NEWS & BLOGS

Heading

Enterotyping Analysis Pipeline

Published on: 2025-09-03 | By: EDITOR


1. MAGs Generation Workflow

2. Isolated Bacterial Genome Assembly Workflow

3. Taxonomic Relative Abundance Estimation via Kraken2 and Bracken

4. Enterotyping Analysis Pipeline


Gut Enterotyping Analysis Overview

This enterotyping analysis leverages the Pbac v2 database and species-level abundance data from 142 samples, quantified using the Kraken2 and Bracken pipeline. Below is the R-based implementation of the workflow. 


1. Data Import & Cleaning

Action: Load species abundance data and preprocess to remove noise.

  • r
     
    taxa_abundance <- read.csv("fraction_total_reads_species.csv",  
                               row.names = 1,   # Use sample names as row IDs  
                               check.names = FALSE)  # Preserve special characters in column names  
    taxa_abundance[is.na(taxa_abundance)] <- 0  # Replace missing values with 0  
    taxa_abundance <- taxa_abundance[, colSums(taxa_abundance) > 0]  # Remove zero-sum columns (species)  
    taxa_abundance <- taxa_abundance[rowSums(taxa_abundance) > 0, ]   # Remove zero-sum rows (samples)  

2. Taxon Filtering

Action: Retain species meeting prevalence and abundance thresholds.

  • r
     
    filter_threshold <- function(x) {  
      (mean(x > 0) >= 0.10) & (sum(x) > 0.5)  # Keep species present in ≥10% of samples and total abundance >0.5%  
    }  
    keep_species <- apply(taxa_abundance, 2, filter_threshold)  # Apply threshold column-wise  
    taxa_filtered <- taxa_abundance[, keep_species]  

3. Hellinger Transformation

Action: Normalize abundance data to reduce dominance of highly abundant taxa.

  • r
     
    library(vegan)  
    taxa_hel <- decostand(taxa_filtered, method = "hellinger")  # Hellinger standardization  
    taxa_hel[is.na(taxa_hel)] <- 0  # Handle potential NA values post-transformation  

4. Bray-Curtis Distance Calculation

Action: Compute pairwise dissimilarity between samples.

  • r
     
    bray_dist <- vegdist(taxa_hel, method = "bray")  # Bray-Curtis dissimilarity matrix  

5. Optimal Cluster Number Determination

Action: Evaluate silhouette scores to identify the optimal number of enterotypes (k=2–5).

  • r
     
    library(cluster)  
    silhouette_scores <- sapply(2:5, function(k) {  
      pam_cluster <- pam(bray_dist, k = k)  
      mean(silhouette(pam_cluster$clustering, bray_dist)[, "sil_width"])  # Average silhouette width  
    })  
    optimal_k <- which.max(silhouette_scores) + 1  # Select k with highest score  

6. PAM Clustering

Action: Partition samples into enterotypes using Partitioning Around Medoids (PAM).

  • r
     
    set.seed(123)  # Ensure reproducibility  
    pam_result <- pam(bray_dist, k = optimal_k)  

7. Enterotype Annotation

Action: Assign readable labels to clusters.

  • r
     
    annotation_df <- data.frame(  
      Enterotype = factor(  
        pam_result$clustering,  
        levels = sort(unique(pam_result$clustering)),  
        labels = paste("Enterotype", sort(unique(pam_result$clustering)))  # Format labels as "Enterotype 1", etc.  
      ),  
      row.names = rownames(taxa_filtered)  # Match sample IDs  
    )