1. MAGs Generation Workflow
2. Isolated Bacterial Genome Assembly Workflow
3. Taxonomic Relative Abundance Estimation via Kraken2 and Bracken
4. Enterotyping Analysis Pipeline
Gut Enterotyping Analysis Overview
This enterotyping analysis leverages the Pbac v2 database and species-level abundance data from 142 samples, quantified using the Kraken2 and Bracken pipeline. Below is the R-based implementation of the workflow.
1. Data Import & Cleaning
Action: Load species abundance data and preprocess to remove noise.
taxa_abundance <- read.csv("fraction_total_reads_species.csv",
row.names = 1, # Use sample names as row IDs
check.names = FALSE) # Preserve special characters in column names
taxa_abundance[is.na(taxa_abundance)] <- 0 # Replace missing values with 0
taxa_abundance <- taxa_abundance[, colSums(taxa_abundance) > 0] # Remove zero-sum columns (species)
taxa_abundance <- taxa_abundance[rowSums(taxa_abundance) > 0, ] # Remove zero-sum rows (samples)
2. Taxon Filtering
Action: Retain species meeting prevalence and abundance thresholds.
filter_threshold <- function(x) {
(mean(x > 0) >= 0.10) & (sum(x) > 0.5) # Keep species present in ≥10% of samples and total abundance >0.5%
}
keep_species <- apply(taxa_abundance, 2, filter_threshold) # Apply threshold column-wise
taxa_filtered <- taxa_abundance[, keep_species]
3. Hellinger Transformation
Action: Normalize abundance data to reduce dominance of highly abundant taxa.
library(vegan)
taxa_hel <- decostand(taxa_filtered, method = "hellinger") # Hellinger standardization
taxa_hel[is.na(taxa_hel)] <- 0 # Handle potential NA values post-transformation
4. Bray-Curtis Distance Calculation
Action: Compute pairwise dissimilarity between samples.
bray_dist <- vegdist(taxa_hel, method = "bray") # Bray-Curtis dissimilarity matrix
5. Optimal Cluster Number Determination
Action: Evaluate silhouette scores to identify the optimal number of enterotypes (k=2–5).
library(cluster)
silhouette_scores <- sapply(2:5, function(k) {
pam_cluster <- pam(bray_dist, k = k)
mean(silhouette(pam_cluster$clustering, bray_dist)[, "sil_width"]) # Average silhouette width
})
optimal_k <- which.max(silhouette_scores) + 1 # Select k with highest score
6. PAM Clustering
Action: Partition samples into enterotypes using Partitioning Around Medoids (PAM).
set.seed(123) # Ensure reproducibility
pam_result <- pam(bray_dist, k = optimal_k)
7. Enterotype Annotation
Action: Assign readable labels to clusters.
annotation_df <- data.frame(
Enterotype = factor(
pam_result$clustering,
levels = sort(unique(pam_result$clustering)),
labels = paste("Enterotype", sort(unique(pam_result$clustering))) # Format labels as "Enterotype 1", etc.
),
row.names = rownames(taxa_filtered) # Match sample IDs
)