library(gRodon)
library(ggplot2)
library(ggfortify)
library(mclust, quietly = T)Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
gRodon’s fitIoC() and predictIoC() functions
This tutorial assumes you already have the gRodon package (version 2.6.0 or higher) installed (see [here](https://github.com/jlw-ecoevo/gRodon2) for installation instructions). We will also use ggplot and ggfotify for visualization and mclust for clustering.
library(gRodon)
library(ggplot2)
library(ggfortify)
library(mclust, quietly = T)Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
The approach to genomically measuring copiotrophy discussed here is more thoroughly outlined in Weissman et al. 2025 and Zakem et al. 2025. Briefly, we combine three quantities that are easy to compute for a genome: (1) the codon usage bias of the ribosomal proteins (computed as dCUB using gRodon’s predictGrowth() function), (2) the number of annotated carbohydrate active enzymes (annotated using dbcan), and (3) the number of protein coding genes. A large body of work suggests that these three axes of variation should broadly be associated with specialization for sporadic resource-abundance, especially codon usage bias, and our own work has shown that these three traits can be combined into an axis of variation that predicts microbial responses to permafrost thaw (Weissman et al. 2025) and the biogeography of marine heterotrophs (Zakem et al. 2025). To use gRodon’s two IoC functions you will need to already have these three datapoints in hand for a genome or set of genomes.
For today, we will use a dataset of 102 metagenome-assembled genomes (MAGs) derived from a permafrost thaw experiment (Weissman et al. 2025) that are included in the gRodon package. We will hold out three genomes and use the rest to fit our index of copiotrophy (IoC).
load(system.file('extdata',
'mags_df.rda',
package = 'gRodon'))
head(mags) Name dCUB nCAZy nGenes
1 11p_S11_bin.11 -0.061077866 41 1073
2 11p_S11_bin.13 0.059440625 55 1764
3 11p_S11_bin.14 -0.016531818 12 909
4 11p_S11_bin.20 -0.079038554 11 688
5 11p_S11_bin.24 0.004909033 11 445
6 11p_S11_bin.30 -0.039718414 40 1713
mags_held_out <- mags[1:3,]
mags_train <- mags[4:nrow(mags),]Let’s look at what our three measures of growth strategy look like for this data (note that more negative values of dCUB correspond to higher bias):
ggplot(mags,aes(x=-dCUB,y=nCAZy,size=nGenes)) +
geom_point(alpha=.75) +
labs(size="Number of\nGenes") +
scale_y_log10() +
theme(legend.position = "right") +
ylab("Number of CAZymes") +
xlab("Codon Usage Bias (-dCUB)") We can create an index of copiotrophy for this set of genomes using the fitIoC() function. In other words, we will create a single index that captures the variation in these three genomic traits across this set of genomes and ranges from 1 to positive infinity (or in this case, about 10). Importantly, we always create an index with respect to a specific set of genomes, usually sampled to be representative of the range in a particular system we are interested in (e.g., across a gradient of permafrost thaw, across a depth gradient in the pacific). To date, we have avoided trying to create a “master” index applicable across all microbes because most scientific questions are interested in the range of variation in a trait within a specific environmental context. In this way we avoid comparing apples to oranges (or soil microbes to marine microbes).
The fitIoC() function returns a list with two entries. The input to fitIoC() can take a few different forms (see helpfile with ?fitIoC())
IoC <- fitIoC(mags_train)The first entry is just a list of IoC values for all genomes in the dataframe
IoC$IoC [1] 2.100309 1.267135 2.425849 1.899993 1.274985 1.316930 2.066081 2.301826
[9] 1.067247 2.179564 1.094351 2.518870 1.766781 2.041328 1.495739 2.676753
[17] 2.751121 1.978550 4.328953 5.546610 2.066590 4.090138 3.259973 3.197463
[25] 1.221463 1.595758 1.933166 3.130683 1.532316 5.511867 5.891610 1.241403
[33] 1.700479 1.514774 3.909136 1.929521 1.691493 1.171920 4.030844 2.156021
[41] 2.256792 1.000000 3.215005 3.519152 3.982645 1.976501 2.677755 4.483684
[49] 4.018415 3.303272 2.284058 2.767192 2.266903 3.049527 4.161165 2.709966
[57] 3.393644 1.757988 1.411638 4.058649 1.840296 5.657543 4.172238 2.392310
[65] 2.944416 2.633718 1.981920 1.714091 2.632589 4.134205 1.972073 4.183917
[73] 4.065502 5.352371 1.601889 2.712035 6.861797 3.694392 3.342672 2.231958
[81] 3.325015 3.015017 1.620333 2.037290 4.684439 3.403603 3.531169 1.866263
[89] 4.512777 1.347368 2.012835 4.702533 1.899654 1.373193 1.544752 1.386913
[97] 1.590439 1.384955 1.914935
The second entry is a prcomp PCA object that can be used to place new genomes onto this index using the predictIoC() function. Your IoC values are just the first component of this PCA (with a correction factor so that higher IoC values always imply more copiotrophic and lower IoC values always imply more oligotrophic). We don’t have specific recommendations for when the percent variance explained by this axis is too low for IoC to be a useful indicator for a dataset, but we have found good results with 50% or greater variance explained.
summary(IoC$model)Importance of components:
PC1 PC2 PC3
Standard deviation 1.2713 0.9911 0.6336
Proportion of Variance 0.5388 0.3274 0.1338
Cumulative Proportion 0.5388 0.8662 1.0000
autoplot(IoC$model)Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Once we have an IoC fitted, we can generate values for new genomes along this index using the predictIoC() function.
predictIoC(mags_held_out,model=IoC)[1] 1.848724 1.717140 1.922849
We have also included pre-fitted IoC from a permafrost thaw experiment (Weissman et al. 2025), the P16 Pacific Ocean transect (Zakem et al. 2025), the human microbiome (Pasolli et al. 2019), diverse soil microbes (Ma et al. 2023), and a balanced combination of soil/marine/human microbes in a single index. We anticipate these will be useful to users hoping to project their own genomes onto these indices. The permafrost dataset captures microbial diversity pre-thaw and immediately post-thaw for a range of Alaskan permafrost types and may be useful for those studying trajectories of the thaw process or other soil datasets. The P16 transect captures heterotrophic diversity along a broad range of open ocean environments (a N-S transect from 60N to 60S with depth profiles to 60 meters) and may be useful for a variety of marine habitats. We have not studied the human microbiome as intensively, but it is strongly correlated with the permafrost and marine IoC’s (indicating some agreement in the major axes of growth variation across environments). The soil IoC, which covers a wide range of soil habitats, is not strongly correlated with the other indices except for the “all_habitat” index, and we suspect that by incorporating too diverse a set of environments these indices will have limited practical use. In general, we recommend fitting new IoC’s in a habitat-sepcific manner that captures the variation one is interested in studying. If you would like your IoC to be implemented in gRodon for others to use, please reach out to JL Weissman <jackie.weissman@stonybrook.edu> who will be happy to add your IoC and a link to the relevant paper.
predictIoC(mags_held_out,model="permafrost")[1] 2.651158 2.184963 1.507101
predictIoC(mags_held_out,model="pacific")[1] 1.681830 1.695430 1.861266
predictIoC(mags_held_out,model="human")[1] 2.500974 2.390334 1.396580
predictIoC(mags_held_out,model="soil")[1] 4.762635 3.553840 2.748666
predictIoC(mags_held_out,model="all_habitat")[1] 8.220195 7.364912 6.980824
Given a set of IoC values, we often wish to cluster organisms into growth strategies (e.g., as “copiotrophs” or “oligotrophs”). While IoC offers a continuous index along which genomes vary, we do often see evidence of clustering along this axis that corresponds to microbial responses to environmental change or biogeography. There are many ways to cluster values, here we will use a Gaussian mixture model as one example that has worked well in our hands for this application.
First, let’s fit our IoC for the complete set of MAG data and pull out just the estimated values:
mag_ioc <- fitIoC(mags)$IoC
plot(density(mag_ioc))We next apply a Gaussian mixture model to the log-transformed data to detect clusters using the mclust package:
#fit Gaussian mixture model
mC <- Mclust(mag_ioc,
verbose=F)
#Check means and variances of fitted Gaussians
mC$parameters$mean 1 2
1.818831 3.609664
mC$parameters$variance$sigmasq[1] 0.1598052 1.3935729
It looks like Mclust() detected two Gaussians.
We first clean up this output a little to get it into a convenient format for plotting:
#We create weightings for our two Gaussians for visualization
#based on the total probability of our MAGs being in each Gaussian
p <- (colSums(mC$z)/sum(mC$z))
#a grid of values to evaluate our Guassians over for plotting
x_seq <- seq(0.1,10,0.01)
#dataframe for plotting densities
cluster_df <- data.frame(x=x_seq,
gaussian_1=dnorm(x_seq,
mean=mC$parameters$mean[1],
sd=sqrt(mC$p$variance$sigmasq[1]))*p[1],
gaussian_2=dnorm(x_seq,
mean=mC$parameters$mean[2],
sd=sqrt(mC$p$variance$sigmasq[2]))*p[2])And now we can plot our fitted Gaussians against the empirical distribution of the IoC data:
ggplot() +
geom_density(data=data.frame(IoC=mag_ioc),
aes(x=IoC),
lwd=2) +
geom_polygon(data=cluster_df,
aes(x=x,
y=gaussian_1,
fill="Cluster 1"),
alpha=0.75,
color="black") +
geom_polygon(data=cluster_df,
aes(x=x,
y=gaussian_2,
fill="Cluster 2"),
alpha=0.75,
color="black") +
xlab("Index of Copiotrophy") +
ylab("Density") +
theme(legend.position = "right") +
labs(fill="") +
scale_fill_manual(values=c("#7570B3","#1B9E77")) +
theme(legend.position = "bottom")Great, we have clusters! We can pull out the classification of each datapoint in our cluster directly from Mclust() if we want:
ioc_classifications <- data.frame(IoC=mC$data,
cluster=mC$classification,
uncertainty=mC$uncertainty)
ggplot(ioc_classifications,
aes(x=IoC,y=cluster,color=factor(cluster))) +
geom_point(size=5) +
scale_color_manual(values=c("#7570B3","#1B9E77")) +
geom_vline(xintercept=mC$data[which.max(mC$uncertainty)],lty=2)Based on the classifications above, there is a clear cutoff around an IoC of about 2.8 for this set of genomes that could be used as a cutoff between oligotrophy and copiotrophy for this dataset. That being said, it will generally be true that for datapoints close to the cutoff our classification uncertainty will be high:
ggplot(ioc_classifications,
aes(x=IoC,y=uncertainty,color=factor(cluster))) +
geom_point(size=5) +
scale_color_manual(values=c("#7570B3","#1B9E77")) +
geom_vline(xintercept=mC$data[which.max(mC$uncertainty)],lty=2)One solution, and the one we prefer, is to define cutoffs for oligotrophy and copiotrophy independently where classification uncertainty drops below some threshold (say, 0.25), thus leaving the classification of some points uncertain:
#Defining cutoffs
c_l <- max(mC$data[mC$uncertainty<0.25 & mC$data<mC$data[which.max(mC$uncertainty)]])
c_h <- min(mC$data[mC$uncertainty<0.25 & mC$data>mC$data[which.max(mC$uncertainty)]])
#Classifications for plotting
ioc_classifications$GrowthCluster <- "Undefined"
ioc_classifications$GrowthCluster[ioc_classifications$IoC<c_l] <- "Oligotroph"
ioc_classifications$GrowthCluster[ioc_classifications$IoC>c_h] <- "Copiotroph"
ioc_classifications$GrowthCluster <- factor(ioc_classifications$GrowthCluster,
levels=c("Oligotroph",
"Undefined",
"Copiotroph"))
ggplot(ioc_classifications,
aes(x=IoC,y=uncertainty,color=factor(GrowthCluster))) +
geom_point(size=5) +
scale_color_brewer(palette="Dark2",direction=-1) +
geom_vline(xintercept=c(c_l,c_h),lty=2) +
labs(color="")In our previous work, we even found that these intermediate organisms with an undefined growth class sometimes have biogeographic patterns distinct from those of both oligotrophs and copiotrophs (in Zakem et al. 2025 we call them “slow copiotrophs”).