Skip to contents

Bayesian method

The NCBI database of species with known genome sizes was split by superkingdom (Bacteria, Archeae, Eukaryotes). A distributional Bayesian linear hierarchical model using the brm function from the brms package was fitted to each superkingdom dataset. The general model structure is outlined below.

log(Gi)𝒩(μi,σi2)\begin{gather*} log(G_i) \sim \mathcal{N}(\mu_i, \sigma_{i}^2) \end{gather*}

where GiG_i is the genome size of species ii in the units of 10 Mbp. The model uses predictors for both mean and standard deviation. The mean is modelled as follows:

μi=α0+αgenusg[i]+αfamilyf[i]+αordero[i]+αclassc[i]+αphylump[i]αgenusg[i]𝒩(0,σgenus2)αfamilyf[i]𝒩(0,σfamily2)αordero[i]𝒩(0,σorder2)αclassc[i]𝒩(0,σclass2)αphylump[i]𝒩(0,σphylum2)\begin{gather*} \mu_i = \alpha_0 + \alpha_{genus_{g[i]}} + \alpha_{family_{f[i]}} + \alpha_{order_{o[i]}} + \alpha_{class_{c[i]}} + \alpha_{phylum_{p[i]}} \\ \alpha_{genus_{g[i]}} \sim \mathcal{N}(0, \sigma_{genus}^2) \\ \alpha_{family_{f[i]}} \sim \mathcal{N}(0, \sigma_{family}^2) \\ \alpha_{order_{o[i]}} \sim \mathcal{N}(0, \sigma_{order}^2) \\ \alpha_{class_{c[i]}} \sim \mathcal{N}(0, \sigma_{class}^2) \\ \alpha_{phylum_{p[i]}} \sim \mathcal{N}(0, \sigma_{phylum}^2) \\ \end{gather*}

The following prior distributions are used:

α0𝒩(0,5)(σgenus,σfamily,σorder,σclass,σphylum,sclass,sphylum)𝒩+(0,1)\begin{gather*} \alpha_0 \sim \mathcal{N}(0,5) \\ (\sigma_{genus},\sigma_{family},\sigma_{order},\sigma_{class},\sigma_{phylum},s_{class},s_{phylum}) \sim \mathcal{N}^+(0,1) \\ \end{gather*}

Differences in genome size variability was observed among taxa, therefore the model also adds predictors to the standard deviation of the response. The standard deviation is modelled as follows:

log(σi)=λ0+λclassc[i]+λphylump[i]λclassc[i]𝒩(0,sclass2)λphylump[i]𝒩(0,sphylum2)\begin{gather*} log(\sigma_{i}) = \lambda_0 + \lambda_{class_{c[i]}} + \lambda_{phylum_{p[i]}} \\ \lambda_{class_{c[i]}} \sim \mathcal{N}(0, s_{class}^2) \\ \lambda_{phylum_{p[i]}} \sim \mathcal{N}(0, s_{phylum}^2) \\ \end{gather*}

with priors

λ0𝒩(0,1)(sclass,sphylum)𝒩+(0,1)\begin{gather*} \lambda_0 \sim \mathcal{N}(0,1) \\ (s_{class},s_{phylum}) \sim \mathcal{N}^+(0,1) \\ \end{gather*}

𝒩+\mathcal{N}^+ is the normal distribution truncated to positive values. g[i]g[i],f[i]f[i],o[i]o[i],c[i]c[i] and p[i]p[i] are respectively the index for the genus, family, order, class, and phylum of entry ii in the species-level database. Note that taxonomic groups are naturally nested within each other and the indices are designed to be unique to the particular taxonomic group it represents.

The estimation process uses Stan’s Hamiltonian Monte Carlo algorithm with the U-turn sampler.

Posterior predictions are obtained using the predict function from the brms package, and 95% credible intervals are obtained using 2.5% and 97.5% quantiles from the posterior distribution.

Queries corresponding to identified species with an available genome size estimate in the NCBI database get allocated the genome size value of the database (averaged at the species level) and 95% confidence intervals are calculated as XXX

Frequentist method

A frequentist linear mixed-effects model using the lmer function from the lme4 package was fitted to the NCBI database of species with known genome sizes. The model is as follows:

log(Gi)=α0+αgenusg[i]+αfamilyf[i]+ei\begin{gather*} log(G_i) = \alpha_0 + \alpha_{genus_{g[i]}} + \alpha_{family_{f[i]}} + e_i \\ \end{gather*} where α0\alpha_0 is the overall mean, αgenusg[i]\alpha_{genus_{g[i]}} and αfamilyf[i]\alpha_{family_{f[i]}} are random effect of genus and family for genus g[i]g[i] and family f[i]f[i] and eie_i is the residual error of observation ii.

The estimation process using the restricted maximum likelihood method (REML). A prediction interval is computed using the predictInterval function from the merTools package. As higher nested levels (order, class) are not taken into account in the model, predictions are not produced for queries above the family level.

Weighted mean method

The weighted mean method computes the genome size of a query by averaging the known genome sizes of surrounding taxa in the taxonomic tree, with a weighted system where further neighbours have less weight in the computed mean. The identification of related taxa is limited to levels below and including order.

For queries relating to well-characterised species where many genetic studies have been performed, such as model organisms, this might lead to more precise predictions than the two other methods. This method can also perform better than the others if your queries consist of lists of taxa (for example, an output of blastn where several matches can be obtained for each query). Otherwise, we suggest using one of the other methods, as the confidence intervals calculated are less reliable for the weighted mean method.

Pseudocode describing the weighted mean method computation:

# 1. Build table of parent information

for each match (a query can be a list of matches e.g. blast result):
  get the taxid associated
  get the list of parents and their ranks
  for each parent :
    if higher than order rank: 
      STOP and return 'Not enough genome size references for close taxa'
    if genome size associated with that taxon in the reference database:
      store it in a 'parent information table'
      store the distance as well (number of nodes between query and this taxon)
      if we have more than one genome that contributed to the data we have in the 'parent information table':
        STOP the iteration through parents here
    if the query is a list of matches and we have reached their common ancestor:
      STOP the iteration through parents here

# 2. Compute weighted mean from the table of parent information

estimated_size = 0
estimated_standard_error = 0
sum_weights = 0
for each line of parent information:
  weight = 1.0/(distance+1)
  sum_weights = sum_weights + weight
  estimated_size = estimated_size + weight * genome_size
  estimated_standard_error = estimated_standard_error + (weight * precomputed_standard_error)**2
estimated_size = estimated_size / sum_of_all_weights
standard_error = square_root(estimated_standard_error)
Z = 1.96 # 95% CI
margin_of_error = Z * standard_error
confidence_interval_lower = estimated_size - margin_of_error
confidence_interval_upper = estimated_size + margin_of_error