Skip to contents

Input

The estimate_genome_size() function of genomesizeR accepts those input formats:

- Any csv-like file with a column containing either NCBI taxids or taxon names (format='csv' option, and specify sep, match_column and match_sep if needed), e.g.:

input.csv:

Id,Taxid,Taxon_name
1,562,Escherichia coli
2,9606,Homo sapiens
output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz', 
                              format='csv', sep=',', match_column='Taxid')

OR

output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz', 
                              format='csv', sep=',', match_column='Taxon_name')

- The common ‘taxonomy table’ format used by popular packages such as phyloseq (format='tax_table' option) e.g.:

input.csv:

TaxonID,Kingdom,Phylum,Class,Order,Family,Genus,Species
taxon1,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
taxon2,Eukaryota,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz', 
                              format='tax_table', sep=',')

A tax_table object can also be directly used as input, e.g.:

library(phyloseq)
taxmat = read.csv("input.csv", row.names = 1, stringsAsFactors = FALSE)
taxmat = as.matrix(taxmat)
tax_tab = tax_table(taxmat)

output = estimate_genome_size(tax_tab, 'genomesizeRdata.tar.gz', 
                              format='tax_table')

- A file in BIOM format (format='biom' option) e.g.:

input.biom:

{
  "id": null,
  "format": "Biological Observation Matrix 1.0.0",
  "format_url": "http://biom-format.org",
  "type": "OTU table",
  "generated_by": "example",
  "date": "2025-10-02T00:00:00",
  "matrix_type": "dense",
  "matrix_element_type": "int",
  "shape": [2, 1],
  "rows": [
    {
      "id": "taxon1",
      "metadata": {
        "taxonomy": [
          "k__Bacteria",
          "p__Proteobacteria",
          "c__Gammaproteobacteria",
          "o__Enterobacterales",
          "f__Enterobacteriaceae",
          "g__Escherichia",
          "s__Escherichia coli"
        ]
      }
    },
    {
      "id": "taxon2",
      "metadata": {
        "taxonomy": [
          "k__Eukaryota",
          "p__Chordata",
          "c__Mammalia",
          "o__Primates",
          "f__Hominidae",
          "g__Homo",
          "s__Homo sapiens"
        ]
      }
    }
  ],
  "columns": [
    { "id": "sample1", "metadata": null }
  ],
  "data": [1, 1]
}
output = estimate_genome_size('input.biom', 'genomesizeRdata.tar.gz', 
                              format='biom')

- Any table-like object (data.frame, matrix…) with a column containing either NCBI taxids or taxon names (format='dataframe' option) e.g.:

input_table = data.frame(
                Taxid = c(562, 9606),
                Taxon_name = c("Escherichia coli", "Homo sapiens"),
                stringsAsFactors = FALSE
              )

# input_table:
#   Taxid       Taxon_name
# 1   562 Escherichia coli
# 2  9606     Homo sapiens

output = estimate_genome_size(input_table, 'genomesizeRdata.tar.gz', 
                              format='dataframe', match_column='Taxid')

- A vector containing either NCBI taxids or taxon names (format='vector' option) e.g.:

input_vector = c("Escherichia coli", "Homo sapiens")

output = estimate_genome_size(input_vector, 'genomesizeRdata.tar.gz', 
                              format='vector')

Reference database

The genome size reference database used is built by querying all genome metadata information from the curated NCBI RefSeq database. Filters are applied to only keep full genomes, and discard data that the NCBI has tagged as anomalous, or abnormally large or small.

This raw database is then prepared to include more pre-computed information to be used by the package. Genome sizes are aggregated to the species level by iteratively averaging all entries below, hence the package can only provide estimates at the level of species and above.

Output

The output format is a data frame with the same columns as the input if the input was a standard data frame, with some added columns providing information about the estimation and the quality of the estimation. An option also allows an output format containing only the estimation information (TAXID, estimated_genome_size, confidence_interval_lower, confidence_interval_upper, genome_size_estimation_status, model_used).

Columns added:

  • estimated_genome_size: Estimated genome size
  • confidence_interval_lower: Lower limit of the confidence interval
  • confidence_interval_upper: Upper limit of the confidence interval
  • genome_size_estimation_status: Whether the estimation succeeded (‘OK’) or if failed, the reason for failure:
    • ‘Query is NA’: The query’s taxon could not be read from the input
    • ‘NCBI taxid not found’: The query’s taxon was not found in the database
    • ‘Parent taxids not found’: The taxa at ranks above the query’s taxon could not be computed
    • ‘Parent taxid ranks not found’: The ranks of the taxa above the query’s taxon could not be computed
    • ‘Not enough genome size references for close taxa’: With the weighted mean method, there were no references found close enough in the taxonomic tree to the query to estimate a genome size
    • ‘Confidence interval to estimated size ratio > ci_threshold’: The ratio of the confidence interval to the estimated size is greater than the chosen threshold
    • ‘Could not compute confidence interval’: The genome size was successfully estimated, but the confidence interval could not be estimated
    • ‘No reference and query too high in taxonomic tree to fit in model’: The lmm model is not able to estimate a genome size for the query
    • ‘Bayesian model not found’: There was an issue loading bayesian models
  • model_used: The model used for the estimation
  • LCA: If queries are made of a list of taxa, taxid of their Last Common Ancestor. Otherwise, LCA is equal to TAXID.
  • NCBI taxids of all taxonomic ranks, with one column per rank
  • genome_size_estimation_rank: Rank of the parent taxon used for the estimation (weighted mean method only)
  • genome_size_estimation_distance: Distance in tree nodes (ranks) between the query and the parent taxon used for the estimation (weighted mean method only)