Input
The estimate_genome_size()
function of
genomesizeR
accepts those input formats:
- Any csv-like file with a column containing either NCBI
taxids or taxon names (format='csv'
option, and specify
sep
, match_column
and match_sep
if needed), e.g.:
input.csv
:
Id,Taxid,Taxon_name
1,562,Escherichia coli
2,9606,Homo sapiens
output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz',
format='csv', sep=',', match_column='Taxid')
OR
output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz',
format='csv', sep=',', match_column='Taxon_name')
- The common ‘taxonomy table’ format used by popular packages
such as phyloseq
(format='tax_table'
option) e.g.:
input.csv
:
TaxonID,Kingdom,Phylum,Class,Order,Family,Genus,Species
taxon1,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli
taxon2,Eukaryota,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
output = estimate_genome_size('input.csv', 'genomesizeRdata.tar.gz',
format='tax_table', sep=',')
A tax_table object can also be directly used as input, e.g.:
library(phyloseq)
taxmat = read.csv("input.csv", row.names = 1, stringsAsFactors = FALSE)
taxmat = as.matrix(taxmat)
tax_tab = tax_table(taxmat)
output = estimate_genome_size(tax_tab, 'genomesizeRdata.tar.gz',
format='tax_table')
- A file in BIOM format
(format='biom'
option) e.g.:
input.biom
:
{
"id": null,
"format": "Biological Observation Matrix 1.0.0",
"format_url": "http://biom-format.org",
"type": "OTU table",
"generated_by": "example",
"date": "2025-10-02T00:00:00",
"matrix_type": "dense",
"matrix_element_type": "int",
"shape": [2, 1],
"rows": [
{
"id": "taxon1",
"metadata": {
"taxonomy": [
"k__Bacteria",
"p__Proteobacteria",
"c__Gammaproteobacteria",
"o__Enterobacterales",
"f__Enterobacteriaceae",
"g__Escherichia",
"s__Escherichia coli"
]
}
},
{
"id": "taxon2",
"metadata": {
"taxonomy": [
"k__Eukaryota",
"p__Chordata",
"c__Mammalia",
"o__Primates",
"f__Hominidae",
"g__Homo",
"s__Homo sapiens"
]
}
}
],
"columns": [
{ "id": "sample1", "metadata": null }
],
"data": [1, 1]
}
output = estimate_genome_size('input.biom', 'genomesizeRdata.tar.gz',
format='biom')
- Any table-like object (data.frame, matrix…) with a column
containing either NCBI taxids or taxon names
(format='dataframe'
option) e.g.:
input_table = data.frame(
Taxid = c(562, 9606),
Taxon_name = c("Escherichia coli", "Homo sapiens"),
stringsAsFactors = FALSE
)
# input_table:
# Taxid Taxon_name
# 1 562 Escherichia coli
# 2 9606 Homo sapiens
output = estimate_genome_size(input_table, 'genomesizeRdata.tar.gz',
format='dataframe', match_column='Taxid')
- A vector containing either NCBI taxids or taxon names
(format='vector'
option) e.g.:
input_vector = c("Escherichia coli", "Homo sapiens")
output = estimate_genome_size(input_vector, 'genomesizeRdata.tar.gz',
format='vector')
Reference database
The genome size reference database used is built by querying all genome metadata information from the curated NCBI RefSeq database. Filters are applied to only keep full genomes, and discard data that the NCBI has tagged as anomalous, or abnormally large or small.
This raw database is then prepared to include more pre-computed information to be used by the package. Genome sizes are aggregated to the species level by iteratively averaging all entries below, hence the package can only provide estimates at the level of species and above.
Output
The output format is a data frame with the same columns as the input
if the input was a standard data frame, with some added columns
providing information about the estimation and the quality of the
estimation. An option also allows an output format containing only the
estimation information (TAXID
,
estimated_genome_size
,
confidence_interval_lower
,
confidence_interval_upper
,
genome_size_estimation_status
,
model_used
).
Columns added:
-
estimated_genome_size
: Estimated genome size -
confidence_interval_lower
: Lower limit of the confidence interval -
confidence_interval_upper
: Upper limit of the confidence interval -
genome_size_estimation_status
: Whether the estimation succeeded (‘OK’) or if failed, the reason for failure:- ‘Query is NA’: The query’s taxon could not be read from the input
- ‘NCBI taxid not found’: The query’s taxon was not found in the database
- ‘Parent taxids not found’: The taxa at ranks above the query’s taxon could not be computed
- ‘Parent taxid ranks not found’: The ranks of the taxa above the query’s taxon could not be computed
- ‘Not enough genome size references for close taxa’: With the weighted mean method, there were no references found close enough in the taxonomic tree to the query to estimate a genome size
- ‘Confidence interval to estimated size ratio > ci_threshold’: The ratio of the confidence interval to the estimated size is greater than the chosen threshold
- ‘Could not compute confidence interval’: The genome size was successfully estimated, but the confidence interval could not be estimated
- ‘No reference and query too high in taxonomic tree to fit in model’: The lmm model is not able to estimate a genome size for the query
- ‘Bayesian model not found’: There was an issue loading bayesian models
-
model_used
: The model used for the estimation -
LCA
: If queries are made of a list of taxa, taxid of their Last Common Ancestor. Otherwise,LCA
is equal toTAXID
. - NCBI taxids of all taxonomic ranks, with one column per rank
-
genome_size_estimation_rank
: Rank of the parent taxon used for the estimation (weighted mean method only) -
genome_size_estimation_distance
: Distance in tree nodes (ranks) between the query and the parent taxon used for the estimation (weighted mean method only)