Skip to contents

This function loads a query file or table and an archive containing reference databases and bayesian models, and predicts genome sizes.

Usage

estimate_genome_size(
  queries,
  refdata_path,
  format = "csv",
  sep = ",",
  match_column = NA,
  match_sep = ";",
  output_format = "input",
  method = "bayesian",
  ci_threshold = 0.2,
  n_cores = "half"
)

Arguments

queries

Queries: path to csv or BIOM file, or variable name of table object

refdata_path

Path to the downloadable archive containing the reference databases and the bayesian models

format

Query format: csv/dataframe format ('table', default), taxonomy table format as used in e.g. phyloseq ('tax_table') or BIOM format ('biom')

sep

If table format, column separator

match_column

If table format, the column containing match information (with one or several matches)

match_sep

If table format and several matches in match column, separator between matches

output_format

Format in which the output should be. Default: "input" a data frame with the same columns as the input, with the added columns: "TAXID", "estimated_genome_size", "confidence_interval_lower", "confidence_interval_upper", "genome_size_estimation_status", "model_used", as well as taxids at all ranks. Other formats available: "data.frame", a data frame with only the previous columns, without the taxid columns.

method

Method to use for genome size estimation, 'bayesian' (default), 'weighted_mean' or 'lmm'

ci_threshold

Threshold for the confidence interval as a proportion of the predicted size (e.g. 0.2 means that estimations with a confidence interval that represents more than 20% of the predicted size will be discarded)

n_cores

Number of CPU cores to use (default is 'half': half of all available cores)