Skip to contents

This function loads a query file or table and an archive containing reference databases and bayesian models, and predicts genome sizes.

Usage

estimate_genome_size(
  queries,
  refdata_path,
  format = "csv",
  sep = ",",
  match_column = NA,
  match_sep = ";",
  output_format = "input",
  method = "bayesian",
  ci_threshold = 0.3,
  n_cores = "half"
)

Arguments

queries

Queries: path to csv or BIOM file, or R object used for input.

refdata_path

Path to the downloadable archive containing the reference databases and the bayesian models.

format

Input format: "csv" for csv file (default), "tax_table" for taxonomy table file or object as used in e.g. phyloseq, "biom" for BIOM file, "dataframe" for a table-style object (e.g. data.frame or matrix object), "vector" for a vector object.

sep

If table-style or csv format, column separator (default: ",").

match_column

If table-style or csv format, the column containing match information (with one or several matches).

match_sep

If table-style or csv format and several matches in match column, separator between matches (default: ";").

output_format

Format in which the output should be. Default: "input" a data frame with the same columns as the input, with the added columns: "TAXID", "estimated_genome_size", "confidence_interval_lower", "confidence_interval_upper", "genome_size_estimation_status", "model_used", as well as taxids at all ranks. Other formats available: "data.frame", a data frame with only the previously described columns, without the "taxids at all ranks" columns.

method

Method to use for genome size estimation, 'bayesian' (default), 'weighted_mean' or 'lmm'.

ci_threshold

Threshold for the confidence interval as a proportion of the predicted size (e.g. 0.3 means that estimations with a confidence interval that represents more than 30% of the predicted size will be tagged in the output table).

n_cores

Number of CPU cores to use (default is 'half': half of all available cores).