Estimate genome sizes — estimate_genome

This function loads a query file or table and an archive containing reference databases and bayesian models, and predicts genome sizes.

Usage

estimate_genome_size(
  queries,
  refdata_path,
  format = "csv",
  sep = ",",
  match_column = NA,
  match_sep = ";",
  output_format = "input",
  method = "bayesian",
  ci_threshold = 0.3,
  n_cores = "half"
)

Arguments

queries: Queries: path to csv or BIOM file, or variable name of table object
refdata_path: Path to the downloadable archive containing the reference databases and the bayesian models
format: Query format: csv/dataframe format ('table', default), taxonomy table format as used in e.g. phyloseq ('tax_table') or BIOM format ('biom')
sep: If table format, column separator
match_column: If table format, the column containing match information (with one or several matches)
match_sep: If table format and several matches in match column, separator between matches
output_format: Format in which the output should be. Default: "input" a data frame with the same columns as the input, with the added columns: "TAXID", "estimated_genome_size", "confidence_interval_lower", "confidence_interval_upper", "genome_size_estimation_status", "model_used", as well as taxids at all ranks. Other formats available: "data.frame", a data frame with only the previous columns, without the taxid columns.
method: Method to use for genome size estimation, 'bayesian' (default), 'weighted_mean' or 'lmm'
ci_threshold: Threshold for the confidence interval as a proportion of the predicted size (e.g. 0.3 means that estimations with a confidence interval that represents more than 30% of the predicted size will be tagged)
n_cores: Number of CPU cores to use (default is 'half': half of all available cores)