This function loads a query file or table and an archive containing reference databases and bayesian models, and predicts genome sizes.
Usage
estimate_genome_size(
queries,
refdata_path,
format = "csv",
sep = ",",
match_column = NA,
match_sep = ";",
output_format = "input",
method = "bayesian",
ci_threshold = 0.3,
n_cores = "half"
)Arguments
- queries
Queries: path to csv or BIOM file, or R object used for input.
- refdata_path
Path to the downloadable archive containing the reference databases and the bayesian models.
- format
Input format: "csv" for csv file (default), "tax_table" for taxonomy table file or object as used in e.g. phyloseq, "biom" for BIOM file, "dataframe" for a table-style object (e.g. data.frame or matrix object), "vector" for a vector object.
- sep
If table-style or csv format, column separator (default: ",").
- match_column
If table-style or csv format, the column containing match information (with one or several matches).
- match_sep
If table-style or csv format and several matches in match column, separator between matches (default: ";").
- output_format
Format in which the output should be. Default: "input" a data frame with the same columns as the input, with the added columns: "TAXID", "estimated_genome_size", "confidence_interval_lower", "confidence_interval_upper", "genome_size_estimation_status", "model_used", as well as taxids at all ranks. Other formats available: "data.frame", a data frame with only the previously described columns, without the "taxids at all ranks" columns.
- method
Method to use for genome size estimation, 'bayesian' (default), 'weighted_mean' or 'lmm'.
- ci_threshold
Threshold for the confidence interval as a proportion of the predicted size (e.g. 0.3 means that estimations with a confidence interval that represents more than 30% of the predicted size will be tagged in the output table).
- n_cores
Number of CPU cores to use (default is 'half': half of all available cores).