Estimating (meta)genome size from shotgun data

Written by: C. Titus Brown

Primary Source: Living in an Ivory Basement

This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage – we recommend starting with a coverage of 20.

Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity.

This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue – this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split).

Note: at the moment, the khmer script needs to be updated from branch cleanup/normalize_and_saturate to run this code properly. Once we’ve merged it into the master branch and cut a release, we’ll remove this note and simply specify the khmer release required.

Let’s assume you have a simple genome with some 5x repeats, and you’ve done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running

~/dev/khmer/scripts/ -x 1e8 -k 20 -C 20 -R report.txt reads.fa
./ -C 20 -k 20 reads.fa.keep report.txt

This yields the output:

Estimated (meta)genome size is: 8727 bp

This is off by about 50% for reasons that we don’t completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.)


The following two tabs change content below.
C. Titus Brown
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.