Written by: C. Titus Brown
Primary Source: Living in an Ivory Basement
This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage – we recommend starting with a coverage of 20.
Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity.
This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue – this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split).
Note: at the moment, the khmer script normalize-by-median.py needs to be updated from branch cleanup/normalize_and_saturate to run this code properly. Once we’ve merged it into the master branch and cut a release, we’ll remove this note and simply specify the khmer release required.
Let’s assume you have a simple genome with some 5x repeats, and you’ve done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running normalize-by-median.py
~/dev/khmer/scripts/normalize-by-median.py -x 1e8 -k 20 -C 20 -R report.txt reads.fa ./estimate-genome-size.py -C 20 -k 20 reads.fa.keep report.txt
This yields the output:
Estimated (meta)genome size is: 8727 bp
This is off by about 50% for reasons that we don’t completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.)
Latest posts by C. Titus Brown (see all)
- On gaining tenure as an open scientist - December 26, 2014
- My review of “Determining the quality and complexity of NGS…” - December 18, 2014
- The post-apocalyptic world of binary containers - December 17, 2014