Recipe: Estimate genome size and coverage from shotgun sequencing data

Written by: C. Titus Brown

Primary Source: Living in an Ivory Basement

In Extracting shotgun reads based on coverage in the data set, we showed how to get a read coverage spectrum for a shotgun data set. This is a useful diagnostic tool that can be used to estimate total genome size, average coverage, and repetitive content.

Uses for this recipe include estimating the sequencing depth of a particular library, estimating the repeat content, and generally characterizing your shotgun data.

See Estimating (meta)genome size from shotgun data for a recipe that estimates the single-copy genome size; that recipe can be used for any metagenomic or genomic data set, including ones subjected to whole genome amplification, while this recipe can only be used for data from unamplified/unbiased whole genome shotgun sequencing.

Let’s assume you have a genome that you’ve sequenced to some unknown coverage, and you want to estimate the total genome size, repeat content, and sequencing coverage. If your reads are in reads.fa, you can generate a read coverage spectrum from your genome like so: -x 1e8 -k 20 reads.fa
~/dev/khmer/sandbox/ reads.fa reads-cov.dist
./ reads-cov.dist reads-cov.png --xmax=600 --ymax=500

and the result looks like this:


For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance reads from repeats; one in the middle, which contains the reads from the single-copy genome; and one all the way on the left at ~1, which contains all of the highly erroneous reads

Pick a coverage number near the peak that you think represents the single-copy genome – with sufficient coverage, and in the absence of significant heterozygosity, it should be the first peak that’s well to the right of zero. Here, we’ll pick 100.

Then run

./ reads.fa reads-cov.dist 100

and you should see:

Estimated total genome size is: 7666 bp
Estimated size > 5x repeats is: 587 bp
Estimated single-genome coverage is: 109.8

The true values (for this simulated genome) are 7500bp genome size, 500 bp > 5x repeat, and 150x coverage, so this isn’t a terribly bad set of estimates. The single-genome coverage estimate is probably low because we’re estimating read coverage with a k-mer based approach, which is sensitive to highly erroneous reads.

A few notes:

  • Yes, you could automate the selection of the first peak, but I think it’s always a good idea to look at your data!
  • In the case of highly polymorphic diploid genomes, you would have to make sure to include the haploid reads but only count them 50% towards the final genome size.
The following two tabs change content below.
C. Titus Brown
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.