Estimate whether your sequencing has saturated your sample to a given coverage

Written by: C. Titus Brown

Primary Source: Living in an Ivory Basement

This recipe provides a time-efficient way to determine whether you’ve saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a “collector’s curve” of information collection.

Uses for this recipe include evaluating whether or not you should do more sequencing.

This approach is more accurate for low coverage than normalize-by-median’s reporting, because it collects the redundant reads rather than discarding them.


Note: at the moment, the khmer script saturate-by-median.py needs to be updated from branch cleanup/normalize_and_saturate to run this code properly. Once we’ve merged it into the master branch and cut a release, we’ll remove this note and simply specify the khmer release required.

Let’s assume you have a simple genome with some 5x repeats, and you’ve done some shotgun sequencing, and you want to know whether or not you’ve saturated to a coverage of 5 with your sequencing. You can use a variant of digital normalization, saturate-by-median, to run a collector’s curve:

~/dev/khmer/sandbox/saturate-by-median.py -x 1e8 -k 20 -C 5 -R report.txt --report-frequency 10 reads.fa

Then, plot the resulting saturation curve:

./plot-saturation-curve.py report.txt saturation.png --xmin 0 --ymin 0 --xmax 1500

images/slice/saturation.png

The x axis here is the number of reads examined (column 1 in report.txt), while the y axis (column 2) is the number of reads that are below a coverage of 5 in the data set at that point. You can see here that by the time you had sampled 1000 reads, you’d stopped seeing new coverage=5 reads, which suggests that further sequencing is unnecessary.

If you zoom out on the graph, you’ll see that the curve keeps on climbing, albeit much more slowly. This is due to the influence of error rate on prediction of “novel” reads, and is something we have to fix.

The following two tabs change content below.
C. Titus Brown
C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering and the Department of Microbiology and Molecular Genetics. He earned his PhD ('06) in developmental molecular biology from the California Institute of Technology. Brown is director of the laboratory for Genomics, Evolution, and Development (GED) at Michigan State University. He is a member of the Python Software Foundation and an active contributor to the open source software community. His research interests include computational biology, bioinformatics, open source software development, and software engineering.