Compressed sensing and genomes

Written by: Stephen Hsu

Primary Source: Information Processing

For more discussion of our recent paper (The human genome as a compressed sensor), see this blog post by my collaborator Carson Chow and another on the machine learning blog Nuit Blanche. One of our main points in the paper is that the phase transition between the regimes of poor and good recovery of the L1 penalized algorithm (LASSO) is readily detectable, and that the scaling behavior of the phase boundary allows theoretical estimates for the necessary amount of data required for good performance at a given sparsity. Apparently, this reasoning has appeared before in the compressed sensing literature, and can be used to optimize hardware designs for sensors. In our case, the sensor is the human genome, and its statistical properties are fixed. Fortunately, we find that genotype matrices are in the same universality class as random matrices, which are good compressed sensors.

The black line in the figure below is the theoretical prediction (Donoho 2006) for the location of the phase boundary. The shading shows results from our simulations. The scale on the right is L2 (norm squared) error in the recovered effects vector compared to the actual effects.

Perhaps we are approaching a D-T moment in genomics ;-)

… a Donoho-Tao moment in the Radar community at the next CoSeRa meeting :-). As a reminder the Donoho-Tao moment was well put in this 2008 IPAM newsletter: …. It’s David Donoho [5] reportedly exclaiming [to] a panel of NSF folks “You’ve got Terry Tao (a Fields medalist [6]) talking to geoscientists, what do you want?” ….

In previous discussions I predicted that of order millions of phenotype-genotype pairs would be sufficient to extract the genetic architecture of complex traits like height or g. This estimate is based on two ingredients:

1. The sparsity of these traits is probably no greater than s ~ 10k (evidence for this comes from looking at genomic Hamming distance as a function of phenotype distance).

2. The compressed sensing results suggest that good recovery can be achieved above a data threshold of roughly n ~ 30 s (assuming 1E06 SNPs and additive heritability h2 = 0.5 or so).

Including an extra order of magnitude to be safe, this leads to n ~ millions.

The following two tabs change content below.
Stephen Hsu
Stephen Hsu is vice president for Research and Graduate Studies at Michigan State University. He also serves as scientific adviser to BGI (formerly Beijing Genomics Institute) and as a member of its Cognitive Genomics Lab. Hsu’s primary work has been in applications of quantum field theory, particularly to problems in quantum chromodynamics, dark energy, black holes, entropy bounds, and particle physics beyond the standard model. He has also made contributions to genomics and bioinformatics, the theory of modern finance, and in encryption and information security. Founder of two Silicon Valley companies—SafeWeb, a pioneer in SSL VPN (Secure Sockets Layer Virtual Private Networks) appliances, which was acquired by Symantec in 2003, and Robot Genius Inc., which developed anti-malware technologies—Hsu has given invited research seminars and colloquia at leading research universities and laboratories around the world.