Scientists of Stature

Written by: Stephen Hsu

Primary Source: Information Processing, 8/28/18.

Table showing the variance between predicted and actual height.

The link below is to the published version of the paper we posted on biorxiv in late 2017 (see blog discussion). Our results have since been replicated by several groups in academia and in Silicon Valley.

Biorxiv article metrics: abstract views 31k, paper downloads 6k. Not bad! Perhaps that means the community understands now that genomic prediction of complex traits is a reality, given enough data.

Had we taken a poll on the eve of releasing our biorxiv article, I suspect 90+ percent of genomics researchers would have said that ~1 inch accuracy in predicted human height from genotype alone was impossible.

Since our article appeared, interesting results for complex phenotypes such as educational attainment, heart disease, diabetes, and other disease risks have been obtained.

Accurate Genomic Prediction Of Human Height

Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos and Stephen D. H. Hsu

GENETICS Early online August 27, 2018; https://doi.org/10.1534/genetics.118.301267

We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, ∼40, 20, and 9 percent of total variance for the three traits, in data not used for training. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from Genome-Wide Complex Trait Analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier Genome-Wide Association Studies (GWAS) for out-of-sample validation of our results.

The published version of the paper contains several new analyses in response to reviewer comments.

We added detailed comparisons between the top SNPs activated in our predictor and earlier GIANT GWAS hits. We analyze the correlation structure of L1-activated SNPs — the algorithm (as expected) automatically selects variants which are mostly decorrelated (statistically independent) from each other.

We compare our L1 method to simpler algorithms, such as windowing: choose a genomic window size (e.g., 200k bp) and use only the SNP in each window which accounts for the most variance. This does not work as well as L1 optimization, but can produce a respectable predictor.

We investigate the correlation structure of height-associated SNPs: to what extent can the best linear combination of GIANT GWAS-significant SNPs predict the state of one of the predictor SNPs? This raises the interesting question: how much total information (entropy) is in the human genome?

The following two tabs change content below.
Stephen Hsu
Stephen Hsu is vice president for Research and Graduate Studies at Michigan State University. He also serves as scientific adviser to BGI (formerly Beijing Genomics Institute) and as a member of its Cognitive Genomics Lab. Hsu’s primary work has been in applications of quantum field theory, particularly to problems in quantum chromodynamics, dark energy, black holes, entropy bounds, and particle physics beyond the standard model. He has also made contributions to genomics and bioinformatics, the theory of modern finance, and in encryption and information security. Founder of two Silicon Valley companies—SafeWeb, a pioneer in SSL VPN (Secure Sockets Layer Virtual Private Networks) appliances, which was acquired by Symantec in 2003, and Robot Genius Inc., which developed anti-malware technologies—Hsu has given invited research seminars and colloquia at leading research universities and laboratories around the world.