Written by: Stephen Hsu
Primary Source: Information Processing, 8/28/18.
The link below is to the published version of the paper we posted on biorxiv in late 2017 (see blog discussion). Our results have since been replicated by several groups in academia and in Silicon Valley.
Biorxiv article metrics: abstract views 31k, paper downloads 6k. Not bad! Perhaps that means the community understands now that genomic prediction of complex traits is a reality, given enough data.
Had we taken a poll on the eve of releasing our biorxiv article, I suspect 90+ percent of genomics researchers would have said that ~1 inch accuracy in predicted human height from genotype alone was impossible.
Since our article appeared, interesting results for complex phenotypes such as educational attainment, heart disease, diabetes, and other disease risks have been obtained.
Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos and Stephen D. H. Hsu
GENETICS Early online August 27, 2018; https://doi.org/10.1534/genetics.118.301267
We construct genomic predictors for heritable but extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). The constructed predictors explain, respectively, ∼40, 20, and 9 percent of total variance for the three traits, in data not used for training. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The proportion of variance explained for height is comparable to the estimated common SNP heritability from Genome-Wide Complex Trait Analysis (GCTA), and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for SNPs. Thus, our results close the gap between prediction R-squared and common SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common variants. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier Genome-Wide Association Studies (GWAS) for out-of-sample validation of our results.
The published version of the paper contains several new analyses in response to reviewer comments.
We added detailed comparisons between the top SNPs activated in our predictor and earlier GIANT GWAS hits. We analyze the correlation structure of L1-activated SNPs — the algorithm (as expected) automatically selects variants which are mostly decorrelated (statistically independent) from each other.
We compare our L1 method to simpler algorithms, such as windowing: choose a genomic window size (e.g., 200k bp) and use only the SNP in each window which accounts for the most variance. This does not work as well as L1 optimization, but can produce a respectable predictor.
We investigate the correlation structure of height-associated SNPs: to what extent can the best linear combination of GIANT GWAS-significant SNPs predict the state of one of the predictor SNPs? This raises the interesting question: how much total information (entropy) is in the human genome?