Written by: Stephen Hsu
Primary Source: Information Processing
I recently blogged about a nice lecture by David Balding at the 2015 MLPM (Machine Learning for Personalized Medicine) Summer School: Machine Learning for Personalized Medicine: Heritability-based models for prediction of complex traits. In that talk he discussed some results concerning heritability estimation and potential improvements over GCTA. A new preprint on bioRxiv has the details:
Doug Speed, Na Cai, The UCLEB Consortium, Michael Johnson, Sergey Nejentsev, David Balding
SNP heritability, the proportion of phenotypic variance explained by SNPs, has been estimated for many hundreds of traits, and these estimates are being used to explore genetic architecture and guide future research. To estimate SNP heritability requires strong assumptions about how heritability is distributed across the genome, but the assumptions in current use have not been thoroughly tested. By analyzing imputed data for 42 human traits, we empirically derive an improved model for heritability estimation. It is commonly assumed that the expected heritability of a SNP does not depend on its allele frequency; we instead identify a more realistic relationship which reflects that heritability tends to decrease with minor allele frequency. Two methods for estimating SNP heritability, GCTA and LDAK, make contrasting assumptions about how heritability varies with linkage disequilibrium; we demonstrate that the model used by LDAK better reflects the properties of real data. Additionally, we show how genotype certainty can be incorporated in the heritability model; this enables the inclusion of poorly-imputed SNPs, which can capture substantial extra heritability. Our revised method typically results in substantially higher estimates of SNP heritability: for example, across 19 traits (mainly diseases), the estimates based on common SNPs (minor allele frequency >0.01) are on average 40% (SD 3) higher than those obtained using original GCTA, and 25% (SD 2) higher than those from the recently-proposed extension GCTA-LDMS. We conclude that for a wide range of traits, common SNPs tag a greater fraction of causal variation than is currently appreciated. When we also include rare SNPs (minor allele frequency <0.01), we find that across 23 quantitative traits, estimates of SNP heritability increase by on average 29% (SD 12), and that rare SNPs tend to contribute about half the heritability of common SNPs.
In contrast to GCTA, which assumes a uniform Gaussian distribution of effect sizes for each SNP, this paper considers effect sizes which depend on the local linkage disequilibrium in a particular region w_j, as well as a SNP quality score r_j. (See equation 1 of the paper.) The intuition behind w_j is that if there are n SNPs in a small region which are all highly correlated, they are likely to all be proxies for the actual causal variant, and hence one might over count its contribution by assigning nearly equal effects to each of the SNPs. Instead, the method proposed in this paper (roughly) splits the effect size among the SNPs (Figure 1 below). Their model also allows the effect size distribution to depend on the MAF of j: SNPs at lower frequency in the population contribute less to heritability than in the GCTA default assumption.
The resulting heritability estimates tend to be higher than from GCTA, so if this method is an improvement (as the authors argue), the amount of missing heritability is even less than that found in GCTA.
Supplement Figure 21 (p.26) provides yet more criticism of Kumar et al., a paper we discussed previously here. [Kumar, S., Feldman, M., Rehkopf, D. & Tuljapurkar, S. Limitations of GCTA as a solution to the missing heritability problem, PNAS 113, E61E70 (2015).]