Correlation and Variance

Written by: Stephen Hsu

Primary Source: Information Processing

In social science a correlation of R = 0.4 between two variables is typically considered a strong result. For example, both high school GPA and SAT score predict college performance with R ~ 0.4. Combining the two, one can achieve R ~ 0.5 to 0.6, depending on major. See Table 2 in my paper Data Mining the University.

It’s easy to understand why SAT and college GPA are not more strongly correlated: some students work harder than others in school, and effort level is largely independent of SAT score. (For psychometricians, Conscientiousness and Intelligence are largely uncorrelated.) Also, it’s typically students in the upper half or quarter of cognitive ability relative to the general population that earn college degrees. If the entire range of students were enrolled in college the SAT-GPA correlation would be higher. Finally, there is, of course, inherent randomness in grading.
The figure below, from the Wikipedia entry on correlation, helps to visualize the meaning of various R values.

I often hear complaints of the type: “R = 0.4 is negligible! It only accounts for 16% percent of the total variance, leaving 84% unaccounted for!” (The fraction of variance unaccounted for is 1 – R^2.) This kind of remark even finds its way into quantitative genetics and genomics: “But the alleles so far discovered only account for 20% of total heritability! OMG GWAS is a failure!”

This is a misleading complaint. Variance is the sum of squared deviations, so it does not even carry the same units as the quantity of interest. Variance is a convenient quantity because it is additive for uncorrelated variables, but it leads to distorted intuitive evaluations of effect size: SDs are the natural unit, not SD^2!

A less misleading way to think about the correlation R is as follows: given X,Y from a standardized bivariate distribution with correlation R, an increase in X leads to an expected increase in Y:  dY = R dX. In other words, students with +1 SD SAT score have, on average, roughly +0.4 SD college GPAs.  Similarly, students with +1 SD college GPAs have on average +0.4 SAT.

Alternatively, if we assume that Y is the sum of (standardized) X and a noise term (the sum rescaled so that Y remains standardized), the standard deviation of the noise term is given by  sqrt(1- R^2)/R ~ 1/R for modest correlations. That is, the standard deviation of the noise is about 1/R times larger than that of the signal X. When the correlation is 1/sqrt(2) ~ 0.7 the signal and noise terms have equal SD and variance. (“Half of the variance is accounted for by the predictor X”; see for comparison the figure above with R = 0.8.)

As another example, test-retest correlations of SAT or IQ are pretty high, R ~ 0.9 or more. What fluctuations in score does this imply? In the model above the noise SD = sqrt(1 – 0.81)/0.9 ~ 0.5, so we’d expect the test score of an individual to fluctuate by about half a population SD (i.e., ~7 points for IQ or ~50 points per SAT section). This is similar to what is observed in the SAT data of Oregon students.

I worked this out during a boring meeting. It was partially stimulated by this article in the New Yorker about training for the SAT (if you go there, come back and read this to unfog your brain), and activist nonsense like this. Let me know if I made mistakes …  8-)

tl;dr Go back to bed. Big people are talking.

The following two tabs change content below.
Stephen Hsu
Stephen Hsu is vice president for Research and Graduate Studies at Michigan State University. He also serves as scientific adviser to BGI (formerly Beijing Genomics Institute) and as a member of its Cognitive Genomics Lab. Hsu’s primary work has been in applications of quantum field theory, particularly to problems in quantum chromodynamics, dark energy, black holes, entropy bounds, and particle physics beyond the standard model. He has also made contributions to genomics and bioinformatics, the theory of modern finance, and in encryption and information security. Founder of two Silicon Valley companies—SafeWeb, a pioneer in SSL VPN (Secure Sockets Layer Virtual Private Networks) appliances, which was acquired by Symantec in 2003, and Robot Genius Inc., which developed anti-malware technologies—Hsu has given invited research seminars and colloquia at leading research universities and laboratories around the world.
Stephen Hsu

Latest posts by Stephen Hsu (see all)