Over- and Underfitting

I just read a nice post by Jean-François Puget, suitable for readers not terribly familiar with the subject, on overfitting in machine learning. I was going to leave a comment mentioning a couple of things, and then decided that with minimal padding I could make it long enough to be a blog post. I agree …

More

$1.2 trillion college loan bubble?

See also When everyone goes to college: a lesson from S. Korea. Returns to a “college education” are highly dependent on the intrinsic cognitive ability and work ethic of the individual. WSJ: College Loan Glut Worries Policy Makers The U.S. government over the last 15 years made a trillion-dollar investment to improve the nation’s workforce, …

More

University quality and global rankings

University quality and global rankings The paper below is one of the best I’ve seen on university rankings. Yes, there is a univariate factor one might characterize as “university quality” that correlates across multiple measures. As I have long suspected, the THE (Times Higher Education) and QS rankings, which are partially survey/reputation based, are biased …

More

Coin Flipping

I don’t recall the details, but in a group conversation recently someone brought up the fact that if you flip a fair coin repeatedly until you encounter a particular pattern, the expected number of tosses needed to get HH is greater than the expected number to get HT (H and T denoting head and tail …

More

Genetic ancestry and brain morphology

Population structure — i.e., distribution of gene variants by ancestral group — is reflected in brain morphology, as measured using MRI. Brain morphology measurements can be used to predict ancestry. Strictly speaking, the data only show correlation, not genetic causation, but the most plausible interpretation is that genetic differences are causing morphological differences. One could …

More

GCTA, Missing Heritability, and All That

Bioinformaticist E. Stovner asked about a recent PNAS paper which is critical of GCTA. My comments are below. It’s a shame that we don’t have a better online platform (e.g., like Quora or StackOverflow) for discussing scientific papers. This would allow the authors of a paper to communicate directly with interested readers, immediately after the paper …

More

On Statistics, Reporting and Bacon

I’ve previously ranted about the need for a “journalistic analytics” college major, to help with reporting (and editing) news containing statistical analysis. Today I read an otherwise well written article that inadvertently demonstrates how easy it is for even seasoned reporters to slip up. The cover story of the November 9 issue of Time magazine, …

More

David Donoho interview at HKUST

A long interview with Stanford professor David Donoho (academic web page) at the IAS at HKUST. Donoho was a pioneer in thinking about sparsity in high dimensional statistical problems. The motivation for this came from real world problems in geosciences (oil exploration), encountered in Texas when he was still a student. Geophysicists were using Compressed …

More

Regression Via Pseudoinverse

In my last post (OLS Oddities), I mentioned that OLS linear regression could be done with multicollinear data using the Moore-Penrose pseudoinverse. I want to tidy up one small loose end. Specifically, let be the matrix of predictor observations (including a column of ones if a constant term is desired), let be a vector of …

More

OLS Oddities

During a couple of the lectures in the Machine Learning MOOC offered by Prof. Andrew Ng of Stanford University, I came across two statements about ordinary least squares linear regression (henceforth OLS) that surprised me. Given that I taught regression for years, I was surprised that I could be surprised (meta-surprised?), but these two facts …

More

Producing Reproducible R Code

A tip in the Google+ Statistics and R community led me to the reprex package for R. Quoting the author (Professor Jennifer Bryan, University of British Columbia), the purpose of reprex is to [r]ender reproducible example code to Markdown suitable for use in code-oriented websites, such as StackOverflow.com or GitHub. Much has been written about …

More

Expert Prediction: hard and soft

Jason Zweig writes about Philip Tetlock’s Good Judgement Project below. See also Expert Predictions, Perils of Prediction, and this podcast talk by Tetlock. A quick summary: good amateurs (i.e., smart people who think probabilistically and are well read) typically perform as well as or better than area experts (e.g., PhDs in Social Science, History, Government; …

More

Colleges ranked by Nobel, Fields, Turing and National Academies output

Colleges ranked by Nobel, Fields, Turing and National Academies output This Quartz article describes Jonathan Wai’s research on the rate at which different universities produce alumni who make great contributions to science, technology, medicine, and mathematics. I think the most striking result is the range of outcomes: the top school outperforms good state flagships (R1 …

More

More Shiny Hacks

In a previous entry, I posted code for hack I came up with to add vertical scrolling to the sidebar of a web-based application I’m developing in Shiny (using shinydashboard). Since then, I’ve bumped into two more issues, leading to two more hacks that I’ll describe here. First, I should point out that I’m using …

More

One Hundred Years of Statistical Developments in Animal Breeding

This nice review gives a history of the last 100 years in statistical genetics as applied to animal breeding (via Andrew Gelman). One Hundred Years of Statistical Developments in Animal Breeding (Annu. Rev. Anim. Biosci. 2015. 3:19–56 DOI:10.1146/annurev-animal-022114-110733) Statistical methodology has played a key role in scientific animal breeding. Approximately one hundred years of statistical …

More

Sparsity estimates for complex traits

Note the estimate of few to ten thousand causal SNP variants, consistent with my estimates for height and cognitive ability. Sparsity (number of causal variants), along with heritability, determines the amount of data necessary to “solve” a specific trait. See Genetic architecture and predictive modeling of quantitative traits. T1D looks like it could be cracked …

More

Decision Analytics and Teacher Qualifications

Disclaimers: This a post about statistics versus decision analytics, not a prescription for improving the educational system in the United States (or anywhere else, for that matter). tl;dr. The genesis of today’s post is a blog entry I read on Spartan Ideas titled “Is Michigan Turning Away Good Teachers?” (Spartan Ideas is a “metablog”, curated …

More

IQ prediction from structural MRI

These authors use machine learning techniques to build sparse predictors based on grey/white matter volumes of specific regions. Correlations obtained are ~ 0.7 (see figure). I predict that genomic estimators of this kind will be available once ~ 1 million genomes and cognitive scores are available for analysis. See also Myths, Sisyphus and g. MRI-Based …

More

The Monty Hall Evolver

The Monty Hall problem is very famous (Wikipedia, NYT). It is so famous because it so easily fools almost everyone the first time they hear about it, including people with doctorate degrees in various STEM fields. There are three doors. Behind one is a big prize, a car, and behind the two others are goats. …

More

Income, wealth, and IQ

I’m occasionally asked about financial returns to cognitive ability. As a rough rule of thumb, judging from the graphs below (obtained here), I would say: On average, an increase of IQ by one SD corresponds to  ~ $30k per annum of additional income. (Somewhat less than 1 SD in income; the distribution is far from …

More

Rigorous inequalities

  The Effects of an Anti-grade-Inflation Policy at Wellesley College Journal of Economic Perspectives, 28(3): 189-204 (2014) DOI: 10.1257/jep.28.3.189 Average grades in colleges and universities have risen markedly since the 1960s. Critics express concern that grade inflation erodes incentives for students to learn; gives students, employers, and graduate schools poor information on absolute and relative …

More

Top 25 richest living comedians

It’s fairly common knowledge that comedy isn’t a terribly lucrative career. Not only do most comedians spend decades doing small-time standup hoping to be discovered, but most of those comedians never end up being discovered either. But what about the comedians that did hit it big? To provide some insight into what it takes to …

More

CBO Against Piketty?

This report using CBO  (Congressional Budget Office) data claims that income inequality did not widen during the Great Recession (table above compares 2007 to 2011). After government transfer payments (taxes, entitlements, etc.) are taken into account, one finds that low income groups were cushioned, while high earners saw significant declines in income. … The CBO on …

More

Python usage survey 2014

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes …

More

Venture capital in the 1980s

Via Dominic Cummings (@odysseanproject), this long discussion of the history of venture capital, which emphasizes the now largely forgotten 1980s. VC in most parts of the developed world, even large parts of the US, resembles the distant past of the above chart. There is a big gap between Silicon Valley and the rest. Heat Death: …

More