Written by: Paul Rubin
Primary Source: OR in an OB World
I just read a nice post by Jean-François Puget, suitable for readers not terribly familiar with the subject, on overfitting in machine learning. I was going to leave a comment mentioning a couple of things, and then decided that with minimal padding I could make it long enough to be a blog post.
I agree with pretty much everything J-F wrote about overfitting. He mentioned cross-validation as a tool for combating the tendency to overfit. It is always advisable to partition your sample into a training set (observations used to compute parameters of a model) and a testing set (used to assess the true accuracy of the model). The rationale is that a trained model tends to look more accurate on the training data than it truly is. In cross-validation, you repeatedly divide the original sample (differently each time), repeating the training and testing.
A related approach, perhaps better suited to “big data” situations, is to split your (presumably large) sample into three subsamples: training, testing and validation. Every model under consideration is trained on the same training set, and then tested on the same testing set. Note that if your model contains a tunable parameter, such as the weight assigned to a regularization term, the same basic model with different (user-chosen) values of the tuning parameter are treated as distinct models for our purposes here. Since the testing data is used to choose among models, the danger of the results on the training set being better than they really are now morphs into the danger that the results on the testing set for the “winning” model being better than they really are. Hence the third (validation) sample is used to get a more reliable estimate of how good the final model really is.
One statement by J-F with which I disagree, based on a combination of things I’ve read and my experiences teaching statistics to business students, is the following:
Underfitting is quite easy to spot: predictions on train[ing] data aren’t great.
My problem with this is that people building machine learning models (or basic regression models, for that matter) frequently enter the process with a predetermined sense of either how accurate the model should be or how accurate they need it to be (to appease journal reviewers or get the boss off their backs). If they don’t achieve this desired accuracy, they will decide (consistent with J-F’s statement) that predictions “aren’t great” and move to a different (most likely more complex or sophisticated) model. In the “big data” era, it’s disturbingly easy to throw in more variables, but that was a danger even in the Dark Ages (i.e., when I was teaching).
I recall one team of MBAs working on a class project requiring them to build a predictive model for demand of some product. I gave every team the same time series for the dependent variable and told them to pick whatever predictors they wanted (subject, of course, to availability of data). This particular team came up with a reasonably accurate, reasonably plausible model, but it temporarily lost accuracy on observations from the early 1980s. So they stuck in an indicator variable for whether Ronald Reagan was president of the US, and instantly got better accuracy on the training data. I’m inclined to think this was overfitting, and it was triggered because they thought their model needed to be more accurate than it realistically could be. (It was interesting to hear them explain the role of this variable in class.)
When I taught regression courses, I always started out by describing data as a mix of “pattern” and “noise”, with “noise” being a relative concept. I defined it as “stuff you can’t currently explain or predict”, leaving the door open to some future combination of better models, greater expertise and/or more data turning some of the “noise” into “pattern”. Overfitting occurs when your model “predicts” what is actually noise. Underfitting occurs when it claims part of the pattern is noise. The problem is that the noise content of the data is whatever the universe / the economy / Loki decided it would be. The universe does not adjust the noise level of the data based on what predictive accuracy you want or need. So calling a model underfitted just because you fell short of the accuracy you thought you should achieve (or needed to achieve) amounts to underestimating the relative noise content, and is both unreliable and likely to induce you to indulge in overfitting.