Ridge Regression Revisited

Written by: Paul Rubin

Primary Source: OR in an OB world

I’ve been neglecting the blog a bit lately, partly because I haven’t had much to say and partly because I’ve been a bit busy with other things. One of the things keeping me occupied is the excellent Statistical Learning course offered by Stanford. (In one of those coming-full-cycle things, now that I’m retired from teaching, I’m going back to my student roots and taking MOOCs in subjects of interest.)

During one of the lectures I learned something new that was both interesting and had the side effect of making me feel old (a job my knees are entirely up to on their own, thank you very much). Ridge regression has apparently been picked up from the scrap heap of statistical history and repurposed.

Stiffness: it’s not just for us old people

To explain ridge regression as I originally learned it, let me start by defining the phrase ill-conditioned matrix. Let A be a square matrix (we’ll operate over the reals, although I think this works as well for complex matrices) and let ∥⋅∥ be a consistent matrix norm. The condition number ofA, assuming that is nonsingular, is

\(\kappa = \left\Vert A\vphantom{^{-1}}\right\Vert \left\Vert A^{-1}\right\Vert \ge 1\)

For singular matrices, κ is either undefined or ∞, according to your religious preferences.

Somewhat informally, κ can be interpreted as a measure of how much rounding errors may be inflated then inverting A (using floating-point arithmetic … “pure” mathematicians don’t commit rounding errors). An ill-conditioned or “stiff” matrix A is non-singular but has a large condition number (the criterion for “large” being rather fuzzy), meaning that rounding errors during the process of inverting (or factoring, or otherwise torturing) A may result in an answer with lots of incorrect digits. Long ago, a graduate school instructor told my class the story of an engineer who had inverted an ill-conditioned matrix, without realizing it was ill-conditioned, using home-brew FORTRAN code. He’d published the inverse in a journal article to four or five decimal places. All the decimal places in most of the coefficients were wrong. So were lots of digits to the left of the decimal points.

Ridge regression then

Here comes the tie-in to regression. Suppose we are trying to fit the model

\(y=x’\beta+\epsilon\)

(where ϵ is random noise with mean zero), and we have collected observations (X,Y) with X an n×m matrix (n = sample size, m = number of predictors). The normal equations tell us that the least squares estimate for β is

\(y=x’\b=(X^\prime X)^{-1}X^\prime Y.beta+\epsilon\)

In theory, X′X is nonsingular unless the predictors are perfectly multicollinear. In practice, ifX′X is nonsingular but stiff, our coefficient estimateb can suffer drastic rounding errors. To mitigate those errors, ridge regression tampers withX′X, adding a small positive multiple of the identity matrix of dimensionm. The modified normal equations are

\(b_\lambda=(X^\prime X + \lambda I)^{-1}X^\prime Y.\)

The result bλ is a deliberately biased estimate of β, in which we trade the bias for reduced rounding error. We generally choose λ>0 as small as possible subject to the need to stabilize the numerics.

When I first encountered ridge regression, we were in the (first) mainframe era. Machine words were small, single-precision arithmetic ruled the computing universe, and rounding errors were relatively easy to generate. Somewhere along the way, computing hardware got smaller and cheaper, double precision arithmetic became the norm and separate floating point units took over the computational load. At the same time, matrix factorization replaced matrix inversion, and factoring algorithms got smarter about managing rounding error. The problems with stiff X′X matrices seemed largely to disappear, and references to ridge regression became uncommon, at least in my experience.

Ridge regression now

In the Statistical Learning course, ridge regression makes a resurgence, with an entirely different purpose. In the era of Big Data, there is a distinct danger that we may overfit data by throwing in lots and lots of predictors (large dimension m). One way to mitigate that temptation is to penalize the number or size of the coefficients in the model.

The normal equations minimize the sum of squared deviations in a linear regression model. In other words, the ordinary least squares estimate ofβ solves the problem

$latex b_\lambda=(X^\prime X \min_b \Vert Y-X b\Vert^2.+ \lambda I)^{-1}X^\prime Y.$

Given that we are squaring errors, and given that quadratic objective functions have nice smoothness, a tempting way to penalize overfitting is to add a penalty for the size of the coefficient vector:

\(\min_b \Vert Y-X b\Vert^2 + \lambda \Vert b \Vert^2\)

where λ>0 is a penalty parameter. If λ=0, we are back to ordinary least squares. As λ→∞, ∥b∥ will converge toward (but never quite reach) 0.

I’ll skip the algebra and just state that solving this problem for fixed λ is mathematically equivalent to ridge regression as seen above. The only difference is in the choice of λ. When we were fighting rounding error, we wanted the smallest λ that would clean up most of the rounding error. Now, when we are fighting overfitting, we probably need larger values of λ to get the job done.

So everything old is new again … which bodes well for my knees.

The following two tabs change content below.
I'm an apostate mathematician, retired from a business school after 33 years of teaching mostly (but not exclusively) quantitative methods courses. My academic interests lie in operations research. I also study Tae Kwon Do a bit on the side.

Latest posts by Paul Rubin (see all)