Regression Via Pseudoinverse

Written by: Paul Rubin

Primary Source: OR in an OB World

In my last post (OLS Oddities), I mentioned that OLS linear regression could be done with multicollinear data using the Moore-Penrose pseudoinverse. I want to tidy up one small loose end.

Specifically, let \(X\) be the matrix of predictor observations (including a column of ones if a constant term is desired), let \(y\) be a vector of observations of the dependent variable, and suppose that you want to fit the model \(y = X\beta + \epsilon\) where \(\epsilon\) is the noise term and \(\beta\) the coefficient vector. The normal equations $$b = \left(X^\prime X\right)^{-1}X^\prime y$$produce the least squares estimate of \(\beta\) when \(X\) has full column rank. If we let \(M^+\) denote the Moore-Penrose pseudoinverse of matrix \(M\) (which always exists and is unique), then $$\hat{b} = X^+ y$$results in \(\hat{y} = X\hat{b}\) giving the correct fitted values even when \(X\) has less than full rank (i.e., when the predictors are multicollinear).

What if you replace the inverse with a pseudoinverse in the normal equations ? In other words, suppose we let $$\tilde{b} = \left(X^\prime X\right)^+X^\prime y.$$Do we get the same fitted values \(\hat{y}\)?

We do, and in fact \(\tilde{b} = \hat{b}\), i.e., both ways of using the pseudoinverse produce the same coefficient vector. The reason is that $$\left(X^\prime X\right)^+X^\prime = X^+.$$A proof is given in section 4.2 of the Wikipedia page of “Proofs involving the Moore-Penrose pseudoinverse“, so I won’t bother to reproduce it here.

If \(X\) is \(m \times n\), the second approach will be preferable only if the computational cost of finding the pseudoinverse of the \(n \times n\) matrix \(X^\prime X\) is sufficiently less than the cost of finding the pseudoinverse of \(X\) to offset the \(O\left(mn^2\right)\) cost of the multiplication of \(X^\prime\) and \(X\). I don’t know if that’s true, particularly in some machine learning applications where, apparently, \(n >> m\). So I’ll either stick to the simpler version (using \(X^+\)) or, more likely, continue with the time-honored tradition of weeding out redundant predictors before fitting the model.

The following two tabs change content below.
I'm an apostate mathematician, retired from a business school after 33 years of teaching mostly (but not exclusively) quantitative methods courses. My academic interests lie in operations research. I also study Tae Kwon Do a bit on the side.

Latest posts by Paul Rubin (see all)