Bias and Variance of Ridge Regression Estimator

1. Ridge Regression Estimator

The Ridge regression estimator is given by:

\[ \hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty \]

where λ > 0 is the regularization parameter, X is the design matrix, y is the response vector, and I is the identity matrix.

2. Bias of Ridge Regression Estimator

To derive the bias, we start with the true model: y = Xβ + ε

  1. Express 𝛃̂_Ridge in terms of β:
    \[ \begin{aligned} \hat{\beta}_{Ridge} &= (X^TX + \lambda I)^{-1}X^T(X\beta + \varepsilon) \\ &= (X^TX + \lambda I)^{-1}(X^TX)\beta + (X^TX + \lambda I)^{-1}X^T\varepsilon \end{aligned} \]
  2. Take the expectation:
    \[ E[\hat{\beta}_{Ridge}] = (X^TX + \lambda I)^{-1}(X^TX)\beta \]
  3. The bias is then:
    \[ \begin{aligned} Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\ &= [(X^TX + \lambda I)^{-1}(X^TX) - I]\beta \\ &= -\lambda(X^TX + \lambda I)^{-1}\beta \end{aligned} \]

Note that the bias is non-zero for λ > 0, unlike OLS which is unbiased.

3. Variance of Ridge Regression Estimator

To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.

  1. Start with the expression for 𝛃̂_Ridge:
    \[ \hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty \]
  2. The variance is:
    \[ \begin{aligned} Var(\hat{\beta}_{Ridge}) &= (X^TX + \lambda I)^{-1}X^T Var(y) X(X^TX + \lambda I)^{-1} \\ &= \sigma^2 (X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1} \end{aligned} \]

4. Mean Squared Error (MSE)

The MSE is the sum of the squared bias and the trace of the variance matrix:

\[ MSE(\hat{\beta}_{Ridge}) = \|\text{Bias}(\hat{\beta}_{Ridge})\|^2 + \text{tr}(Var(\hat{\beta}_{Ridge})) \]

5. Conditions for Lower MSE than OLS

Ridge regression produces a lower MSE than OLS when the reduction in variance outweighs the introduced bias. This typically occurs when:

  1. Multicollinearity is present in the predictors.
  2. The true β values are small (close to zero).
  3. The signal-to-noise ratio is low (high error variance relative to the true effects).

Mathematically, Ridge regression has lower MSE when:

\[ \lambda^2 \beta^T(X^TX + \lambda I)^{-2}\beta < \sigma^2 \text{tr}(X^TX)^{-1} - \sigma^2 \text{tr}((X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}) \]

This condition compares the squared bias introduced by Ridge regression (left side) to the reduction in variance it achieves (right side).

6. Practical Implications