Bias and Variance of Ridge Regression Estimator

1. Ridge Regression Estimator

The Ridge regression estimator is given by:

\[ \hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty \]

where λ > 0 is the regularization parameter, X is the design matrix, y is the response vector, and I is the identity matrix.

2. Bias of Ridge Regression Estimator

To derive the bias, we start with the true model: y = Xβ + ε

Express 𝛃̂_Ridge in terms of β:
\[ \begin{aligned} \hat{\beta}_{Ridge} &= (X^TX + \lambda I)^{-1}X^T(X\beta + \varepsilon) \\ &= (X^TX + \lambda I)^{-1}(X^TX)\beta + (X^TX + \lambda I)^{-1}X^T\varepsilon \end{aligned} \]
Take the expectation:
\[ E[\hat{\beta}_{Ridge}] = (X^TX + \lambda I)^{-1}(X^TX)\beta \]
The bias is then:
\[ \begin{aligned} Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\ &= [(X^TX + \lambda I)^{-1}(X^TX) - I]\beta \\ &= -\lambda(X^TX + \lambda I)^{-1}\beta \end{aligned} \]

Note that the bias is non-zero for λ > 0, unlike OLS which is unbiased.

3. Variance of Ridge Regression Estimator

To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.

Start with the expression for 𝛃̂_Ridge:
\[ \hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty \]
The variance is:
\[ \begin{aligned} Var(\hat{\beta}_{Ridge}) &= (X^TX + \lambda I)^{-1}X^T Var(y) X(X^TX + \lambda I)^{-1} \\ &= \sigma^2 (X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1} \end{aligned} \]

4. Mean Squared Error (MSE)

The MSE is the sum of the squared bias and the trace of the variance matrix:

\[ MSE(\hat{\beta}_{Ridge}) = \|\text{Bias}(\hat{\beta}_{Ridge})\|^2 + \text{tr}(Var(\hat{\beta}_{Ridge})) \]

5. Conditions for Lower MSE than OLS

Ridge regression produces a lower MSE than OLS when the reduction in variance outweighs the introduced bias. This typically occurs when:

Multicollinearity is present in the predictors.
The true β values are small (close to zero).
The signal-to-noise ratio is low (high error variance relative to the true effects).

Mathematically, Ridge regression has lower MSE when:

\[ \lambda^2 \beta^T(X^TX + \lambda I)^{-2}\beta < \sigma^2 \text{tr}(X^TX)^{-1} - \sigma^2 \text{tr}((X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}) \]

This condition compares the squared bias introduced by Ridge regression (left side) to the reduction in variance it achieves (right side).

6. Practical Implications

Ridge regression trades off increased bias for reduced variance.
It's particularly useful when predictors are highly correlated or when there are many predictors relative to the sample size.
The optimal λ can be chosen through cross-validation or other model selection techniques.
As λ → 0, Ridge regression approaches OLS; as λ → ∞, all coefficients approach zero.