Derivation of Ridge Regression Estimator, Bias, and Variance

1. Derivation of Ridge Regression Estimator

Ridge regression minimizes the penalized sum of squared residuals:

\[ L(\beta) = (y - X\beta)'(y - X\beta) + \lambda \beta'\beta \]

where λ > 0 is the regularization parameter.

Expand the expression:
\[ L(\beta) = y'y - 2\beta'X'y + \beta'X'X\beta + \lambda \beta'\beta \]
Take the derivative with respect to β:
\[ \frac{\partial L}{\partial \beta} = -2X'y + 2X'X\beta + 2\lambda \beta \]
Set the derivative to zero and solve:
\[ \begin{aligned} -2X'y + 2X'X\beta + 2\lambda \beta &= 0 \\ (X'X + \lambda I)\beta &= X'y \\ \beta &= (X'X + \lambda I)^{-1}X'y \end{aligned} \]

Therefore, the Ridge regression estimator is:

\[ \hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y \]

To derive the bias, we start with the true model: y = Xβ + ε, where E[ε] = 0

Express 𝛃̂_Ridge in terms of β:
\[ \begin{aligned} \hat{\beta}_{Ridge} &= (X'X + \lambda I)^{-1}X'y \\ &= (X'X + \lambda I)^{-1}X'(X\beta + \varepsilon) \\ &= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon \end{aligned} \]
Take the expectation:
\[ \begin{aligned} E[\hat{\beta}_{Ridge}] &= E[(X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon] \\ &= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'E[\varepsilon] \\ &= (X'X + \lambda I)^{-1}X'X\beta \end{aligned} \]
Calculate the bias:
\[ \begin{aligned} Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\ &= (X'X + \lambda I)^{-1}X'X\beta - \beta \\ &= [(X'X + \lambda I)^{-1}X'X - I]\beta \\ &= [(X'X + \lambda I)^{-1}(X'X + \lambda I - \lambda I) - I]\beta \\ &= [I - \lambda(X'X + \lambda I)^{-1} - I]\beta \\ &= -\lambda(X'X + \lambda I)^{-1}\beta \end{aligned} \]

To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.

Start with the expression for 𝛃̂_Ridge:
\[ \hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y \]
Apply the variance operator:
\[ \begin{aligned} Var(\hat{\beta}_{Ridge}) &= Var((X'X + \lambda I)^{-1}X'y) \\ &= (X'X + \lambda I)^{-1}X' Var(y) X(X'X + \lambda I)^{-1} \\ &= \sigma^2 (X'X + \lambda I)^{-1}X'X(X'X + \lambda I)^{-1} \end{aligned} \]

For comparison, recall the OLS estimator:

\[ \hat{\beta}_{OLS} = (X'X)^{-1}X'y \]

OLS Properties:

Key differences:

Ridge regression introduces a bias, which increases with λ.
The variance of the Ridge estimator is generally smaller than that of OLS.
As λ → 0, the Ridge estimator approaches the OLS estimator.
As λ → ∞, the Ridge estimator approaches zero, maximizing bias but minimizing variance.

The MSE combines both bias and variance:

\[ MSE(\hat{\beta}_{Ridge}) = \|Bias(\hat{\beta}_{Ridge})\|^2 + tr(Var(\hat{\beta}_{Ridge})) \]

Ridge regression can achieve a lower MSE than OLS when the reduction in variance outweighs the introduced bias squared.