Derivation of Ridge Regression Estimator, Bias, and Variance

1. Derivation of Ridge Regression Estimator

Ridge regression minimizes the penalized sum of squared residuals:

\[ L(\beta) = (y - X\beta)'(y - X\beta) + \lambda \beta'\beta \]

where λ > 0 is the regularization parameter.

  1. Expand the expression:
    \[ L(\beta) = y'y - 2\beta'X'y + \beta'X'X\beta + \lambda \beta'\beta \]
  2. Take the derivative with respect to β:
    \[ \frac{\partial L}{\partial \beta} = -2X'y + 2X'X\beta + 2\lambda \beta \]
  3. Set the derivative to zero and solve:
    \[ \begin{aligned} -2X'y + 2X'X\beta + 2\lambda \beta &= 0 \\ (X'X + \lambda I)\beta &= X'y \\ \beta &= (X'X + \lambda I)^{-1}X'y \end{aligned} \]

Therefore, the Ridge regression estimator is:

\[ \hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y \]

2. Bias of Ridge Regression Estimator

To derive the bias, we start with the true model: y = Xβ + ε, where E[ε] = 0

  1. Express 𝛃̂_Ridge in terms of β:
    \[ \begin{aligned} \hat{\beta}_{Ridge} &= (X'X + \lambda I)^{-1}X'y \\ &= (X'X + \lambda I)^{-1}X'(X\beta + \varepsilon) \\ &= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon \end{aligned} \]
  2. Take the expectation:
    \[ \begin{aligned} E[\hat{\beta}_{Ridge}] &= E[(X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon] \\ &= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'E[\varepsilon] \\ &= (X'X + \lambda I)^{-1}X'X\beta \end{aligned} \]
  3. Calculate the bias:
    \[ \begin{aligned} Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\ &= (X'X + \lambda I)^{-1}X'X\beta - \beta \\ &= [(X'X + \lambda I)^{-1}X'X - I]\beta \\ &= [(X'X + \lambda I)^{-1}(X'X + \lambda I - \lambda I) - I]\beta \\ &= [I - \lambda(X'X + \lambda I)^{-1} - I]\beta \\ &= -\lambda(X'X + \lambda I)^{-1}\beta \end{aligned} \]

3. Variance of Ridge Regression Estimator

To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.

  1. Start with the expression for 𝛃̂_Ridge:
    \[ \hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y \]
  2. Apply the variance operator:
    \[ \begin{aligned} Var(\hat{\beta}_{Ridge}) &= Var((X'X + \lambda I)^{-1}X'y) \\ &= (X'X + \lambda I)^{-1}X' Var(y) X(X'X + \lambda I)^{-1} \\ &= \sigma^2 (X'X + \lambda I)^{-1}X'X(X'X + \lambda I)^{-1} \end{aligned} \]

4. Comparison with OLS

For comparison, recall the OLS estimator:

\[ \hat{\beta}_{OLS} = (X'X)^{-1}X'y \]

OLS Properties:

Key differences:

5. Mean Squared Error (MSE)

The MSE combines both bias and variance:

\[ MSE(\hat{\beta}_{Ridge}) = \|Bias(\hat{\beta}_{Ridge})\|^2 + tr(Var(\hat{\beta}_{Ridge})) \]

Ridge regression can achieve a lower MSE than OLS when the reduction in variance outweighs the introduced bias squared.