Derivation of Ridge Regression Estimator, Bias, and Variance
1. Derivation of Ridge Regression Estimator
Ridge regression minimizes the penalized sum of squared residuals:
\[
L(\beta) = (y - X\beta)'(y - X\beta) + \lambda \beta'\beta
\]
where λ > 0 is the regularization parameter.
- Expand the expression:
\[
L(\beta) = y'y - 2\beta'X'y + \beta'X'X\beta + \lambda \beta'\beta
\]
- Take the derivative with respect to β:
\[
\frac{\partial L}{\partial \beta} = -2X'y + 2X'X\beta + 2\lambda \beta
\]
- Set the derivative to zero and solve:
\[
\begin{aligned}
-2X'y + 2X'X\beta + 2\lambda \beta &= 0 \\
(X'X + \lambda I)\beta &= X'y \\
\beta &= (X'X + \lambda I)^{-1}X'y
\end{aligned}
\]
Therefore, the Ridge regression estimator is:
\[
\hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y
\]
2. Bias of Ridge Regression Estimator
To derive the bias, we start with the true model: y = Xβ + ε, where E[ε] = 0
- Express 𝛃̂_Ridge in terms of β:
\[
\begin{aligned}
\hat{\beta}_{Ridge} &= (X'X + \lambda I)^{-1}X'y \\
&= (X'X + \lambda I)^{-1}X'(X\beta + \varepsilon) \\
&= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon
\end{aligned}
\]
- Take the expectation:
\[
\begin{aligned}
E[\hat{\beta}_{Ridge}] &= E[(X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'\varepsilon] \\
&= (X'X + \lambda I)^{-1}X'X\beta + (X'X + \lambda I)^{-1}X'E[\varepsilon] \\
&= (X'X + \lambda I)^{-1}X'X\beta
\end{aligned}
\]
- Calculate the bias:
\[
\begin{aligned}
Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\
&= (X'X + \lambda I)^{-1}X'X\beta - \beta \\
&= [(X'X + \lambda I)^{-1}X'X - I]\beta \\
&= [(X'X + \lambda I)^{-1}(X'X + \lambda I - \lambda I) - I]\beta \\
&= [I - \lambda(X'X + \lambda I)^{-1} - I]\beta \\
&= -\lambda(X'X + \lambda I)^{-1}\beta
\end{aligned}
\]
3. Variance of Ridge Regression Estimator
To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.
- Start with the expression for 𝛃̂_Ridge:
\[
\hat{\beta}_{Ridge} = (X'X + \lambda I)^{-1}X'y
\]
- Apply the variance operator:
\[
\begin{aligned}
Var(\hat{\beta}_{Ridge}) &= Var((X'X + \lambda I)^{-1}X'y) \\
&= (X'X + \lambda I)^{-1}X' Var(y) X(X'X + \lambda I)^{-1} \\
&= \sigma^2 (X'X + \lambda I)^{-1}X'X(X'X + \lambda I)^{-1}
\end{aligned}
\]
4. Comparison with OLS
For comparison, recall the OLS estimator:
\[
\hat{\beta}_{OLS} = (X'X)^{-1}X'y
\]
OLS Properties:
- Bias(𝛃̂_OLS) = 0
- Var(𝛃̂_OLS) = σ²(X'X)⁻¹
Key differences:
- Ridge regression introduces a bias, which increases with λ.
- The variance of the Ridge estimator is generally smaller than that of OLS.
- As λ → 0, the Ridge estimator approaches the OLS estimator.
- As λ → ∞, the Ridge estimator approaches zero, maximizing bias but minimizing variance.
5. Mean Squared Error (MSE)
The MSE combines both bias and variance:
\[
MSE(\hat{\beta}_{Ridge}) = \|Bias(\hat{\beta}_{Ridge})\|^2 + tr(Var(\hat{\beta}_{Ridge}))
\]
Ridge regression can achieve a lower MSE than OLS when the reduction in variance outweighs the introduced bias squared.