Bias and Variance of Ridge Regression Estimator
1. Ridge Regression Estimator
The Ridge regression estimator is given by:
\[
\hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty
\]
where λ > 0 is the regularization parameter, X is the design matrix, y is the response vector, and I is the identity matrix.
2. Bias of Ridge Regression Estimator
To derive the bias, we start with the true model: y = Xβ + ε
- Express 𝛃̂_Ridge in terms of β:
\[
\begin{aligned}
\hat{\beta}_{Ridge} &= (X^TX + \lambda I)^{-1}X^T(X\beta + \varepsilon) \\
&= (X^TX + \lambda I)^{-1}(X^TX)\beta + (X^TX + \lambda I)^{-1}X^T\varepsilon
\end{aligned}
\]
- Take the expectation:
\[
E[\hat{\beta}_{Ridge}] = (X^TX + \lambda I)^{-1}(X^TX)\beta
\]
- The bias is then:
\[
\begin{aligned}
Bias(\hat{\beta}_{Ridge}) &= E[\hat{\beta}_{Ridge}] - \beta \\
&= [(X^TX + \lambda I)^{-1}(X^TX) - I]\beta \\
&= -\lambda(X^TX + \lambda I)^{-1}\beta
\end{aligned}
\]
Note that the bias is non-zero for λ > 0, unlike OLS which is unbiased.
3. Variance of Ridge Regression Estimator
To derive the variance, we use the fact that Var(y) = σ²I, where σ² is the error variance.
- Start with the expression for 𝛃̂_Ridge:
\[
\hat{\beta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty
\]
- The variance is:
\[
\begin{aligned}
Var(\hat{\beta}_{Ridge}) &= (X^TX + \lambda I)^{-1}X^T Var(y) X(X^TX + \lambda I)^{-1} \\
&= \sigma^2 (X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1}
\end{aligned}
\]
4. Mean Squared Error (MSE)
The MSE is the sum of the squared bias and the trace of the variance matrix:
\[
MSE(\hat{\beta}_{Ridge}) = \|\text{Bias}(\hat{\beta}_{Ridge})\|^2 + \text{tr}(Var(\hat{\beta}_{Ridge}))
\]
5. Conditions for Lower MSE than OLS
Ridge regression produces a lower MSE than OLS when the reduction in variance outweighs the introduced bias. This typically occurs when:
- Multicollinearity is present in the predictors.
- The true β values are small (close to zero).
- The signal-to-noise ratio is low (high error variance relative to the true effects).
Mathematically, Ridge regression has lower MSE when:
\[
\lambda^2 \beta^T(X^TX + \lambda I)^{-2}\beta < \sigma^2 \text{tr}(X^TX)^{-1} - \sigma^2 \text{tr}((X^TX + \lambda I)^{-1}X^TX(X^TX + \lambda I)^{-1})
\]
This condition compares the squared bias introduced by Ridge regression (left side) to the reduction in variance it achieves (right side).
6. Practical Implications
- Ridge regression trades off increased bias for reduced variance.
- It's particularly useful when predictors are highly correlated or when there are many predictors relative to the sample size.
- The optimal λ can be chosen through cross-validation or other model selection techniques.
- As λ → 0, Ridge regression approaches OLS; as λ → ∞, all coefficients approach zero.