Lasso Regression as a Bayesian Estimation Problem
1. Lasso Regression: Frequentist Formulation
Lasso (Least Absolute Shrinkage and Selection Operator) regression minimizes:
\[
L(\beta) = \|y - X\beta\|^2 + \lambda \|\beta\|_1
\]
where ||β||₁ is the L1 norm of β (sum of absolute values of coefficients), and λ > 0 is the regularization parameter.
2. Bayesian Framework
In Bayesian statistics, we seek the posterior distribution:
\[
p(\beta|y) \propto p(y|\beta) \cdot p(\beta)
\]
where p(y|β) is the likelihood and p(β) is the prior distribution on β.
3. Likelihood Function
Assuming Gaussian errors with variance σ², the likelihood is:
\[
p(y|\beta) \propto \exp\left(-\frac{1}{2\sigma^2}\|y - X\beta\|^2\right)
\]
4. Prior Distribution for Lasso
The prior distribution that leads to Lasso regression is the Laplace (double exponential) distribution:
\[
p(\beta) \propto \exp\left(-\frac{\lambda}{\sigma^2}\|\beta\|_1\right)
\]
This prior has the following properties:
- It's peaked at zero, encouraging sparsity in the coefficients.
- It has heavier tails than a Gaussian distribution, allowing for larger coefficients when supported by the data.
5. Posterior Distribution
Combining the likelihood and prior:
\[
\begin{aligned}
p(\beta|y) &\propto p(y|\beta) \cdot p(\beta) \\
&\propto \exp\left(-\frac{1}{2\sigma^2}\|y - X\beta\|^2\right) \cdot \exp\left(-\frac{\lambda}{\sigma^2}\|\beta\|_1\right) \\
&\propto \exp\left(-\frac{1}{2\sigma^2}\left(\|y - X\beta\|^2 + 2\lambda\|\beta\|_1\right)\right)
\end{aligned}
\]
6. Maximum A Posteriori (MAP) Estimation
The MAP estimate maximizes the posterior distribution, which is equivalent to minimizing the negative log-posterior:
\[
\hat{\beta}_{MAP} = \argmin_{\beta} \left\{\|y - X\beta\|^2 + 2\lambda\|\beta\|_1\right\}
\]
This is exactly the Lasso regression problem, with λ in the Bayesian formulation corresponding to 2λ in the frequentist formulation.
7. Interpretation
- The Laplace prior encourages sparsity in the coefficient estimates, leading to feature selection.
- The regularization parameter λ controls the strength of the prior belief in sparsity.
- As λ → 0, the prior becomes uninformative, and the MAP estimate approaches the maximum likelihood estimate (OLS).
- As λ → ∞, the prior dominates, pushing all coefficients towards zero.
8. Conclusion
Lasso regression can be formulated as a Bayesian estimation problem with a Laplace (double exponential) prior distribution on the coefficients. This Bayesian perspective provides insight into why Lasso tends to produce sparse solutions and how the regularization parameter λ can be interpreted as a prior belief in the sparsity of the model.