Lasso Regression as a Bayesian Estimation Problem

1. Lasso Regression: Frequentist Formulation

Lasso (Least Absolute Shrinkage and Selection Operator) regression minimizes:

\[ L(\beta) = \|y - X\beta\|^2 + \lambda \|\beta\|_1 \]

where ||β||₁ is the L1 norm of β (sum of absolute values of coefficients), and λ > 0 is the regularization parameter.

2. Bayesian Framework

In Bayesian statistics, we seek the posterior distribution:

\[ p(\beta|y) \propto p(y|\beta) \cdot p(\beta) \]

where p(y|β) is the likelihood and p(β) is the prior distribution on β.

3. Likelihood Function

Assuming Gaussian errors with variance σ², the likelihood is:

\[ p(y|\beta) \propto \exp\left(-\frac{1}{2\sigma^2}\|y - X\beta\|^2\right) \]

4. Prior Distribution for Lasso

The prior distribution that leads to Lasso regression is the Laplace (double exponential) distribution:

\[ p(\beta) \propto \exp\left(-\frac{\lambda}{\sigma^2}\|\beta\|_1\right) \]

This prior has the following properties:

5. Posterior Distribution

Combining the likelihood and prior:

\[ \begin{aligned} p(\beta|y) &\propto p(y|\beta) \cdot p(\beta) \\ &\propto \exp\left(-\frac{1}{2\sigma^2}\|y - X\beta\|^2\right) \cdot \exp\left(-\frac{\lambda}{\sigma^2}\|\beta\|_1\right) \\ &\propto \exp\left(-\frac{1}{2\sigma^2}\left(\|y - X\beta\|^2 + 2\lambda\|\beta\|_1\right)\right) \end{aligned} \]

6. Maximum A Posteriori (MAP) Estimation

The MAP estimate maximizes the posterior distribution, which is equivalent to minimizing the negative log-posterior:

\[ \hat{\beta}_{MAP} = \argmin_{\beta} \left\{\|y - X\beta\|^2 + 2\lambda\|\beta\|_1\right\} \]

This is exactly the Lasso regression problem, with λ in the Bayesian formulation corresponding to 2λ in the frequentist formulation.

7. Interpretation

8. Conclusion

Lasso regression can be formulated as a Bayesian estimation problem with a Laplace (double exponential) prior distribution on the coefficients. This Bayesian perspective provides insight into why Lasso tends to produce sparse solutions and how the regularization parameter λ can be interpreted as a prior belief in the sparsity of the model.