Confidence Intervals for Regression Coefficients
1. Introduction
In regression analysis, confidence intervals provide a range of plausible values for the true population parameters, given the observed data. We'll focus on the 95% confidence interval for the OLS estimator.
2. Formula for 95% Confidence Interval
For a given coefficient \(\beta_j\), the 95% confidence interval is given by:
\[
\hat{\beta}_j \pm t_{n-k,0.975} \cdot SE(\hat{\beta}_j)
\]
Where:
- \(\hat{\beta}_j\) is the OLS estimate of the j-th coefficient
- \(t_{n-k,0.975}\) is the 97.5th percentile of the t-distribution with n-k degrees of freedom
- n is the number of observations
- k is the number of predictors (including the intercept)
- \(SE(\hat{\beta}_j)\) is the standard error of \(\hat{\beta}_j\)
3. Calculating the Standard Error
The standard error for \(\hat{\beta}_j\) is given by:
\[
SE(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 \cdot [(X'X)^{-1}]_{jj}}
\]
Where:
- \(\hat{\sigma}^2\) is the estimate of the error variance: \(\hat{\sigma}^2 = \frac{RSS}{n-k}\)
- RSS is the Residual Sum of Squares
- \([(X'X)^{-1}]_{jj}\) is the j-th diagonal element of \((X'X)^{-1}\)
4. Interpretation
A 95% confidence interval means that if we were to repeat the sampling process many times and calculate the confidence interval each time, about 95% of these intervals would contain the true population parameter.
For a given sample:
- If the interval doesn't contain 0, we can conclude that the coefficient is statistically significant at the 5% level.
- The width of the interval provides information about the precision of the estimate. Narrower intervals indicate more precise estimates.
5. Assumptions
The validity of these confidence intervals relies on several assumptions:
- Linearity: The relationship between X and Y is linear.
- Independence: The observations are independent of each other.
- Homoscedasticity: The variance of the residuals is constant.
- Normality: The residuals are normally distributed.
- No perfect multicollinearity among the predictors.
Violation of these assumptions can lead to incorrect confidence intervals.
6. Example Calculation
Suppose we have the following results for a coefficient \(\beta_1\):
- \(\hat{\beta}_1 = 2.5\)
- \(SE(\hat{\beta}_1) = 0.5\)
- n = 100, k = 3
- \(t_{97,0.975} \approx 1.98\) (from t-distribution table)
The 95% confidence interval would be:
\[
\begin{aligned}
2.5 \pm 1.98 \cdot 0.5 &= 2.5 \pm 0.99 \\
&= [1.51, 3.49]
\end{aligned}
\]
Interpretation: We are 95% confident that the true population value of \(\beta_1\) lies between 1.51 and 3.49.
7. Considerations
- For large samples, the t-distribution approaches the normal distribution, so z-scores may be used instead of t-scores.
- In the presence of heteroscedasticity, robust standard errors (e.g., White's standard errors) should be used to calculate more accurate confidence intervals.
- For non-linear models or when assumptions are violated, bootstrap methods can be used to construct confidence intervals.