Cook's Distance: Derivation and Use in Linear Regression

1. Introduction to Cook's Distance

Cook's distance is a measure used to identify influential observations in linear regression models. It quantifies the impact of deleting a given observation on the model's fitted values.

2. Derivation of Cook's Distance

Step 1: Define the Linear Regression Model

Consider the linear regression model:

\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \]

where y is the response vector, X is the design matrix, β is the vector of coefficients, and ε is the error term.

Step 2: OLS Estimator

The OLS estimator of β is:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y} \]

Step 3: Fitted Values

The fitted values are given by:

\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y} \]

where H = X(X'X)⁻¹X' is the hat matrix.

Step 4: Change in Fitted Values

Let β̂₍ᵢ₎ be the OLS estimate when the i-th observation is removed. The change in fitted values is:

\[ \hat{\mathbf{y}} - \hat{\mathbf{y}}_{(i)} = \mathbf{X}(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}_{(i)}) \]

Step 5: Express β̂₍ᵢ₎ in Terms of β̂

Using the Sherman-Morrison-Woodbury formula:

\[ \hat{\boldsymbol{\beta}}_{(i)} = \hat{\boldsymbol{\beta}} - \frac{(\mathbf{X}'\mathbf{X})^{-1}\mathbf{x}_i'e_i}{1-h_{ii}} \]

where xᵢ is the i-th row of X, eᵢ is the i-th residual, and hᵢᵢ is the i-th diagonal element of H.

Step 6: Derive Cook's Distance

Cook's distance for the i-th observation is defined as:

\[ D_i = \frac{(\hat{\mathbf{y}} - \hat{\mathbf{y}}_{(i)})'(\hat{\mathbf{y}} - \hat{\mathbf{y}}_{(i)})}{p\hat{\sigma}^2} \]

where p is the number of parameters and σ̂² is the mean squared error.

Step 7: Simplify the Expression

After substitution and simplification, we get:

\[ D_i = \frac{e_i^2}{p\hat{\sigma}^2} \cdot \frac{h_{ii}}{(1-h_{ii})^2} \]

This is the final expression for Cook's distance.

3. Interpretation and Use of Cook's Distance

Identifying Influential Observations

Large values of Dᵢ indicate influential observations.
A common rule of thumb is to consider observations with Dᵢ > 4/n as influential, where n is the number of observations.
Another approach is to look for observations with Dᵢ > 1.

Visual Representation

Practical Use in Regression Analysis

Model Diagnosis: Identify observations that have a disproportionate impact on the regression results.
Data Quality Check: High Cook's distance might indicate data entry errors or unusual events that need investigation.
Robust Modeling: By identifying influential points, you can assess the stability of your model and consider robust regression techniques if necessary.
Variable Selection: In stepwise regression, Cook's distance can help in understanding which observations drive the inclusion or exclusion of variables.

4. Limitations and Considerations

Cook's distance should be used in conjunction with other diagnostic tools, not in isolation.
It doesn't distinguish between beneficial and detrimental influence.
The cutoff points (like 4/n or 1) are rules of thumb and may not be appropriate for all situations.
In high-dimensional spaces, the interpretation of Cook's distance can be less straightforward.

Appendix: Derivation of Sherman-Morrison-Woodbury Formula

Introduction

The Sherman-Morrison-Woodbury (SMW) formula provides a way to compute the inverse of a matrix after a small-rank update. It's particularly useful in updating regression coefficients when adding or removing observations.

The Formula

For a non-singular square matrix A and vectors u and v, the SMW formula states:

\[ (A + uv')^{-1} = A^{-1} - \frac{A^{-1}uv'A^{-1}}{1 + v'A^{-1}u} \]

Derivation

Step 1: Let B = A + uv'. We want to find B⁻¹.
Step 2: Assume B⁻¹ has the form:
\[ B^{-1} = A^{-1} + xA^{-1}uv'A^{-1} \]
where x is a scalar we need to determine.
Step 3: For this to be the inverse, it must satisfy BB⁻¹ = I. Let's multiply B and our assumed B⁻¹:
\[ \begin{aligned} BB^{-1} &= (A + uv')(A^{-1} + xA^{-1}uv'A^{-1}) \\ &= AA^{-1} + xAA^{-1}uv'A^{-1} + uv'A^{-1} + xuv'A^{-1}uv'A^{-1} \\ &= I + xuu'A^{-1} + uv'A^{-1} + xuv'A^{-1}uv'A^{-1} \end{aligned} \]
Step 4: For this to equal I, we need:
\[ xu'A^{-1} + v'A^{-1} + xv'A^{-1}uv'A^{-1} = 0 \]
Step 5: Factor out v'A⁻¹:
\[ v'A^{-1}(xu + 1 + xv'A^{-1}u) = 0 \]
Step 6: For this to be true for all v, we must have:
\[ xu + 1 + xv'A^{-1}u = 0 \]
Step 7: Solve for x:
\[ x = -\frac{1}{1 + v'A^{-1}u} \]
Step 8: Substitute this value of x back into our assumed form of B⁻¹:
\[ B^{-1} = A^{-1} - \frac{A^{-1}uv'A^{-1}}{1 + v'A^{-1}u} \]

Conclusion

We have thus derived the Sherman-Morrison-Woodbury formula. This formula allows us to efficiently compute the inverse of a matrix after a rank-one update, which is crucial in many statistical and computational applications, including the derivation of Cook's distance.

Application in Cook's Distance

In the context of Cook's distance, we use this formula to efficiently compute the regression coefficients after removing a single observation. Specifically:

A corresponds to X'X
u corresponds to xᵢ' (the i-th row of X)
v corresponds to xᵢ

This allows us to express β̂₍ᵢ₎ (the coefficients without the i-th observation) in terms of β̂ (the coefficients with all observations), which is a key step in deriving the computationally efficient form of Cook's distance.