Cook's distance is a measure used to identify influential observations in linear regression models. It quantifies the impact of deleting a given observation on the model's fitted values.
Consider the linear regression model:
where y is the response vector, X is the design matrix, β is the vector of coefficients, and ε is the error term.
The OLS estimator of β is:
The fitted values are given by:
where H = X(X'X)⁻¹X' is the hat matrix.
Let β̂₍ᵢ₎ be the OLS estimate when the i-th observation is removed. The change in fitted values is:
Using the Sherman-Morrison-Woodbury formula:
where xᵢ is the i-th row of X, eᵢ is the i-th residual, and hᵢᵢ is the i-th diagonal element of H.
Cook's distance for the i-th observation is defined as:
where p is the number of parameters and σ̂² is the mean squared error.
After substitution and simplification, we get:
This is the final expression for Cook's distance.
The Sherman-Morrison-Woodbury (SMW) formula provides a way to compute the inverse of a matrix after a small-rank update. It's particularly useful in updating regression coefficients when adding or removing observations.
For a non-singular square matrix A and vectors u and v, the SMW formula states:
We have thus derived the Sherman-Morrison-Woodbury formula. This formula allows us to efficiently compute the inverse of a matrix after a rank-one update, which is crucial in many statistical and computational applications, including the derivation of Cook's distance.
In the context of Cook's distance, we use this formula to efficiently compute the regression coefficients after removing a single observation. Specifically:
This allows us to express β̂₍ᵢ₎ (the coefficients without the i-th observation) in terms of β̂ (the coefficients with all observations), which is a key step in deriving the computationally efficient form of Cook's distance.