1. Definitions and Setup
Let's start by defining our terms:
- y: observed values
- ŷ: predicted values
- ȳ: mean of observed values
- n: number of observations
Coefficient of Determination (R²)
\[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{reg}}{SS_{tot}}
\]
Where:
- SS_res: Sum of squared residuals = Σ(y - ŷ)²
- SS_tot: Total sum of squares = Σ(y - ȳ)²
- SS_reg: Regression sum of squares = Σ(ŷ - ȳ)²
Correlation Coefficient (r)
\[
r = \frac{\sum (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{\hat{y}})^2}}
\]
3. Proof
Step 1: Expand SS_tot
\[
\begin{aligned}
SS_{tot} &= \sum (y_i - \bar{y})^2 \\
&= \sum [(y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})]^2 \\
&= \sum (y_i - \hat{y}_i)^2 + \sum (\hat{y}_i - \bar{y})^2 + 2\sum (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) \\
&= SS_{res} + SS_{reg} + 0 \quad \text{(due to orthogonality)}
\end{aligned}
\]
Step 2: Express R² in terms of SS_reg and SS_tot
\[
R^2 = \frac{SS_{reg}}{SS_{tot}} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2}
\]
Step 3: Expand the numerator of the correlation coefficient
\[
\begin{aligned}
\sum (y_i - \bar{y})(\hat{y}_i - \bar{y}) &= \sum (y_i\hat{y}_i - y_i\bar{y} - \bar{y}\hat{y}_i + \bar{y}^2) \\
&= \sum y_i\hat{y}_i - n\bar{y}^2 - n\bar{y}^2 + n\bar{y}^2 \\
&= \sum y_i\hat{y}_i - n\bar{y}^2 \\
&= \sum \hat{y}_i^2 - n\bar{y}^2 \quad \text{(since } \sum y_i\hat{y}_i = \sum \hat{y}_i^2 \text{ in simple linear regression)} \\
&= \sum (\hat{y}_i - \bar{y})^2
\end{aligned}
\]
Step 4: Write out the squared correlation coefficient
\[
\begin{aligned}
r^2 &= \frac{[\sum (y_i - \bar{y})(\hat{y}_i - \bar{y})]^2}{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{y})^2} \\
&= \frac{[\sum (\hat{y}_i - \bar{y})^2]^2}{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{y})^2} \\
&= \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \\
&= R^2
\end{aligned}
\]
4. Conclusion
We have shown that in simple linear regression, the coefficient of determination (R²) is indeed equal to the square of the correlation coefficient between the observed and predicted values.
This relationship highlights the dual interpretation of R² in simple linear regression:
- As the proportion of variance in the dependent variable that is predictable from the independent variable.
- As the square of the correlation between observed and predicted values.
It's important to note that this equality holds specifically for simple linear regression. In multiple regression, R² retains its interpretation as the proportion of variance explained, but it is not necessarily equal to the square of any single correlation coefficient.