Proof: R² Equals Squared Correlation Coefficient in Simple Linear Regression

1. Definitions and Setup

Let's start by defining our terms:

Coefficient of Determination (R²)

\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{reg}}{SS_{tot}} \]

Where:

Correlation Coefficient (r)

\[ r = \frac{\sum (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{\hat{y}})^2}} \]

2. Key Properties in Simple Linear Regression

In simple linear regression, we have these important properties:

  1. The mean of predicted values equals the mean of observed values: ȳ = ŷ̄
  2. The line of best fit passes through the point (x̄, ȳ)
  3. Σ(y - ŷ) = 0
  4. Σ(y - ŷ)(ŷ - ȳ) = 0 (orthogonality of residuals and predicted values)

3. Proof

Step 1: Expand SS_tot

\[ \begin{aligned} SS_{tot} &= \sum (y_i - \bar{y})^2 \\ &= \sum [(y_i - \hat{y}_i) + (\hat{y}_i - \bar{y})]^2 \\ &= \sum (y_i - \hat{y}_i)^2 + \sum (\hat{y}_i - \bar{y})^2 + 2\sum (y_i - \hat{y}_i)(\hat{y}_i - \bar{y}) \\ &= SS_{res} + SS_{reg} + 0 \quad \text{(due to orthogonality)} \end{aligned} \]

Step 2: Express R² in terms of SS_reg and SS_tot

\[ R^2 = \frac{SS_{reg}}{SS_{tot}} = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \]

Step 3: Expand the numerator of the correlation coefficient

\[ \begin{aligned} \sum (y_i - \bar{y})(\hat{y}_i - \bar{y}) &= \sum (y_i\hat{y}_i - y_i\bar{y} - \bar{y}\hat{y}_i + \bar{y}^2) \\ &= \sum y_i\hat{y}_i - n\bar{y}^2 - n\bar{y}^2 + n\bar{y}^2 \\ &= \sum y_i\hat{y}_i - n\bar{y}^2 \\ &= \sum \hat{y}_i^2 - n\bar{y}^2 \quad \text{(since } \sum y_i\hat{y}_i = \sum \hat{y}_i^2 \text{ in simple linear regression)} \\ &= \sum (\hat{y}_i - \bar{y})^2 \end{aligned} \]

Step 4: Write out the squared correlation coefficient

\[ \begin{aligned} r^2 &= \frac{[\sum (y_i - \bar{y})(\hat{y}_i - \bar{y})]^2}{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{y})^2} \\ &= \frac{[\sum (\hat{y}_i - \bar{y})^2]^2}{\sum (y_i - \bar{y})^2 \sum (\hat{y}_i - \bar{y})^2} \\ &= \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} \\ &= R^2 \end{aligned} \]

4. Conclusion

We have shown that in simple linear regression, the coefficient of determination (R²) is indeed equal to the square of the correlation coefficient between the observed and predicted values.

This relationship highlights the dual interpretation of R² in simple linear regression:

  1. As the proportion of variance in the dependent variable that is predictable from the independent variable.
  2. As the square of the correlation between observed and predicted values.

It's important to note that this equality holds specifically for simple linear regression. In multiple regression, R² retains its interpretation as the proportion of variance explained, but it is not necessarily equal to the square of any single correlation coefficient.