1 Welcome to S2Y Lab 8

Intended Learning Outcomes:

  • obtain the elements of an analysis of variance (ANOVA) table;
  • use the ANOVA table to compute and interpret the \(F\)-statistic and its hypothesis test;
  • assess the goodness of fit of linear models based on \(R^2\) and \(R^2_a\).

1.1 Introduction

1.1.1 The ANOVA table

In the lectures we studied analysis of variance (ANOVA) table which can be used to investigate the variability explained by the model. It is based on partitioning the sums of squares (\(SS\)) and the degrees of freedom (df) associated with the response variable \(Y\). The total deviation, \(Y_i - \overline{Y}\), can be decomposed as

  1. The deviation from the fitted value \(\hat{Y}_i\) to the mean \(\overline{Y}\) and

  2. The deviation fo the observation \(Y_i\) to the regression line \(\hat{Y}_i\).

\[\underbrace{Y_i - \bar{Y}}_{\text{Total Deviation}} = \underbrace{\hat{Y_i} - \bar{Y}}_{\substack{\text{Deviation of Fitted}\\ \text{Regression Value}\\ \text{around the Mean}}} + \underbrace{Y_i - \hat{Y_i}}_{\substack{\text{Deviation around}\\ \text{the Fitted Regression Line}}}\]

When we take the sum over all observations and square it, we obtain the following equation, which decomposes the total variability in the response into two parts::

\[\underbrace{\sum_{i=1}^n(Y_i - \bar{Y})^2}_{SST} = \underbrace{\sum_{i=1}^n(\hat{Y_i} - \bar{Y})^2}_{SSR} + \underbrace{\sum_{i=1}^n(Y_i - \hat{Y_i})^2}_{SSE}\]

We have

  • \(SST\): Total sum of squares

  • \(SSR\): Regression sum of squares

  • \(SSE\): Error (residual) sum of squares.

In a simple linear regression (only one predictor), degrees of freedom can be partitioned as: \(SST\) has \(n-1\) degrees of freedom, \(SSE\) has \(n-2\), and \(SSR\) has \(1\). If we divide the \(SS\) by its degrees of freedom, we obtain the mean square as

\[\frac{SSR}{1} = MSR \text{ and } \frac{SSE}{n-2} = MSE\]

When \(\beta_1 = 0\), meaning the predictor does not have an effect in our regression, the \(MSR\) and \(MSE\) will be similar in magnitude. If \(\beta_1\neq0\), the center of the \(MSR\) will be larger than that of \(MSE\).

We can use this understanding to test the effect of the predictor in the model using a hypothesis test. In particular, we could test \(H_0: \beta_1 = 0\) versus \(H_1: \beta_1 \neq 0\) using the test statistic

\[F_\text{obs} = \frac{MSR}{MSE}\]

When the null hypothesis is true, i.e. \(H_0: \beta_1 = 0\), then \[F_\text{obs} = \frac{MSR}{MSE} \sim F_{1,n-2}.\]

As a wide rule of thumb, you can say that values of \(F_{obs}\) close to \(1\) generally support the null hypothesis while larger values will support the alternative. However, you will always need to compare it with a critical value found using a statistical table or look at the \(p\)-value produced in R. To generate an ANOVA table on linear objects, use anova(lm.object).

The full ANOVA table for a single predictor looks like this:

ANOVA table for a simple linear regression model

Figure 1.1: ANOVA table for a simple linear regression model

More generally, for a multiple linear regression with \(p-1\) predictors, the ANOVA table looks like this:

ANOVA table for a multiple linear regression model

Figure 1.2: ANOVA table for a multiple linear regression model

We can perform \(F\)-test to test whether all parameters in the model of interest (i.e. all parameters excluding the intercept) are zero:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0 \quad \text{versus} \quad H_1: \text{at least one } \beta_i \neq 0 \text{ for } i =1, 2, \ldots, p-1.\] The test statistic is again \(F_\text{obs} = \frac{MSR}{MSE}\) and under the assumption that \(H_0\) is true,

\[F_\text{obs} = \frac{MSR}{MSE} \sim F_{p-1;n-p}.\]

1.1.2 \(R^2\)

ANOVA table also allows us to easily calculate the coefficient of determination, \(R^2\). \(R^2\) is a useful tool to assess the goodness-of-fit of our linear regression models. It provides a measure for the proportion of variability in the \(Y_i\)s that can be explained by our model. The formula of \(R^2\) is given by:

\[R^2 = 1-\frac{SSE}{SST} = \frac{SSR}{SST}. \]

Think about it this way: if \(SSE\) is the sum of squares of the error, it can be interpreted as the amount of variability in \(Y\) that is not explained by the model. \(SST\) represents all the variability in the data. If our model did not explain any variation, it would be equivalent to a horizontal line and \(SSE=SST\). Therefore, the ratio of \(\frac{SSE}{SST}\) represents the proportion of variability not explained by the model and \(1-\frac{SSE}{SST}\) is the variability explained.

In today’s lab we will first revisit ordinary least squares estimates (OLS) and then delve into the coefficient of determination (\(R^2\)) and the analysis of variance (ANOVA) table to deepen our understanding of the R output.

1.2 A brief reminder from Lab 7: The summary() function

Let's take a look at the summary() function again. Think of the HSwrestlers data set where we try to predict the hydrostatic fat (HWFAT) measurement of a wrestler from abomnimal fat (ABS) and tricep fat (TRICEPS) measurements.

library(PASWR)
data(HSwrestler)
Model1 <- lm(HWFAT ~ ABS + TRICEPS, data = HSwrestler)

The summary function gives the following output:

summary(Model1)
## 
## Call:
## lm(formula = HWFAT ~ ABS + TRICEPS, data = HSwrestler)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5558 -2.2550 -0.5245  2.3365  9.4957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.05904    0.65219   3.157   0.0023 ** 
## ABS          0.33708    0.06415   5.255 1.34e-06 ***
## TRICEPS      0.50430    0.09920   5.084 2.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.061 on 75 degrees of freedom
## Multiple R-squared:  0.8832, Adjusted R-squared:  0.8801 
## F-statistic: 283.6 on 2 and 75 DF,  p-value: < 2.2e-16

The above R output displays a few elements, which were briefly explained below.

Call: This shows the formula that was used in the regression model.

Residuals: This lists the five-number summary of the residuals from the regression model.

Coefficients: This shows a summary of estimated coefficients of the regression model.

Within this section the column headers are:

  • Estimate: The estimated parameter. These can be used to write down the fitted regression model.

  • Std. Error: This is the estimated standard error of the parameter estimate.

  • t value: This is the \(t\)-statistic for the parameter, calculated as Estimate / Std. Error.

  • Pr(>|t|): This is the \(p\)-value that corresponds to the \(t\)-statistic, i.e. \(2\cdot \mathbb{P}(X>|t|)\) for \(X \sim t(n-p)\), where \(t\) is the t value computed above, \(n\) is the sample size, and \(p\) is the number of parameters.

Significance codes: These codes in asterisks are appended to the \(p\)-values in regression analysis results. They provide a quick indication of the level of significance of the predictors in the model. The most commonly used significance levels and their corresponding codes are:

significance code p-value
*** < 0.001
** 0.001 - 0.01
* 0.01 - 0.05
. 0.05 - 0.1
≥ 0.1

Residual standard error: This is the square root of mean squared error (MSE), where \(MSE=\frac{\sum_{i=1}^n \hat{\epsilon}_i^2}{n-p}\) is calculated as the sum of squared residuals divided by the degrees of freedom in the model.

Multiple R-squared: The proportion of variability in the data that is explained in the model. It is calculated as \(R^2 = 1-\frac{SSE}{SST} = \frac{SSR}{SST}\). Higher values of \(R^2\) indicate a better model fit.

Adjusted R-squared: Adjust \(R^2\) for increased number of predictors. \(R^2_a = 1-\frac{\frac{SSE}{n-p}}{\frac{SST}{n-1}}\).

F-statistic: This is the \(F\)-statistic from the ANOVA table (\(F_\text{obs}\)), along with the regression degrees of freedom and error degrees of freedom, followed by the \(p\)-value corresponding to the \(F\)-test.