2 Example 1: GRADES dataset
Recall the GRADES dataset we explored in Labs 5 and 7. Previously we built a simple linear regression model between GPA and SAT scores and verified the usefulness of SAT in predicting GPA using a \(t\)-test. Today, we will look at the ANOVA table produced in R, test the significance of the regression model using an \(F\)-test, and finally examine the \(R^2\) to understand how well the model performs.
To open the data, type:
2.1 Constructing an ANOVA table
Let's construct an ANOVA table for the simple linear regression using the data in GRADES. To start, we first fit the model for gpa and sat:
model.lm <- lm(gpa ~ sat, data = GRADES)
Then use the anova() command to obtain the ANOVA table:
anova(model.lm)
The ANOVA table output is shown below:
## Analysis of Variance Table
##
## Response: gpa
## Df Sum Sq Mean Sq F value Pr(>F)
## sat 1 40.397 40.397 253.18 < 2.2e-16 ***
## Residuals 198 31.592 0.160
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Check if you understand everything produced in the above table. If not, ask for help!
2.2 Testing for the significance of the regression model
The \(F\)-test is used to assess if the model of interest is useful or not. In this example, since there is only one predictor, testing the significance of the model is equivalent to testing the significance of the predictor. Therefore, we would expect the conclusion from the \(F\)-test to be same with that from the \(t\)-test in Lab 7. Let's check.
TASK: Test if the regression model is significant at the \(\alpha=0.05\) significance level.
Step 1: Hypotheses
The null and alternative hypotheses for testing the significance of the regression model are: \[H_0: \beta_1 = 0\quad \text{ versus }\quad H_1: \beta_1 \neq 0\] (Unsurprisingly, the hypotheses are identical to those in Example 1 of Lab 7, as there is only one predictor.)
Step 2: Test statistic
Looking at the ANOVA table above, what is the value of test statistic for this problem?
\(F_\text{obs}=\)
\(F_\text{obs}\) corresponds to F value in the above ANOVA table. It is calculated as:
\[F_{obs} = \frac{MSR}{MSE} = \frac{40.3965}{0.1596} = 253.1824\]
Step 3: Rejection region calculation
What is the critical value in this case?
ANSWER \(=\)
Because \(F_\text{obs} \sim F_{1,198}\) and this is a one-tailed test, the rejection region is \(F_\text{obs} > F_{0.95;1,198} = 3.888853\) (value found by using the R command below).
## [1] 3.888853
Step 4: Statistical conclusion
From the rejection region, we \(H_0\) because \(F_{obs}\) is the rejection region.
From the \(p\)-value (
Pr(>|F|))in the above ANOVA table), we \(H_0\) because the value forPr(>|F|))is 0.05.
Step 5: English conclusion
There is that suggests the current regression model is significant, thus there exists a linear relationship exists between first-semester gpa and sat scores.
2.3 Computing \(R^2\)
The \(R^2\) metric is a measurement of the proportion of variability captured by the model. It is a helpful tool to assess how well our model fits the data (goodness-of-fit).
Compute \(R^2\) based on anova(model.lm) (it should be the same as the value of Multiple R-squared returned in summary(model.lm)).
\(R^2 =\)
The formula for \(R^2\) is \(R^2 = 1-\frac{SSE}{SST}\).
\(R^2 = 1-\frac{SSE}{SST} = \frac{31.592}{31.592+40.397} = 0.5612\)