3 Exercise 1: HSwrestler dataset

In Lab 5, we looked at the HSwrestler from the PASWR package, which measures the body fat of 78 high school wrestlers using three separate techniques, namely hydrostatic weighing, skin fold measurements and the Tanita body fat scale. We built a multiple linear regression model to understand the relationship between the hydrostatic fat (HWFAT; response variable) and abdominal fat (ABS; predictor variable 1) and tricep fat (TRICEPS; predictor variable 2). Our question of interest today is to test if the model is useful and comment on how well the model performs.

Read in the data using:

library(PASWR)
data(HSwrestler)

3.1 Constructing an ANOVA table and testing for the significance of the regression model.

Task: Perform a hypothesis test to assess whether the multiple linear regression model is significant.

To complete the task, we will need to:

  1. Use R to construct an ANOVA table for a model where hydrostatic fat level (HWFAT) is the response variable and abdominal fat (ABS) and tricep fat (TRICEPS) are the predictor variables;

  2. Manually construct an ANOVA table which combine abdominal fat (ABS) and tricep fat (TRICEPS) are a single term of Regression as in Table 1.2;

  3. Perform the \(F\)-test with \(\alpha=0.05\) significance level.

Remember to follow the steps in Example 1.

Step 1: Hypotheses

Suppose the model is of the following form: \[Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i.\]

The null and alternative hypotheses for testing the significance of the regression model are:

\[H_0: \beta_1 = \beta_2 = 0\quad \text{ versus }\quad H_1: \text{at least one }\beta_i \neq 0 \text{ for } i =1,2\]

Step 2: Test statistic

Use the function anova() to obtain the initial ANOVA table, where each predictor is listed as a separate row. Then, create the ANOVA table as in Table 1.2. Remember, you will need to combine certain rows to create the Regression component.

Once completed the calculation, enter your answer below for each component of the ANOVA table:

\(n =\) , \(p =\)

\(SSR =\) , \(SSE =\) , \(SST =\)

\(MSR =\) , \(MSE =\)

model.lm <- lm(HWFAT ~ ABS + TRICEPS, data = HSwrestler)
anova(model.lm)
## Analysis of Variance Table
## 
## Response: HWFAT
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## ABS        1 5072.8  5072.8 541.365 < 2.2e-16 ***
## TRICEPS    1  242.2   242.2  25.844 2.639e-06 ***
## Residuals 75  702.8     9.4                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.lm <- lm(HWFAT ~ ABS + TRICEPS, data = HSwrestler)
anova.tab <- anova(model.lm)

n <- nrow(HSwrestler)
p <- 3 # Intercept + parameters associated with the two predictors
SSR <- sum(anova.tab$`Sum Sq`[1:2])
SSE <- sum(anova.tab$`Sum Sq`[3])
SST <- SSR + SSE
MSR <- sum(anova.tab$`Sum Sq`[1:2])/(p-1)
MSE <- sum(anova.tab$`Sum Sq`[3])/(n-p)

# Produce
SSR
## [1] 5315.008
SSE
## [1] 702.7837
SST
## [1] 6017.792
MSR
## [1] 2657.504
MSE
## [1] 9.370449

Now calculate the value of the test statistic \(F_\text{obs}\).

\(F_\text{obs} =\)

The formula of \(F_{obs}\) is given by \(F_{obs} = \frac{MSR}{MSE}\).

\(F_\text{obs} = \frac{MSR}{MSE} = \frac{2657.504}{9.370449} = 283.6043\)


Step 3: Rejection region calculation

Now calculate the critical value \(F_{1-\alpha;p-1,n-p}\) to define the rejection region. Remember that \(\alpha = 0.05\).

\(F_{1-\alpha;p-1,n-p} =\)

Use the function qf() to obtain the quantile of the \(F\)-distribution.

qf(0.95, 2, 75)
## [1] 3.118642


Step 4: Statistical conclusion

Using the rejection region approach, we \(H_0\) because \(F_\text{obs}\) is the rejection region.

Step 5: English conclusion

There is that suggests that at least one predictor (ABS and TRICEPS) has a linear relationship with the response variable (HWFAT).

3.2 Computing \(R^2\)

Compute \(R^2\) based on anova(model.lm).

\(R^2 =\)

The formula for \(R^2\) is \(R^2 = 1-\frac{SSE}{SST}\)

\(R^2 = 1-\frac{SSE}{SST} = 1-\frac{702.8}{6017.792} = 0.883213\)

Again, from the summary table output, the value of the coefficient of determination, \(R^2\), is %. This gives us the proportion of variation in HWFAT that is explained by the linear regression model with TRICEPS AND ABS as predictors. Hence % of the variation in the hydrostatic fat is explained by taking into account abdominal fat and tricep fat using a multiple linear regression model. Hence the model gives a fit to the data.

The adjusted coefficient of determination, \(R^2_a\), is also useful in examining model fit and can be obtained from the summary table output. In this case, \(R^2_a\) is %, which is than the \(R^2\) value. This is reasonable since \(R^2_a\) includes a penalty when including more predictors in the model.

Which of the two statistics, \(R^2\) and \(R^2_a\), is more appropriate to assess the model's goodness of fit?