RA tutorial
Additional Material from Professor Zhao
Error correction and material supplement in the textbook
Error 1: Translocation of X and Y in the plot
reg.data <- read.table("regression.txt",header=T)
attach(reg.data)
names(reg.data)## [1] "growth" "tannin"
model <- lm(growth~tannin)Pay attention that growth is the response variable and tannin is the explanatory variable. The right plot is for correction.
par(mfrow=c(1,2))
plot(reg.data,main="Explanatory against Response (wrong plot) ")
abline(model,col="blue")
plot(reg.data$tannin,reg.data$growth, main="Response against Explanatory (right plot)")
abline(model,col="blue")Error 2: PCT in the part of Jackknife stands for PCI
Additionally, Jackknife deals with the skewness issue in the bootstrap distribution. For the percentile confidence interval (PCI), the endpoints for PCI is adjusted to right if it is positively shewed. Vice versa. The endpoints for PCI is adjusted to left if it is negatively shewed.
Supplement materials and websites for further study
Supplement 1: use ggplot2 package to automatically generate the 95% confident interval for prediction
library(ggplot2)
ggplot(reg.data, aes(tannin, growth)) +
geom_point() +
geom_smooth(method = 'lm')## `geom_smooth()` using formula 'y ~ x'
Supplement 2: Combine multiple figures using par function and assign letters for figure caption.
regdat <- read.table("regdat.txt",header=T)
attach(regdat)
model <- lm(response~explanatory)
par(mfrow = c(2,3))
plot(model, which = 1:6, caption = LETTERS[1:6])Additional websites for further learning and understanding on the topics mentioned.
How to interpret the diagnostic plots for regression analysis
Available at: https://data.library.virginia.edu/diagnostic-plots/
Standardized residuals
\[R_{st} = Res / sd(Res)\]
Available at:
https://www.isixsigma.com/dictionary/standardized-residual/
http://www.r-tutor.com/elementary-statistics/simple-linear-regression/standardized-residual
Leverage
An observation is considered to have high leverage if it has a value (or values) for the predictor variables that are much more extreme compared to the rest of the observations in the dataset.
Available at: https://www.statology.org/leverage-in-r/
Cook’s distance
it measures how much all of the fitted values in the model change when a data point is deleted.
Available at:
https://www.statology.org/how-to-identify-influential-data-points-using-cooks-distance/
Excercises for Regression Analysis
Excercise 1
trees is a built-in dataset in R. Try to build up possible multi-factor regressions of Volume against Girth and Height, and determine the best one through regression analysis. Choose methods we have learned today, whicht you think fit in this case most.
Solution for Excercise 1
Read the data first.
trees <- treesFind the plot out scatter plots of Volume against Girth and Height against Height. Try to find out the correlations separately.
par(mfrow=c(1,2))
plot(Volume~Girth,data = trees)
plot(Volume~Height,data = trees)We can notice that the correlation between Volume and Girth is likely to be straight fitting, while the correlation between Volume and Height is difficult to determine.
Let’s call the Volume as y, the Girth as x1 and the Height as x2.
y <- trees$Volume
x1 <- trees$Girth
x2 <- trees$HeightWe first try \(Y = β~0~ + β~1~X~1~ + β~2~X~2~ + β~3~X~2~^2^ + β~4~X~1~X~2~\)
lm1 <- lm(y ~ x1*x2 + I(x2^2))
summary(lm1)##
## Call:
## lm(formula = y ~ x1 * x2 + I(x2^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8437 -1.0913 0.3604 1.5254 4.5031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.193874 62.958066 0.495 0.62443
## x1 -6.708175 2.335613 -2.872 0.00801 **
## x2 -0.123276 1.814453 -0.068 0.94635
## I(x2^2) -0.008744 0.013313 -0.657 0.51710
## x1:x2 0.145602 0.029747 4.895 4.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.738 on 26 degrees of freedom
## Multiple R-squared: 0.976, Adjusted R-squared: 0.9723
## F-statistic: 263.9 on 4 and 26 DF, p-value: < 2.2e-16
We can see although this model passes the t-test and F-test, the p-value of some of the parameters are bigger than 0.05, which means these parameters probably do not exist. Let’s use the step function to optimize the model.
tstep <- step(lm1)## Start: AIC=66.98
## y ~ x1 * x2 + I(x2^2)
##
## Df Sum of Sq RSS AIC
## - I(x2^2) 1 3.233 198.08 65.495
## <none> 194.85 66.985
## - x1:x2 1 179.545 374.39 85.231
##
## Step: AIC=65.49
## y ~ x1 + x2 + x1:x2
##
## Df Sum of Sq RSS AIC
## <none> 198.08 65.495
## - x1:x2 1 223.84 421.92 86.936
It shows that after deleting the β3X22 the AIC decreased most, so the new model should be \(Y = β~0~ + β~1~X~1~ + β~2~X~2~ + β~3~X~1~X~2~\)
lm2 <- lm(y ~ x1 + x2 + x1:x2)
summary(lm2)##
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5821 -1.0673 0.3026 1.5641 4.6649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.39632 23.83575 2.911 0.00713 **
## x1 -5.85585 1.92134 -3.048 0.00511 **
## x2 -1.29708 0.30984 -4.186 0.00027 ***
## x1:x2 0.13465 0.02438 5.524 7.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.709 on 27 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9728
## F-statistic: 359.3 on 3 and 27 DF, p-value: < 2.2e-16
We can see this model passes the t-test and F-test, and the p-value of all the parameters are smaller than 0.05. Next we plot it out to further check the model.
par(mfrow=c(2,2))
plot(lm2)According to the Q-Q plot, we can see quite number of points do not fall on the line, which means this model is still not good enough. For the next step, I choose to build a new model with log. The new model will be like \(Y = β~0~ + β~1~log(X~1~) + β~2~log(X~2~) + β~3~log(X~1~X~2~)\)
lm3 <- lm(log(y)~log(x1)*log(x2))
summary(lm3)##
## Call:
## lm(formula = log(y) ~ log(x1) * log(x2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.165941 -0.048613 0.006384 0.062204 0.132295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.6869 7.6996 -0.479 0.636
## log(x1) 0.7942 3.0910 0.257 0.799
## log(x2) 0.4377 1.7788 0.246 0.808
## log(x1):log(x2) 0.2740 0.7124 0.385 0.704
##
## Residual standard error: 0.08265 on 27 degrees of freedom
## Multiple R-squared: 0.9778, Adjusted R-squared: 0.9753
## F-statistic: 396.4 on 3 and 27 DF, p-value: < 2.2e-16
We can see that though the residual standard error looks very good and the t-test and F-test are passed, the significance of all parameters are not significant. Still use step to this model to improve it.
tstep <- step(lm3)## Start: AIC=-150.85
## log(y) ~ log(x1) * log(x2)
##
## Df Sum of Sq RSS AIC
## - log(x1):log(x2) 1 0.0010105 0.18546 -152.69
## <none> 0.18445 -150.85
##
## Step: AIC=-152.69
## log(y) ~ log(x1) + log(x2)
##
## Df Sum of Sq RSS AIC
## <none> 0.1855 -152.685
## - log(x2) 1 0.1978 0.3832 -132.185
## - log(x1) 1 4.6275 4.8130 -53.743
It shows that after deleting β3log(X1X2) the AIC decreases the most. As the result the model will be change to \(Y = β~0~ + β~1~log(X~1~)+ β~2~log(X~2~)\)
lm4 <- lm(log(y)~log(x1)+log(x2))
summary(lm4)##
## Call:
## lm(formula = log(y) ~ log(x1) + log(x2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168561 -0.048488 0.002431 0.063637 0.129223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***
## log(x1) 1.98265 0.07501 26.432 < 2e-16 ***
## log(x2) 1.11712 0.20444 5.464 7.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
## F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
It passes the t-test and F-test, also all the parameters have p-values smaller than 0.05, so the optimization is successful. Next plot it to see if it is good enough.
par(mfrow=c(2,2))
plot(lm4)All the plots are acceptable for a good regression model, so the possible regression model in this exercise should be \(Y = β~0~ +β~1~log(X~1~) + β~2~log(X~2~)\)
Exercise 2
Use the ozone.data to perform a regression diagnosis using the strategy we have learned in this lesson. We want to figure out the relationship between the response variable ozone level with the three explanatory variables, namely radiation, temperature, and wind.
To give some insights for regression analysis, you can first have a general overview of the data using gam that we have learned in Week 8 and then use the diagnostic methods such as bootstrap, jackknife, diagnostic plot, and robust regression that we newly learned in this lesson. The outcome of this exercise is to generate a satisfying multiple regression model.
Solution for Exercise 2
ozone.pollution <- read.table("ozone.data.txt",header=T)
attach(ozone.pollution)
names(ozone.pollution)## [1] "rad" "temp" "wind" "ozone"
summary(ozone.pollution)## rad temp wind ozone
## Min. : 7.0 Min. :57.00 Min. : 2.300 Min. : 1.0
## 1st Qu.:113.5 1st Qu.:71.00 1st Qu.: 7.400 1st Qu.: 18.0
## Median :207.0 Median :79.00 Median : 9.700 Median : 31.0
## Mean :184.8 Mean :77.79 Mean : 9.939 Mean : 42.1
## 3rd Qu.:255.5 3rd Qu.:84.50 3rd Qu.:11.500 3rd Qu.: 62.0
## Max. :334.0 Max. :97.00 Max. :20.700 Max. :168.0
Question: How is ozone concentration related to wind speed, air temperature and the intensity of solar radiation?
- Check the relationship between the variables by correlation.
We use pairs function to look at all the correlation with pair-wise scatter plots between each variable pair.
pairs(ozone.pollution,panel=panel.smooth)Alternatively, ggparis function can also generate similar result, which also contains the density plots of the continuous variables and the correlation coefficient between each variable pair.
library(ggplot2)
library(GGally)## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(ozone.pollution)Besides, if you are interested specifically in the correlation coefficient,corrplot function can be applied to generate a visualized correlation matrix.
library(corrplot)## corrplot 0.92 loaded
cor <- cor(ozone.pollution)
corrplot(cor,method="shade",type="full",addCoef.col = "black")As a result, there is a strong negative correlation between ozone and wind (-0.61), a strong positive correlation between ozone and temp (0.7) and between wind and temp (0.5).
As learned in Week 8, we can build the non-parametric smoothers in a generalized additive model GAM like this:
library(mgcv)## 载入需要的程辑包:nlme
## This is mgcv 1.8-39. For overview type 'help("mgcv-package")'.
library(nlme)
par(mfrow=c(2,2))
model <- gam(ozone.pollution$ozone~s(rad)+s(temp)+s(wind))
plot(model)- Multiple linear regression diagnosis
We start with the most complicated model: this includes curvature terms for each variable, all three two-way interactions and a three-way interaction:
attach(ozone.pollution)## The following objects are masked from ozone.pollution (pos = 7):
##
## ozone, rad, temp, wind
w2 <- wind^2
t2 <- temp^2
r2 <- rad^2
tw <- temp*wind
wr <- wind*rad
tr <- temp*rad
wtr <- wind*temp*rad
model1 <- lm(ozone.pollution$ozone~rad+temp+wind+t2+w2+r2+wr+tr+tw+wtr)
summary(model1)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2 + r2 + wr + tr + tw + wtr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.894 -11.205 -2.736 8.809 70.551
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
## rad -3.117e-01 5.585e-01 -0.558 0.57799
## temp -1.076e+01 4.303e+00 -2.501 0.01401 *
## wind -3.237e+01 1.173e+01 -2.760 0.00687 **
## t2 5.833e-02 2.396e-02 2.435 0.01668 *
## w2 6.106e-01 1.469e-01 4.157 6.81e-05 ***
## r2 -3.619e-04 2.573e-04 -1.407 0.16265
## wr 2.054e-02 4.892e-02 0.420 0.67552
## tr 8.403e-03 7.512e-03 1.119 0.26602
## tw 2.377e-01 1.367e-01 1.739 0.08519 .
## wtr -4.324e-04 6.595e-04 -0.656 0.51358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.82 on 100 degrees of freedom
## Multiple R-squared: 0.7394, Adjusted R-squared: 0.7133
## F-statistic: 28.37 on 10 and 100 DF, p-value: < 2.2e-16
#conduct the ‘p values on deletion’
model2 <- update(model1,~.-wtr)
summary(model2)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2 + r2 + wr + tr + tw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.611 -11.455 -2.901 8.548 70.325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.245e+02 1.957e+02 2.680 0.0086 **
## rad 2.628e-02 2.142e-01 0.123 0.9026
## temp -1.021e+01 4.209e+00 -2.427 0.0170 *
## wind -2.802e+01 9.645e+00 -2.906 0.0045 **
## t2 5.953e-02 2.382e-02 2.499 0.0141 *
## w2 6.173e-01 1.461e-01 4.225 5.25e-05 ***
## r2 -3.388e-04 2.541e-04 -1.333 0.1855
## wr -1.127e-02 6.277e-03 -1.795 0.0756 .
## tr 3.750e-03 2.459e-03 1.525 0.1303
## tw 1.734e-01 9.497e-02 1.825 0.0709 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.77 on 101 degrees of freedom
## Multiple R-squared: 0.7383, Adjusted R-squared: 0.715
## F-statistic: 31.66 on 9 and 101 DF, p-value: < 2.2e-16
model3 <- update(model2,~.-r2)
summary(model3)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2 + wr + tr + tw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.188 -11.387 -1.500 8.752 71.289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 486.346603 194.333075 2.503 0.01392 *
## rad -0.043163 0.208535 -0.207 0.83644
## temp -9.446780 4.185240 -2.257 0.02613 *
## wind -26.471461 9.610816 -2.754 0.00697 **
## t2 0.056966 0.023835 2.390 0.01868 *
## w2 0.599709 0.146069 4.106 8.14e-05 ***
## wr -0.011359 0.006300 -1.803 0.07435 .
## tr 0.003160 0.002428 1.302 0.19600
## tw 0.157637 0.094595 1.666 0.09869 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.83 on 102 degrees of freedom
## Multiple R-squared: 0.7337, Adjusted R-squared: 0.7128
## F-statistic: 35.12 on 8 and 102 DF, p-value: < 2.2e-16
model4 <- update(model3,~.-tr)
summary(model4)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2 + wr + tw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.379 -11.375 -2.217 8.921 71.247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 514.401470 193.783580 2.655 0.00920 **
## rad 0.212945 0.069283 3.074 0.00271 **
## temp -10.654041 4.094889 -2.602 0.01064 *
## wind -27.391965 9.616998 -2.848 0.00531 **
## t2 0.067805 0.022408 3.026 0.00313 **
## w2 0.619396 0.145773 4.249 4.72e-05 ***
## wr -0.013561 0.006089 -2.227 0.02813 *
## tw 0.169674 0.094458 1.796 0.07538 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.89 on 103 degrees of freedom
## Multiple R-squared: 0.7292, Adjusted R-squared: 0.7108
## F-statistic: 39.63 on 7 and 103 DF, p-value: < 2.2e-16
model5 <- update(model4,~.-tw)
summary(model5)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2 + wr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.478 -10.735 -2.437 9.685 77.543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 223.573855 107.618223 2.077 0.040221 *
## rad 0.173431 0.066398 2.612 0.010333 *
## temp -5.197139 2.775039 -1.873 0.063902 .
## wind -10.816032 2.736757 -3.952 0.000141 ***
## t2 0.043640 0.018112 2.410 0.017731 *
## w2 0.430059 0.101767 4.226 5.12e-05 ***
## wr -0.009819 0.005783 -1.698 0.092507 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.08 on 104 degrees of freedom
## Multiple R-squared: 0.7208, Adjusted R-squared: 0.7047
## F-statistic: 44.74 on 6 and 104 DF, p-value: < 2.2e-16
model6 <- update(model5,~.-wr)
summary(model6)##
## Call:
## lm(formula = ozone.pollution$ozone ~ rad + temp + wind + t2 +
## w2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.044 -10.796 -4.138 8.131 80.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 291.16758 100.87723 2.886 0.00473 **
## rad 0.06586 0.02005 3.285 0.00139 **
## temp -6.33955 2.71627 -2.334 0.02150 *
## wind -13.39674 2.29623 -5.834 6.05e-08 ***
## t2 0.05102 0.01774 2.876 0.00488 **
## w2 0.46464 0.10060 4.619 1.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.25 on 105 degrees of freedom
## Multiple R-squared: 0.713, Adjusted R-squared: 0.6994
## F-statistic: 52.18 on 5 and 105 DF, p-value: < 2.2e-16
Here is a summary table Table 1 to illustrate the model elements, significance score, and the adjusted R-squared value of each model for simplicity.
| Model_Name | Model_element | Significance_score | Adj.R.squared |
|---|---|---|---|
| md 1 | rad, temp, wind, t2, w2, r2, wr, tr, tw, wtr | (Intercept) **, temp *, wind **, t2 *, w2 ***, tw · |
0.7133 |
| md 2 | rad, temp, wind, t2, w2, r2, wr, tr, tw | (Intercept) **, temp *, wind **, t2 *, w2 *** , wr ·, tw · |
0.7150 |
| md 3 | rad, temp, wind, t2, w2, wr, tr, tw | (Intercept) **, temp *, wind **, t2 *, w2 ***, wr ·, tw · |
0.7128 |
| md 4 | rad, temp, wind, t2, w2, wr, tw | (Intercept) *, temp *, wind **, t2 *, w2 ***, wr ·, tw · |
0.7108 |
| md 5 | rad, temp, wind, t2, w2, wr | (Intercept) **, rad **, temp *, wind **, t2 **, w2 ***, wr *, tw · |
0.7047 |
| md 6 | rad, temp, wind, t2, w2 | (Intercept) **, rad **, temp *, wind ***, t2 *, w2 *** |
0.6994 |
At this moment, model 6 is the best one. So we call the regression diagnosis plots to conduct the model check
par(mfrow=c(2,3))
plot(model6,which = 1:6, caption = LETTERS[1:6])As a result, some regression diagnosis should be performed. First, the plot of Residuals v.s. Fitted (Figure A) looks not good since some patterns is shown. As the fitted value (y hat) becomes smaller, the residuals (y-y hat) starts to aggregate, which violates the assumption of “constancy of variance”. Second, the QQ plot (Figure B) shows that the line is not straight but a semi-S curve, which violates the assumption of “normality of error”. The Cook’s distance (Figure D) reminds us to pay attention to the influential point 77th and outlier point of 85th and 23th.
Let us try transforming the response variable with log to let data distribution scatter.
model7 <- lm(log(ozone.pollution$ozone) ~ rad+temp+wind+t2+w2+r2+wr+tr+tw+wtr)
summary(model7)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## t2 + w2 + r2 + wr + tr + tw + wtr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91943 -0.24169 -0.01742 0.28213 1.11802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.803e+00 5.676e+00 0.494 0.6225
## rad 2.771e-02 1.529e-02 1.812 0.0729 .
## temp -3.018e-02 1.178e-01 -0.256 0.7983
## wind -9.812e-02 3.211e-01 -0.306 0.7605
## t2 6.034e-04 6.559e-04 0.920 0.3598
## w2 8.732e-03 4.021e-03 2.172 0.0322 *
## r2 -1.489e-05 7.043e-06 -2.114 0.0370 *
## wr -2.001e-03 1.339e-03 -1.494 0.1382
## tr -2.507e-04 2.056e-04 -1.219 0.2256
## tw -1.985e-03 3.742e-03 -0.530 0.5971
## wtr 2.535e-05 1.805e-05 1.404 0.1634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4877 on 100 degrees of freedom
## Multiple R-squared: 0.7116, Adjusted R-squared: 0.6827
## F-statistic: 24.67 on 10 and 100 DF, p-value: < 2.2e-16
model8 <- update(model7,~.-wtr)
summary(model8)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## t2 + w2 + r2 + wr + tr + tw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.99582 -0.24838 -0.04271 0.32080 1.07835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.373e+00 5.398e+00 0.995 0.3219
## rad 7.896e-03 5.908e-03 1.336 0.1844
## temp -6.230e-02 1.161e-01 -0.537 0.5927
## wind -3.531e-01 2.660e-01 -1.327 0.1874
## t2 5.332e-04 6.571e-04 0.811 0.4191
## w2 8.340e-03 4.030e-03 2.069 0.0411 *
## r2 -1.624e-05 7.010e-06 -2.317 0.0225 *
## wr -1.368e-04 1.731e-04 -0.790 0.4313
## tr 2.195e-05 6.783e-05 0.324 0.7469
## tw 1.784e-03 2.620e-03 0.681 0.4975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4901 on 101 degrees of freedom
## Multiple R-squared: 0.7059, Adjusted R-squared: 0.6797
## F-statistic: 26.93 on 9 and 101 DF, p-value: < 2.2e-16
model9 <- update(model8,~.-tr)
summary(model9)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## t2 + w2 + r2 + wr + tw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.96263 -0.24298 -0.04081 0.31953 1.09081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.516e+00 5.357e+00 1.030 0.30558
## rad 9.533e-03 3.036e-03 3.140 0.00221 **
## temp -6.949e-02 1.134e-01 -0.613 0.54157
## wind -3.574e-01 2.645e-01 -1.351 0.17966
## t2 6.029e-04 6.180e-04 0.976 0.33160
## w2 8.451e-03 3.998e-03 2.114 0.03697 *
## r2 -1.584e-05 6.865e-06 -2.307 0.02310 *
## wr -1.517e-04 1.662e-04 -0.913 0.36341
## tw 1.846e-03 2.601e-03 0.710 0.47956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4879 on 102 degrees of freedom
## Multiple R-squared: 0.7056, Adjusted R-squared: 0.6825
## F-statistic: 30.56 on 8 and 102 DF, p-value: < 2.2e-16
model10 <- update(model9,~.-tw)
summary(model10)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## t2 + w2 + r2 + wr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.89186 -0.26391 -0.03075 0.33076 1.09627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.326e+00 2.907e+00 0.800 0.42544
## rad 8.875e-03 2.884e-03 3.077 0.00268 **
## temp -9.290e-03 7.515e-02 -0.124 0.90185
## wind -1.772e-01 7.366e-02 -2.405 0.01795 *
## t2 3.360e-04 4.892e-04 0.687 0.49375
## w2 6.389e-03 2.739e-03 2.333 0.02162 *
## r2 -1.515e-05 6.781e-06 -2.235 0.02761 *
## wr -1.112e-04 1.557e-04 -0.714 0.47676
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4867 on 103 degrees of freedom
## Multiple R-squared: 0.7041, Adjusted R-squared: 0.684
## F-statistic: 35.02 on 7 and 103 DF, p-value: < 2.2e-16
model11 <- update(model10,~.-t2)
summary(model11)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## w2 + r2 + wr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82031 -0.25479 -0.02779 0.33595 1.15024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.989e-01 7.571e-01 0.527 0.59936
## rad 8.996e-03 2.871e-03 3.133 0.00225 **
## temp 4.214e-02 6.246e-03 6.746 8.79e-10 ***
## wind -1.816e-01 7.320e-02 -2.481 0.01472 *
## w2 6.758e-03 2.679e-03 2.523 0.01316 *
## r2 -1.477e-05 6.740e-06 -2.191 0.03071 *
## wr -1.368e-04 1.507e-04 -0.908 0.36615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4855 on 104 degrees of freedom
## Multiple R-squared: 0.7028, Adjusted R-squared: 0.6856
## F-statistic: 40.98 on 6 and 104 DF, p-value: < 2.2e-16
model12 <- update(model11,~.-wr)
summary(model12)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + temp + wind +
## w2 + r2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.85551 -0.25578 0.00248 0.31349 1.16251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.724e-01 6.350e-01 1.216 0.226543
## rad 7.466e-03 2.323e-03 3.215 0.001736 **
## temp 4.193e-02 6.237e-03 6.723 9.52e-10 ***
## wind -2.211e-01 5.874e-02 -3.765 0.000275 ***
## w2 7.390e-03 2.585e-03 2.859 0.005126 **
## r2 -1.470e-05 6.734e-06 -2.183 0.031246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4851 on 105 degrees of freedom
## Multiple R-squared: 0.7004, Adjusted R-squared: 0.6861
## F-statistic: 49.1 on 5 and 105 DF, p-value: < 2.2e-16
Here is a summary table Table 2 to illustrate the model elements, significance score, and the adjusted R-squared value of each model for simplicity.
| Model_Name | Model_element | Significance_score | Adj.R.squared |
|---|---|---|---|
| md 7 | rad, temp, wind, t2, w2, r2, wr, tr, tw, wtr | temp ·, w2 *, r2 * |
0.6827 |
| md 8 | rad, temp, wind, t2, w2, r2, wr, tr, tw | w2 *, r2 * |
0.6797 |
| md 9 | rad, temp, wind, t2, w2, r2, wr, tw | rad **, w2 *, r2 * |
0.6825 |
| md 10 | rad, temp, wind, t2, w2, r2 wr | rad **, wind *, w2 *, r2 * |
0.6840 |
| md 11 | rad, temp, wind, r2, w2, wr | rad **, temp ***, wind *, w2 *, r2 * |
0.6856 |
| md 12 | rad, temp, wind, r2, w2 | rad **, temp ***, wind ***, w2 **, r2 * |
0.6861 |
par(mfrow=c(2,3))
plot(model12,which = 1:6, caption = LETTERS[1:6])Pay attention that sometimes the step function is not reliable but some kind arbitrary when performing the model reduction based only on the AIC minimization. We try to test on the model1 and model 7, which evidence is summarized in the Table 3.
step(model1)## Start: AIC=649.8
## ozone.pollution$ozone ~ rad + temp + wind + t2 + w2 + r2 + wr +
## tr + tw + wtr
##
## Df Sum of Sq RSS AIC
## - wr 1 55.9 31798 648.00
## - rad 1 98.9 31841 648.15
## - wtr 1 136.4 31879 648.28
## - tr 1 397.1 32139 649.18
## <none> 31742 649.80
## - r2 1 628.0 32370 649.98
## - tw 1 959.4 32702 651.11
## - t2 1 1881.5 33624 654.19
## - temp 1 1985.4 33728 654.54
## - wind 1 2418.2 34160 655.95
## - w2 1 5486.1 37228 665.50
##
## Step: AIC=648
## ozone.pollution$ozone ~ rad + temp + wind + t2 + w2 + r2 + tr +
## tw + wtr
##
## Df Sum of Sq RSS AIC
## - rad 1 74.7 31873 646.26
## <none> 31798 648.00
## - r2 1 588.7 32387 648.03
## - wtr 1 1097.6 32896 649.76
## - tw 1 1246.7 33045 650.27
## - tr 1 1626.1 33424 651.53
## - temp 1 1929.5 33728 652.54
## - t2 1 1956.6 33755 652.63
## - wind 1 2878.3 34676 655.62
## - w2 1 5624.8 37423 664.08
##
## Step: AIC=646.26
## ozone.pollution$ozone ~ temp + wind + t2 + w2 + r2 + tr + tw +
## wtr
##
## Df Sum of Sq RSS AIC
## <none> 31873 646.26
## - r2 1 780.0 32653 646.94
## - wtr 1 1341.7 33215 648.83
## - tw 1 1372.8 33246 648.94
## - temp 1 2282.2 34155 651.93
## - t2 1 2565.4 34438 652.85
## - wind 1 3036.1 34909 654.36
## - tr 1 3295.2 35168 655.18
## - w2 1 5861.6 37734 663.00
##
## Call:
## lm(formula = ozone.pollution$ozone ~ temp + wind + t2 + w2 +
## r2 + tr + tw + wtr)
##
## Coefficients:
## (Intercept) temp wind t2 w2 r2
## 5.569e+02 -1.098e+01 -3.029e+01 6.337e-02 6.244e-01 -3.826e-04
## tr tw wtr
## 4.468e-03 2.060e-01 -1.685e-04
model1_step_op <- lm(ozone.pollution$ozone~temp+wind+t2+w2+r2+tr+tw+wtr)
summary(model1_step_op)##
## Call:
## lm(formula = ozone.pollution$ozone ~ temp + wind + t2 + w2 +
## r2 + tr + tw + wtr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.008 -11.439 -3.136 8.385 70.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.569e+02 1.934e+02 2.879 0.00486 **
## temp -1.098e+01 4.062e+00 -2.703 0.00806 **
## wind -3.029e+01 9.718e+00 -3.117 0.00237 **
## t2 6.337e-02 2.212e-02 2.865 0.00506 **
## w2 6.244e-01 1.442e-01 4.331 3.48e-05 ***
## r2 -3.826e-04 2.422e-04 -1.580 0.11722
## tr 4.468e-03 1.376e-03 3.247 0.00158 **
## tw 2.060e-01 9.829e-02 2.096 0.03856 *
## wtr -1.685e-04 8.132e-05 -2.072 0.04078 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.68 on 102 degrees of freedom
## Multiple R-squared: 0.7383, Adjusted R-squared: 0.7178
## F-statistic: 35.97 on 8 and 102 DF, p-value: < 2.2e-16
step(model7)## Start: AIC=-148.98
## log(ozone.pollution$ozone) ~ rad + temp + wind + t2 + w2 + r2 +
## wr + tr + tw + wtr
##
## Df Sum of Sq RSS AIC
## - temp 1 0.01562 23.803 -150.91
## - wind 1 0.02221 23.809 -150.88
## - tw 1 0.06690 23.854 -150.67
## - t2 1 0.20130 23.988 -150.05
## - tr 1 0.35368 24.141 -149.34
## <none> 23.787 -148.98
## - wtr 1 0.46883 24.256 -148.82
## - wr 1 0.53121 24.318 -148.53
## - rad 1 0.78131 24.568 -147.40
## - r2 1 1.06316 24.850 -146.13
## - w2 1 1.12186 24.909 -145.87
##
## Step: AIC=-150.91
## log(ozone.pollution$ozone) ~ rad + wind + t2 + w2 + r2 + wr +
## tr + tw + wtr
##
## Df Sum of Sq RSS AIC
## - wind 1 0.00766 23.810 -152.88
## - tw 1 0.19954 24.002 -151.98
## - tr 1 0.37762 24.180 -151.16
## <none> 23.803 -150.91
## - wtr 1 0.52236 24.325 -150.50
## - wr 1 0.58077 24.383 -150.24
## - rad 1 0.80362 24.606 -149.22
## - t2 1 0.88137 24.684 -148.88
## - r2 1 1.04911 24.852 -148.12
## - w2 1 1.25401 25.057 -147.21
##
## Step: AIC=-152.88
## log(ozone.pollution$ozone) ~ rad + t2 + w2 + r2 + wr + tr + tw +
## wtr
##
## Df Sum of Sq RSS AIC
## <none> 23.810 -152.88
## - tr 1 0.6953 24.506 -151.68
## - wtr 1 0.9420 24.752 -150.57
## - wr 1 1.0036 24.814 -150.29
## - r2 1 1.0415 24.852 -150.12
## - rad 1 1.3407 25.151 -148.79
## - w2 1 1.6988 25.509 -147.22
## - tw 1 1.9332 25.744 -146.21
## - t2 1 4.0503 27.860 -137.44
##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + t2 + w2 + r2 +
## wr + tr + tw + wtr)
##
## Coefficients:
## (Intercept) rad t2 w2 r2 wr
## 1.159e+00 2.959e-02 4.821e-04 7.906e-03 -1.448e-05 -2.199e-03
## tr tw wtr
## -2.792e-04 -3.080e-03 2.816e-05
model7_step_op <- lm(log(ozone.pollution$ozone) ~ rad + t2 + w2 + r2 + wr + tr + tw + wtr)
summary(model7_step_op)##
## Call:
## lm(formula = log(ozone.pollution$ozone) ~ rad + t2 + w2 + r2 +
## wr + tr + tw + wtr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.88083 -0.25112 -0.01786 0.28580 1.12695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.159e+00 5.547e-01 2.090 0.03915 *
## rad 2.959e-02 1.235e-02 2.397 0.01837 *
## t2 4.821e-04 1.157e-04 4.165 6.52e-05 ***
## w2 7.906e-03 2.931e-03 2.698 0.00817 **
## r2 -1.448e-05 6.857e-06 -2.112 0.03710 *
## wr -2.199e-03 1.060e-03 -2.073 0.04065 *
## tr -2.792e-04 1.617e-04 -1.726 0.08740 .
## tw -3.080e-03 1.070e-03 -2.878 0.00488 **
## wtr 2.816e-05 1.402e-05 2.009 0.04719 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4831 on 102 degrees of freedom
## Multiple R-squared: 0.7113, Adjusted R-squared: 0.6886
## F-statistic: 31.41 on 8 and 102 DF, p-value: < 2.2e-16
| Model_Name | Model_element | Significance_score | Adj.R.squared |
|---|---|---|---|
| omd 1 | rad, temp, wind, t2, w2 | (Intercept) **, rad **, temp *, wind ***, t2 *, w2 *** |
0.6994 |
| omd 1-auto | temp, wind, t2, w2, r2, tr, tw, wtr | (Intercept) **, temp **, wind **, t2 **, w2 ***, tr **, tw *, wtr * |
0.7178 |
| omd 7 | rad, temp, wind, t2, w2, r2 wr | rad **, temp ***, wind ***, w2 **, r2 * |
0.6861 |
| omd 7-auto | rad, t2, w2, r2, wr, tr, tw, wtr | (Intercept) *, rad *, t2 ***, w2 **, r2 *, wr *, tr ·, tw **, wtr * |
0.6886 |
It is found that for omd1-auto, the step function produces a model without the main explanatory variables of rad and remains some two-way interactions and a three-way interaction. For omd7-auto, the step function produces a model without the main explanatory variables of temp, wind and remains some two-way interactions. Therefore, we should be cautious not to blindly use the step function.
In addition, when we are choosing the optimal model by model reduction, normally the result obtained from the criteria of determination can not satisfy the significance and adjusted R squared at the same time. So we should balance between these 2 criteria of determination based on what we need in certain context.
- Evaluation of coefficient estimates by Bootstrap and Jackknife
Bootstrapping
Since we can not directly plot a 4D scatter plot for the multiple linear regression, so calculation both the confidence interval and prediction interval seems challenging for us to perform. However, we can still use the bootstrap and jackknife technique learnt in the week-9-lecture to evaluate.
Remember the re-sample and re-arrange technique introduced in the lecture, these methods does not seems to be useful in this context since the size of computing randomness in the light of the response variable (Y) and so so many explanatory variable (multiple Xs) is unrealistic and computationally intensive.
ozone <- structure(ozone.pollution)
lines <- nrow(ozone)
lines## [1] 111
To solve the problem, we randomly pick up same number of target row data (in this case: 111) to build the multiple linear model for many many times by relicate function (in this case: 5000), which means that the selected target can be repetitive and be considered more than 1 time. For example, assuming that we want to select 5 target row data, the sample function is applied to roll the dice 5 times, which reports “1,3,1,4,2”. Therefore, we choose these row data to build the random model via bootstrapping.
In our case, the bootstrapping results for 5000 times is stored in the reps objects and we use the boxplot function to visualize. If you are interested in the distribution of coefficients estimates as mentioned in the lecture, histograms for each individual coefficients estimates is plotted as follows.
reps <- replicate(5000, lm(log(ozone) ~ rad + temp + wind + w2 + r2,
data=ozone[sample(lines, replace = TRUE),])$coefficients)Jackknife
We can also perform the Jackknife on the model by deleting one data row at a time to evaluate. To implement, the for loop is used to run 111 times and store the information of coefficient estimates into an empty data frame called jack.reg. The following codes also demonstrate the nested for loop that can generate the same result.
jack.reg <- data.frame(intercept=0,rad=0,temp=0,wind=0,w2=0,r2=0)
for (i in 1:111) {
model <- lm(log(ozone) ~ rad + temp + wind +I(wind^2) + I(rad^2),
data=ozone.pollution[-i,])
jack.reg[i,1] <-coef(model)[1]
jack.reg[i,2] <-coef(model)[2]
jack.reg[i,3] <-coef(model)[3]
jack.reg[i,4] <-coef(model)[4]
jack.reg[i,5] <-coef(model)[5]
jack.reg[i,6] <-coef(model)[6]
}
for (i in 1:111) {
for(j in 1:6){
model <- lm(log(ozone) ~ rad + temp + wind +I(wind^2) + I(rad^2),
data=ozone.pollution[-i,])
jack.reg[i,j]<-coef(model)[j]
}
}summary(jack.reg)## intercept rad temp wind
## Min. :0.5529 Min. :0.005630 Min. :0.03908 Min. :-0.2478
## 1st Qu.:0.7515 1st Qu.:0.007399 1st Qu.:0.04171 1st Qu.:-0.2226
## Median :0.7717 Median :0.007467 Median :0.04191 Median :-0.2211
## Mean :0.7724 Mean :0.007465 Mean :0.04193 Mean :-0.2211
## 3rd Qu.:0.7917 3rd Qu.:0.007551 3rd Qu.:0.04209 3rd Qu.:-0.2197
## Max. :1.2001 Max. :0.008531 Max. :0.04454 Max. :-0.1967
## w2 r2
## Min. :0.006378 Min. :-1.730e-05
## 1st Qu.:0.007316 1st Qu.:-1.496e-05
## Median :0.007391 Median :-1.471e-05
## Mean :0.007389 Mean :-1.470e-05
## 3rd Qu.:0.007448 3rd Qu.:-1.449e-05
## Max. :0.008460 Max. :-1.013e-05
Then, the result of jackknife is visualized using boxplot and histogram, which show the distribution of the coefficients of each explanatory variables.
As a result, the histogrames show that the model is greatly influenced by the omission of one certain point, which is extracted based on its Cook’s Distance. We should find it!Pay attention that this time the Cook’s Distance is at the 9th column of the list called influence.measures(model12)$infmat.
model12 <- lm(log(ozone.pollution$ozone) ~ rad + temp + wind + w2+r2)
which(influence.measures(model12)$infmat[,9]
== max(influence.measures(model12)$infmat[,9]))## 17
## 17
Then a new model model12_new is built for comparison of model performance.
model12_new <- lm(log(ozone) ~ rad + temp + wind + I(wind^2) + I(rad^2),
data=ozone.pollution[-17,])
summary(model12_new)##
## Call:
## lm(formula = log(ozone) ~ rad + temp + wind + I(wind^2) + I(rad^2),
## data = ozone.pollution[-17, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0339 -0.2767 -0.0042 0.3597 1.0927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.200e+00 5.946e-01 2.019 0.0461 *
## rad 5.630e-03 2.186e-03 2.576 0.0114 *
## temp 3.908e-02 5.797e-03 6.741 9.02e-10 ***
## wind -2.213e-01 5.425e-02 -4.079 8.87e-05 ***
## I(wind^2) 7.032e-03 2.389e-03 2.944 0.0040 **
## I(rad^2) -1.013e-05 6.307e-06 -1.606 0.1112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.448 on 104 degrees of freedom
## Multiple R-squared: 0.7047, Adjusted R-squared: 0.6905
## F-statistic: 49.64 on 5 and 104 DF, p-value: < 2.2e-16
Table 4 summarize the comparison between model12 and model12_new.
| Model_Name | Model_element | Significance_score | Adj.R.squared |
|---|---|---|---|
| md 12 | rad, temp, wind, w2, r2 | rad **, temp ***, wind ***, w2 **, r2 * |
0.6861 |
| md 12 new | same except the 17th row | Intercept *, rad *, temp ***, wind ***, w2 ** |
0.6905 |
After removing the 17th row data, it seems that the model has improved based on the increased Adjusted R squared value. We can also plot the diagnosis plot for model12_newto compare it with the one for model 12.
Before regression diagnosis: model12
par(mfrow=c(2,3))
plot(model12,which = 1:6, caption = LETTERS[1:6])After regression diagnosis: model12_new
par(mfrow=c(2,3))
plot(model12_new, which = 1:6, caption = LETTERS[1:6])We can also conduct a pair-wise comparison to effectively visualize the difference between these two models.
par(mfcol=c(2,2))
for(i in 2){
plot(model12,which = 1:i, caption = LETTERS[1:i])
plot(model12_new, which = 1:i, caption = LETTERS[1:i])
}par(mfcol=c(2,2))
for(i in 4){
plot(model12,which = 3:i, caption = LETTERS[3:i])
plot(model12_new, which = 3:i, caption = LETTERS[3:i])
}par(mfcol=c(2,2))
for(i in 6){
plot(model12,which = 5:i, caption = LETTERS[5:i])
plot(model12_new, which = 5:i, caption = LETTERS[5:i])
}