Exercise 1

Given a dataset called “glass.data”, try to construct a multinomial model follow the process introduced in simple example. The detailed information of this dataset is available in “glass.names”.

Question: How to predict a glass type according to the provided 9 characteristics?

Step 1; Read the data set and take a brief look

library(nnet)
glass <- read.csv("glass.data", header = FALSE)
names(glass) <- c("id","RI","Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "Type")

Since the first row of the file is the data not the column name, set header as false and named the attributes by ourselves. The attribute information can be found in file “glass.names”.

Step 2: Data processing

Delete the first id column, as it is not a variable needed in model construction.

Split the data into training set and test set, and the ratio here is 80%. It should be noted that the ratio for data splitting is quite subjective, you can decide it as 75% or 70% as you prefer.

#delete the first column (serial number)
glass <- glass[,-1]
#split the data
train.sub <- sample(nrow(glass),8/10*nrow(glass))
glass_train <- glass[train.sub,]
glass_test <- glass[-train.sub,]

Step 3: Model construction

glass_model <- multinom(Type~ ., data = glass_train)
## # weights:  66 (50 variable)
## initial  value 306.390869 
## iter  10 value 184.770489
## iter  20 value 136.614812
## iter  30 value 126.955770
## iter  40 value 123.998698
## iter  50 value 121.292900
## iter  60 value 121.053110
## iter  70 value 120.719536
## iter  80 value 119.416622
## iter  90 value 119.178689
## iter 100 value 118.548092
## final  value 118.548092 
## stopped after 100 iterations
#check the result
summary(glass_model)
## Call:
## multinom(formula = Type ~ ., data = glass_train)
## 
## Coefficients:
##   (Intercept)        RI         Na         Mg        Al          Si
## 2   100.23223 170.76852 -2.8425174 -5.3929462  1.293744 -3.72049165
## 3    49.06041 -41.04774  0.5815352 -0.2403516  1.848909  0.05226385
## 5    20.55760  17.82086 -0.5905658 -3.8183028 10.645443 -0.58104159
## 6   -12.98882 -17.40785 15.0377390 -8.1780902 34.067397 -2.29029201
## 7   -36.41714  30.19252  2.4502113 -4.9604707  6.289330 -0.27125152
##              K          Ca           Ba          Fe
## 2   -2.6624955 -3.85991539   -2.9151888    1.689711
## 3   -2.1827586 -0.02294878    1.7598066    1.675595
## 5    0.6088930 -0.52672369   -1.5491401   -1.768337
## 6 -167.7561153 -1.86839690 -147.8036277 -303.786284
## 7    0.1220649 -2.14651010   -0.3161803  -28.642787
## 
## Std. Errors:
##   (Intercept)         RI        Na        Mg        Al        Si            K
## 2  0.04375287 0.07436692 0.5995630 0.8174109 1.4175217 0.1528503 2.156688e+00
## 3  0.09154674 0.17354694 0.7910230 1.0415904 1.6325180 0.1978553 2.423340e+00
## 5  0.09268378 0.16274268 0.8320274 1.1278840 2.4773750 0.2247502 2.619899e+00
## 6  0.01671126 0.02503714 0.7239535 3.1195576 0.8207166 0.2667885 1.178741e-05
## 7  0.09261539 0.15183221 1.1152763 1.1839134 2.2588795 0.2418852 2.609692e+00
##          Ca           Ba           Fe
## 2 0.5277173 8.908255e+00 2.374720e+00
## 3 0.6776690 9.739969e+00 3.361018e+00
## 5 0.7779939 8.952810e+00 4.643463e+00
## 6 1.6795681 7.882619e-05 2.976338e-13
## 7 0.9483458 9.009162e+00 8.316669e-02
## 
## Residual Deviance: 237.0962 
## AIC: 337.0962

Step 4: Calculate the prediction accuracy

#predict with train set
glass_predictions <- predict(glass_model, glass_train)
mean(glass_predictions==glass_train$Type)
## [1] 0.7076023
table(predicted=glass_predictions,actual=glass_train$Type)
##          actual
## predicted  1  2  3  5  6  7
##         1 38 13 10  0  0  0
##         2 16 45  3  3  0  1
##         3  0  0  0  0  0  0
##         5  0  1  0  9  0  0
##         6  0  0  0  0  7  0
##         7  0  2  0  1  0 22
#predict with test set
glass_test_predictions <- predict(glass_model, glass_test)
mean(glass_test_predictions==glass_test$Type)
## [1] 0.6744186
table(predicted = glass_test_predictions,actual =glass_test$Type)
##          actual
## predicted  1  2  3  6  7
##         1 13  6  1  0  0
##         2  2  9  3  0  0
##         3  1  0  0  0  0
##         5  0  0  0  0  1
##         6  0  0  0  2  0
##         7  0  0  0  0  5

The model behaves poor at the prediction of type 3, 5, 6, and these three types need more attention for improvement.

Exercise 2

‘airquality’ is a built-in dataset in R .

pre-processing: delete the NA value and add a ‘season’ variable according to ‘Month’ in‘airquality’

Spring: Month = 5

Summer: Month = 6 or 7 or 8

Autumn: Month = 9

Study the differences of four variables in different seasons using Multinomial logistic regression.

  1. Fit a Multinomial logistic regression model.

  2. test of significance for each coefficient (if the summary did not give)

  3. perform AIC to select suitable model.

  4. (optional) fit the season with only independent variable temperature (Temp) in Multinomial logistic regression model. Plot the predicted probabilities against the temperature for different season. You need to create a training data set. For example, Temp = data.frame(Temp = rep(c(57:97),1))

Solution


First of all, delete the NA value in dataset and generate the variable ‘season’ according to ‘month’ use ifesle() function.

library(foreign)
library(nnet)
library(ggplot2)
library(reshape2)
df1<- airquality
df1<- na.omit(df1)
df1$season<- ifelse(df1$Month == 9,'fall',ifelse(df1$Month == 8 | df1$Month == 7 | df1$Month == 6,'summer','spring'))


1.Fit the multinomial logistic regression with Ozone, Solar.R, Wind, Temp use multinom ().

model<- multinom(season ~ Ozone+Solar.R+Wind+Temp, data = df1)
## # weights:  18 (10 variable)
## initial  value 121.945964 
## iter  10 value 89.803521
## iter  20 value 75.564422
## iter  30 value 75.550282
## iter  40 value 75.550184
## iter  40 value 75.550183
## iter  40 value 75.550183
## final  value 75.550183 
## converged
summary(model)
## Call:
## multinom(formula = season ~ Ozone + Solar.R + Wind + Temp, data = df1)
## 
## Coefficients:
##        (Intercept)      Ozone      Solar.R       Wind        Temp
## spring   18.995144 0.03481557  0.006662337 0.02033654 -0.30145705
## summer   -8.087726 0.02065903 -0.001498790 0.11702105  0.08824013
## 
## Std. Errors:
##        (Intercept)      Ozone     Solar.R       Wind       Temp
## spring    5.015056 0.01807580 0.003961135 0.11898038 0.07161554
## summer    3.950170 0.01415295 0.003130199 0.09362517 0.05046871
## 
## Residual Deviance: 151.1004 
## AIC: 171.1004


2.Use wald test to detect the significance.

z <- (summary(model)$coefficients/summary(model)$standard.errors)^2
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
##         (Intercept)        Ozone     Solar.R      Wind       Temp
## spring 0.000000e+00 0.0002074131 0.004671185 0.9766932 0.00000000
## summer 2.765055e-05 0.0331122950 0.818663068 0.1182357 0.00223604

Among these, the coefficient of Solar.R in odd p(season =spring)/p(season = fall) and Wind in both odds are insignificant.


3.Choose the model use AIC

model2<- step(model)
## Start:  AIC=171.1
## season ~ Ozone + Solar.R + Wind + Temp
## 
## trying - Ozone 
## # weights:  15 (8 variable)
## initial  value 121.945964 
## iter  10 value 79.298990
## iter  20 value 77.744596
## final  value 77.744009 
## converged
## trying - Solar.R 
## # weights:  15 (8 variable)
## initial  value 121.945964 
## iter  10 value 79.625833
## iter  20 value 77.714354
## iter  30 value 77.709926
## final  value 77.709917 
## converged
## trying - Wind 
## # weights:  15 (8 variable)
## initial  value 121.945964 
## iter  10 value 78.838470
## iter  20 value 76.404650
## iter  30 value 76.403254
## final  value 76.403180 
## converged
## trying - Temp 
## # weights:  15 (8 variable)
## initial  value 121.945964 
## iter  10 value 100.724660
## final  value 100.721050 
## converged
##           Df      AIC
## - Wind     8 168.8064
## <none>    10 171.1004
## - Solar.R  8 171.4198
## - Ozone    8 171.4880
## - Temp     8 217.4421
## # weights:  15 (8 variable)
## initial  value 121.945964 
## iter  10 value 78.838470
## iter  20 value 76.404650
## iter  30 value 76.403254
## final  value 76.403180 
## converged
## 
## Step:  AIC=168.81
## season ~ Ozone + Solar.R + Temp
## 
## trying - Ozone 
## # weights:  12 (6 variable)
## initial  value 121.945964 
## iter  10 value 78.784616
## iter  20 value 78.356364
## final  value 78.353540 
## converged
## trying - Solar.R 
## # weights:  12 (6 variable)
## initial  value 121.945964 
## iter  10 value 78.943041
## iter  20 value 78.393197
## final  value 78.375176 
## converged
## trying - Temp 
## # weights:  12 (6 variable)
## initial  value 121.945964 
## iter  10 value 101.483881
## final  value 101.480253 
## converged
##           Df      AIC
## - Ozone    6 168.7071
## - Solar.R  6 168.7504
## <none>     8 168.8064
## - Temp     6 214.9605
## # weights:  12 (6 variable)
## initial  value 121.945964 
## iter  10 value 78.784616
## iter  20 value 78.356364
## final  value 78.353540 
## converged
## 
## Step:  AIC=168.71
## season ~ Solar.R + Temp
## 
## trying - Solar.R 
## # weights:  9 (4 variable)
## initial  value 121.945964 
## iter  10 value 80.899581
## iter  20 value 80.772771
## iter  20 value 80.772771
## final  value 80.772762 
## converged
## trying - Temp 
## # weights:  9 (4 variable)
## initial  value 121.945964 
## final  value 112.521561 
## converged
##           Df      AIC
## <none>     6 168.7071
## - Solar.R  4 169.5455
## - Temp     4 233.0431
summary(model2)
## Call:
## multinom(formula = season ~ Solar.R + Temp, data = df1)
## 
## Coefficients:
##        (Intercept)      Solar.R       Temp
## spring   15.480891 0.0074004081 -0.2379759
## summer   -8.392658 0.0001989491  0.1131145
## 
## Std. Errors:
##        (Intercept)     Solar.R       Temp
## spring    3.886858 0.003731804 0.05819377
## summer    2.867453 0.002971258 0.03769079
## 
## Residual Deviance: 156.7071 
## AIC: 168.7071

The selected model delete the variable ‘Ozone’ and ‘wind’ and AIC value get decressed.


4.fit the season with variable temperature (Temp) and draw the graph.

model3 <- multinom(season ~ Temp, data = df1)# fit the model
## # weights:  9 (4 variable)
## initial  value 121.945964 
## iter  10 value 80.899581
## iter  20 value 80.772771
## iter  20 value 80.772771
## final  value 80.772762 
## converged
Temp = data.frame(Temp = rep(c(57:97),1))# 1.generate training data
pp.Temp <- cbind(Temp, predict(model3, newdata = Temp, type = "probs", se = TRUE))#2.conbine the predicted probs and original training data
lppp <- melt(pp.Temp, id.vars ='Temp' , value.name = "probability")#3.cleave the table so that get the probability for each Temp to each season.
ggplot(lppp, aes(x = Temp, y = probability)) + geom_line() + facet_grid(variable ~ ., scales = "free")#use ggplot to draw the graph.

We can see the probability of the each Temp belong to each season in range (57-97).