Given a dataset called “glass.data”, try to construct a multinomial model follow the process introduced in simple example. The detailed information of this dataset is available in “glass.names”.
Question: How to predict a glass type according to the provided 9 characteristics?
Step 1; Read the data set and take a brief look
library(nnet)
glass <- read.csv("glass.data", header = FALSE)
names(glass) <- c("id","RI","Na", "Mg", "Al", "Si", "K", "Ca", "Ba", "Fe", "Type")
Since the first row of the file is the data not the column name, set header as false and named the attributes by ourselves. The attribute information can be found in file “glass.names”.
Step 2: Data processing
Delete the first id column, as it is not a variable needed in model construction.
Split the data into training set and test set, and the ratio here is 80%. It should be noted that the ratio for data splitting is quite subjective, you can decide it as 75% or 70% as you prefer.
#delete the first column (serial number)
glass <- glass[,-1]
#split the data
train.sub <- sample(nrow(glass),8/10*nrow(glass))
glass_train <- glass[train.sub,]
glass_test <- glass[-train.sub,]
Step 3: Model construction
glass_model <- multinom(Type~ ., data = glass_train)
## # weights: 66 (50 variable)
## initial value 306.390869
## iter 10 value 184.770489
## iter 20 value 136.614812
## iter 30 value 126.955770
## iter 40 value 123.998698
## iter 50 value 121.292900
## iter 60 value 121.053110
## iter 70 value 120.719536
## iter 80 value 119.416622
## iter 90 value 119.178689
## iter 100 value 118.548092
## final value 118.548092
## stopped after 100 iterations
#check the result
summary(glass_model)
## Call:
## multinom(formula = Type ~ ., data = glass_train)
##
## Coefficients:
## (Intercept) RI Na Mg Al Si
## 2 100.23223 170.76852 -2.8425174 -5.3929462 1.293744 -3.72049165
## 3 49.06041 -41.04774 0.5815352 -0.2403516 1.848909 0.05226385
## 5 20.55760 17.82086 -0.5905658 -3.8183028 10.645443 -0.58104159
## 6 -12.98882 -17.40785 15.0377390 -8.1780902 34.067397 -2.29029201
## 7 -36.41714 30.19252 2.4502113 -4.9604707 6.289330 -0.27125152
## K Ca Ba Fe
## 2 -2.6624955 -3.85991539 -2.9151888 1.689711
## 3 -2.1827586 -0.02294878 1.7598066 1.675595
## 5 0.6088930 -0.52672369 -1.5491401 -1.768337
## 6 -167.7561153 -1.86839690 -147.8036277 -303.786284
## 7 0.1220649 -2.14651010 -0.3161803 -28.642787
##
## Std. Errors:
## (Intercept) RI Na Mg Al Si K
## 2 0.04375287 0.07436692 0.5995630 0.8174109 1.4175217 0.1528503 2.156688e+00
## 3 0.09154674 0.17354694 0.7910230 1.0415904 1.6325180 0.1978553 2.423340e+00
## 5 0.09268378 0.16274268 0.8320274 1.1278840 2.4773750 0.2247502 2.619899e+00
## 6 0.01671126 0.02503714 0.7239535 3.1195576 0.8207166 0.2667885 1.178741e-05
## 7 0.09261539 0.15183221 1.1152763 1.1839134 2.2588795 0.2418852 2.609692e+00
## Ca Ba Fe
## 2 0.5277173 8.908255e+00 2.374720e+00
## 3 0.6776690 9.739969e+00 3.361018e+00
## 5 0.7779939 8.952810e+00 4.643463e+00
## 6 1.6795681 7.882619e-05 2.976338e-13
## 7 0.9483458 9.009162e+00 8.316669e-02
##
## Residual Deviance: 237.0962
## AIC: 337.0962
Step 4: Calculate the prediction accuracy
#predict with train set
glass_predictions <- predict(glass_model, glass_train)
mean(glass_predictions==glass_train$Type)
## [1] 0.7076023
table(predicted=glass_predictions,actual=glass_train$Type)
## actual
## predicted 1 2 3 5 6 7
## 1 38 13 10 0 0 0
## 2 16 45 3 3 0 1
## 3 0 0 0 0 0 0
## 5 0 1 0 9 0 0
## 6 0 0 0 0 7 0
## 7 0 2 0 1 0 22
#predict with test set
glass_test_predictions <- predict(glass_model, glass_test)
mean(glass_test_predictions==glass_test$Type)
## [1] 0.6744186
table(predicted = glass_test_predictions,actual =glass_test$Type)
## actual
## predicted 1 2 3 6 7
## 1 13 6 1 0 0
## 2 2 9 3 0 0
## 3 1 0 0 0 0
## 5 0 0 0 0 1
## 6 0 0 0 2 0
## 7 0 0 0 0 5
The model behaves poor at the prediction of type 3, 5, 6, and these three types need more attention for improvement.
‘airquality’ is a built-in dataset in R .
pre-processing: delete the NA value and add a ‘season’ variable according to ‘Month’ in‘airquality’
Spring: Month = 5
Summer: Month = 6 or 7 or 8
Autumn: Month = 9
Study the differences of four variables in different seasons using Multinomial logistic regression.
Fit a Multinomial logistic regression model.
test of significance for each coefficient (if the summary did not give)
perform AIC to select suitable model.
(optional) fit the season with only independent variable temperature (Temp) in Multinomial logistic regression model. Plot the predicted probabilities against the temperature for different season. You need to create a training data set. For example, Temp = data.frame(Temp = rep(c(57:97),1))
First of all, delete the NA value in dataset and generate the variable ‘season’ according to ‘month’ use ifesle() function.
library(foreign)
library(nnet)
library(ggplot2)
library(reshape2)
df1<- airquality
df1<- na.omit(df1)
df1$season<- ifelse(df1$Month == 9,'fall',ifelse(df1$Month == 8 | df1$Month == 7 | df1$Month == 6,'summer','spring'))
1.Fit the multinomial logistic regression with Ozone, Solar.R, Wind, Temp use multinom ().
model<- multinom(season ~ Ozone+Solar.R+Wind+Temp, data = df1)
## # weights: 18 (10 variable)
## initial value 121.945964
## iter 10 value 89.803521
## iter 20 value 75.564422
## iter 30 value 75.550282
## iter 40 value 75.550184
## iter 40 value 75.550183
## iter 40 value 75.550183
## final value 75.550183
## converged
summary(model)
## Call:
## multinom(formula = season ~ Ozone + Solar.R + Wind + Temp, data = df1)
##
## Coefficients:
## (Intercept) Ozone Solar.R Wind Temp
## spring 18.995144 0.03481557 0.006662337 0.02033654 -0.30145705
## summer -8.087726 0.02065903 -0.001498790 0.11702105 0.08824013
##
## Std. Errors:
## (Intercept) Ozone Solar.R Wind Temp
## spring 5.015056 0.01807580 0.003961135 0.11898038 0.07161554
## summer 3.950170 0.01415295 0.003130199 0.09362517 0.05046871
##
## Residual Deviance: 151.1004
## AIC: 171.1004
2.Use wald test to detect the significance.
z <- (summary(model)$coefficients/summary(model)$standard.errors)^2
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) Ozone Solar.R Wind Temp
## spring 0.000000e+00 0.0002074131 0.004671185 0.9766932 0.00000000
## summer 2.765055e-05 0.0331122950 0.818663068 0.1182357 0.00223604
Among these, the coefficient of Solar.R in odd p(season =spring)/p(season = fall) and Wind in both odds are insignificant.
3.Choose the model use AIC
model2<- step(model)
## Start: AIC=171.1
## season ~ Ozone + Solar.R + Wind + Temp
##
## trying - Ozone
## # weights: 15 (8 variable)
## initial value 121.945964
## iter 10 value 79.298990
## iter 20 value 77.744596
## final value 77.744009
## converged
## trying - Solar.R
## # weights: 15 (8 variable)
## initial value 121.945964
## iter 10 value 79.625833
## iter 20 value 77.714354
## iter 30 value 77.709926
## final value 77.709917
## converged
## trying - Wind
## # weights: 15 (8 variable)
## initial value 121.945964
## iter 10 value 78.838470
## iter 20 value 76.404650
## iter 30 value 76.403254
## final value 76.403180
## converged
## trying - Temp
## # weights: 15 (8 variable)
## initial value 121.945964
## iter 10 value 100.724660
## final value 100.721050
## converged
## Df AIC
## - Wind 8 168.8064
## <none> 10 171.1004
## - Solar.R 8 171.4198
## - Ozone 8 171.4880
## - Temp 8 217.4421
## # weights: 15 (8 variable)
## initial value 121.945964
## iter 10 value 78.838470
## iter 20 value 76.404650
## iter 30 value 76.403254
## final value 76.403180
## converged
##
## Step: AIC=168.81
## season ~ Ozone + Solar.R + Temp
##
## trying - Ozone
## # weights: 12 (6 variable)
## initial value 121.945964
## iter 10 value 78.784616
## iter 20 value 78.356364
## final value 78.353540
## converged
## trying - Solar.R
## # weights: 12 (6 variable)
## initial value 121.945964
## iter 10 value 78.943041
## iter 20 value 78.393197
## final value 78.375176
## converged
## trying - Temp
## # weights: 12 (6 variable)
## initial value 121.945964
## iter 10 value 101.483881
## final value 101.480253
## converged
## Df AIC
## - Ozone 6 168.7071
## - Solar.R 6 168.7504
## <none> 8 168.8064
## - Temp 6 214.9605
## # weights: 12 (6 variable)
## initial value 121.945964
## iter 10 value 78.784616
## iter 20 value 78.356364
## final value 78.353540
## converged
##
## Step: AIC=168.71
## season ~ Solar.R + Temp
##
## trying - Solar.R
## # weights: 9 (4 variable)
## initial value 121.945964
## iter 10 value 80.899581
## iter 20 value 80.772771
## iter 20 value 80.772771
## final value 80.772762
## converged
## trying - Temp
## # weights: 9 (4 variable)
## initial value 121.945964
## final value 112.521561
## converged
## Df AIC
## <none> 6 168.7071
## - Solar.R 4 169.5455
## - Temp 4 233.0431
summary(model2)
## Call:
## multinom(formula = season ~ Solar.R + Temp, data = df1)
##
## Coefficients:
## (Intercept) Solar.R Temp
## spring 15.480891 0.0074004081 -0.2379759
## summer -8.392658 0.0001989491 0.1131145
##
## Std. Errors:
## (Intercept) Solar.R Temp
## spring 3.886858 0.003731804 0.05819377
## summer 2.867453 0.002971258 0.03769079
##
## Residual Deviance: 156.7071
## AIC: 168.7071
The selected model delete the variable ‘Ozone’ and ‘wind’ and AIC value get decressed.
4.fit the season with variable temperature (Temp) and draw the graph.
model3 <- multinom(season ~ Temp, data = df1)# fit the model
## # weights: 9 (4 variable)
## initial value 121.945964
## iter 10 value 80.899581
## iter 20 value 80.772771
## iter 20 value 80.772771
## final value 80.772762
## converged
Temp = data.frame(Temp = rep(c(57:97),1))# 1.generate training data
pp.Temp <- cbind(Temp, predict(model3, newdata = Temp, type = "probs", se = TRUE))#2.conbine the predicted probs and original training data
lppp <- melt(pp.Temp, id.vars ='Temp' , value.name = "probability")#3.cleave the table so that get the probability for each Temp to each season.
ggplot(lppp, aes(x = Temp, y = probability)) + geom_line() + facet_grid(variable ~ ., scales = "free")#use ggplot to draw the graph.
We can see the probability of the each Temp belong to each season in range (57-97).