✨ Answers for Exercise 1
1. Exercise 1
Use the Petal.Width and the Species from iris.csv data
- Develop logistic regression
☝️ There are three species, which means you could try 3 different combinations. Please try all 3 combinations and see whether they worked or not. If not, why?
✌️ If you encountered some troubles, the suggestion is you first use some function for visualization to have some idea about the data.
- Use plot() or ggplot() for visualization
\(~\)
2. Step by step answers
2.1 Basic data processing
1️⃣ Import data
iris <- read.csv("iris.csv")\(~\)
\(~~~~\) Since there are three species in the “Species” column from iris, theoretically we could have three different combinations.
\(~\)
⭐ Note: For expected outcome there could only be two options (remember the binary outcome), whereas there could be multiple predictor variables. In addition, the value of the outcome should be in the range between 0 and 1.
\(~\)
2️⃣ Here we create three datasets based on the combinations.
cb1 <- subset(iris, iris$Species!="setosa") #without setosa
cb2 <- subset(iris, iris$Species!="versicolor") #without versicolor
cb3 <- subset(iris, iris$Species!="virginica") #without verginica\(~\)
Now, if you try to fit the model, it will return with error:
model <- glm(Species ~ Petal.Width, data=cb1, family=binomial)error in eval(family$initialize):y values must be 0 <= y <= 1
\(~\)
We could use typeof() function to check the data type.
typeof(iris$Species)## [1] "character"
\(~\)
3️⃣ To fix the issue that “Species” is not numeric data type, we could either:
Use factor() to label the “Species”. However, this requires further process later.
✨Recommended ✨Directly create another column with numeric data .
\(~\)
➡️ Use factor() to label the Species for three combinations.
cb1$Ss <- factor(cb1$Species, levels = c("versicolor","virginica"), labels = c(0,1))
cb2$Ss <- factor(cb2$Species, levels = c("setosa","virginica"), labels = c(0,1))
cb3$Ss <- factor(cb3$Species, levels = c("setosa","versicolor"), labels = c(0,1))\(~\)
➡️ Directly create another column with numeric data.
cb1$S <- ifelse(cb1$Species == "versicolor", 0, 1)
cb2$S <- ifelse(cb2$Species == "setosa", 0, 1)
cb3$S <- ifelse(cb3$Species == "setosa", 0, 1)\(~\)
Now we can move to the next section: fit model.
\(~\)
2.2 Model fitting
Fit model for three different combinations and check summary().
1️⃣ Combination 1: versicolor & virginica
#Note: Based on the previous choice you made,you choose either Ss or S
A <- glm(Ss ~ Petal.Width, data=cb1, family=binomial)
summary(A)##
## Call:
## glm(formula = Ss ~ Petal.Width, family = binomial, data = cb1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.13868 -0.16468 -0.00929 0.13000 2.46891
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -21.126 4.597 -4.596 4.31e-06 ***
## Petal.Width 12.947 2.843 4.554 5.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 33.421 on 98 degrees of freedom
## AIC: 37.421
##
## Number of Fisher Scoring iterations: 7
\(~\)
The P-value is much lower than 0.05, means it is statistically significance. This one looks good. How about the others?
\(~\)
2️⃣ Combination 2: setosa & virginica
#Note: Based on the previous choice you made,you choose either Ss or S
B <- glm(Ss ~ Petal.Width, data=cb2, family=binomial)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Oooops, warning messages 😦 Still, let’s check the model.
summary(B)##
## Call:
## glm(formula = Ss ~ Petal.Width, family = binomial, data = cb2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.161e-05 -2.100e-08 0.000e+00 2.100e-08 3.264e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -53.49 51193.70 -0.001 0.999
## Petal.Width 53.46 46949.98 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3863e+02 on 99 degrees of freedom
## Residual deviance: 2.0797e-09 on 98 degrees of freedom
## AIC: 4
##
## Number of Fisher Scoring iterations: 25
The P-value is extreme large ☹️ which means this is not a good model. Let’s see how the last combination works.
\(~\)
3️⃣ Combination 3: setosa & versicolor
#Note: Based on the previous choice you made,you choose either Ss or S
C <- glm(Ss ~ Petal.Width, data=cb3, family=binomial)## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
😧 Warning message again. Hmmmmmm.
summary(C)##
## Call:
## glm(formula = Ss ~ Petal.Width, family = binomial, data = cb3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.888e-05 -2.100e-08 0.000e+00 2.100e-08 1.379e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -87.12 65460.84 -0.001 0.999
## Petal.Width 110.20 80592.58 0.001 0.999
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3863e+02 on 99 degrees of freedom
## Residual deviance: 2.8433e-09 on 98 degrees of freedom
## AIC: 4
##
## Number of Fisher Scoring iterations: 25
😕 Huh, still very large P-value. Not a good model.
\(~\)
Now let’s assessing the model fit by McFadden’s R2. Here we use the package “pscl”.
pscl::pR2(A)["McFadden"]## fitting null model for pseudo-r2
## McFadden
## 0.7589199
pscl::pR2(B)["McFadden"]## fitting null model for pseudo-r2
## McFadden
## 1
pscl::pR2(C)["McFadden"]## fitting null model for pseudo-r2
## McFadden
## 1
The first outcome is nice (higher than 0.4). However, the other two combinations with “setosa” have exact 1. This is the same with the previous warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
\(~\)
This is caused by the data is too separated. Here we use ggplot() to visualize the original data to have some idea 💡
ggplot(iris, aes(x=Petal.Width, color=Species, fill = Species)) +
labs(x = "Petal Width", y = "Density") +
geom_density(alpha = 0.3)📌 From this plot you could clearly see that “setosa” is too separated from the other two species. In these cases, logistic regression model is not the best choice since it fit tooooo well (overfitting). Instead, we could just use Petal Width < 0.75 to classify setasa and other species.
\(~\)
Then, how about the first warning message?
glm.fit: algorithm did not converge
The default setting for maximum attempts is 25 times, we could let it try more times by changing control=list(maxit=)
#Here we use the previous example
#B <- glm(Ss ~ Petal.Width, data=cb2, family=binomial)
B <- glm(Ss ~ Petal.Width, data=cb2, family=binomial, control=list(maxit=100))## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Now we all know why this warning message shows 😎
\(~\)
2.3 Summary for Q1
✏️ Combination 1 (“versicolor” & “virginica”) works the best.
✏️ Combinations with “setosa” are not good logistic regressoin models since the data were too perfectly separated (second warning message).
✏️ By increasing the attemptance time, we could slove the first warning message.
\(~\)
2.4 Before Visualization
⭐ Note: If you choose the ✨recommended✨ method converting data type, you could skip this section. Also, since only combination 1 worked well, now we only use cb1.
\(~\)
Although we have convert the data type to numeric, and the glm() did work with it (which means it should be numeric data type).
❗ However, if we use is.numeric() to test whether the data is numeric or not:
is.numeric(cb1$Ss)## [1] FALSE
🙂 I have no idea why this happens, but we could do few more steps to make it right. Also, we are doing these steps because ggplot() or plot() will not accept the data. Or you can choose the recommended method at first ah
\(~\)
So we first convert the data type to numeric, then cover them with 0 and 1 (because after the conversion it would be 1 and 2, which will be rejected too).
#rewrite the column Ss to numeric data
cb1$Ss <- as.numeric(cb1$Ss)
#rewrite the column Ss to 0 and 1
cb1$Ss[cb1$Ss==1] <- 0
cb1$Ss[cb1$Ss==2] <- 1Now let’s check again.
is.numeric(cb1$Ss)## [1] TRUE
Now we can finally visualize the logistic regression.
\(~\)
2.5 Visualization
Similar with what we did in the lecture, we also use plot() and ggplot().
\(~\)
1️⃣ With plot()
#Note: Based on the previous choice you made,you choose either Ss or S
#develop model, if you already done this please ignore
A <- glm(Ss ~ Petal.Width, data=cb1, family=binomial)
#intensify the X value
newdata <- data.frame(Petal.Width = seq(min(cb1$Petal.Width), max(cb1$Petal.Width),len=100))
#use the model to predict Y value
newdata$S = predict(A, newdata, type="response")
#plot the logistic regression curve
plot(Ss ~ Petal.Width, data=cb1, col='orange', las=1, pch=19, cex=1.5)
lines(S ~ Petal.Width, newdata, lwd=2, col='#1874CD')#The simplest code:
#plot(Ss ~ Petal.Width, data=cb1)
#lines(S ~ Petal.Width, newdata)\(~\)
2️⃣ With ggplot()
ggplot(cb1, aes(x=Petal.Width, y=Ss)) +
geom_point(size = 4, alpha = 0.3, color = "blue4") +
theme(panel.grid = element_blank()) +
labs(x = "Petal Width", y = "Species") +
theme(panel.background = element_blank())+
theme(panel.border = element_rect(fill=NA,color="black", size=1, linetype="solid"))+
stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial),
color = "gold1", size = 1.5)#The simplest code:
#ggplot(cb1, aes(x=Petal.Width, y=Ss)) + geom_point() + stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial))\(~\)
THE END
🌱 This is the end for exercise 1 :") thanks a lot!