This paper focuses on evaluating the performances of different statistical learning methods for prediction. The models would be fitted over Titanic dataset that contains different variables related to demographics and other information about the passengers along with whether they survived or not. We will fit a particular statistical learning method on a set of training observations and measure its performance on a set of test observations and finally, we would compare these methods using ROC curves.
The following R packages are required:
# Load standard libraries
library(tidyverse)
library(gridExtra)
library(MASS)
library(pROC)
library(arm)
library(randomForest)
library(xgboost)
We will use the Titanic dataset provided by Paul Hendricks (https://github.com/paulhendricks/titanic). The Titanic dataset contains data about the survival of passengers aboard the Titanic.
# Load data
titanic_data <- read.csv('titanic.csv')
str(titanic_data) # explore data structure
## 'data.frame': 1309 obs. of 14 variables:
## $ pclass : int 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : int 1 1 0 0 0 1 1 0 1 0 ...
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
## $ boat : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
#cleaning data: casting categorical variable as factors
titanic_data$survived <- as.factor(titanic_data$survived)
titanic_data$pclass <- as.factor(titanic_data$pclass)
When we are trying to understand the data or infer some relationships between variables then splitting the data into smaller samples doesn’t make sense as we need as much data as possible for understanding patterns. But when we move into the realm of prediction and try to fit a statistical model in order to predict the different values of events that haven’t even occurred yet, then obviously, one of the major concern is the quality and performance of the model. This quality could be assessed when these models are actually used to predict values with data that is yet unseen by the model. Also, comparing them with exact values prediction errors could be calculated. But as we can’t get real future data, the best way to test the quality of our model is to split the original dataset and then use one part(training) to create the model and other part to test the model. Hence we would first split our data into training and test datasets in 80-20 ratio.
set.seed(2)
# 80% of the sample size
smp_size <- floor(0.80 * nrow(titanic_data))
#training data index
train.index <- sample( 1: nrow(titanic_data), size = smp_size)
#training data
train <- titanic_data[train.index,]
#test data: data after filtering out the training data
test.data <- titanic_data[-train.index ,]
In this paper the goal is to predict the survival of passengers. We will consider different models to do so.
#logistic Regression model
glm.fit <- glm(survived ~ pclass, data = train, family = binomial)
#summary of fit
summary(glm.fit)
##
## Call:
## glm(formula = survived ~ pclass, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3824 -0.7702 -0.7702 0.9854 1.6493
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4700 0.1275 3.687 0.000227 ***
## pclass2 -0.6689 0.1856 -3.604 0.000314 ***
## pclass3 -1.5335 0.1598 -9.598 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.3 on 1046 degrees of freedom
## Residual deviance: 1295.6 on 1044 degrees of freedom
## AIC: 1301.6
##
## Number of Fisher Scoring iterations: 4
#coefficients of the model
coef(glm.fit)
## (Intercept) pclass2 pclass3
## 0.4700036 -0.6688545 -1.5335246
#bayesian model
bayes.fit <- bayesglm(survived ~ pclass, data = train, family = binomial)
#summary of fit
summary(bayes.fit)
##
## Call:
## bayesglm(formula = survived ~ pclass, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3784 -0.7713 -0.7713 0.9889 1.6478
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4611 0.1268 3.637 0.000276 ***
## pclass2 -0.6564 0.1844 -3.560 0.000371 ***
## pclass3 -1.5213 0.1589 -9.573 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.3 on 1046 degrees of freedom
## Residual deviance: 1295.6 on 1044 degrees of freedom
## AIC: 1301.6
##
## Number of Fisher Scoring iterations: 5
#coefficients of the model
coef(bayes.fit)
## (Intercept) pclass2 pclass3
## 0.4611199 -0.6563755 -1.5213075
Both the Bayesian as well as generalized logisitic regression model shows that the significance of a passenger being in lower class is very high (very low p-values for pclass3). The estimated value of coefficient also tells us that being a third class passenger decreases the survival chances by significant amount.
Next, let’s consider the performance of this model.
We will predict the probabilities that the passenger survived or not for a given value of passenger class. The type = response in predict function will be helpful for this. We would then be using threshold of 0.5, as the class is binary.
#prediction of probabilities for test values
glm.probs <- predict(glm.fit, test.data, type = "response")
#predicting actual values based on probabilities
yhat <- rep (0 ,nrow(test.data))
yhat[glm.probs > 0.5] <- 1
yhat
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#creating confusion matrix
table(yhat , test.data$survived)
##
## yhat 0 1
## 0 144 55
## 1 23 40
#test error rate
mean(yhat != test.data$survived)
## [1] 0.2977099
The model has an error rate of about 29.7% as it predicted (144+40) 184 values correclty out of 264. From the confusion matrix it is clear that the number of false positives on test data (when survived = 0 but predicted = 1) is 55.
#roc calculation
fit.roc <- roc(test.data$survived, yhat,
auc = TRUE, thresholds = 0.5)
#plotting
plot(fit.roc, col = "red", legacy.axes = TRUE,
main = "ROC curve for Logistic Regression Model")
##
## Call:
## roc.default(response = test.data$survived, predictor = yhat, auc = TRUE, thresholds = 0.5)
##
## Data: yhat in 167 controls (test.data$survived 0) < 95 cases (test.data$survived 1).
## Area under the curve: 0.6417
The ROC curve is the the curve between True Positive Rate vs False Positive rate for the classifier. For better performace we want the true positive rate to be higher and False positive rate to be as low as possible. So we try to form a model for which the ROC curve is as close to the Upper Left corner as possible giving the maximum area under curve value(AUC). The ROC curve in the case of our model has auc of 0.6417 also the Sensitivity (TPR) is not as high as we would like it to be. So, we should improve the model in order to increase its value.
We would use the data to construct a new predictor variable based on a passenger’s listed title (i.e. Mr., Mrs., Miss., Master).
The title given by the passengers include variety of additional information like:
Including such information in the model would be very interesting as we can then see the patterns if any present in people who survived or do not survived.
#creating new function for adding title column
title.add <- function(dataset){
df1 <- separate(dataset, name, c('last', 'first'), sep = ',', remove = FALSE)
df2 <- separate(df1, first, c('title', 'firstName'), sep = '\\.', remove = FALSE)
#removing whitespaces if any
df2$title <- gsub("\\s", "", df2$title)
drop <- c('last', 'first', 'firstName')
df2[ , !(names(df2) %in% drop)]
return(df2[ , !(names(df2) %in% drop)])
}
#passing titanic data to function
new_data <- title.add(titanic_data)
str(new_data)
## 'data.frame': 1309 obs. of 15 variables:
## $ pclass : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ survived : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ...
## $ name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
## $ title : chr "Miss" "Master" "Miss" "Mr" ...
## $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : num 29 0.917 2 30 25 ...
## $ sibsp : int 0 1 1 1 1 0 1 0 2 0 ...
## $ parch : int 0 2 2 2 2 0 0 0 0 0 ...
## $ ticket : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
## $ fare : num 211 152 152 152 152 ...
## $ cabin : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
## $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
## $ boat : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
## $ body : int NA NA NA 135 NA NA NA NA NA 22 ...
## $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...
#unique titles
unique(new_data$title)
## [1] "Miss" "Master" "Mr" "Mrs" "Col"
## [6] "Mme" "Dr" "Major" "Capt" "Lady"
## [11] "Sir" "Mlle" "Dona" "Jonkheer" "theCountess"
## [16] "Don" "Rev" "Ms"
#number of observations under titles
group_by(new_data, title) %>% summarize(
count_title = n())
## # A tibble: 18 × 2
## title count_title
## <chr> <int>
## 1 Capt 1
## 2 Col 4
## 3 Don 1
## 4 Dona 1
## 5 Dr 8
## 6 Jonkheer 1
## 7 Lady 1
## 8 Major 2
## 9 Master 61
## 10 Miss 260
## 11 Mlle 2
## 12 Mme 1
## 13 Mr 757
## 14 Mrs 197
## 15 Ms 2
## 16 Rev 8
## 17 Sir 1
## 18 theCountess 1
#handling special cases
#Variations of Mrs
new_data$title[new_data$title %in% c('Mme', 'Dona')] <- "Mrs"
#Variations of Miss, including Ms in this category
new_data$title[new_data$title %in% c('Mlle', 'Ms')] <- "Miss"
#new category for lady and similar titles
new_data$title[new_data$title %in% c('theCountess')] <- "Lady"
#new category for Sirs and other respected titles
new_data$title[new_data$title %in% c('Capt', 'Col',
'Don', 'Jonkheer', 'Major')] <- "Sir"
#changing to factor
new_data$title <- as.factor(new_data$title)
#creating test and train data from new data with title
set.seed(22)
# 80% of the sample size
smp_size <- floor(0.80 * nrow(new_data))
#training data index
train.index <- sample( 1: nrow(new_data), size = smp_size)
#training data
train <- new_data[train.index,]
#test data: data after filtering out the training data
test.data <- new_data[-train.index ,]
#logistic Regression model
glm.fit <- glm(survived ~ pclass + title, data = train, family = binomial)
#summary of fit
summary(glm.fit)
##
## Call:
## glm(formula = survived ~ pclass + title, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1660 -0.6473 -0.4023 0.6604 2.2602
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.44338 0.88140 -0.503 0.614938
## pclass2 -0.83323 0.23336 -3.571 0.000356 ***
## pclass3 -1.85000 0.20923 -8.842 < 2e-16 ***
## titleLady 16.00945 1455.39780 0.011 0.991223
## titleMaster 2.25217 0.94395 2.386 0.017037 *
## titleMiss 2.55404 0.90352 2.827 0.004702 **
## titleMr -0.17998 0.88973 -0.202 0.839691
## titleMrs 2.68841 0.90928 2.957 0.003110 **
## titleRev -14.28946 550.08929 -0.026 0.979276
## titleSir -0.06745 1.14464 -0.059 0.953013
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1383.64 on 1046 degrees of freedom
## Residual deviance: 953.58 on 1037 degrees of freedom
## AIC: 973.58
##
## Number of Fisher Scoring iterations: 14
#coefficients of the model
coef(glm.fit)
## (Intercept) pclass2 pclass3 titleLady titleMaster
## -0.44337973 -0.83322858 -1.85000103 16.00944798 2.25217257
## titleMiss titleMr titleMrs titleRev titleSir
## 2.55404356 -0.17998261 2.68841034 -14.28945993 -0.06744589
The etimated coefficients for the model tells us that some titles hold significance for the values of survived variable. Hence including those variable would definately improve the model as predictions would be more accurate. This is also proven by the reduced Residual Deviance and AIC value of this model as compared to the previous model.
#prediction of probabilities for test values
glm.probs <- predict(glm.fit, test.data, type = "response")
yhat2 <- rep (0 ,nrow(test.data))
yhat2[glm.probs > 0.5] <- 1
yhat2
## [1] 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0
## [36] 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0
## [71] 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0
## [106] 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1
## [141] 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0
## [176] 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
## [211] 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1
## [246] 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0
#confusion matrix
table(yhat2 , test.data$survived)
##
## yhat2 0 1
## 0 127 31
## 1 26 78
#test error rate
mean(yhat2 != test.data$survived)
## [1] 0.2175573
#roc calculation
fit.roc2 <- roc(test.data$survived, yhat2,
auc = TRUE, thresholds = 0.5)
#plotting
plot(fit.roc2, col = "red")
##
## Call:
## roc.default(response = test.data$survived, predictor = yhat2, auc = TRUE, thresholds = 0.5)
##
## Data: yhat2 in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7728
The mean error rate for the model has been reduced from around 30% to around 22% with the new model. From the confusion matrix we can see that out of 262 records, 57 records were classified incorrectly. The ROC curve also denotes somewhat improved Sensitivity and AUC of 0.7728. Although this value cannot be said as very good, it is definately better than the previous model. Let us look into data that has been misclassified
#missclassified values
test.miss <- test.data
#adding predicted values
test.miss$predicted <- yhat2
test.miss$predicted <- as.factor(test.miss$predicted)
#filtering to only misclassified values
test.miss <- filter(test.miss, survived != predicted)
#titles for which misclassification occured
group_by(test.miss, title) %>% summarize(
count_title = n())
## # A tibble: 6 × 2
## title count_title
## <fctr> <int>
## 1 Dr 2
## 2 Master 2
## 3 Miss 18
## 4 Mr 26
## 5 Mrs 8
## 6 Sir 1
#pclass for which misclassification occured
group_by(test.miss, pclass) %>% summarize(
count_pclass = n())
## # A tibble: 3 × 2
## pclass count_pclass
## <fctr> <int>
## 1 1 17
## 2 2 3
## 3 3 37
The analysis tells us that most miscalssification occured for people of title “Miss” and “Mr” and pclass 1 and 3. There could be three reasons behind these misclassification:
Although the model is better than the first model that we created but we can try further to reduce the classification error by taking into account other predictors or other models as well.
We would now use the randomForest function to fit a random forest model with passenger class and title as predictors and then make predictions for the test set using the random forest model.
set.seed(33)
#fitting random forest model
forest.fit <- randomForest(survived ~ pclass + title,
data = train, importance =TRUE)
forest.fit
##
## Call:
## randomForest(formula = survived ~ pclass + title, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 20.34%
## Confusion matrix:
## 0 1 class.error
## 0 641 15 0.02286585
## 1 198 193 0.50639386
#importance of variables
importance(forest.fit)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass 34.88239 31.57615 35.76548 43.38513
## title 50.47940 58.20294 57.53085 133.15669
#predicting test values
yhat3 <- predict(forest.fit , newdata = test.data)
#confusion matrix
table(yhat3 , test.data$survived)
##
## yhat3 0 1
## 0 149 55
## 1 4 54
#test error rate
mean(yhat3 != test.data$survived)
## [1] 0.2251908
#roc calculation
fit.roc3 <- roc(test.data$survived,
as.ordered(yhat3), plot = TRUE)
Now we will try to improve our random forest model.
For improving the random forest model we will perform following steps:
set.seed(33)
#making a model with probable predictors
#missing values imputation
train.imputed <- rfImpute(survived ~ pclass + title +
sex + age + fare, data = train, importance =TRUE)
## ntree OOB 1 2
## 300: 19.68% 9.60% 36.57%
## ntree OOB 1 2
## 300: 18.62% 9.30% 34.27%
## ntree OOB 1 2
## 300: 19.29% 9.76% 35.29%
## ntree OOB 1 2
## 300: 17.86% 10.06% 30.95%
## ntree OOB 1 2
## 300: 18.72% 9.60% 34.02%
#new random forest fit
new.fit <- randomForest(survived ~ pclass + title +
sex + age + fare,
data = train.imputed, importance =TRUE)
#summary of fit
new.fit
##
## Call:
## randomForest(formula = survived ~ pclass + title + sex + age + fare, data = train.imputed, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.34%
## Confusion matrix:
## 0 1 class.error
## 0 599 57 0.08689024
## 1 135 256 0.34526854
#importance of variables
importance(new.fit)
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass 22.88647 48.77495 52.29316 40.88463
## title 27.66866 22.20283 31.89233 92.82093
## sex 20.13725 16.87877 23.14630 57.03895
## age 11.56489 18.08702 22.55862 64.31046
## fare 14.64069 26.96837 30.96514 77.49999
#plot of importance
varImpPlot(new.fit, main= "Importance of predictors for new Random Forest Model")
From the analysis above we can see that all the predictors in the new model are important and significant. Hence we will not remove any of them. Also, the OOB error rate of new model has fallen down to 18.34% from 20.34% from previous model. Next we will optimize ntree and mtry parameters.
#error rate plot
layout(matrix(c(1,2),nrow=1), width=c(4,1))
par(mar=c(5,4,4,0)) #No margin on the right side
plot(new.fit, log="y", main = "Error Rate vs Number of trees")
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1), type="n", axes=F, xlab="", ylab="")
legend("top", colnames(new.fit$err.rate),col=1:3,cex=0.8,fill=1:3)
The plot tells us that for ntree between 50 and 60 the OOB error is minimum and then it is almost the same as the number of trees is increasing. SO we will try this value of ntree in our next model and try to find optimized value of mtry.
set.seed(33)
#optimize mtry
mtry <- tuneRF(train.imputed[,c('pclass','title','sex','age','fare')],
train.imputed$survived, ntreeTry=50,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=FALSE)
## mtry = 2 OOB error = 18.34%
## Searching left ...
## Searching right ...
## mtry = 3 OOB error = 19.68%
## -0.07291667 0.01
best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
#optimized mtry for least OOB
best.m
## [1] 2
set.seed(33)
#final optimized model
new.fit <- randomForest(survived ~ pclass + title +
sex + age + fare,
data = train.imputed, importance =TRUE,
ntree = 62, mtry = 2)
#final model
new.fit
##
## Call:
## randomForest(formula = survived ~ pclass + title + sex + age + fare, data = train.imputed, importance = TRUE, ntree = 62, mtry = 2)
## Type of random forest: classification
## Number of trees: 62
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.24%
## Confusion matrix:
## 0 1 class.error
## 0 595 61 0.0929878
## 1 130 261 0.3324808
#replacing missing values with mean
test.data$age[is.na(test.data$age)] <- mean(test.data$age, na.rm = TRUE)
#predicting test values
yhat4 <- predict(new.fit , newdata = test.data)
#confusion matrix
table(yhat4 , test.data$survived)
##
## yhat4 0 1
## 0 139 38
## 1 14 71
#test error rate
mean(yhat4 != test.data$survived)
## [1] 0.1984733
#roc calculation
fit.roc4 <- roc(test.data$survived, as.ordered(yhat4),
threshold = 0.5, plot = TRUE,
auc = TRUE)
The new model has an error rate of 19% which is the lowest for all models.
Now we will compare the accuracy of each of the models using ROC curves.
#plotting roc curves
plot(fit.roc, legacy.axes = TRUE, main = "ROC curves for all models")
##
## Call:
## roc.default(response = test.data$survived, predictor = yhat, auc = TRUE, thresholds = 0.5)
##
## Data: yhat in 167 controls (test.data$survived 0) < 95 cases (test.data$survived 1).
## Area under the curve: 0.6417
plot(fit.roc2, legacy.axes = TRUE, add=TRUE, col='red')
##
## Call:
## roc.default(response = test.data$survived, predictor = yhat2, auc = TRUE, thresholds = 0.5)
##
## Data: yhat2 in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7728
plot(fit.roc3, legacy.axes = TRUE, add=TRUE, col='blue')
##
## Call:
## roc.default(response = test.data$survived, predictor = as.ordered(yhat3), plot = TRUE)
##
## Data: as.ordered(yhat3) in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7346
plot(fit.roc4, legacy.axes = TRUE, add=TRUE, col='darkgreen')
##
## Call:
## roc.default(response = test.data$survived, predictor = as.ordered(yhat4), auc = TRUE, plot = TRUE, threshold = 0.5)
##
## Data: as.ordered(yhat4) in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7799
legend("bottomright", lty=c(1,1), lwd=c(2.5,2.5),
col=c("black","red", "blue", "darkgreen"),
legend= c('Logistic1', 'Logistic2','RF1', 'RF2'))
For a model to be accurate the ROC curve should be focused towards top left corner (High Sensitivity and low FPR ). As seen from the curves random forest models have the leat FPR of all models. Also as we are improving the model the sensitivity is also imcreasing. So, the last RF model have high Sensitivity values than first RF model. Hence, from all the models the optimized random forest model would be the best to predict the survival rates of the passengers.