Prediction of Titanic Survivors: A comparison between statistical models

Abstract

This paper focuses on evaluating the performances of different statistical learning methods for prediction. The models would be fitted over Titanic dataset that contains different variables related to demographics and other information about the passengers along with whether they survived or not. We will fit a particular statistical learning method on a set of training observations and measure its performance on a set of test observations and finally, we would compare these methods using ROC curves.

Setup:

The following R packages are required:

# Load standard libraries
library(tidyverse)
library(gridExtra)
library(MASS)
library(pROC)
library(arm)
library(randomForest)
library(xgboost)

Data: Exploring and Tidying

We will use the Titanic dataset provided by Paul Hendricks (https://github.com/paulhendricks/titanic). The Titanic dataset contains data about the survival of passengers aboard the Titanic.

# Load data
titanic_data <- read.csv('titanic.csv')
str(titanic_data) # explore data structure

## 'data.frame':    1309 obs. of  14 variables:
##  $ pclass   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ name     : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age      : num  29 0.917 2 30 25 ...
##  $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
##  $ fare     : num  211 152 152 152 152 ...
##  $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
##  $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
##  $ boat     : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
##  $ body     : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...

#cleaning data: casting categorical variable as factors
titanic_data$survived <- as.factor(titanic_data$survived)
titanic_data$pclass <- as.factor(titanic_data$pclass)

Description of Variables

pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survived: Survival (0 = No; 1 = Yes)
name : Name
sex : Sex
age : Age
sibsp : Number of Siblings/Spouses Aboard
parch : Number of Parents/Children Aboard
ticket : Ticket Number
fare : Passenger Fare
cabin : Cabin
embarked : Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat : Lifeboat
body : Body Identification Number
home.dest : Home/Destination

Splitting Data

When we are trying to understand the data or infer some relationships between variables then splitting the data into smaller samples doesn’t make sense as we need as much data as possible for understanding patterns. But when we move into the realm of prediction and try to fit a statistical model in order to predict the different values of events that haven’t even occurred yet, then obviously, one of the major concern is the quality and performance of the model. This quality could be assessed when these models are actually used to predict values with data that is yet unseen by the model. Also, comparing them with exact values prediction errors could be calculated. But as we can’t get real future data, the best way to test the quality of our model is to split the original dataset and then use one part(training) to create the model and other part to test the model. Hence we would first split our data into training and test datasets in 80-20 ratio.

set.seed(2)
# 80% of the sample size
smp_size <- floor(0.80 * nrow(titanic_data))
#training data index
train.index <- sample( 1: nrow(titanic_data), size = smp_size)
#training data
train <- titanic_data[train.index,]
#test data: data after filtering out the training data
test.data <- titanic_data[-train.index ,]

In this paper the goal is to predict the survival of passengers. We will consider different models to do so.

Logistic Regression Model:

Model Fitting

#logistic Regression model
glm.fit <- glm(survived ~ pclass, data = train, family = binomial)
#summary of fit
summary(glm.fit)

## 
## Call:
## glm(formula = survived ~ pclass, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3824  -0.7702  -0.7702   0.9854   1.6493  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4700     0.1275   3.687 0.000227 ***
## pclass2      -0.6689     0.1856  -3.604 0.000314 ***
## pclass3      -1.5335     0.1598  -9.598  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.3  on 1046  degrees of freedom
## Residual deviance: 1295.6  on 1044  degrees of freedom
## AIC: 1301.6
## 
## Number of Fisher Scoring iterations: 4

#coefficients of the model
coef(glm.fit)

## (Intercept)     pclass2     pclass3 
##   0.4700036  -0.6688545  -1.5335246

Bayesian generalized linear Model

#bayesian model
bayes.fit <- bayesglm(survived ~ pclass, data = train, family = binomial)

#summary of fit
summary(bayes.fit)

## 
## Call:
## bayesglm(formula = survived ~ pclass, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3784  -0.7713  -0.7713   0.9889   1.6478  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4611     0.1268   3.637 0.000276 ***
## pclass2      -0.6564     0.1844  -3.560 0.000371 ***
## pclass3      -1.5213     0.1589  -9.573  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.3  on 1046  degrees of freedom
## Residual deviance: 1295.6  on 1044  degrees of freedom
## AIC: 1301.6
## 
## Number of Fisher Scoring iterations: 5

#coefficients of the model
coef(bayes.fit)

## (Intercept)     pclass2     pclass3 
##   0.4611199  -0.6563755  -1.5213075

Both the Bayesian as well as generalized logisitic regression model shows that the significance of a passenger being in lower class is very high (very low p-values for pclass3). The estimated value of coefficient also tells us that being a third class passenger decreases the survival chances by significant amount.

Performance Evaluation

Next, let’s consider the performance of this model.

We will predict the probabilities that the passenger survived or not for a given value of passenger class. The type = response in predict function will be helpful for this. We would then be using threshold of 0.5, as the class is binary.

#prediction of probabilities for test values
glm.probs <- predict(glm.fit, test.data, type = "response")
#predicting actual values based on probabilities
yhat <- rep (0 ,nrow(test.data))
yhat[glm.probs > 0.5] <- 1
yhat

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Test Error Rate:

#creating confusion matrix
table(yhat , test.data$survived)

##     
## yhat   0   1
##    0 144  55
##    1  23  40

#test error rate
mean(yhat != test.data$survived)

## [1] 0.2977099

The model has an error rate of about 29.7% as it predicted (144+40) 184 values correclty out of 264. From the confusion matrix it is clear that the number of false positives on test data (when survived = 0 but predicted = 1) is 55.

ROC Curve:

#roc calculation
fit.roc <- roc(test.data$survived, yhat, 
               auc = TRUE, thresholds = 0.5)
#plotting
plot(fit.roc, col = "red", legacy.axes = TRUE, 
main = "ROC curve for Logistic Regression Model")

## 
## Call:
## roc.default(response = test.data$survived, predictor = yhat,     auc = TRUE, thresholds = 0.5)
## 
## Data: yhat in 167 controls (test.data$survived 0) < 95 cases (test.data$survived 1).
## Area under the curve: 0.6417

The ROC curve is the the curve between True Positive Rate vs False Positive rate for the classifier. For better performace we want the true positive rate to be higher and False positive rate to be as low as possible. So we try to form a model for which the ROC curve is as close to the Upper Left corner as possible giving the maximum area under curve value(AUC). The ROC curve in the case of our model has auc of 0.6417 also the Sensitivity (TPR) is not as high as we would like it to be. So, we should improve the model in order to increase its value.

Data Wrangling

We would use the data to construct a new predictor variable based on a passenger’s listed title (i.e. Mr., Mrs., Miss., Master).

The title given by the passengers include variety of additional information like:

The age and sex of passengers. Although we have data for this as seperate columns, a lot of values are missing. Using the title the missing values could be estimated.
The economical background or social status of passengers (as several people used Lady, the Countess, Sir etc. )
The profession of people (Dr, Rev, Capt, Col)
The ethinic background of people. For instance, several people used Mme or Mlle that tells us that these are French people.

Including such information in the model would be very interesting as we can then see the patterns if any present in people who survived or do not survived.

#creating new function for adding title column
title.add <- function(dataset){
df1 <- separate(dataset, name, c('last', 'first'), sep = ',', remove = FALSE)
df2 <- separate(df1, first, c('title', 'firstName'), sep = '\\.', remove = FALSE)
#removing whitespaces if any
df2$title <- gsub("\\s", "", df2$title)

drop <- c('last', 'first', 'firstName')
df2[ , !(names(df2) %in% drop)]
return(df2[ , !(names(df2) %in% drop)])
}

#passing titanic data to function
new_data <- title.add(titanic_data)

str(new_data)

## 'data.frame':    1309 obs. of  15 variables:
##  $ pclass   : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ survived : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 1 2 1 ...
##  $ name     : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ title    : chr  "Miss" "Master" "Miss" "Mr" ...
##  $ sex      : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age      : num  29 0.917 2 30 25 ...
##  $ sibsp    : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch    : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket   : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
##  $ fare     : num  211 152 152 152 152 ...
##  $ cabin    : Factor w/ 187 levels "","A10","A11",..: 45 81 81 81 81 151 147 17 63 1 ...
##  $ embarked : Factor w/ 4 levels "","C","Q","S": 4 4 4 4 4 4 4 4 4 2 ...
##  $ boat     : Factor w/ 28 levels "","1","10","11",..: 13 4 1 1 1 14 3 1 28 1 ...
##  $ body     : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest: Factor w/ 370 levels "","?Havana, Cuba",..: 310 232 232 232 232 238 163 25 23 230 ...

#unique titles
unique(new_data$title)

##  [1] "Miss"        "Master"      "Mr"          "Mrs"         "Col"        
##  [6] "Mme"         "Dr"          "Major"       "Capt"        "Lady"       
## [11] "Sir"         "Mlle"        "Dona"        "Jonkheer"    "theCountess"
## [16] "Don"         "Rev"         "Ms"

#number of observations under titles
group_by(new_data, title) %>% summarize(
count_title = n())

## # A tibble: 18 × 2
##          title count_title
##          <chr>       <int>
## 1         Capt           1
## 2          Col           4
## 3          Don           1
## 4         Dona           1
## 5           Dr           8
## 6     Jonkheer           1
## 7         Lady           1
## 8        Major           2
## 9       Master          61
## 10        Miss         260
## 11        Mlle           2
## 12         Mme           1
## 13          Mr         757
## 14         Mrs         197
## 15          Ms           2
## 16         Rev           8
## 17         Sir           1
## 18 theCountess           1

#handling special cases
#Variations of Mrs
new_data$title[new_data$title %in% c('Mme', 'Dona')] <- "Mrs"
#Variations of Miss, including Ms in this category
new_data$title[new_data$title %in% c('Mlle', 'Ms')] <- "Miss"
#new category for lady and similar titles
new_data$title[new_data$title %in% c('theCountess')] <- "Lady"
#new category for Sirs and other respected titles
new_data$title[new_data$title %in% c('Capt', 'Col', 
                                     'Don', 'Jonkheer', 'Major')] <- "Sir"

#changing to factor
new_data$title <- as.factor(new_data$title)

Improved Logistic Regression Model

#creating test and train data from new data with title
set.seed(22)
# 80% of the sample size
smp_size <- floor(0.80 * nrow(new_data))
#training data index
train.index <- sample( 1: nrow(new_data), size = smp_size)
#training data
train <- new_data[train.index,]
#test data: data after filtering out the training data
test.data <- new_data[-train.index ,]

#logistic Regression model
glm.fit <- glm(survived ~ pclass + title, data = train, family = binomial)
#summary of fit
summary(glm.fit)

## 
## Call:
## glm(formula = survived ~ pclass + title, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1660  -0.6473  -0.4023   0.6604   2.2602  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.44338    0.88140  -0.503 0.614938    
## pclass2       -0.83323    0.23336  -3.571 0.000356 ***
## pclass3       -1.85000    0.20923  -8.842  < 2e-16 ***
## titleLady     16.00945 1455.39780   0.011 0.991223    
## titleMaster    2.25217    0.94395   2.386 0.017037 *  
## titleMiss      2.55404    0.90352   2.827 0.004702 ** 
## titleMr       -0.17998    0.88973  -0.202 0.839691    
## titleMrs       2.68841    0.90928   2.957 0.003110 ** 
## titleRev     -14.28946  550.08929  -0.026 0.979276    
## titleSir      -0.06745    1.14464  -0.059 0.953013    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1383.64  on 1046  degrees of freedom
## Residual deviance:  953.58  on 1037  degrees of freedom
## AIC: 973.58
## 
## Number of Fisher Scoring iterations: 14

#coefficients of the model
coef(glm.fit)

##  (Intercept)      pclass2      pclass3    titleLady  titleMaster 
##  -0.44337973  -0.83322858  -1.85000103  16.00944798   2.25217257 
##    titleMiss      titleMr     titleMrs     titleRev     titleSir 
##   2.55404356  -0.17998261   2.68841034 -14.28945993  -0.06744589

The etimated coefficients for the model tells us that some titles hold significance for the values of survived variable. Hence including those variable would definately improve the model as predictions would be more accurate. This is also proven by the reduced Residual Deviance and AIC value of this model as compared to the previous model.

Performance Evaluation

Test Error Rate and ROC :

#prediction of probabilities for test values
glm.probs <- predict(glm.fit, test.data, type = "response")

yhat2 <- rep (0 ,nrow(test.data))
yhat2[glm.probs > 0.5] <- 1
yhat2

##   [1] 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0
##  [36] 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0
##  [71] 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0
## [106] 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 1
## [141] 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 0
## [176] 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0
## [211] 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1
## [246] 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0

#confusion matrix
table(yhat2 , test.data$survived)

##      
## yhat2   0   1
##     0 127  31
##     1  26  78

#test error rate
mean(yhat2 != test.data$survived)

## [1] 0.2175573

#roc calculation
fit.roc2 <- roc(test.data$survived, yhat2, 
                auc = TRUE, thresholds = 0.5)
#plotting
plot(fit.roc2, col = "red")

## 
## Call:
## roc.default(response = test.data$survived, predictor = yhat2,     auc = TRUE, thresholds = 0.5)
## 
## Data: yhat2 in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7728

The mean error rate for the model has been reduced from around 30% to around 22% with the new model. From the confusion matrix we can see that out of 262 records, 57 records were classified incorrectly. The ROC curve also denotes somewhat improved Sensitivity and AUC of 0.7728. Although this value cannot be said as very good, it is definately better than the previous model. Let us look into data that has been misclassified

#missclassified values
test.miss <- test.data

#adding predicted values
test.miss$predicted <- yhat2
test.miss$predicted <- as.factor(test.miss$predicted)

#filtering to only misclassified values
test.miss <- filter(test.miss, survived != predicted)

#titles for which misclassification occured
group_by(test.miss, title) %>% summarize(
count_title = n())

## # A tibble: 6 × 2
##    title count_title
##   <fctr>       <int>
## 1     Dr           2
## 2 Master           2
## 3   Miss          18
## 4     Mr          26
## 5    Mrs           8
## 6    Sir           1

#pclass for which misclassification occured
group_by(test.miss, pclass) %>% summarize(
count_pclass = n())

## # A tibble: 3 × 2
##   pclass count_pclass
##   <fctr>        <int>
## 1      1           17
## 2      2            3
## 3      3           37

The analysis tells us that most miscalssification occured for people of title “Miss” and “Mr” and pclass 1 and 3. There could be three reasons behind these misclassification:

The handling of missing values or aggregation that we did earlier for terms like “Mme”, “Mlle” etc. is causing these errors.
There are other predictors that are effecting the values of survived.
The model chosen to fit the data is not suitable for this data.

Although the model is better than the first model that we created but we can try further to reduce the classification error by taking into account other predictors or other models as well.

Random Forest Model

We would now use the randomForest function to fit a random forest model with passenger class and title as predictors and then make predictions for the test set using the random forest model.

set.seed(33)
#fitting random forest model

forest.fit <- randomForest(survived ~ pclass + title, 
                           data = train, importance =TRUE)
forest.fit

## 
## Call:
##  randomForest(formula = survived ~ pclass + title, data = train,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 20.34%
## Confusion matrix:
##     0   1 class.error
## 0 641  15  0.02286585
## 1 198 193  0.50639386

#importance of variables
importance(forest.fit)

##               0        1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass 34.88239 31.57615             35.76548         43.38513
## title  50.47940 58.20294             57.53085        133.15669

#predicting test values
yhat3 <- predict(forest.fit , newdata = test.data)


#confusion matrix
table(yhat3 , test.data$survived)

##      
## yhat3   0   1
##     0 149  55
##     1   4  54

#test error rate
mean(yhat3 != test.data$survived)

## [1] 0.2251908

#roc calculation
fit.roc3 <- roc(test.data$survived, 
                as.ordered(yhat3), plot = TRUE)

Now we will try to improve our random forest model.

For improving the random forest model we will perform following steps:

Adding other probable predictors
Imputing missing values in test and training data by replacing by either mean or mode.
Evaluating Importance of these predictors and removing any umimportant ones.
Finding optimal value of ntree for minimized OOB error.
Finding optimal value of mtry for minimized OOB error.

set.seed(33)
#making a model with probable predictors
#missing values imputation
train.imputed <- rfImpute(survived ~ pclass + title +
                          sex + age + fare, data = train, importance =TRUE)

## ntree      OOB      1      2
##   300:  19.68%  9.60% 36.57%
## ntree      OOB      1      2
##   300:  18.62%  9.30% 34.27%
## ntree      OOB      1      2
##   300:  19.29%  9.76% 35.29%
## ntree      OOB      1      2
##   300:  17.86% 10.06% 30.95%
## ntree      OOB      1      2
##   300:  18.72%  9.60% 34.02%

#new random forest fit
new.fit <- randomForest(survived ~ pclass + title +
                          sex + age + fare, 
                        data = train.imputed, importance =TRUE)


#summary of fit
new.fit

## 
## Call:
##  randomForest(formula = survived ~ pclass + title + sex + age +      fare, data = train.imputed, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.34%
## Confusion matrix:
##     0   1 class.error
## 0 599  57  0.08689024
## 1 135 256  0.34526854

#importance of variables
importance(new.fit)

##               0        1 MeanDecreaseAccuracy MeanDecreaseGini
## pclass 22.88647 48.77495             52.29316         40.88463
## title  27.66866 22.20283             31.89233         92.82093
## sex    20.13725 16.87877             23.14630         57.03895
## age    11.56489 18.08702             22.55862         64.31046
## fare   14.64069 26.96837             30.96514         77.49999

#plot of importance
varImpPlot(new.fit, main= "Importance of predictors for new Random Forest Model")

From the analysis above we can see that all the predictors in the new model are important and significant. Hence we will not remove any of them. Also, the OOB error rate of new model has fallen down to 18.34% from 20.34% from previous model. Next we will optimize ntree and mtry parameters.

#error rate plot
layout(matrix(c(1,2),nrow=1), width=c(4,1)) 
par(mar=c(5,4,4,0)) #No margin on the right side
plot(new.fit, log="y", main = "Error Rate vs Number of trees")
par(mar=c(5,0,4,2)) #No margin on the left side
plot(c(0,1), type="n", axes=F, xlab="", ylab="")
legend("top", colnames(new.fit$err.rate),col=1:3,cex=0.8,fill=1:3)

The plot tells us that for ntree between 50 and 60 the OOB error is minimum and then it is almost the same as the number of trees is increasing. SO we will try this value of ntree in our next model and try to find optimized value of mtry.

set.seed(33)
#optimize mtry
mtry <- tuneRF(train.imputed[,c('pclass','title','sex','age','fare')],
               train.imputed$survived, ntreeTry=50,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=FALSE)

## mtry = 2  OOB error = 18.34% 
## Searching left ...
## Searching right ...
## mtry = 3     OOB error = 19.68% 
## -0.07291667 0.01

best.m <- mtry[mtry[, 2] == min(mtry[, 2]), 1]

#optimized mtry for least OOB
best.m

## [1] 2

set.seed(33)
#final optimized model
new.fit <- randomForest(survived ~ pclass + title +
                          sex + age + fare, 
                        data = train.imputed, importance =TRUE, 
                        ntree = 62, mtry = 2)

#final model
new.fit

## 
## Call:
##  randomForest(formula = survived ~ pclass + title + sex + age +      fare, data = train.imputed, importance = TRUE, ntree = 62,      mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 62
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.24%
## Confusion matrix:
##     0   1 class.error
## 0 595  61   0.0929878
## 1 130 261   0.3324808

#replacing missing values with mean
test.data$age[is.na(test.data$age)] <- mean(test.data$age, na.rm = TRUE)


#predicting test values
yhat4 <- predict(new.fit , newdata = test.data)

#confusion matrix
table(yhat4 , test.data$survived)

##      
## yhat4   0   1
##     0 139  38
##     1  14  71

#test error rate
mean(yhat4 != test.data$survived)

## [1] 0.1984733

#roc calculation
fit.roc4 <- roc(test.data$survived, as.ordered(yhat4), 
                threshold = 0.5, plot = TRUE, 
                auc = TRUE)

The new model has an error rate of 19% which is the lowest for all models.

Comparison between the Models

Now we will compare the accuracy of each of the models using ROC curves.

#plotting roc curves
plot(fit.roc, legacy.axes = TRUE, main = "ROC curves for all models")

## 
## Call:
## roc.default(response = test.data$survived, predictor = yhat,     auc = TRUE, thresholds = 0.5)
## 
## Data: yhat in 167 controls (test.data$survived 0) < 95 cases (test.data$survived 1).
## Area under the curve: 0.6417

plot(fit.roc2, legacy.axes = TRUE, add=TRUE, col='red')

## 
## Call:
## roc.default(response = test.data$survived, predictor = yhat2,     auc = TRUE, thresholds = 0.5)
## 
## Data: yhat2 in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7728

plot(fit.roc3, legacy.axes = TRUE, add=TRUE, col='blue')

## 
## Call:
## roc.default(response = test.data$survived, predictor = as.ordered(yhat3),     plot = TRUE)
## 
## Data: as.ordered(yhat3) in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7346

plot(fit.roc4, legacy.axes = TRUE, add=TRUE, col='darkgreen')

## 
## Call:
## roc.default(response = test.data$survived, predictor = as.ordered(yhat4),     auc = TRUE, plot = TRUE, threshold = 0.5)
## 
## Data: as.ordered(yhat4) in 153 controls (test.data$survived 0) < 109 cases (test.data$survived 1).
## Area under the curve: 0.7799

legend("bottomright", lty=c(1,1), lwd=c(2.5,2.5),
       col=c("black","red", "blue", "darkgreen"),
       legend= c('Logistic1', 'Logistic2','RF1', 'RF2'))

For a model to be accurate the ROC curve should be focused towards top left corner (High Sensitivity and low FPR ). As seen from the curves random forest models have the leat FPR of all models. Also as we are improving the model the sensitivity is also imcreasing. So, the last RF model have high Sensitivity values than first RF model. Hence, from all the models the optimized random forest model would be the best to predict the survival rates of the passengers.

Prediction of Titanic Survivors: A comparison between statistical models

Namrata Deshpande

Abstract

Setup:

Data: Exploring and Tidying

Description of Variables

Splitting Data

Logistic Regression Model:

Model Fitting

Bayesian generalized linear Model

Performance Evaluation

Test Error Rate:

ROC Curve:

Data Wrangling

Improved Logistic Regression Model

Performance Evaluation

Test Error Rate and ROC :

Random Forest Model

Comparison between the Models