Cars Case Study

Description This project requires learners to understand what mode of transport employees prefers to commute to their office. The dataset “Cars-dataset” includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. Objectives: (1) To predict whether or not an employee will use Car as a mode of transport. (2) Which variables are a significant predictor behind this decision?

1. Explorary Data Analysis (EDA)

  • Import the data
cars_dt <- read.csv("Cars-dataset.csv")
str(cars_dt)
## 'data.frame':    418 obs. of  9 variables:
##  $ Age      : int  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer : int  1 1 1 0 0 0 1 0 1 1 ...
##  $ MBA      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Work.Exp : int  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary   : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance : num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license  : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
cars_dt2 = cars_dt # Dataset to be used to create a binary column for Transport by CAR.
summary(cars_dt)
##       Age           Gender       Engineer           MBA        
##  Min.   :18.00   Female:121   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:25.00   Male  :297   1st Qu.:0.2500   1st Qu.:0.0000  
##  Median :27.00                Median :1.0000   Median :0.0000  
##  Mean   :27.33                Mean   :0.7488   Mean   :0.2614  
##  3rd Qu.:29.00                3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :43.00                Max.   :1.0000   Max.   :1.0000  
##                                                NA's   :1       
##     Work.Exp          Salary          Distance        license      
##  Min.   : 0.000   Min.   : 6.500   Min.   : 3.20   Min.   :0.0000  
##  1st Qu.: 3.000   1st Qu.: 9.625   1st Qu.: 8.60   1st Qu.:0.0000  
##  Median : 5.000   Median :13.000   Median :10.90   Median :0.0000  
##  Mean   : 5.873   Mean   :15.418   Mean   :11.29   Mean   :0.2033  
##  3rd Qu.: 8.000   3rd Qu.:14.900   3rd Qu.:13.57   3rd Qu.:0.0000  
##  Max.   :24.000   Max.   :57.000   Max.   :23.40   Max.   :1.0000  
##                                                                    
##             Transport  
##  2Wheeler        : 83  
##  Car             : 35  
##  Public Transport:300  
##                        
##                        
##                        
## 
  • There are 418 observations in the dataset with 9 variables: The data including employee’s age, gender, holding an engineering job position, has MBA credential, years of working experience, salary is given hourly rate, distance traveling from home to work, whether the employee has a driver licence or not, and type of transportation.

1.2 Missing Value Treatment

## [1] 1
##       Age    Gender  Engineer       MBA  Work.Exp    Salary  Distance 
##         0         0         0         1         0         0         0 
##   license Transport 
##         0         0
## [1] 0
  • MBA has 1 missing value.
  • Since there are 308 “0” entries comparing to 109 “1”. The mode, which is 0, is being used to replace the missing value for MBA variable.
summary(cars_dt) 
##       Age           Gender    Engineer   MBA         Work.Exp     
##  Min.   :18.00   Female:121   No :105   No :309   Min.   : 0.000  
##  1st Qu.:25.00   Male  :297   Yes:313   Yes:109   1st Qu.: 3.000  
##  Median :27.00                                    Median : 5.000  
##  Mean   :27.33                                    Mean   : 5.873  
##  3rd Qu.:29.00                                    3rd Qu.: 8.000  
##  Max.   :43.00                                    Max.   :24.000  
##      Salary          Distance     license              Transport  
##  Min.   : 6.500   Min.   : 3.20   No :333   2Wheeler        : 83  
##  1st Qu.: 9.625   1st Qu.: 8.60   Yes: 85   Car             : 35  
##  Median :13.000   Median :10.90             Public Transport:300  
##  Mean   :15.418   Mean   :11.29                                   
##  3rd Qu.:14.900   3rd Qu.:13.57                                   
##  Max.   :57.000   Max.   :23.40

Employee’s Background

  • The employee’s youngest age is 18 and the oldest is 43 years old.

  • Employee’s minimum working experiences is less than a year and the highest years of experience is 24, while the mean is 5.87 years.

  • The salary is ranging from hourly rate of $6.5 to $57, the average salary per month is $15,418, there seem to be outliers on monthly salaries as there is a big gap between the minimum and the maximum on salary per hourly rate.

ggplot(cars_dt, aes(x = Work.Exp, 
                     y = Salary, 
                     color = Gender)) +
  geom_point(size = 2, 
             alpha = .6) +
  labs(title = "Work Experience and Salary")

  • The diagram above shows the correlations between work experience and monthly salary between males and females. The more work experience results higher salary (hourly rate) and that there are more male than female employees.

  • The above plots show that there are more male (297) than female (121) in the dataset. In a total of 313 Engineer positions, there are more male enginer than female. The plot also shows there is a missing value for MBA that classified as female.

Illustrate the insights based on EDA.

  • The plot above shows the correlations between travel distance and salary. Longer distance with high monthly salary seems to drive the car and lower monthly salary results in taking public either taking public transportation or 2Wheeler.
df <- dplyr::select_if(cars_dt, is.numeric)
r <- cor(df, use="complete.obs")
round(r,2)
##           Age Work.Exp Salary Distance
## Age      1.00     0.92   0.86     0.38
## Work.Exp 0.92     1.00   0.93     0.39
## Salary   0.86     0.93   1.00     0.48
## Distance 0.38     0.39   0.48     1.00
ggcorrplot(r, hc.order = TRUE, 
                  type = "lower",
                   lab = TRUE)

  • The corrplot above shows that there are high correlations between age to work experiences and salary. Diagrams in figures H1 and H2 confirmed the correlatation information in figure 1 that employees use Car as a mode of transport falls depending on if the distance is greater than 14 miles and has an hourly salary that is greater than $34. Therefore, we will split thedataset into 2 different sets: train_dt with Distance less than 14 and test_dt with Distance greater than or equal to 14.
ggplot(cars_dt, aes(x = Age)) +
  geom_histogram(color = "white",
                 fill = "cornflowerblue") +
  facet_grid(Gender ~ Transport) +
  labs(title = "Fig. H1: Distance histograms by Gender and Type of Transporation",
       x = "Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(cars_dt, aes(x = Distance)) +
  geom_histogram(color = "white",
                 fill = "cornflowerblue") +
  facet_grid(Gender ~ Transport) +
  labs(title = "Fig. H2: Distance histograms by Gender and Type of Transporation",
       x = "Distance")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • The figure above shows that most males and females are taking Public Transportation between the distance of approximately 5 miles to ~18 miles. Only less than 8 for both males and females that farer than 13 miles are using car as a method of transportation. Less than 7 counts for both males and female with the distance ranging between 4 miles to 22 miles are using 2Wheeler as a type of transportation.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • The Salary histograms above depict that less than 10 of both male and female employees are making a salary of more than $30 per hour and within this group are the ones that drive cars. Majority of both males and females that are making salary of less than $20 per hour would commute by Public Transportation. There are less than 18 for both males and females, whose make less than a salary of $18 per hour would commute by 2Wheeler.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • Figure H3 shows that it is obvious that majority of the employees that use public transport has no license.

What is the most challenging aspect of this problem? What method will you use to deal with this?

  • [1] Recall that the objective of the analysis is to predict whether or not an employee will use Car as a mode of transport, the most challenging aspect of this problem is that type of Transport is currently given 3 options: 2Wheeler, Car, and Public Transport.

  • [2] The given dataset has 5 catagorical variables (Gender, Engineer, MBA, License, and Transport type).

What method will you use to deal with this?

  • [1] Add a “Car” binary column for Transport Type to 2 options: 1 for Car and 0 for Transport type 2Wheeler and Public Transportation.
  • [2] Change Engineer, MBA and License to factor variables.

2. Data Preparation

2.1 Create a binary column “Car” in the “cars_dt” dataset to predict whether or not an employee will use Car as a mode of transport.

cars_dt2$Transport = dummy_cols(cars_dt2$Transport, 
                            select_columns = NULL,
                        remove_first_dummy = FALSE,
                remove_most_frequent_dummy = FALSE,
                                 ignore_na = FALSE,
                                     split = FALSE,
                   remove_selected_columns = FALSE)

cars_dt$Car = cars_dt2$Transport$.data_Car
str(cars_dt)
## 'data.frame':    418 obs. of  10 variables:
##  $ Age      : int  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ MBA      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Work.Exp : int  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary   : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance : num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Transport: Factor w/ 3 levels "2Wheeler","Car",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Car      : int  0 0 0 0 0 0 0 0 0 0 ...
  • Car column added to the dataset with two factors (1:yes and 0:No).

2.2 Outlier Treatment and Normality assumption testing

  • Dependent Variable (Transport Type): Car
2.2.A1 Check outliers for AGE
# Age variable - Percentile distribution
summary(cars_dt$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   25.00   27.00   27.33   29.00   43.00
  • The difference between the minimum and first quartile is 7, there is a difference of 2 between [1st quartile and the median] and [mean and 3rd quartile]. The gap between 3rd quartile and the maximum is 14; thus this appear to have outliers in the dataset for Age variable.
quantile(cars_dt$Age,c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))
##   1%   2%   3%  10%  20%  30%  40%  50%  60%  70%  80%  90%  95%  99% 100% 
## 20.0 21.0 21.0 23.0 24.0 25.0 26.0 27.0 28.0 28.0 30.0 32.3 37.0 40.0 43.0
  • Notice that the difference between 60% and 100% is “2 to 3”, but before 60% percentile, consecutive differences(i.e. 60% - 50%, 50%-40% etc ) shows “1” which is (1/2 to 1/3rd) of (100%-60%). Thus,it indicates the existance of a few outlier observations in this variable.
2.2.A1: Density plot for AGE.
par(mfrow = c(1,2))
plot(density(cars_dt$Age), main="Figure 2.2a: Age") # Looks right-skewed
boxplot(cars_dt$Age,
          main = "Employee's Age", 
          xlab = "Age in Years", 
          ylab = "",
          col = "seagreen1",
          border = "black", horizontal = TRUE, notch = FALSE)

  • The density plot in Figure 2.2a shows that the dataset for Age is skewed to the right and the outliers are greater than 90% percentile, i.e. age 35.
2.2.A2 (ggplot plot): Histogram for Age.
  • Linear Regression Assumption checking - Normality test
ggplot(cars_dt,aes(Age)) + geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • again ggplot too shows right skewed
2.2.A3: Q-Q plot for normality for Age
qqnorm(cars_dt$Age) 

  • Overall, from the above three plots, it confirms that the Age variables are normally distribute with few outliers that are greater age 35.

Oultiers Treatment for variable: Age

cars_dt$Age[which(cars_dt$Age < 20)] <- 20 #flooring at 1% quantile
cars_dt$Age[which(cars_dt$Age > 35)] <- 35 #cappling at 35 from density Plot
2.2b Check outliers for Work Experience
# Age variable - Percentile distribution
summary(cars_dt$Work.Exp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   5.873   8.000  24.000
  • The difference between the minimum and first quartile is 3, there is a difference of 2 between [1st quartile and the median] and [mean and 3rd quartile]. The gap between 3rd quartile and the maximum is 16; thus this appear to have outliers in the dataset for Work Experience variable.
quantile(cars_dt$Work.Exp, c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))
##   1%   2%   3%  10%  20%  30%  40%  50%  60%  70%  80%  90%  95%  99% 100% 
##    0    0    0    1    2    3    4    5    6    7    8   12   18   21   24
  • Notice that the difference between 80% and 95% is “4”, but between 10% and 80% percentile, consecutive differences(i.e. 80% - 60%, 50%-40% etc. ) shows “1” which is (1/4) of (95%-80%). Thus,it indicates the existance of a few outlier observations in Work Experience variable.
2.2.B1: Density plot for Work Experiences.

  • The density plot in Figure 2.2b shows that the dataset for Work Experience is skewed to the right and the outliers are greater than 12.
2.2.B2 (ggplot plot): Histogram for Work Experiences.
  • Linear Regression Assumption checking - Normality test
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • again ggplot too shows right skewed
2.2.B3: Q-Q plot for normality for Work Experiences.

  • Overall, from the above three plots, it confirms that the Work Experience variables are normally distribute with few outliers that are greater 12 years of working experiences.

Oultiers Treatment for variable: Work Experiences

cars_dt$Work.Exp[which(cars_dt$Work.Exp > 15)] <- 15
2.2c Check outliers for Salary
# Age variable - Percentile distribution
summary(cars_dt$Salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.500   9.625  13.000  15.418  14.900  57.000
  • The difference between the minimum and first quartile is approximately 3, there is a difference of 4.625 between [1st quartile and the median] and 0.518 between the mean and 3rd quartile. The gap between 3rd quartile and the maximum is 42.1; thus this appear to have outliers in the dataset for Salary variable.
quantile(cars_dt$Salary, c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))
##     1%     2%     3%    10%    20%    30%    40%    50%    60%    70% 
##  6.800  6.800  6.900  8.470  8.900 10.500 12.380 13.000 13.900 14.700 
##    80%    90%    95%    99%   100% 
## 15.800 28.730 41.915 51.000 57.000
  • Notice that the difference between 90% and 95% is “13.185”, but between 1% and 90% percentile, consecutively differences (i.e. 90% - 60%, 50%-40% etc. ) shows between 0 and less than 5. Thus,it indicates the existance of a few outlier observations in Salary variable.
2.2.C1: Density plot for Salary.

  • The density plot in Figure 2.2c shows that the dataset for Salary is skewed to the right and the outliers are greater than 28.730.
2.2.C2 (ggplot plot): Histogram for Salary.
  • Linear Regression Assumption checking - Normality test
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • again ggplot too shows right skewed
2.2.C3: Q-Q plot for normality for Salary.

  • Overall, from the above three plots, it confirms that the Work Experience variables are normally distribute with few outliers that are greater 12 years of working experiences.

Oultiers Treatment for variable: Salary

cars_dt$Salary[which(cars_dt$Salary > 22.8125)] <- 22.8125 #14.9 + 1.5*IQR [IQR = Q3 - Q1]
2.2d Check outliers for Distance
# Age variable - Percentile distribution
summary(cars_dt$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.20    8.60   10.90   11.29   13.57   23.40
  • The difference between the minimum and first quartile is approximately 5.4, there is a difference of 2.3 between [1st quartile and the median] and 2.28 between the mean and 3rd quartile. The gap between 3rd quartile and the maximum is 9.83; thus this appear to have outliers in the dataset for Distance variable.
quantile(cars_dt$Distance, c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))
##     1%     2%     3%    10%    20%    30%    40%    50%    60%    70% 
##  4.634  5.134  5.451  7.000  8.100  9.000  9.880 10.900 11.900 12.900 
##    80%    90%    95%    99%   100% 
## 14.300 16.330 17.915 21.383 23.400
  • Notice that the difference between the quantile percentiles above, the dataset for Distance seem to be normally distributed with a few outliers. Thus,it indicates there are no outlier in Distance variable.
2.2.D1: Density plot for Distance.

  • The density plot in Figure 2.2d shows that the dataset for Distance is normally distributed with a few outliers.
2.2.D2 (ggplot plot): Histogram for Distance.
  • Linear Regression Assumption checking - Normality test
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • again ggplot too shows the Distance variable is normally distributed with few outliers.
2.2.D3: Q-Q plot for normality for Distance.

Oultiers Treatment for variable: Distance

cars_dt$Distance[which(cars_dt$Distance > 21.383)] <- 21.383 #capping at 99% quantile
  • Overall, from the above three plots, it confirms that the Distance variables are normally distribute with a few outliers that are greater 21.025 miles (Q3 + 1.5IQR).

Density Plots and Boxplots after Outliers Treament

Overview of Cleaned Data to be used for the Analyis:

cars_dt$Transport <- NULL
summary(cars_dt)
##       Age           Gender    Engineer   MBA         Work.Exp     
##  Min.   :20.00   Female:121   No :105   No :309   Min.   : 0.000  
##  1st Qu.:25.00   Male  :297   Yes:313   Yes:109   1st Qu.: 3.000  
##  Median :27.00                                    Median : 5.000  
##  Mean   :27.13                                    Mean   : 5.577  
##  3rd Qu.:29.00                                    3rd Qu.: 8.000  
##  Max.   :35.00                                    Max.   :15.000  
##      Salary          Distance     license        Car         
##  Min.   : 6.500   Min.   : 3.20   No :333   Min.   :0.00000  
##  1st Qu.: 9.625   1st Qu.: 8.60   Yes: 85   1st Qu.:0.00000  
##  Median :13.000   Median :10.90             Median :0.00000  
##  Mean   :13.524   Mean   :11.28             Mean   :0.08373  
##  3rd Qu.:14.900   3rd Qu.:13.57             3rd Qu.:0.00000  
##  Max.   :22.812   Max.   :21.38             Max.   :1.00000
str(cars_dt)
## 'data.frame':    418 obs. of  9 variables:
##  $ Age     : num  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender  : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer: Factor w/ 2 levels "No","Yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ MBA     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Work.Exp: num  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary  : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance: num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Car     : int  0 0 0 0 0 0 0 0 0 0 ...

3.1 Modeling

Create multiple models and explore how each model perform using appropriate model performance metrics.

3.1a Basic Linear Regression Model

basic_model <- lm(Car ~ ., data = cars_dt)
summary(basic_model)
## 
## Call:
## lm(formula = Car ~ ., data = cars_dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53913 -0.10189 -0.00929  0.09758  0.68348 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.262310   0.140755  -1.864  0.06309 .  
## Age         -0.001102   0.005811  -0.190  0.84971    
## GenderMale  -0.015423   0.020159  -0.765  0.44467    
## EngineerYes  0.013194   0.020477   0.644  0.51973    
## MBAYes      -0.025589   0.020411  -1.254  0.21068    
## Work.Exp     0.048363   0.007286   6.637 1.02e-10 ***
## Salary      -0.014426   0.004648  -3.104  0.00204 ** 
## Distance     0.024646   0.002632   9.364  < 2e-16 ***
## licenseYes   0.152429   0.024503   6.221 1.22e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1805 on 409 degrees of freedom
## Multiple R-squared:  0.5846, Adjusted R-squared:  0.5765 
## F-statistic: 71.95 on 8 and 409 DF,  p-value: < 2.2e-16
  • Basic Model Insights: The basic model linear regression above shows that Work Experiences, Distance, and License are equally significant, with p-values < 0.001, in determine whether the employee will use Car as a mode of transport. Salary variable is significant with p-value = 0.00204 (0.01 < p-value < 0.05). Thus, the model for Car prediction from the basic_model is

  • Car = Work.Exp + Distance + License[Yes] + Salary with Multiple R-squared: 0.5846 and Adjusted R-squared: 0.5765.

Divide data in “70:30”.

set.seed(100)
indices= sample(1:nrow(cars_dt), 0.7*nrow(cars_dt))
train=cars_dt[indices,] # Training set
test = cars_dt[-indices,] # Test set

Model 1: Develop the first model

model_1 <-lm(Car~.,data=train)
summary(model_1)
## 
## Call:
## lm(formula = Car ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49311 -0.09707 -0.00881  0.09041  0.56110 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.117566   0.160133  -0.734   0.4634    
## Age         -0.007337   0.006596  -1.112   0.2669    
## GenderMale  -0.004811   0.023252  -0.207   0.8362    
## EngineerYes  0.027281   0.023600   1.156   0.2487    
## MBAYes      -0.055129   0.023268  -2.369   0.0185 *  
## Work.Exp     0.047106   0.008108   5.810 1.68e-08 ***
## Salary      -0.011665   0.004994  -2.336   0.0202 *  
## Distance     0.022460   0.003030   7.413 1.43e-12 ***
## licenseYes   0.194199   0.029159   6.660 1.42e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1707 on 283 degrees of freedom
## Multiple R-squared:  0.6106, Adjusted R-squared:  0.5996 
## F-statistic: 55.48 on 8 and 283 DF,  p-value: < 2.2e-16
  • Model 1 Insights: Car = Work.Exp + Distance + License[Yes] + MBA[Yes] + Salary with Multiple R-squared: 0.611 and Adjusted R-squared: 0.597.

Model 2: Applying stepwise method for variable reduction with direction = “both”.

model_2 <- stepAIC(model_1, direction="both")
## Start:  AIC=-1023.42
## Car ~ Age + Gender + Engineer + MBA + Work.Exp + Salary + Distance + 
##     license
## 
##            Df Sum of Sq    RSS      AIC
## - Gender    1   0.00125 8.2514 -1025.38
## - Age       1   0.03607 8.2862 -1024.15
## - Engineer  1   0.03896 8.2891 -1024.05
## <none>                  8.2502 -1023.42
## - Salary    1   0.15907 8.4092 -1019.85
## - MBA       1   0.16365 8.4138 -1019.69
## - Work.Exp  1   0.98397 9.2341  -992.52
## - license   1   1.29308 9.5432  -982.91
## - Distance  1   1.60222 9.8524  -973.60
## 
## Step:  AIC=-1025.38
## Car ~ Age + Engineer + MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## - Age       1   0.03740 8.2888 -1026.06
## - Engineer  1   0.03852 8.2899 -1026.02
## <none>                  8.2514 -1025.38
## + Gender    1   0.00125 8.2502 -1023.42
## - Salary    1   0.15795 8.4094 -1021.84
## - MBA       1   0.17241 8.4238 -1021.34
## - Work.Exp  1   0.98550 9.2369  -994.44
## - license   1   1.31693 9.5683  -984.14
## - Distance  1   1.60352 9.8549  -975.52
## 
## Step:  AIC=-1026.06
## Car ~ Engineer + MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## - Engineer  1   0.03491 8.3237 -1026.83
## <none>                  8.2888 -1026.06
## + Age       1   0.03740 8.2514 -1025.38
## + Gender    1   0.00258 8.2862 -1024.15
## - Salary    1   0.14628 8.4351 -1022.95
## - MBA       1   0.15175 8.4406 -1022.76
## - license   1   1.28035 9.5692  -986.12
## - Work.Exp  1   1.35912 9.6479  -983.72
## - Distance  1   1.59193 9.8807  -976.76
## 
## Step:  AIC=-1026.83
## Car ~ MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## <none>                  8.3237 -1026.83
## + Engineer  1   0.03491 8.2888 -1026.06
## + Age       1   0.03378 8.2899 -1026.02
## + Gender    1   0.00190 8.3218 -1024.90
## - MBA       1   0.14336 8.4671 -1023.85
## - Salary    1   0.14675 8.4705 -1023.73
## - license   1   1.28164 9.6054  -987.01
## - Work.Exp  1   1.36968 9.6934  -984.35
## - Distance  1   1.62322 9.9469  -976.81
summary(model_2) #Summary 
## 
## Call:
## lm(formula = Car ~ MBA + Work.Exp + Salary + Distance + license, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48042 -0.09364 -0.01102  0.08631  0.58087 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.273580   0.048309  -5.663 3.62e-08 ***
## MBAYes      -0.050220   0.022628  -2.219   0.0272 *  
## Work.Exp     0.041107   0.005992   6.860 4.25e-11 ***
## Salary      -0.011145   0.004963  -2.246   0.0255 *  
## Distance     0.022512   0.003014   7.468 9.90e-13 ***
## licenseYes   0.188594   0.028420   6.636 1.61e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1706 on 286 degrees of freedom
## Multiple R-squared:  0.6072, Adjusted R-squared:  0.6003 
## F-statistic:  88.4 on 5 and 286 DF,  p-value: < 2.2e-16
Let’s check Multicollinearity test
vif(model_2) 
##      MBA Work.Exp   Salary Distance  license 
## 1.005681 5.795040 5.604226 1.234974 1.238849
  • Model 2 Insights: Car = MBA + Work.Exp + Salary + Distance + license with Multiple R-squared: 0.607 and Adjusted R-squared: 0.600.

Model 3: Applying stepwise method for variable reduction with direction = “backward”.

model_3 <- stepAIC(model_1, direction="backward", k=5)
## Start:  AIC=-996.42
## Car ~ Age + Gender + Engineer + MBA + Work.Exp + Salary + Distance + 
##     license
## 
##            Df Sum of Sq    RSS      AIC
## - Gender    1   0.00125 8.2514 -1001.38
## - Age       1   0.03607 8.2862 -1000.15
## - Engineer  1   0.03896 8.2891 -1000.05
## <none>                  8.2502  -996.42
## - Salary    1   0.15907 8.4092  -995.85
## - MBA       1   0.16365 8.4138  -995.69
## - Work.Exp  1   0.98397 9.2341  -968.52
## - license   1   1.29308 9.5432  -958.91
## - Distance  1   1.60222 9.8524  -949.60
## 
## Step:  AIC=-1001.38
## Car ~ Age + Engineer + MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## - Age       1   0.03740 8.2888 -1005.06
## - Engineer  1   0.03852 8.2899 -1005.02
## <none>                  8.2514 -1001.38
## - Salary    1   0.15795 8.4094 -1000.84
## - MBA       1   0.17241 8.4238 -1000.34
## - Work.Exp  1   0.98550 9.2369  -973.44
## - license   1   1.31693 9.5683  -963.14
## - Distance  1   1.60352 9.8549  -954.52
## 
## Step:  AIC=-1005.06
## Car ~ Engineer + MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## - Engineer  1   0.03491 8.3237 -1008.83
## <none>                  8.2888 -1005.06
## - Salary    1   0.14628 8.4351 -1004.95
## - MBA       1   0.15175 8.4406 -1004.76
## - license   1   1.28035 9.5692  -968.12
## - Work.Exp  1   1.35912 9.6479  -965.72
## - Distance  1   1.59193 9.8807  -958.76
## 
## Step:  AIC=-1008.83
## Car ~ MBA + Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq    RSS      AIC
## - MBA       1   0.14336 8.4671 -1008.85
## <none>                  8.3237 -1008.83
## - Salary    1   0.14675 8.4705 -1008.73
## - license   1   1.28164 9.6054  -972.01
## - Work.Exp  1   1.36968 9.6934  -969.35
## - Distance  1   1.62322 9.9469  -961.81
## 
## Step:  AIC=-1008.85
## Car ~ Work.Exp + Salary + Distance + license
## 
##            Df Sum of Sq     RSS      AIC
## <none>                   8.4671 -1008.85
## - Salary    1   0.14805  8.6151 -1008.78
## - license   1   1.26341  9.7305  -973.23
## - Work.Exp  1   1.34934  9.8164  -970.67
## - Distance  1   1.63474 10.1018  -962.30
summary(model_3) #Summary
## 
## Call:
## lm(formula = Car ~ Work.Exp + Salary + Distance + license, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51647 -0.09701 -0.00457  0.09269  0.59671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.285140   0.048355  -5.897 1.04e-08 ***
## Work.Exp     0.040789   0.006031   6.763 7.57e-11 ***
## Salary      -0.011194   0.004997  -2.240   0.0258 *  
## Distance     0.022590   0.003035   7.444 1.15e-12 ***
## licenseYes   0.187203   0.028607   6.544 2.75e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1718 on 287 degrees of freedom
## Multiple R-squared:  0.6004, Adjusted R-squared:  0.5948 
## F-statistic: 107.8 on 4 and 287 DF,  p-value: < 2.2e-16
vif(model_3) 
## Work.Exp   Salary Distance  license 
## 5.791724 5.604115 1.234806 1.238247
  • Model 3 Insights: Car = Work.Exp + Distance + License with Multiple R-squared: 0.600 and Adjusted R-squared: 0.595.

Base on the above 4 models, Model 2: Car = MBA + Work.Exp + Salary + Distance + license is the final Model for predicting future outcomes.

3.1b Model Prediction on test data —————————–

Test this model to the test dataset.

Predict1 <- predict(model_2,test[,-1])
# Add a new column "test_predict" into the test dataset
test$test_Car <- Predict1
#3.0B Model Metrics Evaluation
#############################
# test R2 value
###############
cor(test$Car,test$test_Car)
## [1] 0.7212118
cor(test$Car,test$test_Car)^2
## [1] 0.5201464
cor(train$Car,model_2$fitted.values)^2
## [1] 0.6071559
  • Train R2 value: R2 is also slightly less comparing to the correlation square between the actual value and predicted value.
  • R2 of both test and train are nearly similar. Thus the model is not overfitted to the data.
  • Since the given dataset is small, Homoscedasticity Testing is excluded.

Graphical way and Normality of Residuals of qq plot for studentized resid

par(mfrow = c(1,2))
spreadLevelPlot(model_2) # It is approx equal variance distribution 
## 
## Suggested power transformation:  -0.001437018
qqPlot(model_2, main="QQ Plot") # 2.Normality of Residuals from qq plot for studentized resid

## 98 95 
##  9 35
  • Distribution of studentized residuals. It is distributed normally.

Model Coefficients for deployement

print(round(model_2$coefficients), 3)
## (Intercept)      MBAYes    Work.Exp      Salary    Distance  licenseYes 
##           0           0           0           0           0           0

Basic Model: Linear Regression interpretation

    1. Car usage is inversely proportional to having MBA (if MBA increases by 1 unit, Car usuage decreases by 0.0502).
    1. Similarly, “Hourly Salary” is also inversely proportional to the “Car”, (as Salary increases by 1 unit, Car usuage decreases by 0.011).
    1. Car usuage is more lately to occurs if Work Experiences, Distance, and having a driving license are increase by factors of 0.0411, 0.0225, and 0.189 respectively.

3.2 kNN

Step 1: Converting the categorical variables to factor from numeric.
car_dt = cars_dt
car_dt$Gender <- as.integer(ifelse(car_dt$Gender == "Male",1,0))
car_dt$Engineer <- as.integer(ifelse(car_dt$Engineer == "Yes",1,0))
car_dt$MBA <- as.integer(ifelse(car_dt$MBA == "Yes",1,0))
car_dt$license <- as.integer(ifelse(car_dt$license == "Yes",1,0))
str(car_dt)
## 'data.frame':    418 obs. of  9 variables:
##  $ Age     : num  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender  : int  1 1 0 1 0 1 1 1 1 1 ...
##  $ Engineer: int  1 1 1 0 0 0 1 0 1 1 ...
##  $ MBA     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Work.Exp: num  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary  : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance: num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Car     : int  0 0 0 0 0 0 0 0 0 0 ...
head(car_dt)
##   Age Gender Engineer MBA Work.Exp Salary Distance license Car
## 1  28      1        1   0        5   14.4      5.1       0   0
## 2  24      1        1   0        6   10.6      6.1       0   0
## 3  27      0        1   0        9   15.5      6.1       0   0
## 4  25      1        0   0        1    7.6      6.3       0   0
## 5  25      0        0   0        3    9.6      6.7       0   0
## 6  21      1        0   0        3    9.5      7.1       0   0
Step 2: Data Normalization
normalize <- function(car_dt) {
return ((car_dt - min(car_dt)) / (max(car_dt) - min(car_dt))) }
car.n <- as.data.frame(lapply(car_dt[-c(9)], normalize))
str(car.n)
## 'data.frame':    418 obs. of  8 variables:
##  $ Age     : num  0.533 0.267 0.467 0.333 0.333 ...
##  $ Gender  : num  1 1 0 1 0 1 1 1 1 1 ...
##  $ Engineer: num  1 1 1 0 0 0 1 0 1 1 ...
##  $ MBA     : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ Work.Exp: num  0.3333 0.4 0.6 0.0667 0.2 ...
##  $ Salary  : num  0.4843 0.2513 0.5517 0.0674 0.19 ...
##  $ Distance: num  0.104 0.159 0.159 0.17 0.192 ...
##  $ license : num  0 0 0 0 0 0 0 0 0 1 ...
head(car.n)
##          Age Gender Engineer MBA   Work.Exp     Salary  Distance license
## 1 0.53333333      1        1   0 0.33333333 0.48429119 0.1044932       0
## 2 0.26666667      1        1   0 0.40000000 0.25134100 0.1594896       0
## 3 0.46666667      0        1   0 0.60000000 0.55172414 0.1594896       0
## 4 0.33333333      1        0   0 0.06666667 0.06743295 0.1704889       0
## 5 0.33333333      0        0   0 0.20000000 0.19003831 0.1924875       0
## 6 0.06666667      1        0   0 0.20000000 0.18390805 0.2144861       0
Step 3: Data Splicing
set.seed(300)
car.ind <- sample(1:nrow(car.n),size=nrow(car.n)*0.7,replace = FALSE) #random selection of 70% data.
 
train.car <- car_dt[car.ind,] # 70% training data
test.car <- car_dt[-car.ind,] # remaining 30% test data
Creating seperate dataframe for ‘Car’ feature which is our target.
train.car_use <- car_dt[car.ind,9]
test.car_use <- car_dt[-car.ind,9]
Step 4: Building a Machine Learning model
#install.packages('class')
library(class)
NROW(train.car_use)
## [1] 292
  • Thus, there are 292 observations in the training data set. The square root of 292 is approximately 17.088, so two models will be created. One model with “k” value as 17 and the other model with a “k” value as 18.
knn.17 <- knn(train = train.car, test = test.car, cl = train.car_use, k = 17)
ACC.17 <- 100 * sum(test.car_use == knn.17)/NROW(test.car_use)
ACC.17
## [1] 99.20635
  • As shown above, the accuracy for K = 17 is 99.21%.
Step 5: Model Evaluation

Calculate the accuracy of the created models:

#Calculate the proportion of correct classification for k = 17, 18
knn.18 <- knn(train = train.car, test = test.car, cl = train.car_use, k = 18)
ACC.18 <- 100 * sum(test.car_use == knn.18)/NROW(test.car_use)
  • As shown above, the accuracy for K = 18 is also 99.21%.

Check prediction against actual value in tabular

table(knn.17 ,test.car_use)
##       test.car_use
## knn.17   0   1
##      0 116   1
##      1   0   9
table(knn.18 ,test.car_use)
##       test.car_use
## knn.18   0   1
##      0 116   1
##      1   0   9
  • The prediction against actual values for both k = 17 and k = 18 are the same.

Check confusion matrix to calculate the accuracy.

#install.packages('caret')
confusionMatrix(table(knn.17 ,test.car_use))
## Confusion Matrix and Statistics
## 
##       test.car_use
## knn.17   0   1
##      0 116   1
##      1   0   9
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9566, 0.9998)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : 0.0003541       
##                                           
##                   Kappa : 0.9431          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.9915          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.9206          
##          Detection Rate : 0.9206          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.9500          
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(table(knn.18 ,test.car_use))
## Confusion Matrix and Statistics
## 
##       test.car_use
## knn.18   0   1
##      0 116   1
##      1   0   9
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9566, 0.9998)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : 0.0003541       
##                                           
##                   Kappa : 0.9431          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.9915          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.9206          
##          Detection Rate : 0.9206          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.9500          
##                                           
##        'Positive' Class : 0               
## 
  • From the output of the KNN model, the model predicts the outcome with an accuracy for both k = 17 and k = 18 is 99.21%.
Step 6: Optimization
kNN_model <- train(
  Car ~., data = train.car, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
  )
# Plot model Accuracy (RMSE-Cross Validation) vs different values of k
plot(kNN_model)

kNN_model$bestTune
##    k
## 4 11
  • According to the above output, the best k-value to calculate the accuracy in kNN model is k = 11.
knn.11 <- knn(train = train.car, test = test.car, cl = train.car_use, k = 11)
ACC.11 <- 100 * sum(test.car_use == knn.11)/NROW(test.car_use)
table(knn.11 ,test.car_use)
##       test.car_use
## knn.11   0   1
##      0 116   1
##      1   0   9
confusionMatrix(table(knn.11, test.car_use))
## Confusion Matrix and Statistics
## 
##       test.car_use
## knn.11   0   1
##      0 116   1
##      1   0   9
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9566, 0.9998)
##     No Information Rate : 0.9206          
##     P-Value [Acc > NIR] : 0.0003541       
##                                           
##                   Kappa : 0.9431          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9000          
##          Pos Pred Value : 0.9915          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.9206          
##          Detection Rate : 0.9206          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.9500          
##                                           
##        'Positive' Class : 0               
## 
  • Overall, the above outcome for k = 11 is exactly the same as for both k = 17 and k = 18; that is if there are 226 employees, there are 9 car usuages, i.e. the probability of using Car as a form of Transport is 3.98% with 99.21% accuracy.

3.3 Naive Bayes

(is it applicable here? comment and if it is not applicable, how can you build an NB model in this case?)

  • Since the goal of the analysis is to predict whether or not an employee will use Car as a mode of transport based in the defined predictor variables (Age, Gender, Engineer, MBA credential, Work Experience, Salary, Distance, and Licence). Thus, Naive Bayes implementation is a practical analysis in the given dataset.
str(cars_dt)
## 'data.frame':    418 obs. of  9 variables:
##  $ Age     : num  28 24 27 25 25 21 23 23 24 28 ...
##  $ Gender  : Factor w/ 2 levels "Female","Male": 2 2 1 2 1 2 2 2 2 2 ...
##  $ Engineer: Factor w/ 2 levels "No","Yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ MBA     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Work.Exp: num  5 6 9 1 3 3 3 0 4 6 ...
##  $ Salary  : num  14.4 10.6 15.5 7.6 9.6 9.5 11.7 6.5 8.5 13.7 ...
##  $ Distance: num  5.1 6.1 6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 ...
##  $ license : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Car     : Factor w/ 2 levels "Others","Car": 1 1 1 1 1 1 1 1 1 1 ...
  • Recall the above structure of the cleaned data to be used in the analysis.
3.3 Step 1: Data Partition-Split data into training and test data sets
set.seed(300)
trainIndex = createDataPartition(cars_dt$Car, p=0.7)$Resample1 
  nb_train = cars_dt[trainIndex,]
  nb_test  = cars_dt[-trainIndex,]
print(table(cars_dt$Car))
## 
## Others    Car 
##    383     35
print(table(nb_train$Car))
## 
## Others    Car 
##    269     25
3.3 Step 2: Fitting the Naive Bayes model.
NBclassfier = naiveBayes(Car ~., data = nb_train)
print(NBclassfier)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     Others        Car 
## 0.91496599 0.08503401 
## 
## Conditional probabilities:
##         Age
## Y            [,1]     [,2]
##   Others 26.59108 3.045071
##   Car    34.28000 1.400000
## 
##         Gender
## Y          Female     Male
##   Others 0.267658 0.732342
##   Car    0.120000 0.880000
## 
##         Engineer
## Y               No       Yes
##   Others 0.2230483 0.7769517
##   Car    0.0400000 0.9600000
## 
##         MBA
## Y               No       Yes
##   Others 0.7174721 0.2825279
##   Car    0.7600000 0.2400000
## 
##         Work.Exp
## Y             [,1]     [,2]
##   Others  4.855019 3.282328
##   Car    14.240000 1.614517
## 
##         Salary
## Y            [,1]     [,2]
##   Others 12.78016 4.057321
##   Car    22.30350 1.768837
## 
##         Distance
## Y            [,1]     [,2]
##   Others 10.77844 3.206648
##   Car    17.68596 2.533349
## 
##         license
## Y              No      Yes
##   Others 0.866171 0.133829
##   Car    0.200000 0.800000
NBclassfier2 = naiveBayes(Car ~., data = nb_test)
print(NBclassfier2)
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##     Others        Car 
## 0.91935484 0.08064516 
## 
## Conditional probabilities:
##         Age
## Y            [,1]     [,2]
##   Others 26.23684 2.720953
##   Car    33.90000 1.595131
## 
##         Gender
## Y          Female     Male
##   Others 0.377193 0.622807
##   Car    0.300000 0.700000
## 
##         Engineer
## Y               No       Yes
##   Others 0.3508772 0.6491228
##   Car    0.4000000 0.6000000
## 
##         MBA
## Y               No       Yes
##   Others 0.7894737 0.2105263
##   Car    0.7000000 0.3000000
## 
##         Work.Exp
## Y             [,1]     [,2]
##   Others  4.649123 2.868759
##   Car    13.900000 1.852926
## 
##         Salary
## Y            [,1]     [,2]
##   Others 12.59901 3.939864
##   Car    22.11125 2.217547
## 
##         Distance
## Y           [,1]     [,2]
##   Others 10.4807 3.032038
##   Car    17.9966 2.472274
## 
##         license
## Y               No       Yes
##   Others 0.8245614 0.1754386
##   Car    0.1000000 0.9000000
  • The Gaussian Distributions of the numerical variables in Naive Bayes train dataset is calculated by the naiveBayes function.
3.3 Step 3: Prediction on the training data and the test data.
3.3b1: Predicting Model for “test” dataset
NB_Pred1 = predict(NBclassfier, nb_train)
head(cbind(NB_Pred1, nb_train))
##   NB_Pred1 Age Gender Engineer MBA Work.Exp Salary Distance license    Car
## 2   Others  24   Male      Yes  No        6   10.6      6.1      No Others
## 4   Others  25   Male       No  No        1    7.6      6.3      No Others
## 5   Others  25 Female       No  No        3    9.6      6.7      No Others
## 6   Others  21   Male       No  No        3    9.5      7.1      No Others
## 7   Others  23   Male      Yes Yes        3   11.7      7.2      No Others
## 8   Others  23   Male       No  No        0    6.5      7.3      No Others
3.3a2: Confusion matrix to check accuracy for “train” dataset.
str(nb_train)
## 'data.frame':    294 obs. of  9 variables:
##  $ Age     : num  24 25 25 21 23 23 24 28 26 21 ...
##  $ Gender  : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 2 2 ...
##  $ Engineer: Factor w/ 2 levels "No","Yes": 2 1 1 1 2 1 2 2 1 1 ...
##  $ MBA     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 1 2 ...
##  $ Work.Exp: num  6 1 3 3 3 0 4 6 4 3 ...
##  $ Salary  : num  10.6 7.6 9.6 9.5 11.7 6.5 8.5 13.7 12.6 10.6 ...
##  $ Distance: num  6.1 6.3 6.7 7.1 7.2 7.3 7.5 7.5 7.5 7.7 ...
##  $ license : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ Car     : Factor w/ 2 levels "Others","Car": 1 1 1 1 1 1 1 1 1 1 ...
tab1 = table(NB_Pred1, nb_train$Car)
print(tab1)
##         
## NB_Pred1 Others Car
##   Others    261   2
##   Car         8  23
Acc_Train = sum(diag(tab1))/sum(tab1)
print(Acc_Train)
## [1] 0.9659864
  • The prediction of Car usage as a form of Transport in train dataset is 7.82% (23 out of 294) with an accuracy rate of 96.6%.
#create objects x which holds the predictor variables and y which holds the response variables
x1 = nb_train[,1:8]
y1 = nb_train$Car
library(caret)
nb_model = train(x1, y1, 'nb', trControl = trainControl(method = 'cv', number = 10))
nb_model
## Naive Bayes 
## 
## 294 samples
##   8 predictor
##   2 classes: 'Others', 'Car' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 265, 265, 264, 264, 266, 265, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9627504  0.7767531
##    TRUE      0.9591790  0.7579070
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.
  • Naive Bayes Train dataset has 294 samples with 8 predictors gives an accuracy rate of 95.92% by usekernel method.
3.3b1: Predicting Model for “test” dataset
NB_Pred2 = predict(NBclassfier2, nb_test)
head(cbind(NB_Pred2, nb_test))
##    NB_Pred2 Age Gender Engineer MBA Work.Exp  Salary Distance license
## 1    Others  28   Male      Yes  No        5 14.4000      5.1      No
## 3    Others  27 Female      Yes  No        9 15.5000      6.1      No
## 16   Others  28 Female       No  No       10 19.7000      9.0      No
## 18   Others  29 Female       No  No        7 14.6000      9.2      No
## 22   Others  25 Female      Yes  No        6 11.6000     10.1      No
## 23      Car  34   Male      Yes Yes       14 22.8125     10.4     Yes
##       Car
## 1  Others
## 3  Others
## 16 Others
## 18 Others
## 22 Others
## 23 Others
3.3b2: Confusion matrix to check accuracy for “test” dataset.
str(nb_test)
## 'data.frame':    124 obs. of  9 variables:
##  $ Age     : num  28 27 28 29 25 34 28 21 24 26 ...
##  $ Gender  : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 2 2 1 2 1 ...
##  $ Engineer: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 2 2 1 2 2 ...
##  $ MBA     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Work.Exp: num  5 9 10 7 6 14 5 3 0 4 ...
##  $ Salary  : num  14.4 15.5 19.7 14.6 11.6 ...
##  $ Distance: num  5.1 6.1 9 9.2 10.1 10.4 10.5 11 11 11.1 ...
##  $ license : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 2 1 2 1 ...
##  $ Car     : Factor w/ 2 levels "Others","Car": 1 1 1 1 1 1 1 1 1 1 ...
print(table(nb_test$Car))
## 
## Others    Car 
##    114     10
p2 = predict(NBclassfier2, nb_test)
tab2 = table(NB_Pred2, nb_test$Car)
print(tab2)
##         
## NB_Pred2 Others Car
##   Others    112   1
##   Car         2   9
Acc_Test = sum(diag(tab2))/sum(tab2)
print(Acc_Test)
## [1] 0.9758065
  • The prediction of Car usage as a form of Transport in test dataset is 7.44% (9 out of 121) with an accuracy rate of 97.6%.
x2 = nb_test[,1:8]
y2 = nb_test$Car

library(caret)
nb_model2 = train(x2, y2, 'nb', trControl = trainControl(method = 'cv', number = 10))
nb_model2
## Naive Bayes 
## 
## 124 samples
##   8 predictor
##   2 classes: 'Others', 'Car' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 112, 112, 111, 111, 112, 112, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9821937  0.9170635
##    TRUE      0.9596154  0.6882143
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.
  • Naive Bayes Test dataset has 294 samples with 8 predictors gives an accuracy rate of 98.21% by usekernel method.

3.4 Logistic Regression

3.4a Create Train and Test Samples
set.seed(123)
index = sample(1:nrow(cars_dt), .7*nrow(cars_dt))
lg_train = cars_dt[index,]
lg_test = cars_dt[-index,]
#sanity check
prop.table(table(lg_train$Car))
## 
##    Others       Car 
## 0.9109589 0.0890411
3.4b Build Logistic Regression
Train Dataset
logreg1 <- glm(Car ~ ., data = lg_train, family = binomial(link = "logit"))
summary(logreg1) 
## 
## Call:
## glm(formula = Car ~ ., family = binomial(link = "logit"), data = lg_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.60968  -0.00384  -0.00044  -0.00004   2.25098  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -79.5608    42.4347  -1.875   0.0608 .
## Age           1.9922     1.2808   1.555   0.1198  
## GenderMale   -0.9462     1.7262  -0.548   0.5836  
## EngineerYes   0.6289     1.8045   0.349   0.7275  
## MBAYes       -2.0184     1.8472  -1.093   0.2745  
## Work.Exp     -0.6142     1.2134  -0.506   0.6127  
## Salary        0.2107     0.6104   0.345   0.7300  
## Distance      1.1063     0.4296   2.575   0.0100 *
## licenseYes    3.3731     2.5317   1.332   0.1828  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.383  on 291  degrees of freedom
## Residual deviance:  17.162  on 283  degrees of freedom
## AIC: 35.162
## 
## Number of Fisher Scoring iterations: 11
  • The above table show that Distance is significant in term of predicting Car usage as a form of Transport, with p-value between 1% and 5%.
lg_train$predicted = predict(logreg1, lg_train, type="response")
confmatrix = table(Actual_value = lg_train$Car, Predicted_Value = lg_train$predicted > 0.5)
confmatrix 
##             Predicted_Value
## Actual_value FALSE TRUE
##       Others   265    1
##       Car        3   23
Accuracy = (confmatrix[[1,1]] + confmatrix[[2,2]]) / sum(confmatrix)
Accuracy
## [1] 0.9863014
  • The predicted Value of Car Usage in the Logistic Regression train dataset is 7.88% (23 out of 292) with 98.63% accuracy.
Validate on Train dataset with ROC-AUC
lg_train$Car <- ifelse(lg_train$Car == 'Car', 1, 0)
pred1 = predict(logreg1, lg_train)
pred = prediction(pred1, lg_train$Car)
roc = performance(pred, "tpr", "fpr")
plot(roc, 
        colorize = T, 
            main = "ROC-AUC Curve for Train Dataset",
            ylab = "Sensitivity: True Positive Rate",
            xlab = "1 - Specificity: False Positive Rate")
        abline(a = 0, b = 1)

eval = performance(pred, "acc")
#Identify Best Cutoff Values
max = which.max(slot(eval, "y.values")[[1]])
acc = slot(eval, "y.values")[[1]][max]
cut = slot(eval, "x.values")[[1]][max]
print(c(Accuracy = acc, Cutoff = cut))
##  Accuracy Cutoff.98 
##  0.989726  1.210708
  • The ROC-AUC Curve for Train Dataset confirmed the performance of the Logistic Regression train dataset with 98.97% accuracy.
Test Dataset
logreg2 <- glm(Car ~ ., data = lg_test, family = binomial(link = "logit"))
summary(logreg2) 
## 
## Call:
## glm(formula = Car ~ ., family = binomial(link = "logit"), data = lg_test)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.775e-05  -2.110e-08  -2.110e-08  -2.110e-08   3.031e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.080e+02  1.249e+06   0.000        1
## Age          2.926e-01  5.861e+04   0.000        1
## GenderMale  -1.912e+01  1.176e+05   0.000        1
## EngineerYes -1.132e+01  6.043e+04   0.000        1
## MBAYes      -1.519e+01  9.385e+04   0.000        1
## Work.Exp     6.485e+00  5.605e+04   0.000        1
## Salary       2.137e+00  1.954e+04   0.000        1
## Distance    -4.871e-01  1.660e+04   0.000        1
## licenseYes   3.185e+01  5.487e+04   0.001        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.4844e+01  on 125  degrees of freedom
## Residual deviance: 4.3690e-09  on 117  degrees of freedom
## AIC: 18
## 
## Number of Fisher Scoring iterations: 25
lg_test$predicted = predict(logreg2, lg_test, type="response")
confmatrix2 = table(Actual_value = lg_test$Car, Predicted_Value = lg_test$predicted > 0.5)
confmatrix2 
##             Predicted_Value
## Actual_value FALSE TRUE
##       Others   117    0
##       Car        0    9
Accuracy2 = (confmatrix2[[1,1]] + confmatrix2[[2,2]]) / sum(confmatrix2)
Accuracy2
## [1] 1
  • The predicted Value of Car Usage in the Logistic Regression test dataset is 7.14% (9 out of 126) with 100% accuracy.
Validate on Test Dataset
lg_test$Car <- ifelse(lg_test$Car == 'Car', 1, 0)
pred2 = predict(logreg2, lg_test)
tpred = prediction(pred2, lg_test$Car)
roc2 = performance(tpred, "tpr", "fpr")
plot(roc2, 
        colorize = T, 
            main = "ROC-AUC Curve for Train Dataset",
            ylab = "Sensitivity: True Positive Rate",
            xlab = "1 - Specificity: False Positive Rate")
        abline(a = 0, b = 1)

eval2 = performance(tpred, "acc")
#Identify Best Cutoff Values
max = which.max(slot(eval2, "y.values")[[1]])
acc2 = slot(eval, "y.values")[[1]][max]
cut2 = slot(eval, "x.values")[[1]][max]
print(c(Accuracy = acc2, Cutoff = cut2))
##   Accuracy Cutoff.107 
##  0.9452055  9.6134609
  • The ROC-AUC Curve for Test Dataset confirmed the performance of the Logistic Regression train dataset with 94.52% accuracy (which is less accurate comparing to the train dataset by 4.45%).

Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step

table(cars_dt$Car)
## 
## Others    Car 
##    383     35
prop.table(table(cars_dt$Car))#highly imbalanced classification problem
## 
##     Others        Car 
## 0.91626794 0.08373206
  • The probability of predicting Car Usage from the given dataset is 8.37% (35 out of 418).
split <- sample.split(cars_dt$Car,  SplitRatio= 0.70)
split2 <- sample(1:nrow(cars_dt), .1*nrow(cars_dt))

cd<- subset(cars_dt, split == FALSE)
cd2 <- cars_dt[split2,]

table(cd$Car)
## 
## Others    Car 
##    115     11
prop.table(table(cd$Car))
## 
##     Others        Car 
## 0.91269841 0.08730159
  • The probability of predicting Car Usage from the splitted dataset 1 is 8.73% (11 out of 115).
table(cd2$Car)
## 
## Others    Car 
##     38      3
prop.table(table(cd2$Car))
## 
##     Others        Car 
## 0.92682927 0.07317073
  • The probability of predicting Car Usage from the splitted dataset 2 is 7.32% (3 out of 41).
split <- sample.split(cd$Car,  SplitRatio= 0.7)
cd_train<- subset(cd, split == TRUE)
cd_test<- subset(cd, split == FALSE)

Bagging

Car.bagging<-bagging(Car ~.,data = lg_train,
                         control = rpart.control(maxdepth=5, 
                                              minsplit=4))

lg_test$pred.car <- predict(Car.bagging,  lg_test)
table(lg_test$Car,lg_test$pred.car > 0.4)
##    
##     FALSE TRUE
##   0   117    0
##   1     0    9
  • The probability of predicting Car Usage from Bagging Method dataset is 7.14% (9 out of 126).

Boosting

gbm.fit<-gbm(formula = Car ~ .,distribution = "bernoulli",
             data = lg_train,
             n.trees= 1000,
             interaction.depth= 3,
             shrinkage = 0.001,
             cv.folds= 5,
             n.cores= NULL, # will use all cores by default
             verbose = FALSE) 

lg_test$pred.class<-predict(gbm.fit, lg_test, type = "response") 
## Using 1000 trees...
table(lg_test$Car,lg_test$pred.class>0.5) 
##    
##     FALSE TRUE
##   0   117    0
##   1     0    9
  • The probability of predicting Car Usage from Boosting Method dataset is the same as Bagging, which is 7.14% (9 out of 126).

SMOTE

smote.train <-  subset(cd, split == TRUE)
smote.test <- subset(cd, split == FALSE)
smote.train$Car <-  as.factor(smote.train$Car)
balanced.cd1 <- SMOTE(Car~., smote.train, perc.over= 50000, 
                     k = 5, perc.under = 100)
table(smote.train$Car)
## 
## Others    Car 
##     80      8
prop.table(table(smote.train$Car))
## 
##     Others        Car 
## 0.90909091 0.09090909
table(balanced.cd1$Car)
## 
## Others    Car 
##   4000   4008
prop.table(table(balanced.cd1$Car))
## 
##    Others       Car 
## 0.4995005 0.5004995
balanced.cd2 <- SMOTE(Car~., smote.test, perc.over= 50000, 
                     k = 5, perc.under = 100)
table(smote.test$Car)
## 
## Others    Car 
##     35      3
prop.table(table(smote.test$Car))
## 
##     Others        Car 
## 0.92105263 0.07894737
  • The probability of predicting Car Usage from “test” subset dataset to be use in SMOTE Method is 7.89% (3 out of 38).
table(balanced.cd2$Car)
## 
## Others    Car 
##   1500   1503
prop.table(table(balanced.cd2$Car))
## 
##    Others       Car 
## 0.4995005 0.5004995

Now put our SMOTE data into our best xgboost

#convert data frame to data table
setDT(balanced.cd1) #Train 
setDT(balanced.cd2) #Test
str(balanced.cd1)
## Classes 'data.table' and 'data.frame':   8008 obs. of  9 variables:
##  $ Age     : num  34 28 29 27 29 30 27 26 24 24 ...
##  $ Gender  : Factor w/ 2 levels "Female","Male": 2 2 2 1 2 2 2 1 2 2 ...
##  $ Engineer: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 2 2 ...
##  $ MBA     : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
##  $ Work.Exp: num  15 3 6 4 6 10 5 3 1 2 ...
##  $ Salary  : num  22.8 9.5 14.6 13.6 14.7 ...
##  $ Distance: num  12.2 9 7.6 8.2 10.4 12.1 9.3 8.9 12.7 8 ...
##  $ license : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Car     : Factor w/ 2 levels "Others","Car": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>
smote_features_train <- sparse.model.matrix(Car ~ ., data = balanced.cd1)
head(smote_features_train)
## 6 x 9 sparse Matrix of class "dgCMatrix"
##   (Intercept) Age GenderMale EngineerYes MBAYes Work.Exp  Salary Distance
## 1           1  34          1           1      1       15 22.8125     12.2
## 2           1  28          1           .      1        3  9.5000      9.0
## 3           1  29          1           1      .        6 14.6000      7.6
## 4           1  27          .           .      .        4 13.6000      8.2
## 5           1  29          1           .      .        6 14.7000     10.4
## 6           1  30          1           .      .       10 22.8125     12.1
##   licenseYes
## 1          .
## 2          .
## 3          .
## 4          .
## 5          .
## 6          .
smote_label_train = balanced.cd1$Car == "Car"
smote.xgb.fit<-xgboost(data = smote_features_train,
                       label = smote_label_train,
                       eta = 0.7,
                       max_depth= 5,
                       nrounds= 50,
                       nfold= 5,
                       objective = "binary:logistic",  # for regression models
                       verbose = 0,               # silent,
                       early_stopping_rounds= 10 # stop if no improvement for 10 consecutive trees
                       )
smote_features_test <- sparse.model.matrix(Car ~ ., data = smote.test)
smote.test$smote.pred.CarUsage <- predict(smote.xgb.fit, smote_features_test)
table(smote.test$Car,smote.test$smote.pred.CarUsage >= 0.9)
##         
##          FALSE
##   Others    35
##   Car        3
table(cd_test$Car, cd_test$log.pred > 0.3)
##         
##          FALSE TRUE
##   Others    34    1
##   Car        0    3
round(prop.table(table(cd_test$Car, cd_test$log.pred > 0.3)), 3)
##         
##          FALSE  TRUE
##   Others 0.895 0.026
##   Car    0.000 0.079

4. Actionable Insights & Recommendations

(Summarize your findings from the exercise in a concise yet actionable note)

Summary

  • There are 418 observations in the dataset with 9 variables: The data including employee’s age, gender, holding an engineering job position, has MBA credential, years of working experience, salary is given hourly rate, distance traveling from home to work, whether the employee has a driver licence or not, and type of transportation. Within the 9 variables, Car is a predicting depending variable as the objective is to determine whether the employee is using Car as a form of transport.
  • MBA has 1 missing value and is being treated by replacing NA with the mode, i.e. “0”.
  • The corrplot above shows there are high correlations between age to work experiences and salary (age:work = 0.92 and age:salary = 0.86) and work experience to salary is 0.93. Diagrams in figures H1 and H2 confirmed the correlatation information in figure 1 that employees use Car as a mode of transport falls depending on if the distance is greater than 14 miles and has an hourly salary that is greater than $34. Therefore, we will split thedataset into 2 different sets: train_dt with Distance less than 14 and test_dt with Distance greater than or equal to 14.
  • Histograms by Gender and Type of Transportation shows that most males and females are taking Public Transportation between the distance of approximately 5 miles to ~18 miles, age is between 30 and 40, and salary is greater than $40 per hour. Only less than 8 for both males and females that farer than 13 miles are using car as a method of transportation. Less than 7 counts for both males and female with the distance ranging between 4 miles to 22 miles are using 2Wheeler as a type of transportation.
  • There were a few outliers in variables Age, Work Experiences, Salary, and Disance. Outliers are being treated by flooring at 1% quantile for Age and capping at Q3 + 1.5IQR for all four variables.

Model validation outcome to predict Car Usage as a form of Transport

  • Basic Linear Regression Model predicts that Car usage is inversely proportional to having MBA (if MBA increases by 1 unit, Car usuage decreases by 0.0502). Similarly, “Hourly Salary” is also inversely proportional to the “Car”, (as Salary increases by 1 unit, Car usuage decreases by 0.011). Car usuage is more lately to occurs if Work Experiences, Distance, and having a driving license are increase by factors of 0.0411, 0.0225, and 0.189 respectively.

  • Model 1 (Train dataset) of linear regression predicts Car.Usage = Work.Exp + Distance + License[Yes] + MBA[Yes] + Salary with Multiple R-squared: 0.611 and Adjusted R-squared: 0.597.

  • Model 2 (Multicollinearity on the Test dataset) predicts Car.Usage = 1.01MBA + 5.795Work.Exp + 5.604Salary + 1.235Distance + 1.239license with Multiple R-squared: 0.607 and Adjusted R-squared: 0.600.

  • kNN Model predicts that if there are 226 employees, there are 9 car usuages, i.e. the probability of using Car as a form of Transport is 3.98% with 99.21% accuracy.

  • Naive Bayes Model predicts that Car usage as a form of Transport in train dataset is 7.82% (23 out of 294) with an accuracy rate of 96.6% and 7.26% (23 out of 294) with an accuracy rate of 97.58% in the test dataset.

  • Logistic Regression predicts that Distance is significant in term of predicting Car usage as a form of Transport, with p-value between 1% and 5%. The logistic train dataset predicts that 7.88% (23 out of 292) Car Usage with 98.63% accuracy with the confirmation of 98.97% accuracy from ROC-AUC Curve. The logistic test dataset predicts that 7.14% (9 out of 126) with 100% accuracy and is slightly inaccurate as ROC-AUC Curve shows 94.5% (which is less accurate comparing to the train dataset by 4.45%) accuracy within test dataset.

  • Bagging predicts Car usage at 7.14% (9 out of 126).

  • Boosting predicts Car usage at 7.14% (9 out of 126).

  • SMOTE predicts Car usage at 7.9% (3 out of 38).

Conlcusion

  • Basic **linear regression* *provides Car Usage by depending on four significant variables, i.e, MBA, Salary, Work Experience, and Distance with the probability of 14.77% (35 out of 237) on Car Usage with R-square equals to 0.61.

  • Logistic Regression (7.14% and 7.88%), Naive Bayes Model (7.26% and 7.82%), kNN Model (3.98%), Bagging (7.14%), Boosting (7.14%), and SMOTE (7.9%) predicts Car Usage by giving the probability of overall indepdendent variables.

  • Overall, linear regression provides significant variables in making prediction on Car Usage and the best model that would predicts Car Usage with highest accuracy at 99.21% is kNN model, which predicts there are 3.9% that will use Car as a form of Transport.

This is a headline

This is text to draw the attention of your visitors

© Copyright Diem H. Vuong