Final Report

I. Executive Summary

Problem Statement: Predicting Falcon Airlines’ passenger Satisfaction based on 23 factors: Gender, Customer Type, Age, Type of Travel, Seating Class, Distance Travel, Seat Comfort, Departure/Arrival Time, Food & Drink, Gate Location, Inflight Wi-Fi Service, Entertainment, Online Support, Online Booking, On-board Service, Leg-room Service, Baggage Handling, and Check-In Service.
Brief Description of Methods Data was randomly collected/surveyed within 90,917 passengers, who travelled using Falcon Airlines. The surveys were collected at the end of their flight journey based on the overall satisfaction of flight experience with 23 parameters/variables/factors mentioned in the problem statement. There are 90,918 rows with the first row as heading in both datasets.
There are two set of datasets: (1) Flight data, which provides Passengers’ background information, (i.e. Customer ID, Gender, Passenger’s age, Customer’s Type, Purpose of Traveling, Seating Class, Flight’s Distance, and Departure/Arrival Delaying.
- Numerical Data:
  - Customer ID
  - Age
  - Flight Distance
  - Departure/Arrival Delays in minutes
- Categorical Data:
- Gender
- Customer Type (Loyal/Not Classified)
- Type Travel (Personal/Not Classified)
- Seating Class (Economy/Business)
(2) Passengers’ Level of Satisfactions bases on 13 categorical parameters mentioned in Defining problem statement with level of satisfaction between 0 and 5. The data is being converted from characters to numerical when perform Modeling to predict important variables in determining passengers’ Satisfaction as following:
- 0 – extremely poor
- 1 – poor
- 2 – need improvement
- 3 – acceptable
- 4 – good
- 5 – excellent
To understand which variables/parameters play a major role in persuading high level of passengers’ satisfaction in Falcon Airlines, CART and Random Forest Model provided decision trees and show important variables with the Train and Test datasets average rate of accuracy of 95.49% and the sensitivity rate of 94.37% with 95% Confidence Interval ranging with an average between 95.27% and 95.7%.
Final Insights and Recommendation The aim is to improve services and flight’s performance to attract more passengers to select Falcon Airlines as their choice of traveling. Both CART and Random Forest Model confirmed that Seat Comfort, Inflight Entertainment, Traveling Class, Checkin Services, Online Support, Leg Room Services, and Ease of Online Booking are amongs the most important factors to yield passengers’ satifaction in Falcon Airline.

II. Approach: Logical steps to final model selection.

Basic Linear Regression Model on both train and test data of the seven important variables results the following:
- Train dataset - Inflight Entertainment, Ease of Online Booking, Online Support, Onboard Service, Leg Room Services, and Online Boarding are signigicant in determinine passengers’ Satisfaction with p-values less than 0.1%. The Multiple R-squared and Adjust R-squared of predicting passengers’ satisfaction are both 38.85%.
- Test dataset - Inflight Entertainment, Ease of Online Booking, Online Support, Onboard Service, and Leg Room Services are signigicant in determinine passengers’ Satisfaction with p-values of 0%. The Multiple R-squared and Adjust R-squared of predicting passengers’ satisfaction are both 38.51%.
Predicting using PCA
- From Variables-PCA diagram, PCA Model Predicting that Ease of Online Booking, Online Support, Online Boarding, Inflight Wifi Service, Baggage Handing, Cleanliness, and Onboard Sevice are the seven important variables in predicting passengers’ satisfaction.
- Multinomial Logistic Regression Model of the train dataset yields 80.71% accuracy in predicting Passengers’ satisfaction.
- Multinomial Logistic Regression Model of the test dataset yields 19.2% accuracy in predicting Passengers’ satisfaction.
- Logistic Regression Model of the train dataset results 77.74% accuracy with the the ROC-AUC Curve of 80.96% accuracy in predicting Passengers’ satisfaction.
- Logistic Regression Model of the test dataset results 79.73% accuracy with the the ROC-AUC Curve of 78.23% accuracy in predicting Passengers’ satisfaction.
CART Model
- CART Model Important Variables are Inflight Entertainment, Ease of Online Booking, Checkin Service, Online Boarding, and Seat Comfort.
- Classification error rate of the prediction on Passengers’ Satisfaction from CART-Train Dataset is 10.08%, which means the accuracy using CART model to predict Customer’s Satisfaction is 89.92%.
- Classification error rate of the prediction on Passengers’ Satisfaction from CART-Test Dataset is 12.91%, which means the accuracy using CART model to predict Customer’s Satisfaction is 87.09%.
Random Forest Model
- Random Forest Model important variables in predicting passengers’ satisfaction are Seat Comfort, Inflight Entertainment, Traveling Class, Checkin Services, Online Support, Leg Room Services, and Ease of Online Booking. However, Seat Comfort and Inflight Entertainment are the top most important and have the highest accuracy rate.
- The accuracy rate in predicting passengers’ satisfaction from Random Forest Train dataset is 96.97% and the 95% confidence interval between 96.84% and 97.09% with sensitivity rate of 95.86%.
- The accuracy rate in predicting passengers’ satisfaction from Random Forest Test dataset is 94.1% and the 95% confidence interval between 93.69% and 94.31% with sensitivity rate of 92.88%.
XGBoost Model
- The final accuracy for XGBoost Model in predicting passengers’ Satisfaction is 91.55%; however, it would not provide important variables in determine Satisfaction rating.

III. Relevance and implementability of the conclusions and recommendations

Based on the above interpretation of all models, Random Forest Model yields the highest accuracy rate for both train and test dataset with 97.09% and 93.94% consecutively in predicting Falcon Airline passengers’ satisfaction with sensitivity rate of 96% in train and 92.84% in test dataset.
Random Forest Model also shows that important variables in predicting passengers’ satisfaction are Seat Comfort, Inflight Entertainment, Traveling Class, Checkin Services, Online Support, Leg Room Services, and Ease of Online Booking with Seat Comfort and Inflight Entertainment are the most important variables.
In addition, CART Model-Final Tree (Figure 3C in the Appendix) confirms that if passengers that are Satisfied with Ease of Online Booking, Inflight Entertaintment, Checkin Service, Online Boarding, Seat Comfort, and Legroom Services, then most likely the passengers are satisfied with Falcon Airlines.

Recommendation

The five most important variables for Falcon Airline to concentrated on in providing the best services are Inflight Entertainment, Seat Comfort, Ease of Online Booking, Checkin Services, and Online Boarding to continue on having high passengers’ Satisfaction rate.

Appendix

Exploratory Data Analysis

dfSummary(Aviation, plain.ascii = FALSE, style = "grid", graph.magnif = 0.75, varnumbers = TRUE, valid.col = TRUE, tmp.img.dir = "./tmp", method = 'browser')

Data Frame Summary

Aviation

Dimensions: 90917 x 24
Duplicates: 0

No	Variable	Stats / Values	Freqs (% of Valid)	Valid	Missing
1	CustomerID [numeric]	Mean (sd) : 195423 (26245.6) min < med < max: 149965 < 195423 < 240881 IQR (CV) : 45458 (0.1)	90917 distinct values	90917 (100%)	0 (0%)
2	Gender [character]	1. Female 2. Male	46186 (50.8%) 44731 (49.2%)	90917 (100%)	0 (0%)
3	CustomerType [character]	1. Disloyal Customer 2. Loyal Customer	14921 (18.2%) 66897 (81.8%)	81818 (89.99%)	9099 (10.01%)
4	Age [numeric]	Mean (sd) : 39.4 (15.1) min < med < max: 7 < 40 < 85 IQR (CV) : 24 (0.4)	75 distinct values	90917 (100%)	0 (0%)
5	TypeTravel [character]	1. Business travel 2. Personal Travel	56481 (69.0%) 25348 (31.0%)	81829 (90%)	9088 (10%)
6	Class [character]	1. Business 2. Eco 3. Eco Plus	43535 (47.9%) 40758 (44.8%) 6624 ( 7.3%)	90917 (100%)	0 (0%)
7	Flight_Distance [numeric]	Mean (sd) : 1981.6 (1026.8) min < med < max: 50 < 1927 < 6950 IQR (CV) : 1182 (0.5)	5213 distinct values	90917 (100%)	0 (0%)
8	DepartureDelayin_Mins [numeric]	Mean (sd) : 14.7 (38.7) min < med < max: 0 < 0 < 1592 IQR (CV) : 12 (2.6)	436 distinct values	90917 (100%)	0 (0%)
9	ArrivalDelayin_Mins [numeric]	Mean (sd) : 15.1 (39) min < med < max: 0 < 0 < 1584 IQR (CV) : 13 (2.6)	445 distinct values	90633 (99.69%)	284 (0.31%)
10	Satisfaction [character]	1. neutral or dissatisfied 2. satisfied	41156 (45.3%) 49761 (54.7%)	90917 (100%)	0 (0%)
11	Seat_comfort [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	20552 (22.6%) 12519 (13.8%) 3368 ( 3.7%) 19789 (21.8%) 20002 (22.0%) 14687 (16.2%)	90917 (100%)	0 (0%)
12	Departure.Arrival.time_convenient [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	14806 (17.9%) 17079 (20.7%) 4199 ( 5.1%) 18840 (22.8%) 14539 (17.6%) 13210 (16.0%)	82673 (90.93%)	8244 (9.07%)
13	Food_drink [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	17991 (21.8%) 12947 (15.6%) 3794 ( 4.6%) 17245 (20.8%) 17359 (21.0%) 13400 (16.2%)	82736 (91%)	8181 (9%)
14	Gate_location [character]	1. convenient 2. inconvenient 3. manageable 4. need improvement 5. very convenient 6. very inconvenient	21088 (23.2%) 15876 (17.5%) 23385 (25.7%) 17113 (18.8%) 13454 (14.8%) 1 ( 0.0%)	90917 (100%)	0 (0%)
15	Inflightwifi_service [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	19199 (21.1%) 20258 (22.3%) 96 ( 0.1%) 22159 (24.4%) 18894 (20.8%) 10311 (11.3%)	90917 (100%)	0 (0%)
16	Inflight_entertainment [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	16995 (18.7%) 20786 (22.9%) 2038 ( 2.2%) 29373 (32.3%) 13527 (14.9%) 8198 ( 9.0%)	90917 (100%)	0 (0%)
17	Online_support [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	15090 (16.6%) 24916 (27.4%) 1 ( 0.0%) 29042 (31.9%) 12063 (13.3%) 9805 (10.8%)	90917 (100%)	0 (0%)
18	Ease_of_Onlinebooking [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	15686 (17.2%) 23960 (26.4%) 12 ( 0.0%) 27993 (30.8%) 13896 (15.3%) 9370 (10.3%)	90917 (100%)	0 (0%)
19	Onboard_service [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	17411 (20.8%) 20396 (24.4%) 3 ( 0.0%) 26373 (31.5%) 11018 (13.2%) 8537 (10.2%)	83738 (92.1%)	7179 (7.9%)
20	Leg_room_service [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	15775 (17.3%) 24071 (26.5%) 322 ( 0.4%) 27814 (30.6%) 15156 (16.7%) 7779 ( 8.6%)	90917 (100%)	0 (0%)
21	Baggage_handling [character]	1. acceptable 2. excellent 3. good 4. need improvement 5. poor	17233 (18.9%) 25002 (27.5%) 33822 (37.2%) 9301 (10.2%) 5559 ( 6.1%)	90917 (100%)	0 (0%)
22	Checkin_service [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	24941 (27.4%) 18918 (20.8%) 1 ( 0.0%) 25483 (28.0%) 10813 (11.9%) 10761 (11.8%)	90917 (100%)	0 (0%)
23	Cleanliness [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	16930 (18.6%) 25079 (27.6%) 4 ( 0.0%) 34246 (37.7%) 9283 (10.2%) 5375 ( 5.9%)	90917 (100%)	0 (0%)
24	Online_boarding [character]	1. acceptable 2. excellent 3. extremely poor 4. good 5. need improvement 6. poor	21427 (23.6%) 20993 (23.1%) 9 ( 0.0%) 24676 (27.1%) 13035 (14.3%) 10777 (11.8%)	90917 (100%)	0 (0%)

Data was randomly collected/surveyed within 90,917 passengers, who travelled using Falcon Airlines. The surveys were collected at the end of their flight journey based on the overall satisfaction of flight experience within 23 parameters/variables/factors shown in the data frame summary in the above table.
Removal of unwanted variables: Customer ID is being removed from the dataset as it is Passenger’s identification, which has no significant in term of predicting overall passenger’s satisfaction on Falcon Airline.
Missing Value treatment Data Frame Summary table shows that there are missing values in Customer Type, Type Travel, Arrival/Delays in Minutes, Departure/Arrival Time Convenient, Food & Drink, and Onboard Service. Missing values treatment for are as following:
- Customer Type and Type Travel are characters and have 10% missing values, these two columns will be deleted from the analysis.
- Arrival/Departure Delays in Minutes missing values are being imputed with its parameters’ mean.
- Departure/Arrival Time Convenient, Food & Drink, and Onboard Service missing values are being imputed with its parameters’ mode.

## Remove Customer Type and Type Travel Column from the Analysis as they both containing 10% missing values.
Aviation = Aviation[-c(2,4)]
## imputing the mean into missing values for Arrival/Departure Delays in Minutes
Aviation$ArrivalDelayin_Mins[is.na(Aviation$ArrivalDelayin_Mins)] = mean(Aviation$ArrivalDelayin_Mins, na.rm = TRUE) 
## Create mode function 
dt_mode <- function(x) {                                     
  unique_x <- unique(x)
  mode <- unique_x[which.max(tabulate(match(x, unique_x)))]
  mode
}
# imputing the mode into surveyed missing values
Aviation$Departure.Arrival.time_convenient[is.na(Aviation$Departure.Arrival.time_convenient)] <- 
                                          dt_mode(Aviation$Departure.Arrival.time_convenient[!is.na(Aviation$Departure.Arrival.time_convenient)]) 
Aviation$Food_drink[is.na(Aviation$Food_drink)] <- dt_mode(Aviation$Food_drink[!is.na(Aviation$Food_drink)]) 
Aviation$Onboard_service[is.na(Aviation$Onboard_service)] <- dt_mode(Aviation$Onboard_service[!is.na(Aviation$Onboard_service)])

The new Aviation dataset is now has 0 missing value.
Outliers Treatment

quantile(Aviation$DepartureDelayin_Mins,c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))

##   1%   2%   3%  10%  20%  30%  40%  50%  60%  70%  80%  90%  95%  99% 100% 
##    0    0    0    0    0    0    0    0    2    8   18   43   76  180 1592

par(mfrow = c(1,3))
plot(density(Aviation$DepartureDelayin_Mins), main="Departure Delay in Minutes") # Looks right-skewed
boxplot(Aviation$DepartureDelayin_Mins,
     main = "Departure Delay in Minutes",
     xlab ="Minutes",
          ylab = "",
          col = "green",
          border = "black", horizontal = TRUE, notch = FALSE)
ggplot(Aviation, aes(DepartureDelayin_Mins)) + geom_histogram() 
qqnorm(Aviation$DepartureDelayin_Mins)

The Departure Delay in Minutes plots above show that the dataset is skewed to the right with less than 60% of the dataset has 0 minute departure delay and less than 5% of the dataset has more than 76 minutes Departure delaying time.

quantile(Aviation$ArrivalDelayin_Mins,c(0.01,0.02,0.03,0.1,0.2,0.3,0.4,0.50,0.6,0.7,0.8,0.9,0.95,0.99,1))

##   1%   2%   3%  10%  20%  30%  40%  50%  60%  70%  80%  90%  95%  99% 100% 
##    0    0    0    0    0    0    0    0    2    9   19   44   77  181 1584

par(mfrow = c(1,3))
plot(density(Aviation$ArrivalDelayin_Mins), main="Arrival Delays in Minutes") # Looks right-skewed
boxplot(Aviation$ArrivalDelayin_Mins,
     main = "Arrival Delays in Minutes",
     xlab ="Minutes",
          ylab = "",
          col = "blue",
          border = "black", horizontal = TRUE, notch = FALSE)
ggplot(Aviation, aes(ArrivalDelayin_Mins)) + geom_histogram() 
qqnorm(Aviation$ArrivalDelayin_Mins)

The Arrival Delay in Minutes plots above show that the dataset is skewed to the right with less than 60% of the dataset has 0 minute departure delay and less than 5% of the dataset has more than 77 minutes Departure delaying time.

Treating Outlier Values

#Outlier Treatment for Departure Delays
    Aviation$DepartureDelayin_Mins[which(Aviation$DepartureDelayin_Mins > 76)] <- 76 #capping at 95% from density Plot
#Outlier Treatment for Arrival Delays
    Aviation$ArrivalDelayin_Mins[which(Aviation$ArrivalDelayin_Mins > 77)] <- 77 #capping at 95% from density Plot
pD = ggplot(Aviation, aes(DepartureDelayin_Mins)) + geom_histogram() 
pA = ggplot(Aviation, aes(ArrivalDelayin_Mins)) + geom_histogram() 
library(cowplot)
plot_grid(pD, pA, labels = "AUTO")

Thus, since both Departure and Arrival Delaying Time has more than 50% and less than 60% at 0 minute delaying time, the flooring remain the same. The capping for Departure/Arrival Delaying Time is capping at 95% (i.e. 76 for Departure and 77 for Arrival).
Variable Transformation (Transforming from text to numerical rating as following):
- Gender: 1-Male and 0-Female
- Satisfaction: 1-Satisfied and 0-neutral or dissatisfied
- Class: 1-Business, 2-Economy Plus, and 3-Economy
- Survey’ Satisfaction levels: 0-Extremely Poor, 1-Poor, 2-Needs Improvement, 3-Acceptable, 4-Good, and 5-Excellent.

Aviation$Gender <- as.integer(ifelse(Aviation$Gender == "Male", 1, 0))
Aviation$Satisfaction <- as.integer(ifelse(Aviation$Satisfaction == "satisfied", 1, 0))
Aviation$Class <- as.integer(ifelse(Aviation$Class == "Business",1, 
                             ifelse(Aviation$Class == "Eco Plus",2, 3 )))

Aviation[c(8:21)] <- as.integer(ifelse(Aviation[c(8:21)] == "extremely poor", 0, 
                             ifelse(Aviation[c(8:21)] == "poor", 1,
                             ifelse(Aviation[c(8:21)] == "need improvement", 2,
                             ifelse(Aviation[c(8:21)] == "acceptable", 3,
                             ifelse(Aviation[c(8:21)] == "good", 4, 5))))))

The dependent variables (i.e. Seat_comfort, Departure.Arrival.time_convenient, Food_drink, Gate_location, Inflightwifi_service, Inflight_entertainment, Online_support, Ease_of_Onlinebooking, Onboard_service, Leg_room_service, Baggage_handling, Checkin_service, Cleanliness, and Online_boarding) are now being converted from characters to integers for further analysis.
- The struture of the data frame is now containing only numerical and integers variables.
Relationship among variables, important variables. Correlation plot provides correlations among important variables in predicting customer satisfaction.

df <- dplyr::select_if(Aviation, is.numeric)
cor_df <- cor(df, use = "complete.obs") #use complete.obs to discard the entire row if an NA is present
round(cor_df,2)

cor_df = as.data.frame(round(cor_df,2))
ggcorrplot(cor_df, hc.order = TRUE, type = "lower",  
                    ggtheme = ggplot2::theme_minimal, 
                      title = "Fig. 2A: Falcon Airline's Satisfaction Correlations", 
                      insig = "blank",
                        lab = TRUE,
                   lab_size = 2,pch.cex = 1, tl.cex = 8)

Figure 2A shows significant [raw data] correlations between Satisfaction amongs Inflight Entertaintment (52%), Online Booking (43%), Online Support (39%), Online Boarding (34%), Onboard Services (33%), Legroom Services (31%), Checkin Services (27%), Baggage Handling and Cleanliness (26%), Seat Comfort (24%), Inglight WiFi (23%), Age (12%), and Food & Drink (11%). Thus, these 13 significant correlations are important variables that should be taken into the analysis for predicting the level of passengers’ satisfaction in Falcon Airline. Although passengers’ satisfaction and other variables does not have high correlations, these factors might taken into consideration for improvement.
Insightful Visualizations. Data Frame Summary table shows that there are missing values in the following parameters: Customer Type, Type of Travel, Arrival Delays in Minutes (Missing values treatment by impute the parameters’ mean), Departure/Arrival Time Convenient, Food & Drink, and Onboard Service. The other categorical missing values will be replaced with the mode of its parameter if there are high correlations with satisfaction.
- There are more female than male passengers.
- There are more Loyal Customer than Disloyal Customer.
- More passengers are traveling for Business than Personal.
- Passengers seating in Business and Economy Class Type are approximately equally distributed comparing to Economy Plus Class.
- An average age of passengers is 40 and majority are in between 27 and 51 years old (youngest age is 7 years old and oldest is 85 years old).
- Average flight’s distance is 1982 miles (shortest distance is 50 miles and longest is 6950 miles)
- Both Departure and Arrival Delays Time have an average of approximately 15 minutes (the average longest delay time for both parameters is 1588 miles).
- Overall, there are more passengers that are Satisfied with Falcon Airlines at 54.73%.

# Information value and Weights of Evidence (WOE) calculation
Imp_Var = Aviation
#install.packages("Information")
library(Information)
Imp_test <- create_infotables(Imp_Var, y = "Satisfaction", ncore = 2)
Imp_dataframe <- Imp_test$Summary
print(head(Imp_dataframe, 7))

##                  Variable        IV
## 12 Inflight_entertainment 2.2130310
## 7            Seat_comfort 1.9034769
## 14  Ease_of_Onlinebooking 0.9027137
## 13         Online_support 0.8062015
## 20        Online_boarding 0.5187982
## 15        Onboard_service 0.5048123
## 16       Leg_room_service 0.4680803

The the top 7 important variables heading and information values (IV) are shown in the above table.
Addition of new variables (Recall that the objective of the analysis is to determine which dependent variables are important to persuave high rating in Falcon Airline).
- Adding an Rating with the most rated satisfaction level column to predict overall customers’ satisfaction with three levels: Poor, Neutral, and Satisfied.

Aviation$Rating = apply(Aviation[c(8:21)], MARGIN = 1, dt_mode)

Data split into test and train (Divide data into “75:25”)

## 75% of the sample size
   smp_size <- floor(0.75 * nrow(Aviation))

## set the seed to make your partition reproducible
   set.seed(123)
   indices <- sample(seq_len(nrow(Aviation)), size = smp_size)
     train <- Aviation[indices, ]
     test  <- Aviation[-indices,]

Analytical Approach

I. Modelling Process

A. Basic Linear Regression Model

model_1 <-lm(Satisfaction ~ Inflight_entertainment + Seat_comfort + Ease_of_Onlinebooking + Online_support + Online_boarding + Onboard_service + Leg_room_service, data = train)
summary(model_1)

## 
## Call:
## lm(formula = Satisfaction ~ Inflight_entertainment + Seat_comfort + 
##     Ease_of_Onlinebooking + Online_support + Online_boarding + 
##     Onboard_service + Leg_room_service, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07925 -0.30814  0.04724  0.25738  1.37652 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.6324618  0.0066969 -94.441  < 2e-16 ***
## Inflight_entertainment  0.1462264  0.0013749 106.357  < 2e-16 ***
## Seat_comfort           -0.0000184  0.0012010  -0.015  0.98777    
## Ease_of_Onlinebooking   0.0552473  0.0019146  28.855  < 2e-16 ***
## Online_support          0.0281856  0.0016925  16.653  < 2e-16 ***
## Online_boarding         0.0056975  0.0018032   3.160  0.00158 ** 
## Onboard_service         0.0598178  0.0014236  42.020  < 2e-16 ***
## Leg_room_service        0.0471757  0.0013035  36.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3893 on 68179 degrees of freedom
## Multiple R-squared:  0.3885, Adjusted R-squared:  0.3884 
## F-statistic:  6188 on 7 and 68179 DF,  p-value: < 2.2e-16

The basic linear regression model for train dataset with seven important variables above shows Inflight Entertainment, Ease of Online Booking, Online Support, Onboard Service, Leg Room Services, and Online Boarding are signigicant in determinine passengers’ Satisfaction with p-values less than 0.1%. The Multiple R-squared and Adjust R-squared of predicting passengers’ satisfaction are both 38.85%.

model_2 <-lm(Satisfaction ~ Inflight_entertainment + Seat_comfort + Ease_of_Onlinebooking + Online_support + Online_boarding + Onboard_service + Leg_room_service, data = test)
summary(model_2)

## 
## Call:
## lm(formula = Satisfaction ~ Inflight_entertainment + Seat_comfort + 
##     Ease_of_Onlinebooking + Online_support + Online_boarding + 
##     Onboard_service + Leg_room_service, data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02781 -0.30574  0.04166  0.25525  1.42424 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -6.186e-01  1.153e-02 -53.636   <2e-16 ***
## Inflight_entertainment  1.441e-01  2.403e-03  59.970   <2e-16 ***
## Seat_comfort            7.366e-05  2.094e-03   0.035    0.972    
## Ease_of_Onlinebooking   5.543e-02  3.310e-03  16.748   <2e-16 ***
## Online_support          3.186e-02  2.944e-03  10.823   <2e-16 ***
## Online_boarding         3.042e-03  3.097e-03   0.982    0.326    
## Onboard_service         5.775e-02  2.488e-03  23.211   <2e-16 ***
## Leg_room_service        4.632e-02  2.272e-03  20.394   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3903 on 22722 degrees of freedom
## Multiple R-squared:  0.3851, Adjusted R-squared:  0.3849 
## F-statistic:  2033 on 7 and 22722 DF,  p-value: < 2.2e-16

The basic linear regression model for test

dataset with seven important variables above shows Inflight Entertainment, Ease of Online Booking, Online Support, Onboard Service, and Leg Room Services are signigicant in determinine passengers’ Satisfaction with p-values of 0%. The Multiple R-squared and Adjust R-squared of predicting passengers’ satisfaction are both 38.51%.

B. Principal Component Analysis (PCA) and Factor Analysis (FA).

*PCA is a method of obtaining important variables*

Calculate PCA values

Aviation_p = Aviation[-c(22)]
Aviation_p$Satisfaction <- as.factor(ifelse(Aviation_p$Satisfaction <= 0,"Neutral/Unsatisfied", "Satisfied"))

train.pca <- prcomp(train_p[, -7], 
                   center = TRUE,
                   scale. = TRUE) #prcomp() is used to perform/calculate PCA
trpca <- train.pca$rotation[, 1:7]
knitr::kable(trpca)

	PC1	PC2	PC3	PC4	PC5	PC6	PC7
Gender	0.0654329	-0.0007056	-0.0438511	-0.0036211	0.3087693	-0.1268917	0.7457004
Age	-0.0664518	-0.0198820	0.0268271	0.0272237	-0.6136396	-0.1572994	0.4672788
Class	0.1362747	0.0115680	0.0770240	0.0319392	0.0461821	0.7005678	-0.0508878
Flight_Distance	0.0188101	-0.0183297	-0.0352722	-0.1141820	0.6201479	-0.3896129	-0.1052700
DepartureDelayin_Mins	0.0530409	-0.0538526	0.0411481	-0.6966518	-0.0643410	0.0285817	0.0122468
ArrivalDelayin_Mins	0.0575621	-0.0563217	0.0417910	-0.6952638	-0.0677856	0.0349566	0.0076457
Seat_comfort	-0.2026007	0.2634025	0.4507876	-0.0074788	0.0233616	0.0465883	-0.1039955
Departure.Arrival.time_convenient	-0.0751900	0.2896418	0.3603580	-0.0173375	0.1208583	0.0889510	0.2653535
Food_drink	-0.1278557	0.3040501	0.4913077	-0.0109401	0.0290120	-0.0769625	-0.0611758
Gate_location	-0.0086373	0.1230872	0.1804594	-0.0084376	0.1433974	0.1825977	0.2660316
Inflightwifi_service	-0.2862264	-0.3817235	0.1076023	0.0150236	0.1444507	0.1549515	0.0660929
Inflight_entertainment	-0.2926500	-0.0232763	0.2441931	0.0020465	-0.1652099	-0.3240620	-0.2055768
Online_support	-0.3457127	-0.3394448	0.0832096	0.0060376	-0.0262950	-0.0237974	0.0349418
Ease_of_Onlinebooking	-0.4220568	-0.1611544	-0.0706218	-0.0304902	0.0697588	0.2014204	0.0456307
Onboard_service	-0.2768978	0.2554215	-0.2601954	-0.0490683	-0.0479921	0.0336124	0.0021018
Leg_room_service	-0.2460797	0.2334602	-0.1967898	-0.0686339	-0.1110632	0.0317376	-0.0374638
Baggage_handling	-0.2695756	0.3044375	-0.2871096	-0.0734435	0.0899706	0.0950453	0.0199710
Checkin_service	-0.1910108	0.0729327	-0.1068826	-0.0174331	-0.0236666	-0.2483233	0.0175610
Cleanliness	-0.2706575	0.3084226	-0.3013080	-0.0435899	0.0906117	0.1094766	0.0240682
Online_boarding	-0.3439056	-0.3631448	0.0743956	-0.0035991	0.1042999	0.1009564	0.0577162

Calculate Standard Deviations and Variances

#compute standard deviation of each principal component
std_dev <- train.pca$sdev
#compute variance
pr_var <- std_dev^2

The standard deviation of the first 7 components is 1.9832899, 1.4601843, 1.456694, 1.382852, 1.1426512, 1.0936016, 1.022788 and the variances are 3.933439, 2.1321383, 2.1219573, 1.9122798, 1.3056517, 1.1959646, 1.0460953.

Scree Plot shows a clear picture of number of components in the dataset.

fviz_eig(train.pca)

Scree plot (visualize eigenvalues) above shows the percentage of variances explained by each principal component and is confirmed that there are 7 components that has PC values greater than 95%.

Variables - PCA graph shows a clearer visualization of important variables in determine customers’ Airline satisfaction.

fviz_pca_var(train.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Graph of variables above shows positive correlated variables point to the same side of the plot and egative correlated variables point to opposite sides of the graph.

Predict using PCA

tr.predict <- predict(train.pca, train_p)
tr.predict <- data.frame(tr.predict, train_p[7])

tst.predict <- predict(train.pca, test_p)
tst.predict <- data.frame(tst.predict, test_p$Satisfaction)

Multinomial Logistic Regression Model with with First Seven PCs

library(nnet)
tr.predict$Satisfaction <- relevel(tr.predict$Satisfaction, ref = "Satisfied")
pc_model <- multinom(Satisfaction ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 , data = tr.predict)

## # weights:  9 (8 variable)
## initial  value 47263.626801 
## iter  10 value 36164.928220
## final  value 29596.414032 
## converged

summary(pc_model)

## Call:
## multinom(formula = Satisfaction ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     PC6 + PC7, data = tr.predict)
## 
## Coefficients:
##                   Values   Std. Err.
## (Intercept) -0.272367456 0.010372270
## PC1          0.903370085 0.006999610
## PC2         -0.008693525 0.006740759
## PC3          0.020946832 0.006980747
## PC4         -0.071686748 0.007383235
## PC5          0.419451413 0.009101106
## PC6          0.570291138 0.009475457
## PC7          0.494666538 0.010099782
## 
## Residual Deviance: 59192.83 
## AIC: 59208.83

Confusion Matrix & Misclassification Error - Train dataset

p1   <- predict(pc_model, tr.predict)
tab1 <- table(p1, tr.predict$Satisfaction)
tab1

##                      
## p1                    Satisfied Neutral/Unsatisfied
##   Satisfied               31145                7023
##   Neutral/Unsatisfied      6155               23864

error1 <- round(100*(1 - sum(diag(tab1))/sum(tab1)), 2)
satisfied1 <- round(100*((tab1[1,1])/sum(tab1)), 2) 

Acc_p1 <- (tab1[1,1] + tab1[2,2]) / sum(tab1)
Acc_p1 <- round((100*Acc_p1), 2)

Misclassifation error in train dataset is 19.33% with 45.68% (31145 out of 68187) of customers rated Satisfied with 80.67% accuracy.

Confusion Matrix & Misclassification Error - Test dataset

tst.predict <- data.frame(tst.predict, test_p[7])
p2   <- predict(pc_model, tst.predict)
tab2 <- table(p2, tst.predict$Satisfaction)
tab2

##                      
## p2                    Neutral/Unsatisfied Satisfied
##   Satisfied                          2259     10370
##   Neutral/Unsatisfied                8010      2091

error2 <- round(100*(1 - sum(diag(tab2))/sum(tab2)), 2)
satisfied2 <- round(100*((tab2[1,2])/sum(tab2)), 2) 

Acc_p2 <- (tab2[1,1] + tab2[1,2]) / sum(tab2)
Acc_p2 <- round((100*Acc_p2), 2)

Misclassifation error in test dataset is 80.86% with 45.62% (10370 out of 22730) of customers rated Satisfied with with 55.56% accuracy.

Logistic Regression on the Seven Important Variables

# flight dataset is the sub_Aviation dataset with only important variables: Inflight_entertainment + Seat_comfort + Ease_of_Onlinebooking + Online_support + Online_boarding + Onboard_service + Leg_room_service will be used to perform the analysis.
flight = Aviation[c(7, 8, 13:17, 21)]
flight$Satisfaction <- as.factor(ifelse(flight$Satisfaction <= 0,"Neutral/Unsatisfied", "Satisfied"))
set.seed(123)
index = sample(2:nrow(flight), .75*nrow(flight))
lg_train = flight[index,]
lg_test = flight[-index,]
#sanity check
prop.table(table(lg_train$Satisfaction))

## 
## Neutral/Unsatisfied           Satisfied 
##            0.452271            0.547729

Logistic Regression on the Seven Important Variables shows that there are 54.77% Satisfaction in the train dataset.

Logistic Model-Train Dataset

logreg1 <- glm(Satisfaction ~ ., data = lg_train,  family = binomial(link = "logit"))
summary(logreg1)

## 
## Call:
## glm(formula = Satisfaction ~ ., family = binomial(link = "logit"), 
##     data = lg_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6273  -0.7419   0.2960   0.6616   3.3312  
## 
## Coefficients:
##                         Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)            -6.724201   0.056657 -118.683   <2e-16 ***
## Seat_comfort           -0.010560   0.008417   -1.255    0.210    
## Inflight_entertainment  0.853160   0.009898   86.194   <2e-16 ***
## Online_support          0.169080   0.010836   15.603   <2e-16 ***
## Ease_of_Onlinebooking   0.331576   0.012256   27.055   <2e-16 ***
## Onboard_service         0.366328   0.009352   39.171   <2e-16 ***
## Leg_room_service        0.292056   0.008483   34.428   <2e-16 ***
## Online_boarding         0.020691   0.011858    1.745    0.081 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 93905  on 68186  degrees of freedom
## Residual deviance: 62610  on 68179  degrees of freedom
## AIC: 62626
## 
## Number of Fisher Scoring iterations: 5

The above table show that Inflight Entertainment, Online Support, Ease of Onlinebooking, Onboard service, and Leg room service are significant in term of predicting passengers’ satisfaction with p-value of 0%.

lg_train$predicted <- predict(logreg1, lg_train, type="response")
confmatrix <- table(Actual_value = lg_train$Satisfaction, Predicted_Value = lg_train$predicted > 0.5)
confmatrix

##                      Predicted_Value
## Actual_value          FALSE  TRUE
##   Neutral/Unsatisfied 23309  7530
##   Satisfied            6432 30916

cm1 <- round(100*((confmatrix[2,2])/sum(confmatrix)), 2) 
Acc1 <- (confmatrix[1,1] + confmatrix[2,2]) / sum(confmatrix)
Acc_lg1 <- round((100*Acc1), 2)

The predicted value of passengers satisfaction in the Logistic Regression train dataset is 45.34% (30916 out of 68187) with 79.52% accuracy.

Validate on Train dataset with ROC-AUC

lg_train$Satisfaction <- ifelse(lg_train$Satisfaction  == 'Satisfied', 1, 0)
pred1 = predict(logreg1, lg_train)
pred = prediction(pred1, lg_train$Satisfaction)
roc = performance(pred, "tpr", "fpr")
plot(roc, 
        colorize = T, 
            main = "ROC-AUC Curve for Train Dataset",
            ylab = "Sensitivity: True Positive Rate",
            xlab = "1 - Specificity: False Positive Rate")
        abline(a = 0, b = 1)

eval = performance(pred, "acc")
#Identify Best Cutoff Values
max = which.max(slot(eval, "y.values")[[1]])
acc = slot(eval, "y.values")[[1]][max]
cut = slot(eval, "x.values")[[1]][max]
print(c(Accuracy = acc, Cutoff = cut))

##     Accuracy Cutoff.29690 
##    0.8086145    0.3861435

The ROC-AUC Curve for Train Dataset confirmed the performance of the Logistic Regression train dataset with 80.86% accuracy.

Logistic Model-Test Dataset

logreg2 <- glm(Satisfaction ~ ., data = lg_test,  family = binomial(link = "logit"))
summary(logreg2)

## 
## Call:
## glm(formula = Satisfaction ~ ., family = binomial(link = "logit"), 
##     data = lg_test)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6140  -0.7212   0.2881   0.6551   3.2240  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -6.698366   0.097367 -68.795   <2e-16 ***
## Seat_comfort           -0.028763   0.014741  -1.951    0.051 .  
## Inflight_entertainment  0.880391   0.017490  50.336   <2e-16 ***
## Online_support          0.178001   0.018865   9.435   <2e-16 ***
## Ease_of_Onlinebooking   0.356944   0.021168  16.862   <2e-16 ***
## Onboard_service         0.348017   0.016127  21.579   <2e-16 ***
## Leg_room_service        0.279568   0.014780  18.915   <2e-16 ***
## Online_boarding        -0.003629   0.020361  -0.178    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31317  on 22729  degrees of freedom
## Residual deviance: 20747  on 22722  degrees of freedom
## AIC: 20763
## 
## Number of Fisher Scoring iterations: 5

GLM on test dataset results as train dataset with Inflight Entertainment, Online Support, Ease of Onlinebooking, Onboard service, and Leg room service are significant in term of predicting passengers’ satisfaction with p-value of 0%.

lg_test$predicted <- predict(logreg1, lg_test, type="response")
confmatrix2 <- table(Actual_value = lg_test$Satisfaction, Predicted_Value = lg_test$predicted > 0.5)
confmatrix2

##                      Predicted_Value
## Actual_value          FALSE  TRUE
##   Neutral/Unsatisfied  7786  2531
##   Satisfied            2127 10286

cm2 <- round(100*((confmatrix2[2,2])/sum(confmatrix2)), 2) 
Acc2 <- (confmatrix2[1,1] + confmatrix2[2,2]) / sum(confmatrix2)
Acc_lg2 <- round((100*Acc2), 2)

The predicted value of passengers satisfaction in the Logistic Regression test dataset is 45.25% (10286 out of 22730) with 79.51% accuracy.

Validate on Test dataset with ROC-AUC

lg_test$Satisfaction <- ifelse(lg_test$Satisfaction  == 'Satisfied', 1, 0)
pred2 = predict(logreg2, lg_test)
tpred = prediction(pred2, lg_test$Satisfaction)
roc2 = performance(tpred, "tpr", "fpr")
plot(roc2, 
        colorize = T, 
            main = "ROC-AUC Curve for Test Dataset",
            ylab = "Sensitivity: True Positive Rate",
            xlab = "1 - Specificity: False Positive Rate")
        abline(a = 0, b = 1)

eval2 = performance(tpred, "acc")
#Identify Best Cutoff Values
max = which.max(slot(eval2, "y.values")[[1]])
acc2 = slot(eval2, "y.values")[[1]][max]
cut2 = slot(eval2, "x.values")[[1]][max]
print(c(Accuracy = acc2, Cutoff = cut2))

##     Accuracy Cutoff.18948 
##    0.8119666    0.2997027

The ROC-AUC Curve for Train Dataset confirmed the performance of the Logistic Regression train dataset with 81.2% accuracy.

C. Building CART Model

CRTflight = Aviation[-c(22)]
CRTflight$Satisfaction <- as.factor(ifelse(CRTflight$Satisfaction <= 0,"Neutral/Unsatisfied", "Satisfied"))
#Paritioning the data into training and test dataset
set.seed(123)
split <- sample.split(CRTflight$Satisfaction, SplitRatio = 0.75)
train.cart <- subset(CRTflight, split == TRUE)
test.cart <- subset(CRTflight, split == FALSE)

There are 68188 rows in train and 22729 in test CART dataset.

head(train.cart)

## # A tibble: 6 x 21
##   Gender   Age Class Flight_Distance DepartureDelayi~ ArrivalDelayin_~
##    <int> <dbl> <int>           <dbl>            <dbl>            <dbl>
## 1      0    65     3             265                0                0
## 2      0    60     3             623                0                0
## 3      0    66     3             227               17               15
## 4      1    10     3            1812                0                0
## 5      0    58     3             104               47               48
## 6      0    34     3            3633                0                0
## # ... with 15 more variables: Satisfaction <fct>, Seat_comfort <int>,
## #   Departure.Arrival.time_convenient <int>, Food_drink <int>,
## #   Gate_location <int>, Inflightwifi_service <int>,
## #   Inflight_entertainment <int>, Online_support <int>,
## #   Ease_of_Onlinebooking <int>, Onboard_service <int>,
## #   Leg_room_service <int>, Baggage_handling <int>, Checkin_service <int>,
## #   Cleanliness <int>, Online_boarding <int>

prop.table(table(train.cart$Satisfaction))

## 
## Neutral/Unsatisfied           Satisfied 
##            0.452675            0.547325

The table above shows that the train cart dataset contains 54.73% passenger’s satisfaction.

#Setting the control parameters
library(rpart)
r.ctrl = rpart.control(minsplit = 1000, minbucket = 100, cp = 0, xval = 10)
#Building the CART model
flight_tree <- rpart(formula = Satisfaction ~., data = CRTflight, method = "class", control = r.ctrl)
printcp(flight_tree)

## 
## Classification tree:
## rpart(formula = Satisfaction ~ ., data = CRTflight, method = "class", 
##     control = r.ctrl)
## 
## Variables actually used in tree construction:
##  [1] Age                               Baggage_handling                 
##  [3] Checkin_service                   Class                            
##  [5] Cleanliness                       Departure.Arrival.time_convenient
##  [7] Ease_of_Onlinebooking             Flight_Distance                  
##  [9] Food_drink                        Gender                           
## [11] Inflight_entertainment            Leg_room_service                 
## [13] Onboard_service                   Online_boarding                  
## [15] Online_support                    Seat_comfort                     
## 
## Root node error: 41156/90917 = 0.45268
## 
## n= 90917 
## 
##            CP nsplit rel error  xerror      xstd
## 1  0.56273690      0   1.00000 1.00000 0.0036467
## 2  0.05270191      1   0.43726 0.43726 0.0029192
## 3  0.04446496      2   0.38456 0.38456 0.0027780
## 4  0.00947614      3   0.34010 0.34010 0.0026441
## 5  0.00652801      7   0.28735 0.28914 0.0024710
## 6  0.00410633     11   0.26047 0.26227 0.0023698
## 7  0.00323161     12   0.25637 0.25955 0.0023591
## 8  0.00289144     13   0.25313 0.25668 0.0023478
## 9  0.00262416     14   0.25024 0.25092 0.0023247
## 10 0.00216250     16   0.24499 0.24689 0.0023083
## 11 0.00150646     20   0.23596 0.23627 0.0022643
## 12 0.00129588     25   0.22553 0.22512 0.0022164
## 13 0.00127158     28   0.22164 0.22473 0.0022147
## 14 0.00106910     31   0.21783 0.22424 0.0022126
## 15 0.00036447     34   0.21462 0.22109 0.0021987
## 16 0.00019438     35   0.21426 0.22104 0.0021985
## 17 0.00017818     37   0.21387 0.22092 0.0021979
## 18 0.00000000     40   0.21333 0.21977 0.0021929

The main variables actually used in tree construction are Inflight Entertainment, Online Support, Ease of Onlinebooking, Onboard service, Seat Comfort, Online Boarding, and Leg room service with the root node error of 45.3%.

Displaying the Decision Tree

#install.packages("rpart.plot")
library(RColorBrewer)
library(rattle)
library(rpart.plot)
fancyRpartPlot(flight_tree)

plotcp(flight_tree)

The above graph illustrates that the cp level is 0.0079, which will be used to prune the tree for better visualization and prediction for passengers’ satisfaction.

ptree<- prune(flight_tree, cp= 0.0079 ,"CP") 
printcp(ptree)

## 
## Classification tree:
## rpart(formula = Satisfaction ~ ., data = CRTflight, method = "class", 
##     control = r.ctrl)
## 
## Variables actually used in tree construction:
## [1] Checkin_service        Ease_of_Onlinebooking  Inflight_entertainment
## [4] Online_boarding        Seat_comfort          
## 
## Root node error: 41156/90917 = 0.45268
## 
## n= 90917 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.5627369      0   1.00000 1.00000 0.0036467
## 2 0.0527019      1   0.43726 0.43726 0.0029192
## 3 0.0444650      2   0.38456 0.38456 0.0027780
## 4 0.0094761      3   0.34010 0.34010 0.0026441
## 5 0.0079000      7   0.28735 0.28914 0.0024710

The main variables actually used in tree construction with cp = 0.0079 are Checkin Service, Ease of Onlinebooking, Inflight Entertainment, Online Boarding, and Seat Comfort with the root node error of 45.3%.

Ploting the final CART model (Unbalanced Dataset)

fancyRpartPlot(ptree, 
               uniform = TRUE, 
               main = "Figure 3C: CART Model- Final Tree", 
               palettes = c("Blues", "Oranges")
               )

Figure 3C above shows as following:
- 55% of the surveyed passengers are satisfied with Falcon’s Airline- Inflight Entertainment.
  - 82% of passengers who are satisfied with Inflight Entertainment are satistied with Ease of Online Booking are by default yield 90% Satisfaction.
  - 12.24% (0.55 of 0.82 of 0.59 of 0.46 of 90917) of passengers are Satisfied with Checkin Services
  - 6.61% (0.55 of 0.82 of 0.59 of 0.46 of 0.54 of 90917) of passengers are Satisfied with Online Boarding.
- The chart also illustrated that by if passengers that are Satisfied with Ease of Online Booking, Inflight Entertaintment, Checkin Service, Online Boarding, and Seat Comfort, then most likely the passengers are satisfied with Falcon Airlines.

Predict CART Train Dataset

#Setting the control parameters
r.ctrl = rpart.control(minsplit = 1000, minbucket = 100, cp = 0, xval = 10)
train_tree <- rpart(formula = Satisfaction ~., data = train.cart, method = "class", control = r.ctrl)
#Scoring/Predicting the training dataset
train.cart$prediction = predict(train_tree, train.cart, type = "class")
train.cart$probability = predict(train_tree, train.cart, type = "prob")
head(train.cart)

## # A tibble: 6 x 23
##   Gender   Age Class Flight_Distance DepartureDelayi~ ArrivalDelayin_~
##    <int> <dbl> <int>           <dbl>            <dbl>            <dbl>
## 1      0    65     3             265                0                0
## 2      0    60     3             623                0                0
## 3      0    66     3             227               17               15
## 4      1    10     3            1812                0                0
## 5      0    58     3             104               47               48
## 6      0    34     3            3633                0                0
## # ... with 18 more variables: Satisfaction <fct>, Seat_comfort <int>,
## #   Departure.Arrival.time_convenient <int>, Food_drink <int>,
## #   Gate_location <int>, Inflightwifi_service <int>,
## #   Inflight_entertainment <int>, Online_support <int>,
## #   Ease_of_Onlinebooking <int>, Onboard_service <int>,
## #   Leg_room_service <int>, Baggage_handling <int>, Checkin_service <int>,
## #   Cleanliness <int>, Online_boarding <int>, prediction <fct>,
## #   probability[,"Neutral/Unsatisfied"] <dbl>, [,"Satisfied"] <dbl>

cart.tbl = table(train.cart$Satisfaction, train.cart$prediction)
print(cart.tbl)

##                      
##                       Neutral/Unsatisfied Satisfied
##   Neutral/Unsatisfied               27422      3445
##   Satisfied                          3458     33863

pcr = round(100*((cart.tbl[1,2] + cart.tbl[2,1])/sum(cart.tbl)), 2)
print(paste("Classification error rate of the CART-Train Dataset is", pcr,"%"))

## [1] "Classification error rate of the CART-Train Dataset is 10.12 %"

Classification error rate of the prediction on Passengers’ Satisfaction from CART-Train Dataset is 10.12%, which means the accuracy using CART model to predict Customer’s Satisfaction is 89.88%.

Predict CART Test Dataset

#Setting the control parameters
t.ctrl = rpart.control(minsplit = 1000, minbucket = 100, cp = 0, xval = 10)
test_tree <- rpart(formula = Satisfaction ~., data = test.cart, method = "class", control = t.ctrl)
#Scoring/Predicting the training dataset
test.cart$prediction = predict(test_tree, test.cart, type = "class")
test.cart$probability = predict(test_tree, test.cart, type = "prob")
head(test.cart)

## # A tibble: 6 x 23
##   Gender   Age Class Flight_Distance DepartureDelayi~ ArrivalDelayin_~
##    <int> <dbl> <int>           <dbl>            <dbl>            <dbl>
## 1      0    15     3            2138                0                0
## 2      0    70     3             354                0                0
## 3      1    30     3            1894                0                0
## 4      1    22     3            1556               30               26
## 5      1    62     3            1695                0                0
## 6      0    55     3            2554                0                0
## # ... with 18 more variables: Satisfaction <fct>, Seat_comfort <int>,
## #   Departure.Arrival.time_convenient <int>, Food_drink <int>,
## #   Gate_location <int>, Inflightwifi_service <int>,
## #   Inflight_entertainment <int>, Online_support <int>,
## #   Ease_of_Onlinebooking <int>, Onboard_service <int>,
## #   Leg_room_service <int>, Baggage_handling <int>, Checkin_service <int>,
## #   Cleanliness <int>, Online_boarding <int>, prediction <fct>,
## #   probability[,"Neutral/Unsatisfied"] <dbl>, [,"Satisfied"] <dbl>

cart.tbl2 = table(test.cart$Satisfaction, test.cart$prediction)
print(cart.tbl2)

##                      
##                       Neutral/Unsatisfied Satisfied
##   Neutral/Unsatisfied                8511      1778
##   Satisfied                          1156     11284

pct = round(100*((cart.tbl2[1,2] + cart.tbl2[2,1])/sum(cart.tbl2)),2)
print(paste("Classification error rate of the CART-Test Dataset is", pct,"%"))

## [1] "Classification error rate of the CART-Test Dataset is 12.91 %"

Classification error rate of the prediction on Passengers’ Satisfaction from CART-Test Dataset is 12.91%, which means the accuracy using CART model to predict Customer’s Satisfaction is 87.09%.

Random Forests Model

RF = Aviation[-c(22)]
RF$Satisfaction <- as.factor(ifelse(RF$Satisfaction <= 0,"Neutral/Unsatisfied", "Satisfied"))
RF$Satisfaction <- as.character(RF$Satisfaction)
RF$Satisfaction <- as.factor(RF$Satisfaction)

RFindex <- sample(2:nrow(RF),0.75*nrow(RF))
RFtrain <- RF[RFindex,]
RFtest <- RF[-RFindex,]

tuned_rf <- tuneRF(x = RFtrain[,-7], 
                   y = RFtrain$Satisfaction,
           mtryStart = 2, 
            ntreeTry = 500, 
          stepFactor = 1.5, 
             improve = 0.001, 
               trace = T, 
                plot = T,
              doBest = TRUE,
            nodesize = 20, 
          importance = T
)

## mtry = 2  OOB error = 6.99% 
## Searching left ...
## Searching right ...
## mtry = 3     OOB error = 6.46% 
## 0.07598657 0.001 
## mtry = 4     OOB error = 6.17% 
## 0.04429805 0.001 
## mtry = 6     OOB error = 5.89% 
## 0.04540052 0.001 
## mtry = 9     OOB error = 5.85% 
## 0.00747012 0.001 
## mtry = 13    OOB error = 5.88% 
## -0.00526844 0.001

The tuned_rf plot shows that mtry = 9 has the least Out-of-Bag (OOB) error of 5.85%, thus 9 will be used for mtry in Random Forest Model.

random_forest <- randomForest(Satisfaction~ .,
                                data = RFtrain,
                               ntree = 500, 
                                mtry = 9, 
                            nodesize = 200,
                          importance =TRUE)

Listing the importance of the variables.

impVar <- round(randomForest::importance(random_forest), 2)
k.impVar <- impVar[order(impVar[,3], decreasing = TRUE),]
knitr::kable(k.impVar)

	Neutral/Unsatisfied	Satisfied	MeanDecreaseAccuracy	MeanDecreaseGini
Seat_comfort	141.61	69.37	153.39	4714.04
Inflight_entertainment	83.24	85.95	124.86	9524.05
Checkin_service	50.00	30.45	58.80	383.53
Online_support	41.59	34.76	53.42	1642.03
Class	72.38	42.57	52.92	910.09
Leg_room_service	25.83	42.46	50.45	956.15
Ease_of_Onlinebooking	33.90	40.31	48.95	2858.08
Food_drink	44.44	9.13	47.25	729.24
Flight_Distance	42.38	15.92	45.81	332.33
Baggage_handling	37.12	23.98	43.08	264.79
Age	29.83	27.75	41.33	271.86
Departure.Arrival.time_convenient	25.14	38.45	41.26	271.75
Online_boarding	34.67	22.81	40.94	779.81
Gender	14.30	37.34	39.13	365.22
Inflightwifi_service	23.03	28.28	36.80	435.39
Cleanliness	33.90	24.86	36.41	355.83
Onboard_service	20.09	30.38	33.62	487.02
ArrivalDelayin_Mins	16.81	15.81	22.11	89.69
DepartureDelayin_Mins	15.10	9.81	17.77	65.74
Gate_location	-0.07	12.51	12.57	20.09

varImpPlot(random_forest)

Random Forest Model above shows that important variables that helps to preict passengers’ satisfaction are Seat Comfort, Inflight Entertainment, Traveling Class, Checkin Services, Online Support, Leg Room Services, and Ease of Online Booking. However, Seat Comfort and Inflight Entertainment are the top most important and have the highest accuracy rate.

Plotting for arriving at the optimum number of trees

plot(tuned_rf, main = "")
legend("topright", c("Out-Of-Bag (OOB)", "Neutral/Unsatisfied", "Satisfied"), text.col = 1:6, lty = 1:3, col = 1:3)
title(main="Error Rates Random Forest RFDF.dev")

Random Forests Model with Train Dataset

train_perf_dataset <- RFtrain
train_perf_dataset$predict.class <- predict(tuned_rf, RFtrain, type = "class")
train_perf_dataset$predict.score <- predict(tuned_rf, RFtrain, type = "prob")

# Get AUC for Train dataset

RFpred <- prediction(train_perf_dataset$predict.score[,2], train_perf_dataset$Satisfaction)
RFperf <- performance(RFpred, "tpr", "fpr")
plot(RFperf)

RFauc <- performance(RFpred,"auc"); 
RFauc <- as.numeric(RFauc@y.values)

confusionMatrix(train_perf_dataset$Satisfaction,
                train_perf_dataset$predict.class)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Neutral/Unsatisfied Satisfied
##   Neutral/Unsatisfied               30004       772
##   Satisfied                          1296     36115
##                                              
##                Accuracy : 0.9697             
##                  95% CI : (0.9684, 0.9709)   
##     No Information Rate : 0.541              
##     P-Value [Acc > NIR] : < 2.2e-16          
##                                              
##                   Kappa : 0.9389             
##                                              
##  Mcnemar's Test P-Value : < 2.2e-16          
##                                              
##             Sensitivity : 0.9586             
##             Specificity : 0.9791             
##          Pos Pred Value : 0.9749             
##          Neg Pred Value : 0.9654             
##              Prevalence : 0.4590             
##          Detection Rate : 0.4400             
##    Detection Prevalence : 0.4513             
##       Balanced Accuracy : 0.9688             
##                                              
##        'Positive' Class : Neutral/Unsatisfied
##

The accuracy rate in predicting passengers’ satisfaction from Random Forest Train dataset is 96.97% with Sensitivity of 95.86% and 95% Confidence Interval between 96.84% and 97.09% - Inflight Entertainment and Seat Comfort as the most important variables.

Random Forests Model with Test Dataset

#Model validation on Test dataset

test_perf_dataset <- RFtest

test_perf_dataset$predict.class <- predict(tuned_rf, test_perf_dataset, type="class")
test_perf_dataset$predict.score <- predict(tuned_rf, test_perf_dataset, type="prob")

# Get AUC for Test dataset

pred_test_rf <- prediction(test_perf_dataset$predict.score[,2], test_perf_dataset$Satisfaction)
perf_test_rf <- performance(pred_test_rf, "tpr", "fpr")
plot(perf_test_rf)

auc_test_rf <- performance(pred_test_rf,"auc"); 
auc_test_rf <- as.numeric(auc_test_rf@y.values)


confusionMatrix(test_perf_dataset$Satisfaction,
                test_perf_dataset$predict.class)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Neutral/Unsatisfied Satisfied
##   Neutral/Unsatisfied                9767       613
##   Satisfied                           749     11601
##                                              
##                Accuracy : 0.9401             
##                  95% CI : (0.9369, 0.9431)   
##     No Information Rate : 0.5374             
##     P-Value [Acc > NIR] : < 2.2e-16          
##                                              
##                   Kappa : 0.8794             
##                                              
##  Mcnemar's Test P-Value : 0.0002542          
##                                              
##             Sensitivity : 0.9288             
##             Specificity : 0.9498             
##          Pos Pred Value : 0.9409             
##          Neg Pred Value : 0.9394             
##              Prevalence : 0.4626             
##          Detection Rate : 0.4297             
##    Detection Prevalence : 0.4567             
##       Balanced Accuracy : 0.9393             
##                                              
##        'Positive' Class : Neutral/Unsatisfied
##

The accuracy rate in predicting passengers’ satisfaction from Random Forest Test dataset is 94.01% with Sensitivity of 92.88% and 95% Confidence Interval between 93.69% and 94.31% - Inflight Entertainment and Seat Comfort ares also as the most important variables.

Building Model using XGBoost

library(xgboost)
xbgFlight = Aviation[-c(22)]
data(xbgFlight)
# Label conversion
rate = xbgFlight$Satisfaction
label = as.integer(xbgFlight$Satisfaction) 
xbgFlight$Satisfaction = NULL
# Split the data for training and testing (75/25 split)
n = nrow(xbgFlight)
train.index = sample(n,floor(0.75*n))
train.data = as.matrix(xbgFlight[train.index,])
train.label = label[train.index]
test.data = as.matrix(xbgFlight[-train.index,])
test.label = label[-train.index]

Create xgb.DMatrix

# Transform the two data sets into xgb.Matrix
xgbTrain = xgb.DMatrix(data = train.data,label=train.label)
xgbTest = xgb.DMatrix(data = test.data,label=test.label)
xgbModel <- xgboost(data = xgbTrain, label = train.label,
                       eta = 0.7,
                       max_depth= 5,
                       nrounds= 50,
                       nfold= 5,
                       objective = "binary:logistic",  # for regression models
                       verbose = 0,               # silent,
                       early_stopping_rounds= 10 # stop if no improvement for 10 consecutive trees
                       )

Predict New Outcome and the Probability of the prediction

# Predict outcomes with the test data
xbgPreds = predict(xgbModel, xgbTest)
cv.res <- xgb.cv(data = xgbTrain, label = train.label, nfold = 5,
                 nrounds = 2, objective = "binary:logistic")

## [1]  train-error:0.106703+0.000977   test-error:0.107777+0.002048 
## [2]  train-error:0.098560+0.001185   test-error:0.100166+0.002848

The test-error in XGBoost is approximately between 9.9% and 11.4%, which indicates that the accuracy rate to predict Passengers’ Satisfaction is nearly 90% to 91% accuracy.