0% found this document useful (0 votes)
16 views86 pages

Airplane Passanger Satisfication Prediction

Uploaded by

MUHAMMAD USMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views86 pages

Airplane Passanger Satisfication Prediction

Uploaded by

MUHAMMAD USMAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 86

AIRLINE PASSENGER

SATISFACTION PREDICTION
MODEL

GROUP MEMBERS
1-Shahryar Khan
2-Muhammad Ehtisham Asif
3-Muhammad Usman
PROJECT SUPERVISOR
Mr. Abdul Qudus Abbasi

Institute of Information Technology

Quaid-i-Azam University Islamabad


2022
ABSTRACT
Airline passenger satisfaction prediction model is basically a system that help you get
knowledge about different criteria that people need in order to be satisfied while their
journey and be treated as your regular customers. While journey in airplane either people
satisfied with the customer class they are travelling in now it`s either a business or eco or
eco plus and what type of facilities you have provided in a particular class as how much
clean it is, food quality, seat facility for sitting and many more. People rate these things
which tells us that they are satisfies or not. Also the great impact is that either they are
loyal (regular)customer or disloyal (not regular customer) because if regular customer is
dissatisfied with the given facilities this means your standard has gone low. Also the
people travelling for business purpose or for personal reasons because mostly business
trips take business classes and satisfied and personal reason travelers take any class
available and not satisfied with it. The distance travelling also matters if you are going to
USA from Pakistan you are going to get good facilities because it’s a long route you
prefer to travel in business class and you get mostly good facilities and you are going to
be satisfied with that but if you are travelling locally such with in from Lahore to Karachi
you prefer to travel cheaply prefer eco class and in turn being not satisfied these are just
examples scenario must be changed. The main purpose of this prediction model is to help
organizations to get to know what things they can do better to get their passengers
satisfied so we are going to use previous collected that use different features mentioned
above and have label either person satisfied or not we are going to train our model on that
data and that prediction model just going to use main features out of that data and get to
know either people are satisfied or not and make organizations make their facilities better
in which they are lacking.
SECTION-1
MODEL DEVELOPMENT
CHAPTER-1

PROBLEM DEFINITION

1.1 PROBLEM STATEMENT

Satisfaction is one the first things you need when you spend your money either on some
business, buying something or many more you just need to be satisfied while many
travelling companies also need to grow their business in which they need their passengers
to be satisfied with their services but how they would get to know that how people are
rating their services and for which purposes they are not being satisfied they can do
different ratings on different services check their customer classes ratings and so on an
then tell upon that ratings which person are satisfied or not. So for this purpose they need
a tool that help them to take that features and predict either passenger satisfied or not that
help them also in future to what should to do even more better so passenger is going to be
satisfied.

1.2 PURPOSE

Satisfaction give you ease of choosing or doing a particular thing again and again so if
passengers are not satisfied with facilities given they are not going to travel again and if
they so they need them better so in order to make them better we need a model that tells
which are the facilities that make passenger satisfied and which make passenger
unsatisfied so we get prediction through it and in future get satisfaction rate through this
model.
1.3 OBJECTIVES

Following are the objectives of this section

I. Collecting historical data on airline passenger satisfaction.


II. Applying Data Analytics Technique to analyze data.
III. Making a Dash Board for sake of Business Analytics.
IV. Choosing best machine learning algorithm for this problem.
V. Splitting the data into training data and test data.
VI. Algorithm choose must give best classification metric(Good prediction
rate(accuracy)).

1.4 PROPOSED SOLUTION

Basically we are going to analyze historical data clean that data if required then visualize
it then we are going to split data into features and labels as it is a supervised learning
problem then these features and labels are going to be split down into training dataset and
test dataset. Then choose algorithm that is best for the problem then fit or train that
algorithm on training dataset then get predictions on test dataset if accuracy rate is good
then train model on whole dataset using that algorithm and make your model ready for
use by saving its file.
CHAPTER 2

TOOL AND TECHNOLOGIES THIS SECTION

Basically this section is about model development so the tools and techniques we are
going to use in this section are as follows,

2.1 PROGRAMMING LANGUAGE

Programming language we are going to use in this section are as follows

2.1.1 PYTHON

Python is an interpreted, high level, general purpose programming language, multi-


paradigm programming language. Object-Oriented programming and structured
programming are fully supported, and many of its features support functional
programming and aspect-oriented programming. Many other paradigms are supported via
extensions, including design by contract and logic programming. Python offers concise
and readable code. While complex algorithms and versatile workflows stand behind
machine learning and AI, Python's simplicity allows developers to write reliable systems.
Developers get to put all their effort into solving an ML problem instead of focusing on
the technical nuances of the language.

2.2 LIBRARIES

Libraries are sets of routines and functions that are written in a given language. A robust
set of libraries can make it easier for developers to perform complex tasks without
rewriting many lines of code. Machine learning is largely based upon mathematics.
Specifically, mathematical optimization, statistics and probability. Python libraries help
to easily “do machine learning”.

Following are the libraries which we will use in our project

2.2.1 NUMPY

Numpy is a library for the Python programming language, adding support for large,
multi-dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays.

2.2.2 PANDAS

Pandas is a fast, powerful, flexible, and easy to use open source data analysis and
manipulation tool, built on top of the Python programming language.

2.2.3 MATPLOTLIB

Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension Numpy. It provides an object-oriented API for embedding plots
into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or
GTK+.

2.2.4 SEABORN

Seaborn is a Python data visualization library based on matplotlib. It provides a high-


level interface for drawing attractive and informative statistical graphics.

2.2.5 SCIKIT LEARN

Scikit-learn is an open-source Python machine learning library which provides numerous


classification, regression and clustering algorithms. This library was used in this project
to perform the actual task of model building and prediction. It provides a variety of
evaluation metrics to validate the performance of the model, which makes it a valuable
tool.

2.2.6 TENSOR FLOW

TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library and is also used for
machine learning applications such as neural networks.

2.2.7 KERAS

Keras is an open-source neural-network library written in Python. It is capable of running


on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to
enable fast experimentation with deep neural networks, it focuses on being user-friendly,
modular, and extensible.

2.2.8 JOBLIB

JOBLIB is a set of tools to provide lightweight pipelining in Python. In particular


transparent disk-caching of functions and lazy re-evaluation (memorize pattern) and easy
simple parallel computing. JOBLIB is optimized to be fast and robust on large data in
particular and has specific optimizations for NUMPY arrays. It is BSD-licensed.

2.3 OPEN SOURCE DISTRIBUTION

An open source distribution (distro) is a copy of an open source project, created and
managed separately from the main project, and independent of other distributions. The
open source project from which it is copied is a collaborative development,
documentation and testing effort.

2.3.1 ANACONDA
Anaconda is a free and open-source distribution of the Python and R programming
languages for scientific computing, that aims to simplify package management and
deployment. Package versions are managed by the package management system conda.

2.4 INTEGRATED DEVELOPMENT ENVIRONMENTS (IDES)

An integrated development environment is a software application that provides


comprehensive facilities to computer programmers for software development. An IDE
normally consists of at least a source code editor, build automation tools and a debugger.

2.4.1 JUPYTER NOTEBOOK

The Jupyter Notebook is an open-source web application that allows you to create and
share documents that contain live code, equations, visualizations and narrative text. Uses
include data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and much more.

2.4.2 TABLEAU

Tableau is a powerful data analytics tool which is used for building interactive
dashboards. Tableau was mainly used in the project to generate interactive graphs and
observe patterns in the data. This information proved to be useful in determining the
features that could contribute well to the actual model building. It also provides a rich
map interface for geographical data.
CHAPTER 3

DATA ENGINEERING

3.1 WHAT IS DATA ENGINEERING

Data Engineering is the act of collecting, translating, and validating data for analysis.
In particular, data engineers build data warehouses to empower data-driven decisions.
Data engineering lays the foundation for real-world data science application. Working
harmoniously, data engineers and data scientists can deliver consistently valuable
insights.

3.1.1 REQUIRED SKILLS FOR DATA ENGINEERING

Data engineering requires a broad set of skills ranging from programming to database
design and system architecture. Here are just a few

I. Extensive experience with data processing


II. Knowledge of python, SQL and Linux
III. A deep understanding of clustering, machine learning and data visualization.
IV. Aptitude for developing a foundational understanding of company data.
V. Proficiency in report and dashboard creation.

Data engineers are focused on providing the right kind of data at the right time. A good
data engineer will anticipate data scientists’ questions and how they might want to
present data. Data engineers ensure that the most pertinent data is reliable, transformed,
and ready to use. This is a difficult feat, as most organizations rarely gather clean raw
data. To work their magic, most data engineers must be proficient in Python, SQL, and
Linux. Data engineers may also need skills in cluster management, data visualization,
batch processing, and machine learning. Data engineers use these processing techniques

to massage data into a format that facilitates hundreds of queries . While data engineers
may not be directly involved in data analysis, they must have a baseline understanding of
company data to set up appropriate architecture. Creating the best system architecture
depends on a data engineer’s ability to shape and maintain data pipelines. Experienced
data engineers might blend multiple big data processing technologies to meet a
company’s overarching data needs.

3.2 DATA ENGINEERING IN OUR DATASET

As our data set is already presented in processed form as being present in tabular form in
a CSV(comma separated values) file. Also data is being translated by giving it proper
column name as we can see below in picture,
But we can discuss on it as this dataset contains features that predict label that people are
satisfied or not so in terms of data engineering data engineer have collected data from the
booking site or in person booking records and get different data from there gender, age,
customer, class, travelling purpose delay in arrival and departure. Then how they
rated(0,1,2,3,4,5) different facilities present inflight and they are satisfied or dissatisfied.
After collecting all this data now we have to translate it with a meaning ful name and
present it in a way. Now we arrange the collected data in different subgroups then give
these subgroups a name so they describe them and translate them in away that what is
their purpose then save this sub groups one by one in tabular form either in SQL file,
CSV file or in excel file. Now validating it that each subgroup has a related data and each
file contains a proper data related to the domain so that we are able to analyze that data.

So in this way data engineering had been performed on our dataset according to our
knowledge which we have described.
CHAPTER 4

DATA ANALYSIS

4.1 WHO IS DATA ANALYST

Data analysis is the science of examining data to draw conclusions about the
information to make decisions or expand knowledge on various subjects. It consists
of subjecting data to operations. This process happens to obtain precise conclusions to
help us achieve our goals, such as operations that cannot be previously defined since data
collection may reveal specific difficulties.

4.1.1 TECHNIQUE FOR ANALYSIS


It is essential to analyze raw data in order to understand it. We must resort to various
techniques that depend on the type of information collected, so it is crucial to define the
method before implementing it.

QUALITATIVE DATA ANALYSIS

Researchers collect qualitative data from the underlying emotions, body language, and
expressions. Its foundation is the interpretation of verbal responses. The most common
ways of obtaining this information are through open-ended interviews, focus groups, and
observation groups, where researchers generally analyze patterns in observations
throughout the data collection phase

QUANTITATIVE DATA ANALYSIS

Quantitative data presents itself in numerical form. It focuses on tangible results.

Data analysis focuses on reaching a conclusion based solely on the researcher’s current
knowledge. How you collect your data should relate to how you plan to analyze and use
it. You also need to collect accurate and trustworthy information.

There are many data collection techniques, but experts’ most commonly used method is
online surveys. It offers significant benefits such as reducing time and money compared
to traditional data collection methods.

4.1.2 DATA ANALYSIS PROCESS

The data analysis process include five steps,


1- IDENTIFY

Before you get your hands dirty with data, you first need to identify why do you need it
in the first place. The identification is the stage in which you establish the questions you
will need to answer. For example, what is the customer's perception of our brand? Or
what type of packaging is more engaging to our potential customers? Once the questions
are outlined you are ready for the next step.

2-COLLECT

As its name suggests, this is the stage where you start collecting the needed data. Here,
you define which sources of information you will use and how you will use them. The
collection of data can come in different forms such as internal or external sources,
surveys, interviews, questionnaires, focus groups, among others. An important note here
is that the way you collect the information will be different in a quantitative and
qualitative scenario.

3-CLEAN
Once you have the necessary data it is time to clean it and leave it ready for analysis. Not
all the data you collect will be useful, when collecting big amounts of information in
different formats it is very likely that you will find yourself with duplicate or badly
formatted data. To avoid this, before you start working with your data you need to make
sure to erase any white spaces, duplicate records, or formatting errors. This way you
avoid hurting your analysis with incorrect data.

4-ANALYZE

With the help of various techniques such as statistical analysis, regressions, neural
networks, text analysis, and more, you can start analyzing and manipulating your data to
extract relevant conclusions. At this stage, you find trends, correlations, variations, and
patterns that can help you answer the questions you first thought of in the identify stage.
Various technologies in the market assists researchers and average business users with
the management of their data. Some of them include business intelligence and
visualization software, predictive analytics, data mining, among others.

5-INTERPRET

Last but not least you have one of the most important steps: it is time to interpret your
results. This stage is where the researcher comes up with courses of action based on the
findings. For example, here you would understand if your clients prefer packaging that is
red or green, plastic or paper, etc. Additionally, at this stage, you can also find some
limitations and work on them.

4.2 DATA ANALYSIS ON PROJECT

Now our data set we need it to be analyzed our data set contains a historical information
on some airline flights data and it has some columns and rows we have to identify the
satisfaction according to each row data given that either passenger is satisfied or
dissatisfied. As our data is collected in proper tabular form so now we have to analyze
this now lets move towards the analyzing and cleaning data first so first lets see the total
data how much data rows, columns and memory usage data is taken. By making data
frame instance reading file and calling data frame(info) as seen below,

Now after doing this we get as seen below there are 129880 rows, 24 columns(18
integer, 1 float, 5 object ), memory usage of 23.8+ mb.
Now as we see there is first column that contains the index and this type of column
built when you made file and don’t put index as false so this type of column is generated
so this column is of no use we are dropping it now using (drop(‘Unnamed: 0
’,axis=1)) columns remain 23 and you can see it below
Now firstly looking at label we are trying to predict calling value count on it as it is a
categorical function and then visualizing it with a count plot as we can see below,

So we can see people are dissatisfied are in large amount(73452) and satisfied
amount(56428). So see mostly people are dissatisfied we can interpret that.

Now lets move towards categorical features there are in total five categorical features
in our dataset out of these five one is satisfaction column which is our predicted label. So
we are left with four columns as a categorical feature first one is gender lets start
analyzing it as value counts and count plot

Now as we can see that both males(63981) and female(65899) are approximately equal
not that much difference some bit difference of 2000 now as we have to predict
satisfaction so lets see gender column behaviour on satisfaction with putting satisfaction
as hue in count plot with gender

From this plot we can easily see that mostly male ad female are equally satisfied and
dissatisfied so we can say that on this basis by see dissatisfied line of both male and
female which is approximately equal and line of satisfied of both male and female which
is also approximately equal so we can say that the Gender column is not affecting
satisfaction so much both male and female are equally satisfying and dissatisfying.
Now lets talk about the next categorical column which is customer type as categorical
applying value a count and seeing count plot

As wee see Loyal customer (106100) are in large amount and disloyal customer
(23780) and we can also see it with count plot. Now as satisfaction is to be predicted lets
see it with satisfaction as we do previous for gender,

Now we can see that Loyal customer are more dissatisfied and disloyal customer are very
less dissatisfied same for satisfaction one reason tells us that most people who travel are
regular or loyal customer and mostly they are the ones who are dissatisfied in some large
amount and also satisfied with higher amount but less then dissatisfaction on other hand
disloyal customer are very small to travel and also they are dissatisfied at high and
satisfied very low. So we can see that customer type column is affecting satisfaction
very large amount so it is an important feature.

Now lets talk about type of travel means for which purpose person is travelling now lets
do same for it value count and count plot

As we can see two types of travelers One travelling for business purpose are in large
amount (89693) and for personal reasons in a smaller amount (40187). Now as we have
to predict satisfaction so just doing as we do it behind,
So we can see that people who are travelling for business purposes are satisfied in large
amount but people travelling for personal reason are satisfied in very small amount.
In contrast personal and business travelers are dissatisfied in approximately equal
amount. Now lets check how many people are loyal and disloyal traveling for different
travelling reasons,

We can see that people who travel for business are regular customers mostly and the
passenger who comes new also travel for business purposes mostly so we can say that
mostly people travel for business are loyal regular customer and satisfied are
mostly. So type of travel is also an other important feature for prediction.
Now lets talk about customer class also a categorical column so applying value count
and count plot

Now as we can see people mostly travelling in business class(62160), Eco(58309) and
Eco Plus(9411) and we can easily say that in count plot.

Now visualizing it against satisfaction as done above


Now we can easily see that people who are travelling mostly in business class are
mostly satisfied and people mostly travel in eco class are mostly dissatisfied in large
amount.

Now lets check it with type of travelling

We can see people in business class are travelling for business purpose and people
travelling for personal are mostly travelling in eco class.

So we can say that or make a statement that people who are satisfied the large amount
of them are traveling in business class mostly for business purpose and all are loyal
passengers.

Now all categorical columns that contain different string type values are visualized and
done and we interpret that,

“Mostly people are regular and travelling for business purpose in business class are
satisfied people in large amount and gender is not affecting satisfaction column too
much”

Now lets talk about Columns containing numerical values,


Firstly lets talk about age column as it is a numerical values so applying describe
function on it and seeing it with boxplot so

So we can see that min age is 7 both from describe and boxplot as minimum IQR is 7
and max age is 85 both from describe and boxplot. Now we have to see age effect on
satisfaction so let visualize it using bar plot,
Now we can see that passenger satisfied are greater than age of 40 and passenger
unsatisfied are less then age of 40 but it is not affecting too much satisfaction or
dissatisfaction.

Now lets talk about flight distance first apply describe function and boxplot

Now seeing this we can see that min flight distance is 31 we can see from boxplot and
max flight distance is 4983 from boxplot.
Now seeing it with satisfaction so,

So we can see a simple that if flight distance increase your satisfaction increase and
with less flight distance you have a dissatisfaction both with bar plot and histogram
so it is an important feature. So Flight distance plays an important role in
satisfaction. Also visualize it against customer class against flight we can see as
distance gets bigger people prefer business class and people in business class are
mostly satisfied in large amount as we can also see in plot

Now lets talk about 14 columns that tell us about different flight facilities as these
columns have numerical values and values of (0,1,2,3,4,5) which seems to be numerical
but are in actual categorical as passenger s given rating from 0 to 5 on different flight
facilities so now lets visualize it first inflight WIFI service visualize it through count
plot and also calling value count as we are considering it categorical
Now we can see from that people mostly voted 2 to3 a most rating for inflight WIFI
service so now lets visualize it against satisfaction

So after seeing this we can say that people giving rating of 5 are totally satisfied, after
that 4 has some high satisfaction but not that much and also people are satisfied when
rating is zero no WIFI is needed. Now lets take into consideration by putting hue as a
customer class

So we can say that people who are giving rating of 5 for inflight WIFI service are
mostly travelling in business class and also people giving rating of 0 will travelling in
business class.
Now lets talk about departure arrival time convenient its same as previous flight
service so lets take into consideration value counts and count plot

As we can see that people mostly give 4 to 5 a most highest vote for departure arrival
time convenient. Now lets check inside with satisfaction,

So after seeing above plot we can see that people mostly are dissatisfied with departure
arrival time convenient as in all categories of ratings there is minor difference between
satisfaction and dissatisfaction but 4 rating has some high dissatisfaction rating for
departure arrival time convenient. Now lets check a
column then we will give our perception. Now as we can see that people who are mostly
dissatisfied and given rating of 4 and 5 belong to eco class but people who are given
rating of 0 to 3 are approximately or equally satisfied and dissatisfied. So after seeing
whole this scenario we can say that people mostly are dissatisfied with departure
arrival time so its better if its value is in median means 2 or 3 for rating purposes to
be satisfied also it is an important feature because its making dissatisfaction high
and we have to decrease it by making it good.

Now lets move toward next flight service as we are considering them categorical so lets
take value count and count plot of ease of online booking
As we can see people mostly voted 2 to 3 as a high rating for ease of online booking.
Now lets take it into consideration with satisfaction

So we can see that passenger who has voted 2 to 3 as a rating high are mostly
dissatisfied in large way and people who has voted 5 as a rating are satisfied in very
large amount. Now lets see customer class in contrast to give our final interpretation

So after seeing this we can say that people giving rating of 2 to 3 are travelling in eco
class and are mostly dissatisfied and people giving rating of 4 to 5 in which 4 are
somewhat more satisfied and 5 rating ones are highly satisfied are mostly travelling
in business class. So we can say that passenger travelling in business class are mostly
satisfied and giving rating of 5.

Now lets talk about gate location as for previous ones

As we can see people mostly voted 3 as a high rating for gate location. Now lets see it
with satisfaction
Now here we can see that lefting 5 where there is some minor high satisfaction all
passenger are highly dissatisfied giving rating of 3 and 4. Now lets see it with
customer class and give our view

Now we can see that passenger who are traveling in business class are somewhat
satisfied with the gate location and 3 to 4 the highest vote rating are travelling in eco
class are dissatisfied.

Now lets talk about food and drink taking value count and count plot
As we can see there is no such large difference between ratings as 4 is highest rating
then 3,2,5 are the same level approximately. Now lets take it into consideration with
satisfaction

Here we can see that at 4 and 5 rating minor differences as satisfaction is high as
also 4 has highest rating so many people are satisfied in all others 2 and 3 many
people are dissatisfied and also at 1 highly dissatisfied. Now take a look at with
Now as rating is good at 2 3 4 and 5 and all these passenger are traveling in business
class so we can say that people mostly satisfied and dissatisfied are both in business
class because all of them are regular mostly and satisfied and if they are dissatisfied
then there is some bad effect in food and drinks that why they are dissatisfied.

Now lets talk about online boarding taking value count and count plot

As we can see there is high amount of vote of 4 high rating in all 2,3,5 are with minor
differences. Now lets see it with a satisfaction
Now as we see that people voted 4 and 5 are highly satisfied mostly who voted 5 are
highly satisfied and passenger are also travelling in business class but people who voted 2
and 3 are highly dissatisfied and travelling in eco class as you can see below

So we can say that online boarding vote must bee in rating of 4 and 5 for satisfaction
and it is an important feature.
Now lets talk about seat comfort taking value count and count plot

And we can see that people mostly give it a vote of 4 then 5. Now lets consider it with
satisfaction
So we can see a clear difference here that people voted mostly for 4 and 5 are satisfied
with seat comfort and they are travelling in business class but people voted for 1 2
and 3 are dissatisfied and travelling in eco class

So we can say that to be people who are satisfied have seat comfort voting mostly of 4
and 5 and mostly travelling in business class.

Now lets talk about inflight entertainment by talking value count and count plot

Now we can see people highly given rating of 4 and 5. Now lets see it with satisfaction
As we can see that people given high rating of 4 and 5 are highly satisfied with flight
entertainment and traveling in business class and passenger given rating of 1 2 3 are
highly dissatisfied and travelling in eco class.

So we can say that at 4 and 5 rating people are satisfied with the inflight
entertainment and has an important feature.

Now lets talk about on board service taking value count and count plot
As we can see people mostly highly rated this 4 and 5. Now lets see it with
satisfaction

The passenger who rated it 5 are most highly satisfied for is satisfied high but with
minor difference so both are also travelling in business class and 1 2 3 are
dissatisfied and also travelling in eco class.
So we can say that it is important feature as people satisfied with giving rating of 5 to
onboard service when they are travelling in mostly business class.

Now lets talk about legroom service taking its value count and count plot

Now here we can see people are mostly rated 4 and 5 this service as high. Now lets take
it into with satisfaction
And we can see the passenger who rated it 5 are most highly satisfied 4 is satisfied
high but with minor difference so both are also travelling in business class and 1 2 3
are dissatisfied and also travelling in eco class.

So we can say that it is important feature as people satisfied with giving rating of 4 to
5 to legroom service when they are travelling in mostly business class.
Now lets talk about baggage taking value count and count plot

Now here we can see people are mostly rated 4 and 5 this service as high. Now lets take
it into with satisfaction.

And we can see the passenger who rated it 5 are most highly satisfied and also
travelling in business class and 1 2 3 are dissatisfied highly and also travelling in
eco class 4 with minor difference is more dissatisfied but travelling in business class.
So we can say that it is important feature as people satisfied with giving rating of 5 to
baggage handling when they are travelling in mostly business class.

Now lets talk about

taking value count and count plot. Now here we can see people are mostly rated 3 and 4
this service as high. Now lets take it into with satisfaction.
Here we can see that people are some what satisfied at giving rating of 5 at all other
rating people are dissatisfied and satisfied people are travelling in business class
while other 3 4 are travelling in business class but still dissatisfied.

So we can say that it is important feature as people satisfied with giving rating of 5 to
check in service when they are travelling in mostly business class.

Now lets talk about inflight service taking its value count and count plot
Now here we can see people are mostly rated 4 and 5 this service as high. Now lets take
it into with satisfaction

The passenger who rated it 5 are most highly satisfied and also travelling in business
class and 1 2 3 4 are dissatisfied and also travelling in eco class leaving 4 as people
rated 4 are travelling business class.
So we can say that it is important feature as people satisfied with giving rating of 5 to
in flight service when they are travelling in mostly business class.

Now lets talk about cleanliness taking value count and count plot
Now here we can see people are mostly rated 3 and 4 this service as high. Now lets take
it into with satisfaction.

And we can see the passenger who rated it 5 are most highly satisfied 4 is satisfied
high but with minor difference so both are also travelling in business class and 1 2 3
are dissatisfied and also travelling in eco class.
So we can say that at 4 and 5 rating people are satisfied with the cleanliness and has
an important feature.

Now as we are done with this before taking into consideration lets take into consideration
cleansing part starting with seeing how much columns have missing values,

So we can see that only the column arrival delay in minutes has missing values 310 not
a big amount of missing data lets see its percentage
So we can see that percentage of data missing it 0.3%. So there are three things we can
do

1- As missing data is not high in amount dropping it cant affect us.


2- We can fill data with 0 or 1 as it is a numerical value.\
3- We can fill it statistically.

We are going to choose 2 reason is that passenger who arrived at time does not
consider it important to tell that arrival in delay minutes is zero so considering that
point we are going to fill it with 0. After filling it with 0 state become
And now data is cleaned

So lets talk about departure and arrival delay in minutes calling describe function on
it as they are numerical value columns.

Here we can see that minimum values for them is 0 and max is 1582 and 1584.

Now taking satisfaction into consideration by using scatter plot


We can see that it has a linear relationship with satisfactions. Now visualizing it more
with flight distance

Now talking about departure vs flight distance that if the distance increase and
departure less then 500 then people are mostly satisfied.
Same happens for arrival if the distance increase and departure less then 600 then
people are mostly satisfied.

So now we have talk about columns that much now lets do the final round of analysis
by putting out outliers

First plotting boxplot of all numerical columns then seeing which of them contain
outliers then seeing which are the common outliers between them and removing
those from dataset so our dataset fit more efficiently to model.
No outlier present in age column so moving to next column,

There are some outliers present towards the higher limit so we are going to get their
indexes and save them.

Now moving towards third column,


There are some outliers present towards the higher limit so we are going to get their
indexes and save them.

Now we get two arrays first getting common outliers out of them,

Now moving towards last column


There are some outliers present towards the higher limit so we are going to get their
indexes and save them.

Now from previous common outlier array and this array we are again going to get
common indexes

So we can see that there are 342 columns that are the common outliers so we are going to

remove them
So after removing outliers we are done with analysis and our final file is ready for ml
model to be trained.
CHAPTER 5

BUSINESS ANALYTICS

V.1 WHAT IS BUSINESS ANALYTICS

Business analytics is a set of automated data analysis practices, tools and services that
help you understand both what is happening in your business and why, to improve
decision-making and help you plan for the future. The term “business analytics” is
often used in association with business intelligence (BI) and big data analytics.
Basically business analytics provide you with a facility to visualize data and
understand what is happening according to that data what are the insights of data
present with you and how can you use that to make model, products and so many
more.

V.2 BUSINESS ANALYTICS ON OUR DATA SET

Now on our dataset we have analyze it quite comprehensively so users don’t want to
see or analyze it like that they simply want to see what data is going to tell so for that
reason we are going to create a dashboard of our dataset that help user to visualize
data and get information through it. For dashboard creation we are going to use
tableau one of the simplest tools for business analytics.
CHAPETR 6

MACHINE LEARNING

6.1 Machine Learning Project Workflow

We can define the machine learning projects workflow in following stages,

1. Data Collection

2. Data Pre-Processing

3. Researching the model that will be best for the type of data

4. Training and testing the model

5. Evaluation

6.2 What is the Machine Learning Model

The machine learning model is nothing but a piece of code; an engineer or data scientist
makes it smart through training with data. So, if you give garbage to the model, you will
get garbage in return, i.e. the trained model will provide false or wrong predictions.

Now we define the steps of Machine Learning workflow

6.3 Data Collection

The process of gathering data depends on the type of project we desire to make, if we
want to make an ML project that uses real-time data, and then we can build an IoT
system that using different sensors data. The data set can be collected from various
sources such as a file, database, sensor and many other such sources but the collected
data cannot be used directly for performing the analysis process as there might be a lot of
missing data, extremely large values, unorganized text data or noisy data. Therefore, to
solve this problem Data Preparation is done.

We can also use some free data sets that are present on the internet. Kaggle and UCI
Machine learning Repository are the repositories that are used the most for making
Machine-learning models. Kaggle is one of the most visited websites that is used for
practicing machine-learning algorithms, they also host competitions in which people can
participate and get to test their knowledge of machine learning.

6.4 Data Pre-Processing:

Data pre-processing is one of the most important steps in machine learning. It is the most
important step that helps in building machine learning models more accurately. In
machine learning, there is an 80/20 rule. Every data scientist should spend 80% time for
data pre-processing and 20% time to actually perform the analysis.

1.What is Data Pre-Processing:

Data pre-processing is a process of cleaning the raw data i.e. the data is collected in the
real world and is converted to a clean data set. In other words, whenever the data is
gathered from different sources it is collected in a raw format and this data isn’t feasible
for the analysis. Therefore, certain steps are executed to convert the data into a small
clean data set, this part of the process is called as data pre-processing.

2. Why do we need it:

As we know that data pre-processing is a process of cleaning the raw data into
clean data, so that can be used to train the model. So, we definitely need data pre-
processing to achieve good results from the applied model in machine learning and deep
learning projects.
Most of the real-world data is messy, some of these types of data are:

Missing data: Missing data can be found when it is not continuously created or due to
technical issues in the application (IOT system).

Noisy data: This type of data is also called outliers, this can occur due to human errors
(human manually gathering the data) or some technical problem of the device at the time
of collection of data.

Inconsistent data: This type of data might be collected due to human errors (mistakes
with the name or values) or duplication of data.

Three Types of data:

• Numeric e.g. income, age

• Categorical e.g. gender, nationality

• Ordinal e.g. low/medium/high

3. How can data pre-processing be performed:

These are some of the basic pre — processing techniques that can be used to convert raw
data.

Conversion of data: As we know that Machine Learning models can only handle
numeric features, hence categorical and ordinal data must be somehow converted into
numeric features.

Ignoring the missing values: Whenever we encounter missing data in the data set then
we can remove the row or column of data depending on our need. This method is known
to be efficient, but it shouldn’t be performed if there are a lot of missing values in the
dataset.
Filling the missing values: Whenever we encounter missing data in the data set then we
can fill the missing data manually, most commonly, the mean, median or highest
frequency value is used.

Machine learning: If we have some missing data then we can predict what data shall be
present at the empty position by using the existing data.

Outliers detection: There are some error data that might be present in our data set that
deviates drastically from other observations in a data set. [Example: human weight = 800
Kg; due to mistyping of extra 0].

6.5 Model Selection

Our main goal is to train the best performing model possible, using the pre-processed
data.

6.5.1 Supervised Learning:

In Supervised learning, an AI system is presented with data which is labelled, which


means that each data tagged with the correct label. The supervised learning is categorized
into 2 other categories which are “Classification” and “Regression”.

1. Classification:

Classification is used when the target variable is categorical (i.e. the output could be
classified into classes — it belongs to either Class A or B or something else). A
classification problem is when the output variable is a category, such as “red” or “blue”,
“disease” or “no disease” or “spam” or “not spam”.

Following are the most used classification algorithms:

• K-Nearest Neighbor

• Naive Bayes
• Decision Trees

• Random Forest

• Gradient Boosting

• Adaptive Boosting

• Support Vector Machine

• Logistic Regression

• Artificial Neural Networks

2. Regression:

Regression is used when the target variable is continuous (i.e. the output is numeric).

Following are the most used regression algorithms:

• Linear Regression

• Polynomial Regression

• Ridge Regression

• Lasso Regression

• Elastic Net

• K Neighbors Regressor

• Support Vector Regression

• Decision Trees

• Random Forest

• Gradient Boosting

• Adaptive Boosting
• Gaussian Progresses Regression

6.5.2 Unsupervised Learning:

In unsupervised learning, an AI system is presented with unlabeled, un-categorized data


and the system’s algorithms act on the data without prior training. The output is
dependent upon the coded algorithms. Subjecting a system to unsupervised learning is
one way of testing AI.

The unsupervised learning is categorized into 2 other categories which are “Clustering”
and “Association”.

1. Clustering:

A set of inputs is to be divided into groups. Unlike in classification, the groups are not
known beforehand, making this typically an unsupervised task.

Methods used for clustering are:

• Gaussian mixtures

• K-Means Clustering

• Boosting

• Hierarchical Clustering

• K-Means Clustering

• Spectral Clustering

6.5.3 Overview of models under categories


6.6 Training and testing the model on data:

For training a model we initially split the model into 3 three sections which are ‘Training data’,
‘Validation data’ and ‘Testing data’. You train the classifier using ‘training data set’, tune the
parameters using ‘validation set’ and then test the performance of your classifier on unseen ‘test
data set’. An important point to note is that during training the classifier only the training and/or
validation set is available. The test data set must not be used during training the classifier. The
test set will only be available during testing the classifier.

1. Training set:

The training set is the material through which the computer learns how to process information.
Machine learning uses algorithms to perform the training part. A set of data used for learning,
that is to fit the parameters of the classifier.

2. Validation set:

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine
learning model on unseen data. A set of unseen data is used from the training data to tune the
parameters of a classifier.
3. Test set:

A set of unseen data used only to assess the performance of a fully specified classifier.

6.7 Model Evaluation:

Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future.

6.8 MODEL DEVELOPMENT OF OUR DATASET

As we know that our dataset is already analyzed or say pre processed the one basic step
left is conversion of data but first lets start with basic first we have to know that what is
our problem solution so we know we have to predict label that is categorical so our
problem is supervised learning with classification. Now we have to divide data into
features and labels so first converting categorical data into numerical so we know there
are in total five categorical columns as we know satisfaction column is our label
prediction column so we do not need to change that and we are left with 4 categorical
columns as we see in
analyzing that Gender and age column does not put that much affect so we are going to
left both of them now left with three categorical columns. First comes customer type as
this column contains two categorial values either loyal customer or disloyal customer
so what we are going to do is call map function on this column and assign loyal
customer as 1 and disloyal customer as 0 as you are seeing below,
After dropping gender and age data looks like this 21 columns and 129538 rows now
doing
Now we can see our data is converted and lets see it with info,

Now here we can also see that customer type data type has changed to int. Now lets
toward next column type of travel as this column contains two categorial values either
business travel or Personal travel so what we are going to do is call map function on
this column and assign business travel as 1 and Personal travel as 0 as you are seeing
below
So we can see now type of travel is converted now lets talk about customer class as it
contains more then two categorical values so lets apply dummy method on it so lets see
what happens,

So we can see that it has make more three columns but by deleting customer class and
taking its value to make three columns and assign where the value is true in original
column.
Now as we are done with conversion lets start dividing data into Features and Labels
We name features as X and labels as Y as you can see below

The above is data frame of features and the label columns are
Now our data is ready lets divide data into training and test data we are going to make
test size of 15 % so our training set left with 75% data on which it can train and
validate as you can see below

Our data is split into training and testing data. Now next we have to find out which model
is best for our data for that we are going to use all algorithms on our dataset and the one
which gives best accuracy score and f1 score is going to be our algorithm and we are
going to make final evaluation on that so it also gives best result in future s lets starts
next
proceedings.
First we will look at all metrics of algorithms through there classification report as it is a
classification problem

6.8.1 LOGISTIC REGRESSION


6.8.2 KNN CLASSIFIER

6.8.3 SUPPORT VECTOR CLASSIFIER


6.8.4 DECISION TREE CLASSIFIER

6.8.6 RANDOM FOREST CLASSIFIER


6.8.7 GRADIENT BOOSTING CLASSIFIER

6.8.7 ADA BOOST CLASSIFIER


Now as we can see we are getting two models with the accuracy and f1 score of 96% so
we have to choose one of them now Random forest and Gradient Boosting both are
used to enhance decision tree but boosting is not a algorithm its just a technique and in
random first you have to randomly choose feature to make different ensemble and
also do Bootstrapping to choose random rows but gradient boosting just make
ensemble get features take error rate (residual error) prediction and then after
taking next ensemble take error rate of it and take difference of it with last
ensemble prediction to make prediction accuracy better so in sense its doing less
work and giving us that much good accuracy that’s why we are going to choose
Gradient boosting classifier.

6.9 PREPAIRING MODEL FILE


Now as it is known which model we are going to use so know lets prepare its file now
this time its going to be trained on whole data with best parameters we have got for it
using gris search then we are going to dump this file using JOBLIB because for a new
data we don’t come again and again train the model then get prediction we need a source
through which we can directly get prediction also its going be used in web section for
making web application work by making a API so now do the last work of this section,
Now our model is ready so now lets move to next section.
SECTION-II
WEB APP DEVELOPMENT
CHAPETR 7

WEB APPLICATION

7.1 Introduction

As in last section we have prepared our model and dump a file. Now in this section we
are going to prepare a web app for our model so airline services can easily use that model
to make their services better by getting a result that passenger rates.

7.2 Back End

7.2.1 Flask

As in this web application we are going to use flask for preparing login pages,
registration facilities and a REST API that get different feature values from form filled by
passenger and passes it to dump model file which gives a prediction either passenger is
satisfied or dissatisfied. As flask is python micro framework which is easy to learn and
have too much facilities to work with we have write our model prediction app backend
services or make its REST API with flask.

7.2.2 SPRING BOOT

Now when our one side is prepared we are going to make a full flesh web app for airline
services from where passenger book their tickets for travelling for this application we are
going to write its backend services with spring boot as through this application user book
its self as where he wants to travel from which place he wants to travel in which class he
wants to travel what is its departure time and arrival time and how much delay is their in
flight as spring boot is used in development of enterprise application so it will be
beneficial for us to use that in this part of our application as we create SPRING REST
API that are going to handle all these things.

7.2.3 MY SQL
Now for both parts of our application backend services are written but how are we going
to integrate them for making this part easier simply we are going to use a main database
for both of them as both of them have same DB and when user book tickets all its id,
name, class, departure and arrival times, delay in time, distance travelled are going to be
maintained by spring boot services in My SQL DBMS when user reaches the destination
receives a mail click on the link he received on mail he goes the page of flask services
where all necessary information for model features are already fill now passenger has to
just rate the inflight facilities after all this data is going to be passed to the model file and
model file is going to give a prediction either user is satisfied or dissatisfied.

7.2.4 POSTMAN

Now after creating REST APIs we have to check that either they are working properly or
not so to check them first before passing or integrating them with web app e first check
them on POSTMAN which simply help them to check out if any change is needed or if
there is any error present or not.

7.3 FRONT END

Now after completing backend we are going to create a front end for our app so the user
interact with them for that we are simply going to use HTML 5 , CSS 3 , Bootstrap which
are going to simply design it up with layouts and designed elements present already and
we just going to place them in our app to use them easily for our sake and making agood
web application.

You might also like