Machine_Learning_using_Exploratory_Analy
Machine_Learning_using_Exploratory_Analy
http://doi.org/10.22214/ijraset.2019.8073
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.177
Volume 7 Issue VIII, Aug 2019- Available at www.ijraset.com
Abstract: Predictive analytics uses archival data to predict the future events. Typically, past data is used to build a mathematical
model that captures important trends. That predictive model is then used on current data to predict the future or to suggest
actions to take for optimal outcomes. Predictive analytics has received a lot of attention in recent years due to advances in
supporting technology, particularly in the areas of big data and machine learning. Companies also use predictive analytics to
create more accurate forecasts, such as forecasting the fare amount for a cab ride in the city. These forecasts enable resource
planning for instance, scheduling of various cab rentals to be done more effectively. For a cab rental start-up company, the fare
amount is dependent on a lot of factors. This research aims to understand all patterns and to apply analytics for fare prediction.
The proposed work is to design a system that predicts the fare amount for a cab ride in the city. The aim is to build regression
models, which will predict the continuous fare amount for each cab ride and help prediction depending on multiple time-based,
positional and general factors.
Keywords: Predictive Analytics, Forecasting, Regression Models, Random Forest, Decision Tree, K-NN
I. INTRODUCTION
Machine learning (ML) is closely related to computational statistics, which focuses on making predictions using computers. Data
mining (DM) is a field of study within ML and focuses on exploratory data analysis through unsupervised learning. In its
application across business problems, machine learning is also referred to as predictive analytics. Machine learning tasks are
classified into several broad categories.
In supervised learning, the algorithm builds a mathematical model from a set of data that contains both the inputs and the desired
outputs. Classification algorithms and regression algorithms are examples of supervised learning Regression algorithms are named
for their continuous outputs, meaning they may have any value within a range [1].
In unsupervised learning, the algorithm builds a mathematical model from a set of data that contains only inputs and no desired
output labels.
Unsupervised learning algorithms are used to find structure in the data, like grouping or clustering of data points. Unsupervised
learning can discover patterns in the data and can group the inputs into categories, as in feature learning. Dimensionality reduction is
the process of reducing the number of "features", or inputs, in a set of data.
Machine learning and data mining often employ the same methods and overlap significantly, but while ML focuses on prediction,
based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties
in the data. This is the analysis step of knowledge discovery in databases (KDD) [1]. DM uses many ML methods, but with different
goals; on the other hand, ML also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve
learner accuracy.
516 516
© IJRASET: All Rights are Reserved
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.177
Volume 7 Issue VIII, Aug 2019- Available at www.ijraset.com
B. Data
The aim is to build regression models that will predict the continuous fare amount for each of the cab-rides depending on multiple
time-based, positional and generic factors. This problem statement falls under the category of forecasting which deals with
predicting continuous values for the future (the continuous value is the fare amount of the cab ride).Fig.1 shows a sample of the data
set[2] that will be used to predict the fare amount of a cab ride.
II. METHODOLOGY
One common methodology is the Cross-Industry Standard Process for DM or (CRISP-DM) model [1,2]. This is a five process
model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions.
As only passenger_count and fare_amount have missing values and the percentage is less than thirty percent, so the missing values
are imputed. On randomly assigning NA to one of these values for passenger_count and then filling the content using the three
methods: mean, median and K-NN, it is found that the median gives the closest estimate to this actual value. Similarly, for the
fare_amount, mean yields the nearest value to the real value. Hence, the missing values for passenger_count are filled with median
and that for the fare_amount are filled with mean. After filling in missing values data looks like as shown in Fig.5.
C. Outlier Analysis
As depicted in Fig.4, there are a lot of noisy data so it's important to clean the data for better model performance. In this case, a
classic approach, namely Turkey's method is used for removing outliers. We visualize the outliers using boxplots. Fig.6a Fig.6b,
Fig.6c, and Fig.6d plot the boxplots of four of the six predictor variables (as a sample) and the target variable. A lot of useful
inferences can be made from these plots such as a lot of outliers and extreme values are seen in each of the data sets.
D. Feature Selection
Because all the variables are numeric the important features are extracted using the correlation matrix. As is seen from Fig.8, all the
variables are important for predicting the fare_amount since none of the variables have a high correlation factor (considering the
threshold as 0.9), so all the variables for model building are kept.
Another method for feature selection is Random Forest. Fig.9 mentions the process of Random Forest to extract the importance of
each variable using R-programming [4].
model_rf = randomForest(fare_amount~., train,importance = TRUE, ntree = 300)
importance (model_rf,type = 1)
As is seen, distance has the highest prediction power for fare_amount whereas passenger_count and day have the least prediction
power.
E. Feature Engineering
It is important to infer some knowledge from the existing data and come up with more valuable information. As the dataset already
has datetime variable, further the year, month, day, weekday and hours are calculated that might have an effect on the fare and to
further perform some EDA on the data. Also as the longitude, latitude points are there, the distance traveled per ride is easily
calculated to derive a relationship between the fare amount and the distance. To calculate the distance Haversine Distance formula is
used and the distance in kilometers is found. The Haversine formula [5] calculates the shortest distance between two points on a
sphere using their latitudes and longitudes measured along the surface. It is important for use in navigation.
III. MODELING
A. Model Selection
In the early stages of analysis during pre-processing, it is understood that fare_amount is dependent on multiple behaviors.
Therefore, it's important to build a model in such a way that it takes in all the required inputs and fits the model in such a way that it
gives the most accurate result amongst all the other models. The dependent variable can fall in any of the four categories: Nominal,
Ordinal, Interval, and Ratio. Three approaches are taken and compared:
B. Decision Tree
A decision tree is a tree-like graph with nodes representing the place where an attribute is picked and queried; edges represent the
answers to the query, and the leaves represent the actual output or class label. Decision trees are nonlinear [6]. Decision Tree
algorithms are referred to as Classification and Regression Trees (CART) [7].
Max Depth: larger the dataset harder to visualize so the maximum branching is taken as five, and
fit=DecisionTreeRegressor(max_depth=5).fit(train.iloc[:,1:],train.iloc[:,0]).
Herein, the maxDepth is chosen as 5.
C. Random Forest
Random forest is a tree-based algorithm, which involves building several trees (decision trees), then combining their output to
improve the generalization ability of the model. The method of combining trees is known as an ensemble method. The ensemble is a
combination of weak learners (individual trees) to produce a strong learner. Random Forest can be used to solve regression and
classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent
variable is categorical.
D. K-NN
The K-Nearest Neighbors (K-NN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used
to solve both classification and regression problems. The K-NN algorithm assumes that similar things exist close. K-NN makes
predictions using the training dataset directly. Predictions are made for a new instance (x) by searching through the entire training
set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this
might be the mean output variable, in classification, this might be the mode (or most common) class value. To determine which of
the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the
most popular distance measure is Euclidean distance.
K- NN 2.6025145530297236 0.5297013732475271
V. CONCLUSION
The quality of a regression model depends on the matchup of predictions against actual values. In regression problems, the
dependent variable is continuous. In classification problems, the dependent variable is categorical. Random Forest can be used to
solve both regression and classification problems. The K-NN algorithm is a simple, easy-to-implement supervised machine learning
algorithm that can be used to solve both classification and regression problems. Decision trees are nonlinear; unlike linear
regression, there is no equation to express the relationship between independent and dependent variables. Out of the three models
left, Random Forest is the best model as it has the lowest RMSE score and highest R-Squared score, which explains the highest
variability and tells us how well the model fits in this data.