0% found this document useful (0 votes)

2 views9 pages

Machine_Learning_using_Exploratory_Analy

This document discusses the application of machine learning and predictive analytics to forecast taxi fare amounts based on various factors such as time, location, and passenger count. It outlines the methodology for data preprocessing, exploratory data analysis, and model selection, comparing different algorithms like Decision Trees, Random Forest, and K-NN. The results indicate that the Random Forest model performs best in terms of accuracy, as evidenced by its lower RMSE and higher R-Squared values.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views9 pages

Machine_Learning_using_Exploratory_Analy

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

7 VIII August 2019

http://doi.org/10.22214/ijraset.2019.8073
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.177
Volume 7 Issue VIII, Aug 2019- Available at www.ijraset.com

Machine Learning using Exploratory Analysis to

Predict Taxi Fare
Gunjan Panda1, Supriya P. Panda2
1
B.Tech (CSE ), Final year student, The North Cap University, Gurgaon,
2
Professor (CSE) Manav Rachna International Institute of Research and Studies, Faridabad

Abstract: Predictive analytics uses archival data to predict the future events. Typically, past data is used to build a mathematical
model that captures important trends. That predictive model is then used on current data to predict the future or to suggest
actions to take for optimal outcomes. Predictive analytics has received a lot of attention in recent years due to advances in
supporting technology, particularly in the areas of big data and machine learning. Companies also use predictive analytics to
create more accurate forecasts, such as forecasting the fare amount for a cab ride in the city. These forecasts enable resource
planning for instance, scheduling of various cab rentals to be done more effectively. For a cab rental start-up company, the fare
amount is dependent on a lot of factors. This research aims to understand all patterns and to apply analytics for fare prediction.
The proposed work is to design a system that predicts the fare amount for a cab ride in the city. The aim is to build regression
models, which will predict the continuous fare amount for each cab ride and help prediction depending on multiple time-based,
positional and general factors.
Keywords: Predictive Analytics, Forecasting, Regression Models, Random Forest, Decision Tree, K-NN

I. INTRODUCTION
Machine learning (ML) is closely related to computational statistics, which focuses on making predictions using computers. Data
mining (DM) is a field of study within ML and focuses on exploratory data analysis through unsupervised learning. In its
application across business problems, machine learning is also referred to as predictive analytics. Machine learning tasks are
classified into several broad categories.
In supervised learning, the algorithm builds a mathematical model from a set of data that contains both the inputs and the desired
outputs. Classification algorithms and regression algorithms are examples of supervised learning Regression algorithms are named
for their continuous outputs, meaning they may have any value within a range [1].
In unsupervised learning, the algorithm builds a mathematical model from a set of data that contains only inputs and no desired
output labels.
Unsupervised learning algorithms are used to find structure in the data, like grouping or clustering of data points. Unsupervised
learning can discover patterns in the data and can group the inputs into categories, as in feature learning. Dimensionality reduction is
the process of reducing the number of "features", or inputs, in a set of data.
Machine learning and data mining often employ the same methods and overlap significantly, but while ML focuses on prediction,
based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties
in the data. This is the analysis step of knowledge discovery in databases (KDD) [1]. DM uses many ML methods, but with different
goals; on the other hand, ML also employs data mining methods as "unsupervised learning" or as a preprocessing step to improve
learner accuracy.

A. Supervised Learning Approach

Supervised learning algorithms include classification and regression. Classification algorithms are used when the outputs are
restricted to a limited set of values, and regression algorithms are used when the outputs may have any numerical value within a
range.
To build any model, the first step is to recognize the problem statement and choose the appropriate category that fits in. Since the
discourse statement “fare amount for a cab ride in the city” fits into forecasting, Regression is used as it helps in predicting
continuous fare amount for the future. Regression is a Supervised Learning approach as the target variable, which is fare-amount
and is known beforehand. Regression is used as it helps in predicting continuous fare amount for the future.

516 516
© IJRASET: All Rights are Reserved
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.177
Volume 7 Issue VIII, Aug 2019- Available at www.ijraset.com

B. Data
The aim is to build regression models that will predict the continuous fare amount for each of the cab-rides depending on multiple
time-based, positional and generic factors. This problem statement falls under the category of forecasting which deals with
predicting continuous values for the future (the continuous value is the fare amount of the cab ride).Fig.1 shows a sample of the data
set[2] that will be used to predict the fare amount of a cab ride.

Fig.1: a sample of the data set (Kaggle[2])

There are six predictor variables and one target variable which are listed as follows:
Predictors:
1) Pickup_datetime : timestamp value indicating when the cab ride started.
2) Pickup_longitude: float for longitude coordinate of where the cab ride started.
3) Pickup_latitude: float for latitude coordinate of where the cab ride started.
4) Dropoff_longitude: float for longitude coordinate of where the cab ride ended.
5) Dropoff_latitude: float for latitude coordinate of where the cab ride ended.
6) Passenger_count: an integer indicating the number of passengers in the cab ride.
Target: fare_amount
Data structures upon proper data type conversion are shown in Fig.2:

Fig.2: Data structures following data type conversion

II. METHODOLOGY
One common methodology is the Cross-Industry Standard Process for DM or (CRISP-DM) model [1,2]. This is a five process
model that provides a fluid framework for devising, creating, building, testing, and deploying machine learning solutions.

A. Exploratory Data Analysis

After identifying the approach, the next step is preprocessing the data. Looking at data refers to exploring the data, refining the data
as well as visualizing the data through graphs and plots. This is often called as Exploratory Data Analysis (EDA) [3].
To initiate this process, any of the probability distributions of the variables are considered. Most analysis like regression, require the
data to be normally distributed. This can be visualized at a glance by looking at the probability distributions or probability density
functions of the variable. Fig.3a and Fig.3b plot the probability density functions of few variables available in the data as well as the
dependent fare_amount variable. The distribution depicts the skewness of the data points, which indicate the presence of outliers.

© IJRASET: All Rights are Reserved 517

International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.177
Volume 7 Issue VIII, Aug 2019- Available at www.ijraset.com

Fig.3a: plotting probability density functions

Fig.3b: plotting probability density functions

B. Missing Value Analysis

Once proper data conversion is done next step is to analyze the missing values. As per data, the missing value percentages of the
variables are as in Fig.4

Fig.4: missing value percentages of variables

As only passenger_count and fare_amount have missing values and the percentage is less than thirty percent, so the missing values
are imputed. On randomly assigning NA to one of these values for passenger_count and then filling the content using the three
methods: mean, median and K-NN, it is found that the median gives the closest estimate to this actual value. Similarly, for the
fare_amount, mean yields the nearest value to the real value. Hence, the missing values for passenger_count are filled with median
and that for the fare_amount are filled with mean. After filling in missing values data looks like as shown in Fig.5.

© IJRASET: All Rights are Reserved 518

Fig.5: filling in missing values

C. Outlier Analysis
As depicted in Fig.4, there are a lot of noisy data so it's important to clean the data for better model performance. In this case, a
classic approach, namely Turkey's method is used for removing outliers. We visualize the outliers using boxplots. Fig.6a Fig.6b,
Fig.6c, and Fig.6d plot the boxplots of four of the six predictor variables (as a sample) and the target variable. A lot of useful
inferences can be made from these plots such as a lot of outliers and extreme values are seen in each of the data sets.

Fig.6a-d: noisy data pre-cleaning

Upon removing the outliers, data is now refined as shown in Fig.7.

Fig.7: refined data

© IJRASET: All Rights are Reserved 519

D. Feature Selection
Because all the variables are numeric the important features are extracted using the correlation matrix. As is seen from Fig.8, all the
variables are important for predicting the fare_amount since none of the variables have a high correlation factor (considering the
threshold as 0.9), so all the variables for model building are kept.

Fig.8: the correlation matrix

Another method for feature selection is Random Forest. Fig.9 mentions the process of Random Forest to extract the importance of
each variable using R-programming [4].
model_rf = randomForest(fare_amount~., train,importance = TRUE, ntree = 300)
importance (model_rf,type = 1)

Fig.9: the process of Random Forest

As is seen, distance has the highest prediction power for fare_amount whereas passenger_count and day have the least prediction
power.

E. Feature Engineering
It is important to infer some knowledge from the existing data and come up with more valuable information. As the dataset already
has datetime variable, further the year, month, day, weekday and hours are calculated that might have an effect on the fare and to
further perform some EDA on the data. Also as the longitude, latitude points are there, the distance traveled per ride is easily
calculated to derive a relationship between the fare amount and the distance. To calculate the distance Haversine Distance formula is
used and the distance in kilometers is found. The Haversine formula [5] calculates the shortest distance between two points on a
sphere using their latitudes and longitudes measured along the surface. It is important for use in navigation.

III. MODELING
A. Model Selection
In the early stages of analysis during pre-processing, it is understood that fare_amount is dependent on multiple behaviors.
Therefore, it's important to build a model in such a way that it takes in all the required inputs and fits the model in such a way that it
gives the most accurate result amongst all the other models. The dependent variable can fall in any of the four categories: Nominal,
Ordinal, Interval, and Ratio. Three approaches are taken and compared:

B. Decision Tree
A decision tree is a tree-like graph with nodes representing the place where an attribute is picked and queried; edges represent the
answers to the query, and the leaves represent the actual output or class label. Decision trees are nonlinear [6]. Decision Tree
algorithms are referred to as Classification and Regression Trees (CART) [7].
Max Depth: larger the dataset harder to visualize so the maximum branching is taken as five, and
fit=DecisionTreeRegressor(max_depth=5).fit(train.iloc[:,1:],train.iloc[:,0]).
Herein, the maxDepth is chosen as 5.

C. Random Forest
Random forest is a tree-based algorithm, which involves building several trees (decision trees), then combining their output to
improve the generalization ability of the model. The method of combining trees is known as an ensemble method. The ensemble is a
combination of weak learners (individual trees) to produce a strong learner. Random Forest can be used to solve regression and
classification problems. In regression problems, the dependent variable is continuous. In classification problems, the dependent
variable is categorical.

D. K-NN
The K-Nearest Neighbors (K-NN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used
to solve both classification and regression problems. The K-NN algorithm assumes that similar things exist close. K-NN makes
predictions using the training dataset directly. Predictions are made for a new instance (x) by searching through the entire training
set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression this
might be the mean output variable, in classification, this might be the mode (or most common) class value. To determine which of
the K instances in the training dataset are most similar to a new input a distance measure is used. For real-valued input variables, the
most popular distance measure is Euclidean distance.

IV. MODEL EVALUATION

The quality of a regression model is how well its predictions match up against actual values, and Error metrics are used to judge the
quality of a model, which enables us to compare regressions against other regressions with varied parameters [7].

A. Root Mean Squared Error (RMSE)

The Root Mean Squared Error (RMSE) and R-Squared are used for dealing with time series forecasting and continuous variables.
The RMSE indicates the absolute fit of the model to the data, whereas R-Squared is a relative measure of fit [1, 8].
RMSE must be compared with the dependent variable as RMSE is in the same units as the dependent variable.
1) Smaller The Result, Better The Performance Of The Model: To understand how well the independent variables “explain” the
variance in the model, the R-Squared formula is used.
2) For The R-Squared, The Closer The Value To 1, The Better The Performance Of The Model: According to the underlying
model, Table-1 describes its error metrics.
Table-1

Model RMSE Score R Square

Decision Tree 2.1091782572686593 0.6790662606031613
Random Forest 1.9953043471990368 0.7127850099199918

K- NN 2.6025145530297236 0.5297013732475271

V. CONCLUSION
The quality of a regression model depends on the matchup of predictions against actual values. In regression problems, the
dependent variable is continuous. In classification problems, the dependent variable is categorical. Random Forest can be used to
solve both regression and classification problems. The K-NN algorithm is a simple, easy-to-implement supervised machine learning
algorithm that can be used to solve both classification and regression problems. Decision trees are nonlinear; unlike linear
regression, there is no equation to express the relationship between independent and dependent variables. Out of the three models
left, Random Forest is the best model as it has the lowest RMSE score and highest R-Squared score, which explains the highest
variability and tells us how well the model fits in this data.

VI. FUTURE SCOPE

As is known, with an increase in the number of features; underlying equations become a higher-order polynomial equation, and it
leads to overfitting of the data. Generally, it is seen that an overfitted model performs worse on the testing data set, and it is also
observed that the overfitted model performs worse on additional new test data set as well. A kind of normalized regression type -
Ridge Regression may be further considered.
REFERENCES
[1] John D. Kelleher, Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies (The MIT Press), 1st
Edn.
[2] Data science work with public datasets, [Online] Available: https://www.kaggle.com,
[3] Chong Ho Yu, Exploratory data analysis in the context of data mining and re-sampling, International Journal of psychological research, 2010, vol.3, No.1
[4] Ruben Oliva Ramos, Jen Stirrup, Advanced Analytics with R and Tableau, Packt Publishing, 2017, ISBN: 9781786460110
[5] Machine Learning in Python,[Online] Available https://scikit-learn.org/stable/
[6] T. Divya and A. Sonali, “A survey on Data Mining approaches for Healthcare”, International Journal of Biosciences and Bio-Technology, vol. 5, no. 5, (2013),
pp. 241-266.
[7] Sebastian Raschka, Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning, [Online] Available
https://arxiv.org/abs/1811.12808,2016
[8] Weijie Wang and Yanmin Lu, Analysis of the Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) in Assessing Rounding Model,
ICMEMSCE, IOP Publishing, 324 (2018), doi:10.1088/1757-899X/324/1/012049

Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow 3rd Edition TEXTBOOK
0% (2)
Hands-On Machine Learning With Scikit-Learn, Keras, and TensorFlow 3rd Edition TEXTBOOK
14 pages
Uber Data Analysis
100% (4)
Uber Data Analysis
37 pages
IJIRT161160_PAPER
No ratings yet
IJIRT161160_PAPER
5 pages
IJRPR22505
No ratings yet
IJRPR22505
3 pages
2020
No ratings yet
2020
4 pages
Predictive_Analysis_of_Taxi_Fare_using_M
No ratings yet
Predictive_Analysis_of_Taxi_Fare_using_M
6 pages
36718-76598-1-PB
No ratings yet
36718-76598-1-PB
8 pages
Shivaraj
No ratings yet
Shivaraj
11 pages
Document Reference
No ratings yet
Document Reference
33 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
10 pages
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
No ratings yet
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
10 pages
Review of Data Analysis Algorithm and Its Applications
No ratings yet
Review of Data Analysis Algorithm and Its Applications
6 pages
Cab Fare Prediction Report by Abhinav Jha
No ratings yet
Cab Fare Prediction Report by Abhinav Jha
41 pages
Data Preprocessing in Predictive Data Mining: The Knowledge Engineering Review
No ratings yet
Data Preprocessing in Predictive Data Mining: The Knowledge Engineering Review
33 pages
Analysis of Machine Learning Algorithm With Road Accidents Data Sets
No ratings yet
Analysis of Machine Learning Algorithm With Road Accidents Data Sets
11 pages
A Survey On Machine Learning Algorithms-IJRASET
No ratings yet
A Survey On Machine Learning Algorithms-IJRASET
12 pages
Student Performance Prediction
No ratings yet
Student Performance Prediction
4 pages
Data Analytics on Banking
No ratings yet
Data Analytics on Banking
3 pages
Management-Mining Students Data To Predict Student
No ratings yet
Management-Mining Students Data To Predict Student
6 pages
Student's Performance Prediction Using Weighted Modified ID3 Algorithm
No ratings yet
Student's Performance Prediction Using Weighted Modified ID3 Algorithm
6 pages
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
No ratings yet
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
8 pages
Credit Risk Analysis Using Naive Bayes in Machine Learning
No ratings yet
Credit Risk Analysis Using Naive Bayes in Machine Learning
5 pages
An Insight Into Machine Learning Techniq
No ratings yet
An Insight Into Machine Learning Techniq
8 pages
Review On Prediction Algorithms in Educational Data Mining: A.Dinesh Kumar, R.Pandi Selvam, K.Sathesh Kumar
No ratings yet
Review On Prediction Algorithms in Educational Data Mining: A.Dinesh Kumar, R.Pandi Selvam, K.Sathesh Kumar
8 pages
Supervised Learning Classification Algorithms Comparison
No ratings yet
Supervised Learning Classification Algorithms Comparison
6 pages
Survey of Classification Techniques in Data Mining: Open Access
No ratings yet
Survey of Classification Techniques in Data Mining: Open Access
10 pages
Used Car Price Prediction Using Multiple Linear Regression
No ratings yet
Used Car Price Prediction Using Multiple Linear Regression
6 pages
A Comparative Study of Various Approaches To Explore Factors For Vehicle Collision
No ratings yet
A Comparative Study of Various Approaches To Explore Factors For Vehicle Collision
12 pages
Implementation of Credit Card Fraud Detection Using Random Forest Algorithm
100% (1)
Implementation of Credit Card Fraud Detection Using Random Forest Algorithm
10 pages
Implementation of Flight Fare Prediction System Using Machine Learning
No ratings yet
Implementation of Flight Fare Prediction System Using Machine Learning
11 pages
Flight Fare Prediction Using Machine Learning Approach
No ratings yet
Flight Fare Prediction Using Machine Learning Approach
5 pages
DA_Caravan_6672064
No ratings yet
DA_Caravan_6672064
26 pages
Fare and Duration Prediction: A Study of New York City Taxi Rides
No ratings yet
Fare and Duration Prediction: A Study of New York City Taxi Rides
6 pages
Students Performance
No ratings yet
Students Performance
4 pages
Report
No ratings yet
Report
36 pages
Investigation and Comparison Missing Data Imputation Methods
No ratings yet
Investigation and Comparison Missing Data Imputation Methods
73 pages
Prediction of Mental Health (Depression) Using Data Science Technique
No ratings yet
Prediction of Mental Health (Depression) Using Data Science Technique
6 pages
Data Preprocessing For Supervised Leaning
No ratings yet
Data Preprocessing For Supervised Leaning
6 pages
Prognostication of The Placement of Students Applying Machine Learning Algorithms
No ratings yet
Prognostication of The Placement of Students Applying Machine Learning Algorithms
5 pages
Uber
No ratings yet
Uber
46 pages
Predicting Students' Performance Using K-Median Clustering
No ratings yet
Predicting Students' Performance Using K-Median Clustering
4 pages
The Data Mining Based Model For Detection of Fraudulent Behavior in Water Consumption
No ratings yet
The Data Mining Based Model For Detection of Fraudulent Behavior in Water Consumption
5 pages
Study of Supervised Learning and Unsupervised Learning
No ratings yet
Study of Supervised Learning and Unsupervised Learning
8 pages
A Paper
No ratings yet
A Paper
11 pages
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
No ratings yet
CREDIT CARD FRAUD DETECTION USING MACHINE LEARNING
7 pages
Feature Selection Techniques and Classification Al
No ratings yet
Feature Selection Techniques and Classification Al
14 pages
Take Home Exam
No ratings yet
Take Home Exam
8 pages
Predicting Mode of Transport
No ratings yet
Predicting Mode of Transport
29 pages
Almarabeh H. - Analysis of Students' Performance by Using
No ratings yet
Almarabeh H. - Analysis of Students' Performance by Using
7 pages
A Review of Multi-Class Classification Algorithms
No ratings yet
A Review of Multi-Class Classification Algorithms
10 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Performance Evaluation of SVM in A Real Dataset To Predict Customer Purchases
No ratings yet
Performance Evaluation of SVM in A Real Dataset To Predict Customer Purchases
5 pages
Predicting Students Academic Perfomace u
No ratings yet
Predicting Students Academic Perfomace u
10 pages
Tracking and Predecting Students Performance With Machine Learning
0% (1)
Tracking and Predecting Students Performance With Machine Learning
47 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Avc06ijarse PDF
No ratings yet
Avc06ijarse PDF
10 pages
ssrn-4976040
No ratings yet
ssrn-4976040
14 pages
Data Analytics Unit1
No ratings yet
Data Analytics Unit1
17 pages
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
From Everand
Data and Analytics in Action: Project Ideas and Basic Code Skeleton in Python
Zemelak Goraga
No ratings yet
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
From Everand
Applied Predictive Modeling: An Overview of Applied Predictive Modeling
Steven Taylor
No ratings yet
Panel Patent Data Using Poisson, - Ve Binomial and GMM
No ratings yet
Panel Patent Data Using Poisson, - Ve Binomial and GMM
32 pages
ML PDF
No ratings yet
ML PDF
17 pages
Describe the ROC Curve and Its Significance in Assessing the Performance of Binary Classification Mo
No ratings yet
Describe the ROC Curve and Its Significance in Assessing the Performance of Binary Classification Mo
2 pages
North South University: Course: BUS 173 Sec: 06
No ratings yet
North South University: Course: BUS 173 Sec: 06
9 pages
FR1 Chem 28.1 Expt 1
No ratings yet
FR1 Chem 28.1 Expt 1
11 pages
2sample Size Determination Jan 2023
No ratings yet
2sample Size Determination Jan 2023
69 pages
Symmetric and Asymmetric Distributions
No ratings yet
Symmetric and Asymmetric Distributions
148 pages
Lesson_7_7_Answer_Key_AP_Stats_Math_Medic_b9304651b7
No ratings yet
Lesson_7_7_Answer_Key_AP_Stats_Math_Medic_b9304651b7
2 pages
Bài báo NCKH
No ratings yet
Bài báo NCKH
3 pages
Solution
No ratings yet
Solution
8 pages
TPJC JC 2 H2 Maths 2011 Mid Year Exam Questions
No ratings yet
TPJC JC 2 H2 Maths 2011 Mid Year Exam Questions
6 pages
HM-Ch03 Activity Cost Behaviour-Terj
No ratings yet
HM-Ch03 Activity Cost Behaviour-Terj
49 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
40 pages
100 Days of ML
No ratings yet
100 Days of ML
383 pages
Chapter 2-HYPOTHESIS TESTING
No ratings yet
Chapter 2-HYPOTHESIS TESTING
52 pages
Sodapdf
No ratings yet
Sodapdf
10 pages
SPSS Practical 5 - Categorical Data
No ratings yet
SPSS Practical 5 - Categorical Data
3 pages
SDS 302 Formula Sheet: N X X SS
No ratings yet
SDS 302 Formula Sheet: N X X SS
2 pages
Module 1 Introduction to Statistics in Psychology
No ratings yet
Module 1 Introduction to Statistics in Psychology
6 pages
Answer to Critical Thinking Exercise.docx
No ratings yet
Answer to Critical Thinking Exercise.docx
10 pages
Test of Hypothesis For A Single Sample PDF
No ratings yet
Test of Hypothesis For A Single Sample PDF
26 pages
Business Statistics-2 PDF
80% (5)
Business Statistics-2 PDF
2 pages
Data Collection Methods
No ratings yet
Data Collection Methods
55 pages
List of Formulas
No ratings yet
List of Formulas
3 pages
Chapter Contents: Interval Estimation and Hypothesis Testing
No ratings yet
Chapter Contents: Interval Estimation and Hypothesis Testing
26 pages
Business Stastistics - PGDM
No ratings yet
Business Stastistics - PGDM
2 pages
(Measures of Central Tendency) : Nadeem Uddin Associate Professor of Statistics
No ratings yet
(Measures of Central Tendency) : Nadeem Uddin Associate Professor of Statistics
11 pages
AMC Technical Brief 4 (Kernel Density Estimation Using Kernel - Xla)
No ratings yet
AMC Technical Brief 4 (Kernel Density Estimation Using Kernel - Xla)
2 pages
IEM 4103 Quality Control & Reliability Analysis IEM 5103 Breakthrough Quality & Reliability
No ratings yet
IEM 4103 Quality Control & Reliability Analysis IEM 5103 Breakthrough Quality & Reliability
46 pages

Machine_Learning_using_Exploratory_Analy

Uploaded by

Machine_Learning_using_Exploratory_Analy

Uploaded by

7 VIII August 2019

Machine Learning using Exploratory Analysis to

A. Supervised Learning Approach

Fig.1: a sample of the data set (Kaggle[2])

Fig.2: Data structures following data type conversion

A. Exploratory Data Analysis

© IJRASET: All Rights are Reserved 517

Fig.3a: plotting probability density functions

Fig.3b: plotting probability density functions

B. Missing Value Analysis

Fig.4: missing value percentages of variables

© IJRASET: All Rights are Reserved 518

Fig.5: filling in missing values

Fig.6a-d: noisy data pre-cleaning

Upon removing the outliers, data is now refined as shown in Fig.7.

Fig.7: refined data

© IJRASET: All Rights are Reserved 519

Fig.8: the correlation matrix

Fig.9: the process of Random Forest

© IJRASET: All Rights are Reserved 520

IV. MODEL EVALUATION

A. Root Mean Squared Error (RMSE)

Model RMSE Score R Square

© IJRASET: All Rights are Reserved 521

VI. FUTURE SCOPE

© IJRASET: All Rights are Reserved 522

You might also like