0% found this document useful (0 votes)
27 views51 pages

Acd 21 JB

Uploaded by

fraudyfraudy1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views51 pages

Acd 21 JB

Uploaded by

fraudyfraudy1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

University of Sheffield

Machine Learning Based Modelling and


Comparative Analysis of New York City
Taxi and Rideshare Fares with Web
Integration

Jack Burke

Supervisor: Tahsinur Khan

A report submitted in fulfilment of the requirements


for the degree of MComp in Computer Science
in the
Department of Computer Science

May 8, 2024
Declaration

All sentences or passages quoted in this report from other people’s work have been
specifically acknowledged by clear cross-referencing to author, work and page(s). Any
illustrations that are not the work of the author of this report have been used with the
explicit permission of the originator and are specifically acknowledged. I understand that
failure to do this amounts to plagiarism and will be considered grounds for failure in this
project and the degree examination as a whole.

Name: Jack Burke

Signature: Jack Burke

Date: 08/05/2024

i
Abstract

Taxis and rideshares are key modes of transport across cities, with each having their own
unique pricing structure; Taxis use fixed-fare systems that can vary depending on traffic
levels, whereas rideshares use surge-pricing that can fluctuate with user demand.

Utilising the New York City taxi dataset, which has been published monthly since 2009,
and aggregates over 3 billion taxi and rideshare trips, this project utilises machine learning
to learn and make predictions about trips and provide greater insight into what transport
mode is cheapest at a defined time and day of week.

The goal is to provide users with a web application so they can access this data easily,
and plan their trips better by comparing both transport modes and their fares.

ii
Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Literature Survey 3
2.1 Limitations of Existing Applications for Estimating and Comparing Journey
Fares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 New York City Taxi and Limousine Commission Trip Record Data 4
2.2.2 New York City Taxi and Uber Fare Models . . . . . . . . . . . . . 5
2.2.3 Advantages of the Dataset . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Limitations of the Dataset . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning Models Applied on the Dataset . . . . . . . . . . . . . 7
2.3.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . 10
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Root Mean Square Error (RMSE) . . . . . . . . . . . . . . . . . . 10
2.5 Analysis of Machine Learning Models Applied on the Dataset . . . . . . . 10
2.5.1 Comparison of Linear Regression Against Random Forest to Predict
Trip Fares and Duration on the Dataset . . . . . . . . . . . . . . . 11
2.5.2 Comparison of XGBoost Against MLP to Predict Trip Duration on
the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Requirements and Analysis 12


3.1 Project Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1 Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.2 Non-functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Evaluation of Machine Learning models . . . . . . . . . . . . . . . . . . . 13
3.3 Legal and Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . 13

iii
CONTENTS iv

4 Design 14
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Machine Learning Life Cycle . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Cross-Industry Standard Process for Data Mining . . . . . . . . . 14
4.2 Dataset and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3 Handling the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Model Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Model Development Tools . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Web Application and Deployment . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Libraries and Frameworks . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.2 Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.3 Wireframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Implementation and testing 22


5.1 Difficulties Faced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Difficulties with The University of Sheffield High Performance Computing
(HPC) cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.2 Databricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3.1 Databricks AutoML . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4 Web Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Results and discussion 34


6.1 Model Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.1 Yellow Taxi Fare amount results . . . . . . . . . . . . . . . . . . . 34
6.1.2 Trip Time results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.3 Uber fare amount results . . . . . . . . . . . . . . . . . . . . . . . 34
6.1.4 Trip distance results . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.1.5 Comparison to External Predictions . . . . . . . . . . . . . . . . . 35
6.2 Requirements Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2.1 Non-Functional requirements . . . . . . . . . . . . . . . . . . . . . 35
6.2.2 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Challenges and Further Work . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.1 H3 Spatial Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.2 Progressive Web App . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3.3 Live traffic data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
CONTENTS v

7 Conclusions 37
List of Figures

2.1 Map of New York City with TLC Taxi Zones, boroughs and airports labelled 5
2.2 Taxi Zones Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Visualisation of a decision tree. . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Visualisation of a random forest model. . . . . . . . . . . . . . . . . . . . 9
2.5 The architecture of a Gradient Boosting Decision Tree. . . . . . . . . . . . 9

4.1 CRISP-DM Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


4.2 Structure of the Machine Learning models . . . . . . . . . . . . . . . . . . 18
4.3 The Application flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4 Wireframe of the web application. . . . . . . . . . . . . . . . . . . . . . . 21

5.1 The HPC cluster used for the project . . . . . . . . . . . . . . . . . . . . . 23


5.2 Code Snippet: PySpark filter() and drop() functions . . . . . . . . . . . . 24
5.3 Code Snippet: Transformation of pick-up datetime to dayOfWeek, hourOfDay
and minuteOfHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.4 Code Snippet: Haversine Formula on the pick-up and drop-off coordinates 25
5.5 Code Snippet: Calculating rush hour surcharges . . . . . . . . . . . . . . 26
5.6 GeoJSON boundaries of LaGuardia Airport in New York City . . . . . . . 26
5.7 Code Snippet: Determine if a latitude and longitude point is inside a
GeoJSON polygon. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.8 Databricks AutoML configuration . . . . . . . . . . . . . . . . . . . . . . . 27
5.9 Databricks AutoML runs . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.10 Databricks MLFlow model endpoint deployment . . . . . . . . . . . . . . 28
5.11 Code Snippet: Sampling the dataset . . . . . . . . . . . . . . . . . . . . . 28
5.12 The web application showing a trip from the Empire State Building to
Central Park. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.13 One of the results returned from the Photon API for the search ”Grand
Central”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.14 Search for ”rockefeller” and its autocomplete results. . . . . . . . . . . . . 30
5.15 Time and day of week selection. . . . . . . . . . . . . . . . . . . . . . . . . 31
5.16 Route from Chrysler Building to Grand Central displayed on web application
map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.17 Code Snippet: Locating Taxi Zones using Turf.js . . . . . . . . . . . . . . 32
5.18 Website adjusted for mobile viewports . . . . . . . . . . . . . . . . . . . . 32

vi
List of Tables

2.1 Data dictionary for Yellow, Green and HVFHS taxis. . . . . . . . . . . . . 4

3.1 Functional Requirements of the project . . . . . . . . . . . . . . . . . . . 12


3.2 Non-Functional Requirements of the project . . . . . . . . . . . . . . . . . 13

4.1 Taxi Fare Comparison across 2012-2022 and 2023 onwards . . . . . . . . . 16

5.1 Manual Tests for Web Application . . . . . . . . . . . . . . . . . . . . . . 33


5.2 Functional Requirements of the project . . . . . . . . . . . . . . . . . . . 33

6.1 Results of the models for the project . . . . . . . . . . . . . . . . . . . . . 34


6.2 Success of Non-Functional Requirements of the project . . . . . . . . . . . 35

vii
Chapter 1

Introduction

1.1 Background
Yellow Taxis are an iconic symbol of New York City and form an integral part of its
transport network with over hundred thousand trips taken every day across the city’s five
boroughs (Schneider, 2024). However, rising competition from rideshare apps like Uber
and Lyft have completely changed the market, with these apps taking five or six times as
many trips per day and holding around an 87% market share. Pricing between the two
modes can vary heavily, with rideshares being susceptible to “rush hour, rain or snow”
surge pricing (CBS News, 2023). Whereas, taxis use a fixed ”metered” fare system (NYC
Taxi and Limousine Commission, 2023c)

This project addresses the need for commuters to compare the fares of New York City taxis
against rideshares and allow them to make more informed decisions on their transportation
options.

1.2 Aims
The aim of this project is to enable a user to compare the fares of New York City taxis
against rideshares. To elaborate further, the plan is to create a web application where
the user can input an origin and destination within New York and its vicinity, along with
the time and day of the week. The application will then present estimated fares for both
New York City taxis and rideshares to the user.

1.3 Objectives
The objectives of this project are:

• Develop and deploy machine learning models capable of predicting New York City
taxi and rideshare fares.

• Create a user-friendly web application that enables users to perform real-time


inference using the deployed models, presenting fare predictions in a clear and
intuitive manner.

1
CHAPTER 1. INTRODUCTION 2

1.4 Overview of the Report


This dissertation project is divided into seven chapters, each detailing a stage of the
project and the work undertaken to achieve its respective objectives.

Chapter 2 Literature Survey: This chapter surveys all relevant literature and research
for this project, specifically the existing solutions, the dataset used and previous research
on machine learning models applied to this dataset. Additionally, it examines the benefits
and drawbacks of these approaches.

Chapter 3 Requirements and Analysis: This chapter details the requirements for
this project to be deemed successful while also acknowledging potential constraints.

Chapter 4 Design: This chapter details the development methodology and the design
of the dataset, machine learning models and the web application. Providing justification
for each approach and the libraries and frameworks used.

Chapter 5 Implementation and testing: This chapter details the development process
and the issues faced. Additionally, highlighting specific features and their implementation
and testing.

Chapter 6 Results and discussion: This chapter details the results of the machine
learning models and evaluates the success of the application in meeting its requirements.

Chapter 7 Conclusions: This chapter details what this project has achieved, the lessons
that have been learnt and what could be done in future implementations to build upon
this work.
Chapter 2

Literature Survey

2.1 Limitations of Existing Applications for Estimating and


Comparing Journey Fares
Applications have been developed to solve the issue of providing a clear estimate and
comparison of New York City taxi and rideshare fares. With apps such as CityMapper1
and RideGuru2 allowing users to plan their journeys and see the transport modes available,
accompanied with their journey times and cost. OpenStreetCab (Noulas et al., 2018) is
another example with the app deployed on the iOS and Android platforms and having
over 15,000 user installs.

However, the limitation of OpenStreetCab is that it uses Application Programming Interfaces


(APIs) to get price estimates from Uber (Uber Inc., 2023a) and traffic time estimates from
HERE Maps (HERE, 2023). This is due to the fact that the taxi providers in New York
City do not provide price estimate APIs. As a result price estimates have to be calculated
using traffic time estimates and fixed tariffs set out by the city (NYC Taxi and Limousine
Commission, 2023c). These tariffs are mainly affected by the speed of the vehicle and the
time spent stationary at a rate of ”70 cents per 1/5 mile when traveling above 12mph or
per 60 seconds in slow traffic or when the vehicle is stopped.”

Using an API for this is not a completely effective approach as there are ”variations
in travel times driven by urban congestion or alternative routes picked by driver” (Noulas
et al., 2018). Which means for the same trip there can be multiple routes with varying
distances and travel time, directly impacting the cost of the trip. The API they used,
HERE Maps, only predicts the ”fastest or shortest journey” (HERE, 2023). This does not
reflect real world scenarios. Furthermore, since that paper was written Uber has changed
its terms of service that ”using the Uber API to offer price comparisons with competitive
third party services is in violation of § II B of the API Terms of Use” (Uber Inc., 2023a).

1
CityMapper, New York City: https://citymapper.com/nyc
2
RideGuru: https://ride.guru

3
CHAPTER 2. LITERATURE SURVEY 4

2.2 The Dataset


2.2.1 New York City Taxi and Limousine Commission Trip Record Data
The dataset for this project is provided by the NYC Taxi and Limousine Commission
(TLC), the city’s government agency responsible for licensing and regulating all medallion
taxis and for-hire vehicles. The dataset is released monthly as part of the ”TLC Trip
Record Data.” (NYC Taxi and Limousine Commission, 2023b)

The dataset accounts for: Yellow Taxis, which are the iconic street hailed taxis, and can
be hailed from anywhere in NYC; Green Taxis, which are also known as ‘boro taxis’ and
can only be hailed in Upper Manhattan and the outer boroughs (excluding the airports),
although are able to travel to anywhere in NYC; and High-Volume For-Hire Services
(HVFHS) which are vehicles that can be requested from anywhere in NYC but only from
ride-hailing services such as Uber or Lyft.

The dataset is vast, with Yellow Taxis and HVFHS containing 100-200 million records
per year, and Green Taxis containing about 20 million records per year.

The dataset for Yellow Taxis dates back to 2009, with Green Taxis being added in 2013,
and HVFHS in 2019. Each record contains around 20 pieces of data, this includes:

RateCodeID 1 = Standard rate


2 = JFK (Airport)
3 = Newark (Airport)
4 = Nassau or Westchester
pickup datetime The date and time when the meter was engaged.
dropoff datetime The date and time when the meter was disengaged.
trip distance The elapsed trip distance in miles reported by the taximeter.
pickup longitude* Longitude where the meter was engaged.
pickup latitude* Latitude where the meter was engaged.
dropoff longitude* Longitude where the meter was disengaged.
dropoff lattitude* Latitude where the meter was disengaged.
PULocationID* TLC Taxi Zone in which the taximeter was disengaged.
DOLocationID* Latitude where the meter was disengaged.
fare amount The time-and-distance fare calculated by the meter.
total amount The total amount charged to passengers. Does not include cash tips.

Table 2.1: Data dictionary for Yellow, Green and HVFHS taxis.
Data Dictionary provided by NYC Taxi and Limousine Commission - Available at (NYC
Taxi and Limousine Commission, 2023a)

* For Yellow and Green Taxis the coordinates of the pickup and drop off locations are
provided up until July 2016, after which a zone system is used instead. The zone
system divides New York City into 262 zones seen in Figure 2.1 which are based on NYCs
Neighbourhood Tabulation Areas (NTAs) and are meant to closely approximate NYCs
neighbourhoods. For all HVFHS data only the zone system is used.
CHAPTER 2. LITERATURE SURVEY 5

Figure 2.1: Map of New York City with TLC Taxi Zones, boroughs and airports labelled
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
https://www.openstreetmap.org/copyright
NYC Taxi Zones - Available at (NYC Taxi and Limousine Commission, 2023)

2.2.2 New York City Taxi and Uber Fare Models


Yellow and Green Taxis operate on the same fixed fare structure. As of 2023, there is an
initial $3 charge with a $2.50 surcharge for weekday rush hours and a $1 surcharge for
overnight hours (8pm to 6am). The metre charges “70 cents per 1/5 mile when travelling
above 12 mph or per 60 seconds in slow traffic or when the vehicle is stopped.” Other
surcharges exist for improvement works, state congestion fees, going to the airports and
any tolls. (NYC Taxi and Limousine Commission, 2023c)

Uber uses a proprietary “dynamic pricing algorithm” to determine their fares. The
algorithm factors in variables such as the time and distance of the route, traffic and
the current rider-to-driver demand. When there is a sudden increase in rider demand due
to factors such as “bad weather, rush hour, and special events”, Uber implements ‘surge
pricing’ where fares are increased to match the supply and demand. (Uber Inc., 2023b)

Taxis do not have surge pricing, so during busy times they are often cheaper than
rideshares. Although, unlike Taxis, Ubers do not charge depending on the speed of the
vehicle.
CHAPTER 2. LITERATURE SURVEY 6

2.2.3 Advantages of the Dataset


The TLC Trip Record Data (NYC Taxi and Limousine Commission, 2023b) is published
monthly as part of a New York State Freedom of Information Law FOIL request (NYC,
2023) so unlike the APIs discussed earlier, it is free to use with no restrictions or costs.
In addition, it dates back to 2009 with over 3 billion recorded trips (Schneider, 2023;
NYC Taxi and Limousine Commission, 2023b) made by taxis and rideshares. This means
trip distances, time and cost can be computed using historical trends and the differences
between weekdays, weekends, seasons and more can be factored in. This could allow for
better journey planning for users by forecasting journey times and their cost across a
given time period. This is unlike APIs where the data is only relevant to at the time of
the request (Uber Inc., 2023a).

2.2.4 Limitations of the Dataset


The main limitation of the dataset is the zone system as shown in Figure 2.1. The trip data
for yellow and green taxis after July 2016 and all HVFHS data uses taxi zones identified by
numeric identifiers. Consequently, this results in it being difficult to distinguish between
records as they could start and end anywhere within the zone as seen in Figure 2.2. The
only features that can be used to distinguish records apart from their zones are the trip
distance and time.

Handling of the zone based data is a novel problem and has not been discussed in any
prior literature.

Figure 2.2: Taxi Zones Example


Designed using Diagrams.net - Available at https://diagrams.net
Made with data from NYC Taxi Zones - Available at (NYC Taxi and Limousine
Commission, 2023)
CHAPTER 2. LITERATURE SURVEY 7

2.3 Machine Learning Models Applied on the Dataset


This section will introduce the machine learning models and techniques that have been
successfully applied on the dataset. In prior research Antoniades et al. (2016) they
compare linear regression and random forest to predict trip fares and duration. And,
in Manoharan et al. (2021) they compare XGBoost against Multilayer Perceptron to
predict trip duration.
CHAPTER 2. LITERATURE SURVEY 8

2.3.1 Decision Trees


Decision trees are a supervised learning model which splits the data multiple times based
upon ”cutoff values” in its features. From these splits a hierarchical structure is made
with an initial ‘root’ node and branching ‘nodes’ leading to terminal ‘leaf nodes’ which
are the subsets of the data. At each node the model performs decision-making, evaluating
based upon the desired features of the terminal subset. Decision trees can be used for
classification and regression (Molnar, 2022).

Root Node
True False

Node Node

True False True False

Leaf Node Leaf Node Leaf Node Leaf Node

Figure 2.3: Visualisation of a decision tree.


Designed Diagrams.net - Available at https://diagrams.net

Decision trees face problems such as overfitting where the tree develops too many hypothesis
based upon the training data that it performs poorly with the test data (Sayad, 2023). In
addition, they can have high variance estimators as light variations in the data can result
in completely different trees (IBM, 2023a). And lastly, they can be more computationally
costly compared to other models due to using a ‘greedy search approach.’

2.3.2 Random Forest


Random forest is a model that combines multiple decision trees via bagging to reach a
single result (IBM, 2023b). Bagging is a method in which a random subset is taken from
the training data, with a decision tree being constructed from this data. That subset is
put back into the training data and this process is repeated multiple times. Random forest
enhances bagging by using ‘feature randomness’ where only a random subset of features
is used in the decision-making process at each node. With these multiple decision trees
‘ensemble’ learning methods are used to produce a result.

Ensemble learning is where multiple machine learning models are combined and for simple
techniques their predictions are combined either by majority voting for classification, or
the average for regression (Mwiti, 2021).

Random forest solves the key problems of decision trees such as overfitting. This is
due to random forest using multiple decision trees and the ”averaging of uncorrelated
trees lowers the overall variance and prediction error” (IBM, 2023b).
CHAPTER 2. LITERATURE SURVEY 9

Dataset

Output Output Output

Majority
Voting /
Averaging

Figure 2.4: Visualisation of a random forest model.


Designed using Diagrams.net - Available at https://diagrams.net

2.3.3 Gradient Boosting


Gradient boosting is another ensemble algorithm. In gradient boosting it consecutively
passes the results from one model to the next, with each iteration improving on the last.
This process of iterating each model is called ’fitting’ (Natekin and Knoll, 2013).

To figure out how to improve the model it attempts to minimise what is called a ’loss
function’ which is the ”difference between the predicted outputs of a machine learning
algorithm and the actual target values.” There are many loss functions that are appropriate
for different problems with popular ones including Mean Square Error, Mean Absolute
Error and Log Loss (Alake, 2023).

All of the results from each of these models is then combined and ensemble techniques
are used to determine the final result (Natekin and Knoll, 2013).

Figure 2.5: The architecture of a Gradient Boosting Decision Tree.


Source: Deng et al. (2021).
Licensed under the Creative Commons Attribution 4.0 International License
http://creativecommons.org/licenses/by/4.0/.
CHAPTER 2. LITERATURE SURVEY 10

2.3.4 XGBoost
XGBoost or extreme gradient boosting is an ”optimized distributed gradient boosting
library designed to be highly efficient, flexible and portable” (XGBoost Developers, 2022).
It improves on gradient boosting by using parallel tree boosting which allows it to train
multiple decision trees concurrently. This allows it to offer high accuracy as well be highly
computationally efficient (Gupta et al., 2020). XGBoost works on both regression and
classification problems.

2.3.5 Multilayer Perceptron (MLP)


Multilayer perceptron is made of multiple connected neurons or nodes. In this there is
an input layer defined by an input vector and a output layer defined by an output vector.
The neurons in the input vector are connected to neurons in the output vector with each
connection having a variable weight. The input vectors represent features of the dataset,
and the output vectors represent the prediction (Gardner and Dorling, 1998).

2.4 Evaluation
Evaluation is used to determine the efficacy of a model. This is done by looking at the
models performance with key performance metrics such as Root Mean Square Error.

2.4.1 Root Mean Square Error (RMSE)


Root Mean Square Error (RMSE) is a perfomance metric used to determine the accuracy
of a model. RMSE ”measures the difference between the predicted values and the actual
values in the units of the response variable” (Permetrics, 2023). The closer the value to
0 the better the model.
s
PN −1 2
i=0 (yi − ŷi )
RMSE(y, ŷ) = (2.1)
N

Where Squared Differencei = (yi − ŷi )2 . With yi being the actual value, ŷi being the
predicted value. And N being the amount of values.

2.5 Analysis of Machine Learning Models Applied on the


Dataset
In previous research (Antoniades et al., 2016; Manoharan et al., 2021) machine learning
techniques were used on the TLC Trip Record Data. In this section I will discuss their
results and explore how they could be improved.
CHAPTER 2. LITERATURE SURVEY 11

2.5.1 Comparison of Linear Regression Against Random Forest to Predict


Trip Fares and Duration on the Dataset
In Antoniades et al. (2016) they compared linear regression against random forest in
predicting fares and trip duration on the yellow taxi dataset (NYC Taxi and Limousine
Commission, 2023b). They found that random forest outperformed linear regression,
although this was mainly due to the fact that the data was nonlinear. They identify that
improvements could be made to better model the effects of traffic along the route. Which
could be implemented with the publicly available NYC Department of Transportation
Traffic Speeds NBE (City of New York Department of Transportation, 2023) which
aggregates traffic speeds from sensors across NYC on an 24 hour basis.

2.5.2 Comparison of XGBoost Against MLP to Predict Trip Duration


on the Dataset
In Manoharan et al. (2021) they compared XGBoost against MLP with predicting trip
duration on the yellow taxi dataset (NYC Taxi and Limousine Commission, 2023b). In
their results they found that XGBoost achieved a slightly better RMSE over MLP. They
also computed the feature importance graph and found that haversine distance was the
most important feature with the largest F-score. The haversine distance which is equal
to the most direct path between two coordinates on the surface of a sphere (Earth)
(scikit-learn developers, 2023).

In their conclusion they propose further study that could improve their results. They
discussed improving the auto-tune on the MLP to achieve better results. Additionally,
they also discussed looking at location based features that could influence trip durations
and incorporating speed-limit based features to better model traffic. This could be
implement utilising the DOT Traffic Speeds NBE (City of New York Department of
Transportation, 2023) discussed earlier.

Furthermore, they mention using weather data from areas like Central Park as ”New
Yorkers might take a taxi when they are near Central Park or when the weather condition
is severe.” Which would be possible with the publicly available data from the National
Centers for Environmental Information weather station in Central Park (National Centers
for Environmental Information).

Lastly, they propose that the K-Means Clustering algorithm could be enhanced by clustering
around popular areas such as ”metro station, number of bars and eateries in a given zone,
etc.” This could be implemented using the publicly available Office of Technology and
Innovation (OTI) Points of Interest dataset (Office of Technology and Innovation (OTI),
2023) which details points of interest such as schools, transport facilities, commercial
buildings, landmarks and more. Additionally, there is the publicly available Department of
City Planning Facilities Database which ”aggregates information about 30,000+ facilities
and program sites that are owned, operated, funded, licensed, or certified by a City, State,
or Federal agency in the City of New York” (New York City Department of City Planning,
2023).
Chapter 3

Requirements and Analysis

This section outlines the requirements of the project, detailing both the functional and
non-functional requirements needed for project completion. It also specifies the steps
necessary to achieve these requirements.

3.1 Project Requirements


To determine the priority of each requirement the MoSCoW prioritization method (McIntyre,
2016) is used. The MoSCoW prioritization method sorts requirements into 5 categories:

• Must: Required to complete the project


• Should: Key part of the project, but can be delivered without
• Could: Extra part of the project, can be implemented time permitting
• Wont: Will not be delivered, can be reconsidered for future work

3.1.1 Functional

# Requirement Priority
1 Develop and deploy a machine learning model capable of predicting M
taxi and Uber fares
2 Implement a web app where inference can be done on the deployed M
models
3 Implement on the web app the ability for the user to enter an origin M
and destination and compare the fares between taxis and Ubers for
that trip
4 Implement on the web app the ability for the user to enter the time of M
day and day of week for a given trip
5 Display on the web app an interactive map showing where the origin M
and destination are
6 Display on the web app the ’ideal’ route to illustrate the users journey S
7 Display on the web app metrics such as trip time and distance S

Table 3.1: Functional Requirements of the project

12
CHAPTER 3. REQUIREMENTS AND ANALYSIS 13

3.1.2 Non-functional

# Requirement
1 The web app should have a user friendly interface
2 The web app should have high performance and return queries in a reasonable
time
3 The web app should be responsive to a variety of screen sizes
4 The web app should be compatible on major browsers

Table 3.2: Non-Functional Requirements of the project

3.2 Evaluation of Machine Learning models


Building on the evaluation methods used in prior literature (Antoniades et al., 2016;
Manoharan et al., 2021) Root Mean Square Error (RMSE) as discussed in Section 2.4.1
will be used as the primary metric to evaluate the machine learning models in this project.

RMSE has been chosen as the primary metric as it enables effective comparisons to other
works and provides users a straightforward metric to interpret. For instance a RMSE
value of 3 indicates that, on average, the predicted fare is within $3 of the actual fare,
providing users with a clear understanding of the prediction accuracy.

3.3 Legal and Ethical considerations


Legal considerations of the project include the handling of inputted addresses. This is due
to the fact that the addresses could contain sensitive information like the users home or
place of work. This handling falls under the Data Protection Act 2018 (UK Government,
2018). To be in compliance, sensitive information will be sent securely over HTTPS and
all addresses will not be stored in any servers or databases.
Chapter 4

Design

This section outlines the design approach taken in implementing the project. Focusing
on the methodology used and the design of both the model and front-end and back-end
of the web application.

4.1 Methodology
4.1.1 Machine Learning Life Cycle
The machine learning (ML) life cycle is the iterative process that encompasses the:
preparation, development and deployment of an ML solution. This iterative process
is necessary as ML solutions are continually built on and improved (Ashmore et al., 2019).

4.1.2 Cross-Industry Standard Process for Data Mining


The cross-industry standard process for data mining (CRISP-DM) is a popular and proven
development methodology in data science projects (Saltz, 2024). It is also application
neutral and can be applied to ML pipelines and workloads (Amazon Web Services, 2023).
It will be used on this project for guiding the end-to-end ML process as it provides a
logical structure to the project.
CRISP-DM encompasses the following phases:

• Business understanding

• Data understanding

• Data preparation

• Modeling

• Evaluation

• Deployment

14
CHAPTER 4. DESIGN 15

Figure 4.1: CRISP-DM Life Cycle


Designed using Diagrams.net - Available at https://diagrams.net

4.2 Dataset and Features


The project will utilise the Yellow Taxi dataset from July 2014 to July 2016 (NYC Taxi
and Limousine Commission, 2023b) for developing the taxi fare model. Similarly the trip
time and distance models will use this dataset as well. This time period was selected
because it was a period when coordinate data was still available and offers a large sample
size for analysis.

In addition, the HVFHS dataset covering July 2023 to July 2024 will be used for the
Uber fare model.

4.2.1 Feature Selection


Using the features detailed in the data dictionary in Table 2.1. Feature selection will
include:
• Removing null or invalid records.

• Restricting the coordinates of records to New York and its immediate vicinity.

• Removing records where the trip time is over 5 hours.

• Removing records where the average speed is below 7 km/h.

• Removing records where the fare is less than $3, as this is the base fare (NYC Taxi
and Limousine Commission, 2023a).

4.2.2 Feature Engineering


Feature engineering is necessary to enhance the models with extra information and context,
ensuring the data provided is relevant and optimal for modeling.
CHAPTER 4. DESIGN 16

Dealing with Older Data

With the data for taxis coming from the July 2014 to July 2016 time period it is important
to use feature engineering to update it to match today’s current rates.

Item 2012-2022 2023-


Base Fare $2.50 $3
Unit Rate 50¢ 70¢
Rush Hour Surcharge $1 $2.50 ($5 for trips to and from JFK)
Overnight Surcharge 50¢ $1
New York State Congestion $0 $2.50
JFK Flatrate (to and from Manhattan) $52 $70
EWR Dropoff Surcharge $17.50 $20
LGA Pickup/Dropoff Surcharge $0 $2.50
JFK/LGA Pickup Airport Fee $0 $1.25

Table 4.1: Taxi Fare Comparison across 2012-2022 and 2023 onwards
Sources (Woodhouse, 2022; NYC Taxi and Limousine Commission, 2023a)

As reported in (Woodhouse, 2022) at the 2023 price increase, the average fare went
up 23%, this will be reflected in the base fare of each record.

Haversine Distance

As discussed in Manoharan et al. (2021), the Haversine distance had the highest feature
importance. Haversine distance is equal to the most direct path between two coordinates
on the surface of a sphere (scikit-learn developers, 2023). As the Earth is near sphere it
can be applied with only a 1% margin of error. The Haversine formula is defined as:
s    !
2 xLat − yLat 2 xLon − yLon
D(x, y) = 2r arcsin sin + cos(yLat ) cos(xLat ) sin
2 2

Where x and y represent latitude and longitude coordinates.

4.2.3 Handling the Data


To handle the dataset Python (Python Software Foundation, 2024) will be used. Python
is a highly popular language in data science due to its large ecosystem of third-party
packages (VanderPlas, 2016).

Apache Spark

Apache Spark is the ’de facto framework for big data analytics’ (Salloum et al., 2016). It
has features such as advanced in-memory programming model, upper-level libraries for
scalable machine learning, graph analysis, streaming and structured data processing.

In this project Apache Spark will be used for the feature selection and feature engineering
process. Specifically, the Python language-integrated API, PySpark (Apache Spark,
2024), will be used.
CHAPTER 4. DESIGN 17

4.3 Modeling
As outlined in the literature survey a gradient boosted regression model is the most
suitable approach for handling the TLC Trip Record dataset. Gradient boosted models
are ideal for this problem due to:

• Their ability to handle the spatial data on the New York City taxi dataset with
high levels of accuracy. As highlighted by Manoharan et al. (2021).

• It is a proven approach with it consistently being the top contenders in competitions


on the data science website Kaggle (Bentéjac et al., 2021).

• The ability to determine feature importance, which is highly useful when performing
feature selection and engineering.

• It is optimised for distributed workloads, which is useful given that this project is
being developed on HPC clusters.

4.3.1 Model Frameworks


XGBoost as discussed in Section 2.3.4 and used in prior research Manoharan et al. (2021),
will be used as the primary approach. Additionally, an alternative framework called
LightGBM is also being considered, which has the potential to yield better results.

LightGBM

LightGBM is an open source gradient boosting framework developed by Microsoft (Microsoft


Corporation, 2023). It offers optimisations in speed, memory and accuracy when compared
to other gradient boosting frameworks. It can speed up the training process by ’over 20
times whilst achieving almost the same accuracy’ (Ke et al., 2017).

4.3.2 Model Structure


Due to the restrictive nature of the HVFHS zone based data as discussed in Section 2.2.4,
four models will be developed.

In the HVFHS dataset the pick up and drop off locations use zones with numeric identifiers
based on New York City’s neighbourhoods. Therefore, a reverse engineering approach is
needed when determining fares from latitude and longitude points within these zones.
In the HVFHS dataset aside from their pick up and drop off zones the only features that
can be used to distinguish trips are the time and distance values.
Therefore when introducing new data obtaining an accurate estimate of these values is
necessary for optimal fare prediction. Therefore the chosen approach is to develop two
models trained on the coordinate based data from the Yellow Taxi dataset prior to the
switch to zone based identifiers. With these models estimates for trip time and distance
can be produced accurately. By matching the latitude and longitude of the pick up
and drop off locations to their respective zones along with the calculated trip time and
distance, the Uber fare model can calculate an estimate of the fare.
CHAPTER 4. DESIGN 18

Figure 4.2: Structure of the Machine Learning models

4.3.3 Model Development Tools


Python (Python Software Foundation, 2024) will be used to develop the machine learning
model. As discussed in Section 4.2.3, Python is highly popular for data science due to its
large ecosystem of third-party packages.

MLFlow

MlFlow is an open source platform for managing the end-to-end machine learning life
cycle. It allows for tracking of machine learning experiments which records the “code
used, parameters, input data, metrics and output.”
Additionally, it allows models to be packaged and deployed. With inference through
a REST API endpoint, allowing integration into web front-ends (Zaharia et al., 2018).

AutoML

AutoML is the process of automating the iterative tasks of machine learning development
(He et al., 2021).
As this project is focused on the whole machine learning cycle, it is important to be
able to develop models without needing to spend large amounts of time on tasks such as
comparing models and tuning hyper parameters.
CHAPTER 4. DESIGN 19

4.4 Web Application and Deployment


4.4.1 Libraries and Frameworks
Ruby on Rails

Ruby on Rails is a popular web development framework for the Ruby language (Hartl,
2015). Rails is used on sites such as Airbnb, GitHub and Twitter/X.
Rails is the best approach for this this project due to it being completely open source,
ahead on new web technologies and a having a large community of contributors and gems
(plugins).

OpenStreetMap

OpenStreetMap (OSM) is a community project to build a free geographic database of


the world. It includes features such as roads, administrative boundaries and details of
land use (Bennett, 2010). The data on OpenStreetMap is ’free to use by anyone, for any
purpose’ and is released under a highly permissive license allowing users to copy, change
and redistribute its data.

Geocoding

Geocoding is the ’process of transforming a description of a location such as a pair of


coordinates, an address, or a name of a place to a location on the earth’s surface’ (Esri,
2024).

Using the open source Photon geocoder (Komoot, 2024) the user can perform search-as-you-type
on OSM data. This allows for real time suggestions of points of interest (POIs) as the
user types, and importantly, the translation of the desired location into a latitude and
longitude point.

Routing

Using OpenRouteService (ORS) (ORS, 2024) vehicle routing can be performed using
OpenStreetMap data. The OpenRouteService directions API generates vehicle routes in
a GeoJSON format (IETF GeoJSON Working Group, 2016).

GeoJSON is a format based JavaScript Object Notation (JSON) format and allows for the
storing and transmitting of geographic data. Allowing integration into a web front-end.

Leaflet.js

Leaflet.js is a front-end JavaScript library that is used to create interactive, web-based


maps (Crickard III, 2014). With Leaflet.js tile layers, the maps images can be loaded
from a variety of sources such as OpenStreetMap.

Additionally, GeoJSONs can be loaded as layers on top of the map, allowing the displaying
of vehicle routes from services such as OpenRouteService.
CHAPTER 4. DESIGN 20

4.4.2 Application Flow


Bringing everything discussed together, the application flow shows the intended behaviour
when the user interacts with the web application.

Figure 4.3: The Application flow


Designed using Diagrams.net - Available at https://diagrams.net

The web application will be a single page. From here the user can input the origin and
destination of their trip and the time and day of the week. The web application will then
perform an API request to the server with this information.

Using the deployed models the server will return the fare estimates of the taxi and
rideshare options. In addition the server will return the ’ideal’ vehicle route to be displayed
on the map.
CHAPTER 4. DESIGN 21

4.4.3 Wireframe
Building on the application flow detailed in Figure 4.3, this wireframe shows the desired
appearance of the web application.

With this design the application achieves its desired functional and non-functional requirements,
delivering a clear and user friendly experience.

Figure 4.4: Wireframe of the web application.


Designed using moqups - Available at https://moqups.com
Chapter 5

Implementation and testing

In this section the projects development as well as the associated challenges faced will be
detailed.

Specifically it will detail the data preparation, development and deployment of the machine
learning models. In addition to the development of the web application.

5.1 Difficulties Faced


Compared against the original design there were a number of difficulties.

5.1.1 Difficulties with The University of Sheffield High Performance


Computing (HPC) cluster
The original design of this project was to develop it on The University of Sheffield High
Performance Computing (HPC) clusters (IT Services’ Research and Innovation team,
2023). However, this approach was unfeasible due to several reasons.
Firstly, the HPC is a shared compute resource, which means that access privileges
are limited and commands like ’sudo’ are not allowed. This restriction is significant as it
limits the ability to easily install software using package managers like apt-get. Instead,
software had to either be available via ’modules’, or be downloaded as source code and
compiled in a time consuming process. This was process amplified when dealing with
software that has multiple dependencies.

This posed a challenge in development as libraries and frameworks that were needed
for the project such as MLFlow and AutoML were not available or had limited support.
For instance the ability to serve models with a REST API endpoint in MLFlow was
affected, which is an important requirement of this project.

Additionally the HPC that was used, Stanage, still has very limited graphical session
support. Meaning that all work had to be done via the command line which can be
convoluted at times.

22
CHAPTER 5. IMPLEMENTATION AND TESTING 23

5.1.2 Databricks
To overcome the issues faced with the University’ HPC, Databricks was used instead.

Databricks is a cloud-based analytics platform based on Apache Spark (Etaati, 2019)


and is used for big data analytics and machine learning. It is highly popular among data
scientists and data engineers. Databricks is a better approach for this project for multiple
reasons.

Firstly, it can ingest and store a significant amount of data utilising Delta Lakes (Armbrust
et al., 2020). Delta Lakes are cloud object stores that can store large tabular datasets
whilst maintaining Atomicity, Consistency, Isolation, and Durability (ACID) ensuring
data integrity. In addition, Delta Lakes can perform fast metadata operations such as data
querying, which is highly useful for the large datasets used in this project. Moreover, Delta
Lakes have integration with Apache Spark, allowing feature selection and engineering
operations to be performed seamlessly.

Additionally, Databricks supports Python notebooks with Apache Spark. This is highly
useful for tasks like feature selection and feature engineering as it provides an interactive
environment for data preparation.

Furthermore, Databricks has native integration with MLFlow (Zaharia et al., 2018).
Allowing management and tracking of the end-to-end machine learning life cycle. And
importantly the serving of models for inference with a REST API endpoint.

In addition, Databricks has built-in AutoML functionality (Databricks, 2024b) for model
selection and hyperparameter tuning. This feature significantly speeds up the machine
learning model development process, allowing faster development cycles.

Lastly, Databricks can be run on Microsoft Azure with Azure Databricks (Etaati, 2019).
Microsoft Azure offers a free $100 credit for students (Microsoft Corporation, 2024), which
will adequately fund this project.
For the project, the cluster with the most amount of memory was chosen. This was
necessary as it allows for faster querying of large datasets. A constraint of Databricks
on a student account is that each cluster is restricted to 4 cores and 32 GB memory
maximum. However, this was sufficient for the project.

Figure 5.1: The HPC cluster used for the project


CHAPTER 5. IMPLEMENTATION AND TESTING 24

5.2 Data Preparation


5.2.1 Feature Selection
To perform feature selection, PySpark is used to filter and drop records that are not
needed or could have a negative impact on the model results.

Figure 5.2: Code Snippet: PySpark filter() and drop() functions

Feature Selection for Yellow Taxi data:


• Dropping of any trips that have a null value.

• Dropping features that are not needed: vendorID, passengerCount, puLocationId,


doLocationId, storeAndFwdFlag, puYear, puMonth.

• Deducting tips from the total fare as this can cause outliers.

• Filtering where the fare amount is greater than $3.00 and is less than $200.

• Dropping trips where the paymentType is 3, 4, 5 or 6 (No charge, Dispute, Unknown


or Voided trip).

• Dropping trips where rateCodeId is 5 or 6 (Negotiated fares and group rides).

• Filtering trips where the trip time is greater than 60 seconds and is less than 4
hours.

• Filtering trips where the distance is greater than 300 meters.


Feature Selection for HVFHS data:
• Dropping of any trips that have a null value.

• Filtering trips where hvfhs license num is equal to ’HV0003’ (Uber).

• Dropping features that are not needed: hvfhs license num, dispatching base num,
originating base num, request datetime, on scene datetime, driver pay, shared request flag,
shared match flag, access a ride flag, wav request flag, wav match flag.

• Deducting tips from the total fare as this can cause outliers.

• Filtering where the fare amount is greater than $3.00 and is less than $200.

• Filtering trips where the trip time is greater than 60 seconds and is less than 4
hours.

• Filtering trips where the distance is greater than 300 meters.


CHAPTER 5. IMPLEMENTATION AND TESTING 25

5.2.2 Feature Engineering


To perform feature engineering, PySpark is used again to perform functions on dataframes.

Firstly, the pick-up datetime have been transformed into three individual features: dayOfWeek
(0-6, representing Mon-Sun), which indicates the day of the week; hourOfDay (0-23),
representing the hour of the day; and minuteOfHour (0-59), representing the minutes of
the hour.

Figure 5.3: Code Snippet: Transformation of pick-up datetime to dayOfWeek, hourOfDay


and minuteOfHour

This transformation is done to make all the data uniform and enable the model to
distinguish between various days of the week, as well as the different hours and minutes
within a day.

Using the Haversine formula as discussed in Section 4.2.2 the direct distance between
two points can be calculated. PySpark expressions are used in-conjunction with the
formula to transform the data in a distributed and optimized manner.

Figure 5.4: Code Snippet: Haversine Formula on the pick-up and drop-off coordinates

Following on from Section 4.2.2, Yellow Taxi fares have been updated to reflect current
pricing standards.

The new pricing structure includes a $2.50 rush hour surcharge from 4 pm to 8 pm
on weekdays. This calculation can be performed using the transformed datetime features,
as shown in Figure 5.5.
CHAPTER 5. IMPLEMENTATION AND TESTING 26

Figure 5.5: Code Snippet: Calculating rush hour surcharges

There are four locations that if a trip starts and or ends in are subject to extra tariffs:
These locations can be viewed in Figure 2.1:

• John F. Kennedy International Airport ($70 flat fare on trips to and from Manhattan)
plus an additional $1.25 for any pick-up at the airport.

• LaGuardia Airport ($5 Surcharge for any pick-up and drop-offs) plus an additional
$1.25 for any pick-up at the airport.

• Newark International Airport ($20 Surcharge for drop-offs).

• New York State Congestion ($2.50 for any pick-up or drop-off that is south of 96th
Street in Manhattan).

For determining when trips have started and or ended in locations with extra tariffs
the Shapely Python library (https://shapely.readthedocs.io/en/stable/) has been
used. Shapely has functionality for determining if a point is in a polygon.

The polygons of the tariff affected areas are generated using the OpenStreetMap Overpass
Turbo website (https://overpass-turbo.eu). These are then stored in GeoJSON format
(explained in Section 4.4.1).

Figure 5.6: GeoJSON boundaries of LaGuardia Airport in New York City


Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
https://www.openstreetmap.org/copyright
Generated using Overpass Turbo - Available at https://overpass-turbo.eu
CHAPTER 5. IMPLEMENTATION AND TESTING 27

Using the generated GeoJSON polygons, the function as seen in Figure 5.7 is used to
determine if a latitude and longitude point is within the boundaries of the polygon.

Figure 5.7: Code Snippet: Determine if a latitude and longitude point is inside a GeoJSON
polygon.

5.3 Modeling
5.3.1 Databricks AutoML
For modeling Databricks AutoML (Databricks, 2024a) has been used. Databricks AutoML
allows the quick generation of machine learning models along with an accompanying
notebook for the model. The Databricks AutoML workflow involves iterative cycles
between this notebook and the model, with the final result being the production model.

Figure 5.8: Databricks AutoML configuration


CHAPTER 5. IMPLEMENTATION AND TESTING 28

Databricks AutoML generates MLFlow experiments with each run representing a different
model with different hyperparamters and dataset split. This allows for extensive exploration
and comparison of model configurations.

Figure 5.9: Databricks AutoML runs

The top performing model can then be configured for inference and deployed to a REST
API endpoint. This enables the front-end to query the deployed model.

Figure 5.10: Databricks MLFlow model endpoint deployment

Reduced Dataset size

Through testing, it was established that the maximum amount of records that Databricks
AutoML can handle without crashing is about ten million records. This is due to the
constraints of the student account cluster. Therefore the PySpark sample() function has
been used to sample a subset of the dataset.

Figure 5.11: Code Snippet: Sampling the dataset


CHAPTER 5. IMPLEMENTATION AND TESTING 29

5.4 Web Application


5.4.1 Front-end
Main Page

Following from the design plan, the front-end was developed using HTML, CSS and
JavaScript. Leaflet.js (discussed in Section 4.4.1) provides the map tiles and interactivity.
Additionally, Bootstrap (https://getbootstrap.com) was used to improve the styling of
the page elements. In addition, Mapbox (https://www.mapbox.com) was used to improve
the look of the map tile over the deafult OpenStreetMap tiles.

The website is a single page application therefore all interactions are performed through
this page. From this page the user can enter their origin and destination as well as time
and day of week of the trip. The website will then show the estimated fares for both
Yellow Taxi and Uber.

Figure 5.12: The web application showing a trip from the Empire State Building to
Central Park.
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
Additional Map data and styling © Mapbox https://www.mapbox.com/about/maps/

Search and Geocoding

Geocoding is necessary to transform addresses and point of interests into latitude and
longitude points. To do this the Photon API has been used (discussed in Section 4.4.1).
The Photon API allows for search as you type results from OpenStreetMap. An example
of a result from a Photon API query can be seen in Figure 5.13.
CHAPTER 5. IMPLEMENTATION AND TESTING 30

The Photon API has been used because it provides a free public API (with fair use).
Additionally the Photon API has the option to self-host which could be used in future
implementations (Komoot, 2024).

Figure 5.13: One of the results returned from the Photon API for the search ”Grand
Central”.

The JavaScript library Algolia autocomplete (https://github.com/algolia/autocomplete)


is used to fetch, display and select the results from the Photon API query.

Additionally Maki icons (https://labs.mapbox.com/maki-icons/) have been used to


give context to search results and improve the user experience. These icons are CC0
public domain.

Figure 5.14: Search for ”rockefeller” and its autocomplete results.


CHAPTER 5. IMPLEMENTATION AND TESTING 31

Time and Day of Week selection

The user can choose the time and day of the week, which was preferred over using a
calendar approach where the user selects a specific month and day. This choice allows for
a larger sample size to be used for each day of the week and time. ,

Figure 5.15: Time and day of week selection.

Routing

Routing is provided using OpenRouteService (discussed in 4.4.1), OpenRouteService


provides 2000 free API requests per day. With the option to upgrade for future solutions.

OpenRouteService takes the latitude and longitude of the origin and destination and
returns a GeoJSON containing the optimal route (not including traffic). This is feature
is meant to be an illustration to the user and is not meant to show the actual route to be
taken.

Figure 5.16: Route from Chrysler Building to Grand Central displayed on web application
map.
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
Additional Map data and styling © Mapbox https://www.mapbox.com/about/maps/
CHAPTER 5. IMPLEMENTATION AND TESTING 32

Locating Taxi Zones

For finding what taxi zones the origin and destination of the trip are in, Turf.js (https:
//turfjs.org) is used.
Turf.js loads a GeoJSON of the taxi zones and searches each feature to find out what
zone the origin and destination of the trip are in.
The implementation of Turf.js can be seen in Figure 5.17.

Figure 5.17: Code Snippet: Locating Taxi Zones using Turf.js

Mobile Appearance

Considerations have been made for the mobile friendliness of the web application. The
content on the site will scale to fit the current view port. Additionally, on smaller view
ports, the search will expand to fill the entire page making the results clearer to the user.
Overall these features improve the user experience and overall user friendliness of the web
application.

(a) Mobile Main View (b) Mobile Search

Figure 5.18: Website adjusted for mobile viewports


CHAPTER 5. IMPLEMENTATION AND TESTING 33

5.5 Testing
Manual Testing

Manual testing has been performed to ensure that the web application and its inputs
behave as intended.

# Description Expected Result Pass/Fail


1 Enter two valid addresses in New Fares and route shown Pass
York City
2 Enter only one valid address Submit button should be greyed Pass
out
3 Enter an address that cannot be Alert saying that the trip is not Pass
navigated to possible
4 Enter two identical addresses Submit button should be greyed Pass
out
5 Enter an address outside New Results for addresses outside of Pass
York City New York City should not show
up in the search
6 Pressing submit when there are Submit button should be greyed Pass
no inputs out

Table 5.1: Manual Tests for Web Application

Functional Requirements

Functional testing has also been used, testing against the defined requirements from
Chapter 3.1.1.

# Requirement Priority Pass/Fail


1 Develop and deploy a machine learning model capable of M Pass
predicting taxi and Uber fares
2 Implement a web app where inference can be done on the M Pass
deployed models
3 Implement on the web app the ability for the user to enter M Pass
an origin and destination and compare the fares between
taxis and Ubers for that trip
4 Implement on the web app the ability for the user to M Pass
enter the time of day and day of week for a given trip
5 Display on the web app an interactive map showing where M Pass
the origin and destination are
6 Display on the web app the ’ideal’ route to illustrate the S Pass
users journey
7 Display on the web app metrics such as trip time and S Fail
distance

Table 5.2: Functional Requirements of the project


Chapter 6

Results and discussion

6.1 Model Results


Results for the best performing model for each prediction is shown below.

Prediction & Validation Example Validation Validation Validation


Model Count RMSE MAE R2
Yellow Taxi fare amount
932,818 $3.25 $1.86 0.95
LightGBM
Trip Time
932,818 300s 184s 0.78
LightGBM
Uber fare amount
1,439,536 $9.27 $4.79 0.87
XGBoost
Trip Distance
932,818 10231m 29m -1.17
LightGBM

Table 6.1: Results of the models for the project

6.1.1 Yellow Taxi Fare amount results


The Yellow Taxi fare prediction model using LightGBM demonstrates exceptional accuracy
with a low Root Mean Square Error (RMSE) of $3.25. This makes it highly suitable for
predicting taxi fares reliably in a real world scenario.

6.1.2 Trip Time results


The Trip time prediction model using LightGBM demonstrates great accuracy with a
RMSE of 300 seconds. This margin of error equates to 5 minutes, making it reliable for
use in real world scenarios.

6.1.3 Uber fare amount results


The Trip time prediction model using XGBoost demonstrates acceptable accuracy with
a RMSE of $9.27. However, the accuracy of this model falls short when being used in
comparison against taxi prices. Nonetheless, it does provide a suitable benchmark and
could be improved on in future works.

34
CHAPTER 6. RESULTS AND DISCUSSION 35

6.1.4 Trip distance results


The Trip Distance model failed to produce accurate estimates potentially due to discrepancies
in the trip distance values in the dataset. To address this in future works, tools such as the
OpenSourceRouting machine (The OSRM Contributors, 2024) could be used to calculate
the trip distance for each record instead.

For the trip distance predictions needed for the Uber fare amount, the Haversine distance
was used instead.

6.1.5 Comparison to External Predictions


For a benchmark the Yellow Taxi fare can be compared against Curb (Curb Mobility,
2024). Curb is a popular app for booking taxis in New York City that offers fares in app.
For a trip from the Empire State Building to Central Park at 9am Curb has a fare of
$28 whereas the model predicts $23.52. This error of $4.50 is a little bit more than the
RMSE, but still demonstrates the suitability of the model.

For the Uber fare it can be compared against the Uber app (Uber Inc., 2023b). For
the same journey from the Empire State Building to Central Park at 9am. Uber has a
fare of $28.92 and the model estimates $28.60. This demonstrates that the model could
be suitable for predictions, despite the high RMSE.

6.2 Requirements Analysis


Looking at the requirements defined in Section 3, the functional and non-functional
requirements can be evaluated against the final implementation.

6.2.1 Non-Functional requirements


All of the non-functional requirements have been achieved. With the implementation
delivering the desired performance.

# Requirement Pass/Fail
1 The web app should have a user-friendly interface Pass
2 The web app should have high performance and return queries in a Pass
reasonable time
3 The web app should be responsive to a variety of screen sizes Pass
4 The web app should be compatible on major browsers Pass

Table 6.2: Success of Non-Functional Requirements of the project


CHAPTER 6. RESULTS AND DISCUSSION 36

6.2.2 Functional requirements


All of the functional requirements were achieved but number 7.

# Requirement Priority Pass/Fail


7 Display on the web app metrics such as trip time and distance S Fail

Requirement number 7 was not achieved due to time constraints and the model for Trip
Distance not achieving the desired accuracy. This could have been accomplished if there
was more time and research given to the implementation, and as stated prior OSRM (The
OSRM Contributors, 2024) could be used.

6.3 Challenges and Further Work


As discussed this project does meet its requirements and is a successful implementation.
However, further work could be done on this project to improve on accuracy and functionality.

6.3.1 H3 Spatial Indexes


H3 is a spatial indexing library developed by Uber (Inc, 2024). H3 divides the world into
an indexed hexagonal grid at varying resolutions, with resolution 0 having 110 hexagons
and resolution 15 having 569,707,381,193,150 hexagons.

H3 could improve this work by indexing the latitude and longitude points in built up
areas, for instance a city block. This potentially could allow for better accuracy by
increasing the amount of geospatial detail in a given area.

6.3.2 Progressive Web App


The front-end could be improved adding Progressive Web App (PWA) functionality.
PWAs bring app like functionality to a website allowing features such as installation
and caching.

By creating PWA the mobile experience of the implementation could be greatly improved
with better ease of use and user experience.

6.3.3 Live traffic data


Live traffic data could be implemented with the previously discussed DOT Traffic Speeds
NBE (City of New York Department of Transportation, 2023). Or for a simpler, but
costly solution, the Google Distance Matrix API could be used.

The Google Distance Matrix API upon recieving a HTTPS request with the origin and
destination of the trip can return the trip distance in kilometers or miles as well as the
estimated travel time in traffic. This could potentially increase the accuracy of the models.
Chapter 7

Conclusions

The aim of this project was to enable users to compare the fares of New York City
taxis and Ubers, whilst also demonstrating a complete implementation of the end-to-end
machine learning life cycle.

As the literature survey in Chapter 2 showed, the comparison of point to point Uber
and New York city taxi fares had not been achieved before in prior literature. As shown
the zone based identifiers for locations can make developing machine learning solutions
quite complicated.

The requirements and analysis in Chapter 3 showed the importance of choosing appropriate
requirements. With the outlined functional and non-functional requirements guiding the
approach of the implementation.

The design methodology in Chapter 4, showed the importance of choosing a software


methodology to guide the preparation, development and deployment of an end-to-end
machine learning solution. Furthermore, it addressed each stage of the cycle (business
understanding, data understanding, data preparation, modeling, evaluation and deployment)
and the appropriate steps and goals for that stage.

The implementation in Chapter 5 showed the development of the design methodology


and the subsequent challenges and roadblocks, and the solutions to overcome this.

The implementation for estimating Uber fares provides a suitable benchmark for future
work. With the initial results shown here being promising. Additionally, the web application
demonstrated the importance of visualisation, providing a user friendly and easy to use
mode to query the models.

To conclude, the objectives of project of

• Developing and deploying machine learning models capable of predicting New York
City taxi and rideshare fares.

• Creating a user-friendly web application that enables users to perform real-time

37
CHAPTER 7. CONCLUSIONS 38

inference using the deployed models, presenting fare predictions in a clear and
intuitive manner

Have been achieved.


Bibliography

Freedom of information law. https://opengovernment.ny.gov/


freedom-information-law, 2023. URL https://opengovernment.ny.gov/
freedom-information-law. NYC Open Government, Committee on Open
Government.

Openrouteservice. https://openrouteservice.org, 2024. Accessed: May 4, 2024.

R. Alake. Loss functions in machine learning explained. https://www.datacamp.com/


tutorial/loss-function-in-machine-learning, November 2023.

Amazon Web Services. Machine learning best practices. Technical report, Amazon Web
Services, 2023.

C. Antoniades, D. Fadavi, and A. F. Amon. Fare and duration prediction: A study of


new york city taxi rides. 2016. URL https://api.semanticscholar.org/CorpusID:
43844792.

Apache Spark. PySpark Overview. https://spark.apache.org/docs/latest/api/


python/index.html, 2024. Accessed: May 4, 2024.

M. Armbrust, T. Das, L. Sun, B. Yavuz, S. Zhu, M. Murthy, J. Torres, H. van Hovell,


A. Ionescu, A. Luszczak, M. undefinedwitakowski, M. Szafrański, X. Li, T. Ueshin,
M. Mokhtar, P. Boncz, A. Ghodsi, S. Paranjpye, P. Senster, R. Xin, and M. Zaharia.
Delta lake: high-performance acid table storage over cloud object stores. Proc. VLDB
Endow., 13(12):3411–3424, aug 2020. ISSN 2150-8097. doi: 10.14778/3415478.3415560.
URL https://doi.org/10.14778/3415478.3415560.

R. Ashmore, R. Calinescu, and C. Paterson. Assuring the machine learning lifecycle:


Desiderata, methods, and challenges. CoRR, abs/1905.04223, 2019. URL http://
arxiv.org/abs/1905.04223.

J. Bennett. OpenStreetMap. Packt Publishing Ltd, 2010.

C. Bentéjac, A. Csörgő, and G. Martı́nez-Muñoz. A comparative analysis of gradient


boosting algorithms. Artificial Intelligence Review, 54:1937–1967, 2021. doi: 10.1007/
s10462-020-09896-5. URL https://doi.org/10.1007/s10462-020-09896-5.

CBS News. Ride-sharing prices in new york city: Uber, lyft,


and yellow taxis. https://www.cbsnews.com/newyork/news/
ride-sharing-prices-new-york-city-uber-lyft-yellow-taxis/, 2023. Accessed:
Nov 15, 2023.

39
BIBLIOGRAPHY 40

City of New York Department of Transportation. Dot traffic speeds nbe, 2023. URL
https://data.cityofnewyork.us/Transportation/DOT-Traffic-Speeds-NBE/
i4gi-tjb9. Accessed: December 2, 2023.

P. Crickard III. Leaflet. js essentials. Packt Publishing Ltd, 2014.

Curb Mobility. Curb mobility. https://www.gocurb.com, 2024. Accessed: May 4, 2024.

Databricks. Databricks automl. https://www.databricks.com/product/automl, 2024a.


Accessed: May 4, 2024.

Databricks. Databricks AutoML. https://www.databricks.com/product/automl,


2024b. Accessed: May 4, 2024.

H. Deng, Y. Zhou, L. Wang, and C. Zhang. Ensemble learning for the early prediction
of neonatal jaundice with genetic features. BMC Medical Informatics and Decision
Making, 21, 12 2021. doi: 10.1186/s12911-021-01701-9.

Esri. What is Geocoding? https://desktop.arcgis.com/en/arcmap/latest/


manage-data/geocoding/what-is-geocoding.htm, 2024. Accessed: May 4, 2024.

L. Etaati. Azure Databricks, pages 159–171. Apress, Berkeley, CA, 2019. ISBN
978-1-4842-3658-1. doi: 10.1007/978-1-4842-3658-1 10. URL https://doi.org/10.
1007/978-1-4842-3658-1_10.

M. Gardner and S. Dorling. Artificial neural networks (the multilayer perceptron)—a


review of applications in the atmospheric sciences. Atmospheric Environment,
32(14):2627–2636, 1998. ISSN 1352-2310. doi: https://doi.org/10.1016/
S1352-2310(97)00447-0. URL https://www.sciencedirect.com/science/article/
pii/S1352231097004470.

A. Gupta, S. Sharma, S. Goyal, and M. Rashid. Novel xgboost tuned machine learning
model for software bug prediction. pages 376–380, 06 2020. doi: 10.1109/ICIEM48762.
2020.9160152.

M. Hartl. Ruby on rails tutorial: learn Web development with rails. Addison-Wesley
Professional, 2015.

X. He, K. Zhao, and X. Chu. Automl: A survey of the state-of-the-art. Knowledge-Based


Systems, 212:106622, 2021. ISSN 0950-7051. doi: https://doi.org/10.1016/j.
knosys.2020.106622. URL https://www.sciencedirect.com/science/article/pii/
S0950705120307516.

HERE. HERE Routing, 2023. URL https://www.here.com/platform/routing.


Accessed: December 2, 2023.

IBM. Decision trees. https://www.ibm.com/topics/decision-trees, 2023a. Accessed:


November 28, 2023.

IBM. Random forest. https://www.ibm.com/topics/random-forest, 2023b. Accessed:


Nov 25, 2023.
BIBLIOGRAPHY 41

IETF GeoJSON Working Group. The GeoJSON Format (RFC 7946). RFC 7946, RFC
Editor, August 2016. URL https://datatracker.ietf.org/doc/html/rfc7946.

U. Inc. H3 Geospatial Indexing System. https://h3geo.org, 2024. Accessed: May 2,


2024.

IT Services’ Research and Innovation team. Sheffield HPC Documentation, 2023. URL
https://docs.hpc.shef.ac.uk/en/latest/. Accessed: Dec 2, 2023.

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu.
Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

Komoot. Photon: Search as you type with OpenStreetMap. https://photon.komoot.io,


2024. Accessed: May 4, 2024.

P. Manoharan, M. Malviya, C. Kumar, M. Hamdi, V. Vijayakumar, N. Jamel, and


H. J. Alyamani. New york city taxi trip duration prediction using mlp and xgboost.
International Journal of System Assurance Engineering and Management, 13:1–12, 07
2021. doi: 10.1007/s13198-021-01130-x.

J. McIntyre. Moscow or Kano - how do you prioritize? https://www.hotpmo.com/


management-models/moscow-kano-prioritize/, October 2016. Accessed: May 2,
2024.

Microsoft Corporation. LightGBM documentation. https://lightgbm.readthedocs.


io/en/stable/, 2023. Accessed: May 2, 2024.

Microsoft Corporation. Azure for Students. https://azure.microsoft.com/en-us/


free/students, 2024. Accessed: May 4, 2024.

C. Molnar. Interpretable Machine Learning. 2 edition, 2022. URL https://christophm.


github.io/interpretable-ml-book. Chapter 3, Pages 79-81.

D. Mwiti. A comprehensive guide to ensemble learning. https://neptune.ai/blog/


ensemble-learning-guide, 2021. Accessed: November 28, 2023.

A. Natekin and A. Knoll. Gradient boosting machines, a tutorial. Frontiers in


Neurorobotics, 7, 2013. ISSN 1662-5218. doi: 10.3389/fnbot.2013.00021. URL
https://www.frontiersin.org/articles/10.3389/fnbot.2013.00021.

National Centers for Environmental Information. NY CITY CENTRAL PARK, NY


US Weather Data. URL https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/
stations/GHCND:USW00094728/detail. Accessed: December 2, 2023.

New York City Department of City Planning. Department of city planning facilities
database (facdb), 2023. URL https://www.nyc.gov/site/planning/data-maps/
open-data/dwn-selfac.page. Accessed: December 2, 2023.
BIBLIOGRAPHY 42

A. Noulas, V. Salnikov, D. Hristova, C. Mascolo, and R. Lambiotte. Developing and


deploying a taxi price comparison mobile app in the wild: Insights and challenges.
In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics
(DSAA), pages 424–433, 2018. doi: 10.1109/DSAA.2018.00055.

NYC Taxi and Limousine Commission. NYC Taxi Zones Map, 2023. URL
https://data.cityofnewyork.us/d/d3c5-ddgc?category=Transportation&view_
name=NYC-Taxi-Zones. Accessed: Dec 2, 2023.

NYC Taxi and Limousine Commission. Data dictionary – yellow taxi trip
records. https://data.cityofnewyork.us/api/views/biws-g3hs/files/
eb3ccc47-317f-4b2a-8f49-5a684b0b1ecc?download=true&filename=data_
dictionary_trip_records_yellow.pdf, 2023a. Accessed: Nov 15, 2023.

NYC Taxi and Limousine Commission. Tlc trip record data. https://www.nyc.gov/
site/tlc/about/tlc-trip-record-data.page, 2023b. Accessed: Nov 15, 2023.

NYC Taxi and Limousine Commission. Taxi fare information. https://www.nyc.gov/


site/tlc/passengers/taxi-fare.page, 2023c. Accessed: Nov 15, 2023.

Office of Technology and Innovation (OTI). Points of interest, 2023. URL https://data.
cityofnewyork.us/City-Government/Points-Of-Interest/rxuy-2muj. Accessed:
December 2, 2023.

Permetrics. Permetrics Documentation - RMSE, 2023. URL https://permetrics.


readthedocs.io/en/latest/pages/regression/RMSE.html. Accessed: Nov 30, 2023.

Python Software Foundation. Python. https://www.python.org, 2024. Accessed: May


2, 2024.

S. Salloum, R. Dautov, X. Chen, et al. Big data analytics on apache spark.


International Journal of Data Science and Analytics, 1:145–164, 2016. doi: 10.1007/
s41060-016-0027-9. URL https://doi.org/10.1007/s41060-016-0027-9.

J. Saltz. CRISP-DM is still the most popular framework for executing data
science projects. Data Science Project Management, 2024. URL https://www.
datascience-pm.com/crisp-dm-still-most-popular/.

S. Sayad. Decision tree overfitting. https://www.saedsayad.com/decision_tree_


overfitting.htm, 2023. Accessed on: November 28, 2023.

T. Schneider. Taxi and ridehailing app usage in new york city. https://
toddwschneider.com/dashboards/nyc-taxi-ridehailing-uber-lyft-data/, 2024.
toddwschneider.com.

T. W. Schneider. Nyc taxi data. https://github.com/toddwschneider/


nyc-taxi-data, 2023. Last updated on February 10, 2023. Accessed on Nov 28, 2023.

scikit-learn developers. scikit-learn: Haversine distances. scikit-learn Documentation,


2023. BSD License.
BIBLIOGRAPHY 43

The OSRM Contributors. Open Source Routing Machine. https://project-osrm.org,


2024. Accessed: May 4, 2024.

Uber Inc. Estimate price - uber developers. https://developer.uber.com/docs/


riders/references/api/v1.2/estimates-price-get, 2023a.

Uber Inc. How uber’s dynamic pricing model works. https://www.uber.com/en-GB/


blog/uber-dynamic-pricing/, 2023b. Accessed: Nov 15, 2023.

UK Government. Data protection act 2018. https://www.gov.uk/data-protection,


2018. Accessed: May 2, 2024.

J. VanderPlas. Python data science handbook: Essential tools for working with data. ”
O’Reilly Media, Inc.”, 2016.

S. Woodhouse. Nyc taxi cab fares to rise 23% in first increase since 2012. Bloomberg,
November 2022. URL https://www.bloomberg.com/news/articles/2022-11-15/
nyc-taxi-cab-fares-to-rise-23-in-first-increase-since-2012?utm_source=
website&utm_medium=share&utm_campaign=copy.

XGBoost Developers. XGBoost Documentation. https://xgboost.readthedocs.io/


en/stable/index.html, 2022.

M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching,


T. Nykodym, P. Ogilvie, M. Parkhe, et al. Accelerating the machine learning lifecycle
with mlflow. IEEE Data Eng. Bull., 41(4):39–45, 2018.

You might also like