Acd 21 JB
Acd 21 JB
Jack Burke
May 8, 2024
Declaration
All sentences or passages quoted in this report from other people’s work have been
specifically acknowledged by clear cross-referencing to author, work and page(s). Any
illustrations that are not the work of the author of this report have been used with the
explicit permission of the originator and are specifically acknowledged. I understand that
failure to do this amounts to plagiarism and will be considered grounds for failure in this
project and the degree examination as a whole.
Date: 08/05/2024
i
Abstract
Taxis and rideshares are key modes of transport across cities, with each having their own
unique pricing structure; Taxis use fixed-fare systems that can vary depending on traffic
levels, whereas rideshares use surge-pricing that can fluctuate with user demand.
Utilising the New York City taxi dataset, which has been published monthly since 2009,
and aggregates over 3 billion taxi and rideshare trips, this project utilises machine learning
to learn and make predictions about trips and provide greater insight into what transport
mode is cheapest at a defined time and day of week.
The goal is to provide users with a web application so they can access this data easily,
and plan their trips better by comparing both transport modes and their fares.
ii
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.4 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Literature Survey 3
2.1 Limitations of Existing Applications for Estimating and Comparing Journey
Fares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 New York City Taxi and Limousine Commission Trip Record Data 4
2.2.2 New York City Taxi and Uber Fare Models . . . . . . . . . . . . . 5
2.2.3 Advantages of the Dataset . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Limitations of the Dataset . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Machine Learning Models Applied on the Dataset . . . . . . . . . . . . . 7
2.3.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Multilayer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . 10
2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.1 Root Mean Square Error (RMSE) . . . . . . . . . . . . . . . . . . 10
2.5 Analysis of Machine Learning Models Applied on the Dataset . . . . . . . 10
2.5.1 Comparison of Linear Regression Against Random Forest to Predict
Trip Fares and Duration on the Dataset . . . . . . . . . . . . . . . 11
2.5.2 Comparison of XGBoost Against MLP to Predict Trip Duration on
the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
iii
CONTENTS iv
4 Design 14
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Machine Learning Life Cycle . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 Cross-Industry Standard Process for Data Mining . . . . . . . . . 14
4.2 Dataset and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3 Handling the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.1 Model Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Model Development Tools . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Web Application and Deployment . . . . . . . . . . . . . . . . . . . . . . 19
4.4.1 Libraries and Frameworks . . . . . . . . . . . . . . . . . . . . . . . 19
4.4.2 Application Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.4.3 Wireframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7 Conclusions 37
List of Figures
2.1 Map of New York City with TLC Taxi Zones, boroughs and airports labelled 5
2.2 Taxi Zones Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Visualisation of a decision tree. . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Visualisation of a random forest model. . . . . . . . . . . . . . . . . . . . 9
2.5 The architecture of a Gradient Boosting Decision Tree. . . . . . . . . . . . 9
vi
List of Tables
vii
Chapter 1
Introduction
1.1 Background
Yellow Taxis are an iconic symbol of New York City and form an integral part of its
transport network with over hundred thousand trips taken every day across the city’s five
boroughs (Schneider, 2024). However, rising competition from rideshare apps like Uber
and Lyft have completely changed the market, with these apps taking five or six times as
many trips per day and holding around an 87% market share. Pricing between the two
modes can vary heavily, with rideshares being susceptible to “rush hour, rain or snow”
surge pricing (CBS News, 2023). Whereas, taxis use a fixed ”metered” fare system (NYC
Taxi and Limousine Commission, 2023c)
This project addresses the need for commuters to compare the fares of New York City taxis
against rideshares and allow them to make more informed decisions on their transportation
options.
1.2 Aims
The aim of this project is to enable a user to compare the fares of New York City taxis
against rideshares. To elaborate further, the plan is to create a web application where
the user can input an origin and destination within New York and its vicinity, along with
the time and day of the week. The application will then present estimated fares for both
New York City taxis and rideshares to the user.
1.3 Objectives
The objectives of this project are:
• Develop and deploy machine learning models capable of predicting New York City
taxi and rideshare fares.
1
CHAPTER 1. INTRODUCTION 2
Chapter 2 Literature Survey: This chapter surveys all relevant literature and research
for this project, specifically the existing solutions, the dataset used and previous research
on machine learning models applied to this dataset. Additionally, it examines the benefits
and drawbacks of these approaches.
Chapter 3 Requirements and Analysis: This chapter details the requirements for
this project to be deemed successful while also acknowledging potential constraints.
Chapter 4 Design: This chapter details the development methodology and the design
of the dataset, machine learning models and the web application. Providing justification
for each approach and the libraries and frameworks used.
Chapter 5 Implementation and testing: This chapter details the development process
and the issues faced. Additionally, highlighting specific features and their implementation
and testing.
Chapter 6 Results and discussion: This chapter details the results of the machine
learning models and evaluates the success of the application in meeting its requirements.
Chapter 7 Conclusions: This chapter details what this project has achieved, the lessons
that have been learnt and what could be done in future implementations to build upon
this work.
Chapter 2
Literature Survey
Using an API for this is not a completely effective approach as there are ”variations
in travel times driven by urban congestion or alternative routes picked by driver” (Noulas
et al., 2018). Which means for the same trip there can be multiple routes with varying
distances and travel time, directly impacting the cost of the trip. The API they used,
HERE Maps, only predicts the ”fastest or shortest journey” (HERE, 2023). This does not
reflect real world scenarios. Furthermore, since that paper was written Uber has changed
its terms of service that ”using the Uber API to offer price comparisons with competitive
third party services is in violation of § II B of the API Terms of Use” (Uber Inc., 2023a).
1
CityMapper, New York City: https://citymapper.com/nyc
2
RideGuru: https://ride.guru
3
CHAPTER 2. LITERATURE SURVEY 4
The dataset accounts for: Yellow Taxis, which are the iconic street hailed taxis, and can
be hailed from anywhere in NYC; Green Taxis, which are also known as ‘boro taxis’ and
can only be hailed in Upper Manhattan and the outer boroughs (excluding the airports),
although are able to travel to anywhere in NYC; and High-Volume For-Hire Services
(HVFHS) which are vehicles that can be requested from anywhere in NYC but only from
ride-hailing services such as Uber or Lyft.
The dataset is vast, with Yellow Taxis and HVFHS containing 100-200 million records
per year, and Green Taxis containing about 20 million records per year.
The dataset for Yellow Taxis dates back to 2009, with Green Taxis being added in 2013,
and HVFHS in 2019. Each record contains around 20 pieces of data, this includes:
Table 2.1: Data dictionary for Yellow, Green and HVFHS taxis.
Data Dictionary provided by NYC Taxi and Limousine Commission - Available at (NYC
Taxi and Limousine Commission, 2023a)
* For Yellow and Green Taxis the coordinates of the pickup and drop off locations are
provided up until July 2016, after which a zone system is used instead. The zone
system divides New York City into 262 zones seen in Figure 2.1 which are based on NYCs
Neighbourhood Tabulation Areas (NTAs) and are meant to closely approximate NYCs
neighbourhoods. For all HVFHS data only the zone system is used.
CHAPTER 2. LITERATURE SURVEY 5
Figure 2.1: Map of New York City with TLC Taxi Zones, boroughs and airports labelled
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
https://www.openstreetmap.org/copyright
NYC Taxi Zones - Available at (NYC Taxi and Limousine Commission, 2023)
Uber uses a proprietary “dynamic pricing algorithm” to determine their fares. The
algorithm factors in variables such as the time and distance of the route, traffic and
the current rider-to-driver demand. When there is a sudden increase in rider demand due
to factors such as “bad weather, rush hour, and special events”, Uber implements ‘surge
pricing’ where fares are increased to match the supply and demand. (Uber Inc., 2023b)
Taxis do not have surge pricing, so during busy times they are often cheaper than
rideshares. Although, unlike Taxis, Ubers do not charge depending on the speed of the
vehicle.
CHAPTER 2. LITERATURE SURVEY 6
Handling of the zone based data is a novel problem and has not been discussed in any
prior literature.
Root Node
True False
Node Node
Decision trees face problems such as overfitting where the tree develops too many hypothesis
based upon the training data that it performs poorly with the test data (Sayad, 2023). In
addition, they can have high variance estimators as light variations in the data can result
in completely different trees (IBM, 2023a). And lastly, they can be more computationally
costly compared to other models due to using a ‘greedy search approach.’
Ensemble learning is where multiple machine learning models are combined and for simple
techniques their predictions are combined either by majority voting for classification, or
the average for regression (Mwiti, 2021).
Random forest solves the key problems of decision trees such as overfitting. This is
due to random forest using multiple decision trees and the ”averaging of uncorrelated
trees lowers the overall variance and prediction error” (IBM, 2023b).
CHAPTER 2. LITERATURE SURVEY 9
Dataset
Majority
Voting /
Averaging
To figure out how to improve the model it attempts to minimise what is called a ’loss
function’ which is the ”difference between the predicted outputs of a machine learning
algorithm and the actual target values.” There are many loss functions that are appropriate
for different problems with popular ones including Mean Square Error, Mean Absolute
Error and Log Loss (Alake, 2023).
All of the results from each of these models is then combined and ensemble techniques
are used to determine the final result (Natekin and Knoll, 2013).
2.3.4 XGBoost
XGBoost or extreme gradient boosting is an ”optimized distributed gradient boosting
library designed to be highly efficient, flexible and portable” (XGBoost Developers, 2022).
It improves on gradient boosting by using parallel tree boosting which allows it to train
multiple decision trees concurrently. This allows it to offer high accuracy as well be highly
computationally efficient (Gupta et al., 2020). XGBoost works on both regression and
classification problems.
2.4 Evaluation
Evaluation is used to determine the efficacy of a model. This is done by looking at the
models performance with key performance metrics such as Root Mean Square Error.
Where Squared Differencei = (yi − ŷi )2 . With yi being the actual value, ŷi being the
predicted value. And N being the amount of values.
In their conclusion they propose further study that could improve their results. They
discussed improving the auto-tune on the MLP to achieve better results. Additionally,
they also discussed looking at location based features that could influence trip durations
and incorporating speed-limit based features to better model traffic. This could be
implement utilising the DOT Traffic Speeds NBE (City of New York Department of
Transportation, 2023) discussed earlier.
Furthermore, they mention using weather data from areas like Central Park as ”New
Yorkers might take a taxi when they are near Central Park or when the weather condition
is severe.” Which would be possible with the publicly available data from the National
Centers for Environmental Information weather station in Central Park (National Centers
for Environmental Information).
Lastly, they propose that the K-Means Clustering algorithm could be enhanced by clustering
around popular areas such as ”metro station, number of bars and eateries in a given zone,
etc.” This could be implemented using the publicly available Office of Technology and
Innovation (OTI) Points of Interest dataset (Office of Technology and Innovation (OTI),
2023) which details points of interest such as schools, transport facilities, commercial
buildings, landmarks and more. Additionally, there is the publicly available Department of
City Planning Facilities Database which ”aggregates information about 30,000+ facilities
and program sites that are owned, operated, funded, licensed, or certified by a City, State,
or Federal agency in the City of New York” (New York City Department of City Planning,
2023).
Chapter 3
This section outlines the requirements of the project, detailing both the functional and
non-functional requirements needed for project completion. It also specifies the steps
necessary to achieve these requirements.
3.1.1 Functional
# Requirement Priority
1 Develop and deploy a machine learning model capable of predicting M
taxi and Uber fares
2 Implement a web app where inference can be done on the deployed M
models
3 Implement on the web app the ability for the user to enter an origin M
and destination and compare the fares between taxis and Ubers for
that trip
4 Implement on the web app the ability for the user to enter the time of M
day and day of week for a given trip
5 Display on the web app an interactive map showing where the origin M
and destination are
6 Display on the web app the ’ideal’ route to illustrate the users journey S
7 Display on the web app metrics such as trip time and distance S
12
CHAPTER 3. REQUIREMENTS AND ANALYSIS 13
3.1.2 Non-functional
# Requirement
1 The web app should have a user friendly interface
2 The web app should have high performance and return queries in a reasonable
time
3 The web app should be responsive to a variety of screen sizes
4 The web app should be compatible on major browsers
RMSE has been chosen as the primary metric as it enables effective comparisons to other
works and provides users a straightforward metric to interpret. For instance a RMSE
value of 3 indicates that, on average, the predicted fare is within $3 of the actual fare,
providing users with a clear understanding of the prediction accuracy.
Design
This section outlines the design approach taken in implementing the project. Focusing
on the methodology used and the design of both the model and front-end and back-end
of the web application.
4.1 Methodology
4.1.1 Machine Learning Life Cycle
The machine learning (ML) life cycle is the iterative process that encompasses the:
preparation, development and deployment of an ML solution. This iterative process
is necessary as ML solutions are continually built on and improved (Ashmore et al., 2019).
• Business understanding
• Data understanding
• Data preparation
• Modeling
• Evaluation
• Deployment
14
CHAPTER 4. DESIGN 15
In addition, the HVFHS dataset covering July 2023 to July 2024 will be used for the
Uber fare model.
• Restricting the coordinates of records to New York and its immediate vicinity.
• Removing records where the fare is less than $3, as this is the base fare (NYC Taxi
and Limousine Commission, 2023a).
With the data for taxis coming from the July 2014 to July 2016 time period it is important
to use feature engineering to update it to match today’s current rates.
Table 4.1: Taxi Fare Comparison across 2012-2022 and 2023 onwards
Sources (Woodhouse, 2022; NYC Taxi and Limousine Commission, 2023a)
As reported in (Woodhouse, 2022) at the 2023 price increase, the average fare went
up 23%, this will be reflected in the base fare of each record.
Haversine Distance
As discussed in Manoharan et al. (2021), the Haversine distance had the highest feature
importance. Haversine distance is equal to the most direct path between two coordinates
on the surface of a sphere (scikit-learn developers, 2023). As the Earth is near sphere it
can be applied with only a 1% margin of error. The Haversine formula is defined as:
s !
2 xLat − yLat 2 xLon − yLon
D(x, y) = 2r arcsin sin + cos(yLat ) cos(xLat ) sin
2 2
Apache Spark
Apache Spark is the ’de facto framework for big data analytics’ (Salloum et al., 2016). It
has features such as advanced in-memory programming model, upper-level libraries for
scalable machine learning, graph analysis, streaming and structured data processing.
In this project Apache Spark will be used for the feature selection and feature engineering
process. Specifically, the Python language-integrated API, PySpark (Apache Spark,
2024), will be used.
CHAPTER 4. DESIGN 17
4.3 Modeling
As outlined in the literature survey a gradient boosted regression model is the most
suitable approach for handling the TLC Trip Record dataset. Gradient boosted models
are ideal for this problem due to:
• Their ability to handle the spatial data on the New York City taxi dataset with
high levels of accuracy. As highlighted by Manoharan et al. (2021).
• The ability to determine feature importance, which is highly useful when performing
feature selection and engineering.
• It is optimised for distributed workloads, which is useful given that this project is
being developed on HPC clusters.
LightGBM
In the HVFHS dataset the pick up and drop off locations use zones with numeric identifiers
based on New York City’s neighbourhoods. Therefore, a reverse engineering approach is
needed when determining fares from latitude and longitude points within these zones.
In the HVFHS dataset aside from their pick up and drop off zones the only features that
can be used to distinguish trips are the time and distance values.
Therefore when introducing new data obtaining an accurate estimate of these values is
necessary for optimal fare prediction. Therefore the chosen approach is to develop two
models trained on the coordinate based data from the Yellow Taxi dataset prior to the
switch to zone based identifiers. With these models estimates for trip time and distance
can be produced accurately. By matching the latitude and longitude of the pick up
and drop off locations to their respective zones along with the calculated trip time and
distance, the Uber fare model can calculate an estimate of the fare.
CHAPTER 4. DESIGN 18
MLFlow
MlFlow is an open source platform for managing the end-to-end machine learning life
cycle. It allows for tracking of machine learning experiments which records the “code
used, parameters, input data, metrics and output.”
Additionally, it allows models to be packaged and deployed. With inference through
a REST API endpoint, allowing integration into web front-ends (Zaharia et al., 2018).
AutoML
AutoML is the process of automating the iterative tasks of machine learning development
(He et al., 2021).
As this project is focused on the whole machine learning cycle, it is important to be
able to develop models without needing to spend large amounts of time on tasks such as
comparing models and tuning hyper parameters.
CHAPTER 4. DESIGN 19
Ruby on Rails is a popular web development framework for the Ruby language (Hartl,
2015). Rails is used on sites such as Airbnb, GitHub and Twitter/X.
Rails is the best approach for this this project due to it being completely open source,
ahead on new web technologies and a having a large community of contributors and gems
(plugins).
OpenStreetMap
Geocoding
Using the open source Photon geocoder (Komoot, 2024) the user can perform search-as-you-type
on OSM data. This allows for real time suggestions of points of interest (POIs) as the
user types, and importantly, the translation of the desired location into a latitude and
longitude point.
Routing
Using OpenRouteService (ORS) (ORS, 2024) vehicle routing can be performed using
OpenStreetMap data. The OpenRouteService directions API generates vehicle routes in
a GeoJSON format (IETF GeoJSON Working Group, 2016).
GeoJSON is a format based JavaScript Object Notation (JSON) format and allows for the
storing and transmitting of geographic data. Allowing integration into a web front-end.
Leaflet.js
Additionally, GeoJSONs can be loaded as layers on top of the map, allowing the displaying
of vehicle routes from services such as OpenRouteService.
CHAPTER 4. DESIGN 20
The web application will be a single page. From here the user can input the origin and
destination of their trip and the time and day of the week. The web application will then
perform an API request to the server with this information.
Using the deployed models the server will return the fare estimates of the taxi and
rideshare options. In addition the server will return the ’ideal’ vehicle route to be displayed
on the map.
CHAPTER 4. DESIGN 21
4.4.3 Wireframe
Building on the application flow detailed in Figure 4.3, this wireframe shows the desired
appearance of the web application.
With this design the application achieves its desired functional and non-functional requirements,
delivering a clear and user friendly experience.
In this section the projects development as well as the associated challenges faced will be
detailed.
Specifically it will detail the data preparation, development and deployment of the machine
learning models. In addition to the development of the web application.
This posed a challenge in development as libraries and frameworks that were needed
for the project such as MLFlow and AutoML were not available or had limited support.
For instance the ability to serve models with a REST API endpoint in MLFlow was
affected, which is an important requirement of this project.
Additionally the HPC that was used, Stanage, still has very limited graphical session
support. Meaning that all work had to be done via the command line which can be
convoluted at times.
22
CHAPTER 5. IMPLEMENTATION AND TESTING 23
5.1.2 Databricks
To overcome the issues faced with the University’ HPC, Databricks was used instead.
Firstly, it can ingest and store a significant amount of data utilising Delta Lakes (Armbrust
et al., 2020). Delta Lakes are cloud object stores that can store large tabular datasets
whilst maintaining Atomicity, Consistency, Isolation, and Durability (ACID) ensuring
data integrity. In addition, Delta Lakes can perform fast metadata operations such as data
querying, which is highly useful for the large datasets used in this project. Moreover, Delta
Lakes have integration with Apache Spark, allowing feature selection and engineering
operations to be performed seamlessly.
Additionally, Databricks supports Python notebooks with Apache Spark. This is highly
useful for tasks like feature selection and feature engineering as it provides an interactive
environment for data preparation.
Furthermore, Databricks has native integration with MLFlow (Zaharia et al., 2018).
Allowing management and tracking of the end-to-end machine learning life cycle. And
importantly the serving of models for inference with a REST API endpoint.
In addition, Databricks has built-in AutoML functionality (Databricks, 2024b) for model
selection and hyperparameter tuning. This feature significantly speeds up the machine
learning model development process, allowing faster development cycles.
Lastly, Databricks can be run on Microsoft Azure with Azure Databricks (Etaati, 2019).
Microsoft Azure offers a free $100 credit for students (Microsoft Corporation, 2024), which
will adequately fund this project.
For the project, the cluster with the most amount of memory was chosen. This was
necessary as it allows for faster querying of large datasets. A constraint of Databricks
on a student account is that each cluster is restricted to 4 cores and 32 GB memory
maximum. However, this was sufficient for the project.
• Deducting tips from the total fare as this can cause outliers.
• Filtering where the fare amount is greater than $3.00 and is less than $200.
• Filtering trips where the trip time is greater than 60 seconds and is less than 4
hours.
• Dropping features that are not needed: hvfhs license num, dispatching base num,
originating base num, request datetime, on scene datetime, driver pay, shared request flag,
shared match flag, access a ride flag, wav request flag, wav match flag.
• Deducting tips from the total fare as this can cause outliers.
• Filtering where the fare amount is greater than $3.00 and is less than $200.
• Filtering trips where the trip time is greater than 60 seconds and is less than 4
hours.
Firstly, the pick-up datetime have been transformed into three individual features: dayOfWeek
(0-6, representing Mon-Sun), which indicates the day of the week; hourOfDay (0-23),
representing the hour of the day; and minuteOfHour (0-59), representing the minutes of
the hour.
This transformation is done to make all the data uniform and enable the model to
distinguish between various days of the week, as well as the different hours and minutes
within a day.
Using the Haversine formula as discussed in Section 4.2.2 the direct distance between
two points can be calculated. PySpark expressions are used in-conjunction with the
formula to transform the data in a distributed and optimized manner.
Figure 5.4: Code Snippet: Haversine Formula on the pick-up and drop-off coordinates
Following on from Section 4.2.2, Yellow Taxi fares have been updated to reflect current
pricing standards.
The new pricing structure includes a $2.50 rush hour surcharge from 4 pm to 8 pm
on weekdays. This calculation can be performed using the transformed datetime features,
as shown in Figure 5.5.
CHAPTER 5. IMPLEMENTATION AND TESTING 26
There are four locations that if a trip starts and or ends in are subject to extra tariffs:
These locations can be viewed in Figure 2.1:
• John F. Kennedy International Airport ($70 flat fare on trips to and from Manhattan)
plus an additional $1.25 for any pick-up at the airport.
• LaGuardia Airport ($5 Surcharge for any pick-up and drop-offs) plus an additional
$1.25 for any pick-up at the airport.
• New York State Congestion ($2.50 for any pick-up or drop-off that is south of 96th
Street in Manhattan).
For determining when trips have started and or ended in locations with extra tariffs
the Shapely Python library (https://shapely.readthedocs.io/en/stable/) has been
used. Shapely has functionality for determining if a point is in a polygon.
The polygons of the tariff affected areas are generated using the OpenStreetMap Overpass
Turbo website (https://overpass-turbo.eu). These are then stored in GeoJSON format
(explained in Section 4.4.1).
Using the generated GeoJSON polygons, the function as seen in Figure 5.7 is used to
determine if a latitude and longitude point is within the boundaries of the polygon.
Figure 5.7: Code Snippet: Determine if a latitude and longitude point is inside a GeoJSON
polygon.
5.3 Modeling
5.3.1 Databricks AutoML
For modeling Databricks AutoML (Databricks, 2024a) has been used. Databricks AutoML
allows the quick generation of machine learning models along with an accompanying
notebook for the model. The Databricks AutoML workflow involves iterative cycles
between this notebook and the model, with the final result being the production model.
Databricks AutoML generates MLFlow experiments with each run representing a different
model with different hyperparamters and dataset split. This allows for extensive exploration
and comparison of model configurations.
The top performing model can then be configured for inference and deployed to a REST
API endpoint. This enables the front-end to query the deployed model.
Through testing, it was established that the maximum amount of records that Databricks
AutoML can handle without crashing is about ten million records. This is due to the
constraints of the student account cluster. Therefore the PySpark sample() function has
been used to sample a subset of the dataset.
Following from the design plan, the front-end was developed using HTML, CSS and
JavaScript. Leaflet.js (discussed in Section 4.4.1) provides the map tiles and interactivity.
Additionally, Bootstrap (https://getbootstrap.com) was used to improve the styling of
the page elements. In addition, Mapbox (https://www.mapbox.com) was used to improve
the look of the map tile over the deafult OpenStreetMap tiles.
The website is a single page application therefore all interactions are performed through
this page. From this page the user can enter their origin and destination as well as time
and day of week of the trip. The website will then show the estimated fares for both
Yellow Taxi and Uber.
Figure 5.12: The web application showing a trip from the Empire State Building to
Central Park.
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
Additional Map data and styling © Mapbox https://www.mapbox.com/about/maps/
Geocoding is necessary to transform addresses and point of interests into latitude and
longitude points. To do this the Photon API has been used (discussed in Section 4.4.1).
The Photon API allows for search as you type results from OpenStreetMap. An example
of a result from a Photon API query can be seen in Figure 5.13.
CHAPTER 5. IMPLEMENTATION AND TESTING 30
The Photon API has been used because it provides a free public API (with fair use).
Additionally the Photon API has the option to self-host which could be used in future
implementations (Komoot, 2024).
Figure 5.13: One of the results returned from the Photon API for the search ”Grand
Central”.
The user can choose the time and day of the week, which was preferred over using a
calendar approach where the user selects a specific month and day. This choice allows for
a larger sample size to be used for each day of the week and time. ,
Routing
OpenRouteService takes the latitude and longitude of the origin and destination and
returns a GeoJSON containing the optimal route (not including traffic). This is feature
is meant to be an illustration to the user and is not meant to show the actual route to be
taken.
Figure 5.16: Route from Chrysler Building to Grand Central displayed on web application
map.
Map data © OpenStreetMap contributors
Source: https://www.openstreetmap.org/
Licensed under the Open Data Commons Open Database License (ODbL)
Additional Map data and styling © Mapbox https://www.mapbox.com/about/maps/
CHAPTER 5. IMPLEMENTATION AND TESTING 32
For finding what taxi zones the origin and destination of the trip are in, Turf.js (https:
//turfjs.org) is used.
Turf.js loads a GeoJSON of the taxi zones and searches each feature to find out what
zone the origin and destination of the trip are in.
The implementation of Turf.js can be seen in Figure 5.17.
Mobile Appearance
Considerations have been made for the mobile friendliness of the web application. The
content on the site will scale to fit the current view port. Additionally, on smaller view
ports, the search will expand to fill the entire page making the results clearer to the user.
Overall these features improve the user experience and overall user friendliness of the web
application.
5.5 Testing
Manual Testing
Manual testing has been performed to ensure that the web application and its inputs
behave as intended.
Functional Requirements
Functional testing has also been used, testing against the defined requirements from
Chapter 3.1.1.
34
CHAPTER 6. RESULTS AND DISCUSSION 35
For the trip distance predictions needed for the Uber fare amount, the Haversine distance
was used instead.
For the Uber fare it can be compared against the Uber app (Uber Inc., 2023b). For
the same journey from the Empire State Building to Central Park at 9am. Uber has a
fare of $28.92 and the model estimates $28.60. This demonstrates that the model could
be suitable for predictions, despite the high RMSE.
# Requirement Pass/Fail
1 The web app should have a user-friendly interface Pass
2 The web app should have high performance and return queries in a Pass
reasonable time
3 The web app should be responsive to a variety of screen sizes Pass
4 The web app should be compatible on major browsers Pass
Requirement number 7 was not achieved due to time constraints and the model for Trip
Distance not achieving the desired accuracy. This could have been accomplished if there
was more time and research given to the implementation, and as stated prior OSRM (The
OSRM Contributors, 2024) could be used.
H3 could improve this work by indexing the latitude and longitude points in built up
areas, for instance a city block. This potentially could allow for better accuracy by
increasing the amount of geospatial detail in a given area.
By creating PWA the mobile experience of the implementation could be greatly improved
with better ease of use and user experience.
The Google Distance Matrix API upon recieving a HTTPS request with the origin and
destination of the trip can return the trip distance in kilometers or miles as well as the
estimated travel time in traffic. This could potentially increase the accuracy of the models.
Chapter 7
Conclusions
The aim of this project was to enable users to compare the fares of New York City
taxis and Ubers, whilst also demonstrating a complete implementation of the end-to-end
machine learning life cycle.
As the literature survey in Chapter 2 showed, the comparison of point to point Uber
and New York city taxi fares had not been achieved before in prior literature. As shown
the zone based identifiers for locations can make developing machine learning solutions
quite complicated.
The requirements and analysis in Chapter 3 showed the importance of choosing appropriate
requirements. With the outlined functional and non-functional requirements guiding the
approach of the implementation.
The implementation for estimating Uber fares provides a suitable benchmark for future
work. With the initial results shown here being promising. Additionally, the web application
demonstrated the importance of visualisation, providing a user friendly and easy to use
mode to query the models.
• Developing and deploying machine learning models capable of predicting New York
City taxi and rideshare fares.
37
CHAPTER 7. CONCLUSIONS 38
inference using the deployed models, presenting fare predictions in a clear and
intuitive manner
Amazon Web Services. Machine learning best practices. Technical report, Amazon Web
Services, 2023.
39
BIBLIOGRAPHY 40
City of New York Department of Transportation. Dot traffic speeds nbe, 2023. URL
https://data.cityofnewyork.us/Transportation/DOT-Traffic-Speeds-NBE/
i4gi-tjb9. Accessed: December 2, 2023.
H. Deng, Y. Zhou, L. Wang, and C. Zhang. Ensemble learning for the early prediction
of neonatal jaundice with genetic features. BMC Medical Informatics and Decision
Making, 21, 12 2021. doi: 10.1186/s12911-021-01701-9.
L. Etaati. Azure Databricks, pages 159–171. Apress, Berkeley, CA, 2019. ISBN
978-1-4842-3658-1. doi: 10.1007/978-1-4842-3658-1 10. URL https://doi.org/10.
1007/978-1-4842-3658-1_10.
A. Gupta, S. Sharma, S. Goyal, and M. Rashid. Novel xgboost tuned machine learning
model for software bug prediction. pages 376–380, 06 2020. doi: 10.1109/ICIEM48762.
2020.9160152.
M. Hartl. Ruby on rails tutorial: learn Web development with rails. Addison-Wesley
Professional, 2015.
IETF GeoJSON Working Group. The GeoJSON Format (RFC 7946). RFC 7946, RFC
Editor, August 2016. URL https://datatracker.ietf.org/doc/html/rfc7946.
IT Services’ Research and Innovation team. Sheffield HPC Documentation, 2023. URL
https://docs.hpc.shef.ac.uk/en/latest/. Accessed: Dec 2, 2023.
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu.
Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/
file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
New York City Department of City Planning. Department of city planning facilities
database (facdb), 2023. URL https://www.nyc.gov/site/planning/data-maps/
open-data/dwn-selfac.page. Accessed: December 2, 2023.
BIBLIOGRAPHY 42
NYC Taxi and Limousine Commission. NYC Taxi Zones Map, 2023. URL
https://data.cityofnewyork.us/d/d3c5-ddgc?category=Transportation&view_
name=NYC-Taxi-Zones. Accessed: Dec 2, 2023.
NYC Taxi and Limousine Commission. Data dictionary – yellow taxi trip
records. https://data.cityofnewyork.us/api/views/biws-g3hs/files/
eb3ccc47-317f-4b2a-8f49-5a684b0b1ecc?download=true&filename=data_
dictionary_trip_records_yellow.pdf, 2023a. Accessed: Nov 15, 2023.
NYC Taxi and Limousine Commission. Tlc trip record data. https://www.nyc.gov/
site/tlc/about/tlc-trip-record-data.page, 2023b. Accessed: Nov 15, 2023.
Office of Technology and Innovation (OTI). Points of interest, 2023. URL https://data.
cityofnewyork.us/City-Government/Points-Of-Interest/rxuy-2muj. Accessed:
December 2, 2023.
J. Saltz. CRISP-DM is still the most popular framework for executing data
science projects. Data Science Project Management, 2024. URL https://www.
datascience-pm.com/crisp-dm-still-most-popular/.
T. Schneider. Taxi and ridehailing app usage in new york city. https://
toddwschneider.com/dashboards/nyc-taxi-ridehailing-uber-lyft-data/, 2024.
toddwschneider.com.
J. VanderPlas. Python data science handbook: Essential tools for working with data. ”
O’Reilly Media, Inc.”, 2016.
S. Woodhouse. Nyc taxi cab fares to rise 23% in first increase since 2012. Bloomberg,
November 2022. URL https://www.bloomberg.com/news/articles/2022-11-15/
nyc-taxi-cab-fares-to-rise-23-in-first-increase-since-2012?utm_source=
website&utm_medium=share&utm_campaign=copy.