TEAM 5 Report
TEAM 5 Report
REPORT
Team Members:
Soumya Sharma
Vedant Sarode
Tejas Waidande
1
Title
Efficient Data Pipeline Implementation for NYC Green Taxi Trip Records.
Abstract
Introduction-
The project generally focuses on establishing an automated data pipeline
that is concerned with the extraction, processing, and analysis of trip data
belonging to the NYC Green Taxi throughout the year 2022. This pipeline
helps in smoothing data handling to support efficient and scalable
analysis.
Application Problem:
Green taxi Technology Service Providers (TSPs) 2022 have datasets on
their platforms that run into hundreds of gigabytes and present challenges
in terms of data ingestion, cleaning, and transformation for meaningful
analysis.
2
Chapter 1
Introduction
1.1 Description of Application Domain
New York City’s transportation network is one of the most complex
in the world, with a diverse array of transportation modes, including
taxis. Among these, the green taxis, which serve the outer boroughs
and northern Manhattan, represent a crucial component of the city's
efforts to ensure comprehensive transportation coverage. The
availability and analysis of green taxi trip data provide valuable
insights into travel patterns, demand fluctuations, and service
efficiency, making it an essential dataset for urban planners,
policymakers, and researchers.
3
1.4 Problem on Technical Level
On the technical level, the primary challenges include:
4
Chapter 2 - Related Work
Recently, both the research community and industry have shown
enormous interest in designing automated data pipelines for
managing and analysing massive amounts of transportation data.
Different techniques for appropriately and efficiently processing the
transportation data have been proposed by the researchers. This
section presents in detail a critical review of the literature on some of
the related works, which have been taken place in the domain of
automated transportation data management pipeline.
5
transportation data within the AWS setting. In this study, S3 was used
for data storage, while EC2 instances were leveraged for executing
the processing tasks. It enabled scalable and flexible data processing
capabilities to accommodate the fluctuating demands in
transportation data analysis.
In another study, Gupta et al. (2019) built a data pipeline for analysing
public transit data for San Francisco, California. This pipeline aimed
at the improvement of service reliability and passenger satisfaction
through insights into the operations of the transit service and in areas
of improvements. The data pipeline helped decision-makers have
better decisions by delivering comprehensive and actionable data.
This just goes to show the importance of a data-driven methodology
while attempting to solve problems related to urban transportation.
The given studies bring to light the vital role these automated data
pipelines play for urban transport planning and policymaking.
Because of the detailed and accurate data that is supplied through
6
these data pipelines, policymakers can make informed decisions that
shape the efficiency and effectiveness of urban transport systems.
Conclusion
7
Chapter 3 – Dataset
This chapter discusses our dataset that has been used in the study.
In this, we discuss the source, structure, and some of the key
variables that are included. Our dataset is the NYC Green Taxi Trip
Records. This is a treasure trove of public information available on
the City of New York's open data portal. The dataset is a great
resource for urban transportation analysis and planning because it
offers rich insights into green taxi trips across New York City.
Data Source
Our source of data is the New York City portal for open data, which
encourages transparency and innovation by sharing a plethora of
datasets about the city. Our specific dataset was retrieved from the
URL link, [https://data.cityofnewyork.us/resource/8nfn-ifaj.json]
(https://data.cityofnewyork.us/resource/8nfn-ifaj.json). This is to
benefit research and development and support those who make
policies.
Overview of Data
The NYC Green Taxi Trip Records dataset provides trip-level details
of all green taxi rides, called Boro Taxis, which are allowed to pick
up passengers in outer boroughs and northern Manhattan. This
dataset spans several years and hence provides an opportunity to
analyse taxi operations over time.
Each data record consists of a single taxi trip, and the key attributes
include:
8
Surcharge: This refers to additional fees, like the New York State
Congestion Surcharge.
Tip Amount: It refers to how much money the driver was tipped for
this ride.
Tolls Amount: It refers to what has been paid in terms of tolls during
the ride.
Total Amount: It refers to the total paid amount, which includes fare,
surcharges, tips, and tolls.
Payment Type: How the fare was paid (e.g., credit card, cash).
Trip Type: Whether the trip was dispatched by a base or was a
street-hail.
The vast dataset and the difference in entering data are indicators
of a need to look at the quality of the data. The job involves handling
missing values, detecting and treating outliers, and correcting
inconsistencies. For instance, trips having negative distances or
fares have to be verified and potentially removed. Also, we check
the correctness of geographic coordinates to ensure they make
sense over the area of New York City.
9
Conclusion
The New York City Green Taxi Trip Records dataset is available
through the City of New York Open Data platform and provides a
rich resource for better understanding the dynamics of urban
transport. Utilizing this dataset, we are going to unveil valuable
insights into taxi usage, fare structures, and travel behaviours in
New York City. Building on this, our project developed a robust data
pipeline that will ensure efficient data management and analysis,
ensuring data-driven decision-making for urban transport systems.
10
Chapter 4 – Solution
In this chapter, we'll take you behind the scenes of our project,
where we built a data pipeline for NYC green taxi trip data using the
powerful tools of Google Cloud Platform (GCP). Our goal was to
create a seamless flow of data from extraction to visualization, and
we're excited to share the journey with you. From tapping into the
data sources to transforming it into actionable insights, we'll walk
you through each stage of the process. You'll get to see the specific
technologies and tools we used to overcome challenges and
achieve our objectives.
Data Extraction
We Identify the source of the data, which is the NYC Green Taxi
Trip Records dataset available at NYC Open Data Portal.
Then we designed a python script to fetch and process data from a
specified URL, likely from a website or data service that provides
data in JSON format.
‘fetch_data’ function
The core functionality of the script is encapsulated in the
‘fetch_data’ function, which takes a URL as its parameter.
This function is designed to be flexible and reusable for different
URLs that provide data in JSON format.
Data Ingestion
We have a GCS bucket that stores raw data coming from the
ingestion process of data through various APIs.
We have set up the GCS bucket as a centralized repository to hold
lots of raw data, and the storage is both scalable and durable.
The data is ingested into the GCS bucket using APIs and services
like Cloud Functions among other GCP services.
Data that is extracted is then uploaded to the GCS bucket, where it
can reside in its native format.
Such a mechanism enables separation of ingestion from
processing and analysis in stages, which becomes very flexible
and scalable.
Besides, the GCS offers a very reliable, secure storage medium
that assures our data is safe and easily retrievable for further
processing and analysis.
With the help of GCS and Cloud Functions, we've built a strong,
effective data ingestion pipeline that will easily handle large
volumes of data from various sources.
Data Transformation
In our project, the data transformation phase happens to be the crux of
the data pipeline, where data is cleaned, enriched, and shaped correctly
for analysis. We integrated following outlines the detailed steps and
transformations applied:
12
Tools and Frameworks Used
Google Cloud Platform (GCP) Dataflow API: Used for scalable
data processing.
Apache Beam: Integrated with Python to build and run the
Dataflow pipeline.
Beam Transformations: Specifically, ‘beam.Map()’,
‘beam.io.ReadFromText()’, ‘beam.Filter()’, and
‘beam.io.WriteToBigQuery()’.
5. Writing to BigQuery
Used ‘beam.io.WriteToBigQuery()’ to write the transformed data to
a Big Query table.
These transformations are crucial for ensuring that the data is clean,
enriched, and correctly formatted for efficient storage and analysis in Big
Query. This part of the pipeline is executed as a job in Dataflow,
providing a scalable solution for handling large volumes of transportation
data.
13
DATA ANALYSIS & VISUALIZATION
User Story 1: Calculate Daily Earnings
As a data analyst, I want to calculate the total daily earnings for green
taxi trips, so that I can provide insights into the daily revenue trends for
taxi operators.
2. Monthly Variations:
The profits also seem to fluctuate monthly, meaning specific days
of the month might be either high- or low-demand
3. Special Events:
14
There are clear spikes in earnings around certain dates that
probably correspond to holidays, events, or adverse weather
events that drive increased taxi usage.
4. General Growth:
An increase in the average daily earnings can be observed over
the months—though slight—showing that this trend may be
growing.
The analysis of total daily earnings from green taxi trips enables one to
gain deep insights into revenue trends and patterns. An understanding of
such trends helps optimize operations for taxi operators, whereby they
can also schedule maintenance during periods of low demand and
prepare for days with high demand. This insight may further enable
policymakers to make informed decisions about transportation
regulations and support for the taxi industry.
15
Above is a bar chart representing the top 10 high-demand pickup
locations for green taxi trips. The x-axis represents the pickup location
IDs, and the y-axis represents the number of pickups. This chart
represents the demand for taxi services at different locations.
2. Distribution of Demand:
Two of the top locations, IDs 74.0 and 75.0, perform much
better in terms of demand than any others.
From there, the demand slowly decreases down the list,
showing a disproportion in taxi service use.
4. Service Optimization:
Understanding the demand patterns helps in reallocating
resources effectively.
Taxi operators can deploy more vehicles in high-demand areas
during peak times to meet passenger needs.
The revelation of high-demand pickup locations for green taxi trips is
important for making an optimized placement of taxi stands to increase
service availability. By prioritizing these high-demand locations, transport
planners will ensure that there is better access for passengers to taxi
services, hence minimizing waiting times and increasing overall
efficiency. This data-driven approach aids in informed decision-making to
ensure strategic allocation of resources.
16
Chapter 5 - Summary and Outlook
Our Results
Using our data pipeline, the following insights are presented for the
operations of green taxi services in New York City:
Patterns of Revenues: Clear patterns of revenue fluctuations
were revealed, formed by seasonal and other effects, which give a
deep understanding to the operators about the source of changes
in income.
High-Demand Pickup Locations: A practical suggestion about
how to optimize taxi stand placement and improve service
coverage.
Operational Efficiency: This will further help taxi companies save
money on resource allocation, reduce idle times, and generally
increase their level of efficiency.
Future Work
There are some very exciting avenues for future work building upon our
results:
17
Policy Change Impact: Analysis of how policy changes affect taxi
operations in order to make urban transportation planning
sustainable and data driven.
Bibliography
1. https://www.sciencedirect.com/science/article/pii/S2667305322000722
2. https://www.sciencedirect.com/science/article/abs/pii/S0377221720305816
3. https://acp.copernicus.org/articles/19/13519/2019/
4. https://www.sciencedirect.com/science/article/abs/pii/S0377221722003514#pr
eview-section-abstrac
18
THANK YOU!
19