0% found this document useful (0 votes)

23 views19 pages

TEAM 5 Report

Uploaded by

njrakesh75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views19 pages

TEAM 5 Report

Uploaded by

njrakesh75

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

DATA ENGINEERING 2

REPORT

New York Taxi

(Green Taxi Trip 2022 dataset)

Team Members:
Soumya Sharma
Vedant Sarode
Tejas Waidande

1
Title
Efficient Data Pipeline Implementation for NYC Green Taxi Trip Records.

Abstract
Introduction-
The project generally focuses on establishing an automated data pipeline
that is concerned with the extraction, processing, and analysis of trip data
belonging to the NYC Green Taxi throughout the year 2022. This pipeline
helps in smoothing data handling to support efficient and scalable
analysis.

Application Problem:
Green taxi Technology Service Providers (TSPs) 2022 have datasets on
their platforms that run into hundreds of gigabytes and present challenges
in terms of data ingestion, cleaning, and transformation for meaningful
analysis.

Summary of Own Approach:

We decided to create an automated data pipeline that takes in data from
online sources, cleans and transforms it, and prepares it for in-depth
analysis. Our approach leverages modern data engineering tools and
techniques to make data analysis efficient, scalable, and insightful.

Summary of Own Results:

The resulting pipeline will be robust, scalable, and adaptable, making it a
valuable tool for researchers, city planners, and policymakers interested
in analysing green taxi services in New York City. This project aims to
simplify access to complex data and provide insights that can help
improve urban transportation planning and policy decisions.

2
Chapter 1
Introduction
1.1 Description of Application Domain
New York City’s transportation network is one of the most complex
in the world, with a diverse array of transportation modes, including
taxis. Among these, the green taxis, which serve the outer boroughs
and northern Manhattan, represent a crucial component of the city's
efforts to ensure comprehensive transportation coverage. The
availability and analysis of green taxi trip data provide valuable
insights into travel patterns, demand fluctuations, and service
efficiency, making it an essential dataset for urban planners,
policymakers, and researchers.

1.2 Problem on Application Level

The primary challenge on the application level is the efficient
handling and analysis of large volumes of trip data generated by
NYC's green taxis. The sheer volume and complexity of the data
make it difficult to derive actionable insights without a robust and
scalable data processing solution. This problem is compounded by
the necessity to clean and transform the raw data into a format
suitable for analysis, addressing issues such as missing or
inconsistent entries.

1.3 Benefits of a Solution on Application Level

A well-designed data pipeline for processing green taxi trip
data offers numerous benefits:

 Enhanced Decision-Making: By providing clean, structured

data, the pipeline facilitates data-driven decision-making for
urban planners and policymakers.
 Improved Service Efficiency: Analysing trip data helps identify
patterns and trends, leading to more efficient taxi service
deployment and improved passenger satisfaction.
 Resource Allocation: Insights from the data can guide the
allocation of resources, such as the placement of taxi stands
and the planning of public transportation routes.
 Policy Development: Policymakers can use the data to
develop targeted policies aimed at improving urban mobility and
reducing traffic congestion.

3
1.4 Problem on Technical Level
On the technical level, the primary challenges include:

 Data Extraction: Efficiently extracting large datasets from

various sources, including APIs provided by green taxi
Technology Service Providers (TSPs).
 Data Cleaning: Handling missing, duplicate, and inconsistent
entries to ensure the data's integrity and reliability.
 Data Transformation: Converting raw data into a structured
format suitable for analysis, involving processes such as
aggregations, filtering, and joining multiple data sources.
 Scalability: Ensuring the solution can handle growing data
volumes without compromising performance.
 Automation: Developing a pipeline that automates the
extraction, cleaning, and transformation processes to minimize
manual intervention and errors.

1.5 Technical Solution Idea

To address these technical challenges, we designed an
automated data pipeline using a combination of Google Cloud
Platform (GCP) services and modern data engineering tools:

 Data Extraction: We used Python scripts to interact with the

TSPs APIs and extract the green taxi trip data. This approach
allows for flexible and efficient data retrieval.
 Data Storage: Extracted data is stored in Google Cloud
Storage (GCS) buckets, providing scalable and durable storage.
 Data Processing: We utilized Apache Beam and Google
Dataflow for the data pipeline. Apache Beam allows for writing
batch and streaming data processing pipelines, while Dataflow
executes these pipelines in a fully managed environment.
 Data Transformation: Various transformations, such as
cleaning, filtering, and aggregating data, are performed within
the Apache Beam pipelines to prepare the data for analysis.
 Data Analysis: Processed data is stored in Google Big Query,
a fully managed data warehouse, enabling fast and efficient
querying for analysis and reporting.

4
Chapter 2 - Related Work
Recently, both the research community and industry have shown
enormous interest in designing automated data pipelines for
managing and analysing massive amounts of transportation data.
Different techniques for appropriately and efficiently processing the
transportation data have been proposed by the researchers. This
section presents in detail a critical review of the literature on some of
the related works, which have been taken place in the domain of
automated transportation data management pipeline.

 Data Extraction and Processing Techniques

Several studies have focused on the design of methods for the

extraction and processing of transportation data. Zhang et al. (2019)
proposed a novel technique for extracting the taxi trip data from
various online sources with web scraping methods. The extracted
data were further processed using Apache Spark that showed the
power and efficiency of distributed computing frameworks in handling
huge datasets. This approach made the processing of a large amount
of data quickly and effectively, which is extremely vital for real-time
applications and analysis.

Similarly, Smith et al. (2020) developed a systematic pipeline for the

processing of ride-sharing data. In this pipeline, a set of Python scripts
was used to extract and preprocess the data along with the usage of
cloud-based data storage solutions. The combined use of Python's
versatility and the scalability of cloud storage made the handling of
ride-sharing data efficient at large-scale. These studies, therefore,
point towards the vitality of the efficiency of the data extraction and
processing techniques to handle huge transportation data. They also
point out that there is a great need for robust and scalable frameworks
to handle the volume and velocity of transportation data.

 Use of Cloud Computing Platforms

Cloud computing platforms such as Google Cloud Platform (GCP)

and Amazon Web Services (AWS) have proved to be popular for
implementation of large data pipelines due to their scalability,
flexibility, and massive infrastructure. Li et al. (2018) showed, by
example, how the use of cloud infrastructure is made possible through
developing a cloud-based data pipeline for use in processing

5
transportation data within the AWS setting. In this study, S3 was used
for data storage, while EC2 instances were leveraged for executing
the processing tasks. It enabled scalable and flexible data processing
capabilities to accommodate the fluctuating demands in
transportation data analysis.

Our project builds on this foundation, employing GCP services like

Google Cloud Storage (GCS), Big Query, and Dataflow. GCS
provides a reliable and highly available storage for objects of any size,
Big Query offers powerful data warehousing and analytics
capabilities, and Dataflow deals with processing the data in the most
appropriate and optimal way. Our project, using these GCP services,
delivers efficient data extraction, storage, and processing in a manner
that ensures the system can handle big data in an effective and
reliable way.

 Application in Urban Transport Planning and

Policymaking

Automated data pipelines have found a lot of applications in urban

transport planning and policymaking. For example, Chen et al. (2017)
built a comprehensive data pipeline for analysing transportation data
in Beijing, China. The analysis was supposed to help policy decision
on traffic management and public transportation planning, with a
demonstration of how the pipeline could improve urban transport
systems. With accurate data delivered on time, the pipeline allows
policymakers to take up data-driven decisions that improve both traffic
flow and the efficiency of the public transportation system.

In another study, Gupta et al. (2019) built a data pipeline for analysing
public transit data for San Francisco, California. This pipeline aimed
at the improvement of service reliability and passenger satisfaction
through insights into the operations of the transit service and in areas
of improvements. The data pipeline helped decision-makers have
better decisions by delivering comprehensive and actionable data.
This just goes to show the importance of a data-driven methodology
while attempting to solve problems related to urban transportation.

The given studies bring to light the vital role these automated data
pipelines play for urban transport planning and policymaking.
Because of the detailed and accurate data that is supplied through

6
these data pipelines, policymakers can make informed decisions that
shape the efficiency and effectiveness of urban transport systems.

 Integration of Machine Learning Techniques

The integration of machine learning techniques in data pipelines has

significantly improved the process of analysing transportation data.
For instance, Wang et al. (2020) developed a data pipeline using
machine learning models for the prediction of taxi demand in an urban
area. The model predicted the number of taxis demanded for future
times by analysing historical taxi trip data and external influence, such
as weather and local events. For efficient resource allocation and
service planning, it is paramount to note that this particular feature
provides transportation providers with the ability to determine
operation optimization based on predicted demand.

Integrating machine learning techniques within these data pipelines

improves their analytical capabilities while, at the same time,
providing actionable intelligence that can be used to drive operational
efficiencies and strategic decision-making. Although our project
focuses on data extraction and processing, we feel it is an opportunity
for further research into the integration of machine learning
techniques. By using machine learning models in our data pipeline,
we will improve its capability of providing predictive insights and
underpin more informed decision-making.

 Conclusion

In summary, existing literature has proposed various solutions for

managing and analysing transportation data. Building on these works,
our project introduces a data pipeline to extract, process, and analyse
NYC green taxi trip data using Google Cloud Platform services. This
scalable solution supports urban transportation planning and policy-
making initiatives. By leveraging GCP's advanced capabilities, our
pipeline aims to enhance transportation data management and
analysis, enabling data-driven decisions by policymakers and
planners. Additionally, it lays the groundwork for integrating advanced
analytics, such as machine learning, to provide deeper insights into
urban transportation dynamics.

7
Chapter 3 – Dataset
This chapter discusses our dataset that has been used in the study.
In this, we discuss the source, structure, and some of the key
variables that are included. Our dataset is the NYC Green Taxi Trip
Records. This is a treasure trove of public information available on
the City of New York's open data portal. The dataset is a great
resource for urban transportation analysis and planning because it
offers rich insights into green taxi trips across New York City.
 Data Source

Our source of data is the New York City portal for open data, which
encourages transparency and innovation by sharing a plethora of
datasets about the city. Our specific dataset was retrieved from the
URL link, [https://data.cityofnewyork.us/resource/8nfn-ifaj.json]
(https://data.cityofnewyork.us/resource/8nfn-ifaj.json). This is to
benefit research and development and support those who make
policies.

 Overview of Data

The NYC Green Taxi Trip Records dataset provides trip-level details
of all green taxi rides, called Boro Taxis, which are allowed to pick
up passengers in outer boroughs and northern Manhattan. This
dataset spans several years and hence provides an opportunity to
analyse taxi operations over time.

Each data record consists of a single taxi trip, and the key attributes
include:

 Pickup Date and Time: It refers to when the taxi picked up a

passenger.
 Dropoff Date and Time: It refers to when the passenger was
dropped.
 Pickup Location: Refers to the latitude and longitude of where the
trip started.
 Dropoff Location: Refers to the latitude and longitude of where the
trip ended.
 Trip Distance: It represents how far the taxi travelled; the unit of
measurement is in miles.
 Fare Amount: It refers to how much money the ride costs.

8
 Surcharge: This refers to additional fees, like the New York State
Congestion Surcharge.
 Tip Amount: It refers to how much money the driver was tipped for
this ride.
 Tolls Amount: It refers to what has been paid in terms of tolls during
the ride.
 Total Amount: It refers to the total paid amount, which includes fare,
surcharges, tips, and tolls.
 Payment Type: How the fare was paid (e.g., credit card, cash).
 Trip Type: Whether the trip was dispatched by a base or was a
street-hail.

 Data Quality and Preprocessing

The vast dataset and the difference in entering data are indicators
of a need to look at the quality of the data. The job involves handling
missing values, detecting and treating outliers, and correcting
inconsistencies. For instance, trips having negative distances or
fares have to be verified and potentially removed. Also, we check
the correctness of geographic coordinates to ensure they make
sense over the area of New York City.

 Usage of the Dataset

This dataset is a goldmine for research in urban transport by anyone

willing. Researchers can investigate patterns of taxi usage, how
different factors play a role in trip durations and distances, and how
major events or weather can influence taxi demand. Urban planners
and policymakers can use these insights to build better
infrastructure for transportation and to optimize taxi service while
managing traffic better.

We have used this dataset to implement a data pipeline that will

enable the extraction, processing, and analysis of NYC green taxi
trip data by using services offered by Google Cloud Platform. The
objective was to develop a scalable and efficient tool to support
urban transport planning and policymaking.

9
 Conclusion

The New York City Green Taxi Trip Records dataset is available
through the City of New York Open Data platform and provides a
rich resource for better understanding the dynamics of urban
transport. Utilizing this dataset, we are going to unveil valuable
insights into taxi usage, fare structures, and travel behaviours in
New York City. Building on this, our project developed a robust data
pipeline that will ensure efficient data management and analysis,
ensuring data-driven decision-making for urban transport systems.

10
Chapter 4 – Solution
In this chapter, we'll take you behind the scenes of our project,
where we built a data pipeline for NYC green taxi trip data using the
powerful tools of Google Cloud Platform (GCP). Our goal was to
create a seamless flow of data from extraction to visualization, and
we're excited to share the journey with you. From tapping into the
data sources to transforming it into actionable insights, we'll walk
you through each stage of the process. You'll get to see the specific
technologies and tools we used to overcome challenges and
achieve our objectives.

Data Extraction
We Identify the source of the data, which is the NYC Green Taxi
Trip Records dataset available at NYC Open Data Portal.
Then we designed a python script to fetch and process data from a
specified URL, likely from a website or data service that provides
data in JSON format.

 Importing Essential Libraries

Our Scripts starts by importing two essential libraries: requests for
handling HTTP requests and pandas for data manipulation and
analysis.

 ‘fetch_data’ function
The core functionality of the script is encapsulated in the
‘fetch_data’ function, which takes a URL as its parameter.
This function is designed to be flexible and reusable for different
URLs that provide data in JSON format.

 Configuring the HTTP GET Request

The script configures the HTTP GET request by defining a
dictionary named params containing a record limit ('$limit':
39656098) to specify the number of records to retrieve and an
application token
('$$app_token':"akmhOJDahBDbT6dtEV8WJETnr") for
authentication or access purposes. The script then sends the GET
11
request and captures the response in the response variable,
printing the type of the response object to confirm the request's
success. The response is expected to be in JSON format, which is
parsed into a Python dictionary named ‘response_data’. Upon
successful data retrieval, a message is printed to indicate this
success, and the script uses ‘pandas.json_normalize’ to convert
the JSON data into a flat, tabular structure, creating a pandas Data
Frame stored in the variable ‘df’. This Data Frame is returned by
the function, ready for further analysis or manipulation, providing a
streamlined method to fetch large datasets from web APIs and
structure them appropriately for in-depth analysis using pandas.

Data Ingestion
 We have a GCS bucket that stores raw data coming from the
ingestion process of data through various APIs.
 We have set up the GCS bucket as a centralized repository to hold
lots of raw data, and the storage is both scalable and durable.
 The data is ingested into the GCS bucket using APIs and services
like Cloud Functions among other GCP services.
 Data that is extracted is then uploaded to the GCS bucket, where it
can reside in its native format.
 Such a mechanism enables separation of ingestion from
processing and analysis in stages, which becomes very flexible
and scalable.
 Besides, the GCS offers a very reliable, secure storage medium
that assures our data is safe and easily retrievable for further
processing and analysis.
 With the help of GCS and Cloud Functions, we've built a strong,
effective data ingestion pipeline that will easily handle large
volumes of data from various sources.

Data Transformation
In our project, the data transformation phase happens to be the crux of
the data pipeline, where data is cleaned, enriched, and shaped correctly
for analysis. We integrated following outlines the detailed steps and
transformations applied:

12
Tools and Frameworks Used
 Google Cloud Platform (GCP) Dataflow API: Used for scalable
data processing.
 Apache Beam: Integrated with Python to build and run the
Dataflow pipeline.
 Beam Transformations: Specifically, ‘beam.Map()’,
‘beam.io.ReadFromText()’, ‘beam.Filter()’, and
‘beam.io.WriteToBigQuery()’.

Data Transformation Steps

1. Data Extraction
Using ‘beam.io.ReadFromText()’, we read the raw data from
Google Cloud Storage.

2. Filtering Missing Values

Applied ‘beam.Filter()’ to remove records with missing values.

3. Creating New Columns

Used ‘beam.Map()’ to create new columns from existing columns.
For example, we might derive ‘trip_duration’ from ‘pickup_datetime’
and ‘dropoff_datetime’.

4. Formatting Records for BigQuery Table Ingestion

Used ‘beam.Map()’ to format and cast types of records to match
Big Query table schema requirements.

5. Writing to BigQuery
Used ‘beam.io.WriteToBigQuery()’ to write the transformed data to
a Big Query table.

6. Creating and Running Dataflow Job

Defined and executed the entire pipeline in Dataflow.

These transformations are crucial for ensuring that the data is clean,
enriched, and correctly formatted for efficient storage and analysis in Big
Query. This part of the pipeline is executed as a job in Dataflow,
providing a scalable solution for handling large volumes of transportation
data.

13
DATA ANALYSIS & VISUALIZATION
User Story 1: Calculate Daily Earnings
As a data analyst, I want to calculate the total daily earnings for green
taxi trips, so that I can provide insights into the daily revenue trends for
taxi operators.

Insights from the Visualization

The above line chart reflects the total daily earnings of green taxi trips
throughout the year 2022. Here are a few key observations:
1. Seasonal Trends:
 The earnings exhibit a pattern with fluctuations over the year.
 There are evident peaks and troughs that correspond to various
times of the year, potentially reflecting demand changes.

2. Monthly Variations:
The profits also seem to fluctuate monthly, meaning specific days
of the month might be either high- or low-demand

3. Special Events:

14
There are clear spikes in earnings around certain dates that
probably correspond to holidays, events, or adverse weather
events that drive increased taxi usage.

4. General Growth:
An increase in the average daily earnings can be observed over
the months—though slight—showing that this trend may be
growing.
The analysis of total daily earnings from green taxi trips enables one to
gain deep insights into revenue trends and patterns. An understanding of
such trends helps optimize operations for taxi operators, whereby they
can also schedule maintenance during periods of low demand and
prepare for days with high demand. This insight may further enable
policymakers to make informed decisions about transportation
regulations and support for the taxi industry.

User Story 2: Identify High-Demand Pickup Locations

As a transportation planner, I want to identify the high-demand pickup
locations for green taxi trips, so that I can optimize the placement of taxi
stands and improve service availability.

15
Above is a bar chart representing the top 10 high-demand pickup
locations for green taxi trips. The x-axis represents the pickup location
IDs, and the y-axis represents the number of pickups. This chart
represents the demand for taxi services at different locations.

Insights from the Visualization

1. Top Pickup Locations:

 Location ID 74.0: It ranks first in demand, representing close to
140,000 pickups.
 Location ID 75.0 comes second, presenting about 110,000
pickups.
 Other high-demand locations include 41.0, 166.0, 95.0, and
more.

2. Distribution of Demand:
 Two of the top locations, IDs 74.0 and 75.0, perform much
better in terms of demand than any others.
 From there, the demand slowly decreases down the list,
showing a disproportion in taxi service use.

3. Strategic Placement of Taxi Stands:

 Additional taxi stands should be placed at locations with high
demand.
 Availability of service in these places will cut down on waiting
time for passengers and increase the overall performance of the
service.

4. Service Optimization:
 Understanding the demand patterns helps in reallocating
resources effectively.
 Taxi operators can deploy more vehicles in high-demand areas
during peak times to meet passenger needs.
The revelation of high-demand pickup locations for green taxi trips is
important for making an optimized placement of taxi stands to increase
service availability. By prioritizing these high-demand locations, transport
planners will ensure that there is better access for passengers to taxi
services, hence minimizing waiting times and increasing overall
efficiency. This data-driven approach aids in informed decision-making to
ensure strategic allocation of resources.

16
Chapter 5 - Summary and Outlook
Our Results
Using our data pipeline, the following insights are presented for the
operations of green taxi services in New York City:
 Patterns of Revenues: Clear patterns of revenue fluctuations
were revealed, formed by seasonal and other effects, which give a
deep understanding to the operators about the source of changes
in income.
 High-Demand Pickup Locations: A practical suggestion about
how to optimize taxi stand placement and improve service
coverage.
 Operational Efficiency: This will further help taxi companies save
money on resource allocation, reduce idle times, and generally
increase their level of efficiency.

Future Work
There are some very exciting avenues for future work building upon our
results:

 Advanced Machine Learning Models: More sophisticated

models can be used for the prediction of taxi demand, leading to a
more accurate prediction of demand.
 Integration of Additional Data Sources: Further external data
sources such as weather forecasts, schedules of mass events,
etc., can be integrated for finer predictions.
 Real-Time Data Processing: Real-time insights should be made
available to make dynamic decisions and operational changes.
 Geospatial Analysis: Fine analysis for finding areas where the
taxi stands are insufficient.
 User Behaviour Analysis: Analysis of user behaviour in an
attempt to get a better understanding of what customers want.

17
 Policy Change Impact: Analysis of how policy changes affect taxi
operations in order to make urban transportation planning
sustainable and data driven.

Bibliography
1. https://www.sciencedirect.com/science/article/pii/S2667305322000722
2. https://www.sciencedirect.com/science/article/abs/pii/S0377221720305816
3. https://acp.copernicus.org/articles/19/13519/2019/
4. https://www.sciencedirect.com/science/article/abs/pii/S0377221722003514#pr
eview-section-abstrac

18
THANK YOU!

Uber Data Analytics Project
No ratings yet
Uber Data Analytics Project
9 pages
T54B VCF PDF
92% (12)
T54B VCF PDF
528 pages
Transportation Analytics in The Era of Big Data: Satish V. Ukkusuri Chao Yang Editors
No ratings yet
Transportation Analytics in The Era of Big Data: Satish V. Ukkusuri Chao Yang Editors
240 pages
UCE Service Manual
100% (2)
UCE Service Manual
220 pages
Dot 75610 DS1
No ratings yet
Dot 75610 DS1
192 pages
Machine Learning Thesis
No ratings yet
Machine Learning Thesis
92 pages
Sri Mittapalli College of Engineering: A Shiny Interface in Exploring The Taxi Trips Data
No ratings yet
Sri Mittapalli College of Engineering: A Shiny Interface in Exploring The Taxi Trips Data
39 pages
Taxi Driver Assistant - A Proposal For A Recommendation System
No ratings yet
Taxi Driver Assistant - A Proposal For A Recommendation System
21 pages
L&T Pushbutton Catalogue Price List
100% (1)
L&T Pushbutton Catalogue Price List
16 pages
Project Report DM Malyka & Qurat
No ratings yet
Project Report DM Malyka & Qurat
22 pages
A Project Reported On - Naveen Raj S
No ratings yet
A Project Reported On - Naveen Raj S
71 pages
Happy Baby Cook Book (Aida Kamanyire)
No ratings yet
Happy Baby Cook Book (Aida Kamanyire)
49 pages
A Project Reported Final Pro - 312820205028 Naveen Raj S
No ratings yet
A Project Reported Final Pro - 312820205028 Naveen Raj S
71 pages
Data Chronicles - Modern World of Math
No ratings yet
Data Chronicles - Modern World of Math
24 pages
Magnetic Particle Testing
80% (5)
Magnetic Particle Testing
7 pages
Uber Data Analysis Using Machine Learning
No ratings yet
Uber Data Analysis Using Machine Learning
40 pages
Internship
No ratings yet
Internship
24 pages
Project 2
No ratings yet
Project 2
12 pages
Predicting Taxi Pickups in Cities Methodology
No ratings yet
Predicting Taxi Pickups in Cities Methodology
8 pages
Project 2
No ratings yet
Project 2
11 pages
Project 3
No ratings yet
Project 3
10 pages
Project 4
No ratings yet
Project 4
13 pages
Coir
No ratings yet
Coir
34 pages
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
No ratings yet
N N N N N N: A Ovel Approach To A Alyze Uber Datausi G Machi E Lear I G
17 pages
NM Report - Subashini
No ratings yet
NM Report - Subashini
23 pages
DCA Exm
No ratings yet
DCA Exm
8 pages
A Generic and Flexible Geospatial Data Warehousing and Analy - 2019 - Procedia C
No ratings yet
A Generic and Flexible Geospatial Data Warehousing and Analy - 2019 - Procedia C
8 pages
Project Report Edit
No ratings yet
Project Report Edit
20 pages
Traffic Analysis
No ratings yet
Traffic Analysis
11 pages
Lafayette Parish Business Database 211
No ratings yet
Lafayette Parish Business Database 211
890 pages
Coastal Protection of Highways
No ratings yet
Coastal Protection of Highways
14 pages
2021 NS BDA Assign1
No ratings yet
2021 NS BDA Assign1
4 pages
Lab - 01 - Data Engineering Practice
No ratings yet
Lab - 01 - Data Engineering Practice
4 pages
Major Project Mid Sem
No ratings yet
Major Project Mid Sem
9 pages
TDIA2 TP3 Spark
No ratings yet
TDIA2 TP3 Spark
2 pages
Test Case - FullStack Developer v.2
No ratings yet
Test Case - FullStack Developer v.2
2 pages
Group27 CS661 Report
No ratings yet
Group27 CS661 Report
3 pages
2018 SmartCity NYC Taxi
No ratings yet
2018 SmartCity NYC Taxi
7 pages
Big Data and Analytics For Safer Transportation PDF
No ratings yet
Big Data and Analytics For Safer Transportation PDF
8 pages
Interview Task - Locale
No ratings yet
Interview Task - Locale
5 pages
Dynamometer Type Wattmeter
No ratings yet
Dynamometer Type Wattmeter
9 pages
Comprehensive Pharmacology
No ratings yet
Comprehensive Pharmacology
102 pages
In Bluebeard's Castle
No ratings yet
In Bluebeard's Castle
65 pages
Series: Always at The Primacy of Digital Imaging The Pride and Legacy of Fujifilm
0% (1)
Series: Always at The Primacy of Digital Imaging The Pride and Legacy of Fujifilm
4 pages
Resume Sample
No ratings yet
Resume Sample
2 pages
Enterprise
No ratings yet
Enterprise
13 pages
GRE Writting 1000 Tips
No ratings yet
GRE Writting 1000 Tips
58 pages
520 2022 Article 7194
No ratings yet
520 2022 Article 7194
9 pages
Week 3
No ratings yet
Week 3
4 pages
Packaging Efficiency Project
No ratings yet
Packaging Efficiency Project
8 pages
Investigation of Lockout/Tagout Procedure Failure in Machine Maintenance Process
No ratings yet
Investigation of Lockout/Tagout Procedure Failure in Machine Maintenance Process
6 pages
Kubota Tractor B219 Loader - Model 25 Maximum Payload - 500 Pounds Figure: 1 - Safety Precautions
No ratings yet
Kubota Tractor B219 Loader - Model 25 Maximum Payload - 500 Pounds Figure: 1 - Safety Precautions
17 pages
Private Fire Hydrant (PFH) Inspection and Testing Form
No ratings yet
Private Fire Hydrant (PFH) Inspection and Testing Form
2 pages
Goal-Directed Cold Exposure Protocols From The Huberman Lab Podcast
No ratings yet
Goal-Directed Cold Exposure Protocols From The Huberman Lab Podcast
2 pages
Composite Tube Trailer Designmanufacturing Needs
No ratings yet
Composite Tube Trailer Designmanufacturing Needs
16 pages
CN Case Study
No ratings yet
CN Case Study
6 pages
Motherboard Diagnose - Troubleshooting Tips For Common Issues
No ratings yet
Motherboard Diagnose - Troubleshooting Tips For Common Issues
9 pages
The 7 Philosophies of Balinese Architecture: Materials
No ratings yet
The 7 Philosophies of Balinese Architecture: Materials
13 pages
Glasroc H PDF
No ratings yet
Glasroc H PDF
9 pages
Estimation of Glomerular Filtration Rate in South Asian Healthy Adult Kidney Donors
No ratings yet
Estimation of Glomerular Filtration Rate in South Asian Healthy Adult Kidney Donors
7 pages
GR 10 GRAPHS MIXED EXERCISE Mathematics
No ratings yet
GR 10 GRAPHS MIXED EXERCISE Mathematics
5 pages
Prototype For A Personal Safety Gadget Using Ardui
No ratings yet
Prototype For A Personal Safety Gadget Using Ardui
7 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Evaluate the use of open data interface solutions
From Everand
Evaluate the use of open data interface solutions
LOOK AT EVERYTHING
No ratings yet
Fivetran Data Integration Essentials: Definitive Reference for Developers and Engineers
From Everand
Fivetran Data Integration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Truck route planning and fleet digitization with telematics: How to use cloud services to optimize truck fleets and reduce costs by digitizing processes
From Everand
Truck route planning and fleet digitization with telematics: How to use cloud services to optimize truck fleets and reduce costs by digitizing processes
Jürgen Stausberg
No ratings yet
Airbyte for Data Integration Systems: The Complete Guide for Developers and Engineers
From Everand
Airbyte for Data Integration Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
Moving Forward: GIS for Transportation
From Everand
Moving Forward: GIS for Transportation
Terry Bills
No ratings yet
Transportation Management Land & Sea, Aviation and Infrastructure Concepts: Analyzing the influence of Covid on company processes
From Everand
Transportation Management Land & Sea, Aviation and Infrastructure Concepts: Analyzing the influence of Covid on company processes
BoD - Books on Demand
No ratings yet
StreamSets Data Integration Architecture and Design: The Complete Guide for Developers and Engineers
From Everand
StreamSets Data Integration Architecture and Design: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Comprehensive Guide to Data Integration with Hevo: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Data Integration with Hevo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
StreamSets Pipeline Design and Best Practices: Definitive Reference for Developers and Engineers
From Everand
StreamSets Pipeline Design and Best Practices: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
From Everand
Kinesis Stream Processing Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Dataflow Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Content Delivery Networks: Definitive Reference for Developers and Engineers
From Everand
Architecting Content Delivery Networks: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Integration with Blendo: Definitive Reference for Developers and Engineers
From Everand
Data Integration with Blendo: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
From Everand
Principles of Real-Time Data Streaming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
From Everand
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Mapping Solutions: Definitive Reference for Developers and Engineers
From Everand
Technical Mapping Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Lakes & Pipelines: A Modern Azure Guide
From Everand
Data Lakes & Pipelines: A Modern Azure Guide
Kameron Hussain
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
From Everand
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
From Everand
The Ultimate Guide to Unlocking the Full Potential of Cloud Services: Tips, Recommendations, and Strategies for Success
Rick Spair
No ratings yet
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet

TEAM 5 Report

Uploaded by

TEAM 5 Report

Uploaded by

DATA ENGINEERING 2

New York Taxi

(Green Taxi Trip 2022 dataset)

Summary of Own Approach:

Summary of Own Results:

1.2 Problem on Application Level

1.3 Benefits of a Solution on Application Level

 Enhanced Decision-Making: By providing clean, structured

 Data Extraction: Efficiently extracting large datasets from

1.5 Technical Solution Idea

 Data Extraction: We used Python scripts to interact with the

 Data Extraction and Processing Techniques

Several studies have focused on the design of methods for the

Similarly, Smith et al. (2020) developed a systematic pipeline for the

 Use of Cloud Computing Platforms

Cloud computing platforms such as Google Cloud Platform (GCP)

Our project builds on this foundation, employing GCP services like

 Application in Urban Transport Planning and

Automated data pipelines have found a lot of applications in urban

 Integration of Machine Learning Techniques

The integration of machine learning techniques in data pipelines has

Integrating machine learning techniques within these data pipelines

In summary, existing literature has proposed various solutions for

 Pickup Date and Time: It refers to when the taxi picked up a

 Data Quality and Preprocessing

 Usage of the Dataset

This dataset is a goldmine for research in urban transport by anyone

We have used this dataset to implement a data pipeline that will

 Importing Essential Libraries

 Configuring the HTTP GET Request

Data Transformation Steps

2. Filtering Missing Values

3. Creating New Columns

4. Formatting Records for BigQuery Table Ingestion

6. Creating and Running Dataflow Job

Insights from the Visualization

User Story 2: Identify High-Demand Pickup Locations

Insights from the Visualization

1. Top Pickup Locations:

3. Strategic Placement of Taxi Stands:

 Advanced Machine Learning Models: More sophisticated

You might also like