0% found this document useful (0 votes)

5 views6 pages

The Making of An Data Pipeline: Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati

This paper discusses the creation of an ETL data pipeline for extracting, transforming, and loading user details from a local business directory using asynchronous web scraping techniques. It employs Python, HTTPX, BeautifulSoup, and Amazon S3 for efficient data handling, emphasizing performance, data accuracy, and error management. The methodology includes data collection, preprocessing, transformation, feature selection, and storage, resulting in a scalable solution for large-scale web data extraction.

Uploaded by

Ahmed Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views6 pages

The Making of An Data Pipeline: Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati

Uploaded by

Ahmed Mohamed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

International Journal for Multidisciplinary Research (IJFMR)

E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

The Making of an Data Pipeline

Harsh Kaushik1, Avnish Rai2, Gaurav Kapasiya3, Jai Prakash Bhati4
1,2,3
Student, IIMT College of Engineering
4
Professor, IIMT College of Engineering

Abstract
This paper details the development and implementation of a data engineering pipeline designed for the
extraction, transformation, and loading (ETL) of data from a web-based directory. The project involves
using asynchronous web scraping techniques to gather user details from a local business directory,
transforming the data into a structured format, and loading it into a storage solution. The pipeline utilises
Python, the HTTPX library for asynchronous HTTP requests, BeautifulSoup for HTML parsing, and
Amazon S3 for data storage. By leveraging these technologies, the pipeline demonstrates an efficient
approach to handling large-scale web data extraction and processing, significantly reducing the time
required to gather and organise data from multiple web pages. This paper provides insights into the
architecture, implementation, and performance of the ETL pipeline, highlighting the benefits and
challenges of using asynchronous programming in data engineering.

1. Introduction
In today's data-driven world, the ability to extract, transform, and load data from various sources is crucial
for businesses and researchers alike. Data engineering pipelines play a pivotal role in this process, enabling
the efficient collection and processing of vast amounts of data. Web scraping,a method for extracting data
from websites, is particularly useful for gathering publicly available information from the internet.
However, traditional web scraping methods can be time-consuming and resource-intensive, especially
when dealing with large datasets or multiple web pages.
This paper presents the development of a robust ETL pipeline designed to scrape user details from
www.local.ch, a local business directory. The pipeline leverages asynchronous programming techniques
to enhance performance and scalability, making it capable of handling a large number of concurrent web
requests. The use of HTTPX for asynchronous HTTP requests, BeautifulSoup for HTML parsing, and
Amazon S3 for data storage ensures that the pipeline is both efficient and reliable.
The implemented ETL pipeline not only focuses on efficiency but also emphasises data accuracy and
integrity. By integrating advanced error-handling mechanisms and retry strategies, the pipeline minimises
data loss and ensures the completeness of the extracted information.

2. LITERATURE SURVEY
Data engineering has become an essential discipline in the era of big data, enabling the efficient
processing, management, and transformation of vast amounts of data. Data pipelines are fundamental in
this context, facilitating the flow of data from various sources to storage and analytical systems. This
literature survey explores key contributions and methodologies in data engineering, focusing on notable
data pipelines developed by researchers.

IJFMR240320849 Volume 6, Issue 3, May-June 2024 1

International Journal for Multidisciplinary Research (IJFMR)
E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

Google’s Dataflow model

Presented by Akidau et al. (2015), introduced a unified programming model and managed service for batch
and stream data processing. It laid the groundwork for the Apache Beam project, which allows developers
to create data pipelines that can run on various processing engines, including Apache Flink and Apache
Spark.
Apache Hadoop and MapReduce
Dean and Ghemawat's (2008) paper on MapReduce introduced a programming model for processing large
datasets with a distributed algorithm on a cluster. This model became the cornerstone of the Apache
Hadoop project, which revolutionised big data processing by providing a scalable, fault-tolerant
framework..
Lambda Architecture
Nathan Marz (2015) proposed the Lambda Architecture, which is designed to handle massive quantities
of data by utilising both batch and stream processing methods. This architecture addresses the need for
real-time analytics while ensuring data consistency and scalability.
Kappa Architecture
Jay Kreps (2014) introduced the Kappa Architecture as an alternative to the Lambda Architecture, aiming
to simplify the data processing pipeline by using stream processing alone. This approach reduces
complexity and latency by avoiding the need for separate batch processing systems.
ETL Pipelines with Apache and NiFI
Apache NiFi, initially developed at the NSA, provides a robust data ingestion and distribution framework.
Guhathakurta et al. (2017) highlighted its capabilities in building scalable, reliable ETL pipelines that
support data provenance and security.
Data Engineering with Airflow
Apache Airflow, developed by Maxime Beauchemin (2015) at Airbnb, has become a popular open-source
tool for orchestrating complex data workflows. It allows users to programmatically author, schedule, and
monitor data pipelines.
ETL Pipelines in Cloud Environments
Data pipelines in cloud environments have become increasingly important due to the scalability and
flexibility of cloud services. Amazon’s AWS Glue, Azure Data Factory, and Google Cloud Dataflow are
notable examples. These services simplify the creation, scheduling, and monitoring of ETL workflows,
enabling efficient data integration and processing in the cloud.

3. Data Characteristics
In the context of this project, we focus on scraping user details from www.local.ch, a prominent local
business directory. The data extracted from this site exhibits several distinct characteristics that are crucial
for the subsequent processing stages. Understanding these characteristics ensures the development of an
efficient and robust ETL pipeline. Key characteristics of the data are outlined below:
● User Details: The primary focus is on extracting detailed user information, including names,
addresses, and contact numbers. This data is typically structured within HTML elements that need to
be accurately parsed to ensure completeness.
● Data Volume: Given the comprehensive nature of www.local.ch, the volume of data can be
substantial. This necessitates the use of asynchronous programming to handle numerous concurrent
web requests efficiently.

IJFMR240320849 Volume 6, Issue 3, May-June 2024 2

International Journal for Multidisciplinary Research (IJFMR)
E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

● Data Variability: The data may vary significantly in terms of format and completeness. Different
business listings might present user details in various ways, necessitating flexible parsing methods.
● Frequency of Updates: Business listings on www.local.ch are frequently updated to reflect current
information. This characteristic requires the pipeline to be capable of regularly updating the dataset
without redundancy.
● Data Quality Issues: Common issues include incomplete records, duplicates, and inconsistencies in
formatting. These issues necessitate thorough data validation, deduplication, and transformation
processes.
● HTML Structure: The structure of the HTML pages can vary, and it is essential to develop robust
parsing techniques using BeautifulSoup to navigate these variations effectively.

4. Methodology
The methodology for developing the ETL pipeline to scrape user details from www.local.ch involves a
structured approach encompassing data collection, preprocessing, transformation, feature selection, and
storage. The proposed method is represented in several stages, as detailed below:
A. Data Collection
Data collection is the foundational phase of the ETL pipeline. This involves making asynchronous HTTP
requests to www.local.ch, retrieving the HTML content, and parsing it to extract relevant user details. The
process is implemented using the following steps:
1. Setting Up Asynchronous HTTP Requests: Using the HTTPX library, asynchronous HTTP requests
are made to www.local.ch to retrieve HTML pages containing business listings.
2. Navigating URLs: The base URL is dynamically constructed to navigate through multiple pages of
listings, ensuring comprehensive data collection.
3. HTML Parsing: BeautifulSoup is used to parse the HTML content and locate elements containing
user details such as names, addresses, and contact numbers.

Fig. Data Retrieval

IJFMR240320849 Volume 6, Issue 3, May-June 2024 3

International Journal for Multidisciplinary Research (IJFMR)
E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

B. Data Preprocessing
Preprocessing ensures that the raw data collected is cleaned and formatted appropriately for further
processing. This involves:
1. Data Validation: Verifying the presence and correctness of key fields such as phone numbers and
addresses using regular expressions and lookup tables.
2. Deduplication: Identifying and removing duplicate records to maintain a clean dataset. Techniques
like hashing and fuzzy matching are used to detect duplicates.
3. Error Handling: Implementing error-handling mechanisms to manage issues like missing fields or
malformed data entries.

Fig. Data Processing

C. Data Transformation
Data transformation involves converting the data into a format suitable for analysis and storage. This
includes:
1. Standardising Formats: Converting phone numbers, addresses, and names to standardised formats.
2. Handling Variability: Addressing variations in data presentation by applying flexible parsing rules
that can adapt to different HTML structures.

D. Feature Selection
Feature selection focuses on identifying and extracting key attributes that will be stored and analysed. This
includes:
1. Key Attributes: Extracting essential features such as user names, contact numbers, addresses, and any
additional metadata.
2. Attribute Transformation: Transforming attributes into formats suitable for storage and analysis, such

IJFMR240320849 Volume 6, Issue 3, May-June 2024 4

International Journal for Multidisciplinary Research (IJFMR)
E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

as splitting addresses into components or normalizing phone numbers.

E. Data Storage
Data storage involves saving the transformed data in a reliable and accessible format. This stage includes:
1. Storing Data in Amazon S3: Using Boto3 to upload the cleaned and transformed data to an Amazon
S3 bucket, ensuring scalability and durability.
2. Creating a CSV File: Generating a CSV file of the user details for easy access and analysis

Fig. Data Pipeline Process

Results
The pipeline successfully scrapes user details from multiple city pages on www.local.ch and stores the
data in Amazon S3. The use of asynchronous programming significantly reduced the time required for the
scraping process, allowing the pipeline to handle a large number of requests efficiently.

Furthermore, a visual representation of the saved data was included, enhancing the comprehensibility and

IJFMR240320849 Volume 6, Issue 3, May-June 2024 5

International Journal for Multidisciplinary Research (IJFMR)
E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

accessibility of the stored information. This addition not only provides a clear snapshot of the dataset but
also facilitates easier interpretation and analysis of the extracted user details.

Conclusion
The proposed ETL pipeline is designed to efficiently and accurately scrape user details from
www.local.ch. By leveraging asynchronous programming, robust HTML parsing, and thorough data
preprocessing,the pipeline ensures high-quality data extraction suitable for various applications.The
methodology outlined provides a structured approach to handling the complexities of web scraping and
data processing, ensuring reliability and scalability.

References
1. Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S., Fernández-Moctezuma, R. J., Lax, R., ... &
Whittle, S. (2015). The dataflow model: A practical approach to balancing correctness, latency, and
cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB
Endowment, 8(12), 1792-1803.
2. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters.
Communications of the ACM, 51(1), 107-113.
3. Marz, N., & Warren, J. (2015). Big Data: Principles and best practices of scalable real-time data
systems. Manning Publications.
4. Kreps, J. (2014). Questioning the Lambda Architecture. [Online Article]. Retrieved from
https://www.oreilly.com/radar/questioning-the-lambda-architecture/
5. Guhathakurta, A., Boyd, C., & Laing, C. (2017). Data Ingestion Using Apache NiFi. Proceedings of
the Practice and Experience on Advanced Research Computing, 1-8.
6. Beauchemin, M. (2015). The rise of Apache Airflow. [Online Article]. Retrieved from
https://airflow.apache.org/
7. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., ... & Stoica, I. (2010). Resilient
distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the
9th USENIX Conference on Networked Systems Design and Implementation, 15-28.

IJFMR240320849 Volume 6, Issue 3, May-June 2024 6

Fundamentals of Data Engineering
No ratings yet
Fundamentals of Data Engineering
16 pages
Pls Academy Pde Student Slides 4 2405
No ratings yet
Pls Academy Pde Student Slides 4 2405
129 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
De FiNal
No ratings yet
De FiNal
94 pages
Master ETL Pipelines in 30 Days
No ratings yet
Master ETL Pipelines in 30 Days
10 pages
Data Models (Module - II)
No ratings yet
Data Models (Module - II)
101 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
No ratings yet
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
36 pages
DE Skills and Tools Guide
No ratings yet
DE Skills and Tools Guide
20 pages
PRJ de 1
No ratings yet
PRJ de 1
18 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
No ratings yet
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
21 pages
The Big Book of Data Engineering: A Collection of Technical Blogs, Including Code Samples and Notebooks
100% (2)
The Big Book of Data Engineering: A Collection of Technical Blogs, Including Code Samples and Notebooks
57 pages
Project Documentation
No ratings yet
Project Documentation
36 pages
Week8 Classroom Exercise
No ratings yet
Week8 Classroom Exercise
17 pages
CCD 4,5,6
No ratings yet
CCD 4,5,6
21 pages
Os ETLAHigh EfficiencyOpen ScalaSolutionforIntegratingHeterogeneousDatainLarge ScaleDataWarehousing
No ratings yet
Os ETLAHigh EfficiencyOpen ScalaSolutionforIntegratingHeterogeneousDatainLarge ScaleDataWarehousing
9 pages
Rahul Bandewar Resume v2.2.2
No ratings yet
Rahul Bandewar Resume v2.2.2
1 page
SAP S:4HANA2 Innovations Across All LoB's
100% (2)
SAP S:4HANA2 Innovations Across All LoB's
41 pages
Pipeline
No ratings yet
Pipeline
19 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
Evolution of Data Engineering in Modern Software D
No ratings yet
Evolution of Data Engineering in Modern Software D
15 pages
Building Batch Data Pipelines On Google Cloud
No ratings yet
Building Batch Data Pipelines On Google Cloud
18 pages
Data Pipeline Architecture
No ratings yet
Data Pipeline Architecture
6 pages
Unit 4
No ratings yet
Unit 4
11 pages
Architectural Patterns in de
No ratings yet
Architectural Patterns in de
15 pages
What Is A Data Pipeline - IBM
No ratings yet
What Is A Data Pipeline - IBM
10 pages
Understanding Databricks For Etl Slides
No ratings yet
Understanding Databricks For Etl Slides
14 pages
Data Engineering Internship at AICTE
No ratings yet
Data Engineering Internship at AICTE
18 pages
AI Agent Structured Data Layers
No ratings yet
AI Agent Structured Data Layers
7 pages
Rioja
No ratings yet
Rioja
5 pages
TEAM 5 Report
No ratings yet
TEAM 5 Report
19 pages
20230314-EB-Transform Your Data Pipelines
No ratings yet
20230314-EB-Transform Your Data Pipelines
9 pages
Hackathon Retail
No ratings yet
Hackathon Retail
6 pages
Unit 4
No ratings yet
Unit 4
30 pages
Ai&ds Ie Report
No ratings yet
Ai&ds Ie Report
6 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Understanding Etl Er1
No ratings yet
Understanding Etl Er1
34 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
Course Content
No ratings yet
Course Content
2 pages
Course EEContent
No ratings yet
Course EEContent
2 pages
Tcobza
No ratings yet
Tcobza
2 pages
DE - Test
No ratings yet
DE - Test
5 pages
Course EEdadqwd Contenaaavasnvak
No ratings yet
Course EEdadqwd Contenaaavasnvak
2 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
P2 Description
No ratings yet
P2 Description
2 pages
Abhishek Guler I A
No ratings yet
Abhishek Guler I A
2 pages
Big Book of Data Engineering 2nd Edition Final
No ratings yet
Big Book of Data Engineering 2nd Edition Final
97 pages
Internship Presentation 2
No ratings yet
Internship Presentation 2
16 pages
ETL AWS Real Time Senario
No ratings yet
ETL AWS Real Time Senario
1 page
Suraj Chand CV
No ratings yet
Suraj Chand CV
2 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
ETL Process
No ratings yet
ETL Process
2 pages
My Resume Nov
No ratings yet
My Resume Nov
1 page
Data Engineering Course Outline
No ratings yet
Data Engineering Course Outline
3 pages
Iran
No ratings yet
Iran
7 pages
Scribd-Dl PyPI
No ratings yet
Scribd-Dl PyPI
4 pages
Roadmap
No ratings yet
Roadmap
3 pages
Student Union Voting System For Wolkite University
No ratings yet
Student Union Voting System For Wolkite University
3 pages
CCNA Security
No ratings yet
CCNA Security
14 pages
Sample
No ratings yet
Sample
46 pages
BCP and DCP
No ratings yet
BCP and DCP
2 pages
A Method For Analysing Delay Duration Considering Lost Productivity Through Construction Productivity Data Model
No ratings yet
A Method For Analysing Delay Duration Considering Lost Productivity Through Construction Productivity Data Model
9 pages
Microservices On Aws
No ratings yet
Microservices On Aws
35 pages
How To Write Malware and Learn How To Fight It!
No ratings yet
How To Write Malware and Learn How To Fight It!
40 pages
Provide First Level Remote Help Desk
No ratings yet
Provide First Level Remote Help Desk
14 pages
CMDB7.6.04 NormalizationReconciliationGuide
No ratings yet
CMDB7.6.04 NormalizationReconciliationGuide
166 pages
Infosec Survival Guide
No ratings yet
Infosec Survival Guide
25 pages
Potentials and Challenges of Agile Project Management in Real Estate Development
100% (1)
Potentials and Challenges of Agile Project Management in Real Estate Development
35 pages
College Management System Deewan Bca
No ratings yet
College Management System Deewan Bca
76 pages
Devops Interview Question
No ratings yet
Devops Interview Question
19 pages
Download
No ratings yet
Download
11 pages
Lab 5B
No ratings yet
Lab 5B
6 pages
Informatica A Questions
100% (6)
Informatica A Questions
5 pages
Final Tour & Travel Final Project Report, Aman, 2819154
No ratings yet
Final Tour & Travel Final Project Report, Aman, 2819154
82 pages
Cost Center in Sap Create Cost Cost Centers
No ratings yet
Cost Center in Sap Create Cost Cost Centers
4 pages
Imaster NCE-IP V100R022C10 Server Hardware Specifications (x86) 03-E
No ratings yet
Imaster NCE-IP V100R022C10 Server Hardware Specifications (x86) 03-E
20 pages
Test Final Exam Sem 2 Part2
No ratings yet
Test Final Exam Sem 2 Part2
17 pages
Stuart Gordon Reid - Algorithmic Trading System
No ratings yet
Stuart Gordon Reid - Algorithmic Trading System
45 pages
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
No ratings yet
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
5 pages
Docker Desktop For Windows User Manual - Docker Documentation
No ratings yet
Docker Desktop For Windows User Manual - Docker Documentation
12 pages
To - Print - Modern Evaluation of Sri Lankan Market Structures
No ratings yet
To - Print - Modern Evaluation of Sri Lankan Market Structures
13 pages
C++ TestMate - Visual Studio Marketplace
0% (1)
C++ TestMate - Visual Studio Marketplace
4 pages
Eclipse and PyDev - Anaconda Documentation
No ratings yet
Eclipse and PyDev - Anaconda Documentation
3 pages
Cb20133 Logbook Week 1-5
No ratings yet
Cb20133 Logbook Week 1-5
32 pages
ECH - Stealthwatch NetFlow Configuration Templates v1 0
No ratings yet
ECH - Stealthwatch NetFlow Configuration Templates v1 0
10 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Lab 6
No ratings yet
Lab 6
15 pages
Android Daniel Z Resume
No ratings yet
Android Daniel Z Resume
3 pages
Getting Started - Conan 1.37.2 Documentation
No ratings yet
Getting Started - Conan 1.37.2 Documentation
12 pages
SRE FAll 2023
No ratings yet
SRE FAll 2023
8 pages
Using The R Programming Language in Jupyter Notebook - Anaconda Documentation
No ratings yet
Using The R Programming Language in Jupyter Notebook - Anaconda Documentation
3 pages
Connect Hadoop Database by Using Hive in Python - Ting Yu
No ratings yet
Connect Hadoop Database by Using Hive in Python - Ting Yu
2 pages
C - C++ Projects Quick Start Tutorial
No ratings yet
C - C++ Projects Quick Start Tutorial
10 pages
Summary - Learn - Microsoft Docs
No ratings yet
Summary - Learn - Microsoft Docs
4 pages
Contact List Template Someka V7F
No ratings yet
Contact List Template Someka V7F
2 pages
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
No ratings yet
What Is Hadoop - Introduction, Architecture, Ecosystem, Components
8 pages
Run Your First Windows Container - Microsoft Docs
No ratings yet
Run Your First Windows Container - Microsoft Docs
7 pages
Configure A Linux CMake Project in Visual Studio - Microsoft Docs
No ratings yet
Configure A Linux CMake Project in Visual Studio - Microsoft Docs
6 pages
Building Anaconda Navigator Applications - Anaconda Documentation
No ratings yet
Building Anaconda Navigator Applications - Anaconda Documentation
6 pages
E Log
No ratings yet
E Log
4 pages
YIELD Function - Formula, Examples, Calculate Yield in Excel
No ratings yet
YIELD Function - Formula, Examples, Calculate Yield in Excel
5 pages
TrueSTUDIO - A Powerful Eclipse-Based C - C++ Integrated Development Tool For Your STM32 Projects - STMicroelectronics
No ratings yet
TrueSTUDIO - A Powerful Eclipse-Based C - C++ Integrated Development Tool For Your STM32 Projects - STMicroelectronics
4 pages
Installing and Running Pandas - Anaconda Documentation
No ratings yet
Installing and Running Pandas - Anaconda Documentation
4 pages
Experiment No. 11: Write A Java Program For Database Connectivity Using JDBC
No ratings yet
Experiment No. 11: Write A Java Program For Database Connectivity Using JDBC
4 pages
Testing in Python Using Doctest Module
No ratings yet
Testing in Python Using Doctest Module
3 pages
Creating An R Environment and Running RStudio - Anaconda Documentation
No ratings yet
Creating An R Environment and Running RStudio - Anaconda Documentation
3 pages
Managing Environments - Anaconda Documentation
No ratings yet
Managing Environments - Anaconda Documentation
3 pages
Managing Channels - Anaconda Documentation
No ratings yet
Managing Channels - Anaconda Documentation
2 pages
Study Plan
No ratings yet
Study Plan
1 page

The Making of An Data Pipeline: Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati

Uploaded by

The Making of An Data Pipeline: Harsh Kaushik, Avnish Rai, Gaurav Kapasiya, Jai Prakash Bhati

Uploaded by

International Journal for Multidisciplinary Research (IJFMR)

E-ISSN: 2582-2160 ● Website: www.ijfmr.com ● Email: [email protected]

The Making of an Data Pipeline

IJFMR240320849 Volume 6, Issue 3, May-June 2024 1

Google’s Dataflow model

IJFMR240320849 Volume 6, Issue 3, May-June 2024 2

Fig. Data Retrieval

IJFMR240320849 Volume 6, Issue 3, May-June 2024 3

Fig. Data Processing

IJFMR240320849 Volume 6, Issue 3, May-June 2024 4

as splitting addresses into components or normalizing phone numbers.

Fig. Data Pipeline Process

IJFMR240320849 Volume 6, Issue 3, May-June 2024 5

IJFMR240320849 Volume 6, Issue 3, May-June 2024 6

You might also like