0% found this document useful (0 votes)
5 views

Data engineering Flow-

Data engineering serves as a bridge between data producers and consumers, facilitating the management of large volumes of data generated by various applications. The end-to-end data pipeline involves data ingestion, ETL processes, and storage solutions, with a focus on both structured and unstructured data. Key responsibilities include effective communication, cost management, data architecture, and ensuring data reliability and security.

Uploaded by

daivshaladhepale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data engineering Flow-

Data engineering serves as a bridge between data producers and consumers, facilitating the management of large volumes of data generated by various applications. The end-to-end data pipeline involves data ingestion, ETL processes, and storage solutions, with a focus on both structured and unstructured data. Key responsibilities include effective communication, cost management, data architecture, and ensuring data reliability and security.

Uploaded by

daivshaladhepale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data engineering Flow:

Genration-Insertio-trasformation-serving.

Data Generation Applications- Mysql,postgresql ,mongodb,


Third party application like stripe,salesforce,Google analytics etc.

Upstream( date producers) → Data manipulation (data engineering)-> Downstram(Data


Consumers)

Dats engineering is the bridge between date producers and data Consumers.
Here, Data Producers using different platforms produces data and Consumers uses it for
analysis and decision making.

Data engineering can be done on stramed data(kafka & Flink) as well as stored data.

Why Date Engineering is imp?

like there are many platforms and apps get used nowadays and produces large amount of data,
so to manage that data data engineering is helpful.

End to End Data pipeline-

-The receiving of data and storing process is called Data Ingestion.

computr vs Storage:

At top of data pipeline the is compute and at bottom the is Storage.

*Mpp-Massively parallel processing.


It can process on large amount of data very
quickly by splitis the date set into smaller and process them in parallel.

ETL -
E- Is the proccess of extraction of data or taking the data out of sources.
T- is nothing but transformation of data in more usable formats.
L- Loading data into storage.

On primices- company purchase their own networks and stored it on their premises.
• Hadoop - Allow data engineers to handle data on scale of terabyte and Petabyte.

Modern date stack : made up with collections of open sourse platforms and third party tools that
connet together.

* Data Maturity- Determine the complexity of data proccess or pipeline.


Simply, how data is using by different organisation with respect to huge amount of data
Storage -Scale- Lead

Responsibilities of data engineering→

Communicate with both technical and non technical.


Understand how to store and manage.
Minimise cost and work with budget.
Create good data architecture.
Build and manage reliability.
Security, data governance, automation and observation.

Structured Data- (Row Based)


Sql,
Used for BI, ML to detect small features.

Unstructured Data(ColumnBased)
Audio,video,images,log files
Deep learning, neural networks to detect large and micro features.

Event Strams-

Produces Websites-date producering

Event broker- Storage distribution.

Event Consumers-Consumes data in real time.

Challenges:-
Messages come in asynchronously.
Ordering
Duplicacy of data,

Idempotemy: An operation is idempot when the same result comes out no matter how many
times you run it. (imp to manage Duplicate data)
Popular event streaming platform -AWA SQS, Amazon kinesis, Rabbit MQ. Kafka,
pulsar ,spark.

Storage:- Central part of dato pipeline.

HDD- Hard disk drive - Traditional magnetic disk drives that have a rotating disk and arm.

SSD - solid state Drive - faster kind of hard disk.

RAM - Faster than SSD in terms of latency (temp. Memory)

Networking Cloud storage.

Serialisation : Turning data into byte stream to easily save and transport it.

Serialise data into std. fromat which is sent around and deserialized on the receiving end.

Row Based- xml,json,csv

Column Based- parquet,orc,

• Single Machina vs Distributed storage


Vertical Scaling and Horizontal Scaling.

Strong VS eventual consistancy:

When the system doesn't allow read operation until all the nodes with replicated data are
updated.

User read requests are not halted till all the replicas are updated rathe than upadate process is
eventual.

Some user might receive old data but eventually all the data is updated thr latest data.

ACID VS BASE-
ACID- Single machine, Strong Consistency
BASE- Distributed Consistency, Eventual Consistency

Storage systems:-

File Storage- Local File,NAS,Cloud storage.


Block Storage -
HDD/SSD, AWS EBS
Object Storage -AWS, S3
Cache Storage - RAM, redis
Streaming storage-Buffering

You might also like