0% found this document useful (0 votes)
1K views

Data Engineering Fundamentals

The document provides an introduction to fundamentals of data engineering. It discusses the evolution of data engineering and types of data including structured, semi-structured, and unstructured data. It also describes databases like RDBMS and NoSQL, data warehousing concepts, data engineering life cycle including source, ingestion, storage, transformation and serving. Additionally, it covers big data technologies like Hadoop, Spark, Kafka and programming languages like Python and Java. It concludes with applications of big data in various domains like retail, healthcare, telecom and finance.

Uploaded by

Paras Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Data Engineering Fundamentals

The document provides an introduction to fundamentals of data engineering. It discusses the evolution of data engineering and types of data including structured, semi-structured, and unstructured data. It also describes databases like RDBMS and NoSQL, data warehousing concepts, data engineering life cycle including source, ingestion, storage, transformation and serving. Additionally, it covers big data technologies like Hadoop, Spark, Kafka and programming languages like Python and Java. It concludes with applications of big data in various domains like retail, healthcare, telecom and finance.

Uploaded by

Paras Bansal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Fundamentals of Data Engineering

Trainer: Nilesh Ghule

Sunbeam Infotech www.sunbeaminfo.com


Introduction
Big Data Fundamentals
Evolution of Data Engg -

Type: Structured / Semi-structured / Unstructured -

Databases
RDBMS - ACID, SQL (basic concept only) -

NoSQL - BASE, CAP theorem -


Data warehouse - OLAP vs OLTP
Data cleansing, Data transformations and Data modelling ↳

Data warehouse vs Data mart vs Data lake W


Data Engineering Life Cycle
Source Ingestion Storage Transformation Serving e

Ingestion: ETL vs ELT -

Storage: Distributed storage, Storage services I


Processing: Batch vs Stream -

Big Data Technologies


Frameworks: Hadoop, Hive, Spark, Kafka e
Programming Languages: Python, Java, Scala W
Job profiles: Data engineer, Data architects, Database/DWH engineer. 2

Applications: Retail, Healthcare, Telecom, Finance, Media, etc. v

Sunbeam Infotech www.sunbeaminfo.com


History of Big Data /Data Ensg

Database NoSQL
-

Cloud
&
-

Database
-

Internet Computing
Warehouse MPP &

Tourspander processingamana
-
-

& 1998 Big Data


1970
DotCom · Tech
:

1990
a
+ Distributed computing
"
2002
2000
·
2012
Sunbeam Infotech www.sunbeaminfo.com
Big Data characteristics
-

-

-

-
-

Sunbeam Infotech www.sunbeaminfo.com


Types of Data
roll marks Sta

specificate
name

2
I u

im
mm

-
ucts, e

3 in w
-
grocery, cloths, ...

-
--

4
-- -


-

Java Script
ObjectNotati
Flexible
-MS scheme. extensible Markup
Language

Sunbeam Infotech www.sunbeaminfo.com


Relational
RDBMS -

dd- ratherficial - Db

DBMS:
um
Data Base agent System RDBMS
Every enterprise application need to ↳ CRUD
ops an data Clients

Feene
manage data. -> older das:File 20 ↳
L 7

RDBMS is relational DBMS than


umum
Upatate/modify
Delete 7
manages structured data.
-

Data is organized into tables, rows dentametes


are <
7
- C C C
-

>
> -
and columns. Tables are related to
⑰wasthe
-
S
each other. > S

D [
All enterprise RDBMS follow server-
- -
RDBMS RDBRS featur -
Server ↳ joins y
client architecture, have built-in
--
Relating
relational capabilities, fully ACID MB students table by marks table

esote r i e
-
roll name adder


Ho

thee
-

100's
-
GB
DB2, Oracle, MS-SQL, MySQL, rows -

Postgre-SQL, MS-Access, SQLite, etc. entities


*
columns/fields/
attribute

Sunbeam Infotech www.sunbeaminfo.com


SQL Structured Query language IBM

Ge
=> =
->

actions
RDBMS data is processed with SQL queries. ↳ Funds transfer

Originally it was named as RQBE (Relational Query By Example).


-
acc

10 I
5000
ac2

202

Saving Saving
ANSI standardised in 1987. 100 -O
2t
-

65000 & - 7000

Five major categories:


- In is
UPDATE
setof DreL

DDL: Data Definition Language e.g. CREATE, ALTER, DROP, RENAME. queries executed as a

single unit. Either all


mu
CREATE TABLE people(id INT, name CHAR(40), birth DATE);
will succeed (commits
-
DML: Data Manipulation Language e.g. INSERT, UPDATE, DELETE.
- - or all
will discard
W
-09- <rollback).
A tois:Full Completel
DELETE FROM people WHERE id=1; notpartial.
result
DQL: Data Query Language e.g. SELECT. Consistent:Same
m W for all clients
SELECT * FROM people; the
Isolation:multiple

-
DCL: Data Control Language e.g. CREATE USER, GRANT, REVOKE.
- -
zum independent.
TCL: Transaction Control Language e.g. SAVEPOINT, COMMIT, ROLLBACK. Durable:All changes
mustbe saved.
nur -
ze

Sunbeam Infotech www.sunbeaminfo.com


4 CPU 4 CPU 4 CPU
16 CPU
4
8 GB
CPU 4 CPU
8 GB 8 GB 8 GB 8 GB
2TB 2TB 2TB 2TB
2TB
64 GB
32 TB
I I I I I
100 clients
4 CPU
L
10GB
Network
I I I I I
4 CPU 4 CPU 4 CPU 4 CPU 4 CPU
16 GB 8 GB 8 GB
8 GB 8 GB 8 GB
2TB 2TB 2TB
I TB 2TB 2TB
Li

sca+dies cluster (Set of computers in a network

100 GB for
a dedicated fin)
Scaling (Distributed
metal
scaling
wical
data
3 followed by traditionalRDBMS
↳ followed bigMoSGL, Big

Sunbeam Infotech www.sunbeaminfo.com


NoSQL Databases
the SQL 1998 Carlo Strozzi -MoSQL
Stands for Not Only SQL (Beyond
-

10000's rdlur ups/sec


100's TB+)
-> <100
-
-

GB to
Manages structured and semi-structured data. 24x7 l
--- of dients.
Prioritizes high performance, high availability andnare
- e n e
scalability
Designed for Horizontal scaling. Reliable, fault tolerant, Better performance/Speed.
-
- -

No declarative query language -> each by have different language.

Use if: Huge data (TBs), Many Read/Write ops, Scalable, Flexible schema.
---- joins
n

- -

BASE transactions Avaliable 24x7


Basically
->
BA
=

without
=

neared in cluster
Data
=

Based on CAP Theorem S Soft state


=
=

user interaction.
un -

clients see same


E Eventual consistency:all
eventually (after time
=

data
=>

Sunbeam Infotech www.sunbeaminfo.com


CAP Theorem
Consistency - Data is consistent
-

after operation. After an update


operation, all clients see the same
-

data.
-

Availability - System is always on


- -
e 2

(i.e. service guarantee), no x7


downtime.
-
u
fi
Partition Tolerance - System
-

continues to function even the


communication among the servers
-
-

is unreliable.
-

Sunbeam Infotech www.sunbeaminfo.com


NoSQL Databases
~ value
e - - key --

Key-value databases - e.g. redis, dynamodb, riak, ...


-

W W e

Keys are unique and values can be of any type i.e. JSON, BLOB, etc. - -

Implemented as big distributed hash-table for-


fast searching.
. . .

clounaw V -
column
family col family
2 Cole
Collum
Wide Column databases - e.g. hbase, cassandra, bigtable
un
nu
coll CO12

- e

Values of columns are stored contiguously.


Better performance while accessing few columns and aggregations.
Good for data-warehousing, business intelligence, CRM, ...
filel ↳ file2

W W
-

Graph databases -
-
V V
-

Graph is collection of vertices and edges.


m
>

Excellent performance, while dealing with all relations of an entity -


~

(irrespective of size of data). -

Sunbeam Infotech www.sunbeaminfo.com


NoSQL Databases
-doc I

Document oriented databases - e.g. MongoDb, CouchDb -dock


value doc3
- field X
-ISON
Document contains data as key-value pair as JSON or XML.
- --- format
--

Document schema is flexible & are added in collection for processing. - -

RDBMS:Row /Table -> Fixed schema Estingpe


Mango:Dua/Collection -> Flexible Schema ->
W

Search databases
-
e.g. Elasticsearch, Solr, Lucene
wo -
-

For faster search Text search, Log analysis. -

- ->

Indexed, Exact/Fuzzy matches, Anomaly detection, Analytics. - -

- V -
-

Time series databases


-

event-
I
Sunbeam Infotech www.sunbeaminfo.com
Data warehousing
sources AWs Redshift, ~Tableau
Data warehouse is a Apache Hive,
~power BI

single, complete and Oracle, Teradata, ...

consistent store of data


obtained from a variety of -
S
different sources made
available to end users in a ↓ plus

extensions
-

what they can understand -

and use in a business 7


-
-
context. 7
-

Data warehousing is a 7 - business

process of transforming
data into information and
Jetelligence
making it available to users
in a timely enough manner e

to make a difference.

Sunbeam Infotech www.sunbeaminfo.com


Extract
u
Transform
m
Load
u

Extracting: Extract data from various sources into staging area


man, yo
numbers name, cat,
Conditioning: Conversion of data types to fit warehouse. W

..., pod,city, price


date -

House holding: Grouping similar data name, email


-
a
city, covery
Enrichment: Add relevant data from external sources V

-
W

Scoring: Computation of probability of an event


S
dimensions >

Scrubbing: Data cleaning: find duplicate, missing data


Merging: Merging data from various sources. State,
courty
pie, city

De-normalize: Duplicate data to reduce joins.


Loading: Load data into warehouse models like Star,
-m

Snowflake, Fact constellation.


- fact

this
&>

Delta Updating: Incremental data uploading dire


dim 4

Partitioning: Dividing data in logical parts to improve


performance dimz
fact2
dims
dim3

Sunbeam Infotech www.sunbeaminfo.com


OLTP (Database) vs OLAP (Data warehouse)
Online Transaction Processing Online Analytical Processing
RDBMS/NoSQL Data warehouse

Modeled to run the business Modeled to analyze/optimize business

Detailed/Transactional normalized Summarized/refined redundant


real-time data no duplicates) snapshot data

Transaction performance DM2/ Analytical query performance


Search

Read/Write operations Mostly Read operations Update X


delete
X

Isolated data (Application specific) Integrated data (from all sources)


Limited data (100 MB to 100 GB) Huge data (100 GB to Few TB)

Sunbeam Infotech www.sunbeaminfo.com


Data lake vs Data warehouse vs Data mart

raw
E >
data

T 3
-

- >

L modeled
analytical
for 7
queig

M
like
deptspecific
Staging
are data wase
in traditional house
ETL.

Sunbeam Infotech www.sunbeaminfo.com


Data engineering grouping,
modeling,
...

RDBMS, scoring, scrubbing,

↑ ↑
Data Life Cycle Apple,
social media,
Gort BI
↳ e

source make
Cloading) avail

Zim
ML
↑ ↓
-

mutee
-

https://youtu.be/hZu_87l62J4 ene

- W -
V ~ V

pipeline
Data engineering is the development, implementation, and maintenance of systems and
processes that take in raw data and produce high-quality, consistent information that
supports downstream use cases, such as analysis and machine learning.
Data engineer manages data engineering lifecycle, beginning with getting data from source
systems & ending with serving data for use cases, such as analysis or machine learning.

Sunbeam Infotech www.sunbeaminfo.com


Traditional ETL vs Hadoop ELT

ETL stands for Extract, Transform and ELT stands for Extract, Load and
Load. Transform.
The ETL process typically extracts data As opposed to loading just the
from the source/transactional systems, transformed data in the target systems,
transforms it to fit the model of data- the ELT process loads the entire data
warehouse and finally loads it to the into the data lake. This results in faster
data warehouse. load times.
The transformation process involves The load process can also perform
cleansing, enriching and applying some basic validations and data
transformations to create desired output. cleansing rules.
Data is usually dumped to a staging The data is then transformed for
area after extraction. analytical reporting as per demand.

Sunbeam Infotech www.sunbeaminfo.com


Data storage

Data storage is related to multiple stages in data engineering life cycle i.e. ingestion,
transformation and serving.
Storage needs to be selected based on read/write requirement, speed, durability,

Storage tradeoffs
Local storage vs Distributed storage
Strong consistency vs Eventual consistency
Storage options are: File storage, Local disk storage, Network attached storage
(NAS), Cloud file systems (S3/Blob), Block storage, RAID, Storage area network
-

(SAN), Object storage, HDFS, Streaming storage.

Sunbeam Infotech www.sunbeaminfo.com


Batch processing vs Stream processing

Processing finite set of data (data at Processing live stream of data (data in
rest). rB/GB/TB motion).
Incremental data load is managed by Data processing is managed by the
programmer. framework.
Cluster planned as per data size. High Less throughput.
throughput.
-weekly
Job run once per batch. ↳ daily
->
Job is running forever.
monthly

Sunbeam Infotech www.sunbeaminfo.com


Big Data & Analytics Spectrum
Data storage
RDBMS & NoSQL databases
Data warehouse

Data Analysis & visualizations


Data Visualizations
Business reports
Artificial Intelligence, Data Science & Data mining
Mathematics, Statistics & Computer algorithms
Machine learning & Deep learning
R Programming, Python
Data Engineering
Hadoop, Hive, Spark, Kafka, BigTable
Parallel processing
Java, Scala, Python.
Infrastructure
Linux, Cloud Computing

Sunbeam Infotech www.sunbeaminfo.com


Inspired Google
from

Apache Hadoop GFS


MR
-> 2003 Hadoop
3
Yahoo 2007
2004 in Java
&
-

Hadoop is developed by Doug cutting.

-
Web crawler Nutch
Distributed computing and storage needed to process huge -

data produced by the crawler.


Joined Yahoo. Developed and open sourced under Apache
license.
Hadoop 1.x
Distributed
Distributed storage: HDFS (Hadoop
->
File System) -

Distributed computing Map-reduce


m M

Hadoop 2.x
->
-
b.x RM
Distributed storage: HDFS NN SNN
Distributed computing Map-reduce
Cluster manager: YARN [Yet Another Resource Megotiater)
-

Hadoop is like a Kernel/Platform on which many different NM 1 NM 2 NM 3


applications are built (eco-systems). DN 1 DN 2 DN 3

Sunbeam Infotech www.sunbeaminfo.com


Apache sin
Hive
used Hadoop ->
sara
janay
Developed by Facebook (2007)
S
Client software that convert Hive QL queries to MR.
Hive QL is similar to SQL with many extended features.
Hive manages structured data.
Hive is data warehouse (OLAP) built for Hadoop.
Data storage = HDFS -

Metadata = RDBMS W

Data processing = Map-reduce or Spark or Tez.


--
I
eewer MR
older -

SQL

-
Hive

Sunbeam Infotech www.sunbeaminfo.com


with
Apache Spark any storage,
work
Spark is Distributed computing framework, that can process huge amount of data.
W

Spark can be used as eco-system of Hadoop or can be used as independent


distributed computing framework. on life, Berkley
-> university
->

Algorithms, Maching, people.


->
Developed by UCB AMPlabs division.
-
=>
=>

Further developed/maintained by DataBricks.


-
W ~ W -

Scala Python R Java


Popular Spark vendors
DataBricks, AWS EMR, Cloudera, MapR
- -
-
Spark Spark Spark Spark
Spark Toolkit SQL Streaming ML GraphX
um ne m um

Spark High Level API (Dataframes) 2.X


W
W

Spark Core (RDD & DAG)


e

Sunbeam Infotech www.sunbeaminfo.com


Apache Kafka

Kafka is a distributed messaging system. ~> send messages asynchronously at

large scale-Huge data,


Developed at LinkedIn and open sourced in 2011. many producers/consumers.
Used by LinkedIn, Twitter, Uber, airbnb,
Advantages
Scalable, Durable, Finite retention
Low latency, Strong ordering,
Exact once delivery
Applications
Stream processing
Notifications.

Sunbeam Infotech www.sunbeaminfo.com


Real time dashboard reference architecture

->
>

e =>
=
-

D
-

Sunbeam Infotech www.sunbeaminfo.com


Big Data domains & opportunities W

Domains: Health-care, Retails, Trading/Share market, Finance, Security, Fraud,


Search engines, Log Analysis, Telecom, Traffic Control, Manufacturing and lot more.
Big Data is all about :- Think, Collect, Manage, Analyze, Summarize, Visualize,
Discover Knowledge and Take Decisions.
Job profiles:
Business Analyst/Intelligence
Database engineer / DWH
Big Data engineer
Data operations
Big Data Architect

The sexiest job in the 21st century require a mixture of multidisciplinary abilities and
suitable candidates must be prepared to learn and develop constantly.
-Ronald Van Loon
Sunbeam Infotech www.sunbeaminfo.com
Q&A
Sunbeam Infotech www.sunbeaminfo.com
Thank you!
Nilesh Ghule <[email protected]>

Also refer: Big data webinar - https://www.youtube.com/live/BxwpqnQ6BgQ


Sunbeam Infotech www.sunbeaminfo.com

You might also like