Data Engineering Fundamentals
Data Engineering Fundamentals
Databases
RDBMS - ACID, SQL (basic concept only) -
Database NoSQL
-
Cloud
&
-
Database
-
Internet Computing
Warehouse MPP &
Tourspander processingamana
-
-
1990
a
+ Distributed computing
"
2002
2000
·
2012
Sunbeam Infotech www.sunbeaminfo.com
Big Data characteristics
-
-
⑮
-
-
-
specificate
name
2
I u
im
mm
-
ucts, e
3 in w
-
grocery, cloths, ...
-
--
4
-- -
↳
-
Java Script
ObjectNotati
Flexible
-MS scheme. extensible Markup
Language
dd- ratherficial - Db
DBMS:
um
Data Base agent System RDBMS
Every enterprise application need to ↳ CRUD
ops an data Clients
Feene
manage data. -> older das:File 20 ↳
L 7
>
> -
and columns. Tables are related to
⑰wasthe
-
S
each other. > S
D [
All enterprise RDBMS follow server-
- -
RDBMS RDBRS featur -
Server ↳ joins y
client architecture, have built-in
--
Relating
relational capabilities, fully ACID MB students table by marks table
esote r i e
-
roll name adder
↳
Ho
thee
-
100's
-
GB
DB2, Oracle, MS-SQL, MySQL, rows -
Ge
=> =
->
actions
RDBMS data is processed with SQL queries. ↳ Funds transfer
10 I
5000
ac2
202
Saving Saving
ANSI standardised in 1987. 100 -O
2t
-
DDL: Data Definition Language e.g. CREATE, ALTER, DROP, RENAME. queries executed as a
-
DCL: Data Control Language e.g. CREATE USER, GRANT, REVOKE.
- -
zum independent.
TCL: Transaction Control Language e.g. SAVEPOINT, COMMIT, ROLLBACK. Durable:All changes
mustbe saved.
nur -
ze
100 GB for
a dedicated fin)
Scaling (Distributed
metal
scaling
wical
data
3 followed by traditionalRDBMS
↳ followed bigMoSGL, Big
GB to
Manages structured and semi-structured data. 24x7 l
--- of dients.
Prioritizes high performance, high availability andnare
- e n e
scalability
Designed for Horizontal scaling. Reliable, fault tolerant, Better performance/Speed.
-
- -
Use if: Huge data (TBs), Many Read/Write ops, Scalable, Flexible schema.
---- joins
n
- -
without
=
neared in cluster
Data
=
user interaction.
un -
data
=>
data.
-
is unreliable.
-
W W e
Keys are unique and values can be of any type i.e. JSON, BLOB, etc. - -
clounaw V -
column
family col family
2 Cole
Collum
Wide Column databases - e.g. hbase, cassandra, bigtable
un
nu
coll CO12
- e
W W
-
Graph databases -
-
V V
-
Search databases
-
e.g. Elasticsearch, Solr, Lucene
wo -
-
- ->
- V -
-
event-
I
Sunbeam Infotech www.sunbeaminfo.com
Data warehousing
sources AWs Redshift, ~Tableau
Data warehouse is a Apache Hive,
~power BI
extensions
-
process of transforming
data into information and
Jetelligence
making it available to users
in a timely enough manner e
to make a difference.
-
W
this
&>
raw
E >
data
T 3
-
- >
L modeled
analytical
for 7
queig
↑
M
like
deptspecific
Staging
are data wase
in traditional house
ETL.
↑ ↑
Data Life Cycle Apple,
social media,
Gort BI
↳ e
source make
Cloading) avail
Zim
ML
↑ ↓
-
mutee
-
https://youtu.be/hZu_87l62J4 ene
- W -
V ~ V
pipeline
Data engineering is the development, implementation, and maintenance of systems and
processes that take in raw data and produce high-quality, consistent information that
supports downstream use cases, such as analysis and machine learning.
Data engineer manages data engineering lifecycle, beginning with getting data from source
systems & ending with serving data for use cases, such as analysis or machine learning.
ETL stands for Extract, Transform and ELT stands for Extract, Load and
Load. Transform.
The ETL process typically extracts data As opposed to loading just the
from the source/transactional systems, transformed data in the target systems,
transforms it to fit the model of data- the ELT process loads the entire data
warehouse and finally loads it to the into the data lake. This results in faster
data warehouse. load times.
The transformation process involves The load process can also perform
cleansing, enriching and applying some basic validations and data
transformations to create desired output. cleansing rules.
Data is usually dumped to a staging The data is then transformed for
area after extraction. analytical reporting as per demand.
Data storage is related to multiple stages in data engineering life cycle i.e. ingestion,
transformation and serving.
Storage needs to be selected based on read/write requirement, speed, durability,
Storage tradeoffs
Local storage vs Distributed storage
Strong consistency vs Eventual consistency
Storage options are: File storage, Local disk storage, Network attached storage
(NAS), Cloud file systems (S3/Blob), Block storage, RAID, Storage area network
-
Processing finite set of data (data at Processing live stream of data (data in
rest). rB/GB/TB motion).
Incremental data load is managed by Data processing is managed by the
programmer. framework.
Cluster planned as per data size. High Less throughput.
throughput.
-weekly
Job run once per batch. ↳ daily
->
Job is running forever.
monthly
-
Web crawler Nutch
Distributed computing and storage needed to process huge -
Hadoop 2.x
->
-
b.x RM
Distributed storage: HDFS NN SNN
Distributed computing Map-reduce
Cluster manager: YARN [Yet Another Resource Megotiater)
-
Metadata = RDBMS W
SQL
-
Hive
->
>
e =>
=
-
D
-
The sexiest job in the 21st century require a mixture of multidisciplinary abilities and
suitable candidates must be prepared to learn and develop constantly.
-Ronald Van Loon
Sunbeam Infotech www.sunbeaminfo.com
Q&A
Sunbeam Infotech www.sunbeaminfo.com
Thank you!
Nilesh Ghule <[email protected]>