0% found this document useful (0 votes)
33 views33 pages

Lecture 7 - ETL

The document discusses Extract, Transform, Load (ETL) which is a process for loading data from source systems into a data warehouse. It describes the key steps of extract, transform and load, as well as variations like ELT and incremental loading.

Uploaded by

Bilal Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views33 pages

Lecture 7 - ETL

The document discusses Extract, Transform, Load (ETL) which is a process for loading data from source systems into a data warehouse. It describes the key steps of extract, transform and load, as well as variations like ELT and incremental loading.

Uploaded by

Bilal Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA SCIENCE

LAB

Getting Data into the Data


Warehouse
CS 537- Big Data Analytics

Dr. Faisal Kamiran


ETL

• Data transferred from source applications to the DWH or Data Mart


Done with a process called ETL

ETL Process
Source Systems Destination
(Date Warehouse)

©
—► ►

Extract. Transform. Load

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Data sources include structured and unstructured data systems

2
ETL

• Extract

• Transform
Extract Transform Load
ffiD

• Load

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Extract: Collect data from data sources


Transform: Convert extracted data into a correct and common form
Load: Write data to the target Data Warehouse

3
Extract

• Pull data from multiple source systems


• Traditionally done in "batches" (can be hourly, weekly etc.)
• Raw data is loaded including any existing errors
• Data transferred to a staging area
Extraction

Staging Area
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Batches: Extraction is not a continuous process. It is done at intervals


Raw data is loaded including any errors (Error correction is done in the transform
stage)

4
Extract - Staging Area

• An intermediate storage area between the data sources and DWH


• The initial data is in different formats and may contain errors so it
cannot be transferred directly to the DWH

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

The staging area acts as a buffer between the data warehouse and the source
data.
Since data may be coming from multiple different sources, it's likely in various
formats, and directly transferring the data to the warehouse may result in
corrupted data. The staging area is used for transforming the data.

5
Transform

• Convert data from multiple sources into a uniform format


• Performed on the extracted data in the staging area
• Transforms include
• Cleaning
• Filtering
• Joining
• Sorting
• Splitting
• Deduplication

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

6
Load

• Final stage in the ETL process


• Involves transferring data into the DWH

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

7
ELT
T
— 1
*— ** -------------- EXTRACT : ■ -fr- LOAD — g—----------- fj
-f- 1 --------- |”

g TRANSFORM

-------- '

• ELT - Extract, Load, Transform


• Raw data stored in Hadoop HDFS, AWS S3 etc.
• No staging area
• Use big data environment computing power to transform when
needed
• Used in cases where massive amounts of data need to be ingested
quickly

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

With ELT, data is immediately available.

8
ETL and ELT

• Data Warehouses work with relational SQL-like data structures


• Data must be transformed into a relational structure before it can
be loaded into the Data Warehouse

• ETL used in Data Warehouses as transformation must happen


before loading

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Online Analytical Processing (OLAP) data warehouses—whether they are cloud­


based or onsite—need to work with relational SQL-based data structures.
Therefore, any data you load into your OLAP data warehouse
must transform into a relational format before the data warehouse can ingest
it. As a part of this data transformation process, data mapping may also be
necessary to combine multiple data sources based on correlating information

9
Variations of ETL

- Initial

S «„»«.» »»»»•» ssaS


• Incremental nxxnioioioioioioi 000101011
1010100010 001111111
■BBUHiMOlOMOXOl'M'.oW
■^roiiii'i :»!••!• of* *
^fc,t!li»(*l(*il*''H"u'""“'"
• U- ’

'Fi

i
iWfSferti
H <♦,< '

AtW J •<> * S'

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

10
Initial Load ETL

• Done right before the Data Warehouse goes live


• Normally one time only
• Load all relevant data necessary for Analytics
• Redo if Data Warehouse corrupted

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

11
Incremental ETL

• Incrementally "refreshes" the data warehouse


• New data: new employees, products, ...
• Modified data: employee promotions, product price change, ...
• Deleted data: employee resigns, customer unsubscribes
• Load only updated data instances

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

12
Incremental ETL Patterns

Append
Data Warehouse
• New data added at the end

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

13
Incremental ETL Patterns

In-place update
• Modify existing data (only some
rows)

DR. FAISAL KAMIRAN


* INFORMATION TECHNOLOGY UNIVERSITY

14
Incremental ETL Patterns

Complete replacement
Data Warehouse
• Overwrite existing data

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Even if only a single row needs to be changed, entire data is modified

15
Incremental ETL Patterns

Rolling Append Data Warehouse

• Maintain certain duration of


history
• Wipe old data, when new data
is appended

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Like maintain only four weeks of data.


The time window keeps rolling

16
Incremental ETL Patterns

Modern data warehouses use


✓ Append
✓ In-place Update

• Complete replacement
• Rolling Append

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Complete replacement and rolling append are not used in modern data
warehouses. However, maybe found in very old DWHs.

17
Data Transformation

Goals
• Uniformity in data
1

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

18
Data Transformation

Goals
• Uniformity in data
• Restructuring

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Restructure from raw form into a well-engineered data structure

19
Data Transformation Models

• Data value unification


• Data type and size unification
• De-duplication
• Dropping columns (vertical slicing)
• Value-based row filtering (horizontal slicing)
• Correcting known errors

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

20
Data value unification

• Merge data into a common format

Campus 1 New Faculty Campus 2 New Faculty


LastName FirstName Rank ... LastName FirstName Rank «
Johnson Susan Professor Adleman Robert P
Wilson Robert Asst. Prof Bonvoy Janice AP
Tolleson Mary Asst. Prof Clark William L
Zimmerman Todd Professor Douglas Thomas AP
<______________
Marcus Walter Lecturer

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Suppose that we have data from two different campuses, which use a different
format for the Rank column

21
Data value unification
Faculty Master Dimension
• Choose one uniform format LastName FirstName Rank ...
Johnson Susan
• Transform other formats Wilson Robert
Tolleson Mary
Zimmerman Todd
Marcus Walter
Adleman Robert
Bonvoy Janice
Clark William
Douglas Thomas

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

The abbreviated format is our standard one, used in our dimension table in the
DWH.So

22
Data type and size unification

• Use one common set of data types and their sizes

Campus 1 New Faculty Campus 2 New Faculty


LastName FirstName Rank — LastName FirstName Rank —
CHAR (35) CHAR (20) CHAR (20) CHAR (30) CHAR (25) CHAR (3)

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Campus 1 and campus 2 used different data sizes for the columns in their source
systems

23
Data type and size unification

Use one common set of data types and their sizes


Campus 1 Faculty A Campus 2 Faculty A
LastName FirstName Rank ... LastName FirstName Rank ...
CHAR (35) CHAR (20) CHAR (20) CHAR (30) CHAR (25) CHAR (3)

Faculty Master Dimension


LastName FirstName Rank ...
CHAR (35) CHAR (25) CHAR (3)

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Since previously, we chose to use Campus 2's abbreviated scheme, we will use
their abbreviated data sizes in the dimension table

24
De-duplication
Greta Williams is
taking classes on both
• Remove duplicate data campuses

Campus 1 New Students Campus 2 New Students


LastName FirstName Year.. LastName FirstName Year.
Jackson Sally FR Young Ted FR
Thompson Richard SO Williams Greta FR
Williams Greta FR

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Here, Greta Williams is taking classes on both campuses and needs to register for
both campuses. She will have a record in source systems for both the
campuses

25
De-duplication
4 Campus 1 New Students Campus 2 New Students
LastName FirstName Year.. LastName FirstName Year—
Jackson Sally FR Young Ted FR
Thompson Richard SO Williams Greta FR
Williams Greta FR

Campus Student Dimension


LastName FirstName Year..
Jackson Sally FR
Thompson Richard SO
Williams Greta FR
Young Ted FR

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

We need to detect the fact that Greta Williams appears in two different systems
but is in fact a single student. This can be done through maybe the CNIC or
any other natural key and then add de-duplicate it

26
Dropping columns (Vertical slicing)

Campus 2 New Faculty “A


LastName FirstName Rank Column X Column Y
Adleman Robert ABCDEF XYZABC
Bonvoy Janice RJTKWH SLSHJS
Clark William QWERTY ASDFGH
Douglas Thomas ZXCVBN CBNEUY

Data in Column X and Column Y


not needed for analytic
purposes

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Slicing is done based on the columns


Value-based row filtering (horizontal slicing)

J
A

Campus 1 New Students


LastName FirstName Year Major
Jackson Sally FR Business
Thompson Richard SO Business
Williams Greta FR Business College of Business
Brady------ -Michele------- FR Engineering
Data Mart
Fittgeraid -Scott------------SO- -titerattira

Students with other majors will


be filtered from the source data

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Slicing is done based on certain columns in certain fields

Building a data mart containing information about business students. Only,


students with business majors will be included. Other students will be filtered
off

28
Correcting known errors

• Fix errors in source data before loading

Campus 2 New Faculty


LastName FirstName Rank..... Status
Adleman Robert P F
Bonvoy Janice AP X Should be H
Clark William L A
Douglas Thomas AP F

Status: Permissible Values


F=Full-time | H=Half-time | A=Adjunct

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

29
Correcting known Errors

Faculty Master Dimension


LastName FlrstName Rank..... Status
Adleman Robert
Bonvoy Janice
Clark William
Douglas Thomas

V
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Corrected data loaded in the DWH

30
ETL best practices and guidelines

• Limit amount of incoming data to be processed

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

For incremental ETL, only ETL the data which is updated in the source systems

31
If fact tables are processed first, then we might be trying to process a new
student, whose entries are not present in the dimension table. Trying to do so
will result in a foreign key error.

32
ETL best practices and guidelines

• Limit amount of incoming data to be processed


• Process dimension tables before fact tables
• Opportunities for parallel processing

tXMl

WM2

DIM3 FACTA

DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY

Preferrable to incorporate parallel processing, like process dimension tables 1,2


and 3 in parallel then tables 4,5,6 and 7 and then the fact tables

33

You might also like