0% found this document useful (0 votes)
23 views28 pages

Unit 6 ETL and ELT

The document discusses data preprocessing, focusing on the ETL (Extract, Transform, Load) process, which is essential for integrating and preparing data for analysis. It outlines the tasks involved in data preprocessing, including data cleaning, integration, transformation, and reduction, as well as the steps of the ETL process. Additionally, it contrasts ETL with ELT (Extract, Load, Transform), highlighting their differences in data handling and maintenance requirements.

Uploaded by

MURA- NDASI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views28 pages

Unit 6 ETL and ELT

The document discusses data preprocessing, focusing on the ETL (Extract, Transform, Load) process, which is essential for integrating and preparing data for analysis. It outlines the tasks involved in data preprocessing, including data cleaning, integration, transformation, and reduction, as well as the steps of the ETL process. Additionally, it contrasts ETL with ELT (Extract, Load, Transform), highlighting their differences in data handling and maintenance requirements.

Uploaded by

MURA- NDASI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Data mining and Warehousing

Module Code: CSC5901

NDAYAMBAJE Simeon

6/25/2021 Course Code & Name 1


Unit 6
ETL and ELT

6/25/2021 2
Data preprocessing
Why preprocessing?
Data are generally
– Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or
names
Tasks in data preprocessing
1. Data cleaning: fill in missing values, smooth noisy
data,
2. Data integration: using multiple databases, data
cubes, or files.
3. Data transformation: normalization and aggregation.
4. Data reduction: reducing the volume but producing
the same or similar analytical results.
ETL
(Extract, Transform, and Load)

6/25/2021 6
What is ETL?
• ETL Stand for Extraction, Transformation and
Loading.
• The mechanism of extracting information from
source systems and bringing it into the data
warehouse is commonly called ETL
• The ETL process requires active inputs from
various stakeholders, including developers,
analysts, testers, top executives and is
technically challenging.

6/25/2021 7
ETL (Extract, Transform, and Load)

6/25/2021 8
How ETL Works?
• ETL consists of three separate phases:

6/25/2021 9
Extraction

• Extraction is the operation of extracting


information from a source system for further
use in a data warehouse environment.
• Extraction is the first stage of the ETL
process and most time-consuming tasks in
the ETL.
• The data has to be extracted several times
in a periodic manner to keep where house
up-to-date.
6/25/2021 10
Transformation
• Data transformation is the process of
converting one from source format (e.g. a
database file, XML document, or Excel
sheet) into a particular data warehouse
format.
• Data transformation is necessary to ensure
data from one application/database is
intelligible to other applications/databases,
.
6/25/2021 11
Data Transformation Strategies
1. Smoothing:
2. Aggregation:
3. Generalization:
4. Normalization:
5. Attribute Construction:

6/25/2021 12
Data Transformation Strategies

1. Smoothing: Smoothing is a process of removing


noise from the data.
2. Aggregation: Aggregation is a process where
summary operations are applied to the data.
3. Generalization: In generalization low-level data are
replaced with high-level data by using concept
hierarchies climbing.
Data Transformation Strategies
• 4 Normalization: Database Normalization is a
technique of organizing the data in the database.
Normalization is used for mainly two purposes,
– Eliminating redundant(useless) data.
– Ensuring data dependencies make sense
• 5 Attribute Construction: In Attribute construction,
new attributes are constructed from the given set
of attributes.
Generalization

Generalization is the
process of extracting
shared
characteristics from
two or more classes,
and combining them
into a generalized
superclass.

Shared characteristics can be attributes,


associations, or methods.
Generalization

6/25/2021 16
Generalization
Normalization

First Normal Form


Each attribute must contain only a single value from its
pre-defined domain.
Normalization
Second Normal Form
.
Normalization
Third Normal Form

.
Loading

The Load is the process of writing the data


into the target database. During the load step,
it is necessary to ensure that the load is
performed correctly and with as little
resources as possible.

6/25/2021 21
Loading(Cont…)
Loading can be carried in two ways:
• Refresh: Data Warehouse data is completely
rewritten. This means that older file is
replaced.
• Update: Only those changes applied to
source information are added to the Data
Warehouse. An update is typically carried
out without deleting or modifying preexisting
data.

6/25/2021 22
ELT (Extract, Load and Transform)

6/25/2021 23
ELT :Extract, Load and Transform
• ELT involves the extraction of aggregate
information from the source system and
loading to the target method instead of
transformation between the extraction and
loading phase.
• Once the data is copied or loaded into the
target method, then change takes place.
6/25/2021 24
ELT :Extract, Load and Transform

6/25/2021 25
Difference between ETL vs. ELT
Basics ETL ELT
Process Data is transferred to the Data remains in
ETL server and moved back the DB except for
to DB. High network cross Database
bandwidth required. loads (e.g. source
to object).
Transformation Transformations are Transformations
performed in ETL Server. are performed (in
the source or) in
the target.
6/25/2021 26
Difference between ETL vs. ELT
Basics ETL ELT

Time-Maintenance It needs highs Low maintenance as


maintenance as you data is always
need to select data to available.
load and transform.

Analysis

6/25/2021 27
Thank you

You might also like