0% found this document useful (0 votes)
122 views103 pages

DWDM Unit 1

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views103 pages

DWDM Unit 1

Uploaded by

Shubham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Data WareHousing and

Data Mining
BCA VI SEM
UNIT 1
Syllabus
Database
A database is an aggregation of ordered, electronically recorded data that
has been structured/organized. Here structured Data that follows a
pre-established data format is referred to as structured data and is easier
to assess. Structured information follows a tabular structure with a
relationship between the various rows and columns.
Many dynamic webpages on the Internet today use databases to keep their
content. In this way, data can be handled, updated, controlled, and
organized effectively. Most databases use structured query language.
(SQL) for both creating and getting data. Consider Facebook. It must be
able to store, modify, and display information about users, their contacts,
member actions, communications, ads, and a variety of other things. In
such cases, databases become more crucial for the efficient storing of
data.
DataBase System
Database System is used in traditional way of storing and retrieving
data. The major task of database system is to perform query
processing. These systems are generally referred as online transaction
processing system. These systems are used day to day operations of
any organization.
Introduction to Data Warehousing
➔ The term "Data Warehouse" was first coined by Bill
Inmon in 1990.
➔ According to Inmon, a data warehouse is a subject
oriented, integrated, time-variant, and nonvolatile
collection of data.
➔ A data warehouse refers to a data repository that is
maintained separately from an organization’s
operational databases.
➔ A Data Warehouse (DW) is a relational database that is
designed for query and analysis rather than transaction
processing. It includes historical data derived from
transaction data from single and multiple sources.
Database System vs. Data Warehouse
Database System vs. Data Warehouse
THE COMPELLING NEED FOR DATA
WAREHOUSING
In the 1990s, as businesses grew more complex,
corporations spread globally, and competition became
fiercer, business executives became desperate for
information to stay competitive and improve the
bottom line. The operational computer systems did
provide information to run the day-to-day operations, but
what the executives needed were different kinds of
information that could be readily used to make strategic
decisions.
Organizations achieve competitive
advantage:
DATA WAREHOUSING—THE ONLY
VIABLE SOLUTION
● The type of information needed for strategic
decision making is different from that available
from operational systems.
● A DW is a subject-oriented, integrated,
time-variant and non-volatile collection of data
in support of management’s decision making
process.
Characteristics/features of DW
Data Warehouse-Subject-Oriented
➔ Organized around major subjects, such as customer,
product, sales.
➔ Focusing on the modeling and analysis of data for
decision makers, not on daily operations or
transaction processing.
➔ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in
the decision support process.
Data Warehouse-Subject-Oriented
Data Warehouse—Integrated
➔ Constructed by integrating multiple,
heterogeneous data
sources
◆ relational databases, flat files,
on-line transaction records
➔ Data cleaning and data integration
techniques are applied.
◆ Ensure consistency in naming
conventions, encoding structures,
attribute measures, etc. among
different data sources
● E.g., Hotel price: currency, tax,
breakfast covered, etc.
◆ When data is moved to the
warehouse, it is converted
Data Warehouse—Time Variant
➔ The time horizon for the data warehouse is
significantly longer than that of operational systems
◆ Operational database: current value data
◆ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
➔ Every key structure in the data warehouse
◆ Contains an element of time, explicitly or implicitly
◆ But the key of operational data may or may not
contain "time element
Data Warehouse-Nonvolatile
➔ A physically separate store of data transformed from the
operational environment
➔ Operational update of data does not occur in the data
warehouse environment
◆ Does not require transaction processing, recovery, and
concurrency control mechanisms
◆ Requires only two operations in data accessing: initial
loading of data and access of data
Data Granularity
Granularity is one of the main elements in the modeling of
DW data.
Granularity of data refers to detail levels. Multiple levels of
detail may be available depending on the requirements. At
least two granular levels exist for many data warehouses.
The relation between detailing and granularity is important to
understand. It means greater detail of the data (less summary)
when you speak of less granularity or fine granularity. Greater
granularity means fewer details or gross granularity (greater
summarization). The operational data is stored at the lowest
level of information.
Example of Data Granularity
DW and Data Marts
A Data Mart contains a subset of corporate-wide that is of
value to a specific group of users. the scope is confined to
specific selected subjects. e.g. a marketing data mart may
confine its subject to customer, item and sales. The data
contained in data marts tend to be summarized.
In 1998 Bill Inmon stated, “the most important issue
facing the IT manager is whether to build the data
warehouse first or the data mart first”.
Approaches for designing DW/DM
○ Top-Down Approach (Dependent Data Marts ): The data
warehouse is designed first and then data mart are built on top
of data warehouse.

○ Bottom-Up Approach (Independent Data Marts): data marts


are first created to provide the reporting and analytics
capability for specific business process, later with these data
marts enterprise data warehouse is created.
Top-Down Approach
Advantages / Disadvantages of Top-Down
Approach
The advantages of this approach are:
● A truly corporate effort, an enterprise view of data
● Inherently architected, not a union of disparate data marts
● Single, central storage of data about the content
● Centralized rules and control
● May see quick results if implemented with iterations
The disadvantages are:
● Takes longer to build even with an iterative method
● High exposure to risk of failure
● Needs high level of cross-functional skills
● High outlay without proof of concept

https://www.geeksforgeeks.org/data-warehouse-architecture/
Bottom-Up Approach
Advantages / Disadvantages of Bottom-up
Approach
The advantages of this approach are:
● Faster and easier implementation of manageable pieces
● Favorable return on investment and proof of concept
● Less risk of failure
● Inherently incremental; can schedule important data marts first
● Allows project team to learn and grow
The disadvantages are:
● Each data mart has its own narrow view of data
● Permeates redundant data in every data mart
● Perpetuates inconsistent and irreconcilable data
● Proliferates unmanageable interfaces
(ETL) Extracting, Transformation, Loading
DW system use back-end tools and utilities to populate and
refresh their data. These tools and utilities include the
following functions:
Data Extraction: gathers data from multiple sources
Data Cleaning: detects and rectify errors
Data Transformation: converts data to warehouse format
Load: sorts, summarizes, consolidates, computer views,
check integrity and builds indices and partitions.
Refresh: propagates the updates from the data sources to
the warehouse.
Introduction to Data Warehousing
⚫ A Data Warehouse is used for reporting and analyzing
of information and stores both historical and current
data.
⚫ The data in DW system is used for Analytical
reporting, which is later used by Business Analysts,
Sales Managers or Knowledge workers for
decision-making.
Information from Data Warehousing
1. Increasing customer focus, which induce the
assessment of customer buying pattern.
2. Repositioning products and managing product
portfolios by comparing the performance of sales by
quarter, by year and by geographical region.
3. analyzing operations and looking for the source of
profit.
4. managing customer relationship, making
environmental corrections and maintain the cost of
corporate assets.
Three-Tier Data Warehouse Architecture
Data Warehouses usually have a three-level (tier) architecture that
includes:
➢ Bottom Tier (Data Warehouse Server)
➢ Middle Tier (OLAP Server)
➢ Top Tier (Front end Tools).
➢ A bottom-tier that consists of the Data Warehouse server,
which is almost always an RDBMS. It may include several
specialized data marts and a metadata repository.
➢ Data from operational databases and external sources (such
as user profile data provided by external consultants) are
extracted using application program interfaces called a
gateway. A gateway is provided by the underlying DBMS and
allows customer programs to generate SQL code to be
executed at a server.
➢ Examples of gateways contain ODBC (Open Database
Connection) and OLE-DB (Open-Linking and Embedding
for Databases), by Microsoft, and JDBC (Java Database
Connection).
Three-Tier Data Warehouse Architecture
Three-Tier Data Warehouse Architecture
⚫ A middle-tier which consists of an OLAP server for fast
querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended
relational DBMS that maps functions on multidimensional
data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a
particular purpose server that directly implements
multidimensional information and operations.

⚫ A top-tier that contains front-end tools for displaying


results provided by OLAP, as well as additional tools for
data mining of the OLAP-generated data.
Goals of Data Warehousing

➔ To help reporting as well as analysis

➔ Maintain the organization's historical information

➔ Be the foundation for decision making.


Benefits of Data Warehouse
➔ Better business analytics: Data warehouse plays an
important role in every business to store and analysis of all the
past data and records of the company. which can further
increase the understanding or analysis of data for the company.
➔ Faster Queries: The data warehouse is designed to handle
large queries that’s why it runs queries faster than the database.
➔ Improved data Quality: In the data warehouse the data you
gathered from different sources is being stored and analyzed it
does not interfere with or add data by itself so your quality of
data is maintained and if you get any issue regarding data
quality then the data warehouse team will solve this.
➔ Historical Insight: The warehouse stores all your historical
data which contains details about the business so that one can
analyze it at any time and extract insights from it.
Disadvantages of Data Warehousing
➔ Cost: Building a data warehouse can be expensive,
requiring significant investments in hardware, software,
and personnel.
➔ Complexity: Data warehousing can be complex, and
businesses may need to hire specialized personnel to
manage the system.
➔ Time-consuming: Building a data warehouse can take a
significant amount of time, requiring businesses to be
patient and committed to the process.
➔ Data integration challenges: Data from different
sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
➔ Data security: Data warehousing can pose data security
risks, and businesses must take measures to protect
sensitive data from unauthorized access or breaches.
Data Warehouse vs DBMS
Database Data Warehouse

1. It is used for Online Transactional 1. It is used for Online Analytical


Processing (OLTP) but can be used for Processing (OLAP). This reads the
other objectives such as Data historical information for the customers
Warehousing. This records the data for business decisions.
from the clients for history.

2. The tables and joins are complicated 2. The tables and joins are accessible
since they are normalized for RDBMS. since they are de-normalized. This is
This is done to reduce redundant files done to minimize the response time for
and to save storage space. analytical queries.

3. Data is dynamic 3. Data is largely static

4. Entity: Relational modeling 4. Data: Modeling approach are used for


procedures are used for RDBMS the Data Warehouse design.
database design.
Data Warehouse vs DBMS

Database Data Warehouse

5. Optimized for write operations. 5. Optimized for read operations.

6. Performance is low for analysis 6. High performance for analytical


queries. queries.

7. The database is the place where the 7. Data Warehouse is the place where
data is taken as a base and managed to the application data is handled for
get available fast and efficient access. analysis and reporting objectives.
Data warehouse – The building Blocks
Data warehouse – The building Blocks :
Source Data Component
Source data coming into the data warehouses may be grouped into
four broad categories:
1. Production Data: This type of data comes from the different
operating systems of the enterprise. Based on the data
requirements in the data warehouse, we choose segments of the
data from the various operational modes.
2. Internal Data: In each organization, the client keeps their
"private" spreadsheets, reports, customer profiles, and
sometimes even department databases. This is the internal data,
part of which could be useful in a data warehouse.
3. Archived Data: Operational systems are mainly intended to run
the current business. In every operational system, we
periodically take the old data and store it in achieved files.
4. External Data: Most executives depend on information from
external sources for a large percentage of the information they
use. They use statistics associating to their industry produced by
the external department.
Data warehouse – The building Blocks:
Data Staging Component
⚫ After we have been extracted data
from various operational systems
and external sources, we have to
prepare the files for storing in the
data warehouse. The extracted data
coming from several different
sources need to be changed,
converted, and made ready in a
format that is relevant to be saved
for querying and analysis.
⚫ We will now discuss the three
primary functions that take place
in the staging area.
⚫ Data Extraction: This method has
to deal with numerous data sources.
We have to employ the appropriate
techniques for each data source.
Data warehouse – The building Blocks :
Data Staging Component
Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.
➔ First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or
elimination of duplicates when we bring in the same data from various source systems.
➔ Standardization of data components forms a large part of data transformation. Data
transformation contains many forms of combining pieces of data from different sources.
We combine data from single source record or related data parts from many source
records.
➔ On the other hand, data transformation also contains purging source data that is not
useful and separating outsource records into new combinations. Sorting and merging of
data take place on a large scale in the data staging area. When the data transformation
function ends, we have a collection of integrated data that is cleaned, standardized, and
summarized.
Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time,
we do the initial loading of the information into the data warehouse storage. The initial load
moves high volumes of data using up a substantial amount of time.
Data warehouse – The building Blocks :
Source Storage Component
● Data storage for the data warehousing is a split
repository.
● The data repositories for the operational systems
generally include only the current data.
● Also, these data repositories include the data
structured in highly normalized for fast and efficient
processing.
Data warehouse – The building Blocks :
Information Delivery Component
The information delivery
element is used to enable
the process of subscribing
for data warehouse files
and having it transferred
to one or more
destinations according to
some customer-specified
scheduling algorithm.
Data warehouse – The building Blocks :
Metadata Component
⚫ Metadata in a data warehouse is equal to the data
dictionary or the data catalog in a database
management system.
⚫ In the data dictionary, we keep the data about the
logical data structures, the data about the records and
addresses, the information about the indexes, and so
on.
Data warehouse – The building Blocks :
Data Marts
⚫ It includes a subset of corporate-wide data that is of value
to a specific group of users.
⚫ The scope is confined to particular selected subjects. Data
in a data warehouse should be a fairly current, but not
mainly up to the minute, although development in the
data warehouse industry has made standard and
incremental data dumps more achievable.
⚫ Data marts are lower than data warehouses and usually
contain organization.
⚫ The current trends in data warehousing are to developed a
data warehouse with several smaller related data marts for
particular kinds of queries and reports.
Data warehouse – The building Blocks :
Management and Control Component
⚫ The management and control elements coordinate the
services and functions within the data warehouse.
⚫ These components control the data transformation and
the data transfer into the data warehouse storage.
⚫ On the other hand, it moderates the data delivery to the
clients.
⚫ Its work with the database management systems and
authorizes data to be correctly saved in the repositories.
⚫ It monitors the movement of information into the
staging method and from there into the data warehouses
storage itself.
What is Data Mart?
A Data Mart is a subset of a directorial information store,
generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on
particular business functions for a specific community within
an organization. Data marts are derived from subsets of data in
a data warehouse, though in the bottom-up data warehouse
design methodology, the data warehouse is created from the
union of organizational data marts.
⚫ The fundamental use of a data mart is Business
Intelligence (BI) applications. BI is used to gather, store,
access, and analyze record. It can be used by smaller
businesses to utilize the data they have accumulated since it
is less expensive than implementing a data warehouse.
What is Data Mart?
Reasons for creating a data mart
➔ Creates collective data by a group of users
➔ Easy access to frequently needed data
➔ Ease of creation
➔ Improves end-user response time
➔ Lower cost than implementing a complete data
warehouses
➔ Potential clients are more clearly defined than in a
comprehensive data warehouse
➔ It contains only essential business data and is less
cluttered.
Types of Data Marts

There are mainly two approaches to designing data


marts. These approaches are
➔ Dependent Data Marts
➔ Independent Data Marts
Dependent Data Marts
Dependent Data Marts
⚫ A dependent data marts is a logical subset of a physical
subset of a higher data warehouse.
⚫ According to this technique, the data marts are treated
as the subsets of a data warehouse.
⚫ In this technique, firstly a data warehouse is created
from which further various data marts can be created.
⚫ These data mart are dependent on the data warehouse
and extract the essential record from it.
⚫ In this technique, as the data warehouse creates the data
mart; therefore, there is no need for data mart
integration. It is also known as a top-down approach.
Independent Data Marts
Independent Data Marts
⚫ The second approach is Independent data marts (IDM)
⚫ Here, firstly independent data marts are created, and then
a data warehouse is designed using these independent
multiple data marts.
⚫ In this approach, as all the data marts are designed
independently; therefore, the integration of data marts is
required.
⚫ It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.
Steps in Implementing a Data Mart
The significant steps in implementing a data mart are
to design the schema, construct the physical storage,
populate the data mart with data from source systems,
access it to make informed decisions and manage it
over time. So, the steps are:
➔ Designing
➔ Constructing
➔ Populating
➔ Accessing
➔ Managing
Steps in Implementing a Data Mart
Designing
The design step is the first in the data mart process. This
phase covers all of the functions from initiating the request
for a data mart through gathering data about the
requirements and developing the logical and physical design
of the data mart.
➔ It involves the following tasks:
➔ Gathering the business and technical requirements
➔ Identifying data sources
➔ Selecting the appropriate subset of data
➔ Designing the logical and physical architecture of the
data mart.
Steps in Implementing a Data Mart
Constructing
This step contains creating the physical database and
logical structures associated with the data mart to
provide fast and efficient access to the data. It involves
the following tasks:
➔ Creating the physical database and logical structures
such as tablespaces associated with the data mart.
➔ creating the schema objects such as tables and indexes
describe in the design step.
➔ Determining how best to set up the tables and access
structures.
Steps in Implementing a Data Mart
Populating
This step includes all of the tasks related to the getting
data from the source, cleaning it up, modifying it to the
right format and level of detail, and moving it into the
data mart. It involves the following tasks:
➔ Mapping data sources to target data sources
➔ Extracting data
➔ Cleansing and transforming the information.
➔ Loading data into the data mart
➔ Creating and storing metadata
Steps in Implementing a Data Mart
Accessing
This step involves putting the data to use: querying the
data, analyzing it, creating reports, charts and graphs
and publishing them. It involves the following tasks:
➔ Set up and intermediate layer (Meta Layer) for the
front-end tool to use. This layer translates database
operations and objects names into business conditions
so that the end-clients can interact with the data mart
using words which relates to the business functions.
➔ Set up and manage database architectures like
summarized tables which help queries agree through
the front-end tools execute rapidly and efficiently.
Steps in Implementing a Data Mart
Managing
This step contains managing the data mart over its
lifetime. In this step, management functions are
performed as:
➔ Providing secure access to the data.
➔ Managing the growth of the data.
➔ Optimizing the system for better performance.
➔ Ensuring the availability of data event with system
failures.
Difference between Data Warehouse and Data Mart
S.No Data Warehouse Data Mart

Data warehouse is a Centralised While it is a decentralised


1.
system. system.

In data warehouse, lightly While in Data mart, highly


2.
denormalization takes place. denormalization takes place.

Data warehouse is top-down While it is a bottom-up


3.
model. model.

4. To built a warehouse is difficult. While to build a mart is easy.

In data warehouse, Fact While in this, Star schema


5.
constellation schema is used. and snowflake
Difference between Data Warehouse and Data Mart

S.No Data Warehouse Data Mart


6. Data Warehouse is flexible. While it is not flexible.

Data Warehouse is the While it is the project-oriented


7.
data-oriented in nature. in nature.

Data Ware house has long While data-mart has short life
8.
life. than warehouse.

While in this, data are


In Data Warehouse, Data are
9. contained in summarized
contained in detail form.
form.

Data Warehouse is vast in While data mart is smaller


10.
size. than warehouse.
Difference between Data Warehouse and Data Mart
S.No Data Warehouse Data Mart

The Data Warehouse might be


11. somewhere between 100 GB and 1 The Size of Data Mart is less than 100 GB.
TB+ in size.

The time it takes to implement a data


The Data Mart deployment procedure is
12. warehouse might range from months
time-limited to a few months.
to years.

It uses a lot of data and has Operational data are not present in Data
13. comprehensive operational data. Mart.

It collects data from various data It generally stores data from a data
14. sources. warehouse.

Long time for processing the data Less time for processing the data because of
15. because of large data. handling only a small amount of data.

Complicated design process of Easy design process of creating schemas


16. creating schemas and views. and views.
What is Meta Data in Data Warehousing?
● Metadata is simply defined as data about data. The
data that is used to represent other data is known as
metadata. For example, the index of a book serves as a
metadata for the contents in the book. In other words,
we can say that metadata is the summarized data that
leads us to detailed data. In terms of data warehouse,
we can define metadata as follows.
● Metadata is the road-map to a data warehouse.
● Metadata in a data warehouse defines the warehouse
objects.
● Metadata acts as a directory. This directory helps the
decision support system to locate the contents of a
data warehouse.
What is Meta Data in Data Warehousing?
Metadata includes the following:
● The location and descriptions of warehouse systems and
components.
● Names, definitions, structures, and content of
data-warehouse and end-users views.
● Identification of authoritative data sources.
● Integration and transformation rules used to populate data.
● Integration and transformation rules used to deliver
information to end-user analytical tools.
● Subscription information for information delivery to analysis
subscribers.
● Metrics used to analyze warehouses usage and performance.
● Security authorizations, access control list, etc.
What is Meta Data in Data Warehousing?
● Metadata can be stored in various forms, such as text,
XML, or RDF, and can be organized using metadata
standards and schemas.
● Metadata can be used in a variety of contexts, such as
libraries, museums, archives, and online platforms.
● Metadata can also support data preservation by providing
information about the context, provenance, and
preservation needs of data, and can support data
visualization by providing information about the data’s
structure and content, and by enabling the creation of
interactive and customizable visualizations.
Examples Of Metadata
Metadata is data that provides information about other data.
Here are a few examples of metadata:
● File metadata: This includes information about a file,
such as its name, size, type, and creation date.
● Image metadata: This includes information about an
image, such as its resolution, color depth, and camera
settings.
● Music metadata: This includes information about a piece
of music, such as its title, artist, album, and genre.
● Video metadata: This includes information about a video,
such as its length, resolution, and frame rate.
● Document metadata: This includes information about a
document, such as its author, title, and creation date.
● Database metadata: This includes information about a
database, such as its structure, tables, and fields.
● Web metadata: This includes information about a web
page, such as its title, keywords, and description.
Types / Categories of Metadata
Metadata in a data warehouse fall into three major categories:
● Operational Metadata
● Extraction and Transformation Metadata
● End-User Metadata
Operational Metadata: Contains all the information about the
operational data sources.
Extraction and Transformation Metadata: contain data about the
extraction of data from the source systems, namely, the extraction
frequencies, extraction methods, and business rules for the data
extraction. Also, this category of metadata contains information about
all the data transformations that take place in the data staging area.
End-User Metadata: The end-user metadata is the navigational map
of the data warehouse. It enables the end-users to find information
from the data warehouse. The end-user metadata allows the end-users
to use their own business terminology and look for information in those
ways in which they normally think of the business.
Role/Importance of Metadata
Metadata has a very important role in a data warehouse. The role of metadata
in a warehouse is different from the warehouse data, yet it plays an important
role. The various roles of metadata are explained below:
● Metadata acts as a directory.
● This directory helps the decision support system to locate the contents of
the data warehouse.
● Metadata helps in decision support system for mapping of data when data
is transformed from operational environment to data warehouse
environment.
● Metadata helps in summarization between current detailed data and highly
summarized data.
● Metadata also helps in summarization between lightly detailed data and
highly summarized data.
● Metadata is used for query tools.
● Metadata is used in extraction and cleansing tools.
● Metadata is used in reporting tools.
● Metadata is used in transformation tools.
● Metadata plays an important role in loading functions.
ETL (Extract, Transform, and Load) Process
● The mechanism of extracting information from source
systems and bringing it into the data warehouse is
commonly called ETL, which stands for Extraction,
Transformation and Loading.
● The ETL process requires active inputs from various
stakeholders, including developers, analysts, testers, top
executives and is technically challenging.
● To maintain its value as a tool for decision-makers, Data
warehouse technique needs to change with business
changes. ETL is a recurring method (daily, weekly,
monthly) of a Data warehouse system and needs to be
agile, automated, and well documented.
ETL (Extract, Transform, and Load) Process
How ETL Works?

⚫ ETL
consists
of three
separate
phases:
ETL (Extract, Transform, and Load) Process
Extraction
● Extraction is the operation of extracting information from
a source system for further use in a data warehouse
environment. This is the first stage of the ETL process.
● Extraction process is often one of the most
time-consuming tasks in the ETL.
● The source systems might be complicated and poorly
documented, and thus determining which data needs to
be extracted can be difficult.
● The data has to be extracted several times in a periodic
manner to supply all changed data to the warehouse and
keep it up-to-date.
ETL (Extract, Transform, and Load) Process
Cleansing
● The cleansing stage is crucial in a data warehouse technique because
it is supposed to improve data quality.
● The primary data cleansing features found in ETL tools are
rectification and homogenization.
● They use specific dictionaries to rectify typing mistakes and to
recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between
values.
● The following examples show the essential of data cleaning:
● If an enterprise wishes to contact its users or its suppliers, a
complete, accurate and up-to-date list of contact addresses, email
addresses and telephone numbers must be available.
● If a client or supplier calls, the staff responding should be quickly
able to find the person in the enterprise database, but this need that
the caller's name or his/her company name is listed in the database.
● If a user appears in the databases with two or more slightly different
names or different account numbers, it becomes difficult to update
the customer's information.
ETL (Extract, Transform, and Load) Process
Transformation
● Transformation is the core of the reconciliation phase. It
converts records from its operational source format into a
particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled
data layer.
● The following points must be rectified in this phase:
● Loose texts may hide valuable information. For example,
XYZ PVT Ltd does not explicitly show that this is a Limited
Partnership company.
● Different formats can be used for individual data. For
example, data can be saved as a string or as three integers.
ETL (Extract, Transform, and Load) Process
Following are the main transformation processes aimed
at populating the reconciled data layer:
● Conversion and normalization that operate on both
storage formats and units of measure to make data
uniform.
● Matching that associates equivalent fields in different
sources.
● Selection that reduces the number of source fields and
records.

Cleansing and Transformation processes are often


closely linked in ETL tools.
ETL (Extract, Transform, and Load) Process
ETL (Extract, Transform, and Load) Process
Loading
● The Load is the process of writing the data into the
target database. During the load step, it is necessary to
ensure that the load is performed correctly and with as
little resources as possible.
● Loading can be carried in two ways:
● Refresh: Data Warehouse data is completely rewritten.
This means that older file is replaced. Refresh is usually
used in combination with static extraction to populate a
data warehouse initially.
● Update: Only those changes applied to source
information are added to the Data Warehouse. An
update is typically carried out without deleting or
modifying pre-existing data. This method is used in
combination with incremental extraction to update data
warehouses regularly.
Selecting an ETL Tool
● Selection of an appropriate ETL Tools is an important
decision that has to be made in choosing the importance of
an ODS or data warehousing application.
● The ETL tools are required to provide coordinated access to
multiple data sources so that relevant data may be
extracted from them.
● An ETL tool would generally contains tools for data
cleansing, re-organization, transformations, aggregation,
calculation and automatic loading of information into the
object database.
● An ETL tool should provide a simple user interface that
allows data cleansing and data transformation rules to be
specified using a point-and-click approach.
● When all mappings and transformations have been
defined, the ETL tool should automatically generate the
data extract/transformation/load programs, which
typically run in batch mode.
Advantages of ETL process in data warehousing

Better data Increased data


Improved data quality
integration security

Improved scalability Increased automation


Advantages of ETL process in data warehousing
● Improved data quality: ETL process ensures that the data in
the data warehouse is accurate, complete, and up-to-date.
● Better data integration: ETL process helps to integrate data
from multiple sources and systems, making it more accessible
and usable.
● Increased data security: ETL process can help to improve
data security by controlling access to the data warehouse and
ensuring that only authorized users can access the data.
● Improved scalability: ETL process can help to improve
scalability by providing a way to manage and analyze large
amounts of data.
● Increased automation: ETL tools and technologies can
automate and simplify the ETL process, reducing the time and
effort required to load and update data in the warehouse.
Disadvantages of ETL process in data warehousing

High cost Complexity Limited flexibility

Limited scalability Data privacy concerns


Disadvantages of ETL process in data warehousing
● High cost: ETL process can be expensive to implement
and maintain, especially for organizations with limited
resources.
● Complexity: ETL process can be complex and difficult to
implement, especially for organizations that lack the
necessary expertise or resources.
● Limited flexibility: ETL process can be limited in terms
of flexibility, as it may not be able to handle unstructured
data or real-time data streams.
● Limited scalability: ETL process can be limited in terms
of scalability, as it may not be able to handle very large
amounts of data.
● Data privacy concerns: ETL process can raise concerns
about data privacy, as large amounts of data are collected,
stored, and analyzed.
ELT (Extract, Load and Transform)
● ELT stands for Extract, Load and Transform is the various
sight while looking at data migration or movement.
● ELT involves the extraction of aggregate information from the
source system and loading to the target method instead of
transformation between the extraction and loading phase.
● Once the data is copied or loaded into the target method,
then change takes place.
● The extract and load step can be isolated from the
transformation process. Isolating the load phase from the
transformation process delete an inherent dependency
between these phases.
● In addition to containing the data necessary for the
transformations, the extract and load process can include
components of data that may be essential in the future.
● The load phase could take the entire source and loaded it into
the warehouses.
ELT (Extract, Load and Transform)
ELT (Extract, Load and Transform)
● Separating the phases enables the project to be damaged
down into smaller chunks, thus making it more specific
and manageable.
● Performing the data integrity analysis in the staging
method enables a further phase in the process to be
isolated and dealt with at the most appropriate point in
the process. This method also helps to ensure that only
cleaned and checked information is loaded into the
warehouse for transformation.
● Isolating the transformations from the load steps helps to
encourage a more staged way to the warehouse design and
implementation.
Strength of ETL
● Project Management: Being able to divide the warehouse method into
specific and isolated functions, enables a project to be designed on a
smaller function basis, therefore the project can be broken down into
feasible chunks.
● Flexible & Future Proof: In general, in an ELT implementation, all record
from the sources are loaded into the data warehouse as part of the extract
and loading process. This, linked with the isolation of the transformation
phase, means that future requirements can easily be incorporated into the
data warehouse architecture.
● Risk minimization: Deleting the close interdependencies between each
technique of the warehouse build system enables the development method
to be isolated, and the individual process design can thus also be separated.
This provides a good platform for change, maintenance and management.
● Utilize Existing Hardware: In implementing ELT as a warehouse build
process, the essential tools provided with the database engine can be used.
● Utilize Existing Skill sets: By using the functionality support by the
database engine, the existing investment in database functions are re-used
to develop the warehouse. No new skills need to be learned, and the full
weight of the experience in developing the engine?s technology is utilized,
further reducing the cost and risk in the development process.
Weakness of ETL
● Against the Norm: ELT is a new method to data
warehouse design and development. While it has
proven itself many times over through its abundant
use in implementations throughout the world, it does
require a change in mentality and design approach
against traditional methods.
● Tools Availability: Being an emergent technology
approach, ELT suffers from the limited availability of
tools.
Difference between ETL vs. ELT
Basics ETL ELT

Process Data is transferred to the Data remains in the DB


ETL server and moved except for cross Database
back to DB. High network loads (e.g. source to object).
bandwidth required.
Transformation Transformations are Transformations are
performed in ETL Server. performed (in the source or)
in the target.
Code Usage Typically used for Source to •Typically used for High
target transfer amounts of data

Compute-intensive
Transformations
Small amount of data
Difference between ETL vs. ELT
Basics ETL ELT

Time-Maintena It needs highs maintenance as you Low maintenance as data


nce need to select data to load and is always available.
transform.

Calculations Overwrites existing column or Need Easily add the calculated


to append the dataset and push to column to the existing
the target platform. table.

Analysis
Defining the business requirements:
Select the business process for which the dimensional model will be
designed. Based on the selection, the requirements for the business
process are gathered. A business process require more than one
dimensional model. When you select a single business process (out of all
of the possible processes that exist in a company), you must prioritize the
business processes according to certain criteria. Criteria might include
business process significance, quality of data in the source systems, and
the feasibility and complexity of the business processes.
When you identify the business processes of a dimensional model, you
collect the following metadata:
● Business requirements for the selected business for which you will
design the dimensional model
● Business processes
● Owners
● Source systems that will be used
● Data quality issues
● Common terms used across business processes
● Other business-related metadata
Dimensional analysis
One approach to data warehouse design is to develop and
implement a dimensional model. This has given rise to
dimensional analysis (sometimes generalized as
multi-dimensional analysis ).

It was noticed quite early on when data warehouses started to be


developed that, whenever decision makers were asked to
describe the kinds of questions they would like to get answers to
regarding their organizations, they almost always wanted the
following:

○ Summarized information with the ability to break the


summaries into more detail
○ Analysis of the summarized information across their own
organizational components such as departments or regions
○ Ability to slice and dice the information in any way they
chose
○ Display of the information in both graphical and tabular form
○ Capability to view their information over time
Example of Dimension Analysis
Requirements gathering methods
Requirements gathering is a crucial phase in the development of a
data warehouse, as it lays the foundation for understanding and
meeting the needs of stakeholders. There are several methods and
techniques for gathering requirements in the context of a data
warehouse. Here are some commonly used methods:
● Interviews:
○ Conduct one-on-one or group interviews with stakeholders,
including business users, analysts, and IT personnel.
○ Ask open-ended questions to understand their data needs,
reporting requirements, and expectations from the data
warehouse.
● Surveys and Questionnaires:
○ Distribute surveys to a broad audience to collect feedback on
data requirements.
○ Use structured questionnaires to gather specific information
about data sources, data formats, and desired reports.
Requirements gathering methods
● Workshops and Focus Groups:
○ Organize workshops or focus group sessions involving key
stakeholders.
○ Facilitate discussions to elicit detailed requirements,
clarify ambiguities, and identify common goals.
● Prototyping:
○ Develop prototype reports or dashboards to visualize data
and gather feedback.
○ Iteratively refine prototypes based on user input, ensuring
that the final solution meets user expectations.
● Document Analysis:
○ Review existing documentation, reports, and business
processes to understand the current state.
○ Identify gaps and opportunities for improvement in data
availability and reporting.
Requirements gathering methods
● Use Cases and Scenarios:
○ Develop use cases and scenarios that describe how users
will interact with the data warehouse.
○ Identify specific situations and the corresponding data
requirements for each use case.
● Data Sampling:
○ Analyze a sample of existing data to understand its
structure, quality, and potential issues.
○ Use insights from data sampling to refine requirements
for data cleaning and transformation.
● Joint Application Development (JAD):
○ Bring together stakeholders, end-users, and
development teams for collaborative sessions.
○ Use JAD sessions to define requirements, resolve
conflicts, and establish a shared understanding.
Requirements gathering methods
● Benchmarking:
○ Compare the data warehouse requirements with industry
benchmarks and best practices.
○ Identify key performance indicators (KPIs) and metrics relevant
to the organization's goals.
● Observation:
○ Observe how users currently work with data and identify pain
points.
○ Gain insights into user behaviors, preferences, and challenges
related to data access and reporting.
Selecting the appropriate combination of these methods
based on the organization's context, the complexity of the
data warehouse project, and the nature of stakeholders is crucial
for effective requirements gathering. Iterative and collaborative
approaches often lead to more accurate and aligned
requirements for a successful data warehouse implementation.
Requirements Definition: Scope And Content
● Formal documentation is often neglected in computer system projects. The
project team goes through the requirements definition phase.
● They conduct the interviews and group sessions. They review the existing
documentation.
● They gather enough material to support the next phases in the system
development life cycle.
● But they skip the detailed documentation of the requirements definition.
● There are several reasons why you should commit the results of your
requirements definition phase.
● First of all, the requirements definition document is the basis for the next
phases.
● If project team members have to leave the project for any reason at all, the
project will not suffer from people walking away with the knowledge they
have gathered.
● The formal documentation will also validate your findings when reviewed
with the users.
● We will come up with a suggested outline for the formal requirements
definition document.
● Before that, let us look at the types of information this document must
contain.
The Types Of Information This Document Must Contain
Data Sources:
● A data source in the context of a data warehouse refers
to the origin or location from which data is collected
and ingested into the data warehouse.
● These sources can be various databases, systems,
applications, or external providers that hold data
relevant to an organization's business processes.
● The data warehouse consolidates, organizes, and
stores this data in a manner that supports analytical
and reporting activities.
The Types Of Information This Document Must Contain

The requirements definition document should include


the following information:
● Available data sources
● Data structures within the data sources
● Location of the data sources
● Operating systems, networks, protocols, and client
architectures
● Data extraction procedures
● Availability of historical data
The Types Of Information This Document Must Contain
● Data transformation is a critical process in a data
warehouse that involves cleaning, aggregating, and
restructuring data from source systems to make it
suitable for analytical and reporting purposes. The goal of
data transformation is to ensure that the data in the data
warehouse is accurate, consistent, and easily accessible
for analysis.
● Data transformation is a crucial step in the ETL process
and ensures that the data in the data warehouse is of high
quality, consistent, and aligned with the business
requirements for analysis and reporting.
The Types Of Information This Document Must Contain

● Data storage in a data warehouse involves the


organization and management of data in a structured
and efficient manner to support analytical queries and
reporting.
● Optimizing data storage in a data warehouse to ensure
efficient data retrieval, analysis, and maintenance. The
chosen storage strategy should align with the specific
needs and goals of the organization.
The Types Of Information This Document Must Contain

● Information delivery in a data warehouse refers to the


process of providing users with access to organized,
transformed, and meaningful data for reporting,
analysis, and decision-making purposes. The goal is to
present information in a way that is easily
understandable, accessible, and supports business
intelligence needs.
● Effective information delivery ensures that
stakeholders across an organization have timely access
to accurate and relevant data, empowering them to
make informed decisions. It plays a crucial role in the
success of a data warehouse implementation.
Requirements Definition Document Outline
General
Introductio requiremen
n. ts
descriptions

Specific
Information
requiremen
packages
ts

Other User
requiremen expectation
ts s

User
General
participatio
implementa
n and
tion plan
sign-off
Requirements Definition Document Outline
1. Introduction. State the purpose and scope of the project. Include
broad project justification. Provide an executive summary of each
subsequent section.
2. General requirements descriptions. Describe the source systems
reviewed. Include interview summaries. Broadly state what types of
information requirements are needed in the data warehouse.
3. Specific requirements. Include details of source data needed. List the
data transformation and storage requirements. Describe the types of
information delivery methods needed by the users.
4. Information packages. Provide as much detail as possible for each
information package. Include in the form of package diagrams.
5. Other requirements. Cover miscellaneous requirements such as data
extract frequencies, data loading methods, and locations to which
information must be delivered.
6. User expectations. State the expectations in terms of problems and
opportunities. Indicate how the users expect to use the data
warehouse.
7. User participation and sign-off. List the tasks and activities in which
the users are expected to participate throughout the development life
cycle.
8. General implementation plan. At this stage, give a high-level plan
for implementation.

You might also like