We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 21
Describe need and architecture for the given data
warehouse.
Explain the benefits of data warehousing of the given applicat
ig Describe the given data warehouse models, Seem
To understand Basic Concepts in Data Warehouse and Data Warehousi
To learn Architecture of Data Warehousing. as
To study various Data Warehouse Models (Data Mart, Virtual Warehouse etc.)
q Tounderstand ETL Concept in Data Warehouse ;
To know Benefits of Data Warehousing
very complex and tedious. Consequently, a number of methods, techniques and tools were developed
tosolve that problem.
These included decentralized processing, extract processing, Executive, Information Systems (EIS),
query tools, relational databases, etc. The need for timely and accurate decisions also led to the
development of Decision Support Systems (DSSs)..
pata warehousing began to grow explosively starting in the mid-nineties. It is still characterized by
high growth,
pata warehousing is the process of constructing and using a data warehouse. Data warehousing is
the process of constructing and using a data warehouse. The data warehouse is a basis for
informational processing.
Data warehousing enables easy organization and maintenance of large data in addition to fast
retrieval and analysis in the manner and depth required from time to time.
Data Warehousing, OnLine Analytical Processing (OLAP) and Data Mining represent some of the
latest trends in computing environment and Information Technology (IT) applications to large-scale
processing and analysis of data.
oa
Seen eee SSData mining is the process of discovering new information out of data in a data warehouse,
cannot be retrieved within the operational system.
‘pata mining refers to the extraction of useful information from a bulk of data or data
ethan is the computational process of discovering patterns in large data sets invo
stziods at the intersection of artificial intelligence, machine learning, statistics, and database
Data warehousing is tha process of Constructing’ and ising al data warsisoase™ A Aga wrareHigeTh
constricted by Inegrarng sen conrructing an curcee that support anagcl
3 —
Teportin, structured and/or adhoc queries eo denn ees roures that St
Data warehousing involves data cleaning, data integration, and data consolidations,
Data Warehouse I
“A database contains information organized in columns, rows and tables that is periodically indexed
to make accessing relevant information more accessible.
Many enterprises and organizations create and ‘manage databases using a database management
System. Special DBMS software can be used create and store product inventory and customer
information. Organizations most often use databases for OnLine Transaction Processing (OLTP).
A database was built to store current transactions and enable fast access to specific transactions for
ongoing business processes, known as OnLine Transaction Processing (OLTP).
Most enterprises generate unlimited amounts of data from their OLTP systems, Point-of-Service
(oS) systems, financial ATMs and the Web.
The challenge faced by these enterprises/organizations with regard to the massive data-rich but
information-poor collection is to extract valuable information to be available at a particular tim
Place and in the form needed to support the decision-making process.
warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of data.
data helps analysts to take informed decisions in an organization.
= A Data Warehouse (DW) is a collection of technologies aimed at enabling the knowledge wor!
* (executive, manager, and analyst) to make better and faster decisions. f
2 It is expected to have the right information in the right place at the right time with the right cost.
order to support the right decision. ott
Data warehouses and databases are both relational data systems, but were built to serve differ
complex queries across all the data, typically using OnLine Analytical Processing (OLAP).
Data Warehousing
> A data warehouse is a database, which is kept separate from the organization's operational d
A data warehouse is a collection of data specific to the entire organization.
A data warehouse is a decision-support environment that leverages data stored in different
organizing it and delivering it to decision makers across the enterprise, regardless of their platf
or technical skill level.aia Warehousing with Mining Techniques
+ Italso provides the appropri:
decisions.
Characteristics of Data Warehouse
The formal definition of a data warehouse by W. H. Inmon is given below:
‘A data warehouse can p;
consistent data.
Data warehousing Involves processed that: extract
ccheuse nnorm the data, integrate it, remove any flaws and inconsistencies, nore it into a data
warehouse, aiid Provide endl sare with access to the date Sota they can carry out complex data
analysis and prediction queries.
A data warehouse ensures
the consistency of management rules and conventions applied to the
data
ate tools to extract specific data, convert it into business information,
and monitor for changes and hence, it is possible to use this information sa tenia insightful
“A data warehouse Is a subject-oriented, integrated, time-varying, non-volatile collection of data in
support of the management's decision-making process.”
The key characteristics/features of a data warehouse are discussed below:
1. Subject Oriented: A data warehouse is subject oriented because it provides information around a
subject rather than the organization's ongoing operations. These subjects can be Product,
customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing.
operations, rather it focuses on modelling and analysis of data for decision making,
Integrated: A data warehouse is constructed by integrating data from heterogeneous sources
such as relational databases, flat files, etc. This integration enhances the effective analysis of
data.
Time Variant: The data collected in a data warehouse is identified with a particular time Period.
The data in a data warehouse provides information from the historical point of view.
Non-volatile: Non-volatile means the previous data is not erased when new data is added to it. A
Gata warehouse is always a physically separated store of data. Due to this separation, date
warehouse does not require transaction processing, recovery, concurrency control, ete.
A data warehouse is kept separate from the operational database and therefore frequent changes in
operational database is not reflected in the data warehouse.
Difference between Operational Database System and Data Warehouse
Operational database management systems also called as OLTP (OnLine Transactions Processing
Databases), are used to manage dynamic data in real-time. Following table shows the differences
between OLTP and data warehouse.Operational database
designed to support
transaction processing.
systems are
high-volume
Data warehousing systems are typi
designed to support high-volume anal
processing (i.e., OLAP).
Operational database systems are usually
concerned with current data.
Data within operational systems are
mainly updated regularly according to
need.
Data warehousing systems are us
concerned with historical data.
‘Non-volatile, new data may be added regularly.
Once, the data added rarely changed.
It is designed for real-time business
dealing and processes,
Itis designed for analysis of business measures
It is optimized for a simple set of
transactions, generally adding or
retrieving a single row at a time per table.
by subject area, categories, and attributes.
It is optimized for extent loads and high,
complex, unpredictable queries that access
many rows per table.
It is optimized for validation of incoming
information during transactions, uses
validation data tables.
Loaded with consistent, valid informa!
requires no real-time validation.
It supports thousands of concurrent
clients.
Operational database systems are widely
functional or process-oriented.
It supports a few concurrent clients relative
LTP.
Data warehousing systems are widely
oriented.
Operational systems are usually
optimized to perform fast inserts and
updates of associatively small volumes of
data.
Operational database system focuses on
Data In
| Less number of data accessed.
| Relational databases are created for OLTP.
a
Data integration in operational database
Data warehousing systems are
optimized to perform fast retrievals
relatively high volumes of data.
Data warehousing system focuses on Data
Large number of data accessed.
Data warehouse designed for OLAP.
Data integration data warehouse is
is application based. based.
14, | It provides detailed and flat relational | It provides summarized and mt
view of data. view of data.
Components of Data Warehouse
A data warehouse consists of various components (building blocks) which are oe
optimum way to get the maximum benefit out of it.
The arrangement of these components mainly depends on certain circumstances and
requirements of an organization.Interna
Q
Archived
Data storage:
¥ig. 1.1: Components of Data Warehouse
'ta warehouse are explained below:
Gata comprises the data coming into the data warehouse from different
areas. This source data can be grouped into foliowing four categories:
@) Production Data: The data in this category comprise different segments of data chosen
from the various operational systems of he enterprise, The data are selected on the basis of
requirements in data warehouse.
(099 Mtecaml Dated Nevery ergaisetion; users keep their personal data such as private
spreadsheets, documents, customer profiles, departmentar databases, etc. All these data
form the internal data which could be useful in » sare warehousing environment.
Gil) Archived Data: This is a compiled form of older data which is stored in archived files such
depen teem flat Mes, etc. As data warehouse containeiiayaiee many years, so
depending on the requirement, data warehouse data are archived from time to time.
Gv) External Data; It refers to those data sources which are available outside the organization,
‘Zhe production, internal and archived data give an insight view of what organization is
doing presently and have done in the past. However, by including external data in data
warehouse, executives can spot current trends availing in the market end can compare the
performance of their organizations with others.
"Data Staging Component: After extracting data from various operational systems and external
sources, data need to be prepared for storing in the data warehouse. The date staging component
of data warehouse, helps in making data ready to be stored in data warehoun It comprises
following three main functions:
() Pata Extraction function deals with a large amount of data sources and for each data
source an appropriate technique must be applied. Data extraction is a complex and critical
task, as source data may be from different source machines in various data formats and
with different data models (relational, network or hierarchical). Various data extraction
tools are available in the market today. The organization can use these tools for certain data
Sources. However, these tools incur high initial cost. For some other data sources, in-house
The basic components of a dat
4. Source Data: The source[ESE
Introduction to Data Ws
he development and mait
may incur t
sped. But, these may
tracted into a separate PI
Data Warehousing with wining Techniques
programs can also be develo}
Fa emery, Sroging relational database from which m
warehouse would be easier. =
(ii) Deta Transformation: The data sources may contain some Minot anne ineonsleaea
Porevaimple, the narnes are often saiespelied, and strest, aren OF CNY O00 C7 is oy an
ier Mamatied or aip codes are entered incorrectly. These incorrect Sly Cys gh
eae et oe cainicaiae the errors and fill in the iilesing information \elleu, Does iyi
eCentrecting and preprocessing the! datal is called/datalclesnsing Tite ran eq
eerste ome reasonable levelby looking up a dstabase containing street names #24 8
ee is each cep: the apprositnate matching of Auta required for tls tasiieretece=a ar
fursy lookup. In some cases, the data managers in the organization want to upgrade {is
This process is known as back-flushing. These data are then
(ii) Data Loading: The cleaned and transformed data are finally loaded into the ware
Data aie’ partitioned, and indexas or other, access paths are built fom fest a co Say
Reisieealor-aata: teating’is a slow process due to the large vohania Gf Gn eaae aa
Ioading a terabyte of data sequentially can take weeks and a gigabyte can take nous. 2
‘paraileliama is ienportant for loading warehouses: The raw data generated DY "seces =a
processing system may be too large to store in a data warehouse; therefore some OSs oat
JRored in a summarised form. Thus, additional preprocessing such as sorting and generation,
of summarized data is performed at this stage.
ouse is called Extract, Transform and Load
jodically refreshed to
This entire process of getting data into the data wareh« t
the data is loaded into a warehouse, it must be peri
lations at the data sources and periodically purge of old data.
data with the cleaned data.
transformed to accommodate semantic mismatches.
(ETL) process. Once,
reflect the updates on the rel
3. Data Storage Component: This component consists of a separate repository for storing desired
data in the data warehouse. In the data repository of a warehouse, huge amount of histori
data are kept along with the current data in specific structures suitable for analysis; however,
these repositories are made read-only in the data warehouse. This is because for analysis,
must not have data storage to be in such a state where continual updations are made to it.
4. Information Delivery Component: This component includes different methods for renderiny
information to the wide group of data warehouse users. Some common methods include ad h
‘reports, Multi-Dimensional (MD) analysis, statistical analysis, Executive Information Syst
(EIS) feed and data mining applications.
5. Metadata Component: Metadata is the data about the data. The metadata stores data ina
way as the data dictionary or data catalogue does in a DBMS but it also keeps information
the logical data structures, files, addresses, indexes, etc.
6. Management and Control Component: This component of data warehouse manages
coordinates the various services and activities within the data warehouse from the beginning
the end. It also works with the database management system and enables data to be
stored in the repositories. It controls the data transformation into the data warehouse
and moderates information delivery to the users. It also supervises the movement of data
the staging area and from there into the data warehouse storage itself. While performing
functions, it interacts with the metadata component as metadata is the source of inft
for the management module.[pete Warehousing with Mining Tech a7 Introduction to Data Warehousing
+ Data warehousing technology is becoming essential for effective business intelligence as it enables
aeenee a ation and maintenance of large data in addition to fast retrieval and analysis in the
manner and depth required from time to time.
Following polnts show the reed and importance of data warehouse:
3, Pata warehouse helps business users to access critical data from some sources all in one place.
B provides consistent information on various cross-functional activities.
aie rehousing helps to integrate many sources of data to reduce stress on the production
system.
ft Data warehousing Hels tsexs to reduce total turnaround time for analysis andxenartine)
5 Data warehousing helps users to access critical data from different sources in a single place so, it
Free eet time of retrieving data information from multiple sources. We can also access data
from the cloud easily.
6. Data warehousin,
#€ allows to stores a large amount of historical data to analyze different periods
and trends to make future predictions.
7% Data Warehousing enhances the value of operational business applications and customer
relationship management systems.
8. It separates analytics
both systems.
Data warehousing provides more accurate reports.
Processing from transactional databases, improving the performance of
Data warehousing provides architectures to systematically organize, understand, and use their data
to make strategic decisions.
Data warehouse architecture depends upon the organization's situation. The following architecture
Properties are essential for a data warehouse system:
3. Separation: Analytical and transactional processing should be kept apart as much as possible.
2. Scalability: Hardware and software architectures should be easy to upgrade as the data ‘volume,
which has to be managed and processed, and the number of users’ requirements, which have to
be met, progressively increase.
5. Extensibility: The architecture should be able to host new applications and technologies without
redesigning the whole system.
4. Security: Monitoring accesses is essential because o:
warehouses.
Administerability: Data warehouse management should not be overly difficult.
‘There are following three common types of data warehouse architectures:
1. Basic architecture (single-layer) of data warehouse.
2. Two layer architecture of data warehouse.
3. Three layer architecture of data warehouse.
f the strategic data stored in dataj Date Warehousing with Mining Techniques 12 Introduction to Data W
BRERY single tayer architecture of Data Warehouse
* Single layer architecture is a basic architecture of data warehouse which is
today. Single layer architecture is shown in Fig. 1.2.
In single layer architecture, end users can access data directly from the various source
through the data warehouse.
The goal of basic architecture of data warehouse is to minimize the =
remove the data redundancies.
* The basic structure lets end users of the warehouse directly access eon ae oe
source systems and perform analysis, reporting, and mining on that data. This structure is useful;
when data sources derive from the same types of database systems.
not frequently
mount of data stored
Fig. 1.2: Basic Architecture of Data Warehouse
* The weakness of basic architecture of data warehouse architecture lies in its failure to meet
requirement for separation between analytical and transactional processing.
* Analysis queries are submitted to operational data after the middleware interprets them. It this.
the queries affect regular transactional workloads.
* In addition, although basic architecture of data warehouse architecture can meet the requirem
for integration and correctness of data, it cannot log more data than sources do.
‘* For these reasons, a virtual approach to data warehouses can be successful only if analysis needs
Particularly restricted and the data volume to analyze is huge.
EEE7 two layer architecture of Data Warehouse
* The requirement for separation plays a fundamental role in defining the typical architecture
data warehouse system, as shown in Fig. 1.3.
Fig. 13 shows two-level (two-layer) architecture of data warehouse to highlight the separat
physically available sources and data warehouses. This separation process is important to clean
process operational data, basically it consists of following five stages:
1. Data Source: It is a heterogeneous source data, it might be operational data, flat files, etc.
2. Staging Area: In this area, data stored to sources should be extracted, cleansed to
inconsistencies and fill gaps before the warehouse.with Mining Techniques. 19. Introduction to Data Warehousing
Warehouse: It is a centralized repository which can access data directly, but it can also be used as.
a source for creating data marts.
4. Data Marts: Data mart is a partially copy of organization's data and is designed for a specific
purpose like purchasing, sales, inventory, etc.
5, Users: End users can access the processed report, analyze them and mine them.
Fig. 1.3 : Two-Level Architecture of Data Warehouse
its of a Two-Layer Architecture:
In data warehouse systems, good quality information is always available, even when access to
sources is denied temporarily for technical or organizational reasons.
Data warehouse analysis queries do not affect the management of transactions, the reliability of
which is vital for enterprises to work properly at an operational level.
Data warehouses are logically structured according to the multidimensional model, while
operational sources are generally based on relational or semi-structured models,
Three Tier Data Warehouse Architecture
Generally a data warehouses adopts a three tier (layer/level) architecture. Fig. 1.4 shows the three
tier architecture of data warehouse.
Following are the three tiers of the data warehouse architecture:
Bottom Tier: The bottom tier of the architecture is the data warehouse database server. It is the
relational database system. We use the back end tools and utilities to feed data into the bottom tier,
These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.
The following are the functions of data warehouse tools and utilities:
(@) Data Extraction: involves gathering data from multiple heterogeneous sources,
(i) Data Cleaning: Involves finding and correcting the errors in data.
(ii) Data Transformation: Involves converting the data from legacy format to warehouse format.
_ (iv) Data Loading: Involves sorting, summarizing, consolidating, checking integrity, and building
indices and partitions.
(¥) Refreshing: Involves updating from data sources to warehouse.be implemented in eit
we have the OLAPServer that can
e management
.ded relational database ma
1 data to standard relational operations,
rectly implements the multidimenst
By Relational OLAP (ROLAP),which is an exten
tools and reporting
‘The ROLAP maps the operations on multidimensional
(i) By Multidimensional OLAP (MOLAP) model, which dit
data and operations. .e quer
Top Tier: This tier is the front-end client layer. This layer holds the query
analysis tools and data mining tools.
Top Ter
Fig. 1.4: Three Layer Architecture of Data Warehouse
The main advantage of the reconciled data layer is that it creates a common reference data
for a whole enterprise. At the same time, it sharply separates the problems of source data extra
and integration from those of data warehouse population.
A data warehouse is an electronic system that gathers data from a wide range of sources within
company and uses the data to support management decision-making.
Data warehouse architecture is changing with time. There are two types of data w:
architectures, namely, Traditional data Warehouse and Cloud-based architectures,
Traditional data warehouse architecture employs a three-tier structure, (as explained ear
composed of the Bottom tier, Middle tier and Top tier.
Ina traditional architecture there are three common data warehouse models are virtual
data mart, and enterprise data warehouse.
ased data warehouses do not adhere to the traditional architecture; each data
The cloud-1
offering has a unique architecture.The view over an operational data warehouse is known as a virtual warehouse.
A virtual data warehouse is a set of separate databases, which can be queried together, so a user can
effectively access all the data as if it was stored in one data warehouse.
he operational data warehouse can be a virtual but complex component of an enterprise data
warehouse (EDW).
Jt is also @ multi-purpose structure that enables transactional and decision support processing.
Because the data originates from multiple sources, the integration often involves cleaning, resolving
redundancy, and checking it against business rules for integrity.
A data mart is a subset or an aggregation of the data stored to a primary data warehouse. A data
mart includes a set of information pieces relevant to a specific business area, corporate department,
or category of users.
A data mart is a subject-specific data warehouse that is usually set up to meet the information needs
of users of a particular department or functional unit within an organization. The size of a data in
art, therefore, is generally many times smaller than an enterprise data warehouse.
Data marts contain a subset of organization-wide data that is valuable to specific groups of people in
an organization.
In other words, a data mart contains only those data that is specific to a particular group.
For example, the marketing data mart may contain only data related to items, customers and sales.
Data marts are confined to subjects.
A data mart can be called as a subset of a data
warehouse or a sub-group of corporate-wide
data corresponding to a certain set of users.
Fig. 1.5 shows a graphical representation of data
marts.
A data mart can be implemented rising a top-
down or bottom-up approach. In the former,
which is called a dependent data mart, data is
drawn directly from an enterprise data
warehouse. In the latter, which is called an
independent data mart.
Individual data marts are built by capturing and
transforming data from existing local
operational databases in a department or
business area,
of Data Mart:
Fig. 1.5: Graphical Representation of Data Marts
Depending on the source of data, data marts can be categorized as independent data mart or
dependent data mart.
independent data marts are sourced from data captured from one or more operational systems or
external information providers, or from data generated locally within a particular department or
Seographic area. Dependent data marts are sourced directly from enterprise data warehouses.Sats w. 332
vith ning Techn .d all dependent data
,e data warehouse,
Jr data warehouse, a”
Jource - the enterpris
A dependent data mart is one whose source is another
within an organization are typically fed by the same
e
oh Dependent Data Mart (Data Mart exists with Data Warel ees systems,
* Am Independent data mart is one whose source is directly from transaction directly Sa
applications, or external data feeds. Independent data mart can collect t
different sources.
‘Operational
‘Sysioms
)
Fig. 1.7: Independent Data Mart
Advantages of Data Marts:
1. Building a data mart is simpler as compared to implementing a corporate data warehouse.
2. Data marts are small in size.6. Data marts are flexible,
stages of Data Marts:
4, Increase in their size of data marts results in perfo
creates problems when data warehouse needs to be
2, The data marts are frequently short-term,
architecture.
Development can be unorganized, which creates
blocks for creating an enterprise data warehouse.
3
because data marts focus on individual needs.
mate upgrade to an enterp1
e systern.
ice between Data Warehouse and Data Mart:
Its scope is enterprise-wide.
fed machines to allow users to break away from.
profoundly powered machines and still handle processing of the reports
5. The cost of implementing a data mart is far less when compared to build a data warehouse.
temporary solutions that are not part of a corporate
Problems when data marts are used as building
4. The process of data access, consolidation and cleanin,
5. Their design is not as thorough as with a data warehouse due to limited consideration for an ulti-
6 They can be expensive in the long-term process as activities such as extraction and processing
can get duplicated. Then, additional persons will be required for maintenance and support.
rmance deterioration, data inconsistency and
upgraded.
ig in data marts becomes very difficult
Its scope is department-wide.
Control and management process of data
warehouse is centralized.
Its process is decentralized,
\
Due to huge amount of data it is complex
and difficult to manage and thus, takes long,
time to produce the result.
Due to fewer amounts of data itis easy to.
build and manage.
There are many internal and external
sources, thus staging design takes much.
more time.
The data stored inside the data warehouse
are always detailed and accurate when
compared with data mart.
6. | A data warehouse isa large repository of data
| collected from different organizations.
Data warehousing includes large area of the
corporation which is why it takes a long time
| to process it.
A data warehouse is a blend of technologies
and components which allows the strategic
use of data.
‘There are only few internal and external
sources and it is self-explanatory; thus it is
faster to build.
‘The data stored inside the data mart is short
and limited.
A data mart is an only subtype ofa data
warehouse.
Data marts are easy to use design and
implement as it can only handle small
amounts of data.
A data mart is simple form of a data
warehouse. It is focused on a single subject.
It is designed for a long period of time.
It is built with a given objective, andhas a
a)
short lifespan.reanlzation. This model sees the data warehouse as the hea
p stem, with integrated data from all business units. oe
1t Provides corporate-wide data integration, usually from
So information, Providers, and is ‘cross-functional in scope
F typically contains detailed data as well s summarized 4at3,
Sigabytes to hundreds of gigabytes, terabytes, or beyond:
An enterprise data warehouse may be implemented on 8
Servers, or Parallel architecture platforms. It requires extensi
eee object in the data
The goal of EDW is to Provide a complete overview of any. pie ona Thi m
accomplished by identifying and wrangling the data from different sys
“consistent and conformed model.
* After all the information is gathered by EDW which has the capability of providing aco
Jocation where different tools can be used to perform analytical functions and
Predictions. The research teams can identify new trends or patterns and focus on tl
business grow,
and can range in siz.
ditional mainframes,
jive business modeling ay
The ETL process encompasses data extraction, transformation, and loading.
ETL tools are very important because they help in combining Logic, Raw Data and Sc
and loads the information to the Data Warehouse or Data Marts.
Data Source Data Staging Data Storage
Fig. 1.8: ETL
Sometimes, ETL loads the data into the Data Marts and then information is
Warehouse. This approach is known as the Bottom-up approach. 4
The approach where ETL loads information to the Data Warehouse directly is known ¢
Approach.rn be generated easily as Dat
and it is relatively
are created firs
interact with data marts.
ured ynouse can be
Not as strong but data ware!
ied vended and the number of dats marts.can
ite and consistent view of
tion from the data
ate data marts.
provides a defini
information as informa\
warehouse is used to cre
‘strong model and hence prefe!
companies.
ce is high.
ation. Some of them are given below
pusiness decisions.
wered by ETL-
Time, cost and maintenan¢
adopting ETL in the org®
ithelps companies to anal ata for taking critical
sransactional databases cannot answer comPICX pusiness questions that can De Sis
‘a data warehouse provides a common data repository:
into a data warehouse.
ETL provides a method of moving the data from various sources
vase will automatically UP
‘As data sources change, the Data Wareho
well-designed and documented ETL system is almost essenti
There are many reasons for
date.
al to the success of a Data
Veep e
Warehouse project.
| allow verification of data transformation, aggregation and calculations rules.
| ETLprocess allows sample data comparison between the source and the target system.
ETL process can perform complex transformations and requires the extra area to store the data.
It helps to Migrate data into a Data Warehouse. Convert to the various formats and types to
adhere to one consistent ‘system.
.e data into the target database.
_ this a predefined process for accessing and manipulating soures
,. ETLoffers deep historical context for the business.
without a need for technical skills.
Ithelps to improve productivity because it codifies and reuses
ps in getting Data into the Data Warehouse:
Extraction:
Extraction is the first step in the process of getting data into the data warehouse environment.
During extraction, data is specifically identified and then taken from many different locations,
referred to as the Source.
‘The Source can be a variety of things, such as files, spreadsheets, database tables, a pipe, etc.
Relevant data is obtained from sources in the extraction phase.
Extracting means reading and understanc S needed
ding the source data and
eo: for further manipulation. At this point, the data a aie oa og
ere are two types of data warehouse extraction methods, namely, logical oie
: : ani
methods as shown in Fig. 1.9.Extraction:
Extracti inst methods:
reece on when the data needs to be extracted and loaded f¢
first time. In full extraction, the data from the source is extracted completely. This
Neen ae change 2 ours dat oa
tracked since the last successful extraction. Only these changes in data will be extracted and:
loaded. These changes can be detected from the source data which have the last
timestamp. Also a change table can be created in the source system, which keeps track of
changes in the source data.
@) Physical Extraction:
* Physical extraction has two methods: OnLine and Offline extraction:
(@) Online Extraction: In this process, extraction process directly connect to the source system
extract the source data.
Gi) Offline Extraction: The data is not extracted directly from the source system but is
explicitly outside the original source system.
2. Transformation:
* After data is extracted, it must be physically transported to the target destination and converted;
the appropriate format.
* It converts data from its operational source format into a specific data warehouse format.
* This data transformation may include operations such as cleaning, joining, and validating
generating calculated data based on existing values.
* Whether the transformation takes place in the data warehouse or beforehand, there are
common and advanced transformation types that prepare data for analysis.
* Some of these include:
Basic transformations:
) Cleaning: Mapping NULL to 0 or "Male" to"M" and "Female" to"F," date format cons
(ii) Deduplication: Identifying and removing duplicate records.
(iii) Format Revision: Character set conversion, unit of measurement conversion,
conversion, etc.
(iv) Key Restructuring: Establishing key relationships across tables,warehousing with Mining Techniques saz Introduction to Data Warehousing,
transformations:
Derivation: Applying business rules to the data that derive new calculated values from existing
data — for example, creating a revenue metric that subtracts taxes.
(i) Filtering: Selecting only certain rows and/or columns.
inking data from multiple sources - for example, adding ad spend data across
multiple platforms, such as Google Adwords and Facebook Ads.
(iv) splitting: Splitting a single column into multiple columns.
(v) Data Validation: simple or complex data validation ~ for example, if the first three columns in.
‘a row are empty then reject the row from processing.
(vi) Summarization: Values are summarized to obtain total figures which are calculated and stored
at multiple levels as business metrics ~ for example, adding up all purchases a customer has
made to build a Customer Lifetime Value (CLV) metric.
(vii) Aggregation: Data elements are aggregated from multiple data sources and databases.
(vili)integration: Give each unique data element one standard name with one standard definition.
Data integration reconciles different data names and values for the same data element.
Loading:
Loading into a data warehouse is the last step to take. The final step in the ETL process involves
loading the transformed data into the destination target.
This target may be a database or a data warehouse. Loading can be carried out in two ways namely
full load and incremental load.
(i) The full load method involves an entire data dump that occurs the first time the source is loaded
into the warehouse.
(ii) The incremental load, on the other hand, takes place at regular intervals. These intervals can be
streaming increments (better for smaller data volumes) or batch increments (better for larger
data volumes).
There two different methods of loading data into a warehouse are given in following table:
Extract, Transform, Load (ETL) first extracts
the data from a pool of data sources, which
are typically transactional databases.
* With Extract Load Transform (ELT), data is
immediately loaded after being extracted
from the source data pools.
The data is held in a temporary staging |+ There is no staging database, meaning the
data is immediately loaded into the single,
centralized repository.
database. Transformation operations are
then performed, to structure and convert the
data into a suitable form for the target data
The data is transformed inside the data
warehouse system.
warehouse system for use with business
The structured data is then loaded into the intelligence tools and analytics.
warehouse, ready for analysis,
EData Warehousing with Mining Techniques
Traditional approach for analysis of data is
ETL.
Metadata is simply defined as, data about data. The data that are used to ee
‘know as metadata. For example, the index of a book serves as metadata for the contents in th
In other words we can ‘say that metadata is the summarized data that leads us to the detail
In terms of data warehouse we can define metadata as:
1. Metadata is a road map to data warehouse.
2. Metadata in data warehouse define the warehouse objects.
3. The metadata act as a directory. This directory helps the decision support system to
~ contents of data. warehouse,
Metadata in a data warehouse is similar to the data dictionary or the data catalog in a
management system.
In the data dictionary, we keep the information about the logical data structures, the
about the files and addresses, the information about the indexes, and so on. The data d
contains data about the data in the database.
The metadata can be broadly categorized into following three categories:
1. Business Metadata: This metadata has the data ownership information, business definitio
changing policies. 4
2. Technical Metadata: Technical metadata includes database system names, table and.
names and sizes, data types and allowed values. Technical metadata also includes
information such as primary and foreign key attributes and indices,
3. Operational Metadata: This metadata includes cu:
data means whether data is active, archived or
migrated and transformation applied on it.
In the data warehouse architecture, meta-data Plays an important role as it specifies the
usage, values, and features of data warehouse data. It also defines how data can be
Processed. It is closely connected to the data warehouse.
The generation and management of metadata serves two purposes:
rency of data and data lineage. Curren
Purged. Lineage of data means history@® Information on the
_ Gi) Information on the
E the refreshment
: Feconciled data,
«ii srormation on the implicit semantics of data (with respect to@
waren 889 other kind of data that aids the end-user exp!
(iv) Information on the
‘sources of the data
(%) Information inch
administrator tune the
.cture,
contents of the data warehouse, their location and their stru
ing
shouse back-stage, concern!
Processes that take place in the data warehouse back stagi» CetO
Of the warehouse with clean, up-to-date, semant
common enterprise model),
Joit the information of the
the
infrastructure and physical characteristics of components and
warehouse, and,
security, authentication, and usage statistics that aids the
fae ‘operation of the data warehouse as appropriate.
6 data “ th as i (whether
@ = imp data quality like consistency (uniform and no duplicates), completeness aT
accuracy (precision and confidence of the data), timeliness (value
up-to-date).
(4) Improves query, retrieval and answer. quality.
has very important role in data warehouse. The role of metadata in warehouse is different
the warehouse data yet it has very important role. The various roles of metadata are explained.
‘The metadata act as a directory.
Metadata are also used for query tools.
Metadata are used in reporting tools.
Metadata are used in extraction and cleansing tools.
Metadata are used in transformation tools.
Metadata also plays important role in loading functions.
The directory helps the decision support system to locate the contents of data warehouse.
Metadata helps in decision support system for mapping of data when data are trarisformed from
operational environment to data warehouse environment. .
Metadata helps in summarization between current detailed data and highly summarized data.
Metadata also helps in summarization between lightly detailed data and highly summarized
data.
metadata repository is a database of data about data
‘bottom tier of the data warehousing architecture.
Purpose of the metadata repository is to provide a consistent
(metadata). A metadata repository is used in
and reliable means of access to
metadata repository should contain the following:Introduction
which includes the warehouse schema,
De ian 308 as well as data mart locations
scription of the data :
acs pats and derived data definitions,
dimensions, hierarchies,
contents.
2. Operational metadata,
transformations applied to it), currency ire oo
information (warehouse usage statistics, error reports, and audit ).
we used for summarizat! ich include measure and dimension definitioy
ithms includs ure and J
Ae chs,algort zation, whic]
algorithms, data on granularity, partitions, subject areas, aggregation, summarization,
jeries and reports.
4 Eee cs the ital environment to the data warehouse, ee oe 74
databases and their contents, gateway descriptions, data partitions, data : e:
transformation rules and defaults, data refresh and purging rules, ant security q
authorization and access control).
5. Data related to system performance,
access and retrieval performance, in addition to rules for thé
update, and replication cycles. 7a
6. Business metadata, which include business terms and definitions, data ownership informatio
and charging policies.
story of migrated data and the sequen
ey on
which include data lineage ( migrated dein one
of data (active, archi
which include indices and profiles that improve da
e timing and scheduling of ret
* Various benefits of data warehousing are given below:
1. Improved Control of Data: Information in the data warehouse is under the control of dat
warehouse users so that, even if the source system data is removed over time, the information
the warehouse can be stored safely for extended periods of time.
2. Better Retrieval of Data: Because they are separate from operational systems, data wareho
Provide retrieval of data without slowing down operational systems.
3. Increased Productivity of Corporate Decision Makers: Data warehousing improves
Productivity of corporate decision makers by creating an integrated database of consistel
subject-oriented, historical data. It integrates data from multiple incompatible systems into
form that provides one consistent view of the organization. By transforming data
meaningful information, a data warehouse allows business managers to perform mo
substantive, accurate, and consistent analysis. 7
4. More Cost-Effective Decision-Making: Data warehousing helps to reduce the overall cost of th
product by reducing the number of channels.
5. Better Enterprise Intelligence: It helps to provide better enterprise intelligence and enhanc
customer service.
Potential High Returns on Investment: Return on investment (ROI) refers to the amount ¢
increased revenue. Implementations of data warehouses and complementary busi
intelligence systems have enabled business to generate higher amounts of revenue and pro
substantial cost savings.Introduction to Data Warehousing
Warehousing with Mining Techniques a1
Competitive Advantage: The huge returns on investment for those companies/organizations
that have successfully implemented a data warehouse is evidence of the enormous competitive
advantage that accompanies this technology. The competitive advantage is gained by allowing
decision-makers access to data that can reveal previously unavailable, unknown, and untapped
information on, for example, customers, trends, and demands.
tations of Data Warehouse:
3, Hidden Problems with Source Systems: Sometimes hidden problems associated with the source
systems feeding the data warehouse may be identified after years of being undetected. For
example, when entering the details of a new property, certain fields may allow nulls which may
result in staff entering incomplete property data, even when available and applicable,
2. Required Data Not Captured: In some cases the required data is not captured by the source
systems which may be very important for the data warehouse purpose. For example, the date of
registration for the property may be not used in source system but it may be very important
analysis purpose.
3, Increased End-User Demands : Once a data warehouse is OnLine, it is often the case that the
number of users and queries increase together with requests for answers to more and more
complex queries.
4. Data Homogenization: The concept of data warehouse deals with similarity of data formats
between different data sources. Thus, results in to lose of some important value of the data.
5. High Demand for Resources: The data warehouse requires large amounts of data.
Data Ownership: Data warehousing may change the attitude of end-users to the ownership of
data. Sensitive data that owned by one department has to be loaded in data warehouse for
decision making purpose. But some time it results in to reluctance of that department because it
may hesitate to share it with others.
High Maintenance: Data warehouses are high maintenance systems. Any reorganization of the
business processes and the source systems may affect the data warehouse and it results high
maintenance cost.
Long-Duration Projects: The building of a warehouse can take up to three years, which is why
some organizations are reluctant in investigating in to data warehouse. Whereas, data marts
support only the requirements of a particular department and limited the functionality to that
department or area only.
Complexity of Integration: An organization must spend a significant amount of time
determining how well the various different data warehousing tools can be integrated into the
overall solution that is needed. This can be a very difficult task, as there are a number of tools for
9.
every operation of the data warehouse.
20. Under-Estimation of Resources of Data Loading: Sometimes, we underestimate the time
required to extract, clean, and load the data into the warehouse. It may take the significant
Proportion of the total development time, although some tools are there which are used to
reduce the time and effort spent on this process.