0% found this document useful (0 votes)
44 views91 pages

Chapter 1&2

The document provides an overview of data warehousing, data mining, and their relationship. It defines data warehousing as a subject-oriented collection of integrated and non-volatile data used for analysis. Data mining extracts useful patterns from large datasets through techniques like classification, clustering, regression, and association rule mining. The architecture of a data mining system includes data sources, a data mining engine, data warehouse, pattern evaluation, and a knowledge base.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views91 pages

Chapter 1&2

The document provides an overview of data warehousing, data mining, and their relationship. It defines data warehousing as a subject-oriented collection of integrated and non-volatile data used for analysis. Data mining extracts useful patterns from large datasets through techniques like classification, clustering, regression, and association rule mining. The architecture of a data mining system includes data sources, a data mining engine, data warehouse, pattern evaluation, and a knowledge base.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Wollega University

Chapter One
Data Warehousing and Data Mining
November 13, 2021
Chapter-1
Overview & Brief description of
data mining Data warehousing,
data mining and database technology
Online Transaction processing and data mining
• Data Warehousing – Overview
• The term "Data Warehouse" was first coined by Bill Inmon in 1990.
According to Inmon, a data warehouse is a subject oriented,
integrated, time-variant, and non-volatile collection of data.
• This data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily
basis on account of the transactions that take place.
• Suppose a business executive wants to analyze previous feedback on
any data such as a product, a supplier, or any consumer data, then
the executive will have no data available to analyze because the
previous data has been updated due to transactions.
• A data warehouses provides us generalized and consolidated data in
multidimensional view. Along with generalized and consolidated view
of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools.
• These tools help us in interactive and effective analysis of data in a
multidimensional space.
• This analysis results in data generalization and data mining.
• Data mining functions such as association, clustering, classification,
prediction can be integrated with OLAP operations to enhance the
interactive mining of knowledge at multiple level of abstraction.
• That's why data warehouse has now become an important platform
for data analysis and online analytical processing.
• Understanding a Data Warehouse

• A data warehouse is a database, which is kept separate from the


organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization
to analyze its business.
• A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
• Data warehouse systems help in the integration of application
systems.
• A data warehouse system helps in consolidated historical data
analysis.
• Why a Data Warehouse is Separated from
Operational Databases
• A data warehouses is kept separate from operational databases due
to the following reasons −
• An operational database is constructed for well-known tasks and
workloads such as searching particular records, indexing, etc.
• Operational databases support concurrent processing of multiple
transactions.
• Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the
database.
• An operational database query allows to read and modify operations,
while an OLAP query needs only read only access of stored data.
• An operational database maintains current data. On the other hand,
a data warehouse maintains historical data.
Data Mining - Overview
• There is a huge amount of data available in the Information
Industry. This data is of no use until it is converted into useful
information. It is necessary to analyze this huge amount of data
and extract useful information from it.
• Extraction of information is not the only process we need to
perform.
• data mining also involves other processes such as Data Cleaning,
Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation.
• Once all these processes are over, we would be able to use this
information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
• What is Data Mining?

• Data Mining is defined as extracting information from huge sets of data.


Or
• Data mining can be defined as “the process of discovering interesting
knowledge from large amounts of data stored in databases, data
warehouses, or other information repositories.”

• In other words, we can say that data mining is the procedure of mining
knowledge from data.
• The information or knowledge extracted so can be used for any of the
following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
• Data Mining - Systems
• There is a large variety of data mining systems available. Data mining
systems may integrate techniques from the following −

• Spatial Data Analysis


• Information Retrieval
• Pattern Recognition
• Image Analysis
• Signal Processing
• Computer Graphics
• Web Technology
• Business
• Bioinformatics
• Data Mining Techniques

• Data mining includes the utilization of refined data analysis tools to


find previously unknown, valid patterns and relationships in huge
data sets.
• These tools can incorporate statistical models, machine learning
techniques, and mathematical algorithms, such as neural networks or
decision trees.
• In recent data mining projects, various major data mining techniques
have been developed and used, including association, classification,
clustering, prediction, sequential patterns, and regression.
• 1. Classification:
• This technique is used to obtain important and relevant information
about data and metadata. This data mining technique helps to
classify data in different classes.
• Data mining techniques can be classified by different criteria, as
follows:
• Classification of Data mining frameworks as per the type of data
sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
• Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example.
Object-oriented database, transactional database, relational
database, and so on..
• Classification of data mining frameworks as per the kind of
knowledge discovered:
This classification depends on the types of knowledge discovered or
data mining functionalities. For example classification, clustering,
characterization, etc.
o Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such as
neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database oriented, etc.
o 2. Clustering:
o Clustering is a division of information into groups of connected
objects.
o In other words, we can say that Clustering analysis is a data mining
technique to identify similar data. This technique helps to recognize
the differences and similarities between the data.
o Clustering is very similar to the classification, but it involves grouping
chunks of data together based on their similarities.
• 3. Regression:
• Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence
of the other factor.
• Regression, primarily a form of planning and modeling.

• 4. Association Rules:

• This data mining technique helps to discover a link between two or


more items. It finds a hidden pattern in the data set.
• Association rules are if-then statements .that support to probability
of interactions between data items within large data sets in different
types of databases.
• Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.
• These are three major measurements technique:

• Lift: This measurement technique measures the accuracy of the


confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)

• Support: This measurement technique measures how often multiple


items are purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)

• Confidence: This measurement technique measures how often item


B is purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
• 5. Outer detection:
• This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior.
• This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining.

• 6. Sequential Patterns:
• The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns.

• 7. Prediction:
• Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc.
• Data Mining Architecture
• The significant components of data mining systems are a data source,
data mining engine, data warehouse server, the pattern evaluation
module, graphical user interface, and knowledge base.

Data Mining System Architecture


• Data Source:
• The actual source of data is the Database, data warehouse, World
Wide Web (WWW), text files, and other documents. huge amount of
historical data for data mining to be successful.
• Organizations typically store data in databases or data warehouses.
Data warehouses may comprise one or more data bases. Another
primary source of data is the World Wide Web or the internet.
• Different processes:
• Before passing the data to the database or data warehouse server,
the data must be cleaned, integrated, and selected.
• As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate.
• So, the first data requires to be cleaned and unified.
• These procedures are not as easy as we think. Several methods may
be performed on the data as part of selection, integration, and
cleaning.
• Database or Data Warehouse Server:
• The database or data warehouse server consists of the original data
that is ready to be processed.
• the server is cause for retrieving the relevant data that is based on
data mining as per user request.
• Data Mining Engine:
• The data mining engine is a major component of any data mining
system.
• It contains several modules for operating data mining tasks, including
association, characterization, classification, clustering, prediction,
time-series analysis, etc.
• Pattern Evaluation Module:
• The Pattern evaluation module is primarily responsible for the
measure of investigation of the pattern by using a threshold value.
• It collaborates with the data mining engine to focus the search on
exciting patterns.
• Graphical User Interface:
• The graphical user interface (GUI) module communicates between
the data mining system and the user.
• This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
• This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
• Knowledge Base:
• The knowledge base is helpful in the entire process of data mining. It
might be helpful to guide the search result patterns.
• The knowledge base may contain user views and data from user
experiences that might be helpful in the data mining process.
• The data mining engine may receive inputs from the knowledge base
to make the result more accurate and reliable.
Data Mining Functionalities

 There are different kinds of data mining functionalities that can be


used to extract various types of pattern from data
– Concept /class description: Characterization and discrimination
– Association Analysis
– Classification and prediction
– Clustering analysis
– Outlier analysis
– Evolution analysis

January 14, 2022 21


1. Concept/class description: Characterization and
discrimination
– Given a class/classes with data that belongs to the class, describe
the class by making observation of its members.
• Data characterisation is a summarisation of general features of
objects in a target class, and produces what is called characteristic
rules
• Data discrimination is description made by making comparative
analysis between the target class with the other comparative class
(contrasting classes)
2. Association Analysis
Association analysis is based on the association rules. It studies the
frequency of items occurring together in transactional databases,
and based on a threshold called support. identifies the frequent
item sets.
Another threshold, confidence, which is the conditional probability
than an item appears in a transaction when another item appears, is
used to pinpoint association rules.
Association analysis is commonly used for market basket analysis.
3. Classification and Prediction
– Classification is the processing of finding a set of models (or functions) that
describe and distinguish data classes or concepts, for the purposes of being
able to use the model to predict the class of objects whose class label is
unknown .
– Prediction is the process of predicting some missing or unavailable data values
rather than class labels.
4. Cluster analysis
• Similar to classification, clustering is the organisation of data in
classes.
• unlike classification, it is used to place data elements into related
groups without advance knowledge of the group definitions i.e.
class labels are unknown and it is up to the clustering algorithm to
discover acceptable classes.
• Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels.
• There are many clustering approaches all based on the principle of
maximising the similarity between objects in a same class (intra-
class similarity) and minimising the similarity between objects of
different classes (inter-class similarity).
5. Outlier analysis
– Database may contain data object that do not comply with the
general behavior or model of the data.
– These data objects are outliers.
– Usually outlier data items are considered as noise or exception in
many data mining applications
6. Trend and evolution analysis
– Describe and model regularities or trends for objects whose
behavior changes over time.
– It is also referred as regression analysis, sequential pattern mining,
periodicity analysis, similarity-based analysis
• Data Mining System Classification
• A data mining system can be classified according to the following
criteria −
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
• Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
• Classification Based on the Databases Mined
• We can classify a data mining system according to the kind of databases
mined. Database system can be classified according to different criteria such
as data models, types of data, etc. And the data mining system can be
classified accordingly.
• For example, if we classify a database according to the data model, then we
may have a relational, transactional, object-relational, or data warehouse
mining system.
• Classification Based on the kind of Knowledge
Mined
• We can classify a data mining system according to the kind of knowledge
mined. It means the data mining system is classified on the basis of
functionalities such as −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
• Classification Based on the Techniques Utilized
• We can classify a data mining system according to the kind of
techniques used.
• We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.

• Classification Based on the Applications Adapted


• We can classify a data mining system according to the applications
adapted. These applications are as follows −
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
• Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web
databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information
systems
• Data Mining - Terminologies
• Data Integration
• Data Integration is a data preprocessing technique that merges
the data from multiple heterogeneous data sources into a
coherent data store.
• Data integration may involve inconsistent data and therefore
needs data cleaning.
• Data Cleaning
• Data cleaning is a technique that is applied to remove the noisy
data and correct the inconsistencies in data.
• Data cleaning involves transformations to correct the wrong data.
• Data cleaning is performed as a data preprocessing step while
preparing the data for a data warehouse.
• Data Selection
• Data Selection is the process where data relevant to the analysis
task are retrieved from the database.
• Sometimes data transformation and consolidation are performed
before the data selection process.
• Clusters
• Cluster refers to a group of similar kind of objects.
• Cluster analysis refers to forming group of objects that are very
similar to each other but are highly different from the objects in
other clusters.
• Data Transformation
• In this step, data is transformed or consolidated into forms
appropriate for mining, by performing summary or aggregation
operations.
• What is Knowledge Discovery?
• Data mining as an essential step in the process of knowledge
discovery.
• Here is the list of steps involved in the knowledge discovery process −
• Data Cleaning − In this step, the noise and inconsistent data is
removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.
• Data Transformation − In this step, data is transformed or
consolidated into forms appropriate for mining by performing
summary or aggregation operations.
• Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
A KDD Process
• Data mining: the core of knowledge
discovery in Database
Pattern Evaluation

Data Mining

Task-relevant Data

Selection & Transformation


Data Warehouse

Data Cleaning

Data Integration

Databases
January 14, 2022 34
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
35
• Data Mining Applications

• Data mining is highly useful in the following domains −


• Market Analysis and Management
• Corporate Analysis & Risk Management
• Fraud Detection

• Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration,
sports, astrology, and Internet Web Technology.
• Market Analysis and Management
• Listed below are the various fields of market where data mining is
used
• Customer Profiling − Data mining helps determine what kind of
people buy what kind of products.
• Identifying Customer Requirements − Data mining helps in
identifying the best products for different customers. It uses
prediction to find the factors that may attract new customers.
• Cross Market Analysis − Data mining performs
Association/correlations between product sales.
• Determining Customer purchasing pattern − Data mining helps in
determining customer purchasing pattern.
• Providing Summary Information − Data mining provides us
various multidimensional summary reports.
• Corporate Analysis and Risk Management

• Data mining is used in the following fields of the Corporate Sector.

• Finance Planning and Asset Evaluation − It involves cash flow


analysis and prediction, contingent claim analysis to evaluate
assets.
• Resource Planning − It involves summarizing and comparing the
resources and spending.
• Competition − It involves monitoring competitors and market
directions.
• Fraud Detection

• Data mining is also used in the fields of credit card services and
telecommunication to detect frauds.
• In fraud telephone calls, it helps to find the destination of the
call, duration of the call, time of the day or week, etc.
• It also analyzes the patterns that deviate from expected norms.
• Classification and Prediction
• Classification is the process of finding a model that describes the
data classes or concepts.
• The purpose is to be able to use this model to predict the class of
objects whose class label is unknown.
• This derived model is based on the analysis of sets of training
data.
• The derived model can be presented in the following forms −
• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
• The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is
unknown.
• Its objective is to find a derived model that describes and
distinguishes data classes or concepts.
• The Derived Model is based on the data object whose class label
is well known.
• Prediction − It is used to predict missing or unavailable numerical
data values rather than class labels. Regression Analysis is
generally used for prediction.
• Outlier Analysis − Outliers may be defined as the data objects
that do not comply with the general model of the data available.
• Evolution Analysis − Evolution analysis refers to the description
for objects whose behavior changes over time.
Challenges in Data Mining
 Efficiency and scalability of data mining algorithms

 Parallel, distributed, stream, and incremental mining methods

 Handling high-dimensionality

 Handling noise, uncertainty, and incompleteness of data

 Incorporation of constraints, expert knowledge, and background


knowledge

 Pattern evaluation and knowledge integration


• Data Mining and Database Technology
• We can classify a data mining system according to the kind of
databases mined.
• Database system can be classified according to different criteria such
as data models, types of data, etc. And the data mining system can be
classified accordingly.
• For example, if we classify a database according to the data model,
then we may have a relational, transactional, object-relational, or
data warehouse mining system.
• Transaction Databases:
• A transaction database is a set of records representing transactions,
each with a time stamp, an identifier and a set of items. Associated
with the transaction files descriptive data for the items.
• For example, in the case of the video store, the rentals table
represents the transaction database. Each record is a rental contract
with a customer identifier, a date, and the list of items rented (i.e.
video tapes, games, VCR, etc.).
• transactions are usually stored in flat files or stored in two normalized
transaction tables, one for the transactions and one for the
transaction items.
• Multimedia Databases:

• Multimedia databases include video, images, audio and text media.


• They can be stored on extended object-relational or object-oriented
databases, or simply on a file system.
• Multimedia is characterized by its high dimensionality, which makes
data mining even more challenging.
• Data mining from multimedia repositories may require computer
vision, computer graphics, image interpretation, and natural language
processing methodologies.
• Spatial Databases:
• Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional
positioning.
• Such spatial databases present new challenges to data mining
algorithms.
• Time-Series Databases:
• Time-series databases contain time related data such stock market
data or logged activities.
• These databases usually have a continuous flow of new data coming
in, which sometimes causes the need for a challenging real time
analysis.
• Data mining in such databases commonly includes the study of trends
and correlations between evolutions of different variables, as well as
the prediction of trends and movements of the variables in time.
• What is OLTP ? (Online Transaction processing and
data mining)

• OLTP is an operational system that supports transaction-oriented


applications in a 3-tier architecture. It administers the day to day
transaction of an organization.
• OLTP is basically focused on query processing, maintaining data
integrity in multi-access environments as well as effectiveness that is
measured by the total number of transactions per second.
• The full form of OLTP is Online Transaction Processing.
• Architecture of OLTP
• Business / Enterprise Strategy: Enterprise strategy deals with the issues
that affect the organization as a whole. In OLTP, it is typically developed at a
high level within the firm, by the board of directors or the top management
• Business Process: OLTP business process is a set of activities and tasks that,
once completed, will accomplish an organizational goal.
• Customers, Orders, and Products: OLTP database store information about
products, orders (transactions), customers (buyers), suppliers (sellers), and
employees.
• ETL Processes: It separates the data from various RDBMS source systems,
then transforms the data and loads the processed data into the Data
Warehouse system.
• Data Mart and Data warehouse: A data mart is a structure/access pattern
specific to data warehouse environments. It is used by OLAP to store
processed data.
• Data Mining, Analytics, and Decision Making: Data stored in the data mart
and data warehouse can be used for data mining, analytics, and decision
making.
Sr.N
o. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing of It involves day-to-day processing.


information.
2
OLAP systems are used by OLTP systems are used by clerks,
knowledge workers such as DBAs, or database professionals.
executives, managers, and
analysts.
3
It is used to analyze the It is used to run the business.
business.
4
It focuses on Information out. It focuses on Data in.
5
It is based on Star Schema, It is based on Entity Relationship
Snowflake Schema, and Fact Model.
Constellation Schema.
6
It focuses on Information out. It is application oriented.
7
It contains historical data. It contains current data.
8
It provides summarized and It provides primitive and highly detailed
consolidated data. data.

9
It provides summarized and It provides detailed and flat relational
multidimensional view of data. view of data.

10
The number of users is in The number of users is in thousands.
hundreds.

11
The number of records accessed is The number of records accessed is in
in millions. tens.

12
The database size is from 100GB The database size is from 100 MB to
to 100 TB. 100 GB.

13
These are highly flexible. It provides high performance.
• Advantages of OLTP:

• Following are the benefits of OLTP system:


• OLTP offers accurate forecast for revenue and expense.
• It provides a solid foundation for a stable business /organization due
to timely modification of all transactions.
• OLTP makes transactions much easier on behalf of the customers.
• OLTP provides support for bigger databases.
• Partition of data for data manipulation is easy.
• We need OLTP to use the tasks which are frequently performed by
the system.
• When we need only a small number of records.
• The tasks that include insertion, updation, or deletion of data.
• Disadvantages of OLTP
• Here are drawbacks of OLTP system:
• If the OLTP system faces hardware failures, then online transactions
get severely affected.
• OLTP systems allow multiple users to access and change the same
data at the same time, which many times created an unprecedented
situation.
• If the server hangs for seconds, it can affect to a large number of
transactions.
• OLTP required a lot of staff working in groups in order to maintain
inventory.
• Online Transaction Processing Systems do not have proper methods
of transferring products to buyers by themselves.
• Server failure may lead to wiping out large amounts of data from the
database.
• You can perform a limited number of queries and updates.
• Challenges of an OLTP System

• It allows more than one user to access and change the same data
simultaneously.
• Therefore, it requires concurrency control and recovery technique in
order to avoid any unprecedented situations
• OLTP system data are not suitable for decision making. You have to
use data of OLAP systems for “what if” analysis or the decision
making.
Wollega University
Chapter Two
Data Warehousing and Data Mining
November 13, 2021
Data Warehouse Concepts

• The basic concept of a Data Warehouse is to facilitate a single version


of truth for a company for decision making and forecasting.
• A Data warehouse is an information system that contains historical
and commutative data from single or multiple sources.
• Data Warehouse Concepts simplify the reporting and analysis
process of organizations.
• Types of Data Warehouse
• Information processing, analytical processing, and data mining are
the three types of data warehouse applications that are discussed
below −
• Information Processing − A data warehouse allows to process the
data stored in it. The data can be processed by means of querying,
basic statistical analysis, reporting using crosstabs, tables, charts, or
graphs.
• Analytical Processing − A data warehouse supports analytical
processing of the information stored in it. The data can be analyzed
by means of basic OLAP operations, including slice-and-dice, drill
down, drill up, and pivoting.
• Data Mining − Data mining supports knowledge discovery by finding
hidden patterns and associations, constructing analytical models,
performing classification and prediction. These mining results can be
presented using the visualization tools.
• Characteristics of Data Warehouse

• A data warehouse can be viewed as an information system with


the following attributes:
• – It is a database designed for analytical tasks
• – It‘s content is periodically updated
• – It contains current and historical data to provide a historical
perspective of information
• Data Warehouse Features
• The key features of a data warehouse are discussed below −
• Subject Oriented − A data warehouse is subject oriented because it
provides information around a subject rather than the organization's
ongoing operations.
• These subjects can be product, customers, suppliers, sales, revenue, etc.
• A data warehouse does not focus on the ongoing operations, rather it
focuses on modeling and analysis of data for decision making.
• Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc.
• This integration enhances the effective analysis of data.
• Time Variant − The data collected in a data warehouse is identified with a
particular time period.
• The data in a data warehouse provides information from the historical point
of view.
• Non-volatile − Non-volatile means the previous data is not erased when
new data is added to it.
• A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the
data warehouse.
• Note − A data warehouse does not require transaction processing, recovery,
and concurrency controls, because it is physically stored and separate from
the operational database.
• Data Warehouse Applications
• a data warehouse helps business executives to organize, analyze, and use
their data for decision making.
• A data warehouse serves as a sole part of a plan-execute-assess "closed-
loop" feedback system for the enterprise management.
• Data warehouses are widely used in the following fields −
• Financial services
• Banking services
• Consumer goods
• Retail sectors
• Controlled manufacturing
• Functions of Data Warehouse Tools and Utilities

• The following are the functions of data warehouse tools and


utilities −
• Data Extraction − Involves gathering data from multiple
heterogeneous sources.
• Data Cleaning − Involves finding and correcting the errors in data.
• Data Transformation − Involves converting the data from legacy
format to warehouse format.
• Data Loading − Involves sorting, summarizing, consolidating,
checking integrity, and building indices and partitions.
• Refreshing − Involves updating from data sources to warehouse.
• Note − Data cleaning and data transformation are important steps
in improving the quality of data and data mining results.
• Data Warehousing - Terminologies
• In this chapter, we will discuss some of the most commonly used
terms in data warehousing.
• Metadata
• Metadata is simply defined as data about data. The data that are
used to represent other data is known as metadata.
• For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is
the summarized data that leads us to the detailed data.
• In terms of data warehouse, we can define metadata as
following
• Metadata is a road-map to data warehouse.
• Metadata in data warehouse defines the warehouse objects.
• Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
• Metadata Repository
• Metadata repository is an integral part of a data warehouse
system. It contains the following metadata −
• Business metadata − It contains the data ownership information,
business definition, and changing policies.
• Operational metadata − It includes currency of data and data
lineage. Currency of data refers to the data being active, archived,
or purged. Lineage of data means history of data migrated and
transformation applied on it.
• Data for mapping from operational environment to data
warehouse − It metadata includes source databases and their
contents, data extraction, data partition, cleaning, transformation
rules, data refresh and purging rules.
• The algorithms for summarization − It includes dimension
algorithms, data on granularity, aggregation, summarizing, etc.
• Data Cube
• A data cube helps us represent data in multiple dimensions. It is
defined by dimensions and facts.
• The dimensions are the entities with respect to which an
enterprise preserves the records.

• Data Mart
• Data marts contain a subset of organization-wide data that is
valuable to specific groups of people in an organization.
• In other words, a data mart contains only those data that is
specific to a particular group.
• For example, the marketing data mart may contain only data
related to items, customers, and sales. Data marts are confined to
subjects.
• Points to Remember About Data Marts

• Windows-based or Unix/Linux-based servers are used to


implement data marts. They are implemented on low-cost
servers.
• The implementation cycle of a data mart is measured in short
periods of time, i.e., in weeks rather than months or years.
• The life cycle of data marts may be complex in the long run, if
their planning and design are not organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data
warehouse.
• Data marts are flexible.
• The following figure shows a graphical representation of data
marts.
• Other important terminology in data warehouse
• Enterprise Data warehouse: It collects all information about
subjects (customers, products, sales, assets, personnel) that
span the entire organization .
• Decision Support System (DSS): Information technology to
help the knowledge worker (executive, manager, and analyst)
makes faster & better decisions Drill-down.
• Designing the Data Warehouse
• Data warehouse is difficult to build due to the following reason:
• Heterogeneity of data sources
• Use of historical data
• Growing nature of data base.
• Data warehouse design approach must be business driven,
continuous and iterative engineering approach.
• In addition to the general considerations there are following
specific points relevant to the data warehouse design:
• Data content :
• The content and structure of the data warehouse are reflected in its
data model.
• The data model is the template that describes how information will
be organized within the integrated warehouse framework.
• The data warehouse data must be a detailed data. It must be
formatted, cleaned up and transformed to fit the warehouse data
model.
• Meta data
• It defines the location and contents of data in the warehouse.
• Meta data is searchable by users to find definitions or subject
areas.
• In other words, it must provide decision support oriented pointers
to warehouse data and provides a logical link between warehouse
data and decision support applications.
• Data distribution
• One of the biggest challenges when designing a data warehouse is
the data placement and distribution strategy.
• it becomes necessary to know how the data should be divided
across multiple servers and which users should get access to which
types of data.
• The data can be distributed based on the subject area, location
(geographical region), or time (current, month, year).
• Tools
• A number of tools are available that are specifically designed to
help in the implementation of the data warehouse.
• All selected tools must be compatible with the given data
warehouse environment and with each other.
• All tools must be able to use a common Meta data repository.
• Technical considerations
• A number of technical issues are to be considered when designing a
data warehouse environment. These issues include:
• The hardware platform that would house the data warehouse
• The dbms that supports the warehouse data
• The communication infrastructure that connects data marts,
operational systems and end users
• The hardware and software to support meta data repository
• The systems management framework that enables admin of the
entire environment
• In general, the warehouse design process consists of the
following steps:

• 1. Choose a business process to model If the business process is


organizational and involves multiple complex object collections, a data
warehouse model should be followed.
• if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
• 2. Choose the business process grain, which is the fundamental, atomic
level of data to be represented in the fact table for this process .
• 3. Choose the dimensions that will apply to each fact table record. Typical
dimensions are time, item, customer, supplier, warehouse, transaction
type, and status.
• 4. Choose the measures that will populate each fact table record. Typical
measures are numeric quantities like dollars sold and units
sold.
• Process of Data Warehouse Design
• A data warehouse can be built using three approaches:
• 1. A top-down approach
• 2. A bottom-up approach
• 3. A combination of both approaches
• Data warehouses are designed to facilitate reporting and analysis.
• The top-down approach starts with the overall design and planning. It is
useful in cases where the technology is mature and well-known, and
where the business problems that must be solved are clear and well-
understood.
• The bottom-up approach starts with experiments and prototypes. This is
useful in the early stage of business modeling and technology
development. It allows an organisation to move forward at considerably
less expense and to evaluate the benefits of the technology before
making significant commitments.
• In the combined approach, an organisation can exploit the planned and
strategic nature of the top-down approach while retaining the rapid
implementation and opportunistic application of the bottom-up
approach.
• Data Warehouse Design Architecture

• Data Warehouse Architecture is complex as it’s an information


system that contains historical and commutative data from multiple
sources.
• There are 3 approaches for constructing Data Warehouse layers:
• Single Tier, Two tier and Three tier. This 3 tier architecture of Data
Warehouse is explained as below.
• Single-tier architecture
• The objective of a single layer is to minimize the amount of data
stored. This goal is to remove data redundancy. This architecture is
not frequently used in practice.
• Two-tier architecture
• Two-layer architecture is one of the Data Warehouse layers which
separates physically available sources and data warehouse. This
architecture is not expandable and also not supporting a large
number of end-users. It also has connectivity problems because of
network limitations.
• Three-Tier Data Warehouse Architecture
• This is the most widely used Architecture of Data Warehouse.
• It consists of the Top, Middle and Bottom Tier.
• Bottom Tier: The database of the Data warehouse servers as the
bottom tier. It is usually a relational database system. Data is
cleansed, transformed, and loaded into this layer using back-end
tools.
• Middle Tier: The middle tier in Data warehouse is an OLAP server
which is implemented using either ROLAP or MOLAP model. For a
user, this application tier presents an abstracted view of the
database. This layer also acts as a mediator between the end-user
and the database.
• Top-Tier: The top tier is a front-end client layer. Top tier is the tools
and API that you connect and get data out from the data
warehouse. It could be Query tools, reporting tools, managed query
tools, Analysis tools and Data mining tools.
• The Data Warehouse is based on an RDBMS server which is a
central information repository that is surrounded by some key Data
Warehousing components to make the entire environment
functional, manageable and accessible.
• There are mainly five Data Warehouse Components:

• Data Warehouse Database


• The central database is the foundation of the data warehousing
environment. This database is implemented on the RDBMS
technology.
• This kind of implementation is constrained by the fact that
traditional RDBMS system is optimized for transactional database
processing and not for data warehousing.
• For instance, ad-hoc query, multi-table joins, aggregates are
resource intensive and slow down performance.
• Hence, alternative approaches to Database are used as listed
below-
• In a data warehouse, relational databases are deployed in parallel
to allow for scalability.
• Parallel relational databases also allow shared memory or shared
nothing model on various multiprocessor configurations or
massively parallel processors.
• New index structures are used to bypass relational table scan and
improve speed.
• Use of multidimensional database (MDDBs) to overcome any
limitations which are placed because of the relational Data
Warehouse Models. Example: Essbase from Oracle.
• Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
• The data sourcing, transformation, and migration tools are used for
performing all the conversions, summarizations, and all the changes
needed to transform data into a unified format in the data
warehouse.
• They are also called Extract, Transform and Load (ETL) Tools.
• Their functionality includes:
• Eliminating unwanted data in operational databases from loading
into Data warehouse.
• Search and replace common names and definitions for data arriving
from different sources.
• Calculating summaries and derived data
• In case of missing data, populate them with defaults.
• De-duplicated repeated data arriving from multiple data sources.
• Note:
• These Extract, Transform, and Load tools may generate background
jobs, Cobol programs, shell scripts, etc. that regularly update data in
data warehouse. These tools are also helpful to maintain the
Metadata.
• These ETL Tools have to deal with challenges of Database & Data
heterogeneity.
• Metadata
• The name Meta Data suggests some high-level technological Data
Warehousing Concepts.
• Metadata is data about data which defines the data warehouse. It
is used for building, maintaining and managing the data warehouse.
• In the Data Warehouse Architecture, meta-data plays an important
role as it specifies the source, usage, values, and features of data
warehouse data.
• It also defines how data can be changed and processed. It is closely
connected to the data warehouse.
• Metadata helps to answer the following questions
• What tables, attributes, and keys does the Data Warehouse
contain?
• Where did the data come from?
• How many times do data get reloaded?
• What transformations were applied with cleansing?

• Metadata can be classified into following categories:


• Technical Meta Data: This kind of Metadata contains information
about warehouse which is used by Data warehouse designers and
administrators.
• Business Meta Data: This kind of Metadata contains detail that
gives end-users a way easy to understand information stored in the
data warehouse.
• Data Warehousing Query Tools

• One of the primary objects of data warehousing is to provide


information to businesses to make strategic decisions.
• Query tools allow users to interact with the data warehouse
system. These tools fall into four different categories:
• Query and reporting tools
• Application Development tools
• Data mining tools
• OLAP tools
• 1. Query and reporting tools:
• Query and reporting tools can be further divided into
• Reporting tools
• Managed query tools
• Reporting tools:
• Reporting tools can be further divided into production reporting tools and
desktop report writer.
• Report writers: This kind of reporting tool are tools designed for end-
users for their analysis.
• Production reporting: This kind of tools allows organizations to generate
regular operational reports. It also supports high volume batch jobs like
printing and calculating.
• Some popular reporting tools are Brio, Business Objects, Oracle, Power
Soft, SAS Institute.
• Managed query tools:
• This kind of access tools helps end users to resolve snags in database and
SQL and database structure by inserting meta-layer between users and
database.
• Examples of types of reporting tools include:
• 1. Business intelligence tools: These are software applications that
simplify the process of development and production of business
reports based on data warehouse data.
• 2. Executive information systems (known more widely as
Dashboard (business): These are software applications that are
used to display complex business metrics and information in a
graphical way to allow rapid understanding.
• 3. OLAP Tools: OLAP tools form data into logical multi-dimensional
structures and allow users to select which dimensions to view data
by.
• 4. Data Mining: Data mining tools are software that allow users to
perform detailed mathematical and statistical calculations on
detailed data warehouse data to detect trends, identify patterns
and analyze data.
• 5. Application development tools:
• Sometimes built-in graphical and analytical tools do not satisfy the
analytical needs of an organization. In such cases, custom reports
are developed using Application development tools.
• Data warehouse Bus Architecture
• Data warehouse Bus determines the flow of data in your
warehouse.
• The data flow in a data warehouse can be categorized as Inflow, Up
flow, Down flow, Outflow and Meta flow.
• While designing a Data Bus, one needs to consider the shared
dimensions, facts across data marts.
• Data Marts
• A data mart is an access layer which is used to get data out to the
users. It is presented as an option for large size data warehouse as
it takes less time and money to build.
• there is no standard definition of a data mart is differing from
person to person.
• In a simple word Data mart is a subsidiary of a data warehouse. The
data mart is used for partition of data which is created for the
specific group of users. Data marts could be created in the same
database as the Data warehouse or a physically separate Database
• Operations
• A data warehouse operation is comprised of the processes of
loading, manipulating and extracting data from the data
warehouse.
• Operations also cover user management, security, capacity
management and related functions.
• Roll-up: The roll-up operation (also called the drill-up operation )
• performs aggregation on a data cube, either by climbing up a
concept hierarchy for a dimension or by dimension reduction.
• When roll-up is performed by dimension reduction, one or more
dimensions are removed from the given cube.
• Drill-down: Drill-down is the reverse of roll-up. It navigates from
less detailed data to more detailed data.
• Drill-down can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions.
• Slice and dice: The slice operation performs a selection on one
dimension of the given cube, resulting in a sub cube.
• The dice operation defines a sub cube by performing a selection on
two or more dimensions.
• Pivot (rotate): Pivot (also called rotate) is a visualization operation
that rotates the data axes in view to provide an alternative data
presentation.
• a pivot operation where the item and location axes in a 2-D slice
are rotated.
• Other examples include rotating the axes in a 3-D cube, or
transforming a 3-D cube into a series of 2-D planes
• Major Issues in Data Warehousing
• Building a data Warehouse is very difficult and a pain. It is
challenging, when data warehouses work properly, they are
magnificently useful, huge fun and unbelievably rewarding.
• Some of the major issues involved in building data warehouse are
discussed below:
• General Issues: It includes but is not limited to following issues:
• What kind of analysis do the business users want to perform?
• Do you currently collect the data required to support that analysis?
• How clean is data?
• Are there multiple sources for similar data?
• What structure is best for the core data warehouse (i.e., dimensional
or relational)?
• Technical Issues: It includes but is not limited to following issues
• How much data are you going to ship around your network, and will
it be able to cope?
• How much disk space will be needed?
• How fast does the disk storage need to be?
• Are you going to use SSDs to store “hot” data (i.e., frequently
accessed information)?
• What database and data management technology expertise already
exists within the company?
• Cultural Issues: It includes but is not limited to following issues
• How do data definitions differ between your operational systems?
Different departments and business units often use their own
definitions of terms like “customer,” “sale” and “order” within
systems.
• So you’ll need to standardize the definitions and add prefixes such
as “all sales,” “recent sales,” “commercial sales” and so on.
• What’s the process for gathering business requirements? Some
people will not want to spend time for you. Instead, they will
expect you to use your telepathic powers to divine their
warehousing and data analysis needs.

You might also like