0% found this document useful (0 votes)
78 views55 pages

Chapter 1

The document provides an overview of data warehousing, data mining, and their relationship. It defines data warehousing as a subject-oriented collection of integrated and non-volatile data used for analysis. Data mining is described as extracting useful patterns from large datasets. Common data mining techniques include classification, clustering, regression, association rule mining, and prediction. The architecture of a basic data mining system is also outlined.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views55 pages

Chapter 1

The document provides an overview of data warehousing, data mining, and their relationship. It defines data warehousing as a subject-oriented collection of integrated and non-volatile data used for analysis. Data mining is described as extracting useful patterns from large datasets. Common data mining techniques include classification, clustering, regression, association rule mining, and prediction. The architecture of a basic data mining system is also outlined.

Uploaded by

Bikila Seketa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Wollega University

Chapter One
Data Warehousing and Data Mining
November 13, 2021
Chapter-1

Overview & Brief description of


data mining Data warehousing,
data mining and database technology
Online Transaction processing and data mining
• Data Warehousing – Overview
• The term "Data Warehouse" was first coined by Bill Inmon in 1990.
According to Inmon, a data warehouse is a subject oriented,
integrated, time-variant, and non-volatile collection of data.
• This data helps analysts to take informed decisions in an organization.
• An operational database undergoes frequent changes on a daily
basis on account of the transactions that take place.
• Suppose a business executive wants to analyze previous feedback on
any data such as a product, a supplier, or any consumer data, then
the executive will have no data available to analyze because the
previous data has been updated due to transactions.
• A data warehouses provides us generalized and consolidated data in
multidimensional view. Along with generalized and consolidated view
of data, a data warehouses also provides us Online Analytical
Processing (OLAP) tools.
• These tools help us in interactive and effective analysis of data in a
multidimensional space.
• This analysis results in data generalization and data mining.
• Data mining functions such as association, clustering, classification,
prediction can be integrated with OLAP operations to enhance the
interactive mining of knowledge at multiple level of abstraction.
• That's why data warehouse has now become an important platform
for data analysis and online analytical processing.
• Understanding a Data Warehouse

• A data warehouse is a database, which is kept separate from the


organization's operational database.
• There is no frequent updating done in a data warehouse.
• It possesses consolidated historical data, which helps the organization
to analyze its business.
• A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
• Data warehouse systems help in the integration of application
systems.
• A data warehouse system helps in consolidated historical data
analysis.
• Why a Data Warehouse is Separated from
Operational Databases
• A data warehouses is kept separate from operational databases due
to the following reasons −
• An operational database is constructed for well-known tasks and
workloads such as searching particular records, indexing, etc.
• Operational databases support concurrent processing of multiple
transactions.
• Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the
database.
• An operational database query allows to read and modify operations,
while an OLAP query needs only read only access of stored data.
• An operational database maintains current data. On the other hand,
a data warehouse maintains historical data.
Data Mining - Overview

• There is a huge amount of data available in the Information


Industry. This data is of no use until it is converted into useful
information. It is necessary to analyze this huge amount of data
and extract useful information from it.
• Extraction of information is not the only process we need to
perform.
• data mining also involves other processes such as Data Cleaning,
Data Integration, Data Transformation, Data Mining, Pattern
Evaluation and Data Presentation.
• Once all these processes are over, we would be able to use this
information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
• What is Data Mining?

• Data Mining is defined as extracting information from huge sets of data.


Or
• Data mining can be defined as “the process of discovering interesting
knowledge from large amounts of data stored in databases, data
warehouses, or other information repositories.”

• In other words, we can say that data mining is the procedure of mining
knowledge from data.
• The information or knowledge extracted so can be used for any of the
following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
• Data Mining - Systems
• There is a large variety of data mining systems available. Data mining
systems may integrate techniques from the following −

• Spatial Data Analysis


• Information Retrieval
• Pattern Recognition
• Image Analysis
• Signal Processing
• Computer Graphics
• Web Technology
• Business
• Bioinformatics
• Data Mining Techniques

• Data mining includes the utilization of refined data analysis tools to


find previously unknown, valid patterns and relationships in huge
data sets.
• These tools can incorporate statistical models, machine learning
techniques, and mathematical algorithms, such as neural networks or
decision trees.
• In recent data mining projects, various major data mining techniques
have been developed and used, including association, classification,
clustering, prediction, sequential patterns, and regression.
• 1. Classification:
• This technique is used to obtain important and relevant information
about data and metadata. This data mining technique helps to
classify data in different classes.
• Data mining techniques can be classified by different criteria, as
follows:
• Classification of Data mining frameworks as per the type of data
sources mined:
This classification is as per the type of data handled. For example,
multimedia, spatial data, text data, time-series data, World Wide
Web, and so on..
• Classification of data mining frameworks as per the database
involved:
This classification based on the data model involved. For example.
Object-oriented database, transactional database, relational
database, and so on..
• Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data
mining functionalities. For example classification, clustering,
characterization, etc.
o Classification of data mining frameworks according to data mining
techniques used:
This classification is as per the data analysis approach utilized, such as
neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database oriented, etc.
o 2. Clustering:
o Clustering is a division of information into groups of connected objects.
o In other words, we can say that Clustering analysis is a data mining
technique to identify similar data. This technique helps to recognize the
differences and similarities between the data.
o Clustering is very similar to the classification, but it involves grouping
chunks of data together based on their similarities.
• 3. Regression:
• Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence
of the other factor.
• Regression, primarily a form of planning and modeling.

• 4. Association Rules:

• This data mining technique helps to discover a link between two or


more items. It finds a hidden pattern in the data set.
• Association rules are if-then statements .that support to probability
of interactions between data items within large data sets in different
types of databases.
• Association rule mining has several applications and is commonly
used to help sales correlations in data or medical data sets.
• These are three major measurements technique:

• Lift: This measurement technique measures the accuracy of the


confidence over how often item B is purchased.
                  (Confidence) / (item B)/ (Entire dataset)

• Support: This measurement technique measures how often multiple


items are purchased and compared it to the overall dataset.
                  (Item A + Item B) / (Entire dataset)

• Confidence: This measurement technique measures how often item


B is purchased when item A is purchased as well.
                  (Item A + Item B)/ (Item A)
• 5. Outer detection:
• This type of data mining technique relates to the observation of data
items in the data set, which do not match an expected pattern or
expected behavior.
• This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining.

• 6. Sequential Patterns:
• The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns.

• 7. Prediction:
• Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc.
• Data Mining Architecture
• The significant components of data mining systems are a data source,
data mining engine, data warehouse server, the pattern evaluation
module, graphical user interface, and knowledge base.

Data Mining System Architecture


• Data Source:
• The actual source of data is the Database, data warehouse, World
Wide Web (WWW), text files, and other documents. huge amount of
historical data for data mining to be successful.
• Organizations typically store data in databases or data warehouses.
Data warehouses may comprise one or more data bases. Another
primary source of data is the World Wide Web or the internet.
• Different processes:
• Before passing the data to the database or data warehouse server, the
data must be cleaned, integrated, and selected.
• As the information comes from various sources and in different
formats, it can't be used directly for the data mining procedure
because the data may not be complete and accurate.
• So, the first data requires to be cleaned and unified.
• These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
• Database or Data Warehouse Server:
• The database or data warehouse server consists of the original data
that is ready to be processed.
• the server is cause for retrieving the relevant data that is based on
data mining as per user request.
• Data Mining Engine:
• The data mining engine is a major component of any data mining
system.
• It contains several modules for operating data mining tasks, including
association, characterization, classification, clustering, prediction,
time-series analysis, etc.
• Pattern Evaluation Module:
• The Pattern evaluation module is primarily responsible for the
measure of investigation of the pattern by using a threshold value.
• It collaborates with the data mining engine to focus the search on
exciting patterns.
• Graphical User Interface:
• The graphical user interface (GUI) module communicates between
the data mining system and the user.
• This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
• This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.
• Knowledge Base:
• The knowledge base is helpful in the entire process of data mining. It
might be helpful to guide the search result patterns.
• The knowledge base may contain user views and data from user
experiences that might be helpful in the data mining process.
• The data mining engine may receive inputs from the knowledge base
to make the result more accurate and reliable.
Data Mining Functionalities

 There are different kinds of data mining functionalities that can be


used to extract various types of pattern from data
– Concept /class description: Characterization and discrimination
– Association Analysis
– Classification and prediction
– Clustering analysis
– Outlier analysis
– Evolution analysis

7/23/23 21
1. Concept/class description: Characterization and
discrimination
– Given a class/classes with data that belongs to the class, describe
the class by making observation of its members.
• Data characterisation is a summarisation of general features of
objects in a target class, and produces what is called characteristic
rules
• Data discrimination is description made by making comparative
analysis between the target class with the other comparative class
(contrasting classes)
2. Association Analysis
Association analysis is based on the association rules. It studies the
frequency of items occurring together in transactional databases, and
based on a threshold called support. identifies the frequent item sets.
Another threshold, confidence, which is the conditional probability
than an item appears in a transaction when another item appears, is
used to pinpoint association rules.
Association analysis is commonly used for market basket analysis.
3. Classification and Prediction
– Classification is the processing of finding a set of models (or functions) that
describe and distinguish data classes or concepts, for the purposes of being able
to use the model to predict the class of objects whose class label is unknown .
– Prediction is the process of predicting some missing or unavailable data values
rather than class labels.
4. Cluster analysis
• Similar to classification, clustering is the organisation of data in
classes.
• unlike classification, it is used to place data elements into related
groups without advance knowledge of the group definitions i.e. class
labels are unknown and it is up to the clustering algorithm to
discover acceptable classes.
• Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels.
• There are many clustering approaches all based on the principle of
maximising the similarity between objects in a same class (intra-class
similarity) and minimising the similarity between objects of different
classes (inter-class similarity).
5. Outlier analysis
– Database may contain data object that do not comply with the
general behavior or model of the data.
– These data objects are outliers.
– Usually outlier data items are considered as noise or exception in
many data mining applications
6. Trend and evolution analysis
– Describe and model regularities or trends for objects whose
behavior changes over time.
– It is also referred as regression analysis, sequential pattern mining,
periodicity analysis, similarity-based analysis
• Data Mining System Classification
• A data mining system can be classified according to the following
criteria −
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
• Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
• Classification Based on the Databases Mined
• We can classify a data mining system according to the kind of databases mined.
Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified
accordingly.
• For example, if we classify a database according to the data model, then we may
have a relational, transactional, object-relational, or data warehouse mining
system.
• Classification Based on the kind of Knowledge Mined
• We can classify a data mining system according to the kind of knowledge mined. It
means the data mining system is classified on the basis of functionalities such as −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
• Classification Based on the Techniques Utilized
• We can classify a data mining system according to the kind of
techniques used.
• We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.

• Classification Based on the Applications Adapted


• We can classify a data mining system according to the applications
adapted. These applications are as follows −
• Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
• Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web
databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
• Data Mining - Terminologies
• Data Integration
• Data Integration is a data preprocessing technique that merges
the data from multiple heterogeneous data sources into a
coherent data store.
• Data integration may involve inconsistent data and therefore
needs data cleaning.
• Data Cleaning
• Data cleaning is a technique that is applied to remove the noisy
data and correct the inconsistencies in data.
• Data cleaning involves transformations to correct the wrong data.
• Data cleaning is performed as a data preprocessing step while
preparing the data for a data warehouse.
• Data Selection
• Data Selection is the process where data relevant to the analysis
task are retrieved from the database.
• Sometimes data transformation and consolidation are performed
before the data selection process.
• Clusters
• Cluster refers to a group of similar kind of objects.
• Cluster analysis refers to forming group of objects that are very
similar to each other but are highly different from the objects in
other clusters.
• Data Transformation
• In this step, data is transformed or consolidated into forms
appropriate for mining, by performing summary or aggregation
operations.
• What is Knowledge Discovery?
• Data mining as an essential step in the process of knowledge discovery.
• Here is the list of steps involved in the knowledge discovery process −
• Data Cleaning − In this step, the noise and inconsistent data is
removed.
• Data Integration − In this step, multiple data sources are combined.
• Data Selection − In this step, data relevant to the analysis task are
retrieved from the database.
• Data Transformation − In this step, data is transformed or consolidated
into forms appropriate for mining by performing summary or
aggregation operations.
• Data Mining − In this step, intelligent methods are applied in order to
extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
A KDD Process
• Data mining: the core of knowledge
discovery in Database
Pattern Evaluation

Data Mining

Task-relevant Data

Selection & Transformation


Data Warehouse

Data Cleaning

Data Integration

7/23/23 Databases 34
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
35
• Data Mining Applications

• Data mining is highly useful in the following domains −


• Market Analysis and Management
• Corporate Analysis & Risk Management
• Fraud Detection

• Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration,
sports, astrology, and Internet Web Technology.
• Market Analysis and Management
• Listed below are the various fields of market where data mining is
used
• Customer Profiling − Data mining helps determine what kind of
people buy what kind of products.
• Identifying Customer Requirements − Data mining helps in
identifying the best products for different customers. It uses
prediction to find the factors that may attract new customers.
• Cross Market Analysis − Data mining performs
Association/correlations between product sales.
• Determining Customer purchasing pattern − Data mining helps in
determining customer purchasing pattern.
• Providing Summary Information − Data mining provides us
various multidimensional summary reports.
• Corporate Analysis and Risk Management

• Data mining is used in the following fields of the Corporate Sector.

• Finance Planning and Asset Evaluation − It involves cash flow


analysis and prediction, contingent claim analysis to evaluate
assets.
• Resource Planning − It involves summarizing and comparing the
resources and spending.
• Competition − It involves monitoring competitors and market
directions.
• Fraud Detection

• Data mining is also used in the fields of credit card services and
telecommunication to detect frauds.
• In fraud telephone calls, it helps to find the destination of the
call, duration of the call, time of the day or week, etc.
• It also analyzes the patterns that deviate from expected norms.
• Classification and Prediction
• Classification is the process of finding a model that describes the
data classes or concepts.
• The purpose is to be able to use this model to predict the class of
objects whose class label is unknown.
• This derived model is based on the analysis of sets of training
data.
• The derived model can be presented in the following forms −
• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
• The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is
unknown.
• Its objective is to find a derived model that describes and
distinguishes data classes or concepts.
• The Derived Model is based on the data object whose class label
is well known.
• Prediction − It is used to predict missing or unavailable numerical
data values rather than class labels. Regression Analysis is
generally used for prediction.
• Outlier Analysis − Outliers may be defined as the data objects
that do not comply with the general model of the data available.
• Evolution Analysis − Evolution analysis refers to the description
for objects whose behavior changes over time.
Challenges in Data Mining
 Efficiency and scalability of data mining algorithms

 Parallel, distributed, stream, and incremental mining methods


 Handling high-dimensionality
 Handling noise, uncertainty, and incompleteness of data

 Incorporation of constraints, expert knowledge, and background


knowledge
 Pattern evaluation and knowledge integration
• Data Mining and Database Technology
• We can classify a data mining system according to the kind of
databases mined.
• Database system can be classified according to different criteria such
as data models, types of data, etc. And the data mining system can be
classified accordingly.
• For example, if we classify a database according to the data model,
then we may have a relational, transactional, object-relational, or
data warehouse mining system.
• Transaction Databases:
• A transaction database is a set of records representing transactions,
each with a time stamp, an identifier and a set of items. Associated
with the transaction files descriptive data for the items.
• For example, in the case of the video store, the rentals table
represents the transaction database. Each record is a rental contract
with a customer identifier, a date, and the list of items rented (i.e.
video tapes, games, VCR, etc.).
• transactions are usually stored in flat files or stored in two normalized
transaction tables, one for the transactions and one for the
transaction items.
• Multimedia Databases:

• Multimedia databases include video, images, audio and text media.


• They can be stored on extended object-relational or object-oriented
databases, or simply on a file system.
• Multimedia is characterized by its high dimensionality, which makes
data mining even more challenging.
• Data mining from multimedia repositories may require computer
vision, computer graphics, image interpretation, and natural language
processing methodologies.
• Spatial Databases:
• Spatial databases are databases that, in addition to usual data, store
geographical information like maps, and global or regional
positioning.
• Such spatial databases present new challenges to data mining
algorithms.
• Time-Series Databases:
• Time-series databases contain time related data such stock market
data or logged activities.
• These databases usually have a continuous flow of new data coming
in, which sometimes causes the need for a challenging real time
analysis.
• Data mining in such databases commonly includes the study of trends
and correlations between evolutions of different variables, as well as
the prediction of trends and movements of the variables in time.
• What is OLTP ? (Online Transaction processing and
data mining)

• OLTP is an operational system that supports transaction-oriented


applications in a 3-tier architecture. It administers the day to day
transaction of an organization.
• OLTP is basically focused on query processing, maintaining data
integrity in multi-access environments as well as effectiveness that is
measured by the total number of transactions per second.
• The full form of OLTP is Online Transaction Processing.
• Architecture of OLTP
• Business / Enterprise Strategy: Enterprise strategy deals with the issues that
affect the organization as a whole. In OLTP, it is typically developed at a high
level within the firm, by the board of directors or the top management
• Business Process: OLTP business process is a set of activities and tasks that,
once completed, will accomplish an organizational goal.
• Customers, Orders, and Products: OLTP database store information about
products, orders (transactions), customers (buyers), suppliers (sellers), and
employees.
• ETL Processes: It separates the data from various RDBMS source systems,
then transforms the data and loads the processed data into the Data
Warehouse system.
• Data Mart and Data warehouse: A data mart is a structure/access pattern
specific to data warehouse environments. It is used by OLAP to store
processed data.
• Data Mining, Analytics, and Decision Making: Data stored in the data mart
and data warehouse can be used for data mining, analytics, and decision
making.
Sr.N
o. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing It involves day-to-day processing.


of information.
2
OLAP systems are used by OLTP systems are used by clerks,
knowledge workers such as DBAs, or database professionals.
executives, managers, and
analysts.
3
It is used to analyze the It is used to run the business.
business.
4
It focuses on Information out. It focuses on Data in.
5
It is based on Star Schema, It is based on Entity Relationship
Snowflake Schema, and Fact Model.
Constellation Schema.
6
It focuses on Information out. It is application oriented.
7
It contains historical data. It contains current data.
8
It provides summarized and It provides primitive and highly detailed
consolidated data. data.

9
It provides summarized and It provides detailed and flat relational
multidimensional view of data. view of data.

10
The number of users is in The number of users is in thousands.
hundreds.

11
The number of records accessed is The number of records accessed is in
in millions. tens.

12
The database size is from 100GB The database size is from 100 MB to
to 100 TB. 100 GB.

13
These are highly flexible. It provides high performance.
• Advantages of OLTP:

• Following are the benefits of OLTP system:


• OLTP offers accurate forecast for revenue and expense.
• It provides a solid foundation for a stable business /organization due
to timely modification of all transactions.
• OLTP makes transactions much easier on behalf of the customers.
• OLTP provides support for bigger databases.
• Partition of data for data manipulation is easy.
• We need OLTP to use the tasks which are frequently performed by
the system.
• When we need only a small number of records.
• The tasks that include insertion, updation, or deletion of data.
• Disadvantages of OLTP
• Here are drawbacks of OLTP system:
• If the OLTP system faces hardware failures, then online transactions
get severely affected.
• OLTP systems allow multiple users to access and change the same
data at the same time, which many times created an unprecedented
situation.
• If the server hangs for seconds, it can affect to a large number of
transactions.
• OLTP required a lot of staff working in groups in order to maintain
inventory.
• Online Transaction Processing Systems do not have proper methods
of transferring products to buyers by themselves.
• Server failure may lead to wiping out large amounts of data from the
database.
• You can perform a limited number of queries and updates.
• Challenges of an OLTP System

• It allows more than one user to access and change the same data
simultaneously.
• Therefore, it requires concurrency control and recovery technique in
order to avoid any unprecedented situations
• OLTP system data are not suitable for decision making. You have to
use data of OLAP systems for “what if” analysis or the decision
making.

You might also like