Chapter 1
Chapter 1
Chapter One
Data Warehousing and Data Mining
November 13, 2021
Chapter-1
• In other words, we can say that data mining is the procedure of mining
knowledge from data.
• The information or knowledge extracted so can be used for any of the
following applications −
• Market Analysis
• Fraud Detection
• Customer Retention
• Production Control
• Science Exploration
• Data Mining - Systems
• There is a large variety of data mining systems available. Data mining
systems may integrate techniques from the following −
• 4. Association Rules:
• 6. Sequential Patterns:
• The sequential pattern is a data mining technique specialized
for evaluating sequential data to discover sequential patterns.
• 7. Prediction:
• Prediction used a combination of other data mining techniques such
as trends, clustering, classification, etc.
• Data Mining Architecture
• The significant components of data mining systems are a data source,
data mining engine, data warehouse server, the pattern evaluation
module, graphical user interface, and knowledge base.
7/23/23 21
1. Concept/class description: Characterization and
discrimination
– Given a class/classes with data that belongs to the class, describe
the class by making observation of its members.
• Data characterisation is a summarisation of general features of
objects in a target class, and produces what is called characteristic
rules
• Data discrimination is description made by making comparative
analysis between the target class with the other comparative class
(contrasting classes)
2. Association Analysis
Association analysis is based on the association rules. It studies the
frequency of items occurring together in transactional databases, and
based on a threshold called support. identifies the frequent item sets.
Another threshold, confidence, which is the conditional probability
than an item appears in a transaction when another item appears, is
used to pinpoint association rules.
Association analysis is commonly used for market basket analysis.
3. Classification and Prediction
– Classification is the processing of finding a set of models (or functions) that
describe and distinguish data classes or concepts, for the purposes of being able
to use the model to predict the class of objects whose class label is unknown .
– Prediction is the process of predicting some missing or unavailable data values
rather than class labels.
4. Cluster analysis
• Similar to classification, clustering is the organisation of data in
classes.
• unlike classification, it is used to place data elements into related
groups without advance knowledge of the group definitions i.e. class
labels are unknown and it is up to the clustering algorithm to
discover acceptable classes.
• Clustering is also called unsupervised classification, because the
classification is not dictated by given class labels.
• There are many clustering approaches all based on the principle of
maximising the similarity between objects in a same class (intra-class
similarity) and minimising the similarity between objects of different
classes (inter-class similarity).
5. Outlier analysis
– Database may contain data object that do not comply with the
general behavior or model of the data.
– These data objects are outliers.
– Usually outlier data items are considered as noise or exception in
many data mining applications
6. Trend and evolution analysis
– Describe and model regularities or trends for objects whose
behavior changes over time.
– It is also referred as regression analysis, sequential pattern mining,
periodicity analysis, similarity-based analysis
• Data Mining System Classification
• A data mining system can be classified according to the following
criteria −
• Database Technology
• Statistics
• Machine Learning
• Information Science
• Visualization
• Other Disciplines
• Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c)
techniques utilized, and (d) applications adapted.
• Classification Based on the Databases Mined
• We can classify a data mining system according to the kind of databases mined.
Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified
accordingly.
• For example, if we classify a database according to the data model, then we may
have a relational, transactional, object-relational, or data warehouse mining
system.
• Classification Based on the kind of Knowledge Mined
• We can classify a data mining system according to the kind of knowledge mined. It
means the data mining system is classified on the basis of functionalities such as −
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Outlier Analysis
• Evolution Analysis
• Classification Based on the Techniques Utilized
• We can classify a data mining system according to the kind of
techniques used.
• We can describe these techniques according to the degree of user
interaction involved or the methods of analysis employed.
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
7/23/23 Databases 34
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
• Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration,
sports, astrology, and Internet Web Technology.
• Market Analysis and Management
• Listed below are the various fields of market where data mining is
used
• Customer Profiling − Data mining helps determine what kind of
people buy what kind of products.
• Identifying Customer Requirements − Data mining helps in
identifying the best products for different customers. It uses
prediction to find the factors that may attract new customers.
• Cross Market Analysis − Data mining performs
Association/correlations between product sales.
• Determining Customer purchasing pattern − Data mining helps in
determining customer purchasing pattern.
• Providing Summary Information − Data mining provides us
various multidimensional summary reports.
• Corporate Analysis and Risk Management
• Data mining is also used in the fields of credit card services and
telecommunication to detect frauds.
• In fraud telephone calls, it helps to find the destination of the
call, duration of the call, time of the day or week, etc.
• It also analyzes the patterns that deviate from expected norms.
• Classification and Prediction
• Classification is the process of finding a model that describes the
data classes or concepts.
• The purpose is to be able to use this model to predict the class of
objects whose class label is unknown.
• This derived model is based on the analysis of sets of training
data.
• The derived model can be presented in the following forms −
• Classification (IF-THEN) Rules
• Decision Trees
• Mathematical Formulae
• Neural Networks
• The list of functions involved in these processes are as follows −
• Classification − It predicts the class of objects whose class label is
unknown.
• Its objective is to find a derived model that describes and
distinguishes data classes or concepts.
• The Derived Model is based on the data object whose class label
is well known.
• Prediction − It is used to predict missing or unavailable numerical
data values rather than class labels. Regression Analysis is
generally used for prediction.
• Outlier Analysis − Outliers may be defined as the data objects
that do not comply with the general model of the data available.
• Evolution Analysis − Evolution analysis refers to the description
for objects whose behavior changes over time.
Challenges in Data Mining
Efficiency and scalability of data mining algorithms
9
It provides summarized and It provides detailed and flat relational
multidimensional view of data. view of data.
10
The number of users is in The number of users is in thousands.
hundreds.
11
The number of records accessed is The number of records accessed is in
in millions. tens.
12
The database size is from 100GB The database size is from 100 MB to
to 100 TB. 100 GB.
13
These are highly flexible. It provides high performance.
• Advantages of OLTP:
• It allows more than one user to access and change the same data
simultaneously.
• Therefore, it requires concurrency control and recovery technique in
order to avoid any unprecedented situations
• OLTP system data are not suitable for decision making. You have to
use data of OLAP systems for “what if” analysis or the decision
making.