UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

This document discusses various data mining techniques including definition, predictive modeling, classification, regression, time series analysis, prediction, descriptive modeling, clustering, summarization, and association rules. It provides examples for each technique to illustrate how they are used in applications such as credit risk analysis, airport security screening, savings prediction, customer catalog targeting, university rankings, and grocery store sales analysis.

Uploaded by

Kuntal Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

594 views40 pages

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

Uploaded by

Kuntal Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

UNIT-04

INTRODUCTION TO DATA MINING:

• Definition
• Data mining Techniques
• KDD Process
• Association rules.
• (http://index-of.co.uk/Data-Mining/Dunham%20-%20Data%20Mining.pdf)
Definition
• In simple words, data mining is defined as a process used to extract usable
data from a larger set of any raw data. It implies analyzing data patterns in
large batches of data using one or more software.
• It is one of the step in KDD Process.
• Data mining is often defined as finding hidden information in a database.
Alternatively, it has been called exploratory data analysis, data driven
discovery, and deductive learning.
Predictive Model
• A predictive model makes a prediction about values of data using known
results found from different data.
• Predictive modeling may be made based on the use of other historical data.
For example, a credit card use might be refused not because of the user's
own credit history, but because the current purchase is similar to earlier
purchases that were subsequently found to be made with stolen cards.
• Example 1.1 uses predictive modeling to predict the credit risk.
Predictive model/Supervised Techniques
tasks
Predictive model data mining tasks include:
• Classification.
• Regression.
• Time series analysis.
• Prediction
Classification
• Classification maps data into predefined groups or classes.
• It is often referred to as supervised learning because the classes are determined before
examining the data.
• Two examples of classification applications are determining whether to make a bank loan
and identifying credit risks. Classification algorithms require that the classes be defined
based on data attribute values.
• They often describe these classes by looking at the characteristics of data already known to
belong to the classes.
• Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes.
Example:2
• An airport security screening station is used to determine:
• if passengers are potential terrorists or criminals.
• To do this, the face of each passenger is scanned and its basic pattern
(distance between eyes, size and shape of mouth, shape of head, etc.) is
identified.
• This pattern is compared to entries in a database to see if it matches any
patterns that are associated with known offenders.
Regression
• Regression is used to map a data item to a real valued prediction variable. In
actuality, regression involves the learning of the function that does this
mapping.
• Regression assumes that the target data fit into some known type of function
(e.g., linear, logistic, etc.) and then determines the best function of this type
that models the given data.
• some type of error analysis is used to determine which function is "best."
Example
• EXAMPLE 1.3 A college professor wishes to reach a certain level of
savings before her retirement. Periodically, she predicts what her retirement
savings will be based on its current value and several past values.
• She uses a simple linear regression formula to predict this value by fitting
past behavior to a linear function and then using this function to predict the
values at points in the future. Based on these values, she then alters her
investment portfolio.
Time Series Analysis
• With time series analysis, the value of an attribute is examined as it varies over time. The
values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.).
• A time series plot (Figure 1.3), is used to visualize the time series.
• In this figure you can easily see that the plots for Y and Z have similar behavior, while X
appears to have less volatility.
• There are three basic functions performed in time series . analysis:
• In one case, distance measures are used to determine the similarity between different time
series. In the second case, the structure of the line is examined to determine (and perhaps
classify) its behavior. A third application would be to use the historical time series plot to
predict future values.
Prediction
• Many real-world data mining applications can be seen as predicting future data states based
on past and current data.
• Prediction can be viewed as a type of classification. (Note: This is a data mining task that is
different from the prediction model, although the prediction task is a type of prediction
model.)
• The difference is that prediction is predicting a future state rather than a current state.
• Here we are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech recognition,
machine learning, and pattern recognition. Although future values may be predicted using
time series analysis or regression techniques, other approaches may be used as well.
Descriptive model
• . A descriptive model identifies patterns or relationships in data. Unlike the
predictive model, a descriptive model serves as a way to explore the
properties of the data examined, not to predict new properties. Clustering,
summarization, association rules, and sequence discovery are usually viewed
as descriptive in nature.
Descriptive/unsupervised techniques
• Clustering
• Summarization
• Association Rules.
• Sequence Pattern Discovery
Clustering
• Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone.
• Clustering is alternatively referred to as unsupervised learning or
segmentation.
• It can be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed.
• The clustering is usually accomplished by determining the similarity among
the data on predefined attributes.
• The most similar data are grouped into clusters. Example 1.6 provides a
simple clustering Since the clusters are not predefined, a domain expert is
often required to interpret the meaning of the created clusters.
EXAMPLE
• A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location, and physical
characteristics of potential customers (age, height, weight, income etc.).
• To determine the target mailings of the various catalogs and to assist in the creation
of new, more specific catalogs, the company performs a clustering of potential
customers based on the determined attribute values.
• The results of the clustering exercise are then used by management to create special
catalogs and distribute them to the correct target population based on the cluster
for that catalog.
Summarization
• Summarization maps data into subsets with associated simple descriptions.
Summarization is also called characterization or generalization. It extracts or
derives representative information about the database. This may be
accomplished by actually retrieving portions of the data. Alternatively,
summary type information (such as the mean of some numeric attribute) can
be derived from the data. The summarization succinctly characterizes the
contents of the database. Example 1.7 illustrates this process.
EXAMPLE
• One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the student
body
Association Rules

• Link analysis, alternatively referred to as affinity analysis or association, refers to the

data mining task of uncovering relationships among data.
• The best example of this type of application is to determine association rules.
• An association rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community to identify items that
are frequently purchased together.
• Example 1.8 illustrates the use of association rules in market basket analysis. Here
the data analyzed consist of information about what items a customer purchases.
• Associations are also used in many other applications such as predicting the failure
of telecommunication switches.
• Users of association rules must be cautioned that these are not causal relationships.
• They do not represent any relationship inherent in the actual data (as is true with
functional dependencies) or in the real world.
• There probably is no relationship between bread and pretzels that causes them to
be purchased together. And there is no guarantee that this association will apply in
the future.
• However, association rules can be used to assist retail store management in
effective advertising, marketing, and inventory control.
EXAMPLE
• A grocery store retailer is trying to decide whether to put bread on sale.
• To help determine the impact of this decision, the retailer generates association
rules that show what other products are frequently purchased with bread.
• He finds that 60% of the time that bread is sold so are pretzels and that 70%
of the time jelly is also sold.
• Based on these facts, he tries to capitalize on the association between bread,
pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the
bread is placed.
• In addition, he decides not to place either of these items on sale at the same time.
Sequence Discovery
• Sequential analysis or sequence discovery is used to determine sequential patterns in data.
• These patterns are based on a time sequence of actions. These patterns are similar to
associations in that data (or events) are found to be related, but the relationship is based on
time.
• Unlike a market basket analysis, which requires the items to be purchased at the same time,
in sequence discovery the items are purchased over time in some order.
• Example 1.9 illustrates the discovery of some simple patterns. A similar type of discovery
can be seen in the sequence within which data are purchased. For example, most people who
purchase CD players may be found to purchase CDs within one week. As we will see,
temporal association rules really fall into this category.
EXAMPLE
• The Webmaster at the XYZ Corp. periodically analyzes the Web log data to
determine how users of the XYZ's Web pages access them.
• He is interested in determining what sequences of pages are frequently
accessed.
• He determines that 70 percent of the users of page A follow one of the
following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C).
• He then determines to add a link directly from page A to page C.
DATA MINING ISSUES
1. Human interaction:
2. Over fitting:
3. Outliers
4. Interpretation of Results
5. Visualization of results:
6. Large datasets:
7. High dimensionality:
DATA MINING ISSUES CONTD…
8. Multimedia data
9. Missing data:
10. Irrelevant data:
11. Noisy data.
12. Changing data.
KDD PROCESS
Knowledge Discovery in Databases(KDD).
• Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
• Data mining is also one step in KDD.
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
KDD contd..
2. Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(Data Warehouse).
• Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.
KDD contd..
3. Data Selection: Data selection is defined as the process where data relevant
to the analysis is decided and retrieved from the data collection.
• Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive Bayes.
• Data selection using Clustering, Regression, etc.
KDD contd..
4. Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure. Data
Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
KDD contd..
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
• Transforms task relevant data into patterns.
• Decides purpose of model using classification or characterization.
KDD contd..
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
• Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by
user.
KDD contd..
7. Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining results.
• Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules,
etc.

Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
Mc4301 APR May 24 (Machine Learning)
No ratings yet
Mc4301 APR May 24 (Machine Learning)
3 pages
Unit 3 Univariate Analysis
No ratings yet
Unit 3 Univariate Analysis
39 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit 1 Aktu
No ratings yet
Unit 1 Aktu
26 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
100% (1)
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
86 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Data Mining: Books
No ratings yet
Data Mining: Books
14 pages
P4S2 - Hydrometry
No ratings yet
P4S2 - Hydrometry
189 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Life Cycle Assessment of Acetylene Production From Calcium Carbide and Methane in China
No ratings yet
Life Cycle Assessment of Acetylene Production From Calcium Carbide and Methane in China
9 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Lecture 1
No ratings yet
Lecture 1
43 pages
Class - 8 Social Science
No ratings yet
Class - 8 Social Science
3 pages
Big Data Aktu Unit 3
No ratings yet
Big Data Aktu Unit 3
90 pages
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Coal
No ratings yet
Coal
44 pages
MLT Unit 3 Notes
No ratings yet
MLT Unit 3 Notes
19 pages
R Language
No ratings yet
R Language
59 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Energy Resources in Afghanistan and Measures To Im
No ratings yet
Energy Resources in Afghanistan and Measures To Im
12 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
No ratings yet
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
27 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
12 pages
Data Warehousing and Data Mining (10cs755)
No ratings yet
Data Warehousing and Data Mining (10cs755)
142 pages
Unit 5 Intro To Machine Learning
No ratings yet
Unit 5 Intro To Machine Learning
25 pages
3.1 What Is Data Warehouse?: Unit Iii
No ratings yet
3.1 What Is Data Warehouse?: Unit Iii
33 pages
1 PB
No ratings yet
1 PB
14 pages
Best Coal Fired Power Plant and Cogeneration Case Studies
No ratings yet
Best Coal Fired Power Plant and Cogeneration Case Studies
71 pages
Model: Tn20: Installation and Operating Instructions
No ratings yet
Model: Tn20: Installation and Operating Instructions
28 pages
B.SC III QB - Fuel Geology
No ratings yet
B.SC III QB - Fuel Geology
5 pages
RER Mod1@AzDOCUMENTS - in
No ratings yet
RER Mod1@AzDOCUMENTS - in
31 pages
169 20512 160 PDF
No ratings yet
169 20512 160 PDF
10 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Unit-I Introduction and ANN Structure
No ratings yet
Unit-I Introduction and ANN Structure
15 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
Jamalganj UCG Prospect
No ratings yet
Jamalganj UCG Prospect
23 pages
Data Engineering Interview Preparation Questions
No ratings yet
Data Engineering Interview Preparation Questions
7 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Keystone Pipeline Thesis Statement
100% (3)
Keystone Pipeline Thesis Statement
5 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Unit 4
No ratings yet
Unit 4
4 pages
8 Minerals in India I
No ratings yet
8 Minerals in India I
11 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
03.1 CO2 Conversion To Methane Article
No ratings yet
03.1 CO2 Conversion To Methane Article
8 pages
Data Mining and Model Selection
No ratings yet
Data Mining and Model Selection
27 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Fossil Fuels and Carbon Compounds
100% (1)
Fossil Fuels and Carbon Compounds
48 pages
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
No ratings yet
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
9 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Data Warehousing & Mining: Unit - V
100% (2)
Data Warehousing & Mining: Unit - V
13 pages
McLean County Letter (Saggau)
100% (1)
McLean County Letter (Saggau)
3 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
SSR 2020-21 (Revised Draft)
No ratings yet
SSR 2020-21 (Revised Draft)
630 pages
Ahp Techanical Data - 15!05!2016
No ratings yet
Ahp Techanical Data - 15!05!2016
7 pages
Artificial Intelligence & Expert System
100% (1)
Artificial Intelligence & Expert System
18 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Sulphuric Acid Manufacturing Process - Chemical E
No ratings yet
Sulphuric Acid Manufacturing Process - Chemical E
2 pages
Parallel Database Systems
No ratings yet
Parallel Database Systems
17 pages
Bharathi Cement Corporation (PVT) LTD: Conveyor and de System Check List - A/B Shift
No ratings yet
Bharathi Cement Corporation (PVT) LTD: Conveyor and de System Check List - A/B Shift
1 page
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Data Mining
No ratings yet
Data Mining
2 pages
Cement Production and Manufacturing Process - Portland Cement Industry
No ratings yet
Cement Production and Manufacturing Process - Portland Cement Industry
9 pages
Grade 8 Visto Mock Test - 2 (28.09.2022)
100% (1)
Grade 8 Visto Mock Test - 2 (28.09.2022)
5 pages
5.1 Mining Data Streams
No ratings yet
5.1 Mining Data Streams
16 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Tangent Burner
100% (1)
Tangent Burner
14 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Explaining Relationships Among Various Coal Analyses With Coal Grindability Index by Random Forest
No ratings yet
Explaining Relationships Among Various Coal Analyses With Coal Grindability Index by Random Forest
7 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Technical Specification TED-Al
No ratings yet
Technical Specification TED-Al
1 page
Dev Energy - GASIFIER
No ratings yet
Dev Energy - GASIFIER
33 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
VPA Berthing Programme (Working & Expected Vessels)
No ratings yet
VPA Berthing Programme (Working & Expected Vessels)
4 pages
Mine Planning & Design Bits
No ratings yet
Mine Planning & Design Bits
10 pages
9000 Model Gasifier Technical Data
No ratings yet
9000 Model Gasifier Technical Data
1 page

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

Uploaded by

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

Uploaded by

UNIT-04

INTRODUCTION TO DATA MINING:

• Link analysis, alternatively referred to as affinity analysis or association, refers to the

You might also like