2020 - UNIT 2 Chapter 1
2020 - UNIT 2 Chapter 1
2
1.1 Why Data Mining
3
1.1 Why Data Mining
4
1.1 Why Data Mining
– process measuring,
– scientific experiments,
– system performance,
– Environment surveillance.
5
1.1 Why Data Mining
6
1.1 Why Data Mining
7
1.1 Why Data Mining
Extracting useful information is extremely challenging.
Traditional data analysis tools and techniques cannot be used because of the
massive size of a data set
Additional data analysis tools are required for in-depth analysis, such as data
classification, clustering, and the characterization of data that changes over
time
Data mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data.
8
1.2 What Is Data Mining?
9
1.2 What Is Data Mining?: Data Mining Process
10
1.2 What Is Data Mining? : What is Data ?
11
1.2 What Is Data Mining?: What is Information ?
12
What is Knowledge?
– Pattern of Relationships in data
and information that exhibit a
high degree of certainty
13
1.2 What Is Data Mining?: Knowledge Discovery from Data(KDD)
Many treat data mining as a synonym for another popularly used term,
knowledge discovery from data(KDD)
– KDD is the process of discovering useful knowledge from a collection of
data.
14
1.2 What Is Data Mining?: KDD Process
Knowledge discovery process is an iterative sequence of steps:
15
1.2 What Is Data Mining?: KDD Process
16
1.3 What Kinds of Data can be Mined?
DM can be applied to any kind of data as long as the data are meaningful for a target
application.
However, algorithms and approaches may differ when applied to different types of data.
17
1.3 What Kinds of Data can be Mined?: Database Data
19
1.3 What Kinds of Data can be Mined?: Transactional Data
Transaction Items
22
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
23
23
23
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
– Classification [Predictive]
– Regression [Predictive]
– Outlier Analysis [Predictive]
– Clustering [Descriptive]
– Association Rule Discovery [Descriptive]
– Sequential Pattern Discovery [Descriptive]
24
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
25
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Classification [Predictive]
– Classification is the process of
• finding a model (or function)
• that describes and distinguishes data classes or concepts,
• maps data into predefined classes or groups
– The derived model is based on the analysis of a set of training
data (i.e., data objects whose class label is known).
26
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Classification
– Decision trees
– Neural network
27
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Classification
IF – THEN
– IF (Attendance = 75) AND (IA=10) THEN class= ‘F’
– IF (Attendance = 85) AND (IA=25) THEN class = ‘D’
– IF (Attendance = 75) AND (IA=45) THEN class = ‘S’
28
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Classification
Decision Tree
– A decision tree is a flow-chart-like tree structure,
– where each node denotes a test on an attribute value,
– each branch represents an outcome of the test, and
– tree leaves represent classes or class distributions.
– Decision trees can easily be converted to classification rules.
Attendance
>75 <75
IA F
45 - 50
0-24
38 - 44
S A F 29
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Application
– Credit card fraud detection
– Telecom fraud detection
– Medical analysis
30
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
31
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Clustering
Clustering is a process of partitioning a set of data (or objects)
into a set of meaningful sub-classes, called clusters.
Given a set of data points, each having a set of attributes, and
a similarity measure among them, find clusters such that
– Data points in that are similar to one another and collectively
should be treated as group
– As a collection, are sufficiently different from other groups
32
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Clustering
Similarity measures
– Euclidean distance
Types of clustering
– Group based clustering
– Hierarchical clustering
Application
– Market segmentation
– Document clustering
33
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
34
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Association Rule Discovery
Given a set of transactions, each of which contain some
number of items from a given collection
– Produce dependency rules which will predict the occurrence
of an item based on the occurrences of other items in the
transaction
Rules Discovered:
Transaction Items
T1 Bread, Jelly, Jam
T2 Bread, Jam
{bread} {Jam}
T3 Bread, Milk, Jam {jelly } {bread}
T4 Coffee, bread
T5 Milk, coffee
{jelly} {jam}
{jelly} {milk}
35
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
36
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Rule form:
Body Head [support, confidence]
37
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
38
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Example
– When customer buys a shirt, in 70% of cases, he or
she will buy a tie!!
– We find this happen in 13.5 % of all purchases
39
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
41
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Rules Discovered:
{bread} {Jam}
Transaction Items
support= , confidence =
T1 Bread, Jelly, Jam
{jelly } {bread}
T2 Bread, Jam
support= , confidence = T3 Bread, Milk, Jam
{jelly} {jam} T4 Coffee, bread
support= , confidence = {jelly} T5 Milk, coffee
{milk}
42
support= , confidence =
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
Rules Discovered:
{bread} {Jam}
Transaction Items
support=60%, confidence = 75%
T1 Bread, Jelly, Jam
{jelly } {bread}
T2 Bread, Jam
support=20%, confidence = 100% T3 Bread, Milk, Jam
{jelly} {jam} T4 Coffee, bread
support=20%, confidence = 100% T5 Milk, coffee
{jelly} {milk}
43
support=0%
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
44
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
45
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
46
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
47
1.4 What Kinds of Patterns Can Be Mined?:Data mining tasks
60% of customer who buy intro to visual c and c++ primer also buy
Perl for dummies and TCL/Tk with in a month
– Athletic Apparel Store:
48
2.5 Which Technologies are used
As a highly application-driven domain, data mining has incorporated
many techniques
49
Data Mining Applications
– Relationship Marketing
– Customer Profiling
– Customer Segmentation
50
Data Mining Applications
51
Data Mining Applications
Relationship Marketing
– Usually customers have a lifetime value, not just the value of a single state.
52
Data Mining Applications
Customer Profiling
– It is the process of using relevant and available information to describe
the characteristics of a group of customers
– Profiling can help an enterprise identify its most valuable customers so
that the enterprise may differentiate their needs and values
– Customer profiles may include information on how customers spend
money, where and what they tend to buy, who are the most profitable
customers and so on
53
Data Mining Applications
Customer Segmentation
– Essentially, it is a process of finding sub-groups of similar
people within a data set and can be useful in marketing.
– Furthermore, data mining may be used to understand and
predict customer behavior and profitability, to develop
new products and services, and to effectively market new
offerings.
54
Data Mining Applications
55
Data Mining Applications
56
Potential Application
– Sports Scouting
– Web Advertising
– Recommendation Systems
– Sports
– Weather forecasting
57
Major Issues in Data Mining
Data mining is not an easy task, as the algorithms used are very complex
and data is not always available at one place. It needs to be integrated
from various heterogeneous data sources.
The major issues in data mining can be divided into following categories
1. Mining methodology and User interaction,
2. Performance Issues
58
Major Issues in Data Mining
Mining knowledge at multiple levels one may find not only high level knowledge,
such as “milk and bread are likely to be purchased together” , but also lower – level
one such as “particular brand of milk and bread are purchased together ” .
Discovering knowledge at multiple levels extend the scope of knowledge discovery
59
Major Issues in Data Mining
61
Major Issues in Data Mining
Performance Issues
– Efficiency and scalability of data mining algorithms: In
order to effectively extract the information from huge
amount of data in databases, data mining algorithm must
be efficient and scalable.
62
Major Issues in Data Mining
63
Major Issues in Data Mining
64
Major Issues in Data Mining
65
Major Issues in Data Mining
Data Mining and Society
– How does data mining impact society?
66
Major Issues in Data Mining
67
Major Issues in Data Mining
68
Discuss whether or not each of the following activities is a data mining task.
a) Dividing the customers of a company according to their gender.
69
a) Dividing the customers of a company according to their gender.
70
e) Predicting the outcomes of tossing a (fair) pair of dice.
No. Since the die is fair, this is a probability calculation. If the die were not fair, and
we needed to estimate the probabilities of each outcome from the data, then this is
more like the problems considered by data mining. However, in this specific case,
solutions to this problem were developed by mathematicians a long time ago, and
thus, we wouldn’t consider it to be data mining.
f) Predicting the future stock price of a company using historical records.
Yes. We would attempt to create a model that can predict the continuous value of
the stock price. This is an example of the area of data mining known as predictive
modelling. We could use regression for this modelling, although researchers in
many fields have developed a wide variety of techniques for predicting time series.
71
g) Monitoring the heart rate of a patient for abnormalities.
Yes. We would build a model of the normal behavior of heart rate and
raise an alarm when an unusual heart behavior occurred. This would
involve the area of data mining known as anomaly detection. This could
also be considered as a classification problem if we had examples of both
normal and abnormal heart behavior.
h) Monitoring seismic waves for earthquake activities.
Yes. In this case, we would build a model of different types of seismic
wave behavior associated with earthquake activities and raise an alarm
when one of these different types of seismic activity was observed. This
is an example of the area of data mining known as classification.
i) Extracting the frequencies of a sound wave.
No. This is signal processing.
72
UNIT – I
Data Warehouse Modeling: Data Cube and OLAP : Data Cube: A Multidimensional
Data Model, Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional
Data Models, The Role of Concept Hierarchies, Typical OLAP Operations, OLAP Systems
versus Statistical Databases
Data Warehouse Implementation : Efficient Data Cube Computation, DMQL,