0% found this document useful (0 votes)
591 views40 pages

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

This document discusses various data mining techniques including definition, predictive modeling, classification, regression, time series analysis, prediction, descriptive modeling, clustering, summarization, and association rules. It provides examples for each technique to illustrate how they are used in applications such as credit risk analysis, airport security screening, savings prediction, customer catalog targeting, university rankings, and grocery store sales analysis.

Uploaded by

Kuntal Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
591 views40 pages

UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.

This document discusses various data mining techniques including definition, predictive modeling, classification, regression, time series analysis, prediction, descriptive modeling, clustering, summarization, and association rules. It provides examples for each technique to illustrate how they are used in applications such as credit risk analysis, airport security screening, savings prediction, customer catalog targeting, university rankings, and grocery store sales analysis.

Uploaded by

Kuntal Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

UNIT-04

INTRODUCTION TO DATA MINING:


• Definition
• Data mining Techniques
• KDD Process
• Association rules.
• (http://index-of.co.uk/Data-Mining/Dunham%20-%20Data%20Mining.pdf)
Definition
• In simple words, data mining is defined as a process used to extract usable
data from a larger set of any raw data. It implies analyzing data patterns in
large batches of data using one or more software.
• It is one of the step in KDD Process.
• Data mining is often defined as finding hidden information in a database.
Alternatively, it has been called exploratory data analysis, data driven
discovery, and deductive learning.
Predictive Model
• A predictive model makes a prediction about values of data using known
results found from different data.
• Predictive modeling may be made based on the use of other historical data.
For example, a credit card use might be refused not because of the user's
own credit history, but because the current purchase is similar to earlier
purchases that were subsequently found to be made with stolen cards.
• Example 1.1 uses predictive modeling to predict the credit risk.
Predictive model/Supervised Techniques
tasks
Predictive model data mining tasks include:
• Classification.
• Regression.
• Time series analysis.
• Prediction
Classification
• Classification maps data into predefined groups or classes.
• It is often referred to as supervised learning because the classes are determined before
examining the data.
• Two examples of classification applications are determining whether to make a bank loan
and identifying credit risks. Classification algorithms require that the classes be defined
based on data attribute values.
• They often describe these classes by looking at the characteristics of data already known to
belong to the classes.
• Pattern recognition is a type of classification where an input pattern is classified into one of
several classes based on its similarity to these predefined classes.
Example:2
• An airport security screening station is used to determine:
• if passengers are potential terrorists or criminals.
• To do this, the face of each passenger is scanned and its basic pattern
(distance between eyes, size and shape of mouth, shape of head, etc.) is
identified.
• This pattern is compared to entries in a database to see if it matches any
patterns that are associated with known offenders.
Regression
• Regression is used to map a data item to a real valued prediction variable. In
actuality, regression involves the learning of the function that does this
mapping.
• Regression assumes that the target data fit into some known type of function
(e.g., linear, logistic, etc.) and then determines the best function of this type
that models the given data.
• some type of error analysis is used to determine which function is "best."
Example
• EXAMPLE 1.3 A college professor wishes to reach a certain level of
savings before her retirement. Periodically, she predicts what her retirement
savings will be based on its current value and several past values.
• She uses a simple linear regression formula to predict this value by fitting
past behavior to a linear function and then using this function to predict the
values at points in the future. Based on these values, she then alters her
investment portfolio.
Time Series Analysis
• With time series analysis, the value of an attribute is examined as it varies over time. The
values usually are obtained as evenly spaced time points (daily, weekly, hourly, etc.).
• A time series plot (Figure 1.3), is used to visualize the time series.
• In this figure you can easily see that the plots for Y and Z have similar behavior, while X
appears to have less volatility.
• There are three basic functions performed in time series . analysis:
• In one case, distance measures are used to determine the similarity between different time
series. In the second case, the structure of the line is examined to determine (and perhaps
classify) its behavior. A third application would be to use the historical time series plot to
predict future values.
Prediction
• Many real-world data mining applications can be seen as predicting future data states based
on past and current data.
• Prediction can be viewed as a type of classification. (Note: This is a data mining task that is
different from the prediction model, although the prediction task is a type of prediction
model.)
• The difference is that prediction is predicting a future state rather than a current state.
• Here we are referring to a type of application rather than to a type of data mining modeling
approach, as discussed earlier. Prediction applications include flooding, speech recognition,
machine learning, and pattern recognition. Although future values may be predicted using
time series analysis or regression techniques, other approaches may be used as well.
Descriptive model
• . A descriptive model identifies patterns or relationships in data. Unlike the
predictive model, a descriptive model serves as a way to explore the
properties of the data examined, not to predict new properties. Clustering,
summarization, association rules, and sequence discovery are usually viewed
as descriptive in nature.
Descriptive/unsupervised techniques
• Clustering
• Summarization
• Association Rules.
• Sequence Pattern Discovery
Clustering
• Clustering is similar to classification except that the groups are not
predefined, but rather defined by the data alone.
• Clustering is alternatively referred to as unsupervised learning or
segmentation.
• It can be thought of as partitioning or segmenting the data into groups that
might or might not be disjointed.
• The clustering is usually accomplished by determining the similarity among
the data on predefined attributes.
• The most similar data are grouped into clusters. Example 1.6 provides a
simple clustering Since the clusters are not predefined, a domain expert is
often required to interpret the meaning of the created clusters.
EXAMPLE
• A certain national department store chain creates special catalogs targeted to various
demographic groups based on attributes such as income, location, and physical
characteristics of potential customers (age, height, weight, income etc.).
• To determine the target mailings of the various catalogs and to assist in the creation
of new, more specific catalogs, the company performs a clustering of potential
customers based on the determined attribute values.
• The results of the clustering exercise are then used by management to create special
catalogs and distribute them to the correct target population based on the cluster
for that catalog.
Summarization
• Summarization maps data into subsets with associated simple descriptions.
Summarization is also called characterization or generalization. It extracts or
derives representative information about the database. This may be
accomplished by actually retrieving portions of the data. Alternatively,
summary type information (such as the mean of some numeric attribute) can
be derived from the data. The summarization succinctly characterizes the
contents of the database. Example 1.7 illustrates this process.
EXAMPLE
• One of the many criteria used to compare universities by the U.S. News &
World Report is the average SAT or AC T score [GM99]. This is a
summarization used to estimate the type and intellectual level of the student
body
Association Rules

• Link analysis, alternatively referred to as affinity analysis or association, refers to the


data mining task of uncovering relationships among data.
• The best example of this type of application is to determine association rules.
• An association rule is a model that identifies specific types of data associations.
These associations are often used in the retail sales community to identify items that
are frequently purchased together.
• Example 1.8 illustrates the use of association rules in market basket analysis. Here
the data analyzed consist of information about what items a customer purchases.
• Associations are also used in many other applications such as predicting the failure
of telecommunication switches.
• Users of association rules must be cautioned that these are not causal relationships.
• They do not represent any relationship inherent in the actual data (as is true with
functional dependencies) or in the real world.
• There probably is no relationship between bread and pretzels that causes them to
be purchased together. And there is no guarantee that this association will apply in
the future.
• However, association rules can be used to assist retail store management in
effective advertising, marketing, and inventory control.
EXAMPLE
• A grocery store retailer is trying to decide whether to put bread on sale.
• To help determine the impact of this decision, the retailer generates association
rules that show what other products are frequently purchased with bread.
• He finds that 60% of the time that bread is sold so are pretzels and that 70%
of the time jelly is also sold.
• Based on these facts, he tries to capitalize on the association between bread,
pretzels, and jelly by placing some pretzels and jelly at the end of the aisle where the
bread is placed.
• In addition, he decides not to place either of these items on sale at the same time.
Sequence Discovery
• Sequential analysis or sequence discovery is used to determine sequential patterns in data.
• These patterns are based on a time sequence of actions. These patterns are similar to
associations in that data (or events) are found to be related, but the relationship is based on
time.
• Unlike a market basket analysis, which requires the items to be purchased at the same time,
in sequence discovery the items are purchased over time in some order.
• Example 1.9 illustrates the discovery of some simple patterns. A similar type of discovery
can be seen in the sequence within which data are purchased. For example, most people who
purchase CD players may be found to purchase CDs within one week. As we will see,
temporal association rules really fall into this category.
EXAMPLE
• The Webmaster at the XYZ Corp. periodically analyzes the Web log data to
determine how users of the XYZ's Web pages access them.
• He is interested in determining what sequences of pages are frequently
accessed.
• He determines that 70 percent of the users of page A follow one of the
following patterns of behavior: (A, B, C) or (A, D, B, C) or (A, E, B, C).
• He then determines to add a link directly from page A to page C.
DATA MINING ISSUES
1. Human interaction:
2. Over fitting:
3. Outliers
4. Interpretation of Results
5. Visualization of results:
6. Large datasets:
7. High dimensionality:
DATA MINING ISSUES CONTD…
8. Multimedia data
9. Missing data:
10. Irrelevant data:
11. Noisy data.
12. Changing data.
KDD PROCESS
Knowledge Discovery in Databases(KDD).
• Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
• Data mining is also one step in KDD.
1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data
from collection.
• Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
KDD contd..
2. Data Integration: Data integration is defined as heterogeneous data from
multiple sources combined in a common source(Data Warehouse).
• Data integration using Data Migration tools.
• Data integration using Data Synchronization tools.
• Data integration using ETL(Extract-Load-Transformation) process.
KDD contd..
3. Data Selection: Data selection is defined as the process where data relevant
to the analysis is decided and retrieved from the data collection.
• Data selection using Neural network.
• Data selection using Decision Trees.
• Data selection using Naive Bayes.
• Data selection using Clustering, Regression, etc.
KDD contd..
4. Data Transformation: Data Transformation is defined as the process of
transforming data into appropriate form required by mining procedure. Data
Transformation is a two step process:
• Data Mapping: Assigning elements from source base to destination to capture
transformations.
• Code generation: Creation of the actual transformation program.
KDD contd..
5. Data Mining: Data mining is defined as clever techniques that are applied to
extract patterns potentially useful.
• Transforms task relevant data into patterns.
• Decides purpose of model using classification or characterization.
KDD contd..
6. Pattern Evaluation: Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures.
• Find interestingness score of each pattern.
• Uses summarization and Visualization to make data understandable by
user.
KDD contd..
7. Knowledge representation: Knowledge representation is defined as
technique which utilizes visualization tools to represent data mining results.
• Generate reports.
• Generate tables.
• Generate discriminant rules, classification rules, characterization rules,
etc.

You might also like