0% found this document useful (0 votes)
32 views

Lecturenotes Data Mining

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. It involves discovering patterns and relationships within large datasets. Common techniques include classification, clustering, association rule mining, and prediction. Decision trees and clustering are popular algorithms. The CRISP-DM methodology provides a standardized process for conducting a data mining project through phases of business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

tanyah Lloyd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Lecturenotes Data Mining

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. It involves discovering patterns and relationships within large datasets. Common techniques include classification, clustering, association rule mining, and prediction. Decision trees and clustering are popular algorithms. The CRISP-DM methodology provides a standardized process for conducting a data mining project through phases of business understanding, data understanding, data preparation, modeling, evaluation, and deployment.

Uploaded by

tanyah Lloyd
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA MINING

• It is the process of analyzing data from different


perspectives and summarizing it into useful
information - information that can be used to
increase revenue, cuts costs, or both.
(http://www.anderson.ucla.edu)
• Also defined as the process of extracting valid
previously unknown comprehensible and actionable
information from large databases and using it to
make crucial business decisions.(Conolly & Begg,
2005)
� Technically, it is a process of discovering
meaningful patterns and relationships that lie
hidden within very large databases(Seidman,
2001)
� Refers to the mining or discovery of new
information in terms of patterns or rules from
vast amounts of data
� Keyword here is patterns:
So what is a pattern??
� A set of events that occur with enough frequency
in the dataset to reveal a relationship between
them. Revealing the relationship is usually an
inductive reasoning process
THE MATHEMATICS OF DATA MINING

� Mathematicians have provided an ideal


framework within which to conduct data mining
called the “EUCLIDEAN SPACE” and the
mathematical theory describing it is known as
linear algebra
� So what is the Euclidean space??
PREDICTION

CLASSIFICATION GOALS OF DATA MINING


OPTIMIZATION

IDENTIFICATION
STYLES TO DATA MINING
• Directed data mining- takes the form of predictive
modelling where we know exactly what we want to
predict
• It classifies data for use in making predictions or
estimates with the goal of deriving target values
• Egs banks may use it to predict defaulters on loans,
businesses may use it to decide whom to market their
products to
• Uses popular data mining algorithms such as
decision trees(which will be discussed later on in detail)
� Undirected data mining- which finds patterns
in the data and leaves it up to the user to
determine whether or not these patterns are
important
� Data is placed in a format that makes it easier
for us to make sense of it
� Most commonly used algorithm is clustering
which clumps data together in groups based on
common characteristics(to be discussed later in detail)
� One can then take one of the derived clusters
and apply the decision tree algorithm to it so
that they focus on a particular segment of the
cluster
DATA MINING METHODOLOGY
DATA MINING ALGORITHMS

� A data mining algorithm is a well-defined


procedure that takes data as input and produces as
output: models or patterns
DECISION TREES

� This algorithm analyzes the data and creates a


repeating series of branches until no more
relevant branches can be made
� The end result is a binary tree structure where the
splits in the branches can be followed along
specific criteria to find the most desired result
� Decision Tree (DT):
�Tree where the root and each internal node is labeled
with a question.
�The arcs represent each possible answer to the
associated question.
�Each leaf node represents a prediction of a solution to
the problem.
� Popular technique for classification; Leaf node
indicates class to which the corresponding tuple
belongs.
CLUSTERING
� This algorithm groups data into clusters
� The goal of clustering is to place records into
groups, such that records in a group are similar
to each other and dissimilar to records in other
groups
� An important facet of clustering is the
similarity function that is used
� The Euclidean distance(the ordinary or straight
line distance between two points) can be used
to measure similarity
ASSOCIATION RULE MINING

� It is an important data mining model initially


used for Market Basket Analysis to find how
items purchased by customers are related
ASSOCIATION RULE MINING
� Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions
Example of Association Rules

{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},

Implication means co-occurrence,


not causality!
DEFINITION: ASSOCIATION RULE
● Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}

● Rule Evaluation Metrics


– Support (s)
◆ Fraction of transactions that contain Example
both X and Y :
– Confidence (c)
◆ Measures how often items in Y
appear in transactions that
contain X
MINING ASSOCIATION RULES
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
CROSS INDUSTRY STANDARD PROCESS FOR DATA
MINING (CRISP- DM)
CRISP-DM: OVERVIEW

� CRISP-DM is a comprehensive data mining


methodology and process model that provides
anyone—from novices to data mining experts—
with a complete blueprint for conducting a data
mining project.
� CRISP-DM breaks down the life cycle of a data
mining project into six phases.
CRISP-DM: PHASES

Business Understanding
� Understanding project objectives and
requirements; Data mining problem definition
Data Understanding
Initial data collection and familiarization; Identify
data quality issues; Initial, obvious results
Data Preparation
� Record and attribute selection; Data cleansing
Modeling
� Run the data mining tools
Evaluation
� Determine if results meet business objectives;
Identify business issues that should have been
addressed earlier
Deployment
� Put the resulting models into practice; Set up for
continuous mining of the data

You might also like