Data Mining
Data Mining
Data types, Input and Output of Data Mining Algorithms Decision Trees – Constructing
Classification Trees – CHAID – CART – Regression Trees – Pruning Model Estimation.
CHAID:
CHAID (Chi-square Automatic Interaction Detector)
Market research is an essential activity for every business and helps you to identify and analyze
market demand, market size, market trends and the strength of your competition. It also enables
you to assess the viability of a potential product or service before taking it to market. It is a field
that recognizes the importance of utilizing data to make evidence based decisions and many
statistical and analytical methods have become popular in the field of quantitative market
research.
• CHAID distinguishes 3 types of indep. variables
- Monotonic
- Free
- Floating
• Basic components of CHAID analysis
1. A categorical dep. var.
2. A set of categorical indep. variables
3. Settings for various CHAID parameters
4. Analyze subgroups and identify “best” indep. var
What is it for?
CHAID (Chi-square Automatic Interaction Detector) analysis is an algorithm used for
discovering relationships between a categorical response variable and other categorical predictor
variables. It is useful when looking for patterns in datasets with lots of categorical variables and
is a convenient way of summarizing the data as the relationships can be easily visualized.
In practice, CHAID is often used in direct marketing to understand how different groups of
customers might respond to a campaign based on their characteristics. So suppose, for example,
that we run a marketing campaign and are interested in understanding what customer
characteristics (e.g., gender, socio-economic status, geographic location, etc.) are associated with
the response rate achieved. We build a CHAID “tree” showing the effects of different customer
characteristics on the likelihood of response.
What statistical techniques are used?
As indicated in the name, CHAID uses Person’s Chi-square tests of independence, which test for
an association between two categorical variables. A statistically significant result indicates that
the two variables are not independent, i.e., there is a relationship between them.
Alternatives methods
CHAID is sometimes used as an exploratory method for predictive modeling. Advantage of this
modeling approach is that we are able to analyze the data all-in-one rather than splitting the data
into subgroups and performing multiple tests. In particular, where a continuous response variable
is of interest or there are a number of continuous predictors to consider, we would recommend
performing a multiple regression analysis instead.
Example: Opening of Cinema/ Children’s Park/Exhibition Center
• To find consumer responses to opening of Cinema, Children’s park or Exhibition
• 903 respondents were asked to rate each alternative on a 5 point scale: 1(v.low) to 5 (v.high)
• The analyst also collected demographic data on the respondents
• Dependent var - % of positive responses
• Indep variables (with coding in parenthesis)
Gender: Male (1), Female (2)
Age: 16-20 (1)
21-24 (2)
25-34 (3)
35-44 (4)
45-54 (5)
55-64 (6)
65+ (7)
Socio-economic group had 6 categories:A(1), B(2), C1(3), C2(4) etc
• CHAID is a dependence method.
• For given dep var. we want technique that can
1. Indicate indep. var. that most affect dep. var.
2. Identify mkt. segments that differ most on these important. indep. var.
• Early interaction detection method is AID
AID employs hierarchical binary splitting algorithm
• General procedure
1. First select indep. var. whose subgroups differ most w.r.t dep. var.
2. Each subgroup of this var. is further divided into subgroups on remaining variables
3. These subgroups are tested for differences on dep. var.
4. Var. with greatest difference is selected next
5. Continue until subgroups are too small
Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.
Regression Trees: where the target variable is continuous and tree is used to predict it's value.
The CART algorithm is structured as a sequence of questions, the answers to which determine
what the next question, if any should be. The result of these questions is a tree like structure
where the ends are terminal nodes at which point there are no more questions. A simple example
of a decision tree is as follows
The main elements of CART (and any decision tree algorithm) are:
1. Rules for splitting data at a node based on the value of one variable;
2. Stopping rules for deciding when a branch is terminal and can be split no more; and
3. Finally, a prediction for the target variable in each terminal node.
One of the questions that arises in a decision tree algorithm is the optimal size of the final tree. A
tree that is too large risks overfitting the training data and poorly generalizing to new samples. A
small tree might not capture important structural information about the sample space. However,
it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition
of a single extra node will dramatically decrease error. This problem is known as the horizon
effect. A common strategy is to grow the tree until each node contains a small number of
instances then use pruning to remove nodes that do not provide additional information.[1]
Pruning should reduce the size of a learning tree without reducing predictive accuracy as
measured by a cross-validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance
Pruning can occur in a top down or bottom up fashion. A top down pruning will traverse nodes
and trim subtrees starting at the root, while a bottom up pruning will start at the leaf nodes.
Below are several popular pruning algorithms.
One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node
is replaced with its most popular class. If the prediction accuracy is not affected then the change
is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and
speed.
Cost complexity pruning generates a series of trees T0 . . . Tm where T0 is the initial tree and Tm is
the root alone. At step i the tree is created by removing a subtree from treei-1 and replacing it
with a leaf node with value chosen as in the tree building algorithm. The subtree that is removed
is chosen as follows. Define the error rate of tree T over data set S as err(T,S). The subtree that
Discretization
Pre-processing of numerics.
Not every learning can handle numerics
Not every learning can handle numerics very well
Discretization: convert numbers to a finite number of bins.
Simple schemes (like nbins), are surprisingly effective.
E.g. really helps for NaiveBayes.
Supervised Discretization
Separates the numerics according to the class variable
E.g. find a cliff where suddenly everything switches from one class to another
o e.g. class=happy if age under 20; class=sad if age over 20
o 20 is the cliff
Best left to the learner
Some details on cliff learning Per numeric attribute, apply the following:
Unsupervised Discretization
Ignores class symbols
e.g. EqualIntervalDiscretization
o 0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
o -------------------------------------------|--|--|--|--|---|
bin1 b2 b3 b5 b9 b10
0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
-----|----|-----|-----|-----|-----|-----|------|---------|
bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9
o 0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,2,2,2,2,3,4,5,10,20,40,80,100
o --------------|-------------|--------|----|-------|-------
Variants: ProportionalKIntervalDiscretization
o EqualFrequencyDiscretization where the number of bins is
sqrt(numberOfInstances)
o Best for NaiveBaues
where f are elementary functions dependent on different sets of inputs, a are coefficients and m is
the number of the base function components.
Algorithms:
Combinatorial (COMBI)
Multilayered Iterative (MIA)
GN
Objective System Analysis (OSA)
Harmonical
Two-level (ARIMAD)
Multiplicative-Additive (MAA)
Objective Computer Clusterization (OCC);
Pointing Finger (PF) clusterization algorithm;
Analogues Complexing (AC)
Harmonical Rediscretization
Algorithm on the base of Multilayered Theory of Statistical Decisions (MTSD)
Group of Adaptive Models Evolution (GAME)
Cluster Analysis:
Cluster is a group of objects that belongs to the same class. In other words, similar objects are
grouped in one cluster and dissimilar objects are grouped in another cluster.
What is Clustering?
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into groups based on data similarity
and then assign the labels to the groups.
The main advantage of clustering over classification is that, it is adaptable to changes and helps
single out useful features that distinguish different groups.
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
Clustering can also help marketers discover distinct groups in their customer base. And they can
characterize their customer groups based on the purchasing patterns.
In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house type,
value, and geographic location.
Clustering also helps in classifying documents on the web for information discovery.
Clustering is also used in outlier detection applications such as detection of credit card fraud.
As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to observe characteristics of each cluster.
k-means:
What does it do? k-means creates groups from a set of objects so that the members of a group
are more similar. It’s a popular cluster analysis technique for exploring a dataset.
what’s cluster analysis? Cluster analysis is a family of algorithms designed to form groups such
that the group members are more similar versus non-group members. Clusters and groups are
synonymous in the world of cluster analysis.
How does k-means take care of the rest? k-means has lots of variations to optimize for certain
types of data.
At a high level, they all do something like this:
1. k-means picks points in multi-dimensional space to represent each of the k clusters. These are
called centroids.
2. Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the
same one, so they’ll form a cluster around their nearest centroid.
3. What we have are k clusters, and each patient is now a member of a cluster.
4. k-means then finds the center for each of the k clusters based on its cluster members (yep, using
the patient vectors!).
5. This center becomes the new centroid for the cluster.
6. Since the centroid is in a different place now, patients might now be closer to other centroids. In
other words, they may change cluster membership.
7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships
stabilize. This is called convergence.
Why use k-means? I don’t think many will have an issue with this:
The key selling point of k-means is its simplicity. Its simplicity means it’s generally faster and
more efficient than other algorithms, especially over large datasets.
It gets better:
k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster
analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and explore
whether there are overlooked patterns or relationships in the dataset.
Drawbacks:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial
choice of centroids. One final thing to keep in mind is k-means is designed to operate on
continuous data — you’ll need to do some tricks to get it to work on discrete data.
Where is it used? A ton of implementations for k-means clustering are available online:
Apache Mahout, Julia, R, SciPy, Weka, MATLAB, SAS,
Apriori:
What does it do? The Apriori algorithm learns association rules and is applied to a database
containing a large number of transactions.
What are association rules? Association rule learning is a data mining technique for learning
correlations and relations among variables in a database.
What’s an example of Apriori? Let’s say we have a database full of supermarket transactions.
You can think of a database as a giant spreadsheet where each row is a customer transaction and
every column represents a different grocery item.
For example:
You can probably quickly see that chips + dip and chips + soda seem to frequently occur
together. These are called 2-itemsets. With a large enough dataset, it will be much harder to
“see” the relationships especially when you’re dealing with 3-itemsets or more. That’s precisely
what Apriori helps with!
You might be wondering how Apriori works? Before getting into the nitty gritty of algorithm,
you’ll need to define 3 things:
1. The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 3-itemset, etc.?
2. The second is your support or the number of transactions containing the itemset divided by the
total number of transactions. An itemset that meets the support is called a frequent itemset.
3. The third is your confidence or the conditional probability of some item given you have certain
other items in your itemset. A good example is given chips in your itemset, there is a 67%
confidence of having soda also in the itemset.
Why use Apriori? Apriori is well understood, easy to implement and has many derivatives.
The algorithm can be quite memory, space and time intensive when generating itemsets.
Where is it used? Plenty of implementations of Apriori are available. Some popular ones are
the ARtool, Weka, and Orange.
k-Nearest Neighbor:
What does it do? kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs
from the classifiers previously described because it’s a lazy learner.
What’s a lazy learner? A lazy learner doesn’t do much during the training process other than
store the training data. Only when new unlabeled data is input does this type of learner look to
classify.
On the other hand, an eager learner builds a classification model during training. When new
unlabeled data is input, this type of learner feeds the data into the classification model.
How does C4.5, SVM and AdaBoost fit into this? Unlike kNN, they are all eager learners.
Here’s why:
1. C4.5 builds a decision tree classification model during training.
2. SVM builds a hyperplane classification model during training.
3. AdaBoost builds an ensemble classification model during training.
So what does kNN do? kNN builds no such classification model. Instead, it just stores the
labeled training data.
When new unlabeled data comes in, kNN operates in 2 basic steps:
1. First, it looks at the closest labeled training data points — in other words, the k-nearest
neighbors.
2. Second, using the neighbors’ classes, kNN gets a better idea of how the new data should be
classified.
You might be wondering…
How does kNN figure out what’s closer? For continuous data, kNN uses a distance metric
like Euclidean distance. The choice of distance metric largely depends on the data. Some even
suggest learning a distance metric based on the training data. There’s tons more details and
papers on kNN distance metrics.
For discrete data, the idea is transform discrete data into continuous data. 2 examples of this are:
1. Using Hamming distance as a metric for the “closeness” of 2 text strings.
2. Transforming discrete data into binary features.
These 2 Stack Overflow threads have some more suggestions on dealing with discrete data:
KNN classification with categorical data, Using k-NN in R with categorical values
How does kNN classify new data when neighbors disagree? kNN has an easy time when all
neighbors are the same class. The intuition is if all the neighbors agree, then the new data point
likely falls in the same class.
I’ll bet you can guess where things get hairy…
How does kNN decide the class when neighbors don’t have the same class?
2 common techniques for dealing with this are:
1. Take a simple majority vote from the neighbors. Whichever class has the greatest number of
votes becomes the class for the new data point.
2. Take a similar vote except give a heavier weight to those neighbors that are closer. A simple way
to do this is to use reciprocal distance e.g. if the neighbor is 5 units away, then weight its vote
1/5. As the neighbor gets further away, the reciprocal distance gets smaller and smaller… exactly
what we want!
Is this supervised or unsupervised? This is supervised learning, since kNN is provided a
labeled training dataset.
Why use kNN? Ease of understanding and implementing are 2 of the key reasons to use kNN.
Depending on the distance metric, kNN can be quite accurate.
But that’s just part of the story…
Here are 5 things to watch out for:
1. kNN can get very computationally expensive when trying to determine the nearest neighbors on
a large dataset.
2. Noisy data can throw off kNN classifications.
3. Features with a larger range of values can dominate the distance metric relative to features that
have a smaller range, so feature scaling is important.
4. Since data processing is deferred, kNN generally requires greater storage requirements than
eager classifiers.
5. Selecting a good distance metric is crucial to kNN’s accuracy.
Naive Bayes:
What does it do? Naive Bayes is not a single algorithm, but a family of classification algorithms
that share one common assumption:
Every feature of the data being classified is independent of all other features given the class.
What does independent mean? 2 features are independent when the value of one feature has no
effect on the value of another feature.
For example:
Let’s say you have a patient dataset containing features like pulse, cholesterol level, weight,
height and zip code. All features would be independent if the value of all features have no effect
on each other. For this dataset, it’s reasonable to assume that the patient’s height and zip code are
independent, since a patient’s height has little to do with their zip code.
But let’s not stop there, are the other features independent?
Sadly, the answer is no. Here are 3 feature relationships which are not independent:
If height increases, weight likely increases.
If cholesterol level increases, weight likely increases.
If cholesterol level increases, pulse likely increases as well.
In my experience, the features of a dataset are generally not all independent.
And that ties in with the next question…
Why is it called naive? The assumption that all features of a dataset are independent is precisely
why it’s called naive — it’s generally not the case that all features are independent.
What’s Bayes? Thomas Bayes was an English statistician for which Bayes’ Theorem is named
after. You can click on the link to find about more about Bayes’ Theorem.
In a nutshell, the theorem allows us to predict the class given a set of features using probability.
What does the equation mean? The equation finds the probability of Class A given Features 1
and 2. In other words, if you see Features 1 and 2, this is the probability the data is Class A.
The equation reads: The probability of Class A given Features 1 and 2 is a fraction.
The fraction’s numerator is the probability of Feature 1 given Class A multiplied by the
probability of Feature 2 given Class A multiplied by the probability of Class A.
The fraction’s denominator is the probability of Feature 1 multiplied by the probability of
Feature 2.
What is an example of Naive Bayes? Below is a great example taken from a Stack Overflow
thread (Ram’s answer).
Here’s the deal:
We have a training dataset of 1,000 fruits.
The fruit can be a Banana, Orange or Other (these are the classes).
The fruit can be Long, Sweet or Yellow (these are the features).
Step 3: Ignore the denominator, since it’ll be the same for all the other calculations.
Step 4: Do a similar calculation for the other classes:
Since the is greater than , Naive Bayes would classify this long, sweet and yellow
fruit as a banana.
Is this supervised or unsupervised? This is supervised learning, since Naive Bayes is provided
a labeled training dataset in order to construct the tables.
Why use Naive Bayes? As you could see in the example above, Naive Bayes involves simple
arithmetic. It’s just tallying up counts, multiplying and dividing.
Once the frequency tables are calculated, classifying an unknown fruit just involves calculating
the probabilities for all the classes, and then choosing the highest probability.
Despite its simplicity, Naive Bayes can be surprisingly accurate. For example, it’s been found to
be effective for spam filtering.
Where is it used? Implementations of Naive Bayes can be found in Orange, scikit-
learn, Weka and R.
CART:
What does it do? CART stands for classification and regression trees. It is a decision tree
learning technique that outputs either classification or regression trees. CART is a classifier.
Is a classification tree like a decision tree? A classification tree is a type of decision tree. The
output of a classification tree is a class.
For example, given a patient dataset, you might attempt to predict whether the patient will get
cancer. The class would either be “will get cancer” or “won’t get cancer.”
What’s a regression tree? Unlike a classification tree which predicts a class, regression trees
predict a numeric or continuous value e.g. a patient’s length of stay or the price of a smartphone.
Why use CART? CART is a decision tree learning technique. Things like ease of interpretation
and explanation also apply to CART as well. CART is quite fast, quite popular and the output is
human readable.
Where is it used? scikit-learn implements CART in their decision tree classifier. R’s tree
package has an implementation of CART. Weka and MATLAB also have implementations.
Finally, Salford Systems has the only implementation of the original proprietary CART code
based on the theory introduced by world-renowned statisticians at Stanford University and the
University of California at Berkeley