Module 4 BDA NOTES
Module 4 BDA NOTES
www.vtupulse.com
Caselet: Predicting Heart Attacks
Using Decision Trees
• study was done at UC San Diego concerning heart disease patient data. The
patients were diagnosed with a heart attack from chest pain, diagnosed by EKG,
high enzyme levels in their heart muscles, and so on. The objective was to predict
which of these patients was at risk of dying from a second heart attack within
the next 30 days. The prediction would determine the treatment plan, such as
whether to keep the patient in intensive care or not. For each patient, more than
100 variables were collected, including demographics, medical history, and lab
data. Using that data and the CART algorithm, a decision tree was constructed.
• The decision tree showed that if blood pressure (BP) was low (≤90), the chance of
another heart attack was very high (70 percent).
• If the patient’s BP was OK, the next question to ask was the patient’s age. If the
age was low (≤62), then the patient’s survival was almost guaranteed (98
percent).
• If the age was higher, then the next question to ask was about sinus problems.
• If their sinus was OK, the chances of survival were 89 percent. Otherwise, the
chance of survival dropped to 50 percent. This decision tree predicts 86.5 percent
of the cases correctly. www.vtupulse.com
Decision Tree Problem
• Imagine a conversation between a doctor and a patient. The doctor asks
questions to determine the cause of the ailment. The doctor would
continue to ask questions, till he or she is able to arrive at a reasonable
decision. If nothing seems plausible, he or she might recommend some
tests to generate more data and options.
• This is how experts in any field solve problems. They use decision trees or
decision rules. For every question they ask, the potential answers create
separate branches for further questioning. For each branch, the expert
would know how to proceed ahead. The process continues until the end of
the tree is reached, which means a leaf node is reached.
• Human experts learn from past experiences or data points. Similarly, a
machine can be trained to learn from the past data points and extract
some knowledge or rules from it.
www.vtupulse.com
Decision Tree Problem
A decision tree would have a predictive accuracy based on how often it makes
correct decisions.
1. The more training data is provided, the more accurate its knowledge
extraction will be, and thus, it will make more accurate decisions.
2. The more variables the tree can choose from, the tree will come out
better with higher accuracy.
3. In addition, a good decision tree should also be frugal so that it takes the
least number of questions, and thus, the least amount of effort, to get to
the right decision.
www.vtupulse.com
Decision Tree Exercise
• Here is an exercise to create a decision tree that helps make
decisions about approving the play of an outdoor game. The
objective is to predict the play decision given the atmospheric
conditions out there. The decision is: Should the game be allowed
or not? Here is the decision problem.
www.vtupulse.com
Decision Tree Exercise
If there were a row for sunny/hot/normal/windy condition in the data
table, it would match the current problem, and the decision from that
situation could be used to answer the problem today. However, there is
no such past instance in this case. There are three disadvantages of
looking up the data table:
1. As mentioned earlier, how to decide if there is not a row that
corresponds to the exact situation today? If there is no matching
instance available in the database, the past experience cannot guide
the decision.
2. Searching through the entire past database may be time consuming,
depending on the number of variables and the organization of the
database.
3. What if the data values are not available for all the variables? In this
instance, if the data for humidity variable was not available, looking
up the past data would not www.vtupulse.com
help.
Decision Tree Construction
Determining root node of the tree
In this example, there are four choices of questions based on the four
variables: what is the outlook, what is the temperature, what is the
humidity, and what is the wind speed?
www.vtupulse.com
Decision Tree Construction
• Start with any variable, in this case outlook. It can take three
values: sunny, overcast, and rainy.
• Start with the sunny value of outlook. There are five instances
where the outlook is sunny. In two of the five instances, the
play decision was yes, and in the other three, the decision was
no. Thus, if the decision rule was that outlook: sunny → no,
then three out of five decisions would be correct, while two
out of five such decisions would be incorrect. There are two
errors out of five. This can be recorded in Row 1.
Rainy → Yeswww.vtupulse.com2/5
Decision Tree Construction
• A similar analysis can be done for the other three variables. At the end of
that analytical exercise, the following error table will be constructed.
Attribute Rules Error Total Error
Outlook Sunny → No 2/5 4/14
www.vtupulse.com
Decision Tree Construction
www.vtupulse.com
Decision Tree Construction
www.vtupulse.com
Decision Tree Construction
www.vtupulse.com
www.vtupulse.com
www.vtupulse.com
Regression
• Regression is a well-known statistical technique to model
the predictive relationship between several independent
variables (DVs) and one dependent variable.
• The objective is to find the best-fitting curve for a
dependent variable in a multidimensional space, with
each independent variable being a dimension.
• The curve could be a straight line, or it could be a
nonlinear curve.
• The quality of fit of the curve to the data can be
measured by a coefficient of correlation (r), which is the
square root of the amount of variance explained by the
curve. www.vtupulse.com
Regression
The key steps for regression are simple:
www.vtupulse.com
Correlations and Relationships
• Statistical relationships are about which elements
of data hang together, and which ones hang
separately.
• It is about categorizing variables that have a
relationship with one another, and categorizing
variables that are distinct and unrelated to other
variables.
• It is about describing significant positive
relationships and significant negative differences.
www.vtupulse.com
Correlations
• The first and foremost measure of the strength of
a relationship is co-relation (or correlation). The
strength of a correlation is a quantitative
measure that is measured in a normalized range
between 0 (zero) and 1.
• A correlation of 1 indicates a perfect relationship,
where the two variables are in perfect sync.
• A correlation of 0 indicates that there is no
relationship between the variables.
www.vtupulse.com
Relationships
• The relationship can be positive, or it can be an inverse
relationship, that is, the variables may move together
in the same direction or in the opposite direction.
• Therefore, a good measure of correlation is the
correlation coefficient, which is the square root of
correlation.
• This coefficient, called r, can thus range from −1 to +1.
An r value of 0 signifies no relationship.
• An r value of 1 shows perfect relationship in the same
direction, and an r value of −1 shows a perfect
relationship but moving in opposite directions.
www.vtupulse.com
Correlations and Relationships
• Given two numeric variables x and y, the
coefficient of correlation r is mathematically
computed by the following equation. 𝑥 is the
mean of x, and 𝑦 is the mean of y.
www.vtupulse.com
Visual Look at Relationships
www.vtupulse.com
Regression Exercise
• The regression model is described as a linear
equation that follows. y is the dependent variable,
that is, the variable being predicted. x is the
independent variable, or the predictor variable.
There could be many predictor variables (such as x1,
x2, . . .) in a regression equation. However, there can
be only one dependent variable (y) in the regression
equation.
www.vtupulse.com
A simple example of a regression equation would be to predict a house price
from the size of the house. Here are sample house data:
www.vtupulse.com
• Visually, one can see a positive correlation between house price and size
(sqft). However, the relationship is not perfect. Running a regression
model between the two variables produces the following output
(truncated).
Regression Statistics
Multiple r 0.891
r2 0.794
Coefficients
Intercept -54,191
Size (sqft) 139.48
www.vtupulse.com
A simple example of a regression equation would be to predict a house price
from the size of the house. Here are sample house data:, with room extra
variable
www.vtupulse.com
Nonlinear Regression Exercise
• The relationship between the variables may
also be curvilinear. For example, given past
data from electricity consumption (kWh) and
temperature (temp), the objective is to predict
the electrical consumption from the
temperature value. Here are a dozen past
observations.
www.vtupulse.com
Nonlinear Regression Exercise
KWatts Temp (F)
12,530 46.8
10,800 52.1
10,180 55.1
9,730 59.2
9,750 61.9
10,230 66.2
11,160 69.9
13,910 76.8
15,690 79.3
15,110 79.7
17,020 80.2
www.vtupulse.com
Nonlinear Regression Exercise
www.vtupulse.com
Cluster Analysis
• Cluster analysis is used for automatic identification of natural
groupings of things. It is also known as the segmentation technique.
• In this technique, data instances that are similar to (or near) each
other are categorized into one cluster.
• Similarly, data instances that are very different (or far away) from
each other are moved into different clusters.
• Clustering is an unsupervised learning technique as there is no
output or dependent variable for which a right or wrong answer can
be computed.
• The correct number of clusters or the definition of those clusters is
not known ahead of time.
• Clustering techniques can only suggest to the user how many
clusters would make sense from the characteristics of the data.
www.vtupulse.com
Cluster Analysis
www.vtupulse.com
Applications of Cluster Analysis
• Market segmentation: Categorizing customers
according to their similarities, for example by
their common wants and needs and propensity to
pay, can help with targeted marketing.
• Product portfolio: People of similar sizes can be
grouped together to make small, medium, and
large sizes for clothing items.
• Text mining: Clustering can help organize a given
collection of text documents according to their
content similarities into clusters of related topics.
www.vtupulse.com
Clustering Techniques
• Cluster analysis is a machine-learning technique. The quality of a
clustering result depends on the algorithm, the distance function,
and the application.
• First, consider the distance function. Most cluster analysis methods
use a distance measure to calculate the closeness between pairs of
items.
• There are two major measures of distances: Euclidian distance is
the most intuitive measure.
• The other popular measure is the Manhattan (rectilinear) distance,
where one can go only in orthogonal directions.
• In either case, the key objective of the clustering algorithm is the
same:
• Interclusters distance ⇒ maximized
• Intraclusters distance ⇒ minimized
www.vtupulse.com
Here is the generic pseudocode for
clustering
1. Pick an arbitrary number of groups/segments to
be created.
2. Start with some initial randomly chosen center
values for groups.
3. Classify instances to closest groups.
4. Compute new values for the group centers.
5. Repeat Steps 3 and 4 till groups converge.
6. If clusters are not satisfactory, go to Step 1 and
pick a different number of groups/segments.
www.vtupulse.com
Clustering Exercise
X Y
2 4
2 6
5 6
4 7
8 3
6 6
5 2
5 7
6 3
4 4
www.vtupulse.com
Clustering Exercise
www.vtupulse.com
Clustering Exercise
www.vtupulse.com
Clustering Exercise
www.vtupulse.com
K-Means Algorithm for Clustering
Here is the pseudocode for implementing a K-means
algorithm.
Algorithm K-Means (K number of clusters, D list of data
points)
1. Choose K number of random data points as initial
centroids (cluster centers).
2. Repeat till cluster centers stabilize:
a. Allocate each point in D to the nearest of K
centroids.
b. Compute centroid for the cluster using all points in
the cluster. www.vtupulse.com
K-Means Algorithm for Clustering
www.vtupulse.com
K-Means Algorithm for Clustering
www.vtupulse.com
K-Means Algorithm for Clustering
www.vtupulse.com
K-Means Algorithm for Clustering
www.vtupulse.com
K-Means Algorithm for Clustering
www.vtupulse.com
Advantages and Disadvantages of K-
Means Algorithm
There are many advantages of K-Means Algorithm
1. K-means algorithm is simple, easy to understand, and easy to
implement.
2. It is also efficient, in which the time taken to cluster K-means rises
linearly with the number of data points.
3. No other clustering algorithm performs better than K-means, in
general.
www.vtupulse.com
Data about height and weight for a few volunteers is available.
Create a set of clusters for the following data, to decide how
many sizes of Tshirts should be ordered
Height Weight
71 165
68 165
72 180
67 113
72 178
62 101
70 150
69 172
72 185
63 149
69 132
www.vtupulse.com
61 115
Artificial Neural Networks
• Artificial neural networks (ANNs) are inspired by the
information processing model of the mind/brain.
• The human brain consists of billions of neurons that
link with one another in an intricate pattern.
• Every neuron receives information from many other
neurons, processes it, gets excited or not, and passes
its state information to other neurons.
www.vtupulse.com
Business Applications of ANN
1. They are used in stock price prediction where the
rules of the game are extremely complicated, and a
lot of data needs to be processed very quickly.
2. They are used for character recognition, as in
recognizing handwritten text, or damaged or mangled
text. They are used in recognizing finger prints. These
are complicated patterns and are unique for each
person. Layers of neurons can progressively clarify the
pattern.
3. They are also used in traditional classification
problems, such as approving a loan application.
www.vtupulse.com
www.vtupulse.com
www.vtupulse.com
Developing an ANN
The steps required to build an ANN are as follows:
1. Gather data: Divide into training data and test data. The
training data needs to be further divided into training data,
validation data and testing data.
2. Select the network architecture, such as feedforward network.
3. Select the algorithm, such as Multilayer Perception.
4. Set network parameters.
5. Train the ANN with training data.
6. Validate the model with validation data.
7. Freeze the weights and other parameters.
8. Test the trained network with test data.
9. Deploy the ANN when it achieves good predictive accuracy.
www.vtupulse.com
Training an ANN: Training data is split
into three parts
This data set is used to adjust the weights on the
Training set
neural network (∼ 60%).
This data set is used to minimize overfitting and
Validation set
verifying accuracy (∼ 20%).
This data set is used only for testing the final
Testing set solution in order to confirm the actual predictive
power of the network (∼ 20%).
approach means that the data is divided into k
equal pieces, and the learning process is repeated
k-fold cross-
k-times with each pieces becoming the training
validation
set. This process leads to less bias and more
accuracy, but is more time consuming.
www.vtupulse.com
Advantages of Using ANNs
1. ANNs impose very little restrictions on their use.
2. There is no need to program ANN neural networks, as they learn
from examples. They get better with use, without much programing
effort.
3. ANN can handle a variety of problem types, including classification,
clustering, associations, and so on.
4. ANNs are tolerant of data quality issues, and they do not restrict the
data to follow strict normality and/or independence assumptions.
5. ANN can handle both numerical and categorical variables.
6. ANNs can be much faster than other techniques.
7. Most importantly, ANN usually provide better results (prediction
and/or clustering) compared to statistical counterparts, once they
have been trained enough.
www.vtupulse.com
Disadvantages of Using ANNs
1. They are deemed to be black-box solutions, lacking
explainability.
2. Optimal design of ANN is still an art: It requires
expertise and extensive experimentation.
3. It can be difficult to handle a large number of
variables (especially the rich nominal attributes)
with an ANN.
4. It takes large data sets to train an ANN.
www.vtupulse.com
Association Rule Mining
• Association rule mining is a popular,
unsupervised learning technique, used in
business to help identify shopping patterns.
• It is also known as market basket analysis. It
helps find interesting relationships (affinities)
between variables (items or events).
• Thus, it can help cross-sell related items and
increase the size of a sale.
www.vtupulse.com
Association Rule Mining
• All data used in this technique is categorical.
• There is no dependent variable. It uses machine-learning
algorithms. The fascinating “relationship between sales of
diapers and beers” is how it is often explained in popular
literature.
• This technique accepts as input the raw point-of-sale
transaction data.
• The output produced is the description of the most
frequent affinities among items.
• An example of an association rule would be, “a Customer
who bought a laptop computer and virus protection
software also bought an extended service plan 70 percent
of the time.”
www.vtupulse.com
Business Applications of Association
Rules
• In business environments a pattern or knowledge can be used for many
purposes.
• In sales and marketing, it is used for e-commerce site design, online
advertising optimization, product pricing, and sales/promotion
configurations. This analysis can suggest not to put one item on sale at a
time, and instead to create a bundle of products promoted as a package
to sell other non-selling items.
• In retail environments, it can be used for store design. Strongly
associated items can be kept close tougher for customer convenience.
Or they could be placed far from each other so that the customer has to
walk the aisles and by doing so is potentially exposed to other items.
• In medicine, this technique can be used for relationships between
symptoms and illnesses; diagnosis and patient characteristics /
treatments; genes and their functions; and so on.
www.vtupulse.com
Representing Association Rules
• A generic rule is represented between a set X and
Y: X ⇒ Y [S%, C%]
• X, Y: products and/or services
• X: Left-hand-side (LHS or Antecedent)
• Y: Right-hand-side (RHS or Consequent)
• S: Support: how often X and Y go together in the
total transaction set
• C: Confidence: how often Y goes together with X
• Example: {Laptop Computer, Antivirus Software}
⇒ {Extended Service Plan} [30%, 70%]
www.vtupulse.com
Algorithms for Association Rule
• Not all association rules are interesting and useful, only
those that are strong rules and also those that occur
frequently.
• In association rule mining, the goal is to find all rules that
satisfy the user-specified minimum support and minimum
confidence.
• The resulting sets of rules are all the same irrespective of
the algorithm used, that is, given a transaction data set T, a
minimum support and a minimum confidence, the set of
association rules existing in T is uniquely determined.
• The most popular algorithms are Apriori, Eclat, and FP-
growth, along with various derivatives and hybrids of the
three.
www.vtupulse.com
Apriori Algorithm
• This is the most popular algorithm used for association rule
mining.
• A frequent itemset is an itemset whose support is greater than or
equal to minimum support threshold.
• The Apriori property is a downward closure property, which means
that any subsets of a frequent itemset are also frequent itemsets.
• Thus, if (A,B,C,D) is a frequent itemset, then any subset such as
(A,B,C) or (B,D) are also frequent itemsets.
• This uses a bottom-up approach; and the size of frequent subsets
is gradually increased, from one-item subsets to two-item subsets,
then three-item subsets, and so on.
• Groups of candidates at each level are tested against the data for
minimum support. www.vtupulse.com
Association Rules Exercise
• Here are a dozen sales transactions.
• There are six products being sold: Milk, Bread, Butter, Eggs,
Cookies, and Ketchup.
• Transaction #1 sold Milk, Eggs, Bread, and Butter. Transaction #2
sold Milk, Butter, Egg, and Ketchup. And so on.
• The objective is to use this transaction data to find affinities
between products, that is, which products sell together often.
• The support level will be set at 33 percent; the confidence level
will be set at 50 percent. That means that we have decided to
consider rules from only those itemsets that occur at least 33
percent of the time in the total set of transactions.
• Confidence level means that within those itemsets, the rules of
the form X → Y should be such that there is at least 50 percent
www.vtupulse.com
chance of Y occurring based on X occurring.
Association Rules Exercise
Transactions List
7 Milk Cookies
www.vtupulse.com
Association Rules Exercise
• The second step is to go for the next level of
itemsets using items selected earlier: 2-item
itemsets.
2-item Sets Freq
Milk, Bread 7
Milk, Butter 7
Milk, Cookies 3
Bread, Butter 9
Butter, Cookies 3
Bread, Cookies 4
www.vtupulse.com
Association Rules Exercise
• Thus, (Milk, Bread) sell 7 times out of 12. (Milk, Butter) sell together 7
times, (Bread, Butter) sell together 9 times, and (Bread, Cookies) sell 4
times. However, only 4 of these transactions meet the minimum support
level of 33 percent.
www.vtupulse.com
Association Rules Exercise
• The next step is to go for the next higher level of itemsets: 3-item
itemsets.