Unit-3 Data Analytics Material
Unit-3 Data Analytics Material
Unit-III
Q.no-1 What are association rules? Explain
Association Analysis: It discovers the probability of occurrence of items in a
collection. It helps in finding some interesting relationships in large datasets.
A dataset contains data objects, each containing a set of attributes. An attribute is
also called a dimension, feature, or variable representing a data object's
characteristic features.
For example: Height, qualification, color, etc.
Association rule mining: The below figure shows the general logic behind association
rules. Given a large collection of transactions, in which each transaction consists of
one or more items, association rules go through the items being purchased to see
what items are frequently bought together and to discover a list of rules that
describe the purchasing behavior. The goal of association rules is to discover
interesting relationships among the items. The interesting relationships depend on
the business context and the nature of the algorithm being used for the discovery
Each of the uncovered rules is in the form X→Y, meaning that when item X is
observed, item Y is also observed. In this case, the left-hand side (L.H.S) of the rule is
X and the right-hand side (R.H.S) of the rule is Y
Using association rules, patterns can be discovered from the data that allow
the association rules algorithms to disclose rules of related product purchases. The
2
uncovered rules are listed on the right side of the Figure. The first three rules suggest
that when bread is purchased, 90% of the time milk is purchased also. When eggs are
purchased, 40% of the time bread is purchased also. When milk is purchased, 23% of
the eggs are also purchased.
In the example of a retail store, association rules are used over transactions
that consist of one more item. In fact, because of their popularity in mining customer
transactions, association rules are sometimes, referred to as market basket analysis.
Each transaction can be viewed as the shopping basket of a customer that contains
one or more items. This is also known as items. The term itemset refers to a
collection of items or individual entities that contain some kind of relationship. This
could be a set of hyperlinks clicked on by one used in a single session, or a set of tasks
done in one day. An item containing K items is called a K-itemset denoted by K (Item
1, item 2, ……. Item K).
Q. No-2 What is apriori algorithm? Explain.
Ans: The Apriori algorithm takes a bottom-up iterative approach to uncover the
frequent item by first determining all the possible items (or 1-itemsets, for example
{bread}, {eggs}, {milk},) and then identifying which among them are frequent.
Assuming the minimum support threshold identifies and retains those item sets that
appear in at least 50% of all transactions and discards the item sets that have support
less than 0.5 in fewer than 50% of the transaction
In the next iteration of the Apriori algorithm, the identified frequent 1-itemsets are
paired into 2-itemsets (for example, {bread, eggs}, {milk, bread} ….,) and again
evaluated to identify the frequent-2 itemset among them.
At each iteration, the algorithm checks whether the support criterion can be met: if it
can, the algorithm grows the itemset, repeating the process until it runs out of
support or until the item sets, reach a predefined length, let variable Ck be the set of
candidate K-itemset that satisfy the minimum support. Given a transaction database
D, a minimum support threshold $$, and an optimal parameter N indicating the
maximum length an itemset could reach, Apriori iteratively computer frequent
itemset L k+1 based on Lk
3
4
5
Besides online service basket analysis, association rules are commonly used
for recommender systems and clickstream analysis
Many online service providers such as Amazon and Netflix use recommender
systems, Recommender systems can use association rules to discover related
products or identify customers who have similar interests. For Example, association
rules may suggest that those customers who have bought products A, B, and C are
more similar to this customer. These findings provide opportunities for retailers to
cross-sell their products.
Clickstream analysis refers to the analytics on data related to web- browsing and
user clicks, which are stored on the client or the server side. Web usage log files
generated on web servers contain huge amounts of information, and association
rules can potentially give useful knowledge to web usage data analysis. For Example,
association rules may suggest that website visitors who land on page X click on links
A, B, and C much more often than links D, E, and F this observation provides valuable
insight on how to personalize better and recommend content to site visitors.
Q. No 4 What is clustering? Give an overview
Ans: In general, clustering is the use of unsupervised techniques for grouping similar
objects, in machine learning, unsupervised refers to the problem of finding hidden
structured within unlabeled data. Clustering techniques are unsupervised in the
sense that the data scientist does not determine, in advance, the levels to apply to
the clusters. The structure of the describes the objects of interest and determines
how best to group the objects
6
Clustering is a method often used for exploratory analysis of the data. In clustering,
there are no predictions made. Rather, clustering methods find the similarities
between objects according to the object attributes and group the similar objects into
clusters. Clustering techniques are utilized in marketing, economics, and various
branches of science. A popular clustering method it K-means.
Q. No-5 What is the K-means algorithm? How does the K-means cluster
work? Explain.
K-Means Clustering is an unsupervised Machine Learning algorithm, which groups
the unlabeled dataset into different clusters
Unsupervised Machine Learning;
is the process of teaching a computer to use unlabelled, unclassified data and
enabling the algorithm to operate on that data without supervision. Without any
previous data training, the machine’s job in this case is to organize unsorted data
according to parallels, patterns, and variations.
K means clustering and assigning data points to one of the K clusters depending on
their distance from the canter of the clusters. It starts by randomly assigning the
clusters centroid in the space. Then each data point is assigned to one of the
clusters based on its distance from the centroid of the cluster. After assigning each
point to one of the clusters, new cluster centroids are assigned. This process runs
iteratively until it finds a good cluster. In the analysis, we assume that some
clusters is given in advance and we have to put points in one of the groups.
In some cases, K is not clearly defined, and we have to think about the optimal
number of K. K Means clustering performs best when data is well separated. When
data points overlap this clustering is not suitable. K Means is faster as compared to
other clustering techniques. It provides a strong coupling between the data points.
K Means clusters do not provide clear information regarding the quality of clusters.
Different initial assignments of cluster centroids may lead to different clusters.
Also, the K Means algorithm is sensitive to noise. It may have stuck in local minima.
How does k-means cluster work?
7
We are given a data set of items, with certain features, and values for these
features (like a vector). The task is to categorize those items into groups. To achieve
this, we will use the K-means algorithm, an unsupervised learning algorithm. ‘K’ in
the name of the algorithm represents the number of groups/clusters we want to
classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm
will categorize the items into k groups or clusters of similarity. To calculate that
similarity, we will use the Euclidean distance as a measurement.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s
coordinates, which are the averages of the items categorized in that
cluster so far.
3. We repeat the process for a given number of iterations and at the end,
we have our clusters.
The “points” mentioned above are called means because they are the mean values
of the items categorized in them. To initialize these means, we have a lot of
options. An intuitive method is to initialize the means at random items in the data
set. Another method is to initialize the means at random values between the
boundaries of the data set (if for a feature x, the items have values in [0,3], we will
initialize the means with values for x at [0,3]).
--> Update mean by shifting it to the average of the items in that cluster
Key Points:
1. A Decision Tree is a supervised learning algorithm that can be used for solving
both classification and regression problems
2. It can solve problems for both categorical and numerical data
3. Decision Tree regression builds a tree-like structure in which each internal
node represents the “test” for an attribute, each branch represents the result
of the test, and each leaf node represents the final decision or result
4. A decision tree is constructed starting from the root node/ parents’ node,
which splits into left and right child nodes. These child nodes are further
divided into their children node, and themselves become the parent node of
those nodes. Consider the below image:
The above image shows an example of Decision Tree regression, here, the
model is trying to predict the choice of a person between sports cars or
Luxury cars.
1. Random forest is one of the most powerful supervised learning algorithms
Which is capable of performing regression as well as classification tasks.
2. Random Forest regression is an ensemble learning method that combines
multiple decision trees and predicts the final output based on the average
of each tree output. The combined decision tree is called the base model,
and it can be represented more formally as:
3. Random forest uses the bootstrap aggregation technique of ensemble
learning in which aggregated decision tree runs in parallel and do not
interact with each other
4. With the help of random forest regression, we can prevent overfitting in
the model by creating random subsets of the dataset.
Bayes’ theorem:
1. Bayes theorem is also known as Bayes’ Rule or Bayes’ law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on
the conditional probability.
2. The formula for Bayes’ theorem is given as:
P(A/B) =
P ( BA ) P (A )
P(B)
Where.
P(A/B) is posterior probability: Probability of hypothesis A on the observed event B
P(B/A) is likelihood probability: The probability of the evidence given that the
probability of a hypothesis is true
P(A) is prior probability: The probability of the hypothesis before observing the
evidence.
P(B) is Marginal Probability: Probability of Evidence
Problem: If the weather is Sunny, then the player should play or not?
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Weather No Yes
Overcast 0 5 5/14=0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71 1
Q.No.11. What is attribute selection? What are various methods used for this?
(OR)
How to choose the best attribute at each node in a decision tree?
Ans: Attribute Subset selection Measure is a technique used in the data mining
process for data reduction. The data reduction is necessary to make better analysis
and prediction of the target variable.
1. Gini index:
The measure of the degree of probability of a particular variable being wrongly
classified when it is randomly chosen is called the Gini index or Gini impurity. The
data is equally distributed based on the Gini Index.
Mathematically Formula:
15
When you use the Gini index as the criterion for the algorithm to select the feature
for the root node. The feature with the least Gini index is the selected
• Splitting criterion is called the best when after splitting, each partition will be
pure.
• A partition is called pure when all the tuples that fall into the partition
belong to the same class.
16
• Attribute selection measures are also known as splitting rules because they
determine how the tuples at a given node are to be split.
• First, a rank is provided for each attribute that describes the training tuples.
The attribute having the best score for the measure is chosen as the splitting
attribute for the given tuples.
2.A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split point,
respectively, where the split point is the split point returned by the Attribute
selection method as part of the splitting criterion.
3.If A is discrete-valued and a binary tree must be produced, then the test is of
the form A SA, where SA is the splitting subset for A.
• According to the algorithm the tree node created for partition D is labelled
with the splitting criterion, and the tuples are partitioned accordingly. [Also
Shown in the figure].
• Information gain:
The attribute with the highest information gain is chosen as the splitting attribute.
This attribute minimizes the information needed to classify the tuples in the
resulting partitions.
Let D, the data partition, be a training set of class-labeled tuples.
let class label attribute has m distinct values defining m distinct classes, C i (for i =
1,..., m). Let Ci,D be the set of tuples of class C i in D. Let |D| and |C i,D| denote the
number of tuples in D and C i,D, respectively.
attribute A is:
• The term |D | /|D| acts as the weight of the j th partition. Info A (D) is the
j
Q. No-3 Discuss the importance of the selection of K in the k-means algorithm and
describe the method to choose an appropriate value of k.
Selecting the appropriate number of clusters k is crucial for the performance of the
k-means algorithm. If k is too small, distinct groups may be merged. If k is too large,
the algorithm may find meaningless clusters, and the method to choose k includes.
1. Elbow Method: Plot the WCSS against the number of clusters and look for an
“elbow” point where the rate of decrease sharply slows
2. Silhouette Analysis: Measures how similar an object is to its own cluster
compared to another cluster. The average silhouette score can help
determine the best K
3. Cross-Validation: Use techniques like cross-validation to evaluate the
performance of different k values based on predefined criteria
Q.N-4 Describe how you would evaluate the quality of clusters produced by the k-
means algorithm
The quality of clusters produced by the k-means algorithm can be evaluated using
several methods
1. Within-cluster sum of squares (WCSS): Measures the compactness of the
clusters, Lower WCSS indicates tighter clusters
2. Between-Cluster Sum of Squares (BCSS): Measures the separation of the
clusters, Higher BCSS indicates well-separated clusters.
3. Silhouette Score: Measures the cohesion and separation of clusters. Values
range from -1 to +1 with higher values indicating better-defined clusters.
19
Linear regression is widely used in real-world data analytics for predicting and
understand relationships between variables. An example application is in the
housing market:
1. Problem: Predicting house prices based on various features like square footage,
number of bedrooms, location and age of house
2. Application: A Liner regression model can be built where the dependent variable
is the house, price, and the independent variables are features mentioned.
Steps:
1. Data collection: Gather data on house prices and features
2. Data preprocessing: Clean the data handle missing values and encode
categorical
variables
3. Model Building: Fit a linear regression model to the data
4. Evaluation: Use R-squared, residual analysis, and cross-validation to evaluate the
model
5. Prediction: Use the model to predict prices of new houses based on their
features.
6. Outcome: The model helps real estate agents and buyers estimate house prices
based
on various factors, aiding in better decision-making and pricing strategies
1. Overfitting: Decision trees can easily overfit the training data, especially if they
are deep and complex
2. Small changes in the data can lead to significantly different tress
3. Decision trees can be biased towards features with more levels or categories
4. Sensitive to noisy data, which can lead to incorrect splits.
Q. No-7 How can the performance of a Naïve Bayes classifier be evaluated, and
what metrics are commonly used for this purpose?
The performance of a Naïve Bayes classifier can be evaluated using several metrics,
commonly used in classification tasks.
1. Accuracy: The proportion of correctly classified instances among the total
instances
2. Precision: The proportion of true positive predictions among the total predicted
positives, indicating the accuracy of positive predictions.
3. Recall: The proportion of true positive predictions among the actual positives,
indicating the model’s ability to identify positive instances.
4. FI Score: The harmonic means of precision and recall, providing a balance
between them.
5.Confusion Matrix: A table showing the true vs. predicted classifications useful for
calculating others metrics.
6. ROC-AUC Score: The area under the Receiver Operating Characteristic curve,
measuring the model’s ability to distinguish between classes.
Steps:
1. Data Collection: Gather a dataset of emails labelled as spam or not spam
21