DataMining-Handouts1 0
DataMining-Handouts1 0
Definition:
Data mining is the process of
- discovering patterns,
- correlations,
- and useful insights from large volumes of data using
statistical,
machine learning,
and database management techniques.
It is a critical step in the Knowledge Discovery in Databases (KDD) process, which involves
identifying meaningful information hidden within raw data.
Data Warehousing
Definition:
A data warehouse is a centralized repository designed for
Storing,
Managing,
and querying large volumes of
structured data.
It integrates data from multiple sources and formats for analysis and
reporting.
Introduction to Data Mining
Definition(Repeat)
Data mining is the process of discovering patterns, correlations, and useful insights from large volumes of
data using statistical, machine learning, and database management techniques. It is a critical step in the
Knowledge Discovery in Databases (KDD) process, which involves identifying meaningful information
hidden within raw data.
1. Insight Generation:
Helps organizations uncover hidden patterns, trends, and relationships in
data,
leading to informed decision-making.
2. Predictive Analytics:
including structured,
semi-structured,
recommendations.
6. Manufacturing and Supply Chain:
Real-World Example
A retail company like Amazon uses data mining to analyze purchase histories and browsing
behaviors to recommend products tailored to individual customers. This improves the shopping
experience while boosting sales and customer loyalty.
Definition:
identify patterns
Market Basket Analysis (MBA) is a data mining technique used to
and relationships between different products or items purchased
together by customers.
It helps businesses in Recommendation systems, sales strategies, and Inventory
management.
Key Concepts:
1. Support:
2. Confidence:
Where:
Support(A ∩ B) =
Number of transactions containing both A and B.
Support(A) =
Number of transactions containing A.
3.Lift:
Measures how much more likely item B is bought when item A is
bought, compared to random chance.
It shows the strength of the association between items A and B:
o Lift > 1: Strong association (buying A increases the chance of buying B). o
Lift = 1: No association. A and B are independent (no effect on each other).
o Lift < 1: Negative association (buying A reduces the chance of buying B).
• Apriori Algorithm:
The most widely used algorithm for MBA.
• FP-Growth Algorithm:
Use Cases:
• Retail:
Product recommendations based on customer purchases (e.g., "People who
bought X also bought Y").
• E-commerce:
Personalized marketing campaigns.
Inventory Management:
Understanding product demand relationships to optimize stock
levels.
1. R:
A popular programming language for data analysis and visualization.
2. Python:
A versatile programming language with libraries like Pandas and Scikit-learn.
3. SQL:
A standard language for managing relational databases.
4. Data Mining Software:
Tools like IBM SPSS Modeler, SAS Enterprise Miner, and Oracle Data Miner.
# Retail Industry
1. Diapers and Beer: A classic example of market basket analysis is the discovery that men
who buy diapers often also buy beer. This insight led to retailers placing beer near the diaper
section.
2. Coffee and Creamer: A coffee shop chain analyzed customer purchases and found that
customers who bought coffee often also bought creamer. They began offering a discount for
customers who purchased both together.
# E-commerce
1. Amazon's "Frequently Bought Together": Amazon uses market basket analysis to
recommend products that are frequently bought together. For example, if a customer buys a
camera, Amazon might recommend a memory card and a camera bag.
2. Netflix's Recommendation Engine: Netflix uses market basket analysis to recommend TV
shows and movies based on a user's viewing history. If a user watches a romantic comedy,
Netflix might recommend other romantic comedies.
# Grocery Industry
1. Peanut Butter and Jelly: A grocery store chain analyzed customer purchases and found
that customers who bought peanut butter often also bought jelly. They began placing peanut
butter and jelly near each other in the store.
2. Bread and Milk: A grocery store chain found that customers who bought bread often also
bought milk. They began offering a discount for customers who purchased both together.
# Telecommunications
1. Phone and Internet Plans: A telecom company analyzed customer data and found that
customers who bought phone plans often also bought internet plans. They began offering
discounted bundles for customers who purchased both.
2. Streaming Services: A telecom company found that customers who subscribed to
streaming services often also upgraded to faster internet plans. They began offering promotions
for streaming services with internet plan upgrades.
These examples illustrate how market basket analysis can help businesses identify patterns in
customer behavior and make data-driven decisions to drive revenue growth and improve
customer satisfaction.
A retail store wants to analyze customer purchasing behavior to determine which products are
frequently bought together. The store can use this information for marketing strategies such as
product bundling, cross-selling, and inventory management.
Dataset:
A simplified transaction dataset is shown below:
Transaction ID Items Purchased
1 Bread, Milk, Butter
2 Bread, Milk
3 Milk, Butter
4 Bread, Butter
5 Bread, Milk, Butter, Eggs
Step 1: Convert Transactions into a Binary Matrix
We create a matrix where each row represents a transaction, and each column represents an item.
Transaction ID Bread Milk Butter Eggs
1 1 1 1 0
2 1 1 0 0
3 0 1 1 0
4 1 0 1 0
5 1 1 1 1
Step 2: Compute Item Support
If the confidence is above a chosen threshold (e.g., 0.7), the rule is considered strong.
Step 5: Compute Lift
Lift measures how much more likely two items are bought together compared to being bought
independently.
Since 0.94 is close to 1, there is little to no strong correlation between Bread and Milk.
• Bread and Milk are frequently bought together, so placing them near each other in the
store can increase sales.
• Bread and Butter also show a strong association, making them good candidates for
promotions or discounts.
• The store can offer bundled discounts on these products to encourage more purchases.
This is a basic example, but real-world market basket analysis involves large datasets and
automated tools like Apriori or FP-Growth algorithms.
Class Assignment:
Compute the Lift (Diappers-Beer) from the following dataset:
Dataset:
A grocery store records transactions from customers. Below is a sample dataset:
Transaction ID Items Purchased
1 Diapers, Beer, Chips
2 Diapers, Beer
3 Diapers, Chips
4 Beer, Chips
5 Diapers, Beer, Chips, Milk
# Display Results
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssociation Rules:\n", rules[['antecedents', 'consequents',
'support', 'confidence', 'lift']])
Example Output
Frequent Itemsets
support itemsets 0
0.8 (Bread)
1 0.8 (Butter)
2 0.8 (Milk)
3 0.6 (Bread, Butter)
4 0.6 (Bread, Milk)
5 0.6 (Butter, Milk) 6 0.6 (Bread, Butter, Milk)
Association Rules
antecedents consequents support confidence lift 0
(Bread) (Butter) 0.6 0.75 1.25
1 (Milk) (Butter) 0.6 0.75 1.25
2 (Butter) (Milk) 0.6 0.75 1.25
Key Insights
• Rule 1: Customers who buy Bread are 1.25 times more likely to buy Butter.
• Rule 2: Customers who buy Milk are 1.25 times more likely to buy Butter.
Rule 3: Customers who buy Butter are 1.25 times more likely to buy Milk.
Slide (2)
DATA MINING TECHNIQUES (OVER VIEW)
Data mining involves extracting valuable patterns, insights, and knowledge from
large datasets.
Classification
📌 Example:
• Decision Trees
• Naïve Bayes
• Support Vector Machines (SVM)
• Random Forest
Clustering
Groups similar data points together without predefined labels.
📌 Example:
Customer segmentation in marketing. 🛠
Algorithms:
• K-Means Clustering
• DBSCAN
• Hierarchical Clustering
📌 Example:
Amazon recommending "Customers who bought X also bought Y".
🛠 Algorithms:
• Apriori Algorithm
• FP-Growth Algorithm
Example:
Fraud detection in banking transactions.
🛠 Algorithms:
• Isolation Forest
• One-Class SVM
• DBSCAN
Regression Analysis
Predicts continuous values based on input data.
📌 Example:
🛠 Algorithms:
• Linear Regression
• Polynomial Regression
• Decision Tree Regression
Dimensionality Reduction
Reduces the number of features while retaining key
information.
📌 Example:
🛠 Techniques:
📌 Example:
🛠 Techniques:
• Tokenization
• Named Entity Recognition (NER)
• Latent Dirichlet Allocation (LDA) for topic modeling
📌 Example:
Stock price prediction.
🛠 Techniques:
🛠 Architectures:
• Convolutional Neural Networks (CNN)
• Recurrent Neural Networks (RNN)
Ensemble Learning
Combines multiple models to improve performance.
📌 Example:
Boosting algorithms in Kaggle competitions.
🛠 Techniques:
Slides (3)
Decision Tree
A Decision tree is a supervised machine learning
classification and
algorithm used for both
regression tasks.
splitting the data into subsets
It works by
based on the value of input features,
1. Root Node:
The topmost node that represents the entire dataset.
2. Internal Nodes:
Nodes that split the data based on a
feature.
3. Leaf Nodes:
Terminal nodes that provide the final
decision or prediction.
4. Branches:
2. Splitting Criteria
• Numerical Features:
For numerical features, the splitting condition usually involves a threshold.
For example, "Age < 25?" splits the data into two groups: those younger than 25 and
those 25 or older.
Categorical Features:
For categorical features, the splitting condition can be based on the values
of the feature.
For example,
"Favorite Genre = Action?" splits the data into groups based on their favorite genre.
o Categorical data is a type of data that consists of labels or
categories rather than numerical values. It represents qualitative
characteristics of an object or event.
Example:
• Categorical Data:
Decision trees can handle categorical data directly by
creating branches for each category.
• Numerical Data:
Numerical data can be used directly or discretized into categories.
For example, age can be divided into age groups like "young," "middle-
aged," and "old."
• Missing Values:
Decision trees can handle
missing values by assigning
them to the most likely branch or creating a separate
branch for missing values.
• Advantages:
o Interpretability:
Decision trees are easy to understand and visualize,
making them useful for explaining decisions.
o Versatility:
They can handle both classification and regression tasks,
as well as different data types.
o Minimal Data Preprocessing:
Decision trees require less data preprocessing compared to
some other machine learning algorithms.
• Disadvantages:
o Overfitting:
Decision trees are prone to overfitting, especially when they are
very complex.
o Instability:
Small changes in the data can lead to significant changes in the tree
structure.
o Bias:
Decision trees can be biased towards features with more
levels or categories.
6. Applications (Expanded)
7. Ensemble Methods
• Random Forests:
Create multiple decision trees on different subsets of
the data and combine their predictions through averaging
or voting.
• Gradient Boosting:
• Build trees sequentially, where each tree tries to
correct the errors of the previous trees.
• Classification Trees:
Predict categories (e.g., "likes movie" or "dislikes movie").
• Regression Trees:
Predict continuous values (e.g., the price of a house).
Note: ID3 is a foundational algorithm. More advanced decision tree algorithms like C4.5 and CART
address some of these limitations.
Example Problem
We will use the ID3 Algorithm to build a decision tree to determine if a person will play
tennis based on these features:
Outlook Temperature Humidity Windy Play Tennis?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Similarly, we compute IG for Temperature, Humidity, and Windy and choose the highest
one.
Information Gains (IGs)
• Outlook: 0.247
• Temperature: 0.029
• Humidity: 0.151
• Windy: 0.048
Making Predictions
Now, we can use the tree to classify new data.
Example:
Conclusion
The ID3 algorithm:
Simplified Version
1. What Is Data Mining and Data Warehousing?
Data Mining is like searching through a giant box of information to find hidden
patterns and useful details.
It uses computer methods (like statistics and machine learning) to look at big sets of data
so that companies can learn more about trends and relationships.
Data Warehousing means collecting and storing all of a company’s data in one
large, organized storage space.
This makes it easy to combine data from different sources and run analyses or
reports on it.
A common method (the Apriori Algorithm) finds groups of products that appear
frequently together in many transactions.
Another method, the FP-Growth Algorithm, does this more efficiently by using a
compact tree structure.
5. Real-World Examples:
Retail: Stores use MBA to suggest products (“Customers who bought this also
bought…”).
E-commerce: Online stores like Amazon use it to recommend items based on what others
have purchased together.
Other Fields: Banking, healthcare, and even entertainment use these ideas to predict
behavior and improve services.
First, you convert the transactions into a table (a binary matrix) showing which products
are bought in each transaction.
Next, you calculate the support (how many times each item or pair of items appears).
Then, you compute confidence (if someone buys bread, how likely are they to also buy
milk or butter).
Finally, you calculate lift to see how strong the connection is between items. A lift close
to 1 means there’s not a strong relationship.
Second slide
Overview of Data Mining Techniques
Data mining is the process of extracting
valuable patterns,
insights, and knowledge from large datasets.
1. Classification
o What It Does:
Algorithms:
o Decision Trees,
o Naïve Bayes,
o Support Vector Machines (SVM),
o Random Forest.
2. Clustering
o What It Does:
o Example:
Algorithms:
o K-Means Clustering,
o DBSCAN,
o Hierarchical Clustering.
3. Association Rule Mining (Market Basket Analysis)
o What It Does:
o Example:
o Apriori Algorithm,
o FP-Growth Algorithm.
4. Anomaly Detection (Outlier Detection)
o What It Does:
Identifies rare items or patterns that do not follow the expected behavior.
o Example:
Algorithms:
o Isolation Forest,
o One-Class SVM,
o DBSCAN.
5. Regression Analysis
o What It Does:
o Example:
Algorithms:
o Linear Regression,
o Polynomial Regression,
o Decision Tree Regression.
6. Dimensionality Reduction:-
o What It Does:
Reduces the number of features in the data while keeping the key
information.
Example:
Techniques:
o Principal Component Analysis (PCA),
o t-Distributed Stochastic Neighbor Embedding (t-SNE).
7. Text Mining (Natural Language Processing - NLP)
o What It Does:
o Example:
Techniques:
o Tokenization,
o Named Entity Recognition (NER),
o Latent Dirichlet Allocation (LDA) for topic modeling.
8. Time Series Analysis
o What It Does:
o Example:
Techniques:
o Example
o Architectures:
o Convolutional Neural Networks (CNN),
o Recurrent Neural Networks (RNN).
10.Ensemble Learning
o What It Does:
Combines multiple models to improve performance.
o Example:
o Techniques:
Slide 3
1. Decision Tree Overview
What is a Decision Tree?
A decision tree is a type of machine learning tool used to make decisions or
predictions.
It can be used when you want to classify items into groups (like “yes” or
“no”) or when you want to predict a number (like a price).
How it works:
1. Root Node:
3. Leaf Nodes:
The end points that give you the final answer or prediction.
4. Branches:
The lines connecting the nodes that show the decision path.
1. Feature Selection:
o The tree looks at all the features (like age, temperature, or favorite
color) and picks the best one for splitting the data.
o It uses measurements like
Gini impurity,
information gain, or
2. Splitting:
o Once a feature is chosen, the data is divided into smaller groups based
on that feature’s values.
3. Recursion:
o The splitting process is repeated
on each new group.
o This continues until a stopping condition is met (for
example, when the tree reaches a certain depth or there
are very few samples left in a group).
4. Prediction:
o When new data comes in, it is sent through the tree from the
root to one of the leaf nodes, and the leaf gives the
prediction.
How it works:
o Instead of using entropy, CART uses the Gini index.
o The Gini index measures the impurity of the data (a lower value means
the data in the split is more pure or similar).
4. Splitting Criteria
When making splits in the tree, the method depends on the type of data:
Numerical Features:
o For numbers (like age or temperature), a common approach is to pick a
threshold value.
o For example,
the question might be “Is Age less than 25?” to split the data into two groups:
one with ages below 25 and one with ages 25 or above.
Categorical Features:
o For data that falls into different groups or labels (like colors or genres),
the tree can split based on these specific categories.
o For example,
Asking “Is the favorite genre Action?” creates groups based on whether the
answer is yes or no.
o Categorical data means data that is not numerical but is based on names or labels
(e.g., Red, Blue, Green).
Pruning:
To prevent overfitting, the tree can be “pruned” (simplified) by removing parts of
the tree that do not add much value.
This can be done by:
o Limiting how deep the tree goes.
o Requiring a minimum number of samples at each node.
o Using statistical tests to see if a branch is important.
Interpretability:
o Decision trees are easy to understand and can be visualized.
o This makes it simpler to explain why a decision was made.
Versatility:
o They work for both classification (categories) and regression (predicting
numbers).
Minimal Data Preprocessing:
o They usually require less work to clean or transform the data
compared to some other methods.
Disadvantages
Overfitting:
o If not controlled, they can become too complex and not perform well on new
data.
Instability:
o Small changes in the data can lead to a very different tree.
Bias:
o They may favor features that have more distinct categories or levels, even if
those features are not the most important.
9. Ensemble Methods
To get better predictions and make the model more robust, multiple decision
trees can be combined. Two common methods are:
Random Forests:
o This method creates many decision trees using different parts of the
data. Their predictions are combined (by voting or averaging) to give a final
answer.
Gradient Boosting:
o Here, trees are built one after the other. Each new tree tries to fix the
mistakes of the trees that came before it.
Classification Trees:
o Used when the prediction is a category (for example, whether
someone likes or dislikes a movie).
Regression Trees:
o Used when the prediction is a number (for example, predicting the
price of a house).
11. The ID3 Algorithm (Iterative Dichotomiser 3) – Step by
Step
The ID3 algorithm is one of the first methods used to build decision trees. It works by finding the
best attribute (or feature) to split the data at each step using entropy and information gain.
Outlook
Temperature
Humidity
Windy
Steps Illustrated:
First, calculate the overall uncertainty (entropy) for the decision “Play Tennis?”
Then, calculate how much each feature (like Outlook) decreases this uncertainty by
computing its information gain.
In our example, the feature Outlook might have the highest information gain.
Based on the Outlook:
o If it is “Overcast,” then the decision is automatically “Yes” (play tennis).
o For “Sunny” and “Rainy,” further questions are asked:
For Sunny: Look at the Humidity (if high → “No”, if normal → “Yes”).
For Rainy: Look at Windy (if not windy → “Yes”, if windy → “No”).
13. Conclusion
Summary of the ID3 Process:
The algorithm begins by measuring how mixed up (impure) the data is using entropy.
It then checks each feature to see which one, when used to split the data, will make the
groups purer (measured by information gain).
The feature that best reduces uncertainty is chosen first (forming the root node).
The process continues by splitting the data further until every branch reaches a clear, final
decision or a stopping rule is met.