0% found this document useful (0 votes)
10 views38 pages

DataMining-Handouts1 0

Data mining is the process of discovering patterns and insights from large datasets using statistical and machine learning techniques, playing a crucial role in the Knowledge Discovery in Databases (KDD) process. Data warehousing serves as a centralized repository for managing and querying structured data from various sources. Key applications of data mining include retail, finance, healthcare, and telecommunications, with techniques such as Market Basket Analysis helping businesses optimize strategies and enhance customer experiences.

Uploaded by

Huzaifa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

DataMining-Handouts1 0

Data mining is the process of discovering patterns and insights from large datasets using statistical and machine learning techniques, playing a crucial role in the Knowledge Discovery in Databases (KDD) process. Data warehousing serves as a centralized repository for managing and querying structured data from various sources. Key applications of data mining include retail, finance, healthcare, and telecommunications, with techniques such as Market Basket Analysis helping businesses optimize strategies and enhance customer experiences.

Uploaded by

Huzaifa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Data Mining

Definition:
Data mining is the process of

- discovering patterns,
- correlations,
- and useful insights from large volumes of data using
 statistical,
 machine learning,
 and database management techniques.

It is a critical step in the Knowledge Discovery in Databases (KDD) process, which involves
identifying meaningful information hidden within raw data.
Data Warehousing
Definition:
A data warehouse is a centralized repository designed for

 Storing,
 Managing,
 and querying large volumes of
structured data.

It integrates data from multiple sources and formats for analysis and
reporting.
Introduction to Data Mining
Definition(Repeat)

Data mining is the process of discovering patterns, correlations, and useful insights from large volumes of
data using statistical, machine learning, and database management techniques. It is a critical step in the
Knowledge Discovery in Databases (KDD) process, which involves identifying meaningful information
hidden within raw data.

Importance of Data Mining

1. Insight Generation:
Helps organizations uncover hidden patterns, trends, and relationships in
data,
leading to informed decision-making.
2. Predictive Analytics:

Provides tools to predict future events or behaviors,


such as customer churn, market trends, or sales forecasts.
3. Data-Driven Strategy:
Assists in creating targeted business strategies by leveraging insights derived from
historical data.
4. Efficiency Improvement:
Automates the analysis of complex datasets, saving time and reducing manual
effort.
5. Competitive Advantage:
Equips businesses with a deeper understanding of their operations, customers,
and market, enhancing their competitive edge.
6. Scalability:

Enables the processing of large-scale datasets from multiple sources,

including structured,

semi-structured,

and unstructured data.

Applications of Data Mining


1. Retail and E-commerce:
o Market Basket Analysis (e.g., identifying products frequently bought together).
o Customer segmentation for personalized marketing.
o Sales forecasting and inventory management.
2. Finance and Banking:
o Fraud detection using anomaly detection techniques.
o Credit risk assessment and loan approval processes.
o Stock market prediction and investment analysis.
3. Healthcare:
o Disease diagnosis and treatment recommendation.
o Patient clustering for customized care.
o Drug discovery and genomics research.
4. Telecommunications:
o Customer churn prediction and retention strategies.
o Optimizing network performance and capacity planning.
5. Education:
o Identifying student performance patterns and drop-out risks.
Adaptive learning and personalized course
o

recommendations.
6. Manufacturing and Supply Chain:

o Fault detection and quality control in production.


Demand forecasting for better inventory management.
7. Social Media and Web Analytics:
o Sentiment(Emotion) analysis for brand reputation management.
o Clickstream analysis for website optimization.
o Content recommendation on platforms like YouTube or
Netflix.
8. Energy and Utilities:
o Load forecasting and optimizing energy distribution.

o Identifying inefficiencies in energy consumption .

Real-World Example
A retail company like Amazon uses data mining to analyze purchase histories and browsing
behaviors to recommend products tailored to individual customers. This improves the shopping
experience while boosting sales and customer loyalty.

Market Basket Analysis (MBA)

Definition:
identify patterns
Market Basket Analysis (MBA) is a data mining technique used to
and relationships between different products or items purchased
together by customers.
It helps businesses in Recommendation systems, sales strategies, and Inventory
management.

Key Concepts:
1. Support:

The proportion of transactions that contain a particular item or


item set.

2. Confidence:

The likelihood that item B is bought if item A is bought.

Where:

 Support(A ∩ B) =
Number of transactions containing both A and B.

 Support(A) =
Number of transactions containing A.

3.Lift:
Measures how much more likely item B is bought when item A is
bought, compared to random chance.
It shows the strength of the association between items A and B:

o Lift > 1: Strong association (buying A increases the chance of buying B). o
Lift = 1: No association. A and B are independent (no effect on each other).
o Lift < 1: Negative association (buying A reduces the chance of buying B).

How MBA Works:


The Common Algorithms for MBA are:

• Apriori Algorithm:
The most widely used algorithm for MBA.

It works by finding frequent item sets in the transaction database and


using them to generate association rules.

• FP-Growth Algorithm:

A more efficient algorithm than Apriori,

it uses a compact tree structure (Frequent Pattern tree) to


find frequent itemsets without generating candidate itemsets.

Use Cases:
• Retail:
Product recommendations based on customer purchases (e.g., "People who
bought X also bought Y").
• E-commerce:
Personalized marketing campaigns.
Inventory Management:
Understanding product demand relationships to optimize stock
levels.

Tools and Techniques:

1. R:
A popular programming language for data analysis and visualization.
2. Python:
A versatile programming language with libraries like Pandas and Scikit-learn.
3. SQL:
A standard language for managing relational databases.
4. Data Mining Software:

Tools like IBM SPSS Modeler, SAS Enterprise Miner, and Oracle Data Miner.

Real-World Applications of MBA


Here are some examples of market basket analysis:

# Retail Industry
1. Diapers and Beer: A classic example of market basket analysis is the discovery that men
who buy diapers often also buy beer. This insight led to retailers placing beer near the diaper
section.
2. Coffee and Creamer: A coffee shop chain analyzed customer purchases and found that
customers who bought coffee often also bought creamer. They began offering a discount for
customers who purchased both together.

# E-commerce
1. Amazon's "Frequently Bought Together": Amazon uses market basket analysis to
recommend products that are frequently bought together. For example, if a customer buys a
camera, Amazon might recommend a memory card and a camera bag.
2. Netflix's Recommendation Engine: Netflix uses market basket analysis to recommend TV
shows and movies based on a user's viewing history. If a user watches a romantic comedy,
Netflix might recommend other romantic comedies.
# Grocery Industry
1. Peanut Butter and Jelly: A grocery store chain analyzed customer purchases and found
that customers who bought peanut butter often also bought jelly. They began placing peanut
butter and jelly near each other in the store.
2. Bread and Milk: A grocery store chain found that customers who bought bread often also
bought milk. They began offering a discount for customers who purchased both together.

# Banking and Finance


1. Credit Card and Loan Offers: A bank analyzed customer data and found that customers
who had a credit card often also took out loans. They began offering loan promotions to credit
card customers.
2. Investment and Retirement Accounts: A financial institution found that customers who
invested in stocks often also opened retirement accounts. They began offering bundled
investment and retirement account packages.

# Telecommunications
1. Phone and Internet Plans: A telecom company analyzed customer data and found that
customers who bought phone plans often also bought internet plans. They began offering
discounted bundles for customers who purchased both.
2. Streaming Services: A telecom company found that customers who subscribed to
streaming services often also upgraded to faster internet plans. They began offering promotions
for streaming services with internet plan upgrades.

These examples illustrate how market basket analysis can help businesses identify patterns in
customer behavior and make data-driven decisions to drive revenue growth and improve
customer satisfaction.

Solved Example: Market Basket Analysis using Apriori Algorithm


Problem Statement:

A retail store wants to analyze customer purchasing behavior to determine which products are
frequently bought together. The store can use this information for marketing strategies such as
product bundling, cross-selling, and inventory management.

Dataset:
A simplified transaction dataset is shown below:
Transaction ID Items Purchased
1 Bread, Milk, Butter
2 Bread, Milk
3 Milk, Butter
4 Bread, Butter
5 Bread, Milk, Butter, Eggs
Step 1: Convert Transactions into a Binary Matrix

We create a matrix where each row represents a transaction, and each column represents an item.
Transaction ID Bread Milk Butter Eggs
1 1 1 1 0
2 1 1 0 0
3 0 1 1 0
4 1 0 1 0
5 1 1 1 1
Step 2: Compute Item Support

Step 3: Compute Item Pair Support

We now compute the support for pairs of items.


Itemset Support Calculation Support Value
{Bread, Milk} 3/5 0.6
{Bread, Butter} 3/5 0.6
{Milk, Butter} 3/5 0.6
Step 4: Compute Confidence for Association Rules

Confidence measures how often items are bought together.


Example Rules:

1. Bread → Milk o Confidence = 0.60.8=0.75\frac{0.6}{0.8} =


0.750.80.6=0.75
2. Milk → Bread o Confidence = 0.60.8=0.75\frac{0.6}{0.8} =
0.750.80.6=0.75
3. Bread → Butter o Confidence = 0.60.8=0.75\frac{0.6}{0.8} =
0.750.80.6=0.75
4. Butter → Bread o Confidence = 0.60.8=0.75\frac{0.6}{0.8} =
0.750.80.6=0.75

If the confidence is above a chosen threshold (e.g., 0.7), the rule is considered strong.
Step 5: Compute Lift

Lift measures how much more likely two items are bought together compared to being bought
independently.

Conclusion & Business Insights

Since 0.94 is close to 1, there is little to no strong correlation between Bread and Milk.

• Bread and Milk are frequently bought together, so placing them near each other in the
store can increase sales.
• Bread and Butter also show a strong association, making them good candidates for
promotions or discounts.
• The store can offer bundled discounts on these products to encourage more purchases.

This is a basic example, but real-world market basket analysis involves large datasets and
automated tools like Apriori or FP-Growth algorithms.
Class Assignment:
Compute the Lift (Diappers-Beer) from the following dataset:

Dataset:
A grocery store records transactions from customers. Below is a sample dataset:
Transaction ID Items Purchased
1 Diapers, Beer, Chips
2 Diapers, Beer
3 Diapers, Chips
4 Beer, Chips
5 Diapers, Beer, Chips, Milk

Here's a Python implementation of Market Basket Analysis (MBA) using the


Apriori Algorithm from the mlxtend library. This script processes transactional data and extracts
association rules. 🚀

Steps in the Implementation:

1. Install necessary libraries (mlxtend, pandas).


2. Prepare sample transactional data.
3. Convert data into a format suitable for the Apriori algorithm.
4. Apply Apriori to find frequent item sets.
5. Extract association rules using support, confidence, and lift.

Python Code for MBA using Apriori


# Install necessary libraries (if not already installed)
!pip install mlxtend pandas

# Import necessary libraries import


pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Step 1: Sample Transaction Data (Each row represents a transaction) data
= [
['Bread', 'Butter', 'Milk'],
['Bread', 'Butter'],
['Bread', 'Milk'],
['Butter', 'Milk'],
['Bread', 'Butter', 'Milk']
]

# Step 2: Convert Transaction Data into DataFrame


# Get unique items in all transactions
all_items = sorted(set(item for transaction in data for item in transaction))
# Create a DataFrame with one-hot encoding
df = pd.DataFrame([{item: (item in transaction) for item in all_items} for
transaction in data])

# Step 3: Apply Apriori Algorithm


frequent_itemsets = apriori(df, min_support=0.4, use_colnames=True)

# Step 4: Generate Association Rules


rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Display Results
print("Frequent Itemsets:\n", frequent_itemsets)
print("\nAssociation Rules:\n", rules[['antecedents', 'consequents',
'support', 'confidence', 'lift']])

Explanation of the Code

✅ Step 1: Defines sample transactional data (e.g., grocery store transactions).


✅ Step 2: Converts transactions into a one-hot encoded DataFrame where True means the item
was purchased in a transaction.
✅ Step 3: Uses Apriori Algorithm to find frequent itemsets with a minimum support threshold
of 40%.
✅ Step 4: Extracts association rules using lift, support, and confidence.

Example Output
Frequent Itemsets
support itemsets 0
0.8 (Bread)
1 0.8 (Butter)
2 0.8 (Milk)
3 0.6 (Bread, Butter)
4 0.6 (Bread, Milk)
5 0.6 (Butter, Milk) 6 0.6 (Bread, Butter, Milk)
Association Rules
antecedents consequents support confidence lift 0
(Bread) (Butter) 0.6 0.75 1.25
1 (Milk) (Butter) 0.6 0.75 1.25
2 (Butter) (Milk) 0.6 0.75 1.25
Key Insights
• Rule 1: Customers who buy Bread are 1.25 times more likely to buy Butter.
• Rule 2: Customers who buy Milk are 1.25 times more likely to buy Butter. 
Rule 3: Customers who buy Butter are 1.25 times more likely to buy Milk.

Applications of This Model

🔹 Retail & E-commerce: Product recommendations (e.g., Amazon's "Frequently Bought


Together").
🔹 Grocery Stores: Optimize store layouts and product placements.
🔹 Online Streaming: Recommending movies/songs based on past preferences.

Slide (2)
DATA MINING TECHNIQUES (OVER VIEW)
Data mining involves extracting valuable patterns, insights, and knowledge from
large datasets.

Here are the key techniques used in data mining:

Classification

Used to categorize data into predefined classes.

📌 Example:

Spam email detection (Spam or Not Spam). 🛠


Algorithms:

• Decision Trees
• Naïve Bayes
• Support Vector Machines (SVM)
• Random Forest

Clustering
Groups similar data points together without predefined labels.

📌 Example:
Customer segmentation in marketing. 🛠

Algorithms:
• K-Means Clustering
• DBSCAN
• Hierarchical Clustering

Association Rule Mining (Market Basket Analysis)


Finds relationships between items in a dataset.

📌 Example:
Amazon recommending "Customers who bought X also bought Y".

🛠 Algorithms:

• Apriori Algorithm
• FP-Growth Algorithm

Anomaly Detection (Outlier Detection)


Identifies rare items or patterns that don’t confirm to expected
behavior.

Example:
Fraud detection in banking transactions.

🛠 Algorithms:
• Isolation Forest
• One-Class SVM
• DBSCAN

Regression Analysis
Predicts continuous values based on input data.

📌 Example:

Predicting house prices based on square


footage.

🛠 Algorithms:

• Linear Regression
• Polynomial Regression
• Decision Tree Regression

Dimensionality Reduction
Reduces the number of features while retaining key
information.
📌 Example:

Compressing high-dimensional image data for facial recognition .

🛠 Techniques:

• Principal Component Analysis (PCA)


• t-Distributed Stochastic Neighbor Embedding (t-SNE)
Text Mining (Natural Language Processing - NLP)
Extracts meaningful patterns from text data.

📌 Example:

Sentiment analysis of customer reviews.

🛠 Techniques:
• Tokenization
• Named Entity Recognition (NER)
• Latent Dirichlet Allocation (LDA) for topic modeling

Time Series Analysis


Analyzes sequential data over time to find trends and
patterns.

📌 Example:
Stock price prediction.

🛠 Techniques:

• Autoregressive Integrated Moving Average (ARIMA)


• Long Short-Term Memory (LSTM) networks

Neural Networks & Deep Learning


Mimics human brain structure to detect complex patterns.
📌 Example:
Image recognition, speech-to-text conversion.

🛠 Architectures:
• Convolutional Neural Networks (CNN)
• Recurrent Neural Networks (RNN)

Ensemble Learning
Combines multiple models to improve performance.

📌 Example:
Boosting algorithms in Kaggle competitions.

🛠 Techniques:

• Bagging (e.g., Random Forest)


• Boosting (e.g., XGBoost, AdaBoost)

Slides (3)
Decision Tree
A Decision tree is a supervised machine learning
classification and
algorithm used for both
regression tasks.
splitting the data into subsets
It works by
based on the value of input features,

creating a tree-like structure of decisions.


Eachinternal node represents a decision based on a
feature,

Each branch represents the outcome of that decision,


and Each leaf node represents a final outcome (class
label or continuous value).

Key Components of a Decision Tree:

1. Root Node:
The topmost node that represents the entire dataset.
2. Internal Nodes:
Nodes that split the data based on a
feature.
3. Leaf Nodes:
Terminal nodes that provide the final
decision or prediction.
4. Branches:

root to the leaves,


Paths from the
representing decision rules.

How a Decision Tree Works:


1. Feature Selection:
The algorithm selects the best feature to split the
data based on criteria like Gini impurity,
information gain, or variance reduction.
2. Splitting:
The dataset is divided into subsets based on
the selected feature.
3. Recursion:
The process is repeated for each subset until a
stopping condition is met (e.g., maximum depth,
minimum samples per leaf).
4. Prediction:
For a new data point, the tree is traversed
from the root to a leaf node to make a prediction.

How Decision Trees are Built


Decision trees are constructed using algorithms that aim to find the most
informative features to split the data at each node.

Here are two common approaches:

• ID3 (Iterative Dichotomiser 3):


This algorithm uses entropy and information gain to select the best feature for
splitting.
o Entropy:
Measures the impurity or randomness of a set of
data.
A set with equal
proportions of different classes has
high entropy, while a set with only one class has low
entropy.
o Information Gain:
Measures the reduction in entropy achieved by splitting the
data on a particular feature.
The feature with the highest information gain is chosen for
the split.
• CART (Classification and Regression Trees):
This algorithm uses the Gini index to select the best feature for splitting.
o Gini Index:
Measures the impurity of a set of data, similar to
entropy. A lower Gini index indicates higher purity.

2. Splitting Criteria
• Numerical Features:
For numerical features, the splitting condition usually involves a threshold.
For example, "Age < 25?" splits the data into two groups: those younger than 25 and
those 25 or older.
Categorical Features:
For categorical features, the splitting condition can be based on the values
of the feature.
For example,
"Favorite Genre = Action?" splits the data into groups based on their favorite genre.
o Categorical data is a type of data that consists of labels or
categories rather than numerical values. It represents qualitative
characteristics of an object or event.

Example:

o Colors: {Red, Blue, Green}

3. Overfitting and Pruning


Overfitting:
Decision trees can become very complex and capture noise in the
data, leading to poor performance on unseen data.
This is called overfitting.
Pruning:
To avoid overfitting, we
can prune the tree by removing
branches or nodes that do not contribute significantly to the
prediction accuracy. Pruning can be done by limiting the depth of
the tree, setting a minimum number of samples required at a
node, or using statistical measures to evaluate the importance
of branches.
4. Handling Different Data Types

• Categorical Data:
Decision trees can handle categorical data directly by
creating branches for each category.
• Numerical Data:
Numerical data can be used directly or discretized into categories.
For example, age can be divided into age groups like "young," "middle-
aged," and "old."
• Missing Values:
Decision trees can handle
missing values by assigning
them to the most likely branch or creating a separate
branch for missing values.

5. Advantages and Disadvantages (Expanded)

• Advantages:
o Interpretability:
Decision trees are easy to understand and visualize,
making them useful for explaining decisions.
o Versatility:
They can handle both classification and regression tasks,
as well as different data types.
o Minimal Data Preprocessing:
Decision trees require less data preprocessing compared to
some other machine learning algorithms.
• Disadvantages:
o Overfitting:
Decision trees are prone to overfitting, especially when they are
very complex.
o Instability:
Small changes in the data can lead to significant changes in the tree
structure.
o Bias:
Decision trees can be biased towards features with more
levels or categories.

6. Applications (Expanded)

• Customer Relationship Management (CRM):


Predicting customer churn, identifying potential customers, and
personalizing marketing campaigns.
• Risk Assessment:
Assessing credit risk, predicting loan defaults, and evaluating insurance
applications.
• Medical Diagnosis:
Diagnosing diseases based on symptoms and medical history, predicting
patient outcomes, and personalizing treatment plans.
• Fraud Detection:
Identifying fraudulent transactions in financial systems, detecting
suspicious activities in online platforms.

7. Ensemble Methods

To improve the performance


and robustness of decision
trees, ensemble methods can be used.

These methods combine multiple decision trees to make predictions.


Two popular ensemble methods are:

• Random Forests:
Create multiple decision trees on different subsets of
the data and combine their predictions through averaging
or voting.
• Gradient Boosting:
• Build trees sequentially, where each tree tries to
correct the errors of the previous trees.

Types of Decision Trees:

• Classification Trees:
Predict categories (e.g., "likes movie" or "dislikes movie").
• Regression Trees:
Predict continuous values (e.g., the price of a house).

ID3 Algorithm (Iterative Dichotomiser 3) - Step by Step

The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning


algorithm developed by Ross Quinlan. The ID3 (Iterative Dichotomiser 3)
algorithm is a decision tree learning algorithm used for classification tasks. It
builds a tree by selecting the attribute with the highest Information Gain at each step. It
uses Entropy and Information Gain to determine the best attribute to split the data.
Information Gain measures how much a feature reduces the uncertainty
(entropy) in the dataset.

Note: ID3 is a foundational algorithm. More advanced decision tree algorithms like C4.5 and CART
address some of these limitations.

Step-by-Step Explanation of ID3 Algorithm

1. Start with the Entire Dataset:


o Begin with the complete dataset and all available features.
2. Calculate the Entropy of the Target Attribute:
o Entropy measures the impurity or uncertainty in the
dataset. For a binary classification problem, entropy is
calculated as:

3. Calculate Information Gain for Each Feature:


o Information gain measures how much a feature reduces the
entropy. It is calculated as:

4. Select the Feature with the Highest Information Gain:


o Choose the feature that maximizes information gain as the
splitting criterion.
5. Split the Dataset:
o Split the dataset into subsets based on the selected
feature's values.
6. Repeat the Process Recursively:
o Repeat steps 2–5 for each subset until:
 All instances in a subset belong to the same class (no
further splitting needed).
 No more features are left to split on.
A predefined stopping condition is met (e.g.,

maximum tree depth).
7. Create the Decision Tree:
o The splits form the internal nodes of the tree, and the leaf
nodes represent the class labels.

Example Problem
We will use the ID3 Algorithm to build a decision tree to determine if a person will play
tennis based on these features:
Outlook Temperature Humidity Windy Play Tennis?
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No

Compute Entropy for the Target Variable (Play Tennis?)

So, the entropy of the dataset is 0.94


Compute Information Gain for Each Attribute
We now calculate the Information Gain for Outlook, Temperature, Humidity, and Windy.

Information Gain for "Outlook"


Outlook Total Play Tennis: Yes Play Tennis: No
Sunny 5 2 3
Overcast 4 4 0
Rainy 5 3 2
First, compute the entropy for outlook

Now, compute the Information Gain-IG for outlook

Similarly, we compute IG for Temperature, Humidity, and Windy and choose the highest
one.
Information Gains (IGs)

• Outlook: 0.247
• Temperature: 0.029
• Humidity: 0.151
• Windy: 0.048

Select the Feature with the Highest Information Gain:


Since, Outlook has the highest Information Gain, we split on it:

Recursively Build the DecisionTree:


• For the Overcast subset (all "Yes"), create a leaf node labeled
"Yes".
• For the Sunny and Rainy subsets, repeat the process to find the
best feature to split on.

Final Decision Tree

• If Overcast, Play Tennis = Yes.


• If Sunny, check Humidity:
o If High, Play Tennis = No.
o If Normal, Play Tennis = Yes.
• If Rainy, check Windy:
o If False, Play Tennis = Yes.
o If True, Play Tennis = No.

Making Predictions
Now, we can use the tree to classify new data.
Example:

• Outlook = Rainy, Windy = False → Play Tennis = Yes


• Outlook = Sunny, Humidity = High → Play Tennis = No

Conclusion
The ID3 algorithm:

1. Calculates Entropy for the dataset.


2. Finds Information Gain for each attribute.
3. Selects the attribute with the highest IG as the root node.
4. Recursively splits the dataset until all nodes are pure or other stopping criteria are
met.

Simplified Version
1. What Is Data Mining and Data Warehousing?

 Data Mining is like searching through a giant box of information to find hidden
patterns and useful details.
 It uses computer methods (like statistics and machine learning) to look at big sets of data
so that companies can learn more about trends and relationships.
 Data Warehousing means collecting and storing all of a company’s data in one
large, organized storage space.
 This makes it easy to combine data from different sources and run analyses or
reports on it.

2. Why Are They Important?


 They help businesses discover insights (like hidden trends) that lead to better decisions.
 Data mining can predict future events (for example, whether a customer might stop using
a service).
 They allow businesses to create strategies based on real data rather than guesswork,
saving time and improving efficiency.

3. What Is Market Basket Analysis (MBA)?


 MBA is a specific type of data mining that finds out which products are often bought
together. For example, if people buy bread, they might also buy butter.
 Key Terms:
o Support: How often a product (or a set of products) appears in transactions.
o Confidence: The chance that when one item is bought, another item is also
bought.
o Lift: A number that tells you if two items are bought together more (or less) than
by random chance. If lift is more than 1, the items are strongly linked.

4. How Does MBA Work?

 A common method (the Apriori Algorithm) finds groups of products that appear
frequently together in many transactions.
 Another method, the FP-Growth Algorithm, does this more efficiently by using a
compact tree structure.

5. Real-World Examples:

 Retail: Stores use MBA to suggest products (“Customers who bought this also
bought…”).
 E-commerce: Online stores like Amazon use it to recommend items based on what others
have purchased together.
 Other Fields: Banking, healthcare, and even entertainment use these ideas to predict
behavior and improve services.

6. A Simple Example of MBA:


Imagine a small store with these transactions:

 Transaction 1: Bread, Milk, Butter


 Transaction 2: Bread, Milk
 Transaction 3: Milk, Butter
 Transaction 4: Bread, Butter
 Transaction 5: Bread, Milk, Butter, Eggs

Here’s what happens in the analysis:

 First, you convert the transactions into a table (a binary matrix) showing which products
are bought in each transaction.
 Next, you calculate the support (how many times each item or pair of items appears).
 Then, you compute confidence (if someone buys bread, how likely are they to also buy
milk or butter).
 Finally, you calculate lift to see how strong the connection is between items. A lift close
to 1 means there’s not a strong relationship.

Second slide
Overview of Data Mining Techniques
Data mining is the process of extracting
valuable patterns,
insights, and knowledge from large datasets.

1. Classification
o What It Does:

Sorts data into predefined classes.


o Example: Detecting if an email is Spam or Not Spam.

Algorithms:
o Decision Trees,
o Naïve Bayes,
o Support Vector Machines (SVM),
o Random Forest.
2. Clustering
o What It Does:

Groups similar data points together without using predefined


labels.

o Example:

Segmenting customers in marketing.

Algorithms:

o K-Means Clustering,
o DBSCAN,
o Hierarchical Clustering.
3. Association Rule Mining (Market Basket Analysis)
o What It Does:

Finds relationships between items in a dataset.

o Example:

Recommending products on Amazon (for instance, "Customers who bought


X also bought Y").
Algorithms:

o Apriori Algorithm,
o FP-Growth Algorithm.
4. Anomaly Detection (Outlier Detection)
o What It Does:

Identifies rare items or patterns that do not follow the expected behavior.

o Example:

Detecting fraud in banking transactions.

Algorithms:

o Isolation Forest,
o One-Class SVM,
o DBSCAN.
5. Regression Analysis
o What It Does:

Predicts continuous values based on input data.

o Example:

Predicting house prices based on square footage.

Algorithms:

o Linear Regression,
o Polynomial Regression,
o Decision Tree Regression.
6. Dimensionality Reduction:-
o What It Does:

Reduces the number of features in the data while keeping the key
information.

Example:

Compressing high-dimensional image data for facial recognition.

Techniques:
o Principal Component Analysis (PCA),
o t-Distributed Stochastic Neighbor Embedding (t-SNE).
7. Text Mining (Natural Language Processing - NLP)
o What It Does:

Extracts meaningful patterns from text data.

o Example:

Performing sentiment analysis of customer reviews.

Techniques:

o Tokenization,
o Named Entity Recognition (NER),
o Latent Dirichlet Allocation (LDA) for topic modeling.
8. Time Series Analysis
o What It Does:

Analyzes sequential data over time to find trends and patterns.

o Example:

Predicting stock prices.

Techniques:

o Autoregressive Integrated Moving Average (ARIMA),


o Long Short-Term Memory (LSTM) networks.
9. Neural Networks & Deep Learning
o What It Does:

Mimics the human brain structure to detect complex patterns.

o Example

: Image recognition or converting speech to text.

o Architectures:
o Convolutional Neural Networks (CNN),
o Recurrent Neural Networks (RNN).
10.Ensemble Learning
o What It Does:
Combines multiple models to improve performance.

o Example:

Using boosting algorithms in Kaggle competitions.

o Techniques:

Bagging (e.g., Random Forest)

and Boosting (e.g., XGBoost, AdaBoost).

Slide 3
1. Decision Tree Overview
What is a Decision Tree?
A decision tree is a type of machine learning tool used to make decisions or
predictions.

It can be used when you want to classify items into groups (like “yes” or
“no”) or when you want to predict a number (like a price).
How it works:

o It splits your datainto smaller groups based on questions


about the data.
o The tree is built by making a series of decisions;
o Each decision is based on one piece of information (a
feature).
o The very first split is made at the root node, and then the tree branches
out until the final decisions (called leaf nodes) are reached.
Key Components:

1. Root Node:

The top part that represents all your data.


2. Internal Nodes:
Points in the tree where the data is split by asking a question (e.g., “Is the temperature
high?”).

3. Leaf Nodes:
The end points that give you the final answer or prediction.

4. Branches:
The lines connecting the nodes that show the decision path.

2. How a Decision Tree Works


Steps in Building a Decision Tree:

1. Feature Selection:
o The tree looks at all the features (like age, temperature, or favorite
color) and picks the best one for splitting the data.
o It uses measurements like

Gini impurity,

information gain, or

variance reduction to decide which feature best divides the


data.

2. Splitting:
o Once a feature is chosen, the data is divided into smaller groups based
on that feature’s values.
3. Recursion:
o The splitting process is repeated
on each new group.
o This continues until a stopping condition is met (for
example, when the tree reaches a certain depth or there
are very few samples left in a group).
4. Prediction:
o When new data comes in, it is sent through the tree from the
root to one of the leaf nodes, and the leaf gives the
prediction.

3. How Decision Trees Are Built


Decision trees are made by using special
algorithms that help decide the best
questions to ask about the data.
Two common methods are:

ID3 (Iterative Dichotomiser 3)


 How it works:
o It uses a measure called entropy to see how mixed the classes are (for
example, how much “yes” and “no” are mixed together).
o It then calculates information gain for each feature. Information gain
shows how much the uncertainty (or impurity) decreases when you
split the data using that feature.
o The feature that gives the highest information gain is chosen
for the split.
CART (Classification and Regression Trees)

 How it works:
o Instead of using entropy, CART uses the Gini index.
o The Gini index measures the impurity of the data (a lower value means
the data in the split is more pure or similar).

4. Splitting Criteria
When making splits in the tree, the method depends on the type of data:

 Numerical Features:
o For numbers (like age or temperature), a common approach is to pick a
threshold value.
o For example,

the question might be “Is Age less than 25?” to split the data into two groups:
one with ages below 25 and one with ages 25 or above.

 Categorical Features:
o For data that falls into different groups or labels (like colors or genres),
the tree can split based on these specific categories.
o For example,

Asking “Is the favorite genre Action?” creates groups based on whether the
answer is yes or no.

o Categorical data means data that is not numerical but is based on names or labels
(e.g., Red, Blue, Green).

5. Overfitting and Pruning


Overfitting:

 This happens when a decision tree becomes too complex.


 It might learn the details and “noise” in the training data so well that it does
not work well with new data.

Pruning:
 To prevent overfitting, the tree can be “pruned” (simplified) by removing parts of
the tree that do not add much value.
 This can be done by:
o Limiting how deep the tree goes.
o Requiring a minimum number of samples at each node.
o Using statistical tests to see if a branch is important.

6. Handling Different Data Types


Decision trees can work with different kinds of data:
 Categorical Data:
oThey naturally handle data that comes in categories by creating branches for each
category.
 Numerical Data:
o Numbers can be used directly or turned into groups (for example, splitting “age” into
“young,” “middle-aged,” and “old”).
 Missing Values:
o If some data is missing, the tree can either assign these cases to the branch that
fits best or make a special branch for missing data.

7. Advantages and Disadvantages


Advantages

 Interpretability:
o Decision trees are easy to understand and can be visualized.
o This makes it simpler to explain why a decision was made.
 Versatility:
o They work for both classification (categories) and regression (predicting
numbers).
 Minimal Data Preprocessing:
o They usually require less work to clean or transform the data
compared to some other methods.

Disadvantages

 Overfitting:
o If not controlled, they can become too complex and not perform well on new
data.
 Instability:
o Small changes in the data can lead to a very different tree.
 Bias:
o They may favor features that have more distinct categories or levels, even if
those features are not the most important.

8. Applications of Decision Trees


Decision trees are used in many areas such as:

 Customer Relationship Management (CRM):


o To predict if a customer might stop using a service (customer churn) and to help
tailor marketing efforts.
 Risk Assessment:
o To evaluate risks like credit risk, loan defaults, or insurance
applications.
 Medical Diagnosis:
o To help diagnose diseases using symptoms and medical history, or
to predict patient outcomes.
 Fraud Detection:
o To find and flag fraudulent activities in financial systems or online
platforms.

9. Ensemble Methods
To get better predictions and make the model more robust, multiple decision
trees can be combined. Two common methods are:

 Random Forests:
o This method creates many decision trees using different parts of the
data. Their predictions are combined (by voting or averaging) to give a final
answer.
 Gradient Boosting:
o Here, trees are built one after the other. Each new tree tries to fix the
mistakes of the trees that came before it.

10. Types of Decision Trees


There are two main kinds of decision trees:

 Classification Trees:
o Used when the prediction is a category (for example, whether
someone likes or dislikes a movie).
 Regression Trees:
o Used when the prediction is a number (for example, predicting the
price of a house).
11. The ID3 Algorithm (Iterative Dichotomiser 3) – Step by
Step
The ID3 algorithm is one of the first methods used to build decision trees. It works by finding the
best attribute (or feature) to split the data at each step using entropy and information gain.

Detailed Steps of ID3:

1. Start with the Entire Dataset:


o Use all your data and all the features available.
2. Calculate the Entropy of the Target Attribute:
o Entropy is a measure of how mixed up the classes are in your data.
o For example, if the data has an equal mix of “yes” and “no,” the entropy is high
(meaning the data is very uncertain). If almost all data points belong to one class,
the entropy is low.
3. Calculate Information Gain for Each Feature:
o Information Gain tells you how much the uncertainty decreases when
you split the data based on a feature.
o It is computed by comparing the entropy before the split and the entropy
after the split.
4. Select the Feature with the Highest Information Gain:
o The feature that reduces the uncertainty the most is chosen as the best way to
split the data at that point.
5. Split the Dataset:
o Divide the data into groups according to the values of the selected
feature.
6. Repeat the Process Recursively:
o For each new group (or branch), repeat the steps of calculating entropy and
information gain.
o This continues until one of these conditions is met:
 All data points in a group belong to the same class (perfect
split).
There are no more features left to use for splitting.

A stopping condition is reached (like a maximum depth for the tree).

7. Create the Decision Tree:
o The points where you made splits become the internal nodes.
o The final groups that do not split any further become the leaf nodes with the final
decision (the class label).

Example: Deciding Whether to Play Tennis


Imagine you want to decide if someone will play tennis based on the weather conditions. You
have features such as:

 Outlook
 Temperature
 Humidity
 Windy

Steps Illustrated:

 First, calculate the overall uncertainty (entropy) for the decision “Play Tennis?”
 Then, calculate how much each feature (like Outlook) decreases this uncertainty by
computing its information gain.
 In our example, the feature Outlook might have the highest information gain.
 Based on the Outlook:
o If it is “Overcast,” then the decision is automatically “Yes” (play tennis).
o For “Sunny” and “Rainy,” further questions are asked:
 For Sunny: Look at the Humidity (if high → “No”, if normal → “Yes”).
 For Rainy: Look at Windy (if not windy → “Yes”, if windy → “No”).

12. Making Predictions with the Decision Tree


Once the tree is built, you use it for predictions:

 To predict a new case, start at the root of the tree.


 Follow the branches based on the answers to the questions (the features) until you reach a
leaf node.
 The leaf node tells you the prediction (for example, “Yes” or “No”).

13. Conclusion
Summary of the ID3 Process:

 The algorithm begins by measuring how mixed up (impure) the data is using entropy.
 It then checks each feature to see which one, when used to split the data, will make the
groups purer (measured by information gain).
 The feature that best reduces uncertainty is chosen first (forming the root node).
 The process continues by splitting the data further until every branch reaches a clear, final
decision or a stopping rule is met.

You might also like