0% found this document useful (0 votes)
10 views22 pages

Dsbda Unit4

The document discusses various aspects of analytics in big data, including four main types: descriptive, diagnostic, predictive, and prescriptive analytics, each serving different purposes and employing various techniques. It also covers support and confidence calculations in association rule mining, particularly using the Apriori algorithm, and highlights the importance of data preprocessing, including handling missing data and data transformation techniques. Additionally, it explains the use of logistic regression and decision trees in machine learning, emphasizing their roles and advantages.

Uploaded by

rishisingh048229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

Dsbda Unit4

The document discusses various aspects of analytics in big data, including four main types: descriptive, diagnostic, predictive, and prescriptive analytics, each serving different purposes and employing various techniques. It also covers support and confidence calculations in association rule mining, particularly using the Apriori algorithm, and highlights the importance of data preprocessing, including handling missing data and data transformation techniques. Additionally, it explains the use of logistic regression and decision trees in machine learning, emphasizing their roles and advantages.

Uploaded by

rishisingh048229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT 4

PAPER 1:

Q3 (a): Types of Analytics in Big Data [9 Marks]

Analytics in big data is typically categorized into four major types:

1. Descriptive Analytics
◦ De nition: This type answers the question "What happened?"

◦ Purpose: To summarize historical data and understand trends and patterns.

◦ Example: Monthly sales reports, website traf c analysis, or average customer


spending.

2. Diagnostic Analytics
◦ De nition: This answers the question "Why did it happen?"

◦ Purpose: To nd the root causes of past outcomes or behaviors.

◦ Tools Used: Drill-down, data discovery, correlations.

◦ Example: Analyzing why a campaign underperformed by comparing different user


segments.

3. Predictive Analytics
◦ De nition: This answers the question "What is likely to happen?"

◦ Purpose: To forecast future outcomes using statistical models and machine learning.

◦ Techniques: Regression, classi cation, time-series analysis.

◦ Example: Predicting customer churn or sales forecasting.

4. Prescriptive Analytics
◦ De nition: This answers the question "What should be done?"

◦ Purpose: To recommend actions based on predictive insights.

◦ Techniques: Optimization, simulation, decision analysis.

◦ Example: Suggesting the best pricing strategy or delivery route.

Each type builds upon the previous one, with increasing value and complexity. Together, they help
in making data-driven decisions.
fi
fi
fi
fi
fi
fi
fi
Q3 (b): Support and Con dence Calculation [9 Marks]

Transaction Table:

TI
Items Bought
D
1 Onion, Potato, Cold drink
2 Onion, Burger, Cold drink
3 Eggs, Onion, Cold drink
4 Potato, Milk, Eggs
Potato, Burger, Cold drink, Milk,
5
Eggs

Step 1: Calculate Support

Support of an itemset =
(Number of transactions containing the itemset) / (Total transactions)

Total transactions = 5

Let’s compute support for single items and frequent pairs:

Single Items:

• Onion: 3/5

• Potato: 3/5

• Cold drink: 4/5

• Burger: 2/5

• Eggs: 3/5

• Milk: 2/5

Pairs:

• {Onion, Cold drink}: 3/5

• {Potato, Cold drink}: 2/5

• {Burger, Cold drink}: 2/5

• {Eggs, Milk}: 2/5

• {Eggs, Cold drink}: 1/5

• {Potato, Milk}: 2/5

• {Onion, Burger}: 1/5


fi
• {Burger, Eggs}: 1/5

• {Potato, Eggs}: 1/5

Step 2: Calculate Con dence

Con dence of rule A → B =


Support(A ∪ B) / Support(A)

Example Rules:

1. Onion → Cold drink


◦ Support(Onion ∪ Cold drink) = 3/5

◦ Support(Onion) = 3/5

◦ Con dence = (3/5) / (3/5) = 1.0

2. Burger → Cold drink


◦ Support(Burger ∪ Cold drink) = 2/5

◦ Support(Burger) = 2/5

◦ Con dence = (2/5) / (2/5) = 1.0

3. Potato → Milk
◦ Support(Potato ∪ Milk) = 2/5

◦ Support(Potato) = 3/5

◦ Con dence = (2/5) / (3/5) ≈ 0.67

4. Eggs → Milk
◦ Support(Eggs ∪ Milk) = 2/5

◦ Support(Eggs) = 3/5

◦ Con dence = (2/5) / (3/5) ≈ 0.67

5. Milk → Eggs
◦ Support(Milk ∪ Eggs) = 2/5

◦ Support(Milk) = 2/5

◦ Con dence = (2/5) / (2/5) = 1.0


fi
fi
fi
fi
fi
fi
fi
Conclusion:
You calculate support by counting how often an itemset appears, and con dence by comparing the
conditional probability of the rule. These metrics are essential in association rule mining,
especially in algorithms like Apriori.

Q4 a) Explain the use of logistic function in logistic regression in detail. [9 Marks]

Logistic Regression is a statistical method used for binary classi cation problems where the
output variable is categorical and takes one of two possible values (e.g., Yes/No, 1/0, True/False).

Role of the Logistic Function:

The logistic function, also known as the sigmoid function, is crucial in logistic regression. It
transforms the linear combination of input variables into a value between 0 and 1, which can be
interpreted as a probability.

1. Logistic (Sigmoid) Function Formula:

\sigma(z) = \frac{1}{1 + e^{-z}}


σ(z)=1+e−z1
Where:

z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n


z=β0 +β1 x1 +β2 x2 +…+βn xn

2. Why Logistic Function is Used:

• Probability Interpretation: The output lies in the range [


[0, 1]
[0,1], which makes it ideal to represent probabilities.

• Non-linearity: Although logistic regression is a linear model, the sigmoid function


introduces non-linearity, helping it model complex relationships.

• Threshold-based classi cation: Based on a threshold (usually 0.5), the output probability
can be converted to a class label (e.g., if P
>
0.5

P > 0.5
P>0.5, then class 1; else class 0).

3. Decision Boundary:

• The logistic function helps in drawing a decision boundary between the two classes by
setting a threshold.

• The decision boundary is linear in the feature space.

4. Advantages:
fi
fi
fi
• Handles binary classi cation ef ciently.

• Provides well-calibrated probabilities.

• Simple and interpretable model.

Conclusion:

The logistic function plays a central role in converting the linear regression output into a bounded
probability value, enabling the logistic regression model to classify data into distinct categories
effectively.

Q4 b) Write short notes on:

i) Removing Duplicates from Data Set

Duplicate records can introduce bias and lead to incorrect conclusions.

Why Remove Duplicates:

• They in ate the size of the dataset.

• They distort statistical metrics like mean and standard deviation.

• Can lead to over tting in machine learning models.

Methods:

• Use tools like drop_duplicates() in pandas (Python).

• Compare rows on key columns to identify duplicates.

• Use hashing or ngerprinting techniques in large datasets.

Best Practices:

• Identify whether duplicates are exact or partial.

• Check if duplicates are meaningful (e.g., valid repeated measurements).

• Perform removal cautiously, especially in transactional data.

ii) Handling Missing Data

Missing data is a common issue in datasets that can affect model performance and insights.

Types of Missing Data:

1. MCAR (Missing Completely at Random)

2. MAR (Missing at Random)


fl
fi
fi
fi
fi
3. MNAR (Missing Not at Random)

Techniques to Handle Missing Data:

• Deletion: Remove rows or columns with too many missing values.

• Imputation:

◦ Mean/Median/Mode imputation

◦ Predictive modeling (e.g., k-NN, regression)

• Using Algorithms that Support Missing Values: Some algorithms like XGBoost can
handle missing data.

Best Practices:

• Analyze the reason for missingness.

• Avoid dropping large portions of data unless necessary.

• Validate the imputation method with domain knowledge.

iii) Data Transformation

Data transformation involves converting data into a suitable format for analysis or modeling.

Types of Data Transformation:

1. Normalization/Scaling: Rescale data to a common scale (e.g., Min-Max scaling, Z-score


normalization).

2. Encoding: Convert categorical data into numerical format (e.g., one-hot encoding, label
encoding).

3. Log/Power Transformations: Apply mathematical functions to reduce skewness and


stabilize variance.

4. Binning: Convert continuous variables into discrete bins.

Purpose:

• Improves model performance.

• Makes data compatible with algorithms.

• Reduces computational complexity.

Tools: Libraries like scikit-learn, pandas, and numpy provide ef cient transformation
functions.

Certainly! Here's a complete answer for Q3 parts (a) and (b) with each part answered for 9 marks:
fi
PAPER 2:
Q3 (a): Why Decision Trees Are Used and Explanation with Diagram [9 Marks]

Why Decision Trees Are Used

Decision Trees are widely used in data mining and machine learning for the following reasons:

1. Simple and Easy to Understand


◦ Decision trees mimic human decision-making and present data in a tree-like
structure, making them intuitive and easy to interpret.

2. No Need for Data Normalization


◦ They do not require data to be scaled or normalized, unlike other algorithms such as
SVM or KNN.

3. Handles Both Types of Data


◦ Can work with both categorical and numerical data.

4. Feature Selection is Built-in


◦ Decision trees automatically perform feature selection based on metrics like Gini
Index, Information Gain, or Gain Ratio.

5. Useful for Classi cation and Regression Tasks

Sample Decision Tree and Explanation

Example: Predict whether a person will buy a laptop based on Age and Income.

[Age?]
/ \
<=30 >30
/ \
[Income?] Buy = Yes
/ \
Low High
/ \
No Yes

Explanation of Parts:

1. Root Node
◦ The top node: represents the rst attribute used for splitting.

◦ In this example: Age is the root node.

2. Decision Nodes
fi
fi
◦ Nodes that split the data into subsets.

◦ Example: Income is a decision node under Age ≤ 30.

3. Leaf Nodes
◦ Final outcomes or decisions (class labels).

◦ Example: “Buy = Yes” or “Buy = No”.

4. Branches
◦ Show decision paths from parent to child nodes based on attribute values.

Q3 (b): How Apriori Algorithm Works (with Example) [9 Marks]

What is Apriori Algorithm?

Apriori is an algorithm used for frequent itemset mining and association rule learning in market
basket analysis. It works by identifying frequent itemsets in a dataset and then generating
association rules.

Steps in Apriori Algorithm:

1. Set Minimum Support and Con dence Threshold

2. Generate Frequent 1-itemsets

3. Generate Candidate Itemsets (k+1) from Frequent k-itemsets

4. Prune Non-frequent Itemsets (based on support)

5. Generate Association Rules (based on con dence)

Example:

Transactions:

TI
Items Bought
D
Milk, Bread,
1
Butter
2 Bread, Butter
3 Milk, Bread
4 Milk, Butter
5 Bread, Butter

Let Minimum Support = 0.6 (i.e., 60%), and Con dence = 0.7 (70%)
fi
fi
fi
Step 1: Frequent 1-itemsets

Support
Item Support
Count
3/5 = 0.6
Milk 3

Brea 4/5 = 0.8
4
d ✅
Butte 4/5 = 0.8
4
r ✅

All 1-itemsets are frequent.

Step 2: Generate 2-itemsets

• {Milk, Bread} → Support = 2/5 = 0.4 ❌

• {Milk, Butter} → 2/5 = 0.4 ❌

• {Bread, Butter} → 3/5 = 0.6 ✅

Only {Bread, Butter} is frequent.

Step 3: Association Rule Generation

From {Bread, Butter}:

• Rule: Bread → Butter

◦ Con dence = Support(Bread ∪ Butter) / Support(Bread) = (3/5) / (4/5) = 0.75 ✅

• Rule: Butter → Bread

◦ Con dence = (3/5) / (4/5) = 0.75 ✅

Both rules meet the con dence threshold.

Conclusion:

• Apriori nds frequent itemsets by scanning data iteratively and eliminates combinations that
do not meet the support threshold.

• It then derives association rules from frequent itemsets with high con dence.
fi
fi
fi
fi
fi
Q4 a) What is data preprocessing? Explain in detail about handling missing data and
transformation of data. [9 Marks]

1. What is Data Preprocessing?

Data preprocessing is a data mining technique that involves transforming raw data into a clean,
organized format suitable for machine learning and data analysis.

Importance:

• Raw data is often incomplete, inconsistent, and noisy.

• Quality of data directly impacts the performance of models.

• Ensures reliable, accurate, and ef cient outcomes.

2. Handling Missing Data

Missing data occurs when no value is stored for a variable in an observation.

Types of Missing Data:

• MCAR (Missing Completely at Random) – no relation to any variable.

• MAR (Missing at Random) – depends on observed data but not on the missing value itself.

• MNAR (Missing Not at Random) – depends on the missing value itself.

Techniques to Handle Missing Data:

• Deletion:

◦ Remove rows with missing values (listwise deletion)

◦ Remove speci c columns with excessive missingness

• Imputation:

◦ Replace missing values with:

▪ Mean/Median/Mode (for numeric/categorical data)

▪ Forward/Backward Fill (for time series)

▪ Predictive Imputation using regression, k-NN, etc.

• Flagging Missing Data:

◦ Add a binary variable to indicate missingness.

Best Practices:
fi
fi
• Always explore the reason and pattern of missing data.

• Choose imputation based on data type and context.

• Avoid imputation in test data using information from the test set.

3. Transformation of Data

Data transformation converts data into a format appropriate for analysis or modeling.

Common Techniques:

• Normalization: Rescales features to a range [0, 1].

• Standardization: Centers data around the mean with unit variance (z-score).

• Log/Power Transformations: Handle skewed data.

• Encoding Categorical Variables:

◦ Label Encoding: Assign integer values.

◦ One-Hot Encoding: Create binary variables for each category.

• Binning/Discretization: Group continuous values into categories.

• Date-Time Transformation: Extract year, month, day, etc., from timestamps.

Bene ts:

• Enhances model accuracy.

• Ensures faster convergence.

• Improves interpretability and consistency.

Q4 b) Explain Naïve Bayes’ classi er and its applications. [9 Marks]

1. What is Naïve Bayes Classi er?

Naïve Bayes is a probabilistic classi cation algorithm based on Bayes’ Theorem, assuming
independence among features.

Bayes’ Theorem:

P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}


Where:


fi
fi
fi
fi
P(C|X)
: Posterior probability of class C

given feature vector X

P(X|C)
: Likelihood of features given class

P(C)
: Prior probability of class

P(X)
: Marginal probability of features
"Naïve" Assumption: All features are independent given the class, which simpli es the
computation of

P(X|C)
.

2. Types of Naïve Bayes Classi ers:

• Gaussian Naïve Bayes: Assumes normal distribution of numeric features.

• Multinomial Naïve Bayes: Used for discrete counts (e.g., word counts in text).

• Bernoulli Naïve Bayes: Works with binary/boolean features.

3. Advantages:

• Simple and fast.

• Performs well with large datasets and high-dimensional data (e.g., text).

• Effective with relatively small amounts of training data.

4. Limitations:

• Assumes feature independence, which may not hold in real-world data.

• Struggles with highly correlated features.

5. Applications of Naïve Bayes:

Application Description
fi
fi
Classify emails as spam or not based on keywords and
Spam Detection
patterns.
Sentiment Analysis Identify sentiment (positive/negative) in reviews or tweets.
Document
Classify documents into categories like sports, tech, health.
Categorization
Medical Diagnosis Predict disease based on symptoms.
Credit Scoring Assess creditworthiness of an individual.

PAPER NO 1 and 3 ,Q3 is exactly same.

PAPER 3:

Q4 (a): Need for Logistic Regression and Its Types [9 Marks]

Need for Logistic Regression:

Logistic Regression is used when the dependent variable is categorical, typically for binary
classi cation problems such as:

• Spam vs Not Spam

• Fraudulent vs Genuine Transaction

• Disease vs No Disease

Why is Logistic Regression needed?

1. Classi cation Tasks

◦ Unlike linear regression (which predicts continuous output), logistic regression


predicts probability of class membership.

2. Interpretable Model

◦ Coef cients can be interpreted to understand the in uence of each predictor.

3. Probability Outputs

◦ It gives probabilities (0 to 1), which helps in threshold-based decisions.

4. Simple and Ef cient

◦ Works well on linearly separable data and is computationally ef cient.

Types of Logistic Regression:

1. Binary Logistic Regression


fi
fi
fi
fi
fl
fi
◦ Used when: Output variable has 2 classes (Yes/No, 0/1).

◦ Example: Will a customer buy a product? (Yes or No)

2. Multinomial Logistic Regression

◦ Used when: Output has 3 or more unordered categories.

◦ Example: Classifying fruits as Apple, Orange, or Banana.

3. Ordinal Logistic Regression

◦ Used when: Output has ordered categories.

◦ Example: Customer rating: Poor, Average, Good, Excellent.

Equation of Logistic Regression:

P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}
Where:


P(Y=1)
is the probability of positive class

\beta_0, \beta_1, ..., \beta_n


are model coef cients

X_1, X_2, ..., X_n
are input variables

Q4 (b): Data Preprocessing Concepts [9 Marks]

i) Removing Duplicates from Dataset

De nition:

Duplicate records are identical or nearly identical rows that appear more than once in a dataset,
which can lead to bias, redundancy, and inaccurate model training.

Why Remove Duplicates?

• Prevents over tting (model learning the same data multiple times)

• Improves model accuracy and training ef ciency

Example in Python (Pandas):

import pandas as pd
df = pd.DataFrame({
fi
fi
fi
fi
'Name': ['Alice', 'Bob', 'Alice'],
'Age': [25, 30, 25]
})
df_cleaned = df.drop_duplicates()
Before:

Name Age
Alice 25
Bob 30
Alice 25
After removing duplicates:

Name Age
Alice 25
Bob 30

ii) Handling Missing Data

De nition:

Missing data occurs when no value is stored for a variable in a record. This can cause errors in
analysis and reduce model accuracy.

Ways to Handle Missing Data:

1. Remove Rows with Missing Values

◦ When few rows are missing and can be ignored.

2. df.dropna()
3.
4. Impute Missing Values

◦ Mean/Median/Mode Imputation:
df['Age'].fillna(df['Age'].mean(), inplace=True)


◦ Forward/Backward Fill:
df.fillna(method='ffill') # forward fill


5. Use Algorithms that Handle Missing Data

◦ Some ML models like XGBoost can work with missing values.

Example Dataset:

Name Age
Alice 25
Bob NaN
Eve 30
After Mean Imputation (mean = 27.5):
fi
Name Age
Alice 25
Bob 27.5
Eve 30

Conclusion:

• Removing duplicates and handling missing data are critical preprocessing steps to ensure
clean, reliable, and accurate data for analysis and machine learning.

PAPER 4:

Q3 a) What is logistic regression, and how does it differ from linear regression? What is the
sigmoid function, and what role does it play in logistic regression? [9 Marks]

1. Logistic Regression:

• Logistic regression is a classi cation algorithm used to predict the probability of a


categorical dependent variable, typically binary outcomes (e.g., spam/not spam, 0/1).

• It models the probability that a given input belongs to a particular class.

2. Difference Between Logistic and Linear Regression:

Aspect Linear Regression Logistic Regression


Purpose Predicts continuous output Predicts probability of categorical outcome (classi cation)
Output Any real number (-∞ to +∞) Value between 0 and 1 (probability)
Linear equation: Applies logistic (sigmoid) function on linear combination of
Model
y = \beta_0 + \beta_1x_1 + ... inputs
Loss
Mean Squared Error (MSE) Log-loss or cross-entropy loss
function
Interpretatio
Predicts exact value Predicts likelihood/class membership
n

3. Sigmoid Function:

The sigmoid function is also known as the logistic function and is de ned as:

\sigma(z) = \frac{1}{1 + e^{-z}}


Where

z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots


.

Properties:
fi
fi
fi
• Maps any real number z

z
into the range [
0
,
1
]

[0,1]
.

• Produces an S-shaped curve.

4. Role of Sigmoid Function in Logistic Regression:

• Converts the linear combination of input features into a probability score.

• Enables the model to output a probability rather than an unrestricted numeric value.

• The output probability is then compared to a threshold (usually 0.5) to decide the class label.

• Makes logistic regression suitable for classi cation tasks.

Q3 b) Calculate the probability that a new email with "offer"=1 and "free"=1 is spam using
Naive Bayes classi er. [9 Marks]

Given:

Emai Offe Fre Spa


l r e m
1 1 0 No
2 0 1 Yes
3 1 1 Yes
4 0 1 No
5 1 1 Yes

Step 1: Calculate Prior Probabilities

P(Spam)
and

P(Not Spam)
• Total emails = 5
fi
fi
• Spam = 3 (Emails 2, 3, 5)

• Not Spam = 2 (Emails 1, 4)

P(Spam) = \frac{3}{5} = 0.6, \quad P(Not\ Spam) = \frac{2}{5} = 0.4

Step 2: Calculate Likelihoods

We need:


P(Offer=1 | Spam)

P(Free=1 | Spam)

P(Offer=1 | Not\ Spam)

P(Free=1 | Not\ Spam)

For Spam:

• Emails 2, 3, 5

Emai Offe Fre


l r e
2 0 1
3 1 1
5 1 1

P(Offer=1 | Spam) = \frac{2}{3}


(Emails 3 and 5)

P(Free=1 | Spam) = \frac{3}{3} = 1

For Not Spam:

• Emails 1, 4

Emai Offe Fre


l r e
1 1 0
4 0 1

P(Offer=1 | Not\ Spam) = \frac{1}{2}


(Email 1)

P(Free=1 | Not\ Spam) = \frac{1}{2}


(Email 4)

Step 3: Calculate Posterior Probabilities using Naive Bayes formula:

P(Spam | Offer=1, Free=1) \propto P(Spam) \times P(Offer=1|Spam) \times P(Free=1|Spam)

= 0.6 \times \frac{2}{3} \times 1 = 0.6 \times 0.6667 = 0.4

P(Not\ Spam | Offer=1, Free=1) \propto P(Not\ Spam) \times P(Offer=1|Not\ Spam) \times
P(Free=1|Not\ Spam)
= 0.4 \times \frac{1}{2} \times \frac{1}{2} = 0.4 \times 0.25 = 0.1

Step 4: Normalize to get nal probabilities

P(Spam | features) = \frac{0.4}{0.4 + 0.1} = \frac{0.4}{0.5} = 0.8

P(Not\ Spam | features) = \frac{0.1}{0.5} = 0.2

Final Answer:

The probability that the new email with "offer"=1 and "free"=1 is Spam is 0.8 (80%).

Q4 (a): Apriori Algorithm for Discovering Frequent Itemsets and Role of Support &
Con dence [9 Marks]

How Apriori Algorithm Discovers Frequent Itemsets

• The Apriori algorithm is used to nd frequent itemsets (sets of items that appear together
frequently) in transactional datasets.

• It works on the principle that all subsets of a frequent itemset must also be frequent
(Apriori property).

Step-by-step process:

1. Set Minimum Support Threshold:

◦ User de nes the minimum support, i.e., the minimum frequency an itemset must
have to be considered frequent.

2. Generate Frequent 1-itemsets (L1):

◦ Count occurrences of each item.


fi
fi
fi
fi
◦ Keep items with support ≥ minimum support.

3. Generate Candidate 2-itemsets (C2):

◦ Create all possible pairs from frequent 1-itemsets.

4. Count Support for Candidate 2-itemsets:

◦ Count how many transactions contain each candidate.

◦ Keep candidates meeting minimum support → frequent 2-itemsets (L2).

5. Repeat for k-itemsets (k≥3):

◦ Generate candidate k-itemsets (Ck) from frequent (k-1)-itemsets (Lk-1).

◦ Count support and prune those below threshold.

◦ Continue until no more frequent itemsets are found.

Role of Support and Con dence

• Support:

◦ Measures how frequently an itemset appears in the dataset.

◦ Support of itemset A

A
= (Number of transactions containing A

A
) / (Total number of transactions)

◦ Ensures that the rules are relevant to a signi cant portion of the data.

• Con dence:

◦ Measures the reliability of an association rule A



B

A \rightarrow B
.

◦ Con dence = Support(A



B

A \cup B
fi
fi
fi
fi
) / Support(A

A
)

◦ Re ects the conditional probability that B

B
occurs when A

A
occurs.

Summary:

• Apriori uses support to identify frequent itemsets.

• Con dence is used to generate strong association rules from these frequent itemsets.

Q4 (b): Process of Building a Decision Tree and Splitting Criteria [9 Marks]

Process of Building a Decision Tree

1. Start with Entire Dataset (Root Node)

◦ The dataset is placed at the root.

2. Select the Best Attribute to Split On

◦ Choose the attribute that best separates the data into classes based on a splitting
criterion.

3. Split the Dataset

◦ Divide data into subsets according to values of the chosen attribute.

4. Repeat Recursively

◦ For each subset, select the best attribute and split further.

5. Stopping Criteria

◦ Stop splitting when:

▪ All records in the subset belong to the same class (pure node).

▪ No remaining attributes.

▪ Data cannot be split further (e.g., minimum sample size reached).


fl
fi
6. Assign Class Labels to Leaf Nodes

◦ Final nodes assign the class based on majority voting or class purity.

Criteria Used for Splitting Nodes

1. Information Gain (ID3 Algorithm)

◦ Measures the reduction in entropy (uncertainty) after the split.

◦ Choose attribute with highest information gain.

2. Gain Ratio (C4.5 Algorithm)

◦ Adjusts information gain by the intrinsic information of a split to avoid bias towards
attributes with many values.

3. Gini Index (CART Algorithm)

◦ Measures impurity; lower Gini index means better split.

◦ Choose attribute with the lowest Gini impurity after splitting.

You might also like