0% found this document useful (0 votes)

10 views22 pages

Dsbda Unit4

The document discusses various aspects of analytics in big data, including four main types: descriptive, diagnostic, predictive, and prescriptive analytics, each serving different purposes and employing various techniques. It also covers support and confidence calculations in association rule mining, particularly using the Apriori algorithm, and highlights the importance of data preprocessing, including handling missing data and data transformation techniques. Additionally, it explains the use of logistic regression and decision trees in machine learning, emphasizing their roles and advantages.

Uploaded by

rishisingh048229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views22 pages

Dsbda Unit4

Uploaded by

rishisingh048229

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT 4

PAPER 1:

Q3 (a): Types of Analytics in Big Data [9 Marks]

Analytics in big data is typically categorized into four major types:

1. Descriptive Analytics
◦ De nition: This type answers the question "What happened?"

◦ Purpose: To summarize historical data and understand trends and patterns.

◦ Example: Monthly sales reports, website traf c analysis, or average customer

spending.

2. Diagnostic Analytics
◦ De nition: This answers the question "Why did it happen?"

◦ Purpose: To nd the root causes of past outcomes or behaviors.

◦ Tools Used: Drill-down, data discovery, correlations.

◦ Example: Analyzing why a campaign underperformed by comparing different user

segments.

3. Predictive Analytics
◦ De nition: This answers the question "What is likely to happen?"

◦ Purpose: To forecast future outcomes using statistical models and machine learning.

◦ Techniques: Regression, classi cation, time-series analysis.

◦ Example: Predicting customer churn or sales forecasting.

4. Prescriptive Analytics
◦ De nition: This answers the question "What should be done?"

◦ Purpose: To recommend actions based on predictive insights.

◦ Techniques: Optimization, simulation, decision analysis.

◦ Example: Suggesting the best pricing strategy or delivery route.

Each type builds upon the previous one, with increasing value and complexity. Together, they help
in making data-driven decisions.
fi
fi
fi
fi
fi
fi
fi
Q3 (b): Support and Con dence Calculation [9 Marks]

Transaction Table:

TI
Items Bought
D
1 Onion, Potato, Cold drink
2 Onion, Burger, Cold drink
3 Eggs, Onion, Cold drink
4 Potato, Milk, Eggs
Potato, Burger, Cold drink, Milk,
5
Eggs

Step 1: Calculate Support

Support of an itemset =
(Number of transactions containing the itemset) / (Total transactions)

Total transactions = 5

Let’s compute support for single items and frequent pairs:

Single Items:

• Onion: 3/5

• Potato: 3/5

• Cold drink: 4/5

• Burger: 2/5

• Eggs: 3/5

• Milk: 2/5

Pairs:

• {Onion, Cold drink}: 3/5

• {Potato, Cold drink}: 2/5

• {Burger, Cold drink}: 2/5

• {Eggs, Milk}: 2/5

• {Eggs, Cold drink}: 1/5

• {Potato, Milk}: 2/5

• {Onion, Burger}: 1/5

fi
• {Burger, Eggs}: 1/5

• {Potato, Eggs}: 1/5

Step 2: Calculate Con dence

Con dence of rule A → B =

Support(A ∪ B) / Support(A)

Example Rules:

1. Onion → Cold drink

◦ Support(Onion ∪ Cold drink) = 3/5

◦ Support(Onion) = 3/5

◦ Con dence = (3/5) / (3/5) = 1.0

2. Burger → Cold drink

◦ Support(Burger ∪ Cold drink) = 2/5

◦ Support(Burger) = 2/5

◦ Con dence = (2/5) / (2/5) = 1.0

3. Potato → Milk
◦ Support(Potato ∪ Milk) = 2/5

◦ Support(Potato) = 3/5

◦ Con dence = (2/5) / (3/5) ≈ 0.67

4. Eggs → Milk
◦ Support(Eggs ∪ Milk) = 2/5

◦ Support(Eggs) = 3/5

◦ Con dence = (2/5) / (3/5) ≈ 0.67

5. Milk → Eggs
◦ Support(Milk ∪ Eggs) = 2/5

◦ Support(Milk) = 2/5

◦ Con dence = (2/5) / (2/5) = 1.0

fi
fi
fi
fi
fi
fi
fi
Conclusion:
You calculate support by counting how often an itemset appears, and con dence by comparing the
conditional probability of the rule. These metrics are essential in association rule mining,
especially in algorithms like Apriori.

Q4 a) Explain the use of logistic function in logistic regression in detail. [9 Marks]

Logistic Regression is a statistical method used for binary classi cation problems where the
output variable is categorical and takes one of two possible values (e.g., Yes/No, 1/0, True/False).

Role of the Logistic Function:

The logistic function, also known as the sigmoid function, is crucial in logistic regression. It
transforms the linear combination of input variables into a value between 0 and 1, which can be
interpreted as a probability.

1. Logistic (Sigmoid) Function Formula:

\sigma(z) = \frac{1}{1 + e^{-z}}

σ(z)=1+e−z1
Where:

z = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_nx_n

z=β0 +β1 x1 +β2 x2 +…+βn xn

2. Why Logistic Function is Used:

• Probability Interpretation: The output lies in the range [

[0, 1]
[0,1], which makes it ideal to represent probabilities.

• Non-linearity: Although logistic regression is a linear model, the sigmoid function

introduces non-linearity, helping it model complex relationships.

• Threshold-based classi cation: Based on a threshold (usually 0.5), the output probability
can be converted to a class label (e.g., if P
>
0.5

P > 0.5
P>0.5, then class 1; else class 0).

3. Decision Boundary:

• The logistic function helps in drawing a decision boundary between the two classes by
setting a threshold.

• The decision boundary is linear in the feature space.

4. Advantages:
fi
fi
fi
• Handles binary classi cation ef ciently.

• Provides well-calibrated probabilities.

• Simple and interpretable model.

Conclusion:

The logistic function plays a central role in converting the linear regression output into a bounded
probability value, enabling the logistic regression model to classify data into distinct categories
effectively.

Q4 b) Write short notes on:

i) Removing Duplicates from Data Set

Duplicate records can introduce bias and lead to incorrect conclusions.

Why Remove Duplicates:

• They in ate the size of the dataset.

• They distort statistical metrics like mean and standard deviation.

• Can lead to over tting in machine learning models.

Methods:

• Use tools like drop_duplicates() in pandas (Python).

• Compare rows on key columns to identify duplicates.

• Use hashing or ngerprinting techniques in large datasets.

Best Practices:

• Identify whether duplicates are exact or partial.

• Check if duplicates are meaningful (e.g., valid repeated measurements).

• Perform removal cautiously, especially in transactional data.

ii) Handling Missing Data

Missing data is a common issue in datasets that can affect model performance and insights.

Types of Missing Data:

1. MCAR (Missing Completely at Random)

2. MAR (Missing at Random)

fl
fi
fi
fi
fi
3. MNAR (Missing Not at Random)

Techniques to Handle Missing Data:

• Deletion: Remove rows or columns with too many missing values.

• Imputation:

◦ Mean/Median/Mode imputation

◦ Predictive modeling (e.g., k-NN, regression)

• Using Algorithms that Support Missing Values: Some algorithms like XGBoost can
handle missing data.

Best Practices:

• Analyze the reason for missingness.

• Avoid dropping large portions of data unless necessary.

• Validate the imputation method with domain knowledge.

iii) Data Transformation

Data transformation involves converting data into a suitable format for analysis or modeling.

Types of Data Transformation:

1. Normalization/Scaling: Rescale data to a common scale (e.g., Min-Max scaling, Z-score

normalization).

2. Encoding: Convert categorical data into numerical format (e.g., one-hot encoding, label
encoding).

3. Log/Power Transformations: Apply mathematical functions to reduce skewness and

stabilize variance.

4. Binning: Convert continuous variables into discrete bins.

Purpose:

• Improves model performance.

• Makes data compatible with algorithms.

• Reduces computational complexity.

Tools: Libraries like scikit-learn, pandas, and numpy provide ef cient transformation
functions.

Certainly! Here's a complete answer for Q3 parts (a) and (b) with each part answered for 9 marks:
fi
PAPER 2:
Q3 (a): Why Decision Trees Are Used and Explanation with Diagram [9 Marks]

Why Decision Trees Are Used

Decision Trees are widely used in data mining and machine learning for the following reasons:

1. Simple and Easy to Understand

◦ Decision trees mimic human decision-making and present data in a tree-like
structure, making them intuitive and easy to interpret.

2. No Need for Data Normalization

◦ They do not require data to be scaled or normalized, unlike other algorithms such as
SVM or KNN.

3. Handles Both Types of Data

◦ Can work with both categorical and numerical data.

4. Feature Selection is Built-in

◦ Decision trees automatically perform feature selection based on metrics like Gini
Index, Information Gain, or Gain Ratio.

5. Useful for Classi cation and Regression Tasks

Sample Decision Tree and Explanation

Example: Predict whether a person will buy a laptop based on Age and Income.

[Age?]
/ \
<=30 >30
/ \
[Income?] Buy = Yes
/ \
Low High
/ \
No Yes

Explanation of Parts:

1. Root Node
◦ The top node: represents the rst attribute used for splitting.

◦ In this example: Age is the root node.

2. Decision Nodes
fi
fi
◦ Nodes that split the data into subsets.

◦ Example: Income is a decision node under Age ≤ 30.

3. Leaf Nodes
◦ Final outcomes or decisions (class labels).

◦ Example: “Buy = Yes” or “Buy = No”.

4. Branches
◦ Show decision paths from parent to child nodes based on attribute values.

Q3 (b): How Apriori Algorithm Works (with Example) [9 Marks]

What is Apriori Algorithm?

Apriori is an algorithm used for frequent itemset mining and association rule learning in market
basket analysis. It works by identifying frequent itemsets in a dataset and then generating
association rules.

Steps in Apriori Algorithm:

1. Set Minimum Support and Con dence Threshold

2. Generate Frequent 1-itemsets

3. Generate Candidate Itemsets (k+1) from Frequent k-itemsets

4. Prune Non-frequent Itemsets (based on support)

5. Generate Association Rules (based on con dence)

Example:

Transactions:

TI
Items Bought
D
Milk, Bread,
1
Butter
2 Bread, Butter
3 Milk, Bread
4 Milk, Butter
5 Bread, Butter

Let Minimum Support = 0.6 (i.e., 60%), and Con dence = 0.7 (70%)
fi
fi
fi
Step 1: Frequent 1-itemsets

Support
Item Support
Count
3/5 = 0.6
Milk 3
✅
Brea 4/5 = 0.8
4
d ✅
Butte 4/5 = 0.8
4
r ✅

All 1-itemsets are frequent.

Step 2: Generate 2-itemsets

• {Milk, Bread} → Support = 2/5 = 0.4 ❌

• {Milk, Butter} → 2/5 = 0.4 ❌

• {Bread, Butter} → 3/5 = 0.6 ✅

Only {Bread, Butter} is frequent.

Step 3: Association Rule Generation

From {Bread, Butter}:

• Rule: Bread → Butter

◦ Con dence = Support(Bread ∪ Butter) / Support(Bread) = (3/5) / (4/5) = 0.75 ✅

• Rule: Butter → Bread

◦ Con dence = (3/5) / (4/5) = 0.75 ✅

Both rules meet the con dence threshold.

Conclusion:

• Apriori nds frequent itemsets by scanning data iteratively and eliminates combinations that
do not meet the support threshold.

• It then derives association rules from frequent itemsets with high con dence.
fi
fi
fi
fi
fi
Q4 a) What is data preprocessing? Explain in detail about handling missing data and
transformation of data. [9 Marks]

1. What is Data Preprocessing?

Data preprocessing is a data mining technique that involves transforming raw data into a clean,
organized format suitable for machine learning and data analysis.

Importance:

• Raw data is often incomplete, inconsistent, and noisy.

• Quality of data directly impacts the performance of models.

• Ensures reliable, accurate, and ef cient outcomes.

2. Handling Missing Data

Missing data occurs when no value is stored for a variable in an observation.

Types of Missing Data:

• MCAR (Missing Completely at Random) – no relation to any variable.

• MAR (Missing at Random) – depends on observed data but not on the missing value itself.

• MNAR (Missing Not at Random) – depends on the missing value itself.

Techniques to Handle Missing Data:

• Deletion:

◦ Remove rows with missing values (listwise deletion)

◦ Remove speci c columns with excessive missingness

• Imputation:

◦ Replace missing values with:

▪ Mean/Median/Mode (for numeric/categorical data)

▪ Forward/Backward Fill (for time series)

▪ Predictive Imputation using regression, k-NN, etc.

• Flagging Missing Data:

◦ Add a binary variable to indicate missingness.

Best Practices:
fi
fi
• Always explore the reason and pattern of missing data.

• Choose imputation based on data type and context.

• Avoid imputation in test data using information from the test set.

3. Transformation of Data

Data transformation converts data into a format appropriate for analysis or modeling.

Common Techniques:

• Normalization: Rescales features to a range [0, 1].

• Standardization: Centers data around the mean with unit variance (z-score).

• Log/Power Transformations: Handle skewed data.

• Encoding Categorical Variables:

◦ Label Encoding: Assign integer values.

◦ One-Hot Encoding: Create binary variables for each category.

• Binning/Discretization: Group continuous values into categories.

• Date-Time Transformation: Extract year, month, day, etc., from timestamps.

Bene ts:

• Enhances model accuracy.

• Ensures faster convergence.

• Improves interpretability and consistency.

Q4 b) Explain Naïve Bayes’ classi er and its applications. [9 Marks]

1. What is Naïve Bayes Classi er?

Naïve Bayes is a probabilistic classi cation algorithm based on Bayes’ Theorem, assuming
independence among features.

Bayes’ Theorem:

P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)}

Where:

•
fi
fi
fi
fi
P(C|X)
: Posterior probability of class C

given feature vector X

P(X|C)
: Likelihood of features given class

P(C)
: Prior probability of class
•
P(X)
: Marginal probability of features
"Naïve" Assumption: All features are independent given the class, which simpli es the
computation of

P(X|C)
.

2. Types of Naïve Bayes Classi ers:

• Gaussian Naïve Bayes: Assumes normal distribution of numeric features.

• Multinomial Naïve Bayes: Used for discrete counts (e.g., word counts in text).

• Bernoulli Naïve Bayes: Works with binary/boolean features.

3. Advantages:

• Simple and fast.

• Performs well with large datasets and high-dimensional data (e.g., text).

• Effective with relatively small amounts of training data.

4. Limitations:

• Assumes feature independence, which may not hold in real-world data.

• Struggles with highly correlated features.

5. Applications of Naïve Bayes:

Application Description
fi
fi
Classify emails as spam or not based on keywords and
Spam Detection
patterns.
Sentiment Analysis Identify sentiment (positive/negative) in reviews or tweets.
Document
Classify documents into categories like sports, tech, health.
Categorization
Medical Diagnosis Predict disease based on symptoms.
Credit Scoring Assess creditworthiness of an individual.

PAPER NO 1 and 3 ,Q3 is exactly same.

PAPER 3:

Q4 (a): Need for Logistic Regression and Its Types [9 Marks]

Need for Logistic Regression:

Logistic Regression is used when the dependent variable is categorical, typically for binary
classi cation problems such as:

• Spam vs Not Spam

• Fraudulent vs Genuine Transaction

• Disease vs No Disease

Why is Logistic Regression needed?

1. Classi cation Tasks

◦ Unlike linear regression (which predicts continuous output), logistic regression

predicts probability of class membership.

2. Interpretable Model

◦ Coef cients can be interpreted to understand the in uence of each predictor.

3. Probability Outputs

◦ It gives probabilities (0 to 1), which helps in threshold-based decisions.

4. Simple and Ef cient

◦ Works well on linearly separable data and is computationally ef cient.

Types of Logistic Regression:

1. Binary Logistic Regression

fi
fi
fi
fi
fl
fi
◦ Used when: Output variable has 2 classes (Yes/No, 0/1).

◦ Example: Will a customer buy a product? (Yes or No)

2. Multinomial Logistic Regression

◦ Used when: Output has 3 or more unordered categories.

◦ Example: Classifying fruits as Apple, Orange, or Banana.

3. Ordinal Logistic Regression

◦ Used when: Output has ordered categories.

◦ Example: Customer rating: Poor, Average, Good, Excellent.

Equation of Logistic Regression:

P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}
Where:

•
P(Y=1)
is the probability of positive class
•

\beta_0, \beta_1, ..., \beta_n

are model coef cients
•
X_1, X_2, ..., X_n
are input variables

Q4 (b): Data Preprocessing Concepts [9 Marks]

i) Removing Duplicates from Dataset

De nition:

Duplicate records are identical or nearly identical rows that appear more than once in a dataset,
which can lead to bias, redundancy, and inaccurate model training.

Why Remove Duplicates?

• Prevents over tting (model learning the same data multiple times)

• Improves model accuracy and training ef ciency

Example in Python (Pandas):

import pandas as pd
df = pd.DataFrame({
fi
fi
fi
fi
'Name': ['Alice', 'Bob', 'Alice'],
'Age': [25, 30, 25]
})
df_cleaned = df.drop_duplicates()
Before:

Name Age
Alice 25
Bob 30
Alice 25
After removing duplicates:

Name Age
Alice 25
Bob 30

ii) Handling Missing Data

De nition:

Missing data occurs when no value is stored for a variable in a record. This can cause errors in
analysis and reduce model accuracy.

Ways to Handle Missing Data:

1. Remove Rows with Missing Values

◦ When few rows are missing and can be ignored.

2. df.dropna()
3.
4. Impute Missing Values

◦ Mean/Median/Mode Imputation:
df['Age'].fillna(df['Age'].mean(), inplace=True)

◦
◦ Forward/Backward Fill:
df.fillna(method='ffill') # forward fill

◦
5. Use Algorithms that Handle Missing Data

◦ Some ML models like XGBoost can work with missing values.

Example Dataset:

Name Age
Alice 25
Bob NaN
Eve 30
After Mean Imputation (mean = 27.5):
fi
Name Age
Alice 25
Bob 27.5
Eve 30

Conclusion:

• Removing duplicates and handling missing data are critical preprocessing steps to ensure
clean, reliable, and accurate data for analysis and machine learning.

PAPER 4:

Q3 a) What is logistic regression, and how does it differ from linear regression? What is the
sigmoid function, and what role does it play in logistic regression? [9 Marks]

1. Logistic Regression:

• Logistic regression is a classi cation algorithm used to predict the probability of a

categorical dependent variable, typically binary outcomes (e.g., spam/not spam, 0/1).

• It models the probability that a given input belongs to a particular class.

2. Difference Between Logistic and Linear Regression:

Aspect Linear Regression Logistic Regression

Purpose Predicts continuous output Predicts probability of categorical outcome (classi cation)
Output Any real number (-∞ to +∞) Value between 0 and 1 (probability)
Linear equation: Applies logistic (sigmoid) function on linear combination of
Model
y = \beta_0 + \beta_1x_1 + ... inputs
Loss
Mean Squared Error (MSE) Log-loss or cross-entropy loss
function
Interpretatio
Predicts exact value Predicts likelihood/class membership
n

3. Sigmoid Function:

The sigmoid function is also known as the logistic function and is de ned as:

\sigma(z) = \frac{1}{1 + e^{-z}}

Where

z = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots

Properties:
fi
fi
fi
• Maps any real number z

z
into the range [
0
,
1
]

[0,1]
.

• Produces an S-shaped curve.

4. Role of Sigmoid Function in Logistic Regression:

• Converts the linear combination of input features into a probability score.

• Enables the model to output a probability rather than an unrestricted numeric value.

• The output probability is then compared to a threshold (usually 0.5) to decide the class label.

• Makes logistic regression suitable for classi cation tasks.

Q3 b) Calculate the probability that a new email with "offer"=1 and "free"=1 is spam using
Naive Bayes classi er. [9 Marks]

Given:

Emai Offe Fre Spa

l r e m
1 1 0 No
2 0 1 Yes
3 1 1 Yes
4 0 1 No
5 1 1 Yes

Step 1: Calculate Prior Probabilities

P(Spam)
and

P(Not Spam)
• Total emails = 5
fi
fi
• Spam = 3 (Emails 2, 3, 5)

• Not Spam = 2 (Emails 1, 4)

P(Spam) = \frac{3}{5} = 0.6, \quad P(Not\ Spam) = \frac{2}{5} = 0.4

Step 2: Calculate Likelihoods

We need:

•
P(Offer=1 | Spam)

P(Free=1 | Spam)

P(Offer=1 | Not\ Spam)

P(Free=1 | Not\ Spam)

For Spam:

• Emails 2, 3, 5

Emai Offe Fre

l r e
2 0 1
3 1 1
5 1 1
•

P(Offer=1 | Spam) = \frac{2}{3}

(Emails 3 and 5)
•
P(Free=1 | Spam) = \frac{3}{3} = 1

For Not Spam:

• Emails 1, 4

Emai Offe Fre

l r e
1 1 0
4 0 1
•

P(Offer=1 | Not\ Spam) = \frac{1}{2}

(Email 1)

P(Free=1 | Not\ Spam) = \frac{1}{2}

(Email 4)

Step 3: Calculate Posterior Probabilities using Naive Bayes formula:

P(Spam | Offer=1, Free=1) \propto P(Spam) \times P(Offer=1|Spam) \times P(Free=1|Spam)

= 0.6 \times \frac{2}{3} \times 1 = 0.6 \times 0.6667 = 0.4

P(Not\ Spam | Offer=1, Free=1) \propto P(Not\ Spam) \times P(Offer=1|Not\ Spam) \times
P(Free=1|Not\ Spam)
= 0.4 \times \frac{1}{2} \times \frac{1}{2} = 0.4 \times 0.25 = 0.1

Step 4: Normalize to get nal probabilities

P(Spam | features) = \frac{0.4}{0.4 + 0.1} = \frac{0.4}{0.5} = 0.8

P(Not\ Spam | features) = \frac{0.1}{0.5} = 0.2

Final Answer:

The probability that the new email with "offer"=1 and "free"=1 is Spam is 0.8 (80%).

Q4 (a): Apriori Algorithm for Discovering Frequent Itemsets and Role of Support &
Con dence [9 Marks]

How Apriori Algorithm Discovers Frequent Itemsets

• The Apriori algorithm is used to nd frequent itemsets (sets of items that appear together
frequently) in transactional datasets.

• It works on the principle that all subsets of a frequent itemset must also be frequent
(Apriori property).

Step-by-step process:

1. Set Minimum Support Threshold:

◦ User de nes the minimum support, i.e., the minimum frequency an itemset must
have to be considered frequent.

2. Generate Frequent 1-itemsets (L1):

◦ Count occurrences of each item.

fi
fi
fi
fi
◦ Keep items with support ≥ minimum support.

3. Generate Candidate 2-itemsets (C2):

◦ Create all possible pairs from frequent 1-itemsets.

4. Count Support for Candidate 2-itemsets:

◦ Count how many transactions contain each candidate.

◦ Keep candidates meeting minimum support → frequent 2-itemsets (L2).

5. Repeat for k-itemsets (k≥3):

◦ Generate candidate k-itemsets (Ck) from frequent (k-1)-itemsets (Lk-1).

◦ Count support and prune those below threshold.

◦ Continue until no more frequent itemsets are found.

Role of Support and Con dence

• Support:

◦ Measures how frequently an itemset appears in the dataset.

◦ Support of itemset A

A
= (Number of transactions containing A

A
) / (Total number of transactions)

◦ Ensures that the rules are relevant to a signi cant portion of the data.

• Con dence:

◦ Measures the reliability of an association rule A

→
B

A \rightarrow B
.

◦ Con dence = Support(A

∪
B

A \cup B
fi
fi
fi
fi
) / Support(A

A
)

◦ Re ects the conditional probability that B

B
occurs when A

A
occurs.

Summary:

• Apriori uses support to identify frequent itemsets.

• Con dence is used to generate strong association rules from these frequent itemsets.

Q4 (b): Process of Building a Decision Tree and Splitting Criteria [9 Marks]

Process of Building a Decision Tree

1. Start with Entire Dataset (Root Node)

◦ The dataset is placed at the root.

2. Select the Best Attribute to Split On

◦ Choose the attribute that best separates the data into classes based on a splitting
criterion.

3. Split the Dataset

◦ Divide data into subsets according to values of the chosen attribute.

4. Repeat Recursively

◦ For each subset, select the best attribute and split further.

5. Stopping Criteria

◦ Stop splitting when:

▪ All records in the subset belong to the same class (pure node).

▪ No remaining attributes.

▪ Data cannot be split further (e.g., minimum sample size reached).

fl
fi
6. Assign Class Labels to Leaf Nodes

◦ Final nodes assign the class based on majority voting or class purity.

Criteria Used for Splitting Nodes

1. Information Gain (ID3 Algorithm)

◦ Measures the reduction in entropy (uncertainty) after the split.

◦ Choose attribute with highest information gain.

2. Gain Ratio (C4.5 Algorithm)

◦ Adjusts information gain by the intrinsic information of a split to avoid bias towards
attributes with many values.

3. Gini Index (CART Algorithm)

◦ Measures impurity; lower Gini index means better split.

◦ Choose attribute with the lowest Gini impurity after splitting.

Mathematical Foundations of Machine Learning
100% (1)
Mathematical Foundations of Machine Learning
340 pages
2.signal and Linear System Analysis
No ratings yet
2.signal and Linear System Analysis
42 pages
CSEC Mathematics June 1996 P2
100% (1)
CSEC Mathematics June 1996 P2
13 pages
Chapter 11 - Similarity
100% (1)
Chapter 11 - Similarity
37 pages
Cycle Counting in Fatigue Analysis: Standard Practices For
No ratings yet
Cycle Counting in Fatigue Analysis: Standard Practices For
10 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
40 pages
4 6012558209825376223 PDF
No ratings yet
4 6012558209825376223 PDF
322 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
10 Seconds Part 2
100% (1)
10 Seconds Part 2
167 pages
PGP-Data Science - Course Module With Internship Module
No ratings yet
PGP-Data Science - Course Module With Internship Module
17 pages
Lecture 6.1
No ratings yet
Lecture 6.1
128 pages
Super Cheatsheet Machine Learning
100% (1)
Super Cheatsheet Machine Learning
15 pages
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
No ratings yet
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
132 pages
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
No ratings yet
Statistical Machine Learning: Yiqiao YIN Department of Statistics Columbia University
204 pages
Machine Learning
No ratings yet
Machine Learning
133 pages
Growth and Decay Basic Calculus Lesson Plan
No ratings yet
Growth and Decay Basic Calculus Lesson Plan
10 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
33 pages
90 - 48173 - FINALmaterial C and DS Course Handout 22-08-15
No ratings yet
90 - 48173 - FINALmaterial C and DS Course Handout 22-08-15
156 pages
List of Symbols 2011 Modern Engineering Thermodynamics
No ratings yet
List of Symbols 2011 Modern Engineering Thermodynamics
2 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
The Spatial Diffusion of Homicide in Mexico City: A Test of Theories in Context
No ratings yet
The Spatial Diffusion of Homicide in Mexico City: A Test of Theories in Context
20 pages
ML Final
No ratings yet
ML Final
92 pages
KKP-BDS Lecture Notes
No ratings yet
KKP-BDS Lecture Notes
78 pages
Williams Landel Ferry - JACS55 PDF
No ratings yet
Williams Landel Ferry - JACS55 PDF
7 pages
Let's Begin With:: Differentiate Between Supervised and Unsupervised Learning
No ratings yet
Let's Begin With:: Differentiate Between Supervised and Unsupervised Learning
26 pages
What Are The Differences Between Supervised and Unsupervised Learning?
No ratings yet
What Are The Differences Between Supervised and Unsupervised Learning?
22 pages
Fundamental Data Science 2,3,5 Units of Acharya Nagarjuna University
No ratings yet
Fundamental Data Science 2,3,5 Units of Acharya Nagarjuna University
33 pages
Sawali Synthesis Paper
No ratings yet
Sawali Synthesis Paper
9 pages
M.E Maths
No ratings yet
M.E Maths
87 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
LR, Decision Tree
No ratings yet
LR, Decision Tree
48 pages
Section 4
No ratings yet
Section 4
40 pages
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
No ratings yet
Model-Based Testing of Automotive Systems: Piketec GMBH, Germany
9 pages
CO 2 Session 3
No ratings yet
CO 2 Session 3
39 pages
Machinelearning Algorithm Basics2 NOTES
No ratings yet
Machinelearning Algorithm Basics2 NOTES
72 pages
Ensem Imp Data Science and Big Data Analytics Unit - 4
No ratings yet
Ensem Imp Data Science and Big Data Analytics Unit - 4
25 pages
Module 3
No ratings yet
Module 3
63 pages
BML Answer Key
No ratings yet
BML Answer Key
21 pages
Extra Lecturenotes Cs725
No ratings yet
Extra Lecturenotes Cs725
119 pages
Model Definition
No ratings yet
Model Definition
6 pages
Model Definition11
No ratings yet
Model Definition11
6 pages
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
100% (3)
Recent Advances in Mathematics For Engineering (Mathematical Engineering, Manufacturing, and Management Sciences) 1st Edition Mangey Ram (Editor)
54 pages
Machine Learning Strategies
No ratings yet
Machine Learning Strategies
59 pages
Aoa Practicals
No ratings yet
Aoa Practicals
25 pages
Roller Deflection
No ratings yet
Roller Deflection
18 pages
DATA SCIENCE iNTERVIEW QUESTION
No ratings yet
DATA SCIENCE iNTERVIEW QUESTION
42 pages
DA-4th Unit
No ratings yet
DA-4th Unit
22 pages
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
No ratings yet
The Classification of Stocks With Basic Financial Indicators An Application of Cluster Analysis On The BIST 100 Index
29 pages
DataMining Workbook Answers
No ratings yet
DataMining Workbook Answers
18 pages
AIML
No ratings yet
AIML
30 pages
Broadly, There Are 3 Types of Machine Learning Algorithms.
No ratings yet
Broadly, There Are 3 Types of Machine Learning Algorithms.
33 pages
AI and DS QB1
No ratings yet
AI and DS QB1
31 pages
Chapter 2 Types of Machine Learning and Their Learning Strategies
No ratings yet
Chapter 2 Types of Machine Learning and Their Learning Strategies
45 pages
Dsbda Prelim QB Solution
No ratings yet
Dsbda Prelim QB Solution
11 pages
Logistic Regression in Data Analysis: An Overview
No ratings yet
Logistic Regression in Data Analysis: An Overview
21 pages
Data-Analytics-Manual Lab G.anill Kumar
No ratings yet
Data-Analytics-Manual Lab G.anill Kumar
23 pages
Task Intermediate
No ratings yet
Task Intermediate
15 pages
Ai Word Document Session 2 Detailed Exaple
No ratings yet
Ai Word Document Session 2 Detailed Exaple
15 pages
Information Retrieval Important Questions
No ratings yet
Information Retrieval Important Questions
20 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
Da Mid 2
No ratings yet
Da Mid 2
12 pages
Dsbda 4
No ratings yet
Dsbda 4
16 pages
Dsbda Ut4
No ratings yet
Dsbda Ut4
12 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
2-Machine Learning Algorithms
No ratings yet
2-Machine Learning Algorithms
16 pages
Ia1 ML Scheme Common To Is, Ai, Cs
No ratings yet
Ia1 ML Scheme Common To Is, Ai, Cs
10 pages
Aiml K2
No ratings yet
Aiml K2
8 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
Q2 Week 3 Relation and Function
No ratings yet
Q2 Week 3 Relation and Function
42 pages
SML
No ratings yet
SML
8 pages
Unit3 ML
No ratings yet
Unit3 ML
7 pages
12 PGTRB Maths Study Material Vector Differentiation
No ratings yet
12 PGTRB Maths Study Material Vector Differentiation
11 pages
6746fe71a3a5a Crack Xat 2025 in 40 Days
No ratings yet
6746fe71a3a5a Crack Xat 2025 in 40 Days
4 pages
Interview Preparing - ML Draft
No ratings yet
Interview Preparing - ML Draft
12 pages
0975 Data Science and Machine Learning
No ratings yet
0975 Data Science and Machine Learning
6 pages
AIML Solved Paper Nov-Dec 2024
No ratings yet
AIML Solved Paper Nov-Dec 2024
2 pages
Commonly Used Machine Learning Algorithms
No ratings yet
Commonly Used Machine Learning Algorithms
27 pages
ML - Machine Learning PDF
No ratings yet
ML - Machine Learning PDF
13 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
g (y) = βo + β (Age) - (a)
No ratings yet
g (y) = βo + β (Age) - (a)
6 pages
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
No ratings yet
Handwritten Devanagari Word Recognition: A Curvelet Transform Based Approach
8 pages
Essentials of Machine Learning Algorithms
No ratings yet
Essentials of Machine Learning Algorithms
15 pages
Resilience-Oriented Optimal Operation of Networked Hybrid Microgrids
No ratings yet
Resilience-Oriented Optimal Operation of Networked Hybrid Microgrids
11 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
NTA UGC NET Electronic Science Syllabus
No ratings yet
NTA UGC NET Electronic Science Syllabus
3 pages
Vibrant Academy: (India) Private Limited
No ratings yet
Vibrant Academy: (India) Private Limited
2 pages
1 Complex Numbers
No ratings yet
1 Complex Numbers
9 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet