Dsbda Unit4
Dsbda Unit4
PAPER 1:
1. Descriptive Analytics
◦ De nition: This type answers the question "What happened?"
2. Diagnostic Analytics
◦ De nition: This answers the question "Why did it happen?"
3. Predictive Analytics
◦ De nition: This answers the question "What is likely to happen?"
◦ Purpose: To forecast future outcomes using statistical models and machine learning.
4. Prescriptive Analytics
◦ De nition: This answers the question "What should be done?"
Each type builds upon the previous one, with increasing value and complexity. Together, they help
in making data-driven decisions.
fi
fi
fi
fi
fi
fi
fi
Q3 (b): Support and Con dence Calculation [9 Marks]
Transaction Table:
TI
Items Bought
D
1 Onion, Potato, Cold drink
2 Onion, Burger, Cold drink
3 Eggs, Onion, Cold drink
4 Potato, Milk, Eggs
Potato, Burger, Cold drink, Milk,
5
Eggs
Support of an itemset =
(Number of transactions containing the itemset) / (Total transactions)
Total transactions = 5
Single Items:
• Onion: 3/5
• Potato: 3/5
• Burger: 2/5
• Eggs: 3/5
• Milk: 2/5
Pairs:
Example Rules:
◦ Support(Onion) = 3/5
◦ Support(Burger) = 2/5
3. Potato → Milk
◦ Support(Potato ∪ Milk) = 2/5
◦ Support(Potato) = 3/5
4. Eggs → Milk
◦ Support(Eggs ∪ Milk) = 2/5
◦ Support(Eggs) = 3/5
5. Milk → Eggs
◦ Support(Milk ∪ Eggs) = 2/5
◦ Support(Milk) = 2/5
Logistic Regression is a statistical method used for binary classi cation problems where the
output variable is categorical and takes one of two possible values (e.g., Yes/No, 1/0, True/False).
The logistic function, also known as the sigmoid function, is crucial in logistic regression. It
transforms the linear combination of input variables into a value between 0 and 1, which can be
interpreted as a probability.
• Threshold-based classi cation: Based on a threshold (usually 0.5), the output probability
can be converted to a class label (e.g., if P
>
0.5
P > 0.5
P>0.5, then class 1; else class 0).
3. Decision Boundary:
• The logistic function helps in drawing a decision boundary between the two classes by
setting a threshold.
4. Advantages:
fi
fi
fi
• Handles binary classi cation ef ciently.
Conclusion:
The logistic function plays a central role in converting the linear regression output into a bounded
probability value, enabling the logistic regression model to classify data into distinct categories
effectively.
Methods:
Best Practices:
Missing data is a common issue in datasets that can affect model performance and insights.
• Imputation:
◦ Mean/Median/Mode imputation
• Using Algorithms that Support Missing Values: Some algorithms like XGBoost can
handle missing data.
Best Practices:
Data transformation involves converting data into a suitable format for analysis or modeling.
2. Encoding: Convert categorical data into numerical format (e.g., one-hot encoding, label
encoding).
Purpose:
Tools: Libraries like scikit-learn, pandas, and numpy provide ef cient transformation
functions.
Certainly! Here's a complete answer for Q3 parts (a) and (b) with each part answered for 9 marks:
fi
PAPER 2:
Q3 (a): Why Decision Trees Are Used and Explanation with Diagram [9 Marks]
Decision Trees are widely used in data mining and machine learning for the following reasons:
Example: Predict whether a person will buy a laptop based on Age and Income.
[Age?]
/ \
<=30 >30
/ \
[Income?] Buy = Yes
/ \
Low High
/ \
No Yes
Explanation of Parts:
1. Root Node
◦ The top node: represents the rst attribute used for splitting.
2. Decision Nodes
fi
fi
◦ Nodes that split the data into subsets.
3. Leaf Nodes
◦ Final outcomes or decisions (class labels).
4. Branches
◦ Show decision paths from parent to child nodes based on attribute values.
Apriori is an algorithm used for frequent itemset mining and association rule learning in market
basket analysis. It works by identifying frequent itemsets in a dataset and then generating
association rules.
Example:
Transactions:
TI
Items Bought
D
Milk, Bread,
1
Butter
2 Bread, Butter
3 Milk, Bread
4 Milk, Butter
5 Bread, Butter
Let Minimum Support = 0.6 (i.e., 60%), and Con dence = 0.7 (70%)
fi
fi
fi
Step 1: Frequent 1-itemsets
Support
Item Support
Count
3/5 = 0.6
Milk 3
✅
Brea 4/5 = 0.8
4
d ✅
Butte 4/5 = 0.8
4
r ✅
Conclusion:
• Apriori nds frequent itemsets by scanning data iteratively and eliminates combinations that
do not meet the support threshold.
• It then derives association rules from frequent itemsets with high con dence.
fi
fi
fi
fi
fi
Q4 a) What is data preprocessing? Explain in detail about handling missing data and
transformation of data. [9 Marks]
Data preprocessing is a data mining technique that involves transforming raw data into a clean,
organized format suitable for machine learning and data analysis.
Importance:
• MAR (Missing at Random) – depends on observed data but not on the missing value itself.
• Deletion:
• Imputation:
Best Practices:
fi
fi
• Always explore the reason and pattern of missing data.
• Avoid imputation in test data using information from the test set.
3. Transformation of Data
Data transformation converts data into a format appropriate for analysis or modeling.
Common Techniques:
• Standardization: Centers data around the mean with unit variance (z-score).
Bene ts:
Naïve Bayes is a probabilistic classi cation algorithm based on Bayes’ Theorem, assuming
independence among features.
Bayes’ Theorem:
•
fi
fi
fi
fi
P(C|X)
: Posterior probability of class C
P(X|C)
: Likelihood of features given class
P(C)
: Prior probability of class
•
P(X)
: Marginal probability of features
"Naïve" Assumption: All features are independent given the class, which simpli es the
computation of
P(X|C)
.
• Multinomial Naïve Bayes: Used for discrete counts (e.g., word counts in text).
3. Advantages:
• Performs well with large datasets and high-dimensional data (e.g., text).
4. Limitations:
Application Description
fi
fi
Classify emails as spam or not based on keywords and
Spam Detection
patterns.
Sentiment Analysis Identify sentiment (positive/negative) in reviews or tweets.
Document
Classify documents into categories like sports, tech, health.
Categorization
Medical Diagnosis Predict disease based on symptoms.
Credit Scoring Assess creditworthiness of an individual.
PAPER 3:
Logistic Regression is used when the dependent variable is categorical, typically for binary
classi cation problems such as:
• Disease vs No Disease
2. Interpretable Model
3. Probability Outputs
P(Y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n)}}
Where:
•
P(Y=1)
is the probability of positive class
•
De nition:
Duplicate records are identical or nearly identical rows that appear more than once in a dataset,
which can lead to bias, redundancy, and inaccurate model training.
• Prevents over tting (model learning the same data multiple times)
import pandas as pd
df = pd.DataFrame({
fi
fi
fi
fi
'Name': ['Alice', 'Bob', 'Alice'],
'Age': [25, 30, 25]
})
df_cleaned = df.drop_duplicates()
Before:
Name Age
Alice 25
Bob 30
Alice 25
After removing duplicates:
Name Age
Alice 25
Bob 30
De nition:
Missing data occurs when no value is stored for a variable in a record. This can cause errors in
analysis and reduce model accuracy.
2. df.dropna()
3.
4. Impute Missing Values
◦ Mean/Median/Mode Imputation:
df['Age'].fillna(df['Age'].mean(), inplace=True)
◦
◦ Forward/Backward Fill:
df.fillna(method='ffill') # forward fill
◦
5. Use Algorithms that Handle Missing Data
Example Dataset:
Name Age
Alice 25
Bob NaN
Eve 30
After Mean Imputation (mean = 27.5):
fi
Name Age
Alice 25
Bob 27.5
Eve 30
Conclusion:
• Removing duplicates and handling missing data are critical preprocessing steps to ensure
clean, reliable, and accurate data for analysis and machine learning.
PAPER 4:
Q3 a) What is logistic regression, and how does it differ from linear regression? What is the
sigmoid function, and what role does it play in logistic regression? [9 Marks]
1. Logistic Regression:
3. Sigmoid Function:
The sigmoid function is also known as the logistic function and is de ned as:
Properties:
fi
fi
fi
• Maps any real number z
z
into the range [
0
,
1
]
[0,1]
.
• Enables the model to output a probability rather than an unrestricted numeric value.
• The output probability is then compared to a threshold (usually 0.5) to decide the class label.
Q3 b) Calculate the probability that a new email with "offer"=1 and "free"=1 is spam using
Naive Bayes classi er. [9 Marks]
Given:
P(Spam)
and
P(Not Spam)
• Total emails = 5
fi
fi
• Spam = 3 (Emails 2, 3, 5)
We need:
•
P(Offer=1 | Spam)
P(Free=1 | Spam)
For Spam:
• Emails 2, 3, 5
• Emails 1, 4
P(Not\ Spam | Offer=1, Free=1) \propto P(Not\ Spam) \times P(Offer=1|Not\ Spam) \times
P(Free=1|Not\ Spam)
= 0.4 \times \frac{1}{2} \times \frac{1}{2} = 0.4 \times 0.25 = 0.1
Final Answer:
The probability that the new email with "offer"=1 and "free"=1 is Spam is 0.8 (80%).
Q4 (a): Apriori Algorithm for Discovering Frequent Itemsets and Role of Support &
Con dence [9 Marks]
• The Apriori algorithm is used to nd frequent itemsets (sets of items that appear together
frequently) in transactional datasets.
• It works on the principle that all subsets of a frequent itemset must also be frequent
(Apriori property).
Step-by-step process:
◦ User de nes the minimum support, i.e., the minimum frequency an itemset must
have to be considered frequent.
• Support:
◦ Support of itemset A
A
= (Number of transactions containing A
A
) / (Total number of transactions)
◦ Ensures that the rules are relevant to a signi cant portion of the data.
• Con dence:
A \rightarrow B
.
A \cup B
fi
fi
fi
fi
) / Support(A
A
)
B
occurs when A
A
occurs.
Summary:
• Con dence is used to generate strong association rules from these frequent itemsets.
◦ Choose the attribute that best separates the data into classes based on a splitting
criterion.
4. Repeat Recursively
◦ For each subset, select the best attribute and split further.
5. Stopping Criteria
▪ All records in the subset belong to the same class (pure node).
▪ No remaining attributes.
◦ Final nodes assign the class based on majority voting or class purity.
◦ Adjusts information gain by the intrinsic information of a split to avoid bias towards
attributes with many values.