BA Notes[End Sem)
BA Notes[End Sem)
associated with each other and whether changes in one variable affect another.
Correlation - a statistical measure that expresses the extent to which two variables are linearly
related(how strongly related two variables are). It indicates whether an increase or decrease in one
variable corresponds to an increase or decrease in another variable.
Covariance is scale dependent- that’s why cov not used for relational analysis, correlation is scale
independent that’s why cor is used for relational analysis
Definition Measures the direction of the Measures the strength and direction of the
relationship between two variables. relationship between two variables.
Range No fixed range; values can be very large or very small. Always between -1 and 1.
Scale Affected by the units of the variables (e.g., height in Unit-free; not affected by
Dependence cm vs. inches changes covariance). measurement scale.
Regression: Regression is a statistical technique used to model and analyze the relationship between
a dependent variable (Y) and one or more independent variables (X).
Equation: y=mx+c
Where:
Types of Regression
A statistical method used to predict the value of a dependent variable (Y) based on a single
independent variable (X).
Equation:
Y=mx+c
Where:
A statistical method used to predict the value of a dependent variable (Y) based on two or more
independent variables (X₁, X₂, ... Xₙ).
Equation:
Where:
Example:
Predicting House Prices (Y) based on Size (X₁), Location (X₂), and Number of Rooms (X₃).
Location Score (X₂): A score (1-10) indicating how desirable the location is.
Start studyinh😊- Residual Standard Error (RSE) and R² (R-Squared)Residual Standard Error (RSE)
and R² (R-Squared) are key metrics used to evaluate the performance of a regression model.
📖 Definition:
The Residual Sum of Squares (RSS) measures the total squared difference between the observed
values and the values predicted by the regression model. It quantifies the amount of variation in
the dependent variable (Y) that remains unexplained by the regression model.
Not imp just read - ✅ Explanation:
2. ŷᵢ (Predicted Value): The value predicted by the regression model for the corresponding
independent variable(s).
3. (yᵢ - ŷᵢ): The difference between the actual and predicted values is called the residual.
4. (yᵢ - ŷᵢ)²: Squaring each residual ensures all deviations are positive and penalizes larger
errors more heavily.
5. Σ: Summing up all the squared residuals gives the Residual Sum of Squares (RSS).
Model Fit: A lower RSS indicates a better-fitting model, as the predictions are closer to the
actual values.
Error Measurement: It helps measure the unexplained variance by the regression model.
Optimization: In regression analysis, the goal is often to minimize RSS to improve model
accuracy.
🚀 Key Takeaways:
3. Minimizing RSS: The primary goal of most regression models to achieve accuracy.
📖 Definition:
Residual Standard Error measures the average deviation of the observed values from the regression
line. It quantifies how well the regression model fits the data.
🧠 Formula:
✅ Interpretation:
Low RSE: The model fits the data well (small average error).
High RSE: The model doesn’t fit the data well (large average error).
Example:
If the RSE is 2.5, it means the average deviation of observed data points from the predicted
regression line is approximately 2.5 units.
2️⃣ R² (R-Squared)
📖 Definition:
R² measures the proportion of variance in the dependent variable (Y) that can be explained by the
independent variables (X). It indicates the goodness of fit of the regression model.
✅ Interpretation:
Example:
If R² = 0.85, it means 85% of the variation in the dependent variable (Y) is explained by the
independent variables (X). The remaining 15% is due to random error or unobserved factors.
RSE tells you how far off your predictions are, on average.
R² tells you how well your independent variables explain the variability in the dependent
variable.
Unsupervised Learning is used to find patterns and relationships in data without pre-labeled
outcomes (e.g., customer segmentation and anomaly detection).
Supervised Learning
Supervision: The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
Unsupervised Learning
In unsupervised learning, the algorithm is given data without labels and must find hidden patterns or
relationships in the data.
Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Let's dive deeper into each of the K-Nearest Neighbors (KNN), Decision Trees, Naive Bayes, and
Random Forest, including how to write rules for predictions and interpret metrics like accuracy,
sensitivity, and confusion matrix.
Classification Techniques:
KNN is classification algorithm that assigns a class to a data point based on the majority class of its
nearest neighbors. It assumes that similar data points are located near each other and can be
grouped based on their proximity.
1. Select the number of neighbors (k): Choose the number of nearest neighbors (e.g., k = 3).
2. Compute the distance: Calculate the distance between the the test point and all training
points(using Euclidean distance, Manhattan distance, etc.).
5. Vote for the class: The class assigned to the test point is determined by the majority vote. If
there's a tie, you can choose the class that appears first.
6. Make Prediction:
Suppose you're classifying whether a customer will default on a loan based on their credit score and
income.
Data:
Stepwise Prediction:
1. Calculate the distance from the test point to all training points.
3. Neighbors for test point: Customer 4 (Default), Customer 1 (No Default), Customer 2
(No Default).
Metrics:
Confusion Matrix:
2. Decision Trees
A decision tree is a hierarchical model that splits data into subsets based on feature values, forming a
tree structure. Each node represents a feature, and each branch represents a decision rule.
1. Select the feature to split on: At each node, select the feature that best splits the data.
Common methods include Gini impurity or Information Gain (Entropy).
2. Split the data: Based on the feature selected, divide the dataset into branches that represent
possible outcomes.
3. Repeat the process: Continue splitting recursively until a stopping condition is met (e.g., all
points belong to the same class, or the tree reaches a predefined depth).
4. Assign class labels: At each leaf node, assign the most frequent class from the data points in
that leaf.
5. Prediction: Starting from the root, follow the decision rules until a leaf node is reached. The
class label at the leaf node is the prediction.
Example:
Consider a decision tree to predict whether a customer will buy insurance, based on age and income.
Data:
Stepwise Prediction:
2. For customers with Age ≤ 30, predict "No Buy" (majority class).
4. Naive Bayes
Naïve Bayes theorem is a probabilistic classification technique based on Bayes’ theorem,
assuming independence between features. It is widely used for classification tasks, such as
spam detection, sentiment analysis, and risk prediction.
3. Apply Bayes' Theorem/Calculate posterior probability: Use Bayes' formula to compute the
posterior probability for each
Example:
Spam: 60%
The likelihood of certain words appearing in each type of email (spam or not) is calculated. If the
word "free" appears in a new email, the probability of that email being spam is computed using
Bayes' Theorem.
U can do this if u want – explanation of Sir’s example attached in ppt – read through –
4. Random Forest
Random Forest is an ensemble learning method that builds multiple decision trees and combines
their predictions to improve accuracy.
1. Bootstrap Sampling: Create multiple datasets by randomly sampling from the original
training set (with replacement).
2. Train multiple decision trees: Train a decision tree on each of the bootstrap datasets.
3. Predict using individual trees: Each tree makes a prediction for a test point.
4. Combine the predictions: For classification, use a majority vote to decide the final class
prediction.
5. Prediction: The class chosen by the majority of the trees is the predicted class.
Example:
Similar to the decision tree example above, but you would train multiple trees on different subsets of
data. Each tree might make slightly different predictions, but the overall prediction is the majority
vote.
The slide provides an overview of Random Forest, a popular ensemble learning method introduced
by Leo Breiman in 2001. Below is a detailed explanation of its key concepts:
Random Forest is an ensemble learning technique that builds multiple decision tree classifiers and
aggregates their predictions to improve accuracy and reduce overfitting. The main idea behind
Random Forest is:
1. Each decision tree in the ensemble is built using a random subset of attributes at each node.
2. During classification, each tree provides a vote, and the majority class is chosen as the final
prediction.
o The CART (Classification and Regression Trees) methodology is used to grow the
decision trees.
o Instead of using the original attributes directly, this method creates new attributes.
o This reduces the correlation between decision trees, making the model more diverse
and robust.
Accuracy & Robustness: Comparable to AdaBoost, but more resistant to noise and outliers.
Efficiency: Works well even if many attributes are used, and it is faster than bagging or
boosting.
Feature Selection: Insensitive to the number of attributes chosen for each split, making it
efficient in high-dimensional datasets.
In summary, Random Forest is a powerful and widely used machine learning algorithm that balances
accuracy, robustness, and efficiency while reducing overfitting compared to single decision trees.
What is Clustering?
Clustering is an unsupervised learning technique used to group a set of objects or data points
into clusters (groups) based on similarity. The goal is to ensure that data points within the same
cluster are similar to each other, while data points from different clusters are as dissimilar as
possible.
K-Means is one of the most popular and widely used clustering algorithms. It divides data points
into K distinct clusters based on their features.
Suppose we have a dataset with customer information, such as age and income, and we want to
segment customers into 3 groups for targeted marketing.
Dataset:
1 25 50
2 45 60
3 30 45
Customer ID Age Income (in 1000s)
4 50 90
5 35 55
6 40 85
2. Initialize Centroids: Let's say the initial centroids are randomly chosen as (25, 50) and (45,
90).
3. Assign Points to Nearest Centroid: Calculate the Euclidean distance between each
customer’s data point and the centroids:
4. Update Centroids: After assignment, update the centroids by calculating the mean of all
points in each cluster:
o Cluster 1: Average of (25, 50), (30, 45), (35, 55) → New Centroid (30, 50)
o Cluster 2: Average of (45, 60), (50, 90), (40, 85) → New Centroid (45, 78.33)
5. Repeat: Reassign points to the nearest centroid and recalculate centroids until the centroids
do not change.
Final Clusters:
Cluster 1: Customers with Centroid (30, 50) → Likely young, low-income customers.
Cluster 2: Customers with Centroid (45, 78.33) → Likely older, higher-income customers.
Selecting the optimal number of clusters (K) can be tricky. A common method for determining
KKK is the Elbow Method:
1. Plot the WCSS (within-cluster sum of squares) for different values of K (e.g., from 1 to 10).
2. Look for the "elbow" point in the graph, where the reduction in WCSS begins to level off. This
is often considered the optimal value for K.
Strengths:
Weaknesses:
o Requires k in advance.
Sensitive to Outliers:
o Solution: Use K-Medoids instead, as it selects actual data points (medoids) instead
of mean values.
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster
Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
15. Hierarchical Clustering
o Agglomerative (Bottom-Up):
Iteratively merge the closest clusters until only one cluster remains.
o Divisive (Top-Down):
The two clusters with the least dissimilarity (most similarity) are merged.
This process continues in a non-descending fashion until all points belong to the same
cluster.
Illustration:
The three images at the bottom show how the clustering progresses:
o Right Image: Even larger clusters have formed as the process continues.
Dendogram :
o Data objects (points) are decomposed into different levels of nested partitions.
o To obtain the final clusters, we can cut the dendrogram at a desired level.
o The connected components (subtrees) after cutting form the final clusters.
Illustration:
The dendrogram diagram at the bottom visually represents the merging process.
If we cut at a specific horizontal level, the remaining branches form separate clusters
Aspect Details
Common Use
Customer segmentation, Image compression, Anomaly detection
Cases
Aspect Details
4o mini
Business Problem:
An insurance company wants to predict whether an insurance claim is fraudulent or not, based on
various features of the claim and the claimant. This helps reduce the number of fraudulent claims
the company processes.
Features:
Policy Type: Type of insurance policy held by the customer (e.g., health, auto)
Geographic Location: Location of the customer (can help detect regional fraud trends)
Claim Time: The time it took to process the claim (a large delay might indicate suspicious
behavior)
Target Class:
1. Prepare the data: Collect data from past claims, where the fraudulent claims are labeled.
The data should contain information like claim amount, policy type, claim history, etc.
2. Train the model: Use random forest, which is an ensemble of decision trees, to train the
model. Each decision tree will make a classification (fraud or legitimate), and the final
prediction will be the majority vote from all the trees.
3. Predict new claims: For new claims, the trained random forest model will predict whether
the claim is likely fraudulent or legitimate.
Interpretation:
Confusion Matrix: Will show how many fraudulent claims were detected, how many
legitimate claims were mistakenly identified as fraudulent, etc.
Business Problem:
A telecom company wants to predict whether a customer is likely to churn (i.e., leave the company)
based on their account activity. This allows the company to take preventive action, such as offering
promotions or discounts.
Features:
Customer Tenure: How long the customer has been with the company
Monthly Charges: The amount the customer is charged monthly for their telecom services
Service Type: Whether the customer uses mobile, internet, or bundle services
Customer Support Calls: The number of customer support calls made by the customer in the
past month
Payment Method: Whether the customer pays via direct debit, credit card, or other methods
Target Class:
1. Prepare the data: Gather data from past customers, including those who churned and those
who stayed. Label each customer accordingly.
2. Train the model: Use a decision tree algorithm to learn patterns in the features that are
correlated with churn. The tree will split customers into different branches based on features
like tenure, service type, etc.
3. Prediction: For new customers, the decision tree will predict whether they are likely to churn
based on the features in their profile.
Interpretation:
Accuracy: The percentage of correct predictions made by the decision tree (i.e., correctly
identifying churners and non-churners).
Sensitivity: How well the model detects customers who will churn.
Confusion Matrix: Helps to understand how well the model is distinguishing between
customers who churn and those who stay.
Here’s a clear comparison in tabular format outlining the fundamental differences between
Classification vs Regression, Supervised vs Unsupervised Learning, and Labeled vs
Unlabeled Data:
Type of
Categorical problem. Numerical problem.
Problem
K-means, DBSCAN,
Linear Regression, Logistic
Model Hierarchical Clustering,
Regression, SVM, Decision
Types Principal Component Analysis
Trees, Neural Networks.
(PCA).
Used in unsupervised
Used in supervised learning for
learning for clustering,
Usage both classification and regression
dimensionality reduction,
tasks.
etc.
The model is trained using the The model tries to infer the
labeled data to learn the structure of the data by
Process
relationship between inputs and identifying patterns or
outputs. similarities in input data.
Summary:
Classification vs Regression: Classification deals with categorical outputs, while regression
predicts continuous numerical values.
Labeled vs Unlabeled Data: Labeled data has target labels associated with it, whereas
unlabeled data lacks target labels and is often used in unsupervised learning methods.