AI and DS QB1
AI and DS QB1
c) SVM
- Support Vector Machine (SVM) is a supervised machine learning
algorithm used for classification and regression tasks.
- The algorithm maximizes the margin between the closest points
of different classes.
- The different types of SVM’s are:
- Linear SVM: Linear SVMs use a linear decision boundary to
separate the data points of different classes. When the data can
be precisely linearly separated, linear SVMs are very suitable.
This means that a single straight line (in 2D) or a hyperplane (in
higher dimensions) can entirely divide the data points into their
respective classes.
- Non-Linear SVM: Non-Linear SVM can be used to classify data
when it cannot be separated into two classes by a straight line (in
the case of 2D). By using kernel functions, nonlinear SVMs can
handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional
feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in
this modified space
- Advantages of SVM:
1. High-Dimensional Performance: SVM excels in high-
dimensional spaces, making it suitable for image
classification and gene expression analysis
2. Nonlinear Capability: Utilizing kernel functions like RBF and
polynomial, SVM effectively handles nonlinear relationships.
3. Outliear Resilience: : The soft margin feature allows SVM to
ignore outliers, enhancing robustness in spam detection
4. Binary and multiclass support: SVM is effective for both
binary classification and multiclass classification,
5. Memory Efficiency: SVM focuses on support vectors, making
it memory efficient compared to other algorithms.
- Disadvantages of SVM:
1. Slow training: SVM can be slow for large datasets, affecting
performance
2. Noise Sensitivity: SVM struggles with noisy datasets and
overlapping classes, limiting effectiveness in real-world
scenarios.
e) Decision Tree:
- A Decision Tree is a supervised learning algorithm used for both
classification and regression tasks. It works by recursively
splitting the dataset based on the best attribute, forming a tree-
like structure where each internal node represents a decision, and
each leaf node represents a final outcome or class label.
- A decision tree consist of different types of nodes:
- Root Node: the first node in the tree, representing the entire
dataset. It is split on the child node based on the best attribute
- Internal Node: Nodes that represent the decision made on the
attribute values
- Leaf Nodes: Terminal nodes that provide the final output
- Working of Decision tree:
- STEP1: SELECTING THE BEST ATTRIBUTES:
~To build a decision tree, we need to determine the best attribute
at each step. This is done using:
~ Entropy: Measures the randomness or impurity of the data.
Lower the entropy means purer the data
~ Information Gain: Measures how much an attribute reduces
entropy when used for splitting
- STEP2: CREATING TREE NODES AND BRANCHES
~ The dataset is divided into subsets based on the selected
attribute’s values
~ Each subset forms a child node, and the process repeats
recursively
- STEP3: STOPPING CONDITIONS
~ All instances in a subset belong to the same class (pure node).
~ No more attributes remain for splitting (majority class is
assigned to the leaf node).
~ A predefined tree depth or minimum subset size is reached (to
prevent overfitting).
- The different types of decision tree are:
- Information Gain- Used in ID3: This measures the reduction in
entropy after splitting and also the attributes with highest IG is
choosen
- Gini Index- Used in CART: Measures impurity by calculating the
probability of misclassification and the attribute with the lowest
Gini Index is chosen
- Gain Ratio – Used in C4.5: Improves Information Gain by
considering attribute value distribution
f) K-Means Clustering
- K-means clustering is a technique used to organize data into
groups based on their similarity.
- The algorithm works by first randomly picking some central
points called centroids and each data point is then assigned to the
closest centroid forming a cluster.
- After all the points are assigned to a cluster the centroids are
updated by finding the average position of the points in each
cluster.
- This process repeats until the centroids stop changing forming
clusters.
- The goal of clustering is to divide the data points into clusters so
that similar data points belong to same group.
- The algorithm for K-mean works as follows:
1. First, we randomly initialize k points, called means or cluster
centroids
2. We categorize each item to its closest mean, and we update
the mean’s coordinates, which are the averages of the items
categorized in that cluster so far
3. We repeat the process for a given number of iterations and at
the end, we have our clusters
- In conclusion, K-means clustering is a powerful unsupervised
machine learning algorithm for grouping unlabeled datasets. Its
objective is to divide data into clusters, making similar data
points part of the same group. The algorithm initializes cluster
centroids and iteratively assigns data points to the nearest
centroid, updating centroids based on the mean of points in each
cluster.
g) Hierarchical Clustering
- Hierarchical clustering is a technique used to group similar data
points together based on their similarity creating a hierarchy or
tree-like structure. The key idea is to begin with each data point
as its own separate cluster and then progressively merge or split
them based on their similarity.
- There are types of Hierarchical Clustering and they are:
1. Agglomerative Clustering: It is also known as the bottom-up
approach or hierarchical agglomerative clustering
(HAC).This clustering algorithm does not require us to
prespecify the number of clusters. Bottom-up algorithms treat
each data as a singleton cluster at the outset and then
successively agglomerate pairs of clusters until all clusters
have been merged into a single cluster that contains all data.
2. Divisive Clustering: It is also known as a top-down approach.
This algorithm also does not require to prespecify the number
of clusters. Top-down clustering requires a method for
splitting a cluster that contains the whole data and proceeds
by splitting clusters recursively until individual data have
been split into singleton clusters.
2. Model-Related Issues
a) Overfitting
- The model learns patterns too well, including noise, making
it perform well on training data but poorly on test data.
- Happens when the model is too complex.
Solution: Use regularization techniques (L1, L2), cross-
validation, dropout in neural networks, and simpler models.
b) Underfitting
- The model is too simple and fails to capture underlying
patterns in data.
- Leads to high bias and poor performance on both training
and test data.
Solution: Use a more complex model, increase training time,
or add more relevant features.
c) Hyperparameter Tuning Issues
- Choosing the right hyperparameters (learning rate, number
of layers, regularization strength) is difficult.
- Poor tuning can lead to poor performance.
Solution: Use Grid Search, Random Search, or Bayesian
Optimization.
3. Algorithm-Related Issues
a) Computational Complexity
- Some ML algorithms require high computational power,
making them infeasible for large datasets.
- Example: Deep learning models require GPUs/TPUs for
efficient training.
Solution: Use dimensionality reduction (PCA, t-SNE),
optimized algorithms, and cloud computing.
b) Feature Selection and Engineering
- Too many irrelevant features can slow down training and
reduce accuracy.
- Choosing the right features is critical for model
performance.
Solution: Use feature selection techniques (Recursive Feature
Elimination, Mutual Information) or create new meaningful
features.
- Example:
Problem: Predict house prices based on location, size, and
number of rooms
Steps:
1- Data Collection: Gather real estate data (price, location, area,
rooms)
2- Preprocessing: Handle missing values, scale, numerical
features
3- Model Training: use Liner regression or random forest to
learn patterns
4- Deployment: Create an API to take inputs and return price
prediction
5- Prediction: User inputs location=Mumbai, sqft=1500,
bedrooms=3, model predicts ₹75,00,000.
6- Monitoring: Retrain the model periodically with new data.
7. Distinguish Between Business Intelligence and Data Science
Aspect Business Intelligence Data Science
Def Analyzes past and Uses ML and
present data for statistical
decision-making. techniques to
predict future
trends.
Purpose Reporting, Prediction,
monitoring, and automation, and
performance tracking. pattern discovery.
Data Type Structured Data (SQL Structured, semi-
databases, structured and
spreadsheets) unstructured data
(text, images,
videos)
Time Historical and current Past, present and
Orientation data future data
Techniques Data visualization, Machine Learning,
Used dashboards, SQL, AI, deep learning,
OLAP statistical
modelling
Tools & Power BI, Tableau, Python, R,
Technologies Excel, Looker, SAP TensorFlow,
BI PyTorch, Scikit-
learn
Complexity Easier, focuses on data More complex,
summarization & involves coding,
visualization. modeling, and
algorithm
development.
User Business analysts, Data scientists, AI
executives, managers. engineers,
researchers
Example Use Analyzing sales trends Predicting
Cases and creating customer churn
dashboards. using machine
learning.
End Goal Better decision- Building AI-driven
making based on past solutions for
data. automation and
efficiency.
8. Explain the role played by correlation and covariance in EDA
- Covariance:
Definition:
o Measures the direction of the relationship between two
variables
o A positive covariance means both variables increase
together
o A negative covariance means one increase while the other
decreases
Formula:
Limitations:
o Covariance does not indicate the strength of the
relationship
o Its value are not standardized, making comparisons
difficult
Example in EDA:
If the height and weight of people have a positive
covariance, it suggests taller people tend to weigh more.
- Correlation:
Definition:
o Measures both direction and strength of the
relationship between two variables
o It is the scaled version of covariance and ranges
from -1 to +1
Types of Correlation:
o Positive Correlation (r > 0): One variable increases,
the other also increases
o Negative Correlation (r<0): One variable increases,
the other decreases
o No correlation ( r ≈ 0): No relationship between
variables
Formula:
Advantages:
o Standardized between -1 to +1, making it easier to
interpret
o Helps in feature selection by identifying redundant
variables in ML models
Example In EDA:
If hours studied and exam scores have r = 0.85, it means
they have a strong positive correlation.
Where:
X = Data Values
N = Number of values
Interpretation in a Histogram:
The mean is affected by outliers and skewed
distributions.
In a normal distribution, the mean is located at the center.
In a right-skewed histogram, the mean is greater than the
median.
In a left-skewed histogram, the mean is less than the
median.
Example: In an exam score histogram, if the mean score is 75,
most students scored around this value
- Median:
The median is the middle value when the data is arranged in
ascending order. It divides the dataset into two equal halves.
Interpretation in a Histogram:
The median is not affected by outliers and is useful for
skewed data.
In a normal distribution, the median is at the center, equal
to the mean.
In a right-skewed histogram, the median is less than the
mean.
In a left-skewed histogram, the median is greater than the
mean.
Example: In an income distribution histogram, if the median
salary is $50,000, half of the people earn below and half earn
above this value.
- Mode:
The mode is the most frequently occurring value in a dataset
Interpretation in a Histogram:
The mode is represented by the tallest bar in a histogram.
A dataset can have:
Unimodal Distribution → One peak (single mode).
Bimodal Distribution → Two peaks (two modes).
Multimodal Distribution → Multiple peaks (more than
two modes)
b) Measures of spread
- Measures of spread (dispersion) describe how much the data
values vary from the central tendency (mean, median, or mode).
These measures help understand the distribution, variability,
and consistency of data.
- The main measures of spread are:
i. Range: The range is the difference between the maximum
and minimum values in a dataset.
Formula: Range = Max Value – Min Value
Example: If the scores in a test are 45,50,60,75,90 then:
Range = 90-45 = 45
ii. Interquartile Range (IQR): The IQR s the range of the
middle 50% of data, found by subtracting the 1st quartile
(Q1) from the 3rd quartile (Q3).
Formula: IQR = Q3 – Q1
Example: If Q1 = 50 and Q3 = 80 then:
IQR = 80 – 50 = 30
iii. Variance: The variance measures the average squared
deviation of each data point from the mean. It shows how
data points spread around the mean.
Formula:
For a population:
For a sample:
Where:
X = Each data point
μ = Mean (for population), X = Sample mean
N, n = number of observations
Example:
If exam scores 50, 55, 60, 70, 80, variance is calculated
as the average of squared differences from the mean
iv. Standard Deviation: The standard deviation is the square
root of variance. It measures the typical amount by which
data points deviate from the mean.
Formula:
Where,
X = Data Values
X = Mean
N = Number of values
s = Standard deviation
- Example: If exam scores have a right-skewed distribution, most
students scored low, but a few high scores pull the mean higher
than the median.
- Kurtosis: Kurtosis measures how heavy or light the tails of a
distribution are compared to a normal distribution. It helps
detect extreme values (outliers).
- Types of Kurtosis:
Type Description Kurtosis Value Graph Shape
Mesokurtic Moderate peak, ≈ 3 Bell shaped
normal tails.
Leptokurtic High peak, thick >3 Tall, thin peak
tails (many
outliers).
Platykurtic Low peak, thin <3 Broad, flat peak
(Light-Tailed) tails (few
outliers).
Example:
Rule2: Move NOT (¬) inward
Negation should only apply to individual predicates, not to entire
expressions.
Example:
Rule 3: Rename Variables
Each quantifier should have unique variable names to avoid
confusion
Example:
15. Explain confusion matrix w.r.t. ML. Also explain false positive
and false negative and how are they significant?
- A Confusion Matrix is a table used to evaluate the performance
of a classification model by comparing actual values with
predicted values. It helps in understanding the accuracy and
types of errors the model makes.
- Structure of a Confusion Matrix
Actual/ Predicted Predicted Positive Predicted Negative
(1) (0)
Actual Positive True Positive False Negative
(Correct (FN) (Missed
Prediction) Positive)
Actual Negative False Positive (FP) True Negative
(0) (Wrongly (TN) (Correct
Predicted as Prediction)
Positive)
- A False Positive (Type I Error) occurs when the model
incorrectly predicts a positive result for a negative case.
- The Significance of False Positive is:
Can lead to unnecessary actions (e.g., unnecessary
medical treatment).
In fraud detection, wrongly blocking a genuine
transaction can inconvenience users.
- Example:
A spam filter marks an important email as spam.
A COVID-19 test wrongly detects a healthy person as
infected.
- A False Negative (Type II Error) occurs when the model
incorrectly predicts a negative result for a positive case.
- The Significance of False Negative is:
Can be more dangerous than false positives in critical
applications like healthcare.
In fraud detection, a fraudulent transaction being marked
as genuine can lead to financial loss.
- Example:
A cancer detection model fails to detect cancer in a
patient.
A security system does not flag a cyberattack.