Unit 2
Unit 2
EDA is the process of analysing and summarizing a dataset to uncover patterns, detect
anomalies, find relationships between variables, and check if there are any missing or
incorrect values. It’s an essential step in data analysis because it helps us understand
the story behind the data before making any decisions.
Imagine you are given a dataset without any context. If you immediately try to apply a
machine learning model, it might not work well because the data could have missing
values, incorrect entries, or irrelevant columns. EDA helps us clean and transform the
data so that it makes sense before we move forward. EDA involves both looking at
numbers (statistical analysis) and visualizing the data (graphs and plots).
Exploratory Data Analysis (EDA) can be divided into different types based on how we
analyse the data. Broadly, there are four main types of EDA: Univariate Analysis,
Bivariate Analysis, Multivariate Analysis, and Missing Value & Outlier Analysis.
Let’s go through each of them in simple terms.
"Uni-" means one, so univariate analysis means examining one variable at a time.
This helps us understand the distribution, central tendency (mean, median, mode),
and spread (variance, standard deviation) of that particular variable.
For example, if we have a dataset of students’ exam scores, and we focus only on
the "Math Score" column, we are doing univariate analysis.
How do we analyze?
"Bivariate" means two variables. Here, we study how two variables relate to each
other.
For example, we might want to check if "study hours" and "exam scores" have a
relationship. Do students who study more score higher?
How do we analyse?
When both variables are numerical:
o Use scatter plots to visualize the relationship.
o Use correlation coefficients (like Pearson’s correlation) to
measure the strength of the relationship.
When one variable is categorical and the other is numerical:
o Use box plots or violin plots to compare distributions.
When both variables are categorical:
o Use a cross-tabulation table or a heatmap.
Example:
If we have a dataset of employees with "Years of Experience" and "Salary," we can
use a scatter plot to check if more experience leads to a higher salary.
When we analyze three or more variables at the same time, it’s called
multivariate analysis. This is useful in understanding complex relationships.
For example, we might want to see how "study hours," "sleep hours," and "exam
scores" together affect student performance.
How do we analyze?
Before performing deep analysis, we need to detect and handle missing values
and outliers because they can mislead our conclusions.
Handling Outliers:
Use Boxplots: If we see extreme values (very high or low points), they might
be outliers.
Use Z-score or IQR method to detect outliers mathematically.
Decide whether to remove or adjust them based on context.
Example:
If we analyze salaries in a company and see that most employees earn between
$30K-$50K but one person earns $1M, that’s an outlier. We need to check if it’s a
genuine data point or a mistake.
1. Histogram
A histogram is applicable for both univariate and bivariate analyses. The hist function in
Matplotlib is used to create histograms. Here's a basic example along with some
explanation:
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
# Plotting histogram
plt.hist(data, bins=5, color='blue', edgecolor='black')
#Adding labels and title plt.xlabel('Value') plt.ylabel('Frequency') plt.title('Histogram
Example')
# Display the plot plt.show()
Output:
Explanation:
Example:
import matplotlib.pyplot as plt
# Sample data
x_values = [1, 2, 3, 4, 5] y_values = [2, 4, 6, 8, 10]
# Plotting scatter plot
plt.scatter (x_values, y_values, color='blue', marker='o', label='Data Points')
# Adding Labels and title
plt.xlabel('Independent variable (x-axis)')
plt.ylabel('Dependent Variable (Y-axis)')
plt.title('Scatter Plot Example')
# Adding a Legend (optional)
plt.legend()
# Display the plot
plt.show()
Explanation:
plt.scatter(x_values, y_values, color='blue', marker='o', label='Data Points') : This
line creates a scatter plot. x_values and y_values are your datasets for the X and Y
axes. You can customize the color, marker style, and label as needed.
plt.xlabel('Independent Variable (X-axis)') and plt.ylabel('Dependent Variable (Y-
axis)') : These lines add labels to the x-axis and y-axis, respectively.
plt.title('Scatter Plot Example') : This line adds a title to the plot.
plt.legend(): This line adds a legend if you have labeled your data points.
plt.show(): This function displays the plot.
3. Box Plot
Box plot is also known as whisker plot, box-and-whisker plot or simply a box-whisker
diagram. Box plot is a graphical representation of the distribution of a dataset. It
displays key summary statistics such as the median, quartiles and potential outliers in a
concise and visual manner. It is versatile and applicable for both univariate and bivariate
analyses.
Seaborn Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 5]
# Creating a box plot using
Seaborn sns.boxplot(x=data)
# Adding Labels and title
plt.xlabel('Variable')
plt.ylabel('values')
plt.title('Box Plot Example')
# Display the plot plt.show()
Explanation:
plt.boxplot(data): This line creates a box plot using Matplotlib. Replace 'data' with your
actual dataset.
sns.boxplot(x=data): This line creates a box plot using Seaborn. Again, replace 'data' with
your dataset.
plt.xlabel('Variable') and plt.ylabel('Values') : These lines add labels to the x-axis and y-
axis, respectively.
plt.title('Box Plot Example') : This line adds a title to the plot.
plt.show(): This function displays the plot.
4. Handling Outlier and Remove Outlier
Outliers are unusual values in a dataset that are very different from the rest of the
data. Imagine you are looking at the heights of a group of people, and most of them
are between 5 and 6 feet tall. But suddenly, you see a number like 10 feet or 2 feet—
these are outliers because they are very different from the other values.
Outliers can appear for many reasons. Sometimes, they are real values that just
happen to be extreme, like a basketball player who is much taller than others. Other
times, they are mistakes, like someone typing “100” instead of “10” in a database.
They can also occur due to measuring errors or natural variations in the data.
Example:
import seaborn as sns
import matplotlib.pyplot as plt import pandas as pd
# Sample DataFrame with outliers
data={"values": [1, 2, 2, 3, 3, 3, 4, 4, 4, 20]}
df = pd.DataFrame(data)
# Create a box plot before removing outliers
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(x=df['values'])
pit.title('Box Plot Before Removing Outliers')
* Identify outliers using the interquartile range (IQR)
Q1=df['values'].quantile(0.25)
Q3=df['values'].quantile(0.75)
IQR = Q3 - Q1
# Define a threshold to identify outliers
threshold = 1.5
# Filter out outliers
filtered_data=df[(df['values'] >= Q1 – threshold * IQR) & (df['values'] <= Q3 +threshold *
IQR)]
# Create a box plot after removing outliers
plt.subplot(1, 2, 2)
sns.boxplot(x-filtered_data['values'])
plt.title('Box Plot After Removing outliers')
plt.tight_layout()
plt.show()
Descriptive statistics is a way of summarizing and understanding data. Imagine you have
a big list of numbers, like the scores of students in a class. Instead of looking at each
number one by one, descriptive statistics helps you get a clear picture of the data by
organizing and simplifying it. It tells us about the general pattern in the data, what is
typical, and how much the data varies.
There are two main types of descriptive statistics: Measures of Central Tendency and
Measures of Dispersion (Variability).
Measures of central tendency tell us about the middle or average value of the data. The
most common ways to find this are the Mean, Median, and Mode.
Mean (Average): This is the sum of all values divided by the number of values. For
example, if five students scored 80, 85, 90, 95, and 100, the mean score would be:
80+85+90+95+100/5=90
The mean is useful, but if there are extreme values (like one student scoring 10), it
can be misleading.
Median: This is the middle value when all numbers are arranged in order. If there are
an odd number of values, the median is simply the middle one. If there is an even
number of values, the median is the average of the two middle numbers.
For example, in the same set of scores (80, 85, 90, 95, 100), the middle number is
90, so that’s the median. If there were six scores (80, 85, 90, 95, 100, 110), the
median would be:
90+95/2=92.5
Mode: This is the number that appears the most. If we have test scores of 85, 90,
90, 95, and 100, the mode is 90 because it appears twice. Some datasets may have
more than one mode or no mode at all.
Just knowing the average is not enough; we also need to see how much the data varies.
Two groups of students could have the same average test score, but in one group,
everyone scores around the same number, while in another, scores range from very low
to very high. Measures of dispersion help us understand this spread.
Range: This is the difference between the highest and lowest value. If the highest
test score is 100 and the lowest is 60, the range is: 100−60=40
While the range gives an idea of how spread out the data is, it is affected by extreme
values.
Interquartile Range (IQR): This focuses on the middle 50% of the data. It is
calculated by subtracting the 25th percentile (Q1) from the 75th percentile (Q3). This
method is more reliable than the range because it ignores extreme values.
Variance: Variance tells us how much the data points differ from the mean. If the
variance is high, it means the data points are widely spread.
Standard Deviation: This is the square root of the variance and tells us how much a
typical value differs from the mean. If the standard deviation is small, it means most
values are close to the mean. If it is large, the values are more spread out.
Example:
import numpy as np
scores = [80, 85, 90, 95, 100, 85, 90, 95, 85, 100]
# Measures of Dispersion
# Display Results
print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
print(f"Range: {range_value}")
print(f"Variance: {variance_value}")
Output:
1. Mean: 90.5
2. Median: 90.0
3. Mode: 85
4. Range: 20
5. Variance: 46.94444444444444
6. Standard Deviation: 6.85160159703148
Hypothesis testing is a way of making decisions based on data. Imagine you are a scientist or a business
analyst, and you want to check if a claim is true or not. Instead of relying on guesses, you use hypothesis
testing to analyze the data and make a conclusion.
For example, suppose a company claims that their new diet pill helps people lose 5 kg in a month. You can
collect data from a group of people who used the pill, analyze their weight loss, and use hypothesis testing to
see if the company's claim is really true or just a coincidence.
Once we define these hypotheses, we collect data and perform a test to determine whether we should reject the
null hypothesis or not. If the data strongly supports the alternative hypothesis, we reject the null hypothesis and
conclude that the pill likely works.
There are different types of hypothesis tests depending on what we are trying to check. Some common types
include:
1. One-Sample t-Test
this test is used when we want to compare the mean (average) of one sample to a
known value. For example, if a manufacturer claims that their batteries last 10
hours, we can test if the average battery life in our sample is really 10 hours or not.
2. Two-Sample t-Test
this test is used when we want to compare the averages of two different groups. For
example, if we want to compare the test scores of students who used a new study
method versus those who followed the old method, we use a two-sample t-test to
see if there is a significant difference in scores.
3. Paired t-Test
this test is used when we measure the same group of people before and after some
change. For example, if we measure students' performance before and after a
training program, a paired t-test helps determine if there is a significant
improvement.
5. Chi-Square Test
This test is used for categorical data. For example, if we want to check if gender
(male/female) affects customer preferences for a product, we use a chi-square test
to see if there is a significant relationship between the two variables.
6. Z-Test
This test is similar to the t-test but is used when we have a large sample size
(usually more than 30). It is commonly used in business and social sciences when
analysing population proportions.
Q.5 Machine learning
Machine learning is a type of technology that allows computers to learn and make
decisions without being explicitly programmed. Instead of following a set of fixed
instructions, a machine learning system learns from data and improves over time.
Imagine you are teaching a child to recognize different animals. At first, you show the
child pictures of cats and dogs, telling them which is which. Over time, the child begins
to notice patterns—cats have pointed ears, dogs have different sizes, and so on.
Eventually, the child can recognize a cat or a dog even without your help. Machine
learning works in a similar way, but instead of a child, it is computer learning from data.
The process of machine learning starts with data. This data can be anything—images,
numbers, words, or even videos. The computer analyses this data and looks for patterns.
Once it finds patterns, it creates a model. This model is like a set of rules that helps the
computer make predictions about new data. For example, if a machine learning model is
trained on thousands of images of cats and dogs, it can eventually recognize whether a
new image is a cat or a dog based on the patterns it has learned.
1. Supervised Learning
Supervised learning is a type of machine learning where a computer is trained using
labelled data. This means that for every example in the training dataset, there is a
correct answer, or label, provided. The goal of supervised learning is to teach the
computer to recognize patterns in the data so that it can make accurate predictions
when it encounters new, unseen data.
Imagine you are teaching a child how to differentiate between apples and oranges.
You show the child a series of fruits and tell them, “This is an apple” or “This is an
orange.” Over time, the child learns to recognize the characteristics of each fruit, such
as the colour, shape, and texture. Eventually, when you show them a new fruit, they
can guess whether it is an apple or an orange based on what they have learned. In
supervised learning, the computer learns in a similar way. It is given a set of examples
with correct labels, and it uses these examples to learn patterns that help it classify
new data correctly.
A. classification:
Classification is used when the goal is to categorize data into distinct groups or
classes. The computer learns to recognize patterns in labelled examples and then
uses that knowledge to classify new data into one of the predefined categories.
Imagine you have a basket of fruits, and you want the computer to identify whether a
given fruit is an apple or an orange. You provide the system with many pictures of
apples and oranges, each labelled correctly. The computer studies features like
colour, shape, and texture. After training, when you show it a new fruit, it will predict
whether it's an apple or an orange based on what it has learned.
Classification is widely used in real life. For example, email spam filters classify emails
as either "spam" or "not spam" based on their content. In medical diagnosis,
classification models can determine whether an X-ray image shows signs of a disease
or not. In banking, fraud detection systems classify transactions as "fraudulent" or
"legitimate" to prevent financial crimes.
Classification algorithms are the methods used in machine learning to categorize data
into predefined classes. These algorithms learn from labelled data and make
predictions on new, unseen data. Different algorithms use different techniques to
classify data, and choosing the right one depends on factors like dataset size,
complexity, and accuracy needs. Below are some of the most commonly used
classification algorithms explained in simple terms.
1. Support Vector Machine (SVM): Support Vector Machine (SVM) is an algorithm that
finds the best boundary (or line) that separates different classes in the data. This
boundary is called a hyperplane.
2. Decision Tree: It's a flowchart-like tree structure where an internal node represents
a feature(or attribute), a branch represents a decision rule, and each leaf node
represents the outcome.
3. K-Nearest Neighbor (KNN): k-NN is a simple but powerful algorithm that classifies a
new data point based on the majority class of its nearest neighbors.
B. Regression
Regression is used when the goal is to predict continuous numerical values instead of
categories. The computer learns from past data and finds patterns to make accurate
numerical predictions.
Imagine you want to predict the price of a house based on its size. You collect data on
various houses, including their sizes and selling prices. The computer studies this
information and learns how size affects price. After training, when you enter a new
house’s size, the model predicts its price based on the pattern it learned.
One key difference between regression and classification is that regression deals with
numbers that can have infinite possible values. For example, a house price could be
$250,000, $251,000, or $250,500, while in classification; the output is limited to
predefined categories like "cat" or "dog."
Regression algorithms are used in machine learning when the goal is to predict
continuous numerical values instead of categories. These algorithms learn patterns
from past data and use them to make predictions about new data points. Different
regression algorithms work in different ways, and choosing the right one depends on
factors like dataset size, complexity, and accuracy requirements. Below are some of
the most common regression algorithms explained in simple terms.
1. Linear Regression
Linear regression is the simplest and most commonly used regression algorithm. It
assumes that there is a straight-line relationship between the input (independent
variable) and the output (dependent variable).
3. Polynomial Regression
Sometimes, the relationship between input and output is not a straight line but a
curve. Polynomial regression helps when the data follows a non-linear trend.
For example, if we are predicting the growth of a plant based on the number of
days, the growth may not be a straight line but a curved pattern. Polynomial
regression fits a curved line instead of a straight one. The equation looks like:
y=b0+b1x+b2x2+b3x3+⋯+bnxn
where x2,x3,...x^2, x^3, ...x2,x3,... allow the model to capture curved patterns in
the data.
A decision tree regression model works by splitting the data into different branches
based on conditions, similar to how a flowchart works.
Imagine you are predicting house prices. The decision tree might ask:
Is the house bigger than 1000 sq ft? → Yes → Is it in a good location? → Yes →
Predict higher price
Is it smaller than 1000 sq ft? → Yes → Predict lower price
The model keeps splitting the data into smaller groups until it reaches the final
prediction.
B.Unsupervised learning
Unsupervised learning is a type of machine learning where a model learns from data
without labeled answers. Unlike supervised learning, where we provide both input and
output labels, in unsupervised learning, the algorithm only gets the input data and must
find patterns, relationships, or structures on its own.
Imagine you are given a collection of different objects, but you don’t know what they
are. You start grouping them based on similarities, such as color, shape, or size. This is
exactly what unsupervised learning does—it finds patterns in data and organizes it into
meaningful groups without prior knowledge.
Unsupervised learning is mostly used for clustering and association tasks. Clustering is
about grouping similar items together, while association focuses on finding relationships
between variables.
One common use of unsupervised learning is customer segmentation in businesses.
Suppose a company has data on thousands of customers, but it doesn’t know how to
categorize them. Using an unsupervised learning algorithm, the company can
automatically find groups of customers with similar shopping habits. This helps in
targeted advertising, personalized recommendations, and improving customer
experience.
Unsupervised learning algorithms are used when we have data but no predefined labels or categories. These algorithms
help find patterns, relationships, or structures in the data without human supervision. They are mainly used for
clustering, association, and dimensionality reduction. Let’s explore some of the most common unsupervised learning
algorithms in simple terms.
1. K-Means Clustering
K-Means is one of the most popular clustering algorithms. It is used to divide data into
groups (clusters) based on their similarities.
Imagine you own a clothing store and want to categorize customers based on their
shopping habits. You don’t know how many types of customers exist, but you want to
group them. K-Means works by randomly selecting K cluster centres and then assigning
each customer to the nearest cluster. It keeps adjusting the groups until they are well-
defined.
Budget Shoppers
Regular Shoppers
Premium Shoppers
Once the groups are formed, the business can create personalized marketing strategies
for each group.
2. Association
Imagine you run a supermarket and want to know which products are frequently
bought together. For example, if many customers who buy bread also buy butter, you
can place them close to each other in the store or offer a discount when purchased
together. This technique is called Market Basket Analysis, and it is a real-world example
of association algorithms.
Association algorithms work by looking for patterns in data and creating rules that
describe these patterns. These rules are written in a format like:
3. Dimensionality Reduction
Imagine you have a dataset with 1,000 features, such as a customer’s age, income,
shopping history, and hundreds of other factors. Analyzing all these features together
can be difficult, slow, and computationally expensive. Dimensionality reduction
simplifies the data by selecting only the most important features, making the analysis
faster and more efficient.
Think of it like taking a high-definition image and compressing it into a smaller file while
still keeping the key details. You lose some minor details, but the overall picture
remains clear.
Bias and variance are two important concepts that help us understand how well a
machine learning model is learning. To put it simply, bias is when a model is too simple
and makes a lot of mistakes, while variance is when a model is too complicated and
becomes too sensitive to the data.
Understanding Bias
Imagine you are learning how to shoot arrows at a target, and every time you shoot,
your arrows land far from the bullseye in the same spot. No matter how many times
you try, your arrows always go to the wrong place.
This happens because you are using the wrong technique or have a bad understanding
of how to aim properly. In machine learning, this is called high bias.
A model with high bias makes the same kind of mistakes over and over again because
it does not learn enough from the data.
It is too simple and makes strong assumptions.
Because it does not capture the details of the data well, it performs poorly on both
training data and new data.
💡 Example of High Bias in Machine Learning:
Imagine we are trying to predict house prices based on size. If we only use a simple
formula like Price = Size × 1000, we are ignoring other important factors like location,
number of rooms, and condition of the house. Because the model is too simple, it will
always make mistakes, no matter how much data we give it.
This is called underfitting, where the model is not complex enough to learn the true
patterns in the data.
Understanding Variance
Now, imagine another scenario where you shoot arrows, and sometimes they land near
the bullseye, but other times they are scattered all over the place. Each time you shoot,
the arrows land in very different spots.
This means you have learned too much from each individual shot but have not
developed a consistent technique. In machine learning, this is called high variance.
A model with high variance learns too much from the training data, even
memorizing unnecessary details.
It becomes too sensitive to the training data, so when it sees new data, it does not
perform well.
It works well on training data but fails on new data because it focuses too much on
specific details instead of general patterns.
If a new house comes in with a different owner or a different street number, the model
might fail to predict the price correctly because it relied too much on unnecessary
details from the training data.
This is called overfitting, where the model learns too much from the training data and
fails to work on new data.
The goal of machine learning is to find the perfect balance between bias and variance.
If the bias is too high – The model is too simple and makes a lot of mistakes
(underfitting).
If the variance is too high – The model is too complex and fails on new data
(overfitting).
A good model has a balance – It learns from the data but does not memorize
unnecessary details.
When we train a machine learning model, we want it to learn from the data so that it
can make good predictions on new, unseen data. However, sometimes a model does
not learn enough, and other times it learns too much. These two problems are called
underfitting and overfitting.
What is Underfitting?
Imagine you are learning to ride a bicycle. If you only practice for five minutes and then
try to ride on your own, you will probably fall. This is because you did not learn enough
and your skills are too simple to handle different situations.
Underfitting in machine learning happens when a model is too simple and does not
learn enough patterns from the training data. Because it does not capture the
important details, it makes a lot of mistakes both on the training data and on new data.
For example, if we are trying to predict house prices but only use one factor (like just
the size of the house) while ignoring important details like location and number of
rooms, the model will not be accurate. No matter how much data we give it, it will
always make incorrect predictions.
A model that underfits is not useful because it is not learning enough to make good
predictions.
What is Overfitting?
Now, imagine another situation where you practice riding a bicycle, but you only
practice on a smooth road with no obstacles. You learn every small bump and crack in
the road perfectly. But when you try to ride on a different road with traffic and curves,
you struggle because you only learned the details of one road and not the general skill
of cycling.
This is what happens in overfitting. The model learns too much from the training data,
including unnecessary details and noise, instead of understanding the real patterns. As
a result, it performs very well on the training data but fails on new data.
For example, if we build a house price prediction model and include every tiny detail,
such as the color of the house, the name of the street, or the owner’s birth year, the
model will memorize the training data instead of learning useful patterns. When we
give it a new house, it will struggle to predict the price correctly because it was too
focused on small details that do not matter.
A model that overfits is not useful because it does not generalize well to new data.
There are different types of regression techniques, each designed to handle different
types of data and relationships. Let’s explore them in simple terms with examples.
1. Linear Regression
This is the simplest and most commonly used regression technique. It assumes a
straight-line relationship between the input variable (X) and the output variable (Y).
Example: Imagine you are trying to predict house prices based on their size. The bigger
the house, the higher the price. If you plot the data on a graph, you can draw a straight
line that best fits the data points. This line is called the regression line, and it helps us
predict house prices for new houses based on their size.
Formula: Y=mX+b
Where:
Linear regression works best when there is a linear relationship between the input and
output.
2. Multiple Linear Regression
Example:
Imagine predicting the salary of an employee based on experience, education level, and
city. Here, the salary (Y) depends on multiple factors (X1 = experience, X2 = education,
X3 = city, etc.).
Formula: Y=b0+b1X1+b2X2+b3X3+...+bnXn
Multiple regression is useful when many factors influence the output.
3. Polynomial Regression
When the relationship between the input and output is not linear but curved,
polynomial regression is used. Instead of fitting a straight line, it fits a curved line
(polynomial equation) to better capture the relationship.
Example:
Imagine predicting a car's fuel efficiency based on speed. At very low or very high
speeds, fuel efficiency is low, but at moderate speeds, it is high. A straight line would not
capture this pattern well, but a curved line (polynomial regression) can.
Formula: Y=b0+b1X+b2X2+b3X3+...+bnXn
The higher the polynomial degree, the more flexible the model becomes. However, too
high a degree can lead to overfitting (memorizing instead of learning patterns).
Although it has "regression" in its name, logistic regression is actually used for
classification, not for predicting continuous values. It predicts probabilities of
categories (e.g., spam vs. not spam, disease vs. no disease).
Example:
Imagine predicting whether an email is spam or not based on word frequency. Logistic
regression gives a probability score, helping classify the email as spam or not.
Formula: P=1/1+e−(b0+b1X1+b2X2+...+bnXn)
Where PPP is the probability of belonging to a category (e.g., spam or not).
8. Stepwise Regression
This technique automatically selects the best input variables for the model by adding or
removing features one by one based on their importance. It ensures that only relevant
variables are used.
Example:
If you are predicting house prices using ten different variables, stepwise regression will
test each variable and remove the least important ones, improving model efficiency.
When we build a machine learning model, we need to check how good it is at making
predictions. This process is called model evaluation. If a model performs well on
training data but does not work well on new, unseen data, it is not useful. That’s why we
need to test and compare different models to select the best one. This process is called
model selection.
Precision – Precision measures how many of the positive predictions were actually
correct. If a model predicts 10 emails as spam but only 7 are actually spam,
precision is 7/10 = 70%. Precision is useful when false positives (wrong positive
predictions) are costly.
Formula: Precision = True Positives / (True Positives + False Positives)
Recall (Sensitivity) – Recall measures how many actual positive cases the model
correctly predicted. If there were 10 actual spam emails and the model correctly
found 7 of them, recall is 7/10 = 70%. Recall is important when false negatives
(missing actual positive cases) are costly, like in medical diagnoses.
Formula: Recall = True Positives / (True Positives + False Negatives)
F1 Score – This is a balance between precision and recall. If we care about both
avoiding false positives and false negatives, the F1 score helps us find the right
balance. A higher F1 score means better model performance.
Formula: F1 Score = 2*(Precision * Recall) / (Precision + Recall)
ROC Curve and AUC Score – The ROC curve shows how well the model separates
different classes. The AUC (Area under Curve) score measures the overall
performance. A higher AUC means the model is better at distinguishing between
categories.
Cross-validation is a technique used in machine learning to check how well a model will
perform on new, unseen data. It helps prevent problems like overfitting (when a model
memorizes the training data but doesn't generalize well to new data).
Imagine you are a teacher who wants to test how well students understand a subject. If
you always ask the same questions, students might memorize the answers instead of
actually learning the concepts. To get a better measure of their knowledge, you give
different test questions each time.
Similarly, in machine learning, if we only use one set of data to test a model, it may
perform well just because it memorized that data. Cross-validation ensures that we
evaluate the model more fairly by testing it on multiple subsets of data.
There are different types of cross-validation, each suited for different situations. Below,
I'll explain the main types in an easy-to-understand way.
1. Holdout Method
The holdout method is a basic CV approach in which the original dataset is divided into
two discrete segments:
Training Data - As a reminder this set is used to fit and train the model.
Test Data - This set is used to evaluate the model.
How it works:
Example(K=5):
If you have 100 data points and use 5-Fold Cross-Validation, the dataset is split into 5
equal parts.
As seen above, k-fold validation can’t be used for imbalanced datasets because data is
split into k-folds with a uniform probability distribution. Not so with stratified k-fold,
which is an enhanced version of the k-fold cross-validation technique? Although it too
splits the dataset into k equal folds, each fold has the same ratio of instances of target
variables that are in the complete dataset. This enables it to work perfectly for
imbalanced datasets, but not for time-series data.
How it works:
Similar to K-Fold Cross-Validation but ensures that each fold maintains the same
proportion of class labels (e.g., 70% A, 30% B in each fold).
Useful for datasets where one class appears much more frequently than others.
📌 Example: If you have a dataset where 90% of the data points belong to class "A" and
only 10% to class "B", normal K-Fold might create unbalanced folds. Stratified K-Fold
makes sure each fold has the same percentage of "A" and "B".
To understand how Random Forest works, let's start with decision trees. A decision
tree is a flowchart-like structure where each internal node represents a decision
based on a feature, each branch represents the outcome of that decision, and each
leaf node represents the final output (class label or numerical value). Decision trees
are easy to understand and interpret, but they have a major drawback—they tend
to overfit the training data. Overfitting happens when the model learns patterns
that are too specific to the training data, making it less effective on new, unseen
data.
Random Forest solves the overfitting problem by creating multiple decision trees
instead of just one. Each tree in the forest is trained on a random subset of the
data. Additionally, when splitting nodes within a tree, only a random subset of
features is considered. This randomness ensures that the trees are diverse and
reduces the likelihood of overfitting. Once all the trees are trained, they make
predictions independently, and the final output is determined by combining their
predictions. In classification tasks, the final decision is made using majority voting
(the class with the most votes is chosen). In regression tasks, the final prediction is
the average of all tree outputs.
Let's consider an example of predicting whether a person will buy a car based on
their income, age, and credit score.
To understand how SVM works, imagine you have a group of students, and you want to
predict whether they will pass or fail based on their hours of study and attendance
percentage. If you plot this data on a graph, you will notice that students who study
more and have better attendance tend to pass, while those with low study hours and
attendance tend to fail. The challenge is to draw a line that clearly separates these two
groups.
SVM finds the best line (or a plane in higher dimensions) that separates the classes in
such a way that the distance between the line and the closest data points from both
classes is as large as possible. This distance is called the margin, and the points that
are closest to the boundary are known as support vectors. These support vectors are
crucial because they help define the position of the boundary.
A key feature of SVM is its ability to handle situations where the data is not perfectly
separable by a straight line. For example, if the data is arranged in a circular pattern
where one class is inside the circle and the other is outside, a simple straight line
cannot separate them. In such cases, SVM uses something called the Kernel Trick,
which transforms the data into a higher dimension where it becomes easier to separate
with a plane or a straight boundary.
Support Vector Machine (SVM) is mainly divided into two types based on the nature
of the data and how the classification is performed. These are:
1. Linear SVM
Linear SVM is used when the data can be separated by a straight line (or a plane in
higher dimensions). This means there exists a clear boundary between the two
classes without overlapping. Example:
Use Cases:
2. Non-Linear SVM
Non-Linear SVM is used when the data cannot be separated by a straight line. In
such cases, SVM uses the Kernel Trick to map the data into a higher-dimensional
space where it becomes separable.
Example:
Use Cases:
Applications of SVM
1. Image Classification
o Used in face recognition, object detection, and handwriting recognition.
2. Medical Diagnosis
o Helps classify diseases based on patient symptoms and medical test
results.
o Used in cancer detection (e.g., identifying if a tumor is benign or
malignant).
3. Text and Sentiment Analysis
o Classifies emails as spam or not spam.
o Determines if a product review is positive or negative.
4. Financial Fraud Detection
o Identifies fraudulent credit card transactions by analyzing transaction
patterns.
5. Stock Market Prediction
o Helps analyse historical stock prices to predict future trends.
Advantages of SVM
✔ Works well with both linear and non-linear data (using kernel tricks).
✔ Effective for high-dimensional data.
✔ Resistant to overfitting, especially in small datasets.
✔ Provides a clear margin for better classification.
Disadvantages of SVM
At its core, an ANN consists of layers of neurons. Each neuron takes input, processes
it using mathematical functions, and passes it to the next layer. These neurons are
connected through weights, which determine the importance of each connection. The
network learns by adjusting these weights based on the data it receives.
Input Layer: The image of the handwritten digit is converted into numerical data (pixel values) and fed into
the network. Each pixel acts as an input neuron.
Hidden Layers: The input data passes through multiple hidden layers, where neurons apply mathematical
operations to extract important patterns. For example, the first layer might detect edges, the next layer might
detect shapes, and another layer might identify numbers.
Weights and Activation Functions: Each neuron in the hidden layers has a weight, which
decides how much importance is given to each input. It also has an activation function, which decides
whether the neuron should be activated based on the processed information.
Output Layer: After processing, the final layer provides an output, which in this case is a number (0-9),
predicting which digit was written.
Learning and Training: The network compares its prediction with the actual answer and adjusts the
weights using a technique called backpropagation. This process is repeated multiple times until the
network learns to make accurate predictions.
There are several types of neural networks, each designed for specific tasks.
This is the simplest type, where information moves in one direction—from the input
layer to the output layer without looping back. It is used for tasks like simple
classification (e.g., spam detection in emails).
CNNs are mainly used for image processing and computer vision tasks like face
recognition and medical imaging. They have special layers called convolutional
layers, which extract features like edges, textures, and shapes from images.
RNNs are used for tasks involving sequential data, such as speech recognition,
language translation, and time series prediction. Unlike FNNs, RNNs have loops that
allow them to remember previous inputs, making them suitable for tasks where past
information matters (e.g., predicting the next word in a sentence).
Artificial Neural Networks (ANNs) are widely used in various real-world applications due
to their ability to recognize patterns, make predictions, and process large amounts of
data efficiently. Some of the most common applications include:
Requires Large Amounts of Data: ANNs need vast amounts of training data to
perform well, which may not always be available or easy to collect.
Difficult to Interpret: Unlike traditional models like decision trees, ANNs work like a
"black box," making it hard to understand how they arrive at specific decisions.
1. Bagging
To understand bagging, imagine you are trying to guess the weight of a fruit by
asking multiple people. Instead of relying on just one person's guess, you ask
several people, take all their estimates, and then calculate the average. This way,
the final guess is more reliable because it reduces individual errors. Bagging works
in a similar way for machine learning models.
1. Creating Multiple Datasets: Instead of using the entire dataset to train one
model, bagging randomly selects different subsets of the data with
replacement (this means the same data points can be chosen multiple times).
This process is called bootstrapping.
2. Training Multiple Models: Each subset is used to train a separate model of
the same type. These models are called weak learners because they may
not be highly accurate on their own.
3. Making Predictions: After training, all models make predictions on new
data.
4. Combining Predictions: The final output is determined by averaging the
predictions (for regression problems) or taking a majority vote (for
classification problems).
Example of Bagging: Random Forest
2. Boosting
How Boosting Works: Imagine you are learning to ride a bicycle. At first, you
might struggle and fall, but each time you try again; you focus on correcting the
mistakes you made earlier. Over time, you become better and more confident.
Boosting works in a similar way in machine learning—it builds a series of models,
where each new model learns from the errors of the previous ones.
AdaBoost (Adaptive Boosting): This is one of the earliest and simplest boosting
algorithms. It adjusts the importance (weights) of misclassified points and combines
weak models into a stronger final model.
Imagine you move to a new neighborhood and want to make friends. To decide who
you might get along with, you look at the people living closest to you. If most of
your neighbors like the same activities as you, chances are you'll get along with
them. KNN works in a similar way—it predicts the category of a new data point by
looking at its nearest neighbors.
Imagine you are given a dataset of fruits with two features: size and colour. You
need to classify a new fruit as either an "apple" or an "orange."
If K = 3 and the three closest neighbors are 2 apples and 1 orange, the new
fruit is classified as an apple (majority rule).
If K = 5 and the five closest neighbors are 3 oranges and 2 apples, the new
fruit is classified as an orange (majority rule).
If KNN is used to predict a house price based on nearby house prices, it will find the
K nearest houses and take the average price of those houses as the predicted
price for the new house.
If K is too small (e.g., K=1), the model is very sensitive to noise and may
give incorrect predictions.
If K is too large (e.g., K=100), the model may consider too many distant
points, leading to less accurate predictions.
A good approach is to try different values of K and choose the one that works
best for the dataset.
Advantages of KNN
Disadvantages of KNN
Slow for large datasets (because it has to calculate the distance for every
new point)
Sensitive to irrelevant features (if there are unnecessary data points, they
may affect the results)