0% found this document useful (0 votes)
18 views28 pages

Module2.2 Sampling Techniques

Uploaded by

haxodib502
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

Module2.2 Sampling Techniques

Uploaded by

haxodib502
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Course Code: CSA3002

MACHINE LEARNING ALGORITHMS

Course Type: LPC – 2-2-3


Course Objectives
• The objective of the course is to familiarize the learners with
the concepts of Machine Learning Algorithms and attain
Skill Development through Experiential Learning
techniques.
Course Outcomes
At the end of the course, students should be able to
1. Understanding of training and testing the datasets using machine
Learning techniques.
2. Apply optimization and parameter tuning techniques for machine
Learning algorithms.
3. Apply a machine learning model to solve various problems using
machine learning algorithms.
4. Apply machine learning algorithm to create models.
Sampling Techniques in ML
• Sampling techniques are methods used in statistics and research to
select a subset of individuals or items from a larger population for the
purpose of making inferences or drawing conclusions about the entire
population.
• Different sampling techniques are employed depending on the
research objectives and the nature of the population.
• Simple Random Sampling:
• Description: In simple random sampling, every member of the
population has an equal chance of being selected. This is typically
done using random number generators or random selection methods.
• Example 1: Suppose you want to estimate the average income of
residents in a city with 1,000,000 people. You assign a unique number
to each resident and use a random number generator to select 1,000
residents. These 1,000 individuals form your random sample.
• Example 2:
• Random selection of 20 students
from class of 50 student. Each
student has equal chance of
getting selected.
• Here probability of selection is
1/50
• Stratified Sampling:
• Description: In stratified sampling, the population is divided into
distinct subgroups based on certain characteristics (e.g., age, gender,
income), and then random samples are drawn from each group. This
ensures that each subgroup is adequately represented in the sample.
• Example 1: Imagine you want to study the political preferences of a
population consisting of different age groups (e.g., under 30, 30-50,
over 50). You would first stratify the population into these age groups
and then randomly select a sample from each group based on their
respective sizes.
• We need to have prior
information about the population
to create subgroups.
• Systematic Sampling:
• Description: Here the selection of elements is systematic and not
random except the first element. Elements of a sample are chosen at
regular intervals of population. All the elements are put together in a
sequence first where each element has the equal chance of being
selected.
• Example: If you want to survey customers in a store with 500
shoppers and a desired sample size of 50, you'd select every 10th
customer (500/50 = 10) after randomly choosing one of the first 10
shoppers to start with.
• For a sample of size n, we divide our population of
size N into subgroups of k elements.
• We select our first element randomly from the first
subgroup of k elements.
• To select other elements of sample, perform
following:
• We know number of elements in each group is k i.e
N/n
• So if our first element is n1 then
• Second element is n1+k i.e n2
• Third element n2+k i.e n3 and so on..
• Taking an example of N=20, n=5
• No of elements in each of the subgroups is N/n i.e
20/5 =4= k
• Now, randomly select first element from the first
subgroup.
• If we select n1= 3
• n2 = n1+k = 3+4 = 7
• n3 = n2+k = 7+4 = 11
• Cluster Sampling:
• Description: In cluster sampling, the population is divided into
clusters or groups, and then a random sample of clusters is selected.
All individuals within the chosen clusters are included in the sample.
• Example: To estimate the literacy rate in a country, you might first
divide the country into regions or provinces. Then, randomly select a
few regions, and within those regions, survey everyone of eligible age.
Cluster
• Convenience Sampling:
• Description: Convenience sampling involves selecting individuals who
are readily available and easy to access. This method is quick and
inexpensive but may introduce bias.
• Example: If you want to gather opinions about a new product, you
might approach people in a shopping mall and ask their opinions.
However, this method may lead to a biased sample because mall-
goers may not represent the entire population.
• For example: Researchers prefer this during the initial stages of
survey research, as it’s quick and easy to deliver results.
• Snowball Sampling:
• Description: Snowball sampling is often used in studies where it's
difficult to identify and locate members of a population. One initial
participant is identified and interviewed, and then that participant
helps identify and recruit others.
• Example: When studying the network of drug users, you might
interview one known user, who then introduces you to others in the
network, and so on.
• This technique is used in the
situations where the population is
completely unknown and rare.
• Therefore we will take the help from
the first element which we select for
the population and ask him to
recommend other elements who will
fit the description of the sample
needed.
• So this referral technique goes on,
increasing the size of population like
a snowball.
Snowball Sampling is a non-probability sampling technique commonly used when
the population being studied is hard to reach or not easily identifiable. In this
method, existing study subjects recruit future subjects from among their
acquaintances, thus the sample "snowballs" as it grows.
How Snowball Sampling Works:
Initial Subjects: The researcher identifies a small group of initial subjects, often
referred to as "seeds." These are people who belong to the target population.
Recruitment: The initial subjects then refer others in their network who also fit
the research criteria.
Expansion: Each new participant can further recruit others, expanding the sample
in successive waves.
• Exploring Experiences of Homeless People
• A social worker conducting research on homelessness in a city may
use snowball sampling to gather participants:
• Initial Contacts: They start by interviewing a few homeless individuals
who are currently living in shelters or public spaces.
• Further Recruitment: These participants suggest others they know
who are also homeless, expanding the study through social
connections.
Oversampling and Undersampling:
• Description: Oversampling and undersampling are techniques used to
address class imbalance in a dataset, where one class (the minority
class) is significantly underrepresented compared to another class (the
majority class). These techniques are commonly used in machine
learning to prevent models from being biased toward the majority
class.
• Example: In a credit card fraud detection task, where fraudulent
transactions are rare, you might oversample the fraudulent class or
undersample the non-fraudulent class to balance the dataset.
Oversampling:
• In oversampling, you increase the number of instances in the minority
class by creating duplicates of existing instances or generating
synthetic examples. The goal is to balance the class distribution,
making the dataset more equitable for training.
• Example of Oversampling:
• Suppose you are working on a credit card fraud detection task, where
fraudulent transactions are rare compared to legitimate transactions.
Your dataset consists of 1,000 legitimate transactions (class 0) and
only 50 fraudulent transactions (class 1).
• Original Dataset:
• Class 0 (Legitimate transactions): 1,000 samples
• Class 1 (Fraudulent transactions): 50 samples
• Oversampling:
• One way to oversample is to create duplicates of the minority class (class 1) until the
class distribution is balanced. You might create additional synthetic samples using
techniques like SMOTE (Synthetic Minority Over-sampling Technique).
• After oversampling, your dataset might look like this:
• Class 0 (Legitimate transactions): 1,000 samples
• Class 1 (Fraudulent transactions): 1,000 samples (synthetic)
• Now, you have an equal number of samples for both classes, which can help
prevent the model from being biased toward the majority class.
• SMOTE (Synthetic Minority Over-sampling Technique):
• Description: SMOTE, which stands for Synthetic Minority Over-
sampling Technique, is a method used to address class imbalance in
machine learning datasets.
• Class imbalance occurs when one class (the minority class) is
significantly underrepresented compared to another class (the majority
class).
• SMOTE works by generating synthetic examples of the minority class
to balance the dataset. This technique helps prevent machine learning
models from being biased toward the majority class.
• Example of SMOTE:
• Suppose you're working on a medical diagnosis task to predict
whether a patient has a rare disease (class 1) based on various medical
measurements. Your dataset is imbalanced, with 100 samples of non-
disease cases (class 0) and only 20 samples of disease cases (class 1).
• Original Dataset:
• Class 0 (Non-disease): 100 samples
• Class 1 (Disease): 20 samples
• SMOTE Implementation:
• You decide to apply SMOTE to balance the dataset. You choose a value for k, which
represents the number of nearest neighbors to consider. Let's say you set k = 5.
• For each sample in the minority class (class 1), you calculate the five nearest neighbors
among other class 1 samples.
• You randomly select one of these neighbors, and for each selected pair (original instance
and chosen neighbor), you generate synthetic samples by linear interpolation along the
feature space. The number of synthetic samples generated for each pair depends on the
desired level of oversampling.
• After applying SMOTE with a desired oversampling factor (e.g., 3x), your dataset might
look like this:
• Class 0 (Non-disease): 100 samples (unchanged)
• Class 1 (Disease): 20 samples (unchanged)
• Additional synthetic samples created for class 1 (Disease) to achieve a desired oversampling ratio.
• The synthetic samples are generated in such a way that they're close to the original
samples but introduce some variation to expand the minority class's representation.
• Resulting Dataset:
• Class 0 (Non-disease): 100 samples
• Class 1 (Disease): Increased number of samples through SMOTE
• Now, you have a balanced dataset, which you can use for training
machine learning models. The synthetic samples generated by
SMOTE help improve the model's ability to learn the minority class
and make more accurate predictions.
• SMOTE is a powerful technique for addressing class imbalance, but it
should be used with caution. Generating too many synthetic samples
can lead to overfitting, so it's essential to carefully choose the
oversampling ratio and evaluate model performance using appropriate
validation techniques.

You might also like