We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28
Course Code: CSA3002
MACHINE LEARNING ALGORITHMS
Course Type: LPC – 2-2-3
Course Objectives • The objective of the course is to familiarize the learners with the concepts of Machine Learning Algorithms and attain Skill Development through Experiential Learning techniques. Course Outcomes At the end of the course, students should be able to 1. Understanding of training and testing the datasets using machine Learning techniques. 2. Apply optimization and parameter tuning techniques for machine Learning algorithms. 3. Apply a machine learning model to solve various problems using machine learning algorithms. 4. Apply machine learning algorithm to create models. Sampling Techniques in ML • Sampling techniques are methods used in statistics and research to select a subset of individuals or items from a larger population for the purpose of making inferences or drawing conclusions about the entire population. • Different sampling techniques are employed depending on the research objectives and the nature of the population. • Simple Random Sampling: • Description: In simple random sampling, every member of the population has an equal chance of being selected. This is typically done using random number generators or random selection methods. • Example 1: Suppose you want to estimate the average income of residents in a city with 1,000,000 people. You assign a unique number to each resident and use a random number generator to select 1,000 residents. These 1,000 individuals form your random sample. • Example 2: • Random selection of 20 students from class of 50 student. Each student has equal chance of getting selected. • Here probability of selection is 1/50 • Stratified Sampling: • Description: In stratified sampling, the population is divided into distinct subgroups based on certain characteristics (e.g., age, gender, income), and then random samples are drawn from each group. This ensures that each subgroup is adequately represented in the sample. • Example 1: Imagine you want to study the political preferences of a population consisting of different age groups (e.g., under 30, 30-50, over 50). You would first stratify the population into these age groups and then randomly select a sample from each group based on their respective sizes. • We need to have prior information about the population to create subgroups. • Systematic Sampling: • Description: Here the selection of elements is systematic and not random except the first element. Elements of a sample are chosen at regular intervals of population. All the elements are put together in a sequence first where each element has the equal chance of being selected. • Example: If you want to survey customers in a store with 500 shoppers and a desired sample size of 50, you'd select every 10th customer (500/50 = 10) after randomly choosing one of the first 10 shoppers to start with. • For a sample of size n, we divide our population of size N into subgroups of k elements. • We select our first element randomly from the first subgroup of k elements. • To select other elements of sample, perform following: • We know number of elements in each group is k i.e N/n • So if our first element is n1 then • Second element is n1+k i.e n2 • Third element n2+k i.e n3 and so on.. • Taking an example of N=20, n=5 • No of elements in each of the subgroups is N/n i.e 20/5 =4= k • Now, randomly select first element from the first subgroup. • If we select n1= 3 • n2 = n1+k = 3+4 = 7 • n3 = n2+k = 7+4 = 11 • Cluster Sampling: • Description: In cluster sampling, the population is divided into clusters or groups, and then a random sample of clusters is selected. All individuals within the chosen clusters are included in the sample. • Example: To estimate the literacy rate in a country, you might first divide the country into regions or provinces. Then, randomly select a few regions, and within those regions, survey everyone of eligible age. Cluster • Convenience Sampling: • Description: Convenience sampling involves selecting individuals who are readily available and easy to access. This method is quick and inexpensive but may introduce bias. • Example: If you want to gather opinions about a new product, you might approach people in a shopping mall and ask their opinions. However, this method may lead to a biased sample because mall- goers may not represent the entire population. • For example: Researchers prefer this during the initial stages of survey research, as it’s quick and easy to deliver results. • Snowball Sampling: • Description: Snowball sampling is often used in studies where it's difficult to identify and locate members of a population. One initial participant is identified and interviewed, and then that participant helps identify and recruit others. • Example: When studying the network of drug users, you might interview one known user, who then introduces you to others in the network, and so on. • This technique is used in the situations where the population is completely unknown and rare. • Therefore we will take the help from the first element which we select for the population and ask him to recommend other elements who will fit the description of the sample needed. • So this referral technique goes on, increasing the size of population like a snowball. Snowball Sampling is a non-probability sampling technique commonly used when the population being studied is hard to reach or not easily identifiable. In this method, existing study subjects recruit future subjects from among their acquaintances, thus the sample "snowballs" as it grows. How Snowball Sampling Works: Initial Subjects: The researcher identifies a small group of initial subjects, often referred to as "seeds." These are people who belong to the target population. Recruitment: The initial subjects then refer others in their network who also fit the research criteria. Expansion: Each new participant can further recruit others, expanding the sample in successive waves. • Exploring Experiences of Homeless People • A social worker conducting research on homelessness in a city may use snowball sampling to gather participants: • Initial Contacts: They start by interviewing a few homeless individuals who are currently living in shelters or public spaces. • Further Recruitment: These participants suggest others they know who are also homeless, expanding the study through social connections. Oversampling and Undersampling: • Description: Oversampling and undersampling are techniques used to address class imbalance in a dataset, where one class (the minority class) is significantly underrepresented compared to another class (the majority class). These techniques are commonly used in machine learning to prevent models from being biased toward the majority class. • Example: In a credit card fraud detection task, where fraudulent transactions are rare, you might oversample the fraudulent class or undersample the non-fraudulent class to balance the dataset. Oversampling: • In oversampling, you increase the number of instances in the minority class by creating duplicates of existing instances or generating synthetic examples. The goal is to balance the class distribution, making the dataset more equitable for training. • Example of Oversampling: • Suppose you are working on a credit card fraud detection task, where fraudulent transactions are rare compared to legitimate transactions. Your dataset consists of 1,000 legitimate transactions (class 0) and only 50 fraudulent transactions (class 1). • Original Dataset: • Class 0 (Legitimate transactions): 1,000 samples • Class 1 (Fraudulent transactions): 50 samples • Oversampling: • One way to oversample is to create duplicates of the minority class (class 1) until the class distribution is balanced. You might create additional synthetic samples using techniques like SMOTE (Synthetic Minority Over-sampling Technique). • After oversampling, your dataset might look like this: • Class 0 (Legitimate transactions): 1,000 samples • Class 1 (Fraudulent transactions): 1,000 samples (synthetic) • Now, you have an equal number of samples for both classes, which can help prevent the model from being biased toward the majority class. • SMOTE (Synthetic Minority Over-sampling Technique): • Description: SMOTE, which stands for Synthetic Minority Over- sampling Technique, is a method used to address class imbalance in machine learning datasets. • Class imbalance occurs when one class (the minority class) is significantly underrepresented compared to another class (the majority class). • SMOTE works by generating synthetic examples of the minority class to balance the dataset. This technique helps prevent machine learning models from being biased toward the majority class. • Example of SMOTE: • Suppose you're working on a medical diagnosis task to predict whether a patient has a rare disease (class 1) based on various medical measurements. Your dataset is imbalanced, with 100 samples of non- disease cases (class 0) and only 20 samples of disease cases (class 1). • Original Dataset: • Class 0 (Non-disease): 100 samples • Class 1 (Disease): 20 samples • SMOTE Implementation: • You decide to apply SMOTE to balance the dataset. You choose a value for k, which represents the number of nearest neighbors to consider. Let's say you set k = 5. • For each sample in the minority class (class 1), you calculate the five nearest neighbors among other class 1 samples. • You randomly select one of these neighbors, and for each selected pair (original instance and chosen neighbor), you generate synthetic samples by linear interpolation along the feature space. The number of synthetic samples generated for each pair depends on the desired level of oversampling. • After applying SMOTE with a desired oversampling factor (e.g., 3x), your dataset might look like this: • Class 0 (Non-disease): 100 samples (unchanged) • Class 1 (Disease): 20 samples (unchanged) • Additional synthetic samples created for class 1 (Disease) to achieve a desired oversampling ratio. • The synthetic samples are generated in such a way that they're close to the original samples but introduce some variation to expand the minority class's representation. • Resulting Dataset: • Class 0 (Non-disease): 100 samples • Class 1 (Disease): Increased number of samples through SMOTE • Now, you have a balanced dataset, which you can use for training machine learning models. The synthetic samples generated by SMOTE help improve the model's ability to learn the minority class and make more accurate predictions. • SMOTE is a powerful technique for addressing class imbalance, but it should be used with caution. Generating too many synthetic samples can lead to overfitting, so it's essential to carefully choose the oversampling ratio and evaluate model performance using appropriate validation techniques.