0% found this document useful (0 votes)
118 views17 pages

DSF - UNIT III Notes

Data science fundamentals notes Anna University

Uploaded by

Rockerz Rick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views17 pages

DSF - UNIT III Notes

Data science fundamentals notes Anna University

Uploaded by

Rockerz Rick
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

P.S.N.A.

COLLEGE OF ENGINEERING & TECHNOLOGY


(An Autonomous Institution affiliated to Anna University, Chennai)
Kothandaraman Nagar, Muthanampatti (PO), Dindigul – 624 622.
Phone: 0451-2554032, 2554349 Web Link: www.psnacet.org
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Subject Code / Name : OCS353 / DATA SCIENCE FUNDAMENTALS
Year / Semester : IV/ VII ‘A’

SYLLABUS
UNIT III MACHINE LEARNING
The modeling process - Types of machine learning - Supervised learning - Unsupervised learning -
Semi-supervised learning- Classification, regression - Clustering – Outliers and Outlier Analysis
of Data.

THE MODELING PROCESS


Each step in the modeling process is crucial for building an effective and reliable machine
learning model. Ensuring attention to detail at each stage can lead to better performance and more
accurate predictions.
There are 10 steps are involved to make better machine learning model.
1. Problem Definition
2. Data Collection
3. Data Exploration and Preprocessing
4. Feature Selection
5. Model Selection
6. Model Training
7. Model Evaluation
8. Model Tuning
9. Model Deployment
10. Model Maintenance
1. Problem Definition
Objective: Clearly define the problem you are trying to solve. This includes understanding the
business or scientific objectives.
Output : Decide whether it is a classification, regression, clustering, or another type of problem.
2. Data Collection
Data collection is a crucial step in the creation of a machine learning model, as it lays the
foundation for building accurate models. In this phase of machine learning model development, relevant
data is gathered from various sources to train the machine learning model and enable it to make accurate
predictions.
Sources: Gather data from various sources such as databases, online repositories, sensors, etc.
Quality : Ensure data quality by addressing issues like missing values, inconsistencies, and
errors.
3. Data Exploration and Preprocessing
Exploration: Analyze the data to understand its structure, patterns, and anomalies.
● Visualization: Use plots and graphs to visualize data distributions and relationships.
● Statistics: Calculate summary statistics to get a sense of the data.
Preprocessing: Prepare the data for modeling.
R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals
● Cleaning: Handle missing values, outliers, and duplicates.
● Transformation: Normalize or standardize data, handle categorical variables, and create
new features if necessary.
● Feature Engineering: Create new features from existing ones to improve model performance.
4. Model Selection
Choose Algorithms: Choose appropriate machine learning algorithms based on the problem
type (classification, regression, clustering, etc.).
Baseline Model: Develop a simple model to establish a baseline performance.
Comparison: Compare multiple algorithms using cross-validation to find the best performing
one.
Feature:
Relevance: Identify and select features that are most relevant to the problem.
Techniques: Use methods like correlation analysis, mutual information, and feature importance
scores.
5. Model Training
In this phase of building a machine learning model, we have all the necessary ingredients to train
our model effectively. This involves utilizing our prepared data to teach the model to recognize patterns
and make predictions based on the input features. During the training process, we begin by feeding the
preprocessed data into the selected machine-learning algorithm.
Training Data: Split the data into training and testing sets (and sometimes validation sets).
Training Process: Fit the chosen model to the training data, optimizing its parameters.
6. Model Evaluation
Once you have trained your model, it’s time to assess its performance. There are various metrics used to
evaluate model performance, categorized based on the type of task: regression/numerical or
classification.
1. For regression tasks, common evaluation metrics are:
Mean Absolute Error (MAE): MAE is the average of the absolute differences between
predicted and actual values.
Mean Squared Error (MSE): MSE is the average of the squared differences between predicted
and actual values.
Root Mean Squared Error (RMSE): It is a square root of the MSE, providing a measure of the
average magnitude of error.
R-squared (R2): It is the proportion of the variance in the dependent variable that is predictable
from the independent variables.
2. For classification tasks, common evaluation metrics are:
Accuracy: Proportion of correctly classified instances out of the total instances.
Precision: Proportion of true positive predictions among all positive predictions.
Recall: Proportion of true positive predictions among all actual positive instances.
F1-score: Harmonic mean of precision and recall, providing a balanced measure of model
performance.
Area Under the Receiver Operating Characteristic curve (AUC-ROC): Measure of the
model’s ability to distinguish between classes.
7. Model Tuning
Tuning and optimizing helps our model to maximize its performance and generalization ability.
This process involves fine-tuning hyperparameters, selecting the best algorithm, and improving features
through feature engineering techniques.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Hyperparameters are parameters that are set before the training process begins and control the
behavior of the machine learning model. These are like learning rate, regularization and parameters of
the model should be carefully adjusted.
Techniques: Use grid search, random search, or Bayesian optimization for hyperparameter tuning.
8. Model Deployment
Deploying the model and making predictions is the final stage in the journey of creating an ML
model. Once a model has been trained and optimized, it’s to integrate it into a production environment
where it can provide real-time predictions on new data.
During model deployment, it’s essential to ensure that the system can handle high user loads,
operate smoothly without crashes, and be easily updated.
Integration: Deploy the model into a production environment where it can make real-time
predictions.
Monitoring: Continuously monitor the model's performance to ensure it remains accurate and reliable.
10. Model Maintenance
Updates: Periodically update the model with new data to maintain its performance.
Retraining: Retrain the model if there are significant changes in the data patterns or if the
model's performance degrades.

TYPES OF MACHINE LEARNING


Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed. ML is one of the most exciting technologies that one would have ever
come across. As it is evident from the name, it gives the computer that makes it more similar to humans:
The ability to learn. Machine learning is actively being used today, perhaps in many more places than
one would expect.
Machine learning is a subset of AI, which enables the machine to automatically learn from data,
improve performance from past experiences, and make predictions. Machine learning contains a set of
algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and on the
basis of training, they build the model & perform a specific task.

Features of Machine Learning


 Machine learning is a data-driven technology. A large amount of data is generated by
organizations daily, enabling them to identify notable relationships and make better decisions.
 Machines can learn from past data and automatically improve their performance.
 Given a dataset, ML can detect various patterns in the data.
 For large organizations, branding is crucial, and targeting a relatable customer base becomes
easier.
 It is similar to data mining, as both deal with substantial amounts of data.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


2.1 Supervised learning
Supervised machine learning is based on supervision. It means in the supervised learning
technique, we train the machines using the "labelled" dataset, and based on the training, the machine
predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to the
output. More preciously, we can say; first, we train the machine with the input and corresponding output,
and then we ask the machine to predict the output using the test dataset.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud
Detection, Spam filtering, etc.
Example:
Let's understand supervised learning with an example. Suppose we have an input dataset of cats
and dog images. So, first, we will provide the training to the machine to understand the images, such as
the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are
smaller), etc. After completion of training, we input the picture of a cat and ask the machine to identify
the object and predict the output. Now, the machine is well trained, so it will check all the features of the
object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the
Cat category. This is the process of how the machine identifies the objects in Supervised Learning.

Steps Involved in Supervised Learning:


1. First Determine the type of training dataset
2. Collect/Gather the labelled training data.
3. Split the training dataset into training dataset, test dataset, and validation dataset.
R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals
4. Determine the input features of the training dataset, which should have enough knowledge so
that the model can accurately predict the output.
5. Determine the suitable algorithm for the model, such as support vector machine, decision tree,
etc.
6. Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
7. Evaluate the accuracy of the model by providing the test set. If the model predicts the correct
output, which means our model is accurate.

2.1.1 TYPES OF SUPERVISED MACHINE LEARNING


Supervised machine learning can be classified into two types of problems, which are given
below:
1. Classification
2. Regression
Classification: Classification algorithms are used to predict a categorical output. For example, a
classification algorithm could be used to predict whether an email is spam or not.
Regression: Regression algorithms are used to predict a continuous numerical output. For example, a
regression algorithm could be used to predict the price of a house based on its size, location, and other
features.

2.1.1.1 CLASSIFICATION:
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of classification
algorithms are Spam Detection, Email filtering, etc.
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns from the
given dataset or observations and then classifies new observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar to each
other and dissimilar to other classes.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Classification Types
There are two main classification types in machine learning:
i. Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories. Example –
On the basis of the given health conditions of a person, we have to determine whether the person has a
certain disease or not.
ii. Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or categories. For
Example – On the basis of data about different species of flowers, we have to determine which specie
our observation belongs to.

Some popular classification algorithms are given below:


1. Random Forest Algorithm
2. Decision Tree Algorithm
3. Logistic Regression Algorithm
4. Support Vector Machine Algorithm

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Advantages of Supervised learning Algorithm:
1. Since supervised learning work with the labelled dataset so we can have an exact idea about the
classes of objects.
2. These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages of Supervised learning Algorithm:
1. These algorithms are not able to solve complex tasks.
2. It may predict the wrong output if the test data is different from the training data.
3. It requires lots of computational time to train the algorithm.

2.1.2 APPLICATIONS OF SUPERVISED LEARNING:


1. Logistic Regression
2. Support Vector Machine
3. Random Forest
4. Decision Tree
5. K-Nearest Neighbors (KNN)
6. Naive Bayes

2.1.1.2 REGRESSION:
Regression is a process of finding the correlations between dependent and independent variables.
It helps in predicting the continuous variables such as prediction of Market Trends, prediction of House
prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input variable(x)
to the continuous output variable(y).
What is Regression?
Regression is a statistical approach used to analyze the relationship between a dependent variable
(target variable) and one or more independent variables (predictor variables). The objective is to
determine the most suitable function that characterizes the connection between these variables.
It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions.
Example:
Suppose we want to do weather forecasting, so for this, we will use the Regression algorithm. In
weather prediction, the model is trained on the past data, and once the training is completed, it can easily
predict the weather for future days.
Terminologies Related to the Regression Analysis in Machine Learning
Terminologies Related to Regression Analysis:
 Response Variable: The primary factor to predict or understand in regression, also known as the
dependent variable or target variable.
 Predictor Variable: Factors influencing the response variable, used to predict its values; also
called independent variables.
 Outliers: Observations with significantly low or high values compared to others, potentially
impacting results and best avoided.
 Multicollinearity: High correlation among independent variables, which can complicate the
ranking of influential variables.
 Underfitting and Overfitting: Overfitting occurs when an algorithm performs well on training
but poorly on testing, while underfitting indicates poor performance on both datasets.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


There are three main types of regression:
 Simple Regression
o Used to predict a continuous dependent variable based on a single independent variable.
o Simple linear regression should be used when there is only a single independent variable.
 Multiple Regression
o Used to predict a continuous dependent variable based on multiple independent variables.
o Multiple linear regression should be used when there are multiple independent variables.
 NonLinear Regression
o Relationship between the dependent variable and independent variable(s) follows a
nonlinear pattern.
o Provides flexibility in modeling a wide range of functional forms.
Characteristics of Regression
Here are the characteristics of the regression:
 Continuous Target Variable: Regression deals with predicting continuous target variables that
represent numerical values. Examples include predicting house prices, forecasting sales figures,
or estimating patient recovery times.
 Error Measurement: Regression models are evaluated based on their ability to minimize the
error between the predicted and actual values of the target variable. Common error metrics
include mean absolute error (MAE), mean squared error (MSE), and root mean squared error
(RMSE).
 Model Complexity: Regression models range from simple linear models to more complex
nonlinear models. The choice of model complexity depends on the complexity of the relationship
between the input features and the target variable.
 Overfitting and Underfitting: Regression models are susceptible to overfitting and underfitting.
 Interpretability: The interpretability of regression models varies depending on the algorithm
used. Simple linear models are highly interpretable, while more complex models may be more
difficult to interpret.
Below are some popular Regression algorithms which come under supervised learning:
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
The main difference between Regression and Classification algorithms that Regression
algorithms are used to predict the continuous values such as price, salary, age, etc. and Classification
algorithms are used to predict/Classify the discrete values such as Male or Female, True or False,
Spam or Not Spam, etc.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Difference between Regression and Classification:
In Regression, the output variable must be of In Regression, the output variable must be of
continuous nature or real value. continuous nature or real value.

In Classification, the output variable must be In Classification, the output variable must be
a discrete value. a discrete value.

The task of the regression algorithm is to map The task of the regression algorithm is to map
the input value (x) with the continuous output the input value (x) with the continuous output
variable(y). variable(y).

The task of the classification algorithm is to The task of the classification algorithm is to
map the input value(x) with the discrete map the input value(x) with the discrete
output variable(y). output variable(y).

Regression Algorithms are used with Regression Algorithms are used with
continuous data. continuous data.

Classification Algorithms are used with Classification Algorithms are used with
discrete data. discrete data.

UNSUPERVISED LEARNING
Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


hidden patterns from the input dataset.

Example:
Working of Unsupervised Learning:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine learning
model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data
and then will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The input to the unsupervised learning models is as follows:
 Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data
 Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to the labeled one in the Supervised approach.

2.2.1 TYPES OF UNSUPERVISED MACHINE LEARNING


Unsupervised Learning can be further classified into two types, which are given below:
1. Clustering
2. Association

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


2.2.1.1 CLUSTERING
Clustering in unsupervised machine learning is the process of grouping unlabeled data into
clusters based on their similarities. The goal of clustering is to identify patterns and relationships in the
data without any prior knowledge of the data’s meaning.
Clustering is a technique for exploring raw, unlabeled data and breaking it down into groups (or
clusters) based on similarities or differences. It is used in a variety of applications, including customer
segmentation, fraud detection, and image analysis. Clustering algorithms split data into natural groups
by finding similar structures or patterns in uncategorized data.
Broadly this technique is applied to group data based on different patterns, such as similarities or
differences, our machine model finds. These algorithms are used to process raw, unclassified data
objects into groups.
Cluster analysis finds the commonalities between the data objects and categorizes them as per the
presence and absence of those commonalities.
The task of grouping data points based on their similarity with each other is called Clustering or
Cluster Analysis. This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we don’t have a target
variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance,
etc. and then group the points with highest similarity score together.
For Example, In the graph given below, we can clearly see that there are 3 circular clusters
forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not circular in
shape.

Some common clustering algorithms


1) K-means Clustering: Partitioning Data into K Clusters
R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals
2) Hierarchical Clustering: Building a Hierarchical Structure of Clusters
3) Density-Based Clustering (DBSCAN): Identifying Clusters Based on Density
4) Mean-Shift Clustering: Finding Clusters Based on Mode Seeking
5) Spectral Clustering: Utilizing Spectral Graph Theory for Clustering

Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
 Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.
 Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.
 Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content recommendations.
 Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-
rays.
 Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent
transactions we can use clustering to identify them.
 Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering is
effective when it can represent a complicated case with a straightforward cluster ID. Using the
same principle, clustering data can make complex datasets simpler.

2.2.1.1 ASSOCIATION
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of dataset. It is
based on different rules to discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market
basket analysis is a technique used by the various big retailer to discover the associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
For e.g. shopping stores use algorithms based on this technique to find out the relationship
between the sale of one product w.r.t to another’s sales based on customer behavior. Like if a customer
buys milk, then he may also buy bread, eggs, or butter. Once trained well, such models can be used to
increase their sales by planning different offers.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Association rule learning can be divided into three types of algorithms:
1. Apriori
2. Eclat
3. F-P Growth Algorithm
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
o Support
o Confidence
o Lift

Some common clustering algorithms


1) Apriori Algorithm: A Classic Method for Rule Induction
2) FP-Growth Algorithm: An Efficient Alternative to Apriori
3) Eclat Algorithm: Exploiting Closed Itemsets for Efficient Rule Mining
4) Efficient Tree-based Algorithms: Handling Large Datasets with Scalability
Advantages of Unsupervised Learning:
● These algorithms can be used for complicated tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
● Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier
as compared to the labelled dataset.
Disadvantages of Unsupervised Learning:
● The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and
algorithms are not trained with the exact output in prior.
● Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that
does not map with the output.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of association rule
mining. This technique is commonly used by big retailers to determine the association between
items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

2.2.2 APPLICATIONS OF UNSUPERVISED LEARNING


Here are some real-world unsupervised learning examples:
 Anomaly detection: Unsupervised clustering can process large datasets and discover data points
that are atypical in a dataset.
 Recommendation engines: Using association rules, unsupervised machine learning can help
explore transactional data to discover patterns or trends that can be used to drive personalized
recommendations for online retailers.
 Customer segmentation: Unsupervised learning is also commonly used to generate buyer
persona profiles by clustering customers’ common traits or purchasing behaviors. These profiles
can then be used to guide marketing and other business strategies.
 Fraud detection: Unsupervised learning is useful for anomaly detection, revealing unusual data
points in datasets. These insights can help uncover events or behaviors that deviate from normal
patterns in the data, revealing fraudulent transactions or unusual behavior like bot activity.
 Natural language processing (NLP): Unsupervised learning is commonly used for various NLP
applications, such as categorizing articles in news sections, text translation and classification, or
speech recognition in conversational interfaces.
 Genetic research: Genetic clustering is another common unsupervised learning example.
Hierarchical clustering algorithms are often used to analyze DNA patterns and reveal
evolutionary relationships.

SEMI-SUPERVISED LEARNING
Semi-Supervised learning is a type of Machine Learning algorithm that represents the
intermediate ground between Supervised and Unsupervised learning algorithms. It uses the combination
of labeled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to
effectively use all the available data, rather than only labelled data like in supervised learning. Initially,
similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the
unlabeled data into labelled data. It is because labelled data is a comparatively more expensive
acquisition than unlabeled data.
2.3.1 Assumptions followed by Semi-Supervised Learning:
To work with the unlabeled dataset, there must be a relationship between the objects. To
understand this, semi-supervised learning uses any of the following assumptions:

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


Continuity Assumption:
As per the continuity assumption, the objects near each other tend to share the same group or
label. This assumption is also used in supervised learning, and the datasets are separated by the decision
boundaries. But in semi-supervised, the decision boundaries are added with the smoothness assumption
in low-density boundaries.
Cluster assumptions
In this assumption, data are divided into different discrete clusters. Further, the points in the same
cluster share the output label.
Manifold assumptions
This assumption helps to use distances and densities, and this data lie on a manifold of fewer
dimensions than input space.
The dimensional data are created by a process that has less degree of freedom and may be hard to model
directly. (This assumption becomes practical if high).
2.3.2 Applications of Semi-supervised Learning:
1. Speech Analysis
It is the most classic example of semi-supervised learning applications. Since, labeling the audio data is
the most impassable task that requires many human resources, this problem can be naturally overcome
with the help of applying SSL in a Semi-supervised learning model.
2. Web content classification
However, this is very critical and impossible to label each page on the internet because it needs mode
human intervention. Still, this problem can be reduced through Semi-Supervised learning algorithms.
Further, Google also uses semi-supervised learning algorithms to rank a webpage for a given query.
3. Protein sequence classification
DNA strands are larger, they require active human intervention. So, the rise of the Semi-supervised
model has been proximate in this field.
4. Text document classifier
As we know, it would be very unfeasible to find a large amount of labeled text data, so semi-supervised
learning is an ideal model to overcome this.

OUTLIER
Outliers in machine learning refer to data points that are significantly different from the majority
of the data. These data points can be anomalous, noisy, or errors in measurement.
An outlier is a data point that significantly deviates from the rest of the data. It can be either
much higher or much lower than the other data points, and its presence can have a significant impact on
the results of machine learning algorithms. They can be caused by measurement or execution errors. The
analysis of outlier data is referred to as outlier analysis or outlier mining.
3.1 TYPES OF OUTLIERS
There are two main types of outliers:
1. Global outliers:
Global outliers are isolated data points that are far away from the main body of the data. They are often
easy to identify and remove.
2. Contextual outliers:
Contextual outliers are data points that are unusual in a specific context but may not be outliers in a
different context. They are often more difficult to identify and may require additional information or
domain knowledge to determine their significance.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


3.2 OUTLIER DETECTION METHODS IN MACHINE LEARNING
Outlier detection plays a crucial role in ensuring the quality and accuracy of machine learning
models. By identifying and removing or handling outliers effectively, we can prevent them from biasing
the model, reducing its performance, and hindering its interpretability. Here’s an overview of various
outlier detection methods:
1. Statistical Methods:
● Z-Score:
● Interquartile Range (IQR)
2. Distance-Based Methods:
● K-Nearest Neighbors (KNN)
● Local Outlier Factor (LOF)
3. Clustering-Based Methods:
● Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
● Hierarchical clustering
4. Other Methods:
● Isolation Forest
● One-class Support Vector Machines (OCSVM)

3.3 TECHNIQUES FOR HANDLING OUTLIERS IN MACHINE LEARNING


Outliers, data points that significantly deviate from the majority, can have detrimental effects on
machine learning models. To address this, several techniques can be employed to handle outliers
effectively:
1. Removal:
This involves identifying and removing outliers from the dataset before training the model. Common
methods include:
● Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g.,
Z-score > 3).
● Distance-based methods: Outliers are identified based on their distance from their
nearest neighbors.
● Clustering: Outliers are identified as points not belonging to any cluster or belonging to
very small clusters.
2. Transformation:
This involves transforming the data to reduce the influence of outliers. Common methods include:
● Scaling: Standardizing or normalizing the data to have a mean of zero and a standard
deviation of one.
● Winsorization: Replacing outlier values with the nearest non-outlier value.

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals


● Log transformation: Applying a logarithmic transformation to compress the data and
reduce the impact of extreme values.
3. Robust Estimation:
This involves using algorithms that are less sensitive to outliers. Some examples include:
● Robust regression: Algorithms like L1-regularized regression or Huber regression are
less influenced by outliers than least squares regression.
● M-estimators: These algorithms estimate the model parameters based on a robust
objective function that down weights the influence of outliers.
● Outlier-insensitive clustering algorithms: Algorithms like DBSCAN are less
susceptible to the presence of outliers than K-means
clustering.
4. Modeling Outliers:
This involves explicitly modeling the outliers as a separate group. This can be done by:
● Adding a separate feature: Create a new feature indicating whether a data point is an
outlier or not.
● Using a mixture model: Train a model that assumes the data comes from a mixture of
multiple distributions, where one distribution
represents the outliers.

3.4 IMPORTANCE OF OUTLIER DETECTION IN MACHINE LEARNING


Outlier detection is important in machine learning for several reasons:
Biased models: Outliers can bias a machine learning model towards the outlier values, leading to
poor performance on the rest of the data. This can be particularly problematic for algorithms that are
sensitive to outliers, such as linear regression.
Reduced accuracy: Outliers can introduce noise into the data, making it difficult for a machine
learning model to learn the true underlying patterns. This can lead to reduced accuracy and performance.
Increased variance: Outliers can increase the variance of a machine learning model, making it
more sensitive to small changes in the data. This can make it difficult to train a stable and reliable model.
Reduced interpretability: Outliers can make it difficult to understand what a machine learning
model has learned from the data. This can make it difficult to trust the model’s predictions and can
hamper efforts to improve its performance.
1. Visual inspection: using plots to identify outliers
2. Statistical methods: using metrics like mean, median, and standard deviation to detect outliers
3. Machine learning algorithms: using algorithms like One-Class SVM, Local Outlier Factor
(LOF), and Isolation Forest to detect outliers

R.GAYATHRI / AP-CSE UNIT-III NOTES Data Science Fundamentals

You might also like