DSF - UNIT III Notes
DSF - UNIT III Notes
SYLLABUS
UNIT III MACHINE LEARNING
The modeling process - Types of machine learning - Supervised learning - Unsupervised learning -
Semi-supervised learning- Classification, regression - Clustering – Outliers and Outlier Analysis
of Data.
2.1.1.1 CLASSIFICATION:
Classification algorithms are used to solve the classification problems in which the output
variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of classification
algorithms are Spam Detection, Email filtering, etc.
What is the Classification Algorithm?
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns from the
given dataset or observations and then classifies new observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as "Green
or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised learning technique,
hence it takes labeled input data, which means it contains input with the corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset, and
these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are similar to each
other and dissimilar to other classes.
2.1.1.2 REGRESSION:
Regression is a process of finding the correlations between dependent and independent variables.
It helps in predicting the continuous variables such as prediction of Market Trends, prediction of House
prices, etc.
The task of the Regression algorithm is to find the mapping function to map the input variable(x)
to the continuous output variable(y).
What is Regression?
Regression is a statistical approach used to analyze the relationship between a dependent variable
(target variable) and one or more independent variables (predictor variables). The objective is to
determine the most suitable function that characterizes the connection between these variables.
It seeks to find the best-fitting model, which can be utilized to make predictions or draw
conclusions.
Example:
Suppose we want to do weather forecasting, so for this, we will use the Regression algorithm. In
weather prediction, the model is trained on the past data, and once the training is completed, it can easily
predict the weather for future days.
Terminologies Related to the Regression Analysis in Machine Learning
Terminologies Related to Regression Analysis:
Response Variable: The primary factor to predict or understand in regression, also known as the
dependent variable or target variable.
Predictor Variable: Factors influencing the response variable, used to predict its values; also
called independent variables.
Outliers: Observations with significantly low or high values compared to others, potentially
impacting results and best avoided.
Multicollinearity: High correlation among independent variables, which can complicate the
ranking of influential variables.
Underfitting and Overfitting: Overfitting occurs when an algorithm performs well on training
but poorly on testing, while underfitting indicates poor performance on both datasets.
In Classification, the output variable must be In Classification, the output variable must be
a discrete value. a discrete value.
The task of the regression algorithm is to map The task of the regression algorithm is to map
the input value (x) with the continuous output the input value (x) with the continuous output
variable(y). variable(y).
The task of the classification algorithm is to The task of the classification algorithm is to
map the input value(x) with the discrete map the input value(x) with the discrete
output variable(y). output variable(y).
Regression Algorithms are used with Regression Algorithms are used with
continuous data. continuous data.
Classification Algorithms are used with Classification Algorithms are used with
discrete data. discrete data.
UNSUPERVISED LEARNING
Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified nor
labelled, and the model acts on that data without any supervision.
Example:
Working of Unsupervised Learning:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine learning
model in order to train it. Firstly, it will interpret the raw data to find the hidden patterns from the data
and then will apply suitable algorithms such as k-means clustering, Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The input to the unsupervised learning models is as follows:
Unstructured data: May contain noisy(meaningless) data, missing values, or unknown data
Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to the labeled one in the Supervised approach.
Now it is not necessary that the clusters formed must be circular in shape. The shape of clusters
can be arbitrary. There are many algortihms that work well with detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not circular in
shape.
Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
Market Segmentation – Businesses use clustering to group their customers and use targeted
advertisements to attract more audience.
Market Basket Analysis – Shop owners analyze their sales and figure out which items are
majorly bought together by the customers. For example, In USA, according to a study diapers
and beers were usually bought together by fathers.
Social Network Analysis – Social media sites use your data to understand your browsing
behaviour and provide you with targeted friend recommendations or content recommendations.
Medical Imaging – Doctors use Clustering to find out diseased areas in diagnostic images like X-
rays.
Anomaly Detection – To find outliers in a stream of real-time dataset or forecasting fraudulent
transactions we can use clustering to identify them.
Simplify working with large datasets – Each cluster is given a cluster ID after clustering is
complete. Now, you may reduce a feature set’s whole feature set into its cluster ID. Clustering is
effective when it can represent a complicated case with a straightforward cluster ID. Using the
same principle, clustering data can make complex datasets simpler.
2.2.1.1 ASSOCIATION
Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be more
profitable. It tries to find some interesting relations or associations among the variables of dataset. It is
based on different rules to discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market
basket analysis is a technique used by the various big retailer to discover the associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
For e.g. shopping stores use algorithms based on this technique to find out the relationship
between the sale of one product w.r.t to another’s sales based on customer behavior. Like if a customer
buys milk, then he may also buy bread, eggs, or butter. Once trained well, such models can be used to
increase their sales by planning different offers.
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so
these products are stored within a shelf or mostly nearby. Consider the below diagram:
Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are several
metrics. These metrics are given below:
o Support
o Confidence
o Lift
SEMI-SUPERVISED LEARNING
Semi-Supervised learning is a type of Machine Learning algorithm that represents the
intermediate ground between Supervised and Unsupervised learning algorithms. It uses the combination
of labeled and unlabeled datasets during the training period.
To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the
concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to
effectively use all the available data, rather than only labelled data like in supervised learning. Initially,
similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the
unlabeled data into labelled data. It is because labelled data is a comparatively more expensive
acquisition than unlabeled data.
2.3.1 Assumptions followed by Semi-Supervised Learning:
To work with the unlabeled dataset, there must be a relationship between the objects. To
understand this, semi-supervised learning uses any of the following assumptions:
OUTLIER
Outliers in machine learning refer to data points that are significantly different from the majority
of the data. These data points can be anomalous, noisy, or errors in measurement.
An outlier is a data point that significantly deviates from the rest of the data. It can be either
much higher or much lower than the other data points, and its presence can have a significant impact on
the results of machine learning algorithms. They can be caused by measurement or execution errors. The
analysis of outlier data is referred to as outlier analysis or outlier mining.
3.1 TYPES OF OUTLIERS
There are two main types of outliers:
1. Global outliers:
Global outliers are isolated data points that are far away from the main body of the data. They are often
easy to identify and remove.
2. Contextual outliers:
Contextual outliers are data points that are unusual in a specific context but may not be outliers in a
different context. They are often more difficult to identify and may require additional information or
domain knowledge to determine their significance.