Unit_6_Machine_Learning_Algorithms
Unit_6_Machine_Learning_Algorithms
Unit_6_Machine_Learning_Algorithms
ML algorithms learn from various types of data, including images, text, sensor readings,
and historical records.
Some common ML algorithms include decision trees, neural networks, and support vector
machines.
The applications of ML are vast and diverse. It powers recommendation systems like those
used by Netflix, speech recognition, medical diagnosis, and autonomous vehicles. ML is
also behind chatbots, personalized ads, and fraud detection systems.
Supervised learning
Supervised learning stands out as one of the foundational types of Machine Learning. It is a
powerful approach that allows machines to learn from labeled data, making predictions or
decisions based on that learning. Within supervised learning, two primary types of algorithms
emerge:
1. Regression – works with continuous data
2. Classification – works with discrete data
Regression:
Before we go through the regression, we should first learn about correlation.
Correlation:
Correlation is a measure of the strength of a linear relationship between two quantitative
variables (independent (eg. Price) and dependent (eg. Sales)). If the change in one variable
(independent variable) appears to be accompanied by a change in the other variable (dependent
variable) the two variables are said to be correlated and this inter dependence is called
correlation. There are three types of correlations:
1. Positive Correlation: In a positive correlation, both variables move in the same direction.
As one variable increases, the other also tends to increase, and vice versa.
3. Zero Correlation: When there is no apparent relationship between two variables, they are
said to have zero correlation. Changes in one variable do not predict changes in the other.
PEARSON’S R
Pearson’s R measures the strength and direction of the linear relationship between two
continuous variables. In the context of regression analysis, a high degree of correlation between
the independent and dependent variables suggests that there may be a meaningful relationship
to explore using regression techniques.
OR
Where:
Solution:
For the Calculation of the Pearson Correlation Coefficient, we will first calculate the following
values:
REGRESSION
Regression is a statistical technique used to model the relationship between a dependent
variable and one or more independent variables. In simpler terms, regression helps us
understand how changes in one or more variables are associated with changes in another
variable.
By analyzing the relationship between the independent and dependent variables, regression
allows us to make predictions and understand how changes in one variable may impact the
other. This makes regression a powerful tool for forecasting, prediction, and understanding
complex relationships in various fields such as economics, social sciences, and healthcare.
Regression analysis is particularly useful when dealing with continuous data, where variables can
take on any value within a certain range. For example, variables such as height, temperature,
salary, and time are all continuous, meaning they can be measured along a continuous scale.
When we make a distribution in which there is an involvement of more than one variable, then
such an analysis is called Regression Analysis. It generally focuses on finding or rather predicting
the value of the variable that is dependent on the other.
Let there be two variables x and y. If y depends on x, then the result comes in the form of a
simple regression. Furthermore, we name the variables x and y as:
The least squares method is commonly employed to find this best-fit line or curve. This method
minimizes the squared differences between observed and predicted values, ensuring that the
regression line captures the overall trend or pattern in the data as accurately as possible.
Through the least squares method, regression analysis yields estimate of the regression
coefficients that define the best-fit relationship between the variables. These coefficients allow
for making predictions about the dependent variable based on the values of the independent
variable(s) with greater accuracy and reliability. As a result, this is widely used in regression
analysis.
Example 1
In the example of 6 people with different ages and different weight, let us draw the line of best
fit in Excel.
S.N Age Weigh
o (x) t(y)
1 40 78
2 21 70
3 25 60
4 31 55
5 38 80
6 47 66
Solution:
Step 1: Select the Age and Weight.
Step 2: Insert a scatter chart and make changes to the following:
Trendline Name: Linear, check Display Equation on Chart
X axis minimum: 20
Step 3: Let us verify the values of slope and intercept using slope() and intercept() function in
excel.
Step 4: Click on any cell and type =slope (Now, select the values of Weight, and then type
comma. Now, select the values of Age and press enter.
Step 5: Click on any cell and type =intercept (Now, select the values of Weight, and then type
comma. Now, select the values of Age and press enter.
Some of the regression algorithms include Linear Regression, Logistic Regression, Decision Tree
Regression, Random Forest Regression. Let us learn about Linear Regression.
Weight(y)
90
80
70 f(x) = 0.349095966620306 x + 56.413769123783
60
50
40
30
20
10
0
15 20 25 30 35 40 45 50
Linear Regression:
Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly to
each other. In case the data involves more than one independent variable, then linear regression
is called multiple linear regression models. Linear regression is further divided into two types:
a) Simple Linear Regression: The dependent variable's value is predicted using a single
independent variable in simple linear regression.
b) Multiple Linear Regression: In multiple linear regression, more than one independent
variable is used to predict the value of the dependent variable.
2. CLASSIFICATION
For example, let us say, you live in a gated housing society and your society has separate
dustbins for different types of waste: paper waste, plastic waste, food waste and so on. What
you are basically doing over here is classifying the waste into different categories and then
labeling each category. In the picture given below, we are assigning the labels ‘paper’, ‘metal’,
‘plastic’, and so on to different types of waste.
● Training Data: The classification model is trained using a dataset known as training data. This
dataset consists of labelled examples, where each data instance is associated with a class
label. The model learns from this data to understand the relationship between the features
and the corresponding class labels.
● Classification Model: An algorithm or technique is used to build the classification model. This
model learns from the training data to predict the class labels of new, unseen data instances.
It aims to generalize from the patterns and relationships in the training data to make accurate
predictions.
● Prediction or Inference: Once trained, the classification model is used to predict the class
labels of new data instances. This process, known as prediction or inference, relies on the
learned patterns and relationships from the training data.
Types of classification
The four main types of classification are:
1) Binary Classification
2) Multi-Class Classification
3) Multi-Label Classification
4) Imbalanced Classification
KNN is particularly useful when dealing with classification problems where the decision
boundaries are not clearly defined or when the dataset does not have a well-defined structure. It
provides a simple yet effective method for identifying the category or class of a new data point
based on its similarity to existing data points.
Applications of KNN:
● Image recognition and classification
● Recommendation systems
● Healthcare diagnostics
● Text mining and sentiment analysis
● Anomaly detection
Advantages of KNN:
Unsupervised Learning
Unsupervised learning, on the other hand, deals with unlabelled data, where the algorithm tries
to find hidden patterns or structure without explicit guidance. The goal of this is to explore and
discover inherent structures or relationships within the data, such as clusters or associations.
Examples include k-means clustering, hierarchical clustering, principal component analysis, and
autoencoders.
Clustering
Clustering, or cluster analysis, is a machine learning technique used to group unlabeled dataset
into clusters or groups based on similarity. Clustering aims to organize data points into groups
where points within the same group are more similar to each other than to those in other
groups.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset. The clustering technique is commonly used for statistical data
analysis.
2) Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas. These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
In the distribution model-based clustering method, the data is divided based on the probability
of how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution. The example of this type is the Expectation-
Maximization Clustering algorithm that uses Gaussian Mixture Models (GMM).
4) Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram.
The observations or any number of clusters can be selected by cutting the tree at the correct
level. The most common example of this method is the Agglomerative Hierarchical algorithm.
K- Means clustering
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm.