Unit_6_Machine_Learning_Algorithms

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Unit -6

Machine Learning Algorithms


Machine Learning (ML) is a part of artificial intelligence (AI) that focuses on teaching computers
to learn from data and make decisions without being explicitly programmed. Unlike traditional
programming where developers provide precise instructions, ML algorithms learn from patterns
and relationships in data. This allows them to generalize and make decisions on new, unseen
data.

 ML algorithms learn from various types of data, including images, text, sensor readings,
and historical records.
 Some common ML algorithms include decision trees, neural networks, and support vector
machines.
 The applications of ML are vast and diverse. It powers recommendation systems like those
used by Netflix, speech recognition, medical diagnosis, and autonomous vehicles. ML is
also behind chatbots, personalized ads, and fraud detection systems.

Types of Machine Learning


Machine learning can be divided into three primary categories, each distinguished by its learning
approach and the nature of the input data. These categories are : supervised learning,
Unsupervised learning and reinforcement learning.

Supervised learning
Supervised learning stands out as one of the foundational types of Machine Learning. It is a
powerful approach that allows machines to learn from labeled data, making predictions or
decisions based on that learning. Within supervised learning, two primary types of algorithms
emerge:
1. Regression – works with continuous data
2. Classification – works with discrete data
Regression:
Before we go through the regression, we should first learn about correlation.

Correlation:
Correlation is a measure of the strength of a linear relationship between two quantitative
variables (independent (eg. Price) and dependent (eg. Sales)). If the change in one variable
(independent variable) appears to be accompanied by a change in the other variable (dependent
variable) the two variables are said to be correlated and this inter dependence is called
correlation. There are three types of correlations:

1. Positive Correlation: In a positive correlation, both variables move in the same direction.
As one variable increases, the other also tends to increase, and vice versa.

2. Negative Correlation: Conversely, in a negative correlation, variables move in opposite


directions. An increase in one variable is associated with a decrease in the other, and vice
versa.

3. Zero Correlation: When there is no apparent relationship between two variables, they are
said to have zero correlation. Changes in one variable do not predict changes in the other.
PEARSON’S R

Pearson’s R measures the strength and direction of the linear relationship between two
continuous variables. In the context of regression analysis, a high degree of correlation between
the independent and dependent variables suggests that there may be a meaningful relationship
to explore using regression techniques.

Pearson’s r is calculated using the formula:

OR

Where:

 A value of 0 indicates that there is no association between the two variables.


 A value greater than 0 indicates a positive association; that is, as the value of one variable
increases, so does the value of the other variable.
 A value less than 0 indicates a negative association; that is, as the value of one variable
increases, the value of the other variable decreases.
Example 1
In the example below of 6 people with different ages and different weight, let us try calculating
the value of the Pearson r.
S.N Age Weigh
o (x) t(y)
1 40 78
2 21 70
3 25 60
4 31 55
5 38 80
6 47 66

Solution:
For the Calculation of the Pearson Correlation Coefficient, we will first calculate the following
values:

Here the total number of people is 6 so, n=6

Now the calculation of the Pearson R is as follows:


r = (n (Σxy)- (Σx)(Σy))/(√ [n Σx2-(Σx)2][n Σy2– (Σy)2 )
r = (6 * (13937)- (202)(409)) / (√ [6 *7280 -(202)2] * [6 * 28365- (409)2 )
r = (6 * (13937)- (202) * (409))/(√ [6 *7280 -(202)2] * [6 * 28365- (409)2 )
r = (83622- 82618)/(√ [43680 -40804] * [170190- 167281 )
r = 1004/(√ [2876] * [2909 )
r = 1004 / (√ 8366284)
r = 1004 / 2892.452938
r = 0.35
The value of the Pearson correlation coefficient is 0.35

REGRESSION
Regression is a statistical technique used to model the relationship between a dependent
variable and one or more independent variables. In simpler terms, regression helps us
understand how changes in one or more variables are associated with changes in another
variable.

By analyzing the relationship between the independent and dependent variables, regression
allows us to make predictions and understand how changes in one variable may impact the
other. This makes regression a powerful tool for forecasting, prediction, and understanding
complex relationships in various fields such as economics, social sciences, and healthcare.

Regression analysis is particularly useful when dealing with continuous data, where variables can
take on any value within a certain range. For example, variables such as height, temperature,
salary, and time are all continuous, meaning they can be measured along a continuous scale.

There are several types of regression analysis which are as follows:

When we make a distribution in which there is an involvement of more than one variable, then
such an analysis is called Regression Analysis. It generally focuses on finding or rather predicting
the value of the variable that is dependent on the other.
Let there be two variables x and y. If y depends on x, then the result comes in the form of a
simple regression. Furthermore, we name the variables x and y as:

x – Independent /Predictor / Explanator Variable It is used to predict or explain changes in the


dependent variable.
y – Regression / Dependent / Explained Variable. It is the variable we want to predict or
understand.
Therefore, if we use a simple linear regression model where y depends on x, then the regression
line of y on x is:
y = b + mx + e
In this equation,
● b represents the intercept of the regression line with the y-axis.
● m represents the slope of the regression line, indicating the rate of change in y for a unit
change in x.
● e represents the error or residual, which accounts for the difference between the observed
values of y and the values predicted by the regression equation.

The least squares method is commonly employed to find this best-fit line or curve. This method
minimizes the squared differences between observed and predicted values, ensuring that the
regression line captures the overall trend or pattern in the data as accurately as possible.

Through the least squares method, regression analysis yields estimate of the regression
coefficients that define the best-fit relationship between the variables. These coefficients allow
for making predictions about the dependent variable based on the values of the independent
variable(s) with greater accuracy and reliability. As a result, this is widely used in regression
analysis.

Example 1
In the example of 6 people with different ages and different weight, let us draw the line of best
fit in Excel.
S.N Age Weigh
o (x) t(y)
1 40 78
2 21 70
3 25 60
4 31 55
5 38 80
6 47 66
Solution:
Step 1: Select the Age and Weight.
Step 2: Insert a scatter chart and make changes to the following:
Trendline Name: Linear, check Display Equation on Chart
X axis minimum: 20
Step 3: Let us verify the values of slope and intercept using slope() and intercept() function in
excel.
Step 4: Click on any cell and type =slope (Now, select the values of Weight, and then type
comma. Now, select the values of Age and press enter.
Step 5: Click on any cell and type =intercept (Now, select the values of Weight, and then type
comma. Now, select the values of Age and press enter.
Some of the regression algorithms include Linear Regression, Logistic Regression, Decision Tree
Regression, Random Forest Regression. Let us learn about Linear Regression.

Weight(y)
90
80
70 f(x) = 0.349095966620306 x + 56.413769123783
60
50
40
30
20
10
0
15 20 25 30 35 40 45 50

Linear Regression:

Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly to
each other. In case the data involves more than one independent variable, then linear regression
is called multiple linear regression models. Linear regression is further divided into two types:
a) Simple Linear Regression: The dependent variable's value is predicted using a single
independent variable in simple linear regression.
b) Multiple Linear Regression: In multiple linear regression, more than one independent
variable is used to predict the value of the dependent variable.

Applications of Linear Regression:


● Market Analysis: Linear regression helps understand how different factors like pricing, sales
quantity, advertising, and social media engagement relate to each other in
the market.
● Sales Forecasting: It predicts future sales by analyzing past sales data along with factors like
marketing spending, seasonal trends, and consumer behavior.
● Predicting Salary: Based on Experience: Linear regression estimates a person's salary based on
their years of experience, education, and job role, aiding in recruitment and
compensation planning.
● Sports Analysis: Linear regression analyzes player and team performance by considering
statistics, game conditions, and opponent strength, assisting coaches and
team management in decision-making.
● Medical Research: Linear regression examines relationships between factors like age, weight,
and health outcomes, helping researchers identify risk factors and evaluate
interventions.
Advantages of Linear regression
● Simple technique and easy to implement
● Efficient to train the machine on this model

Disadvantages of Linear regression


1. Sensitivity to outliers, which can significantly impact the analysis.
2. Limited to linear relationships between variables.

2. CLASSIFICATION

Classification is a fundamental concept in artificial intelligence and machine learning that


involves categorizing data into predefined classes or categories. The main objective of
classification is to assign labels to data instances based on their features or attributes. In
classification, the data is typically labeled with class labels or categories, and the goal is to build a
model that can accurately assign these labels to new, unseen data instances. This process is
supervised learning, where the model learns from labeled training data to make predictions on
unseen data.

For example, let us say, you live in a gated housing society and your society has separate
dustbins for different types of waste: paper waste, plastic waste, food waste and so on. What
you are basically doing over here is classifying the waste into different categories and then
labeling each category. In the picture given below, we are assigning the labels ‘paper’, ‘metal’,
‘plastic’, and so on to different types of waste.

How Classification Works


In classification tasks within machine learning, the process revolves around categorizing data into
distinct groups or classes based on their features. Here is how it typically works:
● Classes or Categories: Data is divided into different classes or categories, each representing a
specific outcome or group. For example, in a binary classification scenario, there are two
classes: positive and negative.
● Features or Attributes: Each data instance is described by its features or attributes, which
provide information about the instance. These features are crucial for the classification model
to differentiate between different classes. For instance, in email classification, features might
include words in the email text, sender information, and email subject.

● Training Data: The classification model is trained using a dataset known as training data. This
dataset consists of labelled examples, where each data instance is associated with a class
label. The model learns from this data to understand the relationship between the features
and the corresponding class labels.

● Classification Model: An algorithm or technique is used to build the classification model. This
model learns from the training data to predict the class labels of new, unseen data instances.
It aims to generalize from the patterns and relationships in the training data to make accurate
predictions.

● Prediction or Inference: Once trained, the classification model is used to predict the class
labels of new data instances. This process, known as prediction or inference, relies on the
learned patterns and relationships from the training data.

Types of classification
The four main types of classification are:
1) Binary Classification
2) Multi-Class Classification
3) Multi-Label Classification
4) Imbalanced Classification

Classificati Binary Multi-Class Multi-Label Imbalanced


on Type Classification Classification Classification Classification
Description Classification Classification Classification Classification tasks
tasks with two tasks with more tasks where each with unequally
class labels. than two class example may distributed class
labels. belong to multiple labels, typically with
class labels. a majority and
minority class.

Example • Email spam • Face • Photo • Fraud detection


detection - classification classification - • Outlier detection
spam or not • Plant species objects present in • Medical diagnostic
• Conversion classification the photo (bicycle, tests
prediction - • Optical apple, person,
buy or not character etc.)
• Medical test recognition
- Cancer • Image
detected or classification
not into thousands
• Exam results of classes
- pass/fail

K- Nearest Neighbour algorithm (KNN)


The K-Nearest Neighbors algorithm, commonly known as KNN or k-NN, is a versatile non-
parametric supervised learning technique used for both classification and regression tasks. It
operates based on the principle of proximity, making predictions or classifications by considering
the similarity between data points.
Why KNN Algorithm is Needed:

KNN is particularly useful when dealing with classification problems where the decision
boundaries are not clearly defined or when the dataset does not have a well-defined structure. It
provides a simple yet effective method for identifying the category or class of a new data point
based on its similarity to existing data points.

Applications of KNN:
● Image recognition and classification
● Recommendation systems
● Healthcare diagnostics
● Text mining and sentiment analysis
● Anomaly detection

Advantages of KNN:

● Easy to implement and understand.


● No explicit training phase; the model learns directly from the training data.
● Suitable for both classification and regression tasks.
● Robust to outliers and noisy data.

Unsupervised Learning
Unsupervised learning, on the other hand, deals with unlabelled data, where the algorithm tries
to find hidden patterns or structure without explicit guidance. The goal of this is to explore and
discover inherent structures or relationships within the data, such as clusters or associations.
Examples include k-means clustering, hierarchical clustering, principal component analysis, and
autoencoders.

Clustering
Clustering, or cluster analysis, is a machine learning technique used to group unlabeled dataset
into clusters or groups based on similarity. Clustering aims to organize data points into groups
where points within the same group are more similar to each other than to those in other
groups.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it
deals with the unlabeled dataset. The clustering technique is commonly used for statistical data
analysis.

How Clustering Works:


To cluster data effectively, follow these key steps:
1) Prepare the Data: Select the right features for clustering and make sure the data is ready by
scaling or transforming it as needed.
2) Create Similarity Metrics: Define how similar data points are by comparing their features. This
similarity measure is crucial for clustering.
3) Run the Clustering Algorithm: Apply a clustering algorithm to group the data. Choose one that
works well with your dataset size and characteristics.
4) Interpret the Results: Analyze the clusters to understand what they represent. Since
clustering is unsupervised, interpretation is essential for assessing the
quality of the clusters.

Types of Clustering Methods


Some of the common clustering methods used in Machine learning are:
1) Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known
as the centroid-based method. The most common example of partitioning clustering is the
K-Means Clustering algorithm. In this type, the dataset is divided into a set of k groups,
where k is used to define the number of pre-defined groups. The cluster center is created
in such a way that the distance between the data points of one cluster is minimum as
compared to another cluster centroid.

Partitioning Clustering Density-Based Clustering Distribution Model- Hierarchical Clustering


Based Clustering

2) Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected. This
algorithm does it by identifying different clusters in the dataset and connects the areas of high
densities into clusters. The dense areas in data space are divided from each other by sparser
areas. These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

3) Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the probability
of how a dataset belongs to a particular distribution. The grouping is done by assuming some
distributions commonly Gaussian Distribution. The example of this type is the Expectation-
Maximization Clustering algorithm that uses Gaussian Mixture Models (GMM).

4) Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no
requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram.
The observations or any number of clusters can be selected by cutting the tree at the correct
level. The most common example of this method is the Agglomerative Hierarchical algorithm.

K- Means clustering
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm.

Steps involved K-Means Clustering:

The working of the K-Means algorithm is explained in the below steps:


● Select the number K to decide the number of clusters.
● Select random K points or centroids. (It can be other from the input dataset).
● Assign each data point to their closest centroid, which will form the predefined K clusters.
● Calculate the variance and place a new centroid of each cluster.
● Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.
● If any reassignment occurs, then go to step-4 else go to FINISH.
● The model is ready.

Applications of K-Means Clustering:


● Market Segmentation: group customers based on similar purchasing behaviours or
demographics for tailored marketing strategies.
● Image Segmentation: partition images into regions of similar colours to aid in tasks like object
detection and compression.
● Document Clustering: categorize documents based on content similarity, aiding in
organization and information retrieval.
● Anomaly Detection: identify outliers by clustering normal data points and detecting
deviations.
● Customer Segmentation: segment customers for targeted marketing and personalized
experiences.

Advantages of K-Means Clustering:


● Easy to implement, making it suitable for users of all levels.
● Handles large datasets with low computational resources.
● Works well with numerous features and data points.
● Are easy to understand, aiding in decision-making.
● Applicable across various domains and data types.

Limitations of K-Means Clustering:


● Results can vary based on initial centroid placement.
● Assumes clusters are spherical, which is not always true.
● Number of clusters must be known beforehand.
● Outliers can distort clusters due to their influence on centroids.
● May converge to suboptimal solutions instead of the global optimum.

You might also like