Unit 1
Unit 1
PART A
Machine Learning, as the name says, is all about machines learning automatically without
being explicitly programmed or learning without any direct human intervention. As for the
formal definition of Machine Learning, we can say that a Machine Learning algorithm learns
from experience E with respect to some type of task T and performance measure P, if its
Performance at tasks in T, as measured by P, improves with experience E.
Entropy is a measurement of the disorder or impurity Information gain is a metric for the entropy
of a set of occurrences. It determines the usual reduction brought about by segmenting a set of
amount of information needed to classify a sample instances according to a feature. It gauges the
taken from the collection. amount of knowledge a characteristic imparts to
the class of an example.
Entropy is calculated for a set of examples by By dividing the collection of instances depending
calculating the probability of each class in the set on the feature and calculating the entropies of the
and using that information in the entropy calculation. resulting subsets, information gain is determined
for each feature. The difference between the
entropy of the original set and the weighted sum
of the entropies of the subsets is thus the
information gain.
3. Define Accuracy.
Accuracy is used to measure the performance of the model. It is the ratio of Total correct
instances to the total instances.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁Accuracy=TP+TN+FP+FNTP+TN
For the above case:
Accuracy = (5+3)/(5+3+1+1) = 8/10 = 0.8
Less Computational
More Computational Complex
Computational Complexity Complexity
In supervised learning
In unsupervised learning
training data is used to infer
training data is not used.
Training data model
Test of model We can test our model. We can not test our model.
Supervised Learning Unsupervised Learning
Optical Character
Find a face in an image.
Example Recognition
PART B
6. Explain the types of machine learning algorithm with testing parameters.
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from
data, improve performance from past experiences, and make predictions. Machine learning
contains a set of algorithms that work on a huge amount of data. Data is fed to these algorithms to
train them, and on the basis of training, they build the model & perform a specific task.
These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:
Let's understand supervised learning with an example. Suppose we have an input dataset of cats
and dog images. So, first, we will provide the training to the machine to understand the images,
such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are
taller, cats are smaller), etc. After completion of training, we input the picture of a cat and ask the
machine to identify the object and predict the output. Now, the machine is well trained, so it will
check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that
it's a cat. So, it will put it in the Cat category. This is the process of how the machine identifies the
objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.
ADVERTISEMENT
ADVERTISEMENT
o Classification
o Regression
a) Classification
Classification algorithms are used to solve the classification problems in which the output variable
is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of classification
algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:
b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.
o Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.
Disadvantages:
o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done
by using medical images and past labelled data with labels for disease conditions. With
such a process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.
In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of fruit images,
and we input it into the machine learning model. The images are totally unknown to the model, and
the task of the machine is to find the patterns and categories of the objects.
So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.
ADVERTISEMENT
ADVERTISEMENT
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
o Clustering
o Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in one
group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it
can generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web
usage mining, continuous production, etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.
o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.
Disadvantages:
o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning (with no labelled training
data) algorithms and uses the combination of labelled and unlabeled datasets during the training
period.
Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data.
As labels are costly, but for corporate purposes, they may have few labels. It is completely
different from supervised and unsupervised learning as they are based on the presence & absence
of labels.
We can imagine these algorithms with an example. Supervised learning is where a student is under
the supervision of an instructor at home and college. Further, if that student is self-analysing the
same concept without any help from the instructor, it comes under unsupervised learning. Under
semi-supervised learning, the student has to revise himself after analyzing the same concept under
the guidance of an instructor at college.
Disadvantages:
In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.
The reinforcement learning process is similar to a human being; for example, a child learns various
things by experiences in his day-to-day life. An example of reinforcement learning is to play a
game, where the Game is the environment, moves of an agent at each step define states, and the
goal of the agent is to get a high score. Agent receives feedback in terms of punishment and
rewards.
Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.
o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.
Disadvantage
The curse of dimensionality limits reinforcement learning for real physical systems.
7. Consider the Training dataset given in the following table. Use Weighted K-NN can
determine the class test instance (7.6,60,8) and K=3
PART C (15*1=15)
8. Solve this problem using simple linear regression and find out the independent and dependent
variable .Explain different types of Regression
Xi Yi
1 1.2
2 1.8
3 2.6
4 3.2
5 3.8
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the year
of experience.
1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients
ADVERTISEMENT
Some popular applications of linear regression are:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as Support
Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below
are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is
a line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Above image showing the example of Decision Tee regression, here, the model is trying to predict
the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is capable
of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be: