0% found this document useful (0 votes)
22 views21 pages

Unit 1

The document defines machine learning and discusses the differences between entropy and information gain as well as accuracy. It also defines bias and variance in machine learning and outlines the differences between supervised and unsupervised machine learning. Finally, it explains the types of machine learning algorithms including supervised learning algorithms like classification and regression as well as discussing advantages and disadvantages of supervised learning.

Uploaded by

logapriyaaruna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views21 pages

Unit 1

The document defines machine learning and discusses the differences between entropy and information gain as well as accuracy. It also defines bias and variance in machine learning and outlines the differences between supervised and unsupervised machine learning. Finally, it explains the types of machine learning algorithms including supervised learning algorithms like classification and regression as well as discussing advantages and disadvantages of supervised learning.

Uploaded by

logapriyaaruna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT - I INTRODUCTION

PART A

1. Define machine learning.

Machine Learning, as the name says, is all about machines learning automatically without
being explicitly programmed or learning without any direct human intervention. As for the
formal definition of Machine Learning, we can say that a Machine Learning algorithm learns
from experience E with respect to some type of task T and performance measure P, if its
Performance at tasks in T, as measured by P, improves with experience E.

2. Difference between Entropy and Information Gain.

Entropy Information Gain

Entropy is a measurement of the disorder or impurity Information gain is a metric for the entropy
of a set of occurrences. It determines the usual reduction brought about by segmenting a set of
amount of information needed to classify a sample instances according to a feature. It gauges the
taken from the collection. amount of knowledge a characteristic imparts to
the class of an example.

Entropy is calculated for a set of examples by By dividing the collection of instances depending
calculating the probability of each class in the set on the feature and calculating the entropies of the
and using that information in the entropy calculation. resulting subsets, information gain is determined
for each feature. The difference between the
entropy of the original set and the weighted sum
of the entropies of the subsets is thus the
information gain.

3. Define Accuracy.
Accuracy is used to measure the performance of the model. It is the ratio of Total correct
instances to the total instances.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑃+𝑇𝑁𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁Accuracy=TP+TN+FP+FNTP+TN
For the above case:
Accuracy = (5+3)/(5+3+1+1) = 8/10 = 0.8

4. What is Bias and variance in machine learning?


The variance would specify the amount of variation in the prediction if the different training
data was used. In simple words, variance tells that how much a random variable is different
from its expected value. Ideally, a model should not vary too much from one training dataset
to another, which means the algorithm should be good in understanding the hidden mapping
between inputs and output variables. Variance errors are either of low variance or high
variance.
In general, a machine learning model analyses the data, find patterns in it and make
predictions. While training, the model learns these patterns in the dataset and applies them to
test data for prediction. While making predictions, a difference occurs between prediction
values made by the model and actual values/expected values, and this difference is known
as bias errors or Errors due to bias. It can be defined as an inability of machine learning
algorithms such as Linear Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs from assumptions in the
model, which makes the target function simple to learn. A model has either

5. What are differences between supervised and unsupervised machine learning?


Supervised Learning Unsupervised Learning

Uses Known and Labelled


Uses Unknown Data as input
Input Data Data as input

Less Computational
More Computational Complex
Computational Complexity Complexity

Uses Real-Time Analysis of


Uses off-line analysis
Real-Time Data

The number of Classes is The number of Classes is not


Number of Classes known known

Accurate and Reliable Moderate Accurate and


Accuracy of Results Results Reliable Results

The desired, output is not


The desired output is given.
Output data given.

In supervised learning it is In unsupervised learning it is


not possible to learn larger possible to learn larger and
and more complex models more complex models than in
Model than in unsupervised learning supervised learning

In supervised learning
In unsupervised learning
training data is used to infer
training data is not used.
Training data model

Supervised learning is also Unsupervised learning is also


Another name called classification. called clustering.

Test of model We can test our model. We can not test our model.
Supervised Learning Unsupervised Learning

Optical Character
Find a face in an image.
Example Recognition

PART B
6. Explain the types of machine learning algorithm with testing parameters.
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn from
data, improve performance from past experiences, and make predictions. Machine learning
contains a set of algorithms that work on a huge amount of data. Data is fed to these algorithms to
train them, and on the basis of training, they build the model & perform a specific task.

These ML algorithms help to solve different business problems like Regression, Classification,
Forecasting, Clustering, and Associations, etc.

Based on the methods and way of learning, machine learning is divided into mainly four types,
which are:

1. Supervised Machine Learning


2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
In this topic, we will provide a detailed description of the types of Machine Learning along with
their respective algorithms:

1. Supervised Machine Learning


As its name suggests, Supervised machine learning is based on supervision. It means in the
supervised learning technique, we train the machines using the "labelled" dataset, and based on the
training, the machine predicts the output. Here, the labelled data specifies that some of the inputs
are already mapped to the output. More preciously, we can say; first, we train the machine with the
input and corresponding output, and then we ask the machine to predict the output using the test
dataset.

Let's understand supervised learning with an example. Suppose we have an input dataset of cats
and dog images. So, first, we will provide the training to the machine to understand the images,
such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are
taller, cats are smaller), etc. After completion of training, we input the picture of a cat and ask the
machine to identify the object and predict the output. Now, the machine is well trained, so it will
check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that
it's a cat. So, it will put it in the Cat category. This is the process of how the machine identifies the
objects in Supervised Learning.

The main goal of the supervised learning technique is to map the input variable(x) with the
output variable(y). Some real-world applications of supervised learning are Risk Assessment,
Fraud Detection, Spam filtering, etc.

Categories of Supervised Machine Learning


Supervised machine learning can be classified into two types of problems, which are given below:

ADVERTISEMENT
ADVERTISEMENT
o Classification
o Regression

a) Classification
Classification algorithms are used to solve the classification problems in which the output variable
is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification
algorithms predict the categories present in the dataset. Some real-world examples of classification
algorithms are Spam Detection, Email filtering, etc.
Some popular classification algorithms are given below:

o Random Forest Algorithm


o Decision Tree Algorithm
o Logistic Regression Algorithm
o Support Vector Machine Algorithm

b) Regression
Regression algorithms are used to solve regression problems in which there is a linear relationship
between input and output variables. These are used to predict continuous output variables, such as
market trends, weather prediction, etc.

Some popular Regression algorithms are given below:

o Simple Linear Regression Algorithm


o Multivariate Regression Algorithm
o Decision Tree Algorithm
o Lasso Regression

Advantages and Disadvantages of Supervised Learning


Advantages:

o Since supervised learning work with the labelled dataset so we can have an exact idea about
the classes of objects.
o These algorithms are helpful in predicting the output on the basis of prior experience.

Disadvantages:

o These algorithms are not able to solve complex tasks.


o It may predict the wrong output if the test data is different from the training data.
o It requires lots of computational time to train the algorithm.

Applications of Supervised Learning


Some common applications of Supervised Learning are given below:

o Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process, image
classification is performed on different image data with pre-defined labels.
o MedicalDiagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes. It is done
by using medical images and past labelled data with labels for disease conditions. With
such a process, the machine can identify a disease for the new patients.
o Fraud Detection - Supervised Learning classification algorithms are used for identifying
fraud transactions, fraud customers, etc. It is done by using historic data to identify the
patterns that can lead to possible fraud.
o Spam detection - In spam detection & filtering, classification algorithms are used. These
algorithms classify an email as spam or not spam. The spam emails are sent to the spam
folder.
o Speech Recognition - Supervised learning algorithms are also used in speech recognition.
The algorithm is trained with voice data, and various identifications can be done using the
same, such as voice-activated passwords, voice commands, etc.

2. Unsupervised Machine Learning


Unsupervised learning is different from the Supervised learning technique; as its name suggests,
there is no need for supervision. It means, in unsupervised machine learning, the machine is trained
using the unlabeled dataset, and the machine predicts the output without any supervision.

In unsupervised learning, the models are trained with the data that is neither classified nor labelled,
and the model acts on that data without any supervision.

The main aim of the unsupervised learning algorithm is to group or categories the unsorted
dataset according to the similarities, patterns, and differences. Machines are instructed to find
the hidden patterns from the input dataset.

Let's take an example to understand it more preciously; suppose there is a basket of fruit images,
and we input it into the machine learning model. The images are totally unknown to the model, and
the task of the machine is to find the patterns and categories of the objects.

So, now the machine will discover its patterns and differences, such as colour difference, shape
difference, and predict the output when it is tested with the test dataset.

ADVERTISEMENT
ADVERTISEMENT
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:

o Clustering
o Association

1) Clustering
The clustering technique is used when we want to find the inherent groups from the data. It is a
way to group the objects into a cluster such that the objects with the most similarities remain in one
group and have fewer or no similarities with the objects of other groups. An example of the
clustering algorithm is grouping the customers by their purchasing behaviour.
Some of the popular clustering algorithms are given below:

o K-Means Clustering algorithm


o Mean-shift algorithm
o DBSCAN Algorithm
o Principal Component Analysis
o Independent Component Analysis

2) Association
Association rule learning is an unsupervised learning technique, which finds interesting relations
among variables within a large dataset. The main aim of this learning algorithm is to find the
dependency of one data item on another data item and map those variables accordingly so that it
can generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web
usage mining, continuous production, etc.

Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth
algorithm.

Advantages and Disadvantages of Unsupervised Learning Algorithm


Advantages:

o These algorithms can be used for complicated tasks compared to the supervised ones
because these algorithms work on the unlabeled dataset.
o Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labelled dataset.

Disadvantages:

o The output of an unsupervised algorithm can be less accurate as the dataset is not labelled,
and algorithms are not trained with the exact output in prior.
o Working with Unsupervised learning is more difficult as it works with the unlabelled
dataset that does not map with the output.

Applications of Unsupervised Learning


o Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright
in document network analysis of text data for scholarly articles.
o Recommendation Systems: Recommendation systems widely use unsupervised learning
techniques for building recommendation applications for different web applications and e-
commerce websites.
o Anomaly Detection: Anomaly detection is a popular application of unsupervised learning,
which can identify unusual data points within the dataset. It is used to discover fraudulent
transactions.
o Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract
particular information from the database. For example, extracting information of each user
located at a particular location.

3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground between
Supervised (With Labelled training data) and Unsupervised learning (with no labelled training
data) algorithms and uses the combination of labelled and unlabeled datasets during the training
period.

Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data.
As labels are costly, but for corporate purposes, they may have few labels. It is completely
different from supervised and unsupervised learning as they are based on the presence & absence
of labels.

To overcome the drawbacks of supervised learning and unsupervised learning algorithms,


the concept of Semi-supervised learning is introduced. The main aim of semi-supervised
learning is to effectively use all the available data, rather than only labelled data like in supervised
learning. Initially, similar data is clustered along with an unsupervised learning algorithm, and
further, it helps to label the unlabeled data into labelled data. It is because labelled data is a
comparatively more expensive acquisition than unlabeled data.

We can imagine these algorithms with an example. Supervised learning is where a student is under
the supervision of an instructor at home and college. Further, if that student is self-analysing the
same concept without any help from the instructor, it comes under unsupervised learning. Under
semi-supervised learning, the student has to revise himself after analyzing the same concept under
the guidance of an instructor at college.

Advantages and disadvantages of Semi-supervised Learning


Advantages:

o It is simple and easy to understand the algorithm.


o It is highly efficient.
o It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms.

Disadvantages:

o Iterations results may not be stable.


o We cannot apply these algorithms to network-level data.
o Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking action,
learning from experiences, and improving its performance. Agent gets rewarded for each good
action and get punished for each bad action; hence the goal of reinforcement learning agent is to
maximize the rewards.

In reinforcement learning, there is no labelled data like supervised learning, and agents learn from
their experiences only.

The reinforcement learning process is similar to a human being; for example, a child learns various
things by experiences in his day-to-day life. An example of reinforcement learning is to play a
game, where the Game is the environment, moves of an agent at each step define states, and the
goal of the agent is to get a high score. Agent receives feedback in terms of punishment and
rewards.

Due to its way of working, reinforcement learning is employed in different fields such as Game
theory, Operation Research, Information theory, multi-agent systems.

A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In


MDP, the agent constantly interacts with the environment and performs actions; at each action, the
environment responds and generates a new state.

Categories of Reinforcement Learning


Reinforcement learning is categorized mainly into two types of methods/algorithms:

o Positive Reinforcement Learning: Positive reinforcement learning specifies increasing


the tendency that the required behaviour would occur again by adding something. It
enhances the strength of the behaviour of the agent and positively impacts it.
o Negative Reinforcement Learning: Negative reinforcement learning works exactly
opposite to the positive RL. It increases the tendency that the specific behaviour would
occur again by avoiding the negative condition.

Real-world Use cases of Reinforcement Learning


o Video Games:
RL algorithms are much popular in gaming applications. It is used to gain super-human
performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO
Zero.
o Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed that how to
use RL in computer to automatically learn and schedule resources to wait for different jobs
in order to minimize average job slowdown.
o Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial and
manufacturing area, and these robots are made more powerful with reinforcement learning.
There are different industries that have their vision of building intelligent robots using AI
and Machine learning technology.
o Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with the help
of Reinforcement Learning by Salesforce company.

Advantages and Disadvantages of Reinforcement Learning


Advantages

o It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
o The learning model of RL is similar to the learning of human beings; hence most accurate
results can be found.
o Helps in achieving long term results.

Disadvantage

o RL algorithms are not preferred for simple problems.


o RL algorithms require huge data and computations.
o Too much reinforcement learning can lead to an overload of states which can weaken the
results.

The curse of dimensionality limits reinforcement learning for real physical systems.

7. Consider the Training dataset given in the following table. Use Weighted K-NN can
determine the class test instance (7.6,60,8) and K=3
PART C (15*1=15)
8. Solve this problem using simple linear regression and find out the independent and dependent
variable .Explain different types of Regression

Xi Yi
1 1.2
2 1.8
3 2.6
4 3.2
5 3.8
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o It is used for solving the regression problem in machine learning.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the year
of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b
Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients

ADVERTISEMENT
Some popular applications of linear regression are:

o Analyzing trends and sales estimates


o Salary forecasting
o Real estate prediction
o Arriving at ETAs in traffic.
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or No,
True or False, Spam or not spam, etc.
o It is a predictive analysis algorithm which works on the concept of probability.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:

o f(x)= Output between the 0 and 1 value.


o x= input to the function
o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:

o It uses the concept of threshold levels, values above the threshold level are rounded up to 1,
and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)
o Multi(cats, dogs, lions)
o Ordinal(low, medium, high)

Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Regression:
Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as Support
Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below
are some keywords which are used in Support Vector Regression:

o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is
a line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain a
maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:


o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o It can solve problems for both categorical and numerical data
o Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided
into their children node, and themselves become the parent node of those nodes. Consider
the below image:

Above image showing the example of Decision Tee regression, here, the model is trying to predict
the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable
of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:

g(x)= f0(x)+ f1(x)+ f2(x)+....


o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in
which aggregated decision tree runs in parallel and do not interact with each other.
o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.

Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
o Ridge regression is a regularization technique, which is used to reduce the complexity of
the model. It is also called as L2 regularization.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:

You might also like