Mlall
Mlall
UNIT-I : INTRODUCTION:
Introduction
Machine learning is one of the most exciting technologies that one would have ever come across.
As it is evident from the name, it gives the computer that which makes it more similar to humans:
The ability to learn. Machine learning is actively being used today, perhaps in many more places
than one would expect. We probably use a learning algorithm dozens of time without even
knowing it. Applications of Machine Learning include:
• Web Search Engine: One of the reasons why search engines like google, bing etc work so
well is because the system has learnt how to rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it facebook or any other photo tagging application, the
ability to tag friends makes it even more happening. It is all possible because of a face
recognition algorithm that runs behind the application.
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in
classifying the mails and moving the spam mails to spam folder. This is again achieved by a
spam classifier running in the back end of mail application.
• Augmentation:Machine learning, which assists humans with their day-to-day tasks,
personally or commercially without having complete control of the output. Such machine
learning is used in different ways such as Virtual Assistant, Data analysis, software solutions.
The primary user is to reduce errors due to human bias.
• Automation:Machine learning, which works entirely autonomously in any field without the
need for any human intervention. For example, robots performing the essential process steps
in manufacturing plants.
• Finance Industry:Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent fraud.
• Government organization:The government makes use of ML to manage public safety and
utilities. Take the example of China with the massive face recognition. The government
uses Artificial intelligence to prevent jaywalker.
• Healthcare industry: Healthcare was one of the first industry to use machine learning with
image detection.
• Marketing:Broad use of AI is done in marketing thanks to abundant access to data. Before
the age of mass data, researchers develop advanced mathematical tools like Bayesian analysis
to estimate the value of a customer. With the boom of data, marketing department relies on AI
to optimize the customer relationship and marketing campaign.
Today, companies are using Machine Learning to improve business decisions,increase
productivity, detect disease, forecast weather, and do many more things. With the exponential
growth of technology, we not only need better tools to understand the data we currently have, but
we also need to prepare ourselves for the data we will have. To achieve this goal we need to build
intelligent machines. We can write a program to do simple things. But for most of times
Hardwiring Intelligence in it is difficult. Best way to do it is to have some way for machines to
learn things themselves. A mechanism for learning – if a machine can learn from input then it
does the hard work for us. This is where Machine Learning comes in action. Some examples of
machine learning are:
• Database Mining for growth of automation: Typical applications include Web-click data
for better UX( User eXperience), Medical records for better automation in healthcare,
biological data and many more.
• Applications that cannot be programmed: There are some tasks that cannot be
programmed as the computers we use are not modelled that way. Examples include
Autonomous Driving, Recognition tasks from unordered data (Face Recognition/ Handwriting
Recognition), Natural language Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have understood and mimicked the
human brain. It is the start of a new revolution, The real AI. Now, After a brief insight lets
come to a more formal definition of Machine Learning
• Arthur Samuel(1959): “Machine Learning is a field of study that gives computers, the
ability to learn without explicitly being programmed.”Samuel wrote a Checker playing
program which could learn over time. At first it could be easily won. But over time, it learnt
all the board position that would eventually lead him to victory or loss and thus became a better
chess player than Samuel itself. This was one of the most early attempts of defining Machine
Learning and is somewhat less formal.
• Tom Michel(1999): “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.” This is a more formal and mathematical definition. For the
previous Chess program
• E is number of games.
• T is playing chess against computer.
• P is win/loss by computer.
• Computer vision: Machine learning algorithms can be used to recognize objects, people, and
other elements in images and videos.
• Natural language processing: Machine learning algorithms can be used to understand and
generate human language, including tasks such as translation and text classification.
• Recommendation systems: Machine learning algorithms can be used to recommend
products or content to users based on their past behavior and preferences.
• Fraud detection: Machine learning algorithms can be used to identify fraudulent activity in
areas such as credit card transactions and insurance claims.
• Healthcare: Machine learning algorithms can be used to predict disease outbreaks, identify
potential outbreaks, or predict patient outcomes.
• Finance: Machine learning algorithms can be used to predict stock prices, identify fraudulent
activity, or identify potential investment opportunities.
One simple python code:
• Python3
from sklearn import tree
# Training data
X = [[140, 1], [130, 1], [150, 0], [170, 0]] # [weight, texture] (0: smooth, 1: bumpy)
# Train a classifier
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)
# Make a prediction
print(prediction)
output:
1
If you run the code I provided, the output will be the prediction made by the model. In this case,
the output will be [1], indicating that the model predicts that the fruit with a weight of 160 and a
smooth texture is an orange.
In the Next tutorial we shall classify the types of Machine Learning problems and shall also discuss
about useful packages and setting environment for Machine Learning and how can we use it to
design new projects.
There are many applications of machine learning, some examples include:
1 39
2 44
3 49
4 40
5 22
6 10
7 45
8 38
9 14
10 50
Now, if you want to analyse the standard of achievement of the students. If you arrange them in
ascending or descending order, it will give you a better picture.
Ascending order:
10, 15, 22, 38, 39, 40, 44. 45, 49, 50
Descending order:
50, 49, 45, 44, 40, 39, 38, 22, 15, 10
When the row is placed in ascending or descending order is known as arrayed data.
Types of Graphical Data Representation
Bar Chart
Bar chart helps us to represent the collected data visually. The collected data can be visualized
horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single.
It helps us in comparing different items. By looking at all the bars, it is easy to say which types
in a group of data influence the other.
Now let us understand bar chart by taking this example
Let the marks obtained by 5 students of class V in a class test, out of 10 according to their
names, be:
7,8,4,9,6
The data in the given form is known as raw data. The above given data can be placed in the bar
chart as shown below:
Name Marks
Akshay 7
Maya 8
Dhanvi 4
Jaslen 9
Muskan 6
Histogram
A histogram is the graphical representation of data. It is similar to the appearance of a bar graph
but there is a lot of difference between histogram and bar graph because a bar graph helps to
measure the frequency of categorical data. A categorical data means it is based on two or more
categories like gender, months, etc. Whereas histogram is used for quantitative data.
For example:
Line Graph
The graph which uses lines and points to present the change in time is known as a line graph.
Line graphs can be based on the number of animals left on earth, the increasing population of the
world day by day, or the increasing or decreasing the number of bitcoins day by day, etc. The
line graphs tell us about the changes occurring across the world over time. In a line graph, we
can tell about two or more types of changes occurring around the world.
For Example:
Pie Chart
Pie chart is a type of graph that involves a structural graphic representation of numerical
proportion. It can be replaced in most cases by other plots like a bar chart, box plot, dot plot, etc.
As per the research, it is shown that it is difficult to compare the different sections of a given pie
chart, or if it is to compare data across different pie charts.
For example:
0 4
1 3
2 1
3 1
Sample Questions
Question 1: Considering the school fee submission of 10 students of class 10th is given
below:
Student Fee
Muskan Paid
Nitin Paid
Dhanvi Paid
Jasleen Paid
Sahil Paid
Solution:
In order to draw the bar graph for the data above, we prepare the frequency table as given
below.
Paid 6
Not paid 4
Now we have to represent the data by using the bar graph. It can be drawn by following the
steps given below:
Step 1: firstly we have to draw the two axis of the graph X-axis and the Y-axis.
The varieties of the data must be put on the X-axis (the horizontal line) and the frequencies of the
data must be put on the Y-axis (the vertical line) of the graph.
Step 2: After drawing both the axis now we have to give the numeric scale to the Y-axis (the
vertical line) of the graph
It should be started from zero and ends up with the highest value of the data.
Step 3: After the decision of the range at the Y-axis now we have to give it a suitable difference
of the numeric scale.
Like it can be 0,1,2,3…….or 0,10,20,30 either we can give it a numeric scale like 0,20,40,60…
Step 4: Now on the X-axis we have to label it appropriately.
Step 5: Now we have to draw the bars according to the data but we have to keep in mind that all
the bars should be of the same length and there should be the same distance between each graph
Question 2: Watch the subsequent pie chart that denotes the money spent by Megha at the
funfair. The suggested colour indicates the quantity paid for each variety. The total value
of the data is 15 and the amount paid on each variety is diagnosed as follows:
Chocolates – 3
Wafers – 3
Toys – 2
Rides – 7
To convert this into pie chart percentage, we apply the formula:
(Frequency/Total Frequency) × 100
Let us convert the above data into a percentage:
Amount paid on rides: (7/15) × 100 = 47%
Amount paid on toys: (2/15) × 100 = 13%
Amount paid on wafers: (3/15) × 100 = 20%
Amount paid on chocolates: (3/15) × 100 = 20 %
Question 3: The line graph given below shows how Devdas’s height changes as he grows.
Given below is a line graph showing the height changes in Devdas’s as he grows. Observe
the graph and answer the questions below.
Machine Learning is one of the most popular sub-fields of Artificial Intelligence. Machine learning
concepts are used almost everywhere, such as Healthcare, Finance, Infrastructure, Marketing, Self-
driving cars, recommendation systems, chatbots, social sites, gaming, cyber security, and many
more.
Currently, Machine Learning is under the development phase, and many new technologies are
continuously being added to Machine Learning. It helps us in many ways, such as analyzing large
chunks of data, data extractions, interpretations, etc. Hence, there are unlimited numbers of uses
of Machine Learning. In this topic, we will discuss various importance of Machine Learning with
examples. So, let's start with a quick introduction to Machine Learning.
What is Machine Learning?
Machine Learning is a branch of Artificial Intelligence that allows machines to learn and improve
from experience automatically. It is defined as the field of study that gives computers the capability
to learn without being explicitly programmed. It is quite different than traditional programming.
Machine Learning is a core form of Artificial Intelligence that enable machine to learn from past
data and make predictions
It involves data exploration and pattern matching with minimal human intervention. There are
mainly four technologies that machine learning used to work:
1. Supervised Learning:
Supervised Learning is a machine learning method that needs supervision similar to the student-
teacher relationship. In supervised Learning, a machine is trained with well-labeled data, which
means some data is already tagged with correct outputs. So, whenever new data is introduced into
the system, supervised learning algorithms analyze this sample data and predict correct outputs
with the help of that labeled data.
o Classification: It deals when output is in the form of a category such as Yellow, blue, right
or wrong, etc.
o Regression: It deals when output variables are real values like age, height, etc.
This technology allows us to collect or produce data output from experience. It works the same
way as humans learn using some labeled data points of the training set. It helps in optimizing the
performance of models using experience and solving various complex computation problems.
2. Unsupervised Learning:
Unlike supervised learning, unsupervised Learning does not require classified or well-labeled data
to train a machine. It aims to make groups of unsorted information based on some patterns and
differences even without any labelled training data. In unsupervised Learning, no supervision is
provided, so no sample data is given to the machines. Hence, machines are restricted to finding
hidden structures in unlabeled data by their own.
o Clustering: It deals when there is a requirement of inherent grouping in training data, e.g.,
grouping students by their area of interest.
o Association: It deals with the rules that help to identify a large portion of data, such as
students who are interested in ML and also interested in AI.
3. Semi-supervised learning:
In the semi-supervised learning method, a machine is trained with labeled as well as unlabeled
data. Although, it involves a few labeled examples and a large number of unlabeled examples.
Speech analysis, web content classification, protein sequence classification, and text documents
classifiers are some most popular real-world applications of semi-supervised Learning.
4. Reinforcement learning:
Reinforcement learning is defined as a feedback-based machine learning method that does not
require labeled data. In this learning method, an agent learns to behave in an environment by
performing the actions and seeing the results of actions. Agents can provide positive feedback for
each good action and negative feedback for bad actions. Since, in reinforcement learning, there is
no training data, hence agents are restricted to learn with their experience only.
Although machine learning is continuously evolving with so many new technologies, it is still used
in various industries.
Machine learning is important because it gives enterprises a view of trends in customer behavior
and operational business patterns, as well as supports the development of new products. Many
of today's leading companies, such as Facebook, Google, and Uber, make machine learning a
central part of their operations. Machine learning has become a significant competitive
differentiator for many companies.
Machine learning has several practical applications that drive the kind of real business results -
such as time and money savings - that have the potential to dramatically impact the future of your
organization. In particular, we see tremendous impact occurring within the customer care industry,
whereby machine learning is allowing people to get things done more quickly and efficiently.
Through Virtual Assistant solutions, machine learning automates tasks that would otherwise need
to be performed by a live agent - such as changing a password or checking an account balance.
This frees up valuable agent time that can be used to focus on the kind of customer care that
humans perform best: high touch, complicated decision-making that is not as easily handled by a
machine. At Interactions, we further improve the process by eliminating the decision of whether a
request should be sent to a human or a machine: unique Adaptive Understanding technology, the
machine learns to be aware of its limitations, and bailout to humans when it has low confidence in
providing the correct solution.
Machine Learning is broadly used in every industry and has a wide range of applications,
especially that involves collecting, analyzing, and responding to large sets of data. The importance
of Machine Learning can be understood by these important applications.
Some important applications in which machine learning is widely used are given below:
Conclusion:
Machine Learning is directly or indirectly involved in our daily routine. We have seen various
machine learning applications that are very useful for surviving in this technical world. Although
machine learning is in the developing phase, it is continuously evolving rapidly. The best thing
about machine learning is its High-value predictions that can guide better decisions and smart
actions in real-time without human intervention. Hence, at the end of this article, we can say that
the machine learning field is very vast, and its importance is not limited to a specific industry or
sector; it is applicable everywhere for analyzing or predicting future events.
Big Data includes huge volume, high velocity, and extensible variety of data. There are 3 types:
Structured data, Semi-structured data, and Unstructured data.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.
Matured transaction
and various No transaction
Transaction concurrency Transaction is adapted from management and
management techniques DBMS not matured no concurrency
Extracting useful information from Introduce algorithm from data as well as from past
1. large amount of data experience
Machine learning has a strong connection with mathematics. Each machine learning algorithm is
based on the concepts of mathematics & also with the help of mathematics, one can choose the
correct algorithm by considering training time, complexity, number of features, etc. Linear
Algebra is an essential field of mathematics, which defines the study of vectors, matrices, planes,
mapping, and lines required for linear transformation.
The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns
in Linear equations and solve the equation easily; hence it is an important branch of mathematics
that helps study data. Also, no one can deny that Linear Algebra is undoubtedly the important and
primary thing to process the applications of Machine Learning. It is also a prerequisite to start
learning Machine Learning and data science.
Linear algebra plays a vital role and key foundation in machine learning, and it enables ML
algorithms to run on a huge number of datasets.
The concepts of linear algebra are widely used in developing algorithms in machine learning.
Although it is used almost in each concept of Machine learning, specifically, it can perform the
following task:
o Optimization of data.
o Applicable in loss functions, regularisation, covariance matrices, Singular Value
Decomposition (SVD), Matrix Operations, and support vector machine classification.
o Implementation of Linear Regression in Machine Learning.
Besides the above uses, linear algebra is also used in neural networks and the data science field.
Basic mathematics principles and concepts like Linear algebra are the foundation of Machine
Learning and Deep Learning systems. To learn and understand Machine Learning or Data Science,
one needs to be familiar with linear algebra and optimization theory. In this topic, we will explain
all the Linear algebra concepts required for machine learning.
Note: Although linear algebra is a must-know part of mathematics for machine learning, it is not
required to get intimate in this. It means it is not required to be an expert in linear algebra; instead,
only good knowledge of these concepts is more than enough for machine learning.
Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based on
flour similarly, every Machine Learning Model is also based on Linear Algebra. Further, the cake
also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine Learning also
requires more concepts as vector calculus, probability, and optimization theory. So, we can say
that Machine Learning creates a useful model with the help of the above-mentioned mathematical
concepts.
Below are some benefits of learning Linear Algebra before Machine learning:
Linear Algebra helps to provide better graphical processing in Machine Learning like Image,
audio, video, and edge detection. These are the various graphical representations supported by
Machine Learning projects that you can work on. Further, parts of the given data set are trained
based on their categories by classifiers provided by machine learning algorithms. These classifiers
also remove the errors from the trained data.
Moreover, Linear Algebra helps solve and compute large and complex data set through a specific
terminology named Matrix Decomposition Techniques. There are two most popular matrix
decomposition techniques, which are as follows:
o Q-R
o L-U
Improved Statistics:
Statistics is an important concept to organize and integrate data in Machine Learning. Also, linear
Algebra helps to understand the concept of statistics in a better manner. Advanced statistical topics
can be integrated using methods, operations, and notations of linear algebra.
Linear Algebra also helps to create better supervised as well as unsupervised Machine Learning
algorithms.
Few supervised learning algorithms can be created using Linear Algebra, which is as follows:
o Logistic Regression
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)
Further, below are some unsupervised learning algorithms listed that can also be created with the
help of linear algebra as follows:
If you are working on a Machine Learning project, then you must be a broad-minded person and
also, you will be able to impart more perspectives. Hence, in this regard, you must increase the
awareness and affinity of Machine Learning concepts. You can begin with setting up different
graphs, visualization, using various parameters for diverse machine learning algorithms or taking
up things that others around you might find difficult to understand.
Easy to Learn:
Notation:
Notation in linear algebra enables you to read algorithm descriptions in papers, books, and
websites to understand the algorithm's working. Even if you use for-loops rather than matrix
operations, you will be able to piece things together.
Operations:
Working with an advanced level of abstractions in vectors and matrices can make concepts clearer,
and it can also help in the description, coding, and even thinking capability. In linear algebra, it is
required to learn the basic operations such as addition, multiplication, inversion, transposing of
matrices, vectors, etc.
Matrix Factorization:
One of the most recommended areas of linear algebra is matrix factorization, specifically matrix
deposition methods such as SVD and QR.
Each machine learning project works on the dataset, and we fit the machine learning model using
this dataset.
Each dataset resembles a table-like structure consisting of rows and columns. Where each row
represents observations, and each column represents features/Variables. This dataset is handled as
a Matrix, which is a key data structure in Linear Algebra.
Further, when this dataset is divided into input and output for the supervised learning model, it
represents a Matrix(X) and Vector(y), where the vector is also an important concept of linear
algebra.
In machine learning, images/photographs are used for computer vision applications. Each Image
is an example of the matrix from linear algebra because an image is a table structure consisting of
height and width for each pixel.
Moreover, different operations on images, such as cropping, scaling, resizing, etc., are performed
using notations and operations of Linear Algebra.
In machine learning, sometimes, we need to work with categorical data. These categorical
variables are encoded to make them simpler and easier to work with, and the popular encoding
technique to encode these variables is known as one-hot encoding.
In the one-hot encoding technique, a table is created that shows a variable with one column for
each category and one row for each example in the dataset. Further, each row is encoded as a
binary vector, which contains either zero or one value. This is an example of sparse representation,
which is a subfield of Linear Algebra.
4. Linear Regression
Linear regression is a popular technique of machine learning borrowed from statistics. It describes
the relationship between input and output variables and is used in machine learning to predict
numerical values. The most common way to solve linear regression problems using Least Square
Optimization is solved with the help of Matrix factorization methods. Some commonly used matrix
factorization methods are LU decomposition, or Singular-value decomposition, which are the
concept of linear algebra.
5. Regularization
In machine learning, we usually look for the simplest possible model to achieve the best outcome
for the specific problem. Simpler models generalize well, ranging from specific examples to
unknown datasets. These simpler models are often considered models with smaller coefficient
values.
A technique used to minimize the size of coefficients of a model while it is being fit on data is
known as regularization. Common regularization techniques are L1 and L2 regularization. Both
of these forms of regularization are, in fact, a measure of the magnitude or length of the coefficients
as a vector and are methods lifted directly from linear algebra called the vector norm.
Generally, each dataset contains thousands of features, and fitting the model with such a large
dataset is one of the most challenging tasks of machine learning. Moreover, a model built with
irrelevant features is less accurate than a model built with relevant features. There are several
methods in machine learning that automatically reduce the number of columns of a dataset, and
these methods are known as Dimensionality reduction. The most commonly used dimensionality
reductions method in machine learning is Principal Component Analysis or PCA. This technique
makes projections of high-dimensional data for both visualizations and training models. PCA uses
the matrix factorization method from linear algebra.
7. Singular-Value Decomposition
Singular-Value decomposition is also one of the popular dimensionality reduction techniques and
is also written as SVD in short form.
It is the matrix-factorization method of linear algebra, and it is widely used in different applications
such as feature selection, visualization, noise reduction, and many more.
Natural Language Processing or NLP is a subfield of machine learning that works with text and
spoken words.
NLP represents a text document as large matrices with the occurrence of words. For example, the
matrix column may contain the known vocabulary words, and rows may contain sentences,
paragraphs, pages, etc., with cells in the matrix marked as the count or frequency of the number of
times the word occurred. It is a sparse matrix representation of text. Documents processed in this
way are much easier to compare, query, and use as the basis for a supervised machine learning
model.
This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also
known by the name Latent Semantic Indexing or LSI.
9. Recommender System
The development of recommender systems is mainly based on linear algebra methods. We can
understand it as an example of calculating the similarity between sparse customer behaviour
vectors using distance measures such as Euclidean distance or dot products.
Artificial Neural Networks or ANN are the non-linear ML algorithms that work to process the
brain and transfer information from one layer to another in a similar way.
Deep learning studies these neural networks, which implement newer and faster hardware for the
training and development of larger networks with a huge dataset. All deep learning methods
achieve great results for different challenging tasks such as machine translation, speech
recognition, etc. The core of processing neural networks is based on linear algebra data structures,
which are multiplied and added together. Deep learning algorithms also work with vectors,
matrices, tensors (matrix with more than two dimensions) of inputs and coefficients for multiple
dimensions.
Conclusion
In this topic, we have discussed Linear algebra, its role and its importance in machine learning.
For each machine learning enthusiast, it is very important to learn the basic concepts of linear
algebra to understand the working of ML algorithms and choose the best algorithm for a specific
problem.
UNIT-II: SUPERVISED LEARNING:
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns
in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Supervised learning is when the model is getting trained on a labelled dataset. A labelled dataset
is one that has both input and output parameters. In this type of learning both training and
validation, datasets are labelled as shown in the figures below.
It is important to remember that the theoretical learning models are abstractions of real-life
problems. Close connections with experimentalists are useful to help validate or modify these
abstractions so that the theoretical results reflect empirical performance. The computational
learning theory research has therefore close connections to machine learning research. Besides
the model’s predictive capability, the computational learning theory also addresses other
important features such as simplicity, robustness to variations in the learning scenario, and an
ability to create insights to empirically observed phenomena.
What is computational learning theory in machine learning?
These are sub-fields of machine learning that a machine learning practitioner does not need to
know in great depth in order to achieve good results on a wide range of problems. Nevertheless,
it is a sub-field where having a high-level understanding of some of the more prominent methods
may provide insight into the broader task of learning from data.
Theoretical results in machine learning mainly deal with a type of inductive learning called
supervised learning. In supervised learning, an algorithm is given samples that are labeled in
some useful way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these previously labeled
samples and uses them to induce a classifier. This classifier is a function that assigns labels to
samples, including samples that have not been seen previously by the algorithm. The goal of the
supervised learning algorithm is to optimize some measure of performance such as minimizing
the number of mistakes made on new samples.
In addition to performance bounds, computational learning theory studies the time complexity
and feasibility of learning. In computational learning theory, a computation is considered feasible
if it can be done in polynomial time.
There are two kinds of time complexity results:
Negative results often rely on commonly believed, but yet unproven assumptions, such as:
What is the difference between Computational Learning Theory and Statistical Learning
Theory?
While both frameworks use similar mathematical analysis, the primary difference between CoLT
and SLT is their objectives. CoLT focuses on studying “learnability,” or what functions/features
are necessary to make a given task learnable for an algorithm. Whereas SLT is primarily focused
on studying and improving the accuracy of existing training programs.
What is machine learning theory?
Machine Learning Theory, also known as Computational Learning Theory, aims to understand
the fundamental principles of learning as a computational process. This field seeks to understand
at a precise mathematical level what capabilities and information are fundamentally needed to
learn different kinds of tasks successfully, and to understand the basic algorithmic principles
involved in getting computers to learn from data and to improve performance with feedback. The
goals of this theory are both to aid in the design of better automated learning methods and to
understand fundamental issues in the learning process itself.
Machine Learning Theory draws elements from both the Theory of Computation and
Statistics and involves tasks such as:
• Creating mathematical models that capture key aspects of machine learning, in which one
can analyze the inherent ease or difficulty of different types of learning problems.
• Proving guarantees for algorithms (under what conditions will they succeed, how much
data and computation time is needed) and developing machine learning algorithms that
probably meet desired criteria.
• Mathematically analyzing general issues, such as: “Why is Occam’s Razor a good idea?”,
“When can one be confident about predictions made from limited data?”, “How much
power does active participation add over passive observation for learning?”, and “What
kinds of methods can learn even in the presence of large quantities of distracting
information?”
What is VC Dimension
The Vapnik–Chervonenkis theory (VC Theory) is a theoretical machine learning framework
created by Vladimir Vapnik and Alexey Chervonenkis.
It aims to quantify the capability of a learning algorithm and could be considered to be the main
sub-field of statistical learning theory.
One of the main elements of the VC theory is the Vapnik-chervonenkis dimension (VC
dimension). It quantifies the complexity of hypothesis space. It comes up with an estimation of
the capability or capacity of a classification machine learning algorithm for a particular dataset
(number and dimensionality of examples)
2.4 occma's razor principle and over fitting avoidance heuristic search in inductive
learning
Occam’s razor
Many philosophers throughout history have advocated the idea of parsimony. One of the greatest
Greek philosophers, Aristotle who goes as far as to say, “Nature operates in the shortest way
possible”. It is as a consequence that humans might be biased as well to choose a simpler
explanation given a set of all possible explanations with the same descriptive power. This post
gives a brief overview of Occam’s razor, the relevance of the principle and ends with a note on
the usage of this razor as an inductive bias in machine learning (decision tree learning in
particular).
What is Occam’s razor?
Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must
never be posited without necessity”. Alternatively, as a heuristic, it can be viewed as, when there
are multiple hypotheses to solve a problem, the simpler one is to be preferred. It is not clear as to
whom this principle can be conclusively attributed to, but William of Occam’s (c. 1287 – 1347)
preference for simplicity is well documented. Hence this principle goes by the name, “Occam’s
razor”. This often means cutting off or shaving away other possibilities or explanations, thus
“razor” appended to the name of the principle. It should be noted that these explanations or
hypotheses should lead to the same result.
Relevance of Occam’s razor.
There are many events that favor a simpler approach either as an inductive bias or a constraint to
begin with. Some of them are :
• Studies like this, where the results have suggested that preschoolers are sensitive to simpler
explanations during their initial years of learning and development.
• Preference for a simpler approach and explanations to achieve the same goal is seen in
various facets of sciences; for instance, the parsimony principle applied to the understanding
of evolution.
• In theology, ontology, epistemology, etc this view of parsimony is used to derive various
conclusions.
• Variants of Occam’s razor are used in knowledge Discovery.
Occam’s razor as an inductive bias in machine learning.
Note: It is highly recommended to read the article on decision tree introduction for an insight on
decision tree building with examples.
• Inductive bias (or the inherent bias of the algorithm) are assumptions that are made by the
learning algorithm to form a hypothesis or a generalization beyond the set of training
instances in order to classify unobserved data.
• Occam’s razor is one of the simplest examples of inductive bias. It involves a preference for
a simpler hypothesis that best fits the data. Though the razor can be used to eliminate other
hypotheses, relevant justification may be needed to do so. Below is an analysis of how this
principle is applicable in decision tree learning.
• The decision tree learning algorithms follow a search strategy to search the hypotheses space
for the hypothesis that best fits the training data. For example, the ID3 algorithm uses a
simple to complex strategy starting from an empty tree and adding nodes guided by the
information gain heuristic to build a decision tree consistent with the training instances.
The information gain of every attribute (which is not already included in the tree) is
calculated to infer which attribute to be considered as the next node. Information gain is the
essence of the ID3 algorithm. It gives a quantitative measure of the information that an
attribute can provide about the target variable i.e, assuming only information of that attribute
is available, how efficiently can we infer about the target. It can be defined as :
• Well, there can be many decision trees that are consistent with a given set of training
examples, but the inductive bias of the ID3 algorithm results in the preference for simper (or
shorter trees) trees. This preference bias of ID3 arises from the fact that there is an ordering
of the hypotheses in the search strategy. This leads to additional bias that attributes high with
information gain closer to the root is preferred. Therefore, there is a definite order the
algorithm follows until it terminates on reaching a hypothesis that is consistent with the
training data.
The above image depicts how the ID3 algorithm chooses the nodes in every iteration. The red
arrow depicts the node chosen in a particular iteration while the black arrows suggest other
decision trees that could have been possible in a given iteration.
• Hence starting from an empty node, the algorithm graduates towards more complex decision
trees and stops when the tree is sufficient to classify the training examples.
• This example pops a question. Does eliminating complex hypotheses bear any consequence
on the classification of unobserved instances? simply put, does the preference for a simpler
hypothesis have an advantage? If two decision trees have slightly different training errors but
the same validation errors, then it is obvious that the simpler tree among the two will be
chosen. As a higher validation error causes overfitting of the data. Complex trees often have
almost zero training error, but the validation errors might be high. This scenario gives a
logical reason for a bias towards simpler trees. In addition to that, a simpler hypothesis might
prove effective in a resource-limited environment.
• What is overfitting? Consider two hypotheses a and b. Let ‘a’ fit the training examples
perfectly, while the hypothesis ‘b’ has a small training error. If over the entire set of data (i.e,
including the unseen instances), if the hypothesis ‘b’ performs better, then ‘a’ is said to
overfit the training data. To best illustrate the problem of over-fitting, consider the figure
below.
Figures A and B depict two decision boundaries. Assuming the green and red points
represent the training examples, the decision boundary in B perfectly fits the data thus
perfectly classifying the instances, while the decision boundary in A does not, though being
simpler than B. In this example the decision boundary in B overfits the data. The reason
being that every instance of the training data affects the decision boundary. The added
relevance is when the training data contains noise. For example, assume in figure B that one
of the red points close to the boundary was a noise point. Then the unseen instances in close
proximity to the noise point might be wrongly classified. This makes the complex hypothesis
vulnerable to noise in the data.
• While the problem of overfitting behaviour of a model can be significantly avoided by
settling for a simpler hypothesis, an extremely simple hypothesis may be too abstract to
deduce any information needed for the task resulting in underfitting. Overfitting and
underfitting are one of the major challenges to be addressed before we zero in on a machine
learning model. Sometimes a complex model might be desired, it’s a choice dependent on the
data available, the results expected and the application domain.
Note: For additional information on the decision tree learning, please refer to Tom M. Mitchell’s
“Machine Learning” book.
2.5 Understanding Generalization Error in Machine Learning
Definition
generalization error (also known as the out-of-sample error) is a measure of how accurately an
algorithm is able to predict outcome values for previously unseen data. wikipedia
Notice that the gap between predictions and observed data is induced by model
inaccuracy, sampling error, and noise. Some of the errors are reducible but some are not.
Choosing the right algorithm and tuning parameters could improve model accuracy, but we will
Bias-variance decomposition
An important way to understand generalization error is bias-variance decomposition.
Intuitively speaking, bias is the error rate in the world of big data. A model has a high bias
when, for example, it fails to capture meaningful patterns in the data. Bias is measured by the
differences between the expected predicted values and the observed values, in the dataset D when
the prediction variables are at the level of x (X=x). In contrast with bias, variance is an
algorithm’s flexibility to learn patterns in the observed data. Variance is the amount that an
algorithm will change if a different dataset is used. A model is of high variance when, for
instance, it tries too hard that it not only captures the pattern of meaningful features but also that
Mathematical Notations
Interpretation
Bias measures the deviation between the expected output of our model and the real values, so it
Variance measures the amount that the outputs of our model will change if a different dataset is
Noise is the irreducible error, the lowest bound of generalization error for the current task that
any model will not be able to get rid of, indicating the difficulty of this task.
These 3 components above determine the model’s ability to react to new unseen data rather than
Bias-Variance Tradeoff
Bias-Variance Tradeoff as a Function of Model Capacity
Generalization error could be measured by MSE. As the model capacity increases, the bias
decreases as the model fits the training datasets better. However, the variance increases, as your
model become sophisticated to fit more patterns of the current dataset, changing datasets (even if
they come from the same distribution) would be impactful. As a data scientist, our challenge lies
in finding the optimal capacity — where both bias and variance are low.
Evaluating the performance of a Machine learning model is one of the important steps while
building an effective ML model. To evaluate the performance or quality of the model, different
metrics are used, and these metrics are known as performance metrics or evaluation
metrics. These performance metrics help us understand how well our model has performed for the
given data. In this way, we can improve the model's performance by tuning the hyper-parameters.
Each ML model aims to generalize well on unseen/new data, and performance metrics help
determine how well the model generalizes on the new dataset.
In machine learning, each task or problem is divided into classification and Regression. Not all
metrics can be used for all types of problems; hence, it is important to know and understand which
metrics should be used. Different evaluation metrics are used for both Regression and
Classification tasks. In this topic, we will discuss metrics used for classification and regression
tasks.
In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based
on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not
Spam, etc. To evaluate the performance of a classification model, different metrics are used, and
some of them are as follows:
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions.
Firstly, we need to import the accuracy_score function of the scikit-learn library as follows:
Although it is simple to use and implement, it is suitable only for cases where an equal number of
samples belong to each class.
It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced. For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In
this case, if the model is asked to predict whether the image is of Apple or Mango, it will give a
prediction with 97% of accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class. For example, Suppose there is a model for a disease prediction in which, out of 100
people, only five people have a disease, and 95 people don't have one. In this case, if our model
predicts every person with no disease (which means a bad prediction), the Accuracy measure will
be 95%, which is not correct.
A typical confusion matrix for a binary classifier looks like the below image(However, it can be
extended to use for classifiers with more than two classes).
o In the matrix, columns are for the prediction values, and rows specify the Actual values.
Here Actual and prediction give two possible classes, Yes or No. So, if we are predicting
the presence of a disease in a patient, the Prediction column with Yes means, Patient has
the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted
yes, whereas 55 times predicted No.
o However, in reality, 60 cases in which patients don't have the disease, whereas 105 cases
in which patients have the disease.
In general, the table is divided into four terminologies, which are as follows:
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality,
also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality,
also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision
The precision metric is used to overcome the limitation of Accuracy. The precision determines the
proportion of positive prediction that was actually correct. It can be calculated as the True Positive
or predictions that are actually true to the total positive predictions (True Positive and False
Positive).
It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that are
actually true to the total number of positives, either correctly predicted as positive or incorrectly
predicted as negative (true Positive and false negative).
From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives information
about the performance of a classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we
want to minimize the false positive, then precision should be close to 100% as possible.
In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize
recall, it will minimize the FN error.
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions
that are made for the positive class. It is calculated with the help of Precision and Recall. It is a
type of single score that represents both Precision and Recall. So, the F1 Score can be calculated
as the harmonic mean of both precision and Recall, assigning equal weight to each of them.
The formula for calculating the F1 score is given below:
As F-score make use of both precision and recall, so it should be used if both of them are important
for evaluation, but one (precision or recall) is slightly more important to consider than the other.
For example, when False negatives are comparatively more important than false positives, or vice
versa.
VI. AUC-ROC
Sometimes we need to visualize the performance of the classification model on charts; then, we
can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating the
performance of the classification model.
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a
graph to show the performance of a classification model at different threshold levels. The curve
is plotted between two parameters, which are:
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model
multiple times with different classification thresholds, but this would not be much efficient. So,
for this, one efficient method is used, which is known as AUC.
AUC: Area Under the ROC curve
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-
dimensional area under the entire ROC curve, as shown below image:
AUC calculates the performance across all the thresholds and provides an aggregate measure. The
value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC
of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.
AUC should be used to measure how well the predictions are ranked rather than their absolute
values. Moreover, it measures the quality of predictions of the model without considering the
classification threshold.
As AUC is scale-invariant, which is not always desirable, and we need calibrating probability
outputs, then AUC is not preferable.
Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives
vs. false positives, and it is difficult to minimize one type of classification error.
Regression is a supervised learning technique that aims to find the relationships between the
dependent and independent variables. A predictive regression model predicts a numeric or discrete
value. The metrics used for regression are different from the classification metrics. It means we
cannot use the Accuracy metric (explained above) to evaluate a regression model; instead, the
performance of a Regression model is reported as errors in the prediction. Following are the
popular metrics that are used to evaluate the performance of Regression models.
Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute
difference between actual and predicted values, where absolute means taking a number as Positive.
To understand MAE, let's take an example of Linear Regression, where the model draws a best fit
line between dependent and independent variables. To measure the MAE or error in prediction,
we need to calculate the difference between actual values and predicted values. But in order to find
the absolute error for the complete dataset, we need to find the mean absolute of the complete
dataset.
Here,
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
MAE is much more robust for the outliers. One of the limitations of MAE is that it is not
differentiable, so for this, we need to apply different optimizers such as Gradient Descent.
However, to overcome this limitation, another metric can be used, which is Mean Squared Error
or MSE.
Mean Squared error or MSE is one of the most suitable metrics for Regression evaluation. It
measures the average of the Squared difference between predicted values and the actual value
given by the model.
Since in MSE, errors are squared, therefore it only assumes non-negative values, and it is usually
positive and non-zero.
Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-
estimation of how bad the model is.
Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.
R squared error is also known as Coefficient of Determination, which is another popular metric
used for Regression model evaluation. The R-squared metric enables us to compare our model
with a constant baseline to determine the performance of the model. To select the constant baseline,
we need to take the mean of the data and draw the line at the mean.
The R squared score will always be less than or equal to 1 without concerning if the values are too
large or small.
Adjusted R squared, as the name suggests, is the improved version of R squared error. R square
has a limitation of improvement of a score on increasing the terms, even though the model is not
improving, and it may mislead the data scientists.
To overcome the issue of R square, adjusted R squared is used, which will always show a lower
value than R². It is because it adjusts the values of increasing predictors and only shows
improvement if there is a real improvement.
Here,
Introduction
Statistics has a significant part in the field of data science. It helps us in the collection, analysis
and representation of data either by visualisation or by numbers into a general understandable
format. Generally, we divide statistics into two main branches which are Descriptive Statistics
and Inferential Statistics. In this article, we will discuss the Inferential statistics in detail.
Before discussing the Inferential statistics, let us see the population and sample. Population
contains all the data points from a set of data. It is a group from where we collect the data. While
a sample consists of some observations selected from the population. The sample from the
population should be selected such that it has all the characteristics that a population has.
Population’s measurable characteristics such as mean, standard deviation etc. are called as
parameters while Sample’s measurable characteristic is known as a statistic.
Descriptive statistics describe the important characteristics of data by using mean, median, mode,
variance etc. It summarises the data through numbers and graphs.
In Inferential statistics, we make an inference from a sample about the population. The main aim
of inferential statistics is to draw some conclusions from the sample and generalise them for the
population data. E.g. we have to find the average salary of a data analyst across India. There are
two options.
1. The first option is to consider the data of data analysts across India and ask them their
salaries and take an average.
2. The second option is to take a sample of data analysts from the major IT cities in India
and take their average and consider that for across India.
The first option is not possible as it is very difficult to collect all the data of data analysts across
India. It is time-consuming as well as costly. So, to overcome this issue, we will look into the
second option to collect a small sample of salaries of data analysts and take their average as India
average. This is the inferential statistics where we make an inference from a sample about the
population.
It is a measure of the chance of occurrence of a phenomenon. We will now discuss some terms
which are very important in probability:
Conditional probability is the probability of a particular event Y, given a certain condition which
has already occurred , i.e., X. Then conditional probability, P(Y|X) is defined as,
The mathematical function describing the randomness of a random variable is called probability
distribution. It is a depiction of all possible outcomes of a random variable and their associated
probabilities
F(x) = P {s ε S; X(s) ≤ x}
Or,
F(x) = P {X ≤ x}
E.g. P (X > 7) = 1- P (X ≤ 7)
= 1- {P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4) + P (X = 5) + P (X = 6) + P
(X = 7)}
Sampling Distribution
Probability distribution of statistics of a large number of samples selected from the population is
called sampling distribution. When we increase the size of sample, sample mean becomes more
normally distributed around population mean. The variability of the sample decreases as we
increase sample size.
CLT tells that when we increase the sample size, the distribution of sample means becomes
normally distributed as the sample, whatever be the population distribution shape. This theorem
is particularly true when we have a sample of size greater than 30. The conclusion is that if we
take a greater number of samples and particularly of large sizes, the distribution of sample means
in a graph will look like to follow the normal distribution.
In the above graph we can see that when we increase the value of n i.e. sample size, it is
approaching the shape of normal distribution.
Confidence Interval
Confidence Interval is an interval of reasonable values for our parameters. Confidence intervals
are used to give an interval estimation for our parameter of interest.
The margin of error is found by multiplying the standard error of the mean and the z-score.
Confidence interval having a value of 95% indicates that we are 95% sure that the actual mean is
within our confidence interval.
Hypothesis Testing
Hypothesis testing is a part of statistics in which we make assumptions about the population
parameter. So, hypothesis testing mentions a proper procedure by analysing a random sample of
the population to accept or reject the assumption.
Type of Hypothesis
1. Null hypothesis: Null hypothesis is a type of hypothesis in which we assume that the
sample observations are purely by chance. It is denoted by H0.
1. Alternate hypothesis: The alternate hypothesis is a hypothesis in which we assume that
sample observations are not by chance. They are affected by some non-random situation.
An alternate hypothesis is denoted by H1 or Ha.
Steps of Hypothesis Testing
The process to determine whether to reject a null hypothesis or to fail to reject the null
hypothesis, based on sample data is called hypothesis testing. It consists of four steps:
Technically, we never accept the null hypothesis, we say that either we fail to reject or we reject
the null hypothesis.
Terms in Hypothesis testing
Significance level
The significance level is defined as the probability of the case when we reject the null hypothesis
but in actual it is true. E.g., a 0.05 significance level indicates that there is 5% risk in assuming
that there is some difference when in actual there is no difference. It is denoted by alpha (α).
The above figure shows that the two shaded regions are equidistant from the null hypothesis,
each having a probability of 0.025 and a total of 0.05 which is our significance level. The shaded
region in case of a two-tailed test is called critical region.
P-value
The p-value is defined as the probability of seeing a t-statistic as extreme as the calculated value
if the null hypothesis value is true. Low enough p-value is ground for rejecting the null
hypothesis. We reject the null hypothesis if the p-value is less than the significance level.
We have explained what is hypothesis testing and the steps to do the testing. Now during
performing the hypothesis testing, there might be some errors.
1. Type-1 error: Type 1 error is the case when we reject the null hypothesis but in actual it
was true. The probability of having a Type-1 error is called significance level alpha(α).
2. Type-2 error: Type 2 error is the case when we fail to reject the null hypothesis but
actually it is false. The probability of having a type-2 error is called beta(β).
Therefore,
P= 1- Type-2 error
=1–β
Lesser the type-2 error more the power of the hypothesis test.
Decision –>
/ Reject the null hypothesis Fail to reject the null hypothesis
Actual
Z-test
A Z-test is mainly used when the data is normally distributed. We find the Z-statistic of the
sample means and calculate the z-score. Z-score is given by the formula,
Z-score = (x – µ) / σ
Z-test is mainly used when the population mean and standard deviation are given.
T-test
The t-test is similar to z-test. The only difference is that it is used when we have sample standard
deviation but don’t have population standard, or have a small sample size (n<30).
The one-sample t-test compares the mean of sample data to a known value like if we have to
compare the mean of sample data to the population mean we use the One-Sample T-test.
We can run a one-sample T-test when we do not have the population S.D. or we have a sample
of size less than 30.
We use a two-sample T-test when we want to evaluate whether the mean of the two samples is
different or not. In two-sample T-test we have another two categories:
• Independent Sample T-test: Independent sample means that the two different samples
should be selected from two completely different populations. In other words, we can say
that one population should not be dependent on the other population.
• Paired T-test: If our samples are connected in some way, we have to use paired t-test.
Here connecting means thatThe samples are connected as we are collecting data from the
same group two times e.g. blood test of patients of a hospital before and after medication.
Chi-square test
Chi-square test is used in the case when we have to compare categorical data. Chi-square test is
of two types. Both use chi-square statistics and distribution for different purposes.
ANOVA test is a way to find out if an experiment results are significant or not. It is generally
used when there are more than 2 groups and we have to test the hypothesis that the mean of
multiple populations and variances of multiple populations are equal.
E.g. Students from different colleges take the same exam. We want to see if one college
outperforms others.
1. One-way ANOVA
2. Two-way ANOVA
The test statistic in Anova is given by:
Conclusion
In this article, we studied inferential statistics and the different topics in it like probability,
hypothesis testing, and different types of tests in hypothesis. Also, we discussed the importance
of inferential statistics and how we can make inference about the population by sample data
which in turn is time-consuming and cost-saving.
3.2 Machine Learning for Data Analysis
Over the course of an hour, an unsolicited email skips your inbox and goes straight to spam, a
car next to you auto-stops when a pedestrian runs in front of it, and an ad for the product you
were thinking about yesterday pops up on your social media feed. What do these events all
have in common? It’s artificial intelligence that has guided all these decisions. And the force
behind them all is machine-learning algorithms that use data to predict outcomes.
Now, before we look at how machine learning aids data analysis, let’s explore the
fundamentals of each.
What is Machine Learning?
Machine learning is the science of designing algorithms that learn on their own from data and
adapt without human correction. As we feed data to these algorithms, they build their own
logic and, as a result, create solutions relevant to aspects of our world as diverse as fraud
detection, web searches, tumor classification, and price prediction.
In deep learning, a subset of machine learning, programs discover intricate concepts by
building them out of simpler ones. These algorithms work by exposing multilayered (hence
“deep”) neural networks to vast amounts of data. Applications for machine learning, such
as natural language processing, dramatically improve performance through the use of deep
learning.
What is Data Analysis?
Data analysis involves manipulating, transforming, and visualizing data in order to infer
meaningful insights from the results. Individuals, businesses,and even governments often take
direction based on these insights.
Data analysts might predict customer behavior, stock prices, or insurance claims by using basic
linear regression. They might create homogeneous clusters using classification and regression
trees (CART), or they might gain some impact insight by using graphs to visualize a financial
technology company’s portfolio.
Until the final decades of the 20th century, human analysts were irreplaceable when it came to
finding patterns in data. Today, they’re still essential when it comes to feeding the right kind of
data to learning algorithms and inferring meaning from algorithmic output, but machines can
and do perform much of the analytical work itself.
Why Machine Learning is Useful in Data Analysis
Machine learning constitutes model-building automation for data analysis. When we assign
machines tasks like classification, clustering, and anomaly detection — tasks at the core of
data analysis — we are employing machine learning.
We can design self-improving learning algorithms that take data as input and offer statistical
inferences. Without relying on hard-coded programming, the algorithms make decisions
whenever they detect a change in pattern.
Before we look at specific data analysis problems, let’s discuss some terminology used to
categorize different types of machine-learning algorithms. First, we can think of most
algorithms as either classification-based, where machines sort data into classes, or regression-
based, where machines predict values.
Next, let’s distinguish between supervised and unsupervised algorithms. A supervised
algorithm provides target values after sufficient training with data. In contrast, the information
used to instruct an unsupervised machine-learning algorithm needs no output variable to guide
the learning process.
For example, a supervised algorithm might estimate the value of a home after reviewing the
price (the output variable) of similar homes, while an unsupervised algorithm might look for
hidden patterns in on-the-market housing.
As popular as these machine-learning models are, we still need humans to derive the final
implications of data analysis. Making sense of the results or deciding, say, how to clean the
data remains up to us humans.
Machine-Learning Algorithms for Data Analysis
Now let’s look at six well-known machine-learning algorithms used in data analysis. In
addition to reviewing their structure, we’ll go over some of their real-world applications.
Clustering
At a local garage sale, you buy 70 monochromatic shirts, each of a different color. To avoid
decision fatigue, you design an algorithm to help you color-code your closet. This algorithm
uses photos of each shirt as input and, comparing the color of each shirt to the others, creates
categories to account for every shirt. We call this clustering: an unsupervised learning
algorithm that looks for patterns among input values and groups them accordingly. Here is
a GeeksForGeeks article that provides visualizations of this machine-learning model.
Decision-tree learning
You can think of a decision tree as an upside-down tree: you start at the “top” and move
through a narrowing range of options. These learning algorithms take a single data set and
progressively divide it into smaller groups by creating rules to differentiate the features it
observes. Eventually, they create sets small enough to be described by a specific label. For
example, they might take a general car data set (the root) and classify it down to a make and
then to a model (the leaves).
As you might have gathered, decision trees are supervised learning algorithms ideal for
resolving classification problems in data analysis, such as guessing a person’s blood type.
Check out this in-depth Medium article that explains how decision trees work.
Ensemble learning
Imagine you’re en route to a camping trip with your buddies, but no one in the group
remembered to check the weather. Noting that you always seem dressed appropriately for the
weather, one of your buddies asks you to stand in as a meteorologist. Judging from the time of
year and the current conditions, you guess that it’s going to be 72°F (22°C) tomorrow.
Now imagine that everyone in the group came with their own predictions for tomorrow’s
weather: one person listened to the weatherman; another saw Doppler radar reports online; a
third asked her parents; and you made your prediction based on current conditions.
Do you think you, the group’s appointed meteorologist, will have the most accurate prediction,
or will the average of all four guesses be closer to the actual weather tomorrow? Ensemble
learning dictates that, taken together, your predictions are likely to be distributed around the
right answer. The average will likely be closer to the mark than your guess alone.
In technical terms, this machine-learning model frequently used in data analysis is known as
the random forest approach: by training decision trees on random subsets of data points, and
by adding some randomness into the training procedure itself, you build a forest of diverse
trees that offer a more robust average than any individual tree. For a deeper dive, read
this tutorial on implementing the random forest approach in Python.
Support-vector machine
Have you ever struggled to differentiate between two species — perhaps between alligators
and crocodiles? After a long while, you manage to learn how: alligators have a U-shaped
snout, while crocodiles’ mouths are slender and V-shaped; and crocodiles have a much toothier
grin than alligators do. But on a trip to the Everglades, you come across a reptile that,
perplexingly, has features of both — so how can you tell the difference? Support-vector
machine (SVM) algorithms are here to help you out.
First, let’s draw a graph with one distinguishing feature (snout shape) as the x-axis and another
(grin toothiness) as the y-axis. We’ll populate the graph with plenty of data points for both
species, and then find possible planes (or, in this 2D case, lines) that separate the two classes.
Our objective is to find a single “hyperplane” that divides the data by maximizing the distance
between the dividing plane and each class’s closest points — called support vectors. No more
confusion between crocs and gators: once the SVM finds this hyperplane, you can easily
classify the reptiles in your vacation photos by seeing which side each one lands on.
SVM algorithms can only be used on categorical data, but it’s not always possible to
differentiate between classes with 2D graphs. To resolve this, you can use a kernel: an
established pattern to map data to higher dimensions. By using a combination of kernels and
tweaks to their parameters, you’ll be able to find a non-linear hyperplane and continue on your
way distinguishing between reptiles. This YouTube video does a clear job of visualizing how
kernels integrate with SVM.
Linear regression
If you’ve ever used a scatterplot to find a cause-and-effect relationship between two sets of
data, then you’ve used linear regression. This is a modeling method ideal for forecasting and
finding correlations between variables in data analysis.
For example, say you want to see if there’s a connection between fatigue and the number of
hours someone works. You gather data from a set of people with a wide array of work
schedules and plot your findings. Seeking a relationship between the independent variable
(hours worked) and the dependent variable (fatigue), you notice that a straight line with a
positive slope best models the correlation. You’ve just used linear regression! If you’re
interested in a detailed understanding of linear regression for machine learning, check out
this blog pos from Machine Learning Mastery.
Logistic regression
While linear regression algorithms look for correlations between variables that are continuous
by nature, logistic regression is ideal for classifying categorical data. Our alligator-versus-
crocodile problem is, in fact, a logistic regression problem. Whereas the SVM model can work
with non-linear kernels, logistic regression is limited to (and great for) linear classification. See
this in-depth overview of logistic regression, especially good for lovers of calculus.
Summary
In this article, we looked at how machine learning can automate and scale data analysis. We
summarized a few important machine-learning algorithms and saw their real-life applications.
While machine learning offers precision and scalability in data analysis, it’s important to
remember that the real work of evaluating machine learning results still belongs to humans.
3.3 What Is Descriptive Statistics - Definition, Types, & More
If you work with datasets long enough, you will eventually need to deal with statistics. Ask the
average person what statistics are, and they’ll probably throw around words like “numbers,”
“figures,” and “research.”
Statistics further breaks down into two types: descriptive and inferential. Today, we look at
descriptive statistics, including a definition, the types of descriptive statistics, and the differences
between descriptive statistics and inferential statistics.
Descriptive statistics describe, show, and summarize the basic features of a dataset found in a
given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.
Descriptive statistics represent the available data sample and do not include theories, inferences,
probabilities, or conclusions. That’s a job for inferential statistics.
Descriptive Statistics Examples
If you want a good example of descriptive statistics, look no further than a student’s grade point
average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams then average them together and presents a general idea of the student’s mean
academic performance. Note that the GPA doesn’t predict future performance or present any
conclusions. Instead, it provides a straightforward summary of students’ academic success based
on values pulled from data.
Here’s an even simpler example. Let’s assume a data set of 2, 3, 4, 5, and 6 equals a sum of 20.
The data set’s mean is 4, arrived at by dividing the sum by the number of values (20 divided by 5
equals 4).
Analysts often use charts and graphs to present descriptive statistics. If you stood outside of a
movie theater, asked 50 members of the audience if they liked the film they saw, then put your
findings on a pie chart, that would be descriptive statistics. In this example, descriptive statistics
measure the number of yes and no answers and show how many people in this specific theater
liked or disliked the movie. If you tried to come up with any other conclusions, you would be
wandering into inferential statistics territory, but we'll later cover that issue.
Finally, political polling is considered a descriptive statistic, provided it’s just presenting
concrete facts (the respondents’ answers), without drawing any conclusions. Polls are relatively
straightforward: “Who did you vote for President in the recent election?”
Descriptive statistics break down into several types, characteristics, or measures. Some authors
say that there are two types. Others say three or even four.
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to
summarize the frequency of every possible value of a variable, rendered in percentages or
numbers. For instance, if you held a poll to determine people’s favorite Beatle, you’d set up one
column with all possible variables (John, Paul, George, and Ringo), and another with the number
of votes.
Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.
Mean: The mean is also known as “M” and is the most common method for finding averages.
You get the mean by adding all the response values together, and dividing the sum by the
number of responses, or “N.” For instance, say someone is trying to figure out how many hours a
day they sleep in a week. So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9), and the
sum of those values is 52. There are seven responses, so N=7. You divide the value sum of 52 by
N, or 7, to find M, which in this instance is 7.3.
Mode: The mode is just the most frequent response value. Datasets may have any number of
modes, including “zero.” You can find the mode by arranging your dataset's order from the
lowest to highest value and then looking for the most common response. So, in using our sleep
study from the last part: 4,6,7,8,8,9,10. As you can see, the mode is eight.
Median: Finally, we have the median, defined as the value in the precise center of the dataset.
Arrange the values in ascending order (like we did for the mode) and look for the number in the
set’s middle. In this case, the median is eight.
The measure of variability gives the statistician an idea of how spread out the responses are. The
spread has three aspects — range, standard deviation, and variance.
Range: Use range to determine how far apart the most extreme values are. Start by subtracting
the dataset’s lowest value from its highest value. Once again, we turn to our sleep study:
4,6,7,8,8,9,10. We subtract four (the lowest) from ten (the highest) and get six. There’s your
range.
Standard Deviation: This aspect takes a little more work. The standard deviation (s) is your
dataset’s average amount of variability, showing you how far each score lies from the mean. The
larger your standard deviation, the greater your dataset’s variable. Follow these six steps:
9 9-7.3=1.7 2.89
Variance: Variance reflects the dataset’s degree spread. The greater the degree of data spread, the
larger the variance relative to the mean. You can get the variance by just squaring the standard
deviation. Using the above example, we square 1.992 and arrive at 3.971.
Univariate descriptive statistics examine only one variable at a time and do not compare
variables. Rather, it allows the researcher to describe individual variables. As a result, this sort of
statistic is also known as descriptive statistics. The patterns identified in this sort of data may be
explained using the following:
• Data dispersion (standard deviation, variance, range, minimum, maximum, and quartiles)
(standard deviation, variance, range, minimum, maximum, and quartiles)
• Pie graphs
• Bar graphs
When using bivariate descriptive statistics, two variables are concurrently analyzed (compared)
to see whether they are correlated. Generally, by convention, the independent variable is
represented by the columns, and the rows represent the dependent variable.'
There are numerous real-world applications for bivariate data. For example, estimating when a
natural occurrence will occur is quite valuable. Bivariate data analysis is a tool in the
statistician's toolbox. Sometimes, something as simple as projecting one parameter against the
other on a Two-dimensional plane can better understand what the information is trying to
convince you. For example, the scatterplot below demonstrates the link between the period
between eruptions at Old Faithful and the eruption's duration.
Univariate vs. Bivariate
Univariate Bivariate
• Simultaneous analysis of
two variables
What is the Main Purpose of Descriptive Statistics?
Descriptive statistics can be useful for two things: 1) providing basic information about variables
in a dataset and 2) highlighting potential relationships between variables. Graphical/Pictorial
Methods are measures of the three most common descriptive statistics that can be displayed
graphically or pictorially. It is used to summarise data. Descriptive statistics only make
statements about the data set used to calculate them; they never go beyond your data.
Scatter Plots
A scatter plot employs dots to indicate values for two separate numeric variables. Each dot's
location on the horizontal and vertical axes represents a data point's values. Scatter plots are
being used to monitor relationships between variables.
The main purposes of scatter plots are to examine and display relationships between two
numerical variables. The points in a scatter plot document the values of individual points and
trends when the data is obtained as a whole. Identification of correlational links is prevalent with
scatter plots. In these situations, we want to know what a good vertical value prediction would be
given a specific horizontal value.
This can lead to overplotting when there are many data points to plot. When data points are
overlaid to the point where it is difficult to see the connections between them and the variables,
this is known as overplotting. It might be difficult to discern how densely-packed data points are
when lots of them are in a tiny space.
There are a couple simple methods to relieve this issue. One approach is to choose only a subset
of data points: a random sample of points should still offer the basic sense of the patterns in the
whole data. Additionally, we can alter the shape of the dots by increasing transparency to make
overlaps visible or decreasing point size to minimise overlaps.
So, what’s the difference between the two statistical forms? We’ve already touched upon this
when we mentioned that descriptive statistics doesn’t infer any conclusions or predictions, which
implies that inferential statistics do so.
Inferential statistics takes a random sample of data from a portion of the population and
describes and makes inferences about the entire population. For instance, in asking 50 people if
they liked the movie they had just seen, inferential statistics would build on that and assume that
those results would hold for the rest of the moviegoing population in general.
Therefore, if you stood outside that movie theater and surveyed 50 people who had just seen
Rocky 20: Enough Already! and 38 of them disliked it (about 76 percent), you could extrapolate
that 76% of the rest of the movie-watching world will dislike it too, even though you haven’t the
means, time, and opportunity to ask all those people.
Simply put: Descriptive statistics give you a clear picture of what your current data shows.
Inferential statistics makes projections based on that data.
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.
It shows the simple relationship between joint and conditional probabilities. Here,
P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of
hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the evidence
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:
Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
is very useful in cases where we have a good probability of these three terms and want to determine
the fourth one. Suppose we want to perceive the effect of some unknown cause, and want to
compute that cause, then the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80%
of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:
P(a|b) = 0.8
P(b) = 1/30000
P(a)= .02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.
Example-2:
Question: From a standard deck of playing cards, a single card is drawn. The probability
that the card is king is 4/52, then calculate posterior probability P(King|Face), which means
the drawn face card is a king card.
Solution:
o It is used to calculate the next step of the robot when the already executed step is given.
o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
Suppose we have a new data point and we need to put it in the required category. Consider the
below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.
o Large values for K are good, but it may find some difficulties.
o It is simple to implement.
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points
for all the training samples.
To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:
Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:
Steps to implement the K-NN algorithm:
The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)
By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.
And then we will fit the classifier to the training data. Below is the code for it:
1. #Fitting K-NN classifier to the training set
2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)
Output: By executing the above code, we will get the output as:
Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:
Output:
In above code, we have imported the confusion_matrix function and called it using the variable
cm.
Output: By executing the above code, we will get the matrix as below:
In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.
Output:
o As we can see the graph is showing the red point and green points. The green points
are for Purchased(1) and Red Points for not Purchased(0) variable.
o The graph is showing an irregular boundary instead of showing any straight line or
any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.
o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.
o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.
Output:
The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the green
points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So
these are the incorrect observations that we have observed in the confusion matrix(7 Incorrect
output).
It assumes that the data has a Gaussian distribution, which may not always be the case.
It assumes that the covariance matrices of the different classes are equal, which may not be true
in some datasets.
It assumes that the data is linearly separable, which may not be the case for some datasets.
It may not perform well in high-dimensional feature spaces.
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function
Analysis is a dimensionality reduction technique that is commonly used for supervised
classification problems. It is used for modelling differences in groups i.e. separating two or more
classes. It is used to project the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping as
shown in the below figure. So, we will keep on increasing the number of features for proper
classification.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable. In
such cases, we use non-linear discriminant analysis.
Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
• n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as
vTxi
Let’s consider u1 and u2 be the means of samples class c1 and c2 respectively before projection
and u1hat denotes the mean of the samples of class after projection and it can be calculated by:
Similarly,
Similarly:
Now, we need to project our data on the line having direction v which maximizes
For maximizing the above equation we need to find a projection vector that maximizes the
difference of means of reduces the scatters of both classes. Now, scatter matrix of s1 and s2 of
classes c1 and c2 are:
and s2
Now, To maximize the above equation we need to calculate differentiation with respect to v
Here, for the maximum value of J(v) we will use the value corresponding to the highest
eigenvalue. This will provide us the best solution for LDA.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
Implementation
• In this implementation, we will perform linear discriminant analysis using the Scikit-learn
library on the Iris dataset.
• Python3
# necessary import
import numpy as np
import pandas as pd
import sklearn
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
lda = LinearDiscriminantAnalysis(n_components=2)
plt.scatter(
X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',
alpha=0.7,edgecolors='b'
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(conf_m)
Accuracy : 0.9
[[10 0 0]
[ 0 9 3]
[ 0 0 8]]
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular
application in which each face is represented by a very large number of pixel values. Linear
discriminant analysis (LDA) is used here to reduce the number of features to a more
manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear
combinations obtained using Fisher’s linear discriminant are called Fisher’s faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient
disease state as mild, moderate, or severe based upon the patient’s various parameters and
the medical treatment he is going through. This helps the doctors to intensify or reduce the
pace of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers who are most
likely to buy a particular product in a shopping mall. By doing a simple question and
answers survey, we can gather all the features of the customers. Here, a Linear discriminant
analysis will help us to identify and select the features which can describe the characteristics
of the group of customers that are most likely to buy that particular product in the shopping
mall.
Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.
o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.
o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called underfitting.
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below are some other reasons
for using Regression analysis:
o Regression estimates the relationship between the target and the independent variable.
o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Ridge Regression
o Lasso Regression:
Linear Regression:
o Linear regression is a statistical regression method which is used for predictive analysis.
o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.
o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the year
of experience.
1. Y= aX+b
o Salary forecasting
Logistic Regression:
o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.
o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.
o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.
o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:
When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.
o Binary(0/1, pass/fail)
Polynomial Regression:
o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.
o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.
o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.
o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.
o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
o The model is still linear as the coefficients are still linear with quadratic
Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.
Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as Support
Vector Regression.
Support Vector Regression is a regression algorithm which works for continuous variables. Below
are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.
o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is
a line which helps to predict the continuous variables and cover most of the datapoints.
o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.
o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.
In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided
into their children node, and themselves become the parent node of those nodes. Consider
the below image:
Above image showing the example of Decision Tee regression, here, the model is trying to predict
the choice of a person between Sports cars or Luxury car.
o Random forest is one of the most powerful supervised learning algorithms which is capable
of performing regression as well as classification tasks.
o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:
o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.
Ridge Regression:
o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the model.
o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
o It is also called as L1 regularization. The equation for Lasso regression will be:
3.8 Least Square Regression in Machine Learning
Least Square Regression is a statistical method commonly used in machine learning for
analyzing and modelling data. It involves finding the line of best fit that minimizes the sum of
the squared residuals (the difference between the actual values and the predicted values) between
the independent variable(s) and the dependent variable.
We can use Least Square Regression for both simple linear regression, where there is only one
independent variable. Also for, multiple linear regression, where there are several independent
variables. We widely use this method in a variety of fields, such as economics, engineering, and
finance, to model and predict relationships between variables. Before learning least square
regression, let’s understand linear regression.
Linear Regression
Linear regression is one of the basic statistical techniques in regression analysis. People use it
for investigating and modelling the relationship between variables (i.e dependent variable and
one or more independent variables).
Before being promptly adopted into machine learning and data science, linear models were used
as basic tools in statistics to assist prediction analysis and data mining. If the model involves
only one regressor variable (independent variable), it is called simple linear regression and if the
model has more than one regressor variable, the process is called multiple linear regression.
Let’s consider a simple example of an engineer wanting to analyze the product delivery and
service operations for vending machines. He/she wants to determine the relationship between the
time required by a deliveryman to load a machine and the volume of the products delivered. The
engineer collected the delivery time (in minutes) and the volume of the products (in a number of
cases) of 25 randomly selected retail outlets with vending machines. Scatter diagram is the
observations plotted on a graph.
Now, if I consider Y as delivery time (dependent variable), and X as product volume delivered
(independent variable). Then we can represent the linear relationship between these two
variables as
Okay! Now that looks familiar. Its equation is for a straight line, where m is the slope and c is the
y-intercept. Our objective is to estimate these unknown parameters in the regression model,
such that they give the minimal error for the given dataset. Commonly referred to as parameter
estimation or model fitting. In machine learning, the most common method of estimation is
the Least Squares method.
Least squares is a commonly used method in regression analysis for estimating the unknown
parameters by creating a model which will minimize the sum of squared errors between the
observed data and the predicted data.
Basically, it is one of the widely used methods of fitting curves that works by minimizing the
sum of squared errors as small as possible. It helps you draw a line of best fit depending on your
data points.
Given any collection of a pair of numbers and the corresponding scatter graph, the line of best fit
is the straight line that you can draw through the scatter points to best represent the relationship
between them. So, back to our equation of the straight line, we have:
Where,
Y: Dependent Variable
m: Slope
X: Independent Variable
c: y-intercept
Our aim here is to calculate the values of slope, y-intercept, and substitute them in the equation
along with the values of independent variable X, to determine the values of dependent variable
Y. Let’s assume that we have ‘n’ data points, then we can calculate slope using the scary looking
formula below:
Then, y-intercept is calculated using the formula:
Lastly, we substitute these values in the final equation Y = mX + c. Simple enough, right? Now
let’s take a real life example and implement these formulas to find the line of best fit.
Step 1: First step is to calculate the slope ‘m’ using the formula
Step 2: Next step is to calculate the y-intercept ‘c’ using the formula (ymean — m * xmean). By
doing that, the value of c approximately is c = 6.67.
Step 3: Now we have all the information needed for the equation and by substituting the
respective values in Y = mX + c, we get the following table. Using this info you can now plot the
graph.
This way by the least squares regression method provides the closest relationship between the
dependent and independent variables by minimizing the distance between the residuals (or error)
and the trend line (or line of best fit). Therefore, the sum of squares of residuals (or error) is
minimal under this approach.
Now let us master how the least squares method is implemented using Python.
Scenario
Implement a simple linear regression algorithm using Python to build a machine learning model
that studies the relationship between the shear strength of the bond between two propellers and
the age of the sustainer propellant.
Let’s begin!
Steps
# Importing Libraries
import numpy as np
import pandas as pd
Step 2: Next step is to read and load the dataset that we are working on.
# Loading dataset
data = pd.read_csv('PropallantAge.csv')
data.head()
data.info()
Copy code
This gives you a preview of your data and other related information that’s good to know. Our
aim now is to find the relationship between the age of sustainer propellant and the shear
strength of the bond between two propellers.
Step 3 (optional): You can create a scatter plot just to check the relationship between these two
variables.
Step 4: Next step is to assign X and Y as independent and dependent variables respectively.
# Computing X and Y
X = data['Age of Propellant'].values
Y = data['Shear Strength'].values
Copy code
Step 5: As we calculated manually earlier, we need to compute the mean of variables X and Y to
determine the values of slope (m) and y-intercept. Also, let n be the total number of data points.
mean_x = np.mean(X)
mean_y = np.mean(Y)
Step 6: In the next step, we will be calculating the slope and the y-intercept using the formulas
we discussed above.
num = 0
denom = 0
for i in range(n):
m = num / denom
c = mean_y - (m * mean_x)
# Printing coefficients
print("Coefficients")
print(m, c)
Copy code
The above step has given us the values of m and c. Substituting them we get,
maxx_x = np.max(X) + 10
minn_x = np.min(X) - 10
y=c+m*x
plt.legend()
plt.show()
Copy code
Output:
Well! That’s it! We successfully found the line of best fit and fitted it into the data points using
the least square regression method in machine learning. So, now using this I could verify that
there is a strong statistical relationship between the shear strength and the propellant age.
o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression
is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
To understand the implementation of Logistic Regression in Python, we will use the below
example:
Example: There is a dataset given which contains the information of various users obtained from
the social networking sites. There is a car making company that has recently launched a new SUV
car. So the company wanted to check how many users from the dataset, wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression algorithm.
The dataset is shown in the below image. In this problem, we will predict the purchased variable
(Dependent Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use
the same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The
code for this is given below:
By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below is
the code for it:
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
For test
set:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Out[5]:
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)
Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To create
it, we need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters,
mainly y_true( the actual values) and y_pred (the targeted value return by the classifier). Below
is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did
not purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did
not purchase the car.
We have successfully visualized the training set result for the logistic regression, and our goal for
this classification is to divide the users who purchased the SUV car and who did not purchase the
car. So from the output graph, we can clearly see the two regions (Purple and Green) with the
observation points. The Purple region is for those users who didn't buy the car, and Green Region
is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used
the Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are
in the purple region. So we can say it is a good prediction and model. Some of the green and purple
data points are in different regions, which can be ignored as we have already calculated this error
using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.
UNIT..4
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural
network we combine neurons together so that the outputs of some neurons are
inputs of other neurons.
A gentle introduction to neural networks and TensorFlow can be found here:
• Neural Networks
• Introduction to TensorFlow
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and
it can have any number of hidden layers and each hidden layer can have any
number of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is
depicted below.
In the multi-layer perceptron diagram above, we can see that there are
three inputs and thus three input nodes and the hidden layer has three nodes.
The output layer gives two outputs, therefore there are two output nodes. The
nodes in the input layer take input and forward it for further process, in the
diagram above the nodes in the input layer forwards their output to each of the
three nodes in the hidden layer, and in the same way, the hidden layer
processes the information and passes it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to
numbers between 0 and 1 using the sigmoid formula.
Stepwise Implementation
Step 1: Import the necessary libraries.
• Python3
# importing modules
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
import matplotlib.pyplot as plt
Output:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-
datasets/mnist.npz
11493376/11490434 [==============================] – 2s 0us/step
Step 3: Now we will convert the pixels into floating-point values.
• Python3
We are converting the pixel values into floating-point values to make the
predictions. Changing the numbers into grayscale values will be beneficial as
the values become small and the computation becomes easier and faster. As
the pixel values range from 0 to 256, apart from 0 the range is 255. So dividing
all the values by 255 will convert it to range from 0 to 1
Step 4: Understand the structure of the dataset
• Python3
Output:
Feature matrix: (60000, 28, 28)
Target matrix: (10000, 28, 28)
Feature matrix: (60000,)
Target matrix: (10000,)
Thus we get that we have 60,000 records in the training dataset and 10,000
records in the test dataset and Every image in the dataset is of the size
28×28.
Step 5: Visualize the data.
• Python3
Output
model = Sequential([
# dense layer 1
Dense(256, activation='sigmoid'),
# dense layer 2
Dense(128, activation='sigmoid'),
# output layer
Dense(10, activation='sigmoid'),
])
• model.compile(optimizer='adam',
• loss='sparse_categorical_crossentropy',
• metrics=['accuracy'])
Compile function is used here that involves the use of loss, optimizers, and
metrics. Here loss function used is sparse_categorical_crossentropy,
optimizer used is adam.
Step 8: Fit the model.
• Python3
What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Inputs X, arrive through the preconnected path
Input is modeled using real weights W. The weights are usually randomly selected.
Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
Travel back from the output layer to the hidden layer to adjust the weights such that the error
is decreased.
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.
What is a Feed Forward Network?
A feedforward neural network is an artificial neural network where the nodes never form a
cycle. This kind of neural network has an input layer, hidden layers, and an output layer. It is
the first and simplest type of artificial neural network.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.
Radial Basis Function (RBF) Networks are a particular type of Artificial Neural Network used
for function approximation problems. RBF Networks differ from other neural networks in their
three-layer architecture, universal approximation, and faster learning speed.
Radial Basis Functions are a special class of feed-forward neural networks consisting of three
layers: an input layer, a hidden layer, and the output layer. This is fundamentally different from
most neural network architectures, which are composed of many layers and bring about
nonlinearity by recurrently applying non-linear activation functions. The input layer receives
input data and passes it into the hidden layer, where the computation occurs. The hidden
layer of Radial Basis Functions Neural Network is the most powerful and very different from
most Neural networks. The output layer is designated for prediction tasks like classification or
regression.
The radial basis function for a neuron consists of a center and a radius (also called the
spread). The radius may vary between different neurons. In DTREG-generated RBF networks,
each dimension's radius can differ. As the spread grows larger, neurons at a distance from a
point have more influence.
RBF Network Architecture
The typical architecture of a radial basis functions neural network consists of an input layer,
Input Layer
The input layer consists of one neuron for every predictor variable. The input neurons pass
the value to each neuron in the hidden layer. N-1 neurons are used for categorical values,
where N denotes the number of categories. The range of values is standardized by
subtracting the median and dividing by the interquartile range.
Hidden Layer
The hidden layer contains a variable number of neurons (the ideal number determined by the
training process). Each neuron comprises a radial basis function centered on a point. The
number of dimensions coincides with the number of predictor variables. The radius or spread
of the RBF function may vary for each dimension.
When an x vector of input values is fed from the input layer, a hidden neuron calculates the
Euclidean distance between the test case and the neuron's center point. It then applies the
kernel function using the spread values. The resulting value gets fed into the summation
layer.
Output Layer or Summation Layer
The value obtained from the hidden layer is multiplied by a weight related to the neuron and
passed to the summation. Here the weighted values are added up, and the sum is presented
as the network's output. Classification problems have one output per target category, the
value being the probability that the case evaluated has that category. The Input Vector
It is the n-dimensional vector that you're attempting to classify. The whole input vector is
presented to each of the RBF neurons.
The RBF Neurons
Every RBF neuron stores a prototype vector (also known as the neuron's center) from
amongst the vectors of the training set. An RBF neuron compares the input vector with its
prototype, and outputs a value between 0 and 1 as a measure of similarity. If an input is the
same as the prototype, the neuron's output will be 1. As the input and prototype difference
grows, the output falls exponentially towards 0. The shape of the response by the RBF neuron
is a bell curve. The response value is also called the activation value.
The Output Nodes
The network's output comprises a set of nodes for each category you're trying to classify.
Each output node computes a score for the concerned category. Generally, we take a
classification decision by assigning the input to the category with the highest score.
The score is calculated based on a weighted sum of the activation values from all RBF
neurons. It usually gives a positive weight to the RBF neuron belonging to its category and a
negative weight to others. Each output node has its own set of weights.
o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:
ID3:
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words,
we can say that the purity of the node increases with respect to the target variable. The
decision tree splits the nodes on all available variables and then selects the split which results
in most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:
The ID3 algorithm builds decision trees using a top-down greedy search approach through the
space of possible branches with no backtracking. A greedy algorithm, as the name suggests,
always makes the choice that seems to be the best at that moment.
From the above graph, it is quite evident that the entropy H(X) is zero when the probability is
either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect
randomness in the data and there is no chance if perfectly determining the outcome.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach with
Information Gain
Information gain or IG is a statistical property that measures how well a given attribute
decision tree is all about finding an attribute that returns the highest information gain and the
smallest entropy.
Information Gain
Information gain is a decrease in entropy. It computes the difference between entropy before
split and average entropy after split of the dataset based on given attribute values. ID3
Information Gain
Where “before” is the dataset before the split, K is the number of subsets generated by the
Gini Index
You can understand the Gini index as a cost function used to evaluate splits in the dataset. It
is calculated by subtracting the sum of the squared probabilities of each class from one. It
favors larger partitions and easy to implement whereas information gain favors smaller
Gini Index
Gini Index works with the categorical target variable “Success” or “Failure”. It performs only
Binary splits.
Higher value of Gini index implies higher inequality, higher heterogeneity.
1. Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q) (p²+q²).
2. Calculate the Gini index for split using the weighted Gini score of each node of that split.
CART (Classification and Regression Tree) uses the Gini index method to create split points.
Gain ratio
Information gain is biased towards choosing attributes with a large number of values as root
nodes. It means it prefers the attribute with a large number of distinct values.
C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that
reduces its bias and is usually the best option. Gain ratio overcomes the problem with
information gain by taking into account the number of branches that would result before
making the split. It corrects information gain by taking the intrinsic information of a split into
account.
Gain Ratio
Where “before” is the dataset before the split, K is the number of subsets generated by the
Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression
problems). This algorithm uses the standard formula of variance to choose the best split. The
split with lower variance is selected as the criteria to split the population:
Above X-bar is the mean of the values, X is actual and n is the number of values.
2. Calculate variance for each split as the weighted average of each node variance.
Chi-Square
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the
oldest tree classification methods. It finds out the statistical significance between the
differences between sub-nodes and parent node. We measure it by the sum of squares of
standardized differences between observed and expected frequencies of the target variable.
It works with the categorical target variable “Success” or “Failure”. It can perform two or more
splits. Higher the value of Chi-Square higher the statistical significance of differences between
1. Calculate Chi-square for an individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each
The common problem with Decision trees, especially having a table full of columns, they fit a
lot. Sometimes it looks like the tree memorized the training data set. If there is no limit set on
a decision tree, it will give you 100% accuracy on the training data set because in the worse
case it will end up making 1 leaf for each observation. Thus this affects the accuracy when
2. Random Forest
The splitting process results in fully grown trees until the stopping criteria are reached. But,
the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data.
In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from
the leaf node such that the overall accuracy is not disturbed. This is done by segregating the
actual training set into two sets: training data set, D and validation data set, V. Prepare the
decision tree using the segregated training data set, D. Then continue trimming the tree
In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it
has more importance on the right-hand side of the tree, hence removing overfitting.
Random Forest
A technique known as bagging is used to create an ensemble of trees where multiple training
Then, using a single learning algorithm a model is built on all samples. Later, the resultant
ART( Classification And Regression Tree) is a variation of the decision tree algorithm. It can
handle both classification and regression tasks. Scikit-Learn uses the Classification And
Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees).
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each
fork is split into a predictor variable and each node has a prediction for the target variable at
the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an
attribute. The root node is taken as the training set and is split into two by considering the best
attribute and threshold value. Further, the subsets are also split using the same logic. This
continues till the last pure sub-set is found in the tree or the maximum number of leaves
possible in that growing tree.
The CART algorithm works via the following process:
• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new “best” split point is
identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
• Where 0 depicts that all the elements are allied to a certain class, or only one class exists
there.
• The Gini index of value 1 signifies that all the elements are randomly distributed across
various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:
Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is
then used to identify the “Class” within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split into classes that belong to the
response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used
to predict its value. Regression trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.
• Greedy algorithm: In this The input space is divided using the Greedy method which is
known as a recursive binary spitting. This is a numerical method within which all of the
values are aligned and several other split points are tried and assessed using a cost
function.
• Stopping Criterion: As it works its way down the tree with the training data, the recursive
binary splitting method described above must know when to stop splitting. The most frequent
halting method is to utilize a minimum amount of training data allocated to every leaf node.
If the count is smaller than the specified threshold, the split is rejected and also the node is
considered the last leaf node.
• Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees
with fewer branches are recommended as they are simple to grasp and less prone to cluster
the data. Working through each leaf node in the tree and evaluating the effect of deleting it
using a hold-out test set is the quickest and simplest pruning approach.
• Data preparation for the CART: No special data preparation is required for the CART
algorithm.
Advantages of CART
• Results are simplistic.
• Classification and regression trees are Nonparametric and Nonlinear.
• Classification and regression trees implicitly perform feature selection.
• Outliers have no meaningful effect on CART.
• It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
• the tree structure may be unstable.
Applications of the CART algorithm
• For quick Data insights.
• In Blood Donors Classification.
• For environmental and ecological data.
• In the financial sectors.
Trained Decision Trees are generally quite intuitive to understand, and easy to interpret.
Unlike most other machine learning algorithms, their entire structure can be easily visualised
in a simple flow chart. I covered the topic of interpreting Decision Trees in a previous post.
2. Robust to Outliers
A well-regularised Decision Tree will be robust to the presence of outliers in the data. This
feature stems from the fact that predictions are generated from an aggregation function (e.g.
mean or mode) over a subsample of the training data. Outliers can start to have a bigger
impact if the tree has overfitted. This topic was covered in a previous post.
The CART algorithm naturally permits the handling of missing values in the data. This enables
us to implement a Decision Tree that does not require any additional preprocessing to treat for
missing values. Most other machine learning algorithms do not have this capability. We
implemented a Decision Tree that can handle missing values in a previous post.
4. Non-Linear
Decision Trees are inherently non-linear models. They are piece-wise functions of various
different features in the feature space. As such, Decision Trees can be applied to a wide
5. Non-Parametric
CART Decision Trees do not make assumptions regarding the underlying distributions in the
data. This means we do not necessarily need to be concerned if the model is applicable to a
given problem, given the assumptions of the algorithm. There are caveats to this, however,
Combinations of features can be used in making predictions. The CART algorithm dictates
that decision rules (which are if-else conditions on the input features) are combined together
via AND relationships as one traverses the tree. This can be easily illustrated if we look at a
The CART algorithm naturally permits the handling of categorical features in the data. This
enables us to implement a Decision Tree that does not require any additional preprocessing
(e.g. One-Hot-Encoding) to treat for categorical values. Most other machine learning
algorithms do not have this capability. We implemented a Decision Tree to handle categorical
Minimal data preparation is required for Decision Trees. Since the training procedure in CART
deals with each input feature independently, at each node in the tree, data scaling and
UNIT..5
A neural network is a reflection of the human brain's behavior. It allows computer programs to
recognize patterns and solve problems in the fields of machine learning, deep learning, and
artificial intelligence. These systems are known as artificial neural networks (ANNs) or
simulated neural networks (SNNs). Google’s search algorithm is a fine example.
Neural networks are subtypes of machine learning and form the core part of deep learning
algorithms. Their structure is designed to resemble the human brain, which makes biological
neurons signal to one another. ANNs contain node layers that comprise input, one or more
hidden layers, and an output layer.
Image source
Each artificial neuron is connected to another and has an associated threshold and weight.
When the output of any node is above the threshold, that node will get activated, sending data
to the next layer. If not above the threshold, no data is passed along to the next node.
Neural networks depend on training data to learn and improve their accuracy over time. Once
these learning algorithms are tuned towards accuracy, they become powerful tools in AI. They
allow us to classify and cluster data at a high velocity. Tasks in image recognition take just
minutes to process compared to manual identification.
Neural network models are of different types and are based on their purpose. Here are some
common varieties.
Single-layer perceptronThe perceptron created by Frank Rosenblatt is the first neural network.
It contains a single neuron and is very simple in structure.
These form the base for natural language processing (NLP). They comprise an input layer, a
hidden layer, and an output layer. It is important to know that MLPs contain sigmoid neurons
and not perceptrons because most real-world problems are non-linear. Data is fed into these
modules to train them.
They are similar to MLPs but are usually used for pattern or image recognition, and computer
vision. These neural networks work with the principles of matrix multiplication to identify
patterns within an image.
They are identified with the help of feedback loops and are used with time-series data for
making predictions, such as stock market predictions.
The working of neural networks is pretty simple and can be analyzed in a few steps as shown
below:
Neurons
A neuron is the base of the neural network model. It takes inputs, does calculations, analyzes
them, and produces outputs. Three main things occur in this phase:
imimport numpy as np
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
class Neuron:
return sigmoid(total)
bias = 4 # b = 0
n = Neuron(weights, bias)
x = np.array([2, 3]) # x1 = 2, x2 = 3
print(n.feedforward(x)) # 0.9990889488055994
Source: https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-
neural-networks-d49f22d238f9
With the help of the activation function, an unbound input is turned into an output that has a
predictable form. The sigmoid function is one such activation function. It only outpu ts the
numbers 0 and 1. The outcome with negative numbers can be 0 and positive can be 1.
A neural network is a bunch of neurons interlinked together. A simple neuron has two inputs,
a hidden layer with two neurons, and an output layer. The inputs are 0 and 1, the hidden
layers are h1 and h2, and the output layer is O1. A hidden layer can be any layer between the
input and the output layer. There can be any number of layers.
A neural network itself can have any number of layers with any number of neurons in it. The
basic principle remains the same: feed the algorithm inputs to produce the desired output.
The neural network is trained and improved upon. Mean squared error loss can be used for
the same. A quick refresher: A loss is when you find a way to quantify the efforts of your
neural network and try to improve it.
Image source
Stochastic gradient descent shows how to change weights to minimize loss. The equation is:
Image source
η is a constant known as the learning rate which governs how quickly you train. Subtract η
∂w1/∂L from w1:
Doing this for each weight in the network will see the loss decrease and improve the network.
It is vital to have a proper training process, such as:
• Choosing one sample from the dataset to make it stochastic gradient descent by only
operating on one sample at a particular time.
Once you have completed the processes above, you’re ready to implement a complete neural
network. The steps mentioned will see loss steadily decrease and accuracy improve. Practice
by running and playing with the code to gain a deeper understanding of how to refine neural
networks.
Classification Of Supervised Learning Algorithms
1. Gradient Descent
2. Stochastic
In this type of learning, the error reduction takes place with the help of weights and the
activation function of the network. The activation function should be differentiable.
The adjustment of weights depends on the error gradient E in this learning. The
backpropagation rule is an example of this type of learning. Thus the weight adjustment i s
defined as
1. Hebbian
2. Competitive
This learning was proposed by Hebb in 1949. It is based on correlative adjustment of weights.
The input and output patterns pairs are associated with a weight matrix, W.
Also known as Delta Rule, it follows gradient descent rule for linear regression.
It updates the connection weights with the difference between the target and the output value.
It is the least mean square learning algorithm falling under the category of the supervised
learning algorithm.
This rule is followed by ADALINE (Adaptive Linear Neural Networks) and MADALINE. Unlike
Perceptron, the iterations of Adaline networks do not stop, but it converges by reducing the
least mean square error. MADALINE is a network of more than one ADALINE.
The motive of the delta learning rule is to minimize the error between the output and the target
vector.The weights in ADALINE networks are updated by:
Least mean square error = (t- yin)2, ADALINE converges when the least mean square error is
reached.Learning rule enhances the Artificial Neural Network’s performance by applying this
rule over the network. Thus learning rule updates the weights and bias levels of a network
when certain conditions are met in the training process. it is a crucial part of the development
of the Neural Network.
• If two neighbor neurons are operating in the same phase at the same period of time,
then the weight between these neurons should increase.
• For neurons operating in the opposite phase, the weight between them should
decrease.
• If there is no signal correlation, the weight does not change, the sign of the weight
between two nodes depends on the sign of the input between those nodes
• When inputs of both the nodes are either positive or negative, it results in a strong
positive weight.
• If the input of one node is positive and negative for the other, a strong negative weight
is present.
Mathematical Formulation:
δw=αxiy
where δw=change in weight,α is the learning rate.xi the input vector,y the output.
Computed as follows:
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
α=learning rate
actual output(y)=wixi
δw=αxiej
wnew=wo+δw
Now, the output can be calculated on the basis of the input and the activation function applied
over the net input and can be expressed as:
It was developed by Bernard Widrow and Marcian Hoff and It depends on supervised learning
and has a continuous activation function. It is also known as the Least Mean Square method
and it minimizes error over all the training patterns.
It is based on a gradient descent approach which continues forever. It states that the
modification in the weight of a node is equal to the product of the error and the input where
the error is the difference between desired and actual output.
Computed as follows:
y=actual output
wo=initial weight
wnew=new weight
δw=change in weight
Error= ti-y
Learning signal(ej)=(ti-y)y’
δw=αxiej=αxi(ti-y)y’
wnew=wo+δw
The updating of weights can only be done if there is a difference between the target and
actual output(i.e., error) present:
wnew=wo+δw
4. Correlation Learning Rule
The correlation learning rule follows the same similar principle as the Hebbian learning
rule,i.e., If two neighbor neurons are operating in the same phase at the same period of time,
then the weight between these neurons should be more positive. For neurons operating in the
opposite phase, the weight between them should be more negative but unlike the Hebbian
rule, the correlation rule is supervised in nature here, the targeted response is used for the
calculation of the change in weight.
In Mathematical form:
δw=αxitj
where δw=change in weight,α=learning rate,xi=set of the input vector, and tj=target value
Out Star Learning Rule is implemented when nodes in a network are arranged in a layer. Here
the weights linked to a particular node should be equal to the targeted outputs for the nodes
connected through those same weights. Weight change is thus calculated as=δw=α(t-y)
Where α=learning rate, y=actual output, and t=desired output for n layer nodes.
It is also known as the Winner-takes-All rule and is unsupervised in nature. Here all the
output nodes try to compete with each other to represent the input pattern and the winner is
declared according to the node having the most outputs and is given the output 1 while the
rest are given 0.
There are a set of neurons with arbitrarily distributed weights and the activation function is
applied to a subset of neurons. Only one neuron is active at a time. Only the winner has
updated weights, the rest remain unchanged.
Linear Classification
→ Linear Classification refers to categorizing a set of data points into a discrete class based
on a linear combination of its explanatory variables.
→ Some of the classifiers that use linear functions to separate classes are Linear Discriminant
Classifier, Naive Bayes, Logistic Regression, Perceptron, SVM (linear kernel).
→ In the figure above, we have two classes, namely 'O' and '+.' To differentiate between the
two classes, an arbitrary line is drawn, ensuring that both the classes are on distinct sides.
→ Since we can tell one class apart from the other, these classes are called ‘linearly-
separable.’
→ However, an infinite number of lines can be drawn to distinguish the two classes.
→ The exact location of this plane/hyperplane depends on the type of the linear classifier.
Linear Discriminant Classifier
→ Technique - Linear Discriminant Analysis (LDA) is used, which reduced the 2D graph into a
1D graph by creating a new axis. This helps to maximize the distance between the two
classes for differentiation.
→ In the above graph, we notice that a new axis is created, which maximizes the distance
between the mean of the two classes.
→ However, the problem with LDA is that it would fail in case the means of both the classes
are the same. This would mean that we would not be able to generate a new axis for
differentiating the two.
Naive Bayes
→ It is based on the Bayes Theorem and lies in the domain of Supervised Machine Learning.
→ Every feature is considered equal and independent of the others during Classification.
A: event 1
B: event 2
However, in the case of the Naive Bayes classifier, we are concerned only with the maximum
posterior probability, so we ignore the denominator, i.e., the marginal likelihood. Argmax does
not depend on the normalization term.
→ The Naive Bayes classifier is based on two essential assumptions:-
(i) Conditional Independence - All features are independent of each other. This implies that
one feature does not affect the performance of the other. This is the sole reason behind the
‘Naive’ in ‘Naive Bayes.’
(ii) Feature Importance - All features are equally important. It is essential to know all the
features to make good predictions and get the most accurate results.
→ Naive Bayes is classified into three main types: Multinomial Naive Bayes, Bernoulli Naive
Bayes, and Gaussian Bayes.
Logistic Regression
→ The target variable can take only discrete values for a given set of features.
→ The model builds a regression model to predict the probability of a given data entry.
→ Similar to linear regression, logistic regression uses a linear function and, in addition,
makes use of the 'sigmoid' function.
• Binomial - target variable assumes only two values since binary. Example: ‘0’ or ‘1’.
• Multinomial - target variable assumes >= three unordered values since multinomial.
Example: 'Class A,' 'Class B,' and 'Class C.'
• Ordinal - target variable assumes ordered values since ordinal. Example: ‘Very Good’,
‘Good’, ‘Average, ‘poor’, ‘very poor’.
→ This model finds a hyper-plane that creates a boundary between the various data types.
→ In the case of SVM, the classifier with the highest score is chosen as the output of the SVM.
→ SVM works very well with linearly separable data but can work for non-linearly separable
data as well.
Non-Linear Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly
separable.
→ Some of the classifiers that use non-linear functions to separate classes are Quadratic
Discriminant Classifier, Multi-Layer Perceptron (MLP), Decision Trees, Random Forest, and
K-Nearest Neighbours (KNN).
→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the
two classes, it is impossible to draw an arbitrary straight line to ensure that both the classes
are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class
present between the data points of the second class.
→ The only difference is that here, we do not assume that the mean and covariance of all
classes are the same.
→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This
would give us a clear picture of the difference between the two.
Multi-Layer Perceptron (MLP)
→ This is nothing but a collection of fully connected dense layers. These help transform any
given input dimension into the desired dimension.
→ MLP consists of one input layer(one node belonging to each input), one output layer (one
node belonging to each output), and a few hidden layers (>= one node belonging to each
hidden layer).
→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.
→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in
the hidden layer. Once this is done, the hidden layer processes the information passed on to it
and then further passes it on to the output layer.
Decision Tree
→ Instances are classified by sorting them down from the root to some leaf node.
→ An instance is classified by starting at the tree's root node, testing the attribute specified by
this node, then moving down the tree branch corresponding to the attribute's value, as shown
in the above figure.
→ The process is repeated based on each derived subset in a recursive partitioning manner.
→ The above decision tree helps determine whether the person is fit or not.
→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the
same.
→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’
nearest data points to the new point. It chooses the label of the new point as the one to which
the majority of the ‘k’ nearest neighbors belong to.
→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.
Simple linear regression is one of the simplest (hence the name) yet powerful regression
techniques. It has one input ($x$) and one output variable ($y$) and helps us predict the
output from trained samples by fitting a straight line between those variables. For example, we
can predict the grade of a student based upon the number of hours he/she studies using
simple linear regression.
$$y = mx +c$$
$m$ is slope,
The values which when substituted make the equation right, are the solutions. For the above
equation, (-2, 3) is one solution because when we replace x with -2 and y with +3 the
equation holds true and we get 0.
$$3 * -2 + 2 * 3 = 0$$
In simple linear regression, we assume the slope and intercept to be coefficient and bias,
respectively. These act as the parameters that influence the position of the line to be plotted
between the data.
Imagine you plotted the data points in various colors, below is the image that shows the best-
fit line drawn using linear regression.
Imagine you need to predict if a student will pass or fail an exam. We’d consider multiple
inputs like the number of hours he/she spent studying, total number of subjects and hours
he/she slept for the previous night. Since we have multiple inputs and would use multiple
linear regression.
Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction
techniques in machine learning to solve more than two-class classification problems. It is also
known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into lower-dimensional
space in order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”,
Although the logistic regression algorithm is limited to only two-class, linear Discriminant
analysis is applicable for more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique
to solve such classification problems. For e.g., if we have two classes with multiple features
and need to separate them efficiently. When we classify them using a single feature, then it
may show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and
we need to classify them efficiently. As we have already seen in the above example that LDA
enables us to draw a straight line that can completely separate the two classes of the data
points. Here, LDA uses an X-Y axis to create a new axis by separating them using a straight
line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.
Why LDA?
Logistic Regression is one of the most popular classification algorithms that perform well for
binary classification but falls short in the case of multiple classification problems with well-
separated classes. At the same time, LDA handles these quite efficiently.
LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.
LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract useful
data from different faces. Coupled with eigenfaces, it produces effective results.
Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also
fails in some cases where the Mean of the distributions is shared. In this case, LDA fails to
create a new axis that makes both the classes linearly separable.
This classifier is designed specifically for linearly separable data, refers to the condition in
which data can be separated linearly using a hyperplane. But, what is linearly separable
data?
However, as shown in the diagram below, there can be an infinite number of hyperplanes that
will classify the linearly separable classes.
Based on the maximum margin, the Maximal-Margin Classifier chooses the optimal
hyperplane. The dotted lines, parallel to the hyperplane in the following diagram are the
margins and the distance between both these dotted lines (Margins) is the Maximum Margin.
A margin passes through the nearest points from each class; to the hyperplane. The angle
between these nearest points and the hyperplane is 90°. These points are referred to as
“Support Vectors”. Support vectors are shown by circles in the diagram below.
This classifier would choose the hyperplane with the maximum margin which is why it is
known as Maximal – Margin Classifier.
Drawbacks:
This classifier is heavily reliant on the support vector and changes as support vectors chang e.
As a result, they tend to overfit.
They can’t be used for data that isn’t linearly separable. Since the majority of real-world data
is non-linear. As a result, this classifier is inefficient.
The maximum margin classifier is also known as a “Hard Margin Classifier” because it
prevents misclassification and ensures that no point crosses the margin. It tends to overfit due
to the hard margin. An extension of the Maximal Margin Classifier, “Suppor t Vector Classifier”
was introduced to address the problem associated with it.
Support Vector Classifier is an extension of the Maximal Margin Classifier. It is less sensitive
to individual data. Since it allows certain data to be misclassified, it’s also known as the “Soft
Margin Classifier”. It creates a budget under which the misclassification allowance is granted.
Also, It allows some points to be misclassified, as shown in the following diagram. The points
inside the margin and on the margin are referred to as “Support Vectors” in this scenario.
Whereas, the points on the margins were referred to as “Support vectors” in the Maximal –
Margin Classifier.
the margin widens as the budget for misclassification increases, while the margin narrows as
the budget decreases.
While building the model, we use a hyperparameter called “Cost”. Here Cost is inverse of
budget means when the budget increases —> Cost decreases and vice versa. It is denoted by
“C”.
The influence of C’s value on the margin is depicted in the diagram below. When the value is
small, for example, C=1, the margin widens, while when the value is high, the margin narrows
down.
Support Vector Machines are an extension of Soft Margin Classifier. It can also be used for
nonlinear classification by using the kernel. As a result, this algorithm performs well in the
majority of real-world problem statements. Since, in the real world, we will mostly find non-
linear separable data, which will necessitate the use of complex classifiers to classify them.
Kernel: It transforms non-linear separable data from lower to higher dimensions to facilitate
linear classification, as illustrated in the figure below. We use the kernel-based technique to
separate non-linear data because separation can be simpler in higher dimensions.
The kernel transforms the data from lower to higher dimensions using mathematical formulas.