0% found this document useful (0 votes)
17 views186 pages

Mlall

The document provides an overview of machine learning, detailing well-posed learning problems characterized by a task, performance measure, and experience. It discusses various applications of machine learning across different fields such as healthcare, finance, and marketing, emphasizing its role in enhancing decision-making and automating processes. Additionally, it covers data representation methods and statistical analysis, illustrating how data can be organized and visualized for better understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views186 pages

Mlall

The document provides an overview of machine learning, detailing well-posed learning problems characterized by a task, performance measure, and experience. It discusses various applications of machine learning across different fields such as healthcare, finance, and marketing, emphasizing its role in enhancing decision-making and automating processes. Additionally, it covers data representation methods and statistical analysis, illustrating how data can be organized and visualized for better understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 186

MACHINE LEARNING

UNIT-I : INTRODUCTION:

1.1 Well posed learning problems


Well Posed Learning Problem – A computer program is said to learn from experience E
in context to some task T and some performance measure P, if its performance on T, as was
measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
• Task
• Performance Measure
• Experience
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
• Task – Classifying emails as spam or not
• Performance Measure – The fraction of emails accurately classified as spam or not spam
• Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
• Task – Playing checkers game
• Performance Measure – percent of games won against opposer
• Experience – playing implementation games against itself
3. Handwriting Recognition Problem
• Task – Acknowledging handwritten words within portrayal
• Performance Measure – percent of words accurately classified
• Experience – a directory of handwritten words with given classifications
4. A Robot Driving Problem
• Task – driving on public four-lane highways using sight scanners
• Performance Measure – average distance progressed before a fallacy
• Experience – order of images and steering instructions noted down while observing a human
driver
5. Fruit Prediction Problem
• Task – forecasting different fruits for recognition
• Performance Measure – able to predict maximum variety of fruits
• Experience – training machine with the largest datasets of fruits images
6. Face Recognition Problem
• Task – predicting different types of faces
• Performance Measure – able to predict maximum types of faces
• Experience – training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
• Task – translating one type of language used in a document to other language
• Performance Measure – able to convert one language to other efficiently
• Experience – training machine with a large dataset of different types of languages
1.2 Machine Learning – Applications

Introduction
Machine learning is one of the most exciting technologies that one would have ever come across.
As it is evident from the name, it gives the computer that which makes it more similar to humans:
The ability to learn. Machine learning is actively being used today, perhaps in many more places
than one would expect. We probably use a learning algorithm dozens of time without even
knowing it. Applications of Machine Learning include:
• Web Search Engine: One of the reasons why search engines like google, bing etc work so
well is because the system has learnt how to rank pages through a complex learning algorithm.
• Photo tagging Applications: Be it facebook or any other photo tagging application, the
ability to tag friends makes it even more happening. It is all possible because of a face
recognition algorithm that runs behind the application.
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for us in
classifying the mails and moving the spam mails to spam folder. This is again achieved by a
spam classifier running in the back end of mail application.
• Augmentation:Machine learning, which assists humans with their day-to-day tasks,
personally or commercially without having complete control of the output. Such machine
learning is used in different ways such as Virtual Assistant, Data analysis, software solutions.
The primary user is to reduce errors due to human bias.
• Automation:Machine learning, which works entirely autonomously in any field without the
need for any human intervention. For example, robots performing the essential process steps
in manufacturing plants.
• Finance Industry:Machine learning is growing in popularity in the finance industry. Banks
are mainly using ML to find patterns inside the data but also to prevent fraud.
• Government organization:The government makes use of ML to manage public safety and
utilities. Take the example of China with the massive face recognition. The government
uses Artificial intelligence to prevent jaywalker.
• Healthcare industry: Healthcare was one of the first industry to use machine learning with
image detection.
• Marketing:Broad use of AI is done in marketing thanks to abundant access to data. Before
the age of mass data, researchers develop advanced mathematical tools like Bayesian analysis
to estimate the value of a customer. With the boom of data, marketing department relies on AI
to optimize the customer relationship and marketing campaign.
Today, companies are using Machine Learning to improve business decisions,increase
productivity, detect disease, forecast weather, and do many more things. With the exponential
growth of technology, we not only need better tools to understand the data we currently have, but
we also need to prepare ourselves for the data we will have. To achieve this goal we need to build
intelligent machines. We can write a program to do simple things. But for most of times
Hardwiring Intelligence in it is difficult. Best way to do it is to have some way for machines to
learn things themselves. A mechanism for learning – if a machine can learn from input then it
does the hard work for us. This is where Machine Learning comes in action. Some examples of
machine learning are:
• Database Mining for growth of automation: Typical applications include Web-click data
for better UX( User eXperience), Medical records for better automation in healthcare,
biological data and many more.
• Applications that cannot be programmed: There are some tasks that cannot be
programmed as the computers we use are not modelled that way. Examples include
Autonomous Driving, Recognition tasks from unordered data (Face Recognition/ Handwriting
Recognition), Natural language Processing, computer Vision etc.
• Understanding Human Learning: This is the closest we have understood and mimicked the
human brain. It is the start of a new revolution, The real AI. Now, After a brief insight lets
come to a more formal definition of Machine Learning
• Arthur Samuel(1959): “Machine Learning is a field of study that gives computers, the
ability to learn without explicitly being programmed.”Samuel wrote a Checker playing
program which could learn over time. At first it could be easily won. But over time, it learnt
all the board position that would eventually lead him to victory or loss and thus became a better
chess player than Samuel itself. This was one of the most early attempts of defining Machine
Learning and is somewhat less formal.
• Tom Michel(1999): “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E.” This is a more formal and mathematical definition. For the
previous Chess program
• E is number of games.
• T is playing chess against computer.
• P is win/loss by computer.

Machine learning has many applications in a variety of fields.

Some examples of areas where machine learning is used include:

• Computer vision: Machine learning algorithms can be used to recognize objects, people, and
other elements in images and videos.
• Natural language processing: Machine learning algorithms can be used to understand and
generate human language, including tasks such as translation and text classification.
• Recommendation systems: Machine learning algorithms can be used to recommend
products or content to users based on their past behavior and preferences.
• Fraud detection: Machine learning algorithms can be used to identify fraudulent activity in
areas such as credit card transactions and insurance claims.
• Healthcare: Machine learning algorithms can be used to predict disease outbreaks, identify
potential outbreaks, or predict patient outcomes.
• Finance: Machine learning algorithms can be used to predict stock prices, identify fraudulent
activity, or identify potential investment opportunities.
One simple python code:

• Python3
from sklearn import tree

# Training data

X = [[140, 1], [130, 1], [150, 0], [170, 0]] # [weight, texture] (0: smooth, 1: bumpy)

y = [0, 0, 1, 1] # 0: apple, 1: orange

# Train a classifier

clf = tree.DecisionTreeClassifier()

clf = clf.fit(X, y)

# Make a prediction

prediction = clf.predict([[160, 0]]) # should return 1 (orange)

print(prediction)

output:
1
If you run the code I provided, the output will be the prediction made by the model. In this case,
the output will be [1], indicating that the model predicts that the fruit with a weight of 160 and a
smooth texture is an orange.
In the Next tutorial we shall classify the types of Machine Learning problems and shall also discuss
about useful packages and setting environment for Machine Learning and how can we use it to
design new projects.
There are many applications of machine learning, some examples include:

1. Image and speech recognition


2. Natural language processing
3. Recommender systems
4. Anomaly detection
5. Fraud detection
6. Predictive maintenance
7. Robotics
8. Self-driving cars
9. Healthcare
10. Financial services
11. Marketing
12. Agriculture
13. Energy
14. and many more.

1.3 What are the different ways of Data Representation?


The process of collecting the data and analyzing that data in large quantity is known as statistics.
It is a branch of mathematics trading with the collection, analysis, interpretation, and
presentation of numeral facts and figures.
It is a numerical statement that helps us to collect and analyze the data in large quantity the
statistics are based on two of its concepts:
• Statistical Data
• Statistical Science
Statistics must be expressed numerically and should be collected systematically.
Data Representation
The word data refers to constituting people, things, events, ideas. It can be a title, an integer, or
anycast. After collecting data the investigator has to condense them in tabular form to study
their salient features. Such an arrangement is known as the presentation of data.
It refers to the process of condensing the collected data in a tabular form or graphically. This
arrangement of data is known as Data Representation.
The row can be placed in different orders like it can be presented in ascending orders,
descending order, or can be presented in alphabetical order.
Example: Let the marks obtained by 10 students of class V in a class test, out of 50 according to
their roll numbers, be:
39, 44, 49, 40, 22, 10, 45, 38, 15, 50
The data in the given form is known as raw data. The above given data can be placed in the
serial order as shown below:
Roll No. Marks

1 39

2 44

3 49

4 40

5 22

6 10

7 45

8 38

9 14

10 50

Now, if you want to analyse the standard of achievement of the students. If you arrange them in
ascending or descending order, it will give you a better picture.
Ascending order:
10, 15, 22, 38, 39, 40, 44. 45, 49, 50
Descending order:
50, 49, 45, 44, 40, 39, 38, 22, 15, 10
When the row is placed in ascending or descending order is known as arrayed data.
Types of Graphical Data Representation

Bar Chart
Bar chart helps us to represent the collected data visually. The collected data can be visualized
horizontally or vertically in a bar chart like amounts and frequency. It can be grouped or single.
It helps us in comparing different items. By looking at all the bars, it is easy to say which types
in a group of data influence the other.
Now let us understand bar chart by taking this example
Let the marks obtained by 5 students of class V in a class test, out of 10 according to their
names, be:
7,8,4,9,6
The data in the given form is known as raw data. The above given data can be placed in the bar
chart as shown below:

Name Marks

Akshay 7

Maya 8

Dhanvi 4

Jaslen 9

Muskan 6
Histogram
A histogram is the graphical representation of data. It is similar to the appearance of a bar graph
but there is a lot of difference between histogram and bar graph because a bar graph helps to
measure the frequency of categorical data. A categorical data means it is based on two or more
categories like gender, months, etc. Whereas histogram is used for quantitative data.
For example:

Line Graph
The graph which uses lines and points to present the change in time is known as a line graph.
Line graphs can be based on the number of animals left on earth, the increasing population of the
world day by day, or the increasing or decreasing the number of bitcoins day by day, etc. The
line graphs tell us about the changes occurring across the world over time. In a line graph, we
can tell about two or more types of changes occurring around the world.
For Example:
Pie Chart
Pie chart is a type of graph that involves a structural graphic representation of numerical
proportion. It can be replaced in most cases by other plots like a bar chart, box plot, dot plot, etc.
As per the research, it is shown that it is difficult to compare the different sections of a given pie
chart, or if it is to compare data across different pie charts.
For example:

Frequency Distribution Table


A frequency distribution table is a chart that helps us to summarise the value and the frequency
of the chart. This frequency distribution table has two columns, The first column consist of the
list of the various outcome in the data, While the second column list the frequency of each
outcome of the data. By putting this kind of data into a table it helps us to make it easier to
understand and analyze the data.
For Example: To create a frequency distribution table, we would first need to list all the
outcomes in the data. In this example, the results are 0 runs, 1 run, 2 runs, and 3 runs. We would
list these numerals in numerical ranking in the foremost queue. Subsequently, we ought to
calculate how many times per result happened. They scored 0 runs in the 1st, 4th, 7th, and 8th
innings, 1 run in the 2nd, 5th, and the 9th innings, 2 runs in the 6th inning, and 3 runs in the 3rd
inning. We set the frequency of each result in the double queue. You can notice that the table is a
vastly more useful method to show this data.
Baseball Team Runs Per Inning

Number of Runs Frequency

0 4

1 3

2 1

3 1

Sample Questions
Question 1: Considering the school fee submission of 10 students of class 10th is given
below:

Student Fee

Muskan Paid

Kritika Not paid

Anmol Not paid


Raghav Paid

Nitin Paid

Dhanvi Paid

Jasleen Paid

Manas Not paid

Anshul Not paid

Sahil Paid

Solution:
In order to draw the bar graph for the data above, we prepare the frequency table as given
below.

Fee submission No. of Students

Paid 6

Not paid 4

Now we have to represent the data by using the bar graph. It can be drawn by following the
steps given below:
Step 1: firstly we have to draw the two axis of the graph X-axis and the Y-axis.
The varieties of the data must be put on the X-axis (the horizontal line) and the frequencies of the
data must be put on the Y-axis (the vertical line) of the graph.
Step 2: After drawing both the axis now we have to give the numeric scale to the Y-axis (the
vertical line) of the graph
It should be started from zero and ends up with the highest value of the data.
Step 3: After the decision of the range at the Y-axis now we have to give it a suitable difference
of the numeric scale.
Like it can be 0,1,2,3…….or 0,10,20,30 either we can give it a numeric scale like 0,20,40,60…
Step 4: Now on the X-axis we have to label it appropriately.
Step 5: Now we have to draw the bars according to the data but we have to keep in mind that all
the bars should be of the same length and there should be the same distance between each graph

Question 2: Watch the subsequent pie chart that denotes the money spent by Megha at the
funfair. The suggested colour indicates the quantity paid for each variety. The total value
of the data is 15 and the amount paid on each variety is diagnosed as follows:
Chocolates – 3
Wafers – 3
Toys – 2
Rides – 7
To convert this into pie chart percentage, we apply the formula:
(Frequency/Total Frequency) × 100
Let us convert the above data into a percentage:
Amount paid on rides: (7/15) × 100 = 47%
Amount paid on toys: (2/15) × 100 = 13%
Amount paid on wafers: (3/15) × 100 = 20%
Amount paid on chocolates: (3/15) × 100 = 20 %
Question 3: The line graph given below shows how Devdas’s height changes as he grows.
Given below is a line graph showing the height changes in Devdas’s as he grows. Observe
the graph and answer the questions below.

(i) What was the height of Devdas’s at 8 years?


Answer: 65 inches
(ii) What was the height of Devdas’s at 6 years?
Answer: 50 inches
(iii) What was the height of Devdas’s at 2 years?
Answer: 35 inches
(iv) How much has Devdas’s grown from 2 to 8 years?
Answer: 30 inches
(v) When was Devdas’s 35 inches tall?
Answer: 2 years.
1.4 Importance of Machine Learning

Machine Learning is one of the most popular sub-fields of Artificial Intelligence. Machine learning
concepts are used almost everywhere, such as Healthcare, Finance, Infrastructure, Marketing, Self-
driving cars, recommendation systems, chatbots, social sites, gaming, cyber security, and many
more.

Currently, Machine Learning is under the development phase, and many new technologies are
continuously being added to Machine Learning. It helps us in many ways, such as analyzing large
chunks of data, data extractions, interpretations, etc. Hence, there are unlimited numbers of uses
of Machine Learning. In this topic, we will discuss various importance of Machine Learning with
examples. So, let's start with a quick introduction to Machine Learning.
What is Machine Learning?

Machine Learning is a branch of Artificial Intelligence that allows machines to learn and improve
from experience automatically. It is defined as the field of study that gives computers the capability
to learn without being explicitly programmed. It is quite different than traditional programming.

How Machine Learning Works?

Machine Learning is a core form of Artificial Intelligence that enable machine to learn from past
data and make predictions

It involves data exploration and pattern matching with minimal human intervention. There are
mainly four technologies that machine learning used to work:

1. Supervised Learning:

Supervised Learning is a machine learning method that needs supervision similar to the student-
teacher relationship. In supervised Learning, a machine is trained with well-labeled data, which
means some data is already tagged with correct outputs. So, whenever new data is introduced into
the system, supervised learning algorithms analyze this sample data and predict correct outputs
with the help of that labeled data.

It is classified into two different categories of algorithms. These are as follows:

o Classification: It deals when output is in the form of a category such as Yellow, blue, right
or wrong, etc.
o Regression: It deals when output variables are real values like age, height, etc.

This technology allows us to collect or produce data output from experience. It works the same
way as humans learn using some labeled data points of the training set. It helps in optimizing the
performance of models using experience and solving various complex computation problems.

2. Unsupervised Learning:

Unlike supervised learning, unsupervised Learning does not require classified or well-labeled data
to train a machine. It aims to make groups of unsorted information based on some patterns and
differences even without any labelled training data. In unsupervised Learning, no supervision is
provided, so no sample data is given to the machines. Hence, machines are restricted to finding
hidden structures in unlabeled data by their own.

It is classified into two different categories of algorithms. These are as follows:

o Clustering: It deals when there is a requirement of inherent grouping in training data, e.g.,
grouping students by their area of interest.
o Association: It deals with the rules that help to identify a large portion of data, such as
students who are interested in ML and also interested in AI.

3. Semi-supervised learning:

Semi-supervised Learning is defined as the combination of both supervised and unsupervised


learning methods. It is used to overcome the drawbacks of both supervised and unsupervised
learning methods.

In the semi-supervised learning method, a machine is trained with labeled as well as unlabeled
data. Although, it involves a few labeled examples and a large number of unlabeled examples.

Speech analysis, web content classification, protein sequence classification, and text documents
classifiers are some most popular real-world applications of semi-supervised Learning.

4. Reinforcement learning:

Reinforcement learning is defined as a feedback-based machine learning method that does not
require labeled data. In this learning method, an agent learns to behave in an environment by
performing the actions and seeing the results of actions. Agents can provide positive feedback for
each good action and negative feedback for bad actions. Since, in reinforcement learning, there is
no training data, hence agents are restricted to learn with their experience only.

Importance of Machine Learning

Although machine learning is continuously evolving with so many new technologies, it is still used
in various industries.

Machine learning is important because it gives enterprises a view of trends in customer behavior
and operational business patterns, as well as supports the development of new products. Many
of today's leading companies, such as Facebook, Google, and Uber, make machine learning a
central part of their operations. Machine learning has become a significant competitive
differentiator for many companies.

Machine learning has several practical applications that drive the kind of real business results -
such as time and money savings - that have the potential to dramatically impact the future of your
organization. In particular, we see tremendous impact occurring within the customer care industry,
whereby machine learning is allowing people to get things done more quickly and efficiently.
Through Virtual Assistant solutions, machine learning automates tasks that would otherwise need
to be performed by a live agent - such as changing a password or checking an account balance.
This frees up valuable agent time that can be used to focus on the kind of customer care that
humans perform best: high touch, complicated decision-making that is not as easily handled by a
machine. At Interactions, we further improve the process by eliminating the decision of whether a
request should be sent to a human or a machine: unique Adaptive Understanding technology, the
machine learns to be aware of its limitations, and bailout to humans when it has low confidence in
providing the correct solution.

Use cases of Machine Learning Technology

Machine Learning is broadly used in every industry and has a wide range of applications,
especially that involves collecting, analyzing, and responding to large sets of data. The importance
of Machine Learning can be understood by these important applications.

Some important applications in which machine learning is widely used are given below:

1. Healthcare: Machine Learning is widely used in the healthcare industry. It helps


healthcare researchers to analyze data points and suggest outcomes. Natural language
processing helped to give accurate insights for better results of patients. Further, machine
learning has improved the treatment methods by analyzing external data on patients'
conditions in terms of X-ray, Ultrasound, CT-scan, etc. NLP, medical imaging, and genetic
information are key areas of machine learning that improve the diagnosis, detection, and
prediction system in the healthcare sector.
2. Automation: This is one of the significant applications of machine learning that helps to
make the system automated. It helps machines to perform repetitive tasks without human
intervention. As a machine learning engineer and data scientist, you have the
responsibilities to solve any given task multiple times with no errors. However, this is not
practically possible for humans. Hence machine learning has developed various models to
automate the process, having the capability of performing iterative tasks in lesser time.
3. Banking and Finance: Machine Learning is a subset of AI that uses statistical models to
make accurate predictions. In the banking and finance sector, machine learning helped in
many ways, such as fraud detection, portfolio management, risk management, chatbots,
document analysis, high-frequency trading, mortgage underwriting, AML detection,
anomaly detection, risk credit score detection, KYC processing, etc. Hence, machine
learning is widely applied in the banking and finance sector to reduce error as well as time.
4. Transportation and Traffic Prediction: This is one of the most common applications of
Machine Learning that is widely used by all individuals in their daily routine. It helps to
ensure highly secured routes, generate accurate ETAs, predict vehicle breakdown, Driving
Prescriptive Analytics, etc. Although machine learning has solved transportation problems,
it still requires more improvement. Statistical machine learning algorithms helps to build a
smart transportation system. Further, deep Learning explored the complex interactions of
roads, highways, traffic, environmental elements, crashes, etc. Hence, machine learning
technology has improved daily traffic management as well as a collection of traffic data to
predict insights of routes and traffic.
5. Image Recognition: It is one of the most common applications of machine learning which
is used to detect the image over the internet. Further, various social media sites such as
Facebook uses image recognition for tagging the images to your Facebook friends with its
feature named auto friend tagging suggestion.
Further, now a day's, almost all mobile devices come with exciting face detection features.
Using this feature, you can secure your mobile data with face unlocking, so if anyone tries
to access your mobile device, they cannot open without face recognition.
6. Speech Recognition: Speech recognition is one of the biggest achievements of machine
learning applications. It enables users to search content without writing text or, in other
words, 'search by voice'. It can search content/products on YouTube, Google, Amazon, etc.
platforms by your voice. This technology is referred to as speech recognition.
It is a process of converting voice instructions into the text; hence it is also known as
'Speech to text' or 'Computer speech recognition. Some important examples of speech
recognitions are Google assistant, Siri, Cortana, Alexa, etc.
7. Product Recommendation: It is one of the biggest achievements made by machine
learning which helps various e-commerce and entertainment companies like Flipkart,
Amazon, Netflix, etc., to digitally advertise their products over the internet. When anyone
searches for any product, they start getting an advertisement for the same product while
internet surfing on the same browser.
This is possible by machine learning algorithms that work on users' interests or past
experience and accordingly recommend them for products. For e.g., when we search for a
laptop on the Amazon platform, then it also gets started with so many other laptops having
the same categories and criteria. Similarly, when we use Netflix, we find some
recommendations for entertainment series, movies, etc. Hence, this is also possible by
machine learning algorithms.
8. Virtual Personal Assistance: This feature helps us in many ways, such as searching
content using voice instruction, calling a number using voice, searching contact in your
mobile, playing music, opening an email, Scheduling an appointment, etc. Now a day, you
all have seen advertising like "Alexa! Play the Music" this is also done with the help of
machine learning. Google Assistant, Alexa, Cortana, Siri, etc., are a few common
applications of machine learning. These virtual personal assistants record our voice
instructions, send them over to the server on a cloud, decode it using ML algorithms and
act accordingly.
9. Email Spam and Malware detection & Filtering: Machine learning also helps us for
filtering emails in different categories such as spam, important, general, etc. In this way,
users can easily identify whether the email is useful or spam. This is also possible by
machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier. Content filter, header filter, rules-based filter, permission filter, general
blacklist filter, etc., are some important spam filters used by Google.
10. Self-driving cars: This is one of the most exciting applications of machine learning.
Machine learning plays a vital role in the manufacturing of self-driving cars. It uses an
unsupervised learning method to train car models to detect people and objects while
driving. Tata and Tesla are the most popular car manufacturing companies working on self-
driving cars. Hence, it is a big revolution in a technological era which is also done with the
help of machine learning.
11. Credit card fraud detection: Credit card frauds have become very easy targets for online
hackers. As the culture of online/digital payments is increasing, the risk of credit/debit
cards is parallel increasing. Machine Learning also helps developers to detect and analyze
frauds in online transactions. It develops a novel fraud detection method for Streaming
Transaction Data, with an objective to analyze the past transaction details of the customers
and extract the behavioral patterns. Further, cardholders are clustered into various
categories with their transaction amount so that the behavioral pattern of the groups can be
extracted respectively. Hence, credit card fraud detection is a novel approach using
Aggregation Strategy and Feedback Mechanism of machine learning.
12. Stock Marketing and Trading: Machine learning also helps in the stock marketing and
trading sector, where it uses historical trends or past experience for predicting the market
risk. As share marketing is another name of marketing risk, machine learning reduces it to
some extent and predicts data against marketing risk. Machine learning's long short-term
neural memory network is used for the prediction of stock market trends.
13. Language Translation: The use of Machine learning can be seen in language translation.
It uses the sequence-to-sequence learning algorithms for translating one language into
other. Further, it also uses images recognition techniques to identify the text from one
language to other. Similarly, Google's GNMT (Google Neural Machine Translation)
provides this feature, which is a Neural Machine Learning that translates the text into our
familiar language, and it is called automatic translation.

Conclusion:

Machine Learning is directly or indirectly involved in our daily routine. We have seen various
machine learning applications that are very useful for surviving in this technical world. Although
machine learning is in the developing phase, it is continuously evolving rapidly. The best thing
about machine learning is its High-value predictions that can guide better decisions and smart
actions in real-time without human intervention. Hence, at the end of this article, we can say that
the machine learning field is very vast, and its importance is not limited to a specific industry or
sector; it is applicable everywhere for analyzing or predicting future events.

1.5 Difference between Structured, Semi-structured and Unstructured data

Big Data includes huge volume, high velocity, and extensible variety of data. There are 3 types:
Structured data, Semi-structured data, and Unstructured data.

1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.

2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.

3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.

Differences between Structured, Semi-structured and Unstructured data:


Properties Structured data Semi-structured data Unstructured data

It is based on It is based on It is based on


Relational database XML/RDF(Resource character and
Technology table Description Framework). binary data

Matured transaction
and various No transaction
Transaction concurrency Transaction is adapted from management and
management techniques DBMS not matured no concurrency

Version Versioning over Versioning over tuples or Versioned as a


management tuples,row,tables graph is possible whole

It is more flexible than It is more flexible


It is schema structured data but less and there is
dependent and less flexible than unstructured absence of
Flexibility flexible data schema

It is very difficult to It’s scaling is simpler than It is more


Scalability scale DB schema structured data scalable.

New technology, not very


Robustness Very robust spread —

Structured query Only textual


Query allow complex Queries over anonymous queries are
performance joining nodes are possible possible
1.6 Difference Between Data mining and Machine learning
Data mining:
The process of extracting useful information from a huge amount of data is called Data mining.
Data mining is a tool that is used by humans to discover new, accurate, and useful patterns in
data or meaningful relevant information for the ones who need it.
Machine learning:
The process of discovering algorithms that have improved courtesy of experience derived data is
known as machine learning. It is the algorithm that permits the machine to learn without human
intervention. It’s a tool to make machines smarter, eliminating the human element.

Below is a table of differences between Data Mining and Machine Learning:

S.No. Data Mining Machine Learning

Extracting useful information from Introduce algorithm from data as well as from past
1. large amount of data experience

Teaches the computer to learn and understand from


2. Used to understand the data flow the data flow

Huge databases with unstructured


3. data Existing data as well as algorithms

machine learning algorithm can be used in the


Models can be developed for using decision tree, neural networks and some other area of
4. data mining technique artificial intelligence

5. human interference is more in it. No human effort required after design

It is used in web Search, spam filter, fraud detection


6. It is used in cluster analysis and computer design
S.No. Data Mining Machine Learning

Data mining abstract from the data


7. warehouse Machine learning reads machine

Data mining is more of a research


using methods like machine Self learned and trains system to do the intelligent
8. learning task

9. Applied in limited area Can be used in vast area

1.7 Linear Algebra for Machine learning

Machine learning has a strong connection with mathematics. Each machine learning algorithm is
based on the concepts of mathematics & also with the help of mathematics, one can choose the
correct algorithm by considering training time, complexity, number of features, etc. Linear
Algebra is an essential field of mathematics, which defines the study of vectors, matrices, planes,
mapping, and lines required for linear transformation.

The term Linear Algebra was initially introduced in the early 18th century to find out the unknowns
in Linear equations and solve the equation easily; hence it is an important branch of mathematics
that helps study data. Also, no one can deny that Linear Algebra is undoubtedly the important and
primary thing to process the applications of Machine Learning. It is also a prerequisite to start
learning Machine Learning and data science.
Linear algebra plays a vital role and key foundation in machine learning, and it enables ML
algorithms to run on a huge number of datasets.

The concepts of linear algebra are widely used in developing algorithms in machine learning.
Although it is used almost in each concept of Machine learning, specifically, it can perform the
following task:

o Optimization of data.
o Applicable in loss functions, regularisation, covariance matrices, Singular Value
Decomposition (SVD), Matrix Operations, and support vector machine classification.
o Implementation of Linear Regression in Machine Learning.

Besides the above uses, linear algebra is also used in neural networks and the data science field.

Basic mathematics principles and concepts like Linear algebra are the foundation of Machine
Learning and Deep Learning systems. To learn and understand Machine Learning or Data Science,
one needs to be familiar with linear algebra and optimization theory. In this topic, we will explain
all the Linear algebra concepts required for machine learning.

Note: Although linear algebra is a must-know part of mathematics for machine learning, it is not
required to get intimate in this. It means it is not required to be an expert in linear algebra; instead,
only good knowledge of these concepts is more than enough for machine learning.

Why learn Linear Algebra before learning Machine Learning?

Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based on
flour similarly, every Machine Learning Model is also based on Linear Algebra. Further, the cake
also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine Learning also
requires more concepts as vector calculus, probability, and optimization theory. So, we can say
that Machine Learning creates a useful model with the help of the above-mentioned mathematical
concepts.

Below are some benefits of learning Linear Algebra before Machine learning:

o Better Graphic experience


o Improved Statistics
o Creating better Machine Learning algorithms
o Estimating the forecast of Machine Learning
o Easy to Learn
Better Graphics Experience:

Linear Algebra helps to provide better graphical processing in Machine Learning like Image,
audio, video, and edge detection. These are the various graphical representations supported by
Machine Learning projects that you can work on. Further, parts of the given data set are trained
based on their categories by classifiers provided by machine learning algorithms. These classifiers
also remove the errors from the trained data.

Moreover, Linear Algebra helps solve and compute large and complex data set through a specific
terminology named Matrix Decomposition Techniques. There are two most popular matrix
decomposition techniques, which are as follows:

o Q-R
o L-U

Improved Statistics:

Statistics is an important concept to organize and integrate data in Machine Learning. Also, linear
Algebra helps to understand the concept of statistics in a better manner. Advanced statistical topics
can be integrated using methods, operations, and notations of linear algebra.

Creating better Machine Learning algorithms:

Linear Algebra also helps to create better supervised as well as unsupervised Machine Learning
algorithms.

Few supervised learning algorithms can be created using Linear Algebra, which is as follows:

o Logistic Regression
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)

Further, below are some unsupervised learning algorithms listed that can also be created with the
help of linear algebra as follows:

o Single Value Decomposition (SVD)


o Clustering
o Components Analysis
With the help of Linear Algebra concepts, you can also self-customize the various parameters in
the live project and understand in-depth knowledge to deliver the same with more accuracy and
precision.

Estimating the forecast of Machine Learning:

If you are working on a Machine Learning project, then you must be a broad-minded person and
also, you will be able to impart more perspectives. Hence, in this regard, you must increase the
awareness and affinity of Machine Learning concepts. You can begin with setting up different
graphs, visualization, using various parameters for diverse machine learning algorithms or taking
up things that others around you might find difficult to understand.

Easy to Learn:

Linear Algebra is an important department of Mathematics that is easy to understand. It is taken


into consideration whenever there is a requirement of advanced mathematics and its applications.

Minimum Linear Algebra for Machine Learning

Notation:

Notation in linear algebra enables you to read algorithm descriptions in papers, books, and
websites to understand the algorithm's working. Even if you use for-loops rather than matrix
operations, you will be able to piece things together.

Operations:

Working with an advanced level of abstractions in vectors and matrices can make concepts clearer,
and it can also help in the description, coding, and even thinking capability. In linear algebra, it is
required to learn the basic operations such as addition, multiplication, inversion, transposing of
matrices, vectors, etc.

Matrix Factorization:

One of the most recommended areas of linear algebra is matrix factorization, specifically matrix
deposition methods such as SVD and QR.

Examples of Linear Algebra in Machine Learning

Below are some popular examples of linear algebra in Machine learning:

o Datasets and Data Files


o Linear Regression
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis

1. Datasets and Data Files

Each machine learning project works on the dataset, and we fit the machine learning model using
this dataset.

Each dataset resembles a table-like structure consisting of rows and columns. Where each row
represents observations, and each column represents features/Variables. This dataset is handled as
a Matrix, which is a key data structure in Linear Algebra.

Further, when this dataset is divided into input and output for the supervised learning model, it
represents a Matrix(X) and Vector(y), where the vector is also an important concept of linear
algebra.

2. Images and Photographs

In machine learning, images/photographs are used for computer vision applications. Each Image
is an example of the matrix from linear algebra because an image is a table structure consisting of
height and width for each pixel.

Moreover, different operations on images, such as cropping, scaling, resizing, etc., are performed
using notations and operations of Linear Algebra.

3. One Hot Encoding

In machine learning, sometimes, we need to work with categorical data. These categorical
variables are encoded to make them simpler and easier to work with, and the popular encoding
technique to encode these variables is known as one-hot encoding.

In the one-hot encoding technique, a table is created that shows a variable with one column for
each category and one row for each example in the dataset. Further, each row is encoded as a
binary vector, which contains either zero or one value. This is an example of sparse representation,
which is a subfield of Linear Algebra.
4. Linear Regression

Linear regression is a popular technique of machine learning borrowed from statistics. It describes
the relationship between input and output variables and is used in machine learning to predict
numerical values. The most common way to solve linear regression problems using Least Square
Optimization is solved with the help of Matrix factorization methods. Some commonly used matrix
factorization methods are LU decomposition, or Singular-value decomposition, which are the
concept of linear algebra.

5. Regularization

In machine learning, we usually look for the simplest possible model to achieve the best outcome
for the specific problem. Simpler models generalize well, ranging from specific examples to
unknown datasets. These simpler models are often considered models with smaller coefficient
values.

A technique used to minimize the size of coefficients of a model while it is being fit on data is
known as regularization. Common regularization techniques are L1 and L2 regularization. Both
of these forms of regularization are, in fact, a measure of the magnitude or length of the coefficients
as a vector and are methods lifted directly from linear algebra called the vector norm.

6. Principal Component Analysis

Generally, each dataset contains thousands of features, and fitting the model with such a large
dataset is one of the most challenging tasks of machine learning. Moreover, a model built with
irrelevant features is less accurate than a model built with relevant features. There are several
methods in machine learning that automatically reduce the number of columns of a dataset, and
these methods are known as Dimensionality reduction. The most commonly used dimensionality
reductions method in machine learning is Principal Component Analysis or PCA. This technique
makes projections of high-dimensional data for both visualizations and training models. PCA uses
the matrix factorization method from linear algebra.

7. Singular-Value Decomposition

Singular-Value decomposition is also one of the popular dimensionality reduction techniques and
is also written as SVD in short form.

It is the matrix-factorization method of linear algebra, and it is widely used in different applications
such as feature selection, visualization, noise reduction, and many more.

8. Latent Semantic Analysis

Natural Language Processing or NLP is a subfield of machine learning that works with text and
spoken words.
NLP represents a text document as large matrices with the occurrence of words. For example, the
matrix column may contain the known vocabulary words, and rows may contain sentences,
paragraphs, pages, etc., with cells in the matrix marked as the count or frequency of the number of
times the word occurred. It is a sparse matrix representation of text. Documents processed in this
way are much easier to compare, query, and use as the basis for a supervised machine learning
model.

This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also
known by the name Latent Semantic Indexing or LSI.

9. Recommender System

A recommender system is a sub-field of machine learning, a predictive modelling problem that


provides recommendations of products. For example, online recommendation of books based on
the customer's previous purchase history, recommendation of movies and TV series, as we see in
Amazon & Netflix.

The development of recommender systems is mainly based on linear algebra methods. We can
understand it as an example of calculating the similarity between sparse customer behaviour
vectors using distance measures such as Euclidean distance or dot products.

Different matrix factorization methods such as singular-value decomposition are used in


recommender systems to query, search, and compare user data.

10. Deep Learning

Artificial Neural Networks or ANN are the non-linear ML algorithms that work to process the
brain and transfer information from one layer to another in a similar way.

Deep learning studies these neural networks, which implement newer and faster hardware for the
training and development of larger networks with a huge dataset. All deep learning methods
achieve great results for different challenging tasks such as machine translation, speech
recognition, etc. The core of processing neural networks is based on linear algebra data structures,
which are multiplied and added together. Deep learning algorithms also work with vectors,
matrices, tensors (matrix with more than two dimensions) of inputs and coefficients for multiple
dimensions.

Conclusion

In this topic, we have discussed Linear algebra, its role and its importance in machine learning.
For each machine learning enthusiast, it is very important to learn the basic concepts of linear
algebra to understand the working of ML algorithms and choose the best algorithm for a specific
problem.
UNIT-II: SUPERVISED LEARNING:

2.1 Supervised Machine Learning

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student learns
in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:


o First Determine the type of training dataset
o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough knowledge
so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised
learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Note: We will discuss these algorithms in detail in later chapters.

Advantages of Supervised learning:


o With the help of supervised learning, the model can predict the output on the basis of prior
experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:


o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

2.2 ML | Types of Learning – Supervised Learning


Let us discuss what is learning for a machine is as shown below media as follows:
A machine is said to be learning from past Experiences(data feed-in) with respect to some class
of tasks if its Performance in a given Task improves with the Experience. For example, assume
that a machine has to predict whether a customer will buy a specific product let’s say “Antivirus”
this year or not. The machine will do it by looking at the previous knowledge/past experiences i.e
the data of products that the customer had bought every year and if he buys Antivirus every year,
then there is a high probability that the customer is going to buy an antivirus this year as well. This
is how machine learning works at the basic conceptual level.

Supervised learning is when the model is getting trained on a labelled dataset. A labelled dataset
is one that has both input and output parameters. In this type of learning both training and
validation, datasets are labelled as shown in the figures below.

Both the above figures have labelled data set as follows:


• Figure A: It is a dataset of a shopping store that is useful in predicting whether a customer
will purchase a particular product under consideration or not based on his/ her gender, age,
and salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will purchase and 0 means that the
customer won’t purchase it.
• Figure B: It is a Meteorological dataset that serves the purpose of predicting wind speed
based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity, Wind Direction
Output: Wind Speed
Training the system: While training the model, data is usually split in the ratio of 80:20 i.e.
80% as training data and the rest as testing data. In training data, we feed input as well as output
for 80% of data. The model learns from training data only. We use different machine learning
algorithms(which we will discuss in detail in the next articles) to build our model. Learning
means that the model will build some logic of its own.
Once the model is ready then it is good to be tested. At the time of testing, the input is fed from
the remaining 20% of data that the model has never seen before, the model will predict some
value and we will compare it with the actual output and calculate the accuracy.

Types of Supervised Learning:


A. Classification: It is a Supervised Learning task where output is having defined labels(discrete
value). For example in above Figure A, Output – Purchased has defined labels i.e. 0 or 1; 1 means
the customer will purchase, and 0 means that the customer won’t purchase. The goal here is to
predict discrete values belonging to a particular class and evaluate them on the basis of accuracy.
It can be either binary or multi-class classification. In binary classification, the model predicts
either 0 or 1; yes or no but in the case of multi-class classification, the model predicts more than
one class. Example: Gmail classifies mails in more than one class like social, promotions, updates,
and forums.
B. Regression: It is a Supervised Learning task where output is having continuous value.
For example in above Figure B, Output – Wind Speed is not having any discrete value but is
continuous in a particular range. The goal here is to predict a value as much closer to the actual
output value as our model can and then evaluation is done by calculating the error value. The
smaller the error the greater the accuracy of our regression model.
Example of Supervised Learning Algorithms:
• Linear Regression
• Logistic Regression
• Nearest Neighbor
• Gaussian Naive Bayes
• Decision Trees
• Support Vector Machine (SVM)
• Random Forest

ML | Types of Learning – Supervised Learning

1. Supervised learning is a type of machine learning in which the algorithm is trained on a


labeled dataset, which means that the output (or target) variable is already known. The goal
of supervised learning is to learn a function that can accurately predict the output variable
based on the input variables. Supervised learning can be further divided into two main
categories:
2. Classification: In classification, the output variable is a categorical variable, and the goal is to
predict the class or category to which a new data point belongs. Examples of classification
problems include image classification, spam detection, and sentiment analysis.
3. Regression: In regression, the output variable is a continuous variable, and the goal is to
predict the value of the output variable based on the input variables. Examples of regression
problems include predicting stock prices, weather forecasting, and sales forecasting.
4. Supervised learning algorithms are widely used in various fields, such as natural language
processing, computer vision, medical diagnosis, speech recognition, and many others. Some
of the popular supervised learning algorithms include: linear regression, logistic regression,
decision trees, random forest, k-nearest neighbors (KNN), support vector machine (SVM),
and neural networks.
5. It’s worth noting that supervised learning is useful when we have a labeled data, and it’s not
always the case. In some scenarios, the data is not labeled or it’s too expensive to label, then
unsupervised learning, semi-supervised learning, or self-supervised learning could be a better
approach.

2.3 Computational learning theory

What is computational learning theory?


Computational learning theory (CoLT) is a branch of AI concerned with using mathematical
methods or the design applied to computer learning programs. It involves using mathematical
frameworks for the purpose of quantifying learning tasks and algorithms.
It seeks to use the tools of theoretical computer science to quantify learning problems. This
includes characterizing the difficulty of learning specific tasks.
Computational learning theory can be considered to be an extension of statistical learning theory
or SLT for short, that makes use of formal methods for the purpose of quantifying learning
algorithms.

• Computational Learning Theory (CoLT): Formal study of learning tasks.


• Statistical Learning Theory (SLT): Formal study of learning algorithms.
This division of learning tasks vs. learning algorithms is arbitrary, and in practice, there is quite a
large degree of overlap between these two fields.
Computational learning theory is essentially a sub-field of artificial intelligence (AI) that focuses
on studying the design and analysis of machine learning algorithms.

Source: Semantic Scholar

How important is computational learning theory?


Computational learning theory provides a formal framework in which it is possible to precisely
formulate and address questions regarding the performance of different learning algorithms.
Thus, careful comparisons of both the predictive power and the computational efficiency of
competing learning algorithms can be made. Three key aspects that must be formalized are:

• The way in which the learner interacts with its environment,


• The definition of success in completing the learning task,
• A formal definition of efficiency of both data usage (sample complexity) and processing
time (time complexity).

It is important to remember that the theoretical learning models are abstractions of real-life
problems. Close connections with experimentalists are useful to help validate or modify these
abstractions so that the theoretical results reflect empirical performance. The computational
learning theory research has therefore close connections to machine learning research. Besides
the model’s predictive capability, the computational learning theory also addresses other
important features such as simplicity, robustness to variations in the learning scenario, and an
ability to create insights to empirically observed phenomena.
What is computational learning theory in machine learning?
These are sub-fields of machine learning that a machine learning practitioner does not need to
know in great depth in order to achieve good results on a wide range of problems. Nevertheless,
it is a sub-field where having a high-level understanding of some of the more prominent methods
may provide insight into the broader task of learning from data.
Theoretical results in machine learning mainly deal with a type of inductive learning called
supervised learning. In supervised learning, an algorithm is given samples that are labeled in
some useful way. For example, the samples might be descriptions of mushrooms, and the labels
could be whether or not the mushrooms are edible. The algorithm takes these previously labeled
samples and uses them to induce a classifier. This classifier is a function that assigns labels to
samples, including samples that have not been seen previously by the algorithm. The goal of the
supervised learning algorithm is to optimize some measure of performance such as minimizing
the number of mistakes made on new samples.
In addition to performance bounds, computational learning theory studies the time complexity
and feasibility of learning. In computational learning theory, a computation is considered feasible
if it can be done in polynomial time.
There are two kinds of time complexity results:

• Positive results – Showing that a certain class of functions is learnable in polynomial


time.
• Negative results – Showing that certain classes cannot be learned in polynomial time.

Negative results often rely on commonly believed, but yet unproven assumptions, such as:

• Computational complexity – P ≠ NP (the P versus NP problem);


• Cryptographic – One-way functions exist.

What is the difference between Computational Learning Theory and Statistical Learning

Theory?
While both frameworks use similar mathematical analysis, the primary difference between CoLT
and SLT is their objectives. CoLT focuses on studying “learnability,” or what functions/features
are necessary to make a given task learnable for an algorithm. Whereas SLT is primarily focused
on studying and improving the accuracy of existing training programs.
What is machine learning theory?
Machine Learning Theory, also known as Computational Learning Theory, aims to understand
the fundamental principles of learning as a computational process. This field seeks to understand
at a precise mathematical level what capabilities and information are fundamentally needed to
learn different kinds of tasks successfully, and to understand the basic algorithmic principles
involved in getting computers to learn from data and to improve performance with feedback. The
goals of this theory are both to aid in the design of better automated learning methods and to
understand fundamental issues in the learning process itself.
Machine Learning Theory draws elements from both the Theory of Computation and
Statistics and involves tasks such as:

• Creating mathematical models that capture key aspects of machine learning, in which one
can analyze the inherent ease or difficulty of different types of learning problems.
• Proving guarantees for algorithms (under what conditions will they succeed, how much
data and computation time is needed) and developing machine learning algorithms that
probably meet desired criteria.
• Mathematically analyzing general issues, such as: “Why is Occam’s Razor a good idea?”,
“When can one be confident about predictions made from limited data?”, “How much
power does active participation add over passive observation for learning?”, and “What
kinds of methods can learn even in the presence of large quantities of distracting
information?”

What is 'Probably Approximately Correct' Learning?


PAC learning, also known as Probably Approximately Correct learning is a theoretical machine
learning framework created by Leslie Valiant. PAC learning aims to quantify the difficulty
involved in a learning task and it might be considered to be the main sub-field of computational
learning theory.
PAC learning bothers about the amount of computational effort that is needed in order to identify
a hypothesis (fit model) that is a close match for the unknown target function.

What is VC Dimension
The Vapnik–Chervonenkis theory (VC Theory) is a theoretical machine learning framework
created by Vladimir Vapnik and Alexey Chervonenkis.
It aims to quantify the capability of a learning algorithm and could be considered to be the main
sub-field of statistical learning theory.
One of the main elements of the VC theory is the Vapnik-chervonenkis dimension (VC
dimension). It quantifies the complexity of hypothesis space. It comes up with an estimation of
the capability or capacity of a classification machine learning algorithm for a particular dataset
(number and dimensionality of examples)
2.4 occma's razor principle and over fitting avoidance heuristic search in inductive
learning

Occam’s razor
Many philosophers throughout history have advocated the idea of parsimony. One of the greatest
Greek philosophers, Aristotle who goes as far as to say, “Nature operates in the shortest way
possible”. It is as a consequence that humans might be biased as well to choose a simpler
explanation given a set of all possible explanations with the same descriptive power. This post
gives a brief overview of Occam’s razor, the relevance of the principle and ends with a note on
the usage of this razor as an inductive bias in machine learning (decision tree learning in
particular).
What is Occam’s razor?
Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must
never be posited without necessity”. Alternatively, as a heuristic, it can be viewed as, when there
are multiple hypotheses to solve a problem, the simpler one is to be preferred. It is not clear as to
whom this principle can be conclusively attributed to, but William of Occam’s (c. 1287 – 1347)
preference for simplicity is well documented. Hence this principle goes by the name, “Occam’s
razor”. This often means cutting off or shaving away other possibilities or explanations, thus
“razor” appended to the name of the principle. It should be noted that these explanations or
hypotheses should lead to the same result.
Relevance of Occam’s razor.
There are many events that favor a simpler approach either as an inductive bias or a constraint to
begin with. Some of them are :
• Studies like this, where the results have suggested that preschoolers are sensitive to simpler
explanations during their initial years of learning and development.
• Preference for a simpler approach and explanations to achieve the same goal is seen in
various facets of sciences; for instance, the parsimony principle applied to the understanding
of evolution.
• In theology, ontology, epistemology, etc this view of parsimony is used to derive various
conclusions.
• Variants of Occam’s razor are used in knowledge Discovery.
Occam’s razor as an inductive bias in machine learning.
Note: It is highly recommended to read the article on decision tree introduction for an insight on
decision tree building with examples.
• Inductive bias (or the inherent bias of the algorithm) are assumptions that are made by the
learning algorithm to form a hypothesis or a generalization beyond the set of training
instances in order to classify unobserved data.
• Occam’s razor is one of the simplest examples of inductive bias. It involves a preference for
a simpler hypothesis that best fits the data. Though the razor can be used to eliminate other
hypotheses, relevant justification may be needed to do so. Below is an analysis of how this
principle is applicable in decision tree learning.
• The decision tree learning algorithms follow a search strategy to search the hypotheses space
for the hypothesis that best fits the training data. For example, the ID3 algorithm uses a
simple to complex strategy starting from an empty tree and adding nodes guided by the
information gain heuristic to build a decision tree consistent with the training instances.
The information gain of every attribute (which is not already included in the tree) is
calculated to infer which attribute to be considered as the next node. Information gain is the
essence of the ID3 algorithm. It gives a quantitative measure of the information that an
attribute can provide about the target variable i.e, assuming only information of that attribute
is available, how efficiently can we infer about the target. It can be defined as :

• Well, there can be many decision trees that are consistent with a given set of training
examples, but the inductive bias of the ID3 algorithm results in the preference for simper (or
shorter trees) trees. This preference bias of ID3 arises from the fact that there is an ordering
of the hypotheses in the search strategy. This leads to additional bias that attributes high with
information gain closer to the root is preferred. Therefore, there is a definite order the
algorithm follows until it terminates on reaching a hypothesis that is consistent with the
training data.

The above image depicts how the ID3 algorithm chooses the nodes in every iteration. The red
arrow depicts the node chosen in a particular iteration while the black arrows suggest other
decision trees that could have been possible in a given iteration.
• Hence starting from an empty node, the algorithm graduates towards more complex decision
trees and stops when the tree is sufficient to classify the training examples.
• This example pops a question. Does eliminating complex hypotheses bear any consequence
on the classification of unobserved instances? simply put, does the preference for a simpler
hypothesis have an advantage? If two decision trees have slightly different training errors but
the same validation errors, then it is obvious that the simpler tree among the two will be
chosen. As a higher validation error causes overfitting of the data. Complex trees often have
almost zero training error, but the validation errors might be high. This scenario gives a
logical reason for a bias towards simpler trees. In addition to that, a simpler hypothesis might
prove effective in a resource-limited environment.
• What is overfitting? Consider two hypotheses a and b. Let ‘a’ fit the training examples
perfectly, while the hypothesis ‘b’ has a small training error. If over the entire set of data (i.e,
including the unseen instances), if the hypothesis ‘b’ performs better, then ‘a’ is said to
overfit the training data. To best illustrate the problem of over-fitting, consider the figure
below.

Figures A and B depict two decision boundaries. Assuming the green and red points
represent the training examples, the decision boundary in B perfectly fits the data thus
perfectly classifying the instances, while the decision boundary in A does not, though being
simpler than B. In this example the decision boundary in B overfits the data. The reason
being that every instance of the training data affects the decision boundary. The added
relevance is when the training data contains noise. For example, assume in figure B that one
of the red points close to the boundary was a noise point. Then the unseen instances in close
proximity to the noise point might be wrongly classified. This makes the complex hypothesis
vulnerable to noise in the data.
• While the problem of overfitting behaviour of a model can be significantly avoided by
settling for a simpler hypothesis, an extremely simple hypothesis may be too abstract to
deduce any information needed for the task resulting in underfitting. Overfitting and
underfitting are one of the major challenges to be addressed before we zero in on a machine
learning model. Sometimes a complex model might be desired, it’s a choice dependent on the
data available, the results expected and the application domain.
Note: For additional information on the decision tree learning, please refer to Tom M. Mitchell’s
“Machine Learning” book.
2.5 Understanding Generalization Error in Machine Learning

What determines the model’s ability to react to new unseen data?

Definition

Firstly, let’s define “generalization error”.

In supervised learning applications in machine learning and statistical learning theory,

generalization error (also known as the out-of-sample error) is a measure of how accurately an

algorithm is able to predict outcome values for previously unseen data. wikipedia

Notice that the gap between predictions and observed data is induced by model

inaccuracy, sampling error, and noise. Some of the errors are reducible but some are not.

Choosing the right algorithm and tuning parameters could improve model accuracy, but we will

never be able to make our predictions 100% accurate.

Bias-variance decomposition
An important way to understand generalization error is bias-variance decomposition.

Intuitively speaking, bias is the error rate in the world of big data. A model has a high bias

when, for example, it fails to capture meaningful patterns in the data. Bias is measured by the

differences between the expected predicted values and the observed values, in the dataset D when

the prediction variables are at the level of x (X=x). In contrast with bias, variance is an

algorithm’s flexibility to learn patterns in the observed data. Variance is the amount that an

algorithm will change if a different dataset is used. A model is of high variance when, for

instance, it tries too hard that it not only captures the pattern of meaningful features but also that

the meaningless error (overfitting).

Mathematical Notations

Using regression as an example, we have


Now we could decompose generalization error

Interpretation

Bias measures the deviation between the expected output of our model and the real values, so it

indicates the fit of our model.

Variance measures the amount that the outputs of our model will change if a different dataset is

used. It is the impacts of using different datasets.

Noise is the irreducible error, the lowest bound of generalization error for the current task that

any model will not be able to get rid of, indicating the difficulty of this task.

These 3 components above determine the model’s ability to react to new unseen data rather than

just the data that it was trained on.

Bias-Variance Tradeoff
Bias-Variance Tradeoff as a Function of Model Capacity

source of the figure

Generalization error could be measured by MSE. As the model capacity increases, the bias

decreases as the model fits the training datasets better. However, the variance increases, as your

model become sophisticated to fit more patterns of the current dataset, changing datasets (even if

they come from the same distribution) would be impactful. As a data scientist, our challenge lies

in finding the optimal capacity — where both bias and variance are low.

2.6 Performance Metrics in Machine Learning

Evaluating the performance of a Machine learning model is one of the important steps while
building an effective ML model. To evaluate the performance or quality of the model, different
metrics are used, and these metrics are known as performance metrics or evaluation
metrics. These performance metrics help us understand how well our model has performed for the
given data. In this way, we can improve the model's performance by tuning the hyper-parameters.
Each ML model aims to generalize well on unseen/new data, and performance metrics help
determine how well the model generalizes on the new dataset.
In machine learning, each task or problem is divided into classification and Regression. Not all
metrics can be used for all types of problems; hence, it is important to know and understand which
metrics should be used. Different evaluation metrics are used for both Regression and
Classification tasks. In this topic, we will discuss metrics used for classification and regression
tasks.

1. Performance Metrics for Classification

In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based
on the training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not
Spam, etc. To evaluate the performance of a classification model, different metrics are used, and
some of them are as follows:

o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC

I. Accuracy

The accuracy metric is one of the simplest Classification metrics to implement, and it can be
determined as the number of correct predictions to the total number of predictions.

It can be formulated as:


To implement an accuracy metric, we can compare ground truth and predicted values in a loop, or
we can also use the scikit-learn module for this.

Firstly, we need to import the accuracy_score function of the scikit-learn library as follows:

1. from sklearn.metrics import accuracy_score


2.
3. Here, metrics is a class of sklearn.
4.
5. Then we need to pass the ground truth and predicted values in the function to calculate the accur
acy.
6.
7. print(f'Accuracy Score is {accuracy_score(y_test,y_hat)}')

Although it is simple to use and implement, it is suitable only for cases where an equal number of
samples belong to each class.

When to Use Accuracy?

It is good to use the Accuracy metric when the target variable classes in data are approximately
balanced. For example, if 60% of classes in a fruit image dataset are of Apple, 40% are Mango. In
this case, if the model is asked to predict whether the image is of Apple or Mango, it will give a
prediction with 97% of accuracy.

When not to use Accuracy?

It is recommended not to use the Accuracy measure when the target variable majorly belongs to
one class. For example, Suppose there is a model for a disease prediction in which, out of 100
people, only five people have a disease, and 95 people don't have one. In this case, if our model
predicts every person with no disease (which means a bad prediction), the Accuracy measure will
be 95%, which is not correct.

II. Confusion Matrix

A confusion matrix is a tabular representation of prediction outcomes of any binary classifier,


which is used to describe the performance of the classification model on a set of test data when
true values are known.
The confusion matrix is simple to implement, but the terminologies used in this matrix might be
confusing for beginners.

A typical confusion matrix for a binary classifier looks like the below image(However, it can be
extended to use for classifiers with more than two classes).

We can determine the following from the above matrix:

o In the matrix, columns are for the prediction values, and rows specify the Actual values.
Here Actual and prediction give two possible classes, Yes or No. So, if we are predicting
the presence of a disease in a patient, the Prediction column with Yes means, Patient has
the disease, and for NO, the Patient doesn't have the disease.
o In this example, the total number of predictions are 165, out of which 110 time predicted
yes, whereas 55 times predicted No.
o However, in reality, 60 cases in which patients don't have the disease, whereas 105 cases
in which patients have the disease.

In general, the table is divided into four terminologies, which are as follows:

1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality,
also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality,
also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision

The precision metric is used to overcome the limitation of Accuracy. The precision determines the
proportion of positive prediction that was actually correct. It can be calculated as the True Positive
or predictions that are actually true to the total positive predictions (True Positive and False
Positive).

IV. Recall or Sensitivity

It is also similar to the Precision metric; however, it aims to calculate the proportion of actual
positive that was identified incorrectly. It can be calculated as True Positive or predictions that are
actually true to the total number of positives, either correctly predicted as positive or incorrectly
predicted as negative (true Positive and false negative).

The formula for calculating Recall is given below:

When to use Precision and Recall?

From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives information
about the performance of a classifier with respect to a false positive.

So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if we
want to minimize the false positive, then precision should be close to 100% as possible.

In simple words, if we maximize precision, it will minimize the FP errors, and if we maximize
recall, it will minimize the FN error.

V. F-Scores

F-score or F1 Score is a metric to evaluate a binary classification model on the basis of predictions
that are made for the positive class. It is calculated with the help of Precision and Recall. It is a
type of single score that represents both Precision and Recall. So, the F1 Score can be calculated
as the harmonic mean of both precision and Recall, assigning equal weight to each of them.
The formula for calculating the F1 score is given below:

When to use F-Score?

As F-score make use of both precision and recall, so it should be used if both of them are important
for evaluation, but one (precision or recall) is slightly more important to consider than the other.
For example, when False negatives are comparatively more important than false positives, or vice
versa.

VI. AUC-ROC

Sometimes we need to visualize the performance of the classification model on charts; then, we
can use the AUC-ROC curve. It is one of the popular and important metrics for evaluating the
performance of the classification model.

Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a
graph to show the performance of a classification model at different threshold levels. The curve
is plotted between two parameters, which are:

o True Positive Rate


o False Positive Rate

TPR or true Positive rate is a synonym for Recall, hence can be calculated as:

FPR or False Positive Rate can be calculated as:

To calculate value at any point in a ROC curve, we can evaluate a logistic regression model
multiple times with different classification thresholds, but this would not be much efficient. So,
for this, one efficient method is used, which is known as AUC.
AUC: Area Under the ROC curve

AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates the two-
dimensional area under the entire ROC curve, as shown below image:

AUC calculates the performance across all the thresholds and provides an aggregate measure. The
value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction will have an AUC
of 0.0, whereas models with 100% correct predictions will have an AUC of 1.0.

When to Use AUC

AUC should be used to measure how well the predictions are ranked rather than their absolute
values. Moreover, it measures the quality of predictions of the model without considering the
classification threshold.

When not to use AUC

As AUC is scale-invariant, which is not always desirable, and we need calibrating probability
outputs, then AUC is not preferable.

Further, AUC is not a useful metric when there are wide disparities in the cost of false negatives
vs. false positives, and it is difficult to minimize one type of classification error.

2. Performance Metrics for Regression

Regression is a supervised learning technique that aims to find the relationships between the
dependent and independent variables. A predictive regression model predicts a numeric or discrete
value. The metrics used for regression are different from the classification metrics. It means we
cannot use the Accuracy metric (explained above) to evaluate a regression model; instead, the
performance of a Regression model is reported as errors in the prediction. Following are the
popular metrics that are used to evaluate the performance of Regression models.

o Mean Absolute Error


o Mean Squared Error
o R2 Score
o Adjusted R2

I. Mean Absolute Error (MAE)

Mean Absolute Error or MAE is one of the simplest metrics, which measures the absolute
difference between actual and predicted values, where absolute means taking a number as Positive.

To understand MAE, let's take an example of Linear Regression, where the model draws a best fit
line between dependent and independent variables. To measure the MAE or error in prediction,
we need to calculate the difference between actual values and predicted values. But in order to find
the absolute error for the complete dataset, we need to find the mean absolute of the complete
dataset.

The below formula is used to calculate MAE:

Here,

Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

MAE is much more robust for the outliers. One of the limitations of MAE is that it is not
differentiable, so for this, we need to apply different optimizers such as Gradient Descent.
However, to overcome this limitation, another metric can be used, which is Mean Squared Error
or MSE.

II. Mean Squared Error

Mean Squared error or MSE is one of the most suitable metrics for Regression evaluation. It
measures the average of the Squared difference between predicted values and the actual value
given by the model.

Since in MSE, errors are squared, therefore it only assumes non-negative values, and it is usually
positive and non-zero.

Moreover, due to squared differences, it penalizes small errors also, and hence it leads to over-
estimation of how bad the model is.

MSE is a much-preferred metric compared to other regression metrics as it is differentiable and


hence optimized better.

The formula for calculating MSE is given below:


Here,

Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

III. R Squared Score

R squared error is also known as Coefficient of Determination, which is another popular metric
used for Regression model evaluation. The R-squared metric enables us to compare our model
with a constant baseline to determine the performance of the model. To select the constant baseline,
we need to take the mean of the data and draw the line at the mean.

The R squared score will always be less than or equal to 1 without concerning if the values are too
large or small.

IV. Adjusted R Squared

Adjusted R squared, as the name suggests, is the improved version of R squared error. R square
has a limitation of improvement of a score on increasing the terms, even though the model is not
improving, and it may mislead the data scientists.

To overcome the issue of R square, adjusted R squared is used, which will always show a lower
value than R². It is because it adjusts the values of increasing predictors and only shows
improvement if there is a real improvement.

We can calculate the adjusted R squared as follows:

Here,

n is the number of observations

k denotes the number of independent variables

and Ra2 denotes the adjusted R2


UNIT-III: STATISTICAL LEARNING:
3.1 Inferential Statistics – An Overview | Introduction to Inferential Statistics

Introduction

Statistics has a significant part in the field of data science. It helps us in the collection, analysis
and representation of data either by visualisation or by numbers into a general understandable
format. Generally, we divide statistics into two main branches which are Descriptive Statistics
and Inferential Statistics. In this article, we will discuss the Inferential statistics in detail.

Population and Sample

Before discussing the Inferential statistics, let us see the population and sample. Population
contains all the data points from a set of data. It is a group from where we collect the data. While
a sample consists of some observations selected from the population. The sample from the
population should be selected such that it has all the characteristics that a population has.
Population’s measurable characteristics such as mean, standard deviation etc. are called as
parameters while Sample’s measurable characteristic is known as a statistic.

What is Inferential Statistics?

Descriptive statistics describe the important characteristics of data by using mean, median, mode,
variance etc. It summarises the data through numbers and graphs.

In Inferential statistics, we make an inference from a sample about the population. The main aim
of inferential statistics is to draw some conclusions from the sample and generalise them for the
population data. E.g. we have to find the average salary of a data analyst across India. There are
two options.

1. The first option is to consider the data of data analysts across India and ask them their
salaries and take an average.
2. The second option is to take a sample of data analysts from the major IT cities in India
and take their average and consider that for across India.
The first option is not possible as it is very difficult to collect all the data of data analysts across
India. It is time-consuming as well as costly. So, to overcome this issue, we will look into the
second option to collect a small sample of salaries of data analysts and take their average as India
average. This is the inferential statistics where we make an inference from a sample about the
population.

In inferential statistics, we will discuss probability, distributions, and hypothesis testing.


Importance of Inferential Statistics

• Making conclusions from a sample about the population


• To conclude if a sample selected is statistically significant to the whole population or not
• Comparing two models to find which one is more statistically significant as compared to
the other.
• In feature selection, whether adding or removing a variable helps in improving the model
or not.
Probability

It is a measure of the chance of occurrence of a phenomenon. We will now discuss some terms
which are very important in probability:

• Random Experiment: Random experiment or statistical experiment is an experiment in


which all the possible outcomes of the experiments are already known. The experiment
can be repeated numerous times under identical or similar conditions.
• Sample space: Sample space of a random experiment is the collection or set of all the
possible outcomes of a random experiment.
• Event: A subset of sample space is called an event.
• Trial: Trial refers to a special type of experiment in which we have two types of possible
outcomes: success or failure with varying Success probability.
• Random Variable: A variable whose value is subject to variations due to randomness is
called a random variable. A random variable is of two types: Discrete and Continuous
variable. In a mathematical way, we can say that a real-valued function X: S -> R is
called a random variable where S is probability space and R is a set of real numbers.
Conditional Probability

Conditional probability is the probability of a particular event Y, given a certain condition which
has already occurred , i.e., X. Then conditional probability, P(Y|X) is defined as,

P(Y|X) = N(X∩Y) / N(X); provide N(X) > 0

N(X): – Total cases favourable to the event X

N(X∩Y): – Total favourable simultaneous

Or, we can write as:

P(Y|X) = P(X∩Y) / P(X); P(X) > 0


Probability Distribution and Distribution function

The mathematical function describing the randomness of a random variable is called probability
distribution. It is a depiction of all possible outcomes of a random variable and their associated
probabilities

For a random variable X, CDF (Cumulative Distribution function) is defined as:

F(x) = P {s ε S; X(s) ≤ x}

Or,

F(x) = P {X ≤ x}

E.g. P (X > 7) = 1- P (X ≤ 7)

= 1- {P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4) + P (X = 5) + P (X = 6) + P
(X = 7)}

Sampling Distribution

Probability distribution of statistics of a large number of samples selected from the population is
called sampling distribution. When we increase the size of sample, sample mean becomes more
normally distributed around population mean. The variability of the sample decreases as we
increase sample size.

Central Limit Theorem

CLT tells that when we increase the sample size, the distribution of sample means becomes
normally distributed as the sample, whatever be the population distribution shape. This theorem
is particularly true when we have a sample of size greater than 30. The conclusion is that if we
take a greater number of samples and particularly of large sizes, the distribution of sample means
in a graph will look like to follow the normal distribution.

In the above graph we can see that when we increase the value of n i.e. sample size, it is
approaching the shape of normal distribution.

Confidence Interval

Confidence Interval is an interval of reasonable values for our parameters. Confidence intervals
are used to give an interval estimation for our parameter of interest.
The margin of error is found by multiplying the standard error of the mean and the z-score.

Margin of error = (z. σ)/ √n

And Confidence interval is defined as:

Confidence interval having a value of 95% indicates that we are 95% sure that the actual mean is
within our confidence interval.

Hypothesis Testing

Hypothesis testing is a part of statistics in which we make assumptions about the population
parameter. So, hypothesis testing mentions a proper procedure by analysing a random sample of
the population to accept or reject the assumption.

Type of Hypothesis

A hypothesis is of two types:

1. Null hypothesis: Null hypothesis is a type of hypothesis in which we assume that the
sample observations are purely by chance. It is denoted by H0.
1. Alternate hypothesis: The alternate hypothesis is a hypothesis in which we assume that
sample observations are not by chance. They are affected by some non-random situation.
An alternate hypothesis is denoted by H1 or Ha.
Steps of Hypothesis Testing

The process to determine whether to reject a null hypothesis or to fail to reject the null
hypothesis, based on sample data is called hypothesis testing. It consists of four steps:

1. Define the null and alternate hypothesis


2. Define an analysis plan to find how to use sample data to estimate the null hypothesis
3. Do some analysis on the sample data to create a single number called ‘test statistic’
4. Understand the result by applying the decision rule to check whether the Null hypothesis
is true or not
If the value of t-stat is less than the significance level we will reject the null hypothesis,
otherwise, we will fail to reject the null hypothesis.

Technically, we never accept the null hypothesis, we say that either we fail to reject or we reject
the null hypothesis.
Terms in Hypothesis testing

Significance level

The significance level is defined as the probability of the case when we reject the null hypothesis
but in actual it is true. E.g., a 0.05 significance level indicates that there is 5% risk in assuming
that there is some difference when in actual there is no difference. It is denoted by alpha (α).

The above figure shows that the two shaded regions are equidistant from the null hypothesis,
each having a probability of 0.025 and a total of 0.05 which is our significance level. The shaded
region in case of a two-tailed test is called critical region.

P-value

The p-value is defined as the probability of seeing a t-statistic as extreme as the calculated value
if the null hypothesis value is true. Low enough p-value is ground for rejecting the null
hypothesis. We reject the null hypothesis if the p-value is less than the significance level.

Errors in hypothesis testing

We have explained what is hypothesis testing and the steps to do the testing. Now during
performing the hypothesis testing, there might be some errors.

We classify these errors in two categories.

1. Type-1 error: Type 1 error is the case when we reject the null hypothesis but in actual it
was true. The probability of having a Type-1 error is called significance level alpha(α).
2. Type-2 error: Type 2 error is the case when we fail to reject the null hypothesis but
actually it is false. The probability of having a type-2 error is called beta(β).
Therefore,

α= P (Null hypothesis rejected | Null hypothesis is true)

β= P (Null hypothesis accepted | Null hypothesis is false)

Power of test is defined as

P= 1- Type-2 error

=1–β
Lesser the type-2 error more the power of the hypothesis test.

Decision –>
/ Reject the null hypothesis Fail to reject the null hypothesis
Actual

Null Hypothesis is True Type-1 Error Decision is correct

Alternate hypothesis is true Decision is correct Type-2 Error

Z-test

A Z-test is mainly used when the data is normally distributed. We find the Z-statistic of the
sample means and calculate the z-score. Z-score is given by the formula,

Z-score = (x – µ) / σ

Z-test is mainly used when the population mean and standard deviation are given.

T-test

The t-test is similar to z-test. The only difference is that it is used when we have sample standard
deviation but don’t have population standard, or have a small sample size (n<30).

Different types of T-test

One Sample T-test

The one-sample t-test compares the mean of sample data to a known value like if we have to
compare the mean of sample data to the population mean we use the One-Sample T-test.

We can run a one-sample T-test when we do not have the population S.D. or we have a sample
of size less than 30.

Two sample T-test

We use a two-sample T-test when we want to evaluate whether the mean of the two samples is
different or not. In two-sample T-test we have another two categories:
• Independent Sample T-test: Independent sample means that the two different samples
should be selected from two completely different populations. In other words, we can say
that one population should not be dependent on the other population.
• Paired T-test: If our samples are connected in some way, we have to use paired t-test.
Here connecting means thatThe samples are connected as we are collecting data from the
same group two times e.g. blood test of patients of a hospital before and after medication.
Chi-square test

Chi-square test is used in the case when we have to compare categorical data. Chi-square test is
of two types. Both use chi-square statistics and distribution for different purposes.

• Goodness of fit: It determines if sample data of categorical variables matches with


population or not.
• Test of Independence: It compares two categorical variables to find whether they are
related with each other or not.
Chi-square statistic is given by:

ANOVA (Analysis of variance)

ANOVA test is a way to find out if an experiment results are significant or not. It is generally
used when there are more than 2 groups and we have to test the hypothesis that the mean of
multiple populations and variances of multiple populations are equal.

E.g. Students from different colleges take the same exam. We want to see if one college
outperforms others.

There are two types of ANOVA test:

1. One-way ANOVA
2. Two-way ANOVA
The test statistic in Anova is given by:

Conclusion

In this article, we studied inferential statistics and the different topics in it like probability,
hypothesis testing, and different types of tests in hypothesis. Also, we discussed the importance
of inferential statistics and how we can make inference about the population by sample data
which in turn is time-consuming and cost-saving.
3.2 Machine Learning for Data Analysis
Over the course of an hour, an unsolicited email skips your inbox and goes straight to spam, a
car next to you auto-stops when a pedestrian runs in front of it, and an ad for the product you
were thinking about yesterday pops up on your social media feed. What do these events all
have in common? It’s artificial intelligence that has guided all these decisions. And the force
behind them all is machine-learning algorithms that use data to predict outcomes.

Now, before we look at how machine learning aids data analysis, let’s explore the
fundamentals of each.
What is Machine Learning?
Machine learning is the science of designing algorithms that learn on their own from data and
adapt without human correction. As we feed data to these algorithms, they build their own
logic and, as a result, create solutions relevant to aspects of our world as diverse as fraud
detection, web searches, tumor classification, and price prediction.
In deep learning, a subset of machine learning, programs discover intricate concepts by
building them out of simpler ones. These algorithms work by exposing multilayered (hence
“deep”) neural networks to vast amounts of data. Applications for machine learning, such
as natural language processing, dramatically improve performance through the use of deep
learning.
What is Data Analysis?
Data analysis involves manipulating, transforming, and visualizing data in order to infer
meaningful insights from the results. Individuals, businesses,and even governments often take
direction based on these insights.
Data analysts might predict customer behavior, stock prices, or insurance claims by using basic
linear regression. They might create homogeneous clusters using classification and regression
trees (CART), or they might gain some impact insight by using graphs to visualize a financial
technology company’s portfolio.
Until the final decades of the 20th century, human analysts were irreplaceable when it came to
finding patterns in data. Today, they’re still essential when it comes to feeding the right kind of
data to learning algorithms and inferring meaning from algorithmic output, but machines can
and do perform much of the analytical work itself.
Why Machine Learning is Useful in Data Analysis
Machine learning constitutes model-building automation for data analysis. When we assign
machines tasks like classification, clustering, and anomaly detection — tasks at the core of
data analysis — we are employing machine learning.
We can design self-improving learning algorithms that take data as input and offer statistical
inferences. Without relying on hard-coded programming, the algorithms make decisions
whenever they detect a change in pattern.
Before we look at specific data analysis problems, let’s discuss some terminology used to
categorize different types of machine-learning algorithms. First, we can think of most
algorithms as either classification-based, where machines sort data into classes, or regression-
based, where machines predict values.
Next, let’s distinguish between supervised and unsupervised algorithms. A supervised
algorithm provides target values after sufficient training with data. In contrast, the information
used to instruct an unsupervised machine-learning algorithm needs no output variable to guide
the learning process.
For example, a supervised algorithm might estimate the value of a home after reviewing the
price (the output variable) of similar homes, while an unsupervised algorithm might look for
hidden patterns in on-the-market housing.
As popular as these machine-learning models are, we still need humans to derive the final
implications of data analysis. Making sense of the results or deciding, say, how to clean the
data remains up to us humans.
Machine-Learning Algorithms for Data Analysis
Now let’s look at six well-known machine-learning algorithms used in data analysis. In
addition to reviewing their structure, we’ll go over some of their real-world applications.
Clustering
At a local garage sale, you buy 70 monochromatic shirts, each of a different color. To avoid
decision fatigue, you design an algorithm to help you color-code your closet. This algorithm
uses photos of each shirt as input and, comparing the color of each shirt to the others, creates
categories to account for every shirt. We call this clustering: an unsupervised learning
algorithm that looks for patterns among input values and groups them accordingly. Here is
a GeeksForGeeks article that provides visualizations of this machine-learning model.
Decision-tree learning
You can think of a decision tree as an upside-down tree: you start at the “top” and move
through a narrowing range of options. These learning algorithms take a single data set and
progressively divide it into smaller groups by creating rules to differentiate the features it
observes. Eventually, they create sets small enough to be described by a specific label. For
example, they might take a general car data set (the root) and classify it down to a make and
then to a model (the leaves).

As you might have gathered, decision trees are supervised learning algorithms ideal for
resolving classification problems in data analysis, such as guessing a person’s blood type.
Check out this in-depth Medium article that explains how decision trees work.
Ensemble learning
Imagine you’re en route to a camping trip with your buddies, but no one in the group
remembered to check the weather. Noting that you always seem dressed appropriately for the
weather, one of your buddies asks you to stand in as a meteorologist. Judging from the time of
year and the current conditions, you guess that it’s going to be 72°F (22°C) tomorrow.
Now imagine that everyone in the group came with their own predictions for tomorrow’s
weather: one person listened to the weatherman; another saw Doppler radar reports online; a
third asked her parents; and you made your prediction based on current conditions.
Do you think you, the group’s appointed meteorologist, will have the most accurate prediction,
or will the average of all four guesses be closer to the actual weather tomorrow? Ensemble
learning dictates that, taken together, your predictions are likely to be distributed around the
right answer. The average will likely be closer to the mark than your guess alone.
In technical terms, this machine-learning model frequently used in data analysis is known as
the random forest approach: by training decision trees on random subsets of data points, and
by adding some randomness into the training procedure itself, you build a forest of diverse
trees that offer a more robust average than any individual tree. For a deeper dive, read
this tutorial on implementing the random forest approach in Python.
Support-vector machine
Have you ever struggled to differentiate between two species — perhaps between alligators
and crocodiles? After a long while, you manage to learn how: alligators have a U-shaped
snout, while crocodiles’ mouths are slender and V-shaped; and crocodiles have a much toothier
grin than alligators do. But on a trip to the Everglades, you come across a reptile that,
perplexingly, has features of both — so how can you tell the difference? Support-vector
machine (SVM) algorithms are here to help you out.
First, let’s draw a graph with one distinguishing feature (snout shape) as the x-axis and another
(grin toothiness) as the y-axis. We’ll populate the graph with plenty of data points for both
species, and then find possible planes (or, in this 2D case, lines) that separate the two classes.
Our objective is to find a single “hyperplane” that divides the data by maximizing the distance
between the dividing plane and each class’s closest points — called support vectors. No more
confusion between crocs and gators: once the SVM finds this hyperplane, you can easily
classify the reptiles in your vacation photos by seeing which side each one lands on.
SVM algorithms can only be used on categorical data, but it’s not always possible to
differentiate between classes with 2D graphs. To resolve this, you can use a kernel: an
established pattern to map data to higher dimensions. By using a combination of kernels and
tweaks to their parameters, you’ll be able to find a non-linear hyperplane and continue on your
way distinguishing between reptiles. This YouTube video does a clear job of visualizing how
kernels integrate with SVM.
Linear regression
If you’ve ever used a scatterplot to find a cause-and-effect relationship between two sets of
data, then you’ve used linear regression. This is a modeling method ideal for forecasting and
finding correlations between variables in data analysis.
For example, say you want to see if there’s a connection between fatigue and the number of
hours someone works. You gather data from a set of people with a wide array of work
schedules and plot your findings. Seeking a relationship between the independent variable
(hours worked) and the dependent variable (fatigue), you notice that a straight line with a
positive slope best models the correlation. You’ve just used linear regression! If you’re
interested in a detailed understanding of linear regression for machine learning, check out
this blog pos from Machine Learning Mastery.
Logistic regression
While linear regression algorithms look for correlations between variables that are continuous
by nature, logistic regression is ideal for classifying categorical data. Our alligator-versus-
crocodile problem is, in fact, a logistic regression problem. Whereas the SVM model can work
with non-linear kernels, logistic regression is limited to (and great for) linear classification. See
this in-depth overview of logistic regression, especially good for lovers of calculus.
Summary
In this article, we looked at how machine learning can automate and scale data analysis. We
summarized a few important machine-learning algorithms and saw their real-life applications.
While machine learning offers precision and scalability in data analysis, it’s important to
remember that the real work of evaluating machine learning results still belongs to humans.
3.3 What Is Descriptive Statistics - Definition, Types, & More

If you work with datasets long enough, you will eventually need to deal with statistics. Ask the
average person what statistics are, and they’ll probably throw around words like “numbers,”
“figures,” and “research.”

Statistics is the science, or a branch of mathematics, that involves collecting, classifying,


analyzing, interpreting, and presenting numerical facts and data. It is especially handy when
dealing with populations too numerous and extensive for specific, detailed measurements.
Statistics are crucial for drawing general conclusions relating to a dataset from a data sample.

Statistics further breaks down into two types: descriptive and inferential. Today, we look at
descriptive statistics, including a definition, the types of descriptive statistics, and the differences
between descriptive statistics and inferential statistics.

Descriptive Statistics Defined

Descriptive statistics describe, show, and summarize the basic features of a dataset found in a
given study, presented in a summary that describes the data sample and its measurements. It
helps analysts to understand the data better.

Descriptive statistics represent the available data sample and do not include theories, inferences,
probabilities, or conclusions. That’s a job for inferential statistics.
Descriptive Statistics Examples

If you want a good example of descriptive statistics, look no further than a student’s grade point
average (GPA). A GPA gathers the data points created through a large selection of grades,
classes, and exams then average them together and presents a general idea of the student’s mean
academic performance. Note that the GPA doesn’t predict future performance or present any
conclusions. Instead, it provides a straightforward summary of students’ academic success based
on values pulled from data.

Here’s an even simpler example. Let’s assume a data set of 2, 3, 4, 5, and 6 equals a sum of 20.
The data set’s mean is 4, arrived at by dividing the sum by the number of values (20 divided by 5
equals 4).

Analysts often use charts and graphs to present descriptive statistics. If you stood outside of a
movie theater, asked 50 members of the audience if they liked the film they saw, then put your
findings on a pie chart, that would be descriptive statistics. In this example, descriptive statistics
measure the number of yes and no answers and show how many people in this specific theater
liked or disliked the movie. If you tried to come up with any other conclusions, you would be
wandering into inferential statistics territory, but we'll later cover that issue.

Finally, political polling is considered a descriptive statistic, provided it’s just presenting
concrete facts (the respondents’ answers), without drawing any conclusions. Polls are relatively
straightforward: “Who did you vote for President in the recent election?”

Types of Descriptive Statistics

Descriptive statistics break down into several types, characteristics, or measures. Some authors
say that there are two types. Others say three or even four.

Distribution (Also Called Frequency Distribution)

Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to
summarize the frequency of every possible value of a variable, rendered in percentages or
numbers. For instance, if you held a poll to determine people’s favorite Beatle, you’d set up one
column with all possible variables (John, Paul, George, and Ringo), and another with the number
of votes.

Statisticians depict frequency distributions as either a graph or as a table.


Measures of Central Tendency

Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.

Mean: The mean is also known as “M” and is the most common method for finding averages.
You get the mean by adding all the response values together, and dividing the sum by the
number of responses, or “N.” For instance, say someone is trying to figure out how many hours a
day they sleep in a week. So, the data set would be the hour entries (e.g., 6,8,7,10,8,4,9), and the
sum of those values is 52. There are seven responses, so N=7. You divide the value sum of 52 by
N, or 7, to find M, which in this instance is 7.3.

Mode: The mode is just the most frequent response value. Datasets may have any number of
modes, including “zero.” You can find the mode by arranging your dataset's order from the
lowest to highest value and then looking for the most common response. So, in using our sleep
study from the last part: 4,6,7,8,8,9,10. As you can see, the mode is eight.

Median: Finally, we have the median, defined as the value in the precise center of the dataset.
Arrange the values in ascending order (like we did for the mode) and look for the number in the
set’s middle. In this case, the median is eight.

Variability (Also Called Dispersion)

The measure of variability gives the statistician an idea of how spread out the responses are. The
spread has three aspects — range, standard deviation, and variance.

Range: Use range to determine how far apart the most extreme values are. Start by subtracting
the dataset’s lowest value from its highest value. Once again, we turn to our sleep study:
4,6,7,8,8,9,10. We subtract four (the lowest) from ten (the highest) and get six. There’s your
range.

Standard Deviation: This aspect takes a little more work. The standard deviation (s) is your
dataset’s average amount of variability, showing you how far each score lies from the mean. The
larger your standard deviation, the greater your dataset’s variable. Follow these six steps:

1. List the scores and their means.

2. Find the deviation by subtracting the mean from each score.

3. Square each deviation.

4. Total up all the squared deviations.


5. Divide the sum of the squared deviations by N-1.

6. Find the result’s square root.

Raw Number/Data Deviation from Mean Deviation Squared

4 4-7.3= -3.3 10.89

6 6-7.3= -1.3 1.69

7 7-7.3= -0.3 0.09

8 8-7.3= 0.7 0.49

8 8-7.3= 0.7 0.49

9 9-7.3=1.7 2.89

10 10-7.3= 2.7 7.29

M=7.3 Sum = 0.9 Square sums= 23.83


When you divide the sum of the squared deviations by 6 (N-1): 23.83/6, you get 3.971, and the
square root of that result is 1.992. As a result, we now know that each score deviates from the
mean by an average of 1.992 points.

Variance: Variance reflects the dataset’s degree spread. The greater the degree of data spread, the
larger the variance relative to the mean. You can get the variance by just squaring the standard
deviation. Using the above example, we square 1.992 and arrive at 3.971.

Univariate Descriptive Statistics

Univariate descriptive statistics examine only one variable at a time and do not compare
variables. Rather, it allows the researcher to describe individual variables. As a result, this sort of
statistic is also known as descriptive statistics. The patterns identified in this sort of data may be
explained using the following:

• Measures of central tendency (mean, mode, and median)

• Data dispersion (standard deviation, variance, range, minimum, maximum, and quartiles)
(standard deviation, variance, range, minimum, maximum, and quartiles)

• Tables of frequency distribution

• Pie graphs

• Frequency polygon histograms

• Bar graphs

Bivariate Descriptive Statistics

When using bivariate descriptive statistics, two variables are concurrently analyzed (compared)
to see whether they are correlated. Generally, by convention, the independent variable is
represented by the columns, and the rows represent the dependent variable.'

There are numerous real-world applications for bivariate data. For example, estimating when a
natural occurrence will occur is quite valuable. Bivariate data analysis is a tool in the
statistician's toolbox. Sometimes, something as simple as projecting one parameter against the
other on a Two-dimensional plane can better understand what the information is trying to
convince you. For example, the scatterplot below demonstrates the link between the period
between eruptions at Old Faithful and the eruption's duration.
Univariate vs. Bivariate

Univariate Bivariate

Involves only one variable Involves two variables

Deals with causes or


Doesn't deal with relationships or causes
relationships

The prime purpose of bivariate


is explaining:

The prime purpose of univariate is describing: • Correlations: Comparisons,


explanations, causes,
• Dispersion: variance, range, standard deviation, quartiles, relationships
maximum, minimum • Dependent and independent
• Central tendency: mean median, and mode variables

• Tables where just one


• Bar graph, pie chart, histogram, box-and-whisker plot,
line graph variable is dependent on
other variables' values

• Simultaneous analysis of
two variables
What is the Main Purpose of Descriptive Statistics?

Descriptive statistics can be useful for two things: 1) providing basic information about variables
in a dataset and 2) highlighting potential relationships between variables. Graphical/Pictorial
Methods are measures of the three most common descriptive statistics that can be displayed
graphically or pictorially. It is used to summarise data. Descriptive statistics only make
statements about the data set used to calculate them; they never go beyond your data.

Scatter Plots

A scatter plot employs dots to indicate values for two separate numeric variables. Each dot's
location on the horizontal and vertical axes represents a data point's values. Scatter plots are
being used to monitor relationships between variables.

The main purposes of scatter plots are to examine and display relationships between two
numerical variables. The points in a scatter plot document the values of individual points and
trends when the data is obtained as a whole. Identification of correlational links is prevalent with
scatter plots. In these situations, we want to know what a good vertical value prediction would be
given a specific horizontal value.

This can lead to overplotting when there are many data points to plot. When data points are
overlaid to the point where it is difficult to see the connections between them and the variables,
this is known as overplotting. It might be difficult to discern how densely-packed data points are
when lots of them are in a tiny space.

There are a couple simple methods to relieve this issue. One approach is to choose only a subset
of data points: a random sample of points should still offer the basic sense of the patterns in the
whole data. Additionally, we can alter the shape of the dots by increasing transparency to make
overlaps visible or decreasing point size to minimise overlaps.

What’s the Difference Between Descriptive Statistics and Inferential Statistics?

So, what’s the difference between the two statistical forms? We’ve already touched upon this
when we mentioned that descriptive statistics doesn’t infer any conclusions or predictions, which
implies that inferential statistics do so.

Inferential statistics takes a random sample of data from a portion of the population and
describes and makes inferences about the entire population. For instance, in asking 50 people if
they liked the movie they had just seen, inferential statistics would build on that and assume that
those results would hold for the rest of the moviegoing population in general.

Therefore, if you stood outside that movie theater and surveyed 50 people who had just seen
Rocky 20: Enough Already! and 38 of them disliked it (about 76 percent), you could extrapolate
that 76% of the rest of the movie-watching world will dislike it too, even though you haven’t the
means, time, and opportunity to ask all those people.

Simply put: Descriptive statistics give you a clear picture of what your current data shows.
Inferential statistics makes projections based on that data.

3.4 Bayes' theorem in Artificial intelligence

Bayes' theorem:

Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which
determines the probability of an event with uncertain knowledge.

In probability theory, it relates the conditional probability and marginal probabilities of two
random events.

Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian
inference is an application of Bayes' theorem, which is fundamental to Bayesian statistics.

It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).

Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, we can determine the
probability of cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

1. P(A ⋀ B)= P(B|A) P(A)


Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of most
modern AI systems for probabilistic inference.

It shows the simple relationship between joint and conditional probabilities. Here,

P(A|B) is known as posterior, which we need to calculate, and it will be read as Probability of
hypothesis A when we have occurred an evidence B.

P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.

P(A) is called the prior probability, probability of hypothesis before considering the evidence

P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:

Where A1, A2, A3,........, An is a set of mutually exclusive and exhaustive events.

Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A). This
is very useful in cases where we have a good probability of these three terms and want to determine
the fourth one. Suppose we want to perceive the effect of some unknown cause, and want to
compute that cause, then the Bayes' rule becomes:

Example-1:

Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80%
of the time. He is also aware of some more facts, which are given as follows:

o The Known probability that a patient has meningitis disease is 1/30,000.

o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient has
meningitis. , so we can calculate the following as:

P(a|b) = 0.8

P(b) = 1/30000

P(a)= .02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff neck.

Example-2:

Question: From a standard deck of playing cards, a single card is drawn. The probability
that the card is king is 4/52, then calculate posterior probability P(King|Face), which means
the drawn face card is a king card.

Solution:

P(king): probability that the card is King= 4/52= 1/13

P(face): probability that a card is a face card= 3/13

P(Face|King): probability of face card when we assume it is a king = 1

Putting all values in equation (i) we will get:


Application of Bayes' theorem in Artificial intelligence:

Following are some applications of Bayes' theorem:

o It is used to calculate the next step of the robot when the already executed step is given.

o Bayes' theorem is helpful in weather forecasting.

o It can solve the Monty Hall problem.

3.5 K-Nearest Neighbor(KNN) Algorithm for Machine Learning

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases and
put the new case into the category that is most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.

o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.

o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new data, then
it classifies that data into a category that is much similar to the new data.

o Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors

o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

o Step-4: Among these k neighbors, count the number of the data points in each category.

o Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.

o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category. Consider the
below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:

o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.

o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers
in the model.

o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:

o It is simple to implement.

o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some time.

o The computation cost is high because of calculating the distance between the data points
for all the training samples.

Python implementation of the KNN algorithm

To do the Python implementation of the K-NN algorithm, we will use the same problem and
dataset which we have used in Logistic Regression. But here we will improve the performance of
the model. Below is the problem description:

Problem for K-NN Algorithm: There is a Car manufacturer company that has manufactured a
new SUV car. The company wants to give the ads to the users who are interested in buying that
SUV. So for this problem, we have a dataset that contains multiple user's information through the
social network. The dataset contains lots of information but the Estimated Salary and Age we
will consider for the independent variable and the Purchased variable is for the dependent
variable. Below is the dataset:
Steps to implement the K-NN algorithm:

o Data Pre-processing step

o Fitting the K-NN algorithm to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

o Visualizing the test set result.

Data Pre-Processing Step:

The Data Pre-processing step will remain exactly the same as Logistic Regression. Below is the
code for it:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
5.
6. #importing datasets
7. data_set= pd.read_csv('user_data.csv')
8.
9. #Extracting Independent and dependent Variable
10. x= data_set.iloc[:, [2,3]].values
11. y= data_set.iloc[:, 4].values
12.
13. # Splitting the dataset into training and test set.
14. from sklearn.model_selection import train_test_split
15. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
16.
17. #feature Scaling
18. from sklearn.preprocessing import StandardScaler
19. st_x= StandardScaler()
20. x_train= st_x.fit_transform(x_train)
21. x_test= st_x.transform(x_test)

By executing the above code, our dataset is imported to our program and well pre-processed. After
feature scaling our test dataset will look like:
From the above output image, we can see that our data is successfully scaled.

o Fitting K-NN classifier to the Training data:


Now we will fit the K-NN classifier to the training data. To do this we will import
the KNeighborsClassifier class of Sklearn Neighbors library. After importing the class,
we will create the Classifier object of the class. The Parameter of this class will be

o n_neighbors: To define the required neighbors of the algorithm. Usually, it takes


5.

o metric='minkowski': This is the default parameter and it decides the distance


between the points.

o p=2: It is equivalent to the standard Euclidean metric.

And then we will fit the classifier to the training data. Below is the code for it:
1. #Fitting K-NN classifier to the training set
2. from sklearn.neighbors import KNeighborsClassifier
3. classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the output as:

Out[10]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')

o Predicting the Test Result: To predict the test set result, we will create a y_pred vector
as we did in Logistic Regression. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

Output:

The output for the above code will be:


o Creating the Confusion Matrix:
Now we will create the Confusion Matrix for our K-NN model to see the accuracy of the
classifier. Below is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix(y_test, y_pred)

In above code, we have imported the confusion_matrix function and called it using the variable
cm.

Output: By executing the above code, we will get the matrix as below:

In the above image, we can see there are 64+29= 93 correct predictions and 3+4= 7 incorrect
predictions, whereas, in Logistic Regression, there were 11 incorrect predictions. So we can say
that the performance of the model is improved by using the K-NN algorithm.

o Visualizing the Training set result:


Now, we will visualize the training set result for K-NN model. The code will remain same
as we did in Logistic Regression, except the name of the graph. Below is the code for it:
1. #Visulaizing the trianing set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN Algorithm (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

By executing the above code, we will get the below graph:


The output graph is different from the graph which we have occurred in Logistic Regression. It
can be understood in the below points:

o As we can see the graph is showing the red point and green points. The green points
are for Purchased(1) and Red Points for not Purchased(0) variable.

o The graph is showing an irregular boundary instead of showing any straight line or
any curve because it is a K-NN algorithm, i.e., finding the nearest neighbor.

o The graph has classified users in the correct categories as most of the users who
didn't buy the SUV are in the red region and users who bought the SUV are in the
green region.

o The graph is showing good result but still, there are some green points in the red
region and red points in the green region. But this is no big issue as by doing this
model is prevented from overfitting issues.

o Hence our model is well trained.

o Visualizing the Test set result:


After the training of the model, we will now test the result by putting a new dataset, i.e.,
Test dataset. Code remains the same except some minor changes: such as x_train and
y_train will be replaced by x_test and y_test.
Below is the code for it:
1. #Visualizing the test set result
2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('red','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('red', 'green'))(i), label = j)
13. mtp.title('K-NN algorithm(Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above graph is showing the output for the test data set. As we can see in the graph, the
predicted output is well good as most of the red points are in the red region and most of the green
points are in the green region.
However, there are few green points in the red region and a few red points in the green region. So
these are the incorrect observations that we have observed in the confusion matrix(7 Incorrect
output).

3.6 Linear Discriminant Analysis

1. Linear Discriminant Analysis (LDA) is a supervised learning algorithm used for


classification tasks in machine learning. It is a technique used to find a linear combination of
features that best separates the classes in a dataset.
2. LDA works by projecting the data onto a lower-dimensional space that maximizes the
separation between the classes. It does this by finding a set of linear discriminants that
maximize the ratio of between-class variance to within-class variance. In other words, it
finds the directions in the feature space that best separate the different classes of data.
3. LDA assumes that the data has a Gaussian distribution and that the covariance matrices of
the different classes are equal. It also assumes that the data is linearly separable, meaning
that a linear decision boundary can accurately classify the different classes.

LDA has several advantages, including:

It is a simple and computationally efficient algorithm.


It can work well even when the number of features is much larger than the number of training
samples.
It can handle multicollinearity (correlation between features) in the data.

However, LDA also has some limitations, including:

It assumes that the data has a Gaussian distribution, which may not always be the case.
It assumes that the covariance matrices of the different classes are equal, which may not be true
in some datasets.
It assumes that the data is linearly separable, which may not be the case for some datasets.
It may not perform well in high-dimensional feature spaces.
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function
Analysis is a dimensionality reduction technique that is commonly used for supervised
classification problems. It is used for modelling differences in groups i.e. separating two or more
classes. It is used to project the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping as
shown in the below figure. So, we will keep on increasing the number of features for proper
classification.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.

Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:


1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable. In
such cases, we use non-linear discriminant analysis.
Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
• n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as
vTxi
Let’s consider u1 and u2 be the means of samples class c1 and c2 respectively before projection
and u1hat denotes the mean of the samples of class after projection and it can be calculated by:

Similarly,

Now, In LDA we need to normalize |\widetilde{\mu_1} -\widetilde{\mu_2} |. Let y_i =


v^{T}x_i be the projected samples, then scatter for the samples of c1 is:

Similarly:

Now, we need to project our data on the line having direction v which maximizes
For maximizing the above equation we need to find a projection vector that maximizes the
difference of means of reduces the scatters of both classes. Now, scatter matrix of s1 and s2 of
classes c1 and c2 are:

and s2

After simplifying the above equation, we get:


Now, we define, scatter within the classes(s w) and scatter b/w the classes(sb):

Now, we try to simplify the numerator part of J(v)

Now, To maximize the above equation we need to calculate differentiation with respect to v

Here, for the maximum value of J(v) we will use the value corresponding to the highest
eigenvalue. This will provide us the best solution for LDA.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
Implementation
• In this implementation, we will perform linear discriminant analysis using the Scikit-learn
library on the Iris dataset.

• Python3
# necessary import

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import sklearn

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

# read dataset from URL

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

cls = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

dataset = pd.read_csv(url, names=cls)


# divide the dataset into class and target variable

X = dataset.iloc[:, 0:4].values

y = dataset.iloc[:, 4].values

# Preprocess the dataset and divide into train and test

sc = StandardScaler()

X = sc.fit_transform(X)

le = LabelEncoder()

y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# apply Linear Discriminant Analysis

lda = LinearDiscriminantAnalysis(n_components=2)

X_train = lda.fit_transform(X_train, y_train)


X_test = lda.transform(X_test)

# plot the scatterplot

plt.scatter(

X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',

alpha=0.7,edgecolors='b'

# classify using random forest classifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

# print the accuracy and confusion matrix

print('Accuracy : ' + str(accuracy_score(y_test, y_pred)))


conf_m = confusion_matrix(y_test, y_pred)

print(conf_m)

LDA 2 -variable plot

Accuracy : 0.9

[[10 0 0]
[ 0 9 3]
[ 0 0 8]]
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular
application in which each face is represented by a very large number of pixel values. Linear
discriminant analysis (LDA) is used here to reduce the number of features to a more
manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear
combinations obtained using Fisher’s linear discriminant are called Fisher’s faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient
disease state as mild, moderate, or severe based upon the patient’s various parameters and
the medical treatment he is going through. This helps the doctors to intensify or reduce the
pace of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers who are most
likely to buy a particular product in a shopping mall. By doing a simple question and
answers survey, we can gather all the features of the customers. Here, a Linear discriminant
analysis will help us to identify and select the features which can describe the characteristics
of the group of customers that are most likely to buy that particular product in the shopping
mall.

3.7 Regression Analysis in Machine learning

Regression analysis is a statistical method to model the relationship between a dependent (target)
and independent (predictor) variables with one or more independent variables. More specifically,
Regression analysis helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent variables are held fixed. It
predicts continuous/real values such as temperature, age, salary, price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every year
and get sales on that. The below list shows the advertisement made by the company in the last 5
years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know
the prediction about the sales for this year. So to solve such type of prediction problems in
machine learning, we need regression analysis.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model has
captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors

o Determining Market trends

o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

o Dependent Variable: The main factor in Regression analysis which we want to predict or
understand is called the dependent variable. It is also called target variable.

o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable, also
called as a predictor.

o Outliers: Outlier is an observation which contains either very low value or very high value
in comparison to other observed values. An outlier may hamper the result, so it should be
avoided.

o Multicollinearity: If the independent variables are highly correlated with each other than
other variables, then such condition is called Multicollinearity. It should not be present in
the dataset, because it creates problem while ranking the most affecting variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our algorithm
does not perform well even with training dataset, then such problem is called underfitting.

Why do we use Regression Analysis?

As mentioned above, Regression analysis helps in the prediction of a continuous variable. There
are various scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some technology which
can make predictions more accurately. So for such case we need Regression analysis which is a
statistical method and used in machine learning and data science. Below are some other reasons
for using Regression analysis:

o Regression estimates the relationship between the target and the independent variable.

o It is used to find the trends in data.

o It helps to predict real/continuous values.

o By performing the regression, we can confidently determine the most important factor,
the least important factor, and how each factor is affecting the other factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning. Each
type has its own importance on different scenarios, but at the core, all the regression methods
analyze the effect of the independent variable on dependent variables. Here we are discussing some
important types of regression which are given below:

o Linear Regression

o Logistic Regression

o Polynomial Regression

o Support Vector Regression

o Decision Tree Regression

o Random Forest Regression

o Ridge Regression
o Lasso Regression:

Linear Regression:

o Linear regression is a statistical regression method which is used for predictive analysis.

o It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.

o It is used for solving the regression problem in machine learning.

o Linear regression shows the linear relationship between the independent variable (X-axis)
and the dependent variable (Y-axis), hence called linear regression.

o If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
o The relationship between variables in the linear regression model can be explained using
the below image. Here we are predicting the salary of an employee on the basis of the year
of experience.

o Below is the mathematical equation for Linear regression:

1. Y= aX+b

Here, Y = dependent variables (target variables),


X= Independent variables (predictor variables),
a and b are the linear coefficients

Some popular applications of linear regression are:

o Analyzing trends and sales estimates

o Salary forecasting

o Real estate prediction


o Arriving at ETAs in traffic.

Logistic Regression:

o Logistic regression is another supervised learning algorithm which is used to solve the
classification problems. In classification problems, we have dependent variables in a
binary or discrete format such as 0 or 1.

o Logistic regression algorithm works with the categorical variable such as 0 or 1, Yes or
No, True or False, Spam or not spam, etc.

o It is a predictive analysis algorithm which works on the concept of probability.

o Logistic regression is a type of regression, but it is different from the linear regression
algorithm in the term how they are used.

o Logistic regression uses sigmoid function or logistic function which is a complex cost
function. This sigmoid function is used to model the data in logistic regression. The
function can be represented as:

o f(x)= Output between the 0 and 1 value.

o x= input to the function

o e= base of natural logarithm.

When we provide the input values (data) to the function, it gives the S-curve as follows:
o It uses the concept of threshold levels, values above the threshold level are rounded up to
1, and values below the threshold level are rounded up to 0.

There are three types of logistic regression:

o Binary(0/1, pass/fail)

o Multi(cats, dogs, lions)

o Ordinal(low, medium, high)

Polynomial Regression:

o Polynomial Regression is a type of regression which models the non-linear dataset using
a linear model.

o It is similar to multiple linear regression, but it fits a non-linear curve between the value of
x and corresponding conditional values of y.

o Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover
such datapoints, we need Polynomial regression.
o In Polynomial regression, the original features are transformed into polynomial
features of given degree and then modeled using a linear model. Which means the
datapoints are best fitted using a polynomial line.

o The equation for polynomial regression also derived from linear regression equation that
means Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression
equation Y= b0+b1x+ b2x2+ b3x3+.....+ bnxn.

o Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.

o The model is still linear as the coefficients are still linear with quadratic

Note: This is different from Multiple Linear regression in such a way that in Polynomial
regression, a single element has different degrees instead of multiple variables with the same
degree.

Support Vector Regression:

Support Vector Machine is a supervised learning algorithm which can be used for regression as
well as classification problems. So if we use it for regression problems, then it is termed as Support
Vector Regression.

Support Vector Regression is a regression algorithm which works for continuous variables. Below
are some keywords which are used in Support Vector Regression:
o Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.

o Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is
a line which helps to predict the continuous variables and cover most of the datapoints.

o Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a
margin for datapoints.

o Support vectors: Support vectors are the datapoints which are nearest to the hyperplane
and opposite class.

In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

Decision Tree Regression:

o Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.

o It can solve problems for both categorical and numerical data


o Decision Tree regression builds a tree-like structure in which each internal node represents
the "test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.

o A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided
into their children node, and themselves become the parent node of those nodes. Consider
the below image:

Above image showing the example of Decision Tee regression, here, the model is trying to predict
the choice of a person between Sports cars or Luxury car.

o Random forest is one of the most powerful supervised learning algorithms which is capable
of performing regression as well as classification tasks.

o The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as:

g(x)= f0(x)+ f1(x)+ f2(x)+....

o Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning


in which aggregated decision tree runs in parallel and do not interact with each other.

o With the help of Random Forest regression, we can prevent Overfitting in the model by
creating random subsets of the dataset.

Ridge Regression:

o Ridge regression is one of the most robust versions of linear regression in which a small
amount of bias is introduced so that we can get better long term predictions.
o The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.

o The equation for ridge regression will be:

o A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.

o Ridge regression is a regularization technique, which is used to reduce the complexity of


the model. It is also called as L2 regularization.

o It helps to solve the problems if we have more parameters than samples.

Lasso Regression:

o Lasso regression is another regularization technique to reduce the complexity of the model.

o It is similar to the Ridge Regression except that penalty term contains only the absolute
weights instead of a square of weights.

o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.

o It is also called as L1 regularization. The equation for Lasso regression will be:
3.8 Least Square Regression in Machine Learning

Least Square Regression is a statistical method commonly used in machine learning for
analyzing and modelling data. It involves finding the line of best fit that minimizes the sum of
the squared residuals (the difference between the actual values and the predicted values) between
the independent variable(s) and the dependent variable.

We can use Least Square Regression for both simple linear regression, where there is only one
independent variable. Also for, multiple linear regression, where there are several independent
variables. We widely use this method in a variety of fields, such as economics, engineering, and
finance, to model and predict relationships between variables. Before learning least square
regression, let’s understand linear regression.

Linear Regression

Linear regression is one of the basic statistical techniques in regression analysis. People use it
for investigating and modelling the relationship between variables (i.e dependent variable and
one or more independent variables).

Before being promptly adopted into machine learning and data science, linear models were used
as basic tools in statistics to assist prediction analysis and data mining. If the model involves
only one regressor variable (independent variable), it is called simple linear regression and if the
model has more than one regressor variable, the process is called multiple linear regression.

Equation of Straight Line

Let’s consider a simple example of an engineer wanting to analyze the product delivery and
service operations for vending machines. He/she wants to determine the relationship between the
time required by a deliveryman to load a machine and the volume of the products delivered. The
engineer collected the delivery time (in minutes) and the volume of the products (in a number of
cases) of 25 randomly selected retail outlets with vending machines. Scatter diagram is the
observations plotted on a graph.

Now, if I consider Y as delivery time (dependent variable), and X as product volume delivered
(independent variable). Then we can represent the linear relationship between these two
variables as

Okay! Now that looks familiar. Its equation is for a straight line, where m is the slope and c is the
y-intercept. Our objective is to estimate these unknown parameters in the regression model,
such that they give the minimal error for the given dataset. Commonly referred to as parameter
estimation or model fitting. In machine learning, the most common method of estimation is
the Least Squares method.

What is the Least Square Regression Method?

Least squares is a commonly used method in regression analysis for estimating the unknown
parameters by creating a model which will minimize the sum of squared errors between the
observed data and the predicted data.
Basically, it is one of the widely used methods of fitting curves that works by minimizing the
sum of squared errors as small as possible. It helps you draw a line of best fit depending on your
data points.

Finding the Line of Best Fit Using Least Square Regression

Given any collection of a pair of numbers and the corresponding scatter graph, the line of best fit
is the straight line that you can draw through the scatter points to best represent the relationship
between them. So, back to our equation of the straight line, we have:

Where,

Y: Dependent Variable

m: Slope

X: Independent Variable

c: y-intercept

Our aim here is to calculate the values of slope, y-intercept, and substitute them in the equation
along with the values of independent variable X, to determine the values of dependent variable
Y. Let’s assume that we have ‘n’ data points, then we can calculate slope using the scary looking
formula below:
Then, y-intercept is calculated using the formula:

Lastly, we substitute these values in the final equation Y = mX + c. Simple enough, right? Now
let’s take a real life example and implement these formulas to find the line of best fit.

Least Squares Regression Example

Let us take a simple dataset to demonstrate least squares regression method.

Step 1: First step is to calculate the slope ‘m’ using the formula

After substituting the respective values in the formula, m = 4.70 approximately.

Step 2: Next step is to calculate the y-intercept ‘c’ using the formula (ymean — m * xmean). By
doing that, the value of c approximately is c = 6.67.
Step 3: Now we have all the information needed for the equation and by substituting the
respective values in Y = mX + c, we get the following table. Using this info you can now plot the
graph.

This way by the least squares regression method provides the closest relationship between the
dependent and independent variables by minimizing the distance between the residuals (or error)
and the trend line (or line of best fit). Therefore, the sum of squares of residuals (or error) is
minimal under this approach.

Now let us master how the least squares method is implemented using Python.

Least Squares Regression in Python

Scenario

A rocket motor is manufactured by combining an igniter propellant and a sustainer propellant


inside a strong metal housing. It was noticed that the shear strength of the bond between two
propellers is strongly dependent on the age of the sustainer propellant.
Problem Statement

Implement a simple linear regression algorithm using Python to build a machine learning model
that studies the relationship between the shear strength of the bond between two propellers and
the age of the sustainer propellant.

Let’s begin!

Steps

Step 1: Import the required Python libraries.

# Importing Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


Copy code

Step 2: Next step is to read and load the dataset that we are working on.

# Loading dataset

data = pd.read_csv('PropallantAge.csv')

data.head()

data.info()
Copy code
This gives you a preview of your data and other related information that’s good to know. Our
aim now is to find the relationship between the age of sustainer propellant and the shear
strength of the bond between two propellers.

Step 3 (optional): You can create a scatter plot just to check the relationship between these two
variables.

# Plotting the data

plt.scatter(data['Age of Propellant'],data['Shear Strength'])


Copy code

Step 4: Next step is to assign X and Y as independent and dependent variables respectively.

# Computing X and Y

X = data['Age of Propellant'].values

Y = data['Shear Strength'].values
Copy code

Step 5: As we calculated manually earlier, we need to compute the mean of variables X and Y to
determine the values of slope (m) and y-intercept. Also, let n be the total number of data points.

# Mean of variables X and Y

mean_x = np.mean(X)

mean_y = np.mean(Y)

# Total number of data values


n = len(X)
Copy code

Step 6: In the next step, we will be calculating the slope and the y-intercept using the formulas
we discussed above.

# Calculating 'm' and 'c'

num = 0

denom = 0

for i in range(n):

num += (X[i] - mean_x) * (Y[i] - mean_y)

denom += (X[i] - mean_x) ** 2

m = num / denom

c = mean_y - (m * mean_x)

# Printing coefficients

print("Coefficients")

print(m, c)
Copy code

The above step has given us the values of m and c. Substituting them we get,

Shear Strength = 2627.822359001296 + (-37.15359094490524) * Age of Propellant


Step 7: The above equation represents our linear regression model. Now, let’s plot this
graphically.

# Plotting Values and Regression Line

maxx_x = np.max(X) + 10

minn_x = np.min(X) - 10

# line values for x and y

x = np.linspace(minn_x, maxx_x, 1000)

y=c+m*x

# Ploting Regression Line

plt.plot(x, y, color='#58b970', label='Regression Line')

# Ploting Scatter Points

plt.scatter(X, Y, c='#ef5423', label='Scatter Plot')

plt.xlabel('Age of Propellant (in years)')


plt.ylabel('Shear Strength')

plt.legend()

plt.show()
Copy code

Output:

Well! That’s it! We successfully found the line of best fit and fitted it into the data points using
the least square regression method in machine learning. So, now using this I could verify that
there is a strong statistical relationship between the shear strength and the propellant age.

3.9 Logistic Regression in Machine Learning

o Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.

o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.

o Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression
is used for solving the classification problems.

o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).

o The curve from the logistic function indicates the likelihood of something such as whether
the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.

o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The below
image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the classification
algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.

o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value
below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.

o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more possible


unordered types of the dependent variable, such as "cat", "dogs", or "sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Python Implementation of Logistic Regression (Binomial)

To understand the implementation of Logistic Regression in Python, we will use the below
example:

Example: There is a dataset given which contains the information of various users obtained from
the social networking sites. There is a car making company that has recently launched a new SUV
car. So the company wanted to check how many users from the dataset, wants to purchase the car.

For this problem, we will build a Machine Learning model using the Logistic regression algorithm.
The dataset is shown in the below image. In this problem, we will predict the purchased variable
(Dependent Variable) by using age and salary (Independent variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will use
the same steps as we have done in previous topics of Regression. Below are the steps:

o Data Pre-processing step

o Fitting Logistic Regression to the Training set

o Predicting the test result

o Test accuracy of the result(Creation of Confusion matrix)

o Visualizing the test set result.

1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can use
it in our code efficiently. It will be the same as we have done in Data pre-processing topic. The
code for this is given below:

1. #Data Pre-procesing Step


2. # importing libraries
3. import numpy as nm
4. import matplotlib.pyplot as mtp
5. import pandas as pd
6.
7. #importing datasets
8. data_set= pd.read_csv('user_data.csv')

By executing the above lines of code, we will get the dataset as the output. Consider the given
image:
Now, we will extract the dependent and independent variables from the given dataset. Below is
the code for it:

1. #Extracting Independent and dependent Variable


2. x= data_set.iloc[:, [2,3]].values
3. y= data_set.iloc[:, 4].values

In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:

Now we will split the dataset into a training set and test set. Below is the code for it:

1. # Splitting the dataset into training and test set.


2. from sklearn.model_selection import train_test_split
3. x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)
The output for this is given below:

For test
set:

For training set:


In logistic regression, we will do feature scaling because we want accurate result of predictions.
Here we will only scale the independent variable because dependent variable have only 0 and 1
values. Below is the code for it:

1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)

The scaled output is given below:


2. Fitting Logistic Regression to the Training set:

We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.

After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:

1. #Fitting Logistic Regression to the training set


2. from sklearn.linear_model import LogisticRegression
3. classifier= LogisticRegression(random_state=0)
4. classifier.fit(x_train, y_train)

Output: By executing the above code, we will get the below output:

Out[5]:
1. LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
2. intercept_scaling=1, l1_ratio=None, max_iter=100,
3. multi_class='warn', n_jobs=None, penalty='l2',
4. random_state=0, solver='warn', tol=0.0001, verbose=0,
5. warm_start=False)

Hence our model is well fitted to the training set.

3. Predicting the Test Result

Our model is well trained on the training set, so we will now predict the result by using test set
data. Below is the code for it:

1. #Predicting the test set result


2. y_pred= classifier.predict(x_test)

In the above code, we have created a y_pred vector to predict the test set result.

Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.

4. Test Accuracy of the result

Now we will create the confusion matrix here to check the accuracy of the classification. To create
it, we need to import the confusion_matrix function of the sklearn library. After importing the
function, we will call it using a new variable cm. The function takes two parameters,
mainly y_true( the actual values) and y_pred (the targeted value return by the classifier). Below
is the code for it:

1. #Creating the Confusion matrix


2. from sklearn.metrics import confusion_matrix
3. cm= confusion_matrix()

Output:

By executing the above code, a new confusion matrix will be created. Consider the below image:

We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).

5. Visualizing the training set result


Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:

1. #Visualizing the training set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_train, y_train
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Training set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have
taken are of 0.01 resolution.

To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.

Output: By executing the above code, we will get the below output:
The graph can be explained in the below points:

o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.

o All these data points are the observation points from the training set, which shows the result
for purchased variables.

o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.

o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.

o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.

o We can also estimate from the graph that the users who are younger with low salary, did
not purchase the car, whereas older users with high estimated salary purchased the car.

o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did
not purchase the car.

The goal of the classifier:

We have successfully visualized the training set result for the logistic regression, and our goal for
this classification is to divide the users who purchased the SUV car and who did not purchase the
car. So from the output graph, we can clearly see the two regions (Purple and Green) with the
observation points. The Purple region is for those users who didn't buy the car, and Green Region
is for those users who purchased the car.

Linear Classifier:

As we can see from the graph, the classifier is a Straight line or linear in nature as we have used
the Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.

Visualizing the test set result:

Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we
will use x_test and y_test instead of x_train and y_train. Below is the code for it:

1. #Visulaizing the test set result


2. from matplotlib.colors import ListedColormap
3. x_set, y_set = x_test, y_test
4. x1, x2 = nm.meshgrid(nm.arange(start = x_set[:, 0].min() - 1, stop = x_set[:, 0].max() + 1, step =
0.01),
5. nm.arange(start = x_set[:, 1].min() - 1, stop = x_set[:, 1].max() + 1, step = 0.01))
6. mtp.contourf(x1, x2, classifier.predict(nm.array([x1.ravel(), x2.ravel()]).T).reshape(x1.shape),
7. alpha = 0.75, cmap = ListedColormap(('purple','green' )))
8. mtp.xlim(x1.min(), x1.max())
9. mtp.ylim(x2.min(), x2.max())
10. for i, j in enumerate(nm.unique(y_set)):
11. mtp.scatter(x_set[y_set == j, 0], x_set[y_set == j, 1],
12. c = ListedColormap(('purple', 'green'))(i), label = j)
13. mtp.title('Logistic Regression (Test set)')
14. mtp.xlabel('Age')
15. mtp.ylabel('Estimated Salary')
16. mtp.legend()
17. mtp.show()

Output:

The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are
in the purple region. So we can say it is a good prediction and model. Some of the green and purple
data points are in different regions, which can be ignored as we have already calculated this error
using the confusion matrix (11 Incorrect output).

Hence our model is pretty good and ready to make new predictions for this classification problem.

UNIT..4
Multi-layer Perceptron
Multi-layer perception is also known as MLP. It is fully connected dense layers,
which transform any input dimension to the desired dimension. A multi-layer
perception is a neural network that has multiple layers. To create a neural
network we combine neurons together so that the outputs of some neurons are
inputs of other neurons.
A gentle introduction to neural networks and TensorFlow can be found here:
• Neural Networks
• Introduction to TensorFlow
A multi-layer perceptron has one input layer and for each input, there is one
neuron(or node), it has one output layer with a single node for each output and
it can have any number of hidden layers and each hidden layer can have any
number of nodes. A schematic diagram of a Multi-Layer Perceptron (MLP) is

depicted below.

In the multi-layer perceptron diagram above, we can see that there are
three inputs and thus three input nodes and the hidden layer has three nodes.
The output layer gives two outputs, therefore there are two output nodes. The
nodes in the input layer take input and forward it for further process, in the
diagram above the nodes in the input layer forwards their output to each of the
three nodes in the hidden layer, and in the same way, the hidden layer
processes the information and passes it to the output layer.
Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to
numbers between 0 and 1 using the sigmoid formula.

Stepwise Implementation
Step 1: Import the necessary libraries.
• Python3

# importing modules
import tensorflow as tf
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Activation
import matplotlib.pyplot as plt

Step 2: Download the dataset.


TensorFlow allows us to read the MNIST dataset and we can load it directly in
the program as a train and test dataset.
Python3

(x_train, y_train), (x_test, y_test) =


tf.keras.datasets.mnist.load_data()

Output:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-
datasets/mnist.npz
11493376/11490434 [==============================] – 2s 0us/step
Step 3: Now we will convert the pixels into floating-point values.
• Python3

# Cast the records into float values


x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# normalize image pixel values by dividing
# by 255
gray_scale = 255
x_train /= gray_scale
x_test /= gray_scale

We are converting the pixel values into floating-point values to make the
predictions. Changing the numbers into grayscale values will be beneficial as
the values become small and the computation becomes easier and faster. As
the pixel values range from 0 to 256, apart from 0 the range is 255. So dividing
all the values by 255 will convert it to range from 0 to 1
Step 4: Understand the structure of the dataset
• Python3

print("Feature matrix:", x_train.shape)


print("Target matrix:", x_test.shape)
print("Feature matrix:", y_train.shape)
print("Target matrix:", y_test.shape)

Output:
Feature matrix: (60000, 28, 28)
Target matrix: (10000, 28, 28)
Feature matrix: (60000,)
Target matrix: (10000,)
Thus we get that we have 60,000 records in the training dataset and 10,000
records in the test dataset and Every image in the dataset is of the size
28×28.
Step 5: Visualize the data.

• Python3

fig, ax = plt.subplots(10, 10)


k = 0
for i in range(10):
for j in range(10):
ax[i][j].imshow(x_train[k].reshape(28, 28),
aspect='auto')
k += 1
plt.show()

Output

Step 6: Form the Input, hidden, and output layers.


• Python3

model = Sequential([

# reshape 28 row * 28 column data to 28*28 rows


Flatten(input_shape=(28, 28)),

# dense layer 1
Dense(256, activation='sigmoid'),

# dense layer 2
Dense(128, activation='sigmoid'),
# output layer
Dense(10, activation='sigmoid'),
])

Some important points to note:


• The Sequential model allows us to create models layer-by-layer as we
need in a multi-layer perceptron and is limited to single-input, single-output
stacks of layers.
• Flatten flattens the input provided without affecting the batch size. For
example, If inputs are shaped (batch_size,) without a feature axis, then
flattening adds an extra channel dimension and output shape is (batch_size,
1).
• Activation is for using the sigmoid activation function.
• The first two Dense layers are used to make a fully connected model and
are the hidden layers.
• The last Dense layer is the output layer which contains 10 neurons that
decide which category the image belongs to.
Step 7: Compile the model.
• Python

• model.compile(optimizer='adam',
• loss='sparse_categorical_crossentropy',
• metrics=['accuracy'])
Compile function is used here that involves the use of loss, optimizers, and
metrics. Here loss function used is sparse_categorical_crossentropy,
optimizer used is adam.
Step 8: Fit the model.
• Python3

model.fit(x_train, y_train, epochs=10,


batch_size=2000,
validation_split=0.2)
Output:
Step 9: Find Accuracy of the model.
Python3
results = model.evaluate(x_test, y_test, verbose = 0)
print('test loss, test acc:', results)
Output:
test loss, test acc: [0.27210235595703125, 0.9223999977111816]
We got the accuracy of our model 92% by using model.evaluate() on the test
samples.

What is Backpropagation?
Backpropagation is the essence of neural network training. It is the method of fine-tuning the
weights of a neural network based on the error rate obtained in the previous epoch (i.e.,
iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of errors.” It is a
standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.

How Backpropagation Algorithm Works?

The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a native
direct computation. It computes the gradient, but it does not define how the gradient is used. It
generalizes the computation in the delta rule.
Inputs X, arrive through the preconnected path
Input is modeled using real weights W. The weights are usually randomly selected.
Calculate the output for every neuron from the input layer, to the hidden layers, to the output
layer.
Calculate the error in the outputs
ErrorB= Actual Output – Desired Output
Travel back from the output layer to the hidden layer to adjust the weights such that the error
is decreased.
Keep repeating the process until the desired output is achieved
Why We Need Backpropagation?
Most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.
What is a Feed Forward Network?
A feedforward neural network is an artificial neural network where the nodes never form a
cycle. This kind of neural network has an input layer, hidden layers, and an output layer. It is
the first and simplest type of artificial neural network.
Types of Backpropagation Networks
Two Types of Backpropagation Networks are:
Static Back-propagation
Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for static
output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved. After
that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.

Radial Basis Functions

Radial Basis Function (RBF) Networks are a particular type of Artificial Neural Network used
for function approximation problems. RBF Networks differ from other neural networks in their
three-layer architecture, universal approximation, and faster learning speed.

What Are Radial Basis Functions?

Radial Basis Functions are a special class of feed-forward neural networks consisting of three
layers: an input layer, a hidden layer, and the output layer. This is fundamentally different from
most neural network architectures, which are composed of many layers and bring about
nonlinearity by recurrently applying non-linear activation functions. The input layer receives
input data and passes it into the hidden layer, where the computation occurs. The hidden
layer of Radial Basis Functions Neural Network is the most powerful and very different from
most Neural networks. The output layer is designated for prediction tasks like classification or
regression.

The radial basis function for a neuron consists of a center and a radius (also called the
spread). The radius may vary between different neurons. In DTREG-generated RBF networks,
each dimension's radius can differ. As the spread grows larger, neurons at a distance from a
point have more influence.
RBF Network Architecture

The typical architecture of a radial basis functions neural network consists of an input layer,

hidden layer, and summation layer.

Input Layer

The input layer consists of one neuron for every predictor variable. The input neurons pass
the value to each neuron in the hidden layer. N-1 neurons are used for categorical values,
where N denotes the number of categories. The range of values is standardized by
subtracting the median and dividing by the interquartile range.
Hidden Layer
The hidden layer contains a variable number of neurons (the ideal number determined by the
training process). Each neuron comprises a radial basis function centered on a point. The
number of dimensions coincides with the number of predictor variables. The radius or spread
of the RBF function may vary for each dimension.
When an x vector of input values is fed from the input layer, a hidden neuron calculates the
Euclidean distance between the test case and the neuron's center point. It then applies the
kernel function using the spread values. The resulting value gets fed into the summation
layer.
Output Layer or Summation Layer
The value obtained from the hidden layer is multiplied by a weight related to the neuron and
passed to the summation. Here the weighted values are added up, and the sum is presented
as the network's output. Classification problems have one output per target category, the
value being the probability that the case evaluated has that category. The Input Vector
It is the n-dimensional vector that you're attempting to classify. The whole input vector is
presented to each of the RBF neurons.
The RBF Neurons
Every RBF neuron stores a prototype vector (also known as the neuron's center) from
amongst the vectors of the training set. An RBF neuron compares the input vector with its
prototype, and outputs a value between 0 and 1 as a measure of similarity. If an input is the
same as the prototype, the neuron's output will be 1. As the input and prototype difference
grows, the output falls exponentially towards 0. The shape of the response by the RBF neuron
is a bell curve. The response value is also called the activation value.
The Output Nodes
The network's output comprises a set of nodes for each category you're trying to classify.
Each output node computes a score for the concerned category. Generally, we take a
classification decision by assigning the input to the category with the highest score.
The score is calculated based on a weighted sum of the activation values from all RBF
neurons. It usually gives a positive weight to the RBF neuron belonging to its category and a
negative weight to others. Each output node has its own set of weights.

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?


There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies


Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)


Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness
in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning
o Reduced Error Pruning.
Advantages of the Decision Tree
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree
o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

ID3:
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words,
we can say that the purity of the node increases with respect to the target variable. The
decision tree splits the nodes on all available variables and then selects the split which results
in most homogeneous sub-nodes.
The algorithm selection is also based on the type of target variables. Let us look at some
algorithms used in Decision Trees:

ID3 → (extension of D3)

The ID3 algorithm builds decision trees using a top-down greedy search approach through the

space of possible branches with no backtracking. A greedy algorithm, as the name suggests,

always makes the choice that seems to be the best at that moment.

Steps in ID3 algorithm:


o It begins with the original set S as the root node.
o On each iteration of the algorithm, it iterates through the very unused attribute of the set
S and calculates Entropy(H) and Information gain(IG) of this attribute.
o It then selects the attribute which has the smallest Entropy or Largest Information gain.
o The set S is then split by the selected attribute to produce a subset of the data.
o The algorithm continues to recur on each subset, considering only attributes never
selected before.
Attribute Selection Measures
If the dataset consists of N attributes then deciding which attribute to place at the root or at
different levels of the tree as internal nodes is a complicated step. By just randomly selecting
any node to be the root can’t solve the issue. If we follow a random approach, it may give us
bad results with low accuracy.
For solving this attribute selection problem, researchers worked and devised some solutions.
They suggested using some criteria like :
Entropy, Information gain, Gini index, Gain Ratio, Reduction in Variance, Chi-Square.
These criteria will calculate values for every attribute. The values are sorted, and attributes
are placed in the tree by following the order i.e, the attribute with a high value(in case of
information gain) is placed at the root.
While using Information Gain as a criterion, we assume attributes to be categorical, and for
the Gini index, attributes are assumed to be continuous.
Entropy
Entropy is a measure of the randomness in the information being processed. The higher the
entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an
example of an action that provides information that is random.

From the above graph, it is quite evident that the entropy H(X) is zero when the probability is

either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect

randomness in the data and there is no chance if perfectly determining the outcome.

ID3 follows the rule — A branch with an entropy of zero is a leaf node and A brach with

entropy more than zero needs further splitting.

Mathematically Entropy for 1 attribute is represented as:


Where S → Current state, and Pi → Probability of an event i of state S or Percentage of

class i in a node of state S.

Mathematically Entropy for multiple attributes is represented as :

where T→ Current state and X → Selected attribute

Information Gain

Information gain or IG is a statistical property that measures how well a given attribute

separates the training examples according to their target classification. Constructing a

decision tree is all about finding an attribute that returns the highest information gain and the

smallest entropy.

Information Gain
Information gain is a decrease in entropy. It computes the difference between entropy before

split and average entropy after split of the dataset based on given attribute values. ID3

(Iterative Dichotomiser) decision tree algorithm uses information gain.

Mathematically, IG is represented as:

In a much simpler way, we can conclude that:

Information Gain

Where “before” is the dataset before the split, K is the number of subsets generated by the

split, and (j, after) is subset j after the split.

Gini Index

You can understand the Gini index as a cost function used to evaluate splits in the dataset. It

is calculated by subtracting the sum of the squared probabilities of each class from one. It

favors larger partitions and easy to implement whereas information gain favors smaller

partitions with distinct values.

Gini Index

Gini Index works with the categorical target variable “Success” or “Failure”. It performs only

Binary splits.
Higher value of Gini index implies higher inequality, higher heterogeneity.

Steps to Calculate Gini index for a split

1. Calculate Gini for sub-nodes, using the above formula for success(p) and failure(q) (p²+q²).

2. Calculate the Gini index for split using the weighted Gini score of each node of that split.

CART (Classification and Regression Tree) uses the Gini index method to create split points.

Gain ratio

Information gain is biased towards choosing attributes with a large number of values as root

nodes. It means it prefers the attribute with a large number of distinct values.

C4.5, an improvement of ID3, uses Gain ratio which is a modification of Information gain that

reduces its bias and is usually the best option. Gain ratio overcomes the problem with

information gain by taking into account the number of branches that would result before

making the split. It corrects information gain by taking the intrinsic information of a split into

account.

Gain Ratio

Where “before” is the dataset before the split, K is the number of subsets generated by the

split, and (j, after) is subset j after the split.

Reduction in Variance
Reduction in variance is an algorithm used for continuous target variables (regression

problems). This algorithm uses the standard formula of variance to choose the best split. The

split with lower variance is selected as the criteria to split the population:

Above X-bar is the mean of the values, X is actual and n is the number of values.

Steps to calculate Variance:

1. Calculate variance for each node.

2. Calculate variance for each split as the weighted average of each node variance.

Chi-Square

The acronym CHAID stands for Chi-squared Automatic Interaction Detector. It is one of the

oldest tree classification methods. It finds out the statistical significance between the

differences between sub-nodes and parent node. We measure it by the sum of squares of

standardized differences between observed and expected frequencies of the target variable.

It works with the categorical target variable “Success” or “Failure”. It can perform two or more

splits. Higher the value of Chi-Square higher the statistical significance of differences between

sub-node and Parent node.

It generates a tree called CHAID (Chi-square Automatic Interaction Detector).

Mathematically, Chi-squared is represented as:


Steps to Calculate Chi-square for a split:

1. Calculate Chi-square for an individual node by calculating the deviation for Success and

Failure both

2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each

node of the split

How to avoid/counter Overfitting in Decision Trees?

The common problem with Decision trees, especially having a table full of columns, they fit a

lot. Sometimes it looks like the tree memorized the training data set. If there is no limit set on

a decision tree, it will give you 100% accuracy on the training data set because in the worse

case it will end up making 1 leaf for each observation. Thus this affects the accuracy when

predicting samples that are not part of the training set.

Here are two ways to remove overfitting:

1. Pruning Decision Trees.

2. Random Forest

Pruning Decision Trees

The splitting process results in fully grown trees until the stopping criteria are reached. But,

the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data.

In pruning, you trim off the branches of the tree, i.e., remove the decision nodes starting from

the leaf node such that the overall accuracy is not disturbed. This is done by segregating the

actual training set into two sets: training data set, D and validation data set, V. Prepare the

decision tree using the segregated training data set, D. Then continue trimming the tree

accordingly to optimize the accuracy of the validation data set, V.


Pruning

In the above diagram, the ‘Age’ attribute in the left-hand side of the tree has been pruned as it

has more importance on the right-hand side of the tree, hence removing overfitting.

Random Forest

Random Forest is an example of ensemble learning, in which we combine multiple machine

learning algorithms to obtain better predictive performance.

Why the name “Random”?

Two key concepts that give it the name random:

1. A random sampling of training data set when building trees.

2. Random subsets of features considered when splitting nodes.

A technique known as bagging is used to create an ensemble of trees where multiple training

sets are generated with replacement.


In the bagging technique, a data set is divided into N samples using randomized sampling.

Then, using a single learning algorithm a model is built on all samples. Later, the resultant

predictions are combined using voting or averaging in parallel.

Random Forest in action

ART( Classification And Regression Tree) is a variation of the decision tree algorithm. It can
handle both classification and regression tasks. Scikit-Learn uses the Classification And
Regression Tree (CART) algorithm to train Decision Trees (also called “growing” trees).
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each
fork is split into a predictor variable and each node has a prediction for the target variable at
the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an
attribute. The root node is taken as the training set and is split into two by considering the best
attribute and threshold value. Further, the subsets are also split using the same logic. This
continues till the last pure sub-set is found in the tree or the maximum number of leaves
possible in that growing tree.
The CART algorithm works via the following process:
• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new “best” split point is
identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works
on categorical variables, provides outcomes either “successful” or “failure” and hence conducts
binary splitting only.

The degree of the Gini index varies from 0 to 1,

• Where 0 depicts that all the elements are allied to a certain class, or only one class exists
there.
• The Gini index of value 1 signifies that all the elements are randomly distributed across
various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.
Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular class.

Classification tree
A classification tree is an algorithm where the target variable is categorical. The algorithm is
then used to identify the “Class” within which the target variable is most likely to fall.
Classification trees are used when the dataset needs to be split into classes that belong to the
response variable(like yes or no)

Regression tree
A Regression tree is an algorithm where the target variable is continuous and the tree is used
to predict its value. Regression trees are used when the response variable is continuous. For
example, if the response variable is the temperature of the day.

Pseudo-code of the CART algorithm


d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0

while endtree < 1


if Node(2d-1) + Node(2d) + .... + Node(2d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, .... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d=d+1
end while
CART model representation
CART models are formed by picking input variables and evaluating split points on those
variables until an appropriate tree is produced.

Steps to create a Decision Tree using the CART algorithm:

• Greedy algorithm: In this The input space is divided using the Greedy method which is
known as a recursive binary spitting. This is a numerical method within which all of the
values are aligned and several other split points are tried and assessed using a cost
function.
• Stopping Criterion: As it works its way down the tree with the training data, the recursive
binary splitting method described above must know when to stop splitting. The most frequent
halting method is to utilize a minimum amount of training data allocated to every leaf node.
If the count is smaller than the specified threshold, the split is rejected and also the node is
considered the last leaf node.
• Tree pruning: Decision tree’s complexity is defined as the number of splits in the tree. Trees
with fewer branches are recommended as they are simple to grasp and less prone to cluster
the data. Working through each leaf node in the tree and evaluating the effect of deleting it
using a hold-out test set is the quickest and simplest pruning approach.
• Data preparation for the CART: No special data preparation is required for the CART
algorithm.
Advantages of CART
• Results are simplistic.
• Classification and regression trees are Nonparametric and Nonlinear.
• Classification and regression trees implicitly perform feature selection.
• Outliers have no meaningful effect on CART.
• It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
• the tree structure may be unstable.
Applications of the CART algorithm
• For quick Data insights.
• In Blood Donors Classification.
• For environmental and ecological data.
• In the financial sectors.

Advantages of Decision Trees

1. Relatively Easy to Interpret

Trained Decision Trees are generally quite intuitive to understand, and easy to interpret.

Unlike most other machine learning algorithms, their entire structure can be easily visualised

in a simple flow chart. I covered the topic of interpreting Decision Trees in a previous post.

2. Robust to Outliers

A well-regularised Decision Tree will be robust to the presence of outliers in the data. This

feature stems from the fact that predictions are generated from an aggregation function (e.g.
mean or mode) over a subsample of the training data. Outliers can start to have a bigger

impact if the tree has overfitted. This topic was covered in a previous post.

3. Can Deal with Missing Values

The CART algorithm naturally permits the handling of missing values in the data. This enables

us to implement a Decision Tree that does not require any additional preprocessing to treat for

missing values. Most other machine learning algorithms do not have this capability. We

implemented a Decision Tree that can handle missing values in a previous post.

4. Non-Linear

Decision Trees are inherently non-linear models. They are piece-wise functions of various

different features in the feature space. As such, Decision Trees can be applied to a wide

range of complex problems, where linearity cannot be assumed.

5. Non-Parametric

CART Decision Trees do not make assumptions regarding the underlying distributions in the

data. This means we do not necessarily need to be concerned if the model is applicable to a

given problem, given the assumptions of the algorithm. There are caveats to this, however,

that will be discussed below (see point 4 in the disadvantages section).

6. Combining Features to Make Predictions

Combinations of features can be used in making predictions. The CART algorithm dictates

that decision rules (which are if-else conditions on the input features) are combined together

via AND relationships as one traverses the tree. This can be easily illustrated if we look at a

Decision Tree trained on the Iris dataset:

7. Can Deal with Categorical Values

The CART algorithm naturally permits the handling of categorical features in the data. This

enables us to implement a Decision Tree that does not require any additional preprocessing
(e.g. One-Hot-Encoding) to treat for categorical values. Most other machine learning

algorithms do not have this capability. We implemented a Decision Tree to handle categorical

features in a previous post.

8. Minimal Data Preparation

Minimal data preparation is required for Decision Trees. Since the training procedure in CART

deals with each input feature independently, at each node in the tree, data scaling and

normalization are not required.


Disadvantages of Decision Trees
1. Prone to Overfitting
CART Decision Trees are prone to overfit on the training data, if their growth is not restricted
in some way. Typically this problem is handled by pruning the tree, which in effect regularises
the model. Care needs to be taken to ensure the pruned tree performs as we want on unseen
data.
2. Unstable to Changes in the Data
Significantly different trees can be produced from training, if small changes occur in the data
3. Unstable to Noise
Similar to the previous point, Decision Trees are also sensitive to the presence of noise in the
data.
4. Non-Continuous
Decision trees are piece-wise functions, not smooth or continuous. This piece-wise
approximation approaches a continuous function the deeper & more complex the tree gets.
This however yields problems with overfitting (see point 1 above). Because of this, Decision
Tree regressors tend to have limited performance, and are not good at extrapolation
5. Unbalanced Classes
Decision Tree classifiers can be biased if the training data is highly dominated by certain
classes. Therefore, in situations where we are working with an unbalanced dataset, an
additional preprocessing step will be needed to balance the data for training. Alternatively, if
the implementation you are working with permits it, you can adjust weights within the model to
account for class imbalances. The scikit-learn implementation supports this through
the class_weight parameter.
6. Greedy Algorithm
CART follows a greedy algorithm, that finds only locally optimal solutions at each node in the
tree. As such, the tree produced by the CART algorithm is a non-optimal global solution.

UNIT..5

What are neural networks?

A neural network is a reflection of the human brain's behavior. It allows computer programs to
recognize patterns and solve problems in the fields of machine learning, deep learning, and
artificial intelligence. These systems are known as artificial neural networks (ANNs) or
simulated neural networks (SNNs). Google’s search algorithm is a fine example.

Neural networks are subtypes of machine learning and form the core part of deep learning
algorithms. Their structure is designed to resemble the human brain, which makes biological
neurons signal to one another. ANNs contain node layers that comprise input, one or more
hidden layers, and an output layer.

Image source

Each artificial neuron is connected to another and has an associated threshold and weight.
When the output of any node is above the threshold, that node will get activated, sending data
to the next layer. If not above the threshold, no data is passed along to the next node.

Neural networks depend on training data to learn and improve their accuracy over time. Once
these learning algorithms are tuned towards accuracy, they become powerful tools in AI. They
allow us to classify and cluster data at a high velocity. Tasks in image recognition take just
minutes to process compared to manual identification.

Types of neural networks

Neural network models are of different types and are based on their purpose. Here are some
common varieties.
Single-layer perceptronThe perceptron created by Frank Rosenblatt is the first neural network.
It contains a single neuron and is very simple in structure.

Multilayer perceptrons (MLPs)

These form the base for natural language processing (NLP). They comprise an input layer, a
hidden layer, and an output layer. It is important to know that MLPs contain sigmoid neurons
and not perceptrons because most real-world problems are non-linear. Data is fed into these
modules to train them.

Convolution neural networks (CNNs)

They are similar to MLPs but are usually used for pattern or image recognition, and computer
vision. These neural networks work with the principles of matrix multiplication to identify
patterns within an image.

Recurrent neural networks (RNNs)

They are identified with the help of feedback loops and are used with time-series data for
making predictions, such as stock market predictions.

How neural networks function

The working of neural networks is pretty simple and can be analyzed in a few steps as shown
below:

Neurons
A neuron is the base of the neural network model. It takes inputs, does calculations, analyzes
them, and produces outputs. Three main things occur in this phase:

• Each input is multiplied by its weight

• All the weighted inputs are added with a bias b

• They are summed together.

imimport numpy as np

import numpy as np

def sigmoid(x):

#Our activation function: f(x) = 1 / (1 + e^(-x))

return 1 / (1 + np.exp(-x))

class Neuron:

def init(self, weights, bias):

Weight inputs, add bias, then use the activation function

total = np.dot(self.weights, inputs) + self.bias

return sigmoid(total)

weights = np.array([0, 1]) # w1 = 0, w2 = 1

bias = 4 # b = 0

n = Neuron(weights, bias)

x = np.array([2, 3]) # x1 = 2, x2 = 3

print(n.feedforward(x)) # 0.9990889488055994
Source: https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-
neural-networks-d49f22d238f9

With the help of the activation function, an unbound input is turned into an output that has a
predictable form. The sigmoid function is one such activation function. It only outpu ts the
numbers 0 and 1. The outcome with negative numbers can be 0 and positive can be 1.

Combining neurons into a network

A neural network is a bunch of neurons interlinked together. A simple neuron has two inputs,
a hidden layer with two neurons, and an output layer. The inputs are 0 and 1, the hidden
layers are h1 and h2, and the output layer is O1. A hidden layer can be any layer between the
input and the output layer. There can be any number of layers.

A neural network itself can have any number of layers with any number of neurons in it. The
basic principle remains the same: feed the algorithm inputs to produce the desired output.

Training the neural network

The neural network is trained and improved upon. Mean squared error loss can be used for
the same. A quick refresher: A loss is when you find a way to quantify the efforts of your
neural network and try to improve it.

Image source

In the above formula,

• N is the number of inputs

• Y is the variable used for the prediction

• Y_true is the true value of the predictor variable

• Y_pred is the predicted value of the variable or the output.


Here, the (y_true - y_pred)^2 is the squared error. The overall squared error can be taken with
the help of the loss function. Think of loss as a function of weights. The better you predict, the
lower the loss. The goal, then, is to train a network by trying to minimize the loss.You can now
change the network weights to influence predictions. Label each weight to the network and
then write loss as a multivariate function.

Stochastic gradient descent shows how to change weights to minimize loss. The equation is:

Image source

η is a constant known as the learning rate which governs how quickly you train. Subtract η
∂w1/∂L from w1:

• When ∂L/∂w1 is positive, w1 will decrease and make L decrease.

• When it's negative, w1 will increase and make L decrease.

Doing this for each weight in the network will see the loss decrease and improve the network.
It is vital to have a proper training process, such as:

• Choosing one sample from the dataset to make it stochastic gradient descent by only
operating on one sample at a particular time.

• Calculating all the derivatives of loss concerning the weights.

• Using the update equation to update each weight.

• Going back to step 1 and moving forward.

Once you have completed the processes above, you’re ready to implement a complete neural
network. The steps mentioned will see loss steadily decrease and accuracy improve. Practice
by running and playing with the code to gain a deeper understanding of how to refine neural
networks.
Classification Of Supervised Learning Algorithms

1. Gradient Descent

2. Stochastic

#1) Gradient Descent Learning

In this type of learning, the error reduction takes place with the help of weights and the
activation function of the network. The activation function should be differentiable.

The adjustment of weights depends on the error gradient E in this learning. The
backpropagation rule is an example of this type of learning. Thus the weight adjustment i s
defined as

#2) Stochastic Learning

In this learning, the weights are adjusted in a probabilistic fashion.

Classification Of Unsupervised Learning Algorithms

1. Hebbian

2. Competitive

#1) Hebbian Learning

This learning was proposed by Hebb in 1949. It is based on correlative adjustment of weights.
The input and output patterns pairs are associated with a weight matrix, W.

The transpose of the output is taken for weight adjustment.

#2) Competitive Learning


It is a winner takes all strategy. In this type of learning, when an input pattern is sent to the
network, all the neurons in the layer compete and only the winning neurons have weight
adjustments.

Windrow Hoff Learning Algorithm

Also known as Delta Rule, it follows gradient descent rule for linear regression.

It updates the connection weights with the difference between the target and the output value.
It is the least mean square learning algorithm falling under the category of the supervised
learning algorithm.

This rule is followed by ADALINE (Adaptive Linear Neural Networks) and MADALINE. Unlike
Perceptron, the iterations of Adaline networks do not stop, but it converges by reducing the
least mean square error. MADALINE is a network of more than one ADALINE.

The motive of the delta learning rule is to minimize the error between the output and the target
vector.The weights in ADALINE networks are updated by:

Least mean square error = (t- yin)2, ADALINE converges when the least mean square error is
reached.Learning rule enhances the Artificial Neural Network’s performance by applying this
rule over the network. Thus learning rule updates the weights and bias levels of a network
when certain conditions are met in the training process. it is a crucial part of the development
of the Neural Network.

Types Of Learning Rules in ANN:


1. Hebbian Learning Rule

Donald Hebb developed it in 1949 as an unsupervised learning algorithm in the neural


network. We can use it to improve the weights of nodes of a network. The following
phenomenon occurs when

• If two neighbor neurons are operating in the same phase at the same period of time,
then the weight between these neurons should increase.

• For neurons operating in the opposite phase, the weight between them should
decrease.

• If there is no signal correlation, the weight does not change, the sign of the weight
between two nodes depends on the sign of the input between those nodes

• When inputs of both the nodes are either positive or negative, it results in a strong
positive weight.

• If the input of one node is positive and negative for the other, a strong negative weight
is present.

Mathematical Formulation:

δw=αxiy

where δw=change in weight,α is the learning rate.xi the input vector,y the output.

2. Perceptron Learning Rule

It was introduced by Rosenblatt. It is an error-correcting rule of a single-layer feedforward


network. it is supervised in nature and calculates the error between the desired and actual
output and if the output is present then only adjustments of weight are done.

Computed as follows:

Assume (x1,x2,x3……………………….xn) –>set of input vectors


and (w1,w2,w3…………………..wn) –>set of weights

y=actual output

wo=initial weight

wnew=new weight

δw=change in weight

α=learning rate

actual output(y)=wixi

learning signal(ej)=ti-y (difference between desired and actual output)

δw=αxiej

wnew=wo+δw

Now, the output can be calculated on the basis of the input and the activation function applied
over the net input and can be expressed as:

y=1, if net input>=θ

y=0, if net input<θ

3. Delta Learning Rule

It was developed by Bernard Widrow and Marcian Hoff and It depends on supervised learning
and has a continuous activation function. It is also known as the Least Mean Square method
and it minimizes error over all the training patterns.
It is based on a gradient descent approach which continues forever. It states that the
modification in the weight of a node is equal to the product of the error and the input where
the error is the difference between desired and actual output.

Computed as follows:

Assume (x1,x2,x3……………………….xn) –>set of input vectors

and (w1,w2,w3…………………..wn) –>set of weights

y=actual output

wo=initial weight

wnew=new weight

δw=change in weight

Error= ti-y

Learning signal(ej)=(ti-y)y’

y=f(net input)= ∫wixi

δw=αxiej=αxi(ti-y)y’

wnew=wo+δw

The updating of weights can only be done if there is a difference between the target and
actual output(i.e., error) present:

case I: when t=y

then there is no change in weight

case II: else

wnew=wo+δw
4. Correlation Learning Rule

The correlation learning rule follows the same similar principle as the Hebbian learning
rule,i.e., If two neighbor neurons are operating in the same phase at the same period of time,
then the weight between these neurons should be more positive. For neurons operating in the
opposite phase, the weight between them should be more negative but unlike the Hebbian
rule, the correlation rule is supervised in nature here, the targeted response is used for the
calculation of the change in weight.

In Mathematical form:

δw=αxitj

where δw=change in weight,α=learning rate,xi=set of the input vector, and tj=target value

5. Out Star Learning Rule

It was introduced by Grossberg and is a supervised training procedure.

Out Star Learning Rule is implemented when nodes in a network are arranged in a layer. Here
the weights linked to a particular node should be equal to the targeted outputs for the nodes
connected through those same weights. Weight change is thus calculated as=δw=α(t-y)

Where α=learning rate, y=actual output, and t=desired output for n layer nodes.

6. Competitive Learning Rule

It is also known as the Winner-takes-All rule and is unsupervised in nature. Here all the
output nodes try to compete with each other to represent the input pattern and the winner is
declared according to the node having the most outputs and is given the output 1 while the
rest are given 0.
There are a set of neurons with arbitrarily distributed weights and the activation function is
applied to a subset of neurons. Only one neuron is active at a time. Only the winner has
updated weights, the rest remain unchanged.

Linear Classification

→ Linear Classification refers to categorizing a set of data points into a discrete class based
on a linear combination of its explanatory variables.

→ Some of the classifiers that use linear functions to separate classes are Linear Discriminant
Classifier, Naive Bayes, Logistic Regression, Perceptron, SVM (linear kernel).

→ In the figure above, we have two classes, namely 'O' and '+.' To differentiate between the
two classes, an arbitrary line is drawn, ensuring that both the classes are on distinct sides.

→ Since we can tell one class apart from the other, these classes are called ‘linearly-
separable.’

→ However, an infinite number of lines can be drawn to distinguish the two classes.

→ The exact location of this plane/hyperplane depends on the type of the linear classifier.
Linear Discriminant Classifier

→ It is a dimensionality reduction technique in the domain of Supervised Machine Learning.

→ It is crucial in modeling differences between two groups, i.e., classes.

→ It helps project features in a high dimensions space in a lower-dimensional space.

→ Technique - Linear Discriminant Analysis (LDA) is used, which reduced the 2D graph into a
1D graph by creating a new axis. This helps to maximize the distance between the two
classes for differentiation.

→ In the above graph, we notice that a new axis is created, which maximizes the distance
between the mean of the two classes.

→ As a result, variation within each class is also minimized.

→ However, the problem with LDA is that it would fail in case the means of both the classes
are the same. This would mean that we would not be able to generate a new axis for
differentiating the two.
Naive Bayes

→ It is based on the Bayes Theorem and lies in the domain of Supervised Machine Learning.

→ Every feature is considered equal and independent of the others during Classification.

→ Naive Bayes indicates the likelihood of occurrence of an event. It is also known as


conditional probability.

A: event 1

B: event 2

P(A|B): Probability of A being true given B is true - posterior probability

P(B|A): Probability of B being true given A is true - the likelihood

P(A): Probability of A being true - prior

P(B): Probability of B being true - marginalization

However, in the case of the Naive Bayes classifier, we are concerned only with the maximum
posterior probability, so we ignore the denominator, i.e., the marginal likelihood. Argmax does
not depend on the normalization term.
→ The Naive Bayes classifier is based on two essential assumptions:-

(i) Conditional Independence - All features are independent of each other. This implies that
one feature does not affect the performance of the other. This is the sole reason behind the
‘Naive’ in ‘Naive Bayes.’

(ii) Feature Importance - All features are equally important. It is essential to know all the
features to make good predictions and get the most accurate results.

→ Naive Bayes is classified into three main types: Multinomial Naive Bayes, Bernoulli Naive
Bayes, and Gaussian Bayes.

Logistic Regression

→ It is a very popular supervised machine learning algorithm.

→ The target variable can take only discrete values for a given set of features.

→ The model builds a regression model to predict the probability of a given data entry.
→ Similar to linear regression, logistic regression uses a linear function and, in addition,
makes use of the 'sigmoid' function.

→ Logistic regression can be further classified into three categories:-

• Binomial - target variable assumes only two values since binary. Example: ‘0’ or ‘1’.

• Multinomial - target variable assumes >= three unordered values since multinomial.
Example: 'Class A,' 'Class B,' and 'Class C.'

• Ordinal - target variable assumes ordered values since ordinal. Example: ‘Very Good’,
‘Good’, ‘Average, ‘poor’, ‘very poor’.

Support Vector Machine (linear kernel)

→ It is a straightforward supervised machine learning algorithm used for


regression/classification.

→ This model finds a hyper-plane that creates a boundary between the various data types.

→ It can be used for binary Classification as well as multinomial classification problems.


→ A binary classifier can be created for each class to perform multi-class Classification.

→ In the case of SVM, the classifier with the highest score is chosen as the output of the SVM.

→ SVM works very well with linearly separable data but can work for non-linearly separable
data as well.

Non-Linear Classification

→ Non-Linear Classification refers to categorizing those instances that are not linearly
separable.

→ Some of the classifiers that use non-linear functions to separate classes are Quadratic
Discriminant Classifier, Multi-Layer Perceptron (MLP), Decision Trees, Random Forest, and
K-Nearest Neighbours (KNN).

→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the
two classes, it is impossible to draw an arbitrary straight line to ensure that both the classes
are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class
present between the data points of the second class.

→ In such cases, piece-wise linear or non-linear classification boundaries are required to


distinguish the two classes.

Quadratic Discriminant Classifier

→ This technique is similar to LDA(Linear Discriminant Analysis) discussed above.

→ The only difference is that here, we do not assume that the mean and covariance of all
classes are the same.

→ We get the quadratic discriminant function as the following:-

→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This
would give us a clear picture of the difference between the two.
Multi-Layer Perceptron (MLP)

→ This is nothing but a collection of fully connected dense layers. These help transform any
given input dimension into the desired dimension.

→ It is nothing but simply a neural network.

→ MLP consists of one input layer(one node belonging to each input), one output layer (one
node belonging to each output), and a few hidden layers (>= one node belonging to each
hidden layer).

→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.

→ There is one hidden layer consisting of 3 nodes.

→ There is an output layer consisting of 2 nodes, indicating two outputs.

→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in
the hidden layer. Once this is done, the hidden layer processes the information passed on to it
and then further passes it on to the output layer.
Decision Tree

→ It is considered to be one of the most valuable and robust models.

→ Instances are classified by sorting them down from the root to some leaf node.

→ An instance is classified by starting at the tree's root node, testing the attribute specified by
this node, then moving down the tree branch corresponding to the attribute's value, as shown
in the above figure.

→ The process is repeated based on each derived subset in a recursive partitioning manner.

→ For a better understanding, see the diagram below.

→ The above decision tree helps determine whether the person is fit or not.

→ Similarly, Random Forests, a collection of Decision Trees, is a linear classifier too.


K-Nearest Neighbours

→ KNN is a supervised machine learning algorithm . It is used for classification problems.


Since it is a supervised machine learning algorithm, it uses labeled data to make predictions.

→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the
same.

→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’
nearest data points to the new point. It chooses the label of the new point as the one to which
the majority of the ‘k’ nearest neighbors belong to.

→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.

→ For better understanding, have a look at the diagram below.

simple Linear Regression

Simple linear regression is one of the simplest (hence the name) yet powerful regression
techniques. It has one input ($x$) and one output variable ($y$) and helps us predict the
output from trained samples by fitting a straight line between those variables. For example, we
can predict the grade of a student based upon the number of hours he/she studies using
simple linear regression.

Mathematically, this is represented by the equation:

$$y = mx +c$$

where $x$ is the independent variable (input),

$y$ is the dependent variable (output),

$m$ is slope,

and $c$ is an intercept.

The above mathematical representation is called a linear equation.

Example: Consider a linear equation with two variables, 3x + 2y = 0.

The values which when substituted make the equation right, are the solutions. For the above
equation, (-2, 3) is one solution because when we replace x with -2 and y with +3 the
equation holds true and we get 0.

$$3 * -2 + 2 * 3 = 0$$

A linear equation is always a straight line when plotted on a graph.

In simple linear regression, we assume the slope and intercept to be coefficient and bias,
respectively. These act as the parameters that influence the position of the line to be plotted
between the data.

Imagine you plotted the data points in various colors, below is the image that shows the best-
fit line drawn using linear regression.

Multiple Linear Regression in Machine Learning


This is similar to simple linear regression, but there is more than one independent variable.
Every value of the independent variable x is associated with a value of the dependent variable
y. As it’s a multi-dimensional representation, the best-fit line is a plane.

Mathematically, it’s expressed by:

$$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3$$

Imagine you need to predict if a student will pass or fail an exam. We’d consider multiple
inputs like the number of hours he/she spent studying, total number of subjects and hours
he/she slept for the previous night. Since we have multiple inputs and would use multiple
linear regression.

inear Discriminant Analysis (LDA) in Machine Learning

Linear Discriminant Analysis (LDA) is one of the commonly used dimensionality reduction
techniques in machine learning to solve more than two-class classification problems. It is also
known as Normal Discriminant Analysis (NDA) or Discriminant Function Analysis (DFA).

This can be used to project the features of higher dimensional space into lower-dimensional
space in order to reduce resources and dimensional costs. In this topic, "Linear Discriminant
Analysis (LDA) in machine learning”,

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear Discriminant
analysis is applicable for more than two classes of classification problems.

Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique
to solve such classification problems. For e.g., if we have two classes with multiple features
and need to separate them efficiently. When we classify them using a single feature, then it
may show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number
of features regularly.

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine


learning, using which we can easily transform a 2-D and 3-D graph into a 1-dimensional
plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and
we need to classify them efficiently. As we have already seen in the above example that LDA
enables us to draw a straight line that can completely separate the two classes of the data
points. Here, LDA uses an X-Y axis to create a new axis by separating them using a straight
line and projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into
1-D.

To create a new axis, Linear Discriminant Analysis uses the following criteria:

It maximizes the distance between means of two classes.

It minimizes the variance within the individual class.


Using the above two conditions, LDA generates a new axis in such a way that it can maximize
the distance between the means of the two classes and minimizes the variation w ithin each
class.

In other words, we can say that the new axis will increase the separation between the data
points of the two classes and plot them onto the new axis.

Why LDA?

Logistic Regression is one of the most popular classification algorithms that perform well for
binary classification but falls short in the case of multiple classification problems with well-
separated classes. At the same time, LDA handles these quite efficiently.

LDA can also be used in data pre-processing to reduce the number of features, just as PCA,
which reduces the computing cost significantly.

LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract useful
data from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also
fails in some cases where the Mean of the distributions is shared. In this case, LDA fails to
create a new axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in machine learning.

1. Maximal Margin Classifier

This classifier is designed specifically for linearly separable data, refers to the condition in
which data can be separated linearly using a hyperplane. But, what is linearly separable
data?

Linearly separable and non-linearly separable data


Linear and non-linear separable data are described in the diagram below. Linearly separable
data is data that is populated in such a way that it can be easily classified with a straight line
or a hyperplane. Non-linearly separable data, on the other hand, is described as data that
cannot be separated using a simple straight line (requires a complex classifier).

However, as shown in the diagram below, there can be an infinite number of hyperplanes that
will classify the linearly separable classes.

How do we choose the hyperplane that we really need?

Based on the maximum margin, the Maximal-Margin Classifier chooses the optimal
hyperplane. The dotted lines, parallel to the hyperplane in the following diagram are the
margins and the distance between both these dotted lines (Margins) is the Maximum Margin.

A margin passes through the nearest points from each class; to the hyperplane. The angle
between these nearest points and the hyperplane is 90°. These points are referred to as
“Support Vectors”. Support vectors are shown by circles in the diagram below.

This classifier would choose the hyperplane with the maximum margin which is why it is
known as Maximal – Margin Classifier.

Drawbacks:

This classifier is heavily reliant on the support vector and changes as support vectors chang e.
As a result, they tend to overfit.

They can’t be used for data that isn’t linearly separable. Since the majority of real-world data
is non-linear. As a result, this classifier is inefficient.

The maximum margin classifier is also known as a “Hard Margin Classifier” because it
prevents misclassification and ensures that no point crosses the margin. It tends to overfit due
to the hard margin. An extension of the Maximal Margin Classifier, “Suppor t Vector Classifier”
was introduced to address the problem associated with it.

2. Support Vector Classifier

Support Vector Classifier is an extension of the Maximal Margin Classifier. It is less sensitive
to individual data. Since it allows certain data to be misclassified, it’s also known as the “Soft
Margin Classifier”. It creates a budget under which the misclassification allowance is granted.

Also, It allows some points to be misclassified, as shown in the following diagram. The points
inside the margin and on the margin are referred to as “Support Vectors” in this scenario.
Whereas, the points on the margins were referred to as “Support vectors” in the Maximal –
Margin Classifier.

the margin widens as the budget for misclassification increases, while the margin narrows as
the budget decreases.

While building the model, we use a hyperparameter called “Cost”. Here Cost is inverse of
budget means when the budget increases —> Cost decreases and vice versa. It is denoted by
“C”.

The influence of C’s value on the margin is depicted in the diagram below. When the value is
small, for example, C=1, the margin widens, while when the value is high, the margin narrows
down.

3. Support Vector Machines

Support Vector Machines are an extension of Soft Margin Classifier. It can also be used for
nonlinear classification by using the kernel. As a result, this algorithm performs well in the
majority of real-world problem statements. Since, in the real world, we will mostly find non-
linear separable data, which will necessitate the use of complex classifiers to classify them.

Kernel: It transforms non-linear separable data from lower to higher dimensions to facilitate
linear classification, as illustrated in the figure below. We use the kernel-based technique to
separate non-linear data because separation can be simpler in higher dimensions.

The kernel transforms the data from lower to higher dimensions using mathematical formulas.

You might also like