0% found this document useful (0 votes)
13 views91 pages

Srujitha 1

The document discusses well-posed learning problems in machine learning, defining them through the traits of task, performance measure, and experience, with examples like email filtering and handwriting recognition. It also contrasts structured and unstructured data, highlighting their respective pros, cons, and use cases, while introducing semi-structured data as a middle ground. Additionally, it emphasizes the importance of domain knowledge for effective machine learning application and the future of data in enhancing business intelligence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views91 pages

Srujitha 1

The document discusses well-posed learning problems in machine learning, defining them through the traits of task, performance measure, and experience, with examples like email filtering and handwriting recognition. It also contrasts structured and unstructured data, highlighting their respective pros, cons, and use cases, while introducing semi-structured data as a middle ground. Additionally, it emphasizes the importance of domain knowledge for effective machine learning application and the future of data in enhancing business intelligence.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Unit….

Well Posed Learning Problem – A computer program is said to learn from experience E in context to some task T and
some performance measure P, if its performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
● Task
● Performance Measure
● Experience
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
● Task – Classifying emails as spam or not
● Performance Measure – The fraction of emails accurately classified as spam or not spam
● Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
● Task – Playing checkers game
● Performance Measure – percent of games won against opposer
● Experience – playing implementation games against itself
3. Handwriting Recognition Problem
● Task – Acknowledging handwritten words within portrayal
● Performance Measure – percent of words accurately classified
● Experience – a directory of handwritten words with given classifications
4. A Robot Driving Problem
● Task – driving on public four-lane highways using sight scanners
● Performance Measure – average distance progressed before a fallacy
● Experience – order of images and steering instructions noted down while observing a human driver
5. Fruit Prediction Problem
● Task – forecasting different fruits for recognition
● Performance Measure – able to predict maximum variety of fruits
● Experience – training machine with the largest datasets of fruits images
6. Face Recognition Problem
● Task – predicting different types of faces
● Performance Measure – able to predict maximum types of faces
● Experience – training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
● Task – translating one type of language used in a document to other language
● Performance Measure – able to convert one language to other efficiently
● Experience – training machine with a large dataset of differ
Topic

1.3. EXAMPLES OF APPLICATIONS IN


DIVERSE FIELDS
Machine learning is a growing technology used to mine knowledge from data (popularly known as data mining field (Section 1.8)).
Wherever data exists, things can be learned from it. Whenever there is excess of data, the mechanics of learning must be
automatic. Machine

● Traffic Alerts
● Social Media
● Transportation and Commuting
● Products Recommendations
● Virtual Personal Assistants
● Self Driving Cars
● Dynamic Pricing
● Google Translate
● Online Video Streaming
● Fraud Detection
Unit….1

Topic

In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A
vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business
data in terms of a vector?

Primitive Feature Vector


Whether the data are measured observations, or images (pixels), free text, factors, or shapes, they can be categorized into four
following types:

1. Categorical data
2. Binary data
3. Numerical data

4. Graphical data
The most primitive representation of a feature vector looks like this:

A typical feature vector.


(Source: https://www.researchgate.net/publication/318740904_Chat_Detection_in_an_Intelligent_Assistant_Combini
ng_Task-oriented_and_Non-task-oriented_Spoken_Dialogue_Systems/figures?lo=1)
Unit….1

Numerical Data
Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too
much about it.

Categorical Data
However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:

One-hot encoding of categorical data (Source: https://developers.google.com/machine-learning/data-


prep/transform/transform-categorical)

For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for
orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number
of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing
with missing data (with all elements 0) and multi-category (with multiple non-zeros).

Advertisements

REPORT THIS AD

In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding
above with words as the categories. It is a good way as long as the order of the words does not matter.

Binary Data
For binary data, it can be easily represented by one element, either 1 or 0.
Unit….1

Graphical Data
Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more
information.

Shortcomings
A feature vector can be a concatenation of various features in terms of all these types except graphical data.

However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:

1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar. It ignores the correlation between
different variables. However, it is a very big assumption.

2. The weights of different fields are not considered.


3. Sometimes if the numerical values are very large, it outweighs other categorical data in terms of influence in computation.

4. Data are very sparse, costing a lot of memory waste and computing time.
5. It is unknown whether some of the data are irrelevant.

Modifying Feature Vectors


In light of the shortcomings, to modify the feature factors, there are three main ways of dealing with this:

1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.

2. Embedding: condensing the information into vectors of smaller lengths.


3. Sparse coding: deliberately extend the vectors to a larger length.

Rescaling
Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:

1. Normalization: normalizing all the categories of one feature to having the sum of 1.
2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency
is higher and it appears in relatively few documents or class labels.

Embedding
Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded
inside the elements. There are rich amount of work on this.

1. Topic models: finding the topic models (latent Dirichlet allocation (LDA), structural topic models (STM) etc.) and encode
the vectors with topics instead;
Unit….1

2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of
all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling
(MDS) etc;

3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples
include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and
Projection (UMAP);

4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-
encoders, Word2Vec, FastText, BERT etc.

5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.

Sparse Coding
Sparse coding is good for finding basis vectors for dense vectors.

Topic..

1.5. DOMAIN KNOWLEDGE FOR


PRODUCTIVE USE OF MACHINE
LEARNING
The productive use of machine learning is not just a matter of finding some data and then blindly applying machine learning
algorithms to it. Available commercial work-benches make that easy to do, even with little apparent understanding of what these
algorithms do. The usefulness of such results is questionable.
Topic

What is structured data?

Structured data — typically categorized as quantitative data — is highly organized and easily decipherable by machine learning
algorithms. Developed by IBM in 1974, structured query language (SQL) is the programming language used to manage structured
data. By using a relational (SQL) database, business users can quickly input, search and manipulate structured data.

Pros and cons of structured data


Examples of structured data include dates, names, addresses, credit card numbers, etc. Their benefits are tied to ease of use and
access, while liabilities revolve around data inflexibility:

Pros

● Easily used by machine learning (ML) algorithms: The specific and organized architecture of
structured data eases manipulation and querying of ML data.
● Easily used by business users: Structured data does not require an in-depth understanding of different
types of data and how they function. With a basic understanding of the topic relative to the data, users can easily access
and interpret the data.
● Accessible by more tools: Since structured data predates unstructured data, there are more tools available
for using and analyzing structured data.
Unit….1

Cons

● Limited usage: Data with a predefined structure can only be used for its intended purpose, which limits its flexibility
and usability.
● Limited storage options: Structured data is generally stored in data storage systems with rigid schemas (e.g.,
“data warehouses”). Therefore, changes in data requirements necessitate an update of all structured data, which leads to
a massive expenditure of time and resources.

Structured data tools

● OLAP: Performs high-speed, multidimensional data analysis from unified, centralized data stores.
● SQLite: Implements a self-contained, serverless, zero-configuration, transactional relational database engine.
● MySQL: Embeds data into mass-deployed software, particularly mission-critical, heavy-load production system.
● PostgreSQL: Supports SQL and JSON querying as well as high-tier programming languages (C/C+, Java, Python, etc.).

Use cases for structured data

● Customer relationship management (CRM): CRM software runs structured data through analytical
tools to create datasets that reveal customer behavior patterns and trends.
● Online booking: Hotel and ticket reservation data (e.g., dates, prices, destinations, etc.) fits the “rows and
columns” format indicative of the pre-defined data model.
● Accounting: Accounting firms or departments use structured data to process and record financial transactions.

What is unstructured data?

Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed via conventional data tools and
methods. Since unstructured data does not have a predefined data model, it is best managed in non-relational (NoSQL) databases.
Another way to manage unstructured data is to use data lakes to preserve it in raw form.

The importance of unstructured data is rapidly increasing. Recent projections indicate that unstructured data is over 80% of all
enterprise data, while 95% of businesses prioritize unstructured data management.

Pros and cons of unstructured data


Examples of unstructured data include text, mobile activity, social media posts, Internet of Things (IoT) sensor data, etc. Their
benefits involve advantages in format, speed and storage, while liabilities revolve around expertise and available resources:

Pros

● Native format: Unstructured data, stored in its native format, remains undefined until needed. Its adaptability
increases file formats in the database, which widens the data pool and enables data scientists to prepare and analyze
only the data they need.
● Fast accumulation rates: Since there is no need to predefine the data, it can be collected quickly and easily.
● Data lake storage: Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases
scalability.

Cons

● Requires expertise: Due to its undefined/non-formatted nature, data science expertise is required to prepare and
analyze unstructured data. This is beneficial to data analysts but alienates unspecialized business users who may not fully
understand specialized data topics or how to utilize their data.
● Specialized tools: Specialized tools are required to manipulate unstructured data, which limits product choices for
data managers.
Unit….1

Unstructured data tools

● MongoDB: Uses flexible documents to process data for cross-platform applications and services.
● DynamoDB: Delivers single-digit millisecond performance at any scale via built-in security, in-memory caching and
backup and restore.
● Hadoop: Provides distributed processing of large data sets using simple programming models and no formatting
requirements.
● Azure: Enables agile cloud computing for creating and managing apps through Microsoft’s data centers.

Use cases for unstructured data

● Data mining: Enables businesses to use unstructured data to identify consumer behavior, product sentiment, and
purchasing patterns to better accommodate their customer base.
● Predictive data analytics: Alert businesses of important activity ahead of time so they can properly plan and accordingly
adjust to significant market shifts.
● Chatbots: Perform text analysis to route customer questions to the appropriate answer sources.

What are the key differences between structured and unstructured data?
While structured (quantitative) data gives a “birds-eye view” of customers, unstructured (qualitative) data provides a deeper
understanding of customer behavior and intent. Let’s explore some of the key areas of difference and their implications:

● Sources: Structured data is sourced from GPS sensors, online forms, network logs, web server logs, OLTP systems,
etc., whereas unstructured data sources include email messages, word-processing documents, PDF files, etc.
● Forms: Structured data consists of numbers and values, whereas unstructured data consists of sensors, text files,
audio and video files, etc.
● Models: Structured data has a predefined data model and is formatted to a set data structure before being placed in
data storage (e.g., schema-on-write), whereas unstructured data is stored in its native format and not processed until it is
used (e.g., schema-on-read).
● Storage: Structured data is stored in tabular formats (e.g., excel sheets or SQL databases) that require less storage
space. It can be stored in data warehouses, which makes it highly scalable. Unstructured data, on the other hand, is
stored as media files or NoSQL databases, which require more space. It can be stored in data lakes which makes it
difficult to scale.
● Uses: Structured data is used in machine learning (ML) and drives its algorithms, whereas unstructured data is used
in natural language processing (NLP) and text mining.

What is semi-structured data?


Semi-structured data (e.g., JSON, CSV, XML) is the “bridge” between structured and unstructured data. It does not have a
predefined data model and is more complex than structured data, yet easier to store than unstructured data.

Semi-structured data uses “metadata” (e.g., tags and semantic markers) to identify specific data characteristics and scale data into
records and preset fields. Metadata ultimately enables semi-structured data to be better cataloged, searched and analyzed than
unstructured data.

● Example of metadata usage: An online article displays a headline, a snippet, a featured image, image alt-
text, slug, etc., which helps differentiate one piece of web content from similar pieces.
● Example of semi-structured data vs. structured data: A tab-delimited file containing customer
data versus a database containing CRM tables.
● Example of semi-structured data vs. unstructured data: A tab-delimited file versus a list of
comments from a customer’s Instagram.

The future of data

Recent developments in artificial intelligence (AI) and machine learning (ML) are driving the future wave of data, which is enhancing
business intelligence and advancing industrial innovation. In particular, the data formats and models covered in this article are
helping business users to do the following:
Unit….1

● Analyze digital communications for compliance: Pattern recognition and email threading analysis
software that can search email and chat data for potential noncompliance.
● Track high-volume customer conversations in social media: Text analytics and sentiment
analysis that enables monitoring of marketing campaign results and identifying online threats.
● Gain new marketing intelligence: ML analytics tools that can quickly cover massive amounts of data to
help businesses analyze customer behavior.
Furthermore, smart and efficient usage of data formats and models can help you with the following:

● Understand customer needs at a deeper level to better serve them


● Create more focused and targeted marketing campaigns
● Track current metrics and create new ones
● Create better product opportunities and offerings
● Reduce operational costs

Structured and unstructured data and IBM


Whether you are a seasoned data expert or a novice business owner, being able to handle all forms of data is conducive to your
success. By leveraging structured, semi-structured and unstructured data options, you can perform optimal data management that
will ultimately benefit your mission.

To better understand data storage options for whatever kind of data best serves you, check out IBM Cloud Databases.

Topic….

Data Mining vs Machine Learning


Data Mining relates to extracting information from a large quantity of data. Data mining is a technique of discovering different kinds of
patterns that are inherited in the data set and which are precise, new, and useful data. Data Mining is working as a subset of business
analytics and similar to experimental studies. Data Mining's origins are databases, statistics.

Machine learning includes an algorithm that automatically improves through data-based experience. Machine learning is a way to find a
new algorithm from experience. Machine learning includes the study of an algorithm that can automatically extract the data. Machine
learning utilizes data mining techniques and another learning algorithm to construct models of what is happening behind certain
information so that it can predict future results.

Data Mining and Machine learning are areas that have been influenced by each other, although they have many common things, ye t they
have different ends.

Data Mining is performed on certain data sets by humans to find interesting patterns between the items in the data set. Data Mining uses
techniques created by machine learning for predicting the results while machine learning is the capability of the computer to learn from a
minded data set.
Unit….1

Machine learning algorithms take the information that represents the relationship between items in data sets and creates models in order
to predict future results. These models are nothing more than actions that will be taken by the machine to achieve a result.

What is Data Mining?


Data Mining is the method of extraction of data or previously unknown data patterns from huge sets of data. Hence as the word suggests,
we 'Mine for specific data' from the large data set. Data mining is also called Knowledge Discovery Process, is a field of science that is used
to determine the properties of the datasets. Gregory Piatetsky-Shapiro founded the term "Knowledge Discovery in
Databases" (KDD) in 1989. The term "data mining" came in the database community in 1990. Huge sets of data collected from data
warehouses or complex datasets such as time series, spatial, etc. are extracted in order to extract interesting correlations and patterns
between the data items. For Machine Learning algorithms, the output of the data mining algorithm is often used as input.

What is Machine learning?


Machine learning is related to the development and designing of a machine that can learn itself from a specified set of data to obtain a
desirable result without it being explicitly coded. Hence Machine learning implies 'a machine which learns on its own. Arthur
Samuel invented the term Machine learning an American pioneer in the area of computer gaming and artificial intelligence in 1959. He said
that "it gives computers the ability to learn without being explicitly programmed."

Machine learning is a technique that creates complex algorithms for large data processing and provides outcomes to its users. It utilizes
complex programs that can learn through experience and make predictions.

The algorithms are enhanced by themselves by frequent input of training data. The aim of machine learning is to understand information
and build models from data that can be understood and used by humans.

Machine learning algorithms are divided into two types:

o Unsupervised Learning
o Supervised Learning

1. Unsupervised Machine Learning:

Unsupervised learning does not depend on trained data sets to predict the results, but it utilizes direct techniques such as clustering and
association in order to predict the results. Trained data sets are defined as the input for which the output is known.

2. Supervised Machine Learning:


Unit….1

As the name implies, supervised learning refers to the presence of a supervisor as a teacher. Supervised learning is a learning process in
which we teach or train the machine using data which is well leveled implies that some data is already marked with the correct responses.
After that, the machine is provided with the new sets of data so that the supervised learning algorithm analyzes the training data and gives
an accurate result from labeled data.

Major Difference between Data mining and Machine learning


1. Two-component is used to introduce data mining techniques first one is the database, and the second one is machine learning. The
database provides data management techniques, while machine learning provides methods for data analysis. But to introduce machine
learning methods, it used algorithms.

2. Data Mining utilizes more data to obtain helpful information, and that specific data will help to predict some future results. For example,
In a marketing company that utilizes last year's data to predict the sale, but machine learning does not depend much on data. It uses
algorithms. Many transportation companies such as OLA, UBER machine learning techniques to calculate ETA (Estimated Time of Arrival)
for rides is based on this technique.

3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It will provide the answer to a specific problem,
but machine learning algorithms are self-defined and can alter their rules according to the situation, and find out the solution for a specific
problem and resolves it in its way.

4. The main and most important difference between data mining and machine learning is that without the involvement of humans, data
mining can't work, but in the case of machine learning human effort only involves at the time when the algorithm is defined after that it will
conclude everything on its own. Once it implemented, we can use it forever, but this is not possible in the case of data mining.

5. As machine learning is an automated process, the result produces by machine learning will be more precise as compared to data mining.

6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern assessment techniques to obtain useful
information, whereas machine learning utilizes neural networks, predictive models, and automated algorithms to make the decisions.

Data Mining Vs Machine Learning

Factors Data Mining Machine Learning

Origin Traditional databases with It has an existing algorithm and


unstructured data. data.

Meaning Extracting information from a Introduce new Information


huge amount of data. from data as well as previous
experience.
Unit….1

History In 1930, it was known as The first program, i.e., Samuel's


knowledge discovery in checker playing program, was
databases(KDD). established in 1950.

Responsibility Data Mining is used to obtain Machine learning teaches the


the rules from the existing computer, how to learn and
data. comprehend the rules.

Abstraction Data mining abstract from the Machine learning reads


data warehouse. machine.

Applications In compare to machine It needs a large amount of data


learning, data mining can to obtain accurate results. It
produce outcomes on the has various applications, used
lesser volume of data. It is also in web search, spam filter,
used in cluster analysis. credit scoring, computer
design, etc.

Nature It involves human interference It is automated, once designed


more towards the manual. and implemented, there is no
need for human effort.

Techniques Data mining is more of It is a self-learned and train


involve research using a technique like system to do the task precisely.
a machine learning.

Scope Applied in the limited fields. It can be used in a vast area.

Topic…

Linear Algebra for Machine learning


Machine learning has a strong connection with mathematics. Each machine learning algorithm is based on the concepts of mathematics
& also with the help of mathematics, one can choose the correct algorithm by considering training time, complexity, number of features,
etc. Linear Algebra is an essential field of mathematics, which defines the study of vectors, matrices, planes, mapping, and lines
required for linear transformation.

The term Linear Algebra was initially introduced in the early 18 th century to find out the unknowns in Linear equations and solve the
equation easily; hence it is an important branch of mathematics that helps study data. Also, no one can deny that Linear Algebra is
undoubtedly the important and primary thing to process the applications of Machine Learning. It is also a prerequisite to start learning
Machine Learning and data science.
Unit….1

Linear algebra plays a vital role and key foundation in machine learning, and it enables ML algorithms to run on a huge number of
datasets.

The concepts of linear algebra are widely used in developing algorithms in machine learning. Although it is used almost in each concept
of Machine learning, specifically, it can perform the following task:

o Optimization of data.
o Applicable in loss functions, regularisation, covariance matrices, Singular Value Decomposition (SVD), Matrix Operations, and
support vector machine classification.
o Implementation of Linear Regression in Machine Learning.

Besides the above uses, linear algebra is also used in neural networks and the data science field.

Basic mathematics principles and concepts like Linear algebra are the foundation of Machine Learning and Deep Learning systems. To
learn and understand Machine Learning or Data Science, one needs to be familiar with linear algebra and optimization theory. In this
topic, we will explain all the Linear algebra concepts required for machine learning.

Note: Although linear algebra is a must-know part of mathematics for machine learning, it is not required to get intimate in this. It means it is
not required to be an expert in linear algebra; instead, only good knowledge of these concepts is more than enough for machine learning.

Why learn Linear Algebra before learning Machine Learning?


Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based on flour similarly, every Machine Learning
Model is also based on Linear Algebra. Further, the cake also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine
Learning also requires more concepts as vector calculus, probability, and optimization theory. So, we can say that Machine Learning
creates a useful model with the help of the above-mentioned mathematical concepts.

Below are some benefits of learning Linear Algebra before Machine learning:

o Better Graphic experience


o Improved Statistics
o Creating better Machine Learning algorithms
o Estimating the forecast of Machine Learning
o Easy to Learn

Better Graphics Experience:


Linear Algebra helps to provide better graphical processing in Machine Learning like Image, audio, video, and edge detection. These are
the various graphical representations supported by Machine Learning projects that you can work on. Further, parts of the given data set
are trained based on their categories by classifiers provided by machine learning algorithms. These classifiers also remove the errors
from the trained data.
Unit….1

Moreover, Linear Algebra helps solve and compute large and complex data set through a specific terminology named Matrix
Decomposition Techniques. There are two most popular matrix decomposition techniques, which are as follows:

o Q-R
o L-U

Improved Statistics:
Statistics is an important concept to organize and integrate data in Machine Learning. Also, linear Algebra helps to understand the concept
of statistics in a better manner. Advanced statistical topics can be integrated using methods, operations, and notations of linear algebra.

Creating better Machine Learning algorithms:


Linear Algebra also helps to create better supervised as well as unsupervised Machine Learning algorithms.

Few supervised learning algorithms can be created using Linear Algebra, which is as follows:

o Logistic Regression
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)

Further, below are some unsupervised learning algorithms listed that can also be created with the help of linear algebra as follows:

o Single Value Decomposition (SVD)


o Clustering
o Components Analysis

With the help of Linear Algebra concepts, you can also self-customize the various parameters in the live project and understand in-depth
knowledge to deliver the same with more accuracy and precision.

Estimating the forecast of Machine Learning:


If you are working on a Machine Learning project, then you must be a broad-minded person and also, you will be able to impart more
perspectives. Hence, in this regard, you must increase the awareness and affinity of Machine Learning concepts. You can begin with
setting up different graphs, visualization, using various parameters for diverse machine learning algorithms or taking up things that others
around you might find difficult to understand.

Easy to Learn:
Linear Algebra is an important department of Mathematics that is easy to understand. It is taken into consideration whenever there is a
requirement of advanced mathematics and its applications.

Minimum Linear Algebra for Machine Learning


Notation:
Notation in linear algebra enables you to read algorithm descriptions in papers, books, and websites to understand the algorithm's
working. Even if you use for-loops rather than matrix operations, you will be able to piece things together.

Operations:
Working with an advanced level of abstractions in vectors and matrices can make concepts clearer, and it can also help in the description,
coding, and even thinking capability. In linear algebra, it is required to learn the basic operations such as addition, multiplication, inversion,
transposing of matrices, vectors, etc.

Matrix Factorization:
One of the most recommended areas of linear algebra is matrix factorization, specifically matrix deposition methods such as SVD and
QR.

Examples of Linear Algebra in Machine Learning


Below are some popular examples of linear algebra in Machine learning:

o Datasets and Data Files


Unit….1

o Linear Regression
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis

1. Datasets and Data Files


Each machine learning project works on the dataset, and we fit the machine learning model using this dataset.

Each dataset resembles a table-like structure consisting of rows and columns. Where each row represents observations, and each
column represents features/Variables. This dataset is handled as a Matrix, which is a key data structure in Linear Algebra.

Further, when this dataset is divided into input and output for the supervised learning model, it represents a Matrix(X) and Vector(y),
where the vector is also an important concept of linear algebra.

2. Images and Photographs


In machine learning, images/photographs are used for computer vision applications. Each Image is an example of the matrix from linear
algebra because an image is a table structure consisting of height and width for each pixel.

Moreover, different operations on images, such as cropping, scaling, resizing, etc., are performed using notations and operations of Linear
Algebra.

3. One Hot Encoding


In machine learning, sometimes, we need to work with categorical data. These categorical variables are encoded to make them simpler
and easier to work with, and the popular encoding technique to encode these variables is known as one-hot encoding.

In the one-hot encoding technique, a table is created that shows a variable with one column for each category and one row for each
example in the dataset. Further, each row is encoded as a binary vector, which contains either zero or one value. This is an example of
sparse representation, which is a subfield of Linear Algebra.

4. Linear Regression
Linear regression is a popular technique of machine learning borrowed from statistics. It describes the relationship between input and
output variables and is used in machine learning to predict numerical values. The most common way to solve linear regression problems
using Least Square Optimization is solved with the help of Matrix factorization methods. Some commonly used matrix factorization
methods are LU decomposition, or Singular-value decomposition, which are the concept of linear algebra.

5. Regularization
In machine learning, we usually look for the simplest possible model to achieve the best outcome for the specific problem. Simpler
models generalize well, ranging from specific examples to unknown datasets. These simpler models are often considered models with
smaller coefficient values.

A technique used to minimize the size of coefficients of a model while it is being fit on data is known as regularization. Common
regularization techniques are L1 and L2 regularization. Both of these forms of regularization are, in fact, a measure of the magnitude or
length of the coefficients as a vector and are methods lifted directly from linear algebra called the vector norm.

6. Principal Component Analysis


Generally, each dataset contains thousands of features, and fitting the model with such a large dataset is one of the most challenging
tasks of machine learning. Moreover, a model built with irrelevant features is less accurate than a model built with relevant features.
There are several methods in machine learning that automatically reduce the number of columns of a dataset, and these methods are
known as Dimensionality reduction. The most commonly used dimensionality reductions method in machine learning is Principal
Component Analysis or PCA. This technique makes projections of high-dimensional data for both visualizations and training models. PCA
uses the matrix factorization method from linear algebra.

7. Singular-Value Decomposition
Singular-Value decomposition is also one of the popular dimensionality reduction techniques and is also written as SVD in short form.
Unit….1

It is the matrix-factorization method of linear algebra, and it is widely used in different applications such as feature selection, visualization,
noise reduction, and many more.

8. Latent Semantic Analysis


Natural Language Processing or NLP is a subfield of machine learning that works with text and spoken words.

NLP represents a text document as large matrices with the occurrence of words. For example, the matrix column may contain the known
vocabulary words, and rows may contain sentences, paragraphs, pages, etc., with cells in the matrix marked as the count or frequency
of the number of times the word occurred. It is a sparse matrix representation of text. Documents processed in this way are much easier
to compare, query, and use as the basis for a supervised machine learning model.

This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also known by the name Latent Semantic
Indexing or LSI.

9. Recommender System
A recommender system is a sub-field of machine learning, a predictive modelling problem that provides recommendations of products.
For example, online recommendation of books based on the customer's previous purchase history, recommendation of movies and TV
series, as we see in Amazon & Netflix.

The development of recommender systems is mainly based on linear algebra methods. We can understand it as an example of calculating
the similarity between sparse customer behaviour vectors using distance measures such as Euclidean distance or dot products.

Different matrix factorization methods such as singular-value decomposition are used in recommender systems to query, search, and
compare user data.

10. Deep Learning


Artificial Neural Networks or ANN are the non-linear ML algorithms that work to process the brain and transfer information from one
layer to another in a similar way.

Deep learning studies these neural networks, which implement newer and faster hardware for the training and development of larger
networks with a huge dataset. All deep learning methods achieve great results for different challenging tasks such as machine translation,
speech recognition, etc. The core of processing neural networks is based on linear algebra data structures, which are multiplied and
added together. Deep learning algorithms also work with vectors, matrices, tensors (matrix with more than two dimensions) of inputs
and coefficients for multiple dimensions.

Conclusion
In this topic, we have discussed Linear algebra, its role and its importance in machine learning. For each machine learning enthusiast, it
is very important to learn the basic concepts of linear algebra to understand the working of ML algorithms and choose the best algorithm
for a specific problem.

Unit..2

Machine learning is a large field of study that overlaps with and inherits ideas from many related fields such as artificial intelligence.

The focus of the field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means synthesizing useful
concepts from historical data.

As such, there are many different types of learning that you may encounter as a practitioner in the field of machine learning: from whole
fields of study to specific techniques.

In this post, you will discover a gentle introduction to the different types of learning that you may encounter in the field of machine learning.

After reading this post, you will know:

● Fields of study, such as supervised, unsupervised, and reinforcement learning.


● Hybrid types of learning, such as semi-supervised and self-supervised learning.
● Broad techniques, such as active, online, and transfer learning.
Unit….1

Let’s get started.

Types of Learning in Machine Learning


Photo by Lenny K Photography, some rights reserved.

Types of Learning
Given that the focus of the field of machine learning is “learning,” there are many types that you may encounter as a practitioner.
Some types of learning describe whole subfields of study comprised of many different types of algorithms such as “supervised learning.”
Others describe powerful techniques that you can use on your projects, such as “transfer learning.”
There are perhaps 14 types of learning that you must be familiar with as a machine learning practitioner; they are:

Learning Problems
● 1. Supervised Learning
● 2. Unsupervised Learning
● 3. Reinforcement Learning
Hybrid Learning Problems
● 4. Semi-Supervised Learning
● 5. Self-Supervised Learning
● 6. Multi-Instance Learning
Statistical Inference
● 7. Inductive Learning
● 8. Deductive Inference
● 9. Transductive Learning
Learning Techniques
● 10. Multi-Task Learning
● 11. Active Learning
● 12. Online Learning
● 13. Transfer Learning
● 14. Ensemble Learning
In the following sections, we will take a closer look at each in turn.

Did I miss an important type of learning?


Let me know in the comments below.

Topic

Bias is a phenomenon that skews the result of an algorithm in favor or against an idea.

Bias is considered a systematic error that occurs in the machine learning model itself due to incorrect assumptions in the ML
process.

Technically, we can define bias as the error between average model prediction and the ground truth. Moreover, it describes how
well the model matches the training data set:

● A model with a higher bias would not match the data set closely.
● A low bias model will closely match the training data set.

Characteristics of a high bias model include:

● Failure to capture proper data trends


● Potential towards underfitting
● More generalized/overly simplified
● High error rate
Unit….1

What is variance in machine learning?


Variance refers to the changes in the model when using different portions of the training data set.

Simply stated, variance is the variability in the model prediction—how much the ML function can adjust depending on the given
data set. Variance comes from highly complex models with a large number of features.

● Models with high bias will have low variance.


● Models with high variance will have a low bias.

All these contribute to the flexibility of the model. For instance, a model that does not match a data set with a high bias will
create an inflexible model with a low variance that results in a suboptimal machine learning model.
Unit….1

Characteristics of a high variance model include:

● Noise in the data set


● Potential towards overfitting
● Complex models
● Trying to put all data points as close as possible

Underfitting & overfitting


The terms underfitting and overfitting refer to how the model fails to match the data. The fitting of a model directly correlates to
whether it will return accurate predictions from a given data set.

● Underfitting occurs when the model is unable to match the input data to the target data. This happens when the model is
not complex enough to match all the available data and performs poorly with the training dataset.
● Overfitting relates to instances where the model tries to match non-existent data. This occurs when dealing with highly
complex models where the model will match almost all the given data points and perform well in training datasets.
However, the model would not be able to generalize the data point in the test data set to predict the outcome accurately.

Bias vs variance: A trade-off


Bias and variance are inversely connected. It is impossible to have an ML model with a low bias and a low variance.

When a data engineer modifies the ML algorithm to better fit a given data set, it will lead to low bias—but it will increase
variance. This way, the model will fit with the data set while increasing the chances of inaccurate predictions.
The same applies when creating a low variance model with a higher bias. While it will reduce the risk of inaccurate predictions,
the model will not properly match the data set.

It’s a delicate balance between these bias and variance. Importantly, however, having a higher variance does not indicate a bad
ML algorithm. Machine learning algorithms should be able to handle some variance.

We can tackle the trade-off in multiple ways…

Increasing the complexity of the model to count for bias and variance, thus decreasing the overall bias while increasing the
variance to an acceptable level. This aligns the model with the training dataset without incurring significant variance errors.
Increasing the training data set can also help to balance this trade-off, to some extent. This is the preferred method when
dealing with overfitting models. Furthermore, this allows users to increase the complexity without variance errors that pollute the
model as with a large data set.
A large data set offers more data points for the algorithm to generalize data easily. However, the major issue with increasing the
trading data set is that underfitting or low bias models are not that sensitive to the training data set. Therefore, increasing data is
the preferred solution when it comes to dealing with high variance and high bias models.

This table lists common algorithms and their expected behavior regarding bias and variance:

Algorithm Bias Variance

Linear High
Regression Bias Less Variance

Decision Tree Low Bias High Variance


Unit….1

Bagging Low Bias High Variance (Less than Decision Tree)

High Variance (Less than Decision Tree and


Random Forest Low Bias Bagging)

Bias & variance calculation example


Let’s put these concepts into practice—we’ll calculate bias and variance using Python.
The simplest way to do this would be to use a library called mlxtend (machine learning extension), which is targeted for data
science tasks. This library offers a function called bias_variance_decomp that we can use to calculate bias and variance.
We will be using the Iris data dataset included in mlxtend as the base data set and carry out the bias_variance_decomp using two
algorithms: Decision Tree and Bagging.

Decision tree example


Copy

from mlxtend.evaluate import bias_variance_decomp

from sklearn.tree import DecisionTreeClassifier

from mlxtend.data import iris_data

from sklearn.model_selection import train_test_split

# Get Data Set

X, y = iris_data()

X_train_ds, X_test_ds, y_train_ds, y_test_ds = train_test_split(X, y,

test_size=0.3,

random_state=123,

shuffle=True,

stratify=y)

# Define Algorithm

tree = DecisionTreeClassifier(random_state=123)

# Get Bias and Variance - bias_variance_decomp function

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(


Unit….1

tree, X_train_ds, y_train_ds, X_test_ds, y_test_ds,

loss='0-1_loss',

random_seed=123,

num_rounds=1000)

# Display Bias and Variance

print(f'Average Expected Loss: {round(avg_expected_loss, 4)}n')

print(f'Average Bias: {round(avg_bias, 4)}')

print(f'Average Variance: {round(avg_var, 4)}')

Result:
Unit….1

Bagging example
Copy

from mlxtend.evaluate import bias_variance_decomp

from sklearn.tree import DecisionTreeClassifier

from mlxtend.data import iris_data

from sklearn.model_selection import train_test_split

from sklearn.ensemble import BaggingClassifier


Unit….1

# Get Data Set

X, y = iris_data()

X_train_ds, X_test_ds, y_train_ds, y_test_ds = train_test_split(X, y,

test_size=0.3,

random_state=123,

shuffle=True,

stratify=y)

# Define Algorithm

tree = DecisionTreeClassifier(random_state=123)

bag = BaggingClassifier(base_estimator=tree,

n_estimators=100,

random_state=123)

# Get Bias and Variance - bias_variance_decomp function

avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(

bag, X_train_ds, y_train_ds, X_test_ds, y_test_ds,

loss='0-1_loss',

random_seed=123,

num_rounds=1000)

# Display Bias and Variance

print(f'Average Expected Loss: {round(avg_expected_loss, 4)}n')

print(f'Average Bias: {round(avg_bias, 4)}')

print(f'Average Variance: {round(avg_var, 4)}')

Result:
Unit….1

Each of the above functions will run 1,000 rounds (num_rounds=1000) before calculating the average bias and variance values.
There, we can reduce the variance without affecting bias using a bagging classifier. The higher the algorithm complexity, the
lesser variance.

In the following example, we will have a look at three different linear regression models—least-squares, ridge, and lasso—
using sklearn library. Since they are all linear regression algorithms, their main difference would be the coefficient value.
We can see those different algorithms lead to different outcomes in the ML process (bias and variance).

Copy

from sklearn import linear_model


Unit….1

import numpy as np

from sklearn.metrics import mean_squared_error

def calculate_bias_variance(xTest, ytest, model):

ar = np.array([[[1],[2],[3]], [[2],[4],[6]]])

y = ar[1,:]

x = ar[0,:]

if model == 1:

reg = linear_model.LinearRegression()

reg.fit(x,y)

print(f'nLeast Square Coefficients: {reg.coef_}')

if model == 2:

reg = linear_model.Ridge (alpha = 0.1)

reg.fit(x,y)

print(f'nRidged Coefficients: {reg.coef_}')

if model == 3:

reg = linear_model.Lasso(alpha = 0.1)

reg.fit(x,y)

print(f'nLasso Coefficients: {reg.coef_}')

preds = reg.predict(xTest)

er = []

for i in range(len(ytest)):

print( "Actual=", ytest[i], " Preds=", preds[i])

x = (ytest[i] - preds[i]) **2

er.append(x)
Unit….1

variance_value = np.var(er)

print (f"Variance {round(variance_value, 2)}")

print(f"Bias: {round(mean_squared_error(ytest,preds), 2)}")

dateset_a = np.array([[4],[5],[6]])

dateset_b = np.array([[8.8],[14],[17]])

# Least Square Coefficients

calculate_bias_variance(dateset_a,dateset_b, 1)

# Ridged Coefficients

calculate_bias_variance(dateset_a,dateset_b, 2)

# Lasso Coefficients

calculate_bias_variance(dateset_a,dateset_b, 3)

Result:
Unit….1
Unit….1

Considering bias & variance is crucial


Bias and variance are two key components that you must consider when developing any good, accurate machine learning model.

● Bias creates consistent errors in the ML model, which represents a simpler ML model that is not suitable for a specific
requirement.
● On the other hand, variance creates variance errors that lead to incorrect predictions seeing trends or data points that do
not exist.

Users need to consider both these factors when creating an ML model. Generally, your goal is to keep bias as low as possible
while introducing acceptable levels of variances. This can be done either by increasing the complexity or increasing the training
data set.

In this balanced way, you can create an acceptable machine learning model.

Related reading

● BMC Machine Learning & Big Data Blog


● Supervised, Unsupervised & Other Machine Learning Methods
● Anomaly Detection with Machine Learning: An Introduction
● Top Machine Learning Architectures Explained
● Machine Learning Frameworks To Use Today
● Data Ethics for Companies

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who
wants to learn machine learning. We start with very basic stats and algebra and build upon that.

Download e-book ›

These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing [email protected].



Unit….1

BMC Brings the A-Game


BMC works with 86% of the Forbes Global 50 and customers and partners around the world to create their future. With our
history of innovation, industry-leading automation, operations, and service management solutions, combined with unmatched
flexibility, we help organizations free up time and space to become an Autonomous Digital Enterprise that conquers the
opportunities ahead.
Learn more about BMC ›
Unit….1
Unit….1

You may also like

How To Make a Box and Whisker Plot in Tableau Online

Using Apache Hive with ElasticSearch

How to write a Hive User Defined Function (UDF) in Java

Data Science Certifications: An Introduction

Top NumPy Statistical Functions & Distributions

What Is Data Normalization?

About the author


Unit….1

Shanika Wickramasinghe
Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are
Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate
about everything she does, loves to travel, and enjoys nature whenever she takes a break from her busy work schedule. You can
connect with her on LinkedIn.

View all posts


Unit….1
Unit….1

ices

Topic.,

Computational learning theory, or statistical learning theory, refers to mathematical frameworks for quantifying learning tasks and algorithms.

These are sub-fields of machine learning that a machine learning practitioner does not need to know in great depth in order to achieve good
results on a wide range of problems. Nevertheless, it is a sub-field where having a high-level understanding of some of the more prominent
methods may provide insight into the broader task of learning from data.

In this post, you will discover a gentle introduction to computational learning theory for machine learning.

After reading this post, you will know:


Unit….1

● Computational learning theory uses formal methods to study learning tasks and learning algorithms.
● PAC learning provides a way to quantify the computational difficulty of a machine learning task.
● VC Dimension provides a way to quantify the computational capacity of a machine learning algorithm.
Kick-start your project with my new book Probability for Machine Learning, including step-by-step tutorials and the Python source code files
for all examples.
Let’s get started.

A Gentle Introduction to Computational Learning Theory


Photo by someone10x, some rights reserved.

Tutorial Overview
This tutorial is divided into three parts; they are:

1. Computational Learning Theory


2. PAC Learning (Theory of Learning Problems)
3. VC Dimension (Theory of Learning Algorithms)
Computational Learning Theory
Computational learning theory, or CoLT for short, is a field of study concerned with the use of formal mathematical methods applied to
learning systems.
It seeks to use the tools of theoretical computer science to quantify learning problems. This includes characterizing the difficulty of learning
specific tasks.

Computational learning theory may be thought of as an extension or sibling of statistical learning theory, or SLT for short, that uses formal
methods to quantify learning algorithms.
● Computational Learning Theory (CoLT): Formal study of learning tasks.
● Statistical Learning Theory (SLT): Formal study of learning algorithms.

As a machine learning practitioner, it can be useful to know about computational learning theory and some of the main areas of investigation.
The field provides a useful grounding for what we are trying to achieve when fitting models on data, and it may provide insight into the
methods.

There are many subfields of study, although perhaps two of the most widely discussed areas of study from computational learning theory
are:

● PAC Learning.
● VC Dimension.
Tersely, we can say that PAC Learning is the theory of machine learning problems and VC dimension is the theory of machine learning
algorithms.

You may encounter the topics as a practitioner and it is useful to have a thumbnail idea of what they are about. Let’s take a closer look at
each.

If you would like to dive deeper into the field of computational learning theory, I recommend the book:

PAC Learning (Theory of Learning Problems)


Probably approximately correct learning, or PAC learning, refers to a theoretical machine learning framework developed by Leslie Valiant.
Unit….1

PAC learning seeks to quantify the difficulty of a learning task and might be considered the premier sub-field of computational learning
theory.

Consider that in supervised learning, we are trying to approximate an unknown underlying mapping function from inputs to outputs. We don’t
know what this mapping function looks like, but we suspect it exists, and we have examples of data produced by the function.

PAC learning is concerned with how much computational effort is required to find a hypothesis (fit model) that is a close match for the
unknown target function.

VC Dimension (Theory of Learning Algorithms)


Vapnik–Chervonenkis theory, or VC theory for short, refers to a theoretical machine learning framework developed by Vladimir
Vapnik and Alexey Chervonenkis.
VC theory learning seeks to quantify the capability of a learning algorithm and might be considered the premier sub-field of statistical learning
theory.

VC theory is comprised of many elements, most notably the VC dimension.


The VC dimension quantifies the complexity of a hypothesis space, e.g. the models that could be fit given a representation and learning
algorithm.

One way to consider the complexity of a hypothesis space (space of models that could be fit) is based on the number of distinct hypotheses
it contains and perhaps how the space might be navigated. The VC dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by hypotheses in the space.

The VC dimension measures the complexity of the hypothesis space […] by the number of distinct instances from X that can be completely
discriminated using H.

Topic..,

Occam’s razor
● Last Updated : 27 Feb, 2020
Many philosophers throughout history have advocated the idea of parsimony. One of the greatest Greek philosophers,
Aristotle who goes as far as to say, “Nature operates in the shortest way possible”. It is as a consequence that humans
might be biased as well to choose a simpler explanation given a set of all possible explanations with the same descriptive
power. This post gives a brief overview of Occam’s razor, the relevance of the principle and ends with a note on the usage
of this razor as an inductive bias in machine learning (decision tree learning in particular).

What is Occam’s razor?


Occam’s razor is a law of parsimony popularly stated as (in William’s words) “Plurality must never be posited without
necessity”. Alternatively, as a heuristic, it can be viewed as, when there are multiple hypotheses to solve a problem, the
simpler one is to be preferred. It is not clear as to whom this principle can be conclusively attributed to, but William of
Occam’s (c. 1287 – 1347) preference for simplicity is well documented. Hence this principle goes by the name, “Occam’s
razor”. This often means cutting off or shaving away other possibilities or explanations, thus “razor” appended to the name of
the principle. It should be noted that these explanations or hypotheses should lead to the same result.
Relevance of Occam’s razor.
There are many events that favor a simpler approach either as an inductive bias or a constraint to begin with. Some of them
are :
● Studies like this, where the results have suggested that preschoolers are sensitive to simpler explanations during their
initial years of learning and development.
● Preference for a simpler approach and explanations to achieve the same goal is seen in various facets of sciences; for
instance, the parsimony principle applied to the understanding of evolution.
● In theology, ontology, epistemology, etc this view of parsimony is used to derive various conclusions.
● Variants of Occam’s razor are used in knowledge Discovery.
Occam’s razor as an inductive bias in machine learning.
Note: It is highly recommended to read the article on decision tree introduction for an insight on decision tree building with
examples.
● Inductive bias (or the inherent bias of the algorithm) are assumptions that are made by the learning algorithm to form a
hypothesis or a generalization beyond the set of training instances in order to classify unobserved data.
● Occam’s razor is one of the simplest examples of inductive bias. It involves a preference for a simpler hypothesis that
best fits the data. Though the razor can be used to eliminate other hypotheses, relevant justification may be needed to do
so. Below is an analysis of how this principle is applicable in decision tree learning.
Unit….1

● The decision tree learning algorithms follow a search strategy to search the hypotheses space for the hypothesis that
best fits the training data. For example, the ID3 algorithm uses a simple to complex strategy starting from an empty tree
and adding nodes guided by the information gain heuristic to build a decision tree consistent with the training instances.
The information gain of every attribute (which is not already included in the tree) is calculated to infer which attribute to be
considered as the next node. Information gain is the essence of the ID3 algorithm. It gives a quantitative measure of the
information that an attribute can provide about the target variable i.e, assuming only information of that attribute is
available, how efficiently can we infer about the target. It can be defined as :

Generalization error
From Wikipedia, the free encyclopedia

Jump to navigationJump to search

For supervised learning applications in machine learning and statistical learning theory, generalization error[1] (also known as
the out-of-sample error[2] or the risk) is a measure of how accurately an algorithm is able to predict outcome values for previously
unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive
to sampling error. As a result, measurements of prediction error on the current data may not provide much information about
predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The
performance of a machine learning algorithm is visualized by plots that show values of estimates of the generalization error through
the learning process, which are called learning curves.

Contents

● 1Definition
o 1.1Leave-one-out cross-validation Stability
o 1.2Expected-leave-one-out error Stability
o 1.3Algorithms with proven stability
● 2Relation to overfitting
● 3References
● 4Further reading

Definition[edit]
See also: Statistical learning theory

In a learning problem, the goal is to develop a function that predicts output values for each input datum . The

subscript indicates that the function is developed based on a data set of data points. The generalization

error or expected loss or risk, of a particular function over all possible values of and is:[3]

where denotes a loss function and is the unknown joint probability distribution for and .
Unit….1

Without knowing the joint probability distribution , it is impossible to compute . Instead, we can compute the error

on sample data, which is called empirical error (or empirical risk). Given data points, the empirical error of a

candidate function is:

An algorithm is said to generalize if:

Of particular importance is the generalization error of the data-dependent function that is found by a

learning algorithm based on the sample. Again, for an unknown probability distribution, cannot be computed.
Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the
generalization error and the empirical error in probability:

That is, the goal is to characterize the probability that the generalization error is less than the empirical

error plus some error bound (generally dependent on and ). For many types of
algorithms, it has been shown that an algorithm has generalization bounds if it meets certain stability criteria.
Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and
meets two stability conditions, it will generalize. The first stability condition, leave-one-out cross-
validation stability, says that to be stable, the prediction error for each data point when leave-one-out cross

validation is used must converge to zero as . The second condition, expected-to-leave-one-out error

stability (also known as hypothesis stability if operating in the norm) is met if the prediction on a left-out
datapoint does not change when a single data point is removed from the training dataset.[4]
These conditions can be formalized as:

Leave-one-out cross-validation Stability[edit]

An algorithm has stability if for each , there exists a and such that:

and and go to zero as goes to infinity.[4]

Expected-leave-one-out error Stability[edit]

An algorithm has stability if for each there exists a and a such that:
Unit….1

with and going to zero for .

For leave-one-out stability in the norm, this is the same as hypothesis stability:

with going to zero as goes to infinity.[4]

Algorithms with proven stability[edit]


A number of algorithms have been proven to be stable and as a result have bounds on their
generalization error. A list of these algorithms and the papers that proved stability is
available here.

Relation to overfitting[edit]
See also: Overfitting

This figure illustrates the relationship between overfitting and the generalization error I[fn] - IS[fn]. Data points were generated
from the relationship y = x with white noise added to the y values. In the left column, a set of training points is shown in blue.
A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled
from the underlying joint probability distribution of x and y. In the top row, the function is fit on a sample dataset of 10
datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes
and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we
have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error
on training and test data converges and generalization error goes to 0.

The concepts of generalization error and overfitting are closely related. Overfitting occurs

when the learned function becomes sensitive to the noise in the sample. As a result,
the function will perform well on the training set but not perform well on other data from the

joint probability distribution of and . Thus, the more overfitting occurs, the larger
the generalization error.
The amount of overfitting can be tested using cross-validation methods, that split the sample
into simulated training samples and testing samples. The model is then trained on a training
sample and evaluated on the testing sample. The testing sample is previously unseen by the
algorithm and so represents a random sample from the joint probability distribution

of and . This test sample allows us to approximate the expected error and as a
result approximate a particular form of the generalization error.
Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more
complex functions (known as Tikhonov regularization), or the hypothesis space can be
constrained, either explicitly in the form of the functions or by adding constraints to the
minimization function (Ivanov regularization).
The approach to finding a function that does not overfit is at odds with the goal of finding a
function that is sufficiently complex to capture the particular characteristics of the data. This is
known as the bias–variance tradeoff. Keeping a function simple to avoid overfitting may
introduce a bias in the resulting predictions, while allowing it to be more complex leads to
Unit….1

overfitting and a higher variance in the predictions. It is impossible to minimize both


simultaneously.

Regression refers to predictive modeling problems that involve predicting a numeric value.

It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate
the predictions made by a regression model.

Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.

In this tutorial, you will discover how to calculate error metrics for regression predictive modeling projects.
After completing this tutorial, you will know:

● Regression predictive modeling are those problems that involve predicting a numeric value.
● Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
● How to calculate and report mean squared error, root mean squared error, and mean absolute error.
Let’s get started.

Tutorial Overview
This tutorial is divided into three parts; they are:

1. Regression Predictive Modeling


2. Evaluating Regression Models
3. Metrics for Regression
1. Mean Squared Error
2. Root Mean Squared Error
3. Mean Absolute Error
Regression Predictive Modeling
Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the
answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output
variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

For more on approximating functions in applied machine learning, see the post:

● How Machine Learning Algorithms Work


Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable
(y).
Regression is different from classification, which involves predicting a category or class label.

For more on the difference between classification and regression, see the tutorial:

● Difference Between Classification and Regression in Machine Learning


A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and
sizes.
Unit….1

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of 100,000to200,000.

● A regression problem requires the prediction of a quantity.


● A regression can have real-valued or discrete input variables.
● A problem with multiple input variables is often called a multivariate regression problem.
● A regression problem where input variables are ordered by time is called a time series forecasting problem.
Now that we are familiar with regression predictive modeling, let’s look at how we might evaluate a regression model.

Evaluating Regression Models


A common question by beginners to regression predictive modeling projects is:

How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

We cannot calculate accuracy for a regression model.


The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the
model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to
the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

● Mean Squared Error (MSE).


● Root Mean Squared Error (RMSE).
● Mean Absolute Error (MAE)
There are many other metrics for regression, although these are the most commonly used. You can see the full list of regression metrics
supported by the scikit-learn Python machine learning library here:

● Scikit-Learn API: Regression Metrics.


In the next section, let’s take a closer look at each in turn.

Metrics for Regression


In this section, we will take a closer look at the popular metrics for regression models and how to calculate them for your predictive modeling
project.

Mean Squared Error


Mean Squared Error, or MSE for short, is a popular error metric for regression problems.
It is also an important loss function for algorithms fit or optimized using the least squares framing of a regression problem. Here “least
squares” refers to minimizing the mean squared error between predictions and expected values.
The MSE is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.

● MSE = 1 / N * sum for i to N (y_i – yhat_i)^2


Where y_i is the i’th expected value in the dataset and yhat_i is the i’th predicted value. The difference between these two values is squared,
which has the effect of removing the sign, resulting in a positive error value.
The squaring also has the effect of inflating or magnifying large errors. That is, the larger the difference between the predicted and expected
values, the larger the resulting squared positive error. This has the effect of “punishing” models more for larger errors when MSE is used as a
loss function. It also has the effect of “punishing” models by inflating the average error score when used as a metric.
We can create a plot to get a feeling for how the change in prediction error impacts the squared error.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1
increments. The squared error between each prediction and expected value is calculated and plotted to show the quadratic increase in
squared error.
Unit….1

1 ...

2 # calculate error

3 err = (expected[i] - predicted[i])**2

The complete example is listed below.

1 # example of increase in mean squared error

2 from matplotlib import pyplot

3 from sklearn.metrics import mean_squared_error

4 # real value

5 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

6 # predicted value

7 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

8 # calculate errors

9 errors = list()

10 for i in range(len(expected)):

11 # calculate error

12 err = (expected[i] - predicted[i])**2

13 # store error

14 errors.append(err)

15 # report error

16 print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))

17 # plot errors

18 pyplot.plot(errors)

19 pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)

20 pyplot.xlabel('Predicted Value')

21 pyplot.ylabel('Mean Squared Error')

22 pyplot.show()

Running the example first reports the expected value, predicted value, and squared error for each case.

We can see that the error rises quickly, faster than linear (a straight line).

1 >1.0, 1.0 = 0.000

2 >1.0, 0.9 = 0.010

3 >1.0, 0.8 = 0.040

4 >1.0, 0.7 = 0.090

5 >1.0, 0.6 = 0.160

6 >1.0, 0.5 = 0.250

7 >1.0, 0.4 = 0.360


Unit….1

8 >1.0, 0.3 = 0.490

9 >1.0, 0.2 = 0.640

10 >1.0, 0.1 = 0.810

11 >1.0, 0.0 = 1.000

A line plot is created showing the curved or super-linear increase in the squared error value as the difference between the expected and
predicted value is increased.

The curve is not a straight line as we might naively assume for an error metric.
Unit….1
Unit….1

Line Plot of the Increase Square Error With Predictions


The individual error terms are averaged so that we can report the performance of a model with regard to how much error the model makes
generally when making predictions, rather than specifically for a given example.

The units of the MSE are squared units.

For example, if your target value represents “dollars,” then the MSE will be “squared dollars.” This can be confusing for stakeholders;
therefore, when reporting results, often the root mean squared error is used instead (discussed in the next section).
The mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the
scikit-learn library.
The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

1 ...

2 # calculate errors

3 errors = mean_squared_error(expected, predicted)

The example below gives an example of calculating the mean squared error between a list of contrived expected and predicted values.

1 # example of calculate the mean squared error

2 from sklearn.metrics import mean_squared_error

3 # real value

4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

5 # predicted value

6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

7 # calculate errors

8 errors = mean_squared_error(expected, predicted)

9 # report error

10 print(errors)

Running the example calculates and prints the mean squared error.

1 0.35000000000000003

A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MSE is relative to your specific dataset.

It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.

Root Mean Squared Error


The Root Mean Squared Error, or RMSE, is an extension of the mean squared error.
Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target
value that is being predicted.

For example, if your target variable has the units “dollars,” then the RMSE error score will also have the unit “dollars” and not “squared
dollars” like the MSE.
As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.

The RMSE can be calculated as follows:


Unit….1

● RMSE = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)


Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value, and sqrt() is the square root function.
We can restate the RMSE in terms of the MSE as:

● RMSE = sqrt(MSE)
Note that the RMSE cannot be calculated as the average of the square root of the mean squared error values. This is a common error made
by beginners and is an example of Jensen’s inequality.
You may recall that the square root is the inverse of the square operation. MSE uses the square operation to remove the sign of each error
value and to punish large errors. The square root reverses this operation, although it ensures that the result remains positive.

The root mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from
the scikit-learn library.
By default, the function calculates the MSE, but we can configure it to calculate the square root of the MSE by setting the “squared”
argument to False.
The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

1 ...

2 # calculate errors

3 errors = mean_squared_error(expected, predicted, squared=False)

The example below gives an example of calculating the root mean squared error between a list of contrived expected and predicted values.

1 # example of calculate the root mean squared error

2 from sklearn.metrics import mean_squared_error

3 # real value

4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

5 # predicted value

6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

7 # calculate errors

8 errors = mean_squared_error(expected, predicted, squared=False)

9 # report error

10 print(errors)

Running the example calculates and prints the root mean squared error.

1 0.5916079783099616

A perfect RMSE value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good RMSE is relative to your specific dataset.

It is a good idea to first establish a baseline RMSE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves an RMSE better than the RMSE for the naive model has skill.

Mean Absolute Error


Mean Absolute Error, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is
being predicted.
Unlike the RMSE, the changes in MAE are linear and therefore intuitive.
Unit….1

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square
of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with
increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function
that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is
forced to be positive when calculating the MAE.
The MAE can be calculated as follows:

● MAE = 1 / N * sum for i to N abs(y_i – yhat_i)


Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.
We can create a plot to get a feeling for how the change in prediction error impacts the MAE.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1
increments. The absolute error between each prediction and expected value is calculated and plotted to show the linear increase in error.

1 ...

2 # calculate error

3 err = abs((expected[i] - predicted[i]))

The complete example is listed below.

1 # plot of the increase of mean absolute error with prediction error

2 from matplotlib import pyplot

3 from sklearn.metrics import mean_squared_error

4 # real value

5 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

6 # predicted value

7 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

8 # calculate errors

9 errors = list()

10 for i in range(len(expected)):

11 # calculate error

12 err = abs((expected[i] - predicted[i]))

13 # store error

14 errors.append(err)

15 # report error

16 print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))

17 # plot errors

18 pyplot.plot(errors)

19 pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)

20 pyplot.xlabel('Predicted Value')

21 pyplot.ylabel('Mean Absolute Error')

22 pyplot.show()

Running the example first reports the expected value, predicted value, and absolute error for each case.
Unit….1

We can see that the error rises linearly, which is intuitive and easy to understand.

1 >1.0, 1.0 = 0.000

2 >1.0, 0.9 = 0.100

3 >1.0, 0.8 = 0.200

4 >1.0, 0.7 = 0.300

5 >1.0, 0.6 = 0.400

6 >1.0, 0.5 = 0.500

7 >1.0, 0.4 = 0.600

8 >1.0, 0.3 = 0.700

9 >1.0, 0.2 = 0.800

10 >1.0, 0.1 = 0.900

11 >1.0, 0.0 = 1.000

A line plot is created showing the straight line or linear increase in the absolute error value as the difference between the expected and
predicted value is increased.
Unit….1
Unit….1

Line Plot of the Increase Absolute Error With Predictions


The mean absolute error between your expected and predicted values can be calculated using the mean_absolute_error() function from the
scikit-learn library.
The function takes a one-dimensional array or list of expected values and predicted values and returns the mean absolute error value.

1 ...

2 # calculate errors

3 errors = mean_absolute_error(expected, predicted)

The example below gives an example of calculating the mean absolute error between a list of contrived expected and predicted values.

1 # example of calculate the mean absolute error

2 from sklearn.metrics import mean_absolute_error

3 # real value

4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

5 # predicted value

6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

7 # calculate errors

8 errors = mean_absolute_error(expected, predicted)

9 # report error

10 print(errors)

Running the example calculates and prints the mean absolute error.

1 0.5

A perfect mean absolute error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MAE is relative to your specific dataset.

It is a good idea to first establish a baseline MAE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves a MAE better than the MAE for the naive model has skill.

Further Reading
This section provides more resources on the topic if you are looking to go deeper.

alculate error for regression predictive modeling projects.

Unit..4
ML | Linear Discriminant Analysis
● Difficulty Level : Medium
● Last Updated : 10 Nov, 2021
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used for modelling differences in
groups i.e. separating two or more classes. It is used to project the features in higher dimension space into a lower
dimension space.
Unit….1

For example, we have two classes and we need to separate them efficiently. Classes can have multiple features. Using only
a single feature to classify them may result in some overlapping as shown in the below figure. So, we will keep on increasing
the number of features for proper classification.

Example:
Suppose we have two sets of data points belonging to two different classes that we want to classify. As shown in the given
2D graph, when the data points are plotted on the 2D plane, there’s no straight line that can separate the two classes of the
data points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a
1D graph in order to maximize the separability between the two classes.

Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data onto a new axis in a
way to maximize the separation of the two categories and hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:

1. Maximize the distance between means of the two classes.


2. Minimize the variation within each class.
Unit….1

In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph such that it maximizes
the distance between the means of the two classes and minimizes the variation within each class. In simple terms, this
newly generated axis increases the separation between the data points of the two classes. After generating this new axis
using the above-mentioned criteria, all the data points of the classes are plotted on this new axis and are shown in the figure
given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes impossible for LDA to
find a new axis that makes both the classes linearly separable. In such cases, we use non-linear discriminant analysis.

Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:

● n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as vTxi

Let’s consider u1 and u2 be the means of samples class c1 and c2 respectively before projection and u1hat denotes the
mean of the samples of class after projection and it can be calculated by:

Similarly,
Unit….1

Now, In LDA we need to normalize |\widetilde{\mu_1} -\widetilde{\mu_2} |. Let y_i = v^{T}x_i be the projected samples, then
scatter for the samples of c1 is:

Similarly:

Now, we need to project our data on the line having direction v which maximizes

For maximizing the above equation we need to find a projection vector that maximizes the difference of means of reduces
the scatters of both classes. Now, scatter matrix of s1 and s2 of classes c1 and c2 are:

and s2

After simplifying the above equation, we get:

Now, we define, scatter within the classes(sw) and scatter b/w the classes(sb):

Now, we try to simplify the numerator part of J(v)

Now, To maximize the above equation we need to calculate differentiation with respect to v
Unit….1

Here, for the maximum value of J(v) we will use the value corresponding to the highest eigenvalue. This will provide us the
best solution for LDA.

Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are
multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually
covariance), moderating the influence of different variables on LDA.
Implementation
● In this implementation, we will perform linear discriminant analysis using the Scikit-learn library on the Iris dataset.
● Python3

# necessary import

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt


Unit….1

import sklearn

from sklearn.preprocessing import StandardScaler,


LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.discriminant_analysis import


LinearDiscriminantAnalysis

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score,


confusion_matrix

# read dataset from URL

url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"

cls = ['sepal-length', 'sepal-width', 'petal-length',


'petal-width', 'Class']

dataset = pd.read_csv(url, names=cls)

# divide the dataset into class and target variable

X = dataset.iloc[:, 0:4].values

y = dataset.iloc[:, 4].values

# Preprocess the dataset and divide into train and test

sc = StandardScaler()

X = sc.fit_transform(X)

le = LabelEncoder()

y = le.fit_transform(y)
Unit….1

X_train, X_test, y_train, y_test = train_test_split(X,


y, test_size=0.2)

# apply Linear Discriminant Analysis

lda = LinearDiscriminantAnalysis(n_components=2)

X_train = lda.fit_transform(X_train, y_train)

X_test = lda.transform(X_test)

# plot the scatterplot

plt.scatter(

X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',

alpha=0.7,edgecolors='b'

# classify using random forest classifier

classifier = RandomForestClassifier(max_depth=2,
random_state=0)

classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

# print the accuracy and confusion matrix

print('Accuracy : ' + str(accuracy_score(y_test,


y_pred)))

conf_m = confusion_matrix(y_test, y_pred)

print(conf_m)
Unit….1

LDA 2 -variable plot

Accuracy : 0.9

[[10 0 0]
[ 0 9 3]
[ 0 0 8]]
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular application in which each face is
represented by a very large number of pixel values. Linear discriminant analysis (LDA) is used here to reduce the
number of features to a more manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear combinations obtained using Fisher’s
linear discriminant are called Fisher’s faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate,
or severe based upon the patient’s various parameters and the medical treatment he is going through. This helps the
doctors to intensify or reduce the pace of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers who are most likely to buy a particular
product in a shopping mall. By doing a simple question and answers survey, we can gather all the features of the
customers. Here, a Linear discriminant analysis will help us to identify and select the features which can describe the
characteristics of the group of customers that are most likely to buy that particular product in the shoppi

Topic…
Perceptron in Machine Learning
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It is the primary step to learn
Machine Learning and Deep Learning technologies, which consists of a set of weights, input values or scores, and a threshold. Perceptron
is a building block of an Artificial Neural Network. Initially, in the mid of 19 th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine Learning algorithm
used for supervised learning for various binary classifiers. This algorithm enables neurons to learn elements and processes them one by
Unit….1

one during preparation. In this tutorial, "Perceptron in Machine Learning," we will discuss in-depth knowledge of Perceptron and its basic
functions in brief. Let's start with the basic introduction of Perceptron.

What is the Perceptron model in Machine Learning?


Perceptron is Machine Learning algorithm for supervised learning of various binary classification tasks. Further, Perceptron is also
understood as an Artificial Neuron or neural network unit that helps to detect certain input data computations in business intelligence.

Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks. However, it is a supervised learning
algorithm of binary classifiers. Hence, we can consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

What is Binary classifier in Machine Learning?


In Machine Learning, binary classifiers are defined as the function that helps in deciding whether input data can be represented as vectors
of numbers and belongs to some specific class.
Unit….1

Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as a classification algorithm that can
predict linear predictor function in terms of weight and feature vectors.

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains three main components. These are as follows:

o Input Nodes or Input Layer:

This is the primary component of Perceptron which accepts the initial data into the system for further processing. Each input node contains
a real numerical value.
Unit….1

o Wight and Bias:

Weight parameter represents the strength of the connection between units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.

o Activation Function:

These are the final and important components that help to determine whether the neuron will fire or not. Activation Function can be
considered primarily as a step function.

Types of Activation functions:

o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on various problem statements and forms the desired
outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process is
slow or has vanishing or exploding gradients.

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main parameters named input values
(Input nodes), weights and Bias, net sum, and an activation function. The perceptron model begins with the multiplication of all input values
and their weights, then adds these values together to create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step function and is represented by 'f'.
Unit….1

This step function or Activation function plays a vital role in ensuring that output is mapped between required values (0,1) or (-1,1). It is
important to note that the weight of input is indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift the
activation function curve up or down.

Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us output either in binary form
or a continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron model consists feed-forward network and
also includes a threshold transfer function inside the model. The main objective of the single-layer perceptron model is to analyze the
linearly separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly allocated input for weight
parameters. Further, it sums up all inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined value,
the model gets activated and shows the output value as +1.
Unit….1

If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as satisfied, and weight demand
does not change. However, this model consists of a few discrepancies triggered when multiple weight inputs values are fed into the model.
Hence, to find desired output and minimize errors, some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also has the same model structure but has a greater number of
hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's requirement. In this stage, the
error between actual output and demanded originated backward on the output layer and ended on the input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having various layers in which activation
function does not remain linear, similar to a single layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns. Further, it can also
implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.

Advantages of Multi-Layer Perceptron:

o A multi-layered perceptron model can be used to solve complex non-linear problems.


o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o In multi-layer Perceptron, it is difficult to predict how much the dependent variable affects each independent variable.
o The model functioning depends on the quality of the training.

Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector


o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron
The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.


2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly separable classes +1 and -1.
Unit….1

6. If the added sum of all input values is more than the threshold value, it must have an output signal; otherwise, no output will be
shown.

Limitations of Perceptron Model


A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors are non-linear, it is not easy
to classify them properly.

Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data by building intuitive patterns and applying
them in the future. Machine learning is a rapidly growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and facilitate analytical behavior in machines that
will, in turn, add to the efficiency of computers.

The perceptron model is continuously becoming more advanced and working efficiently on complex problems with the help of artificial
neurons.

Conclusion:
In this article, you have learned how Perceptron models are the simplest type of artificial neural network which carries input and their
weights, the sum of all weighted input, and an activation function. Perceptron models are continuously contributing to Artificial Intelligence
and Machine Learning, and these models are becoming more advanced. Perceptron enables the computer to work more efficiently on
complex problems using various Machine Learning technologies. The Perceptrons are the fundamentals of artificial neural networks, and
everyone should have in-depth knowledge of perceptron models to study deep neural networks.

Topic…

Support vector machines: The linearly


separable case
Unit….1

Figure 15.1: The support vectors are the 5 points right up against the margin of the classifier.

For two-class, separable training data sets, such as the one in Figure 14.8 (page ),
there are lots of possible linear separators. Intuitively, a decision boundary drawn in
the middle of the void between data items of the two classes seems better than one
which approaches very close to examples of one or both classes. While some learning
methods such as the perceptron algorithm (see references in vclassfurther) find just
any linear separator, others, like Naive Bayes, search for the best linear separator
according to some criterion. The SVM in particular defines the criterion to be looking
for a decision surface that is maximally far away from any data point. This distance
from the decision surface to the closest data point determines the margin of the
classifier. This method of construction necessarily means that the decision function
for an SVM is fully specified by a (usually small) subset of the data which defines the
position of the separator. These points are referred to as the support vectors (in a
vector space, a point can be thought of as a vector between the origin and that point).
Figure 15.1 shows the margin and support vectors for a sample problem. Other data
points play no part in determining the decision surface that is chosen.
Unit….1

An intuition
for large-margin classification.Insisting on a large margin reduces the capacity of the
model: the range of angles at which the fat decision surface can be placed is smaller
than for a decision hyperplane (cf. vclassline).

Maximizing the margin seems good because points near the decision surface represent
very uncertain classification decisions: there is almost a 50% chance of the classifier
deciding either way. A classifier with a large margin makes no low certainty
classification decisions. This gives you a classification safety margin: a slight error in
measurement or a slight document variation will not cause a misclassification.
Another intuition motivating SVMs is shown in Figure 15.2 . By construction, an
SVM classifier insists on a large margin around the decision boundary. Compared to a
decision hyperplane, if you have to place a fat separator between classes, you have
fewer choices of where it can be put. As a result of this, the memory capacity of the
model has been decreased, and hence we expect that its ability to correctly generalize
to test data is increased (cf. the discussion of the bias-variance tradeoff in
Chapter 14 , page 14.6 ).
Let us formalize an SVM with algebra. A decision hyperplane (page 14.4 ) can be

defined by an intercept term and a decision hyperplane normal vector which is


perpendicular to the hyperplane. This vector is commonly referred to in the machine
learning literature as the weight vector . To choose among all the hyperplanes that are
Unit….1

perpendicular to the normal vector, we specify the intercept term . Because the
hyperplane is perpendicular to the normal vector, all points on the hyperplane

satisfy . Now suppose that we have a set of training data

points , where each member is a pair of a point and a class


label corresponding to it. For SVMs, the two data classes are always

named and (rather than 1 and 0), and the intercept term is always explicitly
represented as (rather than being folded into the weight vector by adding an extra
always-on feature). The math works out much more cleanly if you do things this way,
as we will see almost immediately in the definition of functional margin. The linear
classifier is then:

(165)

A value of indicates one class, and a value of the other class.

We are confident in the classification of a point if it is far away from the decision
boundary. For a given data set and decision hyperplane, we define the functional

margin of the example with respect to a hyperplane as the

quantity . The functional margin of a data set with respect to a decision


surface is then twice the functional margin of any of the points in the data set with
minimal functional margin (the factor of 2 comes from measuring across the whole
width of the margin, as in Figure 15.3 ). However, there is a problem with using this
definition as is: the value is underconstrained, because we can always make the

functional margin as big as we wish by simply scaling up and . For example, if

we replace by and by then the functional margin is five


times as large. This suggests that we need to place some constraint on the size of

the vector. To get a sense of how to do that, let us look at the actual geometry.
Unit….1

Figure 15.3: The geometric margin of a point ( ) and a decision boundary ( ).

What is the Euclidean distance from a point to the decision boundary? In


Figure 15.3 , we denote by this distance. We know that the shortest distance
between a point and a hyperplane is perpendicular to the plane, and hence, parallel

to . A unit vector in this direction is . The dotted line in the diagram is then a

translation of the vector . Let us label the point on the hyperplane closest

to as . Then:

(166)
Unit….1

where multiplying by just changes the sign for the two cases of being on either

side of the decision surface. Moreover, lies on the decision boundary and so

satisfies . Hence:

(167)

Solving for gives:

(168)

Again, the points closest to the separating hyperplane are support vectors.
The geometric margin of the classifier is the maximum width of the band that can be
drawn separating the support vectors of the two classes. That is, it is twice the

minimum value over data points for given in Equation 168, or, equivalently, the
maximal width of one of the fat separators shown in Figure 15.2 . The geometric
margin is clearly invariant to scaling of parameters: if we

replace by and by , then the geometric margin is the same, because it is

inherently normalized by the length of . This means that we can impose any scaling

constraint we wish on without affecting the geometric margin. Among other

choices, we could use unit vectors, as in Chapter 6 , by requiring that . This


would have the effect of making the geometric margin the same as the functional
margin.
Unit….1

Since we can scale the functional margin as we please, for convenience in solving
large SVMs, let us choose to require that the functional margin of all data points is at
least 1 and that it is equal to 1 for at least one data vector. That is, for all items in the
data:

(169)

and there exist support vectors for which the inequality is an equality. Since each

example's distance from the hyperplane is , the geometric

margin is . Our desire is still to maximize this geometric margin. That is, we

want to find and such that:

● is maximized

● For all ,

Maximizing is the same as minimizing . This gives the final standard


formulation of an SVM as a minimization problem:

We are now optimizing a quadratic function subject to linear constraints. Quadratic


optimization problems are a standard, well-known class of mathematical optimization
problems, and many algorithms exist for solving them. We could in principle build
our SVM using standard quadratic programming (QP) libraries, but there has been
much recent research in this area aiming to exploit the structure of the kind of QP that
emerges from an SVM. As a result, there are more intricate but much faster and more
scalable libraries available especially for building SVMs, which almost everyone uses
to build models. We will not present the details of such algorithms here.
Unit….1

However, it will be helpful to what follows to understand the shape of the solution of
such an optimization problem. The solution involves constructing a dual problem

where a Lagrange multiplier is associated with each constraint in


the primal problem:

The solution is then of the form:

In the solution, most of the are zero. Each non-zero indicates that the

corresponding is a support vector. The classification function is then:

(170)

Both the term to be maximized in the dual problem and the classifying function

involve a dot product between pairs of points ( and or and ), and that is the
only way the data are used - we will return to the significance of this later.

To recap, we start with a training data set. The data set uniquely defines the best
separating hyperplane, and we feed the data through a quadratic optimization

procedure to find this plane. Given a new point to classify, the classification

function in either Equation 165 or Equation 170 is computing the projection of


the point onto the hyperplane normal. The sign of this function determines the class to
assign to the point. If the point is within the margin of the classifier (or another

confidence threshold that we might have determined to minimize classification


mistakes) then the classifier can return ``don't know'' rather than one of the two

classes. The value of may also be transformed into a probability of


classification; fitting a sigmoid to transform the values is standard (Platt, 2000). Also,
since the margin is constant, if the model includes dimensions from various sources,
Unit….1

careful rescaling of some dimensions may be required. However, this is not a problem
if our documents (points) are on the unit hypersphere.

Figure 15.4: A tiny 3 data point training set for an SVM.

Worked example. Consider building an SVM over the (very little) data set shown in
Figure 15.4 . Working geometrically, for an example like this, the maximum margin
weight vector will be parallel to the shortest line connecting points of the two classes,

that is, the line between and , giving a weight vector of . The optimal
decision surface is orthogonal to that line and intersects it at the halfway point.

Therefore, it passes through . So, the SVM decision boundary is:

(171)

Working algebraically, with the standard constraint that , we

seek to minimize . This happens when this constraint is satisfied with equality by

the two support vectors. Further we know that the solution is for some .
So we have that:
Unit….1

Therefore, and . So the optimal hyperplane is given

by and .

The margin is . This answer can be


confirmed geometrically by examining Figure 15.4 .
End worked example.
Exercises.

● What is the minimum number of support vectors that there can be for a data
set (which contains instances of each class)?
● The basis of being able to use kernels in SVMs (see Section 15.2.3 ) is that the
classification function can be written in the form of Equation 170 (where, for

large problems, most are 0). Show explicitly how the classification function
could be written in this form for the data set from small-svm-eg. That is,

write as a function where the data points appear and the only variable is .
● Install an SVM package such as SVMlight (http://svmlight.joachims.org/),
and build an SVM for the data set discussed in small-svm-eg. Confirm that the
program gives the same solution as the text. For SVMlight, or another package
that accepts the same training data format, the training file would be:

1 1:2 2:3

1 1:2 2:0

1 1:1 2:1

The training command for SVMlight is then:

svm_learn -c 1 -a alphas.dat train.dat model.dat

The -c 1 option is needed to turn off use of the slack variables that we discuss
in Section 15.2.1 . Check that the norm of the weight vector agrees with what
Unit….1

we found in small-svm-eg. Examine the file alphas.dat which contains

the values, and check that they agree with your answers in Exercise 15.1 .

Topic…….

4.5. LINEAR SOFT MARGIN CLASSIFIER


FOR OVERLAPPING CLASSES
The linear hard-margin classifier gives a simple SVM when the samples are linearly separable. In practice, however, the training
data is almost always linearly nonseparable because of random errors owing to different reasons. For instance, certain instances
may be wrongly labeled. The labels could be different even for two…

Topic

Working in a kernel-defined feature space means that we are not able to explicitly represent points. For example the image of an
input point x is φ(x), but we do not have access to the components of this vector, only to the evaluation of inner products b etween
this point and the images of other points.

Topic

Non-Linear Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly separable.

→ Some of the classifiers that use non-linear functions to separate classes are Quadratic Discriminant Classifier, Multi-Layer
Perceptron (MLP), Decision Trees, Random Forest, and K-Nearest Neighbours (KNN).
Unit….1

→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the two classes, it is impossible to draw an
arbitrary straight line to ensure that both the classes are on distinct sides.

→ We notice that even if we draw a straight line, there would be points of the first-class present between the data points of the
second class.

→ In such cases, piece-wise linear or non-linear classification boundaries are required to distinguish the two classes.

Quadratic Discriminant Classifier

→ This technique is similar to LDA(Linear Discriminant Analysis) discussed above.

→ The only difference is that here, we do not assume that the mean and covariance of all classes are the same.

→ We get the quadratic discriminant function as the following:-

→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This would give us a clear picture of the
difference between the two.
Unit….1
Unit….1

Multi-Layer Perceptron (MLP)

→ This is nothing but a collection of fully connected dense layers. These help transform any given input dimension into the desired
dimension.

→ It is nothing but simply a neural network.

→ MLP consists of one input layer(one node belonging to each input), one output layer (one node belonging to each output), and a
few hidden layers (>= one node belonging to each hidden layer).

→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.

→ There is one hidden layer consisting of 3 nodes.


Unit….1

→ There is an output layer consisting of 2 nodes, indicating two outputs.

→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in the hidden layer. Once this is done,
the hidden layer processes the information passed on to it and then further passes it on to the output layer.

Decision Tree

→ It is considered to be one of the most valuable and robust models.

→ Instances are classified by sorting them down from the root to some leaf node.

→ An instance is classified by starting at the tree's root node, testing the attribute specified by this node, then moving down the tree
branch corresponding to the attribute's value, as shown in the above figure.

→ The process is repeated based on each derived subset in a recursive partitioning manner.

→ For a better understanding, see the diagram below.

→ The above decision tree helps determine whether the person is fit or not.

→ Similarly, Random Forests, a collection of Decision Trees, is a linear classifier too.

K-Nearest Neighbours

→ KNN is a supervised machine learning algorithm . It is used for classification problems. Since it is a supervised machine learning
algorithm, it uses labeled data to make predictions.
Unit….1

→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the same.

→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the new point. It
chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to.

→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.

→ For better understanding, have a look at the diagram below.

Topic…..

Support Vector Regression Tutorial for Machine


Learning

Alakh Sethi — Published On March 27, 2020 and Last Modified On April 1st, 2020

Algorithm Beginner Machine Learning Python Regression Structured Data

Unlocking a New World with the Support Vector Regression Algorithm


Support Vector Machines (SVM) are popularly and widely used for classification problems in machine learning. I’ve
often relied on this not just in machine learning projects but when I want a quick result in a hackathon.

But SVM for regression analysis? I hadn’t even considered the possibility for a while! And even now when I bring up
“Support Vector Regression” in front of machine learning beginners, I often get a bemused expression. I understand –
most courses and experts don’t even mention Support Vector Regression (SVR) as a machine learning algorithm.
Unit….1

But SVR has its uses as you’ll see in this tutorial. We will first quickly understand what SVM is, before diving into the
world of Support Vector Regression and how to implement it in Python!

Note: You can learn about Support Vector Machines and Regression problems in course format here (it’s free!):

● Support Vector Machine (SVM) in Python and R


● Fundamentals of Regression Analysis

Here’s what we’ll cover in this Support Vector Regression tutorial:

● What is a Support Vector Machine (SVM)?


● Hyperparameters of the Support Vector Machine Algorithm
● Introduction to Support Vector Regression (SVR)
● Implementing Support Vector Regression in Python
Unit….1

What is a Support Vector Machine (SVM)?


So what exactly is Support Vector Machine (SVM)? We’ll start by understanding SVM in simple terms. Let’s say we
have a plot of two label classes as shown in the figure below:

Can you decide what the separating line will be? You might have come up with this:

The line fairly separates the classes. This is what SVM essentially does – simple class separation. Now, what is the
data was like this:
Unit….1

Here, we don’t have a simple line separating these two classes. So we’ll extend our dimension and introduce a new
dimension along the z-axis. We can now separate these two classes:

When we transform this line back to the original plane, it maps to the circular boundary as I’ve shown here:
Unit….1

This is exactly what SVM does! It tries to find a line/hyperplane (in multidimensional space) that separates these two
classes. Then it classifies the new point depending on whether it lies on the positive or negative side of the
hyperplane depending on the classes to predict.

Hyperparameters of the Support Vector Machine (SVM) Algorithm


There are a few important parameters of SVM that you should be aware of before proceeding further:

● Kernel: A kernel helps us find a hyperplane in the higher dimensional space without increasing the computational
cost. Usually, the computational cost will increase if the dimension of the data increases. This increase in dimension
is required when we are unable to find a separating hyperplane in a given dimension and are required to move in a
higher dimension:

● Hyperplane: This is basically a separating line between two data classes in SVM. But in Support Vector Regression,
this is the line that will be used to predict the continuous output
● Decision Boundary: A decision boundary can be thought of as a demarcation line (for simplification) on one side of
which lie positive examples and on the other side lie the negative examples. On this very line, the examples may be
classified as either positive or negative. This same concept of SVM will be applied in Support Vector Regression as
well
Unit….1

To understand SVM from scratch, I recommend this tutorial: Understanding Support Vector Machine(SVM)
algorithm from examples.

Introduction to Support Vector Regression (SVR)


Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems. Let’s spend a few
minutes understanding the idea behind SVR.

The Idea Behind Support Vector Regression


The problem of regression is to find a function that approximates mapping from an input domain to real numbers on
the basis of a training sample. So let’s now dive deep and understand how SVR works actually.

Consider these two red lines as the decision boundary and the green line as the hyperplane. Our objective, when
we are moving on with SVR, is to basically consider the points that are within the decision boundary line. Our
best fit line is the hyperplane that has a maximum number of points.

The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider these
lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at distance ‘+a’ and
‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.

Assuming that the equation of the hyperplane is as follows:

Y = wx+b (equation of hyperplane)


Then the equations of decision boundary become:
Unit….1

wx+b= +a

wx+b= -a
Thus, any hyperplane that satisfies our SVR should satisfy:

-a < Y- wx+b < +a


Our main aim here is to decide a decision boundary at ‘a’ distance from the original hyperplane such that data points
closest to the hyperplane or the support vectors are within that boundary line.

Hence, we are going to take only those points that are within the decision boundary and have the least error rate, or
are within the Margin of Tolerance. This gives us a better fitting model.

Implementing Support Vector Regression (SVR) in Python


Time to put on our coding hats! In this section, we’ll understand the use of Support Vector Regression with the help of
a dataset. Here, we have to predict the salary of an employee given a few independent variables. A classic HR
analytics project!

Step 1: Importing the libraries

Step 2: Reading the dataset

Step 3: Feature Scaling

A real-world dataset contains features that vary in magnitudes, units, and range. I would suggest performing
normalization when the scale of a feature is irrelevant or misleading.

Feature Scaling basically helps to normalize the data within a particular range. Normally several common class types
contain the feature scaling function so that they make feature scaling automatically. However, the SVR class is not a
commonly used class type so we should perform feature scaling using Python.

Step 4: Fitting SVR to the dataset

Kernel is the most important feature. There are many types of kernels – linear, Gaussian, etc. Each is used
depending on the dataset. To learn more about this, read this: Support Vector Machine (SVM) in Python and R

Step 5. Predicting a new result


Unit….1

So, the prediction for y_pred(6, 5) will be 170,370.

Step 6. Visualizing the SVR results (for higher resolution and smoother curve)

This is what we get as output- the best fit line that has a maximum number of points. Quite accurate!

End Notes
We can think of Support Vector Regression as the counterpart of SVM for regression problems. SVR acknowledges
the presence of non-linearity in the data and provides a proficient prediction model.

I would love to hear your thoughts and ideas around using SVR for regression analysis. Connect with me in the
comments section below and let’s ideate!

regressionSupport Vector Machinesupport vector machine regressionSVMsvm kernelsvm regressionSVR

topic..
Unit….1

Cognitive Machine Learning

Zhongzhi Shi
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
DOI: 10.4236/ijis.2019.94007 PDF HTML XML 1,000 Downloads 3,404 Views Citations

Abstract
Cognitive machine learning refers to the combination of machine learning and brain cognitive mechanism, specifically, combining
machine learning with mind model CAM. Three research directions are proposed in this paper, that is, emergency of learning,
complementary learning system and evolution of learning.
Keywords
Cognitive Machine Learning, Emergency of Learning, Complementary Learning System, Evolution of Learning, Intelligence
Science, Mind Model CAM
Share and Cite:
Shi, Z. (2019) Cognitive Machine Learning. International Journal of Intelligence Science, 9, 111-121. doi: 10.4236/ijis.2019.94007.

1. Introduction
On July 1, 2005, in commemoration of the 125th anniversary of Science, scientists summed up 125 questions, of which 94 was
“What are the limitations of learning through machines?”, which was interpreted as “Computers can beat the best chess players in
the world, and they can grab rich information on the Internet. But abstract reasoning still goes beyond any machine”.
Learning ability is the basic characteristic of human intelligence. From birth, people have been learning from objective envi ronment
and their own experience. Human cognitive ability and wisdom ability are gradually formed, developed and perfected in lifelong
learning.
In 1983, Simon gave a better definition of learning: A certain long-term change that the system produces in order to adapt to the
environment, makes the system can finish the same or similar work next time more effectively. Learning is the change taking place
in a system; it can be either the permanent improvement of the systematic work or the permanent change in the behavior of
organism. On December 12, 2015, Science magazine published a paper to show human-level concept learning through probabilistic
program induction [1] . In a complicated system, the change of learning is due to many aspects of reasons; that is to say, there are
many forms of learning process in the same system.
AlphaGo is a computer program developed by Google DeepMind in London to play the board game Go. In October 2015, it had
beaten a professional named Fan Hui, the European champion. In March 2016, it had beaten Lee Sedol, who is the strongest Go
player in the world, in a five-game match for 4 to 1. AlphaGo’s victories are a major milestone in artificial intelligence research.
AlphaGo’s algorithm uses a Monte Carlo tree search to find its moves based on knowledge previously “learned” by machine
learning, specifically by a deep neural network and reinforcement learning [2] .
Learning theory is about learning the essence of the learning process, learning the rules and constraints to study the variou s
conditions to explore the theory and explanation. Learning is a kind of process, where individuals can produce lasting changes in
the behavior by training. There are various learning theories. For over 100 years, the psychologists have provided all kinds of
learning theoretical theory schools due to the difference between their own philosophical foundation, theory background, research
means. These theory schools mainly include behavioral school, cognitive school and humanism school [3] .
In recent years, artificial intelligence has made great progress, mainly focusing on statistics and big data. In order to sol ve the
problem that machines have the ability of abstract reasoning and computers can learn evolution, cognitive machine learning i s
presented in this paper.
2. What Is Cognitive Machine Learning
Cognitive machine learning refers to the combination of machine learning and brain cognitive mechanism, specifically, combini ng
the achievements of machine learning we have studied for many years with the mind model CAM [3] . Figure 1 shows the cognitive
machine learning.
Cognitive machine learning mainly studies the following three aspects:
1) The emergency of learning: In the process of human cognition the first step is to begin to contact with the outside world, which
belongs to the stage of

Figure 1. Cognitive machine learning.

perception. The second step is to sort out and transform the materials of comprehensive perception, which belongs to the stage of
concept, judgment and reasoning. We raise perceptual knowledge through visual, auditory and tactile senses to rational knowledge.
After acquiring a lot of perceptual knowledge, a new concept has been formed in the human brain, which is the emergence of
learning.
2) Complementary learning system: how to construct the complementary learning system between short-term memory and
semantic memory?
3) Evolution of Learning: As we all know, after hundreds of thousands of years of evolution, human brain capacity is also cha nging.
Language plays an important role in it. So learning evolution is not only to adapt to changes in the outside world, but also to
change its own structure. We think it is the most important in the world to change its-owns structure.
3. The Emergency of Learning
The emergency of learning is to rise from perceptual awareness to rational knowledge, that is, conceptual learning, is a learning
method as well as a form of critical thinking in which individuals master the ability to categorize and organize data by crea ting mind
logic-based structures. This process requires both knowledge construction and acquisition because individuals first identify key
attributes that would make certain subjects fall in the same category or concept. Knowledge construction is a constructive le arning
process in which individuals use what is familiar or what they have experienced to understand another subject matter, while
knowledge acquisition is a learning process wherein a person acquires knowledge from an acknowledged expert.
Unit….1

The conceptual learning can be divided into two types: first-order concept generation, which is based on the similarity-recognition
process, and high-order concept generation, which is based on the dissimilarity-recognition process. The first-order concept
generation is related to the problem driven phase and high-order concept generation is related to the inner sense driven phase.
So far a lot of learning methods and algorithms have been proposed. Various learning methods from statistics, neural networks ,
fuzzy logic and deep learning can be applied to conceptual learning and pattern recognition. Here convolutional generative
stochastic model (CGSM) is used as an example to illustrate the principle of emergency of learning.
Convolutional neural networks were originally based on neurocognitive machines introduced by Fukushima et al. into the computer
field [4] , later by Yann Lecun [5] , has been improved and successfully applied to image detection and segmentation, object
recognition and other fields.
Generally, a convolution neural network (CNN) consists of one or more convolution layers and a fully connected layer at the top
(corresponding to the classical neural network), also includes non-linear mappings and some local or global pooling layers. The
convolutional neural network can utilize the two-dimensional structure of input data. Compared with other deep feed forward neural
networks, convolutional neural networks need fewer parameters to estimate. It has become an attractive structure of in deep
learning.CNN mainly consists of the feature extraction and the classifier. The feature extraction contains the multiple convolutional
layers and sub-sampling layers. The classifier is consisted of one layer or two layers of fully connected neural networks.
However, convolutional networks deal with data noise (such as local loss, modulus), it shows a weaker. Generating random
networks has the strong robustness characteristics of data noise (such as local loss, blurring, distortion, etc.) [ 6] . Because the
generated random network has strong noise robustness and can adopt flexible frame and noise form, while the convolution netwo rk
has the advantages of multi-layer invariance and spatial local correlation when extracting image visual features, and conforms to
the principle of biological visual perception channel. We consider introducing convolutional network model into generating ra ndom
network, and propose a convolutional generative stochastic network model (CGSM) shown in Figure 2.
Figure 2(a) describes a typical convolution generation stochastic network model architecture, which consists of three feature layers.
There is no detailed description of the convolution and pooling process in each feature layer. A hidden layer may contain
convolution and pooling sub-layers (e.g. h1 and h2), only h3 contains convolution sub-layers. Figure 2(b) shows the computational
graph. It is clear that the output of each layer is injected with no more than 50% random noise or Gaussian noise (lightning
symbol) through C ( X ˜ | X ) . Then, the local supervised learning method is used to train a function to reconstruct X only from the
degenerated sample X ˜ with the highest accuracy. In this way, not only can noise data be fully learned, but also the complex task
of directly modeling data to generate distribution P (X) can be transformed into a more operational learning reconstruction
distribution P ( X | X ˜ ) . The random elements in each layer are back propagation, where w ′ i denotes the transformation of wi,
and each Xi (i > 0) is sampled from the second reconstructed distribution P θ 2 ( X i | H i ) and the Log likelihood sum of each
reconstructed distribution is used as the training objective function of the network.

(a) (b)
Figure 2. A typical CGSM architecture and computational graph.

In the convolution generated random model, input x is a three-dimensional array composed of n 1 two-dimensional feature maps
with size n2 × n3 and output y is a three-dimensional array composed of m1 two-dimensional feature maps with size m2 × m3. The
convolution layer consists of a trainable filter with K size l 1 × l2. If the two-dimensional convolution is performed in Valid mode
considering the boundary effect, then m2 = n2 − l1 + 1 and m3 = n3 − l2 + 1. In convolution layer, the output y i is calculated by the
following calculation method based on the input characteristic x i:
y i , k = σ ( x ˜ i ∗ w i , k + b i , k )(1)
Here, symbols ∗ represent two-dimensional convolution operations. Then x i’s reconstructed output of x i is calculated as follows:
x ′ i = σ ( ∑ k y i , k ∗ w ′ i , k + b ′ i , k )(2)
In the above equation, x ˜ i is obtained by the degeneration of xi through C ( x ˜ i | x i ) , where C ( x ˜ i | x i ) is injected with
independent random variable z ~ P(Z). In this way, xi is converted to the operation of random variable x ˜ i ; σ ( · ) is an activation
function, and generally σ ( · ) is a tanh() operated point by point. However, more complex non-linear implementations have
recently been adopted, such as for natural image processing.
Convolution generates random models, usually consisting of multiple convolution generates random stacks. In order to obtain h igh-
level feature expression, pooling technology is used after each convolution layer to reduce the size of feature graph. Common
pooling technologies are Mean-Pooling and Max-Pooling. In this paper, Mean-Pooling technology is mainly used to realize Downward
Pass operation. Similar to the depth-constrained Boltzmann machine network, the layer-by-layer sampling operation will also be
used to learn the convolution-generated random network. The convolution-generated random network computational graph shown
in Figure 2(b) describes the process of layer-by-layer reconstruction, including the sampling operation.
Softmax is usually used as the output of the last layer of the model when convolution generation stochastic model is used as
classification and recognition task. There are obvious differences between the training process of the model and the traditional
convolution network. The training process of convolution generation stochastic model is mainly divided into two stages: first, the
pre-training process. The WalkBack algorithm is used to reconstruct the learning condition distribution P ( X | X ˜ ) by sampling
layer by layer to approximate the actual data distribution P(X) in order to obtain better robustness. Also the use of back
Propagation algorithm optimizes the whole stochastic generation model globally to achieve target prediction or recognition
performance.
In order to verify that convolution generated random model has more performance advantages than ordinary convolution neural
network and other perception models in complex noise environment, two experiments were designed in the experiment, namely,
noise-free environment and noise environment. Noise-free environment directly uses handwritten digits in mnist0_3 data set for
recognition, while noisy environment does noise processing, such as partial ambiguity, missing and so on.
From the recognition rate column in Table 1, it can be seen that the implementation forms of MLP, CNN, CGSM and SVMs can
maintain a high recognition rate when recognizing a single object, while the recognition rate of CGSM is the highest. For obj ect
sequence recognition, CGSM shows a higher recognition rate than MLP, CNN and SVMs. The object recognition accuracy is 93.86%.
In the environment experiment with noise, the input data of each perception model is converted from the original input data X to
the random variable X ˜ with noise through C ( X ˜ | X ) . Noisy dataset in Table 2 is obtained by injecting 30% Bernoulli random
variables and the convolution feature layer of CGSM is also injected 30% Bernoulli random variables during the experiment.
4. Complemental Learning System
In recent years, rapid progress has been made in the related fields of artificial intelligence. The benefits to developing artificial
intelligence of closely examining biological intelligence are two-fold [7] . First, neuroscience provides a rich source
Unit….1

Table 1. Recognition rate of noise-free object sequence.

Table 2. Recognition rate of noisy object sequence.


of inspiration for new types of algorithms and architectures, independent of and complementary to the mathematical and logic -
based methods and ideas that have largely dominated traditional approaches to AI. Second, neuroscience can provide validation of
AI techniques that already exist. If a known algorithm is subsequently found to be implemented in the brain, then that is str ong
support for its plausibility as an integral component of an overall general intelligence system.
Artificial intelligence has been revolutionized over the past few years by dramatic advances in deep learning methods. As the field
of deep learning evolved out of parallel distributed processing research into a core area within artificial intelligence, it was bolstered
by new ideas, such as the development of deep belief networks, convolutional neural network.
In 1995, McClelland et al. proposed a complementary learning system in neocortex and hippocampus [8] . Brain effective learning
requires two complementary systems: one, located in the neocortex, serves as the basis for the gradual acquisition of structured
knowledge about the environment, while the other, centered on the hippocampus, allows rapid learning of the specifics of individual
items and experiences. In 2016, Kumaran et al. extend the complementary learning system by showing that recurrent activation of
hippocampal traces can support some forms of generalization and that neocortical learning can be rapid for information that is
consistent with known structure [9] .
Follow McClelland’s idea, it is an interesting topic how to develop a complementary learning system in mind model CAM, which can
implement between short-term memory and semantic memory.
According to the temporal length of memory operation, there are three types of human memory: sensory memory, short-term
memory and long-term memory. The relationship among these three can be illustrated by Figure 3 [10] . First of all, the
information from the environment reaches the sensory memory. If the information is attentioned then they will enter the short-
term memory. It is in short-term memory, the individual to be the restructuring of the information and use and respond to. In order
to analyze the information into short-term memory, you will be out in the long-term memory storage of knowledge. At the same
time, short-term memory in the preservation of information, if necessary, repeat can also be deposited after long-term memory.
In Figure 3, the arrows indicate the flow of information storage in three runs in the direction of the model.
In Figure 3, rehearsal refers to the psychological process in which an individual repeats the material he has previously memorized
through speech in order to consolidate his memory. It is an effective method of short-term memory information storage, which can
prevent short-term memory information from being disturbed by irrelevant stimuli and forgetting. After repetition, learning
materials are kept in short-term memory and transferred to long-term memory, particular semantic memory.

Figure 3. Human memory system.

5. Evolution of Learning
Evolution, which adapts itself to the outside world and changes its structure, is one of the most important mechanisms in the world.
Evolution, learning and then advanced evolution, learning evolution, produce target, this is actually a key, random aimless machine
can explore its own target through learning. Darwin founded the theory of biological evolution in the mid-19th century. Through
heredity, variation and natural selection, organisms evolve and develop from low to high, from simple to complex, and from fe w to
many.
For intelligence, the so-called evolution refers to the learning of learning, which is different from the learning of software, and its
structure changes with it. This is very important, and the structural changes record the results of learning, and improve the learning
method. Moreover, its storage and operation are integrated, which is difficult for computer to do so at present. The study of
computer learning evolution model in this area is probably a new topic, which deserves great attention.
Studies of fossils of ancient human skulls reveal the development of the human brain, which has tripled in size over the course of
two million years of evolution. With the rapid development of human intelligence, many unique cortical centers emerged in this
period, such as the locomotive language center, the writing center, the auditory language center and so on. At the same time, the
brain cortex has also appeared to appreciate music and painting centers, these centers have obvious positioning characteristi cs.
Especially with the development of human Abstract thinking, the frontal lobe of human brain expands rapidly. Thus, the modern
human brain is evolving continuously.
In order to make machines have human-level intelligence and break through the limitation of learning through computers, it is
necessary to make machines have the function of learning evolution. Through learning, not only knowledge is increased, but also
the memory structure of the machine is changed.
We consider that without evolution of learning the goal of achieving human-level general intelligence is far from completion. Here
we review the principles underlying the evolution of learning, as the most fundamental to human-level machine learning.
Cognitive structure refers to the organizational form and operation mode of cognitive activities, including a series of operational
processes, such as the interaction of components and components in cognitive activities, namely the mechanism of psychological
activities. Cognitive structure theory focuses on cognitive structure, emphasizing the nature of cognitive structure construction, the
interaction between cognitive structure and learning [11] .
Throughout the theoretical development of cognitive structure, there are Piaget’s schema theory, Gestalt’s insight theory, Tolman’s
cognitive map theory, Bruner’s classification theory, Ausubel’s cognitive assimilation theory and so on. Cognitive structure theory
Unit….1

holds that the cognitive structure existing in human mind is always in the process of change and construction, and the learning
process is the process of continuous change and reorganization of cognitive structure, in which the environment and individual
characteristics of learners are the decisive factors. Piaget used assimilation, adaptation and balance to characterize the mechanism
of cognitive structure construction. He emphasized the importance of the external environment as a whole. He believed that the
rich and good multiple stimulation provided by the environment for learners was the fundamental condition for the improvement
and change of cognitive structure. Modern cognitive psychologist Neisser believes that cognitive process is constructive, which
includes two processes: the process of individual response to external stimuli and the process of learners’ conscious control,
transformation and construction of ideas and images [12] . Cognitive structure is a gradual process of self-construction under the
combination of external stimulation and individual characteristics of learners.
Piaget’s formalization work of intelligence development can be divided into two stages: early structuralism and later post-
structuralism. The former is also called classical theory, and the latter is called the new theoretical stage. Piaget’s new formal
theory basically abandoned the operation structure theory and replaced it with morphism-category theory. The development series
of traditional theory is from perceptual motion schema to representation schema, intuitive thinking schema to operational thin king
schema. Piaget’s new formal theory has become the development series of intramorphic level, intermorphic level and extramorph ic
level [13] .
The first stage is called intramorphic level. Psychologically, it’s just a simple correspondence, no combination. Common feat ures are
based on correct or incorrect observations, especially visible predictions. This is only an empirical comparison, depending on simple
state transitions.
The second stage, called intermorphic level, marks the beginning of systematic combinatorial construction. Intermorphic level
combination construction only occurs locally and gradually and finally does not construct a closed general system.
The last stage is extramorphic level. The main body compares morphisms by means of operation tools. Among them, the arithmetic
tool is precisely to explain and summarize the content of the previous morphism.
Topos is used to describe morphism-category theory. Around 1963, Bill Lawvere decided to figure out new foundations for
mathematics, based on category theory. His idea was to figure out what was so great about sets, strictly from the category-
theoretic point of view. In the spring of 1966 Lawvere encountered the work of Alexander Grothendieck, who had invented a
concept of “Topos” in his work on algebraic geometry. The word “Topos” means “place” in Greek. In algebraic geometry we are
often interested not just in wh

Topic….
Neural Network Models
Explained
Artificial neural network models are behind many of the most complex applications of machine learning.
Classification, regression problems, and sentiment analysis are some of the ways artificial neural networks are being leveraged
today. As an emerging field, there are many different types of artificial neural networks. They vary for a variety of reasons, such
as complexity, network architecture, density, and the flow of data. But the different types share a common goal of modelling and
attempting to replicate the behaviour of neurons to improve machine learning.

Artificial neural networks have a wide range of uses in machine learning. Each type of artificial neural network model has
different strengths and use cases. Overall, they are mainly used to solve more complex problems than would be possible with
more traditional machine learning techniques. Examples may include complex natural language processing and machine
learning-power language translation, which all rely on artificial neural networks. Recurrent neural networks are often utilised for
analysis sentiment or translating text too. The depth and scale of the neural architecture means a non-linear decision making
process can be achieved.

Artificial neural networks are used in the deep learning form of machine learning. It’s called deep learning as models use the
‘deep’, multi-layered architecture of an artificial neural network. As each layer of an artificial neural network can process data,
models can build an abstract understanding of the data. This architecture means models can perform increasingly complex tasks,
for example understanding natural language or categorising complex file types.

Artificial neural networks are already used in machine learning to power:

● Recommendation systems for customers, users and consumers in products like streaming services or e-commerce.
● To power virtual assistance and speech recognition software.
● Complex image, audio and document classification models, for example in facial recognition software.
● In automatic feature extraction from raw, unlabelled data.
Unit….1

There are different types of artificial neural networks which vary in complexity. This guide explores the different types of
artificial neural networks, including what they are and how they’re used.

What are artificial neural networks?


Artificial neural networks are designed to replicate the behaviour of neural networks found in human or animal brains. By
mirroring and modelling the behaviour of neurons, machine learning gains the model architecture to process increasingly
complex data. There are a variety of different types of artificial neural networks, with many early iterations seeming simple in
comparison to emerging techniques. For example, artificial neural networks are used as the architecture for complex deep
learning models.

Artificial neurons or nodes are modelled as a simplified version of neurons found in the brain. Each artificial neuron is connected
to other nodes, though the density and amount of connections differ with each type of artificial neural network. The network is
usually grouped into layers of nodes, which exist between the input and output layer. This multi-layered network architecture is
also known as a deep neural network because of the depth of these layers. These different layers in the artificial neural network
models can learn different features of data. Hidden hierarchical layers allow the understanding of complex concepts or patterns
from processed data.

The structure of artificial neural networks represent a simplified reflection of the complexity of the human or animal brain. A
web of interconnected artificial nodes mimic the behaviour of neurons within a nervous system. These artificial neural networks
are much less complex than a human brain, but are still incredibly powerful at performing tasks such as classification. Data starts
in the input layer and leaves from the output layer. But with the more complex artificial neural networks, data will move between
many different layers in a non-linear way.

Complex artificial neural networks are developed so that models can mirror the nonlinear decision-making process of the human
brain. This means models can be trained to make complex decisions or understand abstract concepts and objects. The model will
build from low-level features to complex features, understanding complex concepts. Each node within the network is weighted
depending on its influence on other artificial neural network nodes.

Like other machine learning models, optimisation of artificial neural networks is based on a loss function. This is the difference
between a predicted and actual output. The weighting of each node and layer is adjusted by the model to achieve a minimum loss.
Artificial neural network models can understand multiple levels of data features, and any hierarchical relationship between
features. So when used for a classification problem, an artificial neural network model can understand complex concepts by
processing multiple layers of features.

5 types of neural network models explained


There are many different types of artificial neural networks, varying in complexity. They share the intended goal of mirroring the
function of the human brain to solve complex problems or tasks. The structure of each type of artificial neural network in some
way mirrors neurons and synapses. However, they differ in terms of complexity, use cases, and structure. Differences also
include how artificial neurons are modelled within each type of artificial neural network, and the connections between each node.
Other differences include how the data may flow through the artificial neural network, and the density of the nodes.

5 examples of the different types of artificial neural network include:

● Feedforward artificial neural networks


● Perceptron and Multilayer Perceptron neural networks
● Radial basis function artificial neural networks
● Recurrent neural networks
● Modular neural networks

Feedforward artificial neural networks


As the name suggests, a Feedforward artificial neural network is when data moves in one direction between the input and output
nodes. Data moves forward through layers of nodes, and won’t cycle backwards through the same layers. Although there may be
many different layers with many different nodes, the one-way movement of data makes Feedforward neural networks relatively
simple. Feedforward artificial neural network models are mainly used for simplistic classification problems. Models will perform
beyond the scope of a traditional machine learning model, but don’t meet the level of abstraction found in a deep learning model.
Unit….1

Perceptron and Multilayer Perceptron neural networks


A perceptron is one of the earliest and simplest models of a neuron. A Perceptron model is a binary classifier, separating data into
two different classifications. As a linear model it is one of the simplest examples of a type of artificial neural network.

Multilayer Perceptron artificial neural networks adds complexity and density, with the capacity for many hidden layers between
the input and output layer. Each individual node on a specific layer is connected to every node on the next layer. This means
Multilayer Perceptron models are fully connected networks, and can be leveraged for deep learning.

They’re used for more complex problems and tasks such as complex classification or voice recognition. Because of the model’s
depth and complexity, processing and model maintenance can be resource and time-consuming.

Radial basis function artificial neural networks


Radial basis function neural networks usually have an input layer, a layer with radial basis function nodes with different
parameters, and an output layer. Models can be used to perform classification, regression for time series, and to control systems.
Radial basis functions calculate the absolute value between a centre point and a given point. In the case of classification, a radial
basis function calculates the distance between an input and a learned classification. If the input is closest to a specific tag, it is
classified as such.

A common use for radial basis function neural networks is in system control, such as systems that control power restoration after
a power cut. The artificial neural network can understand the priority order to restoring power, prioritising repairs to the greatest
number of people or core services.

Recurrent neural networks


Recurrent neural networks are powerful tools when a model is designed to process sequential data. The model will move data
forward and loop it backwards to previous steps in the artificial neural network to best achieve a task and improve predictions.
The layers between the input and output layers are recurrent, in that relevant information is looped back and retained. Memory of
outputs from a layer is looped back to the input where it is held to improve the process for the next input.

The flow of data is similar to Feedforward artificial neural networks, but each node will retain information needed to improve
each step. Because of this, models can better understand the context of an input and refine the prediction of an output. For
example, a predictive text system may use memory of a previous word in a string of words to better predict the outcome of the
next word. A recurrent artificial neural network would be better suited to understand the sentiment behind a whole sentence
compared to more traditional machine learning models.

Recurrent neural networks are also used within sequence to sequence models, which are used for natural language processing.
Two recurrent neural networks are used within these models, which consists of a simultaneous encoder and decoder. These
models are used for reactive chatbots, translating language, or to summarise documents.

Modular neural networks


A Modular artificial neural network consists of a series of networks or components that work together (though independently) to
achieve a task. A complex task can therefore be broken down into smaller components. If applied to data processing or the
computing process, the speed of the processing will be increased as smaller components can work in tandem.

Each component network is performing a different subtask which when combined completes the overall tasks and output. This
type of artificial neural network is beneficial as it can make complex processes more efficient, and can be applied to a range of
environme
Unit….1

Top of Form

You might also like