Srujitha 1
Srujitha 1
Well Posed Learning Problem – A computer program is said to learn from experience E in context to some task T and
some performance measure P, if its performance on T, as was measured by P, upgrades with experience E.
Any problem can be segregated as well-posed learning problem if it has three traits –
● Task
● Performance Measure
● Experience
Certain examples that efficiently defines the well-posed learning problem are –
1. To better filter emails as spam or not
● Task – Classifying emails as spam or not
● Performance Measure – The fraction of emails accurately classified as spam or not spam
● Experience – Observing you label emails as spam or not spam
2. A checkers learning problem
● Task – Playing checkers game
● Performance Measure – percent of games won against opposer
● Experience – playing implementation games against itself
3. Handwriting Recognition Problem
● Task – Acknowledging handwritten words within portrayal
● Performance Measure – percent of words accurately classified
● Experience – a directory of handwritten words with given classifications
4. A Robot Driving Problem
● Task – driving on public four-lane highways using sight scanners
● Performance Measure – average distance progressed before a fallacy
● Experience – order of images and steering instructions noted down while observing a human driver
5. Fruit Prediction Problem
● Task – forecasting different fruits for recognition
● Performance Measure – able to predict maximum variety of fruits
● Experience – training machine with the largest datasets of fruits images
6. Face Recognition Problem
● Task – predicting different types of faces
● Performance Measure – able to predict maximum types of faces
● Experience – training machine with maximum amount of datasets of different face images
7. Automatic Translation of documents
● Task – translating one type of language used in a document to other language
● Performance Measure – able to convert one language to other efficiently
● Experience – training machine with a large dataset of differ
Topic
● Traffic Alerts
● Social Media
● Transportation and Commuting
● Products Recommendations
● Virtual Personal Assistants
● Self Driving Cars
● Dynamic Pricing
● Google Translate
● Online Video Streaming
● Fraud Detection
Unit….1
Topic
In implementing most of the machine learning algorithms, we represent each data point with a feature vector as the input. A
vector is basically an array of numerics, or in physics, an object with magnitude and direction. How do we represent our business
data in terms of a vector?
1. Categorical data
2. Binary data
3. Numerical data
4. Graphical data
The most primitive representation of a feature vector looks like this:
Numerical Data
Numerical data can be represented as individual elements above (like Tweet GRU, Query GRU), and I am not going to talk too
much about it.
Categorical Data
However, for categorical data, how do we represent them? The first basic way is to use one-hot encoding:
For each type of categorical data, each category has an integer code. In the figure above, each color has a code (0 for red, 1 for
orange etc.) and they will eventually be transformed to the feature vector on the right, with vector length being the total number
of categories found in the data, and the element will be filled with 1 if it is of that category. This allows a natural way of dealing
with missing data (with all elements 0) and multi-category (with multiple non-zeros).
Advertisements
REPORT THIS AD
In natural language processing, the bag-of-words model is often used to represent free-text data, which is the one-hot encoding
above with words as the categories. It is a good way as long as the order of the words does not matter.
Binary Data
For binary data, it can be easily represented by one element, either 1 or 0.
Unit….1
Graphical Data
Graphical data are best represented in terms of graph Laplacian and adjacency matrix. Refer to a previous blog article for more
information.
Shortcomings
A feature vector can be a concatenation of various features in terms of all these types except graphical data.
However, such representation that concatenates all the categorical, binary, and numerical fields has a lot of shortcomings:
1. Data with different categories are often seen as orthogonal, i.e., perfectly dissimilar. It ignores the correlation between
different variables. However, it is a very big assumption.
4. Data are very sparse, costing a lot of memory waste and computing time.
5. It is unknown whether some of the data are irrelevant.
1. Rescaling: rescaling all of some of the elements, or reweighing, to adjust the influence from different variables.
Rescaling
Rescaling means rescaling all or some of the elements in the vectors. Usually there are two ways:
1. Normalization: normalizing all the categories of one feature to having the sum of 1.
2. Term frequency-inverse document frequency (tf-idf): weighing the elements so that the weights are heavier if the frequency
is higher and it appears in relatively few documents or class labels.
Embedding
Embedding means condensing a sparse vector to a smaller vector. Many sparse elements disappear and information is encoded
inside the elements. There are rich amount of work on this.
1. Topic models: finding the topic models (latent Dirichlet allocation (LDA), structural topic models (STM) etc.) and encode
the vectors with topics instead;
Unit….1
2. Global dimensionality reduction algorithms: reducing the dimensions by retaining the principal components of the vectors of
all the data, e.g., principal component analysis (PCA), independent component analysis (ICA), multi-dimensional scaling
(MDS) etc;
3. Local dimensionality reduction algorithms: same as the global, but these are good for finding local patterns, where examples
include t-Distributed Stochastic Neighbor Embedding (tSNE) and Uniform Manifold Approximation and
Projection (UMAP);
4. Representation learned from deep neural networks: embeddings learned from encoding using neural networks, such as auto-
encoders, Word2Vec, FastText, BERT etc.
5. Mixture Models: Gaussian mixture models (GMM), Dirichlet multinomial mixture (DMM) etc.
6. Others: Tensor decomposition (Schmidt decomposition, Jennrich algorithm etc.), GloVe etc.
Sparse Coding
Sparse coding is good for finding basis vectors for dense vectors.
Topic..
Structured data — typically categorized as quantitative data — is highly organized and easily decipherable by machine learning
algorithms. Developed by IBM in 1974, structured query language (SQL) is the programming language used to manage structured
data. By using a relational (SQL) database, business users can quickly input, search and manipulate structured data.
Pros
● Easily used by machine learning (ML) algorithms: The specific and organized architecture of
structured data eases manipulation and querying of ML data.
● Easily used by business users: Structured data does not require an in-depth understanding of different
types of data and how they function. With a basic understanding of the topic relative to the data, users can easily access
and interpret the data.
● Accessible by more tools: Since structured data predates unstructured data, there are more tools available
for using and analyzing structured data.
Unit….1
Cons
● Limited usage: Data with a predefined structure can only be used for its intended purpose, which limits its flexibility
and usability.
● Limited storage options: Structured data is generally stored in data storage systems with rigid schemas (e.g.,
“data warehouses”). Therefore, changes in data requirements necessitate an update of all structured data, which leads to
a massive expenditure of time and resources.
● OLAP: Performs high-speed, multidimensional data analysis from unified, centralized data stores.
● SQLite: Implements a self-contained, serverless, zero-configuration, transactional relational database engine.
● MySQL: Embeds data into mass-deployed software, particularly mission-critical, heavy-load production system.
● PostgreSQL: Supports SQL and JSON querying as well as high-tier programming languages (C/C+, Java, Python, etc.).
● Customer relationship management (CRM): CRM software runs structured data through analytical
tools to create datasets that reveal customer behavior patterns and trends.
● Online booking: Hotel and ticket reservation data (e.g., dates, prices, destinations, etc.) fits the “rows and
columns” format indicative of the pre-defined data model.
● Accounting: Accounting firms or departments use structured data to process and record financial transactions.
Unstructured data, typically categorized as qualitative data, cannot be processed and analyzed via conventional data tools and
methods. Since unstructured data does not have a predefined data model, it is best managed in non-relational (NoSQL) databases.
Another way to manage unstructured data is to use data lakes to preserve it in raw form.
The importance of unstructured data is rapidly increasing. Recent projections indicate that unstructured data is over 80% of all
enterprise data, while 95% of businesses prioritize unstructured data management.
Pros
● Native format: Unstructured data, stored in its native format, remains undefined until needed. Its adaptability
increases file formats in the database, which widens the data pool and enables data scientists to prepare and analyze
only the data they need.
● Fast accumulation rates: Since there is no need to predefine the data, it can be collected quickly and easily.
● Data lake storage: Allows for massive storage and pay-as-you-use pricing, which cuts costs and eases
scalability.
Cons
● Requires expertise: Due to its undefined/non-formatted nature, data science expertise is required to prepare and
analyze unstructured data. This is beneficial to data analysts but alienates unspecialized business users who may not fully
understand specialized data topics or how to utilize their data.
● Specialized tools: Specialized tools are required to manipulate unstructured data, which limits product choices for
data managers.
Unit….1
● MongoDB: Uses flexible documents to process data for cross-platform applications and services.
● DynamoDB: Delivers single-digit millisecond performance at any scale via built-in security, in-memory caching and
backup and restore.
● Hadoop: Provides distributed processing of large data sets using simple programming models and no formatting
requirements.
● Azure: Enables agile cloud computing for creating and managing apps through Microsoft’s data centers.
● Data mining: Enables businesses to use unstructured data to identify consumer behavior, product sentiment, and
purchasing patterns to better accommodate their customer base.
● Predictive data analytics: Alert businesses of important activity ahead of time so they can properly plan and accordingly
adjust to significant market shifts.
● Chatbots: Perform text analysis to route customer questions to the appropriate answer sources.
What are the key differences between structured and unstructured data?
While structured (quantitative) data gives a “birds-eye view” of customers, unstructured (qualitative) data provides a deeper
understanding of customer behavior and intent. Let’s explore some of the key areas of difference and their implications:
● Sources: Structured data is sourced from GPS sensors, online forms, network logs, web server logs, OLTP systems,
etc., whereas unstructured data sources include email messages, word-processing documents, PDF files, etc.
● Forms: Structured data consists of numbers and values, whereas unstructured data consists of sensors, text files,
audio and video files, etc.
● Models: Structured data has a predefined data model and is formatted to a set data structure before being placed in
data storage (e.g., schema-on-write), whereas unstructured data is stored in its native format and not processed until it is
used (e.g., schema-on-read).
● Storage: Structured data is stored in tabular formats (e.g., excel sheets or SQL databases) that require less storage
space. It can be stored in data warehouses, which makes it highly scalable. Unstructured data, on the other hand, is
stored as media files or NoSQL databases, which require more space. It can be stored in data lakes which makes it
difficult to scale.
● Uses: Structured data is used in machine learning (ML) and drives its algorithms, whereas unstructured data is used
in natural language processing (NLP) and text mining.
Semi-structured data uses “metadata” (e.g., tags and semantic markers) to identify specific data characteristics and scale data into
records and preset fields. Metadata ultimately enables semi-structured data to be better cataloged, searched and analyzed than
unstructured data.
● Example of metadata usage: An online article displays a headline, a snippet, a featured image, image alt-
text, slug, etc., which helps differentiate one piece of web content from similar pieces.
● Example of semi-structured data vs. structured data: A tab-delimited file containing customer
data versus a database containing CRM tables.
● Example of semi-structured data vs. unstructured data: A tab-delimited file versus a list of
comments from a customer’s Instagram.
Recent developments in artificial intelligence (AI) and machine learning (ML) are driving the future wave of data, which is enhancing
business intelligence and advancing industrial innovation. In particular, the data formats and models covered in this article are
helping business users to do the following:
Unit….1
● Analyze digital communications for compliance: Pattern recognition and email threading analysis
software that can search email and chat data for potential noncompliance.
● Track high-volume customer conversations in social media: Text analytics and sentiment
analysis that enables monitoring of marketing campaign results and identifying online threats.
● Gain new marketing intelligence: ML analytics tools that can quickly cover massive amounts of data to
help businesses analyze customer behavior.
Furthermore, smart and efficient usage of data formats and models can help you with the following:
To better understand data storage options for whatever kind of data best serves you, check out IBM Cloud Databases.
Topic….
Machine learning includes an algorithm that automatically improves through data-based experience. Machine learning is a way to find a
new algorithm from experience. Machine learning includes the study of an algorithm that can automatically extract the data. Machine
learning utilizes data mining techniques and another learning algorithm to construct models of what is happening behind certain
information so that it can predict future results.
Data Mining and Machine learning are areas that have been influenced by each other, although they have many common things, ye t they
have different ends.
Data Mining is performed on certain data sets by humans to find interesting patterns between the items in the data set. Data Mining uses
techniques created by machine learning for predicting the results while machine learning is the capability of the computer to learn from a
minded data set.
Unit….1
Machine learning algorithms take the information that represents the relationship between items in data sets and creates models in order
to predict future results. These models are nothing more than actions that will be taken by the machine to achieve a result.
Machine learning is a technique that creates complex algorithms for large data processing and provides outcomes to its users. It utilizes
complex programs that can learn through experience and make predictions.
The algorithms are enhanced by themselves by frequent input of training data. The aim of machine learning is to understand information
and build models from data that can be understood and used by humans.
o Unsupervised Learning
o Supervised Learning
Unsupervised learning does not depend on trained data sets to predict the results, but it utilizes direct techniques such as clustering and
association in order to predict the results. Trained data sets are defined as the input for which the output is known.
As the name implies, supervised learning refers to the presence of a supervisor as a teacher. Supervised learning is a learning process in
which we teach or train the machine using data which is well leveled implies that some data is already marked with the correct responses.
After that, the machine is provided with the new sets of data so that the supervised learning algorithm analyzes the training data and gives
an accurate result from labeled data.
2. Data Mining utilizes more data to obtain helpful information, and that specific data will help to predict some future results. For example,
In a marketing company that utilizes last year's data to predict the sale, but machine learning does not depend much on data. It uses
algorithms. Many transportation companies such as OLA, UBER machine learning techniques to calculate ETA (Estimated Time of Arrival)
for rides is based on this technique.
3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It will provide the answer to a specific problem,
but machine learning algorithms are self-defined and can alter their rules according to the situation, and find out the solution for a specific
problem and resolves it in its way.
4. The main and most important difference between data mining and machine learning is that without the involvement of humans, data
mining can't work, but in the case of machine learning human effort only involves at the time when the algorithm is defined after that it will
conclude everything on its own. Once it implemented, we can use it forever, but this is not possible in the case of data mining.
5. As machine learning is an automated process, the result produces by machine learning will be more precise as compared to data mining.
6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern assessment techniques to obtain useful
information, whereas machine learning utilizes neural networks, predictive models, and automated algorithms to make the decisions.
Topic…
The term Linear Algebra was initially introduced in the early 18 th century to find out the unknowns in Linear equations and solve the
equation easily; hence it is an important branch of mathematics that helps study data. Also, no one can deny that Linear Algebra is
undoubtedly the important and primary thing to process the applications of Machine Learning. It is also a prerequisite to start learning
Machine Learning and data science.
Unit….1
Linear algebra plays a vital role and key foundation in machine learning, and it enables ML algorithms to run on a huge number of
datasets.
The concepts of linear algebra are widely used in developing algorithms in machine learning. Although it is used almost in each concept
of Machine learning, specifically, it can perform the following task:
o Optimization of data.
o Applicable in loss functions, regularisation, covariance matrices, Singular Value Decomposition (SVD), Matrix Operations, and
support vector machine classification.
o Implementation of Linear Regression in Machine Learning.
Besides the above uses, linear algebra is also used in neural networks and the data science field.
Basic mathematics principles and concepts like Linear algebra are the foundation of Machine Learning and Deep Learning systems. To
learn and understand Machine Learning or Data Science, one needs to be familiar with linear algebra and optimization theory. In this
topic, we will explain all the Linear algebra concepts required for machine learning.
Note: Although linear algebra is a must-know part of mathematics for machine learning, it is not required to get intimate in this. It means it is
not required to be an expert in linear algebra; instead, only good knowledge of these concepts is more than enough for machine learning.
Below are some benefits of learning Linear Algebra before Machine learning:
Moreover, Linear Algebra helps solve and compute large and complex data set through a specific terminology named Matrix
Decomposition Techniques. There are two most popular matrix decomposition techniques, which are as follows:
o Q-R
o L-U
Improved Statistics:
Statistics is an important concept to organize and integrate data in Machine Learning. Also, linear Algebra helps to understand the concept
of statistics in a better manner. Advanced statistical topics can be integrated using methods, operations, and notations of linear algebra.
Few supervised learning algorithms can be created using Linear Algebra, which is as follows:
o Logistic Regression
o Linear Regression
o Decision Trees
o Support Vector Machines (SVM)
Further, below are some unsupervised learning algorithms listed that can also be created with the help of linear algebra as follows:
With the help of Linear Algebra concepts, you can also self-customize the various parameters in the live project and understand in-depth
knowledge to deliver the same with more accuracy and precision.
Easy to Learn:
Linear Algebra is an important department of Mathematics that is easy to understand. It is taken into consideration whenever there is a
requirement of advanced mathematics and its applications.
Operations:
Working with an advanced level of abstractions in vectors and matrices can make concepts clearer, and it can also help in the description,
coding, and even thinking capability. In linear algebra, it is required to learn the basic operations such as addition, multiplication, inversion,
transposing of matrices, vectors, etc.
Matrix Factorization:
One of the most recommended areas of linear algebra is matrix factorization, specifically matrix deposition methods such as SVD and
QR.
o Linear Regression
o Recommender Systems
o One-hot encoding
o Regularization
o Principal Component Analysis
o Images and Photographs
o Singular-Value Decomposition
o Deep Learning
o Latent Semantic Analysis
Each dataset resembles a table-like structure consisting of rows and columns. Where each row represents observations, and each
column represents features/Variables. This dataset is handled as a Matrix, which is a key data structure in Linear Algebra.
Further, when this dataset is divided into input and output for the supervised learning model, it represents a Matrix(X) and Vector(y),
where the vector is also an important concept of linear algebra.
Moreover, different operations on images, such as cropping, scaling, resizing, etc., are performed using notations and operations of Linear
Algebra.
In the one-hot encoding technique, a table is created that shows a variable with one column for each category and one row for each
example in the dataset. Further, each row is encoded as a binary vector, which contains either zero or one value. This is an example of
sparse representation, which is a subfield of Linear Algebra.
4. Linear Regression
Linear regression is a popular technique of machine learning borrowed from statistics. It describes the relationship between input and
output variables and is used in machine learning to predict numerical values. The most common way to solve linear regression problems
using Least Square Optimization is solved with the help of Matrix factorization methods. Some commonly used matrix factorization
methods are LU decomposition, or Singular-value decomposition, which are the concept of linear algebra.
5. Regularization
In machine learning, we usually look for the simplest possible model to achieve the best outcome for the specific problem. Simpler
models generalize well, ranging from specific examples to unknown datasets. These simpler models are often considered models with
smaller coefficient values.
A technique used to minimize the size of coefficients of a model while it is being fit on data is known as regularization. Common
regularization techniques are L1 and L2 regularization. Both of these forms of regularization are, in fact, a measure of the magnitude or
length of the coefficients as a vector and are methods lifted directly from linear algebra called the vector norm.
7. Singular-Value Decomposition
Singular-Value decomposition is also one of the popular dimensionality reduction techniques and is also written as SVD in short form.
Unit….1
It is the matrix-factorization method of linear algebra, and it is widely used in different applications such as feature selection, visualization,
noise reduction, and many more.
NLP represents a text document as large matrices with the occurrence of words. For example, the matrix column may contain the known
vocabulary words, and rows may contain sentences, paragraphs, pages, etc., with cells in the matrix marked as the count or frequency
of the number of times the word occurred. It is a sparse matrix representation of text. Documents processed in this way are much easier
to compare, query, and use as the basis for a supervised machine learning model.
This form of data preparation is called Latent Semantic Analysis, or LSA for short, and is also known by the name Latent Semantic
Indexing or LSI.
9. Recommender System
A recommender system is a sub-field of machine learning, a predictive modelling problem that provides recommendations of products.
For example, online recommendation of books based on the customer's previous purchase history, recommendation of movies and TV
series, as we see in Amazon & Netflix.
The development of recommender systems is mainly based on linear algebra methods. We can understand it as an example of calculating
the similarity between sparse customer behaviour vectors using distance measures such as Euclidean distance or dot products.
Different matrix factorization methods such as singular-value decomposition are used in recommender systems to query, search, and
compare user data.
Deep learning studies these neural networks, which implement newer and faster hardware for the training and development of larger
networks with a huge dataset. All deep learning methods achieve great results for different challenging tasks such as machine translation,
speech recognition, etc. The core of processing neural networks is based on linear algebra data structures, which are multiplied and
added together. Deep learning algorithms also work with vectors, matrices, tensors (matrix with more than two dimensions) of inputs
and coefficients for multiple dimensions.
Conclusion
In this topic, we have discussed Linear algebra, its role and its importance in machine learning. For each machine learning enthusiast, it
is very important to learn the basic concepts of linear algebra to understand the working of ML algorithms and choose the best algorithm
for a specific problem.
Unit..2
Machine learning is a large field of study that overlaps with and inherits ideas from many related fields such as artificial intelligence.
The focus of the field is learning, that is, acquiring skills or knowledge from experience. Most commonly, this means synthesizing useful
concepts from historical data.
As such, there are many different types of learning that you may encounter as a practitioner in the field of machine learning: from whole
fields of study to specific techniques.
In this post, you will discover a gentle introduction to the different types of learning that you may encounter in the field of machine learning.
Types of Learning
Given that the focus of the field of machine learning is “learning,” there are many types that you may encounter as a practitioner.
Some types of learning describe whole subfields of study comprised of many different types of algorithms such as “supervised learning.”
Others describe powerful techniques that you can use on your projects, such as “transfer learning.”
There are perhaps 14 types of learning that you must be familiar with as a machine learning practitioner; they are:
Learning Problems
● 1. Supervised Learning
● 2. Unsupervised Learning
● 3. Reinforcement Learning
Hybrid Learning Problems
● 4. Semi-Supervised Learning
● 5. Self-Supervised Learning
● 6. Multi-Instance Learning
Statistical Inference
● 7. Inductive Learning
● 8. Deductive Inference
● 9. Transductive Learning
Learning Techniques
● 10. Multi-Task Learning
● 11. Active Learning
● 12. Online Learning
● 13. Transfer Learning
● 14. Ensemble Learning
In the following sections, we will take a closer look at each in turn.
Topic
Bias is a phenomenon that skews the result of an algorithm in favor or against an idea.
Bias is considered a systematic error that occurs in the machine learning model itself due to incorrect assumptions in the ML
process.
Technically, we can define bias as the error between average model prediction and the ground truth. Moreover, it describes how
well the model matches the training data set:
● A model with a higher bias would not match the data set closely.
● A low bias model will closely match the training data set.
Simply stated, variance is the variability in the model prediction—how much the ML function can adjust depending on the given
data set. Variance comes from highly complex models with a large number of features.
All these contribute to the flexibility of the model. For instance, a model that does not match a data set with a high bias will
create an inflexible model with a low variance that results in a suboptimal machine learning model.
Unit….1
● Underfitting occurs when the model is unable to match the input data to the target data. This happens when the model is
not complex enough to match all the available data and performs poorly with the training dataset.
● Overfitting relates to instances where the model tries to match non-existent data. This occurs when dealing with highly
complex models where the model will match almost all the given data points and perform well in training datasets.
However, the model would not be able to generalize the data point in the test data set to predict the outcome accurately.
When a data engineer modifies the ML algorithm to better fit a given data set, it will lead to low bias—but it will increase
variance. This way, the model will fit with the data set while increasing the chances of inaccurate predictions.
The same applies when creating a low variance model with a higher bias. While it will reduce the risk of inaccurate predictions,
the model will not properly match the data set.
It’s a delicate balance between these bias and variance. Importantly, however, having a higher variance does not indicate a bad
ML algorithm. Machine learning algorithms should be able to handle some variance.
Increasing the complexity of the model to count for bias and variance, thus decreasing the overall bias while increasing the
variance to an acceptable level. This aligns the model with the training dataset without incurring significant variance errors.
Increasing the training data set can also help to balance this trade-off, to some extent. This is the preferred method when
dealing with overfitting models. Furthermore, this allows users to increase the complexity without variance errors that pollute the
model as with a large data set.
A large data set offers more data points for the algorithm to generalize data easily. However, the major issue with increasing the
trading data set is that underfitting or low bias models are not that sensitive to the training data set. Therefore, increasing data is
the preferred solution when it comes to dealing with high variance and high bias models.
This table lists common algorithms and their expected behavior regarding bias and variance:
Linear High
Regression Bias Less Variance
X, y = iris_data()
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
# Define Algorithm
tree = DecisionTreeClassifier(random_state=123)
loss='0-1_loss',
random_seed=123,
num_rounds=1000)
Result:
Unit….1
Bagging example
Copy
X, y = iris_data()
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
# Define Algorithm
tree = DecisionTreeClassifier(random_state=123)
bag = BaggingClassifier(base_estimator=tree,
n_estimators=100,
random_state=123)
loss='0-1_loss',
random_seed=123,
num_rounds=1000)
Result:
Unit….1
Each of the above functions will run 1,000 rounds (num_rounds=1000) before calculating the average bias and variance values.
There, we can reduce the variance without affecting bias using a bagging classifier. The higher the algorithm complexity, the
lesser variance.
In the following example, we will have a look at three different linear regression models—least-squares, ridge, and lasso—
using sklearn library. Since they are all linear regression algorithms, their main difference would be the coefficient value.
We can see those different algorithms lead to different outcomes in the ML process (bias and variance).
Copy
import numpy as np
ar = np.array([[[1],[2],[3]], [[2],[4],[6]]])
y = ar[1,:]
x = ar[0,:]
if model == 1:
reg = linear_model.LinearRegression()
reg.fit(x,y)
if model == 2:
reg.fit(x,y)
if model == 3:
reg.fit(x,y)
preds = reg.predict(xTest)
er = []
for i in range(len(ytest)):
er.append(x)
Unit….1
variance_value = np.var(er)
dateset_a = np.array([[4],[5],[6]])
dateset_b = np.array([[8.8],[14],[17]])
calculate_bias_variance(dateset_a,dateset_b, 1)
# Ridged Coefficients
calculate_bias_variance(dateset_a,dateset_b, 2)
# Lasso Coefficients
calculate_bias_variance(dateset_a,dateset_b, 3)
Result:
Unit….1
Unit….1
● Bias creates consistent errors in the ML model, which represents a simpler ML model that is not suitable for a specific
requirement.
● On the other hand, variance creates variance errors that lead to incorrect predictions seeing trends or data points that do
not exist.
Users need to consider both these factors when creating an ML model. Generally, your goal is to keep bias as low as possible
while introducing acceptable levels of variances. This can be done either by increasing the complexity or increasing the training
data set.
In this balanced way, you can create an acceptable machine learning model.
Related reading
This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who
wants to learn machine learning. We start with very basic stats and algebra and build upon that.
Download e-book ›
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.
●
●
Unit….1
Shanika Wickramasinghe
Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are
Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate
about everything she does, loves to travel, and enjoys nature whenever she takes a break from her busy work schedule. You can
connect with her on LinkedIn.
ices
Topic.,
Computational learning theory, or statistical learning theory, refers to mathematical frameworks for quantifying learning tasks and algorithms.
These are sub-fields of machine learning that a machine learning practitioner does not need to know in great depth in order to achieve good
results on a wide range of problems. Nevertheless, it is a sub-field where having a high-level understanding of some of the more prominent
methods may provide insight into the broader task of learning from data.
In this post, you will discover a gentle introduction to computational learning theory for machine learning.
● Computational learning theory uses formal methods to study learning tasks and learning algorithms.
● PAC learning provides a way to quantify the computational difficulty of a machine learning task.
● VC Dimension provides a way to quantify the computational capacity of a machine learning algorithm.
Kick-start your project with my new book Probability for Machine Learning, including step-by-step tutorials and the Python source code files
for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
Computational learning theory may be thought of as an extension or sibling of statistical learning theory, or SLT for short, that uses formal
methods to quantify learning algorithms.
● Computational Learning Theory (CoLT): Formal study of learning tasks.
● Statistical Learning Theory (SLT): Formal study of learning algorithms.
As a machine learning practitioner, it can be useful to know about computational learning theory and some of the main areas of investigation.
The field provides a useful grounding for what we are trying to achieve when fitting models on data, and it may provide insight into the
methods.
There are many subfields of study, although perhaps two of the most widely discussed areas of study from computational learning theory
are:
● PAC Learning.
● VC Dimension.
Tersely, we can say that PAC Learning is the theory of machine learning problems and VC dimension is the theory of machine learning
algorithms.
You may encounter the topics as a practitioner and it is useful to have a thumbnail idea of what they are about. Let’s take a closer look at
each.
If you would like to dive deeper into the field of computational learning theory, I recommend the book:
PAC learning seeks to quantify the difficulty of a learning task and might be considered the premier sub-field of computational learning
theory.
Consider that in supervised learning, we are trying to approximate an unknown underlying mapping function from inputs to outputs. We don’t
know what this mapping function looks like, but we suspect it exists, and we have examples of data produced by the function.
PAC learning is concerned with how much computational effort is required to find a hypothesis (fit model) that is a close match for the
unknown target function.
One way to consider the complexity of a hypothesis space (space of models that could be fit) is based on the number of distinct hypotheses
it contains and perhaps how the space might be navigated. The VC dimension is a clever approach that instead measures the number of
examples from the target problem that can be discriminated by hypotheses in the space.
The VC dimension measures the complexity of the hypothesis space […] by the number of distinct instances from X that can be completely
discriminated using H.
Topic..,
Occam’s razor
● Last Updated : 27 Feb, 2020
Many philosophers throughout history have advocated the idea of parsimony. One of the greatest Greek philosophers,
Aristotle who goes as far as to say, “Nature operates in the shortest way possible”. It is as a consequence that humans
might be biased as well to choose a simpler explanation given a set of all possible explanations with the same descriptive
power. This post gives a brief overview of Occam’s razor, the relevance of the principle and ends with a note on the usage
of this razor as an inductive bias in machine learning (decision tree learning in particular).
● The decision tree learning algorithms follow a search strategy to search the hypotheses space for the hypothesis that
best fits the training data. For example, the ID3 algorithm uses a simple to complex strategy starting from an empty tree
and adding nodes guided by the information gain heuristic to build a decision tree consistent with the training instances.
The information gain of every attribute (which is not already included in the tree) is calculated to infer which attribute to be
considered as the next node. Information gain is the essence of the ID3 algorithm. It gives a quantitative measure of the
information that an attribute can provide about the target variable i.e, assuming only information of that attribute is
available, how efficiently can we infer about the target. It can be defined as :
Generalization error
From Wikipedia, the free encyclopedia
For supervised learning applications in machine learning and statistical learning theory, generalization error[1] (also known as
the out-of-sample error[2] or the risk) is a measure of how accurately an algorithm is able to predict outcome values for previously
unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive
to sampling error. As a result, measurements of prediction error on the current data may not provide much information about
predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The
performance of a machine learning algorithm is visualized by plots that show values of estimates of the generalization error through
the learning process, which are called learning curves.
Contents
● 1Definition
o 1.1Leave-one-out cross-validation Stability
o 1.2Expected-leave-one-out error Stability
o 1.3Algorithms with proven stability
● 2Relation to overfitting
● 3References
● 4Further reading
Definition[edit]
See also: Statistical learning theory
In a learning problem, the goal is to develop a function that predicts output values for each input datum . The
subscript indicates that the function is developed based on a data set of data points. The generalization
error or expected loss or risk, of a particular function over all possible values of and is:[3]
where denotes a loss function and is the unknown joint probability distribution for and .
Unit….1
Without knowing the joint probability distribution , it is impossible to compute . Instead, we can compute the error
on sample data, which is called empirical error (or empirical risk). Given data points, the empirical error of a
Of particular importance is the generalization error of the data-dependent function that is found by a
learning algorithm based on the sample. Again, for an unknown probability distribution, cannot be computed.
Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the
generalization error and the empirical error in probability:
That is, the goal is to characterize the probability that the generalization error is less than the empirical
error plus some error bound (generally dependent on and ). For many types of
algorithms, it has been shown that an algorithm has generalization bounds if it meets certain stability criteria.
Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and
meets two stability conditions, it will generalize. The first stability condition, leave-one-out cross-
validation stability, says that to be stable, the prediction error for each data point when leave-one-out cross
validation is used must converge to zero as . The second condition, expected-to-leave-one-out error
stability (also known as hypothesis stability if operating in the norm) is met if the prediction on a left-out
datapoint does not change when a single data point is removed from the training dataset.[4]
These conditions can be formalized as:
An algorithm has stability if for each , there exists a and such that:
An algorithm has stability if for each there exists a and a such that:
Unit….1
For leave-one-out stability in the norm, this is the same as hypothesis stability:
Relation to overfitting[edit]
See also: Overfitting
This figure illustrates the relationship between overfitting and the generalization error I[fn] - IS[fn]. Data points were generated
from the relationship y = x with white noise added to the y values. In the left column, a set of training points is shown in blue.
A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled
from the underlying joint probability distribution of x and y. In the top row, the function is fit on a sample dataset of 10
datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes
and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we
have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error
on training and test data converges and generalization error goes to 0.
The concepts of generalization error and overfitting are closely related. Overfitting occurs
when the learned function becomes sensitive to the noise in the sample. As a result,
the function will perform well on the training set but not perform well on other data from the
joint probability distribution of and . Thus, the more overfitting occurs, the larger
the generalization error.
The amount of overfitting can be tested using cross-validation methods, that split the sample
into simulated training samples and testing samples. The model is then trained on a training
sample and evaluated on the testing sample. The testing sample is previously unseen by the
algorithm and so represents a random sample from the joint probability distribution
of and . This test sample allows us to approximate the expected error and as a
result approximate a particular form of the generalization error.
Many algorithms exist to prevent overfitting. The minimization algorithm can penalize more
complex functions (known as Tikhonov regularization), or the hypothesis space can be
constrained, either explicitly in the form of the functions or by adding constraints to the
minimization function (Ivanov regularization).
The approach to finding a function that does not overfit is at odds with the goal of finding a
function that is sufficiently complex to capture the particular characteristics of the data. This is
known as the bias–variance tradeoff. Keeping a function simple to avoid overfitting may
introduce a bias in the resulting predictions, while allowing it to be more complex leads to
Unit….1
Regression refers to predictive modeling problems that involve predicting a numeric value.
It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate
the predictions made by a regression model.
Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.
In this tutorial, you will discover how to calculate error metrics for regression predictive modeling projects.
After completing this tutorial, you will know:
● Regression predictive modeling are those problems that involve predicting a numeric value.
● Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
● How to calculate and report mean squared error, root mean squared error, and mean absolute error.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output
variables (y). This is called the problem of function approximation.
The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.
For more on approximating functions in applied machine learning, see the post:
For more on the difference between classification and regression, see the tutorial:
For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of 100,000to200,000.
This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the
model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to
the expected values.
Error addresses exactly this and summarizes on average how close predictions were to their expected values.
There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:
The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1
increments. The squared error between each prediction and expected value is calculated and plotted to show the quadratic increase in
squared error.
Unit….1
1 ...
2 # calculate error
4 # real value
5 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
6 # predicted value
7 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
8 # calculate errors
9 errors = list()
10 for i in range(len(expected)):
11 # calculate error
13 # store error
14 errors.append(err)
15 # report error
17 # plot errors
18 pyplot.plot(errors)
20 pyplot.xlabel('Predicted Value')
22 pyplot.show()
Running the example first reports the expected value, predicted value, and squared error for each case.
We can see that the error rises quickly, faster than linear (a straight line).
A line plot is created showing the curved or super-linear increase in the squared error value as the difference between the expected and
predicted value is increased.
The curve is not a straight line as we might naively assume for an error metric.
Unit….1
Unit….1
For example, if your target value represents “dollars,” then the MSE will be “squared dollars.” This can be confusing for stakeholders;
therefore, when reporting results, often the root mean squared error is used instead (discussed in the next section).
The mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the
scikit-learn library.
The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.
1 ...
2 # calculate errors
The example below gives an example of calculating the mean squared error between a list of contrived expected and predicted values.
3 # real value
4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
5 # predicted value
6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
7 # calculate errors
9 # report error
10 print(errors)
Running the example calculates and prints the mean squared error.
1 0.35000000000000003
A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.
This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.
It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.
For example, if your target variable has the units “dollars,” then the RMSE error score will also have the unit “dollars” and not “squared
dollars” like the MSE.
As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.
● RMSE = sqrt(MSE)
Note that the RMSE cannot be calculated as the average of the square root of the mean squared error values. This is a common error made
by beginners and is an example of Jensen’s inequality.
You may recall that the square root is the inverse of the square operation. MSE uses the square operation to remove the sign of each error
value and to punish large errors. The square root reverses this operation, although it ensures that the result remains positive.
The root mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from
the scikit-learn library.
By default, the function calculates the MSE, but we can configure it to calculate the square root of the MSE by setting the “squared”
argument to False.
The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.
1 ...
2 # calculate errors
The example below gives an example of calculating the root mean squared error between a list of contrived expected and predicted values.
3 # real value
4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
5 # predicted value
6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
7 # calculate errors
9 # report error
10 print(errors)
Running the example calculates and prints the root mean squared error.
1 0.5916079783099616
A perfect RMSE value is 0.0, which means that all predictions matched the expected values exactly.
This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.
It is a good idea to first establish a baseline RMSE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves an RMSE better than the RMSE for the naive model has skill.
That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square
of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with
increases in error.
As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function
that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is
forced to be positive when calculating the MAE.
The MAE can be calculated as follows:
The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1
increments. The absolute error between each prediction and expected value is calculated and plotted to show the linear increase in error.
1 ...
2 # calculate error
4 # real value
5 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
6 # predicted value
7 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
8 # calculate errors
9 errors = list()
10 for i in range(len(expected)):
11 # calculate error
13 # store error
14 errors.append(err)
15 # report error
17 # plot errors
18 pyplot.plot(errors)
20 pyplot.xlabel('Predicted Value')
22 pyplot.show()
Running the example first reports the expected value, predicted value, and absolute error for each case.
Unit….1
We can see that the error rises linearly, which is intuitive and easy to understand.
A line plot is created showing the straight line or linear increase in the absolute error value as the difference between the expected and
predicted value is increased.
Unit….1
Unit….1
1 ...
2 # calculate errors
The example below gives an example of calculating the mean absolute error between a list of contrived expected and predicted values.
3 # real value
4 expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
5 # predicted value
6 predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
7 # calculate errors
9 # report error
10 print(errors)
Running the example calculates and prints the mean absolute error.
1 0.5
A perfect mean absolute error value is 0.0, which means that all predictions matched the expected values exactly.
This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.
It is a good idea to first establish a baseline MAE for your dataset using a naive predictive model, such as predicting the mean target value
from the training dataset. A model that achieves a MAE better than the MAE for the naive model has skill.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Unit..4
ML | Linear Discriminant Analysis
● Difficulty Level : Medium
● Last Updated : 10 Nov, 2021
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis is a dimensionality
reduction technique that is commonly used for supervised classification problems. It is used for modelling differences in
groups i.e. separating two or more classes. It is used to project the features in higher dimension space into a lower
dimension space.
Unit….1
For example, we have two classes and we need to separate them efficiently. Classes can have multiple features. Using only
a single feature to classify them may result in some overlapping as shown in the below figure. So, we will keep on increasing
the number of features for proper classification.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to classify. As shown in the given
2D graph, when the data points are plotted on the 2D plane, there’s no straight line that can separate the two classes of the
data points completely. Hence, in this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a
1D graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and projects data onto a new axis in a
way to maximize the separation of the two categories and hence, reducing the 2D graph into a 1D graph.
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D graph such that it maximizes
the distance between the means of the two classes and minimizes the variation within each class. In simple terms, this
newly generated axis increases the separation between the data points of the two classes. After generating this new axis
using the above-mentioned criteria, all the data points of the classes are plotted on this new axis and are shown in the figure
given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it becomes impossible for LDA to
find a new axis that makes both the classes linearly separable. In such cases, we use non-linear discriminant analysis.
Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
● n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as vTxi
Let’s consider u1 and u2 be the means of samples class c1 and c2 respectively before projection and u1hat denotes the
mean of the samples of class after projection and it can be calculated by:
Similarly,
Unit….1
Now, In LDA we need to normalize |\widetilde{\mu_1} -\widetilde{\mu_2} |. Let y_i = v^{T}x_i be the projected samples, then
scatter for the samples of c1 is:
Similarly:
Now, we need to project our data on the line having direction v which maximizes
For maximizing the above equation we need to find a projection vector that maximizes the difference of means of reduces
the scatters of both classes. Now, scatter matrix of s1 and s2 of classes c1 and c2 are:
and s2
Now, we define, scatter within the classes(sw) and scatter b/w the classes(sb):
Now, To maximize the above equation we need to calculate differentiation with respect to v
Unit….1
Here, for the maximum value of J(v) we will use the value corresponding to the highest eigenvalue. This will provide us the
best solution for LDA.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or covariance when there are
multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of the variance (actually
covariance), moderating the influence of different variables on LDA.
Implementation
● In this implementation, we will perform linear discriminant analysis using the Scikit-learn library on the Iris dataset.
● Python3
# necessary import
import numpy as np
import pandas as pd
import sklearn
url = "https://archive.ics.uci.edu/ml/machine-learning-
databases/iris/iris.data"
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
Unit….1
lda = LinearDiscriminantAnalysis(n_components=2)
X_test = lda.transform(X_test)
plt.scatter(
X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',
alpha=0.7,edgecolors='b'
classifier = RandomForestClassifier(max_depth=2,
random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(conf_m)
Unit….1
Accuracy : 0.9
[[10 0 0]
[ 0 9 3]
[ 0 0 8]]
Applications:
1. Face Recognition: In the field of Computer Vision, face recognition is a very popular application in which each face is
represented by a very large number of pixel values. Linear discriminant analysis (LDA) is used here to reduce the
number of features to a more manageable number before the process of classification. Each of the new dimensions
generated is a linear combination of pixel values, which form a template. The linear combinations obtained using Fisher’s
linear discriminant are called Fisher’s faces.
2. Medical: In this field, Linear discriminant analysis (LDA) is used to classify the patient disease state as mild, moderate,
or severe based upon the patient’s various parameters and the medical treatment he is going through. This helps the
doctors to intensify or reduce the pace of their treatment.
3. Customer Identification: Suppose we want to identify the type of customers who are most likely to buy a particular
product in a shopping mall. By doing a simple question and answers survey, we can gather all the features of the
customers. Here, a Linear discriminant analysis will help us to identify and select the features which can describe the
characteristics of the group of customers that are most likely to buy that particular product in the shoppi
Topic…
Perceptron in Machine Learning
In Machine Learning and Artificial Intelligence, Perceptron is the most commonly used term for all folks. It is the primary step to learn
Machine Learning and Deep Learning technologies, which consists of a set of weights, input values or scores, and a threshold. Perceptron
is a building block of an Artificial Neural Network. Initially, in the mid of 19 th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or business intelligence. Perceptron is a linear Machine Learning algorithm
used for supervised learning for various binary classifiers. This algorithm enables neurons to learn elements and processes them one by
Unit….1
one during preparation. In this tutorial, "Perceptron in Machine Learning," we will discuss in-depth knowledge of Perceptron and its basic
functions in brief. Let's start with the basic introduction of Perceptron.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural networks. However, it is a supervised learning
algorithm of binary classifiers. Hence, we can consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
Binary classifiers can be considered as linear classifiers. In simple words, we can understand it as a classification algorithm that can
predict linear predictor function in terms of weight and feature vectors.
This is the primary component of Perceptron which accepts the initial data into the system for further processing. Each input node contains
a real numerical value.
Unit….1
Weight parameter represents the strength of the connection between units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the associated input neuron in deciding the output. Further, Bias can be
considered as the line of intercept in a linear equation.
o Activation Function:
These are the final and important components that help to determine whether the neuron will fire or not. Activation Function can be
considered primarily as a step function.
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various problem statements and forms the desired
outputs. Activation function may differ (e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process is
slow or has vanishing or exploding gradients.
This step function or Activation function plays a vital role in ensuring that output is mapped between required values (0,1) or (-1,1). It is
important to note that the weight of input is indicative of the strength of a node. Similarly, an input's bias value gives the ability to shift the
activation function curve up or down.
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them to determine the weighted sum.
Mathematically, we can calculate the weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which gives us output either in binary form
or a continuous value as follows:
Y = f(∑wi*xi + b)
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins with inconstantly allocated input for weight
parameters. Further, it sums up all inputs (weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined value,
the model gets activated and shows the output value as +1.
Unit….1
If the outcome is same as pre-determined or threshold value, then the performance of this model is stated as satisfied, and weight demand
does not change. However, this model consists of a few discrepancies triggered when multiple weight inputs values are fed into the model.
Hence, to find desired output and minimize errors, some changes should be necessary for the weights input.
The multi-layer perceptron model is also known as the Backpropagation algorithm, which executes in two stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the model's requirement. In this stage, the
error between actual output and demanded originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks having various layers in which activation
function does not remain linear, similar to a single layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-linear patterns. Further, it can also
implement logic gates such as AND, OR, XOR, NAND, NOT, XNOR, NOR.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the learned weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The perceptron model has the following characteristics.
6. If the added sum of all input values is more than the threshold value, it must have an output signal; otherwise, no output will be
shown.
o The output of a perceptron can only be a binary number (0 or 1) due to the hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input vectors. If input vectors are non-linear, it is not easy
to classify them properly.
Future of Perceptron
The future of the Perceptron model is much bright and significant as it helps to interpret data by building intuitive patterns and applying
them in the future. Machine learning is a rapidly growing technology of Artificial Intelligence that is continuously evolving and in the
developing phase; hence the future of perceptron technology will continue to support and facilitate analytical behavior in machines that
will, in turn, add to the efficiency of computers.
The perceptron model is continuously becoming more advanced and working efficiently on complex problems with the help of artificial
neurons.
Conclusion:
In this article, you have learned how Perceptron models are the simplest type of artificial neural network which carries input and their
weights, the sum of all weighted input, and an activation function. Perceptron models are continuously contributing to Artificial Intelligence
and Machine Learning, and these models are becoming more advanced. Perceptron enables the computer to work more efficiently on
complex problems using various Machine Learning technologies. The Perceptrons are the fundamentals of artificial neural networks, and
everyone should have in-depth knowledge of perceptron models to study deep neural networks.
Topic…
Figure 15.1: The support vectors are the 5 points right up against the margin of the classifier.
For two-class, separable training data sets, such as the one in Figure 14.8 (page ),
there are lots of possible linear separators. Intuitively, a decision boundary drawn in
the middle of the void between data items of the two classes seems better than one
which approaches very close to examples of one or both classes. While some learning
methods such as the perceptron algorithm (see references in vclassfurther) find just
any linear separator, others, like Naive Bayes, search for the best linear separator
according to some criterion. The SVM in particular defines the criterion to be looking
for a decision surface that is maximally far away from any data point. This distance
from the decision surface to the closest data point determines the margin of the
classifier. This method of construction necessarily means that the decision function
for an SVM is fully specified by a (usually small) subset of the data which defines the
position of the separator. These points are referred to as the support vectors (in a
vector space, a point can be thought of as a vector between the origin and that point).
Figure 15.1 shows the margin and support vectors for a sample problem. Other data
points play no part in determining the decision surface that is chosen.
Unit….1
An intuition
for large-margin classification.Insisting on a large margin reduces the capacity of the
model: the range of angles at which the fat decision surface can be placed is smaller
than for a decision hyperplane (cf. vclassline).
Maximizing the margin seems good because points near the decision surface represent
very uncertain classification decisions: there is almost a 50% chance of the classifier
deciding either way. A classifier with a large margin makes no low certainty
classification decisions. This gives you a classification safety margin: a slight error in
measurement or a slight document variation will not cause a misclassification.
Another intuition motivating SVMs is shown in Figure 15.2 . By construction, an
SVM classifier insists on a large margin around the decision boundary. Compared to a
decision hyperplane, if you have to place a fat separator between classes, you have
fewer choices of where it can be put. As a result of this, the memory capacity of the
model has been decreased, and hence we expect that its ability to correctly generalize
to test data is increased (cf. the discussion of the bias-variance tradeoff in
Chapter 14 , page 14.6 ).
Let us formalize an SVM with algebra. A decision hyperplane (page 14.4 ) can be
perpendicular to the normal vector, we specify the intercept term . Because the
hyperplane is perpendicular to the normal vector, all points on the hyperplane
named and (rather than 1 and 0), and the intercept term is always explicitly
represented as (rather than being folded into the weight vector by adding an extra
always-on feature). The math works out much more cleanly if you do things this way,
as we will see almost immediately in the definition of functional margin. The linear
classifier is then:
(165)
We are confident in the classification of a point if it is far away from the decision
boundary. For a given data set and decision hyperplane, we define the functional
the vector. To get a sense of how to do that, let us look at the actual geometry.
Unit….1
to . A unit vector in this direction is . The dotted line in the diagram is then a
translation of the vector . Let us label the point on the hyperplane closest
to as . Then:
(166)
Unit….1
where multiplying by just changes the sign for the two cases of being on either
side of the decision surface. Moreover, lies on the decision boundary and so
satisfies . Hence:
(167)
(168)
Again, the points closest to the separating hyperplane are support vectors.
The geometric margin of the classifier is the maximum width of the band that can be
drawn separating the support vectors of the two classes. That is, it is twice the
minimum value over data points for given in Equation 168, or, equivalently, the
maximal width of one of the fat separators shown in Figure 15.2 . The geometric
margin is clearly invariant to scaling of parameters: if we
inherently normalized by the length of . This means that we can impose any scaling
Since we can scale the functional margin as we please, for convenience in solving
large SVMs, let us choose to require that the functional margin of all data points is at
least 1 and that it is equal to 1 for at least one data vector. That is, for all items in the
data:
(169)
and there exist support vectors for which the inequality is an equality. Since each
margin is . Our desire is still to maximize this geometric margin. That is, we
● is maximized
● For all ,
However, it will be helpful to what follows to understand the shape of the solution of
such an optimization problem. The solution involves constructing a dual problem
In the solution, most of the are zero. Each non-zero indicates that the
(170)
Both the term to be maximized in the dual problem and the classifying function
involve a dot product between pairs of points ( and or and ), and that is the
only way the data are used - we will return to the significance of this later.
To recap, we start with a training data set. The data set uniquely defines the best
separating hyperplane, and we feed the data through a quadratic optimization
procedure to find this plane. Given a new point to classify, the classification
careful rescaling of some dimensions may be required. However, this is not a problem
if our documents (points) are on the unit hypersphere.
Worked example. Consider building an SVM over the (very little) data set shown in
Figure 15.4 . Working geometrically, for an example like this, the maximum margin
weight vector will be parallel to the shortest line connecting points of the two classes,
that is, the line between and , giving a weight vector of . The optimal
decision surface is orthogonal to that line and intersects it at the halfway point.
(171)
seek to minimize . This happens when this constraint is satisfied with equality by
the two support vectors. Further we know that the solution is for some .
So we have that:
Unit….1
by and .
● What is the minimum number of support vectors that there can be for a data
set (which contains instances of each class)?
● The basis of being able to use kernels in SVMs (see Section 15.2.3 ) is that the
classification function can be written in the form of Equation 170 (where, for
large problems, most are 0). Show explicitly how the classification function
could be written in this form for the data set from small-svm-eg. That is,
write as a function where the data points appear and the only variable is .
● Install an SVM package such as SVMlight (http://svmlight.joachims.org/),
and build an SVM for the data set discussed in small-svm-eg. Confirm that the
program gives the same solution as the text. For SVMlight, or another package
that accepts the same training data format, the training file would be:
1 1:2 2:3
1 1:2 2:0
1 1:1 2:1
The -c 1 option is needed to turn off use of the slack variables that we discuss
in Section 15.2.1 . Check that the norm of the weight vector agrees with what
Unit….1
the values, and check that they agree with your answers in Exercise 15.1 .
Topic…….
Topic
Working in a kernel-defined feature space means that we are not able to explicitly represent points. For example the image of an
input point x is φ(x), but we do not have access to the components of this vector, only to the evaluation of inner products b etween
this point and the images of other points.
Topic
Non-Linear Classification
→ Non-Linear Classification refers to categorizing those instances that are not linearly separable.
→ Some of the classifiers that use non-linear functions to separate classes are Quadratic Discriminant Classifier, Multi-Layer
Perceptron (MLP), Decision Trees, Random Forest, and K-Nearest Neighbours (KNN).
Unit….1
→ In the figure above, we have two classes, namely 'O' and 'X.' To differentiate between the two classes, it is impossible to draw an
arbitrary straight line to ensure that both the classes are on distinct sides.
→ We notice that even if we draw a straight line, there would be points of the first-class present between the data points of the
second class.
→ In such cases, piece-wise linear or non-linear classification boundaries are required to distinguish the two classes.
→ The only difference is that here, we do not assume that the mean and covariance of all classes are the same.
→ Now, let us visualize the decision boundaries of both LDA and QDA on the iris dataset. This would give us a clear picture of the
difference between the two.
Unit….1
Unit….1
→ This is nothing but a collection of fully connected dense layers. These help transform any given input dimension into the desired
dimension.
→ MLP consists of one input layer(one node belonging to each input), one output layer (one node belonging to each output), and a
few hidden layers (>= one node belonging to each hidden layer).
→ In the above diagram, we notice three inputs, resulting in 3 nodes belonging to each input.
→ Overall, the nodes belonging to the input layer forward their outputs to the nodes present in the hidden layer. Once this is done,
the hidden layer processes the information passed on to it and then further passes it on to the output layer.
Decision Tree
→ Instances are classified by sorting them down from the root to some leaf node.
→ An instance is classified by starting at the tree's root node, testing the attribute specified by this node, then moving down the tree
branch corresponding to the attribute's value, as shown in the above figure.
→ The process is repeated based on each derived subset in a recursive partitioning manner.
→ The above decision tree helps determine whether the person is fit or not.
K-Nearest Neighbours
→ KNN is a supervised machine learning algorithm . It is used for classification problems. Since it is a supervised machine learning
algorithm, it uses labeled data to make predictions.
Unit….1
→ KNN analyzes the 'k' nearest data points and then classifies the new data based on the same.
→ In detail, to label a new point, the KNN algorithm analyzes the ‘k’ nearest neighbors or ‘k’ nearest data points to the new point. It
chooses the label of the new point as the one to which the majority of the ‘k’ nearest neighbors belong to.
→It is essential to choose an appropriate value of ‘K’ to avoid the overfitting of our model.
Topic…..
Alakh Sethi — Published On March 27, 2020 and Last Modified On April 1st, 2020
But SVM for regression analysis? I hadn’t even considered the possibility for a while! And even now when I bring up
“Support Vector Regression” in front of machine learning beginners, I often get a bemused expression. I understand –
most courses and experts don’t even mention Support Vector Regression (SVR) as a machine learning algorithm.
Unit….1
But SVR has its uses as you’ll see in this tutorial. We will first quickly understand what SVM is, before diving into the
world of Support Vector Regression and how to implement it in Python!
Note: You can learn about Support Vector Machines and Regression problems in course format here (it’s free!):
Can you decide what the separating line will be? You might have come up with this:
The line fairly separates the classes. This is what SVM essentially does – simple class separation. Now, what is the
data was like this:
Unit….1
Here, we don’t have a simple line separating these two classes. So we’ll extend our dimension and introduce a new
dimension along the z-axis. We can now separate these two classes:
When we transform this line back to the original plane, it maps to the circular boundary as I’ve shown here:
Unit….1
This is exactly what SVM does! It tries to find a line/hyperplane (in multidimensional space) that separates these two
classes. Then it classifies the new point depending on whether it lies on the positive or negative side of the
hyperplane depending on the classes to predict.
● Kernel: A kernel helps us find a hyperplane in the higher dimensional space without increasing the computational
cost. Usually, the computational cost will increase if the dimension of the data increases. This increase in dimension
is required when we are unable to find a separating hyperplane in a given dimension and are required to move in a
higher dimension:
● Hyperplane: This is basically a separating line between two data classes in SVM. But in Support Vector Regression,
this is the line that will be used to predict the continuous output
● Decision Boundary: A decision boundary can be thought of as a demarcation line (for simplification) on one side of
which lie positive examples and on the other side lie the negative examples. On this very line, the examples may be
classified as either positive or negative. This same concept of SVM will be applied in Support Vector Regression as
well
Unit….1
To understand SVM from scratch, I recommend this tutorial: Understanding Support Vector Machine(SVM)
algorithm from examples.
Consider these two red lines as the decision boundary and the green line as the hyperplane. Our objective, when
we are moving on with SVR, is to basically consider the points that are within the decision boundary line. Our
best fit line is the hyperplane that has a maximum number of points.
The first thing that we’ll understand is what is the decision boundary (the danger red line above!). Consider these
lines as being at any distance, say ‘a’, from the hyperplane. So, these are the lines that we draw at distance ‘+a’ and
‘-a’ from the hyperplane. This ‘a’ in the text is basically referred to as epsilon.
wx+b= +a
wx+b= -a
Thus, any hyperplane that satisfies our SVR should satisfy:
Hence, we are going to take only those points that are within the decision boundary and have the least error rate, or
are within the Margin of Tolerance. This gives us a better fitting model.
A real-world dataset contains features that vary in magnitudes, units, and range. I would suggest performing
normalization when the scale of a feature is irrelevant or misleading.
Feature Scaling basically helps to normalize the data within a particular range. Normally several common class types
contain the feature scaling function so that they make feature scaling automatically. However, the SVR class is not a
commonly used class type so we should perform feature scaling using Python.
Kernel is the most important feature. There are many types of kernels – linear, Gaussian, etc. Each is used
depending on the dataset. To learn more about this, read this: Support Vector Machine (SVM) in Python and R
Step 6. Visualizing the SVR results (for higher resolution and smoother curve)
This is what we get as output- the best fit line that has a maximum number of points. Quite accurate!
End Notes
We can think of Support Vector Regression as the counterpart of SVM for regression problems. SVR acknowledges
the presence of non-linearity in the data and provides a proficient prediction model.
I would love to hear your thoughts and ideas around using SVR for regression analysis. Connect with me in the
comments section below and let’s ideate!
topic..
Unit….1
Zhongzhi Shi
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
DOI: 10.4236/ijis.2019.94007 PDF HTML XML 1,000 Downloads 3,404 Views Citations
Abstract
Cognitive machine learning refers to the combination of machine learning and brain cognitive mechanism, specifically, combining
machine learning with mind model CAM. Three research directions are proposed in this paper, that is, emergency of learning,
complementary learning system and evolution of learning.
Keywords
Cognitive Machine Learning, Emergency of Learning, Complementary Learning System, Evolution of Learning, Intelligence
Science, Mind Model CAM
Share and Cite:
Shi, Z. (2019) Cognitive Machine Learning. International Journal of Intelligence Science, 9, 111-121. doi: 10.4236/ijis.2019.94007.
1. Introduction
On July 1, 2005, in commemoration of the 125th anniversary of Science, scientists summed up 125 questions, of which 94 was
“What are the limitations of learning through machines?”, which was interpreted as “Computers can beat the best chess players in
the world, and they can grab rich information on the Internet. But abstract reasoning still goes beyond any machine”.
Learning ability is the basic characteristic of human intelligence. From birth, people have been learning from objective envi ronment
and their own experience. Human cognitive ability and wisdom ability are gradually formed, developed and perfected in lifelong
learning.
In 1983, Simon gave a better definition of learning: A certain long-term change that the system produces in order to adapt to the
environment, makes the system can finish the same or similar work next time more effectively. Learning is the change taking place
in a system; it can be either the permanent improvement of the systematic work or the permanent change in the behavior of
organism. On December 12, 2015, Science magazine published a paper to show human-level concept learning through probabilistic
program induction [1] . In a complicated system, the change of learning is due to many aspects of reasons; that is to say, there are
many forms of learning process in the same system.
AlphaGo is a computer program developed by Google DeepMind in London to play the board game Go. In October 2015, it had
beaten a professional named Fan Hui, the European champion. In March 2016, it had beaten Lee Sedol, who is the strongest Go
player in the world, in a five-game match for 4 to 1. AlphaGo’s victories are a major milestone in artificial intelligence research.
AlphaGo’s algorithm uses a Monte Carlo tree search to find its moves based on knowledge previously “learned” by machine
learning, specifically by a deep neural network and reinforcement learning [2] .
Learning theory is about learning the essence of the learning process, learning the rules and constraints to study the variou s
conditions to explore the theory and explanation. Learning is a kind of process, where individuals can produce lasting changes in
the behavior by training. There are various learning theories. For over 100 years, the psychologists have provided all kinds of
learning theoretical theory schools due to the difference between their own philosophical foundation, theory background, research
means. These theory schools mainly include behavioral school, cognitive school and humanism school [3] .
In recent years, artificial intelligence has made great progress, mainly focusing on statistics and big data. In order to sol ve the
problem that machines have the ability of abstract reasoning and computers can learn evolution, cognitive machine learning i s
presented in this paper.
2. What Is Cognitive Machine Learning
Cognitive machine learning refers to the combination of machine learning and brain cognitive mechanism, specifically, combini ng
the achievements of machine learning we have studied for many years with the mind model CAM [3] . Figure 1 shows the cognitive
machine learning.
Cognitive machine learning mainly studies the following three aspects:
1) The emergency of learning: In the process of human cognition the first step is to begin to contact with the outside world, which
belongs to the stage of
perception. The second step is to sort out and transform the materials of comprehensive perception, which belongs to the stage of
concept, judgment and reasoning. We raise perceptual knowledge through visual, auditory and tactile senses to rational knowledge.
After acquiring a lot of perceptual knowledge, a new concept has been formed in the human brain, which is the emergence of
learning.
2) Complementary learning system: how to construct the complementary learning system between short-term memory and
semantic memory?
3) Evolution of Learning: As we all know, after hundreds of thousands of years of evolution, human brain capacity is also cha nging.
Language plays an important role in it. So learning evolution is not only to adapt to changes in the outside world, but also to
change its own structure. We think it is the most important in the world to change its-owns structure.
3. The Emergency of Learning
The emergency of learning is to rise from perceptual awareness to rational knowledge, that is, conceptual learning, is a learning
method as well as a form of critical thinking in which individuals master the ability to categorize and organize data by crea ting mind
logic-based structures. This process requires both knowledge construction and acquisition because individuals first identify key
attributes that would make certain subjects fall in the same category or concept. Knowledge construction is a constructive le arning
process in which individuals use what is familiar or what they have experienced to understand another subject matter, while
knowledge acquisition is a learning process wherein a person acquires knowledge from an acknowledged expert.
Unit….1
The conceptual learning can be divided into two types: first-order concept generation, which is based on the similarity-recognition
process, and high-order concept generation, which is based on the dissimilarity-recognition process. The first-order concept
generation is related to the problem driven phase and high-order concept generation is related to the inner sense driven phase.
So far a lot of learning methods and algorithms have been proposed. Various learning methods from statistics, neural networks ,
fuzzy logic and deep learning can be applied to conceptual learning and pattern recognition. Here convolutional generative
stochastic model (CGSM) is used as an example to illustrate the principle of emergency of learning.
Convolutional neural networks were originally based on neurocognitive machines introduced by Fukushima et al. into the computer
field [4] , later by Yann Lecun [5] , has been improved and successfully applied to image detection and segmentation, object
recognition and other fields.
Generally, a convolution neural network (CNN) consists of one or more convolution layers and a fully connected layer at the top
(corresponding to the classical neural network), also includes non-linear mappings and some local or global pooling layers. The
convolutional neural network can utilize the two-dimensional structure of input data. Compared with other deep feed forward neural
networks, convolutional neural networks need fewer parameters to estimate. It has become an attractive structure of in deep
learning.CNN mainly consists of the feature extraction and the classifier. The feature extraction contains the multiple convolutional
layers and sub-sampling layers. The classifier is consisted of one layer or two layers of fully connected neural networks.
However, convolutional networks deal with data noise (such as local loss, modulus), it shows a weaker. Generating random
networks has the strong robustness characteristics of data noise (such as local loss, blurring, distortion, etc.) [ 6] . Because the
generated random network has strong noise robustness and can adopt flexible frame and noise form, while the convolution netwo rk
has the advantages of multi-layer invariance and spatial local correlation when extracting image visual features, and conforms to
the principle of biological visual perception channel. We consider introducing convolutional network model into generating ra ndom
network, and propose a convolutional generative stochastic network model (CGSM) shown in Figure 2.
Figure 2(a) describes a typical convolution generation stochastic network model architecture, which consists of three feature layers.
There is no detailed description of the convolution and pooling process in each feature layer. A hidden layer may contain
convolution and pooling sub-layers (e.g. h1 and h2), only h3 contains convolution sub-layers. Figure 2(b) shows the computational
graph. It is clear that the output of each layer is injected with no more than 50% random noise or Gaussian noise (lightning
symbol) through C ( X ˜ | X ) . Then, the local supervised learning method is used to train a function to reconstruct X only from the
degenerated sample X ˜ with the highest accuracy. In this way, not only can noise data be fully learned, but also the complex task
of directly modeling data to generate distribution P (X) can be transformed into a more operational learning reconstruction
distribution P ( X | X ˜ ) . The random elements in each layer are back propagation, where w ′ i denotes the transformation of wi,
and each Xi (i > 0) is sampled from the second reconstructed distribution P θ 2 ( X i | H i ) and the Log likelihood sum of each
reconstructed distribution is used as the training objective function of the network.
(a) (b)
Figure 2. A typical CGSM architecture and computational graph.
In the convolution generated random model, input x is a three-dimensional array composed of n 1 two-dimensional feature maps
with size n2 × n3 and output y is a three-dimensional array composed of m1 two-dimensional feature maps with size m2 × m3. The
convolution layer consists of a trainable filter with K size l 1 × l2. If the two-dimensional convolution is performed in Valid mode
considering the boundary effect, then m2 = n2 − l1 + 1 and m3 = n3 − l2 + 1. In convolution layer, the output y i is calculated by the
following calculation method based on the input characteristic x i:
y i , k = σ ( x ˜ i ∗ w i , k + b i , k )(1)
Here, symbols ∗ represent two-dimensional convolution operations. Then x i’s reconstructed output of x i is calculated as follows:
x ′ i = σ ( ∑ k y i , k ∗ w ′ i , k + b ′ i , k )(2)
In the above equation, x ˜ i is obtained by the degeneration of xi through C ( x ˜ i | x i ) , where C ( x ˜ i | x i ) is injected with
independent random variable z ~ P(Z). In this way, xi is converted to the operation of random variable x ˜ i ; σ ( · ) is an activation
function, and generally σ ( · ) is a tanh() operated point by point. However, more complex non-linear implementations have
recently been adopted, such as for natural image processing.
Convolution generates random models, usually consisting of multiple convolution generates random stacks. In order to obtain h igh-
level feature expression, pooling technology is used after each convolution layer to reduce the size of feature graph. Common
pooling technologies are Mean-Pooling and Max-Pooling. In this paper, Mean-Pooling technology is mainly used to realize Downward
Pass operation. Similar to the depth-constrained Boltzmann machine network, the layer-by-layer sampling operation will also be
used to learn the convolution-generated random network. The convolution-generated random network computational graph shown
in Figure 2(b) describes the process of layer-by-layer reconstruction, including the sampling operation.
Softmax is usually used as the output of the last layer of the model when convolution generation stochastic model is used as
classification and recognition task. There are obvious differences between the training process of the model and the traditional
convolution network. The training process of convolution generation stochastic model is mainly divided into two stages: first, the
pre-training process. The WalkBack algorithm is used to reconstruct the learning condition distribution P ( X | X ˜ ) by sampling
layer by layer to approximate the actual data distribution P(X) in order to obtain better robustness. Also the use of back
Propagation algorithm optimizes the whole stochastic generation model globally to achieve target prediction or recognition
performance.
In order to verify that convolution generated random model has more performance advantages than ordinary convolution neural
network and other perception models in complex noise environment, two experiments were designed in the experiment, namely,
noise-free environment and noise environment. Noise-free environment directly uses handwritten digits in mnist0_3 data set for
recognition, while noisy environment does noise processing, such as partial ambiguity, missing and so on.
From the recognition rate column in Table 1, it can be seen that the implementation forms of MLP, CNN, CGSM and SVMs can
maintain a high recognition rate when recognizing a single object, while the recognition rate of CGSM is the highest. For obj ect
sequence recognition, CGSM shows a higher recognition rate than MLP, CNN and SVMs. The object recognition accuracy is 93.86%.
In the environment experiment with noise, the input data of each perception model is converted from the original input data X to
the random variable X ˜ with noise through C ( X ˜ | X ) . Noisy dataset in Table 2 is obtained by injecting 30% Bernoulli random
variables and the convolution feature layer of CGSM is also injected 30% Bernoulli random variables during the experiment.
4. Complemental Learning System
In recent years, rapid progress has been made in the related fields of artificial intelligence. The benefits to developing artificial
intelligence of closely examining biological intelligence are two-fold [7] . First, neuroscience provides a rich source
Unit….1
5. Evolution of Learning
Evolution, which adapts itself to the outside world and changes its structure, is one of the most important mechanisms in the world.
Evolution, learning and then advanced evolution, learning evolution, produce target, this is actually a key, random aimless machine
can explore its own target through learning. Darwin founded the theory of biological evolution in the mid-19th century. Through
heredity, variation and natural selection, organisms evolve and develop from low to high, from simple to complex, and from fe w to
many.
For intelligence, the so-called evolution refers to the learning of learning, which is different from the learning of software, and its
structure changes with it. This is very important, and the structural changes record the results of learning, and improve the learning
method. Moreover, its storage and operation are integrated, which is difficult for computer to do so at present. The study of
computer learning evolution model in this area is probably a new topic, which deserves great attention.
Studies of fossils of ancient human skulls reveal the development of the human brain, which has tripled in size over the course of
two million years of evolution. With the rapid development of human intelligence, many unique cortical centers emerged in this
period, such as the locomotive language center, the writing center, the auditory language center and so on. At the same time, the
brain cortex has also appeared to appreciate music and painting centers, these centers have obvious positioning characteristi cs.
Especially with the development of human Abstract thinking, the frontal lobe of human brain expands rapidly. Thus, the modern
human brain is evolving continuously.
In order to make machines have human-level intelligence and break through the limitation of learning through computers, it is
necessary to make machines have the function of learning evolution. Through learning, not only knowledge is increased, but also
the memory structure of the machine is changed.
We consider that without evolution of learning the goal of achieving human-level general intelligence is far from completion. Here
we review the principles underlying the evolution of learning, as the most fundamental to human-level machine learning.
Cognitive structure refers to the organizational form and operation mode of cognitive activities, including a series of operational
processes, such as the interaction of components and components in cognitive activities, namely the mechanism of psychological
activities. Cognitive structure theory focuses on cognitive structure, emphasizing the nature of cognitive structure construction, the
interaction between cognitive structure and learning [11] .
Throughout the theoretical development of cognitive structure, there are Piaget’s schema theory, Gestalt’s insight theory, Tolman’s
cognitive map theory, Bruner’s classification theory, Ausubel’s cognitive assimilation theory and so on. Cognitive structure theory
Unit….1
holds that the cognitive structure existing in human mind is always in the process of change and construction, and the learning
process is the process of continuous change and reorganization of cognitive structure, in which the environment and individual
characteristics of learners are the decisive factors. Piaget used assimilation, adaptation and balance to characterize the mechanism
of cognitive structure construction. He emphasized the importance of the external environment as a whole. He believed that the
rich and good multiple stimulation provided by the environment for learners was the fundamental condition for the improvement
and change of cognitive structure. Modern cognitive psychologist Neisser believes that cognitive process is constructive, which
includes two processes: the process of individual response to external stimuli and the process of learners’ conscious control,
transformation and construction of ideas and images [12] . Cognitive structure is a gradual process of self-construction under the
combination of external stimulation and individual characteristics of learners.
Piaget’s formalization work of intelligence development can be divided into two stages: early structuralism and later post-
structuralism. The former is also called classical theory, and the latter is called the new theoretical stage. Piaget’s new formal
theory basically abandoned the operation structure theory and replaced it with morphism-category theory. The development series
of traditional theory is from perceptual motion schema to representation schema, intuitive thinking schema to operational thin king
schema. Piaget’s new formal theory has become the development series of intramorphic level, intermorphic level and extramorph ic
level [13] .
The first stage is called intramorphic level. Psychologically, it’s just a simple correspondence, no combination. Common feat ures are
based on correct or incorrect observations, especially visible predictions. This is only an empirical comparison, depending on simple
state transitions.
The second stage, called intermorphic level, marks the beginning of systematic combinatorial construction. Intermorphic level
combination construction only occurs locally and gradually and finally does not construct a closed general system.
The last stage is extramorphic level. The main body compares morphisms by means of operation tools. Among them, the arithmetic
tool is precisely to explain and summarize the content of the previous morphism.
Topos is used to describe morphism-category theory. Around 1963, Bill Lawvere decided to figure out new foundations for
mathematics, based on category theory. His idea was to figure out what was so great about sets, strictly from the category-
theoretic point of view. In the spring of 1966 Lawvere encountered the work of Alexander Grothendieck, who had invented a
concept of “Topos” in his work on algebraic geometry. The word “Topos” means “place” in Greek. In algebraic geometry we are
often interested not just in wh
Topic….
Neural Network Models
Explained
Artificial neural network models are behind many of the most complex applications of machine learning.
Classification, regression problems, and sentiment analysis are some of the ways artificial neural networks are being leveraged
today. As an emerging field, there are many different types of artificial neural networks. They vary for a variety of reasons, such
as complexity, network architecture, density, and the flow of data. But the different types share a common goal of modelling and
attempting to replicate the behaviour of neurons to improve machine learning.
Artificial neural networks have a wide range of uses in machine learning. Each type of artificial neural network model has
different strengths and use cases. Overall, they are mainly used to solve more complex problems than would be possible with
more traditional machine learning techniques. Examples may include complex natural language processing and machine
learning-power language translation, which all rely on artificial neural networks. Recurrent neural networks are often utilised for
analysis sentiment or translating text too. The depth and scale of the neural architecture means a non-linear decision making
process can be achieved.
Artificial neural networks are used in the deep learning form of machine learning. It’s called deep learning as models use the
‘deep’, multi-layered architecture of an artificial neural network. As each layer of an artificial neural network can process data,
models can build an abstract understanding of the data. This architecture means models can perform increasingly complex tasks,
for example understanding natural language or categorising complex file types.
● Recommendation systems for customers, users and consumers in products like streaming services or e-commerce.
● To power virtual assistance and speech recognition software.
● Complex image, audio and document classification models, for example in facial recognition software.
● In automatic feature extraction from raw, unlabelled data.
Unit….1
There are different types of artificial neural networks which vary in complexity. This guide explores the different types of
artificial neural networks, including what they are and how they’re used.
Artificial neurons or nodes are modelled as a simplified version of neurons found in the brain. Each artificial neuron is connected
to other nodes, though the density and amount of connections differ with each type of artificial neural network. The network is
usually grouped into layers of nodes, which exist between the input and output layer. This multi-layered network architecture is
also known as a deep neural network because of the depth of these layers. These different layers in the artificial neural network
models can learn different features of data. Hidden hierarchical layers allow the understanding of complex concepts or patterns
from processed data.
The structure of artificial neural networks represent a simplified reflection of the complexity of the human or animal brain. A
web of interconnected artificial nodes mimic the behaviour of neurons within a nervous system. These artificial neural networks
are much less complex than a human brain, but are still incredibly powerful at performing tasks such as classification. Data starts
in the input layer and leaves from the output layer. But with the more complex artificial neural networks, data will move between
many different layers in a non-linear way.
Complex artificial neural networks are developed so that models can mirror the nonlinear decision-making process of the human
brain. This means models can be trained to make complex decisions or understand abstract concepts and objects. The model will
build from low-level features to complex features, understanding complex concepts. Each node within the network is weighted
depending on its influence on other artificial neural network nodes.
Like other machine learning models, optimisation of artificial neural networks is based on a loss function. This is the difference
between a predicted and actual output. The weighting of each node and layer is adjusted by the model to achieve a minimum loss.
Artificial neural network models can understand multiple levels of data features, and any hierarchical relationship between
features. So when used for a classification problem, an artificial neural network model can understand complex concepts by
processing multiple layers of features.
Multilayer Perceptron artificial neural networks adds complexity and density, with the capacity for many hidden layers between
the input and output layer. Each individual node on a specific layer is connected to every node on the next layer. This means
Multilayer Perceptron models are fully connected networks, and can be leveraged for deep learning.
They’re used for more complex problems and tasks such as complex classification or voice recognition. Because of the model’s
depth and complexity, processing and model maintenance can be resource and time-consuming.
A common use for radial basis function neural networks is in system control, such as systems that control power restoration after
a power cut. The artificial neural network can understand the priority order to restoring power, prioritising repairs to the greatest
number of people or core services.
The flow of data is similar to Feedforward artificial neural networks, but each node will retain information needed to improve
each step. Because of this, models can better understand the context of an input and refine the prediction of an output. For
example, a predictive text system may use memory of a previous word in a string of words to better predict the outcome of the
next word. A recurrent artificial neural network would be better suited to understand the sentiment behind a whole sentence
compared to more traditional machine learning models.
Recurrent neural networks are also used within sequence to sequence models, which are used for natural language processing.
Two recurrent neural networks are used within these models, which consists of a simultaneous encoder and decoder. These
models are used for reactive chatbots, translating language, or to summarise documents.
Each component network is performing a different subtask which when combined completes the overall tasks and output. This
type of artificial neural network is beneficial as it can make complex processes more efficient, and can be applied to a range of
environme
Unit….1
Top of Form