0% found this document useful (0 votes)

4 views755 pages

Anatomy of Deep Learning Principles (2023)

This book provides a comprehensive introduction to deep learning principles and implementation using Python's numpy library, enabling readers to build a deep learning library from scratch. It covers essential topics such as regression models, neural networks, and various deep learning techniques while emphasizing the importance of understanding underlying principles rather than just using existing libraries. The book is suitable for beginners and practitioners alike, aiming to deepen their knowledge of deep learning's foundational concepts and practical applications.

Uploaded by

Renan Fernandes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views755 pages

Anatomy of Deep Learning Principles (2023)

Uploaded by

Renan Fernandes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 755

Combined

Brief introduction
This book introduces the basic principles and implementation process of
deep learning in a simple way, and uses python's numpy library to build its
own deep learning library from scratch instead of using existing deep
learning libraries. On the basis of introducing basic knowledge of Python
programming, calculus, and probability statistics, the core basic knowledge
of deep learning such as regression model, neural network, convolutional
neural network, recurrent neural network, and generative network is
introduced in sequence according to the development of deep learning.
While analyzing the principle in a simple way, it provides a detailed code
implementation process. It is like not teaching you how to use weapons and
mobile phones, but teaching you how to make weapons and mobile phones
by yourself. This book is not a tutorial on the use of existing deep learning
libraries, but an analysis of how to develop deep learning libraries from 0.
This method of combining the principle from 0 with code implementation
can enable readers to better understand the basic principles of deep learning
and the design ideas of popular deep learning libraries.

Preface
Since the invention of computers, it has been the goal of computer scientists
to make machines have human-like intelligence. Since the concept of
"artificial intelligence" was proposed in 1956, artificial intelligence research
has experienced many ups and downs from peak to trough, trough to peak
In the development process of AI, from rule-based reasoning based on
mathematical logic to state-space search reasoning, from expert systems to
statistical learning, from crowd intelligence algorithms to machine learning,
from neural networks to support vector machines, different artificial
intelligence technologies used to lead the way.

In the past 6 years, deep learning using deep neural networks has been
brilliant and advanced by leaps and bounds. Successful applications of deep
learning such as AlphaGo defeating the human Go champion, automatic
driving, machine translation, speech recognition, and deep face changing
continue to attract people's attention. As a branch of machine learning, deep
learning brings traditional neural network technology back to life, and has
established itself as the overlord of modern artificial intelligence among all
artificial intelligence technologies. status.

Deep learning has no complex and esoteric theories. In principle, it is still a

traditional neural network, that is, some simple neuron functions are
combined into a complex function and a simple gradient descent method is
used to learn the model in the neural network based on actual sample data.
parameter. Its success is mainly attributed to computer hardware, especially
graphics processors GPUs with increasingly powerful parallel computing
performance and more and more big data.

The future society will be a society of artificial intelligence. Artificial

intelligence will be everywhere. Many jobs will be replaced by artificial
intelligence. course.

With the help of some deep learning platforms such as tensorflow, pytorch,
and caffe, a primary school student can easily use the deep learning library
to do various applications such as face recognition and speech recognition.
What he does is to directly call the APIs of these platforms to define the
model of the deep neural network. Structure and tune training parameters.
These platforms make deep learning very easy, make deep learning enter
the homes of ordinary people, and artificial intelligence is no longer
mysterious. From universities to enterprises, people from all walks of life
are using deep learning to carry out various research and applications.

Like any technology, only by thoroughly understanding the principles

behind the technology can the technology be better applied. There are a
large number of scattered articles on the Internet explaining the principles
of deep learning, and there are also some deep learning courses and
tutorials. Books are still an important way to learn systematically. The
written deep learning books are mainly divided into several categories: one
is books that focus on mathematical theory for experts or professional
researchers. These books, like academic papers, are difficult for readers to
understand. These books lack in-depth analysis of the principles and code
implementation, and readers may still not know how to implement them
even if they work hard to understand the principles. The other category is
tool books, which mainly introduce how to use various deep learning
platforms, with very little explanation of the principles, making readers
unable to understand the principles behind the code, and can only follow the
gourd. There are also some books that are just popular books, and they have
a taste of every technology, and the principles and codes are often
superficial. There are also very few books that introduce the principles and
also have code implementations, and avoid the derivation of mathematical
formulas as much as possible.

The author believes that platform tutorial books are time-sensitive, and the
publication cycle of the book is usually as long as one year, and the
interface of the platform may have undergone some changes or even major
changes. For the changing platform, such books are almost worthless .
Principle books should be easy to understand, try to avoid complex and
esoteric mathematics, but completely abandon the classic advanced
mathematics developed by mathematicians for thousands of years, and
using elementary school mathematics to describe functions for derivation is
not suitable for readers with advanced mathematics knowledge. Not an
optimal choice. However, there is a special lack of easy-to-understand
books on the market that introduce the principles and how to implement
deep learning from the bottom instead of using deep learning libraries.

In order to take care of readers who are difficult in mathematics, the first
chapter of this book not only introduces the necessary knowledge of python
programming, but also introduces some necessary knowledge of calculus
and probability as popularly as possible. On this basis, this book transitions
from the simplest regression model to the neural network model from the
shallower to the deeper, and uses the method from problem to concept to
explain the basic concepts and principles in an easy-to-understand manner.
Avoiding long speeches and avoiding "treasure words like gold", use simple
examples and concise and popular language to analyze the core principles
of models and algorithms. On the basis of understanding the principle,
further use python's numpy library to implement the code from the bottom
layer, so that readers can be enlightened on the principle and
implementation. Through reading this book, readers can follow step by step
to build a deep learning library from 0 without any deep learning platform.
Finally, as a comparison, the use of the deep learning platform Pytorch is
introduced, so that readers can easily learn to use this deep learning
platform, which will help readers understand the design ideas of these
platforms more deeply, so as to better grasp and use these deep learning
platforms. Learning platform.

This book is suitable not only for beginners without any deep learning
knowledge, but also for practitioners who have experience in using deep
learning libraries and want to understand its underlying implementation
principles. This book is especially suitable as a deep learning textbook for
colleges and universities.

Relevant resources of the book (including algorithm codes) can be found on

the author's website https://hwdong-net.github.io.

The English version of this book is translated using Google Translate on the
basis of the Chinese version. We will continue to improve the quality of the
translation in the future, and we hope readers can help me correct errors.

My email: [email protected]

eBook link (English Version): https://leanpub.com/dle/

eBook link (ChineseVersion): https://leanpub.com/dl_0

Chapter 1 Programming and Math Fundamentals

1.1 Python quick start

1.1.1 Python installation

Python interpreter installation

jupyter notebook programming environment

Anaconda installation tool

1.1.2 Object, print() function, type conversion, comment,

variable, input() function

1. Objects

2. Print function print()

3. Type conversion

4. Notes

5. Variables

6. input() function

1.1.3 Operation

[subscript operator []](#subscript-operator-)

String formatting

1.1.4 Control Statements

1. if statement

2. while statement

3. for statement

1.1.5 Python commonly used container types

1. list (list)

index

slice

for traverse all elements

2. tuple (tuple)

3. set (collection)

4. dict (dictionary)

1.1.6 Functions

math math package

Global and local variables

Anonymous/Lambda Function (anonymous/lambda

function)

Nested functions, closures

yield and generators

1.1.7 Classes and Objects

1.1.8 Getting Started with Matplotlib

subplot()

Axes objects

mplot3d

display image
1.2 tensor library numpy

1.2.1 What is a tensor?

1 vector

The norm of the vector

2 Matrix

3 dimensional tensor

1.2.2 Create ndarray object

1. array()

2. Multidimensional array type ndarray

3. asarray()

4. The tolist() method of ndarray

5. astype() and reshape()

6. arange() and linspace()

7. full(), empty(), zeros(), ones(), eye()

8. Common functions for creating tensors of random

values

9. Add, Repeat & Lay, Merge & Split, Edge Fill, Add
Axis & Swap Axis

Repeat repeat()

laying tile()

merge concatenate()
overlay stack()

column_stack(), hstack(), vstack()

split split()

Edge Padding

Add Axis

Swap axes

1.2.3 Indexing and slicing of ndarry arrays

1.2.4 Tensor calculation

1. Element-by-element calculation

Hadamard Product

2. Cumulative calculation

3. Dot Product

4 Broadcast Broadcasting

1.3 Calculus

1.3.1 Functions

1.3.2 Four arithmetic and compound operations

Arithmetic

Composite

1.3.3 Limits, derivatives

1. The limit of the sequence

2. Limit and continuity of function

3. Derivatives of functions

1.3.4 The Four Arithmetic Operations of Derivatives and the

Chain Derivation Rule

1.3.5 Calculation graph, forward calculation,

backpropagation derivation

1.3.6 Partial derivatives and gradients of multivariable

functions

1.3.7 Derivative of vector-valued function and Jacobian

matrix

1.3.8 Integral

1.4 Probability Basics

1.4.1 Probability

1.4.2 Conditional probability, joint probability, total

probability formula, Bayesian formula

1.4.3 Random variables

1.4.4 Probability distribution sequence of discrete random

variables

1.4.5 Probability Density of Continuous Random Variables

1.4.6 Distribution functions of random variables

1.4.7 Expectation, variance, covariance, covariance matrix

1. Mean and Expectation

2. Variance, standard deviation

3. Covariance, covariance matrix

Chapter 2 Gradient descent method

2.1 Necessary conditions for function extremum

2.2 Gradient descent method (gradient descent)

2.3 Parameter optimization strategy of gradient descent method

2.3.1 Momentum momentum method

2.3.2 Adagrad method

2.3.3 Adadelta method

2.3.4 RMSprop method

2.3.5 Adam method

2.4 Gradient verification

2.4.1 Comparing numerical and analytical gradients

2.4.2 Generic numerical gradients

2.5 Separation gradient descent algorithm and parameter

optimization strategy

2.5.1 Parameter optimizer

2.5.2 Gradient descent method accepting parameter

optimizer

Chapter 3 Linear Regression, Logistic Regression and Softmax

Regression
3.1 Linear regression

3.1.1 Dining car profit problem

3.1.2 Machine Learning and Artificial Intelligence

1. Machine Learning

2. The relationship between machine learning and artificial

intelligence

3. Classification of machine learning

3.1.3 What is linear regression?

3.1.4 Normal equations to solve linear regression problems

3.1.5 Gradient descent method to solve linear regression problems

3.1.6 Debug learning rate

3.1.7 Gradient verification

3.1.8 Prediction

3.1.9 Linear regression with multiple features

1. Multi-feature linear regression

2. Fitting plane

3. Temperature and pressure problems

3.1.10 Normalization of data

3.2 Evaluation of the model

3.2.1 Underfitting and overfitting

3.2.2 Verification set, test set

3.2.3 Learning Curve

3.2.4 Forecasting the output of the dam

3.2.5 Bias and variance (Bias-Variance)

3.3 Regularization
- The loss function of adding the regular term becomes

3.5 Logistic regression

3.5.1 Logistic regression

3.5.2 numpy implementation of logistic regression

1. Generate data

2. Code implementation of gradient descent method

3. Calculate the loss function value

4. Decision curve

5. Prediction accuracy

6. Logistic Regression with Scikit-Learn Library

3.5.3 Actual combat: numpy implementation of iris

classification

3.6 softmax regression

3.6.1 spiral data set

3.6.2 softmax function

3.6.3 softmax regression

Multi-sample form

3.6.4 Multi-classification cross-entropy loss

3.6.5 Calculate cross entropy loss by weighted sum

3.6.6 Gradient calculation of softmax regression

1. The gradient of the cross-entropy loss on the

weighted sum

2. The gradient of the cross-entropy loss with respect to

the weight parameter

3.6.7 Implementation of gradient descent method for

softmax regression

2.6.8 Softmax regression of spiral data set

3.7 Batch Gradient Descent and Stochastic Gradient Descent

3.7.1 MNIST handwritten digit set

3.7.2 Training logistic regression with partial training

samples

3.7.3 Batch Gradient Descent Method and Implementation

Softmax regression of Fasion MNIST training set

3.7.4 Stochastic Gradient Descent

Summarize

Chapter 4 Neural Networks

4.1 Neural Network

4.1.1 Perceptrons and neurons

1. Perceptron

2. Neurons

4.1.2 Activation function

1. Step function sign(x)

2. Tanh function

4. ReLU function

4.1.3 Neural Networks and Deep Learning

4.1.4 Forward calculation of multiple samples

4.1.5 Output

4.1.6 Loss function

1. Mean square error loss

2. Binary classification cross entropy loss

3. Multi-classification cross-entropy loss

4.1.7 Neural Network Training Based on Numerical

Gradients

4.1.8 Deep Learning

4.2 Reverse derivation

4.2.1 Forward calculation and reverse derivation

4.2.2 Computation graph

4.2.3 The gradient of the loss function with respect to the
output

1. The gradient of the binary cross-entropy loss function

on the output

2. The gradient of the mean square error loss function

on the output

3. The gradient of the multi-class cross entropy loss

function on the output

4.2.4 Derivation of back propagation of 2-layer neural

network

1. Reverse derivation of single sample

2. Multi-sample vectorized representation of reverse

derivation

3. Gradient calculation formula in column vector form

4.2.5 Python implementation of 2-layer neural network

4.2.6 Derivation of backpropagation of any layer neural

network

4.3 Implement a simple deep learning framework

4.3.1 Training process of neural network

4.3.2 Code implementation of the network layer

4.3.3 Gradient test of network layer

4.3.4 Neural Network Class

4.3.5 Gradient test of neural network

4.3.6 MNIST data handwritten digit recognition based on
deep learning framework

4.3.7 Improved general neural network framework: separate

weighted sum and activation function

Gradient Validation

4.3.8 Independent parameter optimizer

4.3.9 fashion-mnist classification training

4.3.9 Read and write model parameters

Chapter 5 Basic Techniques for Improving Neural Network

Performance

5.1 Data processing

5.1.1 Data Augmentation

5.1.2 Normalization

5.1.3 Feature Engineering

1. Data dimensionality reduction and principal

component analysis

2 Whitening

5.2 Parameter debugging

5.2.1 Weight initialization

5.2.2 Optimization parameters

5.3 Batch Normalization

5.3.1 What is batch normalization?

5.3.2 Reverse derivation of batch normalization

5.3.3 Code Implementation of Batch Normalization

5.4 Regularization Regularization

5.4.1 Weight regularization

5.4.2 Dropout

5.4.3 Early stopping method (Early stopping)

Chapter 6 Convolutional Neural Network CNN

6.1 Convolution

6.1.1 What is convolution?

span

6.1.2 Convolution of one-dimensional signal

6.1.3 Two-dimensional convolution

span

6.1.4 Multiple input channels and multiple output channels

6.1.5 Pooling

6.2 Convolutional Neural Network

6.2.1 Fully connected neurons and convolutional neurons

6.2.2 Convolutional Layer and Convolutional Neural

Network
6.2.3 Reverse derivation and code implementation of
convolutional layer and pooling layer

Reverse derivation of convolutional layer

The reverse derivation of the pooling layer

6.2.4 Implementation of convolutional neural network

6.3 Convolution matrix multiplication

6.3.1 Matrix multiplication of 1D sample convolution

6.3.2 Matrix multiplication of 2D sample convolution

6.3.3 Matrix multiplication for reverse derivation of 1D

convolution

6.3.4 Matrix multiplication for reverse derivation of 2D

convolution

6.4 Fast convolution based on coordinate index

Gradient Test

Time comparison with non-accelerated convolution

6.5 Typical convolutional neural network structure

6.5.1 LeNet-5

6.5.2 AlexNet

6.5.3 VGG

6.5.4 Gradient Explosion and Vanishing Problems of Deep

Neural Networks
6.5.5 Residual Networks (ResNets)

6.5.6 Google Inception Network

6.5.7 Network in Network (NiN)

Chapter 7 Recurrent Neural Network RNN

7.1 Sequence problems and models

7.1.1 Stock Price Prediction Problem

7.1.2 Probabilistic sequence model, language model

1. Probabilistic sequence model

2. Language Model

7.1.3 Autoregressive model

7.1.4 Generate autoregressive data

7.1.5 Time window method

7.1.6 Time window sampling

7.1.7 Time window method modeling and training

7.1.8 Long-term forecast and short-term forecast

7.1.9 Stock Price Prediction

7.1.10 k-gram language model

7.2 Recurrent Neural Networks

7.2.1 Acyclic neural network without memory function

7.2.2 Recurrent neural network with memory function

7.3 Backpropagation through time

7.4 Implementation of single-layer recurrent neural network

7.4.1 Initialize model parameters

7.4.2 Forward calculation

7.4.3 Loss function

7.4.4 Reverse derivation

7.4.5 Gradient verification

7.4.6 Gradient descent training

7.4.7 Sampling of sequence data

7.4.8 RNN training and prediction of sequence data

Training on sequence data

predict

Training and prediction of stock data

7.5 RNN language model and text generation

7.5.1 Character table

7.5.2 Sampling of character sequence samples

7.5.3 RNN model training and prediction

predict

7.6 Gradient explosion and gradient disappearance of RNN

network
7.7 Long Short-Term Memory Network (LSTM)

7.7.1 LSTM neuron: cell

7.7.2 Reverse derivation of LSTM

7.7.3 LSTM code implementation

Gradient Test

Text generation

predict

7.7.4 Variations of LSTM

7.8 Gated Recurrent Unit (GRU)

7.8.1 Working principle of GRU

7.8.2 GRU code implementation

7.9 Class Representation and Implementation of Recurrent Neural

Network

7.9.1 Implementing Recurrent Neural Networks with Classes

7.9.2 Class implementation of recurrent neural network unit

7.10 Multilayer, Bidirectional Recurrent Neural Network

7.10.1 Multilayer Recurrent Neural Network

7.10.2 Training and prediction of multi-layer recurrent

neural network

7.10.3 Bidirectional Recurrent Neural Network

7.11 Sequence to sequence (seq2seq) model

machine translation

7.11.1 Implementation of Seq2Seq model

7.11.2 Seq2Seq for character-level machine translation

1. Character word list

2. Read training samples and build character vocabulary

3. Training character-level Seq2Seq model

7.11.3 Seq2Seq machine translation based on Word2Vec

1. Word vectorization Word2Vec's skip-gram method

7.11.4 Seq2Seq model based on word embedding layer

1. Word embedding layer

2. Seq2Seq model using word embedding layer

7.11.5 Attention mechanism

Chapter 8 Generating Models

8.1 Generate model

8.2 Autoencoders

8.2.1 Autoencoder

8.2.2 Sparse Encoder

8.2.3 Implementation of Autoencoder

8.3 Variational Autoencoders

8.3.1 What is a variational autoencoder?

8.3.2 Loss function

8.3.3 Parameter resampling

8.3.4 Reverse Derivation

8.3.4 Implementation of Variational Autoencoder

8.4 Generating Adversarial Networks

8.4.1 Principle of GAN

1. Discriminator and Generator

2. Loss function

3. Training process

8.4.2 Code implementation of GAN training process

8.5 GAN modeling example

8.5.1 GAN modeling of a set of real numbers

1. Real data: a set of real numbers

2. Define discriminator and generator functions

3. Real data iterator, noise data iterator

4. Intermediate result drawing function

5. Training GAN

8.5.2 GAN modeling of two-dimensional coordinate points

1. Real data: coordinate points sampled on the elliptic

curve
2. Real data iterator, noise iterator

3. Define the generator and discriminator of the GAN

model

4. Training GAN model

8.5.3 GAN modeling of MNIST dataset

1. Read training data

2. Define the data iterator

3. Define the generator and discriminator and its

optimizer

4. Training model

8.5.4 GAN training techniques

8.6 GAN loss function and its probability explanation

8.6.1 The global optimal solution of the loss function of

GAN

8.6.2 Kullback–Leibler divergence and Jensen–Shannon

divergence

8.6.3 Maximum Likelihood Interpretation of GAN

8.7 Improved loss function: Wasserstein GAN (WGAN)

8.7.1 Principle of Wasserstein GAN

8.7.2 WGAN code implementation

8.8 Deep convolutional confrontation network DCGAN

8.8.1 Transposed convolution of 1D vectors

8.8.2 2D transposed convolution

8.8.3 Implementation of convolutional confrontation

network DCGAN
Chapter 1 Programming and Math
Fundamentals

1.1 Python quick start

Python is an easy-to-learn interpreted scripting language, originally
designed to write automated scripts, and later used in web development and
scientific computing. In recent years, with the development of data science
and artificial intelligence, Python has become the fastest growing and most
popular Its programming language has established its dominance in the
field of artificial intelligence represented by data processing and machine
learning.

1.1.1 Python installation

As an interpreted language, each statement of the Python program is
interpreted and executed by the python interpreter, that is, the Python
interpreter interprets and executes the statements of the Python program
sentence by sentence.

Python interpreter installation

On an Ubuntu system, you can enter the following command in a terminal
window:

sudo apt-get update

sudo apt-get install python3.8
On Mac systems, you can install Python through the package management
tool Homebrew:

brew install python3

Windows platform can download and run the Python3 installer from the
official website https://www.python.org/downloads. Check "Add Python3.8
to Path" during the installation process, and the installer will automatically
add the path of the Python interpreter to the system path.

Enter the Python interpreter command "python" in the terminal window to

open the Python interpreter, and the Python version information will be
displayed. If the following information is displayed, the installation is
successful.

C:\Users\hwdon>python
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10)
[MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more
information.
>>>

The reason why Python is popular is a large number of program libraries

(packages), such as numpy is a multidimensional array (tensor) library,
matplotlib is a drawing library, and these packages can be installed through
the pip installation command. like:

python -m pip install -U pip

pip install numpy
python -m pip install -U matplotlib

jupyter notebook programming environment

Now many people use the python development environment of jupyter
notebook, and jupyter notebook can use the browser as a programming
environment for Python programs. A jupyter notebook document is
composed of some cells (units), which have different types, mainly Code
cell and Markdown Cell. The former is used to write Python code, and the
latter is used to write Markdown format (similar to word) A document in
which you can write text, formulas, etc. Jupyter Notebook organically
combines code and notes in one document. It has become the most used
tool for Python programmers to write code and record ideas.

You can enter the following command to install the jupyter environment:

pip install --upgrade pip

pip install jupyter

Then enter jupyter notebook on the command line to open a browser

window for python programming.

Anaconda installation tool

Integrated installation tools such as Anaconda can automatically install
python interpreters and common development packages such as numpy,
matplotlib, and jupyter notebook. Download and install Anaconda on the
Anaconda official website https://www.anaconda.com.

Note: If you use Anaconda to install Python and its packages, you don't
need to install the Python interpreter separately.

The Anaconda installer also comes with spyder, a Python integrated

development environment with a graphical user interface. Of course,
readers can also use other Python development environments such as
Pycharm, etc.

1.1.2 Object, print() function, type conversion, comment,

variable, input() function

1. Objects
All values in Python (such as the integer 2) exist in the form of objects, and
an object contains information such as value (that is, the content of the
object), data type, and id (equivalent to an address). Python is a
dynamically typed high-level language. The so-called "dynamic type"
means that Python can automatically infer the type of an object from its
value. The type of a value can be queried with the Python built-in function
type(). like:

type("http://hwdong-net.github.io")

str

type(2)

int

type(3.14)

float

type(False)

bool

id(3)

140705546753760

id(3.14)

1857260776048

Note: The Boolean type bool has only 2 values True and False, which are
used to represent the truth or falsehood of logical propositions.

2. Print function print()

You can output a series of objects separated by commas ',', such as:
print(2,3.14)
print("youtube channel：hwdong",True)

2 3.14
youtube channel: hwdong True

The print() function has a keyword parameter end, which indicates the
default end character after outputting information. Its default value is "\n",
which means a newline after the output. The following code passes a space
string " " to the end parameter, then the space will be output instead of a
newline after the print output. like

print(1,2,3,end = " ") #After outputting 2 objects, output

a space instead of a newline
print(4,5,6)

1 2 3 4 5 6

3. Type conversion
For primitive types, the type name can be used to convert an object of
another type to an object of that type. like:

print(int(3.14)) # convert 3.14 from float

type to int integer type
print(type(int(3.14))) # output the type of
int(3.14)
print(type("3.14"))
print(float("3.14"))
print(type(float("3.14"))) # convert "3.14" from str
string type to float floating point type

3
<class 'int'>
<class 'str'>
3.14
<class 'float'>
4. Comment
The line of text at the beginning of # is called comment, and the comment
is not a program statement, but a description of the program code.

5. Variables
You can use the operator = to give an object a name, which is called the
variable name of the object. Or that the variable refers to an object.

pi = 3.14
print(pi)
print(2*pi)

Variable names can refer to other objects at any time. Variable names are
not "one-size-fits-all", unlike other languages such as C.

a = 3.14 # a is a reference to
object 3.14
b = a # b and a refer to the
same object 3.14
a = "hwdong-net.github.io" # a refers to the new
string object "hwdong-net.github.io"
print(a)
print(b)

hwdong-net.github.io
3.14

In this code, the variable name a first refers to the object 3.14, and then
refers to the string "hwdong-net.github.io". As shown in Figure 1-1.

Figure 1-1 The left picture is the result after executing b=a, and the right
picture is the result after executing a= "hwdong-net.github.io"
6. input() function
Used to accept input from the keyboard, the input is a string. input() can
accept a "prompt string". like:

name = input()
print("name: ",name)
score = input("Please enter your score:")
print("Name: ",name,"Score: ",score)
type(score)

Wang An
name: Wang An
Please enter your score: 56.8
Name: Wang An Score: 56.8

str

input() always inputs a str type object, but type conversion can be used to
convert the input string to other basic types.

score = input("Please enter your score:")

print(type(score))
score = float(input("Please enter your score:"))
type(score)

Please enter your score: 70.5

<class 'str'>
Please enter your score: 80.5

float

1.1.3 Operation

Operators can be used to operate directly on (object) values. Different types

of values support different operations. For example, arithmetic operations
such as +, -, *, /, %, //, ** can be performed on numeric types (int, float).
Among them, %, //, ** represent remainder, integer division, and exponent
operation respectively. like:
x = 15
y = 2
print('x+y=',x+y)
print('x - y =', x - y)
print('x * y =', x*y)
print('x/y=',x/y)
print('x % y =',x%y) # Find the remainder
print('x // y =',x//y) # Integer division
print('x ** y =',x**y) # exponential operation

x + y = 17
x - y = 13
x * y = 30
x/y = 7.5
x % y = 1
x // y = 7
x ** y = 225

The comparison operators ==, !=, >, <, >=, and <= for comparing two
values represent equal to, not equal to, greater than, less than, greater than
or equal to, less than or equal to, respectively, and the result of the
comparison operation is bool type value (True or False). like:
x = 15
y = 2
print('x > y is',x>y)
print('x < y is',x<y)
print('x == y is',x==y)
print('x != y is',x!=y)
print('x >= y is',x>=y)
print('x <= y is',x<=y)

x > y is True
x < y is False
x == y is False
x != y is True
x >= y is True
x <= y is False
The logical operators and, or, and not represent logical and, logical or, and
logical not respectively. In logical operations, True or non-zero or non-
empty objects are true (True), while False or 0 or empty objects are false
(Fasle). The operation rules of operation operators are:

For an object x, not x is False when x is true (True or non-zero or non-

null), and not x is True (true) when x is false (False or 0 or null).

print(not 0,end = ' ') # print() defaults to

a newline after the output,
# pass a space ' '
to the end parameter,
# so that only
spaces are output without newlines
print(not "",end = ' ')
print(not False,end = ' ')
print(not 2,end = ' ')
print(not "hwdong")

True True True False False

For 2 objects x, y, when x is true, the result of x or y is x, and when x

is false, the result of x or y is y.

print(3 or 2,end = ' ') #Because 3 is true, the

result of 3 or 2 is 3
print(0 or 2,end = ' ') #Because 0 is false,
the result of 0 or 2 is 2
print(False or True, end = ' ')
print("" or 2) #Because the empty
string "" is false, the result of "" or 2 is 2

3 2 True 2

For 2 objects x, y, when x is true, the result of x and y is y, and when x

is false, the result of x and y is x.
print(3 and 2, end = ' ') #Because 3 is true, the
result of 3 and 2 is 2
print(0 and 2, end = ' ') #Because 0 is false, the
result of 0 and 2 is 0
print("" and 2, end = ' ') #Because the empty string
"" is false,
# the result of "" and 2
is "", so there is no output
print(False and True)

2 0 False

Python also has shift operators (such as bit and &, bit or |, exclusive or ^,
negation ~, left shift <<, right shift >>) and other operators. Interested
readers can search for related information.

The operator = can also be used in combination with other arithmetic

operators and bitwise operators. like:

+= -+ *= /= %= **= &= |= ^= ~= <<= >>=

x = 3
print(id(x))
x+=2 # x = x+2
print(id(x))

"x+=2" is equivalent to "x = x+2", which means that the original x object is
added to 2 to get an object, and then the variable name x refers to the result
object of this addition. Therefore, the variable name x before and after
represents 2 different objects.

Operators in and not in are used to determine whether a value (object) is in

a container object. like:
print("h"in"hwdong")
print("h"not in"hwdong")

True
False
subscript operator []
You can access an element of a container object by giving subscript
operator [] a subscript, such as:
s = "hwdong"
print(s[0], s[1], s[2], s[3], s[4], s[5])
print(s[-6],s[-5],s[-4],s[-3],s[-2],s[-1])

The subscript starts from 0, the string object s is composed of a series of

characters, the subscript of the first character is 0, the subscript of the
second character is 1, ..., the subscript of a string of length n is
0, 1, 2 ⋯ , n − 1.

Subscripts can also be negative integers, where -1 refers to the last character
and -n refers to the first character.

String formatting
Use the format character % to format some data into a string, creating a new
string. like:

s2 = '%s %s %f' % ("The score", "of LiPing is: ", 78.5)

print(s2)

The %s and %f in the format string '%s %s %f' respectively indicate that the
first two of the following three output items ("The score", "of LiPing is: ",
78.5) are strings, while The last one is a real number.

Use the format() method of the string str to format a string, that is, replace
the placeholder {} in the string with the data in the format() method in turn.
like:

print ("{} {} {}".format("The score", "of LiPing is: ",

78.5))

1.1.4 Control Statements

A program generally executes each statement sequentially "from top to
bottom". Sometimes it may be necessary to decide whether to execute
certain statements or to execute some statements repeatedly according to
whether a certain condition is met. Python's conditional statements and loop
statements are control statements used to determine how to execute certain
statements based on conditions.

1. if statement
if expression:
program block

Indicates that if the result of the expression after the if keyword is True or
non-zero, the program block in it will be executed, such as:
score = float(input())
if score>=60:
print("Congratulations!")
print("Passed the exam.")

60.5
congratulations!
passed the exam.

Notice:

The if expression must be followed by a colon:

In Python, alignment method indicates that a group of statements

belong to the same program block, such as 2 statements in the if
program block

The code belonging to the same program block in Python must be indented
correctly, otherwise the Python interpreter will report an error. like:
score = float(input())
if score>=60:
print("Congratulations!")
print("Passed the exam.")

File "", line 4

print("Passed the exam.")
^
IndentationError: unindent does not match any outer indentation level

if and else can be used in combination to mean "if...otherwise...". That is,

when the conditional expression in the if is True, the program block in the if
clause is executed, otherwise the program block in the else clause is
executed. There is no conditional expression after else. The format is as
follows:

if expression:
code block 1
else:
code block 2

like:

score= float(input("Please enter the score:"));

if score>=60: # If score>=60,
execute the program block in if
print("Congratulations!")
print("Passed the exam.")
else: # Otherwise,
execute the else block
print("You failed the test.")
print("Keep working hard, come on!")

Please enter grade: 67.5

congratulations!
passed the exam.

For multiple conditions, you can use another form of if "if...elif...else",

which means "if...otherwise if...otherwise".
if expression 1:
code block 1
elif expression 2:
code block 2
elif expression 3:
code block 3.
else:
code block

It means that if the result of "expression 1" is True, execute "program block
1" in "expression 1", and no other program blocks will be executed,
otherwise if the result of "expression 2" is True, execute " "Block 2" in
"Expression 2", otherwise if the result of "Expression 3 is True, then
execute "Block 3 in "Expression 3", if the previous expressions are all
False, execute the else clause block in the program.

like:
score= float(input("Please enter the student's score: "));
if score<60: # If score<60, execute this
if block
print("Failure")
elif score<70: # Otherwise, if score<60,
execute this elif block
print("Pass")
elif score<80: # Otherwise, if score<80,
execute this elif block
print("Medium")
elif score<90: # Otherwise, if score<69,
execute this elif block
print("Good")
else: # Otherwise (other cases),
execute this else block
print("excellent");

Please enter student grade: 90.5

excellent

2. while statement
The format of the while statement is as follows
while expression:
code block

That is, when the "expression" in the keyword while is True, the program
block in it is executed repeatedly. like:
i = 1
s = 0
while i<=100:
s = s+i; #equivalent to s += i
i+=1
print(s)

5050

For another example, to count the average score of a group of students'

scores entered by the keyboard, you can use the following code to achieve:
total_score=0
i = 0
score = float(input("Please enter the student's score:"))
while True:
total_score += score
i += 1
score = float(input("Please enter the student's
score:"))
if score<0:
break # keyword break is
used to break out of the loop

print('The average score is:', total_score/i)

Please enter student grade: 45.6

Please enter student grade: 56.7
Please enter student grade: 89.7
Please enter student grade: 78.6
Please enter student grade: -2
Average grade: 67.65
An if conditional statement is nested in the program block of the loop, and
if the conditional statement expression score<0 is True, the break
statement in it is executed. break is a keyword of python, which is used to
jump out of the loop statement, that is, the loop statement is no longer
executed.

3. for statement
The for keyword also represents a loop statement, which means iterative
access to each element in a container object or iterable object. The format
is:
for e in container:
code block

It means to iterate through each element e in the container object container

and execute the statement of the program block in it. like:
for ch in "hwdong":
print(ch,end=",")

h,w,d,o,n,g,

Each element in the string is iteratively accessed, that is, each character, and
then the character ch is output with the print() function.

1.1.5 Python commonly used container types

Just as a string is a container of some characters, Python provides container
types such as list, tuple, set, and dict.

1. list (list)
The list list is an ordered sequence of a group of data elements (objects).
The definition list object is surrounded by a pair of left and right square
brackets [ ], and the data elements are separated by commas. like:
a = [2,5,8]
print(a)
type(a)

[2, 5, 8]

list

The data elements in the list can be of different types, even list objects
containing other objects, such as:
my_list =[2, 3.14,True,[3,6,9],'python']
print(my_list)
print(type(my_list)) # Print the type of my_list,
that is, the list type

[2, 3.14, True, [3, 6, 9], 'python']

Another example:

a = [[1,2,3],[4,5,6]]
print(a)

[[1, 2, 3], [4, 5, 6]]

index
As with strings, an element can be accessed by subscripting:
print("my_list[0]:",my_list[0])
print("my_list[3]:",my_list[3])
print("my_list[-2]:",my_list[-1])

my_list[0]: 2
my_list[3]: [3, 6, 9]
my_list[-2]: python

The elements of a list object can be modified by subscripting:

print(my_list)
my_list[-2]=[8,9]
print(my_list)

[2, 3.14, True, [3, 6, 9], 'python']

[2, 3.14, True, [8, 9], 'python']

That is, the element value of the subscript -2 points to the new object
[8,9], as shown in Figure 1-2:

Figure 1-2 Assigning different objects to list elements makes the list
elements point to different objects

slice
You can use [start:end:step] to access a sublist of elements filtered by
the start subscript, end subscript, and step size of a list object. This way of
accessing list objects is called slicing.
print(my_list)
print(my_list[2:4])
print(my_list[0:4:2])
[2, 3.14, True, [8, 9], 'python']
[True, [8, 9]]
[2, True]

The default value of step is 1. If the start subscript start is not specified, it
defaults to 0; if the end subscript start is not specified, it defaults to the
position after the last element. For example:
list_2 = my_list[:] # all elements
print(list_2)

[2, 3.14, True, [8, 9], 'python']

Returns a list of all elements, Note: The slice operation creates a new list
object. Therefore, the following code will output different id values:
print(id(my_list))
print(id(list_2))

1535447525504
1535447326144

If the slice operation is placed on the left side of the assignment statement,
it means to modify the content of the sublist corresponding to the slice, such
as:

print(my_list)
my_list[2:4] = [13, 9]
print(my_list)

[2, 3.14, True, [8, 9], 'python']

[2, 3.14, 13, 9, 'python']

The modified list is shown in Figure 1-3:

Figure 1-3 Replacing parts of a list by slicing

for traverse all elements

Like strings, of course you can use for loops etc. to access elements in a list
object. like:
for e in my_list:
print(e,end=" ")

2 3.14 True [8, 9] python

You can even use a for loop to iterate over a container or iterable to create a
new list object. like:
alist = [e**2 for e in [0,1,2,3,4,5]]
print(alist)

[0, 1, 4, 9, 16, 25]

It means to calculate e**2 for each element e of [1,2,3,4,5], and create

a list object with these values. This formula that creates a new list object by
iteratively calculating values in [] is calledlist comprehension formula. It
can also include more complex calculation formulas, such as including
conditional statements:
alist = [0, 1, 2, 3, 4,5]
alist = [x ** 2 for x in alist if x % 2 == 0]
print(alist)

[0, 4, 16]
A built-in function range(n) of python is an iterator object that yields
integers between 0 and n (not including n). An iterator object is not a
container, but you can also use for to traverse the elements of the object.
for e in range(6):
print(e, end = ' ')
print()

0 1 2 3 4 5

A list object can be created by traversing the iterator object:

alist = [e**2 for e in range(6)]
print(alist)

[0, 1, 4, 9, 16, 25]

2. tuple (tuple)
Like list, tuple (tuple) is also an ordered sequence of a set of data elements
(objects), that is, each element also has a unique subscript. Define a tuple
with parentheses instead of square brackets. like:

t = ('python',[2,5],37,3.14,"https://hwdong.net")
print(type(t))
print(t[1:4])
print(t[-1:-4:-1])

<class 'tuple'>
([2, 5], 37, 3.14)
('https://hwdong.net', 3.14, 37)

The elements in the list can be modified.

print(alist)
alist[1] = 22
print(alist)
[0, 1, 4, 9, 16, 25]
[0, 22, 4, 9, 16, 25]

The elements in the tuple cannot be modified, just like the elements in the
string cannot be modified.
t[1]=22

-------------------------------------------------- -------
------------------

TypeError Traceback (most recent call last)

<ipython-input-27-70d00e4ef536> in <module>
----> 1 t[1]=22

TypeError: 'tuple' object does not support item

assignment

Note: Unlike list, use () analysis to create an iterable object instead of a

tuple object, such as:

nums = (x**2 for x in range(6))

print(nums)
for e in nums:
print(e,end= " ")

<generator object <genexpr> at 0x000001657FD80F20>

0 1 4 9 16 25

This is because the tuple object cannot be modified, and it is impossible to

create a tuple object by gradually adding a value to the tuple every time it is
generated.

3. set (collection)
A set is an unordered collection with no duplicate elements. A set is a
comma-separated set of elements surrounded by left and right curly braces
{}. The types of elements can be different. like:
s = {5,5,3.14,2,'python',8}
print(type(s))
print(s)

<class 'set'>
{2, 3.14, 5, 8, 'python'}

You can use the add() and remove() functions to add and delete an element
to a collection, and the list object can use the append() or insert() function
to add or insert elements, pop() is used to delete the last element, remove( )
is to delete the element with the first specified value.
s.add("hwdong")
print(s)
s.remove("hwdong")
print(s)
alist.append("hwdong")
print(alist)
alist.insert(2,"net")
print(alist)
alist.pop()
print(alist)
alist.remove("net")
print(alist)

{2, 3.14, 5, 8, 'hwdong', 'python'}

{2, 3.14, 5, 8, 'python'}
[0, 22, 4, 9, 16, 25, 'hwdong']
[0, 22, 'net', 4, 9, 16, 25, 'hwdong']
[0, 22, 'net', 4, 9, 16, 25]
[0, 22, 4, 9, 16, 25]

But immutable objects such as tuple do not have functions like append() or
insert() for adding elements. The following code is wrong:
t.append("hwdong")
-------------------------------------------------- -------
------------------

AttributeError Traceback (most recent call last)

<ipython-input-31-34fd50c7f43a> in <module>
----> 1 t.append("hwdong")

AttributeError: 'tuple' object has no attribute

'append'

You can create a set object with the analytical expression of {}, such as:
nums = {x**2 for x in range(6)}
print(nums)

{0, 1, 4, 9, 16, 25}

4. dict (dictionary)
A dict is an unordered collection of (key-value pairs) "key-value" pairs.
Each element is stored in the form of "key: value (key: value)". like:
d = {1:'value', 'key':2, 'hello': [4,7]}
print(type(d))
print(d)

<class 'dict'>
{1: 'value', 'key': 2, 'hello': [4, 7]}

You need to pass the key (key, also known as the keyword) to access the
value value of the element corresponding to the key in the dict. like:
d['hello']

[4, 7]
If the element corresponding to a key does not exist, it is illegal to access
the element through this key, such as:

d[3]

-------------------------------------------------- -------
------------------

KeyError Traceback (most recent call last)

<ipython-input-35-0acadf17a380> in <module>
----> 1 d[3]

KeyError: 3

But you can assign a value to a key that does not exist, and an element of a
"key-value" pair will be added to the set. like:
d[3] = "python"
print(d)
print(d[3])

{1: 'value', 'key': 2, 'hello': [4, 7], 3: 'python'}

python

You can define a dict object that represents student information and uses
name as a key:
students={"LiPing":[21,"Compu01",15370203152],"ZhangWei":
[20,"Compu02",17331203312]
,"ZhaoSi":[22,"mecha03",16908092516]}
print(students)
print(students["ZhangWei"])

{'LiPing': [21, 'Jike 01', 15370203152], 'ZhangWei': [20,

'Jike 02', 17331203312], 'ZhaoSi': [22, 'Mechanical 03',
16908092516]}
[20, 'Jike 02', 17331203312]
The elements in the dictionary can be accessed through the for...in loop
statement, such as:
for name in students:
info = students[name]
print('{}\'s info: {} '.format(name, info))

LiPing's info: [21, 'Jike 01', 15370203152]

ZhangWei's info: [20, 'Jike 02', 17331203312]
ZhaoSi's info: [22, 'Mechanical 03', 16908092516]

Of course, you can also use a {} analysis to create a dictionary object, such
as:

points = {x:x**2 for x in range(6)}

print(points)

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

1.1.6 Functions
Python defines functions through the keyword def, gives a program block a
name, and then can call and execute the code in this function block through
the function name. like:

def hwdong():
print("My youtube channel is: ","hwdong")
# call the built-in function print()
print("My station B number is: hw-dong")
print("My blog is: https://hwdong-net.github.io")

Call this function hwdong():

hwdong()
print() # call the built-in function print()
hwdong()
print() # call the built-in function print()
hwdong()
My youtube channel is: hwdong
My B station number is: hw-dong
My blog is: <https://hwdong-net.github.io>

My youtube channel is: hwdong

My B station number is: hw-dong
My blog is: https://hwdong-net.github.io

My youtube channel is: hwdong

My B station number is: hw-dong
My blog is: https://hwdong-net.github.io

Functions can have parameters, so that when the function is called, the
corresponding parameters can be passed to this function, such as the
following function to calculate x :
n

def pow(x,n):
ret = 1
for i in range(n): #0,1,2,...,n-1
ret *=x # ret = ret*x
return ret # "return ret" returns
the value of the function

The statement represented by the keyword return is called a return

statement, and the function returns when it reaches return (ends the function
execution), and can return a value.

Calling this function can pass two actual parameters to the formal
parameters x and n of the function. like:

print(pow(3,2)) # 3.5 is passed to x, 2 to n

print(pow(2,4))

9
16

The return value of the function pow() call is passed to the function print()
to print out the return value.
The parameters of the function can have a default value. If the
corresponding parameter is not provided when the function is called, the
parameter will use its default value.

def pow(x,n=2):
ret = 1
for i in range(n):
ret *=x
return ret

The default value of the parameter n of this function is 2, which is to

calculate x . When calling this function, the actual parameter can be passed
2

or not passed to the formal parameter n.

print(pow(3.5)) # 3.5 When passed to x, the default
value of n is 2
print(pow(3.5,3))

12.25
42.875

There can be statements inside a function that call other functions. Of

course, a function can call itself inside itself, this kind of function is called
recursive function. The following is a function to find the factorial of a
positive integer n. When n=1, it returns 1 directly, otherwise it converts n!
into n*(n-1)!.

def fact(n):
if n==1: #If n is equal to 1, return
the value 1 directly
return 1
return n * fact(n - 1) #If n is greater than 1, it
is the product of n and fact(n-1)

fact(4) # output: 24

Of course, you can also use a loop to calculate n! = 1 ∗ 2 ∗ 3 ∗ ⋯ n

def fact(n):
ret = 1
i = 1
while i<=n:
ret *= i
i += 1
return ret

fact(4) # output: 24

math package
Many mathematical function libraries are defined in the math package. To
use the functions of this package, you need to import this package. (import)

import math
print(math.sqrt(2))

1.4142135623730951

“import... as ” can give an imported package an alias, such as:

import math as mt
print(mt.sqrt(2))
print(mt.pow(3.5,2))

1.4142135623730951
12.25

Look at another example:

import math
def circle(r):
area = math.pi*r**2
perimeter = 2*math.pi*r
return area,perimeter
area,p = circle(2.5)
print("The area and circumference of a circle with a
radius of 2.5 are: %5.2f,%5.2f"%(area,p))
area,p=circle(3.5)
print("The area and circumference of a circle with a
radius of 3.5 are: %5.2f,%5.2f"%(area,p))

The area and circumference of a circle with a radius of

2.5 are: 19.63,15.71
The area and circumference of a circle with a radius of
3.5 are: 38.48,21.99

A function can return multiple values, which are actually returned as a tuple
object.

ret = circle(2.5) #(area,perimeter)

print(type(ret))
area,p = ret
print(area,p )

<class 'tuple'>
19.634954084936208 15.707963267948966

Global and local variables

Variables defined outside the function are called global variables, and
variables defined inside the function are called local variables. Global
variables belong to the global namespace, and local variables of functions
belong to the local namespace of the function. Variables with the same
name can exist in different namespaces without conflicting with each other.
Functions cannot directly access external global variables, such as:
global_x = 6
def f():
x = 3
global_x = 5
print(x,global_x)
f()
print(global_x)

3 5
6

The statement "global_x = 5" inside the function does not modify the
external global variable global_x but defines a local variable global_x
pointing to object 5, which has no effect on the external global variable
global_x. Therefore, after the function f() is executed, the internal function
The local variable is destroyed, and the global variable global_x is still 6.

If you want to access global variables inside the function, you need to use
the keyword global to declare a variable as a global variable inside the
function, such as:

global_x = 6
def f():
global global_x
x = 3
global_x = 5
print(x,global_x)

f()
print(global_x)

3 5
5

The global_x inside the function is declared as the external global variable
global_x. To modify it is to modify the external global variable global_x.
Therefore, the output statement after the function f() is executed is 5 instead
of 6.

Another way to modify a global variable is to pass the global variable as a

parameter to the function, but this global variable must be an object of a
non-basic type. Because when the object of the basic type is passed, the
object referenced by the formal parameter of the function is a copy object of
the object referenced by the actual parameter instead of the same object.
like:

global_x = 6
a = [1,2,3]

def f(y,z):
x = 3
y = 5
z[0] = 10
print(y)
print(z)

f(global_x,a)
print(global_x)
print(a)

5
[10, 2, 3]
6
[10, 2, 3]

As shown in Figure 1-4, when global_x is passed to y, y refers to the copy

object of the object pointed to by global_x instead of the same object, and
when a is passed to z, z and a refer to the same object.
Figure 1-4 When the function parameter is passed, the basic type parameter
passes the copy object instead of the basic type parameter passing the
reference

Anonymous/Lambda Function (anonymous/lambda

function)
For short functions, you can use the keyword lambda to define a function
without a function name, such as:
lambda arguments: expression

The function has no function name, but has parameter arguments.

lambda x: x ** 2

Usually use the = operator to give this lambda function a name, such as:
double = lambda x: x ** 2

Calling an anonymous function on a double reference:

double(3.5)

12.25

Nested functions, closures

Functions defined inside another function are called nested functions.

def print_msg(msg): # surround function

def printer(): # Nested function
print(msg)
return printer # return nested function

another = print_msg("Hello")
another()
hello

Nested functions can access variables in the enclosing scope, for example,
the printer() function can access the local variables of the print_msg()
function (including the parameter msg).

def make_pow(n):
def pow(x):
return x ** n # pow() can access the variables of
make_pow() (ie n).
return pow

The function object pow() returned by make_pow() can access the variable
(ie n) of make_pow().

pow3 = make_pow(3) # Give the returned nested function

pow() a name pow3

pow5 = make_pow(5) # Give the returned nested function

pow() a name pow5

print(pow3(9))
print(pow5(3))
print(pow5(pow3(2)))

729
243
32768

yield and generators

A generator function can be defined through the keyword yield, and a
generator is a function that returns an iterator (object).

def infinite_sequence():
num = 0
while True:
yield num
num += 1
The generator function infinite_sequence() returns an iterable object via
yield:
iterator = infinite_sequence()
print(next(iterator))
print(next(iterator))

0
1

Use the variable name iterator to refer to the iterable object returned by the
generator, and you can use the next function to get a value of the iterable
object. You can also use for to traverse this iterable object:

for i in infinite_sequence():
print(i, end=" ")
if i>5:
break

0 1 2 3 4 5 6

1.1.7 Classes and Objects

Class (class) is a description of an abstract concept, which describes the
common attributes of all objects belonging to the same concept, these
attributes are: data attributes and method attributes. The data attribute
describes the state of this type of object, and the method attribute describes
what functions this type of object has. A class is a data type that
characterizes the common properties of all possible values of this type, such
as the int type characterizes the characteristics of all integers.

The preceding str, list, etc. are all classes. Generally, you can create a class
object through the class name, and the class object is a specific object. like:
s = str("http://hwdong.net")
print(type(s)) #str
location = s.find("hwdong") #Query whether there is
a substring through the find() method of str,
#and return the
location of the substring
print(location)
alist = list(range(6)) #[0,1,2,3,4,5] list
blist = alist.copy() #Create a copy (copy)
list of alist through the copy() method of list
blist[2] = 20
print(alist)
print(blist)

<class 'str'>
7
[0, 1, 2, 3, 4, 5]
[0, 1, 20, 3, 4, 5]

As you can see, you can use member access operator. to access class
methods through a class object to perform certain operations on this object
(access some information or modify the object or create a new object). For
example, s.find() queries whether there is a substring equal to the string
"hwdong" in s, and returns the position of the substring. And alist.copy()
creates a list object with the same content as alist and makes blist refer to
the newly created list object.

A method of a class is a function belonging to the class, that is, an internal

function of the class, rather than an ordinary external function before.

A class is defined in Python with the keyword class. In order to describe the
common attributes of all students, a Student class can be defined.
class Student:
def __init__(self, name, score):
self.name = name
self.score = score

def print(self):
print(self.name,",",self.score)

The functions in a class are called methods. By calling a method of a class

through a class object, the statements in the method can be executed on the
class object. Because there can be many objects of a class, the method of
the class must know which object to execute the code of the method.
Therefore, the first parameter of the method of the class is self, which
means the object that calls the method of this class.

The following code defines two objects s1 and s2 of the Student class: and
calls the print() method of the class Student through them. Student's print()
method calls the built-in function print() to output the name and score of the
object pointed to by self.
s1 = Student("LiPing",67)
s2 = Student("WangQiang",83)
s1.print()
s2.print()

LiPing, 67
Wang Qiang , 83

The init() method in a class is a special method called constructor.

When defining a class object, this constructor will be called automatically
to initialize the class object pointed to by self.

The above-mentioned constructor of the Student class uses the parameter

name and score to initialize the self.name and self.score process of the two
attributes self.name and self.score of the object pointed to by self. For
example, when the statement "Student("LiPing",67)" is executed, this
constructor will be called automatically, and "LiPing" and 67 are used as
parameters name and score to call the constructor to create a Student object.

Each object has its own separate instance properties, and changing the
instance properties of one object will not affect the instance properties of
other objects. In addition to instance attributes, you can also define class
attributes for a class, which are attributes shared by all objects of the class.
A class attribute is an attribute defined outside a method of a class.

For example, the modified Student class adds a class attribute count, which
indicates how many specific class objects have been created from this class.
Its initial value is 0. Whenever a class object is created, its count is
increased.

class Student:
count=0
def __init__(self, name, score):
self.name = name
self.score = score
Student.count +=1

def print(self):
print(self.name,",",self.score)

Generally, class attributes are queried or modified through "class name.class

attribute", such as "Student.count", and instance attributes can also be
queried through "instance name.class attribute" (including self.class
attribute). Such as s1.count below.
print(Student.count)
s1 = Student("LiPing",67)
print(s1.count)
s2 = Student("WangQiang",83)
print(Student.count)

0
1
2

1.1.8 Getting Started with Matplotlib

Matplotlib is a Python package for 2D graphics that can be used to draw
graphs or display images. matplotlib.pyplot provides a convenient Matlab-
like interface function for the matplotlib object-oriented plotting library.
You can use the following code to import the matplotlib.pyplot module and
name it plt, which can avoid writing a long list of matplotlib.pyplot in the
code.
import matplotlib.pyplot as plt
%matplotlib inline

"%matplotlib inline" makes the figure appear in Jupyter Notebook.

The plot() function of the pyplot module can directly plot 2D data, such as:

y = [i*0.5 for i in range(10)]

print(y)
plt.plot(y) # Draw y as a graph composed of
coordinate points on the vertical axis
plt.show() # call plt.show() to display
graphics

[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]

Although only the array y of the vertical axis coordinates is given, the plot()
function will automatically generate the horizontal axis coordinates starting
from 0 by default. Of course, you can pass 2 arrays to represent the x and y
coordinates, such as:

x = [i*0.1 for i in range(10)]

y = [xi**2 for xi in x]
print(["{0:0.2f}". format(i) for i in x])
print(["{0:0.2f}". format(i) for i in y])
plt.plot(x, y) # Draw a graph composed of
(x, y) coordinate points
plt.show() # call plt.show() to display
graphics

['0.00', '0.10', '0.20', '0.30', '0.40', '0.50', '0.60',

'0.70', '0.80', '0.90']
['0.00', '0.01', '0.04', '0.09', '0.16', '0.25',
'0.36', '0.49', '0.64', '0.81']

Several curves can be drawn simultaneously.

x = [i*0.2 for i in range(10)]

y = [xi**2 for xi in x]
y2 = [3*xi-1 for xi in x]

plt.plot(x, y) # Draw a graph

composed of (x, y) coordinate points
plt.plot(x, y2)
plt.ylim(0,5)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt. title('y=x^2 and y=3x-1')
plt.legend(['y=x^2', 'y=3x-1']) # Specify the
labels of the legend labels
plt.show() # call plt.show()
to display graphics
The function title() of the pyplot module is used to give a title to the graph,
while legend() gives each drawn curve a name, xlim() and ylim() are used
to limit the range of x and y coordinates, xlabel() and ylabel() are used to
give a label to the x-axis and y-axis. You can see that different graphics will
be automatically displayed in different colors.

The plot() function can also accept some parameters to customize the style
of the drawn graphics, such as:
import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
plt.plot(x, y,'r-')
plt.plot(x, y2,'bo')
plt.plot(x, y3,'g:')
plt.legend(['sin(x)', 'cos(x)','0.2x'])
plt.show()
Among them, r in 'r-' means red (red), - means short line, b in 'bo' means
blue (blue), o means dots, g in 'g:' means green (green), : means dotted line .

In addition to the plot() function that can be drawn in the pyplot module,
there are other functions for drawing other types of graphs, such as scatter()
for drawing scattered point graphs. like:
import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
plt.scatter(x, y, c='r', s=6, alpha=0.2)
plt.scatter(x, y2,c='g', s=18, alpha=0.9)
plt.scatter(x, y3,c='b', s=3, alpha=0.4)
plt.legend(['sin(x)', 'cos(x)','0.2x'])
plt.show()
The parameter c represents the color, its values 'r', 'g', and 'b' represent red,
green, and blue respectively, the s parameter represents the size of the point,
and alpha represents the transparency of the graph.

subplot()
In addition to displaying multiple figures, the figure object of the window
for displaying figures can also use multiple sub-regions to display different
figures. You can use the subplot() function to specify in which subplot
window to draw the subplot.

subplot(numRows, numCols, plotNum)

Its parameters numRows, numCols, and plotNum specify the number of

rows, columns, and subplot numbers, respectively. Before drawing the
graphics in the subplot, call subplot() to specify the subplot at that position
to draw the graphics, and you can use the title() function to set the title of
the subplot. like:

import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
fig = plt.gcf()
fig.set_size_inches(12, 4, forward=True)

plt.subplot(1, 2, 1)
plt.plot(x, y,'r-')
plt.plot(x, y2,'bo')
plt.title('sin(x) and cos(x)')
plt.legend(['sin(x)', 'cos(x)'])

plt.subplot(1, 2, 2)
plt.plot(x, y3,'g:')
plt.title('0.2x')

plt.show()

The above code first obtains the figure object of the current drawing
window through fig = plt.gcf() and assigns it to the variable fig, and
then modifies the default width and height of the figure object by calling the
set_size_inches() function of figure, forward =True means update the figure
object size of the current window immediately.

Axes objects
The subplot() function returns an axes object. We can use this to specify
which subplot is active at any time:
# http://www.math.buffalo.edu/~badzioch/MTH337/PT/PT-
matplotlib_subplots/PT-matplotlib_subplots.html
from math import pi
plt.figure(figsize=(8,4))

x = [i*0.03 for i in range(300)]

y = [math.sin(2*xi) for xi in x]
y2 = [math.sin(10*xi) for xi in x]
y3 =[math.cos(2*xi) for xi in x]

plt.subplots_adjust(hspace=0.4)

ax1 = plt.subplot(2,1,1) # subplot(2,1,1) is active,

plotting will be done there
plt.xlim(0, 9)
plt.plot(x, y)
plt.title('subplot(2,1,1)')

ax2 = plt.subplot(2,1,2) # subplot(2,1,2) is now active

plt.xlim(0, 9)
plt.plot(x, y2)
plt.title('subplot(2,1,2)')

plt.axes(ax1) # we activate subplot(2,1,1) to

do more plotting on this subplot
plt.plot(x, y3, 'r--')

plt.show()
mplot3d
Like other axes (axes), use the projection = '3d' keyword to create an
Axes3D object. Create a matplotlib.figure.Figure and add an axes(axes) of
type Axes3D:

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

Similarly, you can use Axes3D's methods plot(), scatter(), plot_wireframe(),

and plot_surface() to draw lines, points, wireframes, and shadow surfaces.

import matplotlib as mpl

from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt

mpl.rcParams['legend.fontsize'] = 10

fig = plt.figure()
ax = fig.gca(projection='3d')

theta = np.linspace(-4 * np.pi, 4 * np.pi, 100)

z = np.linspace(-2, 2, 100)
r = z**2 + 1
x = r * np.sin(theta)
y = r * np.cos(theta)
ax.plot(x, y, z, label='parametric curve')
ax.legend()

plt.show()

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt
import numpy as np

def randrange(n, vmin, vmax):

'''
Helper function to make an array of random numbers
having shape (n, )
with each number distributed Uniform(vmin, vmax).
'''
return (vmax - vmin)*np.random.rand(n) + vmin

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

n = 100
# For each set of style and range settings, plot n random
points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow,
zhigh].
for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^',
-30, -5)]:
xs = randrange(n, 23, 32)
ys = randrange(n, 0, 100)
zs = randrange(n, zlow, zhigh)
ax.scatter(xs, ys, zs, c=c, marker=m)

ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')

plt.show()

Axes3D.plot_wireframe(X, Y, Z, *args, **kwargs)

from mpl_toolkits.mplot3d import axes3d

import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

# Grab some test data.

X, Y, Z = axes3d.get_test_data(0.05)
print(len(X))
print(X)

# Plot a basic wireframe.

ax.plot_wireframe(X, Y, Z, rstride=10, cstride=2)

plt.show()

120
[[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
...
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]]

from mpl_toolkits.mplot3d import Axes3D

import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator,
FormatStrFormatter
import numpy as np

fig = plt.figure()
ax = fig.gca(projection='3d')
# Make data.
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)

# Plot the surface.

surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
linewidth=0, antialiased=False)

# Customize the z axis.

ax.set_zlim(-1.01, 1.01)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

# Add a color bar which maps values to colors.

fig.colorbar(surf, shrink=0.5, aspect=5)

plt.show()

https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html

display image
You can use imshow() to display an image, and before that, you can use the
imread() function of the io model of the skimage library to read the image.
The read image is placed in a multidimensional array ndarray object of the
multidimensional array library numpy. The numpy model can have many
convenient functions to deal with multidimensional arrays. For example, the
uint8() function can convert numpy arrays of other data element types into
uint8 type, which is an unsigned integer type (value range is [0,255]).
import numpy as np
#import skimage
import matplotlib.pyplot as plt
import skimage.io

img = skimage.io.imread('../imgs/lenna.png')
#Original image
img_tinted = img * [1, 0.95, 0.9] #3
color channel values multiplied by different coefficients

plt. subplot(1, 2, 1)
plt.imshow(img)

plt.subplot(1, 2, 2)
plt.imshow(np.uint8(img_tinted))
#Convert the real-valued img_tinted image to unit8
unsigned integer
plt. show()

There are more detailed Python tutorials on the author's blog site
(https://hwdong-net.github.io).
1.2 tensor library numpy
Tensor is also called multidimensional array, which is a regular
arrangement of multiple values. Tensor operations are the most important
operations in deep learning. This chapter introduces the python tensor
library numpy.

1.2.1 What is a tensor?

The simplest tensor is a single number, also known as a scalar. Such as 2,
3.6, etc. are all scalars.

1 vector
In physics, a vector is a quantity with magnitude and direction, such as
force and velocity are quantities with magnitude and direction. The vector
is expressed as a set of ordered numbers by direct coordinate
decomposition, such as (2, 5, 8), so that the vector can be studied
mathematically. In linear algebra, the vector is defined as a set of ordered
numbers, that is, a one-dimensional array ( 1D tensor). Another example is
that a student's usual grades, experimental grades, final grades, and general
evaluation grades can be expressed as a vector: (normal grades,
experimental grades, final grades, and general evaluation grades). The
coefficients and unknowns of an equation ax + bx + cx = d can be
1 2 3

expressed as vectors: (a, b, c, d), (x , x , x ).

1 2 3

That is, vectors in linear algebra are abstractions of vectors in various

practical problems. A vector (a set of ordered numbers) can be expressed
as: a = (a , a , ⋯ , a ), where each number has a definite serial number
0 1 n−1

(or subscript) in this sequence ). Vectors are also known as one-

dimensional tensors or one-dimensional arrays. A one-dimensional
tensor can be represented by a sequence container such as python's list:
a = [2,5,8]
a[1] = 30 #Access its elements by subscript
print(a[0],a[1],a[2])

Mathematically, a vector can be written in the form of a row or a column,

which are called row vector and column vector respectively. For example,
the column vector form corresponding to the row vector (2, 3, 5) is:

⎛ 2⎞
3

⎝ 5⎠
Vectors of this form are called column vectors.

The row and column form of the same vector is called the transpose
relationship, that is, the transposition of a row vector (x , x , ⋯ , x ) is its
1 2 n

column vector.

⎛ x1
⎞
x2
T
(x1 , x2 , ⋯ , xn ) =

⋮
⎝ xn
⎠

Among them, the superscript T represents the transpose operation, which

transposes a row vector into its column vector form. Conversely, a column
vector can also be transposed into row vector form:
T

⎛ x1
⎞
x2

= (x1 , x2 , ⋯ , xn )
⋮

⎝ xn
⎠

A vector (x, y) with only 2 values can be visualized by a point (more

precisely, a directed line segment) in a two-dimensional direct coordinate
system. For example, the 2 elements of the vector (1, 3) are represented as
The x and y coordinates represent a point on a two-dimensional plane, such
as A in Figure 1-5, and the vector (1,3) is represented by a directed line
segment OA from the origin O of the coordinate system to the point A. The
vector (3,1) can be expressed by the coordinate point B, or the vector can be
expressed by the directed line segment OB.

Figure 1-5 A vector can be represented as a point on the plane, or as a

directed line segment from a starting point (origin) to the point

Therefore, a vector can be expressed as a point on the plane, or as a directed

line segment from a certain starting point (origin) to the point. When a
vector is represented by a directed line segment, the starting point can be
different from the origin, as long as the two directed line segments have the
same The size and direction mean that they are the same vector, for
example, CD and OB represent the same vector, where the coordinates of C
and D are (-0.5,0.5), (2.5,1.5) respectively.

Similarly, a vector (x, y, z) with 3 values can be visualized by a point or a

line segment in a three-dimensional direct coordinate system.

The norm of the vector

For a vector v = (x, y) with 2 elements, its length can be represented by the
Euclidean distance √x + y from the coordinate point (x, y) to the origin
2 2

(0, 0) on the Cartesian coordinate system. This distance √ x + y

2 2

represents the distance from the coordinate point to the origin. This
Euclidean distance is usually represented by the notation ∥v∥ , namely:
2
2 2
v ∥2 = √ x + y

For example, 2 2
OA ∥2 = √ 3 + 1 = √ 10 ,
2 2
OB ∥2 = √ 1 + 3 = , it can be seen that ∥OA ∥
sqrt10 2
=∥ OB∥2 , that
is, the vectors (3,1) and (1,3) have the same length.

For a vector v = (x, y), 2

v ∥2 = √ x + y
2
is called the 2-norm of the
vector. It characterizes the magnitude of the vector. Similarly, for a vector
v = (x, y, z) with only 3 elements, its 2-norm is also the Euclidean distance

in three-dimensional space: v ∥ = √x + y + z .2
2 2 2

Without causing confusion, the number of elements in a vector can be

referred to as the length, while the norm of a vector indicates the vector
size.

The general extension of 2-norm is p-norm, that is, for a positive integer p,
its p-norm is defined as:
n p 1/p
x ∥p = (∑ |xi | )
i=1

That is, after the p-th power of the absolute value of the vector elements is
summed, the 1/p-th power of the sum is calculated. p-norm also
characterizes a certain size of the vector in different senses. For example,
the 1-norm of p = 1 is the sum of the absolute values of all elements of the
vector,
N

x ∥1 = ∑ |xi |
i=1

The ∞-norm of p = ∞ is the maximum absolute value of the vector

elements:

x ∥∞ =max |xi |
i
The usual convention: the 0-norm of a vector is the number of its non-zero
elements.

2 Matrix
A matrix in algebra is an ordered sequence of vectors. For example, the
following scalars with 3 rows and 4 columns form a matrix:

⎡a 11 a12 a13 a14

⎤
a21 a22 a23 a24

⎣a a32 a33 a34

⎦
31

Each row of the matrix is a vector (one-dimensional tensor), and of course,

each column of the matrix is also a vector. Each element of the matrix can
be accessed with two subscripts (row subscript and column subscript), such
as subscript i, j to access element a . The matrix accesses its elements
ij

through 2 subscripts, so it is called two-dimensional array or two-

dimensional tensor.

A matrix can be thought of as a column vector whose data elements are row
vectors. like:

⎡ a 11 a 11 ⋯ a 1n
⎤ ⎡ a 1,:
⎤
a 21 a 21 ⋯ a 2n a 2,:

A mn = =
⋮ ⋮
⎣a a m1 ⋯ a mn
⎦ ⎣a ⎦
m1 m,:

Among them, a represents a row vector of the matrix. A matrix can also
i,:

be viewed as a row vector whose data elements are column vectors. like:
⎡ a 11 a 11 ⋯ a 1n
⎤
a 21 a 21 ⋯ a 2n
A mn = = [a :,1 a :,2 ⋯ a :,m ]
⋮

⎣a a m1 ⋯ a mn
⎦
m1

Among them, a :,j represents a column vector of the matrix.

In a computer, a grayscale digital image is represented as a matrix of pixel

values (as shown in Figure 1-5).

Figure 1-5 A grayscale image represented as a matrix of pixel values

Similarly, a two-dimensional tensor (matrix) can be represented by nesting

a list in a list, such as:

b = [[1,2,3],[4,5,6]]
b[0][2] = 20
print(b)
print(b[0])
print(b[1])

[[1, 2, 20], [4, 5, 6]]

[1, 2, 20]
[4, 5, 6]

3 dimensional tensor
A color image is composed of three channel images of red, green and blue,
that is, three matrices.

Figure 1-6 The color image is synthesized by three color matrices of red,
green and blue

These three matrices together form a three-dimensional tensor, as shown in

Figure 1-7, each element can be accessed by three subscripts, such as using
the subscript ijk to access the access element a .
ijk

Figure 1-7 Color images can be represented as 3D tensors (3D arrays)

Of course, three-dimensional tensors can also be represented by nested lists,

readers can try.
In addition to these common 0-dimensional (scalar), one-dimensional
(vector), and two-dimensional tensors (matrix), the dimensions of tensors
can also be 4-dimensional, 5-dimensional, ....

Note: In linear algebra, the dimension (dimension) of tensor and the

dimension (dimension) of vector are different concepts. In linear algebra,
the number of data elements of a one-dimensional vector is called
dimension (dimension). In this book, the dimension of the tensor is used
instead of the dimension of the vector, that is, the dimension of a one-
dimensional vector is 1 rather than the number of its elements.

Although Python's intrinsic sequence types such as list can represent

multidimensional tensor (also known as multi-axis tensor), the python
package numpy provides a more efficient tensor (multidimensional array)
library function, numpy The tensor (multidimensional array) type ndarray
can describe a tensor of any dimension (axis), an ndarray object is a tensor
(multidimensional array), and each element of an ndarray tensor can only
be of the same data type .

1.2.2 Create ndarray object

1. array()
The array() function of numpy is the most commonly used function to
create ndarry objects. This function can create a multidimensional array
ndarray object from a sequence object or iterable object. For example, the
following code creates a 1D tensor (vector) and a 2D tensor (matrix)
respectively:

import numpy as np
a= np.array([1,3,2]) # Create a one-
dimensional vector (tensor) a
print(a)
print(a. shape)
b= np.array([[1,3,2],[4,5,6]]) # Create a two-
dimensional vector (tensor) b
print(b)
print(b. shape) #axis=0

[1 3 2]
(3,)
[[1 3 2]
[4 5 6]]
(twenty three)

The numpy.array() function specification is as follows:

numpy.array(object, dtype = None, copy = True, order =
'K', subok = False, ndmin = 0)

Among them, the first parameter object is required, which means to create
an ndarray object from this object object, such as
np.array([1,3,2])from the list object[1,3 ,2] creates an ndarray
object, that is, a one-dimensional array. dtype specifies the data type of the
array element, the default is None, which means the same type as the object
element, and the default of copy is True, which means creating a copied
object and will not share the data storage space with the object. order
indicates the order in which the elements in the object are arranged in the
created array, and the default is 'K', which means arranged in rows.

array() will create an array of the corresponding element type according to

the type of the incoming iterable object element, or create an array of the
corresponding data element type by specifying the parameter dtype.

a = np.array([1,2,3,4])
print(a.dtype)
print(a)
b = np.array([1,2,3,4], dtype=np.float64)
print(b.dtype)
print(b)

int32
[1 2 3 4]
float64
[1. 2. 3. 4.]
2. Multidimensional array type ndarray
The ndarray class of numpy is used to represent multidimensional arrays.
The following are some common attributes of ndarray class objects:

ndarray.ndim
The number of axes (dimensions) of the array, that is, the rank of the
array.

ndarray. shape
The shape of the array is a tuple of integers, and each integer in the
tuple represents the length (number of data elements) of the
corresponding dimension (axis) of the array.

ndarray.size
The total number of array elements is equal to the product of the tuple
elements in the shape attribute.

ndarray.dtype
The data type of the elements in the array.

ndarray.itemsize
The size in bytes of each element in the array. For example, an array
whose element type is float64 has an itemsiz property value of 8
(=64/8).

ndarray.data
Stores the memory address of the actual array element, usually you
don't need to use this attribute, because you can always access the
elements in the array by subscript.

The following code outputs these properties of 2 ndarray objects a, b:

a= np.array([1.,2.,3.])
print(a.ndim,a.shape,a.size,a.dtype,a.itemsize,a.data)
b= np.array([[1,2,3],[4,5,6]])
print(b.ndim,b.shape,b.size,b.dtype,b.itemsize,b.data)

1 (3,) 3 float64 8 <memory at 0x000001F747A16F40>

2 (2, 3) 6 int32 4 <memory at 0x000001F747A11A00>

The shape attribute of the ndarray object represents the shape of the tensor,
(3,) indicates that a is a one-dimensional tensor with 3 elements, and (2, 3)
indicates that b is a two-dimensional tensor, the first dimension (line ) has 2
elements (rows) and the second dimension (columns) has 3 elements, i.e. b
is a matrix with 2 rows and 3 columns.

1 3 2
b = [ ]
4 5 6

ndim is the dimension (number of axes) of the tensor (array), the axis (axis)
starts from 0, the above b has axis=0 and axis=1, as shown in Figure 1-8:

Figure 1-8 Tensor axis: axis=0 means the first dimension (axis), axis=1
means the second dimension (axis),

The elements of the multidimensional array can be accessed through the

subscript, which is called index, and the subscript starts from 0. For
example, to access the element in the second row and third column of b,
you can use the subscript [1,2] to access:

print(a[2])
print(b[1,2])

3.0
6
3. asarray()
array() defaults to copy (copy), that is, the created new ndarray object does
not share data storage with the original incoming object. If you do not need
to copy (copy), directly convert the incoming oejct object into an ndarray
object, you can use A simplified wrapper function asarray() for array().
asarray() does not copy the original data and has fewer parameters.
numpy.asarray(a, dtype = None, order = None)

The implementation code of this function is:

def asarray(a, dtype=None, order=None): return array(a,
dtype, copy=False, order=order)

asarray() simply calls the array() function to create a new ndarray object,
and the new object shares the same data storage as the incoming a. Of
course, if the incoming a is an iterable object, the storage space will not be
shared, and the data of the new object will point to a new memory block
that stores all elements.
d = np.asarray(range(5))
print(d)
e= np.asarray([1,2,3,4,5]) # Through
asarray, an ndarray array object can also be created
# from a
sequence or iterable object
print(e)
print(type(e))

f= np.asarray(e) # f and e share the same data storage,

modifying one will affect the other
e[2] = 20 # The third element of e can be
accessed through subscript 2
print(e)
print(f)

[0 1 2 3 4]
[1 2 3 4 5]
<class 'numpy.ndarray'>
[ 1 2 20 4 5]
[ 1 2 20 4 5]

4. The tolist() method of ndarray

numpy's array() and asarray() can create a numpy ndarray object from a
python iterable object. The tolist() method of the function ndarray can
convert an ndarray object into a python list object, such as:
a = np.array([[1,2,3],[4,5,6]])
b = a.tolist()
print(type(b))
print(b)

<class 'list'>
[[1, 2, 3], [4, 5, 6]]

5. astype() and reshape()

The astype() method of ndarray converts the data type of an element of an
ndarray object into another data type, such as converting the element type
of a above from numpy's int32 to numpy's float64 type:

c = a.astype(np.float64)
print(a.dtype,c.dtype)
a[0][0] = 100
print(a)
print(c)

int32 float64
[[100 2 3]
[ 4 5 6]]
[[1. 2. 3.]
[4. 5. 6.]]

The reshape() function of numpy or the reshape() method of ndarray can

create a new ndarray object by changing the shape of the ndarray object.
like:
a= np.array(range(6)) # Create a tensor from an
iterable object
b =np.reshape(a,(2,3)) # Create a new tensor with
shape (2,3)
c = a.reshape(2,3).astype(np.float64)
print(a)
print(b)
print(c)

[0 1 2 3 4 5]
[[0 1 2]
[3 4 5]]
[[0.1.2.]
[3. 4. 5.]]

6. arange() and linspace()

The arange() function creates a one-dimensional array representing the
arithmetic sequence by specifying the start value, end value and step size,
and its function specification is as follows:

numpy.arange([start], stop, [step], dtype=None)

The initial value of the elements of the arithmetic sequence is start, and the
arithmetic difference is step (also called step size) until stop (but not
including stop). The data element type can be specified as dtype. The
default value of start is 0, the default value of step is 1, and the default
value of dtype is None, all of which can be left unspecified. Note that the
resulting array does not contain the final value.
print(np.arange(5)) #Only specify end, start and step
default to 0 and 1
print(np.arange(2,5))
print(np.arange(2,7,2))

[0 1 2 3 4]
[2 3 4]
[2 4 6]

Similar to arange(), linspace() is also used to create an arithmetic sequence

between the initial value and the final value, except that its third parameter
is not the step size but the number of elements created.

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False,

dtype=None)

linspace() creates an arithmetic sequence of num numbers between start and

stop. endpoint indicates whether stop is included.
np.linspace(2.0, 3.0, num=5, endpoint=False)

array([2. , 2.2, 2.4, 2.6, 2.8])

np.linspace(2.0, 3.0, num=5)

array([2. , 2.25, 2.5 , 2.75, 3. ])

Similar to linspace, logspace() creates a geometric sequence.

logspace() first generates an arithmetic sequence between start and stop,

and then uses base (the default value of base is 10) as the base, and uses
these numbers as indices to generate an arithmetic sequence as the element
of the created numpy array. This method is equivalent to:
y = numpy.linspace(start, stop, num=num,
endpoint=endpoint)
numpy.power(base, y).astype(dtype)

like:
np.logspace(2.0, 3.0, num=5)

array([ 100. , 177.827941 , 316.22776602, 562.34132519,

1. ])

np.logspace(2.0, 3.0, base = 3,num=5)

array([ 9. , 11.84466612, 15.58845727, 20.51556351, 27. ])

7. full(), empty(), zeros(), ones(), eye()

full() The filling function creates an array whose specified value is
fill_value and whose shape is shape.
numpy.full(shape, fill_value, dtype=None, order=’C’)

like:
np.full((2, 3),np.inf)

array([[inf, inf, inf],

[inf, inf, inf]])

np.full((2, 3),3.5)

array([[3.5, 3.5, 3.5],

[3.5, 3.5, 3.5]])

Similar to the full() function, numpy's empty(), zeros(), ones(), and eye()
respectively create uninitialized arrays with a value of 0, a value of 1, and 1
on the diagonal and 0 on the rest.
numpy.empty(shape, dtype = float, order = ‘C’)
numpy.zeros(shape, dtype = float, order = ‘C’)
numpy.ones(shape, dtype = None, order = ‘C’)
numpy.eye(N, M=None, k=0, dtype=<class ‘float’>,
order=’C’)

print( np.empty((2,3)) ,'\n') # Create a two-dimensional

array with shape (2,3),
# which is equivalent
to a 2*3 matrix.
# The value of the
element is not initialized
print( np.zeros((2,3)) ,'\n') # Create a two-dimensional
array with shape (2,3)
# whose element values
are all 0, which is equivalent to a 2*3 matrix
print( np.ones((1,2)) ,'\n' ) # Create a two-dimensional
array whose shape is (1,2)
# and the element values
are all 1
print( np.eye(2) ,'\n' ) #Create an identity
matrix with shape (2,2), that is,
#a matrix with diagonal
element values of 1 and other element values of 0

[[3.5 3.5 3.5]

[3.5 3.5 3.5]]

[[0. 0. 0.]
[0. 0. 0.]]

[[1. 1.]]

[[1. 0.]
[0.1.]]

8. Common functions for creating tensors of random

values
The numpy.random module provides several functions for creating tensors
whose values are random numbers. like

numpy.random.rand(d0, d1, ..., dn) Create an array of random numbers

of the specified shape (d0, d1, ..., dn), whose values are uniformly
sampled in the [0,1] interval.

numpy.random.random(shape) Create an array of random numbers

whose shape is shape and whose values are uniformly sampled in the
interval [0,1]. shape is a tuple object representing the number of
elements in each dimension.
numpy.random.randn(d0,d1,…,dn) Creates an array of random
numbers of the specified shape (d0, d1, ..., dn) whose values are
Gaussian sampled in the [0,1] interval.

numpy.random.normal(loc=0.0, scale=1.0, size=None) Generate

random numbers or arrays whose values are sampled according to the
normal distribution N(loc,scale) in the interval [0,1].
The size defaults to None, which means to generate a value. If it is an
integer, it means to create a one-dimensional array. If it is a tuple, it
means to create a multi-dimensional array of the shape specified by the
tuple.

numpy.random.randint(low,hight,size,dtype) Creates a random integer

of type size with a value between low and high of type dtype, dtype
can be int64 or int.

For example, the following code creates a two-dimensional array with

shape (2,3):
np.random.rand(2,3)

array([[0.77752078, 0.90528037, 0.03474023],

[0.74134429, 0.53963193, 0.12413591]])

Note: rand() directly passes a series of integers representing the number of

elements in each dimension, while random() passes a tuple object
representing the number of elements in each dimension.
e = np.random.random((2,3)) # Create a two-dimensional
array with shape (2,3)
# whose element values are
all random numbers between [0,1]
print(e)

[[0.09472091 0.21267183 0.05193963]

[0.16334292 0.20288691 0.89140325]]
The following code creates a 1D or 2D array that follows a (standard)
Gaussian distribution.
a = np.random.randn(5) # Generate a one-
dimensional array of 5 random numbers
# that obey the (0,1)
standard normal distribution
b = np.random.randn(2,3) # Generate a two-
dimensional array of random numbers whose shape is (2,3)
# and follows (0,1)
standard normal distribution
print("a:",a)
print("b:",b)
c = np.random.normal(0,1,5) #Generate 5 one-
dimensional arrays of random numbers that
#obey the (0,1) standard
normal distribution
d = np.random.normal(size=(2,3)) # Generate a two-
dimensional array of random numbers whose size
# is (2,3) subject to
(0,1) standard normal distribution
e = np.random.normal(2,0.3,size=(2,3)) # Generate a two-
dimensional array of random numbers whose size
#is (2,3) subject
to (2,0.3) normal distribution
print("c:",c)
print("d:",d)
print("e:",e)

print(a.shape,b.shape,c.shape,d.shape)

a: [ 0.19371118 -1.15554198 1.19635313 0.79492457

0.87414178]
b: [[ 0.00880117 -0.75877358 -0.64144633]
[ 1.04679662 0.24226954 0.34902206]]
c: [-0.43792241 0.8093157 -1.18669693 -1.37376709
-1.3847464 ]
d: [[-0.6863099 0.24868581 -0.5864114 ]
[ 2.26636543 -1.24958728 -1.78229482]]
e: [[2.12434416 1.97693289 2.12858001]
[1.69272355 1.89084818 2.15927248]]
(5,) (2, 3) (5,) (2, 3)

Both the standard normal distribution and the normal distribution describe
the probability of random variables taking different values (will be
introduced later in probability knowledge). A random variable x obeys a
normal distribution, and its probability density function is
2
(x−μ)
2 1 −
2
N (x; μ, σ ) = e 2σ

√ 2πσ2

Describes the probability of x taking different values, where the probability

of x taking a value of μ is the highest, and the value farther away from μ is
less likely to take a value. When μ = 0, σ = 1The normal distribution is
called the standard normal distribution. The probability density of the
standard normal distribution is:
2
x
1 −
N (x; 0, 1) = e 2

√ 2π

randn() can only create multidimensional arrays randomly sampled

according to the standard Gaussian distribution, while normal() can create
multidimensional arrays randomly sampled according to the general
Gaussian distribution. They can be transformed into each other. If x obeys
the general normal distribution: x ∽ N (μ, σ), through variable
substitution:
x−μ
z =
σ

It can be transformed into the standard normal distribution z ∽ N (0, 1) of

the variable z
mu, sigma = 2, 0.3
e = np.random.normal(mu, sigma,size=(2,3))
print(e)
f = np.random.randn(2,3)
print(f)
g = f*sigma+mu # f= (g-mu)/sigma obeys standard
normal distribution,
# g obeys (mu,sigma) Gaussian
distribution
print(g)

[[2.34166242 1.98567633 2.21305203]

[2.37320838 1.7396114 1.62458515]]
[[ 0.14297945 -1.29341937 0.44674436]
[ 0.7630391 0.49162644 0.43494297]]
[[2.04289383 1.61197419 2.13402331]
[2.22891173 2.14748793 2.13048289]]

Some random number functions also have aliases. For example, the
random() function has an alias function random_sample(), that is, both are
the same function, and both generate uniformly sampled random numbers
in the [0,1] interval. If you want to generate uniformly sampled random
numbers in the [a,b] interval, you only need to do a simple linear
transformation on the generated array. For example: (b - a) *
random_sample() + a can generate an array of random numbers in the
interval [a,b]. The following code creates an array of random numbers
between [2,7].

5 * np.random.random_sample((2, 3)) +2

9. Add, Repeat & Lay, Merge & Split, Edge Fill, Add
Axis & Swap Axis
I saw earlier that through ndarry's astype() and reshape(), new tensors
(arrays) are created by changing the data element type or changing the
shape of the tensor. Numpy also has many functions or methods to create
new ndarry objects by adding, repeating, laying, merging, splitting, adding
axes, swapping axes, etc. to existing arrays.

Append
Nunmpy's append() can add content to the existing array to create a new
array, and its function specification is:
numpy.append(arr, values, axis=None)

Indicates to add the content of values after the array arr, and axis indicates
along which axis to add, the default is None, and a flattened (flattened) one-
dimensional array will be created.

a = np.array([1,2,3])
b= np.append(a,4)
print(a)
print(b)
np.append([1, 2, 3], [[4, 5, 6], [7, 8, 9]])

[1 2 3]
[1 2 3 4]

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

np.append([[1, 2, 3], [4, 5, 6]], [[7, 8, 9]], axis=0)

array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

Repeat repeat()
The repeat function repeat() creates a new ndarray array by repeating the
elements in the array along an axis.

numpy.repeat(a, repeats, axis=None)

Creates a new array that repeats the elements of a along axis axis repeats
times. The default value of axis is None, which means to create a flattened
(flattened) array.
np.repeat(3, 4) # Create an array, repeat the value 3 4
times,
array([3, 3, 3, 3])

a = np.array([[1,2],[3,4]])
np.repeat(a, 2) # Create a flattened array,
that is, a one-dimensional array,
# in which the elements in x
are repeated twice

array([1, 1, 2, 2, 3, 3, 4, 4])

np.repeat(a, 2,axis=0)

array([[1, 2],
[1, 2],
[3, 4],
[3, 4]])

np.repeat(a, 2, axis=0) means repeating each element (ie each row) twice in
the direction of axis=0, and np.repeat(a, 2, axis=1) means along axis= 1 will
repeat 2 times per 1 column.

np.repeat(a, 2,axis=1)

array([[1, 1, 2, 2],
[3, 3, 4, 4]])

laying tile()
Unlike repeat, which repeats elements in each axis direction, the laying
function tile(A, reps) can copy the entire array vertically or horizontally
like tiles.
numpy.tile(A, reps)

A is the array to be laid, and rep represents the number of repetitions for
each axis. If the length of reps is less than A.ndim, such as the shape is (2,
3, 4, 5), and rep=(2,2), then the rep array will be supplemented with 1 in
front and become (1,1,2,2 ). If A.ndim is less than the length of reps, A is
promoted to an array of the same shape as reps. If the shape of a one-
dimensional tensor A is (3,), and reps=(2,2), then A will be referred to as a
two-dimensional tensor with a shape of (1,3).

a = np.array([1, 2,3])
b= np.tile(a, 2) #a is Repeated 2 times to create a
new array
print(a)
print(b)

[1 2 3]
[1 2 3 1 2 3]

When laying the array a according to the shape (2, 2), it will first extract a
from the one-dimensional tensor [1, 2, 3] to the two-dimensional tensor [[1,
2, 3]], and then lay it out:
np.tile(a, (2, 2)) # a is tiled in 2 rows and 2 columns,
creating a new array

array([[1, 2, 3, 1, 2, 3],
[1, 2, 3, 1, 2, 3]])

For example, the shape of c below is (1, 2), and the length of reps=2 is 1,
reps will first become (1,2), and then laying, that is, axis=0 is repeated once
(that is, the row direction remains unchanged change), and axis=2 repeats 2
times.
c = np.array([[1, 2], [3, 4]])
print(c)
np.tile(c, 2) # reps will change to (1,2) first,
indicating that
# the first axis repeats once, and
the second axis repeats twice

[[1 2]
[3 4]]

array([[1, 2, 1, 2],
[3, 4, 3, 4]])
The following is just the opposite:
np.tile(c, (2, 1))

array([[1, 2],
[3, 4],
[1, 2],
[3, 4]])

Note: repeat() is to copy each axis of the array (unspecified axis is to copy
each element), while tile() is to copy the entire array.

merge concatenate()
The splicing function concatenate() and the accumulation function stack()
create new arrays by merging multiple arrays.

The concatenation function concatenate() concatenates multiple arrays

along the specified axis axis to create a new array.
numpy.concatenate((a1, a2, …), axis=0, out=None)

axis specifies the axis of the merging direction. The default value is 0. If it
is None, it means merging into a flat array, that is, a one-dimensional array.
out defaults to None, if not None, the merged result will be put in out.
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(b.T)
c = np.concatenate((a, b), axis=0) # Merge along
the axis=0 axis
d = np.concatenate((a, b.T), axis=1) # Merge along
the axis=1 axis
e = np.concatenate((a, b), axis=None) # Merge into a
flat array
print(c)
print(d)
print(e)
[[5]
[6]]
[[1 2]
[3 4]
[5 6]]
[[1 2 5]
[3 4 6]]
[1 2 3 4 5 6]

overlay stack()
The overlay function stack(arrays, axis=0, out=None) stacks a series of
arrays along the direction of axis into a new array, and the default value of
axis is 0, indicating the first axis. If axis=-1 means the last axis.
a = np.array([1, 2])
b = np.array([3, 4])
c = np.array([5, 6])
np.stack((a, b,c))

array([[1, 2],
[3, 4],
[5, 6]])

np.stack((a,b,c),axis=1)

array([[1, 3, 5],
[2, 4, 6]])

The difference between stack() and concatenate() is: stack() accumulates

(merges) the merged arrays as a whole, therefore, the new array created will
have one more axis (dimension), such as multiple one-dimensional arrays
through stack () is accumulated to form a two-dimensional array.
concatenate() is to splice the elements of the array or splice the elements of
the following array into the previous array. Therefore, the number of axes of
the new array created is usually the same as that of the original array, and
no new axes will be generated ( dimension).
column_stack(), hstack(), vstack()
As a special case of stack(), numpy.column_stack(tup) takes a series of
one-dimensional arrays in tup as columns of two-dimensional arrays to
create a two-dimensional array. For example, the following code uses two
one-dimensional arrays as the columns of the new array:
a = np.array((1,2,3))
b = np.array((4,5,6))
np.column_stack((a,b))

array([[1, 4],
[2, 5],
[3, 6]])

As a special case of concatenate(), numpy.hstack(tup) is spliced along the

second axis (axis=1), or spliced along the horizontal direction (column).
like:

a = np.array([[1],[2],[3]]) #3 rows and 1 column

b = np.array([[4],[5],[6]]) #3 rows and 1 column
print(a. shape, b. shape)
np.hstack((a,b)) # Merge 3 rows and 2
columns

(3, 1) (3, 1)

array([[1, 4],
[2, 5],
[3, 6]])

As shown in Figure 1-9:

Figure 1-9 hstack() splicing along the horizontal direction

The hstack() merge of one-dimensional arrays is still a one-dimensional

array:
a = np.array((1,2,3))
b = np.array((4,5,6))
print(a.shape,b.shape)
np.hstack((a,b))

(3,) (3,)

array([1, 2, 3, 4, 5, 6])

As a special case of concatenate(), numpy.vstack(tup) is spliced along the

first axis (axis=0), or spliced along the vertical direction (row). like:

a = np.array([[1, 2, 3]]) # 1 row and 3 columns

b = np.array([[4, 5, 6]]) # 1 row and 3 columns
print(a. shape)
np.vstack((a,b)) # Merge into 2 rows and 3
columns

(1, 3)

array([[1, 2, 3],
[4, 5, 6]])

c = np.array([[1], [2], [3]]) # There are 3 lines

d = np.array([[4], [5], [6]]) # There are 3 rows
np.vstack((c,d)) # Merge into 6 lines

array([[1],
[2],
[3],
[4],
[5],
[6]])
vstack() will splicing the one-dimensional array (N,) as the shape of (1,N).
like:

a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack((a,b)) # The shape of 1D array a is (3,) will
be regarded as a 2D array of (1,3),
# and the same is true for b.

array([[1, 2, 3],
[4, 5, 6]])

split split()
Split is the opposite operation of merging. The function split() splits the
array along the axis (the default value of axis is 0)

numpy.split(ary, indices_or_sections, axis=0)

If indices_or_sections is an integer N, it means splitting into N equal

subarrays, and fails if it is not possible. indices_or_sections, if it is a sorted
array of integers, indicates where to split in the axis direction. Such as
indices_or_sections = [2,3], axis=0, the result of the division is:

ary[:2]

ary[2:3]

ary[3:]

x = np.arange(9.0)
print(x)
np.split(x, 3) # Split into 3 sub-arrays of equal length

[0. 1. 2. 3. 4. 5. 6. 7. 8.]

[array([0., 1., 2.]), array([3., 4., 5.]), array([6.,

7., 8.])]
x = np.arange(8.0)
print(x)
np.split(x, [3, 5, 6, 10])

[0. 1. 2. 3. 4. 5. 6. 7.]

[array([0., 1., 2.]),

array([3., 4.]),
array([5.]),
array([6., 7.]),
array([], dtype=float64)]

hsplit() and vsplit() are the splitting functions corresponding to the merging
operations hstack() and vstack() respectively, and split the array along the
horizontal (axis=1) and vertical direction (axis=0) respectively. Both of
these split functions are special cases of the split function split().

x = np.arange(16.0).reshape(4, 4)
print(x)
np.hsplit(x, 2) # Split into 2 equal sub-arrays
along the horizontal direction (column direction)

[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]

[array([[ 0., 1.],

[ 4., 5.],
[ 8., 9.],
[12., 13.]]),
array([[ 2., 3.],
[ 6., 7.],
[10., 11.],
[14., 15.]])]

np.vsplit(x, 2) # Split into 2 equal sub-arrays

along the vertical direction (row direction)
[array([[0., 1., 2., 3.],
[4., 5., 6., 7.]]),
array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]

np.split(x, [1,2])

[array([[0., 1., 2., 3.]]),

array([[4., 5., 6., 7.]]),
array([[ 8., 9., 10., 11.],
[12., 13., 14., 15.]])]

Edge Padding
The np.pad() function can perform edge padding on each axis (dimension)
of the array, that is, fill some values at the front and rear edge positions of
each axis (dimension).

numpy.pad(array, pad_width, mode='constant', **kwargs)

arrar is the input array, pad_width represents the width of the padding (the
number of elements), and mode represents the filling method, 'constant'
represents the constant of the padding, and constant_values represents the
value of the constant. like:
a = [7,8 ,9 ]
b =np.pad(a, (2, 3), mode='constant', constant_values=(4,
6))
print(a)
print(b)

:
[7, 8, 9]
[4 4 7 8 9 6 6 6]

(2, 3) indicates that 2 elements are filled in front of the array a, and 3
elements are filled in the back, mode='constant' indicates that the filling is
constant, 'constant_values=(4, 6)' indicates that the constant values are
filled before and after It's 4 and 6. The following sets the mode to 'edge',
which means filling with the value of the edge element.

np.pad(a, (2, 3), 'edge')

array([7, 7, 7, 8, 9, 9, 9, 9])

mode='minimum' means to fill with the minimum value in the array. For
multidimensional arrays, the width of the padding at the beginning and end
of each dimension must be specified.

a = [[2, 5], [7, 9]]

print(a)
np.pad(a, ((1, 2), (2, 3)), 'minimum')

[[2, 5], [7, 9]]

array([[2, 2, 2, 5, 2, 2, 2],
[2, 2, 2, 5, 2, 2, 2],
[7, 7, 7, 9, 7, 7, 7],
[2, 2, 2, 5, 2, 2, 2],
[2, 2, 2, 5, 2, 2, 2]])

Indicates that 1 and 2 minimum values 2 are filled before and after the
direction of the first axis (row) of a, that is, the value 2 of 1 row and 2 rows
are added to the top and bottom of the two-dimensional array. Similarly, in
the column direction of a, that is, the left and right of the array are filled
with the value 2 in 2 columns and 3 columns.

Add Axis
numpy.expand_dims(a, axis) Expands the shape of the array by inserting a
new axis at axis position. like:

x = np.array([3,5]) # x is a one-dimensional array

with only one axis
print(x. shape)
print(x)
y = np.expand_dims(x, axis=0) # y is a two-dimensional
array with 2 axes, the newly added axis is axis=0,
# that is, the newly added
axis becomes the first axis (row)
print(y. shape)
print(y)

(2,)
[3 5]
(1, 2)
[[3 5]]

x is a one-dimensional array with only one axis. After executing "y =

np.expand_dims(x, axis=0)", y is a two-dimensional array with two axes.
The newly added axis is axis=0, which is the new The added axis becomes
the first axis (row). You can also add an axis to x with np.newaxis, such as:

y = x[np.newaxis,:]
print(y.shape)
print(y)

(1, 2)
[[3 5]]

y = x[:,np.newaxis]
print(y.shape)
print(y)

(twenty one)
[[3]
[5]]

Swap axes
Sometimes it is necessary to swap the axes of the array, for example, when
reading a color image, its color channel may be the third axis (axis=2), but
in some programs, the color channel needs to be on the first axis. There are
several different functions that can be used to swap the axes of an array.
numpy.swapaxes(a, axis1, axis2) swap axes axis1 and axis2. like:
A = np.random.random((2,3,4,5))
print(A.shape)
B = np.swapaxes(A,0,2) # Swap axes with axos=0 and axis=2
print(B.shape)

(2, 3, 4, 5)
(4, 3, 2, 5)

numpy.rollaxis(a, axis, start=0) Roll the axis axis backward until it is in

front of the axis start. like:

C = np.rollaxis(A,2,0) # Move the axis=2 axis of A to

the front of the axis=0 axis,
# that is, the shape of C is:
(4,23,,5)
print(C. shape)
D = np.rollaxis(C,2,1) # Move the axis=2 axis of C to
the front of the axis=1 axis,
# that is, the shape of D is
(4,3,2,5)
print(D. shape)

(4, 2, 3, 5)
(4, 3, 2, 5)

numpy.moveaxis(a, source, destination) Move the axis source to the axis

destination position. like:
C = np.moveaxis(A,2,0) # A (2,3,4,5) The shape of C is
(4,2,3,5)
print(C. shape)
D = np.rollaxis(C,2,1) # D shape is (4,3,2,5)
print(D.shape)

(4, 2, 3, 5)
(4, 3, 2, 5)
numpy.transpose(a, axes=None) Rearrange the axes of the array according
to the order of the axes in axes. The default value of axes is None, which
means that the axes are arranged in reverse order. This is a more general
and flexible function. like:

A = np.random.random((2,4))
print(A)
B = np.transpose(A)
print(B)
C = np.random.random((2,4,3,5))
D = np.transpose(C,(2,0,3,1))
print(D.shape)

[[0.37541182 0.15745876 0.81639957 0.09506275]

[0.2499226 0.59380174 0.69907614 0.73254894]]
[[0.37541182 0.2499226]
[0.15745876 0.59380174]
[0.81639957 0.69907614]
[0.09506275 0.73254894]]
(3, 2, 5, 4)

1.2.3 Indexing and slicing of ndarry arrays

The indexing and slicing functions of numpy are the same as python's
indexing and slicing of sequence objects, that is, an element or sub-array of
the array is extracted by specifying the subscript of the element with square
brackets [].

Unlike python, indexing and slicing of numpy arrays does not create a new
array, but a view (window) of the original array, that is, the sub-array of the
slice is part of the original array. Therefore, through the reference variable
of this slice to Modifying this slice actually modifies the original array. like:

import numpy as np
a = np.array([1,2,3,4,5]) # Create an array with rank 1,
that is, a one-dimensional array
print(a[0], a[1], a[2]) # Access the elements of array
a through the subscript [],
# print them, output: 1 2 3
a[0] = 5 # Modify the value of element a[0] with
subscript 0
print(a) # print the entire array, output: 5, 2, 3
b = a[1:4] # a[1:4] returns a slice of the array
consisting of elements from
# subscript 1 to subscript 4 (not including
subscript 4)
print(b)
b[0] = 40 # Slice b is a part of a, modifying b is
modifying the elements in a
print(b)
print(a)

1 2 3
[5 2 3 4 5]
[2 3 4]
[40 3 4]
[ 5 40 3 4 5]

a[1:4] is a sub-array of a whose subscripts range from 1 to 4 (but not

including 4), as shown in Figure 1-10:

Figure 1-10 Array slices with subscripts from 1 to 4 (but not including 4)

Indexing and slicing work the same for multidimensional arrays, i.e. you
can index or slice any dimension.
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)
print(a[2,1])
print(a[2])
print(a[:,1])
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
10
[9 10 11 12]
[ 2 6 10 ]

(a[2,1] means the element in the 3rd row and the 2nd column, a[2]
means the 3rd row of the 1st axis, a[:, 1] means the 2nd column, where
the first The : of the axis axis=0 means all, that is, all row subscripts, and a
column element whose column subscript is 1. As shown in Figure 1-11:

Figure 1-11 Slices of a[2] and a[:,1]

You can also slice with negative integers, like:

a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

b = a[:2, -1:-4:-1] # The slice area is: the first
dimension is from 0 to 2 (not including 2),
# the second dimension is from -1
to -4, and the step size is -1, not including -4.
print(a)
print(b)

[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[[4 3 2]
[8 7 6]]

:2 means the subscript (0,1) of the first axis (axis=0), and -1:-4:-1 means
the subscript of the second axis (axis=1) from Starting from -1, each step is
-1, so the subscript is (-1,-2,-3). As shown in Figure 1-12, but the column
subscripts are in reverse order.

Figure 1-12 Slice of a[:2, -1:-4:-1]

For an index or slice of a certain dimension, if : is used to represent all

elements of this dimension, if the specified range start:end:step does
not specify a step size, the default step=1, If start is specified for a
dimension, it defaults to -, and if end is not specified, it defaults to the last
position of this dimension + 1. like:

c = a[:2,:] # The default end position of the first

dimension is 2, that is,
# the third line, and the starting
position is 0;
# the second dimension defaults to all
subscripts
print(c)
d = a[1:,1] # The default end position of the first
dimension is 4,
# and the starting position is 1; the
second dimension is index 1.
# The final result is a one-dimensional
array
print(d)

[[1 2 3 4]
[5 6 7 8]]
[6 10]

Similarly, changing the array itself or the slice changes both, because the
data of the slice is part of the original array, that is, the slice is a window of
the original array.

a[0,3]=100
print(a)
print(b)

[[ 1 2 3 100]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[[100 3 2]
[ 8 7 6]]

Let's look at another example of indexing a three-dimensional tensor

(array):
a = np.array(range(27)).reshape(3,3,3)
print(a)

[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]

[[ 9 10 11]
[12 13 14]
[15 16 17]]

[[18 19 20]
[21 22 23]
[24 25 26]]]
print(a[1, 2])

[15 16 17]

Given 2 subscripts, the subscripts of the 1st axis (axis=0) and the 1st axis
(axis=0) are 1 and 2, and the third subscript defaults to:, that is, the third
takes all its subscripts value. The indexing process is shown in Figure 1-13:

Figure 1-13 Slices of a[1, 2], a[0,:,1], a[:,1,2]

Similarly, a[0,:,1] is the first element (the first plane) of the first axis and
the second element (column) of the third axis.

print(a[0,:,1])

[1 4 7]

(a[:,1,2] is the 2nd and 3rd elements of the 2nd axis and 3rd axis. The
1st axis is all subscript values.
print(a[:,1,2])

[ 5 14 23 ]

Integer array indexing(Integer array indexing)

When indexing a numpy array using a slice, the resulting array view is
always a subarray of the original array. That is, the elements in the subarray
are composed of consecutive elements in the original array. This is because
the index values of each dimension are continuous, for example, the actual
index values of 1:3 are 1 and 2. The sub-array obtained by slicing is a
window of the original array and shares data storage with the window area
of the original array.

When indexing, you can also pass discrete integer values for each
dimension index. That is, an array of integers is passed to each dimension.
Integer array indexing makes it possible to construct a new array. like:

a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

c = a[[0,2],[1,3]] # The elements of row 1 (0), row 3 (2)
and columns 2 and 4
print(a)
print(c)
c[0] = 111
print(a)
print(c) # show that c is a new array
independent of a

[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[ 2 12 ]
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[111 12]

Passing the index of the integer array is to form the subscript corresponding
to each axis into an index subscript. The above constitutes two index
subscripts (0,1) and (2,3), which index two elements, as shown in the figure
1-14 show:
Figure 1-14 A so-called indexed slice of an array of integers

The result is a one-dimensional tensor (array). This method of passing an

integer array to each dimension creates a new array, and the new array does
not share data storage with the original array. Changing the contents of one
array does not affect the other.

Boolean array indexing

As with the integer array index type, Boolean array indexing is used to
select elements in the array that meet certain conditions to create a new
array object that does not share storage, for example:
print(a[bool_idx]) # True and False elements according to
the Boolean value
print(a[a > 2]) #The above two formulas can be combined
into one

bool_idx = (a > 2) # Return an array with the same shape

as a and the values are True and False
print(bool_idx) # Prints "[[False False]
# [True True]
# [True True]]"

print(a[bool_idx]) # True and False elements according to

the Boolean value
print(a[a > 2]) # The above two formulas can be
combined into one formula
[[False False]
[True True]
[True True]]
[3 4 5 6]
[3 4 5 6]

1.2.4 Tensor calculation

1. Element-by-element calculation
Operations such as "element-by-element" +, -, *, /, % can be
performed on two multidimensional arrays to generate a new array. like:
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7,8,9],[10,11,12]])
print(a+b)
print(a*b)
print(b%a)

[[ 8 10 12]
[14 16 18]]
[[ 7 16 27]
[40 55 72]]
[[0 0 0]
[2 1 0]]

These operators also have corresponding numpy functions, such as add(),

subtract(), multiply(), divide() are functions corresponding to +, -, *,
/operators.

print(np.add(a,b))
print(np.subtract(a,b))
print(np.multiply(a,b))
print(np.divide(a,b))

[[ 8 10 12]
[14 16 18]]
[[-6 -6 -6]
[-6 -6 -6]]
[[ 7 16 27]
[40 55 72]]
[[0.14285714 0.25 0.33333333]
[0.4 0.45454545 0.5 ]]

All numpy functions can perform element-by-element calculations, that is,

perform corresponding operations on each element to generate a new array,
such as numpy's sqrt(), sin(), and power() functions to calculate the square
root and sine of the array elements respectively Value, exponential function
value:
print(np.sqrt(a))
print(np.sin(a))
print(np.power(a,2)) # Calculate a to the power of 2

[[1. 1.41421356 1.73205081]

[2. 2.23606798 2.44948974]]
[[ 0.84147098 0.90929743 0.14112001]
[-0.7568025 -0.95892427 -0.2794155 ]]
[[ 1 4 9]
[16 25 36]]

Hadamard Product
The element-wise product is also known as Hadamard product or Schur
product.
The Hadamard product of two vectors is the vector formed by the product
of their corresponding elements. like:

1 3 1 ∗ 3 3
( ) ⊙ ( ) = ( ) = ( )
2 4 2 ∗ 4 8

Like the Hadamard product of vectors, the Hadamard product of two

matrices is the matrix formed by the product of their corresponding
elements. like:
1 2 5 6 1 ∗ 5 2 ∗ 6 5 12
[ ] ⊙ [ ] = [ ] = [ ]
3 4 7 8 3 ∗ 7 4 ∗ 8 21 32

2. Cumulative calculation
You can use numpy functions or ndarry class methods to perform
cumulative calculations on ndarry objects, such as summation (sum()),
maximum value (min(), max()), mean (mean()), standard deviation (std ()).
a = np.array([[1,2,3],[4,5,6]])
print(np.max(a),a.max())
print(np.min(a),a.min())
print(np.sum(a),a.sum())
print(np.mean(a),a.mean())
print(np.std(a),a.std())

6 6
1 1
21 21
3.5 3.5
1.707825127659933 1.707825127659933

These functions can also specify which axis of the array to operate on, such
as:
print(a)
print(np.max(a,axis=0),a.max(axis=1)) # np.max(a,axis=0)
means to find the maximum value
# along the
direction of the 0th axis (1st dimension)
print(np.min(a,axis=0),a.min(axis=1))
print(np.sum(a,axis=0),a.sum(axis=1))
print(np.mean(a,axis=0),a.mean(axis=0))
print(np.std(a,axis=0),a.std(axis=0))

[[1 2 3]
[4 5 6]]
[4 5 6] [3 6]
[1 2 3] [1 4]
[5 7 9] [ 6 15]
[2.5 3.5 4.5] [2.5 3.5 4.5]
[1.5 1.5 1.5] [1.5 1.5 1.5]

3. Dot Product
The Hadamard product is an element-wise product, while the dot product of
tensors is a generalization of the vector dot product and matrix product.

dot product (inner product)

The dot product (inner product) (dot product) of two vectors

x = (x , x , ⋯ , x ), y = (y , y , ⋯ , y ) is their corresponding The sum
1 2 n 1 2 n

of the product of elements x y + x y + ⋯ + x y , represented by x ⋅ y.

1 1 2 2 n n

Dot product of vectors with a scalar.

Geometrically, the vector dot product is the cosine of the length of two
vectors multiplied by their angle, as shown in Figure 1-15.

x ⋅ y =∥ x ∥2 ∥ y ∥2 cos(θ)

Figure 1-15 Geometric meaning of vector dot product

Therefore, two vectors with constant length, if the angle is 0, then their dot
product is the largest, if the angle is −2π, which is 180 degrees, then the
dot product is the smallest, which is a negative number, if the angle is π/2
is 90 degrees, so the dot product is 0.

The dot product of vectors is equivalent to: the product of the projection
vector of one vector on the other vector and the length of the other vector.

Matrix product:
If a matrix A has the same number of columns as matrix B , these two
mn nl

matrices can be multiplied, and their product A B results in a matrix C mn nl

is a matrix of m × l size, and the element c of its subscript ij is the dot

product of the i-th row vector of matrix A and the j-column vector of matrix
B. That is, c = ∑ (a b ). As shown in Figure 1-16:
ij
n

k ik kj

Figure 1-16 The element in row 2 and column 1 of the product matrix is the
dot product of the vector in row 2 of the first matrix and the vector in
column 1 of the second matrix

The dot product of two vectors can be expressed by matrix multiplication.

Let x and y be two column vectors, then
T T
x ⋅ y = x y = y x

If A is used to represent the row vector of the i-th row of A, and B

i,: :,j is
used to represent the column vector of the j-th column of B. Then
cij = A B . i,: :,j

A vector is a special matrix, so the product of a matrix and a vector also

belongs to matrix multiplication. For example, if A is multiplied by a mn

column vector x : n1

⎡ a1,:
⎤ ⎡ a1,: x
⎤ ⎡ a11 x1 + a12 x2 + ⋯ + a1n xn
⎤
a2,: a2,: x a21 x1 + a22 x2 + ⋯ + a2n xn
Ax = x = =

⋮ ⋮ ⋮

⎣a ⎦ ⎣a ⎦ ⎣a ⎦
m,: m,: x m1 x1 + am2 x2 + ⋯ + amn xn

Ax is a column vector, each element of which is the result of multiplying

(dot product) a row of A and x
can prove:

The multiplication of matrices satisfies the associative law:

(AB)C = A(BC)

Matrix multiplication and addition satisfy the distributive law:

A(B + C) = AB + AC

You can use numpy's dot() function or ndarry's dot() method to calculate the
dot product of vectors and the product of matrices. The dot() function of
numpy accepts 2 multidimensional arrays and performs the dot product
(multiplication) operation of multidimensional arrays.

numpy.dot(a, b, out=None)

If output out is specified, the result is output to out.

The following code compares the difference between element-wise

multiplication * and dot product (multiplication):

a= np.array([1,3])
b= np.array([2,5])
print("a*b:",a*b)
print("dot(a,b):",np.dot(a,b)) #The dot product of
two vectors is a value (scalar)

a*b: [ 2 15]
dot(a,b): 17

It can be seen that the element-wise multiplication of two vectors is still a

vector, and the dot product of two vectors is a value (scalar).

The multiplication of matrix and vector needs to pay attention to whether

the number of elements of the corresponding axis is consistent, such as:

a= np.array([[1,2,3],[4,5,6]])
b = np.array([2,5])
c = np.array([2,5,3])
print("a.shape:",a.shape)
print("b.shape:",b.shape)
print("c.shape:",c.shape)
#print("dot(a,b):",np.dot(a,b))
print("dot(b,a):",np.dot(b,a))
print("dot(a,c):",np.dot(a,c))

a.shape: (2, 3)
b.shape: (2,)
c.shape: (3,)
dot(b,a): [22 29 36]
dot(a,c): [21 51]

a is a (2, 3) matrix, b is a one-dimensional tensor (2,), because a one-

dimensional tensor product can be regarded as a row vector or a column
vector, therefore, np.dot(b,a) is equivalent to ( 1,2) matrix and (2, 3) matrix
multiplication. And np.dot(a,b) cannot be done, because it is impossible to
perform the multiplication of (2, 3) matrix and (1,2) or (2,1) matrix. But
np.dot(a,c) can be executed, because the shape of c is (3,), np.dot(a,c) is
equivalent to the multiplication of (2, 3) matrix and (3,1) matrix.

For one-dimensional vectors and two-dimensional matrices, the matmul()

function and operator @ also perform matrix multiplication, which has the
same effect as np.dot().
a= np.array([1,3])
b= np.array([2,5])
print("dot(a,b):",np.dot(a,b))
print("matmul(a,b):",np.matmul(a,b))
print("a@b:",a@b)

a= np.array([[1,2,3],[4,5,6]])
b= np.array([[2,5],[1,3],[4,5]])
print("a.shape:",a.shape) # 2*3 matrix
print("b.shape:",b.shape) # 3*2 matrix
print("dot(a,b):",np.dot(a,b))
print("matmul(a,b):",np.matmul(a,b))
print("a@b:",a@b)
dot(a,b): 17
matmul(a,b): 17
a@b: 17
a.shape: (2, 3)
b.shape: (3, 2)
dot(a,b): [[16 26]
[37 65]]
matmul(a,b): [[16 26]
[37 65]]
a@b: [[16 26]
[37 65]]

4 Broadcast Broadcasting
Broadcasting is a powerful mechanism that enables numpy to perform
arithmetic operations on arrays of different shapes. For example, we used a
number and an array to perform operations. It is equivalent to turning this
number into an array of the same size as the array, and then performing
element-by-element operations. For example, (a+3 below is equivalent to
a+ np.array([[3,3],[3,3]]). As shown in Figure 1-17:

Figure 1-17 a+3 is equivalent to a+ np.array([[3,3],[3,3]]

a = np.array([[1,2],[3,4]])
print(a)

print(a+3)
print(a+ np.array([[3,3],[3,3]]))

[[1 2]
[3 4]]
[[4 5]
[6 7]]
[[4 5]
[6 7]]

Similarly, the subtraction, multiplication, division, etc. of a number and a

tensor also perform this kind of broadcast calculation.

print(a*3)
print(a/3)

[[ 3 6]
[ 9 12]]
[[0.33333333 0.66666667]
[1. 1.33333333]]

Similarly, a two-dimensional array a can be operated on with the following

one-dimensional array b:

b = np.array([1,2])
print(a+b)

[[twenty four]
[4 6]]

In a+b, because the axis axis=0 of b has only 1 element (1 row), and the
axis axis-0 of a has 2 elements (2 rows), a+b is equivalent to Repeatedly
stack b along axis=0 into an array of the same size as a, and then perform
operations. As shown in Figure 1-18.

Figure 1-18 a+b is equivalent to repeatedly stacking b along axis=0 into an

array of the same size as a, and then adding

Numpy's broadcasting (broadcasting) does not actually perform the above-

mentioned repeated accumulation and recalculation process, but directly
performs broadcast calculations according to the calculation process of this
concept. This saves memory and improves efficiency.
a = np.array([[1],[2],[3]]) # a is a two-dimensional
array of (3,1)
b = np.array([4,5]) # b is a one-dimensional
array of (2,) arrays
print(a)
print(b)
print(a+b)

[[1]
[2]
[3]]
[4 5]
[[5 6]
[6 7]
[7 8]]

a is a two-dimensional array of (3,1), b is a one-dimensional array of

(2,) elements, and a one-dimensional array can be regarded as a tensor of
(2,1), or as Tensor of (1,2). a+b needs to match the shapes of a and b,
and the first axis (axis=0) of (2,1) and (3,1) obviously does not match,
so b will be viewed becomes a tensor of (1,2), since an axis with only one
element can match any number of axes. So a+b is the addition of (3,1)
and (1,2) two two-dimensional arrays. The axis whose number is 1 will be
upgraded to an array with the same number, that is, two two-dimensional
arrays of (3,2), as shown in Figure 1-19, and then element-by-element
calculations will be performed. That is, broadcasting always expands the
element with fewer elements on the axis to the same size as the element
with more elements.

Figure 1-19 The addition of two two-dimensional arrays (3,1) and (1,2) is
upgraded to the addition of two two-dimensional arrays (3,2)
Two arrays are used for operations, and the rules for broadcasting are:

If the ranks of the arrays are different, use 1 to expand the array with a
smaller rank until the two arrays have the same rank. For example, a
number whose rank is 0 and an array operation whose rank is not 0
will expand this number to Same shape as array.

Two arrays are said to be compatible in a dimension (axis) if they have

the same length in that dimension (axis), or if one of the arrays has
length 1 in that dimension.

Two arrays can use broadcasting if they are compatible in all

dimensions. When broadcasting, expand each axis whose length is 1 to
the length of the corresponding axis of the array whose length is not 1.
1.3 Calculus
This section introduces basic calculus concepts such as functions, limits,
and derivatives of functions.

1.3.1 Functions
Regarding functions, there are different definitions and descriptions, such
as:

function describes the dependence of one variable on another variable.

The dependent variable is called the independent variable, and the
dependent variable is called the dependent variable.

function is a mapping from one variable to another. That is, a function

maps one variable to another variable.

function is an input-to-output transformation. That is, a function takes

an input variable and produces an output variable.

These definitions describe the functional relationship between two variables

from different aspects. For example, the area s of a square depends on its
side length e, the dependent variable such as side length e is called
independent variable, and the dependent variable such as area s is called
dependent variable. The dependence of area s on side length e can be
expressed as s = e . The dependency relationship can also be regarded as a
2

mapping relationship: e → s, that is, the side length e is mapped to the area
s. It can also be regarded as an input input transformation s(e), that is,

input side length e, and output an area value s(e).

The functional relationship between two variables is very common, such as

temperature is a function of time, and each moment has a temperature
value. The price of a stock is a function of time, and each moment has a
stock price. The speed of a moving object is a function of time, and a
moving object has a specific speed at each moment. Height is a function of
age, house price is a function of house size, and so on.

In machine learning, it is easier to understand the function as an input-

output transformation. If you use the letter x to represent the input variable,
use the letter f to represent the function, and use f (x) to represent the input
of x to T heoutputvalueproducedbyf .

x → f → f (x)

Sometimes, the process of generating f (x) from x can be expressed as a

calculator, such as: f (x) = 2x + 1 Input a value x = 3, produce an output
value f (3) = 2 ∗ 3 + 1 = 7.

The constant c can also be regarded as a function f (x) = c, which is called

constant function.

If x, f (x) are two scalars (numbers), they can be expressed as coordinate

points (x, f (x)) on a two-dimensional Cartesian coordinate plane. By
drawing many coordinate points of this functional relationship on a two-
dimensional Cartesian coordinate plane, it is possible to observe more
clearly how a function transforms an x into f (x). The graph formed by
these points is called function curve.

Sometimes a letter such as y is used to represent the output value f (x).

The following code obtains the corresponding f (x) = 2x + 1 value by

sampling some x values in [0, 10], and by drawing these points (x, f (x))
can Understand the relationship between x and f (x)) more clearly in the
direct coordinate system.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(-3, 3, 0.1) #
y = 2*x+1
plt.scatter(x, y, s=6)
plt.legend(['f(x)=2x+1'])

plt.show()

Figure 1-20 Some coordinate points defined by f (x) = 2x + 1 function

A function of the form f (x) = ax + b is called a linear function because

all points (x, f (x)) lie on the same straight line. And f (x) = ax is a
2

quadratic function. All points (x, f (x)) describe a parabola, and there is a
common basic function: exponential function f (x) = e , the sine function
x

f (x) = sin(x). The following code plots these curves (Figure 1-21).

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(-3, 3, 0.1) #
y = np.sin(x)
y0 = np.full(x.shape, 2)
y1 = 2*x
y2 = x**2
y3 = np.exp(x)

fig = plt.gcf()
fig.set_size_inches(20, 4, forward=True)

plt.subplot(1, 5, 1)
plt.scatter(x, y, s=6)
plt.legend(['sin(x)'])

plt.subplot(1, 5, 2)
plt.scatter(x, y0, s=6)
plt.legend(['$2$'])

plt.subplot(1, 5, 3)
plt.scatter(x, y1, s=6)
plt.legend(['$2x$'])

plt.subplot(1, 5, 4)
plt.scatter(x, y2, s=6)
plt.legend(['$x^2$'])

plt.subplot(1, 5, 5)
plt.scatter(x, y3, s=6)
plt.legend(['$e^x$'])
plt.axis('equal')

plt.show()
Figure 1-21 f (x) = sin(x), f (x) = 2, f (x) = 2x, f (x) = x 2
, f (x) = e
x

defined by the function some coordinate points

Both the linear function y = 2x and the function value (dependent variable)
of the exponential e increase as x increases, but the exponential function
x

grows very fast. People often say that a quantity grows exponentially, which
means that the quantity grows very fast.

1.3.2 Four arithmetic and compound operations

Complex functions can be constructed from simple functions through four
arithmetic and compound operations.

Arithmetic
The four arithmetic operations refer to the process of performing addition,
subtraction, multiplication, and division operations on two functions to
construct a function. If there are 2 functions f (x), g(x), these 2 functions
can transform x into f (x), g(x) respectively:

f : x → f (x)

g : x → g(x)
If you define a new transformation relation that transforms each x to
f (x) + g(x), then this is a new functional relation:

x → f (x) + g(x))

This new function: x → f (x) + g(x)) is called the sum function of the
original 2 functions.

For example: yx and y = e can generate a new function y = x + e

2 x 2 x

through addition. For x = 2, the function value of this new function is

y = 2 + e = 4 + e .
2 2 2

The following code plots the curves of these 3 functions:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(-2, 2, 0.1) #
y = x**2
y2 = np.exp(x)
y3 = x**2 + np.exp(x)

plt.plot(x, y)
plt.plot(x, y2)
plt.plot(x, y3)
plt.legend(['$x^2$','$e^x$','$x^2+e^x$'])

fig = plt.gcf()
fig.set_size_inches(4, 4, forward=True)
#plt.axis('equal')
plt.xlim([-3,3])
plt.show()
Figure 1-22 The curves of functions f x and e and their sum function
2 x

2 x
x + e

Similarly, the difference function f (x) − g(x), the product function

f (x) ∗ g(x) and the quotient of two functions can be constructed by -,*,/

Function f (x)/g(x).

Composite
Since the function is a transformation or input and output device, inputting
a quantity x into a function g will generate an output g(x), and use this
g(x) as the input of another function f The input produces a f (g(x)).

x → g → g(x) → f → f (g(x))

Using the output of one function as the input of another function constitutes
a new transformation or a new function. This new function is a compound
function formed by the concatenation of the original two functions g and f
through transformation, which can be recorded as f ∘ g : x → f (g(x)),
namely f ∘ g(x) = f (g(x)).

For example, g(x) = −x, f (x) = e , then f ∘ g(x) = f (g(x)) = e .

x −x

Another example is g(x) = e , f (x) = x , then

x 2
2
f ∘ g(x) = f (g(x)) = (e )
x
. The calculation code of the composite
function is as follows:
y = np.exp(x)**2
2

The function curve of e −x

,e
x
is shown in the figure:

Figure 1-23 Curves of functions e −x

and e
x

The sigmoid function σ(x) = is one of the commonly used functions

1+e
−x

in machine learning (deep learning), it can be regarded as two constant

functions 1 and The quotient of the function 1 + e , and the function
−x

1 + e
−x
can be regarded as the sum of the constant function 1 and the
function e , and the function e can be regarded as a composite function
−x −x

of functions e and z = −x. −x can be regarded as the product of the

constant function -1 and x. The process of constructing this complex

function from simple elementary functions through four arithmetic and
compound operations is shown in Figure 1-24:

Figure 1-24 The composite process of sigmoid function σ(x) = 1

1+e
−x
The following code plots the σ(x) function curve:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(-7, 7, 0.1)
y = 1/(1+ np.exp(-x) )
plt.plot(x, y)

plt.legend(['\frac{1}{1+e^{-x}}'])
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Figure 1-25 The curve of the function σ(x)

It can be seen that the value of the function is in the [0,1] interval, and the
function value f (x) increases with the increase of x, that is, the function is
increasing. If there must be f (x ) < f (x ) for any two numbers x < x
1 2 1 2

in a certain interval I of the independent variable, it is said that the function

is strictly increasing. Likewise, the function decrement can be similarly
defined.
The value of the σ(x) function at x = 0 is 1

1+e
= 1/2 = 0.5, when x
−x

tends to be positive When infinity, that is, x → ∞, e is close to 0, that is,

−x

→ 0, so is close to 1/(1 + 0) = 1. Conversely, when x → −∞,

−x 1
e −x
1+e

→ ∞, so is close to 1/(1 + ∞) = 0. In order to describe the

−x 1
e −x
1+e

behavior of a variable tending to a certain quantity, such as x tending to 0 or

infinity, people put forward the concept of limit.

1.3.3 Limits, derivatives

1. The limit of the sequence

The sequence is a sequence of ordered numbers, such as (1, 2, 3, 4),
(1, 1/2, 1/3, ⋯ , 1/n, ⋯). The element format in the sequence is called

Number of Items, a finite number of items is called a finite sequence, and

an infinite number of items is called an infinite sequence. A sequence is
also a function, which is a mapping from a subset of natural numbers to a
set of values, that is, its independent variable is a natural number, and its
dependent variable is a value. Usually, a sequence refers to an infinite
sequence, which can be written as:

a1, a2, a3, ⋯ , an, ⋯ ,

The limit describes an infinite approximation, such as an infinite sequence

{1, 1/2, 1/3, ⋯ , 1/n, ⋯}, when n continues to increase, the sequence

The corresponding number such as 1/n will become smaller and smaller,
getting closer and closer to the value 0, that is, infinitely approaching 0 or
this sequence will gradually converge to 0 as n increases infinitely. This 0 is
called the limit of this sequence. The so-called infinite approximation
means that as long as n is sufficiently large, the distance between 1/n and its
limit value 0 is sufficiently small. In other words, for an arbitrarily small
number such as ϵ = 0.001, an n can always be found so that the distance
between all numbers in the sequence after the nth item and the limit is
smaller than this ϵ. For example, after n=1000 items, |1/n − 0| < ϵ.
For another example, it can be proved that the limit of the sequence
{3 − 1, 3 − 1/2, 3 − 1/3, ⋯ , 3 − 1/n, ⋯} is 3.

Usually, the notation lim represents the limit:

lim 1/n = 0
n→∞

This formula represents the number sequence on the left. When n tends to
infinity (n → ∞), its limit value is 0.

A sequence may not have a limit, if there is a limit, then the limit must be
unique.
2. Limit and continuity of function
Similarly, the limit lim f (x) of the function f (x) at a certain point x can be defined, which means that when x
0
x→x0

is sufficiently close to x , f (x) will be sufficiently close to this limit. like:

2
lim x = 9
x→3

It means that when the independent variable x is sufficiently close to 3, the value of the dependent variable f (x)
will be sufficiently close to 9, that is, the limit of f(x) is 9. For example, there is a set of independent variable
sequence {3 − 1, 3 − 1/2, 3 − 1/3, ⋯ , } that is close to 3, and f (x) value sequence
{T helimitof (3 − 1) , (3 − 1/2) , (3 − 1/3) , ⋯ , } is 9.
2 2 2

If the limit value of the function at x exists and is equal to the function value at x , ie:
0 0 lim f (x) = f (x0 ) , then
x→x0

the function is said to be at x This point is continuous. Intuitively speaking, the curve corresponding to f (x) is
0

not broken at x . 0

As shown in Figure 1-26, f (x) = x is continuous at any independent variable x, which means that the function
2

curve is continuous without disconnection, so the whole function is continuous of. The function f (x) = |x| is
also continuous everywhere. And f (x) = sign(x) defined below is discontinuous at x = 0.

⎧ 1, x > 0

f (x) = sign(x) = ⎨ 0, x == 0
⎩
−1, x < 0

Figure 1-26 Curves of function x 2

, |x|, sign(x)

Let Δx 0 = x − x0 , Δf (x0 ) = f (x) − f (x0 ) , then lim f (x) = f (x0 ) Can be expressed as: lim Δf (x0 ) = 0 .
x→x0 Δx0 →0

The continuous meaning of the function in the independent variable x is that when Δx tends to 0, Δf (x
0 0 0) also
tends to 0. That is, Δf (x ) tends to 0 as Δx tends to 0.
0 0

3. Derivatives of functions
The continuity of the function y = f (x) at a certain point x refers to whether the dependent variable y = f (x)
changes continuously with the continuous transformation of the independent variable near this point. Sometimes it
is necessary to further examine how quickly the dependent variable changes with the independent variable. For
example, t is used to represent time, and s is used to represent the distance traveled by a moving object. Obviously,
s changes continuously with t, and it will not suddenly jump from a certain point to another at a certain moment

point. For moving objects, sometimes we are more concerned about the speed of the object's movement, that is, the
speed of the movement. The average movement speed during this period can be expressed by the distance traveled
in the same point of time, such as from time t to time t + Δt, the distance traveled in this time period
0 0

t + Δt − t is s(t + Δt) − s(t ), and their ratio represents the average movement speed during this period:
0 0 0 0

s(t0 +Δt)−s(t0 ) s(t0 +Δt)−s(t0 )

=
t0 +Δt−t0 Δt
To understand the precise speed of t at a certain moment rather than the average speed of a time period, you can
0

calculate the above average speed when Δx tends to 0, that is, the limit value of Δx → 0, use this limit Value as
exact velocity at time t . 0

s(t0 +Δt)−s(t0 )
lim
Δt
Δt→0

In calculus, the above limit value is called the derivative of the function s(t) at t , which is recorded as s (t
0
′
0) or
| , namely:
ds
t0
dt

′ ds s(t0 +Δt)−s(t0 )
s (t0 ) = |t0 = lim
dt Δt
Δt→0

In general, the derivative f ′

(x0 ) of the function f (x) at the point x is defined as:
0

′ Δy f (x0 +Δx)−f (x0 )

f (x0 ) = lim = lim
Δx Δx
Δx→0 Δx→0

where Δx is a tiny increment towards 0. Δy = f (x + Δx) − f (x ) is the increment of the corresponding

0 0

dependent variable. This derivative characterizes the speed of change of the dependent variable y dependent on the
independent variable x at the point x . The larger the absolute value of f (x ), the faster the change of y.
0
′
0

For example, the derivative of f (x) = x at x = 3 is: 2

2 2
f (3 + Δx) − f (3) (3 + Δx) − 3
′
f (3) = lim = lim
Δx→0 Δx Δx→0 Δx

= lim 6 + Δx = 6
Δx→0

The same derivation process, the derivative of f (x) = x at x = 1 is: 2

2 2
f (1 + Δx) − f (1) (1 + Δx) − 1
′
f (1) = lim = lim
Δx→0 Δx Δx→0 Δx

= lim 2 + Δx = 2
Δx→0

The derivative of f (x) = x at x = 0 is: 2

2 2
f (0 + Δx) − f (0) (0 + Δx) − 0
′
f (0) = lim = lim
Δx→0 Δx Δx→0 Δx

= lim Δx = 0
Δx→0

It shows that at x = 0, when x has a small increment Δx, the increment of y is about 0 times, that is, 0 hardly
changes. And at x = 1, when x increases by a small increment Δx, the increment of y is about 2 times of that, that
is, 2Δx. And at x = 3, when x has a small increment Δx, the increment of y is about 6 times, that is, 6Δx.

Therefore, the derivative characterizes how quickly the dependent variable y changes relative to the independent
variable x. The larger the absolute value of the derivative, it means that a small x increment can cause a drastic
change in y, and the smaller the absolute value is, such as being close to 0, it means that a small x increment can
cause a change in y The smaller it is, that is, y changes slowly relative to x, just like a moving object almost stops
when the time changes.

As shown in Figure 1-27, a) is the of the function y = x at x = 1 when Δx = 0.8, it means

Δy

Δx
3

Two points (1, 1 ) and (1.8, 1.8 ) The rate of change (speed) of the dependent variable change with respect to the
3 3

change of the independent variable, this ratio is called the slope of the line where the two points are located. b) Δy

Δx
when Δx = 0.8, 0.6, 0.4 at x = 1. c) When Δx → 0, this ratio (slope) converges to the slope of the tangent to the
curve at x = 1.

Figure 1-27 a) of Δx = 0.8 at x = 1 b) Δx = 0.8atx=1, 0.6,\frac{\Delta y}{\Delta x}at0.4, c) When

Δy

Δx

Δx → 0, this ratio (slope) converges to the curve at x = 1 and The slope of the tangent line.

If a function f (x) exists at every point x, the derivative f (x) is said to be derivable everywhere, that is, each x
′

corresponds to a derivative value f (x), then this mapping relationship x → f (x) is also a functional relationship,
′ ′

such a function is called the derivative function of the original function. Denote it as f (x). ′

For the above f (x) = x , the derivative value f

2 ′
(x) of each point can be obtained according to the limit formula:
2 2
f (x + Δx) − f (x) (x + Δx) − x
′
f (x) = lim = lim
Δx→0 Δx Δx→0 Δx

= lim 2x + Δx = 2x
Δx→0

That is, the derivative function f (x) = 2x of f (x) = x . 2

According to the definition, it is easy to find the derivative function of the following elementary functions:
′
′ ′′ n−1
(1) C = 0; (2) (x ) = nx (n ∈ Q)

′ ′
(3) (sin x) = cos x; (4) (cos x) = − sin x

x ′ x x ′ x
(5) (a ) = a ln a; quad (6) (e ) = e

′
1 ′
1
(7) (loga x) = loga e; (8) (ln x) =
x x

1.3.4 The Four Arithmetic Operations of Derivatives and the Chain Derivation Rule
It is unrealistic to calculate the derivative function according to the limit definition of the derivative for every
function that may be encountered. Because various functions can be constructed through four arithmetic operations
or function composition. Fortunately, it is easy to prove that the derivatives of functions constructed for the four
arithmetic operations or function composition can be computed by the derivatives of the functions that construct
them.

For example, (f (x) + g(x)) = f (x) + g (x), that is, the derivative of the sum function is the sum of the
′ ′ ′

derivatives of the original two functions. According to the limit definition of the derivative, it is easy to prove that
the derivative of the function constructed by the four arithmetic operations has the following calculation formula:
′ ′ ′
(f (x) + g(x)) = f (x) + g (x)

′ ′ ′
(f (x) − g(x)) = f (x) − g (x)

′ ′ ′
(f (x)g(x)) = f (x)g(x) + f (x)g (x)
′ ′
f (x) ′ f (x)g(x)−f (x)g (x)
( ) = 2
g(x) g(x)

Because the derivative of a constant function (f x) = C is 0, therefore, the derivative of the product Cf (x) of a
constant C and a function f (x):
′ ′ ′ ′
(Cf (x)) = C f (x) + Cf (x) = Cf (x)

f (x)
Thus ( C
)
′
= (
1

C
f (x))
′
=
1

C
′
f (x) . A constant C and a function f (x) and derivatives of C + f (x)
(C + f (x))
′
= C
′ ′
+ f (x) = f (x)
′
.

For another example, (x 2

+ sin(x))
′
= (x ) + (sin(x))
2 ′ ′
= 2x + cos(x) .

Similarly, the composite function f (g(x)) formed by combining two functions f (x) and g(x). Its derivative has
the following relationship with the derivative of the original function:

(f (g(x))
′ ′
= f (g(x))g (x)
′
.

The derivation formula of this composite function is called the chain derivation rule. The derivative of f (g(x) is
to first find the derivative of f with respect to g f (g), then find the derivative of g with respect to x g (x), and then ′ ′

multiply the two f (g)g (x).As shown in Figure 1-28:

′ ′

Figure 1-28 Schematic diagram of compound function chain derivation rule

For compound functions, input a variable x, always calculate its function value along the compound process of the
function "from inside to outside", that is, first calculate g(x), and then calculate f (g(x)), and the process of
calculating the derivative of the final function value f (g(x)) with respect to the input x is reversed, first calculate
f (g), and then calculate, g (x)), and then multiply the two together. That is to say, the derivation process is to find
′ ′

the derivative of each function in turn "from the outside to the inside".

For example, sin(x 2

) is a function composed of functions f = sin(g) and g = x , therefore, its derivative:
2

2 ′ ′ ′ ′ 2 ′ 2
sin(x ) = sin (g)g (x) = sin (g)(x ) = cos(g)(2x) = 2xcos(x )

Similarly, the derivative of the σ(x) function can be obtained:

′ −x −x ′
1 1 ∗ (1 + e ) − 1 ∗ (1 + e )
′ ′
σ (x) = ( ) =
−x 2
1 + e (1 + e
−x
)
−x ′ −x ′ −x ′ −x −x
−(1 + e ) −0 − (e ) −(e )(−x) e 1 + e − 1 1
= = = & = = = −
2 2 2 2 2 −x
(1 + e
−x
) (1 + e
−x
) (1 + e
−x
) (1 + e
−x
) (1 + e
−x
) 1 + e (1

1 1
= (1 − ) = σ(x)(1 − σ(x))
−x −x
1 + e 1 + e

The following code plots the function curve of σ (x): ′

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def sigmoid(x):
return 1/(1+np.exp(-x))

x = np.arange(-7, 7, 0.1)
y = sigmoid(x)
dy = sigmoid(x)*(1-sigmoid(x))

plt.plot(x, y)
plt.plot(x, dy)

plt.legend(['$\sigma(x)$','$\sigma\'(x)$'])
plt.xlabel('x')
plt.ylabel('y')
plt.show()

Figure 1-29 Function curve of σ(x), σ (x)

′

It can be seen that the function curve of σ (x) is a bell-shaped curve, for all x, σ (x) > 0, and its derivative value
′ ′

is the largest at x=0 For σ(0)(1 − σ(0)) = 0.5 ∗ 0.5 = 0.25, when x tends to infinity, the derivative value tends to
0. The magnitude of the derivative value shows how quickly the function value changes with the independent
variable. Therefore, σ(x) changes the fastest at x = 0, and when it tends to infinity, the change of the function
value becomes more and more slow.

The above four arithmetic rules and the chain derivation rule of compound functions can be easily proved
according to the definition of derivatives. Interested readers can prove it by themselves or refer to calculus
textbooks.

1.3.5 Calculation graph, forward calculation, backpropagation derivation

The process of constructing a function from a simple function through the four arithmetic and compound
operations can be represented by the above-mentioned Figure 1-29, where the nodes in the figure represent the
four arithmetic operations or compound operations. This figure also shows the process of inputting an argument x
to this function to calculate the function value (function output). This graph is called calculation graph, and the
process of calculating the output function value of an input variable according to the process of forming this graph
is called forward calculation. It can be seen from the figure that the derivation process is exactly the reverse
process of the forward calculation process, so it is called backward calculation or backpropagation calculation.
Figure 1-30 Schematic diagram of forward calculation (solid line) and reverse derivation calculation (dotted line)

As shown in Figure 1-30, the forward calculation process is obtained from x through the function g to get
g(x) = x , and then through the function f to get f (g) = sin(g) = sin(x )
2 2

2 2
x → g → g(x) = x → f → f (g) = sin(g) = sin(x )

The backpropagation derivation process is:

′ ′ ′
f → f (g) → f (g) = sin (g) = cos(g)
′ ′ 2 ′
g → g (x) → g (x) = (x ) = 2x

′ ′ ′ 2 ′ 2
f (x) = f (g)g (x) = cos(g)(x ) = cos(g)2x = cos(x )2x

Backpropagation derivation is the core and most critical foundation of neural network (deep learning). If you
understand reverse derivation, you can easily understand the algorithm principle of deep learning.
1.3.6 Partial derivatives and gradients of multivariable functions
Sometimes the independent variable x is a vector composed of multiple components instead of a single value, that
is, x = (x , x , ⋯ , x ) contains multiple components x . A function f (x
1 1 n x) that maps such an independent
j

variable x to a single numerical dependent variable is called a multivariate function and can be expressed as
→ R.
n
f : R

∂f
The derivative of f (x
x) with respect to each component x of x is called the partial derivative, denoted as
j
∂xj
,
which reflects the rate of change of f (x
x) about this component x . j

∂f f (x1 ,⋯,xj +Δxj ,⋯,xn )−f (x1 ,⋯,xj ,⋯,xn )

= lim
∂xj Δxj
Δxj →0

That is, the partial derivative is to treat other variables as constants and x as a variable, so this function is a
j

univariate function, and its derivative with respect to x is called the partial derivative of the original function with
j

respect to x . j

For example, if f (x, y) = 2x + y , its argument contains 2 components x and y, this function is a multivariate
2

function, and the argument (x, y) maps to the function value f (x, y), that is, f : (x, y) → (2x + y ). The partial 2

derivatives of this function with respect to x, y are:

2
∂f ∂(2x+y ) d(2x)
= = = 2
∂x ∂x dx

2 2
∂f ∂(2x+y ) d(y )
= = = 2y
∂y ∂y dy

The gradient of f (x
x) with respect to x ∇ x f (x
x x) is f (x
x) A vector of partial derivatives of each component x with j

respect to x :
df ∂f ∂f ∂f n
∇x
x f (x
x) = = ( ,⋯, ,⋯, ) ∈ mathbbR
dx
x ∂x1 ∂xj ∂xn

In the above example, f (x, y) = 2x + y about the gradient of (x, y) ∇ 2

(x,y) f (x, y) = (2, 2y) . The gradient at a
specific point (2,3) is (2, 6).

Without causing confusion, the gradient of f (x

x) with respect to x ∇ x f (x
x x) Often abbreviated as ∇ff (x
x).

For a very small increment of x Δx

x = (Δx , Δx , ⋯ , Δx ), the increment of f (x) f ( pmbx + Δ) − f (x
1 2 x) can
n

be approximated as the dot product of gradient ∇f (x

x) and x increment Δx
x.

f (x
x + Δ) − f (x
x) ≃ ∇f
f (x
x) ⋅ Δx
x

Usually, the gradient is written in the form of a row vector,

∂f ∂f ∂f n
∇x
x f (x
x) = ( ,⋯, ,⋯, ) ∈ R
∂x1 ∂xj ∂xn

And f (x
x), x is written in the form of a column vector:

x1 Δx1
⎡ ⎤ ⎡ ⎤

x2 Δx2
x = Δx
x =

⋮ ⋮
⎣ ⎦ ⎣ ⎦
xn Δxn
Then the dot product of gradient vector and increment vector can be written in the form of matrix product:
T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x)Δx = Δx
x ∇f
f (x
x)

If the gradient is also written as a column vector, the dot product of the gradient vector and the increment vector
can be written as a matrix product:
T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x) Δx = Δx
x ∇f
f (x
x)

If the independent variable is written in the form of a row vector, and the gradient is written in the form of a
column vector, that is:

x = (x1 , x2 , ⋯ , xn )

Δx
x = (Δx1 , Δx2 , ⋯ , Δxn )

∂f ∂f ∂f T
∇f
f (x
x) = ( , ,⋯, )
∂x1 ∂x2 ∂xn

but:

f (x
x + Δ) − f (x
x) ≃ Δx
x∇f (x
x)

If the gradient ∇ x f (x
x) , f (x
x), x are all written in row vector form, then:

T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x)Δx = Δx
x∇f
f (x
x)

For a multivariate function f (xx), if the independent variable x is written in the form of a matrix, then f (x
x) is

about x The gradient of , although it is a vector, is sometimes written in the form of a matrix with the same shape
as x , so it is easy to know which partial derivative corresponds to which variable, namely:
∂y ∂y ∂y
⎡ ⋯ ⎤
∂x11 ∂x21 ∂xn1

∂y ∂y
f rac∂y∂x22 ⋯
dy ∂x12 ∂xn2
′
f (x
x) = =
dx
x
⋮ ⋮ ⋱ ⋮

∂y ∂y ∂y
⎣ ⋯ ⎦
∂x1n ∂x2n ∂xnn

Whether you write gradients, independent variables, and dependent variables as row vectors, column vectors, or
matrices depends entirely on which form is more helpful for you to derive related formulas. If x is written in
matrix form and the gradient is written in matrix form, it looks more consistent.

Example 1: For the multivariate function F (x, y, z) = x + 2y 2 3

+ 3z , there are:
2
∇F (x, y, z) = (1, 4y, 9z )

The gradient at a specific point (2,3,4) is (1, 4 ∗ 3, 9 ∗ 4 2

) = (1, 12, 144) . Around this point if the independent
variable has a small increment (Δx, Δy, Δz), then:

F (2 + Δx, 3 + Δy, 4 + Δz) − F (2, 3, 4) ≃ 1 ∗ Δx + 12 ∗ Δy + 144 ∗ Δz

Example 2: Suppose y = w ∗ x + w ∗ x + ⋯ + w ∗ x + b, if y is regarded as a function of

1 1 2 2 n n

w = (w , w , … , w ), Then = (x , x , … , x ); if y is regarded as a function of x = (x , x , … , x ), then

dy
1 2 n 1 2 n 1 2 n
dw

f racdydx = (w , w , … , w ); if y is regarded as a function of b, then = 1.

dy
1 2 n
db
If w is written as a column vector and x is written as a row vector, then y = x w , if the gradient is written as a row
vector, then = w , = x.
∂y T ∂y

∂x
x ∂w
w

If w and x are written in the form of column vectors, and the gradient is written in the form of row vectors, then
y = w x = x w , then f rac∂y∂x x = w , = x .
T T T ∂y T

∂w
w

It can be proved that the four algorithms of derivatives and the chain rule are also true for gradients. Let f, g be two
real-valued functions from R to R, then: n

Linear rule:
′ ′ ′
(αf + βg) (x) = ∇(αf + βg)(x) = αf (x) + βg (x) = α∇f + β∇g(x)

Product rule:
′ ′ ′
(f g) (x) = ∇(f g)(x) = f (x)g(x) + f (x)g (x) = g(x)∇f (x) + f ∇g(x)

Chain rule:

Let g be a real-valued function from R to R and f be a real-valued function from R to R , for some x ∈ R , the
n n

value of g(xx) is z, if x is a column vector and the gradient is in the form of a row vector, Then there are:

′ ′ ′ ′
(f ∘ g) (x
x) = ∇(f ∘ g)(x
x) = f (z)g (x
x) = f (z)∇g(x
x)

If x is a row vector and the gradient is in column vector form, then:

′ ′ ′ ′
(f ∘ g) (x
x) = ∇(f ∘ g)(x
x) = g (x
x)f (z) = ∇g(x
x)f (z)

That is, the order of the two different forms of the chain rule is exactly the opposite.

x1 x1
Example 3: If g(( )) = 3x1 + 2x2
3
, f (z) = z , then (f ∘ g)((
2
)) = (3x1 + 2x2 )
3 2
. So:
x2 x2

′ ′ 2 3 2 3 2 5
(f ∘ g) (x
x) = f (z)∇g(x) = 2z ∗ (3, 6x ) = 2 ∗ (3x1 + 2x ) ∗ (3, 6x ) = (18x1 + 12x , 36x1 x + 24x )
2 2 2 2 2 2

If the variable is a row vector and the gradient is a column vector, g(x 1, x2 ) = 3x1 + 2x
3
2
, f (z) = z , then
2

(f ∘ g)(x , x ) = (3x + 2x ) . So:

3 2
1 2 1
2

3
3 3 18x1 + 12x
′ ′ 3 2
(f ∘ g) (x
x) = ∇g(x)f (z) = ( )2z = ( )2 ∗ (3x1 + 2x2 ) = ( )
2 2 2 5
6x 6x 36x1 x + 24x
2 2 2 2

Example 4: Suppose y, y^ are 2 vectors in R , you can use the square of their Euclidean distance to define these 2
n

vectors The error (distance) between, such as using the following formula to express the error E(y, y^) between
two vectors:
1 2 1 2 2 2
E(y
y, y
^) = ∥y
y − y
^ ∥2 = ((y1 − y
^1 ) + (y2 − y
^2 ) + ⋯ + (yn − y
^n ) )
2 2

The gradient of E(yy, y^) with respect to y is (yy − y^) . T

1.3.7 Derivative of vector-valued function and Jacobian matrix

If there are multiple functions:
f (x) =

function:

f (x) =

⎢⎥
f1 : x → f1 (x)

f2 : x → f2 (x)

fm : x → fm (x)

Then they can be written together in a column vector:

⎣
f1 (x)

f2 (x)

fm (x)

f1 (x)

f2 (x)

f3 (x)
⋮

⎦
⎤

These combined functions are called vector-valued functions. Input a x, each function produces a function value
f (x), these function values constitute the vector on the right side above.
i

For example, there are 3 functions: f

=
⎡

⎣
ax
⎦
x

x
e
2

Enter a value of x such as 3, and generate 3 values:

Df
m

f (x) = f (x) =
dx
=

⎣
⋮

dfm

dx
⎦
1 (x)

∈ R

The derivative of a vector-valued function is a vector of m elements.

2
= x ,f
2 (x)

⎣
9

e
3

3a
= e

⎦
x
,f
1 (x) = ax

The vector-valued function composed of m univariate functions is a mapping (transformation) of a real number set
R to R
, then they constitute a vector-valued

f (x) : R → R . If at a certain point x, the derivatives of each function f (x) with respect to x exist,
m

these derivatives are accumulated into a vector, which is called the derivative of the vector-valued function with
respect to the independent variable x, which can be written as Dff (x):

′
df
f
⎡
df1

df2

dx
⎤

m×1

If the independent variable x of a vector-valued function is a vector of more than one variable, such a vector-
valued function is called a multivariate vector-valued function. Let the number of independent variables be n and
the number of functions be m. This is a mapping (transformation) f : R → R of R to R . Input n values of
independent variables and output m real numbers.

Each function f (x
function f (x
n

x) of a multivariate vector-valued function is a multivariate function. At some point x , if each

x) has a gradient about x These gradient vectors are stacked together to obtain a matrix, called the
i

Jacobian matrix:
m
i

n m
Df
f (x

(f
x) = f (x

f ∘ g ) (x

(f
x) =

x) = D(f

f ∘ g ) (x
′
′

f ∘ g )(x

x) = D(f
df

the form of row vectors, namely:

f (x

Df
x) = (f1 (x

x = (x

but:

f (x
x) = f (x
x), f2 (x

x) =
′
f

dx
x

x), ⋯ , fm (x

x1 , x 1 , ⋯ , x n )

df
f

dx
x
x)

x) = f (z

f ∘ g )(x

A vector x = (x , x , ⋯ , x
is an identity matrix I :

dx
x

dx
x
=
⎡

⎣
1

0
0

⋮
0

0
⋮
1

⋯
2
z)g

x) = g (x
g (x

x)f
=

⎢⎥
⎡

x) = Df

f (z

1
⎤

⎦
f (z

z) == Dg

= I
g(x

n)
∂f1

∂x1

∂f2

∂x1

∂fm

∂x1

∂f1

∂x1

∂f1

∂x2

∂f1

∂xn

z)Dg
g(x

x)Df
x)

f (z
z)
∂f1

∂x2

∂f2

∂x2

∂fm

∂x2

This is a m × n matrix, where each row is the gradient of a function.

∂f2

∂x1

∂f2

∂x2

∂f1

∂xn

valued function of , for the point x in R , suppose the value of g(x

independent variable etc. are in the form of column vectors, then:
′ ′ ′
⋯

m
∂f1

∂xn

∂f2

∂xn

∂fm

∂xn

∂fm

∂x1

∂fm

∂x2

∂fm

∂xn
⎤

⎦
∈ R

n×m
∈ R
m×n

Typically, independent variable and vector-valued functions are written as column vectors, and the gradient of each
function is written as row vectors. But this book writes both independent variables and vector-valued functions in

The Jacobian matrix is accumulated by the gradient vectors of different multivariable real-valued functions,
therefore, the four arithmetic operations and the chain rule of the gradient are also applicable.

Let f, g be two vector-valued functions from R to R , then:

D(αf + βg)(x) = αDf (x) + βDg(x)

k
m

Let g be the vector-valued function from R to R and f be the vector-valued function from R to R The vector-
x) is z , if the vector-valued function and
m

If the vector-valued function and arguments, etc. are all in the form of row vectors, then:
′ ′
k n

can be regarded as a multivariate vector-valued function of itself, and its derivative

According to the four arithmetic rules of derivatives, for a vector x and b , ∇ x (αx
x x + βb
b) = αI
I .
In the above example 4, if E(y, y^) is regarded as a function of y, this function can be regarded as two functions
^ and E(z) = z Composite function. The gradient of E(y, y ^) about y is:
1 2
z = y − y
2

′ T
E (z) = z
′ ′
z (y) = (y − y
^) = I

′ ′ T ′ T T T
∇y E(y, y
^) = E (z)z (y) = z z (y) = z I = z = (y − y
^)

Example 5: Let

z1 (x
x) 2x1 + 4x2 + 7x3
z (x
x) = [ ] = [ ]
z2 (x
x) 3x1 + 5x2 + 4x3

is a function of x , and

y = [ 4z1 + 3z2 ]

is a function of z , then f (x
x) = y (z x)) is y (z
z(x z) and T hecompositef unctionof pmbz(
(x ) , according to the
derivation rules of composite functions, has:

2 4 7
′ ′ ′
f (x
x) = y (z
z)z
z (x
x) = (4, 3)[ ] = (17, 31, 40)
3 5 4

Expand the full expression of f (x

f (x
x) = 4(2x1 + 4x2 + 7x3 ) + 3(3x1 + 5x2 + 4x3 ) = 17x1 + 31x2 + 40x3

It can be verified that the above results are correct.

If it is agreed to write the gradient in the form of a column vector, the chain rule will be written in reverse,

2 3 2 3 17
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
4 4
′ ′ ′ ′ ′
y (z
z) = [ ] z (x
x) = 4 5 f (x
x) = z (x
x)y
y (z
z) = 4 5 [ ] = 31
3 3
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
7 4 7 4 40

When deriving these formulas in the future, we must pay attention to whether the vectors such as gradients are
column vectors or row vectors.

Example 6:

w11 w12 … w
⎡

w21 w22 … w

z = [ z1 z2 ⋯ zn ] = [x1 x2 ⋯ xm ] ⋅
⋮ ⋮ ⋱

⎣
wm1 wm2 ⋱ w

= [w11 ∗ x1 + w21 ∗ x2 ∗ + ⋯ + wm1 ∗ xm , w12 ∗ x1 + w22 ∗ x2 ∗ + ⋯ + wm2 ∗ xm

Because ∂xj
∂zi
= wji , then ∂zi

∂x
= w.i is W column i, so ∂z

∂x
= W

If there is another variable y, the gradient of z is dy

dz
z
= (
∂y

∂z1
,
∂y

∂z2
,⋯,
∂y

∂zn
)
T
, then

dy ∂y ∂z1 ∂y ∂z1 ∂zn dz

z dy ∂z1 ∂z2 ∂zn ∂z1 ∂z1 ∂z1 T
= ∗ + ∗ f rac∂z2 ∂x + ⋯ + ∗ = = ( , ,⋯, )( , ,⋯, ) =
dx ∂z1 ∂x ∂z2 ∂zn ∂x dx dz
z ∂x ∂x ∂x ∂z1 ∂z2 ∂zn
Because

like:

dz1

W =

is:

′
f (x
x) =
⎣
=

⎣
wm1

⎢⎥
⎡

⎣
∂zj

∂wij

w21

⋮
x1

Therefore, there are:

dW
W

set up
=

⎡
⎡ x1

⎣x
m

w11
⋮

, z = W x + b , namely:

z =
⎡
z1

⋮
⎤

dz
z

dx
x
=
= xi

∂y

∂z1

∂y

∂z1

=
0

∂wij

∂y

w12

w22

wm2

⎣
wm1
⋮

w11

w21

⎣
⋮
…

w11

w21

wm1
,

xm
∂wij

If there is another variable y, the gradient of z is

dwij
=
∂y

∂z1
∗
∂z1
+

⋮
∂zk

…
∂y

∂z2

⋮
∂y

∂z2

∂y

∂z2

w12

w22

wm2
0

∂y

⋮
= 0, k ≠ j

w1n

w2n

wmn

, if z is regarded as a multivariate vector-valued function of x = (x

w11

w21

wm1

If z is regarded as a multivariate vector-valued function of b = (b

is:
⋮
∂z2

∂wij

⋱
⋯

⋯
x2

xm
,

+ ⋯ +

x =

w1n

w2n

wmn

wmn
⋮

w1n

w2n
∂y

∂zn

∂y

partialzn

∂y

partialzn

⎦
x1

xn
∂y

∂zn

⎣
⎤

⎦
⎤

∈ R
∗

⎦
dy

dz
z

∂zn

∂wij

= x

b =

m×n
= (

⎣
dy

⋮
z

bm
∂y

∂z1

∂y

∂zj

⎦
T

⎦
,

=
∂y

∂z2

∂wij =

⎣
z
^ =
,⋯,

1,
⎡

⎣
∂y

z
z
∂zn

∂zj

z
^1

^m
⋮
∂y

⎦
T
)

xi = xi
, then

∂y

∂zj

w11 ∗ x1 + w12 ∗ x2 ∗ + ⋯ + w1n ∗ xn + b1

w21 ∗ x1 + w22 ∗ x2 ∗ + ⋯ + w2n ∗ xn + b2

b2 , ⋯ , bm )
⋮

wm1 ∗ x1 + wm2 ∗ x2 ∗ + ⋯ + wmn ∗ xn + bm

1, x2 , ⋯ , xn ) , then dz
z

dx
x
The Jacobian matrix

, then the Jacobian matrix of dz

db
⎤

⎦
′
f (b) =

′
f (w

∇x
w) =

f (W
W) =
dz

x L = f (z

((W
b

the Jacobian matrix of

W x − b) W )

Write ∇
dW

′
W
dz

z) ∗ z (x

function of z (W

∇W
WL

= ((W

W
W
W1

L
z

dW
W
=

x) = z

= W

⎢⎥
⎡

If z is regarded as W = (w , w

=
is:

(W
1

1 x − b1 )x1 , (W
W1
⎡

⎣
x1

= (W
0

W x − b) = W

W1
⋮
⋱

dW
W
⋯

⋯
z

Example 7: Let W , x , b be the same as above, L = ∥W

T
Wx − W

1 x − b1 , W2
⋮

⋮
11

T
0

⋯
⎤

W = (W
∈ R

12 ,

W x − b)

This is the row vector form of the gradient, and its column vector form is:
T T
b.

2 x − b2 , ⋯ , Wm

1 x − b1 )x2 , ⋯ , (W

in the same matrix form as W :

W1
m x − bm )

1 x − b1 )xn , ⋯ , (W
⋮

xn
m×m

⋯ , w1n , ⋯ , wm1 , wm2 , ⋯ , wmn )

⎦
⋯

⋯
⋮

∈ R
0

m×n
⋯⋯

⋯⋯

For easy identification, the derivative of z with respect to W or the Jacobian matrix can also be written in the
same form as W :

W x − b ∥ , if L is regarded as a function of x , then L can

be regarded as f (zz) = ∥zz∥ and the compound function of z (x) = W x − b , then L about
T hegradientof pmbx is:
1

2
2

If L is regarded as a function of W , it can also be regarded as two functions f (zz) =

W ) = W x − b , then L The gradient with respect to W is:

W2
T
T

2 x − b2 )x1 , (W
W2
W
1

′
0

2 x − b2 )x2 , ⋯ , (W
W2
⋯

2x −
2

′
= f (z) ∗ z (W ) = z

0
0

⋯
⋮
multivariate vector-valued function, then

⎦
∈ R

⋱
T

0
m×(m×n)

′
z (W )

0
1

⋯
∥z

⋮
z∥
2

⋱
0

0
and the composite

⋯⋯

⋯⋯
⋮
0

x1
⋯

⋯
0

x
⎢ ⎥⎢
⎡

⎣
(W

(W
W1 x − b1 )x1

(W
W2 x − b2 )x1

Wm x − bm )x1
(W
W1 x − b1 )x2

(W
W2 x − b2 )x2

(W
⋮

Wm x − bm )x2
⋯

⋯
(W
W1 x − b1 )xn

(W
W2 x − b2 )xn

(W
⋮

Wm x − bm )xn
⎤

⎦
=
⎡

⎣
W 1 x − b1

W 2 x − b2

= (W x − b)x
x

Generally, if a gradient f (zz) of f (zz) about z, and z = W x + b , then T hegradientof f (W

′
f (z
z)z
z (x
′
x) = f (z z)W
′

where x is located is evenly divided.

′

W , the gradient of W can be written as the same matrix form f (z

be useful in the gradient calculation of neural networks in the future.

1.3.8 Integral
For the function f(x) in Figure 1-31, how to find the area of the shaded part below it?

Figure 1-31 Calculate the shaded area

z) x
⋮

W m x − bm
⎤

T
= zx
× [x1

W x + b ) about x is
T

as W . This rule will

′

The area of the curve can be approximated by accumulating the small rectangular surfaces at points x evenly
T

distributed on the interval, that is, ∑ f (x ) ∗ Δx, where Δx is The length of the small interval where the interval
i i

According to the idea of limit, as long as Δx is sufficiently small, the error between the above cumulative sum and
the real area will be smaller, that is, the real area S is the following limit value:

S = lim
Δx→0
∑ f (xi ) ∗ Δx

on the interval [a,b].

F (x) = ∫
a
x
x
i
T

This limit value is called the definite integral (integral) of the function f (x) on this interval. In calculus, a special
b
symbol ∫ f (x)dx is used to represent this limit value, where the meaning of dx is the differential of the
a

f (x)dx

It is also a function that changes as x changes.

a
b
independent variable, which can be regarded as an infinitesimal Δx. ∫ means to accumulate the product f (x)dx

Similarly, ∫ f (x)dx can be used for the area on the interval [a,x]. When the value of x is constantly changing,
a

So what is the derivative of F(x)? According to the definition of derivative, there are:
x
this value is also constantly changing, thus forming a function F : x → ∫ f (x)dx. Right now:
a
x2

= f (z
z)
′
⋯

T
x

i
xn ]

T
′ (F (x+Δ)−F (x)) (f (x)∗Δx)
F (x) = lim = lim = f (x)
Δx Δx
Δx→0 Δx→0

Of course, the second equation proved is a bit imprecise. Interested readers can check the calculus textbook.
1.4 Probability Basics
This section introduces the basics of probability theory such as probability, random variables, expectation,
variance, etc.

1.4.1 Probability
Probability refers to the likelihood (likelihood) of an event occurring (occurring). Probability is a real number
between 0 and 1. If the probability of an event is 0, it means that the event cannot happen, such as "the sun rises
from the west", "people can live forever". If the probability of an event is 1, it means that this is an inevitable
event, such as "a person will always die". Therefore, events with a probability of 1 and 0 are deterministic events,
that is, they must happen or they must not happen.

However, whether many events occur and how likely they are to occur are often uncertain, that is, random events.
For example, the event "buying a lottery ticket and winning the jackpot" may or may not happen. Toss a coin at
random, it may be "heads" or "tails". Throwing a sieve, the number that appears may be any number in 1, 2, 3, 4,
5, or 6. "Winning the jackpot by buying a lottery ticket" is a small probability event, that is, its probability should
be a small real number close to 0.

If a coin is fair, the chances, or probabilities, of heads and tails are the same in a toss of the coin. A random
experiment (such as a "coin toss") may have many different outcomes (random events), all of which may have
different probabilities, but one of all these outcomes must occur, that is, the sum of the probabilities of all
outcomes is equal to 1 . Therefore, assuming that the probability of "tossing a coin", "heads" and "heads" is p, then
2p = 1, that is, p=1/2=0.5. Similarly, if the sieve with 6 numbers is complete and uniform in density, the probability
of each number appearing in the "throwing sieve" is 1/6.

The capital letter P is usually used to indicate probability, then the probability of 2 events that may occur in a coin
toss is: P (heads) = 1/2, P (tails) = 1/2.. And the probability of a number i appearing in the throwing sieve is:
P (the number i appears) = 1/6, i ∈ 1, 2, 3, 4, 5, 6

Call "flipping a coin at random" a random experiment. The possible results (events) of a random experiment are
called sample points, and all possible results of a random experiment, that is, the collection of all sample points
are called sample space. A randomized experiment is usually denoted by a capital E, while a sample space is
usually denoted by a capital letter S, Ω, or U .

For the coin toss experiment, its sample space = {"comes heads", "comes tails"}. And for throwing a sieve,
its sample space = {"Number 1 appears", "Number 2 appears", "Number 3 appears", "Number 4
appears", "Number 5 appears", "Number 6 appears"}.

If the random trial is "randomly throwing the sieve 2 times", its sample space = {"number 1 appears for the first
time, number 1 appears for the 2nd time", "number 1 appears for the first time, number 2 appears for the 2nd
time", \cdots,"The number 6 appears for the first time, the number 6 appears for the second time"}. That is, there
are a total of 36 possible results. Assuming that the probability of each number appearing on each sieve is the
same, the probability of each result is equal, and the probability is 1/36.

Let the random experiment E be "randomly draw a card from 52 playing cards, and observe the number of words
in the card", then its sample space is: {A,2,3,...,J,Q,K}, There are 13 sample points in total. If the random test E is
"randomly draw a card from 52 playing cards, what is its suit", then its sample space is: {spades, hearts, clubs,
diamonds}, a total of 4 sample points. If the random test E is "randomly draw a card from 52 playing cards, what is
the card", at this time, the result of the random test must examine both the number and the suit, and its sample
space will be the Cartesian of the above two sample spaces Child product: { (A, spades), (A, hearts), (A, clubs),
(A, diamonds), (B, spades), (B, hearts), (B, clubs), (B, diamond),...(K, spade), (K, heart), (K, club), (K, diamond)},
a total of 13×4 = 52 sample points.

The sample points of the sample space are called elementary events. A collection of multiple sample points is also
an event. For example, in the random experiment "throwing a dice", there are 6 basic events, and these basic
events may be combined into other events, such as "numbers appearing no more than 3" this event = { "number 1
appears", "number 2 appears", " The number 3"} appears, which is the union of 3 basic events.
Among all random events, there are 2 special events, that is, the event corresponding to the empty set, which is
marked as ∅; the event corresponding to the full set (including all sample points), is still represented by the symbol
Ω.

The possibility of random event A is represented by a real number between 0 and 1, usually represented by the
symbol P (A), that is, 0 <= P (A) <= 1. Obviously P (∅) = 0, P (Ω) = 1, events with probability 0 and 1 are
called impossible events and inevitable events respectively, so ∅ and Ω are respectively impossible events and
inevitable events.

For any event A, obviously: ∅ ⊆ A ⊆ Ω.

Mutually exclusive events (also called incompatible events): A and B cannot occur at the same time, that is, A
and B have no common sample points. That is, A ∩ B = ∅.

Opposite Events: A special case of mutually exclusive events, A and B cannot happen at the same time, but one
of A and B must happen. In set language: A ∩ B = ∅. And A ∪ B = Ω.

Classical Probability Model (Classical Probability): The sample space is limited, and the probability of each
sample appearing is the same. The event probability of classical probability = the number of samples included in
the event/the total number of samples in the sample space.

For example, for "throwing a sieve", the total number of sample spaces is 6, and there are only 2 sample points
(number 1 and number 2) for "the number that appears is less than 3". Therefore, the probability P of this event
("the number that appears is less than 3 ”)=2/6.

Of course, for general random experiments, the probability of occurrence of each sample point (basic event) is
usually not equal. How do you know the probability of an event occurring? Usually, statistical methods are used to
determine the probability of an event, that is, random experiments are repeated many times, such as n times. In
these experiments, if event A occurs k times, it is said that the frequency of event A is k/n. Constantly carry out
such repeated random experiments, when n is very large, according to the law of large numbers in probability
theory, this frequency will approach the real probability. Right now:
k
P (A) = lim
n
n→∞

For example, the following code simulates a trial (n coin tosses) with the function one_coin_test(n) and returns the
frequency of heads in it. It can be seen that as n increases, the frequency approaches probability 0.5.

from random import randint

def one_coin_test(n):
head_tails=[]
for i in range(n):
head_tails.append(randint(0,1))
heads = head_tails.count(1)
return heads/n

for n in range(10,50000,2000):
print(one_coin_test(n),end=', ')
0.7, 0.47562189054726367, 0.4845386533665835, 0.4945091514143095, 0.5013732833957553,
0.5031968031968032, 0.4946711074104 9125, 0.5007137758743755, 0.5033104309806371,
0.4999444752915047, 0.49485257371314345, 0.5070422535211268, 0.50016659725114 53,
0.5036908881199539, 0.5008568368439843, 0.5007664111962679, 0.5004998437988128,
0.4972655101440753, 0.4980560955290197, 0.5011312812417785, 0.5004498875281179,
0.4991906688883599, 0.5021586003181095, 0.49734840252119106, 0.49989585503020206,

A random experiment has many possible events, and each event has a probability. Mathematically, the probability
of these events is defined as a mapping from event to probability.

Assume that the entire set of all measurable events in the sample space Ω is F, and the probability P is a mapping
from F to the real number interval [0,1], that is, P : F → [0, 1]. The mapping must satisfy:

1. Non-negativity: 0 <= P (A) <= 1

1. Normative: P (Ω) = 1

1. Listable additivity: Let A 1, A2 , ⋯ be pairwise incompatible events, that is, for

i ≠ j, Ai ∩ Aj = ∅, (i, j = 1, 2, ⋯) , then P (A ∪ A ∪ ⋯) = P (A + P (A
1 2 1 2) + ⋯

1.4.2 Conditional probability, joint probability, total probability formula, Bayesian

formula
For an event A, P(A) represents the probability of occurrence of A, which is called prior probability. Sometimes
a conditional probability is also considered, which is the probability that another event will occur given that
another event has already occurred. Let P(B|A) denote the probability that B will occur given that A has already
occurred.

To give a medical example, use A to indicate "has hepatitis B", and B to indicate "positive surface antibody". That
is, P(A) means that if a person is randomly selected, what is the probability of "getting hepatitis B". P(B)
represents the probability that a person is randomly selected for inspection, and his surface antibody is positive.
Then P(B|A) indicates the probability of "the surface antibody is positive in the case of hepatitis B". Obviously, the
prior probability P(B) and the conditional probability P(B|A) are not equal, because "any person's surface antigen
is positive" is obviously different from "a person who has hepatitis B shows antigen positive". The latter should be
more likely.

Joint probability P(A, B) : Indicates the joint probability of (A, B) occurring at the same time. That is to say, if a
person is randomly selected, what is the probability of simultaneous occurrence of "hepatitis B, indicating
antibody".

The joint probability P(A, B) is sometimes also written as P (A ∩ B), which indicates the probability that A and B
occur at the same time or their intersection occurs.

Conditional probabilities can be calculated using prior and joint probabilities:

P (A,B)
P (B|A) =
P (A)

P (A,B)
P (A|B) =
P (B)

You can use "throwing a sieve" to help understand this formula. Let A mean "the number is greater than 3", B
means "the number is even", then (A, B) means "the number is greater than 3 and is even", and (B|A) means "the
number is even if the number is greater than 3".
"The number is greater than 3 and is even" has only 2 sample points {4,6} in the sample point space {1, 2, 3, 4, 5,
6}, so: P (A, B) = 2/6

Similarly, the probability of "the number is greater than 3" P (A) = 3/6

The sample point space set of "when the number is greater than 3" is {4, 5, 6}, that is, there are 3 sample points, of
which the even number is 2 {4, 6}, therefore, P (B|A) = 2/3

It can be verified:
P (A,B) 2/6
= = 2/3 = P (B|A)
P (A) 3/6

The joint probability can also be written in the following form (the product of the conditional probability and the
prior probability):

P (A, B) = P (A)P (B|A) = P (B)P (A|B)

This formula can be generalized to n events:

P (A1 , A2 , ⋯ , An ) = P (A1 )P (A2 |A1 )P (A3 |A1 , A2 ) ⋯ P (An |A1 , A2 , ⋯ , An−1 )

Two events are independent if and only if: P (A, B) = P (A)P (B).

Equivalent to: P (B|A) = P (B) or P (A|B) = P (A).

The two events are mutually exclusive, which means that these two events cannot occur at the same time, such as
the two events of "heads" and "reverses" in a coin toss are mutually exclusive. For mutually exclusive events A
and B, obviously the probability of A and B occurring at the same time is 0, that is, P (A, B) = 0.

For the sets A, B, their intersection and union have the relationship A ∪ B = A + B − (A ∩ B), therefore, if two
events A, B are not mutually exclusive, P (A ∪ B) = P (A) + P (B) − P (A, B), as shown in Figure 1-32. If A
and B are mutually exclusive, P (A ∪ B) = P (A) + P (B) − P (A, B) = P (A) + P (B).

Figure 1-32 Union operation of sets

If n events A 1, A2 , ⋯ , An are mutually exclusive, and their union is the entire sample space, that is,
A1 ∪ A2 ∪ ⋯ ∪ An = Ω , there are:

P (A1 ∪ A2 ∪ ⋯ ∪ An ) = P (A1 ) + P (A2 ) + ⋯ + P (An ) = 1

For set B, B = B ∩ Ω = B ∩ (A 1 ∪ A2 ∪ ⋯ ∪ An ) = ((B ∩ A1 ) ∪ (B ∩ A2 ) ∪ ⋯ ∪ (B ∩ An ))

Then for any event B, the following formula holds:

n
P (B) = P (B ∩ A1 ) + P (B ∩ A2 ) + ⋯ = ∑ P (B|Ai )P (Ai )
i=1

It is called Full Probability Formula.

According to the total probability formula, the conditional probability P (A|B) can be calculated as follows:
P (Ai ,B) P (B|Ai )P (Ai )
P (Ai |B) = = n
P (B) ∑ P (B|Ai )P (Ai )
i=1

For example, use P (A) = 0.001 to represent the prior probability of ordinary people getting hepatitis B, use
P (A ) = 0.999 to represent the prior probability of ordinary people not getting hepatitis B, and use
c

P (B|A) = 0.99 represents the probability of "positive surface antibody after hepatitis B test", and

P (B|A ) = 0.01 represents the probability of "positive surface antibody test without hepatitis B". Now if a person
c

has done surface antibody The test is positive (B), what is the probability (possibility) that he has hepatitis B?

This can be solved directly using the Bayesian formula:

P (B|A)P (A) 0.99×0.001
P (A|B) = C C
= ≈ 0.09
P (B|A)P (A)+P (B|A )P (A ) 0.99×0.001+0.01×0.999

1.4.3 Random variables

If a random event is always written in a literal form such as "heads appear", it is inconvenient to write the
corresponding probability as P ("heads appear"), especially when the number of random events is large, in order to
use mathematical methods to better study the probability , the sample point (basic event) in the sample space can
be mapped to a real value, and such a mapping relationship is called random variable.

Simply put, it is a mapping from the sample space Ω to the real number set R, that is, for each sample point (basic
event), there is a real number corresponding to it. As shown below:

Figure 1-33 A random variable is a mapping from a sample space to a set of real numbers

For example, the sample space of a random experiment "toss a coin and observe its heads and tails" has only 2
samples: heads and tails. The random variable X(ω) can be defined as follows:

X(f ront) = 3, X(f ront) = 4

That is, the random variable maps the basic events "heads" and "tails" to two values 3 and 4, respectively. can also
be written as:

X((ω = f ront side) = 3, X((ω = back side) = 4

Or written in piecewise function form:

3, ω = f ront side
X(ω) = {
4, ω = back side

Suppose the probability of these 2 sample points (basic events) is:

P (f ront side) = 0.3, P (f ront side) = 0.7

Then it can be expressed as the probability of random variable X:

P (X(ω) = 3) = 0.3, P (X(ω) = 4) = 0.7

That is, the probabilities of X(ω) taking values 3 and 4 are 0.3 and 0.7, respectively.

For an experiment, different random variables can be defined, for example, two dice are randomly rolled, and the
entire event space can be composed of 36 elements:

ω = {(i, j)|i = 1, … , 6, ; j = 1, … , 6}

You can define a random variable (mapping) random variable X (the sum of the points of the two dice obtained),
and the random variable X can have 11 integer values

X(ω) = X(i, j) := i + j, x = 2, 3, … , 12

You can also define a random variable (mapping) random variable Y (the difference between the points of the two
dice obtained), and the random variable X can have 6 integer values

Y (ω) = Y (i, j) := |i − j|, y = 0, 1, 2, 3, 4, 5

For another example, someone is waiting for the bus, and the arrival time of the bus is every 5 minutes. If the event
of the person arriving at the station is random, then the time he waits for the bus can be represented by the random
variable X(ω). If the sample space S= {waiting time}, the sample point itself is a real number, then the random
variable X(ω) is:

X(ω) = ω, ω ∈ Ω

is actually an identity function.

If the value range of the random variable X(ω) is countable, X(ω) is called a discrete random variable,
otherwise, it is called a non-discrete random variable . Among non-discrete random variables, if the value range
is composed of some intervals, it is called continuous random variable. The random variable X(ω) is often
abbreviated as X, that is, the sample point ω is omitted.

1.4.4 Probability distribution sequence of discrete random variables

Definition: Let X be a discrete random variable, that is, it has only a limited number of possible values:
x , x , ⋯ , x . If each value of X has a probability P (X ), the arrangement of these probabilities:
1 2 n i

P (x ), P (x ), ⋯ , P (x ) is called the probability distribution column of the random variable. The functional
1 2 n

relationship of the random variable X from its possible value x to the corresponding probability P (x ) is the
i i

function P (X) : x → P (x ) is called probability mass function .

i i

For example, if there are excellent, good, medium, and poor reviews for a business, a random variable X can be
used to map this group of sample points to 0, 1, 2, and 3. If it is already known from many previous reviews that
the probability of the merchant's excellent, good, medium and poor is 0.5, 0.3, 0.1, 0.1. Then the probability
distribution law of the random variable X is

P (X = 0) = 0.5, P (X = 1) = 0.3, P (X = 2) = 0.1, P (X = 3) = 0.1

The following code plots this probability distribution for X values 0, 1, 2, 3. It can be seen that except for the
above 4 integers, the probability of X taking other values is 0.
import matplotlib.pyplot as plt
%matplotlib inline

x = [0,1,2,3]
p = [0.5,0.3,0.1,0.1]
plt.vlines(0, 0, 0.5,color="red")
plt.vlines(1, 0, 0.3,color="red")
plt.vlines(2, 0, 0.1,color="red")
plt.vlines(3, 0, 0.1,color="red")
plt.scatter(x,p)
plt.show()

Figure 1-34 Probability distribution of X values 0, 1, 2, 3

Two point distribution

If a discrete random variable takes only two values such as 0 and 1, the distribution of this binary random variable
is also called two-point distribution (also known as 0-1 distribution, Bernoullidistributed). The following
formula is the probability value of 1 and 0 for a binary random variable X:

P (X = 1) = ϕ, P (X = 0) = 1 − ϕ

It describes the probability situation where a random experiment values only 2 different fundamental things. For
example, the problem of "tossing a coin and getting heads or tails". In the binary classification problem of machine
learning, this two-point distribution is used to represent the probability that an object belongs to two categories.

binomial distribution

If you ask a question: "What is the probability that "randomly tossing a coin n times, it will come up heads k
times"? This can be described by a binomial distribution. Each coin in the experiment "randomly flipping a coin n
times" is the above-mentioned "two-point distribution", and any two coin flips are independent of each other. Use
A to represent the event "heads appeared in the first k times, and the following are all negatives", which is a joint
event of n independent events, that is, "the first occurrence of heads" (indicated by A ), "the second occurrence
1

Heads", ⋯, "Heads appears at the kth time", "Tails appears at the k+1th time" (indicated by B ), ⋯, "Tails
k+1

appears at the nth time" joint events. therefore:

A = (A1 , A2 , ⋯ , Ak , Bk+1 , ⋯ , Bn )

If the probability of flipping a coin and getting heads is p, then:

k n−1
P (A) = P (A1 )P (A2 ) ⋯ P (Ak )P (Bk+1 ) ⋯ P (Bn ) = p (1 − p)

According to the combination principle, in the event of "n coin tosses, heads appearing k times", the total number
of heads appearing in these k times is C =n
k n!
. Therefore, according to the additivity of probability, the
k!(n−k)!

probability of the event of "n coin flips and k heads" is C k

n
k n−1
p (1 − p) .
k k n−1
P (k; n, p) = Cn p (1 − p)

If a random variable X is used to map "n coin flips, k heads" to an integer k, then the probability distribution of
this discrete random variable X is called binomial distribution. Right now
k k n−1
P (X = k) = Cn p (1 − p)

Describes the probability of a discrete random variable X = k for a given n and p.

1.4.5 Probability Density of Continuous Random Variables

Discrete random variables can directly enumerate the probability that the random variable takes each discrete
value. And there are some random variables, such as the height of a person, whose possible values are countless
(uncountable), such as random variables that take values on a real number interval, coordinate points that take
values in a two-dimensional plane area such as a circle . Water level, temperature, stock price, etc. The possible
values of these random variables are continuous on the real number axis, called continuous random variables.

For continuous random variables, it is impossible to enumerate the probability of each possible value of the
random variable, and it is meaningless to do so. Just like for an object, it is infeasible and meaningless to measure
the mass (weight) of a certain point inside it. Or for each point on a real number interval, it is impossible to say
what the length of this point is.

Since it is meaningless to define the probability of a continuous random variable taking a single value, how to
measure the probability of a random variable taking different values? Just as the density is used to measure the
mass of each point of an object, the density is used to measure the mass near a certain point inside the material.
Similarly, for random variables, probability density can be used to measure the possibility of a random variable
taking a value near a certain value.

Just as the density of a substance is the limit value of the ratio of its mass to volume, that is, for a point p inside the
object, its density is defined as:
Δm
ρ(p) = lim
Δp
Δp→0

Δp, Δm represents the volume and mass of a small area containing p. Their ratio reflects the mass of the small
region. The limit value of the ratio when this small area tends to 0 precisely characterizes the mass at this point p
(for a point, strictly speaking, it is the mass density).

Similarly, the probability (strictly speaking, probability density) of a continuous random variable at a certain point
x can be expressed similarly as:

ΔP P (x+Δ)−P (x)
p(x) = lim = lim
Δx Δx
Δx→0 Δx→0

Δx is the small interval containing x, and ΔP is the probability that the random variable falls in this small
interval, and their ratio represents the average probability that the random variable falls in this small interval. The
limit of this ratio when Δx tends to 0 accurately describes the probability (likelihood) of the random variable
taking a value at this point x. Therefore, for some x, the probability P ([x − dx, x + dx]) of a random variable X
taking a value on [x-dx,x+dx] can be approximated by 2dx*p(x) .

Example: Assuming that the random variable X is uniformly valued on the interval [a,b], then P[a,b]=1, and its
probability density at each point x is p(x) = , that is, the probability density of each point is the same value,
1

b−a

that is, the probability of each point on the value [a,b] of the random variable is the same. That is, the random
variable is uniformly distributed on the interval [a,b].

Gaussian distribution

If the probability density function of a random variable is a Gaussian function as follows

2 2
2 1 −(x−μ) /2σ
p(x) = N (μ, σ ) = e
σ√2π
It is said that this random variable obeys Gaussian distribution (Gaussian distribution) (also known as normal
distribution (normal distribution)). The range of values of this random variable is the entire real number axis.

The following code plots the Gaussian probability density for different values of μ, σ:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def gaussian(x, mu, sigma):

return 1/(sigma*np.sqrt(2*np.pi))*np.exp(-np.power(x - mu, 2) / (2 *
np.power(sigma, 2.)))

x = np.linspace(-5, 5, 100)
plt.plot(x, gaussian(x,0,0.5))
plt.plot(x, gaussian(x,-2,0.7))
plt.plot(x, gaussian(x,0,1))
plt.plot(x, gaussian(x,1,2.3))
plt.legend(['$\mu=0,\sigma=0.5$','$\mu=-2,\sigma=0.7$','$\mu=0,\sigma=1$','$\mu=1,\sigma=

#plt.axis('equal')
plt.xlabel('x')
plt.ylabel('p(x)')
plt.show()

Figure 1-35 Gaussian curves with different mean and mean square deviation

This is an inverted "bell-shaped curve". It can be seen that the probability density at μ is the largest, and the farther
away from μ, the smaller the probability density. That is, the probability of a random variable taking a value near
μ is the greatest, and the value that deviates from it is less likely to be taken. The smaller σ is, the narrower the

curve is, indicating that the values of random variables are more concentrated near μ. The Gaussian distribution of
μ = 0, σ = 1 is called standard normal distribution.

1.4.6 Distribution functions of random variables

Definition: Function F (x) = P (X ≤ x) = P (ω|X(ω) ≤ x), − ∞ ≤ x ≤ ∞ , called the distribution function
of the random variable X.

The distribution function describes the probability that a random variable falls on the interval (−∞, x).

For example, the distribution function corresponding to the random variable X of the coin toss above can be
calculated as a stepped function F (x):
⎧ 0, x < 3

F (x) = ⎨ 0.3, 3 ≤ x < 4

⎩
1, x ≥ 4

Because the random variable has only two possible values 3 and 4, the probabilities are 0.3 and 0.7 respectively, so
the random variable cannot be less than 3, that is, the probability of falling in the interval (−∞, x), x < 3 is 0,
And the probability of falling in (−∞, x), x < 4 is that the probability of taking the value 3 is 0.3, because the
random variable values 3 and 4 will always fall in (−∞, x), x >= 4, so the probability of F(x) on
(−∞, x), x >= 4 is 1.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
lines = [(-2, 3), (0, 0),'r',(3, 4), (0.3, 0.3),'g',(4, 10), (1, 1),'b']
plt.plot(*lines)
plt.scatter(3,0, s=50, facecolors='none', edgecolors='r')
plt.scatter(4,0.3, s=50, facecolors='none', edgecolors='g')

Figure 1-36 The distribution function corresponding to the random variable X of the coin toss is a stepped function

For a continuous random variable, if its probability density is p(x), the distribution function is the definite integral
of the probability density function p(x) in the interval (−∞, x).
x
F (x) = ∫ p(x)dx
−∞

In turn, the probability density is the derivative of the distribution function, ie p(x) = F ′
(x) .

The following code plots the distribution function corresponding to the Gaussian probability density of the above
figure:
from scipy.integrate import quad
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def gaussian(x, mu, sigma):

return 1/(sigma*np.sqrt(2*np.pi))*np.exp(-np.power(x - mu, 2) / (2 *
np.power(sigma, 2.)))
def gaussion_dist(x,mu, sigma):
return quad(gaussian, np.inf,x, args=(mu, sigma))
vec_gaussion_dist = np.vectorize(gaussion_dist)

x = np.linspace(-5, 5, 100)
plt.plot(x, vec_gaussion_dist(x,0,0.5)[0])
plt.plot(x, vec_gaussion_dist(x,-2,0.7)[0])
plt.plot(x, vec_gaussion_dist(x,0,1)[0])
plt.plot(x, vec_gaussion_dist(x,1,2.3)[0])
plt.legend(['$\mu=0,\sigma=0.5$','$\mu=-2,\sigma=0.7$','$\mu=0,\sigma=1$','$\mu=1,\sigma=

plt.xlabel('x')
plt.ylabel('p(x)')
plt.show()

Figure 1-37 Gaussian probability density functions with different mean values and mean square deviations

1.4.7 Expectation, variance, covariance, covariance matrix

1. Mean and Expectation

If there are some students, their ages are: 18,19,20,21,22,19. Then their average age is:
(18+19+20+21+22+19)/6

i.e. the mean is:

19.833333333333332

This average age is the mean of their ages. The mean is therefore the average of a set of numbers. If this set of
numbers is (x , x , ⋯ , x ), then the mean of this set of numbers is:
1 2 n

x1 +x2 +⋯+xn 1 n
= ∑ xi
n n x=1

If (x , x , ⋯ , x ) are all possible values of a random variable X, assuming that the probability of the random
1 2 n

variable obtaining these values is the same, that is, . This mean can be written as:
n
1

1 1 1 1
(x1 + x2 + ⋯ + xn ) = x1 + x2 + ⋯ + xn )
n n n n

That is, the mean is the sum of the product of the probability of each value of the random variable multiplied by
that value. This mean is called the mathematical expectation of this random variable, or expectation for short.
That is, from an average point of view, the expected value of this random variable.

If the probability of taking each value of the random variable is not equal, for example, the probability of taking x
i

is p , then the expected (mean) value of the random variable is:

p1 x1 + p2 x2 + ⋯ + pn xn

The letter E is usually used to represent the expectation of a random variable, that is, the expectation of the random
variable X can be marked as E[X]. Right now:
n
E[X] = p1 x1 + p2 x2 + ⋯ + pn xn = ∑ pi xi
i=1

Assuming that the freshmen are only 18, 19, 20, and 21 years old, it can be represented by a random variable X
that may be 0, 1, 2, or 3. The probability of the random variable p = P (x = i), i = 0, 1, 2, 3 indicates the i

value probability of the random variable or the probability (probability/percentage) that a student belongs to
different ages. The expectation of this random variable X is:

E[X] = p0 ∗ 0 + p1 ∗ 1 + p2 ∗ 2 + p3 ∗ 3

If p
0, p1 , p2 , p3 are 0.2, 0.4, 0.3, 0.1 respectively, then E(x) = 0.2 ∗ 0 + 0.4 ∗ 1 + 0.3 ∗ 2 + 0.1 ∗ 0.3 = 1.03.

If the function f (X) maps the random variable X representing the age of the student to the age of the student, that
is, x = 0, 1, 2, 3 is mapped to the age 18, 19, 20, 21, because the random variable The probability of X taking the
value 0, 1, 2, 3 is p , p , p , p respectively, therefore, the function value f (X) that changes with the random
0 1 2 3

variable X also changes randomly random variable, the probability of random variable f (X) taking values
f (0), f (1), f (2), f (3) is also p , p , p , p , Thus, the expectation E[f (X)] of the random variable f (X) can be
0 1 2 3

calculated:

E[f (X)] = p0 ∗ f (0) + p1 ∗ f (1) + p2 ∗ f (2) + p3 ∗ f (3) = 0.2 ∗ 18 + 0.4 ∗ 19 + 0.3 ∗ 20 + 0.1 ∗ 21 = 19.3

Therefore, if the probability of all values of a random variable X are p , ⋯ , p , ⋯ , p respectively, the function 1 i n

value f (X) of the random variable X is also a random variable , the expectation of the random variable f (X) is:
n
E[f (X)] = p1 f (x1 ) + p2 f (x2 ) + ⋯ + pn f (xn ) = ∑ pi f (xi )
i=1

If X is a continuous random variable, the probability of random variable X taking value x is p(x), and f (X) is a
random variable dependent on X, then random variable XT heexpectationof and f (X) can be calculated by
integral:

EX∼p [X] = ∫ p(x)xdx

EX∼p [f (X)] = ∫ p(x)f (x)dx

Let f (X) = X be an identity mapping, then the above formula can be obtained from the following formula.
Therefore, the above formula can be regarded as a special case of the following formula.

Linearity is expected, ie:

EX [αf (X) + βg(X)] = αEX [f (X)] + βEX [g(X)]

2. Variance, standard deviation

expected mean of the random variable represented by expectation(mean). For example, the mean value of 18, 19,
20, 21, 22 is 20, and the mean value of 1, 6, 18, 10, 65 is also 20. The former group of numbers are very close, but
the deviation between the latter group of numbers is relatively large or divergent. How to express the degree of
aggregation or divergence of the values of random variables?

The average error of the random variable distance from its expectation can be used to describe the degree of
aggregation or divergence of the value of the random variable. The specific method is to add up the squares of the
errors between each random variable and the expected value, and then average them. like:
2 2 2 2 2
(18−20) +(19−20) +(20−20) +(21−20) +(22−20) 4+1+0+1+4
= = 2
5 5

2 2 2 2 2
(1−20) +(6−20) +(18−20) +(10−20) +(65−20) 4+1+0+1+4
= = 537.2
5 5
The average of this error square is called mean square error, referred to as variance (variance). It describes the
degree of divergence of the random variable from its expected value. The larger the variance, the more divergent it
is, and the smaller the variance, the more concentrated it is. That is, the smaller the variance, the more
concentrated the random variable value is near the expected value.

For an equally probable random variable x whose value is (x 1, x2 , ⋯ , xn ) , if its expectation (mean) is recorded
as μ = E(X), then the variance V ar(X) is:
2 2 2 n 2
2 2 2 (x1 −μ) +(x2 −μ) +⋯+(xn −μ) ) ∑ (xi −μ)
1 i=1
V ar(X) = ((x1 − μ) + (x2 − μ) + ⋯ + (xn − μ) ) = =
n n n

If the probability of the random variable X is p 1, ⋯ , pi , ⋯ , pn respectively, then the variance is:
2 2 2
V ar(X) = (p1 (x1 − μ) + p2 (x2 − μ) + ⋯ + pn (xn − μ) bigr)

If X is a continuous random variable, the probability of random variable X taking value x is p(x), then the
variance V ar(X) of X is:
2 2
V ar(X) = EX∼p [X − E(X)] = ∫ p(x)(X − E[x]) dx

Therefore, the variance is also the expectation, which is the expectation of the random variable (X − E(X)) . Or 2

the expectation (mean) of the squared error. According to the desired linear law, it can be deduced that:
2 2 2
V ar(X) = EX∼p [X − E(X)] = E[X ] − E[X]

The variance is the square of the error, and the square root of the variance is called standard deviation, which can
be represented by std(X). That is: std(X) = √V ar(X)

People often use symbols μ, σ to represent expectations and standard deviations, while variances are represented
by symbols σ . 2

NOTE: Variance is sometimes defined as sampling variance:

2 2 2
1 2 2 2 (x1 −μ) +(x2 −μ) +⋯+(xn −μ) )
V ar(X) = ((x1 − μ) + (x2 − μ) + ⋯ + (xn − mu) ) =
n−1 n−1

3. Covariance, covariance matrix

For two vectors a = (x , y ), b = (x , y ) on a two-dimensional plane, their dot product a ⋅ b = x x + y y
a a b b a b a b

characterizes their correlation. Such as a = (1, 1), b = (−1, 1), their dot product a ⋅ b = 1 ∗ 1 + (−1) ∗ 1 = 0
indicates that they are perpendicular to each other, That is irrelevant, such as a = (1, 1), b = (1, 1), their dot
product a ⋅ b = 1 ∗ 1 + 1 ∗ 1 = 2, these two vectors are The same vector that coincides, they are related, such as
a = (1, 1), b = (1, 0), their dot product a ⋅ b = 1 ∗ 1 + 1 ∗ 0 = 1 , the angle between these two vectors is 45

degrees, and their degree of correlation is between the two degrees of correlation just now.

Generally, there are 2 vectors x = (x , x , ⋯ , x ), y = (y , x , ⋯ , y ), you can use their dot product
1 2 n 1 2 n

x ⋅ y = x y + x y + ⋯ + x y to characterize their correlation.

1 1 2 2 n n

Covariance is a measure of the correlation between two random variables. If two random variables X, Y take the
value of (x , x , ⋯ , x ), y = (y , x , ⋯ , y ), the covariance Cov(X,Y) between them is defined as:
1 2 n 1 2 n

(x1 −μX )(y1 −μY )+(x2 −μX )(y2 −μY )+⋯+(xn −μX )(yn − muY )
Cov(X, Y ) =
n

That is to say, the dot product operation is performed on the values after they are subtracted from the expectation,
and then averaged. This form is very similar to the variance of a single random variable:
(x1 −μX )(x1 −μX )+(x2 −μX )(x2 −μX )+⋯+(xn −μX )(xn −μX )
V ar(X) =
n

But the meanings of the two are different. V ar(X) describes the degree of divergence of the random variable from
the expected value, and Cov(X, Y ) describes the correlation between two random variables.

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])]

According to the desired linear law, the following derivation can be made:

Cov(X, Y ) = E[(X − μX )(Y − μY )]

= E[XY − μX Y − μY X + μx μY ]

= E[XY ] − μX E[Y ] − μY E[X] + μx μY

= E[XY ] − μx μY − μx μY + μx μY

= E[XY ] − μx μY

= E[XY ] − E[X]E[Y ]

In machine learning, a sample may have multiple features. For example, a house may contain features such as area,
number of rooms, location, number of floors, etc. Each feature can be seen as a random variable, and some
features may be related. The influence of the features on the machine learning algorithm will be mutually
restrained, and eliminating the correlation between features or selecting low-correlation features will help improve
the performance of the machine learning algorithm. Correlation analysis can be performed on these features to help
select good features or transform the original data to eliminate the correlation between features.

If a sample has 3 features, their corresponding random variables are represented by X1, X2, andX3 respectively,
and the correlation between them can be analyzed by calculating the pairwise covariation between them. The
covariant value between them can be expressed in the form of a matrix, called covariant matrix, usually
represented by the symbol ∑.

⎛ Cov(X1 , X1 ) Cov(X1 , X2 ) Cov(X1 , X3 )

⎞
∑ = Cov(X2 , X1 ) Cov(X2 , X2 ) Cov(X2 , X3 )

⎝ Cov(X3 , X1 ) Cov(X3 , X2 ) Cov(X3 , X3 )

⎠

This is a symmetric matrix. If all possible values of each random variable are lined up, these possible values of all
random variables can be represented as a matrix:

X = (X1 , X2 , X3 )

If X is a random variable with equal probability, the covariance matrix can be calculated as follows:
i

X = X − E[X]

T
∑ = X X

It can be calculated with the following python code

X = X-np.mean(X,axis=0)
np.dot(X.transpose(),X)
Chapter 2 Gradient descent method
The core task of deep learning is to train a function model through sample data, or to find an optimal
function to represent or describe these sample data. Solving the best function model comes down to a
mathematical optimization problem, more precisely, the problem of finding the most value (extreme value)
of a certain loss function. In deep learning, the gradient descent method is used to solve this maximum
value problem or solve the model parameters.

This chapter introduces the theoretical basis, algorithm principle and code implementation of the gradient
descent algorithm starting from the necessary conditions for the extreme value of the function, and
introduces different optimization strategies for updating the solution variables (parameters) in the gradient
descent method.

2.1 Necessary conditions for function extremum

The function y = f (x) obtains the minimum value at a certain point x : it means that there is a certain
0

positive number ϵ, so that for the interval (x − ϵ, x + epsilon) each x satisfies f (x ) ≤ f (x). x is
0 0 0 0

called the minimum point of the function, and f (x ) is called the minimum point of the function.
0

The function y = f (x) obtains the maximum value at a certain point x : it means that there is a certain
0

positive number ϵ, so that for the interval (x − ϵ, x + ϵ) each x satisfies f (x) ≤ f (x ). x is called the
0 0 0 0

maximum point of the function, and f (x ) is called the maximum value of the function.
0

The minimum value and maximum value are collectively referred to as extreme value, and the minimum
value point and maximum value point are collectively referred to as extreme value point.

If all x in the domain of the function f (x) satisfy f (x ) ≤ f (x), then x is called the minimum point of
0 0

the function, and f (x ) is called the minimum value of the function.

If all x in the domain of the function f (x) satisfy f (x) ≤ f (x ), then x is called the maximum point of
0 0

the function, and f (x ) is called the maximum value of the function.

That is, the minimum value is a minimum value of a global range, and the maximum value is a maximum
value of a global range. The minimum and maximum values are collectively referred to as Most Value, and
the minimum and maximum points are collectively referred to as Most Value Points.

Necessary conditions for function extremum: If x is the extremum point of function f (x), and the
0

function is derivable at x , then f (x ) = 0 must be The derivative value at the extreme point must be 0.
0
′
0

For example, the previous function f (x) = x obtains the minimum value at x = 0 (of course it is also a
2

minimum value) and can be derived, so at x = 0 its derivative value f (0) = 2 × 0 = 0 must be 0.
′

This proposition is easy to prove. If x is the extreme point of the function f (x), there is an interval
0

(x − ϵ, x + ϵ) that satisfies f (x ) ≤ f (x), so f (x) − f (x ) ≥ 0, while:

0 0 0 0
′ Δy f (x0 +Δx)−f (x0 ) f (x)−f (x0 )
f (x0 ) = lim = lim = lim
Δx Δx x−x0
Δx→0 Δx→0 Δx→x0

When x tends to x from the left and right sides, Δx are negative and positive numbers respectively, and the
0

numerator is always positive, from x tends to x from the right, its limit value should be ≥ 0, from x It tends
0

to x from the left, its limit value should be ≤ 0, and this limit exists, so its value can only be 0.
0

According to the limit formula, a rule can also be found. If the derivative at x is a positive number, it 0

means that the function f(x) is monotonically increasing around this point, that is, if x < x , then 1 2

f (x ) < f (x ), that is, f (x) increases as x increases. Or if Δx is a positive number, then Δy is also a
1 2

positive number. For example, the derivative of y = f (x) = x is f (x) = 2x. When x is greater than 0, the
2 ′

derivative is positive. Therefore, the function curve is single-handedly increasing, and x is less than When 0,
the derivatives are all negative numbers, therefore, the function curve is single-handedly decreasing, that is,
if x < x , instead f (x ) > f (x ).
1 2 1 2

For example, the function f (x) = x 3 2

– 3x – 9x + 2. , let its derivative f ′
(x) = 0 :
′ 3 2 2 2
f (x) = 0, ⇒ (x – 3x – 9x + 2) = 0, ⇒ 3x – 6x– 9 = 0, ⇒ x – 2x– 3 = 0, ⇒ x1 =– 1, x2 = 3

Two points x =– 1, x = 3 with a derivative of 0 can be obtained. The monotonous change of this
1 2

function and its derivative function f (x) is shown in Figure 2-1:

′

Figure 2-1 f (x) > 0, the function increases monotonically, f

′ ′
(x) < 0 , the function decreases
monotonically

In the interval (−∞, −1], f (x) is a positive number, so the function f(x) is monotonically increasing, and
′

in the interval (−1, 3), f (x) is a negative number, so the function f(x) is monotonically decreasing, in the
′

interval [3, ∞), f (x) is a positive number, so the function f(x) is monotonically increasing .
′

The following code draws the curve of this function and its derivative function, which can more intuitively
see the monotonic change and extreme point situation.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.arange(-3, 4, 0.01)
f_x = np.power(x,3)-3*x**2-9*x+2
df_x = 3*x**2-6*x-9

plt.plot(x,f_x)
plt.plot(x,df_x)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.legend(['f(x)', "df(x)"])
plt.axvline(x=0, color='k')
plt.axhline(y=0, color='k')
plt.show()
Figure 2-2 f (x) = x 3 2
– 3x – 9x + 2 and the function curve of its derivative f ′
(x)

Note that the above proposition only illustrates the necessary condition at the extreme point of the function,
but not the sufficient condition, that is to say, the derivative f (x ) = 0 at a function x does not mean that
′
0 0

x must be an extreme point. For example, the derivative f (0) of f (x) = x at x = 0 is also 0, but this
′ 3
0

point is not the extreme point of the function. In fact, this function is a monotonically increasing curve, as
shown in Figure 2-3.
x = np.arange(-3, 3, 0.01)
f_x = np.power(x,3)

plt.plot(x,f_x)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.axvline(x=0, color='k')
plt.axhline(y=0, color='k')
plt.show()

Figure 2-3 Function curve of f (x) = x 3

Obviously, the necessary conditions for the extremum of the function can be extended to multivariate
functions, that is, for a multivariate function f (x , x , ⋯ , x ), if the function is at a certain point
1 2 n

x = (x , x , ⋯ , x ) obtains an extreme value and the gradient at this point exists (that is, all partial
∗ ∗ ∗ ∗
1 2 n

derivatives exist), then the gradient at this point must be 0 (that is, each partial derivative value is is 0).
Right now:
∂f (x1 ,x2 ,⋯,xn )
= 0, i = 1, 2, ⋯ , n
∂xi ∗
|x
2.2 Gradient descent method (gradient descent)
For a one-variable function f(x), if there is a small change Δx near a certain point x, then the change
f (x + Δx) − f (x) of f(x) can be expressed as follows Differential form of :

′
f (x + Δx) − f (x) ≃ f (x)Δx

That is, near x, if Δx and f (x) have the same sign, then f (x)Δx is f (x + Δx) − f (x) It is a positive
′ ′

number. If Δx and f (x) have opposite signs, then f (x)Δx, that is, f (x + Δx) − f (x) is a negative
′ ′

number. If Δx = −αf (x) (where α is a small positive number), then f (x + Δx) − f (x) = −αf (x) is
′ ′ 2

a negative number, that is, the value of f (x + Δx) will be smaller than f (x). In other words, x moves Δx
along the opposite direction −f (x) of f (x), and its function value f (x + Δx) is larger than the original
′ ′

f (x) is smaller.

As shown in Figure 2.4, the function value f (x) of the function f (x) = x + 0.2 at x = 1.5 is 2.45, and the
2

derivative value f (x) is 3.0, which is a positive number, pointing to the positive direction of the x axis on
′

the domain of f (x), that is, the x axis, as shown by the long arrow in the figure.

Figure 2-4 f (1.5) = 2.45 > 0, Δx and f

′ ′
(x) move with the same sign, the function value increases,
otherwise, the function value decreases

Let α = 0.15, Δx = −αf (x) = −0.449, move x along this Δx (in the direction of the blue arrow in the
′

figure) to xnew = x + Δx = 1.05, the f(1.05) function value at the new x = 1.05 obtained is 1.3025,

which is the y coordinate value of the blue point on the curve in the figure. Because Δx and f (x) are in ′

opposite directions (one negative and one positive), this f(1.05) must be smaller than the original f(1.5).

Just keep repeating this process, that is, move x along the opposite direction (−f (x)) of its derivative f (x)
′ ′

by a small increment −αf (x) Can reach a new x

′
= x − αf (x), the function value f (x
new
′
) of this
′
new

new x new must be smaller than the previous function value f (x). As x approaches the x value of the
′

minimum point, the derivative f'(x) is also close to 0 (because the derivative f (x ) = 0 of the function
∗

extreme point x ), the increase of x movement The quantity Δx is getting closer and closer to 0.
∗

This is the idea of gradient descent method, that is, starting from an initial x, the value of x is continuously
updated with the following formula:
′
x = x − αf (x)

For the current x, moving x along its negative derivative (gradient) direction (ie −f (x)) can make f (x)′

keep getting smaller. Ideally, x of minimum f (x) is reached, where f (x) = 0. Then update x iteratively,
′
and the value of x will no longer change. As shown in Figure 2-5, x is constantly updated iteratively, so that
it is constantly approaching the extreme point.

Figure 2-5 x moves along −f ′

, the function value keeps decreasing
(x)

Of course, the pace of this movement (ie −αf (x)) cannot be too large, because according to the definition
′

of the derivative, the above approximate formula is only applicable near x. If the moving pace is too large,
the optimal value of x may be skipped, making the value of x constantly oscillating back and forth. As
shown in Figure 2-6.

Figure 2-6 The magnitude of x change −αf ′

(x) is too large, the function value will oscillate

The gradient descent method is to find an approximate optimal solution. In order to avoid iterating, the
following methods can be used to check whether it is close enough to the optimal solution:

The absolute value of the derivative (gradient) f ′

(x) is small enough.

The number of iterations has reached the preset maximum number of iterations.

The following is the code of the gradient descent method, where the parameter df is used to calculate the
derivative f (x) of a function f (x), x is the initial value of the variable, alpha is the learning rate, and
′

iterations represent The number of iterations, epsilon checks whether the value of df=f (x) is close to 0.
′

def gradient_descent(df,x,alpha=0.01, iterations = 100,epsilon = 1e-8):

history=[x]
for i in range(iterations):
if abs(df(x))<epsilon:
print("The gradient is small enough!")
break
x = x-alpha* df(x)
history.append(x)
return history

This gradient descent function saves all updated x during the iteration process in a python list object history
and returns this object.

For the above function f (x) = x – 3x – 9x + 2, its derivative f (x) = 3x – 6x– 9. If you want the
3 2 ′ 2

minimum value of the function f (x) near x = 1, you can call this function gradient_descent():

df = lambda x: 3*x**2-6*x-9
path = gradient_descent(df,1.,0.01,200)
print(path[-1])

The gradient is small enough!

2.999999999256501

Get the extreme point x=2.999999999256501 of f (x). The points on the curve corresponding to x in the
iteration process can be drawn:
f = lambda x: np.power(x,3)-3*x**2-9*x+2
x = np.arange(-3, 4, 0.01)
y= f(x)
plt.plot(x,y)

path_x = np.asarray(path) #.reshape(-1,1)

path_y=f(path_x)
plt.quiver(path_x[:-1], path_y[:-1], path_x[1:]-path_x[:-1], path_y[1:]-
path_y[:-1], scale_units='xy', angles='xy', scale=1, color='k')
plt.scatter(path[-1],f(path[-1]))
plt.show()

Figure 2-7 x gradually converges to the minimum point

Among them, the quiver function of matplotlib can use arrows to draw velocity vectors, and its function
format is:
quiver([X, Y], U, V, [C], **kw)

Where X, Y are 1D or 2D arrays, indicating the position of the arrow, and U, V are the same 1D or 2D
arrays, indicating the speed (vector) of the arrow. For other parameters, please refer to the official
documentation.
For multivariable functions, the principle of the gradient descent method is the same, but the gradient is used
instead of the derivative.

f (x + Δx) − f (x) ≃ ∇f (x)Δx

The following is the Beale's function of Wikipedia.

2 2 2 3 2
f (x, y) = (1.5 − x + xy) + (2.25 − x + xy ) + (2.625 − x + xy )

The global minimum of this function is (3, 0.5). The function value can be calculated with the following
python code:

f = lambda x, y: (1.5 - x + x*y)**2 + (2.25 - x + x*y2)2 + (2.625 - x +

x*y**3)**2

To draw this surface, first take some evenly distributed coordinate values on the x and y axes:
xmin, xmax, xstep = -4.5, 4.5, .2
ymin, ymax, ystep = -4.5, 4.5, .2
x_list = np.arange(xmin, xmax + xstep, xstep)
y_list = np.arange(ymin, ymax + ystep, ystep)

Then use the np.meshgrid() function to get the grid points (x, y) at their intersections according to the above
x_list and y_list, and calculate the function values corresponding to these grid coordinate points:
x, y = np.meshgrid(x_list, y_list)
z = f(x, y)

Finally, the plot_surface() function can be called to draw this surface:

ax.plot_surface(x, y, z, norm=LogNorm(), rstride=1, cstride=1,
edgecolor='none', alpha=.8, cmap=plt.cm.jet)

The complete code is as follows:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import LogNorm
import random

%matplotlib inline

f = lambda x, y: (1.5 - x + x*y)**2 + (2.25 - x + x*y2)2 + (2.625 - x +

x*y**3)**2

minima = np.array([3., .5])

minima_ = minima.reshape(-1, 1)

xmin, xmax, xstep = -4.5, 4.5, .2

ymin, ymax, ystep = -4.5, 4.5, .2
x_list = np.arange(xmin, xmax + xstep, xstep)
y_list = np.arange(ymin, ymax + ystep, ystep)
x, y = np.meshgrid(x_list, y_list)
z = f(x, y)
fig = plt.figure(figsize=(8, 5))
ax = plt.axes(projection='3d', elev=50, azim=-50)

ax.plot_surface(x, y, z, norm=LogNorm(), rstride=1, cstride=1,

edgecolor='none', alpha=.8, cmap=plt.cm.jet)
ax.plot(*minima_, f(*minima_), 'r*', markersize=10)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_zlabel('$z$')

ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

plt.show()

Figure 2-8 Drawn f(x,y) surface

The partial derivative of f (x, y) with respect to x, y is:

∂f (x,y)
2 2 3 3
= 2(1.5 − x + xy)(y − 1) + 2(2.25 − x + xy )(y − 1) + 2(2.625 − x + xy )(y − 1)
∂x
∂f (x,y) 2 3 2
= 2(1.5 − x + xy)x + 2(2.25 − x + xy )(2yx) + 2(2.625 − x + xy )(3y x)
∂y

The gradient directions at these grid points can be plotted on a 2D coordinate plane using matplotlib's quiver
function.
df_x = lambda x, y: 2*(1.5 - x + x*y)*(y-1) + 2*(2.25 - x + x*y**2)*(y**2-1) + 2*
(2.625 - x + x*y**3)*(y**3-1)
df_y = lambda x, y: 2*(1.5 - x + x*y)*x + 2*(2.25 - x + x*y**2)*(2*x*y) + 2*
(2.625 - x + x*y**3)*(3*x*y**2)
dz_dx = df_x(x, y)
dz_dy = df_y(x, y)

fig, ax = plt.subplots(figsize=(10, 6))

ax.contour(x, y, z, levels=np.logspace(0, 5, 35), norm=LogNorm(), cmap=plt.cm.jet)

ax.quiver(x, y, x - dz_dx, y - dz_dy, alpha=.5)
ax.plot(*minima_, 'r*', markersize=18)
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

plt.show()

Figure 2-9. Domain coordinate contours and gradient directions at grid points of the isosurface of the
function f(x,y)

In order to directly use the previous gradient descent method code, x in the previous gradient descent
method code can be represented by a numpy vector, and

if abs(df(x))<epsilon:

change into:
if np.max(np.abs(df(x)))<epsilon:

First combine the separated x and y coordinate arrays into one array:
print(x.shape)
print(y.shape)

x_ = np.vstack((x.reshape(1, -1) ,y.reshape(1, -1) ))

print(x_.shape)

(46, 46)
(46, 46)
(2, 2116)

You can define a gradient function df for this vectorized coordinate point x. The following code also gives
the implementation of the modified vectorized version of the gradient descent algorithm:
df = lambda x: np.array( [2*(1.5 - x[0] + x[0]*x[1])*(x[1]-1) + 2*(2.25 - x[0] +
x[0]*x[1]**2)*(x[1]**2-1)
+ 2*(2.625 - x[0] + x[0]*x[1]**3)*
(x[1]**3-1),
2*(1.5 - x[0] + x[0]*x[1])*x[0] + 2*(2.25 - x[0] +
x[0]*x[1]**2)*(2*x[0]*x[1])
+ 2*(2.625 - x[0] + x[0]*x[1]**3)*
(3*x[0]*x[1]**2)])

def gradient_descent(df,x,alpha=0.01, iterations = 100,epsilon = 1e-8):

history=[x]
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
x = x-alpha* df(x)
history.append(x)
return history

The following code starts from x0=(3., 4.) to solve the extreme point of this surface:

x0=np.array([3., 4.])
print("initial point",x0,"gradient",df(x0))

path = gradient_descent(df,x0,0.000005,300000)
print("Extreme point：",path[-1])

Gradient [25625.25 57519. ] of initial point [3. 4.]

Extreme points: [2.70735828 0.41689171]

Because the initial gradient value of x starts to be very large, the learning rate α must take a small number
(such as 0.000005), otherwise it will cause shock or infinite value, and finally converge to [2.70735828
0.41689171], But it is not the best point, you can see this situation more intuitively by drawing the change
of x during the iteration process.

def plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax):

fig, ax = plt.subplots(figsize=(10, 6))
ax.contour(x, y, z, levels=np.logspace(0, 5, 35), norm=LogNorm(),
cmap=plt.cm.jet)
#ax.scatter(path[0],path[1]);
ax.quiver(path[:-1,0], path[:-1,1], path[1:,0]-path[:-1,0], path[1:,1]-
path[:-1,1], scale_units='xy', angles='xy', scale=1, color='k')
ax.plot(*minima_, 'r*', markersize=18)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))

path = np.asarray(path)
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)
Figure 2-10 During the iteration process, the gradient value becomes smaller and smaller, and the
convergence becomes slower and slower

During the iterative process, the gradient value becomes smaller and smaller, and the same learning rate
makes the update of x very slow. Even after 100,000 iterations, it still fails to approach the optimal solution.
A natural approach is to use an adaptive learning rate, i.e. increase the learning rate when the gradient
becomes small. As an exercise, the reader can try to modify the gradient descent algorithm to get to the
optimal solution better and faster.

2.3 Parameter optimization strategy of gradient descent method

The learning rate in the basic gradient descent algorithm is a fixed value, and the gradient size is constantly
changing during the iterative process. If the learning rate is too large, the variable to be solved will oscillate
back and forth. If the learning rate is too small, the convergence will be very slow or even stagnant. The
initial learning rate is moderate, but as the convergence approaches the optimal solution, its gradient is also
close to 0, which will also cause stagnation. Naturally, the learning rate should be adjusted as it gradually
converges during the iterative process, that is, a variable learning rate is used to update the variable x to be
solved during the iterative process.

In order to ensure that the optimal solution can be approached better and faster, many improvements to the
gradient descent method have been proposed. These improvements use a changing learning rate or strategy
to update the solution variables (also called parameters). The update strategies or methods for variables
(parameters) include: Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam,
AdaMax, Nadam, AMSGrad, etc.

It should be noted that the function may be a multi-variable function, therefore, its variable x can be a vector
x composed of multiple values. Only some of the commonly used optimization strategies are described

below.

2.3.1 Momentum momentum method

The gradient descent method uses the learning rate α and the negative direction of the gradient, that is,
−α∇f (x) to update x each time, that is, to update the vector −α∇f (x)of x) depends entirely on the

current calculated gradient, and the Momentum momentum method updates the vector of x not only
considering the current gradient, but also considering the last update vector, that is, the updated vector is
considered to have inertia. Assuming that v t−1is the vector used for update last time, the current updated
vector is:

vt = γvt−1 + α∇f (x)

Update x with this v:

x = x − vt

This vector used to update x is called momentum. The momentum method regards the update vector as the
velocity of a moving object, and the velocity has inertia. Due to the combination of the previous update
vector and the current gradient, it alleviates the sharp changes in the gradient at different times, making the
updated vector smoother, that is, maintaining the inertia of the previous motion, so that where the gradient is
small, there is still a large motion. The speed will not overshoot due to the sudden increase of the gradient.
This method is like a ball with weight rolling downhill, maintaining a certain amount of inertia while
looking for the steepest descent path. Ordinary gradient descent only determines the speed of movement
according to the degree of steepness, just like rushing fast in steep places and hardly moving in flat places.

The initial value of v is 0, which can be expressed in python code as:

v= np.zeros_like(x)

That is, v is a tensor with the same shape as x with an initial value of 0. In the iterative process, v is updated
first, and then the parameter x of the function is updated:
v = gamma*v+alpha* df(x)
x = x-v

The following is the gradient descent method based on the momentum method:
def gradient_descent_momentum(df, x, alpha=0.01, gamma = 0.8, iterations = 100,
epsilon = 1e-6):
history=[x]
v= np.zeros_like(x) # momentum
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
v = gamma*v+alpha* df(x) # update momentum
x = x-v # Update variables (parameters)

history.append(x)
return history

Use the gradient descent method of this momentum method to solve the above problem:
path = gradient_descent_momentum(df,x0,0.000005,0.8,300000)
print(path[-1])
path = np.asarray(path)

[2.96324633 0.49067782]

It can be seen that the solution of the momentum method is very close to the optimal solution. as the picture
shows.
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)

Figure 2-11 The momentum method quickly converges to a near-optimal solution

2.3.2 Adagrad method

According to the variable update formula of the gradient descent method x = x − α∇f (x x), what affects

the variable update is the product of the learning rate and the gradient α∇f (x x), the gradient is too large or

too small and the learning rate is too large or too small will affect the convergence of the algorithm.

For a multivariate function, the magnitude of the partial derivatives for each variable can vary widely. For
∂f ∂f
example, the absolute values of the partial derivatives ∂x1
,
∂x2
of a function f (x , x ) of two variables at a
1 2

certain point (x , x ) may differ greatly.It is inappropriate to use the same learning rate for them. The
1 2

appropriate learning rate for one component is too large or too small for the other component, resulting in
shock and stagnation. That is, it is inappropriate to directly update with the following formula:
∂f
x1 = x1 − α
∂x1

∂f
x2 = x2 − α
∂x2

The Adagrad method can be translated as "adaptive (ada) gradient (grad)" from the noun, which divides
each gradient component by the historical cumulative value of the gradient component, so that the problem
of unbalanced gradient sizes of different components can be eliminated. For 2 components (x , x ), if the
1 2

historical cumulative value (G , G ) of the respective components is calculated respectively, the update
1 2

formula of the 2 components is:

1 ∂f
x1 = x1 − α
G1 ∂x1

1 ∂f
x2 = x2 − α
G2 ∂x2

∂f
Use the notation g = ∇ f (x ) to represent the partial derivative
t,i θ t,i of the component x in the t-th
∂xi
i

iteration, the component gradient of all rounds from t'=1 to t'=t can be calculated as follows:

t 2
Gt,i = √ ∑ ′ g ′
t =1 t ,i
Divide gt,i by Gt,i to update the component:
1
xt+1,i = xt,i − α gt,i
t 2
√∑ ′ g ′
t =1 t ,i

In order to prevent the divisor from being 0, a small positive number ϵ can be added to this denominator, so
that the parameter update formula of AdaGrad is:
1
xt+1,i = xt,i − α gt,i
t 2
√∑ ′ g′
+ϵ
t =1 t ,i

Compare the basic parameter update formula:

xt+1,i = xt,i − αgt,i

It can be seen that the AdaGrad method eliminates the unbalanced problem of component gradient sizes.
The parameter update formula of AdaGrad can be written in vector form:
1
x t+1 = x t − α ⊙ gt
t 2
√∑ ′ g
g +ϵ
t =1 ′
t

The accumulated G can be recorded with the variable gl with an initial value of 0. In each round of
2
t

iteration, the python code for AdaGrad parameter update is as follows:

gl += df(x)**2
x = x-alpha* df(x)/(sqrt(gl)+epsilon)

The main advantage of the AdaGrad method is that it eliminates the influence of different gradient values,
so that the learning rate can be set to a fixed value without continuously adjusting the learning rate in the
iterative process. The general learning rate is set to 0.01. The main disadvantage of the AdaGrad method is
that with the iterative process, the cumulative sum ∑ g will become larger and larger, because each of
t
′
t =1
2
′
t

them is a positive number . This can lead to slow learning, or even a standstill. Also, making each
component gradient have a consistent pace may not be realistic and can divert the direction of progress from
the direction of the optimal solution.

The code of the gradient descent method based on the Adagrad parameter update method is as follows:

def gradient_descent_Adagrad(df,x,alpha=0.01,iterations = 100,epsilon = 1e-8):

history=[x]
#v= np.zeros_like(x)
gl = np.ones_like(x)
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
gl += grad**2
x = x-alpha* grad/(np.sqrt(gl)+epsilon)
history.append(x)
return history

For the above problem, perform the gradient descent algorithm:

path = gradient_descent_Adagrad(df,x0,0.1,300000,1e-8)
print(path[-1])
path = np.asarray(path)

[-0.69240717 1.76233766]

It can be seen that due to the equalization of the component gradients, the forward direction of the variable
update deviates from the optimal solution method, and converges to another local optimal solution.

plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)

Figure 2-12 Adagrad method converges to another local minimum

2.3.3 Adadelta method

Reviewing the basic parameter update method, use Δx
x to represent the update vector of the parameter:
t

Δx
xt = −η ⋅ g t

x t+1 = x t + Δx
xt

The update vector of the AdaGrad method is:

η
Δx
xt = − ⊙ gt
√ Gt + ϵ

Here G = ∑ g is the historical sum of squares of g . With the iterative process, this value G is getting
t
2
t t t

bigger and bigger, resulting in Δxx is getting smaller and smaller, so the convergence is getting slower and
t

slower. The solution is to replace G with the sum of mean squares E[g ] = instead of the sum of
2 Gt
t t
t

squares. This E[g ] can be calculated using the moving average method, that is, to make an average of the
2
t

last average value and the current value:

2 2 2
E[g ]t = γE[g ]t−1 + (1 − γ)gt

The Adadelta method goes a step further and uses such a moving average method for the update vector to
make the change of the update vector smoother.
2 2 2
E[Δx
x ]t = γE[Δx
x ]t−1 + (1 − γ)Δx
x
t
The final update vector is:

2
E[Δx
x ]t−1 + ϵ
xt = −√
Δx gt
2
E[g ]t + ϵ

Use RM S[Δx x] t−1

, RM S[g]t to represent E[Δx
2
x ]t−1 + ϵ and E[g
2
]t + ϵ , the update vector can be
expressed as:

RM S[Δx
x]t−1
xt = −√
Δx gt
RM S[g]t

So the parameter update formula is:

x t+1 = x t + αΔx
xt

The python code for the Adadelta method is as follows:

Eg = rhoEg+(1-rho)(grad**2) # Update the cumulative sum of

squares of the gradient
delta = np.sqrt((Edelta+epsilon)/(Eg+epsilon))*grad # calculate update vector
x = x - alpha* delta
Edelta = rho*Edelta+(1-rho)*(delta**2) # Cumulative update of update vector

The decay rate parameter ρ of the Adadelta method is usually set to 0.9, and the initial value of
x ] , E[g ] is also 0. The code of the gradient descent method based on the Adadelta parameter
2 2
Δxx , E[Δx
t t t

update method is as follows:

def gradient_descent_Adadelta(df,x,alpha = 0.1,rho=0.9,iterations = 100,epsilon =

1e-8):
history=[x]
Eg = np.ones_like(x)
Edelta = np.ones_like(x)
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
Eg = rho*Eg+(1-rho)*(grad**2)
delta = np.sqrt((Edelta+epsilon)/(Eg+epsilon))*grad
x = x- alpha*delta
Edelta = rho*Edelta+(1-rho)*(delta**2)
history.append(x)
return history

path = gradient_descent_Adadelta(df,x0,1.0,0.9,300000,1e-8)
print(path[-1])
path = np.asarray(path)

[2.9386002 0.45044889]

It can be seen that the Adadelta method can also converge to a close to the optimal solution.

plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)

Figure 2-13 Adadelta method can also converge to a near optimal solution

2.3.4 RMSprop method

Similar to the momentum method, RMSprop uses the following formulas to update the momentum and
parameters:
2
vt = βvt−1 + (1 − β)∇f (x)
1
x = x − α ∇f (x)
√ vt +ϵ

The idea is to divide each value of the gradient by the length (the absolute value of the value), that is,
convert it into a unit length, so that the parameter x is always updated with a fixed step size α. In order to
calculate the length of each component of the gradient, RMSprop is similar to the momentum method to
calculate the square value of the moving average length of the gradient value, that is, f (x) .
2

The python code for updating model parameters by the RMSprop method is as follows:

v= np.ones_like(x)
#...
grad = df(x)
v = beta*v+(1-beta)* grad**2
x = x-alpha*(1/(np.sqrt(v)+epsilon))*grad

The code of the gradient descent method based on the RMSprop parameter update method is as follows:
def gradient_descent_RMSprop(df,x,alpha=0.01,beta = 0.9, iterations = 100,epsilon
= 1e-8):
history=[x]
v= np.ones_like(x)
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
v = beta*v+(1-beta)*grad**2
x = x-alpha*grad/(np.sqrt(v)+epsilon)
history.append(x)
return history

For the above problem, perform the gradient descent algorithm:

path = gradient_descent_RMSprop(df,x0,0.000005,0.99999999999,300000,1e-8)
print(path[-1])
path = np.asarray(path)

[2.70162562 0.41500366]

The results for the model parameters are not good enough, you can increase the number of iterations:
path = gradient_descent_RMSprop(df,x0,0.000005,0.99999999999,900000,1e-8)
print(path[-1])
path = np.asarray(path)

[2.9082809 0.47616156]

It can be seen that the basic convergence is close to the optimal solution, as shown in Figure 2-14:
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)

Figure 2-14 The RMSprop method can also basically converge

2.3.5 Adam method

In addition to storing an exponentially decaying cumulative mean of the squares of past gradients like the
RMSprop method, it also stores a cumulative mean of the gradients like the momentum method. The
momentum method can be seen as a ball running down a slope, but the Adam method behaves like a ball
with friction and is therefore better suited for flat minima. Use m , v to represent the moving average of
t t

past gradients and gradient squares:

mt = β1 mt−1 + (1 − β1 )gt

2
vt = β2 vt−1 + (1 − β2 )gt
They are equivalent to the first-order and second-order momentum of the gradient, because their initial
value is 0, Adam's author observed: when the decay rate is small, such as β , β is close to 1, they are biased
1 2

towards zero, Especially in the early stages of an iteration. To correct this problem, the authors used the
following correction formula:
mt
m
^t =
t
1 − β
1

vt
v
^t =
t
1 − β
2

Update x based on this:

η
θt+1 = θt − m
^t
√v
^ + ϵ
t

#https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-
6be9a291375c
def gradient_descent_Adam(df,x,alpha=0.01,beta_1 = 0.9,beta_2 = 0.999, iterations
= 100,epsilon = 1e-8):
history=[x]
m = np.zeros_like(x)
v = np.zeros_like(x)
for t in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
m = beta_1*m+(1-beta_1)*grad
v = beta_2*v+(1-beta_2)*grad**2

#m_1 = m/(1-beta_1)
#v_1 = v/(1-beta_2)
t = t+1
if True:
m_1 = m/(1-np.power(beta_1, t+1))
v_1 = v/(1-np.power(beta_2, t+1))
else:
m_1 = m / (1 - np.power(beta_1, t)) + (1 - beta_1) * grad / (1 -
np.power(beta_1, t))
v_1 = v / (1 - np.power(beta_2, t))

x = x-alpha*m_1/(np.sqrt(v_1)+epsilon)
#print(x)
history.append(x)
return history

For the above problem, execute the gradient descent algorithm gradient_descent_Adam:

path = gradient_descent_Adam(df,x0,0.001,0.9,0.8,100000,1e-8)
#path = gradient_descent_Adam(df,x0,0.000005,0.9,0.9999,300000,1e-8)
print(path[-1])
path = np.asarray(path)
#plt.plot(path)

[2.99999653 0.50000329]
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)

Figure 2-15 The Adam method can also converge to a near optimal solution

2.4 Gradient verification

2.4.1 Comparing numerical and analytical gradients

When writing the code of the gradient descent algorithm, the most likely mistake is that the gradient
calculation is incorrect, which leads to the inability of the algorithm to converge. Therefore, in addition to
adjusting the learning rate, you should check whether the gradient calculation is correct. To this end,
according to the definition of the derivative, that is, the derivative is the rate of change of the function, the
derivative (gradient) of the function at a point x can be estimated by the following formula:
∂f (x) f (x+ϵ)−f (x−ϵ)
= limϵ−>0
∂x 2ϵ

That is, use the division on the right side of the formula to approximate the derivative (gradient) of f (x) at x
. If ϵ is small enough, the derivative (gradient) of this value should be the same as the analytical derivative
(gradient) on the left ) are close enough.

Therefore, before training the model with the gradient descent method, the numerically calculated gradient
and the analytical gradient can be compared to verify that the analytical gradient is calculated correctly.

For example, for the previous binary function f (x, y) = x + 9y , in the gradient descent method, the
1

16
2 2

function is at a point x = (x , x ) The function values and analytical gradients of are calculated by the
0 1

following code.

f = lambda x: (1/16)*x[0]**2+9*x[1]**2
df = lambda x: np.array( ((1/8)*x[0],18*x[1]))

The numerical gradient at the point x = (x 0, x1 ) can be calculated as follows:

df_approx = lambda x,eps:((f([x[0]+eps,x[1]])-f([x[0]-eps,x[1]]) )/(2*eps),(
f([x[0],x[1]+eps])-f([x[0],x[1]-eps]) )/(2*eps))

The following code snippet compares the errors of the analytical and numerical gradients at the point
x = [2., 3.]:

x = [2.,3.]
eps = 1e-8
grad = df(x)
grad_approx = df_approx(x,eps)
print(grad)
print(grad_approx)
print(abs(grad-grad_approx))

[ 0.25 54. ]
(0.2500001983207767, 54.00000020472362)
[1.98320777e-07 2.04723619e-07]

It can be seen that as long as the small increment eps of calculating the numerical gradient is small enough,
this numerical gradient is close enough to the analytical gradient, and this is the definition of the derivative:
the numerical gradient can be close enough to the analytical gradient. If it is found that the error of the two
is relatively large or large, it means that there may be a problem with the calculation of the analytical
gradient or function value or numerical gradient. Most of the errors are problems with the calculation of the
analytical gradient or function value.

Before using the gradient descent method to solve the optimal solution, the gradient verification method
should be used to ensure that the calculation of the analysis gradient and function value is correct. On this
basis, adjust the hyperparameters of the gradient descent method such as learning rate or momentum
parameters.

2.4.2 Generic numerical gradients

Machine learning includes the hypothetical function in deep learning that contains a lot of parameters, and a
general numerical gradient calculation function can be written:
def numerical_gradient(f, params, eps = 1e-6):
numerical_grads = []
for x in params:
# x may be a multidimensional array,
# for each element, calculate its numerical partial derivative
grad = np.zeros(x.shape)
it = np. nditer(x, flags=['multi_index'], op_flags=['readwrite']) #
while not it. finished:
idx = it.multi_index
old_value = x[idx]
x[idx] = old_value + eps # x[idx]+eps
fx = f()
x[idx] = old_value - eps # x[idx] - eps
fx_ = f()
grad[idx] = (fx - fx_) / (2*eps)
x[idx] = old_value # Note: Be sure to restore the weight parameter
to its original value.
it.iternext() # Loop through the next element of x
numerical_grads.append(grad)
return numerical_grads

The parameter f accepted by this function indicates the function to calculate the gradient, and params
indicates the parameters of the function, because f may have multiple parameters, and params indicates a set
of these multiple parameters (such as python's list, tuple, etc. object). To be more general, assume that each
element x of params is a multidimensional array containing multiple elements.

In the inner loop, for the element x[idx] pointed to by each subscript idx of x, add a small increment
x[idx] + eps and x[idx] − eps respectively and calculate the corresponding The function value f(), and

then use the differential approximation formula of the derivative to calculate the partial derivative
corresponding to this x[idx] and assign it to grad[idx]. Note: After each modification of x[idx], it must be
restored to the original value, otherwise it will affect the calculation of other partial derivatives and affect
the value of params after exiting this function.

You can use this general numerical gradient computation function to compute the numerical gradient of the
previous function:

x = np.array([2.,3.])
param = np.array(x) # The parameter param of numerical_gradient must be a
numpy array
numerical_grads = numerical_gradient(lambda:f(param),[param],1e-6)
print(numerical_grads[0])

[ 0.25 54.00000001]

Note that the first parameter f of numerical_gradient must point to a function object rather than the result of
a function call. It is wrong to write lambda:f(param) above as f(param) .

For a function f that contains some parameters such as param, usually the above lambda expression or the
following wrapper function fun can be used to return a function object that performs calculations on the
parameter param.

def fun():
return f(param)

numerical_grads = numerical_gradient(fun,[param],1e-6)
print(numerical_grads[0])

[ 0.25 54.00000001]

In the following chapters, this general numerical gradient calculation function numerical_gradient() will be
used to calculate the numerical gradient of the model function. This function and others are included in the
book's source code file util.py.

2.5 Separation gradient descent algorithm and parameter optimization strategy

2.5.1 Parameter optimizer

The optimization strategy of variables (parameters) is hard-coded in the gradient descent algorithm. The
gradient descent method of different optimization strategies has the same framework except for the
parameter update. In order to improve code reusability and flexibility, parameter optimization strategies can
be classified from gradient descent algorithms.

A class representing a parameter optimization strategy can be defined:

class Optimizator:
def __init__(self,params):
self.params = params

def step(self,grads):
pass
def parameters(self):
return self.params

params is a list of variables (parameters), and step() is used to update these parameters params according to
the gradient grads. For example, the parameter optimizer class SGD that defines the parameter optimization
strategy using the basic gradient descent method can be derived on the basis of this class:
class SGD(Optimizator):
def __init__(self,params,learning_rate):
super().__init__(params)
self.lr = learning_rate

def step(self,grads):
for i in range(len(self.params)):
self.params[i] -= self.lr*grads[i]
return self.params

Similarly, other parameter optimizers can be defined, such as SGD_Momentum of the momentum method:

class SGD_Momentum(Optimizator):
def __init__(self,params,learning_rate,gamma):
super().__init__(params)
self.lr = learning_rate
self.gamma= gamma
self.v = []
for param in params:
self.v.append(np.zeros_like(param) )

def step(self,grads):
for i in range(len(self.params)):
self.v[i] = self.gamma*self.v[i]+self.lr* grads[i]
self.params[i] -= self.v[i]
return self.params

2.5.2 Gradient descent method accepting parameter optimizer

As long as the gradient descent algorithm accepts the parameter optimizer that updates the parameters, it can
update the parameters according to the optimization strategy of the optimizer:

def gradient_descent_(df,optimizator,iterations,epsilon = 1e-8):

x, = optimizator.parameters()
x = x.copy()
history=[x]
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
x, = optimizator.step([grad])
x = x.copy()
history.append(x)
return history

Looking at a simple convex function surface,

1 2 2
f (x, y) = x + 9y
16

This is a bowl-shaped surface, as shown in Figure 2-16. Its minimum value is at the bottom of the bowl, that
is, (0,0) is the minimum value point of the entire function, and the minimum value is 0.

Figure 2-16 Function f (x, y) = 1

16
2
x
2
+ 9y surface

To this function, apply the SGD parameter optimizer described above:

df = lambda x: np.array( ((1/8)*x[0],18*x[1]))
x0=np.array([-2.4, 0.2])

optimizator = SGD([x0],0.1)
path = gradient_descent_(df,optimizator,100)
print(path[-1])
path = np.asarray(path)
path = path.transpose()

[-8.26638332e-06 2.46046384e-98]

Approaching the optimal solution, switch to the SGD_Momentum optimizer:

x0=np.array([-2.4, 0.2])
optimizator = SGD_Momentum([x0],0.1,0.8)
path = gradient_descent_(df,optimizator,1000)
print(path[-1])
path = np.asarray(path)
path = path.transpose()

It also better approximates the optimal solution.

The gradient is small enough!

[-1.49829905e-08 -4.74284398e-10]
Chapter 3 Linear Regression, Logistic
Regression and Softmax Regression
This chapter introduces three typical machine learning techniques: linear
regression, logistic regression, and softmax regression. They are the basis of
deep learning based on neural networks, and the concepts, techniques, and
methods such as data normalization, model evaluation, and regularization
are also commonly used in deep learning.

3.1 Linear regression

3.1.1 Dining car profit problem

Andrew Ng's deep learning course has a "food truck profit problem", which
provides the following data set:
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
...

The first column in the data set is the population of each city, and the
second column is the profit of the dining car in the city, and the quantity is
in units of 10,000. The following Python code reads the dataset from a text
file and outputs the first 5 rows:

x , y = [] ,[]
with open('food_truck_data.txt') as A:
for eachline in A:
s = eachline.split(',')
x.append(float(s[0]))
y.append(float(s[1]))
for i in range(5):
print(x[i],y[i])

6.1101 17.592
5.5277 9.1302
8.5186 13.662
7.0032 11.854
5.8598 6.8233

The urban population and the profits of dining cars are regarded as the x
and y coordinates on the two-dimensional coordinate plane, that is, each
data sample is regarded as a coordinate point on the two-dimensional plane,
as shown in Figure 3-1, and the data set can be placed on the two-
dimensional plane display:

import matplotlib.pyplot as plt

%matplotlib inline

fig, ax = plt.subplots()
ax.scatter(x, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
plt.show()
Figure 3-1. Data point set for food truck profits

The goal of the "dining car profit problem" is how to predict the profit of a
dining car for a new urban population based on the existing data of these
urban populations and their corresponding profits.

3.1.2 Machine Learning and Artificial Intelligence

1. Machine Learning
The "food truck profit problem" is a classic machine learning problem.
Machine learning is to discover certain statistical laws contained in these
data based on empirical data, and use the learned laws to judge or predict
new data in the future.

Machine learning can obtain a data model that reflects the relationship
between the two or the functional relationship from urban population to
dining car profit based on these (urban population, corresponding dining car
profit) data. If x is used to represent the urban population, y represents the
profit of the dining car , machine learning is to find a function f (x) such
that y satisfies y = f (x). The process of solving this functional relationship
or model is called machine learning or model training. With this
mathematical (model) function f (x), a new city population x can be
substituted into this function f (x) to predict the city's dining car profit.

In machine learning, the data used to train the model is called sample data
or sample set or training set, the sample set contains multiple samples,
each sample consists of sample features* * (such as urban population)
andsample label(such as dining car profit), which correspond to the
independent variable x and dependent variable y of the learning
function y = f (x) respectively. Sample labels are often also referred to
as "true values" or "target values**". The ultimate goal of learning a
model is to predict its target value or label from the sample features
according to this model.

Generally, it is impossible for the predicted value f (x) of the model to be

completely equal to the real value, and the model training is usually to find
some optimal meaning (such as the predicted value f (x) and the real value
y A mathematical function on some kind of error minimization).

2. The relationship between machine learning and

artificial intelligence
Machine learning is a field of study in artificial intelligence. Artificial
intelligence is different from common computer programs. Common
computer programs perform calculations based on known mathematical
models, and each calculation step is performed according to a definite
calculation formula. For example, calculate the area of a circle of a certain
radius according to the well-known formula A = πr between the circle
1

2
2

area A and the radius r.

But for many practical problems, it is impossible to find a definite

mathematical model to represent the relationship or law between data. For
example, the "housing prediction problem" hopes to predict the price of a
house based on the characteristic information of a house (such as the area of
the house, etc.), but there is no ready-made mathematical model to tell us
how to calculate the house price from the characteristic information of the
house.

There are many similar problems, such as: "predicting the price of a stock",
"identifying who is in a face photo", "determining the text corresponding to
a piece of voice", "judging whether an email is spam", "playing chess How
to move in the chessboard state", "How the recommendation system of the
e-commerce website recommends the products that the user may be
interested in", "Automatic driving car", etc.

These problems require computers to have the same intelligence as humans.

The reason why humans can easily identify whether an image contains cats
or dogs is because they have seen many of these images before and have
formed a certain model in their brains, which can be easily done. This
recognition task, but for the computer, all images are a matrix (array) of
numbers, and there is no clear formula to express the relationship between
images and their categories.

How to make a computer have human intelligence? This is the goal of

artificial intelligence, a computer science research. Artificial intelligence is
not mysterious. For example, the expert system that emerged in the 1980s
transformed the experience of experts in a specific field into some rules,
and then solved specific problems based on logical reasoning and other
methods of rule matching. For example, if a person has a fever, it means
that the person may be sick. If there are few red blood cells, it means that
there may be some kind of ischemic or blood loss disease. However, expert
systems have problems such as time-consuming, labor-intensive and
expensive, and expert rules for one problem cannot be used for other
problems. In addition, some problems such as image and speech recognition
are difficult to define rules.

Unlike traditional artificial intelligence based on logical reasoning, machine

learning as a modern artificial intelligence does not need the domain
knowledge of experts. It only needs a large amount of data, and statistical
methods can be used to model and learn statistically on the data.
For example, the recommender system of a shopping website can predict
users' hobbies based on their shopping and website browsing records, so as
to send them corresponding recommended products. Similarly, if there are a
lot of data on house features and their prices, machine learning can find a
reasonable mathematical model that reflects the relationship between the
two, and based on this mathematical model, the price can be predicted from
a house’s feature information.

3. Classification of machine learning

Machine learning is mainly divided into three categories: supervised
learning, unsupervised learning, and reinforcement learning.

The so-called supervised learning means that the data used for learning not
only know their data characteristics but also know their target value, such as
the profit of the dining car. For a sample, not only the population of the city
is known, but also the profit of the city’s dining car. If it is assumed that the
functional relationship y = f (x) is satisfied between the sample feature x
and the target value y, from multiple known samples (x , y ) Then we
(i) (i)

can find out this hypothetical function, so that they satisfy y = f (x ) as

(i) (i)

much as possible. This kind of machine learning that solves the best
hypothesis function based on multiple data samples (x , y ) with known
(i) (i)

data features x and target value y is called For supervised learning. It is

also currently the most widely used machine learning method.

Therefore, supervised learning is to learn the functional relationship

between independent variables (features) and dependent variables (targets)
based on many samples known from independent variables (also called
features) and their corresponding dependent variables (also called labels or
targets). Once this function is known, its target value can be predicted for
new sample features.

As shown in Figure 3-2, a function model f for handwritten digit

recognition can be trained (learned). This function accepts an input digital
image and outputs a certain number from 0 to 9. It is also possible to train a
speech recognition model to generate a corresponding text from a speech
input. Or train an artificial intelligence Go program to output the position of
a move according to the input chess game.

Figure 3-2 Supervised learning is learning a function that transforms an

input (data features) into an output (target value).

There are infinitely many and various functional relationships between

independent variables and dependent variables, such as linear functions,
quadratic functions, trigonometric functions, and so on. Supervised learning
should choose an appropriate function to represent data features x and
target value y according to the specific problem and the characteristics of
the data set. For example, for the profit problem of dining cars, according to
the observation of the data, it can be considered that there seems to be a
linear relationship between the urban population and the profit, then a linear
function such as f (x) = wx + b can be used as x and
T hef unctionmodelbetweeny, the function f (x) = wx + b is called

hypothesis function. Of course, different parameters w, b represent

different linear relationships. Model training is based on the real value y(i)

and the predicted value of the hypothesis function f (x ) = wx + b

(i) (i)

Some kind of error solves a set of optimal parameters w, b. Once w, b is

determined as the model function f , for the new input x , just bring in this
∗

function f to get the predicted value f (x ) = wx + b.

∗ ∗

Supervised learning is currently the most widely used and most successful
machine learning method for artificial intelligence. For example, image
classification recognition uses a large number of images of known image
categories to identify which category a new image belongs to. The postal
code recognition system for letters can automatically recognize handwritten
postal codes. There are also AlphaGo, which defeated the human Go
champion, and AlphaFold, which defeated all human experts and
successfully predicted the 3D shape of the protein based on the gene
sequence, etc.

Supervised learning mainly includes the following steps (tasks):

1. A set of training samples is required;

1. Design a hypothesis function that can well describe the

relationship between sample characteristics and target values;

1. Choose a reasonable loss function to describe the error between

the predicted value and the real value;

1. Training model;

1. Make predictions with the model.

Supervised learning is divided into regression and classification according

to whether the target value is a continuous real number. For example, the
housing price is a continuous real number, so the housing price prediction
problem is a regression problem. In the image classification problem, it is
necessary to identify which category an image belongs to (such as cat, dog,
airplane, etc.), and the category of the image is a discrete value, so the
image classification problem is a classification problem.

Classification problems are also usually classified by learning a continuous

target value that is the probability that an object belongs to each class. This
chapter will introduce three classic regressions: linear regression, logistic
regression, and softmax regression. Because the prediction values of the
assumption functions of these three machine learning techniques are all
continuous values, they are called regression, but the latter two are used to
solve binary and multi-classification problems.

Supervised learning relies on training data with known target values, but in
many cases, manual calibration to specify the target value of each data
sample is time-consuming and labor-intensive. For example, for face
detection problems, each face needs to be calibrated 68 signs The location
of the point, if there are millions of faces that need to be marked in the city,
how big is a project? Similarly, it is laborious to label the categories of
millions of images.
Is it possible to learn some laws between these data without knowing the
true value of the sample? Unsupervised learning is a machine learning
method without knowing the true value. For example, a clustering
algorithm can analyze data samples to determine which cluster center they
belong to. Principal components analysis (PCA) can determine the principal
components of the data, and then use it to reduce the dimension of the data,
that is, express the high-dimensional data into a low-dimensional form. The
autoencoder takes the data itself as the target value, that is, uses (x , x )
(i) (i)

as a sample for supervised learning, and can pass the self-supervised

learning process of the encoding and decoding process , to get the intrinsic
characteristics of the data.

The lack of real-value supervision makes unsupervised learning very

difficult. Although unsupervised learning seems aimless and it is difficult to
learn high-quality laws or mathematical models, it can overcome the
difficulty of obtaining supervised data and is often used as a supervised
learning an assistive technology.

A compromise learning method is semi-supervised learning, which

provides true values for a small part of the data, while most of the data do
not have true values. It avoids the high cost of obtaining samples in
supervised learning, and can guide the learning process in a supervised
manner. So it is a very promising machine learning method.

In addition, there is a reinforcement learning method, which learns the

model and decision-making of the environment through the experience
obtained through the interaction with the environment, and is used to guide
behavioral decisions.

3.1.3 What is linear regression?

For this dining car profit problem, what kind of mathematical function f (x)
is used to represent the mapping relationship between the sample feature
(urban population) x and the sample label (dining car profit) y? Through
the observation of Figure 2-1, it can be considered that the coordinate points
formed by x and y are almost on a straight line (on a straight line) or that x
and y almost satisfy a linear relationship. Therefore, in countless kinds of
mathematical functions, a linear function can be used to express this linear
relationship between x and y:

y = f (x) = wx + b

This function is called hypothesis function or function model or model.

This linear function is determined by the parameters w and b, and different

parameters w and b represent different linear functions. If you have multiple
data samples (x , y ), you can minimize the predicted value f (x) and the
(i) (i)

real value of thesamplef eaturexAcertainerrormeasureof y obtains

the parameter value (w, b), and the linear function represented by these
obtained parameter values can well describe the relationship between x and
y. The process of solving the parameters (w, b) is called model training or

training model.

Linear regression (Linear regression) is the process of using linear

function to represent the relationship between independent variables
(features) and dependent variables (true values) and solving an optimal
linear function based on a set of sample data.

So what is "best"?

For each sample x , y , use the symbol f to represent the predicted

(i) (i) (i)

2
value f (x ). For the dining car profit problem, the variance (f − y )
(i) (i) (i)

can be used to represent the prediction error of a single sample, and the
following mean square error can be used to represent the prediction error of
all samples:
1 m (i) (i) 2 1 m (i) 2
L = ∑ (f − y ) = sum (wx + b − yi )
m i=1 m i=1

L can be regarded as a function L(w, b) of unknown parameters wandb,

which is used to describe the error of model prediction on sample data.
L(w, b) is called loss function (also called error function). Model training
is to solve the parameters wandb that minimize the value of the loss
function L(w, b).

Of course, multiplying a function by a constant does not affect its minimum

parameter. Sometimes in order to make the derivative simpler, the formula
on the right is divided by 2 as the loss function:
1 m (i) (i) 2 1 m (i) (i) 2
L(w, b) = ∑ (f − y ) = ∑ (wx + b − y )
2m i=1 2m i=1

Linear regression is to seek the parameters w, b that minimize the value of

the loss function. There are 2 methods to solve this minimum problem:
normal equation method and gradient descent method.

3.1.4 Normal equations to solve linear regression

problems
The loss function L(w, b) is a multivariate function of w. b, and the partial
derivative of L(w, b) with respect to w, b is:
(i) (i) 2
∂L 1 ∂(∑(wx +b−y ) ) 1 (i) (i) (i)
= = ∑(wx + b − y )x
∂w 2m ∂w m

(i) (i) 2
∂L 1 ∂(∑(wx +b−y ) ) 1 (i) (i)
= = ∑(wx + b − y )
∂b 2m ∂b m

The necessary condition that the minimum value of the function L(w, b)
must satisfy is that the gradient or partial derivative of L(w, b) with respect
to the independent variable, namely w. b, is equal to 0, namely:
∂L 1 (i) (i) (i)
= ∑(wx + b − y )x = 0
∂w m

∂L 1 (i) (i)
= ∑(wx + b − y ) = 0
∂b m

make
(1) (1)
⎛ 1 x ⎞ ⎛ y ⎞
(2) (2)
1 x b y
X = W = ( ) y =
w
1 ⋮ ⋮

⎝ 1 x
(m) ⎠ ⎝ (m)
y
⎠

And remove the coefficient of the equation, the above equations can be
1

written in matrix/vector form:

XW − y = 0

Move y to the right side of the equation and multiply both sides by X : T

T T
X XW = X y

−1
Multiply both sides by the inverse matrix (X T
X) of X T
X :
−1
T T
W = (X X) X y

Thus, W = (b, w) is obtained. The formula (6) is the normal equation for
solving W.

The code for solving the above "dining car profit problem" using the normal
equation method is as follows:

import numpy as np

data = np.loadtxt('food_truck_data.txt', delimiter=",") #

data is an m*2 matrix, each row represents a sample
train_x = data[:, 0] # city population, m*1 matrix
train_y = data[:, 1] # dining car profit, m*1 matrix

X = np.ones(shape=(len(train_x), 2))
X[:, 1] = train_x
y = train_y
XT = X.transpose()

XTy = XT @ y

w = np.linalg.inv(XT@X) @ XTy
print(w)

[-3.89578088 1.19303364]

The model function f (x) = wx + b has been obtained, as long as a city

population value x is substituted into this function, it can be used to predict
the profit of the dining car, such as predicting the profit of the dining car in
a city with a population of 46,000:

4.6*w[1]+w[0]

1.5921738849602525

3.1.5 Gradient descent method to solve linear regression

problems
The normal equation method needs to calculate the matrix product and the
inverse of the matrix. If the number of data features is large or the number
of samples is large, it is time-consuming. Therefore, for the above
optimization problems, the gradient descent method is generally used to
solve them.

In order to solve the unknown model parameters w, b of L(w, b), the

gradient descent method starts from an initial value of w , b and iteratively
0 0

updates w, b through the following formula:

∂L
w := w − α
∂w
∂L
b := b − α
∂b

make
(1) (1)
⎛ x ⎞ ⎛ y ⎞ ⎛ b⎞
(2) (2)
x y b
x = y = b =

⋮ ⋮ ⋮

⎝ (m) ⎠ ⎝ (m) ⎠ ⎝ b⎠
x y

Partial derivatives can be expressed in vector form:

∂L
= np. mean((wx + b − y) ⊙ x)
∂w

∂L
= np. mean((wx + b − y))
∂b

The coefficient 1/m can be included in the learning rate, and it is easy to
write the code to calculate the gradient with numpy's vectorization
operation:
X = train_x
w,b = 0.,0.
dw = np.mean((w*X+b-y)*X)
db = np.mean((w*X+b-y))
print(dw)
print(db)

-65.32884974555671
-5.839135051546393

Therefore, the code for the gradient descent algorithm for solving linear
regression can be written:
def gradient_descent(x,y,w,b,alpha=0.01, iterations =
100,epsilon = 1e-9):
history=[]
for i in range(iterations):
dw = np.mean((w*x+b-y)*x)
db = np.mean((w*x+b-y))
if abs(dw) < epsilon and abs(db) < epsilon:
break;
#Update w: w = w - alpha * gradient
w -= alpha*dw
b -= alpha*db
history.append([w,b])

return history

Calling the above gradient descent method with the learning rate alpha =
0.02 and the number of iterations, the parameters of the hypothetical
function can be found:
alpha = 0.02
iterations=1000
history = gradient_descent(X,y,w,b,alpha,iterations)
print(len(history))
print(history[-1])

1000
[1.1822480052540145, -3.7884192615511796]

History records the model parameters of each step in the iterative process,
and the last parameter is the optimal parameter.

How to judge whether the gradient descent method converges to the optimal
solution?

For the hypothetical straight line function f (x) with only one value for both
the input variable and the output variable, you can visually observe whether
it converges by drawing this hypothetical straight line function on the two-
dimensional plane of the sample points. To do this, write a function that
draws a straight line for the hypothetical function corresponding to the
model parameters (w, b):

def draw_line(plt,w,b,x,linewidth =2):

m=len(x)
f = [0]*m
for i in range(m):
f[i] = b+w*x[i]
plt.plot(x, f, linewidth)

The following code draws the hypothetical function curve corresponding to

the obtained model parameters (w, b) (Figure 2-3):

import matplotlib.pyplot as plt

%matplotlib inline

#fig, ax = plt.subplots()
plt.scatter(X, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
w,b = history[-1]
draw_line(plt,w,b,X,6)
plt.show()

Figure 3-3 Hypothetical function curve

In order to observe the convergence of iterations, for linear regression

problems with more than 2 data features (independent variables), the
common practice is to draw the loss curve (cost curve), that is, the loss
during the iteration process (price) changes. To this end, the following loss
function can be used to calculate the loss corresponding to the parameter
(w, b).

def loss(x,y,w,b):
m = len(y)
return np.mean((x*w+b-y)**2)/2
cost = 0
for i in range(m):
f = x[i]*w+b
cost += (f-y[i])**2
cost /=(2*m)
return cost

print(loss(X,y,1,-3))

4.983860697569072

Use the loss() function to calculate the loss corresponding to all parameters
w in the iterative process, and draw this loss curve (Figure 2-4):
costs = [loss(X,y,w,b) for w,b in history]
plt.axis([0, len(costs), 4, 6])
plt.plot(costs)

Figure 3-4 Loss curve

It can be seen that the loss is gradually decreasing, that is, the iteration is
gradually converging. If the loss curve does not decrease, it means that
there may be problems in the algorithm program or the learning rate
parameters are not appropriate.

Of course, for this linear regression with one independent variable, the loss
function is a function of two parameters, so the surface corresponding to the
loss function can be drawn (Figure 3-5), and the change of unknown
parameters during the iteration process can be drawn:
from mpl_toolkits.mplot3d import Axes3D

def plot_history(x,y,history,figsize=(20, 10)):

w= [ e[0] for e in history]
b= [ e[1] for e in history]

xmin,xmax, xstep = min(w)-0.2,max(w)+0.2, .2

ymin, ymax, ystep = min(b)-0.2,max(b)+0.2, .2
ws,bs = np.meshgrid(np.arange(xmin, xmax + xstep,
xstep), np.arange(ymin, ymax + ystep, ystep))

zs = np.array([loss(x, y, w,b) for w,b in

zip(np.ravel(ws), np.ravel(bs))])
z = zs.reshape(ws.shape)

fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(111, projection='3d')

ax.set_xlabel('w[0]', labelpad=30, fontsize=24,

fontweight='bold')
ax.set_ylabel('w[1]', labelpad=30, fontsize=24,
fontweight='bold')
ax.set_zlabel('L(w,b)', labelpad=30, fontsize=24,
fontweight='bold')

ax.plot_surface(ws, bs, z, rstride=1, cstride=1,

color='b', alpha=0.2)
w_sart,b_start,w_end,b_end = history[0][0], history[0]
[1],history[-1][0], history[-1][1]
ax.plot([w_sart],[b_start], [loss(x,y,w_sart,b_start)]
, markerfacecolor='b', markeredgecolor='b', marker='o',
markersize=7)
ax.plot([w_end],[b_end], [loss(x,y,w_end,b_end)] ,
markerfacecolor='r', markeredgecolor='r', marker='o',
markersize=7)

z2 = [loss(x,y,w,b) for w,b in history]

ax.plot(w, b, z2 , markerfacecolor='r',
markeredgecolor='r', marker='.', markersize=2)
ax.plot(w, b, 0 , markerfacecolor='r',
markeredgecolor='r', marker='.', markersize=2)

fig.suptitle("L(w,b)", fontsize=24, fontweight='bold')

return ws,bs,z

ws,bs,z = plot_history(X,y,history)

Figure 3-5 Loss function surface

For such a 2-parameter loss function, one often also draws the isovalue
curve of the loss function on the parameter plane (i.e. the parameter plane
curve corresponding to the same loss function value), and observes the
iteration parameter more clearly on the parameter plane Changes of (Figure
3-6):
from matplotlib.colors import LogNorm
plt.contour(bs,ws,z,levels=np.logspace(-5, 5, 100),
norm=LogNorm(), cmap=plt.cm.jet)

w= [ e[0] for e in history]

b= [ e[1] for e in history]
plt.plot(b,w)
plt.xlabel("b")
plt.ylabel("w")
title = str.format("iteration={0}, alpha={1}, b={2:.3f},
w={3:.3f}", iterations, alpha, b[-1], w[-1])
plt.title(title)

#plt.axis([result_w-1,result_w+1,result_b-1,result_b+1])
plt.show()

Figure 3-6 The parameter plane of the loss function

3.1.6 Debug learning rate

You can try different learning rates (learning rates) such as 0.01, 0.015,
0.02, and draw the corresponding cost history curve (referred to as cost
curve) for each learning rate (Figure 3 -7).
plt.figure()
num_iters = 1200
learning_rates = [0.01, 0.015, 0.02]
for lr in learning_rates:
w,b=0,0
history = gradient_descent(X, y,w, b,lr, num_iters)
cost_history = [loss(X,y,w,b) for w,b in history]
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with different learning
rates", fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.legend(list(map(str, learning_rates)))
plt.axis([0, num_iters, 4, 6])
plt.grid()
plt.show()

Figure 3-7 Cost curves for different learning rates

For these learning rates, gradient descent works correctly, and it is found
that lower learning rates require more iterations.
So can a larger learning rate be used?
learning_rate = 0.025
num_iters = 50
w,b=0.,0.
history = gradient_descent(X, y,w, b,learning_rate,
num_iters)
cost_history = [loss(X,y,w,b) for w,b in history]
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " +
str(learning_rate), fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.axis([0, num_iters, 0, 6000])
plt.grid()
plt.show()

Figure 3-8 Cost curve when alpha=0.025

Bad results (Figure 3-8)! It shows that the learning rate is too large.
Although the gradient descent method can always advance in the correct
direction, because the learning rate is too large, the step forward is too
large, and the optimal solution is crossed, so the cost diverges rather than
converges. Figure 3-9 shows the iterative process on the loss surface.
At present, the learning rate of 0.02 is more appropriate, it can converge
with fewer iterations (minimize the objective function value)
ws,bs,z = plot_history(X,y,history)

Figure 3-9 When alpha=0.025, the iterative process of displaying

parameters on the loss surface

3.1.7 Gradient verification

As mentioned earlier, before the actual implementation of the gradient
descent method, gradient verification should be performed to ensure that the
calculation of the gradient and function values is correct. For linear
regression problems, it is necessary to use the following numerical gradient
to check whether the analytical gradient calculation is correct.
∂L(w,b) L(w+ϵ,b)−L(w−ϵ,b)
= limϵ−>0
∂w 2ϵ
∂L(w,b) L(w,b+ϵ)−L(w,b−ϵ)
= limϵ−>0
∂b 2ϵ
For the previous gradient descent method, the following code can be used to
verify whether the analytical gradient and the numerical gradient are
consistent. The previous loss() function is used to calculate the function
value L(w, b), and the following 2 lines of code are used to calculate the
analysis gradient.
dw = np.mean((w*x+b-y)*x)
db = np.mean((w*x+b-y))

A function can be defined to compute the numerical gradient:

df_approx = lambda x,y,w,b,eps: ( (loss(x,y,w+eps,b)-
loss(x,y,w-eps,b) )/(2*eps), (loss(x,y,w,b+eps)-
loss(x,y,w,b-eps) )/(2*eps) )

Then compare the analytical and numerical gradients at an arbitrary point

such as (w, b) = (1.0, −2.0).

w =1.0
b = -2.
eps = 1e-8
dw = np.mean((w*X+b-y)*X)
db = np.mean((w*X+b-y))
grad = np.array([dw,db])
grad_approx = df_approx(X,y,w,b,eps)
print(grad)
print(grad_approx)
print(abs(grad-grad_approx))

[-0.24450692 0.32066495]
(-0.24450690361277339, 0.3206649612508272)
[1.98820717e-08 1.27972190e-08]

It can be seen that the results of the two calculations are consistent. This
allows the analytical gradient to be used with confidence in the gradient
descent method.
Of course, the numerical gradient of the loss function can also be calculated
using the general numerical gradient function in Section 2.4.
3.1.8 Prediction
Once the parameters w, b of the specific hypothesis function
f (x; w, b) = xw + b are determined, a new data (such as urban population)

can be substituted into the hypothesis function to get the predicted value
(such as dining car profit).

For example, all X[i] in the training set X can be substituted into this
hypothesis function to get the predicted value f (X[i]; w, b) = X[i]w + b.
The following code computes predicted values for all samples in x and uses
these predicted values to plot the data points corresponding to those
predicted values (Figure 2-10).
#Use the obtained w to calculate the predicted value of
the sample in X
m=len(X)
predictions = [0]*m
for i in range(m):
predictions[i] = X[i]*w+b

plt.scatter(X, y, marker="x", c="red")

plt.scatter(X, predictions, marker="o", c="blue")
#plt.plot(X, predictions, linewidth=2) # plot the
hypothesis on top of the training data
Figure 3-10 Model predictions
3.1.9 Linear regression with multiple features

1. Multi-feature linear regression

The sample of the dining car profit problem has only one data feature. In many actual problems,
the sample has multiple data features. For example, in the house prediction problem, the house
features may include multiple features such as house area x , number of rooms x , etc. The 1 2

relationship between housing characteristics x = (x , x ) and house price y can be expressed as:
1 2

y = f (x
x) = w1 ∗ x1 + w2 ∗ x2 + b

Sometimes, in order to better describe the relationship between x and y, some higher-order
features can be constructed on the basis of the original features, such as x , x , etc., using The 2
1
2
2

original features and high-order features are used as new features, and then the relationship
between the new features and the real value is represented by the following function:
2 2
y = f (x
x) = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x + w4 ∗ x + b
1 2

The function f (xx) is a nonlinear function of the characteristics x , x , but it is still a linear 1 2

function of the unknown parameters w , w , w , w , b, so it is still a linear regression.

1 2 3 4

Considering x and x as two new features x andx , the function is also a linear function of
2
1
2
2 3 4

x , x , x andx .
1 2 3 4

Generally, if a sample contains K features, the hypothesis function of linear regression is:
K
f (x
x) = w1 ∗ x1 + w2 ∗ x2 +. . . +wK ∗ xK + b = ∑ wi xi + b
i=1

A row vector can be used to represent all the features of a sample, that is, x = (x , x , . . . , x 1 2 K) ,
and a column vector can be used to represent the coefficients before these features in the
hypothesis function, that is, w = (w , w , . . . , w ) , assuming that the function can be
1 2 K
T

expressed in a simpler vector form:

⎡ w1
⎤
w2
f (x
x) = (x1 , x2 , ⋯ , xK ) + b = x w + btag3 − 13

⋮
⎣w ⎦
K

The larger the value of the corresponding coefficient w of x , the greater the influence on the
i i

output value of f (x
x). Therefore, w is often called weight, which has nothing to do with x b
i i

also affects the output value as a whole, called bias.

Sometimes b is written as w , all unknown parameters are expressed as

w = (w , w , w , . . . , w ) , and x is expressed as x = (x = 1, x , x . Suppose the

T
0 1 2 K 0 1 2, . . . , xK )
function is expressed as:

fw
w (x

X =

fw
x) = w1 ∗ x1 + w2 ∗ x2 +. . . +wK ∗ xK + w0 = (x0 , x1 , x2 , ⋯ , xK )

If there are m samples x , put these samples in a two-dimensional matrix by row:

w (X
⎡

⎣x

X ) ==
x

(m)
(1)

(2)

mport numpy as np
⎤

⎣x
=

x
(2)

(m)
⋮

X = np.array([[1,8,3],[1,7,5]])
w = np.array([1.3, 2.4,0.5])
X@w

array([22. , 20.6])
⎡x

⎦
x

⎣x
(1)

(2)

(m)

w =
(i)

⎣x
(1)
x
1

(2)
x
1

(m)

x
0

(2)
x
0

(m)

0
⋮
⋯

x
1

(2)

(m)

1
x

x
(1)

(2)

(m)

X) = X w
⋯

⋯
⎤

Because each sample produces an output, the output of the function for all samples can be
written in vector (matrix) form:

⎡
(1)
⎡
(1) (1)

This matrix product can be easily computed with numpy, i.e. with np.dot(X,W) or X.dot(W) or
X@W. For example, X below is 2 samples, each sample has 3 features, and w is the weight
corresponding to 3 features, and then you can directly calculate f (X

When expressing operations with vectors (matrixes), be sure to pay attention to whether the
boundaries of each dimension are consistent. For x w above:

2 × 3
w

3 × 1

hypothesis function f (x
function.
= f

2 × 1
(1)
x
K

(2)
x
K

(m)

Model training is to use a set of samples {x , y } with known target values to find the best
(i)

x), which is to determine the unknown parameter w of the hypothesis

w
w
(i)
⎤

⎦
w = X w tag3 − 14

w
w

# 2 samples, 3 features per sample

# weights

⎢⎥
⎡w

⎣
0

wK
\w

⋮
1 ⎤

⎦
= xw
Like the univariate hypothesis function, for the multivariate hypothesis function, the loss
function based on the mean square error can also be used to measure the error between the
predicted value and the true value of the model:
1 m (i) (i) 2
L(w
w) = ∑i=1 (fw
w (x
x ) − y )
2m

L(w
w) is a function of unknown parameter w , w contains multiple variables, so this is a
multivariate function of w . The model training of linear regression is to find the parameter
w = (w , w , . . . ) that minimizes the value of the loss function:
0 1

arg minw L(w0 , w1 , . . . )

0 ,w1 ,...

If L(w
w) takes a minimum at w , the gradient (partial derivative) at that point should be 0:
∗

∂L(w0 ,w1 ,...)

|w
w
∗ = 0
∂wj

In order to better see the derivation process of the partial derivative on the left side of the
equation, some auxiliary notations are introduced:
(i) (i) (i)
(i) (i) (i)
f = fw
w (x
x ) = x w = w1 ∗ x + w2 ∗ x +. . . +wK ∗ x + w0 ∗ 1
1 2 K

(i) (i) (i)

δ = f − y

m 2
1 (i)
L(w
w) = ∑ δ
2m i=1

2
L(w
w) can be regarded as the average of m δ and then divided by 2, and δ is (i) (i)

f hef unctionof (i), f is the function of w = (w , w , ⋯ , w ) , according to the four rules

T (i) T
1 2 k

of derivation (such as The derivative of the sum function is the sum of the derivatives of each
function) and the chain rule for compound functions:
(i)
∂L(w
w) 1 δ
(i)
= 2δ =
∂δ
(i) 2m m

(i)
∂δ
(i)
= 1
∂f

(i)
∂f (i)
= x
∂wj j

m (i) (i) m
∂L
L(w
w) ∂L
L( pmbw) ∂δ ∂f 1 (i)
(i)
= ∑ × × = ∑δ × 1 × x
j
(i) (i)
∂wj ∂δ ∂f ∂wj m
i=1 i=1

m m
1 (i)
1 (i)
(i) (i) (i) (i)
= ∑(f − y )x = ∑(fw (x ) − y )x
j j
m m
i=1 1=1

The values except the coefficients in the right formula can be regarded as the dot product of two
vectors:
m

∑(fw (x

1=1

Among them, X

∂L(w
w)

∂wj

∇L(w
w) = (

∇L(w
w)

n × 1
=
(i)

m
1

n × m
) − y

∂w1
T
(i)

:,j

X :,j ( pmbXw

∂L(w
w)

function with respect to w :

y = np.array([2.3,1.7])
)x
(i)

w − y)

Because, the column vector form of L(w

,⋯,

(m × n

(1/len(y))*X.transpose() @ (X@w-y)

array([ 19.3 , 144.95, 76.8 ])

Let the gradient ∇L(w

X
T
(X
= (fw (x
(1)
) − y

= (x

w) about the gradient of w is:

(
∂L(w
w)

∂wj

X
, ⋯)

n × 1
w

w) = 0, we get the normal equation:

Xw − y ) = 0
T

=
1

m × 1)
(1)

(1)

∂L

X
L(w
w)

∂wj
(1)

T
(X
fw (x

x
(2)

Xw − y )

You can check whether the bounds of each dimension of matrix multiplication are consistent:
T
y)
(2)

For the above example, you can directly use this formula to calculate the gradient of the loss
) − y

T
= X :,j (X
(m)
x
j

Xw − y )

# or： (1/m)*X.T @ (X@W-y)

(2)

(m)

is the transpose of the jth column of the X matrix (ie, the jth feature of all
samples). Therefore, the partial derivative can be written in vector form:
)

)
⋯

⎝
fw (x

fw (x

x
(1)

(2)

(3)
fw (x

(1)

(2)

(3)

w − y

w − y
(3)

) − y

) − y
) − y

(1)

(2)

(3)

(1)

(2)

(3)
⎞

⎠
⎞

⎠
(3)
)

⎜⎟
⎛

⎝
x

(m)
x
j
(1)

(2)

⋮
⎞

⎠
According to the normal equation, w can be obtained:
−1
T T
w = (X
X X) X y

2. Fitting plane
The following code generates a set of data point samples sampled from the plane
z = 2x + 3y + c, each data sample is characterized by (x, y), and its target value is the noise on

the plane corresponding to the point The z value of:

import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import numpy as np

n_points = 20

a = 3
b = 2
c = 5
x_range = 5
y_range = 5
noise = 3

xs = np.random.uniform(-x_range,x_range,n_points)
ys = np.random.uniform(-y_range,y_range,n_points)
zs = xs*a+ys*b+ c+ np.random.normal(scale=noise)

#-----Draw the plane-------------

# Create grid points xx,yy
xx, yy = np.meshgrid([x for x in range(-x_range,x_range+1)], [y for y in
range(-y_range,y_range+1)])
# Calculate the z value zz corresponding to the grid point (xx, yy)
zz = a * xx +b * yy +c
# draw the surface
plt3d = plt.figure().gca(projection='3d')
plt3d.plot_surface(xx, yy, zz, alpha=0.2)

#-----Plot data points-------------

ax = plt.gca()

ax.scatter(xs, ys, zs, color='b')

ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');

plt.show()
Figure 3-11 2D plane data set for sampling three-dimensional space

Use the above sample points to solve the normal equation to fit a plane, and use the fitted
function to calculate the predicted value zs2 of the original data point (xs, ys), and then display
the original data point and the fitted data point, as well as the original plane and the fitting It can
be seen that the fitting effect is very good.
# fit a plane
X = np.hstack((xs[:, None],ys[:, None]))
X = np.hstack((np.ones((len(xs), 1), dtype=xs.dtype),xs[:, None],ys[:,
None]))
y = zs

# Compute the normal equation

XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy

# Calculate the fitting error

errors = y - X@w
residual = np.linalg.norm(errors)

print("Fitted plane equation:")

print("z = %f x + %f y + %f" % (w[1], w[2],w[0]))
print("residual:",residual)

# draw the fitted plane

xlim = ax.get_xlim()
ylim = ax.get_ylim()
xx2,yy2 = np.meshgrid(np.arange(xlim[0], xlim[1]),
np.arange(ylim[0], ylim[1]))
zz2 = w[1] * xx2 + w[1] * yy2 +w[0]

zs2 = w[1] * xs + w[1] * ys +w[0]

#ax.plot_wireframe(xx,yy,zz, color='k')
plt3d = plt.figure().gca(projection='3d')
plt3d.plot_surface(xx, yy, zz, alpha=0.5)
plt3d.plot_wireframe(xx2,yy2,zz2, color='k',alpha=0.2)
ax = plt.gca()
ax.scatter(xs, ys, zs, color='b')
ax.scatter(xs, ys, zs2, color='r')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.show()

Fitted plane equation:

z = 3.000000 x + 2.000000 y + 5.257362
residual: 1.5379794977104662e-14

Figure 3-12 Fitted plane

When the number of samples is large or the sample features are many, the normal equation needs
to find the inverse matrix, which is time-consuming. Therefore, an iterative method is generally
used to solve the equation system, and the most typical iterative method is the gradient descent
method (gradient descent)**.

The following is the gradient descent algorithm implemented by numpy vector operations:

learning_rate = 0.02
num_iters = 100
X = np.hstack((xs[:, None],ys[:, None]))
w,cost_history = gradient_descent_vec(X, y,learning_rate, num_iters)
print("w:",w)
print(cost_history[:5])

plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.grid()
plt.show()

w: [5.2574307 3.00002524 1.99999406]

[67.35604162168268, 50.488263686179806, 29.597233252531048,
15.413911756207108, 10.830666806860211]

def gradient_descent_vec(X, y, alpha, num_iters, gamma = 0.8, epsilon=1e-

8):
history = [] # Record the parameters in the iteration
process
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) # Add a
column of features 1
num_features = X.shape[1]
v = np.zeros_like(num_features)
w = np.zeros(num_features)
for n in range(num_iters):
predictions = X @ w # Find the predicted value of the
hypothesis function, ie f(x)
errors = predictions - y # The error between the predicted
value and the real value
gradient = X.transpose() @ errors /len(y) #calculate the
gradient
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",n)
break
#w -= alpha * gradient #Update the parameters of the model
v = gamma*v+alpha* gradient
w= w-v
history.append(w)
return history # return the history of optimized
parameters

In the above code, feature 1 is added to each data feature X [i] of the input data X . Therefore,
when calling this function, you only need to pass the characteristics of the input data itself. The
following code tests the vector version of the gradient descent method to fit the above-mentioned
plane data:

learning_rate = 0.02
num_iters = 100
X = np.hstack((xs[:, None],ys[:, None]))
history = gradient_descent_vec(X, y,learning_rate, num_iters)
print("w:",history[-1])

w: [5.2574307 3.00002524 1.99999406]

According to the model parameters in the iterative process recorded in history, calculate the
average loss of the hypothetical function corresponding to the model parameters in each iterative
process on the training data set:
def compute_loss_history(X,y,w_history):
loss_history = []
for w in w_history:
errors = X@w[1:]+w[0]-y
loss_history.append((errors**2).mean()/2)
return loss_history

loss_history = compute_loss_history(X,y,history)
print(loss_history[:-1:10])
plt.plot(loss_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.grid()
plt.show()

[50.488263686179806, 4.228316318122407, 0.25447029668776583,

0.013353012122347907, 0.005607394691562173, 0.000496579138237 0437,
5.787738027991273e-05, 7.432016613222019e-06, 6.995869865807676e-07,
1.9291969094070538e-08]

Figure 3-13 Loss Curve

It can be seen that the same good fitting results have been obtained.

3. Temperature and pressure problems

Let’s look at a simple problem of temperature and pressure. The pressure and temperature in the
atmosphere (in gas) are related. If a set of (temperature, pressure) data is given, can a model be
used to predict the pressure from the temperature (or which in turn predicts temperature from
pressure)?

If the temperature and pressure data are placed in a cvs format file 'data.csv', you can use the
read_csv() of the pandas package to read the data from the file:
temperature.pngimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('data.csv',delimiter=',') # delimiter indicates what

the delimiter is
data

X= data.values[:,1:2]
y= data.values[:,2]
print(X)
print(y)

[[ 0.]
[20.]
[ 40.]
[60.]
[ 80.]
[100.]]
[2.0e-04 1.2e-03 6.0e-03 3.0e-02 9.0e-02 2.7e-01]

Call gradient_descent_vec() to perform linear regression

history = gradient_descent_vec(X,y,0.00005,50)
w = history[-1]
print("w:",w)

w: [-0.00016333 0.00165672]
Draw the cost curve and the predicted value of the test.

def plot_history_predict(X,y,w,loss_history,fig_size=(12,4)):
fig = plt.gcf()
fig.set_size_inches(fig_size[0], fig_size[1], forward=True)

plt.subplot(1, 2, 1)
plt.plot(loss_history)

X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a

column of feature 1
x = X[:,1]

predicts = X @ w
plt.subplot(1, 2, 2)
plt.scatter(x, predicts) #, marker="x", c="red")

indices = x.argsort()
sorted_x = x[indices[::-1]]
sorted_predicts = predicts[indices[::-1]]

plt.plot(sorted_x, sorted_predicts, color = 'red')

# plt.plot(x, predicts, color = 'red')

plt.scatter(x, y) #, marker="x", c="red")

plt.show()

Plot the loss function curve and the predicted values for the training samples:

loss_history = compute_loss_history(X,y,history)
plot_history_predict(X,y,w,loss_history)

Figure 3-14 Loss curve and fitted straight line

Although the best linear model is obtained to fit the relationship between temperature and
pressure, from the figure, pressure and temperature are not a linear relationship, and the linear
hypothesis function is not the best choice. There should be a nonlinear relationship between
them. Naturally, one would think of using a polynomial function such as a 3-degree polynomial
to represent this nonlinear relationship between pressure y and temperature x.
3 2 2 2 T
f (x) = w3 x + w2 x + w1 x + w0 = (1, x, x , x )(w0 , w1 , w2 , w3 )

From the original feature x, new features x , x are artificially constructed, and 1 is also used as
2 3

a feature, so x can be regarded as a data feature composed of 4 features Vector

x = (1, x, x , x ).
2 3

2 3 T
f (x
x; w ) = (1, x, x , x )(w0 , w1 , w2 , w3 ) = xw

Among them, w is the model parameter.

First generate data with 4 features (including 1):

X2 = np.hstack((X,X**2,X**3))
print(X2)

[[0.00e+00 0.00e+00 0.00e+00]

[2.00e+01 4.00e+02 8.00e+03]
[4.00e+01 1.60e+03 6.40e+04]
[6.00e+01 3.60e+03 2.16e+05]
[8.00e+01 6.40e+03 5.12e+05]
[1.00e+02 1.00e+04 1.00e+06]]

Then the gradient descent method was implemented, but it was found that the loss function
continued to increase rapidly until infinity, and did not converge.
history = gradient_descent_vec(X2,y,0.00005,50)
print("w:",history[-1])

w: [nan nan nan nan]

D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
RuntimeWarning: overflow encountered in matmul

D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:10:
RuntimeWarning: overflow encountered in matmul
# Remove the CWD from sys.path while we load stuff.
D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
RuntimeWarning: invalid value encountered in matmul

This is because the eigenvalues of the data are all relatively large values, resulting in a large
gradient, and a very small learning rate must be used, but too small a learning rate makes the
algorithm converge very slowly. The solution to this problem is to normalize the data features,
even if the data feature values are in a small range of values (such as in the [0,1] or [-1,1]
range).
3.1.10 Normalization of data
The normalization process of a feature is very simple: first, it is necessary to calculate the
average value of all samples about this feature, then calculate the degree of deviation (ie standard
deviation) of this feature of all samples around the average value, and finally subtract this feature
of all samples its mean and divide by the standard deviation.
x−mean(x)
x ←
stddev(x)

Where x is a set of numbers, and mean(x) is the mean value of the number in x, and stddev(x)
is the standard deviation (mean square deviation) of the number in x. For example, there is a set
of eigenvalues: {-5, 6, 9, 2, 4}, and its average mean is:

mean = (-5+6+9+2+4) / 5 = 3.2

Subtract this mean from all the eigenvalues to get the deviations, and calculate the square of
these deviations:
2
(−5 − 3.2) = 67.24

2
(6 − 3.2) = 7.84

2
(9 − 3.2) = 33.64

2
(2 − 3.2) = 1.44

2
(4 − 3.2) = 0.64

Then, the standard deviation stddev can be calculated:

stddev = √ (67.24 + 7.84 + 33.64 + 1.44 + 0.64)/5 = 4.71

For the previous X2, you can use the following code to calculate the mean (mean) and mean
square error (stddev) of each feature:

mean = np.mean(X2, axis=0)

stddev = np.std(X2, axis=0)
print(mean)
print(stddev)
X2 = (X2-mean)/stddev
print(X2)

[5.00000000e+01 3.66666667e+03 3.00000000e+05]

[3.41565026e+01 3.55840164e+03 3.58924319e+05]
[[-1.46385011 -1.03042518 -0.8358308 ]
[-0.87831007 -0.91801516 -0.81354198]
[-0.29277002 -0.5807851 -0.65752023]
[0.29277002-0.018735-0.23403262]
[ 0.87831007 0.76813514 0.59065376]
[ 1.46385011 1.77982532 1.95027186]]

Of course, the above code for normalizing data can be simplified to one line of code:
X = nd.array((X - X.mean(axis=0)) / X.std(axis=0))

For normalized data, a relatively large learning rate can be used:

alpha = 0.5
iterations = 50
history = gradient_descent_vec(X2,y,alpha,iterations)
print("w:",history[-1])
loss_history = compute_loss_history(X2,y,history)
print(loss_history[:-1:len(loss_history)//10])
print(loss_history[0],loss_history[-1])

w: [ 0.06599642 -0.03288084 0.0025919 0.12124727]

[0.0019130770858765333, 0.0013145145645564675, 0.0007023606943790882,
0.00027800040087589485, 0.00011132563333752004, 8.37 1800511119992e-05,
8.245141141123992e-05, 7.874110806074924e-05, 7.480832887073949e-05,
7.196043525563033e-05]
0.0019130770858765333 7.023862601858041e-05

It can be seen that the loss (cost) of the loss function has dropped from 0.0019130770858765333
to 7.023862601858041e-05, and the loss function curve and the fitted polynomial curve are
drawn. Of course, readers can adjust the learning rate and number of iterations to further reduce
the loss error.

plot_history_predict(X2,y,history[-1],loss_history)

Figure 3-15 Loss curve and fitted polynomial curve

It can be seen from the final prediction results of the training samples that the model function can
better fit the training data.

Explanation of the temperature and pressure problem: An appropriate hypothesis function should
be selected according to the characteristics of the problem data. If the hypothesis function cannot
fit the data well, artificial features can be considered on the basis of existing features. In addition,
the feature values of the data should be in a relatively small normalized range, such as
normalizing the features of the data to a range where the mean is 0 and the variance is 1.
3.2 Evaluation of the model

3.2.1 Underfitting and overfitting

When a hypothesis function (statistical model) of machine learning is too simple to fully express
the relationship between the characteristics of the sample data and the target value, no matter
how it is trained, the obtained model function cannot be predicted well, that is, the characteristics
of the input data are obtained The error between the predicted value and the actual value will be
large. The model function cannot fit the training data well, which is called underfitting
(Underfitting). For example, for the temperature and pressure problem mentioned above, if the
relationship between temperature and pressure is represented by a linear function, the fitting
error of the linear function obtained at last will be very large. When artificially adding features
and using a 3-degree polynomial to represent the relationship between temperature and pressure,
the training data can be better fitted, and the error value of the loss function is reduced from
0.06623333 to 5.489e-5, because the 3-degree polynomial Better expressive power than linear
polynomials of degree 1.

So is it better to use more complex functions to represent the relationship between data features
and target values?

The following code randomly samples some coordinate points (x,y) around a sinusoidal curve:

#https://github.com/ctgk/PRML/blob/master/notebooks/ch01_Introduction.ipynb

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(896)

def sample(n_samples,std = 0.25):

x = np.sort(np.random.uniform(0,1,n_samples))
y = np.sin(2*np.pi*x) + np.random.normal(scale = std, size=x.shape)
return x,y

n_samples = 10
x,y = sample(n_samples)
#x = np.sort(np.random.uniform(0,1,n_samples))
#y = np.sin(2*np.pi*x) + np.random.normal(scale = 0.25, size=x.shape)

x_test = np.linspace(0, 1, 100)

xx = x_test
y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.scatter(x, y,facecolor="none", edgecolor="b", s=50, label="training
data")
plt.show()
Figure 3-16 Sampled sine curve

Fit these sample points (x,y) with polynomials of different degrees K, and solve the model
function using the normal equation method:
for i, K in enumerate([0, 1, 3, 9]):
plt.subplot(2, 2, i + 1)
X = np.array([np.power(x,k) for k in range(K+1)])
X = X.transpose()

#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
print("w=:",w)

y_predict = 0 #np.zeros(x_test.shape)
for i,wi in enumerate(w):
y_predict+=wi*np.power(x_test,i)

plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")

plt.scatter(x, y,facecolor="none", edgecolor="b", s=50,
label="training data")

y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)

plt.show()

w=: [-0.19410186]
w=: [ 1.167293 -2.40352288]
w=: [ -0.69160733 14.4684786 -40.54048788 27.82130232]
w=: [ 1.04164258e+03 -3.73815312e+04 4.73769000e+05 -3.06523600e+06
1.16099600e+07 -2.72868240e+07 4.03530320e+07 -3.65586800e+07
1.85418720e+07 -4.03263600e+06]
Figure 3-17 The fitting situation of the normal equation method with polynomial functions of
degree 0, degree 1, degree 3 and degree 9 as hypothetical functions

The following code uses the gradient descent method to solve the above hypothetical function:
lr = 0.4
iterations = 10000000
for i, K in enumerate([0, 1, 3, 9]):
plt.subplot(2, 2, i + 1)
if i==0: continue
X = np.array([np.power(x,k+1) for k in range(K)])
X = X.transpose()

#Normalization is very important, otherwise it is difficult to

converge
mean = np.mean(X, axis=0)
stddev = np.std(X, axis=0)
X = (X-mean)/stddev

w_history = gradient_descent_vec(X,y,lr,iterations,0.9)
w = w_history[-1]
print("w=:",w)

X_test = np.array([np.power(x_test,k+1) for k in range(K)])

X_test = X_test.transpose()
X_test = (X_test-mean)/stddev
y_predict = X_test@w[1:]+w[0]

plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")

plt.scatter(x, y,facecolor="none", edgecolor="b", s=50,
label="training data")

y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)

plt.show()
gradient is small enough!
iterated num is : 2
w=: [-0.19410186 -0.5068096 ]
gradient is small enough!
iterated num is : 16124
w=: [-0.19410186 3.0508158 -9.53751983 6.16440066]
w=: [ -0.19410186 -12.29842075 62.23443929 -94.51757917 -12.68948412
83.66870987 51.44475318 -55.72137559 -94.36419461 71.90960592]

Figure 3-18 The fitting situation of polynomial functions of degree 0, degree 1, degree 3 and
degree 9 based on the gradient descent method as hypothetical functions

In this example, the polynomial function with a degree of 3 has the highest fitting effect, and the
polynomial function with a degree of 9 has a small fitting error, but it is far from the potential
real relationship of the actual data. For the sample fitting error in the training set is small but the
error for the test sample is large, it is called overfitting (overfitting). Overfitting is caused by the
model function being too complex relative to the training sample set. One way to solve over-
fitting is to use a low-complexity 3-degree polynomial function as the hypothesis function, and
one is to increase the number of samples in the training data set.

For example, after increasing the number of training samples in the following code, a better
fitting effect can also be obtained for the 9th degree polynomial hypothesis function.
n_samples = 100
x,y = sample(n_samples)
#x = np.sort(np.random.uniform(0,1,n_samples))
#y = np.sin(2*np.pi*x) + np.random.normal(scale = 0.25, size=x.shape)

K= 9

X = np.array([np.power(x,k) for k in range(K+1)])

X = X.transpose()

#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
print("w=:",w)

y_predict = 0 #np.zeros(x_test.shape)
for i,wi in enumerate(w):
y_predict+=wi*np.power(x_test,i)

plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")

plt.scatter(x, y,facecolor="none", edgecolor="b", s=50, label="training
data")

y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)

plt.show()

w=: [ 1.22951941e-01 -3.80763635e-01 1.47730348e+02 -1.56658147e+03

7.96745758e+03 -2.30639208e+04 3.91473296e+04 -3.84418223e+04
2.02211373e+04 -4.41100990e+03]

Figure 3-19 The fitting situation of the 9th degree polynomial function with increasing training
samples

Figure 3-20 The left, middle, and right are underfitting, optimal fitting, and overfitting
respectively

As shown in Figure 3-20, there is a group of two-dimensional coordinate points on the plane,
using a linear function (straight line), a quadratic function (parabola), and a high-degree
polynomial as the model function of linear regression, it can be clearly seen that the left, middle,
and right are respectively Underfitting, optimal fitting, overfitting. If the model is too simple, it
will be under-fitting, if it is too complex, it will be over-fitting, and only when the complexity is
appropriate can it produce an optimized fitting effect.
According to the analysis of the causes of underfitting and overfitting, the underfitting problem
can be alleviated by the following methods:

Add more sample features. For example, by adding more sample features such as x andx 2 3

from a sample feature x, it actually increases the complexity of the data and alleviates the
underfitting problem.

Increase model complexity: such as choosing a more expressive hypothesis function

(model) or reducing the degree of regularization.

The solution to the overfitting problem can be solved by the following methods:

Add more training samples to increase the amount of training data.

Reduce the complexity of the model: Limit the complexity of the hypothesis function
(model) with a low-complexity hypothesis function (model) or through regularization.

3.2.2 Verification set, test set

By visually displaying the training model function and real sample data, you can observe
whether the model function is underfitting or overfitting, so that you can directly judge whether
the model function fits well or not visually. However, this method is only suitable for very simple
hypothetical functions that can be represented by curves and surfaces. Many practical problems
have complex hypothetical functions and a large number of data features. For example, an image
can have millions of features. Deep learning The neural network function in has thousands of
parameters. Such data samples and hypothetical functions cannot be visualized.

As can be seen from the previous examples, for such problems, it is often impossible to judge
whether the final model function fits the data only based on the declining loss function curve.
Even if a function has a small loss for the training sample, it may produce a large error for other
samples that are not in the training set, that is, an overfitting problem will occur. In other words,
the "generalization ability" of the model function is insufficient, and it cannot better express the
relationship between the data characteristics of the actual sample and the target value.
"Overfitting" because the training model pays too much attention to how to better fit the training
set, so the trained model will produce larger errors for data different from the training set. Of
course, it is difficult to judge whether the model function is overfitting or underfitting only based
on the value of the loss function. In some cases, even if the value of the loss function is small, it
may be underfitting.

The purpose of training the model is to use the model to predict new data. Even if the model can
fit the training data well, it is useless if the prediction effect on the new data is poor. Just as an
athlete trains to be at the top of his team, there is no guarantee that he will perform as well
against others.

In order to help judge whether a model function is overfitting or underfitting, in addition to the
loss function curve, the quality of the model function is usually evaluated with the help of a
sample set different from the training set.

Therefore, in machine learning, in addition to the training set used to train the model, a separate
test set is generally used to evaluate the trained model. For a model function, the prediction error
of the samples in the test set (that is, the error between the predicted value and the target value)
can be calculated. If the error of the samples in the test set is similar to the error of the samples in
the training set, it can be preliminarily judged that the model function has better generalization
ability . The test set should cover a variety of different data as much as possible, so as to evaluate
whether the trained model has good generalization ability.

In addition to the training set and the algorithm of the training model, the performance of
different hypothetical function models is different. For example, in the previous problem of
polynomial fitting of two-dimensional data points, the performance of the training model
function obtained by polynomial functions of different degrees is different. Yes, some are
underfitting and some are overfitting. In addition, the hyperparameters in the training algorithm
(such as learning rate, batch size, number of iterations, etc.) also have a great impact on the
training results. For example, when other conditions remain unchanged, the number of iterations
directly affects the training error, and the number of iterations is small It may be underfitting, and
the number of iterations may be overfitting.

Evaluating the prediction errors of different models through the validation set can help to select
an appropriate hypothesis function and train hyperparameters, so as to obtain a training model
with better generalization ability.

For example, when training the model, the loss (error) of the training set and the verification set
can be calculated at the same time. When the training iteration starts, the verification error and
the training error will continue to decrease from high to low. As the number of iterations
increases, the verification set error On the contrary, it becomes larger, indicating that the
generalization ability becomes worse. At this time, the iteration can be stopped as soon as
possible. This method is called early stopping (early stopping) method. That is, the validation
set can be used to prevent too many iterations during training. For another example, for the
previous polynomial fitting, if there is no visual help, the training error and verification error of
the training set and the verification set on the polynomial model function of different degrees can
be used to help select a polynomial function of an appropriate degree.

Thus, a training set is used to train the model, a validation set is used to evaluate and select the
model, and sometimes a separate test set is used to test the resulting model. Sometimes the test
and validation sets are not distinguished.

For the fitting problem of the previous sinusoidal sampling points, the following code samples a
total of three sample sets: training set, verification set, and test set.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

n_pts = 10
x_train,y_train = sample(n_pts)
x_valid,y_valid = sample(n_pts)
x_test,y_test = sample(n_pts)

If you want to choose hypothetical functions of different degrees such as K=0, 1, 2, 3, ... 9, you
can use these hypothetical functions of different degrees to train and calculate the training error
and verification error.
def rmse(a, b):
return np.sqrt(np.mean(np.square(a - b)))

M = 10
errors_train = []
errors_valid = []
for K in range(M):
X = np.array([np.power(x_train,k) for k in range(K+1)])
X = X.transpose()

#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y_train
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
#print("w=:",w)

predict_train = X@w
error_train = rmse(y_train,predict_train)

X_valid = np.array([np.power(x_valid,k) for k in range(K+1)])

X_valid = X_valid.transpose()
predict_valid = X_valid@w
error_valid = rmse(y_valid,predict_valid)

errors_train.append(error_train)
errors_valid.append(error_valid)

plt.plot(errors_train, 'o-', mfc="none", mec="b", ms=10, c="b",

label="Training")
plt.plot(errors_valid, 'o-', mfc="none", mec="r", ms=10, c="r",
label="Valid")
plt.legend()
plt.xlabel("degree")
plt.ylabel("RMSE")
plt.ylim(0, 1.5)
plt.show()
Figure 3-21 Training and testing loss curves

It can be seen that when the polynomial degree is lower than 2, the training error and verification
error are relatively large, indicating that the fitting effect on the training set and the verification
set is not good, that is, the model is in an underfitting state. When the polynomial degree is
around 3 to 4, Both the training error and the verification error are relatively small. When the
polynomial degree starts to be greater than 5, the training error continues to decrease, but the
verification error increases instead, indicating that the generalization ability of the model begins
to deteriorate. Therefore, a polynomial function of degree 3 or 4 is a good hypothetical function.

What should be the appropriate size (number of samples) for the training set, validation set, and
test set? It depends on the actual problem. For some problems, the cost of obtaining samples is
low, and the sample model can reach hundreds of thousands or millions. For example, shopping
websites are easy to collect a large number of user shopping behavior data, and the training set
sample data accounts for all samples. The proportion of data can be as high as more than 90%,
while the proportion of verification set and test set to all sample data can be as low as about 5%,
because the total number of samples is very large, and the number of samples of 5% is already
very large. For some problems, the cost of obtaining samples is expensive, and the total number
of samples is relatively small. For example, in medical imaging, the proportion of the verification
set and the test set to the total number of samples will be relatively high, such as up to 20%, and
the proportion of the natural training set will be reduced. For samples of general size, usually, the
ratio of the number of samples in the training set, verification set, and test set to the number of all
samples can be set to 60%, 20%, or 20%. This proportion division is not absolute and should be
determined according to actual problems.

3.2.3 Learning Curve

In a narrow sense, learning curves usually refer to training and validation errors or scores for
different numbers of training samples. That is to calculate the training error (or score) of different
numbers of training samples and the training error (or score) of the verification sample, and then
draw the training error (or score) curve and the verification error (or score) curve. By comparing
the training curve and validation curve. It can help to observe the overfitting and underfitting of
the training.
Broadly speaking, any curve that helps to judge the training situation can be called a learning
curve, such as training loss curve and accuracy curve. The curve of the training sample alone can
be used to observe whether the training is converged, but it cannot be judged whether it is
underfitting or overfitting. Only the curve of the verification sample can help judge whether it is
overfitting or underfitting. The usual learning curve refers to a training curve and a validation
curve.

For the previous question, you can draw the training loss curve and the verification loss curve for
a specific hypothetical function such as a polynomial function with a degree of 9, and observe
how the loss (error) of the training set and the verification set changes with the number of
iterations.

The loss() function below calculates the loss of the hypothetical function corresponding to the
model parameters on the sample set (X, y), learning_curves_trainSize() calculates the training
loss and verification loss of the training set of different sizes (trainSize), and draws the training
loss curve and verification loss curve.

def loss(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of feature 1
predictions = X @ w
errors = predictions - y
return (errors**2).mean()/2

def learning_curves_trainSize(X_train, y_train, X_val,

y_val,alpha=0.3,iterations = 1000):
train_err = np.zeros(len(y_train))
valid_err = np.zeros(len(y_train))
for i in range(len(y_train)):
w_history = gradient_descent_vec(X_train[0:i + 1, :], y_train[0:i
+ 1],alpha,iterations)
w = w_history[-1]
train_err[i] = loss(w, X_train[0:i + 1, :], y_train[0:i + 1])
valid_err[i] = loss(w, X_val, y_val)

plt.plot(range(1, len(y_train) + 1), train_err, c="r", linewidth=2)

plt.plot(range(1, len(y_train) + 1), valid_err, c="b", linewidth=2)
plt.xlabel("number of training examples", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.legend(["training", "validation"], loc="best")

max_err = np.max( np.array([np.max(train_err),np.max(valid_err)]))

min_err = np.min( np.array([np.min(train_err),np.min(valid_err)]))
offset = (max_err-min_err)/10
plt.axis([1, len(y_train)+1, min_err-offset, max_err+offset])
#plt.axis([1, len(y_train)+1, 0, 100])
plt.grid()
The i in the loop indicates the size of the training set for each training model, use 1, 2, ... of
(X_train, y_train) until all samples are used to train the model, and the calculated training loss of
the model parameters is summed in the verification set (X_val , y_val) validation loss. Sample a
set of training and validation sets for sinusoids, and test this learning_curves_batchSize():
np.random.seed(89)
n_pts = 100
x_train,y_train = sample(n_pts)
x_valid,y_valid = sample(n_pts)

#K = 4
K =2
X_train = np.array([np.power(x_train,k+1) for k in range(K)]).transpose()
X_valid = np.array([np.power(x_valid,k+1) for k in range(K)]).transpose()

plt.title("BatchSize Learning Curves for Linear Regression", fontsize=16)

alpha=0.3
iterations = 50000
learning_curves_trainSize(X_train, y_train, X_valid,
y_valid,alpha,iterations)

plt.ylim(-0.5, 20)
plt.show()

gradient is small enough!

iterated num is : 117
. . .

Figure 3-22. Training loss for training sets of different sizes

It can be seen that after the training set size exceeds 40, the training loss and verification error
are relatively close. Therefore, for the 2-degree polynomial hypothetical function, the number of
samples in the training set should be greater than 40.
For a certain hypothetical function, it is also possible to observe how many iterations are
appropriate through the iterative learning curve.

def learning_curves_iterations(X_train, y_train, X_valid,

y_valid,alpha=0.3,iterations = 10000):
w_history = gradient_descent_vec(X_train, y_train,alpha,iterations)
train_err = compute_loss_history(X_train, y_train,w_history)
valid_err = compute_loss_history(X_valid, y_valid,w_history)

plt.plot(range(1, len(train_err) + 1), train_err, c="r", linewidth=2)

plt.plot(range(1, len(train_err) + 1), valid_err, c="b", linewidth=2)
plt.xlabel("iterations", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.legend(["training", "validation"], loc="best")
max_err = np.max( np.array([np.max(train_err),np.max(valid_err)]))
min_err = np.min( np.array([np.min(train_err),np.min(valid_err)]))
offset = (max_err-min_err)/10
plt.axis([1, len(train_err)+1, min_err-offset, max_err+offset])
plt.grid()

For a polynomial hypothesis function of degree 2, the following code plots the learning curves
for the training and validation losses over its iterations:
np.random.seed(89)
n_pts = 100
x_train,y_train = sample(n_pts)
x_valid,y_valid = sample(n_pts)

K = 2
X_train = np.array([np.power(x_train,k+1) for k in range(K)]).transpose()
X_valid = np.array([np.power(x_valid,k+1) for k in range(K)]).transpose()

plt.title("Iteration Learning Curves for Linear Regression", fontsize=16)

learning_curves_iterations(X_train, y_train, X_valid, y_valid,0.001,2000)

plt.show()
Figure 3-23 Training and validation loss curves for polynomial hypothesis function of degree 2

3.2.4 Forecasting the output of the dam

Wu Enda's machine learning course gave a problem of "predicting the water output of the dam
according to the change of the water level of the reservoir". The sample data in it records the
change of the water level of the reservoir and the corresponding water output of the dam. These
data are saved in the file "water.mat" In , you can use SciPy's loadmat() to read this data file in
matlab format.

import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio

dataset = sio.loadmat("water.mat")
x_train = dataset["X"]
x_val = dataset["Xval"]
x_test = dataset["Xtest"]

# squeeze the target variables into one dimensional arrays

y_train = dataset["y"].squeeze()
y_val = dataset["yval"].squeeze()
y_test = dataset["ytest"].squeeze()
print(x_train.shape,y_train.shape)
print(x_val.shape,y_val.shape)
print(x_test.shape,y_test.shape)
print(x_train[:5])
print(y_train[:5])

(12, 1) (12,)
(21, 1) (21,)
(21, 1) (21,)
[[-15.93675813]
[-29.15297922]
[ 36.18954863]
[ 37.49218733]
[-48.05882945]]
[ 2.13431051 1.17325668 34.35910918 36.83795516 2.80896507]

The sample data is divided into training set, validation set and test set, x_train and y_train are the
data features and target values of the training set, x_val and y_val are the data features and target
values of the validation set, x_test and y_test are the data features and targets of the training set
value. The sample points of the training set and the validation set can be visualized on a two-
dimensional plane. The red ones are the training set samples.
plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.scatter(x_val, y_val, marker="o", s=40, c='blue')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Training sample", fontsize=16)
plt.show()

Figure 3-24 Changes in reservoir water level and corresponding dam discharge

Observing that the data feature values are not small values, it is best to normalize the data
features before performing the gradient descent method. The following code normalizes it to
x_train_1 by calculating the mean and mean square error of the training sample x_train.
train_means = x_train.mean(axis=0)
train_stdevs = np.std(x_train, axis=0, ddof=1)
x_train_1 = (x_train - train_means) / train_stdevs
print(x_train_1[:3])

[[-0.36214078]
[-0.80320484]
[ 1.377467 ]]

Perform gradient descent linear regression on the normalized training sample x_train_1 and its
target value y_train, obtain the final model parameter w and the historical loss (cost)
corresponding to each w in the iterative process, and output some training errors in the iterative
process and the final model parameters and training error.
alpha = 0.3
iterations = 100000
history = gradient_descent_vec(x_train_1,y_train,alpha,iterations)
w = history[-1]
print("w",history[-1])
loss_history = compute_loss_history(x_train_1,y_train,history)
print(loss_history[:-1:len(loss_history)//10])
print(loss_history[-1])

gradient is small enough!

iterated num is : 186
w [11.21758932 11.0202847]
[82.46120802972824, 22.538807071151066, 22.385629110295966,
22.37422066554294, 22.37390999233337, 22.37390661895316, 22.37
3906498193175, 22.373906495130132, 22.373906495109036, 22.373906495108923,
22.373906495108912]
22.373906495108912

The gradient descent has been iterated 186 times, and the loss function curve, the predicted value
of the training set sample, and the hypothesis function line corresponding to w after convergence
can be drawn to visually observe the effect of the training. The training error of the model is
about 22.37. A large training error indicates that the model fits the training data poorly. This is
called "underfitting", that is, the model is not enough to describe the relationship between the
sample features and the target value. This is mainly due to the oversimplification of the model.
plot_history_predict(x_train_1,y_train,history[-1],loss_history)

Figure 3-25 Loss curve and fitted straight line

Write a function loss() that calculates the model error, and normalize the validation set and test
set with the mean and mean square error of the training set, and then calculate and output the loss
(error) of the validation set and test set.

def loss(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of feature 1
predictions = X @ w
errors = predictions - y
return (errors**2).mean()/2

x_val_1 = (x_val - train_means) / train_stdevs

x_test_1 = (x_test - train_means) / train_stdevs

print(x_val_1.shape,y_val.shape,w.shape)

loss_val = loss(w,x_val_1,y_val)
loss_test = loss(w,x_test_1,y_val)
print(loss_val,loss_test)

(21, 1) (21,) (2,)

29.433818145056616 155.62089347474281

The loss (error) of the model to the validation set is 29.43, which is more than 30% higher than
the training loss (error), indicating that the generalization ability of the model is not very good.
The test set loss (error) is much higher, further indicating that the model generalization ability is
poor.

Start training with 2 samples, and increase 1 sample each time for training. Calculate the training
error and validation error using the model obtained from each training.
plt.title("Learning Curves for Linear Regression", fontsize=16)
learning_curves_trainSize(x_train_1, y_train, x_val_1, y_val,0.3,1000)
plt.show()

gradient is small enough!

iterated num is : 161
...

Figure 3-26 Training and Validation Loss Curves

It can be seen that as the number of training samples increases, the loss value of the obtained
model gradually increases, because the number of samples is large, and it is more difficult for the
model to fit them all. When the number of samples increases to a certain level, the model loss
increases very slowly, indicating that increasing the sample data is of little significance for
improving the model. Looking at the model loss of the verification set, when the number of
training samples is small, the verification error is large, indicating that the fitted model is not
accurate enough, and the generalization ability of the model is very weak, resulting in a large loss
of the verification set. As the number of training set samples increases, The verification error
gradually decreases, indicating that the generalization ability of the model is getting better and
better. When the number of models reaches 10, continuing to increase the number of training
samples will not continue to improve the verification error. From this learning curve, it can be
seen that the number of training samples is about 10, and continuing to increase the number of
samples will not further improve the quality of the model. Therefore, you can stop increasing the
number of samples, that is, early stopping (early stopping).

For this simple model (polynomial hypothesis function of degree 1), the learning curve helped
determine the size of the training set and the final model, but the training error and validation
error were still large due to the simplicity of the model. The way to improve is to use a more
complex and expressive function such as a third-degree polynomial as a hypothetical function,
which requires artificially adding two features x , x . The following code obtains a training set
2 3

x_train_2 with 3 features by adding these 2 features to x_train with only one data feature.
x_train_2 =np.hstack((x_train,x_train**2,x_train**3))
train_means = x_train_2.mean(axis=0)
train_stdevs = np.std(x_train_2, axis=0, ddof=1)
x_train_2 = (x_train_2 - train_means) / train_stdevs
print(x_train_2[:3])

output:
[[-3.62140776e-01 -7.55086688e-01 1.82225876e-01]
[-8.03204845e-01 1.25825266e-03 -2.47936991e-01]
[ 1.37746700e+00 5.84826715e-01 1.24976856e+00]]

Use this new feature data to train a new model (polynomial hypothesis function of degree 3) and
plot the loss curve, model curve, and model's training set predictions.
history = gradient_descent_vec(x_train_2,y_train,alpha,iterations)
w = history[-1]
print("w:",w)
loss_history = compute_loss_history(x_train_2,y_train,history)
print(loss_history[:-1:len(loss_history)//10])
plot_history_predict(x_train_2,y_train,history[-1],loss_history)

gradient is small enough!

iterated num is : 182
w: [11.21758933 11.84024399 7.9571074 2.49688343]
[68.75398373160829, 1.1328020346988124, 0.7420671487716749,
0.7168865316409526, 0.7163697533282597, 0.7163655886246967, 0.
716365457750144, 0.7163654554798815, 0.7163654554466912,
0.7163654554466001, 0.7163654554465827]
Figure 3-27 Loss curve and fitted 4th degree polynomial curve
x_val_2 =np.hstack((x_val,x_val**2,x_val**3))
x_test_2 =np.hstack((x_test,x_test**2,x_test**3))

x_val_2 = (x_val_2 - train_means) / train_stdevs

x_test_2 = (x_test_2 - train_means) / train_stdevs

loss_val = loss(w,x_val_2,y_val)
loss_test = loss(w,x_test_2,y_val)
print(loss_val,loss_test)

5.768794748026971 170.64341351247012

Compared with the polynomial hypothesis function of degree one, the verification error of this
model is reduced to about 5.7, but the test error is still relatively large.

It can be seen from the figure that the model can better fit the training data. So can the training
model be further fitted to the training data by increasing the polynomial degree? For example, an
8-degree polynomial is used to represent the hypothesis function.
n = 8
x_train_n =np.hstack(tuple(x_train**(i+1) for i in range(n) ) ) #
(x_train_1,x_train**2,x_train**3,x_train**4))

train_means = x_train_n.mean(axis=0)
train_stdevs = np.std(x_train_n, axis=0, ddof=1)
x_train_n = (x_train_n - train_means) / train_stdevs
print(x_train_n[:3])

[[-3.62140776e-01 -7.55086688e-01 1.82225876e-01 -7.06189908e-01

3.06617917e-01 -5.90877673e-01 3.44515797e-01 -5.08481165e-01]
[-8.03204845e-01 1.25825266e-03 -2.47936991e-01 -3.27023420e-01
9.33963187e-02 -4.35817606e-01 2.55416116e-01 -4.48912493e-01]
[ 1.37746700e+00 5.84826715e-01 1.24976856e+00 2.45311974e-01
9.78359696e-01 -1.21556976e-02 7.56568484e-01 -1.70352114e-01]]
implement:

history = gradient_descent_vec(x_train_n,y_train,alpha,iterations)
w = history[-1]
print("w:",history[-1])
loss_history = compute_loss_history(x_train_n,y_train,history)
plot_history_predict(x_train_n,y_train,history[-1],loss_history)

w: [ 11.21758933 10.04341812 19.80104099 25.34788404 -33.58416963

-64.50287879 19.38666333 52.43219603 11.52492627]

Figure 3-28 Loss curve and fitted 8th degree polynomial curve

It can be seen that the loss of the training set is further reduced to about 0.03, that is, the trained
model has fit the training set data very well. So is this model better than the model just now?
Does it have better generalization ability?

You can first look at the learning curve of this 8th power polynomial hypothetical function.

x_val_n =np.hstack(tuple(x_val**(i+1) for i in range(n) ) )

#
(x_train_1,x_train**2,x_train**3,x_train**4))
x_val_n = (x_val_n - train_means) / train_stdevs

plt.title("Learning Curves for Linear Regression", fontsize=16)

print(x_train_n.shape)
print(w.shape)
print(x_val_n.shape)
learning_curves(x_train_n, y_train, x_val_n, y_val,alpha,iterations)

(12, 8)
(9,)
(21, 8)
gradient is small enough!
iterated num is : 148
...
Figure 3-29 Training and Validation Loss Curves

For different sample numbers, the training error is almost 0, indicating that the model fits the
training set very well, while the verification set error begins to decrease as the training sample
data increases, and then starts from the sample data equal to 5, as the training set sample As the
number increases, the verification error continues to increase. It shows that the generalization
ability becomes worse.

The following code outputs the error of the model trained on all training sets on the validation
and test sets:
x_val_n = np.hstack(tuple(x_val**(i+1) for i in range(n) ) )
x_test_n = np.hstack(tuple(x_test**(i+1) for i in range(n) ) )

x_val_n = (x_val_n - train_means) / train_stdevs

x_test_n = (x_test_n - train_means) / train_stdevs

loss_val = loss(w,x_val_n,y_val)
loss_test = loss(w,x_test_n,y_val)
print(loss_val,loss_test)

37.35209369572414 226.4483749816359

It can be seen that no matter the verification set or the test set, the error exceeds the verification
error and test error corresponding to the training model of the simplest polynomial hypothesis
function. It means that the model has been fitted and the generalization ability is worse.

3.2.5 Bias and variance (Bias-Variance)

If for a certain problem, there is a functional relationship f (x) between the independent variable
(feature) x and the dependent variable (target value) y, but this function is not known, only a set
of actual data samples {x , y }, due to the noise in the sampling process, the x, y of each actual
i i

sample does not strictly satisfy this functional relationship f (x), that is, the actual target value
T hereisarandomerror \epsilonbetweeny and f (x), it is generally considered that this random
error ϵ obeys the Gaussian distribution N (μ, σ ), namely epsilon = y − f (x) ∼ N (0, σ ).
2 2

Therefore, the relationship between y and f (x) can be expressed as:

y = f (x) + ϵ

That is, there is an error ϵ between the sampled target value y and the true target value f (x).

Machine learning hopes to use a hypothetical function such as f^(x) to approximate the real f (x)
2

. This assumption is usually solved by minimizing the error (y − f^(x )) between the actual
i i

target value y and the predicted target value f^(x ) of the hypothesis function function f^(x).
i i

For a hypothetical function model f^(x ), the specific functions f^(x ) obtained by different
i i

training sets and different machine learning algorithms are all different,
2

For a certain x, the error (y − f^(x )) produced by different f^(x ) is also different, all possible
i i i

^(x ) The average (expectation) of this error is called Expected Error or Error Expectation,
f i
2

that is, E[(y − f^(x)) ]. This expected error can be decomposed into three items as follows:
2 2
^(x)) ] = (Bias[f
^(x)]) ^(x)] + σ 2
E[(y − f + V ar[f

Among them, Bias[f^(x)] = E[f^(x)] − E[f (x)] is called the deviation, which represents the
expectation of the hypothetical function model f^(x) The deviation between the predicted value
and the true value. V ar[f^(x)] = E[f^(x) ] − E[f^(x)] = E(f^(x) − E[f^(x)]) is called the
2 2 2

variance, which means the mean square error of the different predicted values of f^(x) obtained
by the hypothetical function model f^(x) and its expected predicted value. Please refer to
Wikipedia for the derivation of the formula:

https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff

If the function f (x) = x + 2 ∗ np. sin(1.5 ∗ x) to be learned, the following code draws the
function curve and a set of {x , y } sampled from the function:
i i

import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(0)

f = lambda x: 𝑥+2*np.sin(1.5*x)

def plot_f(pts=50):
x = np.linspace(0, 10, pts)
f_ = f(x)
plt.plot(x,f_)

def sample_f(pts =8):

x = np.random.uniform(0,10,pts)
f = 𝑥+2*np.sin(1.5*x)
y = f+np.random.normal(0, 0.5, pts) #random noise
return x,y

plot_f()
x,y = sample_f()
plt.scatter(x,y,s=30)#, facecolors='none', edgecolors='r')

Figure 3-30 Function f (x) = x + 2 ∗ np. sin(1.5 ∗ x) and random noise sampling coordinate
points on it

If a constant function f^(x) = b is used as a hypothetical function model to approximate this

function f (x), fFor each {x , y }, i = 1, 2, ⋯ , m in a training set, the predicted value of
i i

f^(x) = b is b, so minimizing ∑ (b − y ) , we can obtain:

m 2

i=1 i

m
∑ yi
i=1
b = = np. mean(yi )
m

The samples in different training sets are different, so the b obtained with different training sets,
that is, the hypothesis function f^(x) = b is different. The difference between the predicted
expected value and the true value of the hypothetical function at a certain point x obtained from
all different training sets is the deviation, and the mean square error of the predicted value and
the predicted expected value of all different hypothesized functions at this point is the variance.
The following code trains with 50 training sets to obtain 50 hypothetical functions, and then
calculates the prediction deviation and prediction variance of these hypothetical functions at x=5.
train_set_num = 100

def plot_b(b):
x = np.linspace(0, 10, pts)
hat_f = [b for i in range(pts)]
plt.plot(x,hat_f)

bs=[]
for i in range(train_set_num):
x,y = sample_f(20)
plt.scatter(x,y)
b = np.mean(y)
bs.append(b)
plot_b(b)

plot_f()
plt.show()

x = 18
f_true = f(x)
f_predict_mean = np.mean(bs)
print("real function value:",f_true)
print("Predicted expected value:",f_predict_mean)
print("Predicted deviation:",f_predict_mean - f_true)
print("Predicted variance:",np.std(bs))

True function value: 19.912751856809006

Predicted expected value: 5.348626589850284
Predicted deviation: -14.564125266958722
Predicted variance: 0.7240080347500965

Figure 3-31 Different hypothetical functions f^(x) = b obtained from different training sets

If a linear function f^(x) = wx + b is used as a hypothetical function, following the same

process, the corresponding deviation and variance estimation can be obtained:
ws = []
for i in range(train_set_num):
x,y = sample_f(20)
plt.scatter(x,y)
X = np.hstack((np.ones((len(x), 1), dtype=x.dtype),x[:, None]))
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
draw_line(plt,w[1],w[0],x)
ws.append(w)

plot_f()
plt.show()

x = 18
f_true = f(x)

f_predict = np.array([ w*x+b for w,b in ws])

f_predict_mean = np.mean(f_predict)
print("real function value:",f_true)
print("Predicted expected value:",f_predict_mean)
print("Predicted deviation:",f_predict_mean - f_true)
print("Predicted variance:",np.std(f_predict))

True function value: 19.912751856809006

Predicted expected value: 7.968426904632787
Predicted deviation: -11.944324952176219
Predicted variance: 10.868072850656494

Figure 3-32 Different hypothetical functions f^(x) = wx + b obtained from different training
sets

The bias and variance of the hypothetical function model f^(x) = b are: -14.564125266958722
and 0.7240080347500965 respectively, while the bias and variance of the hypothetical function
model f^(x) = wx + b are: -11.944324952176219 and 10.868072850656494. Simple models
are often not complex enough to fully approximate the real function f (x), so they are prone to
underfitting, and their deviations are larger than those of complex models. Although the
deviations of complex models are small, but because the function changes complicatedly, its
predicted value The changes in χ also tend to vary greatly or diverge, that is, the variance is
relatively large.

The bias and variance of model predictions can be illustrated by the "bull's eye" as shown in
Figure 3-33. The bullseye is the real value of the target, and a certain shooter is a hypothetical
function. After training, he means that he has trained his own model. Shooting the bullseye is
equivalent to making a prediction on this sample. Every time you train your own model, you
shoot (predict) once, and finally the training model of this hypothesis function (this person) at
different times produces corresponding prediction values. The degree to which these predicted
values deviate from the true value is his bias. The picture in the upper left corner shows that his
deviation is very large, the design skills are still lacking, and the model is underfitting, that is, the
model cannot express the relationship between the independent variable and the target value
well. The two graphs in the right column indicate that the predicted values of the same
independent variable are very divergent, that is, the predicted variance is large, indicating that the
values predicted by these different models deviate greatly, as if the shooting level of this person
is very unstable. For the two images in the left column, the predicted values are relatively
concentrated, that is, the predicted variance is small, indicating that the predicted results of
different models are almost the same. That is, the shooting level of this person is relatively
stable. The deviation in the lower left corner is small, and the variance is also small, indicating
that the model fitting effect is good (high shooting accuracy) and very stable, while the deviation
in the lower right corner is relatively symmetrical, so that the expectation of the deviation is
relatively small, and the predicted value of the response is always around the real value , it seems
that the fitting is relatively accurate, but because of the large variance, it suggests that there may
be an overfitting problem (unstable shooting ability).

Figure 3-33 Bias and variance

From the comparison of the learning curves of the training set and the verification set, the
deviation and variance can also be observed intuitively. If the errors of the training set and the
verification set are relatively close, it means that the prediction results for the two different data
sets are close, which means that the variance is relatively small ( low degree of divergence), on
the contrary, it shows that the variance is relatively large. The numerical value of the error
represents the size of their deviation.

As shown in Figure 3-34, when the verification error is much larger than the training error (Jcv
>> Jtrain) and the training error is small, it means that the model fits the training set well, but the
verification set has a poor fitting effect, indicating that there is overfitting Phenomenon. On the
left, when the verification error is similar to the training error and both are large, it indicates that
there may be underfitting.

Figure 3-34 Judging underfitting and overfitting as well as bias and variance based on training
and validation loss curves

3.3 Regularization
Overfitting (high variance) is caused by the model being too complex and too high in freedom.
One way to solve overfitting is to increase the number of training samples, but sometimes it may
be difficult to obtain enough training samples or it takes time to obtain samples strenuous.

Another common method to solve overfitting is to reduce the complexity of the model. One way
to reduce the complexity of the model is to choose a simpler hypothesis function instead of a
complex hypothesis function, such as using a 3-degree polynomial instead of a 9-degree
polynomial in the previous example. Polynomials as hypothetical functions. If you don't want to
replace the hypothetical function, you can limit the complexity of the hypothetical function
through some techniques. This method of reducing the complexity of the function is called
regularization (Regularization).

The previous early stopping method is a regularization method. Observe the changes in the
training loss and verification loss during the iteration process of the gradient descent method of
the training model through the learning curve, that is, select an appropriate number of iterations
according to the training and verification loss curves. So that the model function will not be too
complicated. For complex functions, the set of hypothetical functions represented by all possible
values of the model parameters may be very large, but when the model training starts, the
parameters will be initialized to small values (such as 0), and the set of functions corresponding
to these small model parameters It is only a small part of all possible functions, that is, a model
with a small value range limits the expressive ability of the model function and reduces the
complexity of the model.

Another common way to impose regularization constraints on functional models is to add a

penalty term to the loss function. For example, for linear regression problems, the hypothetical
function of the model f (x) = x w = w + x ∗ w + ⋯ + +x ∗ w , if there are m samples
0 1 1 n n

), the mean square error loss function describing the fitting error is:
(i) (i)
(x ,y

m 2
(i) (i)
L(x; w ) = ∑ x w − y
i=1

The loss function of adding the regular term becomes

m 2
1 (i) (i) 2
L(x; w ) = ∑ x w − y + λ ∥ w
2m i=1

in:
2 2 2 2
w ∥= w0 + w1 + ⋯ + wn

This penalty item (the square of the norm of the model parameter) can prevent the model
parameter from taking a large value, because a large value of w will cause the value of the loss
i

function to become larger, and the goal of optimization is to make the value of the loss function
as small as possible . The λ of the new loss function is a hyperparameter that needs to be
adjusted according to the actual problem, which controls the importance between the fitting error
term and the penalty term. The larger λ means that the penalty item accounts for a large
proportion, the greater the effect of the penalty item, and the smaller λ means that the penalty
item accounts for a smaller proportion, the smaller the effect of the penalty item.
The gradient of the new loss function becomes:
1 m (i) (i) (i)
∇L(w
w) = ∑i=1 (x
x w − y )x + 2λw
w
m

Therefore, when calculating the partial derivative by the gradient descent method, you only need
to add the gradient of the latter term when calculating the partial derivative. Here is the gradient
descent method for the penalized version:
def gradient_descent_reg(X, y, reg, alpha, num_iters, gamma = 0.8,
epsilon=1e-8):
w_history = [] # Record the parameters in the iteration
process
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of features 1
num_features = X.shape[1]
v = np.zeros_like(num_features)
w = np.zeros(num_features)
for n in range(num_iters):
predictions = X @ w # Find the predicted value of the
hypothesis function, ie f(x)
errors = predictions - y # The error between the predicted
value and the real value
gradient = X.transpose() @ errors /len(y) # calculate the
gradient
gradient += 2*reg*w
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",n)
break
#w -= alpha * gradient #Update the parameters of the model
v = gamma*v+alpha* gradient
w=w-v

w_history.append(w)
return w_history # return optimized parameters and cost history

def loss_reg(w,X,y,reg = 0.):

errors = X@w[1:]+w[0]-y
reg_error = reg*np.sum(np.square(w))
return (errors**2).mean()/2+reg_error

def compute_loss_history_reg(X,y,w_history,reg = 0.):

loss_history = []
for w in w_history:
loss_history.append(loss_reg(w,X,y,reg))
return loss_history

reg = 0.2
history = gradient_descent_reg(x_train_n,y_train,reg,alpha,iterations)
print("w:",history[-1])
loss_history = compute_loss_history_reg(x_train_n,y_train,history,reg)
plot_history_predict(x_train_n,y_train,history[-1],loss_history)

gradient is small enough!

iterated num is : 184
w: [8.0125638 5.79344199 3.33539832 3.53746298 2.03218329 2.16210927
1.23141113 1.33653994 0.72424795]

Figure 3-35 Training loss curve and fitted curve with regularization

Modify the function learning_curves() that draws learning curves.

def learning_curves(X_train, y_train, X_val,

y_val,reg,alpha=0.3,iterations = 1000):
train_err = np.zeros(len(y_train))
val_err = np.zeros(len(y_train))
for i in range(1, len(y_train)):
w_history = gradient_descent_reg(X_train[0:i + 1, :], y_train[0:i
+ 1],reg,alpha,iterations)
w = w_history[-1]
train_err[i] = loss_reg(w, X_train[0:i + 1, :], y_train[0:i +
1],reg)
val_err[i] = loss_reg(w, X_val, y_val,reg)
plt.plot(range(2, len(y_train) + 1), train_err[1:], c="r",
linewidth=2)
plt.plot(range(2, len(y_train) + 1), val_err[1:], c="b", linewidth=2)
plt.xlabel("number of training examples", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.legend(["training", "validation"], loc="best")
plt.axis([2, len(y_train), 0, 100])
plt.grid()

Map the learning curve.

x_val_n =np.hstack(tuple(x_val**(i+1) for i in range(n) ) ) #
(x_train_1,x_train**2,x_train**3,x_train**4))
x_val_n = (x_val_n - train_means) / train_stdevs
plt.title("Learning Curves for Linear Regression", fontsize=16)
print(x_train_n.shape)
print(w.shape)
print(x_val_n.shape)
reg = 0.2
learning_curves(x_train_n, y_train, x_val_n, y_val,reg,alpha,iterations)

(12, 8)
(9,)
(21, 8)
gradient is small enough!
iterated num is : 158
...

Figure 3-36 Training and validation loss curves

Yes, the same 8-degree polynomial assumption function solves the over-fitting problem of the
model after punishing the model parameters through regularization techniques.

3.5 Logistic regression

In the linear regression problem, the target value such as the profit of the dining car and the price
of the house is a continuous value. And some problems, such as classification problems, the
target value is a discrete value, that is, to determine which category a data belongs to among
several categories. Such as identifying whether an image is a cat or a dog, judging whether a
person has cancer from medical images or other measurement data, these problems are all binary
classification problems, that is, it is necessary to judge which of the two classifications a data
belongs to. Logistic Regression (Logistic Regression) is an extension of linear regression,
specifically for solving binary classification problems.
3.5.1 Logistic regression
Can linear regression solve binary classification problems? The answer is yes, as long as a threshold value such as
0 is set, the linear regression output value greater than the threshold value 0 belongs to one category, and the
output value smaller than the threshold value 0 belongs to another category. The output value range of the linear
regression model function is (−∞, ∞), and the logistic regression further uses the sigmoid function σ(x) (see
Section 1.3) to transform the output value to (0, 1)Interval. Thus, the output value can be interpreted as the
probability that the input variable belongs to a certain class.

Using the sigmoid function σ(x) to transform the predicted value of linear regression constitutes the hypothesis
function of logistic regression:
1
fw
w (x
x) = −x
xww
= σ(x
xw ) tag3 − 30
1+e

Using this hypothesis function to model the characteristics of the sample and its target value relationship is the so-
called logistic regression. The hypothetical function f (xx) of logistic regression has a value between 0 and 1,
w
w

which can be used to represent the probability that x belongs to a certain class, if x is the tumor size of a medical
image, f (x
w
w x) can represent the probability of whether the tumor corresponding to x is malignant or not.

If 0 and 1 are used to represent two classes respectively, and f (x x) is used to represent the probability that x
w
w

belongs to class 1, then the probability that x belongs to class 0 is 1 − f (xx). w

Right now:
1
P (y = 1|x
x) = fw
w (x
x) = −x
xww
= σ(x
xw )
1+e

1
P (y = 0|x
x) = 1 − fw
w (x
x) = 1 − −x
xww
= 1 − σ(x
xw )
1+e

For a sample (x
x, y), if y=1, the probability of its occurrence is P (y = 1|x x), if y=0, the probability of its

occurrence is P (y = 0|x x), no matter y is 1 or 0, the probability of (x

x, y) can be uniformly expressed as

P (y = 1|x
y
x) P (y = 0|x x) , or f (x
x) (1 − f (x
1−y
x)) , then the probability that m samples appear at the same
w
w
y
w
w
1−y

time is:
i i
m i y i 1−y
∏i=1 (fw
w (x
x ) (1 − fw
w (x
x )) )

The w that makes this probability value the largest can make these m samples appear with the greatest probability.
Therefore, logistic regression requires w that maximizes this probability. Because the multiplication will make the
value sharply become infinite or close to 0, in order to make the solution algorithm numerically stable and
convenient to calculate the derivative, the average value of the negative logarithm of this probability value is
usually used as the cost function, namely:
1 m i i i i
L (w
w) = − ∑ (y log(fw
w (x
x )) + (1 − y )log(1 − fw
w (x
x )))
m i=1

i
−(y log(fw
w (x
i
x )) + (1 − y )log(1 − fw
w (x
x )))
i
is called the cross entropy (entropy cross) loss of the sample. It
i

can be represented by L (i)

, then the cross-entropy loss of all samples can be expressed as:
1 m (i)
L (w
w) = ∑ L
m i=1

For a sample (xx )), y , if its true target value y is 1, and the predicted value of logistic regression f (x
i i i
x )) is also w
w
i

1, then L = −(1 ∗ 0 + 0 ∗ log0) = 0, if the real target value y is 0, and the logic The predicted value of
(i) i

regression f (xx )) is also 0, then L

w
w
i
= −(0 ∗ log0 + 1 ∗ log1) = 0 , that is, when the predicted value is
(i)

consistent with the target value, this value is 0. If not, because y , 1 − y , f (xx ), 1 − f (x x ) They are all real
i i
w
w
i
w
w
i
numbers in the (0,1) interval, so L is a positive number greater than 0. Therefore, only when the predicted
(i)

value is completely consistent with the target value, L will achieve the minimum value of 0. (i)

The goal of logistic regression is to seek the smallest w of this loss (also called cost) L (w w). Its solution algorithm

is still the gradient descent method. To do this, it is necessary to calculate the gradient of L (ww) with respect to w ,

that is, the partial derivative with respect to each w : j

In order to better see how to find the partial derivative of L(w

w) with respect to w = (w 0
, w1 , ⋯ , wK )
T
, introduce
some auxiliary notation z , f : (i) (i)

(i) (i) (i) (i)

(i) (i)
z = w ⊙ x = w1 ∗ x + w2 ∗ x +. . . +wK ∗ x + w0 ∗ x
1 2 K 0

(i) (i)
f = σ(z )

(i) i (i) i (i)

L = −(y log(f ) + (1 − y )log(1 − f ) bigr)

1 m (i)
L (w
w) = ∑ L
m i=1

L (w
w) can be regarded as the sum of m L , and L is f , f is the function of z , z is (i) (i) (T hef unctionof i) (i) (i) (i)

w = (w , w , ⋯ , w )
1 function, according to the four rules of derivation (such as the sum of the derivative of
2 k
T

each function and the derivative of the function) and the chain derivation rule of the compound function:
∂L (w
w) 1
=
∂L
(i) m

i i (i) i
∂L
(i)
y (1−y ) f −y

(i)
= −( (i)
− (i)
) = (i) (i)
∂f f (1−f ) f (1−f )

(i)
∂f (i) (i) (i) (i)
(i)
= σ(z )(1 − σ(z )) = f (1 − f )
∂z

(i)
∂z (i)
= x
∂wj j

So there are:
m (i) (i) (i)
∂L(w
w) ∂L(w
w) ∂L ∂f ∂z
= ∑ × × ×
(i) (i) (i)
∂wj ∂L ∂f ∂z ∂wj
i=1

m (i) i
1 f − y (i)
(i) (i)
= ∑ × f (1 − f ) × x
j
(i) (i)
m f (1 − f )
i=1

m m
1 (i) 1 (i)
(i) (i) (i) (i)
= ∑(f − y )x = ∑(fw
w (x
x ) − y )x
j j
m m
i=1 1=1

m
1 (i) (i) (i)
= ∑x (fw
w (x
x ) − y )
j
m
1=1

Because f w (x
w x
(i)
) − y is a value, so it can be exchanged with the multiplication of the vector, that is,
(i)

).
(i) (i) (i) (i) (i) (i)
(fw
w
(x
x ) − y )x = x (f (xx ) − y w
w
j j

It can be observed that for a sample (x , y), the gradient (derivative) of L(w
w) on the cumulative sum z = x w
∂L

∂z

is: f − y. This is the same as the gradient (derivative) form of the variance (f − y) with respect to f for linear 1 2

regression. The formula for calculating the gradient of logistic regression and linear regression is the same.

If x is written as a row vector, all x can form a matrix X by row, and the target value and predicted value
i i

yof allsamples , f can be written in the corresponding column vector form:

i i
X =

vector:

∇w
w L(w
w)

⎢⎥
⎡

⎣x ⎦
x

⋮
1

That is, the gradient ∇

∇w
w L(w
w) =
⎤

Take all partial derivatives of L(w

operations are consistent.

1Xn

f(x)
errors = f - y

X = [x

y = [y
1
=

y
x

2
= [

1Xm

2
1

⋯
y =

⋯
⎡

⎣y ⎦

f − y)

i
y
y

1=1
1

w) with respect to w

w L(w
w

i
x
w)

⋯
⎤

T
f =

x
(i)

X =

⋯
mXn
=

y
m
⎡f

⎣f

(fw
w (x

m
x

1
w (x
w

fw
x

w (x
x )

w (x

w (x
w x

∑[ x
⋮

x )

(i)

1=1
1

can be expressed as:

x
1

m
(σ(X
Xw ) − y )
)

) − y
⎦
⎤

(i)

0
=

(i)

(fw
w (x
x
⎡

)
σ(x

T
x w)

σ(x
x w)

σ(x

⎣ σ(x
x
x w)

Assuming that the number of sample data features is n, it can be verified that the dimensions of the above matrix

Therefore, once the output f of logistic regression is known, the following python code can be used to calculate
the gradient of the cross-entropy loss L with respect to the model parameters w

f = sigmoid(X @ w) # Find the predicted value of the hypothesis function, ie

# The error between the predicted value and the real value
gradient = errors.transpose() @ X /len(y) # calculate the gradient

If x is written as a column vector, all x can form a matrix X by column,the target value and predicted value
i

y , f of all samples can be written in the corresponding row vector form:

i i

]
]
i
1

(i)

X
1

∑
w)

) − y

=
∂L(w

=
∂wj

1=1

f rac1m(fw
⎤

(i)

w (x
= σ(X

m
)
Xw )

(i)

1
=

= [

=
(fw

∑[ x

1=1

x ) − y , fw
m
w (x

(i)

1
x

∑ x (fw

i=1

w (x
1

∂L(w
∑

∂w0

(fw

w (x
w (x

(i)

1
i
x
m

i=1

(i)

x ) − y ) =

x ) − y , ⋯ , fw

m
(f
fw
w (x

w
x

(x
) − y

x
(i)

(i)

x) − y )
i
x (fw
j

∂L(w
w (x

∂w1
x ) − y )

(i)

) − y

i
x
)

(i)

2
(i)

T
i

∂L(w

...

X =
w)

∂w2

x
m

1=1

(i)

2
xn
i

...

(fw

m
w (x

(i)

1
x

∑(fw

i=1
is written as a row

w (x

(f
(i)

](fw

f − y)
∂L(w

(fw

w (x
x

x ) − y )x
x
w)

∂wn

w (x

(i)
x

) − y

m
i

T
(i)

(i)

) − y

X
]

) − y

m
(i)

...

(i)

)
⎡
i

⎣ ⋮x
x
)

)
xn

1
x

2
x
.

m
(
f

∇w

∇w
= [ fw

w L(w
w)

w L(w

L(w
w)

w) = −
w (x

= [ σ(x
1

m
x )

x w)

= [
1

∑
1

∂L(w

i=1
fw

∑
w (x

σ(x
x w)

Take all partial derivatives of L(w

1=1

w)
=

∂w0

⎢⎥
x )

⎡x

⎣x

i
2

(y log(fw

Correspondingly, the gradient of L(w

∇w
w L(w

∇w
w) =

w L(w
w) =
1

m
∑

(f
m

i=1

f − y)
w (x
⎣
(i)

(i)

∂L(w

∂w1
m

w)
⋯

w) with respect to w

vector, and x is also written as a column in vector form,

⎡
∂L(w
w)

∂w0

∂L(w
w)

∂w1

∂L(w
w)

∂w2

...

∂L(w
w)

∂wK

(fw
w (x

(fw

∑ x (fw

i=1
x

w (x
x

w (x

(fw
x

w (x

w (x
x

f rac1mX
i

w) with respect to w is:

(fw
w (x

T
i
x ) − y )x
x

X + 2λw

3.5.2 numpy implementation of logistic regression

1. Generate data
w =
fw

⎦
w (x

σ(x

...

i
x )

x w)

(i)

x ) − y ) =

X (f
fww
(x

∂L(w

∂w2

i
=

w)
i

) − y

x) − y )

i
i

i
m

+ 2λw
w
1

⋯
⋯

(i)

If the gradient is written in the form of a row vector according to the usual practice, then:
⎤
)

)
⎦
m

1=1

Similarly, regularization can also be added to the loss function of logistic regression, that is, L(w

x )) + (1 − y )log(1 − fw
w (x
x ))) + λ ∥ w
i
fw

=
w (x

σ(x
x

∂L(w

∂wK
x

w) is:
(i)

(i)

m
m

If each sample x is in the form of a row vector, f , y and model parameters w of multiple samples are in the form
of a column vector, it can be written as the following vector form:

m
(σ(X
Xw ) − y )

The following code uses np.random.normal() to generate two sets of two-dimensional coordinate point data sets
Xa and Xb that obey different normal distributions, and each sample represents a coordinate point on a two-
dimensional plane. The samples in Xa are normally distributed sampling points around the center point (10,12),
m

(fw

=
)]

w )]

∑
∂L(w

w (x
x

w (x

(fw

(fw
x

w (x

...
x

w (x
x

1=1

∑(fw

i=1

] =

T
w)

∂wj

w (x

1
(i)

(i)

⎡x ⎤
x

m
1
1
=

(i)

...

⎣x ⎦(i)

x ) − y )x

X (f

X + 2λw
w
x

(f
1

) − y

f − y)
(fw
w
∑

(i)

(x
x

f − y )X
X
m

i=1

⎤
)

⎦
)

T
i
x (fw

(i)

T
w (x
i
x ) − y )
j

) − y

2
(i)
)
i i
is written as a column
and the samples in Xb are normally distributed sampling points around the center point (5,6). The codes color
these samples to distinguish which category they belong to.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Persistent random data

np.random.seed(0)

n_pts = 100
D = 2

#x0 = np.ones(n_pts)
Xa = np.array([#x0,
np.random.normal(10, 2, n_pts),
np.random.normal(12, 2, n_pts)])
Xb = np.array([#x0,
np.random.normal(5, 2, n_pts),
np.random.normal(6, 2, n_pts)])

X = np.append(Xa, Xb, axis=1).T

#y = np.matrix(np.append(np.zeros(n_pts), np.ones(n_pts))).T
y = (np.append(np.zeros(n_pts), np.ones(n_pts))).T
print(X[::50])
print(y[::50])

[[13.52810469 15.76630139]
[8.20906688 11.86351679]
[ 4.26163632 3.3869463 ]
[6.04212975 4.47171215]]
[0. 0. 1. 1.]

fig, ax = plt.subplots(figsize=(4,4))
ax.scatter(X[:n_pts,0], X[:n_pts,1], color='lightcoral',
label='$Y = 0$')
ax.scatter(X[n_pts:,0], X[n_pts:,1], color='blue',
label='$Y = 1$')
ax.set_title('Sample Dataset')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.legend(loc=4);

Figure 3-35 Plane point set for binary classification

2. Code implementation of gradient descent method
For this specific problem, both x and w are vectors of length 3, similar to linear regression, the following solution
algorithm based on the gradient descent method can be written:
def sigmoid(z):
"""ApplY the sigmoid function element-wise to the
input arraY z."""
return 1 / (1 + np.exp(-z))

def gradient_descent_logistic_reg(X, y, lambda_, alpha, num_iters, gamma = 0.8,

epsilon=1e-8):
w_history = [] # Record the parameters in the iteration process
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a column of
features 1
num_features = X.shape[1]
v = np.zeros_like(num_features)
w = np.zeros(num_features)
for n in range(num_iters):
predictions = sigmoid(X @ w) # Find the predicted value of the hypothesis
function, ie f(x)
errors = predictions - y # The error between the predicted value and
the real value
#gradient = X.transpose() @ errors /len(y) #calculate gradient
gradient = errors.transpose() @ X /len(y) #calculate the gradient
loss_grad = errors /len(y)

gradient += 2*lambda_*w
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",n)
break
#w -= alpha * gradient #Update the parameters of the model
v = gamma*v+alpha* gradient
w= w-v

#cost = - np.mean((np.log(predictions).T * y+np.log(1-predictions).T *(1-y) ))

#cost_history.append(cost)
w_history.append(w)

return w_history # return optimized parameters and cost history

3. Calculate the loss function value

For a w and a set of samples (X, y), the following function can be used to calculate the loss function value:

def loss_logistic(w,X,y,reg=0.):
f = sigmoid(X @ w[1:]+w[0])
loss = -np.mean((np.log(f).T * y+np.log(1-f).T *(1-y) ))
loss += reg*( np.sum(np.square(w)))
return loss

def loss_history_logistic(w_history,X,y,reg=0.):
#X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
loss_history = []
for w in w_history:
loss_history.append(loss_logistic(w,X,y,reg))
return loss_history
reg = 0.0
alpha=0.01
iterations=10000
w_history = gradient_descent_logistic_reg(X,y,reg,alpha,iterations)
w = w_history[-1]
print("w:",w)

loss_history = loss_history_logistic(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])

[11.3920102 -0.55377808 -0.83931251]

[0.6577262444936193, 0.22674637036423945, 0.15646446608041156, 0.12698570286225014,
0.11034864425987873, 0.099493559603644 8, 0.09177469381378582, 0.08596435646154407,
0.08141010065377204, 0.07773089221384288]

4. Decision curve
Distinguish 2 classes with probability value f (x
w
w x) = 0.5, because f (x
w
w xw ), So for a sample x ,
x) == σ(x

f (x
w
w x) == 0.5 is equivalent to x w = 0, namely w and T hedotproductof pmbx is 0, that is,

w + w ∗ x + w ∗ x = 0.
0 1 1 2 2

The following code calculates the corresponding {x = −w /w − w ∗ x /w } for a group of {x } according to

2 0 2 1 1 2 1

w , and then in the (x , x ) coordinate plane Drawing the decision line corresponding to these points above, it can
1 2

be seen that the learned model can distinguish the samples of these two categories very well. The drawn cost curve
also reflects the gradual convergence of the algorithm.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))

x1 = np.array([X[:,0].min()-1, X[:,0].max()+1])
x2 = - w.item(0) / w.item(2) + x1 * (- w.item(1) / w.item(2))

# Plot decision boundary？？？

ax[0].plot(x1, x2, color='k', ls='--', lw=2)

ax[0].scatter(X[:int(n_pts),0], X[:int(n_pts),1], color='lightcoral', label='$y = 0$')

ax[0].scatter(X[int(n_pts):,0], X[int(n_pts):,1], color='blue', label='$y = 1$')
ax[0].set_title('$x_1$ vs. $x_2$')
ax[0].set_xlabel('$x_1$')
ax[0].set_ylabel('$x_2$')
ax[0].legend(loc=4)

ax[1].plot(loss_history, color='r')
ax[1].set_ylim(0,ax[1].get_ylim()[1])
ax[1].set_title(r'$J(w)$ vs. Iteration')
ax[1].set_xlabel('Iteration')
ax[1].set_ylabel(r'$J(w)$')

fig.tight_layout()
Figure 3-36 Classification and loss curves of a flat point set for binary classification

5. Prediction accuracy
# Print accuracy
X_1 = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) # Add a column of
features 1
y_predictions = sigmoid(X_1 @ w)>=0.5

print ('The accuracy of the prediction is: %d ' % float((np.dot(y, y_predictions)

+ np.dot(1 - y,1 - y_predictions)) / float(y.size) *
100) +'%')

The accuracy of the prediction is: 98 %

6. Logistic Regression with Scikit-Learn Library
The linear_model module of the Scikit-Learn package provides the logistic regression
function LogisticRegression(), which can be used to solve the above logistic regression
problem, and the same result can be obtained.

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

scikit_log_reg = sklearn.linear_model.LogisticRegression();
scikit_log_reg.fit(X,y)

#Score is Mean Accuracy

scikit_score = scikit_log_reg.score(X,y)
print('Scikit score: ', scikit_score)

# Print accuracy
y_predictions = scikit_log_reg. predict(X)
print ('The accuracy of the prediction is: %d ' % float((np.dot(y,
y_predictions)
+ np.dot(1 - y,1 - y_predictions)) / float(y.size) * 100) +
'%' )

#plot_decision_boundary(lambda x: clf.predict(x), X, Y)
# Plot decision boundary
x1 = np.array([X[:,0].min()-1, X[:,0].max()+1])
x2 = - w.item(0) / w.item(2) + x1 * (- w.item(1) / w.item(2))

ax.plot(x1, x2, color='k', ls='--', lw=2)

Scikit score: 0.97

The accuracy of the prediction is: 97 %
Figure 3-37 Classification of Flat Point Sets for Binary Classification with Scikit-Learn's
Logistic Regression

3.5.3 Actual combat: numpy implementation of iris classification

The official data set iris.csv, the characteristics of the data set are as follows

sepal_length - Continuous variable measured in centimeters.

sepal_width - Continuous variable measured in centimeters.
petal_length - Continuous variable measured in centimeters.
petal_width - Continuous variable measured in centimeters.
species - Categorical. 2 species of iris flowers, Iris-virginica or
Iris-versicolor.

import pandas
import matplotlib.pyplot as plt
import numpy as np
iris = pandas.read_csv("iris.csv")
# shuffle rows
shuffled_rows = np.random.permutation(iris.index)
iris = iris.loc[shuffled_rows,:]
print(iris.head())

print(iris.species.unique())
iris.hist()
plt.show()

sepal_length sepal_width petal_length petal_width species

55 5.7 2.8 4.5 1.3 versicolor
20 5.4 3.4 1.7 0.2 setosa
144 6.7 3.3 5.7 2.5 virginica
58 6.6 2.9 4.6 1.3 versicolor
31 5.4 3.4 1.5 0.4 setosa
['versicolor' 'setosa' 'virginica']

Figure 3-38 Histogram of different features of the iris dataset

X = iris[['sepal_length', 'sepal_width', 'petal_length',

'petal_width']].values
# Set Iris-versicolor class label to 1 and Iris-virginica to 0
y = (iris.species == 'Iris-versicolor').values.astype(int)
print(X[:3])
print(y[:3])

[[5.7 2.8 4.5 1.3]

[5.4 3.4 1.7 0.2]
[6.7 3.3 5.7 2.5]]
[0 0 0]

reg = 0.0
alpha=0.0001
iterations=10000
w_history = gradient_descent_logistic_reg(X,y,reg,alpha,iterations)
w = w_history[-1]
print("w:",w)

loss_history = loss_history_logistic(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])

w: [-0.10784884 -0.59039117 -0.33446609 -0.31856867 -0.09292942]

[0.691647452939996, 0.04139644267230338, 0.021270173400400796,
0.014388875912695448, 0.010902234144325786, 0.0087906172126 22023,
0.007372547259035234, 0.006353506410124969, 0.005585274554324254,
0.004985062855525219]

# Print accuracy
X_1 = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of features 1
y_predictions = sigmoid(X_1 @ w)>=0.5

print ('The accuracy of the prediction is: %d ' % float((np.dot(y,

y_predictions)
+ np.dot(1 - y,1 - y_predictions)) /
float(y.size) * 100) +'%')
plt.plot(history, color='r')

The accuracy of the prediction is: 100 %

Figure 3-39 Training loss curve of the iris data set

Of course, no verification set and test set are used to evaluate the trained model. Readers
can try to divide the original data set into training set, verification set and test set, and then
calculate the error of the verification set and test set and draw the corresponding learning
curve. Observe the fitting effect of the trained model.

3.6 softmax regression

Logistic regression can solve binary classification problems, but many classification
problems are a multi-classification problem with more than 2 categories. For example, in
handwritten digit recognition, it is necessary to recognize which digit in an image is from 0
to 9 from the handwritten digit image shown in the figure, that is, the target value is a
certain category of 10 categories.

Figure 3-40 Handwritten digital image

This kind of multi-classification problem can of course be transformed into a binary
classification problem to solve. For example, first consider the problem as a binary
classification problem of identifying 0 and non-zero numbers. If it is a non-zero number,
then regard the problem as identifying 1 and non-1 numbers. binary classification problems,
and so on. A logistic regression binary classification model is trained for each number, and
10 logistic regression models need to be trained for 10 digits.

Unlike the hypothesis function of logistic regression, which only outputs a value indicating
that the data belongs to a certain classification of the two classifications, softmax regression
is an extension of logistic regression. Its hypothesis function can output as many values as
the number of multi-category categories, indicating that the data belongs to each
Classification probability.

3.6.1 spiral data set

The following code generates a 3-category dataset on a 2D plane:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(100)

def gen_spiral_dataset(N=100,D=2,K=3):
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 #
theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
return X,y

N = 100 # number of points per class

D = 2 # dimensionality
K = 3 # number of classes

X_spiral,y_spiral = gen_spiral_dataset()
# lets visualize the data:
plt.scatter(X_spiral[:, 0], X_spiral[:, 1], c=y_spiral, s=20,
cmap=plt.cm.spring) #s=40, cmap=plt.cm.Spectral)
plt.show()

Figure 3-41 Three-category dataset on a two-dimensional plane

3.6.2 softmax function

The softmax function is a multivariate vector-valued function that takes multiple (eg 3)
input values and produces the same number (eg 3) output values. For sof tmax(z , z , z ) 1 2 3

of 3 independent variables (z , z , z ), the function value is a vector with the same number
1 2 3

of independent variables:
z1 z2 z2
e e e
sof tmax(z1 , z2 , z3 ) = ( , , )
z1 z2 z3 z1 z2 z3 z1 z2 z3
(e + e + e ) (e + e + e ) (e + e + e )
z1 z2 z2
e e e
= ( , , )
3 3 3
zi zi zi
∑ e ∑ e ∑ e
1 1 1

Obviously, the value of each component of the output vector of the softmax() function is
located in [0, 1], and the sum of all component values is 1, so each component can be
regarded as a probability value.

The function softmax() below is the code to calculate the softmax function:
import numpy as np

def softmax(x):
e_x = np.exp(x)
return e_x / e_x.sum()

Input a 3-dimensional vector z, the softmax function outputs a 3-dimensional vector, each
component of which represents a probability, that is, the values of these components are
between [0, 1], and their sum is equal to 1.

z = [3.0, 1.0, 0.2]

softmax(z)

array([0.8360188, 0.11314284, 0.05083836])

Note: softmax() acts on z = [3.0, 1.0, 0.2], and the resulting value
3.0 1.0 0.2

] and z value is not a linear

e e e
sof tmax(z) = [ , ,
3.0 1.0 0.2 3.0 1.0 0.2 3.0 1.0 0.2
e +e +e e +e +e e +e +e

relationship.

For x with a large value, e will exceed the range of values that can be represented by the
x

computer, causing the value of softmax() to overflow, such as:

z = [100,1000]
softmax(z)

<ipython-input-1-e3aa77d695fd>:4: RuntimeWarning: overflow

encountered in exp
e_x = np. exp(x)
<ipython-input-1-e3aa77d695fd>:5: RuntimeWarning: invalid value
encountered in true_divide
return e_x / e_x. sum()

array([0., nan])

Since the numerator and denominator of a fraction are both divided by a number, the
fraction value remains unchanged, namely:
z a
z j z −a
e
j e /e e
j

z
= z
= z −a
∑ e i ∑ e i /ea ∑ e i
i i i

Therefore, you can first find the maximum a of all z , and then use z i i − a to calculate the
value of the softmax() function.

def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()

print(softmax(z))
z = [500,1000]
softmax(z)

[0.1.]

array([7.12457641e-218, 1.00000000e+000])
The above code is for an input vector such as z = (z , z , z ), can it be used for a matrix
1 2 3

(two-dimensional array) composed of many input vectors? like:

z = np.array([[1, 2, 3],[6, 2, 4]])

softmax(z)

array([[0.00548473, 0.01490905, 0.04052699],

[0.8140064 , 0.01490905, 0.11016379]])

For the input vector [1, 2, 3], the output vector value is
[0.00548473, 0.01490905, 0.04052699], which does not meet the normalization condition

of the probability, that is, 0.00548473 + 0.01490905 + 0.04052699 ≠ 1. This is because

e_x.sum() in the above softmax() function sums all the elements of the array. The correct
way is to calculate the probability that the sample belongs to each category separately for
each sample, that is, calculate for each sample Its softmax output, the summation should be
performed for the components of this sample. In addition, the max() function only needs to
calculate the maximum value of all components of this sample without requiring the
maximum value of the entire array (although there is no problem).

In order to calculate their respective softmax() function values for multiple samples at the
same time, the softmax value vector should be calculated separately for each sample. To do
this, rewrite the above code as:

def softmax(x):
a= np.max(x,axis=-1,keepdims=True)
e_x = np.exp(x - a)
return e_x /np.sum(e_x,axis=-1,keepdims=True)

softmax(z)

array([[0.09003057, 0.24472847, 0.66524096],

[0.86681333, 0.01587624, 0.11731043]])

The parameter axis=1 of the above numpy functions np.max() and np.sum() indicates that
the corresponding maximum value (max) and summation (sum) operations are performed
along the axis (column). keepdims=True means not to change the dimension of the result
array, that is, the result array has the same dimension as the original array.

The above code first finds the maximum value of each row vector, subtracts its maximum
value from each row vector and calculates its index value, and finally calculates the value
of the softmax function by row, and each row of input generates a corresponding output
vector representing the probability. The above code can be further simplified:

def softmax(x):
e_x=np.exp(x-np.max(x,axis=-1,keepdims=True))
return e_x /np.sum(e_x,axis=-1,keepdims=True)
softmax(z)

array([[0.09003057, 0.24472847, 0.66524096],

[0.86681333, 0.01587624, 0.11731043]])

Generally, assuming z = (z , z 1 2, ⋯ , zk , ⋯ , zC ) , using f (zz) to represent the

sof tmax(z z) function, then:

z
e i

fi = C z
∑ e k
k=1

Among them, ∑ C

i=1
fi = 1 .

To prevent calculation overflow, each component can be subtracted from their maximum
value, that is:

$$f_i = \frac{e^{z_i – max\left ( z \right )}}{\sum_{k=1}^C e^{{z_k}- max\left ( z \right

)}}$ $

In order to find the gradient of f (zz) = sof tmax(zz) about z , the intermediate variable
a = e , b = ∑ e , then:
zi C zk
i k=1

ai
fi =
b

ai is regarded as a function of z , then there are: k

z
∂ai ∂e i
zi
= = e
∂zi ∂zi
z
∂ai ∂e i

= = 0
∂zj ∂zj

b is also a function of z . Similarly, there are: k

C z
∂(∑ e k)
∂b k=1 zi
= = e
∂zi ∂zi

According to the derivation rule of quotient, there are:

∂a
i ∂b
⋅b−ai
∂fi ∂z
i
∂z
i ai b−ai ai ai ai
= 2
= 2
= (1 − ) = fi (1 − fi ) = fi − fi fi
∂zi b b b b

∂a
i ∂b
⋅b−ai
∂fi ∂z
j
∂z
j 0−ai aj
= 2
= 2
= −fi fj
∂zj b b
∂f
f

∂z
z

∇z
else:

∇ L =
f
f

zL =
=

⎢⎥
⎡f

⎣
1 (1

f = softmax(z)

have a test:
= ( ,
∂f
f

∂L

∂z
z

x = np.array([[1, 2]])
− f1 )

−f2 f1

−fC f1

The outer product np.outer(f , f ) of f

the matrix formed by f f , namely:

⎡f

⎣f
1 f1

f2 f1

C f1
f1 f2

f2 f2

fC f2
⋮

,⋯,

print(softmax_gradient(x))
df = np.array([1, 3])
=

print(softmax_backward(x,df))

[[ 0.19661193 -0.19661193]
[-0.19661193 0.19661193]]
⋯

def softmax_gradient(z,isF = False):

if isF:
f = z

∂L

∂f1

∂L

∂f
f
∂f
f

∂z
z
i

f1 fC

f2 fC

fC fC
⋮
−f1 f2

f2 (1 − f2 )

j
⋮

−fC f2

⎦
⋯

⋯
−f1 fC

−f2 fC

fC (1 − fC )

Therefore, the gradient of f with respect to z can be calculated as follows:

grad = -np.outer(f, f) + np.diag(f.flatten())

return grad

If you know another variable such as the gradient of L about f

∂L
), then:
∂L

∂f2
∂L

∂fC
⎤

= (f1 , f2 , ⋯ , fk , ⋯ , fC )

Use df to represent the gradient of some other variable L with respect to f , then the python
calculation code for the gradient of L with respect to z is as follows:

def softmax_backward(z,df,isF = False):

grad = softmax_gradient(z,isF)
return df@grad
can be used to calculate
[-0.39322387 0.39322387]

For multiple samples, the following code can be used to calculate the gradient of softmax:

def softmax_gradient(z,isF = False):

if isF:
f = z
else:
f = softmax(z)

if len(df)==1:
return -np.outer(f, f) + np.diag(f.flatten())
else:
grads = []
for i in range(len(f)):
fi = f[i]
grad = -np.outer(fi, fi) + np.diag(fi.flatten())
grads.append(grad)
return np.array(grads)

x = np.array([[1, 2],[2, 5]])

print(softmax_gradient(x))

[[ 0.19661193 -0.19661193]
[-0.19661193 0.19661193]]
[-0.39322387 0.39322387]

You can use np.einsum() to perform multi-sample outer product operation, that is, write the
following vectorized code:
def softmax_gradient(Z,isF = False):
if isF:
F = Z
else:
F = softmax(Z)
D = []
for i in range(F.shape[0]):
f = F[i]
D.append(np.diag(f.flatten()))
grads = D-np.einsum('ij,ik->ijk',F,F)
return grads

print(softmax_gradient(x))

[[[ 0.19661193 -0.19661193]

[-0.19661193 0.19661193]]
[[ 0.04517666 -0.04517666]
[-0.04517666 0.04517666]]]

If you know the gradient dF of a function (such as a loss function) with respect to the
softmax output value F, you can use the following function to find the gradient of the
function with respect to the softmax input Z:

def softmax_backward(Z,dF,isF = True):

grads = softmax_gradient(Z,isF)
grad = np.einsum("bj, bjk -> bk", dF, grads) # [B,D]*[B,D,D] ->
[B,D]
return grad

df = np.array([[1, 3],[2, 4]])

print(softmax_backward_2(x,df))

[[-0.39322387 0.39322387]
[-0.09035332 0.09035332]]

3.6.3 softmax regression

The function model of softmax regression uses the output values of multiple linear
regression functions as the input values of the softmax function, thereby producing the
same number of output values representing the probability. That is, the model function of
softmax regression is a function composed of multiple linear regressions and a softmax
function.

As shown in the figure, for 3 classification problems, 3 linear regression functions can be
used to generate 3 outputs, and these 3 outputs can be passed through the softmax function
to generate 3 values f (i = 1, 2, 3), which respectively represent the probability that the
i

sample belongs to 3 categories, ∑ f = 1.

1 i

Figure 3-42 The hypothetical function of softmax regression consists of 3 weighted sums
and a softmax function

The calculation formula of this softmax regression function is:

f (x
x) = (f1 , f2 , f3 )

= sof tmax(x1 W11 + x2 W21 + x3 W31 + b1 ,

x1 W12 + x2 W22 + x3 W32 + b2 ,

x1 W13 + x2 W23 + x3 W33 + b3 )

For the 10-category problem of handwritten digit recognition, you can use 10 linear
regression functions x W to generate 10 output values z from the input features x , and
,i
,i i
then use The softmax function converts these 10 output values into 10 probability values f , i

that is, ∑ f = 1andf ∈ [0, 1].

1 i i

Like linear regression, the bias b is regarded as w , that is, W = (w , w , w , w ), 1

i 0i ,i 0i 1i 2i 3i

is also used as a feature of the input x , that is, x = (1, x , x , x ). Then the above formula 1 2 3

can be written as:

f (x
x) = sof tmax(x
xW ,1 , x W ,2 , x W ,3 ) = sof tmax(x
xW )

Where W is the i-th column of W , and x W is regarded as an intermediate variable z ,

,i ,i i

then sof tmax(x

xW ) Can be expressed as:

z z z z z z
e 1 e 2 e 2 e 1 e 2 e 2
sof tmax(z1 , z2 , z3 ) = ( z z z
, z z z
, z z z
) = ( 3
, 3
, 3
)
(e 1 +e 2 +e 3 ) (e 1 +e 2 +e 3 ) (e 1 +e 2 +e 3 )
∑ e
z
i ∑ e
z
i ∑ e
z
i
1 1 1

For a sample x , f (x
x) = sof tmax(x xW ) is a vector, each of which represents the

probability that x belongs to the corresponding category of this component. For example,
f respectively represent the probability that x belongs to the jth category. If the true value
j

y
(i)
of a sample (xx ,y ) is 2, the probability that the sample belongs to the second
(i) (i)

category is f , that is, f .

2 (i)
y

Multi-sample form
The weighted sum of the data features of m samples X W is a two-dimensional matrix, and
each row of the matrix represents the weighted sum of the data features of a sample. This
matrix is denoted by the letter Z :
(1) (1)
⎡ z ⎤ ⎡x W ⎤
(2) (2)
z x W
Z = =

⋮ ⋮

⎣z (m) ⎦ ⎣x(m)
W
⎦

The weighted sum z corresponding to each sample x is also a vector itself, and the
(i) (i)

softmax function acts on this vector to generate a vector sof tmax(zz ), which represents (i)

the probability that the sample belongs to different categories. Use f to represent (i)

), all f can be expressed as a column vector F form:

(i) (i)
sof tmax(z z
F =

y =

Fy =

⎢⎥
⎡

⎣f

⎣y

⎣
y
f

(2)

(m)
(m)

Use the vector f

(1)

(2)

vdots

f
(1)

(2)

(m)

(m)
y
⎦
⎤

(1)

(2)
=

⎦
⎡

(i)

y
y

=
(i)
∑

⎣
e

z
e

1
z

(m)

∑
(1)

(2)

e
z

1
(1)

(2)

(m)

z
y

e
e

(2)

(m)

(m)
z

z
i

(2)

(m)

i
∑

⎦
e

z
e

The target value (label) of the sample can be represented by a one-dimensional vector y ,
z

(m)

where each element represents the integer corresponding to the true target category of the
sample. Right now:

⎡y
(1)
⎤
(1)

(2)

to represent the probability of the true category y

samples i, multiple These probabilities for the sample also form a vector:

⎡ e
z

C
(1)
y
(1)

(1)
⎤
z

z
(1)

(2)

(m)

i
⋯

⋯
e

C
∑
1
z

z
(1)

(2)

(m)

e
z

z
(1)

(2)

(m)

i
⎤

(i)
corresponding to all
3.6.4 Multi-classification cross-entropy loss
For a sample (x
x ,y ), its data feature x
(i) (i) (i)
is output by the softmax regression model is that the sample
belongs to each The probability of a class,
(i) (i) (i)
f ,f ,⋯,f
1 2 C

(i) (i)
And the probability that the sample belongs to the target category y (i)
is fy
(i)
, this probability f
y
(i)

indicates the probability that the sample appears in the target category. Similarly, the probability that m
samples (xx ,y ) appear simultaneously with their corresponding target categories is:
(i) (i)

m (i)
∏ f
i=1 y
(i)

The W that makes this probability value the largest can make these m samples appear with the greatest
probability. Therefore, softmax regression requires the regression model parameter W that maximizes this
probability. Because multiplication will make the value sharply become infinite or close to 0, in order to
make the solution algorithm numerically stable, the average value of the negative logarithm of this
probability value is usually used as the cost function, namely:
1 m (i)
L(W
W) = − ∑i=1 log(f (i)
)
m y

(i) (i)
Where −log(f (i)
y
) is called the cross-entropy loss of sample i. The problem of maximizing ∏ m

i=1
f
y
(i)
then
becomes the problem of minimizing this cross-entropy loss.

For a 3-category problem, the values of y use 0, 1, and 2 to represent the probability that the sample
(i)

belongs to 3 different categories. If the true category of a sample is the third category, that is, y = 2, the (i)

predicted value f for the sample is a vector, indicating that the sample belongs to each category The
(i)

probability of , if the vector values are:

(i)
⎡f ⎤ 0
⎡ 0.3 ⎤
(i) (i)
f = f = 0.5

⎣ 0.2 ⎦
1

⎣f ⎦ (i)

(i)
Then the cross-entropy loss of this sample is −log(f 2
) = −log(0.2) .

For multiple samples, if there are 2 samples (m=2), the corresponding probability matrix F and target
value vector y are as follows:

0.2 0.5 0.3

F = [ ]
0.2 0.6 0.2

2
y = [ ]
1

Then F is: y
y
0.3
Fy
y = [ ]
0.6

Indicates the probability that each sample belongs to its corresponding target class. The cross-entropy loss
of all samples can be vectorized and expressed as:

1 m (i) 1
L(W
W) = − ∑ log(f ) = − sum(logF
Fyy)
m i=1 y
(i)
m

For the above 2-sample example, this average cross-entropy loss is,
1
L(W
W) = − (log(0.3) + log(0.6))
2

The python code for calculating cross entropy is as follows:

def cross_entropy(F,y):
m = len(F) #y.shape[0]
log_Fy = -np.log(F[range(m),y])
return np.sum(log_Fy) / m

Use this function to calculate the cross-entropy in the example:

F = np.array([[0.2,0.5,0.3],[0.2,0.6,0.2]]) #Each line corresponds to a sample
Y = np.array([2,1])

print(-1/2*(np.log(0.3)+np.log(0.6)))
print(cross_entropy(F,Y))

0.8573992140459634
0.8573992140459634

Sometimes instead of using an integer value to represent the category of a certain sample, a so-called one-
(i) (i) (i)
hot vector y = (y , y isused, ⋯ , y ) indicates the category to which a sample belongs, where C
(i)
1 2 C

indicates the total number of categories. Only one component of this vector has a value of 1, and the other
components have a value of 0.
For example, for a 3-category problem, if the category of a certain sample is 3, its one-hot vector is
(0,0,1), that is, the third component value is 1, and the other component values are all 0.
(i)
For a sample, if the jth component y = 1 of the corresponding y , that is, the sample belongs to the jth
j
(i)

class, then the cross entropy loss of the sample can be written as:
(i) (i) (i) C (i) (i) (i) (i)
− log(f ) = −y log(f ) = −∑ y log(f ) = −y ⋅ log(f )
j j j j=1 j j

That is, the cross-entropy loss corresponding to this sample is the opposite of the dot product of vector y (i)

and log(f ). (i)

Therefore, for the target value expressed in one-hot form, the cross-entropy loss of all samples can be
written as follows:
m
1 (i) (i)
1
L(W
W) = − ∑y ⋅ log(f ) = − np. sum(Y ⊙ log(F ))
m m
i=1
As for the above f and the one-gotized vector y:

0.2 0.5 0.3

f = [ ]
0.2 0.6 0.2

0 0 1
y = [ ]
0 1 0

1
L(W
W) = − (np. sum(y ⋅ log(f )))
2

1 0 ∗ log(0.2) + 0 ∗ log(0.5) + 1 ∗ log(0.3) 1

= − ∗ np. sum([ ]) = − (log(0.3) + log(0.6))
2 0 ∗ log(0.2) + 1 ∗ log(0.6) + 0 ∗ log(0.2) 2

It is also possible to multiply these two matrices of the same shape by using the Hadamard product, that
is, the element-wise product, to obtain:

0 ∗ log(0.2) 0 ∗ log(0.5) 1 ∗ log(0.3)

Y ⊙ log(F ) = [ ]
0 ∗ log(0.2) 1 ∗ log(0.6) 0 ∗ log(0.2)

Then add all the elements of the result matrix and divide by the number of samples to get the total cross
entropy:
1
(log(0.3) + log(0.6))
2

The corresponding python calculation code is as follows:

def cross_entropy_one_hot(F,Y):
m = len(F)
return -np.sum(Y*np.log(F))/m # -(1./m) *np.sum(np.multiply(y, np.log(f)))

F = np.array([[0.2,0.5,0.3],[0.2,0.6,0.2]]) #Each line corresponds to a sample

Y = np.array([[0,0,1],[0,1,0]])

print(cross_entropy_one_hot(F,Y))

0.8573992140459634

3.6.5 Calculate cross entropy loss by weighted sum

The weighted sum of a sample and the softmax function output of z is the probability f . Knowing the
weighted sum z can calculate f , thus calculating the cross-entropy loss. The code below computes the
cross-entropy loss over the weighted sum Z of multiple samples and their target class labels y :
#https://deepnotes.io/softmax-crossentropy
def softmax(Z):
A = np.exp(Z-np.max(Z,axis=1,keepdims=True))
return A/np.sum(A,axis=1,keepdims=True)

def softmax_cross_entropy(Z,y):
m = len(Z)
F = softmax(Z)
log_Fy = -np.log(F[range(m),y])
return np.sum(log_Fy) / m

Test the function with a set of Z:

Z = np.array([[2,25,13],[54,3,11]]) #Each line corresponds to a sample
y = np.array([2,1])
softmax_cross_entropy(Z,y)

output:

31.500003072148047

If the target labels are in one-hot vector form, the following code computes the cross-entropy loss from
the weighted sum:
def softmax_cross_entropy_one_hot(Z, y):
F = softmax(Z)
loss = -np.sum(y*np.log(F),axis=1)
return np.mean(loss)

Z = np.array([[2,25,13],[54,3,11]]) #Each line corresponds to a sample

y = np.array([[0, 0, 1],[0, 1, 0]])
softmax_cross_entropy_one_hot(Z,y)

31.500003072148047

3.6.6 Gradient calculation of softmax regression

The goal of softmax regression is to find the W that minimizes the cross-entropy loss L (W W ), and the

solution algorithm is still the gradient descent method. For this, it is necessary to calculate the gradient of
L (WW ) with respect to W , that is, the partial derivative with respect to each W . jk

1. The gradient of the cross-entropy loss on the weighted sum

Let a sample be (x
x, y), f (x xW ) can be regarded as
x) = sof tmax(x

1 2 3 xW , x W , x W ) and f (z
z = (z , z , z ) = (x ,1 ,2 ,3z) = sof tmax(z z) = sof tmax(z 1, z2 , z3 ) composite
function.

Introduce auxiliary intermediate variable a = (a 1, a2 , a3 ) = (e

z1
,e
z2
,e
z3
) . but
a1 a2 a2
f (z
z) = f (a
a) = (f1 , f2 , f3 ) = ( , , )
a1 +a2 +a3 a1 +a2 +a3 a1 +a2 +a3

Let L = −log(fy ) = −(log(ay ) − log(a1 + a2 + a3 ))

∂L 1 ∂ay 1 ∂a1 ∂a2 ∂a3
= − − ( + + )
∂zi ay ∂zi a1 + a2 + a3 ∂zi ∂zi ∂zi

1 ∂ay 1
zi
= − − e
ay ∂zi a1 + a2 + a3

1 zy
1 zi
= − 1(y == i)e − e
ay a1 + a2 + a3

zi
e
= −1(y == i) + = fi − 1(y == i)
3
zi
∑ e
1

Among them, the symbol 1(y == i) indicates that the value is 1 when (y == i) is established, otherwise,
the value is 0.

Therefore, the gradient of L on z = (z 1, z2 , z3 ) is:

∂L ∂L ∂L
∇z L = ( , , ) = (f1 − 1(y == 1), f2 − 1(y == 2), f3 − 1(y == 3))
∂z1 ∂z2 ∂z3

If y=1, then:

∇z L = (f1 − 1, f2 , f3 )

That is to say, for any C classification problem, if the classification of a certain y is i, then:

∇z L = (f1 , f2 , ⋯ , fi − 1, ⋯ , fC ) = f − I i

The notation I represents a one-hot vector whose i-th component is 1 and the other components are 0. If
i

the target value of the sample y is represented by this one-hot vector, that is, y = I , then: i

∇z L = f − y

This and the linear regression loss ∥ f − y on the gradient of f , the logistic regression cross-entropy
1

2
2

loss −(T T hegradientf ormulasof ylog(f f ) + (1 − y )log(1 − f )) on weighting and z are surprisingly

consistent. But there is still a difference. Linear regression is the gradient of the loss function value about
f , while logistic regression and softmax regression are about the weighted sum of z instead of f . The

gradient is f − y , but the probability of logistic regression is calculated by f = σ(zz), and the probability
of softmax regression is calculated by f = sof tmax(zz).

Therefore, for the vector Z constructed by multiple sample features, use L to represent the total loss of
all samples, then the gradient of L on the weighted sum Z is:

∇Z L = F − I i

∇Z L = F − Y

This form is the same as the gradient of the weighted sum of the logistic regression loss function, and also
the loss function ∥ F − Y of the linear regression about the output T hegradientof pmbF is
1

2
2

consistent.
According to formula (12), the code for calculating the gradient of cross entropy with respect to Z is as
follows:

def grad_softmax_crossentropy(Z,y):
F = softmax(Z)
I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),y] = 1
return (F - I_i) / Z.shape[0]

def grad_softmax_cross_entropy(Z,y):
m = len(Z)
F = softmax(Z)
F[range(m),y] -= 1
return F/m

Test the function with 2 samples of Z and its target value y:

Z = np.array([[2,25,13],[54,3,11]]) #Each line corresponds to a sample
y = np.array([2,1])
grad_softmax_cross_entropy(Z,y)
#grad_softmax_crossentropy(Z,y)

In order to ensure that there is no error in the calculation of the analytical gradient, the general numerical
gradient function in Section 1.4) can be used to calculate the numerical gradient of the cross-entropy with
respect to Z and compare it with the above-mentioned analytical gradient:

def loss_f():
return softmax_cross_entropy(Z,y)

import util
Z = Z.astype(float) #Note: The integer array must be converted to float
type
print("num_grad",util.numerical_gradient(loss_f,[Z]))

num_grad[array([[ 0. , 0.49999693, -0.49999693],

[ 0.5 , -0.5 , 0. ]])]

If the sample target is represented by a one-hot vector, the code for calculating the gradient of cross
entropy with respect to Z is as follows:
def grad_softmax_crossentropy_one_hot(Z, y): #y is represented by one-hot
vector
F = softmax(Z)
return (F - y)/Z.shape[0]

Z = np.array([[2,25,13],[54,3,11]]) #Each line corresponds to a sample

y = np.array([[0, 0, 1],[0, 1, 0]])
grad_softmax_crossentropy_one_hot(Z,y)

array([[ 5.13090829e-11, 4.99996928e-01, -4.99996928e-01],

[ 5.00000000e-01, -5.00000000e-01, 1.05756552e-19]])

2. The gradient of the cross-entropy loss with respect to the weight parameter
The gradient of the loss function on the weighted sum z is obtained, and the loss function on the model
parameter W can be obtained further. Because the gradient of z = x W for W is x , and for other W ,
i ,i ,i ,j

its gradient is 0, therefore:

∂zi
= 1(i == j)x
x
∂W,j

Therefore, there are:

3
∂L ∂L ∂zi ∂L ∂zj
= ∑ = = (fj − 1(y == j))x
x
∂W
W,j ∂zi ∂W
W,j ∂zj ∂W
W,j
i=1

That is, for each j = 1, 2, 3, there are:

∂L
= (f1 − 1(y == 1))x
x
∂W
W,1

∂L
= (f2 − 1(y == 2))x
x
∂W
W,2

∂L
= (f3 − 1(y == 3))x
x
∂W
W,3

Note: Because W is a column vector, if x is a row vector, the above

,1
∂L

∂W
W,j
is also a row vector, if you
want to write ∂L

∂W
W
in the form of a matrix with the same shape as W , then :
T T T
∂L ∂L ∂L ∂L
= ( , ,⋯, )
∂W
W ∂W
W,1 ∂W
W,j ∂W
W,C

T
= x (f1 − 1(y == 1), f2 − 1(y == 2), ⋯ , fC − 1(y == C))

If the ont-hot vector is used to represent the target value (label) y , it can be written as a more concise
formula:
∂L T
= x (f
f − y)
∂W
W

Among them, it is assumed that x , f , y are all in the form of row vectors. This is a matrix of the same
shape as W . If C is the number of categories and n is the number of data features, then x is a 1 × n
vector, x is a n × 1 vector, W is a n × C matrix, because z = x W , f = sof tmax(zz), so f , y are
T

1 × C vectors. Therefore, x (ff − y ) is a matrix of n × C .

For the matrix

(1)
⎡ x
⎤
(2)
x

⋮
X = ，
(i)
x

⎣x (m) ⎦

F,Y is a matrix of predicted values and target values corresponding to these samples:
F =

∂L

∂W
W

L(W

∂W
W

⎢⎥
⎡

= X

W) = −

W) = −
f

pmbf

T
(1)

(2)

(i)

(F
F − Y )

m
(m)

∑
⎤

i=1

i=1
,Y

y
=

(i)
⎡

(i)
y

l̇ og(f
x

pmbx

x
(m)
(1)

(2)

) + λ ∥ W
(i)
⎤

⎦
.

Then the vector form of the gradient of the loss function with respect to the weight W is:

Similarly, a regular term can be added to the cross-entropy loss of softmax regression. If the target value is
represented by an integer, the loss function becomes:
1
∑
m
log(f
(i) 2

If the target value is represented by a one-hot vector, then the loss function becomes:
(i)
) + λ ∥ W

Then the gradient of the loss function with respect to the weight W is:
∂L
= X
T
(F
F − Y ) + 2λW
W
2

According to the calculation formula of the gradient, it is easy to write the calculation code of the gradient
of the loss function about W . In the following code, it is assumed that X represents the data feature matrix
of multiple samples, y represents the target value vector, reg is the regularization parameter λ,
loss_softmax() and gradient_softmax() respectively calculate the loss of the loss function and about W
gradient:

#def loss_gradient(W,X,y,lambda_):
def gradient_softmax(W,X,y,reg):
m = len(X)
Z= np.dot(X,W)

I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),y] = 1
F = softmax(Z)
#F = np.exp(Z) / np.exp(Z).sum(axis=-1,keepdims=True)
grad = (1 / m) * np.dot(X.T,F - I_i)
grad = grad +2*reg*W
return grad

def loss_softmax(W,X,y,reg):
m = len(X)
Z= np.dot(X,W)
Z_i_y_i = Z[np.arange(len(Z)),y]
# Z.shape[0]

negtive_log_prob = - Z_i_y_i + np.log(np.sum(np.exp(Z),axis=-1))

loss = np.mean(negtive_log_prob)+reg*np.sum(W*W)
return loss
Test the function with the following data:
X = np.array([[2,3],[4,5]]) # Each row corresponds to a sample, each
sample is 2 features
y = np.array([2,1]) # The number of categories is 3
W = np.array([[0.1,0.2,0.3],[0.4,0.2,0.8]]) # 2X3 matrix

reg = 0.2;

print(gradient_softmax(W,X,y,reg))
print(loss_softmax(W,X,y,reg))

[[ 0.30213245 -1.75779321 1.69566076]

[ 0.5254108 -2.19194012 2.22652932]]
2.086304963628266

If the target value of each sample is represented by a one-hot vector, and y represents a matrix composed
of target values of multiple samples, then the loss_softmax_onehot() and gradient_softmax_onehot()
below calculate the loss of the loss function and about W gradient:

def gradient_softmax_onehot(W,X,y,reg):
m = len(X) # number of samples
nC = W.shape[1] # Number of categories
#y_one_hot = np.eye(nC)[y[:,0]]
y_one_hot = y

# y_mat = oneHotIt(y) # Next we convert the integer class coding into a

one-hot representation
Z = np.dot(X,W) # Z is the weighted sum
F = softmax(Z) # F is the probability matrix
grad = (1 / m) * np.dot(X.T,(F - y_one_hot)) + 2*reg*W # compute the
gradient for that loss
return grade

def loss_softmax_onehot(W,X,y,reg):
m = len(X) #First we get the number of training examples
nC = W. shape[1]
#y_one_hot = np.eye(nC)[y[:,0]]
y_one_hot = y

# y_mat = oneHotIt(y) # Convert the integer class encoding into one-

hot vector form
Z = np.dot(X,W) # Z is the weighted sum
F = softmax(Z) # F is the probability matrix
loss = (-1 / m) * np.sum(y_one_hot * np.log(F)) + (reg)*np.sum(W*W)
return loss

X = np.array([[2,3],[4,5]]) # Each row corresponds to a sample, each

sample is 2 features
y = np.array([[0,0,1],[0,1,0]]) # The number of categories is 3
W = np.array([[0.1,0.2,0.3],[0.4,0.2,0.8]]) # 2X3 matrix

reg = 0.2;
print(gradient_softmax_onehot(W,X,y,reg))
print(loss_softmax_onehot(W,X,y,reg))
[[ 0.30213245 -1.75779321 1.69566076]
[ 0.5254108 -2.19194012 2.22652932]]
2.0863049636282662

3.6.7 Implementation of gradient descent method for softmax regression

The gradient descent algorithm code based on this is as follows:
def gradient_descent_softmax(w,X, y, reg=0., alpha=0.01, iterations=100,gamma =
0.8,epsilon=1e-8):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) # add a column
feature 1
v= np.zeros_like(w)
#losses = []
w_history=[]
for i in range(0,iterations):
gradient = gradient_softmax(w,X,y,reg)
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",i)
break

w = w - (alpha * gradient)
#v = gamma*v+alpha* gradientz
#w= w-v
#losses.append(loss)
w_history.append(w)
return w_history

For a set of samples (X,y), the following auxiliary function calculates the model loss corresponding to
each model parameter in the history record w_history:
def compute_loss_history(w_history,X,y,reg=0.,OneHot=False):
loss_history=[]
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
if OneHot:
for w in w_history:
loss_history.append(loss_softmax_onthot(w,X,y,reg))
else:
for w in w_history:
loss_history.append(loss_softmax(w,X,y,reg))
return loss_history

2.6.8 Softmax regression of spiral data set

Train the softmax regression model on the previous 3-category spiral dataset:
X_spiral,y_spiral = gen_spiral_dataset()
X = X_spiral
y = y_spiral

alpha = 1e-0
iterations =200
reg = 1e-3
w = np.zeros([X.shape[1]+1,len(np.unique(y))])
w_history = gradient_descent_softmax(w,X,y,reg,alpha,iterations)
w = w_history[-1]
print("w: ",w)
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
plt.plot(loss_history, color='r')

w: [[-0.05432759 0.00909428 0.04523331]

[ 1.33458061 1.00350822 -2.33808883]
[-2.34741204 2.66497338 -0.31756134]]
[1.0676120053842029, 0.8282060256848282, 0.7842866954902825,
0.7712475770990912, 0.7664708448591956, 0.7645211840017035, 0 .7636739840781024,
0.7632912248183761, 0.7631138695802607, 0.763030293277599]

Figure 3-43 Training loss curve

The following function can calculate the prediction accuracy of the trained model on a batch of data (X,
y):

def getAccuracy(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
probs = softmax(np.dot(X,w))
predicts = np.argmax(probs,axis=1)
accuracy = sum(predicts == y)/(float(len(y)))
return accuracy

Use getAccuracy() to calculate the prediction accuracy of the trained softmax model for the data set just
now.
getAccuracy(w,X_spiral,y_spiral)

0.5366666666666666

The following code plots the classification boundaries of the softmax model for this problem:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = np.dot(np.c_[np.ones(xx.size),xx.ravel(), yy.ravel()], w)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_linear.png')

(-1.908218802050246, 1.9517811979497575)

Figure 3-44 Classification of the three-category data point set

It can be seen that the softmax regression is still a linear function model in essence, and the dividing lines
are all straight lines on the graph. It is difficult to segment the data nonlinearly. The accuracy of the model
is 0.5366666666666666.

3.7 Batch Gradient Descent and Stochastic Gradient Descent

3.7.1 MNIST handwritten digit set

The MNIST handwritten digit training set is some images of handwritten digits, and each image is a
handwritten digit. There are 10 digit images of 0,1,...,9 in the sample dataset. So this is a 10 classification
problem.

The following code reads the MNIST handwritten digit training set:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path

if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")

with gzip.open('mnist.pkl.gz', 'rb') as f:

train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

train_X, train_y = train_set

valid_X, valid_y = valid_set
test_X, test_y = valid_set
print(train_X.shape,train_y.shape)
print(valid_X.shape,valid_y.shape)
print(test_X.shape,test_y.shape)
print(train_X.dtype,train_y.dtype)
print(train_X[9][300],train_y[9])
print(np.min(train_y),np.max(train_y))

(50000, 784) (50000,)

(10000, 784) (10000,)
(10000, 784) (10000,)
float32 int64
0.98828125 4
0 9

The training set has 50,000 samples, and the validation and test sets each have 10,000 samples. Each pixel
value of the image is a real number of float type, and the value range is a real number between [0, 1],
indicating the grayscale intensity of each pixel, and the label value indicates the digital classification
corresponding to the image, using 0 ,1,2,...,9 represent.

The following code visualizes one of these images:

import matplotlib.pyplot as plt

%matplotlib inline
digit = train_X[9].reshape(28,28)
plt.subplot(1,2,1)
plt.imshow(digit)
plt.colorbar()
plt.subplot(1,2,2)
plt.imshow(digit,cmap='gray')
plt.colorbar()
plt.show()

Figure 3-44 Images of handwritten digits from the Mnist dataset

Continue outputting the few pixel values (data features) in that sample.
print(train_X.shape)
print(train_X[9][200:250])
(50000, 784)
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.75 0.984375
0.73046875 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.2421875 0.72265625 0.0703125 0. 0. 0.
0. 0.34765625 0.921875 0.84765625 0.18359375 0.
0. 0. ]

3.7.2 Training logistic regression with partial training samples

The number of samples in the training set reaches 50,000. Using the samples of the entire training set for
training requires a lot of resources and time for each calculation. In order to improve the training
efficiency, some data can be taken out, such as training with batch=500 samples:
batch = 500

alpha =1e-2
iterations =1000
reg = 1e-3

w_history=[]

w = np.zeros([train_X.shape[1]+1,len(np.unique(train_y))])
for i in range(5):
s = i*batch
X = train_X[s :s+batch,:]
y = train_y[s :s+batch]
w_history_batch = gradient_descent_softmax(w,X,y,reg,alpha,iterations)
w = w_history_batch[-1]
w_history.extend(w_history_batch)

print("w: ",w)
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])

Calculate the accuracy of the model function on the training set, validation set, and test set respectively:
print("Accuracy on the training set：",getAccuracy(w,train_X,train_y))
print("Validation set accuracy：",getAccuracy(w,valid_X,valid_y))
print("Accuracy on test set:",getAccuracy(w,test_X,test_y))

Accuracy on training set: 0.88412

Accuracy along set: 0.8979
Accuracy along set: 0.8979

Iterative loss learning curves for training and validation sets can be plotted:
loss_history_valid =
compute_loss_history(w_history,valid_X[0:1000,:],valid_y[0:1000],reg)

plt.plot(loss_history, color='r')
plt.plot(loss_history_valid, color='b')
plt.ylim(0,5)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.title('iterative learning curve')
plt.legend(['train', 'valid'])
plt.ylim(-0.2,3)
plt.show()

Figure 3-45 Training loss curve

3.7.3 Batch Gradient Descent Method and Implementation

If the training set is large, the training becomes very large when using the entire training set. In fact, as
long as a small number or even one sample is used, the gradient of the loss function can be calculated and
the model parameters can be updated. Therefore, in practical problems, a small number of samples are
usually randomly selected from the training set, and this batch of samples is used to perform gradient
updates on the model parameters. A different batch of samples can be used for each gradient update. This
gradient descent method is called batch gradient descent (mini batch gradient descent). In order to
ensure that all samples in the training set can be used for training.

The general approach of the batch gradient descent method is:

1. Rearrange the order of samples in the original training set, that is, disrupt the order of samples in
the original training set.

1. For the rearranged training set, start from the beginning, take a small batch of samples in turn,
use this batch of samples to calculate the gradient of the model function loss, and update the
model parameters.

1. Repeat 1),2) many times.

1. and 2) in the above complete a traversal of almost all samples in the training set, and update the
model parameters with different small batches of samples in this traversal. Therefore, the process of
1), 2) is called an epoch. 3) Indicates the epoch executed multiple times.

Shuffle the order of a list using the numpy.random.shuffle() function, for example:
m=5
indices = list(range(m))
print(indices)
np.random.shuffle(indices)
print(indices)

[0, 1, 2, 3, 4]
[2, 1, 4, 3, 0]

Corresponding to a data set (X, y), an iterator function data_iter() can be defined to shuffle the order of the
original data set and return a small batch of training samples of batchsize size from the data set each time:
def data_iter(X,y,batch_size,shuffle=False):
m = len(X)
indices = list(range(m))
if shuffle: # shuffle is True to shuffle the order
np.random.shuffle(indices)
for i in range(0, m - batch_size + 1, batch_size):
batch_indices = np.array(indices[i: min(i + batch_size, m)])
yield X.take(batch_indices,axis=0), y.take(batch_indices,axis=0)

The following is the code implementation of the batch gradient descent method:

def batch_gradient_descent_softmax(w,X, y, epochs,batchsize = 50,shuffle = False,

reg=0., alpha=0.01, gamma = 0.8,epsilon=1e-8):
w_history = []
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
for epoch in range(epochs):
for X_batch,y_batch in data_iter(X,y,batchsize,shuffle):
gradient = gradient_softmax(w,X_batch,y_batch,reg)
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",i)
break
w = w - (alpha * gradient)
w_history.append(w)
return w_history

On the Mnist handwritten digit recognition training set, perform this batch gradient descent method:
import matplotlib.pyplot as plt
%matplotlib inline

batchsize = 50
epochs = 5
shuffle = True
alpha = 0.01
reg = 1e-3
gamma = 0.8

X,y = train_X,train_y
w = np.zeros([X.shape[1]+1,len(np.unique(y))])
w_history = batch_gradient_descent_softmax(w,train_X,train_y,epochs,batchsize,
shuffle,reg,alpha,gamma)
w = w_history[-1]
print("w: ",w)
X,y = train_X[0:1000,:],train_y[0:1000]
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])

w: [[-0.09892444 0.18983056 -0.03299558 ... 0.14317605 -0.35128395

-0.05662875]
[ 0. 0. 0. ... 0. 0.
0.]
[ 0. 0. 0. ... 0. 0.
0.]
...
[ 0. 0. 0. ... 0. 0.
0.]
[ 0. 0. 0. ... 0. 0.
0.]
[ 0. 0. 0. ... 0. 0.
0. ]]
0 .4898016509007597, 0.4760620858697159, 0.4682054859168761,
0.4587856686373923]

The batch gradient descent algorithm can be used for training with a small batch of samples, which
improves the speed of the algorithm without reducing the accuracy of the model. The accuracy of the
model on different sample sets is output below.
print("Accuracy on the training set：",getAccuracy(w,train_X,train_y))
print("Validation set accuracy：",getAccuracy(w,valid_X,valid_y))
print("Accuracy on test set:",getAccuracy(w,test_X,test_y))

Accuracy on training set: 0.89254

Accuracy along set: 0.904
Accuracy along set: 0.904

Plot an iterative learning curve:

loss_history_valid =
compute_loss_history(w_history,valid_X[0:1000,:],valid_y[0:1000],reg)

The model parameter matrix W is a n × C matrix, each column of which corresponds to a classifier
similar to logistic regression, and the weight of this column is used to extract from the data the features
that can judge the data related to the class. For the model parameter W of MNIST image classification,
the 784-size weight parameter of a certain column (corresponding to a certain classification) can be
displayed in the form of an image. For example, the following code displays the weight parameters of
column 0 (the class corresponding to the number 0), and converts this column vector into an image matrix
of 28 × 28 size:
c = 0
plt.imshow(w[1:,c].reshape((28,28)))

Figure 3-47 Visualization of weight parameters

Softmax regression of Fasion MNIST training set

import mnist_reader
X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')
print(X_train.shape,y_train.shape)
print(X_train.dtype,y_train.dtype)

(60000, 784) (60000,)

uint8 uint8

Show some of these images:

# https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-
fashion-mnist-clothing-classification/
from matplotlib import pyplot
trainX = X_train.reshape(-1,28,28)
print(trainX.shape)
#lot first few images
for i in range(9):
# define subplot
pyplot.subplot(330 + 1 + i)
# plot raw pixel data
pyplot.imshow(trainX[i], cmap=pyplot.get_cmap('gray'))
# show the figure
pyplot.show()

(60000, 28, 28)

Figure 3-48 Images from the Fashion Mnist dataset

Convert the value represented by the byte in [0,255] to a value in the [0,1] interval.

train_X = X_train.astype('float32')/255.0
test_X = X_test.astype('float32')/255.0
print(train_X.shape,y_train.shape)
print(test_X.shape,y_test.shape)
print(test_X.dtype,y_test.dtype)
print(np.mean(train_X[0:1000,:]))
print(np.mean(test_X[0:1000,:]))
train_y = y_train

(60000, 784) (60000,)

(10000, 784) (10000,)
float32 uint8
0.2829032
0.29028687

Start training:
import matplotlib.pyplot as plt
%matplotlib inline

batchsize = 50
epochs = 5
shuffle = True
alpha = 0.01
reg = 1e-3
gamma = 0.8

w = np.zeros([train_X.shape[1]+1,len(np.unique(train_y))])
w_history = batch_gradient_descent_softmax(w,train_X,train_y,epochs,batchsize,
shuffle,reg,alpha,gamma)
w = w_history[-1]
print("w: ",w)
X,y = train_X[0:1000,:],train_y[0:1000]
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])

w: [[ 7.31575784e-02 -6.19716807e-02 -7.67268263e-02 ... -6.90256353e-02

-2.28128013e-01 -3.95874153e-01]
[-1.40999051e-05 -3.41569227e-06 -1.79953563e-05 ... -1.06757525e-06
-4.63211933e-06 -1.43653900e-06]
[ 1.34046441e-04 -5.59964269e-07 -3.40548333e-06 ... -5.55241661e-06
-7.51688795e-05 -1.99009945e-05]
...
[-1.39504254e-02 -1.61035934e-03 1.85487894e-02 ... -4.45252904e-03
-1.39324550e-02 -2.90555040e-03]
[-4.71228285e-03 -3.51288646e-04 4.02540435e-03 ... -1.56477203e-03
-5.55355304e-03 -2.51245458e-04]
[-1.92933951e-04 -9.90426911e-05 8.16674872e-04 ... -1.71482628e-04
-9.56241526e-04 7.42584615e-05]]
[2.275109028057496, 0.8003628208932961, 0.6917393913965211,
0.6483408155406045, 0.6101999163854088, 0.5895045906115264, 0. 5749317113081786,
0.5656061065575259, 0.5555802050015674, 0.5481140082218926]

Plot the loss curve:

plt.plot(loss_history)

Figure 3-49 Training loss curve

loss_history_valid =
compute_loss_history(w_history,test_X[0:1000,:],test_y[0:1000],reg)

Figure 3-50 Validation loss curve

print("Accuracy of training set:", getAccuracy(w, train_X, train_y))
print("Accuracy of test set:", getAccuracy(w, test_X, test_y))

Accuracy on training set: 0.8293

Accuracy on test set: 0.8171

3.7.4 Stochastic Gradient Descent

Batch gradient descent uses only a small batch of samples per iteration, while stochastic gradient descent
is more extreme, using only one sample per iteration. Therefore, to perform stochastic gradient descent,
just modify the batch size batchsize to 1 before calling the batch gradient descent method in the above
code, that is, only one sample is used to update the model parameters each time. In order to save training
time, change the epochs to 2 times:
batchsize=1
epochs = 2
w = np.zeros([train_X.shape[1]+1,len(np.unique(train_y))])
w_history = batch_gradient_descent_softmax(w,train_X,train_y,epochs,batchsize,
shuffle,reg,alpha,gamma)
w = w_history[-1]
print("w: ",w)

Compute the model's accuracy on the training and test sets:

rint("Accuracy of training set:", getAccuracy(w, train_X, train_y))
print("The accuracy of the test set:", getAccuracy(w, test_X, y_test))

Accuracy on training set: 0.81425

Accuracy on test set: 0.7988

Summarize
Linear regression, logistic regression, and softmax regression are essentially linear classifiers. Linear
regression is a linear weighted input, and its regression function is a linear function (straight line).
Logistic regression is essentially the same as linear regression, except that this linear weighting The value
of the sum is compressed from the interval (−∞, ∞) to the interval (0, 1), so that the output value has a
probability meaning, and its decision curve for binary classification is still a straight line. Softmax
regression only regards multiple linear weighted sums as the scores of each class, and then converts their
values into probability values that samples belong to multiple categories. The decision curve when it is
used for multi-classification problems is nothing more than multiple Just a straight line.
Chapter 4 Neural Networks

4.1 Neural Network

The previously learned linear regression uses a linear function of data features
to represent the functional relationship between the feature and the target
value. The linear function is too simple. For many machine learning
problems, the functional relationship between the feature and the target value
is often nonlinear. Logistic regression further On the basis of the linear
function, a nonlinear function sigmoid function σ(z) is used to convert the
result of the linear function into a probability of a value between [0, 1], so that
there is a relationship between the feature and the target value Non-linear
relationship, but only transforming linear weighted sums into probabilities, it
is also difficult to represent complex functions.

How to represent complex functions?

Complex functions can usually be decomposed into simple functions or

composed of simple functions through simple four arithmetic operations or
function composition. For example, the hypothetical function σ(xw + b) of
logistic regression is formed by the function z = xw + b and σ(z) are
combined, and the function xw + b is composed of x, w, b through
multiplication and addition.

Very complex functions can be represented by simple operations and function

composition. Neural network is to combine many simple neuron(neuron)
functions into a complex function in this way. A neuron is a logistic
regression function, that is, the neuron function is to linearly weight the input,
and then use a nonlinear function to nonlinearly transform the weighted sum.
Certainly, the function for performing nonlinear transformation on the
weighted sum in the neuron function may be other simple functions similar to
the sigmoid function. A neural network consists of multiple neurons
(functions), the output of one neuron is used as the input of another neuron,
and a neuron accepts the input of other neurons and passes simple weighted
sums and simple (non-linear) functions Transform output to other neurons.
Neurons form a network through mutual input and output connections, so this
complex function is called a neural network.

4.1.1 Perceptrons and neurons

1. Perceptron
Perceptron (perceptron) is a simple function with two classification
functions. The logistic regression function uses the sigmoid function to
convert the input linear weighted sum into a probability σ(xxw ), while the

perceptron uses a threshold function to transform the input linear weighted

sum into 0 and 1: sign (x xw ). where sign (z) is a step function that outputs 1
b b

and 0 depending on whether the z value is greater than a threshold b:

1 if z >= b
signb (z) = {
0 else

The perceptron calculates the weighted sum x w of the input x through the
weight vector w , and then outputs a value of 1 or 0 according to whether the
weighted sum exceeds a certain threshold b. For example, a perceptron with 3
input values is calculated as:
3
fw
w,b (x
x) = signb (∑j=1 wj xj )

A perceptron can be visualized with a simple circle. The perceptron in Figure

4-1 accepts 3 input values and produces an output value f :

Figure 4-1 Perceptron accepts 3 input values and produces an output value f

Perceptrons are also known as "artificial neurons", or "neurons" for short. It

tries to simulate the neurons in the human brain. Neuroscience reveals that the
human brain is composed of many simple neurons (neurons), as shown in
Figure 3-2.

Figure 4-2 Structure of brain neurons

A neuron usually has multiple dendrites, which are mainly used to receive
incoming information, while there is only one axon, and there are many axon
terminals at the end of the axon that can transmit information to other
neurons, and the axon terminals communicate with other neurons The
dendrites of the dendrites make connections to transmit signals, and the
location of this connection is called a "synapse" in biology.

Each neuron accepts multiple input signals, and the weight of each input
signal to the neuron is also different. If the weighted sum of all input signals
exceeds the "threshold" inside the neuron, an output signal will be generated.

Use f (x
x) to represent the perceptron function, that is, f
w
w w (x
w x) = signb (x
xw ) ,
where x Is the input, w is the weight vector, that is:

1 if ∑ wj xj >= b
j
fw
w (x
x) = {
0 else

The above formula can be transformed into:

1 if ∑ wj xj − b >= 0
j
fw
w (x
x) = {
0 otherwise

bcan be positive or negative, use b to represent −b, the above formula can be
written as:

1 if ∑ wj xj + b >= 0
j
fw
w (x
x) = {
0 else

Therefore, the perceptron function can usually be written as

f (x
x) = sign(x
w
w xw + b).

The perceptron can directly express the most basic logical calculation
functions: "and", "or", "and not". Figure 4-3 shows the functions of the
"AND", "NAND", "OR", and "XOR" gates of the logic circuit.

a) and b) and not c) or d) XOR

Figure 4-3 is the function of the "AND", "NAND", "OR", and "XOR" gates
of the logic circuit.

The parameters (w , w , b) of the perceptron that can generate the "AND"

1 2

gate function can be composed of many, such as: (0.5, 0.5, −0.6),
(0.5, 0.5, −0.9), (1, 1, −1), (1, 1, −1.5), etc. For example, the perceptron

function represented by the parameter (0.5, 0.5, −0.6) is as follows:

sign(x1 w1 + x2 w2 + b) = sign(0.5x1 + 0.5x2 − 0.6)

Substitute x = 0, x = 0 or x = 1, x = 0 or x = 0, x = 1 into the

1 2 1 2 1 2

perceptron function, the output value is 0, and x = 1, x = 1 is substituted

1 2

into the perceptron function, and the output value is 1. That is, the perceptron
implements the "AND" gate function of the logic circuit.

There are also many parameters (w , w , b) of the perceptron that can

1 2

generate the "NAND" gate function, such as: (−0.5, −0.5, 0.6),
(−0.5, −0.5, 0.9), (−1, −1, 1), (−1, −1, 1.5), etc.

There are also many parameters (w , w , b) of the perceptron that can

1 2

generate the "OR" function, such as: (1, 1, 0), (1, 1, −0.5), (0.5, 0.5, −0.3).

For the perceptron, different weights w and bias b can be used to generate
j

different specific functions (such as "and", "or", "and not" and other logical
calculation functions). Like the logistic regression function, a single
perceptron still essentially represents a linearly separable function. For
example, the perceptron represented by the parameter (1, 1, −0.5) is 2 half-
spaces separated by the straight line −0.5 + x + x = 0, and the perception
1 2

of (x , x ) in one half-space The output value of the perceptron is 1, and the

1 2

output value of the other half space is 0. As shown in Figure 4-4:

Figure 4-4 The perceptron function sign(−0.5 + x + x ) implements the

1 2

logic of the OR gate: the output value of one half-space of

−0.5 + x1 + x2 = 0 is 1, and the output value of the other half-space value
is 0

Therefore, although a single perceptron function is a nonlinear function, it

cannot represent nonlinearly separable functions. If the "exclusive OR gate"
logic operation function cannot be expressed:

Only a nonlinear curve as shown in the figure below can produce such a
nonlinear curve division.

Figure 4-5 Only nonlinear curve division can represent AND or gate function

To generate this "exclusive OR" gate function, multiple perceptrons can be

used to generate a more complex function through the compounding of
functions. If the following symbols are used to represent the perceptrons of
"and", "or", and "nand" functions respectively:

Figure 4-6 Symbolic representation of "AND", "OR", "AND NOT" functional

perceptrons

More complex functions can be combined through simple "AND", "OR", and
"NAND" perceptrons, as shown in Figure 4-7, XOR gates can be expressed as
follows:

Figure 4-7 The function of "exclusive or" function can be represented by the
combination of three "and", "or", and "and not" function perceptron functions

The operation process is shown in Figure 4-8:

Figure 4-8 Calculation process of "XOR" function

Input x1, x2 , the output y realizes the function of XOR gate.

2. Neurons
A neuron is a function or vector-valued function that performs a linear or
nonlinear transformation on the linearly weighted sum of multiple input
values to produce one or more output values. A neuron accepts multiple input
values, performs a weighted sum on them, and then produces an output or
multiple output values through a linear or nonlinear function. The function
that transforms the weighted sum linearly or nonlinearly in a neuron is called
an activation function.

The neuron function is expressed as follows:

a = g(∑ Wj xj )
j

Where x is an input value, w is the weight corresponding to the input, and

j j

their weighted sum ∑ W x generates an output value a through the

j j j

activation function g. Neurons are usually represented by circles as shown in

Figure 4-9 below.

Figure 4-9 Artificial neuron that weights the input and generates output
through the activation function

Sometimes the bias of the neuron is also expressed, that is, the neuron
function is written as follows:

a = g(∑ Wj xj + b)
j

Correspondingly, it can be visualized as a neuron with bias as shown in

Figure 4-10 below:

Figure 4-10 The artificial neuron that generates the output through the
activation function after weighting and biasing the input

Neurons are also often represented by the more simplified circles in Figure 4-
11 below:

Figure 4-11. Symbolic representation of an artificial neuron that accepts

multiple input values and produces a single output value

Indicates that this neuron accepts 3 inputs x , x , x (and a fixed input feature
1 2 3

value 1 corresponding to the bias), and produces an output value

3
fw,b (x) = g(∑ Wj xj + b)
j=1
If W is used to represent b, and x represents 1, it can be expressed as
0 0
3
fw (x) = g(∑ Wj xj )
j=0

The different activation functions determine the different functions of

neurons. Linear regression, logistic regression, and perceptrons all use
neurons with different activation functions. The activation function of linear
regression is the identity function g(z) = z, which directly outputs the
weighted sum, while logistic regression uses a nonlinear activation function
(sigmoid function σ(z)) to weight the sum Make a transformation. The
perceptron uses a step function sign(z) to transform the weighted sum.
Therefore, neurons are generalizations of linear regression, logistic
regression, and perceptrons. Softmax regression can be regarded as a neuron
that outputs multiple values, that is, its activation function is the softmax()
function. The activation function of neurons can also be various other
nonlinear functions such as tanh, ReLu function, etc.

4.1.2 Activation function

The activation functions in neurons are usually some simple nonlinear
functions. The commonly used activation functions are: tanh, sigmoid,
rectified linear (ReLu for short). The sigmoid function has been introduced
earlier.

1. Step function sign(x)

The python implementation code of the step function sign(x) and its
derivatives:

def sign(x):
return np.array(x > 0, dtype=np.int)

def grad_sign(x):
return np.zeros_like(x)

The graph of the step function sign(x) can be drawn with the following code:
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

x = np.arange(-5.0,5.0, 0.1)
plt.ylim(-0.1, 1.1) # Specify the range of the y-
axis
plt.plot(x, sign(x),label="sigmoid")
plt.plot(x, grad_sign(x),label="derivative")
plt.legend(loc="upper right", frameon=False)
plt.show()

Figure 4-12 The graph of the step function sign(x)

2. Tanh function
tanh function:
x −x x −x
tanh(x) = (e − e )/(e + e )

Its derivative function is:

′ x −x x −x x −x x −x x −x 2
tanh (x) = [(e + e )(e + e ) − (e − e )(e − e )]/(e + e )

x −x 2 x −x 2 2
= 1 − ((e − e ) )/(e + e ) = 1 − tanh (x)

Numpy provides the calculation function tanh() for calculating tanh. The
following code calculates tanh'(x) and draws the function curve of tanh(x) and
tanh'(x):

def grad_tanh(x):
a = np.tanh(x)
return 1 - a**2

x = np.arange(-5.0, 5.0, 0.1)

plt.plot(x, np.tanh(x),label="tanh")
plt.plot(x, grad_tanh(x),label="derivative")
plt.legend(loc="upper right", frameon=False)
plt.show()

Figure 4-13 Graph of tanh(x) function and its derivative function tanh'(x)

4. ReLU function
The ReLU function f(x) outputs x directly when x is greater than 0, otherwise,
outputs 0:

x (x > 0)
Relu(x) = {
0 (x ≤ 0)

Its derivative:

′
1 (x > 0)
Relu (x) = {
0 (x ≤ 0)
The following code calculates Relu(x), Relu'(x), and draws the function curve
of Relu(x), Relu'(x):

def relu(x):
return np.maximum(0, x)
def grad_relu(x):
return 1. * (x > 0)

x = np.arange(-5.0, 5.0, 0.1)

plt.plot(x, relu(x),label="relu")
plt.plot(x, grad_relu(x),label="derivative")
plt.legend(loc="upper right", frameon=False)
plt.show()

Figure 4-14 The graph of relu(x) function and its derivative function
′
relu (x)

There are also some variants of the Relu function, such as the LeakRelu
function:

x (x > 0)
LeakRelu(x) = {
kx (x ≤ 0)

Its derivative:

′
1 (x > 0)
LeakRelu (x) = {
k (x ≤ 0)
The following code calculates LeakRelu(x), LeakRelu (x), and draws the
′

function curve of LeakRelu(x), LeakRelu (x):

′

import numpy as np
def leakRelu(x,k=0.2):
y = np.copy( x )
y[ y < 0 ] *= k
return y

def grad_leakRelu(x,k=0.2):
return np.clip(x > 0, k, 1.0)
grad = np.ones_like(x)
grad[x < 0] = alpha
return grad

x = np.arange(-5.0, 5.0, 0.1)

plt.plot(x, leakRelu(x),label="leakrelu")
plt.plot(x, grad_leakRelu(x),label="derivative")
plt.legend(loc="upper right", frameon=False)
plt.show()

Figure 4-15 The graph of leakrelu(x) function and its derivative function
′
leakrelu (x)

4.1.3 Neural Networks and Deep Learning

In 1943, psychologist McCulloch and mathematician Pitts published the
abstract neuron model MP with reference to the structure of biological
neurons. In 1949, the psychologist Hebb proposed the Hebb learning rate,
which believed that the strength of the synapses (that is, connections) of
human brain nerve cells could be changed. In 1957, computational scientist
Frank Rosenblatt called artificial neurons "perceptrons" and proposed a
multilayer perceptron (MLP, Multilayer Perceptron) consisting of two layers
of perceptrons. In modern machine learning, perceptrons are more called
neurons, and multilayer perceptrons are also called artificial neural networks
(ANN, Artificial Neural Network), referred to as neural networks.

A neural network consists of many simple "neurons" used to represent a

complex function. The softmax function is a multivariate vector-valued
function, which accepts multiple input values and outputs multiple output
values, or the input and output of the function are both a vector. For example,
an input vector z = (z , z , z ) and output vector f of asof tmaxf unction
1 2 3

\pmb f=softmax(\pmb z)
with3inputvaluesand3outputvalues = (f , f , f ), this function can be
1 2 3

regarded as composed of 3 neurons (as shown in Figure 4-16).

Figure 4-16 The softmax function can be regarded as a neural network

composed of 3 special neurons (3 circles in the right column of the figure)

The 3 neurons all accept the input vector z = (z 1, z2 , z3 ) and generate an

output f , f , f respectively, namely:
1 2 3

z
e 1
f1 = z z z
e 1 +e 2 +e 3

z
e 2
f2 = z z z
e 1 +e 2 +e 3

z
e 3
f3 = z z z
e 1 +e 2 +e 3

These 3 neurons are different from general neurons, they do not carry out
weighted sum of the input vector z = (z , z , z ), but directly output a result
1 2 3

value f according to the above formula.

i
For the two-dimensional data points of the spiral data set (Figure 2-6) in
Section 2.6, perform 3-category softmax regression, first use 3 neurons to
calculate the weighted sum of the input vector x = (x , x ) \pmb z =
1 2

(z_1,z_2,z_3), andthentheseweightedsumsoutputbytheseneurons\pmb
z = (z_1,z_2,z_3)
I nputtothe3neuronsof thesof tmaxf unctiontogetthef inaloutput\pmb

f = (f_1,f_2,f_3)$. The hypothetical function of this softmax regression can be

seen as composed of the following neurons.

Figure 4-17 The hypothetical function of softmax regression can be regarded

as the three weighted sum neurons in the middle column and then combined
with the three neurons of the softmax function

In the above neural network, the left column of circles represents multiple
features of an input. For a two-dimensional plane coordinate point, a sample
has only 2 features, that is, its horizontal and vertical coordinate values
x = (x , x ), The middle column of neurons is the weighted sum z of
1 2 i

multiple features of the input data, and these weighted sums z = (z , z , z )

1 2 3

are directly output to the rightmost column, that is, the 3 neurons of the
softmax function , and each produces an output value f , which constitutes
i

the final output f = (f , f , f ) of softmax regression. It can be seen that the

1 2 3

data is input from the left and output from the right. The output of the neurons
in the left column is the input of the neurons on the right. There is no
connection between the neurons in the same column, and the neurons on the
right will not output to the neurons on the left. That is, this is a calculation
process in which data advances along the "left to right" direction without
going back. Such a neural network is called Feedforward Neural Network
(Feedforward Neural Network).

All the neurons in each column of the feed-forward neural network are called
a layer of the neural network, and the characteristics of the input data are
usually called the input layer, and the last column of neurons that produces
the final output is called the output layer* *, each column of neurons in the
middle is calledhidden layer**, in the feedforward neural network, the data is
layer by layer from the input layer, through each hidden layer in turn, and
finally output the final output through the output layer The output value of the
neural network function. In some books, the number of layers including the
input layer is called the number of layers of the neural network. For example,
the above neural network is called a 3-layer neural network, and in some
books, the above neural network is called a 2-layer neural network, that is, no
Contains the input layer.

As shown in Figure 4-18, the softmax function can be regarded as a neuron

that produces multiple outputs rather than as multiple neurons.

Figure 4-18 Think of the softmax function as a single neuron producing

multiple outputs

The neurons of the above neural network are a bit special. Among them, the
neurons of the softmax output layer do not calculate the weighted sum of the
input of the previous layer, and the hidden layer directly outputs the weighted
sum without the transformation of the nonlinear activation function, that is,
the implicit A layer of neurons is a linear regression function.

The neurons in the hidden layer of the general neural network are neurons
similar to logistic regression, that is, the weighted sum is first calculated, and
then output after being transformed by a nonlinear activation function. The
output layer neurons can be softmax neurons, linear regression neurons, and
logistic regression neurons. Because the neurons in the softmax layer have no
parameters such as weights, but a definite softmax() function, when designing
a neural network for multi-classification problems, the softmax() function can
not be used as a separate layer, but directly Use the previous layer of the
softmax layer as the output layer. That is, softmax regression can be
represented by the following neural network (as shown in Figure 4-19):

Figure 4-19 Softmax regression may not include soiftmax, which can be
regarded as a neural network composed of only 3 weighted sum neurons

That is, there are only the input layer and the output layer, and there is no
hidden layer. The output value of the output layer represents the score that the
input data belongs to each category. This score can then pass through
softmax() to output the probability that the data belongs to each category.
That is, the softmax function for calculating the classification probability can
be omitted in the figure.
The actual neural network contains at least one hidden layer, and usually
contains many hidden layers. With the development of hardware computing
power represented by GPU and the availability of large-scale data, modern
neural networks usually contain many hidden layers. Containing layers, the
number of layers of a neural network is called the depth of a neural network,
that is, the depth of a modern neural network can be very large, even
containing hundreds of hidden layers. Deeper neural networks are called deep
neural networks, and machine learning based on deep neural networks is
called deep learning (deep learning).

Whether it is a deep neural network or a shallow neural network, the working

principle is the same. It's just that the depth of the deep neural network is
deeper. For simplicity, the following uses a shallow deep neural network as an
example to illustrate how to use a neural network for learning.

In order to classify the two-dimensional coordinate points above, as shown in

Figure 4-20, a simple 2-layer neural network can be used as the hypothesis
function for model training.

Figure 4-20 2-layer neural network, the leftmost column is the data, the
middle column of neurons constitutes the hidden layer, and the rightmost
column of neurons constitutes the output layer

Both the hidden layer and the output layer are neurons similar to the logistic
regression function, that is, each neuron uses its own weight vector to
calculate a weighted sum z, and then generates an output value a through its
own activation function.

This 2-layer neural network with only one hidden layer defines a function
→ R . where D is the size of the input vector x and K is the size of
D K
f : R

the output vector f (x

x). Use l = 0,1,2 to represent each layer respectively, for
[l]
a neuron, use z to represent the weighted sum of input values, and use z to
i

represent the ith layer of the l layer The weighted sum of neurons, a
[l]
represents the output value, and a represents the activation value of the i-th
i
(0)
neuron in the l-th layer. When l=0, a is the i-th input feature.
i
Assuming that the activation function of all neurons in the first layer is the
same function g , each neuron accepts an input x = (x , x ) and produces
[1]
1 2

an output value, these outputs The values (activation values) also form a
[1] [1] [1] [1]
vector a = (a , a , a , a ):
[1]
1 2 3 4

[1] [1] [1] [1]

[1]
a = g (x1 W + x2 W + b )
1 11 21 1
[1] [1] [1] [1]
[1]
a = g (x1 W + x2 W + b )
2 12 22 2
[1] [1] [1] [1] [1]
a = g (x1 W + x2 W + b )
3 13 23 3
[1] [1] [1] [1]
[1]
a = g (x1 W + x2 W + b )
4 14 24 4

Use the matrix W [1]

,b
[1]
to represent these weight and bias vectors:
[1] [1] [1] [1]
W W W W
[1] 11 12 13 14
W = [ ]
[1] [1] [1] [1]
W W W W
21 22 23 24

$$\pmb b^{[1]} = (b^{[1]}_1,b^{[1]}_2,b^{[1]}_3,b^{[1]}_4)$ $

That is, each column represents the weight or bias of a neuron, and the
calculation process of the first layer of neurons can be expressed as:
[1] [1] [1] [1]
a = g (x
x W + b )

1 × 4 1 × 2 2 × 4 1 × 4

Assuming that the activation function of all neurons in the 2nd layer (output
layer) is the same g , they accept these output values from the hidden layer
[2]

[1] [1] [1] [1]

), the final output value is the vector
[1]
a = (a ,a ,a ,a
1 2 3 4
[2] [2] [2]
):
[2]
pmba = (a ,a ,a
1 2 3

[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
1 1 11 2 21 3 31 4 41 1

[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
2 1 12 2 22 3 32 4 42 2

[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
3 1 13 2 23 3 33 4 43 3

Use the matrix W [2]

,b
[2]
to represent these weight and bias vectors:
[2] [2] [2]
⎡W11 W
12
W
13
⎤

[2] [2] [2]

W W W
[2] 21 22 23
W =
[2] [2] [2]
W W W
31 32 33

[2] [2] [2]

⎣ ⎦
W W W
41 42 43

[2] [2] [2]

[2]
b = (b ,b ,b )
1 2 3

The calculation process of layer 2 neurons can be expressed as:

[2] [2] [1] [2] [2]
a = g (a
a W + b )

1 × 3 1 × 4 4 × 3 1 × 3

The entire neural network function f (x

x) is written in vector form:

[2] [1] [1] [1] [2] [2]

f (x
x) = g ((g (x
xW + b ))W
W + b )

Use a (0)
to represent the input data x , namely:

(0) (0) (0)

a = (a ,a ) = x = (x1 , x2 )
1 2

Then input a x that is a , the calculation process of this neural network is as

(0)

follows:
[1] (0) [1] [1]
z = a W + b

[1] [1] [1]

a = g (z
z )

[2] [1] [2] [2]

z = a W + b (4-16)

[2] [2] [2]

f (x
x) = a = g (z
z )

According to x → z → a → z → a The process of calculating the

[1] [1] [2] [2]

final output of the neural network is called forward propagation (forward

propagation) or forward calculation.

It can be seen that the calculation process of the weighted sum and activation
value of the first layer and the second layer is completely similar. For a
general neural network, the calculation formula of the weighted sum and
activation value of the first layer is as follows:
[l] [l−1] [l] [l]
z = a W + b

[l] [l] [l]

a = g (z
z )

Layer l accepts input a from layer l-1, and calculates the weighted sum
[l−1]

+ b , through the activation function g , the output

[l] [l−1] [l] [l] [l]
z = a W

z ).
[l] [l] [l]
a = g (z

The forward calculation (forward propagation) of the above-mentioned neural

network can be implemented with the following python code (without loss of
generality, it is assumed that the activation function of all neurons is a
sigmoid function).

import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

g1 = sigmoid

g2 = sigmoid

# x and W1,b1
x = np.array([1.0, 0.5]) # input x: 1x2 row
vector
W1 = np.array([[0.1, 0.3,0.5,0.2],
[0.4,0.6,0.7,0.1]]) # W1 : 2x4 matrix
b1 = np.array([0.1, 0.2, 0.3,0.4]) # bias b1: 1x4 row
vector
print("x.shape",x.shape) # (2,)
print("W1.shape",W1.shape) # (2, 4)
print("b1.shape",b1.shape) # (4,)

# Calculate the value of z1 and a1 from the input x and

W1,b1
z1 = np.dot(x,W1) + b1 # (1,4)
a1 = g1(z1) # (1,4)
print("z1",z1) # (4,)
print("a1",a1)

# a1、W2,b2
W2 = np.array([[0.1, 1.4,0.2],[2.5, 0.6, 0.3],
[1.1,0.7,0.8],[0.3,1.5,2.1]])
b2 = np.array([0.1, 2,0.3])
print("a2.shape",a1.shape) # (4,)
print("W2.shape",W2.shape) # (2, 4)
print("b2.shape",b2.shape) # (2,)

# Calculate the value of z2 and a2 from a1, W2, b2

z2 = np.dot(a1,W2) + b2
a2 = g2(z2)
print("z2",z2)
print("a2",a2)

x.shape(2,)
W1.shape(2, 4)
b1.shape(4,)
z1 [0.4 0.8 1.15 0.65]
a1 [0.59868766 0.68997448 0.75951092 0.65701046]
a2.shape(4,)
W2.shape(4, 3)
b2.shape(3,)
z2 [2.91737012 4.76932075 2.61406058]
a2 [0.94869845 0.99158527 0.93176103]
X =

A
[l]

Where z

Z
[l]

[l]

[l]
⎢⎥
4.1.4 Forward calculation of multiple samples
The data characteristics of multiple samples such as m samples x

⎣x

=
,a

= A

= g

= A

= g
x

The output vector z

⎣a
(1)

(2)

(m)

pmbz

[l]

[l−1]

[l]
⋮
z

z
⎤

[m](l)

(Z
Z

(Z
Z
W
[1](l)

[2](l)

[l]

[l]
⋮

[2](l)

)
[m](l)

[m](l)

[l]

Similarly, A is the activation value of Z :

⎡ a

a
[l]

[1](l)

[2](l)
⎤

=
[l]

+ b

⎡g

⎣g
,a

[l]

[l]
[l]

[l]

(z
z
of each layer corresponding to each sample Z

[l]

⎡a

⎣a

(z
z

⋮
a

[m](l)
=
⎡

⎣a

[2](l−1)

[m](l−1)

[1](l)

[2](l)
)

)
a

⎦
[1](l)

[2](l)

Because numpy arrays have a broadcast function, the above formula can be simplified to:
[l] [l−1]

Because numpy arrays have a broadcast function, the above formula can be simplified to:
[l]
⋮

[m](l)

⋮
[l]

[l]

[l]
⎤

+ b

Therefore, for a general layer l, the vectorization formula for forward computation is as follows:

)
[l]
+ b
[l]
[l]
[l] ⎦

Similarly, you can write the python code for the forward calculation in vector (matrix) form for multiple samples:
(i)

are the weighted sum and activation values of the l-th layer of the i-th sample, respectively, and
[1](l)

they are used as a matrix Z

[2](l)

,A .
T hei−thlineof [l]

The forward calculation of multiple samples can be written in vector (matrix) form as follows:
[1](l) [1](l−1)
+ b

+ b
[l]

[l]

[l]
⎤
can form a matrix X :

[l] [l]
,A is:
X = np.array([[1.0, 2.],[3.0,4.0]])
W1 = np.array([[0.1, 0.3,0.5,0.2],
[0.4,0.6,0.7,0.1]]) # W1 : 2x4 matrix
b1 = np.array([0.1, 0.2, 0.3,0.4]) # bias b1: 1x4 row vector

print("X.shape",X.shape) # (2,)
print("W1.shape",W1.shape) # (4, 2)
print("b1.shape",b1.shape) # (4,)

# Calculate Z1, A1 of the first layer

Z1 = np.dot(X,W1) + b1
A1 = sigmoid(Z1)
print("Z1:",Z1)
print("A1:",A1)

W2 = np. array([[0.1, 1.4,0.2],[2.5, 0.6, 0.3],[1.1,0.7,0.8],[0.3,1.5,2.1]])

b2 = np.array([0.1, 2,0.3])
print("A1.shape",A1.shape) # (2,)
print("W2.shape",W2.shape) # (4, 2)
print("b2.shape",b2.shape) # (4,)

# Calculate the Z2, A2 of the first layer

Z2 = np.dot(A1,W2) + b2
A2 = sigmoid(Z2)
print("Z2:",Z2)
print("A2:",A2)

X.shape(2, 2)
W1.shape(2, 4)
b1.shape(4,)
Z1: [[1. 1.7 2.2 0.8]
[2. 3.5 4.6 1.4]]
A1: [[0.73105858 0.84553473 0.90024951 0.68997448]
[0.88079708 0.97068777 0.9900482 0.80218389]]
A1.shape(2, 4)
W2.shape(4, 3)
b2.shape(3,)
Z2: [[3.4842095 5.19593923 2.86901816]
[3.94450732 5.71183814 3.24399047]]
A2: [[0.97023513 0.9944915 0.94629347]
[0.98100697 0.99670431 0.96245657]]

In some books, x (i)

,z
[l] [l]
,a ,b
[l]
, etc. are written as column vectors form, then
[l] [l] [l−1] [l]
z = W a + b

like:

⎡z ⎤ ⎡W ⎤ ⎡b ⎤
[1] [1] [1] [1]
W
1 11 12 1

[1] [1] [1] [1]

z W W x1 b
[1] 2 21 22 2
z = = [ ] +
[1] [1] [1] [1]
x2
z W W b
3 31 32 3

⎣z ⎦
[1]
⎣W [1]
W
[1]
⎦ ⎣b ⎦
[1]

4 41 42 4

Because x (i)
is a column vector, m samples x (i)
form a matrix X :

$$\pmb X = \begin {bmatrix}\pmb x^{(1)}&x^{(2)}& \cdots& x^{(m)} \end {bmatrix}\tag{4-24}$ $

That is, the data feature x (i)
of each sample is used as a column of the matrix X , so the calculation formula of the
weighted sum Z is: [1]

[1] [1] [1]

Z = W X + b

Generally, there are:

$$\pmb Z^{[l]} = \pmb W^{[l]} \pmb A^{[l-1]} + \pmb b^{[l]}\tag{4-25}$ $

[l] [l] [l]
A = g (Z
Z )

4.1.5 Output
When the neural network is used to solve regression problems, similar to linear regression, the output is any real
number on the real number axis, and the output can be one real number or multiple real numbers. For example, in
the problem of target positioning, it is necessary to output the position of the target, such as the vertical and
horizontal coordinates of the target in the sample image, and for the detection of face landmarks, as shown in
Figure 4-21, it is necessary to output the number of feature points of the face in the face image coordinate. For
these problems, the output value can be any real number.

Figure 4-21 Face landmarks

When solving the binary classification problem, similar to logistic regression, the output is the probability that the
sample belongs to one of the categories. This probability is compressed by using the σ function to compress the
real numbers belonging to the real number interval to [0,1] representing the probability. Interval, when solving
multi-classification problems, the output is the probability that the sample belongs to each category, and these
probabilities are compressed by the softmax function to the [0,1] interval representing the probability of the same
number of real numbers (on the real number axis) .

For classification problems, even if the real numbers of the real number axis are not compressed to the [0,1]
interval representing the probability, the sample category can also be judged according to the size of the real value.
For example, for a 3-category problem, if the output is 3 real numbers: 219, 18, 564, then it can be known that the
sample belongs to the category of the largest real number is the 3rd category instead of the 1st or 2nd category.
Converting any real number into a real number representing probability is to define the cross-entropy loss of
binary classification or multi-classification in a probabilistic sense.

Therefore, for classification problems, the output of the neural network can be a real number belonging to the real
number axis, indicating which class the sample belongs to or the score of multiple different classes. It can also be
the probability of converting the score through the σ and softmax functions.

When designing the neural network structure, if the neuron containing the σ function or softmax function is used
as the final output layer, the output probability can directly calculate the cross-entropy loss with the target value. If
the network layer that outputs the score is used as the output layer, whether it is a classification or a regression
problem, the output is any real number on the real number axis, but for the classification problem, this score output
will also be transformed by the σ function or the softmax function Computes the cross-entropy loss for
probabilistic resuming of target values.

Regardless of whether the σ function or the softmax function is used as the output layer, the neurons in other layers
of the neural network are neurons similar to the logistic regression function, that is, each neuron accepts the input
of the previous layer a , use the weight vector of the neuron to calculate the weighted sum of these inputs
[l−1]

z
[l]
= a
[l−1]
W , and then through an activation function g to generate an output a = g (zz ). If the output
[l] [l] [l] [l] [l]

score is used as the output layer, the activation function g of this layer is usually an identity function, otherwise, [L]

the output layer is a special neuron such as a σ function or a softmax function. In order to avoid distinguishing
between this special neuron and the general logistic regression neuron, this book uses the network layer that
outputs the score as the output layer.

4.1.6 Loss function

Whether it is training a neural network or predicting with a trained neural network, it is necessary to perform error
evaluation on the output value (predicted value) of the sample neural network and the real value. This error is also
called loss or cost.

If the predicted value and true value of a sample are f , y respectively, the error of the sample is L(f (i) (i) (i) (i)
,y ) .
When training a neural network or validating a neural network model, the overall average error is usually
calculated for a set of samples. The error of m samples can take the average of their errors, that is:
1 m (i) (i)
L(f , y) = ∑ L(f ,y )
m i=1

For a certain sample, its error (loss) L(f , y ) is different with the model parameters, because the model
(i) (i)

parameters are different, the predicted value f is also different, as is the average error L(f , y) of multiple
(i)

samples, that is, the error can be regarded as a function of the predicted value f and the model parameters , (i)

called the loss function. By minimizing the training loss, the model parameters corresponding to the minimum
value of the loss function can be obtained.

The commonly used loss functions in neural networks are the three types in Chapter 3: mean square error loss,
binary cross-entropy loss, and multi-class cross-entropy loss.

1. Mean square error loss

Mean square error is to use the mean value of the square of the Euclidean distance between the predicted value
of all samples and the true value as the error.

For multiple samples, set F = (ff , f (1) (2)

,⋯,f
(m)
)
T
, Y = (y
y
(1)
,y
(2) (m)
,⋯,y
T
) , then F and Y The mean
square error loss L (F
F , Y ) between is:

2 m 2
1 (i) (i) 1 (i) (i)
L (F
F, Y ) = (f
f − y = ∑ (f
f − y
m 2 m i=1 2

Multiplying a constant will not change the extreme point of this loss function. Sometimes in order to make the
gradient of the derivation look better, the mean square error loss will be divided by 2, namely:
2 2 2
1 (i) (i) 1 (i) (i)
L (F
F, Y ) = (f
f − y = ∑i=1 (f
f − y
2m 2 2m 2

2
For a sample (ff (i)
,y
(i)
) , 1

2
(f
f
(i)
T hecalculationcodeof − y
(i)

2
is as follows:

import numpy as np
f = np.array([0.1, 0.2,0.5])
y = np.array([0.3, 0.4,0.2])
loss = np.sum((f - y) ** 2)/2
print(loss)

0.084999999999999999

For multiple samples, L (F

L(f
f, y) =

(i)
1

m
m

∑ Li (y

i=1
F, Y ) =

F = np.array([[0.1, 0.2,0.5],[0.1, 0.2,0.5]])

Y = np.array([[0.3, 0.4,0.2],[0.3, 0.4,0.2]])

m = len(F)
loss = np.sum((F - Y) ** 2)/(2*m)
# loss = (np.square(H-Y)).mean()
print(loss)

0.084999999999999999

The mean square error can be written as a function:

def mse_loss(F,Y,divid_2=False):
m = F.shape[0]
loss = np.sum((F - Y) ** 2)/m
if divid_2:
return loss

mse_loss(F,Y,True)

0.08499999999999999
loss/=2

= −
(i)

m
1
,f
(i)

np. sum(y
1

) = −

∥
1

m
(f
f

∑[y

i=1
(i)
− y

(i)

y log f + (1 − y ) log(1 − f ))
(i)

log(f
2
2

Where y has a value of 1 or 0, indicating the category to which the sample belongs, and f
probability that the sample belongs to the category with a value of 1.

Binary classification cross entropy loss can be calculated with the following code.
- (1./m)*np.sum(np.multiply(y,np.log(f)) + np.multiply((1 - y), np.log(1 - f)))

For example:

#https://towardsdatascience.com/neural-net-from-scratch-using-numpy-71a31f6e3675
f = np.array([0.1, 0.2,0.5]) # three samples correspond to the probability of category
1
can be calculated with the following code:

The mean square error is used for regression problems, and for classification problems, cross-entropy loss is
generally used.

2. Binary classification cross entropy loss

For the binary classification problem, the output layer has only one logistic regression neuron, which outputs the
probability that a sample belongs to a certain category (such as category 1), and the probability outputs of all
samples form a vector f , and the target value of the training sample is used 1 or 0 indicates which class the sample
belongs to. The vector composed of the target values of all samples is represented by y . Then the cross-entropy
loss is:

(i) (i)
) + (1 − y ) log(1 − f
(i)
)]

(i)
indicates the
y = np.array([0, 1, 0]) # Classification corresponding to 3 samples
m = y.shape[0]

loss = - (1./m)*np.sum(np.multiply(y,np.log(f)) + np.multiply((1 - y), np.log(1 - f)))

print(loss)

0.8026485362172906

In order to prevent f or 1-f from having a value of 0 and causing the log() function to be abnormal, a small amount
ϵ can be added to the logarithm calculation, so the following binary cross-entropy loss function can be written

def binary_crossentropy(f,y,epsilon = 1e-8):

#np.sum(y*np.log(f+epsilon)+ (1-y)*np.log(1-f+epsilon), axis=1)
m = len(y)
return - (1./m)*np.sum(np.multiply(y,np.log(f+epsilon)) + np.multiply((1 - y),
np.log(1 - f+epsilon)))
binary_crossentropy(f,y)

0.8026485091802541

3. Multi-classification cross-entropy loss

The above is the cross-entropy loss for the binary classification problem, which can be extended to more than 2
(i)
types of multi-classification problems. f represents the probability that the i-th sample belongs to the c-th
c
(i)
category, and y uses 1 or 0 to determine whether the i-th sample belongs to category c, that is, the one-hot vector
c

y
(i)
represents the sample target value. According to the softmax regression in Chapter 3, it can be known that the
cross-entropy loss of multiple samples is as follows:
1 m 1 m C (i) (i) 1 m
(i) (i) (i) (i)
L(f
f, y) = ∑ Li (y ,f ) = − ∑ ∑ yc ⋅ log(fc ) = − ∑ y ⋅ log(f )
m i=1 m i=1 c=1 m i=1

For example, if it is a 3-category problem (C=3), the f (i)

and y (i)
values of a certain sample are:

(i) (i) (i) (i)

f = [f f f ] = [0.3 0.5 0.2]
1 2 3

(i) (i) (i) (i)

y = [y y y ] = [0 0 1]
1 2 3

on of the sample is the third category, and the predicted value indicates the probability that the sample belongs to
the three categories. For this sample, its cross-entropy loss is:

−(0 ∗ log(0.3) + 0 ∗ log(0.5) + 1 ∗ log(0.2) = −log(0.2)

The cross-entropy loss only depends on the term corresponding to the true classification.

Therefore, if the target values of all samples are one-hot vectors, for m samples, L (ff , y ) can be written as a
vectorized Hadamard product :
1
L (f
f, y) = − sum(y ⊙ log(f ))
m

In numpy code it can be written as:

-(1./m)*np.sum(np.multiply(y, np.log(f)))

For example, for 2 samples with m=2, the output F of the softmax and the target value (one-hot vector) matrix of
the sample are Y as follows:
0.2 0.5 0.3 0 1 1
F = [ ] Y = [ ]
0.4 0.3 0.3 0 0 1

The following code calculates the cross-entropy loss of these 2 samples:

def cross_entropy_loss_onehot(F,Y):
m = len(F) # F.shape[0]
return -(1./m) *np.sum(np.multiply(Y, np.log(F)))

F = np.array([[0.2,0.5,0.3],[0.4,0.3,0.3]])
Y = np.array([[0,0,1],[1,0,0]])
cross_entropy_loss_onehot(F,Y)

1.0601317681000455

If the target value of each sample is not represented by a one-hot vector, but an integer represents which category
the sample belongs to. For C classification problems, these integer values are 0, 1, 2, ⋯ , C − 1 used to indicate
which class the sample belongs to, such as using an integer such as 2 to indicate that the sample belongs to the
third category. At this time, the cross-entropy loss of the sample is the negative log value of the corresponding
(i) (i)
component of f (that is, the category f corresponding to subscript 2), that is, −logf .
(i)
2 2

For the target classification where integers represent samples, the multi-class cross-entropy loss is:

1 m 1 m (i)
(i) (i)
L (f
f, y) = ∑ Li (y
y ,f ) = − ∑ log(f )
m i=1 m i=1 y
(i)

Where y (i)
represents the integer value (subscript) corresponding to the category to which the i-th sample belongs.

Therefore, the following multi-category cross-entropy calculation function can be defined:

def cross_entropy_loss(F,Y,onehot=False):
m = len(F) #F.shape[0] number of samples
if one hot:
return -(1./m) *np.sum(np.multiply(Y, np.log(F)))
else:
return - (1./m) *np.sum( np.log(F[range(m),Y]) ) # The log value of the
category
# corresponding to Y[i] in
F[i]

In the following code, F is the output of 2 samples, and the output vector of each sample is 3 components,
indicating the probability that the sample belongs to 3 categories, and the i-th component of the target Y indicates
the category corresponding to the i-th sample. Subscript value (such as 0,1,2).

F = np.array([[0.2,0.5,0.3],[0.4,0.3,0.3]]) # Each line corresponds to a

sample
Y = np.array([2,0]) # The first sample belongs to the second
category,
# and the second sample belongs to the
0th category

cross_entropy_loss(F,Y)

1.0601317681000455

The following code converts an integer-indexed array into a one-hot array.

#numpy.eye(number of classes)[vector containing the labels]

n_C = np.max(Y) + 1 # number of categories
one_hot_y = np.eye(n_C)[Y]
print(one_hot_y)

[[0. 0. 1.]

L (f
[1. 0. 0.]]

cross_entropy_loss_onehot(F,one_hot_y)

1.0601317681000455

f, y) =
m
∑
i=1
Li (y
y ,f ) + λ∑
l=1
∥ W

4.1.7 Neural Network Training Based on Numerical Gradients

∥
Of course, in order to prevent overfitting, a regularization term can be added on the basis of the above loss, and a
larger penalty will be imposed on the model parameters with large absolute values to prevent the absolute value of
the model parameters from being too large.
1 m (i) (i) L [l] 2
2

Like the regression problem, a neural network function with a certain structure is completely determined by the
parameters of the neurons (weight parameters and bias parameters), and different parameters represent a different
specific neural network function.

For a set of samples, it is hoped to find the neural network parameters that can best fit these samples, that is, to
determine a neural network function that best responds to the relationship between sample characteristics and
target values, the process of finding the best neural network parameters and any machine learning model training
The process is the same, which is to find the model parameters that minimize a certain loss. Specifically, it is to
determine the model parameters of the neural network by solving the minimization problem of the loss function.
This process is called neural network training.

The training of the neural network is the same as the training of the regression model, which uses the gradient
descent method to iteratively update the model parameters until the algorithm converges enough or reaches the
maximum number of iterations. The gradient descent method needs to calculate the partial derivative of the loss
function with respect to the model parameters. The neural network usually contains many layers, each layer has
many neurons, and each neuron contains many model parameters. Calculating the partial derivative is more
complicated than the regression problem.

The forward calculation of the neural network and the calculation of the loss function are discussed above, and the
numerical gradient can be used in the gradient descent algorithm to approximate the analytical gradient. Next,
implement a complete neural network training and prediction algorithm for the previous 2-layer neural network.

https://www.bogotobogo.com/python/scikit-learn/Artificial-Neural-Network-ANN-5-Checking-Gradient.php

In linear regression, the model parameters are usually initialized to 0, but if the weight parameters in the neural
network model are initialized to 0, the neurons of the neural network will eventually converge, that is, the model
parameters of all neurons will be the same, and each layer has multiple A neuron is equivalent to a neuron, which
greatly degrades the expressive ability of the neural network, making it difficult to obtain a satisfactory neural
network. Therefore, the weights are generally initialized randomly, and researchers have provided a variety of
different methods for initializing the weights of neural networks.

Usually, the bias of the neural network is initialized to 0, and the weight parameters are randomly sampled from a
distribution such as a Gaussian distribution. If the number of certain neurons is n , and the number of output
values of the previous layer is n
(l)

, the neuron weight matrix W of this layer is a n

(l−1)

can be initialized with the following python code:

Wl = np.random.randn(n_l_1,n_l)* 0.01
(l) (l−1)
× n matrix, which
(l)
That is, random values from a standard normal distribution of all values are multiplied by 0.01.

Assuming that the number of input features of the above two-layer neural network is n_x, and the number of
neurons in the middle layer and output layer are n_h and n_o respectively, the function initialize_parameters()
completes the initialization of all model parameters and returns a dictionary object.

import numpy as np

def initialize_parameters(n_x, n_h, n_o):

np.random.seed(2) # Fix the seed so that the value of the random number
# is always the same every time this code is run

W1 = np.random.randn(n_x,n_h)* 0.01
b1 = np.zeros((1,n_h))
W2 = np.random.randn(n_h,n_o) * 0.01
b2 = np.zeros((1,n_o))

assert (W1.shape == (n_x, n_h))

assert (b1.shape == (1, n_h))
assert (W2.shape == (n_h, n_o))
assert (b2.shape == (1, n_o))

parameters = [W1,b1,W2,b2]
return parameters

test this function

n_x, n_h, n_o = 2,4,3

parameters = initialize_parameters(n_x, n_h, n_o)
print("W1 = " + str(parameters[0]))
print("b1 = " + str(parameters[1]))
print("W2 = " + str(parameters[2]))
print("b2 = " + str(parameters[3]))

W1 = [[-0.00416758 -0.00056267 -0.02136196 0.01640271]

[-0.01793436 -0.00841747 0.00502881 -0.01245288]]
b1 = [[0.0.0.0.]]
W2 = [[-1.05795222e-02 -9.09007615e-03 5.51454045e-03]
[ 2.29220801e-02 4.15393930e-04 -1.11792545e-02]
[5.39058321e-03-5.96159700e-03-1.91304965e-04]
[ 1.17500122e-02 -7.47870949e-03 9.02525097e-05]]
b2 = [[0.0.0.]]

Write the forward calculation function forward_propagation(X, parameters):

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def forward_propagation(X, parameters):

W1,b1,W2,b2 = parameters

Z1 = np.dot(X,W1) + b1 # Z1 shape： (3,2)(2,4)+(1,4)=>(3,4)

A1 = np.tanh(Z1)
Z2 = np.dot(A1,W2) + b2 # Z2 shape： (3,4)(4,3)+(1,3)=>(3,3)
#A2 = sigmoid(Z2)

assert(Z2.shape == (X.shape[0],3))
return Z2

You can test this function:

X = np.array([[1.,2.],[3.,4.],[5.,6.]]) #Each line corresponds to a sample

Z2 = forward_propagation(X, parameters)
print(Z2)

[[-1.36253581e-04 4.87491807e-04 -2.47960226e-05]

[-1.64985210e-04 1.01574088e-03 -5.99877659e-05]
[-1.96135525e-04 1.54048069e-03 -9.36558871e-05]]

The forward calculation function outputs the score Z belonging to each class, and the score can be converted into
the probability of belonging to each class with the softmax() function, and then calculate the multi-category cross-
entropy loss with the real target value. The function softmax_cross_entropy() and function
softmax_cross_entropy_reg() calculate the cross-entropy loss based on the output score value Z and the real value
y, which includes the regular term loss (reg is the regular term coefficient).
def softmax(Z):
exp_Z = np.exp(Z-np.max(Z,axis=1,keepdims=True))
return exp_Z/np.sum(exp_Z,axis=1,keepdims=True)

def softmax_cross_entropy(Z, y, onehot=False):

m = len(Z)
F = softmax(Z)
if onehot:
loss = -np.sum(y*np.log(F))/m
else:
y.flatten()
log_Fy = -np.log(F[range(m),y])
loss = np.sum(log_Fy) / m
return loss

def softmax_cross_entropy_reg(Z, Y, parameters,onehot=False,reg=1e-3):

W1 = parameters[0]
W2 = parameters[2]
L = softmax_cross_entropy(Z,y,onehot)+ reg*(np.sum(W1**2)+np.sum(W2**2))
assert(isinstance(L, float))
return L

y = np.array([2,0,1]) #Each line corresponds to a sample

softmax_crossentropy_loss(Z2,y,parameters)

1.098427770814438

It is generally hoped that the neural network can calculate the loss function value by inputting a set of data X and
the corresponding target value y. Therefore, the forward calculation and the separate cross-entropy loss calculation
function are combined:

def compute_loss_reg(f,loss,X, Y, parameters,reg=1e-3):

Z2 = f(X,parameters)
return loss(Z2,y,parameters,reg)

Test this function:

reg =1e-3
compute_loss_reg(forward_propagation,softmax_cross_entropy_reg, X, y, parameters,reg)

1.098427770814438

Define a function f() that returns the calculated loss function object, and pass it and the model parameters to the
general numerical gradient calculation function in section 2.4) to calculate the numerical gradient of the neural
network.
import util

def f():
return compute_loss_reg(forward_propagation,\
softmax_cross_entropy_reg, X, y, parameters,reg)
num_grads = util.numerical_gradient(f,parameters)
print(num_grads[0])
print(num_grads[3])

[[ 0.00956814 -0.00773283 0.00375128 0.00506506]

[ 0.00950714 -0.00774762 0.00379433 0.0050036 ]]
[[-0.00014298 0.00025054 -0.00010756]]

Now we can modify the previous gradient descent method to train the neural network model:

def max_abs(grads):
return max([np.max(np.abs(grad)) for grad in grads])

def gradient_descent_ANN(f,X, y,parameters, reg=0., alpha=0.01,

iterations=100,gamma = 0.8,epsilon=1e-8):
losses = []
for i in range(0,iterations):
loss = f()
grads = util.numerical_gradient(f, parameters)
if max_abs(grads)<epsilon:
print("gradient is small enough!")
print("iterated num is :",i)
break
for param, grad in zip(parameters, grads):
param-=alpha * grad

losses.append(loss)
return parameters,losses

Now test the previous set of spiral data points again:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

np.random.seed(100)

def gen_spiral_dataset(N=100,D=2,K=3):
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
return X,y

N = 100 # number of points per class

D = 2 # dimensionality
K = 3 # number of classes

X_spiral,y_spiral = gen_spiral_dataset()
# lets visualize the data:
plt.scatter(X_spiral[:, 0], X_spiral[:, 1], c=y_spiral, s=40, cmap=plt.cm.Spectral)
plt.show()

X = X_spiral
y = y_spiral
n_x, n_h, n_o = 2,5,3
parameters = initialize_parameters(n_x, n_h, n_o)
alpha = 1e-0
iterations =1000
lambda_ = 1e-3
parameters,losses = gradient_descent_ANN(f,X,y,parameters,lambda_, alpha, iterations)
for param in parameters:
print(param)
print(losses[:-1:len(losses)//10])
plt.plot(losses, color='r')

W1 [[ 3.38138518 0.61426967 -4.03084148 4.58725647 -3.51525488]

[ 1.71779295 4.22070297 -0.02482012 -2.94531953 -1.70138925]]
b1 [[-0.22738705 2.46255351 -1.6012184 0.13971558 1.93803839]]
W2 [[ 3.02107406 -0.56140685 -2.45577033]
[-3.6239263 1.24139541 2.38094385]
[0.1104459 -2.84775015 2.73785532]
[0.32970362 -3.41827375 3.08718502]
[ 2.15366321 -3.60902121 1.45391142]]
b2 [[ 2.05837167 -0.0169156 -2.04145607]]
0 .3952037360037423, 0.3830253864421071, 0.37822677209963196, 0.3757042519269851]

Figure 4-22. The loss curve of the three-class neural network for the spiral data set

The following function calculates the prediction accuracy of the model on the sample set (X, y) by comparing the
prediction result with the target value:
def getAccuracy(X,y,parameters):
predicts = forward_propagation(X,parameters)
predicts = np.argmax(predicts,axis=1)
accuracy = sum(predicts == y)/(float(len(y)))
return accuracy

getAccuracy(X,y,parameters)

0.9433333333333334

The prediction accuracy of the model on the training set reached 0.943, while the prediction accuracy of the
original softmax regression model was only 0.516. Draw the decision region again with code similar to the
previous one:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

XX = np.c_[xx.ravel(), yy.ravel()]
Z = forward_propagation(XX,parameters)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_linear.png')

(-1.9355521912329907, 1.8444478087670126)

Figure 4-23 The decision area of the three classifications of the neural network function

It can be seen that the decision curve of the 2-layer neural network model is no longer a straight line but can be an
arbitrarily curved curve.

4.1.8 Deep Learning

Since the invention of the neural network, the neural network has gone through a tortuous development process of
brilliance, silence, and rejuvenation....

Until, in 2012, the re-ImageNet of Canadian universities succeeded, and the neural network model shined again,
using a 6-layer convolutional neural network. The success of modern neural networks is mainly due to high-
performance parallel computing hardware and large-scale data. .
Especially the high-performance GPU, especially the CUDA GPU of the Nvidia formula, can perform data-
intensive large-scale parallel computing.

4.2 Reverse derivation

The training of the neural network model requires the gradient (partial derivative) of the loss function with respect
to the model parameters. The gradient (derivative) obtained by the numerical derivation method is only an
approximation to the analytical gradient (derivative), which is not accurate enough. The more important problem is
the numerical value. The derivation requires a small disturbance to each model parameter and then calculates the
loss of the entire neural network. When the neural network is relatively large (the number of layers or the number
of neurons in each layer is relatively large), the overhead of this forward calculation loss function is also Relatively
large, such as using a 2-layer neural network to recognize handwritten numbers, one sample is 28 × 28 = 784
features, using the batch gradient descent method, the number of samples in each batch is 500, if the middle layer
of the neural network is 100 neurons , the output layer is 10 neurons. The number of model parameters of the
neural network is 784 × 100 + 100 + 100 × 10 + 10 = 79510, and each parameter update requires 2 forward
calculations, a total of 2 × 79510 forward calculations, each calculation requires Perform matrix multiplication
and addition operations, such as the calculation of X W of the first layer of neurons is the matrix multiplication
(1)

operation of 500 × 784 matrix and 784 × 100 matrix. Therefore, the amount of calculation is proportional to the
number of samples, the number of features per sample, and the number of neurons in the neural network. The
process of calculating the numerical partial derivative for each model parameter is independent of each other, and
two forward calculations (including the calculation of the loss function value) must be performed independently.
Therefore, the cost of numerical derivation is very large, and it is not feasible for deep and large-scale neural
networks. In fact, in deep learning, parallel computing hardware such as GPU is used to accelerate the forward
calculation and gradient calculation speed of this neural network.

Therefore, the training of the actual neural network is to calculate the analytical gradient (derivative) of the loss
function with respect to the model parameters through the chain derivation rule, and the loss of the model can be
calculated in a forward direction, and then from the gradient of the loss with respect to the final output value along
the The gradient of the model parameters of each layer is calculated in the reverse direction of the forward
calculation.

4.2.1 Forward calculation and reverse derivation

The neural network is a complex compound function formed by function operation and compounding. Each layer
can be regarded as a multi-variable vector-valued function with multiple inputs and multiple outputs. The output of
the previous layer is used as the input of the next layer, that is, the neural network It can be seen as a composite of
layer-by-layer functions.

The chain derivation rule of function derivation tells us that the calculation process of the derivative is exactly
opposite to the calculation process of the function value. For example, a variable x passes through the function g to
get the function value g(x), and then enter this value In the function h, the value h(g(x)) is obtained, and this
value is continuously input into the function k, and the final output f (x) = k(h(g(x))) is obtained. The
calculation process is as follows:

x → g(x) → h(g(x)) → k(h(g(x))) = f (x)

f (x) == k(h(g(x))) is obtained from a series of functions g(x), h(g), k(h) through function compounding, input
an argument x, the process of calculating f(x) is calculated step by step according to this compound process, until
the final f (x) is obtained, this kind of "from inside to outside" sequentially passes through the innermost
independent variable The process of obtaining the final function value from a series of intermediate values is
called forward calculation. If g, h, and k are regarded as functions of each layer of the neural network, this
calculation process is the process of propagating calculations from the input layer of the neural network layer by
layer along the previous layer of the neural network to the next layer. Therefore, the forward calculation is called
forward propagation (forward propogation) in the neural network.

According to the chain derivation rule, to calculate the final derivative of f with respect to x, it can be calculated as
follows:
′ ′ ′ ′
f (x) = k (h)h (g)g (x)

That is, the calculation process of f (x) is decomposed into a series of calculations: first calculate the derivative of
′

f with respect to h, that is, f'(h) = k'(h), and then calculate the derivative of h with respect to g, that is, h'( g), and
finally calculate the derivative g'(x) of g with respect to x. As shown below, the calculation process of the
derivative f'(x) is in the opposite direction to the calculation process of the function value f(x), and is calculated
"from outside to inside" in the opposite direction of the function composition process, that is, along the forward
direction Calculate the reverse procedure to calculate:
′ ′ ′ ′ ′ ′ ′ ′ ′
f (h) = k (h) → f (g) = k (h)h (g) → f (x) = k (h)h (g)g (x)

This process of calculating the derivative of a compound function in reverse is called reverse derivation. The
calculation of derivative f'(h)=k'(h), h'(g),g'(x) is not independent of each other. If f'(h) is obtained first, there is no
need to repeatedly calculate f'(h) when calculating f'(g) = f'(h)h'(g). That is, if the derivative f'(h) of f with respect
to an intermediate variable such as h is saved along the direction of reverse derivation, it can be directly multiplied
by h'(g) to obtain f'(g). This avoids recalculating f'(h) when calculating f'(g).

If x is regarded as the input of the neural network, g, h are regarded as the output of the hidden layer and the output
layer, and f(h) = k(h) is regarded as the loss function, then f'(h) = k'(h) It represents the gradient (derivative) of the
loss function with respect to the output h of the neural network. On the basis of f'(h), along the reverse direction of
the neural network from the output layer to the hidden layer, the loss function can be sequentially calculated for the
hidden layer g, Gradient (derivative) of input x.

If you know the gradient of the loss function about the output of each layer, such as f'(h), you can get the gradient
of the model parameters about the layer. For example, suppose the input of a neural network layer is x, and its
output a = σ(xw + b) = σ(z). If it is known that the derivative of the loss function L with respect to a is L (a), ′

then the gradient of the model parameter w of this layer is

′ ′ ′ ′ ′ ′
L (w) = L (a)a (z)z (w) = L (a)σ (z)x

For the neural network model, the calculation process of the loss function about the output of each layer, the
intermediate variable and the gradient of the input is just the opposite of the calculation process of the forward
propagation. It is to first calculate the gradient of the loss function about the output of the output layer, and then
layer by layer. Forward, compute the gradient of the loss function with respect to each layer's intermediate
variables, model parameters, and inputs to that layer. This calculation process is called backward propagation
(backward propogation).

Building a compound function from a simple function is not only a function compound operation, but also includes
ordinary addition, subtraction, multiplication and division operations, no matter what operations are used to
construct the compound function, the forward calculation and reverse calculation function output from the
independent variable calculation function about the middle The reverse derivation process of the derivative
(gradient) of the variable and the independent variable is similar. Let's take a simple example to further deepen the
process of forward calculation and reverse derivation.

For two independent variables x, y, the function f (x, y) = (2x + 3y) + (x − 4y) can be regarded as:
2 2

f (x, y) = s + t, s, t can be regarded as: s = u , t = v , and u, v can be regarded as: u = 2x + 3y, v = x − 4y.
2 2

The composite process of this function is shown in the figure

Figure 4 -24 Compound process of function f (x, y) = (2x + 3y) 2

+ (x − 4y)
2
The forward calculation process of the function f (x, y) is as follows:
2 2
x, y → u = 2x + 3y, v = x − 4y → s = u , t = v → f = s + t

The reverse derivation process of the partial derivative of the function f (x, y) with respect to x is as follows:
′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′
f (s) = 1, f (t) = 1 → f (u) = f (s)s (u) + f (t)t (u), f (v) = f (s)s (v) + f (t)t (v) → f (x) = f (u)u (x)

4.2.2 Computation graph

The process of calculating the function value in the forward direction and calculating the derivative in the reverse
direction of a function can be represented graphically. For the function f (x, y) = (2x + 3y) + (x − 4y) , its 2 2

forward calculation function value is carried out according to the function matching process as shown in Figure 4-
24. The process of reverse derivation is just the opposite, as shown in Figure 4-25:

Figure 4-25 The reverse derivation calculation diagram of the function f (x, y) = (2x + 3y) 2
+ (x − 4y)
2

It can be seen from this reverse calculation diagram that f (x) comes from the accumulation of partial derivatives ′

of two paths, one is the reverse derivation from u, and the other is the reverse derivation from v, that is,
f (x) = f (u)u (x) + f (v)v (x) = f (u), while f (u) only contributes from the reverse derivative of s , ie
′ ′ ′ ′ ′ ′ ′

f (u) = f (s)s (u), similarly f (v) = f (t)t (v). Therefore, there are:
′ ′ ′ ′ ′ ′

′ ′ ′ ′ ′ ′ ′
f (x) = f (s)s (u)u (x) + f (t)t (v)v (x)

That is, first find f (s), then find s (u), and then find u (x), then you can find f (s)s (u)u (x), that is, to
′ ′ ′ ′ ′ ′

calculate the derivative along the reverse direction of the forward calculation, and f (t)t (v)v (x) is also the same ′ ′ ′

reverse derivation process.

And f ′ ′ ′
(s) = 1, f (t) = 1, s (u) = 2u, t (v) = 2v, u (x) = 2, v (x) = 1
′ ′ ′
, so the final f ′
(x) is:
′
f (x) = 1 ∗ 2u ∗ 2 + 1 ∗ 2v ∗ 1

The feed-forward neural network is composed of multiple layers of neurons, each layer of neurons accepts the
input a of the previous layer, and passes the neuron's own model parameters w , b calculate the weighted
[l−1] [l] [l]

sum z = a[l]
w + b . Then through an activation function g
[l−1] [l] [l]
to generate its own output a , this output is [l−1] [l]

used as the input of the next layer of neurons, so that the calculation results are layer by layer Output to the next
layer, until the output layer, the output of the output layer and the target value calculate some kind of loss
, y).
[L]
L (a

For the above 2-layer neural network, the weighted sum z , z and output value a , a of each layer of neurons [1] [2] [1] [2]

can be calculated in the following order,and finally calculate the loss function value L (a , y). [2]

[1] (1) [1] [1] [1] [2] [2] [1] [2] [2] [2] [2]
z = W x + b → a = σ(z ) → z = W a + b → a = σ(z ) → L (a , y)

This calculation process can be represented by Figure 4-26:

Figure 4-26 Forward calculation process of 2-layer neural network

The neural network function can be regarded as a function of its model parameters and intermediate variables. The
process of calculating the gradient (derivative) of the loss function with respect to these parameters and variables is
the same as the process of calculating the reverse derivative of any complex function. "From outside to inside"
from the most Starting from the loss function of the outer layer, the gradients of these intermediate variables and
parameters are sequentially calculated along the reverse direction of the forward calculation of the neural network
function value.
The process is: first find the gradient of the loss function with respect to the output of the output layer, if the
gradient of the loss function with respect to the output of a certain layer l is obtained, that is , according to the
∂L
[l]
∂a
a

activation function of each neuron in this layer, the gradient of the weighted sum of neurons in this layer z can be [l]

obtained, that is, z

z
(assuming that the neuron activation function of this layer is g, that is,
mathcalL
[l]
= g ( )
∂L
[l]
∂z
z
′ ∂L
[l]
∂a
a

), according to the formula (), in the Knowing , you can find out the loss function and the layer model
L
[l]
z
z

parameters W and input a

[l]
gradient [l−1]
and ∂L

∂W
W
[l]
∂L

∂a
a
[l−1]

For the above 2-layer neural network, the reverse derivation process is as follows:
∂L ∂L ∂L ∂L ∂L ∂L ∂L ∂L
[2]
→ [2]
→ ( [2]
, [2]
, [1]
) → [1]
→ ( [1]
, [1]
)
∂a
a ∂z
z ∂W
W ∂b
b ∂a
a ∂z
z ∂W
W ∂b
b

The gradient of the loss function on the intermediate variables and parameters of each layer of the model depends
on the results in the forward calculation process, such as a , in order to avoid repeated calculation of these values,
[l]

you can use the forward calculation process to These results are stored in the corresponding layers of the neural
network, and these stored results can be directly used in the reverse derivation process, thereby avoiding
continuous repeated calculations and improving efficiency. From the calculation graph, these intermediate results
can be saved in the corresponding nodes of the calculation graph. Now the deep learning platform expresses the
forward propagation and reverse derivation process of the neural network by means of the calculation graph and
saves the relevant intermediate calculation results on the corresponding nodes of the calculation graph.

Therefore, the calculation graph can not only ensure the correct calculation order of forward and reverse
calculations, but also save intermediate results to improve calculation efficiency.
4.2.3 The gradient of the loss function with respect to the output
Reverse derivation first calculates the gradient of the loss function with respect to the output of the final output
layer, and then calculates the gradient of the loss function with respect to the intermediate variables and parameters
of each layer from the output layer along the reverse direction of forward propagation until the input layer .

The definitions of loss functions for different problems (regression, classification) are different. The following
discusses how to calculate the gradient of the loss function with respect to the output layer for several common
loss functions.

1. The gradient of the binary cross-entropy loss function on the output

For the binary classification problem, the output layer is a logistic regression neuron, and z represents the weighted
sum of the input of this neuron.This weighted sum produces an output value a = σ(z) between (0, 1) through the
sigmoid function,
Indicates the probability that the sample belongs to one of the two categories (the probability of belonging to the
other category is 1 − a). For the output layer, this book is accustomed to use f = a to represent the output value,
that is, f = σ(z), and use y to represent the target value with a value of 1 or 0.

According to Section 3.5), the cross-entropy loss L(f , y) = −(ylog(f ) + (1 − y)log(1 − f )) for the two
classifications is the derivative of f :
∂L y (1−y) f −y
= −( − ) =
∂f f (1−f ) f (1−f )

And the derivative of f = σ(z) with respect to z is:

∂f
= σ(z)(1 − σ(z)) = f (1 − f )
∂z

Therefore, the derivative of L(f , y) with respect to z is: f − y.

For binary classification problems, the cross-entropy of multiple samples is the mean of the cross-entropy of a
single sample:
m m
1 1
(i) (i) (i) (i) (i) (i)
L(F
F, Y ) = ∑ Li (y ,f ) = − ∑[y log(f ) + (1 − y ) log(1 − f )]
m m
i=1 i=1

Therefore, the gradient of the cross-entropy loss L(F

F , Y ) with respect to z is:

∂L 1 F
F −Y
Y
=
∂F
F m F
F (1−F
F)

∂F
F
= σ(Z
Z )(1 − σ(Z
Z )) = F (1 − F )
∂Z
Z

∂L

∂Z
Z
=
∂L

∂f
f
∂ pmbf

∂z
z
=
1

m
(F
F − Y ) .

Right now:
F =

⎢⎥
⎡

⎣f
f

(m)
(1)

(i)

⋮
⎤

⎦
Y =

return 1 / (1 + np.exp(-x))

grad = (f-y)/(len(y))
⎡

⎣y
y

y
(1)

(m)

(i)
⋮

(i)

def binary_cross_entropy(f,y,epsilon = 1e-8):

⎤

⎦
∂L

∂Z
Z
=
1

m
⎡

⎣f

#np.sum(ynp.log(f+epsilon)+ (1-y)np.log(1-f+epsilon), axis=1)

m = len(y)
return - (1./m)*np.sum(np.multiply(y,np.log(f+epsilon)) \
f

(m)
(1)

(i)

+ np.multiply((1 - y), np.log(1 - f+epsilon)))

def binary_cross_entropy_grad(out,y,sigmoid_out = True,epsilon = 1e-8):

if sigmoid_out:
f = out
grad = ((f-y)/(f*(1-f)+epsilon) )/(len(y))
else:
f = sigmoid(out) # out is z

def binary_cross_entropy_loss_grad(out,y,sigmoid_out = True,epsilon = 1e-8):

if sigmoid_out:
f = out
grad = ((f-y)/(f*(1-f)+epsilon) )/(len(y))
else:
f = sigmoid(out) # out is z
grad = (f-y)/(len(y))
loss = binary_cross_entropy(f,y,epsilon)
return loss,grad

z = np.array([-4, 5,2])
f = sigmoid(z)
classification 1
y = np.array([0, 1, 0])

print(loss,grad)
− y

− y

multiplication and division in the above formula are all element-wise operations.
(1)

(i)

(m)
⎤

It should be noted that: because each sample corresponds to a different z , the loss cross-entropy loss L should be
(i)

derived separately for each z instead of combining these The derivatives add up. In addition, the vector

According to whether the output out of the neural network is the score represented by the weighted sum or the
output probability of the σ function, the following calculation can be written to calculate the gradient of the
weighted sum or output probability of the two-category cross-entropy loss.
def sigmoid(x):

#Three samples correspond to the classification scores

#Three samples correspond to the probability of

#Classification corresponding to 3 samples

loss,grad = binary_cross_entropy_loss_grad(z,y,False)

loss,grad = binary_cross_entropy_loss_grad(f,y)
print(loss,grad)

0.7172643944362687 [0.0059954 -0.00223095 0.29359903]

0.7172643944362687 [0.33943835 -0.33557881 2.79635177]

2. The gradient of the mean square error loss function on the output
f

∥
For regression problems, the output layer is one or more linear regression neurons, that is, each neuron directly
outputs the weighted sum z of its input, and these values output by multiple neurons in the output layer form an

multiplied by a constant, that is,

is f − y .
(i)

(f
f
(i)
(i)

− y
(i)
(i)

The gradient of the error about f

2
2
1

vector y == (y , y , . . . y ) of the same size.

(i)

If the dimension of the output value is K, then:

2
=
2

2
i

output vector z = (z , z , . . . z ) as the output of the entire output layer f = z . For K>1, the target value is a
K

∑
K

k=1
K

For a sample, the square of the Euclidean distance between the output vector z and the target value vector y can be
(i)

(f
1

2
2
used as the error (ff − y ) , In order to make the derivative (gradient) look more concise, it is usually

(i)

For a matrix F , Y composed of multiple samples, the mean square error L(F
The gradient of F is (F

∂L

∂Z
Z
=
∂L

∂F
F

def mse_loss_grad(f,y):
m = len(f)
F − Y ). Because F = Z , namely:

=
1

m
(F
F − Y )
m
1

loss = (1./m)*np.sum((f-y)**2)# np.square(f-y))

grad = (2./m)*(f-y)
return loss,grad
(i)

k
(f
f

− y

is (f

F, Y ) =

3. The gradient of the multi-class cross entropy loss function on the output
(i)

1
(i)
− y

(i)

k
)

− y
(i)

2
)

(i)

1
2

For multi-classification problems, the neural network converts the output of the previous layer into a probability
with intuitive meaning through the final softmax function, indicating the probability that the sample belongs to
each sample. Since the neurons of softmax do not contain any model parameters, sometimes softmax is not used as
2

1
(i)

the last layer of the neural network, but the previous layer is used as the last output layer of the neural network. No
matter which scheme is adopted, the multi-class cross-entropy loss in softmax regression is usually calculated at
the end. The latter scheme is usually adopted, that is, it is assumed that the output layer outputs scores instead of
probabilities. The neurons of this output layer are linear regression neurons, and the weighted sum of their inputs is
directly output as the activation value, that is, f = a = z (assuming this is an L-layer neural network, the
serial number of this output layer is L).

cross-entropy loss L(ff , y ) with the target value y .

this time, the multi-classification cross-entropy loss L(F , Y )The gradient of Z is: (FF − Y ).
i

Let the output z of the output layer generate an output f through the softmax function, and then calculate the

For multiple samples, the output of the output layer can be written as a matrix Z = (zz , ⋯ , z , ⋯ , z ) , and the
output generated by the softmax function is also a representation Probability matrix F = (ff , ⋯ , f , ⋯ , f ) .
According to Section 3.6), if each target value y uses a vector represented by one-hot, then Y is also a matrix. At
i

If each target value y is an integer, representing the subscript of the category to which the sample belongs, then
i

the gradient of the multi-category cross-entropy loss L(F , Y ) on Z is: (F F − I ). Among them, each line of I

is a one-hot vector, which is to convert the integer corresponding to the sample into a one-hot vector, so I is
composed of I and the above one-hot vector \pmb{Y}$ is the same.
i
i
is used as the error. The gradient of this error with respect to

(L)

i
(i)
− y
2
,⋯,f

(L)

i
K
(i)
− y

m
(i)

K
) = f

i
i
(i)

m
− y

1
(i)

∑
m

i=1

i
(f
f

i
(i)
− y

m
T

i
i
(i)

m
2

T
2

i
i
The following python code converts a target vector of integer values into a matrix of one-hot vectors:
I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),Y] = 1

It can be seen that the Euclidean loss of regression, the cross-entropy loss function of binary classification, and the
cross-entropy loss of multi-classification are surprisingly consistent with respect to the gradient of the output layer
Z . Both are F − Y ).
1
(F
m

Given a multi-sample output layer weighted sum Z and a target value Y, the following code computes the gradient
of the multiclass cross-entropy with respect to Z (see Section 3.6):
def softmax(x):
a= np.max(x,axis=-1,keepdims=True)
e_x = np.exp(x - a)
return e_x /np.sum(e_x,axis=-1,keepdims=True)

def cross_entropy_grad(Z,Y,onehot = False,softmax_out=False):

if softmax_out:
F = Z
else:
F = softmax(Z)
if onehot:
dZ = (F - Y) /len(Z)
else:
m = len(Y)
dZ = F.copy()
dZ[np.arange(m),Y] -= 1
dZ /= m
#I_i = np.zeros_like(Z)
#I_i[np.arange(len(Z)),Y] = 1
#return (F - I_i) /len(Z) #Z.shape[0]
return dZ

4.2.4 Derivation of back propagation of 2-layer neural network

1. Reverse derivation of single sample

The backpropagation derivation algorithm calculates the gradient of the loss function with respect to the relevant
variables of each layer according to the reverse calculation of the forward propagation of the neural network. The
gradient of the loss function about the weighted sum of the output layer has been obtained before, and now
∂L
[L]
[L]
∂z
z

we discuss how to Layer weighting and z gradient based on , Find the gradient of the loss on the variables
[l]
[l] ∂L

∂z
z
[l]
[l]

W
[l]
[l]
,b ,a
[l]
[l]
of the layer .
[l−1]
[l−1]

For a 2-layer neural network, how to find ∂ mathcalL

∂W
W
[2]
[2]
,
∂L

∂b
b
[2]
[2]
,
∂L

∂a
a
[1]
[1]
?

[2] [2] [2]

L is z [2]
[2]
= (z
1
,z
2
,z
3
function, and z
[2]
[2]
= a
[1]
[1]
W
[2]
[2]
+ b
[2]
, namely z[2]
[2]
is a
[1]
[1]
,W
[2]
[2] [2]
, T hef unctionof pmbb .
z
[2]

∂a

∂L

∂a

∂a
[1]

[1]

therefore:

∂L

[1]
= (z

= (

In the same way:

∂W

∂W
∂L

∂L

∂L
[2]

[2]

i3
=

=
[2]
∂z
1

∂L
[2]
∂z
1

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

Written in matrix form:

[2]

∂L

∂a

[2]

3
,z

∂a

∂z

∂a

[1]

a
∂z

∂W
1
[2]

[1]

[2]

[1]

i
,

[2]

11
,z

+
[2]

= (a

According to the chain derivation rule, there are:

∂L

∂a

∂L

∂a

∂L
[1]

[1]

2
=

=
∂L
[2]
∂z
1

∂L
[2]
∂z
1

∂L
∂z

∂a

∂z

∂a

∂z
[2]

1
[1]

[2]

[1]

[2]
+

∂L

∂a

[2]

11
+
a

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z
) = (a

[1]

2
1
[1]

[1]

[2]

∂L

∂z
[1]

,
W

∂z

∂a

∂z

∂a

∂z

∂a

∂z

∂a

[2]

2
W

∂L

∂a
12

(2)

[2]

[1]

[2]

[1]

[2]

[1]

[2]

[1]

∂W
11
[1]

(2)

[1]

∂z
+ a

[2]

11
,a

[2]
[1]

+ a

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂a

+
[1]

[1]

[2]

3
,a

[1]

3
W

∂L

∂z

all 0. Because the i-th column of W only contributes to z or only z depends on T heithcolumnof W , so:
[2]
[1]

∂z

∂a

∂z

∂a

∂z

∂a

∂z

∂a

[2]

3
22
,a

(2)

[2]

3
[1]

[2]

[1]

[2]

[1]

[2]

[1]

) = (

[2]
This is because W is only related to z , z , z does not depend on W , so the last two partial derivatives are
1
(2)
[1]

+ a

∂z

∂W
+ a

+ a

[2]

11
)

∂L

∂z

[2]

⎢⎥
⎡W

⎣W

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z
[1]

[1]

=
[2]

1
W

[1]

[2]

1
W

W
W

[2]]

3
(2)

(2)

∂L

∂z
(2)

(2)

[2]

∂z
[2]

∂L

[2]

1
2
W

+ a

[2]

∂W
∂z
12

+ a

[2]

1
(2)

(2)

11
4
[1]

[1]

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

[2]
∂L
[1]

[2]

i
W

[2]

2
W
W

[2]

3
43

+ 0 + 0 =
13

(2)

(2)
(2)

(2)

[2]

⎡W

⎣W
W
⎤

⎦
+ b

+ b

+
+ (b

1
[2]

[2]

(2)

[2]

i
[2]

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z
)

[2]

∂z
,

∂L
[2]

[2]

11
,b

[2]

(2)

∂z

∂W
[2]

[2]

11
,b

W
[2]

=
)

(2)

∂L
[2]
∂z
1
W

W
41

a
(2)

(2)

[1]

1
⎤

⎦
=
∂L

[2]
∂z
W

[2]
T
[2]
∂W
∂L

Obviously:
∂L

∂b
[2]

Because a

a
[1]

so:

∂L

∂z

∂b

∂W

∂L

∂z
[1]

[1]

[2]
,b

∂L
[2]

[1]
[2]

= (

= (a

= (

= a

=
=

∂z

[2]
∂L

∂b

[1]

∂L

∂a

, T hegradientof W

∂a
∂L

⎢⎥
⎡

[2]

[1]

1
∂W

∂W

,
∂L

∂L

= g(z

[1]

g (z
∂b

′
[2]

[2]

∂L
[2]

1
3

[1]
,

[1]

),
∂L

∂b

,a
∂W

∂W

According to the same derivation process as above, we can get:

∂W

∂L
∂L
[1]
= a

∂L
[0]

[1]

[1]
T

T
∂z

∂z

⊙ g (z
∂L

∂L
[1]

With the help of the gradient of the loss function about z

Therefore, from the gradient

,b .

[2]

′ [1]
∂L

∂L

)
[2]

At this point, the gradient of the loss function for all variables W

, ie

∂L

∂a
[1]

4
[2]

[2]

2
) = (

∂L

∂b
) = (g(z

[1]
g (z

[2]
′
∂L

∂W

∂L

∂W

∂L

∂W

∂L

∂W

=
∂z

[1]

∂L

∂W
∂L
[2]

[2]

∂L

∂z

[1]
⎤

[1]

[2]

= a
∂a
=

∂L

∂z

∂L
[1]

3
[2]

), g(z

of the weighted sum of the loss function with respect to the output layer, the
∂L

∂z
[2]
⎡

′
g (z

gradient of the loss function with respect to the relevant variables of each layer is obtained according to the reverse
derivation process:

[0]
T
∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

[1]

∂L

∂a
[2]

[2]

), g(z

[1]

∂ mathcalL
a

a
[1]

[1]

) =

∂z
∂L

∂a

[1]
[1]

[1]

∂L

∂z
∂L

∂z

∂L

∂z

∂L

∂z

∂L

∂z

[2]

), g(z

g (z

[2]
[2]

[2]

W
a

a
[1]

[1]

[2]

[2]
)) = g(z

)) = (

∂L

∂b
[1]

[1]
∂L
[2]
∂z
3

∂L
[2]
∂z
3

=
a

[2]

∂L

∂a

∂L

∂z
[1]

[1]

[1]
⎤

[1]
[2]

,
)
=

∂a
⎡a ⎤

⎣a ⎦

[1]

partialL

[1]

2
a

a
[1]

[1]

4
[
∂L

∂z
[2]

of the l=2 layer is obtained.

, the loss function about the model parameters

∂L

∂a
[1]

3
,
∂L

∂z
[2]

∂ mathcalL

∂a
[1]

4
∂L
[2]
∂z
3
] = a

) ⊙ (g (z
′
[1]

[1]

1
T ∂L

∂z

′
), g (z
[2]

2
2. Multi-sample vectorized representation of reverse derivation
Like general machine learning, when training a neural network, the model parameters are usually solved by
minimizing the error (loss) between the predicted value and the real value of multiple samples. The loss is a
function of the model parameters, as well as intermediate variables such as a , z . [l] [l]

For the non-model parameters of each layer, such as intermediate variables a , z , different samples have [l] [l]

different values, and they all belong to different variables, such as a and a are different variables
[l](1) [l](2)

produced by two different samples of l layer. Assuming that these variables are all written in the form of row
vectors, the values of these variables for all samples can be piled up in rows to form a matrix, and each row of
the matrix corresponds to a sample. The symbols A , Z can be used to represent the matrix of these
[l] [l]

intermediate variables corresponding to all samples, namely:

⎡a ⎤ ⎡z ⎤
[l](1) [l](1)

[l](2) [l](2)
a z
[l] [l]
A = Z =

⋮ ⋮

⎣a [l](m) ⎦ ⎣z [l](m) ⎦

Where A [0]
is a matrix X [0]
composed of all sample input features, namely:
(1)
⎡ x ⎤
(2)
x
[0] [0]
A = X =

⎣x (m) ⎦

That is, different samples are calculated in the forward propagation, and the intermediate variables of each
layer generated are different, but the same model parameters are used W , b , since the loss of multiple
[l] [l]

samples is the mean value of the loss of all samples, the gradient of the loss of multiple samples with respect
to the model parameters is the mean value of the gradient of all samples with respect to the model parameters.
For example, for the weight parameter W , if there are m samples, then:
(i)
∂L 1 m ∂L
= ∑
∂W
W m i=1 ∂W
W

Usually when calculating the gradient of the loss function with respect to the output layer z , already
∂L
[L]
[L]
∂z
z
[L]

Multiplied by the mean factor , therefore, the gradient of the model parameters can be directly
1

accumulated:

m (i)
∂L ∂L
= ∑
∂W
W i=1 ∂ pmbW

therefore:
∂L

∂W

∂L

∂A

∂L

∂Z

∂L

∂W

∂L

∂b

∂A
[2]

∂L

∂Z

∂L

∂W

∂L

∂b
[1]
[1]
[1]

[2]

[1]

[1]
[2]

=
= A

= A
=

∂A
= a

∂L

= np. sum(

=
∂Z

∂A
[1]

∂L

= np. sum(
[0]
∂a

∂a

[1]
∂L

∂L

⊙ g
[1](1)

[1](1)

[1](2)

[1](m)
⋮

[1]
T

⎦
∂z

(Z
= ∑a

∂L

[1]
i=1

[2](1)

= [a[1](1)

)
m

Similarly, for the bias, the partial derivatives of all single samples

∂L

∂b
b
[2]
[2]
= ∑i=1
m

∂z
z
∂L
[2](i)
[2](i)

That is to add up all rows of the matrix

= np. sum(

∂a

∂L

∂z

∂L

∂z

∂L
+ a
[1](i)

[2](1)

[2](1)
∂z

So the multi-sample gradient has the same formulas as the single-sample gradient:

[2]

[1]
T

⊙ g (Z

T
∂L

∂Z

∂L

∂Z
[2]
[2]

[1]
∂L

∂Z

∂L

∂Z
[2]

[1]

[1]
T

, axis = 0, keepdims = T rue)

∂Z
T

[1](2)

∂L

∂z

[2]

[m]
∂z

[1](2)

= A

[2]

T
∂L

⎦
[2](i)

∂ mathcalL

, the parameter keepdims=True indicates that the accumulated

result is still a two-dimensional matrix, which is for convenience Perform operations on numpy arrays.

Different from the model parameters, the intermediate variables of different samples are different (not
∂Z

=
[1]

∂L
∂z

[2]

shared). Therefore, the gradient of the loss function with respect to the intermediate variable is independent of
each other. If the gradient of the intermediate variable of each sample is in the form of a row vector, all The
gradients of the intermediate variables are stacked into a matrix, and each row of the matrix represents the
gradient of a sample. therefore:

The single-sample formula ∂L

[1]
=
∂L
[2]
W
[2]
[2](2)

⋯a

∂Z
∂L

∂Z

∂L
m

= ∑a

i=1

[2]
[1](m)

[2]

, axis = 0, keepdims = T rue)

W
[1](i)

+ ⋯ + a

[2]
T
]

⎢⎥
T

into vector (matrix) form:

∂L

∂z

∂L

∂z

∂L

∂z
[2](i)

[1](m)

[2](1)

[2](2)
∂z

∂L
[2](m)
⎤
⋮

∂L
[l]
[l]
∂b
b
T

=
∂L

∂z
[2](m)

∂L

∂z
z
[l]
[l]
can be accumulated to get:
3. Gradient calculation formula in column vector form
If the samples, intermediate variables and their gradients are all in the form of column vectors, such as
x, a ,z
[1]
,b ,a
[1]
,z ,b are all column vectors, and a row of W , W corresponds to the ownership of
[1] [2] [2] [2] [1] [2]

a neuron value, namely:

[1] [1] [1] [1] [0] [1] [2] [2] [1] [2]
z = W x + b = W a + b z = W a + b

Then the corresponding formula can be deduced similarly:

One-sample form:
T T
∂L ∂L [1] ∂L ∂L ∂L [2] ∂L
[2]
= [2]
z [2]
= [2] [1]
= W [2]
∂W ∂z ∂b ∂z ∂a ∂z

T
∂L ∂L ′ [1] ∂L ∂L [0] ∂L ∂L
[1]
= [1]
⊙ g (z ) [1]
= [1]
z [1]
= [1]
∂z ∂z ∂W ∂z ∂b ∂z

Multi-sample form:
T T
∂L ∂L [1] ∂L ∂L ∂L [2] ∂L
[2]
= [2]
A [2]
= np. sum( [2]
, axis = 1, keepdims = T rue) [1]
= W [2]
∂W ∂Z ∂b ∂Z ∂A ∂Z

T
∂L ∂L ′ [1] ∂L ∂L [0] ∂L ∂L
[1]
= [1]
⊙ g (Z ) [1]
= [1]
A [1]
= np. sum( [1]
, axis = 1, keepdims = T rue)
∂Z ∂A ∂W ∂Z ∂b ∂Z

For a loss function that contains a regular term, the partial derivative of the regular term to each model
2
[l]
parameter is also calculated when calculating the gradient, if the regular term is λ ∥ W 2
∥= λ ∑ ∑
l ij
W
ij
,
[l] [l]
then the partial derivative of W ij
is 2λW
W , written in vector form is 2λW
ij
W.

For the above 2-layer neural network, on the basis of forward calculation (that is, A0 and A1 in forward
calculation are known), the following code gives the reverse derivation process (if the activation function of
the first layer is Relu ):
def dRelu(x):
return 1. * (x > 0)

dZ2 = grad_softmax_crossentropy(Z2,y) # Calculate the gradient of the loss

function with respect to
# the weighted sum of the output layer
dW2 = np.dot(A1.T, dZ2) +lambda*W2
db2 = np.sum(dZ2, axis=0, keepdims=True)
dA1 = np.dot(dZ2,W2.T)

#dZ1 = A1*dRelu(A1)
dA1[A1 <= 0] = 0
dZ1 = dA1

dW1 = np.dot(X.T, dZ1) +lambda*W1

db1 = np.sum(dZ1, axis=0, keepdims=True)

4.2.5 Python implementation of 2-layer neural network

A 2-layer neural network includes an input layer, a hidden layer, and an output layer. The TwoLayerNN class
below represents such a 2-layer neural network model.
from util import *
def dRelu(x):
return 1 * (x > 0)

def max_abs(s):
max_value = 0
for x in s:
max_value_ = np.max(np.abs(x))
if(max_value_>max_value):
max_value = max_value_
return max_value

class TwoLayerNN:
def __init__(self, input_units, hidden_units,output_units):
# initialize parameters randomly
n = input_units
h = hidden_units
K = output_units

self.W1 = 0.01 * np.random.randn(n,h)

self.b1 = np.zeros((1,h))
self.W2 = 0.01 * np.random.randn(h,K)
self.b2 = np.zeros((1,K))

def train(self,X,y,reg=0,iterations=10000, learning_rate=1e-0,epsilon = 1e-8):

m = X.shape[0]
W1 = self.W1
b1 = self.b1
W2 = self.W2
b2 = self.b2
for i in range(iterations):
# forward evaluate class scores, [N x K]
Z1 = np.dot(X, W1) + b1
A1 = np.maximum(0,Z1) #ReLU activation
Z2 = np.dot(A1, W2) + b2

data_loss = softmax_cross_entropy(Z2,y)
reg_loss = reg*np.sum(W1*W1) + reg*np.sum(W2*W2)
loss = data_loss + reg_loss
if i % 1000 == 0:
print("iteration %d: loss %f" % (i, loss))

# backward
dZ2 = cross_entropy_grad(Z2,y)
dW2 = np.dot(A1.T, dZ2) +2*reg*W2
db2 = np.sum(dZ2, axis=0, keepdims=True)
dA1 = np.dot(dZ2,W2.T)

dA1[A1 <= 0] = 0
dZ1 = dA1
#dZ1 = dA1*dReLU(A1)
#dZ1 = np.multiply(dA1,dRelu(A1) )
dW1 = np.dot(X.T, dZ1)+2*reg*W1
db1 = np.sum(dZ1, axis=0, keepdims=True)

if max_abs([dW2,db2,dW1,db1])<epsilon:
print("gradient is small enough at iter : ",i);
break

# perform a parameter update

W1 += -learning_rate * dW1
b1 += -learning_rate * db1
W2 += -learning_rate * dW2
b2 += -learning_rate * db2
return W1,b1,W2,b2

def predict(self,X):
Z1 = np.dot(X, W1) + b1
A1 = np.maximum(0,Z1) #ReLU activation
Z2 = np.dot(A1, W2) + b2
return Z2

The constructor __init__() of TwoLayerNN accepts the number of neurons in the input layer, hidden layer,
and output layer as parameters, and initializes the model parameters of the 2-layer neural network. train()
is used to train the neural network model, that is, according to the training samples, use the gradient descent
algorithm to calculate the best model parameters, so that the cross entropy loss of these training samples is the
smallest. The parameters of train() include a set of training samples (X, y), regularization parameters reg,
and related hyperparameters of the gradient descent method (such as iterations, learning rate learning_rate,
and convergence error). In each step of the iteration of the gradient descent method, the train() function first
calculates the output value of the sample and its intermediate variables (Z1, A1, Z2) forward, and uses
softmax to convert the score into a probability value and calculates the multi-category cross entropy loss
(data_loss ), and then calculate the gradient dZ2 of the cross-entropy loss (data_loss) about the output layer
output, and backpropagate to find the gradient about the intermediate variables and model parameters (the
gradient of the model parameters includes the gradient of the regular term about the model parameters2
*reg*W2, 2*reg*W1).

The prediction function of the model predict() predicts the target value of the input data X according to the
trained neural network model, which is a forward propagation calculation process.

The data features and target values of the spiral data set in Section 2.7) can be modeled with the above-
mentioned 2-layer neural network. First generate the dataset:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import data_set as ds

np.random.seed(89)
X,y = ds.gen_spiral_dataset()

# lets visualize the data:

#plt.scatter(X[:, 0], X[:, 1], c=y, s=20, cmap=plt.cm.spring)
#plt.show()
Define a TwoLayerNN class object, and pass the training set to the constructor of this class object, and then
call its train() direction for model training:
nn = TwoLayerNN(2,100,3)
W1,b1,W2,b2 = nn.train(X,y)

iteration 0: loss 1.098627

iteration 1000: loss 0.115216
iteration 2000: loss 0.053218
iteration 3000: loss 0.038299
iteration 4000: loss 0.031767
iteration 5000: loss 0.028016
iteration 6000: loss 0.025411
iteration 7000: loss 0.023476
iteration 8000: loss 0.022009
iteration 9000: loss 0.020872

Output the accuracy of the trained model with the following code:
# evaluate training set accuracy
#A1 = np.maximum(0, np.dot(X, W1) + b1)
#Z2 = np.dot(A1, W2) + b2
Z2 = nn.predict(X)
predicted_class = np.argmax(Z2, axis=1)
print ('training accuracy: %.2f' % (np.mean(predicted_class == y)))

training accuracy: 0.99

It can be seen that the model trained with gradient calculation using analytical derivatives is more accurate,
reaching 99%. The following code also visualizes a decision boundary better than that of a model trained with
numerical gradients:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
XX = np.c_[xx.ravel(), yy.ravel()]
Z = nn.predict(XX)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, cmap=plt.cm.spring)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_net.png')

(-1.9124776305480737, 1.9275223694519297)

Figure 4-27 Classification decision region for a spiral data point set

4.2.6 Derivation of backpropagation of any layer neural network

The process of backpropagation derivation of the above 2-layer neural network can be extended to the
backpropagation derivation of any deep (level) neural network. That is, for any layer l, its weighted sum
z
[l]
= a W
[1−1]
+ b passes through the activation function g to generate an output a = g(zz ). If the
[l] [l] [l] [l]

gradient of the loss function with respect to is known,can find = z ). Through

g (z , the
∂L

∂a
[l]
∂L
[l]
∂z
z
∂L
[l]
∂a
a
′ [l] ∂L

∂z
z
[l]

gradient of the loss function with respect to the parameter W , b 和a of the layer can be obtained: [l] [l] [l−1]

T T
∂L [l−1] ∂L ∂L ∂L ∂L ∂L [l]
[l]
= a [l] [l]
= np. sum( [l]
, axis = 0, keepdims = T rue) [l−1]
= [l]
W
∂W ∂z ∂b ∂z ∂a ∂z

For the vector form of multiple samples, the formula is as follows:

∂L ∂L ′ [l]
[l]
= [l]
g (Z
Z )
∂Z
Z ∂A
A

T
∂L [l−1] ∂L
[l]
= A [l]
∂W ∂Z

∂L ∂L
[l]
= np. sum( [l]
, axis = 0, keepdims = T rue)
∂b ∂Z

T
∂L ∂L [l]
[l−1]
= [l]
W
∂A ∂Z

The above formula assumes that the input x = a [0]

and the intermediate variable z [1−1]
,a
[1−1]
are both Takes
the form of a row vector.

In the following, the gradient of the loss function with respect to the intermediate variables and model
parameters is derived in the form of a column vector, that is, assuming the input data x = a , the [0]

intermediate variable z ,a and their gradients are represented by column vectors. Using column
[1−1] [1−1]

vector representation, the weighted sum z of layer l is expressed as: [l]

[l] [l] [l−1] [l]

z = W a + b

At this point, each row (instead of each column) of the weight matrix represents the weight parameters of a
neuron. Of course, the weighted sum z through the activation function g is also a column vector
[l]
a

z
[l]

∂W
[l]

[l]

∂L
= g(z

In the same way:

∂L

∂b

∂L

∂b
[l]

Right now:

[l]
z

⎢⎥
⎡z ⎤

⎣z ⎦

, that is, δ

[l]

= δ
z

∂L

∂z
[l]

[l]

[l]
[l]

contributes to z or depends on W

∂W

as W
∂L

jk
[l]
=

[l]
∂L

∂z

[l]

j
)

[l]

j
.

If there are m neurons in this layer, input a

(l)

∂z

∂b
=

∂z

∂W

The partial derivatives of the ownership values W

[l]

j
=

[l]

∂W

∂W
∂L

∂L
⎡

⎣∑

[l]

jk
[l]

=
∑

[l]

m1
n

k=1

∂ mathcalL

∂z

= δ

. If there are m neurons in the l layer, and the input a

⎡
∂L

∂L

∂z
=

[l]

j
(l)

∂W

∂W
W

∂L

∂L
W

⎡δ ⎤

⎣δ ⎦

= δ
mk

[l]

δ
[l]

[l]

[l]
m

[l]

j
a

The jth neuron in the l-th layer outputs a weighted sum z , which is only related to this neuron and has
nothing to do with other neurons in this layer. Therefore, the weight parameter W

(l)

∂L
partial

∂W
(l)

jk
[l−1]

[l−1]

⋯
[l−1]

For simple derivation, use the symbol δ to represent the partial derivative of the loss function with respect to

[a
(∑ W
+ b

+ b

+ bm

[l]

[l−1]

1
i

∂W

∂Wmn
∂L

∂L

∂L
[l]

[l]

[l]
ji

a
⎤

[l]

2
⎤

[l−1]
[l−1]

a
(l−1)

=
has n values, and expand the vector form of the weighted sum:

⋯
) = δ

⎡δ

⎣δ
jk

[l]
[l]

[l]

m a1
[l]
There is only z in row j. therefore:
j

[l−1]
a
1

an
[l]

about W

[l−1]

[l−1]
[l]

a
[l−1]

δ
[l]

[l]

[l]
δm a

] = δ
[l]

[l−1]

⋮
2

[l−1]

[l]
⋯

[l−1]
a
T
δ

δ
1

[l]

[l]
δm an
an

⋮
[l]

[l−1]
an

[l−1]
⎦
of this neuron only

can be placed in an array with the same shape

has n values, then there are:

[l−1] [l] [l−1]

⎤
[l] [l]
Different from W jk
,b
j
, it can be seen from the formula that the output of the l-1th layer, that is, all
[l−1] [l]
components a i
of the input a [l−1]
of the lth layer contribute to each neuron z of the lth layer, as shown in j

the figure.

Figure 4-28 Each output value of each neuron in layer l-1 becomes the input to each neuron in layer l
[l−1]
Therefore, when calculating the partial derivative of the loss function with respect to a k
, it is necessary to
[l]
accumulate all the partial derivatives of z to it, namely: j

[l]
∂z [l] [l] [l−1] [l] [l]
∂L m ∂L j m ∂ m
= ∑ = ∑ (δ (∑ W a )) = ∑ (δ W )
[l−1] j=1 [l] [l−1] j=1 j [l−1] i jk k j=1 j jk
∂a partialz ∂a ∂a
k j k k

[l]
That is, the dot product of the kth column of δ and W , if W [l] [l]
,k
represents the kth column column, then the
above formula can be written in the form of matrix product:
T
∂L [l]
[l]
= W δ
[l−1] ,k
∂a
k

Write all gradients of L about a [l−1]

in a column vector form:
∂L T
⎡ ∂a
[l−1] ⎤ ⎡W ,1
[l]
δ
[l]
⎤
1

∂L T
[l] [l]
∂L ∂a
[l−1]
W δ T
2 ,2 [l] [l]
= = = W δ
[l−1]
∂a
⋮ ⋮

⎣ ∂L
[l−1]
⎦ ⎣W ,n
[l]
T

δ
[l] ⎦
∂an

(l)
The key to the problem is how to find δ j
=
∂L
(l)
?
∂z
j

For the output layer L, as in the section "Gradient of the loss function", you can directly use the loss function
to find the gradient of the output layer. For other layers such as l < L layer, and the derivative of the ∂L

∂a
[l]

activation function of neurons in this layer are obtained, without loss of generality, the activation functions of
[l] [l]
neurons in this layer are all g, that is, a = g(z ). According to the chain rule of derivation: i i

[l−1] ∂L ∂L [l−1]
′
δ = = g (z )
i [l−1] [l−1] i
∂z ∂a
i i

The symbol g (. ) can be seen to have a broadcast function, that is, it can act on an array, which is equivalent
′

to being used for each element of the array, that is

⎡ g (z ⎤
[l]
′
)
1

[l]
′
g (z )
′ [l] 2
g (z ) =

⎣ g (z ′
[l]
n )
⎦

[l−1] ∂L ∂L ′ [l−1]
δ = [l−1]
= [l−1]
⊙ g (z )
∂z ∂a
You can substitute

δ
[l−1]

(L)

∂Z

δ
Z
(L)
=

=
= (W

∂z
z

∂z
z
(L)

m
[l]

=
T
δ

function about this The gradient of a

calculates the gradient
∂L
[l]

∂a
a
∂L
(L)
∂L

∂a

) ⊙ g (z
[l−1]

= a

′
f (z
z

If the output layer directly outputs the weighted sum z

(L) ∂L
(L)
= f (z
z
L
) − y
′
= W

[L]
[l−1]

about a , only need to calculate the loss function about z Gradientδ .

[l]

)
[l]

)
T

− y , and then according to

∂L

∂L
∂z
z
(L)
.
δ
[l]

=
into the above formula, get:

That is, in the reverse derivation process, it is not necessary to calculate the gradient of the intermediate layer

Finally, it should be noted that if the output layer does not directly output the weighted sum z
passes through an activation function such as a = f (zz ) output, for variance loss ∥ a
(L)

For the binary classification problem, the activation function f is the σ function, and for the multi-
classification problem, the activation function f is the softmax function.The gradient = a

Among them, f is the identity function for regression problems (assuming variance loss ∥ z − y ), for
∂L

∂a
a
(L)

− y of the

cross-entropy loss of a and the target value y with respect to the weighted sum z can be calculated (For
[L]

multi-classification, this y is in onehot vector form). Of course, for multiple samples, the gradient is
∂L
= (A
A − Y )。
1 (L)

two classifications Or multi-category f is the sigmoid function σ or softmax function. Of course, for multiple
samples, the gradient is = (f (Z
Z ) − Y ).
∂L

∂Z
Z
(L) m
1 (L)

The gradient formula of the loss function on the output layer, together with the following three formulas, are
called the four major formulas of reverse derivation:

∂W
∂L

(l)

Its vector form:

δ
[l−1]
= (W
= δ

∂L

∂W
(l)

[l]
T
(a

(l)

δ
(l−1)

= δ

[l]
[l−1]

∂L

∂b
)

) ⊙ g (z
z
T

(l)

′
=

= (W

= δ
⎡δ

⎣δ

∂b

(l−1)
δ

∂L

[l−1]
(l)

(l)

[l]

(l)
T

) = (W
a

T
(l−1)

(l−1)

=
= δ

[l]

=
(l)
δ

) ⊙ g (z

∂L

∂z
z

∂L

∂z
z
[l]

[l]
[l]
(l)

(l)

T
′
a

a
[L]

∂L

∂z
z
(l−1)

(l−1)

[l−1]

(l−1)
(a

[l]
(L)

)
T
⋱

) ⊙ g (z
z
(L)

′
[l]

[L]

[l−1]
(l)
δ
1

(l)
δ
2

(l)
δ
j
a

)
(l−1)

(l−1)

k
⎤

⎦
[l]

without activation function, then:

∂L

∂z
z
(L)

[L]
1

∂L

∂z
z
(L)

2
[L]

∂L
(L)
∂a
a
[L]

− y

f (z

(L)

[L]
′

∥
directly, but
, loss
2

[L]
)

2
The vector form of multi-sample is:

∂L ∂L (l−1) T
= (A )
(l) [l]
∂W ∂Z
Z

∂L ∂L
= np. sum( , axis = 1, keepdims = T rue)
(l) [l]
∂b ∂Z
Z

∂L [l]
T ∂L ′ [l−1]
= (W ) ⊙ g (Z
Z )
[l−1] [l]
∂Z
Z ∂Z
Z

4.3 Implement a simple deep learning framework

4.3.1 Training process of neural network
Like other machine learning algorithms, the training process of a neural network is:

Prepare data: Prepare the sample data set for training the model, which may include a validation set and a test
set in addition to the training set;

Determine the neural network structure: Design an appropriate neural network model for specific problems.
The model scale is large, the training time is long, and it is difficult to train. The model scale is too small, and
the expressive power may not be enough. It is necessary to choose a suitable network according to the actual
problem. model structure. The network structure also includes what kind of activation function to choose,
what kind of error (loss) evaluation method to define what kind of loss function;

Training model: including random initialization of model parameters, and gradient descent method to find the
optimal solution. A validation set may be needed to assist in selecting appropriate models and
hyperparameters to avoid overfitting and underfitting.

Like the regression model, the neural network also uses the gradient descent algorithm to train the model to find
the most suitable model parameters. The gradient descent algorithm can be divided into three steps

1. Forward propagation calculation model output and loss function value:

1.1 Starting from the first layer, the intermediate variables and activation output values of each
subsequent layer are calculated sequentially until the output layer.
[l] [l] [l] [l−1] [l] [l]
Z = XW + b == A W + b

[l] [l] [l]

A = g (Z )

1.2 Calculate the loss function value according to different loss evaluation criteria
(L)
L = L (A
A , y)

1. Reverse derivation:

Calculate the gradient of the loss function with respect to the output layer output, ie δ [L]
=
∂L

∂Z
[L]

from output layer L all the way to layer 1. Calculate the gradient of the loss function about W, b, x
∂L

∂W
, ,
[l]
,
∂b
∂L
[l]
. ∂A
∂L
[l−1]
∂L

∂Z
[l−1]

T
∂L [l−1] ∂L
[l]
= A [l]
∂W ∂Z

∂L ∂L
[1]
= np. sum( [1]
, axis = 0, keepdims = T rue)
∂b ∂Z

T
∂L ∂L [l]
[l−1]
= [l]
W
∂A ∂Z

∂L ∂L ′ [l]
[l]
= [l]
⋅ g (Z )
∂Z ∂A

1. Update model parameters

(l) (l)
∂L
W = W − α
(l)
∂ W

(l) (l)
∂L
L
b = b − α
(l)
∂ b

4.3.2 Code implementation of the network layer

The forward calculation and reverse derivation of the neural network are calculated layer by layer. In order to
realize a general neural network framework, each neural network layer can be represented by a class Layer. The
Layer class represents an abstract neural network layer. In addition to the initialization constructor init(), it mainly
includes 2 methods: forward calculation forward(self, x) accepts input x to generate output, and reverse derivation
backward(self,grad) accepts the gradient grad passed in reverse, grad is the gradient of the loss function with
respect to its output, which comes from the next layer (for the last One layer, grad is the gradient of the loss
function with respect to the output of the output layer). backward() calculates the gradient of the relevant
parameters of this layer (such as cumulative sum z and weight parameter W).

class Layer:
def __init__(self):
pass
def forward(self, x):
raise NotImplementedError

def backward(self, grad):

raise NotImplementedError

On the basis of the Layer class, a derived class Dense can be defined to represent a fully connected layer. The so-
called fully connected layer means that each neuron in this layer accepts all inputs from the previous layer. The
parameters input_units, output_units, and activation of the constructor init() of the Dense class represent the size
of the input, the size of the output, and the activation function, respectively. Forward calculation The forward()
function calculates the cumulative Z according to the input x, weight W, and bias b, and then inputs it to the
activation function g to calculate the output value A. Right now:
[l] [l] [l] [l−1] [l] [l]
Z = XW + b == A W + b

[l] [l] [l]

A = g (Z )

Backward calculation (backward derivation) accepts the gradient of the loss function with respect to the output
value A , and calculates the gradient of the loss function with respect to W, b, and x
∂L

∂A
[l]
, , . ∂Z
∂L
[l]
∂L

∂W
[l]
∂L

∂A
[l−1]

Because partial derivatives or gradient symbols cannot be entered in the code, dA [l]
, dZ
[l]
, dW
[l]
, db
[l]
represent
∂L

∂A
, [l]
, ,∂L

∂Z
.
[l]
∂L

∂W
[l]
∂b
∂L
[l]

The formulas for these gradients are expressed as:

[l] [l] ′ [l]
dZ == dA ⋅ g (Z )

T
[l] [l−1] [l]
dW = A dZ

[l] [l]
db = np. sum(dZ , axis = 0, keepdims = T rue)

T
[l−1] [l] [l]
dA = dZ W

The following is the implementation of the network layer code:

class Layer:
def __init__(self):
pass
def forward(self, x):
raise NotImplementedError

def backward(self, grad):

raise NotImplementedError

class Dense(Layer):
def __init__(self, input_dim, out_dim,activation=None):
super().__init__()
self.W = np.random.randn(input_dim, out_dim) * 0.01 #0.01 * np.random.randn
self.b = np.zeros((1,out_dim)) #np.zeros(out_dim)

self.activation = activation
self.A = None

def forward(self, x):

# f(x) = xw+b
self.x = x
Z = np.matmul(x, self.W) + self.b
self.A = self.g(Z)
return self.A

def backward(self, dA_out):

# backpropagation
A_in = self.x
dZ = self.dZ_(dA_out)

self.dW = np.dot(A_in.T, dZ)

self.db = np.sum(dZ, axis=0, keepdims=True)
dA_in = np.dot(dZ, np.transpose(self.W))
return dA_in

def g(self,z):
if self.activation=='relu':
return np.maximum(0, z)
elif self.activation=='sogmiod':
return 1 / (1 + np.exp(-z))
else:
return z

def dZ_(self,dA_out):
if self.activation=='relu':
grad_g_z = 1. * (self.A > 0) # should actually be 1. * (self.Z > 0),
but both are equivalent
return np.multiply(dA_out,grad_g_z)
elif self.activation=='sogmiod':
grad_g_z = self.A(1-self.A)
return np.multiply(dA_out,grad_g_z)
else:
return dA_out

You can test the forward() function of this neural network layer Dense

import numpy as np
np.random.seed(1)
x = np.random.randn(3,48) # 3 samples, 3 channels, each channel is a 4x4 image
dense = Dense(48,10,'none')
o = dense.forward(x)
print(o.shape)
print(o)

4.3.3 Gradient test of network layer

The neural network model can be further trained only when the forward and reverse derivation of the neural
network is ensured correctly. Comparison, if the error between the two is very small, it means that the calculation
of the analytical gradient is basically correct, and you can proceed to the next step with confidence.

The following code assumes that f is a function of multivariate parameter p, that is, given p, the function value of
f (p) can be calculated, if the loss function L is known about The gradient of f , then the gradient of the loss∂L

∂f

function L about p can be calculated on this basis. Right now:

∂L ∂L ∂f
=
∂p ∂f ∂p

Use grad and df to represent ∂L

∂p
,
∂L

∂f
, namely:

∂f
grad = df
∂p

If f contains multiple output values, that is, f (p) = (f (p), f (p), ⋯ , f (p)) is a multivariate parameter p For
T
1 2 n

vector-valued functions, if the gradient of the loss function L about f is known, the gradient of L about p can
also be calculated according to the chain derivation method, that is, about Partial derivative of each parameter p : j

∂L

∂pj
= ∑
i
∂L

∂fi
∂fi

∂p
= ∑ dfi
i
∂fi

∂pj
.

In the case of known ∂L

∂fi
, you can use the numerical derivation according to the above formula to find ∂L

∂pj
, that is
to use the numerical derivative of ∂fi

∂pj

∂fi fi (pj +ϵ)−fi (pj −ϵ)

≃
∂pj (2ϵ)

Right now:
∂L ∂L fi (p+ϵ)−fi (p−ϵ) ∂L f (pj +ϵ)−f (pj −ϵ) f (pj +ϵ)−f (pj −ϵ)
= ∑i = ⋅ = df ⋅
∂pj ∂fi 2ϵ ∂f 2ϵ 2ϵ

Among them, f is the forward() output of the network layer dense. If you use f = dense. f orward(x) to
represent the calculation of this function, the calculation of this function depends on a certain parameter p, and the
value of the loss function to the parameter p Derivation can be implemented with the following functions:
def numerical_gradient_from_df(f, p, df, h=1e-5):
grad = np.zeros_like(p)
it = np.nditer(p, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index

oldval = p[idx]
p[idx] = oldval + h
pos = f() #Recall f() to calculate its output after a dependent parameter p[idx]
of f changes
p[idx] = oldval - h
neg = f() #Recall f() to calculate its output after a dependent parameter p[idx]
of f changes
p[idx] = oldval

grad[idx] = np.sum((pos - neg) * df) / (2 * h)

#grad[idx] = np.dot((pos - neg), df) / (2 * h)
it.iternext()
return grad

The following code first simulates a loss function with respect to the gradient df of the dense output, and then calls
dense.backward(df) to reverse the gradient of the dense model parameters. The output dx is the gradient dx of
the dense inputx, and then use the above The numerical gradient function numerical_gradient_from_df to
calculate the numerical gradient dx_num about x, and then compare the errors of bothdx and dx_num:

df = np.random.randn(3, 10)
dx = dense.backward(df)
dx_num = numerical_gradient_from_df(lambda :dense.forward(x),x,df)

diff_error = lambda x, y: np.max(np.abs(x - y))

print(diff_error(dx,dx_num))

2.1851062625977136e-12

The error between the numerical gradient and the analytical gradient is very small, indicating that the analytical
gradient and numerical gradient calculated by backward() are almost the same. We can also compare whether the
gradient of the dense model parameters is consistent. The following code checks whether the gradient of the dense
model parameter W is consistent:
dW_num = numerical_gradient_from_df(lambda :dense.forward(x),dense.W,df)
print(diff_error(dense.dW,dW_num))

2.2715163083830703e-12

The numerical and analytical gradients illustrating the model parameters are also very close. Therefore, it can be
judged that the code for analyzing the gradient is basically correct.

4.3.4 Neural Network Class

On a layer basis, you can define a class NeuralNetwork that represents the entire neural network:
class NeuralNetwork:
def __init__(self):
self._layers = []

def add_layer(self, layer):

self._layers.append(layer)

def forward(self, X):

self.X = X
for layer in self._layers:
X = layer.forward(X)
return X

def predict(self, X):

p = self.forward(X)
if p.ndim == 1: # single sample
return np.argmax(ff)

# multiple samples
return np.argmax(p, axis=1)

def backward(self,loss_grad,reg = 0.):

for i in reversed(range(len(self._layers))):
layer = self._layers[i]
loss_grad = layer.backward(loss_grad)

for i in range(len(self._layers)):
self._layers[i].dW += 2*reg * self._layers[i].W

def reg_loss(self,reg):
loss = 0
for i in range(len(self._layers)):
loss+= reg*np.sum(self._layers[i].W*self._layers[i].W)
return loss

def update_parameters(self,learning_rate):
for i in range(len(self._layers)):
self._layers[i].W += -learning_rate * self._layers[i].dW
self._layers[i].b += -learning_rate * self._layers[i].db

def parameters(self):
params = []
for i in range(len(self._layers)):
params.append(self._layers[i].W)
params.append(self._layers[i].b)
return params

def grads(self):
grads = []
for i in range(len(self._layers)):
grads.append(self._layers[i].dW)
grads.append(self._layers[i].db)
return grads

With the network layer Layer and the neural network class NeuralNetwork, a 2-layer neural network model can be
defined for practical problems such as 2-plane point set classification problems:
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100, 'relu'))
nn.add_layer(Dense(100, 3, 'softmax'))

For multi-classification problems, you can use the previous softmax_cross_entropy() and cross_entropy_grad to
calculate a gradient for multi-class cross entropy loss and weighted sum

X_temp = np.random.randn(2,2)
y_temp = np.random.randint(3, size=2)
F = nn.forward(X_temp)
loss = softmax_cross_entropy(F,y_temp)
loss_grad = cross_entropy_grad(F,y_temp)
print(loss,np.mean(loss_grad))

1.098695480580774 -9.25185853854297e-18
4.3.5 Gradient test of neural network
To ensure that the forward computation, loss function computation, and backward derivative of the neural network
are computed correctly, the numerical gradient can be compared with the analytical gradient.
import util

# Calculate the gradient of the model parameters according to the gradient loss_grad
# of the loss function on the output
nn.backward(loss_grad)
grads= nn.grads()

def loss_fun():
F = nn.forward(X_temp)
return softmax_cross_entropy(F,y_temp)

params = nn.parameters()
numerical_grads = util.numerical_gradient(loss_fun,params,1e-6)

for i in range(len(params)):
print(numerical_grads[i].shape,grads[i].shape)

def diff_error(x, y):

return np.max(np.abs(x - y))

def diff_errors(xs, ys):

errors = []
for i in range(len(xs)):
errors.append(diff_error(xs[i],ys[i]))
return np.max(errors)

diff_errors(numerical_grads,grads)

(2, 100) (2, 100)

(1, 100) (1, 100)
(100, 3) (100, 3)
(1, 3) (1, 3)
2.3017241064515748e-10

The error between the numerical gradient and the analytical gradient is very small, indicating that the analytical
gradient is basically correct. Here is the code for the gradient descent algorithm:

def cross_entropy_grad_loss(F,y,softmax_out=False,onehot=False):
if softmax_out:
loss = cross_entropy_loss(F,y,onehot)
else:
loss = softmax_cross_entropy(F,y,onehot)
loss_grad = cross_entropy_grad(F,y,onehot,softmax_out)
return loss,loss_grad

def train(nn,X,y,loss_function,epochs=10000,learning_rate=1e-0,reg = 1e-3,print_n=10):

for epoch in range(epochs):
f = nn.forward(X)
loss,loss_grad = loss_function(f,y)
loss+=nn.reg_loss(reg)

nn.backward(loss_grad,reg)
nn.update_parameters(learning_rate);

if epoch % print_n == 0:
print("iteration %d: loss %f" % (epoch, loss))

For the training sample (X, y), each iteration of the gradient descent method first outputs f = nn.forward(X) to
the calculator, and then calculates the gradient of the loss function with respect to the output loss, loss_grad =
loss_function( f,y), and then calculate the gradient nn.backward(loss_grad,reg) of the model
parameters using reverse derivation based on this gradient. Finally update the model parameters
nn.update_parameters(learning_rate).

Use the above data training set to train the model and output the accuracy of the model prediction:
import data_set as ds

np.random.seed(89)
X,y = ds.gen_spiral_dataset()

epochs=10000
learning_rate=1e-0
reg = 1e-4
print_n = epochs//10
train(nn,X,y,loss_gradient_softmax_crossentropy,epochs,learning_rate,reg,print_n)
print(np.mean(nn.predict(X)==y))

iteration 0: loss 1.098749

iteration 1000: loss 0.199245
iteration 2000: loss 0.129508
iteration 3000: loss 0.116411
iteration 4000: loss 0.110031
iteration 5000: loss 0.105776
iteration 6000: loss 0.103647
iteration 7000: loss 0.102508
iteration 8000: loss 0.101521
iteration 9000: loss 0.100991
0.9933333333333333

The above train() function uses all the samples in the training set to train together. Usually, the batch gradient
descent algorithm train_batch() is used, that is, each time a part of the samples are taken from the training set for
training, the iterator function data_iter defined in Section 2.6 can be used (in code in the python file data_set.py).
Retrain with train_batch():

def data_iter(X,y,batch_size,shuffle=False):
m = len(X)
indices = list(range(m))
if shuffle: # shuffle is True to shuffle the order
np.random.shuffle(indices)
for i in range(0, m - batch_size + 1, batch_size):
batch_indices = np.array(indices[i: min(i + batch_size, m)])
yield X.take(batch_indices,axis=0), y.take(batch_indices,axis=0)

def train_batch(nn,XX,YY,loss_function,epochs=10000,batch_size=50,learning_rate=1e-
0,reg = 1e-3,print_n=10):
iter = 0
for epoch in range(epochs):
for X,y in data_iter(XX,YY,batch_size,True):
f = nn.forward(X)
loss,loss_grad = loss_function(f,y)
loss+=nn.reg_loss(reg)

nn.backward(loss_grad,reg)

nn.update_parameters(learning_rate);

if iter % print_n == 0:
print("iteration %d: loss %f" % (iter, loss))
iter+=1

Train a 2-layer neural network with batch gradient descent:

nn = NeuralNetwork()
nn.add_layer(Dense(2, 100, 'relu'))
nn.add_layer(Dense(100, 3))

epochs=1000
batch_size=50
learning_rate=1e-0
reg = 1e-4
print_n = epochs*len(X)//batch_size//10

train_batch(nn,X,y,cross_entropy_grad_loss,epochs,batch_size,learning_rate,reg,print_n)
print(np.mean(nn.predict(X)==y))

iteration 0: loss 1.098579

iteration 600: loss 0.377089
iteration 1200: loss 0.198609
iteration 1800: loss 0.129696
iteration 2400: loss 0.208457
iteration 3000: loss 0.090015
iteration 3600: loss 0.110976
iteration 4200: loss 0.095018
iteration 4800: loss 0.084522
iteration 5400: loss 0.095629
0.9866666666666667

4.3.6 MNIST data handwritten digit recognition based on deep learning framework
Next, test the MNIST data set. First, download the MNIST data set, in which each digital image has been
converted into a length of 784 dimension vector.

#%%time
import pickle, gzip, urllib.request, json
import numpy as np
import os.path

if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")

with gzip.open('mnist.pkl.gz', 'rb') as f:

train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

train_X, train_y = train_set

valid_X, valid_y = valid_set
print(train_X.dtype)
print(train_set[0].shape)
print(valid_X.shape)

float32
(50000, 784)
(10000, 784)

import matplotlib.pyplot as plt

%matplotlib inline

digit = train_set[0][9].reshape(28,28)
plt.imshow(digit,cmap='gray')
plt.colorbar()
plt.show()

Figure 4-29 Mnist handwritten data image

print(train_X.shape)

(50000, 784)

The neural network model defined in Figure 4-30 is used as the classifier function for handwritten digital image
recognition.

Figure 4-30 The structure of the neural network model

nn = NeuralNetwork()
nn.add_layer(Dense(784, 200, 'relu'))
nn.add_layer(Dense(200, 100, 'relu'))
nn.add_layer(Dense(100, 10, ))

epochs = 25
batch_size = 32
learning_rate = 0.1
reg = 1e-3
print_n = 25*len(train_X)//32//10
train_batch(nn,train_X,train_y,cross_entropy_grad_loss,epochs,batch_size,learning_rate,re

print(np.mean(nn.predict(valid_X)==valid_y))
print(nn.predict(valid_X[9]),valid_y[9])

iteration 0: loss 2.320527

iteration 3906: loss 0.436557
iteration 7812: loss 0.363573
iteration 11718: loss 0.289885
iteration 15624: loss 0.177679
iteration 19530: loss 0.286339
iteration 23436: loss 0.189970
iteration 27342: loss 0.143797
iteration 31248: loss 0.158769
0.98474
0.9766
[4] 4

4.3.7 Improved general neural network framework: separate weighted sum and
activation function
The Dense layer of the above neural network framework contains the weighted sum and activation function, and
the Dense class contains the forward and reverse calculation of the weighted sum and activation function. In order
to increase flexibility, the weighted sum and activation function can be decomposed into 2 The class indicates that
the calculation of the weighted sum and the activation function is performed separately, so as to facilitate the
addition of new activation functions in the future.

The Layer class adds a member variable params to save the parameters of the model, which is used to save the
parameters of the model, and adds a method reg_loss_grad to add the gradient of the regular term in the loss
function to the gradient of the model parameters.

The Dense class only performs weighted sum calculations, and its constructor accepts a parameter that randomly
initializes the weight parameters, and initializes the weight parameters according to different random initialization
methods. The single data feature accepted by the Dense class is not only a vector, but also a multi-channel two-
dimensional image. For example, a color image contains an image of red, green and blue. Each color channel is a
two-dimensional array, so forward() Both the and backwrd() methods use the following code to first flatten the
multi-channel input data into a one-dimensional vector.

x1 = x.reshape(x.shape[0],np.prod(x.shape[1:])) # Flatten multi-channel x

class Layer:
def __init__(self):
self.params = None
pass
def forward(self, x):
raise NotImplementedError
def backward(self, x, grad):
raise NotImplementedError
def reg_grad(self,reg):
pass
def reg_loss(self,reg):
return 0.

class Dense(Layer):
# Z = XW+b
def __init__(self, input_dim, out_dim,init_method = ('random',0.01)):
super().__init__()
random_method_name,random_value = init_method
if random_method_name == "random":
self.W = np.random.randn(input_dim, out_dim) * random_value #0.01 *
np.random.randn
self.b = np.random.randn(1,out_dim)* random_value
elif random_method_name == "he":
self.W = np.random.randn(input_dim, out_dim)*np.sqrt(2/input_dim)
#self.b = np.random.randn(1,out_dim)* random_value
self.b = np.zeros((1,out_dim))
elif random_method_name == "xavier":
self.W = np.random.randn(input_dim, out_dim)*np.sqrt(1/input_dim)
self.b = np.random.randn(1,out_dim)* random_value
elif random_method_name == "zeros":
self.W = np.zeros((input_dim, out_dim))
self.b = np.zeros((1,out_dim))
else:
self.W = np.random.randn(input_dim, out_dim)* random_value
self.b = np.zeros((1,out_dim))

self.params = [self.W,self.b]
self.grads = [np.zeros_like(self.W),np.zeros_like(self.b)]
# self.activation = activation
# self.A = None

def forward(self, x):

self.x = x
x1 = x.reshape(x.shape[0],np.prod(x.shape[1:])) # Flatten multi-channel x
Z = np.matmul(x1, self.W) + self.b
return Z

def backward(self, dZ):

# backpropagation
x = self.x
x1 = x.reshape(x.shape[0],np.prod(x.shape[1:])) # Flatten multi-channel x
dW = np.dot(x1.T, dZ)
db = np.sum(dZ, axis=0, keepdims=True)
dx = np.dot(dZ, np.transpose(self.W))
dx = dx.reshape(x.shape) # Unflatten the shape of x for multi-channe

#self.grads = [dW, db]

self.grads[0] += dW
self.grads[1] += db

return dx

#--------Add the gradient of the regular term -----

def reg_grad(self,reg):
self.grads[0]+= 2*reg * self.W

def reg_loss(self,reg):
return reg*np.sum(self.W**2)

def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.W
return reg*np.sum(self.W**2)

If x is 6 samples, each sample is 3 channels and each channel is an image of 4*4, the following code is the forward
calculation of these 3 samples as input:

import numpy as np
np.random.seed(1)
x = np.random.randn(3,3,4, 4) #3 samples, 3 channels, each channel is a 4x4 image
dense = Dense(3*4*4,10,('no',0.01))
o = dense.forward(x)
print(o.shape)
print(o)

(3, 10)
[[-0.03953509 -0.00214997 0.00743433 -0.16926214 -0.05162853 0.06734225
-0.00221485 -0.11710758 -0.07046456 0.02609659]
[ 0.00848392 0.08259757 -0.09858177 0.0374092 -0.08303008 0.04151241
-0.01407859 -0.02415486 0.04236149 0.0648261 ]
[-0.13877363 -0.04122276 -0.00984716 -0.03461381 0.11513754 0.1043094
0.00170353 -0.00449278 -0.0057236 -0.01403174]]
Gradient Validation
As before, in order to verify whether the dense reverse derivation is correct, you can simulate a loss function with
respect to the gradient do of the dense output vector, and then use dense.backward() to perform reverse derivation
calculations, and calculate the value with the numerical gradient function numerical_gradient_from_df Gradient
for error comparison. Because the size of the dense output vector is 10, the following code simulates the input x of
3 samples to generate an output o of 3*10 through this dense layer. Therefore, the gradient do generated by the
simulation is also a multidimensional array of the same shape.

If the gradient of the loss function with respect to the output vector is known, the gradient of the model parameters
and intermediate variables such as x can be reversely calculated from the gradient. backward() returns the gradient
dx of the input x of the dense layer, and then the analysis Gradient and numerical gradient dx_num compare error.
Similarly, the error of the analytical gradient dense.grads[0] and the numerical gradient dW_num are also
compared for the weight parameter dense.params[0] of the model:

do = np.random.randn(3, 10)
dx = dense.backward(do)
dx_num = numerical_gradient_from_df(lambda :dense.forward(x),x,do)

diff_error = lambda x, y: np.max(np.abs(x - y)/(np.maximum(1e-8, np.abs(x) + np.abs(y)

)) )
print(diff_error(dx,dx_num))

dW_num = numerical_gradient_from_df(lambda :dense.forward(x),dense.params[0],do)

print(diff_error(dense.grads[0],dW_num))
print(dense.grads[0][:3])
print(dW_num[:3])

3.638244314951079e-09
1.3450414982951384e-11
[[ 1.77463167 0.11663492 1.87794917 0.27986781 1.27243915 -2.44375556
-2.1266117 0.99629747 -0.73720237 -0.68570287]
[-0.69807196 0.22547472 -0.93721649 0.3286185 -1.0421723 0.66487528
1.33111205 0.25677848 -0.58451408 0.71015412]
[ 0.12251147 -0.4041516 0.57764614 0.89962639 -0.35195022 0.77829011
-0.01618803 -0.62209694 -1.28543176 -0.37554316]]
[[ 1.77463167 0.11663492 1.87794917 0.27986781 1.27243915 -2.44375556
-2.1266117 0.99629747 -0.73720237 -0.68570287]
[-0.69807196 0.22547472 -0.93721649 0.3286185 -1.0421723 0.66487528
1.33111205 0.25677848 -0.58451408 0.71015412]
[ 0.12251147 -0.4041516 0.57764614 0.89962639 -0.35195022 0.77829011
-0.01618803 -0.62209694 -1.28543176 -0.37554316]]

You can also connect a loss function to the Dense layer to compare the analytical gradient and numerical gradient
of the loss function with respect to the parameters of the Dense model:
import util
x = np.random.randn(3,3,4, 4)
y = np.random.randn(3,10)

dense = Dense(3*4*4,10,('no',0.01))

f = dense.forward(x)
loss,do = mse_loss_grad(f,y)
dx = dense.backward(do)
def loss_f():
f = dense.forward(x)
loss = mse_loss(f,y)
return loss

dW_num = util.numerical_gradient(loss_f,dense.params[0],1e-6)
print(diff_error(dense.grads[0],dW_num))
print(dense.grads[0][:2])
print(dW_num[:2])

2.0148860313259954e-07
[[ 0.47568681 -0.06324119 -0.29294422 -0.76304343 -0.09660146 0.62794569
1.16087896 0.06261028 -0.6611078 -0.02940735]
[-0.10777785 -1.47174583 0.63258553 1.22381944 -0.35702633 0.4409597
-2.42444873 -0.28804741 -1.33377026 0.66775208]]
[array([ 0.47568681, -0.06324119, -0.29294422, -0.76304343, -0.09660146,
0.62794569, 1.16087896, 0.06261028, -0.6611078 , -0.02940735]),
array([-0.10777785, -1.47174583, 0.63258553, 1.22381944, -0.3570 2633,
0.4409597 , -2.42444873, -0.28804741, -1.33377026, 0.66775208])]

The Dense layer only calculates the weighted sum, and does not need to calculate the value of the activation
function or calculate the derivative of the activation function according to the activation function, which becomes
very simple. Different activation functions can be implemented individually as an activation function layer class.

The following code defines the activation function layer corresponding to the most commonly used activation
function in neural networks:
class Relu(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
return np.maximum(0, x)
def backward(self, grad_output):
# If x>0, the derivative is 1, otherwise 0
x = self.x
relu_grad = x > 0
return grad_output * relu_grad

class Sigmoid(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
return 1.0/(1.0 + np.exp(-x))
def backward(self, grad_output):
x = self.x
a = 1.0/(1.0 + np.exp(-x))
return grad_output * a*(1-a)

class Tanh(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
self.a = np.tanh(x)
return self.a
def backward(self, grad_output):
d = (1-np.square(self.a))
return grad_output * d
class Leaky_relu(Layer):
def __init__(self,leaky_slope):
super().__init__()
self.leaky_slope = leaky_slope
def forward(self, x):
self.x = x
return np.maximum(self.leaky_slope*x,x)
def backward(self, grad_output):
x = self.x
d=np.zeros_like(x)
d[x<=0]=self.leaky_slope
d[x>0]=1
return grad_output * d

The activation layer has no model parameters, but simply transforms the input x to produce an output. The shape
of the input and output tensors are the same. Likewise, numerical gradients can be used to check that the analytical
gradients of the activation layers are correct. The following code checks the error of the analytical gradient and the
numerical gradient of each activation layer above using the simulated loss function with respect to the gradient do
of the activation layer output:
import numpy as np
np.random.seed(1)
x = np.random.randn(3,3,4, 4)
do = np.random.randn(3,3,4, 4)

relu = Relu()
relu.forward(x)
dx = relu.backward(do)
dx_num = numerical_gradient_from_df(lambda :relu.forward(x),x,do)
print(diff_error(dx,dx_num))

leaky_relu = Leaky_relu(0.1)
leaky_relu.forward(x)
dx = leaky_relu.backward(do)
dx_num = numerical_gradient_from_df(lambda :leaky_relu.forward(x),x,do)
print(diff_error(dx,dx_num))

tanh = Tanh()
tanh.forward(x)
dx = tanh.backward(do)
dx_num = numerical_gradient_from_df(lambda :tanh.forward(x),x,do)
print(diff_error(dx,dx_num))

sigmoid = Sigmoid()
sigmoid.forward(x)
dx = sigmoid.backward(do)
dx_num = numerical_gradient_from_df(lambda :sigmoid.forward(x),x,do)
print(diff_error(dx,dx_num))

3.2756345281587516e-12
7.43892997215858e-12
5.170019175240593e-11
3.282573028416693e-11

From the nearly equal subgradient and numerical gradient errors of these activation layers, one can be basically
confident that the analytical gradient code is correct.

Based on the dense layer and each activation layer, a class NeuralNetwork representing a neural network can be
defined:
class NeuralNetwork:
def __init__(self):
self._layers = []
self._params = []

def add_layer(self, layer):

self._layers.append(layer)
if layer.params:
# for i in range(len(layer.params)):
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])

def forward(self, X):

for layer in self._layers:
X = layer.forward(X)
return X

def call(self, X):

return self.forward(X)

def predict(self, X):

p = self.forward(X)
# One row
if p.ndim == 1:
return np.argmax(ff)

return np.argmax(p, axis=1)

def backward(self,loss_grad,reg = 0.):

for i in reversed(range(len(self._layers))):
layer = self._layers[i]
loss_grad = layer.backward(loss_grad)
layer.reg_grad(reg)
return loss_grad

def backpropagation(self, X, y,loss_function,reg=0):

f = self.forward(X)
#The gradient of the loss function about the output f
loss,loss_grad = loss_function(f,y)

#Reverse derivation from loss_grad

self.zero_grad()
self.backward(loss_grad)
reg_loss = self.reg_loss_grad(reg)
return loss+reg_loss
#return np.mean(loss)

def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss

def parameters(self):
return self._params

def zero_grad(self):
for i,_ in enumerate(self._params):
#self.params[i][1].fill(0.)
self._params[i][1][:] = 0 #[w,dw]
def get_parameters(self):
return self._params

add_layer() in this class is used to add various layers to the neural network, forward() accepts input data and
generates corresponding output, __call_() is a function call method, for a NeuralNetwork object nn and Input X,
nn(X) is equivalent to nn.forward(X), backward() accepts the gradient of the loss function with respect to the
network output, and performs reverse derivation to find the gradient of the loss function with respect to model
parameters and intermediate variables.

In order to ensure the correctness of the forward() and backward() methods, their correctness can be checked by
the method of numerical gradient. The following code defines a simple neural network, and uses a set of randomly
generated samples (x, y) to calculate and compare the analytical gradient calculated by backward() and the
numerical gradient obtained by using the general numerical gradient function in Section 1.4. Look at them whether
the calculation results are consistent.
import util

np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100,('no',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 3,('no',0.01)))

x = np.random.randn(5,2)
y = np.random.randint(3, size=5)

f = nn.forward(x)
dZ = cross_entropy_grad(f,y) #util.grad_softmax_cross_entropy(f,y) #
nn.zero_grad() # Gradient reset to zero
reg = 0.1
dx = nn.backward(dZ,reg)

#-----Calculate the numerical gradient-----------

params = nn.parameters()
nn_params=[]
for i in range(len(params) ):
nn_params.append(params[i][0])

def loss_fn():
f = nn.forward(x)
loss = softmax_cross_entropy(f,y) #util.softmax_cross_entropy(f,y) #
return loss+nn.reg_loss(reg)

numerical_grads = util.numerical_gradient(loss_fn,nn_params,1e-6)
for i in range(len(numerical_grads)):
print(diff_error(params[i][1],numerical_grads[i]))

1.892395698905401e-06
1.7651393552515298e-06
2.306498772862026e-06
2.3545204992835373e-10

It can be seen that the numerical gradient and the analytical gradient are very close, and there is nothing wrong
with the forward() and backward() of the preliminary determination model.

4.3.8 Independent parameter optimizer

To facilitate updating model parameters with different gradient descent optimization strategies, they can be written
as a separate class, such as:
class SGD():
def __init__(self,model_params,learning_rate=0.01, momentum=0.9):
self.params,self.lr,self.momentum = model_params,learning_rate,momentum
self.vs = []
for p,grad in self.params:
v = np.zeros_like(p)
self.vs.append(v)

def zero_grad(self):
#for p,grad in params:
for i,_ in enumerate(self.params):
#self.params[i][1][:] = 0.
self.params[i][1].fill(0)

def step(self):

for i,_ in enumerate(self.params):

p,grad = self.params[i]
self.vs[i] = self.momentum*self.vs[i]+self.lr* grad
self.params[i][0] -= self.vs[i]
#self.params[i][0][:] = self.params[i][0] - self.vs[i]

def scale_learning_rate(self,scale):
self.lr *= scale

The parameter model_params of the constructor of the optimizer class SGD is a python list object, and each
element is a list object of model parameters and their gradients. If a model has 2 parameters W , b, and their
corresponding gradient parameters are dW , db, then the model_params parameter is a list of the following form:
[[W,dW],[b,db]]

The other two parameters of the constructor are the learning rate learning_rate of the gradient descent algorithm
and the parameter momentum of the momentum optimization strategy. If momentum is set to 0, it is equivalent to
the most basic gradient update strategy without momentum.

The zero_grad() method of SGD is used to reset the gradient corresponding to all parameters to 0, and step() is
used to update the model parameters according to the gradient and optimization strategy. Sometimes the gradient
descent method may need to adjust the learning rate during its iteration, scale_learning_rate() is the method used to
adjust the learning rate.

The gradient descent method can update the model parameters by defining an optimizer object optimizer of the
SGD class:
learning_rate = 1e-1
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

Similarly, other optimizer classes can also be defined, such as the following Adam optimizer:
class Adam():
def __init__(self,model_params,learning_rate=0.01, beta_1 = 0.9,beta_2 =
0.999,epsilon =1e-8):
self.params,self.lr = model_params,learning_rate
self.beta_1,self.beta_2,self.epsilon = beta_1,beta_2,epsilon
self.ms = []
self.vs = []
self.t = 0
for p,grad in self.params:
m = np.zeros_like(p)
v = np.zeros_like(p)
self.ms.append(m)
self.vs.append(v)

def zero_grad(self):
#for p,grad in params:
for i,_ in enumerate(self.params):
#self.params[i][1][:] = 0.
self.params[i][1].fill(0)

def step(self):
#for i in range(len(self.params)):
beta_1,beta_2,lr = self.beta_1,self.beta_2,self.lr
self.t+=1
t = self.t
for i,_ in enumerate(self.params):
p,grad = self.params[i]

self.ms[i] = beta_1*self.ms[i]+(1-beta_1)*grad
self.vs[i] = beta_2*self.vs[i]+(1-beta_2)*grad**2

m_1 = self.ms[i]/(1-np.power(beta_1, t))

v_1 = self.vs[i]/(1-np.power(beta_2, t))
self.params[i][0]-= lr*m_1/(np.sqrt(v_1)+self.epsilon)

def scale_learning_rate(self,scale):
self.lr *= scale

More optimizers and the following training function train_nn are included in train.py.

The following training function train() accepts a data iterator, and takes out a batch of training samples (input,
target) from it each time. For each batch of samples, first execute forwrd() to calculate the output, and then use the
loss function to calculate its loss and The loss function is about the gradient loss_grad of the output output, and the
gradient loss_grad is passed back through the backward() function to obtain the gradient of the model parameters
and intermediate variables. Then use the optimizer's step() function to update the model parameters.
def train_nn(nn,X,y,optimizer,loss_fn,epochs=100,batch_size = 50,reg = 1e-
3,print_n=10):
iter = 0
losses = []
for epoch in range(epochs):
for X_batch,y_bacth in data_iter(X,y,batch_size):
optimizer.zero_grad()

f = nn(X_batch) # nn.forward(X_batch)
loss,loss_grad = loss_fn(f, y_bacth)
nn.backward(loss_grad,reg)
loss += nn.reg_loss(reg)

optimizer.step()

losses.append(loss)
if iter%print_n==0:
print(iter,"iter:",loss)
iter +=1

return losses
Now you can use this neural network to train the previous 3 classification problems
import data_set as ds
import util

np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100,('no',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 3,('no',0.01)))

X,y = ds.gen_spiral_dataset()
epochs=5000
batch_size = len(X)
reg = 0.5e-3
print_n=480

learning_rate = 1e-1
momentum = 0.5#
optimizer = SGD(nn.parameters(),learning_rate,momentum)

losses =
train_nn(nn,X,y,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print_n)

import matplotlib.pylab as plt

%matplotlib inline
plt.plot(losses)
plt.show()

0 iter: 1.0985916677722303
480 iter: 0.7056240023920841
960 iter: 0.6422407772314334
1440 iter: 0.5246104670488081
1920 iter: 0.4186441561530432
2400 iter: 0.37118840941018727
2880 iter: 0.34583485668931857
3360 iter: 0.32954842747580104
3840 iter: 0.31961537369884196
4320 iter: 0.3124394704919282
4800 iter: 0.30620107113884415

Figure 4-31 Loss curve

The website code for the improved neural network framework is in the file NeuralNetwork.py.

4.3.9 fashion-mnist classification training

The following code trains a neural network model with the fashion-mnist training set,

(60000, 784) (60000,)

uint8 uint8

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
trainX = X_train.reshape(-1,28,28)
print(trainX.shape)
#lot first few images
for i in range(9):
# define subplot
plt.subplot(330 + 1 + i)
# plot raw pixel data
plt.imshow(trainX[i], cmap=plt.get_cmap('gray'))
# show the figure
plt.show()

(60000, 28, 28)

Figure 4-32 Images in the Fashion dataset

Looking at the data values, you can see that the original value should be an integer from 0-255, which can be
converted into a real number between 0 and 1 after dividing by 255:
train_X = trainX.astype('float32')/255.0
print(np.mean(trainX),np.mean(train_X))

72.94035223214286 0.2860402

Define the trained neural network model

import numpy as np
import util
np.random.seed(1)

nn = NeuralNetwork()
nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())
nn.add_layer(Dense(500, 200))
nn.add_layer(Relu())
nn.add_layer(Dense(200, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))

Define the optimizer object

learning_rate = 0.01
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

start training
epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000

losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print

plt.plot(losses)

0 iter: 2.3016755298047347
1000 iter: 1.1510374540057933
2000 iter: 0.47471113470221005
3000 iter: 0.5333139450988945
4000 iter: 0.259167391843765
5000 iter: 0.3629363583454308
6000 iter: 0.3486191552507917
7000 iter: 0.4914253677369693

Figure 4-33 Training loss curve

print(np.mean(nn.predict(train_X)==y_train))
test_X = X_test.reshape(-1,28,28).astype('float32')/255.0
print(np.mean(nn.predict(test_X)==y_test))

0.87965
0.8585

4.3.9 Read and write model parameters

Sometimes, the model training time is very long, you can pause the training process, save the trained model
parameters to the file, and then read the model parameters from the file again next time, so as to continue training
on the basis of the last training. To this end, you can add the function of reading (save_parameters()) and writing
(load_parameters()) model parameters to the neural network class NeuralNetwork:
class NeuralNetwork:
def __init__(self):
self._layers = []
self._params = []

def add_layer(self, layer):

self._layers.append(layer)
if layer.params:
# for i in range(len(layer.params)):
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])

def forward(self, X):

for layer in self._layers:
X = layer.forward(X)
return X

def call(self, X):

return self.forward(X)

def predict(self, X):

p = self.forward(X)
if p.ndim == 1: #single sample
return np.argmax(ff)
return np.argmax(p, axis=1) # multiple samples

def backward(self,loss_grad,reg = 0.):

for i in reversed(range(len(self._layers))):
layer = self._layers[i]
loss_grad = layer.backward(loss_grad)
layer.reg_grad(reg)
return loss_grad

def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss

def parameters(self):
return self._params

def zero_grad(self):
for i,_ in enumerate(self._params):
#self.params[i][1].fill(0.)
self.params[i][1][:] = 0

def get_parameters(self):
return self._params

def save_parameters(self,filename):
params = {}
for i in range(len(self._layers)):
if self._layers[i].params:
params[i] = self._layers[i].params
np.save(filename, params)

def load_parameters(self,filename):
params = np.load(filename,allow_pickle = True)
count = 0
for i in range(len(self._layers)):
if self._layers[i].params:
layer_params = params.item().get(i)
self._layers[i].params = layer_params
for j in range(len(layer_params)):
self._params[count][0] = layer_params[j]
count+=1

Use the following code to test the read and write functions of the above model.
from NeuralNetwork import *
nn = NeuralNetwork()
nn.add_layer(Dense(3, 2,('xavier',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(2, 4,('xavier',0.01)))
nn.add_layer(Relu())

def print_nn_parameters(params,print_grad=False):
for p,grad in params:
print("p",p)
if print_grad:
print("grad",grad)
print()
print_nn_parameters(nn.get_parameters())
nn.save_parameters('model_params.npy')
nn.load_parameters('model_params.npy')
print_nn_parameters(nn.get_parameters())

p[[0.0027318 0.00063939]
[-0.00144845 0.00138133]
[-0.01521812 0.0023785 ]]

p [[0.0.]]

p [[-0.00825534 -0.01301992 0.00130655 0.00532404]

[-0.01092436 -0.00243776 0.00889602 0.00531146]]

p [[0. 0. 0. 0.]]

p[[0.0027318 0.00063939]
[-0.00144845 0.00138133]
[-0.01521812 0.0023785 ]]

p [[0.0.]]

p [[-0.00825534 -0.01301992 0.00130655 0.00532404]

[-0.01092436 -0.00243776 0.00889602 0.00531146]]

p [[0. 0. 0. 0.]]
Chapter 5 Basic Techniques for Improving Neural Network
Performance
Data and algorithms are two elements of machine learning. Both high-quality data and good algorithms can
improve the performance and effect of machine learning. High-quality data acquisition comes at a cost. How to
improve the quality and quantity of data through data processing and enhancement on the basis of existing data,
and how to improve the performance of a machine learning model and algorithm through various practical skills
are also based on neural networks. Essential skills for deep learning practice.

5.1 Data processing

The practice of modern artificial intelligence has proved that data is the key to machine learning. The principle of
neural network is simple and clear. The key to why it can stand out from many complex machine learning methods
lies in data. The more data, the better the performance of machine learning. good. Of course, high-performance
hardware can improve the execution efficiency of the algorithm, but the effect of the algorithm depends on
whether there is enough diverse training data. For overfitting functions, increasing the amount of data can
effectively improve overfitting. In addition to the amount of data, the quality of the data is also important. In
addition to obtaining as much data as possible through various means such as collecting and generating data, data
enhancement, normalization, and feature engineering are commonly used data processing methods to increase the
amount of data and improve data quality in the case of existing data.

5.1.1 Data Augmentation

Data enhancement refers to increasing the amount of data by means of various transformations and cutting of the
data under the existing data. For example, for an image, many images can be changed from an original image
through image operations such as mirroring, rotation, shearing, deformation, filtering, changing color, adding
noise, covering, etc., thereby increasing the total amount of data.

The following code can read an image using the io module of skimage:

import numpy as np
import matplotlib.pyplot as plt
from skimage import io, transform

image = io.imread('cat.png')
print(image.shape)
plt.imshow(image)
plt.show()

(403, 544, 3)
Figure 5-1 Original image

The height and width of the image are 403 and 544 respectively, and the color image is composed of three channel
images of red (R), green (G), and blue (B). That is, the image pixel at each position is composed of three red (R),
green (G), and blue (B) colors.

Various transformations can be performed on the image through numpy array operations, such as image[:,::-1,
:], which can perform horizontal mirror flip on the image:

img = image[:,::-1, :]
plt.imshow(img)
plt.show()

Figure 5-2 Horizontal Flip

Another example is to cut an image in a window:

img = image[50:300,90:400, :]
plt.imshow(img)
plt.show()

Figure 5-3 Cutting the image

Or directly process each pixel of the image:
def convert(image):
image = image.astype(np.float64)
yuvimg = np.empty(image.shape)
if False:
yuvimg[:,:,0] = image[:,:,0]*0.5+image[:,:,1]*0.2+ image[:,:,2]*0.3
yuvimg[:,:,1] = image[:,:,1]*0.5
yuvimg[:,:,2] = image[:,:,1]*0.1+ image[:,:,2]*0.7
else:
for y in range(image.shape[0]):
for x in range(image.shape[1]):
rgb = image[y, x]
yuvimg[y, x][0] = rgb[0]*0.5+rgb[1]*0.2+rgb[2]*0.3
yuvimg[y, x][1] = rgb[1]*0.5
yuvimg[y, x][2] = rgb[1]*0.1+rgb[2]*0.7

return yuvimg.astype(np.uint8)
img = convert(image)
plt.imshow(img)
plt.show()

Figure 5-4 Processing each pixel of the image

Use numpy functions such as invet() to transform the image color:

img = np.invert(image)
plt.imshow(img)
plt.show()

Figure 5-5 Inverting the color of an image

Different modules of skimage such as util, transform, etc. provide functions for different transformations of
images, such as adding noise to images with the random_noise() function of util:
from skimage import util
img = util.random_noise(image)
plt.imshow(img)
plt.show()

Figure 5-6 Adding noise

The image can be rotated with transform's rotate():

from skimage import transform
img = transform.rotate(image, 30)
plt.imshow(img)
plt.show()

Figure 5-7 Rotate image

Change the contrast of image intensities:

from skimage import exposure
v_min, v_max = np.percentile(image, (18, 89.8))
img = exposure.rescale_intensity(image, in_range=(v_min, v_max))
plt.imshow(img)
plt.show()
Figure 5-8 Changing the contrast

Change the exposure of an image:

# gamma and gain parameters are between 0 and 1
img = exposure.adjust_gamma(image, gamma=0.4, gain=0.9)
plt.imshow(img)
plt.show()

Figure 5-9 Changing the exposure

img = exposure.adjust_log(image)
plt.imshow(img)
plt.show()

Figure 5-10 Logarithmic transformation

Convert a color image with multiple color channels to a single-channel grayscale image (black and white image):
from skimage import color
img = color.rgb2gray(image)
print(img.shape)
plt.imshow(img,cmap='gray')
plt.show()
(403, 544)

Figure 5-11 Color image converted to grayscale image

You can also use many other Python packages to process data such as images, such as scipy's image processing
module ndimage to process images, such as blurring images:
from scipy import ndimage
img = ndimage.uniform_filter(image, size=(11, 11, 1))
plt.imshow(img)
plt.show()

Figure 5-12 Blurred image

Like image data, various data enhancements can be performed on other data such as text and voice, and various
public data processing packages can be used to improve the efficiency of data processing.

Through data enhancement, the total amount of data can be increased many times, which helps to reduce
overfitting. Although the enhanced data is correlated with the original data, it avoids the cost of obtaining brand
new data.

5.1.2 Normalization
Like simple regression, data with too large absolute value will cause the numerical calculation of the neural
network to overflow, the gradient descent algorithm will become very slow, and the different scales of features
have different influences on the algorithm, which will cause "feature bias", which will make the training algorithm
Difficult to restrain. Therefore, before training the neural network, the data that is not normalized should be
normalized. Usually each feature is normalized separately, that is, for each feature x , calculate the mean x _mean
i i

and standard deviation x _std of the feature in the training set, and then use the mean and The standard deviation
i

normalizes this feature of all samples into a smaller range near 0, such as [0, 1] or [−0.5, 0.5] or [−1, 1], usually
using The following formula is normalized:
xi −xi _mean

xi _std
This canonicalization process can be implemented with the following Python code:
X -= np.mean(X, axis = 0)
X/=np.std(X,, axis = 0)

If the values of all features are in roughly the same range, all features can also be standardized uniformly, that is, a
unified mean x_mean and standard deviation x_std can be calculated using all features of all data in the training
set, all data features are then normalized with this same mean and standard deviation. For example, for an image
that uses a one-byte positive integer to represent the color value of the image pixel, because these values all change
in the range [0, 255], therefore, an image can be directly divided by 255 to transform these values of the image
pixels into [0, 1], without the need to specifically calculate the mean and variance for each feature.

It should be noted that the samples in the validation set and the test set cannot be normalized separately, otherwise
the samples in these sets and the samples in the training set do not use the same normalization standard, so there is
no point in predicting them with the trained model. value. That is, when making predictions on samples in the
validation set or training set, the validation or test samples are normalized with the same normalization parameters
(mean and standard deviation) as the training set.

5.1.3 Feature Engineering

For an original data sample, some of the features may be irrelevant to machine learning, and the features are often
not independent but correlated. These interrelated original data features will be mutually restrained during machine
learning, making learning The algorithm is difficult to converge.

Feature engineering is to discover and extract good features that are helpful for machine learning from raw data.
Feature engineering is one of the most basic key issues in traditional machine learning. Different fields often use
some artificial feature methods specific to the field. Feature engineering usually includes many specific
technologies such as data preprocessing (such as data normalization), data dimensionality reduction, feature
selection, artificial feature design, and feature learning.

1. Data dimensionality reduction and principal component analysis

Data dimensionality reduction is to convert a high-dimensional data into a low-dimensional representation. A
sample may have a large number of original features. If the sample is represented by fewer features, the efficiency
of the machine learning algorithm can be improved. Like data compression, the compressed data retains the
inherent information of the original data, but the processing of the compressed data (such as data transmission) is
more efficient than the processing of the original data.

Principal Component Analysis (PCA) is a classic data dimensionality reduction technique for machine learning. It
represents the data as a linear combination of pivots, which can eliminate the correlation of data features, and then
use a small number of pivots to represent the original data. So as to reduce the dimensionality of the data, that is,
the number of features. For example, a face color image of 256*256 pixels requires 256*256*3 = 196608
numerical representation, that is, its dimension is 196608, and a face image can be expressed as The linear
combination of 23 pivots can retain 97% of the information of the original image, so a face only needs 23 values.

For the data points of the following two-dimensional plane, each data point is represented by 2 coordinates:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Generate random sample points near the line y=2x+1
np. random. seed(1)

pts = 25
x = np.random.randn(pts,1) # Randomly sample some x coordinates
y = x+2
y = y+ np.random.randn(pts,1)*0.2 # Give Y random noise

plt.plot(x,y,'o')
plt. xlabel('x')
plt.ylabel('y')
plt. axis('equal')
plt. show()

Figure 5-13 Data points of a randomly sampled two-dimensional plane near the straight line y=2x+1

Taking the coordinates of each point as a matrix, all point coordinates can be placed in a matrix X and display the
first 3 coordinate points:
X = np.stack((x.flatten(), y.flatten()), axis=-1)
print(X.shape)
print(X[:3])

(25, 2)
[[ 1.62434536 3.48759979]
[-0.61175641 1.36366554]
[-0.52817175 1.28467436]]

PCA can be used to reduce the dimension so that each point has only one numerical value instead of 2 numerical
representations. The first step of the PCA method is to center the components of each dimension (axis), even if the
components of each dimension subtract the mean of all components of the dimension:
X -= np.mean(X, axis = 0)
print(X[:3])
plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
plt.show()

[[ 1.63707525 1.50798964]
[-0.59902653 -0.61594461]
[-0.51544186 -0.69493579]]
Figure 5-14 Data centering, that is, the component of each dimension (feature) of the data minus the mean value of
all components of the dimension, so that each feature takes 0 as the center point

If there is a matrix A composed of a set of three-dimensional coordinate points, each row represents the three
coordinates of a coordinate point:

1 3 2
⎛ ⎞

−4 2 6
A =
2 6 4
⎝ ⎠
−3 0 1

That is, A has 4 samples, and each sample has 3 features (x, y, z coordinates). Are the features of these samples
correlated? A's correlation matrix (covariance matrix) (covariance matrix), A A, represents the degree of
T

correlation between different features of the data set.

1 3 2
⎛ ⎞
1 −4 2 −3 30 7 −17
⎛ ⎞ ⎛ ⎞
−4 2 6
T
A A = 3 2 6 0 = 7 49 42
2 6 4
⎝ ⎠ ⎝ ⎠
2 6 4 1 −17 42 57
⎝ ⎠
−3 0 1

The calculation code is as follows:

A = np.array([[1,3,2],[-4,2,6],[2,6,4],[-3,0,1]])
print(A)
print("A^TA:\n",np.dot(A.transpose(),A))

[[ 1 3 2]
[-4 2 6]
[264]
[-3 0 1]]
A^TA:
[[ 30 7 -17]
[ 7 49 42]
[-17 42 57]]

From the element values of the covariance matrix, it can be seen that the correlation value between x and y is 7,
and the correlation value between y and z is 42, indicating that the correlation between y and z is relatively large,
while the correlation between x and y is relatively small . Usually, the covariance matrix can be divided by the
number of samples to reduce the influence of the number of samples on the matrix value. For the sample matrix X
above, its covariance matrix is calculated as follows:
cov = np.dot(X.T, X) / X.shape[0] # covariance matrix
SVD decomposition of the covariance matrix can obtain the principal component (eigenvector) U, singular value
(variance, square of the eigenvalue) S, and the singular value is equivalent to the variance, indicating the degree of
divergence:

U,S,V = np.linalg.svd(cov)
print(U)
print(S)
print(S[0]/(S[0]+S[1]))

[[-0.68302064 -0.73039907]
[-0.73039907 0.68302064]]
[2.46815362 0.01168714]
0.995287139793862

Each column of U represents a pivot, and the pivot represents the main direction of change of the data (the
direction of the main axis), as shown in Figure 5-15:

plt.plot(X[:,0],X[:,1],'o')
plt.plot([0,U[0,0]], [0,U[1,0]])
plt.plot([0,U[0,1]], [0,U[1,1]])
plt.axis('equal')
plt.show()

Figure 5.15 SVD decomposition of the covariance matrix, the principal component (eigenvalue) and the direction
of the principal component (eigenvector) can be obtained

S[0] and S[1] indicate the proportion of the data in the direction of the pivot, and it can be seen that the change of
the data in the first pivot occupies a larger proportion. By projecting the data onto the axes defined by the pivot U,
the data can be expressed as components of the pivot.

Xrot = np.dot(X, U)
print(Xrot[:5])

[[-2.21959042 -0.16573019]
[ 0.85903285 0.01682553]
[0.85963789-0.09817723]
[ 1.53210054 0.01886974]
[-1.32424593 0.03607588]]

Display the coordinate points formed by these pivot components on the pivot axis:

plt.plot(Xrot[:,0],Xrot[:,1],'o')
plt.axis('equal')
plt.show()
Figure 5.16 Rotate the data to align the pivot and coordinate axes

Converting the data into the component representation of the principal component can eliminate the correlation
between new features, and it can be seen from its covariance matrix that the off-diagonal values representing
different features become 0.

print(np.dot(Xrot.transpose(),Xrot))

[[6.17038405e+01 9.38138456e-15]
[9.38138456e-15 2.92178571e-01]]

Using the coordinates of the first pivot to represent these samples, the data loss is almost (1-
0.995287139793862)*100% = 0.472%, which is almost negligible. This representation of data samples as a
linear combination of a few pivots is called data dimensionality reduction. For this example, the dimension of
the sample data can be reduced from feature number 2 to feature number 1, achieving the purpose of reducing the
number of sample features.
Xrot_reduced = np.dot(X, U[:,:1]) # Xrot_reduced become a [N x 1] array
print(Xrot_reduced[:3])
plt.plot(Xrot_reduced[:],[0]*pts,'o')
plt.axis('equal')
plt.show()

[[-2.21959042]
[ 0.85903285]
[0.85963789]]

Figure 5-17 Data dimensionality reduction: Representing data samples as a linear combination of a few pivots

The projected and dimensionally reduced data can be back-projected onto the main axis of the original data.
X_temp = np.c_[Xrot_reduced, np.zeros(pts) ]
reProjX = np.dot(X_temp, U.transpose())
plt.plot(reProjX[:,0],reProjX[:,1],'o')
plt.axis('equal')
plt.show()

Figure 5-18 The projected and dimensionally reduced data is back-projected onto the main axis of the original
data, retaining the main characteristics of the original data

It can be seen that the data after dimensionality reduction still retains the main information of the original data.

2 Whitening
Data samples may have multiple features, and the variance of these features may vary greatly, that is, the degree of
divergence of different features is different, so that different features have different effects on machine learning
algorithms. There is often correlation between features, and different features with correlation will have a mutual
restraint effect on machine learning. It's like a person is pulled in different directions and will not know what to do.

The whitening operation refers to reducing the correlation of sample features and making these features have the
same variance. Dividing the feature by its standard deviation can achieve the purpose of making the feature have
the same variance, and PCA projection can eliminate the correlation between features. Therefore, whitening is
usually a combination of these two techniques, that is, first perform PCA feature projection to eliminate feature
correlation. , and divide each feature by its feature variance. The whitening operation combined with PCA is called
PCA Whitening.

Like normalization, whitening can improve the performance of machine learning algorithms. The previous code
has performed projection on the original data X to obtain the projected Xrot, that is, the features of Xrot become
independent of each other, and then perform the following operation of dividing by the standard deviation to
complete the whitening operation. Since the original data is only 2-dimensional, in order to see the effect of the
whitening operation, the data is not dimensionally reduced.
Xwhite = Xrot / np.sqrt(S + 1e-5) # Whitening operation: Divide the data features by
# the standard deviation, so that all features
have similar variances
plt.plot(Xwhite[:,0],Xwhite[:,1],'o')
plt.axis('equal')
plt.show()
Figure 5-19 Results of the whitening operation

After the whitening operation, the components of the two main axes have the same variance. You can add data
points to further observe the effect of the whitening operation, as shown in the following code:
pts = 1000
x = np.random.randn(pts,1) # Randomly sample some x-coordinates
y = x+2+ np.random.randn(pts,1)*0.2
X = np.stack((x.flatten(), y.flatten()), axis=-1)

fig = plt.gcf()
fig.set_size_inches(12, 4, forward=True)
plt.subplot(1,2,1)
plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
X -= np.mean(X, axis = 0)
cov = np.dot(X.T, X) / X.shape[0]
U,S,V = np.linalg.svd(cov)
Xrot = np.dot(X, U)
Xwhite = Xrot / np.sqrt(S + 1e-5)
reProjX = np.dot(Xwhite, U.transpose())
plt.subplot(1,2,2)
plt.plot(reProjX[:,0],reProjX[:,1],'o')
plt.axis('equal')
plt.show()

Figure 5-20 Results of whitening operation with more data points

Whitening makes all the features of the sample have the same variance, so that machine learning will not be biased
towards a certain feature due to the difference in variance, which can improve the performance of machine
learning algorithms.

5.2 Parameter debugging

5.2.1 Weight initialization
For regression, the weight of the model is usually initialized to 0. For neural networks, if the weight parameters are
initialized to 0, all neurons in a layer will learn the same parameters, that is, all neurons in a layer are the same
function. , the neural network degenerates into a linear sequence with only one neuron in each layer (as shown in
Figure 5-21).

Figure 5-21 The neural network degenerates into a linear sequence with only one neuron in each layer

Figure 5-22 2-layer neural network

Take the 2-layer neural network shown in Figure 5-22 as an example, because the initial weight value is 0, the
input weight value of all neurons in the hidden layer and the output layer is 0, assuming that the activation
functions of the neurons in the same layer are Same as:
[2] [1] [2] [2]
a = g(a W + b )

When deriving in reverse, we can get:

T T
∂L ∂L [2] ∂L ′ [2] [2]
[1]
= [2]
W = [2]
g (z )W
∂a ∂z ∂a

T T
∂L [1] ∂L [1] ∂L ′ [2]
[2]
= A [2]
= A [2]
g (z )
∂W ∂z ∂a

Thus for each neuron in the hidden layer, its ∂L

[1]
are the same, similar, ∂L
[w]
are also the same. And so on, each
∂a ∂W
i i

layer is the same.

Because the gradient dW of the model parameters of each neuron in the same layer will be the same. When
updating the model parameters with W = W − lr ∗ dW , each parameter W will also be exactly the same. When
iterating again, the output of each neuron in the same layer is the same again, so that the gradient dW is the same
when the reverse derivation is performed. No matter how many iterations it takes, all neurons in the same layer
have the same weight parameters, that is, they represent the same function, that is, they are all symmetric.

Obviously, the expressive ability of such a neural network is very limited, and multiple neurons in each layer are
meaningless. The neural network should break this symmetry so that each neuron extracts different features from
the input. The solution is to give these weights random initialization values. It can be seen from the reverse
derivation formula above that the bias parameter b has no effect on the model parameters and the gradient
calculation of the input. Therefore, it is usually enough to randomly initialize the weight parameters and set the
bias parameters to 0. A simple neural network as above can initialize its model parameters as follows:
W1 = np.random.randn(2,2)*0.01
b1 = np.zeros((1,2))
W2 = np.random.randn(2,1)*0.01
b2 = np.zeros((1,1))

Multiplying the weight parameter by a relatively small number is similar to preprocessing the data normalization,
so that the weighted sum output of the neuron is not too large, because for a large value x, the activation function
sigmoid(x) or tanh(x) is in a Saturation state, that is, the derivative (gradient) at this point is close to 0, and the
gradient of the activation function is too small to make the gradient of the model parameters in the reverse
derivation process will become smaller and smaller, resulting in slow update of the model parameters, which is
Gradient Vanishing problem. According to the reverse derivation, too large a value will also make the gradient
become very large, that is,gradient explosion.

Is the weight initialization value as small as possible? Because the gradient of neuron input (such as ∂L

∂a
[1]
) is
proportional to the weight (W ), too small weight The parameter also makes the gradient about the input too
[2]
small, which will also cause "gradient disappearance" in the reverse derivation process. If the initial value of the
weight parameter is too small, such as close to 0, it will also cause the symmetry problem of the above-mentioned
neurons to a certain extent. Therefore, in general, the weight parameters are initialized with a Gaussian distribution
with a mean of 0 and a standard deviation of 0.01.

The variance of the output of the neurons whose weights are initialized above will vary with the number of input
values. The variance of the output of the neurons should not depend on the constant value of the number of input
values. Otherwise, as the number of layers increases, the variance will become larger and larger. To this end, the
weight parameter can be divided by the square root of the number of inputs so that the variance of the neuron
output can be normalized to 1, that is, the code w = np.random.randn(n)/sqrt(n)Initialize weight parameters,
where n is the number of input values for this neuron.

If x = (x , ⋯ , x , ⋯ , x ) is an input sample, n is the number of its eigenvalues, and z is the output value of the
1 i n

neuron, then the variance of z and The relationship between the variance of x is as follows:
n

Var(z) = Var(∑ wi xi )

n
= sumi Var(wi xi )
n
2 2
= ∑ [E(wi )] Var(xi ) + E[(xi )] Var(wi ) + Var(xi ) Var(wi )

The last equation applies the property of variance, if two random variables X, Y are independent, then:
2 2
Var(XY ) = [E(X)] Var(Y ) + [E(Y )] Var(X) + Var(X) Var(Y )

Assuming that the mean value of the input and weight is 0, that is, E[x i] = E[wi ] = 0 , then:
n

Var(z) = ∑ Var(xi ) Var(wi )

= (n Var(w)) Var(x)

The variance V ar(z) of the output value is not only proportional to the variance V ar(x) of the input value and the
variance V ar(w) of the weight, but also proportional to the number n of the input value x . i

When the output variance is the same as the input variance, that is, Var(z) = Var(x), that is, after the input data
x passes through the neuron, the output variance will not become larger or change Small, so that the input and

output of neurons are stable.

In order to make Var(z) = Var(x), Var(w) should be , according to Var(aX) = a

n
2
Var(X) , if w is sampled
from the standard normal distribution, then V ar(w) = 1, multiply w by a constant a = √n
1
, then
2
Var(aw) = a Var(w) = 1 . Therefore, the weights can be initialized with the following code:

w = np.random.randn(n) * sqrt(1.0/n)

According to a similar analysis, some papers have proposed other different parameter methods, such as Glorot
multiplying the weight parameter w of the standard normal distribution by √2/(n + n ), so that in out

Var(w) = 2/(n + n
in ), where n , n
out are the input and output vectors of the network layer respectively The
in out

purpose is to make the variance of the gradient not change during the reverse derivation process. Of course, the
combination of these two items will affect each other, so that the variance of the forward and reverse derivation
will actually change. .
Weights can also be uniformly distributed: w ∼ U [− √6

√nin +nout
,
√6

sqrtnin +nout
] , the weight initialization method
proposed by et al. is called xavier initialization, and its code is implemented as follows:

import numpy as np
import math
def calculate_fan_in_and_fan_out(tensor):
if len(tensor.shape) < 2:
raise ValueError("tensor with fewer than 2 dimensions")
if len(tensor.shape) ==2:
fan_in,fan_out = tensor.shape
else: #F,C,kH,kW
num_input_fmaps = tensor.shape[1] #size(1) F,C,H,W
num_output_fmaps = tensor.shape[0] #size(0)
receptive_field_size = tensor[0][0].size
fan_in = num_input_fmaps * receptive_field_size
fan_out = num_output_fmaps * receptive_field_size
return fan_in, fan_out

def xavier_uniform(tensor, gain=1.):

fan_in, fan_out = calculate_fan_in_and_fan_out(tensor)
std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
bound = math.sqrt(3.0) * std
tensor[:] = np.random.uniform(-bound,bound,(tensor.shape))

def xavier_normal(tensor, gain=1.):

fan_in, fan_out = calculate_fan_in_and_fan_out(tensor)
std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
tensor[:] = np.random.normal(0,std,(tensor.shape))

Among them, the function calculate_fan_in_and_fan_out() is used to calculate the number of input features and
the number of output features of the network layer (neuron), gain is an optional scaling factor for the weight, and
the default is 1.

For neurons using Relu as the activation function, the weight initialization method of Kaiming He is currently used
more often, that is, the weights of the standard normal distribution sampling are multiplied by √2/n, the code is
as follows :
w = np.random.randn(n) * sqrt(2.0/n)

For the network layer using the Relu activation function, its bias parameter b is recommended to be set to a non-
zero constant such as 0.01, which can make this activation function affect the gradient at the beginning of training,
but whether setting the bias to a non-zero value can really improve Algorithm performance is undefined.

The following is the implementation code of the kaiming method:

def calculate_gain(nonlinearity, param=None):
linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d',
'conv_transpose2d', 'conv_transpose3d']
if nonlinearity in linear_fns or nonlinearity == 'sigmoid':
return 1
elif nonlinearity == 'tanh':
return 5.0 / 3
elif nonlinearity == 'relu':
return math.sqrt(2.0)
elif nonlinearity == 'leaky_relu':
if param is None:
negative_slope = 0.01
elif not isinstance(param, bool) and isinstance(param, int) or
isinstance(param, float):
negative_slope = param
else:
raise ValueError("negative_slope {} not a valid number".format(param))
return math.sqrt(2.0 / (1 + negative_slope ** 2))
else:
raise ValueError("Unsupported nonlinearity {}".format(nonlinearity))

def kaiming_uniform(tensor,a=0,mode = 'fan_in', nonlinearity='leaky_relu'):

fan_in,fan_out = calculate_fan_in_and_fan_out(tensor)
if mode=='fan_in': fan = fan_in
else: fan = fan_out

gain = calculate_gain(nonlinearity, a)
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std
tensor[:] = np.random.uniform(-bound,bound,(tensor.shape))

def kaiming_normal(tensor,a=0,mode = 'fan_in', nonlinearity='leaky_relu'):

fan_in,fan_out = calculate_fan_in_and_fan_out(tensor)
if mode=='fan_in': fan = fan_in
else: fan = fan_out

gain = calculate_gain(nonlinearity, a)
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation
tensor[:] = np.random.normal(0,std,(tensor.shape))

calculate_gain() is used for a certain coefficient used in He Kaiming's method (also called kaiming method or he
method). For example, the value for Relu is √2, and the value for tanh is 5.0/3. kaiming_uniform() and
kaiming_normal() are kaiming methods using mean or Gaussian random values, respectively.

The following function kaiming() selects the kaiming_uniform() or kaiming_normal() method according to the
parameters:
def kaiming(tensor,method_params=None):
method_type,a,mode,nonlinearity='uniform',0,'fan_in','leaky_relu'
if method_params:
method_type = method_params.get('type', "uniform")
a = method_params.get('a', 0)
mode = method_params.get('mode','fan_in' )
nonlinearity = method_params.get('nonlinearity', 'leaky_relu')
if method_params=="uniform":
kaiming_uniform(tensor,a,mode,nonlinearity)
else:
kaiming_normal(tensor,a,mode,nonlinearity)

The following code tests the above parameter initialization method:

w = np.empty((2, 3))
print(w)

xavier_uniform(w)
print("xavier_uniform:",w)
xavier_normal(w)
print("xavier_normal:",w)

kaiming_uniform(w)
print("kaiming_uniform:",w)
kaiming_normal(w)
print("kaiming_normal:",w)

output:
[[17.2 17.2 17.2]
[17.2 17.2 24.2]]
xavier_uniform: [[ 0.026289 -1.09114298 -0.48792212]
[-0.3313437 -0.47333989 -0.90713322]]
xavier_normal: [[ 0.93298795 0.07044394 -0.00270454]
[ 0.44167298 -1.01942638 0.45699115]]
kaiming_uniform: [[-1.21534711 -1.27523387 0.80492134]
[0.81222595 -1.11076413 -0.29943563]]
kaiming_normal: [[-0.98492851 0.24745387 0.53676485]
[ 1.27654978 1.52143405 0.87124828]]

In addition, you can also add an auxiliary function apply(self,init_params_fn) to the NeuralNetwork class
to initialize all its layer parameters, which can facilitate the initialization of multiple layers of the neural network,
such as the initialization of all layers with kaiming_normal() method to initialize parameters.

def apply(self,init_params_fn):
for layer in self._layers:
init_params_fn(layer)

5.2.2 Optimization parameters

The most important parameter of the gradient descent method is the learning rate (commonly represented by α or η
). In the case of a certain neural network structure, too large or too small a learning rate is the most important
factor affecting whether the algorithm converges . You can choose an appropriate learning rate by trying
parameters of different magnitudes such as 0.1, 0.01, 0.0001, etc. and with the help of visualizing the loss curve or
learning curve.

In addition to the learning rate, you should also try different parameter optimization strategies, such as momentum
(Momentum) method, RMSprop, Adam and other famous parameter optimization methods. These optimization
methods may also have some hyperparameters similar to the learning rate, which can be used A similar approach
attempts to choose appropriate hyperparameters.

For the batch gradient descent method, different batch sizes can be considered, and a batch size with appropriate
time efficiency and model quality can be selected for model training.

In addition, some special techniques (such as the Dropout technique below) may also have some hyperparameters
that need to be adjusted. Parameter adjustment (including network structure parameters) is an experience activity
that requires long-term practice and experience. It is also necessary to learn from the experience and skills of
others on the Internet about neural network parameter adjustment to avoid blind exploration.

5.3 Batch Normalization

The normalized preprocessing of the data normalizes the different features of the data to a standard distribution
with a mean of 0 and the same variance, avoiding data overflow caused by large values and "feature bias" caused
by features of different scales, so that the learning algorithm can learn at a higher learning rate Converge faster.
Nevertheless, the data is transformed through neural network layers, especially as the number of layers increases,
its distribution deviates more and more from the standard normal distribution. Since the reverse calculation of the
gradient is a process in which the gradient is continuously multiplied in different layers, if the value is too large or
too small, it is easy to cause the gradient to explode or disappear, making training more and more difficult.
5.3.1 What is batch normalization?
In order to solve this problem, someone proposed batch normalization (batch normalization, BN for short) to
normalize the intermediate output of the network layer. The batch normalization BN operation of a certain network
layer is usually to normalize the weighted sum, and then generate the activation value through the activation
function ϕ, that is, it is an operation inserted between the weighting operation and the activation function. If the
weighted sum operation of a network layer is z = x W + b , then the activation function is executed after the BN
operation to output the activation value:

ϕ(BN(z
z)) = ϕ(BN(x
xW + b ))

As before, the weighted sum is regarded as a separate layer (that is, the fully connected layer), and the activation
function is regarded as a separate activation layer. The BN operation can be regarded as a separate batch
normalization layer inserted between them (BN layer).

Simply normalizing z to the standard normal distribution of N (0, 1) will limit the expressiveness of the model,
because no matter how the previous network layer is transformed, the output after this layer always obeys the
standard normal distribution. The BN operation introduces a learnable parameter β and γ that represents the mean
and mean square error, and transforms the standard normal distribution features normalized to N (0, 1) to normal
distribution N (β, γ). Since β and γ are all learnable parameters, the problem of reducing the expressiveness of
the model is avoided.

The BN layer accepts the weighted sum output z of the fully connected layer as its own input, calculates the mean
and variance of each feature of these z, and normalizes each feature of z to N (0, 1) according to the mean and
variance Standard normal distributions, and then transform them into N (β, γ) normal distributions with learnable
parameters β, γ.

In the absence of confusion, the letters x denote z that require BN normalization. For a set of samples
B = x , x , ⋯ , x , BN normalization first calculates the mean μ
1 2 m B and variance σ of this batch of samples ,
B

and then use the learnable parameters γ, β to scale and translate them:
m
1
μB = ∑ xi
m
i=1

2
1 2
σB = ∑(xi − μB )
m
i=1

xi − μB
x
^i =

√ σ2 + ϵ
B

yi = γ ⊙ xi + β

If each row of the matrix X represents a data, the following code can be used to calculate the mean mean and
variance var for each feature of the data:
mean = X.mean(axis=0)
var = ((X - mean) ** 2).mean(axis=0)

That is, the mean and variance are calculated along the direction of the row. Of course, the numpy function var()
can be used to calculate the variance:
var = np.var(X, axis=0)

It is assumed here that each data x is a vector or a one-dimensional array, that is, X is a two-dimensional array or
i

matrix. In the future, we will see that each data may be a multi-dimensional array, such as a multi-channel image,
regardless of whether x is a one-dimensional or multi-dimensional array, each element of it is regarded as a
i
feature. For x of a multidimensional array, you can use numpy's reshape() function to flatten it into a one-
i

dimensional array (vector), so that X is still a two-dimensional array (matrix). The following code can ensure that
x is flattened into a one-dimensional vector:
i

n_X = X.shape[0]
X_flat = X.ravel().reshape(n_X,-1) # X_flat = X.reshape(n_X,-1)

The following is the forward calculation code for batch normalization:

n_X = X.shape[0]
X_flat = X.ravel().reshape(n_X,-1)
mu = np.mean(X_flat,axis=0)
var = np.var(X_flat, axis=0)
X_norm = (X_flat - mu)/np.sqrt(var + 1e-8)
out = gamma * X_norm + beta
return out.reshape(self.X_shape)

Since the training of neural networks usually adopts the batch gradient descent method, each gradient descent
process uses a small batch of samples to update the model parameters. Therefore, batch normalization does not
calculate the mean and mean square error of the samples in the entire training set, but It is to calculate the mean
and mean square error with a small batch of samples during the training process. Hence the name batch
normalization.

When predicting, the forward calculation of the data also needs to be processed by the BN layer, but the BN
operation is not required and should not be performed again. This requires the use of the mean, variance and
parameters of the BN layer that have been determined during training, such as β, γ for transformation. However,
the mean value and mean square error calculated by each iterative step in the iterative process are different, and the
mean value and variance used in prediction should not only depend on the mean value and mean square error of a
certain iterative step. To this end, the mean and mean square deviation in all iteration steps can be averaged, and
the moving mean and variance are usually calculated by the method of moving average.

If running_mu and running_var are used to represent the moving average of the mean and variance during the
training process, the calculation code is as follows:
running_mu = momentum * running_mu + (1 - momentum) * mu
running_var = momentum * running_var + (1 - momentum) * var

That is to do a simple weighted average of the moving average of the current mean and variance and the mean and
variance of the current sample. The momentum parameter momentum indicates the proportion of the moving
average.

Forward calculations when forecasting can be transformed with moving averages and the equation:
X_flat = X.ravel().reshape(X.shape[0],-1)
# normalization
X_hat = (X_flat - running_mean) / np.sqrt(running_var + eps)
# translate and pan
out = self.gamma * X_hat + self.beta

5.3.2 Reverse derivation of batch normalization

If you know the gradient of an external function f with respect to the output z of the BN layer, you can calculate
the gradient of the loss function f with respect to x , β, andγ according to the chain derivation rule.

From z = γ ⊙ x
^ + β, we can get:
m m
∂f ∂f ∂zi ∂f
= ∑ = ∑
∂β i=1
∂zi ∂ beta i=1
∂zi

m m
∂f ∂f ∂zi ∂f
= ∑ = ∑ ⋅ x
^i
∂γ i=1 ∂zi ∂ gamma i=1 ∂zi

∂f ∂f ∂zi ∂f
= ⋅ = ⋅ γ
∂x
^i ∂zi ∂x
^i ∂zi

How to find based on known ?

∂f ∂f

∂xi ∂x
^i

(xi −μ)
Because x
^ i =
√ σ2 +ϵ
, where μ, σ are also x function, so:
2

∂f ∂f ∂x
^i ∂f ∂μ ∂f
2
∂σ
= ⋅ + ⋅ + 2
⋅
∂x
^i ∂xi ∂μ ∂xi ∂σ ∂xi
∂xi

And from:
(xi −μ)
x
^i =
√ σ2 +ϵ

m
2 1 2
σ = ∑ (xi − μ)
m
i=1

1 m
μ = ∑ xi
m i=1

can get:

∂x
^i 1
=
∂xi √σ 2
+ ϵ

∂μ 1
=
∂xi m

2
∂σ 2(xi − μ)
=
∂xi m

Thus there are:

∂f ∂f 1 ∂f 1 ∂f
2(xi − μ)
= ( ) + ( ) + ( 2
)
∂xi ∂x
^i ∂μ ∂σ
√ σ2 + epsilon m m

one of them:
m
∂f ∂f ∂x
^ ∂f 2 −1.5 ∂f 2 −1.5
2
= ⋅ 2
= ⋅ (−0.5(x − μ) ⋅ (σ + ϵ) ) = −(0.5 ∑ (xj − μ)(σ + ϵ) )
∂σ ∂x
^ ∂σ ∂x
^ ∂x
^j
j=1

m m
∂f ∂f ∂f 1
2
= (∑ ⋅ f rac − 1√ σ + ϵ) + ( 2
⋅ ∑ −2(xi − μ))
∂μ ∂x
^i ∂σ m
i=1 i=1

5.3.3 Code Implementation of Batch Normalization

The following code encapsulates the forward calculation and reverse derivation of the BN layer with a class:
from NeuralNetwork import *

class BatchNorm_1d(Layer):
def __init__(self,num_features,gamma_beta_method = None,eps = 1e-8,momentum = 0.9):
# self.d_X, self.h_X, self.w_X = X_dim
# self.gamma = np.ones((1, int(np.prod(X_dim)) ))
# self.beta = np.zeros((1, int(np.prod(X_dim))))
# self.params = [self.gamma,self.beta]
super().__init__()
self.eps= eps
self.momentum = momentum
if not gamma_beta_method:
self.gamma = np.ones((1, num_features ))
self.beta = np.zeros((1, num_features ))
else:
self.gamma = np.random.randn(1, num_features)
self.beta = np.random.randn(1, num_features) #np.zeros((1, num_features
))

self.running_mu = np.zeros((1, num_features ))

self.running_var = np.zeros((1, num_features ))

self.params = [self.gamma,self.beta]
self.grads = [np.zeros_like(self.gamma),np.zeros_like(self.beta)]

def forward(self,X,training = True):

if training:
self.n_X = X.shape[0]
self.X_shape = X.shape

self.X_flat = X.ravel().reshape(self.n_X,-1)
self.mu = np.mean(self.X_flat,axis=0)
self.var = np.var(self.X_flat, axis=0) # var = 1 / float(N) * np.sum((x -
mu) ** 2, axis=0)
self.X_hat = (self.X_flat - self.mu)/np.sqrt(self.var +self.eps)
out = self.gamma * self.X_hat + self.beta

# Compute the moving average of means and variances

running_mu, running_var, momentum = self.running_mu, self.running_var,
self.momentum
running_mu = momentum * running_mu + (1 - momentum) * self.mu
running_var = momentum * running_var + (1 - momentum) * self.var
else:
X_flat = X.ravel().reshape(X.shape[0],-1)
# Normalization
X_hat = (X_flat - running_mean) / np. sqrt(running_var + eps)
# zoom and pan
out = self.gamma * X_hat + self.beta
return out. reshape(self. X_shape)

def __call__(self,X):
return self.forward(X)

def backward(self,dout):
eps = self.eps
dout = dout.ravel().reshape(dout.shape[0],-1)
X_mu = self.X_flat - self.mu
var_inv = 1./np.sqrt(self.var + eps)
dbeta = np.sum(dout,axis=0)
dgamma = np.sum(dout * self.X_hat, axis=0) #dout * self.X_hat

dX_hat = dout * self.gamma

dvar = np.sum(dX_hat * X_mu,axis=0) * -0.5 * (self.var + eps)**(-3/2)
dmu = np.sum(dX_hat * (-var_inv) ,axis=0) + dvar * 1/self.n_X * np.sum(-2.*
X_mu, axis=0)
dX = (dX_hat * var_inv) + (dmu / self.n_X) + (dvar * 2/self.n_X * X_mu)
dX = dX.reshape(self.X_shape)

self.grads[0] += dgamma
self.grads[1] += dbeta
return dX#, dgamma, dbeta

For this BatchNorm class, the following code checks that the analytical gradient is correct using the numerical
gradient:
# diff_error = lambda x, y: np.max(np.abs(x - y))
from util import *
import numpy as np

diff_error = lambda x, y: np.max(np.abs(x - y))

np.random.seed(231)
N, D = 100, 5
x = 3 * np.random.randn(N, D) + 5

bn = BatchNorm_1d(D,"no")
x_norm = bn(x)

do = np.random.randn(N, D)+0.5
dx = bn.backward(do)

dx_num = numerical_gradient_from_df(lambda :bn.forward(x),x,do)

print(diff_error(dx,dx_num))

if False:
dx_gamma = numerical_gradient_from_df(lambda :bn.forward(x),bn.gamma,do)
print(diff_error(dgamma,dx_gamma))

dx_beta = numerical_gradient_from_df(lambda :bn.forward(x),bn.beta,do)

print(diff_error(dbeta,dx_beta))

7.684454184087031e-10

In the following convolutional neural network, an input sample (such as a color image) is often a three-
dimensional tensor C × H × W , where C, H , andW are the number of channels (of a color image), Height,
width, a batch of samples is a 4-dimensional tensor N × C × H × W , where N is the number of samples. The
above code can be rewritten to handle this 4D tensor input, the following code is batch normalization for each
channel instead of each (pixel) feature:
class BatchNorm(Layer):
def __init__(self,num_features,gamma_beta_method = None,eps = 1e-5,momentum =
0.9,std = 0.02):
super().__init__()
self.eps= eps
self.momentum = momentum
if not gamma_beta_method:
self.gamma = np.ones((1, num_features ))
self.beta = np.zeros((1, num_features ))
else:
self.gamma = np.random.normal(1,std,(1, num_features))
self.beta = np.zeros((1, num_features ))
#self.gamma *=random_value
self.params = [self.gamma,self.beta]
self.grads = [np.zeros_like(self.gamma),np.zeros_like(self.beta)]

self.running_mu = np.zeros((1, num_features ))

self.running_var = np.zeros((1, num_features ))

def forward(self,X,training = True):

#N, C, H, W = X.shape
self.X_shape = X.shape
if len(self.X_shape)>2:
N,C,H,W = self.X_shape

if training:
#X = np.swapaxes(X,0,1) # C to fitst axis
if len(self.X_shape)>2:
X = np.moveaxis(X,1,3) #move C to last axis: N,H,W,C
X_flat = X.reshape(-1,X.shape[3])
else:
X_flat = X

NHW = X_flat.shape[0]
self.n_X = NHW
mu = np.mean(X_flat,axis=0)
var = 1 / float(NHW) * np.sum((X_flat- mu) ** 2, axis=0) # self.var =
np.var(self.X_flat, axis=0) #
X_hat = (X_flat - mu)/np.sqrt(var +self.eps)
out = self.gamma * X_hat + self.beta

if len(self.X_shape)>2:
out = out.reshape(N,H,W,C)
out = np.moveaxis(out,3,1)

self.mu,self.var,self.X_flat,self.X_hat = mu,var,X_flat,X_hat

# Compute the moving average of means and variances

running_mu,running_var,momentum =
self.running_mu,self.running_var,self.momentum
running_mu = momentum * running_mu + (1 - momentum) * self.mu
running_var = momentum * running_var + (1 - momentum) * self.var
else:
if len(self.X_shape)>2:
X = np.moveaxis(X,1,3)
self.X_flat = X.reshape(-1,X.shape[3])
else:
self.X_flat = X

# Normalization
X_hat = (X_flat - self.running_mu) / np.sqrt(self.running_var + eps)
# translate and pan
out = self.gamma * X_hat + self.beta
if len(self.X_shape)>2:
out = out.reshape(N,H,W,C)
out = np.moveaxis(out,3,1)
return out
def __call__(self,X):
return self.forward(X)

def backward(self,dout):
if len(dout.shape)>2: #len(self.X_shape)>2 and
dout = np.moveaxis(dout,1,3)
dout = dout.reshape(-1,dout.shape[3])

eps = self.eps

X_mu = self.X_flat - self.mu

var_inv = 1./np.sqrt(self.var + eps)

dbeta = np.sum(dout,axis=0)
dgamma = np.sum(dout * self.X_hat, axis=0) #dout * self.X_hat

dX_hat = dout * self.gamma

if len(self.X_shape)>2:
N,C,H,W = self.X_shape
dX = dX.reshape(N,H,W,C)
dX = np.moveaxis(dX,3,1)
#dX = dX.reshape(self.X_shape)

self.grads[0] += dgamma
self.grads[1] += dbeta
return dX #, dgamma, dbeta

In order to observe the impact of BN on network performance, a batch normalization (BN) layer is added between
the weighted sum of the first two network layers and the activation function of the network model trained on the
Fashion Mnist dataset in Section 4.3.9, because BN can avoid the model The weight parameters become very
complicated, that is, BN is also a regularization technique, which can remove the regularization of weight decay in
the code (reg=0):
import numpy as np
import util
from NeuralNetwork import *
from train import *
import mnist_reader
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)

X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')

X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')
trainX = X_train.reshape(-1,28,28)
train_X = trainX.astype('float32')/255.0

nn = NeuralNetwork()

nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())

nn.add_layer(Dense(500, 200))
nn.add_layer(BatchNorm_1d(200))
nn.add_layer(Relu())

nn.add_layer(Dense(200, 100))
nn.add_layer(BatchNorm_1d(100))
nn.add_layer(Relu())

nn.add_layer(Dense(100, 10))

learning_rate = 0.01
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000

losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print

plt.plot(losses)

[ 1, 1] loss: 2.291
[ 1001, 2] loss: 0.416
[2001, 3] loss: 0.261
[ 3001, 4] loss: 0.342
[ 4001, 5] loss: 0.222
[ 5001, 6] loss: 0.196
[ 6001, 7] loss: 0.157
[ 7001, 8] loss: 0.295

Final model prediction accuracy:

print(np.mean(nn.predict(train_X)==y_train))
test_X = X_test.reshape(-1,28,28).astype('float32')/255.0
print(np.mean(nn.predict(test_X)==y_test))

0.9066833333333333
0.8766

It can be seen that after using batch normalization, the prediction accuracy of the trained model has improved
somewhat.
5.4 Regularization Regularization
When the model is more complex (such as more model parameters), regularization (Regularization) is the basic
technique to prevent the model from overfitting. In addition to the direct regularization of weights in regression,
deep learning often uses a regularization called "Dropout" to prevent overfitting.

5.4.1 Weight regularization

This is regularization already familiar in regression, by adding a penalty term to the weight parameter in the loss
function to prevent the absolute value of the weight parameter from becoming too large. The total loss function
includes the loss L predicted by the data itself and the regular term R :
data W

Ldata + RW

For a weight w, its L regular term is R = λw , where λ controls the proportion of the regular term relative to
2 W
2

data loss. The larger the value, the greater the effect of the regularization term, and the more it can prevent
overfitting. The smaller the value, the smaller the effect of the regularization term, and the smaller the effect of
preventing overfitting. The L regular term makes the parameters tend to be smaller and close to 0.
2

L1 regular term is R = λ|w|, its function is similar to L regular term, but slightly different, L regular term
W 2 2

will make all values decrease consistently, while L regular term The item will make the weights sparse, that is,
1

many weights become close to 0, and only a few non-zero values, that is, the non-zero values are very sparse. L 1

makes machine learning tend to choose a few good features instead of using all features, that is, it helps to select
good features. Sparseness is an important course in machine learning, and due to space limitations, this book does
not discuss it.

It is also possible to combine the L regularization term and the L regularization term to form the so-called
1 2

Elastic net regularization: R = λ |w| + λ w . Its scope is between L and L , or the role of the elastic
W 1 2
2
1 2

combination of L and L .
1 2

Figure 5-23 is a schematic diagram of these three common weight regularization functions:

Figure 5-23 Different regularization functions

Max norm constraints

In the gradient descent method, especially in deep learning, as the number of layers increases, since the reverse
derivation is to calculate the product of the gradient, it will cause the gradient to disappear and the gradient to
explode. The maximum norm constraint can prevent the gradient from exploding. The upcoming update The
weight is limited to a certain range, that is, by clipping the weight vector so that a certain norm such as L norm
2

does not exceed a certain value: ||w|| < c. The typical value of c is 3 or 4, and some studies have reported that
2

this maximum norm constraint on the weight can improve the convergence performance of the algorithm.
Especially in the cyclic neural network to be learned later, such maximum norm constraints are generally used to
prevent gradient explosion.
import numpy as np
def max_norm_constraints(w,c,epsilon = 1e-8):
norms = np.sqrt(np.sum(np.square(w), keepdims=True))
desired = np.clip(norms, 0, c)
w *= (desired / (epsilon+ norms))
return w

w = np.random.randn(2,5)*10
print(w)
w = max_norm_constraints(w,2)
print(w)

[[ 3.42847604 -19.64442234 -4.80546287 5.65698305 -8.97334854]

[ -0.95122877 -0.04471285 -14.33147196 -0.63593975 9.30212848]]
[[ 0.23851103 -1.36661635 -0.33430477 0.39354303 -0.6242548 ]
[-0.06617475 -0.00311057 -0.99700686 -0.04424084 0.64712724]]

If grads contains gradients of multiple weight parameters, the following code can limit their gradients to [-c,c]:

import math
def grad_clipping(grads,c):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > c:
ratio = c / norm
for i in range(len(grads)):
grads[i]*=ratio

5.4.2 Dropout
https://deepnotes.io/dropout

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks

Dropout (meaning "lost" in English) is a regularization technique proposed by Srivastava et al. During the training
process, some neurons are activated with a certain probability (that is, the output activation value) and other
neurons are in the inactive state (that is, the activation value is not output). For a certain layer of neurons, Dropout
uses a probability drop_p between 0 and 1 to make a neuron inactive, that is, no output is generated, and in turn,
the neuron is active with a probability of 1-drop_p. If there are 100 neurons in this layer, drop_p=0.2, then in terms
of probability, 100*0.2 will be inactive, and 100*0.8 will be active. drop_p is called drop rate, which indicates
how many probability neurons are in an inactive state. In turn, (1 − drop_p) is called survival rate or retention
rate, indicating how many probability neurons are active.

Figure 5-24 Dropout: Some neurons are randomly inactive during each forward and reverse process

As shown in Figure 5-24, all neurons of the neural network on the left are activated, and the neurons on the right
are activated using Dropout. Dropout defines a different neural network function by making certain neurons
inactive. Because each iteration of the gradient descent training process randomly deactivates some neurons,
different iterations target different functions. As a result, the neural network does not rely too much on a small
number of neurons, which is equivalent to the decision-making of a group not relying on a few people, but
everyone has the opportunity to participate, thereby preventing prejudice caused by over-reliance on a few people.
Dropout is similar to the idea of data normalization. If the data is not normalized, some features with large values
will have too much influence on the learning algorithm while other features will have little effect. Dropout is also
similar to the idea of weight regularization. Weight regularization uses penalty items to make all weights smaller
and prevent a few weights from being too large.

A certain network layer dropout with a certain probability will cause the total output expectation of this network
layer to become smaller. For example, the original total output expectation of this network layer is e, and the
duopout loss probability drop_p, then the expected value will become e ∗ (1 − drop_p), in order to avoid
affecting subsequent layers, the output of each neuron in the output layer using duopout is usually divided by
(1 − drop_p). That is, if the activation function output value of a certain neuron is a, it is modified to output

a/(1 − drop_p).

Because each iteration of training, Dropout loses randomly different neurons, that is, the neural network function
of each iteration is different. Therefore, the meaning of the loss function is not very clear, and the loss of different
iterations is different. function instead of the same function. The final trained function can be regarded as the
average of different functions generated by these different iterations. A better model can be obtained through the
average of multiple different function models, just as better results can be obtained through the voting of many
people instead of a few people. Again, this is the basic idea of machine learning based on statistical learning.
Dropout can effectively avoid overfitting by averaging multiple functions.

In addition, Dropout is only used to train the model, and the model function after training should be clear.
Therefore, the final trained neural network function should be a function composed of all neurons in the activation
state. Therefore, when verifying and testing the model, it should be Dropout should no longer be used.

Dropout can act on the output of any hidden network layer. If the output of this network layer is x, the Dropout
operation can be expressed as:

x = D ⊙ x

Among them, D is an array of the same shape as x, and the value of each element in D is 1 or 0, indicating whether
the corresponding neuron is activated. Array D is a mask array calculated based on dropout discard rate or survival
rate. The Dropout operation can be performed with the following code:
retain_p = 1-drop_p
mask = (np.random.rand(*x.shape) < retain_p) / retain_p
x *= mask

Among them, drop_p, retain_p=1-drop_p are the drop rate and retention rate respectively, and mask is the mask
array that keeps the active state.

When deriving in reverse, just multiply the gradient dx_output of the loss function off Dropout output by this mask
mask.

dx = dx_output* self._mask

Where dx_output is the gradient of the loss function passed in reverse with respect to x.

The Dropout operation can be implemented as a separate Dropout layer, the code is as follows:
from Layers import *
class Dropout(Layer):
def __init__(self, drop_p):
super().__init__()
self.retain_p = 1- drop_p

def forward(self, x, training=True):

retain_p = self.retain_p
if training:
self._mask = (np.random.rand(*x.shape) < retain_p) / retain_p
out = x * self._mask
else:
out = x
return out

def backward(self, dx_output,training=True):

dx = None
if training:
dx = dx_output * self._mask
else:
dx = dx_output
return dx

where x is the output of the previous layer of the Dropout layer. For the input X, the following code uses
dropout.forward(X) to calculate the input of dropout, and when deriving in reverse, use
dropout.backward(dx_output) from a reverse input about X The gradient gets the reverse gradient after
dropout:
np.random.seed(1)
dropout = Dropout(0.5)
X = np.random.rand(2, 4)
print(X)
print(dropout.forward(X))
dx_output = np.random.rand(2, 4)
print(dx_output)
print(dropout.backward(dx_output))

[[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]

[1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01]]
[[8.34044009e-01 0.00000000e+00 2.28749635e-04 0.00000000e+00]
[2.93511782e-01 0.00000000e+00 3.72520423e-01 0.00000000e+00]]
[[0.4173048 0.55868983 0.14038694 0.19810149]
[0.80074457 0.96826158 0.31342418 0.69232262]]
[[0.8346096 0. 0.28077388 0. ]
[1.60148914 0. 0.62684836 0. ]]

Dropout is a technology to reduce the complexity of the function. The more model parameters, the more complex
the model. Similarly, the parameters of different network layers of the neural network are different, and the
complexity is also different. Dropout can be added to any hidden layer, and the dropout discard rate of the hidden
layer with more parameters can be higher, and vice versa, it can be lower. For network layers with few model
parameters, the Dropout layer may not be added.

Dropout is to prevent overfitting, but because Dropout causes different functions to be used for each iteration, the
loss function loses its meaning, and the training parameters cannot be debugged using debugging tools. The usual
practice is to turn off Dropout first (but regular items can be added to prevent overfitting), and then turn on
Dropout to further improve the quality of the model after tuning the parameters.

Dropout is a regularization technique. When the network is relatively small compared to the data set,
regularization is usually not required because the complexity of the model is already relatively low. Adding
regularization will reduce the model's representation ability and damage the performance of learning. In addition,
Dropout obviously cannot be placed before and after the output layer, because the network cannot "correct" errors
caused by loss before it is close to classification. In addition, batch normalization has been widely used instead of
Dropout in practice.

For example, if the network model trained on the Fashion Mnist dataset in Section 4.3.9 adds Dropout after the
first network layer, the regularization of weight decay (reg=0) can be removed:

import numpy as np
import util
from NeuralNetwork import *
from train import *
import mnist_reader
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)

X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')

X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')

trainX = X_train.reshape(-1,28,28)
train_X = trainX.astype('float32')/255.0
nn = NeuralNetwork()
nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())
nn.add_layer(Dropout(0.25))
nn.add_layer(Dense(500, 200))
nn.add_layer(Relu())
nn.add_layer(Dropout(0.2))
nn.add_layer(Dense(200, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))

learning_rate = 0.01
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000

losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print

plt.plot(losses)

[ 1, 1] loss: 2.307
[ 1001, 2] loss: 0.661
[2001, 3] loss: 0.322
[ 3001, 4] loss: 0.509
[ 4001, 5] loss: 0.316
[ 5001, 6] loss: 0.344
[ 6001, 7] loss: 0.355
[ 7001, 8] loss: 0.434

Final model prediction accuracy:

print(np.mean(nn.predict(train_X)==y_train))
test_X = X_test.reshape(-1,28,28).astype('float32')/255.0
print(np.mean(nn.predict(test_X)==y_test))

0.8872333333333333
0.8667

Using Dropout also improves accuracy compared to the previous chapter. Of course, the hyperparameters of
Dropout also need to be further adjusted to improve the effect. In current practice, batch normalization is generally
used instead of Dropout.
5.4.3 Early stopping method (Early stopping)
As shown in Figure 5-25, during the training process, with the help of the verification set, once the verification
loss no longer decreases or even increases instead, the iteration is stopped to prevent overfitting from reducing the
generalization ability of the model.

Figure 5-25. Stopping the training iteration when the validation error no longer decreases or increases instead
Chapter 6 Convolutional Neural Network CNN
In the previous neural network, each input data sample is a one-dimensional
tensor, and each layer of neurons receives a one-dimensional tensor from
the previous layer to generate an output. This kind of neural network is
called full connection Neural Networks. For two-dimensional or three-
dimensional tensors such as image data, each image is flattened into a one-
dimensional tensor and input to the neural network. The flattened one-
dimensional tensor loses the inherent spatial structure information of the
image (such as pixels Adjacent relationship), exchanging the element order
of the flattened tensor has no effect on the training of the network function,
that is, as long as all the elements of the tensor are the same, changing their
order will still produce the same network function. Just imagine, for an
image, it is obviously very unreasonable to arrange all the pixels in the
image randomly, and finally recognize the same object, because the pixels
of the image can only be arranged according to a certain spatial structure.
Meaningful images, otherwise a mass of meaningless images.

Processing flattened image data with a fully-connected neural network

results in a dramatic increase in model parameters as the image size
increases. For a 28X28 black and white image, the length of the flattened
1D tensor is 784. A neuron needs 784 weight parameters. For a 64X64X3
color image, the length of the flattened one-dimensional tensor is 12288. A
neuron needs 12288 weight parameters. In order to produce high-quality
results, it may be necessary to process high-resolution images. For a
1280X1280X3 color image, a neuron needs 4915200 weight parameters.
Generally, the longer the length of the input one-dimensional tensor, the
number of neurons in the first layer must also increase accordingly. The
increase in the number of neurons in the first layer leads to an increase in
the number of one-dimensional tensors output, and the number of neurons
in the second layer It should also be increased accordingly, so that the
model parameters grow exponentially with the length of the input tensor
and the depth of the network. Huge model parameters make the network
function very complicated, training becomes very difficult and easy to
overfitting, even if the technology to prevent overfitting is adopted, the
huge amount of model parameters makes the memory consumption very
large, even not enough to store it. many model parameters.

Convolutional neural networks (Convolutional neural networks)

(referred to as CNN or convolutional network) can process images with few
weight parameters by using the translation invariance of image features, and
The memory space structure of the image can be maintained. The so-called
translation invariance of the image means that a feature in the image will
not change due to the translation of its position on the image. If a cat moves
from the upper left corner of the image to the lower right corner, it is still
the cat.

Convolutional neural network is a kind of neural network specially

designed for images. Convolutional neural network is easy to parallelize
and accelerate functions with graphics processor GPU. In 2012, Alex
Krizhevsky, a student of neural network giant Hinton, and others adopted
the CUDA GPU-based The implemented convolutional neural network
AlexNet won the ImageNet competition in one fell swoop, which rekindled
people's enthusiasm for research on neural network technology and opened
up deep learning based on deep neural networks.

Convolutional neural network is the most exciting neural network in deep

learning and has become the core of deep learning. Convolutional neural
networks have revolutionized computer vision research. They are the best
choice in the field of computer vision, and they continue to conquer cities
and land in many computer tasks, and research papers have exploded. In
addition to the field of computer vision, convolutional neural networks have
also been extended to problems with one-dimensional sequence structures
such as time series analysis of audio, text, and gene sequences. The
convolutional neural network has also been extended to the field where the
data is a graph structure, and new neural network technologies such as the
graph neural network have been developed.
Here are some typical convolutional neural network applications:

Detect, recognize, locate, and mark objects in an image, such as

identifying which objects are in an image and where they are located.
Famous applications include face recognition, object detection, and
autonomous driving

Speech recognition, sound synthesis, such as automatically converting

speech into text, text into speech, music, etc.

Describe images and videos in natural language

Road, obstacle recognition in self-driving cars

Analyze video game screens to instruct an agent to play the game

automatically

Generate images that look like real ones. Such as generating realistic
faces, video face replacement (such as the famous DeepFake)

The convolutional neural network is a network layer called "convolutional

layer" added to the fully connected neural network. The input and output of
the convolutional layer are directly multi-dimensional tensors such as
images, without the need to convert them Flattened into a 1D tensor. This
chapter transitions from the simplest one-dimensional tensor convolution to
two-dimensional or multi-dimensional tensor convolution, pooling and
other operations, then introduces the convolution layer and its code
implementation, and finally introduces some classic modern convolutions
neural network structure.

6.1 Convolution

6.1.1 What is convolution?

Convolution is a generalization of weighted sums. For a set of numbers,
(x1 +x2 +x3 ++̇xn )
such as x , x , x , ⋯ , x , the arithmetic mean
1 2 3 n
n
actually
uses the same weight to multiply and accumulate each number to find
1

the average of this group of numbers. You can also multiply each number x i

with different weights w , and then accumulate:

w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x4 + ⋯ + wn ∗ xn

Multiplying each number with different weights and then accumulating is

called weighted sum.For example, x w in the previous regression is the
weighted sum of weight w to feature x .

If the sum of weights is 1, that is, w satisfies ∑ w = 1, this special

n
i i=1 i

weighted sum is called weighted average. The weight value can also be
negative, such as the company's debt ratio, profitability ratio, etc. can be
negative or positive.

For example, to count a student's grades in a course, you can give different
proportions to the usual grades, experimental grades and final grades, such
as 0.2, 0.3, and 0.5 respectively, then you can use (0.2*normal
grades+0.3*experimental grades +0.5*final grade)Calculate the
student's total grade. The weighted sum of a set of numbers is to extract a
certain feature from this set of numbers, such as the weighted sum of grades
to get the feature of "total grades".

By weighting a set of numbers with a weight less than the number of

elements, multiple features can be extracted from the set of numbers. For
example, use 3 weights 1.2, 0.3, 0.5 to weight the following set of numbers:

4 15 16 7 23 17 10 9 5 8

Because the number of weights is less than the number of values, these 3
weights can be used to perform a weighted sum with every 3 adjacent
numbers in this group of numbers in turn. First align the 3 weights with the
first 3 numbers (4, 15, 16):
Figure 6-1 Align the weight vector (1.2,0.3,0.5) to the first three numbers
(4,15,16)

Get the weighted sum: 41.2+150.3+16*0.5 = 17.3

Then align the last 3 numbers (15,16,7)

Figure 6-2 Aligning the weight vector (1.2,0.3,0.5) to (15,16,7)

Get the weighted sum: 151.2+160.3+7*0.5 = 26.3

And so on, finally get 8 weighted sums:

Figure 6-3 Align the weight vector (1.2,0.3,0.5) to the last 3 numbers
(9,5,8)

This process of using a weight less than the number of data to align the data
through a sliding window to obtain a weighted sum to obtain a new set of
data is called convolution.

For a one-dimensional array x = (x , x , x , ⋯ , x ) of length n and a

0 1 2 n−1

weight vector of length K (also called convolution kernel ）

w = (w , w , w , ⋯ , w
0 1 2 ), use the first element w of the convolution
K−1 0

kernel to align any unknown x of \pmb x$ $, the obtained weighted sum

(convolution value) is:

K−1
zi = ∑k=0 wk xi+k

As shown in Figure 6-4,

Figure 6-4 Align the weight vector w = (w , w 0 1, w2 , ⋯ , wK−1 ) to

(x , x
i i+1,x ,⋯,x
i+2 ) get weighted sum z
i+K i

When the convolution kernel slides from x to x 0 along each element of

n−K

the vector x, a series of convolution values are generated to form a result

vector z = (z , z , z , cdots, z
0 1 2 ) , that is, the length of the result vector
n−K

is n-K+1. For example, if the input data length is 5 and the convolution
kernel width is 3, the length of the resulting convolution vector is 5-3+1 =
3, as shown in Figure 6-5 below:

Figure 6-5 valid convolution: the data length is 5, the convolution kernel
width is 3, and the length of the resulting convolution vector is 5-3+1 = 3
This convolution method is called "valid convolution". The above
summation can be expressed in python code as:

K = w.size
z[i] = np.sum(x[i:i+K]*w)

The following code implements this valid convolution operation:

import numpy as np
np.random.seed(5)
x = np.random.randint(low=1, high=30, size=10,dtype='l')
print(x)

w = np.array([1.2,0.3,0.5])
n = x.size
K = w.size
z = np.zeros(n-K+1)
for i in range(n-K+1):
z[i] = np.sum(x[i:i+K]*w)
print(w)
print(z)

[ 4 15 16 7 23 17 10 9 5 8]
[1.2 0.3 0.5]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]

In order to generate the result data with the same length as the original data,
0 can be filled before and after the original data, and then convolution is
performed. As shown in Figure 6-6, for a convolution kernel with a length
of 3, after the left and right sides are filled with a 0, the left and right sides
generate 1.2*0+0.3*4+0.5*15 = 8.7` `` and1.25+0.38+0.5*0 = 8.4
`2 new values.
Figure 6-6 same convolution: the convolution width is K, and (K-1)/2 0s are
filled on the left and right sides of the original data of length n, and the
length of the convolution result vector is n

The number of padding before and after the data is (K-1)/2 0s, so that the
data length becomes n+2(K-1)/2 = n+K-1, so the result vector length of the
convolution is n+ K-1-K+1=n=10, that is, the convolution result with the
same length as the original data is produced. This convolution method is
called "same convolution". Of course, if K is not an odd number, the length
of the result is n-1.

In addition, there is another kind called "full convolution", which is a

convolution with (K-1) zeros filled before and after the data. As shown in
Figure 6-7, the left and right sides are filled with a 0, and the left and right
sides are respectively added 1.2*0+0.3*0+0.5*4 = 2.0 and 1.2*8
+0.3*0+0.5*0 = 9.62 new values.

Figure 6-7 full convolution: the convolution width is K, and K-1 zeros are
filled on the left and right sides of the original data of length n, and the
length of the convolution result vector is n+K-1

The length of the resulting data is n+2*(K-1)-K+1 = n+K-1 = 10+3-1=12.

Generally, assuming that the length of the original data is n, the length of
the convolution kernel is K, and the sum of the lengths of the left and right
padding is P, the length of the convolution result data is n+P-K+1. For
example, if P = 0 means no padding, the length of the original data is 3, and
the length of the convolution kernel is also 3, the length of the convolution
result data is 3-3+1 = 1, that is, the length of the result data is 1.

The following code function conv1d() realizes the convolution operation of

one-dimensional data, adopts the symmetrical filling method, and fills the
same number (pad) of 0 values on the left and right sides of the original
array.
def conv1d(x,w,pad):
n = x.size
K = w.size
P = 2*pad
n_o = n+P-K+1
y = np.zeros(n_o)
if P>0:
x_pad = np.zeros(n+P)
x_pad[pad:-pad] = x
else:
x_pad = x

for i in range(n_o):
y[i] = np.sum(x_pad[i:i+K]*w)
return y

Use this conv1d() convolution function to perform "same convolution" or

"full convolution" operations on one-dimensional data x (with convolution
kernel w).
y1 = conv1d(x,w,1) #sameconvolution
print(x.size,w.size,y1.size)
print("same: ", y1)

y2 = conv1d(x,w,2) #fullconvolution
print(x.size,w.size,y2.size)
print("full: ", y2)

10 3 10
same: [ 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3
8.4]
10 3 12
full: [ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3
8.4 9.6]

Note: The convolution operation defined in deep learning is somewhat

different from the general convolution definition. The general convolution
actually flips the data or convolution kernel left and right first, and then
performs the convolution defined here operation. As shown in Figure 6-8,
the general convolution is defined as follows (if the convolution kernel is
flipped, the data can also be flipped, and the result is the same):
K−1
yi = ∑ wK−k xi+k
k=0

Figure 6-8. Convolutions in other disciplines are somewhat different from

convolutions in deep learning: the kernel or data is flipped

This convolution operation in deep learning is often referred to as "cross-

correlation (correlate)" in other disciplines. Numpy's correlate performs
this cross-correlation operation on one-dimensional vectors.
numpy.correlate(a, v, mode='valid')

However, numpy's convolve performs convolution operations defined in

general disciplines on one-dimensional vectors.
numpy.convolve(a, v, mode='full')

The first parameter of these two functions is the convoluted data, the second
parameter is the weight, and the third parameter represents three different
convolution methods: "full", "same", and "valid".

The cross-correlation operation performed with numpy.correlate is the

convolution operation in deep learning. numpy.convolve is a common
convolution operation. If you want to use numpy.convolve to produce the
same result as numpy.correlate, you must first put the weight vector Or data
flipping, such as changing (1.2,0.3,0.5) to (0.5,0.3,1.2) and then performing
a convolve operation with the data, which is equivalent to directly using the
original weight vector and data to perform the correlate operation.
import numpy as np
np.random.seed(5)
x = np.random.randint(low=1, high=30, size=10,dtype='l')
print(x)

w0 = np.array([1.2,0.3,0.5])
x_valid = np.correlate(x, w0,'valid') # Cross-correlation
function np.correlate
# is convolution in
deep learning
x_same = np.correlate(x, w0,'same')
x_full = np.correlate(x, w0,'full')
print(x_valid)
print(x_same)
print(x_full)

w = np.array([0.5,0.3,1.2])
x_valid = np.convolve(x, w,'valid') # The convolution
function np.convolve first
# inverts and then
performs convolution in deep learning
x_same = np.convolve(x, w,'same')
x_full = np.convolve(x, w,'full')

print(x_valid)
print(x_same)
print(x_full)

[ 4 15 16 7 23 17 10 9 5 8]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]
[8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4]
[ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4
9.6]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]
[8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4]
[ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4
9.6]

span
Typically, convolution operations slide the convolution kernel element by
element along the convolved data. Therefore, the length of the valid
convolution of a data with a length of n and a convolution kernel with a
length of K is n-K+1. This convolution operation that only slides one
element at a time makes the result data almost the same length as the
original data. The number of elements that the convolution kernel slides
along the original data is called span (stide) or stride. Sometimes in order
to generate smaller convolution result data, a span sliding convolution
kernel greater than 1 is commonly used. As shown in Figure 6-9.

Figure 6-9 Convolution with stride 1 and stride 2

Note that the "span" of the convolution kernel is S, and the convolution
kernel with a length of K can slide (n-K)/S times on the data with a length
of n. Except for the initial convolution, every time you slide, you can
perform another One convolution operation, therefore, a total of (n-K)/S+1
convolution operations can be performed. For example: n=10, K=3, S=2,
the number of convolution operations that can be performed is (10-3)/2+1 =
4.

If the length of the original data is n, and the sum of the left and right
padding lengths is P, then the length of the padded data is n+P. Therefore,
the number of convolution operations that can be performed is (n+P-
K)/S+1, that is, the final result The data length is (n+P-K)/S+1.

The convolution function above can be easily rewritten to handle

convolution operations with strides.

def conv1d(x,w,pad=0,s=1):
n = x.size
K = w.size
n_o = (n+2*pad-K)//s+1
y = np.zeros(n_o) # Convolution result

if not pad==0:
#x_pad = np.zeros(n+2*pad)
#x_pad[pad:-pad] = x
x_pad = np.pad(x,[(pad,pad)], mode='constant')
else:
x_pad = x

for i in range(n_o): # Loop over every pixel of

the image
y[i] = np.sum(x_pad[i*s:i*s+K]*w)
return y
With different padding widths and strides, perform this convolution
function conv1d():
y1 = conv1d(x,w,0,s=2)
y2 = conv1d(x,w,1,s=2)
print(y1)
print(y2)

[17.3 32.8 37.7 17.2]

[ 8.7 26.3 23.8 27.9 16.3 ]

6.1.2 Convolution of one-dimensional signal

Convolution is used to process data such as one-dimensional signals or two-
dimensional images, which can remove noise in the data or obtain certain
features contained in the data.

For example, two sets of numbers x and y are generated below, x is a set of
numbers (100) uniformly distributed on [0, 2π], and y is a number near the
corresponding sinusoidal curve sin(x) ( i.e. y is a noisy sample of the
sinusoid):

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2

plt.plot(x,y)
plt.show()
Figure 6-10 Random noise sampling of a sinusoid

The following code generates a set of weight vectors (convolution kernel) w

according to the normal distribution:
sigma=1.6986436005760381
x_for_w = np.arange(-6, 6)
w = np.exp(-(x_for_w) ** 2 / (2 * sigma ** 2))
w/= sum(w)
print(x_for_w)
print(["%0.2f" % x for x in w])
plt.bar(x_for_w, w)

[-6 -5 -4 -3 -2 -1 0 1 2 3 4 5]
['0.00', '0.00', '0.01', '0.05', '0.12', '0.20',
'0.23', '0.20', '0.12', '0.05', '0.01', '0.00']
Figure 6-11 A set of weight parameters sampled according to the normal
distribution

The middle value of this weight vector w is large, the values on both sides
are small, and the sum of all ownership values is 1, and this weight vector w
is used to convolve y:
#w = np.array([0.1,0.2, 0.5, 0.2, 0.1])
yhat = np.correlate(y, w,"same")
plt.plot(x,yhat, color='red')

Figure 6-12 Use the weight vector of Gaussian distribution to weight the
original sinusoidal sampling data, which plays a smooth (smooth) effect
This weight vector calculates the weighted average of the values in y. When
the weight vector slides along y, the calculated value yhat is the average
weight of the center point y of the sliding window and its surrounding y,
and the center point y corresponds to The weight value is the largest, and
the weight value corresponding to y farther away from the center point is
smaller. The resulting vector is equivalent to an average or smoothing
process on the original data. It can be seen from Figure 6-12 that the curve
corresponding to the convolved data point (x, yhat) becomes smoother, that
is, the noise in the original data y is reduced to a certain extent.

6.1.3 Two-dimensional convolution

The reason why the display screens of various electronic devices can
display colorful text and image content is that the display screen itself is
composed of some pixels, and these pixels are arranged in rows and
columns to form a rectangle. In a computer, any image is represented by a
pixel matrix. The number of pixels in the image is called the resolution of
the image. For example, the image resolution 1024 ∗ 768 means that the
number of pixels in the width and height directions of the image rectangle is
1024 and 768. This image represented by a matrix of pixels in a computer is
called a digital image. As shown in Figure 6-13, it is an image with a
resolution of 170 ∗ 225.

Figure 6-13 Mona Lisa image with 170 ∗ 225 resolution

A pixel at each position in a digital image contains data representing color

information. For a color image, it may contain multiple values such as red
(R), green (G), blue (B), and transparency (A). For black and white images,
it is A value representing brightness. These values are usually represented
by byte (8-bit binary) integers, that is, the range of each value is [0,255].
A black and white image is a matrix of such integers, while a color image
can be seen as a superposition of matrices corresponding to each color, and
each color matrix is called a channel. For example, an image consists of red
(R), green (G), and blue (B) color images superimposed together.
Figure 6-14 The color image is superimposed by the color images of 3
channels

Of course, computer digital images can also use real numbers in the [0,1]
interval to represent pixel values. The following code uses the io model of
the skimage package to read a color image into a numpy multidimensional
array img, then uses the rgb2gray of the skimage.color module to convert
the color image into a black and white (grayscale) image, and then displays
the two images, and Print out the pixel value of 5*5 a window in the middle
part.
from skimage import io, transform
from skimage.color import rgb2gray
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

img = io.imread('image.jpg')
gray_img = rgb2gray(img) # io.imread('./imgs/image.jpg',
as_grey=True)

fig, axes = plt.subplots(1, 2, figsize=(8, 4))

ax = axes.ravel()

plt.subplot(1, 2, 1)
plt.imshow(img)

plt.subplot(1, 2, 2)
plt.imshow(gray_img,cmap='gray')
#img = io.imread('./imgs/lenna.png', as_grey=True) # load
the image as grayscale
#plt.imshow(img, cmap='gray')
print('image matrix size: ', img.shape) # print the
size of image
print('image matrix size: ', gray_img.shape) # print
the size of image
print('\n First 5 columns and rows of the color image
matrix: \n', img[150:155,110:115])
print('\n First 5 columns and rows of the gray image
matrix: \n', gray_img[150:155,110:115])

image matrix size: (233, 328, 3)

image matrix size: (233, 328)

First 5 columns and rows of the color image matrix:

[[[143 106 88]
[141 104 86]
[150 108 94]
[144 102 88]
[137 95 81]]

[[108 78 68]
[106 76 66]
[107 77 67]
[101 71 61]
[ 92 62 52]]

[[159 138 133]

[160 139 134]
[167 149 145]
[167 149 145]
[167 149 145]]

[[225 213 215]

[227 215 217]
[220 216 217]
[220 216 217]
[220 216 217]]

[[206 203 210]

[207 204 211]
[204 209 215]
[204 209 215]
[204 209 215]]]

First 5 columns and rows of the gray image matrix:

[[0.4414302 0.43358706 0.45457098 0.43104157
0.40359059]
[0.3280549 0.32021176 0.32413333 0.30060392 0.2653098
]
[0.55726275 0.56118431 0.59818275 0.59818275
0.59818275]
[0.84585961 0.85370275 0.8506749 0.8506749 0.8506749]
[0.80055765 0.80447922 0.81713765 0.81713765 0.81713765]]

Figure 6-15 Converting a color image to a black and white (grayscale)

image

It can be seen that the color image is read into a three-dimensional numpy
array, and the third dimension of the numpy array represents the three color
channels of the color image, namely the RGB channel. Each channel is a
two-dimensional array (matrix). Therefore, a color image can be viewed as
3 matrices. The rgb2gray() function converts a 3-channel color image into a
one-channel grayscale image, that is, a two-dimensional array (matrix). The
value of each grayscale pixel is red (R), green (G), blue (B) Computed from
the weighted sum of the three-color pixel values.
Y = 0.2125 R + 0.7154 G + 0.0721 B

From the output results, the RGB color value will be converted from the
integer value in the [0,255] interval to the real value in the [0,1] interval.

The color value can also be converted from a real value in the [0,1]
interval to an integer value in [0,255], the following code converts the
pixel value of the grayscale image to an integer in the [0,255] interval.
gray_img2 = gray_img*255
gray_imgs= gray_img2.astype(np.uint8)
print('The values of the first 5 rows and 5 columns of the
grayscale matrix: \n', gray_imgs[150:155,110:115])

The values of the first 5 rows and 5 columns of the

grayscale matrix:
[[112 110 115 109 102]
[ 83 81 82 76 67 ]
[142 143 152 152 152]
[215 217 216 216 216]
[204 205 208 208 208]]

Like a one-dimensional signal, for a two-dimensional image matrix, a set of

weights can also be used to process the data in it. For example, use a matrix
smaller than the image (usually called kernel) to convolve the original
image (ie, weighted sum).

As shown in Figure 6-16, assuming that the left side is a 6 × 6 size matrix
(image), and the middle is a 3 × 3 size matrix representing a convolution
kernel, use the convolution kernel matrix according to "from top to bottom,
Sliding along the image in order from left to right, for each image window
encountered, use the convolution kernel to perform weighted summation,
and a value will be generated, such as using the convolution kernel to
weight with the window element in the upper left corner of the image
Summed, the resulting value is:
2*(-1)+ 3*0+ 0*1+ 6*(-2)+ 0*0+ 4*2+ 8*(-1)+ 1*0+ 0*1 = -14

Continue to move the convolution kernel pixel by pixel to the right, and
generate new values in turn:

3(-1)+ 00+ 71+ 0(-2)+ 40+ 72+ 1(-1)+ 00+ 3*1 = 20

0*(-1)+ 7*0+ 9*1+ 4*(-2)+ 7*0+ 2*2+ 0*(-1)+ 3*0+ 2*1 = 7

Figure 6-16 Valid convolution of 6x6 two-dimensional matrix data with 3x3
convolution kernel produces a 4x4 size matrix

This is a valid method of convolution, and finally produces a matrix of 4*4.

If x is used to represent the i-th row and j-column element of a two-
i,j

dimensional matrix (such as an image matrix), and w is used to

m,n

represent the m-th row and n-column weight of the convolution kernel, Use
ai,jto represent the i-th row and j-th column element of the result matrix,
then the convolution operation formula of the two-dimensional matrix is as
follows:
Fh Fw
ai,j = ∑ ∑ wm,n xi+m,j+n
m=0 n=0

That is, the convolution kernel window is aligned with the (i, j) position of
the data matrix, and then weighted and summed with the corresponding
data in the data window. For example, a in the example above is
1,1

calculated as follows:
Fh Fw

a1,1 = ∑ ∑ wm,n x1+m,1+n

m=0 n=0

= w0,0 x1,1 + w0,1 x1,1+1 + w0,2 x1,1+2

w1,0 x1+1,1 + w1,1 x1+1,1+1 + w1,2 x1+1,1+2

w2,0 x1+2,1 + w2,1 x1+2,1+1 + w2,2 x1+2,1+2

= w0,0 x1,1 + w0,1 x1,2 + w0,2 x1,3

w1,0 x2,1 + w1,1 x2,2 + w1,2 x2,3

w2,0 x3,1 + w2,1 x3,2 + w2,2 x3,3

Suppose the number of rows and columns of the data matrix X is h and w
respectively, and the number of rows and columns of the convolution kernel
K is F and F respectively, then the number of rows and columns of the
h w

result matrix generated by the "valid convolution" operation is h − F + 1

and w − F + 1. The following is the code implementation of the "valid

convolution" operation on a two-dimensional matrix (of an image):

def convolve2d(X, K):

h, w = K.shape
Y = np.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i, j] = (X[i: i + h, j: j + w] * K).sum()
return Y

Where X is the input two-dimensional matrix (image), K is the convolution

kernel matrix, and Y is the result matrix (image). That is, the image window
starting with (i, j)X[i: i + h, j: j + w] and the convolution kernel K
perform element-wise productX[i: i + h, j: j + w] *K and sum up
X[i: i + h, j: j + w]* K.sum(). When (i, j) slides along the image,
a series of element values of the resulting matrix are obtained.

Test this function with the following code:

X= np.array([[2,3,0,7,9,5], [6,0,4,7,2,3], [8,1,0,3,2,6],
[7,6,1,5,2,8], [9,5,1,8,3,7], [2,4,1,8,6,5]])
K = np.array([[-1,0,1],[-2,0,2],[-1,0,1]])
print("X: ",X)
print("K: ",K)
convolve2d(X,K)

X: [[2 3 0 7 9 5]
[6 0 4 7 2 3]
[8 1 0 3 2 6]
[7 6 1 5 2 8]
[9 5 1 8 3 7]
[2 4 1 8 6 5]]
K: [[-1 0 1]
[-2 0 2]
[-1 0 1]]
array([[-14., 20., 7., -7.],
[-24., 10., 3., 5.],
[-28., 3., 6., 8.],
[-23., 9., 10., -2.]])

Use this convolution kernel to convolve the image:

image = gray_img
kernel = np.array([[-1,0,1],[-2,0,2],[-1,0,1]])
image_sharpen = convolve2d(image,kernel)
plt.imshow(image_sharpen, cmap=plt.cm.gray)
print("Original image size:",image.shape)
print("Result image size:",image_sharpen.shape)

Original image size: (233, 328)

Resulting image size: (231, 326)

Figure 6-17 The result image obtained by convolving the image with the
convolution kernel with the function of "vertical edge extraction"

As can be seen, the vertical features of the resulting image are exaggerated.
Indicates that this is a convolution kernel that has the function of "vertical
edge extraction".

In order to generate an image of the same size as the original image, "same
convolution" can also be used, that is, some 0 values are filled around the
image. If the size of the weight matrix is F ∗ F , the number of 0s P
w h w

filled in the left and right of the original image is (F − 1)/2, and the
w

number of 0s filled in the top and bottom P It is (F − 1)/2. Generally,

h h

the weight matrix is a square matrix with the same length and width. As
shown in Figure 6-18. Use the 3x3 convolution kernel to perform the same
convolution on the 6x6 matrix, fill the original matrix with (3-1)/2 0 values,
and get a 6x6 matrix.

Figure 6-18 Using the 3x3 convolution kernel to perform the same
convolution on a 6x6 matrix results in a 6x6 matrix
The following code fills P_h and P_w 0 values for the top, bottom, left and
right of the image:
H,W = X.shape
P_h,P_w = 1,2
X_padded = np.zeros((H + 2*P_h, W +2*P_w))
X_padded[P_h:-P_h, P_w:-P_w] = X

X_padded after padding is:

print(X_padded)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 2. 3. 0. 7. 9. 5. 0. 0.]
[0. 0. 6. 0. 4. 7. 2. 3. 0. 0.]
[0. 0. 8. 1. 0. 3. 2. 6. 0. 0.]
[0. 0. 7. 6. 1. 5. 2. 8. 0. 0.]
[0. 0. 9. 5. 1. 8. 3. 7. 0. 0.]
[0. 0. 2. 4. 1. 8. 6. 5. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

The above code creates a padded image X_padded according to the size of
the padding. In fact, numpy also provides a ready-made pad() function that
pads the front and back of each axis of the multidimensional array:
np.pad(x, [(1, 0), (1, 2)], mode='constant',
constant_values=0)

The second parameter [(1, 0), (1, 2)] indicates the number of pixels
to be filled before and after each axis of the numpy array x, and the first
tuple (1,0) indicates The front and rear of the first axis (the axis
corresponding to axis=0) are filled with 1 and 0 pixels respectively, and the
second tuple (1, 2) indicates that the front and rear of the second axis (the
axis corresponding to axis=1) are filled with 1 and 2 respectively pixels.
mode='constant' indicates that the filling is a constant value, and
constant_values=0 indicates that the filling constant value is 0, and these
two parameters can be omitted.
The following code fills a row of 0s in front of the first row of array a, and
fills 1 column of 0s and 2 columns of 0s in front of the first column and
behind the last column.

import numpy as np
a = np.array([[ 1., 1., 1.],
[ 1., 1., 1.]])
b = np.pad(a, [(1, 0), (1, 2)], mode='constant')
print(a)
print(b)

[[1. 1. 1.]
[1. 1. 1.]]
[[0. 0. 0. 0. 0. 0.]
[0. 1. 1. 1. 0. 0.]
[0. 1. 1. 1. 0. 0.]]

The following code first fills (K_h-1)//2 and (K_w-1)//2 pixels on the top,
bottom, left and right of the image according to the height K_h and width
K_w of the convolution kernel, and then rolls the filled image Product
operation:
The corresponding python code is as follows:
def convolve2d_same(X, K):
H,W = X.shape
K_h,K_w = K.shape

P_h = (K_h)//2 # The width of left and right

edge padding
P_w = (K_w)//2 # Height of top and bottom
edge padding
#Y = np.zeros_like(X) #？
Y = np.zeros((H,W))

X_padded = np.pad(X, [(P_h, P_h), (P_w, P_w)],

mode='constant')
# X_padded = np.zeros((H + 2*P_h, W + 2*P_w))
# X_padded[P_h:-P_h, P_w:-P_w] = X
for i in range(Y.shape[0]):
for j in range(Y.shape[1]):
Y[i,j]=(X_padded[i:i+K_h,j:j+K_w]*K).sum()
return Y

Performing this 'same' convolution operation produces a result matrix of the

same shape and size as the original kernel matrix.
convolve2d_same(X,K)

array([[ 6., -6., 15., 16., -8., -20.],

[ 4., -14., 20., 7., -7., -15.],
[ 8., -24., 10., 3., 5., -8.],
[ 18., -28., 3., 6., 8., -9.],
[ 20., -23., 9., 10., -2., -14.],
[ 13., -10., 11., 12., -7., -15.]])

Performing a convolution operation on the original image with this "same

convolution" operation will produce a result image of the same size as the
original image.
image = gray_img
kernel = np.array([[-1,0,1],[-2,0,2],[-1,0,1]])
image_sharpen = convolve2d_same(image,kernel)
plt.imshow(image_sharpen, cmap=plt.cm.gray)
print("Original image size:", image.shape)
print("Result image size:", image_sharpen.shape)

Original image size: (233, 328)

Resulting image size: (233, 328)
Figure 6-19 Performing the "same convolution" operation on the original
image produces a result image of the same size as the original image

Convolving images with different convolution kernels will produce

different results. For example, you can use a convolution kernel that can
"extract edges" to convolve an image (as shown in Figure 6-20):

kernel = np.array([[-1,-1,-1],[-1,8,-1],[-1,-1,-1]])
edges = convolve2d_same(image,kernel)
plt.imshow(edges, cmap=plt.cm.gray)

Figure 6-20 Convolving an image with a convolution kernel with the

function of "extracting edges"
It can be seen that the resulting image extracts the edge features of the
original image.

The scipy.signal module of the scipy library has a similar function

convolve2d() that can perform two-dimensional convolution on an image.
The difference from our own convolution is that it first flips the convolution
image left and right and up and down, and then performs cumulative sums
with the corresponding element values of the convolution kernel. That is,
the convolution is what is commonly referred to as a convolution operation.
You can do a horizontal flip on the convolution kernel just now, and then
execute the convolve2d() function to produce the same result image as the
above picture.
import scipy.signal
kernel = np.flipud(np.fliplr(kernel)) # flip convolution
kernel
edges =scipy.signal.convolve2d(image, kernel, 'same')
plt.imshow(edges, cmap=plt.cm.gray)

Figure 6-21 The convolve2d() of the scipy.signal module of the scipy

library produces the same result image

The image is smoothed with a smoothing convolution kernel, which uses

the average value of 25 pixels around a pixel as the value of the pixel. The
image becomes blurred after smoothing.
kernel = 1./9*np.ones((5,5))
print(kernel)
edges = convolve2d_same(image,kernel)
plt.imshow(edges, cmap=plt.cm.gray)

[[0.11111111 0.11111111 0.11111111 0.11111111 0.11111111]

[0.11111111 0.11111111 0.11111111 0.11111111
0.11111111]
[0.11111111 0.11111111 0.11111111 0.11111111
0.11111111]
[0.11111111 0.11111111 0.11111111 0.11111111
0.11111111]
[0.11111111 0.11111111 0.11111111 0.11111111
0.11111111]]

Figure 6-22 The smoothing convolution kernel smoothes the image, and the
resulting image becomes blurred

Therefore, like the one-dimensional convolution operation, the two-

dimensional convolution can smooth, sharpen, and extract certain features
of the image.

span
The span of the above convolution operation is 1, that is, the convolution
kernel always moves "from top to bottom, from left to right" pixel by pixel,
and the size of the resulting image is close to that of the original image. To
generate a smaller-sized convolution image, such as a convolution image
that is half the size of the original image, you can slide 2 pixels each time
when sliding in the horizontal and vertical directions. That is, the
convolution operation is performed by sliding the convolution kernel in a
span of 2. Span is also sometimes referred to as "Stride"

Similar to the convolution of a one-dimensional signal, for a two-

dimensional signal (such as an image) with a height and width of H and W,
set the height and width of the convolution kernel to F and F , and the
h w

number of elements filled up and down, left and right is P , P , the up and
h w

down, left and right strides are respectively S , S , the output two-
h w

dimensional signal (image) height and width are:

(H − Fh + Ph )/Sh + 1

(W − Fw + Pw )/Sw + 1

As shown in Figure 6-23, for a 7x7 input image, the convolution kernel size
is 3x3, the stride is 2, and the top, bottom, left and right are each filled with
1. The resulting output image is of size ((6+2-3)/2+1)x ((6+2-
3)/2+1) = 3x3.

Figure 6-23 Convolution with a stride of 2

The python code for the two-dimensional convolution operation that

combines padding and span is as follows:

def convolve2d(X, K,pad=(0,0),stride = (1,1)):

H,W = X.shape
K_h,K_w = K.shape

P_h,P_w = pad
S_h,S_w = stride

h = (H-K_h+2*P_h)//S_h+1
w = (W-K_w+2*P_w)//S_w+1
Y = np.zeros((h,w))
if P_h!=0 or P_w !=0:
X_padded = np.pad(X, [(P_h, P_h), (P_w, P_w)],
mode='constant')
else:
X_padded = X
for i in range(Y.shape[0]):
hs = i*S_h
for j in range(Y.shape[1]):
ws = j*S_w
Y[i,j]=(X_padded[hs:hs+K_h,ws:ws+K_w]*K).sum()
return Y

The result of performing the following convolution operation on the

previous two-dimensional matrix and convolution kernel K.

X= np.array([[2,3,0,7,9,5], [6,0,4,7,2,3], [8,1,0,3,2,6],

[7,6,1,5,2,8], [9,5,1,8,3,7], [2,4,1,8,6,5]])
convolve2d(X,K,(1,1),(2,2))

array([[ 6., 15., -8.],

[ 8., 10., 5.],
[20., 9., -2.]])

Perform the following convolution kernel operation on the image. The

convolution kernel subtracts the value of 5 times the value of the pixel itself
from the values of adjacent pixels around it, and the span of up and down,
left and right is 2. The height and width of the resulting image are almost
half the height of the original image. .

image = gray_img
kernel = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])
image_filtered = convolve2d(image,kernel,(1,1),(2,2))
plt.imshow(image_filtered, cmap=plt.cm.gray)
print("Original image size:", image.shape)
print("Result image size:", image_filtered.shape)

Original image size: (233, 328)

Result image size: (116, 164)
Figure 6-24 Convolution of stride 2 on an image

6.1.4 Multiple input channels and multiple output

channels
A color image usually contains at least 3 channels (RGB channel), and each
channel is a two-dimensional matrix. Therefore, a 3-channel color image
can be regarded as a superposition of three two-dimensional matrices, or a
three-dimensional array (Zhang volume) (or 3D signal). Therefore, to
perform convolution operations on color images, each channel needs to be
given a corresponding convolution kernel. Each channel image is a 2D
signal, corresponding to a 2D convolution kernel, and the convolution
kernels of all channels form a 3D convolution. nuclear.

As shown in Figure 6-25, for a 2-channel color image, a 2-channel 3D

convolution kernel is used for convolution operation to produce a single-
channel output image:

Figure 6-25 For a 2-channel color image, a 2-channel 3D convolution

kernel is used for convolution operation, resulting in a single-channel
output image

Use a 3D convolution kernel w to perform a convolution operation on a 3D

tensor X to generate a 2D tensor a, and its convolution calculation formula
is:
Fd −1 Fh −1 Fw −1
ai,j = ∑ ∑ ∑ wd,m,n xd,i+m,j+n
d=0 m=0 n=0

The following is the code implementation of the 3D convolution operation:

def convolve3d(X, K,P=(0,0),S=(1,1)):
C,H,W = X.shape
C,F_h,F_w = K.shape
P_h,P_w = P[0],P[1]
S_h,S_w = S[0],S[1]

h = (H+2*P_h-F_h)//S_h+1
w = (W+2*P_w-F_w)//S_w+1
Y = np.zeros((h,w)) # convolution output

if P_h!=0 or P_w != 0:
#X_padded = np.zeros((C,H + 2*P_h, W +2*P_w))
#X_padded[:,P_h:-P_h, P_w:-P_w] = X
X_padded = np.pad(X,[(0,0),(P_h,P_h),(P_w,P_w)],
mode='constant')
else:
X_padded = X

for i in range(h): # Loop over every pixel of the

image
hs = i*S_h
for j in range(w):
ws = j*S_w
# element-wise multiplication of the kernel
and the image
Y[i,j]=(K*X_padded[:,hs:hs+F_h,
ws:ws+F_w]).sum()
return Y

X= np.array([[[1, 2, 3], [4, 5, 6], [7, 8, 9]],

[[11, 12, 13], [14, 15, 16], [17, 18, 19]]])
K = np.array([[[1, 3], [2, 4]], [[4, -3], [2, 1]]])
convolve3d(X,K)

array([[ 86., 100.],

[128., 142.]])
The following code reads a color image:

from skimage import io, transform

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

lenna_img = io.imread('lenna.png', as_gray=False) # load

the image as grayscale
plt.imshow(lenna_img) #, cmap='gray')
print('image matrix size: ', lenna_img.shape) # print
the size of image

image matrix size: (330, 330, 3)

Figure 6-26 A color image

Perform a 3D convolution operation on a 3-channel color image to produce

a single-channel black-and-white image:
X = np.moveaxis(lenna_img, -1, 0) #np.rollaxis(lenna_img,
2, 0)
kernel = np.array([[[-1,-1,-1],[-1,8,-1],[-1,-1,-1]],
[[-1,-1,-1],[-1,8,-1],[-1,-1,-1]],[[-1,-1,-1],[-1,8,-1],
[-1,-1,-1]]] )
edges = convolve3d(X,kernel,(1,1))
print(X.shape)
print(edges.shape)
plt.imshow(edges,cmap=plt.cm.gray)

(3, 330, 330)

(330, 330)

Figure 6-27 Convolving a color image with a 3D convolution kernel

Multiple different 2D images can be generated with multiple different 3D

convolution kernels. As shown in Figure 6-28, use two 3-channel 3D
convolution kernels to perform convolution operations on the 3-channel
color image. Each 3-channel 3D convolution kernel generates a single-
channel image, and a total of 2 single-channel images can be generated.
channel image or produce a 2-channel output image.

Figure 6-28 Image convolution with 2 3D convolution kernels, resulting in

2 output images

Performing a 3D convolution operation with multiple 3D convolution

kernels can generate a multi-channel output image, and its calculation
formula is as follows:
Fk −1 Fh −1 Fw −1
ai′ ,j′ ,k′ = ∑ ∑ ∑ wi,j,k,k′ xi+i′ ,j+j′ ,k
k=0 i=0 j=0

Among them, k represents different convolution kernels.

′
The convolution operation actually extracts some kind of feature
information in the original data. Therefore, the output channel or
convolution image generated by each convolution kernel is also called
feature map (Feature map). Multiple convolution kernels produce multiple
output feature maps.
6.1.5 Pooling
Like convolution, pooling (pooling) also uses a fixed shape window (called pooling window) to align the data and
calculate the output value of the data window. Different from the weighted sum of input data and kernel of
convolution, pooling directly calculates the maximum or average value of elements in the pooling window of the
data. As shown in Figure 6-29, use a 3 × 3 window to slide from the upper left corner of the input two-
dimensional (image) matrix "from top to bottom, from left to right", and when sliding to each position, output the
current pooling The maximum value of the elements in the data window corresponding to the window produces
the final result matrix. This process is called max pooling.

Figure 6-29 A 3 × 3 window performs pooling on a 6 × 6 matrix (image), producing a resulting image of size
4 × 4

Of course, the average value in the pooling window can also be calculated as the output value. This pooling
operation is called average pooling. Average pooling works similarly to max pooling, except that the elements in
the data window are averaged instead of maximum.

Like the convolution operation, the window of the pooling operation is usually a square. The span of the above
pooling operation is 1, that is, one pixel is moved at a time. The span of the pooling operation is usually the same
as the length or width of the pooling window.

As shown in Figure 6-30, the window length and span of the pooling operation are both 3, resulting in a result
image of 2 × 2, while the original image size is 6x6.

Figure 6-30 3 × 3 pooling window, performing a pooling operation with a span of 3 on 6 × 6, resulting in a result
image of 2 × 2

The main goal of pooling is to alleviate the excessive sensitivity of the position to the convolution operation. The
pooling layer can retain the main features of the original image, and the pooling operation with a span greater than
1 reduces the size of the image exponentially and produces a smaller feature map, which can reduce the amount of
calculation in subsequent layers and improve calculation efficiency.

Unlike the convolution operation that uses a convolution kernel to convolve all channels of the input data, pooling
usually performs a pooling operation on each channel separately. Therefore, as many channels as there are input
data, the output has the same number of channels, as shown in Figure 6-31 , if the input is 64 channels, the output
is also 64 channels, that is, each input channel generates an output channel.

Figure 6-31 Pooling is for each channel, and the pooling operation does not change the number of channels

Like the convolution operation, the pooling operation not only has a span, but also fills the original image before
pooling. Similar to the convolution operation, the following code that performs pooling operations on single-
channel input data X can be written:
def pool2d(X, pool, stride=(1,1),padding=(0,0), mode='max'):
pool_h, pool_w = pool
S_h,S_w = stride
P_h,P_w = padding

#fill
if P_h or P_w:
X_padded = np.pad(X,[(P_h,P_h),(P_w,P_w)], mode='constant')
else:
X_padded = X

#Execute the pooling operation

Y_h,Y_w = (X.shape[0]-pool_h+2*P_h)//S_h+1,(X.shape[1]-pool_w+2*P_w)//S_w+1
Y = np.zeros((Y_h,Y_w ),dtype = X.dtype)
for i in range(Y.shape[0]):
hs = i*S_h
for j in range(Y.shape[1]):
ws = j*S_h
if mode == 'max': # Maximum pooling
Y[i, j] = X[hs: hs + pool_h, ws: ws + pool_w].max()
elif mode == 'avg':
Y[i, j] = X[hs: hs + pool_h, ws: ws + pool_w].mean()
return Y

For the two-dimensional matrix shown in Figure 6-30, the results of performing the maximum value and average
pooling with a span of 3 and a window of 3 × 3 are as follows:

X= np.array([[2,3,0,7,9,5], [6,0,4,7,2,3], [8,1,0,3,2,6],

[7,6,1,5,2,8], [9,5,1,8,3,7], [2,4,1,8,6,5]])
pool2d(X,(3,3),(3,3),(0,0),mode ='max')

array([[8, 9],
[9, 8]])

Perform average pooling:

pool2d(X,(3,3),(3,3),(0,0),mode ='avg')

array([[2, 4],
[4, 5]])

Similarly, for multi-channel input, as long as the above single-channel pooling operation is performed on each
channel, the following pooling operation for multi-channel input data X is added to the outer layer of the original
pooling operation cycle Loop for multi-channel traversal (for c in range(Y.shape[0])):

def pool(X, pool, stride=(1,1),padding=(0,0), mode='max'):

pool_h, pool_w = pool
S_h,S_w = stride
P_h,P_w = padding

if P_h or P_w:
X_padded = np.pad(X,[(0,0),(P_h,P_h),(P_w,P_w)], mode='constant')
else:
X_padded = X

Y_h,Y_w = (X.shape[1]-pool_h+2*P_h)//S_h+1,(X.shape[1]-pool_w+2*P_w)//S_w+1

Y = np.zeros((X.shape[0],Y_h,Y_w ),dtype = X.dtype)

print(X.shape)
print(Y.shape)

for c in range(Y.shape[0]):
for i in range(Y.shape[1]):
hs = i*S_h
for j in range(Y.shape[2]):
ws = j*S_w
if mode == 'max':
Y[c,i, j] = X[c,hs: hs + pool_h, ws: ws + pool_w].max()
elif mode == 'avg':
Y[c,i, j] = X[c,hs: hs + pool_h, ws: ws + pool_w].mean()
return Y

Test the above multi-channel input pooling operation function pool():

X3= np.array([[[0, 1, 2], [3, 4, 5], [6, 7, 8]],
[[11, 2, 3], [4, 1, 16], [71, 8, 9]]])
pool(X3,(2,2),(1,1),(0,0),mode ='max')

(2, 3, 3)
(2, 2, 2)

array([[[ 4, 5],
[ 7, 8]],

[[11, 16],
[71, 16]]])

Using the pool() function and the 5 × 5 window to pool the image with a span of (2,2), the resulting image is
only half the size of the original image.
img = np.moveaxis(lenna_img, -1, 0) #np.rollaxis(lenna_img, 2, 0)
pooled_img = pool(img,[5,5],(2,2))
pooled_img = np.moveaxis(pooled_img, 0, -1) # Move the channel axis = 0 to the last
axis = -1 position
plt.imshow(pooled_img, cmap=plt.cm.gray)
print("Original image size:",img.shape)
print("Result image size:",pooled_img.shape)

(3, 330, 330)

(3, 163, 163)
Original image size: (3, 330, 330)
Resulting image size: (163, 163, 3)

Figure 6-32 Pooling the image with the stride (2,2), the resulting image is only half the size of the original image

6.2 Convolutional Neural Network

The neurons of the previous neural network are all so-called "**fully-connected" neurons. Each neuron will
directly calculate the weighted sum of all input features. For a fully connected neural network, an input The
sample data must be expressed as a one-dimensional vector, which is not suitable for processing multi-dimensional
data such as images. Although multi-dimensional data can be flattened into a one-dimensional vector, there are
disadvantages such as low efficiency and the inability to capture the internal structure of multi-dimensional data.
Convolutional Neural Networks with convolution operations as neurons are the best choice for this type of data. In
2012, AlexNet used the convolutional neural network to win the imageNet competition, which rejuvenated the
neural network that had been in a trough, and also marked the neural network from the traditional low-level neural
network to the deep learning based on the deep neural network. In recent years, the research of convolutional
neural network, which is the core of deep learning, has made a lot of new progress, and various improved
convolutional neural network structures have emerged, such as googleNet, ResNet (residual network), etc.
6.2.1 Fully connected neurons and convolutional neurons
Each fully connected neuron of the fully connected network directly performs a weighted sum operation on all
input features, so the number of weights (except bias) in each neuron is as many as the number of input features,
too many weight parameters are not only pin numbers More memory also makes the model function complex and
prone to overfitting. If the input sample is a multi-dimensional tensor (such as a multi-channel image such as an
image), it needs to be flattened into a one-dimensional tensor when it is input to a fully connected neuron, which
destroys the memory structure of the data itself, which is not conducive to Extract intrinsic structural features of
samples.

Unlike fully connected neurons, convolution neurons use a convolution kernel to perform convolution operations
on input samples, and the number of parameters of the convolution kernel is often much smaller than the number
of features of the sample, such as for a 3 T hecolorimageof times64 × 64, the convolutional neuron is a
convolution kernel of 3 × 4 × 4 size, and the convolutional neuron has only 48 parameters. Convolutional neurons
have a small number of weight parameters compared to fully connected neurons, which helps prevent overfitting.
In addition, the fully connected neuron value produces an output value. Therefore, the fully connected network
layer requires many fully connected neurons to extract enough features, and the convolutional neuron produces a
feature map of multiple output values. Convolution A convolutional network layer composed of neurons requires
only a small number of convolutional neurons.

As shown in Figure 6-33, unlike a fully connected neuron that only outputs one value, the convolution kernel
moves along the input "from top to bottom, from left to right". Each time the convolution kernel window is aligned
with a data window, the Generate an output value, and the convolution kernel moves along the input data to
produce many output values that are regularly arranged as the original data. These regularly arranged output values
are called feature maps. Input a multi-channel image and output a single-channel feature map. Convolution
operations can preserve and capture the inherent spatial structure relationship between adjacent data (pixels) of the
original data. That is, the inherent characteristics of the data can be better captured to improve the quality of the
neural network function.

Figure 6-33 The data window aligned with the convolution kernel window of the convolutional neuron generates
an output value, and the convolution kernel window moves along the input data, and a series of regularly arranged
output values generated constitute an output feature map. convolution kernel will generate multiple feature maps

For a multi-channel input tensor, the operation of the convolutional neuron can be expressed by the following
formula:
Fk −1 Fh −1 Fw −1
ai′ ,j′ = g(∑ ∑ ∑ wi,j,k xi+i′ ,j+j′ ,k + b)
k=0 i=0 j=0

Like the previous neurons, each convolutional neuron also has a bias b and an activation function g. The result of
the convolution operation is also output after being transformed by the activation function. Although the
convolutional neuron has the same number of channels as the input, its resolution is usually much smaller than that
of the input image.

6.2.2 Convolutional Layer and Convolutional Neural Network

In a neural network, if the neurons in a certain layer are all convolutional neurons, the neurons in this layer are
called convolutional layer. If there are k' neurons in a certain layer, the bias of the activation function of each
neuron is g , b respectively, if the input is a multi-channel three-dimensional tensor x
k
′
k
′
u,v,c, the weight matrix is a
four-dimensional tensor, which can be recorded as w . Then the calculation formula of the (i , j )pixelvalue
′
i,j,k,k
′ ′

of the output k feature map is:

′

Fk −1 Fh −1 Fw −1
ai′ ,j′ ,k′ = gk′ (∑ ∑ ∑ wi,j,k,k′ xi+i′ ,j+j′ ,k + bk′ )
k=0 i=0 j=0
For a multi-channel input, each convolutional neuron with the same number of channels produces an output feature
map. If a convolution layer has k' neurons, k' feature maps or k' output channels will be generated, that is, multiple
convolution neurons in the convolution layer will generate the same number of feature maps, as shown in Figure 6
-34, for 3-channel input data, 2 convolutional neurons output feature maps of 2 channels (where each circle
represents a convolutional neuron). Just as the output of a fully connected layer can be used as the input of the next
fully connected layer, the multi-channel feature map output by one convolutional layer can be used as the multi-
channel input of the next convolutional layer.

Figure 6-34 Each convolutional neuron is a 3 × 3 × 3 size convolution kernel, for a 3 × 3 × 3 local data window
of the input 3 channels, 2 convolutional neurons Will produce 2 output values. For 3-channel input, each
convolutional neuron produces a single-channel feature map, and 2 convolutional neurons produce a total of 2-
channel feature maps.

The convolutional layer is usually followed by a pooling layer. The pooling layer performs a simple pooling
operation such as maximum pooling or average pooling. The pooling layer performs a pooling operation on each
input feature map to generate a separate new For feature maps, the pooling operation will not change the number
of feature maps. Inputting 3 feature maps will generate 3 output feature maps, that is, the number of output
channels is the same as the number of input channels.

The role of the pooling layer is to reduce the size of the feature map, reduce the dimension of the feature map
output by the convolutional layer, and improve the training efficiency. The pooling layer does not contain model
parameters. As shown in Figure 6-35, the input is a 10 × 10 feature map (matrix or image) with only one channel,
and a convolutional layer containing 6 3 × 3 size convolution neurons with a stride of 1 produces 6 A 8 × 8
feature map, and then a 2 × 2 size pooling window and a pooling layer with a span of 2, resulting in 6 4 × 4 size
feature maps.

Figure 6-35 The single-channel input feature map 1 × 10 × 10 passes through 6 3 × 3 size convolution neurons to
perform a convolution operation with a span of 1 to produce 6 size 8 × 8 feature maps 6 × 8 × 8, and then after
2 × 2 pooling window and a pooling layer with a span of 2, 6 feature maps with a size of 4 × 4 are generated

A neural network that contains convolutional layers is called a Convolutional Neural Network. The network
layer of the convolutional neural network has both a convolutional layer and a fully connected layer. Usually, the
initial network layer is some convolutional layers, and the last one close to the output position is some fully
connected layers.

Figure 6-36 is a typical convolutional neural network structure diagram. The input is a single-channel 28 × 28
image (feature map). After 8 5 × 5 convolutional neurons and a convolutional layer with a stride of 1, 8 size
24 × 24 feature maps (output The channel is 8), and then through the subsequent 2 × 2 pooling window and the

pooling layer with a span of 2, the output is the feature map of 8 channels with a size of 12 × 12, and after 16
channels with a size of 5T he × 5 convolutional neuron and the convolutional layer with a span of 1 generate 16
feature maps with a size of 8 × 8, and then pass through the subsequent 2 × 2 pooling window and a pooling layer
with a span of 2. The output is a feature map of 16 channels with a size of 4 × 4. Then, the fully connected layer
of 64 fully connected neurons produces 64 output values, and these output values pass through the final 10 fully
connected neurons. The fully connected layer outputs 10 values.

Figure 6-36 Single-channel 1 × 28 × 28 input, 8 24 × 24 feature maps are output through 8 5 × 5 convolutional
neurons and a convolutional layer with a span of 1, after the span For 2, the pooling layer with a pooling window
size of 2 × 2 outputs 8 12 × 12 feature maps, and then outputs through 16 5 × 5 convolutional neurons and a
convolutional layer with a span of 1 16 8 × 8 feature maps are obtained, and 16 4 × 4 feature maps are output
after inheriting the pooling layer with a span of 2 and a pooling window size of 2 × 2, and then perform a
flattening operation to convert it into 1 A vector with a length of 256, then a vector with a length of 64 output
through a fully connected layer, and a vector with a length of 10 output through a fully connected layer.

When the feature map of the last convolutional layer (or pooling layer) is output to the full connection, the feature
map will be flattened, that is, converted into a one-dimensional vector, and then processed and output by the
neurons of the fully connected layer. Note, in Figure 6-36, the convolutional layer is not drawn, but the input and
output feature maps of the convolutional layer are given.

Convolutional neural networks were originally used mainly for computer vision or image processing problems,
such as inputting an image to recognize its classification. For image data, convolutional neural networks usually
contain multiple layers of "convolution + pooling" to continuously extract image features from low-level to high-
level and use pooling to reduce image size. In the last layer of the neural network, the small-sized multi-channel
feature map will be expanded into a one-dimensional vector, that is, the so-called feature map flattening (flatten)
operation is performed, and then the one-dimensional feature vector Further transformations are performed using a
fully connected layer composed of fully connected neurons.

Therefore, the three most commonly used network layers of convolutional neural networks are: convolutional
layer, pooling layer (usually maximizing pooling Max pool) and fully-connected (fully-connected) layer,
convolutional layer, fully connected Layers and pooling layers are usually abbreviated as conv, fc, pool. For
example, a neural network structure can be described with the following shorthand:
INPUT -> [[CONV -> Relu]*N -> POOL?]*M -> [FC -> Relu]*K -> FC

∗N indicates that the convolutional layer has N convolution kernels and generates N feature maps, ∗M indicates
the combination of convolutional layer and pooling layer [[CONV -> RELU]*N -> POOL?]repeated M times,
similarly, *K means that the fully connected layer [FC -> RELU] has been repeated K times, that is, there are K
fully connected layers. Where Relu means that the activation function is Relu. The activation function of the
convolutional layer generally uses the Relu activation function. This is because σ(x) and other functions when the
absolute value of x becomes larger, the derivative will become very small, making the gradient in the reverse
derivation process (Derivative) cannot be effectively transmitted, that is, there will be a "gradient disappearance"
problem, especially when the network depth increases, this "gradient disappearance" problem is more serious, and
the Relu function does not have this problem.

Convolution kernels with different weights can extract different features. As shown in Figure 6-37, a convolution
layer uses multiple convolution kernels to extract different feature maps.

Figure 6-37 Convolution kernels with different weights extract different features, and one convolution layer can
use multiple convolution kernels to obtain different feature maps

Applying convolution through multiple layers can generate hierarchical convolution result images, extracting
features of different granularities from low-level features to high-level features. Figure 6-38 is a convolutional
image of different levels of features obtained by applying convolution.

Figure 6-38 Multiple convolutional layers can extract features from low-level to high-level. The convolutional
layer close to the input layer extracts image edge or color features, and subsequent convolutional layers can extract
edge intersections or color shadows. , and further convolutional layers can extract meaningful structures or objects

It can be seen that the convolutional layer close to the input layer finds the edge or color features of the image, and
the subsequent layers build complex structures on this basis, such as finding the intersection of edges or the
shadow of the color, and the subsequent layers combine all These are combined to identify meaningful structures
or objects in the image, and subsequent convolutional layers begin to gradually extract higher-level features. This
extraction process from the lowest-level edge features to high-level shape features is similar to human observation
of the world. the process of.

6.2.3 Reverse derivation and code implementation of convolutional layer and pooling
layer
The difference between the convolutional neural network and the fully connected neural network is that the
convolutional layer (including the pooling layer) is added. As long as the convolutional layer and the pooling layer
are added on the basis of the previously implemented fully connected neural network, the convolution can be
realized. Neural Networks. Section 6.1) has realized the forward calculation of the convolutional layer and the
pooling layer. The key is how to realize the reverse derivation of the convolutional layer (including the pooling
layer).

Reverse derivation of convolutional layer

First, one-dimensional convolution is used to illustrate how to perform reverse derivation.

Set x = (x 0, , b is a bias Value, let the convolution result be

x1 , ⋯ , xn−1 ), w = (w0 , w1 , ⋯ , wK−1 )

z = x ⋅ w + b = (z0 , ⋯ , zn−K ) . If it is known that the gradient of a loss function about z is

), then according to the chain of derivation According to the law, the gradient of a
∂L
dz = = (dz , ⋯ , dz
0 n−K
∂z

loss function with respect to w can be obtained as:

∂L ∂L ∂L ∂L ∂L ∂L ∂zi ∂L ∂zi ∂L ∂zi

dw = = ( , , ,⋯, ) = (∑ ,⋯,∑ ,⋯,⋯,∑
∂w
w ∂w0 ∂w1 ∂w2 ∂wK−1 ∂zi ∂w0 ∂zi ∂wj ∂zi ∂wK−
i i i

∂L ∂zi ∂zi ∂zi ∂L ∂zi

= ∑ ( ,⋯, ,⋯, ) = ∑
∂zi ∂w0 ∂wj ∂wK−1 ∂zi ∂w
w
i i

because

zi = xi w0 + xi+1 w1 + ⋯ + xi+K−1 wK−1

therefore,
∂zi
= (xi , xi+1 , ⋯ , xi+K−1 )
∂w
w

therefore:
∂L ∂L ∂zi ∂L
dw = = ∑ = ∑ (xi , xi+1 , ⋯ , xi+K−1 )
∂w
w i ∂zi ∂ pmbw i ∂zi

For example, set x = (x 0

, x1 , ⋯ , x9 ), w = (w0 , w1 , w2 ) , b is a value representing the bias, and the convolution
result is:

z0 = x0 w0 + x1 w1 + x2 w2 + b

z1 = x1 w0 + x2 w1 + x3 w2 + b

therefore:
∂L ∂z0 ∂L ∂L ∂L
= ( x0 , x1 , x2 )
∂z0 ∂w
w ∂z0 ∂z0 ∂z0

∂L ∂z1 ∂L ∂L ∂L
= ( x1 , x2 , x3 )
∂z1 ∂w
w ∂z1 ∂z1 ∂z1

therefore,
∂L ∂L ∂L ∂L ∂L ∂zi
= ( , , ) = ∑ =
∂w
w ∂w0 ∂w1 ∂w2 ∂zi ∂w
w
i

∂L
= (x0 , x1 , x2 )
∂z0

∂L
+ (x1 , x2 , x3 )
∂z1

+ ⋮

∂L
+ (x7 , x8 , x9 )
∂z7

As shown in Figure 6-39:

Figure 6-39, ∂L

∂z0
(x0 , x1 , x2 ),
∂L

∂z1
(x1 , x2 , x3 ), ⋯ accumulation to ( ∂L

∂w0
,
∂L

∂w1
,
∂L

∂w2
)

Generally, there are:

∂L n−K ∂L
= ∑i=0 x[i : i + K]
∂w
w ∂zi

And ∂L

∂b
= ∑
i
∂L

∂zi
∂zi

∂b
= ∑
i
∂L

∂zi
, that is to add up all ∂L

∂zi
. Use dw, dz, db to represent ∂L

∂w
,
∂L

∂z
,
∂L

∂b
, Then you
can write the following python code for dw, db:
for i in range(z.size):
dw += x[i:i+K]*dz[i]
db = dz.sum()

How to find the gradient dx

x =
∂L

∂x
x
of L about the output x ?

Because z is only related to x

i i, xi+1 , ⋯ , xi+K−1 , when j ≠ i, ⋯ , i + K − 1, z Partial derivative i
∂zi

∂x
xj
= 0 with
respect to x , ie: j

∂zi ∂zi ∂zi ∂zi ∂zi ∂zi

= ( ,⋯, ,, ⋯, , , ⋯)
∂x
x ∂x
x0 ∂x
xi−1 ∂x
xi ∂x
xi+K−1 ∂x
xi+K

∂zi ∂zi
= (0, ⋯ , 0, ,⋯, , 0, ⋯)
∂x
xi ∂x
xi+K−1

= (0, ⋯ , 0, w0 , ⋯ , wK−1 , 0, ⋯)

According to the chain derivation rule, there are:

∂L n−K+1 ∂L ∂zi n−K+1 ∂L
= ∑ = ∑ (0, ⋯ , w0 , w1 , ⋯ , , wK−1 , ⋯ , 0)
∂x
x i=1 ∂zi ∂x
x i=1 ∂zi

Therefore, the loss function L only contributes to the partial derivative of the loss function L with respect to
x ,xi
,⋯,x
i+1
through z , therefore, there are:
i+K−1 i
∂L ∂L
[i : i + K]+ = w
∂x
x ∂zi

For example, for the previous example, z is only related to x 0 0

, x1 , x2 , therefore, the loss function L can be added
to the final Onf rac∂L∂x , , , namely: 0
∂x1
∂L

∂x2
∂L

∂L ∂L ∂L ∂L ∂z0 ∂L ∂z0 ∂L ∂z0

( , , )+ = ( , , )
∂x0 ∂x1 ∂x2 ∂z0 ∂x0 ∂z0 ∂x1 ∂z0 ∂x2

∂L ∂L ∂L ∂L ∂L ∂L
( , , )+ = ( w0 , w1 , w2 )
∂x0 ∂x1 ∂x2 ∂z0 ∂z0 ∂z0
∂L ∂L ∂L ∂L
( , , )+ = w
∂x0 ∂x1 ∂x2 ∂z0

The calculation process can be written as the following python code:

for i in range(z.size):
dx[i:i+K] += w*dz[i]

For the convolution with the span of S, because each z is obtained by weighting the sum of the data window
i

x[i ∗ S : i ∗ S + K], the above formula can be extended to fill and span convolution:

(n−K)//S
∂L ∂L
= ∑ x[i ∗ S : i ∗ S + K]
∂w
w ∂zi
i=0

∂L ∂L
[i ∗ S : i ∗ S + K]+ = w
∂x
x ∂zi

Of course, for convolutions with up, down, left and right pads, x needs to be filled before convolution, and the
same is true for reverse derivation. Therefore, the reverse derivative of a convolution with stride and padding can
be handled with the following python code:

x_pad = np.pad(x, [(pad,pad)], 'constant')

dx_pad = np.zeros_like(x_pad)
#....
start = i*S
dw += x_pad[start:start+K]*dz[i]
dx_pad[start:start+K] += w*dz[i]

For one-dimensional data, the complete reverse derivation code is as follows:

def conv_backward(dz,x,w,p=0,s=1):
n, K = len(x),len(w)
o_n = 1 + (n + 2 * p - K) // s
assert(o_n==len(dz))

dx = np.zeros_like(x)
dw = np.zeros_like(w)
db = dz[:].sum()

x_pad = np.pad(x, [(pad,pad)], 'constant')

dx_pad = np.zeros_like(x_pad)

for i in range(o_n):
start = i * s
dw += x_pad[start:start+K]*dz[i]
dx_pad[start:start+K] += w*dz[i]
dx = dx_pad[pad:-pad]
return dx, dw, db

The following code tests the function above:

import numpy as np
np.random.seed(231)
x = np.random.randn(5)
w = np.random.randn(3)
stride = 2
pad = 1
dz = np.random.randn(5)

print(dz)
dx, dw, db = conv_backward(dz,x,w,1)
print(dx)
print(dw)
print(db)

[-1.4255293 -0.3763567 -0.34227539 0.29490764 -0.83732373]

[0.50522405 -2.33230266 -0.87796042 -0.03246064 0.67446745]
[-0.56864738 -0.65679696 -1.09889311]
-2.6865774833459617

In the same way, the convolution of one-dimensional data can be extended to the reverse derivation of the
convolution of two-dimensional data with multiple input and output channels. As shown in Figure 6-39, it is a
schematic diagram of the gradient solution of a single input channel and a single output channel.

Figure 6-39 z00 = x00w00+x01w01+x10w10+x11w1, its gradients about w00, w01, w10, w11 are x00, x01, x10,
x11 respectively, its gradients about x00, x01, x10, x11 are w00, w01,w10,w11

∂L ∂L ∂zij
= ∑
∂w
w ij ∂zij ∂w
w

And z is the weighted sum of the window x [i : i + K

ij h, j : j + Kw ] starting with x and the convolution kernel
ij

W , namely z = x [i : i + K , j : j + K ] ⋅ w , and
ij h w

∂zij
= xi+u,j+v
∂wu,v

∂zij
Therefore, write ∂w
w
as a matrix with the same shape as w :
∂zij
= x [i : i + Kh , j : j + Kw ]
∂w
w

For example, for ∂z00

∂w
w
of the above figure:

∂z00 x00 x01

= x [0 : 0 + 2, 0 : j + 2] = [ ]
∂w
w x10 x11

therefore
∂L ∂L
= ∑ x [i : i + Kh , j : j + Kw ]
∂w
w ij ∂zij

∂zij
Similarly, because ∂b
= 1 , therefore:

∂L ∂L ∂zij ∂L
= ∑ = ∑
∂b ij ∂zij ∂b ij ∂zij

because
∂L ∂L ∂zij
= ∑
∂x
x ij ∂zij ∂x
x

And z ij
= x [i : i + Kh , j : j + Kw ] ⋅ w only depends on the data window x [i : iatthebeginningof x_{ij}
+Kh , j : j + Kw ] . and:
∂zij
= wu,v
∂xi+u,j+v

therefore:
∂zij
[i : i + Kh , j : j + K +w ] = w
∂x
x

∂zij
For example, for the window [i : i + 2, j : j + 2] of ∂x
x
above:

∂zij w00 w01

[i : i + 2, j : j + 2] = w = [ ]
∂x
x w10 w11

∂zij
Therefore, just ∂L

∂zij ∂x
x
=
∂L

∂zij
w can be added to the window [i : i + K h, j : j + K +w ] corresponding to ∂L

∂x
x
,
namely:
∂L ∂L
[i : i + Kh , j : j + K +w ]+ = w
∂x
x ∂zij

For convolution with padding and span, before performing reverse derivation, it is also necessary to fill x first, and
then find the data window corresponding to z according to the span, that is, as follows The formula calculates the
ij

gradient (partial derivative) of L with respect to w , x .

∂L ∂L
= ∑ x [i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]
∂w
w ij ∂zij

∂L ∂L
= ∑
∂b ij ∂zij

∂L ∂L
[i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = w
∂x
x ∂zij

The above is the calculation formula for single-channel input and single-channel output. For multi-channel input x
, the weight tensor w of the convolution kernel corresponds to the 3D convolution kernel at this time, which has a
similar gradient calculation formula, except that there is one more color channel:
∂L ∂L
= ∑ x [:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]
∂w
w ij ∂zij

∂L ∂L
[:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = w
∂x
x partialzij

If it is a multi-channel output, replace w in the above formula with the weight tensor w corresponding to each f

output channel f , but because x is right Each output channel feature map z has a contribution, and the gradient of f

all z about x should be accumulated, namely:

∂L partialL f
[:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = sumf ( f
w )
∂x
x
∂z
ij

∂L ∂L
= ∑ x [:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]
∂w
w
f ij f
∂z
ij

∂L ∂L
= ∑
∂b
f ij f
∂z
ij

If there are multiple samples, just add up the above gradient ∂L

∂w
w
f
,
∂L
f
∂b
of each sample , and the x of each sample
are independent of each other, so cannot be accumulated. ∂L

∂x
x

On the basis of the previous Layer class, the following code defines a class Conv representing the convolutional
layer, which is used for convolution forward calculation (forward) and reverse derivation (backward) of multiple
samples, multiple input channels and multiple output channels. . Conv's constructor accepts parameters such as the
number of input and output channels and convolution kernels representing the convolution operation. The
forward() method accepts a multi-channel input X to generate a convolutional multi-channel output Z. The
backward() method accepts the input from the loss function. Regarding the gradient dZ of the output Z of the
convolutional layer, calculate the gradient of the loss function about the parameters of the convolution (W , b) and
the input X.

import numpy as np
from init_weights import *

class Conv(Layer):
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.K = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
self.W = np.random.randn(self.F, self.C, self.K, self.K) #/(self.K*self.K)
self.b = np.random.randn(out_channels,)
self.params = [self.W,self.b]
self.grads = [np.zeros_like(self.W),np.zeros_like(self.b)]
self.X = None
self.reset_parameters()

def reset_parameters(self):
kaiming_uniform(self.W, a=math.sqrt(5))
if self.b is not None:
#fan_in, _ = calculate_fan_in_and_fan_out(self.K)
fan_in = self.C
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))

def forward(self, X):

self.X = X
N, C, X_h, X_w = self.X.shape
F, _, F_h, F_w = self.W.shape
# print(self.X.shape,self.W.shape )

X_pad = np.pad(self.X, ((0,0), (0, 0), (self.P, self.P),(self.P, self.P)),

mode='constant', constant_values=0)

O_h = 1 + int((X_h + 2 * self.P - F_h) / self.S)

O_w = 1 + int((X_w + 2 * self.P - F_w) / self.S)
O = np.zeros((N, F, O_h, O_w))
for n in range(N):
for f in range(F):
for i in range(O_h):
hs = i * self.S
for j in range(O_w):
ws = j * self.S
O[n, f, i, j] = (X_pad[n, :, hs:hs+F_h,
ws:ws+F_w]*self.W[f]).sum() + self.b[f]

return O

def __call__(self,X):
return self.forward(X)

def backward(self,dZ):
""" A naive implementation of the backward pass for a convolutional layer.
Inputs: - dout: Upstream derivatives.
- cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive Returns a
tuple of:
- dx: Gradient with respect to x - dw: Gradient with respect to w - db:
Gradient with respect to b """
N, F, Z_h, Z_w = dZ.shape
N, C, X_h, X_w = self.X.shape
F, _, F_h, F_w = self.W.shape

pad = self.P

H_ = 1 + (X_h + 2 * pad - F_h) // self.S

W_ = 1 + (X_w + 2 * pad - F_w) // self.S

dX = np.zeros_like(self.X)
dW = np.zeros_like(self.W)
db = np.zeros_like(self.b)

X_pad = np.pad(self.X, [(0,0), (0,0), (pad,pad), (pad,pad)], 'constant')

dX_pad = np.pad(dX, [(0,0), (0,0), (pad,pad), (pad,pad)], 'constant')

for n in range(N):
for f in range(F):
db[f] += dZ[n, f].sum()
for i in range(H_):
hs = i * self.S
for j in range(W_):
ws = j * self.S
# w [f,c,i,j] X[n,c,i,j]
dW[f] += X_pad[n, :, hs:hs+F_h, ws:ws+F_w]*dZ[n, f, i, j]
dX_pad[n, :, hs:hs+F_h, ws:ws+F_w] += self.W[f] * dZ[n, f, i,
j]

# "Unpad"
dX = dX_pad[:, :, pad:pad+X_h, pad:pad+X_h]

self.grads[0] += dW
self.grads[1] += db
return dX
# return dX, dW, db

#--------Add the gradient of the regular term -----

def reg_grad(self,reg):
self.grads[0]+= 2*reg * self.W

def reg_loss(self,reg):
return reg*np.sum(self.W**2)

def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.W
return reg*np.sum(self.W**2)

Among them, N is the number of samples, C is the number of input channels, and F is the number of output
channels. The reverse derivation is for each sample (for n in range(N)), for each output channel (for f in range(F)),
calculate its db =
f
∂L

∂ pmbbf
, dw =
f
∂L

∂w
wf
, dx[n] = ∂L

∂x
x
.

You can test conv's forward calculation method forward() with randomly generated input data, and output the
value of the first channel of its first sample:
np.random.seed(1)
x = np.random.randn(4, 3, 5, 5)

conv = Conv(3,2,3,1,1)
f = conv.forward(x)
print(f.shape)
print(f[0,0],"\n")

(4, 2, 5, 5)
[[ 0.46362714 -0.83578144 0.40298519 -0.32152652 0.56616046]
[-0.47878018 1.02346756 0.20004975 0.59663092 0.25253169]
[-0.39733747 -0.08368194 0.52454712 0.54133918 -0.32698456]
[0.47703053 -0.01967369 1.13655418 0.22321357 0.77693417]
[-0.23944267 0.62971182 -0.38411731 0.42818679 -0.07566246]]

Reverse derivation backward() requires the gradient ∂L

∂f
from the loss function about the output f to calculate the
parameters of the loss function about the convolution (W , b ) and (X ) gradients. To test the method, the following
code feeds it a simulated gradient, denoted as df = ∂L

∂f
f
.

df = np.random.randn(4, 2, 5, 5)
dx= conv.backward(df)
print(df[0,0],"\n")
print(dx[0,0],"\n")
print(conv.grads[0][0,0],"\n")
print(conv.grads[1],"\n")

[[-1.30653407 0.07638048 0.36723181 1.23289919 -0.42285696]

[0.08646441 -2.14246673 -0.83016886 0.45161595 1.10417433]
[-0.28173627 2.05635552 1.76024923 -0.06065249 -2.413503 ]
[-1.77756638 -0.77785883 1.11584111 0.31027229 -2.09424782]
[-0.22876583 1.61336137 -0.37480469 -0.74996962 2.0546241 ]]

[[-1.28063939e-02 -3.66152720e-01 8.60100186e-02 -1.22187599e-01

-9.82733000e-02]
[ 1.56875134e-01 -1.50855186e-01 -9.11041554e-04 -3.84484585e-01
7.94984888e-02]
[-5.68530426e-01 4.20951048e-01 5.41634150e-01 7.61553975e-01
-5.97223756e-01]
[ 1.85998058e-01 -3.13055184e-01 -1.49268149e-01 -7.67989087e-01
3.10833619e-01]
[ 3.84377541e-02 6.33352468e-01 -3.20074728e-01 -9.61297590e-01
9.84565706e-01]]
[[-12.64870544 7.33773197 -3.47470049]
[ 4.76851832 -18.31687439 3.59104687]
[-3.28925017 0.94823861 -5.66853535]]

[11.528173 7.46555585]

The reverse derivation of the convolutional layer is more complicated. The numerical gradient function
numerical_gradient_from_df() function in the previous util.py can be used to calculate the numerical gradient and
compare the analytical gradient of the reverse derivation to check whether the reverse derivation is correct.
import util

def f():
return conv.forward(x)

dw_num = util.numerical_gradient_from_df(f,conv.W,df)
diff_error = lambda x, y: np.max(np.abs(x - y))
print(diff_error(conv.grads[0],dw_num))

db_num = util.numerical_gradient_from_df(lambda :conv.forward(x),conv.b,df)

print(diff_error(conv.grads[1],db_num))

dx_num = util.numerical_gradient_from_df(lambda :conv.forward(x),x,df)

print(diff_error(dx,dx_num))

6.533440455314121e-11
3.7474023883987684e-11
3.998808228988793e-11

The numerical and analytical gradients of the loss function with respect to the model parameters w, b and input x
can be consistent.

The reverse derivation of the pooling layer

The pooling layer has no model parameters, but only maximizes or averages the input data x , and outputs the
maximum or average value of each pooling window. The commonly used maximum pooling, if x outputs z
through maximum pooling, then:

zij = max(x
x[i : i + Kh , j : j + Kw ])

As shown in Figure 6-40, z 00 = max(x

x[0 : 2, 0 : 2]) , if z = x , then L passes through z
00 11 00 The partial
derivative of x is only ∂L

∂x11
=
∂L

∂z
z00
∂z00

∂x11
=
∂L

∂z00
is not 0, other ∂L

∂x
= 0,
xij
ij ≠ 11

Figure 6-40 The result z00 of the shaded data window produced by the pool operation is equal to x11, so only
∂L
≠ 0
∂x
x11

Therefore, the gradient calculation of the max pooling layer is simple. For z = max(x x[i : i + K
ij h, j : j + Kw ]) ,
xi+u,j+vis the data value equal to z in the data window x [i : i + K , j : j + K ]. Just add each
ij h w
∂L

∂zij
to the
partial derivative ∂L

∂xi+u,j+v
corresponding to this x i+u,j+v
corresponding to z .
ij

Code implementation of the pooling layer:

class Pool(Layer):
def __init__(self, pool_param = (2,2,2)):
super().__init__()
self.pool_h,self.pool_w,self.stride = pool_param
def forward(self, x):
self.x = x
N, C, H, W = x.shape

pool_h,pool_w,stride= self.pool_h,self.pool_w,self.stride

h_out = 1 + (H - pool_h) // stride

w_out = 1 + (W - pool_w) // stride
out = np.zeros((N, C, h_out, w_out))

for n in range(N):
for c in range(C):
for i in range(h_out):
si = stride*i
for j in range(w_out):
sj = stride*j
x_win = x[n, c, si:si+pool_h, sj:sj+pool_w]
out[n,c,i,j] = np.max(x_win)

return out

def backward(self,dout):
out = None
x = self.x
N, C, H, W = x.shape
kH,kW,stride = self.pool_h,self.pool_w,self.stride
oH = 1 + (H - kH) // stride
oW = 1 + (W - kW) // stride

dx = np.zeros_like(x)

for k in range(N):
for l in range(C):
for i in range(oH):
si = stride * i
for j in range(oW):
sj = stride * j
slice = x[k,l,si:si+kH,sj:sj+kW]
slice_max = np.max(slice)
dx[k,l,si:si+kH,sj:sj+kW] += (slice_max==slice)*dout[k,l,i,j]

return dx

Similarly, numerical gradients can be used to verify the correctness of the analytical gradient of the pool class:
x = np.random.randn(3, 2, 8, 8)
df = np.random.randn(3, 2, 4, 4)

pool = Pool((2,2,2))
f = pool.forward(x)
dx = pool.backward(df)

dx_num = util.numerical_gradient_from_df(lambda :pool.forward(x),x,df)

print(diff_error(dx,dx_num))

1.680655614677562e-11

The neurons of the convolutional layer implemented above omit the activation function, and the activation
function can be added to the implementation class conv of the convolutional layer like the fully connected neuron,
or (as in Chapter 4) the convolutional layer and the full connection The weighted sum z in the connection layer
passes through the activation function to output the activation value a defined by a separate class.

a = g(z
z)

∂L ∂L ′
= g (z
z)
∂z
z ∂a
a

The following is the implementation of the forward calculation and reverse derivation corresponding to the
activation function Relu:

6.2.4 Implementation of convolutional neural network

Based on the already implemented convolutional layer, pooling layer and fully connected layer, a ConvNetwork
representing a class of convolutional neural networks can be implemented

class NeuralNetwork:
def __init__(self):
self._layers = []
self._params = []

def add_layer(self, layer):

self._layers.append(layer)
if layer.params:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])

def forward(self, X):

for layer in self._layers:
X = layer.forward(X)
return X

def call(self, X):

return self.forward(X)

def predict(self, X):

p = self.forward(X)
if p.ndim == 1: # single sample
return np.argmax(ff)
return np.argmax(p, axis=1)

def backward(self,loss_grad,reg = 0.):

for i in reversed(range(len(self._layers))):
layer = self._layers[i]
loss_grad = layer.backward(loss_grad)
layer.reg_grad(reg)
return loss_grad

def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss

def parameters(self):
return self._params

def zero_grad(self):
for i,_ in enumerate(self._params):
self._params[i][1] *= 0.
To test the convolutional layer, first read the training set of Mnist handwritten digits:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path

if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")

with gzip.open('mnist.pkl.gz', 'rb') as f:

train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

train_X, train_y = train_set

print(train_X.shape)
train_X = train_X.reshape((train_X.shape[0],1,28,28))
print(train_X.shape)

(50000, 784)
(50000, 1, 28, 28)

The convolutional neural network defined as follows classifies and trains Mnist handwritten digit recognition,

import train
#from NeuralNetwork import *
import time

np.random.seed(1)

#nn = ConvNetwork()
nn = NeuralNetwork()
nn.add_layer(Conv(1,2,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Conv(2,4,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Dense(64, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))

learning_rate = 1e-3 #1e-1

momentum = 0.9
optimizer = train.SGD(nn.parameters(),learning_rate,momentum)

epochs=1
batch_size = 64
reg = 1e-3
print_n=100

start = time.time()

X,y =train_X,train_y

losses =
train.train_nn(nn,X,y,optimizer,util.loss_gradient_softmax_crossentropy,epochs,batch_size

done = time.time()
elapsed = done - start
print(elapsed)

print(np.mean(nn.predict(X)==y))

[ 1, 1] loss: 2.303
[ 101, 1] loss: 2.293
[ 201, 1] loss: 2.302
[ 301, 1] loss: 2.251
[ 401, 1] loss: 2.149
[ 501, 1] loss: 1.684
[ 601, 1] loss: 0.749
[ 701, 1] loss: 0.711
2535.1755859851837
0.84184

import matplotlib.pyplot as plt

%matplotlib inline
plt.plot(losses)

Figure 6-41 Loss curve

6.3 Convolution matrix multiplication
The operation of the fully connected layer can be easily realized by matrix
multiplication. For a fully connected neuron whose weight vector is
w = (w , w , ⋯ , w ) , if the input is a sample x = (x , x , ⋯ , x ), then the
T
1 2 n 1 2 n

output of this neuron is a simple vector dot product: x w . If there are K neurons in
the fully connected layer, the column vectors of each neuron can be combined into
a matrix: W = (w w , w , ⋯ , w ). For a single input x , each neuron produces an
1 2 K

output, and a total of K outputs are generated, namely:

x W = (x
xw 1 , x w 2 , ⋯ , x w K )

If there are m input samples, and each input sample is used as a row of the matrix,
a matrix X = (x x ,x ,⋯,x )
1 2 mof m rows can be formed, this m Inputs pass
T

through K neurons to generate m × K outputs in total, forming a matrix Z which

can be expressed as the product of matrix X and W : Z = X W .

That is, the weighted sum of the fully connected layers is easily realized by the
product of matrices. Although the convolution operation of the convolutional
neuron can also be regarded as the tensor dot product of the convolution kernel and
the corresponding data window, it cannot be directly expressed as the dot product
of vectors or the product of matrices. For multiple neurons and multiple channels
The input of is even less directly represented by a simple matrix product or vector
product. The codes of the previous convolution operation realize the convolution
operation of the convolution layer through the loop of many layers. This kind of
multi-layer loop code cannot directly utilize the parallelization of the vector,
making the convolution operation realized inefficient.

In order to improve the computational efficiency of the convolutional layer, the

convolution operation can be converted into a vector dot product or matrix
operation similar to the fully connected layer neurons.

6.3.1 Matrix multiplication of 1D sample convolution

If there is a one-dimensional tensor x = (1, 2, 3, 4, 5) and a convolution kernel
k = (−1, 2, 1), use span s=1, fill p=0 The convolution operation of , the process is

to slide the convolution kernel on the data vector, calculate the cumulative sum of
each aligned data window and convolution kernel, and obtain a value:
(1, 2, 3) ⋅ (−1, 2, 1) = 6

(2, 3, 4) ⋅ (−1, 2, 1) = 8

(3, 4, 5) ⋅ (−1, 2, 1) = 10

If the data in the data window of each cumulative sum is used as a row of the
matrix, a matrix is obtained, which is recorded as x , and the convolution kernel row

is converted into a column vector, which is recorded as K , Then the convolution col

result tensor can be expressed as the product of these two matrices, namely:

⎡1 2 3
⎤ ⎡ −1⎤ ⎡ 6
⎤
zrow = xrow Kcol = 2 3 4 2 = 8

⎣3 4 5
⎦⎣ 1
⎦ ⎣ 10⎦

The length of the input tensor is n, and the length of the result tensor generated by
the convolution operation with the span of s and the padding of p before and after
is o = + 1, For the above example, the length of the resulting tensor is
n−k+2∗p

s
5−3+0
o = + 1 = 3
1

If there are 2 samples, perform the above flattening operation on each sample in
turn, if x is the following 2 samples:

1 2 3 4 5
x = [ ]
6 7 8 9 10

Then x row is a matrix of 6 rows, and the convolution can be expressed as:

⎡1 2 3
⎤ ⎡ 6
⎤
2 3 4 8

3 4 5
⎡ −1⎤ 10
zrow = xrow Kcol = 2 =
6 7 8
⎣ 1
⎦ 16

7 8 9 18

⎣8 9 10
⎦ ⎣ 20⎦

Restore z row to the form of 2 samples, and get z as follows:

6 8 10
z = [ ]
16 18 20
6.3.2 Matrix multiplication of 2D sample convolution
If the input data has only one sample and the sample has only one channel, that is,
the shape of the input data is a tensor of (1, 1, H , W ), where H , W are the
resolution of this 2D sample. Like the sample X of (1, 1, 3, 3) below:

⎡1 2 3
⎤
X = 4 5 6

⎣7 8 9
⎦
3×3

And the convolution kernel is a tensor of shape (1, 1, 2, 2):

1 2
K = [ ]
3 4
2×2

If the convolution operation with span S=1 and surrounding padding P=1 is
performed, the original data needs to be filled first, and the filled data X is: pad

⎡0 0 0 0
⎤
0

0 1 2 3 0

Xpad = 0 4 5 6 0

0 7 8 9 0

⎣0 0 0 0
⎦
0
5×5

Use the (1, 1, 2, 2) convolution kernel to slide along X , and calculate the pad

weighted sum of each aligned data window and the corresponding element of the
convolution kernel. When the convolution kernel "from When sliding from "up to
down, from left to right", a feature map of 4 × 4 will be generated. These data
windows weighted and summed with the kernel look like this:

0 0 0 0 0 0 0 0
X0 = [ ] X1 = [ ] X2 = [ ] X3 = [ ]
0 0 0 1 1 2 3 0
2×2 2×2 2×2 2×2

0 1 1 2 2 3 3 0
X4 = [ ] X5 = [ ] X6 = [ ] X7 = [ ]
0 4 4 5 5 6 6 0
2×2 2×2 2×2 2×2

0 4 4 5 5 6 6 0
X8 = [ ] X9 = [ ] X10 = [ ] X11 = [ ]
0 7 7 8 8 9 9 0
2×2 2×2 2×2 2×2
0 7 7 8 8 9 9 0
X12 = [ ] X13 = [ ] X14 = [ ] X15 = [ ]
0 0 0 0 0 0 0 0
2×2 2×2 2×2 2×2

Turn each window data block into a row of the matrix. The rows of all these data
blocks form a matrix, which is recorded as X , and the convolution kernel is row

converted into a column vector, which is recorded as K : col

⎡0 0 0 0
⎤
0 0 0 1
⎡ 1⎤
0 0 1 2 2
Xrow = Kcol =
0 0 3 0 3

0 0 0 4 ⎣ 4⎦
4×1

⎣⋮ ⋮ ⋮ ⋮
⎦
16×4

The convolution operation can be expressed as the product X K of these two row col

matrices, which will generate a convolution result matrix of 16 × 1, recorded as

Zrow , which can be converted into A feature map with a shape of 4 × 4, or a
feature map of one channel per sample, that is, a tensor of (1, 1, 4, 4).

If the input is a single-sample multi-channel tensor, such as the tensor X of

(1, 2, 3, 3), its two channels X andX are respectively: 0 1

⎡1 2 3
⎤
X0 = 4 5 6

⎣7 8 9
⎦
3×3

⎡ 11 12 13
⎤
X1 = 14 15 16

⎣ 17 18 19
⎦
3×3

The convolution kernel should also be a tensor with the same number of channels,
such as the tensor K of (1, 2, 2, 2), and its two channels are respectively recorded
as K , K .
0 1

1 2
K0 = [ ]
3 4
2×2
5 6
K1 = [ ]
7 8
2×2

If the convolution with a span of 1 and a padding of 0 is performed, each time the
convolution kernel slides, it performs a weighted sum with the data block of 2
channels, namely 2 × 2 × 2, that is, the 2 corresponding to the convolution kernel
The 2 × 2 × 2 data blocks of the channel are flattened into one row, and the rows
corresponding to the data blocks corresponding to all sliding windows form a
matrix X , and the convolution kernel is flattened into a size of 8 column vector,
row

as follows:
⎡ 1⎤
2

⎡1 2 4 5 11 12 14 15
⎤ 3

2 3 5 6 12 13 15 16 4
Xrow = Kcol =
4 5 7 8 14 15 17 18 5

⎣5 6 8 9 15 16 18 19
⎦ 6
4×8

⎣ 8⎦
8×1

Multiplying the two matrices yields a 4 × 1 convolution result matrix. This matrix
can be reshaped into a convolution result tensor of (1, 1, 2, 2), that is, a single-
sample single-channel feature map.

If there are multiple convolution kernels, such as 3 convolution kernels, each

convolution kernel is flattened into a column vector, and the column vectors of the
3 convolution kernels form a 8 × 3 matrix K . Matrix multiplication produces a
col

4 × 3 matrix, in which each column corresponds to a single-channel feature map,

so a 3-channel feature map is generated, which can be transposed into a 3 × 4

matrix and then reshaped into A tensor of convolution results of (1, 3, 2, 2).

Suppose a X is a four-dimensional tensor with a shape of (N , C, H , W ), where

N , C, H , andW are the number of samples, number of channels, height, and

width respectively. The convolutional layer K is a 4D tensor of shape

(F , C, kH , kW ). Among them, F , C, kH , andkW are the number of convolution

kernels, number of channels, height, and width, respectively. That is, the
convolution layer has F convolution kernels, and the shape of each convolution
kernel is (C, kH , kW ).
Each sample is a tensor of shape (C, H , W ), which is convolved with a
convolution kernel to generate a feature map whose shape is recorded as
(oH , oW ) , where oH , oW are the height and width of the feature map, satisfying:

oH = (H + P − kH + 1)//S + 1, oW = (W + P − kW + 1)//S + 1

For a sample, each convolution kernel of the convolutional layer produces a

feature map with a shape of (oH , oW ), F convolution kernels generate a total of
F feature maps, that is, (1, F , oH , oW ) Tensor. N samples yield tensors of shape

(N , F , oH , oW ).

As shown in Figure 6-42, each convolution kernel is flattened into a column vector
of the same length C × kH × kW . F Kernel The convolution kernel forms a
matrix with columns of F . Recorded as K : col

(1) (2) (F )
Kcol = [K K ⋮ K ]
col col col

is a matrix of C × kH × kW rows and F columns.

Figure 6-42 Each convolution kernel is flattened into a column vector, and each
data block corresponding to the size of the convolution kernel is flattened into a
row vector

Similarly, each sample and a convolution kernel with the same shape as
(C, kH , kW ) are flattened into a row vector, and a sample will be flattened into

oH × oW row vectors, and T hematrixf ormedbythef lattenedrowsof N

samples has a total of N × oH × oW rows, and the matrix is recorded as X : row

⎡ ⎤
(1)
Xrow

(2)
Xrow
Xrow =

⎣X (N ×oH ×oW )
row
⎦

is a matrix of N × oH × oW rows C × kH × kW columns.

The resulting matrix Z = Xrow K of 2 matrices is a matrix with rows

row col

N × oH × oW and columns F . It can be converted into a four-dimensional tensor

with a shape of (N , oH , oW , F ) or (N , F , oH , oW ) with numpy reshape, that is,

the number of samples is N, the number of channels is F, and the feature map
Tensor of shape oH × oW .

It is easy to use numpy's reshape to flatten the 4-dimensional tensor of shape

(F , C, kH , kW ) formed by multiple convolution kernels of the convolutional

layer into a matrix K_col of two-dimensional tensors:

K.reshape(K.shape[0],-1).transpose()

Among them, each column corresponds to a convolution kernel.

For a sample data, it needs to be divided into oH × oW three-dimensional data

blocks in order of "from top to bottom, from left to right", and each data block is
turned into a row vector. As shown in Figure 6-42, assuming there is only one
sample, the data block whose position is (h, w) can be extracted with the following
code:

patch = x[:,hS: hS+kH, wS: wS+kW]

It is reshaped into a row vector. If there is only one sample (ie N = 1), this row
vector will be put into the h*oW+wth row of the result matrix:

X_row[h*oW+w,:] = np.reshape(patch,-1)

Therefore, a sample is flattened into oH × oW row vectors of the matrix. If there

are N samples, the above flattening operation is performed on each sample in turn,
and there are a total of oH*oW*N rows. The interval between the corresponding row
vectors in the flattened matrix of the data blocks corresponding to the position of N
samples is oH × oW , therefore, the code for N samples can be written:

patch = x[:,:,hS: hS+kH, wS: wS+kW] #Simultaneously extract

the corresponding data blocks of N samples
oSize = oH*oW #oSize is the total number of data blocks
of a sample
X_row[h*oW+w::oSize,:] = np.reshape(patch,(N,-1))

The code first reshape the data block from N samples into (N,-1) shape, and then
put it into the corresponding row according to the step size oSize = oH × oW .

Here is the complete code for converting a padded 4D tensor to a 2D matrix:

def im2row(x, kH,kW, S=1):
N, C,H, W = x.shape
oH = (H - kH) // S + 1
oW = (W - kW) // S + 1
row = np.empty((N * oH * oW, kH * kW * C))
oSize = oH*oW

for h in range(oH):
hS = h * S
hS_kH = hS + kH
h_start = h*oW
for w in range(oW):
wS = w*S
patch = x[:,:,hS:hS_kH,wS:wS+kW]
row[h_start+w::oSize,:] = np.reshape(patch,(N,-1))
return row

Test this function with a single-sample multi-channel 4D tensor:

x = np.arange(18).reshape(1,2,3,3)
print(x)
x_row = im2row(x,2,2)
print(x_row)

[[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]

[[ 9 10 11]
[12 13 14]
[15 16 17]]]]
[[ 0. 1. 3. 4. 9. 10. 12. 13.]
[ 1. 2. 4. 5. 10. 11. 13. 14.]
[ 3. 4. 6. 7. 12. 13. 15. 16.]
[ 4. 5. 7. 8. 13. 14. 16. 17.]]

Then test the 4-dimensional tensor of multi-sample and multi-channel:

x = np.arange(36).reshape(2,2,3,3)
print(x)
x_row = im2row(x,2,2)
print(x_row)
[[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]

[[ 9 10 11]
[12 13 14]
[15 16 17]]]

[[[18 19 20]
[21 22 23]
[24 25 26]]

[[27 28 29]
[30 31 32]
[33 34 35]]]]
[[ 0. 1. 3. 4. 9. 10. 12. 13.]
[ 1. 2. 4. 5. 10. 11. 13. 14.]
[ 3. 4. 6. 7. 12. 13. 15. 16.]
[ 4. 5. 7. 8. 13. 14. 16. 17.]
[18. 19. 21. 22. 27. 28. 30. 31.]
[19. 20. 22. 23. 28. 29. 31. 32.]
[21. 22. 24. 25. 30. 31. 33. 34.]
[22. 23. 25. 26. 31. 32. 34. 35.]]

Now, the convolution of the convolutional layer can be expressed as the

multiplication of the data matrix Xrow and the convolutional layer matrix K , and
col

the convolution result matrix Z = np. dot(X , K ). Z is a matrix with rows

row col

N × oH × oW and columns F . Its shape can be reshaped into (N , oH , oW , F )

first, and then the 4th axis (axis=3) where F is located is exchanged to the 2nd axis
position, which is transformed into (N , F , oH , oW ) Tensor of shape:

Z = Z.reshape(N,oH,oW,-1)
Z = Z.transpose(0,3,1,2)

To sum up, the convolution operation of the convolution layer can be realized by
matrix multiplication, the code is as follows:
def conv_forward(X, K, S=1, P=0):
N,C, H, W = X.shape
F,C, kH,kW = K.shape
if P==0:
X_pad = X
else:
X_pad = np.pad(X, ((0, 0), (0, 0),(P, P), (P, P)),
'constant')

X_row = im2row(X_pad, kH,kW, S)

K_col = K.reshape(K.shape[0],-1).transpose()
Z_row = np.dot(X_row, K_col)

oH = (X_pad.shape[2] - kH) // S + 1
oW = (X_pad.shape[3] - kW) // S + 1

Z = Z_row.reshape(N,oH,oW,-1)
Z = Z.transpose(0,3,1,2)
return Z

Also, test this function:

x = np.arange(9).reshape(1,1,3,3)+1
k = np.arange(4).reshape(1,1,2,2)+1
print(x)
print(k)
z = conv_forward(x,k)
print(z.shape)
print(z)

[[[[1 2 3]
[4 5 6]
[7 8 9]]]]
[[[[1 2]
[3 4]]]]
(1, 1, 2, 2)
[[[[37.47.]
[67. 77.]]]]

Then test the multi-sample multi-channel data:

x = np.arange(36).reshape(2,2,3,3)
k = np.arange(16).reshape(2,2,2,2)
z = conv_forward(x,k)
print(z.shape)
print(z)

(2, 2, 2, 2)
[[[[ 268. 296.]
[ 352. 380.]]

[[ 684. 776.]
[ 960. 1052.]]]

[[[ 772. 800.]

[ 856. 884.]]

[[2340. 2432.]
[2616. 2708.]]]]

6.3.3 Matrix multiplication for reverse derivation of 1D

convolution
Let x = (x , x , x , x , x ), K = (w , w , w ), x , K are x and K
0 1 2 3 4 0 1 2 row col

respectively To flatten the matrix, the convolution can be marked as their product.
Let the convolution result matrix be z , that is: row

⎡z ⎤
0
⎡x0 x1 x2
⎤⎡w ⎤0

zrow = z1 = xrow Kcol = x1 x2 x3 w1

⎣z ⎦ ⎣x x3 x4
⎦⎣w ⎦
2 2 2

The process of generating z from x is shown in Figure 6-43:

Figure 6-43 A one-dimensional vector of length 5 and a convolution kernel of

length 3 perform valid convolution to produce a convolution vector of length 3

If you know that the gradient of the loss function about z is dz, the corresponding
matrix form is dz , row

Then the gradient of x and K can be obtained from dz as:

row col

T
dxrow = dzrow K
col

T
dKcol = xrow dzrow

For example, for dx row :

⎡ dz ⎤ 0
⎡ dz 0 w0 dz0 w1 dz0 w2
⎤
T
dxrow = dzrow K = dz1 [w0 w1 w2 ] = dz1 w0 dz1 w1 dz1 w2
col

⎣ dz ⎦ ⎣ dz ⎦
2 2 w0 dz2 w1 dz2 w2
This dx rowis a flattened form of dx, each row of which is a gradient of an element
z of an output Z about its dependent data block, this data block and volume The
i

product kernels have the same shape and size. As shown in Figure 6-44, the first
row of dx is the gradient of the output component z with respect to the data
row 0

block (x , x , x ) of x on which it depends. That is, dz contributes to

0 1 2 0

dx , dx , anddx : dx = dz w , dx = dz w , dx = dz w
0 1 2 0 0 0 1 0 1 2 0 2

Figure 6-44 z depends on x , x , x , so dz contributes to dx , dx , dx or

0 0 1 2 0 0 1 2

dx , dx , dx all depend on dz : dx = dz w , dx = dz w , dx = dz w
0 1 2 0 0 0 0 1 0 1 2 0 2

In this example, an output component z will contribute gradients to the gradients

of the 3 input components that depend on the data block. Similarly, each of the
other lines will also contribute to the gradient of the different input data blocks, as
shown in Figure 6- 45 shows:

Figure 6-45 Each dz contributes to the gradient of the element x of the data block
i j

on which z depends
i

It can be seen that the forward calculation process of convolution is to calculate the
weighted sum of the data blocks to obtain an output value z , while the reverse i

derivation is to assign the gradient of each z to each element of its dependent data
i

block on the gradient. The distribution process of reverse derivation is exactly the
reverse process of the accumulation process of forward calculation.

Therefore, in order to get the gradient dx

x of the loss function with respect to the

input x , dx
x needs to be flattened into x
row according to x The inverse process
row

of the process, assign these gradients to the data blocks corresponding to dxx. That

is, each row is converted into a gradient of a data block, because different data
blocks are overlapping, and these gradients are also overlapping. During the
reverse flattening process, these overlapping gradients should be accumulated. As
shown in Figure 6-45:

Figure 6-46 The contribution of each dz to the gradient of z dependent data is

i i

added to the gradient of this data

According to the inverse process of the flattening process, the gradient of each line
is accumulated to the position of its corresponding original data block, and the
final dx
x is obtained as:

dx = (dz0 w0 , dz0 w1 + dz1 w0 , dz0 w2 + dz1 w1 + dz2 w0 , dz1 w2 + dz2 w1 , dz2 w2 )

Right now:

dx[i : i + K]+ = dxrow [i]

or:

dx[i : i + K]+ = dzi ∗ w

6.3.4 Matrix multiplication for reverse derivation of 2D convolution
Like the matrix multiplication of 1D convolution reverse derivation, the matrix multiplication of 2D convolution
reverse derivation can be obtained, and the final input and convolution layer weight parameters of the convolution
layer can be obtained from the gradient of the loss function with respect to the output of the convolution layer
gradient.

Let X be the input, K , b be the weight and bias of a convolutional layer, and the output tensor Z can be expressed
as:

Z = conv(X
X, K ) + b ,

Where conv(XX , K ) represents the convolution operation of the input X with the weight of the convolution kernel

K . The convolution operation conv(XX , K ) can be expressed as a matrix product, namely:

Z row = X row K col ,

Where X row , K col , Z row are the inputs, weights and outputs flattened into matrix form.

If the gradient dZ of the loss function with respect to the output vector Z is known, according to the formula (6-
33), the gradient db of the loss function with respect to b is the same as the derivation of the previous fully
connected layer, namely db = np. sum(dZ, axis = (0, 2, 3)). That is, the gradient of the offset b corresponding
k

to each channel is the accumulation of the gradient dz of all pixel values of all samples.
i,k,h,w

According to the formula (6-34), the same as the reverse derivation of the fully connected layer, the loss function
can be obtained from the gradient dZ
Z about the flattened X , K
row row :
Gradientof col

T
dX
Xrow = dZ
Zrow K
col

T
dKcol = Xrow dZ
Zrow

The calculation can be expressed as the following Python code:

dK_col = np.dot(X_row.T,dZ_row) #X_row.T@dZ_row

dX_row = np.dot(dZ_row,K_col.T)

Because K flattens K into each column vector according to each channel, so as long as each column of dK
col col is
reshaped into a sum of K The same shape is enough, that is:

dK_col = dK_col.transpose(1,0) # Transpose the channel axis F to the first axis

dK = dK_col.reshape(K.shape) # Converted to the shape of K: (F,C,kH,kW)

That is, it is easy to reshape to the same shape as K according to the flattening method of K .

dX
Xrow is a matrix of the same shape as the flattened matrix X of X , each row of which represents a
row

convolution kernel shape (C, oH , kW ) the same data block, and the data blocks represented by different rows of
X row may overlap in X . Thus the different rows of dXX represent the gradients of possibly overlapping data
row

blocks. Therefore, when restoring dXX to dXX according to the inverse process of the flattening process, these
row

overlapping gradients need to be accumulated. This process is exactly the same as the previous 1D case.

Obtaining dXX from dX

X row according to the inverse process of flattening can be implemented as the following
function row2im().

def row2im(dx_row,oH,oW,kH,kW,S):
nRow,K2C = dx_row.shape[0],dx_row.shape[1]
C = K2C//(kH*kW)
N = nRow//(oH*oW) # number of samples
oSize = oH*oW
H = (oH - 1) * S + kH
W = (oW - 1) * S + kW
dx = np.zeros([N,C,H,W])
for i in range(oSize):
row = dx_row[i::oSize,:] # N row vectors
h_start = (i // oW) * S
w_start = (i % oW) * S
dx[:,:,h_start:h_start+kH,w_start:w_start+kW] += row.reshape((N,C,kH,kW))
#np.reshape(row,
(C,kH,kW))
return dx

Among them, oSize = oH × oW represents the size of a feature map of Z , and also represents the number of
data blocks that an input sample is divided into. oH andoW are the height and width of the data block matrix. i
represents the number of the corresponding data block when sliding the convolution kernel "from top to bottom,
from left to right". According to i, the subscript (i // oW), ( i % oW), according to the data block subscript
and span S, the height and width subscript h_start, h_start of this data block in the original data matrix can
be calculated. Thus, the i-th row of dx_row can be accumulated to this position. Because there are N samples, the
row positions of adjacent samples at the same position in the flattened matrix differ by oSize, so
dx_row[i::oSize,:] can be used to get the same position of all N samples The data block gradient of . The
corresponding position of the original data gradient tensor dx is
dx[:,:,h_start:h_start+kH,w_start:w_start+kW].

The row2im() function can also be written as follows:

def row2im(dx_row,oH,oW,kH,kW,S):
nRow,K2C = dx_row.shape[0],dx_row.shape[1]
C = K2C//(kH*kW)
N = nRow//(oH*oW) # number of samples
oSize = oH*oW

H = (oH - 1) * S + kH
W = (oW - 1) * S + kW
dx = np.zeros([N,C,H,W])
for h in range(oH):
hS = h * S
hS_kH = hS + kH
h_start = h*oW
for w in range(oW):
wS = w*S
row =dx_row[h_start+w::oSize,:]
dx[:,:,hS:hS_kH,wS:wS+kW] += row.reshape(N,C,kH,kW)
return dx

You can test the above function row2im() with the following code:
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P = 1,2,1,0
nRow = oH*oW*N
K2C = C*kH*kW

a = np.arange(nRow*K2C).reshape(nRow,K2C)
#dx_row = np.arange(nRow*K2C).reshape(nRow,K2C)
dx_row = np.vstack((a,a))
print("dx_row",dx_row)

print(dx_row.shape)
dx = row2im(dx_row,oH,oW,kH,kW,S)
print(dx.shape)
print("dx[0,0,:,:]:",dx[0,0,:,:])

dx_row[[0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]
[64 65 66 67 68 69 70 71]
[ 0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]
[64 65 66 67 68 69 70 71]]
(18, 8)
(2, 2, 4, 4)
dx[0,0,:,:]: [[ 0. 9. 25. 17.]
[ 26. 70. 102. 60.]
[ 74. 166. 198. 108.]
[ 50. 109. 125. 67.]]

Based on the above discussion, the reverse derivation code of the convolutional layer is as follows:
def conv_backward(dZ,K,oH,oW,kH,kW,S=1,P=0):
# Flatten dZ into a matrix with the same shape as Z_row
F = dZ.shape[1] # Convert (N,F,oH,oW) to (N,oH,oW,F)
dZ_row = dZ.transpose(0,2,3,1).reshape(-1,F)

#Calculate the gradient of the loss function with respect to the convolution kernel
parameters
dK_col = np.dot(X_row.T,dZ_row) #X_row.T@dZ_row
dK_col = dK_col.transpose(1,0)
dK = dK_col.reshape(K.shape)
db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)

K_col = K.reshape(K.shape[0],-1).transpose()
dX_row = np.dot(dZ_row,K_col.T)

dX_pad = row2im(dX_row,oH,oW,kH,kW,S)
if P == 0:
return dX_pad,dK,db
return dX_pad[:, :, P:-P, P:-P],dK,db

The following code tests the convolution reverse derivation function conv_backward() above:

H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P,F = 1,3,1,0,4
dZ = np.arange(N*F*oH*oW).reshape(N,F,oH,oW)
X = np.arange(N*C*H*W).reshape(N,C,H,W)
if P==0:
X_pad = X
else:
X_pad = np.pad(X, ((0, 0), (0, 0),(P, P), (P, P)), 'constant')
K = np.arange(F*C*kH*kW).reshape(F,C,kH,kW)

X_row = im2row(X_pad, kH,kW, S)

dX,dW,db = conv_backward(dZ,K,oH,oW,kH,kW,S,P)
print(dX.shape)
print("dX[0,0,:,:]:",dX[0,0,:,:])
print(dW.shape)
print("dW[0,0,:,:]:",dW[0,0,:,:])
print(db.shape)
print("db:",db)

(1, 3, 4, 4)
dX[0,0,:,:]: [[1512. 3150. 3298. 1718.]
[3348. 6968. 7280. 3788.]
[3804. 7904. 8216. 4268.]
[2100. 4358. 4522. 2346.]]
(4, 3, 2, 2)
dW[0,0,:,:]: [[258. 294.]
[402. 438.]]
(1, 4)
db: [[ 36 117 198 279]]

6.4 Fast convolution based on coordinate index

Replacing multiple loops with matrix operations improves the speed of convolution operations, but multiple cycles
are still required to copy (copy) data when flattening and rebounding data. This section will introduce fast
convolution based on indexes, and build it quickly first. Indexing the array and constructing a flattened or
rebounded tensor can further improve the efficiency of convolution.

The convolution operation is to move the convolution kernel in the order of "from top to bottom, from left to right"
with the span as the step size, and use the convolution kernel and the window data block of the corresponding data
tensor at each position (Multi-channel data block) Calculate the weighted sum of the corresponding elements to
obtain each element of the output feature map.

The previous flattening of the original data tensor into a matrix is to flatten the multi-channel data blocks of the
original tensor into a row vector in turn according to the order of the "top-to-bottom, left-to-right" convolution
weighted sum.

If the span and the size (height and width) of the convolution kernel are not consistent, they overlap with the data
blocks that the convolution kernel sequentially calculates the weighted sum of. Arrange these data blocks
calculated in sequence according to the calculation order, and a new tensor with non-overlapping data blocks can
be obtained.
For example, for the following 4 × 4 tensor with one sample and one channel,

⎡x 00 x01 x02 x03

⎤
x10 x11 x12 x13

x20 x21 x22 x23

⎣x x31 x32 x33

⎦
30
⎢⎥
If the convolution kernel is 2 × 2, the convolution operation is actually a weighted sum of data blocks and
convolution kernels arranged as follows.

⎡ [x

⎣ [x

x01

x02

x10

x11

x12

x20

x21

⎣x

x020

⎣x

and

⎡x
x110

x120

⎣x
00

x10

x20

030

100

130
x01

x11

x21

x31

x02

x03

x11

x12

x13

x21

x22

x23

x021

x031

x101

x111

x121

x131
]

x10

x11

x13

x20

x21

x23

x30

x31

x33
[

x012

x022

x032

x102

x112

x122

x132
x01

x11

x21

x31

x11

x12

x13

x21

x22

x23

x31

x32

x33
x02

x12

x22

x32

⎦
]

]
[

[
x02

x13

x12

x23

x22

x33
x03

x13

x23

x33
]

]
⎤

Each data block that is exactly the same size as the convolution kernel can be flattened into a row vector, and the
following row vector can be obtained:

⎡x x01
⎤

If there are multiple channels, each data block is a three-dimensional tensor (cuboid), and the process is similar.
Such as a tensor with 2 channels:

⎡x
x010
000 x001

x011
x002 x003

x013

x023

x033

x103

x113

x123

x133
⎤

According to the convolution calculation process, these data blocks are in order:

Figure 6-47 Each data block of convolution calculation is a three-dimensional data block with a shape of
2 × 2 × 2 composed of a 2-channel matrix 2 × 2

That is, the matrix blocks corresponding to the positions of the following two matrices constitute a data block:
⎢⎥
⎡ [x

⎣ [x

and

⎡ [x

⎣ [x

x001

x002

x010

x011

x012

x020

x021

⎣x
000

x010

x020

030

100

x110

x120

130

000

022
x001

x011

x021

x031

x101

x111

x121

x131

x001

x002

x003

x011

x012

x013

x021

x022

x023
]

x010

x011

x013

x020

x021

x023

x030

x031

x033
[

[
x001

x011

x021

x031

x101

x111

x121

x131

x011

x012

x013

x021

x022

x023

x031

x032

x033
x002

x012

x022

x032

x102

x112

x122

x132

x100

x101

x102

x110

x111

x112

x120

x121

x122
]

]
[

x101

x102

x103

x111

x112

x113

x121

x122

x123
x002

x013

x012

x023

x022

x033

x102

x113

x112

x123

x122

x133

Each data block is flattened into a row vector to obtain:

⎡x
x003

x013

x023

x033

x103

x113

x123

x133

x110

x111

x113

x120

x121

x123

x130

x131

x133
]

]
⎤

x111

x112

x113

x121

x122

x123

x131

x132

x133
⎤

All these data blocks of the same size as the convolution kernel are arranged in the calculation order of "from top
to bottom, from left to right" to form an extended tensor whose data comes from the original data tensor, from the
extended tensor shown in Figure 6-47 The data subscripts of the quantities can be seen from which subscripts of
the original tensor they come from. In other words, as long as you know the subscript of each data element of the
extended data tensor in the original data tensor, you can generate these data from the original data tensor.
First look at the case of a single channel, as long as the letter x is removed, you can clearly see the subscripts of the
elements of the extended tensor in the original tensor:
⎢⎥
⎡ [00

⎣ [30
10

20
01

31
]

blocks can be obtained.

[

[
01

31
02

32
]

]
[

[
02

33
03

33
]

]
⎤

As long as the original data tensor is indexed according to these subscripts, the tensor composed of these data

Observing these subscripts, it can be found that all these subscripts can be obtained from the initial upper left
subscript by moving "from top to bottom, from left to right" with a span as the step size. The subscript in the upper
left corner is:

[[

[
00

10
01

11
]

"From left to right" moves a span each time, that is, the column subscript increases by 1, and the subscripts of the
three data blocks in the first row can be obtained:

00 01

]
] [
01

11
02

12
] [
02

13
03

13
]]

If the span of the first row is moved "from top to bottom" according to the span, that is, the row subscript is
increased, and the subscripts of all data blocks in the following 2 rows can be sequentially obtained.

For the row and column subscripts of the data block in the upper left corner

00 01

The row and column subscripts are: i=0,1 and j=0,1 respectively. As shown in Figure 6-48:

Figure 6-48 Row and column subscripts of the data block in the upper left corner

Therefore, the 4 elements of the data block in the upper left corner can be obtained through the row subscript
[0,0,1,1] and the column subscript vector [0,1,0,1] of the 4 elements of the data block The row and column
subscript combination [(0,0),(0,1),(1,0),(1,1)]. Similarly, for any kH × kW convolution kernel, the row
and column subscripts i0 and j0 corresponding to the data block elements in the upper left corner of the original
tensor can be obtained with python code:

import numpy as np
kH,kW = 2,2
i0 = np.repeat(np.arange(kH), kW) #row subscript [0,1] repeats along the column
direction: [0,0,1,1]
print(i0)
j0 = np.tile(np.arange(kW), kH) #column subscript [0,1] is spliced along the direction
of the row [0,1,0,1]
print(j0)

[0 0 1 1]
[0 1 0 1]
The elements of the data block in the upper left corner can be obtained by using the combined index of i0 and j0.

def idx_matrix(H,W):
a = np.empty((H,W), dtype='object')
for i in range(H):
for j in range(W):
a[i,j] = str(i)+str(j)
return a

For example, for the matrix x below:

x = idx_matrix(4,4)
print(x)
print(x[i0,j0])

[['00' '01' '02' '03']

['10' '11' '12' '13']
['20' '21' '22' '23']
['30' '31' '32' '33']]
['00' '01' '10' '11']

For a multi-channel data block, for each channel, the row and column subscripts of the corresponding elements of
the data block are the same. As shown in Figure 6-49, for the data block in the upper left corner of the 2-channel,
the row and column subscripts are:

Figure 6-49 Row and column subscripts of the data block in the upper left corner of channel 2

Generally, for the data block in the upper left corner with the number of channels as C, the row and column
subscripts of its elements can be generated with the following code:

i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)

For C=2, run this code:

C = 2
i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)
print(i0)
print(j0)

[0 0 1 1 0 0 1 1] #2 channel subscripts
[0 1 0 1 0 1 0 1] #2 channel subscripts

To generate the coordinates of the elements of all data blocks in the original data tensor, not only need to know the
row and column subscripts of each data block relative to its upper left corner (0,0), but also add the offset
according to the span S to get the final correctness row and column coordinates. If a feature map is divided into
oH × oW data blocks, the offset of these data blocks relative to the data block in the upper left corner can be

called span coordinates. For example, it is divided into 3 × 3, a total of 9 data blocks, and when the span S=1, the
row (height) column (width) span coordinates of these 9 data blocks are:

Figure 6-50 Span coordinates

Similarly, these span coordinates can be generated using code that generates the row and column coordinates of
elements within a data block:
oH,oW=3,3
i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)
print(i1)
print(j1)

[0 0 0 1 1 1 2 2 2]
[0 1 2 0 1 2 0 1 2]

The row and column coordinates of the data blocks in the upper left corner and the span coordinates of these data
blocks are added to obtain the row and column coordinates of all data block elements in the original data tensor,
namely:

i = i0.reshape(-1, 1) + i1.reshape(1, -1)

j = j0.reshape(-1, 1) + j1.reshape(1, -1)

Output the row subscript i of these 9 data blocks:

print("i0:",i0)
print("i1:",i1)
print(i)

i0: [0 0 1 1 0 0 1 1]
i1: [0 0 0 1 1 1 2 2 2]
[[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]
[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]]

Each column is a row subscript of a data block. The first 3 columns are the row subscripts of the 3 data blocks
when the span row coordinate is 0.

The following is the code to combine the span coordinates and the element subscripts in the data block to get the
row and column subscripts of all data blocks in the original input (single-channel) tensor:
C,S = 1,1,
oH,oW = 3,3
kH,kW = 2,2

i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)

i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)

i = i0.reshape(-1, 1) + i1.reshape(1, -1)

j = j0.reshape(-1, 1) + j1.reshape(1, -1)
print(i)
print(j)

[[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]]
[[0 1 2 0 1 2 0 1 2]
[1 2 3 1 2 3 1 2 3]
[0 1 2 0 1 2 0 1 2]
[1 2 3 1 2 3 1 2 3]]

The row subscripts of the elements in the upper left corner of all data blocks are shown in Figure 6-51:

Figure 6-51 Row subscripts of elements in the upper left corner of all data blocks

The row subscripts of the elements in the lower right corner of all data blocks are shown in Figure 6-52:

Figure 6-52 Row subscripts of elements in the lower right corner of all data blocks

It can be observed that a column of this index matrix corresponds to a data block.

If you want the index subscript of each data block to become a row of the matrix, then modify whether the
subscript of the data block and the span subscript are arranged by row or by column, that is, modify the reshape
code.

C,S = 1,1
oH,oW=3,3
kH,kW = 2,2

i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)

i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)

i = i0.reshape(1,-1) + i1.reshape(-1,1)
j = j0.reshape(1,-1) + j1.reshape(-1,1)
print(i)
print(j)

[[0 0 1 1]
[0 0 1 1]
[0 0 1 1]
[1 1 2 2]
[1 1 2 2]
[1 1 2 2]
[2 2 3 3]
[2 2 3 3]
[2 2 3 3]]
[[0 1 0 1]
[1 2 1 2]
[2 3 2 3]
[0 1 0 1]
[1 2 1 2]
[2 3 2 3]
[0 1 0 1]
[1 2 1 2]
[2 3 2 3]]

The above discusses the image row and column coordinates of each element of the data block relative to its
channel. Indexing each data element should also consider the channel coordinates. If there are C channels, the
coordinate value of a single element in the C channels is (0, 1, 2, ⋯ , C − 1), as shown in Figure 6-53. But each
data block has kH × kW elements on a single channel, so, combining channel coordinates and image (feature
map) coordinates, a data block of shape kH × kW ∗ C has a total of kH × kW elements times kW*C$
coordinates.

Figure 6-53 Channel coordinates, the channel coordinates of all elements in channel i are i

The following code calculates the channel coordinates of a data block:

C=2
k = np.repeat(np.arange(C), kH * kW).reshape(1,-1) #(-1, 1)
print(k)

[[0 0 0 0 1 1 1 1]]

If the shape of the original data tensor input by the convolutional layer is (N , C, H , W ), the convolutional layer
has F convolution kernels with a shape of (C, H , W ), that is, convolution The stack shape is (F , C, H , W ). If the
convolution operation of span S and edge padding P is performed. According to the above analysis, the channel
coordinate k, row subscript i and column subscript j of all elements of the extended tensor composed of the data
blocks participating in the convolution operation in the original data tensor can be obtained. The function
get_im2row_indices() can get these mapping coordinates (k,i,j):
import numpy as np
def get_im2row_indices(x_shape, kH, kW, S=1,P=0):
N, C, H, W = x_shape
assert (H + 2 * P - kH) % S == 0
assert (W + 2 * P - kH) % S == 0
oH = (H + 2 * P - kH) // S + 1
oW = (W + 2 * P - kW) // S + 1

i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
i1 = S * np.repeat(np.arange(oH), oW)
j0 = np.tile(np.arange(kW), kH * C)
j1 = S * np.tile(np.arange(oW), oH)
#i = i0.reshape(-1, 1) + i1.reshape(1, -1)
#j = j0.reshape(-1, 1) + j1.reshape(1, -1)
i = i0.reshape(1,-1) + i1.reshape(-1,1)
j = j0.reshape(1,-1) + j1.reshape(-1,1)

k = np.repeat(np.arange(C), kH * kW).reshape(1,-1)

return (k, i, j)

The following code tests the above function:

H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P,F = 2,2,1,0,4

k, i, j = get_im2row_indices((N,C,H,W),kH,kW,S,P)
print(k.shape)
print(i.shape)
print(j.shape)

(1, 8)
(9, 8)
(9, 8)

With this helper function, it is easy to generate a row-flattened tensor of data blocks from the original data tensor:
def im2row_indices(x, kH, kW, S=1,P=0):
x_padded = np.pad(x, ((0, 0), (0, 0), (P, P), (P, P)), mode='constant')
k, i, j = get_im2row_indices(x.shape, kH, kW, S,P)
rows = x_padded[:, k, i, j] # all blocks of data for each sample
C = x.shape[1]
rows = rows.reshape(-1,kH * kW * C) # all data blocks of the 1st sample, all data
blocks of the 2nd sample
return rows

Test this function:

X = np.arange(N*C*H*W).reshape(N,C,H,W)
X_row = im2row_indices(X,kH,kW,S,P)
print(X)
print(X_row)

[[[[ 0 1 2 3]
[ 4 5 6 7 ]
[ 8 9 10 11]
[12 13 14 15]]

[[16 17 18 19]
[20 21 22 23]
[24 25 26 27]
[28 29 30 31]]]

[[[32 33 34 35]
[36 37 38 39]
[40 41 42 43]
[44 45 46 47]]

[[48 49 50 51]
[52 53 54 55]
[56 57 58 59]
[60 61 62 63]]]]
[[ 0 1 4 5 16 17 20 21]
[ 1 2 5 6 17 18 21 22]
[ 2 3 6 7 18 19 22 23]
[ 4 5 8 9 20 21 24 25]
[ 5 6 9 10 21 22 25 26]
[ 6 7 10 11 22 23 26 27]
[ 8 9 12 13 24 25 28 29]
[ 9 10 13 14 25 26 29 30]
[10 11 14 15 26 27 30 31]
[32 33 36 37 48 49 52 53]
[33 34 37 38 49 50 53 54]
[34 35 38 39 50 51 54 55]
[36 37 40 41 52 53 56 57]
[37 38 41 42 53 54 57 58]
[38 39 42 43 54 55 58 59]
[40 41 44 45 56 57 60 61]
[41 42 45 46 57 58 61 62]
[42 43 46 47 58 59 62 63]]

Or conversely, convert the tensor of the data block flattened by row into the shape of the original data tensor.
def row2im_indices(rows, x_shape, kH, kW, S=1,P=0):
N, C, H, W = x_shape
H_pad, W_pad = H + 2 * P, W + 2 * P
x_pad = np.zeros((N, C,H_pad, W_pad), dtype=rows.dtype)
k, i, j = get_im2row_indices(x_shape, kH, kW, S,P)
rows_reshaped = rows.reshape(N,-1,C * kH * kW)

np.add.at(x_pad, (slice(None), k, i, j), rows_reshaped)

if P == 0:
return x_pad
return x_pad[:, :, P:-P, P:-P]

Test whether this function is consistent with the previous function:

import numpy as np

H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P = 2,2,1,0
#F = 4

nRow = oH*oW*N
K2C = C*kH*kW

dx_row = X_row.copy() #np.arange(nRow*K2C).reshape(nRow,K2C)

print("dx_row.shape",dx_row.shape)
#print("dx_row",dx_row)

dx = row2im(dx_row,oH,oW,kH,kW,S)
print("dx.shape",dx.shape)
print("dx[0,0,:,:]",dx[0,0,:,:])

#dx_row = dx_row.transpose()
dX = row2im_indices(dx_row,(N,C,H,W),kH,kW,S,P)
print("dX.shape",dX.shape)
print("dX[0,0,:,:]",dX[0,0,:,:])
print(dX)

dx_row.shape(18, 8)
dx.shape(2, 2, 4, 4)
dx[0,0,:,:] [[ 0. 2. 4. 3.]
[ 8. 20. 24. 14.]
[16. 36. 40. 22.]
[12. 26. 28. 15.]]
dX.shape(2, 2, 4, 4)
dX[0,0,:,:] [[ 0 2 4 3]
[ 8 20 24 14 ]
[16 36 40 22]
[12 26 28 15]]
[[[[ 0 2 4 3]
[ 8 20 24 14 ]
[ 16 36 40 22]
[ 12 26 28 15]]

[[ 16 34 36 19]
[ 40 84 88 46 ]
[ 48 100 104 54]
[ 28 58 60 31]]]

[[[ 32 66 68 35]
[ 72 148 152 78]
[ 80 164 168 86]
[ 44 90 92 47]]
[[ 48 98 100 51]
[104 212 216 110]
[112 228 232 118]
[ 60 122 124 63]]]]

With the helper function above that directly flattens multidimensional tensors into matrices (in the file im2row.py),
you can write a convolution layer based on fast read convolution operations:
from Layers import *
from im2row import *

class Conv_fast():
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.kH = kernel_size
self.kW = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
#self.K = np.random.randn(self.F, self.C, self.kH, self.kW)
#/(self.K*self.K)
self.K = np.random.normal(0,1,(self.F, self.C, self.kH, self.kW))
self.b = np.zeros((1,self.F)) #,1))
self.params = [self.K,self.b]
self.grads = [np.zeros_like(self.K),np.zeros_like(self.b)]
self.X = None
self.reset_parameters()

def reset_parameters(self):
kaiming_uniform(self.K, a=math.sqrt(5))
if self.b is not None:
#fan_in, _ = calculate_fan_in_and_fan_out(self.K)
fan_in = self.C
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))

def forward(self,X):
# Convert to multi-channel
self.X = X
if len(X.shape)==1:
X = X.reshape(X.shape[0],1,1,1)
elif len(X.shape)==2:
X = X.reshape(X.shape[0],X.shape[1],1,1)

self.N,self.H,self.W = X.shape[0], X.shape[2], X.shape[3]

S,P,kH,kW = self.S, self.P,self.kH,self.kW
self.oH = (self.H - kH + 2*P)// S + 1
self.oW = (self.W - kW + 2*P)// S + 1

X_shape = (self.N,self.C,self.H,self.W)

self.X_row = im2row_indices(X,self.kH,self.kW,S=self.S,P=self.P)

K_col = self.K.reshape(self.F,-1).transpose()
Z_row = self.X_row @ K_col + self.b #W_row @ self.X_row + self.b

Z = Z_row.reshape(self.N,self.oH,self.oW,-1)
Z = Z.transpose(0,3,1,2)
return Z

def __call__(self,x):
return self.forward(x)

def backward(self,dZ):

if len(dZ.shape)<=2:
dZ = dZ.reshape(dZ.shape[0],-1,self.oH,self.oW)
K = self.K
# flatten dZ into a matrix with the same shape as Z_row
F = dZ.shape[1] # Convert (N,F,oH,oW) to (N,oH,oW,F)
assert(F==self.F)
dZ_row = dZ.transpose(0,2,3,1).reshape(-1,F)

# Calculate the gradient of the loss function with respect to the convolution
kernel parameters
dK_col = np.dot(self.X_row.T,dZ_row) #X_row.T@dZ_row
dK_col = dK_col.transpose(1,0) # Change the F channel axis from axis=1 to
axis=0
dK = dK_col.reshape(self.K.shape)
db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)

# Calculate the gradient of the loss function with respect to the input of the
convolution layer

K_col = K.reshape(K.shape[0],-1).transpose() #Flattening

dX_row = np.dot(dZ_row,K_col.T)

X_shape = (self.N,self.C,self.H,self.W)
dX = row2im_indices(dX_row,X_shape,self.kH,self.kW,S =self.S,P = self.P)

dX = dX.reshape(self.X.shape)
self.grads[0] += dK
self.grads[1] += db

return dX

#-------- add gradient of regular items -----

def reg_grad(self,reg):
self.grads[0]+= 2*reg * self.K

def reg_loss(self,reg):
return reg*np.sum(self.K**2)

def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.K
return reg*np.sum(self.K**2)

Gradient Test
Similarly, the gradient test can be used to check whether the code is correct for this convolutional layer:
import util

np.random.seed(1)

N,C,H,W = 4,3,5,5
F,kH,kW = 6,3,3
oH,oW = 3,3
x = np.random.randn(N,C,H,W)
y = np.random.randn(N,F,oH,oW)

conv = Conv_fast(C,F,kH,1,0)
f = conv.forward(x)

loss,do = util.mse_loss_grad(f,y)
dx = conv.backward(do)

def loss_f():
f = conv.forward(x)
loss,do = util.mse_loss_grad(f,y)
return loss

dW_num = util.numerical_gradient(loss_f,conv.params[0],1e-6)

diff_error = lambda x, y: np.max(np.abs(x - y)/(np.maximum(1e-8, np.abs(x) + np.abs(y)

)) )
print(diff_error(conv.grads[0],dW_num))
#print("dW",conv.grads[0][:2])
#print("dW_num",dW_num[:2])

4.198542114313848e-07

Time comparison with non-accelerated convolution

In order to compare the time efficiency of fast convolution with the previous non-accelerated convolution, the
MNist handwritten digit recognition problem is trained with the same convolutional neural network as before:
from Layers import *
import time
np.random.seed(0)

#N,C,H,W = 64,256,64,64
#F,kH= 128,5
N,C,H,W = 128,16,64,64
F,kH= 32,5
x = np.random.randn(N,C,H,W)
oH = H-kH+1
do = np.random.randn(N,F,oH,oH)

start = time.time()
conv = Conv(C,F,kH)
f = conv(x)
conv.backward(do)
done = time.time()
elapsed = done - start
print(elapsed)

start = time.time()
conv = Conv_fast(C,F,kH)
f = conv(x)
conv.backward(do)
done = time.time()
elapsed = done - start
print(elapsed)

476.4419822692871
29.02124047279358

The original convolution takes 476 seconds, while the fast convolution takes only 29 seconds.

Replace the convolution of the previous convolutional neural network for MNist handwritten digit classification
with fast convolution, and look at the time efficiency:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path

if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")

with gzip.open('mnist.pkl.gz', 'rb') as f:

train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

train_X, train_y = train_set

print(train_X.shape)
train_X = train_X.reshape((train_X.shape[0],1,28,28))
print(train_X.shape)

(50000, 784)
(50000, 1, 28, 28)

Define a convolutional neural network and train it:

import train
from NeuralNetwork import *
import time

np.random.seed(1)

nn = NeuralNetwork()
nn.add_layer(Conv_fast(1,2,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Conv_fast(2,4,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Dense(64, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))

learning_rate = 1e-3 #1e-1

momentum = 0.9
optimizer = train.SGD(nn.parameters(),learning_rate,momentum)

epochs=1
batch_size = 64
reg = 1e-3
print_n=100

start = time.time()
X,y =train_X,train_y
losses =
train.train_nn(nn,X,y,optimizer,util.cross_entropy_grad_loss,epochs,batch_size,reg,print_

done = time.time()
elapsed = done - start
print(elapsed)

print(np.mean(nn.predict(X)==y))

[ 1, 1] loss: 2.383
[ 101, 1] loss: 2.316
[ 201, 1] loss: 2.283
[ 301, 1] loss: 2.160
[ 401, 1] loss: 1.675
[ 501, 1] loss: 1.091
[ 601, 1] loss: 0.514
[ 701, 1] loss: 0.659
690.5078177452087
0.83894
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(losses)

Figure 6-54. Training loss curve of a convolutional neural network for classifying Mnist handwritten digits
6.5 Typical convolutional neural network structure
In 1989, Yann LeCun, the inventor of the convolutional neural network structure (CNN),
used the backpropagation (BP) algorithm to train a multi-layer neural network to recognize
handwritten postal codes. The neural network used was later called LeNet in 1994. The
convolutional neural network, although the paper does not mention convolution or
convolutional neural network, it only says that the adjacent area of 5 × 5 is used as the
receptive field. In 1998, LeCun proposed the famous LeNet5 neural network, marking the
real birth of convolutional neural network. Due to the limitation of hardware conditions at
that time, the training of convolutional neural network consumed machine resources and
time very much. Therefore, the CNN network model did not become popular.

It was not until 2012 that Alex Krizhevsky implemented a deep convolutional neural
network called AlexNet with a GPU and won the championship of the ImageNet image
recognition competition that deep learning represented by deep convolutional neural
networks began to develop rapidly. Subsequently, various neural network structures were
proposed, such as VGG, GoogLeNet, ResNet, Inception, etc.

6.5.1 LeNet-5
Figure 6-55 shows the network structure of LeNet-5. A 32×32×1 image passes through six
5×5 convolution kernels with a stride of 1 and a padding of 0 to generate 6 outputs of 28×28
size. The feature map, and then through the average pooling operation with a span of 2 and a
size of 2, six 14×14 feature maps are generated, that is, the length and width of the image
are reduced by half. Then through 16 5×5 convolution kernels with a span of 1 and a filling
of 0, 16 output feature maps of 10×10 size are generated. Then through the average pooling
operation with a span of 2 and a size of 2, 16 feature maps of 5×5 are generated.

Figure 6-55 LeNet network structure

Then there is the fully connected layer, which has 120 neurons, and each neuron receives a
total of 400 feature values from the previous layer's 16 5×5 feature maps (images), and each
neuron produces an output value, 120 neurons generate a one-dimensional vector consisting
of 120 outputs, which are output to the next fully connected layer with 84 neurons. A fully
connected layer of 84 neurons feeds their 84 outputs into the final output layer, and if 10
classes are performed, the output layer will contain 10 neurons, and each neuron outputs a
score for a sample belonging to the corresponding class. These 10 classifications can be
used to train the neural network model by calculating the multi-class cross entropy loss
through the softmax function and the true target value.

LeNet-5 is a classic structure of convolutional neural network. It adopts the mode of "first
convolution and then pooling" to generate multi-channel output from multi-channel input
and reduce the size of the image. As in the above example, the single channel 32×32×1 is
passed A series of "convolution pooling" transformations finally produced 16 5×5 feature
maps, which were finally passed through some fully connected layers to produce the final
output.

6.5.2 AlexNet
AlexNet is a CNN network structure proposed by Alex Krizhevsky et al. It won the first
place in the ImageNet image classification competition in 2012, and instantly increased the
top-5 error rate by more than 10 percentage points. The author implements a parallel neural
network training algorithm with CUDA GPU, which makes it possible to train deep neural
networks in a reasonable time. Its network structure is shown in Figure 6-56.

Figure 6-56 AlexNet network structure

If the input is a 3-channel color image 227×227×3, after 96 11×11 convolution kernels with
a span of 4, 96 feature maps of 55×55 (55×55×96) are generated, because With a stride of 4,
the image is scaled down by a factor of 4. Then after a 3×3 maximum pooling layer with a
span of 2, 96 27×27 feature maps (27×27×96) are output, and then 256 5×5 convolution
kernels with a span of 1 and a padding of 2 Execute "same convolution" to generate 256
feature maps of 27×27 (27×27×256), and then pass through the 3×3 maximum pooling layer
with a span of 2, and output 256 feature maps of 13×13 13× 13×256, and then use 384 3×3
convolution kernels with a span of 1 and a filling of 1 to perform "same convolution" to
generate 384 feature maps of 13×13 (13×13×384), and then do it again The same
convolution operation. Then, "same convolution" was performed with 256 3×3 convolution
kernels with a stride of 1 and a padding of 1 to generate 256 feature maps of 13×13
(13×13×256). Perform another 3 × 3 maximum pooling with a span of 2, resulting in 256
feature maps of size 6×6 (6×6×256). Expand 6×6×256 into a 9216 vector, then output to a
fully connected layer of 4096 neurons, followed by a fully connected layer of 4096 neurons,
and finally a fully connected layer of 1000 neurons, and After the softmax function outputs
the probability that the sample belongs to 1000 different categories.

AlexNet is very similar to LeNet, but its depth and scale are much larger than LeNet. LeNet
has about 60,000 parameters, while AlexNet has about 60 million parameters. The most
important improvement of AlexNet over the LeNet network is the use of the Relu activation
function, which avoids the "gradient disappearance" problem of the deep neural network.
Another improvement to improve performance is the Dropout technology, that is, in a
hidden layer with a certain probability, the output of some neurons is 0, and does not
participate in network propagation. In each iteration, it is equivalent to defining a different
function. This regularization technique actually uses a combination of simpler network
functions to represent the trained model function. In addition, the "Local Response
Normalization" (Local Response Normalization) is proposed, which is to normalize the
values of all channels at a certain position of the feature map in a certain layer. But later it
was found that LRU has little effect.

The success of the AlexNet network has refocused the computer vision and artificial
intelligence communities on neural networks that have been dormant for many years,
especially CNN deep convolutional networks. The depth can become deeper, and the deep
neural network with simple principles can surpass the mathematically complex artificial
intelligence technology. Deep learning has begun to become the most important branch of
machine learning. Modern artificial intelligence mainly refers to deep learning.

6.5.3 VGG
The VGG-16 network is a simplified convolutional network structure proposed by Oxford's
Visual Geometry Group (VGG for short). Its main contribution is to prove that increasing
the depth of the network can improve the final performance of the network to a certain
extent. The size of the convolution kernel of different convolution layers of the general
convolutional network is different, and the size of all convolution kernels in the VGG
network is the same, for example, a convolution kernel with 3 × 3 span of 1 is used , the
same is true for the pooling layer, using a 2 × 2 maximum pooling kernel with a span of 2.
Therefore, it simplifies the structure of the convolutional neural network. As long as the
depth of the VGG-16 network is deep enough, it can achieve the same or even better
performance than the complex neural network structure. The 16 of VGG-16 means that the
convolutional layer and the fully connected layer in the neural network have a total of 16
layers. Figure 6-57 is a convolutional network structure of VGG-16

Figure 6-57 VGG network structure

The convolution kernels are all 3 × 3 with a span of 1, and the pooling kernels are all 2 × 2
with a span of 2. However, the number of output channels of the initial convolutional layer
is 64, and the number of output channels of the subsequent convolutional layers is doubled
in order of 128, 256, and 512. When it reaches 512, the number of output channels of the
subsequent convolutional layer will no longer increase. Yes, as the number of 512 channels
is considered to be large enough. The VGG network structure is very regular, but the
amount of data required will be large. Later, VGG-19 was proposed, but there is no
difference in performance between VGG-19 and VGG-16.

6.5.4 Gradient Explosion and Vanishing Problems of Deep Neural

Networks
Very deep neural networks are difficult to train due to the problem of gradient explosion and
disappearance, because the gradient can only be passed layer by layer from back to front,
and if a number is continuously multiplied by a number whose absolute value is less than 1,
this number will getting closer to 0. Let a number be y, multiplied by c continuously, and i

finally become c ∗ ⋯ c y, this value will tend to 0. Similarly, when a number is multiplied
i L

by a number whose absolute value is greater than 1, this number will get closer and closer to
infinity ∞.

Consider a simplified neural network, in which the neurons are z = xw, if there is an L
layer, the forward calculation is:

x → z1 = xw1 → z2 = z1w2 = xw1w2 → ⋯ → zL = zL−1wL = xw1w2 ⋯ wL

If you know that the gradient of the last z of the loss function is dz , then the gradient of
L L

the loss function about z dz = w dz , about z The gradient of

L−1 L−1 L L i

dz = w
i ⋯ w dz , the gradient of the loss function on w
i+1 L L i

dw = dz z
i = wi ⋯ w dz z
i−1 i+1 . L L i−1

If ∥w ∥< ρ < 1, then dz decays exponentially as L − i increases, and the larger L − i, the
i i

faster the decay, so that the value of dw It may become very small, and the gradient is too
i

small, making the update of the parameter w almost stagnant, and the convergence becomes
i

very slow. Similarly, if ∥w ∥> ρ > 1, then dz increases exponentially with the increase of
i i

L − i, and thus becomes very large, making the parameter update sharply larger and

oscillating , so it cannot converge.

As the depth of the neural network becomes deeper, the gradient explosion and attenuation
of the neural network are inevitable, which makes the training of the deep neural network
very difficult. In order to prevent gradient explosion, the technique L of gradient clipping
can be used, that is, the absolute value of the gradient is limited to a predetermined range.
Let g be the gradient, θ be the clipping threshold, and clip the gradient according to the
following formula:

θ
min ( , 1)g
∥g∥

That is, the gradient value is limited to [−θ, θ].

If grads contains gradients of multiple weight parameters, the following code can limit their
gradients to [-c,c]:

import math
def grad_clipping(grads,c):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > c:
ratio = c / norm
for i in range(len(grads)):
grads[i]*=ratio

Gradient clipping can solve the problem of gradient explosion to a certain extent, but it
cannot solve the problem of gradient disappearance.

6.5.5 Residual Networks (ResNets)

The gradient explosion and disappearance are caused by the fact that the network layers are
always connected in series. The residual network breaks this layer-by-layer network
structure through a "jumping" technique. The idea is to establish a short-circuit connection
between the two network layers that were originally far apart. Like a drug, if it reaches the
patient after passing through many management departments and distributors, due to the
layer-by-layer price increase, the final price that the patient buys It may be very expensive.
If the manufacturer sells directly, the manufacturer directly sells the drug to the patient, and
the cost is the ex-factory price.

Figure 6-58 is a common neural network structure:

[0] [1] [2] [i−1] [i] [i+1] [L]
x = a → a → a ⋯ → a → a → a ⋯ → a

Usually the jumper of the residual network is very regular, as shown in Figure 6-58 is a
schematic diagram of the structure of the residual network:

Figure 6-58 Schematic diagram of the residual network structure, the upper figure is the
residual network, and the lower figure is the general neural network

Because there is a short-circuit connection, the gradient of the reverse derivation can be
directly fed back from the arc tail layer to the arc head layer through this jumper connection,
so that the gradient will not be attenuated or exploded due to passing through multiple
intermediate layers. The residual network was invented by Chinese scholar Kaiming He and
others. The author found that as long as these jumper connections are established between
different layers, a deeper neural network can be trained. With the help of the residual
network structure, people can even easily Train neural networks with more than 1000 layers.

This residual network has a periodic law, which can be regarded as composed of the same
structure residual block, as shown in Figure 6-60 is a structure of a residual block:

Figure 6-60 The residual block is the structural unit of the residual network

This residual block is composed of 2 convolutional blocks. Each convolutional block is first
weighted and calculated, and then the output of the activation function is calculated. Before
calculating the activation function of the second convolutional block, this residual block will
be the first The input of one convolution block and the output of the weighted sum of the
second convolution are accumulated, and then output through the activation function. The
input x of the first convolution block is input to the second convolution block through the
weighted sum and activation function of this convolution block, and the weighted sum F (x)
of the second convolution block and the first The input of the convolution block is added to
get F (x) + x, and then input to the activation function of the second convolution block.
The function x → F (x) + x represented by this residual block is an identity function
x → x added on the basis of the original functional relationship x → F (x). This will force

x → F (x) to be as close to 0 as possible, that is, limit the function x → F (x) to a small

function subspace, which is similar to the regularization restriction on weights The scope of
the function is the same. In addition, when deriving in reverse, the gradient of the output of
the second convolution block is directly passed to the input of the first convolution block
through this identity function, thereby avoiding the problem of gradient disappearance.
Therefore, the residual network has both the function of preventing the gradient from
disappearing and the regularization function of preventing the function from being too
complex.

In terms of structure, if the residual block is regarded as an overall module like a

convolution block, the residual neural network is the same as the ordinary neural network
structure, that is, the residual network is a series of residual blocks connected end to end.
structure. That is, the residual network can be expressed as:

F1(x) + x = ResBlock1 → F2(x) + x = ResBlock2 → ⋯ → Fn(x) + x = ResBlockn

Where ResBlock represents a residual block, and an ordinary neural network can be
i

expressed as:

F1(x) = convBlock1 → F2(x) = convBlock2 → ⋯ → Fn(x) = convBlockn

Of course, a residual block may be composed of multiple convolutional blocks, and each
convolutional block may contain batch normalization layers, pooling layers, dropout layers,
etc., that is, it may contain several or even a dozen various network layers.

6.5.6 Google Inception Network

Due to the huge difference and change of features in the position of the feature map, it
becomes very difficult to choose the correct convolution kernel size for the convolution
operation. For different problems, the size of the network layer convolution kernel is 3 × 3
or 5W hatabout × 5? Larger convolution kernels are good for capturing more globally
distributed information, while smaller kernels are suitable for capturing more locally
distributed information. Similarly, for the pooling layer, it is a similar problem to choose the
size of the pooling window.
The idea of Google's Inception network is to let the network automatically select the
appropriate size of the convolution kernel or pooling window size. To achieve this goal, the
Inception network replaces ordinary convolutional layers with Inception modules. Combine
multiple convolution kernels of different sizes (including pooling windows) to form an
Inception module. As shown in Figure 6-61:

Figure 6-61 Inception network structure

Each convolution kernel of the Inception module accepts the same input, and uses the
"same" convolution method to generate feature maps with different numbers of output
channels. These feature maps are spliced into a final feature map. If the number of channels
output by 1 × 1, 3 × 3, 5 × 5, 7 × 7 is 96, 16, 32, 64, then these feature maps synthesize
96+16+32+64 feature maps Feature map as output. As shown in Figure 6-62:

Figure 6-62 Inception module

Of course, the Inception module can also include a pooling layer. The pooling layer will
reduce the size of the feature map. In order to generate a pooled output feature map of the
same size as the original feature map, a "same" pooling operation with padding is used.

Using the Inception module to replace the ordinary convolutional layer and pooling layer
can automatically learn the appropriate model parameters through training, thereby
automatically selecting the appropriate convolution kernel (pooling window) size.

The above-mentioned Inception module will lead to a large model parameter. For example,
if the input is a 28 × 28 × 192 tensor, the output tensor of 32 5 × 5 × 192 convolution
kernels is 28 × 28 × 32, each element of the output tensor is calculated by a weighted sum
of 5 × 5 × 192, so there are a total of 5 × 5 × 192 × 28 × 28 × 32 multiplications
120422400 calculations .

In order to reduce the amount of calculation, you can insert a 1 × 1 convolution kernel with
a relatively small number of output channels before 3 × 3, 5 × 5, 7 × 7 these convolution
kernels with a size greater than 1, as shown in the figure 6-63 shows:

Figure 6-63 Dimensionally simplified Inception network

If the number of output channels of the 1 × 1 convolution kernel is 16, that is, 1 × 1 × 16,
although an extra layer is inserted in the middle, the amount of calculation can be greatly
reduced. For example, the calculation after inserting 1 × 1 × 16 into the above 5 × 5 × 32
is: 1 × 1 × 192 × 28 × 28 × 16 + 5 × 5 times16 × 28 × 28 × 32 = 12443648. The
amount of computation is reduced by a factor of 10. Because the pooling layer always
produces the same number of channels as the original input, in order to make the pooling
layer output fewer channels, for the pooling layer, this 1 × 1 convolution kernel is added
after it.
The neural network composed of these Inception modules instead of ordinary convolution
and pooling layers is called Inception network. Figure 6-64 is the famous googleNet,
Inception v1 network. Based on Inception v1, people have proposed some improved
network structure versions, such as Inception V2, V3 and V4. Even combined with residual
networks.

Figure 6-64 googleNet network structure

6.5.7 Network in Network (NiN)

The usual convolutional layer is a linear convolution operation of the input, that is, the
linear weighting of the input is then output through a nonlinear activation function. Its
structure is shown in Figure a). Network in Network (NiN) "Network in Network" is to
replace this single-layer linear convolution layer with a small network, as shown in Figure
6-65 b).

Figure 6-65 Network in Network (NiN) "Network in Network" is to replace the linear
convolutional layer with a small network

In addition to the convolutional layer, the author added 2 layers of fully connected layers to
this small network. The author believes that the nonlinear capability of the network
convolutional layer can be increased. The paper also uses global mean pooling to replace
the traditional fully connected layer, and performs global mean pooling on each feature map,
so that each feature map only produces one output value. It can greatly reduce the problem
of excessive number of parameters caused by flattening the feature map of the traditional
fully connected layer, and can avoid model overfitting. Since each feature map value
produces an output value, the number of input feature maps of the global pooling layer must
be consistent with the number of categories. For 10 categories, the number of feature maps
must be 10.

Figure 6-66 Network in Network (NiN) network structure

Chapter 7 Recurrent Neural Network RNN
In the previous neural network, it is assumed that the samples are independent of each other, and different data
input and output are independent of each other. For a sample (x , y ),the value of y depends only on its
(i) (i) (i)

input x and has nothing to do with the input and output of other samples such as (x , y ), j ≠ i. This neural
(i) (j) (j)

network can be called a "one-to-one" neural network.

But there are also some problems. There is a sequence relationship between the data. For example, a video is
composed of images generated sequentially in time, a text or sentence is a sequence of words, a piece of music is
composed of a series of notes, a protein A sequence is a sequence of amino acids, and a stock curve contains the
price at each moment. It is unreliable to judge or predict a single data of a sequence in isolation, such as
understanding a word in an article or a paragraph in isolation is meaningless, and judging objects in a video from
a certain image of a video in isolation Motion situations such as "is the car in an image stationary, moving
forward, or moving backward" are not feasible. In machine translation, each word in a sentence is processed
separately, such as "how do you do?" is translated verbatim into "how do you do it?" Obviously it will not work!

Just as the convolutional neural network can capture the spatial correlation between the features of a data
sample, Recurrent Neural Network (Recurrent Neural Network, RNN) is for sequential (timing)
relationships A neural network structure for sequence data. The cyclic neural network is a network with state
memory, which can memorize historical information in the time dimension, specifically: for a certain time t, in
addition to the input data (element) x at the current time, there is also a memory before the time t The hidden
t

state h of information. Therefore, the input at time t includes the current data input x and the historical
t−1 t

memory state h , and the output at time t includes the prediction y and the new historical memory state h .
t−1 t t

The hidden state h is propagated along the sequence, which theoretically can contain the historical information
t

of all previous moments. The recurrent neural network predicts the current sequence elements through this
internal hidden state memory of historical information in the previous sequence, so that it can make better
predictions for the current moment.

Recurrent neural networks can be used for problems with sequence dependencies between data, such as natural
language processing (such as machine translation, text generation, part-of-speech tagging, text sentiment
analysis), speech processing (recognition, synthesis), music generation, protein sequence analysis, Video
understanding and analysis, stock forecasting, etc.

7.1 Sequence problems and models

The prediction problem with sequence relationship between data is to predict the target value at the current
moment based on all sequence data before the current moment. For example, predicting the stock price at the
current moment based on all historical information of a stock. If x is used to represent the data characteristics at
t

time t, y is used to represent the target value at time t that you want to predict. Like any supervised machine
t

learning, the prediction of sequence data is to learn a mapping or function f : (x , x , ⋯ , x ) → y , that is,
1 2 t t

according to the data characteristics of all moments before t time (x , x , ⋯ , x )to predict the target value y at
1 2 t t

tmoment.

If x and y are the same type of data, such as x is the stock price at time t, y is the target price predicted at time
t t t t

t, that is, the price at time t + 1 y = x , then such a sequence data prediction problem is called an
t t+1

autoregressive problem.
7.1.1 Stock Price Prediction Problem
Predicting the stock price based on the historical information data of a stock is a typical sequence data prediction
problem. The following code uses the pandas package to read the stock data of the csv format file 'sp500.csv':
import pandas as pd
data = pd.read_csv('sp500.csv')
data.head()

output:
Date Open High Low Close Volume
0 03-01-00 1469.250000 1478.000000 1438.359985 1455.219971 931800000
1 04-01-00 1455.219971 1455.219971 1397.430054 1399.420044 1009000000
2 05-01-00 1399.420044 1413.270020 1377.680054 1402.109985 1085500000
3 06-01-00 1402.109985 1411.900024 1392.099976 1403.449951 1092300000
4 07-01-00 1403.449951 1441.469971 1400.729980 1441.469971 1225200000

Among them, each column represents: date, opening price, highest price, lowest price, closing price, trading
volume. In order to facilitate the training of machine learning algorithms, the data needs to be normalized. The
following code normalizes data other than dates:
data = data.iloc[:,1:6]
data = data.values.astype(float)
data = pd.DataFrame(data)
data = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
print(data[:3])

The above code for reading stock data can be expressed as a function:
import pandas as pd

def read_stock(filename,normalize = True):

data = pd.read_csv(filename)
data = data.iloc[:,1:6]
data = data.values.astype(float)
data = pd.DataFrame(data)
if normalize:
data = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
return data

data = read_stock('sp500.csv')
print(data[:3])

0 1 2 3 4
0 -0.005973 -0.005916 -0.015676 -0.012310 -0.191184
1 -0.012266 -0.016172 -0.034017 -0.037249 -0.184230
2 -0.037292 -0.035058 -0.042867 -0.036047 -0.177338

Stock price forecasting is to predict the next day's stock price such as the closing price based on the historical
stock data of each day, such as the opening price, the highest price, the lowest price, the closing price, and the
trading volume. For this sequence data prediction problem, the data x at each moment includes data
t

characteristics such as opening price, highest price, lowest price, closing price, trading volume, etc. The target
value y to be predicted is the closing price of the stock on the next day .
t

If the data x at each moment only contains the feature of the closing price, and the y to be predicted is also the
t t

closing price, that is, they are the same type of data, then such a stock price prediction problem is an
autoregressive problem. The following code plots a curve of closing prices:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.array(data.iloc[:,-2])
print(x.shape)
plt.plot(x)

output:
(4697,)

Figure 7-1 Stock Closing Price Trend Chart

7.1.2 Probabilistic sequence model, language model

1. Probabilistic sequence model

The prediction problem of sequence data sometimes does not need to directly model the functional relationship
between the sequence data and its target value, but predicts the probability of the target value according to the
sequence data, that is, determines the following conditional probability:

yt ∼ p(yt|x1, … xt)

That is, according to all the sequence data (x , x , ⋯ , x ) before t, the value probability of the target value y is
1 2 t t

predicted. There are usually many or even infinite target values. The prediction problem is to determine each
Probability of possible y values as target values. This kind of model that predicts the probability of the target
t

value based on sequence data is called probability sequence model. For autoregressive problems, this
probabilistic sequence model is expressed as:

xt ∼ p(xt|x1, … xt−1)

2. Language Model
The basis of natural language processing is to build a language model. The so-called language model is to model
the probability of a sentence (sentence), that is, to determine the probability of a sentence appearing. For
example, the probability of "I am Chinese" is obviously greater than that of "I am Chinese". A sentence is a
series of words (words), which is an ordered sequence of words, such as "I am Chinese" is an ordered sequence
composed of four words: "I", "Yes", "China", and "People". sequence. Suppose a sentence S is composed of
words w , w , w , ⋯ , w , the probability of sentence S is represented by P (w
1 2 3 n 1, w2, w3, ⋯ , wn) , according to
probability theory, this probability can be expressed for:

P (w1, w2, w3, ⋯ , wn) = P (w1) ∗ P (w2|w1) ∗ P (w3|w1, w2) ∗ ⋯ P (wn|w1, w2, ⋯ , wn−1)

It is the product of the conditional probabilities of occurrences of a sequence of words. The probability of w 1

appearing first is P (w ), and the probability of w appearing when w appears is P (w |w ). The probability of
1 2 1 2 1

w in the case of w , w is P (w |w , w ), and the probability of w in the case of ⋯，w , w , ⋯ , w

3 1 2 3 1 2 n is 1 2 n−1

P (w |w , w , ⋯ , w
n 1 2 ). n−1

According to this formula (7-3), if these conditional probabilities can be known, that is, in the case of known
words w , w , ⋯ , w , the probability of the next word w appearing P (w |w , w , ⋯ , w ), you can know
1 2 i−1 i i 1 2 i−1

the probability of a sentence consisting of a series of words. Therefore, the language model is to predict the next
word or the probability of each word in the word list based on the existing word sequence.

7.1.3 Autoregressive model

For the prediction target at the current moment is the autoregressive problem of the data at the next moment, the
established prediction model is called autoregressive model (Auto regressive model, referred to as AR model),
this model Can be a probabilistic sequence model or a functional model. If x depends on x , ⋯ , x , the
t 1 t−1

autoregressive model can be a function model f : (x , ⋯ , x ) → y or probability sequence model

1 t−1 t

P (x |x , ⋯ , x
t 1 ).t−1

If the real data x only depends on the data (x , ⋯ , x ) at the previous fixed-length τ moment, then this
t t−τ t−1

sequence data is said to satisfy Markov (Markov) properties. Such autoregressive models are also known as
Markov models.

The simplest autoregressive model assumes that x t−τ , … xt−1 and x satisfy a linear relationship:
t

xt = a0 + a1xt−1 + ⋯ + aτ xt−τ + ϵ

where ϵ is sampled random noise (also known as white noise).

7.1.4 Generate autoregressive data

When studying sequence models, we can use actual sequence data such as stock prices and natural language
texts, or generate some simulated sequence data for sequence model research by means of simulation. For
example, the following code forms a function by combining the sine function and the cosine function, and then
samples the y-coordinate value of the function curve to form a sequence data:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def gen_seq_data_from_function(f,ts):
return f(ts)

T =5000
x = gen_seq_data_from_function(lambda ts:np.sin(ts*0.1)+np.cos(ts*0.2),\
np.arange(0, T))
plt.plot(x[:500])
plt.show()
Figure 7-2 Autoregressive Data Generated from Function Values

But such data has obvious periodicity, but real serial data such as stock price data do not have such periodicity.
The autoregressive model of formula (7-4) can be used to generate non-periodic series data from some initial
data. The steps are:

1. Choose appropriate coefficients a 0, a1, ⋯ , aτ

2. Generate the first τ random data

3. Continuously generate the next data according to the formula (7-4)

The research on the autoregressive model shows that only when the coefficients constitute the equation
− ⋯ − a When the absolute value of the root does not exceed 1, the autoregressive
τ τ −1 τ −2
x − a x 0 − a x 1 τ

model is stable, otherwise, the generated data is unstable.

The following function init_coefficients() generates the coefficients of a stable autoregressive model:
np.random.seed(5)
def init_coefficients(n):
while True:
a = np.random.random(n) - 0.5
coefficients = np.append(1, -a)
if np.max(np.abs(np.roots(coefficients))) < 1:
return a
init_coefficients(3)

array([-0.27800683, 0.37073231, -0.29328084])

The following function generate_data() generates autoregressive data according to the above 3 steps, because the
distribution of the initially generated data is very different, and it will also affect the subsequent data. After a
period of time, the data will start to be really stable, so , it is necessary to discard some initially generated data,
such as the initial 3n data.

def generate_data(n,data_n,noise_value = 1,k=3):

a = init_coefficients(n+1)
x = np.zeros(data_n + n*(k+1))
x_noise = np.zeros(data_n + n*(k+1))
x_noise[:n]= np.random.randn(n)

n_all = data_n + n*k

for i in range(n_all):
x[n+i] = np.dot(x_noise[i:n+i][::-1], a[1:]) +a[0]
x_noise[n+i] = x[n+i] + noise_value * np.random.randn()
x_noise = x_noise[k*n:] # Discard the previous k*n real numbers
x = x[k*n:]
return x_noise,x

x,_ = generate_data(5,100)
plt.plot(x[:80])
plt.show()

Figure 7-3 Autoregressive data generated from an autoregressive model

7.1.5 Time window method

Each sample in the "one-to-one" neural network must be of the same length, that is, the number of features of
each sample is the same. Can such a network handle sequence data? In fact, it is also possible. A subsequence of
the same length can always be intercepted from the sequence data as a whole to form the data characteristics of a
sample. For a sequence data x , if you always use the T (including x ) sequence elements before x , namely
(i) (i) (i)

) as a sample data feature to predict a certain y . To a certain extent, the serial

(i−T +1) (i−1) (i) (i)
(x ,⋯,x ,x

correlation between x is considered.

(i)

For example, in a table tennis game, it is impossible to judge the movement of the ball from the position of the
ball at a certain moment, that is, it is impossible to predict its movement speed v according to the position x
(t) (t)

of the ball at a certain moment.

But if the current moment is combined with the position of several moments before it, for example, the position
of the two moments before it is combined into an input data feature x (t)
^ = (x
(t−2)
,x ,x ), then predict the
(t−1) (t)

(t)
speed v of the ball based on this data feature x
(t)
^ .

This method of using a fixed-length subsequence around the current location as the sample data feature of the
current location is called the time window method. Time windows allow direct processing of sequence data with
"one-to-one" neural networks. Time window is a traditional method to deal with time series, such as predicting
the stock price of the day based on the stock information of 60 consecutive days before a certain day in the stock
price prediction problem, and predicting the probability of the next word in the language model based on the
known k words .

The time window method transforms the prediction problem of sequence model into the supervised learning
problem of non-sequence data, so that the prediction problem of sequence data can be modeled by the existing
supervised learning method of non-sequence data. The application of the time window method is illustrated
below with the forecasting problem of autoregressive sequence data.

7.1.6 Time window sampling

For an autoregressive sequence data {x }, you can use a fixed length T time window data x
t t−T +1, ⋯ x to
t

predict the next data x . The fixed-length sequence (x

t+1 , ⋯ x ) can be used as the input data of supervised
t−T +1 t

learning as a whole, and x can be used as its target value, so that Transformed into a supervised learning
t+1

problem of non-sequential data, the problem can be modeled and trained with the previous supervised machine
learning method, such as the previous acyclic neural network for modeling. To do this, it is necessary to prepare
training data for training the model.

For a sequence of data, a sequence x[i: i + T+1] of length T+1 can be intercepted from any position i to
constitute a sample of supervised learning, where x[i: i + T]The data features x that make up the sample,
i

and x[T+1] is the target value y . For sequence data with length n, the value range of i is [0,n-(T+1)-1].
i

The set data_set composed of these samples can be divided into a training set (x_train, y_train) and a test set
(x_test, y_test) in a certain proportion. The following code is to sample training samples from the sequence data
according to the time window width T:

def gen_data_set(x,T,percentage = 0.9):

L = T + 1
data_set = []
for i in range(len(x) - (T+1)):
data_set.append(x[i: i + T+1])
data_set = np.array(data_set)
row = round(percentage * data_set.shape[0])
train = data_set[:int(row), :]
np.random.shuffle(train)
x_train = train[:, :-1]
y_train = train[:, -1]
x_test = data_set[int(row):, :-1]
y_test = data_set[int(row):, -1]
return [x_train, y_train, x_test, y_test]

x = gen_seq_data_from_function(lambda ts:np.sin(ts*0.1)+np.cos(ts*0.2),\
np.arange(0, 5000))
x_train, y_train, x_test, y_test = gen_data_set(x, 50)

y_train = y_train.reshape(-1,1)
print(x_train.shape,y_train.shape)

(4454, 50) (4454, 1)

7.1.7 Time window method modeling and training

The training sample is obtained by sampling the autoregressive sequence data according to a fixed time window,
and then it can be modeled and trained with a supervised learning model. The following code uses a 2-layer fully
connected neural network to model and train the above autoregressive data sampled from the function value:

from NeuralNetwork import *

import util

hidden_dim = 50
n = x_train.shape[1]
print("n",n)
nn = NeuralNetwork()
nn.add_layer(Dense(n, hidden_dim)) #('xavier',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(hidden_dim, 1)) #('xavier',0.01)))
learning_rate = 1e-2
momentum = 0.8 #0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

epochs=20
batch_size = 200 # len(train_x) #200
reg = 1e-1
print_n=100

losses = train_nn(nn,x_train,y_train,optimizer,
util.mse_loss_grad,epochs,batch_size,reg,print_n)
#print(losses[::len(losses)//50])
plt.plot(losses)

n 50
0 iter: 3.144681992803935
100 iter: 0.3332809082102651
200 iter: 0.13722749233747686
300 iter: 0.10941419118718776
400 iter: 0.10108511745662195

Figure 7-5 Training loss curve of 2-layer fully connected neural network

7.1.8 Long-term forecast and short-term forecast

For autoregressive sequence data, the output x of the next time T can be predicted from the sample features
T

composed of an initial real data sequence x , x , ⋯ , x , and then from x , x , ⋯ , x predict x , predict
0 1 T −1 1 2 T T +1

xT +2 from x , x , ⋯ , x , keep predicting like this, That is, the initial sequence x , x , ⋯ , x
2 3 T +1 0 1 T −1 can be used
to predict multiple subsequent moments. This type of forecast is called a long-term forecast (or long-term
forecast). Since the prediction is not necessarily accurate, it will be more inaccurate to use the predicted value as
the real value to predict the next value, that is, as time goes by, the error between the predicted value and the real
value will become larger and larger. The other is short-term forecasting, the extreme case is that each moment
always uses the real data corresponding to the time window of this moment (such as this moment and the
previous T-1 moment) to predict the data of the next moment. Short-term prediction is because the input data
samples are all real data and only predict the data at the next moment. The prediction effect is good, but the data
used for prediction at each moment must be real data rather than the predicted data at the previous moment.

The following code adopts the method of long-term forecasting to predict the data of subsequent series of time
points from the real data sample at the initial time point. And visually compare these predicted values with the
corresponding target values of the test set. To observe the predictive performance of this model.
x = x_test[0].copy()
x = x.reshape(1,-1)
ys =[]
for i in range(400):
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:400])
plt.plot(y_test[:400])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])

Figure 7-6 Long-term forecasting with a trained model with time window length T=50

It can be seen that the predicted result is very close to the real target value, because this is a periodic curve, and
T=50 is basically close to the period of the curve (50*0.1 = 5 is close to 2π ∼ 6.28) . If the time window is
relatively short, such as the prediction result when T=10 is shown in Figure 7-7, the prediction result will be
poor.

And it can be found that the accuracy of the prediction is lower the further the future is, because the predicted
value is used as the real data value to predict the value of the subsequent time, due to the continuous
accumulation of errors, the error will become larger and larger.

Figure 7-7 Long-term forecasting with a trained model with time window length T=10

The following code is to use the trained neural network for short-term prediction, that is, each time the real data
is used to predict the data value of the next moment:
ys =[]
for i in range(400):
x = x_test[i].copy()
x = x.reshape(1,-1)
y = nn.forward(x)
ys.append(y[0][0])
ys = ys[:]
plt.plot(ys[:400])
plt.plot(y_test[:400])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])

Figure 7-8 Short-term forecasting with a training model with a time window length T=50

Obviously, short-term forecasts are more accurate.

7.1.9 Stock Price Prediction

Using the previous closing price of the stock as the sequence data, the time window method can also be used to
sample training or test samples. The following code uses the time window of 100 to generate a training data set
and a test data set, that is, use the price of the previous 100 days to predict the price of the next day
x = np.array(data.iloc[:,-1])
print(x.shape)
x = x.reshape(-1,1)
print(x.shape)

x_train, y_train, x_test, y_test = gen_data_set(x, 100)

y_train = y_train.reshape(-1,1)
print(x_train.shape,y_train.shape)

(4697,)
(4697, 1)
(4136, 100, 1) (4136, 1)

Likewise, train a neural network model with the training set:

hidden_dim = 500
n = x_train.shape[1]
print("n",n)
nn = NeuralNetwork()
nn.add_layer(Dense(n, hidden_dim))
nn.add_layer(Relu())
nn.add_layer(Dense(hidden_dim, 1))

learning_rate = 0.1
momentum = 0.8 #0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)

epochs=60
batch_size = 500 # len(train_x) #200
reg = 1e-6
print_n=50

losses =
train_nn(nn,x_train,y_train,optimizer,util.mse_loss_grad,epochs,batch_size,reg,print_n)

plt.plot(losses)

n 100
0 iter: 0.04027576839624083
50 iter: 0.0005585708338086856
100 iter: 0.0004103264701123903
150 iter: 0.0003765723130633676
200 iter: 0.0003516184170804334
250 iter: 0.00035039658640954825
300 iter: 0.00030599817269094394
350 iter: 0.00031335621767437775
400 iter: 0.000308409636035205
450 iter: 0.0003134471927653575

Figure 7-9 Network model training loss curve of stock data with time window length T=100

Use the first sample of the test set as a starting point for long-term prediction, that is, to continuously use the
predicted value to construct new data features to predict the stock price of the next day:
x = x_test[0].copy()
x = x.reshape(1,-1)
ys =[]
num = 400
for i in range(num):
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:num])
plt.plot(y_test[:num])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-10 Long-term forecasting with a trained model with time window length T=100

The results show that for the sequence data with few regularities like stocks, even if the time window is large
(100), the prediction results are not ideal. The following code uses short-term forecasting, that is, always uses the
real data of the first 100 days to predict the stock price on the 101st day:

ys =[]
num = 400
for i in range(num):
x = x_test[i].copy()
x = x.reshape(1,-1)
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:num])
plt.plot(y_test[:num])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])

Figure 7-11 Short-term forecasting with a training model with a time window length T=600

7.1.10 k-gram language model

The language model is to calculate the probability P (w |w , w , ⋯ , w ) of w when w , w , ⋯ , w
n 1 2 n−1 n 1 2 n−1

P (A∩B)
appear . According to the formula P (A|B) = , conditional probability P (w |w , w , ⋯ , w ) can be
P (B)
i 1 2 i−1

expressed as:
P (w1,w2,w3,⋯,wi−1,wi) P (w1,w2,w3,⋯,wi−1,wi)
P (wi|w1, w2, ⋯ , wi−1) = =
∑ P (w1,w2,w3,⋯,wi−1,w) P (w1,w2,w3,⋯,wi−1)
w

That is, the joint probability P (w , w , w , ⋯ , w , w ) of w , w , w , ⋯ , w , w appears at the same time

1 2 3 i−1 i 1 2 3 i−1 i

except The marginal probability ∑ P (w , w , w , ⋯ , w , w) is P (w , w , w , ⋯ , w ). The latter can also

w 1 2 3 i−1 1 2 3 i−1

be considered as the joint probability of only random variables w , w , w , ⋯ , w . 1 2 3 i−1

To calculate the conditional probability, you need to find the joint probability
P (w , w , w , ⋯ , w
1 2 3 ), P (w , w , w , ⋯ , w
i−1 1 , w ). These probabilities can be calculated by statistical
2 3 i−1 i

methods that approximate the probabilities with frequencies. For example, to count the joint probability
P("China", "person") of two words w = “China” and w = “person” at the same time, it can be counted in a
1 2

corpus (such as many texts) The number of occurrences of "Chinese" is n, and the number of occurrences of all
combinations of w and w that are other arbitrary words (such as "Hello", "Playing Ball", "Chinese Dream")
1 2

appears m. Use this frequency n/m to approximate the probability P("China", "people").

But if i is relatively large, this calculation is obviously unrealistic, and there are two problems:

Because there are a lot of w , the number of combinations of w , w , w , ⋯ , w , w will increase

j 1 2 3 i−1 i

exponentially, statistical calculation P (w , w , w , ⋯ , w , w ) is very difficult.

1 2 3 i−1 i

It is very likely that the sequence w , w , w , ⋯ , w 1 2 3 i−1, wi does not appear in the corpus, resulting in the
probability P (w , w , w , ⋯ , w , w ) is 0.
1 2 3 i−1 i

In order to solve the above-mentioned problem of calculating the conditional probability dependent on too many
parameters, the Markov assumption is usually introduced, that is, it is assumed that the probability of a word is
only related to the limited number of words that appear before it. In an extreme case, assuming that the
appearance of a word is independent of its surrounding words, that is, the probability of its occurrence does not
depend on other words. This language model is called a unigram language model. At this point, the probability
calculation of the sentence S = w , w , w , ⋯ , w becomes very simple:
1 2 3 n

P (w1, w2, w3, ⋯ , wn) = P (w1) ∗ P (w2) ∗ P (w3) ∗ ⋯ P (wn)

But this language model is obviously unreasonable, because the appearance of words in the text will not be
independent of each other, and there is a dependency relationship. If a language model assumes that the
probability of a word appearing only depends on a word that appeared before it, this language model is called a
2-gram language model (bigram).

P (w1, w2, w3, ⋯ , wn) = P (w1) ∗ P (w2|w1) ∗ P (w3|w2) ∗ ⋯ P (wn|wn−1)

By analogy, if a language model assumes that the probability of a word only depends on the k-1 words in front of
it, this language model is called k-gram language model (k-gram). The k-gram language model is a specific
application of the time window method on the language model, that is, the probability of the next word is
predicted by using the first k-1 words.

Obviously, the larger k is, the higher the prediction accuracy will be. For example, for a 2-gram language model,
if the current word is "China", there may be many next words, and it is difficult to predict which one the next
word will be, but if it is a 4-gram model, the word that has appeared sequentially is "I" , "Yes", "China", then the
probability (possibility) that the next word is "person" will be very high. But the larger k is, the more serious the
above two problems will be. In order to avoid the above two difficulties, traditional language models generally
use 3-gram or 4-gram language models (3-gram or 4-gram).

If such a k-element language model is constructed, the probability that each word in the word list will appear as
the next word can be predicted according to the k-1 words that have appeared before, so as to predict the
probability that the entire sentence will appear.
Language models are the basis for a variety of natural language processing problems. For example, the language
model can be used to make long-term predictions from the first words, that is, the next word is sampled
according to the probability of the language model, and this process is repeated continuously to generate a series
of subsequent words, thereby automatically generating a text such as articles, novels, and poems , essays,
reviews, etc.

The k-gram language model has some obvious limitations in predicting with fixed-length "time window" data.
There are mainly two problems:

The length of the time window is difficult to determine. If the time window is too short, it will cause short-
sightedness. For example, in machine translation, it is often necessary to accurately understand the meaning
of the current word based on a long context, and in text generation, the next word can be correctly predicted
based on the previous long text, such as "old On the way to school, Zhang's son saw an old lady who had
fallen down. Whether to use "he" or "she" in the last word of this paragraph needs to be determined based
on the previous "son". The calculation amount of the algorithm is proportional to the length of the sample
data. If the time window is too long, it will take more time. For the language model, the time window is too
long, which makes it very difficult to estimate the probability. In addition, for short sequence samples (such
as short sentences), many blank elements need to be filled, resulting in waste of space. For many sequence
data problems, the sequence lengths dependent on the data at different times are often different and change,
and it is difficult to determine an appropriate time window length.

Using the usual neural network to model the sequence data prediction problem, the model parameter scale
will become larger as the time window becomes larger. Compared with processing each original data
sample separately, if a time window of length 3 is used, the length of the input data sample is increased by 3
times. In order to better capture the characteristics of the data, the number of neurons in each layer will
increase accordingly. , which will increase the number of parameters of the model exponentially, which not
only consumes more computing resources, but also increases the complexity of the model function and
easily leads to overfitting of training.

The time window is just a short-term memory behavior, and when people understand things, they will not only
use short-term memory, but also use all their past memories. In order to deal with this variable-length sequence
data, researchers simulated human long-term memory behavior and invented Recurrent Neural Networks
(Recurrent Neural Networks, RNN). RNN adds a storage/memory unit to the neurons of the traditional
neural network, so that historical calculation information can be saved. In other words, neurons have a memory
function, and the calculation of each current moment depends not only on the current input but also on the stored
historical information, so that the data and calculation results can be transmitted in the time dimension, so that
the recurrent neural network theoretically Can memorize arbitrarily long sequences of information. Just as the
convolutional neural network can extract the features of the spatial dimension, the cyclic neural network can
transmit information in the time dimension, which is an extension of the acyclic neural network in the time
dimension.

7.2 Recurrent Neural Networks

7.2.1 Acyclic neural network without memory function

The neural networks learned above are all neural networks without memory function, which represent a function
without memory function, and the input and output between different samples are independent of each other
without any correlation. Assuming that this neural network is recorded as y = f (x), for two different inputs
x , x , their outputs f (x ), f (x ) are mutually irrelevant.
i j i j

The function represented by the acyclic neural network is similar to the function without memory function in the
programming language, such as the function without static variables in the C language and the function in the
Python language. like:

def f(x):
y = 0
y += x*x
return y

print(f(2),'\t',f(3))
print(f(3),'\t',f(2))

4 9
9 4

Regardless of whether f(2) is computed before f(3) or f(2) is computed before f(3), the results of f(2) and f(3)
depend only on the respective inputs 2 and 3, and The execution order of f(2) and f(3) is irrelevant.

If the probability of the next word from the current word is used with an acyclic neural network, this prediction
is also independent of the order in which the words are processed. As shown in Figure 7-12, assuming that there
are only 3 words in the language "good", "drink", and "wine", for each word in the input word sequence "good
drink", the neural network outputs each word as the probability of the next word.

Figure 7-12 The probability of predicting "good", "drink", and "wine" as the next word from the word "good" by
the non-cyclical neural network and the probability of predicting "good", "drink", and "wine" from the word
"drink" as The probabilities of the next word are independent of each other. The probability of predicting "good",
"drink", and "wine" as the next word from the word "good" is the same regardless of whether the input sequence
is "good to drink", "wine is good to drink" or "drink good wine".

Regardless of the order in which these three words appear in a sentence, such as "wine is good" and "drinking
good wine", for each word, the neural network predicts the probability of each word as its next word based on
the word. it's the same. Using this neural network to represent the language model, the output value of each word
depends on itself and has nothing to do with other words. This is obviously unreasonable.

Without loss of generality, if the neural network has only one or one layer of neurons, namely:

y = f (x) = g(xW + b)

And assuming that the nonlinear activation function is a sigmoid function, this neural network can be
represented by python code:
class FNN(x):
# ...
def forward(self,x):
y = sigmoid(np.dot(x,self.W)+self.b)
return y

For a set of data x

(1)
,x
(3)
,x
(3)
, the output process of FNN to calculate these data is as follows:

nn = FNN()
y1 = nn.forward(x1)
y2 = nn.forward(x2)
y3 = nn.forward(x3)

The order of the three prediction statements has no effect on the prediction results y1, y2, y3.

As shown in Figure 7-12, in order to input a word as a sample into the neural network, each word needs to be
quantized, that is, converted into a vector with a fixed length, because the language model has only 3 words, and
one with a length of 3 can be used The -hot vector distinguishes these 3 words, that is, each word corresponds to
a different one-hot vector, such as: good (1,0,0), drink (0,1,0), wine (0,0,1) .
7.2.2 Recurrent neural network with memory function
Different from the acyclic neural network, the cyclic neural network is a neural network with a memory
function, which can be expressed as y = f (x, h), that is, in addition to the input x, there is also a hidden
state variable h Used to record the history of the calculation process. Recurrent neural network functions
are similar to functions or classes with memory functions in programming languages, such as functions
containing static local variables in C language, and class objects in languages such as C++, Java, and
Python. For example, the following class rf uses a data attribute h to record the intermediate results
(states) of the calculation:
class rf( ):
def __init__(self):
self.h = 0

def forward(self,x):
self.h += 2*x
return self.h+x*x

def __call__(self,x):
return self.forward(x)

f = rf()
print(f(2),'\t',f(3))
print(f(3),'\t',f(2))

8 19
25 24

For an input value x, the forward() method of the rf class not only relies on x but also depends on the
information h saved in the previous calculation process when calculating the output. Therefore, the
output of f(2) and f(3) and their execution The order of precedence is relevant. The variable h that
records the intermediate results before calculation in class rf is called state.

Like the above class, the cyclic neural network also has a variable h that records/memorizes the
calculation process information. This variable is called hidden state (variable) in the cyclic neural
network. At any time t, the cyclic neural network is based on the current input data x and the state
⟨t⟩

variable h , calculate the current moment output f and state h . That is, the function represented
⟨t−1⟩ ⟨t⟩ ⟨t⟩

by the cyclic neural network is a function of 2 inputs and 2 outputs: y, h = f (x, h) or

).
⟨t⟩ ⟨t⟩ ⟨t⟩ ⟨t−1⟩
y ,h = f (x ,h

The state h at t time will be used as the input at t + 1 time to participate in the calculation of t + 1
⟨t⟩

time: y ⟨t+1⟩
,h
⟨t+1⟩
= f (x ,h ).
⟨t+1⟩ ⟨t⟩

The state variable h that changes over time stores/memorizes historical information. Based on the
⟨t⟩

historical information represented by this state variable and the data at the current moment, better
predictions can be made.

The cyclic neural network is usually represented by the diagram shown in Figure 7-13 a). The difference
between it and the ordinary neural network is that the hidden state calculated at the current moment will
be used as the input at the next moment, so it is drawn as an arc pointing to itself. Hidden state variables
are used as both the output calculated at the current moment and the input calculated at the next moment.
Figure 7-13 b) is an expanded representation of the calculation process in the time dimension of Figure
7-13 a). It can be seen that the state variables at the previous moment are used as the input at the current
moment to calculate the output and state variables at the current moment. This state variable is used as
the input for the calculation at the next moment.

Figure 7-13 a) Representation of a recurrent neural network in which there is a hidden state variable that
is both the output of the computation and the input to the next computation. b) is the expanded
representation of the calculation process of Figure a) in the time dimension. The state variable at the
previous moment is used as the input at the current moment to calculate the output and state variable at
the current moment, and the state variable at the current moment is used as the calculation at the next
moment input of.

At the initial moment t=0, the state variable h is a vector with an initial value of 0. For the above 3-
⟨−1⟩

word language model, the input data is the word "good" The corresponding feature vector x = ⟨0⟩

(1,0,0), the neural network will calculate the current output y and state variable h : ⟨0⟩ ⟨0⟩

⟨0⟩ ⟨0⟩ ⟨0⟩ ⟨−1⟩

y ,h = f (x ,h )

At time t=1, the input data is x ⟨1⟩

= (0,1,0) and the previous state h , the neural network calculates
⟨0⟩

new output y and state h : ⟨1⟩ ⟨1⟩

⟨1⟩ ⟨1⟩ ⟨1⟩ ⟨0⟩

y ,h = f (x ,h )

At time t=2, the input data is x ⟨2⟩

= (0,0,1) and the previous state h , the neural network calculates
⟨1⟩

new output y and state h : ⟨2⟩ ⟨2⟩

⟨2⟩ ⟨2⟩ ⟨2⟩ ⟨1⟩

y ,h = f (x ,h )

Consider the simplest recurrent neural network with only one or one layer of neurons, using
x
⟨t⟩
,h ,f
⟨t⟩
represent the input data, state variables and output respectively, and the calculation process
⟨t⟩

of the recurrent neural network is almost the same as that of the ordinary neural network with only one
neuron or one layer of neurons.
⟨t⟩ ⟨t−1⟩ ⟨t⟩
h = gh (h Wh + x Wx + bh )

⟨t⟩ ⟨t⟩
f = gf (h Wf + by )

Among them, g h, gf are the activation functions for calculating the current state and output respectively.

If g is a tanh function, and g is a sigmoid function, the calculation process can be expressed in python
h y

code as follows:
class RNN:
# ...
def step(self, x):
# update hidden state
self.h = np.tanh(np.dot(self.h,self.W_hh) + np.dot(x,self.W_hx) )+self.b
# Compute the output vector
y = sigmoid(np.dot(self.h,self.W_hy)+self.b2)
return y
For a time series data x ⟨1⟩ ⟨2⟩
,x ,x
⟨3⟩
, then RNN The output process for calculating these data is as
follows:
rnn = RNN()
y1 = rnn.step(x1) # calculated the implicit h1 at the same time
y2 = rnn.step(x2) # Calculate the implicit h2 at the same time
y3 = rnn.step(x3) # calculated the implicit h3 at the same time

It can be seen that the structure of the recurrent neural network is still similar to the structure of the
ordinary neural network, the only difference is that the calculation process will use the saved hidden state
to calculate the hidden state and output at the current moment. Expanding 7-13 a) into the structure of 7-
13 b) in the time dimension helps to understand its iterative cycle calculation process in the time
dimension. But it is not a copy of multiple neural networks in the time dimension, but a state variable
h is added to the neural network (neuron) to save the calculation result of the previous moment .
⟨t⟩

Therefore, the model parameters of the cyclic neural network will not increase with the expansion of
time, and arbitrarily long sequences can be expanded in the time dimension by continuously calling
rnn.step(). The neural network of the time window method can only handle sequences of fixed length,
and the model parameters will increase with the increase of the time window length. The cyclic neural
network perfectly solves these two problems in the time window method.

RNN can store historical information through the state variable h , so that RNN has a memory ⟨t⟩

function, just like playing a game with a recording function, each time is based on the previous game
points and ability play games on. If you play a game without recording function, each game play is a new
start, which has nothing to do with the previous game play.

Like the one-to-one neural network, two auxiliary variables z , z can be introduced to express the
⟨t⟩ ⟨t⟩

h f

formula (1), The weighted sum in (2), then the calculation process of the above-mentioned recurrent
neural network can be expressed as the following 4 calculation formulas:
⟨t⟩
⟨t⟩ ⟨t−1⟩ ⟨t⟩
z = x Wx + h Wh + b
h

⟨t⟩ ⟨t⟩
h = gh (z )
h

⟨t⟩ ⟨t⟩ ⟨t⟩

z = h Wf + b
f f

⟨t⟩
⟨t⟩
f = go (z )
f

The forward calculation process of RNN with only one neuron or a single network layer is shown in
Figure 7-14:

Figure 7-14 The forward calculation process of the cyclic neural network introducing weighted and
⟨t⟩ ⟨t⟩
intermediate variables z , z , first according to Input and the hidden state of the previous moment
h f

⟨t⟩ ⟨t⟩
calculate z , then calculate h
h
⟨t⟩
according to the activation function, and then calculate z f
and f ⟨t⟩

Like ordinary one-to-one neural networks, recurrent neural networks can also have multiple layers, and
the output of the previous layer is used as the input of the next layer. At the same time, the neurons in
each layer have their own state variables. Figure 7-15 is a three-layer recurrent neural network:

Figure 7-15 Three-layer recurrent neural network

The above-mentioned neural network is a synchronous many-to-many RNN, that is, the input at each
moment corresponds to an output, and there is also an asynchronous many-to-many RNN, such as
machine translation, which can only be given if all the words in the entire sentence are completed. The
final translated sentence. As shown in Figure 7-16:

Figure 7-16 Recurrent neural network with Seq2Seq structure

This cyclic neural network structure that processes the input sequence before generating the output
sequence is called Sequence-to-Sequence (Seq2seq) structure.

Of course, there is also a many-to-one RNN, such as classifying a text of a word sequence (such as
performing sentiment analysis on a text, and analyzing the quality of reviews from product reviews), as
shown in Figure 7-17.

Figure 7-17 Recurrent neural network with many-to-one structure

There is also a neural network with a one-to-many structure, as shown in Figure 7-18, which produces an
output sequence given an input. For example, given a word, automatically generate a series of text
composed of words, and automatically generate all notes of a musical score from a note.

Figure 7-18 Recurrent neural network with one-to-many structure

7.3 Backpropagation through time

Like the non-recurrent neural network, the gradient of the loss function with respect to the model
parameters, hidden state, input and output variables of the cyclic neural network can be obtained by
using the backpropagation algorithm based on the chain derivation rule. Because RNN calculates the
hidden state and output at each moment based on time, the hidden state and output at each moment not
only depend on the input data at the current moment, but also depend on the hidden state at the previous
moment, and the previous moments The hidden state depends on the input data and hidden state at the
previous moment. Therefore, the loss function of the RNN network depends on the hidden state and
output at each moment. The forward calculation is calculated according to the time expansion, and the
gradient of the backpropagation is calculated according to the time reverse to solve the gradient of the
loss function with respect to the variables (hidden state, model parameters, etc.) at each moment.

Taking the synchronous many-to-many RNN neural network as an example, assuming that the RNN has
only one layer of neurons (the same is true for multi-layer neural networks), each moment has a
predicted value f and the target value y , each moment has a loss L , as shown in Figure 7-19:
⟨t⟩ ⟨t⟩ (t)

Figure 7-19 For a synchronous many-to-many RNN network, each moment has a predicted value f ⟨t⟩

and a target value y , each moment has a loss L

⟨t⟩ (t)

The total loss is the sum of the loss of the predicted value and the target value at all moments. Right
now:
T (t)
L = ∑ L
t=1

If this is a one-way RNN, that is, the prediction at each moment only depends on the state at the previous
moment, the solid line in the figure represents the forward calculation process unfolded according to
time, while the dotted line represents the reverse derivation according to the reverse time process. The
loss function at each moment is a function of the variables (hidden state, input) and model parameters at
the previous moment. In the reverse derivation, it is necessary to find the gradient of the loss function
with respect to the variables and model parameters at the previous moment.

For any time t, its forward and reverse calculation process is shown in Figure 7-20:

Figure 7-20 At any time t, the forward and reverse calculation process of the cyclic neural network, the
gradient of the model parameters in the reverse derivation process includes the gradient of the current
loss and the subsequent loss with respect to the model parameters.

At any time, it is necessary to calculate the gradient of the loss at the current moment with respect to the
model parameters, and also calculate the gradient of the model parameters at the current moment
contributed by the hidden state gradient at the next moment. That is, the gradient of the model
parameters at the current moment includes the gradient of the loss at the current moment and the loss at
the subsequent moment with respect to the model parameters.
⟨t⟩ ⟨t⟩
The introduction of intermediate variables such as z , z can simplify the calculation process of the f h

model parameter gradient. Figure 7-21 shows the The forward calculation and reverse derivation process
of variables.

Figure 7-21 Forward calculation and reverse derivation process including intermediate variables

According to the chain rule of derivation, suppose that the gradient ∂L

∂f
⟨t⟩
of the loss function at the
current moment with respect to f ⟨t⟩
and the gradient ∂L

∂h
⟨t⟩
of the loss function at subsequent moments
with respect to h . On this basis, it is possible to find the gradient of the loss function at moment t with
⟨t⟩

respect to the model parameters W , W , W and the implied variable h f h at the previous moment.
x
⟨t−1⟩

The calculation process is shown in Figure 7-21.

Because the output f at time t only contributes to the loss L at time t, that is, only L depends on
⟨t⟩ ⟨t⟩ ⟨t⟩

⟨t⟩

it. Therefore, the gradient of the total loss function L with respect to the gradient of f is . ∂L

∂f
⟨t⟩
⟨t⟩ ∂L

∂f
⟨t⟩

The gradient can be obtained through the output f ⟨t⟩

and the target value y ⟨t⟩
at time t according to the
specific loss function type.

Note that the model parameters W , W , W are shared at all times, for example, for the model
f h x

parameter W , the gradient of the total loss function with respect to it is at all times The sum of the
f

gradients of the loss function with respect to it, namely:

⟨t⟩
∂L n ∂L
= ∑
∂Wf t=1 ∂Wf

L
⟨t⟩
is the predicted value f and the real value y ⟨t⟩
, that is, L depends on f , and ⟨tattimetT heerrorof ⟩ ⟨t⟩ ⟨t⟩

⟨t⟩ ⟨t⟩
f
⟨t⟩
depends on z , z depends on W , thus the gradient of the loss L at time t with respect to the
f f
f
⟨t⟩

model parameter W is: f

⟨t⟩
⟨t⟩ ⟨t⟩ ∂z T ⟨t⟩ T ⟨t⟩
⟨t⟩
∂L ∂L f ⟨t⟩ ∂L ⟨t⟩ ∂L ′
= ⟨t⟩
⋅ = h ⟨t⟩
= h ⟨t⟩
go (z )
∂Wf ∂Wf ∂f f
∂z ∂z
f f
The gradient of the total loss function with respect to W is obtained by accumulating the gradient of the f

loss function at all moments with respect to the model parameter W : f

⟨t⟩
⟨t⟩ ∂z T ⟨t⟩
⟨t⟩
∂L n ∂L f n ⟨t⟩ ∂L ′
= ∑ ⋅ = ∑ h go (z )
∂Wf t=1 ⟨t⟩
∂Wf t=1 ∂f
⟨t⟩
f
∂z
f

How to find ∂L

∂h
⟨t⟩
?

The hidden state h output at time t is output to f on the one hand, and as the Input, that is, not only
⟨t⟩ ⟨t⟩

affects the loss L of the current moment t through f , but also serves as the loss of the next moment
⟨t⟩ ⟨t⟩

The hidden state input affects the loss of all subsequent moments L , t > t. Therefore, the gradient of (t ) ′

the loss function about h can be divided into two parts to find:
⟨t⟩

⟨t−⟩ ⟨t⟩ ⟨t+1−⟩ ⟨t⟩ ⟨t⟩ ⟨t+1−⟩ ⟨t⟩ ⟨t+1−⟩

∂L ∂L ∂L ∂L ∂L ∂z ∂L ∂L T ∂L
= = + = ⋅ + = ⋅ W +
∂h
⟨t⟩
∂h
⟨t⟩
∂h
⟨t⟩
∂h
⟨t⟩
∂z
⟨t⟩
∂h
⟨t⟩
∂h
⟨t⟩
∂z
⟨t⟩ f ∂h
⟨t⟩

Where L represents the sum of losses of t and all subsequent moments, that is, L
⟨t−⟩
= ∑
⟨t−⟩ n
′
t =t
⟨t⟩
L .
L represents the sum of losses at all subsequent moments of t, that is, L
⟨t+1−⟩
= ∑
⟨t+1−⟩ n
′
t =t+1
L
⟨t⟩
.
⟨t−⟩ ⟨t+1−⟩

Because h has no effect on the loss before time t, so

⟨t⟩
= , and is t + 1 The loss ∂L

∂h
⟨t⟩
∂L

∂h
⟨t⟩

after time is the gradient of the output hidden state h at time t, which comes from time t + 1 in the ⟨t⟩

reverse derivation process.

⟨T −⟩ ⟨T ⟩

Of course, ∂L

∂h
⟨T ⟩
=
partialL

∂h
⟨T ⟩
, that is, the loss at the last moment about the hidden state
hatthismomentT hegradientof
⟨T ⟩
. Right now:
⟨T ⟩ ⟨T ⟩
∂L ∂L T
= ⋅ W
∂h
⟨T ⟩ ⟨T ⟩ f
∂z
f

⟨t⟩ ⟨t⟩
Because h ⟨t⟩
= gh (z
h
) , know ∂L

∂h
⟨t⟩
, you can get the gradient of the loss function about z : h

⟨t−⟩
∂L ∂L ∂L ′ ⟨t⟩
= = ⋅ g (z )
⟨t⟩ ⟨t⟩
∂h
⟨t⟩ h h
∂z ∂z
h h

Further, the gradient of the loss function with respect to the model parameters W h, Wx and the hidden
state h output at the previous moment can be obtained.
⟨t−1⟩

⟨t⟩
⟨t−⟩ ∂z ⟨t⟩
∂L ∂L h ∂L T
= ⋅ = ⋅ W
∂h
⟨t−1⟩ ⟨t⟩
∂h
⟨t−1⟩ ⟨t⟩ h
∂z ∂z
h h

⟨t⟩
n ⟨t−⟩ ∂z n T ⟨t−⟩
∂L ∂L h ⟨t−1⟩ ∂L
= ∑ ⋅ = ∑ h
∂Wh t=1 ⟨t⟩
∂Wh t=1 ⟨t⟩
∂z ∂z
h h

⟨t⟩
n
⟨t−⟩ ∂z n T ⟨t−⟩
∂L ∂L h ⟨t⟩ ∂L
= ∑ ⋅ = ∑ x
∂Wx t=1 ⟨t⟩
∂Wx t=1 ⟨t⟩
∂z ∂z
h h

Assume that the RNN has only one hidden layer, f is the output at moment t, and y is the true value ⟨t⟩ ⟨t⟩

at moment t. For a multiclassification problem, y can be the category integer corresponding to the true ⟨t⟩

classification, then the gradient dz of the multiclassification cross-entropy loss L at moment t with ⟨t⟩

respect to its output value z is： ⟨t⟩

dzf = np.copy(f[t])
dzf[y[t]] -= 1

And the gradient dh of the hidden state h ⟨t⟩

at time t is:

dh = np.dot(dzf,Wf.T) + dh_next

Knowing these two gradients, according to the above formula, the gradient of the loss function with
respect to other variables can be obtained. The following is the code for reverse derivation at time t:

dzf = np.copy(f[t])
dzf[y[t]] -= 1

dWf += np.dot(h[t].T,dzf)
dbf += dzf
dh = np.dot(dzf, Wf.T) + dh_next
dzh = (1 - h[t] * h[t]) * dh
dbh += dzh
dWx += np.dot(x[t].T,dzh)
dWh += np.dot(h[t-1].T,dzh)
dh_pre = np.dot(dzh,Wh.T)

Among them, dWf, dWx, dWh, dbh, and dbf are the gradients of the loss function with respect to the
model parameters, dh_next is the gradient of the loss function with respect to the hidden state at time
t+1, dh is the gradient of the loss function with respect to the hidden state at the current moment, and
dh_pre is The gradient of the loss function with respect to the state of the output hidden state at the
⟨t−⟩

previous moment ∂L

∂h
⟨t−1⟩
.

7.4 Implementation of single-layer recurrent neural network

7.4.1 Initialize model parameters

The cyclic neural network calculates the sequence data along the time dimension. If at any time t, the
input x is a vector of length input_dim, and the number of neurons of the single-layer neural network
⟨t⟩

is hidden_dim, and the output f is a vector of length output_dim, then the following function
⟨t⟩

init_rnn_parameters() can be used to initialize the model parameters of the network.

import numpy as np
np.random.seed(1)
def rnn_params_init(input_dim, hidden_dim,output_dim,scale = 0.01):
Wx = np.random.randn(input_dim, hidden_dim)*scale # input to hidden
Wh = np.random.randn(hidden_dim, hidden_dim)*scale # hidden to hidden
bh = np.zeros((1,hidden_dim)) # hidden bias

Wf = np.random.randn(hidden_dim, output_dim)*scale # hidden to output

bf = np.zeros((1,output_dim)) # output bias

return [Wx,Wh,bh,Wf,bf]

In addition to the model parameters, it is also necessary to initialize the hidden state vector h of the RNN
network, an input sample x, corresponding to a hidden state vector h. When training the model, if you
input a batch of samples X = (x
(1) (2)
,x
(m)
,⋯,x , it will correspond to a batch Hidden state vector
T
)

H = (h
(1) (2)
,h
(m)
,⋯,h
T
) . The following function is used to initialize the hidden state vector H of a
batch of samples:

def rnn_hidden_state_init(batch_dim, hidden_dim):

return np.zeros((batch_dim,hidden_dim))

7.4.2 Forward calculation

Recurrent neural network is a process of sequentially processing sequence data in the time dimension.
When training a recurrent neural network, a sequence of data is usually used for training. If the batch
gradient descent method is used, the sequence data length of the training model is T, then there will be a
batch of samples at each moment of t = 0, 1, ⋯ , T − 1. If Xs is used to represent the sequence data,
and the data at each time t is represented by Xs[t], then the following function is sequenced from time
t=0 to the last momentt = len(xs)-1 Calculate the hidden state H s[t] and output value F s[t] at each
moment.
def rnn_forward(params,Xs, H_):
Wx, Wh, bh, Wf, bf = params
H = H_ #np.copy(H_)

Fs = []
Hs = {}
Hs[-1] = np.copy(H)

for t in range(len(Xs)):
X = Xs[t]
H = np.tanh(np.dot(X, Wx) + np.dot(H, Wh) + bh)
F = np.dot(H, Wf) + bf

Fs.append(F)
Hs[t] = H
return Fs, Hs

Where params is the model parameter, H_ is the hidden state input at t=0 (usually the initial value is 0).

Assuming that each sequence element is a one-dimensional vector, then Xs is a three-dimensional tensor,
that is, there are 3 axes representing sequence length, batch size, and input data length, respectively. As
shown in Figure 7-22,

Figure 7-22 xs is a three-dimensional tensor, and its three axes represent sequence length T, batch size
batch_dim, and input data length input_dim

Hs is represented by a dictionary, Hs[-1] represents the input state at t=0, and Hs[t] represents the
output state at time t. Because then len(Hs) = len(Xs)+1, so
Hs[len(Hs)-2] is the state of len(Xs)-1 at the last moment. Each Hs[t] is a two-dimensional tensor,
the first axis represents the batch size, and the second axis represents the state vector length, which is
hidden_dim. Similarly, Fs represents the output value at all moments, which can be represented as a
three-dimensional tensor or a list. Fs[t] at each moment is a two-dimensional tensor, and the first axis
represents the batch size , the second axis represents the size of the output vector, namely output_dim.
The forward calculation process at each moment can be written as a separate function
rnn_forward_step():
def rnn_forward_step(params,X, preH):
Wx, Wh, bh, Wf, bf = params
H = np.tanh(np.dot(X, Wx) + np.dot(preH, Wh) + bh)
F = np.dot(H, Wf) + bf
return F, H

Among them, X is the input at a certain moment, and preH is the hidden state at the previous moment,
and they are both two-dimensional tensors. The forward calculation process at all times can be written as
a function rnn_forward_():
def rnn_forward_(params,Xs, H_):
Wx, Wh, bh, Wf, bf = params
H = H_

Fs = []
Hs = {}
Hs[-1] = np.copy(H)

for t in range(len(Xs)):
X = Xs[t]
F,H = rnn_forward_step(params,X,H)
Fs.append(F)
Hs[t] = H
return Fs, Hs

7.4.3 Loss function

According to the output value Fs and the target value Ys, the loss of the model can be calculated. If this
is a synchronous many-to-many cyclic neural network, it is necessary to calculate the loss L at each
t

moment t, and the cumulative loss at all moments is The total loss L = ∑ L . T

t=1 t

Depending on the problem, the loss L can be the mean square error loss of the regression problem or
t

the cross-entropy loss of the classification problem. The following function rnn_loss_grad() uses the
incoming function object parameter loss_fn() to calculate the loss loss_t at each moment and the
gradient dF_t of the loss with respect to the output Fs[i]. And put the gradient at all times in a
dictionary variable dFs.
import util
def rnn_loss_grad(Fs,Ys,loss_fn =
util.loss_gradient_softmax_crossentropy,flatten = True):
loss = 0
dFs = {}

for t in range(len(Fs)):
F = Fs[t]
Y = Ys[t]
if flatten and Y.ndim>=2:
Y = Y.flatten()
loss_t,dF_t = loss_fn(F,Y)
loss += loss_t
dFs[t] = dF_t

return loss,dFs

The default loss_fn is a multi-category cross-entropy function. In multi-category problems, Y is usually a

classification represented by an integer. If Y is a two-dimensional tensor, it needs to be flattened into a
one-dimensional tensor to correctly calculate the cross-entropy loss. , the parameter flatten defaults to
True, indicating that Y is flattened into a one-dimensional tensor.

7.4.4 Reverse derivation

With the gradient dZ of the loss function at each moment with respect to the output Z at that moment, the
gradient of the loss function with respect to intermediate variables such as hidden state H [t] and model
parameters at that moment can be obtained by using the reverse derivation process. The backward()
function below is used to calculate these gradients, which accepts the model parameters params, the
input data sequence Xs, the hidden state Hs, and the gradient dZs of the loss function with respect to the
output.

import math
def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio

def rnn_backward(params,Xs,Hs,dZs,clip_value = 5.): # Ys,loss_function):

Wx, Wh,bh, Wf,bf = params
dWx, dWh, dWf = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wf)
dbh, dbf = np.zeros_like(bh), np.zeros_like(bf)

dh_next = np.zeros_like(Hs[0])
h = Hs
x = Xs

T = len(Xs) #Sequence length (time length)

for t in reversed(range(T)):
dZ = dZs[t]

dWf += np.dot(h[t].T,dZ)

dbf += np.sum(dZ, axis=0, keepdims=True)

dh = np.dot(dZ, Wf.T) + dh_next
dZh = (1 - h[t] * h[t]) * dh

dbh += np.sum(dZh, axis=0, keepdims=True)

dWx += np.dot(x[t].T,dZh)
dWh += np.dot(h[t-1].T,dZh)
dh_next = np.dot(dZh,Wh.T)

grads = [dWx, dWh, dbh,dWf, dbf]

if clip_value is not None:
grad_clipping(grads,clip_value)
return grads

Because the cyclic network expands over time, like the deep neural network, there will be gradient
explosion and gradient disappearance problems. In order to solve the gradient explosion problem, the
gradient clipping method can be used to prevent the gradient explosion. As in the above code backward()
function, the gradient is clipped grad_clipping(grads,5.) at the end.

Similarly, the reverse derivation at each moment can be written as a separate function
rnn_backward_step():

def rnn_backward_step(params,dZ,X,H,H_,dh_next):
Wx, Wh,bh, Wf,bf = params
dWf = np.dot(H.T,dZ)

dbf = np.sum(dZ, axis=0, keepdims=True)

dh = np.dot(dZ, Wf.T) + dh_next
dZh = (1 - H * H) * dh

dbh = np.sum(dZh, axis=0, keepdims=True)

dWx = np.dot(X.T,dZh)
dWh = np.dot(H_.T,dZh)
dh_next = np.dot(dZh,Wh.T)
return dWx, dWh,dbh, dWf,dbf,dh_next

And functions that perform reverse differentiation on sequence data can call this single-moment reverse
differentiation function:

def rnn_backward_(params,Xs,Hs,dZs,clip_value = 5.):

Wx, Wh,bh, Wf,bf = params
dWx, dWh, dWf = np.zeros_like(Wx), np.zeros_like(Wh), np.zeros_like(Wf)
dbh, dbf = np.zeros_like(bh), np.zeros_like(bf)
dh_next = np.zeros_like(Hs[0])

T = len(Xs) #Sequence length (time length)

for t in reversed(range(T)):
dZ = dZs[t]
H= Hs[t]
H_ = Hs[t-1]
X = Xs[t]

dWx_,dWh_,dbh_,dWf_,dbf_,dh_next =
rnn_backward_step(params,dZ,X,H,H_,dh_next)
for grad,grad_t in zip([dWx, dWh,dbh, dWf,dbf],
[dWx_,dWh_,dbh_,dWf_,dbf_]):
grad+=grad_t

grads = [dWx, dWh, dbh,dWf, dbf]

if clip_value is not None:
grad_clipping(grads,clip_value)
return grads

7.4.5 Gradient verification

In order to verify whether the reverse derivation is correct, you can define a simple RNN model and use
a set of test samples to compare the analytical gradient and the numerical gradient. The following code is
for RNNs whose input, hidden layer, and output layer sizes are 4, 10, and 4, respectively. The model
generates a set of test samples:

import numpy as np
np.random.seed(1)

# Generate a batch of samples Xs with 2 samples per batch and targets for 4
moments
# Define an RNN model with input, implicit, and output layers of sizes 4, 10,
and 4, respectively
if True:
T = 5
input_dim, hidden_dim,output_dim = 4,10,4
batch_size = 1
seq_len = 5
Xs = np.random.rand(seq_len,batch_size,input_dim)
#Ys = np.random.randint(input_dim,size = (seq_len,batch_size,output_dim))
Ys = np.random.randint(input_dim,size = (seq_len,batch_size))
#Ys = Ys.reshape(Ys.shape[0],Ys.shape[1])
else:
input_size,hidden_size,output_size = 4,3,4
batch_size = 1
vocab_size = 4
inputs = [0,1,2,2] #hello
targets = [1,2,2,3]
Xs=[]
Ys=[]
for t in range(len(inputs)):
X = np.zeros((1,vocab_size)) # encode in 1-of-k representation
X[0,inputs[t]] = 1
Xs.append(X)
Ys.append(targets[t])

print(Xs)
print(Ys)

[[[4.17022005e-01 7.20324493e-01 1.14374817e-04 3.02332573e-01]]

[[1.46755891e-01 9.23385948e-02 1.86260211e-01 3.45560727e-01]]

[[3.96767474e-01 5.38816734e-01 4.19194514e-01 6.85219500e-01]]

[[2.04452250e-01 8.78117436e-01 2.73875932e-02 6.70467510e-01]]

[[4.17304802e-01 5.58689828e-01 1.40386939e-01 1.98101489e-01]]]

[[1]
[1]
[1]
[3]
[3]]

The following code calculates the analytical gradient for the above sample:
# --------cheack gradient-------------
params = rnn_params_init(input_dim, hidden_dim,output_dim)
H_0 = rnn_hidden_state_init(batch_size,hidden_dim)

Fs,Hs = rnn_forward(params,Xs,H_0)
loss_function = rnn_loss_grad
print(Fs[0].shape,Ys[0].shape)
loss,dFs = loss_function(Fs,Ys)
grads = rnn_backward(params,Xs,Hs,dFs)

(1, 4) (1,)

The following code defines the auxiliary function rnn_loss() for calculating the RNN loss, and then calls
the general numerical gradient function numerical_gradient() in util to calculate the numerical gradient
of the RNN model parameters, and compares the error with the above analysis gradient, and also outputs
the first Gradients of model parameters:

def rnn_loss():
H_0 = np.zeros((1,hidden_dim))
H = np.copy(H_0)
Fs,Hs = rnn_forward(params,Xs,H)
loss_function = rnn_loss_grad
loss,dFs = loss_function(Fs,Ys)
return loss

numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
#rnn_numerical_gradient(rnn_loss,params,1e-10)
#diff_error = lambda x, y: np.max(np.abs(x - y))
diff_error = lambda x, y: np.max( np.abs(x - y)/(np.maximum(1e-8, np.abs(x) +
np.abs(y))))

print("loss",loss)
print("[dWx, dWh, dbh,dWf, dbf]")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))

print("grads",grads[1][:2])
print("numerical_grads",numerical_grads[1][:2])

loss 6.931604253116049
[dWx, dWh, dbh, dWf, dbf]
4.30868739852771e-06
0.00014321848390554473
8.225164888798296e-08
2.030282934604882e-07
1.155121982079175e-10
grads [[-2.39049602e-04 8.14220495e-05 1.57776751e-04 5.67414815e-05
-2.52527076e-04 7.67751376e-05 8.81253550e-05 2.07270381e-04
-6.92579913e-05 5.33532921e-05]
[-1.59775181e-04 8.33693576e-05 7.68434971e-05 4.16925859e-05
-1.31768112e-04 1.87065893e-05 3.02967764e-05 1.17071893e-04
-3.32692578e-05 2.22690120e-05]]
numerical_grads [[-2.39049225e-04 8.14224244e-05 1.57776459e-04 5.67408343e-
05
-2.52526444e-04 7.67759190e-05 8.81255069e-05 2.07270645e-04
-6.92583768e-05 5.33533218e-05]
[-1.59774860e-04 8.33693115e-05 7.68434205e-05 4.16924273e-05
-1.31767930e-04 1.87068139e-05 3.02966541e-05 1.17071686e-04
-3.32689432e-05 2.22684093e-05]]

By comparing the errors, it can be judged that the calculation of the analytical gradient is basically
correct.

7.4.6 Gradient descent training

On the basis of forward calculation, loss function and reverse derivation, the RNN model can be trained
with training samples. First, define the gradient optimizer SGD of the most basic update parameters:
class SGD():
def __init__(self,model_params,learning_rate=0.01, momentum=0.9):
self.params,self.lr,self.momentum = model_params,learning_rate,momentum
self.vs = []
for p in self.params:
v = np.zeros_like(p)
self.vs.append(v)

def step(self,grads):
for i in range(len(self.params)):
grad = grads[i]
self.vs[i] = self.momentum*self.vs[i]+self.lr* grad
self.params[i] -= self.vs[i]

def scale_learning_rate(self,scale):
self.lr *= scale

Of course, other parameter optimizers are also available, such as the AdaGrad optimizer:
class AdaGrad():
def __init__(self,model_params,learning_rate=0.01):
self.params,self.lr= model_params,learning_rate
self.vs = []
self.delta = 1e-7
for p in self.params:
v = np.zeros_like(p)
self.vs.append(v)

def step(self,grads):
for i in range(len(self.params)):
grad = grads[i]
self.vs[i] += grad**2
self.params[i] -= self.lr* grad /(self.delta + np.sqrt(self.vs[i]))

def scale_learning_rate(self,scale):
self.lr *= scale

The following training function rnn_train_epoch() uses the data iterator data_iter to traverse the sampling
training data set to complete a training process. This function gets a batch of sequence training samples
from the data iterator data_iter each time. Each sample sequence (Xs, Ys) is composed of samples at
multiple times. start indicates whether this sample sequence is end-to-end with the previous sample
sequence of. For each sample sequence (Xs, Ys), first use rnn_forward(params,Xs,H) to calculate
the output Zs and state Hs at each moment, and then use the loss function loss_function(Zs,Ys) to
output Zs and The target value Ys calculates the loss of the model and the gradient dzs of the loss with
respect to the output, and then calculates the gradient of the loss with respect to the model parameters
through reverse derivation rnn_backward(params,Xs,Hs,dzs), and finally updates the model
parameters. iterations is the maximum number of iterations to prevent infinite loops, and print_n
indicates the number of intervals for printing information.
def
rnn_train_epoch(params,data_iter,optimizer,iterations,loss_function,print_n=100):

Wx, Wh,bh, Wf,bf = params

losses = []
iter = 0

hidden_size = Wh.shape[0]

for Xs,Ys,start in data_iter:

batch_size = Xs[0].shape[0]
if start:
H = rnn_hidden_state_init(batch_size,hidden_size)

Zs,Hs = rnn_forward(params,Xs,H)
loss,dzs = loss_function(Zs,Ys)

if False:
print("Z.shape",Zs[0].shape)
print("Y.shape",Ys[0].shape)
print("H",H.shape)

dWx, dWh, dbh,dWf, dbf = rnn_backward(params,Xs,Hs,dzs)

H = Hs[len(Hs)-2] # hidden state vector at Last moment

grads = [dWx, dWh, dbh,dWf, dbf]

optimizer.step(grads)
losses.append(loss)

if iter % print_n == 0:
print ('iter %d, loss: %f' % (iter, loss))
iter+=1

if iter>iterations:break
return losses,H

7.4.7 Sampling of sequence data

The cyclic neural network can be trained with arbitrarily long sequence data. For example, if you want to
train a language model represented by a cyclic neural network, you can use sentences or word sequences
of different lengths to train the RNN network, which are used to train the cyclic neural network. The
sequence data of can be called sequence samples. These sequence samples, which are usually used to
train RNN networks, need to be selected (sampled) from a longer original sequence data. For example,
when training a language model, the original data may be one or more texts containing many sentences
(called corpus). This requires sampling short sequence samples from the original long sequence data for
training, such as taking a sentence from a text as a sequence sample. For another example, the sequence
data composed of all historical prices of a stock is used as the original data, and a small sequence of
prices needs to be taken out as a sequence sample.

For autoregressive sequence data {x }, the target value y at time t is the next element x
t t of the
t+1

sequence, such as stock price data, text data for predicting the next word wait. For the special sequence
data where y is x , if you want to sample a sequence sample with a sequence length of seq_len=T, you
t t+1

can start from a certain τ , and the length is T The subsequence x , x , ⋯ , x

τ τ +1
is used as the input
τ +T −1

of sequence samples, and x , x

τ +1 ,⋯,x
tau+2 τ +Tas the target value of the sequence sample.

That is to say, the output corresponding to x input at τ moment is x , and the output corresponding to
τ τ +1

xτ +1input at τ + 1 moment It is x , ⋯, τ + T − 1 moment input x

τ +2 and the corresponding
τ +T −1

output is xτ +T . As shown in Figure 7-23:

Figure 7-23 Input x at τ moment, the corresponding output is x , and τ + 1 moment input x
τ τ +1 τ +1

corresponds to The output is x , ⋯, τ + T − 1 moment input x

τ +2 corresponding output is x
τ +T −1 τ +T

In order to train the RNN model, many sequence samples can be sampled from the original sequence as
the training set of the model. If the two adjacent sequence samples of these sequence samples are
connected at the beginning and end, this sampling method is called sequential sampling , otherwise it is
called random sampling. For example, for the following sequence:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

The sequence samples sampled sequentially are:

([0,1,2],[1,2,3])、([3,4,5],[4,5,6])、([6,7,8],[7,8,9])、([9,10,11],
[10,11,12])、...

If random sampling is used, the sequence of sampled samples might look like this:

([0,1,2],[1,2,3])、([2,3,4],[3,4,5])、([12,13,14],[13,14,15])、([7,8,9],
[8,9,10])、...

The lengths of all the sequence samples sampled above are the same (both are 3). In fact, the lengths of
the sequence samples can be different, but for the sake of simplicity, the sequence samples of the same
length are sampled.

When training the RNN model with each sequence sample, the input hidden state H at the initial
−1

moment is usually initialized to 0, indicating that there is no historical calculation information. But for
sequential sampling, the end moment of a sequence sample is exactly the beginning moment of the next
sequence sample, so when processing a sequence, the last hidden state of the previous sequence sample
can be directly used as the input hidden state of the next sequence sample, while Instead of initializing
the hidden state to 0, the previous sequence samples can be used to better process the current sequence
samples, and theoretically, the subsequent sequence samples can use the historical information in the
previous sequence samples.
Let data be the original sequence data, and the length of all sampled sequence samples is T. The
following iterator function uses sequential sampling to generate sequence samples, that is, sequence
samples generated sequentially are connected end to end:
import numpy as np
def seg_data_iter_consecutive_one(data,T,start_range=0,repeat = False):
n = len(data)
if start_range>0:
start = np.random.randint(0, start_range)
else:
start = 0
end = n-T
while True:
for p in range(start,end,T):
# pick a training sample
X = data[p:p+T]
Y = data[p+1:p+T+1] #[:,-1]
#inputs = np.expand_dims(inputs, axis=1)
#targets = targets.reshape(-1,1)
if p==start:
yield X,Y,True
else:
yield X,Y,False
if not repeat:
return

The parameter start_range is used to determine the initial position start of sampling (the default is 0,
indicating that sampling is always started from the original sequence), so that each sampling starts from a
random position, making the sampled sequence samples more random. repeat indicates whether to
repeatedly sample the original sequence, the default is False, which means that the original sequence is
only sampled once. The third value of the return value indicates whether this sequence sample is the first
sample.

Test this function:

data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
data_it = seg_data_iter_consecutive_one(data,3,5)
for X,Y,_ in data_it:
print(X,Y)

[4, 5, 6] [5, 6, 7]
[7, 8, 9] [8, 9, 10]
[10, 11, 12] [11, 12, 13]
[13, 14, 15] [14, 15, 16]
[16, 17, 18] [17, 18, 19]

It can be seen that this is sequential sampling.

Random sampling does not need to ensure that the two sequence samples sampled in sequence are
connected end to end, and its implementation is simpler, such as the following random sampling iterator
function:
import numpy as np
import random
def seg_data_iter_random_one(data,T,repeat = False):
while True:
end = len(data)-T
indices = list(range(0, end))
random.shuffle(indices)
for i in range(end):
p = indices[i]
X = data[p:p+T]
Y = data[p+1:p+T+1]
yield X,Y
if not repeat:
return

Call the random sampling function above:

data_it = seg_data_iter_random_one(data,3)
i=0
for X,Y in data_it:
print(X,Y)
i+=1
if i==3: break

[13, 14, 15] [14, 15, 16]

[16, 17, 18] [17, 18, 19]
[11, 12, 13] [12, 13, 14]

When training a neural network, if each iterative calculation does not use a sequence sample but a batch
of samples, the sampling of the batch samples is also divided into sequential sampling and random
sampling. If there are batch_size samples in a batch, the above function batch_size can be called
repeatedly to obtain batch_size sequence samples. However, there is a problem with this simple batch
sampling method, that is, the same batch of samples may be very related or even the same sequence
sample. If the sequence samples in a batch are the same sequence sample, the effect is equivalent to a
sequence sample, losing The meaning of a batch of sample training.

For random sampling, as long as the starting position of each batch of sequence samples is different, just
modify the above function slightly, start from the beginning in the subscript array indices, and take out
consecutive batch_size subscripts each time as the beginning of each sequence sample The location is
fine. Because random.shuffle(indices) has been performed on the subscript array indices in front of
the for loop, each sequence sample in a batch of samples is also randomly scattered in position. Thus, the
function seg_data_iter_random() that randomly takes a batch of sequence samples is obtained:
import numpy as np
import random
def seg_data_iter_random(data,T,batch_size,repeat = False):
while True:
end = len(data)-T
indices = list(range(0, end))
random.shuffle(indices)
for i in range(0,end,batch_size):
batch_indices = indices[i:(i+batch_size)]
X = [data[p:p+T] for p in batch_indices]
Y = [data[p+1:p+T+1] for p in batch_indices]
yield X,Y
if not repeat:
return

Also, test this function:

data_it = seg_data_iter_random(data,3,2)
i=0
for X,Y in data_it:
print("X:",X)
print("Y:",Y)
i+=1
if i==3: break

X: [[10, 11, 12], [6, 7, 8]]

Y: [[11, 12, 13], [7, 8, 9]]
X: [[13, 14, 15], [11, 12, 13]]
Y: [[14, 15, 16], [12, 13, 14]]
X: [[16, 17, 18], [9, 10, 11]]
Y: [[17, 18, 19], [10, 11, 12]]

The cyclic neural network corresponds to a separate hidden state for each input sample. Different
sequence samples of the same batch correspond to different hidden states. If you want two batches of
samples to be connected end to end, the subsequent batch training sequences can be used directly. The
hidden state of the previous batch of training sequences does not need to initialize the hidden state every
time, so that more historical information can be used. Sequential sampling needs to ensure that the
corresponding samples of each batch are connected end to end.

As shown in Figure 7-24, if there are 2 data in each batch, the first data of the second batch and the first
data of the first batch should be connected end to end. Similarly, the second data of the second batch and
the first data of the first batch should be connected end to end. The second data of the first batch should
be connected end to end. The first data in all batches constitutes the first sequence sample, and the
second data in all batches constitutes the second sequence sample.

Figure 7-24 The data of the first sequence sample (red) of the batch sample is end-to-end, and the data of
the second sequence sample (blue) is end-to-end

How to ensure that all batches are end-to-end? A simple solution is to divide the original data into
batch_size sub-parts, and use sequential sampling to sample a sequence sample in each sub-part, which
naturally ensures that the batch_size sequence samples are connected end to end, and different samples
in each batch come from different parts. For the above data, set batch_size=2, the original sequence data
is divided into 2 parts:
[0, 1, 2, 3, 4, 5, 6, 7, 8,9] and [10, 11, 12, 13, 14, 15, 16, 17,18,19]

The following code can divide the data sequence into batch_size subparts:
batch_size = 2
data= np.array(data)
data = data.reshape(batch_size,-1)
print(data)
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]]

The sequence sample x1=[0,1,2] can be taken from the first part, and the sequence sample x2 =
[10,11,12] can be taken from the second part to form a batch sequence sample. Take the sequence
sample x1 = [3,4,5] from part 1, and take the sequence sample x2 = [13,14,15] from part 2 to
form another batch of sequence samples. Take the sequence sample x1=[6,7,8] from the first part, and
take the sequence sample x2 = [16,17,18] from the second part to form another batch sequence
sample.

However, in addition to the input, each sequence sample should also contain the target sequence, and the
target sequence is exactly one position behind the input sequence. Therefore, the following code can be
used to generate 2*batch_size sub-blocks:

data = np.array(range(20))
print(data)
batch_size = 2
block_len = (len(data)-1)//2
print(block_len)
data_x = data[0:block_len*batch_size]
data_x = data_x.reshape(batch_size,-1)
print(data_x)

data_y = data[1:1+block_len*batch_size]
data_y = data_y.reshape(batch_size,-1)
print(data_y)

data_x has batch_size sub-blocks, which are used to generate input sequence samples, and data_y is
batch_size` sub-blocks staggered with data_x, which are used to form target sequence samples.
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
9
[[ 0 1 2 3 4 5 6 7 8]
[ 9 10 11 12 13 14 15 16 17]]
[[ 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]]

Now, you can take a sequence from the first row of data_x and data_y as the input sequence and the
target sequence respectively: x1=[0,1,2],y1 =[1,2,3], and then from their The sequence samples x2
= [10,11,12], y2 = [11,12,13] are taken from each of the 2 lines as the second sample, forming
the first batch of sequence samples.

x1 = [0,1,2], y1 = [1,2,3],
x2 = [10,11,12],y2 = [11,12,13]]

In the same way, the second batch of sequence samples can be taken out:
x1 = [3,4,5],y1 = [4,5,6]
x2 = [13,14,15],y2 = [14,15,16]

and the third batch of sequence samples.

x1=[6,7,8],y1=[7,8,9]
x2 = [16,17,18],x2 = [17,18,19]

According to the above method, the following batch sequential sampling function
rnn_data_iter_consecutive() can be written:

def rnn_data_iter_consecutive(data, batch_size, seq_len,start_range=10).

#Sample data[start:] each time, so that the training samples are different
for each epoch
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size # length of each block
block_len

Xs = data[start:start+block_len*batch_size]
Xs = Xs.reshape(batch_size,-1)
Ys = data[start+1:start+block_len*batch_size+1]
Ys = Ys.reshape(batch_size,-1)

# sample sequence in each block of length seq_len

num_batches = Xs.shape[1] // seq_len # how many batches of samples
end_pos = num_batches * seq_len
for i in range(0, end_pos, seq_len): #Sample a batch of samples
X = Xs[:,i:(i+seq_len)]
Y = Ys[:,i:(i+seq_len)]
yield X, Y

Test the above function:

data = list(range(20))
print(data[:20])
data_it = rnn_data_iter_consecutive(np.array(data[:20]),2,3,1)

for X,Y in data_it:

print("X:",X)
print("Y:",Y)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
X: [[ 0 1 2]
[ 9 10 11]]
Y: [[ 1 2 3]
[10 11 12]]
X: [[ 3 4 5]
[12 13 14]]
Y: [[ 4 5 6]
[13 14 15]]
X: [[ 6 7 8]
[15 16 17]]
Y: [[ 7 8 9]
[16 17 18]]

Each X of the batch sequence samples sampled above is a two-dimensional tensor with the batch size on
the first axis and the sequence length on the second axis. The previous cyclic neural network assumes
that the first axis of the sequence sample is the sequence length rather than the batch size, and the axes
corresponding to the sequence length and batch size can be exchanged:
X = np.swapaxes(X,0,1)

The above X assumes that each data element is a scalar with a length of 1, but in actual problems, each
data is a vector containing multiple features or even a multi-dimensional tensor (such as an image), if
each data element is a multi-featured vector, then X is a three-dimensional tensor. Therefore, the
sequence sample X of the above two-dimensional tensor can be converted into a three-dimensional
tensor:

X = X.reshape(X.shape[0],X.shape[1],-1)

That is, a 3rd axis is added. Combine the two lines of code:

x1 = np.swapaxes(X,0,1)
x1 = x1.reshape(x1.shape[0],x1.shape[1],-1)
print(x1)

[[[ 6]
[15]]

[[ 7]
[16]]

[[ 8]
[17]]]

Therefore, you can rewrite the above function and add a to_3D parameter to determine whether to
convert to a 3D tensor:

import numpy as np
def rnn_data_iter_consecutive(data, batch_size, seq_len,start_range=10,to_3D =
True).
#sample in data[offset:] each time, so that the training samples are
different for each epoch
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size

Xs = data[start:start+block_len*batch_size]
Ys = data[start+1:start+block_len*batch_size+1]
Xs = Xs.reshape(batch_size,-1)
Ys = Ys.reshape(batch_size,-1)

# how many samples of length seq_len can be sampled in each block

reset = True
num_batches = Xs.shape[1] // seq_len
for i in range(0, num_batches * seq_len, seq_len).
X = Xs[:,i:(i+seq_len)]
Y = Ys[:,i:(i+seq_len)]
if to_3D.
X = np.swapaxes(X,0,1)
X = X.reshape(X.shape[0],X.shape[1],-1)
#X = np.expand_dims(X, axis=2)
Y = np.swapaxes(Y,0,1)
Y = Y.reshape(Y.shape[0],Y.shape[1],-1)
else.
X = np.swapaxes(X,0,1)
Y = np.swapaxes(Y,0,1)
if reset.
reset = False
yield X, Y,True
else: yield X, Y,False

Among them, the data iterator generates samples (Xs, Ys) and returns a flag whether to reset the RNN
hidden state. If the flag is True, reset the RNN hidden state H.

data = np.array(list(range(20))).reshape(-1,1)
data_it = rnn_data_iter_consecutive(data,2,3,2)
i = 0
for X,Y,_ in data_it:
print("X:",X)
print("Y:",Y)
i+=1
if i==2 :break

X: [[[ 0]
[ 9]]

[[ 1]
[10]]

[[ 2]
[11]]]
Y: [[[ 1]
[10]]

[[ 2]
[11]]

[[ 3]
[12]]]
X: [[[ 3]
[12]]

[[ 4]
[13]]

[[ 5]
[14]]]
Y: [[[ 4]
[13]]

[[ 5]
[14]]

[[ 6]
[15]]]

7.4.8 RNN training and prediction of sequence data

Training on sequence data

The following trains the RNN model with the previous real numbers sampled from the y-values of the
curve:
T = 5000 # Generate a total of 1000 points
time = np.arange(0, T)
data = np.sin(time*0.1)+np.cos(time*0.2)
print(data.shape)

batch_size = 3
input_dim = 1
output_dim= 1
hidden_size=100
seq_length = 50
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)

data_it = rnn_data_iter_consecutive(data,batch_size,seq_length,2)
x,y,_ = next(data_it)
print("X:",x.shape,"Y:",y.shape,"H:",H.shape)

loss_function = lambda F,Y:rnn_loss_grad(F,Y,util.mse_loss_grad,False)

Zs,Hs = rnn_forward(params,x,H)
print("Z:",Zs[0].shape,"H:",Hs[0].shape)
loss,dzs = loss_function(Zs,y)
print(dzs[0].shape)

epoches = 10
learning_rate = 5e-4

iterations =200
losses = []

#optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)

for epoch in range(epoches):

data_it = rnn_data_iter_consecutive(data,batch_size,seq_length,100)
# epoch_losses,param,H =
rnn_train(params,data_it,learning_rate,iterations,loss_function,print_n=100)
epoch_losses,H =
rnn_train_epoch(params,data_it,optimizer,iterations,loss_function,print_n=50)
#losses.extend(epoch_losses)
epoch_losses = np.array(epoch_losses).mean()
losses.append(epoch_losses)

(5000,)
X: (50, 3, 1) Y: (50, 3, 1) H: (3, 100)
Z: (3, 1) H: (3, 100)
(3, 1)
iter 0, loss: 52.575362
iter 0, loss: 41.488531
iter 0, loss: 2.666009
iter 0, loss: 1.424797
iter 0, loss: 0.849381
iter 0, loss: 0.723504
iter 0, loss: 0.581355
iter 0, loss: 0.938593
iter 0, loss: 1.019344
iter 0, loss: 0.297335

Plot the training loss curve

import matplotlib.pyplot as plt
plt.plot(losses)
plt.show()

Figure 7-25 Training loss curve of RNN model for autoregressive data

predict
The following code uses the trained RNN model to predict the output of the next 500 moments from the
data at a certain moment:
H = rnn_hidden_state_init(1,hidden_size)

start = 3
x = data[start:start+1].copy()
x =x.reshape(x.shape[0],1,-1)
print(x.shape)
x = x.reshape(1,-1)
ys =[]
print(x.flatten())
for i in range(500):
F,H= rnn_forward_step(params,x,H)
x=F
ys.append(F[0,0])

print(len(ys))
ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[start+1:start+1+500])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
plt.show()

(1, 1, 1)
[1.12085582]
500

Figure 7-26 Comparison of long-term forecast data and real data of the rnn autoregressive model

This prediction is not very accurate. If you only predict the data at the next moment from the current
moment, that is, predict data[t+1] from data[t]. The following code uses this short-term prediction
method to obtain the data for predicting the next moment from the data at each moment in
data[start,start+500], that is, to predict data[start+1,start+1+500]

H = rnn_hidden_state_init(1,hidden_size)

start = 3
ys =[]
for i in range(500):
x= data[start+i:start+i+1].copy()
x = x.reshape(1,-1)
F,H= rnn_forward_step(params,x,H)
ys.append(F[0,0])

ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[start+1:start+501])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
plt.show()
Figure 7-27 Comparison of short-term forecast data and real data of the rnn autoregressive model

The result of the short-term prediction at the next moment completely coincides with the real data,
indicating that the short-term prediction is very good. The relevant code for the above RNN is in the
rnn.py file in the code for this book.

Training and prediction of stock data

For stock data, you can only use the closing price of the stock as sequence data to predict the closing
price of the stock. The following code uses the closing price data of the stock as the autoregressive data:

data = read_stock('sp500.csv')
data = np.array(data.iloc[:,-2]).reshape(-1,1)

For such autoregressive sequence data, you can directly use the above code for training and prediction.
The learning rate is 1e-4, and the number of iterations epoches of the batch gradient descent method is
40. The loss curve of model training is as follows:

Figure 7-28. Training Loss Curves for an Autoregressive Model of Stock Closing Prices

The long-term forecast and short-term forecast are shown in the figure respectively:
Figure 7-29. Long-Run Forecast of Stock Closing Price Autoregressive Model

Figure 7-30. Short-run Forecasting of Autoregressive Stock Closing Prices

The above only uses the historical data of the stock closing price to predict the future stock closing price.
The following code uses all the indicators of the stock (opening price, highest price, lowest price, closing
price, trading volume) to predict the closing price of the stock. First, the RNN model is also trained:
import pandas as pd
import numpy as np

data = read_stock('sp500.csv')

stock_data = np.array(data)
print("stock_data.shape",stock_data.shape)
print("stock_data[:3]\n",stock_data[:3])

def stock_data_iter(data,seq_length):
feature_n = data.shape[1]
num = (len(data)-1)//seq_length
while True:
for i in range(num):
#Select a training sample
p = i*seq_length
inputs = data[p:p+seq_length]
targets = data[p+1:p+seq_length+1][:,-2]
inputs = np.expand_dims(inputs, axis=1)
targets = targets.reshape(-1,1)
if i==0:
yield inputs,targets,True
else:
yield inputs,targets,False

batch_size = 1
input_dim= stock_data.shape[1]
hidden_dim = 100
output_dim=1
params = rnn_params_init(input_dim, hidden_dim,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_dim)

seq_length = 100 # number of steps to unroll the RNN for

data_it = stock_data_iter(stock_data, seq_length)

X,Y,_ = next(data_it)
print(X.shape,Y.shape)

loss_function = lambda F,Y:rnn_loss_grad(F,Y,util.mse_loss_grad,False)

# hyperparameters
epoches = 2
learning_rate = 1e-4
iterations =2000
losses = []

#optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)

for epoch in range(epoches):

data_it = stock_data_iter(stock_data, seq_length)
# epoch_losses,param,H =
rnn_train(params,data_it,learning_rate,iterations,loss_function,print_n=100)
epoch_losses,H =
rnn_train_epoch(params,data_it,optimizer,iterations,loss_function,print_n=200)
losses.extend(epoch_losses)
#epoch_losses = np.array(epoch_losses).mean()
#losses.append(epoch_losses)
plt.plot(losses)

stock_data.shape(4697, 5)
stock_data[:3]
[[-0.00597324 -0.00591629 -0.01567558 -0.01231037 -0.19118446]
[-0.01226569 -0.01617188 -0.03401657 -0.03724877 -0.1842296 ]
[-0.0372919 -0.03505779 -0.04286668 -0.03604657 -0.17733781]]
(100, 1, 5) (100, 1)
iter 0, loss: 0.105906
iter 200, loss: 0.092861
iter 400, loss: 0.561419
iter 600, loss: 0.061234
iter 800, loss: 0.447817
iter 1000, loss: 2.762900
iter 1200, loss: 0.713906
iter 1400, loss: 0.022479
iter 1600, loss: 0.004160
iter 1800, loss: 0.011423
iter 2000, loss: 0.033837

Figure 7-31 Model training loss curve for stock data

The above sequence data is not autoregressive data. The stock data at each moment is a vector composed
of multiple features, and the predicted stock price at the next moment is a value, that is, the input is a
vector of multiple values and the output is data of one value. , no long-term forecasts can be made from
this model. The following code is a short-term prediction based on the trained RNN:

H = rnn_hidden_state_init(1,hidden_dim)

start = 3
data = stock_data[start:,:]

ys =[]
for i in range(len(data)):
x= data[i,:].copy()
x = x.reshape(1,-1)
f,H = rnn_forward_step(params,x,H)
ys.append(f[0,0])

ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[:500,-2])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-32 The short-term prediction effect of the training model of stock data

7.5 RNN language model and text generation

In the k-gram language model, each word only depends on the previous k-1 words, because more
contextual information cannot be used, which limits the accuracy of the model. The RNN model can
accept arbitrarily long input sequences, using RNN to represent the language model, can predict the
probability of each word in the word list as the next word from an arbitrarily long input word sequence,
that is, at time t=i can be based on all previous The word at time, predict the probability of the word
appearing at time t=i+1, that is, P (w |w , w , ⋯ , w ).
i+1 1 2 i

According to this probability, a word is sampled as the next word, and this process is repeated
continuously, and new words can be continuously generated from the initial word, that is, a series of
words or texts are generated. This process of automatically generating a large piece of text from an initial
one or a small number of words based on a certain language model is called text generation.

Text generation relies on a trained language model. To train a language model, it is necessary to sample
sequence data in units of words from existing texts such as one or more novels and prose. These original
texts used to sample word sequence samples are called corpus. These corpora are usually first divided
into word sequences in units of words, and then the sequence samples used for training the RNN model
can be sampled using the previous sequence data sampling method.

For English articles, the original text can be divided into word sequences with spaces and punctuation
marks, while for Chinese character texts, some special word extraction techniques need to be used to
segment and extract words from the text. No matter what kind of language, the number of words is very
large. For simplicity, each character can be regarded as a word. Such a language model is called
character language model. The character language model does not need to specifically extract words in
the text, and the number of characters in the language is often much smaller than words. For example,
there are only 26 letters and a small number of punctuation marks in English, while the number of
English words is very large.

Whether it is a character language model or a normal word language model, the principle is the same.
For example, before using RNN to train the language model, it is necessary to quantify the basic unit of
the language model - word or character, that is, convert the word or character into a value vector. In order
to quantify words (characters), the first step is to establish a word table (character table).

7.5.1 Character table

The construction process of the word table or character table is to scan all the texts in the corpus, find all
the words or characters, and put them in a linear table, so that each word (character) has a definite
position in the table ( subscript). For a character language model, it is enough to scan all the characters in
the corpus text and put them into a character table.

If a corpus contains only one text file 'input.txt', which is Shakespeare's play, the following code reads
the text content into data, and set(data) is used to construct a set of all different characters, and then the
set These different characters are put into a list object chars (chars = list(set(data))), and this list object is
the character table of all characters.

filename = 'input.txt'
data = open(filename, 'r').read()
chars = list(set(data))

Output the number of all characters in the text and the length of the character list, the first 10 characters
of the character list and the first 148 characters of the text:

ata_size, vocab_size = len(data), len(chars)

print ('Total number of characters %d, length of character table %d unique.' %
(data_size, vocab_size))
print('First 10 characters of the character table: \n',chars[:10])
print('First 148 characters: \n',data[:148])

The total number of characters is 1115394, and the length of the character
list is 65 unique.
First 10 characters of character table:
['t', 'z', 'A', 'Y', 'm', ' ', 'B', 'g', 'r', '.']
First 148 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

Each character in the character table corresponds to a subscript, and two dictionaries can be used to
represent the mapping relationship between characters to subscripts and subscripts to characters:

char_to_idx = { ch:i for i,ch in enumerate(chars) }

idx_to_char = { i:ch for i,ch in enumerate(chars) }

With the character table, each character can be quantified. The easiest way is to use a one-hot vector to
represent the character according to the subscript of a character in the character table. The length of the
one-hot vector is the length of the character table. , this vector is 0 except for the value of the subscript
corresponding to the character is 1. If the character table has only four characters, as shown in Figure 7-
33:

Figure 7-33 There are only 4 characters in the character table: 'h','e','l','o'

The subscript of the character 'e' is 1, so its one-hot vector is (0,1,0,0).

The function one_hot_idx converts the character into a one-hot vector according to the character (word)
table size vocab_size and the subscript idx of a character in the character (word) table:
def one_hot_idx(idx,vocab_size):
x = np.zeros((1,vocab_size))
x[0,idx] = 1
return x

7.5.2 Sampling of character sequence samples

In order to train the character language model, some training samples are needed. Similar to the sampling
process of the previous sequence data, character sequence samples can be sampled from the original text.
The following code samples character sequence samples in a sequential sampling manner:

import numpy as np
def character_seq_data_iter_consecutive(data, batch_size, seq_len,
start_range=10):
#Sampling in data[offset:] each time makes the training samples of each
epoch different
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size
num_batches = block_len // seq_len #The maximum number of batches that can
be sampled continuously in each block
bs = np.array(range(0,block_len*batch_size,block_len) ) #Each block starting
position

i_end = num_batches * seq_len

for i in range(0, i_end, seq_len): #The sequence start position of a block
s = start+i #position in a block
X = np.empty((seq_len,batch_size),dtype=object)#,dtype=np.int32)
Y = np.empty((seq_len,batch_size),dtype=object)#,dtype=np.int32)
for b in range(batch_size): #b indicates the number of samples of a
batch sample
s_b = s+bs[b]
for t in range(seq_len):
X[t,b] = data[s_b]
Y[t,b] = data[s_b+1]
s_b +=1
if i==0:
yield X,Y,True
else:
yield X,Y,False

Test this function:

x = 'Li,where are you from'

batch_size = 2
seq_length = 3
data_it = character_seq_data_iter_consecutive(x,batch_size,seq_length,1)

i = 0
for x,y,_ in data_it:
print("x:",x)
print("y",y)
i+=1
if i==2:break

x: [['L' 'r']
['i' 'e']
[',' ' ']]
y [['i' 'e']
[',' ' ']
['w' 'y']]
x: [['w' 'y']
['h' 'o']
['e' 'u']]
y [['h' 'o']
['e' 'u']
['r' ' ']]

The characters returned by the function need to be further vectorized, such as converting each character
into a one-hot vector form. For this, modify the above function:
def character_seq_data_iter_consecutive(data, batch_size, seq_len, vocab_size,
start_range=10):
#Sampling in data[offset:] each time makes the training samples of each
epoch different
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size
num_batches = block_len // seq_len #The maximum number of batches that
# can be sampled continuously in each
block
bs = np.array(range(0,block_len*batch_size,block_len) )

i_end = num_batches * seq_len

for i in range(0, i_end, seq_len):
s = start+i
X = np.empty((seq_len,batch_size,vocab_size),dtype = np.int32)
Y = np.empty((seq_len,batch_size,1),dtype = np.int32)
for b in range(batch_size):
s_b = s+bs[b]
for t in range(seq_len):
X[t,b,:] = one_hot_idx(char_to_idx[data[s_b]],vocab_size)
Y[t,b,:] = char_to_idx[data[s_b+1]]
s_b +=1
if i==0:
yield X,Y,True
else:
yield X,Y,False

Also, test this function:

x = 'Li,where are you from'

batch_size = 2
seq_length = 3
data_it =
character_seq_data_iter_consecutive(x,batch_size,seq_length,vocab_size,1)
i = 0
for x,y,_ in data_it:
print("x:",x)
print("y",y)
i+=1
if i==2:break

x: [[[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]]

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]]
y [[[62]
[54]]

[[51]
[10]]

[[49]
[12]]]
x: [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]

[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]]
y [[[ 6]
[4]]

[[54]
[20]]

[[56]
[10]]]
7.5.3 RNN model training and prediction
If the length of the character table is vocab_size, the length of the one-hot vector of each character is
vocab_size, that is, the length input_dim of the input data at each moment is vocab_size, and the prediction
at each moment is the probability of all words as the next word , so the size of the output vector output_dim
is also vocab_size. Adding the length hidden_size of the RNN hidden state vector and the size batch_size of
each training sample can initialize an RNN model:
batch_size = 1
input_dim = vocab_size
output_dim= vocab_size
hidden_size=100
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)

predict
For the above-mentioned character language RNN model, as long as an initial character (or character
sequence) is input, the RNN model can be used to continuously predict the next character, so that a text
composed of many characters can be generated.
The following function predict_rnn() function accepts the model parameter params of the RNN model and
an initial string prefix (this initial string may only have one character), and then generates a series of
characters after the prefix. This function first takes each character of prefix as the input at each moment in
turn, and generates an output z. If the prefix is finished, it calculates the probability p of each character
based on the output z at the previous moment, and samples one according to the probability p as input at the
next moment. The auxiliary function one_hot_idx obtains the one-hot vector of the character according to
the character subscript. The expected target character at each moment is recorded in the output list, the
beginning part is the character in prefix, and then the character sampled according to the predicted
probability.
def predict_rnn(params,prefix,n):
Wx, Wh,bh, Wf,bf = params
#Wxh, Whh,Why, bh, by
=params["Wxh"],params["Whh"],params["Why"],params["bh"],params["by"]
vocab_size,hidden_size = Wx.shape[0],Wh.shape[1]
h = rnn_hidden_state_init(1,hidden_size)

output = [char_to_idx[prefix[0]]]

for t in range(len(prefix) +n - 1):

# Use the output of the previous time step as the input of the current time
step.
x = one_hot_idx(output[-1], vocab_size)
z,h = rnn_forward_step(params,x,h)

if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
p = np.exp(z) / np.sum(np.exp(z))
# idx = int(p.argmax(axis=1))
idx = np.random.choice(range(vocab_size), p=p.ravel())
output.append(idx)
return ''.join([idx_to_char[i] for i in output])

Test the predict function:

str = predict_rnn(params,"he",200)
print(str)

heokIX..ytE:JhMjGN:AXpNH;MZZZ&prP?I;,N;!
U,zu-&veMgvasx;!VBx3BYSYVljxozYjgiQcMbIHYISWpGTlkZcFjclR-n??
T&mRhnHe;ewTNZLyLOkNizPuWliTtTX&&dGHtBm$VFWVgT
KBF!aOiHM-!TzrhwXW
gEiG?f,kEqipDQJ3yQIKwXkcptNhJ&CTmke

Since the initial RNN model parameters are random, the prediction is also random, so the generated text is
very messy. The sequence samples sampled from a text corpus can be used to train the RNN model, such as
the following code:

import matplotlib.pyplot as plt

batch_size = 3
input_dim = vocab_size
output_dim= vocab_size
hidden_size=100
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)
seq_length = 25

loss_function = lambda F,Y:rnn_loss_grad(F,Y) #,util.mse_loss_grad)

epoches = 3
learning_rate = 1e-2
iterations =10000
losses = []

optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)

for epoch in range(epoches):

data_it =
character_seq_data_iter_consecutive(data,batch_size,seq_length,vocab_size,100)
# epoch_losses,param,H =
rnn_train(params,data_it,learning_rate,iterations,loss_function,print_n=100)
epoch_losses,H =
rnn_train_epoch(params,data_it,optimizer,iterations,loss_function,print_n=10)
losses.extend(epoch_losses)
#epoch_losses = np.array(epoch_losses).mean()
#losses.append(epoch_losses)
plt.plot(losses[:])

iter 0, loss: 104.362862

iter 1000, loss: 55.074135
iter 2000, loss: 56.620070
iter 3000, loss: 51.073415
...
iter 9000, loss: 44.980323
iter 10000, loss: 46.659329

Figure 7-34 The training loss curve of the character language model

Predict again with the trained RNN:

str = predict_rnn(params,"he",200)
print(str)

her creatuep I wikes spiines corvantle coulling go, your fear him hole.
No, ay no linged siffate too,
come, my wise altes in by is beays friond, and we within; beems
You jores fad lealene,
ine holl i w

It can be seen that the output text is already similar to normal text. Character language models can be used
not only to generate text, but also to other problems, such as generating musical scores.

7.6 Gradient explosion and gradient disappearance of RNN network

The RNN network can theoretically capture long-term sequence information, but it is expanded in time for
forward calculation and reverse derivation. This process is similar to the forward calculation and layer-by-
layer calculation of a deep neural network from the input layer to the output layer. In the reverse derivation
process, it is easy to have gradient explosion and gradient disappearance problems, which will cause the
training to fail to converge.

To illustrate this, consider the following simplified RNN model,

ht = σ(wht−1 )

The bias and input are ignored, and only the hidden state vector h is considered, that is, the hidden state h
t t

at time t and the hidden state h

t−1attime$t−1$
has the above relationship.

According to the chain derivation rule, there are:

∂ht ′
= wσ (wht−1 )
∂ht−1
same:

$$\frac{\partial h_{t-1}}{\partial h_{t-2}} =w \sigma'\left(w h_{t-2}\right) \tag{7-21}$ $

If from t to t through a series of time t + 1, t + 2, ⋯ , t , in the reverse derivation, h

′ ′
′
t attime t'$ $The
partial derivative of $h_t$ at time t is:

∂ht′ ∂ht′ ∂ht′ −1 ∂ht+1

= ⋯
∂ht ∂ht′ −1 ∂ht′ −2 ∂ht
′ ′ ′
= (wσ (wht′ −1 ))(wσ (wht′ −2 )) ⋯ (wσ (wht ))
′
t −t

′
= ∏ wσ (wht′ −k )

k=1

′
t −t
′
t −t ′
= w ∏ σ (wht′ −k )

k=1
!!!

If the weight w is not equal to 0, when 0 < |w| < 1, the above formula decays exponentially to 0 at the
∂ht′
speed of t − t, and grows exponentially when |w| > 1 to infinity. That is, the gradient
′
decays to 0 or ∂ht

explodes to infinity. The parameter update formula is:

∂L
w = w − α
∂w

∂h
According to formula (7-17), ∂L n ∂L
, therefore, ∂L
follows decays to 0
T
t
= ∑ ht f rac∂ht′ ∂ht
∂w t=1 ∂h T ∂ht ∂w
t

or explodes to infinity, which causes the model parameter w to oscillate back and forth or hardly move
during the training process, that is, the training cannot converge. The longer the sample sequence length,
the more likely it is to decay or explode.

Clipping gradients can handle exploding gradients, but not decaying gradients.

7.7 Long Short-Term Memory Network (LSTM)

Due to the problem of gradient explosion and disappearance, it is impossible for the previous RNN model
to expand too long in the time dimension, and the short time series means that the prediction at the current
moment only depends on the information of a short time in the past, just like the time window, there is no
With long-term memory function. For this reason, it is difficult to capture dependencies with large time step
distances in practical use of RNN. In order to solve this gradient decay and explosion problem and make
RNN have a long-term memory function, in 1997 Sepp Hochreiter and Jürgen Schmidhuber proposed a
network called "Long short-term memory (Long short-term memory, LSTM )” improved RNN model.

LSTM introduces a cell state C that is different from the hidden state h . The cell states C
t t t−1 and C at the
t

time before and after are an additive relationship rather than a multiplicative relationship.
~
Ct = i ⊙ C t + f ⊙ Ct−1

The gradient ∂L

∂Ct
is therefore also an additive relation:

∂L ∂L
=. . . +f ⊙
∂Ct−1 ∂Ct
And f is a value close to 1, therefore, ∂L

∂Ct
can be guaranteed to be stable, so that the gradient will not
disappear, and the gradient explosion is also alleviated (but still produces gradient explosion).

7.7.1 LSTM neuron: cell

The neuron of LSTM is called cell (cell), and the cell is based on the hidden state h of the original t

traditional RNN, adding a cell state (cell state) C thatspecif icallyremembershistoricalinf ormation.
t

Cell C is the cumulative memory of all historical information, which can flow from one cell to the next, as
t

shown in Figure 7-35:

7-35 Cell state C records all historical accumulation information

The original hidden state h is used to determine the extent to which C information is used for the update
t t

calculation of the cell information at the next moment. Think of C as a mighty long river of history, h is
t t

the part of historical information that affects contemporary social activities in this long river of history, for
example, Confucianism may have a greater influence in a certain era, while Taoism in another The
influence of an era is relatively large.

There is a current memory unit (also called candidate memory unit) in the cell to calculate the
~
contribution value C of the current input to the total historical information C , also known as Activation
t t

Value. Activation value is like the contribution of contemporary social activities to history. As shown in
~
Figure 7-36, the current memory unit calculates the activation value C at the current moment according to
t

the data input x and the hidden state input h .

t t−1

~
C t = tanh(xt Wxc + ht−1 Whc + bc )

where W ∈ R , W ∈ R
xc
d×h
hc is the weight parameter, b ∈ R
h×h
is the bias parameter. h represents
c
1×h

the vector length of the tuple state h and the hidden state C , and d represents the number of features of the
t t
~
input sample. The activation value C at the current moment not only depends on the input at the current
t

moment, but also depends on the hidden state h passed from the previous moment.
t−1

For a batch of samples, then:

~
C t = tanh(Xt Wxc + Ht−1 Whc + bc )

Xt ∈ R is the input at the current moment, H

n×d
t−1
n×h
∈ R is the hidden state at the previous moment. n
is the number of samples.

Figure 7-36 The current memory unit accepts the hidden state h of the previous moment and the input x
t−1 t
~
of the current moment, and outputs the activation value C of thecurrentmoment t

In addition to the current memory unit, the cell also contains three gates: input gate, output gate and forget
gate. The gate is a mechanism to determine whether information can circulate and the degree of circulation.
It multiplies the output and input of the sigmoid function σ element by element, and determines how much
of the input can be output (through this gate), as shown in Figure 7-37. The output f of the sigmoid function
σ is multiplied element-wise by the input in, which determines the output out = f ∗ in of in through this

gate.

Figure 7-37 The gate is multiplied element-wise by the output f of the sigmoid function σ and the input in,
which determines how many outputs in have, that is, out = f ∗ in
Let the input of the sigmoid function σ be x, and its value σ(x) is between 0 and 1. If σ(x) = 0, multiply it
by a certain input c , it means that this input c will not produce any output, if σ(x) = 1, multiply it by an
input c, it means that this input c is completely output.

As shown in Figure 7-38, the forget gate is used to control how much of the total information C at the t−1

previous moment is forgotten (and in turn, how much is memorized), and it accepts the input data x and t

the state h of the previous moment, and output a value f between 0 and 1 through the σ function, which
t−1 t

is the same as the state of the cell at the previous moment C Multiply f C element by element,
t−1 t t−1

indicating how much the elements of C are recorded. Its mathematical calculation formula is:
t−1

Ft = σ(Xt Wxf + Ht−1 Whf + Whf + mathbf bf )

Figure 7-38 The output F of the forget gate and the historical information C
t of the previous moment are t−1

multiplied element by element F C determines how much C

t t−1 is retain (in turn, how many times
t−1

forget)

As shown in Figure 7-39, the input gate accepts the input data x and the state h at the previous moment,
t t−1
~
and outputs a value between 0 and 1 through the σ function i , the element-wise multiplication of i and C
t t t
~ ~
i C determines how much C participates in the output calculation. Its calculation formula is:
t t t

It = σ(Xt Wxi + Ht−1 Whi + Whi + mathbf bi )

~
Figure 7-39 The output I of the input gate and the current activation value C are multiplied element by
t t
~ ~
element I C determines the activation value C is entered into the historical aggregate information
t t t

As shown in Figure 7-40, add the historical information f C of the previous moment after passing the
t t−1
~
forget gate and the activation information i C of the current moment after passing the input gate The new
t t

historical information C of the current state is obtained. Its calculation formula is as follows:
t

~
Ct = ft Ct−1 + it C t

Figure 7-40 The accumulation of the historical information f C retained by the forget gate and the
t t−1
~
current activation value information i C through the input gate is the new historical information
t t
~
Ct = ft Ct−1 + it C t

As shown in Figure 7-41, the output gate determines how much of the new historical information C at the t

current moment will participate in the cell calculation at the next moment, that is, to determine the state h t

output to the next moment. It accepts the input data x and the state h of the previous moment, and
t t−1

outputs the value o between 0 and 1 through the σ function:

Ot = σ(Xt Wxo + Ht−1 Who + Who + mathbf bo )

Figure 7-41 The output O of the output gate will determine how many outputs the cell state has at the
t

current moment and participate in the calculation of the next moment as a hidden vector

As shown in Figure 7-42, O C obtained by multiplying the output value O of the output gate and the
t t t

information C of the current state element by element is the output value h of the cell, that is, the value of
t t

the cell at the next moment Enter hidden state H : t

Ht = Ot ∗ tanh(Ct )
Figure 7-42 The output gate determines how much of the new historical information C at the current t

moment (H = O ∗ tanh(C )) participates in the cell calculation at the next moment

t t t

As shown in Figure 7-43, the cell is composed of the current memory unit and the forget gate, input gate
~
and output gate. The current calculation unit calculates the activation value C at the current moment. This t

value is determined by the input data and the hidden state at the previous moment. The forget gate
determines how many times C is retained, and the input gate determines the current How much of the
t
~
activation value C is recorded in the total historical information C , and the output gate determines how
t t

much of the historical memory C at the current moment participates in the calculation of the next moment.
t

Figure 7-43 Cell is composed of current memory unit and forget gate, input gate and output gate.

Finally, the cell also calculates the current output value Z from H . t t

Zt = (Ht Wy + by )

The formula (7 − 24), (7 − 25), (7 − 26), (7 − 27), (7 − 28), (7 − 29), (7 − 30) constitutes an element
of LSTM Cell calculation process.

7.7.2 Reverse derivation of LSTM

According to the reverse derivation formula of 4.2, if the gradient dZ = of the loss function L with
∂Lt

∂Zt
t

respect to Z at the current moment is known, then the gradient of the loss function with respect to
t

H , W , b can be obtained:
t y y

∂Lt T ∂Lt ∂Lt ∂Lt ∂Lt ∂Lt T

= Ht y
= np. sum( , axis = 0, keepdims = T rue) = Wy
∂Wy partialZt ∂b partialZt ∂Ht ∂Zt

And the gradient of the loss function about H also includes the gradient from the subsequent moment, so:
t

t−
∂L ∂Lt T ∂L
= Wy +
∂Ht ∂Zt ∂Ht

Similarly, the gradient of the loss function about C is also divided into two parts, one part is the gradient
t

from the formula (7 − 29), which is the output H , and the other is the output of C itself to Gradient at the
t t

next moment:
t−
∂L ′ ∂Lt ∂L
= Ot ⊙ tanh (Ct ) +
∂Ct ∂Ht ∂Ct

From ∂L

∂Ht
and formula (7 − 29), the gradient of the loss function with respect to O can be obtained: t

∂L ∂L
= ⊙ tanh(Ct )
∂Ot ∂Ht

From ∂L

∂Ct
and the formula (7 − 27), the loss function can be obtained about
~
it , ft , Ct−1 , thegradientof C t :
∂L ∂L ~
= ⊙ Ct
∂It ∂Ct

∂L ∂L
= ⊙ Ct−1
∂Ft ∂Ct

∂L ∂L
~ = ⊙ It
∂Ct
∂Ct
∂L ∂L
= ⊙ Ft
∂Ct−1 ∂Ct

Let ZI t = (Xt , Ht−1 )Wi + bi , ZFt = (Xt , Ht−1 )Wf + bf , ZIo = (Xt , Ht−1 )Wo + bo , then you can
get:
∂L ′ ∂L ∂L
= σ (ZIt ) = It (1 − It )
∂ZIt ∂It ∂It

∂L ′ ∂L ∂L
= σ (ZFt ) = Ft (1 − Ft )
∂ZFt ∂Ft ∂Ft

∂L ′ ∂L ∂L
= σ (ZOt ) = O(1 − O)
∂ZOt ∂Ot ∂Ot

Got it ∂L
,
∂ZIt
,
∂L

∂ZFt
∂L

∂ZOt
can similarly find the loss function about W i, Wf , Wo , Xt , Ht−1 Gradient. Please
refer to Section 4.2.

7.7.3 LSTM code implementation

There are a total of 4 units in the LSTM cell, and each unit is similar to a neuron in the acyclic neural
network, that is, the weighted sum is first calculated, and then a nonlinear output is generated through the
activation function. Finally, the total historical information C and state H are updated. Finally, the cell
t t

calculates the current output value y from H . The model parameters include the model parameters that
t t

need to be learned in 4 units, namely (W W , b ), (WW , b ), (W i W , b ) and model parameters

W , b ), (W
i f f o o c c

W , b ) to calculate the final output y .

(W y y t

The following function initializes these model parameters:

import numpy as np
def lstm_params_init(input_dim,hidden_dim,output_dim,scale=0.01):
normal = lambda m,n : np.random.randn(m, n)*scale
two = lambda : (normal(input_dim+hidden_dim,
hidden_dim),np.zeros((1,hidden_dim)))

Wi, bi = two() # Input gate parameters

Wf, bf = two() # Forget gate parameters
Wo, bo = two() # Output gate parameters
Wc, bc = two() # Candidate cell parameters

Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))

params = [Wi, bi,Wf, bf, Wo,bo, Wc,bc,Wy,by]

return params

Initialize cell state C and hidden state h :

t t

def lstm_state_init(batch_size, hidden_size):

return (np.zeros((batch_size, hidden_size)),
np.zeros((batch_size, hidden_size)))

Forward calculation (forward propagation):

def sigmoid(x):
return 1 / (1 + np.exp(-x))
def lstm_forward(params,Xs, state):
[Wi, bi,Wf, bf, Wo,bo,Wc,bc,Wy,by] = params

(H, C) = state #Initial state

Hs = {}
Cs = {}
Zs = []

Hs[-1] = np.copy(H)
Cs[-1] = np.copy(C)

Is = []
Fs = []
Os = []
C_tildas = []

for t in range(len(Xs)):
X = Xs[t]
XH = np.column_stack((X, H))
if False:
print("XH.shape",XH.shape)
print("Wi.shape",Wi.shape)
break
I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)

C = F * C + I * C_tilda
H = O*np.tanh(C) #O * C.tanh() #Output status

Y = np.dot(H, Wy) + by # output

Zs.append(Y)
Hs[t] = H
Cs[t] = C

Is.append(I)
Fs.append(F)
Os.append(O)
C_tildas.append(C_tilda)
return Zs,Hs,Cs,(Is,Fs,Os,C_tildas)

Similarly, the forward calculation at a certain moment can also be used as a separate function:
def lstm_forward_step(params,X,H,C):
[Wi, bi,Wf, bf, Wo,bo,Wc,bc,Wy,by] = params

XH = np.column_stack((X, H))
I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)

C = F * C + I * C_tilda
H = O*np.tanh(C) #O * tanh(C) #Output status
Y = np.dot(H, Wy) + by # output

return Y,H,C,(I,F,O,C_tilda)

Reverse derivation:
import math

def dsigmoid(x):
return sigmoid(x) * (1 - sigmoid(x))

def dtanh(x):
return 1 - np.tanh(x) * np.tanh(x)

def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio

def lstm_backward(params,Xs,Hs,Cs,dZs,cache,clip_value = 5.): # Ys,loss_function):

[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] = params

Is,Fs,Os,C_tildas = cache

dWi,dWf,dWo,dWc,dWy = np.zeros_like(Wi), np.zeros_like(Wf),

np.zeros_like(Wo), np.zeros_like(Wc), np.zeros_like(Wy)
dbi,dbf,dbo,dbc,dby = np.zeros_like(bi), np.zeros_like(bf),
np.zeros_like(bo), np.zeros_like(bc), np.zeros_like(by)

dH_next = np.zeros_like(Hs[0])
dC_next = np.zeros_like(Cs[0])

input_dim = Xs[0].shape[1]

h = Hs
x = Xs

T = len(Xs)
for t in reversed(range(T)):
I = Is[t]
F = Fs[t]
O = Os[t]
C_tilda = C_tildas[t]
H = Hs[t]
X = Xs[t]
C = Cs[t]
H_pre = Hs[t-1]
C_prev = Cs[t-1]
XH_pre = np.column_stack((X, H_pre))
XH_ = XH_pre

dZ = dZs[t]

#Output the idu of the model parameter of f

dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)

# Gradient of hidden state h

dH = np.dot(dZ, Wy.T) + dH_next

dC = dHOdtanh(C) +dC_next # H_t= O_t*tanh(C_t)

dO = np.tanh(C) *dH
dOZ = O * (1-O)*dO #O = sigma(Z_o)
dWo += np.dot(XH_.T,dOZ) # Z_o = (X,H_)W_o+b_o
dbo += np.sum(dOZ, axis=0, keepdims=True)

#di
di = C_tilda*dC
diZ = I*(1-I) * di
dWi += np.dot(XH_.T,diZ)
dbi += np.sum(diZ, axis=0, keepdims=True)

#df
df = C_prev*dC
dfZ = F*(1-F) * df
dWf += np.dot(XH_.T,dfZ)
dbf += np.sum(dfZ, axis=0, keepdims=True)

# dC_bar
dC_tilda = I*dC #C = F * C + I * C_tilda
dC_tilda_Z =(1-np.square(C_tilda))*dC_tilda # C_tilda =
sigmoid(C_tilda_Z)
dWc += np.dot(XH_.T,dC_tilda_Z) # C_tilda_Z = (X,H_)W_c+b_c
dbc += np.sum(dC_tilda_Z, axis=0, keepdims=True)

dXH_ = (np.dot(dfZ, Wf.T)

+ np.dot(diZ, Wi.T)
+ np.dot(dC_tilda_Z, Wc.T)
+ np.dot(dOZ, Wo.T))

dX_prev = dXH_[:, :input_dim]

dH_prev = dXH_[:, input_dim:]
dC_prev = F * dC

dC_next = dC_prev
dH_next = dH_prev

grads = [dWi, dbi,dWf, dbf, dWo,dbo,dWc, dbc,dWy,dby]

grad_clipping(grads,clip_value)
#for dparam in [dWi, dbi,dWf, dbf, dWo,dbo,dWc, dbc,dWy,dby]:
# np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
return grads

Gradient Test
T = 3
input_dim, hidden_dim,output_dim = 4,3,4
batch_size = 2
Xs = np.random.randn(T,batch_size,input_dim)
Ys = np.random.randint(output_dim, size=(T,batch_size))

print("Xs",Xs)
print("Ys",Ys)

# cheack gradient
params = lstm_params_init(input_dim, hidden_dim,output_dim)
HC = lstm_state_init(batch_size,hidden_dim)

Zs,Hs,Cs,cache = lstm_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)

grads = lstm_backward(params,Xs,Hs,Cs,dZs,cache)
def rnn_loss():
HC = lstm_state_init(batch_size,hidden_dim)
Zs,Hs,Cs,cache= lstm_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss

numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
#rnn_numerical_gradient(rnn_loss,params,1e-10)
#diff_error = lambda x, y: np.max( np.abs(x - y)/(np.maximum(1e-8, np.abs(x) +
np.abs(y))))
diff_error = lambda x, y: np.max( np.abs(x - y))

def rel_error(x, y):

""" returns relative error """
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))

print("grads",grads[0])
print("numerical_grads",numerical_grads[0])

Xs [[[ 1.07411384 0.6391398 -0.2931798 2.17849217]]

[[-1.20811047 0.57628232 -1.76050121 -0.10946053]]

[[ 0.63967167 -1.31792179 -0.44309305 0.02581717]]]

Ys [[1]
[2]
[2]]
loss 4.158946966364267
[Wi, bi, Wf, bf, Wo, bo, Wc, bc, Wy, by]
6.058123808717051e-10
6.072598550329748e-10
5.219924365667437e-10
3.349195083594825e-10
3.6378380831964666e-10
2.0005488732545667e-10
6.416346913759086e-10
4.1295304328836657e-10
5.883587193833417e-10
3.573135121115456e-10
grads [[-1.70859751e-05 2.89937470e-05 -6.60073310e-05]
[ 5.93110956e-06 3.66997064e-06 7.01129322e-05]
[-3.36578036e-05 1.80418123e-05 -5.54601958e-05]
[ 1.81485935e-06 3.18453505e-05 2.24917114e-06]
[-6.18222182e-08 3.47025809e-08 9.16314700e-08]
[ 4.30458841e-08 -3.56817450e-08 2.73817019e-07]
[3.71678370e-08-1.95199444e-08-9.44652486e-08]]
numerical_grads [[-1.70858883e-05 2.89936963e-05 -6.60071997e-05]
[ 5.93125549e-06 3.66995323e-06 7.01128045e-05]
[-3.36584094e-05 1.80420123e-05 -5.54596369e-05]
[ 1.81499260e-06 3.18451931e-05 2.24886776e-06]
[-6.21724894e-08 3.46389584e-08 9.14823772e-08]
[ 4.30766534e-08 -3.55271368e-08 2.73558953e-07]
[ 3.73034936e-08 -1.95399252e-08 -9.45910017e-08]]

Similarly, an iterative process of the gradient descent method can be defined:

def
lstm_train_epoch(params,data_iter,optimizer,iterations,loss_function,print_n=100):
Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by = params
#Wxh, Whh,Why, bh, by
=params["Wxh"],params["Whh"],params["Why"],params["bh"],params["by"]
losses = []
iter = 0

batch_size = None
hidden_size = Wy.shape[0]

for Xs,Ys,start in data_iter:

if not batch_size:
batch_size = Xs[0].shape[0]
if start:
HC = lstm_state_init(batch_size,hidden_size)

Zs,Hs,Cs,cache = lstm_forward(params,Xs,HC)
loss,dZs = loss_function(Zs,Ys)
grads = lstm_backward(params,Xs,Hs,Cs,dZs,cache)

optimizer.step(grads)
losses.append(loss)

if iter % print_n == 0:
print ('iter %d, loss: %f' % (iter, loss))
iter+=1

if iter>iterations:break
return losses,H

Text generation
Use LSTM instead of ordinary RNN to train the character language model.
filename = 'input.txt'
data = open(filename, 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Total number of characters %d,Length of character table %d unique.' %
(data_size, vocab_size))

char_to_idx = { ch:i for i,ch in enumerate(chars) }

idx_to_char = { i:ch for i,ch in enumerate(chars) }

input_dim, hidden_dim,output_dim = vocab_size,100,vocab_size

batch_size = 2

params = lstm_params_init(input_dim, hidden_dim,output_dim)

H = lstm_state_init(batch_size,hidden_dim)
seq_length = 25

loss_function = lambda F,Y:rnn_loss_grad(F,Y) #,util.loss_grad_least)

epoches = 3
learning_rate = 1e-2
iterations =10000
losses = []

optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)

for epoch in range(epoches):

data_it =
character_seq_data_iter_consecutive(data,batch_size,seq_length,vocab_size,100)
# epoch_losses,param,H =
rnn_train(params,data_it,learning_rate,iterations,loss_function,print_n=100)
epoch_losses,H =
lstm_train_epoch(params,data_it,optimizer,iterations,loss_function,print_n=10)
losses.extend(epoch_losses)
#epoch_losses = np.array(epoch_losses).mean()
#losses.append(epoch_losses)

predict
Similar to rnn, the following prediction function can be defined:
def predict_lstm(params,prefix,n):
Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by = params
vocab_size,hidden_dim = Wi.shape[0]-Wy.shape[0],Wy.shape[0]
h,c = lstm_state_init(1,hidden_dim)

output = [char_to_idx[prefix[0]]]

for t in range(len(prefix) +n - 1):

# Use the output of the previous time step as the input of the current
time step.
x = one_hot_idx(output[-1], vocab_size)

z,h,c,_ = lstm_forward_step(params,x,h,c)
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
p = np.exp(z) / np.sum(np.exp(z))
# idx = int(p.argmax(axis=1))
idx = np.random.choice(range(vocab_size), p=p.ravel())
output.append(idx)

return ''.join([idx_to_char[i] for i in output])

str = predict_lstm(params,"he",200)
print(str)

he done!

GLOUCESTER:
Why was I being your houghcessing in lord?

CARILLO:
How, or your his dessent;
Come his false, what comon:

HASTINGS:
Put she with your howiring act a both,
But long and you have
T

7.7.4 Variations of LSTM

The above is the most classic LSTM, and the actual LSTM model used is often slightly different. Gers &
Schmidhuber (2000) introduced the "peephole connection", as shown in Figure 7-42, so that various gates
can observe the state of the cell.

Figure 7-42 LSTM with "peephole connection" introduced

In this figure, "peepholes" are added to all doors, that is, f , i , o can see the corresponding cell state
t t t

Ct−1 , C . However, there are also some papers that only add a part.
t
Considering that the LSTM cell is too complex, in 2014 Kyunghyun Cho et al. proposed a more varied
LSTM variant-Gated Recurrent Unit (Gated Recurrent Unit, GRU). GRU merges the forget gate and
the input gate into a single "Update Gate", merges the Cell State and the Hidden State, and introduces some
other changes. This model is more simplified than the standard LSTM model, and the effect is better than
the classic LSTM, so it is becoming more and more popular now.

There are also some other models, such as Depth Gated RNNs proposed by Yao, et al. (2015). At the same
time, there are many completely different ways to solve the long-term dependency problem, such as
Clockwork RNNs proposed by Koutnik, et al. (2014).

Which of the different models is best? Does the difference really matter? Greff, et al. (2015) did a
comparison of popular variants and found that they are basically the same. Jozefowicz et al. (2015) tested
more than 10,000 RNN structures and found that some of them perform better than LSTMs on specific
tasks.

7.8 Gated Recurrent Unit (GRU)

7.8.1 Working principle of GRU

Unlike LSTM, which uses C and H to represent the total historical information and the historical
t t

information involved in the calculation of the next moment, GRU, like a simple RNN, only uses a hidden
state H to represent all historical information. Like LSTM, GRU also has a forget gate, also known as a
t

reset gate, which is used to represent the effect of memory information on the calculation of the current
~
moment, and an update gate for the current activation value H and historical information H
t t−1
update the
historical information H at the current moment. As shown in Figure 7-43, there are two gates in the GRU
t

unit: the reset gate R and the update gate U.

Figure 7-43 Reset gate and update gate of GRU

Like the gate of LSTM, these two gates output values between [0,1], and the calculation formula is as
follows:

Rt = σ(Xt Wxr + Ht−1 Whr + Whr + mathbf br )

Ut = σ(Xt Wxu + Ht−1 Whu + Whu + mathbf bu )

Ordinary RNN neurons use historical information H t−1and current input data X to calculate the
t

information at the current moment, that is, the hidden state H . t

Ht = tanh(Xt Wxh + Ht−1 Whh + Whh + mathbf bh )

The reset gate indicates how much historical memory is forgotten or how much historical memory is
preserved in this calculation, that is, the output value R of the reset gate is multiplied by the historical
t

memory H t−1participate in the calculation of current information.

~
Ht = tanh(Xt Wxh + (Rt ⊙ Ht−1 )Whh + bh )

~
This H represents the activation value at the current moment, also known as current candidate memory.
t

As shown in Figure 7-44:

Figure 7-44 The current working unit of GRU outputs the activation value at the current moment
~
The weighted average of the current candidate memory H and historical memory H
t t−1
through an update
gate is used as the hidden state at the current moment:
~
Ht = Ut ⊙ Ht−1 + (1 − Ut ) ⊙ Ht

As shown in Figure 7-45, the output value U of the update gate is used for weighted average of the
t

historical memory and the current candidate memory,

Figure 7-45 The complete structure of the GRU neural network unit, the updaters output the current hidden
state Ht

The reverse derivation of GRU is similar to LSTM. After the gradient dZ of the loss function with respect
to the GRU output is known, the reverse derivation calculates the gradient of the loss function with respect
to the model parameters and intermediate transformations. Readers can imitate the reverse derivation of
LSTM to derive the reverse derivation formula of GRU.

Like LSTM, GRU can maintain long-term memory and prevent gradient explosion and disappearance. Its
performance is also comparable to LSTM, and even better than LSTM on some problems. Its
implementation is simpler and more computationally efficient than LSTM. Therefore, in actual use, GRU is
usually used instead of traditional LSTM.

7.8.2 GRU code implementation

The code implementation of GRU is similar to that of LSTM. The following is the implementation code:
import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def gru_init_params(input_dim,hidden_dim,output_dim,scale=0.01):
normal = lambda m,n : np.random.randn(m, n)*scale
three = lambda : (normal(input_dim,hidden_dim),
normal(hidden_dim,hidden_dim),np.zeros((1,hidden_dim)))

Wxu, Whu, bu = three() # Update gate parameter

Wxr, Whr, br = three() # Reset gate parameter
Wxh, Whh, bh = three() # Candidate hidden state parameter
Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))

params = [Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by]
return params

def gru_state_init(batch_size, hidden_size):

return np.zeros((batch_size, hidden_size))

def gru_forward(params,Xs, H_0):

Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = params
H = H_0
Hs = {}
Ys = []
Hs[-1] = np.copy(H)
Rs = []
Us = []
H_tildas = []

for t in range(len(Xs)):
X = Xs[t]
U = sigmoid(np.dot(X, Wxu) + np.dot(H, Whu) + bu)
R = sigmoid(np.dot(X, Wxr) + np.dot(H, Whr) + br)
H_tilda = np.tanh(np.dot(X, Wxh) + np.dot(R * H, Whh) + bh)
H = U * H + (1 - U) * H_tilda
Y = np.dot(H, Wy) + by

Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)

return Ys,Hs,(Rs,Us,H_tildas)

def gru_backward(params,Xs,Hs,dZs,cache): # Ys,loss_function):

Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = params
Rs,Us,H_tildas = cache

dWxu,dWhu,dWxr,dWhr,dWxh,dWhh,dWy = np.zeros_like(Wxu), np.zeros_like(Whu),

np.zeros_like(Wxr), np.zeros_like(Whr)\
, np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Wy)
dbu,dbr,dbh,dby = np.zeros_like(bu), np.zeros_like(br), np.zeros_like(bh),
np.zeros_like(by)

dH_next = np.zeros_like(Hs[0])

input_dim = Xs[0].shape[1]

T = len(Xs)
for t in reversed(range(T)):
R = Rs[t]
U = Us[t]
H = Hs[t]
X = Xs[t]
H_tilda = H_tildas[t]
H_pre = Hs[t-1]

dZ = dZs[t]
#Output the gradient of the model parameters of f
dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)

# Gradient of hidden state h

dH = np.dot(dZ, Wy.T) + dH_next

# H = U H_pre+(1-U)H_tildas
dH_tilda = dH*(1-U)
dH_pre = dH*U
dU = H_pre*dH -H_tilda*dH

# H_tilda = tanh(X Wxh+(R*H_)Whh+bh)

dH_tildaZ = (1-np.square(H_tilda))*dH_tilda
dWxh+= np.dot(X.T,dH_tildaZ)
dWhh+= np.dot((R*H_pre).T,dH_tildaZ)
dbh += np.sum(dH_tildaZ, axis=0, keepdims=True)

dR = np.dot(dH_tildaZ, Whh.T)*H_pre
dH_pre += np.dot(dH_tildaZ, Whh.T)*R

# U = \sigma(UZ) R = \sigma(RZ)
dUZ = U*(1-U)*dU
dRZ = R*(1-R)*dR

dH_pre += np.dot(dUZ, Whu.T)

dH_pre += np.dot(dRZ, Whr.T)

# R = \sigma(X Wxr+H_ Whr + br)

dWxr+= np.dot(X.T,dRZ)
dWhr+= np.dot(H_pre.T,dRZ)
dbr += np.sum(dRZ, axis=0, keepdims=True)

dWxu+= np.dot(X.T,dUZ)
dWhu+= np.dot(H_pre.T,dUZ)
dbu += np.sum(dUZ, axis=0, keepdims=True)

if True:
dX_RZ = np.dot(dRZ,Wxr.T)
dX_UZ = np.dot(dUZ,Wxu.T)
dX_H_tildaZ = np.dot(dH_tildaZ,Wxh.T)
dX = dX_RZ+dX_UZ+dX_H_tildaZ

dH_next = dH_pre

return [dWxu, dWhu, dbu, dWxr, dWhr, dbr, dWxh, dWhh, dbh, dWy,dby]

Check that the analytical and numerical gradients are consistent with the following code:
T = 3
input_dim, hidden_dim,output_dim = 4,3,4
batch_size = 1
Xs = np.random.randn(T,batch_size,input_dim)
Ys = np.random.randint(output_dim, size=(T,batch_size))

print("Xs",Xs)
print("Ys",Ys)

# cheack gradient
params = gru_init_params(input_dim, hidden_dim,output_dim)
HC = gru_state_init(batch_size,hidden_dim)

Zs,Hs,cache = gru_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
grads = gru_backward(params,Xs,Hs,dZs,cache)

def rnn_loss():
HC = gru_state_init(batch_size,hidden_dim)
Zs,Hs,cache= gru_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss

def rel_error(x, y):

""" returns relative error """
return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))

print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))

print("grads",grads[0])
print("numerical_grads",numerical_grads[0])

Test results slightly.

7.9 Class Representation and Implementation of Recurrent Neural

Network

7.9.1 Implementing Recurrent Neural Networks with Classes

Three typical cyclic neural networks, simple RNN, LSTM and GRU, were implemented with functions.
Classes can also be used to implement these different types of recurrent neural networks.

A class LSTM can be used to represent the previous LSTM-related functions:

import numpy as np

def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio

class LSTM(object):
def __init__(self,input_dim,hidden_dim,output_dim,scale=0.01):
#super(LSTM_cell, self).__init__()
self.input_dim,self.hidden_dim,self.output_dim =
input_dim,hidden_dim,output_dim
normal = lambda m,n : np.random.randn(m, n)*scale
two = lambda : (normal(input_dim+hidden_dim,
hidden_dim),np.zeros((1,hidden_dim)))

Wi, bi = two() # Input gate parameters

Wf, bf = two() # Forget gate parameters
Wo, bo = two() # Output gate parameters
Wc, bc = two() # Candidate cell parameters

Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))
self.params = [Wi, bi,Wf, bf, Wo,bo, Wc,bc,Wy,by]
self.grads = [np.zeros_like(param) for param in self.params]
self.H,self.C = None,None

def reset_state(self,batch_size):
self.H,self.C = (np.zeros((batch_size, self.hidden_dim)),
np.zeros((batch_size, self.hidden_dim)))

def forward(self,Xs):
[Wi, bi,Wf, bf, Wo,bo,Wc,bc,Wy,by] = self.params
if self.H is None or self.C is None:
self.reset_state(Xs[0].shape[0])

H, C = self.H,self.C
Hs = {}
Cs = {}
Zs = []
Hs[-1] = np.copy(H)
Cs[-1] = np.copy(C)
Is = []
Fs = []
Os = []
C_tildas = []
for t in range(len(Xs)):
X = Xs[t]
XH = np.column_stack((X, H))

I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)
C = F * C + I * C_tilda
H = O*np.tanh(C) #O * C.tanh() #Output status

Y = np.dot(H, Wy) + by # output

Zs.append(Y)
Hs[t] = H
Cs[t] = C

Is.append(I)
Fs.append(F)
Os.append(O)
C_tildas.append(C_tilda)
self.Zs,self.Hs,self.Cs,self.Is,self.Fs,self.Os,self.C_tildas =
Zs,Hs,Cs,Is,Fs,Os,C_tildas
self.Xs =Xs
return Zs,Hs

def backward(self,dZs): # Ys,loss_function):

[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] = self.params
Hs,Cs,Is,Fs,Os,C_tildas =
self.Hs,self.Cs,self.Is,self.Fs,self.Os,self.C_tildas
Xs = self.Xs
dWi,dWf,dWo,dWc,dWy = np.zeros_like(Wi), np.zeros_like(Wf),
np.zeros_like(Wo), np.zeros_like(Wc), np.zeros_like(Wy)
dbi,dbf,dbo,dbc,dby = np.zeros_like(bi), np.zeros_like(bf),
np.zeros_like(bo), np.zeros_like(bc), np.zeros_like(by)

dH_next = np.zeros_like(Hs[0])
dC_next = np.zeros_like(Cs[0])

input_dim = Xs[0].shape[1]
h = Hs
x = Xs
T = len(Xs)
for t in reversed(range(T)):
I = Is[t]
F = Fs[t]
O = Os[t]
C_tilda = C_tildas[t]
H = Hs[t]
X = Xs[t]
C = Cs[t]
H_pre = Hs[t-1]
C_prev = Cs[t-1]
XH_pre = np.column_stack((X, H_pre))
XH_ = XH_pre

dZ = dZs[t]

#Output the idu of the model parameter of f

dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)
# Gradient of hidden state h
dH = np.dot(dZ, Wy.T) + dH_next
# dC = dH_next*O*dtanh(C) +dC_next #* H = O*np.tanh(C)
# dC = dH_next*O*(1-np.square(np.tanh(C))) +dC_next
dC = dH*O*dtanh(C) +dC_next

dO = np.tanh(C) *dH
dOZ = O * (1-O)*dO
dWo += np.dot(XH_.T,dOZ)
dbo += np.sum(dOZ, axis=0, keepdims=True)

#di
di = C_tilda*dC
diZ = I*(1-I) * di
dWi += np.dot(XH_.T,diZ)
dbi += np.sum(diZ, axis=0, keepdims=True)

#df
df = C_prev*dC
dfZ = F*(1-F) * df
dWf += np.dot(XH_.T,dfZ)
dbf += np.sum(dfZ, axis=0, keepdims=True)

# dC_bar
dC_tilda = I*dC #C = F * C + I * C_tilda
dC_tilda_Z =(1-np.square(C_tilda))*dC_tilda # C_tilda =
sigmoid(np.dot(XH, Wc)+bc)
dWc += np.dot(XH_.T,dC_tilda_Z)
dbc += np.sum(dC_tilda_Z, axis=0, keepdims=True)

dXH_ = (np.dot(dfZ, Wf.T)

+ np.dot(diZ, Wi.T)
+ np.dot(dC_tilda_Z, Wc.T)
+ np.dot(dOZ, Wo.T))
dX_prev = dXH_[:, :input_dim]
dH_prev = dXH_[:, input_dim:]
dC_prev = F * dC

dC_next = dC_prev
dH_next = dH_prev

grads = [dWi, dbi,dWf, dbf, dWo,dbo,dWc, dbc,dWy,dby]

grad_clipping(grads,5.)
for i,_ in enumerate(self.grads):
self.grads[i]+=grads[i]

return [dWi, dbi,dWf, dbf, dWo,dbo,dWc, dbc,dWy,dby]

def parameters(self):
return self.params

The following code tests this LSTM class:

T = 3
input_dim, hidden_dim,output_dim = 4,3,4
batch_size = 2
Xs = np.random.randn(T,batch_size,input_dim)
Ys = np.random.randint(output_dim, size=(T,batch_size))
#print("Xs",Xs)
#print("Ys",Ys)

lstm = LSTM(input_dim, hidden_dim,output_dim)

Zs,Hs = lstm.forward(Xs)

loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
grads = lstm.backward(dZs)

def rnn_loss():
lstm.reset_state(batch_size)
Zs,Hs = lstm.forward(Xs)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss

params = lstm.parameters()
numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
diff_error = lambda x, y: np.max( np.abs(x - y))

print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))

print("grads",grads[0])
print("numerical_grads",numerical_grads[0])

loss 4.15897570534243
[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by]
4.0983714987404213e-10
4.804842887035274e-10
5.574688488332363e-10
5.962706955096197e-10
4.786088983281455e-10
3.3010982580892407e-10
5.250774498359589e-10
7.762481196021964e-10
5.116074152863859e-10
4.973363854077206e-08
grads [[-1.40953185e-06 1.39633673e-05 3.77862529e-05]
[-2.05605688e-06 -6.94901972e-06 -9.72150550e-06]
[-1.97703294e-06 2.14765528e-05 -6.23417436e-07]
[ 2.38579566e-06 3.03502478e-05 5.32372144e-06]
[-2.43351424e-10 -1.73915908e-09 -1.49094729e-08]
[ 1.89104848e-08 1.69377027e-07 1.08468341e-07]
[-6.11087686e-09 -6.70921838e-08 -7.03528265e-09]]
numerical_grads [[-1.40953915e-06 1.39630529e-05 3.77866627e-05]
[-2.05613304e-06 -6.94910796e-06 -9.72155689e-06]
[-1.97708516e-06 2.14761542e-05 -6.23501251e-07]
[ 2.38564724e-06 3.03503889e-05 5.32374145e-06]
[-4.44089210e-10 -1.77635684e-09 -1.46549439e-08]
[ 1.86517468e-08 1.69197989e-07 1.08357767e-07]
[-5.77315973e-09 -6.70574707e-08 -7.10542736e-09]]

The following GRU implements a recurrent neural network with a GRU structure:
class GRU(object):
def __init__(self, input_dim,hidden_dim,output_dim,scale=0.01):
super(GRU, self).__init__()
self.input_dim,self.hidden_dim,self.output_dim,self.scale =
input_dim,hidden_dim,output_dim,scale

normal = lambda m,n : np.random.randn(m, n)*scale

three = lambda : (normal(input_dim,hidden_dim),
normal(hidden_dim,hidden_dim),np.zeros((1,hidden_dim)))

Wxu, Whu, bu = three() # Update gate parameter

Wxr, Whr, br = three() # Reset gate parameter
Wxh, Whh, bh = three() # Candidate hidden state parameter
Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))

self.Wxu, self.Whu, self.bu, self.Wxr, self.Whr, self.br, self.Wxh,

self.Whh, self.bh, self.Wy,self.by = Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh,
Wy,by

self.params = [Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by]
self.grads = [np.zeros_like(param) for param in self.params]
self.H = None

def reset_state(self,batch_size):
self.H = np.zeros((batch_size, self.hidden_dim))

def forward_step(self,X):
Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = self.params
H = self.H # previous state
X = Xs[t]
U = sigmoid(np.dot(X, Wxu) + np.dot(H, Whu) + bu)
R = sigmoid(np.dot(X, Wxr) + np.dot(H, Whr) + br)
H_tilda = np.tanh(np.dot(X, Wxh) + np.dot(R * H, Whh) + bh)
H = U * H + (1 - U) * H_tilda
Y = np.dot(H, Wy) + by

Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)

def forward(self,Xs):
Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = self.params
if self.H is None:
self.reset_state(Xs[0].shape[0])
H = self.H
Hs = {}
Ys = []
Hs[-1] = np.copy(H)
Rs = []
Us = []
H_tildas = []

Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)

self.Ys,self.Hs,self.Rs,self.Us,self.H_tildas = Ys,Hs,Rs,Us,H_tildas
return Ys,Hs #return Ys,Hs,(Rs,Us,H_tildas)

def backward(self,dZs): # Ys,loss_function):

Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = self.params
Ys,Hs,Rs,Us,H_tildas = self.Ys,self.Hs,self.Rs,self.Us,self.H_tildas
dWxu,dWhu,dWxr,dWhr,dWxh,dWhh,dWy = np.zeros_like(Wxu),
np.zeros_like(Whu), np.zeros_like(Wxr), np.zeros_like(Whr)\
, np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Wy)
dbu,dbr,dbh,dby = np.zeros_like(bu), np.zeros_like(br),
np.zeros_like(bh), np.zeros_like(by)

dH_next = np.zeros_like(Hs[0])
input_dim = Xs[0].shape[1]
T = len(Xs)
for t in reversed(range(T)):
R = Rs[t]
U = Us[t]
H = Hs[t]
X = Xs[t]
H_tilda = H_tildas[t]
H_pre = Hs[t-1]

dZ = dZs[t]
#output the idu of the model parameters of f
dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)

#The gradient of the hidden state h

dH = np.dot(dZ, Wy.T) + dH_next

# H = U H_pre+(1-U)H_tildas
dH_tilda = dH*(1-U)
dH_pre = dH*U
dU = H_pre*dH -H_tilda*dH

# H_tilda = tanh(X Wxh+(R*H_)Whh+bh)

dH_tildaZ = (1-np.square(H_tilda))*dH_tilda
dWxh+= np.dot(X.T,dH_tildaZ)
dWhh+= np.dot((R*H_pre).T,dH_tildaZ)
dbh += np.sum(dH_tildaZ, axis=0, keepdims=True)

dR = np.dot(dH_tildaZ, Whh.T)*H_pre
dH_pre += np.dot(dH_tildaZ, Whh.T)*R

# U = \sigma(UZ) R = \sigma(RZ)
dUZ = U*(1-U)*dU
dRZ = R*(1-R)*dR

dH_pre += np.dot(dUZ, Whu.T)

dH_pre += np.dot(dRZ, Whr.T)

# R = \sigma(X Wxr+H_ Whr + br)

dWxr+= np.dot(X.T,dRZ)
dWhr+= np.dot(H_pre.T,dRZ)
dbr += np.sum(dRZ, axis=0, keepdims=True)

dWxu+= np.dot(X.T,dUZ)
dWhu+= np.dot(H_pre.T,dUZ)
dbu += np.sum(dUZ, axis=0, keepdims=True)

if True:
dX_RZ = np.dot(dRZ,Wxr.T)
dX_UZ = np.dot(dUZ,Wxu.T)
dX_H_tildaZ = np.dot(dH_tildaZ,Wxh.T)
dX = dX_RZ+dX_UZ+dX_H_tildaZ
dH_next = dH_pre

grads = [dWxu, dWhu, dbu, dWxr, dWhr, dbr, dWxh, dWhh, dbh, dWy,dby]
for i,_ in enumerate(self.grads):
self.grads[i]+=grads[i]
return self.grads

def get_states(self):
return self.Hs

def get_outputs(self):
return self.Ys

def parameters(self):
return self.params

7.9.2 Class implementation of recurrent neural network unit

The most basic calculation of the cyclic neural network is the forward and reverse calculation of the neural
network unit at a certain moment. At a certain moment, the neural network unit accepts the data input x and
the state input h of the previous time step, For simple RNN and GRU, the output is the state h of the
′

current time step, and the current memory storage c output by LSTM and the h passed to the next time
′ ′

step. For example, for a simple RNN, its forward calculation formula is:
′
h = tanh(Wih x + bih + Whh h + bhh )

Here, the original offset b is split into 2 items: b , b . Denote the bias of the weighted sum of the data
h ih hh

input and the bias of the weighted sum of the hidden state, respectively. Similarly, for LSTM, one offset of
each original weighted sum can also be split into two offsets, that is, the calculation formula of LSTM:

i = σ(Wii x + bii + Whi h + bhi )

f = σ(Wif x + bif + Whf h + bhf )

g = tanh(Wig x + big + Whg h + bhg )

o = σ(Wio x + bio + Who h + bho )

′
c = f ∗ c + i ∗ g

′ ′
h = o ∗ tanh(c )

Similarly, the calculation formula of the GRU neural network unit is:

r = σ(Wir x + bir + Whr h + bhr )

z = σ(Wiz x + biz + Whz h + bhz )

n = tanh(Win x + bin + r ∗ (Whn h + bhn ))

′
h = (1 − z) ∗ n + z ∗ h

A common base class can be used to represent the common properties of these 3 different neural network
units:
import numpy as np
import math
class RNNCellBase(object):
__constants__ = ['input_size', 'hidden_size']
def __init__(self, input_size, hidden_size,bias, num_chunks):
super(RNNCellBase, self).__init__()
self.input_size, self.hidden_size = input_size, hidden_size
self.bias = bias
self.W_ih= np.empty((input_size, num_chunks*hidden_size)) # input to
hidden
self.W_hh = np.empty((hidden_size, num_chunks*hidden_size)) # hidden to
hidden
if bias:
self.b_ih = np.zeros((1,num_chunks*hidden_size))
self.b_hh = np.zeros((1,num_chunks*hidden_size))
self.params = [self.W_ih,self.W_hh,self.b_ih,self.b_hh]
else:
self.b_ih = None
self.b_hh = None
self.params = [self.W_ih,self.W_hh]

self.grads = [np.zeros_like(param)for param in self.params]

self.param_grads = self.params.copy()
self.param_grads.extend(self.grads)

self.reset_parameters()

def parameters(self,no_grad = True):

if no_grad: return self.params;
return self.param_grads;
def reset_parameters(self):
stdv = 1.0 / math.sqrt(self.hidden_size)
for param in self.params:
w = param
w[:] = np.random.uniform(-stdv, stdv,(w.shape))

def check_forward_input(self, input):

if input.shape[1] != self.input_size:
raise RuntimeError(
"input has inconsistent input_size: got {}, expected {}".format(
input.shape[1], self.input_size))

def check_forward_hidden(self, input, h, hidden_label=''):

if input.shape[0] != h.shape[0]:
raise RuntimeError(
"Input batch size {} doesn't match hidden{} batch size {}".format(
input.shape[0], hidden_label, h.shape[0]))

if h.shape[1] != self.hidden_size:
raise RuntimeError(
"hidden{} has inconsistent hidden_size: got {}, expected
{}".format(
hidden_label, h.shape[1], self.hidden_size))

The parameters input_size and hidden_size of the constructor represent the size of the input data and state,
and num_chunks represent the number of calculation gates for each neural network unit.
check_forward_input and check_forward_hidden are auxiliary methods to check whether the input data and
hidden state size match the model parameters of the neural network unit. num_chunks represents the
number of computing units in the cyclic neural network. For the basic RNN, its value is 1, and for LSTM
and GRU, its values are 4 and 3, respectively.

A specific type of neural network unit can be defined on the basis of the base class RNNCellBase of the
neural network unit. The following code defines the class RNNCell representing a simple RNN cell:

def relu(x):
return x * (x > 0)

def rnn_tanh_cell(x, h,W_ih, W_hh,b_ih, b_hh):

#h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})
if b_ih is None:
return np.tanh(np.dot(x,W_ih) + np.dot(h,W_hh))
else:
return np.tanh(np.dot(x,W_ih) + b_ih + np.dot(h,W_hh) + b_hh)

def rnn_relu_cell(x, h,W_ih,W_hh,b_ih, b_hh):

#h' = \relu(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})
if b_ih is None:
return relu(np.dot(x,W_ih) + np.dot(h,W_hh) )
else:
return relu(np.dot(x,W_ih) + b_ih + np.dot(h,W_hh) + b_hh)

class RNNCell(RNNCellBase):
""" h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})"""
__constants__ = ['input_size', 'hidden_size', 'nonlinearity']
def __init__(self, input_size, hidden_size,bias=True, nonlinearity="tanh"):
super(RNNCell, self).__init__(input_size, hidden_size,bias,num_chunks=1)
self.nonlinearity = nonlinearity

def forward(self, input, h=None):

self.check_forward_input(input)
if h is None:
h = np.zeros(input.shape[0], self.hidden_size, dtype=input.dtype)
self.check_forward_hidden(input, h, '')
if self.nonlinearity == "tanh":
ret = rnn_tanh_cell( input, h,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh,)
elif self.nonlinearity == "relu":
ret = rnn_relu_cell( input, h,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh,)
else:
ret = input
raise RuntimeError(
"Unknown nonlinearity: {}".format(self.nonlinearity))
return ret
def __call__(self, input, h=None):
return self.forward(input,h)

def backward(self,dh,H,X,H_pre):
if self.nonlinearity == "tanh":
dZh = (1 - H * H) * dh # backprop through tanh nonlinearity
else:
dZh = H*(1-H)* dh
db_hh = np.sum(dZh, axis=0, keepdims=True)
db_ih = np.sum(dZh, axis=0, keepdims=True)
dW_ih = np.dot(X.T,dZh)
dW_hh = np.dot(H_pre.T,dZh)
dh_pre = np.dot(dZh,self.W_hh.T)
dx = np.dot(dZh,self.W_ih.T)
grads = (dW_ih,dW_hh,db_ih,db_hh)
for a, b in zip(self.grads,grads):
a+=b
return dx,dh_pre,grads

The following code demonstrates the forward and reverse calculation of a time step of RNNCell, where x is
the input data with a batch size of 3, and h is the state corresponding to a batch size of 3:
import numpy as np
np.random.seed(1)
x = np.random.randn(3, 10) #(batch_size,input_dim)
h = np.random.randn(3, 20) #(batch_size,hidden_dim)
rnn = RNNCell(10, 20) #(input_dim,hidden_dim)

h_ = rnn(x, h)
print("h_:",h_)
dh_ = np.random.randn(*h.shape)
dx,dh,_ = rnn.backward(dh_,h_,x,h)
print("dh:",dh)

Test results slightly.

The following code uses sequence data x to demonstrate the calculation process of RNNCell with a step
size of 6:
import numpy as np
x = np.random.randn(6, 3, 10)
h = np.random.randn(3, 20)
rnn = RNNCell(10, 20)

h_0 = h.copy()
hs = []
for i in range(6):
h = rnn(x[i], h)
hs.append(h)
print("h:",hs[0])

dh = np.random.randn(*h.shape)
for i in reversed(range(6)):
if i==0:
dx,dh,_ = rnn.backward(dh,hs[i],x[i],h_0)
else:
dx,dh,_ = rnn.backward(dh,hs[i],x[i],hs[i-1])
print("dh:",dh)

Test results omited.

Similarly, LSTMCell and GRUCell of LSTM and GRU types can be defined. The code of LSTMCell is as
follows:

def sigmoid(x):
return (1 / (1 + np.exp(-x)))
def lstm_cell(x, hc,w_ih, w_hh,b_ih, b_hh):
h,c = hc[0],hc[1]
hidden_size = w_ih.shape[1]//4
ifgo_Z = np.dot(x,w_ih) + b_ih + np.dot(h,w_hh) + b_hh
i = sigmoid(ifgo_Z[:,:hidden_size])
f = sigmoid(ifgo_Z[:,hidden_size:2*hidden_size])
g = np.tanh(ifgo_Z[:,2*hidden_size:3*hidden_size])
o = sigmoid(ifgo_Z[:,3*hidden_size:])
c_ = f*c+i*g
h_ = o*np.tanh(c_)
return (h_,c_),np.column_stack((i,f,g,o))

def lstm_cell_back(dhc,ifgo,x,hc_pre,w_ih, w_hh,b_ih, b_hh):

hidden_size = w_ih.shape[1]//4
if isinstance(dhc, tuple):
dh_,dc_next = dhc
else:
dh_ = dhc
dc_next = np.zeros_like(dh_)
h_pre,c = hc_pre
i,f,g,o = ifgo[:,:hidden_size],ifgo[:,hidden_size:2*hidden_size]\
, ifgo[:,2*hidden_size:3*hidden_size],ifgo[:,3*hidden_size:]
c_ = f*c+i*g
dc_ = dc_next+dh_*o*(1-np.square(np.tanh(c_)))
do = dh_*np.tanh(c_)
di = dc_*g
dg = dc_*i
df = dc_*c

diz = i*(1-i)*di
dfz = f*(1-f)*df
dgz = (1-np.square(g))*dg
doz = o*(1-o)*do

dZ = np.column_stack((diz,dfz,dgz,doz))

dW_ih = np.dot(x.T,dZ)
dW_hh = np.dot(h_pre.T,dZ)
db_hh = np.sum(dZ, axis=0, keepdims=True)
db_ih = np.sum(dZ, axis=0, keepdims=True)
dx = np.dot(dZ,w_ih.T)
dh_pre = np.dot(dZ,w_hh.T)
#return dx,dh_pre,(dW_ih,dW_hh,db_ih,db_hh)
dc = dc_*f
return dx,(dh_pre,dc),(dW_ih,dW_hh,db_ih,db_hh)

class LSTMCell(RNNCellBase):
""" \begin{array}{ll}
i = \sigma(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\
f = \sigma(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\
g = \tanh(W_{ig} x + b_{ig} + W_{hg} h + b_{hg}) \\
o = \sigma(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\
c' = f * c + i * g \\
h' = o * \tanh(c') \\
\end{array}
"""
def __init__(self, input_size, hidden_size, bias=True):
super(LSTMCell, self).__init__(input_size, hidden_size,bias, num_chunks=4)

def init_hidden(batch_size):
zeros= np.zeros(input.shape[0], self.hidden_size, dtype=input.dtype)
return (zeros, zeros)#np.array([zeros, zeros])

def forward(self, input, h=None):

self.check_forward_input(input)
if h is None:
h = init_hidden(input.shape[0])
self.check_forward_hidden(input, h[0], '[0]')
self.check_forward_hidden(input, h[1], '[1]')
return lstm_cell(
input, h,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh,
)
def __call__(self, input, h=None):
return self.forward(input,h)

def backward(self, dhc,ifgo,input,hc_pre):

if hc_pre is None:
hc_pre = init_hidden(input.shape[0])
dx,dh_pre,grads = lstm_cell_back(
dhc,ifgo,
input, hc_pre,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh)

#grads = (dW_ih,dW_hh,db_ih,db_hh)
for a, b in zip(self.grads,grads):
a+=b
return dx,dh_pre,grads

The code of GRUCell is as follows:

def gru_cell(x, h,w_ih, w_hh,b_ih, b_hh):

Z_ih,Z_hh = np.dot(x,w_ih) + b_ih, np.dot(h,w_hh) + b_hh
hidden_size = w_ih.shape[1]//3
r = sigmoid(Z_ih[:,:hidden_size]+Z_hh[:,:hidden_size])
u =
sigmoid(Z_ih[:,hidden_size:2*hidden_size]+Z_hh[:,hidden_size:2*hidden_size])
n = np.tanh(Z_ih[:,2*hidden_size:]+r*Z_hh[:,2*hidden_size:])
h_next= u*h+(1-u)*n
run = np.column_stack((r,u,n))
#return h_next,(r,u,n)
return h_next,run

def gru_cell_back(dh,run,x,h_pre,w_ih, w_hh,b_ih, b_hh):

hidden_size = w_ih.shape[1]//3
#r,u,n = run
r,u,n = run[:,:hidden_size],run[:,hidden_size:2*hidden_size]\
, run[:,2*hidden_size:]

# H = U H_pre+(1-U)H_tildas
dn = dh*(1-u)
dh_pre = dh*u
du = h_pre*dh -n*dh

#n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn}))

dnz = (1-np.square(n))*dn

Z_hn = np.dot(h_pre,w_hh[:,2*hidden_size:])+b_hh[:,2*hidden_size:]
dr = dnz*Z_hn
dZ_ih_n = dnz
dZ_hh_n = dnz*r

duz = u*(1-u)*du
dZ_ih_u = duz
dZ_hh_u = duz

drz = r*(1-r)*dr
dZ_ih_r = drz
dZ_hh_r = drz

dZ_ih = np.column_stack((dZ_ih_r,dZ_ih_u,dZ_ih_n))
dZ_hh = np.column_stack((dZ_hh_r,dZ_hh_u,dZ_hh_n))

dW_ih = np.dot(x.T,dZ_ih)
dW_hh = np.dot(h_pre.T,dZ_hh)
db_ih = np.sum(dZ_ih, axis=0, keepdims=True)
db_hh = np.sum(dZ_hh, axis=0, keepdims=True)
dh_pre+=np.dot(dZ_hh,w_hh.T)
dx = np.dot(dZ_ih,w_ih.T)
return dx,dh_pre,(dW_ih,dW_hh,db_ih,db_hh)

class GRUCell(RNNCellBase):
""" \begin{array}{ll}
r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\
z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\
n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\
h' = (1 - z) * n + z * h
\end{array}
"""
def __init__(self, input_size, hidden_size, bias=True):
super(GRUCell, self).__init__(input_size, hidden_size,bias, num_chunks=3)

def forward(self, input, h=None):

self.check_forward_input(input)
if h is None:
h= np.zeros(input.shape[0], self.hidden_size, dtype=input.dtype)
self.check_forward_hidden(input, h, '')
return gru_cell(
input, h,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh,
)
def __call__(self, input, h=None):
return self.forward(input,h)

def backward(self, dh,run,input,h_pre):

if h_pre is None:
h_pre = np.zeros(input.shape[0], self.hidden_size, dtype=input.dtype)
dx,dh_pre,grads = gru_cell_back(
dh,run,
input, h_pre,
self.W_ih, self.W_hh,
self.b_ih, self.b_hh )
#grads = (dW_ih,dW_hh,db_ih,db_hh)
for a, b in zip(self.grads,grads):
a+=b
return dx,dh_pre,grads

Different recurrent neural network models can be implemented with neural network units.
7.10 Multilayer, Bidirectional Recurrent Neural Network

7.10.1 Multilayer Recurrent Neural Network

The previous discussion is a single-layer cyclic neural network, that is, there is
only one RNN network layer. Like a fully connected neural network or a
convolutional neural network, a multi-layer cyclic neural network can be defined,
as shown in Figure 7-46. One hidden layer accepts data input and generates a
hidden state H , this hidden state is used as the input of the second hidden layer,
(1)

..., the last cycle neural network layer can be used as the whole The output layer of
the network can also be followed by one or more acyclic neural network layers.

Figure 7-46 Multi-layer recurrent neural network, the first hidden layer accepts
data input and generates a hidden state H , this hidden state is used as the input
(1)

of the second hidden layer, and the last cycle The neural network layer can be used
as the output layer of the entire network or can be followed by one or more acyclic
neural network layers

At time t, the input of neurons in layer 1 includes data input X and state inputt
(1) (1)
H
t−1
, calculate the hidden state H of the first layer:
t

(1) (1)
H = f1 (Xt , H )
t t−1

(1)
The state H of the RNN unit (neuron) in the first layer is used as the data input
t

of the RNN unit (neuron) in the second layer, and the front of the neuron itself in
(2)
the second layer The state H at one moment is used together to calculate the
t−1
(2)
state H of the neuron. This state is used as the data input of the third layer RNN
t
(3)
unit (neuron) to calculate the hidden state H of the third layer. Generally, the lth
t
(1)
hidden layer accepts the hidden state H and its previous layer at time t That is,
t−1
(l−1)
the input of the l − 1 hidden layer (usually hidden state H ), output the hidden
t
(l)
state Hattimet , its calculation process can be expressed by the following
t

formula:
(l) (l−1) (l)
H = fl (H ,H )
t t t−1
Except that the data input of the first layer is the initial data input X , the data
t

input of other recurrent network layers is the hidden state output of the previous
。
(l−1)
recurrent network layer H t

The state variable of the last layer of the multi-layer recurrent network can be
(L)
directly output as the output of the model F = H , or output through an
t t

activation function.
(L)
Ft = g(H )
t

If the last cyclic neural network layer is the output layer of the entire network, this
F is the output of the entire network. If there are some non-cyclic network layers
t

behind this cyclic neural network layer, this F will continue to be the subsequent
t

network layer input.

Like multi-layer fully connected or convolutional neural networks, multi-layer

RNNs can capture hierarchical features from low to high.

In a multi-layer cyclic neural network, the size of the initial input data X and the
size of the hidden state H are usually not equal, and the size of H in each cyclic
network layer is the same, so the data input of the first layer and other cyclic
network layers The sizes are usually different, so the shape of the weight
parameters of the recurrent network layers is the same except for the first layer. Of
course, it is also possible to have hidden states of different sizes for different
recurrent network layers, but this is usually not done in practice.

The previous cyclic neural network unit can be used to construct a multi-layer
cyclic neural network. The following code builds a base class RNNBase
representing a multi-layer cyclic neural network based on the neural network unit:
from Layers import *
class RNNBase(Layer):
def __init__(self,mode,input_size, hidden_size,
n_layers,bias = True):
super(RNNBase, self).__init__()
self.mode = mode
if mode == 'RNN_TANH':
self.cells = [RNNCell(input_size,
hidden_size,bias,nonlinearity="tanh")]
self.cells += [RNNCell(hidden_size,
hidden_size,bias,nonlinearity="tanh") for i in range(n_layers-
1)]
elif mode == 'RNN_RELU':
self.cells = [RNNCell(input_size,
hidden_size,bias,nonlinearity="relu")]
self.cells += [RNNCell(hidden_size,
hidden_size,bias,nonlinearity="relu") for i in range(n_layers-
1)]
elif mode == 'LSTM':
self.cells = [LSTMCell(input_size,
hidden_size,bias)]
self.cells += [LSTMCell(hidden_size,
hidden_size,bias) for i in range(n_layers-1)]
elif mode == 'GRU':
self.cells = [GRUCell(input_size,
hidden_size,bias)]
self.cells += [GRUCell(hidden_size,
hidden_size,bias) for i in range(n_layers-1)]

self.input_size, self.hidden_size =
input_size,hidden_size
self.n_layers = n_layers
self.flatten_parameters()
self._params = None

def flatten_parameters(self):
self.params = []
self.grads = []
for i in range(self.n_layers):
rnn = self.cells[i]
for j,p in enumerate(rnn.params):
self.params.append(p)
self.grads.append(rnn.grads[j])

def forward(self, x,h=None):

seq_len,batch_size = x.shape[0], x.shape[1]
n_layers = self.n_layers
mode = self.mode

hs = [[] for i in range(n_layers)]

zs = [[] for i in range(n_layers)]
if h is None:
h = self.init_hidden(batch_size)
if False:
if mode == 'LSTM':#isinstance(h, tuple):
self.h = (h[0].copy(),h[1].copy())
else:
self.h = h.copy()
else:
self.h = h

for i in range(n_layers):
cell = self.cells[i]
if i!=0:
x = hs[i-1] # out h of pre layer
if mode == 'LSTM':
x = np.array([h for h,c in x])

hi = h[i]
if mode == 'LSTM':
hi = (h[0][i],h[1][i])
for t in range(seq_len):
hi = cell(x[t],hi)
if isinstance(hi, tuple):
hi,z = hi[0],hi[1]
zs[i].append(z)

hs[i].append(hi)
# if mode == 'LSTM' or mode == 'GRU':
# zs[i].append(z)

self.hs = np.array(hs) #
(layer_size,seq_size,batch_size,hidden_size)
if len(zs[0])>0:
self.zs = np.array(zs)
else:self.zs = None

output = hs[-1] # containing the output features

(`h_t`)
# from the last layer of the RNN,
if mode == 'LSTM':
output = [h for h,c in output]
hn = self.hs[:,-1,:,:] # containing the hidden state
for `t = seq_len`
return np.array(output),hn

def call(self, x,h=None):

return self.forward(x,h)

def init_hidden(self, batch_size):

zeros = np.zeros((self.n_layers, batch_size,
self.hidden_size))
if self.mode=='LSTM':
self.h = (zeros,zeros)
else:
self.h = zeros
return self.h

def backward(self,dhs,input):#,hs):
if self.hs is None:
self.hs,_ = self.forward(input)
hs = self.hs
zs = self.zs if self.zs is not None else hs
seq_len,batch_size = input.shape[0], input.shape[1]
dinput = [None for t in range(seq_len)]

if len(dhs.shape)==2: # dh at last time(batch,hidden)

dhs_ = [np.zeros_like(dhs) for i in range(seq_len)]
dhs_[-1] = dhs
dhs = np.array(dhs_)
elif dhs.shape[0]!=seq_len:
raise RuntimeError(
"dhs has inconsistent seq_len: got {}, expected
{}".format(
dhs.shape[0], seq_len))
else:
pass

#----dhidden--------
dhidden = [None for i in range(self.n_layers)]
for layer in reversed(range(self.n_layers)):
layer_hs = hs[layer]
layer_zs = zs[layer]
cell = self.cells[layer]
if layer==0:
layer_input = input
else:
if self.mode =='LSTM':
layer_input = self.hs[layer-1]
layer_input = [h for h,c in layer_input]
else:
layer_input = self.hs[layer-1]

h_0 = self.h[layer]
dh = np.zeros_like(dhs[0]) #Gradient from the next
moment
if self.mode =='LSTM':
h_0 = (self.h[0][layer],self.h[1][layer])
dc = np.zeros_like(dhs[0])
for t in reversed(range(seq_len)):
dh += dhs[t] #The gradient of the next moment +
the gradient of the current moment
h_pre = h_0 if t==0 else layer_hs[t-1]
if self.mode=='LSTM':
dhc = (dh,dc)
dx,dhc,_ =
cell.backward(dhc,layer_zs[t],layer_input[t],h_pre)
dh,dc = dhc
else:
dx,dh,_ =
cell.backward(dh,layer_zs[t],layer_input[t],h_pre)
if layer>0:
dhs[t] = dx
else :
dinput[t] = dx
#----dhidden--------
if t==0:
if self.mode=='LSTM':
dhidden[layer] = dhc
else:
dhidden[layer] = dh
return np.array(dinput),np.array(dhidden)

def parameters(self):
if self._params is None:
self._params = []
for i, _ in enumerate(self.params):

self._params.append([self.params[i],self.grads[i]])
return self._params

On the basis of this base class, a type of multi-layer cyclic neural network can be
implemented. For example, the following representations indicate that RNN,
LSTM, and GRU respectively implement a multi-layer simple cyclic network,
LSTM, and GRU cyclic neural network:

class RNN(RNNBase):
def __init__(self,*args, **kwargs):
if 'nonlinearity' in kwargs:
if kwargs['nonlinearity'] == 'tanh':
mode = 'RNN_TANH'
elif kwargs['nonlinearity'] == 'relu':
mode = 'RNN_RELU'
else:
raise ValueError("Unknown nonlinearity
'{}'".format(
kwargs['nonlinearity']))
del kwargs['nonlinearity']
else:
mode = 'RNN_TANH'
super(RNN, self).__init__(mode, *args, **kwargs)

class LSTM(RNNBase):
def __init__(self,*args, **kwargs):
super(LSTM, self).__init__('LSTM', *args, **kwargs)

class GRU(RNNBase):
def __init__(self,*args, **kwargs):
super(GRU, self).__init__('GRU', *args, **kwargs)

These multilayer recurrent neural networks can be tested with the following code:
import numpy as np
from rnn import *
np.random.seed(1)

num_layers= 2
batch_size,input_size,hidden_size= 3,5,8
seg_len = 6

test_RNN = "LSTM"

if test_RNN == "rnnTANH":
rnn = RNN(input_size,hidden_size,num_layers )
elif test_RNN == "rnnRELU":
rnn = RNN(input_size,hidden_size, num_layers,nonlinearity=
'relu')
elif test_RNN == "GRU":
rnn = GRU(input_size,hidden_size, num_layers)
elif test_RNN == "LSTM":
rnn = LSTM(input_size,hidden_size, num_layers)
c_0 = np.random.randn(num_layers, batch_size, hidden_size)

input = np.random.randn(seg_len, batch_size, input_size)

h_0 = np.random.randn(num_layers, batch_size, hidden_size)

print("input.shape",input.shape)
print("h_0.shape",h_0.shape)
print("c_0.shape",c_0.shape)

if test_RNN == "LSTM":
output, hn = rnn(input, (h_0,c_0))
else:
output, hn = rnn(input, h_0)

print("output.shape",output.shape)
print("output",output)
print("hn",hn)

#------test backward---
do = np.random.randn(*output.shape)
dinput,dhidden = rnn.backward(do,input)#,rnn.hs)#output)
print("dinput.shape:",dinput.shape)
print("dinput:",dinput)
print("dhidden:",dhidden)

Test results slightly.

7.10.2 Training and prediction of multi-layer recurrent neural

network
The hidden state size of each hidden layer of the above multi-layer LSTM cyclic
neural network is the same. In order to make this multi-layer neural network adapt
to the problem of different output values, it can be based on the multi-layer LSTM
cyclic neural unit. Add another fully connected output layer to output specific
problems for different output vector sizes. The following LSTM_RNN is such a
multi-layer cyclic neural network, where input_size, hidden_size, and output_size
are the input data size, hidden state size, and output value size, respectively, and
num_layers is the number of layers of the cyclic neural network.

from Layers import *

class LSTM_RNN(object):
def __init__(self, input_size, hidden_size,
output_size,num_layers):
super(LSTM_RNN, self).__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers

# Define the LSTM layer

self.lstm = LSTM(input_size,hidden_size,num_layers)

# Define the output layer

self.linear = Dense(hidden_size, output_size)
self.layers = [self.lstm,self.linear]
self._params = None

def init_hidden(self,batch_size):
# This is what we'll initialise our hidden state as
self.h_0 = (np.zeros((self.num_layers, batch_size,
self.hidden_size)),
np.zeros((self.num_layers, batch_size,
self.hidden_size)))

def forward(self, input):

# input:(seq_len, batch, input_size)
# shape of hs_out: [input_size, batch_size, hidden_dim]
# shape of self.h_0: (a, b), where a and b both
# have shape (num_layers, batch_size, hidden_dim).

hs_out, self.h_0 = self.lstm(input,self.h_0)

batch_size = input.shape[1]
y_pred = self.linear(hs_out[-1].reshape(batch_size,
-1))
return y_pred#.reshape(batch_size, -1)#.flatten()
#view(-1)

def call(self, input):

return self.forward(input)

def backward(self,dZs,input):
dhs = self.linear.backward(dZs)
dinput = self.lstm.backward(dhs,input)

def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):

self._params.append([layer.params[i],layer.grads[i]])
return self._params

The code below models an autoregressive data with the above multi-layer
recurrent neural network, the ARData class is from the URL
https://github.com/jessicayung/blog-code-snippets/blob/master/lstm-
pytorch/generate_data .py is used to generate autoregressive training data.

import util
from train import *
from generate_data import *
import matplotlib.pyplot as plt
%matplotlib inline

input_size = 20

# Data params
noise_var = 0
num_datapoints = 100
test_size = 0.2
num_train = int((1-test_size) * num_datapoints)

data = ARData(num_datapoints, num_prev=input_size,

test_size=test_size, noise_var=noise_var, \
coeffs=fixed_ar_coefficients[input_size])
X_train =data.X_train
y_train =data.y_train

hidden_size = 32
lstm_input_size = input_size
output_dim = 1
num_layers = 2
batch_size =num_train #80

X_train = X_train.reshape(input_size, -1, 1)

print(X_train.shape)
X_train = X_train.reshape(len(X_train), batch_size, 1)
print(X_train.shape)
X_train = X_train.swapaxes(0,2)
y_train = y_train.reshape(-1,1)

model = LSTM_RNN(lstm_input_size, hidden_size,

output_size=output_dim, num_layers=num_layers)

loss_fn = util.mse_loss_grad#
(f,y)#torch.nn.MSELoss(size_average=False)

learning_rate = 1e-3
momentum = 0.9
#optimizer = SGD(model.parameters(),learning_rate,momentum)
optimizer = Adam(model.parameters(),learning_rate)
num_epochs = 500

print(X_train.shape)
hist = np.zeros(num_epochs)
for t in range(num_epochs):
model.hidden = model.init_hidden(batch_size)
y_pred = model(X_train) # Forward pass

loss,grad = loss_fn(y_pred, y_train)

if t % 100 == 0:
print("Epoch ", t, "MSE: ", loss)
hist[t] = loss

optimizer.zero_grad() # Zero out gradient, else they will

accumulate between epochs
model.backward(grad,X_train)# Backward pass
optimizer.step() # Update parameters

plt.plot(y_pred, label="Preds")
plt.plot(y_train, label="Data")
plt.legend()
plt.show()
plt.plot(hist, label="Training loss")
plt.legend()
plt.show()

(20, 80, 1)
(20, 80, 1)
(1, 80, 20)
Epoch 0 MSE: 0.030292062696899477
Epoch 100 MSE: 0.013801384758457096
Epoch 200 MSE: 0.013244797126843889
Epoch 300 MSE: 0.013052903618001023
Epoch 400 MSE: 0.012934439762440214

Figure 7-47 Prediction and real data of the 2-layer LSTM network training model
for autoregressive data

)
Figure 7-48 The training loss curve of the 2-layer LSTM network training model
for autoregressive data

7.10.3 Bidirectional Recurrent Neural Network

The previous cyclic neural networks are all one-way, that is, the prediction at time
t only depends on the data of the 0, 1, 2, ⋯ , t − 1 sequence before it, but some
problems such as natural language understanding, the text The understanding of a
word in will depend on the context information before and after it, that is, the
prediction at time t not only depends on the sequence data before but also after.
For this type of problem, a two-way recurrent neural network can be used to
model, that is, the neurons of the neural network have state variables that record
context information before and after. Figure 7-49 is a schematic diagram of the
structure of a single-layer bidirectional neural network.

7-49 Structure of Bidirectional Recurrent Neural Network

Its calculation process can be expressed as:

→
Ht = ϕ(Xt W
(f )
→
+ Ht−1 W
(f )
+ b
(f )
),
xh hh h

← (b)
← (b) (b)
Ht = ϕ(Xt W + H t+1 W + b )
xh hh h

←
Among them, H → t, Ht represent the forward and backward state variables,
(f ) (f ) (b) (b)
respectively. W xh
∈ R
d×h
,W
hh
h×h
∈ R ,W
xh
d×h
∈ R ,W
hh
h×h
∈ R is the
(f ) (b)
weight parameter of the model, b ∈ R , b ∈ R
h
1×h
is the bias parameter.
h
1×h

(f ), (b) is used to mark whether the model parameters are forward or backward.

The state variable of the last layer of the multi-layer bidirectional recurrent
(L)
network can be directly output as the output of the model F = H , or output
t t

through an activation function or re- Connect some other acyclic network layers
and output.
(L)
Ft = Ht Whf + bf

(L)
←
→
(L)
(L)
where H t
is H
t
, Ht The vector formed by splicing.

You can directly use neural network units to construct a bidirectional cyclic neural
network as in the previous section, or you can use a class to encapsulate a
bidirectional cyclic neural network layer separately, and then use these single-layer
bidirectional cyclic network layers to construct (multi-layer) bidirectional cyclic
neural networks network.

The RNNLayer below represents a bidirectional recurrent neural network layer. A

parameter mode of its constructor indicates different types of cyclic neural
network units, and the parameter reverse indicates whether the neural network
layer is forward or reverse.

from Layers import *

#from rnn import *
class RNNLayer(Layer):
def __init__(self,mode,input_size, hidden_size,bias=True,
batch_first=False,reverse=False):
super(RNNLayer, self).__init__()
self.mode = mode
if mode == 'RNN_TANH':
self.cell = RNNCell(input_size,
hidden_size,bias,nonlinearity="tanh")
elif mode == 'RNN_RELU':
self.cell = RNNCell(input_size,
hidden_size,bias,nonlinearity="relu")
elif mode == 'LSTM':
self.cell = LSTMCell(input_size, hidden_size,bias)
elif mode == 'GRU':
self.cell = GRUCell(input_size, hidden_size,bias)
self.reverse = reverse
self.batch_first = batch_first
self.zs = None

def init_hidden(self, batch_size):

#self.h = np.random.zeros(batch_size, self.hidden_dim)
self.h = self.cell.init_hidden(batch_size)
return self.h

def forward(self,input,h=None,batch_sizes = None):

mode = self.mode
if self.batch_first and batch_sizes is None:
input = input.transpose(0, 1)
seq_len,batch_size = input.shape[0], input.shape[1]
if h is None:
h = self.init_hidden(batch_size)
self.h = h #h.copy()

output = []
zs=[]
hs = []
steps = range(seq_len - 1, -1, -1) if self.reverse else
range(seq_len)
for t in steps:
h = self.cell(input[t], h)
#h,z = self.cell(input[t], h)
#output.append(h)
if isinstance(h, tuple):
h,z = h[0],h[1]
if mode == 'LSTM' or mode == 'GRU':
zs.append(z)
hs.append(h)

self.hs = np.array(hs)
output = [h[0] if isinstance(h, tuple) else h for h in
self.hs]
if mode == 'LSTM' or mode == 'GRU':
self.zs = np.array(zs)
return np.array(output),h

def call(self,input,h=None,batch_sizes = None):

return self.forward(input,h,batch_sizes)

def backward(self, dhs,input):#,hs):

if False:
if hs is None:
hs,_ = self.forward(input)
else:
if self.hs is None:
self.hs,_ = self.forward(input)
hs = self.hs

if False:
if self.zs is None:
zs = hs
else:
zs = self.zs
zs = self.zs if self.zs is not None else hs

seq_len,batch_size = input.shape[0], input.shape[1]

cell = self.cell

if len(dhs)==len(hs):#.shape==hs.shape: #
(seq,batch,hidden)
dinput = [None for i in range(seq_len)]
steps = range(seq_len) if self.reverse else
range(seq_len - 1, -1, -1)
t0 = seq_len - 1 if self.reverse else 0
dh = np.zeros_like(dhs[0]) #Gradient from the next
moment
for t in steps:
dh += dhs[t] #The gradient of the next moment +
the gradient of the current moment
h_pre = self.h if t==t0 else hs[t-1]
dx,dh,_ =
cell.backward(dh,zs[t],input[t],h_pre)

dinput[t] = dx
return dinput

Test out this recurrent neural network layer class:

#test_LSTM="LSTM"
test_LSTM="GRU"
reverse = True
np.random.seed(1)

seq_len,batch_size,input_size,hidden_size = 5,3,4,6

if test_LSTM=="RNN_TANH":
rnn_ = RNNLayer("RNN_TANH",input_size, hidden_size,reverse
= reverse)
elif test_LSTM=="GRU":
rnn_ = RNNLayer('GRU',input_size, hidden_size,reverse =
reverse)
else:
rnn_ = RNNLayer('LSTM',input_size, hidden_size,reverse =
reverse)

input = np.random.randn(seq_len,batch_size,input_size)
if reverse:
input = input[::-1]

h0 = np.random.randn(batch_size, hidden_size)
c0 = np.random.randn(batch_size, hidden_size)

if test_LSTM=="LSTM":
output,hn= rnn_(input, (h0,c0))
else:
output,hn= rnn_(input, h0)
print("output",output)
print("hn",hn)

#------test backward---
do = np.random.randn(*output.shape)
dinput = rnn_.backward(do,input)#,rnn_.hs)#output)
print("dinput:",dinput)

Test results slightly.

On the basis of the above-mentioned cyclic neural network layer, a multi-layer

bidirectional cyclic neural network can be easily realized. RNNBase_ implements
a bidirectional multi-layer cyclic neural network:
from Layers import *
class RNNBase_(Layer):
__constants__ = ['mode', 'input_size', 'hidden_size',
'num_layers', 'bias',
'batch_first', 'dropout', 'bidirectional']

def init(self, mode, input_size, hidden_size,

num_layers=1, bias=True, batch_first=False,
dropout=0., bidirectional=False):
super(RNNBase_, self).__init__()
self.mode = mode
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
self.bias = bias
self.batch_first = batch_first
self.dropout = float(dropout)
self.bidirectional = bidirectional
num_directions = 2 if bidirectional else 1
self.num_directions = num_directions

if not isinstance(dropout, float) or not 0 <= dropout

<= 1 or \
isinstance(dropout, bool):
raise ValueError("dropout should be a number in
range [0, 1] "
"representing the probability of
an element being "
"zeroed")
if dropout > 0 and num_layers == 1:
warnings.warn("dropout option adds dropout after
all but last "
"recurrent layer, so non-zero dropout
expects "
"num_layers greater than 1, but got
dropout={} and "
"num_layers={}".format(dropout,
num_layers))

if False:
if mode == 'LSTM':
gate_size = 4 * hidden_size
elif mode == 'GRU':
gate_size = 3 * hidden_size
elif mode == 'RNN_TANH':
gate_size = hidden_size
elif mode == 'RNN_RELU':
gate_size = hidden_size
else:
raise ValueError("Unrecognized RNN mode: " +
mode)

self.layers = []
self.params = []
self.grads = []
self._all_weights = []
for layer in range(num_layers):
layer_input_size = input_size if layer == 0 else
hidden_size
for direction in range(num_directions):
if direction==0:
rnnlayer = RNNLayer(mode,layer_input_size,
hidden_size,reverse = False)
else:
rnnlayer = RNNLayer(mode,layer_input_size,
hidden_size,reverse = True)
self.layers.append(rnnlayer)

self.params+= rnnlayer.cell.params
self.grads+= rnnlayer.cell.grads
def init_hidden(self, batch_size):
num_layers,num_directions =
self.num_layers,self.num_directions
selh.h0 = []
for layer in self.layers:
h0 = layer.init_hidden(batch_size)
selh.h0.append(h0)
return self.h0

def forward(self,input,h=None,batch_sizes = None):

num_layers,num_directions =
self.num_layers,self.num_directions
mode = self.mode
seq_len,batch_size = input.shape[0], input.shape[1]
if h is None:
h = self.init_hidden(batch_size)
self.h = h #h.copy()
hs = []
hns = []
for i in range(num_layers):
for j in range(num_directions):
l= i*num_directions+j
x = input if i == 0 else hs[l-num_directions]
layer = self.layers[l]
#print(i,j,x.shape,h[l].shape)
output,hn = layer(x,h[l])
hs.append(output)
hns.append(hn)
self.hs = np.array(hs)
#return output,hns
output = self.hs[-1] if num_directions==1 else
self.hs[-num_directions:]
return output,np.array(hns)
#return self.hs[-num_directions:],np.array(hns)

def call(self,input,h=None,batch_sizes = None):

return self.forward(input,h,batch_sizes)

def backward(self, dhs,input):#,hs):

num_layers,num_directions =
self.num_layers,self.num_directions
if False:
if hs is None:
hs,_ = self.forward(input)
else:
if self.hs is None:
self.hs,_ = self.forward(input)
hs = self.hs

dhs = [dhs[j] for j in range(num_directions)] if

num_directions==2 else [dhs]
for i in reversed(range(num_layers)):
for j in (range(num_directions)):
l= i*num_directions+j
layer = self.layers[l]
if i==0:
x = input
else:
x = self.layers[l-num_directions].hs
dhs[j] = layer.backward(dhs[j],x)

return dhs

Test this RNNBase_class:

import numpy as np
np.random.seed(1)
reverse = False
num_layers = 2
seq_len,batch_size,input_size,hidden_size = 5,3,4,6

input = np.random.randn(seq_len,batch_size,input_size)
test_LSTM = 'GRU'
if test_LSTM=="RNN_TANH":
rnn =
RNNBase_("RNN_TANH",input_size,hidden_size,num_layers)
elif test_LSTM=="GRU":
rnn = RNNBase_('GRU',input_size,hidden_size,num_layers)
else:
rnn = RNNBase_('LSTM',input_size,hidden_size,num_layers)

h_0 = np.random.randn(num_layers, batch_size, hidden_size)

output, hn = rnn(input, h_0)
print("output.shape",output.shape) #
(seq_len,batch_size,hidden_size)
print("output",output)

do = np.random.randn(*output.shape)
dinput = rnn.backward(do,input)
print("dinput:",dinput)

Test results slightly.

7.11 Sequence to sequence (seq2seq) model

The sequence-to-sequence (seq2seq) model is a deep learning model proposed by
Google for machine translation. The model adopts an encoder-decoder structure,
as shown in Figure 7-50. The input sequence generates a context vector through
the encoder cyclic neural network. This context can be regarded as the information
compression or feature of the input sequence. This context vector is used as the
hidden state input at the initial moment of the decoder. The input at the initial
moment of the decoder is usually a constant indicating the beginning of the
sequence. The decoder is based on the input at each moment The state and data
generate a predicted value and a state vector at the current moment, and this
predicted value is used as the input data at the next moment. The decoder keeps
producing predictions until it encounters a prediction that indicates the end or
exceeds a certain time step.
Figure 7-50 The structure of the sequence-to-sequence (seq2seq) model. The input
(word) sequence generates an S context vector through the encoder cyclic neural
network as the hidden state input of the decoder. The decoder starts from an initial
input (such as a special The start word SOS) can produce a sequence of outputs
(words) up to the end word EOS.

Both the encoder and decoder use a recurrent neural network (RNN) to process
sequence inputs and outputs of varying lengths. The input sequence is input to the
encoder to generate a state variable, also called context variable (content vector),
the decoder takes this context variable as its input state variable at the initial
moment, and can generate a output sequence. For example, in machine translation,
an encoder takes input sentences (sequences of words) in one language and a
decoder outputs sentences (sequences of words) in other languages. The seq2seq
model was quickly extended to other problems similar to machine translation, such
as dialogue, image captions, text summarization, couplet generation, etc.

machine translation
Machine translation is the conversion (translation) of sentences (sequences of
words) in one language into sentences (sequences of words) in another language.
This sequence-to-sequence conversion problem can be modeled with a Seq2Seq
model composed of an encoder and a decoder, as shown in Figure 7-50. The
encoder accepts a sentence (ie, a sequence of words) in a certain language, and the
sequence of words It can be arbitrarily long, and the encoder RNN processes each
word of the input word sequence in turn until it encounters the end word. The
encoder outputs a context vector that encodes the input sentence. This context
vector can be the output at the last moment (such as the hidden state) or the output
at each moment (such as the hidden state).

The decoder takes this encoded context and a special start word, and produces a
sequence of words in turn until it encounters a special end word. The start word
and end word of the decoder are artificially set words, such as using 3 English
letters "SOS" and "EOS" as the start word and end word respectively. In machine
translation, such special start and end words are usually artificially added after
both the input sentence and the translated sentence.

In the training phase, the encoder and decoder are trained based on the error loss of
the predicted word sequence and the target word sequence. In the inference phase,
the decoder predicts and samples the next word from the current word each time
until the final output word sequence.

7.11.1 Implementation of Seq2Seq model

The Seq2Seq model is organized by two cyclic neural networks, the encoder and
the decoder. The encoder does not output any information except for the hidden
state of the cyclic neural unit itself, so the encoder is mainly the cyclic neural
network unit itself.

The simplest encoder is a recurrent neural network that inputs a sequence of data
(such as a sequence of words) and calculates contextual information representing
the content of the sequence. For the simplest encoders, this contextual information
is the last-minute hidden state. At each moment, it accepts the input data and the
hidden state at the previous moment, and calculates the hidden state at the current
moment as the output at the current moment. The calculation process is shown in
the left subgraph of Figure 7-51:

Figure 7-51 The encoder accepts the one-hot vector of the input word at the
current moment and the hidden state at the previous moment, and calculates the
hidden state at the current moment as the output at the current moment; the
decoder accepts the one-hot vector of the input word at the current moment and the
previous hidden state The hidden state at one moment, calculate the hidden state at
the current moment, this hidden state outputs a vector through a linear layer to
represent each word in the word list as the score of the next word

The following encoder EncoderRNN is a GRU neural network. The only

difference from the previous GRU class is that an auxiliary function word2vec() is
added to convert the word index sequence word_indices_input into a one-hot
vector. EncoderRNN is a GRU, so its constructor The parameters of are the same
as the parameters of GRU, input_size, hidden_size respectively represent the
length of the input data and the length of the hidden state vector.

from rnn import *

def one_hot(size,indices,expend = False):

x = np.eye(size)[indices.reshape(-1)]
if expend:
x = np.expand_dims(x, axis=1)
return x
class EncoderRNN(object):
def __init__(self, input_size, hidden_size,num_layers = 1):
super(EncoderRNN, self).__init__()
self.input_size,self.hidden_size =
input_size,hidden_size
self.num_layers = num_layers
self.gru = GRU(input_size, hidden_size,num_layers)

def word2vec(self,word_indices_input):
return one_hot(self.input_size,word_indices_input,True)

def forward(self, word_indices_input, hidden):

#self.encode_input =
one_hot(self.input_size,word_indices_input,True)
self.encode_input =self.word2vec(word_indices_input)
output, hidden = self.gru(self.encode_input, hidden)
return output, hidden

def call(self,word_indices_input, hidden):

return self.forward(word_indices_input, hidden)

def initHidden(self,batch_size=1):
return np.zeros((self.num_layers, batch_size,
self.hidden_size))

def parameters(self):
return self.gru.parameters()

def backward(self,dhs):
dinput,dhidden =
self.gru.backward(dhs,self.encode_input)

The simplest decoder is a cyclic neural network plus an output layer. The decoder
accepts the one-hot vector of the input word at the current moment and the hidden
state at the previous moment, and calculates the hidden state at the current
moment. This hidden state passes through a linear layer Output a vector
representing the score of each word in the word list, and the calculation process is
shown in the right sub-figure of Figure 7-51.

The code of the decoder DecoderRNN class is as follows:

class DecoderRNN(object):
def __init__(self,input_size,hidden_size,
output_size,num_layers=1,teacher_forcing_ratio = 0.5):
# super(DecoderRNN, self).__init__()
super().__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
self.teacher_forcing_ratio = teacher_forcing_ratio

self.gru = GRU(input_size,hidden_size,num_layers)
self.out = Dense(hidden_size, output_size)

self.layers = [self.gru,self.out]
self._params = None

def initHidden(self,batch_size=1):
self.h_0 = np.zeros((self.num_layers, batch_size,
self.hidden_size))

def word2vec(self,input_t):
return one_hot(self.input_size,input_t,True)

def forward_step(self, input_t, hidden):

gru_input = self.word2vec(input_t)
self.input.append(gru_input)
output_hs, hidden = self.gru(gru_input,hidden)
output = self.out(output_hs[0])
return output,hidden,output_hs[0]

def forward(self,input_tensor,hidden):
teacher_forcing_ratio = self.teacher_forcing_ratio
use_teacher_forcing = True if random.random() <
teacher_forcing_ratio else False
#use_teacher_forcing = True
self.input = []

output_hs = []
output = []
hidden_t = hidden
h_0 = hidden.copy()

input_t = np.array([SOS_token])
#input_seq = []
hs = []
zs = []

target_length = input_tensor.shape[0]
for t in range(target_length):
output_t, hidden_t,output_hs_t = self.forward_step(
input_t, hidden_t)
#Save the calculation results at each moment
hs.append(self.gru.hs) #hidden state
zs.append(self.gru.zs) #Intermediate variables
output_hs.append(output_hs_t)
output.append(output_t)

if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) #maximum
probability
if input_t== EOS_token:
break
input_t = np.array([input_t])

output = np.array(output)
self.output_hs = np.array(output_hs)
self.h_0 = h_0
self.hs = np.concatenate(hs, axis=1)
self.zs = np.concatenate(zs, axis=1)

#self.input_seq = input_seq
#return output,input_seq
return output

def call(self, input, hidden):

return self.forward(input, hidden)

def evaluate(self, hidden,max_length):

# input:(1, batch_size=1, input_size)
input = np.array([SOS_token])
decoded_word_indices = []
for t in range(max_length):
output,hidden,_ = self.forward_step(input, hidden)
output = np.argmax(output)
if output==EOS_token:
break;
else:
decoded_word_indices.append(output)
input = np.array([output])

return decoded_word_indices
#return indexToSentence(output_lang,decoded_words)
#return indexToSentence(output_verb,decoded_words)

def backward(self,dZs):
dhs = []
output_hs = self.output_hs
input = np.concatenate(self.input,axis=0)

for i in range(len(input)):
self.out.x = output_hs[i]
dh = self.out.backward(dZs[i])
dhs.append(dh)
dhs = np.array(dhs)

self.gru.hs = self.hs
self.gru.zs = self.zs
self.gru.h = self.h_0

dinput,dhidden = self.gru.backward(dhs,input)
return dinput,dhidden

# def backward_dh(self,dZ):
# dh = self.out.backward(dZ)
# return dh

def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):

self._params.append([layer.params[i],layer.grads[i]])
return self._params

DecoderRNN contains a GRU cyclic neural network self.gru, the output of self.gru
is through the output layer self.out of the linear weighted sum to output each word
in the word list as the score value of the next word. Since the word at a certain
moment is input to the cyclic neural network self.gru through one-hot
vectorization, and the output of self.out is also a vector with the same length as the
word list to represent the score of each word. Therefore, the length of the input
vector of self.gru and the output vector of self.gru is the length of the word list.

forward_step() is a processing process at a certain moment. It accepts the word

table index input_t of the input word at this moment and the hidden state hidden at
the previous moment, and first converts it into a one-hot vector gru_input through
word2vec(), and then inputs the vector In gru, the generated hidden state
output_hs[0] passes through the output layer out to generate the final output
output. Because backpropagation derivation requires these intermediate variables
at each moment, such as gru_input, output_hs[0], etc. to be calculated, these
intermediate vectors are saved, such as gru_input saved to self.input,
output_hs[0] Save to self.output_hs in forward().

The forward() method accepts the input word sequence input_tensor, starts from
the special start word index SOS_token, processes each input word input_t in turn,
and saves the intermediate state of gru calculation such as self.gru.hs, self.gru.zs ,
because the reverse derivation at each moment needs to depend on these
intermediate variables at this moment.

Input the word input_t at each moment, and output a prediction vector output_t.
The input_t at the next moment can be the word with the highest score
corresponding to output_t, or the word corresponding to the output sentence of the
training sample. If the flag use_teacher_forcing is True, input_t uses the word
corresponding to the output sentence in the training sample, otherwise it uses the
word with the predicted maximum score. Using the word in the training sample
output sentence as the next word is called "teacher forcing" ("Teacher forcing").

For example, the target sequence of the decoder is 'hello', the input at the initial
moment is the special character 'SOS', and its target output should be the character
'h', but the probability of the character 'h' in the output vector at the initial moment
may not be the largest , if 'o' is the predicted character with the highest probability,
if not adopted ("Teacher forcing"), this 'o' will be used as the input at the next
moment. If ("Teacher forcing") is used, the 'o' with the highest predicted
probability is not used but the actual target output 'h' is used as the input of the
next moment.

The use of teacher forcing will lead to faster convergence, but the training network
may over-learn the information in the training samples, resulting in poor
generalization ability, that is, the actual prediction effect will be unstable.
Therefore, teacher-forced training can be enabled randomly, for example, with a
50% chance of using teacher-forced training.

Evaluate() uses the trained decoder for prediction, which accepts the context
vector hidden output by the encoder and the maximum number of output words
max_length. Its process is similar to the forward() function, because it is for
prediction, and only one initial moment of data is input into 'SOS'. Therefore, it
adopts a non-teacher-forced method, that is, always uses the word with the largest
prediction score every time (for example, if you want to generate a variety of , or
you can sample the word according to the probability corresponding to the score)
as the input word at the next moment. From the context vector output by the initial
encoder and the start word 'SOS' at the initial moment, the word is continuously
sampled according to the prediction score as the input word at the next moment,
until the end character 'EOS' is encountered or the word (character) reaches the
maximum number max_length. The final output is a vector constructed of wordlist
indices for all words.

The backward() method accepts the gradient dZs of the loss function on the output
layer output, first calculates the gradient of the output layer at each moment about
the hidden state at the corresponding moment, and then uses the hidden state
gradient dhs at all moments and the input input of gru to the gru loop neural
network Network reverse derivation.

The parameters() function of the encoder and decoder is used to return all their
model parameters in order to construct the optimizer object.

The following function train() accepts a pair of input and output sequences
input_tensor and target_tensor, as well as the encoder, decoder and its optimizer
encoder, decoder, encoder_optimizer, decoder_optimizer, as well as the function
loss_fn and the regularization coefficient reg for calculating the model loss .

train_step() performs a training update of the model parameters for the model, first
calculates the output encoder_output, encoder_hidden of the encoder according to
the input_tensor, and uses the hidden state encoder_hidden or encoder_output at
the last moment as the input of the decoder according to the last_hidden flag, and
is used for calculation together with the target_tensor The final predicted output of
the decoder. Then calculate the cross-entropy loss and the gradient grad of the loss
with respect to output according to the predicted output output and target, and then
use decoder.backward(grad) to reverse the derivative of the decoder, and the
output is the gradient dhidden of the output encoder_hidden of the encoder ,
continue to reverse the encoder according to this gradient. Finally update the
model parameters. Before updating the model parameters, clip_grad_norm_nn can
be used to clip the gradient to prevent gradient explosion.

def train_step(input_tensor, target_tensor, encoder, decoder,

encoder_optimizer, decoder_optimizer, \
loss_fn,reg,last_hidden = True,max_length=0):
clip = 5.
encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()

input_length = input_tensor.shape[0] #input_tensor.size(0)

loss = 0
encode_input = input_tensor
encoder_output, encoder_hidden = encoder(encode_input,
None)
if last_hidden:
output = decoder(target_tensor, encoder_hidden)
else:
output = decoder(target_tensor, encoder_output)

target = target_tensor.reshape(-1,1)
if output.shape[0]!= target.shape[0]:
target = target[:output.shape[0],:]
loss,grad = loss_fn(output, target)
loss /=(output.shape[0])

if last_hidden:
dinput,dhidden = decoder.backward(grad)
encoder.backward(dhidden[0]) #,encode_input)
else:
dinput,d_encoder_outputs = decoder.backward(grad)
encoder.backward(d_encoder_outputs)

if reg is not None:

loss+=encoder_optimizer.regularization(reg)
loss+=decoder_optimizer.regularization(reg)

util.clip_grad_norm_nn(encoder_optimizer.parameters(),clip,None)
util.clip_grad_norm_nn(decoder_optimizer.parameters(),clip,None)

encoder_optimizer.step()
decoder_optimizer.step()

return loss
#return loss.item() / target_length

The function trainIters() iteratively calls the train_step() function to update the
model parameters. And in the iterative process, some intermediate training result
models can be output, such as output training error and verification error.

import numpy as np
import time
import math
import matplotlib.pyplot as plt
%matplotlib inline

def timeSince(start):
now = time.time()
s = now - start
m = math.floor(s / 60)
s -= m * 60
return '%dm %ds' % (m, s)

def trainIters_(encoder, decoder,

encoder_optimizer,decoder_optimizer,train_pairs,valid_pairs,
encoder_output_all = False,print_every=1000, plot_every=100,
reg =None):
start = time.time()
valid_losses = []
plot_losses = []
print_loss_total = 0 # Reset every print_every
plot_loss_total = 0 # Reset every plot_every

training_pairs = train_pairs
loss_fn = util.rnn_loss_grad

for iter in range(1, n_iters + 1):

pair = training_pairs[iter - 1]
input_tensor,target_tensor = pair[0],pair[1]
loss = train_step_(input_tensor, target_tensor,
encoder,
decoder, encoder_optimizer,
decoder_optimizer, loss_fn,reg,encoder_output_all)

if loss is None: continue

print_loss_total += loss
plot_loss_total += loss
if iter % print_every == 0:
print_loss_avg = print_loss_total / print_every
print_loss_total = 0
print('%s (%d %d%%) %.4f' % (timeSince(start),
iter, iter / n_iters *
100, print_loss_avg))

if iter % plot_every == 0:
plot_loss_avg = plot_loss_total / plot_every
plot_losses.append(plot_loss_avg)
plot_loss_total = 0
plt.plot(plot_losses)
valid_losses.append(validation_loss(encoder,
decoder, valid_pairs,encoder_output_all,20,reg))
plt.plot(valid_losses)
plt.legend(["train_losses","valid_losses"])
plt.show()

def validation_loss(encoder, decoder, valid_pairs,last_hidden =

True,validation_size = None,reg =None):
if validation_size is not None:
valid_pairs = [random.choice(valid_pairs) for i in
range(validation_size)]
total_loss = 0
loss_fn = util.rnn_loss_grad
teacher_forcing_ratio = decoder.teacher_forcing_ratio
decoder.teacher_forcing_ratio = 1.1
for pair in valid_pairs:
encode_input = pair[0]
target_tensor = pair[1]

encoder_output, encoder_hidden = encoder(encode_input,

None)
if last_hidden:
output = decoder(target_tensor, encoder_hidden)
else:
output = decoder(target_tensor, encoder_output)

target = target_tensor.reshape(-1,1)
if output.shape[0]!= target.shape[0]:
target = target[:output.shape[0],:]
loss,grad = loss_fn(output, target)
loss /=(output.shape[0])

if reg is not None:

params = encoder.parameters()+decoder.parameters()
reg_loss =0
for p,grad in params:
reg_loss+= np.sum(p**2)
loss += reg*reg_loss

total_loss += loss

decoder.teacher_forcing_ratio = teacher_forcing_ratio
return total_loss/len(valid_pairs)

Among them, validation_loss() uses the trained model to calculate the verification
error. It randomly takes a small number of samples from the verification sample
set, generates output through the encoder and decoder, and then calculates the
predicted output of the decoder, and calculates the loss in the same way as the
training process.

7.11.2 Seq2Seq for character-level machine translation

Machine translation first requires a corpus to train the Seq2Seq model. The
following website provides training corpora between various languages, and each
corpus is a text file. For example, fra.txt is a training sample for French and
English, and each line is Statements in two different languages.

https://www.manythings.org/anki/

Like the text generated by RNN before, the input and output sentences of machine
translation can be regarded as a sequence of words or a sequence of characters. As
long as a character table is established for all characters in a language, each
character in the sentence can be characters into a one-hot vector. Figure 7-52 is a
Seq2Seq model that regards sentences as character sequences.

7-52 The character-level Seq2seq model of machine translation, the input at each
moment is a character.

To do this, a character vocabulary needs to be created for each language.

1. Character word list

The class ChVerb is used to represent a character vocabulary of a language, that is,
it records which characters there are, and the corresponding relationship between
each character and its index in the character vocabulary. Among them, '\t' and '\n'
represent special start characters and end characters respectively, and their
corresponding word table indexes are 0 and 1 respectively.
SOS_token = 0
EOS_token = 1

class ChVerb:
def __init__(self, name):
self.name = name

self.char2index = {'\t':0, '\n':1}

self.index2char = {0: '\t', 1: '\n'}
self.n_chars = 2 # Count SOS and EOS

def addChars(self, chars):

for char in chars:
self.addChar(char)

def addChar(self, char):

if char not in self.char2index:
self.char2index[char] = self.n_chars
self.index2char[self.n_chars] = char
self.n_chars += 1
2. Read training samples and build character vocabulary
First read the corresponding translation sentences in the expected library:

import numpy as np
import random
import re
import unicodedata
random.seed(1)

def unicodeToAscii(sentence):
return ''.join(
c for c in unicodedata.normalize('NFD', sentence)
if unicodedata.category(c) != 'Mn'
)

def normalize_sentence(sentence):
sentence = unicodeToAscii(sentence.lower().strip())
sentence = re.sub(r"([.!?])", r" \1", sentence)
sentence = re.sub(r"[^a-zA-Z.!?]+", r" ", sentence)
return sentence

def readLangs(lang2lang_file, reverse=False):

print("Reading lines...")
lines = open(lang2lang_file, encoding='utf-8').\
read().strip().split('\n')

# Split every line into pairs and normalize

pairs = [[normalize_sentence(s) for s in l.split('\t')][:2] for l in lines]

if reverse: # Reverse pairs

pairs = [list(reversed(p)) for p in pairs]
return pairs

normalize_sentence() preprocesses the characters in the sentence, such as converting unicode codes to Ascii codes,
converting uppercase characters to lowercase characters, and deleting non-alphabetic characters.

Filter the read sentence pairs, such as limiting the length of the sentence:

MAX_LENGTH = 20
def filterPair(p):
return len(p[0]) < MAX_LENGTH and \
len(p[1]) < MAX_LENGTH

def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]

Using the read and filtered sentence pairs as training samples, first construct the character word lists of the two
languages.

def prepareCharPairs(lang2lang_file,reverse=False):
pairs = readLangs(lang2lang_file,reverse)
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
for pair in pairs:
in_verb.addChars(pair[0])
out_verb.addChars(pair[1])
return in_verb, out_verb, pairs

lang2lang_file = './data/eng-fra.txt'
in_verb = ChVerb("fra")
out_verb = ChVerb("eng")
in_verb, out_verb, pairs = prepareCharPairs(lang2lang_file,True)

print("Read %s sentence pairs" % len(pairs))

print("Counted chars:")
print(in_verb.name, in_verb.n_chars)
print(out_verb.name, out_verb.n_chars)
for i in range(5):
print(random.choice(pairs))
print(pairs[3])

Reading lines...
Read 170651 sentence pairs
Trimmed to 9194 sentence pairs
Read 9194 sentence pairs
Counted chars:
fra 32
eng 32
['tom a dit bonjour .', 'tom said hi .']
['je suis creve .', 'i am tired .']
['prends une douche !', 'take a shower .']
['je suis detendu .', 'i m relaxed .']
['tu es endurant .', 'you re resilient .']
['cours !', 'run !']

The following code converts the character words and wordlist indexes of the text sentences of these training
samples to and from each other:

def indexToSentence(verb, indexes):

sentense = [verb.index2char[idx] for idx in indexes]
return ''.join(sentense)

def indexesFromSentence(verb, sentence):

return [verb.char2index[char] for char in sentence]

def tensorFromSentence(verb, sentence):

indexes = indexesFromSentence(verb, sentence)
indexes.append(EOS_token)
return np.array(indexes).reshape(-1,1)
# return np.array(indexes,dtype = np.int64).reshape(-1,1)

def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_verb, pair[0])
target_tensor = tensorFromSentence(out_verb, pair[1])
return (input_tensor, target_tensor)

print(pairs[3])
en_input, de_target = tensorsFromPair(pairs[3]) #random.choice(pairs))

print(en_input.shape)
print(de_target.shape)
print(en_input)
print(de_target)

['cours !', 'run !']

(8, 1)
(6, 1)
[[11]
[12]
[ 8]
[13]
[ 6]
[ 4]
[ 5]
[ 1]]
[[ 8]
[ 9]
[10]
[ 4]
[11]
[ 1]]

3. Training character-level Seq2Seq model

According to the training sample set pair obtained after data processing and the training code of the previous
Seq2Seq model, the Seq2Seq model of character-level machine translation can be trained. The following code
defines the encoder and decoder objects encoder, decoder, and corresponding optimizer encoder_optimizer,
decoder_optimizer. And divide the sample set pairs into two sets train_pairs and valid_pairs for training and
verification, and then call the training function trainIters() of the Seq2Seq model to train the model:
from train import *
from Layers import *
from rnn import *
import util

hidden_size = 50 #256
num_layers = 1

clip = 5.#50.
learning_rate = 0.1
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5

encoder = EncoderRNN(in_verb.n_chars, hidden_size)

decoder =
DecoderRNN(out_verb.n_chars,hidden_size,out_verb.n_chars,num_layers,teacher_forcing_ratio

momentum = 0.5
decay_every =1000
encoder_optimizer = SGD(encoder.parameters(), learning_rate, momentum,decay_every)
decoder_optimizer = SGD(decoder.parameters(), learning_rate*decoder_learning_ratio,
momentum,decay_every)

reg= None#1e-2

if True:
pairs = pairs[:80000]

np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]
n_iters = 50000
print_every, plot_every = 100,100 #10,10
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]
trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg)

Figure 8-54 Loss curve of the Seq2Seq model for character sequences

From the separation of the training loss curve and the verification loss curve, it can be seen that the training is
unstable, and the loss curve tends to be flat and rises slightly after 40,000 times.

The trained model can be used for language translation. The word sequence (sentence) of the language to be
translated is input to the encoder to generate an output context information, which is input to the decoder. Produce
translated word sequences (sentences).
def evaluate(encoder,decoder,in_vocab,out_vocab,sentence,\
max_length=MAX_LENGTH,last_Hidden = True):
encode_input = tensorFromSentence(in_vocab,sentence)
encoder_output, encoder_hidden = encoder(encode_input, None)
if last_Hidden:
output_sentence = decoder.evaluate(encoder_hidden,max_length)
else:
output_sentence = decoder.evaluate(encoder_output,max_length)
output_sentence = indexToSentence(out_vocab,output_sentence)
return output_sentence

Among them, last_Hidden indicates whether the input of the decoder is the output (hidden vector) of the encoder at
the last moment or the output at all moments.

Randomly select several input sentences and use evaluate to predict the translated sentences:
indices = np.random.randint(len(pairs), size=3)
for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_verb,out_verb, sentence,MAX_LENGTH)
print(sentence)

['es tu jalouse ?', 'are you jealous ?']

are you a see ??
['continue a courir .', 'keep running .']
come a .
['lis ton livre !', 'read your book .']
be care it .

From the results, the prediction effect is not ideal. The last moment output of the encoder is used as a context
vector (context vector) to pass information between the encoder and decoder, then this single vector bears the
burden of encoding the entire sentence and may not contain complete information. If the output of all moments is
used As a context variable, the information is relatively complete, but it is impossible to directly use the variable-
length encoder output information at all moments as the input of the decoder at each moment.

The attention mechanism introduced later enables the decoder network to "focus" on different parts of the
encoder output for each step of the decoder's own output, which can solve this variable-length decoder output and
avoid the decoder The hidden state vector of is increased.

7.11.3 Seq2Seq machine translation based on Word2Vec

A sentence can be regarded as a sequence of characters or a sequence of words. The length of a sentence as a
sequence of characters is several times longer than that of a sequence of words. The longer the sequence, the faster
the gradient transfer of the cyclic neural network. Difficult, easier gradient explosion and disappearance, therefore,
in machine translation, sentences are regarded as a sequence of words. However, the number of words in a
language is often very large. If one-hot is used to quantify each word, its one-hot vector will be very large. Of
course, the characters of Chinese language are Chinese characters, and the number of Chinese characters is also
large. One The -hot vector will be large. Therefore, there are two obvious problems in directly quantizing words
with one-hot vectors:

Space is wasted, the vector representing a word is very large, only one value is 1, and the others are 0

It cannot express the internal connection between words, such as synonyms, correlation, etc., and the words of
a language are not independent of each other, and there is often a certain correlation between them

In natural language processing, word vectorization methods that are better than one-hot are used. These methods
are collectively called Word2Vec (word quantization). Word quantization can be considered as converting words
from the space where their word lists are located (one- hot vector space) to a low-dimensional space, similar to the
self-encoder mapping a high-dimensional vector to a low-dimensional vector, word quantization is also a
quantization model that uses a corpus (such as a piece of text) to train words through supervised machine learning
methods , but because it does not need to do any labeling of words but samples supervised training samples itself,
some people also call it unsupervised learning.

1. Word vectorization Word2Vec's skip-gram method

There are two main methods of word quantization (Word2vec): continuous bag-of-words (CBOW) and skip-gram.

These two methods learn the word vector of a word through a 2-layer neural network similar to an autoencoder,
that is, to map a high-dimensional one-hot vector to a low-dimensional hidden vector. As shown in Figure 8-55,
the length of the word list is V, that is, there are V different words, and the one-hot vector x of a word is a vector of
length V , after passing through a weight matrix W heencoderof V × N , because N is usually an integer much
T

smaller than V, therefore, use W V ×N to weight the sum of x to produce a low-dimensional hidden Vector
hN = xW V ×N , this hidden vector h is the vectorized representation of the word with index k.
N

Figure 8-55 Both CBOW and Skip-gram word quantization use a 2-layer weighted and neural network to learn the
quantized representation of words

Because x is a row vector with only the kth component being 1 and the others being 0, the result of xW V ×N is the
kth row of the matrix, so it is not actually needed For multiplication, just take out the kth row of the matrix. The

python code is as follows:

h = self.W[k,:]

In order to obtain a suitable latent vector representation of words to reflect the relevance between words (such as
synonyms), it is necessary to train this weight matrix with an autoencoder. Like the self-encoder, the hidden vector
is converted into an output vector of the same length as the word list through a W N ×V . Each component p of this
i

output vector represents the ithT hescoreof words can be converted into a probability through the softmax
function. In order to train this neural network model, CBOW and Skipgram use different methods to generate
samples for the training model from a corpus composed of many sentences.

Both the encoder and the decoder are fully connected layers without bias and activation functions, that is, there is
only one weight matrix. Therefore, this 2-layer neural network is 2 weight matrices, which can be represented by
the following simplified Figure 8-56:

Figure 8-56 Simplified word quantization neural network, the encoder and decoder are just a weight matrix

Its working process is similar to that of an autoencoder. The first fully connected linear layer is a weight matrix W 1

without bias, which converts a one-hot vector x of an input word into a low-dimensional embedded representation
h = xW , this h is the quantified representation of this word x. In order to train this weight matrix W , pass h
1 1

through another fully connected linear layer without bias and activation function, namely the weight matrix W , 2

and output a vector f = hW of all word scores. During the training process, compare this f with its target word
2

to get a loss error, and update the model parameters through the backpropagation of these loss errors.

As shown in Figure 8-57, for a word in a sentence, if it is recorded as w (also known as the center word or target
t

word), CBOW uses its context or surrounding words as input. For example, if the context window C is set to C =
5, the input will be at positions w , w , w
t−2 t−1 t+1and w Words, that is, the two words before and after the head
t+2

word w . Input a context word (w , w , w , w ) of a certain central word w into the network, CBOW The
t t−2 t−1 t+1 t+2 t

predicted word w that is expected to be output is the target word w . The network is trained by computing the
p t

cross-entropy loss with the predicted word w and the target word w .
p t

Figure 8-57 CBOW takes the context of a word (surrounding words) as input to the encoder, and the score of the
vocabulary words output by the decoder, the target word with the highest score should be.

Contrary to CBOW using the context of a word in a sentence to predict the word, Skip-gram uses the word w to t

predict its context words w , w , w , w , as shown in Figure 8-58. That is, the one-hot vector of the input
t−2 t−1 t+1 t+2

word w , the hidden vector h is obtained through the encoder, and a vector with the same length as the word list
t N

is output through the decoder, indicating the score of each word in the word list. Of course, the score can be passed
through a The softmax function is converted into a probability, and the context word is used as the target word to
calculate the cross-entropy loss, thereby training the encoder and decoder.

Figure 8-58 Skip gram takes a word as the input of the encoder, and the score of the vocabulary word output by the
decoder, the highest score should be the context word of this word, that is, the context word is the target word.

Both CBOW and Skip gram use sentences in the corpus to generate training samples. For the Skip gram method,
each word in a sentence is used as a central word, and each context word is used as a target word to form a training
sample.

For example, for the sentence "Seq2Seq is a general purpose encoder decoder framework ", the context words of
the first word "Seq2Seq" are "is" and "a", then 2 training samples (Seq2Seq, is) and (Seq2Seq, is), as shown in
Figure 8-59. Similarly, for the second given word "is", its context words are "seq2seq", "a", and "general".
Therefore, three samples (is, Seq2Seq, ), (is, a), (is, general). By analogy, for the last word "framework", samples
(framework, encoder) and (framework, decoder) can be obtained.

Figure 8-59 skip gram For the sentence "Seq2Seq is a general purpose encoder decoder framework", the generated
training samples
CBOW and skip gram have their own advantages and disadvantages. CBOW is suitable for a language model with
a small number of words and can well represent scarce words, while Skip gram is suitable for a language model
with a large number of words and can better represent words with high frequency.

The following code uses skip-gram as an example to illustrate how to implement the training process of this
model. For skip-gram, its input is the current center word, and its target is the context word, but because there are
multiple context words, that is, multiple targets, each target word needs to calculate a cross entropy loss with
f = hW . 2

Similarly, define a wordlist representing all words in a language, which can be constructed from sentences in a
corpus:
class Vocab:
def __init__(self,corpus):
wordset = set()
for sentence in corpus:
if isinstance(sentence,str):
for word in sentence.split(' '):
wordset.add(word)
else:
for word in sentence:
wordset.add(word)

wordlist = list(wordset)
self.word2index = dict([(word, i) for i, word in enumerate(wordset)])
self.index2word = dict([(i, word) for i, word in enumerate(wordset)])
self.n_words = len(wordset)

def index2onehot(self, idx):

x = np.zeros((1,self.n_words))
x[0,idx] = 1
return x

corpus = ["i am from china",]

vocab = Vocab(corpus)
print(vocab.word2index)
print(vocab.word2index["am"])
print(vocab.index2word)
print(vocab.index2word[3])

{'china': 0, 'am': 1, 'i': 2, 'from': 3}

1
{0: 'china', 1: 'am', 2: 'i', 3: 'from'}
from

All the sentences in a corpus file can be read to build a word list, and the training samples for training the
Word2Vec model can be generated based on the word list and sentences in the corpus. The function
generate_training_data() samples the samples used to train the Word2Vec model according to the corpus corpus
composed of the word list vocab and sentences and the window size window of the sample sampling.
def generate_training_data(vocab,corpus,window = 2):
training_data = []
for sentence in corpus: # for each sentense
sent_len = len(sentence)
for i, word in enumerate(sentence): # for each word in the sentense
w_target =vocab.word2index[sentence[i]]
w_context = []
for j in range(i-window, i+window+1):
if j!=i and j<=sent_len-1 and j>=0:
w_context.append(vocab.word2index[sentence[j]])
training_data.append([w_target, w_context])
return np.array(training_data)

corpus = [["i","am","from","china"]]
generate_training_data(vocab,corpus)

array([[2, list([1, 3])],

[1, list([2, 3, 0])],
[3, list([2, 1, 0])],
[0, list([1, 3])]], dtype=object)

A Word2Vec model can be trained on the basis of word list and corpus, and the following class Word2Vec is its
code implementation:

class Word2Vec():
def __init__ (self,corpus,hidden_n,window,learning_rate=0.01,epochs=5000):
self.hidden_n = hidden_n
self.window = window
self.lr = learning_rate
self.epochs = epochs
self.vocab = Vocab(corpus)

print("Training a Word2Vec model....")

train_data = generate_training_data(self.vocab,corpus,self.window)
self.train(train_data,self.vocab.n_words, self.hidden_n)
self.epsilon =1e-8
print("Complete the training of the Word2Vec model!")

def train(self, train_data,word_count, hidden_n):

bound= 0.01
self.w1 = np.random.uniform(-bound, bound, (word_count, hidden_n))
self.w2 = np.random.uniform(-bound, bound, (hidden_n, word_count))
for i in range(0, self.epochs):
loss = 0
for w_t, w_c in train_data:
f, h, z = self.forward_pass(w_t)
w_y = [self.vocab.index2onehot(c) for c in w_c]

dz = np.sum([np.subtract(f,y) for y in w_y], axis=0)

self.backprop(dz, h)#, w_t)

loss+= np.sum( [-np.sum(y*np.log(f)) for y in w_y] )

print('epoch:',i, 'loss:', loss)

def forward_pass(self, idx):

self.x = self.vocab.index2onehot(idx)
#h = np.dot(self.x,self.w1)
h = self.w1[idx,:]
z = np.dot(h,self.w2)
f = self.softmax(z)
return f, h, z

def backprop(self, dz, h):

x = self.x
dw2 = np.outer(h.T, dz)
dh = np.dot( dz, self.w2.T)
dw1 = np.outer(np.array(x).T, dh)

self.w1 = self.w1 - (self.lr * dw1)

self.w2 = self.w2 - (self.lr * dw2)
def softmax(self, x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum(axis=0)

def word_vec(self, word):

w_index = self.vocab.word2index[word]
embeded_w = self.w1[w_index]
return embeded_w

def call(self, word):

return self.word_vec(word)

#--- Test example -----------------------------

hidden_n = 5
window_size = 2
min_count = 0 # minimum word count
epochs = 5000 # number of training epochs
learning_rate = 0.01 # learning rate
np.random.seed(0) # set the seed for reproducibility

corpus = ["Neural Machine Translation using word level seq2seq model".split(' ')]

# INITIALIZE W2V MODEL

w2v = Word2Vec(corpus,hidden_n,window_size,learning_rate,epochs)
print(w2v("Machine"))

Training a Word2Vec model....

epoch: 0 loss: 54.065347305478255
epoch: 1 loss: 54.06530596613177
epoch: 2 loss: 54.06526433976022
......
epoch: 4998 loss: 31.565571058450484
epoch: 4999 loss: 31.565565641922532
Complete the training of the Word2Vec model!
[-0.83213006 -2.9516065 0.14489502 0.27716055 0.85657948]

Now read the previous machine-translated corpus file:

MAX_LENGTH = 10

eng_prefixes = (
"i am ", "i m ",
"he is", "he s ",
"she is", "she s ",
"you are", "you re ",
"we are", "we re ",
"they are", "they re "
)
def filterPair(p):
return len(p[0].split(' ')) < MAX_LENGTH and \
len(p[1].split(' ')) < MAX_LENGTH and \
p[1].startswith(eng_prefixes)

def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]

def read_pairs(lang2lang_file, reverse=False):

pairs = readLangs(lang2lang_file,reverse)
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
return pairs

lang2lang_file = './data/eng-fra.txt'
pairs = read_pairs(lang2lang_file,True)
print(random.choice(pairs))

Reading lines...
Read 170651 sentence pairs
Trimmed to 12761 sentence pairs
['je ne le vendrai pas .', 'i m not going to sell it .']

According to the previously read corpus for machine translation, that is, paired sentence pairs, the following code
builds its sentence prediction library for training input and output languages:

if True:
pairs = pairs[:80000]

in_corpus = []
out_corpus = []
for pair in pairs:
in_corpus.append(pair[0].split(' '))
out_corpus.append(pair[1].split(' '))
print(in_corpus[:2])
print(out_corpus[:2])

[['j', 'ai', 'ans', '.'], ['je', 'vais', 'bien', '.']]

[['i', 'm', '.'], ['i', 'm', 'ok', '.']]

The code below trains a word2Vec model for 2 languages:

hidden_n = 150
window_size = 2
min_count = 0 # minimum word count
epochs = 1 # number of training epochs
learning_rate = 0.01 # learning rate
np.random.seed(0) # set the seed for reproducibility

# INITIALIZE W2V MODEL

in_word2vec = Word2Vec(in_corpus,hidden_n,window_size,learning_rate,epochs)
out_word2vec = Word2Vec(out_corpus,hidden_n,window_size,learning_rate,epochs)
print(in_word2vec("peur"))
print(in_word2vec("trouble"))

Training the Word2Vec model...

This training time will be very long. You can consider using an existing Word2Vec training library such as the
multi-threaded Word2Vec library gensim, because using the low-level linear algebra library in fortran or C
language can obtain hundreds of times of training acceleration. Install command:
pip install --upgrade gensim

For example, the following in_corpus is a corpus composed of 2 sentences. gensim.models.Word2Vec() constructs
a Word2Vec model model from this in_corpus, and a quantitative representation of a word can be obtained
according to this model: model.wv['am'].
import gensim

sentence = "i am from China"

sentence2 ="how old are you ?"
test_corpus = [sentence.split(" "),sentence2.split(' ')]
print(test_corpus)

hidden_n = 8
model = gensim.models.Word2Vec(test_corpus, size=hidden_n, window=2, min_count=1,
workers=10, iter=10)
print('am:',model.wv['am'])

[['i', 'am', 'from', 'China'], ['how', 'old', 'are', 'you', '?']]

am: [-0.00522377 0.03762834 -0.05772045 0.02232596 0.00224983 -0.05164414
-0.02401852 -0.01468942]

The following code uses gensim to train the Word2Vec models in_vocab and out_vocab for the input and output
languages.

import gensim

hidden_n = 150
window_size = 2
in_vocab = gensim.models.Word2Vec(in_corpus, size=hidden_n, window=window_size,
min_count=1, workers=10, iter=10)
out_vocab = gensim.models.Word2Vec(out_corpus, size=hidden_n, window=window_size,
min_count=1, workers=10, iter=10)

Because the above-mentioned trained Word2Vec model does not contain special characters "SOS", "EOS", and
"UNK", these three special words can also be added to the word list to obtain an extended word list (the length of
the word list becomes hidden_n+ 3), for these 3 special words, their quantization can be represented directly by
random vectors:

import numpy as np
SEU_count = 3

in_SEU = np.random.rand(3,hidden_n+SEU_count)
out_SEU = np.random.rand(3,hidden_n+SEU_count)

The code below defines some helper functions for obtaining indexed sentences and word quantization
representations from a literal sentence. indexesFromSentence() converts the words of a sentence into the index of
the word table, because the model word table of gensim does not contain 3 special characters, therefore, a word
word is in the index vocab.wv.vocab[word] of the word model of gensim. index is for ordinary
characters, you need to add offset SEU_count=3 to get its index in the extended word list.

vocab_word2vec obtains its quantified representation for each word index idx according to the vocab word index
(word list index containing special characters) sequence of gensim's Word2Vec model. For ordinary words, the
index should also be offset to the index of the gensim word list, vocab. wv.index2word[idx-SEU_count]. For
special characters, directly use SEU[idx] to obtain their quantified representation.

def indexesFromSentence(vocab, sentence):

return [ vocab.wv.vocab[word].index +SEU_count for word in sentence.split(' ')]

def vocab_word2vec(vocab,word_indices_input,SEU,expend = False):

x = []
SEU_vec = np.zeros(SEU_count)
word_indices_input = word_indices_input.reshape(-1)
for idx in word_indices_input:
if idx<=2:
x.append(SEU[idx])
else:
word = vocab.wv.index2word[idx-SEU_count]
vec = vocab.wv[word]
vec = np.append(vec,SEU_vec)
x.append( vec )
x = np.array(x)
if expend:
x = np.expand_dims(x, axis=1)
return x

SOS_token =0
EOS_token =1
UNK_token =2
def tensorFromSentence(vocab, sentence):
indexes = indexesFromSentence(vocab, sentence)
indexes.append(EOS_token)
return np.array(indexes).reshape(-1,1)

def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_vocab, pair[0])
target_tensor = tensorFromSentence(out_vocab, pair[1])
return (input_tensor, target_tensor)

def indexToSentence(vocab, indexes):

sentense = [vocab.wv.index2word[idx-SEU_count] for idx in indexes]
return ' '.join(sentense)

tensorFromSentence() and tensorsFromPair respectively convert a sentence or a pair of sentences from a string to
an index sequence. As before, tensorsFromPair() adds an end character after each sentence. indexToSentence()
converts a sentence from a sequence of word indices to a sequence of strings.

In order to replace the one-hot vector with Word2Vec, the code of the encoder and decoder needs to be modified,
and a derived class can be defined:

class EncoderRNN_w2v(EncoderRNN):
def __init__(self, input_size, hidden_size,vocab,num_layers = 1):
super(EncoderRNN_w2v,self).__init__(input_size, hidden_size,num_layers)
self.vocab = vocab

def word2vec(self,word_indices_input):
return vocab_word2vec(self.vocab,word_indices_input,in_SEU,True)

class DecoderRNN_w2v( DecoderRNN):

def __init__(self,input_size, hidden_size,
output_size,vocab,num_layers=1,teacher_forcing_ratio = 0.5):
super().__init__(input_size, hidden_size,
output_size,num_layers,teacher_forcing_ratio)
self.vocab = vocab

def word2vec(self,word_indices_input):
return vocab_word2vec(self.vocab,word_indices_input,out_SEU,True)

The following code starts training:

from train import *
from Layers import *
from rnn import *
import util

hidden_size = 256
num_layers = 1

clip = 5.#50.
learning_rate = 0.1
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5

n_iters = 70000
print_every, plot_every = 100,100 #10,10

input_size = hidden_n+SEU_count # length of a Vec

output_size =len(out_vocab.wv.vocab)+SEU_count #num of words

encoder = EncoderRNN_w2v(input_size, hidden_size,in_vocab)

decoder =
DecoderRNN_w2v(input_size,hidden_size,output_size,out_vocab,num_layers,teacher_forcing_ra

momentum = 0.3
decay_every =1000
encoder_optimizer = SGD(encoder.parameters(), learning_rate, momentum,decay_every)
decoder_optimizer = SGD(decoder.parameters(), learning_rate*decoder_learning_ratio,
momentum,decay_every)

reg= None#1e-2

np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]

print_every, plot_every = 100,100 #10,10

n_iters = 40000
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]

trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg

output:
Use the trained model for translation prediction:

indices = np.random.randint(len(pairs), size=3)

for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_verb,out_verb, sentence,MAX_LENGTH)
print(sentence)

['nous sommes sauvees .', 'we re saved .']

we re unlucky .
['je requiers votre aide .', 'i m asking you for your help .']
i m on on your . .
['je suis enchante d etre ici .', 'i am delighted to be here .']
i m delighted to be here .

It can be seen that the word-level Seq2Seq model can predict better than the character-level Seq2Seq model.
Readers can increase the number of training times and adjust parameters to obtain more satisfactory results.

7.11.4 Seq2Seq model based on word embedding layer

1. Word embedding layer

Word2Vec word vectorization learns through a separate training process to map a word index in a word list to a
lower-dimensional vector than the word list length. That is, according to the weight matrix learned by Word2Vec,
use the word index to get the corresponding row of this matrix.

Word embedding (Embedding) refers to the combination of word vectorization into the model of a specific
problem, that is, adding an embedding layer in front of the network model of a specific problem. The parameter of
this embedding layer is the matrix of word vectorization, which is used to map the word index to the word vector,
but the parameters of this matrix are initially random and need to be learned during the model training process.
That is, word vectorization and problem-specific models are combined for training.

This embedding layer is a fully connected linear layer without activation function and bias. It is a simplified linear
layer. The code is as follows:

def one_hot(size,indices,expend = False):

x = np.eye(size)[indices.reshape(-1)]
if expend:
x = np.expand_dims(x, axis=1)
return x

class Embedding():
def __init__(self, num_embeddings, embedding_dim,_weight = None):
super().__init__()
if _weight is None:
self.W = np.empty((num_embeddings, embedding_dim))
self.reset_parameters()
self.preTrained = False
else:
self.W = _weight
self.preTrained = True
self.params = [self.W]
self.grads = [np.zeros_like(self.W)]

def reset_parameters(self):
self.W[:] = np.random.randn(*self.W.shape)

def forward(self, indices):

num_embeddings = self.W.shape[0]
x = one_hot(num_embeddings,indices).astype(float)
self.x = x
#Z = np.matmul(x, self.W)
Z = self.W[indices,:]
return Z

def __call__(self,indices):
return self.forward(indices)

def backward(self, dZ):

x = self.x
dW = np.dot(x.T, dZ)
dx = np.dot(dZ, np.transpose(self.W))
self.grads[0] += dW
return dx

2. Seq2Seq model using word embedding layer

The encoder and decoder use a word embedding layer to convert a word into a vector. The calculation process of
the encoder is shown in Figure 7-53:

Figure 7-53 The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional
numerical vector embedded through the embedding layer embedding, and then used together with the hidden state
as the input of the cyclic neural network unit to calculate the output and hidden vector

The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional numerical
vector embedded through the embedding layer embedding, and then used together with the hidden state as the
input of the cyclic neural network unit to calculate the output and hidden vector. For simple encoders, output and
hidden can be the same vector.

from rnn import *

from Layers import *
from train import *

class EncoderRNN_Embed(object):
def __init__(self, input_size, hidden_size):
super().__init__()
self.input_size,self.hidden_size = input_size,hidden_size
self.embedding = Embedding(input_size, hidden_size)
self.gru = GRU(hidden_size, hidden_size,1)

def forward(self, input, hidden):

self.embedded_x = []
self.embedded_out = []
embed_out = []
for x in input:
embedded = self.embedding(x).reshape(1,1,-1)
self.embedded_x.append(self.embedding.x)
self.embedded_out.append( embedded)

self.embedded_out = np.concatenate(self.embedded_out,axis=0)
output, hidden = self.gru(self.embedded_out, hidden)
return output, hidden

def call(self,input, hidden):

return self.forward(input, hidden)

def initHidden(self):
return np.zeros((1, 1, self.hidden_size))

def parameters(self):
return self.gru.parameters()

def backward(self,dhs):
dinput,dhidden = self.gru.backward(dhs,self.embedded_out)
T = dinput.shape[0]
for t in range(T):
dinput_t = dinput[t]
self.embedding.x = self.embedded_x[t] # recover the original x
self.embedding.backward(dinput_t)

#return

Because the weight parameter of the embedding layer is also a model parameter that needs to be learned, it is also
necessary to calculate the gradient of the weight parameter of the loss function with respect to the embedding layer
during reverse derivation, namely self.embedding.backward(dinput_t). The reverse derivation boils down
to the derivation at each moment t. It is necessary to know the input self.embedding.x at this moment t.
Therefore, it is necessary to save the output self.embedded_x of the embedding layer at each moment during the
forward calculation process. .append(self.embedding.x)`.

As shown in Figure 7-54, the decoder also uses the output vector of the word embedding layer as the input of the
RNN unit:

Figure 7-54 The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional
numerical vector embedded through the embedding layer embedding, and then the output of the relu activation
function and the hidden state are used together as the input of the cyclic neural network unit. Compute output and
latent vectors

It is also necessary to perform forward calculation and reverse derivation on the embedding layer, the code is as
follows:

class DecoderRNN_Embed(object):
def __init__(self, hidden_size, output_size,num_layers=1,teacher_forcing_ratio =
0.5):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = 1
self.teacher_forcing_ratio = teacher_forcing_ratio

self.embedding = Embedding(output_size, hidden_size)

self.relu = Relu()
self.gru = GRU(hidden_size, hidden_size,1)
self.linear = Dense(hidden_size, output_size)
self.layers = [self.embedding,self.gru,self.linear]
self._params = None

def initHidden(self,batch_size):
self.h_0 = np.zeros((self.num_layers, batch_size, self.hidden_size))

def forward_step(self, input_t, hidden,train = True):

embedded = self.embedding(input_t)#.reshape(1,1,-1)
self.embedded_x.append(self.embedding.x)
output = self.relu(embedded)
self.relu_x = self.relu.x

relu_output = output.reshape(1,output.shape[0],-1)
self.input.append(relu_output) #output) # input of gru

output_hs, hidden = self.gru(relu_output,hidden)

output = self.linear(output_hs[0]) #seq_len = 1
return output,hidden,output_hs[0]

def forward(self,input_tensor,hidden):
self.input = []
target_length = input_tensor.shape[0] #nput_tensor.size(0)
teacher_forcing_ratio = self.teacher_forcing_ratio
use_teacher_forcing = True if random.random() < teacher_forcing_ratio else
False
output_hs = []
output = []
hidden_t = hidden
h_0 = hidden.copy()
input_t = np.array([SOS_token])

hs = []
zs = []
self.embedded_x = []
self.relu_x = []

for t in range(target_length):
output_t, hidden_t,output_hs_t = self.forward_step(
input_t, hidden_t)

hs.append(self.gru.hs) #Keep the hidden state of the middle layer

zs.append(self.gru.zs) #Keep the calculation result of the middle layer
output_hs.append(output_hs_t)
output.append(output_t)

if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) # maximum probability
if input_t== EOS_token:
break
input_t = np.array([input_t])

output = np.array(output)
self.output_hs = np.array(output_hs)
self.h_0 = h_0
self.hs = np.concatenate(hs, axis=1)
self.zs = np.concatenate(zs, axis=1)
#self.gru.hs = self.hs
#self.gru.zs = self.zs
return output

def call(self, input, hidden):

return self.forward(input, hidden)

def evaluate(self, hidden,max_length):

# input:(1, batch_size=1, input_size)
input = np.array([SOS_token])
decoded_words = []
for t in range(max_length):
output,hidden,_ = self.forward_step(input, hidden,False)
output = np.argmax(output)
if output==EOS_token:
break;
else:
decoded_words.append(output)
input = np.array([output])
return decoded_words

def backward(self,dZs):
dhs = []
output_hs = self.output_hs
input = np.concatenate(self.input,axis=0)

for i in range(len(input)):
self.linear.x = output_hs[i]
dh = self.linear.backward(dZs[i])
dhs.append(dh)
dhs = np.array(dhs)

self.gru.hs = self.hs
self.gru.zs = self.zs
self.gru.h = self.h_0

dinput,dhidden = self.gru.backward(dhs,input)
for i in range(len(input)):
dinput_t = dinput[i]
d_embeded = self.relu.backward(dinput_t)
self.embedding.x = self.embedded_x[i] # recover the original x
self.embedding.backward(d_embeded)
return dinput,dhidden

def backward_dh(self,dZ):
dh = self.linear.backward(dZ)
return dh

def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])
return self._params

Again, input and output wordlists need to be built, along with some helper functions to convert between the string
form of the sentence and the indexed form. Redefine the word list class Vocab so that it can contain special start
and end words "SOS", "EOS", and the number of occurrences less than min_count is regarded as an unknown
word:
import numpy as np
from collections import defaultdict
SOS_token = 0
EOS_token = 1
UNK_token = 2

class Vocab:
def __init__(self,min_count=1,corpus = None):
self.min_count = 1
self.word2count = {}
self.word2index = {"SOS":0,"EOS":1, "UNK":2}
self.index2word = {0: "SOS", 1: "EOS",2: "UNK"}
self.n_words = 3 # Count SOS and EOS
if corpus is not None:
for sentence in corpus:
self.addSentence(sentence)
self.build()

def addSentence(self, sentence):

if isinstance(sentence,str):
for word in sentence.split(' '):
self.addWord(word)
else:
for word in sentence:
self.addWord(word)

def addWord(self, word):

if word not in self.word2count:
self.word2count[word] = 1
else:
self.word2count[word] += 1

def build(self):
for word in self.word2count:
if self.word2count[word]<self.min_count:
self.word2index[word] = UNK_token
else:
self.word2index[word] = self.n_words
self.index2word[self.n_words] = word
self.n_words += 1

vocab = Vocab()
vocab.addSentence("i am from china")
vocab.build()

print(vocab.word2index["i"])
print(vocab.index2word[4])

3
am

Create the wordlist objects in_vocab and out_vocab for the input and output languages:

in_vocab = Vocab()
out_vocab = Vocab()
lang2lang_file = './data/eng-fra.txt'
pairs = read_pairs(lang2lang_file,True)
for pair in pairs:
in_vocab.addSentence(pair[0])
out_vocab.addSentence(pair[1])
in_vocab.build()
out_vocab.build()

def indexesFromSentence(vocab, sentence):

return [vocab.word2index[word] for word in sentence.split(' ')]

def tensorFromSentence(vocab, sentence):

indexes = indexesFromSentence(vocab, sentence)
indexes.append(EOS_token)
return np.array(indexes).reshape(-1, 1)

def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_vocab, pair[0])
target_tensor = tensorFromSentence(out_vocab, pair[1])
return (input_tensor, target_tensor)

def indexToSentence(vocab, indexes):

sentense = [vocab.index2word[idx] for idx in indexes]
return ' '.join(sentense)

#input_tensor, target_tensor = tensorsFromPair(random.choice(pairs))

#print(input_tensor.shape)
#print(input_tensor)
#print(target_tensor)

The training process of the Seq2Seq model based on word embedding is similar to the previous one:
from train import *
from Layers import *
from rnn import *
import util

hidden_size = 256
num_layers = 1

clip = 5.#50.
learning_rate = 0.03
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5

output_size = out_vocab.n_words #num of words

encoder = EncoderRNN_Embed(in_vocab.n_words, hidden_size)
decoder =
DecoderRNN_Embed(hidden_size,out_vocab.n_words,num_layers,teacher_forcing_ratio)

reg= None#1e-2

np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]
print_every, plot_every = 100,100 #10,10

n_iters = 40000
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]

trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg)

Figure 8-55 The training and verification loss curves of the Seq2Seq model of the word embedding layer

Make predictions:
indices = np.random.randint(len(train_pairs), size=3)
for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_vocab,out_vocab, sentence,MAX_LENGTH)
print(sentence)

['c est une vraie commere .', 'she is a confirmed gossip .']
she is a total . .
['nous sommes meilleures qu elles .', 'we re better than they are .']
we re better than they are .
['tu es curieux hein ?', 'you are curious aren t you ?']
you are curious right ?

7.11.5 Attention mechanism

The content vector output by the encoder of the previous Seq2Seq model is usually the hidden state or output of
the encoder at the last moment, which contains the encoding of the entire input sequence information. This content
vector is passed to the decoder along the time dimension as the initial hidden state of the decoder. The processing
at each moment of the decoder, that is to say, the decoder accepts the same content encoding of the input sequence
of the encoder at each moment.

Using the hidden state or output at the last moment as the content vector may not be enough to contain the
complete input sequence information, especially for long input sequences. From the effect of the previous Seq2Seq
model, it can be seen that the longer the sequence, the worse the prediction effect. If all the hidden states at all
times are concatenated into a content vector, it can contain enough complete input sequence information, but since
the length of the input sequence changes, this content vector obviously cannot be directly used as the hidden state
of the encoder, so it needs to be done. A transformation process turns it into a fixed-length vector. On the other
hand, different parts of the input sequence have different effects on each moment of the decoder, and each moment
of the decoder should have different degrees of attention to different parts of the input sequence. As shown in
Figure 7-56, the input sequence is the sentence (word sequence) "knowledge is power", and the output target
sequence is the sentence "knowledge is power". When the decoder is processing "knowledge", the word
"knowledge" of the input sequence has a greater impact than the other two words "is" and "strength", and when
dealing with "is", the "is" of the input sequence is more important . Therefore, when the decoder is making
predictions, different words of the input sequence have different effects on different words of the output sequence.

Figure 7-56 Different parts of the input sequence have different predictive effects on the output sequence at
different moments

Attention (Attention) mechanism means that the decoder dynamically selects the part of the input sequence that
is most relevant to the current prediction at each moment. A weight vector can be calculated by comparing the
input information of the decoder at the current moment (the hidden state at the previous moment and the data input
at the current moment) with the output (or hidden state) of the encoder at all moments, and then use the weight The
vector weights the output content of the encoder at all moments to obtain a specific content context vector at the
current moment, that is, the decoder has different encoder context vectors at different moments, and this context
vector is used together with the hidden state and data input at that moment Computation at the current moment of
the decoder.

The calculation of i at each moment of the cyclic neural network of the previous Seq2Seq decoder can be
expressed as:

hi = RN N (hi−1 , xi )

The calculation of i at each moment of the Seq2Seq decoder using the attention mechanism can be expressed as:

hi = RN N (hi−1 , xi , ci )

That is, at each moment, there is an additional content vector c specific to that moment. This c not only depends
i i

on h , x , but also depends on the output (or hidden state) of the encoder at all moments , if the output of the
i−1 i

encoder at all times is the hidden state h̄ , that is, c depends on all h̄ , t = 1, 2, , ⋯ , T , T is the last of the encoder
t i t

time. At each moment i, the decoder is based on the output of the encoder h̄ = h , h , ⋯ , h and the information
1 2 T

at the moment of the decoder i (such as the input hidden state h ) First calculate a weight vector
i−1

α = (α , α , ⋯ , α
i i1 i2 ), use this weight vector α to the output of the encoder h̄ is weighted and summed to get
iT

the content vector c at the current moment, namely:

ci = αi ⋅ h̄ = αi1 h̄1 + αi2 h̄2 + ⋯ + αiT h̄T

And
T
∑ αij = 1, αij > 0
j=1

That is, the input context vector c of the decoder moment i is the weighted average of the output (or hidden state)
i

h̄ of the encoder at all moments. And α

j represents the weight of the jth output (or hidden state) h̄ of the
ij j

encoder in the input context vector c . i

And these α are calculated by the same set of so-called scores (also called energy) values e :
ij ij

e
ij
exp
αij = T e
∑ exp ik
k=1
Each e can be the input hidden state h
ij i−1 of the decoder time i and the h̄ of the encoder time j is calculated by a
j

function a, namely:

eij = a(hi−1 , h̄j )

Of course, e can also rely on the data at the current moment to input x . According to different functions a, the
ij i

score has different calculation methods, as shown in Figure 7-57, h and h̄ uses a neural network layer with
i−1 j

only one neuron and an activation function of tanh as the calculation function of the score, namely:

a(hi−1 ; h̄j ) = tanh([hi−1 ; h̄j ]Wa )

Figure 7-57 The score calculation function is a neural network layer with only one neuron and the activation
function is tanh

Among them, the parameter W is also a parameter that needs to be learned. The table below shows some
a

common scoring methods:

Name Alignment score function Citation

Content-based attention score(ht , h̄s , ) = cosine[ht , boldsymbolh̄t ] Graves2014
Additive(*) ⊤
score(ht , h̄s , ) = va tanh(Wa [ht ; h̄s ]) Bahdanau2015
Location-Base αt,i = sof tmax(Wa ht ) Luong2015
General ⊤
score(ht , h̄s , ) = ht Wa h̄s Luong2015
Dot-Product ⊤
score(ht , h̄s , ) = ht h̄t Luong2015
⊤

Scaled Dot-Product score(ht , h̄s , ) =

ht boldsymbolh̄s

√n
Vaswani2017

Where h̄ , h represent the hidden state of the input sequence s and the output sequence t respectively, and v , W
s t a a

are learnable weight parameter matrices. Note that although the hidden state h of the decoder uses a unified
t

symbol, the meanings in different papers are slightly different, some are h at the current moment t, and some are
t

actually t − 1 at the previous moment hidden state h . For example, h

t−1 at the previous moment represented in
t−1

the Bahdanau attention paper, and Luong attention is h at the current moment.
t

As shown in Figure 7-58, the decoder uses this dynamically calculated context at each moment to perform
calculations together with the input data and the hidden state at the previous moment.
Figure 7-58 The decoder calculates dynamic weights at each moment, and uses these weights to calculate the
weighted average of the outputs (or hidden vectors) of the encoder at all moments to obtain a context vector, which
is used for the calculation of the decoder at the current moment

Luong et al also proposed local attention attention, which is different from the usual global (Global) in that: The
model first predicts an alignment position of the current target word in the input sequence, and then calculates the
context vector with a window centered on the position, as shown in the right figure of Figure 7-59.
Figure 7-59 The global attention uses all the outputs (hidden state) of the encoder to calculate the context vector,
while the local attention first finds the position of the input sequence corresponding to the target position, and then
all the outputs of the encoder in the window area centered on this position (hidden state) Compute the context
vector

As shown in Figure 7-60, the decoder calculates an attention weight vector attn_weights based on the hidden state
prev_hidden at the previous moment and the output content encoder_outputs of the encoder at all moments, and
then uses this weight vector to weight the output encoder_outputs of the encoder and obtain a The attention content
vector content, and then the embedding vector embedded of the data input input is output through a full leveling
layer combine combination, and then after the relu activation function, it is input to the recurrent neural network
unit gru together with pre_hidden, and the output of gru is then passed through a full level The connection layer
out produces the final output.

Calculate an attention weight vector attn according to the current input data input and the hidden state prev_hidden
at the previous moment, and then weighted with the hidden state output encoder_outputs of the encoder to obtain
attn_applied, and then combine it with the input embedded embedded into attn_combine, and pass After the
activation function is transformed, it is used as the current moment data input of the recurrent neural network unit
(GRU).

Figure 7-60 The calculation process of the attention mechanism: the input and the hidden state are used to
calculate an attention weight, and then weighted and summed with the output of the encoder. This weighted sum
input is combined as a new input data input to the recurrent neural network The network layers produce the final
output.

The forward calculation and reverse derivation codes of the hidden state prev_hidden and the weighted vector of
the encoder output content encoder_outputs are as follows:
def attn_forward(hidden,encoder_outputs):
#hidden (B,D) encoder_outputs (T,B,D)
energies = np.sum(hidden * encoder_outputs, axis=2) #(T,B)
energies =energies.T #(B,T)
alphas = util.softmax(energies)
return alphas,energies

def attn_backward(d_alpha,energies,hidden,encoder_outputs):
#hidden (B,D) encoder_outputs (T,B,D)
#d_alpha energies:(B,T)
d_energies = softmax_backward_2(energies,d_alpha,False) #d_alpha,energies)
d_energies = d_energies.T #(T,B)
d_energies = np.expand_dims(d_energies,axis=2)
d_encoder_outputs = d_energies*hidden # (T,B) (B,D)
d_hidden = np.sum(d_energies*encoder_outputs,axis=0) # (T,B) (T,B,D)
return d_encoder_outputs,d_hidden

The following is the code for forward calculation and reverse derivation of attn_weights weighted sum of
encoder_outputs:
def bmm(alphas,encoder_outputs):
# (B,T), [T,B,D]
encoder_outputs = np.transpose(encoder_outputs, (1, 0, 2)) # [T,B,D] -> [B,T,D]
#weights = np.expand_dims(weights,axis=1) #(B,T) -> (B,1,T)
context = np.einsum("bj, bjk -> bk", alphas, encoder_outputs) # [B,T]*[B,T,D] ->
[B,D]
return context

def bmm_backward(d_context,alphas,encoder_outputs):
encoder_outputs = np.transpose(encoder_outputs, (1,0,2)) # [T,B,D] -> [B,T,D]
d_alphas = np.einsum("bjk, bk -> bj", encoder_outputs,d_context) #dx = Wdz^T
(B,T,D) (B,D) ->(B,T)
d_encoder_outputs = np.einsum("bi, bj -> bij", alphas,d_context) # dW = x^Tdz #
(B,T) (B,D) ->(B,T,D)
d_encoder_outputs = np.transpose(d_encoder_outputs, (1,0,2)) # [B,T,D] -> [T,B,D]
return d_alphas,d_encoder_outputs

First, implement the weighted sum operation bmm operation of multi-sequence samples in this figure. Let T, B,
and D be the sequence length, the number of samples, and the data length at each moment, respectively. bmm()
accepts a weight matrix of shape (B, T), one row of which represents the weight vector of a sample, and
encoder_outputs is the encoder output The content vector of the shape is (T, B, D), which needs to be converted
into a tensor of shape (B, T, D) first, and then use np.einsum() to calculate the output content for each sample with
its weight vector The weighted sum yields a vector of length D.

einsum() uses string instructions to control flexible dot product (matrix multiplication) operations, such as
multiplying the left two tensors "bj" and "bjk" of "bj, bjk -> bk" to produce the right two-dimensional tensor "bk",
where the axis of the tensor is represented by a letter (instead of 0, 1, 2). This multiplication process can be
simulated with the following code:
#Loop through each element of the result tensor (subscript bk)
for b in range(...)
for k in range(...)
C[b,k] = 0
for j in range(...)
C[b,k]+= A[b,j]*B[b,j,k]

The process of calculating the weight vector and the weighted sum of the output content of the encoder can be
synthesized into an attention layer Atten:

#Attention layer at a time t

class Atten(Layer):
def __init__(self, hidden_size):
super().__init__()
self.hidden_size = hidden_size
def forward(self,hidden,encoder_outputs):
self.hidden = hidden
self.encoder_outputs = encoder_outputs
alphas,energies = attn_forward(hidden,encoder_outputs)
context = bmm(alphas,encoder_outputs)
self.alphas,self.energies = alphas,energies
return context,alphas,energies

def __call__(self,hidden,encoder_outputs):
return self.forward(hidden,encoder_outputs)

def backward(self,d_context): #(B,D)

alphas,energies,hidden,encoder_outputs =
self.alphas,self.energies,self.hidden,self.encoder_outputs
d_alphas,d_encoder_outputs_2 = bmm_backward(d_context,alphas,encoder_outputs)
d_encoder_outputs,d_hidden =
attn_backward(d_alphas,energies,hidden,encoder_outputs)
d_encoder_outputs+=d_encoder_outputs_2
return d_hidden,d_encoder_outputs

The following code implements the decoder for the simple attention mechanism above:
from Layers import *
from rnn import*
import util

class DecoderRNN_Atten(object):
def **init**(self, hidden_size, output_size, num_layers=1, teacher_forcing_ratio =
0.5, dropout_p=0.1, \
max_length=MAX_LENGTH):
super(DecoderRNN_Atten, self).**init**()

self. hidden_size = hidden_size

self.num_layers = 1
self.teacher_forcing_ratio = teacher_forcing_ratio
self. dropout_p = dropout_p
self.max_length = max_length

self. embedding = Embedding(output_size, hidden_size)

self. dropout = Dropout(self. dropout_p)

#self.attn = Dense(self.hidden_size * 2, self.max_length)

self.attn = Atten(hidden_size)
self.attn_combine = Dense(self.hidden_size * 2, self.hidden_size)
self.relu = Relu()

self.gru = GRU(hidden_size, hidden_size,1)

self.out = Dense(hidden_size, output_size)

# self.layers = [self.embedding,self.attn,self.attn_combine,self.gru,self.out]

self.layers = [self.embedding,self.attn_combine,self.gru,self.out]
self._params = None
self. use_dropout = False

def initHidden(self, batch_size):

self.h_0 = np.zeros((self.num_layers, batch_size, self.hidden_size))

def forward_step_(self, input, prev_hidden, encoder_outputs, training=True):

embedded = self. embedding(input) #(B,D))
if self. use_dropout and training:
embedded = self. dropout(embedded, training)
context,alphas,energies = self.attn(prev_hidden[0],encoder_outputs)
attn_combine_out = self.attn_combine(np.concatenate((embedded,
context),axis=1))
relu_out = self.relu(attn_combine_out)
self.relu_x.append(self.relu.x)
relu_out = np. expand_dims(relu_out, axis=0)
output_hs, hidden = self.gru(relu_out, prev_hidden)
output_hs_t = output_hs[0]#seq_len = 1
output = self.out(output_hs_t)

if training:
self.embedded_x.append(self.embedding.x)
if self. use_dropout:
self.dropout_mask.append(self.dropout._mask)

self.attn_x.append((self.attn.alphas,self.attn.energies,self.attn.hidden,self.attn.encode

self.attn_combine_x.append(self.attn_combine.x)
self.relu_x.append(self.relu.x)

self.gru_x.append((relu_out,self.gru.h))
self.gru_hs.append(self.gru.hs) #Keep the hidden state of the middle layer
self.gru_zs.append(self.gru.zs) #Keep the calculation result of the middle
layer
self.out_x.append(self.out.x)
return output, hidden, output_hs_t

def forward(self, input_tensor, encoder_outputs): #hidden, encoder_outputs):

self.encoder_outputs = encoder_outputs #(T,B,D)
self.attn_weights_seq = []
target_length = input_tensor.shape[0] #nput_tensor.size(0)
teacher_forcing_ratio = self.teacher_forcing_ratio
use_teacher_forcing = True if random. random() < teacher_forcing_ratio else
False

hidden_t =
encoder_outputs[-1].reshape(1,encoder_outputs[-1].shape[0],encoder_outputs[-1].shape[1])

h_0 = hidden_t. copy()

input_t = np.array([SOS_token])

output = []
output_hs = []
self.gru_x = [] #gru input
self.gru_hs = []
self.gru_zs = []
self. dropout_mask = []
self.embedded_x = []
self.relu_x = []
self.attn_x = []
self.attn_combine_x = []
self.attn_weights_seq = []
self.out_x = []

# encoder_outputs = np.pad(self.encoder_outputs,((0,self.max_length-
self.encoder_outputs.shape[0]),(0,0),(0,0)), 'constant')
for t in range(target_length):
output_t, hidden_t, output_hs_t = self.forward_step(input_t, hidden_t,
encoder_outputs)
output_hs.append(output_hs_t)
output.append(output_t)
if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) #maximum probability
if input_t==EOS_token:
break
input_t = np.array([input_t])

output = np. array(output)

self. output_hs = np. array(output_hs)
self.h_0 = h_0
return output

def call(self, input, hidden):

return self.forward(input, hidden)

def evaluate(self, encoder_outputs, max_length):

hidden = encoder_outputs[-1]
hidden = hidden. reshape(1, hidden. shape[0], hidden. shape[1])
input_T = self.encoder_outputs.shape[0]
# encoder_outputs = np.pad( self.encoder_outputs,((0,self.max_length-input_T),
(0,0),(0,0)), 'constant')

# input:(1, batch_size=1, input_size)

input = np.array([SOS_token])
decoded_words = []
for t in range(max_length):
output, hidden, _ = self. forward_step(input, hidden, encoder_outputs,
False)
output = np.argmax(output)
if output==EOS_token:
break;
else:
decoded_words.append(output)
input = np. array([output])
return decoded_words

def backward(self, dZs):

input_T = self.encoder_outputs.shape[0]
d_encoder_outputs = np.zeros_like(self.encoder_outputs)
T = len(dZs)
dprev_hidden = np.zeros_like(self.h_0)
for i in reversed(range(T)):
self.out.x = self.out_x[i]
dh = self.out.backward(dZs[i])
dh += dprev_hidden[-1]

dhs = np. expand_dims(dh, axis=0)

self.gru.hs = self.gru_hs[i]
self.gru.zs = self.gru_zs[i]
relu_out, self.gru.h = self.gru_x[i]
drelu_out,dprev_hidden = self.gru.backward(dhs,relu_out)
drelu_out = drelu_out.reshape(drelu_out.shape[1],drelu_out.shape[2])

self.relu.x = self.relu_x[i]
d_relu_x = self.relu.backward(drelu_out)
d_attn_combine_out = d_relu_x

self.attn_combine.x = self.attn_combine_x[i]
d_attn_combine_x = self.attn_combine.backward(d_attn_combine_out)
d_embedded, d_attn_out = d_attn_combine_x[:,:self.hidden_size],
d_attn_combine_x[:,self.hidden_size:]
self.attn.alphas, self.attn.energies, self.attn.hidden,
self.attn.encoder_outputs = self.attn_x[i]
dprev_hidden_2,d_encoder_outputs_2 = self.attn.backward(d_attn_out)

if self. use_dropout:
self.dropout._mask = self.dropout_mask[i]
d_embedding = self. dropout. backward(d_embedded)
else:
d_embedding = d_embedded

self.embedding.x = self.embedded_x[i] ## recover the original x when do

forward
self. embedding. backward(d_embedding)

dprev_hidden += dprev_hidden_2
d_encoder_outputs +=d_encoder_outputs_2 #[:input_T] #Every moment must be
accumulated

#d_encoder_outputs[input_T-1]+=dprev_hidden[0]
d_encoder_outputs[-1]+=dprev_hidden[0]
return dprev_hidden,d_encoder_outputs #dhidden

600m 28s (80000 100%) 1.6529

Make predictions with the trained model:

indices = np.random.randint(len(train_pairs), size=3)

for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_vocab,out_vocab, sentence,MAX_LENGTH,False)
print(sentence)

['tu n ecoutes pas !', 'you re not listening !']

you re not a .
['nous irons .', 'we re going .']
we re going . .
['nous y sommes pretes .', 'we re ready for this .']
we re ready for it .

The results do not seem to have improved much. Interested readers can try to increase the number of iterations,
adjust learning parameters, and especially use different attention mechanisms to get better results.
Chapter 8 Generating Models
Data is the basis of machine learning and modern artificial intelligence. The more data, the better the
performance of machine learning algorithms. It is precisely because of the huge amount of data that large
companies can develop high-performance artificial intelligence products such as search engines,
recommendation systems, Intelligent games, etc. It is often said that "whoever owns the data owns the future."
Big data is also one of the key factors for the re-emergence of neural networks and the development of deep
learning.

For many problems, manually obtaining data (such as medical imaging data) is usually very difficult and
expensive. For example, to improve the performance of face-related algorithms such as face recognition, a large
amount of face image data is required, and the collection of these face images not only requires user
authorization, but also requires a certain price. If it can be automatically generated and real faces are difficult
Differentiated face images can save costs and costs, and promote face-related research and applications. Another
example is that there are a large number of two-dimensional and three-dimensional scenes in video game film
and television works. Designing and producing these scenes requires a lot of manpower, material and financial
resources, making the cost of shooting a movie and making a game very high. If the high-quality scenes in these
film and television works can be automatically generated, a lot of financial resources and manpower can be
saved, so that product developers can focus more on creative work.

The generative model (generative model) in machine learning specializes in how to use computers to
automatically generate data similar to real data, that is, the generative model can automatically generate fake
data that is indistinguishable from real data. The language model (language model) in natural language
understanding in Chapter 7 is a generative model. A good language model can generate fluent sentences for
applications such as machine translation, chat dialogue, and article generation. Typically, once trained, recurrent
neural networks can be used to generate a steady stream of sequence data.

Therefore, automatically generated data can solve the lack of data for many research problems, not only improve
the performance of machine learning algorithms for related problems, but also contribute to the development of
various application products. For example, automatic face generation technology is used in various face
application problems such as video face replacement (such as DeepFake), automatic image generation can
generate images of various styles, and automatic speech synthesis technology can automatically synthesize
various voices similar to real people. Voice, as well as automatic composition and so on.

This chapter mainly discusses the two most popular generative model technologies based on deep neural
networks (deep learning): Variational Auto-Encoders (Variational Auto-Encoders, VAE),Generative
Adversarial Network ( Generative Adversarial Network, GAN).

8.1 Generate model

Taking generating a face image as an example, how to automatically generate a face that is the same as a real
face? If scribble-style arbitrary coloring of different pixels of an image, it is obviously impossible to produce an
image that looks like a real image let alone a human face.

Every face image in the world is different, but no matter how different, people can tell at a glance that an image
is a human face rather than a cat, dog, or plant. If all face images are used with the same shape tensor such as a
three-dimensional tensor That is, red, green and blue three-color image representation, for example, a face image
is represented by a 3 × 1024 × 768 tensor, which contains 1024 × 768 pixels, and each pixel is composed of
three colors of red, green and blue The value indicates that a face image contains 3 × 1024 × 768 variable
values, and x represents this tensor, then x is a 3 × 1024 × 768 dimensional linear space data The point or x
corresponding to each face image corresponds to a coordinate point in this linear space. The data points of x
corresponding to all face images in this space are not random, and are usually located in a small subspace in this
space, just like the points on a straight line on a two-dimensional plane are distributed in this The same in a
straight line. x is a changing random variable. The distribution of x coordinate points corresponding to all face
images in this large linear space has its own specific probability distribution law or a certain probability
distribution density, but this Probability distributions cannot be expressed in analytical mathematical
expressions.

If a certain model can automatically generate face images that are indistinguishable from real face images, then
these automatically generated images must obey the underlying probability distribution of real face images.

Thus, generative modeling is all about generating artificial data that has the same (or as similar as possible)
probability distribution to the real data. For example, the real data are all data points on a circle on the plane, if
the generated data points are also located on this circle, it is said that the generated data points satisfy the
distribution of the circle. If the real data are some real numbers on the number axis that satisfy a certain
probability distribution, for example, the real numbers that are uniformly distributed on the interval [0,1], if the
real numbers generated by the generated model are also uniformly distributed on the [0,1] interval, that is
Generated real numbers and real real numbers have the same probability distribution, then the two sets of real
numbers are indistinguishable. But usually only these real numbers do not know the underlying probability
distribution of these real numbers. How to generate real numbers with the same probability distribution as these
real real numbers? This is the problem generative models are designed to solve.

The generative model is usually a parametric model, like a parameterized neural network function. In order to
obtain a generative model that can generate fake data that obeys the same distribution as the real data, it is
necessary to learn the parameters of the parameterized generative model based on the real data. It is the same as
learning the parameters of the regression model with real data. As long as the parameters of the parametric
generative model are determined, fake data that obey the distribution of real data can be automatically generated
according to the determined generative model, which means that these fake data and real data are
indistinguishable.

Of course, the distribution of data generated by the generative model cannot be exactly the same as the
distribution of real data. The closer the two distributions are, the harder it is to distinguish the generated data
from the real data.

If there is a set of real numbers, that is, the real data is a real number, and these real numbers are located on a real
number axis, but their distribution on the real number axis is unknown, how to generate fake real numbers is
difficult to distinguish from these real real numbers or that forgery obeys the rules of these real real numbers
Distribution? For example, these real numbers are the heights of people in Hainan Province. If the generated
height data does not obey the distribution of these height data, it can be easily identified.

For low-dimensional data such as a set of real numbers, the frequency can be used to approximate the probability
distribution of the data through simple statistical calculations. For example, below is a set of real numbers from
the file "real_values.npy", which set {x } constitutes the real data.
(i)

import numpy as np
x = np.load('real_values.npy')
print(x.shape)
print(x[:5])

(10000,)
[4.88202617 4.2000786 4.48936899 5.1204466 4.933779 ]
How is this set of real numbers distributed in the real number space R? Or what kind of probability distribution
do they have? The real number axis can be divided into many small intervals, and the frequency of real numbers
falling in each small interval can be counted. As long as there is enough data, this frequency can be close enough
to the probability, so that the probability distribution of these real numbers in the real number space R can be
understood. The code below shows the frequency distribution of the approximate probabilities for this set of data
in histogram and curve form:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

def draw_hist(plt,x,bin_num = 10):

xmin, xmax = np.min(x),np.max(x)
bins = np.linspace(xmin, xmax, bin_num)
plt.hist(x, bins=bins,density = True,alpha = 0.7)

x2 = np.sort(x) #sort the real numbers

p, _ = np.histogram(x2, bins, density=True) # Calculate the frequency p of each
interval in bins
p_x = np.linspace(xmin, xmax, len(p))
plt.plot(p_x, p, 'b-', linewidth=2, label='real data')

draw_hist(plt,x,26)
plt.show()

Figure 8-1 Statistical histogram of a set of real numbers

Through this histogram, it can be observed that the distribution of these real numbers is close to the Gaussian
distribution, and the center point (mean value) of the Gaussian distribution is about 4.0. It is also easy to
calculate that the standard deviation of this set of real numbers is about 0.5.

For this set of real numbers, you can also use the kdeplot() function of the seaborn library to draw the probability
density, which is simpler:
import seaborn as sns
sns.set(color_codes=True)
sns.kdeplot(x.flatten(), shade=True, label='Probability Density')

The probability distribution plot of the result is as follows:

Figure 8-2 Probability density plot for a set of real numbers

In fact, these real numbers are indeed sampled from a normal distribution with mean 4 and variance 0.5, and they
are generated with the following code:

import numpy as np

np.random.seed(0)
mu = 4
sigma = 0.5
M = 10000
x = np.random.normal(mu, sigma, M)
print(x[:5])
np.save('real_values.npy', x)

[4.88202617 4.2000786 4.48936899 5.1204466 4.933779 ]

That is, this set of real numbers {x } obeys a Gaussian distribution with a mean of 4 and a standard deviation
(i)

of 0.5. As shown in Figure 8-3:

def gaussian(x, mu, sig):

return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))

xmin, xmax = np.min(x),np.max(x)

x_values = np.linspace(xmin, xmax, 100)
plt.plot(x_values, gaussian(x_values, mu, sigma))
y = [0]*len(x)
#plt.scatter(x, y, c='b', s=3)
plt.show()

Figure 8-3 Gaussian distribution (normal distribution) with a mean of 4 and a standard deviation of 0.5
N (4, 0.5)
It can be seen that the real numbers near the mean value 4 have a higher probability of being sampled, and the
real numbers farther away from 4 have a lower probability of being sampled. Therefore, this group of real
numbers is the data in the one-dimensional real number space R, and they satisfy the Gaussian distribution with
a mean of 4 and a standard deviation of 0.5. If the distribution law is found, this probability distribution law can
be directly used to generate real numbers that conform to this distribution law.

For high-dimensional data, using the above method of calculating frequency to find the distribution of real data
in high-dimensional data space is not only computationally intensive but also unrealistic. For example, the real
face dataset is a collection of some face images. If each face image contains 256x256 pixels and each pixel is
represented by 3 colors (red, green and blue), then each image has 256x256x3=196608 values, namely The
dimension of the image is 196608, and all these face images are in a 196608-dimensional space. Each face image
is a data point in this high-dimensional space. How are these face images distributed in this high-dimensional
space? It is an impossible task to directly estimate the probability (density) distribution p(x , x , ⋯ , x
1 2 ) of
196608

196608 random variables (x , x , ⋯ , x

1 2 196608 ), There are not enough face images, and the amount of calculation

is very large.

For high-dimensional data, it is necessary to learn a parameterized generative model based on real data, so as to
generate generated data similar to real data based on this generative model. Some generative models directly
represent the probability distribution or allow direct calculation of the probability distribution, while some
generative models themselves do not represent the probability distribution, but the distribution of the data
generated according to this model is very close to the distribution of the real data, that is, this generation The
model is used directly to generate the data rather than to calculate the probability distribution of the real data.
The generative models (VAE and GAN) that directly generate data are discussed below.

From a mathematical point of view, the generative model learns a parameterized generative model function
G(z|θ) based on a set of real data (such as a set of real numbers or a set of faces). Once the parameter θ is

determined, This function is determined. This function maps a hidden variable z to a real data, and the space
where the hidden variable z is located is usually a low-dimensional linear space with a much lower dimension
than the real data. For example, z is a vector with a small length, and the real data is a Multi-megapixel images.
Different z produces different G(z). If the probability distribution pf akesatisfied by G(z) is close to the
distribution p
real of real data, such a generative model function can be Generating fake data.

Therefore, the generative model is to find a generative model function G(z), so that data G(z) similar to real
data can be generated from a random variable (vector) z. Different random variables z produce different
generated data G(z), and the distribution law of these G(z) should be very close to the distribution law of the
real data x.

As shown in Figure 8-4, many real face images can be used to learn a generative model function of a face image.
Using this function to sample in the latent space (such as a vector) can generate a fake face image.

Figure 8-4 Learn a generative model function that maps hidden vectors to real face images through many real
face image data, sample a hidden vector in the latent space, and input it to this generative model function to get a
real face image

There are three main types of generative models that use neural networks (deep learning) as generative model
functions: Generative Adversarial Net (GAN), Variational Autoencoders (VAEs), Autoregressive models, such
as PixelRNN).

For example, the following is a face image generated by GAN (ThisPersonDoesNotExist.com), it can be seen
that the generated face image is difficult to distinguish from the real face image.
Figure 8-5 Counterfeit face images generated by the production network GAN

8.2 Autoencoders
Before introducing the variational autoencoder, let's introduce the autoencoder related to it. The understanding of
the autoencoder is helpful to understand the variational autoencoder.

8.2.1 Autoencoder
The neural network used for classification and regression problems can map an input x to an output y, that is, the
neural network is a mapping from x to y y=f(x), x is the data feature, y is different from x The goal. What does
this produce if y and x are the same, i.e. this is an identity mapping x=f(x)? If the number of neurons in each
layer is the same as the number of features of x, then each neuron can output one of the feature components
through identity mapping, as shown in Figure 8-6.

Figure 8-6 The identity map represented by a 2-layer neural network, the same number of hidden layer neurons
as input and output directly output one of the input feature components

If the number of neurons in the middle hidden layer is different from the number of features, if the number is less
than the number of features, as shown in Figure 8-7, multiple input features need to go through this "bottleneck"
before outputting, if This neural network can reconstruct the original input (that is, the output of the network is
the same as the input), indicating that the activation output vector of the bottleneck layer contains all the
information of the input, that is, the activation output of the bottleneck layer is actually the input A compressed
representation of data, as if the compressed file contained virtually all the information of the original file. That
is, the representation of the bottleneck layer captures the intrinsic relationship (intrinsic structure) between the
features of the input data. It also shows that the features of the original data are not independent but correlated.
For example, the adjacent pixels of an image have similar colors, that is, these adjacent pixels are related. It is
precisely because the pixels in an image are related that the image compression algorithm can compress the
image into smaller data, and pass Unzip to restore the original image.

Figure 8-7 For a 2-layer neural network whose output can reconstruct the input, the output of the hidden layer
whose number of neurons is less than the number of input features contains all the information of the input to
make the output reconstruct the output, that is, the activation output of the hidden layer A vector is actually a
compressed representation of the input data

If the features of a data are independent of each other, these features cannot be fully represented by the
bottleneck layer compression representation, and many input features will inevitably be lost, so that the input
cannot be reconstructed.

The hidden layer of the neural network is a transformation of data features. The output of the hidden layer whose
number of neurons is less than the number of original data features is a compressed representation of the original
data. The neural network can automatically learn the intrinsic characteristics of the data, so it is also called
feature learning.

Data such as images are often high-dimensional data, while their essential features are generally low-
dimensional. Representing raw data with low-dimensional data features can improve the efficiency and
performance of machine learning algorithms, such as reducing memory consumption and computation, and
speeding up algorithm convergence. For example, a face image may contain millions of pixels, but in machine
learning, the low-dimensional features of the face are often used to represent the face, such as using PCA
dimensionality reduction technology to represent the face as a vector of dozens of values.

Data can be represented by different features. For example, a circle can be represented by many pixels (points)
on the circle. This kind of pixel map representing a circle is called a bitmap. It can also be represented by many
straight line segments approaching a circle. round. Both representations require many values to represent a high-
quality circle. A circle can also be expressed as three values: the radius of the circle and the coordinates of the
center of the circle. The coordinates and radius of the circle are the intrinsic characteristics of the circle. These
three ways of representing circles represent the characteristics of circles from different angles. Similarly, for any
other kind of data, there can be multiple representation methods to represent the data, and different
representations of the data represent different characteristics of the data from different angles.

Selecting the appropriate feature representation of data is the key to determining the success of machine
learning. One of the main goals of machine learning efforts in recent decades is how to find low-dimensional and
more essential feature representations from the high-dimensional feature representations in the original form of
data. Finding its low-dimensional feature representation from high-dimensional data is called feature
engineering. Designing various artificial features has been the main research goal of researchers in the field of
artificial intelligence in the past few decades. For different problem data, people have proposed various feature
dimensionality reduction techniques and designed various artificial features. With the rise of deep learning, using
neural networks to automatically learn features frees researchers from time-consuming and laborious manual
feature engineering, so that they can focus on more innovative work.

Autoencoder (autoencoder, AE) is a technology that uses a neural network with a bottleneck layer to
automatically learn data features. When training the neural network, the target of the neural network sample is
the data itself. When the output of the neural network can reconstruct the input, the bottleneck layer of the neural
network is the low-dimensional feature or low-dimensional representation of the data. As shown in Figure 8-8,
this neural network function is regarded as two functions, and the part from the data input layer to the bottleneck
layer is regarded as a function, called encoder, that is, the encoder accepts input data , producing a lower-
dimensional vector than the input data, called hidden vector. The part from the bottleneck layer to the
reconstructed output layer is regarded as another function called decoder, that is, the decoder accepts hidden
vector input and outputs an output value with the same shape as the input data. This output value should be as
heavy as possible. structure input data. The error of the decoder output value and the input data of the encoder
constitutes the loss of the autoencoder, which is called reconstruction loss. By minimizing this loss, the output
of the decoder and the input data of the encoder can be made as equal as possible, i.e. the output of the decoder
can reconstruct the input. For an input data, the hidden vector output by the encoder is a low-dimensional
compressed representation of the data, which represents some inherent essential characteristics of the data itself.

Figure 8-8 The structure of the automatic encoder: a digital image is input to the encoder, the hidden vector
output by the encoder is used as the input of the decoder, and the digital image output by the decoder
reconstructs the input of the encoder.

Therefore, the encoder can encode a high-dimensional data x into a low-dimensional vector z, and the decoder
can map this low-dimensional vector z back to the original data space to obtain a sum x Very close data x . The ′

output z of x through the encoder is called hidden vector, and the linear space formed by all possible hidden
vectors is called hidden space.

Let the encoder function be z = q (x), which maps an input x to a latent vector z, and the decoder function is
θ

x = p (z) , which maps a hidden vector z to a data x of the same shape as the encoder input x, x should be as
′ ′ ′
α

equal to x as possible, of course x and x It cannot be exactly the same, there will be some errors. θ, α are the
′

model parameters of the encoder and decoder respectively, once θ, α are determined, the encoder and decoder
functions are determined.

For a trained autoencoder, its decoder is a generator function that can generate (produce) a generated data similar
to real data from a hidden vector.

For example, as shown in Figure 8-8, an automatic encoder for Mnist handwritten digital images can be trained.
The handwritten digital image of 28 × 28 shape is directly input to the encoder or converted into an input vector
of 784 size and input to the encoder. The encoder outputs a hidden vector z of a certain length (for example, 10),
which is input to the decoder, and outputs a vector with a length of 784 or an image of size 28x28.

The main function of the autoencoder is to compress the data. The hidden vector is a vector with a lower
dimension than the input data. The data samples are mapped to the hidden vector through the encoding function,
and then mapped back to itself through the decoding function. Construct**. The encoding and decoding process
of an autoencoder is similar to data compression. Compression software compresses a file (folder) into smaller
files, and then restores the original file (folder) by decompression. The difference between the decompressed file
and the original file is the compression error. If the compressed and decompressed files are exactly the same, it is
lossless compression, otherwise it is lossy compression.

The encoding and decoding of the autoencoder belongs to a kind of lossy compression, that is, encoding x into a
hidden vector z, and x and x decoded from the hidden vector z are not exactly the same, but very close.
′

In order to learn the parameters θ, α of the codec function, use all real data x to form supervised learning
training samples (x, x) (that is, the target value of the sample is the input data) to train the codec decoder model.
The loss function of the autoencoder is:

L (x, x)
^ + Lregularizer

That is, the regular term L regularizer that includes the reconstruction error and prevents overfitting.

The autoencoder can be used to denoise the data, as long as the noise version and the denoise version of the data
are used as the data characteristics and target value of the training sample respectively when training the
autoencoder, that is, the sample (x ,x
noise ) x
denoise ,x
noise are noisy and noise-free data respectively. As
denoise

shown in Figure 8-9

Figure 8-9 Denoising self-encoder, the input is a noisy image, and the output target of the decoder is the
denoised image

Autoencoders used to remove noise are called Denoising Autoencoders.

8.2.2 Sparse Encoder

In order to obtain a low-dimensional feature representation, the neuron data in the hidden layer is usually far less
than the data features, but how much is the number of neurons in the hidden layer? This is difficult to determine.
Too small a latent vector length may lack enough features to make it difficult for the decoder to reconstruct the
data. Sometimes, the number of hidden layers of the autoencoder would rather be larger, even comparable to the
number of data features, and through the "sparse constraint" regular term, the number of non-zero values of the
hidden vector is very small, although the number of hidden vector elements is large, but Because there are few
non-zero values, the purpose of low-dimensional compression of data is also achieved. That is, the loss function
defined as follows:
(h)
L (x, x)
^ + λ∑ a
i
i

(h)
Where a is the activation output of the hidden layer, this penalty item forces these values to be as close to 0 as
i

possible, that is, the effect of "making non-zero values as small as possible" (sparseness) . The sparsity constraint
plays a similar role to the bottleneck layer. Autoencoders that employ sparse constraints are called sparse
encoders.
(h)
Another commonly used sparse constraint is the KL divergence constraint. If ρ^ j
=
1

m
∑[a
i
(x)] represents
i

the average activation value of the hidden layer, which can be regarded as a Bernoulli random variable, so that
the ideal distribution can be represented by KL divergence and the difference between the observed distributions.

L (x, x)
^ + ∑ KL(ρ||ρ
^j )
j

The KL divergence of the 2 distributions is:

(h)
l
ρ 1−ρ
∑ ρ log + (1 − ρ) log
ρ
^j 1−ρ
^j
j=1
8.2.3 Implementation of Autoencoder
The following uses the Mnist dataset to illustrate how to implement an autoencoder, first read the Mnist dataset:
# read data
import matplotlib.pyplot as plt
%matplotlib inline
import pickle, gzip, urllib.request, json
import numpy as np
import os.path

def read_mnist():
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")

with gzip.open('mnist.pkl.gz', 'rb') as f:

train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
return train_set, valid_set, test_set

def draw_mnists(plt,X,indices):
for i,index in enumerate(indices):
plt.subplot(1, 10, i+1)
plt.imshow(X[index].reshape(28,28), cmap='Greys')
plt.axis('off')

train_set, valid_set, test_set = read_mnist()

train_X, train_y = train_set

valid_X, valid_y = valid_set
test_X, test_y = valid_set

print(train_X.dtype)
print(train_X.shape)
print(valid_X.shape)
print(np.mean(train_X[0]))

draw_mnists(plt,train_X,range(10))
plt.show()

float32
(50000, 784)
(10000, 784)
0.13714226

Then define an autoencoder neural network, and use the samples in the training set train_X as data input and
target values to train this neural network:

import util
import train
np.random.seed(100)

nn = NeuralNetwork()
nn.add_layer(Dense(784, 32))
nn.add_layer(Relu()) # Leaky_relu(0.01)) #Sigmoid()) #Leaky_relu(0.01)) #Relu()) # #

nn.add_layer(Dense(32, 784))
nn.add_layer(Sigmoid())

learning_rate = 1e-2 #0.01

momentum = 0.9 #0.8 # 0.9
#optimizer = SGD(nn.parameters(),learning_rate,momentum)
optimizer = train.Adam(nn.parameters(),learning_rate,0.5)

reg = 1e-3 #1e-3

loss_fn = util.util.mse_loss_grad# loss_grad_least
batch_size = 128

X= train_X
epochs= 5 # 10000//(len(X)//batch_size)
print_n = 150
losses = train_nn(nn,X,X,optimizer,loss_fn,epochs,batch_size,reg,print_n)

0 iter: 181.4754917881575
195 iter: 37.86314183909435
390 iter: 26.37453076661517
585 iter: 23.174562871581397
780 iter: 18.48867272781079
975 iter: 17.106892623912394
1170 iter: 14.298662482564286
1365 iter: 13.615108972766208
1560 iter: 12.110143611597861
1755 iter: 11.548596796674369

Plot the loss curve:

import matplotlib.pylab as plt
%matplotlib inline
plt.plot(losses)

Figure 8-10 Training loss curve

What is the result of the reconstruction of some digital images with the following code?
def draw_predict_mnists(plt,X,indices):
for i,index in enumerate(indices):
aimg = train_X[index]
aimg = aimg.reshape(1,-1)
aimg_out = nn(aimg)
plt.subplot(2, 10, i+1)
plt.imshow(aimg.reshape(28,28),cmap='gray')
plt.axis('off')
plt.subplot(2, 10, i+11)
plt.imshow(aimg_out.reshape(28,28),cmap='gray')#cmap='gray')
plt.axis('off')

draw_predict_mnists(plt,train_X,range(10))
plt.show()

Figure 8-11 The learning rate is 0.001, the target image and the reconstructed image of the autoencoder trained
by epochs=100, the upper one is the target image, and the lower one is the reconstructed image

It can be seen that the output image almost reconstructs the input image. Of course, the parameters and training
process of the network can be tuned to produce better results.

As an exercise, the reader can add noise to the input of the training samples, and the program training code does
not need to be modified to enable the network to have the image denoising function. In addition, the full-link
neural network here can also be replaced by a convolutional neural network. The autoencoder using the
convolutional network is called convolutional autoencoder.

8.3 Variational Autoencoders

8.3.1 What is a variational autoencoder?

Variational autoencoders (VAE) is a generative model method proposed by Kingma & Welling et al. (2013) and
Rezende, Mohamed & Wierstr et al. (2014).

VAE is an enhancement to the traditional autoencoder (autoencoder, AE). Figure 8-12 is the working process of
VAE:

Figure 8-12 Variational automatic autoencoder. The encoder outputs the parameters of the probability
distribution. The hidden vector sampled according to the probability distribution is used as the input of the
decoder. The decoder outputs data with the same shape as the encoder input data. Both The error as the loss
function value

Unlike AE, which maps a data (such as an image) to a fixed-length vector in a latent space, VAE maps data to a
probability distribution (actually maps to a probability distribution parameter), such as mapping an image x to a
Gaussian distribution parameter , such as outputting the mean parameter μ and variance parameter σ of the
2

Gaussian distribution.

If a hidden vector data point is randomly sampled from this probability distribution, according to the
characteristics of the Gaussian distribution, this data point should be concentrated near this μ, such as the
sampled data point z = μ + σ ∗ epsilon, where ϵ is a very small number. Each such sampling point z will be
2

mapped to an image x by the decoder, because these z are surrounded by μ, and these decoded x are also x
′ ′

very similar images. That is, the continuous hidden vector z generates continuously changing data x , which
′

makes the data more structured in the hidden space, so that the hidden vector can be edited meaningfully, and the
data can be changed and controlled according to the needs.

The encoder and decoder functions are determined by the parameters ϕ, θ of the neural network, and the output
of the encoder neural network is the parameter of the probability density of the hidden vector z (assumed to be
the parameters of the Gaussian distribution μ, σ ), that is, for each input x, the q (z|x) output by the encoder is
2
ϕ

a parameter of the probability distribution, which means that x maps to different hidden variables The likelihood
size (probability size) of z. The decoder function p (x|z) indicates that the hidden variable z is mapped to an
θ
output with the same shape as the input x, and the output can also be a probability, such as x is 28 T imes28 size
handwritten digital image, where the value of each pixel is 1 or 0, then the output of the decoder p (x |z) is also θ
′

a tensor of 28 × 28, which means Each position is the probability of the corresponding value (such as 1 or 0) of
the input x.

8.3.2 Loss function

The loss of VAE includes: reconstruction loss and regularization term. The reconstruction loss describes the
approximation between the output and the input, and can use the logarithmic loss or square difference loss of
logistic regression. The regularization term makes the distribution of the latent variable z as close as possible to
the standard normal distribution.

Input a x into the codec pipeline of VAE, the perfect reconstruction output is still x , and the actual output is x ′

as close as possible to x . In order to reconstruct the input x as much as possible, the probability that the output is
x should be the largest, that is, p (xx|zz) should be the largest. If the output of the decoder in VAE is not the data
θ

itself but the probability p (x z) of different data x , then the reconstruction loss corresponding to maximizing
x|z θ

this probability is Minimize the logarithm of its negative maximum likelihood −log(p (x x|zz)), plus a regular θ

term, for a sample x , its loss l (θ, ϕ) can be expressed as:

(i)
i

(i) (i)
Li (θ, ϕ) = −Ez
z∼q
(i) [log pθ (x
x |z
z)] + KL(qϕ (z
z|x
x ) ∥ p(z
z))
ϕ (z
z|x
x )

For a fixed x , it can be mapped to a probability distribution q (z|x

(i)
x
(i)
thatobeystheencoderrepresentation)
ϕ

random variable z , and for each z , the probability that its output is x is p ( pmbx |zz), the expected log
(i)
θ
(i)

probability E z
z∼qϕ (z
z|x
x
[log p (x
x
(i)
)
z)] That is, the output of reconstruction is the expected logarithmic
|z θ
(i)

probability of x , and the maximum probability of reconstructing x is to maximize the expected logarithmic
(i) (i)

probability, that is Minimize this negative expected log probability −E [log p (xx z)].
|z z
z∼qϕ (z
z|x
x
(i)
) θ
(i)

The second term of the loss function is the regular term, using the Kullback-Leibler divergence to represent the
distribution q (zz|x
ϕ x ) of z and the standard positive The distance between the state distribution p(z
i z) = N (0, 1)

is to describe their similarity. Using this term as a regular term (penalty term) promotes the probability
distribution of z to be as close to the standard normal distribution as possible, just like restricting the weight
parameters of the neural network to be as close to 0 as possible. On the one hand, any probability distribution
can always be approximated by a multivariate normal distribution, and on the other hand, any normal
distribution can also be converted into a standard normal distribution by transformation. Therefore, z can be
regarded as a standard normal distribution.

For m samples x , the total loss function is the sum of each sample loss, ie ∑ .
(i) m
Li
i=1

From Gaussian distribution N (μ , Σ ) to Gaussian distribution N

0 0 0 1 (μ1 , Σ1 ) (their covariance matrix Σ and 0

Σ is a non-singular matrix) the Kullback–Leibler divergence is:

1 −1 T −1 |Σ1 |
DKL (N0 ∥ N1 ) = { operatornametr(Σ Σ0 ) + (μ1 − μ0 ) Σ (μ1 − μ0 ) − k + ln }
2 1 1 |Σ0 |

where k is the dimensionality of the vector space. KL divergence describes how similar 2 distributions are.

Assuming that the mean vector and covariance matrix of the Gaussian distribution of the hidden variable z of the
variational autoencoder are μ(z)andΣ(z), then this distribution and the KL dispersion of the standard normal
distribution Degree D [N (μ(z), Σ(z)) ∥ N (0, 1)] can be expressed as:
KL

1 T
DKL [N (μ(z), Σ(z)) ∥ N (0, 1)] = (tr(Σ(z)) + μ(z) μ(z) − k − log det(Σ(z)))
2
The above k is the dimension of the Gaussian distribution,

tr(Σ(z)) is the trace of the covariance matrix Σ(z), which is the sum of the diagonal elements of Σ(z).
det(Σ(z)) is the value of its determinant, and any multivariate Gaussian distribution can always be transformed
into a Gaussian distribution whose covariance matrix is a diagonal matrix by a linear transformation of one
variable, ie Σ(z) can be regarded as a diagonal matrix. Thus the above formula can be simplified as:

1 2 2 2
DKL [N (μ(z), Σ(z))]N (0, 1)] = (∑ σj + ∑ μj − ∑ 1 − log ∏ σj )
2
j j j j

1 2 2 2
= (∑ σj + ∑ μj − ∑ 1 − ∑ log σj )
2
j j j j

1
2 2 2
= ∑(σ + μ − 1 − log σ )
j j j
2
j

k
1 2 2 2
= − ∑(1 + log(σj ) − (μj ) − (σj ) )
2
j=1

Among them, σ is the jth diagonal element of the diagonal matrix Σ(z). In practice, using log σ instead of σ
2
j
2
j
2
j

is more stable in numerical calculation, because the logarithm log is more stable than the exponent exp and is not
easy to overflow. Therefore, what the encoder outputs is actually not the variance itself σ but the logarithm of 2
j

the variance, ie log σ . 2

8.3.3 Parameter resampling

A sample x generates a probability distribution through the encoder, and the actual output is the mean value μ
(i)

and log σ of this probability distribution, how to get from this multi The variable Gaussian distribution gets a
2

latent variable z ? Because only by inputting a hidden variable z into the decoder, can an output of a decoder be
obtained. This requires sampling this Gaussian distribution to get a sampled z , which is then fed into the
decoder. However, the sampling operation on the probability distribution cannot be differentiated (derivative).
For this reason, a "reparameterization trick (reparameterization trick)" is used in the paper to general
Gaussian distribution The sampling of z ∼ N (μ, Σ) is transformed into the sampling of the standard normal
distribution u ∼ N (0, 1), because between z and u There is a simple linear transformation:
1

z = μ + Σ 2
u

According to this transformation, as long as the standard normal distribution u ∼ N (0, 1) is sampled and a
sampling value ϵ is obtained, the general normal distribution z ∼ N (μ, Σ) sampling:
1 1 2
log σ
z = μ + Σ 2 ϵ = μ + σ ϵ = μ + (e 2 )ϵ

Random sampling of the standard normal distribution N (0, 1) makes the random sampling operation no longer
depend on μ and log σ , so there is no need for derivatives about them, that is, ϵ does not depend on μ and
2

log σ , there is no need to find ϵ on their gradients.

8.3.4 Reverse Derivation

The reverse derivation includes the derivation of the reconstruction loss with respect to the model parameters
and the derivation of the regularization term with respect to the encoder parameters. First, the reconstruction loss
(such as the binary classification cross-entropy loss) is the same as the derivation process of the general neural
network. Finally, the gradient of the reconstruction loss with respect to the sampling z can be obtained, set to
dz,According to the above sampling process, the gradient du of μ is the same as dz, and log σ is represented by
2

the letter E, then the gradient dE = dz × ϵ × (e ) of the reconstruction loss with respect to log σ .
2
E 1

2
2

Knowing du, dE, the model parameters of the encoder can be reversely derived. The process is also the same as
the usual neural network reverse derivation.

The regular term, the KL loss, can be expressed in vector form:

1 2 E
− np. sum(1 + E − (μ) − e )
2

Its gradient du, dE about μ, E is:

du = μ
1 E
dE = − (1 − e )
2

8.3.5 Implementation of Variational Autoencoder

The following still takes Mnist handwritten digit recognition as an example to illustrate how to implement a
variational autoencoder. Read data first:

from util import *

from read_data import *
import time

train_set, valid_set, test_set = read_mnist()

train_X, train_y = train_set
#valid_X, valid_y = valid_set
test_X, test_y = valid_set
print(train_X.dtype)
print(train_X.shape)
print(np.mean(train_X[0]))

float32
(50000, 784)
0.13714226

In order to avoid too long training time, only several handwritten digital images (such as numbers 1, 2, 7) are
selected for training. The auxiliary function choose_numbers() is used to extract the label Y value from the X of
the training set (X, Y) is numbers Those digital images of the numbers specified in, such as
choose_numbers(train_X, train_y,[1,2,7]) means that the digital images whose Y labels are numbers
1, 2, and 7 are extracted from train_X.
def choose_numbers(X,Y,numbers):
X_ = []
for i in range(len(X)):
if Y[i] in numbers:
X_.append(X[i])

return np.array(X_)

#X = choose_numbers(train_X, train_y,[1,2,7])
X = train_X

VAE's encoder encoder and decoder decoder are two neural networks:
from NeuralNetwork import *
from util import *
np.random.seed(100)

input_dim = 784
hidden = 256 #400
nz = 2 #2 #20

encoder = NeuralNetwork()
encoder.add_layer(Dense(input_dim, hidden))
encoder.add_layer(Relu()) #Leaky_relu(0.01)) #
encoder.add_layer(Dense(hidden, hidden))
encoder.add_layer(Relu()) #Leaky_relu(0.01)) #
encoder.add_layer(Dense(hidden, 2*nz))

decoder = NeuralNetwork()
decoder.add_layer(Dense(nz, hidden))
decoder.add_layer(Relu())
decoder.add_layer(Dense(hidden, hidden))
decoder.add_layer(Relu())
decoder.add_layer(Dense(hidden, input_dim))
decoder.add_layer(Sigmoid())

Where nz represents the spatial dimension of the Gaussian distribution, such as nz=2 represents a two-
dimensional multivariate Gaussian distribution.

The VAE model is composed of an encoder and a decoder. The following VAE class includes an encoder encoder
and a decoder decoder. Its method forward() indicates that the input x passes through the encoder to generate
output μ (ie mu) and log σ (ie logvar). Then after parameter resampling, the sampled z(sample_z) is obtained,
2

and then the output out is obtained through the decoder. The method backward() uses the loss function specified
by the parameter to first calculate the reconstruction loss (ie loss_fn(out, x)) between the input x and the output
out. According to the gradient loss_grad of this loss about out, call the backward() of the decoder to calculate
the gradient of the reconstruction loss about the decoder model parameters and the gradient dz about resampling
z. According to dz, the gradient du and dE about the encoder output can be calculated, that is, the gradient vector

duE of the reconstruction loss about the encoder output is obtained, plus the gradient of the KL loss about u and
E,Calling the backward() of the decoder encoder can calculate the gradient of the reconstruction loss and KL loss
with respect to the decoder parameters. The train_VAE_epoch() method traverses the dataset dataset for a
training trip.

class VAE:
def __init__(self, encoder,decoder,e_optimizer,d_optimizer):
self.encoder,self.decoder = encoder,decoder
self.e_optimizer,self.d_optimizer = e_optimizer,d_optimizer

def encode(self,x):
e_out = self.encoder(x)
#print("x,e_out", x,e_out)
mu,logvar = np.split(e_out,2,axis=1)
return mu,logvar

def decode(self,z):
return self.decoder(z)

def forward(self,x):
mu, logvar = self.encode(x)

#use reparameterization trick to sample from gaussian

self.rand_sample = np.random.standard_normal(size=(mu.shape[0], mu.shape[1]))
# self.sample_z = mu + np.exp(logvar * .5) * np.random.standard_normal(size=
(mu.shape[0], mu.shape[1]))
self.sample_z = mu + np.exp(logvar * .5) * self.rand_sample
d_out = self.decode(self.sample_z)
return d_out,mu, logvar

def __call__(self,X):
return self.forward(X)

#backpropagation
def backward(self,x,loss_fn = BCE_loss_grad):
out,mu, logvar = self.forward(x)
##print(" out,mu, logvar", out,mu, logvar)

# reconstruction loss
loss,loss_grad = loss_fn(out, x)
dz = decoder.backward(loss_grad)

du = dz
dE = dz * np.exp(logvar * .5) * .5 * self.rand_sample
duE = np.hstack([du,dE])
#encoder.backward(duE)

# KL_loss
kl_loss = -0.5*np.sum(1+logvar-mu**2-np.exp(logvar)) # np.power(mu, 2)
loss += kl_loss/(len(out))
#loss += kl_loss
#loss /= (len(out))

kl_du = mu
kl_dE = -0.5*(1-np.exp(logvar))
kl_duE = np.hstack([kl_du,kl_dE])
kl_duE /=len(out)
#encoder.backward(kl_duE)
encoder.backward(duE+kl_duE)
return loss

def train_VAE_epoch(self,dataset,loss_fn = BCE_loss_grad,print_fn = None):

iter = 0
losses = []
for x in dataset:
self.e_optimizer.zero_grad()
self.d_optimizer.zero_grad()

loss = self.backward(x,loss_fn)
#loss += nn.reg_loss_grad(reg)

self.e_optimizer.step()
self.d_optimizer.step()

losses.append(loss)
if print_fn:
print_fn(losses)
iter += 1
return losses

def save_parameters(self,en_filename,de_filename):
self.encoder.save_parameters(en_filename)
self.decoder.save_parameters(de_filename)

def load_parameters(self,en_filename,de_filename):
self.encoder.load_parameters(en_filename)
self.decoder.load_parameters(de_filename)

The following code creates a VAE object vae, and calls its training method train_VAE_epoch() multiple times to
train with the data set of the iterator data_it:

lr = 0.001
beta_1,beta_2 = 0.9,0.999
e_optimizer = Adam(encoder.parameters(),lr,beta_1,beta_2)
d_optimizer = Adam(decoder.parameters(),lr,beta_1,beta_2)

#reg = 1e-3
loss_fn = mse_loss_grad #BCE_loss_grad
iterations = 10000
batch_size = 64

vae = VAE(encoder,decoder,e_optimizer,d_optimizer)

start = time.time()
epochs = 30
print_n = 1 #epochs // 10
epoch_losses = []
for epoch in range(epochs):
data_it = data_iterator_X(X,batch_size)

epoch_loss = vae.train_VAE_epoch(data_it,loss_fn)
# epoch_loss = vae.train_VAE_epoch(data_it,loss_fn,lambda
loss:print_loss(loss,100))
epoch_loss =np.array(epoch_loss).mean()

#epoch_loss = vae.train_VAE_epoch(data_it,loss_fn).mean()
if epoch % print_n == 0:
print('Epoch{}, Training loss {:.2f}:'.format(epoch, epoch_loss))#,
epoch_val_loss))
epoch_losses.append(epoch_loss)
end = time.time()
print('Time elapsed: {:.2f}s'.format(end - start))
#vae.save_parameters("vae_en.npy","vae_de.npy")

Epoch0, Training loss 46.80:

Epoch1, Training loss 40.55:
Epoch2, Training loss 39.07:
Epoch3, Training loss 38.11:
Epoch4, Training loss 37.40:
Epoch5, Training loss 36.86:
...
Epoch28, Training loss 34.04:
Epoch29, Training loss 33.98:
Time elapsed: 995.09s

Plot the error curve:

import matplotlib.pylab as plt
%matplotlib inline
plt.plot(epoch_losses)

Figure 8-13. Training loss curve for a variational autoencoder that recognizes Mnist handwritten digits

The following code uses the trained VAE to encode and decode MNIST, input a handwritten digital image, and
hope to reconstruct the digital image:

def draw_predict_mnists(plt,vae,X,n_samples = 10):

np.random.seed(1)
idx = np.random.choice(len(X), n_samples)
_, axarr = plt.subplots(2, n_samples, figsize=(16,4))
for i,j in enumerate(idx):
axarr[0,i].imshow(X[j].reshape((28,28)), cmap='Greys')
if i==0:
axarr[0,i].set_title('original')
out,_,_ = vae(X[j].reshape(1,-1))

axarr[1,i].imshow(out.reshape((28,28)), cmap='Greys')
if i==0:
axarr[1,i].set_title('reconstruction')

draw_predict_mnists(plt,vae,test_X,10)
plt.show()

Figure 8-14 The encoding and decoding results of the variational autoencoder, some can be reconstructed
correctly, and some cannot be reconstructed correctly, and the network model structure and debugging
parameters need to be further improved

It can be seen that some digital images can be reconstructed correctly, but some cannot be reconstructed, further
improving the network model and debugging parameters can improve the quality of reconstruction.

8.4 Generating Adversarial Networks

Generative Adversarial Network (Generative Adversarial Net, GAN) is a generative model proposed by Ian
Goodfellow in 2014, which contains 2 called discriminator and generatorThe neural network function, the
discriminator is used to identify whether a data is real data or fake data, and the generator is used to generate
fake data. Using the method of maximin confrontation game, the discriminator and the generator improve their
respective performance through the process of continuous confrontation, just like the mutual confrontation
process between the two sides of the game, one side always hopes to get the maximum score while the other
always hopes Minimize the opponent's score. GAN is the most exciting and creative technology since the rise of
deep learning. Yann LeCun, one of the 3 giants of deep learning and winner of the Turing Award, spoke highly
of GAN:

“Generative Adversarial Networks is the most interesting idea in the last ten years in machine learning.”
"Generative Adversarial Networks are the most interesting idea in machine learning of the past decade"

As a data generation technology, GAN can generate fake images, text, voice and other data. As shown in Figure
8-15, the two face images (from the website <http://www.whichfaceisreal.com/) one is a real face image, and the
other is an image generated by GAN. Is it difficult to distinguish? >

Figure 8-15 Face generated by GAN (left) and real face (right)

Generating images is the main goal of GAN in the early stage. Figure 8-16 is an image that can be generated
with BigGAN.

Figure 8-16 Images generated with BigGAN, from the paper Large Scale GAN Training for High Fidelity
Natural Image Synthesis

GAN can not only be used to generate images, but also can be used for image enhancement, image super-
resolution, image restoration, image conversion, style transfer and other applications, as shown in Figure 8-17,
based on GAN image inpaiting technology (Image Inpainting for Irregular Holes Using Partial Convolutions)
can restore the original image content from a damaged or mosaic image.

Figure 8-17 Image restoration based on GAN image inpaiting technology: masking images and corresponding
restoration results, from Image Inpainting for Irregular Holes Using Partial Convolutions

As shown in Figure 8-18, style migration ([style migration] (Image-to-Image Translation with Conditional
Adversarial Nets)) can transfer the style of an image to another image,

Figure 8-18 Style transfer, from the paper Image-to-Image Translation with Conditional Adversarial Nets

In addition to synthesizing images, GAN can also be used to synthesize music (such as GANSynth), voice, and
text. Figure 8-19 is the text generated by different GAN technologies .
Figure 8-19 English text generated by IWGAN and TextKD-GAN, from the paper TextKD-GAN: Text
Generation using Knowledge
Distillation and Generative Adversarial Networks

GAN-based data synthesis and reconstruction technology has spawned a variety of innovative applications, such
as the famous GAN-based face-changing application DeepFake, which can replace a face in a video with another
face (as shown in Figure 8-20).

Figure 8-20 Video face replacement

Virtual changing clothes can change clothes for a person, as shown in Figure 8-21, which is a given photo of a
person, which can be changed into different clothes, and the posture can also be changed (from the paper: Down
to the Last Detail: Virtual Try-on with Detail Carving).
Figure 8-21 Virtual changing clothes: Given a photo of a person, different clothes can be changed, and the
posture can also be changed

For more GAN technologies and applications, you can search Google, for example, many GAN technical papers
are collected on GAN zoo.

8.4.1 Principle of GAN

"Generative confrontation network" literally includes three aspects of GAN, "Generative" means that it is used to
generate (manufacture) data, such as inputting a vector composed of a real number or several real numbers,
GAN can generate An image, such as an anime avatar, a human face image, or a landscape image, can also
generate a piece of musical score or a piece of speech or a piece of text. "Adversarial" means that it improves the
ability to generate data through an adversarial approach. Confrontation is a familiar means of learning to
improve a certain ability. For example, athletes can continuously improve and improve their abilities by
confronting each other in competition or training. Confrontation is usually an iterative process. Through repeated
confrontation with the opponent, we constantly improve and adjust ourselves, hoping to finally defeat the
opponent. "Networks" refers to the generator function of GAN to generate data and the discriminator function to
identify the authenticity of data, both of which use neural network functions.

The working principle of GAN is similar to the process of counterfeiting: counterfeiters (counterfeit
counterfeiters, counterfeiters) hope to manufacture (generate) fake works (banknotes, calligraphy and paintings,
antiques), and the authenticators as opponents try to identify the authenticity of the works (such as bank staff or
banknote detectors to identify counterfeit banknotes, cultural relics experts to identify the authenticity of cultural
relics), the fakes produced by the counterfeiters are easy to be identified by the authenticators. ..., as the
technology of the counterfeiter continues to improve, the counterfeit produced by it becomes more and more
difficult to be identified, and the two sides between the counterfeiter and the authenticator continue to confront
each other. Finally, when the counterfeit produced by the counterfeiter cannot be recognized by the authenticator
After the counterfeit, the confrontation between the two sides has reached a balance, and the counterfeit made by
the counterfeiter can fool the authenticator.

This process of confrontation between the counterfeiter and the authenticator is a so-called "maximum
minigame". The authenticator hopes to maximize the ability to distinguish true from false, while the
counterfeiter hopes to minimize the ability of the authenticator to distinguish true from false. When this game
reaches an equilibrium state, it is called "Nash Equilibrium".

1. Discriminator and Generator

GAN trains a generative model through real data. GAN contains 2 functions (or neural networks):

A Generator (Generayor) function (network) that generates (produces) synthetic data from random noise
(called latent variables) inputs.

A Discriminator function (network), which is a binary classification function that identifies whether the
data is real data or not.

As shown in Figure 8-22, it is a GAN that generates face images, where the generator (Generator) and the
discriminator (Discriminator) are two functions represented by a deep neural network. Generator can generate a
face image from a random vector with noise, and Discriminator is a simple binary classification neural network
that accepts a face image and outputs the probability that the image is a real face. The discriminator accepts both
real face images and fake face images generated by the generator to train its discriminative ability.

Figure 8-22 GAN model for generating faces, where the generator represented by the neural network generates a
face image from a random vector with noise, and the discriminator Discriminator is a simple two-category neural
network that inputs a face image Face image, output the probability that the image is a real face

In GAN, both the discriminator and the generator are functions represented by neural networks, which can be
represented by symbols D(x|θ ) and G(z|θ ) respectively, where θ 、 θ are the model parameters of the
D G D G

two neural networks, respectively. The generator G(z|θ ) function maps a noise hidden variable z to a data
G

G(z|θ ), and the "discriminator" D(x|θ ) is used to judge x is the probability of the real data.
G D

GAN needs to train two neural network functions, the discriminator and the generator, so that the discriminator
D can correctly identify the true and false data as much as possible, that is, the probability D(x) is as large as

possible, so that the probability D(G(z)) of the generated data is judged as real data is as small as possible. On
the other hand, it is also necessary to train the generator G so that the data generated by the generator can
deceive the discriminator as much as possible, even if the data generated by the generator is judged as true by the
discriminator D with a probability D(G(z)) as large as possible. During the GAN training process, the generator
and the discriminator improve their respective functions through this mutual confrontation, so that the final
discriminator cannot distinguish between real data and fake data generated by the generator.

Initially, the distribution of the data G(z|θ ) generated by the generator G will not be consistent with the
G

potential distribution of the real data, and the discriminator D(x|θ ) has not learned enough ability to identify
D

true and false data. GAN trains them using an adversarial process that alternately trains the discriminator and the
generator, i.e., repeatedly performs the following adversarial training:

Training of the discriminator: the discriminator D accepts a set of real data x and fake data
real

xf ake= G(z) from the generator as samples, the discriminator function should make the real data The
output value D(x |θ ) is as large as possible (the probability is as close to 1 as possible) and the output
real D

value D(x |θ ) = D(G(z)|θ ) is as small as possible (probability close to 0), therefore, the sample
f ake D D

labels of real data and fake data are 1 and 0, respectively. The training process is exactly the same as the
normal neural network training process.

Generator training: The generator G accepts a set of random noise z, and inputs the data generated by it,
that is, the output value G(z), to the discriminator D, hoping to fool the discriminator as much as possible,
That is, the output value of the discriminator D(G(z)) is as large as possible (probability close to 1).

The above process is repeated until the discriminator cannot distinguish real data from fake data. In
mathematical terms, the distribution of the data generated by the generator is very close to the distribution of the
real data.

If the real data are some real numbers, these real numbers obey a certain distribution such as a normal
distribution, as shown in Figure 8-23, the black dots represent the probability density (distribution)
corresponding to these real numbers, assuming only these real numbers, do not know the other Real distribution,
the generator can generate some real numbers x by mapping the noise latent variable z to real numbers in the
real number space, the green solid line indicates the distribution of these generated real numbers, the distribution
of generated real numbers begins and the distribution of real real numbers is not Inconsistency, with the
continuous iteration of the training process, the distribution of generated data G(z) gradually approaches the
distribution of real data x, and the probability that the discriminator recognizes the generated data as real data
continues to increase, and finally the distribution of generated data and The distribution of real data is almost
identical. At this time, the discriminator can no longer distinguish between real and generated data, that is,
regardless of real or generated data, the probability of being judged as real data is close to 0.5.

(a) (b) (c) (d)

In Figure 8-23, the black dots represent the probability density (distribution) corresponding to the real data (real
numbers), and the green dots represent the distributions that the generated real numbers obey. Initially, the two
distributions are not similar. As the confrontation process progresses, the two become more and more the closer.

2. Loss function
The goals of the discriminator and the generator are different. For the discriminator, it is hoped that D(x) is as
large as possible and D(G(z)) is as small as possible. For the generator, it is hoped that D(G(z)))big. Like
logistic regression and multi-classification problems, in order to improve the stability of the calculation,
log(D(z)) and log(D(G(z))) are usually used instead of D(z) and D(G(z)), and use the average loss

(expected loss) of a batch (multiple) samples to calculate the loss function value.

If p , p , andp are used to represent the distributions of hidden variable z, real data x and generated data G(z)
z r g

respectively, the discriminator D expects Dof realdataxT he\log D(x)expectationof (x) (average value)
Ex∼pr (x)
[log D(x)] is as large as possible, while Expectation (average) log D(G(z)) of log D(G(z)) generating

data G(z) from random noise variable zE [log(D(G(z)))] as small as possible or

z∼pz (z)
Ez∼p
z (z)
[log(1 − D(G(z)))] as large as possible. Therefore, the discriminator expects
Ex∼p (x)
[log D(x)] + Ez∼p [log(1 − D(G(z)))] as large as possible:
r z (z)

max LD (D, G) = Ex∼p [log D(x)] + Ez∼p [log(1 − D(G(z)))]

r (x) z (z)
D

= Ex∼p [log D(x)] + Ex∼p [log(1 − D(x)]

r (x) g (x)

The generator G hopes to deceive the discriminator D, that is, it hopes that E z∼pz (z) [log(D(G(z)))] is as large as
possible or Say E [log(1 − D(G(z)))] is as small as possible. Right now:
z∼pz (z)

min LG (D, G) = Ez∼p (z) [log(1 − D(G(z)))]

z
G

= Ex∼p (x) [log(1 − D(x)]

The minimization of the generator has nothing to do with E x∼pr (x)

[log D(x)] of the real data, so even adding this
term does not affect this minimize:

min LG (D, G) = Ex∼p [log D(x)] + Ex∼p [log(1 − D(x)]

r (x) g (x)
G

Therefore, these two loss functions can be expressed in a unified loss function:

min max L(D, G) = Ex∼p [log D(x)] + Ez∼p [log(1 − D(G(z)))]

r (x) z (z)
G D

= Ex∼p [log D(x)] + Ex∼p [log(1 − D(x)]

r (x) g (x)

That is, the discriminator D wants to maximize this loss, while the generator G wants to minimize this loss (for
G, the first item of this loss function has nothing to do with G). That is, both the generator and the discriminator

are playing a "maximum and minimum" confrontation game. Although this unified formula is written, the actual
programming is still to optimize max E [log D(x)] + E
D [log(1 − D(G(z)))] and
x∼pr (x) z∼pz (z)

min G E [ log(1 − D(G(z)))]. According to practical experience, generally min log(1 − D(G(z))) is
z∼pz (z)

transformed into max log(D(G(z))), or max log(1 − T hetransf ormationof D(G(z))) into
min log(D(G(z))) is more conducive to the stability of training.

3. Training process
The training of GAN is nothing more than the training of 2 ordinary neural networks. GAN trains the
discriminator and the generator in an alternating manner, that is, first train the discriminator, then train the
generator, then train the discriminator, then train the generator, ... . Its training process can be described by the
following pseudocode:

for Each iteration:

Perform k iterative updates of the discriminator:
Sample m real data samples x and generated data G(zz ) corresponding to m random noise z , and mark their
i i i

labels as 1 and 0 respectively

Compute the gradient of the following loss function with respect to the model parameters;

Update the model parameters of the discriminator using the gradient ascent method;

Perform l iterative updates of the generator:

Sample generated data G(zz ) corresponding to m random noise z , and mark their labels as 1;
i i

Compute the gradient of the following loss function with respect to the model parameters;

Update the model parameters of the generator using the gradient descent method;
In the author's original paper, the number of gradient updates of the generator in each confrontation iteration is
only executed once, that is, l = 1, and the number of iterative updates of the discriminator k is used as an
adjustable hyperparameter. Through the adjustment of k or l, the training degree of the discriminator and the
generator can be balanced to prevent one party from being overtrained and causing the other party to become
weak. Like the learning rate, network structure and its parameters, they are some hyperparameters that need to be
adjusted based on experience. The adjustment of these parameters directly affects the performance of the
algorithm, and the training of GAN, which is confrontational, is more difficult.

The discriminator and the generator need to fight, but if one of them is too strong, the other will become weaker.
How to balance the training of the two (that is, adjust these hyperparameters) is the difficulty of GAN training.

8.4.2 Code implementation of GAN training process

The generator and discriminator of GAN are two ordinary neural network functions. But its training process is a
process of confrontation. In the code implementation, in order to increase the readability of the code, the GAN
training process in the previous section can be decomposed into 3 separate functions:

The D_train() function is responsible for each pass of gradient updates for discriminator training.

The G_train() function is responsible for each gradient update of the generator training.

The GAN_train() function represents the entire GAN training process.

The discriminator is a binary classification neural network function, which is trained by real data and fake data
generated by the generator. In the training of the discriminator, the label value of the real data is 1 and the label
value of the fake data is 0. The D_train() function calculates the binary cross-entropy loss based on these real
data and fake data samples, and calculates the gradient through reverse derivation and then updates the model
parameters:
from util import *

#================ A training process of discriminator =========================#

def D_train(D, D_optimizer, x_real, x_fake, loss_fn=BCE_loss_grad, reg = 1e-3):
# 1. Gradient reset to 0
D_optimizer. zero_grad()

# 2. Train with real data

m_real = x_real.shape[0]
y_real = np.ones((m_real,1))

f_real = D(x_real)
real_loss, real_loss_grad = loss_fn(f_real, y_real)
D.backward(real_loss_grad,reg)
loss = real_loss + D.reg_loss(reg)

# 3. Train with generated data

m_fake = x_fake. shape[0]
y_fake = np.zeros((m_fake,1))

f_fake = D(x_fake)
fake_loss, fake_loss_grad = loss_fn(f_fake, y_fake)
D.backward(fake_loss_grad,reg)
loss += (fake_loss + D.reg_loss(reg))

# 4. Update the gradient

D_optimizer. step()
return loss

Among them, D, D_optimizer are discriminator neural network and optimizer, x_real and x_fake are real data
and fake data respectively. loss_fn is the binary cross-entropy function.

G_train() is the gradient update function of each pass of the generator. It accepts a set of random noise vectors z,
and outputs x_fake through the generator. In order to deceive the discriminator, the data label of x_fake is set to
1. These generator data samples are input to the discriminator as real data samples, and then the model
parameters of the generator are updated according to the reverse derivation of the discriminator's binary
classification loss function.
#=================== A training process of the generator ===========================#
def G_train(D, G, G_optimizer, z, loss_fn, reg = 1e-3, hack = False):
# 1. Gradient reset to 0
G_optimizer. zero_grad()

# 2. Generate data based on sampling noise

x_fake = G(z)

# 3. Calculate discriminator error

f_fake = D(x_fake)
batch_size = z. shape[0]
y = np.ones((batch_size, 1))
loss, loss_grad = loss_fn(f_fake, y)

# Reverse derivation, but only update the parameters of G

loss_grad = D.backward(loss_grad)
G.backward(loss_grad,reg)
loss += G.reg_loss(reg)

G_optimizer. step()
return loss

Among them, D, G, and G_optimizer are the discriminator neural network function, the generator neural
network function and its optimizer, respectively. z is randomly sampled noise. When training the generator, the
model parameters of the discriminator are fixed, therefore, only the model parameters of the updated generator
are trained, that is, only the

G_optimizer. step()

As the overall process of GAN training, the GAN_train() function first executes D_train() to train and update the
discriminator in each iteration, and then executes G_train() to train and update the generator. d_steps and g_steps
indicate the number of executions of D_train() and G_train() in each iteration of GAN_train(), because
sometimes multiple gradient updates may be required to learn better model parameters. They and parameters
such as the learning rate in the respective optimizers are used to balance the learning intensity of the two,
preventing the discriminator from being too strong or the generator from being too strong.

def
GAN_train(D,G,D_optimizer,G_optimizer,real_dataset,noise_z,loss_fn,iterations=10000,reg
= 1e-3,

show_result = None, d_steps = 1, g_steps = 1, print_n = 20):

iter = 0
D_losses = []
G_losses = []
G_loss = 0.
D_loss = 0.
while iter< iterations:
# train the discriminator
for d_index in range(d_steps):
x_real = next(real_dataset)
#batch_size,dim = x_real.shape[0],x_real.shape[1]
# Generate fake data
x_fake = G(next(noise_z))
D_loss = D_train(D,D_optimizer,x_real,x_fake,loss_fn,reg)

# training generator
for g_index in range(g_steps):
G_loss = G_train(D,G,G_optimizer,next(noise_z),loss_fn,reg)

if iter % print_n == 0:
print(iter,"iter:","D_loss",D_loss,"G_loss",G_loss)
D_losses.append(D_loss)
G_losses.append(G_loss)
if show_result:
show_result(D_losses, G_losses)

iter += 1
return D_losses, G_losses

8.5 GAN modeling example

8.5.1 GAN modeling of a set of real numbers

1. Real data: a set of real numbers

If there is a set of real numbers satisfying a certain distribution (such as Gaussian distribution), for example, the
following code can be used to generate a batch of real numbers satisfying Gaussian distribution:
M = 10000
mu = 4
sigma = 0.5
x = np.random.normal(mu, sigma, M)
print(x[:20])
x = x.reshape(-1,1)

[4.04491498 4.16228945 4.57294517 4.36487946 3.80316745 3.70081992

5.1913777 3.91089626 3.7194276 3.47951151 4.23955145 3.97878447
4.42033902 2.84864752 4.71202734 4.34330571 4.29610917 3.81866978
5.22367772 4.56030347]

These real numbers are used as real data, and assuming that the distribution is not known, how to generate real
numbers that conform to the probability distribution of these real numbers? This problem can be solved with
GAN, that is, GAN can use these real numbers as real data to train its discriminator and generator function, and
the trained generator function can produce real numbers with the same distribution as these real real numbers
(that is, fake data) .

2. Define discriminator and generator functions

In order to train this GAN that generates real numbers, first define the generator G and the discriminator D for
this problem:
from NeuralNetwork import *
#from util import *
from train import *
np.random.seed(0)

hidden = 4
D = NeuralNetwork()
D.add_layer(Dense(1, hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(hidden, 1))
#D.add_layer(Sigmoid())

G = NeuralNetwork()
z_dim = 1 #dimension of hidden variable
G.add_layer(Dense(z_dim, hidden))
G. add_layer(Leaky_relu(0.2))
G. add_layer(Dense(hidden, 1))

#Define the optimizer algorithm object for training G and D

momentum = 0.9
D_lr = 1e-4 # 1e-4
G_lr = 1e-4 #1e-4
beta_1, beta_2 = 0.9, 0.999
D_optimizer = Adam(D.parameters(),D_lr,beta_1,beta_2)
#D_optimizer = SGD(D.parameters(),D_lr,momentum)
G_optimizer = Adam(G.parameters(),G_lr,beta_1,beta_2)

3. Real data iterator, noise data iterator

In order to train G and G, they need to be provided with training samples. The following code defines an iterator
that selects a batch of samples from real data:

batch_size=64
def data_iterator_X(X,batch_size,shuffle = True):
m = len(X)
#print(m)
indices = list(range(m))
while True:
if shuffle:
np.random.shuffle(indices)
for i in range(0, m, batch_size):
if i + batch_size>m:
break
j = np.array(indices[i: i + batch_size])
yield X.take(j,axis=0)

data_it = data_iterator_Xdata_iterator(x,batch_size)
x0= next(data_it)
print(x0.shape)
print(x0[:10].transpose())

10000
(64, 1)
[[4.39069056 4.20482386 4.14997364 4.65636703 4.36363908 3.75927793
3.34646553 4.64355828 4.45063574 3.49191287]]

The input of the generator function is a random noise. The following code is a function iterator object that
generates a batch (m) of random noises (each noise is a vector of length z_dim):
def sample_z(m, z_dim=1):
return np.random.randn(m, z_dim)

def noise_z_iterator(m, z_dim):

while True:
yield sample_z(m, z_dim)

noise_it = noise_z_iterator(batch_size, z_dim)

z= next(noise_it)
print(z.shape)
print(z[:10].transpose())

(64, 1)
[[ 0.72956978 0.14262128 -0.29800486 1.78637966 0.27740342 -0.61411045
-0.68236473 1.61341108 0.41862218 -0.89009973]]

4. Intermediate result drawing function

In addition, to observe the effect of the generator during training, you can write a helper function to generate a
batch of data from the generator and then plot its histogram:

import seaborn as sns

def gaussian(x, mu, sig):
return np.exp(-np.power(x - mu, 2.) / (2 * np.power(sig, 2.)))

def draw_loss(ax,D_losses=None,G_losses=None):
ax.clear()
if D_losses:
i = np.arange(len(D_losses))
ax.plot(i, D_losses, '-')
if D_losses:
ax.plot(i, G_losses, '-')
ax.legend(['D_losses', 'G_losses'])

def show_result_gauss(D_losses=None,G_losses=None,m=600):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
draw_loss(ax1,D_losses,G_losses)

ax2.clear()
xmin, xmax = np.min(x),np.max(x)
x_values = np.linspace(xmin, xmax, 100)
ax2.plot(x_values, gaussian(x_values, mu, sigma), label='real data')

noise_it = noise_z_iterator(m, z_dim)

z= next(noise_it)
y = G(z)
sns.kdeplot(y.flatten(), ax=ax2, shade=True, label='fake data')

xs = np.linspace(*ax2.get_xlim(), m)[:, np.newaxis]

discrim = sigmoid(D(xs))
ax2.plot(xs, discrim, label='discrim probilities (normalized)')
plt.show()

The left picture of the show_result() function draws the loss curves of D and G, and the right picture draws the
distribution of real data (blue curve) and generated data (green shaded curve) and the decision curve (that is,
whether the real number on the model prediction interval is real data The probability). Execute the function:
show_result_gauss()
Outputs the following graphics:

Figure 8-24 The blue curve is the true distribution of real data, the green line in the shaded area is the
distribution of the data generated by the generator, and the middle line is the decision curve

Because there is no training, the left picture is blank. From the right picture, it can be seen that the distribution of
real numbers generated by the generator function is very different from the distribution of real data.
5. Training GAN
Call GAN's training function GAN_train(), pass in discriminator and generator parameters (D, G, D_optimizer,
G_optimizer), data iterator data_it, noise iterator noise_it, and binary cross entropy function BCE_loss_grad and
training hyperparameters ( iterations, reg), start training D and G:

from util import *

reg = 0.001 #1e-5
iterations = 100000
d_steps,g_steps = 5,1 #12,1
print_n=500
D_losses,G_losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,\
reg,show_result_gauss,d_steps,g_steps,print_n)

During the training process, print_n = 500 intervals to output the loss curve of the intermediate training model, the
distribution of real data and generated data, and the decision curve. Figure 8-25 is the output of some of the
iterative steps:

500 iter: D_loss 0.840985288041485 G_loss 0.7371538282553437

![](8_GAN\GAN_gauss_500.png)
2000 iter: D_loss 0.40058914689196146 G_loss 1.4307126427753376

![](8_GAN\GAN_gauss_2000.png)
4000 iter: D_loss 1.3457534057336877 G_loss 0.7859707963138415

![](8_GAN\GAN_gauss_4000.png)
7000 iter: D_loss 1.3266855320062865 G_loss 0.8275752348295592

![](8_GAN\GAN_gauss_7000.png)
13000 iter: D_loss 1.3860751575316943 G_loss 0.7022555897704553

![](8_GAN\GAN_gauss_13000.png)
45000 iter: D_loss 1.3859107668070463 G_loss 0.6946582988497471

![](8_GAN\GAN_gauss_45000.png)
95000 iter: D_loss 1.386978807433111 G_loss 0.694197914623761

Figure 8-25 Some intermediate results during the training process of a set of real GAN models

From these intermediate iteration results, it can be seen that the discriminator and generator are an adversarial
process. How to adjust the training parameters so that they are balanced in the confrontation is more difficult.
Incorrect parameters will make the training process oscillate continuously, the training will not converge, and the
generator is stronger than the discriminator, which will cause mode collapse, that is, the discriminator cannot
generate diverse data, and the generated data is almost is the same data. Readers can lower the regularization
parameter reg or modify the learning rate or modify the number of learning d_steps of the discriminator in each
iterative process to observe these cases of non-convergence and mode collapse.

8.5.2 GAN modeling of two-dimensional coordinate points

Each sample in a set of real numbers is a real number, i.e. has only one feature. In this section, samples from a set
of two-dimensional coordinate points on a two-dimensional plane are used as real data, and let the generator learn
the probability distribution of these two-dimensional coordinate points. The modeling and training process is the
same as the GAN modeling and training of one-dimensional real numbers. .

1. Real data: coordinate points sampled on the elliptic curve

The x, y coordinates of the data points on the elliptic curve can be expressed by parametric equations as:

x = cx + asin(α)

y = cy + bcos(α)

Where (cx, cy) is the center point of the ellipse, (a, b) is the length of the major and minor axes of the ellipse, and
α is the directed direction composed of the center point of the ellipse and the point (x, y) The included angle of

the line segment about the x axis. The following function sample_ellipse() can uniformly sample a set of
coordinate points on the elliptic curve:

import numpy as np
import math
def sample_ellipse(m,a,b,cx=0,cy=0):
alpha = np.random.uniform(0, 2*math.pi, m)
x,y = cx+a*np.cos(alpha) , cy+b*np.sin(alpha)
x = x.reshape(m, 1)
y = y.reshape(m, 1)
return np.hstack((x,y))

According to the ellipse sampling function above, the following code samples 100 coordinate points whose long
and short axis lengths are 5 and 3 respectively at the center of the ellipse (4,4), and then draws these coordinate
points:

from matplotlib import pyplot as plt

%matplotlib inline
data = sample_ellipse(100,5,3,4,4)

plt.scatter(data[:, 0], data[:, 1])

plt.show()

Figure 8-26 A group of sampling coordinate points on the elliptic curve whose major and minor axis lengths are
respectively 5 and 3 at the center of the ellipse (4,4)
2. Real data iterator, noise iterator
You can use the sample_ellipse() function to define a data iterator to sample a set of coordinate points from the
ellipse:
cx,cy,a,b = 5,3,4,4
batch_size = 64
def data_iterator_ellipse(batch_size):
while True:
yield sample_ellipse(batch_size,cx,cy,a,b) #generate_real_samples(batch_size)

data_it = data_iterator_ellipse(batch_size)
x= next(data_it)
print(x[:3])

output:

[[1.09671815 6.44244624]
[8.71292461 5.00189969]
[1.99319665 6.74776027]]

Still use the previous noise_z_iterator noise iterator function to define a noise iterator noise_it to generate a noise
vector:
batch_size = 64
z_dim = 2
noise_it = noise_z_iterator(batch_size, z_dim)
z = next(noise_it)
print(z[:3])

[[-0.12580991 -2.49903308]
[-0.36232861 0.95614813]
[-0.45110849 -1.30580063]]

3. Define the generator and discriminator of the GAN model

If many two-dimensional coordinate points are given, but the real distribution is not known, a GAN model can be
trained so that the distribution of the coordinate points generated by its generator is very close to the distribution of
these real coordinate points.

The same as the GAN modeling and training process that generates a set of real numbers, you only need to define
a specific generator and discriminator for this problem. Of course, for GANs with different problems, the training
parameters of the generator and discriminator must also be Make corresponding adjustments (i.e. parameter
adjustments).
from NeuralNetwork import *
#from util import *
from train import *
np.random.seed(0)

G_hidden,D_hidden = 10,10
z_dim = 2 # Dimensions of hidden variables

G = NeuralNetwork()
G.add_layer(Dense(z_dim, G_hidden))
G.add_layer(Leaky_relu(0.2)) #Relu()) #
G.add_layer(Dense(G_hidden, 2))
D = NeuralNetwork()
D.add_layer(Dense(2, D_hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(D_hidden, 1))

4. Training GAN model

First define a function show_result() that displays intermediate results:

def draw_loss(ax,D_losses=None,G_losses=None):
ax.clear()
i = np.arange(len(D_losses))
if D_losses: ax.plot(i, D_losses, '-')
if D_losses: ax.plot(i, G_losses, '-')
ax.legend(['D_losses', 'G_losses'])

def show_ellipse_gan(D_losses=None,G_losses=None,m=100):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
draw_loss(ax1,D_losses,G_losses)

ax2.clear()
if True:
data = sample_ellipse(100,cx,cy,a,b)
ax2.scatter(data[:, 0], data[:, 1])
else:
alpha = np.linspace(0,2*math.pi, 100)
x,y = cx+a*np.cos(alpha) , cy+b*np.sin(alpha)
ax2.plot(x, y,label='real data')

noise_it = noise_z_iterator(m, z_dim)

z= next(noise_it)
fake_data = G(z)
ax2.scatter(fake_data[:, 0], fake_data[:, 1],label='fake data')

plt.show()

show_result = show_ellipse_gan #lambda

D_losses,G_losses:show_ellipse_gan(D_losses,G_losses)

Then define the parameter optimizers D_optimizer and G_optimizer corresponding to the discriminator and the
generator. Their learning rate is still 1e-4, and set the number of times d_steps and g_steps of each discriminator
and generator to 12 and 1 respectively. Set Regularization parameter reg = 1e-4. Start training:
from util import *

#Defines an optimizer algorithm object for training G and D

momentum = 0.9
D_lr = 1e-4 # 1e-4
G_lr = 1e-4 #1e-4
beta_1,beta_2 = 0.9,0.999
D_optimizer = Adam(D.parameters(),D_lr,beta_1,beta_2)
G_optimizer = Adam(G.parameters(),G_lr,beta_1,beta_2)

reg = 1e-4 #0.001 #1e-5 #1e-5

iterations = 300000
d_steps,g_steps = 12,1
print_n=500
D_losses,G_losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,\
reg,show_result,d_steps,g_steps,print_n)

Here are some intermediate output results during iteration:

0 iter: D_loss 2.1742366959961332 G_loss 0.8968873898437348
![](8_GAN\ellipse_gan_0.png)
2000 iter: D_loss 0.3267357227744408 G_loss 2.3412898225834082

![](8_GAN\ellipse_gan_2000.png)
7000 iter: D_loss 1.2152731903087477 G_loss 0.9141720546508202

![](8_GAN\ellipse_gan_7000.png)
30000 iter: D_loss 1.0173900698057503 G_loss 1.1880948654376398

![](8_GAN\ellipse_gan_30000.png)
160000 iter: D_loss 1.3094760434222943 G_loss 0.9307732117439997

![](8_GAN\ellipse_gan_160000.png)
299500 iter: D_loss 1.350800595219167 G_loss 0.8432568162317724

Figure 8-27 Some intermediate results during the training process of a set of two-dimensional coordinate point
GAN models

8.5.3 GAN modeling of MNIST dataset

The same process can be used to train a GAN model to generate Mnist handwritten digit images.

1. Read training data

The following code reads the Mnist dataset and normalizes the values between -1 and 1.
import data_set as ds
import matplotlib.pyplot as plt
%matplotlib inline

train_set, valid_set, test_set = ds.read_mnist()

train_X, train_y = train_set

valid_X, valid_y = valid_set
test_X, test_y = test_set
print(train_X.dtype)
print(train_X.shape)
print(train_y.dtype)
print(train_y.shape)

print(np.min(train_X[0]), np.max(train_X[0]))
train_X = (train_X -0.5)*2
print(np.min(train_X[0]), np.max(train_X[0]))

ds.draw_mnists(plt,train_X,range(10))
plt.show()

float32
(50000, 784)
int64
(50000,)
0.0 0.99609375
-1.0 0.9921875

2. Define the data iterator

z_dim= 64

batch_size = 32
data_it = data_iterator_X(train_X,batch_size,shuffle = True,repeat=True) #
noise_it = noise_z_iterator(batch_size, z_dim)

3. Define the generator and discriminator and its optimizer

from util import *
from NeuralNetwork import *
#from train import *
import time
np.random.seed(0)

image_dim = 784
g_hidden_dim = 256
d_hidden_dim = 256
d_output_dim = 1

G = NeuralNetwork()
G.add_layer(Dense(z_dim, g_hidden_dim))
G.add_layer(Relu()) # Leaky_relu(0.2)) #
G.add_layer(Dense(g_hidden_dim, g_hidden_dim))
G.add_layer(Relu()) # Leaky_relu(0.2)) #
G.add_layer(Dense(g_hidden_dim, image_dim))
G.add_layer(Tanh())

D = NeuralNetwork()
D.add_layer(Dense(image_dim, d_hidden_dim))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(d_hidden_dim, d_hidden_dim))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(d_hidden_dim, d_output_dim))

#Defines an optimizer algorithm object for training G and D

D_lr = 0.0002 #0.0001
G_lr = 0.0002 #0.0001
beta_1,beta_2 = 0.9,0.999
D_optimizer = Adam(D.parameters(),D_lr,beta_1,beta_2)
G_optimizer = Adam(G.parameters(),G_lr,beta_1,beta_2)

4. Training model
Define a complex function show_result_mnist() that displays intermediate results:

def plot_images(images, subplot_shape):

plt.style.use('ggplot')
fig, axes = plt.subplots(*subplot_shape)
for image, ax in zip(images, axes.flatten()):
ax.imshow(image.reshape(28, 28), cmap='Greys')
#ax.imshow(image.reshape(28, 28), vmin = 0, vmax = 1.0, cmap = 'gray')
ax.axis('off')
plt.show()

def show_result_mnist(D_losses = None,G_losses = None,m=10):

#fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
#ax1.clear()
if D_losses and G_losses:
i = np.arange(len(D_losses))
plt.plot(i,D_losses, '-')
plt.plot(i,G_losses, '-')
plt.legend(['D_losses)', 'G_losses'])
plt.show()

##ax2.clear()

z = np.random.randn(m, z_dim)
x_fake = G(z)
#ds.draw_mnists(plt,x_fake,range(m))
plot_images(x_fake, subplot_shape =[1, 10])
plt.show()

show_result = show_result_mnist

Train the GAN model with the following training procedure:

start = time.time()
reg = 1e-6#1e-5 #1e-5
iterations = 180000
d_steps,g_steps = 1,1 #2,1 #12,1

print_n=500
D_losses,G_losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,\
reg,show_result,d_steps,g_steps,print_n)
done = time.time()
elapsed = done - start
print("Training time: %dSecond"%(elapsed))

121500 iter: D_loss 1.2937134810406197 G_loss 1.4630274260796108

Figure 8-27 The results of Mnist’s GAN model training after 121,500 iterations. It can be seen that the losses of
the discriminator and the generator begin to approach, and the generated digital image is also close to the real
digital image

The higher the number of iterations, the longer the training time.

8.5.4 GAN training techniques

The training of the GAN model is very difficult. Based on practical experience, people have summarized some
precautions and techniques for GAN training.

1. Normalize the input data. If the value of the image data is normalized to between -1 and 1, the final output
activation function of the generator uses the tanh activation function.

2. Modify the loss function, such as the minimum loss 1-D(G(z) used by the training generator in the original
GAN paper, that is, min log(1-D(G(z)), it is suggested to change it to maximize the loss log (D(G(z)), ie max
log(D(G(z)).

3. The input noise of the generator is sampled from a Gaussian distribution instead of a uniform distribution.

4. Batch normalization (BatchNorm) is used for real data and generated data separately, and batch normalization
of data mixed with real and generated data cannot be performed.

5. Avoid sparse gradients, such as avoiding activation functions or network layers that generate sparse gradients
such as relu and maxpool. But it is recommended to use Leakrelu

6. Use soft labels or noise. When training the discriminator, the real data label can use a random number
between 0.7 and 1.2 instead of 1, and the generated data label can use a random number between 0.0 and 0.3
instead of 0, and occasionally flip the label of the generated data, such as changing from 0 to 1 .

7. Using the Adam optimizer, it is recommended to use the SGD optimizer for the discriminator and the Adam
optimizer for the generator.

8. When the loss of D tends to 0, it means that D is too strong, and the variance of D’s loss is relatively large,
which means that it cannot converge. When the loss of G keeps decreasing, it means that G is too strong and
it is easy to collapse noise. During the training process, you can check the gradient of the model parameters.
If the absolute value exceeds 100, it means that it cannot converge.

For more tips, please refer to the following github URL:

https://github.com/soumith/ganhacks
8.6 GAN loss function and its probability explanation
The essence of GAN's anti-neural network is to make the distribution of generated data and the distribution of real
data as consistent as possible through confrontational learning, so that the distance between the two distributions is
as small as possible. The loss function of GAN is essentially a measure of the similarity of the two distributions.
Kullback–Leibler Divergence (Kullback–Leibler Divergence) and Jenson-Shannon Divergence (Jensen–Shannon
Divergence).

8.6.1 The global optimal solution of the loss function of GAN

The loss function of GAN can be written in the following integral form:

L(G, D) = ∫ (pr (x) log(D(x)) + pg (x) log(1 − D(x)))dx

Among them, p (x), p r g (x) are the distribution of real data and generated data respectively, and the following
notation is introduced:
~ = D(x), A = p (x), B = p (x)
x r g

Considering L(G, D) as a function of x ~

, according to the necessary conditions of the extreme point of the
function, its derivative with respect to x
~
is 0, according to calculus, its The derivative of the function being
∂(pr (x) log(D(x))+pg (x) log(1−D(x)))

integrated should be 0 almost everywhere, ie ~

∂x
= 0 . make:

~ ~ ~
(pr (x) log(D(x)) + pg (x) log(1 − D(x))) = f (x) = Alogx + Blog(1 − x)

Then there are:

~
df (x) 1 1 A B
= A − B = ( − )
~ ~ ~ ~ ~
dx x 1 − x x 1 − x
~
A − (A + B)x
=
~ ~
x(1 − x)

~
df (x)
Let ~
dx
, you can get the extreme point of L(G, D) of the discriminator loss function is
= 0

f ( T heextremepointof tildex),

pr (x)
∗ ~∗ A
D (x) = x = = ∈ [0, 1]
A+B pr (x)+pg (x)

When the generator is optimal, that is, the distribution of the generated data is exactly the same as that of the real
data, that is, p = p , then the extreme point of the discriminator's loss function value is 1/2. At this time, the
g r

optimal value of the discriminator loss function L(G, D) is:

∗ ∗ ∗
L(G, D ) = ∫ (pr (x) log(D (x)) + pg (x) log(1 − D (x)))dx
x

1 1
= log ∫ pr (x)dx + log ∫ pg (x)dx
2 x
2 x

= −2 log 2

According to the nature of probability, where ∫ x

pr (x)dx and ∫ x
pg (x)dx are both 1.
8.6.2 Kullback–Leibler divergence and Jensen–Shannon divergence
There are usually two ways to measure whether two distributions are similar: Kullback–Leibler divergence (KL
divergence for short) and Jensen–Shannon divergence (JS divergence for short).

For two probability distributions p, q, their Kullback–Leibler divergence is:

p(x)
DKL (p ∥ q) = ∫ p(x) log dx
x q(x)

KL divergence describes the degree to which the probability distribution p deviates from q. For a x, if
p(x) p(x)
p(x) = q(x), then log = log 1 = 0, if p(x) ≠ q(x), log ≠ 0. When p and q are equal everywhere,
q(x) q(x)

DKL (p ∥ q) = 0 , otherwise, it can be proved that D (p ∥ q) > 0. Therefore, when the two distributions are
KL

exactly the same or almost the same (satisfying p(x) = q(x) almost everywhere), the KL divergence has a
minimum value of 0.

For example, for the 2 discrete probability distributions shown in Figure 8-28:

Figure 8-28 The discrete probability distributions of the left and right graphs are (0.36, 0.48, 0.16) and (0.333,
0.333, 0.333) respectively

That is, the probability distributions of p and q are: (0.36, 0.48, 0.16) and (0.333, 0.333, 0.333), respectively. Their
KL divergence is:
p(x) 0.48 0.16
DKL (p ∥ q) = ∑ p(x) log( ) = 0.36 log f rac0.360.333 + 0.48 log + 0.16 log = 0.0863
x∈X q(x) 0.333 0.333

q(x) 0.333 0.333

DKL (q ∥ p) = ∑ q(x) log( ) = 0.333 log f rac0.3330.36 + 0.333 log + 0.333 log = 0.0963
x∈X p(x) 0.48 0.16

KL divergence is asymmetrical and can lead to erroneous results when measuring the similarity between two
equally important distributions.

For another example, for p(x) = N for 2 Gaussian distributions, its

(0, 2), q(x) = N (2, 2)

DKL (p ∥ q)T heintegralf unctionof divergence is the subgraph on the right of Figure 8-29. The KL divergence
is the sum of the positive and negative areas of the shaded part:
Figure 8-29 D (p ∥ q) The divergence is the sum of the positive and negative areas of the shaded part of the
KL

function being integrated

For the probability distribution of 2 Gaussian distributions, their KL divergence is:

KL(p, q) = −∫ p(x) log q(x)dx + ∫ p(x) log p(x)dx

2 2
σ + (μ1 − μ2 ) 1
2 1 2
= f rac12 log(2πσ2 ) + − (1 + log 2πσ1 )
2
2σ 2
2

2 2
σ2 σ + (μ1 − μ2 ) 1
1
= log + −
2
σ1 2σ 2
2

If you fix a probability distribution such as fixing q(x) = ( 0, 2), and let p(x) = ( μ, 2) change freely with the
value of μ, you can use The following code plots the KL divergence value curve corresponding to different μ
values,
import math
import matplotlib.pyplot as plt
import numpy as np
# if using a jupyter notebook
%matplotlib inline

def KL(mu1,sigma1,mu2,sigma2):
return math.log(sigma2/sigma1) + (sigma1**2+(mu1-mu2)**2)/(2*sigma2**2)-1/2

mus= np.arange(-12,12,0.1)
kl_values = [KL(mu,2,0,2) for mu in mus]

plt.plot(mus,kl_values)
plt.xlabel('$\mu$')
plt.ylabel('KL ')
plt.legend(['KL Value'],loc='upper center')
plt.show()

The result is shown in Figure 8-30:

Figure 8-30 KL value curve obtained by fixing q(x) = ( 0, 2), p(x) = ( μ, 2) as μ changes freely

It can be seen that when μ = 0, that is, p(x), q(x) are the same distribution, the KL divergence reaches the
minimum.

JS divergence is also a measure of the similarity of 2 distributions:

1 p+q 1 p+q
DJ S (p ∥ q) = DKL (p ∥ ) + DKL (q ∥ )
2 2 2 2

Different from KL divergence, JS divergence is symmetrical about pandq, that is, pandq are equally important,
and it is smoother than KL divergence.

Figure 8-31 p, q are Gaussian distributions N (0, 1), N (1, 1) respectively, the average of the two distributions
m = (p + q)/2 . KL divergence D is asymmetric, but JS divergence D is symmetric. The upper left corner
KL JS

is two probability distributions p(x), q(x), the upper right corner is the integral function of
KL(p ∥ q), KL(q ∥ p) respectively, and the lower left corner is

T heintegralf unctionof KL(p ∥ m), KL(q ∥ m), the lower right corner is the integral function of J S(p ∥ q).

Do the following transformations on the JS divergence:

1 pr + pg 1 pr + pg
DJ S (pr ∥ pg ) = DKL (pr || ) + DKL (pg || )
2 2 2 2

1 pr (x)
= (log 2 + ∫ pr (x) log dx)+
2 x
pr (x) + pg (x)

1 pg (x)
(log 2 + ∫ pg (x) log dx bigg)
2 x
pr (x) + pg (x)

1 ∗
= (log 4 + L(G, D ))
2

It can be seen that when the discriminator is optimal, the JS divergence D (p ∥ p ) and L(G, D ) only differ by
JS r g
∗

a constant log 4. Therefore, the loss function of GAN quantifies the similarity between the generated data
1

distribution p and the actual sample distribution p through JS divergence,

g r

∗
L(G, D ) = 2DJ S (pr ∥ pg ) − 2 log 2

When the generator is optimal, that is, when the generated data distribution is exactly the same as the real data
distribution, the first item is 0, and the loss function value when both the generator and the discriminator are
optimal:
∗ ∗
L(G , D ) = 0 − 2 log 2 = −2 log 2
8.6.3 Maximum Likelihood Interpretation of GAN
The above distribution of real data and generated data should be as consistent as possible, explaining the
relationship between the loss function of GAN and JS divergence, and JS divergence is the sum of two KL
divergences. The relationship between GAN loss function and KL divergence can also be found from the
perspective of maximum likelihood estimation.

For a set of real data x , x , ⋯ , x , the distribution it obeys is p , and the probability that the generator G(θ)
1 2 n r

generates these real data is p (x ), p (x ), ⋯ , p (x ), if G(θ) reaches the optimum, then the generator G(θ)
θ 1 θ 2 θ n

should be The maximum probability (possibility) to generate these real data, that is, the probability that the
generator generates these real data should be maximized, that is, find the generator parameter θ that satisfies the
following maximum value:
n
arg maxθ p(θ; x1 , … , xn ) = arg maxθ ∏ pθ (xi )
i=1

Similarly, in order to improve the stability of the calculation, the logarithm of the above probability product can be
used to replace the product, and the extreme point of the function will not be changed. So the problem boils down
to finding:
n n
arg maxθ log p(θ; x1 , … , xn ) = arg maxθ log ∏i=1 pθ (xi ) = arg maxθ ∑i=1 log pθ (xi )

Because these x are real data, they obey the distribution p of real data, assuming n tends to infinity, then the
i r

cumulative sum of the rightmost term of the above formula can be expressed as an integral form:
n
arg maxθ ∑ log pθ (xi ) = arg maxθ ∫ pr (x) log p theta(x)dx
i=1 x

A constant value − ∫ p (x) log p

x
r r (x)dx can be added to the integral on the right without changing its extreme
point. Therefore, there are:

arg max ∫ pr (x) log pθ (x)dx = arg max (− ∫ pr (x) log pr (x)dx + ∫ pr (x) log pθ (x)dx)
θ θ
x x x

= arg min (∫ pr (x) log pr (x)dx − ∫ pr (x) log pθ (x)dx)

θ
x x

pr (x)
= arg min (∫ pr (x) log dx
θ
x
pθ (x)

= arg min KL(pr ∥ pθ )

Therefore, letting the generator G(θ) maximize the likelihood probability of the real data is equivalent to
minimizing the KL divergence of the real data distribution p and the generated data distribution p above.
r θ

8.7 Improved loss function: Wasserstein GAN (WGAN)

The training of the GAN model is very unstable. There are two main ways to solve this problem: one is to find a
stable learning architecture, and the other is to modify the loss function, that is, replace the original loss function
with a new loss function. Wasserstein GAN (WGAN for short) belongs to the latter.

8.7.1 Principle of Wasserstein GAN

WGAN uses the Wasserstein distance between two distributions to define a new loss function to replace the JS
divergence in the original paper GAN.
WGAN analyzes the reasons for GAN training instability from mathematical theory, and believes that the essence
of GAN is to optimize the JS divergence of real data distribution and generated data distribution. KL and JS are
measured by measuring the difference in the probability density of the corresponding random variables of the two
distributions as the similarity measure of the two distributions. For two distributions that do not overlap, the JS
divergence is always 2. If there are two different generated data distributions, and they do not overlap with the real
distribution, it is impossible to reflect which distribution is closer to the real distribution from the JS walk, so that
the generated data distribution cannot gradually approach the real data distribution.

The fact that the distribution generated by GAN does not overlap with the real data distribution is a high
probability. This is one of the reasons why the original GAN is difficult to train. The author proposes to replace the
JS divergence of the original GAN with the Wasserstein Distance distance. Characterizing the distance between
two distributions, Wasserstein Distance is also known as Bulldozer Distance (Earth Mover's distance, EM). The
EM distance does not directly measure the difference in the probability density of random variables corresponding
to two distributions, but measures the energy consumed to transform one distribution into another distribution. If
one distribution p is more efficient than the other distribution p The small energy is transformed into the target
1 2

distribution q, then the EM distance between p and q is smaller than the EM distance between p and q.
1 2

Consider the distribution p(x), q(y) as two piles of soil, and the bulldozer distance measures how to transform a
pile of soil of shape p(x) into q(y) through a certain movement scheme The shape of that pile of dirt.

As shown in Figure 8-32 a) The probabilities of the colored random variables in the above figure at x=1 and 8 are
3/4 and 1/4 respectively. This probability can be regarded as a probability at x = 1and3 For a pile of soil or
bricks, the probability of a white random variable at x=1 and 8 is 2/4 and 2/4 respectively. This probability can be
regarded as a pile of soil or bricks at y = 5, 6 piece. Figure 8-32 a) The distance ∥x − y∥ corresponding to the
combination of x, y shown in the table below.

a) b) c)

Figure 8-32 a) The probabilities of the colored random variables in the above figure at x=1 and 8 are 3/4 and 1/4
respectively, and the probabilities of the white random variables at x=5 and 6 are 2/4 and 2 respectively /4, the
figure below is the moving distance from x to y ∥x − y∥. b) Moving plan γ1 c) Moving plan γ1

To transform p(x) into q(y), it is necessary to move this pile of soil or bricks. For example, the soil or bricks of
p(x) can be moved according to the movement plan in Figure 8-32 b). The block is transformed to the target q(y),

and the movement cost at this time is:

2/4 ∗ (5 − 1) + 1/4 ∗ (6 − 1) + 1/4 ∗ (8 − 6) = (8 + 5 + 2)/4 = 15/4 ,

It is also possible to transform the soil or bricks of p(x) to the target q(y) according to the moving plan in Figure
8-32 c). The moving cost at this time is:
2/4 ∗ (6 − 1) + 1/4 ∗ (5 − 1) + 1/4 ∗ (8 − 5) = (10 + 4 + 3)/4 = 17/4 .

It can be seen that different mobile plans have different mobile costs. Use γ to represent a mobile plan, γ(x, y)
represents the amount of soil transported from x to y, ∥x − y∥ is the distance of movement, γ(x, y)⋅ ∥ x − y∥ is
the cost of transporting γ(x, y) from x to y. The cost of this mobile plan is the sum of all these γ(x, y)⋅ ∥ x − y∥:

∑ γ(x, y) ∥ x − y∥

This γ(x, y) can be expressed as a percentage of the amount of soil transported to the total amount, and this
percentage is equivalent to a probability, because all possible γ(x, y) are not only greater than or equal to 0 but
also The sum of γ(x, y) and ∑ γ(x, y) of γ(x, y) is 1, which satisfies the condition of probability. That is, γ(x, y)
is the joint probability distribution of random variable (x, y), and the transport distance ∥x − y∥ is a function of
random variable (x, y), then for For a moving plan γ, the moving cost is the mathematical expectation of the
random variable ∥x − y∥ on this probability γ(x, y):

∑ γ(x, y) ∥ x − y ∥= E(x,y)∼γ [∥ x − y ∥]

The bulldozer distance of p(x) transformed into q(y) is defined as the minimum value of the movement cost of all
possible movement plans. A more accurate mathematical term is the *infimum of the movement costs of all
possible movement plans *, whose mathematical symbol is inf . Therefore, the bulldozer distance is defined as:

W (p, q) =inf γ∼Π(p,q) E(x,y)∼γ [∥ x − y ∥]

Π(p, q) is all possible mobile plans, γ ∈ Π(p, q) represents a move that moves p(x) into q(y) plan.

Let p , p be the probability densities of real data and generated data respectively, and γ represent a movement
r g

plan that transforms the probability distribution p into p , then the bulldozer distance between these two
r g

distributions is :

W (pr , pg ) =inf γ∼Π(p E(x,y)∼γ [∥ x − y ∥]

r ,pg )

Represents the minimum cost required to transform the distribution p into the distribution p . The bulldozer
r g

distance is symmetric, so it also represents the minimum cost to transform the distribution p into the distribution
g

p .
r

Computing this distance is infeasible because it is impossible to enumerate an infinite number of these movement
plans γ. Through a complex mathematical derivation called Kantorovich-Rubenstein duality, this translates into the
following distance calculation:

W (Pr , Pθ ) =sup∥f ∥ Ex∼P f (x) − Ex∼P f (x)

L ≤1 r θ

Among them, sup is Supremum means the minimum value greater than all values, and f is a 1 − Lipschitz
function, that is, f is a function that satisfies the following conditions:

|f (x1 ) − f (x2 )| ≤ |x1 − x2 |

For such a function f , E f (x) is x that obeys the real data distribution p , that is, the function value of the real
x∼P
r
r

data f (T heexpectation(meanvalue)of x), E f (x) is x that obeys the generated data distribution p , that is,
x∼P
θ
g

the function value of the generated data Expected (mean) of f (x). Therefore, E f (x) can be estimated by just
x∼P
r

using the average value of f (x) of some real data x, and also by using some generated The average of f (x) of data
x can estimate E f (x). like:
x∼P
θ
m

Ex f (x) = ∑ f (real_xi )
∼Pr

Ex f (x) = ∑ f (f ake_xi )
simP
θ

Where real_x , f ake_x are some real and generated data respectively. Therefore, the estimation of the bulldozer
i i

distance becomes very simple.

In GAN training, f (x) is the neural network function of the generator, but it must be guaranteed that it satisfies the
conditions of the above 1 − Lipschitz function. In the original paper of WGAN, this is ensured by limiting the
size of the weight parameters through the practical skills of weight clipping, that is, limiting the weight parameters
to a certain range of [−c, c], usually c=0.01, It is also set to c=0.1 or c=0.001, that is, c is also a parameter that
needs to be debugged.

For the generator, to get the supremum of formula (8-29), is to get the maximum value of E x∼P
r
f (x) − Ex∼Pf (x),
θ

its parameters (such as w) can be updated by the gradient ascent method, even if the Wasserstein distance is as
large as possible to improve the ability to distinguish real data from generated data,the generator wants to
minimize the Wasserstein distance, so its parameters (such as θ) are updated by the gradient descent method. For
the generator, the first item of (8-29) has nothing to do with it, that is, as long as −E
x∼P
θ
f (x) is minimized.

The loss function of WGAN is the sum of f (x) or −f (x), and the gradient of the loss function about f (x) is 1 (or
-1). Therefore, the loss function of WGAN and its gradient calculation is simpler. As long as the code for
calculating the loss function and calculating the gradient of the loss function with respect to f (x) in the GAN code
is slightly modified, the following is the pseudocode of the WGAN algorithm as shown in Figure 8-33:

Figure 8-33 Algorithm pseudocode of WGAN.

Later, some improved WGANs were proposed, such as Improved WGAN (WGAN-GP), which added a gradient
penalty term to the loss function instead of clipping the weight parameters.
2
~
L(pr , pg ) = Ex∼p
~
g
[f (x)] − Ex∼pr [f (x))] + Ex∼p
^
[(| ∥ ∇f (x)
^ ∥2 −1) ]
x
^

Among them, E ~
~
x∼pg [f (x)] is the negative Wasserstein The distance is −W (p , p ), and
− Ex∼pr [f (x))] g r

Ex∼p
^
[(| ∥ ∇f (x)
x
^
^ ∥2 −1) ] is the gradient penalty item, which limits the absolute value of the gradient to a unit
2

length as much as possible, thereby preventing the gradient from exploding and disappearing.

The clipping of weight parameters is mainly to prevent the weight parameters from being too large, so as to ensure
that the neural network function is still a 1 − Lipschitz function. The gradient penalty item is similar to the
regular item of the previous model parameters, and it is also to prevent gradient explosion from causing parameters
in the gradient update process. becomes larger, limiting the gradient to a certain range also ensures that the model
parameters are limited to a certain range.

Recent practice has shown that WGAN and WGAN GP are not actually superior to GAN, so in practice, people
are still accustomed to using the most primitive GAN.

8.7.2 WGAN code implementation

According to the loss function of WGAN, the previous D_train() and G_train() functions are slightly modified, and
the following discriminator and generator training functions WGAN_D_train() and WGAN_G_train() using
WGAN loss are obtained:
#===============one epoch training for the discriminator =======================#
def WGAN_D_train(D,D_optimizer,x_real,x_fake,clip_value = 0.01,reg = 1e-3):
assert(x_real.shape[0]==x_fake.shape[0])
# 1. Gradient reset to 0
D_optimizer.zero_grad()

# 2. Calculate loss and gradient

f_real = D(x_real)
m = f_real.size
real_loss = np.mean(f_real)
real_grad = (1/m)*np.ones(f_real.shape)
D.backward(-real_grad,reg)

f_fake = D(x_fake)
assert(f_fake.size==f_real.size)
fake_loss = np.mean(f_fake)
fake_grad = (1/m)*np.ones(f_fake.shape)
D.backward(fake_grad,reg)
loss = (real_loss - fake_loss)
#loss += D.reg_loss(reg)
# 4. Update the gradient
D_optimizer.step()

# 3. Clipping Gradient Weight clipping

for i,_ in enumerate(D_optimizer.params):
D_optimizer.params[i][0][:] = np.clip(D_optimizer.params[i][0],-
clip_value,clip_value)
return loss

#=================one epoch training for the Generator========================#

def WGAN_G_train(D,G,G_optimizer,z,clip_value = 0.01,reg = 1e-3):
# 1. Gradient reset to 0
G_optimizer.zero_grad()

# 2. Calculate loss and gradient

x_fake = G(z)
f_fake = D(x_fake)
loss = -np.mean(f_fake)
m = f_fake.size
grad = -(1/m)*np.ones(f_fake.shape)

grad = D.backward(grad)
G.backward(grad,reg)
#loss += G.reg_loss(reg)

# 3. Update the gradient

G_optimizer.step()
return loss

def WGAN_train(D,G,D_optimizer,G_optimizer,real_dataset,noise_z,iterations=10000,reg =
1e-3,
clip_value=0.01,n_critic = 4, show_result = None,print_n = 20):
iter = 0
D_losses = []
G_losses = []

while iter< iterations:

# train the discriminator
x_real = next(real_dataset)
x_fake = G(next(noise_z))
D_loss = WGAN_D_train(D,D_optimizer,x_real,x_fake,clip_value,reg)

# training generator
if iter%n_critic==0:
G_loss = WGAN_G_train(D,G,G_optimizer,next(noise_z),clip_value,reg)
if iter % print_n == 0:
print(iter,"iter:","D_loss",D_loss,"G_loss",G_loss)
D_losses.append(D_loss)
G_losses.append(G_loss)
if show_result:
show_result(D,G,D_losses,G_losses)
iter += 1

return D_losses,G_losses

For the GAN model with a set of real numbers in Section 8.5.1, you can use this WGAN loss function to train the
GAN model, the code is as follows:

from NeuralNetwork import *

#from util import *

np.random.seed(0)

hidden = 4
D = NeuralNetwork()
D.add_layer(Dense(1, hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(hidden, 1))
#D.add_layer(Sigmoid())

G = NeuralNetwork()
z_dim = 1 #Dimensionality of hidden variables
G.add_layer(Dense(z_dim, hidden))
G.add_layer(Leaky_relu(0.2)) #Relu())
G.add_layer(Dense(hidden, 1))

# Define the optimizer algorithm object for training G and D

D_lr = 0.0003 # 1e-4
G_lr = 0.0001 #1e-4
beta_1,beta_2 = 0.9,0.999
D_optimizer = Adam(D.parameters(),D_lr,beta_1,beta_2)
G_optimizer = Adam(G.parameters(),G_lr,beta_1,beta_2)

from util import *

clip_value = 0.01
reg = 0 #1e-5 #1e-5e-
iterations = 200000 #100000
n_critic = 1 #5
print_n =500
D_losses,G_losses =
WGAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,iterations,reg,
clip_value,n_critic,show_result,print_n)

The final training results are as follows:

...
500 iter: D_loss 0.0011615003991261863 G_loss -0.01023799847202126
27000 iter: D_loss -8.578426744787482e-07 G_loss -0.009488053321470099
...
90000 iter: D_loss 1.3109091936969186e-11 G_loss -0.009999930339860416
...
199500 iter: D_loss 4.4971589611975116e-09 G_loss -0.01000896809522164
Figure 8-33 The results of WGAN training for a set of real number problems

D_loss is the Wasserstein distance between the distribution of generated data and real data, and it converges to 0
with iterations, indicating gradual convergence.

8.8 Deep convolutional confrontation network DCGAN

The most basic GAN is to use a fully connected neural network to represent the discriminator and generator
functions. For data with a spatial structure such as images, the training of GAN is unstable and it is difficult to
generate high-quality generated data. The Deep Convolutional Generative Adversarial Networks (DCGAN)
proposed by Radford et al. is an extension of the basic GAN. Data with spatial structure.

The discriminator is a binary classification function, which can be represented by a convolutional neural network.
This discriminator can continuously reduce the resolution of the image from high to low through convolution
(including pooling) operations, until the final fully connected layer is transformed into A score representing the
binary classification. How does the generator transform the low-dimensional one-dimensional hidden vector into a
high-dimensional multi-channel image (feature map)?

Ordinary convolution operation is a kind of downsampling, which can convert high-resolution feature maps into
low-resolution feature maps, but cannot convert low-dimensional hidden vectors into high-dimensional images,
and ordinary The convolution operation is just the opposite. The transposed convolution operation (Transposed
convolutions) belongs to the upscaling (upscaling) operation, which can convert low-resolution feature maps into
high-resolution feature maps. Transposed convolution is also called fractionally-strided convolution
(fractionally-strided convolution), and some literature is also called deconvolution (deconvolution), but
deconvolution and usually mathematical deconvolution Product is not the same concept), therefore, the first 2
terms are generally used. As shown in Figure 8-35, four transposed convolutional layers are used to convert a
vector of length 100 into a 3 × 64 × 64 color image.

Figure 8-35. Four transposed convolutional layers transform a vector of length 100 into a color image of
3 × 64 × 64.

In order to make the training more stable, the DCGAN paper also made several improvements: the discriminator
network removed the fully connected layer, and replaced the pooling operation with strided convolution, and both
the generator and the discriminator network used batch normalization (batchnorm), the generator uses the tanh
activation function except for the output layer, all other layers use the ReLU activation function, and all the
discriminator layers use the LeakyReLU activation function.

In order to implement DCGAN, transposed convolution must be implemented first. The principle and
implementation of transposed convolution are discussed below.

8.8.1 Transposed convolution of 1D vectors

For an input vector x = (x , x , x , x , x ) with a length of 5, perform a 1D convolution with a convolution
0 1 2 3 4

kernel width of 3, a span and a padding of 1 and 0 respectively, and the process is shown in the figure 8-36 shows:

Figure 8-36 For an input vector x = (x , x , x , x , x ) with a length of 5, perform a 1D convolution with a
0 1 2 3 4

convolution kernel width of 3, a stride and a padding of 1 and 0 respectively process

If the length of the input vector is n, the width of the convolution kernel is k, the length of the result tensor
generated by the convolution operation with the span of s and the left and right fillings of p is o = n−k+2∗p

s
+ 1,

for the above example, the length of the resulting tensor is o = + 1 = 3. That is, if it is not filled, the
5−3+0

length of the result vector of convolution is often smaller than the length of the input vector.

Convolution calculates and accumulates a data block with the same shape and size as the convolution kernel
through the convolution kernel to obtain an output value, that is, a data block and convolution kernel operation
generate an output value. The convolution kernel moves along the data in terms of strides, and produces an output
value for each corresponding block of data encountered.

Transposed convolution and convolution are just the opposite. For each element of the input tensor, the element is
multiplied by each element of the convolution kernel to produce an output of the same shape as the convolution
kernel, that is, for each element of the input tensor The element will produce the same number of elements as the
convolution kernel, as shown in Figure 8-37:

Figure 8-37 Transposed convolution: For each element of the input tensor, multiply that element by each element
of the convolution kernel, producing an output of the same shape as the convolution kernel

The convolution kernel with a width of 3 is aligned with the first element x of the input x = (x , x , x ), and this
0 0 1 2

x is multiplied by each element of the convolution kernel to obtain an output value , the width of the convolution
0

kernel is 3, therefore, 3 output values are produced. If a transposed convolution with a stride of 1 is performed, the
convolution kernel slides to x and produces 3 output values until the last element of the input, as shown in Figure
1

8-38:

Figure 8-38 A convolution kernel with a width of 3 acts on an input one-dimensional tensor with a length of three
to generate a one-dimensional tensor composed of five elements

Figure 8-39 is a specific example,

Figure 8-39 Transposed convolution process of convolution kernel (1,2,-1) and input one-dimensional tensor
(5,15,12)

As shown in Figure 8-40, in the transposed convolution operation, the three values output by element-by-element
multiplication of each element and the convolution kernel are accumulated to the output vector elements at the
corresponding positions.

Figure 8-40 The three values output by multiplying each element and the convolution kernel element by element
are accumulated to the elements of the output tensor at the corresponding position

If the z of the convolution operation in Figure 8-36 is used as the input of the transposed convolution, and x is
used as the output of the transposed convolution, the transposition calculation process is shown in Figure 8-41 :

Figure 8-41 Transposed convolution is the reverse process of convolution. The output z of the convolution in
Figure 8-36 is used as the input of the transposed convolution, and the transposed convolution produces an output
with the same shape as the convolution input.

It can be seen that the calculation process of transposed convolution is the reverse process of convolution process,
just as the reverse derivation of convolution is the reverse process of convolution. Therefore, the calculation
process of the transposed convolution is completely similar to the reverse derivation process of the convolution,
and an input value is assigned and accumulated to the output vector through the convolution kernel.

According to the relationship between the output of the convolution and the input vector, span, and padding, the
relationship between the output of the transposed convolution and the input vector, span, and padding can be
obtained. The length of the input tensor is o, the span is s, and the left and right padding is p The length of the
resulting tensor generated by the transposed convolution operation is n = (o − 1) ∗ s + k − 2 ∗ p. For the
transposed convolution operation in the above example, the resulting tensor length is (3 − 1) ∗ 1 + 3 − 0 = 5

Convolution can use matrix multiplication to realize its forward calculation and reverse derivation. Therefore,
transposed convolution can also use matrix multiplication to realize its forward calculation and reverse derivation.
The forward calculation of transposed convolution is completely similar to the reverse derivation of convolution,
and the reverse derivation of transposed convolution is similar to the forward calculation of convolution.

Looking back at the reverse derivation process of 1D convolution in Section 6.3.3, the calculation formula is:
T
dxrow = dzrow K
col

Considering dz as the input of transposed convolution, and dx

row as the output of transposed convolution, you
row

can get the matrix multiplication formula of forward calculation of transposed convolution :
T
z row = x row K
col

Where x row
is the input of the transposed convolution, and z is the output of the transposed convolution, and
row

K col
is the column vector representation of the convolution kernel . For the specific example above, the calculation
process is:

⎡x ⎤
0
⎡ 5
⎤ ⎡ 5 15 12
⎤
zrow = xrow kcol = x1 [k0 k1 k2 ] = 15 [1 2 −1] = 15 30 −15

⎣x ⎦ ⎣ 12⎦ ⎣ 12 24 −12
⎦
2

Like the reverse derivation of convolution, each row of this flattened z represents an allocation calculation, and
row

each row needs to be accumulated to the corresponding position of the final output z ( As shown in Figure 8-40).

The process of transforming z into z can use the convolution reverse derivation to transform d x
row row
into the
function row2im() of dxx Finish.

Therefore, the forward calculation of the transposed convolution is generally performed according to the reverse
derivation process of the convolution. Similarly, the reverse derivative of the transposed convolution can be
performed according to the forward calculation process of the convolution. As an exercise, readers can try to write
the code for forward calculation and reverse derivation of 1D transposed convolution.

Let's look at some transposed convolution processes with different spans and fillings. Figure 8-42 is a transposed
convolution with an input length of 3, a convolution kernel length of 3, a span of 2, and a padding of 0:

Figure 8-42 Transposed convolution with input length 3, convolution kernel length 3, stride 2, and padding 0

And Figure 8-43 is a transposed convolution with an input length of 3, a convolution kernel length of 3, a span of
2, and left and right padding of 1:

Figure 8-43 Transposed convolution with an input length of 3, a convolution kernel length of 3, a stride of 2, and a
left and right padding length of 1

It can be seen that the leftmost and rightmost elements of the output when the padding length is 1 are not counted
in the output tensor. As long as the transposed convolution is compared to the inverse process of convolution, the
calculation process of transposed convolution including span and padding can be understood.

8.8.2 2D transposed convolution

Just as 1D transposed convolution is the reverse process of 1D convolution, 2D transposed convolution is the
reverse process of 2D convolution. As shown in Figure 8-44, for the convolution operation, the following is
regarded as the input and the above is regarded as the output, which means that a 4 × 4 input is passed through a
3 × 3 convolution kernel, and the execution span is 1. The convolution with padding of 0 gets an output of shape

2 × 2. The same figure can also be regarded as the input above and the output below, which means that 2 × 2

input uses a 3 × 3 convolution kernel to perform a transposed convolution with a span of 1 and a padding of 0 Get
an output of shape 4 × 4.

Figure 8-44 2D transposed convolution is the reverse process of 2D convolution. The input is the above 2 × 2
matrix, and after the 3 × 3 convolution kernel, the following 4 × 4 matrix is output

The matrix multiplication implementation of 2D transposed convolution is the same as the matrix multiplication
implementation process of 1D transposed convolution, that is, the reverse derivation and forward calculation
process of the corresponding 2D convolution are used to realize the forward direction of 2D transposed
convolution respectively. Calculation and reverse derivation process.

Therefore, just modify the Conv_fast class of the convolution operation that has been implemented before, and
convert the backward() and forward() methods of the Conv_fast class into the forward() and backward() methods
of the transposed convolution implementation class Conv_transpose. For example, for the input x , it can be
regarded as the gradient of the loss function with respect to the convolution output, which is first flattened into a
matrix, that is, a matrix with a shape similar to (N ∗ oH ∗ oW , F ) , the second axis represents the number of
output channels, and the first axis represents each element, thus transforming into a matrix form of x , the code
row

is:
X_row = X.transpose(0,2,3,1).reshape(-1,F)

Then according to the reverse derivation formula, the flattened matrix Z row of the output of this transposed
convolution can be calculated, namely:
Z_row = np.dot(X_row,K_col.T)

Finally, according to the reverse derivation process of the convolution, the distribution of each row of this Z row

must be accumulated to the final output Z . This process can directly use the row2im() function or row2im_indices(
) function completes:

Z = row2im_indices(Z_row,Z_shape,self.kH,self.kW,S =self.S,P = self.P)

Similarly, the reverse derivation of the transposed convolution is similar to the forward calculation process of the
convolution. First, the gradient dz of the loss function with respect to Z must be flattened by the im2row() or
im2row_indices() function Matrix dZ_row, each row of which represents a data block, and then calculates the
gradient of the input X of the transposed convolution:
dX_row = dZ_row @ K_col

You can get the flattened matrix dX_row about the gradient of X , and finally reshape this flattened matrix into a
four-dimensional tensor with the same shape as X , namely:
dX = dX_row.reshape(N,self.H,self.W,self.C)
dX = dX.transpose(0,3,1,2)

Similarly, the calculation process of the gradient dK_col for model K is similar, and it is more straightforward to
reshape the shape of the flattened matrix dK_col into the same shape as K:
dK_col = self.X_row.T@dZ_row
dK = dK_col.reshape(self.K.shape)

According to the above analysis, the following transposed convolution class Conv_transpose can be written:

class Conv_transpose():
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.kH = kernel_size
self.kW = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
#self.K = np.random.randn(self.F, self.C, self.kH, self.kW)
#/(self.K*self.K)
# self.K = np.random.randn(self.C, self.F, self.kH, self.kW)
#/(self.K*self.K)
self.K = np.random.normal(0,1,(self.C, self.F, self.kH, self.kW))
self.b = np.zeros((1,self.F)) #,1))
self.params = [self.K,self.b]
self.grads = [np.zeros_like(self.K),np.zeros_like(self.b)]
self.X = None

self.reset_parameters()

def reset_parameters(self):
kaiming_uniform(self.K, a=math.sqrt(5))
if self.b is not None:
fan_in, _ = calculate_fan_in_and_fan_out(self.K)
#fan_in = self.F
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))

def forward(self,X):
'''
X: (N,C,H,W)
K: (F,C,kH,kW)
Z: (N,F,oH,oW)
X_row: (N*oH*oW, C*kH*kW)
K_col: (C*kH*kW, F)
Z_row = X_row*K_col: (N*oH*oW, C*kH*kW)*(C*kH*kW, F) = (N*oH*oW, F)

dK_col = X_row.T @dZ_row: (CkHkW,NoHoW)(NoHoW, F) = (CkH*kW,F)

dX_row = dZ_row@K_col.T = (N*oH*oW, F) * (F, C*kH*kW) = (N*oH*oW, C*kH*kW)
'''
#convert to multi-channel
self.X = X
if len(X.shape)==1:
X = X.reshape(X.shape[0],1,1,1)
elif len(X.shape)==2:
X = X.reshape(X.shape[0],X.shape[1],1,1)

self.N,self.H,self.W = X.shape[0], X.shape[2], X.shape[3]

S,P,kH,kW = self.S, self.P,self.kH,self.kW
self.oH =self.S*(self.H-1)+kH-2*P
self.oW = self.S*(self.W - 1)+kW - 2*P

K = self.K
# Convert (N,F,oH,oW) to (N,oH,oW,F) and flatten to (-1,F)
F = X.shape[1]
#assert(F==self.F)
X_row = X.transpose(0,2,3,1).reshape(-1,F) #(N*oH*oW,F)
K_col = K.reshape(K.shape[0],-1).transpose() #Flattening

Z_row = np.dot(X_row,K_col.T)

Z_shape = (self.N,self.F,self.oH,self.oW)
Z = row2im_indices(Z_row,Z_shape,self.kH,self.kW,S =self.S,P = self.P)

self.b = self.b.reshape(1,self.F,1,1)
Z+= self.b

self.X_row = X_row
return Z

def __call__(self,X):
return self.forward(X)

def backward(self,dZ):
N,F,oH,oW = dZ.shape[0], dZ.shape[1],dZ.shape[2], dZ.shape[3]
S,P,kH,kW = self.S, self.P,self.kH,self.kW

dZ_row = im2row_indices(dZ,self.kH,self.kW,S=self.S,P=self.P)
K_col = self.K.reshape(self.K.shape[0],-1).transpose() #Flattening

dX_row = dZ_row @ K_col # (o,f) = (9,18)(18,1) = (9,1)

dK_col = self.X_row.T@dZ_row #(1,9)(9,18) #Z_row = X_row @ K_col

dK = dK_col.reshape(self.K.shape)

db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)

# (N*H*W, C)
dX = dX_row.reshape(N,self.H,self.W,self.C)
dX = dX.transpose(0,3,1,2)

self.grads[0] += dK
self.grads[1] += db

return dX

#-------- add the gradient of the regular term ---

def reg_grad(self,reg):
self.grads[0]+= 2*reg * self.K

def reg_loss(self,reg):
return reg*np.sum(self.K**2)

def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.K
return reg*np.sum(self.K**2)

Note: People sometimes use convolution operation to simulate the calculation process of transposed convolution,
but this simulation process is not only complicated but also has a large amount of calculation, so it has no practical
significance. Interested readers can refer to the following URL :
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html

8.8.3 Implementation of convolutional confrontation network DCGAN

With the help of transposed convolution, a low-dimensional hidden vector can be transformed into a high-
resolution image. Therefore, the generator of a convolutional adversarial network can be represented by a neural
network with transposed convolutional layers, while the discriminator can be represented by a normal
convolutional neural network. Such an adversarial generative network is a convolutional adversarial network
(DCGAN). Next, use this DCGAN network to train the Mnist handwritten digit set, so that the generator can
output an image similar to the handwritten digit image in the training set from a hidden vector.

First, still read the Mnist handwritten digit set as a training sample:

import data_set as ds
import matplotlib.pyplot as plt
%matplotlib inline

train_set, valid_set, test_set = ds.read_mnist()

train_X, train_y = train_set

valid_X, valid_y = valid_set
test_X, test_y = valid_set
print(train_X.dtype)
print(train_X.shape)
print(train_y.dtype)
print(train_y.shape)

ds.draw_mnists(plt,train_X,range(10))
plt.show()
train_X = train_X.reshape(train_X.shape[0],1,28,28)
print(train_X.shape)

float32
(50000, 784)
int64
(50000,)

(50000, 1, 28, 28)

Then use the transposed convolution class and the convolution class and other network layer classes to define the
neural networks G and D representing the generator and discriminator respectively:

from util import *

from NeuralNetwork import *
from GAN import *

np.random.seed(100)
random_name = 'no'
random_value = 0.01

G = NeuralNetwork()
z_dim = 100
ngf=28
ndf=28
nc=1

G.add_layer(Conv_transpose(z_dim, ngf4,4,1,0)) # ->(ngf4) x 4 x 4

G.add_layer(BatchNorm(ngf*4))
G.add_layer(Relu()) #Leaky_relu(0.2))
G.add_layer(Conv_transpose(ngf*4,ngf*2,3,2,1)) # 2(4-1)+3-2 ->(ngf*2) x 7 x 7
G.add_layer(BatchNorm(ngf*2))
G.add_layer(Relu()) #Leaky_relu(0.2))
G.add_layer(Conv_transpose(ngf*2,ngf,4,2,1)) # 2(7-1)+4-2 ->(ngf) x 14 x 14
G.add_layer(BatchNorm(ngf))
G.add_layer(Relu()) #Leaky_relu(0.2))
G.add_layer(Conv_transpose(ngf,nc,4,2,1)) # 2(14-1)+4-2 ->(nc) x 28 x 28
G.add_layer(Tanh())

#self.oH = (self.H - kH + 2*P)// S + 1

D = NeuralNetwork()
D.add_layer(Conv_fast(nc, ndf,4,2,1)) # (28-4+2)//2+1=14 ->ndf x 14 x 14
D.add_layer(BatchNorm(ndf))
D.add_layer(Leaky_relu(0.2))
D.add_layer(Conv_fast(ndf, 2*ndf,4,2,1)) # (14-4+2)//2+1=7 ->(2*ndf) x 7 x 7
D.add_layer(BatchNorm(2*ndf))
D.add_layer(Leaky_relu(0.2))
D.add_layer(Conv_fast(2*ndf, 4*ndf,3,2,1)) # (7-3+2)//2+1=4 ->(4*ndf) x 4 x 4
D.add_layer(BatchNorm(4*ndf))
D.add_layer(Leaky_relu(0.2))
D.add_layer(Conv_fast(4*ndf, 1,4,1,0)) # (4-4+0)//1+1=1 ->1 x1 x1
#D.add_layer(Sigmoid())

def weights_init(layer):
classname = layer.__class__.__name__
if classname.find('Conv') != -1:
W = layer.params[0]
W[:] = np.random.normal(0.0, 0.02,(W.shape))
elif classname.find('BatchNorm') != -1:
W = layer.params[0]
W[:] = np.random.normal(1.0, 0.02,(W.shape))
b = layer.params[1]
b[:] = 0

G.apply(weights_init)

#Define the optimizer algorithm object for training G and D

reg = None #1e-5
D_lr = 0.0002
G_lr = 0.0002
beta_1,beta_2 = 0.5,0.999
D_optimizer = Adam(D.parameters(),D_lr,beta_1,beta_2)
G_optimizer = Adam(G.parameters(),G_lr,beta_1,beta_2)

Finally, train the DCGAN network model with the previous GAN training process, where show_result_mnist() is
an auxiliary function to display intermediate results.
def show_result_mnist(D_losses = None,G_losses = None,m=10):
#fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

#ax1.clear()
if D_losses and G_losses:
i = np.arange(len(D_losses))
plt.plot(i,D_losses, '-')
plt.plot(i,G_losses, '-')
plt.legend(['D_losses)', 'G_losses'])
plt.show()

##ax2.clear()

z = np.random.randn(m, z_dim)
x_fake = G(z)
ds.draw_mnists(plt,x_fake,range(m))
plt.show()

#----------- start training -------------

import time
batch_size = 64 # len(X)
data_it = data_iterator_X(train_X,batch_size,shuffle = True,repeat=False) #

#noise_it = iter(Noise_z(batch_size,z_dim))
noise_it = noise_z_iterator(batch_size, z_dim)
iterations = 1500
#losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,reg,3,1,s

#losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,reg,show_

start = time.time()
loss_fn = BCE_loss_grad
n_epoch = 20 #200
print_n =20
for epoch in range(1, n_epoch+1):
D_losses, G_losses = [], []
data_it = data_iterator_X(train_X,batch_size,shuffle = True,repeat=False) #
for batch_idx, x_real in enumerate(data_it):
x_fake = G(next(noise_it))
D_loss = D_train(D,D_optimizer,x_real,x_fake,loss_fn,reg)
G_loss = G_train(D,G,G_optimizer,next(noise_it),loss_fn,reg)
D_losses.append(D_loss)
G_losses.append(G_loss)
#print(D_loss,G_loss)
#if batch_idx>10: break

if batch_idx%print_n ==0:
print('[%d:/%d]: loss_d: %.3f, loss_g: %.3f' % (
(batch_idx), epoch, np.mean(np.array(D_losses)),
np.mean(np.array(G_losses))))
show_result_mnist(D_losses,G_losses)

D.save_parameters('MNIST_DCGAN_D_params.npy')
G.save_parameters('MNIST_DCGAN_G_params.npy')
print('[%d/%d]: loss_d: %.3f, loss_g: %.3f' % (
(epoch), n_epoch, np.mean(np.array(D_losses)),
np.mean(np.array(G_losses))))
#break

done = time.time()
elapsed = done - start
print("Training time: %d seconds"%(elapsed))
Please download the complete code from the author's blog (https://hwdong-net.github.io).

As shown in Figure 8-45, it is the intermediate result of the 11th epoch in the training process

Figure 8-45 The intermediate result of the 11th epoch

references:
[1] Saito Yasuhiro. Introduction to Deep Learning: Theory and Implementation Based on Python [M]. Beijing:
People's Posts and Telecommunications Press, 2018.

[2] Nielsen, Michael A. Neural networks and deep learning. Vol. 2018. San Francisco, CA: Determination press,
2015[M]. Website: http://neuralnetworksanddeeplearning.com.

[3] Aston Zhang, Mu Li, Zachary C. Lipton, Alexander J. Smola. Deep Learning by Hands [M]. Website:
https://zh.d2l.ai/d2l-zh.pdf. 2020.

[4] Wu Enda (Andrew Ng). deeplearning.ai course. URL: https://mooc.study.163.com/u/ykt1503557960168#/c.

2019.

[5] Stanford University. CS231n: Convolutional Neural Networks for Visual Recognition. URL:
http://cs231n.stanford.edu/. 2019.

[6] Andrew Ng. Unsupervised Feature Learning and Deep Learning Tutorial. Website:
http://ufldl.stanford.edu/tutorial/StarterCode/. 2018.

[7] Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain[J]. Cornell Aeronautical Laboratory, Psychological Review, 65(6):386-408.

[8] Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities [C].
Proc. Natl. Acad. Sci. U.S.A. 1982, 79 (8): 2554–2558.

[9] Y. LeCun , B. Boser , J. S. Denker , D. Henderson , R. E. Howard , W. Hubbard and L. D. Jackel,

“Backpropagation applied to handwritten zip code recognition[J].Neural Computation, 1989, 1(4): 541-551.

[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition[C].Proceedings of the IEEE, 1998, 86(11):2278–2324.

[11] Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. ImageNet classification with deep convolutional
neural networks[J]. Communications of the ACM. 2017, 60 (6): 84–90.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-
Level Performance on ImageNet Classification[C].2015 IEEE International Conference on Computer Vision
(ICCV), 2015.

[13] Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift[J]. 2015, arXiv preprint, arXiv:1502.03167.

[14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. Dropout: A
Simple Way to Prevent Nextworks From Overfi tting [j]. Journal of Machine Learning Research. 2014, 15 (56):
1929−1958.

[15] Afshine Amidi , Shervine Amidi. Deep Learning Tips and Tricks cheatsheet. URL:
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks .2019.

[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren. Deep Residual Learning.2015, arXiv:1512.03385.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[18] Sepp Hochreiter; Jürgen Schmidhuber. Long short-term memory[J]. Neural Computation. 1997,9 (8): 1735–
1780.

[19] Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk,
Holger; iv] (https://en.wikipedia.org/wiki/ArXiv_(identifier)):1406.1078.

[20] Christopher Olah. Understanding LSTM Networks. URL: https://colah.github.io/posts/2015-08-

Understanding-LSTMs/. 2015.

[21] Diederik P Kingma, Max Welling. Auto-Encoding Variational Bayes. 2013, arXiv:1312.6114.

[22] Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil;
Courville, Aaron; Bengio, Yoshua. Generative Adversarial Nets [C]. Proceedings of the International Conference
on Neural Information Processing Systems. 2014: 2672–2680.

[23] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein GAN”.2017, arXiv:1701.07875.

[24] Lilian Weng. From GAN to WGAN. URL: https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-

WGAN.html. 2017.

[25]. Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks. 2015, arXiv:1511.06434.

[26] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros. Unpaired Image-to-Image Translation using
Cycle-Consistent Adversarial Networks. 2017, arxiv 1703.10593 .

More Predictive Analytics. Microsoft Excel (PDFDrive)
No ratings yet
More Predictive Analytics. Microsoft Excel (PDFDrive)
465 pages
DL Unit 1
No ratings yet
DL Unit 1
200 pages
DL Unit 2
No ratings yet
DL Unit 2
29 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
126 pages
Unit 1
No ratings yet
Unit 1
21 pages
مقدمة في العمليات التصادفية
No ratings yet
مقدمة في العمليات التصادفية
24 pages
UNIT I Part 1 Notes
No ratings yet
UNIT I Part 1 Notes
28 pages
Assignment2 Group5B
No ratings yet
Assignment2 Group5B
60 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 1 Notes
15 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
Mealy and Moore Machine and Their Conversions
No ratings yet
Mealy and Moore Machine and Their Conversions
3 pages
Python Deep Learning Tutorial
0% (1)
Python Deep Learning Tutorial
17 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Lecun 2015
No ratings yet
Lecun 2015
10 pages
Ian Goodfellow Yoshua Bengio and Aaron Courville D
No ratings yet
Ian Goodfellow Yoshua Bengio and Aaron Courville D
4 pages
Deep Learning
No ratings yet
Deep Learning
7 pages
Unit 3
No ratings yet
Unit 3
67 pages
Deep Learning Using Python
No ratings yet
Deep Learning Using Python
303 pages
Deep Learning Project
No ratings yet
Deep Learning Project
24 pages
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
No ratings yet
III-II CSM (Ar 20) DL - Units - 1 & 2 - Question Answers As On 4-3-23
56 pages
III-II CSM (Ar 20) DL 5 Units Question Answers
No ratings yet
III-II CSM (Ar 20) DL 5 Units Question Answers
108 pages
3rd Unit DL Final Class Notes
No ratings yet
3rd Unit DL Final Class Notes
78 pages
DL All Units Materials
No ratings yet
DL All Units Materials
138 pages
Deep Learning
100% (3)
Deep Learning
32 pages
Python Machine Learning Machine Learning and Deep Learning From Scratch Illustrated With Python Scikit Learn Keras Theano and Tensorflow 1211083261
No ratings yet
Python Machine Learning Machine Learning and Deep Learning From Scratch Illustrated With Python Scikit Learn Keras Theano and Tensorflow 1211083261
53 pages
Machine Learning and Deep Learning With Python
From Everand
Machine Learning and Deep Learning With Python
James Chen
No ratings yet
Deep Learning
No ratings yet
Deep Learning
61 pages
Soft Computing Syllabus
No ratings yet
Soft Computing Syllabus
3 pages
Deep Learning University
No ratings yet
Deep Learning University
129 pages
12 Sorting
No ratings yet
12 Sorting
66 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Deep Learning: Book Review
No ratings yet
Deep Learning: Book Review
4 pages
PID Pole Placement Controller
No ratings yet
PID Pole Placement Controller
16 pages
Lagrange Multiplier: F (X, Y) G (X, Y) 0
No ratings yet
Lagrange Multiplier: F (X, Y) G (X, Y) 0
10 pages
ML Archs
No ratings yet
ML Archs
36 pages
Salman Technical Seminar
No ratings yet
Salman Technical Seminar
24 pages
Chapter VI - Introduction To Deep Learning
No ratings yet
Chapter VI - Introduction To Deep Learning
38 pages
Ch04-Genetic Algorithms
No ratings yet
Ch04-Genetic Algorithms
22 pages
Asymptotic Notations
No ratings yet
Asymptotic Notations
18 pages
The Fundamental Concepts Behind Deep Learning
No ratings yet
The Fundamental Concepts Behind Deep Learning
22 pages
Deep Learning
No ratings yet
Deep Learning
127 pages
A Library of Local Search Heuristics For The Vehicle Routing Problem
No ratings yet
A Library of Local Search Heuristics For The Vehicle Routing Problem
23 pages
Unit 3
No ratings yet
Unit 3
21 pages
Trainer - X-Vision
No ratings yet
Trainer - X-Vision
21 pages
J Jcde 2016 07 002
No ratings yet
J Jcde 2016 07 002
9 pages
Coding Theory and Techniques - Updated
No ratings yet
Coding Theory and Techniques - Updated
21 pages
Reading+10+ +Introduction+to+Deep+Learning
No ratings yet
Reading+10+ +Introduction+to+Deep+Learning
21 pages
23 Ex 5G Absolute Maximum and Minimum
No ratings yet
23 Ex 5G Absolute Maximum and Minimum
8 pages
Ian Goodfellow Yoshua Bengio and Aaron Courville D
No ratings yet
Ian Goodfellow Yoshua Bengio and Aaron Courville D
4 pages
Beginner's guide to mastering python
From Everand
Beginner's guide to mastering python
Xilis
No ratings yet
Deep Learning Algorithms and Architectures
No ratings yet
Deep Learning Algorithms and Architectures
26 pages
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
No ratings yet
Rigorous Derivation of Hooghoudt's Equation For Drainage Spacing
41 pages
Queueing System Analysis of Multi Server Model at
No ratings yet
Queueing System Analysis of Multi Server Model at
9 pages
Unit - 1 Deep Learning Techniques
No ratings yet
Unit - 1 Deep Learning Techniques
18 pages
Mathematical Problems in Engineering - 2021 - Mishra - The Understanding of Deep Learning A Comprehensive Review
No ratings yet
Mathematical Problems in Engineering - 2021 - Mishra - The Understanding of Deep Learning A Comprehensive Review
15 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Unit I
No ratings yet
Unit I
10 pages
Review
No ratings yet
Review
15 pages
Repetitive Control
No ratings yet
Repetitive Control
22 pages
Deep Learning
No ratings yet
Deep Learning
22 pages
Deep Learnig
No ratings yet
Deep Learnig
16 pages
Joint Probability
No ratings yet
Joint Probability
8 pages
Deep Learning Presentation
No ratings yet
Deep Learning Presentation
10 pages
DeepLearning - 1NT22CS078 - I Shania Jone
No ratings yet
DeepLearning - 1NT22CS078 - I Shania Jone
4 pages
Introducing Deep Learning and The Pytorch Library: This Chapter Covers
No ratings yet
Introducing Deep Learning and The Pytorch Library: This Chapter Covers
5 pages
Advancements and Applications of Deep Learning
No ratings yet
Advancements and Applications of Deep Learning
4 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
No ratings yet
Jntuk r20 Unit-I Deep Learning Techniques (WWW - Jntumaterials.co - In)
23 pages
Docs Gate User Guide
No ratings yet
Docs Gate User Guide
2 pages
Chapter1. Introduction To Deep Learning
No ratings yet
Chapter1. Introduction To Deep Learning
21 pages
Insidedeeplearning Preview
No ratings yet
Insidedeeplearning Preview
5 pages
Tutorial 1
No ratings yet
Tutorial 1
3 pages
Recognition of Persisting Emotional Valence From EEG Using Convolutional Neural Networks PDF
No ratings yet
Recognition of Persisting Emotional Valence From EEG Using Convolutional Neural Networks PDF
6 pages
(IJCST-V9I4P17) :yew Kee Wong
No ratings yet
(IJCST-V9I4P17) :yew Kee Wong
4 pages
Introduction to Algorithms and Data Structures: A solid foundation for the real world of machine learning and data analytics
From Everand
Introduction to Algorithms and Data Structures: A solid foundation for the real world of machine learning and data analytics
Bolakale Aremu
No ratings yet
Lecun 2015
No ratings yet
Lecun 2015
9 pages
MNPI Bibliografia
No ratings yet
MNPI Bibliografia
7 pages
Muskingum Routing - Example
No ratings yet
Muskingum Routing - Example
12 pages
What Is Deep Learning
No ratings yet
What Is Deep Learning
5 pages
Path Planning Using Dynamic Vehicle Model: Navigation
No ratings yet
Path Planning Using Dynamic Vehicle Model: Navigation
6 pages
Matlab Deep Learning Series
No ratings yet
Matlab Deep Learning Series
6 pages
Python Mastery: From Absolute Beginner to Pro
From Everand
Python Mastery: From Absolute Beginner to Pro
NIBEDITA Sahu
No ratings yet
Generalized Solution For The Static Analysis of Coupled Shear Walls Three-Field CTB Beam-26
No ratings yet
Generalized Solution For The Static Analysis of Coupled Shear Walls Three-Field CTB Beam-26
1 page
String Pair PDF
No ratings yet
String Pair PDF
3 pages
Deep Learning Essentials 1
No ratings yet
Deep Learning Essentials 1
4 pages
TOC Micro. Fabric.
No ratings yet
TOC Micro. Fabric.
4 pages
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
100% (5)
Neural Networks and Deep Learning - Deep Learning Explained To Your Granny - A Visual Introduction For Beginners Who Want To Make Their Own Deep Learning Neural Network (Machine Learning)
84 pages
The Visibility Graph: A New Method For Estimating The Hurst Exponent of Fractional Brownian Motion
No ratings yet
The Visibility Graph: A New Method For Estimating The Hurst Exponent of Fractional Brownian Motion
5 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages