Anatomy of Deep Learning Principles (2023)
Anatomy of Deep Learning Principles (2023)
Combined
Brief introduction
This book introduces the basic principles and implementation process of
deep learning in a simple way, and uses python's numpy library to build its
own deep learning library from scratch instead of using existing deep
learning libraries. On the basis of introducing basic knowledge of Python
programming, calculus, and probability statistics, the core basic knowledge
of deep learning such as regression model, neural network, convolutional
neural network, recurrent neural network, and generative network is
introduced in sequence according to the development of deep learning.
While analyzing the principle in a simple way, it provides a detailed code
implementation process. It is like not teaching you how to use weapons and
mobile phones, but teaching you how to make weapons and mobile phones
by yourself. This book is not a tutorial on the use of existing deep learning
libraries, but an analysis of how to develop deep learning libraries from 0.
This method of combining the principle from 0 with code implementation
can enable readers to better understand the basic principles of deep learning
and the design ideas of popular deep learning libraries.
Preface
Since the invention of computers, it has been the goal of computer scientists
to make machines have human-like intelligence. Since the concept of
"artificial intelligence" was proposed in 1956, artificial intelligence research
has experienced many ups and downs from peak to trough, trough to peak
In the development process of AI, from rule-based reasoning based on
mathematical logic to state-space search reasoning, from expert systems to
statistical learning, from crowd intelligence algorithms to machine learning,
from neural networks to support vector machines, different artificial
intelligence technologies used to lead the way.
In the past 6 years, deep learning using deep neural networks has been
brilliant and advanced by leaps and bounds. Successful applications of deep
learning such as AlphaGo defeating the human Go champion, automatic
driving, machine translation, speech recognition, and deep face changing
continue to attract people's attention. As a branch of machine learning, deep
learning brings traditional neural network technology back to life, and has
established itself as the overlord of modern artificial intelligence among all
artificial intelligence technologies. status.
With the help of some deep learning platforms such as tensorflow, pytorch,
and caffe, a primary school student can easily use the deep learning library
to do various applications such as face recognition and speech recognition.
What he does is to directly call the APIs of these platforms to define the
model of the deep neural network. Structure and tune training parameters.
These platforms make deep learning very easy, make deep learning enter
the homes of ordinary people, and artificial intelligence is no longer
mysterious. From universities to enterprises, people from all walks of life
are using deep learning to carry out various research and applications.
The author believes that platform tutorial books are time-sensitive, and the
publication cycle of the book is usually as long as one year, and the
interface of the platform may have undergone some changes or even major
changes. For the changing platform, such books are almost worthless .
Principle books should be easy to understand, try to avoid complex and
esoteric mathematics, but completely abandon the classic advanced
mathematics developed by mathematicians for thousands of years, and
using elementary school mathematics to describe functions for derivation is
not suitable for readers with advanced mathematics knowledge. Not an
optimal choice. However, there is a special lack of easy-to-understand
books on the market that introduce the principles and how to implement
deep learning from the bottom instead of using deep learning libraries.
In order to take care of readers who are difficult in mathematics, the first
chapter of this book not only introduces the necessary knowledge of python
programming, but also introduces some necessary knowledge of calculus
and probability as popularly as possible. On this basis, this book transitions
from the simplest regression model to the neural network model from the
shallower to the deeper, and uses the method from problem to concept to
explain the basic concepts and principles in an easy-to-understand manner.
Avoiding long speeches and avoiding "treasure words like gold", use simple
examples and concise and popular language to analyze the core principles
of models and algorithms. On the basis of understanding the principle,
further use python's numpy library to implement the code from the bottom
layer, so that readers can be enlightened on the principle and
implementation. Through reading this book, readers can follow step by step
to build a deep learning library from 0 without any deep learning platform.
Finally, as a comparison, the use of the deep learning platform Pytorch is
introduced, so that readers can easily learn to use this deep learning
platform, which will help readers understand the design ideas of these
platforms more deeply, so as to better grasp and use these deep learning
platforms. Learning platform.
This book is suitable not only for beginners without any deep learning
knowledge, but also for practitioners who have experience in using deep
learning libraries and want to understand its underlying implementation
principles. This book is especially suitable as a deep learning textbook for
colleges and universities.
The English version of this book is translated using Google Translate on the
basis of the Chinese version. We will continue to improve the quality of the
translation in the future, and we hope readers can help me correct errors.
My email: [email protected]
1. Objects
3. Type conversion
4. Notes
5. Variables
6. input() function
1.1.3 Operation
String formatting
1. if statement
2. while statement
3. for statement
index
slice
2. tuple (tuple)
3. set (collection)
4. dict (dictionary)
1.1.6 Functions
subplot()
Axes objects
mplot3d
display image
1.2 tensor library numpy
1 vector
2 Matrix
3 dimensional tensor
1. array()
3. asarray()
9. Add, Repeat & Lay, Merge & Split, Edge Fill, Add
Axis & Swap Axis
Repeat repeat()
laying tile()
merge concatenate()
overlay stack()
split split()
Edge Padding
Add Axis
Swap axes
1. Element-by-element calculation
Hadamard Product
2. Cumulative calculation
3. Dot Product
4 Broadcast Broadcasting
1.3 Calculus
1.3.1 Functions
Arithmetic
Composite
3. Derivatives of functions
1.3.8 Integral
1.4.1 Probability
1. Machine Learning
3.1.8 Prediction
2. Fitting plane
3.3 Regularization
- The loss function of adding the regular term becomes
1. Generate data
4. Decision curve
5. Prediction accuracy
Summarize
1. Perceptron
2. Neurons
2. Tanh function
4. ReLU function
4.1.5 Output
Gradient Validation
5.1.2 Normalization
2 Whitening
5.4.2 Dropout
6.1 Convolution
span
span
6.1.5 Pooling
Gradient Test
6.5.1 LeNet-5
6.5.2 AlexNet
6.5.3 VGG
2. Language Model
predict
predict
Gradient Test
Text generation
predict
8.2 Autoencoders
8.2.1 Autoencoder
2. Loss function
3. Training process
5. Training GAN
4. Training model
Windows platform can download and run the Python3 installer from the
official website https://www.python.org/downloads. Check "Add Python3.8
to Path" during the installation process, and the installer will automatically
add the path of the Python interpreter to the system path.
C:\Users\hwdon>python
Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10)
[MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more
information.
>>>
You can enter the following command to install the jupyter environment:
Note: If you use Anaconda to install Python and its packages, you don't
need to install the Python interpreter separately.
1. Objects
All values in Python (such as the integer 2) exist in the form of objects, and
an object contains information such as value (that is, the content of the
object), data type, and id (equivalent to an address). Python is a
dynamically typed high-level language. The so-called "dynamic type"
means that Python can automatically infer the type of an object from its
value. The type of a value can be queried with the Python built-in function
type(). like:
type("http://hwdong-net.github.io")
str
type(2)
int
type(3.14)
float
type(False)
bool
id(3)
140705546753760
id(3.14)
1857260776048
Note: The Boolean type bool has only 2 values True and False, which are
used to represent the truth or falsehood of logical propositions.
2 3.14
youtube channel: hwdong True
The print() function has a keyword parameter end, which indicates the
default end character after outputting information. Its default value is "\n",
which means a newline after the output. The following code passes a space
string " " to the end parameter, then the space will be output instead of a
newline after the print output. like
1 2 3 4 5 6
3. Type conversion
For primitive types, the type name can be used to convert an object of
another type to an object of that type. like:
3
<class 'int'>
<class 'str'>
3.14
<class 'float'>
4. Comment
The line of text at the beginning of # is called comment, and the comment
is not a program statement, but a description of the program code.
5. Variables
You can use the operator = to give an object a name, which is called the
variable name of the object. Or that the variable refers to an object.
pi = 3.14
print(pi)
print(2*pi)
Variable names can refer to other objects at any time. Variable names are
not "one-size-fits-all", unlike other languages such as C.
a = 3.14 # a is a reference to
object 3.14
b = a # b and a refer to the
same object 3.14
a = "hwdong-net.github.io" # a refers to the new
string object "hwdong-net.github.io"
print(a)
print(b)
hwdong-net.github.io
3.14
In this code, the variable name a first refers to the object 3.14, and then
refers to the string "hwdong-net.github.io". As shown in Figure 1-1.
Figure 1-1 The left picture is the result after executing b=a, and the right
picture is the result after executing a= "hwdong-net.github.io"
6. input() function
Used to accept input from the keyboard, the input is a string. input() can
accept a "prompt string". like:
name = input()
print("name: ",name)
score = input("Please enter your score:")
print("Name: ",name,"Score: ",score)
type(score)
Wang An
name: Wang An
Please enter your score: 56.8
Name: Wang An Score: 56.8
str
input() always inputs a str type object, but type conversion can be used to
convert the input string to other basic types.
float
1.1.3 Operation
x + y = 17
x - y = 13
x * y = 30
x/y = 7.5
x % y = 1
x // y = 7
x ** y = 225
The comparison operators ==, !=, >, <, >=, and <= for comparing two
values represent equal to, not equal to, greater than, less than, greater than
or equal to, less than or equal to, respectively, and the result of the
comparison operation is bool type value (True or False). like:
x = 15
y = 2
print('x > y is',x>y)
print('x < y is',x<y)
print('x == y is',x==y)
print('x != y is',x!=y)
print('x >= y is',x>=y)
print('x <= y is',x<=y)
x > y is True
x < y is False
x == y is False
x != y is True
x >= y is True
x <= y is False
The logical operators and, or, and not represent logical and, logical or, and
logical not respectively. In logical operations, True or non-zero or non-
empty objects are true (True), while False or 0 or empty objects are false
(Fasle). The operation rules of operation operators are:
3 2 True 2
2 0 False
Python also has shift operators (such as bit and &, bit or |, exclusive or ^,
negation ~, left shift <<, right shift >>) and other operators. Interested
readers can search for related information.
x = 3
print(id(x))
x+=2 # x = x+2
print(id(x))
"x+=2" is equivalent to "x = x+2", which means that the original x object is
added to 2 to get an object, and then the variable name x refers to the result
object of this addition. Therefore, the variable name x before and after
represents 2 different objects.
True
False
subscript operator []
You can access an element of a container object by giving subscript
operator [] a subscript, such as:
s = "hwdong"
print(s[0], s[1], s[2], s[3], s[4], s[5])
print(s[-6],s[-5],s[-4],s[-3],s[-2],s[-1])
Subscripts can also be negative integers, where -1 refers to the last character
and -n refers to the first character.
String formatting
Use the format character % to format some data into a string, creating a new
string. like:
The %s and %f in the format string '%s %s %f' respectively indicate that the
first two of the following three output items ("The score", "of LiPing is: ",
78.5) are strings, while The last one is a real number.
Use the format() method of the string str to format a string, that is, replace
the placeholder {} in the string with the data in the format() method in turn.
like:
1. if statement
if expression:
program block
Indicates that if the result of the expression after the if keyword is True or
non-zero, the program block in it will be executed, such as:
score = float(input())
if score>=60:
print("Congratulations!")
print("Passed the exam.")
60.5
congratulations!
passed the exam.
Notice:
The code belonging to the same program block in Python must be indented
correctly, otherwise the Python interpreter will report an error. like:
score = float(input())
if score>=60:
print("Congratulations!")
print("Passed the exam.")
if expression:
code block 1
else:
code block 2
like:
It means that if the result of "expression 1" is True, execute "program block
1" in "expression 1", and no other program blocks will be executed,
otherwise if the result of "expression 2" is True, execute " "Block 2" in
"Expression 2", otherwise if the result of "Expression 3 is True, then
execute "Block 3 in "Expression 3", if the previous expressions are all
False, execute the else clause block in the program.
like:
score= float(input("Please enter the student's score: "));
if score<60: # If score<60, execute this
if block
print("Failure")
elif score<70: # Otherwise, if score<60,
execute this elif block
print("Pass")
elif score<80: # Otherwise, if score<80,
execute this elif block
print("Medium")
elif score<90: # Otherwise, if score<69,
execute this elif block
print("Good")
else: # Otherwise (other cases),
execute this else block
print("excellent");
2. while statement
The format of the while statement is as follows
while expression:
code block
That is, when the "expression" in the keyword while is True, the program
block in it is executed repeatedly. like:
i = 1
s = 0
while i<=100:
s = s+i; #equivalent to s += i
i+=1
print(s)
5050
3. for statement
The for keyword also represents a loop statement, which means iterative
access to each element in a container object or iterable object. The format
is:
for e in container:
code block
h,w,d,o,n,g,
Each element in the string is iteratively accessed, that is, each character, and
then the character ch is output with the print() function.
1. list (list)
The list list is an ordered sequence of a group of data elements (objects).
The definition list object is surrounded by a pair of left and right square
brackets [ ], and the data elements are separated by commas. like:
a = [2,5,8]
print(a)
type(a)
[2, 5, 8]
list
The data elements in the list can be of different types, even list objects
containing other objects, such as:
my_list =[2, 3.14,True,[3,6,9],'python']
print(my_list)
print(type(my_list)) # Print the type of my_list,
that is, the list type
Another example:
a = [[1,2,3],[4,5,6]]
print(a)
index
As with strings, an element can be accessed by subscripting:
print("my_list[0]:",my_list[0])
print("my_list[3]:",my_list[3])
print("my_list[-2]:",my_list[-1])
my_list[0]: 2
my_list[3]: [3, 6, 9]
my_list[-2]: python
That is, the element value of the subscript -2 points to the new object
[8,9], as shown in Figure 1-2:
Figure 1-2 Assigning different objects to list elements makes the list
elements point to different objects
slice
You can use [start:end:step] to access a sublist of elements filtered by
the start subscript, end subscript, and step size of a list object. This way of
accessing list objects is called slicing.
print(my_list)
print(my_list[2:4])
print(my_list[0:4:2])
[2, 3.14, True, [8, 9], 'python']
[True, [8, 9]]
[2, True]
The default value of step is 1. If the start subscript start is not specified, it
defaults to 0; if the end subscript start is not specified, it defaults to the
position after the last element. For example:
list_2 = my_list[:] # all elements
print(list_2)
Returns a list of all elements, Note: The slice operation creates a new list
object. Therefore, the following code will output different id values:
print(id(my_list))
print(id(list_2))
1535447525504
1535447326144
If the slice operation is placed on the left side of the assignment statement,
it means to modify the content of the sublist corresponding to the slice, such
as:
print(my_list)
my_list[2:4] = [13, 9]
print(my_list)
You can even use a for loop to iterate over a container or iterable to create a
new list object. like:
alist = [e**2 for e in [0,1,2,3,4,5]]
print(alist)
[0, 4, 16]
A built-in function range(n) of python is an iterator object that yields
integers between 0 and n (not including n). An iterator object is not a
container, but you can also use for to traverse the elements of the object.
for e in range(6):
print(e, end = ' ')
print()
0 1 2 3 4 5
2. tuple (tuple)
Like list, tuple (tuple) is also an ordered sequence of a set of data elements
(objects), that is, each element also has a unique subscript. Define a tuple
with parentheses instead of square brackets. like:
t = ('python',[2,5],37,3.14,"https://hwdong.net")
print(type(t))
print(t[1:4])
print(t[-1:-4:-1])
<class 'tuple'>
([2, 5], 37, 3.14)
('https://hwdong.net', 3.14, 37)
The elements in the tuple cannot be modified, just like the elements in the
string cannot be modified.
t[1]=22
-------------------------------------------------- -------
------------------
<ipython-input-27-70d00e4ef536> in <module>
----> 1 t[1]=22
3. set (collection)
A set is an unordered collection with no duplicate elements. A set is a
comma-separated set of elements surrounded by left and right curly braces
{}. The types of elements can be different. like:
s = {5,5,3.14,2,'python',8}
print(type(s))
print(s)
<class 'set'>
{2, 3.14, 5, 8, 'python'}
You can use the add() and remove() functions to add and delete an element
to a collection, and the list object can use the append() or insert() function
to add or insert elements, pop() is used to delete the last element, remove( )
is to delete the element with the first specified value.
s.add("hwdong")
print(s)
s.remove("hwdong")
print(s)
alist.append("hwdong")
print(alist)
alist.insert(2,"net")
print(alist)
alist.pop()
print(alist)
alist.remove("net")
print(alist)
But immutable objects such as tuple do not have functions like append() or
insert() for adding elements. The following code is wrong:
t.append("hwdong")
-------------------------------------------------- -------
------------------
<ipython-input-31-34fd50c7f43a> in <module>
----> 1 t.append("hwdong")
You can create a set object with the analytical expression of {}, such as:
nums = {x**2 for x in range(6)}
print(nums)
4. dict (dictionary)
A dict is an unordered collection of (key-value pairs) "key-value" pairs.
Each element is stored in the form of "key: value (key: value)". like:
d = {1:'value', 'key':2, 'hello': [4,7]}
print(type(d))
print(d)
<class 'dict'>
{1: 'value', 'key': 2, 'hello': [4, 7]}
You need to pass the key (key, also known as the keyword) to access the
value value of the element corresponding to the key in the dict. like:
d['hello']
[4, 7]
If the element corresponding to a key does not exist, it is illegal to access
the element through this key, such as:
d[3]
-------------------------------------------------- -------
------------------
<ipython-input-35-0acadf17a380> in <module>
----> 1 d[3]
KeyError: 3
But you can assign a value to a key that does not exist, and an element of a
"key-value" pair will be added to the set. like:
d[3] = "python"
print(d)
print(d[3])
You can define a dict object that represents student information and uses
name as a key:
students={"LiPing":[21,"Compu01",15370203152],"ZhangWei":
[20,"Compu02",17331203312]
,"ZhaoSi":[22,"mecha03",16908092516]}
print(students)
print(students["ZhangWei"])
Of course, you can also use a {} analysis to create a dictionary object, such
as:
1.1.6 Functions
Python defines functions through the keyword def, gives a program block a
name, and then can call and execute the code in this function block through
the function name. like:
def hwdong():
print("My youtube channel is: ","hwdong")
# call the built-in function print()
print("My station B number is: hw-dong")
print("My blog is: https://hwdong-net.github.io")
hwdong()
print() # call the built-in function print()
hwdong()
print() # call the built-in function print()
hwdong()
My youtube channel is: hwdong
My B station number is: hw-dong
My blog is: <https://hwdong-net.github.io>
Functions can have parameters, so that when the function is called, the
corresponding parameters can be passed to this function, such as the
following function to calculate x :
n
def pow(x,n):
ret = 1
for i in range(n): #0,1,2,...,n-1
ret *=x # ret = ret*x
return ret # "return ret" returns
the value of the function
Calling this function can pass two actual parameters to the formal
parameters x and n of the function. like:
9
16
The return value of the function pow() call is passed to the function print()
to print out the return value.
The parameters of the function can have a default value. If the
corresponding parameter is not provided when the function is called, the
parameter will use its default value.
def pow(x,n=2):
ret = 1
for i in range(n):
ret *=x
return ret
12.25
42.875
def fact(n):
if n==1: #If n is equal to 1, return
the value 1 directly
return 1
return n * fact(n - 1) #If n is greater than 1, it
is the product of n and fact(n-1)
fact(4) # output: 24
24
fact(4) # output: 24
24
math package
Many mathematical function libraries are defined in the math package. To
use the functions of this package, you need to import this package. (import)
import math
print(math.sqrt(2))
1.4142135623730951
1.4142135623730951
12.25
import math
def circle(r):
area = math.pi*r**2
perimeter = 2*math.pi*r
return area,perimeter
area,p = circle(2.5)
print("The area and circumference of a circle with a
radius of 2.5 are: %5.2f,%5.2f"%(area,p))
area,p=circle(3.5)
print("The area and circumference of a circle with a
radius of 3.5 are: %5.2f,%5.2f"%(area,p))
A function can return multiple values, which are actually returned as a tuple
object.
<class 'tuple'>
19.634954084936208 15.707963267948966
3 5
6
The statement "global_x = 5" inside the function does not modify the
external global variable global_x but defines a local variable global_x
pointing to object 5, which has no effect on the external global variable
global_x. Therefore, after the function f() is executed, the internal function
The local variable is destroyed, and the global variable global_x is still 6.
If you want to access global variables inside the function, you need to use
the keyword global to declare a variable as a global variable inside the
function, such as:
global_x = 6
def f():
global global_x
x = 3
global_x = 5
print(x,global_x)
f()
print(global_x)
3 5
5
The global_x inside the function is declared as the external global variable
global_x. To modify it is to modify the external global variable global_x.
Therefore, the output statement after the function f() is executed is 5 instead
of 6.
global_x = 6
a = [1,2,3]
def f(y,z):
x = 3
y = 5
z[0] = 10
print(y)
print(z)
f(global_x,a)
print(global_x)
print(a)
5
[10, 2, 3]
6
[10, 2, 3]
lambda x: x ** 2
Usually use the = operator to give this lambda function a name, such as:
double = lambda x: x ** 2
double(3.5)
12.25
another = print_msg("Hello")
another()
hello
Nested functions can access variables in the enclosing scope, for example,
the printer() function can access the local variables of the print_msg()
function (including the parameter msg).
def make_pow(n):
def pow(x):
return x ** n # pow() can access the variables of
make_pow() (ie n).
return pow
The function object pow() returned by make_pow() can access the variable
(ie n) of make_pow().
print(pow3(9))
print(pow5(3))
print(pow5(pow3(2)))
729
243
32768
def infinite_sequence():
num = 0
while True:
yield num
num += 1
The generator function infinite_sequence() returns an iterable object via
yield:
iterator = infinite_sequence()
print(next(iterator))
print(next(iterator))
0
1
Use the variable name iterator to refer to the iterable object returned by the
generator, and you can use the next function to get a value of the iterable
object. You can also use for to traverse this iterable object:
for i in infinite_sequence():
print(i, end=" ")
if i>5:
break
0 1 2 3 4 5 6
The preceding str, list, etc. are all classes. Generally, you can create a class
object through the class name, and the class object is a specific object. like:
s = str("http://hwdong.net")
print(type(s)) #str
location = s.find("hwdong") #Query whether there is
a substring through the find() method of str,
#and return the
location of the substring
print(location)
alist = list(range(6)) #[0,1,2,3,4,5] list
blist = alist.copy() #Create a copy (copy)
list of alist through the copy() method of list
blist[2] = 20
print(alist)
print(blist)
<class 'str'>
7
[0, 1, 2, 3, 4, 5]
[0, 1, 20, 3, 4, 5]
As you can see, you can use member access operator. to access class
methods through a class object to perform certain operations on this object
(access some information or modify the object or create a new object). For
example, s.find() queries whether there is a substring equal to the string
"hwdong" in s, and returns the position of the substring. And alist.copy()
creates a list object with the same content as alist and makes blist refer to
the newly created list object.
A class is defined in Python with the keyword class. In order to describe the
common attributes of all students, a Student class can be defined.
class Student:
def __init__(self, name, score):
self.name = name
self.score = score
def print(self):
print(self.name,",",self.score)
The following code defines two objects s1 and s2 of the Student class: and
calls the print() method of the class Student through them. Student's print()
method calls the built-in function print() to output the name and score of the
object pointed to by self.
s1 = Student("LiPing",67)
s2 = Student("WangQiang",83)
s1.print()
s2.print()
LiPing, 67
Wang Qiang , 83
Each object has its own separate instance properties, and changing the
instance properties of one object will not affect the instance properties of
other objects. In addition to instance attributes, you can also define class
attributes for a class, which are attributes shared by all objects of the class.
A class attribute is an attribute defined outside a method of a class.
For example, the modified Student class adds a class attribute count, which
indicates how many specific class objects have been created from this class.
Its initial value is 0. Whenever a class object is created, its count is
increased.
class Student:
count=0
def __init__(self, name, score):
self.name = name
self.score = score
Student.count +=1
def print(self):
print(self.name,",",self.score)
0
1
2
The plot() function of the pyplot module can directly plot 2D data, such as:
[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5]
Although only the array y of the vertical axis coordinates is given, the plot()
function will automatically generate the horizontal axis coordinates starting
from 0 by default. Of course, you can pass 2 arrays to represent the x and y
coordinates, such as:
The plot() function can also accept some parameters to customize the style
of the drawn graphics, such as:
import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
plt.plot(x, y,'r-')
plt.plot(x, y2,'bo')
plt.plot(x, y3,'g:')
plt.legend(['sin(x)', 'cos(x)','0.2x'])
plt.show()
Among them, r in 'r-' means red (red), - means short line, b in 'bo' means
blue (blue), o means dots, g in 'g:' means green (green), : means dotted line .
In addition to the plot() function that can be drawn in the pyplot module,
there are other functions for drawing other types of graphs, such as scatter()
for drawing scattered point graphs. like:
import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
plt.scatter(x, y, c='r', s=6, alpha=0.2)
plt.scatter(x, y2,c='g', s=18, alpha=0.9)
plt.scatter(x, y3,c='b', s=3, alpha=0.4)
plt.legend(['sin(x)', 'cos(x)','0.2x'])
plt.show()
The parameter c represents the color, its values 'r', 'g', and 'b' represent red,
green, and blue respectively, the s parameter represents the size of the point,
and alpha represents the transparency of the graph.
subplot()
In addition to displaying multiple figures, the figure object of the window
for displaying figures can also use multiple sub-regions to display different
figures. You can use the subplot() function to specify in which subplot
window to draw the subplot.
import math
x = [i*0.2 for i in range(50)]
y = [math.sin(xi) for xi in x]
y2 = [math.cos(xi) for xi in x]
y3 = [0.2*xi for xi in x]
fig = plt.gcf()
fig.set_size_inches(12, 4, forward=True)
plt.subplot(1, 2, 1)
plt.plot(x, y,'r-')
plt.plot(x, y2,'bo')
plt.title('sin(x) and cos(x)')
plt.legend(['sin(x)', 'cos(x)'])
plt.subplot(1, 2, 2)
plt.plot(x, y3,'g:')
plt.title('0.2x')
plt.show()
The above code first obtains the figure object of the current drawing
window through fig = plt.gcf() and assigns it to the variable fig, and
then modifies the default width and height of the figure object by calling the
set_size_inches() function of figure, forward =True means update the figure
object size of the current window immediately.
Axes objects
The subplot() function returns an axes object. We can use this to specify
which subplot is active at any time:
# http://www.math.buffalo.edu/~badzioch/MTH337/PT/PT-
matplotlib_subplots/PT-matplotlib_subplots.html
from math import pi
plt.figure(figsize=(8,4))
plt.subplots_adjust(hspace=0.4)
plt.show()
mplot3d
Like other axes (axes), use the projection = '3d' keyword to create an
Axes3D object. Create a matplotlib.figure.Figure and add an axes(axes) of
type Axes3D:
mpl.rcParams['legend.fontsize'] = 10
fig = plt.figure()
ax = fig.gca(projection='3d')
plt.show()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
n = 100
# For each set of style and range settings, plot n random
points in the box
# defined by x in [23, 32], y in [0, 100], z in [zlow,
zhigh].
for c, m, zlow, zhigh in [('r', 'o', -50, -25), ('b', '^',
-30, -5)]:
xs = randrange(n, 23, 32)
ys = randrange(n, 0, 100)
zs = randrange(n, zlow, zhigh)
ax.scatter(xs, ys, zs, c=c, marker=m)
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
plt.show()
120
[[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
...
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]
[-30. -29.5 -29. ... 28.5 29. 29.5]]
fig = plt.figure()
ax = fig.gca(projection='3d')
# Make data.
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
plt.show()
https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html
display image
You can use imshow() to display an image, and before that, you can use the
imread() function of the io model of the skimage library to read the image.
The read image is placed in a multidimensional array ndarray object of the
multidimensional array library numpy. The numpy model can have many
convenient functions to deal with multidimensional arrays. For example, the
uint8() function can convert numpy arrays of other data element types into
uint8 type, which is an unsigned integer type (value range is [0,255]).
import numpy as np
#import skimage
import matplotlib.pyplot as plt
import skimage.io
img = skimage.io.imread('../imgs/lenna.png')
#Original image
img_tinted = img * [1, 0.95, 0.9] #3
color channel values multiplied by different coefficients
plt. subplot(1, 2, 1)
plt.imshow(img)
plt.subplot(1, 2, 2)
plt.imshow(np.uint8(img_tinted))
#Convert the real-valued img_tinted image to unit8
unsigned integer
plt. show()
There are more detailed Python tutorials on the author's blog site
(https://hwdong-net.github.io).
1.2 tensor library numpy
Tensor is also called multidimensional array, which is a regular
arrangement of multiple values. Tensor operations are the most important
operations in deep learning. This chapter introduces the python tensor
library numpy.
1 vector
In physics, a vector is a quantity with magnitude and direction, such as
force and velocity are quantities with magnitude and direction. The vector
is expressed as a set of ordered numbers by direct coordinate
decomposition, such as (2, 5, 8), so that the vector can be studied
mathematically. In linear algebra, the vector is defined as a set of ordered
numbers, that is, a one-dimensional array ( 1D tensor). Another example is
that a student's usual grades, experimental grades, final grades, and general
evaluation grades can be expressed as a vector: (normal grades,
experimental grades, final grades, and general evaluation grades). The
coefficients and unknowns of an equation ax + bx + cx = d can be
1 2 3
⎛ 2⎞
3
⎝ 5⎠
Vectors of this form are called column vectors.
The row and column form of the same vector is called the transpose
relationship, that is, the transposition of a row vector (x , x , ⋯ , x ) is its
1 2 n
column vector.
⎛ x1
⎞
x2
T
(x1 , x2 , ⋯ , xn ) =
⋮
⎝ xn
⎠
⎛ x1
⎞
x2
= (x1 , x2 , ⋯ , xn )
⋮
⎝ xn
⎠
represents the distance from the coordinate point to the origin. This
Euclidean distance is usually represented by the notation ∥v∥ , namely:
2
2 2
v ∥2 = √ x + y
For example, 2 2
OA ∥2 = √ 3 + 1 = √ 10 ,
2 2
OB ∥2 = √ 1 + 3 = , it can be seen that ∥OA ∥
sqrt10 2
=∥ OB∥2 , that
is, the vectors (3,1) and (1,3) have the same length.
in three-dimensional space: v ∥ = √x + y + z .2
2 2 2
The general extension of 2-norm is p-norm, that is, for a positive integer p,
its p-norm is defined as:
n p 1/p
x ∥p = (∑ |xi | )
i=1
That is, after the p-th power of the absolute value of the vector elements is
summed, the 1/p-th power of the sum is calculated. p-norm also
characterizes a certain size of the vector in different senses. For example,
the 1-norm of p = 1 is the sum of the absolute values of all elements of the
vector,
N
x ∥1 = ∑ |xi |
i=1
x ∥∞ =max |xi |
i
The usual convention: the 0-norm of a vector is the number of its non-zero
elements.
2 Matrix
A matrix in algebra is an ordered sequence of vectors. For example, the
following scalars with 3 rows and 4 columns form a matrix:
A matrix can be thought of as a column vector whose data elements are row
vectors. like:
⎡ a 11 a 11 ⋯ a 1n
⎤ ⎡ a 1,:
⎤
a 21 a 21 ⋯ a 2n a 2,:
A mn = =
⋮ ⋮
⎣a a m1 ⋯ a mn
⎦ ⎣a ⎦
m1 m,:
Among them, a represents a row vector of the matrix. A matrix can also
i,:
be viewed as a row vector whose data elements are column vectors. like:
⎡ a 11 a 11 ⋯ a 1n
⎤
a 21 a 21 ⋯ a 2n
A mn = = [a :,1 a :,2 ⋯ a :,m ]
⋮
⎣a a m1 ⋯ a mn
⎦
m1
b = [[1,2,3],[4,5,6]]
b[0][2] = 20
print(b)
print(b[0])
print(b[1])
3 dimensional tensor
A color image is composed of three channel images of red, green and blue,
that is, three matrices.
Figure 1-6 The color image is synthesized by three color matrices of red,
green and blue
1. array()
The array() function of numpy is the most commonly used function to
create ndarry objects. This function can create a multidimensional array
ndarray object from a sequence object or iterable object. For example, the
following code creates a 1D tensor (vector) and a 2D tensor (matrix)
respectively:
import numpy as np
a= np.array([1,3,2]) # Create a one-
dimensional vector (tensor) a
print(a)
print(a. shape)
b= np.array([[1,3,2],[4,5,6]]) # Create a two-
dimensional vector (tensor) b
print(b)
print(b. shape) #axis=0
[1 3 2]
(3,)
[[1 3 2]
[4 5 6]]
(twenty three)
Among them, the first parameter object is required, which means to create
an ndarray object from this object object, such as
np.array([1,3,2])from the list object[1,3 ,2] creates an ndarray
object, that is, a one-dimensional array. dtype specifies the data type of the
array element, the default is None, which means the same type as the object
element, and the default of copy is True, which means creating a copied
object and will not share the data storage space with the object. order
indicates the order in which the elements in the object are arranged in the
created array, and the default is 'K', which means arranged in rows.
a = np.array([1,2,3,4])
print(a.dtype)
print(a)
b = np.array([1,2,3,4], dtype=np.float64)
print(b.dtype)
print(b)
int32
[1 2 3 4]
float64
[1. 2. 3. 4.]
2. Multidimensional array type ndarray
The ndarray class of numpy is used to represent multidimensional arrays.
The following are some common attributes of ndarray class objects:
ndarray.ndim
The number of axes (dimensions) of the array, that is, the rank of the
array.
ndarray. shape
The shape of the array is a tuple of integers, and each integer in the
tuple represents the length (number of data elements) of the
corresponding dimension (axis) of the array.
ndarray.size
The total number of array elements is equal to the product of the tuple
elements in the shape attribute.
ndarray.dtype
The data type of the elements in the array.
ndarray.itemsize
The size in bytes of each element in the array. For example, an array
whose element type is float64 has an itemsiz property value of 8
(=64/8).
ndarray.data
Stores the memory address of the actual array element, usually you
don't need to use this attribute, because you can always access the
elements in the array by subscript.
a= np.array([1.,2.,3.])
print(a.ndim,a.shape,a.size,a.dtype,a.itemsize,a.data)
b= np.array([[1,2,3],[4,5,6]])
print(b.ndim,b.shape,b.size,b.dtype,b.itemsize,b.data)
The shape attribute of the ndarray object represents the shape of the tensor,
(3,) indicates that a is a one-dimensional tensor with 3 elements, and (2, 3)
indicates that b is a two-dimensional tensor, the first dimension (line ) has 2
elements (rows) and the second dimension (columns) has 3 elements, i.e. b
is a matrix with 2 rows and 3 columns.
1 3 2
b = [ ]
4 5 6
ndim is the dimension (number of axes) of the tensor (array), the axis (axis)
starts from 0, the above b has axis=0 and axis=1, as shown in Figure 1-8:
Figure 1-8 Tensor axis: axis=0 means the first dimension (axis), axis=1
means the second dimension (axis),
print(a[2])
print(b[1,2])
3.0
6
3. asarray()
array() defaults to copy (copy), that is, the created new ndarray object does
not share data storage with the original incoming object. If you do not need
to copy (copy), directly convert the incoming oejct object into an ndarray
object, you can use A simplified wrapper function asarray() for array().
asarray() does not copy the original data and has fewer parameters.
numpy.asarray(a, dtype = None, order = None)
asarray() simply calls the array() function to create a new ndarray object,
and the new object shares the same data storage as the incoming a. Of
course, if the incoming a is an iterable object, the storage space will not be
shared, and the data of the new object will point to a new memory block
that stores all elements.
d = np.asarray(range(5))
print(d)
e= np.asarray([1,2,3,4,5]) # Through
asarray, an ndarray array object can also be created
# from a
sequence or iterable object
print(e)
print(type(e))
[0 1 2 3 4]
[1 2 3 4 5]
<class 'numpy.ndarray'>
[ 1 2 20 4 5]
[ 1 2 20 4 5]
<class 'list'>
[[1, 2, 3], [4, 5, 6]]
c = a.astype(np.float64)
print(a.dtype,c.dtype)
a[0][0] = 100
print(a)
print(c)
int32 float64
[[100 2 3]
[ 4 5 6]]
[[1. 2. 3.]
[4. 5. 6.]]
[0 1 2 3 4 5]
[[0 1 2]
[3 4 5]]
[[0.1.2.]
[3. 4. 5.]]
The initial value of the elements of the arithmetic sequence is start, and the
arithmetic difference is step (also called step size) until stop (but not
including stop). The data element type can be specified as dtype. The
default value of start is 0, the default value of step is 1, and the default
value of dtype is None, all of which can be left unspecified. Note that the
resulting array does not contain the final value.
print(np.arange(5)) #Only specify end, start and step
default to 0 and 1
print(np.arange(2,5))
print(np.arange(2,7,2))
[0 1 2 3 4]
[2 3 4]
[2 4 6]
like:
np.logspace(2.0, 3.0, num=5)
like:
np.full((2, 3),np.inf)
np.full((2, 3),3.5)
Similar to the full() function, numpy's empty(), zeros(), ones(), and eye()
respectively create uninitialized arrays with a value of 0, a value of 1, and 1
on the diagonal and 0 on the rest.
numpy.empty(shape, dtype = float, order = ‘C’)
numpy.zeros(shape, dtype = float, order = ‘C’)
numpy.ones(shape, dtype = None, order = ‘C’)
numpy.eye(N, M=None, k=0, dtype=<class ‘float’>,
order=’C’)
[[0. 0. 0.]
[0. 0. 0.]]
[[1. 1.]]
[[1. 0.]
[0.1.]]
print(a.shape,b.shape,c.shape,d.shape)
Both the standard normal distribution and the normal distribution describe
the probability of random variables taking different values (will be
introduced later in probability knowledge). A random variable x obeys a
normal distribution, and its probability density function is
2
(x−μ)
2 1 −
2
N (x; μ, σ ) = e 2σ
√ 2πσ2
√ 2π
Some random number functions also have aliases. For example, the
random() function has an alias function random_sample(), that is, both are
the same function, and both generate uniformly sampled random numbers
in the [0,1] interval. If you want to generate uniformly sampled random
numbers in the [a,b] interval, you only need to do a simple linear
transformation on the generated array. For example: (b - a) *
random_sample() + a can generate an array of random numbers in the
interval [a,b]. The following code creates an array of random numbers
between [2,7].
5 * np.random.random_sample((2, 3)) +2
9. Add, Repeat & Lay, Merge & Split, Edge Fill, Add
Axis & Swap Axis
I saw earlier that through ndarry's astype() and reshape(), new tensors
(arrays) are created by changing the data element type or changing the
shape of the tensor. Numpy also has many functions or methods to create
new ndarry objects by adding, repeating, laying, merging, splitting, adding
axes, swapping axes, etc. to existing arrays.
Append
Nunmpy's append() can add content to the existing array to create a new
array, and its function specification is:
numpy.append(arr, values, axis=None)
Indicates to add the content of values after the array arr, and axis indicates
along which axis to add, the default is None, and a flattened (flattened) one-
dimensional array will be created.
a = np.array([1,2,3])
b= np.append(a,4)
print(a)
print(b)
np.append([1, 2, 3], [[4, 5, 6], [7, 8, 9]])
[1 2 3]
[1 2 3 4]
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Repeat repeat()
The repeat function repeat() creates a new ndarray array by repeating the
elements in the array along an axis.
a = np.array([[1,2],[3,4]])
np.repeat(a, 2) # Create a flattened array,
that is, a one-dimensional array,
# in which the elements in x
are repeated twice
array([1, 1, 2, 2, 3, 3, 4, 4])
np.repeat(a, 2,axis=0)
array([[1, 2],
[1, 2],
[3, 4],
[3, 4]])
np.repeat(a, 2, axis=0) means repeating each element (ie each row) twice in
the direction of axis=0, and np.repeat(a, 2, axis=1) means along axis= 1 will
repeat 2 times per 1 column.
np.repeat(a, 2,axis=1)
array([[1, 1, 2, 2],
[3, 3, 4, 4]])
laying tile()
Unlike repeat, which repeats elements in each axis direction, the laying
function tile(A, reps) can copy the entire array vertically or horizontally
like tiles.
numpy.tile(A, reps)
A is the array to be laid, and rep represents the number of repetitions for
each axis. If the length of reps is less than A.ndim, such as the shape is (2,
3, 4, 5), and rep=(2,2), then the rep array will be supplemented with 1 in
front and become (1,1,2,2 ). If A.ndim is less than the length of reps, A is
promoted to an array of the same shape as reps. If the shape of a one-
dimensional tensor A is (3,), and reps=(2,2), then A will be referred to as a
two-dimensional tensor with a shape of (1,3).
a = np.array([1, 2,3])
b= np.tile(a, 2) #a is Repeated 2 times to create a
new array
print(a)
print(b)
[1 2 3]
[1 2 3 1 2 3]
When laying the array a according to the shape (2, 2), it will first extract a
from the one-dimensional tensor [1, 2, 3] to the two-dimensional tensor [[1,
2, 3]], and then lay it out:
np.tile(a, (2, 2)) # a is tiled in 2 rows and 2 columns,
creating a new array
array([[1, 2, 3, 1, 2, 3],
[1, 2, 3, 1, 2, 3]])
For example, the shape of c below is (1, 2), and the length of reps=2 is 1,
reps will first become (1,2), and then laying, that is, axis=0 is repeated once
(that is, the row direction remains unchanged change), and axis=2 repeats 2
times.
c = np.array([[1, 2], [3, 4]])
print(c)
np.tile(c, 2) # reps will change to (1,2) first,
indicating that
# the first axis repeats once, and
the second axis repeats twice
[[1 2]
[3 4]]
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
The following is just the opposite:
np.tile(c, (2, 1))
array([[1, 2],
[3, 4],
[1, 2],
[3, 4]])
Note: repeat() is to copy each axis of the array (unspecified axis is to copy
each element), while tile() is to copy the entire array.
merge concatenate()
The splicing function concatenate() and the accumulation function stack()
create new arrays by merging multiple arrays.
axis specifies the axis of the merging direction. The default value is 0. If it
is None, it means merging into a flat array, that is, a one-dimensional array.
out defaults to None, if not None, the merged result will be put in out.
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
print(b.T)
c = np.concatenate((a, b), axis=0) # Merge along
the axis=0 axis
d = np.concatenate((a, b.T), axis=1) # Merge along
the axis=1 axis
e = np.concatenate((a, b), axis=None) # Merge into a
flat array
print(c)
print(d)
print(e)
[[5]
[6]]
[[1 2]
[3 4]
[5 6]]
[[1 2 5]
[3 4 6]]
[1 2 3 4 5 6]
overlay stack()
The overlay function stack(arrays, axis=0, out=None) stacks a series of
arrays along the direction of axis into a new array, and the default value of
axis is 0, indicating the first axis. If axis=-1 means the last axis.
a = np.array([1, 2])
b = np.array([3, 4])
c = np.array([5, 6])
np.stack((a, b,c))
array([[1, 2],
[3, 4],
[5, 6]])
np.stack((a,b,c),axis=1)
array([[1, 3, 5],
[2, 4, 6]])
array([[1, 4],
[2, 5],
[3, 6]])
(3, 1) (3, 1)
array([[1, 4],
[2, 5],
[3, 6]])
(3,) (3,)
array([1, 2, 3, 4, 5, 6])
(1, 3)
array([[1, 2, 3],
[4, 5, 6]])
array([[1],
[2],
[3],
[4],
[5],
[6]])
vstack() will splicing the one-dimensional array (N,) as the shape of (1,N).
like:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack((a,b)) # The shape of 1D array a is (3,) will
be regarded as a 2D array of (1,3),
# and the same is true for b.
array([[1, 2, 3],
[4, 5, 6]])
split split()
Split is the opposite operation of merging. The function split() splits the
array along the axis (the default value of axis is 0)
ary[:2]
ary[2:3]
ary[3:]
x = np.arange(9.0)
print(x)
np.split(x, 3) # Split into 3 sub-arrays of equal length
[0. 1. 2. 3. 4. 5. 6. 7. 8.]
[0. 1. 2. 3. 4. 5. 6. 7.]
hsplit() and vsplit() are the splitting functions corresponding to the merging
operations hstack() and vstack() respectively, and split the array along the
horizontal (axis=1) and vertical direction (axis=0) respectively. Both of
these split functions are special cases of the split function split().
x = np.arange(16.0).reshape(4, 4)
print(x)
np.hsplit(x, 2) # Split into 2 equal sub-arrays
along the horizontal direction (column direction)
[[ 0. 1. 2. 3.]
[ 4. 5. 6. 7.]
[ 8. 9. 10. 11.]
[12. 13. 14. 15.]]
np.split(x, [1,2])
Edge Padding
The np.pad() function can perform edge padding on each axis (dimension)
of the array, that is, fill some values at the front and rear edge positions of
each axis (dimension).
arrar is the input array, pad_width represents the width of the padding (the
number of elements), and mode represents the filling method, 'constant'
represents the constant of the padding, and constant_values represents the
value of the constant. like:
a = [7,8 ,9 ]
b =np.pad(a, (2, 3), mode='constant', constant_values=(4,
6))
print(a)
print(b)
:
[7, 8, 9]
[4 4 7 8 9 6 6 6]
(2, 3) indicates that 2 elements are filled in front of the array a, and 3
elements are filled in the back, mode='constant' indicates that the filling is
constant, 'constant_values=(4, 6)' indicates that the constant values are
filled before and after It's 4 and 6. The following sets the mode to 'edge',
which means filling with the value of the edge element.
array([7, 7, 7, 8, 9, 9, 9, 9])
mode='minimum' means to fill with the minimum value in the array. For
multidimensional arrays, the width of the padding at the beginning and end
of each dimension must be specified.
array([[2, 2, 2, 5, 2, 2, 2],
[2, 2, 2, 5, 2, 2, 2],
[7, 7, 7, 9, 7, 7, 7],
[2, 2, 2, 5, 2, 2, 2],
[2, 2, 2, 5, 2, 2, 2]])
Indicates that 1 and 2 minimum values 2 are filled before and after the
direction of the first axis (row) of a, that is, the value 2 of 1 row and 2 rows
are added to the top and bottom of the two-dimensional array. Similarly, in
the column direction of a, that is, the left and right of the array are filled
with the value 2 in 2 columns and 3 columns.
Add Axis
numpy.expand_dims(a, axis) Expands the shape of the array by inserting a
new axis at axis position. like:
(2,)
[3 5]
(1, 2)
[[3 5]]
y = x[np.newaxis,:]
print(y.shape)
print(y)
(1, 2)
[[3 5]]
y = x[:,np.newaxis]
print(y.shape)
print(y)
(twenty one)
[[3]
[5]]
Swap axes
Sometimes it is necessary to swap the axes of the array, for example, when
reading a color image, its color channel may be the third axis (axis=2), but
in some programs, the color channel needs to be on the first axis. There are
several different functions that can be used to swap the axes of an array.
numpy.swapaxes(a, axis1, axis2) swap axes axis1 and axis2. like:
A = np.random.random((2,3,4,5))
print(A.shape)
B = np.swapaxes(A,0,2) # Swap axes with axos=0 and axis=2
print(B.shape)
(2, 3, 4, 5)
(4, 3, 2, 5)
(4, 2, 3, 5)
(4, 3, 2, 5)
(4, 2, 3, 5)
(4, 3, 2, 5)
numpy.transpose(a, axes=None) Rearrange the axes of the array according
to the order of the axes in axes. The default value of axes is None, which
means that the axes are arranged in reverse order. This is a more general
and flexible function. like:
A = np.random.random((2,4))
print(A)
B = np.transpose(A)
print(B)
C = np.random.random((2,4,3,5))
D = np.transpose(C,(2,0,3,1))
print(D.shape)
Unlike python, indexing and slicing of numpy arrays does not create a new
array, but a view (window) of the original array, that is, the sub-array of the
slice is part of the original array. Therefore, through the reference variable
of this slice to Modifying this slice actually modifies the original array. like:
import numpy as np
a = np.array([1,2,3,4,5]) # Create an array with rank 1,
that is, a one-dimensional array
print(a[0], a[1], a[2]) # Access the elements of array
a through the subscript [],
# print them, output: 1 2 3
a[0] = 5 # Modify the value of element a[0] with
subscript 0
print(a) # print the entire array, output: 5, 2, 3
b = a[1:4] # a[1:4] returns a slice of the array
consisting of elements from
# subscript 1 to subscript 4 (not including
subscript 4)
print(b)
b[0] = 40 # Slice b is a part of a, modifying b is
modifying the elements in a
print(b)
print(a)
1 2 3
[5 2 3 4 5]
[2 3 4]
[40 3 4]
[ 5 40 3 4 5]
Figure 1-10 Array slices with subscripts from 1 to 4 (but not including 4)
Indexing and slicing work the same for multidimensional arrays, i.e. you
can index or slice any dimension.
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(a)
print(a[2,1])
print(a[2])
print(a[:,1])
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
10
[9 10 11 12]
[ 2 6 10 ]
(a[2,1] means the element in the 3rd row and the 2nd column, a[2]
means the 3rd row of the 1st axis, a[:, 1] means the 2nd column, where
the first The : of the axis axis=0 means all, that is, all row subscripts, and a
column element whose column subscript is 1. As shown in Figure 1-11:
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[[4 3 2]
[8 7 6]]
:2 means the subscript (0,1) of the first axis (axis=0), and -1:-4:-1 means
the subscript of the second axis (axis=1) from Starting from -1, each step is
-1, so the subscript is (-1,-2,-3). As shown in Figure 1-12, but the column
subscripts are in reverse order.
[[1 2 3 4]
[5 6 7 8]]
[6 10]
Similarly, changing the array itself or the slice changes both, because the
data of the slice is part of the original array, that is, the slice is a window of
the original array.
a[0,3]=100
print(a)
print(b)
[[ 1 2 3 100]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[[100 3 2]
[ 8 7 6]]
[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]
[[ 9 10 11]
[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]
[24 25 26]]]
print(a[1, 2])
[15 16 17]
Given 2 subscripts, the subscripts of the 1st axis (axis=0) and the 1st axis
(axis=0) are 1 and 2, and the third subscript defaults to:, that is, the third
takes all its subscripts value. The indexing process is shown in Figure 1-13:
Similarly, a[0,:,1] is the first element (the first plane) of the first axis and
the second element (column) of the third axis.
print(a[0,:,1])
[1 4 7]
(a[:,1,2] is the 2nd and 3rd elements of the 2nd axis and 3rd axis. The
1st axis is all subscript values.
print(a[:,1,2])
[ 5 14 23 ]
When indexing a numpy array using a slice, the resulting array view is
always a subarray of the original array. That is, the elements in the subarray
are composed of consecutive elements in the original array. This is because
the index values of each dimension are continuous, for example, the actual
index values of 1:3 are 1 and 2. The sub-array obtained by slicing is a
window of the original array and shares data storage with the window area
of the original array.
When indexing, you can also pass discrete integer values for each
dimension index. That is, an array of integers is passed to each dimension.
Integer array indexing makes it possible to construct a new array. like:
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[ 2 12 ]
[[ 1 2 3 4]
[ 5 6 7 8 ]
[ 9 10 11 12]]
[111 12]
Passing the index of the integer array is to form the subscript corresponding
to each axis into an index subscript. The above constitutes two index
subscripts (0,1) and (2,3), which index two elements, as shown in the figure
1-14 show:
Figure 1-14 A so-called indexed slice of an array of integers
As with the integer array index type, Boolean array indexing is used to
select elements in the array that meet certain conditions to create a new
array object that does not share storage, for example:
print(a[bool_idx]) # True and False elements according to
the Boolean value
print(a[a > 2]) #The above two formulas can be combined
into one
1. Element-by-element calculation
Operations such as "element-by-element" +, -, *, /, % can be
performed on two multidimensional arrays to generate a new array. like:
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[7,8,9],[10,11,12]])
print(a+b)
print(a*b)
print(b%a)
[[ 8 10 12]
[14 16 18]]
[[ 7 16 27]
[40 55 72]]
[[0 0 0]
[2 1 0]]
print(np.add(a,b))
print(np.subtract(a,b))
print(np.multiply(a,b))
print(np.divide(a,b))
[[ 8 10 12]
[14 16 18]]
[[-6 -6 -6]
[-6 -6 -6]]
[[ 7 16 27]
[40 55 72]]
[[0.14285714 0.25 0.33333333]
[0.4 0.45454545 0.5 ]]
Hadamard Product
The element-wise product is also known as Hadamard product or Schur
product.
The Hadamard product of two vectors is the vector formed by the product
of their corresponding elements. like:
1 3 1 ∗ 3 3
( ) ⊙ ( ) = ( ) = ( )
2 4 2 ∗ 4 8
2. Cumulative calculation
You can use numpy functions or ndarry class methods to perform
cumulative calculations on ndarry objects, such as summation (sum()),
maximum value (min(), max()), mean (mean()), standard deviation (std ()).
a = np.array([[1,2,3],[4,5,6]])
print(np.max(a),a.max())
print(np.min(a),a.min())
print(np.sum(a),a.sum())
print(np.mean(a),a.mean())
print(np.std(a),a.std())
6 6
1 1
21 21
3.5 3.5
1.707825127659933 1.707825127659933
These functions can also specify which axis of the array to operate on, such
as:
print(a)
print(np.max(a,axis=0),a.max(axis=1)) # np.max(a,axis=0)
means to find the maximum value
# along the
direction of the 0th axis (1st dimension)
print(np.min(a,axis=0),a.min(axis=1))
print(np.sum(a,axis=0),a.sum(axis=1))
print(np.mean(a,axis=0),a.mean(axis=0))
print(np.std(a,axis=0),a.std(axis=0))
[[1 2 3]
[4 5 6]]
[4 5 6] [3 6]
[1 2 3] [1 4]
[5 7 9] [ 6 15]
[2.5 3.5 4.5] [2.5 3.5 4.5]
[1.5 1.5 1.5] [1.5 1.5 1.5]
3. Dot Product
The Hadamard product is an element-wise product, while the dot product of
tensors is a generalization of the vector dot product and matrix product.
Geometrically, the vector dot product is the cosine of the length of two
vectors multiplied by their angle, as shown in Figure 1-15.
x ⋅ y =∥ x ∥2 ∥ y ∥2 cos(θ)
Therefore, two vectors with constant length, if the angle is 0, then their dot
product is the largest, if the angle is −2π, which is 180 degrees, then the
dot product is the smallest, which is a negative number, if the angle is π/2
is 90 degrees, so the dot product is 0.
The dot product of vectors is equivalent to: the product of the projection
vector of one vector on the other vector and the length of the other vector.
Matrix product:
If a matrix A has the same number of columns as matrix B , these two
mn nl
product of the i-th row vector of matrix A and the j-column vector of matrix
B. That is, c = ∑ (a b ). As shown in Figure 1-16:
ij
n
k ik kj
Figure 1-16 The element in row 2 and column 1 of the product matrix is the
dot product of the vector in row 2 of the first matrix and the vector in
column 1 of the second matrix
column vector x : n1
⎡ a1,:
⎤ ⎡ a1,: x
⎤ ⎡ a11 x1 + a12 x2 + ⋯ + a1n xn
⎤
a2,: a2,: x a21 x1 + a22 x2 + ⋯ + a2n xn
Ax = x = =
⋮ ⋮ ⋮
⎣a ⎦ ⎣a ⎦ ⎣a ⎦
m,: m,: x m1 x1 + am2 x2 + ⋯ + amn xn
You can use numpy's dot() function or ndarry's dot() method to calculate the
dot product of vectors and the product of matrices. The dot() function of
numpy accepts 2 multidimensional arrays and performs the dot product
(multiplication) operation of multidimensional arrays.
numpy.dot(a, b, out=None)
a= np.array([1,3])
b= np.array([2,5])
print("a*b:",a*b)
print("dot(a,b):",np.dot(a,b)) #The dot product of
two vectors is a value (scalar)
a*b: [ 2 15]
dot(a,b): 17
a= np.array([[1,2,3],[4,5,6]])
b = np.array([2,5])
c = np.array([2,5,3])
print("a.shape:",a.shape)
print("b.shape:",b.shape)
print("c.shape:",c.shape)
#print("dot(a,b):",np.dot(a,b))
print("dot(b,a):",np.dot(b,a))
print("dot(a,c):",np.dot(a,c))
a.shape: (2, 3)
b.shape: (2,)
c.shape: (3,)
dot(b,a): [22 29 36]
dot(a,c): [21 51]
a= np.array([[1,2,3],[4,5,6]])
b= np.array([[2,5],[1,3],[4,5]])
print("a.shape:",a.shape) # 2*3 matrix
print("b.shape:",b.shape) # 3*2 matrix
print("dot(a,b):",np.dot(a,b))
print("matmul(a,b):",np.matmul(a,b))
print("a@b:",a@b)
dot(a,b): 17
matmul(a,b): 17
a@b: 17
a.shape: (2, 3)
b.shape: (3, 2)
dot(a,b): [[16 26]
[37 65]]
matmul(a,b): [[16 26]
[37 65]]
a@b: [[16 26]
[37 65]]
4 Broadcast Broadcasting
Broadcasting is a powerful mechanism that enables numpy to perform
arithmetic operations on arrays of different shapes. For example, we used a
number and an array to perform operations. It is equivalent to turning this
number into an array of the same size as the array, and then performing
element-by-element operations. For example, (a+3 below is equivalent to
a+ np.array([[3,3],[3,3]]). As shown in Figure 1-17:
a = np.array([[1,2],[3,4]])
print(a)
print(a+3)
print(a+ np.array([[3,3],[3,3]]))
[[1 2]
[3 4]]
[[4 5]
[6 7]]
[[4 5]
[6 7]]
print(a*3)
print(a/3)
[[ 3 6]
[ 9 12]]
[[0.33333333 0.66666667]
[1. 1.33333333]]
b = np.array([1,2])
print(a+b)
[[twenty four]
[4 6]]
In a+b, because the axis axis=0 of b has only 1 element (1 row), and the
axis axis-0 of a has 2 elements (2 rows), a+b is equivalent to Repeatedly
stack b along axis=0 into an array of the same size as a, and then perform
operations. As shown in Figure 1-18.
[[1]
[2]
[3]]
[4 5]
[[5 6]
[6 7]
[7 8]]
Figure 1-19 The addition of two two-dimensional arrays (3,1) and (1,2) is
upgraded to the addition of two two-dimensional arrays (3,2)
Two arrays are used for operations, and the rules for broadcasting are:
If the ranks of the arrays are different, use 1 to expand the array with a
smaller rank until the two arrays have the same rank. For example, a
number whose rank is 0 and an array operation whose rank is not 0
will expand this number to Same shape as array.
1.3.1 Functions
Regarding functions, there are different definitions and descriptions, such
as:
mapping relationship: e → s, that is, the side length e is mapped to the area
s. It can also be regarded as an input input transformation s(e), that is,
x → f → f (x)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-3, 3, 0.1) #
y = 2*x+1
plt.scatter(x, y, s=6)
plt.legend(['f(x)=2x+1'])
plt.show()
quadratic function. All points (x, f (x)) describe a parabola, and there is a
common basic function: exponential function f (x) = e , the sine function
x
f (x) = sin(x). The following code plots these curves (Figure 1-21).
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-3, 3, 0.1) #
y = np.sin(x)
y0 = np.full(x.shape, 2)
y1 = 2*x
y2 = x**2
y3 = np.exp(x)
fig = plt.gcf()
fig.set_size_inches(20, 4, forward=True)
plt.subplot(1, 5, 1)
plt.scatter(x, y, s=6)
plt.legend(['sin(x)'])
plt.subplot(1, 5, 2)
plt.scatter(x, y0, s=6)
plt.legend(['$2$'])
plt.subplot(1, 5, 3)
plt.scatter(x, y1, s=6)
plt.legend(['$2x$'])
plt.subplot(1, 5, 4)
plt.scatter(x, y2, s=6)
plt.legend(['$x^2$'])
plt.subplot(1, 5, 5)
plt.scatter(x, y3, s=6)
plt.legend(['$e^x$'])
plt.axis('equal')
plt.show()
Figure 1-21 f (x) = sin(x), f (x) = 2, f (x) = 2x, f (x) = x 2
, f (x) = e
x
Both the linear function y = 2x and the function value (dependent variable)
of the exponential e increase as x increases, but the exponential function
x
grows very fast. People often say that a quantity grows exponentially, which
means that the quantity grows very fast.
Arithmetic
The four arithmetic operations refer to the process of performing addition,
subtraction, multiplication, and division operations on two functions to
construct a function. If there are 2 functions f (x), g(x), these 2 functions
can transform x into f (x), g(x) respectively:
f : x → f (x)
g : x → g(x)
If you define a new transformation relation that transforms each x to
f (x) + g(x), then this is a new functional relation:
x → f (x) + g(x))
This new function: x → f (x) + g(x)) is called the sum function of the
original 2 functions.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-2, 2, 0.1) #
y = x**2
y2 = np.exp(x)
y3 = x**2 + np.exp(x)
plt.plot(x, y)
plt.plot(x, y2)
plt.plot(x, y3)
plt.legend(['$x^2$','$e^x$','$x^2+e^x$'])
fig = plt.gcf()
fig.set_size_inches(4, 4, forward=True)
#plt.axis('equal')
plt.xlim([-3,3])
plt.show()
Figure 1-22 The curves of functions f x and e and their sum function
2 x
2 x
x + e
Function f (x)/g(x).
Composite
Since the function is a transformation or input and output device, inputting
a quantity x into a function g will generate an output g(x), and use this
g(x) as the input of another function f The input produces a f (g(x)).
x → g → g(x) → f → f (g(x))
Using the output of one function as the input of another function constitutes
a new transformation or a new function. This new function is a compound
function formed by the concatenation of the original two functions g and f
through transformation, which can be recorded as f ∘ g : x → f (g(x)),
namely f ∘ g(x) = f (g(x)).
1+e
−x
1 + e
−x
can be regarded as the sum of the constant function 1 and the
function e , and the function e can be regarded as a composite function
−x −x
1+e
−x
The following code plots the σ(x) function curve:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-7, 7, 0.1)
y = 1/(1+ np.exp(-x) )
plt.plot(x, y)
plt.legend(['\frac{1}{1+e^{-x}}'])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
It can be seen that the value of the function is in the [0,1] interval, and the
function value f (x) increases with the increase of x, that is, the function is
increasing. If there must be f (x ) < f (x ) for any two numbers x < x
1 2 1 2
1+e
= 1/2 = 0.5, when x
−x
The corresponding number such as 1/n will become smaller and smaller,
getting closer and closer to the value 0, that is, infinitely approaching 0 or
this sequence will gradually converge to 0 as n increases infinitely. This 0 is
called the limit of this sequence. The so-called infinite approximation
means that as long as n is sufficiently large, the distance between 1/n and its
limit value 0 is sufficiently small. In other words, for an arbitrarily small
number such as ϵ = 0.001, an n can always be found so that the distance
between all numbers in the sequence after the nth item and the limit is
smaller than this ϵ. For example, after n=1000 items, |1/n − 0| < ϵ.
For another example, it can be proved that the limit of the sequence
{3 − 1, 3 − 1/2, 3 − 1/3, ⋯ , 3 − 1/n, ⋯} is 3.
lim 1/n = 0
n→∞
This formula represents the number sequence on the left. When n tends to
infinity (n → ∞), its limit value is 0.
A sequence may not have a limit, if there is a limit, then the limit must be
unique.
2. Limit and continuity of function
Similarly, the limit lim f (x) of the function f (x) at a certain point x can be defined, which means that when x
0
x→x0
2
lim x = 9
x→3
It means that when the independent variable x is sufficiently close to 3, the value of the dependent variable f (x)
will be sufficiently close to 9, that is, the limit of f(x) is 9. For example, there is a set of independent variable
sequence {3 − 1, 3 − 1/2, 3 − 1/3, ⋯ , } that is close to 3, and f (x) value sequence
{T helimitof (3 − 1) , (3 − 1/2) , (3 − 1/3) , ⋯ , } is 9.
2 2 2
If the limit value of the function at x exists and is equal to the function value at x , ie:
0 0 lim f (x) = f (x0 ) , then
x→x0
the function is said to be at x This point is continuous. Intuitively speaking, the curve corresponding to f (x) is
0
not broken at x . 0
As shown in Figure 1-26, f (x) = x is continuous at any independent variable x, which means that the function
2
curve is continuous without disconnection, so the whole function is continuous of. The function f (x) = |x| is
also continuous everywhere. And f (x) = sign(x) defined below is discontinuous at x = 0.
⎧ 1, x > 0
f (x) = sign(x) = ⎨ 0, x == 0
⎩
−1, x < 0
Let Δx 0 = x − x0 , Δf (x0 ) = f (x) − f (x0 ) , then lim f (x) = f (x0 ) Can be expressed as: lim Δf (x0 ) = 0 .
x→x0 Δx0 →0
The continuous meaning of the function in the independent variable x is that when Δx tends to 0, Δf (x
0 0 0) also
tends to 0. That is, Δf (x ) tends to 0 as Δx tends to 0.
0 0
3. Derivatives of functions
The continuity of the function y = f (x) at a certain point x refers to whether the dependent variable y = f (x)
changes continuously with the continuous transformation of the independent variable near this point. Sometimes it
is necessary to further examine how quickly the dependent variable changes with the independent variable. For
example, t is used to represent time, and s is used to represent the distance traveled by a moving object. Obviously,
s changes continuously with t, and it will not suddenly jump from a certain point to another at a certain moment
point. For moving objects, sometimes we are more concerned about the speed of the object's movement, that is, the
speed of the movement. The average movement speed during this period can be expressed by the distance traveled
in the same point of time, such as from time t to time t + Δt, the distance traveled in this time period
0 0
t + Δt − t is s(t + Δt) − s(t ), and their ratio represents the average movement speed during this period:
0 0 0 0
calculate the above average speed when Δx tends to 0, that is, the limit value of Δx → 0, use this limit Value as
exact velocity at time t . 0
s(t0 +Δt)−s(t0 )
lim
Δt
Δt→0
In calculus, the above limit value is called the derivative of the function s(t) at t , which is recorded as s (t
0
′
0) or
| , namely:
ds
t0
dt
′ ds s(t0 +Δt)−s(t0 )
s (t0 ) = |t0 = lim
dt Δt
Δt→0
dependent variable. This derivative characterizes the speed of change of the dependent variable y dependent on the
independent variable x at the point x . The larger the absolute value of f (x ), the faster the change of y.
0
′
0
2 2
f (3 + Δx) − f (3) (3 + Δx) − 3
′
f (3) = lim = lim
Δx→0 Δx Δx→0 Δx
= lim 6 + Δx = 6
Δx→0
2 2
f (1 + Δx) − f (1) (1 + Δx) − 1
′
f (1) = lim = lim
Δx→0 Δx Δx→0 Δx
= lim 2 + Δx = 2
Δx→0
2 2
f (0 + Δx) − f (0) (0 + Δx) − 0
′
f (0) = lim = lim
Δx→0 Δx Δx→0 Δx
= lim Δx = 0
Δx→0
It shows that at x = 0, when x has a small increment Δx, the increment of y is about 0 times, that is, 0 hardly
changes. And at x = 1, when x increases by a small increment Δx, the increment of y is about 2 times of that, that
is, 2Δx. And at x = 3, when x has a small increment Δx, the increment of y is about 6 times, that is, 6Δx.
Therefore, the derivative characterizes how quickly the dependent variable y changes relative to the independent
variable x. The larger the absolute value of the derivative, it means that a small x increment can cause a drastic
change in y, and the smaller the absolute value is, such as being close to 0, it means that a small x increment can
cause a change in y The smaller it is, that is, y changes slowly relative to x, just like a moving object almost stops
when the time changes.
Δx
3
Two points (1, 1 ) and (1.8, 1.8 ) The rate of change (speed) of the dependent variable change with respect to the
3 3
change of the independent variable, this ratio is called the slope of the line where the two points are located. b) Δy
Δx
when Δx = 0.8, 0.6, 0.4 at x = 1. c) When Δx → 0, this ratio (slope) converges to the slope of the tangent to the
curve at x = 1.
Δx
Δx → 0, this ratio (slope) converges to the curve at x = 1 and The slope of the tangent line.
If a function f (x) exists at every point x, the derivative f (x) is said to be derivable everywhere, that is, each x
′
corresponds to a derivative value f (x), then this mapping relationship x → f (x) is also a functional relationship,
′ ′
such a function is called the derivative function of the original function. Denote it as f (x). ′
= lim 2x + Δx = 2x
Δx→0
According to the definition, it is easy to find the derivative function of the following elementary functions:
′
′ ′′ n−1
(1) C = 0; (2) (x ) = nx (n ∈ Q)
′ ′
(3) (sin x) = cos x; (4) (cos x) = − sin x
x ′ x x ′ x
(5) (a ) = a ln a; quad (6) (e ) = e
′
1 ′
1
(7) (loga x) = loga e; (8) (ln x) =
x x
1.3.4 The Four Arithmetic Operations of Derivatives and the Chain Derivation Rule
It is unrealistic to calculate the derivative function according to the limit definition of the derivative for every
function that may be encountered. Because various functions can be constructed through four arithmetic operations
or function composition. Fortunately, it is easy to prove that the derivatives of functions constructed for the four
arithmetic operations or function composition can be computed by the derivatives of the functions that construct
them.
For example, (f (x) + g(x)) = f (x) + g (x), that is, the derivative of the sum function is the sum of the
′ ′ ′
derivatives of the original two functions. According to the limit definition of the derivative, it is easy to prove that
the derivative of the function constructed by the four arithmetic operations has the following calculation formula:
′ ′ ′
(f (x) + g(x)) = f (x) + g (x)
′ ′ ′
(f (x) − g(x)) = f (x) − g (x)
′ ′ ′
(f (x)g(x)) = f (x)g(x) + f (x)g (x)
′ ′
f (x) ′ f (x)g(x)−f (x)g (x)
( ) = 2
g(x) g(x)
Because the derivative of a constant function (f x) = C is 0, therefore, the derivative of the product Cf (x) of a
constant C and a function f (x):
′ ′ ′ ′
(Cf (x)) = C f (x) + Cf (x) = Cf (x)
f (x)
Thus ( C
)
′
= (
1
C
f (x))
′
=
1
C
′
f (x) . A constant C and a function f (x) and derivatives of C + f (x)
(C + f (x))
′
= C
′ ′
+ f (x) = f (x)
′
.
Similarly, the composite function f (g(x)) formed by combining two functions f (x) and g(x). Its derivative has
the following relationship with the derivative of the original function:
(f (g(x))
′ ′
= f (g(x))g (x)
′
.
The derivation formula of this composite function is called the chain derivation rule. The derivative of f (g(x) is
to first find the derivative of f with respect to g f (g), then find the derivative of g with respect to x g (x), and then ′ ′
For compound functions, input a variable x, always calculate its function value along the compound process of the
function "from inside to outside", that is, first calculate g(x), and then calculate f (g(x)), and the process of
calculating the derivative of the final function value f (g(x)) with respect to the input x is reversed, first calculate
f (g), and then calculate, g (x)), and then multiply the two together. That is to say, the derivation process is to find
′ ′
the derivative of each function in turn "from the outside to the inside".
2 ′ ′ ′ ′ 2 ′ 2
sin(x ) = sin (g)g (x) = sin (g)(x ) = cos(g)(2x) = 2xcos(x )
1 1
= (1 − ) = σ(x)(1 − σ(x))
−x −x
1 + e 1 + e
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def sigmoid(x):
return 1/(1+np.exp(-x))
x = np.arange(-7, 7, 0.1)
y = sigmoid(x)
dy = sigmoid(x)*(1-sigmoid(x))
plt.plot(x, y)
plt.plot(x, dy)
plt.legend(['$\sigma(x)$','$\sigma\'(x)$'])
plt.xlabel('x')
plt.ylabel('y')
plt.show()
It can be seen that the function curve of σ (x) is a bell-shaped curve, for all x, σ (x) > 0, and its derivative value
′ ′
is the largest at x=0 For σ(0)(1 − σ(0)) = 0.5 ∗ 0.5 = 0.25, when x tends to infinity, the derivative value tends to
0. The magnitude of the derivative value shows how quickly the function value changes with the independent
variable. Therefore, σ(x) changes the fastest at x = 0, and when it tends to infinity, the change of the function
value becomes more and more slow.
The above four arithmetic rules and the chain derivation rule of compound functions can be easily proved
according to the definition of derivatives. Interested readers can prove it by themselves or refer to calculus
textbooks.
As shown in Figure 1-30, the forward calculation process is obtained from x through the function g to get
g(x) = x , and then through the function f to get f (g) = sin(g) = sin(x )
2 2
2 2
x → g → g(x) = x → f → f (g) = sin(g) = sin(x )
′ ′ ′ 2 ′ 2
f (x) = f (g)g (x) = cos(g)(x ) = cos(g)2x = cos(x )2x
Backpropagation derivation is the core and most critical foundation of neural network (deep learning). If you
understand reverse derivation, you can easily understand the algorithm principle of deep learning.
1.3.6 Partial derivatives and gradients of multivariable functions
Sometimes the independent variable x is a vector composed of multiple components instead of a single value, that
is, x = (x , x , ⋯ , x ) contains multiple components x . A function f (x
1 1 n x) that maps such an independent
j
variable x to a single numerical dependent variable is called a multivariate function and can be expressed as
→ R.
n
f : R
∂f
The derivative of f (x
x) with respect to each component x of x is called the partial derivative, denoted as
j
∂xj
,
which reflects the rate of change of f (x
x) about this component x . j
That is, the partial derivative is to treat other variables as constants and x as a variable, so this function is a
j
univariate function, and its derivative with respect to x is called the partial derivative of the original function with
j
respect to x . j
For example, if f (x, y) = 2x + y , its argument contains 2 components x and y, this function is a multivariate
2
function, and the argument (x, y) maps to the function value f (x, y), that is, f : (x, y) → (2x + y ). The partial 2
2 2
∂f ∂(2x+y ) d(y )
= = = 2y
∂y ∂y dy
The gradient of f (x
x) with respect to x ∇ x f (x
x x) is f (x
x) A vector of partial derivatives of each component x with j
respect to x :
df ∂f ∂f ∂f n
∇x
x f (x
x) = = ( ,⋯, ,⋯, ) ∈ mathbbR
dx
x ∂x1 ∂xj ∂xn
f (x
x + Δ) − f (x
x) ≃ ∇f
f (x
x) ⋅ Δx
x
And f (x
x), x is written in the form of a column vector:
x1 Δx1
⎡ ⎤ ⎡ ⎤
x2 Δx2
x = Δx
x =
⋮ ⋮
⎣ ⎦ ⎣ ⎦
xn Δxn
Then the dot product of gradient vector and increment vector can be written in the form of matrix product:
T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x)Δx = Δx
x ∇f
f (x
x)
If the gradient is also written as a column vector, the dot product of the gradient vector and the increment vector
can be written as a matrix product:
T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x) Δx = Δx
x ∇f
f (x
x)
If the independent variable is written in the form of a row vector, and the gradient is written in the form of a
column vector, that is:
x = (x1 , x2 , ⋯ , xn )
Δx
x = (Δx1 , Δx2 , ⋯ , Δxn )
∂f ∂f ∂f T
∇f
f (x
x) = ( , ,⋯, )
∂x1 ∂x2 ∂xn
but:
f (x
x + Δ) − f (x
x) ≃ Δx
x∇f (x
x)
If the gradient ∇ x f (x
x) , f (x
x), x are all written in row vector form, then:
T T
f (x
x + Δ) − f (x
x) ≃ ∇f (x
x)Δx = Δx
x∇f
f (x
x)
For a multivariate function f (xx), if the independent variable x is written in the form of a matrix, then f (x
x) is
about x The gradient of , although it is a vector, is sometimes written in the form of a matrix with the same shape
as x , so it is easy to know which partial derivative corresponds to which variable, namely:
∂y ∂y ∂y
⎡ ⋯ ⎤
∂x11 ∂x21 ∂xn1
∂y ∂y
f rac∂y∂x22 ⋯
dy ∂x12 ∂xn2
′
f (x
x) = =
dx
x
⋮ ⋮ ⋱ ⋮
∂y ∂y ∂y
⎣ ⋯ ⎦
∂x1n ∂x2n ∂xnn
Whether you write gradients, independent variables, and dependent variables as row vectors, column vectors, or
matrices depends entirely on which form is more helpful for you to derive related formulas. If x is written in
matrix form and the gradient is written in matrix form, it looks more consistent.
∂x
x ∂w
w
If w and x are written in the form of column vectors, and the gradient is written in the form of row vectors, then
y = w x = x w , then f rac∂y∂x x = w , = x .
T T T ∂y T
∂w
w
It can be proved that the four algorithms of derivatives and the chain rule are also true for gradients. Let f, g be two
real-valued functions from R to R, then: n
Linear rule:
′ ′ ′
(αf + βg) (x) = ∇(αf + βg)(x) = αf (x) + βg (x) = α∇f + β∇g(x)
Product rule:
′ ′ ′
(f g) (x) = ∇(f g)(x) = f (x)g(x) + f (x)g (x) = g(x)∇f (x) + f ∇g(x)
Chain rule:
Let g be a real-valued function from R to R and f be a real-valued function from R to R , for some x ∈ R , the
n n
value of g(xx) is z, if x is a column vector and the gradient is in the form of a row vector, Then there are:
′ ′ ′ ′
(f ∘ g) (x
x) = ∇(f ∘ g)(x
x) = f (z)g (x
x) = f (z)∇g(x
x)
That is, the order of the two different forms of the chain rule is exactly the opposite.
x1 x1
Example 3: If g(( )) = 3x1 + 2x2
3
, f (z) = z , then (f ∘ g)((
2
)) = (3x1 + 2x2 )
3 2
. So:
x2 x2
′ ′ 2 3 2 3 2 5
(f ∘ g) (x
x) = f (z)∇g(x) = 2z ∗ (3, 6x ) = 2 ∗ (3x1 + 2x ) ∗ (3, 6x ) = (18x1 + 12x , 36x1 x + 24x )
2 2 2 2 2 2
If the variable is a row vector and the gradient is a column vector, g(x 1, x2 ) = 3x1 + 2x
3
2
, f (z) = z , then
2
3
3 3 18x1 + 12x
′ ′ 3 2
(f ∘ g) (x
x) = ∇g(x)f (z) = ( )2z = ( )2 ∗ (3x1 + 2x2 ) = ( )
2 2 2 5
6x 6x 36x1 x + 24x
2 2 2 2
Example 4: Suppose y, y^ are 2 vectors in R , you can use the square of their Euclidean distance to define these 2
n
vectors The error (distance) between, such as using the following formula to express the error E(y, y^) between
two vectors:
1 2 1 2 2 2
E(y
y, y
^) = ∥y
y − y
^ ∥2 = ((y1 − y
^1 ) + (y2 − y
^2 ) + ⋯ + (yn − y
^n ) )
2 2
function:
f (x) =
⎢⎥
f1 : x → f1 (x)
f2 : x → f2 (x)
fm : x → fm (x)
⎣
f1 (x)
f2 (x)
fm (x)
f1 (x)
f2 (x)
f3 (x)
⋮
⎦
⎤
These combined functions are called vector-valued functions. Input a x, each function produces a function value
f (x), these function values constitute the vector on the right side above.
i
=
⎡
⎣
ax
⎦
x
x
e
2
Df
m
f (x) = f (x) =
dx
=
⎣
⋮
dfm
dx
⎦
1 (x)
∈ R
⎣
9
e
3
3a
= e
⎦
x
,f
1 (x) = ax
The vector-valued function composed of m univariate functions is a mapping (transformation) of a real number set
R to R
, then they constitute a vector-valued
f (x) : R → R . If at a certain point x, the derivatives of each function f (x) with respect to x exist,
m
these derivatives are accumulated into a vector, which is called the derivative of the vector-valued function with
respect to the independent variable x, which can be written as Dff (x):
′
df
f
⎡
df1
dx
df2
dx
⎤
m×1
If the independent variable x of a vector-valued function is a vector of more than one variable, such a vector-
valued function is called a multivariate vector-valued function. Let the number of independent variables be n and
the number of functions be m. This is a mapping (transformation) f : R → R of R to R . Input n values of
independent variables and output m real numbers.
Each function f (x
function f (x
n
x) has a gradient about x These gradient vectors are stacked together to obtain a matrix, called the
i
Jacobian matrix:
m
i
n m
Df
f (x
(f
x) = f (x
f ∘ g ) (x
(f
x) =
x) = D(f
f ∘ g ) (x
′
′
f ∘ g )(x
x) = D(f
df
f (x
Df
x) = (f1 (x
x = (x
but:
f (x
x) = f (x
x), f2 (x
x) =
′
f
dx
x
x), ⋯ , fm (x
x1 , x 1 , ⋯ , x n )
df
f
dx
x
x)
x) = f (z
f ∘ g )(x
A vector x = (x , x , ⋯ , x
is an identity matrix I :
dx
x
dx
x
=
⎡
⎣
1
0
0
⋮
0
0
⋮
1
⋯
2
z)g
x) = g (x
g (x
x)f
=
⎢⎥
⎡
x) = Df
f (z
1
⎤
⎦
f (z
z) == Dg
= I
g(x
n)
∂f1
∂x1
∂f2
∂x1
∂fm
∂x1
∂f1
∂x1
∂f1
∂x2
∂f1
∂xn
z)Dg
g(x
x)Df
x)
f (z
z)
∂f1
∂x2
∂f2
∂x2
∂fm
∂x2
∂f2
∂x1
∂f2
∂x2
∂f1
∂xn
m
∂f1
∂xn
∂f2
∂xn
∂fm
∂xn
∂fm
∂x1
∂fm
∂x2
∂fm
∂xn
⎤
⎦
∈ R
n×m
∈ R
m×n
Typically, independent variable and vector-valued functions are written as column vectors, and the gradient of each
function is written as row vectors. But this book writes both independent variables and vector-valued functions in
The Jacobian matrix is accumulated by the gradient vectors of different multivariable real-valued functions,
therefore, the four arithmetic operations and the chain rule of the gradient are also applicable.
k
m
Let g be the vector-valued function from R to R and f be the vector-valued function from R to R The vector-
x) is z , if the vector-valued function and
m
If the vector-valued function and arguments, etc. are all in the form of row vectors, then:
′ ′
k n
According to the four arithmetic rules of derivatives, for a vector x and b , ∇ x (αx
x x + βb
b) = αI
I .
In the above example 4, if E(y, y^) is regarded as a function of y, this function can be regarded as two functions
^ and E(z) = z Composite function. The gradient of E(y, y ^) about y is:
1 2
z = y − y
2
′ T
E (z) = z
′ ′
z (y) = (y − y
^) = I
′ ′ T ′ T T T
∇y E(y, y
^) = E (z)z (y) = z z (y) = z I = z = (y − y
^)
Example 5: Let
z1 (x
x) 2x1 + 4x2 + 7x3
z (x
x) = [ ] = [ ]
z2 (x
x) 3x1 + 5x2 + 4x3
is a function of x , and
y = [ 4z1 + 3z2 ]
is a function of z , then f (x
x) = y (z x)) is y (z
z(x z) and T hecompositef unctionof pmbz(
(x ) , according to the
derivation rules of composite functions, has:
2 4 7
′ ′ ′
f (x
x) = y (z
z)z
z (x
x) = (4, 3)[ ] = (17, 31, 40)
3 5 4
f (x
x) = 4(2x1 + 4x2 + 7x3 ) + 3(3x1 + 5x2 + 4x3 ) = 17x1 + 31x2 + 40x3
If it is agreed to write the gradient in the form of a column vector, the chain rule will be written in reverse,
2 3 2 3 17
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
4 4
′ ′ ′ ′ ′
y (z
z) = [ ] z (x
x) = 4 5 f (x
x) = z (x
x)y
y (z
z) = 4 5 [ ] = 31
3 3
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
7 4 7 4 40
When deriving these formulas in the future, we must pay attention to whether the vectors such as gradients are
column vectors or row vectors.
Example 6:
w11 w12 … w
⎡
w21 w22 … w
z = [ z1 z2 ⋯ zn ] = [x1 x2 ⋯ xm ] ⋅
⋮ ⋮ ⋱
⎣
wm1 wm2 ⋱ w
Because ∂xj
∂zi
= wji , then ∂zi
∂x
= w.i is W column i, so ∂z
∂x
= W
dz
z
= (
∂y
∂z1
,
∂y
∂z2
,⋯,
∂y
∂zn
)
T
, then
like:
dz1
dW
W =
is:
′
f (x
x) =
⎣
=
⎣
wm1
zm
⎢⎥
⎡
⎣
∂zj
∂wij
w21
⋮
x1
x2
xn
dy
dW
W
set up
=
⎡
⎡ x1
⎣x
m
w11
⋮
x2
, z = W x + b , namely:
z =
⎡
z1
z2
⋮
⎤
dz
z
dx
x
=
= xi
∂y
∂z1
∂y
∂z1
∂z1
=
0
∂wij
∂y
w12
w22
wm2
⎣
wm1
⋮
w11
w21
⎣
⋮
…
w11
w21
wm1
,
xm
∂wij
dy
dwij
=
∂y
∂z1
∗
∂z1
+
x1
x2
⋮
∂zk
…
∂y
∂z2
⋮
∂y
∂z2
∂y
∂z2
∂z2
w12
w22
wm2
0
∂y
⋮
= 0, k ≠ j
w1n
w2n
wmn
w11
w21
wm1
∂wij
⋱
⋯
⋯
x2
xm
,
+ ⋯ +
x1
x =
w1n
w2n
wmn
wmn
⋮
w1n
w2n
∂y
∂zn
∂y
partialzn
∂y
partialzn
⎦
x1
x2
xn
∂y
∂zn
⎣
⎤
⎦
⎤
x1
x2
xn
∈ R
∗
⎦
dy
dz
z
∂zn
∂wij
= x
b =
m×n
= (
⎣
dy
dz
⋮
z
bm
b1
b2
bm
∂y
∂z1
∂y
∂zj
b1
b2
⎦
T
⎦
,
=
∂y
∂z2
∂wij =
⎣
z
^ =
,⋯,
1,
⎡
⎣
∂y
z
z
∂zn
∂zj
z
^1
^2
^m
⋮
∂y
⎦
T
)
xi = xi
, then
∂y
∂zj
b2 , ⋯ , bm )
⋮
1, x2 , ⋯ , xn ) , then dz
z
dx
x
The Jacobian matrix
db
⎤
⎦
′
f (b) =
′
f (w
∇x
w) =
f (W
W) =
dz
db
x L = f (z
((W
b
W x − b) W )
Write ∇
dW
′
W
dz
z) ∗ z (x
function of z (W
∇W
WL
= ((W
W
W
W1
L
z
dz
dW
W
=
x) = z
= W
⎢⎥
⎡
If z is regarded as W = (w , w
=
is:
(W
1
1 x − b1 )x1 , (W
W1
⎡
⎣
x1
= (W
0
x1
x1
W x − b) = W
W1
⋮
⋱
dW
W
⋯
dz
⋯
z
x2
x2
x2
T
Wx − W
1 x − b1 , W2
⋮
⋮
11
T
0
⋯
⎤
xn
W = (W
∈ R
12 ,
W x − b)
This is the row vector form of the gradient, and its column vector form is:
T T
b.
2 x − b2 , ⋯ , Wm
1 x − b1 )x2 , ⋯ , (W
1 x − b1 )xn , ⋯ , (W
⋮
xn
xn
xn
m×m
x1
⎦
⋯
⋯
⋮
∈ R
0
xn
m×n
⋯⋯
⋯⋯
⋯⋯
For easy identification, the derivative of z with respect to W or the Jacobian matrix can also be written in the
same form as W :
x1
be regarded as f (zz) = ∥zz∥ and the compound function of z (x) = W x − b , then L about
T hegradientof pmbx is:
1
2
2
W2
T
T
2 x − b2 )x1 , (W
W2
W
1
′
0
x1
2 x − b2 )x2 , ⋯ , (W
W2
⋯
2x −
2
′
= f (z) ∗ z (W ) = z
x1
0
0
xn
⋯
⋮
multivariate vector-valued function, then
⎦
∈ R
⋱
T
xn
0
m×(m×n)
′
z (W )
x1
0
1
⋯
∥z
⋮
z∥
2
⋱
0
xn
0
and the composite
⋯⋯
⋯⋯
⋯⋯
⋮
0
x1
⋯
⋯
0
x
⎢ ⎥⎢
⎡
⎣
(W
(W
W1 x − b1 )x1
(W
W2 x − b2 )x1
Wm x − bm )x1
(W
W1 x − b1 )x2
(W
W2 x − b2 )x2
(W
⋮
Wm x − bm )x2
⋯
⋯
(W
W1 x − b1 )xn
(W
W2 x − b2 )xn
(W
⋮
Wm x − bm )xn
⎤
⎦
=
⎡
⎣
W 1 x − b1
W 2 x − b2
= (W x − b)x
x
1.3.8 Integral
For the function f(x) in Figure 1-31, how to find the area of the shaded part below it?
W m x − bm
⎤
T
= zx
× [x1
W x + b ) about x is
T
The area of the curve can be approximated by accumulating the small rectangular surfaces at points x evenly
T
distributed on the interval, that is, ∑ f (x ) ∗ Δx, where Δx is The length of the small interval where the interval
i i
According to the idea of limit, as long as Δx is sufficiently small, the error between the above cumulative sum and
the real area will be smaller, that is, the real area S is the following limit value:
S = lim
Δx→0
∑ f (xi ) ∗ Δx
F (x) = ∫
a
x
x
i
T
This limit value is called the definite integral (integral) of the function f (x) on this interval. In calculus, a special
b
symbol ∫ f (x)dx is used to represent this limit value, where the meaning of dx is the differential of the
a
f (x)dx
Similarly, ∫ f (x)dx can be used for the area on the interval [a,x]. When the value of x is constantly changing,
a
So what is the derivative of F(x)? According to the definition of derivative, there are:
x
this value is also constantly changing, thus forming a function F : x → ∫ f (x)dx. Right now:
a
x2
= f (z
z)
′
⋯
T
x
i
xn ]
T
′ (F (x+Δ)−F (x)) (f (x)∗Δx)
F (x) = lim = lim = f (x)
Δx Δx
Δx→0 Δx→0
Of course, the second equation proved is a bit imprecise. Interested readers can check the calculus textbook.
1.4 Probability Basics
This section introduces the basics of probability theory such as probability, random variables, expectation,
variance, etc.
1.4.1 Probability
Probability refers to the likelihood (likelihood) of an event occurring (occurring). Probability is a real number
between 0 and 1. If the probability of an event is 0, it means that the event cannot happen, such as "the sun rises
from the west", "people can live forever". If the probability of an event is 1, it means that this is an inevitable
event, such as "a person will always die". Therefore, events with a probability of 1 and 0 are deterministic events,
that is, they must happen or they must not happen.
However, whether many events occur and how likely they are to occur are often uncertain, that is, random events.
For example, the event "buying a lottery ticket and winning the jackpot" may or may not happen. Toss a coin at
random, it may be "heads" or "tails". Throwing a sieve, the number that appears may be any number in 1, 2, 3, 4,
5, or 6. "Winning the jackpot by buying a lottery ticket" is a small probability event, that is, its probability should
be a small real number close to 0.
If a coin is fair, the chances, or probabilities, of heads and tails are the same in a toss of the coin. A random
experiment (such as a "coin toss") may have many different outcomes (random events), all of which may have
different probabilities, but one of all these outcomes must occur, that is, the sum of the probabilities of all
outcomes is equal to 1 . Therefore, assuming that the probability of "tossing a coin", "heads" and "heads" is p, then
2p = 1, that is, p=1/2=0.5. Similarly, if the sieve with 6 numbers is complete and uniform in density, the probability
of each number appearing in the "throwing sieve" is 1/6.
The capital letter P is usually used to indicate probability, then the probability of 2 events that may occur in a coin
toss is: P (heads) = 1/2, P (tails) = 1/2.. And the probability of a number i appearing in the throwing sieve is:
P (the number i appears) = 1/6, i ∈ 1, 2, 3, 4, 5, 6
Call "flipping a coin at random" a random experiment. The possible results (events) of a random experiment are
called sample points, and all possible results of a random experiment, that is, the collection of all sample points
are called sample space. A randomized experiment is usually denoted by a capital E, while a sample space is
usually denoted by a capital letter S, Ω, or U .
For the coin toss experiment, its sample space = {"comes heads", "comes tails"}. And for throwing a sieve,
its sample space = {"Number 1 appears", "Number 2 appears", "Number 3 appears", "Number 4
appears", "Number 5 appears", "Number 6 appears"}.
If the random trial is "randomly throwing the sieve 2 times", its sample space = {"number 1 appears for the first
time, number 1 appears for the 2nd time", "number 1 appears for the first time, number 2 appears for the 2nd
time", \cdots,"The number 6 appears for the first time, the number 6 appears for the second time"}. That is, there
are a total of 36 possible results. Assuming that the probability of each number appearing on each sieve is the
same, the probability of each result is equal, and the probability is 1/36.
Let the random experiment E be "randomly draw a card from 52 playing cards, and observe the number of words
in the card", then its sample space is: {A,2,3,...,J,Q,K}, There are 13 sample points in total. If the random test E is
"randomly draw a card from 52 playing cards, what is its suit", then its sample space is: {spades, hearts, clubs,
diamonds}, a total of 4 sample points. If the random test E is "randomly draw a card from 52 playing cards, what is
the card", at this time, the result of the random test must examine both the number and the suit, and its sample
space will be the Cartesian of the above two sample spaces Child product: { (A, spades), (A, hearts), (A, clubs),
(A, diamonds), (B, spades), (B, hearts), (B, clubs), (B, diamond),...(K, spade), (K, heart), (K, club), (K, diamond)},
a total of 13×4 = 52 sample points.
The sample points of the sample space are called elementary events. A collection of multiple sample points is also
an event. For example, in the random experiment "throwing a dice", there are 6 basic events, and these basic
events may be combined into other events, such as "numbers appearing no more than 3" this event = { "number 1
appears", "number 2 appears", " The number 3"} appears, which is the union of 3 basic events.
Among all random events, there are 2 special events, that is, the event corresponding to the empty set, which is
marked as ∅; the event corresponding to the full set (including all sample points), is still represented by the symbol
Ω.
The possibility of random event A is represented by a real number between 0 and 1, usually represented by the
symbol P (A), that is, 0 <= P (A) <= 1. Obviously P (∅) = 0, P (Ω) = 1, events with probability 0 and 1 are
called impossible events and inevitable events respectively, so ∅ and Ω are respectively impossible events and
inevitable events.
Mutually exclusive events (also called incompatible events): A and B cannot occur at the same time, that is, A
and B have no common sample points. That is, A ∩ B = ∅.
Opposite Events: A special case of mutually exclusive events, A and B cannot happen at the same time, but one
of A and B must happen. In set language: A ∩ B = ∅. And A ∪ B = Ω.
Classical Probability Model (Classical Probability): The sample space is limited, and the probability of each
sample appearing is the same. The event probability of classical probability = the number of samples included in
the event/the total number of samples in the sample space.
For example, for "throwing a sieve", the total number of sample spaces is 6, and there are only 2 sample points
(number 1 and number 2) for "the number that appears is less than 3". Therefore, the probability P of this event
("the number that appears is less than 3 ”)=2/6.
Of course, for general random experiments, the probability of occurrence of each sample point (basic event) is
usually not equal. How do you know the probability of an event occurring? Usually, statistical methods are used to
determine the probability of an event, that is, random experiments are repeated many times, such as n times. In
these experiments, if event A occurs k times, it is said that the frequency of event A is k/n. Constantly carry out
such repeated random experiments, when n is very large, according to the law of large numbers in probability
theory, this frequency will approach the real probability. Right now:
k
P (A) = lim
n
n→∞
For example, the following code simulates a trial (n coin tosses) with the function one_coin_test(n) and returns the
frequency of heads in it. It can be seen that as n increases, the frequency approaches probability 0.5.
for n in range(10,50000,2000):
print(one_coin_test(n),end=', ')
0.7, 0.47562189054726367, 0.4845386533665835, 0.4945091514143095, 0.5013732833957553,
0.5031968031968032, 0.4946711074104 9125, 0.5007137758743755, 0.5033104309806371,
0.4999444752915047, 0.49485257371314345, 0.5070422535211268, 0.50016659725114 53,
0.5036908881199539, 0.5008568368439843, 0.5007664111962679, 0.5004998437988128,
0.4972655101440753, 0.4980560955290197, 0.5011312812417785, 0.5004498875281179,
0.4991906688883599, 0.5021586003181095, 0.49734840252119106, 0.49989585503020206,
A random experiment has many possible events, and each event has a probability. Mathematically, the probability
of these events is defined as a mapping from event to probability.
Assume that the entire set of all measurable events in the sample space Ω is F, and the probability P is a mapping
from F to the real number interval [0,1], that is, P : F → [0, 1]. The mapping must satisfy:
1. Normative: P (Ω) = 1
To give a medical example, use A to indicate "has hepatitis B", and B to indicate "positive surface antibody". That
is, P(A) means that if a person is randomly selected, what is the probability of "getting hepatitis B". P(B)
represents the probability that a person is randomly selected for inspection, and his surface antibody is positive.
Then P(B|A) indicates the probability of "the surface antibody is positive in the case of hepatitis B". Obviously, the
prior probability P(B) and the conditional probability P(B|A) are not equal, because "any person's surface antigen
is positive" is obviously different from "a person who has hepatitis B shows antigen positive". The latter should be
more likely.
Joint probability P(A, B) : Indicates the joint probability of (A, B) occurring at the same time. That is to say, if a
person is randomly selected, what is the probability of simultaneous occurrence of "hepatitis B, indicating
antibody".
The joint probability P(A, B) is sometimes also written as P (A ∩ B), which indicates the probability that A and B
occur at the same time or their intersection occurs.
P (A,B)
P (A|B) =
P (B)
You can use "throwing a sieve" to help understand this formula. Let A mean "the number is greater than 3", B
means "the number is even", then (A, B) means "the number is greater than 3 and is even", and (B|A) means "the
number is even if the number is greater than 3".
"The number is greater than 3 and is even" has only 2 sample points {4,6} in the sample point space {1, 2, 3, 4, 5,
6}, so: P (A, B) = 2/6
Similarly, the probability of "the number is greater than 3" P (A) = 3/6
The sample point space set of "when the number is greater than 3" is {4, 5, 6}, that is, there are 3 sample points, of
which the even number is 2 {4, 6}, therefore, P (B|A) = 2/3
It can be verified:
P (A,B) 2/6
= = 2/3 = P (B|A)
P (A) 3/6
The joint probability can also be written in the following form (the product of the conditional probability and the
prior probability):
Two events are independent if and only if: P (A, B) = P (A)P (B).
The two events are mutually exclusive, which means that these two events cannot occur at the same time, such as
the two events of "heads" and "reverses" in a coin toss are mutually exclusive. For mutually exclusive events A
and B, obviously the probability of A and B occurring at the same time is 0, that is, P (A, B) = 0.
For the sets A, B, their intersection and union have the relationship A ∪ B = A + B − (A ∩ B), therefore, if two
events A, B are not mutually exclusive, P (A ∪ B) = P (A) + P (B) − P (A, B), as shown in Figure 1-32. If A
and B are mutually exclusive, P (A ∪ B) = P (A) + P (B) − P (A, B) = P (A) + P (B).
If n events A 1, A2 , ⋯ , An are mutually exclusive, and their union is the entire sample space, that is,
A1 ∪ A2 ∪ ⋯ ∪ An = Ω , there are:
According to the total probability formula, the conditional probability P (A|B) can be calculated as follows:
P (Ai ,B) P (B|Ai )P (Ai )
P (Ai |B) = = n
P (B) ∑ P (B|Ai )P (Ai )
i=1
For example, use P (A) = 0.001 to represent the prior probability of ordinary people getting hepatitis B, use
P (A ) = 0.999 to represent the prior probability of ordinary people not getting hepatitis B, and use
c
P (B|A) = 0.99 represents the probability of "positive surface antibody after hepatitis B test", and
P (B|A ) = 0.01 represents the probability of "positive surface antibody test without hepatitis B". Now if a person
c
has done surface antibody The test is positive (B), what is the probability (possibility) that he has hepatitis B?
Simply put, it is a mapping from the sample space Ω to the real number set R, that is, for each sample point (basic
event), there is a real number corresponding to it. As shown below:
Figure 1-33 A random variable is a mapping from a sample space to a set of real numbers
For example, the sample space of a random experiment "toss a coin and observe its heads and tails" has only 2
samples: heads and tails. The random variable X(ω) can be defined as follows:
That is, the random variable maps the basic events "heads" and "tails" to two values 3 and 4, respectively. can also
be written as:
3, ω = f ront side
X(ω) = {
4, ω = back side
For an experiment, different random variables can be defined, for example, two dice are randomly rolled, and the
entire event space can be composed of 36 elements:
ω = {(i, j)|i = 1, … , 6, ; j = 1, … , 6}
You can define a random variable (mapping) random variable X (the sum of the points of the two dice obtained),
and the random variable X can have 11 integer values
X(ω) = X(i, j) := i + j, x = 2, 3, … , 12
You can also define a random variable (mapping) random variable Y (the difference between the points of the two
dice obtained), and the random variable X can have 6 integer values
For another example, someone is waiting for the bus, and the arrival time of the bus is every 5 minutes. If the event
of the person arriving at the station is random, then the time he waits for the bus can be represented by the random
variable X(ω). If the sample space S= {waiting time}, the sample point itself is a real number, then the random
variable X(ω) is:
X(ω) = ω, ω ∈ Ω
If the value range of the random variable X(ω) is countable, X(ω) is called a discrete random variable,
otherwise, it is called a non-discrete random variable . Among non-discrete random variables, if the value range
is composed of some intervals, it is called continuous random variable. The random variable X(ω) is often
abbreviated as X, that is, the sample point ω is omitted.
P (x ), P (x ), ⋯ , P (x ) is called the probability distribution column of the random variable. The functional
1 2 n
relationship of the random variable X from its possible value x to the corresponding probability P (x ) is the
i i
For example, if there are excellent, good, medium, and poor reviews for a business, a random variable X can be
used to map this group of sample points to 0, 1, 2, and 3. If it is already known from many previous reviews that
the probability of the merchant's excellent, good, medium and poor is 0.5, 0.3, 0.1, 0.1. Then the probability
distribution law of the random variable X is
The following code plots this probability distribution for X values 0, 1, 2, 3. It can be seen that except for the
above 4 integers, the probability of X taking other values is 0.
import matplotlib.pyplot as plt
%matplotlib inline
x = [0,1,2,3]
p = [0.5,0.3,0.1,0.1]
plt.vlines(0, 0, 0.5,color="red")
plt.vlines(1, 0, 0.3,color="red")
plt.vlines(2, 0, 0.1,color="red")
plt.vlines(3, 0, 0.1,color="red")
plt.scatter(x,p)
plt.show()
If a discrete random variable takes only two values such as 0 and 1, the distribution of this binary random variable
is also called two-point distribution (also known as 0-1 distribution, Bernoullidistributed). The following
formula is the probability value of 1 and 0 for a binary random variable X:
P (X = 1) = ϕ, P (X = 0) = 1 − ϕ
It describes the probability situation where a random experiment values only 2 different fundamental things. For
example, the problem of "tossing a coin and getting heads or tails". In the binary classification problem of machine
learning, this two-point distribution is used to represent the probability that an object belongs to two categories.
binomial distribution
If you ask a question: "What is the probability that "randomly tossing a coin n times, it will come up heads k
times"? This can be described by a binomial distribution. Each coin in the experiment "randomly flipping a coin n
times" is the above-mentioned "two-point distribution", and any two coin flips are independent of each other. Use
A to represent the event "heads appeared in the first k times, and the following are all negatives", which is a joint
event of n independent events, that is, "the first occurrence of heads" (indicated by A ), "the second occurrence
1
Heads", ⋯, "Heads appears at the kth time", "Tails appears at the k+1th time" (indicated by B ), ⋯, "Tails
k+1
A = (A1 , A2 , ⋯ , Ak , Bk+1 , ⋯ , Bn )
According to the combination principle, in the event of "n coin tosses, heads appearing k times", the total number
of heads appearing in these k times is C =n
k n!
. Therefore, according to the additivity of probability, the
k!(n−k)!
If a random variable X is used to map "n coin flips, k heads" to an integer k, then the probability distribution of
this discrete random variable X is called binomial distribution. Right now
k k n−1
P (X = k) = Cn p (1 − p)
For continuous random variables, it is impossible to enumerate the probability of each possible value of the
random variable, and it is meaningless to do so. Just like for an object, it is infeasible and meaningless to measure
the mass (weight) of a certain point inside it. Or for each point on a real number interval, it is impossible to say
what the length of this point is.
Since it is meaningless to define the probability of a continuous random variable taking a single value, how to
measure the probability of a random variable taking different values? Just as the density is used to measure the
mass of each point of an object, the density is used to measure the mass near a certain point inside the material.
Similarly, for random variables, probability density can be used to measure the possibility of a random variable
taking a value near a certain value.
Just as the density of a substance is the limit value of the ratio of its mass to volume, that is, for a point p inside the
object, its density is defined as:
Δm
ρ(p) = lim
Δp
Δp→0
Δp, Δm represents the volume and mass of a small area containing p. Their ratio reflects the mass of the small
region. The limit value of the ratio when this small area tends to 0 precisely characterizes the mass at this point p
(for a point, strictly speaking, it is the mass density).
Similarly, the probability (strictly speaking, probability density) of a continuous random variable at a certain point
x can be expressed similarly as:
ΔP P (x+Δ)−P (x)
p(x) = lim = lim
Δx Δx
Δx→0 Δx→0
Δx is the small interval containing x, and ΔP is the probability that the random variable falls in this small
interval, and their ratio represents the average probability that the random variable falls in this small interval. The
limit of this ratio when Δx tends to 0 accurately describes the probability (likelihood) of the random variable
taking a value at this point x. Therefore, for some x, the probability P ([x − dx, x + dx]) of a random variable X
taking a value on [x-dx,x+dx] can be approximated by 2dx*p(x) .
Example: Assuming that the random variable X is uniformly valued on the interval [a,b], then P[a,b]=1, and its
probability density at each point x is p(x) = , that is, the probability density of each point is the same value,
1
b−a
that is, the probability of each point on the value [a,b] of the random variable is the same. That is, the random
variable is uniformly distributed on the interval [a,b].
Gaussian distribution
The following code plots the Gaussian probability density for different values of μ, σ:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-5, 5, 100)
plt.plot(x, gaussian(x,0,0.5))
plt.plot(x, gaussian(x,-2,0.7))
plt.plot(x, gaussian(x,0,1))
plt.plot(x, gaussian(x,1,2.3))
plt.legend(['$\mu=0,\sigma=0.5$','$\mu=-2,\sigma=0.7$','$\mu=0,\sigma=1$','$\mu=1,\sigma=
#plt.axis('equal')
plt.xlabel('x')
plt.ylabel('p(x)')
plt.show()
Figure 1-35 Gaussian curves with different mean and mean square deviation
This is an inverted "bell-shaped curve". It can be seen that the probability density at μ is the largest, and the farther
away from μ, the smaller the probability density. That is, the probability of a random variable taking a value near
μ is the greatest, and the value that deviates from it is less likely to be taken. The smaller σ is, the narrower the
curve is, indicating that the values of random variables are more concentrated near μ. The Gaussian distribution of
μ = 0, σ = 1 is called standard normal distribution.
The distribution function describes the probability that a random variable falls on the interval (−∞, x).
For example, the distribution function corresponding to the random variable X of the coin toss above can be
calculated as a stepped function F (x):
⎧ 0, x < 3
Because the random variable has only two possible values 3 and 4, the probabilities are 0.3 and 0.7 respectively, so
the random variable cannot be less than 3, that is, the probability of falling in the interval (−∞, x), x < 3 is 0,
And the probability of falling in (−∞, x), x < 4 is that the probability of taking the value 3 is 0.3, because the
random variable values 3 and 4 will always fall in (−∞, x), x >= 4, so the probability of F(x) on
(−∞, x), x >= 4 is 1.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
lines = [(-2, 3), (0, 0),'r',(3, 4), (0.3, 0.3),'g',(4, 10), (1, 1),'b']
plt.plot(*lines)
plt.scatter(3,0, s=50, facecolors='none', edgecolors='r')
plt.scatter(4,0.3, s=50, facecolors='none', edgecolors='g')
Figure 1-36 The distribution function corresponding to the random variable X of the coin toss is a stepped function
For a continuous random variable, if its probability density is p(x), the distribution function is the definite integral
of the probability density function p(x) in the interval (−∞, x).
x
F (x) = ∫ p(x)dx
−∞
In turn, the probability density is the derivative of the distribution function, ie p(x) = F ′
(x) .
The following code plots the distribution function corresponding to the Gaussian probability density of the above
figure:
from scipy.integrate import quad
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(-5, 5, 100)
plt.plot(x, vec_gaussion_dist(x,0,0.5)[0])
plt.plot(x, vec_gaussion_dist(x,-2,0.7)[0])
plt.plot(x, vec_gaussion_dist(x,0,1)[0])
plt.plot(x, vec_gaussion_dist(x,1,2.3)[0])
plt.legend(['$\mu=0,\sigma=0.5$','$\mu=-2,\sigma=0.7$','$\mu=0,\sigma=1$','$\mu=1,\sigma=
plt.xlabel('x')
plt.ylabel('p(x)')
plt.show()
Figure 1-37 Gaussian probability density functions with different mean values and mean square deviations
19.833333333333332
This average age is the mean of their ages. The mean is therefore the average of a set of numbers. If this set of
numbers is (x , x , ⋯ , x ), then the mean of this set of numbers is:
1 2 n
x1 +x2 +⋯+xn 1 n
= ∑ xi
n n x=1
If (x , x , ⋯ , x ) are all possible values of a random variable X, assuming that the probability of the random
1 2 n
variable obtaining these values is the same, that is, . This mean can be written as:
n
1
1 1 1 1
(x1 + x2 + ⋯ + xn ) = x1 + x2 + ⋯ + xn )
n n n n
That is, the mean is the sum of the product of the probability of each value of the random variable multiplied by
that value. This mean is called the mathematical expectation of this random variable, or expectation for short.
That is, from an average point of view, the expected value of this random variable.
If the probability of taking each value of the random variable is not equal, for example, the probability of taking x
i
p1 x1 + p2 x2 + ⋯ + pn xn
The letter E is usually used to represent the expectation of a random variable, that is, the expectation of the random
variable X can be marked as E[X]. Right now:
n
E[X] = p1 x1 + p2 x2 + ⋯ + pn xn = ∑ pi xi
i=1
Assuming that the freshmen are only 18, 19, 20, and 21 years old, it can be represented by a random variable X
that may be 0, 1, 2, or 3. The probability of the random variable p = P (x = i), i = 0, 1, 2, 3 indicates the i
value probability of the random variable or the probability (probability/percentage) that a student belongs to
different ages. The expectation of this random variable X is:
E[X] = p0 ∗ 0 + p1 ∗ 1 + p2 ∗ 2 + p3 ∗ 3
If p
0, p1 , p2 , p3 are 0.2, 0.4, 0.3, 0.1 respectively, then E(x) = 0.2 ∗ 0 + 0.4 ∗ 1 + 0.3 ∗ 2 + 0.1 ∗ 0.3 = 1.03.
If the function f (X) maps the random variable X representing the age of the student to the age of the student, that
is, x = 0, 1, 2, 3 is mapped to the age 18, 19, 20, 21, because the random variable The probability of X taking the
value 0, 1, 2, 3 is p , p , p , p respectively, therefore, the function value f (X) that changes with the random
0 1 2 3
variable X also changes randomly random variable, the probability of random variable f (X) taking values
f (0), f (1), f (2), f (3) is also p , p , p , p , Thus, the expectation E[f (X)] of the random variable f (X) can be
0 1 2 3
calculated:
E[f (X)] = p0 ∗ f (0) + p1 ∗ f (1) + p2 ∗ f (2) + p3 ∗ f (3) = 0.2 ∗ 18 + 0.4 ∗ 19 + 0.3 ∗ 20 + 0.1 ∗ 21 = 19.3
Therefore, if the probability of all values of a random variable X are p , ⋯ , p , ⋯ , p respectively, the function 1 i n
value f (X) of the random variable X is also a random variable , the expectation of the random variable f (X) is:
n
E[f (X)] = p1 f (x1 ) + p2 f (x2 ) + ⋯ + pn f (xn ) = ∑ pi f (xi )
i=1
If X is a continuous random variable, the probability of random variable X taking value x is p(x), and f (X) is a
random variable dependent on X, then random variable XT heexpectationof and f (X) can be calculated by
integral:
Let f (X) = X be an identity mapping, then the above formula can be obtained from the following formula.
Therefore, the above formula can be regarded as a special case of the following formula.
The average error of the random variable distance from its expectation can be used to describe the degree of
aggregation or divergence of the value of the random variable. The specific method is to add up the squares of the
errors between each random variable and the expected value, and then average them. like:
2 2 2 2 2
(18−20) +(19−20) +(20−20) +(21−20) +(22−20) 4+1+0+1+4
= = 2
5 5
2 2 2 2 2
(1−20) +(6−20) +(18−20) +(10−20) +(65−20) 4+1+0+1+4
= = 537.2
5 5
The average of this error square is called mean square error, referred to as variance (variance). It describes the
degree of divergence of the random variable from its expected value. The larger the variance, the more divergent it
is, and the smaller the variance, the more concentrated it is. That is, the smaller the variance, the more
concentrated the random variable value is near the expected value.
For an equally probable random variable x whose value is (x 1, x2 , ⋯ , xn ) , if its expectation (mean) is recorded
as μ = E(X), then the variance V ar(X) is:
2 2 2 n 2
2 2 2 (x1 −μ) +(x2 −μ) +⋯+(xn −μ) ) ∑ (xi −μ)
1 i=1
V ar(X) = ((x1 − μ) + (x2 − μ) + ⋯ + (xn − μ) ) = =
n n n
If the probability of the random variable X is p 1, ⋯ , pi , ⋯ , pn respectively, then the variance is:
2 2 2
V ar(X) = (p1 (x1 − μ) + p2 (x2 − μ) + ⋯ + pn (xn − μ) bigr)
If X is a continuous random variable, the probability of random variable X taking value x is p(x), then the
variance V ar(X) of X is:
2 2
V ar(X) = EX∼p [X − E(X)] = ∫ p(x)(X − E[x]) dx
Therefore, the variance is also the expectation, which is the expectation of the random variable (X − E(X)) . Or 2
the expectation (mean) of the squared error. According to the desired linear law, it can be deduced that:
2 2 2
V ar(X) = EX∼p [X − E(X)] = E[X ] − E[X]
The variance is the square of the error, and the square root of the variance is called standard deviation, which can
be represented by std(X). That is: std(X) = √V ar(X)
People often use symbols μ, σ to represent expectations and standard deviations, while variances are represented
by symbols σ . 2
characterizes their correlation. Such as a = (1, 1), b = (−1, 1), their dot product a ⋅ b = 1 ∗ 1 + (−1) ∗ 1 = 0
indicates that they are perpendicular to each other, That is irrelevant, such as a = (1, 1), b = (1, 1), their dot
product a ⋅ b = 1 ∗ 1 + 1 ∗ 1 = 2, these two vectors are The same vector that coincides, they are related, such as
a = (1, 1), b = (1, 0), their dot product a ⋅ b = 1 ∗ 1 + 1 ∗ 0 = 1 , the angle between these two vectors is 45
degrees, and their degree of correlation is between the two degrees of correlation just now.
Generally, there are 2 vectors x = (x , x , ⋯ , x ), y = (y , x , ⋯ , y ), you can use their dot product
1 2 n 1 2 n
Covariance is a measure of the correlation between two random variables. If two random variables X, Y take the
value of (x , x , ⋯ , x ), y = (y , x , ⋯ , y ), the covariance Cov(X,Y) between them is defined as:
1 2 n 1 2 n
(x1 −μX )(y1 −μY )+(x2 −μX )(y2 −μY )+⋯+(xn −μX )(yn − muY )
Cov(X, Y ) =
n
That is to say, the dot product operation is performed on the values after they are subtracted from the expectation,
and then averaged. This form is very similar to the variance of a single random variable:
(x1 −μX )(x1 −μX )+(x2 −μX )(x2 −μX )+⋯+(xn −μX )(xn −μX )
V ar(X) =
n
But the meanings of the two are different. V ar(X) describes the degree of divergence of the random variable from
the expected value, and Cov(X, Y ) describes the correlation between two random variables.
According to the desired linear law, the following derivation can be made:
= E[XY − μX Y − μY X + μx μY ]
= E[XY ] − μx μY − μx μY + μx μY
= E[XY ] − μx μY
= E[XY ] − E[X]E[Y ]
In machine learning, a sample may have multiple features. For example, a house may contain features such as area,
number of rooms, location, number of floors, etc. Each feature can be seen as a random variable, and some
features may be related. The influence of the features on the machine learning algorithm will be mutually
restrained, and eliminating the correlation between features or selecting low-correlation features will help improve
the performance of the machine learning algorithm. Correlation analysis can be performed on these features to help
select good features or transform the original data to eliminate the correlation between features.
If a sample has 3 features, their corresponding random variables are represented by X1, X2, andX3 respectively,
and the correlation between them can be analyzed by calculating the pairwise covariation between them. The
covariant value between them can be expressed in the form of a matrix, called covariant matrix, usually
represented by the symbol ∑.
This is a symmetric matrix. If all possible values of each random variable are lined up, these possible values of all
random variables can be represented as a matrix:
X = (X1 , X2 , X3 )
If X is a random variable with equal probability, the covariance matrix can be calculated as follows:
i
X = X − E[X]
T
∑ = X X
X = X-np.mean(X,axis=0)
np.dot(X.transpose(),X)
Chapter 2 Gradient descent method
The core task of deep learning is to train a function model through sample data, or to find an optimal
function to represent or describe these sample data. Solving the best function model comes down to a
mathematical optimization problem, more precisely, the problem of finding the most value (extreme value)
of a certain loss function. In deep learning, the gradient descent method is used to solve this maximum
value problem or solve the model parameters.
This chapter introduces the theoretical basis, algorithm principle and code implementation of the gradient
descent algorithm starting from the necessary conditions for the extreme value of the function, and
introduces different optimization strategies for updating the solution variables (parameters) in the gradient
descent method.
positive number ϵ, so that for the interval (x − ϵ, x + epsilon) each x satisfies f (x ) ≤ f (x). x is
0 0 0 0
called the minimum point of the function, and f (x ) is called the minimum point of the function.
0
The function y = f (x) obtains the maximum value at a certain point x : it means that there is a certain
0
positive number ϵ, so that for the interval (x − ϵ, x + ϵ) each x satisfies f (x) ≤ f (x ). x is called the
0 0 0 0
maximum point of the function, and f (x ) is called the maximum value of the function.
0
The minimum value and maximum value are collectively referred to as extreme value, and the minimum
value point and maximum value point are collectively referred to as extreme value point.
If all x in the domain of the function f (x) satisfy f (x ) ≤ f (x), then x is called the minimum point of
0 0
If all x in the domain of the function f (x) satisfy f (x) ≤ f (x ), then x is called the maximum point of
0 0
That is, the minimum value is a minimum value of a global range, and the maximum value is a maximum
value of a global range. The minimum and maximum values are collectively referred to as Most Value, and
the minimum and maximum points are collectively referred to as Most Value Points.
Necessary conditions for function extremum: If x is the extremum point of function f (x), and the
0
function is derivable at x , then f (x ) = 0 must be The derivative value at the extreme point must be 0.
0
′
0
For example, the previous function f (x) = x obtains the minimum value at x = 0 (of course it is also a
2
minimum value) and can be derived, so at x = 0 its derivative value f (0) = 2 × 0 = 0 must be 0.
′
This proposition is easy to prove. If x is the extreme point of the function f (x), there is an interval
0
When x tends to x from the left and right sides, Δx are negative and positive numbers respectively, and the
0
numerator is always positive, from x tends to x from the right, its limit value should be ≥ 0, from x It tends
0
to x from the left, its limit value should be ≤ 0, and this limit exists, so its value can only be 0.
0
According to the limit formula, a rule can also be found. If the derivative at x is a positive number, it 0
means that the function f(x) is monotonically increasing around this point, that is, if x < x , then 1 2
f (x ) < f (x ), that is, f (x) increases as x increases. Or if Δx is a positive number, then Δy is also a
1 2
positive number. For example, the derivative of y = f (x) = x is f (x) = 2x. When x is greater than 0, the
2 ′
derivative is positive. Therefore, the function curve is single-handedly increasing, and x is less than When 0,
the derivatives are all negative numbers, therefore, the function curve is single-handedly decreasing, that is,
if x < x , instead f (x ) > f (x ).
1 2 1 2
Two points x =– 1, x = 3 with a derivative of 0 can be obtained. The monotonous change of this
1 2
In the interval (−∞, −1], f (x) is a positive number, so the function f(x) is monotonically increasing, and
′
in the interval (−1, 3), f (x) is a negative number, so the function f(x) is monotonically decreasing, in the
′
interval [3, ∞), f (x) is a positive number, so the function f(x) is monotonically increasing .
′
The following code draws the curve of this function and its derivative function, which can more intuitively
see the monotonic change and extreme point situation.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.arange(-3, 4, 0.01)
f_x = np.power(x,3)-3*x**2-9*x+2
df_x = 3*x**2-6*x-9
plt.plot(x,f_x)
plt.plot(x,df_x)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.legend(['f(x)', "df(x)"])
plt.axvline(x=0, color='k')
plt.axhline(y=0, color='k')
plt.show()
Figure 2-2 f (x) = x 3 2
– 3x – 9x + 2 and the function curve of its derivative f ′
(x)
Note that the above proposition only illustrates the necessary condition at the extreme point of the function,
but not the sufficient condition, that is to say, the derivative f (x ) = 0 at a function x does not mean that
′
0 0
x must be an extreme point. For example, the derivative f (0) of f (x) = x at x = 0 is also 0, but this
′ 3
0
point is not the extreme point of the function. In fact, this function is a monotonically increasing curve, as
shown in Figure 2-3.
x = np.arange(-3, 3, 0.01)
f_x = np.power(x,3)
plt.plot(x,f_x)
plt.xlabel('x axis label')
plt.ylabel('y axis label')
plt.axvline(x=0, color='k')
plt.axhline(y=0, color='k')
plt.show()
Obviously, the necessary conditions for the extremum of the function can be extended to multivariate
functions, that is, for a multivariate function f (x , x , ⋯ , x ), if the function is at a certain point
1 2 n
x = (x , x , ⋯ , x ) obtains an extreme value and the gradient at this point exists (that is, all partial
∗ ∗ ∗ ∗
1 2 n
derivatives exist), then the gradient at this point must be 0 (that is, each partial derivative value is is 0).
Right now:
∂f (x1 ,x2 ,⋯,xn )
= 0, i = 1, 2, ⋯ , n
∂xi ∗
|x
2.2 Gradient descent method (gradient descent)
For a one-variable function f(x), if there is a small change Δx near a certain point x, then the change
f (x + Δx) − f (x) of f(x) can be expressed as follows Differential form of :
′
f (x + Δx) − f (x) ≃ f (x)Δx
That is, near x, if Δx and f (x) have the same sign, then f (x)Δx is f (x + Δx) − f (x) It is a positive
′ ′
number. If Δx and f (x) have opposite signs, then f (x)Δx, that is, f (x + Δx) − f (x) is a negative
′ ′
number. If Δx = −αf (x) (where α is a small positive number), then f (x + Δx) − f (x) = −αf (x) is
′ ′ 2
a negative number, that is, the value of f (x + Δx) will be smaller than f (x). In other words, x moves Δx
along the opposite direction −f (x) of f (x), and its function value f (x + Δx) is larger than the original
′ ′
f (x) is smaller.
As shown in Figure 2.4, the function value f (x) of the function f (x) = x + 0.2 at x = 1.5 is 2.45, and the
2
derivative value f (x) is 3.0, which is a positive number, pointing to the positive direction of the x axis on
′
the domain of f (x), that is, the x axis, as shown by the long arrow in the figure.
Let α = 0.15, Δx = −αf (x) = −0.449, move x along this Δx (in the direction of the blue arrow in the
′
figure) to xnew = x + Δx = 1.05, the f(1.05) function value at the new x = 1.05 obtained is 1.3025,
which is the y coordinate value of the blue point on the curve in the figure. Because Δx and f (x) are in ′
opposite directions (one negative and one positive), this f(1.05) must be smaller than the original f(1.5).
Just keep repeating this process, that is, move x along the opposite direction (−f (x)) of its derivative f (x)
′ ′
new x new must be smaller than the previous function value f (x). As x approaches the x value of the
′
minimum point, the derivative f'(x) is also close to 0 (because the derivative f (x ) = 0 of the function
∗
extreme point x ), the increase of x movement The quantity Δx is getting closer and closer to 0.
∗
This is the idea of gradient descent method, that is, starting from an initial x, the value of x is continuously
updated with the following formula:
′
x = x − αf (x)
For the current x, moving x along its negative derivative (gradient) direction (ie −f (x)) can make f (x)′
keep getting smaller. Ideally, x of minimum f (x) is reached, where f (x) = 0. Then update x iteratively,
′
and the value of x will no longer change. As shown in Figure 2-5, x is constantly updated iteratively, so that
it is constantly approaching the extreme point.
Of course, the pace of this movement (ie −αf (x)) cannot be too large, because according to the definition
′
of the derivative, the above approximate formula is only applicable near x. If the moving pace is too large,
the optimal value of x may be skipped, making the value of x constantly oscillating back and forth. As
shown in Figure 2-6.
The gradient descent method is to find an approximate optimal solution. In order to avoid iterating, the
following methods can be used to check whether it is close enough to the optimal solution:
The number of iterations has reached the preset maximum number of iterations.
The following is the code of the gradient descent method, where the parameter df is used to calculate the
derivative f (x) of a function f (x), x is the initial value of the variable, alpha is the learning rate, and
′
iterations represent The number of iterations, epsilon checks whether the value of df=f (x) is close to 0.
′
This gradient descent function saves all updated x during the iteration process in a python list object history
and returns this object.
For the above function f (x) = x – 3x – 9x + 2, its derivative f (x) = 3x – 6x– 9. If you want the
3 2 ′ 2
minimum value of the function f (x) near x = 1, you can call this function gradient_descent():
df = lambda x: 3*x**2-6*x-9
path = gradient_descent(df,1.,0.01,200)
print(path[-1])
Get the extreme point x=2.999999999256501 of f (x). The points on the curve corresponding to x in the
iteration process can be drawn:
f = lambda x: np.power(x,3)-3*x**2-9*x+2
x = np.arange(-3, 4, 0.01)
y= f(x)
plt.plot(x,y)
Among them, the quiver function of matplotlib can use arrows to draw velocity vectors, and its function
format is:
quiver([X, Y], U, V, [C], **kw)
Where X, Y are 1D or 2D arrays, indicating the position of the arrow, and U, V are the same 1D or 2D
arrays, indicating the speed (vector) of the arrow. For other parameters, please refer to the official
documentation.
For multivariable functions, the principle of the gradient descent method is the same, but the gradient is used
instead of the derivative.
The global minimum of this function is (3, 0.5). The function value can be calculated with the following
python code:
To draw this surface, first take some evenly distributed coordinate values on the x and y axes:
xmin, xmax, xstep = -4.5, 4.5, .2
ymin, ymax, ystep = -4.5, 4.5, .2
x_list = np.arange(xmin, xmax + xstep, xstep)
y_list = np.arange(ymin, ymax + ystep, ystep)
Then use the np.meshgrid() function to get the grid points (x, y) at their intersections according to the above
x_list and y_list, and calculate the function values corresponding to these grid coordinate points:
x, y = np.meshgrid(x_list, y_list)
z = f(x, y)
%matplotlib inline
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_zlabel('$z$')
ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))
plt.show()
The gradient directions at these grid points can be plotted on a 2D coordinate plane using matplotlib's quiver
function.
df_x = lambda x, y: 2*(1.5 - x + x*y)*(y-1) + 2*(2.25 - x + x*y**2)*(y**2-1) + 2*
(2.625 - x + x*y**3)*(y**3-1)
df_y = lambda x, y: 2*(1.5 - x + x*y)*x + 2*(2.25 - x + x*y**2)*(2*x*y) + 2*
(2.625 - x + x*y**3)*(3*x*y**2)
dz_dx = df_x(x, y)
dz_dy = df_y(x, y)
ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))
plt.show()
Figure 2-9. Domain coordinate contours and gradient directions at grid points of the isosurface of the
function f(x,y)
In order to directly use the previous gradient descent method code, x in the previous gradient descent
method code can be represented by a numpy vector, and
if abs(df(x))<epsilon:
change into:
if np.max(np.abs(df(x)))<epsilon:
First combine the separated x and y coordinate arrays into one array:
print(x.shape)
print(y.shape)
(46, 46)
(46, 46)
(2, 2116)
You can define a gradient function df for this vectorized coordinate point x. The following code also gives
the implementation of the modified vectorized version of the gradient descent algorithm:
df = lambda x: np.array( [2*(1.5 - x[0] + x[0]*x[1])*(x[1]-1) + 2*(2.25 - x[0] +
x[0]*x[1]**2)*(x[1]**2-1)
+ 2*(2.625 - x[0] + x[0]*x[1]**3)*
(x[1]**3-1),
2*(1.5 - x[0] + x[0]*x[1])*x[0] + 2*(2.25 - x[0] +
x[0]*x[1]**2)*(2*x[0]*x[1])
+ 2*(2.625 - x[0] + x[0]*x[1]**3)*
(3*x[0]*x[1]**2)])
The following code starts from x0=(3., 4.) to solve the extreme point of this surface:
x0=np.array([3., 4.])
print("initial point",x0,"gradient",df(x0))
path = gradient_descent(df,x0,0.000005,300000)
print("Extreme point:",path[-1])
Because the initial gradient value of x starts to be very large, the learning rate α must take a small number
(such as 0.000005), otherwise it will cause shock or infinite value, and finally converge to [2.70735828
0.41689171], But it is not the best point, you can see this situation more intuitively by drawing the change
of x during the iteration process.
ax.set_xlabel('$x$')
ax.set_ylabel('$y$')
ax.set_xlim((xmin, xmax))
ax.set_ylim((ymin, ymax))
path = np.asarray(path)
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)
Figure 2-10 During the iteration process, the gradient value becomes smaller and smaller, and the
convergence becomes slower and slower
During the iterative process, the gradient value becomes smaller and smaller, and the same learning rate
makes the update of x very slow. Even after 100,000 iterations, it still fails to approach the optimal solution.
A natural approach is to use an adaptive learning rate, i.e. increase the learning rate when the gradient
becomes small. As an exercise, the reader can try to modify the gradient descent algorithm to get to the
optimal solution better and faster.
In order to ensure that the optimal solution can be approached better and faster, many improvements to the
gradient descent method have been proposed. These improvements use a changing learning rate or strategy
to update the solution variables (also called parameters). The update strategies or methods for variables
(parameters) include: Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam,
AdaMax, Nadam, AMSGrad, etc.
It should be noted that the function may be a multi-variable function, therefore, its variable x can be a vector
x composed of multiple values. Only some of the commonly used optimization strategies are described
below.
current calculated gradient, and the Momentum momentum method updates the vector of x not only
considering the current gradient, but also considering the last update vector, that is, the updated vector is
considered to have inertia. Assuming that v t−1is the vector used for update last time, the current updated
vector is:
x = x − vt
This vector used to update x is called momentum. The momentum method regards the update vector as the
velocity of a moving object, and the velocity has inertia. Due to the combination of the previous update
vector and the current gradient, it alleviates the sharp changes in the gradient at different times, making the
updated vector smoother, that is, maintaining the inertia of the previous motion, so that where the gradient is
small, there is still a large motion. The speed will not overshoot due to the sudden increase of the gradient.
This method is like a ball with weight rolling downhill, maintaining a certain amount of inertia while
looking for the steepest descent path. Ordinary gradient descent only determines the speed of movement
according to the degree of steepness, just like rushing fast in steep places and hardly moving in flat places.
That is, v is a tensor with the same shape as x with an initial value of 0. In the iterative process, v is updated
first, and then the parameter x of the function is updated:
v = gamma*v+alpha* df(x)
x = x-v
The following is the gradient descent method based on the momentum method:
def gradient_descent_momentum(df, x, alpha=0.01, gamma = 0.8, iterations = 100,
epsilon = 1e-6):
history=[x]
v= np.zeros_like(x) # momentum
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
v = gamma*v+alpha* df(x) # update momentum
x = x-v # Update variables (parameters)
history.append(x)
return history
Use the gradient descent method of this momentum method to solve the above problem:
path = gradient_descent_momentum(df,x0,0.000005,0.8,300000)
print(path[-1])
path = np.asarray(path)
[2.96324633 0.49067782]
It can be seen that the solution of the momentum method is very close to the optimal solution. as the picture
shows.
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)
the variable update is the product of the learning rate and the gradient α∇f (x x), the gradient is too large or
too small and the learning rate is too large or too small will affect the convergence of the algorithm.
For a multivariate function, the magnitude of the partial derivatives for each variable can vary widely. For
∂f ∂f
example, the absolute values of the partial derivatives ∂x1
,
∂x2
of a function f (x , x ) of two variables at a
1 2
certain point (x , x ) may differ greatly.It is inappropriate to use the same learning rate for them. The
1 2
appropriate learning rate for one component is too large or too small for the other component, resulting in
shock and stagnation. That is, it is inappropriate to directly update with the following formula:
∂f
x1 = x1 − α
∂x1
∂f
x2 = x2 − α
∂x2
The Adagrad method can be translated as "adaptive (ada) gradient (grad)" from the noun, which divides
each gradient component by the historical cumulative value of the gradient component, so that the problem
of unbalanced gradient sizes of different components can be eliminated. For 2 components (x , x ), if the
1 2
historical cumulative value (G , G ) of the respective components is calculated respectively, the update
1 2
1 ∂f
x2 = x2 − α
G2 ∂x2
∂f
Use the notation g = ∇ f (x ) to represent the partial derivative
t,i θ t,i of the component x in the t-th
∂xi
i
iteration, the component gradient of all rounds from t'=1 to t'=t can be calculated as follows:
t 2
Gt,i = √ ∑ ′ g ′
t =1 t ,i
Divide gt,i by Gt,i to update the component:
1
xt+1,i = xt,i − α gt,i
t 2
√∑ ′ g ′
t =1 t ,i
In order to prevent the divisor from being 0, a small positive number ϵ can be added to this denominator, so
that the parameter update formula of AdaGrad is:
1
xt+1,i = xt,i − α gt,i
t 2
√∑ ′ g′
+ϵ
t =1 t ,i
It can be seen that the AdaGrad method eliminates the unbalanced problem of component gradient sizes.
The parameter update formula of AdaGrad can be written in vector form:
1
x t+1 = x t − α ⊙ gt
t 2
√∑ ′ g
g +ϵ
t =1 ′
t
The accumulated G can be recorded with the variable gl with an initial value of 0. In each round of
2
t
gl += df(x)**2
x = x-alpha* df(x)/(sqrt(gl)+epsilon)
The main advantage of the AdaGrad method is that it eliminates the influence of different gradient values,
so that the learning rate can be set to a fixed value without continuously adjusting the learning rate in the
iterative process. The general learning rate is set to 0.01. The main disadvantage of the AdaGrad method is
that with the iterative process, the cumulative sum ∑ g will become larger and larger, because each of
t
′
t =1
2
′
t
them is a positive number . This can lead to slow learning, or even a standstill. Also, making each
component gradient have a consistent pace may not be realistic and can divert the direction of progress from
the direction of the optimal solution.
The code of the gradient descent method based on the Adagrad parameter update method is as follows:
[-0.69240717 1.76233766]
It can be seen that due to the equalization of the component gradients, the forward direction of the variable
update deviates from the optimal solution method, and converges to another local optimal solution.
Δx
xt = −η ⋅ g t
x t+1 = x t + Δx
xt
Here G = ∑ g is the historical sum of squares of g . With the iterative process, this value G is getting
t
2
t t t
bigger and bigger, resulting in Δxx is getting smaller and smaller, so the convergence is getting slower and
t
slower. The solution is to replace G with the sum of mean squares E[g ] = instead of the sum of
2 Gt
t t
t
squares. This E[g ] can be calculated using the moving average method, that is, to make an average of the
2
t
The Adadelta method goes a step further and uses such a moving average method for the update vector to
make the change of the update vector smoother.
2 2 2
E[Δx
x ]t = γE[Δx
x ]t−1 + (1 − γ)Δx
x
t
The final update vector is:
2
E[Δx
x ]t−1 + ϵ
xt = −√
Δx gt
2
E[g ]t + ϵ
RM S[Δx
x]t−1
xt = −√
Δx gt
RM S[g]t
x t+1 = x t + αΔx
xt
The decay rate parameter ρ of the Adadelta method is usually set to 0.9, and the initial value of
x ] , E[g ] is also 0. The code of the gradient descent method based on the Adadelta parameter
2 2
Δxx , E[Δx
t t t
path = gradient_descent_Adadelta(df,x0,1.0,0.9,300000,1e-8)
print(path[-1])
path = np.asarray(path)
[2.9386002 0.45044889]
It can be seen that the Adadelta method can also converge to a close to the optimal solution.
The idea is to divide each value of the gradient by the length (the absolute value of the value), that is,
convert it into a unit length, so that the parameter x is always updated with a fixed step size α. In order to
calculate the length of each component of the gradient, RMSprop is similar to the momentum method to
calculate the square value of the moving average length of the gradient value, that is, f (x) .
2
The python code for updating model parameters by the RMSprop method is as follows:
v= np.ones_like(x)
#...
grad = df(x)
v = beta*v+(1-beta)* grad**2
x = x-alpha*(1/(np.sqrt(v)+epsilon))*grad
The code of the gradient descent method based on the RMSprop parameter update method is as follows:
def gradient_descent_RMSprop(df,x,alpha=0.01,beta = 0.9, iterations = 100,epsilon
= 1e-8):
history=[x]
v= np.ones_like(x)
for i in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
v = beta*v+(1-beta)*grad**2
x = x-alpha*grad/(np.sqrt(v)+epsilon)
history.append(x)
return history
[2.70162562 0.41500366]
The results for the model parameters are not good enough, you can increase the number of iterations:
path = gradient_descent_RMSprop(df,x0,0.000005,0.99999999999,900000,1e-8)
print(path[-1])
path = np.asarray(path)
[2.9082809 0.47616156]
It can be seen that the basic convergence is close to the optimal solution, as shown in Figure 2-14:
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)
mt = β1 mt−1 + (1 − β1 )gt
2
vt = β2 vt−1 + (1 − β2 )gt
They are equivalent to the first-order and second-order momentum of the gradient, because their initial
value is 0, Adam's author observed: when the decay rate is small, such as β , β is close to 1, they are biased
1 2
towards zero, Especially in the early stages of an iteration. To correct this problem, the authors used the
following correction formula:
mt
m
^t =
t
1 − β
1
vt
v
^t =
t
1 − β
2
#https://towardsdatascience.com/adam-latest-trends-in-deep-learning-optimization-
6be9a291375c
def gradient_descent_Adam(df,x,alpha=0.01,beta_1 = 0.9,beta_2 = 0.999, iterations
= 100,epsilon = 1e-8):
history=[x]
m = np.zeros_like(x)
v = np.zeros_like(x)
for t in range(iterations):
if np.max(np.abs(df(x)))<epsilon:
print("The gradient is small enough!")
break
grad = df(x)
m = beta_1*m+(1-beta_1)*grad
v = beta_2*v+(1-beta_2)*grad**2
#m_1 = m/(1-beta_1)
#v_1 = v/(1-beta_2)
t = t+1
if True:
m_1 = m/(1-np.power(beta_1, t+1))
v_1 = v/(1-np.power(beta_2, t+1))
else:
m_1 = m / (1 - np.power(beta_1, t)) + (1 - beta_1) * grad / (1 -
np.power(beta_1, t))
v_1 = v / (1 - np.power(beta_2, t))
x = x-alpha*m_1/(np.sqrt(v_1)+epsilon)
#print(x)
history.append(x)
return history
For the above problem, execute the gradient descent algorithm gradient_descent_Adam:
path = gradient_descent_Adam(df,x0,0.001,0.9,0.8,100000,1e-8)
#path = gradient_descent_Adam(df,x0,0.000005,0.9,0.9999,300000,1e-8)
print(path[-1])
path = np.asarray(path)
#plt.plot(path)
[2.99999653 0.50000329]
plot_path(path,x,y,z,minima_,xmin, xmax,ymin, ymax)
Figure 2-15 The Adam method can also converge to a near optimal solution
That is, use the division on the right side of the formula to approximate the derivative (gradient) of f (x) at x
. If ϵ is small enough, the derivative (gradient) of this value should be the same as the analytical derivative
(gradient) on the left ) are close enough.
Therefore, before training the model with the gradient descent method, the numerically calculated gradient
and the analytical gradient can be compared to verify that the analytical gradient is calculated correctly.
For example, for the previous binary function f (x, y) = x + 9y , in the gradient descent method, the
1
16
2 2
function is at a point x = (x , x ) The function values and analytical gradients of are calculated by the
0 1
following code.
f = lambda x: (1/16)*x[0]**2+9*x[1]**2
df = lambda x: np.array( ((1/8)*x[0],18*x[1]))
The following code snippet compares the errors of the analytical and numerical gradients at the point
x = [2., 3.]:
x = [2.,3.]
eps = 1e-8
grad = df(x)
grad_approx = df_approx(x,eps)
print(grad)
print(grad_approx)
print(abs(grad-grad_approx))
[ 0.25 54. ]
(0.2500001983207767, 54.00000020472362)
[1.98320777e-07 2.04723619e-07]
It can be seen that as long as the small increment eps of calculating the numerical gradient is small enough,
this numerical gradient is close enough to the analytical gradient, and this is the definition of the derivative:
the numerical gradient can be close enough to the analytical gradient. If it is found that the error of the two
is relatively large or large, it means that there may be a problem with the calculation of the analytical
gradient or function value or numerical gradient. Most of the errors are problems with the calculation of the
analytical gradient or function value.
Before using the gradient descent method to solve the optimal solution, the gradient verification method
should be used to ensure that the calculation of the analysis gradient and function value is correct. On this
basis, adjust the hyperparameters of the gradient descent method such as learning rate or momentum
parameters.
The parameter f accepted by this function indicates the function to calculate the gradient, and params
indicates the parameters of the function, because f may have multiple parameters, and params indicates a set
of these multiple parameters (such as python's list, tuple, etc. object). To be more general, assume that each
element x of params is a multidimensional array containing multiple elements.
In the inner loop, for the element x[idx] pointed to by each subscript idx of x, add a small increment
x[idx] + eps and x[idx] − eps respectively and calculate the corresponding The function value f(), and
then use the differential approximation formula of the derivative to calculate the partial derivative
corresponding to this x[idx] and assign it to grad[idx]. Note: After each modification of x[idx], it must be
restored to the original value, otherwise it will affect the calculation of other partial derivatives and affect
the value of params after exiting this function.
You can use this general numerical gradient computation function to compute the numerical gradient of the
previous function:
x = np.array([2.,3.])
param = np.array(x) # The parameter param of numerical_gradient must be a
numpy array
numerical_grads = numerical_gradient(lambda:f(param),[param],1e-6)
print(numerical_grads[0])
[ 0.25 54.00000001]
Note that the first parameter f of numerical_gradient must point to a function object rather than the result of
a function call. It is wrong to write lambda:f(param) above as f(param) .
For a function f that contains some parameters such as param, usually the above lambda expression or the
following wrapper function fun can be used to return a function object that performs calculations on the
parameter param.
def fun():
return f(param)
numerical_grads = numerical_gradient(fun,[param],1e-6)
print(numerical_grads[0])
[ 0.25 54.00000001]
In the following chapters, this general numerical gradient calculation function numerical_gradient() will be
used to calculate the numerical gradient of the model function. This function and others are included in the
book's source code file util.py.
class Optimizator:
def __init__(self,params):
self.params = params
def step(self,grads):
pass
def parameters(self):
return self.params
params is a list of variables (parameters), and step() is used to update these parameters params according to
the gradient grads. For example, the parameter optimizer class SGD that defines the parameter optimization
strategy using the basic gradient descent method can be derived on the basis of this class:
class SGD(Optimizator):
def __init__(self,params,learning_rate):
super().__init__(params)
self.lr = learning_rate
def step(self,grads):
for i in range(len(self.params)):
self.params[i] -= self.lr*grads[i]
return self.params
Similarly, other parameter optimizers can be defined, such as SGD_Momentum of the momentum method:
class SGD_Momentum(Optimizator):
def __init__(self,params,learning_rate,gamma):
super().__init__(params)
self.lr = learning_rate
self.gamma= gamma
self.v = []
for param in params:
self.v.append(np.zeros_like(param) )
def step(self,grads):
for i in range(len(self.params)):
self.v[i] = self.gamma*self.v[i]+self.lr* grads[i]
self.params[i] -= self.v[i]
return self.params
This is a bowl-shaped surface, as shown in Figure 2-16. Its minimum value is at the bottom of the bowl, that
is, (0,0) is the minimum value point of the entire function, and the minimum value is 0.
16
2
x
2
+ 9y surface
optimizator = SGD([x0],0.1)
path = gradient_descent_(df,optimizator,100)
print(path[-1])
path = np.asarray(path)
path = path.transpose()
[-8.26638332e-06 2.46046384e-98]
The first column in the data set is the population of each city, and the
second column is the profit of the dining car in the city, and the quantity is
in units of 10,000. The following Python code reads the dataset from a text
file and outputs the first 5 rows:
x , y = [] ,[]
with open('food_truck_data.txt') as A:
for eachline in A:
s = eachline.split(',')
x.append(float(s[0]))
y.append(float(s[1]))
for i in range(5):
print(x[i],y[i])
6.1101 17.592
5.5277 9.1302
8.5186 13.662
7.0032 11.854
5.8598 6.8233
The urban population and the profits of dining cars are regarded as the x
and y coordinates on the two-dimensional coordinate plane, that is, each
data sample is regarded as a coordinate point on the two-dimensional plane,
as shown in Figure 3-1, and the data set can be placed on the two-
dimensional plane display:
fig, ax = plt.subplots()
ax.scatter(x, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
plt.show()
Figure 3-1. Data point set for food truck profits
The goal of the "dining car profit problem" is how to predict the profit of a
dining car for a new urban population based on the existing data of these
urban populations and their corresponding profits.
1. Machine Learning
The "food truck profit problem" is a classic machine learning problem.
Machine learning is to discover certain statistical laws contained in these
data based on empirical data, and use the learned laws to judge or predict
new data in the future.
Machine learning can obtain a data model that reflects the relationship
between the two or the functional relationship from urban population to
dining car profit based on these (urban population, corresponding dining car
profit) data. If x is used to represent the urban population, y represents the
profit of the dining car , machine learning is to find a function f (x) such
that y satisfies y = f (x). The process of solving this functional relationship
or model is called machine learning or model training. With this
mathematical (model) function f (x), a new city population x can be
substituted into this function f (x) to predict the city's dining car profit.
In machine learning, the data used to train the model is called sample data
or sample set or training set, the sample set contains multiple samples,
each sample consists of sample features* * (such as urban population)
andsample label(such as dining car profit), which correspond to the
independent variable x and dependent variable y of the learning
function y = f (x) respectively. Sample labels are often also referred to
as "true values" or "target values**". The ultimate goal of learning a
model is to predict its target value or label from the sample features
according to this model.
2
2
There are many similar problems, such as: "predicting the price of a stock",
"identifying who is in a face photo", "determining the text corresponding to
a piece of voice", "judging whether an email is spam", "playing chess How
to move in the chessboard state", "How the recommendation system of the
e-commerce website recommends the products that the user may be
interested in", "Automatic driving car", etc.
The so-called supervised learning means that the data used for learning not
only know their data characteristics but also know their target value, such as
the profit of the dining car. For a sample, not only the population of the city
is known, but also the profit of the city’s dining car. If it is assumed that the
functional relationship y = f (x) is satisfied between the sample feature x
and the target value y, from multiple known samples (x , y ) Then we
(i) (i)
much as possible. This kind of machine learning that solves the best
hypothesis function based on multiple data samples (x , y ) with known
(i) (i)
Supervised learning is currently the most widely used and most successful
machine learning method for artificial intelligence. For example, image
classification recognition uses a large number of images of known image
categories to identify which category a new image belongs to. The postal
code recognition system for letters can automatically recognize handwritten
postal codes. There are also AlphaGo, which defeated the human Go
champion, and AlphaFold, which defeated all human experts and
successfully predicted the 3D shape of the protein based on the gene
sequence, etc.
1. Training model;
Supervised learning relies on training data with known target values, but in
many cases, manual calibration to specify the target value of each data
sample is time-consuming and labor-intensive. For example, for face
detection problems, each face needs to be calibrated 68 signs The location
of the point, if there are millions of faces that need to be marked in the city,
how big is a project? Similarly, it is laborious to label the categories of
millions of images.
Is it possible to learn some laws between these data without knowing the
true value of the sample? Unsupervised learning is a machine learning
method without knowing the true value. For example, a clustering
algorithm can analyze data samples to determine which cluster center they
belong to. Principal components analysis (PCA) can determine the principal
components of the data, and then use it to reduce the dimension of the data,
that is, express the high-dimensional data into a low-dimensional form. The
autoencoder takes the data itself as the target value, that is, uses (x , x )
(i) (i)
y = f (x) = wx + b
training model.
So what is "best"?
2
value f (x ). For the dining car profit problem, the variance (f − y )
(i) (i) (i)
can be used to represent the prediction error of a single sample, and the
following mean square error can be used to represent the prediction error of
all samples:
1 m (i) (i) 2 1 m (i) 2
L = ∑ (f − y ) = sum (wx + b − yi )
m i=1 m i=1
(i) (i) 2
∂L 1 ∂(∑(wx +b−y ) ) 1 (i) (i)
= = ∑(wx + b − y )
∂b 2m ∂b m
The necessary condition that the minimum value of the function L(w, b)
must satisfy is that the gradient or partial derivative of L(w, b) with respect
to the independent variable, namely w. b, is equal to 0, namely:
∂L 1 (i) (i) (i)
= ∑(wx + b − y )x = 0
∂w m
∂L 1 (i) (i)
= ∑(wx + b − y ) = 0
∂b m
make
(1) (1)
⎛ 1 x ⎞ ⎛ y ⎞
(2) (2)
1 x b y
X = W = ( ) y =
w
1 ⋮ ⋮
⎝ 1 x
(m) ⎠ ⎝ (m)
y
⎠
And remove the coefficient of the equation, the above equations can be
1
XW − y = 0
Move y to the right side of the equation and multiply both sides by X : T
T T
X XW = X y
−1
Multiply both sides by the inverse matrix (X T
X) of X T
X :
−1
T T
W = (X X) X y
Thus, W = (b, w) is obtained. The formula (6) is the normal equation for
solving W.
The code for solving the above "dining car profit problem" using the normal
equation method is as follows:
import numpy as np
X = np.ones(shape=(len(train_x), 2))
X[:, 1] = train_x
y = train_y
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
print(w)
[-3.89578088 1.19303364]
4.6*w[1]+w[0]
1.5921738849602525
make
(1) (1)
⎛ x ⎞ ⎛ y ⎞ ⎛ b⎞
(2) (2)
x y b
x = y = b =
⋮ ⋮ ⋮
⎝ (m) ⎠ ⎝ (m) ⎠ ⎝ b⎠
x y
∂L
= np. mean((wx + b − y) ⊙ x)
∂w
∂L
= np. mean((wx + b − y))
∂b
The coefficient 1/m can be included in the learning rate, and it is easy to
write the code to calculate the gradient with numpy's vectorization
operation:
X = train_x
w,b = 0.,0.
dw = np.mean((w*X+b-y)*X)
db = np.mean((w*X+b-y))
print(dw)
print(db)
-65.32884974555671
-5.839135051546393
Therefore, the code for the gradient descent algorithm for solving linear
regression can be written:
def gradient_descent(x,y,w,b,alpha=0.01, iterations =
100,epsilon = 1e-9):
history=[]
for i in range(iterations):
dw = np.mean((w*x+b-y)*x)
db = np.mean((w*x+b-y))
if abs(dw) < epsilon and abs(db) < epsilon:
break;
#Update w: w = w - alpha * gradient
w -= alpha*dw
b -= alpha*db
history.append([w,b])
return history
Calling the above gradient descent method with the learning rate alpha =
0.02 and the number of iterations, the parameters of the hypothetical
function can be found:
alpha = 0.02
iterations=1000
history = gradient_descent(X,y,w,b,alpha,iterations)
print(len(history))
print(history[-1])
1000
[1.1822480052540145, -3.7884192615511796]
History records the model parameters of each step in the iterative process,
and the last parameter is the optimal parameter.
How to judge whether the gradient descent method converges to the optimal
solution?
For the hypothetical straight line function f (x) with only one value for both
the input variable and the output variable, you can visually observe whether
it converges by drawing this hypothetical straight line function on the two-
dimensional plane of the sample points. To do this, write a function that
draws a straight line for the hypothetical function corresponding to the
model parameters (w, b):
#fig, ax = plt.subplots()
plt.scatter(X, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
w,b = history[-1]
draw_line(plt,w,b,X,6)
plt.show()
def loss(x,y,w,b):
m = len(y)
return np.mean((x*w+b-y)**2)/2
cost = 0
for i in range(m):
f = x[i]*w+b
cost += (f-y[i])**2
cost /=(2*m)
return cost
print(loss(X,y,1,-3))
4.983860697569072
Use the loss() function to calculate the loss corresponding to all parameters
w in the iterative process, and draw this loss curve (Figure 2-4):
costs = [loss(X,y,w,b) for w,b in history]
plt.axis([0, len(costs), 4, 6])
plt.plot(costs)
Of course, for this linear regression with one independent variable, the loss
function is a function of two parameters, so the surface corresponding to the
loss function can be drawn (Figure 3-5), and the change of unknown
parameters during the iteration process can be drawn:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(111, projection='3d')
ws,bs,z = plot_history(X,y,history)
#plt.axis([result_w-1,result_w+1,result_b-1,result_b+1])
plt.show()
Bad results (Figure 3-8)! It shows that the learning rate is too large.
Although the gradient descent method can always advance in the correct
direction, because the learning rate is too large, the step forward is too
large, and the optimal solution is crossed, so the cost diverges rather than
converges. Figure 3-9 shows the iterative process on the loss surface.
At present, the learning rate of 0.02 is more appropriate, it can converge
with fewer iterations (minimize the objective function value)
ws,bs,z = plot_history(X,y,history)
w =1.0
b = -2.
eps = 1e-8
dw = np.mean((w*X+b-y)*X)
db = np.mean((w*X+b-y))
grad = np.array([dw,db])
grad_approx = df_approx(X,y,w,b,eps)
print(grad)
print(grad_approx)
print(abs(grad-grad_approx))
[-0.24450692 0.32066495]
(-0.24450690361277339, 0.3206649612508272)
[1.98820717e-08 1.27972190e-08]
It can be seen that the results of the two calculations are consistent. This
allows the analytical gradient to be used with confidence in the gradient
descent method.
Of course, the numerical gradient of the loss function can also be calculated
using the general numerical gradient function in Section 2.4.
3.1.8 Prediction
Once the parameters w, b of the specific hypothesis function
f (x; w, b) = xw + b are determined, a new data (such as urban population)
can be substituted into the hypothesis function to get the predicted value
(such as dining car profit).
For example, all X[i] in the training set X can be substituted into this
hypothesis function to get the predicted value f (X[i]; w, b) = X[i]w + b.
The following code computes predicted values for all samples in x and uses
these predicted values to plot the data points corresponding to those
predicted values (Figure 2-10).
#Use the obtained w to calculate the predicted value of
the sample in X
m=len(X)
predictions = [0]*m
for i in range(m):
predictions[i] = X[i]*w+b
relationship between housing characteristics x = (x , x ) and house price y can be expressed as:
1 2
y = f (x
x) = w1 ∗ x1 + w2 ∗ x2 + b
Sometimes, in order to better describe the relationship between x and y, some higher-order
features can be constructed on the basis of the original features, such as x , x , etc., using The 2
1
2
2
original features and high-order features are used as new features, and then the relationship
between the new features and the real value is represented by the following function:
2 2
y = f (x
x) = w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x + w4 ∗ x + b
1 2
The function f (xx) is a nonlinear function of the characteristics x , x , but it is still a linear 1 2
Considering x and x as two new features x andx , the function is also a linear function of
2
1
2
2 3 4
x , x , x andx .
1 2 3 4
Generally, if a sample contains K features, the hypothesis function of linear regression is:
K
f (x
x) = w1 ∗ x1 + w2 ∗ x2 +. . . +wK ∗ xK + b = ∑ wi xi + b
i=1
A row vector can be used to represent all the features of a sample, that is, x = (x , x , . . . , x 1 2 K) ,
and a column vector can be used to represent the coefficients before these features in the
hypothesis function, that is, w = (w , w , . . . , w ) , assuming that the function can be
1 2 K
T
⎡ w1
⎤
w2
f (x
x) = (x1 , x2 , ⋯ , xK ) + b = x w + btag3 − 13
⋮
⎣w ⎦
K
The larger the value of the corresponding coefficient w of x , the greater the influence on the
i i
output value of f (x
x). Therefore, w is often called weight, which has nothing to do with x b
i i
fw
w (x
X =
fw
x) = w1 ∗ x1 + w2 ∗ x2 +. . . +wK ∗ xK + w0 = (x0 , x1 , x2 , ⋯ , xK )
w (X
⎡
⎣x
X ) ==
x
(m)
(1)
(2)
mport numpy as np
⎤
⎣x
=
x
(2)
(m)
⋮
X = np.array([[1,8,3],[1,7,5]])
w = np.array([1.3, 2.4,0.5])
X@w
array([22. , 20.6])
⎡x
⎦
x
⎣x
(1)
(2)
(m)
w =
(i)
⎣x
(1)
x
1
(2)
x
1
(m)
x
0
(2)
x
0
(m)
0
⋮
⋯
x
1
(2)
(m)
1
x
x
(1)
(2)
(m)
X) = X w
⋯
⋯
⎤
Because each sample produces an output, the output of the function for all samples can be
written in vector (matrix) form:
⎡
(1)
⎡
(1) (1)
This matrix product can be easily computed with numpy, i.e. with np.dot(X,W) or X.dot(W) or
X@W. For example, X below is 2 samples, each sample has 3 features, and w is the weight
corresponding to 3 features, and then you can directly calculate f (X
When expressing operations with vectors (matrixes), be sure to pay attention to whether the
boundaries of each dimension are consistent. For x w above:
2 × 3
w
3 × 1
hypothesis function f (x
function.
= f
2 × 1
(1)
x
K
(2)
x
K
(m)
Model training is to use a set of samples {x , y } with known target values to find the best
(i)
⎦
w = X w tag3 − 14
w
w
⎢⎥
⎡w
⎣
0
w2
wK
\w
⋮
1 ⎤
⎦
= xw
Like the univariate hypothesis function, for the multivariate hypothesis function, the loss
function based on the mean square error can also be used to measure the error between the
predicted value and the true value of the model:
1 m (i) (i) 2
L(w
w) = ∑i=1 (fw
w (x
x ) − y )
2m
L(w
w) is a function of unknown parameter w , w contains multiple variables, so this is a
multivariate function of w . The model training of linear regression is to find the parameter
w = (w , w , . . . ) that minimizes the value of the loss function:
0 1
If L(w
w) takes a minimum at w , the gradient (partial derivative) at that point should be 0:
∗
In order to better see the derivation process of the partial derivative on the left side of the
equation, some auxiliary notations are introduced:
(i) (i) (i)
(i) (i) (i)
f = fw
w (x
x ) = x w = w1 ∗ x + w2 ∗ x +. . . +wK ∗ x + w0 ∗ 1
1 2 K
m 2
1 (i)
L(w
w) = ∑ δ
2m i=1
2
L(w
w) can be regarded as the average of m δ and then divided by 2, and δ is (i) (i)
of derivation (such as The derivative of the sum function is the sum of the derivatives of each
function) and the chain rule for compound functions:
(i)
∂L(w
w) 1 δ
(i)
= 2δ =
∂δ
(i) 2m m
(i)
∂δ
(i)
= 1
∂f
(i)
∂f (i)
= x
∂wj j
m (i) (i) m
∂L
L(w
w) ∂L
L( pmbw) ∂δ ∂f 1 (i)
(i)
= ∑ × × = ∑δ × 1 × x
j
(i) (i)
∂wj ∂δ ∂f ∂wj m
i=1 i=1
m m
1 (i)
1 (i)
(i) (i) (i) (i)
= ∑(f − y )x = ∑(fw (x ) − y )x
j j
m m
i=1 1=1
The values except the coefficients in the right formula can be regarded as the dot product of two
vectors:
m
∑(fw (x
1=1
Among them, X
∂L(w
w)
∂wj
∇L(w
w) = (
∇L(w
w)
n × 1
=
(i)
m
1
n × m
) − y
∂w1
T
(i)
:,j
X :,j ( pmbXw
∂L(w
w)
y = np.array([2.3,1.7])
)x
(i)
w − y)
,⋯,
(m × n
(1/len(y))*X.transpose() @ (X@w-y)
X
T
(X
= (fw (x
(1)
) − y
= (x
= (x
(
∂L(w
w)
∂wj
X
, ⋯)
n × 1
w
Xw − y ) = 0
T
=
1
m × 1)
(1)
(1)
∂L
X
L(w
w)
∂wj
(1)
T
(X
fw (x
x
(2)
x
(2)
Xw − y )
You can check whether the bounds of each dimension of matrix multiplication are consistent:
T
y)
(2)
For the above example, you can directly use this formula to calculate the gradient of the loss
) − y
T
= X :,j (X
(m)
x
j
Xw − y )
(m)
is the transpose of the jth column of the X matrix (ie, the jth feature of all
samples). Therefore, the partial derivative can be written in vector form:
)
)
⋯
⎝
fw (x
fw (x
fw (x
x
(1)
(2)
(3)
fw (x
(1)
(2)
(3)
w − y
w − y
w − y
(3)
) − y
) − y
) − y
) − y
(1)
(2)
(3)
(1)
(2)
(3)
⎞
⎠
⎞
⎠
(3)
)
⎜⎟
⎛
⎝
x
(m)
x
j
(1)
(2)
⋮
⎞
⎠
According to the normal equation, w can be obtained:
−1
T T
w = (X
X X) X y
2. Fitting plane
The following code generates a set of data point samples sampled from the plane
z = 2x + 3y + c, each data sample is characterized by (x, y), and its target value is the noise on
n_points = 20
a = 3
b = 2
c = 5
x_range = 5
y_range = 5
noise = 3
xs = np.random.uniform(-x_range,x_range,n_points)
ys = np.random.uniform(-y_range,y_range,n_points)
zs = xs*a+ys*b+ c+ np.random.normal(scale=noise)
plt.show()
Figure 3-11 2D plane data set for sampling three-dimensional space
Use the above sample points to solve the normal equation to fit a plane, and use the fitted
function to calculate the predicted value zs2 of the original data point (xs, ys), and then display
the original data point and the fitted data point, as well as the original plane and the fitting It can
be seen that the fitting effect is very good.
# fit a plane
X = np.hstack((xs[:, None],ys[:, None]))
X = np.hstack((np.ones((len(xs), 1), dtype=xs.dtype),xs[:, None],ys[:,
None]))
y = zs
When the number of samples is large or the sample features are many, the normal equation needs
to find the inverse matrix, which is time-consuming. Therefore, an iterative method is generally
used to solve the equation system, and the most typical iterative method is the gradient descent
method (gradient descent)**.
The following is the gradient descent algorithm implemented by numpy vector operations:
learning_rate = 0.02
num_iters = 100
X = np.hstack((xs[:, None],ys[:, None]))
w,cost_history = gradient_descent_vec(X, y,learning_rate, num_iters)
print("w:",w)
print(cost_history[:5])
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.grid()
plt.show()
In the above code, feature 1 is added to each data feature X [i] of the input data X . Therefore,
when calling this function, you only need to pass the characteristics of the input data itself. The
following code tests the vector version of the gradient descent method to fit the above-mentioned
plane data:
learning_rate = 0.02
num_iters = 100
X = np.hstack((xs[:, None],ys[:, None]))
history = gradient_descent_vec(X, y,learning_rate, num_iters)
print("w:",history[-1])
According to the model parameters in the iterative process recorded in history, calculate the
average loss of the hypothetical function corresponding to the model parameters in each iterative
process on the training data set:
def compute_loss_history(X,y,w_history):
loss_history = []
for w in w_history:
errors = X@w[1:]+w[0]-y
loss_history.append((errors**2).mean()/2)
return loss_history
loss_history = compute_loss_history(X,y,history)
print(loss_history[:-1:10])
plt.plot(loss_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.grid()
plt.show()
It can be seen that the same good fitting results have been obtained.
If the temperature and pressure data are placed in a cvs format file 'data.csv', you can use the
read_csv() of the pandas package to read the data from the file:
temperature.pngimport numpy as np
import matplotlib.pyplot as plt
import pandas as pd
X= data.values[:,1:2]
y= data.values[:,2]
print(X)
print(y)
[[ 0.]
[20.]
[ 40.]
[60.]
[ 80.]
[100.]]
[2.0e-04 1.2e-03 6.0e-03 3.0e-02 9.0e-02 2.7e-01]
history = gradient_descent_vec(X,y,0.00005,50)
w = history[-1]
print("w:",w)
w: [-0.00016333 0.00165672]
Draw the cost curve and the predicted value of the test.
def plot_history_predict(X,y,w,loss_history,fig_size=(12,4)):
fig = plt.gcf()
fig.set_size_inches(fig_size[0], fig_size[1], forward=True)
plt.subplot(1, 2, 1)
plt.plot(loss_history)
predicts = X @ w
plt.subplot(1, 2, 2)
plt.scatter(x, predicts) #, marker="x", c="red")
indices = x.argsort()
sorted_x = x[indices[::-1]]
sorted_predicts = predicts[indices[::-1]]
Plot the loss function curve and the predicted values for the training samples:
loss_history = compute_loss_history(X,y,history)
plot_history_predict(X,y,w,loss_history)
Although the best linear model is obtained to fit the relationship between temperature and
pressure, from the figure, pressure and temperature are not a linear relationship, and the linear
hypothesis function is not the best choice. There should be a nonlinear relationship between
them. Naturally, one would think of using a polynomial function such as a 3-degree polynomial
to represent this nonlinear relationship between pressure y and temperature x.
3 2 2 2 T
f (x) = w3 x + w2 x + w1 x + w0 = (1, x, x , x )(w0 , w1 , w2 , w3 )
From the original feature x, new features x , x are artificially constructed, and 1 is also used as
2 3
2 3 T
f (x
x; w ) = (1, x, x , x )(w0 , w1 , w2 , w3 ) = xw
X2 = np.hstack((X,X**2,X**3))
print(X2)
Then the gradient descent method was implemented, but it was found that the loss function
continued to increase rapidly until infinity, and did not converge.
history = gradient_descent_vec(X2,y,0.00005,50)
print("w:",history[-1])
D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
RuntimeWarning: overflow encountered in matmul
D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:10:
RuntimeWarning: overflow encountered in matmul
# Remove the CWD from sys.path while we load stuff.
D:\Programs\Anaconda3\lib\site-packages\ipykernel_launcher.py:8:
RuntimeWarning: invalid value encountered in matmul
This is because the eigenvalues of the data are all relatively large values, resulting in a large
gradient, and a very small learning rate must be used, but too small a learning rate makes the
algorithm converge very slowly. The solution to this problem is to normalize the data features,
even if the data feature values are in a small range of values (such as in the [0,1] or [-1,1]
range).
3.1.10 Normalization of data
The normalization process of a feature is very simple: first, it is necessary to calculate the
average value of all samples about this feature, then calculate the degree of deviation (ie standard
deviation) of this feature of all samples around the average value, and finally subtract this feature
of all samples its mean and divide by the standard deviation.
x−mean(x)
x ←
stddev(x)
Where x is a set of numbers, and mean(x) is the mean value of the number in x, and stddev(x)
is the standard deviation (mean square deviation) of the number in x. For example, there is a set
of eigenvalues: {-5, 6, 9, 2, 4}, and its average mean is:
Subtract this mean from all the eigenvalues to get the deviations, and calculate the square of
these deviations:
2
(−5 − 3.2) = 67.24
2
(6 − 3.2) = 7.84
2
(9 − 3.2) = 33.64
2
(2 − 3.2) = 1.44
2
(4 − 3.2) = 0.64
For the previous X2, you can use the following code to calculate the mean (mean) and mean
square error (stddev) of each feature:
Of course, the above code for normalizing data can be simplified to one line of code:
X = nd.array((X - X.mean(axis=0)) / X.std(axis=0))
It can be seen that the loss (cost) of the loss function has dropped from 0.0019130770858765333
to 7.023862601858041e-05, and the loss function curve and the fitted polynomial curve are
drawn. Of course, readers can adjust the learning rate and number of iterations to further reduce
the loss error.
plot_history_predict(X2,y,history[-1],loss_history)
It can be seen from the final prediction results of the training samples that the model function can
better fit the training data.
Explanation of the temperature and pressure problem: An appropriate hypothesis function should
be selected according to the characteristics of the problem data. If the hypothesis function cannot
fit the data well, artificial features can be considered on the basis of existing features. In addition,
the feature values of the data should be in a relatively small normalized range, such as
normalizing the features of the data to a range where the mean is 0 and the variance is 1.
3.2 Evaluation of the model
So is it better to use more complex functions to represent the relationship between data features
and target values?
The following code randomly samples some coordinate points (x,y) around a sinusoidal curve:
#https://github.com/ctgk/PRML/blob/master/notebooks/ch01_Introduction.ipynb
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(896)
n_samples = 10
x,y = sample(n_samples)
#x = np.sort(np.random.uniform(0,1,n_samples))
#y = np.sin(2*np.pi*x) + np.random.normal(scale = 0.25, size=x.shape)
Fit these sample points (x,y) with polynomials of different degrees K, and solve the model
function using the normal equation method:
for i, K in enumerate([0, 1, 3, 9]):
plt.subplot(2, 2, i + 1)
X = np.array([np.power(x,k) for k in range(K+1)])
X = X.transpose()
#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
print("w=:",w)
y_predict = 0 #np.zeros(x_test.shape)
for i,wi in enumerate(w):
y_predict+=wi*np.power(x_test,i)
y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.show()
w=: [-0.19410186]
w=: [ 1.167293 -2.40352288]
w=: [ -0.69160733 14.4684786 -40.54048788 27.82130232]
w=: [ 1.04164258e+03 -3.73815312e+04 4.73769000e+05 -3.06523600e+06
1.16099600e+07 -2.72868240e+07 4.03530320e+07 -3.65586800e+07
1.85418720e+07 -4.03263600e+06]
Figure 3-17 The fitting situation of the normal equation method with polynomial functions of
degree 0, degree 1, degree 3 and degree 9 as hypothetical functions
The following code uses the gradient descent method to solve the above hypothetical function:
lr = 0.4
iterations = 10000000
for i, K in enumerate([0, 1, 3, 9]):
plt.subplot(2, 2, i + 1)
if i==0: continue
X = np.array([np.power(x,k+1) for k in range(K)])
X = X.transpose()
w_history = gradient_descent_vec(X,y,lr,iterations,0.9)
w = w_history[-1]
print("w=:",w)
y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.show()
gradient is small enough!
iterated num is : 2
w=: [-0.19410186 -0.5068096 ]
gradient is small enough!
iterated num is : 16124
w=: [-0.19410186 3.0508158 -9.53751983 6.16440066]
w=: [ -0.19410186 -12.29842075 62.23443929 -94.51757917 -12.68948412
83.66870987 51.44475318 -55.72137559 -94.36419461 71.90960592]
Figure 3-18 The fitting situation of polynomial functions of degree 0, degree 1, degree 3 and
degree 9 based on the gradient descent method as hypothetical functions
In this example, the polynomial function with a degree of 3 has the highest fitting effect, and the
polynomial function with a degree of 9 has a small fitting error, but it is far from the potential
real relationship of the actual data. For the sample fitting error in the training set is small but the
error for the test sample is large, it is called overfitting (overfitting). Overfitting is caused by the
model function being too complex relative to the training sample set. One way to solve over-
fitting is to use a low-complexity 3-degree polynomial function as the hypothesis function, and
one is to increase the number of samples in the training data set.
For example, after increasing the number of training samples in the following code, a better
fitting effect can also be obtained for the 9th degree polynomial hypothesis function.
n_samples = 100
x,y = sample(n_samples)
#x = np.sort(np.random.uniform(0,1,n_samples))
#y = np.sin(2*np.pi*x) + np.random.normal(scale = 0.25, size=x.shape)
K= 9
#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
print("w=:",w)
y_predict = 0 #np.zeros(x_test.shape)
for i,wi in enumerate(w):
y_predict+=wi*np.power(x_test,i)
y_test = np.sin(2*np.pi*x_test)
plt.plot(x_test, y_test, c="g", label="$\sin(2\pi x)$")
plt.plot(x_test, y_predict, c="r", label="fitting")
plt.ylim(-1.5, 1.5)
plt.show()
Figure 3-19 The fitting situation of the 9th degree polynomial function with increasing training
samples
Figure 3-20 The left, middle, and right are underfitting, optimal fitting, and overfitting
respectively
As shown in Figure 3-20, there is a group of two-dimensional coordinate points on the plane,
using a linear function (straight line), a quadratic function (parabola), and a high-degree
polynomial as the model function of linear regression, it can be clearly seen that the left, middle,
and right are respectively Underfitting, optimal fitting, overfitting. If the model is too simple, it
will be under-fitting, if it is too complex, it will be over-fitting, and only when the complexity is
appropriate can it produce an optimized fitting effect.
According to the analysis of the causes of underfitting and overfitting, the underfitting problem
can be alleviated by the following methods:
Add more sample features. For example, by adding more sample features such as x andx 2 3
from a sample feature x, it actually increases the complexity of the data and alleviates the
underfitting problem.
The solution to the overfitting problem can be solved by the following methods:
Reduce the complexity of the model: Limit the complexity of the hypothesis function
(model) with a low-complexity hypothesis function (model) or through regularization.
As can be seen from the previous examples, for such problems, it is often impossible to judge
whether the final model function fits the data only based on the declining loss function curve.
Even if a function has a small loss for the training sample, it may produce a large error for other
samples that are not in the training set, that is, an overfitting problem will occur. In other words,
the "generalization ability" of the model function is insufficient, and it cannot better express the
relationship between the data characteristics of the actual sample and the target value.
"Overfitting" because the training model pays too much attention to how to better fit the training
set, so the trained model will produce larger errors for data different from the training set. Of
course, it is difficult to judge whether the model function is overfitting or underfitting only based
on the value of the loss function. In some cases, even if the value of the loss function is small, it
may be underfitting.
The purpose of training the model is to use the model to predict new data. Even if the model can
fit the training data well, it is useless if the prediction effect on the new data is poor. Just as an
athlete trains to be at the top of his team, there is no guarantee that he will perform as well
against others.
In order to help judge whether a model function is overfitting or underfitting, in addition to the
loss function curve, the quality of the model function is usually evaluated with the help of a
sample set different from the training set.
Therefore, in machine learning, in addition to the training set used to train the model, a separate
test set is generally used to evaluate the trained model. For a model function, the prediction error
of the samples in the test set (that is, the error between the predicted value and the target value)
can be calculated. If the error of the samples in the test set is similar to the error of the samples in
the training set, it can be preliminarily judged that the model function has better generalization
ability . The test set should cover a variety of different data as much as possible, so as to evaluate
whether the trained model has good generalization ability.
In addition to the training set and the algorithm of the training model, the performance of
different hypothetical function models is different. For example, in the previous problem of
polynomial fitting of two-dimensional data points, the performance of the training model
function obtained by polynomial functions of different degrees is different. Yes, some are
underfitting and some are overfitting. In addition, the hyperparameters in the training algorithm
(such as learning rate, batch size, number of iterations, etc.) also have a great impact on the
training results. For example, when other conditions remain unchanged, the number of iterations
directly affects the training error, and the number of iterations is small It may be underfitting, and
the number of iterations may be overfitting.
Evaluating the prediction errors of different models through the validation set can help to select
an appropriate hypothesis function and train hyperparameters, so as to obtain a training model
with better generalization ability.
For example, when training the model, the loss (error) of the training set and the verification set
can be calculated at the same time. When the training iteration starts, the verification error and
the training error will continue to decrease from high to low. As the number of iterations
increases, the verification set error On the contrary, it becomes larger, indicating that the
generalization ability becomes worse. At this time, the iteration can be stopped as soon as
possible. This method is called early stopping (early stopping) method. That is, the validation
set can be used to prevent too many iterations during training. For another example, for the
previous polynomial fitting, if there is no visual help, the training error and verification error of
the training set and the verification set on the polynomial model function of different degrees can
be used to help select a polynomial function of an appropriate degree.
Thus, a training set is used to train the model, a validation set is used to evaluate and select the
model, and sometimes a separate test set is used to test the resulting model. Sometimes the test
and validation sets are not distinguished.
For the fitting problem of the previous sinusoidal sampling points, the following code samples a
total of three sample sets: training set, verification set, and test set.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
n_pts = 10
x_train,y_train = sample(n_pts)
x_valid,y_valid = sample(n_pts)
x_test,y_test = sample(n_pts)
If you want to choose hypothetical functions of different degrees such as K=0, 1, 2, 3, ... 9, you
can use these hypothetical functions of different degrees to train and calculate the training error
and verification error.
def rmse(a, b):
return np.sqrt(np.mean(np.square(a - b)))
M = 10
errors_train = []
errors_valid = []
for K in range(M):
X = np.array([np.power(x_train,k) for k in range(K+1)])
X = X.transpose()
#w,history = gradient_descent_vec(X,y,lr,iterations)
XT = X.transpose()
XTy = XT @ y_train
w = np.linalg.inv(XT@X) @ XTy
#w = np.linalg.pinv(X) @ y
#print("w=:",w)
predict_train = X@w
error_train = rmse(y_train,predict_train)
errors_train.append(error_train)
errors_valid.append(error_valid)
It can be seen that when the polynomial degree is lower than 2, the training error and verification
error are relatively large, indicating that the fitting effect on the training set and the verification
set is not good, that is, the model is in an underfitting state. When the polynomial degree is
around 3 to 4, Both the training error and the verification error are relatively small. When the
polynomial degree starts to be greater than 5, the training error continues to decrease, but the
verification error increases instead, indicating that the generalization ability of the model begins
to deteriorate. Therefore, a polynomial function of degree 3 or 4 is a good hypothetical function.
What should be the appropriate size (number of samples) for the training set, validation set, and
test set? It depends on the actual problem. For some problems, the cost of obtaining samples is
low, and the sample model can reach hundreds of thousands or millions. For example, shopping
websites are easy to collect a large number of user shopping behavior data, and the training set
sample data accounts for all samples. The proportion of data can be as high as more than 90%,
while the proportion of verification set and test set to all sample data can be as low as about 5%,
because the total number of samples is very large, and the number of samples of 5% is already
very large. For some problems, the cost of obtaining samples is expensive, and the total number
of samples is relatively small. For example, in medical imaging, the proportion of the verification
set and the test set to the total number of samples will be relatively high, such as up to 20%, and
the proportion of the natural training set will be reduced. For samples of general size, usually, the
ratio of the number of samples in the training set, verification set, and test set to the number of all
samples can be set to 60%, 20%, or 20%. This proportion division is not absolute and should be
determined according to actual problems.
For the previous question, you can draw the training loss curve and the verification loss curve for
a specific hypothetical function such as a polynomial function with a degree of 9, and observe
how the loss (error) of the training set and the verification set changes with the number of
iterations.
The loss() function below calculates the loss of the hypothetical function corresponding to the
model parameters on the sample set (X, y), learning_curves_trainSize() calculates the training
loss and verification loss of the training set of different sizes (trainSize), and draws the training
loss curve and verification loss curve.
def loss(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of feature 1
predictions = X @ w
errors = predictions - y
return (errors**2).mean()/2
#K = 4
K =2
X_train = np.array([np.power(x_train,k+1) for k in range(K)]).transpose()
X_valid = np.array([np.power(x_valid,k+1) for k in range(K)]).transpose()
alpha=0.3
iterations = 50000
learning_curves_trainSize(X_train, y_train, X_valid,
y_valid,alpha,iterations)
plt.ylim(-0.5, 20)
plt.show()
It can be seen that after the training set size exceeds 40, the training loss and verification error
are relatively close. Therefore, for the 2-degree polynomial hypothetical function, the number of
samples in the training set should be greater than 40.
For a certain hypothetical function, it is also possible to observe how many iterations are
appropriate through the iterative learning curve.
For a polynomial hypothesis function of degree 2, the following code plots the learning curves
for the training and validation losses over its iterations:
np.random.seed(89)
n_pts = 100
x_train,y_train = sample(n_pts)
x_valid,y_valid = sample(n_pts)
K = 2
X_train = np.array([np.power(x_train,k+1) for k in range(K)]).transpose()
X_valid = np.array([np.power(x_valid,k+1) for k in range(K)]).transpose()
import numpy as np
import matplotlib.pyplot as plt
import scipy.io as sio
dataset = sio.loadmat("water.mat")
x_train = dataset["X"]
x_val = dataset["Xval"]
x_test = dataset["Xtest"]
(12, 1) (12,)
(21, 1) (21,)
(21, 1) (21,)
[[-15.93675813]
[-29.15297922]
[ 36.18954863]
[ 37.49218733]
[-48.05882945]]
[ 2.13431051 1.17325668 34.35910918 36.83795516 2.80896507]
The sample data is divided into training set, validation set and test set, x_train and y_train are the
data features and target values of the training set, x_val and y_val are the data features and target
values of the validation set, x_test and y_test are the data features and targets of the training set
value. The sample points of the training set and the validation set can be visualized on a two-
dimensional plane. The red ones are the training set samples.
plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.scatter(x_val, y_val, marker="o", s=40, c='blue')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Training sample", fontsize=16)
plt.show()
Figure 3-24 Changes in reservoir water level and corresponding dam discharge
Observing that the data feature values are not small values, it is best to normalize the data
features before performing the gradient descent method. The following code normalizes it to
x_train_1 by calculating the mean and mean square error of the training sample x_train.
train_means = x_train.mean(axis=0)
train_stdevs = np.std(x_train, axis=0, ddof=1)
x_train_1 = (x_train - train_means) / train_stdevs
print(x_train_1[:3])
[[-0.36214078]
[-0.80320484]
[ 1.377467 ]]
Perform gradient descent linear regression on the normalized training sample x_train_1 and its
target value y_train, obtain the final model parameter w and the historical loss (cost)
corresponding to each w in the iterative process, and output some training errors in the iterative
process and the final model parameters and training error.
alpha = 0.3
iterations = 100000
history = gradient_descent_vec(x_train_1,y_train,alpha,iterations)
w = history[-1]
print("w",history[-1])
loss_history = compute_loss_history(x_train_1,y_train,history)
print(loss_history[:-1:len(loss_history)//10])
print(loss_history[-1])
The gradient descent has been iterated 186 times, and the loss function curve, the predicted value
of the training set sample, and the hypothesis function line corresponding to w after convergence
can be drawn to visually observe the effect of the training. The training error of the model is
about 22.37. A large training error indicates that the model fits the training data poorly. This is
called "underfitting", that is, the model is not enough to describe the relationship between the
sample features and the target value. This is mainly due to the oversimplification of the model.
plot_history_predict(x_train_1,y_train,history[-1],loss_history)
Write a function loss() that calculates the model error, and normalize the validation set and test
set with the mean and mean square error of the training set, and then calculate and output the loss
(error) of the validation set and test set.
def loss(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of feature 1
predictions = X @ w
errors = predictions - y
return (errors**2).mean()/2
print(x_val_1.shape,y_val.shape,w.shape)
loss_val = loss(w,x_val_1,y_val)
loss_test = loss(w,x_test_1,y_val)
print(loss_val,loss_test)
The loss (error) of the model to the validation set is 29.43, which is more than 30% higher than
the training loss (error), indicating that the generalization ability of the model is not very good.
The test set loss (error) is much higher, further indicating that the model generalization ability is
poor.
Start training with 2 samples, and increase 1 sample each time for training. Calculate the training
error and validation error using the model obtained from each training.
plt.title("Learning Curves for Linear Regression", fontsize=16)
learning_curves_trainSize(x_train_1, y_train, x_val_1, y_val,0.3,1000)
plt.show()
It can be seen that as the number of training samples increases, the loss value of the obtained
model gradually increases, because the number of samples is large, and it is more difficult for the
model to fit them all. When the number of samples increases to a certain level, the model loss
increases very slowly, indicating that increasing the sample data is of little significance for
improving the model. Looking at the model loss of the verification set, when the number of
training samples is small, the verification error is large, indicating that the fitted model is not
accurate enough, and the generalization ability of the model is very weak, resulting in a large loss
of the verification set. As the number of training set samples increases, The verification error
gradually decreases, indicating that the generalization ability of the model is getting better and
better. When the number of models reaches 10, continuing to increase the number of training
samples will not continue to improve the verification error. From this learning curve, it can be
seen that the number of training samples is about 10, and continuing to increase the number of
samples will not further improve the quality of the model. Therefore, you can stop increasing the
number of samples, that is, early stopping (early stopping).
For this simple model (polynomial hypothesis function of degree 1), the learning curve helped
determine the size of the training set and the final model, but the training error and validation
error were still large due to the simplicity of the model. The way to improve is to use a more
complex and expressive function such as a third-degree polynomial as a hypothetical function,
which requires artificially adding two features x , x . The following code obtains a training set
2 3
x_train_2 with 3 features by adding these 2 features to x_train with only one data feature.
x_train_2 =np.hstack((x_train,x_train**2,x_train**3))
train_means = x_train_2.mean(axis=0)
train_stdevs = np.std(x_train_2, axis=0, ddof=1)
x_train_2 = (x_train_2 - train_means) / train_stdevs
print(x_train_2[:3])
output:
[[-3.62140776e-01 -7.55086688e-01 1.82225876e-01]
[-8.03204845e-01 1.25825266e-03 -2.47936991e-01]
[ 1.37746700e+00 5.84826715e-01 1.24976856e+00]]
Use this new feature data to train a new model (polynomial hypothesis function of degree 3) and
plot the loss curve, model curve, and model's training set predictions.
history = gradient_descent_vec(x_train_2,y_train,alpha,iterations)
w = history[-1]
print("w:",w)
loss_history = compute_loss_history(x_train_2,y_train,history)
print(loss_history[:-1:len(loss_history)//10])
plot_history_predict(x_train_2,y_train,history[-1],loss_history)
loss_val = loss(w,x_val_2,y_val)
loss_test = loss(w,x_test_2,y_val)
print(loss_val,loss_test)
5.768794748026971 170.64341351247012
Compared with the polynomial hypothesis function of degree one, the verification error of this
model is reduced to about 5.7, but the test error is still relatively large.
It can be seen from the figure that the model can better fit the training data. So can the training
model be further fitted to the training data by increasing the polynomial degree? For example, an
8-degree polynomial is used to represent the hypothesis function.
n = 8
x_train_n =np.hstack(tuple(x_train**(i+1) for i in range(n) ) ) #
(x_train_1,x_train**2,x_train**3,x_train**4))
train_means = x_train_n.mean(axis=0)
train_stdevs = np.std(x_train_n, axis=0, ddof=1)
x_train_n = (x_train_n - train_means) / train_stdevs
print(x_train_n[:3])
history = gradient_descent_vec(x_train_n,y_train,alpha,iterations)
w = history[-1]
print("w:",history[-1])
loss_history = compute_loss_history(x_train_n,y_train,history)
plot_history_predict(x_train_n,y_train,history[-1],loss_history)
Figure 3-28 Loss curve and fitted 8th degree polynomial curve
It can be seen that the loss of the training set is further reduced to about 0.03, that is, the trained
model has fit the training set data very well. So is this model better than the model just now?
Does it have better generalization ability?
You can first look at the learning curve of this 8th power polynomial hypothetical function.
(12, 8)
(9,)
(21, 8)
gradient is small enough!
iterated num is : 148
...
Figure 3-29 Training and Validation Loss Curves
For different sample numbers, the training error is almost 0, indicating that the model fits the
training set very well, while the verification set error begins to decrease as the training sample
data increases, and then starts from the sample data equal to 5, as the training set sample As the
number increases, the verification error continues to increase. It shows that the generalization
ability becomes worse.
The following code outputs the error of the model trained on all training sets on the validation
and test sets:
x_val_n = np.hstack(tuple(x_val**(i+1) for i in range(n) ) )
x_test_n = np.hstack(tuple(x_test**(i+1) for i in range(n) ) )
loss_val = loss(w,x_val_n,y_val)
loss_test = loss(w,x_test_n,y_val)
print(loss_val,loss_test)
37.35209369572414 226.4483749816359
It can be seen that no matter the verification set or the test set, the error exceeds the verification
error and test error corresponding to the training model of the simplest polynomial hypothesis
function. It means that the model has been fitted and the generalization ability is worse.
sample does not strictly satisfy this functional relationship f (x), that is, the actual target value
T hereisarandomerror \epsilonbetweeny and f (x), it is generally considered that this random
error ϵ obeys the Gaussian distribution N (μ, σ ), namely epsilon = y − f (x) ∼ N (0, σ ).
2 2
y = f (x) + ϵ
That is, there is an error ϵ between the sampled target value y and the true target value f (x).
Machine learning hopes to use a hypothetical function such as f^(x) to approximate the real f (x)
2
. This assumption is usually solved by minimizing the error (y − f^(x )) between the actual
i i
target value y and the predicted target value f^(x ) of the hypothesis function function f^(x).
i i
For a hypothetical function model f^(x ), the specific functions f^(x ) obtained by different
i i
training sets and different machine learning algorithms are all different,
2
For a certain x, the error (y − f^(x )) produced by different f^(x ) is also different, all possible
i i i
^(x ) The average (expectation) of this error is called Expected Error or Error Expectation,
f i
2
that is, E[(y − f^(x)) ]. This expected error can be decomposed into three items as follows:
2 2
^(x)) ] = (Bias[f
^(x)]) ^(x)] + σ 2
E[(y − f + V ar[f
Among them, Bias[f^(x)] = E[f^(x)] − E[f (x)] is called the deviation, which represents the
expectation of the hypothetical function model f^(x) The deviation between the predicted value
and the true value. V ar[f^(x)] = E[f^(x) ] − E[f^(x)] = E(f^(x) − E[f^(x)]) is called the
2 2 2
variance, which means the mean square error of the different predicted values of f^(x) obtained
by the hypothetical function model f^(x) and its expected predicted value. Please refer to
Wikipedia for the derivation of the formula:
https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
If the function f (x) = x + 2 ∗ np. sin(1.5 ∗ x) to be learned, the following code draws the
function curve and a set of {x , y } sampled from the function:
i i
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(0)
f = lambda x: 𝑥+2*np.sin(1.5*x)
def plot_f(pts=50):
x = np.linspace(0, 10, pts)
f_ = f(x)
plt.plot(x,f_)
plot_f()
x,y = sample_f()
plt.scatter(x,y,s=30)#, facecolors='none', edgecolors='r')
Figure 3-30 Function f (x) = x + 2 ∗ np. sin(1.5 ∗ x) and random noise sampling coordinate
points on it
i=1 i
m
∑ yi
i=1
b = = np. mean(yi )
m
The samples in different training sets are different, so the b obtained with different training sets,
that is, the hypothesis function f^(x) = b is different. The difference between the predicted
expected value and the true value of the hypothetical function at a certain point x obtained from
all different training sets is the deviation, and the mean square error of the predicted value and
the predicted expected value of all different hypothesized functions at this point is the variance.
The following code trains with 50 training sets to obtain 50 hypothetical functions, and then
calculates the prediction deviation and prediction variance of these hypothetical functions at x=5.
train_set_num = 100
def plot_b(b):
x = np.linspace(0, 10, pts)
hat_f = [b for i in range(pts)]
plt.plot(x,hat_f)
bs=[]
for i in range(train_set_num):
x,y = sample_f(20)
plt.scatter(x,y)
b = np.mean(y)
bs.append(b)
plot_b(b)
plot_f()
plt.show()
x = 18
f_true = f(x)
f_predict_mean = np.mean(bs)
print("real function value:",f_true)
print("Predicted expected value:",f_predict_mean)
print("Predicted deviation:",f_predict_mean - f_true)
print("Predicted variance:",np.std(bs))
Figure 3-31 Different hypothetical functions f^(x) = b obtained from different training sets
plot_f()
plt.show()
x = 18
f_true = f(x)
f_predict_mean = np.mean(f_predict)
print("real function value:",f_true)
print("Predicted expected value:",f_predict_mean)
print("Predicted deviation:",f_predict_mean - f_true)
print("Predicted variance:",np.std(f_predict))
Figure 3-32 Different hypothetical functions f^(x) = wx + b obtained from different training
sets
The bias and variance of the hypothetical function model f^(x) = b are: -14.564125266958722
and 0.7240080347500965 respectively, while the bias and variance of the hypothetical function
model f^(x) = wx + b are: -11.944324952176219 and 10.868072850656494. Simple models
are often not complex enough to fully approximate the real function f (x), so they are prone to
underfitting, and their deviations are larger than those of complex models. Although the
deviations of complex models are small, but because the function changes complicatedly, its
predicted value The changes in χ also tend to vary greatly or diverge, that is, the variance is
relatively large.
The bias and variance of model predictions can be illustrated by the "bull's eye" as shown in
Figure 3-33. The bullseye is the real value of the target, and a certain shooter is a hypothetical
function. After training, he means that he has trained his own model. Shooting the bullseye is
equivalent to making a prediction on this sample. Every time you train your own model, you
shoot (predict) once, and finally the training model of this hypothesis function (this person) at
different times produces corresponding prediction values. The degree to which these predicted
values deviate from the true value is his bias. The picture in the upper left corner shows that his
deviation is very large, the design skills are still lacking, and the model is underfitting, that is, the
model cannot express the relationship between the independent variable and the target value
well. The two graphs in the right column indicate that the predicted values of the same
independent variable are very divergent, that is, the predicted variance is large, indicating that the
values predicted by these different models deviate greatly, as if the shooting level of this person
is very unstable. For the two images in the left column, the predicted values are relatively
concentrated, that is, the predicted variance is small, indicating that the predicted results of
different models are almost the same. That is, the shooting level of this person is relatively
stable. The deviation in the lower left corner is small, and the variance is also small, indicating
that the model fitting effect is good (high shooting accuracy) and very stable, while the deviation
in the lower right corner is relatively symmetrical, so that the expectation of the deviation is
relatively small, and the predicted value of the response is always around the real value , it seems
that the fitting is relatively accurate, but because of the large variance, it suggests that there may
be an overfitting problem (unstable shooting ability).
From the comparison of the learning curves of the training set and the verification set, the
deviation and variance can also be observed intuitively. If the errors of the training set and the
verification set are relatively close, it means that the prediction results for the two different data
sets are close, which means that the variance is relatively small ( low degree of divergence), on
the contrary, it shows that the variance is relatively large. The numerical value of the error
represents the size of their deviation.
As shown in Figure 3-34, when the verification error is much larger than the training error (Jcv
>> Jtrain) and the training error is small, it means that the model fits the training set well, but the
verification set has a poor fitting effect, indicating that there is overfitting Phenomenon. On the
left, when the verification error is similar to the training error and both are large, it indicates that
there may be underfitting.
Figure 3-34 Judging underfitting and overfitting as well as bias and variance based on training
and validation loss curves
3.3 Regularization
Overfitting (high variance) is caused by the model being too complex and too high in freedom.
One way to solve overfitting is to increase the number of training samples, but sometimes it may
be difficult to obtain enough training samples or it takes time to obtain samples strenuous.
Another common method to solve overfitting is to reduce the complexity of the model. One way
to reduce the complexity of the model is to choose a simpler hypothesis function instead of a
complex hypothesis function, such as using a 3-degree polynomial instead of a 9-degree
polynomial in the previous example. Polynomials as hypothetical functions. If you don't want to
replace the hypothetical function, you can limit the complexity of the hypothetical function
through some techniques. This method of reducing the complexity of the function is called
regularization (Regularization).
The previous early stopping method is a regularization method. Observe the changes in the
training loss and verification loss during the iteration process of the gradient descent method of
the training model through the learning curve, that is, select an appropriate number of iterations
according to the training and verification loss curves. So that the model function will not be too
complicated. For complex functions, the set of hypothetical functions represented by all possible
values of the model parameters may be very large, but when the model training starts, the
parameters will be initialized to small values (such as 0), and the set of functions corresponding
to these small model parameters It is only a small part of all possible functions, that is, a model
with a small value range limits the expressive ability of the model function and reduces the
complexity of the model.
), the mean square error loss function describing the fitting error is:
(i) (i)
(x ,y
m 2
(i) (i)
L(x; w ) = ∑ x w − y
i=1
in:
2 2 2 2
w ∥= w0 + w1 + ⋯ + wn
This penalty item (the square of the norm of the model parameter) can prevent the model
parameter from taking a large value, because a large value of w will cause the value of the loss
i
function to become larger, and the goal of optimization is to make the value of the loss function
as small as possible . The λ of the new loss function is a hyperparameter that needs to be
adjusted according to the actual problem, which controls the importance between the fitting error
term and the penalty term. The larger λ means that the penalty item accounts for a large
proportion, the greater the effect of the penalty item, and the smaller λ means that the penalty
item accounts for a smaller proportion, the smaller the effect of the penalty item.
The gradient of the new loss function becomes:
1 m (i) (i) (i)
∇L(w
w) = ∑i=1 (x
x w − y )x + 2λw
w
m
Therefore, when calculating the partial derivative by the gradient descent method, you only need
to add the gradient of the latter term when calculating the partial derivative. Here is the gradient
descent method for the penalized version:
def gradient_descent_reg(X, y, reg, alpha, num_iters, gamma = 0.8,
epsilon=1e-8):
w_history = [] # Record the parameters in the iteration
process
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of features 1
num_features = X.shape[1]
v = np.zeros_like(num_features)
w = np.zeros(num_features)
for n in range(num_iters):
predictions = X @ w # Find the predicted value of the
hypothesis function, ie f(x)
errors = predictions - y # The error between the predicted
value and the real value
gradient = X.transpose() @ errors /len(y) # calculate the
gradient
gradient += 2*reg*w
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",n)
break
#w -= alpha * gradient #Update the parameters of the model
v = gamma*v+alpha* gradient
w=w-v
w_history.append(w)
return w_history # return optimized parameters and cost history
reg = 0.2
history = gradient_descent_reg(x_train_n,y_train,reg,alpha,iterations)
print("w:",history[-1])
loss_history = compute_loss_history_reg(x_train_n,y_train,history,reg)
plot_history_predict(x_train_n,y_train,history[-1],loss_history)
Figure 3-35 Training loss curve and fitted curve with regularization
(12, 8)
(9,)
(21, 8)
gradient is small enough!
iterated num is : 158
...
Yes, the same 8-degree polynomial assumption function solves the over-fitting problem of the
model after punishing the model parameters through regularization techniques.
Using the sigmoid function σ(x) to transform the predicted value of linear regression constitutes the hypothesis
function of logistic regression:
1
fw
w (x
x) = −x
xww
= σ(x
xw ) tag3 − 30
1+e
Using this hypothesis function to model the characteristics of the sample and its target value relationship is the so-
called logistic regression. The hypothetical function f (xx) of logistic regression has a value between 0 and 1,
w
w
which can be used to represent the probability that x belongs to a certain class, if x is the tumor size of a medical
image, f (x
w
w x) can represent the probability of whether the tumor corresponding to x is malignant or not.
If 0 and 1 are used to represent two classes respectively, and f (x x) is used to represent the probability that x
w
w
Right now:
1
P (y = 1|x
x) = fw
w (x
x) = −x
xww
= σ(x
xw )
1+e
1
P (y = 0|x
x) = 1 − fw
w (x
x) = 1 − −x
xww
= 1 − σ(x
xw )
1+e
For a sample (x
x, y), if y=1, the probability of its occurrence is P (y = 1|x x), if y=0, the probability of its
P (y = 1|x
y
x) P (y = 0|x x) , or f (x
x) (1 − f (x
1−y
x)) , then the probability that m samples appear at the same
w
w
y
w
w
1−y
time is:
i i
m i y i 1−y
∏i=1 (fw
w (x
x ) (1 − fw
w (x
x )) )
The w that makes this probability value the largest can make these m samples appear with the greatest probability.
Therefore, logistic regression requires w that maximizes this probability. Because the multiplication will make the
value sharply become infinite or close to 0, in order to make the solution algorithm numerically stable and
convenient to calculate the derivative, the average value of the negative logarithm of this probability value is
usually used as the cost function, namely:
1 m i i i i
L (w
w) = − ∑ (y log(fw
w (x
x )) + (1 − y )log(1 − fw
w (x
x )))
m i=1
i
−(y log(fw
w (x
i
x )) + (1 − y )log(1 − fw
w (x
x )))
i
is called the cross entropy (entropy cross) loss of the sample. It
i
For a sample (xx )), y , if its true target value y is 1, and the predicted value of logistic regression f (x
i i i
x )) is also w
w
i
1, then L = −(1 ∗ 0 + 0 ∗ log0) = 0, if the real target value y is 0, and the logic The predicted value of
(i) i
consistent with the target value, this value is 0. If not, because y , 1 − y , f (xx ), 1 − f (x x ) They are all real
i i
w
w
i
w
w
i
numbers in the (0,1) interval, so L is a positive number greater than 0. Therefore, only when the predicted
(i)
value is completely consistent with the target value, L will achieve the minimum value of 0. (i)
The goal of logistic regression is to seek the smallest w of this loss (also called cost) L (w w). Its solution algorithm
is still the gradient descent method. To do this, it is necessary to calculate the gradient of L (ww) with respect to w ,
(i) (i)
f = σ(z )
1 m (i)
L (w
w) = ∑ L
m i=1
L (w
w) can be regarded as the sum of m L , and L is f , f is the function of z , z is (i) (i) (T hef unctionof i) (i) (i) (i)
w = (w , w , ⋯ , w )
1 function, according to the four rules of derivation (such as the sum of the derivative of
2 k
T
each function and the derivative of the function) and the chain derivation rule of the compound function:
∂L (w
w) 1
=
∂L
(i) m
i i (i) i
∂L
(i)
y (1−y ) f −y
(i)
= −( (i)
− (i)
) = (i) (i)
∂f f (1−f ) f (1−f )
(i)
∂f (i) (i) (i) (i)
(i)
= σ(z )(1 − σ(z )) = f (1 − f )
∂z
(i)
∂z (i)
= x
∂wj j
So there are:
m (i) (i) (i)
∂L(w
w) ∂L(w
w) ∂L ∂f ∂z
= ∑ × × ×
(i) (i) (i)
∂wj ∂L ∂f ∂z ∂wj
i=1
m (i) i
1 f − y (i)
(i) (i)
= ∑ × f (1 − f ) × x
j
(i) (i)
m f (1 − f )
i=1
m m
1 (i) 1 (i)
(i) (i) (i) (i)
= ∑(f − y )x = ∑(fw
w (x
x ) − y )x
j j
m m
i=1 1=1
m
1 (i) (i) (i)
= ∑x (fw
w (x
x ) − y )
j
m
1=1
Because f w (x
w x
(i)
) − y is a value, so it can be exchanged with the multiplication of the vector, that is,
(i)
).
(i) (i) (i) (i) (i) (i)
(fw
w
(x
x ) − y )x = x (f (xx ) − y w
w
j j
It can be observed that for a sample (x , y), the gradient (derivative) of L(w
w) on the cumulative sum z = x w
∂L
∂z
is: f − y. This is the same as the gradient (derivative) form of the variance (f − y) with respect to f for linear 1 2
regression. The formula for calculating the gradient of logistic regression and linear regression is the same.
If x is written as a row vector, all x can form a matrix X by row, and the target value and predicted value
i i
vector:
∇w
w L(w
w)
⎢⎥
⎡
⎣x ⎦
x
⋮
1
∇w
w L(w
w) =
⎤
1Xn
f(x)
errors = f - y
X = [x
y = [y
1
=
y
x
2
= [
1Xm
2
1
⋯
y =
(f
⋯
⎡
⎣y ⎦
f − y)
i
y
y
1=1
1
w) with respect to w
w L(w
w
i
x
w)
⋯
⎤
T
f =
x
(i)
X =
⋯
mXn
=
y
m
⎡f
⎣f
(fw
w (x
m
x
1
w (x
w
fw
fw
x
w (x
x )
w (x
w (x
w x
∑[ x
⋮
x )
(i)
1=1
1
x
1
m
(σ(X
Xw ) − y )
)
) − y
⎦
⎤
(i)
0
=
(i)
(fw
w (x
x
⎡
)
σ(x
T
x w)
σ(x
x w)
σ(x
⎣ σ(x
x
x w)
Assuming that the number of sample data features is n, it can be verified that the dimensions of the above matrix
Therefore, once the output f of logistic regression is known, the following python code can be used to calculate
the gradient of the cross-entropy loss L with respect to the model parameters w
# The error between the predicted value and the real value
gradient = errors.transpose() @ X /len(y) # calculate the gradient
If x is written as a column vector, all x can form a matrix X by column,the target value and predicted value
i
]
]
i
1
(i)
X
1
∑
w)
) − y
=
∂L(w
=
∂wj
1=1
f rac1m(fw
⎤
w)
(i)
w (x
= σ(X
m
)
Xw )
(i)
1
=
= [
=
(fw
∑[ x
1=1
x ) − y , fw
m
w (x
(i)
1
x
∑ x (fw
i=1
w (x
1
∂L(w
∑
w)
∂w0
(fw
w (x
w (x
(i)
1
i
x
m
i=1
(i)
x ) − y ) =
x ) − y , ⋯ , fw
m
(f
fw
w (x
w
x
(x
) − y
x
(i)
(i)
x) − y )
i
x (fw
j
∂L(w
w (x
w)
∂w1
x ) − y )
(i)
) − y
i
x
)
(i)
2
(i)
T
i
∂L(w
...
X =
w)
∂w2
x
m
1=1
(i)
2
xn
i
...
(fw
m
w (x
(i)
1
x
∑(fw
i=1
is written as a row
w (x
(f
(i)
](fw
f − y)
∂L(w
(fw
w (x
x
x ) − y )x
x
w)
∂wn
w (x
(i)
x
) − y
m
i
T
(i)
(i)
(i)
) − y
X
]
) − y
) − y
m
(i)
...
(i)
)
⎡
i
⎣ ⋮x
x
)
)
xn
1
x
2
x
.
m
(
f
∇w
∇w
= [ fw
w L(w
w)
w L(w
L(w
w)
w) = −
w (x
= [ σ(x
1
m
x )
x w)
= [
1
∑
1
∂L(w
i=1
fw
∑
w (x
σ(x
x w)
1=1
w)
=
∂w0
⎢⎥
x )
⎡x
⎣x
i
2
(y log(fw
∇w
w L(w
∇w
w) =
w L(w
w) =
1
m
∑
(f
m
i=1
f − y)
w (x
⎣
(i)
(i)
(i)
(i)
∂L(w
∂w1
m
w)
⋯
w) with respect to w
⎡
∂L(w
w)
∂w0
∂L(w
w)
∂w1
∂L(w
w)
∂w2
...
∂L(w
w)
∂wK
(fw
w (x
(fw
(fw
∑ x (fw
i=1
x
w (x
x
w (x
(fw
x
w (x
w (x
x
f rac1mX
i
(fw
w (x
T
i
x ) − y )x
x
X + 2λw
1. Generate data
w =
fw
⎦
w (x
σ(x
...
i
x )
x w)
(i)
(i)
(i)
(i)
x ) − y ) =
X (f
fww
(x
∂L(w
∂w2
i
=
w)
i
) − y
) − y
) − y
) − y
x) − y )
i
i
i
m
+ 2λw
w
1
⋯
⋯
(i)
(i)
(i)
(i)
If the gradient is written in the form of a row vector according to the usual practice, then:
⎤
)
)
⎦
m
1=1
1=1
1=1
1=1
Similarly, regularization can also be added to the loss function of logistic regression, that is, L(w
x )) + (1 − y )log(1 − fw
w (x
x ))) + λ ∥ w
i
fw
=
w (x
σ(x
x
∂L(w
∂wK
x
w)
w) is:
(i)
(i)
(i)
(i)
m
m
If each sample x is in the form of a row vector, f , y and model parameters w of multiple samples are in the form
of a column vector, it can be written as the following vector form:
m
(σ(X
Xw ) − y )
The following code uses np.random.normal() to generate two sets of two-dimensional coordinate point data sets
Xa and Xb that obey different normal distributions, and each sample represents a coordinate point on a two-
dimensional plane. The samples in Xa are normally distributed sampling points around the center point (10,12),
m
(fw
(fw
=
)]
w )]
∑
∂L(w
w (x
x
w (x
(fw
(fw
x
w (x
...
x
w (x
x
1=1
∑(fw
i=1
] =
T
w)
∂wj
w (x
1
(i)
(i)
(i)
(i)
⎡x ⎤
x
m
1
1
=
(i)
(i)
(i)
...
⎣x ⎦(i)
x ) − y )x
X (f
X + 2λw
w
x
(f
1
) − y
) − y
) − y
) − y
f − y)
(fw
w
∑
(i)
(i)
(i)
(i)
(x
x
f − y )X
X
m
i=1
⎤
)
⎦
)
T
i
x (fw
(i)
T
w (x
i
x ) − y )
j
) − y
2
(i)
)
i i
is written as a column
and the samples in Xb are normally distributed sampling points around the center point (5,6). The codes color
these samples to distinguish which category they belong to.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
n_pts = 100
D = 2
#x0 = np.ones(n_pts)
Xa = np.array([#x0,
np.random.normal(10, 2, n_pts),
np.random.normal(12, 2, n_pts)])
Xb = np.array([#x0,
np.random.normal(5, 2, n_pts),
np.random.normal(6, 2, n_pts)])
[[13.52810469 15.76630139]
[8.20906688 11.86351679]
[ 4.26163632 3.3869463 ]
[6.04212975 4.47171215]]
[0. 0. 1. 1.]
fig, ax = plt.subplots(figsize=(4,4))
ax.scatter(X[:n_pts,0], X[:n_pts,1], color='lightcoral',
label='$Y = 0$')
ax.scatter(X[n_pts:,0], X[n_pts:,1], color='blue',
label='$Y = 1$')
ax.set_title('Sample Dataset')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.legend(loc=4);
gradient += 2*lambda_*w
if np.max(np.abs(gradient))<epsilon:
print("gradient is small enough!")
print("iterated num is :",n)
break
#w -= alpha * gradient #Update the parameters of the model
v = gamma*v+alpha* gradient
w= w-v
def loss_logistic(w,X,y,reg=0.):
f = sigmoid(X @ w[1:]+w[0])
loss = -np.mean((np.log(f).T * y+np.log(1-f).T *(1-y) ))
loss += reg*( np.sum(np.square(w)))
return loss
def loss_history_logistic(w_history,X,y,reg=0.):
#X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
loss_history = []
for w in w_history:
loss_history.append(loss_logistic(w,X,y,reg))
return loss_history
reg = 0.0
alpha=0.01
iterations=10000
w_history = gradient_descent_logistic_reg(X,y,reg,alpha,iterations)
w = w_history[-1]
print("w:",w)
loss_history = loss_history_logistic(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
4. Decision curve
Distinguish 2 classes with probability value f (x
w
w x) = 0.5, because f (x
w
w xw ), So for a sample x ,
x) == σ(x
f (x
w
w x) == 0.5 is equivalent to x w = 0, namely w and T hedotproductof pmbx is 0, that is,
w + w ∗ x + w ∗ x = 0.
0 1 1 2 2
w , and then in the (x , x ) coordinate plane Drawing the decision line corresponding to these points above, it can
1 2
be seen that the learned model can distinguish the samples of these two categories very well. The drawn cost curve
also reflects the gradual convergence of the algorithm.
x1 = np.array([X[:,0].min()-1, X[:,0].max()+1])
x2 = - w.item(0) / w.item(2) + x1 * (- w.item(1) / w.item(2))
ax[1].plot(loss_history, color='r')
ax[1].set_ylim(0,ax[1].get_ylim()[1])
ax[1].set_title(r'$J(w)$ vs. Iteration')
ax[1].set_xlabel('Iteration')
ax[1].set_ylabel(r'$J(w)$')
fig.tight_layout()
Figure 3-36 Classification and loss curves of a flat point set for binary classification
5. Prediction accuracy
# Print accuracy
X_1 = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) # Add a column of
features 1
y_predictions = sigmoid(X_1 @ w)>=0.5
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
scikit_log_reg = sklearn.linear_model.LogisticRegression();
scikit_log_reg.fit(X,y)
# Print accuracy
y_predictions = scikit_log_reg. predict(X)
print ('The accuracy of the prediction is: %d ' % float((np.dot(y,
y_predictions)
+ np.dot(1 - y,1 - y_predictions)) / float(y.size) * 100) +
'%' )
#plot_decision_boundary(lambda x: clf.predict(x), X, Y)
# Plot decision boundary
x1 = np.array([X[:,0].min()-1, X[:,0].max()+1])
x2 = - w.item(0) / w.item(2) + x1 * (- w.item(1) / w.item(2))
fig, ax = plt.subplots(figsize=(4,4))
ax.scatter(X[:n_pts,0], X[:n_pts,1], color='lightcoral',
label='$Y = 0$')
ax.scatter(X[n_pts:,0], X[n_pts:,1], color='blue',
label='$Y = 1$')
ax.set_title('Sample Dataset')
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
import pandas
import matplotlib.pyplot as plt
import numpy as np
iris = pandas.read_csv("iris.csv")
# shuffle rows
shuffled_rows = np.random.permutation(iris.index)
iris = iris.loc[shuffled_rows,:]
print(iris.head())
print(iris.species.unique())
iris.hist()
plt.show()
reg = 0.0
alpha=0.0001
iterations=10000
w_history = gradient_descent_logistic_reg(X,y,reg,alpha,iterations)
w = w_history[-1]
print("w:",w)
loss_history = loss_history_logistic(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
# Print accuracy
X_1 = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X)) #Add a
column of features 1
y_predictions = sigmoid(X_1 @ w)>=0.5
Of course, no verification set and test set are used to evaluate the trained model. Readers
can try to divide the original data set into training set, verification set and test set, and then
calculate the error of the verification set and test set and draw the corresponding learning
curve. Observe the fitting effect of the trained model.
Unlike the hypothesis function of logistic regression, which only outputs a value indicating
that the data belongs to a certain classification of the two classifications, softmax regression
is an extension of logistic regression. Its hypothesis function can output as many values as
the number of multi-category categories, indicating that the data belongs to each
Classification probability.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(100)
def gen_spiral_dataset(N=100,D=2,K=3):
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 #
theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
return X,y
X_spiral,y_spiral = gen_spiral_dataset()
# lets visualize the data:
plt.scatter(X_spiral[:, 0], X_spiral[:, 1], c=y_spiral, s=20,
cmap=plt.cm.spring) #s=40, cmap=plt.cm.Spectral)
plt.show()
of 3 independent variables (z , z , z ), the function value is a vector with the same number
1 2 3
of independent variables:
z1 z2 z2
e e e
sof tmax(z1 , z2 , z3 ) = ( , , )
z1 z2 z3 z1 z2 z3 z1 z2 z3
(e + e + e ) (e + e + e ) (e + e + e )
z1 z2 z2
e e e
= ( , , )
3 3 3
zi zi zi
∑ e ∑ e ∑ e
1 1 1
Obviously, the value of each component of the output vector of the softmax() function is
located in [0, 1], and the sum of all component values is 1, so each component can be
regarded as a probability value.
The function softmax() below is the code to calculate the softmax function:
import numpy as np
def softmax(x):
e_x = np.exp(x)
return e_x / e_x.sum()
Input a 3-dimensional vector z, the softmax function outputs a 3-dimensional vector, each
component of which represents a probability, that is, the values of these components are
between [0, 1], and their sum is equal to 1.
Note: softmax() acts on z = [3.0, 1.0, 0.2], and the resulting value
3.0 1.0 0.2
relationship.
For x with a large value, e will exceed the range of values that can be represented by the
x
z = [100,1000]
softmax(z)
array([0., nan])
Since the numerator and denominator of a fraction are both divided by a number, the
fraction value remains unchanged, namely:
z a
z j z −a
e
j e /e e
j
z
= z
= z −a
∑ e i ∑ e i /ea ∑ e i
i i i
Therefore, you can first find the maximum a of all z , and then use z i i − a to calculate the
value of the softmax() function.
def softmax(x):
e_x = np.exp(x - np.max(x))
return e_x / e_x.sum()
print(softmax(z))
z = [500,1000]
softmax(z)
[0.1.]
array([7.12457641e-218, 1.00000000e+000])
The above code is for an input vector such as z = (z , z , z ), can it be used for a matrix
1 2 3
For the input vector [1, 2, 3], the output vector value is
[0.00548473, 0.01490905, 0.04052699], which does not meet the normalization condition
In order to calculate their respective softmax() function values for multiple samples at the
same time, the softmax value vector should be calculated separately for each sample. To do
this, rewrite the above code as:
def softmax(x):
a= np.max(x,axis=-1,keepdims=True)
e_x = np.exp(x - a)
return e_x /np.sum(e_x,axis=-1,keepdims=True)
softmax(z)
The parameter axis=1 of the above numpy functions np.max() and np.sum() indicates that
the corresponding maximum value (max) and summation (sum) operations are performed
along the axis (column). keepdims=True means not to change the dimension of the result
array, that is, the result array has the same dimension as the original array.
The above code first finds the maximum value of each row vector, subtracts its maximum
value from each row vector and calculates its index value, and finally calculates the value
of the softmax function by row, and each row of input generates a corresponding output
vector representing the probability. The above code can be further simplified:
def softmax(x):
e_x=np.exp(x-np.max(x,axis=-1,keepdims=True))
return e_x /np.sum(e_x,axis=-1,keepdims=True)
softmax(z)
z
e i
fi = C z
∑ e k
k=1
Among them, ∑ C
i=1
fi = 1 .
To prevent calculation overflow, each component can be subtracted from their maximum
value, that is:
In order to find the gradient of f (zz) = sof tmax(zz) about z , the intermediate variable
a = e , b = ∑ e , then:
zi C zk
i k=1
ai
fi =
b
z
∂ai ∂e i
zi
= = e
∂zi ∂zi
z
∂ai ∂e i
= = 0
∂zj ∂zj
C z
∂(∑ e k)
∂b k=1 zi
= = e
∂zi ∂zi
∂a
i ∂b
⋅b−ai
∂fi ∂z
j
∂z
j 0−ai aj
= 2
= 2
= −fi fj
∂zj b b
∂f
f
∂z
z
∇z
else:
∇ L =
f
f
zL =
=
⎢⎥
⎡f
⎣
1 (1
f = softmax(z)
have a test:
= ( ,
∂f
f
∂L
∂z
z
x = np.array([[1, 2]])
− f1 )
−f2 f1
−fC f1
⎡f
⎣f
1 f1
f2 f1
C f1
f1 f2
f2 f2
fC f2
⋮
,⋯,
print(softmax_gradient(x))
df = np.array([1, 3])
=
print(softmax_backward(x,df))
[[ 0.19661193 -0.19661193]
[-0.19661193 0.19661193]]
⋯
∂L
∂f1
∂L
∂f
f
∂f
f
∂z
z
i
f1 fC
f2 fC
fC fC
⋮
−f1 f2
f2 (1 − f2 )
j
⋮
−fC f2
⎦
⋯
⋯
−f1 fC
−f2 fC
fC (1 − fC )
∂f2
∂L
∂fC
⎤
= (f1 , f2 , ⋯ , fk , ⋯ , fC )
Use df to represent the gradient of some other variable L with respect to f , then the python
calculation code for the gradient of L with respect to z is as follows:
For multiple samples, the following code can be used to calculate the gradient of softmax:
if len(df)==1:
return -np.outer(f, f) + np.diag(f.flatten())
else:
grads = []
for i in range(len(f)):
fi = f[i]
grad = -np.outer(fi, fi) + np.diag(fi.flatten())
grads.append(grad)
return np.array(grads)
[[ 0.19661193 -0.19661193]
[-0.19661193 0.19661193]]
[-0.39322387 0.39322387]
You can use np.einsum() to perform multi-sample outer product operation, that is, write the
following vectorized code:
def softmax_gradient(Z,isF = False):
if isF:
F = Z
else:
F = softmax(Z)
D = []
for i in range(F.shape[0]):
f = F[i]
D.append(np.diag(f.flatten()))
grads = D-np.einsum('ij,ik->ijk',F,F)
return grads
print(softmax_gradient(x))
If you know the gradient dF of a function (such as a loss function) with respect to the
softmax output value F, you can use the following function to find the gradient of the
function with respect to the softmax input Z:
[[-0.39322387 0.39322387]
[-0.09035332 0.09035332]]
As shown in the figure, for 3 classification problems, 3 linear regression functions can be
used to generate 3 outputs, and these 3 outputs can be passed through the softmax function
to generate 3 values f (i = 1, 2, 3), which respectively represent the probability that the
i
1 i
Figure 3-42 The hypothetical function of softmax regression consists of 3 weighted sums
and a softmax function
f (x
x) = (f1 , f2 , f3 )
For the 10-category problem of handwritten digit recognition, you can use 10 linear
regression functions x W to generate 10 output values z from the input features x , and
,i
,i i
then use The softmax function converts these 10 output values into 10 probability values f , i
1 i i
is also used as a feature of the input x , that is, x = (1, x , x , x ). Then the above formula 1 2 3
f (x
x) = sof tmax(x
xW ,1 , x W ,2 , x W ,3 ) = sof tmax(x
xW )
z z z z z z
e 1 e 2 e 2 e 1 e 2 e 2
sof tmax(z1 , z2 , z3 ) = ( z z z
, z z z
, z z z
) = ( 3
, 3
, 3
)
(e 1 +e 2 +e 3 ) (e 1 +e 2 +e 3 ) (e 1 +e 2 +e 3 )
∑ e
z
i ∑ e
z
i ∑ e
z
i
1 1 1
For a sample x , f (x
x) = sof tmax(x xW ) is a vector, each of which represents the
probability that x belongs to the corresponding category of this component. For example,
f respectively represent the probability that x belongs to the jth category. If the true value
j
y
(i)
of a sample (xx ,y ) is 2, the probability that the sample belongs to the second
(i) (i)
Multi-sample form
The weighted sum of the data features of m samples X W is a two-dimensional matrix, and
each row of the matrix represents the weighted sum of the data features of a sample. This
matrix is denoted by the letter Z :
(1) (1)
⎡ z ⎤ ⎡x W ⎤
(2) (2)
z x W
Z = =
⋮ ⋮
⎣z (m) ⎦ ⎣x(m)
W
⎦
The weighted sum z corresponding to each sample x is also a vector itself, and the
(i) (i)
softmax function acts on this vector to generate a vector sof tmax(zz ), which represents (i)
the probability that the sample belongs to different categories. Use f to represent (i)
y =
Fy =
⎢⎥
⎡
⎣f
⎣y
⎣
y
f
(2)
(m)
(m)
(2)
vdots
f
(1)
(2)
(m)
(m)
y
⎦
⎤
(1)
(2)
=
⎦
⎡
(i)
y
y
=
(i)
∑
⎣
e
z
e
1
z
(m)
∑
(1)
(2)
e
z
1
(1)
(2)
(m)
z
y
e
e
(2)
(2)
(m)
(m)
z
z
i
(2)
(m)
i
∑
⎦
e
z
e
The target value (label) of the sample can be represented by a one-dimensional vector y ,
z
(m)
where each element represents the integer corresponding to the true target category of the
sample. Right now:
⎡y
(1)
⎤
(1)
(2)
⎡ e
z
C
(1)
y
(1)
(1)
⎤
z
z
(1)
(2)
(m)
i
⋯
⋯
e
C
∑
1
z
z
(1)
(2)
(m)
e
z
z
(1)
(2)
(m)
i
⎤
(i)
corresponding to all
3.6.4 Multi-classification cross-entropy loss
For a sample (x
x ,y ), its data feature x
(i) (i) (i)
is output by the softmax regression model is that the sample
belongs to each The probability of a class,
(i) (i) (i)
f ,f ,⋯,f
1 2 C
(i) (i)
And the probability that the sample belongs to the target category y (i)
is fy
(i)
, this probability f
y
(i)
indicates the probability that the sample appears in the target category. Similarly, the probability that m
samples (xx ,y ) appear simultaneously with their corresponding target categories is:
(i) (i)
m (i)
∏ f
i=1 y
(i)
The W that makes this probability value the largest can make these m samples appear with the greatest
probability. Therefore, softmax regression requires the regression model parameter W that maximizes this
probability. Because multiplication will make the value sharply become infinite or close to 0, in order to
make the solution algorithm numerically stable, the average value of the negative logarithm of this
probability value is usually used as the cost function, namely:
1 m (i)
L(W
W) = − ∑i=1 log(f (i)
)
m y
(i) (i)
Where −log(f (i)
y
) is called the cross-entropy loss of sample i. The problem of maximizing ∏ m
i=1
f
y
(i)
then
becomes the problem of minimizing this cross-entropy loss.
For a 3-category problem, the values of y use 0, 1, and 2 to represent the probability that the sample
(i)
belongs to 3 different categories. If the true category of a sample is the third category, that is, y = 2, the (i)
predicted value f for the sample is a vector, indicating that the sample belongs to each category The
(i)
⎣ 0.2 ⎦
1
⎣f ⎦ (i)
(i)
Then the cross-entropy loss of this sample is −log(f 2
) = −log(0.2) .
For multiple samples, if there are 2 samples (m=2), the corresponding probability matrix F and target
value vector y are as follows:
2
y = [ ]
1
Then F is: y
y
0.3
Fy
y = [ ]
0.6
Indicates the probability that each sample belongs to its corresponding target class. The cross-entropy loss
of all samples can be vectorized and expressed as:
1 m (i) 1
L(W
W) = − ∑ log(f ) = − sum(logF
Fyy)
m i=1 y
(i)
m
For the above 2-sample example, this average cross-entropy loss is,
1
L(W
W) = − (log(0.3) + log(0.6))
2
print(-1/2*(np.log(0.3)+np.log(0.6)))
print(cross_entropy(F,Y))
0.8573992140459634
0.8573992140459634
Sometimes instead of using an integer value to represent the category of a certain sample, a so-called one-
(i) (i) (i)
hot vector y = (y , y isused, ⋯ , y ) indicates the category to which a sample belongs, where C
(i)
1 2 C
indicates the total number of categories. Only one component of this vector has a value of 1, and the other
components have a value of 0.
For example, for a 3-category problem, if the category of a certain sample is 3, its one-hot vector is
(0,0,1), that is, the third component value is 1, and the other component values are all 0.
(i)
For a sample, if the jth component y = 1 of the corresponding y , that is, the sample belongs to the jth
j
(i)
class, then the cross entropy loss of the sample can be written as:
(i) (i) (i) C (i) (i) (i) (i)
− log(f ) = −y log(f ) = −∑ y log(f ) = −y ⋅ log(f )
j j j j=1 j j
That is, the cross-entropy loss corresponding to this sample is the opposite of the dot product of vector y (i)
Therefore, for the target value expressed in one-hot form, the cross-entropy loss of all samples can be
written as follows:
m
1 (i) (i)
1
L(W
W) = − ∑y ⋅ log(f ) = − np. sum(Y ⊙ log(F ))
m m
i=1
As for the above f and the one-gotized vector y:
0 0 1
y = [ ]
0 1 0
1
L(W
W) = − (np. sum(y ⋅ log(f )))
2
It is also possible to multiply these two matrices of the same shape by using the Hadamard product, that
is, the element-wise product, to obtain:
Then add all the elements of the result matrix and divide by the number of samples to get the total cross
entropy:
1
(log(0.3) + log(0.6))
2
print(cross_entropy_one_hot(F,Y))
0.8573992140459634
def softmax_cross_entropy(Z,y):
m = len(Z)
F = softmax(Z)
log_Fy = -np.log(F[range(m),y])
return np.sum(log_Fy) / m
output:
31.500003072148047
If the target labels are in one-hot vector form, the following code computes the cross-entropy loss from
the weighted sum:
def softmax_cross_entropy_one_hot(Z, y):
F = softmax(Z)
loss = -np.sum(y*np.log(F),axis=1)
return np.mean(loss)
31.500003072148047
solution algorithm is still the gradient descent method. For this, it is necessary to calculate the gradient of
L (WW ) with respect to W , that is, the partial derivative with respect to each W . jk
1 2 3 xW , x W , x W ) and f (z
z = (z , z , z ) = (x ,1 ,2 ,3z) = sof tmax(z z) = sof tmax(z 1, z2 , z3 ) composite
function.
1 ∂ay 1
zi
= − − e
ay ∂zi a1 + a2 + a3
1 zy
1 zi
= − 1(y == i)e − e
ay a1 + a2 + a3
zi
e
= −1(y == i) + = fi − 1(y == i)
3
zi
∑ e
1
Among them, the symbol 1(y == i) indicates that the value is 1 when (y == i) is established, otherwise,
the value is 0.
If y=1, then:
∇z L = (f1 − 1, f2 , f3 )
That is to say, for any C classification problem, if the classification of a certain y is i, then:
∇z L = (f1 , f2 , ⋯ , fi − 1, ⋯ , fC ) = f − I i
The notation I represents a one-hot vector whose i-th component is 1 and the other components are 0. If
i
the target value of the sample y is represented by this one-hot vector, that is, y = I , then: i
∇z L = f − y
This and the linear regression loss ∥ f − y on the gradient of f , the logistic regression cross-entropy
1
2
2
loss −(T T hegradientf ormulasof ylog(f f ) + (1 − y )log(1 − f )) on weighting and z are surprisingly
consistent. But there is still a difference. Linear regression is the gradient of the loss function value about
f , while logistic regression and softmax regression are about the weighted sum of z instead of f . The
gradient is f − y , but the probability of logistic regression is calculated by f = σ(zz), and the probability
of softmax regression is calculated by f = sof tmax(zz).
Therefore, for the vector Z constructed by multiple sample features, use L to represent the total loss of
all samples, then the gradient of L on the weighted sum Z is:
∇Z L = F − I i
or
∇Z L = F − Y
This form is the same as the gradient of the weighted sum of the logistic regression loss function, and also
the loss function ∥ F − Y of the linear regression about the output T hegradientof pmbF is
1
2
2
consistent.
According to formula (12), the code for calculating the gradient of cross entropy with respect to Z is as
follows:
def grad_softmax_crossentropy(Z,y):
F = softmax(Z)
I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),y] = 1
return (F - I_i) / Z.shape[0]
def grad_softmax_cross_entropy(Z,y):
m = len(Z)
F = softmax(Z)
F[range(m),y] -= 1
return F/m
In order to ensure that there is no error in the calculation of the analytical gradient, the general numerical
gradient function in Section 1.4) can be used to calculate the numerical gradient of the cross-entropy with
respect to Z and compare it with the above-mentioned analytical gradient:
def loss_f():
return softmax_cross_entropy(Z,y)
import util
Z = Z.astype(float) #Note: The integer array must be converted to float
type
print("num_grad",util.numerical_gradient(loss_f,[Z]))
If the sample target is represented by a one-hot vector, the code for calculating the gradient of cross
entropy with respect to Z is as follows:
def grad_softmax_crossentropy_one_hot(Z, y): #y is represented by one-hot
vector
F = softmax(Z)
return (F - y)/Z.shape[0]
2. The gradient of the cross-entropy loss with respect to the weight parameter
The gradient of the loss function on the weighted sum z is obtained, and the loss function on the model
parameter W can be obtained further. Because the gradient of z = x W for W is x , and for other W ,
i ,i ,i ,j
∂L
= (f2 − 1(y == 2))x
x
∂W
W,2
∂L
= (f3 − 1(y == 3))x
x
∂W
W,3
∂W
W,j
is also a row vector, if you
want to write ∂L
∂W
W
in the form of a matrix with the same shape as W , then :
T T T
∂L ∂L ∂L ∂L
= ( , ,⋯, )
∂W
W ∂W
W,1 ∂W
W,j ∂W
W,C
T
= x (f1 − 1(y == 1), f2 − 1(y == 2), ⋯ , fC − 1(y == C))
If the ont-hot vector is used to represent the target value (label) y , it can be written as a more concise
formula:
∂L T
= x (f
f − y)
∂W
W
Among them, it is assumed that x , f , y are all in the form of row vectors. This is a matrix of the same
shape as W . If C is the number of categories and n is the number of data features, then x is a 1 × n
vector, x is a n × 1 vector, W is a n × C matrix, because z = x W , f = sof tmax(zz), so f , y are
T
⋮
X = ,
(i)
x
⎣x (m) ⎦
F,Y is a matrix of predicted values and target values corresponding to these samples:
F =
∂L
∂W
W
L(W
L(W
∂W
W
⎢⎥
⎡
= X
W) = −
W) = −
f
pmbf
T
(1)
(2)
(i)
(F
F − Y )
m
(m)
∑
⎤
i=1
i=1
,Y
y
=
(i)
⎡
(i)
y
l̇ og(f
x
pmbx
x
(m)
(1)
(2)
) + λ ∥ W
(i)
⎤
⎦
.
Then the vector form of the gradient of the loss function with respect to the weight W is:
Similarly, a regular term can be added to the cross-entropy loss of softmax regression. If the target value is
represented by an integer, the loss function becomes:
1
∑
m
log(f
(i) 2
If the target value is represented by a one-hot vector, then the loss function becomes:
(i)
) + λ ∥ W
Then the gradient of the loss function with respect to the weight W is:
∂L
= X
T
(F
F − Y ) + 2λW
W
2
According to the calculation formula of the gradient, it is easy to write the calculation code of the gradient
of the loss function about W . In the following code, it is assumed that X represents the data feature matrix
of multiple samples, y represents the target value vector, reg is the regularization parameter λ,
loss_softmax() and gradient_softmax() respectively calculate the loss of the loss function and about W
gradient:
#def loss_gradient(W,X,y,lambda_):
def gradient_softmax(W,X,y,reg):
m = len(X)
Z= np.dot(X,W)
I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),y] = 1
F = softmax(Z)
#F = np.exp(Z) / np.exp(Z).sum(axis=-1,keepdims=True)
grad = (1 / m) * np.dot(X.T,F - I_i)
grad = grad +2*reg*W
return grad
def loss_softmax(W,X,y,reg):
m = len(X)
Z= np.dot(X,W)
Z_i_y_i = Z[np.arange(len(Z)),y]
# Z.shape[0]
reg = 0.2;
print(gradient_softmax(W,X,y,reg))
print(loss_softmax(W,X,y,reg))
If the target value of each sample is represented by a one-hot vector, and y represents a matrix composed
of target values of multiple samples, then the loss_softmax_onehot() and gradient_softmax_onehot()
below calculate the loss of the loss function and about W gradient:
def gradient_softmax_onehot(W,X,y,reg):
m = len(X) # number of samples
nC = W.shape[1] # Number of categories
#y_one_hot = np.eye(nC)[y[:,0]]
y_one_hot = y
def loss_softmax_onehot(W,X,y,reg):
m = len(X) #First we get the number of training examples
nC = W. shape[1]
#y_one_hot = np.eye(nC)[y[:,0]]
y_one_hot = y
reg = 0.2;
print(gradient_softmax_onehot(W,X,y,reg))
print(loss_softmax_onehot(W,X,y,reg))
[[ 0.30213245 -1.75779321 1.69566076]
[ 0.5254108 -2.19194012 2.22652932]]
2.0863049636282662
w = w - (alpha * gradient)
#v = gamma*v+alpha* gradientz
#w= w-v
#losses.append(loss)
w_history.append(w)
return w_history
For a set of samples (X,y), the following auxiliary function calculates the model loss corresponding to
each model parameter in the history record w_history:
def compute_loss_history(w_history,X,y,reg=0.,OneHot=False):
loss_history=[]
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
if OneHot:
for w in w_history:
loss_history.append(loss_softmax_onthot(w,X,y,reg))
else:
for w in w_history:
loss_history.append(loss_softmax(w,X,y,reg))
return loss_history
alpha = 1e-0
iterations =200
reg = 1e-3
w = np.zeros([X.shape[1]+1,len(np.unique(y))])
w_history = gradient_descent_softmax(w,X,y,reg,alpha,iterations)
w = w_history[-1]
print("w: ",w)
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
plt.plot(loss_history, color='r')
The following function can calculate the prediction accuracy of the trained model on a batch of data (X,
y):
def getAccuracy(w,X,y):
X = np.hstack((np.ones((X.shape[0], 1), dtype=X.dtype),X))
probs = softmax(np.dot(X,w))
predicts = np.argmax(probs,axis=1)
accuracy = sum(predicts == y)/(float(len(y)))
return accuracy
Use getAccuracy() to calculate the prediction accuracy of the trained softmax model for the data set just
now.
getAccuracy(w,X_spiral,y_spiral)
0.5366666666666666
The following code plots the classification boundaries of the softmax model for this problem:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = np.dot(np.c_[np.ones(xx.size),xx.ravel(), yy.ravel()], w)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_linear.png')
(-1.908218802050246, 1.9517811979497575)
It can be seen that the softmax regression is still a linear function model in essence, and the dividing lines
are all straight lines on the graph. It is difficult to segment the data nonlinearly. The accuracy of the model
is 0.5366666666666666.
The following code reads the MNIST handwritten digit training set:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")
The training set has 50,000 samples, and the validation and test sets each have 10,000 samples. Each pixel
value of the image is a real number of float type, and the value range is a real number between [0, 1],
indicating the grayscale intensity of each pixel, and the label value indicates the digital classification
corresponding to the image, using 0 ,1,2,...,9 represent.
Continue outputting the few pixel values (data features) in that sample.
print(train_X.shape)
print(train_X[9][200:250])
(50000, 784)
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0.75 0.984375
0.73046875 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.2421875 0.72265625 0.0703125 0. 0. 0.
0. 0.34765625 0.921875 0.84765625 0.18359375 0.
0. 0. ]
alpha =1e-2
iterations =1000
reg = 1e-3
w_history=[]
w = np.zeros([train_X.shape[1]+1,len(np.unique(train_y))])
for i in range(5):
s = i*batch
X = train_X[s :s+batch,:]
y = train_y[s :s+batch]
w_history_batch = gradient_descent_softmax(w,X,y,reg,alpha,iterations)
w = w_history_batch[-1]
w_history.extend(w_history_batch)
print("w: ",w)
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
Calculate the accuracy of the model function on the training set, validation set, and test set respectively:
print("Accuracy on the training set:",getAccuracy(w,train_X,train_y))
print("Validation set accuracy:",getAccuracy(w,valid_X,valid_y))
print("Accuracy on test set:",getAccuracy(w,test_X,test_y))
Iterative loss learning curves for training and validation sets can be plotted:
loss_history_valid =
compute_loss_history(w_history,valid_X[0:1000,:],valid_y[0:1000],reg)
plt.plot(loss_history, color='r')
plt.plot(loss_history_valid, color='b')
plt.ylim(0,5)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.title('iterative learning curve')
plt.legend(['train', 'valid'])
plt.ylim(-0.2,3)
plt.show()
1. Rearrange the order of samples in the original training set, that is, disrupt the order of samples in
the original training set.
1. For the rearranged training set, start from the beginning, take a small batch of samples in turn,
use this batch of samples to calculate the gradient of the model function loss, and update the
model parameters.
1. and 2) in the above complete a traversal of almost all samples in the training set, and update the
model parameters with different small batches of samples in this traversal. Therefore, the process of
1), 2) is called an epoch. 3) Indicates the epoch executed multiple times.
Shuffle the order of a list using the numpy.random.shuffle() function, for example:
m=5
indices = list(range(m))
print(indices)
np.random.shuffle(indices)
print(indices)
[0, 1, 2, 3, 4]
[2, 1, 4, 3, 0]
Corresponding to a data set (X, y), an iterator function data_iter() can be defined to shuffle the order of the
original data set and return a small batch of training samples of batchsize size from the data set each time:
def data_iter(X,y,batch_size,shuffle=False):
m = len(X)
indices = list(range(m))
if shuffle: # shuffle is True to shuffle the order
np.random.shuffle(indices)
for i in range(0, m - batch_size + 1, batch_size):
batch_indices = np.array(indices[i: min(i + batch_size, m)])
yield X.take(batch_indices,axis=0), y.take(batch_indices,axis=0)
The following is the code implementation of the batch gradient descent method:
On the Mnist handwritten digit recognition training set, perform this batch gradient descent method:
import matplotlib.pyplot as plt
%matplotlib inline
batchsize = 50
epochs = 5
shuffle = True
alpha = 0.01
reg = 1e-3
gamma = 0.8
X,y = train_X,train_y
w = np.zeros([X.shape[1]+1,len(np.unique(y))])
w_history = batch_gradient_descent_softmax(w,train_X,train_y,epochs,batchsize,
shuffle,reg,alpha,gamma)
w = w_history[-1]
print("w: ",w)
X,y = train_X[0:1000,:],train_y[0:1000]
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
The batch gradient descent algorithm can be used for training with a small batch of samples, which
improves the speed of the algorithm without reducing the accuracy of the model. The accuracy of the
model on different sample sets is output below.
print("Accuracy on the training set:",getAccuracy(w,train_X,train_y))
print("Validation set accuracy:",getAccuracy(w,valid_X,valid_y))
print("Accuracy on test set:",getAccuracy(w,test_X,test_y))
plt.plot(loss_history, color='r')
plt.plot(loss_history_valid, color='b')
plt.ylim(0,5)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.title('iterative learning curve')
plt.legend(['train', 'valid'])
plt.ylim(-0.2,3)
plt.show()
Figure 3-46 Training loss curve
The model parameter matrix W is a n × C matrix, each column of which corresponds to a classifier
similar to logistic regression, and the weight of this column is used to extract from the data the features
that can judge the data related to the class. For the model parameter W of MNIST image classification,
the 784-size weight parameter of a certain column (corresponding to a certain classification) can be
displayed in the form of an image. For example, the following code displays the weight parameters of
column 0 (the class corresponding to the number 0), and converts this column vector into an image matrix
of 28 × 28 size:
c = 0
plt.imshow(w[1:,c].reshape((28,28)))
Convert the value represented by the byte in [0,255] to a value in the [0,1] interval.
train_X = X_train.astype('float32')/255.0
test_X = X_test.astype('float32')/255.0
print(train_X.shape,y_train.shape)
print(test_X.shape,y_test.shape)
print(test_X.dtype,y_test.dtype)
print(np.mean(train_X[0:1000,:]))
print(np.mean(test_X[0:1000,:]))
train_y = y_train
Start training:
import matplotlib.pyplot as plt
%matplotlib inline
batchsize = 50
epochs = 5
shuffle = True
alpha = 0.01
reg = 1e-3
gamma = 0.8
w = np.zeros([train_X.shape[1]+1,len(np.unique(train_y))])
w_history = batch_gradient_descent_softmax(w,train_X,train_y,epochs,batchsize,
shuffle,reg,alpha,gamma)
w = w_history[-1]
print("w: ",w)
X,y = train_X[0:1000,:],train_y[0:1000]
loss_history = compute_loss_history(w_history,X,y,reg)
print(loss_history[:-1:len(loss_history)//10])
plt.plot(loss_history, color='r')
plt.plot(loss_history_valid, color='b')
plt.ylim(0,5)
plt.xlabel('iterations')
plt.ylabel('loss')
plt.title('iterative learning curve')
plt.legend(['train', 'valid'])
plt.ylim(-0.2,3)
plt.show()
Summarize
Linear regression, logistic regression, and softmax regression are essentially linear classifiers. Linear
regression is a linear weighted input, and its regression function is a linear function (straight line).
Logistic regression is essentially the same as linear regression, except that this linear weighting The value
of the sum is compressed from the interval (−∞, ∞) to the interval (0, 1), so that the output value has a
probability meaning, and its decision curve for binary classification is still a straight line. Softmax
regression only regards multiple linear weighted sums as the scores of each class, and then converts their
values into probability values that samples belong to multiple categories. The decision curve when it is
used for multi-classification problems is nothing more than multiple Just a straight line.
Chapter 4 Neural Networks
1. Perceptron
Perceptron (perceptron) is a simple function with two classification
functions. The logistic regression function uses the sigmoid function to
convert the input linear weighted sum into a probability σ(xxw ), while the
1 if z >= b
signb (z) = {
0 else
The perceptron calculates the weighted sum x w of the input x through the
weight vector w , and then outputs a value of 1 or 0 according to whether the
weighted sum exceeds a certain threshold b. For example, a perceptron with 3
input values is calculated as:
3
fw
w,b (x
x) = signb (∑j=1 wj xj )
Figure 4-1 Perceptron accepts 3 input values and produces an output value f
A neuron usually has multiple dendrites, which are mainly used to receive
incoming information, while there is only one axon, and there are many axon
terminals at the end of the axon that can transmit information to other
neurons, and the axon terminals communicate with other neurons The
dendrites of the dendrites make connections to transmit signals, and the
location of this connection is called a "synapse" in biology.
Each neuron accepts multiple input signals, and the weight of each input
signal to the neuron is also different. If the weighted sum of all input signals
exceeds the "threshold" inside the neuron, an output signal will be generated.
Use f (x
x) to represent the perceptron function, that is, f
w
w w (x
w x) = signb (x
xw ) ,
where x Is the input, w is the weight vector, that is:
1 if ∑ wj xj >= b
j
fw
w (x
x) = {
0 else
1 if ∑ wj xj − b >= 0
j
fw
w (x
x) = {
0 otherwise
bcan be positive or negative, use b to represent −b, the above formula can be
written as:
1 if ∑ wj xj + b >= 0
j
fw
w (x
x) = {
0 else
The perceptron can directly express the most basic logical calculation
functions: "and", "or", "and not". Figure 4-3 shows the functions of the
"AND", "NAND", "OR", and "XOR" gates of the logic circuit.
Figure 4-3 is the function of the "AND", "NAND", "OR", and "XOR" gates
of the logic circuit.
gate function can be composed of many, such as: (0.5, 0.5, −0.6),
(0.5, 0.5, −0.9), (1, 1, −1), (1, 1, −1.5), etc. For example, the perceptron
into the perceptron function, and the output value is 1. That is, the perceptron
implements the "AND" gate function of the logic circuit.
generate the "NAND" gate function, such as: (−0.5, −0.5, 0.6),
(−0.5, −0.5, 0.9), (−1, −1, 1), (−1, −1, 1.5), etc.
generate the "OR" function, such as: (1, 1, 0), (1, 1, −0.5), (0.5, 0.5, −0.3).
For the perceptron, different weights w and bias b can be used to generate
j
different specific functions (such as "and", "or", "and not" and other logical
calculation functions). Like the logistic regression function, a single
perceptron still essentially represents a linearly separable function. For
example, the perceptron represented by the parameter (1, 1, −0.5) is 2 half-
spaces separated by the straight line −0.5 + x + x = 0, and the perception
1 2
Only a nonlinear curve as shown in the figure below can produce such a
nonlinear curve division.
Figure 4-5 Only nonlinear curve division can represent AND or gate function
More complex functions can be combined through simple "AND", "OR", and
"NAND" perceptrons, as shown in Figure 4-7, XOR gates can be expressed as
follows:
Figure 4-7 The function of "exclusive or" function can be represented by the
combination of three "and", "or", and "and not" function perceptron functions
2. Neurons
A neuron is a function or vector-valued function that performs a linear or
nonlinear transformation on the linearly weighted sum of multiple input
values to produce one or more output values. A neuron accepts multiple input
values, performs a weighted sum on them, and then produces an output or
multiple output values through a linear or nonlinear function. The function
that transforms the weighted sum linearly or nonlinearly in a neuron is called
an activation function.
a = g(∑ Wj xj )
j
Figure 4-9 Artificial neuron that weights the input and generates output
through the activation function
Sometimes the bias of the neuron is also expressed, that is, the neuron
function is written as follows:
a = g(∑ Wj xj + b)
j
Figure 4-10 The artificial neuron that generates the output through the
activation function after weighting and biasing the input
Neurons are also often represented by the more simplified circles in Figure 4-
11 below:
Indicates that this neuron accepts 3 inputs x , x , x (and a fixed input feature
1 2 3
def sign(x):
return np.array(x > 0, dtype=np.int)
def grad_sign(x):
return np.zeros_like(x)
The graph of the step function sign(x) can be drawn with the following code:
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
x = np.arange(-5.0,5.0, 0.1)
plt.ylim(-0.1, 1.1) # Specify the range of the y-
axis
plt.plot(x, sign(x),label="sigmoid")
plt.plot(x, grad_sign(x),label="derivative")
plt.legend(loc="upper right", frameon=False)
plt.show()
2. Tanh function
tanh function:
x −x x −x
tanh(x) = (e − e )/(e + e )
x −x 2 x −x 2 2
= 1 − ((e − e ) )/(e + e ) = 1 − tanh (x)
Numpy provides the calculation function tanh() for calculating tanh. The
following code calculates tanh'(x) and draws the function curve of tanh(x) and
tanh'(x):
def grad_tanh(x):
a = np.tanh(x)
return 1 - a**2
Figure 4-13 Graph of tanh(x) function and its derivative function tanh'(x)
4. ReLU function
The ReLU function f(x) outputs x directly when x is greater than 0, otherwise,
outputs 0:
x (x > 0)
Relu(x) = {
0 (x ≤ 0)
Its derivative:
′
1 (x > 0)
Relu (x) = {
0 (x ≤ 0)
The following code calculates Relu(x), Relu'(x), and draws the function curve
of Relu(x), Relu'(x):
def relu(x):
return np.maximum(0, x)
def grad_relu(x):
return 1. * (x > 0)
Figure 4-14 The graph of relu(x) function and its derivative function
′
relu (x)
There are also some variants of the Relu function, such as the LeakRelu
function:
x (x > 0)
LeakRelu(x) = {
kx (x ≤ 0)
Its derivative:
′
1 (x > 0)
LeakRelu (x) = {
k (x ≤ 0)
The following code calculates LeakRelu(x), LeakRelu (x), and draws the
′
import numpy as np
def leakRelu(x,k=0.2):
y = np.copy( x )
y[ y < 0 ] *= k
return y
def grad_leakRelu(x,k=0.2):
return np.clip(x > 0, k, 1.0)
grad = np.ones_like(x)
grad[x < 0] = alpha
return grad
Figure 4-15 The graph of leakrelu(x) function and its derivative function
′
leakrelu (x)
\pmb f=softmax(\pmb z)
with3inputvaluesand3outputvalues = (f , f , f ), this function can be
1 2 3
z
e 1
f1 = z z z
e 1 +e 2 +e 3
z
e 2
f2 = z z z
e 1 +e 2 +e 3
z
e 3
f3 = z z z
e 1 +e 2 +e 3
These 3 neurons are different from general neurons, they do not carry out
weighted sum of the input vector z = (z , z , z ), but directly output a result
1 2 3
(z_1,z_2,z_3), andthentheseweightedsumsoutputbytheseneurons\pmb
z = (z_1,z_2,z_3)
I nputtothe3neuronsof thesof tmaxf unctiontogetthef inaloutput\pmb
In the above neural network, the left column of circles represents multiple
features of an input. For a two-dimensional plane coordinate point, a sample
has only 2 features, that is, its horizontal and vertical coordinate values
x = (x , x ), The middle column of neurons is the weighted sum z of
1 2 i
are directly output to the rightmost column, that is, the 3 neurons of the
softmax function , and each produces an output value f , which constitutes
i
data is input from the left and output from the right. The output of the neurons
in the left column is the input of the neurons on the right. There is no
connection between the neurons in the same column, and the neurons on the
right will not output to the neurons on the left. That is, this is a calculation
process in which data advances along the "left to right" direction without
going back. Such a neural network is called Feedforward Neural Network
(Feedforward Neural Network).
All the neurons in each column of the feed-forward neural network are called
a layer of the neural network, and the characteristics of the input data are
usually called the input layer, and the last column of neurons that produces
the final output is called the output layer* *, each column of neurons in the
middle is calledhidden layer**, in the feedforward neural network, the data is
layer by layer from the input layer, through each hidden layer in turn, and
finally output the final output through the output layer The output value of the
neural network function. In some books, the number of layers including the
input layer is called the number of layers of the neural network. For example,
the above neural network is called a 3-layer neural network, and in some
books, the above neural network is called a 2-layer neural network, that is, no
Contains the input layer.
The neurons of the above neural network are a bit special. Among them, the
neurons of the softmax output layer do not calculate the weighted sum of the
input of the previous layer, and the hidden layer directly outputs the weighted
sum without the transformation of the nonlinear activation function, that is,
the implicit A layer of neurons is a linear regression function.
The neurons in the hidden layer of the general neural network are neurons
similar to logistic regression, that is, the weighted sum is first calculated, and
then output after being transformed by a nonlinear activation function. The
output layer neurons can be softmax neurons, linear regression neurons, and
logistic regression neurons. Because the neurons in the softmax layer have no
parameters such as weights, but a definite softmax() function, when designing
a neural network for multi-classification problems, the softmax() function can
not be used as a separate layer, but directly Use the previous layer of the
softmax layer as the output layer. That is, softmax regression can be
represented by the following neural network (as shown in Figure 4-19):
Figure 4-19 Softmax regression may not include soiftmax, which can be
regarded as a neural network composed of only 3 weighted sum neurons
That is, there are only the input layer and the output layer, and there is no
hidden layer. The output value of the output layer represents the score that the
input data belongs to each category. This score can then pass through
softmax() to output the probability that the data belongs to each category.
That is, the softmax function for calculating the classification probability can
be omitted in the figure.
The actual neural network contains at least one hidden layer, and usually
contains many hidden layers. With the development of hardware computing
power represented by GPU and the availability of large-scale data, modern
neural networks usually contain many hidden layers. Containing layers, the
number of layers of a neural network is called the depth of a neural network,
that is, the depth of a modern neural network can be very large, even
containing hundreds of hidden layers. Deeper neural networks are called deep
neural networks, and machine learning based on deep neural networks is
called deep learning (deep learning).
Figure 4-20 2-layer neural network, the leftmost column is the data, the
middle column of neurons constitutes the hidden layer, and the rightmost
column of neurons constitutes the output layer
Both the hidden layer and the output layer are neurons similar to the logistic
regression function, that is, each neuron uses its own weight vector to
calculate a weighted sum z, and then generates an output value a through its
own activation function.
This 2-layer neural network with only one hidden layer defines a function
→ R . where D is the size of the input vector x and K is the size of
D K
f : R
represent the ith layer of the l layer The weighted sum of neurons, a
[l]
represents the output value, and a represents the activation value of the i-th
i
(0)
neuron in the l-th layer. When l=0, a is the i-th input feature.
i
Assuming that the activation function of all neurons in the first layer is the
same function g , each neuron accepts an input x = (x , x ) and produces
[1]
1 2
an output value, these outputs The values (activation values) also form a
[1] [1] [1] [1]
vector a = (a , a , a , a ):
[1]
1 2 3 4
That is, each column represents the weight or bias of a neuron, and the
calculation process of the first layer of neurons can be expressed as:
[1] [1] [1] [1]
a = g (x
x W + b )
1 × 4 1 × 2 2 × 4 1 × 4
Assuming that the activation function of all neurons in the 2nd layer (output
layer) is the same g , they accept these output values from the hidden layer
[2]
[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
1 1 11 2 21 3 31 4 41 1
[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
2 1 12 2 22 3 32 4 42 2
[2] [2] [1] [2] [1] [2] [1] [2] [1] [2] [2]
a = g (a W + a W + a W + a W + b )
3 1 13 2 23 3 33 4 43 3
1 × 3 1 × 4 4 × 3 1 × 3
Use a (0)
to represent the input data x , namely:
follows:
[1] (0) [1] [1]
z = a W + b
It can be seen that the calculation process of the weighted sum and activation
value of the first layer and the second layer is completely similar. For a
general neural network, the calculation formula of the weighted sum and
activation value of the first layer is as follows:
[l] [l−1] [l] [l]
z = a W + b
Layer l accepts input a from layer l-1, and calculates the weighted sum
[l−1]
z ).
[l] [l] [l]
a = g (z
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
g1 = sigmoid
g2 = sigmoid
# x and W1,b1
x = np.array([1.0, 0.5]) # input x: 1x2 row
vector
W1 = np.array([[0.1, 0.3,0.5,0.2],
[0.4,0.6,0.7,0.1]]) # W1 : 2x4 matrix
b1 = np.array([0.1, 0.2, 0.3,0.4]) # bias b1: 1x4 row
vector
print("x.shape",x.shape) # (2,)
print("W1.shape",W1.shape) # (2, 4)
print("b1.shape",b1.shape) # (4,)
# a1、W2,b2
W2 = np.array([[0.1, 1.4,0.2],[2.5, 0.6, 0.3],
[1.1,0.7,0.8],[0.3,1.5,2.1]])
b2 = np.array([0.1, 2,0.3])
print("a2.shape",a1.shape) # (4,)
print("W2.shape",W2.shape) # (2, 4)
print("b2.shape",b2.shape) # (2,)
x.shape(2,)
W1.shape(2, 4)
b1.shape(4,)
z1 [0.4 0.8 1.15 0.65]
a1 [0.59868766 0.68997448 0.75951092 0.65701046]
a2.shape(4,)
W2.shape(4, 3)
b2.shape(3,)
z2 [2.91737012 4.76932075 2.61406058]
a2 [0.94869845 0.99158527 0.93176103]
X =
A
[l]
Where z
Z
[l]
[l]
[l]
[l]
⎢⎥
4.1.4 Forward calculation of multiple samples
The data characteristics of multiple samples such as m samples x
⎣x
=
,a
= A
= g
= A
= g
x
⎣a
(1)
(2)
(m)
pmbz
pmbz
[l]
[l−1]
[l]
⋮
z
z
⎤
[m](l)
(Z
Z
(Z
Z
W
[1](l)
[2](l)
[l]
[l]
⋮
[2](l)
)
[m](l)
[m](l)
[l]
⎡ a
a
[l]
[1](l)
[2](l)
⎤
=
[l]
+ b
⎡g
⎣g
,a
[l]
[l]
[l]
[l]
[l]
(z
z
of each layer corresponding to each sample Z
[l]
⎡a
⎣a
(z
z
(z
z
⋮
a
[m](l)
=
⎡
⎣a
[2](l−1)
[m](l−1)
[1](l)
[2](l)
)
)
a
⎦
[1](l)
[2](l)
Because numpy arrays have a broadcast function, the above formula can be simplified to:
[l] [l−1]
Because numpy arrays have a broadcast function, the above formula can be simplified to:
[l]
⋮
[m](l)
⋮
[l]
[l]
[l]
⎤
+ b
Therefore, for a general layer l, the vectorization formula for forward computation is as follows:
)
[l]
+ b
[l]
[l]
[l] ⎦
Similarly, you can write the python code for the forward calculation in vector (matrix) form for multiple samples:
(i)
are the weighted sum and activation values of the l-th layer of the i-th sample, respectively, and
[1](l)
,A .
T hei−thlineof [l]
The forward calculation of multiple samples can be written in vector (matrix) form as follows:
[1](l) [1](l−1)
+ b
+ b
[l]
[l]
[l]
⎤
can form a matrix X :
[l] [l]
,A is:
X = np.array([[1.0, 2.],[3.0,4.0]])
W1 = np.array([[0.1, 0.3,0.5,0.2],
[0.4,0.6,0.7,0.1]]) # W1 : 2x4 matrix
b1 = np.array([0.1, 0.2, 0.3,0.4]) # bias b1: 1x4 row vector
print("X.shape",X.shape) # (2,)
print("W1.shape",W1.shape) # (4, 2)
print("b1.shape",b1.shape) # (4,)
X.shape(2, 2)
W1.shape(2, 4)
b1.shape(4,)
Z1: [[1. 1.7 2.2 0.8]
[2. 3.5 4.6 1.4]]
A1: [[0.73105858 0.84553473 0.90024951 0.68997448]
[0.88079708 0.97068777 0.9900482 0.80218389]]
A1.shape(2, 4)
W2.shape(4, 3)
b2.shape(3,)
Z2: [[3.4842095 5.19593923 2.86901816]
[3.94450732 5.71183814 3.24399047]]
A2: [[0.97023513 0.9944915 0.94629347]
[0.98100697 0.99670431 0.96245657]]
like:
⎡z ⎤ ⎡W ⎤ ⎡b ⎤
[1] [1] [1] [1]
W
1 11 12 1
⎣z ⎦
[1]
⎣W [1]
W
[1]
⎦ ⎣b ⎦
[1]
4 41 42 4
Because x (i)
is a column vector, m samples x (i)
form a matrix X :
4.1.5 Output
When the neural network is used to solve regression problems, similar to linear regression, the output is any real
number on the real number axis, and the output can be one real number or multiple real numbers. For example, in
the problem of target positioning, it is necessary to output the position of the target, such as the vertical and
horizontal coordinates of the target in the sample image, and for the detection of face landmarks, as shown in
Figure 4-21, it is necessary to output the number of feature points of the face in the face image coordinate. For
these problems, the output value can be any real number.
When solving the binary classification problem, similar to logistic regression, the output is the probability that the
sample belongs to one of the categories. This probability is compressed by using the σ function to compress the
real numbers belonging to the real number interval to [0,1] representing the probability. Interval, when solving
multi-classification problems, the output is the probability that the sample belongs to each category, and these
probabilities are compressed by the softmax function to the [0,1] interval representing the probability of the same
number of real numbers (on the real number axis) .
For classification problems, even if the real numbers of the real number axis are not compressed to the [0,1]
interval representing the probability, the sample category can also be judged according to the size of the real value.
For example, for a 3-category problem, if the output is 3 real numbers: 219, 18, 564, then it can be known that the
sample belongs to the category of the largest real number is the 3rd category instead of the 1st or 2nd category.
Converting any real number into a real number representing probability is to define the cross-entropy loss of
binary classification or multi-classification in a probabilistic sense.
Therefore, for classification problems, the output of the neural network can be a real number belonging to the real
number axis, indicating which class the sample belongs to or the score of multiple different classes. It can also be
the probability of converting the score through the σ and softmax functions.
When designing the neural network structure, if the neuron containing the σ function or softmax function is used
as the final output layer, the output probability can directly calculate the cross-entropy loss with the target value. If
the network layer that outputs the score is used as the output layer, whether it is a classification or a regression
problem, the output is any real number on the real number axis, but for the classification problem, this score output
will also be transformed by the σ function or the softmax function Computes the cross-entropy loss for
probabilistic resuming of target values.
Regardless of whether the σ function or the softmax function is used as the output layer, the neurons in other layers
of the neural network are neurons similar to the logistic regression function, that is, each neuron accepts the input
of the previous layer a , use the weight vector of the neuron to calculate the weighted sum of these inputs
[l−1]
z
[l]
= a
[l−1]
W , and then through an activation function g to generate an output a = g (zz ). If the output
[l] [l] [l] [l] [l]
score is used as the output layer, the activation function g of this layer is usually an identity function, otherwise, [L]
the output layer is a special neuron such as a σ function or a softmax function. In order to avoid distinguishing
between this special neuron and the general logistic regression neuron, this book uses the network layer that
outputs the score as the output layer.
If the predicted value and true value of a sample are f , y respectively, the error of the sample is L(f (i) (i) (i) (i)
,y ) .
When training a neural network or validating a neural network model, the overall average error is usually
calculated for a set of samples. The error of m samples can take the average of their errors, that is:
1 m (i) (i)
L(f , y) = ∑ L(f ,y )
m i=1
For a certain sample, its error (loss) L(f , y ) is different with the model parameters, because the model
(i) (i)
parameters are different, the predicted value f is also different, as is the average error L(f , y) of multiple
(i)
samples, that is, the error can be regarded as a function of the predicted value f and the model parameters , (i)
called the loss function. By minimizing the training loss, the model parameters corresponding to the minimum
value of the loss function can be obtained.
The commonly used loss functions in neural networks are the three types in Chapter 3: mean square error loss,
binary cross-entropy loss, and multi-class cross-entropy loss.
2 m 2
1 (i) (i) 1 (i) (i)
L (F
F, Y ) = (f
f − y = ∑ (f
f − y
m 2 m i=1 2
Multiplying a constant will not change the extreme point of this loss function. Sometimes in order to make the
gradient of the derivation look better, the mean square error loss will be divided by 2, namely:
2 2 2
1 (i) (i) 1 (i) (i)
L (F
F, Y ) = (f
f − y = ∑i=1 (f
f − y
2m 2 2m 2
2
For a sample (ff (i)
,y
(i)
) , 1
2
(f
f
(i)
T hecalculationcodeof − y
(i)
2
is as follows:
import numpy as np
f = np.array([0.1, 0.2,0.5])
y = np.array([0.3, 0.4,0.2])
loss = np.sum((f - y) ** 2)/2
print(loss)
0.084999999999999999
L(f
f, y) =
(i)
1
m
m
∑ Li (y
i=1
F, Y ) =
m = len(F)
loss = np.sum((F - Y) ** 2)/(2*m)
# loss = (np.square(H-Y)).mean()
print(loss)
0.084999999999999999
mse_loss(F,Y,True)
0.08499999999999999
loss/=2
= −
(i)
m
1
,f
(i)
np. sum(y
1
2m
) = −
∥
1
m
(f
f
∑[y
i=1
(i)
− y
(i)
y log f + (1 − y ) log(1 − f ))
(i)
log(f
2
2
Where y has a value of 1 or 0, indicating the category to which the sample belongs, and f
probability that the sample belongs to the category with a value of 1.
Binary classification cross entropy loss can be calculated with the following code.
- (1./m)*np.sum(np.multiply(y,np.log(f)) + np.multiply((1 - y), np.log(1 - f)))
For example:
#https://towardsdatascience.com/neural-net-from-scratch-using-numpy-71a31f6e3675
f = np.array([0.1, 0.2,0.5]) # three samples correspond to the probability of category
1
can be calculated with the following code:
The mean square error is used for regression problems, and for classification problems, cross-entropy loss is
generally used.
(i) (i)
) + (1 − y ) log(1 − f
(i)
)]
(i)
indicates the
y = np.array([0, 1, 0]) # Classification corresponding to 3 samples
m = y.shape[0]
0.8026485362172906
In order to prevent f or 1-f from having a value of 0 and causing the log() function to be abnormal, a small amount
ϵ can be added to the logarithm calculation, so the following binary cross-entropy loss function can be written
0.8026485091802541
y
(i)
represents the sample target value. According to the softmax regression in Chapter 3, it can be known that the
cross-entropy loss of multiple samples is as follows:
1 m 1 m C (i) (i) 1 m
(i) (i) (i) (i)
L(f
f, y) = ∑ Li (y ,f ) = − ∑ ∑ yc ⋅ log(fc ) = − ∑ y ⋅ log(f )
m i=1 m i=1 c=1 m i=1
on of the sample is the third category, and the predicted value indicates the probability that the sample belongs to
the three categories. For this sample, its cross-entropy loss is:
The cross-entropy loss only depends on the term corresponding to the true classification.
Therefore, if the target values of all samples are one-hot vectors, for m samples, L (ff , y ) can be written as a
vectorized Hadamard product :
1
L (f
f, y) = − sum(y ⊙ log(f ))
m
For example, for 2 samples with m=2, the output F of the softmax and the target value (one-hot vector) matrix of
the sample are Y as follows:
0.2 0.5 0.3 0 1 1
F = [ ] Y = [ ]
0.4 0.3 0.3 0 0 1
F = np.array([[0.2,0.5,0.3],[0.4,0.3,0.3]])
Y = np.array([[0,0,1],[1,0,0]])
cross_entropy_loss_onehot(F,Y)
1.0601317681000455
If the target value of each sample is not represented by a one-hot vector, but an integer represents which category
the sample belongs to. For C classification problems, these integer values are 0, 1, 2, ⋯ , C − 1 used to indicate
which class the sample belongs to, such as using an integer such as 2 to indicate that the sample belongs to the
third category. At this time, the cross-entropy loss of the sample is the negative log value of the corresponding
(i) (i)
component of f (that is, the category f corresponding to subscript 2), that is, −logf .
(i)
2 2
For the target classification where integers represent samples, the multi-class cross-entropy loss is:
1 m 1 m (i)
(i) (i)
L (f
f, y) = ∑ Li (y
y ,f ) = − ∑ log(f )
m i=1 m i=1 y
(i)
Where y (i)
represents the integer value (subscript) corresponding to the category to which the i-th sample belongs.
def cross_entropy_loss(F,Y,onehot=False):
m = len(F) #F.shape[0] number of samples
if one hot:
return -(1./m) *np.sum(np.multiply(Y, np.log(F)))
else:
return - (1./m) *np.sum( np.log(F[range(m),Y]) ) # The log value of the
category
# corresponding to Y[i] in
F[i]
In the following code, F is the output of 2 samples, and the output vector of each sample is 3 components,
indicating the probability that the sample belongs to 3 categories, and the i-th component of the target Y indicates
the category corresponding to the i-th sample. Subscript value (such as 0,1,2).
cross_entropy_loss(F,Y)
1.0601317681000455
[[0. 0. 1.]
L (f
[1. 0. 0.]]
cross_entropy_loss_onehot(F,one_hot_y)
1.0601317681000455
f, y) =
m
∑
i=1
Li (y
y ,f ) + λ∑
l=1
∥ W
∥
Of course, in order to prevent overfitting, a regularization term can be added on the basis of the above loss, and a
larger penalty will be imposed on the model parameters with large absolute values to prevent the absolute value of
the model parameters from being too large.
1 m (i) (i) L [l] 2
2
Like the regression problem, a neural network function with a certain structure is completely determined by the
parameters of the neurons (weight parameters and bias parameters), and different parameters represent a different
specific neural network function.
For a set of samples, it is hoped to find the neural network parameters that can best fit these samples, that is, to
determine a neural network function that best responds to the relationship between sample characteristics and
target values, the process of finding the best neural network parameters and any machine learning model training
The process is the same, which is to find the model parameters that minimize a certain loss. Specifically, it is to
determine the model parameters of the neural network by solving the minimization problem of the loss function.
This process is called neural network training.
The training of the neural network is the same as the training of the regression model, which uses the gradient
descent method to iteratively update the model parameters until the algorithm converges enough or reaches the
maximum number of iterations. The gradient descent method needs to calculate the partial derivative of the loss
function with respect to the model parameters. The neural network usually contains many layers, each layer has
many neurons, and each neuron contains many model parameters. Calculating the partial derivative is more
complicated than the regression problem.
The forward calculation of the neural network and the calculation of the loss function are discussed above, and the
numerical gradient can be used in the gradient descent algorithm to approximate the analytical gradient. Next,
implement a complete neural network training and prediction algorithm for the previous 2-layer neural network.
https://www.bogotobogo.com/python/scikit-learn/Artificial-Neural-Network-ANN-5-Checking-Gradient.php
In linear regression, the model parameters are usually initialized to 0, but if the weight parameters in the neural
network model are initialized to 0, the neurons of the neural network will eventually converge, that is, the model
parameters of all neurons will be the same, and each layer has multiple A neuron is equivalent to a neuron, which
greatly degrades the expressive ability of the neural network, making it difficult to obtain a satisfactory neural
network. Therefore, the weights are generally initialized randomly, and researchers have provided a variety of
different methods for initializing the weights of neural networks.
Usually, the bias of the neural network is initialized to 0, and the weight parameters are randomly sampled from a
distribution such as a Gaussian distribution. If the number of certain neurons is n , and the number of output
values of the previous layer is n
(l)
Wl = np.random.randn(n_l_1,n_l)* 0.01
(l) (l−1)
× n matrix, which
(l)
That is, random values from a standard normal distribution of all values are multiplied by 0.01.
Assuming that the number of input features of the above two-layer neural network is n_x, and the number of
neurons in the middle layer and output layer are n_h and n_o respectively, the function initialize_parameters()
completes the initialization of all model parameters and returns a dictionary object.
import numpy as np
W1 = np.random.randn(n_x,n_h)* 0.01
b1 = np.zeros((1,n_h))
W2 = np.random.randn(n_h,n_o) * 0.01
b2 = np.zeros((1,n_o))
parameters = [W1,b1,W2,b2]
return parameters
def sigmoid(x):
return 1 / (1 + np.exp(-x))
assert(Z2.shape == (X.shape[0],3))
return Z2
Z2 = forward_propagation(X, parameters)
print(Z2)
The forward calculation function outputs the score Z belonging to each class, and the score can be converted into
the probability of belonging to each class with the softmax() function, and then calculate the multi-category cross-
entropy loss with the real target value. The function softmax_cross_entropy() and function
softmax_cross_entropy_reg() calculate the cross-entropy loss based on the output score value Z and the real value
y, which includes the regular term loss (reg is the regular term coefficient).
def softmax(Z):
exp_Z = np.exp(Z-np.max(Z,axis=1,keepdims=True))
return exp_Z/np.sum(exp_Z,axis=1,keepdims=True)
1.098427770814438
It is generally hoped that the neural network can calculate the loss function value by inputting a set of data X and
the corresponding target value y. Therefore, the forward calculation and the separate cross-entropy loss calculation
function are combined:
1.098427770814438
Define a function f() that returns the calculated loss function object, and pass it and the model parameters to the
general numerical gradient calculation function in section 2.4) to calculate the numerical gradient of the neural
network.
import util
def f():
return compute_loss_reg(forward_propagation,\
softmax_cross_entropy_reg, X, y, parameters,reg)
num_grads = util.numerical_gradient(f,parameters)
print(num_grads[0])
print(num_grads[3])
Now we can modify the previous gradient descent method to train the neural network model:
def max_abs(grads):
return max([np.max(np.abs(grad)) for grad in grads])
losses.append(loss)
return parameters,losses
np.random.seed(100)
def gen_spiral_dataset(N=100,D=2,K=3):
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
for j in range(K):
ix = range(N*j,N*(j+1))
r = np.linspace(0.0,1,N) # radius
t = np.linspace(j*4,(j+1)*4,N) + np.random.randn(N)*0.2 # theta
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
return X,y
X_spiral,y_spiral = gen_spiral_dataset()
# lets visualize the data:
plt.scatter(X_spiral[:, 0], X_spiral[:, 1], c=y_spiral, s=40, cmap=plt.cm.Spectral)
plt.show()
X = X_spiral
y = y_spiral
n_x, n_h, n_o = 2,5,3
parameters = initialize_parameters(n_x, n_h, n_o)
alpha = 1e-0
iterations =1000
lambda_ = 1e-3
parameters,losses = gradient_descent_ANN(f,X,y,parameters,lambda_, alpha, iterations)
for param in parameters:
print(param)
print(losses[:-1:len(losses)//10])
plt.plot(losses, color='r')
Figure 4-22. The loss curve of the three-class neural network for the spiral data set
The following function calculates the prediction accuracy of the model on the sample set (X, y) by comparing the
prediction result with the target value:
def getAccuracy(X,y,parameters):
predicts = forward_propagation(X,parameters)
predicts = np.argmax(predicts,axis=1)
accuracy = sum(predicts == y)/(float(len(y)))
return accuracy
getAccuracy(X,y,parameters)
0.9433333333333334
The prediction accuracy of the model on the training set reached 0.943, while the prediction accuracy of the
original softmax regression model was only 0.516. Draw the decision region again with code similar to the
previous one:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
XX = np.c_[xx.ravel(), yy.ravel()]
Z = forward_propagation(XX,parameters)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.3)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_linear.png')
(-1.9355521912329907, 1.8444478087670126)
Figure 4-23 The decision area of the three classifications of the neural network function
It can be seen that the decision curve of the 2-layer neural network model is no longer a straight line but can be an
arbitrarily curved curve.
Until, in 2012, the re-ImageNet of Canadian universities succeeded, and the neural network model shined again,
using a 6-layer convolutional neural network. The success of modern neural networks is mainly due to high-
performance parallel computing hardware and large-scale data. .
Especially the high-performance GPU, especially the CUDA GPU of the Nvidia formula, can perform data-
intensive large-scale parallel computing.
operation of 500 × 784 matrix and 784 × 100 matrix. Therefore, the amount of calculation is proportional to the
number of samples, the number of features per sample, and the number of neurons in the neural network. The
process of calculating the numerical partial derivative for each model parameter is independent of each other, and
two forward calculations (including the calculation of the loss function value) must be performed independently.
Therefore, the cost of numerical derivation is very large, and it is not feasible for deep and large-scale neural
networks. In fact, in deep learning, parallel computing hardware such as GPU is used to accelerate the forward
calculation and gradient calculation speed of this neural network.
Therefore, the training of the actual neural network is to calculate the analytical gradient (derivative) of the loss
function with respect to the model parameters through the chain derivation rule, and the loss of the model can be
calculated in a forward direction, and then from the gradient of the loss with respect to the final output value along
the The gradient of the model parameters of each layer is calculated in the reverse direction of the forward
calculation.
The chain derivation rule of function derivation tells us that the calculation process of the derivative is exactly
opposite to the calculation process of the function value. For example, a variable x passes through the function g to
get the function value g(x), and then enter this value In the function h, the value h(g(x)) is obtained, and this
value is continuously input into the function k, and the final output f (x) = k(h(g(x))) is obtained. The
calculation process is as follows:
f (x) == k(h(g(x))) is obtained from a series of functions g(x), h(g), k(h) through function compounding, input
an argument x, the process of calculating f(x) is calculated step by step according to this compound process, until
the final f (x) is obtained, this kind of "from inside to outside" sequentially passes through the innermost
independent variable The process of obtaining the final function value from a series of intermediate values is
called forward calculation. If g, h, and k are regarded as functions of each layer of the neural network, this
calculation process is the process of propagating calculations from the input layer of the neural network layer by
layer along the previous layer of the neural network to the next layer. Therefore, the forward calculation is called
forward propagation (forward propogation) in the neural network.
According to the chain derivation rule, to calculate the final derivative of f with respect to x, it can be calculated as
follows:
′ ′ ′ ′
f (x) = k (h)h (g)g (x)
That is, the calculation process of f (x) is decomposed into a series of calculations: first calculate the derivative of
′
f with respect to h, that is, f'(h) = k'(h), and then calculate the derivative of h with respect to g, that is, h'( g), and
finally calculate the derivative g'(x) of g with respect to x. As shown below, the calculation process of the
derivative f'(x) is in the opposite direction to the calculation process of the function value f(x), and is calculated
"from outside to inside" in the opposite direction of the function composition process, that is, along the forward
direction Calculate the reverse procedure to calculate:
′ ′ ′ ′ ′ ′ ′ ′ ′
f (h) = k (h) → f (g) = k (h)h (g) → f (x) = k (h)h (g)g (x)
This process of calculating the derivative of a compound function in reverse is called reverse derivation. The
calculation of derivative f'(h)=k'(h), h'(g),g'(x) is not independent of each other. If f'(h) is obtained first, there is no
need to repeatedly calculate f'(h) when calculating f'(g) = f'(h)h'(g). That is, if the derivative f'(h) of f with respect
to an intermediate variable such as h is saved along the direction of reverse derivation, it can be directly multiplied
by h'(g) to obtain f'(g). This avoids recalculating f'(h) when calculating f'(g).
If x is regarded as the input of the neural network, g, h are regarded as the output of the hidden layer and the output
layer, and f(h) = k(h) is regarded as the loss function, then f'(h) = k'(h) It represents the gradient (derivative) of the
loss function with respect to the output h of the neural network. On the basis of f'(h), along the reverse direction of
the neural network from the output layer to the hidden layer, the loss function can be sequentially calculated for the
hidden layer g, Gradient (derivative) of input x.
If you know the gradient of the loss function about the output of each layer, such as f'(h), you can get the gradient
of the model parameters about the layer. For example, suppose the input of a neural network layer is x, and its
output a = σ(xw + b) = σ(z). If it is known that the derivative of the loss function L with respect to a is L (a), ′
For the neural network model, the calculation process of the loss function about the output of each layer, the
intermediate variable and the gradient of the input is just the opposite of the calculation process of the forward
propagation. It is to first calculate the gradient of the loss function about the output of the output layer, and then
layer by layer. Forward, compute the gradient of the loss function with respect to each layer's intermediate
variables, model parameters, and inputs to that layer. This calculation process is called backward propagation
(backward propogation).
Building a compound function from a simple function is not only a function compound operation, but also includes
ordinary addition, subtraction, multiplication and division operations, no matter what operations are used to
construct the compound function, the forward calculation and reverse calculation function output from the
independent variable calculation function about the middle The reverse derivation process of the derivative
(gradient) of the variable and the independent variable is similar. Let's take a simple example to further deepen the
process of forward calculation and reverse derivation.
For two independent variables x, y, the function f (x, y) = (2x + 3y) + (x − 4y) can be regarded as:
2 2
f (x, y) = s + t, s, t can be regarded as: s = u , t = v , and u, v can be regarded as: u = 2x + 3y, v = x − 4y.
2 2
The reverse derivation process of the partial derivative of the function f (x, y) with respect to x is as follows:
′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′ ′
f (s) = 1, f (t) = 1 → f (u) = f (s)s (u) + f (t)t (u), f (v) = f (s)s (v) + f (t)t (v) → f (x) = f (u)u (x)
forward calculation function value is carried out according to the function matching process as shown in Figure 4-
24. The process of reverse derivation is just the opposite, as shown in Figure 4-25:
Figure 4-25 The reverse derivation calculation diagram of the function f (x, y) = (2x + 3y) 2
+ (x − 4y)
2
It can be seen from this reverse calculation diagram that f (x) comes from the accumulation of partial derivatives ′
of two paths, one is the reverse derivation from u, and the other is the reverse derivation from v, that is,
f (x) = f (u)u (x) + f (v)v (x) = f (u), while f (u) only contributes from the reverse derivative of s , ie
′ ′ ′ ′ ′ ′ ′
f (u) = f (s)s (u), similarly f (v) = f (t)t (v). Therefore, there are:
′ ′ ′ ′ ′ ′
′ ′ ′ ′ ′ ′ ′
f (x) = f (s)s (u)u (x) + f (t)t (v)v (x)
That is, first find f (s), then find s (u), and then find u (x), then you can find f (s)s (u)u (x), that is, to
′ ′ ′ ′ ′ ′
calculate the derivative along the reverse direction of the forward calculation, and f (t)t (v)v (x) is also the same ′ ′ ′
And f ′ ′ ′
(s) = 1, f (t) = 1, s (u) = 2u, t (v) = 2v, u (x) = 2, v (x) = 1
′ ′ ′
, so the final f ′
(x) is:
′
f (x) = 1 ∗ 2u ∗ 2 + 1 ∗ 2v ∗ 1
The feed-forward neural network is composed of multiple layers of neurons, each layer of neurons accepts the
input a of the previous layer, and passes the neuron's own model parameters w , b calculate the weighted
[l−1] [l] [l]
sum z = a[l]
w + b . Then through an activation function g
[l−1] [l] [l]
to generate its own output a , this output is [l−1] [l]
used as the input of the next layer of neurons, so that the calculation results are layer by layer Output to the next
layer, until the output layer, the output of the output layer and the target value calculate some kind of loss
, y).
[L]
L (a
For the above 2-layer neural network, the weighted sum z , z and output value a , a of each layer of neurons [1] [2] [1] [2]
can be calculated in the following order,and finally calculate the loss function value L (a , y). [2]
[1] (1) [1] [1] [1] [2] [2] [1] [2] [2] [2] [2]
z = W x + b → a = σ(z ) → z = W a + b → a = σ(z ) → L (a , y)
The neural network function can be regarded as a function of its model parameters and intermediate variables. The
process of calculating the gradient (derivative) of the loss function with respect to these parameters and variables is
the same as the process of calculating the reverse derivative of any complex function. "From outside to inside"
from the most Starting from the loss function of the outer layer, the gradients of these intermediate variables and
parameters are sequentially calculated along the reverse direction of the forward calculation of the neural network
function value.
The process is: first find the gradient of the loss function with respect to the output of the output layer, if the
gradient of the loss function with respect to the output of a certain layer l is obtained, that is , according to the
∂L
[l]
∂a
a
activation function of each neuron in this layer, the gradient of the weighted sum of neurons in this layer z can be [l]
), according to the formula (), in the Knowing , you can find out the loss function and the layer model
L
[l]
z
z
∂W
W
[l]
∂L
∂a
a
[l−1]
For the above 2-layer neural network, the reverse derivation process is as follows:
∂L ∂L ∂L ∂L ∂L ∂L ∂L ∂L
[2]
→ [2]
→ ( [2]
, [2]
, [1]
) → [1]
→ ( [1]
, [1]
)
∂a
a ∂z
z ∂W
W ∂b
b ∂a
a ∂z
z ∂W
W ∂b
b
The gradient of the loss function on the intermediate variables and parameters of each layer of the model depends
on the results in the forward calculation process, such as a , in order to avoid repeated calculation of these values,
[l]
you can use the forward calculation process to These results are stored in the corresponding layers of the neural
network, and these stored results can be directly used in the reverse derivation process, thereby avoiding
continuous repeated calculations and improving efficiency. From the calculation graph, these intermediate results
can be saved in the corresponding nodes of the calculation graph. Now the deep learning platform expresses the
forward propagation and reverse derivation process of the neural network by means of the calculation graph and
saves the relevant intermediate calculation results on the corresponding nodes of the calculation graph.
Therefore, the calculation graph can not only ensure the correct calculation order of forward and reverse
calculations, but also save intermediate results to improve calculation efficiency.
4.2.3 The gradient of the loss function with respect to the output
Reverse derivation first calculates the gradient of the loss function with respect to the output of the final output
layer, and then calculates the gradient of the loss function with respect to the intermediate variables and parameters
of each layer from the output layer along the reverse direction of forward propagation until the input layer .
The definitions of loss functions for different problems (regression, classification) are different. The following
discusses how to calculate the gradient of the loss function with respect to the output layer for several common
loss functions.
According to Section 3.5), the cross-entropy loss L(f , y) = −(ylog(f ) + (1 − y)log(1 − f )) for the two
classifications is the derivative of f :
∂L y (1−y) f −y
= −( − ) =
∂f f (1−f ) f (1−f )
For binary classification problems, the cross-entropy of multiple samples is the mean of the cross-entropy of a
single sample:
m m
1 1
(i) (i) (i) (i) (i) (i)
L(F
F, Y ) = ∑ Li (y ,f ) = − ∑[y log(f ) + (1 − y ) log(1 − f )]
m m
i=1 i=1
∂L 1 F
F −Y
Y
=
∂F
F m F
F (1−F
F)
∂F
F
= σ(Z
Z )(1 − σ(Z
Z )) = F (1 − F )
∂Z
Z
∂L
∂Z
Z
=
∂L
∂f
f
∂ pmbf
∂z
z
=
1
m
(F
F − Y ) .
Right now:
F =
⎢⎥
⎡
⎣f
f
(m)
(1)
(i)
⋮
⎤
⎦
Y =
return 1 / (1 + np.exp(-x))
grad = (f-y)/(len(y))
⎡
⎣y
y
y
(1)
(m)
(i)
⋮
(i)
⎦
∂L
∂Z
Z
=
1
m
⎡
⎣f
(m)
(1)
(i)
z = np.array([-4, 5,2])
f = sigmoid(z)
classification 1
y = np.array([0, 1, 0])
print(loss,grad)
− y
− y
− y
multiplication and division in the above formula are all element-wise operations.
(1)
(i)
(m)
⎤
It should be noted that: because each sample corresponds to a different z , the loss cross-entropy loss L should be
(i)
derived separately for each z instead of combining these The derivatives add up. In addition, the vector
According to whether the output out of the neural network is the score represented by the weighted sum or the
output probability of the σ function, the following calculation can be written to calculate the gradient of the
weighted sum or output probability of the two-category cross-entropy loss.
def sigmoid(x):
loss,grad = binary_cross_entropy_loss_grad(z,y,False)
loss,grad = binary_cross_entropy_loss_grad(f,y)
print(loss,grad)
2. The gradient of the mean square error loss function on the output
f
∥
For regression problems, the output layer is one or more linear regression neurons, that is, each neuron directly
outputs the weighted sum z of its input, and these values output by multiple neurons in the output layer form an
(f
f
(i)
(i)
− y
(i)
(i)
(i)
2
i
output vector z = (z , z , . . . z ) as the output of the entire output layer f = z . For K>1, the target value is a
K
∑
K
k=1
K
For a sample, the square of the Euclidean distance between the output vector z and the target value vector y can be
(i)
(f
1
2
2
used as the error (ff − y ) , In order to make the derivative (gradient) look more concise, it is usually
(i)
For a matrix F , Y composed of multiple samples, the mean square error L(F
The gradient of F is (F
∂L
∂Z
Z
=
∂L
∂F
F
def mse_loss_grad(f,y):
m = len(f)
F − Y ). Because F = Z , namely:
=
1
m
(F
F − Y )
m
1
k
(f
f
− y
is (f
F, Y ) =
3. The gradient of the multi-class cross entropy loss function on the output
(i)
1
(i)
− y
(i)
k
)
− y
(i)
2
)
(i)
1
2
,f
For multi-classification problems, the neural network converts the output of the previous layer into a probability
with intuitive meaning through the final softmax function, indicating the probability that the sample belongs to
each sample. Since the neurons of softmax do not contain any model parameters, sometimes softmax is not used as
2
1
(i)
the last layer of the neural network, but the previous layer is used as the last output layer of the neural network. No
matter which scheme is adopted, the multi-class cross-entropy loss in softmax regression is usually calculated at
the end. The latter scheme is usually adopted, that is, it is assumed that the output layer outputs scores instead of
probabilities. The neurons of this output layer are linear regression neurons, and the weighted sum of their inputs is
directly output as the activation value, that is, f = a = z (assuming this is an L-layer neural network, the
serial number of this output layer is L).
this time, the multi-classification cross-entropy loss L(F , Y )The gradient of Z is: (FF − Y ).
i
Let the output z of the output layer generate an output f through the softmax function, and then calculate the
For multiple samples, the output of the output layer can be written as a matrix Z = (zz , ⋯ , z , ⋯ , z ) , and the
output generated by the softmax function is also a representation Probability matrix F = (ff , ⋯ , f , ⋯ , f ) .
According to Section 3.6), if each target value y uses a vector represented by one-hot, then Y is also a matrix. At
i
If each target value y is an integer, representing the subscript of the category to which the sample belongs, then
i
the gradient of the multi-category cross-entropy loss L(F , Y ) on Z is: (F F − I ). Among them, each line of I
is a one-hot vector, which is to convert the integer corresponding to the sample into a one-hot vector, so I is
composed of I and the above one-hot vector \pmb{Y}$ is the same.
i
i
is used as the error. The gradient of this error with respect to
(L)
i
(i)
− y
2
,⋯,f
(L)
i
K
(i)
− y
m
(i)
K
) = f
i
i
(i)
m
− y
2m
1
(i)
∑
m
i=1
i
(f
f
i
(i)
− y
m
T
i
i
(i)
m
2
T
2
i
i
The following python code converts a target vector of integer values into a matrix of one-hot vectors:
I_i = np.zeros_like(Z)
I_i[np.arange(len(Z)),Y] = 1
It can be seen that the Euclidean loss of regression, the cross-entropy loss function of binary classification, and the
cross-entropy loss of multi-classification are surprisingly consistent with respect to the gradient of the output layer
Z . Both are F − Y ).
1
(F
m
Given a multi-sample output layer weighted sum Z and a target value Y, the following code computes the gradient
of the multiclass cross-entropy with respect to Z (see Section 3.6):
def softmax(x):
a= np.max(x,axis=-1,keepdims=True)
e_x = np.exp(x - a)
return e_x /np.sum(e_x,axis=-1,keepdims=True)
we discuss how to Layer weighting and z gradient based on , Find the gradient of the loss on the variables
[l]
[l] ∂L
∂z
z
[l]
[l]
W
[l]
[l]
,b ,a
[l]
[l]
of the layer .
[l−1]
[l−1]
∂W
W
[2]
[2]
,
∂L
∂b
b
[2]
[2]
,
∂L
∂a
a
[1]
[1]
?
∂a
∂L
∂a
∂a
[1]
[1]
therefore:
∂L
[1]
= (z
= (
∂W
∂W
∂W
∂W
∂L
∂L
∂L
∂L
[2]
11
[2]
i1
[2]
i2
[2]
i3
=
=
[2]
∂z
1
∂L
[2]
∂z
1
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂a
[2]
[2]
[2]
[2]
3
,z
∂a
∂z
∂a
[1]
a
∂z
∂W
1
[2]
[1]
[2]
[1]
[1]
[1]
[1]
i
,
[2]
[2]
11
,z
+
[2]
= (a
∂L
∂a
∂L
∂a
∂L
[1]
[1]
2
=
=
∂L
[2]
∂z
1
∂L
[2]
∂z
1
∂L
∂z
∂a
∂z
∂a
∂z
[2]
1
[1]
[2]
[1]
[2]
+
∂L
∂a
[2]
11
+
a
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
) = (a
[1]
2
1
[1]
[1]
[2]
[2]
[2]
[2]
∂L
∂z
[1]
,
W
∂z
∂a
∂z
∂a
∂z
∂a
∂z
∂a
[2]
2
W
∂L
∂a
12
(2)
13
[2]
[1]
[2]
[1]
[2]
[1]
[2]
[1]
∂W
11
[1]
(2)
(2)
[1]
∂z
+ a
[2]
11
,a
[2]
[1]
+ a
+ a
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂a
+
[1]
[1]
[2]
[2]
[2]
[2]
3
,a
[1]
[1]
3
W
∂L
∂z
all 0. Because the i-th column of W only contributes to z or only z depends on T heithcolumnof W , so:
[2]
[1]
∂z
∂a
∂z
∂a
∂z
∂a
∂z
∂a
[2]
3
22
,a
21
(2)
(2)
23
[2]
3
[1]
[2]
[1]
[2]
[1]
[2]
[1]
) = (
[2]
This is because W is only related to z , z , z does not depend on W , so the last two partial derivatives are
1
(2)
[1]
+ a
∂z
∂W
+ a
+ a
[2]
[2]
11
)
∂L
∂z
[2]
⎢⎥
⎡W
⎣W
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
[1]
[1]
=
[2]
1
W
[1]
[2]
[2]
[2]
[2]
1
W
W
W
[2]]
3
(2)
11
(2)
21
(2)
31
(2)
41
∂L
∂z
(2)
31
(2)
32
(2)
33
[2]
11
[2]
21
[2]
31
∂z
[2]
41
∂L
[2]
1
2
W
+ a
+ a
[2]
∂W
∂z
12
22
32
42
+ a
[2]
1
(2)
(2)
(2)
(2)
11
4
[1]
[1]
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂z
[2]
∂L
[1]
[2]
i
W
[2]
[2]
[2]
[2]
2
W
W
[2]
3
43
+ 0 + 0 =
13
23
34
44
(2)
41
(2)
42
(2)
(2)
(2)
(2)
(2)
[2]
12
[2]
22
[2]
32
[2]
42
⎡W
⎣W
W
⎤
⎦
+ b
+ b
+ b
+
+ (b
1
[2]
[2]
(2)
11
(2)
12
(2)
13
[2]
i
[2]
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
)
[2]
[2]
[2]
[2]
∂z
,
∂L
[2]
[2]
[2]
11
,b
[2]
13
[2]
23
[2]
33
[2]
43
(2)
21
(2)
22
(2)
23
∂z
∂W
[2]
[2]
[2]
11
,b
W
[2]
31
32
33
=
)
(2)
(2)
(2)
∂L
[2]
∂z
1
W
W
41
42
43
a
(2)
(2)
(2)
[1]
1
⎤
⎦
=
∂L
[2]
∂z
W
[2]
T
[2]
∂W
∂L
Obviously:
∂L
∂b
[2]
Because a
a
[1]
so:
∂L
∂z
∂b
∂W
∂L
∂z
[1]
[1]
[2]
,b
∂L
[2]
[1]
[2]
= (
= (a
= (
= a
=
=
∂z
[2]
∂L
∂b
[1]
∂L
∂a
, T hegradientof W
∂a
∂L
⎢⎥
⎡
[2]
[1]
,a
[1]
1
∂W
∂W
∂W
∂W
,
∂L
∂L
∂L
∂L
= g(z
[1]
g (z
∂b
′
[2]
11
[2]
21
[2]
31
[2]
41
∂L
[2]
,a
1
3
[1]
,
[1]
[1]
),
∂L
∂b
,a
∂W
∂W
∂W
∂W
∂W
∂L
∂L
[1]
= a
∂L
[0]
[1]
[1]
[1]
T
T
∂z
∂z
⊙ g (z
∂L
∂L
[1]
[2]
′ [1]
∂L
∂L
∂L
∂L
)
[2]
At this point, the gradient of the loss function for all variables W
, ie
∂L
∂a
[1]
4
[2]
12
[2]
22
[2]
32
[2]
42
2
) = (
∂L
∂b
) = (g(z
[1]
g (z
[2]
′
∂L
∂W
∂L
∂W
∂L
∂W
∂L
∂W
=
∂z
[1]
[1]
∂L
∂W
∂L
[2]
13
[2]
23
[2]
33
[2]
43
[2]
),
∂L
∂z
[1]
⎤
[1]
[1]
[2]
= a
∂a
=
∂L
∂z
∂L
[1]
3
[2]
), g(z
of the weighted sum of the loss function with respect to the output layer, the
∂L
∂z
[2]
⎡
′
g (z
gradient of the loss function with respect to the relevant variables of each layer is obtained according to the reverse
derivation process:
[0]
T
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
[1]
∂L
∂a
[2]
[2]
[2]
[2]
[2]
), g(z
[1]
[1]
∂ mathcalL
a
a
[1]
[1]
[1]
[1]
) =
),
∂z
∂L
∂a
[1]
[1]
[1]
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
∂L
∂z
[2]
), g(z
g (z
[2]
[2]
[2]
[2]
[2]
W
a
a
[1]
[1]
[1]
[1]
[1]
[1]
[2]
[2]
)) = g(z
)) = (
,z
∂L
∂b
[1]
[1]
∂L
[2]
∂z
3
∂L
[2]
∂z
3
∂L
[2]
∂z
3
∂L
[2]
∂z
3
=
a
[2]
∂L
∂a
∂L
∂z
[1]
[1]
[1]
[1]
,b
[1]
[1]
⎤
[1]
[2]
,
)
=
,a
∂a
⎡a ⎤
⎣a ⎦
[1]
partialL
[1]
2
a
a
[1]
[1]
[1]
[1]
4
[
∂L
∂z
[2]
∂a
[1]
3
,
∂L
∂z
[2]
∂ mathcalL
∂a
[1]
4
∂L
[2]
∂z
3
] = a
) ⊙ (g (z
′
[1]
[1]
1
T ∂L
∂z
′
), g (z
[2]
[1
2
2. Multi-sample vectorized representation of reverse derivation
Like general machine learning, when training a neural network, the model parameters are usually solved by
minimizing the error (loss) between the predicted value and the real value of multiple samples. The loss is a
function of the model parameters, as well as intermediate variables such as a , z . [l] [l]
For the non-model parameters of each layer, such as intermediate variables a , z , different samples have [l] [l]
different values, and they all belong to different variables, such as a and a are different variables
[l](1) [l](2)
produced by two different samples of l layer. Assuming that these variables are all written in the form of row
vectors, the values of these variables for all samples can be piled up in rows to form a matrix, and each row of
the matrix corresponds to a sample. The symbols A , Z can be used to represent the matrix of these
[l] [l]
⎡a ⎤ ⎡z ⎤
[l](1) [l](1)
[l](2) [l](2)
a z
[l] [l]
A = Z =
⋮ ⋮
⎣a [l](m) ⎦ ⎣z [l](m) ⎦
Where A [0]
is a matrix X [0]
composed of all sample input features, namely:
(1)
⎡ x ⎤
(2)
x
[0] [0]
A = X =
⎣x (m) ⎦
That is, different samples are calculated in the forward propagation, and the intermediate variables of each
layer generated are different, but the same model parameters are used W , b , since the loss of multiple
[l] [l]
samples is the mean value of the loss of all samples, the gradient of the loss of multiple samples with respect
to the model parameters is the mean value of the gradient of all samples with respect to the model parameters.
For example, for the weight parameter W , if there are m samples, then:
(i)
∂L 1 m ∂L
= ∑
∂W
W m i=1 ∂W
W
Usually when calculating the gradient of the loss function with respect to the output layer z , already
∂L
[L]
[L]
∂z
z
[L]
Multiplied by the mean factor , therefore, the gradient of the model parameters can be directly
1
accumulated:
m (i)
∂L ∂L
= ∑
∂W
W i=1 ∂ pmbW
therefore:
∂L
∂W
∂L
∂A
∂L
∂Z
∂L
∂W
∂L
∂b
∂A
[2]
∂L
∂L
∂Z
∂L
∂W
∂L
∂b
[1]
[1]
[1]
[2]
[1]
[1]
[1]
[2]
=
= A
= A
=
∂A
= a
∂L
= np. sum(
=
∂Z
∂A
[1]
∂L
∂L
= np. sum(
[0]
∂a
∂a
∂a
[1]
∂L
∂L
∂L
⊙ g
[1](1)
[1](1)
[1](2)
[1](m)
⋮
[1]
T
⎦
∂z
(Z
= ∑a
∂L
[1]
i=1
[2](1)
= [a[1](1)
)
m
Similarly, for the bias, the partial derivatives of all single samples
∂L
∂b
b
[2]
[2]
= ∑i=1
m
∂z
z
∂L
[2](i)
[2](i)
∂a
∂L
∂z
∂L
∂z
∂L
+ a
[1](i)
[2](1)
[2](1)
[2](1)
∂z
So the multi-sample gradient has the same formulas as the single-sample gradient:
[2]
[1]
T
⊙ g (Z
T
∂L
∂Z
∂L
∂Z
[2]
[2]
[1]
∂L
∂Z
∂L
∂Z
[2]
[1]
[1]
T
[1](2)
∂L
∂z
[2]
[2]
[m]
∂z
[1](2)
= A
[2]
T
∂L
⎦
[2](i)
∂ mathcalL
Different from the model parameters, the intermediate variables of different samples are different (not
∂Z
=
[1]
∂L
∂z
[2]
shared). Therefore, the gradient of the loss function with respect to the intermediate variable is independent of
each other. If the gradient of the intermediate variable of each sample is in the form of a row vector, all The
gradients of the intermediate variables are stacked into a matrix, and each row of the matrix represents the
gradient of a sample. therefore:
⋯a
∂Z
∂L
∂Z
∂L
m
= ∑a
i=1
[2]
[1](m)
[2]
W
[1](i)
+ ⋯ + a
[2]
T
]
⎢⎥
T
∂z
∂L
∂z
∂L
∂z
[2](i)
[1](m)
[2](1)
[2](2)
∂z
∂L
[2](m)
⎤
⋮
∂L
[l]
[l]
∂b
b
T
=
∂L
∂z
[2](m)
∂L
∂z
z
[l]
[l]
can be accumulated to get:
3. Gradient calculation formula in column vector form
If the samples, intermediate variables and their gradients are all in the form of column vectors, such as
x, a ,z
[1]
,b ,a
[1]
,z ,b are all column vectors, and a row of W , W corresponds to the ownership of
[1] [2] [2] [2] [1] [2]
One-sample form:
T T
∂L ∂L [1] ∂L ∂L ∂L [2] ∂L
[2]
= [2]
z [2]
= [2] [1]
= W [2]
∂W ∂z ∂b ∂z ∂a ∂z
T
∂L ∂L ′ [1] ∂L ∂L [0] ∂L ∂L
[1]
= [1]
⊙ g (z ) [1]
= [1]
z [1]
= [1]
∂z ∂z ∂W ∂z ∂b ∂z
Multi-sample form:
T T
∂L ∂L [1] ∂L ∂L ∂L [2] ∂L
[2]
= [2]
A [2]
= np. sum( [2]
, axis = 1, keepdims = T rue) [1]
= W [2]
∂W ∂Z ∂b ∂Z ∂A ∂Z
T
∂L ∂L ′ [1] ∂L ∂L [0] ∂L ∂L
[1]
= [1]
⊙ g (Z ) [1]
= [1]
A [1]
= np. sum( [1]
, axis = 1, keepdims = T rue)
∂Z ∂A ∂W ∂Z ∂b ∂Z
For a loss function that contains a regular term, the partial derivative of the regular term to each model
2
[l]
parameter is also calculated when calculating the gradient, if the regular term is λ ∥ W 2
∥= λ ∑ ∑
l ij
W
ij
,
[l] [l]
then the partial derivative of W ij
is 2λW
W , written in vector form is 2λW
ij
W.
For the above 2-layer neural network, on the basis of forward calculation (that is, A0 and A1 in forward
calculation are known), the following code gives the reverse derivation process (if the activation function of
the first layer is Relu ):
def dRelu(x):
return 1. * (x > 0)
#dZ1 = A1*dRelu(A1)
dA1[A1 <= 0] = 0
dZ1 = dA1
def max_abs(s):
max_value = 0
for x in s:
max_value_ = np.max(np.abs(x))
if(max_value_>max_value):
max_value = max_value_
return max_value
class TwoLayerNN:
def __init__(self, input_units, hidden_units,output_units):
# initialize parameters randomly
n = input_units
h = hidden_units
K = output_units
data_loss = softmax_cross_entropy(Z2,y)
reg_loss = reg*np.sum(W1*W1) + reg*np.sum(W2*W2)
loss = data_loss + reg_loss
if i % 1000 == 0:
print("iteration %d: loss %f" % (i, loss))
# backward
dZ2 = cross_entropy_grad(Z2,y)
dW2 = np.dot(A1.T, dZ2) +2*reg*W2
db2 = np.sum(dZ2, axis=0, keepdims=True)
dA1 = np.dot(dZ2,W2.T)
dA1[A1 <= 0] = 0
dZ1 = dA1
#dZ1 = dA1*dReLU(A1)
#dZ1 = np.multiply(dA1,dRelu(A1) )
dW1 = np.dot(X.T, dZ1)+2*reg*W1
db1 = np.sum(dZ1, axis=0, keepdims=True)
if max_abs([dW2,db2,dW1,db1])<epsilon:
print("gradient is small enough at iter : ",i);
break
def predict(self,X):
Z1 = np.dot(X, W1) + b1
A1 = np.maximum(0,Z1) #ReLU activation
Z2 = np.dot(A1, W2) + b2
return Z2
The constructor __init__() of TwoLayerNN accepts the number of neurons in the input layer, hidden layer,
and output layer as parameters, and initializes the model parameters of the 2-layer neural network. train()
is used to train the neural network model, that is, according to the training samples, use the gradient descent
algorithm to calculate the best model parameters, so that the cross entropy loss of these training samples is the
smallest. The parameters of train() include a set of training samples (X, y), regularization parameters reg,
and related hyperparameters of the gradient descent method (such as iterations, learning rate learning_rate,
and convergence error). In each step of the iteration of the gradient descent method, the train() function first
calculates the output value of the sample and its intermediate variables (Z1, A1, Z2) forward, and uses
softmax to convert the score into a probability value and calculates the multi-category cross entropy loss
(data_loss ), and then calculate the gradient dZ2 of the cross-entropy loss (data_loss) about the output layer
output, and backpropagate to find the gradient about the intermediate variables and model parameters (the
gradient of the model parameters includes the gradient of the regular term about the model parameters2
*reg*W2, 2*reg*W1).
The prediction function of the model predict() predicts the target value of the input data X according to the
trained neural network model, which is a forward propagation calculation process.
The data features and target values of the spiral data set in Section 2.7) can be modeled with the above-
mentioned 2-layer neural network. First generate the dataset:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import data_set as ds
np.random.seed(89)
X,y = ds.gen_spiral_dataset()
Output the accuracy of the trained model with the following code:
# evaluate training set accuracy
#A1 = np.maximum(0, np.dot(X, W1) + b1)
#Z2 = np.dot(A1, W2) + b2
Z2 = nn.predict(X)
predicted_class = np.argmax(Z2, axis=1)
print ('training accuracy: %.2f' % (np.mean(predicted_class == y)))
It can be seen that the model trained with gradient calculation using analytical derivatives is more accurate,
reaching 99%. The following code also visualizes a decision boundary better than that of a model trained with
numerical gradients:
# plot the resulting classifier
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
XX = np.c_[xx.ravel(), yy.ravel()]
Z = nn.predict(XX)
Z = np.argmax(Z, axis=1)
Z = Z.reshape(xx.shape)
fig = plt.figure()
plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=20, cmap=plt.cm.spring)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
#fig.savefig('spiral_net.png')
(-1.9124776305480737, 1.9275223694519297)
Figure 4-27 Classification decision region for a spiral data point set
∂a
[l]
∂L
[l]
∂z
z
∂L
[l]
∂a
a
′ [l] ∂L
∂z
z
[l]
gradient of the loss function with respect to the parameter W , b 和a of the layer can be obtained: [l] [l] [l−1]
T T
∂L [l−1] ∂L ∂L ∂L ∂L ∂L [l]
[l]
= a [l] [l]
= np. sum( [l]
, axis = 0, keepdims = T rue) [l−1]
= [l]
W
∂W ∂z ∂b ∂z ∂a ∂z
T
∂L [l−1] ∂L
[l]
= A [l]
∂W ∂Z
∂L ∂L
[l]
= np. sum( [l]
, axis = 0, keepdims = T rue)
∂b ∂Z
T
∂L ∂L [l]
[l−1]
= [l]
W
∂A ∂Z
In the following, the gradient of the loss function with respect to the intermediate variables and model
parameters is derived in the form of a column vector, that is, assuming the input data x = a , the [0]
intermediate variable z ,a and their gradients are represented by column vectors. Using column
[1−1] [1−1]
At this point, each row (instead of each column) of the weight matrix represents the weight parameters of a
neuron. Of course, the weighted sum z through the activation function g is also a column vector
[l]
a
z
[l]
∂W
[l]
[l]
∂L
= g(z
∂L
∂b
∂L
∂b
[l]
Right now:
[l]
z
⎢⎥
⎡z ⎤
⎣z ⎦
, that is, δ
[l]
= δ
z
∂L
∂z
[l]
[l]
[l]
[l]
[l]
contributes to z or depends on W
∂W
as W
∂L
jk
[l]
=
[l]
∂L
∂z
[l]
j
)
[l]
j
.
(l)
∂z
∂b
=
∂z
∂W
[l]
[l]
j
=
[l]
∂W
∂W
∂W
∂L
∂L
⎡
⎣∑
[l]
jk
[l]
=
∑
[l]
11
[l]
j1
[l]
m1
n
k=1
k=1
k=1
∂ mathcalL
∂z
= δ
⎡
∂L
∂L
∂z
=
[l]
j
(l)
∂W
∂W
∂W
W
∂L
∂L
W
⎡δ ⎤
⎣δ ⎦
= δ
mk
[l]
12
[l]
j2
[l]
m2
δ
[l]
1k
[l]
2k
[l]
[l]
[l]
[l]
m
[l]
j
a
The jth neuron in the l-th layer outputs a weighted sum z , which is only related to this neuron and has
nothing to do with other neurons in this layer. Therefore, the weight parameter W
(l)
∂L
partial
∂W
(l)
jk
[l−1]
[l−1]
⋯
[l−1]
For simple derivation, use the symbol δ to represent the partial derivative of the loss function with respect to
[a
(∑ W
+ b
+ b
+ bm
[l]
[l−1]
1
i
∂W
∂W
∂Wmn
∂L
∂L
∂L
[l]
[l]
[l]
[l]
[l]
1n
[l]
jn
[l]
ji
a
⎤
[l]
2
⎤
[l−1]
[l−1]
a
(l−1)
=
has n values, and expand the vector form of the weighted sum:
⋯
) = δ
⎡δ
⎣δ
jk
[l]
[l]
[l]
[l]
m a1
[l]
There is only z in row j. therefore:
j
[l−1]
a
1
[l−1]
a
1
an
[l]
about W
[l−1]
[l−1]
[l]
a
[l−1]
δ
[l]
[l]
[l]
δm a
] = δ
[l]
[l−1]
⋮
2
[l−1]
[l−1]
[l]
⋯
[l−1]
a
T
δ
δ
1
[l]
[l]
δm an
an
⋮
[l]
jk
[l−1]
an
[l−1]
⎦
of this neuron only
the figure.
Figure 4-28 Each output value of each neuron in layer l-1 becomes the input to each neuron in layer l
[l−1]
Therefore, when calculating the partial derivative of the loss function with respect to a k
, it is necessary to
[l]
accumulate all the partial derivatives of z to it, namely: j
[l]
∂z [l] [l] [l−1] [l] [l]
∂L m ∂L j m ∂ m
= ∑ = ∑ (δ (∑ W a )) = ∑ (δ W )
[l−1] j=1 [l] [l−1] j=1 j [l−1] i jk k j=1 j jk
∂a partialz ∂a ∂a
k j k k
[l]
That is, the dot product of the kth column of δ and W , if W [l] [l]
,k
represents the kth column column, then the
above formula can be written in the form of matrix product:
T
∂L [l]
[l]
= W δ
[l−1] ,k
∂a
k
∂L T
[l] [l]
∂L ∂a
[l−1]
W δ T
2 ,2 [l] [l]
= = = W δ
[l−1]
∂a
⋮ ⋮
⎣ ∂L
[l−1]
⎦ ⎣W ,n
[l]
T
δ
[l] ⎦
∂an
(l)
The key to the problem is how to find δ j
=
∂L
(l)
?
∂z
j
For the output layer L, as in the section "Gradient of the loss function", you can directly use the loss function
to find the gradient of the output layer. For other layers such as l < L layer, and the derivative of the ∂L
∂a
[l]
activation function of neurons in this layer are obtained, without loss of generality, the activation functions of
[l] [l]
neurons in this layer are all g, that is, a = g(z ). According to the chain rule of derivation: i i
[l−1] ∂L ∂L [l−1]
′
δ = = g (z )
i [l−1] [l−1] i
∂z ∂a
i i
The symbol g (. ) can be seen to have a broadcast function, that is, it can act on an array, which is equivalent
′
⎡ g (z ⎤
[l]
′
)
1
[l]
′
g (z )
′ [l] 2
g (z ) =
⎣ g (z ′
[l]
n )
⎦
[l−1] ∂L ∂L ′ [l−1]
δ = [l−1]
= [l−1]
⊙ g (z )
∂z ∂a
You can substitute
δ
[l−1]
(L)
∂Z
δ
Z
(L)
=
=
= (W
∂z
z
∂z
z
(L)
m
[l]
=
T
δ
∂a
a
∂L
(L)
∂L
∂a
) ⊙ g (z
[l−1]
= a
′
f (z
z
[L]
[l−1]
)
[l]
)
T
∂L
∂L
∂z
z
(L)
.
δ
[l]
=
into the above formula, get:
That is, in the reverse derivation process, it is not necessary to calculate the gradient of the intermediate layer
Finally, it should be noted that if the output layer does not directly output the weighted sum z
passes through an activation function such as a = f (zz ) output, for variance loss ∥ a
(L)
For the binary classification problem, the activation function f is the σ function, and for the multi-
classification problem, the activation function f is the softmax function.The gradient = a
Among them, f is the identity function for regression problems (assuming variance loss ∥ z − y ), for
∂L
∂a
a
(L)
− y of the
cross-entropy loss of a and the target value y with respect to the weighted sum z can be calculated (For
[L]
multi-classification, this y is in onehot vector form). Of course, for multiple samples, the gradient is
∂L
= (A
A − Y )。
1 (L)
two classifications Or multi-category f is the sigmoid function σ or softmax function. Of course, for multiple
samples, the gradient is = (f (Z
Z ) − Y ).
∂L
∂Z
Z
(L) m
1 (L)
The gradient formula of the loss function on the output layer, together with the following three formulas, are
called the four major formulas of reverse derivation:
∂W
∂L
(l)
δ
[l−1]
= (W
= δ
∂L
∂W
(l)
[l]
T
(a
(l)
δ
(l−1)
= δ
[l]
[l−1]
∂L
∂b
)
) ⊙ g (z
z
T
(l)
(l)
′
=
= (W
(a
= δ
⎡δ
⎣δ
∂b
(l−1)
δ
∂L
[l−1]
(l)
(l)
(l)
(l)
[l]
(l)
T
) = (W
a
T
(l−1)
(l−1)
(l−1)
=
= δ
[l]
=
(l)
δ
) ⊙ g (z
∂L
∂z
z
∂L
∂z
z
[l]
[l]
[l]
(l)
(l)
(l)
T
′
a
a
[L]
∂L
∂z
z
(l−1)
(l−1)
(l−1)
[l−1]
(l−1)
(a
[l]
(L)
)
T
⋱
) ⊙ g (z
z
(L)
′
[l]
[L]
[l−1]
(l)
δ
1
(l)
δ
2
(l)
δ
j
a
)
(l−1)
(l−1)
(l−1)
k
⎤
⎦
[l]
∂z
z
(L)
[L]
1
∂L
∂z
z
(L)
2
[L]
∂L
(L)
∂a
a
[L]
− y
f (z
(L)
[L]
′
∥
directly, but
, loss
2
[L]
)
2
The vector form of multi-sample is:
∂L ∂L (l−1) T
= (A )
(l) [l]
∂W ∂Z
Z
∂L ∂L
= np. sum( , axis = 1, keepdims = T rue)
(l) [l]
∂b ∂Z
Z
∂L [l]
T ∂L ′ [l−1]
= (W ) ⊙ g (Z
Z )
[l−1] [l]
∂Z
Z ∂Z
Z
Prepare data: Prepare the sample data set for training the model, which may include a validation set and a test
set in addition to the training set;
Determine the neural network structure: Design an appropriate neural network model for specific problems.
The model scale is large, the training time is long, and it is difficult to train. The model scale is too small, and
the expressive power may not be enough. It is necessary to choose a suitable network according to the actual
problem. model structure. The network structure also includes what kind of activation function to choose,
what kind of error (loss) evaluation method to define what kind of loss function;
Training model: including random initialization of model parameters, and gradient descent method to find the
optimal solution. A validation set may be needed to assist in selecting appropriate models and
hyperparameters to avoid overfitting and underfitting.
Like the regression model, the neural network also uses the gradient descent algorithm to train the model to find
the most suitable model parameters. The gradient descent algorithm can be divided into three steps
1.1 Starting from the first layer, the intermediate variables and activation output values of each
subsequent layer are calculated sequentially until the output layer.
[l] [l] [l] [l−1] [l] [l]
Z = XW + b == A W + b
1.2 Calculate the loss function value according to different loss evaluation criteria
(L)
L = L (A
A , y)
1. Reverse derivation:
Calculate the gradient of the loss function with respect to the output layer output, ie δ [L]
=
∂L
∂Z
[L]
from output layer L all the way to layer 1. Calculate the gradient of the loss function about W, b, x
∂L
∂W
, ,
[l]
,
∂b
∂L
[l]
. ∂A
∂L
[l−1]
∂L
∂Z
[l−1]
T
∂L [l−1] ∂L
[l]
= A [l]
∂W ∂Z
∂L ∂L
[1]
= np. sum( [1]
, axis = 0, keepdims = T rue)
∂b ∂Z
T
∂L ∂L [l]
[l−1]
= [l]
W
∂A ∂Z
∂L ∂L ′ [l]
[l]
= [l]
⋅ g (Z )
∂Z ∂A
(l) (l)
∂L
L
b = b − α
(l)
∂ b
class Layer:
def __init__(self):
pass
def forward(self, x):
raise NotImplementedError
On the basis of the Layer class, a derived class Dense can be defined to represent a fully connected layer. The so-
called fully connected layer means that each neuron in this layer accepts all inputs from the previous layer. The
parameters input_units, output_units, and activation of the constructor init() of the Dense class represent the size
of the input, the size of the output, and the activation function, respectively. Forward calculation The forward()
function calculates the cumulative Z according to the input x, weight W, and bias b, and then inputs it to the
activation function g to calculate the output value A. Right now:
[l] [l] [l] [l−1] [l] [l]
Z = XW + b == A W + b
Backward calculation (backward derivation) accepts the gradient of the loss function with respect to the output
value A , and calculates the gradient of the loss function with respect to W, b, and x
∂L
∂A
[l]
, , . ∂Z
∂L
[l]
∂L
∂W
[l]
∂L
∂A
[l−1]
Because partial derivatives or gradient symbols cannot be entered in the code, dA [l]
, dZ
[l]
, dW
[l]
, db
[l]
represent
∂L
∂A
, [l]
, ,∂L
∂Z
.
[l]
∂L
∂W
[l]
∂b
∂L
[l]
T
[l] [l−1] [l]
dW = A dZ
[l] [l]
db = np. sum(dZ , axis = 0, keepdims = T rue)
T
[l−1] [l] [l]
dA = dZ W
class Dense(Layer):
def __init__(self, input_dim, out_dim,activation=None):
super().__init__()
self.W = np.random.randn(input_dim, out_dim) * 0.01 #0.01 * np.random.randn
self.b = np.zeros((1,out_dim)) #np.zeros(out_dim)
self.activation = activation
self.A = None
def g(self,z):
if self.activation=='relu':
return np.maximum(0, z)
elif self.activation=='sogmiod':
return 1 / (1 + np.exp(-z))
else:
return z
def dZ_(self,dA_out):
if self.activation=='relu':
grad_g_z = 1. * (self.A > 0) # should actually be 1. * (self.Z > 0),
but both are equivalent
return np.multiply(dA_out,grad_g_z)
elif self.activation=='sogmiod':
grad_g_z = self.A(1-self.A)
return np.multiply(dA_out,grad_g_z)
else:
return dA_out
You can test the forward() function of this neural network layer Dense
import numpy as np
np.random.seed(1)
x = np.random.randn(3,48) # 3 samples, 3 channels, each channel is a 4x4 image
dense = Dense(48,10,'none')
o = dense.forward(x)
print(o.shape)
print(o)
(3, 10)
[[-0.03953509 -0.00214997 0.00743433 -0.16926214 -0.05162853 0.06734225
-0.00221485 -0.11710758 -0.07046456 0.02609659]
[ 0.00848392 0.08259757 -0.09858177 0.0374092 -0.08303008 0.04151241
-0.01407859 -0.02415486 0.04236149 0.0648261 ]
[-0.13877363 -0.04122276 -0.00984716 -0.03461381 0.11513754 0.1043094
0.00170353 -0.00449278 -0.0057236 -0.01403174]]
The following code assumes that f is a function of multivariate parameter p, that is, given p, the function value of
f (p) can be calculated, if the loss function L is known about The gradient of f , then the gradient of the loss∂L
∂f
∂p
,
∂L
∂f
, namely:
∂f
grad = df
∂p
If f contains multiple output values, that is, f (p) = (f (p), f (p), ⋯ , f (p)) is a multivariate parameter p For
T
1 2 n
vector-valued functions, if the gradient of the loss function L about f is known, the gradient of L about p can
also be calculated according to the chain derivation method, that is, about Partial derivative of each parameter p : j
∂L
∂pj
= ∑
i
∂L
∂fi
∂fi
∂p
= ∑ dfi
i
∂fi
∂pj
.
∂fi
, you can use the numerical derivation according to the above formula to find ∂L
∂pj
, that is
to use the numerical derivative of ∂fi
∂pj
Right now:
∂L ∂L fi (p+ϵ)−fi (p−ϵ) ∂L f (pj +ϵ)−f (pj −ϵ) f (pj +ϵ)−f (pj −ϵ)
= ∑i = ⋅ = df ⋅
∂pj ∂fi 2ϵ ∂f 2ϵ 2ϵ
Among them, f is the forward() output of the network layer dense. If you use f = dense. f orward(x) to
represent the calculation of this function, the calculation of this function depends on a certain parameter p, and the
value of the loss function to the parameter p Derivation can be implemented with the following functions:
def numerical_gradient_from_df(f, p, df, h=1e-5):
grad = np.zeros_like(p)
it = np.nditer(p, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
oldval = p[idx]
p[idx] = oldval + h
pos = f() #Recall f() to calculate its output after a dependent parameter p[idx]
of f changes
p[idx] = oldval - h
neg = f() #Recall f() to calculate its output after a dependent parameter p[idx]
of f changes
p[idx] = oldval
The following code first simulates a loss function with respect to the gradient df of the dense output, and then calls
dense.backward(df) to reverse the gradient of the dense model parameters. The output dx is the gradient dx of
the dense inputx, and then use the above The numerical gradient function numerical_gradient_from_df to
calculate the numerical gradient dx_num about x, and then compare the errors of bothdx and dx_num:
df = np.random.randn(3, 10)
dx = dense.backward(df)
dx_num = numerical_gradient_from_df(lambda :dense.forward(x),x,df)
2.1851062625977136e-12
The error between the numerical gradient and the analytical gradient is very small, indicating that the analytical
gradient and numerical gradient calculated by backward() are almost the same. We can also compare whether the
gradient of the dense model parameters is consistent. The following code checks whether the gradient of the dense
model parameter W is consistent:
dW_num = numerical_gradient_from_df(lambda :dense.forward(x),dense.W,df)
print(diff_error(dense.dW,dW_num))
2.2715163083830703e-12
The numerical and analytical gradients illustrating the model parameters are also very close. Therefore, it can be
judged that the code for analyzing the gradient is basically correct.
# multiple samples
return np.argmax(p, axis=1)
for i in range(len(self._layers)):
self._layers[i].dW += 2*reg * self._layers[i].W
def reg_loss(self,reg):
loss = 0
for i in range(len(self._layers)):
loss+= reg*np.sum(self._layers[i].W*self._layers[i].W)
return loss
def update_parameters(self,learning_rate):
for i in range(len(self._layers)):
self._layers[i].W += -learning_rate * self._layers[i].dW
self._layers[i].b += -learning_rate * self._layers[i].db
def parameters(self):
params = []
for i in range(len(self._layers)):
params.append(self._layers[i].W)
params.append(self._layers[i].b)
return params
def grads(self):
grads = []
for i in range(len(self._layers)):
grads.append(self._layers[i].dW)
grads.append(self._layers[i].db)
return grads
With the network layer Layer and the neural network class NeuralNetwork, a 2-layer neural network model can be
defined for practical problems such as 2-plane point set classification problems:
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100, 'relu'))
nn.add_layer(Dense(100, 3, 'softmax'))
For multi-classification problems, you can use the previous softmax_cross_entropy() and cross_entropy_grad to
calculate a gradient for multi-class cross entropy loss and weighted sum
X_temp = np.random.randn(2,2)
y_temp = np.random.randint(3, size=2)
F = nn.forward(X_temp)
loss = softmax_cross_entropy(F,y_temp)
loss_grad = cross_entropy_grad(F,y_temp)
print(loss,np.mean(loss_grad))
1.098695480580774 -9.25185853854297e-18
4.3.5 Gradient test of neural network
To ensure that the forward computation, loss function computation, and backward derivative of the neural network
are computed correctly, the numerical gradient can be compared with the analytical gradient.
import util
# Calculate the gradient of the model parameters according to the gradient loss_grad
# of the loss function on the output
nn.backward(loss_grad)
grads= nn.grads()
def loss_fun():
F = nn.forward(X_temp)
return softmax_cross_entropy(F,y_temp)
params = nn.parameters()
numerical_grads = util.numerical_gradient(loss_fun,params,1e-6)
for i in range(len(params)):
print(numerical_grads[i].shape,grads[i].shape)
diff_errors(numerical_grads,grads)
The error between the numerical gradient and the analytical gradient is very small, indicating that the analytical
gradient is basically correct. Here is the code for the gradient descent algorithm:
def cross_entropy_grad_loss(F,y,softmax_out=False,onehot=False):
if softmax_out:
loss = cross_entropy_loss(F,y,onehot)
else:
loss = softmax_cross_entropy(F,y,onehot)
loss_grad = cross_entropy_grad(F,y,onehot,softmax_out)
return loss,loss_grad
nn.backward(loss_grad,reg)
nn.update_parameters(learning_rate);
if epoch % print_n == 0:
print("iteration %d: loss %f" % (epoch, loss))
For the training sample (X, y), each iteration of the gradient descent method first outputs f = nn.forward(X) to
the calculator, and then calculates the gradient of the loss function with respect to the output loss, loss_grad =
loss_function( f,y), and then calculate the gradient nn.backward(loss_grad,reg) of the model
parameters using reverse derivation based on this gradient. Finally update the model parameters
nn.update_parameters(learning_rate).
Use the above data training set to train the model and output the accuracy of the model prediction:
import data_set as ds
np.random.seed(89)
X,y = ds.gen_spiral_dataset()
epochs=10000
learning_rate=1e-0
reg = 1e-4
print_n = epochs//10
train(nn,X,y,loss_gradient_softmax_crossentropy,epochs,learning_rate,reg,print_n)
print(np.mean(nn.predict(X)==y))
The above train() function uses all the samples in the training set to train together. Usually, the batch gradient
descent algorithm train_batch() is used, that is, each time a part of the samples are taken from the training set for
training, the iterator function data_iter defined in Section 2.6 can be used (in code in the python file data_set.py).
Retrain with train_batch():
def data_iter(X,y,batch_size,shuffle=False):
m = len(X)
indices = list(range(m))
if shuffle: # shuffle is True to shuffle the order
np.random.shuffle(indices)
for i in range(0, m - batch_size + 1, batch_size):
batch_indices = np.array(indices[i: min(i + batch_size, m)])
yield X.take(batch_indices,axis=0), y.take(batch_indices,axis=0)
def train_batch(nn,XX,YY,loss_function,epochs=10000,batch_size=50,learning_rate=1e-
0,reg = 1e-3,print_n=10):
iter = 0
for epoch in range(epochs):
for X,y in data_iter(XX,YY,batch_size,True):
f = nn.forward(X)
loss,loss_grad = loss_function(f,y)
loss+=nn.reg_loss(reg)
nn.backward(loss_grad,reg)
nn.update_parameters(learning_rate);
if iter % print_n == 0:
print("iteration %d: loss %f" % (iter, loss))
iter+=1
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100, 'relu'))
nn.add_layer(Dense(100, 3))
epochs=1000
batch_size=50
learning_rate=1e-0
reg = 1e-4
print_n = epochs*len(X)//batch_size//10
train_batch(nn,X,y,cross_entropy_grad_loss,epochs,batch_size,learning_rate,reg,print_n)
print(np.mean(nn.predict(X)==y))
4.3.6 MNIST data handwritten digit recognition based on deep learning framework
Next, test the MNIST data set. First, download the MNIST data set, in which each digital image has been
converted into a length of 784 dimension vector.
#%%time
import pickle, gzip, urllib.request, json
import numpy as np
import os.path
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")
float32
(50000, 784)
(10000, 784)
digit = train_set[0][9].reshape(28,28)
plt.imshow(digit,cmap='gray')
plt.colorbar()
plt.show()
(50000, 784)
The neural network model defined in Figure 4-30 is used as the classifier function for handwritten digital image
recognition.
nn = NeuralNetwork()
nn.add_layer(Dense(784, 200, 'relu'))
nn.add_layer(Dense(200, 100, 'relu'))
nn.add_layer(Dense(100, 10, ))
epochs = 25
batch_size = 32
learning_rate = 0.1
reg = 1e-3
print_n = 25*len(train_X)//32//10
train_batch(nn,train_X,train_y,cross_entropy_grad_loss,epochs,batch_size,learning_rate,re
print(np.mean(nn.predict(valid_X)==valid_y))
print(nn.predict(valid_X[9]),valid_y[9])
4.3.7 Improved general neural network framework: separate weighted sum and
activation function
The Dense layer of the above neural network framework contains the weighted sum and activation function, and
the Dense class contains the forward and reverse calculation of the weighted sum and activation function. In order
to increase flexibility, the weighted sum and activation function can be decomposed into 2 The class indicates that
the calculation of the weighted sum and the activation function is performed separately, so as to facilitate the
addition of new activation functions in the future.
The Layer class adds a member variable params to save the parameters of the model, which is used to save the
parameters of the model, and adds a method reg_loss_grad to add the gradient of the regular term in the loss
function to the gradient of the model parameters.
The Dense class only performs weighted sum calculations, and its constructor accepts a parameter that randomly
initializes the weight parameters, and initializes the weight parameters according to different random initialization
methods. The single data feature accepted by the Dense class is not only a vector, but also a multi-channel two-
dimensional image. For example, a color image contains an image of red, green and blue. Each color channel is a
two-dimensional array, so forward() Both the and backwrd() methods use the following code to first flatten the
multi-channel input data into a one-dimensional vector.
class Layer:
def __init__(self):
self.params = None
pass
def forward(self, x):
raise NotImplementedError
def backward(self, x, grad):
raise NotImplementedError
def reg_grad(self,reg):
pass
def reg_loss(self,reg):
return 0.
class Dense(Layer):
# Z = XW+b
def __init__(self, input_dim, out_dim,init_method = ('random',0.01)):
super().__init__()
random_method_name,random_value = init_method
if random_method_name == "random":
self.W = np.random.randn(input_dim, out_dim) * random_value #0.01 *
np.random.randn
self.b = np.random.randn(1,out_dim)* random_value
elif random_method_name == "he":
self.W = np.random.randn(input_dim, out_dim)*np.sqrt(2/input_dim)
#self.b = np.random.randn(1,out_dim)* random_value
self.b = np.zeros((1,out_dim))
elif random_method_name == "xavier":
self.W = np.random.randn(input_dim, out_dim)*np.sqrt(1/input_dim)
self.b = np.random.randn(1,out_dim)* random_value
elif random_method_name == "zeros":
self.W = np.zeros((input_dim, out_dim))
self.b = np.zeros((1,out_dim))
else:
self.W = np.random.randn(input_dim, out_dim)* random_value
self.b = np.zeros((1,out_dim))
self.params = [self.W,self.b]
self.grads = [np.zeros_like(self.W),np.zeros_like(self.b)]
# self.activation = activation
# self.A = None
return dx
def reg_loss(self,reg):
return reg*np.sum(self.W**2)
def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.W
return reg*np.sum(self.W**2)
If x is 6 samples, each sample is 3 channels and each channel is an image of 4*4, the following code is the forward
calculation of these 3 samples as input:
import numpy as np
np.random.seed(1)
x = np.random.randn(3,3,4, 4) #3 samples, 3 channels, each channel is a 4x4 image
dense = Dense(3*4*4,10,('no',0.01))
o = dense.forward(x)
print(o.shape)
print(o)
(3, 10)
[[-0.03953509 -0.00214997 0.00743433 -0.16926214 -0.05162853 0.06734225
-0.00221485 -0.11710758 -0.07046456 0.02609659]
[ 0.00848392 0.08259757 -0.09858177 0.0374092 -0.08303008 0.04151241
-0.01407859 -0.02415486 0.04236149 0.0648261 ]
[-0.13877363 -0.04122276 -0.00984716 -0.03461381 0.11513754 0.1043094
0.00170353 -0.00449278 -0.0057236 -0.01403174]]
Gradient Validation
As before, in order to verify whether the dense reverse derivation is correct, you can simulate a loss function with
respect to the gradient do of the dense output vector, and then use dense.backward() to perform reverse derivation
calculations, and calculate the value with the numerical gradient function numerical_gradient_from_df Gradient
for error comparison. Because the size of the dense output vector is 10, the following code simulates the input x of
3 samples to generate an output o of 3*10 through this dense layer. Therefore, the gradient do generated by the
simulation is also a multidimensional array of the same shape.
If the gradient of the loss function with respect to the output vector is known, the gradient of the model parameters
and intermediate variables such as x can be reversely calculated from the gradient. backward() returns the gradient
dx of the input x of the dense layer, and then the analysis Gradient and numerical gradient dx_num compare error.
Similarly, the error of the analytical gradient dense.grads[0] and the numerical gradient dW_num are also
compared for the weight parameter dense.params[0] of the model:
do = np.random.randn(3, 10)
dx = dense.backward(do)
dx_num = numerical_gradient_from_df(lambda :dense.forward(x),x,do)
3.638244314951079e-09
1.3450414982951384e-11
[[ 1.77463167 0.11663492 1.87794917 0.27986781 1.27243915 -2.44375556
-2.1266117 0.99629747 -0.73720237 -0.68570287]
[-0.69807196 0.22547472 -0.93721649 0.3286185 -1.0421723 0.66487528
1.33111205 0.25677848 -0.58451408 0.71015412]
[ 0.12251147 -0.4041516 0.57764614 0.89962639 -0.35195022 0.77829011
-0.01618803 -0.62209694 -1.28543176 -0.37554316]]
[[ 1.77463167 0.11663492 1.87794917 0.27986781 1.27243915 -2.44375556
-2.1266117 0.99629747 -0.73720237 -0.68570287]
[-0.69807196 0.22547472 -0.93721649 0.3286185 -1.0421723 0.66487528
1.33111205 0.25677848 -0.58451408 0.71015412]
[ 0.12251147 -0.4041516 0.57764614 0.89962639 -0.35195022 0.77829011
-0.01618803 -0.62209694 -1.28543176 -0.37554316]]
You can also connect a loss function to the Dense layer to compare the analytical gradient and numerical gradient
of the loss function with respect to the parameters of the Dense model:
import util
x = np.random.randn(3,3,4, 4)
y = np.random.randn(3,10)
dense = Dense(3*4*4,10,('no',0.01))
f = dense.forward(x)
loss,do = mse_loss_grad(f,y)
dx = dense.backward(do)
def loss_f():
f = dense.forward(x)
loss = mse_loss(f,y)
return loss
dW_num = util.numerical_gradient(loss_f,dense.params[0],1e-6)
print(diff_error(dense.grads[0],dW_num))
print(dense.grads[0][:2])
print(dW_num[:2])
2.0148860313259954e-07
[[ 0.47568681 -0.06324119 -0.29294422 -0.76304343 -0.09660146 0.62794569
1.16087896 0.06261028 -0.6611078 -0.02940735]
[-0.10777785 -1.47174583 0.63258553 1.22381944 -0.35702633 0.4409597
-2.42444873 -0.28804741 -1.33377026 0.66775208]]
[array([ 0.47568681, -0.06324119, -0.29294422, -0.76304343, -0.09660146,
0.62794569, 1.16087896, 0.06261028, -0.6611078 , -0.02940735]),
array([-0.10777785, -1.47174583, 0.63258553, 1.22381944, -0.3570 2633,
0.4409597 , -2.42444873, -0.28804741, -1.33377026, 0.66775208])]
The Dense layer only calculates the weighted sum, and does not need to calculate the value of the activation
function or calculate the derivative of the activation function according to the activation function, which becomes
very simple. Different activation functions can be implemented individually as an activation function layer class.
The following code defines the activation function layer corresponding to the most commonly used activation
function in neural networks:
class Relu(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
return np.maximum(0, x)
def backward(self, grad_output):
# If x>0, the derivative is 1, otherwise 0
x = self.x
relu_grad = x > 0
return grad_output * relu_grad
class Sigmoid(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
return 1.0/(1.0 + np.exp(-x))
def backward(self, grad_output):
x = self.x
a = 1.0/(1.0 + np.exp(-x))
return grad_output * a*(1-a)
class Tanh(Layer):
def __init__(self):
super().__init__()
pass
def forward(self, x):
self.x = x
self.a = np.tanh(x)
return self.a
def backward(self, grad_output):
d = (1-np.square(self.a))
return grad_output * d
class Leaky_relu(Layer):
def __init__(self,leaky_slope):
super().__init__()
self.leaky_slope = leaky_slope
def forward(self, x):
self.x = x
return np.maximum(self.leaky_slope*x,x)
def backward(self, grad_output):
x = self.x
d=np.zeros_like(x)
d[x<=0]=self.leaky_slope
d[x>0]=1
return grad_output * d
The activation layer has no model parameters, but simply transforms the input x to produce an output. The shape
of the input and output tensors are the same. Likewise, numerical gradients can be used to check that the analytical
gradients of the activation layers are correct. The following code checks the error of the analytical gradient and the
numerical gradient of each activation layer above using the simulated loss function with respect to the gradient do
of the activation layer output:
import numpy as np
np.random.seed(1)
x = np.random.randn(3,3,4, 4)
do = np.random.randn(3,3,4, 4)
relu = Relu()
relu.forward(x)
dx = relu.backward(do)
dx_num = numerical_gradient_from_df(lambda :relu.forward(x),x,do)
print(diff_error(dx,dx_num))
leaky_relu = Leaky_relu(0.1)
leaky_relu.forward(x)
dx = leaky_relu.backward(do)
dx_num = numerical_gradient_from_df(lambda :leaky_relu.forward(x),x,do)
print(diff_error(dx,dx_num))
tanh = Tanh()
tanh.forward(x)
dx = tanh.backward(do)
dx_num = numerical_gradient_from_df(lambda :tanh.forward(x),x,do)
print(diff_error(dx,dx_num))
sigmoid = Sigmoid()
sigmoid.forward(x)
dx = sigmoid.backward(do)
dx_num = numerical_gradient_from_df(lambda :sigmoid.forward(x),x,do)
print(diff_error(dx,dx_num))
3.2756345281587516e-12
7.43892997215858e-12
5.170019175240593e-11
3.282573028416693e-11
From the nearly equal subgradient and numerical gradient errors of these activation layers, one can be basically
confident that the analytical gradient code is correct.
Based on the dense layer and each activation layer, a class NeuralNetwork representing a neural network can be
defined:
class NeuralNetwork:
def __init__(self):
self._layers = []
self._params = []
def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss
def parameters(self):
return self._params
def zero_grad(self):
for i,_ in enumerate(self._params):
#self.params[i][1].fill(0.)
self._params[i][1][:] = 0 #[w,dw]
def get_parameters(self):
return self._params
add_layer() in this class is used to add various layers to the neural network, forward() accepts input data and
generates corresponding output, __call_() is a function call method, for a NeuralNetwork object nn and Input X,
nn(X) is equivalent to nn.forward(X), backward() accepts the gradient of the loss function with respect to the
network output, and performs reverse derivation to find the gradient of the loss function with respect to model
parameters and intermediate variables.
In order to ensure the correctness of the forward() and backward() methods, their correctness can be checked by
the method of numerical gradient. The following code defines a simple neural network, and uses a set of randomly
generated samples (x, y) to calculate and compare the analytical gradient calculated by backward() and the
numerical gradient obtained by using the general numerical gradient function in Section 1.4. Look at them whether
the calculation results are consistent.
import util
np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100,('no',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 3,('no',0.01)))
x = np.random.randn(5,2)
y = np.random.randint(3, size=5)
f = nn.forward(x)
dZ = cross_entropy_grad(f,y) #util.grad_softmax_cross_entropy(f,y) #
nn.zero_grad() # Gradient reset to zero
reg = 0.1
dx = nn.backward(dZ,reg)
def loss_fn():
f = nn.forward(x)
loss = softmax_cross_entropy(f,y) #util.softmax_cross_entropy(f,y) #
return loss+nn.reg_loss(reg)
numerical_grads = util.numerical_gradient(loss_fn,nn_params,1e-6)
for i in range(len(numerical_grads)):
print(diff_error(params[i][1],numerical_grads[i]))
1.892395698905401e-06
1.7651393552515298e-06
2.306498772862026e-06
2.3545204992835373e-10
It can be seen that the numerical gradient and the analytical gradient are very close, and there is nothing wrong
with the forward() and backward() of the preliminary determination model.
def zero_grad(self):
#for p,grad in params:
for i,_ in enumerate(self.params):
#self.params[i][1][:] = 0.
self.params[i][1].fill(0)
def step(self):
def scale_learning_rate(self,scale):
self.lr *= scale
The parameter model_params of the constructor of the optimizer class SGD is a python list object, and each
element is a list object of model parameters and their gradients. If a model has 2 parameters W , b, and their
corresponding gradient parameters are dW , db, then the model_params parameter is a list of the following form:
[[W,dW],[b,db]]
The other two parameters of the constructor are the learning rate learning_rate of the gradient descent algorithm
and the parameter momentum of the momentum optimization strategy. If momentum is set to 0, it is equivalent to
the most basic gradient update strategy without momentum.
The zero_grad() method of SGD is used to reset the gradient corresponding to all parameters to 0, and step() is
used to update the model parameters according to the gradient and optimization strategy. Sometimes the gradient
descent method may need to adjust the learning rate during its iteration, scale_learning_rate() is the method used to
adjust the learning rate.
The gradient descent method can update the model parameters by defining an optimizer object optimizer of the
SGD class:
learning_rate = 1e-1
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)
Similarly, other optimizer classes can also be defined, such as the following Adam optimizer:
class Adam():
def __init__(self,model_params,learning_rate=0.01, beta_1 = 0.9,beta_2 =
0.999,epsilon =1e-8):
self.params,self.lr = model_params,learning_rate
self.beta_1,self.beta_2,self.epsilon = beta_1,beta_2,epsilon
self.ms = []
self.vs = []
self.t = 0
for p,grad in self.params:
m = np.zeros_like(p)
v = np.zeros_like(p)
self.ms.append(m)
self.vs.append(v)
def zero_grad(self):
#for p,grad in params:
for i,_ in enumerate(self.params):
#self.params[i][1][:] = 0.
self.params[i][1].fill(0)
def step(self):
#for i in range(len(self.params)):
beta_1,beta_2,lr = self.beta_1,self.beta_2,self.lr
self.t+=1
t = self.t
for i,_ in enumerate(self.params):
p,grad = self.params[i]
self.ms[i] = beta_1*self.ms[i]+(1-beta_1)*grad
self.vs[i] = beta_2*self.vs[i]+(1-beta_2)*grad**2
def scale_learning_rate(self,scale):
self.lr *= scale
More optimizers and the following training function train_nn are included in train.py.
The following training function train() accepts a data iterator, and takes out a batch of training samples (input,
target) from it each time. For each batch of samples, first execute forwrd() to calculate the output, and then use the
loss function to calculate its loss and The loss function is about the gradient loss_grad of the output output, and the
gradient loss_grad is passed back through the backward() function to obtain the gradient of the model parameters
and intermediate variables. Then use the optimizer's step() function to update the model parameters.
def train_nn(nn,X,y,optimizer,loss_fn,epochs=100,batch_size = 50,reg = 1e-
3,print_n=10):
iter = 0
losses = []
for epoch in range(epochs):
for X_batch,y_bacth in data_iter(X,y,batch_size):
optimizer.zero_grad()
f = nn(X_batch) # nn.forward(X_batch)
loss,loss_grad = loss_fn(f, y_bacth)
nn.backward(loss_grad,reg)
loss += nn.reg_loss(reg)
optimizer.step()
losses.append(loss)
if iter%print_n==0:
print(iter,"iter:",loss)
iter +=1
return losses
Now you can use this neural network to train the previous 3 classification problems
import data_set as ds
import util
np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Dense(2, 100,('no',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 3,('no',0.01)))
X,y = ds.gen_spiral_dataset()
epochs=5000
batch_size = len(X)
reg = 0.5e-3
print_n=480
learning_rate = 1e-1
momentum = 0.5#
optimizer = SGD(nn.parameters(),learning_rate,momentum)
losses =
train_nn(nn,X,y,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print_n)
0 iter: 1.0985916677722303
480 iter: 0.7056240023920841
960 iter: 0.6422407772314334
1440 iter: 0.5246104670488081
1920 iter: 0.4186441561530432
2400 iter: 0.37118840941018727
2880 iter: 0.34583485668931857
3360 iter: 0.32954842747580104
3840 iter: 0.31961537369884196
4320 iter: 0.3124394704919282
4800 iter: 0.30620107113884415
The website code for the improved neural network framework is in the file NeuralNetwork.py.
import mnist_reader
X_train, y_train = mnist_reader.load_mnist('data/fashion', kind='train')
X_test, y_test = mnist_reader.load_mnist('data/fashion', kind='t10k')
print(X_train.shape,y_train.shape)
print(X_train.dtype,y_train.dtype)
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
trainX = X_train.reshape(-1,28,28)
print(trainX.shape)
#lot first few images
for i in range(9):
# define subplot
plt.subplot(330 + 1 + i)
# plot raw pixel data
plt.imshow(trainX[i], cmap=plt.get_cmap('gray'))
# show the figure
plt.show()
Looking at the data values, you can see that the original value should be an integer from 0-255, which can be
converted into a real number between 0 and 1 after dividing by 255:
train_X = trainX.astype('float32')/255.0
print(np.mean(trainX),np.mean(train_X))
72.94035223214286 0.2860402
nn = NeuralNetwork()
nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())
nn.add_layer(Dense(500, 200))
nn.add_layer(Relu())
nn.add_layer(Dense(200, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))
start training
epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000
losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print
plt.plot(losses)
0 iter: 2.3016755298047347
1000 iter: 1.1510374540057933
2000 iter: 0.47471113470221005
3000 iter: 0.5333139450988945
4000 iter: 0.259167391843765
5000 iter: 0.3629363583454308
6000 iter: 0.3486191552507917
7000 iter: 0.4914253677369693
0.87965
0.8585
def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss
def parameters(self):
return self._params
def zero_grad(self):
for i,_ in enumerate(self._params):
#self.params[i][1].fill(0.)
self.params[i][1][:] = 0
def get_parameters(self):
return self._params
def save_parameters(self,filename):
params = {}
for i in range(len(self._layers)):
if self._layers[i].params:
params[i] = self._layers[i].params
np.save(filename, params)
def load_parameters(self,filename):
params = np.load(filename,allow_pickle = True)
count = 0
for i in range(len(self._layers)):
if self._layers[i].params:
layer_params = params.item().get(i)
self._layers[i].params = layer_params
for j in range(len(layer_params)):
self._params[count][0] = layer_params[j]
count+=1
Use the following code to test the read and write functions of the above model.
from NeuralNetwork import *
nn = NeuralNetwork()
nn.add_layer(Dense(3, 2,('xavier',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(2, 4,('xavier',0.01)))
nn.add_layer(Relu())
def print_nn_parameters(params,print_grad=False):
for p,grad in params:
print("p",p)
if print_grad:
print("grad",grad)
print()
print_nn_parameters(nn.get_parameters())
nn.save_parameters('model_params.npy')
nn.load_parameters('model_params.npy')
print_nn_parameters(nn.get_parameters())
p[[0.0027318 0.00063939]
[-0.00144845 0.00138133]
[-0.01521812 0.0023785 ]]
p [[0.0.]]
p [[0. 0. 0. 0.]]
p[[0.0027318 0.00063939]
[-0.00144845 0.00138133]
[-0.01521812 0.0023785 ]]
p [[0.0.]]
p [[0. 0. 0. 0.]]
Chapter 5 Basic Techniques for Improving Neural Network
Performance
Data and algorithms are two elements of machine learning. Both high-quality data and good algorithms can
improve the performance and effect of machine learning. High-quality data acquisition comes at a cost. How to
improve the quality and quantity of data through data processing and enhancement on the basis of existing data,
and how to improve the performance of a machine learning model and algorithm through various practical skills
are also based on neural networks. Essential skills for deep learning practice.
The following code can read an image using the io module of skimage:
import numpy as np
import matplotlib.pyplot as plt
from skimage import io, transform
image = io.imread('cat.png')
print(image.shape)
plt.imshow(image)
plt.show()
(403, 544, 3)
Figure 5-1 Original image
The height and width of the image are 403 and 544 respectively, and the color image is composed of three channel
images of red (R), green (G), and blue (B). That is, the image pixel at each position is composed of three red (R),
green (G), and blue (B) colors.
Various transformations can be performed on the image through numpy array operations, such as image[:,::-1,
:], which can perform horizontal mirror flip on the image:
img = image[:,::-1, :]
plt.imshow(img)
plt.show()
return yuvimg.astype(np.uint8)
img = convert(image)
plt.imshow(img)
plt.show()
img = np.invert(image)
plt.imshow(img)
plt.show()
Different modules of skimage such as util, transform, etc. provide functions for different transformations of
images, such as adding noise to images with the random_noise() function of util:
from skimage import util
img = util.random_noise(image)
plt.imshow(img)
plt.show()
Convert a color image with multiple color channels to a single-channel grayscale image (black and white image):
from skimage import color
img = color.rgb2gray(image)
print(img.shape)
plt.imshow(img,cmap='gray')
plt.show()
(403, 544)
You can also use many other Python packages to process data such as images, such as scipy's image processing
module ndimage to process images, such as blurring images:
from scipy import ndimage
img = ndimage.uniform_filter(image, size=(11, 11, 1))
plt.imshow(img)
plt.show()
Like image data, various data enhancements can be performed on other data such as text and voice, and various
public data processing packages can be used to improve the efficiency of data processing.
Through data enhancement, the total amount of data can be increased many times, which helps to reduce
overfitting. Although the enhanced data is correlated with the original data, it avoids the cost of obtaining brand
new data.
5.1.2 Normalization
Like simple regression, data with too large absolute value will cause the numerical calculation of the neural
network to overflow, the gradient descent algorithm will become very slow, and the different scales of features
have different influences on the algorithm, which will cause "feature bias", which will make the training algorithm
Difficult to restrain. Therefore, before training the neural network, the data that is not normalized should be
normalized. Usually each feature is normalized separately, that is, for each feature x , calculate the mean x _mean
i i
and standard deviation x _std of the feature in the training set, and then use the mean and The standard deviation
i
normalizes this feature of all samples into a smaller range near 0, such as [0, 1] or [−0.5, 0.5] or [−1, 1], usually
using The following formula is normalized:
xi −xi _mean
xi _std
This canonicalization process can be implemented with the following Python code:
X -= np.mean(X, axis = 0)
X/=np.std(X,, axis = 0)
If the values of all features are in roughly the same range, all features can also be standardized uniformly, that is, a
unified mean x_mean and standard deviation x_std can be calculated using all features of all data in the training
set, all data features are then normalized with this same mean and standard deviation. For example, for an image
that uses a one-byte positive integer to represent the color value of the image pixel, because these values all change
in the range [0, 255], therefore, an image can be directly divided by 255 to transform these values of the image
pixels into [0, 1], without the need to specifically calculate the mean and variance for each feature.
It should be noted that the samples in the validation set and the test set cannot be normalized separately, otherwise
the samples in these sets and the samples in the training set do not use the same normalization standard, so there is
no point in predicting them with the trained model. value. That is, when making predictions on samples in the
validation set or training set, the validation or test samples are normalized with the same normalization parameters
(mean and standard deviation) as the training set.
Feature engineering is to discover and extract good features that are helpful for machine learning from raw data.
Feature engineering is one of the most basic key issues in traditional machine learning. Different fields often use
some artificial feature methods specific to the field. Feature engineering usually includes many specific
technologies such as data preprocessing (such as data normalization), data dimensionality reduction, feature
selection, artificial feature design, and feature learning.
Principal Component Analysis (PCA) is a classic data dimensionality reduction technique for machine learning. It
represents the data as a linear combination of pivots, which can eliminate the correlation of data features, and then
use a small number of pivots to represent the original data. So as to reduce the dimensionality of the data, that is,
the number of features. For example, a face color image of 256*256 pixels requires 256*256*3 = 196608
numerical representation, that is, its dimension is 196608, and a face image can be expressed as The linear
combination of 23 pivots can retain 97% of the information of the original image, so a face only needs 23 values.
For the data points of the following two-dimensional plane, each data point is represented by 2 coordinates:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Generate random sample points near the line y=2x+1
np. random. seed(1)
pts = 25
x = np.random.randn(pts,1) # Randomly sample some x coordinates
y = x+2
y = y+ np.random.randn(pts,1)*0.2 # Give Y random noise
plt.plot(x,y,'o')
plt. xlabel('x')
plt.ylabel('y')
plt. axis('equal')
plt. show()
Figure 5-13 Data points of a randomly sampled two-dimensional plane near the straight line y=2x+1
Taking the coordinates of each point as a matrix, all point coordinates can be placed in a matrix X and display the
first 3 coordinate points:
X = np.stack((x.flatten(), y.flatten()), axis=-1)
print(X.shape)
print(X[:3])
(25, 2)
[[ 1.62434536 3.48759979]
[-0.61175641 1.36366554]
[-0.52817175 1.28467436]]
PCA can be used to reduce the dimension so that each point has only one numerical value instead of 2 numerical
representations. The first step of the PCA method is to center the components of each dimension (axis), even if the
components of each dimension subtract the mean of all components of the dimension:
X -= np.mean(X, axis = 0)
print(X[:3])
plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
plt.show()
[[ 1.63707525 1.50798964]
[-0.59902653 -0.61594461]
[-0.51544186 -0.69493579]]
Figure 5-14 Data centering, that is, the component of each dimension (feature) of the data minus the mean value of
all components of the dimension, so that each feature takes 0 as the center point
If there is a matrix A composed of a set of three-dimensional coordinate points, each row represents the three
coordinates of a coordinate point:
1 3 2
⎛ ⎞
−4 2 6
A =
2 6 4
⎝ ⎠
−3 0 1
That is, A has 4 samples, and each sample has 3 features (x, y, z coordinates). Are the features of these samples
correlated? A's correlation matrix (covariance matrix) (covariance matrix), A A, represents the degree of
T
1 3 2
⎛ ⎞
1 −4 2 −3 30 7 −17
⎛ ⎞ ⎛ ⎞
−4 2 6
T
A A = 3 2 6 0 = 7 49 42
2 6 4
⎝ ⎠ ⎝ ⎠
2 6 4 1 −17 42 57
⎝ ⎠
−3 0 1
A = np.array([[1,3,2],[-4,2,6],[2,6,4],[-3,0,1]])
print(A)
print("A^TA:\n",np.dot(A.transpose(),A))
[[ 1 3 2]
[-4 2 6]
[264]
[-3 0 1]]
A^TA:
[[ 30 7 -17]
[ 7 49 42]
[-17 42 57]]
From the element values of the covariance matrix, it can be seen that the correlation value between x and y is 7,
and the correlation value between y and z is 42, indicating that the correlation between y and z is relatively large,
while the correlation between x and y is relatively small . Usually, the covariance matrix can be divided by the
number of samples to reduce the influence of the number of samples on the matrix value. For the sample matrix X
above, its covariance matrix is calculated as follows:
cov = np.dot(X.T, X) / X.shape[0] # covariance matrix
SVD decomposition of the covariance matrix can obtain the principal component (eigenvector) U, singular value
(variance, square of the eigenvalue) S, and the singular value is equivalent to the variance, indicating the degree of
divergence:
U,S,V = np.linalg.svd(cov)
print(U)
print(S)
print(S[0]/(S[0]+S[1]))
[[-0.68302064 -0.73039907]
[-0.73039907 0.68302064]]
[2.46815362 0.01168714]
0.995287139793862
Each column of U represents a pivot, and the pivot represents the main direction of change of the data (the
direction of the main axis), as shown in Figure 5-15:
plt.plot(X[:,0],X[:,1],'o')
plt.plot([0,U[0,0]], [0,U[1,0]])
plt.plot([0,U[0,1]], [0,U[1,1]])
plt.axis('equal')
plt.show()
Figure 5.15 SVD decomposition of the covariance matrix, the principal component (eigenvalue) and the direction
of the principal component (eigenvector) can be obtained
S[0] and S[1] indicate the proportion of the data in the direction of the pivot, and it can be seen that the change of
the data in the first pivot occupies a larger proportion. By projecting the data onto the axes defined by the pivot U,
the data can be expressed as components of the pivot.
Xrot = np.dot(X, U)
print(Xrot[:5])
[[-2.21959042 -0.16573019]
[ 0.85903285 0.01682553]
[0.85963789-0.09817723]
[ 1.53210054 0.01886974]
[-1.32424593 0.03607588]]
Display the coordinate points formed by these pivot components on the pivot axis:
plt.plot(Xrot[:,0],Xrot[:,1],'o')
plt.axis('equal')
plt.show()
Figure 5.16 Rotate the data to align the pivot and coordinate axes
Converting the data into the component representation of the principal component can eliminate the correlation
between new features, and it can be seen from its covariance matrix that the off-diagonal values representing
different features become 0.
print(np.dot(Xrot.transpose(),Xrot))
[[6.17038405e+01 9.38138456e-15]
[9.38138456e-15 2.92178571e-01]]
Using the coordinates of the first pivot to represent these samples, the data loss is almost (1-
0.995287139793862)*100% = 0.472%, which is almost negligible. This representation of data samples as a
linear combination of a few pivots is called data dimensionality reduction. For this example, the dimension of
the sample data can be reduced from feature number 2 to feature number 1, achieving the purpose of reducing the
number of sample features.
Xrot_reduced = np.dot(X, U[:,:1]) # Xrot_reduced become a [N x 1] array
print(Xrot_reduced[:3])
plt.plot(Xrot_reduced[:],[0]*pts,'o')
plt.axis('equal')
plt.show()
[[-2.21959042]
[ 0.85903285]
[0.85963789]]
Figure 5-17 Data dimensionality reduction: Representing data samples as a linear combination of a few pivots
The projected and dimensionally reduced data can be back-projected onto the main axis of the original data.
X_temp = np.c_[Xrot_reduced, np.zeros(pts) ]
reProjX = np.dot(X_temp, U.transpose())
plt.plot(reProjX[:,0],reProjX[:,1],'o')
plt.axis('equal')
plt.show()
Figure 5-18 The projected and dimensionally reduced data is back-projected onto the main axis of the original
data, retaining the main characteristics of the original data
It can be seen that the data after dimensionality reduction still retains the main information of the original data.
2 Whitening
Data samples may have multiple features, and the variance of these features may vary greatly, that is, the degree of
divergence of different features is different, so that different features have different effects on machine learning
algorithms. There is often correlation between features, and different features with correlation will have a mutual
restraint effect on machine learning. It's like a person is pulled in different directions and will not know what to do.
The whitening operation refers to reducing the correlation of sample features and making these features have the
same variance. Dividing the feature by its standard deviation can achieve the purpose of making the feature have
the same variance, and PCA projection can eliminate the correlation between features. Therefore, whitening is
usually a combination of these two techniques, that is, first perform PCA feature projection to eliminate feature
correlation. , and divide each feature by its feature variance. The whitening operation combined with PCA is called
PCA Whitening.
Like normalization, whitening can improve the performance of machine learning algorithms. The previous code
has performed projection on the original data X to obtain the projected Xrot, that is, the features of Xrot become
independent of each other, and then perform the following operation of dividing by the standard deviation to
complete the whitening operation. Since the original data is only 2-dimensional, in order to see the effect of the
whitening operation, the data is not dimensionally reduced.
Xwhite = Xrot / np.sqrt(S + 1e-5) # Whitening operation: Divide the data features by
# the standard deviation, so that all features
have similar variances
plt.plot(Xwhite[:,0],Xwhite[:,1],'o')
plt.axis('equal')
plt.show()
Figure 5-19 Results of the whitening operation
After the whitening operation, the components of the two main axes have the same variance. You can add data
points to further observe the effect of the whitening operation, as shown in the following code:
pts = 1000
x = np.random.randn(pts,1) # Randomly sample some x-coordinates
y = x+2+ np.random.randn(pts,1)*0.2
X = np.stack((x.flatten(), y.flatten()), axis=-1)
fig = plt.gcf()
fig.set_size_inches(12, 4, forward=True)
plt.subplot(1,2,1)
plt.plot(X[:,0],X[:,1],'o')
plt.axis('equal')
X -= np.mean(X, axis = 0)
cov = np.dot(X.T, X) / X.shape[0]
U,S,V = np.linalg.svd(cov)
Xrot = np.dot(X, U)
Xwhite = Xrot / np.sqrt(S + 1e-5)
reProjX = np.dot(Xwhite, U.transpose())
plt.subplot(1,2,2)
plt.plot(reProjX[:,0],reProjX[:,1],'o')
plt.axis('equal')
plt.show()
Whitening makes all the features of the sample have the same variance, so that machine learning will not be biased
towards a certain feature due to the difference in variance, which can improve the performance of machine
learning algorithms.
Figure 5-21 The neural network degenerates into a linear sequence with only one neuron in each layer
Take the 2-layer neural network shown in Figure 5-22 as an example, because the initial weight value is 0, the
input weight value of all neurons in the hidden layer and the output layer is 0, assuming that the activation
functions of the neurons in the same layer are Same as:
[2] [1] [2] [2]
a = g(a W + b )
T T
∂L [1] ∂L [1] ∂L ′ [2]
[2]
= A [2]
= A [2]
g (z )
∂W ∂z ∂a
Because the gradient dW of the model parameters of each neuron in the same layer will be the same. When
updating the model parameters with W = W − lr ∗ dW , each parameter W will also be exactly the same. When
iterating again, the output of each neuron in the same layer is the same again, so that the gradient dW is the same
when the reverse derivation is performed. No matter how many iterations it takes, all neurons in the same layer
have the same weight parameters, that is, they represent the same function, that is, they are all symmetric.
Obviously, the expressive ability of such a neural network is very limited, and multiple neurons in each layer are
meaningless. The neural network should break this symmetry so that each neuron extracts different features from
the input. The solution is to give these weights random initialization values. It can be seen from the reverse
derivation formula above that the bias parameter b has no effect on the model parameters and the gradient
calculation of the input. Therefore, it is usually enough to randomly initialize the weight parameters and set the
bias parameters to 0. A simple neural network as above can initialize its model parameters as follows:
W1 = np.random.randn(2,2)*0.01
b1 = np.zeros((1,2))
W2 = np.random.randn(2,1)*0.01
b2 = np.zeros((1,1))
Multiplying the weight parameter by a relatively small number is similar to preprocessing the data normalization,
so that the weighted sum output of the neuron is not too large, because for a large value x, the activation function
sigmoid(x) or tanh(x) is in a Saturation state, that is, the derivative (gradient) at this point is close to 0, and the
gradient of the activation function is too small to make the gradient of the model parameters in the reverse
derivation process will become smaller and smaller, resulting in slow update of the model parameters, which is
Gradient Vanishing problem. According to the reverse derivation, too large a value will also make the gradient
become very large, that is,gradient explosion.
Is the weight initialization value as small as possible? Because the gradient of neuron input (such as ∂L
∂a
[1]
) is
proportional to the weight (W ), too small weight The parameter also makes the gradient about the input too
[2]
small, which will also cause "gradient disappearance" in the reverse derivation process. If the initial value of the
weight parameter is too small, such as close to 0, it will also cause the symmetry problem of the above-mentioned
neurons to a certain extent. Therefore, in general, the weight parameters are initialized with a Gaussian distribution
with a mean of 0 and a standard deviation of 0.01.
The variance of the output of the neurons whose weights are initialized above will vary with the number of input
values. The variance of the output of the neurons should not depend on the constant value of the number of input
values. Otherwise, as the number of layers increases, the variance will become larger and larger. To this end, the
weight parameter can be divided by the square root of the number of inputs so that the variance of the neuron
output can be normalized to 1, that is, the code w = np.random.randn(n)/sqrt(n)Initialize weight parameters,
where n is the number of input values for this neuron.
If x = (x , ⋯ , x , ⋯ , x ) is an input sample, n is the number of its eigenvalues, and z is the output value of the
1 i n
neuron, then the variance of z and The relationship between the variance of x is as follows:
n
Var(z) = Var(∑ wi xi )
n
= sumi Var(wi xi )
n
2 2
= ∑ [E(wi )] Var(xi ) + E[(xi )] Var(wi ) + Var(xi ) Var(wi )
The last equation applies the property of variance, if two random variables X, Y are independent, then:
2 2
Var(XY ) = [E(X)] Var(Y ) + [E(Y )] Var(X) + Var(X) Var(Y )
Assuming that the mean value of the input and weight is 0, that is, E[x i] = E[wi ] = 0 , then:
n
= (n Var(w)) Var(x)
The variance V ar(z) of the output value is not only proportional to the variance V ar(x) of the input value and the
variance V ar(w) of the weight, but also proportional to the number n of the input value x . i
When the output variance is the same as the input variance, that is, Var(z) = Var(x), that is, after the input data
x passes through the neuron, the output variance will not become larger or change Small, so that the input and
n
2
Var(X) , if w is sampled
from the standard normal distribution, then V ar(w) = 1, multiply w by a constant a = √n
1
, then
2
Var(aw) = a Var(w) = 1 . Therefore, the weights can be initialized with the following code:
w = np.random.randn(n) * sqrt(1.0/n)
According to a similar analysis, some papers have proposed other different parameter methods, such as Glorot
multiplying the weight parameter w of the standard normal distribution by √2/(n + n ), so that in out
Var(w) = 2/(n + n
in ), where n , n
out are the input and output vectors of the network layer respectively The
in out
purpose is to make the variance of the gradient not change during the reverse derivation process. Of course, the
combination of these two items will affect each other, so that the variance of the forward and reverse derivation
will actually change. .
Weights can also be uniformly distributed: w ∼ U [− √6
√nin +nout
,
√6
sqrtnin +nout
] , the weight initialization method
proposed by et al. is called xavier initialization, and its code is implemented as follows:
import numpy as np
import math
def calculate_fan_in_and_fan_out(tensor):
if len(tensor.shape) < 2:
raise ValueError("tensor with fewer than 2 dimensions")
if len(tensor.shape) ==2:
fan_in,fan_out = tensor.shape
else: #F,C,kH,kW
num_input_fmaps = tensor.shape[1] #size(1) F,C,H,W
num_output_fmaps = tensor.shape[0] #size(0)
receptive_field_size = tensor[0][0].size
fan_in = num_input_fmaps * receptive_field_size
fan_out = num_output_fmaps * receptive_field_size
return fan_in, fan_out
Among them, the function calculate_fan_in_and_fan_out() is used to calculate the number of input features and
the number of output features of the network layer (neuron), gain is an optional scaling factor for the weight, and
the default is 1.
For neurons using Relu as the activation function, the weight initialization method of Kaiming He is currently used
more often, that is, the weights of the standard normal distribution sampling are multiplied by √2/n, the code is
as follows :
w = np.random.randn(n) * sqrt(2.0/n)
For the network layer using the Relu activation function, its bias parameter b is recommended to be set to a non-
zero constant such as 0.01, which can make this activation function affect the gradient at the beginning of training,
but whether setting the bias to a non-zero value can really improve Algorithm performance is undefined.
gain = calculate_gain(nonlinearity, a)
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std
tensor[:] = np.random.uniform(-bound,bound,(tensor.shape))
gain = calculate_gain(nonlinearity, a)
std = gain / math.sqrt(fan)
bound = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation
tensor[:] = np.random.normal(0,std,(tensor.shape))
calculate_gain() is used for a certain coefficient used in He Kaiming's method (also called kaiming method or he
method). For example, the value for Relu is √2, and the value for tanh is 5.0/3. kaiming_uniform() and
kaiming_normal() are kaiming methods using mean or Gaussian random values, respectively.
The following function kaiming() selects the kaiming_uniform() or kaiming_normal() method according to the
parameters:
def kaiming(tensor,method_params=None):
method_type,a,mode,nonlinearity='uniform',0,'fan_in','leaky_relu'
if method_params:
method_type = method_params.get('type', "uniform")
a = method_params.get('a', 0)
mode = method_params.get('mode','fan_in' )
nonlinearity = method_params.get('nonlinearity', 'leaky_relu')
if method_params=="uniform":
kaiming_uniform(tensor,a,mode,nonlinearity)
else:
kaiming_normal(tensor,a,mode,nonlinearity)
xavier_uniform(w)
print("xavier_uniform:",w)
xavier_normal(w)
print("xavier_normal:",w)
kaiming_uniform(w)
print("kaiming_uniform:",w)
kaiming_normal(w)
print("kaiming_normal:",w)
output:
[[17.2 17.2 17.2]
[17.2 17.2 24.2]]
xavier_uniform: [[ 0.026289 -1.09114298 -0.48792212]
[-0.3313437 -0.47333989 -0.90713322]]
xavier_normal: [[ 0.93298795 0.07044394 -0.00270454]
[ 0.44167298 -1.01942638 0.45699115]]
kaiming_uniform: [[-1.21534711 -1.27523387 0.80492134]
[0.81222595 -1.11076413 -0.29943563]]
kaiming_normal: [[-0.98492851 0.24745387 0.53676485]
[ 1.27654978 1.52143405 0.87124828]]
In addition, you can also add an auxiliary function apply(self,init_params_fn) to the NeuralNetwork class
to initialize all its layer parameters, which can facilitate the initialization of multiple layers of the neural network,
such as the initialization of all layers with kaiming_normal() method to initialize parameters.
def apply(self,init_params_fn):
for layer in self._layers:
init_params_fn(layer)
In addition to the learning rate, you should also try different parameter optimization strategies, such as momentum
(Momentum) method, RMSprop, Adam and other famous parameter optimization methods. These optimization
methods may also have some hyperparameters similar to the learning rate, which can be used A similar approach
attempts to choose appropriate hyperparameters.
For the batch gradient descent method, different batch sizes can be considered, and a batch size with appropriate
time efficiency and model quality can be selected for model training.
In addition, some special techniques (such as the Dropout technique below) may also have some hyperparameters
that need to be adjusted. Parameter adjustment (including network structure parameters) is an experience activity
that requires long-term practice and experience. It is also necessary to learn from the experience and skills of
others on the Internet about neural network parameter adjustment to avoid blind exploration.
ϕ(BN(z
z)) = ϕ(BN(x
xW + b ))
As before, the weighted sum is regarded as a separate layer (that is, the fully connected layer), and the activation
function is regarded as a separate activation layer. The BN operation can be regarded as a separate batch
normalization layer inserted between them (BN layer).
Simply normalizing z to the standard normal distribution of N (0, 1) will limit the expressiveness of the model,
because no matter how the previous network layer is transformed, the output after this layer always obeys the
standard normal distribution. The BN operation introduces a learnable parameter β and γ that represents the mean
and mean square error, and transforms the standard normal distribution features normalized to N (0, 1) to normal
distribution N (β, γ). Since β and γ are all learnable parameters, the problem of reducing the expressiveness of
the model is avoided.
The BN layer accepts the weighted sum output z of the fully connected layer as its own input, calculates the mean
and variance of each feature of these z, and normalizes each feature of z to N (0, 1) according to the mean and
variance Standard normal distributions, and then transform them into N (β, γ) normal distributions with learnable
parameters β, γ.
In the absence of confusion, the letters x denote z that require BN normalization. For a set of samples
B = x , x , ⋯ , x , BN normalization first calculates the mean μ
1 2 m B and variance σ of this batch of samples ,
B
and then use the learnable parameters γ, β to scale and translate them:
m
1
μB = ∑ xi
m
i=1
2
1 2
σB = ∑(xi − μB )
m
i=1
xi − μB
x
^i =
√ σ2 + ϵ
B
yi = γ ⊙ xi + β
If each row of the matrix X represents a data, the following code can be used to calculate the mean mean and
variance var for each feature of the data:
mean = X.mean(axis=0)
var = ((X - mean) ** 2).mean(axis=0)
That is, the mean and variance are calculated along the direction of the row. Of course, the numpy function var()
can be used to calculate the variance:
var = np.var(X, axis=0)
It is assumed here that each data x is a vector or a one-dimensional array, that is, X is a two-dimensional array or
i
matrix. In the future, we will see that each data may be a multi-dimensional array, such as a multi-channel image,
regardless of whether x is a one-dimensional or multi-dimensional array, each element of it is regarded as a
i
feature. For x of a multidimensional array, you can use numpy's reshape() function to flatten it into a one-
i
dimensional array (vector), so that X is still a two-dimensional array (matrix). The following code can ensure that
x is flattened into a one-dimensional vector:
i
n_X = X.shape[0]
X_flat = X.ravel().reshape(n_X,-1) # X_flat = X.reshape(n_X,-1)
Since the training of neural networks usually adopts the batch gradient descent method, each gradient descent
process uses a small batch of samples to update the model parameters. Therefore, batch normalization does not
calculate the mean and mean square error of the samples in the entire training set, but It is to calculate the mean
and mean square error with a small batch of samples during the training process. Hence the name batch
normalization.
When predicting, the forward calculation of the data also needs to be processed by the BN layer, but the BN
operation is not required and should not be performed again. This requires the use of the mean, variance and
parameters of the BN layer that have been determined during training, such as β, γ for transformation. However,
the mean value and mean square error calculated by each iterative step in the iterative process are different, and the
mean value and variance used in prediction should not only depend on the mean value and mean square error of a
certain iterative step. To this end, the mean and mean square deviation in all iteration steps can be averaged, and
the moving mean and variance are usually calculated by the method of moving average.
If running_mu and running_var are used to represent the moving average of the mean and variance during the
training process, the calculation code is as follows:
running_mu = momentum * running_mu + (1 - momentum) * mu
running_var = momentum * running_var + (1 - momentum) * var
That is to do a simple weighted average of the moving average of the current mean and variance and the mean and
variance of the current sample. The momentum parameter momentum indicates the proportion of the moving
average.
Forward calculations when forecasting can be transformed with moving averages and the equation:
X_flat = X.ravel().reshape(X.shape[0],-1)
# normalization
X_hat = (X_flat - running_mean) / np.sqrt(running_var + eps)
# translate and pan
out = self.gamma * X_hat + self.beta
From z = γ ⊙ x
^ + β, we can get:
m m
∂f ∂f ∂zi ∂f
= ∑ = ∑
∂β i=1
∂zi ∂ beta i=1
∂zi
m m
∂f ∂f ∂zi ∂f
= ∑ = ∑ ⋅ x
^i
∂γ i=1 ∂zi ∂ gamma i=1 ∂zi
∂f ∂f ∂zi ∂f
= ⋅ = ⋅ γ
∂x
^i ∂zi ∂x
^i ∂zi
∂xi ∂x
^i
(xi −μ)
Because x
^ i =
√ σ2 +ϵ
, where μ, σ are also x function, so:
2
∂f ∂f ∂x
^i ∂f ∂μ ∂f
2
∂σ
= ⋅ + ⋅ + 2
⋅
∂x
^i ∂xi ∂μ ∂xi ∂σ ∂xi
∂xi
And from:
(xi −μ)
x
^i =
√ σ2 +ϵ
m
2 1 2
σ = ∑ (xi − μ)
m
i=1
1 m
μ = ∑ xi
m i=1
can get:
∂x
^i 1
=
∂xi √σ 2
+ ϵ
∂μ 1
=
∂xi m
2
∂σ 2(xi − μ)
=
∂xi m
∂f ∂f 1 ∂f 1 ∂f
2(xi − μ)
= ( ) + ( ) + ( 2
)
∂xi ∂x
^i ∂μ ∂σ
√ σ2 + epsilon m m
one of them:
m
∂f ∂f ∂x
^ ∂f 2 −1.5 ∂f 2 −1.5
2
= ⋅ 2
= ⋅ (−0.5(x − μ) ⋅ (σ + ϵ) ) = −(0.5 ∑ (xj − μ)(σ + ϵ) )
∂σ ∂x
^ ∂σ ∂x
^ ∂x
^j
j=1
m m
∂f ∂f ∂f 1
2
= (∑ ⋅ f rac − 1√ σ + ϵ) + ( 2
⋅ ∑ −2(xi − μ))
∂μ ∂x
^i ∂σ m
i=1 i=1
class BatchNorm_1d(Layer):
def __init__(self,num_features,gamma_beta_method = None,eps = 1e-8,momentum = 0.9):
# self.d_X, self.h_X, self.w_X = X_dim
# self.gamma = np.ones((1, int(np.prod(X_dim)) ))
# self.beta = np.zeros((1, int(np.prod(X_dim))))
# self.params = [self.gamma,self.beta]
super().__init__()
self.eps= eps
self.momentum = momentum
if not gamma_beta_method:
self.gamma = np.ones((1, num_features ))
self.beta = np.zeros((1, num_features ))
else:
self.gamma = np.random.randn(1, num_features)
self.beta = np.random.randn(1, num_features) #np.zeros((1, num_features
))
self.params = [self.gamma,self.beta]
self.grads = [np.zeros_like(self.gamma),np.zeros_like(self.beta)]
self.X_flat = X.ravel().reshape(self.n_X,-1)
self.mu = np.mean(self.X_flat,axis=0)
self.var = np.var(self.X_flat, axis=0) # var = 1 / float(N) * np.sum((x -
mu) ** 2, axis=0)
self.X_hat = (self.X_flat - self.mu)/np.sqrt(self.var +self.eps)
out = self.gamma * self.X_hat + self.beta
def __call__(self,X):
return self.forward(X)
def backward(self,dout):
eps = self.eps
dout = dout.ravel().reshape(dout.shape[0],-1)
X_mu = self.X_flat - self.mu
var_inv = 1./np.sqrt(self.var + eps)
dbeta = np.sum(dout,axis=0)
dgamma = np.sum(dout * self.X_hat, axis=0) #dout * self.X_hat
self.grads[0] += dgamma
self.grads[1] += dbeta
return dX#, dgamma, dbeta
For this BatchNorm class, the following code checks that the analytical gradient is correct using the numerical
gradient:
# diff_error = lambda x, y: np.max(np.abs(x - y))
from util import *
import numpy as np
np.random.seed(231)
N, D = 100, 5
x = 3 * np.random.randn(N, D) + 5
bn = BatchNorm_1d(D,"no")
x_norm = bn(x)
do = np.random.randn(N, D)+0.5
dx = bn.backward(do)
if False:
dx_gamma = numerical_gradient_from_df(lambda :bn.forward(x),bn.gamma,do)
print(diff_error(dgamma,dx_gamma))
7.684454184087031e-10
In the following convolutional neural network, an input sample (such as a color image) is often a three-
dimensional tensor C × H × W , where C, H , andW are the number of channels (of a color image), Height,
width, a batch of samples is a 4-dimensional tensor N × C × H × W , where N is the number of samples. The
above code can be rewritten to handle this 4D tensor input, the following code is batch normalization for each
channel instead of each (pixel) feature:
class BatchNorm(Layer):
def __init__(self,num_features,gamma_beta_method = None,eps = 1e-5,momentum =
0.9,std = 0.02):
super().__init__()
self.eps= eps
self.momentum = momentum
if not gamma_beta_method:
self.gamma = np.ones((1, num_features ))
self.beta = np.zeros((1, num_features ))
else:
self.gamma = np.random.normal(1,std,(1, num_features))
self.beta = np.zeros((1, num_features ))
#self.gamma *=random_value
self.params = [self.gamma,self.beta]
self.grads = [np.zeros_like(self.gamma),np.zeros_like(self.beta)]
if training:
#X = np.swapaxes(X,0,1) # C to fitst axis
if len(self.X_shape)>2:
X = np.moveaxis(X,1,3) #move C to last axis: N,H,W,C
X_flat = X.reshape(-1,X.shape[3])
else:
X_flat = X
NHW = X_flat.shape[0]
self.n_X = NHW
mu = np.mean(X_flat,axis=0)
var = 1 / float(NHW) * np.sum((X_flat- mu) ** 2, axis=0) # self.var =
np.var(self.X_flat, axis=0) #
X_hat = (X_flat - mu)/np.sqrt(var +self.eps)
out = self.gamma * X_hat + self.beta
if len(self.X_shape)>2:
out = out.reshape(N,H,W,C)
out = np.moveaxis(out,3,1)
self.mu,self.var,self.X_flat,self.X_hat = mu,var,X_flat,X_hat
# Normalization
X_hat = (X_flat - self.running_mu) / np.sqrt(self.running_var + eps)
# translate and pan
out = self.gamma * X_hat + self.beta
if len(self.X_shape)>2:
out = out.reshape(N,H,W,C)
out = np.moveaxis(out,3,1)
return out
def __call__(self,X):
return self.forward(X)
def backward(self,dout):
if len(dout.shape)>2: #len(self.X_shape)>2 and
dout = np.moveaxis(dout,1,3)
dout = dout.reshape(-1,dout.shape[3])
eps = self.eps
dbeta = np.sum(dout,axis=0)
dgamma = np.sum(dout * self.X_hat, axis=0) #dout * self.X_hat
if len(self.X_shape)>2:
N,C,H,W = self.X_shape
dX = dX.reshape(N,H,W,C)
dX = np.moveaxis(dX,3,1)
#dX = dX.reshape(self.X_shape)
self.grads[0] += dgamma
self.grads[1] += dbeta
return dX #, dgamma, dbeta
In order to observe the impact of BN on network performance, a batch normalization (BN) layer is added between
the weighted sum of the first two network layers and the activation function of the network model trained on the
Fashion Mnist dataset in Section 4.3.9, because BN can avoid the model The weight parameters become very
complicated, that is, BN is also a regularization technique, which can remove the regularization of weight decay in
the code (reg=0):
import numpy as np
import util
from NeuralNetwork import *
from train import *
import mnist_reader
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())
nn.add_layer(Dense(500, 200))
nn.add_layer(BatchNorm_1d(200))
nn.add_layer(Relu())
nn.add_layer(Dense(200, 100))
nn.add_layer(BatchNorm_1d(100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))
learning_rate = 0.01
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)
epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000
losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print
plt.plot(losses)
[ 1, 1] loss: 2.291
[ 1001, 2] loss: 0.416
[2001, 3] loss: 0.261
[ 3001, 4] loss: 0.342
[ 4001, 5] loss: 0.222
[ 5001, 6] loss: 0.196
[ 6001, 7] loss: 0.157
[ 7001, 8] loss: 0.295
0.9066833333333333
0.8766
It can be seen that after using batch normalization, the prediction accuracy of the trained model has improved
somewhat.
5.4 Regularization Regularization
When the model is more complex (such as more model parameters), regularization (Regularization) is the basic
technique to prevent the model from overfitting. In addition to the direct regularization of weights in regression,
deep learning often uses a regularization called "Dropout" to prevent overfitting.
Ldata + RW
For a weight w, its L regular term is R = λw , where λ controls the proportion of the regular term relative to
2 W
2
data loss. The larger the value, the greater the effect of the regularization term, and the more it can prevent
overfitting. The smaller the value, the smaller the effect of the regularization term, and the smaller the effect of
preventing overfitting. The L regular term makes the parameters tend to be smaller and close to 0.
2
L1 regular term is R = λ|w|, its function is similar to L regular term, but slightly different, L regular term
W 2 2
will make all values decrease consistently, while L regular term The item will make the weights sparse, that is,
1
many weights become close to 0, and only a few non-zero values, that is, the non-zero values are very sparse. L 1
makes machine learning tend to choose a few good features instead of using all features, that is, it helps to select
good features. Sparseness is an important course in machine learning, and due to space limitations, this book does
not discuss it.
It is also possible to combine the L regularization term and the L regularization term to form the so-called
1 2
Elastic net regularization: R = λ |w| + λ w . Its scope is between L and L , or the role of the elastic
W 1 2
2
1 2
combination of L and L .
1 2
Figure 5-23 is a schematic diagram of these three common weight regularization functions:
In the gradient descent method, especially in deep learning, as the number of layers increases, since the reverse
derivation is to calculate the product of the gradient, it will cause the gradient to disappear and the gradient to
explode. The maximum norm constraint can prevent the gradient from exploding. The upcoming update The
weight is limited to a certain range, that is, by clipping the weight vector so that a certain norm such as L norm
2
does not exceed a certain value: ||w|| < c. The typical value of c is 3 or 4, and some studies have reported that
2
this maximum norm constraint on the weight can improve the convergence performance of the algorithm.
Especially in the cyclic neural network to be learned later, such maximum norm constraints are generally used to
prevent gradient explosion.
import numpy as np
def max_norm_constraints(w,c,epsilon = 1e-8):
norms = np.sqrt(np.sum(np.square(w), keepdims=True))
desired = np.clip(norms, 0, c)
w *= (desired / (epsilon+ norms))
return w
w = np.random.randn(2,5)*10
print(w)
w = max_norm_constraints(w,2)
print(w)
If grads contains gradients of multiple weight parameters, the following code can limit their gradients to [-c,c]:
import math
def grad_clipping(grads,c):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > c:
ratio = c / norm
for i in range(len(grads)):
grads[i]*=ratio
5.4.2 Dropout
https://deepnotes.io/dropout
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks
Dropout (meaning "lost" in English) is a regularization technique proposed by Srivastava et al. During the training
process, some neurons are activated with a certain probability (that is, the output activation value) and other
neurons are in the inactive state (that is, the activation value is not output). For a certain layer of neurons, Dropout
uses a probability drop_p between 0 and 1 to make a neuron inactive, that is, no output is generated, and in turn,
the neuron is active with a probability of 1-drop_p. If there are 100 neurons in this layer, drop_p=0.2, then in terms
of probability, 100*0.2 will be inactive, and 100*0.8 will be active. drop_p is called drop rate, which indicates
how many probability neurons are in an inactive state. In turn, (1 − drop_p) is called survival rate or retention
rate, indicating how many probability neurons are active.
Figure 5-24 Dropout: Some neurons are randomly inactive during each forward and reverse process
As shown in Figure 5-24, all neurons of the neural network on the left are activated, and the neurons on the right
are activated using Dropout. Dropout defines a different neural network function by making certain neurons
inactive. Because each iteration of the gradient descent training process randomly deactivates some neurons,
different iterations target different functions. As a result, the neural network does not rely too much on a small
number of neurons, which is equivalent to the decision-making of a group not relying on a few people, but
everyone has the opportunity to participate, thereby preventing prejudice caused by over-reliance on a few people.
Dropout is similar to the idea of data normalization. If the data is not normalized, some features with large values
will have too much influence on the learning algorithm while other features will have little effect. Dropout is also
similar to the idea of weight regularization. Weight regularization uses penalty items to make all weights smaller
and prevent a few weights from being too large.
A certain network layer dropout with a certain probability will cause the total output expectation of this network
layer to become smaller. For example, the original total output expectation of this network layer is e, and the
duopout loss probability drop_p, then the expected value will become e ∗ (1 − drop_p), in order to avoid
affecting subsequent layers, the output of each neuron in the output layer using duopout is usually divided by
(1 − drop_p). That is, if the activation function output value of a certain neuron is a, it is modified to output
a/(1 − drop_p).
Because each iteration of training, Dropout loses randomly different neurons, that is, the neural network function
of each iteration is different. Therefore, the meaning of the loss function is not very clear, and the loss of different
iterations is different. function instead of the same function. The final trained function can be regarded as the
average of different functions generated by these different iterations. A better model can be obtained through the
average of multiple different function models, just as better results can be obtained through the voting of many
people instead of a few people. Again, this is the basic idea of machine learning based on statistical learning.
Dropout can effectively avoid overfitting by averaging multiple functions.
In addition, Dropout is only used to train the model, and the model function after training should be clear.
Therefore, the final trained neural network function should be a function composed of all neurons in the activation
state. Therefore, when verifying and testing the model, it should be Dropout should no longer be used.
Dropout can act on the output of any hidden network layer. If the output of this network layer is x, the Dropout
operation can be expressed as:
x = D ⊙ x
Among them, D is an array of the same shape as x, and the value of each element in D is 1 or 0, indicating whether
the corresponding neuron is activated. Array D is a mask array calculated based on dropout discard rate or survival
rate. The Dropout operation can be performed with the following code:
retain_p = 1-drop_p
mask = (np.random.rand(*x.shape) < retain_p) / retain_p
x *= mask
Among them, drop_p, retain_p=1-drop_p are the drop rate and retention rate respectively, and mask is the mask
array that keeps the active state.
When deriving in reverse, just multiply the gradient dx_output of the loss function off Dropout output by this mask
mask.
dx = dx_output* self._mask
Where dx_output is the gradient of the loss function passed in reverse with respect to x.
The Dropout operation can be implemented as a separate Dropout layer, the code is as follows:
from Layers import *
class Dropout(Layer):
def __init__(self, drop_p):
super().__init__()
self.retain_p = 1- drop_p
where x is the output of the previous layer of the Dropout layer. For the input X, the following code uses
dropout.forward(X) to calculate the input of dropout, and when deriving in reverse, use
dropout.backward(dx_output) from a reverse input about X The gradient gets the reverse gradient after
dropout:
np.random.seed(1)
dropout = Dropout(0.5)
X = np.random.rand(2, 4)
print(X)
print(dropout.forward(X))
dx_output = np.random.rand(2, 4)
print(dx_output)
print(dropout.backward(dx_output))
Dropout is a technology to reduce the complexity of the function. The more model parameters, the more complex
the model. Similarly, the parameters of different network layers of the neural network are different, and the
complexity is also different. Dropout can be added to any hidden layer, and the dropout discard rate of the hidden
layer with more parameters can be higher, and vice versa, it can be lower. For network layers with few model
parameters, the Dropout layer may not be added.
Dropout is to prevent overfitting, but because Dropout causes different functions to be used for each iteration, the
loss function loses its meaning, and the training parameters cannot be debugged using debugging tools. The usual
practice is to turn off Dropout first (but regular items can be added to prevent overfitting), and then turn on
Dropout to further improve the quality of the model after tuning the parameters.
Dropout is a regularization technique. When the network is relatively small compared to the data set,
regularization is usually not required because the complexity of the model is already relatively low. Adding
regularization will reduce the model's representation ability and damage the performance of learning. In addition,
Dropout obviously cannot be placed before and after the output layer, because the network cannot "correct" errors
caused by loss before it is close to classification. In addition, batch normalization has been widely used instead of
Dropout in practice.
For example, if the network model trained on the Fashion Mnist dataset in Section 4.3.9 adds Dropout after the
first network layer, the regularization of weight decay (reg=0) can be removed:
import numpy as np
import util
from NeuralNetwork import *
from train import *
import mnist_reader
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(1)
trainX = X_train.reshape(-1,28,28)
train_X = trainX.astype('float32')/255.0
nn = NeuralNetwork()
nn.add_layer(Dense(784, 500))
nn.add_layer(Relu())
nn.add_layer(Dropout(0.25))
nn.add_layer(Dense(500, 200))
nn.add_layer(Relu())
nn.add_layer(Dropout(0.2))
nn.add_layer(Dense(200, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))
learning_rate = 0.01
momentum = 0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)
epochs=8
batch_size = 64
reg = 0#1e-3
print_n=1000
losses =
train_nn(nn,train_X,y_train,optimizer,cross_entropy_grad_loss,epochs,batch_size,reg,print
plt.plot(losses)
[ 1, 1] loss: 2.307
[ 1001, 2] loss: 0.661
[2001, 3] loss: 0.322
[ 3001, 4] loss: 0.509
[ 4001, 5] loss: 0.316
[ 5001, 6] loss: 0.344
[ 6001, 7] loss: 0.355
[ 7001, 8] loss: 0.434
0.8872333333333333
0.8667
Using Dropout also improves accuracy compared to the previous chapter. Of course, the hyperparameters of
Dropout also need to be further adjusted to improve the effect. In current practice, batch normalization is generally
used instead of Dropout.
5.4.3 Early stopping method (Early stopping)
As shown in Figure 5-25, during the training process, with the help of the verification set, once the verification
loss no longer decreases or even increases instead, the iteration is stopped to prevent overfitting from reducing the
generalization ability of the model.
Figure 5-25. Stopping the training iteration when the validation error no longer decreases or increases instead
Chapter 6 Convolutional Neural Network CNN
In the previous neural network, each input data sample is a one-dimensional
tensor, and each layer of neurons receives a one-dimensional tensor from
the previous layer to generate an output. This kind of neural network is
called full connection Neural Networks. For two-dimensional or three-
dimensional tensors such as image data, each image is flattened into a one-
dimensional tensor and input to the neural network. The flattened one-
dimensional tensor loses the inherent spatial structure information of the
image (such as pixels Adjacent relationship), exchanging the element order
of the flattened tensor has no effect on the training of the network function,
that is, as long as all the elements of the tensor are the same, changing their
order will still produce the same network function. Just imagine, for an
image, it is obviously very unreasonable to arrange all the pixels in the
image randomly, and finally recognize the same object, because the pixels
of the image can only be arranged according to a certain spatial structure.
Meaningful images, otherwise a mass of meaningless images.
Generate images that look like real ones. Such as generating realistic
faces, video face replacement (such as the famous DeepFake)
6.1 Convolution
the average of this group of numbers. You can also multiply each number x i
w1 ∗ x1 + w2 ∗ x2 + w3 ∗ x4 + ⋯ + wn ∗ xn
weighted sum is called weighted average. The weight value can also be
negative, such as the company's debt ratio, profitability ratio, etc. can be
negative or positive.
For example, to count a student's grades in a course, you can give different
proportions to the usual grades, experimental grades and final grades, such
as 0.2, 0.3, and 0.5 respectively, then you can use (0.2*normal
grades+0.3*experimental grades +0.5*final grade)Calculate the
student's total grade. The weighted sum of a set of numbers is to extract a
certain feature from this set of numbers, such as the weighted sum of grades
to get the feature of "total grades".
4 15 16 7 23 17 10 9 5 8
Because the number of weights is less than the number of values, these 3
weights can be used to perform a weighted sum with every 3 adjacent
numbers in this group of numbers in turn. First align the 3 weights with the
first 3 numbers (4, 15, 16):
Figure 6-1 Align the weight vector (1.2,0.3,0.5) to the first three numbers
(4,15,16)
Figure 6-3 Align the weight vector (1.2,0.3,0.5) to the last 3 numbers
(9,5,8)
This process of using a weight less than the number of data to align the data
through a sliding window to obtain a weighted sum to obtain a new set of
data is called convolution.
is n-K+1. For example, if the input data length is 5 and the convolution
kernel width is 3, the length of the resulting convolution vector is 5-3+1 =
3, as shown in Figure 6-5 below:
Figure 6-5 valid convolution: the data length is 5, the convolution kernel
width is 3, and the length of the resulting convolution vector is 5-3+1 = 3
This convolution method is called "valid convolution". The above
summation can be expressed in python code as:
K = w.size
z[i] = np.sum(x[i:i+K]*w)
import numpy as np
np.random.seed(5)
x = np.random.randint(low=1, high=30, size=10,dtype='l')
print(x)
w = np.array([1.2,0.3,0.5])
n = x.size
K = w.size
z = np.zeros(n-K+1)
for i in range(n-K+1):
z[i] = np.sum(x[i:i+K]*w)
print(w)
print(z)
[ 4 15 16 7 23 17 10 9 5 8]
[1.2 0.3 0.5]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]
In order to generate the result data with the same length as the original data,
0 can be filled before and after the original data, and then convolution is
performed. As shown in Figure 6-6, for a convolution kernel with a length
of 3, after the left and right sides are filled with a 0, the left and right sides
generate 1.2*0+0.3*4+0.5*15 = 8.7` `` and1.25+0.38+0.5*0 = 8.4
`2 new values.
Figure 6-6 same convolution: the convolution width is K, and (K-1)/2 0s are
filled on the left and right sides of the original data of length n, and the
length of the convolution result vector is n
The number of padding before and after the data is (K-1)/2 0s, so that the
data length becomes n+2(K-1)/2 = n+K-1, so the result vector length of the
convolution is n+ K-1-K+1=n=10, that is, the convolution result with the
same length as the original data is produced. This convolution method is
called "same convolution". Of course, if K is not an odd number, the length
of the result is n-1.
Figure 6-7 full convolution: the convolution width is K, and K-1 zeros are
filled on the left and right sides of the original data of length n, and the
length of the convolution result vector is n+K-1
Generally, assuming that the length of the original data is n, the length of
the convolution kernel is K, and the sum of the lengths of the left and right
padding is P, the length of the convolution result data is n+P-K+1. For
example, if P = 0 means no padding, the length of the original data is 3, and
the length of the convolution kernel is also 3, the length of the convolution
result data is 3-3+1 = 1, that is, the length of the result data is 1.
for i in range(n_o):
y[i] = np.sum(x_pad[i:i+K]*w)
return y
y2 = conv1d(x,w,2) #fullconvolution
print(x.size,w.size,y2.size)
print("full: ", y2)
10 3 10
same: [ 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3
8.4]
10 3 12
full: [ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3
8.4 9.6]
The first parameter of these two functions is the convoluted data, the second
parameter is the weight, and the third parameter represents three different
convolution methods: "full", "same", and "valid".
w0 = np.array([1.2,0.3,0.5])
x_valid = np.correlate(x, w0,'valid') # Cross-correlation
function np.correlate
# is convolution in
deep learning
x_same = np.correlate(x, w0,'same')
x_full = np.correlate(x, w0,'full')
print(x_valid)
print(x_same)
print(x_full)
w = np.array([0.5,0.3,1.2])
x_valid = np.convolve(x, w,'valid') # The convolution
function np.convolve first
# inverts and then
performs convolution in deep learning
x_same = np.convolve(x, w,'same')
x_full = np.convolve(x, w,'full')
print(x_valid)
print(x_same)
print(x_full)
[ 4 15 16 7 23 17 10 9 5 8]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]
[8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4]
[ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4
9.6]
[17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3]
[8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4]
[ 2. 8.7 17.3 26.3 32.8 23.8 37.7 27.9 17.2 16.3 8.4
9.6]
span
Typically, convolution operations slide the convolution kernel element by
element along the convolved data. Therefore, the length of the valid
convolution of a data with a length of n and a convolution kernel with a
length of K is n-K+1. This convolution operation that only slides one
element at a time makes the result data almost the same length as the
original data. The number of elements that the convolution kernel slides
along the original data is called span (stide) or stride. Sometimes in order
to generate smaller convolution result data, a span sliding convolution
kernel greater than 1 is commonly used. As shown in Figure 6-9.
Note that the "span" of the convolution kernel is S, and the convolution
kernel with a length of K can slide (n-K)/S times on the data with a length
of n. Except for the initial convolution, every time you slide, you can
perform another One convolution operation, therefore, a total of (n-K)/S+1
convolution operations can be performed. For example: n=10, K=3, S=2,
the number of convolution operations that can be performed is (10-3)/2+1 =
4.
If the length of the original data is n, and the sum of the left and right
padding lengths is P, then the length of the padded data is n+P. Therefore,
the number of convolution operations that can be performed is (n+P-
K)/S+1, that is, the final result The data length is (n+P-K)/S+1.
def conv1d(x,w,pad=0,s=1):
n = x.size
K = w.size
n_o = (n+2*pad-K)//s+1
y = np.zeros(n_o) # Convolution result
if not pad==0:
#x_pad = np.zeros(n+2*pad)
#x_pad[pad:-pad] = x
x_pad = np.pad(x,[(pad,pad)], mode='constant')
else:
x_pad = x
For example, two sets of numbers x and y are generated below, x is a set of
numbers (100) uniformly distributed on [0, 2π], and y is a number near the
corresponding sinusoidal curve sin(x) ( i.e. y is a noisy sample of the
sinusoid):
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.linspace(0,2*np.pi,100)
y = np.sin(x) + np.random.random(100) * 0.2
plt.plot(x,y)
plt.show()
Figure 6-10 Random noise sampling of a sinusoid
[-6 -5 -4 -3 -2 -1 0 1 2 3 4 5]
['0.00', '0.00', '0.01', '0.05', '0.12', '0.20',
'0.23', '0.20', '0.12', '0.05', '0.01', '0.00']
Figure 6-11 A set of weight parameters sampled according to the normal
distribution
The middle value of this weight vector w is large, the values on both sides
are small, and the sum of all ownership values is 1, and this weight vector w
is used to convolve y:
#w = np.array([0.1,0.2, 0.5, 0.2, 0.1])
yhat = np.correlate(y, w,"same")
plt.plot(x,yhat, color='red')
Figure 6-12 Use the weight vector of Gaussian distribution to weight the
original sinusoidal sampling data, which plays a smooth (smooth) effect
This weight vector calculates the weighted average of the values in y. When
the weight vector slides along y, the calculated value yhat is the average
weight of the center point y of the sliding window and its surrounding y,
and the center point y corresponds to The weight value is the largest, and
the weight value corresponding to y farther away from the center point is
smaller. The resulting vector is equivalent to an average or smoothing
process on the original data. It can be seen from Figure 6-12 that the curve
corresponding to the convolved data point (x, yhat) becomes smoother, that
is, the noise in the original data y is reduced to a certain extent.
Of course, computer digital images can also use real numbers in the [0,1]
interval to represent pixel values. The following code uses the io model of
the skimage package to read a color image into a numpy multidimensional
array img, then uses the rgb2gray of the skimage.color module to convert
the color image into a black and white (grayscale) image, and then displays
the two images, and Print out the pixel value of 5*5 a window in the middle
part.
from skimage import io, transform
from skimage.color import rgb2gray
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
img = io.imread('image.jpg')
gray_img = rgb2gray(img) # io.imread('./imgs/image.jpg',
as_grey=True)
plt.subplot(1, 2, 1)
plt.imshow(img)
plt.subplot(1, 2, 2)
plt.imshow(gray_img,cmap='gray')
#img = io.imread('./imgs/lenna.png', as_grey=True) # load
the image as grayscale
#plt.imshow(img, cmap='gray')
print('image matrix size: ', img.shape) # print the
size of image
print('image matrix size: ', gray_img.shape) # print
the size of image
print('\n First 5 columns and rows of the color image
matrix: \n', img[150:155,110:115])
print('\n First 5 columns and rows of the gray image
matrix: \n', gray_img[150:155,110:115])
[[108 78 68]
[106 76 66]
[107 77 67]
[101 71 61]
[ 92 62 52]]
It can be seen that the color image is read into a three-dimensional numpy
array, and the third dimension of the numpy array represents the three color
channels of the color image, namely the RGB channel. Each channel is a
two-dimensional array (matrix). Therefore, a color image can be viewed as
3 matrices. The rgb2gray() function converts a 3-channel color image into a
one-channel grayscale image, that is, a two-dimensional array (matrix). The
value of each grayscale pixel is red (R), green (G), blue (B) Computed from
the weighted sum of the three-color pixel values.
Y = 0.2125 R + 0.7154 G + 0.0721 B
From the output results, the RGB color value will be converted from the
integer value in the [0,255] interval to the real value in the [0,1] interval.
The color value can also be converted from a real value in the [0,1]
interval to an integer value in [0,255], the following code converts the
pixel value of the grayscale image to an integer in the [0,255] interval.
gray_img2 = gray_img*255
gray_imgs= gray_img2.astype(np.uint8)
print('The values of the first 5 rows and 5 columns of the
grayscale matrix: \n', gray_imgs[150:155,110:115])
As shown in Figure 6-16, assuming that the left side is a 6 × 6 size matrix
(image), and the middle is a 3 × 3 size matrix representing a convolution
kernel, use the convolution kernel matrix according to "from top to bottom,
Sliding along the image in order from left to right, for each image window
encountered, use the convolution kernel to perform weighted summation,
and a value will be generated, such as using the convolution kernel to
weight with the window element in the upper left corner of the image
Summed, the resulting value is:
2*(-1)+ 3*0+ 0*1+ 6*(-2)+ 0*0+ 4*2+ 8*(-1)+ 1*0+ 0*1 = -14
Continue to move the convolution kernel pixel by pixel to the right, and
generate new values in turn:
Figure 6-16 Valid convolution of 6x6 two-dimensional matrix data with 3x3
convolution kernel produces a 4x4 size matrix
represent the m-th row and n-column weight of the convolution kernel, Use
ai,jto represent the i-th row and j-th column element of the result matrix,
then the convolution operation formula of the two-dimensional matrix is as
follows:
Fh Fw
ai,j = ∑ ∑ wm,n xi+m,j+n
m=0 n=0
That is, the convolution kernel window is aligned with the (i, j) position of
the data matrix, and then weighted and summed with the corresponding
data in the data window. For example, a in the example above is
1,1
calculated as follows:
Fh Fw
m=0 n=0
Suppose the number of rows and columns of the data matrix X is h and w
respectively, and the number of rows and columns of the convolution kernel
K is F and F respectively, then the number of rows and columns of the
h w
X: [[2 3 0 7 9 5]
[6 0 4 7 2 3]
[8 1 0 3 2 6]
[7 6 1 5 2 8]
[9 5 1 8 3 7]
[2 4 1 8 6 5]]
K: [[-1 0 1]
[-2 0 2]
[-1 0 1]]
array([[-14., 20., 7., -7.],
[-24., 10., 3., 5.],
[-28., 3., 6., 8.],
[-23., 9., 10., -2.]])
Figure 6-17 The result image obtained by convolving the image with the
convolution kernel with the function of "vertical edge extraction"
As can be seen, the vertical features of the resulting image are exaggerated.
Indicates that this is a convolution kernel that has the function of "vertical
edge extraction".
In order to generate an image of the same size as the original image, "same
convolution" can also be used, that is, some 0 values are filled around the
image. If the size of the weight matrix is F ∗ F , the number of 0s P
w h w
filled in the left and right of the original image is (F − 1)/2, and the
w
the weight matrix is a square matrix with the same length and width. As
shown in Figure 6-18. Use the 3x3 convolution kernel to perform the same
convolution on the 6x6 matrix, fill the original matrix with (3-1)/2 0 values,
and get a 6x6 matrix.
Figure 6-18 Using the 3x3 convolution kernel to perform the same
convolution on a 6x6 matrix results in a 6x6 matrix
The following code fills P_h and P_w 0 values for the top, bottom, left and
right of the image:
H,W = X.shape
P_h,P_w = 1,2
X_padded = np.zeros((H + 2*P_h, W +2*P_w))
X_padded[P_h:-P_h, P_w:-P_w] = X
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 2. 3. 0. 7. 9. 5. 0. 0.]
[0. 0. 6. 0. 4. 7. 2. 3. 0. 0.]
[0. 0. 8. 1. 0. 3. 2. 6. 0. 0.]
[0. 0. 7. 6. 1. 5. 2. 8. 0. 0.]
[0. 0. 9. 5. 1. 8. 3. 7. 0. 0.]
[0. 0. 2. 4. 1. 8. 6. 5. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
The above code creates a padded image X_padded according to the size of
the padding. In fact, numpy also provides a ready-made pad() function that
pads the front and back of each axis of the multidimensional array:
np.pad(x, [(1, 0), (1, 2)], mode='constant',
constant_values=0)
The second parameter [(1, 0), (1, 2)] indicates the number of pixels
to be filled before and after each axis of the numpy array x, and the first
tuple (1,0) indicates The front and rear of the first axis (the axis
corresponding to axis=0) are filled with 1 and 0 pixels respectively, and the
second tuple (1, 2) indicates that the front and rear of the second axis (the
axis corresponding to axis=1) are filled with 1 and 2 respectively pixels.
mode='constant' indicates that the filling is a constant value, and
constant_values=0 indicates that the filling constant value is 0, and these
two parameters can be omitted.
The following code fills a row of 0s in front of the first row of array a, and
fills 1 column of 0s and 2 columns of 0s in front of the first column and
behind the last column.
import numpy as np
a = np.array([[ 1., 1., 1.],
[ 1., 1., 1.]])
b = np.pad(a, [(1, 0), (1, 2)], mode='constant')
print(a)
print(b)
[[1. 1. 1.]
[1. 1. 1.]]
[[0. 0. 0. 0. 0. 0.]
[0. 1. 1. 1. 0. 0.]
[0. 1. 1. 1. 0. 0.]]
The following code first fills (K_h-1)//2 and (K_w-1)//2 pixels on the top,
bottom, left and right of the image according to the height K_h and width
K_w of the convolution kernel, and then rolls the filled image Product
operation:
The corresponding python code is as follows:
def convolve2d_same(X, K):
H,W = X.shape
K_h,K_w = K.shape
kernel = np.array([[-1,-1,-1],[-1,8,-1],[-1,-1,-1]])
edges = convolve2d_same(image,kernel)
plt.imshow(edges, cmap=plt.cm.gray)
Figure 6-22 The smoothing convolution kernel smoothes the image, and the
resulting image becomes blurred
span
The span of the above convolution operation is 1, that is, the convolution
kernel always moves "from top to bottom, from left to right" pixel by pixel,
and the size of the resulting image is close to that of the original image. To
generate a smaller-sized convolution image, such as a convolution image
that is half the size of the original image, you can slide 2 pixels each time
when sliding in the horizontal and vertical directions. That is, the
convolution operation is performed by sliding the convolution kernel in a
span of 2. Span is also sometimes referred to as "Stride"
number of elements filled up and down, left and right is P , P , the up and
h w
down, left and right strides are respectively S , S , the output two-
h w
(H − Fh + Ph )/Sh + 1
(W − Fw + Pw )/Sw + 1
As shown in Figure 6-23, for a 7x7 input image, the convolution kernel size
is 3x3, the stride is 2, and the top, bottom, left and right are each filled with
1. The resulting output image is of size ((6+2-3)/2+1)x ((6+2-
3)/2+1) = 3x3.
P_h,P_w = pad
S_h,S_w = stride
h = (H-K_h+2*P_h)//S_h+1
w = (W-K_w+2*P_w)//S_w+1
Y = np.zeros((h,w))
if P_h!=0 or P_w !=0:
X_padded = np.pad(X, [(P_h, P_h), (P_w, P_w)],
mode='constant')
else:
X_padded = X
for i in range(Y.shape[0]):
hs = i*S_h
for j in range(Y.shape[1]):
ws = j*S_w
Y[i,j]=(X_padded[hs:hs+K_h,ws:ws+K_w]*K).sum()
return Y
image = gray_img
kernel = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])
image_filtered = convolve2d(image,kernel,(1,1),(2,2))
plt.imshow(image_filtered, cmap=plt.cm.gray)
print("Original image size:", image.shape)
print("Result image size:", image_filtered.shape)
h = (H+2*P_h-F_h)//S_h+1
w = (W+2*P_w-F_w)//S_w+1
Y = np.zeros((h,w)) # convolution output
if P_h!=0 or P_w != 0:
#X_padded = np.zeros((C,H + 2*P_h, W +2*P_w))
#X_padded[:,P_h:-P_h, P_w:-P_w] = X
X_padded = np.pad(X,[(0,0),(P_h,P_h),(P_w,P_w)],
mode='constant')
else:
X_padded = X
Figure 6-29 A 3 × 3 window performs pooling on a 6 × 6 matrix (image), producing a resulting image of size
4 × 4
Of course, the average value in the pooling window can also be calculated as the output value. This pooling
operation is called average pooling. Average pooling works similarly to max pooling, except that the elements in
the data window are averaged instead of maximum.
Like the convolution operation, the window of the pooling operation is usually a square. The span of the above
pooling operation is 1, that is, one pixel is moved at a time. The span of the pooling operation is usually the same
as the length or width of the pooling window.
As shown in Figure 6-30, the window length and span of the pooling operation are both 3, resulting in a result
image of 2 × 2, while the original image size is 6x6.
Figure 6-30 3 × 3 pooling window, performing a pooling operation with a span of 3 on 6 × 6, resulting in a result
image of 2 × 2
The main goal of pooling is to alleviate the excessive sensitivity of the position to the convolution operation. The
pooling layer can retain the main features of the original image, and the pooling operation with a span greater than
1 reduces the size of the image exponentially and produces a smaller feature map, which can reduce the amount of
calculation in subsequent layers and improve calculation efficiency.
Unlike the convolution operation that uses a convolution kernel to convolve all channels of the input data, pooling
usually performs a pooling operation on each channel separately. Therefore, as many channels as there are input
data, the output has the same number of channels, as shown in Figure 6-31 , if the input is 64 channels, the output
is also 64 channels, that is, each input channel generates an output channel.
Figure 6-31 Pooling is for each channel, and the pooling operation does not change the number of channels
Like the convolution operation, the pooling operation not only has a span, but also fills the original image before
pooling. Similar to the convolution operation, the following code that performs pooling operations on single-
channel input data X can be written:
def pool2d(X, pool, stride=(1,1),padding=(0,0), mode='max'):
pool_h, pool_w = pool
S_h,S_w = stride
P_h,P_w = padding
#fill
if P_h or P_w:
X_padded = np.pad(X,[(P_h,P_h),(P_w,P_w)], mode='constant')
else:
X_padded = X
For the two-dimensional matrix shown in Figure 6-30, the results of performing the maximum value and average
pooling with a span of 3 and a window of 3 × 3 are as follows:
array([[8, 9],
[9, 8]])
pool2d(X,(3,3),(3,3),(0,0),mode ='avg')
array([[2, 4],
[4, 5]])
Similarly, for multi-channel input, as long as the above single-channel pooling operation is performed on each
channel, the following pooling operation for multi-channel input data X is added to the outer layer of the original
pooling operation cycle Loop for multi-channel traversal (for c in range(Y.shape[0])):
if P_h or P_w:
X_padded = np.pad(X,[(0,0),(P_h,P_h),(P_w,P_w)], mode='constant')
else:
X_padded = X
Y_h,Y_w = (X.shape[1]-pool_h+2*P_h)//S_h+1,(X.shape[1]-pool_w+2*P_w)//S_w+1
for c in range(Y.shape[0]):
for i in range(Y.shape[1]):
hs = i*S_h
for j in range(Y.shape[2]):
ws = j*S_w
if mode == 'max':
Y[c,i, j] = X[c,hs: hs + pool_h, ws: ws + pool_w].max()
elif mode == 'avg':
Y[c,i, j] = X[c,hs: hs + pool_h, ws: ws + pool_w].mean()
return Y
(2, 3, 3)
(2, 2, 2)
array([[[ 4, 5],
[ 7, 8]],
[[11, 16],
[71, 16]]])
Using the pool() function and the 5 × 5 window to pool the image with a span of (2,2), the resulting image is
only half the size of the original image.
img = np.moveaxis(lenna_img, -1, 0) #np.rollaxis(lenna_img, 2, 0)
pooled_img = pool(img,[5,5],(2,2))
pooled_img = np.moveaxis(pooled_img, 0, -1) # Move the channel axis = 0 to the last
axis = -1 position
plt.imshow(pooled_img, cmap=plt.cm.gray)
print("Original image size:",img.shape)
print("Result image size:",pooled_img.shape)
Figure 6-32 Pooling the image with the stride (2,2), the resulting image is only half the size of the original image
Unlike fully connected neurons, convolution neurons use a convolution kernel to perform convolution operations
on input samples, and the number of parameters of the convolution kernel is often much smaller than the number
of features of the sample, such as for a 3 T hecolorimageof times64 × 64, the convolutional neuron is a
convolution kernel of 3 × 4 × 4 size, and the convolutional neuron has only 48 parameters. Convolutional neurons
have a small number of weight parameters compared to fully connected neurons, which helps prevent overfitting.
In addition, the fully connected neuron value produces an output value. Therefore, the fully connected network
layer requires many fully connected neurons to extract enough features, and the convolutional neuron produces a
feature map of multiple output values. Convolution A convolutional network layer composed of neurons requires
only a small number of convolutional neurons.
As shown in Figure 6-33, unlike a fully connected neuron that only outputs one value, the convolution kernel
moves along the input "from top to bottom, from left to right". Each time the convolution kernel window is aligned
with a data window, the Generate an output value, and the convolution kernel moves along the input data to
produce many output values that are regularly arranged as the original data. These regularly arranged output values
are called feature maps. Input a multi-channel image and output a single-channel feature map. Convolution
operations can preserve and capture the inherent spatial structure relationship between adjacent data (pixels) of the
original data. That is, the inherent characteristics of the data can be better captured to improve the quality of the
neural network function.
Figure 6-33 The data window aligned with the convolution kernel window of the convolutional neuron generates
an output value, and the convolution kernel window moves along the input data, and a series of regularly arranged
output values generated constitute an output feature map. convolution kernel will generate multiple feature maps
For a multi-channel input tensor, the operation of the convolutional neuron can be expressed by the following
formula:
Fk −1 Fh −1 Fw −1
ai′ ,j′ = g(∑ ∑ ∑ wi,j,k xi+i′ ,j+j′ ,k + b)
k=0 i=0 j=0
Like the previous neurons, each convolutional neuron also has a bias b and an activation function g. The result of
the convolution operation is also output after being transformed by the activation function. Although the
convolutional neuron has the same number of channels as the input, its resolution is usually much smaller than that
of the input image.
Fk −1 Fh −1 Fw −1
ai′ ,j′ ,k′ = gk′ (∑ ∑ ∑ wi,j,k,k′ xi+i′ ,j+j′ ,k + bk′ )
k=0 i=0 j=0
For a multi-channel input, each convolutional neuron with the same number of channels produces an output feature
map. If a convolution layer has k' neurons, k' feature maps or k' output channels will be generated, that is, multiple
convolution neurons in the convolution layer will generate the same number of feature maps, as shown in Figure 6
-34, for 3-channel input data, 2 convolutional neurons output feature maps of 2 channels (where each circle
represents a convolutional neuron). Just as the output of a fully connected layer can be used as the input of the next
fully connected layer, the multi-channel feature map output by one convolutional layer can be used as the multi-
channel input of the next convolutional layer.
Figure 6-34 Each convolutional neuron is a 3 × 3 × 3 size convolution kernel, for a 3 × 3 × 3 local data window
of the input 3 channels, 2 convolutional neurons Will produce 2 output values. For 3-channel input, each
convolutional neuron produces a single-channel feature map, and 2 convolutional neurons produce a total of 2-
channel feature maps.
The convolutional layer is usually followed by a pooling layer. The pooling layer performs a simple pooling
operation such as maximum pooling or average pooling. The pooling layer performs a pooling operation on each
input feature map to generate a separate new For feature maps, the pooling operation will not change the number
of feature maps. Inputting 3 feature maps will generate 3 output feature maps, that is, the number of output
channels is the same as the number of input channels.
The role of the pooling layer is to reduce the size of the feature map, reduce the dimension of the feature map
output by the convolutional layer, and improve the training efficiency. The pooling layer does not contain model
parameters. As shown in Figure 6-35, the input is a 10 × 10 feature map (matrix or image) with only one channel,
and a convolutional layer containing 6 3 × 3 size convolution neurons with a stride of 1 produces 6 A 8 × 8
feature map, and then a 2 × 2 size pooling window and a pooling layer with a span of 2, resulting in 6 4 × 4 size
feature maps.
Figure 6-35 The single-channel input feature map 1 × 10 × 10 passes through 6 3 × 3 size convolution neurons to
perform a convolution operation with a span of 1 to produce 6 size 8 × 8 feature maps 6 × 8 × 8, and then after
2 × 2 pooling window and a pooling layer with a span of 2, 6 feature maps with a size of 4 × 4 are generated
A neural network that contains convolutional layers is called a Convolutional Neural Network. The network
layer of the convolutional neural network has both a convolutional layer and a fully connected layer. Usually, the
initial network layer is some convolutional layers, and the last one close to the output position is some fully
connected layers.
Figure 6-36 is a typical convolutional neural network structure diagram. The input is a single-channel 28 × 28
image (feature map). After 8 5 × 5 convolutional neurons and a convolutional layer with a stride of 1, 8 size
24 × 24 feature maps (output The channel is 8), and then through the subsequent 2 × 2 pooling window and the
pooling layer with a span of 2, the output is the feature map of 8 channels with a size of 12 × 12, and after 16
channels with a size of 5T he × 5 convolutional neuron and the convolutional layer with a span of 1 generate 16
feature maps with a size of 8 × 8, and then pass through the subsequent 2 × 2 pooling window and a pooling layer
with a span of 2. The output is a feature map of 16 channels with a size of 4 × 4. Then, the fully connected layer
of 64 fully connected neurons produces 64 output values, and these output values pass through the final 10 fully
connected neurons. The fully connected layer outputs 10 values.
Figure 6-36 Single-channel 1 × 28 × 28 input, 8 24 × 24 feature maps are output through 8 5 × 5 convolutional
neurons and a convolutional layer with a span of 1, after the span For 2, the pooling layer with a pooling window
size of 2 × 2 outputs 8 12 × 12 feature maps, and then outputs through 16 5 × 5 convolutional neurons and a
convolutional layer with a span of 1 16 8 × 8 feature maps are obtained, and 16 4 × 4 feature maps are output
after inheriting the pooling layer with a span of 2 and a pooling window size of 2 × 2, and then perform a
flattening operation to convert it into 1 A vector with a length of 256, then a vector with a length of 64 output
through a fully connected layer, and a vector with a length of 10 output through a fully connected layer.
When the feature map of the last convolutional layer (or pooling layer) is output to the full connection, the feature
map will be flattened, that is, converted into a one-dimensional vector, and then processed and output by the
neurons of the fully connected layer. Note, in Figure 6-36, the convolutional layer is not drawn, but the input and
output feature maps of the convolutional layer are given.
Convolutional neural networks were originally used mainly for computer vision or image processing problems,
such as inputting an image to recognize its classification. For image data, convolutional neural networks usually
contain multiple layers of "convolution + pooling" to continuously extract image features from low-level to high-
level and use pooling to reduce image size. In the last layer of the neural network, the small-sized multi-channel
feature map will be expanded into a one-dimensional vector, that is, the so-called feature map flattening (flatten)
operation is performed, and then the one-dimensional feature vector Further transformations are performed using a
fully connected layer composed of fully connected neurons.
Therefore, the three most commonly used network layers of convolutional neural networks are: convolutional
layer, pooling layer (usually maximizing pooling Max pool) and fully-connected (fully-connected) layer,
convolutional layer, fully connected Layers and pooling layers are usually abbreviated as conv, fc, pool. For
example, a neural network structure can be described with the following shorthand:
INPUT -> [[CONV -> Relu]*N -> POOL?]*M -> [FC -> Relu]*K -> FC
∗N indicates that the convolutional layer has N convolution kernels and generates N feature maps, ∗M indicates
the combination of convolutional layer and pooling layer [[CONV -> RELU]*N -> POOL?]repeated M times,
similarly, *K means that the fully connected layer [FC -> RELU] has been repeated K times, that is, there are K
fully connected layers. Where Relu means that the activation function is Relu. The activation function of the
convolutional layer generally uses the Relu activation function. This is because σ(x) and other functions when the
absolute value of x becomes larger, the derivative will become very small, making the gradient in the reverse
derivation process (Derivative) cannot be effectively transmitted, that is, there will be a "gradient disappearance"
problem, especially when the network depth increases, this "gradient disappearance" problem is more serious, and
the Relu function does not have this problem.
Convolution kernels with different weights can extract different features. As shown in Figure 6-37, a convolution
layer uses multiple convolution kernels to extract different feature maps.
Figure 6-37 Convolution kernels with different weights extract different features, and one convolution layer can
use multiple convolution kernels to obtain different feature maps
Applying convolution through multiple layers can generate hierarchical convolution result images, extracting
features of different granularities from low-level features to high-level features. Figure 6-38 is a convolutional
image of different levels of features obtained by applying convolution.
Figure 6-38 Multiple convolutional layers can extract features from low-level to high-level. The convolutional
layer close to the input layer extracts image edge or color features, and subsequent convolutional layers can extract
edge intersections or color shadows. , and further convolutional layers can extract meaningful structures or objects
It can be seen that the convolutional layer close to the input layer finds the edge or color features of the image, and
the subsequent layers build complex structures on this basis, such as finding the intersection of edges or the
shadow of the color, and the subsequent layers combine all These are combined to identify meaningful structures
or objects in the image, and subsequent convolutional layers begin to gradually extract higher-level features. This
extraction process from the lowest-level edge features to high-level shape features is similar to human observation
of the world. the process of.
6.2.3 Reverse derivation and code implementation of convolutional layer and pooling
layer
The difference between the convolutional neural network and the fully connected neural network is that the
convolutional layer (including the pooling layer) is added. As long as the convolutional layer and the pooling layer
are added on the basis of the previously implemented fully connected neural network, the convolution can be
realized. Neural Networks. Section 6.1) has realized the forward calculation of the convolutional layer and the
pooling layer. The key is how to realize the reverse derivation of the convolutional layer (including the pooling
layer).
because
therefore,
∂zi
= (xi , xi+1 , ⋯ , xi+K−1 )
∂w
w
therefore:
∂L ∂L ∂zi ∂L
dw = = ∑ = ∑ (xi , xi+1 , ⋯ , xi+K−1 )
∂w
w i ∂zi ∂ pmbw i ∂zi
z0 = x0 w0 + x1 w1 + x2 w2 + b
z1 = x1 w0 + x2 w1 + x3 w2 + b
therefore:
∂L ∂z0 ∂L ∂L ∂L
= ( x0 , x1 , x2 )
∂z0 ∂w
w ∂z0 ∂z0 ∂z0
∂L ∂z1 ∂L ∂L ∂L
= ( x1 , x2 , x3 )
∂z1 ∂w
w ∂z1 ∂z1 ∂z1
therefore,
∂L ∂L ∂L ∂L ∂L ∂zi
= ( , , ) = ∑ =
∂w
w ∂w0 ∂w1 ∂w2 ∂zi ∂w
w
i
∂L
= (x0 , x1 , x2 )
∂z0
∂L
+ (x1 , x2 , x3 )
∂z1
+ ⋮
∂L
+ (x7 , x8 , x9 )
∂z7
Figure 6-39, ∂L
∂z0
(x0 , x1 , x2 ),
∂L
∂z1
(x1 , x2 , x3 ), ⋯ accumulation to ( ∂L
∂w0
,
∂L
∂w1
,
∂L
∂w2
)
And ∂L
∂b
= ∑
i
∂L
∂zi
∂zi
∂b
= ∑
i
∂L
∂zi
, that is to add up all ∂L
∂zi
. Use dw, dz, db to represent ∂L
∂w
,
∂L
∂z
,
∂L
∂b
, Then you
can write the following python code for dw, db:
for i in range(z.size):
dw += x[i:i+K]*dz[i]
db = dz.sum()
∂x
x
of L about the output x ?
∂x
xj
= 0 with
respect to x , ie: j
∂zi ∂zi
= (0, ⋯ , 0, ,⋯, , 0, ⋯)
∂x
xi ∂x
xi+K−1
= (0, ⋯ , 0, w0 , ⋯ , wK−1 , 0, ⋯)
Therefore, the loss function L only contributes to the partial derivative of the loss function L with respect to
x ,xi
,⋯,x
i+1
through z , therefore, there are:
i+K−1 i
∂L ∂L
[i : i + K]+ = w
∂x
x ∂zi
∂x2
∂L
∂L ∂L ∂L ∂L ∂L ∂L
( , , )+ = ( w0 , w1 , w2 )
∂x0 ∂x1 ∂x2 ∂z0 ∂z0 ∂z0
∂L ∂L ∂L ∂L
( , , )+ = w
∂x0 ∂x1 ∂x2 ∂z0
for i in range(z.size):
dx[i:i+K] += w*dz[i]
For the convolution with the span of S, because each z is obtained by weighting the sum of the data window
i
x[i ∗ S : i ∗ S + K], the above formula can be extended to fill and span convolution:
(n−K)//S
∂L ∂L
= ∑ x[i ∗ S : i ∗ S + K]
∂w
w ∂zi
i=0
∂L ∂L
[i ∗ S : i ∗ S + K]+ = w
∂x
x ∂zi
Of course, for convolutions with up, down, left and right pads, x needs to be filled before convolution, and the
same is true for reverse derivation. Therefore, the reverse derivative of a convolution with stride and padding can
be handled with the following python code:
def conv_backward(dz,x,w,p=0,s=1):
n, K = len(x),len(w)
o_n = 1 + (n + 2 * p - K) // s
assert(o_n==len(dz))
dx = np.zeros_like(x)
dw = np.zeros_like(w)
db = dz[:].sum()
for i in range(o_n):
start = i * s
dw += x_pad[start:start+K]*dz[i]
dx_pad[start:start+K] += w*dz[i]
dx = dx_pad[pad:-pad]
return dx, dw, db
print(dz)
dx, dw, db = conv_backward(dz,x,w,1)
print(dx)
print(dw)
print(db)
In the same way, the convolution of one-dimensional data can be extended to the reverse derivation of the
convolution of two-dimensional data with multiple input and output channels. As shown in Figure 6-39, it is a
schematic diagram of the gradient solution of a single input channel and a single output channel.
Figure 6-39 z00 = x00w00+x01w01+x10w10+x11w1, its gradients about w00, w01, w10, w11 are x00, x01, x10,
x11 respectively, its gradients about x00, x01, x10, x11 are w00, w01,w10,w11
∂L ∂L ∂zij
= ∑
∂w
w ij ∂zij ∂w
w
W , namely z = x [i : i + K , j : j + K ] ⋅ w , and
ij h w
∂zij
= xi+u,j+v
∂wu,v
∂zij
Therefore, write ∂w
w
as a matrix with the same shape as w :
∂zij
= x [i : i + Kh , j : j + Kw ]
∂w
w
∂w
w
of the above figure:
therefore
∂L ∂L
= ∑ x [i : i + Kh , j : j + Kw ]
∂w
w ij ∂zij
∂zij
Similarly, because ∂b
= 1 , therefore:
∂L ∂L ∂zij ∂L
= ∑ = ∑
∂b ij ∂zij ∂b ij ∂zij
because
∂L ∂L ∂zij
= ∑
∂x
x ij ∂zij ∂x
x
And z ij
= x [i : i + Kh , j : j + Kw ] ⋅ w only depends on the data window x [i : iatthebeginningof x_{ij}
+Kh , j : j + Kw ] . and:
∂zij
= wu,v
∂xi+u,j+v
therefore:
∂zij
[i : i + Kh , j : j + K +w ] = w
∂x
x
∂zij
For example, for the window [i : i + 2, j : j + 2] of ∂x
x
above:
∂zij
Therefore, just ∂L
∂zij ∂x
x
=
∂L
∂zij
w can be added to the window [i : i + K h, j : j + K +w ] corresponding to ∂L
∂x
x
,
namely:
∂L ∂L
[i : i + Kh , j : j + K +w ]+ = w
∂x
x ∂zij
For convolution with padding and span, before performing reverse derivation, it is also necessary to fill x first, and
then find the data window corresponding to z according to the span, that is, as follows The formula calculates the
ij
∂L ∂L
= ∑
∂b ij ∂zij
∂L ∂L
[i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = w
∂x
x ∂zij
The above is the calculation formula for single-channel input and single-channel output. For multi-channel input x
, the weight tensor w of the convolution kernel corresponds to the 3D convolution kernel at this time, which has a
similar gradient calculation formula, except that there is one more color channel:
∂L ∂L
= ∑ x [:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]
∂w
w ij ∂zij
∂L ∂L
[:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = w
∂x
x partialzij
If it is a multi-channel output, replace w in the above formula with the weight tensor w corresponding to each f
output channel f , but because x is right Each output channel feature map z has a contribution, and the gradient of f
∂L partialL f
[:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]+ = sumf ( f
w )
∂x
x
∂z
ij
∂L ∂L
= ∑ x [:, i ∗ S : i ∗ S + Kh , j ∗ S : j ∗ S + Kw ]
∂w
w
f ij f
∂z
ij
∂L ∂L
= ∑
∂b
f ij f
∂z
ij
∂w
w
f
,
∂L
f
∂b
of each sample , and the x of each sample
are independent of each other, so cannot be accumulated. ∂L
∂x
x
On the basis of the previous Layer class, the following code defines a class Conv representing the convolutional
layer, which is used for convolution forward calculation (forward) and reverse derivation (backward) of multiple
samples, multiple input channels and multiple output channels. . Conv's constructor accepts parameters such as the
number of input and output channels and convolution kernels representing the convolution operation. The
forward() method accepts a multi-channel input X to generate a convolutional multi-channel output Z. The
backward() method accepts the input from the loss function. Regarding the gradient dZ of the output Z of the
convolutional layer, calculate the gradient of the loss function about the parameters of the convolution (W , b) and
the input X.
import numpy as np
from init_weights import *
class Layer:
def __init__(self):
self.params = None
pass
def forward(self, x):
raise NotImplementedError
def backward(self, x, grad):
raise NotImplementedError
def reg_grad(self,reg):
pass
def reg_loss(self,reg):
return 0.
def reg_loss_grad(self,reg):
return 0
class Conv(Layer):
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.K = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
self.W = np.random.randn(self.F, self.C, self.K, self.K) #/(self.K*self.K)
self.b = np.random.randn(out_channels,)
self.params = [self.W,self.b]
self.grads = [np.zeros_like(self.W),np.zeros_like(self.b)]
self.X = None
self.reset_parameters()
def reset_parameters(self):
kaiming_uniform(self.W, a=math.sqrt(5))
if self.b is not None:
#fan_in, _ = calculate_fan_in_and_fan_out(self.K)
fan_in = self.C
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))
return O
def __call__(self,X):
return self.forward(X)
def backward(self,dZ):
""" A naive implementation of the backward pass for a convolutional layer.
Inputs: - dout: Upstream derivatives.
- cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive Returns a
tuple of:
- dx: Gradient with respect to x - dw: Gradient with respect to w - db:
Gradient with respect to b """
N, F, Z_h, Z_w = dZ.shape
N, C, X_h, X_w = self.X.shape
F, _, F_h, F_w = self.W.shape
pad = self.P
dX = np.zeros_like(self.X)
dW = np.zeros_like(self.W)
db = np.zeros_like(self.b)
for n in range(N):
for f in range(F):
db[f] += dZ[n, f].sum()
for i in range(H_):
hs = i * self.S
for j in range(W_):
ws = j * self.S
# w [f,c,i,j] X[n,c,i,j]
dW[f] += X_pad[n, :, hs:hs+F_h, ws:ws+F_w]*dZ[n, f, i, j]
dX_pad[n, :, hs:hs+F_h, ws:ws+F_w] += self.W[f] * dZ[n, f, i,
j]
# "Unpad"
dX = dX_pad[:, :, pad:pad+X_h, pad:pad+X_h]
self.grads[0] += dW
self.grads[1] += db
return dX
# return dX, dW, db
def reg_loss(self,reg):
return reg*np.sum(self.W**2)
def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.W
return reg*np.sum(self.W**2)
Among them, N is the number of samples, C is the number of input channels, and F is the number of output
channels. The reverse derivation is for each sample (for n in range(N)), for each output channel (for f in range(F)),
calculate its db =
f
∂L
∂ pmbbf
, dw =
f
∂L
∂w
wf
, dx[n] = ∂L
∂x
x
.
You can test conv's forward calculation method forward() with randomly generated input data, and output the
value of the first channel of its first sample:
np.random.seed(1)
x = np.random.randn(4, 3, 5, 5)
conv = Conv(3,2,3,1,1)
f = conv.forward(x)
print(f.shape)
print(f[0,0],"\n")
(4, 2, 5, 5)
[[ 0.46362714 -0.83578144 0.40298519 -0.32152652 0.56616046]
[-0.47878018 1.02346756 0.20004975 0.59663092 0.25253169]
[-0.39733747 -0.08368194 0.52454712 0.54133918 -0.32698456]
[0.47703053 -0.01967369 1.13655418 0.22321357 0.77693417]
[-0.23944267 0.62971182 -0.38411731 0.42818679 -0.07566246]]
∂f
from the loss function about the output f to calculate the
parameters of the loss function about the convolution (W , b ) and (X ) gradients. To test the method, the following
code feeds it a simulated gradient, denoted as df = ∂L
∂f
f
.
df = np.random.randn(4, 2, 5, 5)
dx= conv.backward(df)
print(df[0,0],"\n")
print(dx[0,0],"\n")
print(conv.grads[0][0,0],"\n")
print(conv.grads[1],"\n")
[11.528173 7.46555585]
The reverse derivation of the convolutional layer is more complicated. The numerical gradient function
numerical_gradient_from_df() function in the previous util.py can be used to calculate the numerical gradient and
compare the analytical gradient of the reverse derivation to check whether the reverse derivation is correct.
import util
def f():
return conv.forward(x)
dw_num = util.numerical_gradient_from_df(f,conv.W,df)
diff_error = lambda x, y: np.max(np.abs(x - y))
print(diff_error(conv.grads[0],dw_num))
6.533440455314121e-11
3.7474023883987684e-11
3.998808228988793e-11
The numerical and analytical gradients of the loss function with respect to the model parameters w, b and input x
can be consistent.
zij = max(x
x[i : i + Kh , j : j + Kw ])
∂x11
=
∂L
∂z
z00
∂z00
∂x11
=
∂L
∂z00
is not 0, other ∂L
∂x
= 0,
xij
ij ≠ 11
Figure 6-40 The result z00 of the shaded data window produced by the pool operation is equal to x11, so only
∂L
≠ 0
∂x
x11
Therefore, the gradient calculation of the max pooling layer is simple. For z = max(x x[i : i + K
ij h, j : j + Kw ]) ,
xi+u,j+vis the data value equal to z in the data window x [i : i + K , j : j + K ]. Just add each
ij h w
∂L
∂zij
to the
partial derivative ∂L
∂xi+u,j+v
corresponding to this x i+u,j+v
corresponding to z .
ij
pool_h,pool_w,stride= self.pool_h,self.pool_w,self.stride
for n in range(N):
for c in range(C):
for i in range(h_out):
si = stride*i
for j in range(w_out):
sj = stride*j
x_win = x[n, c, si:si+pool_h, sj:sj+pool_w]
out[n,c,i,j] = np.max(x_win)
return out
def backward(self,dout):
out = None
x = self.x
N, C, H, W = x.shape
kH,kW,stride = self.pool_h,self.pool_w,self.stride
oH = 1 + (H - kH) // stride
oW = 1 + (W - kW) // stride
dx = np.zeros_like(x)
for k in range(N):
for l in range(C):
for i in range(oH):
si = stride * i
for j in range(oW):
sj = stride * j
slice = x[k,l,si:si+kH,sj:sj+kW]
slice_max = np.max(slice)
dx[k,l,si:si+kH,sj:sj+kW] += (slice_max==slice)*dout[k,l,i,j]
return dx
Similarly, numerical gradients can be used to verify the correctness of the analytical gradient of the pool class:
x = np.random.randn(3, 2, 8, 8)
df = np.random.randn(3, 2, 4, 4)
pool = Pool((2,2,2))
f = pool.forward(x)
dx = pool.backward(df)
1.680655614677562e-11
The neurons of the convolutional layer implemented above omit the activation function, and the activation
function can be added to the implementation class conv of the convolutional layer like the fully connected neuron,
or (as in Chapter 4) the convolutional layer and the full connection The weighted sum z in the connection layer
passes through the activation function to output the activation value a defined by a separate class.
a = g(z
z)
∂L ∂L ′
= g (z
z)
∂z
z ∂a
a
The following is the implementation of the forward calculation and reverse derivation corresponding to the
activation function Relu:
class NeuralNetwork:
def __init__(self):
self._layers = []
self._params = []
def reg_loss(self,reg):
reg_loss = 0
for i in range(len(self._layers)):
reg_loss+=self._layers[i].reg_loss(reg)
return reg_loss
def parameters(self):
return self._params
def zero_grad(self):
for i,_ in enumerate(self._params):
self._params[i][1] *= 0.
To test the convolutional layer, first read the training set of Mnist handwritten digits:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")
(50000, 784)
(50000, 1, 28, 28)
The convolutional neural network defined as follows classifies and trains Mnist handwritten digit recognition,
import train
#from NeuralNetwork import *
import time
np.random.seed(1)
#nn = ConvNetwork()
nn = NeuralNetwork()
nn.add_layer(Conv(1,2,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Conv(2,4,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Dense(64, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))
epochs=1
batch_size = 64
reg = 1e-3
print_n=100
start = time.time()
X,y =train_X,train_y
losses =
train.train_nn(nn,X,y,optimizer,util.loss_gradient_softmax_crossentropy,epochs,batch_size
done = time.time()
elapsed = done - start
print(elapsed)
print(np.mean(nn.predict(X)==y))
[ 1, 1] loss: 2.303
[ 101, 1] loss: 2.293
[ 201, 1] loss: 2.302
[ 301, 1] loss: 2.251
[ 401, 1] loss: 2.149
[ 501, 1] loss: 1.684
[ 601, 1] loss: 0.749
[ 701, 1] loss: 0.711
2535.1755859851837
0.84184
output of this neuron is a simple vector dot product: x w . If there are K neurons in
the fully connected layer, the column vectors of each neuron can be combined into
a matrix: W = (w w , w , ⋯ , w ). For a single input x , each neuron produces an
1 2 K
x W = (x
xw 1 , x w 2 , ⋯ , x w K )
If there are m input samples, and each input sample is used as a row of the matrix,
a matrix X = (x x ,x ,⋯,x )
1 2 mof m rows can be formed, this m Inputs pass
T
That is, the weighted sum of the fully connected layers is easily realized by the
product of matrices. Although the convolution operation of the convolutional
neuron can also be regarded as the tensor dot product of the convolution kernel and
the corresponding data window, it cannot be directly expressed as the dot product
of vectors or the product of matrices. For multiple neurons and multiple channels
The input of is even less directly represented by a simple matrix product or vector
product. The codes of the previous convolution operation realize the convolution
operation of the convolution layer through the loop of many layers. This kind of
multi-layer loop code cannot directly utilize the parallelization of the vector,
making the convolution operation realized inefficient.
to slide the convolution kernel on the data vector, calculate the cumulative sum of
each aligned data window and convolution kernel, and obtain a value:
(1, 2, 3) ⋅ (−1, 2, 1) = 6
(2, 3, 4) ⋅ (−1, 2, 1) = 8
(3, 4, 5) ⋅ (−1, 2, 1) = 10
If the data in the data window of each cumulative sum is used as a row of the
matrix, a matrix is obtained, which is recorded as x , and the convolution kernel row
is converted into a column vector, which is recorded as K , Then the convolution col
result tensor can be expressed as the product of these two matrices, namely:
⎡1 2 3
⎤ ⎡ −1⎤ ⎡ 6
⎤
zrow = xrow Kcol = 2 3 4 2 = 8
⎣3 4 5
⎦⎣ 1
⎦ ⎣ 10⎦
The length of the input tensor is n, and the length of the result tensor generated by
the convolution operation with the span of s and the padding of p before and after
is o = + 1, For the above example, the length of the resulting tensor is
n−k+2∗p
s
5−3+0
o = + 1 = 3
1
If there are 2 samples, perform the above flattening operation on each sample in
turn, if x is the following 2 samples:
1 2 3 4 5
x = [ ]
6 7 8 9 10
Then x row is a matrix of 6 rows, and the convolution can be expressed as:
⎡1 2 3
⎤ ⎡ 6
⎤
2 3 4 8
3 4 5
⎡ −1⎤ 10
zrow = xrow Kcol = 2 =
6 7 8
⎣ 1
⎦ 16
7 8 9 18
⎣8 9 10
⎦ ⎣ 20⎦
6 8 10
z = [ ]
16 18 20
6.3.2 Matrix multiplication of 2D sample convolution
If the input data has only one sample and the sample has only one channel, that is,
the shape of the input data is a tensor of (1, 1, H , W ), where H , W are the
resolution of this 2D sample. Like the sample X of (1, 1, 3, 3) below:
⎡1 2 3
⎤
X = 4 5 6
⎣7 8 9
⎦
3×3
If the convolution operation with span S=1 and surrounding padding P=1 is
performed, the original data needs to be filled first, and the filled data X is: pad
⎡0 0 0 0
⎤
0
0 1 2 3 0
Xpad = 0 4 5 6 0
0 7 8 9 0
⎣0 0 0 0
⎦
0
5×5
Use the (1, 1, 2, 2) convolution kernel to slide along X , and calculate the pad
weighted sum of each aligned data window and the corresponding element of the
convolution kernel. When the convolution kernel "from When sliding from "up to
down, from left to right", a feature map of 4 × 4 will be generated. These data
windows weighted and summed with the kernel look like this:
0 0 0 0 0 0 0 0
X0 = [ ] X1 = [ ] X2 = [ ] X3 = [ ]
0 0 0 1 1 2 3 0
2×2 2×2 2×2 2×2
0 1 1 2 2 3 3 0
X4 = [ ] X5 = [ ] X6 = [ ] X7 = [ ]
0 4 4 5 5 6 6 0
2×2 2×2 2×2 2×2
0 4 4 5 5 6 6 0
X8 = [ ] X9 = [ ] X10 = [ ] X11 = [ ]
0 7 7 8 8 9 9 0
2×2 2×2 2×2 2×2
0 7 7 8 8 9 9 0
X12 = [ ] X13 = [ ] X14 = [ ] X15 = [ ]
0 0 0 0 0 0 0 0
2×2 2×2 2×2 2×2
Turn each window data block into a row of the matrix. The rows of all these data
blocks form a matrix, which is recorded as X , and the convolution kernel is row
⎡0 0 0 0
⎤
0 0 0 1
⎡ 1⎤
0 0 1 2 2
Xrow = Kcol =
0 0 3 0 3
0 0 0 4 ⎣ 4⎦
4×1
⎣⋮ ⋮ ⋮ ⋮
⎦
16×4
The convolution operation can be expressed as the product X K of these two row col
⎡1 2 3
⎤
X0 = 4 5 6
⎣7 8 9
⎦
3×3
⎡ 11 12 13
⎤
X1 = 14 15 16
⎣ 17 18 19
⎦
3×3
The convolution kernel should also be a tensor with the same number of channels,
such as the tensor K of (1, 2, 2, 2), and its two channels are respectively recorded
as K , K .
0 1
1 2
K0 = [ ]
3 4
2×2
5 6
K1 = [ ]
7 8
2×2
If the convolution with a span of 1 and a padding of 0 is performed, each time the
convolution kernel slides, it performs a weighted sum with the data block of 2
channels, namely 2 × 2 × 2, that is, the 2 corresponding to the convolution kernel
The 2 × 2 × 2 data blocks of the channel are flattened into one row, and the rows
corresponding to the data blocks corresponding to all sliding windows form a
matrix X , and the convolution kernel is flattened into a size of 8 column vector,
row
as follows:
⎡ 1⎤
2
⎡1 2 4 5 11 12 14 15
⎤ 3
2 3 5 6 12 13 15 16 4
Xrow = Kcol =
4 5 7 8 14 15 17 18 5
⎣5 6 8 9 15 16 18 19
⎦ 6
4×8
⎣ 8⎦
8×1
Multiplying the two matrices yields a 4 × 1 convolution result matrix. This matrix
can be reshaped into a convolution result tensor of (1, 1, 2, 2), that is, a single-
sample single-channel feature map.
kernels, number of channels, height, and width, respectively. That is, the
convolution layer has F convolution kernels, and the shape of each convolution
kernel is (C, kH , kW ).
Each sample is a tensor of shape (C, H , W ), which is convolved with a
convolution kernel to generate a feature map whose shape is recorded as
(oH , oW ) , where oH , oW are the height and width of the feature map, satisfying:
oH = (H + P − kH + 1)//S + 1, oW = (W + P − kW + 1)//S + 1
(N , F , oH , oW ).
As shown in Figure 6-42, each convolution kernel is flattened into a column vector
of the same length C × kH × kW . F Kernel The convolution kernel forms a
matrix with columns of F . Recorded as K : col
(1) (2) (F )
Kcol = [K K ⋮ K ]
col col col
Figure 6-42 Each convolution kernel is flattened into a column vector, and each
data block corresponding to the size of the convolution kernel is flattened into a
row vector
Similarly, each sample and a convolution kernel with the same shape as
(C, kH , kW ) are flattened into a row vector, and a sample will be flattened into
⎡ ⎤
(1)
Xrow
(2)
Xrow
Xrow =
⎣X (N ×oH ×oW )
row
⎦
It is reshaped into a row vector. If there is only one sample (ie N = 1), this row
vector will be put into the h*oW+wth row of the result matrix:
X_row[h*oW+w,:] = np.reshape(patch,-1)
The code first reshape the data block from N samples into (N,-1) shape, and then
put it into the corresponding row according to the step size oSize = oH × oW .
for h in range(oH):
hS = h * S
hS_kH = hS + kH
h_start = h*oW
for w in range(oW):
wS = w*S
patch = x[:,:,hS:hS_kH,wS:wS+kW]
row[h_start+w::oSize,:] = np.reshape(patch,(N,-1))
return row
[[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]
[[ 9 10 11]
[12 13 14]
[15 16 17]]]]
[[ 0. 1. 3. 4. 9. 10. 12. 13.]
[ 1. 2. 4. 5. 10. 11. 13. 14.]
[ 3. 4. 6. 7. 12. 13. 15. 16.]
[ 4. 5. 7. 8. 13. 14. 16. 17.]]
x = np.arange(36).reshape(2,2,3,3)
print(x)
x_row = im2row(x,2,2)
print(x_row)
[[[[ 0 1 2]
[ 3 4 5 ]
[ 6 7 8]]
[[ 9 10 11]
[12 13 14]
[15 16 17]]]
[[[18 19 20]
[21 22 23]
[24 25 26]]
[[27 28 29]
[30 31 32]
[33 34 35]]]]
[[ 0. 1. 3. 4. 9. 10. 12. 13.]
[ 1. 2. 4. 5. 10. 11. 13. 14.]
[ 3. 4. 6. 7. 12. 13. 15. 16.]
[ 4. 5. 7. 8. 13. 14. 16. 17.]
[18. 19. 21. 22. 27. 28. 30. 31.]
[19. 20. 22. 23. 28. 29. 31. 32.]
[21. 22. 24. 25. 30. 31. 33. 34.]
[22. 23. 25. 26. 31. 32. 34. 35.]]
first, and then the 4th axis (axis=3) where F is located is exchanged to the 2nd axis
position, which is transformed into (N , F , oH , oW ) Tensor of shape:
Z = Z.reshape(N,oH,oW,-1)
Z = Z.transpose(0,3,1,2)
To sum up, the convolution operation of the convolution layer can be realized by
matrix multiplication, the code is as follows:
def conv_forward(X, K, S=1, P=0):
N,C, H, W = X.shape
F,C, kH,kW = K.shape
if P==0:
X_pad = X
else:
X_pad = np.pad(X, ((0, 0), (0, 0),(P, P), (P, P)),
'constant')
K_col = K.reshape(K.shape[0],-1).transpose()
Z_row = np.dot(X_row, K_col)
oH = (X_pad.shape[2] - kH) // S + 1
oW = (X_pad.shape[3] - kW) // S + 1
Z = Z_row.reshape(N,oH,oW,-1)
Z = Z.transpose(0,3,1,2)
return Z
x = np.arange(9).reshape(1,1,3,3)+1
k = np.arange(4).reshape(1,1,2,2)+1
print(x)
print(k)
z = conv_forward(x,k)
print(z.shape)
print(z)
[[[[1 2 3]
[4 5 6]
[7 8 9]]]]
[[[[1 2]
[3 4]]]]
(1, 1, 2, 2)
[[[[37.47.]
[67. 77.]]]]
(2, 2, 2, 2)
[[[[ 268. 296.]
[ 352. 380.]]
[[ 684. 776.]
[ 960. 1052.]]]
[[2340. 2432.]
[2616. 2708.]]]]
respectively To flatten the matrix, the convolution can be marked as their product.
Let the convolution result matrix be z , that is: row
⎡z ⎤
0
⎡x0 x1 x2
⎤⎡w ⎤0
⎣z ⎦ ⎣x x3 x4
⎦⎣w ⎦
2 2 2
If you know that the gradient of the loss function about z is dz, the corresponding
matrix form is dz , row
T
dxrow = dzrow K
col
T
dKcol = xrow dzrow
⎡ dz ⎤ 0
⎡ dz 0 w0 dz0 w1 dz0 w2
⎤
T
dxrow = dzrow K = dz1 [w0 w1 w2 ] = dz1 w0 dz1 w1 dz1 w2
col
⎣ dz ⎦ ⎣ dz ⎦
2 2 w0 dz2 w1 dz2 w2
This dx rowis a flattened form of dx, each row of which is a gradient of an element
z of an output Z about its dependent data block, this data block and volume The
i
product kernels have the same shape and size. As shown in Figure 6-44, the first
row of dx is the gradient of the output component z with respect to the data
row 0
dx , dx , anddx : dx = dz w , dx = dz w , dx = dz w
0 1 2 0 0 0 1 0 1 2 0 2
dx , dx , dx all depend on dz : dx = dz w , dx = dz w , dx = dz w
0 1 2 0 0 0 0 1 0 1 2 0 2
of the 3 input components that depend on the data block. Similarly, each of the
other lines will also contribute to the gradient of the different input data blocks, as
shown in Figure 6- 45 shows:
Figure 6-45 Each dz contributes to the gradient of the element x of the data block
i j
on which z depends
i
It can be seen that the forward calculation process of convolution is to calculate the
weighted sum of the data blocks to obtain an output value z , while the reverse i
derivation is to assign the gradient of each z to each element of its dependent data
i
block on the gradient. The distribution process of reverse derivation is exactly the
reverse process of the accumulation process of forward calculation.
input x , dx
x needs to be flattened into x
row according to x The inverse process
row
of the process, assign these gradients to the data blocks corresponding to dxx. That
is, each row is converted into a gradient of a data block, because different data
blocks are overlapping, and these gradients are also overlapping. During the
reverse flattening process, these overlapping gradients should be accumulated. As
shown in Figure 6-45:
According to the inverse process of the flattening process, the gradient of each line
is accumulated to the position of its corresponding original data block, and the
final dx
x is obtained as:
or:
Let X be the input, K , b be the weight and bias of a convolutional layer, and the output tensor Z can be expressed
as:
Z = conv(X
X, K ) + b ,
Where conv(XX , K ) represents the convolution operation of the input X with the weight of the convolution kernel
Where X row , K col , Z row are the inputs, weights and outputs flattened into matrix form.
If the gradient dZ of the loss function with respect to the output vector Z is known, according to the formula (6-
33), the gradient db of the loss function with respect to b is the same as the derivation of the previous fully
connected layer, namely db = np. sum(dZ, axis = (0, 2, 3)). That is, the gradient of the offset b corresponding
k
to each channel is the accumulation of the gradient dz of all pixel values of all samples.
i,k,h,w
According to the formula (6-34), the same as the reverse derivation of the fully connected layer, the loss function
can be obtained from the gradient dZ
Z about the flattened X , K
row row :
Gradientof col
T
dX
Xrow = dZ
Zrow K
col
T
dKcol = Xrow dZ
Zrow
Because K flattens K into each column vector according to each channel, so as long as each column of dK
col col is
reshaped into a sum of K The same shape is enough, that is:
That is, it is easy to reshape to the same shape as K according to the flattening method of K .
dX
Xrow is a matrix of the same shape as the flattened matrix X of X , each row of which represents a
row
convolution kernel shape (C, oH , kW ) the same data block, and the data blocks represented by different rows of
X row may overlap in X . Thus the different rows of dXX represent the gradients of possibly overlapping data
row
blocks. Therefore, when restoring dXX to dXX according to the inverse process of the flattening process, these
row
overlapping gradients need to be accumulated. This process is exactly the same as the previous 1D case.
def row2im(dx_row,oH,oW,kH,kW,S):
nRow,K2C = dx_row.shape[0],dx_row.shape[1]
C = K2C//(kH*kW)
N = nRow//(oH*oW) # number of samples
oSize = oH*oW
H = (oH - 1) * S + kH
W = (oW - 1) * S + kW
dx = np.zeros([N,C,H,W])
for i in range(oSize):
row = dx_row[i::oSize,:] # N row vectors
h_start = (i // oW) * S
w_start = (i % oW) * S
dx[:,:,h_start:h_start+kH,w_start:w_start+kW] += row.reshape((N,C,kH,kW))
#np.reshape(row,
(C,kH,kW))
return dx
Among them, oSize = oH × oW represents the size of a feature map of Z , and also represents the number of
data blocks that an input sample is divided into. oH andoW are the height and width of the data block matrix. i
represents the number of the corresponding data block when sliding the convolution kernel "from top to bottom,
from left to right". According to i, the subscript (i // oW), ( i % oW), according to the data block subscript
and span S, the height and width subscript h_start, h_start of this data block in the original data matrix can
be calculated. Thus, the i-th row of dx_row can be accumulated to this position. Because there are N samples, the
row positions of adjacent samples at the same position in the flattened matrix differ by oSize, so
dx_row[i::oSize,:] can be used to get the same position of all N samples The data block gradient of . The
corresponding position of the original data gradient tensor dx is
dx[:,:,h_start:h_start+kH,w_start:w_start+kW].
def row2im(dx_row,oH,oW,kH,kW,S):
nRow,K2C = dx_row.shape[0],dx_row.shape[1]
C = K2C//(kH*kW)
N = nRow//(oH*oW) # number of samples
oSize = oH*oW
H = (oH - 1) * S + kH
W = (oW - 1) * S + kW
dx = np.zeros([N,C,H,W])
for h in range(oH):
hS = h * S
hS_kH = hS + kH
h_start = h*oW
for w in range(oW):
wS = w*S
row =dx_row[h_start+w::oSize,:]
dx[:,:,hS:hS_kH,wS:wS+kW] += row.reshape(N,C,kH,kW)
return dx
You can test the above function row2im() with the following code:
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P = 1,2,1,0
nRow = oH*oW*N
K2C = C*kH*kW
a = np.arange(nRow*K2C).reshape(nRow,K2C)
#dx_row = np.arange(nRow*K2C).reshape(nRow,K2C)
dx_row = np.vstack((a,a))
print("dx_row",dx_row)
print(dx_row.shape)
dx = row2im(dx_row,oH,oW,kH,kW,S)
print(dx.shape)
print("dx[0,0,:,:]:",dx[0,0,:,:])
dx_row[[0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]
[64 65 66 67 68 69 70 71]
[ 0 1 2 3 4 5 6 7]
[ 8 9 10 11 12 13 14 15]
[16 17 18 19 20 21 22 23]
[24 25 26 27 28 29 30 31]
[32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47]
[48 49 50 51 52 53 54 55]
[56 57 58 59 60 61 62 63]
[64 65 66 67 68 69 70 71]]
(18, 8)
(2, 2, 4, 4)
dx[0,0,:,:]: [[ 0. 9. 25. 17.]
[ 26. 70. 102. 60.]
[ 74. 166. 198. 108.]
[ 50. 109. 125. 67.]]
Based on the above discussion, the reverse derivation code of the convolutional layer is as follows:
def conv_backward(dZ,K,oH,oW,kH,kW,S=1,P=0):
# Flatten dZ into a matrix with the same shape as Z_row
F = dZ.shape[1] # Convert (N,F,oH,oW) to (N,oH,oW,F)
dZ_row = dZ.transpose(0,2,3,1).reshape(-1,F)
#Calculate the gradient of the loss function with respect to the convolution kernel
parameters
dK_col = np.dot(X_row.T,dZ_row) #X_row.T@dZ_row
dK_col = dK_col.transpose(1,0)
dK = dK_col.reshape(K.shape)
db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)
K_col = K.reshape(K.shape[0],-1).transpose()
dX_row = np.dot(dZ_row,K_col.T)
dX_pad = row2im(dX_row,oH,oW,kH,kW,S)
if P == 0:
return dX_pad,dK,db
return dX_pad[:, :, P:-P, P:-P],dK,db
The following code tests the convolution reverse derivation function conv_backward() above:
H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P,F = 1,3,1,0,4
dZ = np.arange(N*F*oH*oW).reshape(N,F,oH,oW)
X = np.arange(N*C*H*W).reshape(N,C,H,W)
if P==0:
X_pad = X
else:
X_pad = np.pad(X, ((0, 0), (0, 0),(P, P), (P, P)), 'constant')
K = np.arange(F*C*kH*kW).reshape(F,C,kH,kW)
(1, 3, 4, 4)
dX[0,0,:,:]: [[1512. 3150. 3298. 1718.]
[3348. 6968. 7280. 3788.]
[3804. 7904. 8216. 4268.]
[2100. 4358. 4522. 2346.]]
(4, 3, 2, 2)
dW[0,0,:,:]: [[258. 294.]
[402. 438.]]
(1, 4)
db: [[ 36 117 198 279]]
The convolution operation is to move the convolution kernel in the order of "from top to bottom, from left to right"
with the span as the step size, and use the convolution kernel and the window data block of the corresponding data
tensor at each position (Multi-channel data block) Calculate the weighted sum of the corresponding elements to
obtain each element of the output feature map.
The previous flattening of the original data tensor into a matrix is to flatten the multi-channel data blocks of the
original tensor into a row vector in turn according to the order of the "top-to-bottom, left-to-right" convolution
weighted sum.
If the span and the size (height and width) of the convolution kernel are not consistent, they overlap with the data
blocks that the convolution kernel sequentially calculates the weighted sum of. Arrange these data blocks
calculated in sequence according to the calculation order, and a new tensor with non-overlapping data blocks can
be obtained.
For example, for the following 4 × 4 tensor with one sample and one channel,
⎡ [x
⎣ [x
x01
x02
x10
x11
x12
x20
x21
⎣x
x020
⎣x
and
⎡x
x110
x120
⎣x
00
x10
x10
x20
x20
30
00
22
030
100
130
x01
x11
x11
x21
x21
x31
x02
x03
x11
x12
x13
x21
x22
x23
x021
x031
x101
x111
x121
x131
]
x10
x11
x13
x20
x21
x23
x30
x31
x33
[
x012
x022
x032
x102
x112
x122
x132
x01
x11
x11
x21
x21
x31
x11
x12
x13
x21
x22
x23
x31
x32
x33
x02
x12
x12
x22
x22
x32
⎦
]
]
[
[
x02
x13
x12
x23
x22
x33
x03
x13
x13
x23
x23
x33
]
]
⎤
Each data block that is exactly the same size as the convolution kernel can be flattened into a row vector, and the
following row vector can be obtained:
⎡x x01
⎤
If there are multiple channels, each data block is a three-dimensional tensor (cuboid), and the process is similar.
Such as a tensor with 2 channels:
⎡x
x010
000 x001
x011
x002 x003
x013
x023
x033
x103
x113
x123
x133
⎤
According to the convolution calculation process, these data blocks are in order:
Figure 6-47 Each data block of convolution calculation is a three-dimensional data block with a shape of
2 × 2 × 2 composed of a 2-channel matrix 2 × 2
That is, the matrix blocks corresponding to the positions of the following two matrices constitute a data block:
⎢⎥
⎡ [x
⎣ [x
and
⎡ [x
⎣ [x
x001
x002
x010
x011
x012
x020
x021
⎣x
000
x010
x010
x020
x020
030
100
x110
x110
x120
x120
130
000
022
x001
x011
x011
x021
x021
x031
x101
x111
x111
x121
x121
x131
x001
x002
x003
x011
x012
x013
x021
x022
x023
]
x010
x011
x013
x020
x021
x023
x030
x031
x033
[
[
x001
x011
x011
x021
x021
x031
x101
x111
x111
x121
x121
x131
x011
x012
x013
x021
x022
x023
x031
x032
x033
x002
x012
x012
x022
x022
x032
x102
x112
x112
x122
x122
x132
x100
x101
x102
x110
x111
x112
x120
x121
x122
]
]
[
x101
x102
x103
x111
x112
x113
x121
x122
x123
x002
x013
x012
x023
x022
x033
x102
x113
x112
x123
x122
x133
⎡x
x003
x013
x013
x023
x023
x033
x103
x113
x113
x123
x123
x133
x110
x111
x113
x120
x121
x123
x130
x131
x133
]
]
⎤
x111
x112
x113
x121
x122
x123
x131
x132
x133
⎤
All these data blocks of the same size as the convolution kernel are arranged in the calculation order of "from top
to bottom, from left to right" to form an extended tensor whose data comes from the original data tensor, from the
extended tensor shown in Figure 6-47 The data subscripts of the quantities can be seen from which subscripts of
the original tensor they come from. In other words, as long as you know the subscript of each data element of the
extended data tensor in the original data tensor, you can generate these data from the original data tensor.
First look at the case of a single channel, as long as the letter x is removed, you can clearly see the subscripts of the
elements of the extended tensor in the original tensor:
⎢⎥
⎡ [00
⎣ [30
10
10
20
20
01
11
11
21
21
31
]
[
01
11
11
21
21
31
02
12
12
22
22
32
]
]
[
[
02
13
12
23
22
33
03
13
13
23
23
33
]
]
⎤
As long as the original data tensor is indexed according to these subscripts, the tensor composed of these data
Observing these subscripts, it can be found that all these subscripts can be obtained from the initial upper left
subscript by moving "from top to bottom, from left to right" with a span as the step size. The subscript in the upper
left corner is:
[[
[
00
10
10
10
01
11
11
11
]
"From left to right" moves a span each time, that is, the column subscript increases by 1, and the subscripts of the
three data blocks in the first row can be obtained:
00 01
]
] [
01
11
02
12
] [
02
13
03
13
]]
If the span of the first row is moved "from top to bottom" according to the span, that is, the row subscript is
increased, and the subscripts of all data blocks in the following 2 rows can be sequentially obtained.
For the row and column subscripts of the data block in the upper left corner
00 01
The row and column subscripts are: i=0,1 and j=0,1 respectively. As shown in Figure 6-48:
Figure 6-48 Row and column subscripts of the data block in the upper left corner
Therefore, the 4 elements of the data block in the upper left corner can be obtained through the row subscript
[0,0,1,1] and the column subscript vector [0,1,0,1] of the 4 elements of the data block The row and column
subscript combination [(0,0),(0,1),(1,0),(1,1)]. Similarly, for any kH × kW convolution kernel, the row
and column subscripts i0 and j0 corresponding to the data block elements in the upper left corner of the original
tensor can be obtained with python code:
import numpy as np
kH,kW = 2,2
i0 = np.repeat(np.arange(kH), kW) #row subscript [0,1] repeats along the column
direction: [0,0,1,1]
print(i0)
j0 = np.tile(np.arange(kW), kH) #column subscript [0,1] is spliced along the direction
of the row [0,1,0,1]
print(j0)
[0 0 1 1]
[0 1 0 1]
The elements of the data block in the upper left corner can be obtained by using the combined index of i0 and j0.
def idx_matrix(H,W):
a = np.empty((H,W), dtype='object')
for i in range(H):
for j in range(W):
a[i,j] = str(i)+str(j)
return a
For a multi-channel data block, for each channel, the row and column subscripts of the corresponding elements of
the data block are the same. As shown in Figure 6-49, for the data block in the upper left corner of the 2-channel,
the row and column subscripts are:
Figure 6-49 Row and column subscripts of the data block in the upper left corner of channel 2
Generally, for the data block in the upper left corner with the number of channels as C, the row and column
subscripts of its elements can be generated with the following code:
i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)
[0 0 1 1 0 0 1 1] #2 channel subscripts
[0 1 0 1 0 1 0 1] #2 channel subscripts
To generate the coordinates of the elements of all data blocks in the original data tensor, not only need to know the
row and column subscripts of each data block relative to its upper left corner (0,0), but also add the offset
according to the span S to get the final correctness row and column coordinates. If a feature map is divided into
oH × oW data blocks, the offset of these data blocks relative to the data block in the upper left corner can be
called span coordinates. For example, it is divided into 3 × 3, a total of 9 data blocks, and when the span S=1, the
row (height) column (width) span coordinates of these 9 data blocks are:
Similarly, these span coordinates can be generated using code that generates the row and column coordinates of
elements within a data block:
oH,oW=3,3
i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)
print(i1)
print(j1)
[0 0 0 1 1 1 2 2 2]
[0 1 2 0 1 2 0 1 2]
The row and column coordinates of the data blocks in the upper left corner and the span coordinates of these data
blocks are added to obtain the row and column coordinates of all data block elements in the original data tensor,
namely:
print("i0:",i0)
print("i1:",i1)
print(i)
i0: [0 0 1 1 0 0 1 1]
i1: [0 0 0 1 1 1 2 2 2]
[[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]
[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]]
Each column is a row subscript of a data block. The first 3 columns are the row subscripts of the 3 data blocks
when the span row coordinate is 0.
The following is the code to combine the span coordinates and the element subscripts in the data block to get the
row and column subscripts of all data blocks in the original input (single-channel) tensor:
C,S = 1,1,
oH,oW = 3,3
kH,kW = 2,2
i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)
i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)
[[0 0 0 1 1 1 2 2 2]
[0 0 0 1 1 1 2 2 2]
[1 1 1 2 2 2 3 3 3]
[1 1 1 2 2 2 3 3 3]]
[[0 1 2 0 1 2 0 1 2]
[1 2 3 1 2 3 1 2 3]
[0 1 2 0 1 2 0 1 2]
[1 2 3 1 2 3 1 2 3]]
The row subscripts of the elements in the upper left corner of all data blocks are shown in Figure 6-51:
Figure 6-51 Row subscripts of elements in the upper left corner of all data blocks
The row subscripts of the elements in the lower right corner of all data blocks are shown in Figure 6-52:
Figure 6-52 Row subscripts of elements in the lower right corner of all data blocks
It can be observed that a column of this index matrix corresponds to a data block.
If you want the index subscript of each data block to become a row of the matrix, then modify whether the
subscript of the data block and the span subscript are arranged by row or by column, that is, modify the reshape
code.
C,S = 1,1
oH,oW=3,3
kH,kW = 2,2
i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
j0 = np.tile(np.arange(kW), kH * C)
i1 = S * np.repeat(np.arange(oH), oW)
j1 = S * np.tile(np.arange(oW), oH)
i = i0.reshape(1,-1) + i1.reshape(-1,1)
j = j0.reshape(1,-1) + j1.reshape(-1,1)
print(i)
print(j)
[[0 0 1 1]
[0 0 1 1]
[0 0 1 1]
[1 1 2 2]
[1 1 2 2]
[1 1 2 2]
[2 2 3 3]
[2 2 3 3]
[2 2 3 3]]
[[0 1 0 1]
[1 2 1 2]
[2 3 2 3]
[0 1 0 1]
[1 2 1 2]
[2 3 2 3]
[0 1 0 1]
[1 2 1 2]
[2 3 2 3]]
The above discusses the image row and column coordinates of each element of the data block relative to its
channel. Indexing each data element should also consider the channel coordinates. If there are C channels, the
coordinate value of a single element in the C channels is (0, 1, 2, ⋯ , C − 1), as shown in Figure 6-53. But each
data block has kH × kW elements on a single channel, so, combining channel coordinates and image (feature
map) coordinates, a data block of shape kH × kW ∗ C has a total of kH × kW elements times kW*C$
coordinates.
Figure 6-53 Channel coordinates, the channel coordinates of all elements in channel i are i
C=2
k = np.repeat(np.arange(C), kH * kW).reshape(1,-1) #(-1, 1)
print(k)
[[0 0 0 0 1 1 1 1]]
If the shape of the original data tensor input by the convolutional layer is (N , C, H , W ), the convolutional layer
has F convolution kernels with a shape of (C, H , W ), that is, convolution The stack shape is (F , C, H , W ). If the
convolution operation of span S and edge padding P is performed. According to the above analysis, the channel
coordinate k, row subscript i and column subscript j of all elements of the extended tensor composed of the data
blocks participating in the convolution operation in the original data tensor can be obtained. The function
get_im2row_indices() can get these mapping coordinates (k,i,j):
import numpy as np
def get_im2row_indices(x_shape, kH, kW, S=1,P=0):
N, C, H, W = x_shape
assert (H + 2 * P - kH) % S == 0
assert (W + 2 * P - kH) % S == 0
oH = (H + 2 * P - kH) // S + 1
oW = (W + 2 * P - kW) // S + 1
i0 = np.repeat(np.arange(kH), kW)
i0 = np.tile(i0, C)
i1 = S * np.repeat(np.arange(oH), oW)
j0 = np.tile(np.arange(kW), kH * C)
j1 = S * np.tile(np.arange(oW), oH)
#i = i0.reshape(-1, 1) + i1.reshape(1, -1)
#j = j0.reshape(-1, 1) + j1.reshape(1, -1)
i = i0.reshape(1,-1) + i1.reshape(-1,1)
j = j0.reshape(1,-1) + j1.reshape(-1,1)
k = np.repeat(np.arange(C), kH * kW).reshape(1,-1)
return (k, i, j)
H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P,F = 2,2,1,0,4
k, i, j = get_im2row_indices((N,C,H,W),kH,kW,S,P)
print(k.shape)
print(i.shape)
print(j.shape)
(1, 8)
(9, 8)
(9, 8)
With this helper function, it is easy to generate a row-flattened tensor of data blocks from the original data tensor:
def im2row_indices(x, kH, kW, S=1,P=0):
x_padded = np.pad(x, ((0, 0), (0, 0), (P, P), (P, P)), mode='constant')
k, i, j = get_im2row_indices(x.shape, kH, kW, S,P)
rows = x_padded[:, k, i, j] # all blocks of data for each sample
C = x.shape[1]
rows = rows.reshape(-1,kH * kW * C) # all data blocks of the 1st sample, all data
blocks of the 2nd sample
return rows
X = np.arange(N*C*H*W).reshape(N,C,H,W)
X_row = im2row_indices(X,kH,kW,S,P)
print(X)
print(X_row)
[[[[ 0 1 2 3]
[ 4 5 6 7 ]
[ 8 9 10 11]
[12 13 14 15]]
[[16 17 18 19]
[20 21 22 23]
[24 25 26 27]
[28 29 30 31]]]
[[[32 33 34 35]
[36 37 38 39]
[40 41 42 43]
[44 45 46 47]]
[[48 49 50 51]
[52 53 54 55]
[56 57 58 59]
[60 61 62 63]]]]
[[ 0 1 4 5 16 17 20 21]
[ 1 2 5 6 17 18 21 22]
[ 2 3 6 7 18 19 22 23]
[ 4 5 8 9 20 21 24 25]
[ 5 6 9 10 21 22 25 26]
[ 6 7 10 11 22 23 26 27]
[ 8 9 12 13 24 25 28 29]
[ 9 10 13 14 25 26 29 30]
[10 11 14 15 26 27 30 31]
[32 33 36 37 48 49 52 53]
[33 34 37 38 49 50 53 54]
[34 35 38 39 50 51 54 55]
[36 37 40 41 52 53 56 57]
[37 38 41 42 53 54 57 58]
[38 39 42 43 54 55 58 59]
[40 41 44 45 56 57 60 61]
[41 42 45 46 57 58 61 62]
[42 43 46 47 58 59 62 63]]
Or conversely, convert the tensor of the data block flattened by row into the shape of the original data tensor.
def row2im_indices(rows, x_shape, kH, kW, S=1,P=0):
N, C, H, W = x_shape
H_pad, W_pad = H + 2 * P, W + 2 * P
x_pad = np.zeros((N, C,H_pad, W_pad), dtype=rows.dtype)
k, i, j = get_im2row_indices(x_shape, kH, kW, S,P)
rows_reshaped = rows.reshape(N,-1,C * kH * kW)
H,W = 4,4
kH,kW = 2,2
oH,oW = 3,3
N,C,S,P = 2,2,1,0
#F = 4
nRow = oH*oW*N
K2C = C*kH*kW
dx = row2im(dx_row,oH,oW,kH,kW,S)
print("dx.shape",dx.shape)
print("dx[0,0,:,:]",dx[0,0,:,:])
#dx_row = dx_row.transpose()
dX = row2im_indices(dx_row,(N,C,H,W),kH,kW,S,P)
print("dX.shape",dX.shape)
print("dX[0,0,:,:]",dX[0,0,:,:])
print(dX)
dx_row.shape(18, 8)
dx.shape(2, 2, 4, 4)
dx[0,0,:,:] [[ 0. 2. 4. 3.]
[ 8. 20. 24. 14.]
[16. 36. 40. 22.]
[12. 26. 28. 15.]]
dX.shape(2, 2, 4, 4)
dX[0,0,:,:] [[ 0 2 4 3]
[ 8 20 24 14 ]
[16 36 40 22]
[12 26 28 15]]
[[[[ 0 2 4 3]
[ 8 20 24 14 ]
[ 16 36 40 22]
[ 12 26 28 15]]
[[ 16 34 36 19]
[ 40 84 88 46 ]
[ 48 100 104 54]
[ 28 58 60 31]]]
[[[ 32 66 68 35]
[ 72 148 152 78]
[ 80 164 168 86]
[ 44 90 92 47]]
[[ 48 98 100 51]
[104 212 216 110]
[112 228 232 118]
[ 60 122 124 63]]]]
With the helper function above that directly flattens multidimensional tensors into matrices (in the file im2row.py),
you can write a convolution layer based on fast read convolution operations:
from Layers import *
from im2row import *
class Conv_fast():
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.kH = kernel_size
self.kW = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
#self.K = np.random.randn(self.F, self.C, self.kH, self.kW)
#/(self.K*self.K)
self.K = np.random.normal(0,1,(self.F, self.C, self.kH, self.kW))
self.b = np.zeros((1,self.F)) #,1))
self.params = [self.K,self.b]
self.grads = [np.zeros_like(self.K),np.zeros_like(self.b)]
self.X = None
self.reset_parameters()
def reset_parameters(self):
kaiming_uniform(self.K, a=math.sqrt(5))
if self.b is not None:
#fan_in, _ = calculate_fan_in_and_fan_out(self.K)
fan_in = self.C
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))
def forward(self,X):
# Convert to multi-channel
self.X = X
if len(X.shape)==1:
X = X.reshape(X.shape[0],1,1,1)
elif len(X.shape)==2:
X = X.reshape(X.shape[0],X.shape[1],1,1)
X_shape = (self.N,self.C,self.H,self.W)
self.X_row = im2row_indices(X,self.kH,self.kW,S=self.S,P=self.P)
K_col = self.K.reshape(self.F,-1).transpose()
Z_row = self.X_row @ K_col + self.b #W_row @ self.X_row + self.b
Z = Z_row.reshape(self.N,self.oH,self.oW,-1)
Z = Z.transpose(0,3,1,2)
return Z
def __call__(self,x):
return self.forward(x)
def backward(self,dZ):
if len(dZ.shape)<=2:
dZ = dZ.reshape(dZ.shape[0],-1,self.oH,self.oW)
K = self.K
# flatten dZ into a matrix with the same shape as Z_row
F = dZ.shape[1] # Convert (N,F,oH,oW) to (N,oH,oW,F)
assert(F==self.F)
dZ_row = dZ.transpose(0,2,3,1).reshape(-1,F)
# Calculate the gradient of the loss function with respect to the convolution
kernel parameters
dK_col = np.dot(self.X_row.T,dZ_row) #X_row.T@dZ_row
dK_col = dK_col.transpose(1,0) # Change the F channel axis from axis=1 to
axis=0
dK = dK_col.reshape(self.K.shape)
db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)
# Calculate the gradient of the loss function with respect to the input of the
convolution layer
X_shape = (self.N,self.C,self.H,self.W)
dX = row2im_indices(dX_row,X_shape,self.kH,self.kW,S =self.S,P = self.P)
dX = dX.reshape(self.X.shape)
self.grads[0] += dK
self.grads[1] += db
return dX
def reg_loss(self,reg):
return reg*np.sum(self.K**2)
def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.K
return reg*np.sum(self.K**2)
Gradient Test
Similarly, the gradient test can be used to check whether the code is correct for this convolutional layer:
import util
np.random.seed(1)
N,C,H,W = 4,3,5,5
F,kH,kW = 6,3,3
oH,oW = 3,3
x = np.random.randn(N,C,H,W)
y = np.random.randn(N,F,oH,oW)
conv = Conv_fast(C,F,kH,1,0)
f = conv.forward(x)
loss,do = util.mse_loss_grad(f,y)
dx = conv.backward(do)
def loss_f():
f = conv.forward(x)
loss,do = util.mse_loss_grad(f,y)
return loss
dW_num = util.numerical_gradient(loss_f,conv.params[0],1e-6)
4.198542114313848e-07
#N,C,H,W = 64,256,64,64
#F,kH= 128,5
N,C,H,W = 128,16,64,64
F,kH= 32,5
x = np.random.randn(N,C,H,W)
oH = H-kH+1
do = np.random.randn(N,F,oH,oH)
start = time.time()
conv = Conv(C,F,kH)
f = conv(x)
conv.backward(do)
done = time.time()
elapsed = done - start
print(elapsed)
start = time.time()
conv = Conv_fast(C,F,kH)
f = conv(x)
conv.backward(do)
done = time.time()
elapsed = done - start
print(elapsed)
476.4419822692871
29.02124047279358
The original convolution takes 476 seconds, while the fast convolution takes only 29 seconds.
Replace the convolution of the previous convolutional neural network for MNist handwritten digit classification
with fast convolution, and look at the time efficiency:
import pickle, gzip, urllib.request, json
import numpy as np
import os.path
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")
(50000, 784)
(50000, 1, 28, 28)
np.random.seed(1)
nn = NeuralNetwork()
nn.add_layer(Conv_fast(1,2,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Conv_fast(2,4,5,1,0))
nn.add_layer(Pool((2,2,2)))
nn.add_layer(Dense(64, 100))
nn.add_layer(Relu())
nn.add_layer(Dense(100, 10))
epochs=1
batch_size = 64
reg = 1e-3
print_n=100
start = time.time()
X,y =train_X,train_y
losses =
train.train_nn(nn,X,y,optimizer,util.cross_entropy_grad_loss,epochs,batch_size,reg,print_
done = time.time()
elapsed = done - start
print(elapsed)
print(np.mean(nn.predict(X)==y))
[ 1, 1] loss: 2.383
[ 101, 1] loss: 2.316
[ 201, 1] loss: 2.283
[ 301, 1] loss: 2.160
[ 401, 1] loss: 1.675
[ 501, 1] loss: 1.091
[ 601, 1] loss: 0.514
[ 701, 1] loss: 0.659
690.5078177452087
0.83894
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(losses)
Figure 6-54. Training loss curve of a convolutional neural network for classifying Mnist handwritten digits
6.5 Typical convolutional neural network structure
In 1989, Yann LeCun, the inventor of the convolutional neural network structure (CNN),
used the backpropagation (BP) algorithm to train a multi-layer neural network to recognize
handwritten postal codes. The neural network used was later called LeNet in 1994. The
convolutional neural network, although the paper does not mention convolution or
convolutional neural network, it only says that the adjacent area of 5 × 5 is used as the
receptive field. In 1998, LeCun proposed the famous LeNet5 neural network, marking the
real birth of convolutional neural network. Due to the limitation of hardware conditions at
that time, the training of convolutional neural network consumed machine resources and
time very much. Therefore, the CNN network model did not become popular.
It was not until 2012 that Alex Krizhevsky implemented a deep convolutional neural
network called AlexNet with a GPU and won the championship of the ImageNet image
recognition competition that deep learning represented by deep convolutional neural
networks began to develop rapidly. Subsequently, various neural network structures were
proposed, such as VGG, GoogLeNet, ResNet, Inception, etc.
6.5.1 LeNet-5
Figure 6-55 shows the network structure of LeNet-5. A 32×32×1 image passes through six
5×5 convolution kernels with a stride of 1 and a padding of 0 to generate 6 outputs of 28×28
size. The feature map, and then through the average pooling operation with a span of 2 and a
size of 2, six 14×14 feature maps are generated, that is, the length and width of the image
are reduced by half. Then through 16 5×5 convolution kernels with a span of 1 and a filling
of 0, 16 output feature maps of 10×10 size are generated. Then through the average pooling
operation with a span of 2 and a size of 2, 16 feature maps of 5×5 are generated.
Then there is the fully connected layer, which has 120 neurons, and each neuron receives a
total of 400 feature values from the previous layer's 16 5×5 feature maps (images), and each
neuron produces an output value, 120 neurons generate a one-dimensional vector consisting
of 120 outputs, which are output to the next fully connected layer with 84 neurons. A fully
connected layer of 84 neurons feeds their 84 outputs into the final output layer, and if 10
classes are performed, the output layer will contain 10 neurons, and each neuron outputs a
score for a sample belonging to the corresponding class. These 10 classifications can be
used to train the neural network model by calculating the multi-class cross entropy loss
through the softmax function and the true target value.
LeNet-5 is a classic structure of convolutional neural network. It adopts the mode of "first
convolution and then pooling" to generate multi-channel output from multi-channel input
and reduce the size of the image. As in the above example, the single channel 32×32×1 is
passed A series of "convolution pooling" transformations finally produced 16 5×5 feature
maps, which were finally passed through some fully connected layers to produce the final
output.
6.5.2 AlexNet
AlexNet is a CNN network structure proposed by Alex Krizhevsky et al. It won the first
place in the ImageNet image classification competition in 2012, and instantly increased the
top-5 error rate by more than 10 percentage points. The author implements a parallel neural
network training algorithm with CUDA GPU, which makes it possible to train deep neural
networks in a reasonable time. Its network structure is shown in Figure 6-56.
AlexNet is very similar to LeNet, but its depth and scale are much larger than LeNet. LeNet
has about 60,000 parameters, while AlexNet has about 60 million parameters. The most
important improvement of AlexNet over the LeNet network is the use of the Relu activation
function, which avoids the "gradient disappearance" problem of the deep neural network.
Another improvement to improve performance is the Dropout technology, that is, in a
hidden layer with a certain probability, the output of some neurons is 0, and does not
participate in network propagation. In each iteration, it is equivalent to defining a different
function. This regularization technique actually uses a combination of simpler network
functions to represent the trained model function. In addition, the "Local Response
Normalization" (Local Response Normalization) is proposed, which is to normalize the
values of all channels at a certain position of the feature map in a certain layer. But later it
was found that LRU has little effect.
The success of the AlexNet network has refocused the computer vision and artificial
intelligence communities on neural networks that have been dormant for many years,
especially CNN deep convolutional networks. The depth can become deeper, and the deep
neural network with simple principles can surpass the mathematically complex artificial
intelligence technology. Deep learning has begun to become the most important branch of
machine learning. Modern artificial intelligence mainly refers to deep learning.
6.5.3 VGG
The VGG-16 network is a simplified convolutional network structure proposed by Oxford's
Visual Geometry Group (VGG for short). Its main contribution is to prove that increasing
the depth of the network can improve the final performance of the network to a certain
extent. The size of the convolution kernel of different convolution layers of the general
convolutional network is different, and the size of all convolution kernels in the VGG
network is the same, for example, a convolution kernel with 3 × 3 span of 1 is used , the
same is true for the pooling layer, using a 2 × 2 maximum pooling kernel with a span of 2.
Therefore, it simplifies the structure of the convolutional neural network. As long as the
depth of the VGG-16 network is deep enough, it can achieve the same or even better
performance than the complex neural network structure. The 16 of VGG-16 means that the
convolutional layer and the fully connected layer in the neural network have a total of 16
layers. Figure 6-57 is a convolutional network structure of VGG-16
The convolution kernels are all 3 × 3 with a span of 1, and the pooling kernels are all 2 × 2
with a span of 2. However, the number of output channels of the initial convolutional layer
is 64, and the number of output channels of the subsequent convolutional layers is doubled
in order of 128, 256, and 512. When it reaches 512, the number of output channels of the
subsequent convolutional layer will no longer increase. Yes, as the number of 512 channels
is considered to be large enough. The VGG network structure is very regular, but the
amount of data required will be large. Later, VGG-19 was proposed, but there is no
difference in performance between VGG-19 and VGG-16.
finally become c ∗ ⋯ c y, this value will tend to 0. Similarly, when a number is multiplied
i L
by a number whose absolute value is greater than 1, this number will get closer and closer to
infinity ∞.
Consider a simplified neural network, in which the neurons are z = xw, if there is an L
layer, the forward calculation is:
If you know that the gradient of the last z of the loss function is dz , then the gradient of
L L
dz = w
i ⋯ w dz , the gradient of the loss function on w
i+1 L L i
dw = dz z
i = wi ⋯ w dz z
i−1 i+1 . L L i−1
If ∥w ∥< ρ < 1, then dz decays exponentially as L − i increases, and the larger L − i, the
i i
faster the decay, so that the value of dw It may become very small, and the gradient is too
i
small, making the update of the parameter w almost stagnant, and the convergence becomes
i
very slow. Similarly, if ∥w ∥> ρ > 1, then dz increases exponentially with the increase of
i i
L − i, and thus becomes very large, making the parameter update sharply larger and
As the depth of the neural network becomes deeper, the gradient explosion and attenuation
of the neural network are inevitable, which makes the training of the deep neural network
very difficult. In order to prevent gradient explosion, the technique L of gradient clipping
can be used, that is, the absolute value of the gradient is limited to a predetermined range.
Let g be the gradient, θ be the clipping threshold, and clip the gradient according to the
following formula:
θ
min ( , 1)g
∥g∥
If grads contains gradients of multiple weight parameters, the following code can limit their
gradients to [-c,c]:
import math
def grad_clipping(grads,c):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > c:
ratio = c / norm
for i in range(len(grads)):
grads[i]*=ratio
Gradient clipping can solve the problem of gradient explosion to a certain extent, but it
cannot solve the problem of gradient disappearance.
Usually the jumper of the residual network is very regular, as shown in Figure 6-58 is a
schematic diagram of the structure of the residual network:
Figure 6-58 Schematic diagram of the residual network structure, the upper figure is the
residual network, and the lower figure is the general neural network
Because there is a short-circuit connection, the gradient of the reverse derivation can be
directly fed back from the arc tail layer to the arc head layer through this jumper connection,
so that the gradient will not be attenuated or exploded due to passing through multiple
intermediate layers. The residual network was invented by Chinese scholar Kaiming He and
others. The author found that as long as these jumper connections are established between
different layers, a deeper neural network can be trained. With the help of the residual
network structure, people can even easily Train neural networks with more than 1000 layers.
This residual network has a periodic law, which can be regarded as composed of the same
structure residual block, as shown in Figure 6-60 is a structure of a residual block:
Figure 6-60 The residual block is the structural unit of the residual network
This residual block is composed of 2 convolutional blocks. Each convolutional block is first
weighted and calculated, and then the output of the activation function is calculated. Before
calculating the activation function of the second convolutional block, this residual block will
be the first The input of one convolution block and the output of the weighted sum of the
second convolution are accumulated, and then output through the activation function. The
input x of the first convolution block is input to the second convolution block through the
weighted sum and activation function of this convolution block, and the weighted sum F (x)
of the second convolution block and the first The input of the convolution block is added to
get F (x) + x, and then input to the activation function of the second convolution block.
The function x → F (x) + x represented by this residual block is an identity function
x → x added on the basis of the original functional relationship x → F (x). This will force
x → F (x) to be as close to 0 as possible, that is, limit the function x → F (x) to a small
function subspace, which is similar to the regularization restriction on weights The scope of
the function is the same. In addition, when deriving in reverse, the gradient of the output of
the second convolution block is directly passed to the input of the first convolution block
through this identity function, thereby avoiding the problem of gradient disappearance.
Therefore, the residual network has both the function of preventing the gradient from
disappearing and the regularization function of preventing the function from being too
complex.
Where ResBlock represents a residual block, and an ordinary neural network can be
i
expressed as:
Of course, a residual block may be composed of multiple convolutional blocks, and each
convolutional block may contain batch normalization layers, pooling layers, dropout layers,
etc., that is, it may contain several or even a dozen various network layers.
Each convolution kernel of the Inception module accepts the same input, and uses the
"same" convolution method to generate feature maps with different numbers of output
channels. These feature maps are spliced into a final feature map. If the number of channels
output by 1 × 1, 3 × 3, 5 × 5, 7 × 7 is 96, 16, 32, 64, then these feature maps synthesize
96+16+32+64 feature maps Feature map as output. As shown in Figure 6-62:
Of course, the Inception module can also include a pooling layer. The pooling layer will
reduce the size of the feature map. In order to generate a pooled output feature map of the
same size as the original feature map, a "same" pooling operation with padding is used.
Using the Inception module to replace the ordinary convolutional layer and pooling layer
can automatically learn the appropriate model parameters through training, thereby
automatically selecting the appropriate convolution kernel (pooling window) size.
The above-mentioned Inception module will lead to a large model parameter. For example,
if the input is a 28 × 28 × 192 tensor, the output tensor of 32 5 × 5 × 192 convolution
kernels is 28 × 28 × 32, each element of the output tensor is calculated by a weighted sum
of 5 × 5 × 192, so there are a total of 5 × 5 × 192 × 28 × 28 × 32 multiplications
120422400 calculations .
In order to reduce the amount of calculation, you can insert a 1 × 1 convolution kernel with
a relatively small number of output channels before 3 × 3, 5 × 5, 7 × 7 these convolution
kernels with a size greater than 1, as shown in the figure 6-63 shows:
If the number of output channels of the 1 × 1 convolution kernel is 16, that is, 1 × 1 × 16,
although an extra layer is inserted in the middle, the amount of calculation can be greatly
reduced. For example, the calculation after inserting 1 × 1 × 16 into the above 5 × 5 × 32
is: 1 × 1 × 192 × 28 × 28 × 16 + 5 × 5 times16 × 28 × 28 × 32 = 12443648. The
amount of computation is reduced by a factor of 10. Because the pooling layer always
produces the same number of channels as the original input, in order to make the pooling
layer output fewer channels, for the pooling layer, this 1 × 1 convolution kernel is added
after it.
The neural network composed of these Inception modules instead of ordinary convolution
and pooling layers is called Inception network. Figure 6-64 is the famous googleNet,
Inception v1 network. Based on Inception v1, people have proposed some improved
network structure versions, such as Inception V2, V3 and V4. Even combined with residual
networks.
Figure 6-65 Network in Network (NiN) "Network in Network" is to replace the linear
convolutional layer with a small network
In addition to the convolutional layer, the author added 2 layers of fully connected layers to
this small network. The author believes that the nonlinear capability of the network
convolutional layer can be increased. The paper also uses global mean pooling to replace
the traditional fully connected layer, and performs global mean pooling on each feature map,
so that each feature map only produces one output value. It can greatly reduce the problem
of excessive number of parameters caused by flattening the feature map of the traditional
fully connected layer, and can avoid model overfitting. Since each feature map value
produces an output value, the number of input feature maps of the global pooling layer must
be consistent with the number of categories. For 10 categories, the number of feature maps
must be 10.
input x and has nothing to do with the input and output of other samples such as (x , y ), j ≠ i. This neural
(i) (j) (j)
But there are also some problems. There is a sequence relationship between the data. For example, a video is
composed of images generated sequentially in time, a text or sentence is a sequence of words, a piece of music is
composed of a series of notes, a protein A sequence is a sequence of amino acids, and a stock curve contains the
price at each moment. It is unreliable to judge or predict a single data of a sequence in isolation, such as
understanding a word in an article or a paragraph in isolation is meaningless, and judging objects in a video from
a certain image of a video in isolation Motion situations such as "is the car in an image stationary, moving
forward, or moving backward" are not feasible. In machine translation, each word in a sentence is processed
separately, such as "how do you do?" is translated verbatim into "how do you do it?" Obviously it will not work!
Just as the convolutional neural network can capture the spatial correlation between the features of a data
sample, Recurrent Neural Network (Recurrent Neural Network, RNN) is for sequential (timing)
relationships A neural network structure for sequence data. The cyclic neural network is a network with state
memory, which can memorize historical information in the time dimension, specifically: for a certain time t, in
addition to the input data (element) x at the current time, there is also a memory before the time t The hidden
t
state h of information. Therefore, the input at time t includes the current data input x and the historical
t−1 t
memory state h , and the output at time t includes the prediction y and the new historical memory state h .
t−1 t t
The hidden state h is propagated along the sequence, which theoretically can contain the historical information
t
of all previous moments. The recurrent neural network predicts the current sequence elements through this
internal hidden state memory of historical information in the previous sequence, so that it can make better
predictions for the current moment.
Recurrent neural networks can be used for problems with sequence dependencies between data, such as natural
language processing (such as machine translation, text generation, part-of-speech tagging, text sentiment
analysis), speech processing (recognition, synthesis), music generation, protein sequence analysis, Video
understanding and analysis, stock forecasting, etc.
time t, y is used to represent the target value at time t that you want to predict. Like any supervised machine
t
learning, the prediction of sequence data is to learn a mapping or function f : (x , x , ⋯ , x ) → y , that is,
1 2 t t
according to the data characteristics of all moments before t time (x , x , ⋯ , x )to predict the target value y at
1 2 t t
tmoment.
If x and y are the same type of data, such as x is the stock price at time t, y is the target price predicted at time
t t t t
t, that is, the price at time t + 1 y = x , then such a sequence data prediction problem is called an
t t+1
autoregressive problem.
7.1.1 Stock Price Prediction Problem
Predicting the stock price based on the historical information data of a stock is a typical sequence data prediction
problem. The following code uses the pandas package to read the stock data of the csv format file 'sp500.csv':
import pandas as pd
data = pd.read_csv('sp500.csv')
data.head()
output:
Date Open High Low Close Volume
0 03-01-00 1469.250000 1478.000000 1438.359985 1455.219971 931800000
1 04-01-00 1455.219971 1455.219971 1397.430054 1399.420044 1009000000
2 05-01-00 1399.420044 1413.270020 1377.680054 1402.109985 1085500000
3 06-01-00 1402.109985 1411.900024 1392.099976 1403.449951 1092300000
4 07-01-00 1403.449951 1441.469971 1400.729980 1441.469971 1225200000
Among them, each column represents: date, opening price, highest price, lowest price, closing price, trading
volume. In order to facilitate the training of machine learning algorithms, the data needs to be normalized. The
following code normalizes data other than dates:
data = data.iloc[:,1:6]
data = data.values.astype(float)
data = pd.DataFrame(data)
data = data.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
print(data[:3])
The above code for reading stock data can be expressed as a function:
import pandas as pd
data = read_stock('sp500.csv')
print(data[:3])
0 1 2 3 4
0 -0.005973 -0.005916 -0.015676 -0.012310 -0.191184
1 -0.012266 -0.016172 -0.034017 -0.037249 -0.184230
2 -0.037292 -0.035058 -0.042867 -0.036047 -0.177338
Stock price forecasting is to predict the next day's stock price such as the closing price based on the historical
stock data of each day, such as the opening price, the highest price, the lowest price, the closing price, and the
trading volume. For this sequence data prediction problem, the data x at each moment includes data
t
characteristics such as opening price, highest price, lowest price, closing price, trading volume, etc. The target
value y to be predicted is the closing price of the stock on the next day .
t
If the data x at each moment only contains the feature of the closing price, and the y to be predicted is also the
t t
closing price, that is, they are the same type of data, then such a stock price prediction problem is an
autoregressive problem. The following code plots a curve of closing prices:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x = np.array(data.iloc[:,-2])
print(x.shape)
plt.plot(x)
output:
(4697,)
yt ∼ p(yt|x1, … xt)
That is, according to all the sequence data (x , x , ⋯ , x ) before t, the value probability of the target value y is
1 2 t t
predicted. There are usually many or even infinite target values. The prediction problem is to determine each
Probability of possible y values as target values. This kind of model that predicts the probability of the target
t
value based on sequence data is called probability sequence model. For autoregressive problems, this
probabilistic sequence model is expressed as:
xt ∼ p(xt|x1, … xt−1)
2. Language Model
The basis of natural language processing is to build a language model. The so-called language model is to model
the probability of a sentence (sentence), that is, to determine the probability of a sentence appearing. For
example, the probability of "I am Chinese" is obviously greater than that of "I am Chinese". A sentence is a
series of words (words), which is an ordered sequence of words, such as "I am Chinese" is an ordered sequence
composed of four words: "I", "Yes", "China", and "People". sequence. Suppose a sentence S is composed of
words w , w , w , ⋯ , w , the probability of sentence S is represented by P (w
1 2 3 n 1, w2, w3, ⋯ , wn) , according to
probability theory, this probability can be expressed for:
P (w1, w2, w3, ⋯ , wn) = P (w1) ∗ P (w2|w1) ∗ P (w3|w1, w2) ∗ ⋯ P (wn|w1, w2, ⋯ , wn−1)
It is the product of the conditional probabilities of occurrences of a sequence of words. The probability of w 1
appearing first is P (w ), and the probability of w appearing when w appears is P (w |w ). The probability of
1 2 1 2 1
P (w |w , w , ⋯ , w
n 1 2 ). n−1
According to this formula (7-3), if these conditional probabilities can be known, that is, in the case of known
words w , w , ⋯ , w , the probability of the next word w appearing P (w |w , w , ⋯ , w ), you can know
1 2 i−1 i i 1 2 i−1
the probability of a sentence consisting of a series of words. Therefore, the language model is to predict the next
word or the probability of each word in the word list based on the existing word sequence.
P (x |x , ⋯ , x
t 1 ).t−1
If the real data x only depends on the data (x , ⋯ , x ) at the previous fixed-length τ moment, then this
t t−τ t−1
sequence data is said to satisfy Markov (Markov) properties. Such autoregressive models are also known as
Markov models.
The simplest autoregressive model assumes that x t−τ , … xt−1 and x satisfy a linear relationship:
t
xt = a0 + a1xt−1 + ⋯ + aτ xt−τ + ϵ
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def gen_seq_data_from_function(f,ts):
return f(ts)
T =5000
x = gen_seq_data_from_function(lambda ts:np.sin(ts*0.1)+np.cos(ts*0.2),\
np.arange(0, T))
plt.plot(x[:500])
plt.show()
Figure 7-2 Autoregressive Data Generated from Function Values
But such data has obvious periodicity, but real serial data such as stock price data do not have such periodicity.
The autoregressive model of formula (7-4) can be used to generate non-periodic series data from some initial
data. The steps are:
The research on the autoregressive model shows that only when the coefficients constitute the equation
− ⋯ − a When the absolute value of the root does not exceed 1, the autoregressive
τ τ −1 τ −2
x − a x 0 − a x 1 τ
The following function init_coefficients() generates the coefficients of a stable autoregressive model:
np.random.seed(5)
def init_coefficients(n):
while True:
a = np.random.random(n) - 0.5
coefficients = np.append(1, -a)
if np.max(np.abs(np.roots(coefficients))) < 1:
return a
init_coefficients(3)
The following function generate_data() generates autoregressive data according to the above 3 steps, because the
distribution of the initially generated data is very different, and it will also affect the subsequent data. After a
period of time, the data will start to be really stable, so , it is necessary to discard some initially generated data,
such as the initial 3n data.
x,_ = generate_data(5,100)
plt.plot(x[:80])
plt.show()
For example, in a table tennis game, it is impossible to judge the movement of the ball from the position of the
ball at a certain moment, that is, it is impossible to predict its movement speed v according to the position x
(t) (t)
(t)
speed v of the ball based on this data feature x
(t)
^ .
This method of using a fixed-length subsequence around the current location as the sample data feature of the
current location is called the time window method. Time windows allow direct processing of sequence data with
"one-to-one" neural networks. Time window is a traditional method to deal with time series, such as predicting
the stock price of the day based on the stock information of 60 consecutive days before a certain day in the stock
price prediction problem, and predicting the probability of the next word in the language model based on the
known k words .
The time window method transforms the prediction problem of sequence model into the supervised learning
problem of non-sequence data, so that the prediction problem of sequence data can be modeled by the existing
supervised learning method of non-sequence data. The application of the time window method is illustrated
below with the forecasting problem of autoregressive sequence data.
learning as a whole, and x can be used as its target value, so that Transformed into a supervised learning
t+1
problem of non-sequential data, the problem can be modeled and trained with the previous supervised machine
learning method, such as the previous acyclic neural network for modeling. To do this, it is necessary to prepare
training data for training the model.
For a sequence of data, a sequence x[i: i + T+1] of length T+1 can be intercepted from any position i to
constitute a sample of supervised learning, where x[i: i + T]The data features x that make up the sample,
i
and x[T+1] is the target value y . For sequence data with length n, the value range of i is [0,n-(T+1)-1].
i
The set data_set composed of these samples can be divided into a training set (x_train, y_train) and a test set
(x_test, y_test) in a certain proportion. The following code is to sample training samples from the sequence data
according to the time window width T:
x = gen_seq_data_from_function(lambda ts:np.sin(ts*0.1)+np.cos(ts*0.2),\
np.arange(0, 5000))
x_train, y_train, x_test, y_test = gen_data_set(x, 50)
y_train = y_train.reshape(-1,1)
print(x_train.shape,y_train.shape)
hidden_dim = 50
n = x_train.shape[1]
print("n",n)
nn = NeuralNetwork()
nn.add_layer(Dense(n, hidden_dim)) #('xavier',0.01)))
nn.add_layer(Relu())
nn.add_layer(Dense(hidden_dim, 1)) #('xavier',0.01)))
learning_rate = 1e-2
momentum = 0.8 #0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)
epochs=20
batch_size = 200 # len(train_x) #200
reg = 1e-1
print_n=100
losses = train_nn(nn,x_train,y_train,optimizer,
util.mse_loss_grad,epochs,batch_size,reg,print_n)
#print(losses[::len(losses)//50])
plt.plot(losses)
n 50
0 iter: 3.144681992803935
100 iter: 0.3332809082102651
200 iter: 0.13722749233747686
300 iter: 0.10941419118718776
400 iter: 0.10108511745662195
Figure 7-5 Training loss curve of 2-layer fully connected neural network
composed of an initial real data sequence x , x , ⋯ , x , and then from x , x , ⋯ , x predict x , predict
0 1 T −1 1 2 T T +1
xT +2 from x , x , ⋯ , x , keep predicting like this, That is, the initial sequence x , x , ⋯ , x
2 3 T +1 0 1 T −1 can be used
to predict multiple subsequent moments. This type of forecast is called a long-term forecast (or long-term
forecast). Since the prediction is not necessarily accurate, it will be more inaccurate to use the predicted value as
the real value to predict the next value, that is, as time goes by, the error between the predicted value and the real
value will become larger and larger. The other is short-term forecasting, the extreme case is that each moment
always uses the real data corresponding to the time window of this moment (such as this moment and the
previous T-1 moment) to predict the data of the next moment. Short-term prediction is because the input data
samples are all real data and only predict the data at the next moment. The prediction effect is good, but the data
used for prediction at each moment must be real data rather than the predicted data at the previous moment.
The following code adopts the method of long-term forecasting to predict the data of subsequent series of time
points from the real data sample at the initial time point. And visually compare these predicted values with the
corresponding target values of the test set. To observe the predictive performance of this model.
x = x_test[0].copy()
x = x.reshape(1,-1)
ys =[]
for i in range(400):
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:400])
plt.plot(y_test[:400])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-6 Long-term forecasting with a trained model with time window length T=50
It can be seen that the predicted result is very close to the real target value, because this is a periodic curve, and
T=50 is basically close to the period of the curve (50*0.1 = 5 is close to 2π ∼ 6.28) . If the time window is
relatively short, such as the prediction result when T=10 is shown in Figure 7-7, the prediction result will be
poor.
And it can be found that the accuracy of the prediction is lower the further the future is, because the predicted
value is used as the real data value to predict the value of the subsequent time, due to the continuous
accumulation of errors, the error will become larger and larger.
Figure 7-7 Long-term forecasting with a trained model with time window length T=10
The following code is to use the trained neural network for short-term prediction, that is, each time the real data
is used to predict the data value of the next moment:
ys =[]
for i in range(400):
x = x_test[i].copy()
x = x.reshape(1,-1)
y = nn.forward(x)
ys.append(y[0][0])
ys = ys[:]
plt.plot(ys[:400])
plt.plot(y_test[:400])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-8 Short-term forecasting with a training model with a time window length T=50
(4697,)
(4697, 1)
(4136, 100, 1) (4136, 1)
learning_rate = 0.1
momentum = 0.8 #0.9
optimizer = SGD(nn.parameters(),learning_rate,momentum)
epochs=60
batch_size = 500 # len(train_x) #200
reg = 1e-6
print_n=50
losses =
train_nn(nn,x_train,y_train,optimizer,util.mse_loss_grad,epochs,batch_size,reg,print_n)
plt.plot(losses)
n 100
0 iter: 0.04027576839624083
50 iter: 0.0005585708338086856
100 iter: 0.0004103264701123903
150 iter: 0.0003765723130633676
200 iter: 0.0003516184170804334
250 iter: 0.00035039658640954825
300 iter: 0.00030599817269094394
350 iter: 0.00031335621767437775
400 iter: 0.000308409636035205
450 iter: 0.0003134471927653575
Figure 7-9 Network model training loss curve of stock data with time window length T=100
Use the first sample of the test set as a starting point for long-term prediction, that is, to continuously use the
predicted value to construct new data features to predict the stock price of the next day:
x = x_test[0].copy()
x = x.reshape(1,-1)
ys =[]
num = 400
for i in range(num):
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:num])
plt.plot(y_test[:num])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-10 Long-term forecasting with a trained model with time window length T=100
The results show that for the sequence data with few regularities like stocks, even if the time window is large
(100), the prediction results are not ideal. The following code uses short-term forecasting, that is, always uses the
real data of the first 100 days to predict the stock price on the 101st day:
ys =[]
num = 400
for i in range(num):
x = x_test[i].copy()
x = x.reshape(1,-1)
y = nn.forward(x)
ys.append(y[0][0])
x = np.delete(x,0,1)
x = np.append(x, y.reshape(1,-1), axis=1)
ys = ys[:]
plt.plot(ys[:num])
plt.plot(y_test[:num])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-11 Short-term forecasting with a training model with a time window length T=600
P (A∩B)
appear . According to the formula P (A|B) = , conditional probability P (w |w , w , ⋯ , w ) can be
P (B)
i 1 2 i−1
expressed as:
P (w1,w2,w3,⋯,wi−1,wi) P (w1,w2,w3,⋯,wi−1,wi)
P (wi|w1, w2, ⋯ , wi−1) = =
∑ P (w1,w2,w3,⋯,wi−1,w) P (w1,w2,w3,⋯,wi−1)
w
To calculate the conditional probability, you need to find the joint probability
P (w , w , w , ⋯ , w
1 2 3 ), P (w , w , w , ⋯ , w
i−1 1 , w ). These probabilities can be calculated by statistical
2 3 i−1 i
methods that approximate the probabilities with frequencies. For example, to count the joint probability
P("China", "person") of two words w = “China” and w = “person” at the same time, it can be counted in a
1 2
corpus (such as many texts) The number of occurrences of "Chinese" is n, and the number of occurrences of all
combinations of w and w that are other arbitrary words (such as "Hello", "Playing Ball", "Chinese Dream")
1 2
appears m. Use this frequency n/m to approximate the probability P("China", "people").
But if i is relatively large, this calculation is obviously unrealistic, and there are two problems:
It is very likely that the sequence w , w , w , ⋯ , w 1 2 3 i−1, wi does not appear in the corpus, resulting in the
probability P (w , w , w , ⋯ , w , w ) is 0.
1 2 3 i−1 i
In order to solve the above-mentioned problem of calculating the conditional probability dependent on too many
parameters, the Markov assumption is usually introduced, that is, it is assumed that the probability of a word is
only related to the limited number of words that appear before it. In an extreme case, assuming that the
appearance of a word is independent of its surrounding words, that is, the probability of its occurrence does not
depend on other words. This language model is called a unigram language model. At this point, the probability
calculation of the sentence S = w , w , w , ⋯ , w becomes very simple:
1 2 3 n
But this language model is obviously unreasonable, because the appearance of words in the text will not be
independent of each other, and there is a dependency relationship. If a language model assumes that the
probability of a word appearing only depends on a word that appeared before it, this language model is called a
2-gram language model (bigram).
By analogy, if a language model assumes that the probability of a word only depends on the k-1 words in front of
it, this language model is called k-gram language model (k-gram). The k-gram language model is a specific
application of the time window method on the language model, that is, the probability of the next word is
predicted by using the first k-1 words.
Obviously, the larger k is, the higher the prediction accuracy will be. For example, for a 2-gram language model,
if the current word is "China", there may be many next words, and it is difficult to predict which one the next
word will be, but if it is a 4-gram model, the word that has appeared sequentially is "I" , "Yes", "China", then the
probability (possibility) that the next word is "person" will be very high. But the larger k is, the more serious the
above two problems will be. In order to avoid the above two difficulties, traditional language models generally
use 3-gram or 4-gram language models (3-gram or 4-gram).
If such a k-element language model is constructed, the probability that each word in the word list will appear as
the next word can be predicted according to the k-1 words that have appeared before, so as to predict the
probability that the entire sentence will appear.
Language models are the basis for a variety of natural language processing problems. For example, the language
model can be used to make long-term predictions from the first words, that is, the next word is sampled
according to the probability of the language model, and this process is repeated continuously to generate a series
of subsequent words, thereby automatically generating a text such as articles, novels, and poems , essays,
reviews, etc.
The k-gram language model has some obvious limitations in predicting with fixed-length "time window" data.
There are mainly two problems:
The length of the time window is difficult to determine. If the time window is too short, it will cause short-
sightedness. For example, in machine translation, it is often necessary to accurately understand the meaning
of the current word based on a long context, and in text generation, the next word can be correctly predicted
based on the previous long text, such as "old On the way to school, Zhang's son saw an old lady who had
fallen down. Whether to use "he" or "she" in the last word of this paragraph needs to be determined based
on the previous "son". The calculation amount of the algorithm is proportional to the length of the sample
data. If the time window is too long, it will take more time. For the language model, the time window is too
long, which makes it very difficult to estimate the probability. In addition, for short sequence samples (such
as short sentences), many blank elements need to be filled, resulting in waste of space. For many sequence
data problems, the sequence lengths dependent on the data at different times are often different and change,
and it is difficult to determine an appropriate time window length.
Using the usual neural network to model the sequence data prediction problem, the model parameter scale
will become larger as the time window becomes larger. Compared with processing each original data
sample separately, if a time window of length 3 is used, the length of the input data sample is increased by 3
times. In order to better capture the characteristics of the data, the number of neurons in each layer will
increase accordingly. , which will increase the number of parameters of the model exponentially, which not
only consumes more computing resources, but also increases the complexity of the model function and
easily leads to overfitting of training.
The time window is just a short-term memory behavior, and when people understand things, they will not only
use short-term memory, but also use all their past memories. In order to deal with this variable-length sequence
data, researchers simulated human long-term memory behavior and invented Recurrent Neural Networks
(Recurrent Neural Networks, RNN). RNN adds a storage/memory unit to the neurons of the traditional
neural network, so that historical calculation information can be saved. In other words, neurons have a memory
function, and the calculation of each current moment depends not only on the current input but also on the stored
historical information, so that the data and calculation results can be transmitted in the time dimension, so that
the recurrent neural network theoretically Can memorize arbitrarily long sequences of information. Just as the
convolutional neural network can extract the features of the spatial dimension, the cyclic neural network can
transmit information in the time dimension, which is an extension of the acyclic neural network in the time
dimension.
The function represented by the acyclic neural network is similar to the function without memory function in the
programming language, such as the function without static variables in the C language and the function in the
Python language. like:
def f(x):
y = 0
y += x*x
return y
print(f(2),'\t',f(3))
print(f(3),'\t',f(2))
4 9
9 4
Regardless of whether f(2) is computed before f(3) or f(2) is computed before f(3), the results of f(2) and f(3)
depend only on the respective inputs 2 and 3, and The execution order of f(2) and f(3) is irrelevant.
If the probability of the next word from the current word is used with an acyclic neural network, this prediction
is also independent of the order in which the words are processed. As shown in Figure 7-12, assuming that there
are only 3 words in the language "good", "drink", and "wine", for each word in the input word sequence "good
drink", the neural network outputs each word as the probability of the next word.
Figure 7-12 The probability of predicting "good", "drink", and "wine" as the next word from the word "good" by
the non-cyclical neural network and the probability of predicting "good", "drink", and "wine" from the word
"drink" as The probabilities of the next word are independent of each other. The probability of predicting "good",
"drink", and "wine" as the next word from the word "good" is the same regardless of whether the input sequence
is "good to drink", "wine is good to drink" or "drink good wine".
Regardless of the order in which these three words appear in a sentence, such as "wine is good" and "drinking
good wine", for each word, the neural network predicts the probability of each word as its next word based on
the word. it's the same. Using this neural network to represent the language model, the output value of each word
depends on itself and has nothing to do with other words. This is obviously unreasonable.
Without loss of generality, if the neural network has only one or one layer of neurons, namely:
y = f (x) = g(xW + b)
And assuming that the nonlinear activation function is a sigmoid function, this neural network can be
represented by python code:
class FNN(x):
# ...
def forward(self,x):
y = sigmoid(np.dot(x,self.W)+self.b)
return y
nn = FNN()
y1 = nn.forward(x1)
y2 = nn.forward(x2)
y3 = nn.forward(x3)
The order of the three prediction statements has no effect on the prediction results y1, y2, y3.
As shown in Figure 7-12, in order to input a word as a sample into the neural network, each word needs to be
quantized, that is, converted into a vector with a fixed length, because the language model has only 3 words, and
one with a length of 3 can be used The -hot vector distinguishes these 3 words, that is, each word corresponds to
a different one-hot vector, such as: good (1,0,0), drink (0,1,0), wine (0,0,1) .
7.2.2 Recurrent neural network with memory function
Different from the acyclic neural network, the cyclic neural network is a neural network with a memory
function, which can be expressed as y = f (x, h), that is, in addition to the input x, there is also a hidden
state variable h Used to record the history of the calculation process. Recurrent neural network functions
are similar to functions or classes with memory functions in programming languages, such as functions
containing static local variables in C language, and class objects in languages such as C++, Java, and
Python. For example, the following class rf uses a data attribute h to record the intermediate results
(states) of the calculation:
class rf( ):
def __init__(self):
self.h = 0
def forward(self,x):
self.h += 2*x
return self.h+x*x
def __call__(self,x):
return self.forward(x)
f = rf()
print(f(2),'\t',f(3))
print(f(3),'\t',f(2))
8 19
25 24
For an input value x, the forward() method of the rf class not only relies on x but also depends on the
information h saved in the previous calculation process when calculating the output. Therefore, the
output of f(2) and f(3) and their execution The order of precedence is relevant. The variable h that
records the intermediate results before calculation in class rf is called state.
Like the above class, the cyclic neural network also has a variable h that records/memorizes the
calculation process information. This variable is called hidden state (variable) in the cyclic neural
network. At any time t, the cyclic neural network is based on the current input data x and the state
⟨t⟩
variable h , calculate the current moment output f and state h . That is, the function represented
⟨t−1⟩ ⟨t⟩ ⟨t⟩
The state h at t time will be used as the input at t + 1 time to participate in the calculation of t + 1
⟨t⟩
time: y ⟨t+1⟩
,h
⟨t+1⟩
= f (x ,h ).
⟨t+1⟩ ⟨t⟩
The state variable h that changes over time stores/memorizes historical information. Based on the
⟨t⟩
historical information represented by this state variable and the data at the current moment, better
predictions can be made.
The cyclic neural network is usually represented by the diagram shown in Figure 7-13 a). The difference
between it and the ordinary neural network is that the hidden state calculated at the current moment will
be used as the input at the next moment, so it is drawn as an arc pointing to itself. Hidden state variables
are used as both the output calculated at the current moment and the input calculated at the next moment.
Figure 7-13 b) is an expanded representation of the calculation process in the time dimension of Figure
7-13 a). It can be seen that the state variables at the previous moment are used as the input at the current
moment to calculate the output and state variables at the current moment. This state variable is used as
the input for the calculation at the next moment.
Figure 7-13 a) Representation of a recurrent neural network in which there is a hidden state variable that
is both the output of the computation and the input to the next computation. b) is the expanded
representation of the calculation process of Figure a) in the time dimension. The state variable at the
previous moment is used as the input at the current moment to calculate the output and state variable at
the current moment, and the state variable at the current moment is used as the calculation at the next
moment input of.
At the initial moment t=0, the state variable h is a vector with an initial value of 0. For the above 3-
⟨−1⟩
word language model, the input data is the word "good" The corresponding feature vector x = ⟨0⟩
(1,0,0), the neural network will calculate the current output y and state variable h : ⟨0⟩ ⟨0⟩
Consider the simplest recurrent neural network with only one or one layer of neurons, using
x
⟨t⟩
,h ,f
⟨t⟩
represent the input data, state variables and output respectively, and the calculation process
⟨t⟩
of the recurrent neural network is almost the same as that of the ordinary neural network with only one
neuron or one layer of neurons.
⟨t⟩ ⟨t−1⟩ ⟨t⟩
h = gh (h Wh + x Wx + bh )
⟨t⟩ ⟨t⟩
f = gf (h Wf + by )
Among them, g h, gf are the activation functions for calculating the current state and output respectively.
If g is a tanh function, and g is a sigmoid function, the calculation process can be expressed in python
h y
code as follows:
class RNN:
# ...
def step(self, x):
# update hidden state
self.h = np.tanh(np.dot(self.h,self.W_hh) + np.dot(x,self.W_hx) )+self.b
# Compute the output vector
y = sigmoid(np.dot(self.h,self.W_hy)+self.b2)
return y
For a time series data x ⟨1⟩ ⟨2⟩
,x ,x
⟨3⟩
, then RNN The output process for calculating these data is as
follows:
rnn = RNN()
y1 = rnn.step(x1) # calculated the implicit h1 at the same time
y2 = rnn.step(x2) # Calculate the implicit h2 at the same time
y3 = rnn.step(x3) # calculated the implicit h3 at the same time
It can be seen that the structure of the recurrent neural network is still similar to the structure of the
ordinary neural network, the only difference is that the calculation process will use the saved hidden state
to calculate the hidden state and output at the current moment. Expanding 7-13 a) into the structure of 7-
13 b) in the time dimension helps to understand its iterative cycle calculation process in the time
dimension. But it is not a copy of multiple neural networks in the time dimension, but a state variable
h is added to the neural network (neuron) to save the calculation result of the previous moment .
⟨t⟩
Therefore, the model parameters of the cyclic neural network will not increase with the expansion of
time, and arbitrarily long sequences can be expanded in the time dimension by continuously calling
rnn.step(). The neural network of the time window method can only handle sequences of fixed length,
and the model parameters will increase with the increase of the time window length. The cyclic neural
network perfectly solves these two problems in the time window method.
RNN can store historical information through the state variable h , so that RNN has a memory ⟨t⟩
function, just like playing a game with a recording function, each time is based on the previous game
points and ability play games on. If you play a game without recording function, each game play is a new
start, which has nothing to do with the previous game play.
Like the one-to-one neural network, two auxiliary variables z , z can be introduced to express the
⟨t⟩ ⟨t⟩
h f
formula (1), The weighted sum in (2), then the calculation process of the above-mentioned recurrent
neural network can be expressed as the following 4 calculation formulas:
⟨t⟩
⟨t⟩ ⟨t−1⟩ ⟨t⟩
z = x Wx + h Wh + b
h
⟨t⟩ ⟨t⟩
h = gh (z )
h
⟨t⟩
⟨t⟩
f = go (z )
f
The forward calculation process of RNN with only one neuron or a single network layer is shown in
Figure 7-14:
Figure 7-14 The forward calculation process of the cyclic neural network introducing weighted and
⟨t⟩ ⟨t⟩
intermediate variables z , z , first according to Input and the hidden state of the previous moment
h f
⟨t⟩ ⟨t⟩
calculate z , then calculate h
h
⟨t⟩
according to the activation function, and then calculate z f
and f ⟨t⟩
Like ordinary one-to-one neural networks, recurrent neural networks can also have multiple layers, and
the output of the previous layer is used as the input of the next layer. At the same time, the neurons in
each layer have their own state variables. Figure 7-15 is a three-layer recurrent neural network:
This cyclic neural network structure that processes the input sequence before generating the output
sequence is called Sequence-to-Sequence (Seq2seq) structure.
Of course, there is also a many-to-one RNN, such as classifying a text of a word sequence (such as
performing sentiment analysis on a text, and analyzing the quality of reviews from product reviews), as
shown in Figure 7-17.
There is also a neural network with a one-to-many structure, as shown in Figure 7-18, which produces an
output sequence given an input. For example, given a word, automatically generate a series of text
composed of words, and automatically generate all notes of a musical score from a note.
Taking the synchronous many-to-many RNN neural network as an example, assuming that the RNN has
only one layer of neurons (the same is true for multi-layer neural networks), each moment has a
predicted value f and the target value y , each moment has a loss L , as shown in Figure 7-19:
⟨t⟩ ⟨t⟩ (t)
Figure 7-19 For a synchronous many-to-many RNN network, each moment has a predicted value f ⟨t⟩
The total loss is the sum of the loss of the predicted value and the target value at all moments. Right
now:
T (t)
L = ∑ L
t=1
If this is a one-way RNN, that is, the prediction at each moment only depends on the state at the previous
moment, the solid line in the figure represents the forward calculation process unfolded according to
time, while the dotted line represents the reverse derivation according to the reverse time process. The
loss function at each moment is a function of the variables (hidden state, input) and model parameters at
the previous moment. In the reverse derivation, it is necessary to find the gradient of the loss function
with respect to the variables and model parameters at the previous moment.
For any time t, its forward and reverse calculation process is shown in Figure 7-20:
Figure 7-20 At any time t, the forward and reverse calculation process of the cyclic neural network, the
gradient of the model parameters in the reverse derivation process includes the gradient of the current
loss and the subsequent loss with respect to the model parameters.
At any time, it is necessary to calculate the gradient of the loss at the current moment with respect to the
model parameters, and also calculate the gradient of the model parameters at the current moment
contributed by the hidden state gradient at the next moment. That is, the gradient of the model
parameters at the current moment includes the gradient of the loss at the current moment and the loss at
the subsequent moment with respect to the model parameters.
⟨t⟩ ⟨t⟩
The introduction of intermediate variables such as z , z can simplify the calculation process of the f h
model parameter gradient. Figure 7-21 shows the The forward calculation and reverse derivation process
of variables.
Figure 7-21 Forward calculation and reverse derivation process including intermediate variables
∂f
⟨t⟩
of the loss function at the
current moment with respect to f ⟨t⟩
and the gradient ∂L
∂h
⟨t⟩
of the loss function at subsequent moments
with respect to h . On this basis, it is possible to find the gradient of the loss function at moment t with
⟨t⟩
respect to the model parameters W , W , W and the implied variable h f h at the previous moment.
x
⟨t−1⟩
Because the output f at time t only contributes to the loss L at time t, that is, only L depends on
⟨t⟩ ⟨t⟩ ⟨t⟩
⟨t⟩
it. Therefore, the gradient of the total loss function L with respect to the gradient of f is . ∂L
∂f
⟨t⟩
⟨t⟩ ∂L
∂f
⟨t⟩
Note that the model parameters W , W , W are shared at all times, for example, for the model
f h x
parameter W , the gradient of the total loss function with respect to it is at all times The sum of the
f
L
⟨t⟩
is the predicted value f and the real value y ⟨t⟩
, that is, L depends on f , and ⟨tattimetT heerrorof ⟩ ⟨t⟩ ⟨t⟩
⟨t⟩ ⟨t⟩
f
⟨t⟩
depends on z , z depends on W , thus the gradient of the loss L at time t with respect to the
f f
f
⟨t⟩
⟨t⟩
⟨t⟩ ⟨t⟩ ∂z T ⟨t⟩ T ⟨t⟩
⟨t⟩
∂L ∂L f ⟨t⟩ ∂L ⟨t⟩ ∂L ′
= ⟨t⟩
⋅ = h ⟨t⟩
= h ⟨t⟩
go (z )
∂Wf ∂Wf ∂f f
∂z ∂z
f f
The gradient of the total loss function with respect to W is obtained by accumulating the gradient of the f
⟨t⟩
⟨t⟩ ∂z T ⟨t⟩
⟨t⟩
∂L n ∂L f n ⟨t⟩ ∂L ′
= ∑ ⋅ = ∑ h go (z )
∂Wf t=1 ⟨t⟩
∂Wf t=1 ∂f
⟨t⟩
f
∂z
f
How to find ∂L
∂h
⟨t⟩
?
The hidden state h output at time t is output to f on the one hand, and as the Input, that is, not only
⟨t⟩ ⟨t⟩
affects the loss L of the current moment t through f , but also serves as the loss of the next moment
⟨t⟩ ⟨t⟩
The hidden state input affects the loss of all subsequent moments L , t > t. Therefore, the gradient of (t ) ′
the loss function about h can be divided into two parts to find:
⟨t⟩
Where L represents the sum of losses of t and all subsequent moments, that is, L
⟨t−⟩
= ∑
⟨t−⟩ n
′
t =t
⟨t⟩
L .
L represents the sum of losses at all subsequent moments of t, that is, L
⟨t+1−⟩
= ∑
⟨t+1−⟩ n
′
t =t+1
L
⟨t⟩
.
⟨t−⟩ ⟨t+1−⟩
∂h
⟨t⟩
∂L
∂h
⟨t⟩
∂L
∂h
⟨t⟩
after time is the gradient of the output hidden state h at time t, which comes from time t + 1 in the ⟨t⟩
Of course, ∂L
∂h
⟨T ⟩
=
partialL
∂h
⟨T ⟩
, that is, the loss at the last moment about the hidden state
hatthismomentT hegradientof
⟨T ⟩
. Right now:
⟨T ⟩ ⟨T ⟩
∂L ∂L T
= ⋅ W
∂h
⟨T ⟩ ⟨T ⟩ f
∂z
f
⟨t⟩ ⟨t⟩
Because h ⟨t⟩
= gh (z
h
) , know ∂L
∂h
⟨t⟩
, you can get the gradient of the loss function about z : h
⟨t−⟩
∂L ∂L ∂L ′ ⟨t⟩
= = ⋅ g (z )
⟨t⟩ ⟨t⟩
∂h
⟨t⟩ h h
∂z ∂z
h h
Further, the gradient of the loss function with respect to the model parameters W h, Wx and the hidden
state h output at the previous moment can be obtained.
⟨t−1⟩
⟨t⟩
⟨t−⟩ ∂z ⟨t⟩
∂L ∂L h ∂L T
= ⋅ = ⋅ W
∂h
⟨t−1⟩ ⟨t⟩
∂h
⟨t−1⟩ ⟨t⟩ h
∂z ∂z
h h
⟨t⟩
n ⟨t−⟩ ∂z n T ⟨t−⟩
∂L ∂L h ⟨t−1⟩ ∂L
= ∑ ⋅ = ∑ h
∂Wh t=1 ⟨t⟩
∂Wh t=1 ⟨t⟩
∂z ∂z
h h
⟨t⟩
n
⟨t−⟩ ∂z n T ⟨t−⟩
∂L ∂L h ⟨t⟩ ∂L
= ∑ ⋅ = ∑ x
∂Wx t=1 ⟨t⟩
∂Wx t=1 ⟨t⟩
∂z ∂z
h h
Assume that the RNN has only one hidden layer, f is the output at moment t, and y is the true value ⟨t⟩ ⟨t⟩
at moment t. For a multiclassification problem, y can be the category integer corresponding to the true ⟨t⟩
classification, then the gradient dz of the multiclassification cross-entropy loss L at moment t with ⟨t⟩
dh = np.dot(dzf,Wf.T) + dh_next
Knowing these two gradients, according to the above formula, the gradient of the loss function with
respect to other variables can be obtained. The following is the code for reverse derivation at time t:
dzf = np.copy(f[t])
dzf[y[t]] -= 1
dWf += np.dot(h[t].T,dzf)
dbf += dzf
dh = np.dot(dzf, Wf.T) + dh_next
dzh = (1 - h[t] * h[t]) * dh
dbh += dzh
dWx += np.dot(x[t].T,dzh)
dWh += np.dot(h[t-1].T,dzh)
dh_pre = np.dot(dzh,Wh.T)
Among them, dWf, dWx, dWh, dbh, and dbf are the gradients of the loss function with respect to the
model parameters, dh_next is the gradient of the loss function with respect to the hidden state at time
t+1, dh is the gradient of the loss function with respect to the hidden state at the current moment, and
dh_pre is The gradient of the loss function with respect to the state of the output hidden state at the
⟨t−⟩
previous moment ∂L
∂h
⟨t−1⟩
.
is hidden_dim, and the output f is a vector of length output_dim, then the following function
⟨t⟩
import numpy as np
np.random.seed(1)
def rnn_params_init(input_dim, hidden_dim,output_dim,scale = 0.01):
Wx = np.random.randn(input_dim, hidden_dim)*scale # input to hidden
Wh = np.random.randn(hidden_dim, hidden_dim)*scale # hidden to hidden
bh = np.zeros((1,hidden_dim)) # hidden bias
return [Wx,Wh,bh,Wf,bf]
In addition to the model parameters, it is also necessary to initialize the hidden state vector h of the RNN
network, an input sample x, corresponding to a hidden state vector h. When training the model, if you
input a batch of samples X = (x
(1) (2)
,x
(m)
,⋯,x , it will correspond to a batch Hidden state vector
T
)
H = (h
(1) (2)
,h
(m)
,⋯,h
T
) . The following function is used to initialize the hidden state vector H of a
batch of samples:
Fs = []
Hs = {}
Hs[-1] = np.copy(H)
for t in range(len(Xs)):
X = Xs[t]
H = np.tanh(np.dot(X, Wx) + np.dot(H, Wh) + bh)
F = np.dot(H, Wf) + bf
Fs.append(F)
Hs[t] = H
return Fs, Hs
Where params is the model parameter, H_ is the hidden state input at t=0 (usually the initial value is 0).
Assuming that each sequence element is a one-dimensional vector, then Xs is a three-dimensional tensor,
that is, there are 3 axes representing sequence length, batch size, and input data length, respectively. As
shown in Figure 7-22,
Figure 7-22 xs is a three-dimensional tensor, and its three axes represent sequence length T, batch size
batch_dim, and input data length input_dim
Hs is represented by a dictionary, Hs[-1] represents the input state at t=0, and Hs[t] represents the
output state at time t. Because then len(Hs) = len(Xs)+1, so
Hs[len(Hs)-2] is the state of len(Xs)-1 at the last moment. Each Hs[t] is a two-dimensional tensor,
the first axis represents the batch size, and the second axis represents the state vector length, which is
hidden_dim. Similarly, Fs represents the output value at all moments, which can be represented as a
three-dimensional tensor or a list. Fs[t] at each moment is a two-dimensional tensor, and the first axis
represents the batch size , the second axis represents the size of the output vector, namely output_dim.
The forward calculation process at each moment can be written as a separate function
rnn_forward_step():
def rnn_forward_step(params,X, preH):
Wx, Wh, bh, Wf, bf = params
H = np.tanh(np.dot(X, Wx) + np.dot(preH, Wh) + bh)
F = np.dot(H, Wf) + bf
return F, H
Among them, X is the input at a certain moment, and preH is the hidden state at the previous moment,
and they are both two-dimensional tensors. The forward calculation process at all times can be written as
a function rnn_forward_():
def rnn_forward_(params,Xs, H_):
Wx, Wh, bh, Wf, bf = params
H = H_
Fs = []
Hs = {}
Hs[-1] = np.copy(H)
for t in range(len(Xs)):
X = Xs[t]
F,H = rnn_forward_step(params,X,H)
Fs.append(F)
Hs[t] = H
return Fs, Hs
moment t, and the cumulative loss at all moments is The total loss L = ∑ L . T
t=1 t
Depending on the problem, the loss L can be the mean square error loss of the regression problem or
t
the cross-entropy loss of the classification problem. The following function rnn_loss_grad() uses the
incoming function object parameter loss_fn() to calculate the loss loss_t at each moment and the
gradient dF_t of the loss with respect to the output Fs[i]. And put the gradient at all times in a
dictionary variable dFs.
import util
def rnn_loss_grad(Fs,Ys,loss_fn =
util.loss_gradient_softmax_crossentropy,flatten = True):
loss = 0
dFs = {}
for t in range(len(Fs)):
F = Fs[t]
Y = Ys[t]
if flatten and Y.ndim>=2:
Y = Y.flatten()
loss_t,dF_t = loss_fn(F,Y)
loss += loss_t
dFs[t] = dF_t
return loss,dFs
import math
def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio
dh_next = np.zeros_like(Hs[0])
h = Hs
x = Xs
dWf += np.dot(h[t].T,dZ)
Because the cyclic network expands over time, like the deep neural network, there will be gradient
explosion and gradient disappearance problems. In order to solve the gradient explosion problem, the
gradient clipping method can be used to prevent the gradient explosion. As in the above code backward()
function, the gradient is clipped grad_clipping(grads,5.) at the end.
Similarly, the reverse derivation at each moment can be written as a separate function
rnn_backward_step():
def rnn_backward_step(params,dZ,X,H,H_,dh_next):
Wx, Wh,bh, Wf,bf = params
dWf = np.dot(H.T,dZ)
And functions that perform reverse differentiation on sequence data can call this single-moment reverse
differentiation function:
dWx_,dWh_,dbh_,dWf_,dbf_,dh_next =
rnn_backward_step(params,dZ,X,H,H_,dh_next)
for grad,grad_t in zip([dWx, dWh,dbh, dWf,dbf],
[dWx_,dWh_,dbh_,dWf_,dbf_]):
grad+=grad_t
import numpy as np
np.random.seed(1)
# Generate a batch of samples Xs with 2 samples per batch and targets for 4
moments
# Define an RNN model with input, implicit, and output layers of sizes 4, 10,
and 4, respectively
if True:
T = 5
input_dim, hidden_dim,output_dim = 4,10,4
batch_size = 1
seq_len = 5
Xs = np.random.rand(seq_len,batch_size,input_dim)
#Ys = np.random.randint(input_dim,size = (seq_len,batch_size,output_dim))
Ys = np.random.randint(input_dim,size = (seq_len,batch_size))
#Ys = Ys.reshape(Ys.shape[0],Ys.shape[1])
else:
input_size,hidden_size,output_size = 4,3,4
batch_size = 1
vocab_size = 4
inputs = [0,1,2,2] #hello
targets = [1,2,2,3]
Xs=[]
Ys=[]
for t in range(len(inputs)):
X = np.zeros((1,vocab_size)) # encode in 1-of-k representation
X[0,inputs[t]] = 1
Xs.append(X)
Ys.append(targets[t])
print(Xs)
print(Ys)
The following code calculates the analytical gradient for the above sample:
# --------cheack gradient-------------
params = rnn_params_init(input_dim, hidden_dim,output_dim)
H_0 = rnn_hidden_state_init(batch_size,hidden_dim)
Fs,Hs = rnn_forward(params,Xs,H_0)
loss_function = rnn_loss_grad
print(Fs[0].shape,Ys[0].shape)
loss,dFs = loss_function(Fs,Ys)
grads = rnn_backward(params,Xs,Hs,dFs)
(1, 4) (1,)
The following code defines the auxiliary function rnn_loss() for calculating the RNN loss, and then calls
the general numerical gradient function numerical_gradient() in util to calculate the numerical gradient
of the RNN model parameters, and compares the error with the above analysis gradient, and also outputs
the first Gradients of model parameters:
def rnn_loss():
H_0 = np.zeros((1,hidden_dim))
H = np.copy(H_0)
Fs,Hs = rnn_forward(params,Xs,H)
loss_function = rnn_loss_grad
loss,dFs = loss_function(Fs,Ys)
return loss
numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
#rnn_numerical_gradient(rnn_loss,params,1e-10)
#diff_error = lambda x, y: np.max(np.abs(x - y))
diff_error = lambda x, y: np.max( np.abs(x - y)/(np.maximum(1e-8, np.abs(x) +
np.abs(y))))
print("loss",loss)
print("[dWx, dWh, dbh,dWf, dbf]")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))
print("grads",grads[1][:2])
print("numerical_grads",numerical_grads[1][:2])
loss 6.931604253116049
[dWx, dWh, dbh, dWf, dbf]
4.30868739852771e-06
0.00014321848390554473
8.225164888798296e-08
2.030282934604882e-07
1.155121982079175e-10
grads [[-2.39049602e-04 8.14220495e-05 1.57776751e-04 5.67414815e-05
-2.52527076e-04 7.67751376e-05 8.81253550e-05 2.07270381e-04
-6.92579913e-05 5.33532921e-05]
[-1.59775181e-04 8.33693576e-05 7.68434971e-05 4.16925859e-05
-1.31768112e-04 1.87065893e-05 3.02967764e-05 1.17071893e-04
-3.32692578e-05 2.22690120e-05]]
numerical_grads [[-2.39049225e-04 8.14224244e-05 1.57776459e-04 5.67408343e-
05
-2.52526444e-04 7.67759190e-05 8.81255069e-05 2.07270645e-04
-6.92583768e-05 5.33533218e-05]
[-1.59774860e-04 8.33693115e-05 7.68434205e-05 4.16924273e-05
-1.31767930e-04 1.87068139e-05 3.02966541e-05 1.17071686e-04
-3.32689432e-05 2.22684093e-05]]
By comparing the errors, it can be judged that the calculation of the analytical gradient is basically
correct.
def step(self,grads):
for i in range(len(self.params)):
grad = grads[i]
self.vs[i] = self.momentum*self.vs[i]+self.lr* grad
self.params[i] -= self.vs[i]
def scale_learning_rate(self,scale):
self.lr *= scale
Of course, other parameter optimizers are also available, such as the AdaGrad optimizer:
class AdaGrad():
def __init__(self,model_params,learning_rate=0.01):
self.params,self.lr= model_params,learning_rate
self.vs = []
self.delta = 1e-7
for p in self.params:
v = np.zeros_like(p)
self.vs.append(v)
def step(self,grads):
for i in range(len(self.params)):
grad = grads[i]
self.vs[i] += grad**2
self.params[i] -= self.lr* grad /(self.delta + np.sqrt(self.vs[i]))
def scale_learning_rate(self,scale):
self.lr *= scale
The following training function rnn_train_epoch() uses the data iterator data_iter to traverse the sampling
training data set to complete a training process. This function gets a batch of sequence training samples
from the data iterator data_iter each time. Each sample sequence (Xs, Ys) is composed of samples at
multiple times. start indicates whether this sample sequence is end-to-end with the previous sample
sequence of. For each sample sequence (Xs, Ys), first use rnn_forward(params,Xs,H) to calculate
the output Zs and state Hs at each moment, and then use the loss function loss_function(Zs,Ys) to
output Zs and The target value Ys calculates the loss of the model and the gradient dzs of the loss with
respect to the output, and then calculates the gradient of the loss with respect to the model parameters
through reverse derivation rnn_backward(params,Xs,Hs,dzs), and finally updates the model
parameters. iterations is the maximum number of iterations to prevent infinite loops, and print_n
indicates the number of intervals for printing information.
def
rnn_train_epoch(params,data_iter,optimizer,iterations,loss_function,print_n=100):
hidden_size = Wh.shape[0]
batch_size = Xs[0].shape[0]
if start:
H = rnn_hidden_state_init(batch_size,hidden_size)
Zs,Hs = rnn_forward(params,Xs,H)
loss,dzs = loss_function(Zs,Ys)
if False:
print("Z.shape",Zs[0].shape)
print("Y.shape",Ys[0].shape)
print("H",H.shape)
if iter % print_n == 0:
print ('iter %d, loss: %f' % (iter, loss))
iter+=1
if iter>iterations:break
return losses,H
For autoregressive sequence data {x }, the target value y at time t is the next element x
t t of the
t+1
sequence, such as stock price data, text data for predicting the next word wait. For the special sequence
data where y is x , if you want to sample a sequence sample with a sequence length of seq_len=T, you
t t+1
That is to say, the output corresponding to x input at τ moment is x , and the output corresponding to
τ τ +1
Figure 7-23 Input x at τ moment, the corresponding output is x , and τ + 1 moment input x
τ τ +1 τ +1
In order to train the RNN model, many sequence samples can be sampled from the original sequence as
the training set of the model. If the two adjacent sequence samples of these sequence samples are
connected at the beginning and end, this sampling method is called sequential sampling , otherwise it is
called random sampling. For example, for the following sequence:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
If random sampling is used, the sequence of sampled samples might look like this:
([0,1,2],[1,2,3])、([2,3,4],[3,4,5])、([12,13,14],[13,14,15])、([7,8,9],
[8,9,10])、...
The lengths of all the sequence samples sampled above are the same (both are 3). In fact, the lengths of
the sequence samples can be different, but for the sake of simplicity, the sequence samples of the same
length are sampled.
When training the RNN model with each sequence sample, the input hidden state H at the initial
−1
moment is usually initialized to 0, indicating that there is no historical calculation information. But for
sequential sampling, the end moment of a sequence sample is exactly the beginning moment of the next
sequence sample, so when processing a sequence, the last hidden state of the previous sequence sample
can be directly used as the input hidden state of the next sequence sample, while Instead of initializing
the hidden state to 0, the previous sequence samples can be used to better process the current sequence
samples, and theoretically, the subsequent sequence samples can use the historical information in the
previous sequence samples.
Let data be the original sequence data, and the length of all sampled sequence samples is T. The
following iterator function uses sequential sampling to generate sequence samples, that is, sequence
samples generated sequentially are connected end to end:
import numpy as np
def seg_data_iter_consecutive_one(data,T,start_range=0,repeat = False):
n = len(data)
if start_range>0:
start = np.random.randint(0, start_range)
else:
start = 0
end = n-T
while True:
for p in range(start,end,T):
# pick a training sample
X = data[p:p+T]
Y = data[p+1:p+T+1] #[:,-1]
#inputs = np.expand_dims(inputs, axis=1)
#targets = targets.reshape(-1,1)
if p==start:
yield X,Y,True
else:
yield X,Y,False
if not repeat:
return
The parameter start_range is used to determine the initial position start of sampling (the default is 0,
indicating that sampling is always started from the original sequence), so that each sampling starts from a
random position, making the sampled sequence samples more random. repeat indicates whether to
repeatedly sample the original sequence, the default is False, which means that the original sequence is
only sampled once. The third value of the return value indicates whether this sequence sample is the first
sample.
[4, 5, 6] [5, 6, 7]
[7, 8, 9] [8, 9, 10]
[10, 11, 12] [11, 12, 13]
[13, 14, 15] [14, 15, 16]
[16, 17, 18] [17, 18, 19]
Random sampling does not need to ensure that the two sequence samples sampled in sequence are
connected end to end, and its implementation is simpler, such as the following random sampling iterator
function:
import numpy as np
import random
def seg_data_iter_random_one(data,T,repeat = False):
while True:
end = len(data)-T
indices = list(range(0, end))
random.shuffle(indices)
for i in range(end):
p = indices[i]
X = data[p:p+T]
Y = data[p+1:p+T+1]
yield X,Y
if not repeat:
return
When training a neural network, if each iterative calculation does not use a sequence sample but a batch
of samples, the sampling of the batch samples is also divided into sequential sampling and random
sampling. If there are batch_size samples in a batch, the above function batch_size can be called
repeatedly to obtain batch_size sequence samples. However, there is a problem with this simple batch
sampling method, that is, the same batch of samples may be very related or even the same sequence
sample. If the sequence samples in a batch are the same sequence sample, the effect is equivalent to a
sequence sample, losing The meaning of a batch of sample training.
For random sampling, as long as the starting position of each batch of sequence samples is different, just
modify the above function slightly, start from the beginning in the subscript array indices, and take out
consecutive batch_size subscripts each time as the beginning of each sequence sample The location is
fine. Because random.shuffle(indices) has been performed on the subscript array indices in front of
the for loop, each sequence sample in a batch of samples is also randomly scattered in position. Thus, the
function seg_data_iter_random() that randomly takes a batch of sequence samples is obtained:
import numpy as np
import random
def seg_data_iter_random(data,T,batch_size,repeat = False):
while True:
end = len(data)-T
indices = list(range(0, end))
random.shuffle(indices)
for i in range(0,end,batch_size):
batch_indices = indices[i:(i+batch_size)]
X = [data[p:p+T] for p in batch_indices]
Y = [data[p+1:p+T+1] for p in batch_indices]
yield X,Y
if not repeat:
return
The cyclic neural network corresponds to a separate hidden state for each input sample. Different
sequence samples of the same batch correspond to different hidden states. If you want two batches of
samples to be connected end to end, the subsequent batch training sequences can be used directly. The
hidden state of the previous batch of training sequences does not need to initialize the hidden state every
time, so that more historical information can be used. Sequential sampling needs to ensure that the
corresponding samples of each batch are connected end to end.
As shown in Figure 7-24, if there are 2 data in each batch, the first data of the second batch and the first
data of the first batch should be connected end to end. Similarly, the second data of the second batch and
the first data of the first batch should be connected end to end. The second data of the first batch should
be connected end to end. The first data in all batches constitutes the first sequence sample, and the
second data in all batches constitutes the second sequence sample.
Figure 7-24 The data of the first sequence sample (red) of the batch sample is end-to-end, and the data of
the second sequence sample (blue) is end-to-end
How to ensure that all batches are end-to-end? A simple solution is to divide the original data into
batch_size sub-parts, and use sequential sampling to sample a sequence sample in each sub-part, which
naturally ensures that the batch_size sequence samples are connected end to end, and different samples
in each batch come from different parts. For the above data, set batch_size=2, the original sequence data
is divided into 2 parts:
[0, 1, 2, 3, 4, 5, 6, 7, 8,9] and [10, 11, 12, 13, 14, 15, 16, 17,18,19]
The following code can divide the data sequence into batch_size subparts:
batch_size = 2
data= np.array(data)
data = data.reshape(batch_size,-1)
print(data)
[[ 0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]]
The sequence sample x1=[0,1,2] can be taken from the first part, and the sequence sample x2 =
[10,11,12] can be taken from the second part to form a batch sequence sample. Take the sequence
sample x1 = [3,4,5] from part 1, and take the sequence sample x2 = [13,14,15] from part 2 to
form another batch of sequence samples. Take the sequence sample x1=[6,7,8] from the first part, and
take the sequence sample x2 = [16,17,18] from the second part to form another batch sequence
sample.
However, in addition to the input, each sequence sample should also contain the target sequence, and the
target sequence is exactly one position behind the input sequence. Therefore, the following code can be
used to generate 2*batch_size sub-blocks:
data = np.array(range(20))
print(data)
batch_size = 2
block_len = (len(data)-1)//2
print(block_len)
data_x = data[0:block_len*batch_size]
data_x = data_x.reshape(batch_size,-1)
print(data_x)
data_y = data[1:1+block_len*batch_size]
data_y = data_y.reshape(batch_size,-1)
print(data_y)
data_x has batch_size sub-blocks, which are used to generate input sequence samples, and data_y is
batch_size` sub-blocks staggered with data_x, which are used to form target sequence samples.
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
9
[[ 0 1 2 3 4 5 6 7 8]
[ 9 10 11 12 13 14 15 16 17]]
[[ 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18]]
Now, you can take a sequence from the first row of data_x and data_y as the input sequence and the
target sequence respectively: x1=[0,1,2],y1 =[1,2,3], and then from their The sequence samples x2
= [10,11,12], y2 = [11,12,13] are taken from each of the 2 lines as the second sample, forming
the first batch of sequence samples.
x1 = [0,1,2], y1 = [1,2,3],
x2 = [10,11,12],y2 = [11,12,13]]
In the same way, the second batch of sequence samples can be taken out:
x1 = [3,4,5],y1 = [4,5,6]
x2 = [13,14,15],y2 = [14,15,16]
According to the above method, the following batch sequential sampling function
rnn_data_iter_consecutive() can be written:
Xs = data[start:start+block_len*batch_size]
Xs = Xs.reshape(batch_size,-1)
Ys = data[start+1:start+block_len*batch_size+1]
Ys = Ys.reshape(batch_size,-1)
data = list(range(20))
print(data[:20])
data_it = rnn_data_iter_consecutive(np.array(data[:20]),2,3,1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
X: [[ 0 1 2]
[ 9 10 11]]
Y: [[ 1 2 3]
[10 11 12]]
X: [[ 3 4 5]
[12 13 14]]
Y: [[ 4 5 6]
[13 14 15]]
X: [[ 6 7 8]
[15 16 17]]
Y: [[ 7 8 9]
[16 17 18]]
Each X of the batch sequence samples sampled above is a two-dimensional tensor with the batch size on
the first axis and the sequence length on the second axis. The previous cyclic neural network assumes
that the first axis of the sequence sample is the sequence length rather than the batch size, and the axes
corresponding to the sequence length and batch size can be exchanged:
X = np.swapaxes(X,0,1)
The above X assumes that each data element is a scalar with a length of 1, but in actual problems, each
data is a vector containing multiple features or even a multi-dimensional tensor (such as an image), if
each data element is a multi-featured vector, then X is a three-dimensional tensor. Therefore, the
sequence sample X of the above two-dimensional tensor can be converted into a three-dimensional
tensor:
X = X.reshape(X.shape[0],X.shape[1],-1)
That is, a 3rd axis is added. Combine the two lines of code:
x1 = np.swapaxes(X,0,1)
x1 = x1.reshape(x1.shape[0],x1.shape[1],-1)
print(x1)
[[[ 6]
[15]]
[[ 7]
[16]]
[[ 8]
[17]]]
Therefore, you can rewrite the above function and add a to_3D parameter to determine whether to
convert to a 3D tensor:
import numpy as np
def rnn_data_iter_consecutive(data, batch_size, seq_len,start_range=10,to_3D =
True).
#sample in data[offset:] each time, so that the training samples are
different for each epoch
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size
Xs = data[start:start+block_len*batch_size]
Ys = data[start+1:start+block_len*batch_size+1]
Xs = Xs.reshape(batch_size,-1)
Ys = Ys.reshape(batch_size,-1)
Among them, the data iterator generates samples (Xs, Ys) and returns a flag whether to reset the RNN
hidden state. If the flag is True, reset the RNN hidden state H.
data = np.array(list(range(20))).reshape(-1,1)
data_it = rnn_data_iter_consecutive(data,2,3,2)
i = 0
for X,Y,_ in data_it:
print("X:",X)
print("Y:",Y)
i+=1
if i==2 :break
X: [[[ 0]
[ 9]]
[[ 1]
[10]]
[[ 2]
[11]]]
Y: [[[ 1]
[10]]
[[ 2]
[11]]
[[ 3]
[12]]]
X: [[[ 3]
[12]]
[[ 4]
[13]]
[[ 5]
[14]]]
Y: [[[ 4]
[13]]
[[ 5]
[14]]
[[ 6]
[15]]]
batch_size = 3
input_dim = 1
output_dim= 1
hidden_size=100
seq_length = 50
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)
data_it = rnn_data_iter_consecutive(data,batch_size,seq_length,2)
x,y,_ = next(data_it)
print("X:",x.shape,"Y:",y.shape,"H:",H.shape)
Zs,Hs = rnn_forward(params,x,H)
print("Z:",Zs[0].shape,"H:",Hs[0].shape)
loss,dzs = loss_function(Zs,y)
print(dzs[0].shape)
epoches = 10
learning_rate = 5e-4
iterations =200
losses = []
#optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)
(5000,)
X: (50, 3, 1) Y: (50, 3, 1) H: (3, 100)
Z: (3, 1) H: (3, 100)
(3, 1)
iter 0, loss: 52.575362
iter 0, loss: 41.488531
iter 0, loss: 2.666009
iter 0, loss: 1.424797
iter 0, loss: 0.849381
iter 0, loss: 0.723504
iter 0, loss: 0.581355
iter 0, loss: 0.938593
iter 0, loss: 1.019344
iter 0, loss: 0.297335
Figure 7-25 Training loss curve of RNN model for autoregressive data
predict
The following code uses the trained RNN model to predict the output of the next 500 moments from the
data at a certain moment:
H = rnn_hidden_state_init(1,hidden_size)
start = 3
x = data[start:start+1].copy()
x =x.reshape(x.shape[0],1,-1)
print(x.shape)
x = x.reshape(1,-1)
ys =[]
print(x.flatten())
for i in range(500):
F,H= rnn_forward_step(params,x,H)
x=F
ys.append(F[0,0])
print(len(ys))
ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[start+1:start+1+500])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
plt.show()
(1, 1, 1)
[1.12085582]
500
Figure 7-26 Comparison of long-term forecast data and real data of the rnn autoregressive model
This prediction is not very accurate. If you only predict the data at the next moment from the current
moment, that is, predict data[t+1] from data[t]. The following code uses this short-term prediction
method to obtain the data for predicting the next moment from the data at each moment in
data[start,start+500], that is, to predict data[start+1,start+1+500]
H = rnn_hidden_state_init(1,hidden_size)
start = 3
ys =[]
for i in range(500):
x= data[start+i:start+i+1].copy()
x = x.reshape(1,-1)
F,H= rnn_forward_step(params,x,H)
ys.append(F[0,0])
ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[start+1:start+501])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
plt.show()
Figure 7-27 Comparison of short-term forecast data and real data of the rnn autoregressive model
The result of the short-term prediction at the next moment completely coincides with the real data,
indicating that the short-term prediction is very good. The relevant code for the above RNN is in the
rnn.py file in the code for this book.
data = read_stock('sp500.csv')
data = np.array(data.iloc[:,-2]).reshape(-1,1)
For such autoregressive sequence data, you can directly use the above code for training and prediction.
The learning rate is 1e-4, and the number of iterations epoches of the batch gradient descent method is
40. The loss curve of model training is as follows:
Figure 7-28. Training Loss Curves for an Autoregressive Model of Stock Closing Prices
The long-term forecast and short-term forecast are shown in the figure respectively:
Figure 7-29. Long-Run Forecast of Stock Closing Price Autoregressive Model
The above only uses the historical data of the stock closing price to predict the future stock closing price.
The following code uses all the indicators of the stock (opening price, highest price, lowest price, closing
price, trading volume) to predict the closing price of the stock. First, the RNN model is also trained:
import pandas as pd
import numpy as np
data = read_stock('sp500.csv')
stock_data = np.array(data)
print("stock_data.shape",stock_data.shape)
print("stock_data[:3]\n",stock_data[:3])
def stock_data_iter(data,seq_length):
feature_n = data.shape[1]
num = (len(data)-1)//seq_length
while True:
for i in range(num):
#Select a training sample
p = i*seq_length
inputs = data[p:p+seq_length]
targets = data[p+1:p+seq_length+1][:,-2]
inputs = np.expand_dims(inputs, axis=1)
targets = targets.reshape(-1,1)
if i==0:
yield inputs,targets,True
else:
yield inputs,targets,False
batch_size = 1
input_dim= stock_data.shape[1]
hidden_dim = 100
output_dim=1
params = rnn_params_init(input_dim, hidden_dim,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_dim)
# hyperparameters
epoches = 2
learning_rate = 1e-4
iterations =2000
losses = []
#optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)
stock_data.shape(4697, 5)
stock_data[:3]
[[-0.00597324 -0.00591629 -0.01567558 -0.01231037 -0.19118446]
[-0.01226569 -0.01617188 -0.03401657 -0.03724877 -0.1842296 ]
[-0.0372919 -0.03505779 -0.04286668 -0.03604657 -0.17733781]]
(100, 1, 5) (100, 1)
iter 0, loss: 0.105906
iter 200, loss: 0.092861
iter 400, loss: 0.561419
iter 600, loss: 0.061234
iter 800, loss: 0.447817
iter 1000, loss: 2.762900
iter 1200, loss: 0.713906
iter 1400, loss: 0.022479
iter 1600, loss: 0.004160
iter 1800, loss: 0.011423
iter 2000, loss: 0.033837
The above sequence data is not autoregressive data. The stock data at each moment is a vector composed
of multiple features, and the predicted stock price at the next moment is a value, that is, the input is a
vector of multiple values and the output is data of one value. , no long-term forecasts can be made from
this model. The following code is a short-term prediction based on the trained RNN:
H = rnn_hidden_state_init(1,hidden_dim)
start = 3
data = stock_data[start:,:]
ys =[]
for i in range(len(data)):
x= data[i,:].copy()
x = x.reshape(1,-1)
f,H = rnn_forward_step(params,x,H)
ys.append(f[0,0])
ys = ys[:]
plt.plot(ys[:500])
plt.plot(data[:500,-2])
plt.xlabel("time")
plt.ylabel("value")
plt.legend(['y','y_real'])
Figure 7-32 The short-term prediction effect of the training model of stock data
According to this probability, a word is sampled as the next word, and this process is repeated
continuously, and new words can be continuously generated from the initial word, that is, a series of
words or texts are generated. This process of automatically generating a large piece of text from an initial
one or a small number of words based on a certain language model is called text generation.
Text generation relies on a trained language model. To train a language model, it is necessary to sample
sequence data in units of words from existing texts such as one or more novels and prose. These original
texts used to sample word sequence samples are called corpus. These corpora are usually first divided
into word sequences in units of words, and then the sequence samples used for training the RNN model
can be sampled using the previous sequence data sampling method.
For English articles, the original text can be divided into word sequences with spaces and punctuation
marks, while for Chinese character texts, some special word extraction techniques need to be used to
segment and extract words from the text. No matter what kind of language, the number of words is very
large. For simplicity, each character can be regarded as a word. Such a language model is called
character language model. The character language model does not need to specifically extract words in
the text, and the number of characters in the language is often much smaller than words. For example,
there are only 26 letters and a small number of punctuation marks in English, while the number of
English words is very large.
Whether it is a character language model or a normal word language model, the principle is the same.
For example, before using RNN to train the language model, it is necessary to quantify the basic unit of
the language model - word or character, that is, convert the word or character into a value vector. In order
to quantify words (characters), the first step is to establish a word table (character table).
If a corpus contains only one text file 'input.txt', which is Shakespeare's play, the following code reads
the text content into data, and set(data) is used to construct a set of all different characters, and then the
set These different characters are put into a list object chars (chars = list(set(data))), and this list object is
the character table of all characters.
filename = 'input.txt'
data = open(filename, 'r').read()
chars = list(set(data))
Output the number of all characters in the text and the length of the character list, the first 10 characters
of the character list and the first 148 characters of the text:
The total number of characters is 1115394, and the length of the character
list is 65 unique.
First 10 characters of character table:
['t', 'z', 'A', 'Y', 'm', ' ', 'B', 'g', 'r', '.']
First 148 characters:
First Citizen:
Before we proceed any further, hear me speak.
All:
Speak, speak.
First Citizen:
You are all resolved rather to die than to famish?
Each character in the character table corresponds to a subscript, and two dictionaries can be used to
represent the mapping relationship between characters to subscripts and subscripts to characters:
With the character table, each character can be quantified. The easiest way is to use a one-hot vector to
represent the character according to the subscript of a character in the character table. The length of the
one-hot vector is the length of the character table. , this vector is 0 except for the value of the subscript
corresponding to the character is 1. If the character table has only four characters, as shown in Figure 7-
33:
Figure 7-33 There are only 4 characters in the character table: 'h','e','l','o'
import numpy as np
def character_seq_data_iter_consecutive(data, batch_size, seq_len,
start_range=10):
#Sampling in data[offset:] each time makes the training samples of each
epoch different
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size
num_batches = block_len // seq_len #The maximum number of batches that can
be sampled continuously in each block
bs = np.array(range(0,block_len*batch_size,block_len) ) #Each block starting
position
i = 0
for x,y,_ in data_it:
print("x:",x)
print("y",y)
i+=1
if i==2:break
x: [['L' 'r']
['i' 'e']
[',' ' ']]
y [['i' 'e']
[',' ' ']
['w' 'y']]
x: [['w' 'y']
['h' 'o']
['e' 'u']]
y [['h' 'o']
['e' 'u']
['r' ' ']]
The characters returned by the function need to be further vectorized, such as converting each character
into a one-hot vector form. For this, modify the above function:
def character_seq_data_iter_consecutive(data, batch_size, seq_len, vocab_size,
start_range=10):
#Sampling in data[offset:] each time makes the training samples of each
epoch different
start = np.random.randint(0, start_range)
block_len = (len(data)-start-1) // batch_size
num_batches = block_len // seq_len #The maximum number of batches that
# can be sampled continuously in each
block
bs = np.array(range(0,block_len*batch_size,block_len) )
x: [[[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]]
y [[[62]
[54]]
[[51]
[10]]
[[49]
[12]]]
x: [[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]
[[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]]]
y [[[ 6]
[4]]
[[54]
[20]]
[[56]
[10]]]
7.5.3 RNN model training and prediction
If the length of the character table is vocab_size, the length of the one-hot vector of each character is
vocab_size, that is, the length input_dim of the input data at each moment is vocab_size, and the prediction
at each moment is the probability of all words as the next word , so the size of the output vector output_dim
is also vocab_size. Adding the length hidden_size of the RNN hidden state vector and the size batch_size of
each training sample can initialize an RNN model:
batch_size = 1
input_dim = vocab_size
output_dim= vocab_size
hidden_size=100
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)
predict
For the above-mentioned character language RNN model, as long as an initial character (or character
sequence) is input, the RNN model can be used to continuously predict the next character, so that a text
composed of many characters can be generated.
The following function predict_rnn() function accepts the model parameter params of the RNN model and
an initial string prefix (this initial string may only have one character), and then generates a series of
characters after the prefix. This function first takes each character of prefix as the input at each moment in
turn, and generates an output z. If the prefix is finished, it calculates the probability p of each character
based on the output z at the previous moment, and samples one according to the probability p as input at the
next moment. The auxiliary function one_hot_idx obtains the one-hot vector of the character according to
the character subscript. The expected target character at each moment is recorded in the output list, the
beginning part is the character in prefix, and then the character sampled according to the predicted
probability.
def predict_rnn(params,prefix,n):
Wx, Wh,bh, Wf,bf = params
#Wxh, Whh,Why, bh, by
=params["Wxh"],params["Whh"],params["Why"],params["bh"],params["by"]
vocab_size,hidden_size = Wx.shape[0],Wh.shape[1]
h = rnn_hidden_state_init(1,hidden_size)
output = [char_to_idx[prefix[0]]]
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
p = np.exp(z) / np.sum(np.exp(z))
# idx = int(p.argmax(axis=1))
idx = np.random.choice(range(vocab_size), p=p.ravel())
output.append(idx)
return ''.join([idx_to_char[i] for i in output])
heokIX..ytE:JhMjGN:AXpNH;MZZZ&prP?I;,N;!
U,zu-&veMgvasx;!VBx3BYSYVljxozYjgiQcMbIHYISWpGTlkZcFjclR-n??
T&mRhnHe;ewTNZLyLOkNizPuWliTtTX&&dGHtBm$VFWVgT
KBF!aOiHM-!TzrhwXW
gEiG?f,kEqipDQJ3yQIKwXkcptNhJ&CTmke
Since the initial RNN model parameters are random, the prediction is also random, so the generated text is
very messy. The sequence samples sampled from a text corpus can be used to train the RNN model, such as
the following code:
batch_size = 3
input_dim = vocab_size
output_dim= vocab_size
hidden_size=100
params = rnn_params_init(input_dim, hidden_size,output_dim)
H = rnn_hidden_state_init(batch_size,hidden_size)
seq_length = 25
epoches = 3
learning_rate = 1e-2
iterations =10000
losses = []
optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)
Figure 7-34 The training loss curve of the character language model
str = predict_rnn(params,"he",200)
print(str)
her creatuep I wikes spiines corvantle coulling go, your fear him hole.
No, ay no linged siffate too,
come, my wise altes in by is beays friond, and we within; beems
You jores fad lealene,
ine holl i w
It can be seen that the output text is already similar to normal text. Character language models can be used
not only to generate text, but also to other problems, such as generating musical scores.
ht = σ(wht−1 )
The bias and input are ignored, and only the hidden state vector h is considered, that is, the hidden state h
t t
′
= ∏ wσ (wht′ −k )
k=1
′
t −t
′
t −t ′
= w ∏ σ (wht′ −k )
k=1
!!!
If the weight w is not equal to 0, when 0 < |w| < 1, the above formula decays exponentially to 0 at the
∂ht′
speed of t − t, and grows exponentially when |w| > 1 to infinity. That is, the gradient
′
decays to 0 or ∂ht
∂h
According to formula (7-17), ∂L n ∂L
, therefore, ∂L
follows decays to 0
T
t
= ∑ ht f rac∂ht′ ∂ht
∂w t=1 ∂h T ∂ht ∂w
t
or explodes to infinity, which causes the model parameter w to oscillate back and forth or hardly move
during the training process, that is, the training cannot converge. The longer the sample sequence length,
the more likely it is to decay or explode.
Clipping gradients can handle exploding gradients, but not decaying gradients.
LSTM introduces a cell state C that is different from the hidden state h . The cell states C
t t t−1 and C at the
t
time before and after are an additive relationship rather than a multiplicative relationship.
~
Ct = i ⊙ C t + f ⊙ Ct−1
The gradient ∂L
∂Ct
is therefore also an additive relation:
∂L ∂L
=. . . +f ⊙
∂Ct−1 ∂Ct
And f is a value close to 1, therefore, ∂L
∂Ct
can be guaranteed to be stable, so that the gradient will not
disappear, and the gradient explosion is also alleviated (but still produces gradient explosion).
traditional RNN, adding a cell state (cell state) C thatspecif icallyremembershistoricalinf ormation.
t
Cell C is the cumulative memory of all historical information, which can flow from one cell to the next, as
t
The original hidden state h is used to determine the extent to which C information is used for the update
t t
calculation of the cell information at the next moment. Think of C as a mighty long river of history, h is
t t
the part of historical information that affects contemporary social activities in this long river of history, for
example, Confucianism may have a greater influence in a certain era, while Taoism in another The
influence of an era is relatively large.
There is a current memory unit (also called candidate memory unit) in the cell to calculate the
~
contribution value C of the current input to the total historical information C , also known as Activation
t t
Value. Activation value is like the contribution of contemporary social activities to history. As shown in
~
Figure 7-36, the current memory unit calculates the activation value C at the current moment according to
t
~
C t = tanh(xt Wxc + ht−1 Whc + bc )
where W ∈ R , W ∈ R
xc
d×h
hc is the weight parameter, b ∈ R
h×h
is the bias parameter. h represents
c
1×h
the vector length of the tuple state h and the hidden state C , and d represents the number of features of the
t t
~
input sample. The activation value C at the current moment not only depends on the input at the current
t
moment, but also depends on the hidden state h passed from the previous moment.
t−1
Figure 7-36 The current memory unit accepts the hidden state h of the previous moment and the input x
t−1 t
~
of the current moment, and outputs the activation value C of thecurrentmoment t
In addition to the current memory unit, the cell also contains three gates: input gate, output gate and forget
gate. The gate is a mechanism to determine whether information can circulate and the degree of circulation.
It multiplies the output and input of the sigmoid function σ element by element, and determines how much
of the input can be output (through this gate), as shown in Figure 7-37. The output f of the sigmoid function
σ is multiplied element-wise by the input in, which determines the output out = f ∗ in of in through this
gate.
Figure 7-37 The gate is multiplied element-wise by the output f of the sigmoid function σ and the input in,
which determines how many outputs in have, that is, out = f ∗ in
Let the input of the sigmoid function σ be x, and its value σ(x) is between 0 and 1. If σ(x) = 0, multiply it
by a certain input c , it means that this input c will not produce any output, if σ(x) = 1, multiply it by an
input c, it means that this input c is completely output.
As shown in Figure 7-38, the forget gate is used to control how much of the total information C at the t−1
previous moment is forgotten (and in turn, how much is memorized), and it accepts the input data x and t
the state h of the previous moment, and output a value f between 0 and 1 through the σ function, which
t−1 t
is the same as the state of the cell at the previous moment C Multiply f C element by element,
t−1 t t−1
indicating how much the elements of C are recorded. Its mathematical calculation formula is:
t−1
Figure 7-38 The output F of the forget gate and the historical information C
t of the previous moment are t−1
forget)
As shown in Figure 7-39, the input gate accepts the input data x and the state h at the previous moment,
t t−1
~
and outputs a value between 0 and 1 through the σ function i , the element-wise multiplication of i and C
t t t
~ ~
i C determines how much C participates in the output calculation. Its calculation formula is:
t t t
~
Figure 7-39 The output I of the input gate and the current activation value C are multiplied element by
t t
~ ~
element I C determines the activation value C is entered into the historical aggregate information
t t t
As shown in Figure 7-40, add the historical information f C of the previous moment after passing the
t t−1
~
forget gate and the activation information i C of the current moment after passing the input gate The new
t t
historical information C of the current state is obtained. Its calculation formula is as follows:
t
~
Ct = ft Ct−1 + it C t
Figure 7-40 The accumulation of the historical information f C retained by the forget gate and the
t t−1
~
current activation value information i C through the input gate is the new historical information
t t
~
Ct = ft Ct−1 + it C t
As shown in Figure 7-41, the output gate determines how much of the new historical information C at the t
current moment will participate in the cell calculation at the next moment, that is, to determine the state h t
output to the next moment. It accepts the input data x and the state h of the previous moment, and
t t−1
Figure 7-41 The output O of the output gate will determine how many outputs the cell state has at the
t
current moment and participate in the calculation of the next moment as a hidden vector
As shown in Figure 7-42, O C obtained by multiplying the output value O of the output gate and the
t t t
information C of the current state element by element is the output value h of the cell, that is, the value of
t t
Ht = Ot ∗ tanh(Ct )
Figure 7-42 The output gate determines how much of the new historical information C at the current t
As shown in Figure 7-43, the cell is composed of the current memory unit and the forget gate, input gate
~
and output gate. The current calculation unit calculates the activation value C at the current moment. This t
value is determined by the input data and the hidden state at the previous moment. The forget gate
determines how many times C is retained, and the input gate determines the current How much of the
t
~
activation value C is recorded in the total historical information C , and the output gate determines how
t t
much of the historical memory C at the current moment participates in the calculation of the next moment.
t
Figure 7-43 Cell is composed of current memory unit and forget gate, input gate and output gate.
Finally, the cell also calculates the current output value Z from H . t t
Zt = (Ht Wy + by )
The formula (7 − 24), (7 − 25), (7 − 26), (7 − 27), (7 − 28), (7 − 29), (7 − 30) constitutes an element
of LSTM Cell calculation process.
∂Zt
t
respect to Z at the current moment is known, then the gradient of the loss function with respect to
t
H , W , b can be obtained:
t y y
And the gradient of the loss function about H also includes the gradient from the subsequent moment, so:
t
t−
∂L ∂Lt T ∂L
= Wy +
∂Ht ∂Zt ∂Ht
Similarly, the gradient of the loss function about C is also divided into two parts, one part is the gradient
t
from the formula (7 − 29), which is the output H , and the other is the output of C itself to Gradient at the
t t
next moment:
t−
∂L ′ ∂Lt ∂L
= Ot ⊙ tanh (Ct ) +
∂Ct ∂Ht ∂Ct
From ∂L
∂Ht
and formula (7 − 29), the gradient of the loss function with respect to O can be obtained: t
∂L ∂L
= ⊙ tanh(Ct )
∂Ot ∂Ht
From ∂L
∂Ct
and the formula (7 − 27), the loss function can be obtained about
~
it , ft , Ct−1 , thegradientof C t :
∂L ∂L ~
= ⊙ Ct
∂It ∂Ct
∂L ∂L
= ⊙ Ct−1
∂Ft ∂Ct
∂L ∂L
~ = ⊙ It
∂Ct
∂Ct
∂L ∂L
= ⊙ Ft
∂Ct−1 ∂Ct
Let ZI t = (Xt , Ht−1 )Wi + bi , ZFt = (Xt , Ht−1 )Wf + bf , ZIo = (Xt , Ht−1 )Wo + bo , then you can
get:
∂L ′ ∂L ∂L
= σ (ZIt ) = It (1 − It )
∂ZIt ∂It ∂It
∂L ′ ∂L ∂L
= σ (ZFt ) = Ft (1 − Ft )
∂ZFt ∂Ft ∂Ft
∂L ′ ∂L ∂L
= σ (ZOt ) = O(1 − O)
∂ZOt ∂Ot ∂Ot
Got it ∂L
,
∂ZIt
,
∂L
∂ZFt
∂L
∂ZOt
can similarly find the loss function about W i, Wf , Wo , Xt , Ht−1 Gradient. Please
refer to Section 4.2.
calculates the current output value y from H . The model parameters include the model parameters that
t t
Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))
Hs[-1] = np.copy(H)
Cs[-1] = np.copy(C)
Is = []
Fs = []
Os = []
C_tildas = []
for t in range(len(Xs)):
X = Xs[t]
XH = np.column_stack((X, H))
if False:
print("XH.shape",XH.shape)
print("Wi.shape",Wi.shape)
break
I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)
C = F * C + I * C_tilda
H = O*np.tanh(C) #O * C.tanh() #Output status
Zs.append(Y)
Hs[t] = H
Cs[t] = C
Is.append(I)
Fs.append(F)
Os.append(O)
C_tildas.append(C_tilda)
return Zs,Hs,Cs,(Is,Fs,Os,C_tildas)
Similarly, the forward calculation at a certain moment can also be used as a separate function:
def lstm_forward_step(params,X,H,C):
[Wi, bi,Wf, bf, Wo,bo,Wc,bc,Wy,by] = params
XH = np.column_stack((X, H))
I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)
C = F * C + I * C_tilda
H = O*np.tanh(C) #O * tanh(C) #Output status
Y = np.dot(H, Wy) + by # output
return Y,H,C,(I,F,O,C_tilda)
Reverse derivation:
import math
def dsigmoid(x):
return sigmoid(x) * (1 - sigmoid(x))
def dtanh(x):
return 1 - np.tanh(x) * np.tanh(x)
def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio
Is,Fs,Os,C_tildas = cache
dH_next = np.zeros_like(Hs[0])
dC_next = np.zeros_like(Cs[0])
input_dim = Xs[0].shape[1]
h = Hs
x = Xs
T = len(Xs)
for t in reversed(range(T)):
I = Is[t]
F = Fs[t]
O = Os[t]
C_tilda = C_tildas[t]
H = Hs[t]
X = Xs[t]
C = Cs[t]
H_pre = Hs[t-1]
C_prev = Cs[t-1]
XH_pre = np.column_stack((X, H_pre))
XH_ = XH_pre
dZ = dZs[t]
dO = np.tanh(C) *dH
dOZ = O * (1-O)*dO #O = sigma(Z_o)
dWo += np.dot(XH_.T,dOZ) # Z_o = (X,H_)W_o+b_o
dbo += np.sum(dOZ, axis=0, keepdims=True)
#di
di = C_tilda*dC
diZ = I*(1-I) * di
dWi += np.dot(XH_.T,diZ)
dbi += np.sum(diZ, axis=0, keepdims=True)
#df
df = C_prev*dC
dfZ = F*(1-F) * df
dWf += np.dot(XH_.T,dfZ)
dbf += np.sum(dfZ, axis=0, keepdims=True)
# dC_bar
dC_tilda = I*dC #C = F * C + I * C_tilda
dC_tilda_Z =(1-np.square(C_tilda))*dC_tilda # C_tilda =
sigmoid(C_tilda_Z)
dWc += np.dot(XH_.T,dC_tilda_Z) # C_tilda_Z = (X,H_)W_c+b_c
dbc += np.sum(dC_tilda_Z, axis=0, keepdims=True)
dC_next = dC_prev
dH_next = dH_prev
Gradient Test
T = 3
input_dim, hidden_dim,output_dim = 4,3,4
batch_size = 2
Xs = np.random.randn(T,batch_size,input_dim)
Ys = np.random.randint(output_dim, size=(T,batch_size))
print("Xs",Xs)
print("Ys",Ys)
# cheack gradient
params = lstm_params_init(input_dim, hidden_dim,output_dim)
HC = lstm_state_init(batch_size,hidden_dim)
Zs,Hs,Cs,cache = lstm_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
grads = lstm_backward(params,Xs,Hs,Cs,dZs,cache)
def rnn_loss():
HC = lstm_state_init(batch_size,hidden_dim)
Zs,Hs,Cs,cache= lstm_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss
numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
#rnn_numerical_gradient(rnn_loss,params,1e-10)
#diff_error = lambda x, y: np.max( np.abs(x - y)/(np.maximum(1e-8, np.abs(x) +
np.abs(y))))
diff_error = lambda x, y: np.max( np.abs(x - y))
print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))
print("grads",grads[0])
print("numerical_grads",numerical_grads[0])
batch_size = None
hidden_size = Wy.shape[0]
Zs,Hs,Cs,cache = lstm_forward(params,Xs,HC)
loss,dZs = loss_function(Zs,Ys)
grads = lstm_backward(params,Xs,Hs,Cs,dZs,cache)
optimizer.step(grads)
losses.append(loss)
if iter % print_n == 0:
print ('iter %d, loss: %f' % (iter, loss))
iter+=1
if iter>iterations:break
return losses,H
Text generation
Use LSTM instead of ordinary RNN to train the character language model.
filename = 'input.txt'
data = open(filename, 'r').read()
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print ('Total number of characters %d,Length of character table %d unique.' %
(data_size, vocab_size))
epoches = 3
learning_rate = 1e-2
iterations =10000
losses = []
optimizer = AdaGrad(params,learning_rate)
momentum = 0.9
optimizer = SGD(params,learning_rate,momentum)
predict
Similar to rnn, the following prediction function can be defined:
def predict_lstm(params,prefix,n):
Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by = params
vocab_size,hidden_dim = Wi.shape[0]-Wy.shape[0],Wy.shape[0]
h,c = lstm_state_init(1,hidden_dim)
output = [char_to_idx[prefix[0]]]
z,h,c,_ = lstm_forward_step(params,x,h,c)
if t < len(prefix) - 1:
output.append(char_to_idx[prefix[t + 1]])
else:
p = np.exp(z) / np.sum(np.exp(z))
# idx = int(p.argmax(axis=1))
idx = np.random.choice(range(vocab_size), p=p.ravel())
output.append(idx)
str = predict_lstm(params,"he",200)
print(str)
he done!
GLOUCESTER:
Why was I being your houghcessing in lord?
CARILLO:
How, or your his dessent;
Come his false, what comon:
HASTINGS:
Put she with your howiring act a both,
But long and you have
T
In this figure, "peepholes" are added to all doors, that is, f , i , o can see the corresponding cell state
t t t
Ct−1 , C . However, there are also some papers that only add a part.
t
Considering that the LSTM cell is too complex, in 2014 Kyunghyun Cho et al. proposed a more varied
LSTM variant-Gated Recurrent Unit (Gated Recurrent Unit, GRU). GRU merges the forget gate and
the input gate into a single "Update Gate", merges the Cell State and the Hidden State, and introduces some
other changes. This model is more simplified than the standard LSTM model, and the effect is better than
the classic LSTM, so it is becoming more and more popular now.
There are also some other models, such as Depth Gated RNNs proposed by Yao, et al. (2015). At the same
time, there are many completely different ways to solve the long-term dependency problem, such as
Clockwork RNNs proposed by Koutnik, et al. (2014).
Which of the different models is best? Does the difference really matter? Greff, et al. (2015) did a
comparison of popular variants and found that they are basically the same. Jozefowicz et al. (2015) tested
more than 10,000 RNN structures and found that some of them perform better than LSTMs on specific
tasks.
information involved in the calculation of the next moment, GRU, like a simple RNN, only uses a hidden
state H to represent all historical information. Like LSTM, GRU also has a forget gate, also known as a
t
reset gate, which is used to represent the effect of memory information on the calculation of the current
~
moment, and an update gate for the current activation value H and historical information H
t t−1
update the
historical information H at the current moment. As shown in Figure 7-43, there are two gates in the GRU
t
Ordinary RNN neurons use historical information H t−1and current input data X to calculate the
t
The reset gate indicates how much historical memory is forgotten or how much historical memory is
preserved in this calculation, that is, the output value R of the reset gate is multiplied by the historical
t
~
This H represents the activation value at the current moment, also known as current candidate memory.
t
Figure 7-44 The current working unit of GRU outputs the activation value at the current moment
~
The weighted average of the current candidate memory H and historical memory H
t t−1
through an update
gate is used as the hidden state at the current moment:
~
Ht = Ut ⊙ Ht−1 + (1 − Ut ) ⊙ Ht
As shown in Figure 7-45, the output value U of the update gate is used for weighted average of the
t
The reverse derivation of GRU is similar to LSTM. After the gradient dZ of the loss function with respect
to the GRU output is known, the reverse derivation calculates the gradient of the loss function with respect
to the model parameters and intermediate transformations. Readers can imitate the reverse derivation of
LSTM to derive the reverse derivation formula of GRU.
Like LSTM, GRU can maintain long-term memory and prevent gradient explosion and disappearance. Its
performance is also comparable to LSTM, and even better than LSTM on some problems. Its
implementation is simpler and more computationally efficient than LSTM. Therefore, in actual use, GRU is
usually used instead of traditional LSTM.
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def gru_init_params(input_dim,hidden_dim,output_dim,scale=0.01):
normal = lambda m,n : np.random.randn(m, n)*scale
three = lambda : (normal(input_dim,hidden_dim),
normal(hidden_dim,hidden_dim),np.zeros((1,hidden_dim)))
params = [Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by]
return params
for t in range(len(Xs)):
X = Xs[t]
U = sigmoid(np.dot(X, Wxu) + np.dot(H, Whu) + bu)
R = sigmoid(np.dot(X, Wxr) + np.dot(H, Whr) + br)
H_tilda = np.tanh(np.dot(X, Wxh) + np.dot(R * H, Whh) + bh)
H = U * H + (1 - U) * H_tilda
Y = np.dot(H, Wy) + by
Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)
return Ys,Hs,(Rs,Us,H_tildas)
dH_next = np.zeros_like(Hs[0])
input_dim = Xs[0].shape[1]
T = len(Xs)
for t in reversed(range(T)):
R = Rs[t]
U = Us[t]
H = Hs[t]
X = Xs[t]
H_tilda = H_tildas[t]
H_pre = Hs[t-1]
dZ = dZs[t]
#Output the gradient of the model parameters of f
dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)
# H = U H_pre+(1-U)H_tildas
dH_tilda = dH*(1-U)
dH_pre = dH*U
dU = H_pre*dH -H_tilda*dH
dR = np.dot(dH_tildaZ, Whh.T)*H_pre
dH_pre += np.dot(dH_tildaZ, Whh.T)*R
# U = \sigma(UZ) R = \sigma(RZ)
dUZ = U*(1-U)*dU
dRZ = R*(1-R)*dR
dWxu+= np.dot(X.T,dUZ)
dWhu+= np.dot(H_pre.T,dUZ)
dbu += np.sum(dUZ, axis=0, keepdims=True)
if True:
dX_RZ = np.dot(dRZ,Wxr.T)
dX_UZ = np.dot(dUZ,Wxu.T)
dX_H_tildaZ = np.dot(dH_tildaZ,Wxh.T)
dX = dX_RZ+dX_UZ+dX_H_tildaZ
dH_next = dH_pre
return [dWxu, dWhu, dbu, dWxr, dWhr, dbr, dWxh, dWhh, dbh, dWy,dby]
Check that the analytical and numerical gradients are consistent with the following code:
T = 3
input_dim, hidden_dim,output_dim = 4,3,4
batch_size = 1
Xs = np.random.randn(T,batch_size,input_dim)
Ys = np.random.randint(output_dim, size=(T,batch_size))
print("Xs",Xs)
print("Ys",Ys)
# cheack gradient
params = gru_init_params(input_dim, hidden_dim,output_dim)
HC = gru_state_init(batch_size,hidden_dim)
Zs,Hs,cache = gru_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
grads = gru_backward(params,Xs,Hs,dZs,cache)
def rnn_loss():
HC = gru_state_init(batch_size,hidden_dim)
Zs,Hs,cache= gru_forward(params,Xs,HC)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss
numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
#rnn_numerical_gradient(rnn_loss,params,1e-10)
#diff_error = lambda x, y: np.max( np.abs(x - y)/(np.maximum(1e-8, np.abs(x) +
np.abs(y))))
diff_error = lambda x, y: np.max( np.abs(x - y))
print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))
print("grads",grads[0])
print("numerical_grads",numerical_grads[0])
def grad_clipping(grads,alpha):
norm = math.sqrt(sum((grad ** 2).sum() for grad in grads))
if norm > alpha:
ratio = alpha / norm
for i in range(len(grads)):
grads[i]*=ratio
class LSTM(object):
def __init__(self,input_dim,hidden_dim,output_dim,scale=0.01):
#super(LSTM_cell, self).__init__()
self.input_dim,self.hidden_dim,self.output_dim =
input_dim,hidden_dim,output_dim
normal = lambda m,n : np.random.randn(m, n)*scale
two = lambda : (normal(input_dim+hidden_dim,
hidden_dim),np.zeros((1,hidden_dim)))
Wy = normal(hidden_dim, output_dim)
by = np.zeros((1,output_dim))
self.params = [Wi, bi,Wf, bf, Wo,bo, Wc,bc,Wy,by]
self.grads = [np.zeros_like(param) for param in self.params]
self.H,self.C = None,None
def reset_state(self,batch_size):
self.H,self.C = (np.zeros((batch_size, self.hidden_dim)),
np.zeros((batch_size, self.hidden_dim)))
def forward(self,Xs):
[Wi, bi,Wf, bf, Wo,bo,Wc,bc,Wy,by] = self.params
if self.H is None or self.C is None:
self.reset_state(Xs[0].shape[0])
H, C = self.H,self.C
Hs = {}
Cs = {}
Zs = []
Hs[-1] = np.copy(H)
Cs[-1] = np.copy(C)
Is = []
Fs = []
Os = []
C_tildas = []
for t in range(len(Xs)):
X = Xs[t]
XH = np.column_stack((X, H))
I = sigmoid(np.dot(XH, Wi)+bi)
F = sigmoid(np.dot(XH, Wf)+bf)
O = sigmoid(np.dot(XH, Wo)+bo)
C_tilda = np.tanh(np.dot(XH, Wc)+bc)
C = F * C + I * C_tilda
H = O*np.tanh(C) #O * C.tanh() #Output status
Zs.append(Y)
Hs[t] = H
Cs[t] = C
Is.append(I)
Fs.append(F)
Os.append(O)
C_tildas.append(C_tilda)
self.Zs,self.Hs,self.Cs,self.Is,self.Fs,self.Os,self.C_tildas =
Zs,Hs,Cs,Is,Fs,Os,C_tildas
self.Xs =Xs
return Zs,Hs
dH_next = np.zeros_like(Hs[0])
dC_next = np.zeros_like(Cs[0])
input_dim = Xs[0].shape[1]
h = Hs
x = Xs
T = len(Xs)
for t in reversed(range(T)):
I = Is[t]
F = Fs[t]
O = Os[t]
C_tilda = C_tildas[t]
H = Hs[t]
X = Xs[t]
C = Cs[t]
H_pre = Hs[t-1]
C_prev = Cs[t-1]
XH_pre = np.column_stack((X, H_pre))
XH_ = XH_pre
dZ = dZs[t]
dO = np.tanh(C) *dH
dOZ = O * (1-O)*dO
dWo += np.dot(XH_.T,dOZ)
dbo += np.sum(dOZ, axis=0, keepdims=True)
#di
di = C_tilda*dC
diZ = I*(1-I) * di
dWi += np.dot(XH_.T,diZ)
dbi += np.sum(diZ, axis=0, keepdims=True)
#df
df = C_prev*dC
dfZ = F*(1-F) * df
dWf += np.dot(XH_.T,dfZ)
dbf += np.sum(dfZ, axis=0, keepdims=True)
# dC_bar
dC_tilda = I*dC #C = F * C + I * C_tilda
dC_tilda_Z =(1-np.square(C_tilda))*dC_tilda # C_tilda =
sigmoid(np.dot(XH, Wc)+bc)
dWc += np.dot(XH_.T,dC_tilda_Z)
dbc += np.sum(dC_tilda_Z, axis=0, keepdims=True)
dC_next = dC_prev
dH_next = dH_prev
def parameters(self):
return self.params
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
grads = lstm.backward(dZs)
def rnn_loss():
lstm.reset_state(batch_size)
Zs,Hs = lstm.forward(Xs)
loss_function = rnn_loss_grad
loss,dZs = loss_function(Zs,Ys)
return loss
params = lstm.parameters()
numerical_grads = util.numerical_gradient(rnn_loss,params,1e-6)
diff_error = lambda x, y: np.max( np.abs(x - y))
print("loss",loss)
print("[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by] ")
for i in range(len(grads)):
print(diff_error(grads[i],numerical_grads[i]))
print("grads",grads[0])
print("numerical_grads",numerical_grads[0])
loss 4.15897570534243
[Wi, bi,Wf, bf, Wo,bo,Wc, bc,Wy,by]
4.0983714987404213e-10
4.804842887035274e-10
5.574688488332363e-10
5.962706955096197e-10
4.786088983281455e-10
3.3010982580892407e-10
5.250774498359589e-10
7.762481196021964e-10
5.116074152863859e-10
4.973363854077206e-08
grads [[-1.40953185e-06 1.39633673e-05 3.77862529e-05]
[-2.05605688e-06 -6.94901972e-06 -9.72150550e-06]
[-1.97703294e-06 2.14765528e-05 -6.23417436e-07]
[ 2.38579566e-06 3.03502478e-05 5.32372144e-06]
[-2.43351424e-10 -1.73915908e-09 -1.49094729e-08]
[ 1.89104848e-08 1.69377027e-07 1.08468341e-07]
[-6.11087686e-09 -6.70921838e-08 -7.03528265e-09]]
numerical_grads [[-1.40953915e-06 1.39630529e-05 3.77866627e-05]
[-2.05613304e-06 -6.94910796e-06 -9.72155689e-06]
[-1.97708516e-06 2.14761542e-05 -6.23501251e-07]
[ 2.38564724e-06 3.03503889e-05 5.32374145e-06]
[-4.44089210e-10 -1.77635684e-09 -1.46549439e-08]
[ 1.86517468e-08 1.69197989e-07 1.08357767e-07]
[-5.77315973e-09 -6.70574707e-08 -7.10542736e-09]]
The following GRU implements a recurrent neural network with a GRU structure:
class GRU(object):
def __init__(self, input_dim,hidden_dim,output_dim,scale=0.01):
super(GRU, self).__init__()
self.input_dim,self.hidden_dim,self.output_dim,self.scale =
input_dim,hidden_dim,output_dim,scale
self.params = [Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by]
self.grads = [np.zeros_like(param) for param in self.params]
self.H = None
def reset_state(self,batch_size):
self.H = np.zeros((batch_size, self.hidden_dim))
def forward_step(self,X):
Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = self.params
H = self.H # previous state
X = Xs[t]
U = sigmoid(np.dot(X, Wxu) + np.dot(H, Whu) + bu)
R = sigmoid(np.dot(X, Wxr) + np.dot(H, Whr) + br)
H_tilda = np.tanh(np.dot(X, Wxh) + np.dot(R * H, Whh) + bh)
H = U * H + (1 - U) * H_tilda
Y = np.dot(H, Wy) + by
Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)
def forward(self,Xs):
Wxu, Whu, bu, Wxr, Whr, br, Wxh, Whh, bh, Wy,by = self.params
if self.H is None:
self.reset_state(Xs[0].shape[0])
H = self.H
Hs = {}
Ys = []
Hs[-1] = np.copy(H)
Rs = []
Us = []
H_tildas = []
for t in range(len(Xs)):
X = Xs[t]
U = sigmoid(np.dot(X, Wxu) + np.dot(H, Whu) + bu)
R = sigmoid(np.dot(X, Wxr) + np.dot(H, Whr) + br)
H_tilda = np.tanh(np.dot(X, Wxh) + np.dot(R * H, Whh) + bh)
H = U * H + (1 - U) * H_tilda
Y = np.dot(H, Wy) + by
Hs[t] = H
Ys.append(Y)
Rs.append(R)
Us.append(U)
H_tildas.append(H_tilda)
self.Ys,self.Hs,self.Rs,self.Us,self.H_tildas = Ys,Hs,Rs,Us,H_tildas
return Ys,Hs #return Ys,Hs,(Rs,Us,H_tildas)
dH_next = np.zeros_like(Hs[0])
input_dim = Xs[0].shape[1]
T = len(Xs)
for t in reversed(range(T)):
R = Rs[t]
U = Us[t]
H = Hs[t]
X = Xs[t]
H_tilda = H_tildas[t]
H_pre = Hs[t-1]
dZ = dZs[t]
#output the idu of the model parameters of f
dWy += np.dot(H.T,dZ)
dby += np.sum(dZ, axis=0, keepdims=True)
# H = U H_pre+(1-U)H_tildas
dH_tilda = dH*(1-U)
dH_pre = dH*U
dU = H_pre*dH -H_tilda*dH
dR = np.dot(dH_tildaZ, Whh.T)*H_pre
dH_pre += np.dot(dH_tildaZ, Whh.T)*R
# U = \sigma(UZ) R = \sigma(RZ)
dUZ = U*(1-U)*dU
dRZ = R*(1-R)*dR
dWxu+= np.dot(X.T,dUZ)
dWhu+= np.dot(H_pre.T,dUZ)
dbu += np.sum(dUZ, axis=0, keepdims=True)
if True:
dX_RZ = np.dot(dRZ,Wxr.T)
dX_UZ = np.dot(dUZ,Wxu.T)
dX_H_tildaZ = np.dot(dH_tildaZ,Wxh.T)
dX = dX_RZ+dX_UZ+dX_H_tildaZ
dH_next = dH_pre
grads = [dWxu, dWhu, dbu, dWxr, dWhr, dbr, dWxh, dWhh, dbh, dWy,dby]
for i,_ in enumerate(self.grads):
self.grads[i]+=grads[i]
return self.grads
def get_states(self):
return self.Hs
def get_outputs(self):
return self.Ys
def parameters(self):
return self.params
current time step, and the current memory storage c output by LSTM and the h passed to the next time
′ ′
step. For example, for a simple RNN, its forward calculation formula is:
′
h = tanh(Wih x + bih + Whh h + bhh )
Here, the original offset b is split into 2 items: b , b . Denote the bias of the weighted sum of the data
h ih hh
input and the bias of the weighted sum of the hidden state, respectively. Similarly, for LSTM, one offset of
each original weighted sum can also be split into two offsets, that is, the calculation formula of LSTM:
′
c = f ∗ c + i ∗ g
′ ′
h = o ∗ tanh(c )
Similarly, the calculation formula of the GRU neural network unit is:
′
h = (1 − z) ∗ n + z ∗ h
A common base class can be used to represent the common properties of these 3 different neural network
units:
import numpy as np
import math
class RNNCellBase(object):
__constants__ = ['input_size', 'hidden_size']
def __init__(self, input_size, hidden_size,bias, num_chunks):
super(RNNCellBase, self).__init__()
self.input_size, self.hidden_size = input_size, hidden_size
self.bias = bias
self.W_ih= np.empty((input_size, num_chunks*hidden_size)) # input to
hidden
self.W_hh = np.empty((hidden_size, num_chunks*hidden_size)) # hidden to
hidden
if bias:
self.b_ih = np.zeros((1,num_chunks*hidden_size))
self.b_hh = np.zeros((1,num_chunks*hidden_size))
self.params = [self.W_ih,self.W_hh,self.b_ih,self.b_hh]
else:
self.b_ih = None
self.b_hh = None
self.params = [self.W_ih,self.W_hh]
self.reset_parameters()
if h.shape[1] != self.hidden_size:
raise RuntimeError(
"hidden{} has inconsistent hidden_size: got {}, expected
{}".format(
hidden_label, h.shape[1], self.hidden_size))
The parameters input_size and hidden_size of the constructor represent the size of the input data and state,
and num_chunks represent the number of calculation gates for each neural network unit.
check_forward_input and check_forward_hidden are auxiliary methods to check whether the input data and
hidden state size match the model parameters of the neural network unit. num_chunks represents the
number of computing units in the cyclic neural network. For the basic RNN, its value is 1, and for LSTM
and GRU, its values are 4 and 3, respectively.
A specific type of neural network unit can be defined on the basis of the base class RNNCellBase of the
neural network unit. The following code defines the class RNNCell representing a simple RNN cell:
def relu(x):
return x * (x > 0)
class RNNCell(RNNCellBase):
""" h' = \tanh(W_{ih} x + b_{ih} + W_{hh} h + b_{hh})"""
__constants__ = ['input_size', 'hidden_size', 'nonlinearity']
def __init__(self, input_size, hidden_size,bias=True, nonlinearity="tanh"):
super(RNNCell, self).__init__(input_size, hidden_size,bias,num_chunks=1)
self.nonlinearity = nonlinearity
def backward(self,dh,H,X,H_pre):
if self.nonlinearity == "tanh":
dZh = (1 - H * H) * dh # backprop through tanh nonlinearity
else:
dZh = H*(1-H)* dh
db_hh = np.sum(dZh, axis=0, keepdims=True)
db_ih = np.sum(dZh, axis=0, keepdims=True)
dW_ih = np.dot(X.T,dZh)
dW_hh = np.dot(H_pre.T,dZh)
dh_pre = np.dot(dZh,self.W_hh.T)
dx = np.dot(dZh,self.W_ih.T)
grads = (dW_ih,dW_hh,db_ih,db_hh)
for a, b in zip(self.grads,grads):
a+=b
return dx,dh_pre,grads
The following code demonstrates the forward and reverse calculation of a time step of RNNCell, where x is
the input data with a batch size of 3, and h is the state corresponding to a batch size of 3:
import numpy as np
np.random.seed(1)
x = np.random.randn(3, 10) #(batch_size,input_dim)
h = np.random.randn(3, 20) #(batch_size,hidden_dim)
rnn = RNNCell(10, 20) #(input_dim,hidden_dim)
h_ = rnn(x, h)
print("h_:",h_)
dh_ = np.random.randn(*h.shape)
dx,dh,_ = rnn.backward(dh_,h_,x,h)
print("dh:",dh)
h_0 = h.copy()
hs = []
for i in range(6):
h = rnn(x[i], h)
hs.append(h)
print("h:",hs[0])
dh = np.random.randn(*h.shape)
for i in reversed(range(6)):
if i==0:
dx,dh,_ = rnn.backward(dh,hs[i],x[i],h_0)
else:
dx,dh,_ = rnn.backward(dh,hs[i],x[i],hs[i-1])
print("dh:",dh)
Similarly, LSTMCell and GRUCell of LSTM and GRU types can be defined. The code of LSTMCell is as
follows:
def sigmoid(x):
return (1 / (1 + np.exp(-x)))
def lstm_cell(x, hc,w_ih, w_hh,b_ih, b_hh):
h,c = hc[0],hc[1]
hidden_size = w_ih.shape[1]//4
ifgo_Z = np.dot(x,w_ih) + b_ih + np.dot(h,w_hh) + b_hh
i = sigmoid(ifgo_Z[:,:hidden_size])
f = sigmoid(ifgo_Z[:,hidden_size:2*hidden_size])
g = np.tanh(ifgo_Z[:,2*hidden_size:3*hidden_size])
o = sigmoid(ifgo_Z[:,3*hidden_size:])
c_ = f*c+i*g
h_ = o*np.tanh(c_)
return (h_,c_),np.column_stack((i,f,g,o))
diz = i*(1-i)*di
dfz = f*(1-f)*df
dgz = (1-np.square(g))*dg
doz = o*(1-o)*do
dZ = np.column_stack((diz,dfz,dgz,doz))
dW_ih = np.dot(x.T,dZ)
dW_hh = np.dot(h_pre.T,dZ)
db_hh = np.sum(dZ, axis=0, keepdims=True)
db_ih = np.sum(dZ, axis=0, keepdims=True)
dx = np.dot(dZ,w_ih.T)
dh_pre = np.dot(dZ,w_hh.T)
#return dx,dh_pre,(dW_ih,dW_hh,db_ih,db_hh)
dc = dc_*f
return dx,(dh_pre,dc),(dW_ih,dW_hh,db_ih,db_hh)
class LSTMCell(RNNCellBase):
""" \begin{array}{ll}
i = \sigma(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\
f = \sigma(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\
g = \tanh(W_{ig} x + b_{ig} + W_{hg} h + b_{hg}) \\
o = \sigma(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\
c' = f * c + i * g \\
h' = o * \tanh(c') \\
\end{array}
"""
def __init__(self, input_size, hidden_size, bias=True):
super(LSTMCell, self).__init__(input_size, hidden_size,bias, num_chunks=4)
def init_hidden(batch_size):
zeros= np.zeros(input.shape[0], self.hidden_size, dtype=input.dtype)
return (zeros, zeros)#np.array([zeros, zeros])
#grads = (dW_ih,dW_hh,db_ih,db_hh)
for a, b in zip(self.grads,grads):
a+=b
return dx,dh_pre,grads
# H = U H_pre+(1-U)H_tildas
dn = dh*(1-u)
dh_pre = dh*u
du = h_pre*dh -n*dh
Z_hn = np.dot(h_pre,w_hh[:,2*hidden_size:])+b_hh[:,2*hidden_size:]
dr = dnz*Z_hn
dZ_ih_n = dnz
dZ_hh_n = dnz*r
duz = u*(1-u)*du
dZ_ih_u = duz
dZ_hh_u = duz
drz = r*(1-r)*dr
dZ_ih_r = drz
dZ_hh_r = drz
dZ_ih = np.column_stack((dZ_ih_r,dZ_ih_u,dZ_ih_n))
dZ_hh = np.column_stack((dZ_hh_r,dZ_hh_u,dZ_hh_n))
dW_ih = np.dot(x.T,dZ_ih)
dW_hh = np.dot(h_pre.T,dZ_hh)
db_ih = np.sum(dZ_ih, axis=0, keepdims=True)
db_hh = np.sum(dZ_hh, axis=0, keepdims=True)
dh_pre+=np.dot(dZ_hh,w_hh.T)
dx = np.dot(dZ_ih,w_ih.T)
return dx,dh_pre,(dW_ih,dW_hh,db_ih,db_hh)
class GRUCell(RNNCellBase):
""" \begin{array}{ll}
r = \sigma(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\
z = \sigma(W_{iz} x + b_{iz} + W_{hz} h + b_{hz}) \\
n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\
h' = (1 - z) * n + z * h
\end{array}
"""
def __init__(self, input_size, hidden_size, bias=True):
super(GRUCell, self).__init__(input_size, hidden_size,bias, num_chunks=3)
Different recurrent neural network models can be implemented with neural network units.
7.10 Multilayer, Bidirectional Recurrent Neural Network
..., the last cycle neural network layer can be used as the whole The output layer of
the network can also be followed by one or more acyclic neural network layers.
Figure 7-46 Multi-layer recurrent neural network, the first hidden layer accepts
data input and generates a hidden state H , this hidden state is used as the input
(1)
of the second hidden layer, and the last cycle The neural network layer can be used
as the output layer of the entire network or can be followed by one or more acyclic
neural network layers
At time t, the input of neurons in layer 1 includes data input X and state inputt
(1) (1)
H
t−1
, calculate the hidden state H of the first layer:
t
(1) (1)
H = f1 (Xt , H )
t t−1
(1)
The state H of the RNN unit (neuron) in the first layer is used as the data input
t
of the RNN unit (neuron) in the second layer, and the front of the neuron itself in
(2)
the second layer The state H at one moment is used together to calculate the
t−1
(2)
state H of the neuron. This state is used as the data input of the third layer RNN
t
(3)
unit (neuron) to calculate the hidden state H of the third layer. Generally, the lth
t
(1)
hidden layer accepts the hidden state H and its previous layer at time t That is,
t−1
(l−1)
the input of the l − 1 hidden layer (usually hidden state H ), output the hidden
t
(l)
state Hattimet , its calculation process can be expressed by the following
t
formula:
(l) (l−1) (l)
H = fl (H ,H )
t t t−1
Except that the data input of the first layer is the initial data input X , the data
t
input of other recurrent network layers is the hidden state output of the previous
。
(l−1)
recurrent network layer H t
The state variable of the last layer of the multi-layer recurrent network can be
(L)
directly output as the output of the model F = H , or output through an
t t
activation function.
(L)
Ft = g(H )
t
If the last cyclic neural network layer is the output layer of the entire network, this
F is the output of the entire network. If there are some non-cyclic network layers
t
behind this cyclic neural network layer, this F will continue to be the subsequent
t
In a multi-layer cyclic neural network, the size of the initial input data X and the
size of the hidden state H are usually not equal, and the size of H in each cyclic
network layer is the same, so the data input of the first layer and other cyclic
network layers The sizes are usually different, so the shape of the weight
parameters of the recurrent network layers is the same except for the first layer. Of
course, it is also possible to have hidden states of different sizes for different
recurrent network layers, but this is usually not done in practice.
The previous cyclic neural network unit can be used to construct a multi-layer
cyclic neural network. The following code builds a base class RNNBase
representing a multi-layer cyclic neural network based on the neural network unit:
from Layers import *
class RNNBase(Layer):
def __init__(self,mode,input_size, hidden_size,
n_layers,bias = True):
super(RNNBase, self).__init__()
self.mode = mode
if mode == 'RNN_TANH':
self.cells = [RNNCell(input_size,
hidden_size,bias,nonlinearity="tanh")]
self.cells += [RNNCell(hidden_size,
hidden_size,bias,nonlinearity="tanh") for i in range(n_layers-
1)]
elif mode == 'RNN_RELU':
self.cells = [RNNCell(input_size,
hidden_size,bias,nonlinearity="relu")]
self.cells += [RNNCell(hidden_size,
hidden_size,bias,nonlinearity="relu") for i in range(n_layers-
1)]
elif mode == 'LSTM':
self.cells = [LSTMCell(input_size,
hidden_size,bias)]
self.cells += [LSTMCell(hidden_size,
hidden_size,bias) for i in range(n_layers-1)]
elif mode == 'GRU':
self.cells = [GRUCell(input_size,
hidden_size,bias)]
self.cells += [GRUCell(hidden_size,
hidden_size,bias) for i in range(n_layers-1)]
self.input_size, self.hidden_size =
input_size,hidden_size
self.n_layers = n_layers
self.flatten_parameters()
self._params = None
def flatten_parameters(self):
self.params = []
self.grads = []
for i in range(self.n_layers):
rnn = self.cells[i]
for j,p in enumerate(rnn.params):
self.params.append(p)
self.grads.append(rnn.grads[j])
for i in range(n_layers):
cell = self.cells[i]
if i!=0:
x = hs[i-1] # out h of pre layer
if mode == 'LSTM':
x = np.array([h for h,c in x])
hi = h[i]
if mode == 'LSTM':
hi = (h[0][i],h[1][i])
for t in range(seq_len):
hi = cell(x[t],hi)
if isinstance(hi, tuple):
hi,z = hi[0],hi[1]
zs[i].append(z)
hs[i].append(hi)
# if mode == 'LSTM' or mode == 'GRU':
# zs[i].append(z)
self.hs = np.array(hs) #
(layer_size,seq_size,batch_size,hidden_size)
if len(zs[0])>0:
self.zs = np.array(zs)
else:self.zs = None
def backward(self,dhs,input):#,hs):
if self.hs is None:
self.hs,_ = self.forward(input)
hs = self.hs
zs = self.zs if self.zs is not None else hs
seq_len,batch_size = input.shape[0], input.shape[1]
dinput = [None for t in range(seq_len)]
#----dhidden--------
dhidden = [None for i in range(self.n_layers)]
for layer in reversed(range(self.n_layers)):
layer_hs = hs[layer]
layer_zs = zs[layer]
cell = self.cells[layer]
if layer==0:
layer_input = input
else:
if self.mode =='LSTM':
layer_input = self.hs[layer-1]
layer_input = [h for h,c in layer_input]
else:
layer_input = self.hs[layer-1]
h_0 = self.h[layer]
dh = np.zeros_like(dhs[0]) #Gradient from the next
moment
if self.mode =='LSTM':
h_0 = (self.h[0][layer],self.h[1][layer])
dc = np.zeros_like(dhs[0])
for t in reversed(range(seq_len)):
dh += dhs[t] #The gradient of the next moment +
the gradient of the current moment
h_pre = h_0 if t==0 else layer_hs[t-1]
if self.mode=='LSTM':
dhc = (dh,dc)
dx,dhc,_ =
cell.backward(dhc,layer_zs[t],layer_input[t],h_pre)
dh,dc = dhc
else:
dx,dh,_ =
cell.backward(dh,layer_zs[t],layer_input[t],h_pre)
if layer>0:
dhs[t] = dx
else :
dinput[t] = dx
#----dhidden--------
if t==0:
if self.mode=='LSTM':
dhidden[layer] = dhc
else:
dhidden[layer] = dh
return np.array(dinput),np.array(dhidden)
def parameters(self):
if self._params is None:
self._params = []
for i, _ in enumerate(self.params):
self._params.append([self.params[i],self.grads[i]])
return self._params
On the basis of this base class, a type of multi-layer cyclic neural network can be
implemented. For example, the following representations indicate that RNN,
LSTM, and GRU respectively implement a multi-layer simple cyclic network,
LSTM, and GRU cyclic neural network:
class RNN(RNNBase):
def __init__(self,*args, **kwargs):
if 'nonlinearity' in kwargs:
if kwargs['nonlinearity'] == 'tanh':
mode = 'RNN_TANH'
elif kwargs['nonlinearity'] == 'relu':
mode = 'RNN_RELU'
else:
raise ValueError("Unknown nonlinearity
'{}'".format(
kwargs['nonlinearity']))
del kwargs['nonlinearity']
else:
mode = 'RNN_TANH'
super(RNN, self).__init__(mode, *args, **kwargs)
class LSTM(RNNBase):
def __init__(self,*args, **kwargs):
super(LSTM, self).__init__('LSTM', *args, **kwargs)
class GRU(RNNBase):
def __init__(self,*args, **kwargs):
super(GRU, self).__init__('GRU', *args, **kwargs)
These multilayer recurrent neural networks can be tested with the following code:
import numpy as np
from rnn import *
np.random.seed(1)
num_layers= 2
batch_size,input_size,hidden_size= 3,5,8
seg_len = 6
test_RNN = "LSTM"
if test_RNN == "rnnTANH":
rnn = RNN(input_size,hidden_size,num_layers )
elif test_RNN == "rnnRELU":
rnn = RNN(input_size,hidden_size, num_layers,nonlinearity=
'relu')
elif test_RNN == "GRU":
rnn = GRU(input_size,hidden_size, num_layers)
elif test_RNN == "LSTM":
rnn = LSTM(input_size,hidden_size, num_layers)
c_0 = np.random.randn(num_layers, batch_size, hidden_size)
print("input.shape",input.shape)
print("h_0.shape",h_0.shape)
print("c_0.shape",c_0.shape)
if test_RNN == "LSTM":
output, hn = rnn(input, (h_0,c_0))
else:
output, hn = rnn(input, h_0)
print("output.shape",output.shape)
print("output",output)
print("hn",hn)
#------test backward---
do = np.random.randn(*output.shape)
dinput,dhidden = rnn.backward(do,input)#,rnn.hs)#output)
print("dinput.shape:",dinput.shape)
print("dinput:",dinput)
print("dhidden:",dhidden)
def init_hidden(self,batch_size):
# This is what we'll initialise our hidden state as
self.h_0 = (np.zeros((self.num_layers, batch_size,
self.hidden_size)),
np.zeros((self.num_layers, batch_size,
self.hidden_size)))
batch_size = input.shape[1]
y_pred = self.linear(hs_out[-1].reshape(batch_size,
-1))
return y_pred#.reshape(batch_size, -1)#.flatten()
#view(-1)
def backward(self,dZs,input):
dhs = self.linear.backward(dZs)
dinput = self.lstm.backward(dhs,input)
def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])
return self._params
The code below models an autoregressive data with the above multi-layer
recurrent neural network, the ARData class is from the URL
https://github.com/jessicayung/blog-code-snippets/blob/master/lstm-
pytorch/generate_data .py is used to generate autoregressive training data.
import util
from train import *
from generate_data import *
import matplotlib.pyplot as plt
%matplotlib inline
input_size = 20
# Data params
noise_var = 0
num_datapoints = 100
test_size = 0.2
num_train = int((1-test_size) * num_datapoints)
hidden_size = 32
lstm_input_size = input_size
output_dim = 1
num_layers = 2
batch_size =num_train #80
loss_fn = util.mse_loss_grad#
(f,y)#torch.nn.MSELoss(size_average=False)
learning_rate = 1e-3
momentum = 0.9
#optimizer = SGD(model.parameters(),learning_rate,momentum)
optimizer = Adam(model.parameters(),learning_rate)
num_epochs = 500
print(X_train.shape)
hist = np.zeros(num_epochs)
for t in range(num_epochs):
model.hidden = model.init_hidden(batch_size)
y_pred = model(X_train) # Forward pass
plt.plot(y_pred, label="Preds")
plt.plot(y_train, label="Data")
plt.legend()
plt.show()
plt.plot(hist, label="Training loss")
plt.legend()
plt.show()
(20, 80, 1)
(20, 80, 1)
(1, 80, 20)
Epoch 0 MSE: 0.030292062696899477
Epoch 100 MSE: 0.013801384758457096
Epoch 200 MSE: 0.013244797126843889
Epoch 300 MSE: 0.013052903618001023
Epoch 400 MSE: 0.012934439762440214
Figure 7-47 Prediction and real data of the 2-layer LSTM network training model
for autoregressive data
)
Figure 7-48 The training loss curve of the 2-layer LSTM network training model
for autoregressive data
→
Ht = ϕ(Xt W
(f )
→
+ Ht−1 W
(f )
+ b
(f )
),
xh hh h
← (b)
← (b) (b)
Ht = ϕ(Xt W + H t+1 W + b )
xh hh h
←
Among them, H → t, Ht represent the forward and backward state variables,
(f ) (f ) (b) (b)
respectively. W xh
∈ R
d×h
,W
hh
h×h
∈ R ,W
xh
d×h
∈ R ,W
hh
h×h
∈ R is the
(f ) (b)
weight parameter of the model, b ∈ R , b ∈ R
h
1×h
is the bias parameter.
h
1×h
(f ), (b) is used to mark whether the model parameters are forward or backward.
The state variable of the last layer of the multi-layer bidirectional recurrent
(L)
network can be directly output as the output of the model F = H , or output
t t
through an activation function or re- Connect some other acyclic network layers
and output.
(L)
Ft = Ht Whf + bf
(L)
←
→
(L)
(L)
where H t
is H
t
, Ht The vector formed by splicing.
You can directly use neural network units to construct a bidirectional cyclic neural
network as in the previous section, or you can use a class to encapsulate a
bidirectional cyclic neural network layer separately, and then use these single-layer
bidirectional cyclic network layers to construct (multi-layer) bidirectional cyclic
neural networks network.
output = []
zs=[]
hs = []
steps = range(seq_len - 1, -1, -1) if self.reverse else
range(seq_len)
for t in steps:
h = self.cell(input[t], h)
#h,z = self.cell(input[t], h)
#output.append(h)
if isinstance(h, tuple):
h,z = h[0],h[1]
if mode == 'LSTM' or mode == 'GRU':
zs.append(z)
hs.append(h)
self.hs = np.array(hs)
output = [h[0] if isinstance(h, tuple) else h for h in
self.hs]
if mode == 'LSTM' or mode == 'GRU':
self.zs = np.array(zs)
return np.array(output),h
if False:
if self.zs is None:
zs = hs
else:
zs = self.zs
zs = self.zs if self.zs is not None else hs
if len(dhs)==len(hs):#.shape==hs.shape: #
(seq,batch,hidden)
dinput = [None for i in range(seq_len)]
steps = range(seq_len) if self.reverse else
range(seq_len - 1, -1, -1)
t0 = seq_len - 1 if self.reverse else 0
dh = np.zeros_like(dhs[0]) #Gradient from the next
moment
for t in steps:
dh += dhs[t] #The gradient of the next moment +
the gradient of the current moment
h_pre = self.h if t==t0 else hs[t-1]
dx,dh,_ =
cell.backward(dh,zs[t],input[t],h_pre)
dinput[t] = dx
return dinput
#test_LSTM="LSTM"
test_LSTM="GRU"
reverse = True
np.random.seed(1)
seq_len,batch_size,input_size,hidden_size = 5,3,4,6
if test_LSTM=="RNN_TANH":
rnn_ = RNNLayer("RNN_TANH",input_size, hidden_size,reverse
= reverse)
elif test_LSTM=="GRU":
rnn_ = RNNLayer('GRU',input_size, hidden_size,reverse =
reverse)
else:
rnn_ = RNNLayer('LSTM',input_size, hidden_size,reverse =
reverse)
input = np.random.randn(seq_len,batch_size,input_size)
if reverse:
input = input[::-1]
h0 = np.random.randn(batch_size, hidden_size)
c0 = np.random.randn(batch_size, hidden_size)
if test_LSTM=="LSTM":
output,hn= rnn_(input, (h0,c0))
else:
output,hn= rnn_(input, h0)
print("output",output)
print("hn",hn)
#------test backward---
do = np.random.randn(*output.shape)
dinput = rnn_.backward(do,input)#,rnn_.hs)#output)
print("dinput:",dinput)
if False:
if mode == 'LSTM':
gate_size = 4 * hidden_size
elif mode == 'GRU':
gate_size = 3 * hidden_size
elif mode == 'RNN_TANH':
gate_size = hidden_size
elif mode == 'RNN_RELU':
gate_size = hidden_size
else:
raise ValueError("Unrecognized RNN mode: " +
mode)
self.layers = []
self.params = []
self.grads = []
self._all_weights = []
for layer in range(num_layers):
layer_input_size = input_size if layer == 0 else
hidden_size
for direction in range(num_directions):
if direction==0:
rnnlayer = RNNLayer(mode,layer_input_size,
hidden_size,reverse = False)
else:
rnnlayer = RNNLayer(mode,layer_input_size,
hidden_size,reverse = True)
self.layers.append(rnnlayer)
self.params+= rnnlayer.cell.params
self.grads+= rnnlayer.cell.grads
def init_hidden(self, batch_size):
num_layers,num_directions =
self.num_layers,self.num_directions
selh.h0 = []
for layer in self.layers:
h0 = layer.init_hidden(batch_size)
selh.h0.append(h0)
return self.h0
return dhs
import numpy as np
np.random.seed(1)
reverse = False
num_layers = 2
seq_len,batch_size,input_size,hidden_size = 5,3,4,6
input = np.random.randn(seq_len,batch_size,input_size)
test_LSTM = 'GRU'
if test_LSTM=="RNN_TANH":
rnn =
RNNBase_("RNN_TANH",input_size,hidden_size,num_layers)
elif test_LSTM=="GRU":
rnn = RNNBase_('GRU',input_size,hidden_size,num_layers)
else:
rnn = RNNBase_('LSTM',input_size,hidden_size,num_layers)
do = np.random.randn(*output.shape)
dinput = rnn.backward(do,input)
print("dinput:",dinput)
Both the encoder and decoder use a recurrent neural network (RNN) to process
sequence inputs and outputs of varying lengths. The input sequence is input to the
encoder to generate a state variable, also called context variable (content vector),
the decoder takes this context variable as its input state variable at the initial
moment, and can generate a output sequence. For example, in machine translation,
an encoder takes input sentences (sequences of words) in one language and a
decoder outputs sentences (sequences of words) in other languages. The seq2seq
model was quickly extended to other problems similar to machine translation, such
as dialogue, image captions, text summarization, couplet generation, etc.
machine translation
Machine translation is the conversion (translation) of sentences (sequences of
words) in one language into sentences (sequences of words) in another language.
This sequence-to-sequence conversion problem can be modeled with a Seq2Seq
model composed of an encoder and a decoder, as shown in Figure 7-50. The
encoder accepts a sentence (ie, a sequence of words) in a certain language, and the
sequence of words It can be arbitrarily long, and the encoder RNN processes each
word of the input word sequence in turn until it encounters the end word. The
encoder outputs a context vector that encodes the input sentence. This context
vector can be the output at the last moment (such as the hidden state) or the output
at each moment (such as the hidden state).
The decoder takes this encoded context and a special start word, and produces a
sequence of words in turn until it encounters a special end word. The start word
and end word of the decoder are artificially set words, such as using 3 English
letters "SOS" and "EOS" as the start word and end word respectively. In machine
translation, such special start and end words are usually artificially added after
both the input sentence and the translated sentence.
In the training phase, the encoder and decoder are trained based on the error loss of
the predicted word sequence and the target word sequence. In the inference phase,
the decoder predicts and samples the next word from the current word each time
until the final output word sequence.
The simplest encoder is a recurrent neural network that inputs a sequence of data
(such as a sequence of words) and calculates contextual information representing
the content of the sequence. For the simplest encoders, this contextual information
is the last-minute hidden state. At each moment, it accepts the input data and the
hidden state at the previous moment, and calculates the hidden state at the current
moment as the output at the current moment. The calculation process is shown in
the left subgraph of Figure 7-51:
Figure 7-51 The encoder accepts the one-hot vector of the input word at the
current moment and the hidden state at the previous moment, and calculates the
hidden state at the current moment as the output at the current moment; the
decoder accepts the one-hot vector of the input word at the current moment and the
previous hidden state The hidden state at one moment, calculate the hidden state at
the current moment, this hidden state outputs a vector through a linear layer to
represent each word in the word list as the score of the next word
def word2vec(self,word_indices_input):
return one_hot(self.input_size,word_indices_input,True)
def initHidden(self,batch_size=1):
return np.zeros((self.num_layers, batch_size,
self.hidden_size))
def parameters(self):
return self.gru.parameters()
def backward(self,dhs):
dinput,dhidden =
self.gru.backward(dhs,self.encode_input)
The simplest decoder is a cyclic neural network plus an output layer. The decoder
accepts the one-hot vector of the input word at the current moment and the hidden
state at the previous moment, and calculates the hidden state at the current
moment. This hidden state passes through a linear layer Output a vector
representing the score of each word in the word list, and the calculation process is
shown in the right sub-figure of Figure 7-51.
self.gru = GRU(input_size,hidden_size,num_layers)
self.out = Dense(hidden_size, output_size)
self.layers = [self.gru,self.out]
self._params = None
def initHidden(self,batch_size=1):
self.h_0 = np.zeros((self.num_layers, batch_size,
self.hidden_size))
def word2vec(self,input_t):
return one_hot(self.input_size,input_t,True)
def forward(self,input_tensor,hidden):
teacher_forcing_ratio = self.teacher_forcing_ratio
use_teacher_forcing = True if random.random() <
teacher_forcing_ratio else False
#use_teacher_forcing = True
self.input = []
output_hs = []
output = []
hidden_t = hidden
h_0 = hidden.copy()
input_t = np.array([SOS_token])
#input_seq = []
hs = []
zs = []
target_length = input_tensor.shape[0]
for t in range(target_length):
output_t, hidden_t,output_hs_t = self.forward_step(
input_t, hidden_t)
#Save the calculation results at each moment
hs.append(self.gru.hs) #hidden state
zs.append(self.gru.zs) #Intermediate variables
output_hs.append(output_hs_t)
output.append(output_t)
if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) #maximum
probability
if input_t== EOS_token:
break
input_t = np.array([input_t])
output = np.array(output)
self.output_hs = np.array(output_hs)
self.h_0 = h_0
self.hs = np.concatenate(hs, axis=1)
self.zs = np.concatenate(zs, axis=1)
#self.input_seq = input_seq
#return output,input_seq
return output
return decoded_word_indices
#return indexToSentence(output_lang,decoded_words)
#return indexToSentence(output_verb,decoded_words)
def backward(self,dZs):
dhs = []
output_hs = self.output_hs
input = np.concatenate(self.input,axis=0)
for i in range(len(input)):
self.out.x = output_hs[i]
dh = self.out.backward(dZs[i])
dhs.append(dh)
dhs = np.array(dhs)
self.gru.hs = self.hs
self.gru.zs = self.zs
self.gru.h = self.h_0
dinput,dhidden = self.gru.backward(dhs,input)
return dinput,dhidden
# def backward_dh(self,dZ):
# dh = self.out.backward(dZ)
# return dh
def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])
return self._params
DecoderRNN contains a GRU cyclic neural network self.gru, the output of self.gru
is through the output layer self.out of the linear weighted sum to output each word
in the word list as the score value of the next word. Since the word at a certain
moment is input to the cyclic neural network self.gru through one-hot
vectorization, and the output of self.out is also a vector with the same length as the
word list to represent the score of each word. Therefore, the length of the input
vector of self.gru and the output vector of self.gru is the length of the word list.
The forward() method accepts the input word sequence input_tensor, starts from
the special start word index SOS_token, processes each input word input_t in turn,
and saves the intermediate state of gru calculation such as self.gru.hs, self.gru.zs ,
because the reverse derivation at each moment needs to depend on these
intermediate variables at this moment.
Input the word input_t at each moment, and output a prediction vector output_t.
The input_t at the next moment can be the word with the highest score
corresponding to output_t, or the word corresponding to the output sentence of the
training sample. If the flag use_teacher_forcing is True, input_t uses the word
corresponding to the output sentence in the training sample, otherwise it uses the
word with the predicted maximum score. Using the word in the training sample
output sentence as the next word is called "teacher forcing" ("Teacher forcing").
For example, the target sequence of the decoder is 'hello', the input at the initial
moment is the special character 'SOS', and its target output should be the character
'h', but the probability of the character 'h' in the output vector at the initial moment
may not be the largest , if 'o' is the predicted character with the highest probability,
if not adopted ("Teacher forcing"), this 'o' will be used as the input at the next
moment. If ("Teacher forcing") is used, the 'o' with the highest predicted
probability is not used but the actual target output 'h' is used as the input of the
next moment.
The use of teacher forcing will lead to faster convergence, but the training network
may over-learn the information in the training samples, resulting in poor
generalization ability, that is, the actual prediction effect will be unstable.
Therefore, teacher-forced training can be enabled randomly, for example, with a
50% chance of using teacher-forced training.
Evaluate() uses the trained decoder for prediction, which accepts the context
vector hidden output by the encoder and the maximum number of output words
max_length. Its process is similar to the forward() function, because it is for
prediction, and only one initial moment of data is input into 'SOS'. Therefore, it
adopts a non-teacher-forced method, that is, always uses the word with the largest
prediction score every time (for example, if you want to generate a variety of , or
you can sample the word according to the probability corresponding to the score)
as the input word at the next moment. From the context vector output by the initial
encoder and the start word 'SOS' at the initial moment, the word is continuously
sampled according to the prediction score as the input word at the next moment,
until the end character 'EOS' is encountered or the word (character) reaches the
maximum number max_length. The final output is a vector constructed of wordlist
indices for all words.
The backward() method accepts the gradient dZs of the loss function on the output
layer output, first calculates the gradient of the output layer at each moment about
the hidden state at the corresponding moment, and then uses the hidden state
gradient dhs at all moments and the input input of gru to the gru loop neural
network Network reverse derivation.
The parameters() function of the encoder and decoder is used to return all their
model parameters in order to construct the optimizer object.
The following function train() accepts a pair of input and output sequences
input_tensor and target_tensor, as well as the encoder, decoder and its optimizer
encoder, decoder, encoder_optimizer, decoder_optimizer, as well as the function
loss_fn and the regularization coefficient reg for calculating the model loss .
train_step() performs a training update of the model parameters for the model, first
calculates the output encoder_output, encoder_hidden of the encoder according to
the input_tensor, and uses the hidden state encoder_hidden or encoder_output at
the last moment as the input of the decoder according to the last_hidden flag, and
is used for calculation together with the target_tensor The final predicted output of
the decoder. Then calculate the cross-entropy loss and the gradient grad of the loss
with respect to output according to the predicted output output and target, and then
use decoder.backward(grad) to reverse the derivative of the decoder, and the
output is the gradient dhidden of the output encoder_hidden of the encoder ,
continue to reverse the encoder according to this gradient. Finally update the
model parameters. Before updating the model parameters, clip_grad_norm_nn can
be used to clip the gradient to prevent gradient explosion.
loss = 0
encode_input = input_tensor
encoder_output, encoder_hidden = encoder(encode_input,
None)
if last_hidden:
output = decoder(target_tensor, encoder_hidden)
else:
output = decoder(target_tensor, encoder_output)
target = target_tensor.reshape(-1,1)
if output.shape[0]!= target.shape[0]:
target = target[:output.shape[0],:]
loss,grad = loss_fn(output, target)
loss /=(output.shape[0])
if last_hidden:
dinput,dhidden = decoder.backward(grad)
encoder.backward(dhidden[0]) #,encode_input)
else:
dinput,d_encoder_outputs = decoder.backward(grad)
encoder.backward(d_encoder_outputs)
util.clip_grad_norm_nn(encoder_optimizer.parameters(),clip,None)
util.clip_grad_norm_nn(decoder_optimizer.parameters(),clip,None)
encoder_optimizer.step()
decoder_optimizer.step()
return loss
#return loss.item() / target_length
The function trainIters() iteratively calls the train_step() function to update the
model parameters. And in the iterative process, some intermediate training result
models can be output, such as output training error and verification error.
import numpy as np
import time
import math
import matplotlib.pyplot as plt
%matplotlib inline
def timeSince(start):
now = time.time()
s = now - start
m = math.floor(s / 60)
s -= m * 60
return '%dm %ds' % (m, s)
training_pairs = train_pairs
loss_fn = util.rnn_loss_grad
if iter % plot_every == 0:
plot_loss_avg = plot_loss_total / plot_every
plot_losses.append(plot_loss_avg)
plot_loss_total = 0
plt.plot(plot_losses)
valid_losses.append(validation_loss(encoder,
decoder, valid_pairs,encoder_output_all,20,reg))
plt.plot(valid_losses)
plt.legend(["train_losses","valid_losses"])
plt.show()
target = target_tensor.reshape(-1,1)
if output.shape[0]!= target.shape[0]:
target = target[:output.shape[0],:]
loss,grad = loss_fn(output, target)
loss /=(output.shape[0])
total_loss += loss
decoder.teacher_forcing_ratio = teacher_forcing_ratio
return total_loss/len(valid_pairs)
Among them, validation_loss() uses the trained model to calculate the verification
error. It randomly takes a small number of samples from the verification sample
set, generates output through the encoder and decoder, and then calculates the
predicted output of the decoder, and calculates the loss in the same way as the
training process.
https://www.manythings.org/anki/
Like the text generated by RNN before, the input and output sentences of machine
translation can be regarded as a sequence of words or a sequence of characters. As
long as a character table is established for all characters in a language, each
character in the sentence can be characters into a one-hot vector. Figure 7-52 is a
Seq2Seq model that regards sentences as character sequences.
7-52 The character-level Seq2seq model of machine translation, the input at each
moment is a character.
class ChVerb:
def __init__(self, name):
self.name = name
import numpy as np
import random
import re
import unicodedata
random.seed(1)
def unicodeToAscii(sentence):
return ''.join(
c for c in unicodedata.normalize('NFD', sentence)
if unicodedata.category(c) != 'Mn'
)
def normalize_sentence(sentence):
sentence = unicodeToAscii(sentence.lower().strip())
sentence = re.sub(r"([.!?])", r" \1", sentence)
sentence = re.sub(r"[^a-zA-Z.!?]+", r" ", sentence)
return sentence
normalize_sentence() preprocesses the characters in the sentence, such as converting unicode codes to Ascii codes,
converting uppercase characters to lowercase characters, and deleting non-alphabetic characters.
Filter the read sentence pairs, such as limiting the length of the sentence:
MAX_LENGTH = 20
def filterPair(p):
return len(p[0]) < MAX_LENGTH and \
len(p[1]) < MAX_LENGTH
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]
Using the read and filtered sentence pairs as training samples, first construct the character word lists of the two
languages.
def prepareCharPairs(lang2lang_file,reverse=False):
pairs = readLangs(lang2lang_file,reverse)
print("Read %s sentence pairs" % len(pairs))
pairs = filterPairs(pairs)
print("Trimmed to %s sentence pairs" % len(pairs))
for pair in pairs:
in_verb.addChars(pair[0])
out_verb.addChars(pair[1])
return in_verb, out_verb, pairs
lang2lang_file = './data/eng-fra.txt'
in_verb = ChVerb("fra")
out_verb = ChVerb("eng")
in_verb, out_verb, pairs = prepareCharPairs(lang2lang_file,True)
Reading lines...
Read 170651 sentence pairs
Trimmed to 9194 sentence pairs
Read 9194 sentence pairs
Counted chars:
fra 32
eng 32
['tom a dit bonjour .', 'tom said hi .']
['je suis creve .', 'i am tired .']
['prends une douche !', 'take a shower .']
['je suis detendu .', 'i m relaxed .']
['tu es endurant .', 'you re resilient .']
['cours !', 'run !']
The following code converts the character words and wordlist indexes of the text sentences of these training
samples to and from each other:
def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_verb, pair[0])
target_tensor = tensorFromSentence(out_verb, pair[1])
return (input_tensor, target_tensor)
print(pairs[3])
en_input, de_target = tensorsFromPair(pairs[3]) #random.choice(pairs))
print(en_input.shape)
print(de_target.shape)
print(en_input)
print(de_target)
hidden_size = 50 #256
num_layers = 1
clip = 5.#50.
learning_rate = 0.1
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5
momentum = 0.5
decay_every =1000
encoder_optimizer = SGD(encoder.parameters(), learning_rate, momentum,decay_every)
decoder_optimizer = SGD(decoder.parameters(), learning_rate*decoder_learning_ratio,
momentum,decay_every)
reg= None#1e-2
if True:
pairs = pairs[:80000]
np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]
n_iters = 50000
print_every, plot_every = 100,100 #10,10
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]
trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg)
Figure 8-54 Loss curve of the Seq2Seq model for character sequences
From the separation of the training loss curve and the verification loss curve, it can be seen that the training is
unstable, and the loss curve tends to be flat and rises slightly after 40,000 times.
The trained model can be used for language translation. The word sequence (sentence) of the language to be
translated is input to the encoder to generate an output context information, which is input to the decoder. Produce
translated word sequences (sentences).
def evaluate(encoder,decoder,in_vocab,out_vocab,sentence,\
max_length=MAX_LENGTH,last_Hidden = True):
encode_input = tensorFromSentence(in_vocab,sentence)
encoder_output, encoder_hidden = encoder(encode_input, None)
if last_Hidden:
output_sentence = decoder.evaluate(encoder_hidden,max_length)
else:
output_sentence = decoder.evaluate(encoder_output,max_length)
output_sentence = indexToSentence(out_vocab,output_sentence)
return output_sentence
Among them, last_Hidden indicates whether the input of the decoder is the output (hidden vector) of the encoder at
the last moment or the output at all moments.
Randomly select several input sentences and use evaluate to predict the translated sentences:
indices = np.random.randint(len(pairs), size=3)
for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_verb,out_verb, sentence,MAX_LENGTH)
print(sentence)
From the results, the prediction effect is not ideal. The last moment output of the encoder is used as a context
vector (context vector) to pass information between the encoder and decoder, then this single vector bears the
burden of encoding the entire sentence and may not contain complete information. If the output of all moments is
used As a context variable, the information is relatively complete, but it is impossible to directly use the variable-
length encoder output information at all moments as the input of the decoder at each moment.
The attention mechanism introduced later enables the decoder network to "focus" on different parts of the
encoder output for each step of the decoder's own output, which can solve this variable-length decoder output and
avoid the decoder The hidden state vector of is increased.
Space is wasted, the vector representing a word is very large, only one value is 1, and the others are 0
It cannot express the internal connection between words, such as synonyms, correlation, etc., and the words of
a language are not independent of each other, and there is often a certain correlation between them
In natural language processing, word vectorization methods that are better than one-hot are used. These methods
are collectively called Word2Vec (word quantization). Word quantization can be considered as converting words
from the space where their word lists are located (one- hot vector space) to a low-dimensional space, similar to the
self-encoder mapping a high-dimensional vector to a low-dimensional vector, word quantization is also a
quantization model that uses a corpus (such as a piece of text) to train words through supervised machine learning
methods , but because it does not need to do any labeling of words but samples supervised training samples itself,
some people also call it unsupervised learning.
These two methods learn the word vector of a word through a 2-layer neural network similar to an autoencoder,
that is, to map a high-dimensional one-hot vector to a low-dimensional hidden vector. As shown in Figure 8-55,
the length of the word list is V, that is, there are V different words, and the one-hot vector x of a word is a vector of
length V , after passing through a weight matrix W heencoderof V × N , because N is usually an integer much
T
smaller than V, therefore, use W V ×N to weight the sum of x to produce a low-dimensional hidden Vector
hN = xW V ×N , this hidden vector h is the vectorized representation of the word with index k.
N
Figure 8-55 Both CBOW and Skip-gram word quantization use a 2-layer weighted and neural network to learn the
quantized representation of words
Because x is a row vector with only the kth component being 1 and the others being 0, the result of xW V ×N is the
kth row of the matrix, so it is not actually needed For multiplication, just take out the kth row of the matrix. The
In order to obtain a suitable latent vector representation of words to reflect the relevance between words (such as
synonyms), it is necessary to train this weight matrix with an autoencoder. Like the self-encoder, the hidden vector
is converted into an output vector of the same length as the word list through a W N ×V . Each component p of this
i
output vector represents the ithT hescoreof words can be converted into a probability through the softmax
function. In order to train this neural network model, CBOW and Skipgram use different methods to generate
samples for the training model from a corpus composed of many sentences.
Both the encoder and the decoder are fully connected layers without bias and activation functions, that is, there is
only one weight matrix. Therefore, this 2-layer neural network is 2 weight matrices, which can be represented by
the following simplified Figure 8-56:
Figure 8-56 Simplified word quantization neural network, the encoder and decoder are just a weight matrix
Its working process is similar to that of an autoencoder. The first fully connected linear layer is a weight matrix W 1
without bias, which converts a one-hot vector x of an input word into a low-dimensional embedded representation
h = xW , this h is the quantified representation of this word x. In order to train this weight matrix W , pass h
1 1
through another fully connected linear layer without bias and activation function, namely the weight matrix W , 2
and output a vector f = hW of all word scores. During the training process, compare this f with its target word
2
to get a loss error, and update the model parameters through the backpropagation of these loss errors.
As shown in Figure 8-57, for a word in a sentence, if it is recorded as w (also known as the center word or target
t
word), CBOW uses its context or surrounding words as input. For example, if the context window C is set to C =
5, the input will be at positions w , w , w
t−2 t−1 t+1and w Words, that is, the two words before and after the head
t+2
word w . Input a context word (w , w , w , w ) of a certain central word w into the network, CBOW The
t t−2 t−1 t+1 t+2 t
predicted word w that is expected to be output is the target word w . The network is trained by computing the
p t
cross-entropy loss with the predicted word w and the target word w .
p t
Figure 8-57 CBOW takes the context of a word (surrounding words) as input to the encoder, and the score of the
vocabulary words output by the decoder, the target word with the highest score should be.
Contrary to CBOW using the context of a word in a sentence to predict the word, Skip-gram uses the word w to t
predict its context words w , w , w , w , as shown in Figure 8-58. That is, the one-hot vector of the input
t−2 t−1 t+1 t+2
word w , the hidden vector h is obtained through the encoder, and a vector with the same length as the word list
t N
is output through the decoder, indicating the score of each word in the word list. Of course, the score can be passed
through a The softmax function is converted into a probability, and the context word is used as the target word to
calculate the cross-entropy loss, thereby training the encoder and decoder.
Figure 8-58 Skip gram takes a word as the input of the encoder, and the score of the vocabulary word output by the
decoder, the highest score should be the context word of this word, that is, the context word is the target word.
Both CBOW and Skip gram use sentences in the corpus to generate training samples. For the Skip gram method,
each word in a sentence is used as a central word, and each context word is used as a target word to form a training
sample.
For example, for the sentence "Seq2Seq is a general purpose encoder decoder framework ", the context words of
the first word "Seq2Seq" are "is" and "a", then 2 training samples (Seq2Seq, is) and (Seq2Seq, is), as shown in
Figure 8-59. Similarly, for the second given word "is", its context words are "seq2seq", "a", and "general".
Therefore, three samples (is, Seq2Seq, ), (is, a), (is, general). By analogy, for the last word "framework", samples
(framework, encoder) and (framework, decoder) can be obtained.
Figure 8-59 skip gram For the sentence "Seq2Seq is a general purpose encoder decoder framework", the generated
training samples
CBOW and skip gram have their own advantages and disadvantages. CBOW is suitable for a language model with
a small number of words and can well represent scarce words, while Skip gram is suitable for a language model
with a large number of words and can better represent words with high frequency.
The following code uses skip-gram as an example to illustrate how to implement the training process of this
model. For skip-gram, its input is the current center word, and its target is the context word, but because there are
multiple context words, that is, multiple targets, each target word needs to calculate a cross entropy loss with
f = hW . 2
Similarly, define a wordlist representing all words in a language, which can be constructed from sentences in a
corpus:
class Vocab:
def __init__(self,corpus):
wordset = set()
for sentence in corpus:
if isinstance(sentence,str):
for word in sentence.split(' '):
wordset.add(word)
else:
for word in sentence:
wordset.add(word)
wordlist = list(wordset)
self.word2index = dict([(word, i) for i, word in enumerate(wordset)])
self.index2word = dict([(i, word) for i, word in enumerate(wordset)])
self.n_words = len(wordset)
All the sentences in a corpus file can be read to build a word list, and the training samples for training the
Word2Vec model can be generated based on the word list and sentences in the corpus. The function
generate_training_data() samples the samples used to train the Word2Vec model according to the corpus corpus
composed of the word list vocab and sentences and the window size window of the sample sampling.
def generate_training_data(vocab,corpus,window = 2):
training_data = []
for sentence in corpus: # for each sentense
sent_len = len(sentence)
for i, word in enumerate(sentence): # for each word in the sentense
w_target =vocab.word2index[sentence[i]]
w_context = []
for j in range(i-window, i+window+1):
if j!=i and j<=sent_len-1 and j>=0:
w_context.append(vocab.word2index[sentence[j]])
training_data.append([w_target, w_context])
return np.array(training_data)
corpus = [["i","am","from","china"]]
generate_training_data(vocab,corpus)
A Word2Vec model can be trained on the basis of word list and corpus, and the following class Word2Vec is its
code implementation:
class Word2Vec():
def __init__ (self,corpus,hidden_n,window,learning_rate=0.01,epochs=5000):
self.hidden_n = hidden_n
self.window = window
self.lr = learning_rate
self.epochs = epochs
self.vocab = Vocab(corpus)
hidden_n = 5
window_size = 2
min_count = 0 # minimum word count
epochs = 5000 # number of training epochs
learning_rate = 0.01 # learning rate
np.random.seed(0) # set the seed for reproducibility
corpus = ["Neural Machine Translation using word level seq2seq model".split(' ')]
MAX_LENGTH = 10
eng_prefixes = (
"i am ", "i m ",
"he is", "he s ",
"she is", "she s ",
"you are", "you re ",
"we are", "we re ",
"they are", "they re "
)
def filterPair(p):
return len(p[0].split(' ')) < MAX_LENGTH and \
len(p[1].split(' ')) < MAX_LENGTH and \
p[1].startswith(eng_prefixes)
def filterPairs(pairs):
return [pair for pair in pairs if filterPair(pair)]
lang2lang_file = './data/eng-fra.txt'
pairs = read_pairs(lang2lang_file,True)
print(random.choice(pairs))
Reading lines...
Read 170651 sentence pairs
Trimmed to 12761 sentence pairs
['je ne le vendrai pas .', 'i m not going to sell it .']
According to the previously read corpus for machine translation, that is, paired sentence pairs, the following code
builds its sentence prediction library for training input and output languages:
if True:
pairs = pairs[:80000]
in_corpus = []
out_corpus = []
for pair in pairs:
in_corpus.append(pair[0].split(' '))
out_corpus.append(pair[1].split(' '))
print(in_corpus[:2])
print(out_corpus[:2])
hidden_n = 150
window_size = 2
min_count = 0 # minimum word count
epochs = 1 # number of training epochs
learning_rate = 0.01 # learning rate
np.random.seed(0) # set the seed for reproducibility
This training time will be very long. You can consider using an existing Word2Vec training library such as the
multi-threaded Word2Vec library gensim, because using the low-level linear algebra library in fortran or C
language can obtain hundreds of times of training acceleration. Install command:
pip install --upgrade gensim
For example, the following in_corpus is a corpus composed of 2 sentences. gensim.models.Word2Vec() constructs
a Word2Vec model model from this in_corpus, and a quantitative representation of a word can be obtained
according to this model: model.wv['am'].
import gensim
hidden_n = 8
model = gensim.models.Word2Vec(test_corpus, size=hidden_n, window=2, min_count=1,
workers=10, iter=10)
print('am:',model.wv['am'])
The following code uses gensim to train the Word2Vec models in_vocab and out_vocab for the input and output
languages.
import gensim
hidden_n = 150
window_size = 2
in_vocab = gensim.models.Word2Vec(in_corpus, size=hidden_n, window=window_size,
min_count=1, workers=10, iter=10)
out_vocab = gensim.models.Word2Vec(out_corpus, size=hidden_n, window=window_size,
min_count=1, workers=10, iter=10)
Because the above-mentioned trained Word2Vec model does not contain special characters "SOS", "EOS", and
"UNK", these three special words can also be added to the word list to obtain an extended word list (the length of
the word list becomes hidden_n+ 3), for these 3 special words, their quantization can be represented directly by
random vectors:
import numpy as np
SEU_count = 3
in_SEU = np.random.rand(3,hidden_n+SEU_count)
out_SEU = np.random.rand(3,hidden_n+SEU_count)
The code below defines some helper functions for obtaining indexed sentences and word quantization
representations from a literal sentence. indexesFromSentence() converts the words of a sentence into the index of
the word table, because the model word table of gensim does not contain 3 special characters, therefore, a word
word is in the index vocab.wv.vocab[word] of the word model of gensim. index is for ordinary
characters, you need to add offset SEU_count=3 to get its index in the extended word list.
vocab_word2vec obtains its quantified representation for each word index idx according to the vocab word index
(word list index containing special characters) sequence of gensim's Word2Vec model. For ordinary words, the
index should also be offset to the index of the gensim word list, vocab. wv.index2word[idx-SEU_count]. For
special characters, directly use SEU[idx] to obtain their quantified representation.
SOS_token =0
EOS_token =1
UNK_token =2
def tensorFromSentence(vocab, sentence):
indexes = indexesFromSentence(vocab, sentence)
indexes.append(EOS_token)
return np.array(indexes).reshape(-1,1)
def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_vocab, pair[0])
target_tensor = tensorFromSentence(out_vocab, pair[1])
return (input_tensor, target_tensor)
tensorFromSentence() and tensorsFromPair respectively convert a sentence or a pair of sentences from a string to
an index sequence. As before, tensorsFromPair() adds an end character after each sentence. indexToSentence()
converts a sentence from a sequence of word indices to a sequence of strings.
In order to replace the one-hot vector with Word2Vec, the code of the encoder and decoder needs to be modified,
and a derived class can be defined:
class EncoderRNN_w2v(EncoderRNN):
def __init__(self, input_size, hidden_size,vocab,num_layers = 1):
super(EncoderRNN_w2v,self).__init__(input_size, hidden_size,num_layers)
self.vocab = vocab
def word2vec(self,word_indices_input):
return vocab_word2vec(self.vocab,word_indices_input,in_SEU,True)
def word2vec(self,word_indices_input):
return vocab_word2vec(self.vocab,word_indices_input,out_SEU,True)
hidden_size = 256
num_layers = 1
clip = 5.#50.
learning_rate = 0.1
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5
n_iters = 70000
print_every, plot_every = 100,100 #10,10
momentum = 0.3
decay_every =1000
encoder_optimizer = SGD(encoder.parameters(), learning_rate, momentum,decay_every)
decoder_optimizer = SGD(decoder.parameters(), learning_rate*decoder_learning_ratio,
momentum,decay_every)
reg= None#1e-2
np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]
n_iters = 40000
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]
trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg
output:
Use the trained model for translation prediction:
It can be seen that the word-level Seq2Seq model can predict better than the character-level Seq2Seq model.
Readers can increase the number of training times and adjust parameters to obtain more satisfactory results.
Word embedding (Embedding) refers to the combination of word vectorization into the model of a specific
problem, that is, adding an embedding layer in front of the network model of a specific problem. The parameter of
this embedding layer is the matrix of word vectorization, which is used to map the word index to the word vector,
but the parameters of this matrix are initially random and need to be learned during the model training process.
That is, word vectorization and problem-specific models are combined for training.
This embedding layer is a fully connected linear layer without activation function and bias. It is a simplified linear
layer. The code is as follows:
class Embedding():
def __init__(self, num_embeddings, embedding_dim,_weight = None):
super().__init__()
if _weight is None:
self.W = np.empty((num_embeddings, embedding_dim))
self.reset_parameters()
self.preTrained = False
else:
self.W = _weight
self.preTrained = True
self.params = [self.W]
self.grads = [np.zeros_like(self.W)]
def reset_parameters(self):
self.W[:] = np.random.randn(*self.W.shape)
def __call__(self,indices):
return self.forward(indices)
Figure 7-53 The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional
numerical vector embedded through the embedding layer embedding, and then used together with the hidden state
as the input of the cyclic neural network unit to calculate the output and hidden vector
The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional numerical
vector embedded through the embedding layer embedding, and then used together with the hidden state as the
input of the cyclic neural network unit to calculate the output and hidden vector. For simple encoders, output and
hidden can be the same vector.
class EncoderRNN_Embed(object):
def __init__(self, input_size, hidden_size):
super().__init__()
self.input_size,self.hidden_size = input_size,hidden_size
self.embedding = Embedding(input_size, hidden_size)
self.gru = GRU(hidden_size, hidden_size,1)
self.embedded_out = np.concatenate(self.embedded_out,axis=0)
output, hidden = self.gru(self.embedded_out, hidden)
return output, hidden
def initHidden(self):
return np.zeros((1, 1, self.hidden_size))
def parameters(self):
return self.gru.parameters()
def backward(self,dhs):
dinput,dhidden = self.gru.backward(dhs,self.embedded_out)
T = dinput.shape[0]
for t in range(T):
dinput_t = dinput[t]
self.embedding.x = self.embedded_x[t] # recover the original x
self.embedding.backward(dinput_t)
#return
Because the weight parameter of the embedding layer is also a model parameter that needs to be learned, it is also
necessary to calculate the gradient of the weight parameter of the loss function with respect to the embedding layer
during reverse derivation, namely self.embedding.backward(dinput_t). The reverse derivation boils down
to the derivation at each moment t. It is necessary to know the input self.embedding.x at this moment t.
Therefore, it is necessary to save the output self.embedded_x of the embedding layer at each moment during the
forward calculation process. .append(self.embedding.x)`.
As shown in Figure 7-54, the decoder also uses the output vector of the word embedding layer as the input of the
RNN unit:
Figure 7-54 The input word (the one-hot vector corresponding to the index) is transformed into a low-dimensional
numerical vector embedded through the embedding layer embedding, and then the output of the relu activation
function and the hidden state are used together as the input of the cyclic neural network unit. Compute output and
latent vectors
It is also necessary to perform forward calculation and reverse derivation on the embedding layer, the code is as
follows:
class DecoderRNN_Embed(object):
def __init__(self, hidden_size, output_size,num_layers=1,teacher_forcing_ratio =
0.5):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = 1
self.teacher_forcing_ratio = teacher_forcing_ratio
def initHidden(self,batch_size):
self.h_0 = np.zeros((self.num_layers, batch_size, self.hidden_size))
relu_output = output.reshape(1,output.shape[0],-1)
self.input.append(relu_output) #output) # input of gru
def forward(self,input_tensor,hidden):
self.input = []
target_length = input_tensor.shape[0] #nput_tensor.size(0)
teacher_forcing_ratio = self.teacher_forcing_ratio
use_teacher_forcing = True if random.random() < teacher_forcing_ratio else
False
output_hs = []
output = []
hidden_t = hidden
h_0 = hidden.copy()
input_t = np.array([SOS_token])
hs = []
zs = []
self.embedded_x = []
self.relu_x = []
for t in range(target_length):
output_t, hidden_t,output_hs_t = self.forward_step(
input_t, hidden_t)
if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) # maximum probability
if input_t== EOS_token:
break
input_t = np.array([input_t])
output = np.array(output)
self.output_hs = np.array(output_hs)
self.h_0 = h_0
self.hs = np.concatenate(hs, axis=1)
self.zs = np.concatenate(zs, axis=1)
#self.gru.hs = self.hs
#self.gru.zs = self.zs
return output
def backward(self,dZs):
dhs = []
output_hs = self.output_hs
input = np.concatenate(self.input,axis=0)
for i in range(len(input)):
self.linear.x = output_hs[i]
dh = self.linear.backward(dZs[i])
dhs.append(dh)
dhs = np.array(dhs)
self.gru.hs = self.hs
self.gru.zs = self.zs
self.gru.h = self.h_0
dinput,dhidden = self.gru.backward(dhs,input)
for i in range(len(input)):
dinput_t = dinput[i]
d_embeded = self.relu.backward(dinput_t)
self.embedding.x = self.embedded_x[i] # recover the original x
self.embedding.backward(d_embeded)
return dinput,dhidden
def backward_dh(self,dZ):
dh = self.linear.backward(dZ)
return dh
def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])
return self._params
Again, input and output wordlists need to be built, along with some helper functions to convert between the string
form of the sentence and the indexed form. Redefine the word list class Vocab so that it can contain special start
and end words "SOS", "EOS", and the number of occurrences less than min_count is regarded as an unknown
word:
import numpy as np
from collections import defaultdict
SOS_token = 0
EOS_token = 1
UNK_token = 2
class Vocab:
def __init__(self,min_count=1,corpus = None):
self.min_count = 1
self.word2count = {}
self.word2index = {"SOS":0,"EOS":1, "UNK":2}
self.index2word = {0: "SOS", 1: "EOS",2: "UNK"}
self.n_words = 3 # Count SOS and EOS
if corpus is not None:
for sentence in corpus:
self.addSentence(sentence)
self.build()
def build(self):
for word in self.word2count:
if self.word2count[word]<self.min_count:
self.word2index[word] = UNK_token
else:
self.word2index[word] = self.n_words
self.index2word[self.n_words] = word
self.n_words += 1
vocab = Vocab()
vocab.addSentence("i am from china")
vocab.build()
print(vocab.word2index["i"])
print(vocab.index2word[4])
3
am
Create the wordlist objects in_vocab and out_vocab for the input and output languages:
in_vocab = Vocab()
out_vocab = Vocab()
lang2lang_file = './data/eng-fra.txt'
pairs = read_pairs(lang2lang_file,True)
for pair in pairs:
in_vocab.addSentence(pair[0])
out_vocab.addSentence(pair[1])
in_vocab.build()
out_vocab.build()
def tensorsFromPair(pair):
input_tensor = tensorFromSentence(in_vocab, pair[0])
target_tensor = tensorFromSentence(out_vocab, pair[1])
return (input_tensor, target_tensor)
The training process of the Seq2Seq model based on word embedding is similar to the previous one:
from train import *
from Layers import *
from rnn import *
import util
hidden_size = 256
num_layers = 1
clip = 5.#50.
learning_rate = 0.03
decoder_learning_ratio = 1.0
teacher_forcing_ratio =0.5
momentum = 0.3
decay_every =1000
encoder_optimizer = SGD(encoder.parameters(), learning_rate, momentum,decay_every)
decoder_optimizer = SGD(decoder.parameters(), learning_rate*decoder_learning_ratio,
momentum,decay_every)
reg= None#1e-2
np.random.shuffle(pairs)
train_n = (int)(len(pairs)*0.98)
train_pairs = pairs[:train_n]
valid_pairs = pairs[train_n:]
print_every, plot_every = 100,100 #10,10
n_iters = 40000
idx_train_pairs = [tensorsFromPair(random.choice(train_pairs)) for i in
range(n_iters)]
idx_valid_pairs = [tensorsFromPair(pair) for pair in valid_pairs]
trainIters(encoder,
decoder,encoder_optimizer,decoder_optimizer,idx_train_pairs,idx_valid_pairs,True,print_ev
plot_every,reg)
Figure 8-55 The training and verification loss curves of the Seq2Seq model of the word embedding layer
Make predictions:
indices = np.random.randint(len(train_pairs), size=3)
for i in indices:
pair = pairs[i]
print(pair)
sentence = pair[0]
sentence = evaluate(encoder, decoder,in_vocab,out_vocab, sentence,MAX_LENGTH)
print(sentence)
['c est une vraie commere .', 'she is a confirmed gossip .']
she is a total . .
['nous sommes meilleures qu elles .', 'we re better than they are .']
we re better than they are .
['tu es curieux hein ?', 'you are curious aren t you ?']
you are curious right ?
Using the hidden state or output at the last moment as the content vector may not be enough to contain the
complete input sequence information, especially for long input sequences. From the effect of the previous Seq2Seq
model, it can be seen that the longer the sequence, the worse the prediction effect. If all the hidden states at all
times are concatenated into a content vector, it can contain enough complete input sequence information, but since
the length of the input sequence changes, this content vector obviously cannot be directly used as the hidden state
of the encoder, so it needs to be done. A transformation process turns it into a fixed-length vector. On the other
hand, different parts of the input sequence have different effects on each moment of the decoder, and each moment
of the decoder should have different degrees of attention to different parts of the input sequence. As shown in
Figure 7-56, the input sequence is the sentence (word sequence) "knowledge is power", and the output target
sequence is the sentence "knowledge is power". When the decoder is processing "knowledge", the word
"knowledge" of the input sequence has a greater impact than the other two words "is" and "strength", and when
dealing with "is", the "is" of the input sequence is more important . Therefore, when the decoder is making
predictions, different words of the input sequence have different effects on different words of the output sequence.
Figure 7-56 Different parts of the input sequence have different predictive effects on the output sequence at
different moments
Attention (Attention) mechanism means that the decoder dynamically selects the part of the input sequence that
is most relevant to the current prediction at each moment. A weight vector can be calculated by comparing the
input information of the decoder at the current moment (the hidden state at the previous moment and the data input
at the current moment) with the output (or hidden state) of the encoder at all moments, and then use the weight The
vector weights the output content of the encoder at all moments to obtain a specific content context vector at the
current moment, that is, the decoder has different encoder context vectors at different moments, and this context
vector is used together with the hidden state and data input at that moment Computation at the current moment of
the decoder.
The calculation of i at each moment of the cyclic neural network of the previous Seq2Seq decoder can be
expressed as:
hi = RN N (hi−1 , xi )
The calculation of i at each moment of the Seq2Seq decoder using the attention mechanism can be expressed as:
hi = RN N (hi−1 , xi , ci )
That is, at each moment, there is an additional content vector c specific to that moment. This c not only depends
i i
on h , x , but also depends on the output (or hidden state) of the encoder at all moments , if the output of the
i−1 i
encoder at all times is the hidden state h̄ , that is, c depends on all h̄ , t = 1, 2, , ⋯ , T , T is the last of the encoder
t i t
time. At each moment i, the decoder is based on the output of the encoder h̄ = h , h , ⋯ , h and the information
1 2 T
at the moment of the decoder i (such as the input hidden state h ) First calculate a weight vector
i−1
α = (α , α , ⋯ , α
i i1 i2 ), use this weight vector α to the output of the encoder h̄ is weighted and summed to get
iT
And
T
∑ αij = 1, αij > 0
j=1
That is, the input context vector c of the decoder moment i is the weighted average of the output (or hidden state)
i
And these α are calculated by the same set of so-called scores (also called energy) values e :
ij ij
e
ij
exp
αij = T e
∑ exp ik
k=1
Each e can be the input hidden state h
ij i−1 of the decoder time i and the h̄ of the encoder time j is calculated by a
j
function a, namely:
Of course, e can also rely on the data at the current moment to input x . According to different functions a, the
ij i
score has different calculation methods, as shown in Figure 7-57, h and h̄ uses a neural network layer with
i−1 j
only one neuron and an activation function of tanh as the calculation function of the score, namely:
Figure 7-57 The score calculation function is a neural network layer with only one neuron and the activation
function is tanh
Among them, the parameter W is also a parameter that needs to be learned. The table below shows some
a
√n
Vaswani2017
Where h̄ , h represent the hidden state of the input sequence s and the output sequence t respectively, and v , W
s t a a
are learnable weight parameter matrices. Note that although the hidden state h of the decoder uses a unified
t
symbol, the meanings in different papers are slightly different, some are h at the current moment t, and some are
t
the Bahdanau attention paper, and Luong attention is h at the current moment.
t
As shown in Figure 7-58, the decoder uses this dynamically calculated context at each moment to perform
calculations together with the input data and the hidden state at the previous moment.
Figure 7-58 The decoder calculates dynamic weights at each moment, and uses these weights to calculate the
weighted average of the outputs (or hidden vectors) of the encoder at all moments to obtain a context vector, which
is used for the calculation of the decoder at the current moment
Luong et al also proposed local attention attention, which is different from the usual global (Global) in that: The
model first predicts an alignment position of the current target word in the input sequence, and then calculates the
context vector with a window centered on the position, as shown in the right figure of Figure 7-59.
Figure 7-59 The global attention uses all the outputs (hidden state) of the encoder to calculate the context vector,
while the local attention first finds the position of the input sequence corresponding to the target position, and then
all the outputs of the encoder in the window area centered on this position (hidden state) Compute the context
vector
As shown in Figure 7-60, the decoder calculates an attention weight vector attn_weights based on the hidden state
prev_hidden at the previous moment and the output content encoder_outputs of the encoder at all moments, and
then uses this weight vector to weight the output encoder_outputs of the encoder and obtain a The attention content
vector content, and then the embedding vector embedded of the data input input is output through a full leveling
layer combine combination, and then after the relu activation function, it is input to the recurrent neural network
unit gru together with pre_hidden, and the output of gru is then passed through a full level The connection layer
out produces the final output.
Calculate an attention weight vector attn according to the current input data input and the hidden state prev_hidden
at the previous moment, and then weighted with the hidden state output encoder_outputs of the encoder to obtain
attn_applied, and then combine it with the input embedded embedded into attn_combine, and pass After the
activation function is transformed, it is used as the current moment data input of the recurrent neural network unit
(GRU).
Figure 7-60 The calculation process of the attention mechanism: the input and the hidden state are used to
calculate an attention weight, and then weighted and summed with the output of the encoder. This weighted sum
input is combined as a new input data input to the recurrent neural network The network layers produce the final
output.
The forward calculation and reverse derivation codes of the hidden state prev_hidden and the weighted vector of
the encoder output content encoder_outputs are as follows:
def attn_forward(hidden,encoder_outputs):
#hidden (B,D) encoder_outputs (T,B,D)
energies = np.sum(hidden * encoder_outputs, axis=2) #(T,B)
energies =energies.T #(B,T)
alphas = util.softmax(energies)
return alphas,energies
def attn_backward(d_alpha,energies,hidden,encoder_outputs):
#hidden (B,D) encoder_outputs (T,B,D)
#d_alpha energies:(B,T)
d_energies = softmax_backward_2(energies,d_alpha,False) #d_alpha,energies)
d_energies = d_energies.T #(T,B)
d_energies = np.expand_dims(d_energies,axis=2)
d_encoder_outputs = d_energies*hidden # (T,B) (B,D)
d_hidden = np.sum(d_energies*encoder_outputs,axis=0) # (T,B) (T,B,D)
return d_encoder_outputs,d_hidden
The following is the code for forward calculation and reverse derivation of attn_weights weighted sum of
encoder_outputs:
def bmm(alphas,encoder_outputs):
# (B,T), [T,B,D]
encoder_outputs = np.transpose(encoder_outputs, (1, 0, 2)) # [T,B,D] -> [B,T,D]
#weights = np.expand_dims(weights,axis=1) #(B,T) -> (B,1,T)
context = np.einsum("bj, bjk -> bk", alphas, encoder_outputs) # [B,T]*[B,T,D] ->
[B,D]
return context
def bmm_backward(d_context,alphas,encoder_outputs):
encoder_outputs = np.transpose(encoder_outputs, (1,0,2)) # [T,B,D] -> [B,T,D]
d_alphas = np.einsum("bjk, bk -> bj", encoder_outputs,d_context) #dx = Wdz^T
(B,T,D) (B,D) ->(B,T)
d_encoder_outputs = np.einsum("bi, bj -> bij", alphas,d_context) # dW = x^Tdz #
(B,T) (B,D) ->(B,T,D)
d_encoder_outputs = np.transpose(d_encoder_outputs, (1,0,2)) # [B,T,D] -> [T,B,D]
return d_alphas,d_encoder_outputs
First, implement the weighted sum operation bmm operation of multi-sequence samples in this figure. Let T, B,
and D be the sequence length, the number of samples, and the data length at each moment, respectively. bmm()
accepts a weight matrix of shape (B, T), one row of which represents the weight vector of a sample, and
encoder_outputs is the encoder output The content vector of the shape is (T, B, D), which needs to be converted
into a tensor of shape (B, T, D) first, and then use np.einsum() to calculate the output content for each sample with
its weight vector The weighted sum yields a vector of length D.
einsum() uses string instructions to control flexible dot product (matrix multiplication) operations, such as
multiplying the left two tensors "bj" and "bjk" of "bj, bjk -> bk" to produce the right two-dimensional tensor "bk",
where the axis of the tensor is represented by a letter (instead of 0, 1, 2). This multiplication process can be
simulated with the following code:
#Loop through each element of the result tensor (subscript bk)
for b in range(...)
for k in range(...)
C[b,k] = 0
for j in range(...)
C[b,k]+= A[b,j]*B[b,j,k]
The process of calculating the weight vector and the weighted sum of the output content of the encoder can be
synthesized into an attention layer Atten:
def __call__(self,hidden,encoder_outputs):
return self.forward(hidden,encoder_outputs)
The following code implements the decoder for the simple attention mechanism above:
from Layers import *
from rnn import*
import util
class DecoderRNN_Atten(object):
def **init**(self, hidden_size, output_size, num_layers=1, teacher_forcing_ratio =
0.5, dropout_p=0.1, \
max_length=MAX_LENGTH):
super(DecoderRNN_Atten, self).**init**()
# self.layers = [self.embedding,self.attn,self.attn_combine,self.gru,self.out]
self.layers = [self.embedding,self.attn_combine,self.gru,self.out]
self._params = None
self. use_dropout = False
if training:
self.embedded_x.append(self.embedding.x)
if self. use_dropout:
self.dropout_mask.append(self.dropout._mask)
self.attn_x.append((self.attn.alphas,self.attn.energies,self.attn.hidden,self.attn.encode
self.attn_combine_x.append(self.attn_combine.x)
self.relu_x.append(self.relu.x)
self.gru_x.append((relu_out,self.gru.h))
self.gru_hs.append(self.gru.hs) #Keep the hidden state of the middle layer
self.gru_zs.append(self.gru.zs) #Keep the calculation result of the middle
layer
self.out_x.append(self.out.x)
return output, hidden, output_hs_t
hidden_t =
encoder_outputs[-1].reshape(1,encoder_outputs[-1].shape[0],encoder_outputs[-1].shape[1])
output = []
output_hs = []
self.gru_x = [] #gru input
self.gru_hs = []
self.gru_zs = []
self. dropout_mask = []
self.embedded_x = []
self.relu_x = []
self.attn_x = []
self.attn_combine_x = []
self.attn_weights_seq = []
self.out_x = []
# encoder_outputs = np.pad(self.encoder_outputs,((0,self.max_length-
self.encoder_outputs.shape[0]),(0,0),(0,0)), 'constant')
for t in range(target_length):
output_t, hidden_t, output_hs_t = self.forward_step(input_t, hidden_t,
encoder_outputs)
output_hs.append(output_hs_t)
output.append(output_t)
if use_teacher_forcing:
input_t = input_tensor[t] # Teacher forcing
else:
input_t = np.argmax(output_t) #maximum probability
if input_t==EOS_token:
break
input_t = np.array([input_t])
self.relu.x = self.relu_x[i]
d_relu_x = self.relu.backward(drelu_out)
d_attn_combine_out = d_relu_x
self.attn_combine.x = self.attn_combine_x[i]
d_attn_combine_x = self.attn_combine.backward(d_attn_combine_out)
d_embedded, d_attn_out = d_attn_combine_x[:,:self.hidden_size],
d_attn_combine_x[:,self.hidden_size:]
self.attn.alphas, self.attn.energies, self.attn.hidden,
self.attn.encoder_outputs = self.attn_x[i]
dprev_hidden_2,d_encoder_outputs_2 = self.attn.backward(d_attn_out)
if self. use_dropout:
self.dropout._mask = self.dropout_mask[i]
d_embedding = self. dropout. backward(d_embedded)
else:
d_embedding = d_embedded
dprev_hidden += dprev_hidden_2
d_encoder_outputs +=d_encoder_outputs_2 #[:input_T] #Every moment must be
accumulated
#d_encoder_outputs[input_T-1]+=dprev_hidden[0]
d_encoder_outputs[-1]+=dprev_hidden[0]
return dprev_hidden,d_encoder_outputs #dhidden
def parameters(self):
if self._params is None:
self._params = []
for layer in self.layers:
for i, _ in enumerate(layer.params):
self._params.append([layer.params[i],layer.grads[i]])
return self._params
The results do not seem to have improved much. Interested readers can try to increase the number of iterations,
adjust learning parameters, and especially use different attention mechanisms to get better results.
Chapter 8 Generating Models
Data is the basis of machine learning and modern artificial intelligence. The more data, the better the
performance of machine learning algorithms. It is precisely because of the huge amount of data that large
companies can develop high-performance artificial intelligence products such as search engines,
recommendation systems, Intelligent games, etc. It is often said that "whoever owns the data owns the future."
Big data is also one of the key factors for the re-emergence of neural networks and the development of deep
learning.
For many problems, manually obtaining data (such as medical imaging data) is usually very difficult and
expensive. For example, to improve the performance of face-related algorithms such as face recognition, a large
amount of face image data is required, and the collection of these face images not only requires user
authorization, but also requires a certain price. If it can be automatically generated and real faces are difficult
Differentiated face images can save costs and costs, and promote face-related research and applications. Another
example is that there are a large number of two-dimensional and three-dimensional scenes in video game film
and television works. Designing and producing these scenes requires a lot of manpower, material and financial
resources, making the cost of shooting a movie and making a game very high. If the high-quality scenes in these
film and television works can be automatically generated, a lot of financial resources and manpower can be
saved, so that product developers can focus more on creative work.
The generative model (generative model) in machine learning specializes in how to use computers to
automatically generate data similar to real data, that is, the generative model can automatically generate fake
data that is indistinguishable from real data. The language model (language model) in natural language
understanding in Chapter 7 is a generative model. A good language model can generate fluent sentences for
applications such as machine translation, chat dialogue, and article generation. Typically, once trained, recurrent
neural networks can be used to generate a steady stream of sequence data.
Therefore, automatically generated data can solve the lack of data for many research problems, not only improve
the performance of machine learning algorithms for related problems, but also contribute to the development of
various application products. For example, automatic face generation technology is used in various face
application problems such as video face replacement (such as DeepFake), automatic image generation can
generate images of various styles, and automatic speech synthesis technology can automatically synthesize
various voices similar to real people. Voice, as well as automatic composition and so on.
This chapter mainly discusses the two most popular generative model technologies based on deep neural
networks (deep learning): Variational Auto-Encoders (Variational Auto-Encoders, VAE),Generative
Adversarial Network ( Generative Adversarial Network, GAN).
Every face image in the world is different, but no matter how different, people can tell at a glance that an image
is a human face rather than a cat, dog, or plant. If all face images are used with the same shape tensor such as a
three-dimensional tensor That is, red, green and blue three-color image representation, for example, a face image
is represented by a 3 × 1024 × 768 tensor, which contains 1024 × 768 pixels, and each pixel is composed of
three colors of red, green and blue The value indicates that a face image contains 3 × 1024 × 768 variable
values, and x represents this tensor, then x is a 3 × 1024 × 768 dimensional linear space data The point or x
corresponding to each face image corresponds to a coordinate point in this linear space. The data points of x
corresponding to all face images in this space are not random, and are usually located in a small subspace in this
space, just like the points on a straight line on a two-dimensional plane are distributed in this The same in a
straight line. x is a changing random variable. The distribution of x coordinate points corresponding to all face
images in this large linear space has its own specific probability distribution law or a certain probability
distribution density, but this Probability distributions cannot be expressed in analytical mathematical
expressions.
If a certain model can automatically generate face images that are indistinguishable from real face images, then
these automatically generated images must obey the underlying probability distribution of real face images.
Thus, generative modeling is all about generating artificial data that has the same (or as similar as possible)
probability distribution to the real data. For example, the real data are all data points on a circle on the plane, if
the generated data points are also located on this circle, it is said that the generated data points satisfy the
distribution of the circle. If the real data are some real numbers on the number axis that satisfy a certain
probability distribution, for example, the real numbers that are uniformly distributed on the interval [0,1], if the
real numbers generated by the generated model are also uniformly distributed on the [0,1] interval, that is
Generated real numbers and real real numbers have the same probability distribution, then the two sets of real
numbers are indistinguishable. But usually only these real numbers do not know the underlying probability
distribution of these real numbers. How to generate real numbers with the same probability distribution as these
real real numbers? This is the problem generative models are designed to solve.
The generative model is usually a parametric model, like a parameterized neural network function. In order to
obtain a generative model that can generate fake data that obeys the same distribution as the real data, it is
necessary to learn the parameters of the parameterized generative model based on the real data. It is the same as
learning the parameters of the regression model with real data. As long as the parameters of the parametric
generative model are determined, fake data that obey the distribution of real data can be automatically generated
according to the determined generative model, which means that these fake data and real data are
indistinguishable.
Of course, the distribution of data generated by the generative model cannot be exactly the same as the
distribution of real data. The closer the two distributions are, the harder it is to distinguish the generated data
from the real data.
If there is a set of real numbers, that is, the real data is a real number, and these real numbers are located on a real
number axis, but their distribution on the real number axis is unknown, how to generate fake real numbers is
difficult to distinguish from these real real numbers or that forgery obeys the rules of these real real numbers
Distribution? For example, these real numbers are the heights of people in Hainan Province. If the generated
height data does not obey the distribution of these height data, it can be easily identified.
For low-dimensional data such as a set of real numbers, the frequency can be used to approximate the probability
distribution of the data through simple statistical calculations. For example, below is a set of real numbers from
the file "real_values.npy", which set {x } constitutes the real data.
(i)
import numpy as np
x = np.load('real_values.npy')
print(x.shape)
print(x[:5])
(10000,)
[4.88202617 4.2000786 4.48936899 5.1204466 4.933779 ]
How is this set of real numbers distributed in the real number space R? Or what kind of probability distribution
do they have? The real number axis can be divided into many small intervals, and the frequency of real numbers
falling in each small interval can be counted. As long as there is enough data, this frequency can be close enough
to the probability, so that the probability distribution of these real numbers in the real number space R can be
understood. The code below shows the frequency distribution of the approximate probabilities for this set of data
in histogram and curve form:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
draw_hist(plt,x,26)
plt.show()
Through this histogram, it can be observed that the distribution of these real numbers is close to the Gaussian
distribution, and the center point (mean value) of the Gaussian distribution is about 4.0. It is also easy to
calculate that the standard deviation of this set of real numbers is about 0.5.
For this set of real numbers, you can also use the kdeplot() function of the seaborn library to draw the probability
density, which is simpler:
import seaborn as sns
sns.set(color_codes=True)
sns.kdeplot(x.flatten(), shade=True, label='Probability Density')
In fact, these real numbers are indeed sampled from a normal distribution with mean 4 and variance 0.5, and they
are generated with the following code:
import numpy as np
np.random.seed(0)
mu = 4
sigma = 0.5
M = 10000
x = np.random.normal(mu, sigma, M)
print(x[:5])
np.save('real_values.npy', x)
That is, this set of real numbers {x } obeys a Gaussian distribution with a mean of 4 and a standard deviation
(i)
Figure 8-3 Gaussian distribution (normal distribution) with a mean of 4 and a standard deviation of 0.5
N (4, 0.5)
It can be seen that the real numbers near the mean value 4 have a higher probability of being sampled, and the
real numbers farther away from 4 have a lower probability of being sampled. Therefore, this group of real
numbers is the data in the one-dimensional real number space R, and they satisfy the Gaussian distribution with
a mean of 4 and a standard deviation of 0.5. If the distribution law is found, this probability distribution law can
be directly used to generate real numbers that conform to this distribution law.
For high-dimensional data, using the above method of calculating frequency to find the distribution of real data
in high-dimensional data space is not only computationally intensive but also unrealistic. For example, the real
face dataset is a collection of some face images. If each face image contains 256x256 pixels and each pixel is
represented by 3 colors (red, green and blue), then each image has 256x256x3=196608 values, namely The
dimension of the image is 196608, and all these face images are in a 196608-dimensional space. Each face image
is a data point in this high-dimensional space. How are these face images distributed in this high-dimensional
space? It is an impossible task to directly estimate the probability (density) distribution p(x , x , ⋯ , x
1 2 ) of
196608
is very large.
For high-dimensional data, it is necessary to learn a parameterized generative model based on real data, so as to
generate generated data similar to real data based on this generative model. Some generative models directly
represent the probability distribution or allow direct calculation of the probability distribution, while some
generative models themselves do not represent the probability distribution, but the distribution of the data
generated according to this model is very close to the distribution of the real data, that is, this generation The
model is used directly to generate the data rather than to calculate the probability distribution of the real data.
The generative models (VAE and GAN) that directly generate data are discussed below.
From a mathematical point of view, the generative model learns a parameterized generative model function
G(z|θ) based on a set of real data (such as a set of real numbers or a set of faces). Once the parameter θ is
determined, This function is determined. This function maps a hidden variable z to a real data, and the space
where the hidden variable z is located is usually a low-dimensional linear space with a much lower dimension
than the real data. For example, z is a vector with a small length, and the real data is a Multi-megapixel images.
Different z produces different G(z). If the probability distribution pf akesatisfied by G(z) is close to the
distribution p
real of real data, such a generative model function can be Generating fake data.
Therefore, the generative model is to find a generative model function G(z), so that data G(z) similar to real
data can be generated from a random variable (vector) z. Different random variables z produce different
generated data G(z), and the distribution law of these G(z) should be very close to the distribution law of the
real data x.
As shown in Figure 8-4, many real face images can be used to learn a generative model function of a face image.
Using this function to sample in the latent space (such as a vector) can generate a fake face image.
Figure 8-4 Learn a generative model function that maps hidden vectors to real face images through many real
face image data, sample a hidden vector in the latent space, and input it to this generative model function to get a
real face image
There are three main types of generative models that use neural networks (deep learning) as generative model
functions: Generative Adversarial Net (GAN), Variational Autoencoders (VAEs), Autoregressive models, such
as PixelRNN).
For example, the following is a face image generated by GAN (ThisPersonDoesNotExist.com), it can be seen
that the generated face image is difficult to distinguish from the real face image.
Figure 8-5 Counterfeit face images generated by the production network GAN
8.2 Autoencoders
Before introducing the variational autoencoder, let's introduce the autoencoder related to it. The understanding of
the autoencoder is helpful to understand the variational autoencoder.
8.2.1 Autoencoder
The neural network used for classification and regression problems can map an input x to an output y, that is, the
neural network is a mapping from x to y y=f(x), x is the data feature, y is different from x The goal. What does
this produce if y and x are the same, i.e. this is an identity mapping x=f(x)? If the number of neurons in each
layer is the same as the number of features of x, then each neuron can output one of the feature components
through identity mapping, as shown in Figure 8-6.
Figure 8-6 The identity map represented by a 2-layer neural network, the same number of hidden layer neurons
as input and output directly output one of the input feature components
If the number of neurons in the middle hidden layer is different from the number of features, if the number is less
than the number of features, as shown in Figure 8-7, multiple input features need to go through this "bottleneck"
before outputting, if This neural network can reconstruct the original input (that is, the output of the network is
the same as the input), indicating that the activation output vector of the bottleneck layer contains all the
information of the input, that is, the activation output of the bottleneck layer is actually the input A compressed
representation of data, as if the compressed file contained virtually all the information of the original file. That
is, the representation of the bottleneck layer captures the intrinsic relationship (intrinsic structure) between the
features of the input data. It also shows that the features of the original data are not independent but correlated.
For example, the adjacent pixels of an image have similar colors, that is, these adjacent pixels are related. It is
precisely because the pixels in an image are related that the image compression algorithm can compress the
image into smaller data, and pass Unzip to restore the original image.
Figure 8-7 For a 2-layer neural network whose output can reconstruct the input, the output of the hidden layer
whose number of neurons is less than the number of input features contains all the information of the input to
make the output reconstruct the output, that is, the activation output of the hidden layer A vector is actually a
compressed representation of the input data
If the features of a data are independent of each other, these features cannot be fully represented by the
bottleneck layer compression representation, and many input features will inevitably be lost, so that the input
cannot be reconstructed.
The hidden layer of the neural network is a transformation of data features. The output of the hidden layer whose
number of neurons is less than the number of original data features is a compressed representation of the original
data. The neural network can automatically learn the intrinsic characteristics of the data, so it is also called
feature learning.
Data such as images are often high-dimensional data, while their essential features are generally low-
dimensional. Representing raw data with low-dimensional data features can improve the efficiency and
performance of machine learning algorithms, such as reducing memory consumption and computation, and
speeding up algorithm convergence. For example, a face image may contain millions of pixels, but in machine
learning, the low-dimensional features of the face are often used to represent the face, such as using PCA
dimensionality reduction technology to represent the face as a vector of dozens of values.
Data can be represented by different features. For example, a circle can be represented by many pixels (points)
on the circle. This kind of pixel map representing a circle is called a bitmap. It can also be represented by many
straight line segments approaching a circle. round. Both representations require many values to represent a high-
quality circle. A circle can also be expressed as three values: the radius of the circle and the coordinates of the
center of the circle. The coordinates and radius of the circle are the intrinsic characteristics of the circle. These
three ways of representing circles represent the characteristics of circles from different angles. Similarly, for any
other kind of data, there can be multiple representation methods to represent the data, and different
representations of the data represent different characteristics of the data from different angles.
Selecting the appropriate feature representation of data is the key to determining the success of machine
learning. One of the main goals of machine learning efforts in recent decades is how to find low-dimensional and
more essential feature representations from the high-dimensional feature representations in the original form of
data. Finding its low-dimensional feature representation from high-dimensional data is called feature
engineering. Designing various artificial features has been the main research goal of researchers in the field of
artificial intelligence in the past few decades. For different problem data, people have proposed various feature
dimensionality reduction techniques and designed various artificial features. With the rise of deep learning, using
neural networks to automatically learn features frees researchers from time-consuming and laborious manual
feature engineering, so that they can focus on more innovative work.
Autoencoder (autoencoder, AE) is a technology that uses a neural network with a bottleneck layer to
automatically learn data features. When training the neural network, the target of the neural network sample is
the data itself. When the output of the neural network can reconstruct the input, the bottleneck layer of the neural
network is the low-dimensional feature or low-dimensional representation of the data. As shown in Figure 8-8,
this neural network function is regarded as two functions, and the part from the data input layer to the bottleneck
layer is regarded as a function, called encoder, that is, the encoder accepts input data , producing a lower-
dimensional vector than the input data, called hidden vector. The part from the bottleneck layer to the
reconstructed output layer is regarded as another function called decoder, that is, the decoder accepts hidden
vector input and outputs an output value with the same shape as the input data. This output value should be as
heavy as possible. structure input data. The error of the decoder output value and the input data of the encoder
constitutes the loss of the autoencoder, which is called reconstruction loss. By minimizing this loss, the output
of the decoder and the input data of the encoder can be made as equal as possible, i.e. the output of the decoder
can reconstruct the input. For an input data, the hidden vector output by the encoder is a low-dimensional
compressed representation of the data, which represents some inherent essential characteristics of the data itself.
Figure 8-8 The structure of the automatic encoder: a digital image is input to the encoder, the hidden vector
output by the encoder is used as the input of the decoder, and the digital image output by the decoder
reconstructs the input of the encoder.
Therefore, the encoder can encode a high-dimensional data x into a low-dimensional vector z, and the decoder
can map this low-dimensional vector z back to the original data space to obtain a sum x Very close data x . The ′
output z of x through the encoder is called hidden vector, and the linear space formed by all possible hidden
vectors is called hidden space.
Let the encoder function be z = q (x), which maps an input x to a latent vector z, and the decoder function is
θ
x = p (z) , which maps a hidden vector z to a data x of the same shape as the encoder input x, x should be as
′ ′ ′
α
equal to x as possible, of course x and x It cannot be exactly the same, there will be some errors. θ, α are the
′
model parameters of the encoder and decoder respectively, once θ, α are determined, the encoder and decoder
functions are determined.
For a trained autoencoder, its decoder is a generator function that can generate (produce) a generated data similar
to real data from a hidden vector.
For example, as shown in Figure 8-8, an automatic encoder for Mnist handwritten digital images can be trained.
The handwritten digital image of 28 × 28 shape is directly input to the encoder or converted into an input vector
of 784 size and input to the encoder. The encoder outputs a hidden vector z of a certain length (for example, 10),
which is input to the decoder, and outputs a vector with a length of 784 or an image of size 28x28.
The main function of the autoencoder is to compress the data. The hidden vector is a vector with a lower
dimension than the input data. The data samples are mapped to the hidden vector through the encoding function,
and then mapped back to itself through the decoding function. Construct**. The encoding and decoding process
of an autoencoder is similar to data compression. Compression software compresses a file (folder) into smaller
files, and then restores the original file (folder) by decompression. The difference between the decompressed file
and the original file is the compression error. If the compressed and decompressed files are exactly the same, it is
lossless compression, otherwise it is lossy compression.
The encoding and decoding of the autoencoder belongs to a kind of lossy compression, that is, encoding x into a
hidden vector z, and x and x decoded from the hidden vector z are not exactly the same, but very close.
′
In order to learn the parameters θ, α of the codec function, use all real data x to form supervised learning
training samples (x, x) (that is, the target value of the sample is the input data) to train the codec decoder model.
The loss function of the autoencoder is:
L (x, x)
^ + Lregularizer
That is, the regular term L regularizer that includes the reconstruction error and prevents overfitting.
The autoencoder can be used to denoise the data, as long as the noise version and the denoise version of the data
are used as the data characteristics and target value of the training sample respectively when training the
autoencoder, that is, the sample (x ,x
noise ) x
denoise ,x
noise are noisy and noise-free data respectively. As
denoise
(h)
Where a is the activation output of the hidden layer, this penalty item forces these values to be as close to 0 as
i
possible, that is, the effect of "making non-zero values as small as possible" (sparseness) . The sparsity constraint
plays a similar role to the bottleneck layer. Autoencoders that employ sparse constraints are called sparse
encoders.
(h)
Another commonly used sparse constraint is the KL divergence constraint. If ρ^ j
=
1
m
∑[a
i
(x)] represents
i
the average activation value of the hidden layer, which can be regarded as a Bernoulli random variable, so that
the ideal distribution can be represented by KL divergence and the difference between the observed distributions.
L (x, x)
^ + ∑ KL(ρ||ρ
^j )
j
def read_mnist():
if not os.path.isfile("mnist.pkl.gz"):
# Load the dataset
urllib.request.urlretrieve("http://deeplearning.net/data/mnist/mnist.pkl.gz",
"mnist.pkl.gz")
def draw_mnists(plt,X,indices):
for i,index in enumerate(indices):
plt.subplot(1, 10, i+1)
plt.imshow(X[index].reshape(28,28), cmap='Greys')
plt.axis('off')
print(train_X.dtype)
print(train_X.shape)
print(valid_X.shape)
print(np.mean(train_X[0]))
draw_mnists(plt,train_X,range(10))
plt.show()
float32
(50000, 784)
(10000, 784)
0.13714226
Then define an autoencoder neural network, and use the samples in the training set train_X as data input and
target values to train this neural network:
import util
import train
np.random.seed(100)
nn = NeuralNetwork()
nn.add_layer(Dense(784, 32))
nn.add_layer(Relu()) # Leaky_relu(0.01)) #Sigmoid()) #Leaky_relu(0.01)) #Relu()) # #
nn.add_layer(Dense(32, 784))
nn.add_layer(Sigmoid())
X= train_X
epochs= 5 # 10000//(len(X)//batch_size)
print_n = 150
losses = train_nn(nn,X,X,optimizer,loss_fn,epochs,batch_size,reg,print_n)
0 iter: 181.4754917881575
195 iter: 37.86314183909435
390 iter: 26.37453076661517
585 iter: 23.174562871581397
780 iter: 18.48867272781079
975 iter: 17.106892623912394
1170 iter: 14.298662482564286
1365 iter: 13.615108972766208
1560 iter: 12.110143611597861
1755 iter: 11.548596796674369
What is the result of the reconstruction of some digital images with the following code?
def draw_predict_mnists(plt,X,indices):
for i,index in enumerate(indices):
aimg = train_X[index]
aimg = aimg.reshape(1,-1)
aimg_out = nn(aimg)
plt.subplot(2, 10, i+1)
plt.imshow(aimg.reshape(28,28),cmap='gray')
plt.axis('off')
plt.subplot(2, 10, i+11)
plt.imshow(aimg_out.reshape(28,28),cmap='gray')#cmap='gray')
plt.axis('off')
draw_predict_mnists(plt,train_X,range(10))
plt.show()
Figure 8-11 The learning rate is 0.001, the target image and the reconstructed image of the autoencoder trained
by epochs=100, the upper one is the target image, and the lower one is the reconstructed image
It can be seen that the output image almost reconstructs the input image. Of course, the parameters and training
process of the network can be tuned to produce better results.
As an exercise, the reader can add noise to the input of the training samples, and the program training code does
not need to be modified to enable the network to have the image denoising function. In addition, the full-link
neural network here can also be replaced by a convolutional neural network. The autoencoder using the
convolutional network is called convolutional autoencoder.
VAE is an enhancement to the traditional autoencoder (autoencoder, AE). Figure 8-12 is the working process of
VAE:
Figure 8-12 Variational automatic autoencoder. The encoder outputs the parameters of the probability
distribution. The hidden vector sampled according to the probability distribution is used as the input of the
decoder. The decoder outputs data with the same shape as the encoder input data. Both The error as the loss
function value
Unlike AE, which maps a data (such as an image) to a fixed-length vector in a latent space, VAE maps data to a
probability distribution (actually maps to a probability distribution parameter), such as mapping an image x to a
Gaussian distribution parameter , such as outputting the mean parameter μ and variance parameter σ of the
2
Gaussian distribution.
If a hidden vector data point is randomly sampled from this probability distribution, according to the
characteristics of the Gaussian distribution, this data point should be concentrated near this μ, such as the
sampled data point z = μ + σ ∗ epsilon, where ϵ is a very small number. Each such sampling point z will be
2
mapped to an image x by the decoder, because these z are surrounded by μ, and these decoded x are also x
′ ′
very similar images. That is, the continuous hidden vector z generates continuously changing data x , which
′
makes the data more structured in the hidden space, so that the hidden vector can be edited meaningfully, and the
data can be changed and controlled according to the needs.
The encoder and decoder functions are determined by the parameters ϕ, θ of the neural network, and the output
of the encoder neural network is the parameter of the probability density of the hidden vector z (assumed to be
the parameters of the Gaussian distribution μ, σ ), that is, for each input x, the q (z|x) output by the encoder is
2
ϕ
a parameter of the probability distribution, which means that x maps to different hidden variables The likelihood
size (probability size) of z. The decoder function p (x|z) indicates that the hidden variable z is mapped to an
θ
output with the same shape as the input x, and the output can also be a probability, such as x is 28 T imes28 size
handwritten digital image, where the value of each pixel is 1 or 0, then the output of the decoder p (x |z) is also θ
′
a tensor of 28 × 28, which means Each position is the probability of the corresponding value (such as 1 or 0) of
the input x.
Input a x into the codec pipeline of VAE, the perfect reconstruction output is still x , and the actual output is x ′
as close as possible to x . In order to reconstruct the input x as much as possible, the probability that the output is
x should be the largest, that is, p (xx|zz) should be the largest. If the output of the decoder in VAE is not the data
θ
itself but the probability p (x z) of different data x , then the reconstruction loss corresponding to maximizing
x|z θ
this probability is Minimize the logarithm of its negative maximum likelihood −log(p (x x|zz)), plus a regular θ
(i) (i)
Li (θ, ϕ) = −Ez
z∼q
(i) [log pθ (x
x |z
z)] + KL(qϕ (z
z|x
x ) ∥ p(z
z))
ϕ (z
z|x
x )
random variable z , and for each z , the probability that its output is x is p ( pmbx |zz), the expected log
(i)
θ
(i)
probability E z
z∼qϕ (z
z|x
x
[log p (x
x
(i)
)
z)] That is, the output of reconstruction is the expected logarithmic
|z θ
(i)
probability of x , and the maximum probability of reconstructing x is to maximize the expected logarithmic
(i) (i)
probability, that is Minimize this negative expected log probability −E [log p (xx z)].
|z z
z∼qϕ (z
z|x
x
(i)
) θ
(i)
The second term of the loss function is the regular term, using the Kullback-Leibler divergence to represent the
distribution q (zz|x
ϕ x ) of z and the standard positive The distance between the state distribution p(z
i z) = N (0, 1)
is to describe their similarity. Using this term as a regular term (penalty term) promotes the probability
distribution of z to be as close to the standard normal distribution as possible, just like restricting the weight
parameters of the neural network to be as close to 0 as possible. On the one hand, any probability distribution
can always be approximated by a multivariate normal distribution, and on the other hand, any normal
distribution can also be converted into a standard normal distribution by transformation. Therefore, z can be
regarded as a standard normal distribution.
For m samples x , the total loss function is the sum of each sample loss, ie ∑ .
(i) m
Li
i=1
1 −1 T −1 |Σ1 |
DKL (N0 ∥ N1 ) = { operatornametr(Σ Σ0 ) + (μ1 − μ0 ) Σ (μ1 − μ0 ) − k + ln }
2 1 1 |Σ0 |
where k is the dimensionality of the vector space. KL divergence describes how similar 2 distributions are.
Assuming that the mean vector and covariance matrix of the Gaussian distribution of the hidden variable z of the
variational autoencoder are μ(z)andΣ(z), then this distribution and the KL dispersion of the standard normal
distribution Degree D [N (μ(z), Σ(z)) ∥ N (0, 1)] can be expressed as:
KL
1 T
DKL [N (μ(z), Σ(z)) ∥ N (0, 1)] = (tr(Σ(z)) + μ(z) μ(z) − k − log det(Σ(z)))
2
The above k is the dimension of the Gaussian distribution,
tr(Σ(z)) is the trace of the covariance matrix Σ(z), which is the sum of the diagonal elements of Σ(z).
det(Σ(z)) is the value of its determinant, and any multivariate Gaussian distribution can always be transformed
into a Gaussian distribution whose covariance matrix is a diagonal matrix by a linear transformation of one
variable, ie Σ(z) can be regarded as a diagonal matrix. Thus the above formula can be simplified as:
1 2 2 2
DKL [N (μ(z), Σ(z))]N (0, 1)] = (∑ σj + ∑ μj − ∑ 1 − log ∏ σj )
2
j j j j
1 2 2 2
= (∑ σj + ∑ μj − ∑ 1 − ∑ log σj )
2
j j j j
1
2 2 2
= ∑(σ + μ − 1 − log σ )
j j j
2
j
k
1 2 2 2
= − ∑(1 + log(σj ) − (μj ) − (σj ) )
2
j=1
Among them, σ is the jth diagonal element of the diagonal matrix Σ(z). In practice, using log σ instead of σ
2
j
2
j
2
j
is more stable in numerical calculation, because the logarithm log is more stable than the exponent exp and is not
easy to overflow. Therefore, what the encoder outputs is actually not the variance itself σ but the logarithm of 2
j
and log σ of this probability distribution, how to get from this multi The variable Gaussian distribution gets a
2
latent variable z ? Because only by inputting a hidden variable z into the decoder, can an output of a decoder be
obtained. This requires sampling this Gaussian distribution to get a sampled z , which is then fed into the
decoder. However, the sampling operation on the probability distribution cannot be differentiated (derivative).
For this reason, a "reparameterization trick (reparameterization trick)" is used in the paper to general
Gaussian distribution The sampling of z ∼ N (μ, Σ) is transformed into the sampling of the standard normal
distribution u ∼ N (0, 1), because between z and u There is a simple linear transformation:
1
z = μ + Σ 2
u
According to this transformation, as long as the standard normal distribution u ∼ N (0, 1) is sampled and a
sampling value ϵ is obtained, the general normal distribution z ∼ N (μ, Σ) sampling:
1 1 2
log σ
z = μ + Σ 2 ϵ = μ + σ ϵ = μ + (e 2 )ϵ
Random sampling of the standard normal distribution N (0, 1) makes the random sampling operation no longer
depend on μ and log σ , so there is no need for derivatives about them, that is, ϵ does not depend on μ and
2
the letter E, then the gradient dE = dz × ϵ × (e ) of the reconstruction loss with respect to log σ .
2
E 1
2
2
Knowing du, dE, the model parameters of the encoder can be reversely derived. The process is also the same as
the usual neural network reverse derivation.
du = μ
1 E
dE = − (1 − e )
2
float32
(50000, 784)
0.13714226
In order to avoid too long training time, only several handwritten digital images (such as numbers 1, 2, 7) are
selected for training. The auxiliary function choose_numbers() is used to extract the label Y value from the X of
the training set (X, Y) is numbers Those digital images of the numbers specified in, such as
choose_numbers(train_X, train_y,[1,2,7]) means that the digital images whose Y labels are numbers
1, 2, and 7 are extracted from train_X.
def choose_numbers(X,Y,numbers):
X_ = []
for i in range(len(X)):
if Y[i] in numbers:
X_.append(X[i])
return np.array(X_)
#X = choose_numbers(train_X, train_y,[1,2,7])
X = train_X
VAE's encoder encoder and decoder decoder are two neural networks:
from NeuralNetwork import *
from util import *
np.random.seed(100)
input_dim = 784
hidden = 256 #400
nz = 2 #2 #20
encoder = NeuralNetwork()
encoder.add_layer(Dense(input_dim, hidden))
encoder.add_layer(Relu()) #Leaky_relu(0.01)) #
encoder.add_layer(Dense(hidden, hidden))
encoder.add_layer(Relu()) #Leaky_relu(0.01)) #
encoder.add_layer(Dense(hidden, 2*nz))
decoder = NeuralNetwork()
decoder.add_layer(Dense(nz, hidden))
decoder.add_layer(Relu())
decoder.add_layer(Dense(hidden, hidden))
decoder.add_layer(Relu())
decoder.add_layer(Dense(hidden, input_dim))
decoder.add_layer(Sigmoid())
Where nz represents the spatial dimension of the Gaussian distribution, such as nz=2 represents a two-
dimensional multivariate Gaussian distribution.
The VAE model is composed of an encoder and a decoder. The following VAE class includes an encoder encoder
and a decoder decoder. Its method forward() indicates that the input x passes through the encoder to generate
output μ (ie mu) and log σ (ie logvar). Then after parameter resampling, the sampled z(sample_z) is obtained,
2
and then the output out is obtained through the decoder. The method backward() uses the loss function specified
by the parameter to first calculate the reconstruction loss (ie loss_fn(out, x)) between the input x and the output
out. According to the gradient loss_grad of this loss about out, call the backward() of the decoder to calculate
the gradient of the reconstruction loss about the decoder model parameters and the gradient dz about resampling
z. According to dz, the gradient du and dE about the encoder output can be calculated, that is, the gradient vector
duE of the reconstruction loss about the encoder output is obtained, plus the gradient of the KL loss about u and
E,Calling the backward() of the decoder encoder can calculate the gradient of the reconstruction loss and KL loss
with respect to the decoder parameters. The train_VAE_epoch() method traverses the dataset dataset for a
training trip.
class VAE:
def __init__(self, encoder,decoder,e_optimizer,d_optimizer):
self.encoder,self.decoder = encoder,decoder
self.e_optimizer,self.d_optimizer = e_optimizer,d_optimizer
def encode(self,x):
e_out = self.encoder(x)
#print("x,e_out", x,e_out)
mu,logvar = np.split(e_out,2,axis=1)
return mu,logvar
def decode(self,z):
return self.decoder(z)
def forward(self,x):
mu, logvar = self.encode(x)
def __call__(self,X):
return self.forward(X)
#backpropagation
def backward(self,x,loss_fn = BCE_loss_grad):
out,mu, logvar = self.forward(x)
##print(" out,mu, logvar", out,mu, logvar)
# reconstruction loss
loss,loss_grad = loss_fn(out, x)
dz = decoder.backward(loss_grad)
du = dz
dE = dz * np.exp(logvar * .5) * .5 * self.rand_sample
duE = np.hstack([du,dE])
#encoder.backward(duE)
# KL_loss
kl_loss = -0.5*np.sum(1+logvar-mu**2-np.exp(logvar)) # np.power(mu, 2)
loss += kl_loss/(len(out))
#loss += kl_loss
#loss /= (len(out))
kl_du = mu
kl_dE = -0.5*(1-np.exp(logvar))
kl_duE = np.hstack([kl_du,kl_dE])
kl_duE /=len(out)
#encoder.backward(kl_duE)
encoder.backward(duE+kl_duE)
return loss
loss = self.backward(x,loss_fn)
#loss += nn.reg_loss_grad(reg)
self.e_optimizer.step()
self.d_optimizer.step()
losses.append(loss)
if print_fn:
print_fn(losses)
iter += 1
return losses
def save_parameters(self,en_filename,de_filename):
self.encoder.save_parameters(en_filename)
self.decoder.save_parameters(de_filename)
def load_parameters(self,en_filename,de_filename):
self.encoder.load_parameters(en_filename)
self.decoder.load_parameters(de_filename)
The following code creates a VAE object vae, and calls its training method train_VAE_epoch() multiple times to
train with the data set of the iterator data_it:
lr = 0.001
beta_1,beta_2 = 0.9,0.999
e_optimizer = Adam(encoder.parameters(),lr,beta_1,beta_2)
d_optimizer = Adam(decoder.parameters(),lr,beta_1,beta_2)
#reg = 1e-3
loss_fn = mse_loss_grad #BCE_loss_grad
iterations = 10000
batch_size = 64
vae = VAE(encoder,decoder,e_optimizer,d_optimizer)
start = time.time()
epochs = 30
print_n = 1 #epochs // 10
epoch_losses = []
for epoch in range(epochs):
data_it = data_iterator_X(X,batch_size)
epoch_loss = vae.train_VAE_epoch(data_it,loss_fn)
# epoch_loss = vae.train_VAE_epoch(data_it,loss_fn,lambda
loss:print_loss(loss,100))
epoch_loss =np.array(epoch_loss).mean()
#epoch_loss = vae.train_VAE_epoch(data_it,loss_fn).mean()
if epoch % print_n == 0:
print('Epoch{}, Training loss {:.2f}:'.format(epoch, epoch_loss))#,
epoch_val_loss))
epoch_losses.append(epoch_loss)
end = time.time()
print('Time elapsed: {:.2f}s'.format(end - start))
#vae.save_parameters("vae_en.npy","vae_de.npy")
Figure 8-13. Training loss curve for a variational autoencoder that recognizes Mnist handwritten digits
The following code uses the trained VAE to encode and decode MNIST, input a handwritten digital image, and
hope to reconstruct the digital image:
axarr[1,i].imshow(out.reshape((28,28)), cmap='Greys')
if i==0:
axarr[1,i].set_title('reconstruction')
draw_predict_mnists(plt,vae,test_X,10)
plt.show()
Figure 8-14 The encoding and decoding results of the variational autoencoder, some can be reconstructed
correctly, and some cannot be reconstructed correctly, and the network model structure and debugging
parameters need to be further improved
It can be seen that some digital images can be reconstructed correctly, but some cannot be reconstructed, further
improving the network model and debugging parameters can improve the quality of reconstruction.
“Generative Adversarial Networks is the most interesting idea in the last ten years in machine learning.”
"Generative Adversarial Networks are the most interesting idea in machine learning of the past decade"
As a data generation technology, GAN can generate fake images, text, voice and other data. As shown in Figure
8-15, the two face images (from the website <http://www.whichfaceisreal.com/) one is a real face image, and the
other is an image generated by GAN. Is it difficult to distinguish? >
Figure 8-15 Face generated by GAN (left) and real face (right)
Generating images is the main goal of GAN in the early stage. Figure 8-16 is an image that can be generated
with BigGAN.
Figure 8-16 Images generated with BigGAN, from the paper Large Scale GAN Training for High Fidelity
Natural Image Synthesis
GAN can not only be used to generate images, but also can be used for image enhancement, image super-
resolution, image restoration, image conversion, style transfer and other applications, as shown in Figure 8-17,
based on GAN image inpaiting technology (Image Inpainting for Irregular Holes Using Partial Convolutions)
can restore the original image content from a damaged or mosaic image.
Figure 8-17 Image restoration based on GAN image inpaiting technology: masking images and corresponding
restoration results, from Image Inpainting for Irregular Holes Using Partial Convolutions
As shown in Figure 8-18, style migration ([style migration] (Image-to-Image Translation with Conditional
Adversarial Nets)) can transfer the style of an image to another image,
Figure 8-18 Style transfer, from the paper Image-to-Image Translation with Conditional Adversarial Nets
In addition to synthesizing images, GAN can also be used to synthesize music (such as GANSynth), voice, and
text. Figure 8-19 is the text generated by different GAN technologies .
Figure 8-19 English text generated by IWGAN and TextKD-GAN, from the paper TextKD-GAN: Text
Generation using Knowledge
Distillation and Generative Adversarial Networks
GAN-based data synthesis and reconstruction technology has spawned a variety of innovative applications, such
as the famous GAN-based face-changing application DeepFake, which can replace a face in a video with another
face (as shown in Figure 8-20).
Virtual changing clothes can change clothes for a person, as shown in Figure 8-21, which is a given photo of a
person, which can be changed into different clothes, and the posture can also be changed (from the paper: Down
to the Last Detail: Virtual Try-on with Detail Carving).
Figure 8-21 Virtual changing clothes: Given a photo of a person, different clothes can be changed, and the
posture can also be changed
For more GAN technologies and applications, you can search Google, for example, many GAN technical papers
are collected on GAN zoo.
The working principle of GAN is similar to the process of counterfeiting: counterfeiters (counterfeit
counterfeiters, counterfeiters) hope to manufacture (generate) fake works (banknotes, calligraphy and paintings,
antiques), and the authenticators as opponents try to identify the authenticity of the works (such as bank staff or
banknote detectors to identify counterfeit banknotes, cultural relics experts to identify the authenticity of cultural
relics), the fakes produced by the counterfeiters are easy to be identified by the authenticators. ..., as the
technology of the counterfeiter continues to improve, the counterfeit produced by it becomes more and more
difficult to be identified, and the two sides between the counterfeiter and the authenticator continue to confront
each other. Finally, when the counterfeit produced by the counterfeiter cannot be recognized by the authenticator
After the counterfeit, the confrontation between the two sides has reached a balance, and the counterfeit made by
the counterfeiter can fool the authenticator.
This process of confrontation between the counterfeiter and the authenticator is a so-called "maximum
minigame". The authenticator hopes to maximize the ability to distinguish true from false, while the
counterfeiter hopes to minimize the ability of the authenticator to distinguish true from false. When this game
reaches an equilibrium state, it is called "Nash Equilibrium".
A Generator (Generayor) function (network) that generates (produces) synthetic data from random noise
(called latent variables) inputs.
A Discriminator function (network), which is a binary classification function that identifies whether the
data is real data or not.
As shown in Figure 8-22, it is a GAN that generates face images, where the generator (Generator) and the
discriminator (Discriminator) are two functions represented by a deep neural network. Generator can generate a
face image from a random vector with noise, and Discriminator is a simple binary classification neural network
that accepts a face image and outputs the probability that the image is a real face. The discriminator accepts both
real face images and fake face images generated by the generator to train its discriminative ability.
Figure 8-22 GAN model for generating faces, where the generator represented by the neural network generates a
face image from a random vector with noise, and the discriminator Discriminator is a simple two-category neural
network that inputs a face image Face image, output the probability that the image is a real face
In GAN, both the discriminator and the generator are functions represented by neural networks, which can be
represented by symbols D(x|θ ) and G(z|θ ) respectively, where θ 、 θ are the model parameters of the
D G D G
two neural networks, respectively. The generator G(z|θ ) function maps a noise hidden variable z to a data
G
G(z|θ ), and the "discriminator" D(x|θ ) is used to judge x is the probability of the real data.
G D
GAN needs to train two neural network functions, the discriminator and the generator, so that the discriminator
D can correctly identify the true and false data as much as possible, that is, the probability D(x) is as large as
possible, so that the probability D(G(z)) of the generated data is judged as real data is as small as possible. On
the other hand, it is also necessary to train the generator G so that the data generated by the generator can
deceive the discriminator as much as possible, even if the data generated by the generator is judged as true by the
discriminator D with a probability D(G(z)) as large as possible. During the GAN training process, the generator
and the discriminator improve their respective functions through this mutual confrontation, so that the final
discriminator cannot distinguish between real data and fake data generated by the generator.
Initially, the distribution of the data G(z|θ ) generated by the generator G will not be consistent with the
G
potential distribution of the real data, and the discriminator D(x|θ ) has not learned enough ability to identify
D
true and false data. GAN trains them using an adversarial process that alternately trains the discriminator and the
generator, i.e., repeatedly performs the following adversarial training:
Training of the discriminator: the discriminator D accepts a set of real data x and fake data
real
xf ake= G(z) from the generator as samples, the discriminator function should make the real data The
output value D(x |θ ) is as large as possible (the probability is as close to 1 as possible) and the output
real D
value D(x |θ ) = D(G(z)|θ ) is as small as possible (probability close to 0), therefore, the sample
f ake D D
labels of real data and fake data are 1 and 0, respectively. The training process is exactly the same as the
normal neural network training process.
Generator training: The generator G accepts a set of random noise z, and inputs the data generated by it,
that is, the output value G(z), to the discriminator D, hoping to fool the discriminator as much as possible,
That is, the output value of the discriminator D(G(z)) is as large as possible (probability close to 1).
The above process is repeated until the discriminator cannot distinguish real data from fake data. In
mathematical terms, the distribution of the data generated by the generator is very close to the distribution of the
real data.
If the real data are some real numbers, these real numbers obey a certain distribution such as a normal
distribution, as shown in Figure 8-23, the black dots represent the probability density (distribution)
corresponding to these real numbers, assuming only these real numbers, do not know the other Real distribution,
the generator can generate some real numbers x by mapping the noise latent variable z to real numbers in the
real number space, the green solid line indicates the distribution of these generated real numbers, the distribution
of generated real numbers begins and the distribution of real real numbers is not Inconsistency, with the
continuous iteration of the training process, the distribution of generated data G(z) gradually approaches the
distribution of real data x, and the probability that the discriminator recognizes the generated data as real data
continues to increase, and finally the distribution of generated data and The distribution of real data is almost
identical. At this time, the discriminator can no longer distinguish between real and generated data, that is,
regardless of real or generated data, the probability of being judged as real data is close to 0.5.
In Figure 8-23, the black dots represent the probability density (distribution) corresponding to the real data (real
numbers), and the green dots represent the distributions that the generated real numbers obey. Initially, the two
distributions are not similar. As the confrontation process progresses, the two become more and more the closer.
2. Loss function
The goals of the discriminator and the generator are different. For the discriminator, it is hoped that D(x) is as
large as possible and D(G(z)) is as small as possible. For the generator, it is hoped that D(G(z)))big. Like
logistic regression and multi-classification problems, in order to improve the stability of the calculation,
log(D(z)) and log(D(G(z))) are usually used instead of D(z) and D(G(z)), and use the average loss
(expected loss) of a batch (multiple) samples to calculate the loss function value.
If p , p , andp are used to represent the distributions of hidden variable z, real data x and generated data G(z)
z r g
respectively, the discriminator D expects Dof realdataxT he\log D(x)expectationof (x) (average value)
Ex∼pr (x)
[log D(x)] is as large as possible, while Expectation (average) log D(G(z)) of log D(G(z)) generating
The generator G hopes to deceive the discriminator D, that is, it hopes that E z∼pz (z) [log(D(G(z)))] is as large as
possible or Say E [log(1 − D(G(z)))] is as small as possible. Right now:
z∼pz (z)
Therefore, these two loss functions can be expressed in a unified loss function:
That is, the discriminator D wants to maximize this loss, while the generator G wants to minimize this loss (for
G, the first item of this loss function has nothing to do with G). That is, both the generator and the discriminator
are playing a "maximum and minimum" confrontation game. Although this unified formula is written, the actual
programming is still to optimize max E [log D(x)] + E
D [log(1 − D(G(z)))] and
x∼pr (x) z∼pz (z)
min G E [ log(1 − D(G(z)))]. According to practical experience, generally min log(1 − D(G(z))) is
z∼pz (z)
transformed into max log(D(G(z))), or max log(1 − T hetransf ormationof D(G(z))) into
min log(D(G(z))) is more conducive to the stability of training.
3. Training process
The training of GAN is nothing more than the training of 2 ordinary neural networks. GAN trains the
discriminator and the generator in an alternating manner, that is, first train the discriminator, then train the
generator, then train the discriminator, then train the generator, ... . Its training process can be described by the
following pseudocode:
Update the model parameters of the discriminator using the gradient ascent method;
Compute the gradient of the following loss function with respect to the model parameters;
Update the model parameters of the generator using the gradient descent method;
In the author's original paper, the number of gradient updates of the generator in each confrontation iteration is
only executed once, that is, l = 1, and the number of iterative updates of the discriminator k is used as an
adjustable hyperparameter. Through the adjustment of k or l, the training degree of the discriminator and the
generator can be balanced to prevent one party from being overtrained and causing the other party to become
weak. Like the learning rate, network structure and its parameters, they are some hyperparameters that need to be
adjusted based on experience. The adjustment of these parameters directly affects the performance of the
algorithm, and the training of GAN, which is confrontational, is more difficult.
The discriminator and the generator need to fight, but if one of them is too strong, the other will become weaker.
How to balance the training of the two (that is, adjust these hyperparameters) is the difficulty of GAN training.
The D_train() function is responsible for each pass of gradient updates for discriminator training.
The G_train() function is responsible for each gradient update of the generator training.
The discriminator is a binary classification neural network function, which is trained by real data and fake data
generated by the generator. In the training of the discriminator, the label value of the real data is 1 and the label
value of the fake data is 0. The D_train() function calculates the binary cross-entropy loss based on these real
data and fake data samples, and calculates the gradient through reverse derivation and then updates the model
parameters:
from util import *
f_real = D(x_real)
real_loss, real_loss_grad = loss_fn(f_real, y_real)
D.backward(real_loss_grad,reg)
loss = real_loss + D.reg_loss(reg)
f_fake = D(x_fake)
fake_loss, fake_loss_grad = loss_fn(f_fake, y_fake)
D.backward(fake_loss_grad,reg)
loss += (fake_loss + D.reg_loss(reg))
Among them, D, D_optimizer are discriminator neural network and optimizer, x_real and x_fake are real data
and fake data respectively. loss_fn is the binary cross-entropy function.
G_train() is the gradient update function of each pass of the generator. It accepts a set of random noise vectors z,
and outputs x_fake through the generator. In order to deceive the discriminator, the data label of x_fake is set to
1. These generator data samples are input to the discriminator as real data samples, and then the model
parameters of the generator are updated according to the reverse derivation of the discriminator's binary
classification loss function.
#=================== A training process of the generator ===========================#
def G_train(D, G, G_optimizer, z, loss_fn, reg = 1e-3, hack = False):
# 1. Gradient reset to 0
G_optimizer. zero_grad()
G_optimizer. step()
return loss
Among them, D, G, and G_optimizer are the discriminator neural network function, the generator neural
network function and its optimizer, respectively. z is randomly sampled noise. When training the generator, the
model parameters of the discriminator are fixed, therefore, only the model parameters of the updated generator
are trained, that is, only the
G_optimizer. step()
As the overall process of GAN training, the GAN_train() function first executes D_train() to train and update the
discriminator in each iteration, and then executes G_train() to train and update the generator. d_steps and g_steps
indicate the number of executions of D_train() and G_train() in each iteration of GAN_train(), because
sometimes multiple gradient updates may be required to learn better model parameters. They and parameters
such as the learning rate in the respective optimizers are used to balance the learning intensity of the two,
preventing the discriminator from being too strong or the generator from being too strong.
def
GAN_train(D,G,D_optimizer,G_optimizer,real_dataset,noise_z,loss_fn,iterations=10000,reg
= 1e-3,
# training generator
for g_index in range(g_steps):
G_loss = G_train(D,G,G_optimizer,next(noise_z),loss_fn,reg)
if iter % print_n == 0:
print(iter,"iter:","D_loss",D_loss,"G_loss",G_loss)
D_losses.append(D_loss)
G_losses.append(G_loss)
if show_result:
show_result(D_losses, G_losses)
iter += 1
return D_losses, G_losses
These real numbers are used as real data, and assuming that the distribution is not known, how to generate real
numbers that conform to the probability distribution of these real numbers? This problem can be solved with
GAN, that is, GAN can use these real numbers as real data to train its discriminator and generator function, and
the trained generator function can produce real numbers with the same distribution as these real real numbers
(that is, fake data) .
hidden = 4
D = NeuralNetwork()
D.add_layer(Dense(1, hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(hidden, 1))
#D.add_layer(Sigmoid())
G = NeuralNetwork()
z_dim = 1 #dimension of hidden variable
G.add_layer(Dense(z_dim, hidden))
G. add_layer(Leaky_relu(0.2))
G. add_layer(Dense(hidden, 1))
batch_size=64
def data_iterator_X(X,batch_size,shuffle = True):
m = len(X)
#print(m)
indices = list(range(m))
while True:
if shuffle:
np.random.shuffle(indices)
for i in range(0, m, batch_size):
if i + batch_size>m:
break
j = np.array(indices[i: i + batch_size])
yield X.take(j,axis=0)
data_it = data_iterator_Xdata_iterator(x,batch_size)
x0= next(data_it)
print(x0.shape)
print(x0[:10].transpose())
10000
(64, 1)
[[4.39069056 4.20482386 4.14997364 4.65636703 4.36363908 3.75927793
3.34646553 4.64355828 4.45063574 3.49191287]]
The input of the generator function is a random noise. The following code is a function iterator object that
generates a batch (m) of random noises (each noise is a vector of length z_dim):
def sample_z(m, z_dim=1):
return np.random.randn(m, z_dim)
(64, 1)
[[ 0.72956978 0.14262128 -0.29800486 1.78637966 0.27740342 -0.61411045
-0.68236473 1.61341108 0.41862218 -0.89009973]]
def draw_loss(ax,D_losses=None,G_losses=None):
ax.clear()
if D_losses:
i = np.arange(len(D_losses))
ax.plot(i, D_losses, '-')
if D_losses:
ax.plot(i, G_losses, '-')
ax.legend(['D_losses', 'G_losses'])
def show_result_gauss(D_losses=None,G_losses=None,m=600):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
draw_loss(ax1,D_losses,G_losses)
ax2.clear()
xmin, xmax = np.min(x),np.max(x)
x_values = np.linspace(xmin, xmax, 100)
ax2.plot(x_values, gaussian(x_values, mu, sigma), label='real data')
The left picture of the show_result() function draws the loss curves of D and G, and the right picture draws the
distribution of real data (blue curve) and generated data (green shaded curve) and the decision curve (that is,
whether the real number on the model prediction interval is real data The probability). Execute the function:
show_result_gauss()
Outputs the following graphics:
Figure 8-24 The blue curve is the true distribution of real data, the green line in the shaded area is the
distribution of the data generated by the generator, and the middle line is the decision curve
Because there is no training, the left picture is blank. From the right picture, it can be seen that the distribution of
real numbers generated by the generator function is very different from the distribution of real data.
5. Training GAN
Call GAN's training function GAN_train(), pass in discriminator and generator parameters (D, G, D_optimizer,
G_optimizer), data iterator data_it, noise iterator noise_it, and binary cross entropy function BCE_loss_grad and
training hyperparameters ( iterations, reg), start training D and G:
During the training process, print_n = 500 intervals to output the loss curve of the intermediate training model, the
distribution of real data and generated data, and the decision curve. Figure 8-25 is the output of some of the
iterative steps:

2000 iter: D_loss 0.40058914689196146 G_loss 1.4307126427753376

4000 iter: D_loss 1.3457534057336877 G_loss 0.7859707963138415

7000 iter: D_loss 1.3266855320062865 G_loss 0.8275752348295592

13000 iter: D_loss 1.3860751575316943 G_loss 0.7022555897704553

45000 iter: D_loss 1.3859107668070463 G_loss 0.6946582988497471

95000 iter: D_loss 1.386978807433111 G_loss 0.694197914623761
Figure 8-25 Some intermediate results during the training process of a set of real GAN models
From these intermediate iteration results, it can be seen that the discriminator and generator are an adversarial
process. How to adjust the training parameters so that they are balanced in the confrontation is more difficult.
Incorrect parameters will make the training process oscillate continuously, the training will not converge, and the
generator is stronger than the discriminator, which will cause mode collapse, that is, the discriminator cannot
generate diverse data, and the generated data is almost is the same data. Readers can lower the regularization
parameter reg or modify the learning rate or modify the number of learning d_steps of the discriminator in each
iterative process to observe these cases of non-convergence and mode collapse.
x = cx + asin(α)
y = cy + bcos(α)
Where (cx, cy) is the center point of the ellipse, (a, b) is the length of the major and minor axes of the ellipse, and
α is the directed direction composed of the center point of the ellipse and the point (x, y) The included angle of
the line segment about the x axis. The following function sample_ellipse() can uniformly sample a set of
coordinate points on the elliptic curve:
import numpy as np
import math
def sample_ellipse(m,a,b,cx=0,cy=0):
alpha = np.random.uniform(0, 2*math.pi, m)
x,y = cx+a*np.cos(alpha) , cy+b*np.sin(alpha)
x = x.reshape(m, 1)
y = y.reshape(m, 1)
return np.hstack((x,y))
According to the ellipse sampling function above, the following code samples 100 coordinate points whose long
and short axis lengths are 5 and 3 respectively at the center of the ellipse (4,4), and then draws these coordinate
points:
Figure 8-26 A group of sampling coordinate points on the elliptic curve whose major and minor axis lengths are
respectively 5 and 3 at the center of the ellipse (4,4)
2. Real data iterator, noise iterator
You can use the sample_ellipse() function to define a data iterator to sample a set of coordinate points from the
ellipse:
cx,cy,a,b = 5,3,4,4
batch_size = 64
def data_iterator_ellipse(batch_size):
while True:
yield sample_ellipse(batch_size,cx,cy,a,b) #generate_real_samples(batch_size)
data_it = data_iterator_ellipse(batch_size)
x= next(data_it)
print(x[:3])
output:
[[1.09671815 6.44244624]
[8.71292461 5.00189969]
[1.99319665 6.74776027]]
Still use the previous noise_z_iterator noise iterator function to define a noise iterator noise_it to generate a noise
vector:
batch_size = 64
z_dim = 2
noise_it = noise_z_iterator(batch_size, z_dim)
z = next(noise_it)
print(z[:3])
[[-0.12580991 -2.49903308]
[-0.36232861 0.95614813]
[-0.45110849 -1.30580063]]
The same as the GAN modeling and training process that generates a set of real numbers, you only need to define
a specific generator and discriminator for this problem. Of course, for GANs with different problems, the training
parameters of the generator and discriminator must also be Make corresponding adjustments (i.e. parameter
adjustments).
from NeuralNetwork import *
#from util import *
from train import *
np.random.seed(0)
G_hidden,D_hidden = 10,10
z_dim = 2 # Dimensions of hidden variables
G = NeuralNetwork()
G.add_layer(Dense(z_dim, G_hidden))
G.add_layer(Leaky_relu(0.2)) #Relu()) #
G.add_layer(Dense(G_hidden, 2))
D = NeuralNetwork()
D.add_layer(Dense(2, D_hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(D_hidden, 1))
def draw_loss(ax,D_losses=None,G_losses=None):
ax.clear()
i = np.arange(len(D_losses))
if D_losses: ax.plot(i, D_losses, '-')
if D_losses: ax.plot(i, G_losses, '-')
ax.legend(['D_losses', 'G_losses'])
def show_ellipse_gan(D_losses=None,G_losses=None,m=100):
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
draw_loss(ax1,D_losses,G_losses)
ax2.clear()
if True:
data = sample_ellipse(100,cx,cy,a,b)
ax2.scatter(data[:, 0], data[:, 1])
else:
alpha = np.linspace(0,2*math.pi, 100)
x,y = cx+a*np.cos(alpha) , cy+b*np.sin(alpha)
ax2.plot(x, y,label='real data')
plt.show()
Then define the parameter optimizers D_optimizer and G_optimizer corresponding to the discriminator and the
generator. Their learning rate is still 1e-4, and set the number of times d_steps and g_steps of each discriminator
and generator to 12 and 1 respectively. Set Regularization parameter reg = 1e-4. Start training:
from util import *

7000 iter: D_loss 1.2152731903087477 G_loss 0.9141720546508202

30000 iter: D_loss 1.0173900698057503 G_loss 1.1880948654376398

160000 iter: D_loss 1.3094760434222943 G_loss 0.9307732117439997

299500 iter: D_loss 1.350800595219167 G_loss 0.8432568162317724
Figure 8-27 Some intermediate results during the training process of a set of two-dimensional coordinate point
GAN models
print(np.min(train_X[0]), np.max(train_X[0]))
train_X = (train_X -0.5)*2
print(np.min(train_X[0]), np.max(train_X[0]))
ds.draw_mnists(plt,train_X,range(10))
plt.show()
float32
(50000, 784)
int64
(50000,)
0.0 0.99609375
-1.0 0.9921875
batch_size = 32
data_it = data_iterator_X(train_X,batch_size,shuffle = True,repeat=True) #
noise_it = noise_z_iterator(batch_size, z_dim)
image_dim = 784
g_hidden_dim = 256
d_hidden_dim = 256
d_output_dim = 1
G = NeuralNetwork()
G.add_layer(Dense(z_dim, g_hidden_dim))
G.add_layer(Relu()) # Leaky_relu(0.2)) #
G.add_layer(Dense(g_hidden_dim, g_hidden_dim))
G.add_layer(Relu()) # Leaky_relu(0.2)) #
G.add_layer(Dense(g_hidden_dim, image_dim))
G.add_layer(Tanh())
D = NeuralNetwork()
D.add_layer(Dense(image_dim, d_hidden_dim))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(d_hidden_dim, d_hidden_dim))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(d_hidden_dim, d_output_dim))
4. Training model
Define a complex function show_result_mnist() that displays intermediate results:
##ax2.clear()
z = np.random.randn(m, z_dim)
x_fake = G(z)
#ds.draw_mnists(plt,x_fake,range(m))
plot_images(x_fake, subplot_shape =[1, 10])
plt.show()
show_result = show_result_mnist
print_n=500
D_losses,G_losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,\
reg,show_result,d_steps,g_steps,print_n)
done = time.time()
elapsed = done - start
print("Training time: %dSecond"%(elapsed))
The higher the number of iterations, the longer the training time.
1. Normalize the input data. If the value of the image data is normalized to between -1 and 1, the final output
activation function of the generator uses the tanh activation function.
2. Modify the loss function, such as the minimum loss 1-D(G(z) used by the training generator in the original
GAN paper, that is, min log(1-D(G(z)), it is suggested to change it to maximize the loss log (D(G(z)), ie max
log(D(G(z)).
3. The input noise of the generator is sampled from a Gaussian distribution instead of a uniform distribution.
4. Batch normalization (BatchNorm) is used for real data and generated data separately, and batch normalization
of data mixed with real and generated data cannot be performed.
5. Avoid sparse gradients, such as avoiding activation functions or network layers that generate sparse gradients
such as relu and maxpool. But it is recommended to use Leakrelu
6. Use soft labels or noise. When training the discriminator, the real data label can use a random number
between 0.7 and 1.2 instead of 1, and the generated data label can use a random number between 0.0 and 0.3
instead of 0, and occasionally flip the label of the generated data, such as changing from 0 to 1 .
7. Using the Adam optimizer, it is recommended to use the SGD optimizer for the discriminator and the Adam
optimizer for the generator.
8. When the loss of D tends to 0, it means that D is too strong, and the variance of D’s loss is relatively large,
which means that it cannot converge. When the loss of G keeps decreasing, it means that G is too strong and
it is easy to collapse noise. During the training process, you can check the gradient of the model parameters.
If the absolute value exceeds 100, it means that it cannot converge.
https://github.com/soumith/ganhacks
8.6 GAN loss function and its probability explanation
The essence of GAN's anti-neural network is to make the distribution of generated data and the distribution of real
data as consistent as possible through confrontational learning, so that the distance between the two distributions is
as small as possible. The loss function of GAN is essentially a measure of the similarity of the two distributions.
Kullback–Leibler Divergence (Kullback–Leibler Divergence) and Jenson-Shannon Divergence (Jensen–Shannon
Divergence).
Among them, p (x), p r g (x) are the distribution of real data and generated data respectively, and the following
notation is introduced:
~ = D(x), A = p (x), B = p (x)
x r g
~ ~ ~
(pr (x) log(D(x)) + pg (x) log(1 − D(x))) = f (x) = Alogx + Blog(1 − x)
~
df (x)
Let ~
dx
, you can get the extreme point of L(G, D) of the discriminator loss function is
= 0
f ( T heextremepointof tildex),
pr (x)
∗ ~∗ A
D (x) = x = = ∈ [0, 1]
A+B pr (x)+pg (x)
When the generator is optimal, that is, the distribution of the generated data is exactly the same as that of the real
data, that is, p = p , then the extreme point of the discriminator's loss function value is 1/2. At this time, the
g r
∗ ∗ ∗
L(G, D ) = ∫ (pr (x) log(D (x)) + pg (x) log(1 − D (x)))dx
x
1 1
= log ∫ pr (x)dx + log ∫ pg (x)dx
2 x
2 x
= −2 log 2
KL divergence describes the degree to which the probability distribution p deviates from q. For a x, if
p(x) p(x)
p(x) = q(x), then log = log 1 = 0, if p(x) ≠ q(x), log ≠ 0. When p and q are equal everywhere,
q(x) q(x)
DKL (p ∥ q) = 0 , otherwise, it can be proved that D (p ∥ q) > 0. Therefore, when the two distributions are
KL
exactly the same or almost the same (satisfying p(x) = q(x) almost everywhere), the KL divergence has a
minimum value of 0.
For example, for the 2 discrete probability distributions shown in Figure 8-28:
Figure 8-28 The discrete probability distributions of the left and right graphs are (0.36, 0.48, 0.16) and (0.333,
0.333, 0.333) respectively
That is, the probability distributions of p and q are: (0.36, 0.48, 0.16) and (0.333, 0.333, 0.333), respectively. Their
KL divergence is:
p(x) 0.48 0.16
DKL (p ∥ q) = ∑ p(x) log( ) = 0.36 log f rac0.360.333 + 0.48 log + 0.16 log = 0.0863
x∈X q(x) 0.333 0.333
KL divergence is asymmetrical and can lead to erroneous results when measuring the similarity between two
equally important distributions.
DKL (p ∥ q)T heintegralf unctionof divergence is the subgraph on the right of Figure 8-29. The KL divergence
is the sum of the positive and negative areas of the shaded part:
Figure 8-29 D (p ∥ q) The divergence is the sum of the positive and negative areas of the shaded part of the
KL
2 2
σ + (μ1 − μ2 ) 1
2 1 2
= f rac12 log(2πσ2 ) + − (1 + log 2πσ1 )
2
2σ 2
2
2 2
σ2 σ + (μ1 − μ2 ) 1
1
= log + −
2
σ1 2σ 2
2
If you fix a probability distribution such as fixing q(x) = ( 0, 2), and let p(x) = ( μ, 2) change freely with the
value of μ, you can use The following code plots the KL divergence value curve corresponding to different μ
values,
import math
import matplotlib.pyplot as plt
import numpy as np
# if using a jupyter notebook
%matplotlib inline
def KL(mu1,sigma1,mu2,sigma2):
return math.log(sigma2/sigma1) + (sigma1**2+(mu1-mu2)**2)/(2*sigma2**2)-1/2
mus= np.arange(-12,12,0.1)
kl_values = [KL(mu,2,0,2) for mu in mus]
plt.plot(mus,kl_values)
plt.xlabel('$\mu$')
plt.ylabel('KL ')
plt.legend(['KL Value'],loc='upper center')
plt.show()
It can be seen that when μ = 0, that is, p(x), q(x) are the same distribution, the KL divergence reaches the
minimum.
Different from KL divergence, JS divergence is symmetrical about pandq, that is, pandq are equally important,
and it is smoother than KL divergence.
Figure 8-31 p, q are Gaussian distributions N (0, 1), N (1, 1) respectively, the average of the two distributions
m = (p + q)/2 . KL divergence D is asymmetric, but JS divergence D is symmetric. The upper left corner
KL JS
is two probability distributions p(x), q(x), the upper right corner is the integral function of
KL(p ∥ q), KL(q ∥ p) respectively, and the lower left corner is
T heintegralf unctionof KL(p ∥ m), KL(q ∥ m), the lower right corner is the integral function of J S(p ∥ q).
1 pr (x)
= (log 2 + ∫ pr (x) log dx)+
2 x
pr (x) + pg (x)
1 pg (x)
(log 2 + ∫ pg (x) log dx bigg)
2 x
pr (x) + pg (x)
1 ∗
= (log 4 + L(G, D ))
2
It can be seen that when the discriminator is optimal, the JS divergence D (p ∥ p ) and L(G, D ) only differ by
JS r g
∗
a constant log 4. Therefore, the loss function of GAN quantifies the similarity between the generated data
1
∗
L(G, D ) = 2DJ S (pr ∥ pg ) − 2 log 2
When the generator is optimal, that is, when the generated data distribution is exactly the same as the real data
distribution, the first item is 0, and the loss function value when both the generator and the discriminator are
optimal:
∗ ∗
L(G , D ) = 0 − 2 log 2 = −2 log 2
8.6.3 Maximum Likelihood Interpretation of GAN
The above distribution of real data and generated data should be as consistent as possible, explaining the
relationship between the loss function of GAN and JS divergence, and JS divergence is the sum of two KL
divergences. The relationship between GAN loss function and KL divergence can also be found from the
perspective of maximum likelihood estimation.
For a set of real data x , x , ⋯ , x , the distribution it obeys is p , and the probability that the generator G(θ)
1 2 n r
generates these real data is p (x ), p (x ), ⋯ , p (x ), if G(θ) reaches the optimum, then the generator G(θ)
θ 1 θ 2 θ n
should be The maximum probability (possibility) to generate these real data, that is, the probability that the
generator generates these real data should be maximized, that is, find the generator parameter θ that satisfies the
following maximum value:
n
arg maxθ p(θ; x1 , … , xn ) = arg maxθ ∏ pθ (xi )
i=1
Similarly, in order to improve the stability of the calculation, the logarithm of the above probability product can be
used to replace the product, and the extreme point of the function will not be changed. So the problem boils down
to finding:
n n
arg maxθ log p(θ; x1 , … , xn ) = arg maxθ log ∏i=1 pθ (xi ) = arg maxθ ∑i=1 log pθ (xi )
Because these x are real data, they obey the distribution p of real data, assuming n tends to infinity, then the
i r
cumulative sum of the rightmost term of the above formula can be expressed as an integral form:
n
arg maxθ ∑ log pθ (xi ) = arg maxθ ∫ pr (x) log p theta(x)dx
i=1 x
arg max ∫ pr (x) log pθ (x)dx = arg max (− ∫ pr (x) log pr (x)dx + ∫ pr (x) log pθ (x)dx)
θ θ
x x x
pr (x)
= arg min (∫ pr (x) log dx
θ
x
pθ (x)
Therefore, letting the generator G(θ) maximize the likelihood probability of the real data is equivalent to
minimizing the KL divergence of the real data distribution p and the generated data distribution p above.
r θ
The fact that the distribution generated by GAN does not overlap with the real data distribution is a high
probability. This is one of the reasons why the original GAN is difficult to train. The author proposes to replace the
JS divergence of the original GAN with the Wasserstein Distance distance. Characterizing the distance between
two distributions, Wasserstein Distance is also known as Bulldozer Distance (Earth Mover's distance, EM). The
EM distance does not directly measure the difference in the probability density of random variables corresponding
to two distributions, but measures the energy consumed to transform one distribution into another distribution. If
one distribution p is more efficient than the other distribution p The small energy is transformed into the target
1 2
distribution q, then the EM distance between p and q is smaller than the EM distance between p and q.
1 2
Consider the distribution p(x), q(y) as two piles of soil, and the bulldozer distance measures how to transform a
pile of soil of shape p(x) into q(y) through a certain movement scheme The shape of that pile of dirt.
As shown in Figure 8-32 a) The probabilities of the colored random variables in the above figure at x=1 and 8 are
3/4 and 1/4 respectively. This probability can be regarded as a probability at x = 1and3 For a pile of soil or
bricks, the probability of a white random variable at x=1 and 8 is 2/4 and 2/4 respectively. This probability can be
regarded as a pile of soil or bricks at y = 5, 6 piece. Figure 8-32 a) The distance ∥x − y∥ corresponding to the
combination of x, y shown in the table below.
a) b) c)
Figure 8-32 a) The probabilities of the colored random variables in the above figure at x=1 and 8 are 3/4 and 1/4
respectively, and the probabilities of the white random variables at x=5 and 6 are 2/4 and 2 respectively /4, the
figure below is the moving distance from x to y ∥x − y∥. b) Moving plan γ1 c) Moving plan γ1
To transform p(x) into q(y), it is necessary to move this pile of soil or bricks. For example, the soil or bricks of
p(x) can be moved according to the movement plan in Figure 8-32 b). The block is transformed to the target q(y),
It is also possible to transform the soil or bricks of p(x) to the target q(y) according to the moving plan in Figure
8-32 c). The moving cost at this time is:
2/4 ∗ (6 − 1) + 1/4 ∗ (5 − 1) + 1/4 ∗ (8 − 5) = (10 + 4 + 3)/4 = 17/4 .
It can be seen that different mobile plans have different mobile costs. Use γ to represent a mobile plan, γ(x, y)
represents the amount of soil transported from x to y, ∥x − y∥ is the distance of movement, γ(x, y)⋅ ∥ x − y∥ is
the cost of transporting γ(x, y) from x to y. The cost of this mobile plan is the sum of all these γ(x, y)⋅ ∥ x − y∥:
∑ γ(x, y) ∥ x − y∥
This γ(x, y) can be expressed as a percentage of the amount of soil transported to the total amount, and this
percentage is equivalent to a probability, because all possible γ(x, y) are not only greater than or equal to 0 but
also The sum of γ(x, y) and ∑ γ(x, y) of γ(x, y) is 1, which satisfies the condition of probability. That is, γ(x, y)
is the joint probability distribution of random variable (x, y), and the transport distance ∥x − y∥ is a function of
random variable (x, y), then for For a moving plan γ, the moving cost is the mathematical expectation of the
random variable ∥x − y∥ on this probability γ(x, y):
∑ γ(x, y) ∥ x − y ∥= E(x,y)∼γ [∥ x − y ∥]
The bulldozer distance of p(x) transformed into q(y) is defined as the minimum value of the movement cost of all
possible movement plans. A more accurate mathematical term is the *infimum of the movement costs of all
possible movement plans *, whose mathematical symbol is inf . Therefore, the bulldozer distance is defined as:
Π(p, q) is all possible mobile plans, γ ∈ Π(p, q) represents a move that moves p(x) into q(y) plan.
Let p , p be the probability densities of real data and generated data respectively, and γ represent a movement
r g
plan that transforms the probability distribution p into p , then the bulldozer distance between these two
r g
distributions is :
Represents the minimum cost required to transform the distribution p into the distribution p . The bulldozer
r g
distance is symmetric, so it also represents the minimum cost to transform the distribution p into the distribution
g
p .
r
Computing this distance is infeasible because it is impossible to enumerate an infinite number of these movement
plans γ. Through a complex mathematical derivation called Kantorovich-Rubenstein duality, this translates into the
following distance calculation:
Among them, sup is Supremum means the minimum value greater than all values, and f is a 1 − Lipschitz
function, that is, f is a function that satisfies the following conditions:
For such a function f , E f (x) is x that obeys the real data distribution p , that is, the function value of the real
x∼P
r
r
data f (T heexpectation(meanvalue)of x), E f (x) is x that obeys the generated data distribution p , that is,
x∼P
θ
g
the function value of the generated data Expected (mean) of f (x). Therefore, E f (x) can be estimated by just
x∼P
r
using the average value of f (x) of some real data x, and also by using some generated The average of f (x) of data
x can estimate E f (x). like:
x∼P
θ
m
Ex f (x) = ∑ f (real_xi )
∼Pr
Ex f (x) = ∑ f (f ake_xi )
simP
θ
Where real_x , f ake_x are some real and generated data respectively. Therefore, the estimation of the bulldozer
i i
In GAN training, f (x) is the neural network function of the generator, but it must be guaranteed that it satisfies the
conditions of the above 1 − Lipschitz function. In the original paper of WGAN, this is ensured by limiting the
size of the weight parameters through the practical skills of weight clipping, that is, limiting the weight parameters
to a certain range of [−c, c], usually c=0.01, It is also set to c=0.1 or c=0.001, that is, c is also a parameter that
needs to be debugged.
For the generator, to get the supremum of formula (8-29), is to get the maximum value of E x∼P
r
f (x) − Ex∼Pf (x),
θ
its parameters (such as w) can be updated by the gradient ascent method, even if the Wasserstein distance is as
large as possible to improve the ability to distinguish real data from generated data,the generator wants to
minimize the Wasserstein distance, so its parameters (such as θ) are updated by the gradient descent method. For
the generator, the first item of (8-29) has nothing to do with it, that is, as long as −E
x∼P
θ
f (x) is minimized.
The loss function of WGAN is the sum of f (x) or −f (x), and the gradient of the loss function about f (x) is 1 (or
-1). Therefore, the loss function of WGAN and its gradient calculation is simpler. As long as the code for
calculating the loss function and calculating the gradient of the loss function with respect to f (x) in the GAN code
is slightly modified, the following is the pseudocode of the WGAN algorithm as shown in Figure 8-33:
Later, some improved WGANs were proposed, such as Improved WGAN (WGAN-GP), which added a gradient
penalty term to the loss function instead of clipping the weight parameters.
2
~
L(pr , pg ) = Ex∼p
~
g
[f (x)] − Ex∼pr [f (x))] + Ex∼p
^
[(| ∥ ∇f (x)
^ ∥2 −1) ]
x
^
Among them, E ~
~
x∼pg [f (x)] is the negative Wasserstein The distance is −W (p , p ), and
− Ex∼pr [f (x))] g r
Ex∼p
^
[(| ∥ ∇f (x)
x
^
^ ∥2 −1) ] is the gradient penalty item, which limits the absolute value of the gradient to a unit
2
length as much as possible, thereby preventing the gradient from exploding and disappearing.
The clipping of weight parameters is mainly to prevent the weight parameters from being too large, so as to ensure
that the neural network function is still a 1 − Lipschitz function. The gradient penalty item is similar to the
regular item of the previous model parameters, and it is also to prevent gradient explosion from causing parameters
in the gradient update process. becomes larger, limiting the gradient to a certain range also ensures that the model
parameters are limited to a certain range.
Recent practice has shown that WGAN and WGAN GP are not actually superior to GAN, so in practice, people
are still accustomed to using the most primitive GAN.
f_fake = D(x_fake)
assert(f_fake.size==f_real.size)
fake_loss = np.mean(f_fake)
fake_grad = (1/m)*np.ones(f_fake.shape)
D.backward(fake_grad,reg)
loss = (real_loss - fake_loss)
#loss += D.reg_loss(reg)
# 4. Update the gradient
D_optimizer.step()
grad = D.backward(grad)
G.backward(grad,reg)
#loss += G.reg_loss(reg)
def WGAN_train(D,G,D_optimizer,G_optimizer,real_dataset,noise_z,iterations=10000,reg =
1e-3,
clip_value=0.01,n_critic = 4, show_result = None,print_n = 20):
iter = 0
D_losses = []
G_losses = []
# training generator
if iter%n_critic==0:
G_loss = WGAN_G_train(D,G,G_optimizer,next(noise_z),clip_value,reg)
if iter % print_n == 0:
print(iter,"iter:","D_loss",D_loss,"G_loss",G_loss)
D_losses.append(D_loss)
G_losses.append(G_loss)
if show_result:
show_result(D,G,D_losses,G_losses)
iter += 1
return D_losses,G_losses
For the GAN model with a set of real numbers in Section 8.5.1, you can use this WGAN loss function to train the
GAN model, the code is as follows:
np.random.seed(0)
hidden = 4
D = NeuralNetwork()
D.add_layer(Dense(1, hidden))
D.add_layer(Leaky_relu(0.2)) #Relu()) #
D.add_layer(Dense(hidden, 1))
#D.add_layer(Sigmoid())
G = NeuralNetwork()
z_dim = 1 #Dimensionality of hidden variables
G.add_layer(Dense(z_dim, hidden))
G.add_layer(Leaky_relu(0.2)) #Relu())
G.add_layer(Dense(hidden, 1))
...
500 iter: D_loss 0.0011615003991261863 G_loss -0.01023799847202126
27000 iter: D_loss -8.578426744787482e-07 G_loss -0.009488053321470099
...
90000 iter: D_loss 1.3109091936969186e-11 G_loss -0.009999930339860416
...
199500 iter: D_loss 4.4971589611975116e-09 G_loss -0.01000896809522164
Figure 8-33 The results of WGAN training for a set of real number problems
D_loss is the Wasserstein distance between the distribution of generated data and real data, and it converges to 0
with iterations, indicating gradual convergence.
The discriminator is a binary classification function, which can be represented by a convolutional neural network.
This discriminator can continuously reduce the resolution of the image from high to low through convolution
(including pooling) operations, until the final fully connected layer is transformed into A score representing the
binary classification. How does the generator transform the low-dimensional one-dimensional hidden vector into a
high-dimensional multi-channel image (feature map)?
Ordinary convolution operation is a kind of downsampling, which can convert high-resolution feature maps into
low-resolution feature maps, but cannot convert low-dimensional hidden vectors into high-dimensional images,
and ordinary The convolution operation is just the opposite. The transposed convolution operation (Transposed
convolutions) belongs to the upscaling (upscaling) operation, which can convert low-resolution feature maps into
high-resolution feature maps. Transposed convolution is also called fractionally-strided convolution
(fractionally-strided convolution), and some literature is also called deconvolution (deconvolution), but
deconvolution and usually mathematical deconvolution Product is not the same concept), therefore, the first 2
terms are generally used. As shown in Figure 8-35, four transposed convolutional layers are used to convert a
vector of length 100 into a 3 × 64 × 64 color image.
Figure 8-35. Four transposed convolutional layers transform a vector of length 100 into a color image of
3 × 64 × 64.
In order to make the training more stable, the DCGAN paper also made several improvements: the discriminator
network removed the fully connected layer, and replaced the pooling operation with strided convolution, and both
the generator and the discriminator network used batch normalization (batchnorm), the generator uses the tanh
activation function except for the output layer, all other layers use the ReLU activation function, and all the
discriminator layers use the LeakyReLU activation function.
In order to implement DCGAN, transposed convolution must be implemented first. The principle and
implementation of transposed convolution are discussed below.
kernel width of 3, a span and a padding of 1 and 0 respectively, and the process is shown in the figure 8-36 shows:
Figure 8-36 For an input vector x = (x , x , x , x , x ) with a length of 5, perform a 1D convolution with a
0 1 2 3 4
If the length of the input vector is n, the width of the convolution kernel is k, the length of the result tensor
generated by the convolution operation with the span of s and the left and right fillings of p is o = n−k+2∗p
s
+ 1,
for the above example, the length of the resulting tensor is o = + 1 = 3. That is, if it is not filled, the
5−3+0
length of the result vector of convolution is often smaller than the length of the input vector.
Convolution calculates and accumulates a data block with the same shape and size as the convolution kernel
through the convolution kernel to obtain an output value, that is, a data block and convolution kernel operation
generate an output value. The convolution kernel moves along the data in terms of strides, and produces an output
value for each corresponding block of data encountered.
Transposed convolution and convolution are just the opposite. For each element of the input tensor, the element is
multiplied by each element of the convolution kernel to produce an output of the same shape as the convolution
kernel, that is, for each element of the input tensor The element will produce the same number of elements as the
convolution kernel, as shown in Figure 8-37:
Figure 8-37 Transposed convolution: For each element of the input tensor, multiply that element by each element
of the convolution kernel, producing an output of the same shape as the convolution kernel
The convolution kernel with a width of 3 is aligned with the first element x of the input x = (x , x , x ), and this
0 0 1 2
x is multiplied by each element of the convolution kernel to obtain an output value , the width of the convolution
0
kernel is 3, therefore, 3 output values are produced. If a transposed convolution with a stride of 1 is performed, the
convolution kernel slides to x and produces 3 output values until the last element of the input, as shown in Figure
1
8-38:
Figure 8-38 A convolution kernel with a width of 3 acts on an input one-dimensional tensor with a length of three
to generate a one-dimensional tensor composed of five elements
Figure 8-39 Transposed convolution process of convolution kernel (1,2,-1) and input one-dimensional tensor
(5,15,12)
As shown in Figure 8-40, in the transposed convolution operation, the three values output by element-by-element
multiplication of each element and the convolution kernel are accumulated to the output vector elements at the
corresponding positions.
Figure 8-40 The three values output by multiplying each element and the convolution kernel element by element
are accumulated to the elements of the output tensor at the corresponding position
If the z of the convolution operation in Figure 8-36 is used as the input of the transposed convolution, and x is
used as the output of the transposed convolution, the transposition calculation process is shown in Figure 8-41 :
Figure 8-41 Transposed convolution is the reverse process of convolution. The output z of the convolution in
Figure 8-36 is used as the input of the transposed convolution, and the transposed convolution produces an output
with the same shape as the convolution input.
It can be seen that the calculation process of transposed convolution is the reverse process of convolution process,
just as the reverse derivation of convolution is the reverse process of convolution. Therefore, the calculation
process of the transposed convolution is completely similar to the reverse derivation process of the convolution,
and an input value is assigned and accumulated to the output vector through the convolution kernel.
According to the relationship between the output of the convolution and the input vector, span, and padding, the
relationship between the output of the transposed convolution and the input vector, span, and padding can be
obtained. The length of the input tensor is o, the span is s, and the left and right padding is p The length of the
resulting tensor generated by the transposed convolution operation is n = (o − 1) ∗ s + k − 2 ∗ p. For the
transposed convolution operation in the above example, the resulting tensor length is (3 − 1) ∗ 1 + 3 − 0 = 5
Convolution can use matrix multiplication to realize its forward calculation and reverse derivation. Therefore,
transposed convolution can also use matrix multiplication to realize its forward calculation and reverse derivation.
The forward calculation of transposed convolution is completely similar to the reverse derivation of convolution,
and the reverse derivation of transposed convolution is similar to the forward calculation of convolution.
Looking back at the reverse derivation process of 1D convolution in Section 6.3.3, the calculation formula is:
T
dxrow = dzrow K
col
can get the matrix multiplication formula of forward calculation of transposed convolution :
T
z row = x row K
col
Where x row
is the input of the transposed convolution, and z is the output of the transposed convolution, and
row
K col
is the column vector representation of the convolution kernel . For the specific example above, the calculation
process is:
⎡x ⎤
0
⎡ 5
⎤ ⎡ 5 15 12
⎤
zrow = xrow kcol = x1 [k0 k1 k2 ] = 15 [1 2 −1] = 15 30 −15
⎣x ⎦ ⎣ 12⎦ ⎣ 12 24 −12
⎦
2
Like the reverse derivation of convolution, each row of this flattened z represents an allocation calculation, and
row
each row needs to be accumulated to the corresponding position of the final output z ( As shown in Figure 8-40).
The process of transforming z into z can use the convolution reverse derivation to transform d x
row row
into the
function row2im() of dxx Finish.
Therefore, the forward calculation of the transposed convolution is generally performed according to the reverse
derivation process of the convolution. Similarly, the reverse derivative of the transposed convolution can be
performed according to the forward calculation process of the convolution. As an exercise, readers can try to write
the code for forward calculation and reverse derivation of 1D transposed convolution.
Let's look at some transposed convolution processes with different spans and fillings. Figure 8-42 is a transposed
convolution with an input length of 3, a convolution kernel length of 3, a span of 2, and a padding of 0:
Figure 8-42 Transposed convolution with input length 3, convolution kernel length 3, stride 2, and padding 0
And Figure 8-43 is a transposed convolution with an input length of 3, a convolution kernel length of 3, a span of
2, and left and right padding of 1:
Figure 8-43 Transposed convolution with an input length of 3, a convolution kernel length of 3, a stride of 2, and a
left and right padding length of 1
It can be seen that the leftmost and rightmost elements of the output when the padding length is 1 are not counted
in the output tensor. As long as the transposed convolution is compared to the inverse process of convolution, the
calculation process of transposed convolution including span and padding can be understood.
2 × 2. The same figure can also be regarded as the input above and the output below, which means that 2 × 2
input uses a 3 × 3 convolution kernel to perform a transposed convolution with a span of 1 and a padding of 0 Get
an output of shape 4 × 4.
Figure 8-44 2D transposed convolution is the reverse process of 2D convolution. The input is the above 2 × 2
matrix, and after the 3 × 3 convolution kernel, the following 4 × 4 matrix is output
The matrix multiplication implementation of 2D transposed convolution is the same as the matrix multiplication
implementation process of 1D transposed convolution, that is, the reverse derivation and forward calculation
process of the corresponding 2D convolution are used to realize the forward direction of 2D transposed
convolution respectively. Calculation and reverse derivation process.
Therefore, just modify the Conv_fast class of the convolution operation that has been implemented before, and
convert the backward() and forward() methods of the Conv_fast class into the forward() and backward() methods
of the transposed convolution implementation class Conv_transpose. For example, for the input x , it can be
regarded as the gradient of the loss function with respect to the convolution output, which is first flattened into a
matrix, that is, a matrix with a shape similar to (N ∗ oH ∗ oW , F ) , the second axis represents the number of
output channels, and the first axis represents each element, thus transforming into a matrix form of x , the code
row
is:
X_row = X.transpose(0,2,3,1).reshape(-1,F)
Then according to the reverse derivation formula, the flattened matrix Z row of the output of this transposed
convolution can be calculated, namely:
Z_row = np.dot(X_row,K_col.T)
Finally, according to the reverse derivation process of the convolution, the distribution of each row of this Z row
must be accumulated to the final output Z . This process can directly use the row2im() function or row2im_indices(
) function completes:
Similarly, the reverse derivation of the transposed convolution is similar to the forward calculation process of the
convolution. First, the gradient dz of the loss function with respect to Z must be flattened by the im2row() or
im2row_indices() function Matrix dZ_row, each row of which represents a data block, and then calculates the
gradient of the input X of the transposed convolution:
dX_row = dZ_row @ K_col
You can get the flattened matrix dX_row about the gradient of X , and finally reshape this flattened matrix into a
four-dimensional tensor with the same shape as X , namely:
dX = dX_row.reshape(N,self.H,self.W,self.C)
dX = dX.transpose(0,3,1,2)
Similarly, the calculation process of the gradient dK_col for model K is similar, and it is more straightforward to
reshape the shape of the flattened matrix dK_col into the same shape as K:
dK_col = self.X_row.T@dZ_row
dK = dK_col.reshape(self.K.shape)
According to the above analysis, the following transposed convolution class Conv_transpose can be written:
class Conv_transpose():
def __init__(self, in_channels, out_channels, kernel_size, stride=1,padding=0):
super().__init__()
self.C = in_channels
self.F = out_channels
self.kH = kernel_size
self.kW = kernel_size
self.S = stride
self.P = padding
# filters is a 3d array with dimensions (num_filters, self.K, self.K)
# you can also use Xavier Initialization.
#self.K = np.random.randn(self.F, self.C, self.kH, self.kW)
#/(self.K*self.K)
# self.K = np.random.randn(self.C, self.F, self.kH, self.kW)
#/(self.K*self.K)
self.K = np.random.normal(0,1,(self.C, self.F, self.kH, self.kW))
self.b = np.zeros((1,self.F)) #,1))
self.params = [self.K,self.b]
self.grads = [np.zeros_like(self.K),np.zeros_like(self.b)]
self.X = None
self.reset_parameters()
def reset_parameters(self):
kaiming_uniform(self.K, a=math.sqrt(5))
if self.b is not None:
fan_in, _ = calculate_fan_in_and_fan_out(self.K)
#fan_in = self.F
bound = 1 / math.sqrt(fan_in)
self.b[:] = np.random.uniform(-bound,bound,(self.b.shape))
def forward(self,X):
'''
X: (N,C,H,W)
K: (F,C,kH,kW)
Z: (N,F,oH,oW)
X_row: (N*oH*oW, C*kH*kW)
K_col: (C*kH*kW, F)
Z_row = X_row*K_col: (N*oH*oW, C*kH*kW)*(C*kH*kW, F) = (N*oH*oW, F)
K = self.K
# Convert (N,F,oH,oW) to (N,oH,oW,F) and flatten to (-1,F)
F = X.shape[1]
#assert(F==self.F)
X_row = X.transpose(0,2,3,1).reshape(-1,F) #(N*oH*oW,F)
K_col = K.reshape(K.shape[0],-1).transpose() #Flattening
Z_row = np.dot(X_row,K_col.T)
Z_shape = (self.N,self.F,self.oH,self.oW)
Z = row2im_indices(Z_row,Z_shape,self.kH,self.kW,S =self.S,P = self.P)
self.b = self.b.reshape(1,self.F,1,1)
Z+= self.b
self.X_row = X_row
return Z
def __call__(self,X):
return self.forward(X)
def backward(self,dZ):
N,F,oH,oW = dZ.shape[0], dZ.shape[1],dZ.shape[2], dZ.shape[3]
S,P,kH,kW = self.S, self.P,self.kH,self.kW
dZ_row = im2row_indices(dZ,self.kH,self.kW,S=self.S,P=self.P)
K_col = self.K.reshape(self.K.shape[0],-1).transpose() #Flattening
db = np.sum(dZ,axis=(0,2,3))
db = db.reshape(-1,F)
# (N*H*W, C)
dX = dX_row.reshape(N,self.H,self.W,self.C)
dX = dX.transpose(0,3,1,2)
self.grads[0] += dK
self.grads[1] += db
return dX
def reg_loss(self,reg):
return reg*np.sum(self.K**2)
def reg_loss_grad(self,reg):
self.grads[0]+= 2*reg * self.K
return reg*np.sum(self.K**2)
Note: People sometimes use convolution operation to simulate the calculation process of transposed convolution,
but this simulation process is not only complicated but also has a large amount of calculation, so it has no practical
significance. Interested readers can refer to the following URL :
http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html
First, still read the Mnist handwritten digit set as a training sample:
import data_set as ds
import matplotlib.pyplot as plt
%matplotlib inline
ds.draw_mnists(plt,train_X,range(10))
plt.show()
train_X = train_X.reshape(train_X.shape[0],1,28,28)
print(train_X.shape)
float32
(50000, 784)
int64
(50000,)
Then use the transposed convolution class and the convolution class and other network layer classes to define the
neural networks G and D representing the generator and discriminator respectively:
np.random.seed(100)
random_name = 'no'
random_value = 0.01
G = NeuralNetwork()
z_dim = 100
ngf=28
ndf=28
nc=1
def weights_init(layer):
classname = layer.__class__.__name__
if classname.find('Conv') != -1:
W = layer.params[0]
W[:] = np.random.normal(0.0, 0.02,(W.shape))
elif classname.find('BatchNorm') != -1:
W = layer.params[0]
W[:] = np.random.normal(1.0, 0.02,(W.shape))
b = layer.params[1]
b[:] = 0
G.apply(weights_init)
Finally, train the DCGAN network model with the previous GAN training process, where show_result_mnist() is
an auxiliary function to display intermediate results.
def show_result_mnist(D_losses = None,G_losses = None,m=10):
#fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
#ax1.clear()
if D_losses and G_losses:
i = np.arange(len(D_losses))
plt.plot(i,D_losses, '-')
plt.plot(i,G_losses, '-')
plt.legend(['D_losses)', 'G_losses'])
plt.show()
##ax2.clear()
z = np.random.randn(m, z_dim)
x_fake = G(z)
ds.draw_mnists(plt,x_fake,range(m))
plt.show()
#noise_it = iter(Noise_z(batch_size,z_dim))
noise_it = noise_z_iterator(batch_size, z_dim)
iterations = 1500
#losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,reg,3,1,s
#losses =
GAN_train(D,G,D_optimizer,G_optimizer,data_it,noise_it,BCE_loss_grad,iterations,reg,show_
start = time.time()
loss_fn = BCE_loss_grad
n_epoch = 20 #200
print_n =20
for epoch in range(1, n_epoch+1):
D_losses, G_losses = [], []
data_it = data_iterator_X(train_X,batch_size,shuffle = True,repeat=False) #
for batch_idx, x_real in enumerate(data_it):
x_fake = G(next(noise_it))
D_loss = D_train(D,D_optimizer,x_real,x_fake,loss_fn,reg)
G_loss = G_train(D,G,G_optimizer,next(noise_it),loss_fn,reg)
D_losses.append(D_loss)
G_losses.append(G_loss)
#print(D_loss,G_loss)
#if batch_idx>10: break
if batch_idx%print_n ==0:
print('[%d:/%d]: loss_d: %.3f, loss_g: %.3f' % (
(batch_idx), epoch, np.mean(np.array(D_losses)),
np.mean(np.array(G_losses))))
show_result_mnist(D_losses,G_losses)
D.save_parameters('MNIST_DCGAN_D_params.npy')
G.save_parameters('MNIST_DCGAN_G_params.npy')
print('[%d/%d]: loss_d: %.3f, loss_g: %.3f' % (
(epoch), n_epoch, np.mean(np.array(D_losses)),
np.mean(np.array(G_losses))))
#break
done = time.time()
elapsed = done - start
print("Training time: %d seconds"%(elapsed))
Please download the complete code from the author's blog (https://hwdong-net.github.io).
As shown in Figure 8-45, it is the intermediate result of the 11th epoch in the training process
references:
[1] Saito Yasuhiro. Introduction to Deep Learning: Theory and Implementation Based on Python [M]. Beijing:
People's Posts and Telecommunications Press, 2018.
[2] Nielsen, Michael A. Neural networks and deep learning. Vol. 2018. San Francisco, CA: Determination press,
2015[M]. Website: http://neuralnetworksanddeeplearning.com.
[3] Aston Zhang, Mu Li, Zachary C. Lipton, Alexander J. Smola. Deep Learning by Hands [M]. Website:
https://zh.d2l.ai/d2l-zh.pdf. 2020.
[5] Stanford University. CS231n: Convolutional Neural Networks for Visual Recognition. URL:
http://cs231n.stanford.edu/. 2019.
[6] Andrew Ng. Unsupervised Feature Learning and Deep Learning Tutorial. Website:
http://ufldl.stanford.edu/tutorial/StarterCode/. 2018.
[7] Rosenblatt, Frank (1958), The Perceptron: A Probabilistic Model for Information Storage and Organization in
the Brain[J]. Cornell Aeronautical Laboratory, Psychological Review, 65(6):386-408.
[8] Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities [C].
Proc. Natl. Acad. Sci. U.S.A. 1982, 79 (8): 2554–2558.
[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document
recognition[C].Proceedings of the IEEE, 1998, 86(11):2278–2324.
[11] Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. ImageNet classification with deep convolutional
neural networks[J]. Communications of the ACM. 2017, 60 (6): 84–90.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Delving Deep into Rectifiers: Surpassing Human-
Level Performance on ImageNet Classification[C].2015 IEEE International Conference on Computer Vision
(ICCV), 2015.
[13] Sergey Ioffe, Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift[J]. 2015, arXiv preprint, arXiv:1502.03167.
[14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov. Dropout: A
Simple Way to Prevent Nextworks From Overfi tting [j]. Journal of Machine Learning Research. 2014, 15 (56):
1929−1958.
[15] Afshine Amidi , Shervine Amidi. Deep Learning Tips and Tricks cheatsheet. URL:
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-deep-learning-tips-and-tricks .2019.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren. Deep Residual Learning.2015, arXiv:1512.03385.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[18] Sepp Hochreiter; Jürgen Schmidhuber. Long short-term memory[J]. Neural Computation. 1997,9 (8): 1735–
1780.
[19] Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk,
Holger; iv] (https://en.wikipedia.org/wiki/ArXiv_(identifier)):1406.1078.
[21] Diederik P Kingma, Max Welling. Auto-Encoding Variational Bayes. 2013, arXiv:1312.6114.
[22] Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil;
Courville, Aaron; Bengio, Yoshua. Generative Adversarial Nets [C]. Proceedings of the International Conference
on Neural Information Processing Systems. 2014: 2672–2680.
[23] Martin Arjovsky, Soumith Chintala, and Léon Bottou. “Wasserstein GAN”.2017, arXiv:1701.07875.
[25]. Alec Radford, Luke Metz, Soumith Chintala. Unsupervised Representation Learning with Deep
Convolutional Generative Adversarial Networks. 2015, arXiv:1511.06434.
[26] Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros. Unpaired Image-to-Image Translation using
Cycle-Consistent Adversarial Networks. 2017, arxiv 1703.10593 .