0% found this document useful (0 votes)
92 views508 pages

HCIP-AI-EI Developer V2.0 Training Material

Artificial Intelligence And Applications

Uploaded by

pie Hero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views508 pages

HCIP-AI-EI Developer V2.0 Training Material

Artificial Intelligence And Applications

Uploaded by

pie Hero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 508

• Artificial intelligence is a goal that we have wanted to achieve: to make

computers as smart as people.


• Machine learning is a method to achieve the goal.

• Deep learning is one of the methods of machine learning.


• The rule-based approach is like the instinct of a machine. The machine will
always follow the program.
• Machine learning is the capability of machine, and data is the key to machine
learning.
• Machine learning is to automatically search for rules from data, but we need to
divide the entire data set into training sets and test sets.
• Self-supervised learning was proposed by Yann LeCun in 2019 and is often used
for model training of natural language processing, such as Word2Vec and BERT.
Image and phonetics have gradually proposed some methods for self-supervised
learning. Since manual labeling is not required, self-supervised learning will
gradually become the mainstream training method in the future.
• Semi-supervised learning mainly has two main methods: one is unsupervised
clustering first, then supervised learning. The other is self-learning.
• As a model based on unsupervised feature learning and feature hierarchy
learning, deep learning has great advantages in fields such as computer vision,
speech recognition, and natural language processing.

• The two kinds of learning are compared from five aspects.


• In fact, neural network is not a very new concept. As early as 1957, the precursor
of neural network, the sensor, was proposed.
• After 1989, multilayer perceptrons were once referred to as “dirty words", and if
a paper mentions multilayer perceptrons, the paper will definitely be rejected. So,
people came up with a solution: a new name, a tall "neural network" and " deep
learning".
• The limited Boltzmann machine was proposed as early as 1986, but it was not
known until around 2006 that the fast learning algorithm proposed by Hinton. In
fact, the promotion of the restricted Boltzmann machine in 2006 is more like a
"stone soup" story, the restricted Boltzmann machine is like a catalyst, let the
neural network return to the public eye again. Subsequent research also proves
that RBM initialization is not effective.
• In the picture, w0 and w5 indicate bias.
• If there is no activation function, we can use only one layer to place lots of layers.
So, hidden layers have no meaning.
• The sigmoid function is monotonic, continuous, and easy to derive. The output is
bounded, and the network is easy to converge. However, we see that the
derivative of the sigmoid function is close to 0 at the position away from the
central point. When the network is very deep, more and more backpropagation
gradients fall into the saturation area so that the gradient module becomes
smaller. Generally, if the sigmoid network has five or fewer layers, the gradient is
degraded to 0, which is difficult to train. This phenomenon is a vanishing
gradient. In addition, the output of the sigmoid is not zero-centered.
• Tanh function and sigmoid function have similar shortcomings. The derivative of
the tanh function is nearly 0 at its extremes. However, because the tanh function
is symmetric with respect to the origin, the average of the outputs is closer to 0
than that of the sigmoid function. Therefore, SGD can reduce the required
number of iterations because it is closer to the natural gradient descent.
• Advantages:
▫ Compared with sigmoid and tanh, ReLU supports fast convergence in SGD.
▫ Compared with the sigmoid and tanh functions involving exponentiation,
the ReLU can be implemented more easily.
▫ The vanishing gradient problem can be effectively alleviated.
▫ The ReLU has a good performance during unsupervised pre-training.
• Disadvantages:
▫ There is no upper bound, so that the training is relatively easy to diverge.
▫ The ReLU is not differentiable at x = 0 and a derivative is forcibly defined at
this point.
▫ The surface defined at the zero point is not smooth enough in some
regression problems.
• Reduces the computation workload
▫ When functions such as sigmoid are used, the activation function involves
exponent operation, which requires a large amount of computation. When
the error gradient is calculated through backpropagation, the derivation
involves division and the computation workload is heavy. However, the
ReLU activation function can reduce much of the computation workload.
• Effectively mitigates the vanishing gradient problem.
▫ The ReLU gradient is unsaturated. When the sigmoid function is close to
the saturation area (far from the function center), the transformation is too
slow and the derivative is close to 0. Therefore, in the backpropagation
process, the ReLU function mitigates the vanishing gradient problem, and
parameters of the first several layers of the neural network can be quickly
updated.
• So, how to choose the 𝛼 value? SeLU activation function has the certain answer.
• https://www.youtube.com/watch?v=1WPjVpwJ88I&list=PLJV_el3uVTsPMxPbjeX7Pi
cgWbY7F8wW9&index=11
• Reference:
https://www.youtube.com/watch?v=1WPjVpwJ88I&list=PLJV_el3uVTsPMxPbjeX7Pi
cgWbY7F8wW9&index=11

• Advantage: The SeLU can avoid the range of gradient 0 in the ReLU and achieve
the effect similar to batch standardization.

• Disadvantage: According to theoretical derivation, data features need to be


standardized, and the average weight value during training cannot be ensured to
be 0. Therefore, approximate estimation can be performed only by initializing the
average value to be 0.
• Swish is proposed by Google. Swish is discovered by using meta-learning
methods from many functions. In the thesis, swish has better performance than
SeLU and GeLU in most tasks. However, swish is found through meta-learning, so
there is not much theoretical proof. The hyper-parameter beta needs to be
invoked.

• Meta-learning: uses another network to predict hyper-parameters of that


network. Learn to learn.
• Z is the output value of the neuron before enters the activation function, and is
also the input value of the activation function.
• Through forward propagation, we can get the output of all the neurons.
• Reasons for Gradient Disappearance:
▫ First, deep networks.
▫ The second is that an inappropriate loss function, such as sigmoid, is used.

• Gradient explosion: This problem occurs when the initial weight value is too large
in the deep network.
• Momentum optimizer accelerates training, reduces vibrations, and hits the local
extremum.
• The Adagrad optimizer is similar to playing golf. At the beginning, the ball is far
away from the target point and needs to use a relatively large force to reduce
the number of updates. When the ball is near the target point, the we need to
use a relatively small force to get the ball into the target point.
• Adam attempts to calculate momentum and adaptive learning rates for each
parameter, which is useful in complex network structures because different parts
of the network have different sensitivity to weight adjustment. Very sensitive
parts generally require lower learning rates. If sensitive parts are manually
identified and specially set learning rates for them, it is difficult or cumbersome
to implement. It's probably the best optimizer so far.
• The graph shows that when there are only two hidden layers, the performance
improvement of the network is limited regardless of the number of neurons. Add
a layer of hidden layers, and the performance of the model will be significantly
improved.
• Similar to cutting a window flower, folding a few more times requires fewer cuts.
• ResNet suggests that when the network layer is too deep, the network
performance deteriorates because the vanishing gradient problem. ResNet
proposed that residual structure can be used to alleviate the vanishing gradient
problem.
• Answers:
▫ BCD
• Image Source: stanfor.edu CS231N
• 8-bit images are most commonly used in computers.
• YUV: Human eyes are more sensitive to bright spots. That is, UV data can be
compressed, which is difficult to detect by human eyes. Therefore, the first step
of compression algorithm is to convert RGB data into YUV data. Compress Y less
and compress UV more to balance the image effect and compression ratio.
• Principle of the weighted mean value method: The human eye is most sensitive
to green, followed by red and blue (least sensitive).
• The intensity histogram reflects the frequencies of pixels with different gray
levels in an image. The intensity histogram is a relationship graph, which uses the
gray level as the horizontal coordinate and the frequency as the vertical
coordinate. The intensity histogram is an important feature of an image and
reflects the intensity distribution of the image.

• Using a histogram for image transformation is a method based on the


probability theory. By changing the histogram of an image and changing the
intensity value of each pixel in the image, the visual effect of the image is
enhanced.
• While grayscale transformation operates on single pixels of an image, histogram
transformation considers the intensity value distribution of the entire image.
• Histogram equalization adjusts the histogram of the original image, that is, the
intensity probability distribution, to equalized distribution. In this way, the
intensity is equalized, and the overall contrast of the image can be effectively
enhanced.
• Histogram equalization can automatically calculate the transformation function,
which can adaptively generate the output image with an equalized histogram.
Images that are too dark, too bright, and not clear can be effectively enhanced.

• In common image processing libraries, histogram operations are implemented by


invoking APIs.
• The upper left figure shows the original image; the lower left figure shows the
histogram of the original image; the upper right figure shows the histogram
equalization result; the lower right figure shows the histogram after processing.
• The upper left figure shows the original image. The lower left figure shows the
specification template, which restricts the intensity value so that the image is not
too bright.

• The upper right figure shows the specification result, and the lower right figure
shows the result histogram. Compared with histogram equalization, the details of
the sky in the image after histogram specification are clearer.
• Grayscale transformation mentioned earlier operates on single pixels, and its
output values are related only to the pixels. Histogram transformation applies to
an image globally and its output values are related to the entire image.

• Different from the preceding two methods, spatial filtering is a local processing
method. The output value of pixel 𝑃 is determined by the pixel values of 𝑃 and its
neighborhood 𝑁 in the input image.
• The calculation method is to perform a template operation on pixel values of
neighborhood 𝑁 and a subimage with the same size as the neighborhood. The
subimage is referred to as a template or filter.
• Common template operations include template convolution and template sorting.
• The two processing methods are commonly used and suitable for different
algorithms. In complex image processing tasks, the two methods are frequently
switched and combined to complete a task.
• Husky, Golden retriever, and Border collie
• Information can be divided into two categories. A type of information can be
represented by data or a unified structure, which is called structured data, such
as numbers and symbols. The other type of information, such as text, images,
sound, web pages, and so on, can not be represented by numbers or uniform
structures. We call it unstructured data method.

• Pattern recognition: The study of the automatic processing and interpretation of


patterns by computer mathematical techniques. We call environment and object
together "models".
• Binary image plays a very important role in digital image processing, and it can
be used to extract connected domain, morphology processing and ROI
segmentation.

• The binarization of the image reduces the amount of data in the image and
highlights the contour of the object. The set property of the image after
binarization is related only to the position of the point whose pixel value is 0 or
255, and does not involve the multi-level value of the pixel, which makes the
processing simple and reduces the amount of data to be processed in the
calculation process.
• In order to adapt to the uneven illumination image, the adaptive threshold
algorithm also derives local adaptive threshold segmentation, which can
calculate the threshold for the local area in different regions of the image to
achieve the best segmentation effect.
• According to the actual problem, selecting or designing an appropriate adaptive
threshold algorithm can greatly improve the segmentation effect and subsequent
processing.

• bimodal method

▫ Histogram bimodal method is an easy-to-understand adaptive thresholding


algorithm. If the gray histogram of the image is obviously bimodal, the
gray value corresponding to the valley between the two peaks is selected as
the threshold of image segmentation. If the gray histogram of the image is
flat or has multiple peaks, the bimodal method cannot be used to obtain
the proper threshold.
• OTSU

▫ The OTSU algorithm divides the image into two parts: foreground and
background according to the gray feature of the image. The greater the
interclass variance between the foreground and background, the greater
the difference between the two parts that make up the image. If the
threshold is incorrectly split, the inter-class variance of the foreground
background will decrease. Therefore, it traverses all possible thresholds and
selects a segmentation threshold that maximizes the variance between
classes as the optimal threshold.
• First, we divide the image into small connected areas, we call it cell units, then
we collect the gradients and edge directions of the pixels in the cell units, and
then we add up a one - dimensional gradient histogram in each cell.

• To achieve better invariance of illumination and shadows, contrast normalization


is required for histograms, which can be performed in a larger range of images
(called intervals or blocks). First, we calculate the density of each histogram in
the interval, and then normalize the cells in the interval based on this density.
The normalized block descriptor is called the HOG descriptor.
• Image Source: stanfor.edu CS231N
• The support vector machine (SVM) is a basic classification algorithm in pattern
recognition. The SVM can be used to complete image classification and object
detection tasks based on the feature vectors extracted using the HOG and LBP.
• The basic idea of the SVM is to find the best classification interval. For linearly
inseparable data, the SVM uses a kernel function to implicitly map data into a
high-dimensional feature space to become linearly separable.
• Adaptive boosting (AdaBoost) is an adaptive boosting algorithm, which can
implement efficient binary classification. The AdaBoost algorithm is used to
combine multiple weak classifiers to form a strong classifier. A weak classifier
generally uses a single-layer decision tree model.
• AdaBoost only trains a single weak classifier during one iteration. The adaptation
is embodied in: the weight of the sample misclassified in the N-1th iteration will
increase at the Nth iteration, and the weight of the correctly classified sample
will decrease and be used again to train the next weak classification. Each weak
classifier has a corresponding weight, and the weak classifier with a small
classification error rate has a large weight, which plays a greater role in the final
classification function, while the weak classifier with a large classification error
rate has a small weight.
• The CNN determines the type of an object by its details and contour, and the
result is not affected by the direction, angle, and illumination. The CNN uses local
receptive fields to implement this function.
• Objects in different positions of an image are determined to be of the same type.
The CNN uses weight sharing to implement this function.
• Local perception is generally considered to be from local to global perception,
and the spatial relationship of images is close to local pixels, while the pixel
correlation of images far away is weak. Therefore, each neuron does not need to
sense the global image, but only needs to sense the local information, and then
synthesizes the local information at a higher level to obtain the global
information. The idea of network partially connected is also inspired by the
structure of the visual system in biology. Neurons in the visual cortex receive
information locally (that is, they respond only to specific regions of stimulation).
• Parameter sharing: For an input photo, one or more filters are used to scan the
photo. The parameter of the filter is the weight. If the same filter is used to scan
the entire image and the filter parameter remains unchanged, parameter sharing
is called. For example, there are three filters, and each filter scans the entire
image. During the scanning, the parameter values of the filter are fixed, that is,
all elements of the entire image share the same weight.
• ILSVRC: ImageNet Visual Recognition Challenges
• Machine vision has achieved excellent results in the ILSVRC competition. Its error
rate is already lower than that of human vision. It is pointless to continue to hold
similar competitions. As a result, people's expectations for computer vision have
shifted from the mature image identification technology to the image
understanding technology that has yet to be developed.
• ILSVRC 2017 is the last to be held.
• Input layer, pooling layer, and softmax layer are not counted into the number of
layers: 13 convolutional layer and three fully-connected layers
• The author uses the combination of SENet block and ResNeXt.
• The innovation of the SENet network is to focus on the relationship between
channels. It is hoped that models can automatically learn the importance of
different channel features. To solve this problem, the SENet provides the
Squeeze-and-Excitation (SE) module.
• Essentially, the SE module performs the attention operation on the channel
dimension. This attention mechanism enables the model to focus on the channel
features with the largest amount of information and suppress the unimportant
channel features.
• The SE module is universal, which means that it can be embedded in the existing
network architecture.
• Generate Proposals
▫ Use traditional Selective Search to generate candidate areas.
• Generate ROI features and adjust the size.

▫ Crop the corresponding area of the proposal on the original image, align
the area to the same scale, and extract features of each area of the original
image after alignment using the neural network.
• Post-processing

▫ NMS, Soft NMS, Soft NMS, and IOU-Guided NMS.


• The preceding figure shows the architecture of Fast R-CNN. The feature extractor
obtains the feature map from the image, runs the Selective Search algorithm on
the original image, and maps the Region of Interset (RoI) to the feature map,
perform RoI pooling on each RoI to obtain feature vectors of the same length.
Sort out positive and negative samples of the obtained feature vectors (maintain
a certain proportion of positive and negative samples), and transfer the obtained
feature vectors to parallel R-CNN subnetworks in batches for classification and
regression, and unite the losses of the two.
• Backbone: Darknet-53. To achieve better classification effect, the author designs
and trains Darknet-53. Compared with ResNet-152 and ResNet-101, Darknet-53
has similar classification precision, compute faster than ResNet-152 and ResNet-
101
• To improve the accuracy of small object detection, YOLO v3 uses the upsample
and fusion method similar to FPN. (Three scales are integrated.) The sizes of the
other two scales are 26 x 26 and 52 x 52 respectively. Check the feature maps of
multiple scales.
• Yolo v3 sets that each grid cell predicts three boxes, and each time it detects a
different receptive field,
▫ 32-fold down sampling has the largest receptive field, which is suitable for
detecting large targets,
▫ 16x is suitable for objects of common size, 8x is the smallest receptive field,
and is suitable for small objects,
• Common edge detectors include Canny edge detector, Harris corner detector, and
SIFT detector.
• The U-Net is a deep learning model used for bioimage segmentation. The
network can train end-to-end with very few images and is very fast. The U-Net is
a fully convolutional network. The input and output are images, and there is no
fully connected layer.
• Through supervised deep learning training, the network can determine whether a
pixel belongs to the foreground or background pixel by pixel.
• Answer: 1 AB 2 ABC
Speech Processing Theory and Applications

1 Huawei Confidential
Foreword

⚫ Speech technology has gradually changed the way we live and work in
recent years. Voice has become a major way for people to communicate
with some devices, including voice assistants, smart speakers, and robots.
So how does the machine understand what we say and answer our
questions? This chapter will take you to unveil speech processing
technology.

2 Huawei Confidential
Objectives

On completion of this course, you will be able to:


 Understand the basics and applications of speech processing.
 Master basic steps of speech recognition.
 Master main text-to-speech synthesis technologies.
 Understand basic and advanced speech models.

3 Huawei Confidential
Contents

1. Speech Processing
◼ Overview of Speech Processing
 Speech Processing
 Speech Signal Analysis and Feature Extraction

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

4 Huawei Confidential
Overview of Speech Processing (1)
⚫ Speech Signal Processing is a general term used to study various processing
technologies such as phonological pronouncing process, speech signal statistics, speech
automatic recognition, machine synthesis, and speech perception.
⚫ Since modern speech processing technologies are based on digital computing and
implemented by microprocessors, signal processors or general-purpose computers, it is
also called digital speech signal processing.

5 Huawei Confidential
Overview of Speech Processing (2)
⚫ The study of speech signal processing originated from the simulation of the vocal
organs.
⚫ Language information is mainly included in the parameters of speech signal, so it is the
key to extract the parameters of speech signal accurately and quickly.

6 Huawei Confidential
Main Application Scenarios of Speech Processing
⚫ Technology ⚫ Scenario
 Speech preprocessing  Man-machine interaction
 Speech recognition  Security protection
 Speaker recognition  Smart home
 Speech translation  Smart city
 Speech synthesis  Elderly care
 Voiceprint recognition  Education
 Speech coding  Customer service

7 Huawei Confidential
Linguistics
⚫ There are certain rules followed when communicating with each other to ensure
smooth communication and correct information transmission. For example, we need to
comply with:
 Unified lexical meaning
 Uniform pronunciation
 Unified grammar specifications
 Unified writing mode
 ……
⚫ All these are the linguistic research contents to explore the essential laws of language.

8 Huawei Confidential
Linguistics
⚫ Linguistics is the science of language as the object of study. Its research object is
human language, its task is to study and describe the structure, function and
historical development of language, find out the essence of language and
explore the law of language.
⚫ Phonology, grammar, lexical and literal all focus on the structure of language
itself, which are the center of linguistics, called microlinguistics.
⚫ Texts are used to basically record ideas, communicate ideas, or carry language
images or symbols.

9 Huawei Confidential
Phonetics (1)
⚫ Phonetics is a branch of linguistics, which is the study of human language and sound.
This paper mainly studies the pronunciation mechanism, phonetic characteristics and
changes in speech.
⚫ Phonetics in a narrow sense focuses on the specific nature of phonetics and the
methods of producing phonetics. In contrast, phonology is a system of abstract rules
and phonetics that study phonetic or phonetic distinguishing features in a language.
⚫ The generalization of phonetics refers to the sum of phonetics and phonology.
Speech Essence/
Generation Method Phonetics

Linguistics
Speech features/
operation rules
Phonology
10 Huawei Confidential
Phonetics (2)
⚫ Phonetics can be roughly divided into four main types:
 Pronunciation phonetics: the study of how the sound of speech is produced through the vocal
organs in the mouth (such as the lips, teeth, tongue, vocal cords, etc.).
 Acoustic phonetics: the study of how to perform acoustic analysis of speech sounds, such as the
frequency, duration, and amplitude of sound waves.
 Auditory phonetics: the study of how the human ear receives sound, i.e. the auditory perception of
the human ear to the speech.
 Language phonetics: how to combine sound, social environment, personal habits, and language laws.
◼ People in the same area pronounce the same word or sentence differently.
◼ People in different areas may pronounce the same word or sentence differently.
◼ The same person pronunciation of the same word or sentence differs in different situations and
emotions.

11 Huawei Confidential
Contents

1. Speech Processing
 Overview of Speech Processing
◼ Speech Processing
 Speech Signal Analysis and Feature Extraction

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

12 Huawei Confidential
Voice Source
⚫ The vocal organs are divided into three parts: sublarynx, larynx and upper larynx.
 The sublarynx runs from the trachea to the lungs. The air flowing out of the lungs becomes the sound
source of speech.

 The laryngeal part is mainly glottis and vocal cords. The vocal cords are two ligaments that act as valves
for the throat, and they close and open as glottis. When the glottis is open, the air is smooth, otherwise
the air burst out to make the vocal chords vibrate periodically to produce sound.
 The upper part of larynx includes three areas of pharynx, mouth and nasal cavity, which mainly used to
adjust speech.

13 Huawei Confidential
Voice data
⚫ Speech: A voice with language information. It is a combination of acoustic and
language.
⚫ Features of speech signals:
 Speech signals vary with time, which is a time-varying non-stationary signal. And it is
relatively stable in a short time range effected by the oral muscle, which can be regarded
as a quasi-steady process, so speech signal has short-term stability.
 The short-term analysis technology is used throughout the whole process of speech signal
analysis.
 Speech is produced by the movement of oral muscles, which move very slowly compared
to the frequency of speech.
 Generally, the short time range is 10 ms to 30 ms.

14 Huawei Confidential
Speech Signal Preprocessing
⚫ Generally, voice files are in .wav format. In the following figure, the horizontal coordinate
indicates the number of sampling points, and the vertical coordinate indicates the amplitude.
⚫ The speech signals need to be preprocessed. The main problems of the speech signals are as
follows:
 The distribution of waveform data is very uneven.
 The muted part at the beginning and end.
 The waveform contains noise.

15 Huawei Confidential
Speech Signal Preprocessing Procedure
⚫ Speech signal preprocessing includes:
 Digitalization: Discretizes analog speech signals collected from sensors into digital signals.
 Pre-emphasis: The purpose of pre-emphasis is to emphasize the high-frequency part of speech, eliminate
the impact of lip radiation, and increase the high-frequency resolution of speech.
 Endpoint detection: Identify and eliminate long-time silence segments from speech signals to reduce
interference from the environment to signals.
 Frame division: Short-term analysis is required because the voice is stable in a short time. That is, signals
are segmented. Each segment is called a frame (generally 10 ms to 30 ms).
 Windowing: The speech signal is divided into frames by weighting a movable window with a limited
length. The purpose of windowing is to reduce the truncation effect of speech frames. Common windows
are: rectangular window, Haning window and Hanming window.

16 Huawei Confidential
Speech Signal Preprocessing - Pre-emphasis
⚫ Lip radiation: Lip radiation causes energy loss, which is obvious in high frequency bands
and has little impact on low frequency bands. Therefore, we use a high-pass filter to
pre-emphasize the signal to enhance the high-frequency part resolution.

17 Huawei Confidential
Speech Signal Preprocessing – Frame blocking
⚫ Frame blocking: Partitioning the speech signals into fixed-length frames.
 Although a continuous segmentation method may be used for frame division, an overlapping
segmentation method is usually used to ensure smooth transition between frames and maintain
continuity. The overlap between the previous frame and the next frame is called a frame shift.

Window size

Frame shift
Overlapping

18 Huawei Confidential
Speech Signal Preprocessing - Windowing
⚫ Windowing: The framed speech signal is then windowed. Which is applied to minimize
disjointedness at the start and finish of each frame. Since more frames are split, and the error
between the frame and the original signal is greater. It increases the sharpness of harmonics,
eliminates the discontinuous of signal by tapering beginning and ending of the frame zero. It also
reduces the spectral distortion formed by the overlap.
⚫ Different window functions affect the results of speech signal analysis. Rectangular window has
good smoothness, but the details of the waveform are lost and leakage occurs. Hamming window
can effectively overcome the leakage phenomenon and has the widest application scope.

19 Huawei Confidential
Contents

1. Speech Processing
 Overview of Speech Processing
 Speech Processing
◼ Speech Signal Analysis and Feature Extraction

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

20 Huawei Confidential
Speech Features (1)
⚫ When analyzing a audio file, some speech features are required to reflect the essence of the voice
to facilitate subsequent process use. Therefore, the design of speech features is an important part
of speech processing.
⚫ Speech feature is the core information of speech description and plays an important role in
speech model construction.
⚫ Speech features:
 Contains valid information to distinguish phonemes: time domain resolution and frequency domain
resolution.
 Separate the fundamental frequency F0 and its harmonic components.
 Robustness for different speakers.
 Robustness to noise or channel distortion.
 Good pattern recognition characteristics: low-dimensional features, independent features.

21 Huawei Confidential
Speech Features (2)
⚫ The feature extraction methods are as follows:
 Linear Prediction Coefficient(LPC)
 Linear Prediction Cepstral Coefficients (LPCC)
 Discrete Wavelet Transform (DWT)
 Line Spectral Frequencies (LSF)
 Mel Frequency Cepstral Coefficients (MFCC)
 Perceptual Linear Prediction (PLP)
⚫ The most commonly used speech feature in speech recognition and speaker
recognition is Mel-Frequency Cepstral Coefficients(MFCC).

22 Huawei Confidential
Speech Analysis (1)
⚫ Speech signal analysis does not include speech signal preprocessing, but noise reduction and
smoothing. Those two processes are called speech signal analysis.
⚫ Importance of speech signal analysis:
 The quality of speech synthesis and the recognition rate depend on the accuracy and precision of speech
signal analysis.
 Speech signal analysis is the basis and prerequisite of speech synthesis, speech recognition, speech
enhancement, and target speech extraction, which can be better used in different service scenarios.

Service scenario application

Speech Speech Speech Speech


synthesis recognition enhancement extraction ……

Speech Analysis

Speech signal preprocessing

23 Huawei Confidential
Speech Analysis(2)
⚫ There are many methods for analyzing speech signals according to specific
requirements. Speech analysis can be classified into the following types:
 Time domain analysis
 Frequency domain analysis
 Inverted frequency analysis
 Wavelet domain analysis
 ……

⚫ The analysis methods are classified into the following two types:
 Model analysis method
 Non-model analysis method

24 Huawei Confidential
Time Domain Speech Analysis
⚫ Time domain analysis is to analyze and extract time domain parameters of speech
signals, which is the earliest and most widely used analysis method (speech signals are
time domain signals). It is usually used for parameter analysis and application such as
speech segmentation, preprocessing, and classification.
⚫ List of Short-Time Features:
 Short-time energy
 Short-time zero-cross rate(ZCR)
 Short-time auto-correlation
 Short-time amplitude difference

25 Huawei Confidential
Frequency Domain Speech Analysis
⚫ Frequency domain analysis of speech signals is to analyze and extract frequency
domain parameters of speech signals.
⚫ The most common frequency domain analysis method is Fourier analysis.
Speech signal is a non-stationary process, so short-time Fourier transform is
needed to analyze the spectrum of speech signal. The resonance peak
characteristics, pitch frequency and harmonic frequency of speech signals can
be observed through the spectrum of speech signals.

26 Huawei Confidential
Speech Features
⚫ The most commonly used speech feature in speech recognition and speaker recognition
is Mel-Frequency Cepstral Coefficients(MFCC).
⚫ MFCC processor:
 Pre-emphasis
 Frame Blocking
 Windowing
 Fast Fourier Transform(FFT)
 Mel-Scale filter bank
 𝐿𝑜𝑔|. |
 Discrete Cosine Transform(DCT)
http://speech.ee.ntu.edu.tw/~tlkagk/courses/DLHLP20/ASR%20%28v12%29.pdf

27 Huawei Confidential
Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

28 Huawei Confidential
Speech - Text

29 Huawei Confidential
Speech Recognition
⚫ Speech recognition is the technology that enables machines to recognize and
understand speech signals into texts or commands.
⚫ Speech recognition technologies include signal processing, pattern recognition,
probability theory and information theory, sound generation mechanism, auditory
mechanism, artificial intelligence, and so on.

Input Output

“I love you.”

30 Huawei Confidential
History of Speech Recognition

Single voice speaking Hidden Markov


digits recognition Model (HMM)
Apple Siri Google Glass

1952 1980 1987 1997 2011 2012 2013

Continuous speech IBM Via Voice Google Voice Search


recognition

31 Huawei Confidential
Current Situation of Speech Recognition
⚫ Speech recognition is a sensory intelligence in artificial intelligence, and has been
applied in various fields, such as home appliances, communications, automobiles,
healthcare, and home services.
⚫ Currently, the recognition rate of some companies in standard data sets or quiet near-
field environments has reached 97%, but the recognition rate in real scenarios is far
from the expected level.
hello?

Hello

32 Huawei Confidential
Difficulties in Speech Recognition
⚫ Difficulties in speech recognition tasks:
 Regional;
 Scenario-specific;
 Physiological.

⚫ To sum up, the difficulty of speech recognition is uncertain. The same word or sentence
may be pronounced differently because of different factors.

33 Huawei Confidential
Speech Recognition - Isolated Word Recognition
⚫ Isolated word recognition: At the early stage of speech processing, a small
number of isolated words are identified. The input voice file contains only one
word. Then, the model is used to identify the word to which the file belongs.
The common model is GMM-HMM.

Input Output

“0”

“2”

……

“9”

34 Huawei Confidential
Speech Recognition - Continuous Speech Recognition
⚫ Continuous Speech Recognition: In practice, a few isolated words cannot meet actual application
requirements. Most of the requirements need to recognize consecutive sentences. Therefore, if a
few isolated words are still used, the following problems may occur:
 The entire file needs to be split into isolated words, which requires a lot of manual work and
cannot guarantee accuracy because many words are pronunciations are adhesive.
 Even if the vocabulary is perfectly split, The number of words we use in actual use is so large
that the matching strategy used for isolated word recognition has no advantage here, or even
becomes a disadvantage due to the large vocabulary.

“I love you”

“Nice weather!”

“I like eating apple”

……

35 Huawei Confidential
Traditional Speech Recognition Task Process

Voice file Words Text file

Uncompresse Language
d voice file model

Acoustic
Preprocessing
model

36 Huawei Confidential
Speech Recognition Algorithm
⚫ Traditional speech recognition algorithm:GMM-HMM。
⚫ There are many kinds of speech recognition algorithms based on deep learning,
and new models and algorithms are proposed. These models can be divided into
two directions.
 Hybrid Model
 End2end Model

37 Huawei Confidential
Speech Recognition Application
⚫ Many speech recognition applications like voice interaction and operation functions are
used during daily use of the app. In smart home, the state of the appliance is controlled
by voice without the use of a remote control and so on.
 Voice typewriter
 Voice search
 Voice assistant
 Smart Speaker
 Customer service robot

38 Huawei Confidential
Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

39 Huawei Confidential
Text-To-Speech Synthesis
⚫ Speech synthesis, also known as Text-To-Speech(TTS), can convert any text information
into corresponding speech.
⚫ Speech synthesis, which involves acoustics, linguistics, digital signal processing and
computer science, is an advanced technology in information processing.
⚫ In order to synthesize high-quality language, it depends on various rules, including
semantic rules, lexical rules and phonetics rules, but also needs to have a good
understanding of the content of the text, which also involves the understanding of
natural language.

40 Huawei Confidential
Application Scenarios of TTS
⚫ Service robot
⚫ Customer service system
⚫ Smart furniture
⚫ Trip navigation
⚫ Reading software

41 Huawei Confidential
TTS System
⚫ A complete TTS system process is to first convert the text sequence into a phonological sequence
and then generate speech waveforms based on the phonological sequence. Where:
 Step 1 involves linguistic processing, for example, particle and pronunciation conversion, and a whole set
of valid rhythm control rules.
 Step 2 requires an advanced speech synthesis technology, which can synthesize high-quality speech
streams in real time based on requirements.

⚫ The research on speech synthesis technologies has a history of more than 200 years. The modern
voice synthesis technology that is really practically develops with the development of computer
technology and digital signal processing technology. It enables computers to generate high-
definition and highly natural continuous speech.

42 Huawei Confidential
Speech Synthesis Flow

Waveform synthesis

Text analysis • Adjoin synthesis


Phonemic
• Text normalization • Formant synthesis
Text file internal Waveform file Evaluation
• Speech analysis • Articulatory synthesis
representation
• Rhythmic analysis

43 Huawei Confidential
Text Analysis
⚫ The main task of text analysis in speech recognition is to convert text data into
phonemic internal representation. Specific content includes:
 Text normalization: Natural text data of all types is preprocessed or normalized,
including word example restoration of sentences and disambiguation of non-
standard words and homonyms.
 Speech analysis: The next step of text normalization is speech analysis. Specific
methods include using a large pronunciation dictionary and word-to-phoneme
conversion rules.
 Rhythmic analysis: Analyzes tone patterns and rhyming rules of text, including the
rhythmic mechanism, rhythmic prominence, and tone.

44 Huawei Confidential
Speech Synthesis Method
⚫ During development of speech synthesis technologies, early research primarily adopts the parameter synthesis
method. Later, with the development of computer technology, the synthesis method of waveform concatenation
appears.
⚫ Parameter synthesis
 During development of speech synthesis technologies, early research primarily adopts the parameter synthesis method.
What is worth mentioning are Holmes' parallel formant synthesizer (1973) and Klatt's serial/parallel formant synthesizer
(1980). Through fine parameter adjustment, the two synthesizers can synthesize natural speech. However, it is difficult to
accurately extract formant parameters. Therefore, the quality of synthesized speech cannot meet practical requirements.

⚫ Waveform concatenation
 From the late 1980s to nowadays, new progress has been made in speech synthesis technologies. In particular, the pitch
synchronous overlap add (PSOLA) method that was proposed in 1990 greatly improves the timbre and degree of
naturalness of speech synthesized based on the method of time domain waveform concatenation. The degree of
naturalness is higher than that based on the LPC method or formant synthesizer. In addition, the structure of the
synthesizer based on the PSOLA method is simple and the synthesizer is easy to implement and has a great commercial
prospect.

45 Huawei Confidential
Speech Synthesis Algorithms
⚫ HMM-based parameter synthesis
⚫ WaveNet
⚫ Tacotron
⚫ Deep Voice 3

46 Huawei Confidential
Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM


◼ GMM
 HMM
 GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

47 Huawei Confidential
GMM Introduction
⚫ GMM can also be abbreviated as Mixture of Gaussian (MOG). A GMM is
used to accurately quantify a thing using the Gaussian density function
(normal distribution curve) and break down one thing into several
models formed based on the Gaussian probability density function (PDF).

48 Huawei Confidential
Gaussian Distribution
⚫ Gaussian distribution, also called normal distribution, was obtained the earliest by Abraham De
Moivre from the asymptotic formula for binomial distribution. Gauss derived Gaussian distribution
from another angle when he studied measurement error. Gaussian distribution is an important
probability distribution in the fields such as mathematics, physics, and engineering and has a
great influence in many aspects of statistics.
⚫ If the random variable 𝑋 follows a mathematical expectation 𝜇, the variance is the normal
distribution of 𝜎 2 and is recorded as 𝑁 (𝜇, 𝜎 2 ). The PDF of the random variable is a normal
distribution and the expectation value 𝜇 decides the position of the variable and the standard
variance σ decides the distribution amplitude. When 𝜇 is 0 and σ is 1, the normal distribution is a
standard normal distribution. The formula is as follows:

1 (𝑥−𝜇)2

f 𝑥 = 𝑒 2𝜎2
2𝜋𝜎
49 Huawei Confidential
Gaussian Distribution Curve
⚫ The normal curve is in a bell shape. The two sides are low and the middle is high. The
left and the right are symmetrical and take on a bell shape.
⚫ If the standard deviation is higher, the curve is flatter; if the standard deviation is lower,
the curve is thinner.

50 Huawei Confidential
SGM
⚫ When sample data 𝑋 is one-dimensional data, Gaussian distribution follows the PDF
below:

1 (𝑥−𝜇)2

𝑃 𝑥𝜃 = 𝑒 2𝜎2
2𝜋𝜎 2
(𝜇: average value of the data, 𝜎: standard deviation of the data.)

⚫ When sample data 𝑋 is multi-dimensional data, Gaussian distribution follows the PDF
below:

1 𝑥−𝜋 𝑇 σ(𝑥−𝜇)−1
𝑃 𝑥|𝜃 = 𝐷 𝑒− 2
(2𝜋) 2
(𝜇: average value of the data,∑: covariance,𝐷: the data dimension)
51 Huawei Confidential
Maximum Likelihood
⚫ Maximum likelihood (ML), also called maximum likelihood estimation, is a theoretical
point estimation method. The maximum likelihood estimation is a statistical method. It
is used to solve the parameter of the related PDF for a sample set.
⚫ The basic idea of this method is: After 𝑛 groups of observed sample values are
randomly drawn from a model, the most reasonable parameter estimation should make
the probability of the n groups of observed sample values drawn from the model the
maximum instead of making the model best fit the parameter estimation of sample
data, as provided by the least square estimation.
Known Unknown
1. Distribution model followed Maximum
by the sample Model parameter
Likelihood
2. Sample randomly selected
Estimation

52 Huawei Confidential
Maximum Likelihood
⚫ Assume N independent data points follow a distribution 𝑃𝑟(𝑥;𝜃). We want to find a group of parameters 𝜃 to
make the maximum probability of generating the data points. The probability is:

ෑ 𝑃𝑟(𝑥𝑖 ; 𝜃)
𝑖=1

⚫ It is called a likelihood function. Generally, the probability of one point is lower. After continuous
multiplication, data becomes smaller, which may cause floating point underflow. Therefore, the logarithm of
data is used. As a result, the probability becomes:

෍ 𝑙𝑜𝑔𝑃𝑟(𝑥𝑖 ; 𝜃)
𝑖=1

⚫ It is called a log-likelihood function. Then derivation can be performed to find the parameter 𝜃 that makes
the value of the preceding formula the greatest. We think the possibility of obtaining the observed values is
the lowest but the parameter 𝜃 makes this happen at the maximum likelihood.
53 Huawei Confidential
Parameter Learning of SGM
⚫ For an SGM, we can use maximum likelihood to estimate the value of the parameter 𝜃: 𝜃 =
𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝐿(𝜃). Here, assume each data point is independent. The likelihood parameter is provided
by the PDF:
𝑁

𝐿 𝜃 = ෑ 𝑃(𝑥𝑗 |𝜃)
𝑗=1

⚫ Because the probability of occurrence for each point is lower, the product becomes very small,
which is not helpful to calculation or observation. So we often use maximum log-likelihood for
calculation. Since logarithmic functions feature monotonicity, the functions do not change the
location of the extremum. In addition, any small change of the input value in the range 0 to 1
causes great change to the output value.

𝑙𝑜𝑔𝐿 𝜃 = ෍ 𝑙𝑜𝑔𝑃(𝑥𝑗 |𝜃)


𝑗=1
54 Huawei Confidential
Parameter Estimation of Gaussian Distribution Model
⚫ For a sample data set, if we use the Gaussian probability distribution to fit the
data set, the procedure for solving the parameters of the Gaussian distribution
model is as follows:
 Use probability density function based on Gaussian Probability Distribution
 Get the likelihood function according to the sample data and the function form of
the Gaussian probability distribution.
 Convert the likelihood function into the log-likelihood function.
 The log-likelihood function takes the derivative of the parameter and sets the
equation to zero.
 Solve the equation to get the optimal parameter value.

55 Huawei Confidential
Gaussian Mixture Model(GMM)
⚫ GMM is an extension of a single Gaussian probability density function, and it can
smoothly approximate the density distribution of any shape. There are two types of
Gaussian mixed models: single Gaussian Model (SGM) and Gaussian Mixture Model
(GMM).
⚫ Similar to clustering, each Gaussian model can be considered as a category based on
the Probability Density Function (PDF) parameter. Input a sample x to calculate its
value using the PDF, then a threshold is used to determine whether the sample belongs
to the Gaussian model. Obviously, SGM is suitable for binary classification, while GMM
is more refined because it has multiple models and is suitable for multi-classification
and can be applied to complex object modeling.

56 Huawei Confidential
GMM
⚫ Probability distribution of the GMM:

𝑃 𝑥 𝜃 = ෍ 𝛼𝑘 𝜙(𝑥|𝜃𝑘 )
𝑘=1

⚫ For the GMM, the parameter 𝜃=((𝜇_𝑘 ) ̃,(𝜎_𝑘 ) ̃,(𝛼_𝑘 ) ̃) indicates the occurrence
probability of the expectation and variance (or covariance) of each sub-model in the
GMM.
⚫ Parameter description:
 𝑥𝑗 indicates the 𝑗th observed data. 𝑗 = 1,2,3, … , 𝑁. 𝐾 indicates the number of Gaussian models
in a GMM; 𝛼𝑘 indicates the probability that observed data belongs to the 𝑘𝑡ℎ sub-model. 𝛼𝑘 ≥
0, σ𝐾
𝑘=1 𝑎𝑘 = 1. ∅ 𝑥 𝜃𝑘 indicates the Gaussian PDF of the 𝑘𝑡ℎ sub-model: 𝜃𝑘 = (𝜇𝑘 , 𝜎 𝑘 ).
2

57 Huawei Confidential
Parameter Learning of GMM
⚫ For a GMM, the log-likelihood function is:

𝑁 𝑁 𝑁

𝑙𝑜𝑔𝐿 𝜃 = ෍ 𝑙𝑜𝑔𝑃 𝑥𝑗 𝜃 = ෍ 𝑙𝑜𝑔(෍ 𝛼𝑘 ∅(𝑥|𝜃𝑘 ))


𝑗=1 𝑗=1 𝑘=1

⚫ How to calculate parameters of a GMM? We cannot use maximum likelihood to derive and
calculate the parameter that makes likelihood the maximum like a Gaussian model because we
do not know which sub-distribution (hidden variable) an observed data point belongs to.
Therefore, summation also exists in the log. The sum of 𝐾 Gaussian models is not a Gaussian
model. For each sub-model, 𝛼𝑘 , 𝜇𝑘 , and 𝜎𝑘 are unknown and cannot be calculated through direct
derivation. The parameters must be solved through an iterative approach.

58 Huawei Confidential
EM Algorithm
⚫ The expectation maximization (EM) algorithm is an iterative algorithm. It is used for the
maximum likelihood estimation(MLE) or maximum a posteriori estimation(MAP) of
probability parameter models that contain hidden variables.
⚫ The EM algorithm is a method proposed by Dempster, Laind, and Rubin in 1977 to solve
maximum likelihood estimation parameters. It can perform maximum likelihood
estimation (MLE) for incomplete data sets. This method can be widely used to process
missing data, truncated data, and incomplete data such as data with noise.

Known Unknown
1. Distribution of each
EM Algorithm
1. Sample randomly sample
selected 2. Model parameter

59 Huawei Confidential
EM Algorithm Step (1)
⚫ Initialize parameters:
⚫ E-step: According to current parameters, calculate the possibility that each data
𝑗 comes from the sub-model k.

𝛼𝑘 ∅(𝑥𝑗 |𝜃𝑘 )
𝛾𝑗𝑘 =
σ𝐾
𝑘=1 𝛼𝑘 ∅(𝑥𝑗 |𝜃𝑘 )

Where 𝑗 = 1,2,3, … , 𝑁; 𝑘 = 1,2,3, … , 𝐾


⚫ M-step: Calculate the model parameter in a new round of iteration.

σ𝑁
𝑗 (𝛾𝑗𝑘 𝑥𝑗 )
𝜇𝑘 = ,where 𝑘 = 1,2,3, … , 𝑁
σ𝑁
𝑗 𝛾𝑗𝑘

60 Huawei Confidential
EM Algorithm Step(2)
⚫ M-step: Calculate the model parameter in a new round of iteration.

σ𝑁
𝑗 𝛾𝑗𝑘 (𝑥𝑗 −𝜇𝑘 )(𝑥𝑗 −𝜇𝑘 )
𝑇
σ𝑘 = σ𝐾
(Use 𝜇_𝑘 updated after this round of iteration)
𝑗 𝛾𝑗𝑘

σ𝑁
𝑗=1 𝛾𝑗𝑘
𝛼𝑘 =
𝑁
where 𝑘=1,2,3,…,𝑁
⚫ Iteration: Repeatedly calculate E-step and M-step until convergence occurs.
⚫ At this point, we complete parameter learning of GMM. It should be noted that the EM algorithm
features convergence but does not ensure that the maximum global value can be found. A local
maximum value may be found. The solution is to initialize different parameters several times for
iteration and use the initialization with the best results.
61 Huawei Confidential
EM Algorithm Step (3)
⚫ Specific iteration steps:
 Initialize parameters.
 E-step: Find the expectation.
 M-step: Find the maximum value and calculate the model parameter in a new round
of iteration.
 Perform iteration until convergence occurs.

62 Huawei Confidential
Disadvantages of GMM
⚫ Advantages:
 Strong fitting capability
 Maximum probability of speech feature matching

⚫ Disadvantages:
 The sequence factor cannot be processed.
 Linear or approximate linear data cannot be processed.

63 Huawei Confidential
Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM


 GMM
◼ HMM
 GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

64 Huawei Confidential
Cases of Markov Chain
⚫ The publicity for commodities A, B, and C under a category is different. The
probabilities for customers to attempt to select and buy commodities A, B, and
C under the advertising effect are respectively 0.2, 0.4, and 0.4. The following
table describes purchase predisposition of customers. Find the probabilities for
customers to buy the commodities the fourth time.

Second Purchase
A B C

First A 0.8 0.1 0.1

Purchase B 0.5 0.1 0.4

C 0.5 0.3 0.2

65 Huawei Confidential
Cases of Markov Chain
⚫ Three elements:
 Initial probability: 𝜋 = (0.2,0.4,0.4)
 Transition probability: 𝑝𝐴𝐴 = 0.8, 𝑝𝐴𝐵 = 0.1, 𝑝𝐴𝐶 = 0.1, 𝑝𝐵𝐴 = 0.5, …

0.8 0.1 0.1


 Transition probability matrix: 𝐴 = 0.5 0.1 0.4
0.5 0.3 0.2
⚫ Solving:
 Model of Markov chain:
 P 𝑋𝑛+1 = 𝑥 𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 , … , 𝑋𝑛 = 𝑥𝑛 = P(𝑋𝑛+1 = 𝑥|𝑋𝑛 = 𝑥𝑛 )
 Third probability: 𝑃𝐴𝐴𝐴 = ?

66 Huawei Confidential
Markov Chain
⚫ Markov chain refers to a random discrete event process with Markov properties in
mathematics. In the process, when current knowledge or information is provided, future
prediction is unrelated to the past and is related to only the current state.
⚫ In each step of Markov chain, the system can change from one state to another
according to probability distribution or maintain the current state. The change of states
is called transition. The probability related to change of different states is called
transition probability.
0.4
0.3 0.1

Cloudy 0.6 Rainy Fine


0.1 0.5
0.1
0.5

0.4

67 Huawei Confidential
Principles of Markov Chain
⚫ Principles:
 Markov chain describes a state sequence, in which each state value depends on the finite number of
previous states. Markov chain is a sequence of random variables with Markov properties. The range of the
variables, namely the set of possible values of the variables, is called state space.

⚫ Properties:
 Positive definiteness: Each element in the state transition matrix is called a state transmission probability.
It can be learned from knowledge of the probability theory that each state transition probability is a
positive number, which can be expressed using the formula:

𝑝𝑖𝑗 (𝑘) ≥ 0

 Finiteness: According to knowledge of the probability theory, adding each row in the state transmission
matrix is equal to 1, which can be expressed using the formula:

σ 𝑝𝑖𝑗 = 1

68 Huawei Confidential
Observable Markov Model
⚫ For one question, we have initial distribution 𝜋 and transition probability matrix 𝐴. In
any given time 𝑡, we have a state 𝑄𝑡 . When one state transits to another with the
change of time, an observation sequence is obtained, that is, state sequence 𝑂 =
[𝑞1 , 𝑞2 , 𝑞3 , 𝑞4 , … , 𝑞𝑛 ]. In the whole question, there are 𝑛 observation states in total.
⚫ The probability of such a sequence is:

𝑃 𝑂 𝐴, 𝜋 = 𝑃(𝑞1 ) ෑ 𝑃(𝑞𝑡 |𝑞𝑡−1 )


𝑡=2

⚫ Therefore, an observable Markov model has a triplet description (𝐴, 𝜋, 𝑛), which can be
abbreviated as 𝐴, 𝜋 .

69 Huawei Confidential
Markov Chain Learning
⚫ Markov Process
 Markov model learning problem refers to learning parameters of Markov model after a series
of observation data is given.
 The learning content includes the initial probability 𝜋 and the transition probability matrix A.
The state set is determined during the problem study and does not require an additional
learning process.

Observation data Learning Model Parameter

 Initial probability:𝜋
ABACCACBCBBAC…  Transition probability:𝑝𝐴𝐴 , 𝑝𝐴𝐵 , 𝑝𝐴𝐶 , 𝑝𝐵𝐴 , …...
BCBCACCBACACB
 State set:S

70 Huawei Confidential
Markov Chain Learning Algorithm-Exhaustion
⚫ Exhaustion method: In Markov model learning, exhaustion is probability approximation
method. The more sample sequence data can be obtained, the more accurate the
obtained parameter is.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑠𝑡𝑎𝑟𝑡𝑖𝑛𝑔 𝑤𝑖𝑡ℎ 𝑠𝑡𝑎𝑡𝑒 𝑖
 Initial probability:𝜋𝑖 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑛𝑔 𝑡𝑜 𝑠𝑡𝑎𝑡𝑒 𝑗 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖


 Transition probability:𝑝𝑖𝑗 = 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠 𝑓𝑟𝑜𝑚 𝑠𝑡𝑎𝑡𝑒 𝑖

⚫ For example, assuming that there is data [red, red, red] [red, red, blue] [red, blue, red]
[blue, red, red], the initial probability and the transition probability may be obtained by
using the exhaustion method:
 Initial probability:𝜋={0.75,0.25};
 Transition probability: 𝑝(𝑟𝑒𝑑, 𝑏𝑙𝑢𝑒) = 1/3, 𝑝(𝑟𝑒𝑑, 𝑟𝑒𝑑) = 2/3, 𝑝(𝑏𝑙𝑢𝑒, 𝑏𝑙𝑢𝑒) = 0, 𝑝(𝑏𝑙𝑢𝑒, 𝑟𝑒𝑑) = 1

71 Huawei Confidential
Markov Chain Prediction
⚫ Markov Model Prediction
 Markov‘s prediction problem refers to the initial probability of 𝜋 given the Markov model
parameter. Transition probability: 𝑝𝐴𝐴 , 𝑝𝐴𝐵 , 𝑝𝐴𝐶 , 𝑝𝐵𝐴 , … …;State set 𝑆, calculate the probability
of occurrence of an observation sequence.

⚫ The probability of occurrence of a sequence can be predicted. At the same time, the
Markov model can be used to determine which sequence is generated.

Model Parameter Observation Probability


Predition
 Initial probability:𝜋
P (AABCCBCAC… )
 Transition probability:𝑝𝐴𝐴 , 𝑝𝐴𝐵 , 𝑝𝐴𝐶 , P (CBACBCCBBA…)
𝑝𝐵𝐴 , …... P (ABACCACBCBBAC…)
……
 State set:S
72 Huawei Confidential
Markov Chain Prediction Algorithm
⚫ Assume that a Markov model is given, that is, the initial probability 𝜋. transition probability
𝑝𝐴𝐴 , 𝑝𝐴𝐵 , 𝑝𝐴𝐶 , 𝑝𝐵𝐴 , …...;the state set 𝑆 is all known. Solve the probability of occurrence of the
following sequence.

 Markov characteristic that the probability of the state at the n-1 moment is related only to the
state at the n moment. Thus the probability calculation of the foregoing sequence may be
chained decomposition.
 Multiplier on the equation consists of initial probability and transition probability. These
probability values are known. Thus the probability of sequence occurrence can be easily
calculated.
73 Huawei Confidential
Case Study-Hidden Markov Model
⚫ The key difference between Hidden Markov Model and Markov Model lies in this "hidden". The
following uses a simple case to describe what the Hidden Markov Model is.

Mood Activity
affect

⚫ Assumption:
 As shown in the upper left figure, Jimmy has three mood states: "happy", "unhappy" and “just so so“;
 Jimmy's daily activities are "playing football", "listening to music" and "watching television".
 Jimmy's activities are affected by the mood of the day. For example, when he is "happy" he will choose to
watch TV, “unhappy" most likely to play football.
 Jimmy's mood changes every day. The mood of the next day is affected only by the mood of the previous
day.
74 Huawei Confidential
Case Study-Hidden Markov Model
⚫ In this case, Jimmy’s mood and activity sequence in a period of time may be represented by using
the following schematic diagram, and it can be seen that there are two sequences:
 The mood states are a sequence, and the states of the mood sequence are generated under certain rules.
 Activity is also a sequence, and the sequence of activity is influenced by the mood sequence.

⚫ In this case, Jimmy's mood is hidden (the mood is not observed directly), and activities can be
directly observed. So we usually
 The states that cannot be observed directly are called hidden states, such as Jimmy's mood.
 The states that can be observed directly are called Observation States, like Jimmy's activities.

Hidden States

Observation States

75 Huawei Confidential
Case Study-Hidden Markov Model
⚫ In this case:

 There are two states: hidden states and observation states.


 The current hidden states determines the distribution probability of the current observation
states.
 The hidden states distribution of the next time step is affected only by the current states.

⚫ The mathematical model for this problem is the hidden Markov model.

76 Huawei Confidential
Case Study-Hidden Markov Model
⚫ In fact, the Hidden Markov Model is an extension of the Standard Markov Model by
adding the set of observation states and hidden states.
⚫ Similar to Markov model, there are three model elements, and Hidden Markov model
has five elements. Once these five elements are determined, Hidden Markov model is
only determined.

Observation
Hidden states 𝑆 states 𝑅 Initial probability 𝜋

Transition probability 𝐴 Observation probability 𝐵

77 Huawei Confidential
Case Study-Hidden Markov Model
⚫ The HMM can be described in the following five elements:
 Observation states:𝑅 = 𝑅1 , 𝑅2 , 𝑅3,…, 𝑅𝑚 ; like Jimmy’s mood states "happy", "unhappy" and “just so so“.

 Hidden states:S= 𝑆1 , 𝑆2 , 𝑆3,…, 𝑆𝑛 ;like Jimmy's daily activities.


 Initial probability: 𝜋 = 𝑃𝑖𝑛𝑖 𝑆𝑖 ∀ i ∈ 1, 𝑛 }; Jimmy’s initial mood is an assumption when the random
sequence starts, which starts from 0 at the beginning and is not affected by the previous month.

 Transition probability:𝐴 = 𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 ∀ i, j ∈ 1, 𝑛 }; Jimmy's mood transfer relationship, today's


mood how to affect tomorrow's mood.

 Observation probability:𝐵 = 𝑃𝑜𝑏𝑣 𝑅𝑘 ቚ𝑆𝑗 ∀ j ∈ 1, 𝑛 , k ∈ 1, 𝑚 };The relationship between Jimmy’s


mood and activities. In other words, how mood affects entertainment activities.

⚫ Thus the HMM can be described as 𝜆 = (𝐴, 𝐵, 𝜋, 𝑅, 𝑆), or abbreviated as 𝜆 = 𝐴, 𝐵, 𝜋 .

78 Huawei Confidential
Three Issues of the HMM
⚫ Evaluation:
 Forward algorithm
 Backward algorithm

⚫ Decoding
 Dynamic planning algorithm
 Viterbi algorithm
⚫ Learning
 Supervised algorithm
 Unsupervised Baum-Welch algorithm

79 Huawei Confidential
Hidden Markov Model Evaluation
⚫ Hidden Markov model evaluation
 Given the hidden Markov model, it includes:
◼ Initial probability:𝜋
◼ Transition probability:𝐴
◼ Observation probability:𝐵
◼ Hidden states: S= 𝑆1 , 𝑆2 , 𝑆3,…, 𝑆𝑛

◼ Observation states: 𝑅 = 𝑅1 , 𝑅2 , 𝑅3,…, 𝑅𝑚


◼ The default hidden states and observable states are always known in the case study.

 Given observation states


 The purpose is to compute the probability of observation sequences given HMM.

80 Huawei Confidential
Hidden Markov Model Evaluation
⚫ Example: Calculate the probability that Jimmy's activities in a week are in the
following sequence:
 TV-> Football-> Football-> Music-> TV -> Football-> TV Football-> Football-> Music-
> TV -> Music-> Football-> Music
Observation Sequence
Known Model Parameters Probability
Prediction
……
 Initial probability :𝜋
……
 Transition probability : 𝐴

 Observation probability:𝐵 ……
……

81 Huawei Confidential
HMM Algorithm - Forward and Backward Algorithm (1)
⚫ Assumed that the hidden state sequence for generating the observation sequence is 𝑄
 𝑄 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑇 , T indicates the sequence length;𝑞𝑖 indicates the observation
status of the 𝑖𝑡ℎ position in the sequence,𝑞𝑖 ∈ S

⚫ The forward and backward algorithms are similar. The core improvement of the
forward algorithm is to repeat the calculation part.
 Assume 𝑂𝑡 = 𝑜1 ,𝑜2 ,𝑜3 ,𝑜4 , … … , 𝑜𝑡
 P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) indicates the probability that the 𝑡𝑡ℎ hidden state is 𝑆𝑖 and 𝑂𝑡 is generated.
 P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) characteristic:

◼ P(𝑂𝑡+1 , 𝑞t+1 = 𝑆𝑖 ) = σ𝑛𝑗=1 P(𝑂𝑡 , 𝑞t = 𝑆𝑗 ) ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 ∗ 𝑃𝑜𝑏𝑣 (𝑜𝑡+1 ห𝑆𝑖 )

◼ 𝑃 𝑂 = 𝑃 𝑂𝑇 = σ𝑛𝑖=1 P(𝑂𝑇 , 𝑞𝑇 = 𝑆𝑖 )

82 Huawei Confidential
HMM Algorithm - Forward and Backward Algorithm (2)
⚫ Forward algorithm calculation process:
 Traversal sequence point 𝑡 ∈ 1, 𝑇 :
◼ 𝑡 =1, P(𝑂1 , 𝑞1 = 𝑆𝑖 ) = 𝑃𝑖𝑛𝑖 𝑆𝑖 ∗ 𝑃𝑜𝑏𝑣 (𝑜1 ห𝑆𝑖 )

◼ 𝑡 >1for each hidden state 𝑆𝑖 ,compute P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) = σ𝑛𝑗=1 P(𝑂𝑡−1 , 𝑞t−1 = 𝑆𝑗 ) ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 ∗

𝑃𝑜𝑏𝑣 (𝑜𝑡−1 ห𝑆𝑖 )

 Get 𝑃 𝑂 = 𝑃 𝑂𝑇 = σ𝑛𝑖=1 P(𝑂𝑇 , 𝑞𝑇 = 𝑆𝑖 )


 Algorithm complexity: O(𝑇𝑛2 )

83 Huawei Confidential
Hidden Markov Model Learning
⚫ The learning problem is the basis of the other two problems of the Hidden Markov
model, which is given in both evaluation and decoding problems. But in practice, like
Jimmy's mood and activities, no one knows what the model parameters of this hidden
Markov model are, thus it needs to learn these model parameters from the observation
data.
⚫ The learning problem of Hidden Markov Model is
 Known observation state sequences and corresponding hidden state sequences.
 Objective: to solve the parameters of Hidden Markov model.
◼ Initial probability :𝜋
◼ Transition probability : 𝐴
◼ Observation probability:𝐵

84 Huawei Confidential
Hidden Markov Model Learning
⚫ For example, learning a hidden Markov model for both Jimmy's mood and activities
given the sequence of Jimmy's activities in a week.

Observation Sequence Model Parameters

…… Learning
 Initial probability :𝜋
……
 Transition probability : 𝐴
……
 Observation probability:𝐵
……

85 Huawei Confidential
HMM Learning Algorithm - Supervision Algorithm
⚫ Application scenario: A large number of observation state sequences and corresponding hidden states sequence are known.

⚫ Idea: Use the conclusion of the large number theorem that the limit of frequency is probability to directly obtain the
parameter estimation of HMM.
Hidden states
|𝑆𝑒𝑞_𝑏𝑒𝑔𝑖𝑛_𝑤𝑖𝑡ℎ_𝑆𝑖 |
⚫ Initial probability 𝜋:𝑃𝑖𝑛𝑖 𝑆𝑖 = |𝑆𝑒𝑞| Observation states
 𝑆𝑒𝑞_𝑏𝑒𝑔𝑖𝑛_𝑤𝑖𝑡ℎ_𝑆𝑖 indicates the hidden states starting with 𝑆𝑖 in the sample.

 𝑆𝑒𝑞 indicates all hidden states sequences in the sample.

|𝑡𝑟𝑎𝑛𝑠(𝑆𝑗, 𝑆𝑖 )|
⚫ Transition probability 𝐴:𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 = | 𝑡𝑟𝑎𝑛𝑠 𝑆𝑗 |

 𝑡𝑟𝑎𝑛𝑠 𝑆𝑗 indicates all state transition pairs initiated by 𝑆𝑗 in the hidden states sequence.

 𝑡𝑟𝑎𝑛𝑠(𝑆𝑗 , 𝑆𝑖 ) indicates all state transition pairs from 𝑆𝑗 to 𝑆𝑖 in the hidden state sequence.

| 𝑒𝑚𝑖𝑠𝑠(𝑆𝑗,𝑅𝑘 ) |
⚫ Observation probability 𝐵:𝑃𝑜𝑏𝑣 𝑅𝑘 ቚ𝑆𝑗 = | 𝑒𝑚𝑖𝑠𝑠 𝑆𝑗 |

 𝑒𝑚𝑖𝑠𝑠(𝑆𝑗 , 𝑅𝑘 ) indicates the emission pair of the observed state 𝑅𝑘 generated by the hidden state 𝑆𝑗 .

 𝑒𝑚𝑖𝑠𝑠 𝑆𝑗 indicates emission pairs of all possible observation states generated by the hidden state 𝑆𝑗 .

86 Huawei Confidential
HMM Learning Algorithm - Baum-Welch
⚫ Observe the sequence, states sequence not provided, then calculate the HMM.
⚫ The essence of this algorithm is the EM algorithm. When the observed value 𝑋
is available and the observed value has a hidden variable 𝑍, the joint probability
𝑃(𝑋, 𝑍|𝜆) under the HMM parameter λ can be calculated.
⚫ Solution steps:
 Find the Log Likelihood Function of data
 መ
EM-E step: Find Q function 𝑄(𝜆, 𝜆)
 EM-M step: Max 𝑄 function to find the parameter

87 Huawei Confidential
Hidden Markov Model Decoding
⚫ Hidden Markov Model Decoding
 Given the hidden Markov model, it includes:
◼ Initial probability:𝜋
◼ Transition probability:𝐴
◼ Observation probability:𝐵
◼ Hidden states: S= 𝑆1 , 𝑆2 , 𝑆3,…, 𝑆𝑛

◼ Observation states: 𝑅 = 𝑅1 , 𝑅2 , 𝑅3,…, 𝑅𝑚


◼ The default hidden states and observable states are always known in the case study.

 Given observation states


 The purpose is to choose the optimal state sequence .

88 Huawei Confidential
HMM Decoding Algorithm - Exhaustion
⚫ Assumed that the possible hidden states sequence for generating the observation sequence is 𝑄
 𝑄 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑇 , T indicates the sequence length;
 𝑞𝑖 indicates the observation states of the 𝑖 𝑡ℎ position in the sequence,𝑞𝑖 ∈ S

⚫ The idea of exhaustion enumeration:


 Enumerate the possibility of all hidden state sequence Q
 Calculate the probability of each hidden state sequence generating an observation state sequence 𝑃(𝑂, 𝑄)

 𝑄𝑚𝑎𝑥 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑄 𝑃(𝑂, 𝑄)

⚫ Disadvantages of Exhaustion method:


 High algorithm complexity

89 Huawei Confidential
HMM Decoding Algorithm - Viterbi
⚫ Observation sequence 𝑂 = 𝑜1 ,𝑜2 ,𝑜3 ,𝑜4 , … … , 𝑜𝑇 , 𝑄𝑚𝑎𝑥 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑇 ;
⚫ Observation sequence 𝑂𝑡 = [𝑜1 ,𝑜2 ,𝑜3 ,𝑜4 , … … , 𝑜𝑡 ], 𝑄max_𝑡 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑡 .
 𝑂𝑡 is a subsequence of the first 𝑡 elements of 𝑂;𝑄max_𝑡 is a subsequence of the first 𝑡 elements of 𝑄𝑚𝑎𝑥 .

⚫ Idea:
 Forward traversal: For each hidden state in the sequence at each time step, two information is
recorded.
◼ Hidden state at the previous time step with highest probability path.
◼ Maximum probability of observation sequence in this state

 Backtracking
◼ For each moment, a hidden state sequence with a maximum probability of observation state is extracted.

90 Huawei Confidential
HMM Decoding Algorithm - Viterbi
⚫ Assumed that the possible hidden states sequence for generating the observation sequence is 𝑄
 𝑄 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑇 , T indicates the sequence length;
 𝑞𝑖 indicates the observation states of the 𝑖 𝑡ℎ position in the sequence,𝑞𝑖 ∈ S

⚫ Definition as follows:
 Assume 𝑂𝑡 = 𝑜1 ,𝑜2 ,𝑜3 ,𝑜4 , … … , 𝑜𝑡
 P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) indicates the max probability of the 𝑡𝑡ℎ hidden state is 𝑆𝑖 and 𝑂𝑡 is generated.
 P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) characteristics:
◼ Max probability P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) based on 𝑡 − 1 time step hidden state 𝑃𝑟𝑒_𝑆(𝑡, 𝑆𝑖 ):

− 𝑃𝑟𝑒_𝑆(𝑡, 𝑆𝑖 ) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑆𝑗 P(𝑂𝑡−1, 𝑞t−1= 𝑆𝑗 ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 )

◼ Compute P(𝑂𝑡 , 𝑞t = 𝑆𝑖 ) = P(𝑂𝑡−1 , 𝑞t−1 = 𝑆𝑗 ) ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑃𝑟𝑒_𝑆(𝑡, 𝑆𝑖 ) = ቚ𝑆𝑗 ∗ 𝑃𝑜𝑏𝑣 (𝑜𝑡−1 ห𝑆𝑖 )

91 Huawei Confidential
HMM Decoding Algorithm - Viterbi
⚫ The Viterbi algorithm consists of forward calculation and backward backtracking.
 Forward calculation, traversing sequence 𝑡 ∈ 1, 𝑇 :
◼ 𝑡 =1, P(𝑂1 , 𝑞1 = 𝑆𝑖 ) = 𝑃𝑖𝑛𝑖 𝑆𝑖 ∗ 𝑃𝑜𝑏𝑣 (𝑜1 ห𝑆𝑖 )
◼ 𝑡 >1, for each hidden state𝑆𝑖

− Check 𝑃𝑟𝑒_𝑆(𝑡, 𝑆𝑖 ) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑆j P(𝑂𝑡−1, 𝑞t−1= 𝑆𝑗 ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑆𝑖 ቚ𝑆𝑗 )

− Compute P(𝑂𝑡 , 𝑞t= 𝑆𝑖 ) = P(𝑂𝑡−1, 𝑞t−1= 𝑆𝑗 ) ∗ 𝑃𝑡𝑟𝑎𝑛𝑠 𝑃𝑟𝑒_𝑆(𝑡, 𝑆𝑖 ) = ቚ𝑆𝑗 ∗ 𝑃𝑜𝑏𝑣 𝑜𝑡−1ห𝑆𝑖

 Backward backtracking, 𝑄𝑚𝑎𝑥 = 𝑞1 ,𝑞2 ,𝑞3 ,𝑞4 , … … , 𝑞𝑇 :


◼ Traversing sequence 𝑡 ∈ 𝑇, 1 :

− 𝑡 = T, 𝑞𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑆j P(𝑂𝑇 , 𝑞𝑇 = 𝑆𝑗

− 𝑡 < T, 𝑞t = 𝑃𝑟𝑒_𝑆(𝑡, 𝑞𝑡+1)

92 Huawei Confidential
Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM


 GMM
 HMM
◼ GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

93 Huawei Confidential
Role of GMM - HMM (1)
⚫ Role of GMM:
 A GMM is used to obtain the probability of a factor.
 A GMM consists of three to five superimposed Gaussian models.
 A Gaussian model is a normal distribution and represents the probability density of signals.
 In speech recognition, one word consists of multiple phonemes. One phoneme is one state and
one state is associated with a GMM.
 Each GMM has 𝐾 model parameters.
 (Here, the corresponding model is a GMM, that is, 𝐾 Gaussian weights corresponding to each
state, each Gaussian average vector, and variance matrix.)

94 Huawei Confidential
Role of GMM - HMM (2)
⚫ An HMM performs speech modeling:
 An HMM is created for each word. Training samples of the word are used. The
training samples are labeled in advance, that is, each sample corresponds to a
section of audio and the audio contains only the pronunciation of the word.
 After multiple training samples of the word are available, the samples, together with
the Baum-Welch algorithm and the EM algorithm, are used to train out all
parameters of GMM-HMM. These parameters include the probability vector of the
initial state, inter-state transition matrix, and observation matrix corresponding to
each state.

95 Huawei Confidential
Role of GMM – HMM (3)
⚫ HMM in the recognition phase:
 If a section of audio that includes multiple words is input, the audio can be manually
separated (the simplest method is considered). Then the audio MFCC feature
sequence of each word is extracted and is input in each HMM (that is trained in
advance). The forward algorithm is used to obtain the probability of generating the
sequence by each HMM and finally the model with the highest probability is used.
The word indicated by the model is the recognition result.

96 Huawei Confidential
GMM - HMM Speech Recognition
⚫ A wav file is obtained. The following shows the process of recognizing a word:
 Cut a waveform into several equal frames and extract the MFCC feature of each frame.
 Run the GMM for the feature of each frame to obtain the probability state in which a frame belongs to a state.
 According to the HMM state transition probability a of each word, calculate the probability of generating the frame in
each state sequence. If the probability where the HMM sequence of a word occurs is the highest, the speech belongs to
the word.

0.6 0.3 0.7 0.5 0.8 1.0


0.4 0.5 0.2
sil y eh s sil

𝑏𝑠𝑖𝑙 𝑜1 ∙ 0.6 ∙ 𝑏𝑠𝑖𝑙 𝑜2 ∙ 0.6 ∙ 𝑏𝑠𝑖𝑙 𝑜3 ∙ 0.6 ∙ 𝑏𝑠𝑖𝑙 𝑜4 ∙ 0.4 ∙ 𝑏𝑦 𝑜5 ∙ 0.3 ∙ 𝑏𝑦 𝑜6 ∙ 0.3 ∙ 𝑏𝑦 𝑜7 ∙ 0.7 ∙∙∙

97 Huawei Confidential
Example: Single Word Recognition (1)
⚫ Task : recognize the speech of single word
 Like : /l ai k/
 One: /w ∧ n/

⚫ Like :
 Hidden state: l ai k sli
 observe state: mfcc feature
⚫ One:
 Hidden state: w ∧ n sli
 observe state: mfcc feature

98 Huawei Confidential
Example: Single Word Recognition (2)
⚫ Like :
sample 1 sample 2
……
l l l ai ai ai ai ai k k k sli
Hidden state sequence l l l l ai ai ai ai ai k k sli

observe state sequence

⚫ Target: learn the following HMM-GMM model


 Observation probability: GMM model for every state
 Transition probability: the transition probability between each pair of hidden state
 Initial probability: the initial probability of each hidden state

99 Huawei Confidential
Example: Single Word Recognition (3)
⚫ One :
sample 1 sample n
……
Hidden state sequence w w w w ∧ ∧ ∧ n n sli sli sli w w ∧ ∧ ∧ ∧ ∧ n n n sli sli

observe state sequence

⚫ Target: learn the following HMM-GMM model


 Observation probability: GMM model for every state
 Transition probability: the transition probability between each pair of hidden state
 Initial probability: the initial probability of each hidden state

100 Huawei Confidential


Example: Single Word Recognition (4)
⚫ Given a speech file:
 Calculate the evaluation probability of each single word based on their HMM-GMM model
 Choose the word with the maximum probability as the word of the speech file

LIKE ONE
HMM-GMM HMM-GMM

Forward-backward algorithm Forward-backward algorithm

preprocessing preprocessing

Speech file Evaluation probability Speech file Evaluation probability

101 Huawei Confidential


Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model

105 Huawei Confidential


Disadvantages of Traditional Speech Recognition Algorithms
⚫ The main disadvantages of traditional speech recognition algorithms GMM-HMM are as
follows:
 In GMM-HMM, each hidden state corresponds to a GMM model. Especially when continuous
word recognition is performed, model training consumes time and has a large parameter
space because of a large number of states.
 When the Hidden Markov model is used for speech recognition, there is a hidden condition is
the hidden state sequences must meet the Markov characteristic, that is, the hidden state at
the next moment is related only to the hidden state at the current moment, but not the past
or future, which is not entirely reasonable in practice. In most cases, multiple states determine
the state in between.
 Each module is independently optimized and trained. Modules are coupled and associated. As
a result, tasks cannot in end2end processing and the entire process cannot be optimized.

106 Huawei Confidential


Speech Recognition Algorithm – Hybrid Model
⚫ Hybrid model refers to replacing the traditional GMM-HMM model with deep neural network.
The replacement networks include DNN, CNN, LSTM, and other variant neural network structures.
 Characteristic: The hybrid model uses deep neural network structure instead of GMM to extract the
features of speech signals, and then it is processed and characterized in depth.
 Disadvantage: The HMM model cannot be separated from the HMM model. And the trained traditional
GMM-HMM model is used for data alignment (sample labeling). Essentially, the traditional GMM-HMM
model is still used at first. The entire network architecture is only used to train the acoustic model, not
end2end processing.
 The common models are as follows:
◼ DNN+HMM
◼ CNN+HMM
◼ LSTM+HMM
◼ CLDNN+HMM

107 Huawei Confidential


DNN-HMM
⚫ Acoustic signals use the HMM framework for modeling. The generation probability of
each state uses the DNN instead of the original GMM for estimate. The output of each
unit on the DNN indicates the posterior probability of the state.
Transition Probabilities
𝑎𝑠1 𝑠1 𝑎𝑠2 𝑠2 𝑎𝑠 𝑘 𝑠𝑘 𝑎𝑠𝐾 𝑠𝐾

𝑎𝑠1 𝑠2 𝑎𝑠2 𝑠3 𝑎𝑠𝑘−1 𝑠𝑘 HMM


𝑆1 𝑆2 … 𝑆𝑘 … 𝑆𝐾

𝑉𝐿 Observation
Probabilities

𝑉 𝐿−1
… DNN
𝑉2
Window of feature
frames 𝑉1

Observation

108 Huawei Confidential


DNN-HMM Training Process
⚫ Train the GMM-HMM model based on the sample data.
⚫ The most likely hidden state sequence is deduced for the input speech feature sequence
by using a decoding algorithm based on the trained GMM-HMM model.
⚫ Take the correspondence between the feature and the hidden state obtained through
decoding in previous step as the training sample data; and; Training the DNN network.
The input of the DNN network is the speech feature, and the output is the probability of
each hidden state.
⚫ For the input speech sequence, the trained DNN and HMM are used to decode and
search the input speech sequence to obtain the final output text sequence.

109 Huawei Confidential


DNN Superior to GMM
⚫ The DNN is a discrimination model. It can better discriminate annotation
categories.
⚫ The DNN provides outstanding performance with big data. As the data volume
increases continuously, the performance of a GMM becomes saturated within
about 2000 hours. However, the DNN model can still improve performance
when the data volume increases to more than 10,000 hours.
⚫ The DNN model provides stronger robustness for ambient noise. Through noisy
training, the recognition performance of the DNN model can even surpass that
of a GMM using the speech enhanced algorithm.

110 Huawei Confidential


CD-DNN-HMM
⚫ Different from the GMM model, the DNN model introduces context
information (namely front and back feature frame information) and is
therefore called context-dependent DNN-HMM (CD-DNN-HMM).
⚫ The CD-DNN-HMM is composed of three parts:
 A DNN
 An HMM
 A state prior probability distribution

111 Huawei Confidential


Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model


◼ RNN Architecture
 Loss Function
 End2end Model

112 Huawei Confidential


Recurrent Neural Network
⚫ A recurrent neural network (RNN) is a neural network that captures dynamic
information from serialized data through periodic connection of hidden layer nodes. It
can classify serialized data.
⚫ Different from other forward neural networks, an RNN can save a context state and can
even store, learn, and express related information in the context window of any length
instead of being limited to space boundaries of traditional neural networks. It can
extend in the time sequence. Intuitively, connections exist between the hidden layer at
the moment of time and the hidden layer at the next moment.
⚫ An RNN is widely used in scenarios related to sequence, for example, a video consisting
of image frames, an audio consisting of fragments, and a sentence consisting of words.

113 Huawei Confidential


Application Scenarios of RNN
⚫ Limited related information and position information.
⚫ Scenario where previous short-term information needs to be connected to the
current task.

Input layer

Hidden layer

Output layer

114 Huawei Confidential


RNN Unrolling

𝑜
𝑜𝑡−1 𝑜𝑡 𝑜𝑡+1

V V V V
W
s W
W W W

U U U U

𝑥 𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1

115 Huawei Confidential


Standard RNN

ℎ𝑡−1 ℎ𝑡 ℎ𝑡+1

tanh
A A
𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1

⚫ All RNNs have a chain formed by a repeated neural network module.


⚫ In a standard RNN, the repeated module has a simple structure, for example, the
module has the Tanh activation function.

116 Huawei Confidential


BPTT
⚫ After an RNN is unfolded, forward propagation means to calculate once according to the time
sequence and back propagation trough time (BPTT) means to transfer accumulated residual
errors back from the last time point. This is essentially similar to training of a common neural
network. The main difference of BPTT is that we add the gradient at each moment.
⚫ Our goal is to calculate the error of gradients about parameters U, V, and W and learn good
parameters using the gradient descent method. To calculate the gradients, we need to use the
differential chain rule.
𝐸0 𝐸1 𝐸2 𝐸3 𝐸4

𝜕𝐸3
𝜕𝑠1 𝜕𝑠2 𝜕𝑠3 𝜕𝑠3
𝜕𝑠0 𝜕𝑠1 𝜕𝑠2
𝑠0 𝑠1 𝑠2 𝑠3 𝑠4

𝑥0 𝑥1 𝑥2 𝑥3 𝑥4

117 Huawei Confidential


Vanishing and Exploding Gradients of RNN
⚫ During reverse training, an RNN needs to be pushed forward horizontally until
the place where the sequence starts.
⚫ Both gradient vanishes and gradient explosion are related to the path length
being too long. Previous weights remain basically unchanged and provide no
training effect.

118 Huawei Confidential


Long Short-Term Memory
⚫ A long short-term memory (LSTM) is a time recursive neural network. It is
applicable to process and predict important events with long interval and delay
in the time sequence.
⚫ An LSTM differs from an RNN in that it adds a "processor" for determining
whether information is useful in the algorithm. The structure of the processor is
the cell structure. Three gates are placed in a cell, which are called the input
gate, forget gate, and output gate. When information enters the LSTM, rules
are used to determine whether the information is useful. Only the information
that passes algorithm authentication is stored and other information is
forgotten through the forget gate.

119 Huawei Confidential


LSTM and Speech Recognition
⚫ After decades of research and development, an HMM-based framework has been established for speech
recognition.
⚫ In recent years, on the basis of HMM, the application of DNN greatly improves the performance of the speech
recognition system. The DNN splices a frame of speech and frames before and after the frame as the input of
the network so as to use context information in the speech sequence. In DNN, the number of frames input
every time is fixed. The final recognition result is affected by the window length.
⚫ The RNN mines context information in a sequence through recursion. This avoids disadvantages of the DNN
to some extent. During training, however, gradients may vanish in the RNN and long-term information
cannot be memorized.
⚫ LSTM enables the error at the current moment to be saved through a specific gating unit and selectively
transfers the error to a specific unit to avoid the problem of vanishing gradients.
 LSTM+CTC
 LSTM+HMM

120 Huawei Confidential


Application Scenarios of LSTM
⚫ Long related information and position spacing problem.
⚫ Scenario where previous long-term information needs to be connected to the
current task.

ℎ0 ℎ1 ℎ2 ℎt ℎ𝑡+1 ℎ𝑡+2

A A A A A A


𝑥0 𝑥1 𝑥2 𝑥𝑡 𝑥𝑡+1 𝑥𝑡+2

121 Huawei Confidential


Standard LSTM

ℎ𝑡−1 ℎ𝑡 ℎ𝑡+1

tanh tanh tanh

𝜎 𝜎 tanh 𝜎 𝜎 𝜎 tanh 𝜎 𝜎 𝜎 tanh 𝜎

𝑥𝑡−1 𝑥𝑡 𝑥𝑡+1

Neural Pointwise Vector Concatenate Copy


Network Operation Transfer
Layer

122 Huawei Confidential


LSTM: Initial State
⚫ The key of LSTM is the cell state. The horizontal line runs through above the chart. The
cell state is similar to a conveyor. The line runs directly through the whole chain with
only a few linear interactions. It is easy to make information transferred on the chain
remain unchanged.

ℎt
𝐶t−1 𝐶t
tanh
𝑓t 𝑖t 𝐶ሚ 𝑜t
𝑡
𝜎 𝜎 tanh
𝜎
ℎt−1 ℎt
𝑥t

123 Huawei Confidential


LSTM: Forget Gate
⚫ The first step of LSTM is to decide which information will be discarded from the cell.
⚫ The decision is completed through a function called forget gate. The gate reads ℎ𝑡−1 and 𝑥𝑡 and
outputs a numeric value in the range 0 to 1 to each digit in the cell state 𝐶𝑡−1. The value 1
indicates the information is completely retained and the value 0 indicates the information is
completely discarded.
ℎt

𝐶t−1 𝐶t
tanh
𝑖t 𝑜t
𝑓t 𝑓𝑡 = 𝜎(𝑊𝑓 ∙ ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 )
𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt−1 ℎt

𝑥t

124 Huawei Confidential


LSTM: Input Gate
⚫ This step is to determine what information is stored in the cell state.
⚫ It includes two parts: 1. The sigmoid layer is called the "input gate layer", which decides
the value that will be updated. 2. A candidate value vector is created at the tanh layer
and is added to the state.
ℎt

𝐶t−1 𝐶t
𝑖𝑡 = 𝜎(𝑊𝑖 ∙ ℎ𝑡−1, 𝑥𝑡 + 𝑏𝑖 )
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡 𝐶ሚ𝑡 = 𝜎(𝑊𝐶 ∙ ℎ𝑡−1, 𝑥𝑡 + 𝑏𝐶 )
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1

𝑥t

125 Huawei Confidential


LSTM: Information Update
⚫ It is time to update the state of the old cell. 𝐶𝑡−1 is updated as 𝐶𝑡 .
⚫ We multiply the old state by 𝑓𝑡 and discard information we decide to discard. Then add
𝑖𝑡 *𝐶𝑡 . This is a new candidate value. Based on our decision, update the process of each
state.
ℎt

𝐶t−1 𝐶t
tanh
𝑜t
𝑓t 𝑖t 𝐶ሚ𝑡 𝐶𝑡 = 𝑓𝑡 ∗ 𝐶𝑡−1 + 𝑖𝑡 ∗ 𝐶ሚ𝑡
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1
tanh
𝑥t

126 Huawei Confidential


LSTM: Output Gate
⚫ First of all, we run a sigmoid layer to determine which part of the cell state will be
output. Then we process the cell state through tanh (a value between -1 and 1 is
obtained) and multiply the value by the output of the sigmoid gate. In the end, we
output only the part we determine to output.

ℎt

𝐶t−1 𝐶t
𝑜𝑡 = 𝜎(𝑊𝑜 ℎ𝑡−1, 𝑥𝑡 + 𝑏𝑜 )
tanh
𝑖t 𝑜t
𝑓t
𝐶ሚ𝑡 ℎ𝑡 = 𝑜𝑡 ∗ 𝑡𝑎𝑛ℎ(𝐶𝑡 )
𝜎 𝜎 tanh 𝜎
ℎt
ℎt−1

𝑥t

127 Huawei Confidential


GRU (1)
⚫ GRU is a variant of LSTM. There is also a mix of cell states and hidden states, with
other modifications like combining the forget gate and the input gate into a single
update gate. The final model is simpler than the standard LSTM model and is a very
popular variant.
 The update gate is used to make sure the state information of the previous moment is
substituted into the current state. A larger value of the update gate indicates more state
information of the previous moment is brought into the current state.
 The reset gate is used to control the extent to which the state information of the previous
moment is ignored. A smaller value of the reset gate indicates more ignored state information.

128 Huawei Confidential


GRU (2)

Update gate

Reset gate

State candidate

Current State

Output

129 Huawei Confidential


GRU Characteristic
⚫ If the reset door is set to 1, the update door is set to 0, the standard RNN model
is obtained again.
⚫ In order to solve the short-term memory problem, each recursive unit can
adaptively capture the dependencies of different scales.
⚫ In order to solve the problem of gradient disappearing, the ℎ𝑡 , ℎ𝑡−1 ℎ𝑡 ,
ℎ𝑡−1 relationship in the hidden layer output is added instead of the
multiplication + activation function in RNN.

130 Huawei Confidential


Comparison of LSTM and GRU
⚫ The standard LSTM and GRU have a better effect than Vanilla RNN.
⚫ Compared with the LSTM, the GRU construction is simpler and has one less gate.
Therefore, the calculation speed of the GRU is faster than that of the LSTM.
⚫ The LSTM is controllable for the reserved memory, but the GRU is not
controllable.

131 Huawei Confidential


BiRNN
⚫ The basic idea of bidirectional RNN is that each training sequence is two recurrent
neural networks (RNNs) forward and backward, and the two RNNs are connected to an
output layer at the same time. At each moment t, the input is provided to the two
RNNs in opposite directions at the same time, and the output is determined by the
states of the two unidirectional RNNs:

132 Huawei Confidential


Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model


 RNN Architecture
◼ Loss Function
 End2end Model

133 Huawei Confidential


Connectionist Temporal Classification Loss Function (1)
⚫ CTC - a probabilistic model 𝑝(𝑌|𝑋), where X = 𝑥1 𝑥2 … 𝑥𝑇 , 𝑌 = 𝑦1 𝑦2 … . 𝑦𝐿 (𝑇 ≥ 𝐿).
Y= this is a spectrogram

𝑋=

134 Huawei Confidential


Connectionist Temporal Classification Loss Function (2)
⚫ A Connectionist Temporal Classification Loss, or CTC Loss, is for tasks when the difficult alignment
between sequences are needed, which calculates the loss between a continuous (unsegmented)
time series and a target sequence, by summing over the probability of possible alignments of
input to target and producing a loss value which is differentiable with respect to each input node.

135 Huawei Confidential


Connectionist Temporal Classification Loss Function (3)
⚫ CTC realizes variable-length insinuation. This alignment mode has the following
features:
 The time slice mapping between X and Y is monotonous. That is, if X moves forward one time
slice, Y remains unchanged or moves forward one time slice.
 The mapping between X and Y is many-to-one. That is, multiple outputs may correspond to
one mapping. Otherwise, the mapping is not valid. Therefore, the length of X is greater than
or equal to the length of Y.

𝑎 𝑎 𝜖 𝑏 𝑏 𝜖 𝑐 𝑎 𝜖 𝑏 𝜖 𝑐 𝑐

ℬ ℬ

𝑎 𝑏 𝑐 𝑎 𝑏 𝑐
136 Huawei Confidential
Voice data alignment (1)
⚫ In a voice dataset, audio files are difficult to align with the output text.
⚫ In the traditional speech recognition model, before training the speech model, the text and speech are often
strictly aligned. There are two disadvantages:
 Strict alignment takes manpower and time.
 After strict alignment, the predicted label is only the result of partial classification, but cannot provide the
output result of the entire sequence. The expected result can be obtained only after post-processing is
performed on the predicted label.
⚫ CTC(Connectionist Temporal Classification) Loss Function in many-to-many sequence is without alignment
information.

https://distill.pub/2017/ctc/

137 Huawei Confidential


Voice data alignment (2)
⚫ Data alignment is to combine consecutive repeated letters. But then the word "hello" is mapped
to "helo".
⚫ The CTC loss function is an alignment-free algorithm. During data alignment, the white space 𝜖 is
introduced to indicate pauses in speech. Maps 𝑋 to 𝑌 by operation ℬ. Operation ℬ includes:
 Combining consecutive identical symbols;
 Removes white space characters.

https://distill.pub/2017/ctc/

138 Huawei Confidential


RNN+CTC
⚫ The input is the speech feature sequence , and the output is the letter sequence. The length of the
output is the same as the input.
 How to evaluate the difference between the output letter sequences and the target letter sequences.
Because the length of the speech feature sequence is much longer than that of the target letter sequence,
the letter sequence output is also much longer than the target letter sequence, and cannot be aligned.
 The difference between the output letter sequence and the target letter sequence with inconsistent
lengths.
 The CTC loss function is evaluated.

139 Huawei Confidential


RNN+CTC Process
⚫ The process of modelling speech recognition is as follows:
 During the training process, the training sample data is used to perform model
training based on the model structure and CTC loss function.
 When do the prediction, the corresponding letter sequence obtained after training,
beam-search can be used to search for an optimal path.

140 Huawei Confidential


Contents

1. Speech Processing

2. Speech Recognition

3. Text-to-Speech Synthesis

4. Traditional Acoustic Model GMM-HMM

5. Hybrid Model DNN-HMM

6. Advanced Speech Model


 RNN Architecture
◼ Loss Function
◼ End2end Model

141 Huawei Confidential


End2End Model
⚫ End2End is the model in which input is speech feature sequences and output is a letter
or coarser-grained (word) sequence.
 Advantage: E2E speech recognition problems do not require a large number of intermediate steps.
 How to train network without forcible input/output labeling data? There are two methods:
◼ Use RNN architecture networks with connectionist temporal classification (CTC) loss function to
evaluate the difference between the predicted sequence and the target sequence.
◼ The Seq2Seq network structure is used for automatic alignment, and the cross-entropy loss function is
used to evaluate the difference between the predicted sequence and the target sequence.
 The common network models are as follows:
◼ CNN/RNN/LSTM+CTC
◼ Seq2Seq+CE
◼ Seq2Seq+Attention+CE

142 Huawei Confidential


Seq2Seq
⚫ This model is an end2end model. It inputs speech feature sequences and outputs letter sequences.
⚫ When the Seq2Seq model is used, the lengths of the input and the output are different. Therefore, the input
sequence may be speech feature sequence, and the output sequence may be a letter sequence or a character
sequence with a larger granularity.
⚫ Using this model to evaluate the difference between the output letter sequence and the target letter
sequence can be achieved directly by using the cross entropy loss function.

143 Huawei Confidential


Attention
⚫ Seq2Seq does not perform well in processing overlong speech due to the following reasons:
 RNN Long-term dependency (Solution: LSTM instead.)
 Decoder needs to consider all sequences, which contain many irrelevant noises and information leading to
information overload. In fact, most of the output is related to only part of the input sequence, not the
whole sequence. Therefore, the attention is added on the encoder-decoder model which attach more
importance on critical information in input information reduces some other information, and even
filtering out some irrelevant. In this way, attention can solve the problem of information overload and
improve task processing efficiency and accuracy.
⚫ Attention: When we read, we cannot notice all the words. As our eyes move, our attention will also shift and
eventually stay on the words that are considered important.

144 Huawei Confidential


Seq2Seq Improvements
⚫ There are many variants according to the network structure:
 Generally, adding the Attention will make the model more effective.
 The pooling is performed in the stacked bidirectional LSTM to reduce the length of the output
sequence on the encoder.
 More variants can be used for the RNN structure on the encode and decode, such as LSTM
and bidirectional LSTM.
Attention

145 Huawei Confidential


Quiz

1. Which of the following is not speech recognition preprocessing?


A. Silence removal

B. Noise reduction

C. Standardization

D. Deduplication

146 Huawei Confidential


Summary

⚫ This chapter mainly introduces the theories of speech processing, and


focuses on the evolution and application of speech recognition technologies,
especially the main acoustic models and speech models of speech
recognition.

147 Huawei Confidential


More Information

Huawei Learning Website


 http://support.huawei.com/learning/Index!toTrainIndex

Huawei Support Case Library


 http://support.huawei.com/enterprise/servicecenter?lang=zh

148 Huawei Confidential


Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Theory and Applications of NLP
Objectives

Upon completion of this course, you will be able to:


 Understand the basic knowledge of Natural Language Processing (NLP)
 Master the key tasks of NLP
 Understand the applications of NLP

2 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks

4. Applications

3 Huawei Confidential
What Is NLP ?
⚫ “Natural” languages
 English, Mandarin, French, Swahili, Arabic, Nahuatl, ….
 NOT Java, C++, Perl, …

⚫ Natural language processing (NLP)


 WIKI: NLP is a field of computer science, artificial intelligence, and computational
linguistics concerned with the interactions between computers and human (natural)
languages.
⚫ NLP = NLU + NLG
 NLU = Natural Language Understanding, speech/text → meaning
 NLG = Natural Language Generation, meaning → text/speech

4 Huawei Confidential
Real-word NLP

5 Huawei Confidential
Language Technology
making good progress
Sentiment analysis still really hard
mostly solved Best roast chicken in San Francisco!
Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?
Let’s go to Agra! ✓ Carter told Mubarak he shouldn’t run again.
Paraphrase
Buy V1AGRA … ✗
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Part-of-speech (POS) tagging I need new batteries for my mouse.
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up
Economy
I can see Alcatraz from the window! The S&P500 jumped is good
Named entity recognition (NER) Housing prices rose
Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog Where is Citizen Kane playing in SF?
Einstein met with UN officials in Princeton The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction Party you want a ticket?
(IE) invited to our dinner
You’re May
27
party, Friday May 27 at 8:30
add

from Dan Jurafsky and Christopher Manning, Stanford University


6 Huawei Confidential
Why NLP is Hard? (1)
⚫ Ambiguity
⚫ Non-standard languages
⚫ Segmentation issues
⚫ Idioms
⚫ Neologisms
⚫ World knowledge
⚫ Tricky entity names
⚫ ……

7 Huawei Confidential
Why NLP is Hard? (2)
⚫ Ambiguity at multiple levels :
 Word senses: bank (finance or river ?)
 Part of speech: chair (noun or verb ?)
 Syntactic structure: I can see a man with a telescope.
 Multiple: I made her duck.

From Diyi Yang, Georgia Institute of Technology

8 Huawei Confidential
Why NLP Is Hard? (3)

from Dan Jurafsky and Christopher Manning, Stanford University


9 Huawei Confidential
History of NLP

Logistic-based/rule-
based NLP

~1990s

Statistical NLP

~2010s

End-to-end NLP

10 Huawei Confidential
Symbolic and Probabilistic NLP

Logic-based/Rule-based NLP Statistical NLP: Engineered Features/Representations

11 Huawei Confidential
Deep Learning and NLP
output
⚫ Representation Learning (e.g. word embeddings)
⚫ End-to-end Optimization (e.g. NMT)
Deep Learning
⚫ Transfer Learning (e.g. BERT)
⚫ Structure Learning (e.g. Transformer)
input

12 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required
▫ Word Embeddings
◼ Language Models

3. Key Tasks

4. Applications

13 Huawei Confidential
Why Do We Need Word Representation?

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings

14 Huawei Confidential
Representing words as discrete symbols
⚫ In traditional NLP, we regard words as discrete symbols:

⚫ Words can be represented by one-hot vectors:

hotel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ]
motel = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ]

⚫ Vector dimension = number of words in vocabulary (e.g., 500,000)

15 Huawei Confidential
Problem with one-hot vectors
⚫ These two vectors are orthogonal.
⚫ There is no natural notion of similarity for one-hot vectors!
⚫ These vectors do not contain information about the meaning of a word.

hotel = [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 ]
motel = [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 ]

Solution: learn to encode similarity in the vectors themselves.

16 Huawei Confidential
Representing Words By Their Context
⚫ Distributional semantics: A word’s meaning is given by the words that
frequently appear close-by.

“you shall know a word by the company it keeps”(J.R.Firth 1957:11)

17 Huawei Confidential
Dense Word Vector By Contexts
⚫ We will build a dense vector for each word, so that it is similar to vectors of
words that appear in similarity contexts.

banking =

18 Huawei Confidential
Visualization

19 Huawei Confidential
Count-based: Co-occurrence Counts + SVD
Example corpus:
I like deep learning
I like nlp
I enjoy flying

20 Huawei Confidential
Word2Vec: A Prediction-Based Method
⚫ take a huge text corpus;
⚫ go over the text with a sliding window, moving one word at a time. At each step, there
is a central word and context words (other words in this window);
⚫ for the central word, compute probabilities of context words (or vice versa);
⚫ adjust the vectors to increase these probabilities.

21 Huawei Confidential
Objective Function: Negative Log-Likelihood
⚫ For each position 𝑡 = 1, … , 𝑇 in a text corpus, Word2Vec predicts context words
within a m-sized window given the central word 𝑤𝑡 :

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings

22 Huawei Confidential
Word2Vec: prediction function

dot product measures similarity of o and c


larger dot product = larger probability

𝑇
exp 𝑢𝑜 𝑣𝑐
𝑝 𝑜𝑐 = 𝑇𝑣
σ𝑤∈𝑉 exp 𝑢𝑤 𝑐

normalize over entire vocabulary

23 Huawei Confidential
Two Vectors for Each Word

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings

24 Huawei Confidential
Why Two Vectors ?
⚫ In Word2Vec we train two vectors for each word: one when it is a central word
and another when it is a context word. After training, context vectors are
thrown away.

⚫ When central and context words have different vectors, both the first term and
dot products inside the exponents are linear with respect to the parameters.
Therefore, the gradients are easy to compute.

25 Huawei Confidential
How to Train Word2Vec

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings

26 Huawei Confidential
Faster Training: Negative Sampling

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings

27 Huawei Confidential
Word2Vec Variants: Skip-Gram and CBOW

https://github.com/yandexdataschool/nlp_course/tree/2020/week01_embeddings
28 Huawei Confidential
GloVe: Combine Count-based and Direct Prediction methods

𝑊
1
𝐽 𝜃 = ෍ 𝑓(𝑋𝑖𝑗 )(𝑢𝑖 𝑇 𝑣𝑗 − log 𝑋𝑖𝑗 )2
2
𝑖,𝑗=1

𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉

X is co-occurrence counts matrix

29 Huawei Confidential
Fasttext: Word2Vec with N-gram

30 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required
▫ Word Embeddings
◼ Language Models

3. Key Tasks

4. Applications

31 Huawei Confidential
What Is a Language Model ? (1)
⚫ Language Modeling is the task of predicting what word comes next.
working

I really really like __________ eating

you
⚫ More formally: given a sequence of words 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑡) , compute the
probability distribution of the next word 𝑥 (𝑡+1) :
𝑡+1
𝑃 𝑥 𝑥 (𝑡) , … , 𝑥 1 )

where 𝑥 (𝑡+1) can be any word in the vocabulary 𝑉 = {𝑤1 , … , 𝑤|𝑉| }


⚫ A system that does this is called a Language Model.
32 Huawei Confidential
What Is a Language Model ? (2)
⚫ We can also think of a Language Model as a system that assigns probability to
a piece of text. For example, if we have some text 𝑥 (1) , … , 𝑥 𝑇 , then the
probability of this text (according to the Language Model) is:

𝑃 𝑥 (1) , … , 𝑥 (𝑇) = 𝑃(𝑥 1 ) × 𝑃(𝑥 2 |𝑥 1 ) × ⋯ × 𝑃(𝑥 𝑇 |𝑥 𝑇−1 , … , 𝑥1 )

= ς𝑇𝑡=1 𝑃(𝑥 𝑡 |𝑥 𝑡−1 , … , 𝑥 (1) )

33 Huawei Confidential
Applications of Language Models

34 Huawei Confidential
N-gram Language Models
⚫ Definition: A n-gram is a chunk of n consecutive words.
I really really like ______
 unigrams: “I”, “really”, “really”, “like”
 Bigrams: “I really”, “really really”, “really like”
 Trigrams: “I really really”, “really really like”
 4-grams: “I really really like”

35 Huawei Confidential
N-gram Language Models
⚫ First we make a simplifying assumption: 𝑥 (𝑡+1) depends only on the
preceding n-1 words.
n-1 words

𝑡+1
𝑃 𝑥 𝑥𝑡 , … , 𝑥 1 = 𝑃(𝑥 𝑡+1 |𝑥 𝑡 , … , 𝑥 (𝑡−𝑛+2) )
prob of a n-gram
𝑃(𝑥 𝑡+1 , 𝑥 𝑡 ,…,𝑥 (𝑡−𝑛+2) )
=
𝑃(𝑥 𝑡 ,…,𝑥 (𝑡−𝑛+2) )

prob of a (n-1)-gram 𝑐𝑜𝑢𝑛𝑡(𝑥 (𝑡+1) ,𝑥 (𝑡) ,…,𝑥 (𝑡−𝑛+2) )


≈ 𝑐𝑜𝑢𝑛𝑡(𝑥 𝑡 ,…,𝑥 (𝑡−𝑛+2) )

36 Huawei Confidential
Example: N-gram Language Models
Suppose we are learning a 4-gram Language Model.

my dear, I really really like ______

discard
condition on this

𝑐𝑜𝑢𝑛𝑡(𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒 𝒘)


𝑃 𝑤 𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒 =
𝑐𝑜𝑢𝑛𝑡(𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒)

37 Huawei Confidential
Problems with n-gram Language Models

Discard too much information

Sparsity: What if the count is 0?

𝑐𝑜𝑢𝑛𝑡(𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒 𝒘)


𝑃 𝑤 𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑜𝑣𝑒 =
𝑐𝑜𝑢𝑛𝑡(𝑟𝑒𝑎𝑙𝑙𝑦 𝑟𝑒𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒)

Storage: Need to store count for all n-grams you saw in the corpus.

38 Huawei Confidential
Fixed-window Neural Language Model

working
output distribution: you
𝑦 = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑈ℎ + 𝑏2 )

hidden layer:
ℎ = 𝑓(𝑊𝑒 + 𝑏1 )

concatenated word
embedding

I really really like


𝑥 (1) 𝑥 (2) 𝑥 (3) 𝑥 (4)

39 Huawei Confidential
Recurrent Neural Networks (RNN)
⚫ Core idea: Apply the same weights W repeatedly.

outputs
(optional)

hidden states

input sequence
(any length)

40 Huawei Confidential
RNN Language Model
⚫ RNN Advantages:
 Can process any length input.
 Compute for step t can (in theory) use information from many steps back.
 Model size does not increase for longer input.

⚫ RNN Disadvantages:
 Recurrent computation is slow.
 In practice, difficult to access information from many steps back

41 Huawei Confidential
Generating Text with a RNN Language Model

favorite season is spring

my favorite season is spring

42 Huawei Confidential
Language Model for Word Embedding
⚫ ELMO, BERT, GPT use language model to generate the word embedding
dynamic, the same word with different context will have different
representation vectors.

43 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks
▫ Keywords Extraction

▫ Text Classification

▫ Text Generation

▫ Sequence Labeling

▫ Sequence to Sequence

4. Application Systems
44 Huawei Confidential
Keywords Extraction
⚫ Keywords are a group of words that represent the important content of an
article.
⚫ the technology of automatic keyword extraction enables people to browse and
retrieve information conveniently, and plays an important role in text clustering,
classification, and automatic summarization.

45 Huawei Confidential
TF - IDF Algorithm
⚫ Term Frequency-Inverse Document Frequency (TF-IDF): a statistical calculation
method commonly used to assess the importance of a word to a document in a
fileset.

𝑛𝑖𝑗 Number of times that a word appears in the document


𝑡𝑓𝑖𝑗 = =
σ𝑘 𝑛𝑘𝑗 Number of total words in the document

|𝐷|
𝑖𝑑𝑓𝑖 = log( )
1 + |𝐷𝑖 |

𝑛𝑖𝑗 |𝐷|
𝑡𝑓 × 𝑖𝑑𝑓 𝑖, 𝑗 = 𝑡𝑓𝑖𝑗 × 𝑖𝑑𝑓𝑖 = × log( )
σ𝑘 𝑛𝑘𝑗 1 + |𝐷𝑖 |

46 Huawei Confidential
TextRank: PageRank on Text
⚫ The basic idea of the TextRank algorithm comes from Google's PageRank algorithm.
PageRank is a link analysis algorithm which used to evaluate the importance of a web
page in the search system. There are two basic ideas:
 Link quantity. A web page is more important if it is linked by more other web pages.
 Link quality. A web page is more important if it is linked by another web page with higher
weight.

Question: How do we build a graph on text?

47 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks
▫ Keywords Extraction

▫ Text Classification

▫ Text Generation

▫ Sequence Labeling

▫ Sequence to Sequence

4. Application Systems
48 Huawei Confidential
Text Classification: Definition
⚫ Input:
 A document 𝑑
 A fixed set of classes 𝐶 = {𝐶1 , 𝐶2 , … , 𝐶𝑗 }

⚫ Output: a predicted class 𝑐 ∈ 𝐶

input: I really really like this phone

𝐶 = {0: 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 1: 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒}

output: 0 (positive)

49 Huawei Confidential
Text Classification: Application
⚫ Spam detection
⚫ Authorship identification
⚫ Age/gender identification
⚫ Language Identification
⚫ Sentiment analysis
⚫ ……

50 Huawei Confidential
Text Classification: Method

51 Huawei Confidential
Text Representation – Bag of Words

Corpus: Shanghai Beijing TianAnMen I Hangzhou Love

I Love Shanghai d1 1 0 0 1 0 1
I Love Hangzhou
d2 0 0 0 1 1 1
I Love Beijing TianAnMen
d3 0 1 1 1 0 1

52 Huawei Confidential
Text Representation – TF-IDF

Corpus: Shanghai Beijing TianAnMen I Hangzhou Love

d1 0.767 0.0 0.0 0.453 0.0 0.453


I Love Shanghai
I Love Hangzhou d2 0.0 0.0 0.0 0.453 0.767 0.453
I Love Beijing TianAnMen d3 0.0 0.608 0.608 0.359 0.0 0.359

w𝑖,𝑑 = 𝑡𝑓𝑖,𝑑 × log(𝑛/ 1 + 𝑑𝑓𝑖 )

53 Huawei Confidential
Text Representation – LSA
document vector

54 Huawei Confidential
Classifier – Naïve Bayes (1)
⚫ Given document 𝑑 and a fixed set of classes 𝐶 = {𝑐1 , 𝑐2 }, 𝑥 is the words of 𝑑,
compute the class of 𝑑 :

𝐶𝑀𝐴𝑃 = argmax 𝑃(𝑐|𝑑)


𝑐∈𝐶

𝑃(𝑑|𝑐)𝑃(𝑐) Bayes rule


= argmax
𝑐∈𝐶 𝑃(𝑑)
= argmax 𝑃(𝑑|𝑐)𝑃(𝑐)
𝑐∈𝐶

= argmax 𝑃 𝑥1 , 𝑥2 , … , 𝑥𝑛 |𝑐 𝑃(𝑐)
𝑐∈𝐶

= argmax 𝑃(𝑐) ෑ 𝑃(𝑥|𝑐)


𝑐∈𝐶
𝑥∈𝑋

55 Huawei Confidential
Classifier – Naïve Bayes (2)

from Dan Jurafsky and Christopher Manning, Stanford University


56 Huawei Confidential
Classifier – SVM

SVM SVM + kernel

57 Huawei Confidential
CNN for Text Classification

58 Huawei Confidential
RNN for Text Classification

59 Huawei Confidential
BERT for Text Classification

60 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks
▫ Keywords Extraction

▫ Text Classification

▫ Text Generation

▫ Sequence Labeling

▫ Sequence to Sequence

4. Application Systems
61 Huawei Confidential
Text Generation: Language Model

favorite season is spring

my favorite season is spring

62 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks
▫ Keywords Extraction

▫ Text Classification

▫ Text Generation

▫ Sequence Labeling

▫ Sequence to Sequence

4. Application Systems
63 Huawei Confidential
Sequence Labeling
⚫ For each input 𝑥𝑖 ,it has corresponding label 𝑦𝑖

Text Classification Sequence Labeling

𝑥1 𝑥2 𝑥3 𝑥1 𝑥2 𝑥3

𝑦 𝑦1 𝑦2 𝑦3

64 Huawei Confidential
Part-of-Speech Tagging

Source nlpforhackers

65 Huawei Confidential
Named Entity Recognition (NER)
⚫ Identify token spans of entity mentions in text, and classify them into types of
entity.

From Xiang Ren, USC Computer Science

66 Huawei Confidential
Sequence Labeling: Method

Traditional:
HMM
MEMM
CRF

Deep Learning:
RNN/LSTM
BiLSTM + CRF
BERT

67 Huawei Confidential
BiLSTM + CRF

68 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks
▫ Keywords Extraction

▫ Text Classification

▫ Text Generation

▫ Sequence Labeling

▫ Sequence to Sequence

4. Application Systems
69 Huawei Confidential
Sequence to Sequence (Seq2Seq)

70 Huawei Confidential
Sequence to Sequence: Applications

Machine Translation

Caption Generation

Speech Recognition
71 Huawei Confidential
Machine Translation Using RNN

RNN RNN RNN


RNN RNN RNN RNN

Encoder Decoder

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2017/Lecture/Attain%20(v5).pdf

72 Huawei Confidential
Problem – Information Bottleneck

the model puts all information about


sentence in one vector

73 Huawei Confidential
Attention (1)

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Seq%20(v2).pdf
74 Huawei Confidential
Attention (2)

http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2018/Lecture/Seq%20(v2).pdf
75 Huawei Confidential
Seq2seq with Attention

76 Huawei Confidential
Transformer: Attention is All You Need (1)

Decoder
Encoder

77 Huawei Confidential
Transformer: Attention is All You Need (2)

78 Huawei Confidential
Self-Attention (1)

79 Huawei Confidential
Self-Attention (2)

𝑄𝐾 𝑇
𝑠𝑜𝑓𝑡𝑚𝑎𝑥
𝑑𝑘

80 Huawei Confidential
Self-Attention (3)

http://jalammar.github.io/illustrated-transformer/
81 Huawei Confidential
Multi-Head Self-Attention (1)

82 Huawei Confidential
Multi-Head Self-Attention (2)

83 Huawei Confidential
Contents

1. Introduction to NLP

2. Knowledge Required

3. Key Tasks

4. Application Systems

84 Huawei Confidential
Dialogue Systems

85 Huawei Confidential
Quiz

1. (Single)The TF algorithm counts the number of documents in a fileset that


contains the same word. The basic idea is that a word appears in fewer
documents better distinguishes the documents. ( )
A. True

B. False

86 Huawei Confidential
Recommendations

⚫ CS224n: Natural Language Processing with Deep Learning


 http://web.stanford.edu/class/cs224n/
⚫ YSDA Natural Language Processing course
 https://github.com/yandexdataschool/nlp_course

87 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Overview of Huawei’s AI Development Strategy
Objectives

On completion of this course, you will be able to:


 Know that AI is a new general purpose technology (GPT).
 Know the 10 changes that will define the future.
 Know Huawei's AI portfolio.

2 Huawei Confidential
Contents

1. AI: New General Purpose Technology

2. 10 Changes That Will Define the Future

3. Huawei's AI Portfolio

3 Huawei Confidential
AI: Overall outcome of 60 years of development in ICT

AI popularity Moore’s Law

AI Winter I I

AI Winter I

1956 1970s 1990s 2010s

4 Huawei Confidential
AI is a new general purpose technology (GPT)
th th
9000 BC~1000 AD 15 ~18 Century 19th Century 20th Century 21st Century

Domestication of plants Three-masted Railways Automobile Business virtualization


Domestication of animals sailing ship Iron steamship Airplane Nanotechnology
Smelting of ore Printing Internal combustion Mass production
Wheel Factory system engine Computer
Steam engine Electricity Lean production Artificial intelligence
Writing
(A set of technologies)
Bronze Internet
Iron Biotechnology
Water wheel

Multiple uses across the economy Many technological complementarities and spillovers

https://www.researchgate.net/publication/227468040_Economic_Transformations_General_Purpose_Technologies_and_Long-Term_Economic_Growth

5 Huawei Confidential
AI Will Reshape Industries

Transportation Electric Power


Manufacturing
Finance • Road-vehicle • Intelligent
Internet
• Intelligent quality
cooperation booster station
• Smart branch inspection • Personalized
• Contactless • Unmanned
• Financial OCR • Industrial robots recommendation
identification patrol
• Content analysis
• Autonomous • Intelligent PV
driving

AI readiness empowers industries

Speech recognition Machine vision Decision and inference Natural language processing

6 Huawei Confidential
AI will change every organization

Leaders

Leaders
Managers / Experts
/ Data Scientists
Managers /
Experts
Junior Managers / Senior
Professionals
Junior Managers / Senior /Data Science Engineers
Professionals

Junior
Junior Employees Employees

7 Huawei Confidential
AI-triggered change has just begun
Reactions to AI: Excitement, urge to act, anxiety, confusion
AI adoption / productivity

Now

Phase 1 Phase 2 Phase 3

Small-scale exploration New tech and society collide Tech and society reinforce each other

GPT productivity / adoption curve

8 Huawei Confidential
Continuous Breakthroughs in AI Algorithms Unlock
Boundless Possibilities
In specific fields, AI is approaching or exceeding human capabilities.

ResNet model: top-5 error AlphaGo


rate 3.57%, exceeding the March 2016: Defeated Lee Sedol
human capability (4%) May 2017: Beat Ke Jie

1 Image classification 3 Game decision-


making
Reading
2 Speech recognition 4 comprehension
DeepSpeech2 model: 95% BERT model: 87%
accuracy, approaching the accuracy, exceeding
human capability that of human (82%)

10 Huawei Confidential
Contents

1. AI: New General Purpose Technology

2. 10 Changes That Will Define the Future

3. Huawei's AI Portfolio

11 Huawei Confidential
10 changes that will shape the future
Training in days or even months Training in minutes or even seconds

Scarce & costly computing power Abundant & affordable computing power

AI: Mostly in cloud, some at the edge Pervasive AI for all scenarios. Respects and protects user privacy

Today’s basic algorithms invented before the 1980s Data and energy-efficient, secure, and explainable algorithms

No labor, no intelligence Automated / semi-automated data labeling

Models perform better in tests Industrial-grade AI, perform excellently in execution

Updates not in real time Real-time, closed-loop system

Inadequate integration with other technologies Synergy between AI and cloud, IoT, edge computing, blockchain,
big data, databases, etc.

Only highly-skilled experts can work with AI AI as a basic skill, supported by one-stop platforms

Scarcity of data scientists Data scientists + Subject matter experts + Data science engineers
As Is To Be
12 Huawei Confidential
Contents

1. AI: New General Purpose Technology

2. 10 Changes That Will Define the Future

3. Huawei's AI Portfolio

13 Huawei Confidential
Huawei’s Full-Stack, All-Scenario AI Solution

AI applications Application enablement: whole-process


services (ModelArts), layered APIs, and
Application pre-integration solution
ModelArts
enablement
MindSpore: unified training and inference
MindSpore Framework
Full framework for device/edge/cloud
stack Chip
(independent or collaborative)
CANN
enablement
CANN: chip operator library and highly
Ascend- Ascend- Ascend- Ascend- Ascend- IPs and automated operator development tool
Nano Tiny Lite
Ascend Mini Max chips
Ascend: a series of AI IPs and chips with
unified and scalable architecture

Atlas Atlas: various products built on Huawei


Ascend AI processors for device-edge-
cloud AI infrastructure for all scenarios
All scenarios
Consumer devices Public cloud Hybrid cloud Edge computing Industry IoT
devices

14 Huawei Confidential
Atlas AI Computing Portfolio

Superior computing power All-scenario deployment Cloud-edge-device collaboration

Atlas 300 AI accelerator Atlas 500 AI edge


Atlas 900 AI cluster Atlas 800 AI server Atlas 800 AI server card station Atlas 200 AI accelerator module
Model 9000/9010 Model 3000/3010 Model 3000/9000 Model 3000 Model 3000
Cloud-device synergy Edge Device

Ascend 310 Ascend 910


AI processor AI processor

15 Huawei Confidential
Atlas Accelerates AI Training

Ascend 910
AI processor

World’s most powerful World's most powerful training World's fastest AI training
training card server cluster

Atlas 300 AI accelerator


Atlas 800 AI server
card Atlas 900 AI cluster
Model 9000/9010
Model 9000

16 Huawei Confidential
Atlas Accelerates AI Inference
Ascend 310
AI processor

Intelligent devices with Highest density, 64 video inference Edge intelligence and cloud-edge AI inference platform with
7x higher performance channels synergy ultimate computing power

Atlas 200 AI accelerator module Atlas 300 AI accelerator card Atlas 500 AI edge station Atlas 800 AI server

Model 3000 Model 3000 Model 3000/3010 Model 3000/3010

17 Huawei Confidential
CANN: High-Performance Chip Operator Library and
Automated Operator Development Tool
➢ CANN: Includes the chip operator library and highly
CANN
Compute Architecture for Neural Networks automated operator development tool for optimal

Fusion Engine development efficiency and Ascend performance matching.


➢ Fusion Engine: Ascend internal storage reduces operator calling
Task Operator Operator
information fusion management overheads and memory migrations while improving
performance.
TBE operator CCE operator library
CCE operator library: high-performance operator library
development tool
TIK Matrix based on in-depth collaborative optimization the Ascend chip.
Convolution
multiplication
Control ➢ TBE operator development tool: various APIs for custom
TVM Vectors
flows operator development and automatic optimization,
improving operator development efficiency.
CCE compiler
➢ CCE compiler: compiler and binary tool set using
Compiler front end
heterogeneous hybrid programming language (C/C++
AI core AI CPU CPU
extension) to optimize performance and programming,
enabling Ascend to support all scenarios.

18 Huawei Confidential
MindSpore: All-Scenario AI Computing Framework

All-scenario AI application ecosystem

MindSpore
All-scenario unified APIs

Automatic Automatic Automatic User-friendly development: AI algorithm as code


differentiation parallelism optimization

MindSpore IR computational graph expression


Efficient execution: optimized for Ascend and
Pipeline Deep graph GPU
On-device execution
parallelism optimization

Device-edge-cloud, synergistic, distributed architecture Flexible deployment: all-scenario on-demand


(deployment, scheduling, communication, etc.) collaboration

Processors: Ascend, GPU, and CPU

19 Huawei Confidential
1 Platform + 3 Plans Support Ascend Industry Partners and Developers

CNY 3 billion investment, 3000 partners, and 1 million developers in 5 years

Business
partners Developers Universities

Solution Partner Developer AI Talent


Program Enablement Plan Development Plan

Platform for Ascend industry development


Industry Technical Marketing
Open-source
cooperation support support

20 Huawei Confidential
Atlas Products: Built on Ascend 310 and Serving Many Industries

50+ Atlas Industry Solutions

……
Finance Electric power Transportation Internet Carriers

Smart banks Unmanned Free flow at Intelligent Smart customer


inspection of provincial toll stations recommendations service center
high-voltage
lines

21 Huawei Confidential
Quiz

1. (Single)Huawei's AI strategy is to invest in basic research and talent development,


build a full-stack, all-scenario AI portfolio, and foster an open global ecosystem.
( )
A. TRUE
B. FALSE

22 Huawei Confidential
Summary

⚫ This course describes AI, a new general purpose technology, and introduces
the 10 changes that will shape the future. It also elaborates on Huawei's AI
development strategy and AI portfolio.

23 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.
Overview of ModelArts
Objectives

⚫ On completion of this course, you will be able to:


 Know about the functions and usage of ModelArts.

2 Huawei Confidential
Contents

1. Overview of ModelArts

3 Huawei Confidential
Huawei's Full-Stack, All-Scenario AI Portfolio
AI Applications Application enablement: provides end-to-end
services (ModelArts), layered APIs, and pre-
HiAI Application
Engine ModelArts integrated solutions.
Enablement
Paddle- MindSpore: supports the unified training and
Paddle MindSpore
TensorFlow PyTorch Framework inference framework that is independent of
the device, edge, and cloud.
Full Stack CANN Chip CANN: a chip operator library and highly
Enablement automated operator development tool.
Ascend- Ascend- Ascend- Ascend- Ascend- IP & Chip Ascend: provides a series of NPU IPs and chips
Nano Tiny Lite Ascend Mini Max IP and Chip based on a unified, scalable architecture.

Atlas: enables an all-scenario AI infrastructure


Atlas solution that is oriented to the device, edge, and
cloud based on the Ascend series AI processors
and various product forms.
All Scenarios
Consumer Public Hybrid Edge Industrial IoT
Device Cloud Cloud Computing Device

Huawei's "all AI scenarios" indicate different deployment scenarios for AI, including public clouds,
private clouds, edge computing in all forms, industrial IoT devices, and consumer devices.

4 Huawei Confidential
ModelArts 2.0, an Open Platform for Inclusive AI
For AI Users: Application Developers Data Scientist AI Specialist AI Ops

Online learning
Model update

Data Data processing Model training Model management Deployment

⚫ Data collection ⚫ Online coding ⚫ Cloud real-time service AI app 1


⚫ Model repository
⚫ Data filtering ⚫ Common AI frameworks ⚫ Cloud batch service
⚫ Model traceback
⚫ Data labeling ⚫ Built-in algorithms ⚫ Device-edge-cloud synergy
⚫ Precision tracking
⚫ Version management ⚫ Auto hyperparameter optimization ⚫ Auto hard-example mining
⚫ Model evaluation
⚫ Public datasets ⚫ Distributed cluster
⚫ Model conversion
⚫ Multi-person labeling ⚫ Model visualization AI market
⚫ Intelligent data filtering ⚫ Auto learning (ExeML)
⚫ Intelligent data labeling ⚫ Ascend AI chip compatibility ⚫ Model transaction AI app 2
⚫ Intelligent data analysis ⚫ Few-shot/semi-supervised learning ⚫ API transaction
⚫ Labeling scoring, and ⚫ Data augmentation ⚫ Built-in model fine-
evaluation /hyperparameter/NAS metasearch tuning
⚫ Open PC SDKs
⚫ Graph neural networks learning
⚫ Reinforcement learning
⚫ Federated learning

5 Huawei Confidential
Innovative Data Processing Reduces Data Preparation

• Intelligent data filtering


Data pre-screening based on
unsupervised learning

• Hybrid intelligent labeling


Hybrid intelligent labeling based on
semi-supervised learning/AET/active
learning

• Data feature analysis


Automatic extraction of distribution
statistics, and data diagnosis based
on data features

• Savings on workforce

50% to 80%

6 Huawei Confidential
Data Feature Mining - Data Enhancement Suggestions

20+ features
auto extraction

Bounding Bounding Bounding box


features
box number box size aspect ration

Quality
Saturation Brightness Clarity
features

Image
Resolution Complexity Colorfulness
property

7 Huawei Confidential
ExeML Engine, Automated AI pipelines

Training job finishes in 20 minutes

Zero
step 1:
upload and label data
Coding

step 2:
train the model

Zero
AI experience step 3:
validate and deploy model

8 Huawei Confidential
Interactive Notebook Coding
• Multiple languages:Python 2.7, Python 3.6;
• Multiple resouces:CPU/GPU/Ascend;
• Configurable auto-stop time
• Built-in environments: TensorFlow,Pytorch, Spark MLlib, Scikit-Learn, XGBoost, customization supported

9 Huawei Confidential
Guided Training Process
Training Guidance
• built-in algorithms
1. algorithm
• frameworks with customized code
source
• custom images

• built-in datasets
2. dataset source
• user dataset on OBS
data/code

• input/output directory
3. hyper-parameters
configuration
• running hyper-parameters

• computing resources:
4. computing CPU/GPU/NPU
resource
• node numbers

5. start training

10 Huawei Confidential
Real-time observing in training process

Real-time observing in training process

log configs

resource
utils metrics

visualization versions

param model
templates traceback

11 Huawei Confidential
Model Deployment: Cloud, Edge, Terminal

API Online Service


▪ high throughput, low latency, elastic

Batch Inference
▪ large data batch inference
▪ high efficiency distributed computing
Model Compression
(for hardware, frequency, precision limits)
AI model Edge Inference
compression
▪ integrated with IEF
distillation prune estimation

quantization Terminal Inference


▪ support Huawei SDC, HiLens Kit,
Atlas 500

12 Huawei Confidential
AI Market: Trade, Share, Learn

ModelArts dataset preprocessor APIs Ipython notebook



Flows services algorithm model Docker Image

HiLens HoloSens
Model Hub model algorithms
skills applications

Ipython Notebook AI Panel Paper&Impl


Knowledge
AI competition Custom Made

Sharing Trading Learning

13 Huawei Confidential
ModelArts: Development Platform Used by HUAWEI
CLOUD AI Services

OCR Reverse Image Search Object Detection


Receipt/vehicle Product search, Detection of floating
license plate/card copyright protection objects in rivers
recognition

ModelArts

14 Huawei Confidential
Autonomous Driving - Data Processing and Model Training

Big Data Analysis Subsystem Unified management platform:


⚫ Sample management
Device-Cloud Synergy Application

⚫ Training management
Data quality Model quality Responsibility
Unified Management Platform ⚫ Dataset management
analysis analysis analysis

Data Processing Subsystem Algorithm Training Subsystem Data processing subsystem:


(Using ModelArts) (Using ModelArts) ⚫ Data collection
⚫ Data preprocessing
⚫ Data cleaning and filtering
Data Data Algorithm training ⚫ Automatic/Manual data
Data labeling
receiving filtering labeling
Multi-team/multi-model
Daily PB-level data processing synchronous parallel training

Application Management Simulation Test Subsystem Algorithm training


Subsystem subsystem:
⚫ Perceptive, regulatory

Application Model Model Simulation test model training


management management archiving center

15 Huawei Confidential
Quiz
1. (Multiple)Which of the following modules are covered by ModelArts? ( )
A. Data management

B. DevEnviron

C. AI Market

D. Model Management

E. Service Deployment

16 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.

Copyright©2020 Huawei Technologies Co., Ltd.


All Rights Reserved.

The information in this document may contain predictive


statements including, without limitation, statements regarding
the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that
could cause actual results and developments to differ materially
from those expressed or implied in the predictive statements.
Therefore, such information is provided for reference purpose
only and constitutes neither an offer nor an acceptance. Huawei
may change the information at any time without notice.

You might also like