Natural Language Processing Theory and Applications
Natural Language Processing Theory and Applications
y B Uk B2
Theory and Applications of NLP
Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved.
E} Objectives
« Upon completion of this course, you will be able to:
o Understand the basic knowledge of Natural Language Processing (NLP)
o Master the algorithms of the Recurrent Neural Network (RNN)
Page2 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
4. Applications
Page3 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
E > What Is a Natural Language
« A natural language is a symbolic system that is embodied externally as
voice and consists of vocabulary and grammar. Text and sound are two
attributes of a language.
« A language is a tool for human communication and a carrier of human
thinking. In human history, the knowledge recorded and spread in the form
of language accounts for more than 80% of the total human knowledge.
« A natural language is established by usage, different from artificial
languages such as Java, C++, and other programming languages.
Paged Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
o The biggest difference between natural and artificial languages lies in ambiguity.
For example: "We give bananas to monkeys because they are hungry.” "We give
bananas to monkeys because they are ripe." Here is a problem in pronoun
reference.
5 ) What Is NLP
« Natural Language Processing (NLP) is a technology that uses computers as tools to
perform various processing on human-specific written and verbal natural language
information.
— Feng Zhiwei
o NLP is a branch discipline in the fields of artificial intelligence (Al) and linguistics. It
studies various theories and methods for effective communication between human
beings and computers using natural languages.
= ) Basic Methods of NLP (1)
« Capability model
o This is a model established based on linguistic rules and the hypothesis that there is a
general grammar rule in the human brain. This model is compliant with the idea that
languages are derived from the language capability of the human brain, and the
establishment of a language model is to simulate this innate language capability by
establishing a manually edited set of language rules.
Itis also known as the "rationalistic" language model, with representatives of Chomsky
and Minsky.
= Modeling steps:
. Formalize linguistic knowledge
« Convert formalized rules into algorithms
+ Implement the algorithms
PageG Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
« Rationalistic:
o Empirical:
Page8 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
E > NLP Research Direction
« NLP is an important research direction in the fields of computer science
and Al. It is a cross-disciplinary subject, covering linguistics, computer
science, mathematics, psychology, information theory, acoustics ...
Phonermics
Morphology
[ Lexicology
language
understanding Syntax
Pragmatics
Natural
language ool Natural language
generation text
Paged Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. &2 HuawEI
o Phonemics
o Describes the laws of the lexical system and explains the inherent semantic
and grammatical characteristics of words.
Source.
sentence Target sentence
Page 11 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
« Syntax analysis: At present, there are three mainstream syntax analysis methods in
the industry: phrase structure syntax system, dependency structure syntax system,
and deep syntax analysis.
E > Challenges in NLP (1)
« Lexical ambiguity:
o Word segmentation: English is easier to segment than other languages.
o Part-of-speech tagging: The part of speech of a same word varies with a
context.
= | plan/v to take the postgraduate
= | have completed the plan/n
Page 12 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
o Human languages are ambiguous, and rules are difficult to obtain. For
example, for anaphora resolution, computer languages use x and y, and
natural languages use this, that, he ...
Page 13 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
=) challenges in NLP (3)
« Semantic ambiguity
o At last, a computer that understands you like your mother.
= Meaning 1: A computer understands you as your mother does.
» Meaning 2: A computer understands that you like your mother.
= Meaning 3: A computer understands you as it understands your mother.
Page 14 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Challenges in NLP (4)
« Pragmatic ambiguity
o "You are so bad"
= When this sentence is said to an adult who has done bad things, it is severe blame.
» When a mother says it to her naughty son, what she actually expresses is a kind of
love for her son.
= When a girl in love says it to her boyfriend, this is a manifestation of the girl playing
the woman in front of her boyfriend.
Page 15 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
Page 16 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
-> Contents
Introduction to NLP
2. Knowledge Required
= Language Models
o Text Vectorization
o Common Algorithms
3. Key Tasks
4. Applications
Page 17 Copyright © 2019 Huawei Technologies Co., Ltd. Al rights reserved. W2 HuAWE!
E > What Is a Language Model
« A language model is an abstract correspondence established based on objective
language facts.
« In practical applications, we often need to resolve the following problems:
o Spelling correction: P(about fifteen minutes from) > P(about fifteenminuets from)
Page 16 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
« "Alanguage model assumes that all possible sentences of a language obey a prob
ability distribution, and the occurrence probability of each sentence adds up to 1. T
he task of a language model is to predict the probability of each sentence appeari
ng in the language. For a common sentence in the language, a good language mo
del gets a relatively high probability. For a sentence that is not grammatical, the cal
culated probability approaches zero. If a sentence is regarded as a sequence of wo
rds, the language model can be represented as a calculation model. A language m
odel only models the occurrence probability of a sentence and does not attempt t
o understand the meaning of the sentence.”
« Core of a language model: using scores to make the machine know how to speak
G > Neural Network Language Model (1)
i~ th output = P(w, = i|context)
1 softmax
(e== - O : o)
T Most computation
here | |
Cwig) Clwen)
o] (coe -]
E\ Matrix ¢ ,.!
Page 19 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
o The Bag of Word model is the earliest text vectorization method that uses word as
the basic processing unit. However, this model has the following shortcomings:
o Curse of dimensionality
o There is a semantic gap. (Words are the basic units of semantic expression.
The Bag of Word model only symbolizes words, and does not contain any
semantic information.)
« Word2vec and other word vector models are based on the distribution hypothesis,
that is, words with similar contexts have similar semantics. This is based on the
linguistic principle of "iconicity of distance”. A word and its context constitute an
image. When similar images are learned from the corpus, their semantics are
always similar. Neural network language model (NNLM) enables modeling based
on the relationship between the context and target word. Some researchers
proposed a method that uses the distribution of context to represent the meaning
of words, namely the Word Space Model.
Neural Network Language Model (2)
| E B
-
RN
B = & 3 5
o |I
Page20 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
« Softmax layer: converts RNN output status to the probability of each word.
_ count(Wi_n_1), Wi-(n-2)
count (Wi—(n—1), Wi-(n-2),
= When n = 1, this is a unigram model: P(wy, wy, . ., wy) = P(W)P(W3) ... P(Wy)
= When n = 2, this is a bigram model: P(wy, w,, ..., wy,) = P(w,)P(W,|w,) ... P(Wyn
[ Win—1)
« For example:
<s> lam Lily </s> p(l| <s>) = 2/3 = 0.667
<s> Lily 1 am </s> Pia(r;\sl)\i E“Z)/ /és_s'gs
<s> I do not like green eggs and ham</s> P ly =
Page21 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
o nA larger value of n indicates richer order information in the model and larger
calculation workload. At the same time, the number of long text sequences
decreases, and the numerator or denominator can be zero. Therefore, a
corresponding smoothing algorithm, such as Laplacian Smoothing, needs to be
used together with n-gram model to solve the problem. A unigram model
completely loses the order information in the sentence, and is improper.
Relationship Between NNLM and Statistical
] Language Model
« Similarity: These two models take a sentence as a word sequence to calculate the
sentence probability.
« Differences:
= Manner of probability calculation: Based on the Markov Assumption, N-gram model
considers only the first n words, while NNLM considers the context of the whole
sentence.
= Manner of model training: N-gram calculates parameters based on maximum likelihood
estimation, and such calculation is word-based; NNLM trains the model based on the
RNN optimization method.
Recurrent neural networks (RNNs) can store context information of any length in a
hidden state, not subject to the window limit in the N-gram model.
Page 22 Copyright © 2019 Huawei Technologies Co,, Ltd. All rights reserved. W2 HuAWE!
-> Contents
Introduction to NLP
2. Knowledge Required
o Language Models
= Text Vectorization
o Common Algorithms
3. Key Tasks
4. Applications
Page 23 Copyright © 2019 Huawei Technologies Co., Ltd. Al rights reserved. W2 HuAWE!
E > Text Vectorization (1)
» Text vectorization: represents text by using a series of vectors that express
the semantics of the text. Common vectorization algorithms are as follows:
o one-hot
o TF-IDF
o word2vec
« CBOW model
« Skip-gram model
o doc2vec/str2vec
« Distributed Memory (DM)
« Distributed Bag of Words (DBOW)
Page 24 Copyright© 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
™, woman do
king 2
cat @ fish
queen -
Page 25 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
@ ) word2vec - CBOW Model
Input layer
x
"l Woxn
Hidden layer Quipat loyer
Xzk ]
“H Waxn L
r N —dim vV —dim
Woxn
= CxV—dim
Page 26 Copyright © 2019 Huawel Technolgies Co, L. All rights reservec. S Huawe!
G > word2vec - Skip-gram Model
Output layer
B,
Input layer Hidden layer.
i
i
N—din
v —dim
€ xV—dim
Page 27 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
doc2vec - DM Model
Classifier ; 5
Average/Concatenate
L Am\m
oo ]
Paragraph
E&l
the
i
sat
Page 26 Copyright© 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
doc2vec - DBOW Model
Classifier the cat sat on
\,
Paragraph Matrix
Paragraph
id
Page 29 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
-> Contents
Introduction to NLP
2. Knowledge Required
o Language Models
o Text Vectorization
= Common Algorithms
3. Key Tasks
4. Applications
Page30 Copyright © 2019 Huawei Technologies Co., Ltd. Al rights reserved. W2 HuAWE!
HMM(1)
S IS
L L L
r T 1 "
Page 31 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
@ ) HMM (2)
Schematic diagram of the Hidden Markov Model (HMM)
TIl
i L g
00600000000
FH A hidden state + From a hidden state to the next hidden state
Page 32 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
= ) HMM (3)
« HMM is a statistical model used to describe Markov processes with hidden
parameters. Mathematically, it is expressed by the following formula:
maxP(h|w) = maxP(hyh, ... hy|lwyw; ... wy,)
Page 33 Copyright© 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
o Introduction:
o Maximum entropy model: It is often said that do not put all eggs in one
basket. (We know that there may be dozens or even hundreds of factors that
affect stock fluctuation, and the maximum entropy method can find a model
that meets thousands of different conditions at the same time.) To minimize
risks, keep all uncertainties.
Bayesian formula
Plwlt)P(h)
)
Constant P(w)
[maxp iy p(h)
Observation independence hypothesis, chain rule
‘mwl'(w\\'h?l'(WMJ P(wlha)PChy)PCholhy)PChs|hy
by ) - P(ha
by ho vhm)‘
Page 34 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
i Y, Y Va1 Yn i Y. Y Yo-1 Yo
Page 35 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
Page 37 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
Page38 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. &2 HuawEI
X
Colah, 2015, Understanding LSTMs Networks
Page 39 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
= ) BiRNN
+ In a typical RNN, the state is transmitted unidirectionally backwards. However, in some problems, the
output of the current moment is related not only to the previous state, but also to the subsequent
state. In this case, a bi-directional recurrent neural network (BIRNN) is needed to resolve such
problems. For example, a missing word in a sentence needs to be predicted based on not only the
previous text, but also the following content, and the BIRNN functions. A BIRNN is composed of two
RNNS, one on top of the other. The output is determined by the state of the two RNNS.
[}
Page 40 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. &2 HuawEI
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Technologies
= Word Segmentation
a Part-of-Speech Tagging
o Named Entity Recognition
a Keyword Extraction
o Syntax Analysis
o Semantic Analysis
4. Application System
Page 41 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Chinese Word Segmentation
« Chinese word segmentation is the task of splitting Chinese text (a
sequence of Chinese characters) into words. Word segmentation is the
process of dividing a string of written language into its component words.
= For example: —JL1/\ &/ E/SCH/EHH O/ RME/E/—FTFN+/\m=12/5mT
Page42 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
Page 43 Copyright © 2019 Huawei Technologies Co,, Ltd. All rights reserved. W2 HuAWE!
E > Regular Word Segmentation (2)
+« MM method: /17
Y ook u%
Character strng ST10] < detonary to
be segmented; outpu check whether
uopezyerul
character string 52 = exists
maximum word \
length Maxten
Page44 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
o MM method:
o Basic idea: Assume that the longest word in the word segmentation
dictionary has i Chinese characters. Then, use the first i characters in the
current character string of the processed document as the matching field to
look up the dictionary. If such an i word exists in the dictionary, the match is
successful and the matched field is segmented as a word. If such an i word is
not found in the dictionary, the match fails, the last character in the field to
be matched is removed, and the remaining string is re-matched. This process
is repeated until a word is segmented or the length of the remaining string is
zero.
o For example: "EIIRMITARS" is segmented to "BIRMR\T\AHR"
¢ RMM method: The basic principle is the same as that of the MM method, but the
direction of word segmentation is opposite to that of MM.
¢ Main idea: Each word is composed of characters, the smallest units of a word. If
successive characters often appear in different texts, the successive characters are
likely a word. Therefore, the occurrence frequency of successive characters is used
to reflect the reliability of words, and the frequency of a combination of successive
characters in the corpus can be collected. When the combination frequency is
higher than a certain threshold, it is considered that this character combination
may constitute a word.
o Hidden state: Choose HMM when the hidden state sequence needs to be
calculated based on an observation sequence (character sequence).
—
S ! F‘\IF e T): r
SRR R
Word embeddmgS[ @ @ @ @
Page47 Copyright © 2019 Huawei Technologies Co. Ltd. All rights reserved. W2 HuAWE!
E > Mixed Word Segmentation
» In most practical engineering applications, a word segmentation algorithm
is used with the assistance of other word segmentation algorithms. The
most common practice is to use dictionary-based word segmentation, with
the assistance of statistical word segmentation algorithms.
Page ds Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
o In fact, the word segmentation effects of rule-based algorithms, HMM, CRF, and
deep learning algorithms in specific tasks have little difference.
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
o Word Segmentation
= Part-of-Speech Tagging
o Named Entity Recognition
a Keyword Extraction
o Syntax Analysis
o Semantic Analysis
4. Applications
Page 49 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Part-of-Speech Tagging
« Part-of-speech tagging: process of tagging a correct part of speech for
each word in a sentence after word segmentation, that is, process of
determining each word as a noun, a verb, an adjective, or any other part of
speech. For example: march toward/v, be filled with/v, hope/n, of/uj,
new/a, and century/n.
o Part of speech: a basic syntax attribute of a word
o Purpose: This is a pre-processing step for many NLP tasks, such as syntactic
analysis and information extraction. The text obtained after part-of-speech
tagging brings great convenience, but this is dispensable.
o Methods: rule-based, statistics-based, and deep learning-based methods
Page 50 Copyright © 2019 Huawei Technologies Co,, Ltd. All rights reserved. W2 HuAWE!
o Statistics-based methods:
o Semantic Analysis
4. Applications
Page 51 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
=) Named Entity Recognition (1)
. Named Entities Recognition (NER): also known as "proper name recognition”. It refers to the
recognition of entities with specific meanings in the text, mainly including person names, place
names, institution names, and proper nouns. For example: metallurgy/n, ministry of industry/n,
Hongkong/n, fireproofing material/l, and research institute/n.
o Classification: Named entities studied by NER are divided into three categories (entity, time, and
number) and seven subcategories (person name, place name, institution name, time, date, currency,
and percent).
Function: Same as automatic word segmentation and part-of-speech tagging, named entity
recognition is also a basic task for natural languages. It is essential to technologies such as
information extraction, information retrieval, machine translation, and the question and answer
system.
o Steps:
+ Recognize the entity boundary.
+ Determine the entity category (such as person name, place name, or institution name),
Page'52. Copyright © 2019 Huawel Technolgies Co, L. All rights reservec. S Huawe!
o See https://blog.csdn.net/ZJL0105/article/details/82194610.
E > Named Entity Recognition (2)
« Difficulties:
o There are a large number of various named entities.
o The composition of named entities is complex.
o Entities are embedded and complex.
o The entity length is uncertain.
Page 53 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
Deep Learning NER
Character/Word Billy goes to the training centerto study
vector
Word
embeddings
Bi-LSTM
CRF
Page 54 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. &2 HuawEI
« Source: https://blog.csdn.net/DataGrand/article/details/83312169
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
o Word Segmentation
a Part-of-Speech Tagging
o Named Entity Recognition
= Keyword Extraction
o Syntax Analysis
o Semantic Analysis
4. Applications
Page 55 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Keyword Extraction
« Keywords are a group of words that represent the important content of an article. In actual
scenarios, a large amount of text does not contain keywords. Therefore, the technology of
automatic keyword extraction enables people to browse and retrieve information
conveniently, and plays an important role in text clustering, classification, and automatic
summarization.
« Keyword extraction algorithms can be divided into supervised and unsupervised types:
o Supervised: Supervised keyword extraction is carried out via classification. This type of algorithms
creates a comprehensive word list, and determines the degree of matching between each word in
each document and the word list to extract keywords in a manner similarto tagging.
o Unsupervised: This type of algorithms does not require manually generated and maintained word
list, or manual standard corpus to assist training. These algorithms include TF-IDF, TextRank, and
topic model algorithms (such as LSA, LS|, and LDA),
Page 56 Copyright © 2019 Huawel Technolgies Co, L. All rights eservec. S Huawe!
o Supervised algorithms can achieve higher accuracy, but require a large amount of
tagged data, featuring high labor costs.
D) TF - IDF Algorithm (1)
« Term Frequency-Inverse Document Frequency (TF-IDF): a statistical calculation
method commonly used to assess the importance of a word to a document in a
fileset.
- For example:
On the World Blood Donor Day, school groups and blood donation service
volunteers can go to the blood center to visit the inspection process. We will
publicize the test results, and the price of blood will also be publicized.
In the sentence, the occurrence frequencies of words "blood", “the", “donor”
and " visit" are 4. In TF algorithm, the importance of these words to this document
is the same. However, "blood" and “donor” are obviously more important to this
document.
Page 57 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
D) TF - IDF Algorithm (2)
. The TF algorithm collects the occurrence frequency of a word in a document. The basic idea is that a
word appears more often in a document better expresses the document.
.+ _ _m__ Number of times that a word appears in the document
fo = Somy Number of total words in the document
« The IDF algorithm counts the number of documents in a fileset that contain a same word. The basic
idea is that a word appears in fewer documents better distinguishes the documents.
idf, = lug(&)
! 1+|D;|
+ ID] s the total number of documents in a fileset, |Di| is the number of documents containing word i
« TF-IDF algorithm:
D
of X df(i,]) = tfy x idf, = x Iog(%“‘”)
Page 58 Copyright © 2019 Huawei Technologies Co,, Ltd. All rights reserved. W2 HuAWE!
o IDF algorithm: The denominator plus 1 means that Laplacian Smoothing is used.
This avoids the situation in which the denominator is zero because some new
words do not appear in the corpus, enhancing the robustness of the algorithm.
o TF-IDF algorithm: combination of TF and IDF algorithms. Scholars have done a lot
of researches on how to combine these two algorithms: whether to add or multiply
them and whether to take logarithm for IDF calculation. After a lot of theoretical
derivation and experimental researches, multiplication is found to be one of the
more effective calculation methods.
E > TextRank Algorithm (1)
« The basic idea of the TextRank algorithm comes from Google's PageRank algorithm.
PageRank is a link analysis algorithm proposed by Google founder Larry Page and Sergey
Brin when constructing an early search system prototype in 1997. The algorithm is used to
evaluate the importance of a web page in the search system. There are two basic ideas:
o Link quantity. A web page is more important if it is linked by more other web pages.
o Link quality. A web page is more important if it is linked by another web page with higher weight.
—
R :/;;ln’u =]
AR
— =
o T
—
Page 53 Copyright © 2019 Huawel Technologies Co, L. All rights reservec. S Huawe!
« Difference between TextRank and other algorithms: TextRank can extract the
keywords of an individual document without a corpus.
o TF-IDF counts each word appears in how many documents of the corpus,
that is, inverse document frequency.
> o
1
SV = xS
sy 10wl
- To avoid the score of an isolated web page being 0, add a damping factor d as follows:
. PageRank is a directed non-weight graph, and TextRank automatic summarization is a weighted graph,
because in addition to the importance of link sentences, the similarity between two sentences also
needs to be considered. Therefore, the complete expression of TextRank is:
1
WS(W) = (1—d)+d x
vyémtvy SV ourw, )
Page 60 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
E > TextRank Algorithm (3)
» When TextRank is applied to a keyword extraction task, there are two main
differences compared with a case in which it is applied to an automatic
summarization task:
o The association between words has no weight.
o Not every word is linked to all the words in the document.
Page 61 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
= ) LSA, LS, and LDA Algorithms
« A topic model considers that there is no direct connection between words and
documents, and they are connected by another dimension, which is called a topic.
Each document should correspond to one or more topics, and each topic should
have a corresponding word distribution. The word distribution of each document
can be obtained through the topic.
o In general, TF-IDF and TextRank can satisfy most keyword extraction tasks.
However, in some scenarios, keyword extraction based on the document itself is
not sufficient, and some keywords may not appear in the document. For example,
a science article about animal living environments introduces a variety of animals,
such as lions, tigers, and crocodiles, but does not contain the word of animal. In
this case, the two algorithms are inapplicable, and a topic model is required.
o The previous two models extract keywords based on the relationship between
words and documents. These two algorithms use only the statistical information in
the text, and do not fully utilize the rich information, especially the semantic
information, which is obviously very useful for text keyword extraction.
= ) LSA/LSI
« Both Latent Semantic Analysis (LSA) and Latent Semantic Indexing (LSI) are used to
analyze latent semantics of a document. However, after analysis, LSI may further
establish a related index by using the analysis result. LSA and LS| are usually
treated as the same algorithm. The analysis steps are as follow:
o Represent each document as a vector using the BOW model.
o Combine all document word vectors to form a word-document matrix (mxn).
o Perform the singular value decomposition (SVD) ([mxr].[rxr] [rxn]) on the word-document matrix.
According to the SVD result, each word and each document may be represented as a dot in a space
formed by r topics. The level of similarity of each word in each document may be obtained by
calculating the similarity between each word and each document, and the word with the highest
level similarity is selected as a keyword of the document.
Page 63 Copyright© 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
« Compared with a conventional Space Vector Model (SVM) that lacks utilization of
semantic information, LSA maps a word and a document to a low-dimensional
semantic space by using SVD, to mine shallow semantic information of the word
and the document. In this way, the document is more essentially expressed.
11
@
Page 65 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
o Word Segmentation
o Part-of-Speech Tagging
o Named Entity Recognition
a Keyword Extraction
= Syntax Analysis
o Semantic Analysis
4. Applications
Page 66 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Syntax Analysis
» The main task of syntax analysis is to identify the syntactic components
contained in a sentence and the dependencies between these components.
It can be divided into syntactic structure analysis and dependency
analysis. The result of syntax analysis is represented by a syntax tree.
want
5y
I
to
a horror
Page 67 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
o Machine translation is a main field of NLP, and syntax analysis is the core
data structure of machine translation. Syntax analysis is the core technology
of NLP and the basis of deep understanding of languages.
o With the use of deep learning in NLP, especially the application of the LSTM
model with its own syntactic relationship, syntax analysis has become less
important. However, syntax analysis can still play a great role in long
sentences with very complex syntactic structures and few tagged samples.
Therefore, the study of syntax analysis is still necessary.
Page 66 Copyright© 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
o With the use of deep learning in NLP, especially the application of the LSTM model
with its own syntactic relationship, syntax analysis has become less important.
However, syntax analysis can still play a great role in long sentences with very
complex syntactic structures and few tagged samples.
o Reference: https://blog.csdn.net/yu5064/article/details/82151578
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
o Word Segmentation
o Part-of-Speech Tagging
o Named Entity Recognition
a Keyword Extraction
o Syntax Analysis
= Semantic Analysis
4. Applications
Page 69 Copyright © 2019 Huawei Technologies Co, Ltd. All rights reserved. W2 HuAWE!
E > Semantic Analysis
« Semantic analysis is a logical phase of the compilation process. The task of
semantic computing is to explain the meaning of parts (words, phrases,
sentences, paragraphs, or chapters) of an article in natural languages.
Page 70 Copyright © 2019 Huawei Technologies Co. Ltd. All rights reserved. W2 HuAWE!
E > Importance of Semantic Analysis
« Isit enough to know just the structure of a sentence?
« For example: syllogism: All men are mortal.
Socrates is a man.
Therefore, Socrates is mortal
Page71 Copyright © 2019 Huamei Technologes Co, 1 All rights reserved. e Huawer
o Sentences with the same syntactic structure often vary greatly in semantics. In this
case, the entire analysis cannot go on without semantic analysis.
> Contents
1. Introduction to NLP
2. Knowledge Required
3. Key Tasks
4. Applications
Page 72 Copyright © 2019 Huawei Technologies Co., Ltd. Al rights reserved. W2 HuAWE!
= ) Applications (1)
Text classification: process of associating given text with one or more categories based on the
characteristics of the text under a predefined classification system. Example: spam detection and
sentiment analysis
Text clustering: clusters text based on the clustering hypothesis that documents in the same
category have a high similarity and documents in different categories have a small similarity.
Machine translation: uses computers to translate between different languages. Related
researches and exploration have been started since the birth of the first computer: memory-
based -> instance-based -> statistical machine translation -> neural network translation.
Question answering system: an information retrieval system that receives questions raised by
users in natural languages and finds or infers answers to the questions from a large amount of
heterogeneous data.
Page 73 Copyright © 2019 Huawei Technologes Co, 1 All rights reserved. e Huawer
o Definition phase: Defines data and the classification system, involving specific
category division and data required.
Data features extraction: Reduces the dimension of the document matrix, and
extracts the most useful features from the training set. Algorithms include
BOW, TF-IDF, and N-gram.
o The information extraction technology does not attempt to fully understand the
entire document but only analyzes the parts of the document that contain relevant
information. As for which information is relevant, it will be determined by the
predefined scope of field.
> Quiz
1. Which of the following options is NOT one of the three levels of NLP? ()
A. Lexical analysis
B. Syntax analysis
C. Speech analysis
D. Semantic analysis
Page 76 Copyright © 2019 Huawel Technologies Co, L. All rights reservec. S Huawe!
o Answers:
o C
o False
> Summary
« This document introduces the basic knowledge of NLP, describes language
models and commonly used algorithms, and illustrates key technologies
and applications of NLP.
Page 77 Copyright © 2019 Huawei Technologies Co, Ltd. Al rights reserved. W2 HuAWE!
ThanklYdu
WwWiv.huaWei
.€o m