0% found this document useful (0 votes)
56 views9 pages

Vietnamese Word Segmentation

This document summarizes previous work on Vietnamese word segmentation and proposes a new model. It begins with an introduction explaining that word segmentation is the first step for natural language processing in Vietnamese but is challenging due to the lack of spaces between words. It then reviews previous rule-based, statistics-based, and hybrid approaches for Chinese, Thai, and Vietnamese word segmentation before describing the authors' new model combining weighted finite state transducers and neural networks. The proposed model aims to leverage linguistic knowledge and overcome limitations of prior rule-based and statistical methods.

Uploaded by

Minh Ánh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views9 pages

Vietnamese Word Segmentation

This document summarizes previous work on Vietnamese word segmentation and proposes a new model. It begins with an introduction explaining that word segmentation is the first step for natural language processing in Vietnamese but is challenging due to the lack of spaces between words. It then reviews previous rule-based, statistics-based, and hybrid approaches for Chinese, Thai, and Vietnamese word segmentation before describing the authors' new model combining weighted finite state transducers and neural networks. The proposed model aims to leverage linguistic knowledge and overcome limitations of prior rule-based and statistical methods.

Uploaded by

Minh Ánh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220706845

Vietnamese Word Segmentation.

Conference Paper · January 2001


Source: DBLP

CITATIONS READS
34 835

3 authors, including:

Dinh Dien
Ho Chi Minh City University of Science
22 PUBLICATIONS   442 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Dinh Dien on 28 May 2014.

The user has requested enhancement of the downloaded file.


Vietnamese Word Segmentation
Dinh Dien, Hoang Kiem, Nguyen Van Toan
Faculty of Information Technology
National University of HCM City
227 Nguyen Van Cu, Dist. 5, HCM City, VIETNAM
[email protected]

Unlike Euro-Indian languages, where "Word


Abstract is a group of letters having meaning separated
Word segmentation is the first and by spaces in the sentence.” (Definition in
obligatory task for every NLP. For Webster Dictionary), in Vietnamese and other
inflectional languages like English, Asian languages, whitespaces are not used to
French, Dutch,.. their word identify the word boundaries. However, the
boundaries are simply assumed to be word segmentation is indispensable due to such
whitespaces or punctuations. Whilst various applications as: Search engines, Word
in various Asian languages, including processors, Spelling Checkers, Voice
Chinese and Vietnamese, whitespaces Processing, etc.
are never used to determine the word Therefore, this has become an interesting
boundaries, so one must resort to such matter for the circles of linguistics and computer
higher levels of information as: science. The following points must be resolved
information of morphology, syntax to proceed with the word segmentation:
and even semantics and pragmatics. In a. Local ambiguity in compound words.
this paper, we present a model b. No comprehensive dictionaries.
combining WFST (Weighted Finite c. Recognition of proper nouns and names.
State Transducer) approach and d. Morphemes and reduplicatives.
Neural Network. This word
segmentation system is applied to 2 General Instructions of
Text-to-speech of Vietnamese and Vietnamese linguistics
POS-tagger of Vietnamese. We
evaluate the performance by
comparing its word segmentation 2.1 “Tieáng ” (Vietnamese syllable /
results with the manually annotated morpheme)
corpus and its performance proves to
be very good. Our algorithm achieves Vietnamese has a special linguistic unit called
97% of accuracy on a corpus of “tieáng” (equivalent to hanzi of Chinese) which
Vietnamese Electronic Textbooks. is similar to traditional morphemes in respect of
content and similar to traditional syllables in
1 Introduction respect of form. Unlike the hanzi of Chinese,
each “tieáng” of Vietnamese has one and only
Vietnamese is an isolated language, however, one way of pronunciation. “Tieáng” is the basic
unlike such other insolated languages as unit in Vietnamese and it is constructed from
Chinese, Thai, Vietnamese is written in phonemes under the following structure (Ñinh
extended Latin characters. So, the treatment of
Leâ Thö, 1999). For example: “toaùn” (math /
other languages cannot be mechanically applied
group)
to Vietnamese and one of the pending works in
Initial tone mark = ‘ (acute)
the NLP of Vietnamese at present is identifying
letter = t Pre- main final
the word boundaries.
In linguistics, word is a basic unit. sound sound sound
Therefore, in order to computationally process =o =a =n
Vietnamese, the first and foremost is “Tieáng” may be:
determining the word boundaries, which is still
pending for Vietnamese now. • A word, e.g. “chò” (sister), “toâi” (I)
• A morpheme, e.g. “hoa” (flower) and • Thai, Sornlertlamvanich (1993).
“hoàng” (pink) in the word “hoa hoàng” (rose). • Chinese, Chih-Hao Tsai (1996),
• A sub-morpheme, e.g. “buø” and “nhìn” MMSeg 2000; accurate 98% in a corpus with
in the word “buø nhìn” (puppet). 1300 simple sentences without solution for
proper nouns and unknown words.
For simplicity, we can consider “tieáng” as
“Vietnamese morpheme”, or “Vietnamese
syllable” or “syllable” in short.
3.2 Statistics-based approach
This approach is based on the word context in
2.2 Words considering the information of the neighboring
words to issue relevant decisions. There are two
There exist various definitions of Vietnamese
points to be resolved for this approach, which
words but all the linguists reach the unanimous
are the context width and the applied statistical
agreement on the following points (Ñinh Ñieàn, method. As far as the context width is
2000): concerned the wider the more complex.
a. They must be integral in respects of form, As far as the statistical method is concerned,
meaning and be independent in respect of the hidden first-order Markov has always been
syntax. applied. However, this method greatly depends
b. They are structured from “tieáng” on the corpus for its training. In case one
(Vietnamese morpheme/syllable). method is applicable for the political corpus, it
c. They consist of simple words (1-tieáng, cannot be applied to literal ones. In addition,
monosyllable) and complex words (n-tieáng, n < there are some words of high probability but of
5, polysyllable) , e.g. reduplicatives and syntactical function only, which lessens the role
compounds. of probability.
For example: “chò” (sister) : simple word; a. HMM, based on Viterbi algorithm
“hoa hoàng” (rose): compound word; “chuùm (Asanee Kawtraku, 1995 ; Surapant, 1995).
b. Expectation-Maximization (EM). This
chím” (smile slightly) : reduplicative word
method is based on the resolvement of the
(repeated initial consonants and/or tones); …
“chicken and egg” question through its
repetition (Xianping, 1996).
3 Previous Works
3.3 Other approaches
At present, there have been some models for
Vietnamese which prove not to be so practical Most of these approaches are hybrid in
whereas for such other isolated languages as combination with such other linguistic models
Chinese, Thai, this matter has been resolved in a as WFST, TBL (Transformation-Based Error
rather acceptable manner with the following driven Learning) (Julia, 1996). However, due to
approaches: the requirement of various manipulations, the
processing becomes quite time-consuming but
3.1 Rule-based approach its accuracy proves to be high.
However, the linguistic knowledge, which is
This approach is well shown in the following much applied in the rule-based models, is rarely
models. found in all the above models.
a. Longest Matching, Greedy Matching
Models (Yuen Poowarawan, 1986 ; Sampan
Rarunrom, 1991).
4 Our Model
b. Maximal Matching Models.
Based on the definition of the Vietnamese words
This model is divided into “forward
in section 2, we suggest a model (Fig.1), which
maximum match” and “backward maximum
can meet all the requirements of the above-
match”, for which the fully comprehensive
mentioned definition as follows:
dictionary is indispensable. However, it is
At first, we input a sentence into the pre-
obvious that there is no comprehensive
processing stage where we eliminate all the
dictionary and depending on different contexts
errors of sentence presentation. In addition, what
that this model requires suitable dictionaries.
is more important is the normalization of
accenting, the way of wording y, i… in terminated with a weighted arc labeled with an
Vietnamese. Due to no unanimous element of ε × P . The weight represents the
standardization, there exist some Vietnamese estimated cost of the word.
syllables with different writing but with Next, we represent the input sentence as an
identical meaning and pronunciation, e.g.: thôøi unweighted Finite-State Acceptor (FSA) I over
kyø = thôøi kì (stage) , hoaø = hoøa (equal), etc. H. Let us suppose the existence of a function
Id , which takes as input an FSA A and
produces as output a transducer that maps all
start and only the strings of symbols accepted by A to
themselves. Finally, we define the best
Pre-processing segmentation to be the sentence with the
smallest weight or best path in Id ( I ) × D .We

WFST have improved this WFST model to make the


Vietnamese word segmentation more convenient
with the following peculiarities:
True
t < T0
4.1.1 Dictionaries
False
The dictionary is arranged in the multiway tree
Neural Network (Fig.3). In case of multiway trees, more memory
will be occupied, the binary tree is
recommended instead to represent the dictionary
end of multiway trees. In the dictionary, each node
Figure 1: Flowchart of our model. represents a Vietnamese letter and for every
Then, that sentence is introduced to the word, we attribute to it such additional details as
WFST model where reduplicatives, proper POS, word frequency, and syntactic features.
nouns, date-time, numbers,... are further As mentioned above, the dictionary consists
identified (One of the characteristics of of a sequence of nodes and arcs. Each word is
Vietnamese is that all initial letters of proper terminated with an arc describing a
names must be capitalized). In case of any transformation between ε and their POS. In
further ambiguity, the Neural network model, addition, an estimated weight is also attributed.
which is our new approach, will be applied. In case of considering the probability of a word
as the weight, it is difficult for the calculations
4.1 WFST Model to be executed due to their too small figures.
Therefore, the weight is assigned through the
Vietnamese word segmentation can be logarithm of the probability of a concrete word:
considered as a stochastic transduction problem.
 f 
We apply WFST model for Chinese Word C = − log 
segmentation into our task as follows (Richard N
Sproat, 1996): where f : word frequency; N: size of corpus.
We represent the dictionary D as a Some words might have more than one
Weighted Finite State Transducer. Supposed: POS. Therefore, we represent the POS of a word
H: set of “tieáng” (syllables). with an integer, in which each bit corresponds
p: no use, due to characteristic of “tieáng” with a certain POS (Fig.2). As such, in case a
(see 2.2.a). word possesses more than one POS, more than
P: set of grammatical Part-of-speech (POS) one bit of its representing integer will be on.
labels.
Each arc of D maps is either from an ….. 0 1 1 1

element of H to an element of H , or from ε Noun


(the empty string) to an element of P. Verb
Each word in dictionary D is represented as Adj
a sequence of arcs, starting from the initial state Figure 2: Array of integers to represent POS
of D (labeled with an element S of H ) and
At present, the classification of Vietnamese 4.1.2 Identification of proper names
POS is still under argument, however, we only
One of the advantages of Vietnamese compared
choose the following POS (Hoaøng Pheâ, 2000) with such morphosyllabic languages as Chinese,
which are: Thai is the capitalization of all the initial
Noun verb characters of proper names, which makes the
Adjective Pronoun identification of proper names easier.
Preposition Conjunction However, the ambiguity here is that the
Adverb Interjection initial letter of a sentence is also capitalized and
Particle Abbreviation besides, as mentioned above, since the syntax of
Idiom and other symbols Vietnamese is rather complicated and not
For example, an entry in our dictionary: standardized yet, we are facing also with a lot of
struct entry difficulties for the processing in this stage. For
{ example: the following word is a proper name
string word; // e.g. “baøn” with 3 different acceptable forms: Ex: Boä Chính
int POS; // e.g. 3 (N+V)
trò, Boä chính trò, or Boä Chính Trò (politburo).
float frequency; // e.g. 4.21
We make use of heuristic to attribute
string syn_fea; // e.g. N_concrete,…
appropriate weights to these words and then
}
consider them as conventional words to be
ROOT processed at WFST with a very satisfactory
result (further refer to the conclusion) and this is
a Noun 7.1 also a new approach based on the peculiarity of
b aø c o n Noun 5.7 Vietnamese writings.
n o i Noun 5.7
4.1.3 Proper names
n Noun 2.1
We have also studied the names of the
n Noun 2.4 Vietnamese in the large-scale corpus and found
out some peculiar rules which have also been
y Verb 3.2
applied by Richard Sproat for Chinese .
c 1. word → name
... 2. name → Family Given1
3. name → Family Given2
...
4. Family → Family1
y 5. Family → Family2
Figure 3: Dictionary Tree 6. Family1 → VN- syllable
The probability of words is calculated based 7. Family2 → VN-syllable, VN-syllable
on a corpus of 2,000,000 words. In fact, we have 8. Given1 → VN- syllable
constructed a dictionary of 34,000 words based 9. Given2 → VN-syllable, VN-syllable
on the one of the Center of Lexicography (under E :- Hoà Chí Minh (Family1 + Given2)
the National Center of Social Sciences and - Nguyeãn Du (Family1 + Given1)
Humanities). In addition, we have calculated the - Leâ Nguyeãn Trang Ñaøi (Family2 + Given2)
probability and searched for new words based
on the corpus of 2,000,000 words consisting of: 4.1.4 Identification of Reduplicatives
• 1.6 MB from Complete works of Ho Chi
Minh. One major feature of Vietnamese is there is a
large number of reduplicatives which are also
• 0.6 MB from Vietnam PC-WORLD
frequently multiplied in the course of
magazines.
communication. No dictionary can be
• 0.9 MB from newspapers in Science and
comprehensive enough with all these
Technology.
reduplicatives due to no exhaustive statistics.
• 0.5 MB from famous works of Vietnamese Here, we make use of the rule of morpheme
poets. transformation in reduplicatives to identify
• 3.7 MB from Vietnamese literary works. them. Ex :
• leøo teøo (scattered), laåm baåm (murmur): only Therefore, as for the words in the dictionary
initial consonant is changed. with prefixes and suffixes (temporarily referred
• hoån heån (pant), chuùm chím (smile slightly): to as C), we further store the probability of C
only main sound is changed. when it is located right after a certain POS.
• Other cases: one/many component(s) of
“tieáng” will be changed but at least one 4.1.6 Method of segmenting sentences into
component is kept unchanged. sequences of words
The point here is to decrease the combinatorial
4.1.5 Morphological analyzer explosion in the generating of the sequences of
However, one inevitable thing is that no possible words from a string of syllables in a
dictionary is comprehensive. There is also in sentence. Supposed in an n-syllable sentence (in
Vietnamese a class of words which are not fact, the longest word in Vietnamese is fewer
available in the dictionaries due to their than 5 syllables), we will have at most 2n-1
morphological aspect. Those are the words different word segmentations. And in case every
morphologically derived (R.Sproat,1996), e.g.: Vietnamese sentence has 24 syllables on the
average, we must tackle 8,000,000 possibilities
• coá gaéng (attempt, v)→ söï coá gaéng (attempt, n)
of word segmentation.
• hieän ñaïi (modern ,a )→ hieän ñaïi hoùa The new method we suggest here is making
(modernize) combined use of the dictionary to restrict the
• chuû tòch (president,n) → phoù chuû tòch (vice- generating of these combinatorial explosions.
president), … When finding out that a certain word
Similar to English, there appear also segmentation is not appropriate (not available in
prefixes and suffixes, which are however much the dictionary, not reduplicatives, not proper
simpler in the morphology of Vietnamese. names,…), we will eliminate the branches
Therefore, we apply further morphological originated from that word segmentation by
analysis to easily identify this class of words. calculating in advance its weight in a sentence.
The critical point here is to determine the weight With this method, we will restrict the number of
of these derived words (due to their possible segmentations to hundreds of cases (in
unavailability in the dictionary). The weight of comparison with millions of cases).
these new words will be calculated through the And in order to avoid too much repeated
application of the conditional probability of dictionary consultation, we make use of the
Good-Tuning (Baayen). Supposedly, we need to above check to create the possibilities of a word
calculate cost(ABC), in which AB is the radical in a sentence and separately store all their
and C is the suffix. Given p(C): probability of concerned details of POS, probability… for the
C; p(unseen(C)): probability of C in which C is convenience of future evaluation. Naturally, this
next to AB) (Fig.4). separated storage will be of small size
⇒ p (unseen (C )) = p unseen (C )  * p (C ) (approximately some hundreds of elements).
 C And the dictionary consultation up to this
⇒ cost( ABC) = cost( AB) + cost(unseen(C)) with moment has been terminated.
cos t ( X ) = − log( p( X ))
4.1.7 Method of selecting the best sentence
Probability of C After achieving a set of possible word
following Noun segmentations of a sentence, we usually can
Noun select the best word segmentation through the
0 algorithms of Like-Viterbi. We made use of a
1.0 Verb simple method by selecting the word
C segmentation with the smallest total weight. For
0.5
Adjectives example : Input = "Toác ñoä truyeàn thoâng tin seõ
0.7 taêng cao." (The speed of information
Others transmission will increase). In the dictionary,
Figure 4: Diagram of Vietnamese morphology we have:
= "toác ñoä" (speed) 8.68 In fact, there is sometimes a sequence of
= "truyeàn" (transmit) 12.31 POS in Vietnamese which cannot serially stand
= "truyeàn thoâng" (communicate) 12.31 next to each other as in case of adjectives and
nouns and the application of the rule-based
= "thoâng tin" (information) 7.24
model to resolve any ambiguities here might be
= "tin" (news / information) 7.33
not appropriate. Even if the rule-based model is
= "seõ" (will) 6.09 used, we face with a very tough question of how
= "taêng" (increase) 7.43 many rules to be applied. In case of too few
= "cao" (high) 6.95 rules, some correct sentences might be ignored
Id ( I ) o D ∗ =" Toác ñoä # truyeàn thoâng # tin and in case of too many rules, all the sentences
# seõ # taêng # cao." 48.79 (1) are understood to be correct.
=" Toác ñoä # truyeàn # thoâng tin # seõ # taêng # In order to resolve this problem, we have
proceeded with the machine learning for the
cao." 48.70 (2).
ambiguous sentences through the neural network
BestPath =" Toác ñoä # truyeàn # thoâng tin # model. We make use of this model to evaluate
seõ # taêng # cao." 48.70 the suitability of the sequence of POS in a
(1):8.68+12.31+7.33+6.09+7.43+6.95= 48.79. sentence and let’s examine the above example
(2):8.68+12.31+7.24+6.09+7.43+6.95= 48.70. again where our suggested neural network
In case of applying this model only, we are model is used to evaluate 3 sequences of POS :
able to segment most of the sentences without NVN, NNV, VNN. However, when calculating
any ambiguity as well as the scientific and the weight of a word of more than one POS we
technical material. However, there remain large examine all these cases.
classes of sentences with the structural This model will be trained by the ambiguous
ambiguities, which cannot be completely sentences after being processed with the first
resolved by this model. model. These ambiguous sentences will be
manually segmented to be trained by the
4.2 The Neural network model computer. In order to check the appropriateness
of a sequence of POS in a sentence, we make
4.2.1 Role of Neural network use of a “k context” for each word in the
sentence with a window of k words and its
After the word segmentation through the WFST description will slide on the concerned sentence
model, we define a threshold value t0 as follows from the first to the last word of that sentence.
to determine the accuracy of the above
segmentation : In case the weight difference MM MM ... MM MM
between the segmented sentences and the one
with the smallest weight is more than t0, the 16 16 16 16 Input
above segmentation result proves to be quite
accurate. Otherwise, the WFST model is still not
sufficient to determine the word boundary and
then we have to further process these sentences
through the neural network model. ... Hidden
For example, we consider this sentence
“Hoïc sinh hoïc sinh hoïc “ (Pupils learn biology).
After the WFST processing, only the three Output
following sentences are left (due to too small
weight difference, others are not mentioned Figure 5: MM: Descriptor Array for one word
Actually, our network model consists of 6
here: 1. Hoïc sinh(N) hoïc(V) sinh hoïc(N)
input nodes, 10 hidden nodes and 1 output nodes
(Pupil | learn | biology)
(Fig.6). Every input node is a 16-dimension
2. Hoïc sinh(N)hoïc sinh(N)hoïc(V) vector (This vector is described as an integer at
(Pupil | pupil | learn ) the installation) representing the POS of a word
3. Hoïc(V) sinh hoïc(N) sinh hoïc(N) as mentioned at 4.1.1. All the punctuations are
(Learn | biology | biology) considered as POS and a certain value will be
attributed according to its form of punctuations.
The input layer of the neural network is result is opposite. In this case, we select the
fully linked to 10 hidden nodes with a repetition to be 1000 (Table 1 ).
propagation function. These hidden nodes are ƒ As for the sentences outside the corpus:
further fully linked to an output layer. The It is also realized that in case the hidden
output node is a real value ranging from 0 to 1 nodes is 10, the difference between the wrong
representing the appropriateness of a sequence and the correct sentences is maximum. In this
of POS standing next to each other in a sliding case, we also select the repetition to be 1000.
window. When this window slides from the Table 2: The result of sentences outside the
beginning to the end of a sentence, we training corpus:
cumulatively add all the results to be attributed #nodes #correct #wrong (1) – (2)
as the weight to the sentence. The function Hidden (1) (2)
1 1 3.65 1.78 1.87
sigmoid f ( hi ) = − hi
, which is quite
2 2.91 1.09 1.82
1+ e T
3 2.91 1.38 1.53
popular in all the neural networks, is chosen as
the propagation function. The selected sentence 4 3.45 1.98 1.47
is the one with the maximum weight. 5 4.16 1.98 2.18
6 2.01 1.18 0.83
4.2.2 The parameters in the Neural network 7 4.16 1.86 2.3
8 5.37 2.27 3.1
• The number of hidden nodes: 9 4.42 1.74 2.68
To determine the optimal number of nodes 10 3.98 0.64 3.34
of the hidden layer for learning the sequence of
11 4.06 1.47 2.59
POS in a sentence so that the word segmentation
12 3.62 1.55 2.07
of good syntactical structure can be verified, we
have tried checking various values of the size of 13 5.36 2.3 3.06
the layer of hidden nodes to obtain the statistical 14 3.08 1.01 2.07
result as in the following table: 15 4.91 1.92 2.99
Table 1: The result of sentences inside the
training corpus: 5 Results
#nodes wrong Correct wrong
Hidden #1 #2
1 2.13 1.64 0.78 5.1 Experiment
2 0.92 2.04 0.66 Applying the above model in processing
3 0.79 2.21 1.04 unrestricted corpus, we have achieved the
4 1.38 2.41 1.5 following results (Table 3)
5 1.88 2.07 0.87 Table 3: The result of evaluation:
6 1.02 1.63 1.02 Number of Number Number of Ratio
7 1.74 1.89 0.77 training of correct incorrect (%)
sentences sentences sentences
8 2.16 2.96 1.43 Techno 550 541 9 98.36
9 1.68 2.41 1.14 -
10 0.4 3.21 0.61 Science
11 1.66 2.25 0.89 Novels 150 142 8 94.67
12 1.63 1.62 0.74
(Note : the correct sentence must have all the
13 2.19 2.96 1.48 correct segmentation).
14 1.49 1.45 0.48 Especially, this model can well segment the
15 2 1.88 0.84 words in the sentence: “Hoïc sinh hoïc sinh hoïc.”
ƒ As for the sentences inside the training (Pupils learn biology) as well as the derived
corpus: sentences. In our experimental results, the
It is realized that in case the hidden nodes is Neural Net has improved the original WFST
10, the difference between the correct and model approx. 5-10% depending on the style of
wrong sentences is maximum. Especially, in texts.
case the hidden node is 1 or more than 12, the
5.2 Reasons of mistakes committed words through the sentence structure and the
sequence of words. Therefore, through the
Most of the mistakes committed in applying this application of the linguistic knowledge as well
model are due to no exhaustive dictionaries and
as the probability (Neural network... ), we have
overall ambiguities and here are some
improved the WFST method to become more
overcoming solutions: Upon experimenting the
easily understood, more simple and better
program, we have found out that nearly 500
applied in the concrete background of
conventional words are still not available in the Vietnamese to obtain a satisfactory result.
dictionary. Therefore, more entries have been Additionally, we have tried combining
introduced into the dictionary to enhance the
WFST model and trigram model but its result
reliability of this program. And one unavoidable
turns out to be not so good as that of the
thing is that we are not able to fully determine
combination of WFST and Neural Net model.
all the overall ambiguities in a sentence, for Finally, this model can be further improved
example: input = OÂng giaø ñi nhanh quaù. (through the proper adjustment of thresholds) to
→ OÂng # giaø ñi# nhanh quaù. (The grandfather be more extensively applied in various fields.
gets old so fast) We expect that this model as well as our
→ OÂng giaø# ñi# nhanh quaù. (The old man goes program can be served as the first stage of sound
so fast). Both of them are plausible. The correct foundation to effectively assist such future
one depends on context.Even human beings find Vietnamese processing programs as POS-tagger,
it difficult to determine where the word Machine Translation, etc.
boundary is and this matter cannot be resolved
until the computer has read through and fully References
understood the full paragraph. Additionally, this Asanee Kawtraku. 1995. Alexibase Model
model has sometimes made unreasonable for Writing Production Assistant System.
segmentations due to its incorrect morphological Chih-Hao Tsai. 1996. MMSEG: A Word
analysis. Identification System for Mandarin Chinese Text
Based on Two Variants of the Maximum
6 Conclusion Matching Algorithm. www.casper.beckman.
uiuc.edu/~c-tsai4/chinese/wordseg/mmseg.html.
Even though there is still no complete corpus in Ñinh Ñieàn.2000. Töø tieáng Vieät. (Vietnamese
Vietnamese (as an objective condition, in other words) VNU-HCMC.
countries, complete corpus have been created for Ñinh Leâ Thö. 1999. Cô caáu ngöõ aâm tieáng
the research), we have been trying our best to
Vieät. (Structure of Vietnamese phonetics).
proceed with the collection for a corpus
VNU-HCMC.
sufficient for our thesis as well as in other
works. And one more difficulty is that since Hoaøng Pheâ. 2000. Töø ñieån tieáng Vieät.
there is still no unanimous and standard norms (Vietnamese Dictionary). Center of Lexico-
on words, the results of the word segmentation graphy, Institute of Linguistics. Ñaø Naüng.
are not able yet to satisfy everybody’s Julia Hockenmaire, Chris Brew. 1996. Error-
requirements. Driven Learning for Chinese Word
However, in considering the basic norms of Segmentation.
words as mentioned in part 2.2, the result is Richard Sproat et al. 1996. A Stochastic
quite satisfactory. As for the first requirement, Finite-State Word Segmentation Algorithm for
we have made use of a reliable dictionary with Chinese. ACL Vol 22 N3.
the datamining of corpus to further recognize the Sampan Rarunrom.1991. Dictionary-base
concerned words. That is why our model has Thai Syllable Separation.
met the first requirement. As for the third Surapant Meknavin. 1995. Towards 99.99%
requirement, we are also able to further Accuracy of Thai Word Segmentation . 1995.
determine some reduplicatives through the Xianping Ge, Wanda Pratt, Padhraic Smyth.
model of reduplicatives. And the model of 1996. Discovering Chinese Words from
compound words has been verified through the Unsegmented Text.
WFST model of probability. Incidentally, this Yuen Poowarawan. 1986. Dictionary-base
model is helpful in determining the concerned Thai Syllable Separation.

View publication stats

You might also like