0% found this document useful (0 votes)
8 views9 pages

Dandan 2017

Uploaded by

prabham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Dandan 2017

Uploaded by

prabham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Int. J. Computational Science and Engineering, Vol. 15, Nos.

1/2, 2017 153

The analysis and recognition of Chinese temporal


expressions based on a mixtured model using
statistics and rules

Zhao Dandan*
School of Computer Science and Technology,
Dalian University of Technology,
No. 2 Linggong Road,
Ganjingzi District, Dalian, China
and
School of Computer Science and Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]
*Corresponding author

Huang Degen
School of Computer Science and Technology,
Dalian University of Technology,
No.2 Linggong Road,
Ganjingzi District, Dalian, China
Email: [email protected]

Wang Yuzhe
College of Electromechanical and Information Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]

Wu Qiong
School of Computer Science and Technology,
Dalian University of Technology,
No. 2 Linggong Road,
Ganjingzi District, Dalian, China
Email: [email protected]

Zhao Ge
School of Computer Science and Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]

Abstract: As the first step of temporal information understanding, the results of temporal
expressions recognition will directly affect further usage of temporal information. For Chinese
language, there are many distinct characters both in word morphology and syntax in temporal
expressions compared with the Western languages. Classifications and constructions of Chinese
temporal expressions were analysed, and an approach for extracting temporal expressions from
Chinese texts was presented in this paper. The model comprises of a cascade of rule-based and
machine-learning pattern recognition procedures. Conditional random fields (CRFs) were applied
to recognise time units rather than time expressions to avoid the boundary localisation problems
in Chinese temporal expressions. Rules for the temporal expressions boundary localisation were

Copyright © 2017 Inderscience Enterprises Ltd.


154 Z. Dandan et al.

formulated based on time triggers thesaurus and time affix words thesaurus. The F-measure of
temporal expressions identification was 95.93% on the temporal 2010 Chinese corpus. The
experiments result showed the validity of the proposed approach.

Keywords: temporal expressions; TEs; conditional random fields; CRFs; time units; TUs; rules;
time trigger; time affix word; Chinese information processing; thesaurus; China.

Reference to this paper should be made as follows: Dandan, Z., Degen, H., Yuzhe, W.,
Qiong, W. and Ge, Z. (2017) ‘The analysis and recognition of Chinese temporal expressions
based on a mixtured model using statistics and rules’, Int. J. Computational Science and
Engineering, Vol. 15, Nos. 1/2, pp.153–161.

Biographical notes: Zhao Dandan is a Lecturer at Dalian Nationalities University. She received
her MS in Computer Application from Liaoning University of Petroleum and Chemical
Technology in 2003. She is currently a PhD candidate of Dalian University of Technology and
her research interests include machine learning, software engineering and nature language
processing.

Huang Degen is a Professor in the Dalian University of Technology. His main research interests
include natural language processing, machine learning and machine translation. He is currently
working at the School of Computer Science and Technology, Dalian University of Technology.
He is a senior member of CCF, CAAI, ACM and an Associate Editor of Int. J. Advanced
Intelligence.

Wang Yuzhe received his Master and Doctor degree from Harbin Institute of Technology in 2007
and 2013, respectively. He has published more than 20 papers and is currently a Lecturer at
Dalian Nationalities University. His research interests include information fusion and extraction,
intelligent systems as well as nonlinear control.

Wu Qiong is a student who received her Master degree in Dalian University of Technology. Her
research interests include natural language processing and machine translation.

Zhao Ge is a Lecturer at Dalian Nationalities University. He received his Doctoral degree in


Pattern Recognition and Intelligent System from the University of Chinese Academy of Sciences
in 2014. His research interests include pattern recognition and computer vision.

1 Introduction 2 Related works


Temporal expressions (TEs) recognition is the foundation of The extraction of temporal information from free text has
natural language understanding, and it plays an important become a hot topic since 1995. At that time, the recognition
role in the follow-up anaphora resolution, relation extraction of TEs was first proposed as a special and important task of
and event extraction mission (Wei et al., 2014). It has entity recognition in the message understanding conference
a very wide range of application in the field of (MUC). In the year 2004, automatic content extraction
natural language processing. For example, in automatic QA (ACE) program launched the ‘temporal expressions
system (Caviglione, 2013), TEs recognition can help recognition and normalisation (TERN)’ contest, which
answer question of ‘when…’ and ‘how long…’; in machine greatly promoted the study of TEs’ recognition and
translation system (Di Modica and Tomarchio, 2015), TEs normalisation task. 2007 TempEval evaluation is presented
recognition can be used for selection of proper tense; in as the SemEval sub-task, aiming at evaluating the temporal
multi document summarisation (Amato et al., 2015), TEs relationship. The temporal relationship collection of
recognition can be used to order time information, even in TempEval-2010 is the same as TempEval-2007, except for
electronic commerce system (Abhishek et al., 2012), the fact that the corpus language has been expanded from
TEs recognition can help improve the performance of English alone to Chinese, English, Italian, French, Korean,
e-commerce product selection. Recently, relevant research and Spanish (Hongye et al., 2011). Unfortunately, the
of TEs has gained more and more attention from domestic participants only participated in the evaluation task in
and foreign experts. However, due to the uniqueness of English and Spanish, and the subsequent TempEval-2013
Chinese language (Bittar, 2009; Zhao et al., 2013) and the provided only corpus language of English and Spanish.
writing habits of the Chinese people, a concept of time can The work of TEs recognition started relatively late in
be expressed in several or even a dozen of ways, making China. Mingli et al. (2005) from Hong Kong Polytechnic
extraction of the Chinese text expression relatively difficult. University participated in the competition of ACE-2004,
and they realised CTEMP system with F measure of 85.6%,
which was the best result in Chinese TEs extracting for
some time. Ruifang et al. (2007) identified TEs based on
The analysis and recognition of Chinese temporal expressions based on a mixtured model 155

dependency analysis and error drive; Lin et al. (2008) not a specific time can be classified in the date or time
proposed an improving method to confirm the boundary of collection by time noun or noun phrase. The plain idea of
TE based on regular expression; Tong (2010) divided time TEs classification is shown in Figure 1.
information into a series of ‘time-based unit’, and used
heuristic rules to extract TE and pruned the rules by error Figure 1 Classification of temporal expressions
driven method; Zhu et al. (2011) classified Chinese TEs into
date and event and recognised them using CRF; Li et al. Temporal expressions
(2011) added semantic role feature to extract TEs; Li et al.
(2012) added the part of speech info into the recognition
rules to get TEs. Time point Duration Set
All the methods of recognising TEs above were machine
learning or based on rules. Machine learning method often
uses statistical model, such as conditional random field Date Time Culture time Not a specific time
(CRF). The advantage of CRFs is that it can make full use
of context information and find the global optimal solution,
Though TEs can be only one word, such as today, three
yet for CRFs recognition result depends on training corpus
years, complex TEs will have much more information, such
and it has the problem of data sparsity. In comparison, rule-
as three o’clock of this afternoon, about three years ago. For
based method is simple and can get high correct rate (Maqur
each type, it will have simple and complex form, and the
and Dale, 2007), but it is hard to get all the rules and need
latter will have normalised and un-normalised format. The
more manual work, plus, it fail to yield satisfactory
framework of TEs is shown in Figure 2.
performance on field adaptability.
After measuring the advantages and disadvantages of Figure 2 Framework of temporal expressions
the statistics-based method and the rule-based method, this
paper proposes a method to combine the merits of these Temporal expressions (TEs)
two. Through TUs rather than TEs are recognised by CRFs,
the identification of granularity is reduced and the accuracy
is improved. The uncommon expressions and the boundary
Simple TEs Complex TEs
problems are solved by rules, which is relatively convenient
and effective. First, CRFs is applied to recognise time units
(TUs). Second, the uncommon words that do not appear in
the training corpus can be recognised under the help of time Time noun or adverb Un-normalised TEs
triggers thesaurus. And then the boundaries of TEs are
determined by rules according the time affix words Normalised TEs
thesaurus. Finally, the best F-measure 97.32% is achieved
on the testing corpus.
Year month day Hours minutes seconds

3 Definitions
3.1 Temporal expressions
Time Simple TEs or Time
Time Markup Language (TimeML, 2009) is a set of rules prefix normalised TEs postfix
that encodes the electronic document, and its goal is to
create a standard markup language, and to effectively
address the events, time and time relationship annotation in Normalised TEs are the standard mode of date and time.
English text. TIMEX is a set of markers used to tag the They can be year, month, day, hours, minutes, seconds, or
time expressions in TimeML. Symbol of TIMEX in their combinations. For the un-normalised TEs, time prefix
TempEval-2010 is TIMEX3. The early TEs labelling or postfix will appear before or after the simple or
schemes include TIMEX and TIMEX2. TIMEX3 evolved normalised TEs.
from early programs and there were significant changes
between TIMEX3 and the early versions (Kolomiyets and 3.2 Time units
Moens, 2010). For example: in TIMEX2 program, the TEs
are divided into the following categories: absolute time, From the analysis of TEs component, TUs should be
relative time, period of time, set of time, time of the event defined. In this paper, it is the smallest unit of TEs. For
trigger, cultural time and not a specific time. In TIMEX3, example, in ‘8:20 pm, 26th July, 2015’ include five TUs:
the time type is reduced to four categories: dates, times, year ‘2015’, month ‘7’, day ‘26’, ‘afternoon’, ‘8:20’.
durations, sets of dates and times. Such tagging scheme is
more reasonable and effective to avoid problems triggered
by events time information as in TIMEX2. Culture time and
156 Z. Dandan et al.

3.3 Time trigger the candidates were filtered to the correct time triggers
using evaluation function. Specific steps are as follows:
Time triggers are keywords to judge whether a phrase is a
TE. Often they are the concept of time such as year, month Step 1 Recognition of TUs Based on CRFs
and day, etc. Step 1.1 Corpus preprocessing
Time triggers are divided into independent triggers and
Word segmentation, part of speech
number triggers according to the identified need. The
(POS) tagging and pre-formation of the
independent trigger is TE itself, such as today, morning and
corpus are done in this step. Some
now. The number triggers are divided into number prefix
errors of word segmentation will be
triggers and number suffix triggers. The former are the
corrected. And the format of the text
triggers that embody number before the words, such as
should meet the requirements of CRFs.
‘century’ in 21st century. The latter are the triggers that
embody number behind the word, such as ‘week’ in week 3. Step 1.2 CRFs training
TUs are tagged in the training corpus.
The proper template is chosen to train a
4 Recognition of TEs model. The model will be used in the
next step.
Chinese TEs were recognised through the method of
combining statistics and rules in this paper. The step of the Step 1.3 CRFs testing
identification is described in Figure 3. TUs are not tagged in the testing
corpus. Choose proper model that is
Figure 3 Recognition model of Chinese temporal expressions derived from step 1.2, test the input
corpus, and obtain the TUs tagged in
Training corpus the output corpus.
Based on CRFs
Step 2 Recognition of TEs based on rules
Participle and part Testing corpus Step 2.1 Removal of the obvious wrong
of speech tagging
annotated TUs.
Participle and part TUs as the results of CRFs annotation
Feature extraction of speech tagging are accepted as input in the system
based on rules, the obvious wrong
Train CRFs model annotated TUs would be removed.
Pre format
Step 2.2 Expansion of the trigger thesaurus.
CRFs recognition An automatic acquisition program
CRFs model would start using TUs as candidate
trigger list. An evaluation function
Time units would be used for impurity removal,
sifting out to obtain correct trigger
Get candidate triggers
words. The trigger words would be
Delete obviously error placed in the corresponding
independent trigger thesaurus or digital
Evaluation trigger thesaurus according to the
Supplement tagging trigger’s type.
Trigger thesaurus
Step 2.3 Obtainment of TEs.
Confirm boundary New trigger thesaurus and affix thesaurus
and combine time unit are used to help supplement labelled
Affix thesaurus
new TUs, confirm the boundary, and
combine TUs and the TEs are gotten at
Based on rules Time expression last.
The two steps would be discussed in detail in 4.1 and 4.2.
The TUs are recognised in text through CRFs first, and then
TEs are obtained through rules. CRFs was applied to 4.1 Recognition of TUs based on CRFs
recognise time units rather than time expressions to reduce
the granularity and difficulty of identification. The rules 4.1.1 Basic principle of CRFs
method were based on the time triggers thesaurus and time TEs recognition can be converted into a problem of
affix words thesaurus, which should be initialised manually sequence labelling. CRFs model is a kind of conditional
at the beginning. With the increasing of test corpus, more probability model (Degen and Yu, 2009). It can calculate
candidate triggers were obtained from the test corpus and
The analysis and recognition of Chinese temporal expressions based on a mixtured model 157

the joint probability distribution of the entire category tag in Table 1 Feature template
the condition of a given observation sequence.
Template 1 Template 2
In the TEs recognition system, the problem can be seen
as linear chain CRF. X = (X1, X2, …, Xn), Y = (Y1, Y2, …, Yn) #Unigram #Unigram
are random variable sequence of linear chain, they satisfy U00:%x[–1, 0] U00:%x[–1, 0]
the equation (1): U01:%x[0, 0] U01:%x[0, 0]
P (Yi | X , Y1 ," , Yi −1 ," , Yn ) = P (Yi | X , Yi −1 , Yi +1 ) (1) U02:%x[1, 0] U02:%x[1, 0]
U03:%x[–2, 0] U03:%x[–1, 1]
In above equation, i = 1, 2, …, n (when i = 1, …, n, only U04:%x[2, 0] U04:%x[0, 1]
one part is taken into consideration). P(Y|X) is linear chain
U05:%x[–1, 1] U05:%x[1, 1]
of CRFs. In TEs recognition, X is the input observation
sequence and Y is the output label sequence or state U06:%x[0, 1]
sequence. U07:%x[1, 1]
Conditional probability can be obtained by joint U08:%x[–2, 1]
probability distribution. P(Y1, …, Yn, X) can be obtained U09:%x[2, 1]
from the equation (2)
Two experiments were tested on the international news
1 ⎧⎪ n m ⎫⎪
P(Y1 ," , Yn , X ) = exp ⎨
Z ⎪⎩ i =1
∑∑ j =1
λ j f j (Yi −1 , Yi , X , i ) ⎬
⎪⎭
(2) corpus of 2011, which includes 1,026 TUs. The effect of the
test shows that template 2 is better than template 1. Only
current word and its part of speech (POS), and the word and
In equation (2), Z is normalisation factor, which is defined its POS before and after current position are used as feature
as formulates: at last in training CRFs model.
⎧⎪ n m ⎫⎪
Z(X ) = ∑ exp ⎨ ∑∑ λ j f j (Yi −1 , Yi , X , i ) ⎬ (3) 4.1.3 Selection of TUs as annotation target of CRFs
Y ⎩⎪ i =1 j =1 ⎭⎪ The TEs were constructed by TUs. The TUs are relatively
In equation (3), fj is a characteristic function. It is the j th loose and varied in the TEs, but the format of TUs itself is
local function which is defined on the ith maximal group. λj relatively stable. To reduce the identification granularity
is the weight that is defined on fj. It is an important measure and avoid recognise time affix words, TUs were thought as
of the feature. n is sample size in corpus, and m is the annotation target of CRFs.
number of characteristic functions. The characteristic The experiments were done as to compare on
function f is described as formulates: recognition of TUs and TEs based on CRFs. The results are
shown as Figure 4.
⎧1 if Yi-1 , Yi and X = specific value
f j (Yi −1 , Yi , X , i ) = ⎨ (4) Figure 4 Recognition results of TUs and TEs based on CRFs
⎩0 others
(see online version for colours)
Specified information in X comes from feature template
which includes partial linguistic knowledge of the sample,
such as words, part of speech, word semantic information,
etc. All the feature function can be acquired according to
feature template.

4.1.2 The selection of feature template


According to the statistics of Li et al. (2012), 49% TEs are
made up of only one TU, and 26% TEs are made up of two
TUs, 21% TEs are made up of three TUs, 2.3% TEs are
made up of four TUs, 1.7% TEs are made up of more than Owing to the diversity and flexibility of time affix words,
five TUs. The following feature template is chosen to do the the difficulty to recognise TEs’ boundary by CRFs
testing. The feature template is shown in Table 1. increased. From the experiment, the result showed that the
accuracy rate of extraction of TUs would be higher than the
extraction of TEs. So TUs are selected as annotation target
of CRFs recognition in this paper. Compared with TEs, TUs
are smaller in particle size, the identification difficulties are
reduced to a certain extent, and recognition effect is
improved.
158 Z. Dandan et al.

4.2 Recognition of TEs based on rules The score of each candidate trigger word is calculated
from formula (5):
CRFs model can convert complex recognition problem to
simple serials label problem, through choosing feature and Ture(Ti )
Score(Ti ) = (5)
getting context information freely, and solving recognition Ture(Ti ) + False(Ti )
problem properly. Yet CRFs has its limitations and is not
suitable in our experiments for the following reasons: In this equation, True(Ti) represents the correct number of
candidate Ti in the test corpus, and False(Ti) indicates the
• it is not good at recognising pure digital TEs
number of errors. Set the threshold value λ, and the
• it is not good at recognising TUs that are not in the candidates so that the score that is less than λ will be
training corpus or that just appears few times removed from the list.
• it is not good at recognising the boundary of TEs.
4.2.3 Obtainment of TEs
So, the following rules are proposed to make remedies for
the weakness mentioned above. In this paper, we use the affix thesaurus to help determine
the boundary. Specifically speaking, check whether the
former word of current TU is in the prefix thesaurus or not
4.2.1 Pure digital TEs identification rules to determine the upper boundary, and check whether the
A pure digital TE is a string of numbers, usually latter word of current Time Unit is in suffix thesaurus or not
representing year. It is confined from 1,753 to 9,999, and to determine the lower boundary.
the following rules are defined: For the existence of ambiguous words, the limit
conditions are set. If the condition of the words is satisfied,
• ‘of’ + digital + not quantifier then the time element boundary is determined, otherwise, it
• digital + location does not belong to the scope of time.
The TUs are combined according to their spatial
• prepositions + digital + not quantifier. proximity. Two neighbouring TUs are merged into a
temporal expression. With respect to adjacent TUs, we first
4.2.2 Supplementary tagging rules check whether there is any conjunctive in between, if so, we
unite the conjunctive with the two TUs together to form a
For TUs in the training data that is scarce or non-existent,
time expression.
the approach taken is to construct time trigger thesaurus to
The purpose of rule-based processing part to make up
supplementary annotation in this paper.
the shortfall of recognition TUs based on CRFs, merged
The process for identifying new TUs is given as follow.
TUs and to obtain the integrated TEs. Delete obvious error
Step 1 Read a word from text. and supplementary labelled new TUs missed in CRFs,
combine TUs and confirm the boundary, TEs are obtained
Step 2 Access independent trigger thesaurus, check
at last. The complement of time triggers can effectively
whether the term belongs to independent trigger
extend the scale of the triggers thesaurus, and improve the
word. If so, then mark it as a time unit.
effect of recognition. Convert the boundary confirming
Step 3 If the word is not in independent trigger thesaurus, problem to see whether the fore word and back word were
check whether the term belongs to the digital affix time words, which simplifies the difficulty of the
trigger thesaurus. If it is, check forwards and boundary locator. Set constraints on ambiguity affix time
backwards. If there is a digit, then it is a time unit. words to reduce the error probability of boundary locations.
If the former word is quantifier, then look forwards
once more, if the new former are numbers or vague
word, then it is a time unit. Otherwise, it is not a 5 Experimental results and analysis
time unit.
5.1 Experimental setup and corpus preparation
The initial time trigger thesaurus was created by manual
work. The recognition effect depends on the size of time The CRF model was obtained from the tool of
trigger thesaurus to a certain extent. It is too big to be CRF++ –0.54. The segmentation system used ‘Nihao’
perfect. So, transformation rules were used to obtain the segmentation procedure of Dalian university of technology.
TUs in the candidate trigger words automatically, and The rules program was completed by C++ in Visual Studio
generate candidate trigger words. Candidates to trigger list 2010.
contains a lot of wrong candidate words, if the candidate The training data of the CRFs used in this paper are
words are put directly into the trigger thesaurus, this will news taken from The People’s Daily in the year 2000,
greatly reduce the accuracy of recognition results. which contains 240,000 words. The testing corpus is 2011
Therefore, the evaluation function score (Ti) is introduced to international news, which contains 180,000 words and 1954
evaluate the candidate trigger words and set the threshold TEs. After the basic testing, the sampling test was taken.
value to a filter. The sampling corpus was selected from 2012 international
news. The contrast experiment was carried out on the
The analysis and recognition of Chinese temporal expressions based on a mixtured model 159

Chinese corpus of TempEval-2010. In order to test our whose language style is consistent. So, the contrast and
approach in other areas of validity, the new experiments transfer test was carried out.
were carried out in Chinese Emergency Corpus (CEC). All
of the row corpus were tagged by our former program that 5.2.2 Comparisons with another method
extracted Chinese TEs based on LEX.
Downloaded the Chinese corpus of the TempEval-2010
from internet, used our model, and tagged the 44 files in
5.2 Experimental results
them. The F1-measure is 95.93%, which is better than the
5.2.1 Evaluation of the proposed method former work of Li Junchan on the same corpus. Specifically
result was shown in Table 4.
In order to compare the effect of this method, two systems
were built: CRFs-R represents the recognition system that Table 4 Comparison results on TempEval-2010
recognises TEs only using CRFs. And Rules-R represents the
system that recognises TEs only using rules. C&R-R was Methods
File TEs
P (%) R (%) F1 (%)
the method of combining statistic and rules described in this numbers numbers
paper. POS-R 8 131 85.16 83.21 84.17
Test results are shown in Table 2 in the same C&R-R 44 763 96.72 95.16 95.93
experimental dataset of 2011 international news.
POS-R was the recognition method based on part of speech
Table 2 Results of different TEs recognition methods tagging unit of time. C&R-R was the method of combining
statistic and rules used in this paper.
Methods P R F-measure
CRFs-R 91.78% 85.67% 88.62%
5.2.3 Experiments on CEC
Rules-R 93.24% 92.53% 92.88%
C&R-R 97.07% 97.57% 97.32% CEC is built by Data Semantic Laboratory in
Shanghai University. This corpus is divided into five
From the experimental results shown in Table 2, it can be categories – earthquake, fire, traffic accident, terrorist attack
seen that recognition results with combination of CRFs and and intoxication of food. There are totally 332 texts in CEC,
rules is better than that with CRFs method or rules method which are derived from internet and processed by several
alone. steps. Downloaded them from internet first and deleted all
On the basis of the experiments above, a large scale of the XML tagging from the file, then pretreated them with
corpus sample testing was carried out. This time, testing LEX, Nihao, and formatted them for the CRF++’s input.
data were selected from 2012 international news which There were 1,378 TEs in the whole corpus. The corpus’
contained 6,570,000 words. Sampling method is chosen as: information was shown in Table 5.
selecting each sample size with 10,000 words after every
60,000 words and in total deriving ten samples from the Table 5 Corpus scale of CEC
corpus. The whole sample size is 100,000 words. Detailed
File TEs Words
results of the ten sample are shown in Table 3. Type
number number number

Table 3 Results of sampling test Fire 75 236 14,026


intoxication of food 61 261 12,385
Sample Error Not
P R F Terrorist attack 49 258 10,272
no. identified identified
Earthquake 62 279 11,987
1 0 0 100% 100% 100%
Traffic accident 85 344 17,601
2 1 1 99.20% 99.20% 99.20%
3 3 3 97.14% 95.33% 96.23% Used the model trained by the corpus from The People’s
4 4 1 95.88% 97.89% 96.87% Daily in 2000, different categories was tested and the results
5 3 0 97.90% 98.59% 98.24% were in Table 6.
6 11 0 89.42% 93.94% 91.62% Table 6 Results of news model on CEC
7 1 1 99.06% 99.06% 99.06%
8 3 0 96.10% 98.67% 97.37% Type P R F

9 2 4 98.46% 95.52% 96.97% Fire 89.69 98.78 94.02


10 3 0 96.15% 94.94% 95.54% intoxication of food 98.18 87.16 92.34

Total 31 10 97.05% 97.51% 97.28% Terrorist attack 98.42 86.04 91.81


Earthquake 94.94 81.91 89.97
The identification results were almost the same as the basic Traffic accident 88.23 99.39 93.48
test. This proved the validity of the method to some extent.
But the corpus was all selected from international news,
160 Z. Dandan et al.

These results were lower than the former test. The Acknowledgements
F-measure of earthquake category was lower than 90%, and
The authors sincerely thank the anonymous referees for
the best result of fire category F-measure was lower than
their valuable remarks and helpful suggestions, which
95%. The main reason was the different kind of corpus. In
significantly improved the paper.
order to include the factors of the test data into the training
This work was supported by the National Nature
corpus, a new round of testing was carried out as follows:
Science Foundation of China (Nos. 61173100, 61173101,
one category as test, the other four categories added into the
61272375) and by the Fundamental Research Funds for the
former training corpus. The new model was called
Central Universities.
self-trained model. In this way, the testing results were
shown in Table 7.

Table 7 Results of self-trained model on CEC References


Abhishek, S., Krishna, R.P. and Maheshwari, S. (2012) ‘Mining
Type P R F special features to improve the performance of e-commerce
Fire 98.75 96.17 97.44 product selection and resume processing’, International
Journal of Computational Science and Engineering, Vol. 7,
Intoxication of food 96.98 90.90 93.84
No. 1, pp.82–95.
Terrorist attack 97.83 90.57 94.06 Amato, F., Mazzeo, A., Mazzocca, N. and Romano, S. (2015)
Earthquake 98.83 88.37 93.30 ‘Semantically driven documents composition in close cloud
Traffic accident 99.57 94.22 96.82 system’, International Journal of Computational Science and
Engineering, Vol. 11, No. 1, pp.68–77.
The results were acceptable and the best F-measure of fire Bittar, A. (2009) ‘Annotation of events and temporal expressions
category was better than the basic test and the average of in French texts’, in Linguistic Annotation Workshop,
The Association for Computational Linguistics, pp.48–51.
sampling test.
Caviglione, L. (2013) ‘Extending HTTP models to Web 2.0
applications: the case of online social networks’,
5.2.4 Experimental analysis International Journal of Computational Science and
Engineering, Vol. 9, No. 3, pp.210–218.
After analysis of the results of the identification errors in the
experiment, it is found that the errors are generally caused Degen, H. and Yu, J. (2009) ‘A distributed strategy for CRFs
based Chinese text chunking’, Journal of Chinese Information
by the boundary error. The main cause of the boundary error Processing, Vol. 23, No. 1, pp.16–22.
is affix of time. The method of determining the time
Di Modica, G. and Tomarchio, O. (2015) ‘A semantic framework
boundary is relatively simple in this paper. Determine to support resource discovery in future cloud markets’,
whether the term is a boundary of a time expression by International Journal of Computational Science and
simply looking at the pre-order or follow-up word, and not Engineering, Vol. 10, Nos. 1/2 pp.1–14.
taking into account the structure of the sentence. The same Hongye, T., Jiaheng, Z. and Jiye, L. (2011) ‘The progress of
words in different sentence structure and context may temporal relation recognition research’, Journal of Chinese
represent a completely different semantics. And the time Information Processing, Vol. 25, No. 9, pp.44–52.
affix words thesaurus was unable to construct automatically, Kolomiyets, O. and Moens, M-F. (2010) ‘KUL: recognition and
so the scale of the thesaurus was limited. Therefore, this normalization of temporal expressions’, Proceedings of the
method can produce a boundary error in some cases, and it 5th International Workshop on Semantic Evaluation, Uppsala,
Sweden, pp.325–328.
also needs to be further studied.
Li, J., Tan, H. and Wang, F-e. (2012) ‘Recognition of temporal
expressions and their types in Chinese’, Computer Science,
Vol. 39, No. 11A, pp.191–211.
6 Conclusions and future work
Li, L., Zhongshi, H., Xinlai, X. and Xiaoli, M. (2011) ‘Chinese
This paper analysed the classification and composition of time expression recognition based on semantic role’,
temporal expressions in Chinese text, referenced the Application Research of Computers, Vol. 28, No. 7,
pp.2543–2545.
standard of TIMEX3, combined the requirement of using
Chinese temporal information, and proposed a novel Lin, J., Cao, D. and Yuan, C. (2008) ‘Automatic TIMEX2
tagging of Chinese temporal information’, Journal of
method of fusion of statistical and rules to identify Chinese Tsinghua University (Science and Technology), Vol. 48,
temporal expressions. The presented method mainly aimed No. 1. pp.117–119.
at relatively clear period of time and point of time that can Maqur, P. and Dale, R. (2007) ‘A rule based approach to temporal
be found on the timeline. Although it could not fully cover expression tagging’, Proceedings of the International
all of the time expressions, the relative accurate data is Multiconference on Computer Science and Information
provided to the following usage of time information, which Technology, pp.293–303.
made it possible for practical application of the time Mingli, W., Wenjie, L., Qin, L. and Baoli, L. (2005) ‘CTEMP:
information in Chinese text. a Chinese temporal parser for extracting and normalizing
With the basis of recognition of temporal expressions, temporal information’, in IJC-NLP Proceedings of
the type identification and standardisation of temporal international Joint Conference on Natural Language
Processing, Vol. 3651, pp.694–706.
expressions will be the focus of further research.
The analysis and recognition of Chinese temporal expressions based on a mixtured model 161

Ruifang, H., Bing, Q., Ting, L., Yuequn, P. and Sheng, L. (2007) Wei, L., Wenjie, X., Dong, W., Xujie, Z. and Zongtian, L. (2014)
‘Recognizing the extent of chinese time expressions based on ‘An extending description logic for action formalism in event
the dependency parsing and error-driven learning’, Journal of ontology’, International Journal of Computational Science
Chinese Information Processing, Vol. 21, No. 5, pp.36–40. and Engineering, Vol. 9, No. 3, pp.205–214.
TimeML (2009) ‘Guildelines for temporal expression annotation Zhao, Z., Xu, J., Zhang, Y. and Liu, J. (2013) ‘Japanese time
for English for TempEval 2010’, TimeML Working expression recognition by combining rules with statistics’,
Group [online] http://www.timeml.org/tempeval2/tempeval2- Journal of Chinese Information Processing, Vol. 27, No. 6,
trial/guidelines/timex3guidelines-072009.pdf. pp.192–200.
Tong, W. (2010) Research on Chinese Time Expression Zhu, S., Liu, Z., Fu, J. and Zhu, F. (2011) ‘Chinese temporal
Recognition, MS thesis, Department of Computer Science and phrase recognition based on conditional random fields’,
Technology, FuDan University, Shanghai, China. Computer Engineering, Vol. 37, No. 15, pp.164–167.

You might also like