Dandan 2017
Dandan 2017
Zhao Dandan*
School of Computer Science and Technology,
Dalian University of Technology,
No. 2 Linggong Road,
Ganjingzi District, Dalian, China
and
School of Computer Science and Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]
*Corresponding author
Huang Degen
School of Computer Science and Technology,
Dalian University of Technology,
No.2 Linggong Road,
Ganjingzi District, Dalian, China
Email: [email protected]
Wang Yuzhe
College of Electromechanical and Information Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]
Wu Qiong
School of Computer Science and Technology,
Dalian University of Technology,
No. 2 Linggong Road,
Ganjingzi District, Dalian, China
Email: [email protected]
Zhao Ge
School of Computer Science and Engineering,
Dalian Nationalities University,
No. 18, Liaohe West Road,
Jinzhou New District, Dalian, China
Email: [email protected]
Abstract: As the first step of temporal information understanding, the results of temporal
expressions recognition will directly affect further usage of temporal information. For Chinese
language, there are many distinct characters both in word morphology and syntax in temporal
expressions compared with the Western languages. Classifications and constructions of Chinese
temporal expressions were analysed, and an approach for extracting temporal expressions from
Chinese texts was presented in this paper. The model comprises of a cascade of rule-based and
machine-learning pattern recognition procedures. Conditional random fields (CRFs) were applied
to recognise time units rather than time expressions to avoid the boundary localisation problems
in Chinese temporal expressions. Rules for the temporal expressions boundary localisation were
formulated based on time triggers thesaurus and time affix words thesaurus. The F-measure of
temporal expressions identification was 95.93% on the temporal 2010 Chinese corpus. The
experiments result showed the validity of the proposed approach.
Keywords: temporal expressions; TEs; conditional random fields; CRFs; time units; TUs; rules;
time trigger; time affix word; Chinese information processing; thesaurus; China.
Reference to this paper should be made as follows: Dandan, Z., Degen, H., Yuzhe, W.,
Qiong, W. and Ge, Z. (2017) ‘The analysis and recognition of Chinese temporal expressions
based on a mixtured model using statistics and rules’, Int. J. Computational Science and
Engineering, Vol. 15, Nos. 1/2, pp.153–161.
Biographical notes: Zhao Dandan is a Lecturer at Dalian Nationalities University. She received
her MS in Computer Application from Liaoning University of Petroleum and Chemical
Technology in 2003. She is currently a PhD candidate of Dalian University of Technology and
her research interests include machine learning, software engineering and nature language
processing.
Huang Degen is a Professor in the Dalian University of Technology. His main research interests
include natural language processing, machine learning and machine translation. He is currently
working at the School of Computer Science and Technology, Dalian University of Technology.
He is a senior member of CCF, CAAI, ACM and an Associate Editor of Int. J. Advanced
Intelligence.
Wang Yuzhe received his Master and Doctor degree from Harbin Institute of Technology in 2007
and 2013, respectively. He has published more than 20 papers and is currently a Lecturer at
Dalian Nationalities University. His research interests include information fusion and extraction,
intelligent systems as well as nonlinear control.
Wu Qiong is a student who received her Master degree in Dalian University of Technology. Her
research interests include natural language processing and machine translation.
dependency analysis and error drive; Lin et al. (2008) not a specific time can be classified in the date or time
proposed an improving method to confirm the boundary of collection by time noun or noun phrase. The plain idea of
TE based on regular expression; Tong (2010) divided time TEs classification is shown in Figure 1.
information into a series of ‘time-based unit’, and used
heuristic rules to extract TE and pruned the rules by error Figure 1 Classification of temporal expressions
driven method; Zhu et al. (2011) classified Chinese TEs into
date and event and recognised them using CRF; Li et al. Temporal expressions
(2011) added semantic role feature to extract TEs; Li et al.
(2012) added the part of speech info into the recognition
rules to get TEs. Time point Duration Set
All the methods of recognising TEs above were machine
learning or based on rules. Machine learning method often
uses statistical model, such as conditional random field Date Time Culture time Not a specific time
(CRF). The advantage of CRFs is that it can make full use
of context information and find the global optimal solution,
Though TEs can be only one word, such as today, three
yet for CRFs recognition result depends on training corpus
years, complex TEs will have much more information, such
and it has the problem of data sparsity. In comparison, rule-
as three o’clock of this afternoon, about three years ago. For
based method is simple and can get high correct rate (Maqur
each type, it will have simple and complex form, and the
and Dale, 2007), but it is hard to get all the rules and need
latter will have normalised and un-normalised format. The
more manual work, plus, it fail to yield satisfactory
framework of TEs is shown in Figure 2.
performance on field adaptability.
After measuring the advantages and disadvantages of Figure 2 Framework of temporal expressions
the statistics-based method and the rule-based method, this
paper proposes a method to combine the merits of these Temporal expressions (TEs)
two. Through TUs rather than TEs are recognised by CRFs,
the identification of granularity is reduced and the accuracy
is improved. The uncommon expressions and the boundary
Simple TEs Complex TEs
problems are solved by rules, which is relatively convenient
and effective. First, CRFs is applied to recognise time units
(TUs). Second, the uncommon words that do not appear in
the training corpus can be recognised under the help of time Time noun or adverb Un-normalised TEs
triggers thesaurus. And then the boundaries of TEs are
determined by rules according the time affix words Normalised TEs
thesaurus. Finally, the best F-measure 97.32% is achieved
on the testing corpus.
Year month day Hours minutes seconds
3 Definitions
3.1 Temporal expressions
Time Simple TEs or Time
Time Markup Language (TimeML, 2009) is a set of rules prefix normalised TEs postfix
that encodes the electronic document, and its goal is to
create a standard markup language, and to effectively
address the events, time and time relationship annotation in Normalised TEs are the standard mode of date and time.
English text. TIMEX is a set of markers used to tag the They can be year, month, day, hours, minutes, seconds, or
time expressions in TimeML. Symbol of TIMEX in their combinations. For the un-normalised TEs, time prefix
TempEval-2010 is TIMEX3. The early TEs labelling or postfix will appear before or after the simple or
schemes include TIMEX and TIMEX2. TIMEX3 evolved normalised TEs.
from early programs and there were significant changes
between TIMEX3 and the early versions (Kolomiyets and 3.2 Time units
Moens, 2010). For example: in TIMEX2 program, the TEs
are divided into the following categories: absolute time, From the analysis of TEs component, TUs should be
relative time, period of time, set of time, time of the event defined. In this paper, it is the smallest unit of TEs. For
trigger, cultural time and not a specific time. In TIMEX3, example, in ‘8:20 pm, 26th July, 2015’ include five TUs:
the time type is reduced to four categories: dates, times, year ‘2015’, month ‘7’, day ‘26’, ‘afternoon’, ‘8:20’.
durations, sets of dates and times. Such tagging scheme is
more reasonable and effective to avoid problems triggered
by events time information as in TIMEX2. Culture time and
156 Z. Dandan et al.
3.3 Time trigger the candidates were filtered to the correct time triggers
using evaluation function. Specific steps are as follows:
Time triggers are keywords to judge whether a phrase is a
TE. Often they are the concept of time such as year, month Step 1 Recognition of TUs Based on CRFs
and day, etc. Step 1.1 Corpus preprocessing
Time triggers are divided into independent triggers and
Word segmentation, part of speech
number triggers according to the identified need. The
(POS) tagging and pre-formation of the
independent trigger is TE itself, such as today, morning and
corpus are done in this step. Some
now. The number triggers are divided into number prefix
errors of word segmentation will be
triggers and number suffix triggers. The former are the
corrected. And the format of the text
triggers that embody number before the words, such as
should meet the requirements of CRFs.
‘century’ in 21st century. The latter are the triggers that
embody number behind the word, such as ‘week’ in week 3. Step 1.2 CRFs training
TUs are tagged in the training corpus.
The proper template is chosen to train a
4 Recognition of TEs model. The model will be used in the
next step.
Chinese TEs were recognised through the method of
combining statistics and rules in this paper. The step of the Step 1.3 CRFs testing
identification is described in Figure 3. TUs are not tagged in the testing
corpus. Choose proper model that is
Figure 3 Recognition model of Chinese temporal expressions derived from step 1.2, test the input
corpus, and obtain the TUs tagged in
Training corpus the output corpus.
Based on CRFs
Step 2 Recognition of TEs based on rules
Participle and part Testing corpus Step 2.1 Removal of the obvious wrong
of speech tagging
annotated TUs.
Participle and part TUs as the results of CRFs annotation
Feature extraction of speech tagging are accepted as input in the system
based on rules, the obvious wrong
Train CRFs model annotated TUs would be removed.
Pre format
Step 2.2 Expansion of the trigger thesaurus.
CRFs recognition An automatic acquisition program
CRFs model would start using TUs as candidate
trigger list. An evaluation function
Time units would be used for impurity removal,
sifting out to obtain correct trigger
Get candidate triggers
words. The trigger words would be
Delete obviously error placed in the corresponding
independent trigger thesaurus or digital
Evaluation trigger thesaurus according to the
Supplement tagging trigger’s type.
Trigger thesaurus
Step 2.3 Obtainment of TEs.
Confirm boundary New trigger thesaurus and affix thesaurus
and combine time unit are used to help supplement labelled
Affix thesaurus
new TUs, confirm the boundary, and
combine TUs and the TEs are gotten at
Based on rules Time expression last.
The two steps would be discussed in detail in 4.1 and 4.2.
The TUs are recognised in text through CRFs first, and then
TEs are obtained through rules. CRFs was applied to 4.1 Recognition of TUs based on CRFs
recognise time units rather than time expressions to reduce
the granularity and difficulty of identification. The rules 4.1.1 Basic principle of CRFs
method were based on the time triggers thesaurus and time TEs recognition can be converted into a problem of
affix words thesaurus, which should be initialised manually sequence labelling. CRFs model is a kind of conditional
at the beginning. With the increasing of test corpus, more probability model (Degen and Yu, 2009). It can calculate
candidate triggers were obtained from the test corpus and
The analysis and recognition of Chinese temporal expressions based on a mixtured model 157
the joint probability distribution of the entire category tag in Table 1 Feature template
the condition of a given observation sequence.
Template 1 Template 2
In the TEs recognition system, the problem can be seen
as linear chain CRF. X = (X1, X2, …, Xn), Y = (Y1, Y2, …, Yn) #Unigram #Unigram
are random variable sequence of linear chain, they satisfy U00:%x[–1, 0] U00:%x[–1, 0]
the equation (1): U01:%x[0, 0] U01:%x[0, 0]
P (Yi | X , Y1 ," , Yi −1 ," , Yn ) = P (Yi | X , Yi −1 , Yi +1 ) (1) U02:%x[1, 0] U02:%x[1, 0]
U03:%x[–2, 0] U03:%x[–1, 1]
In above equation, i = 1, 2, …, n (when i = 1, …, n, only U04:%x[2, 0] U04:%x[0, 1]
one part is taken into consideration). P(Y|X) is linear chain
U05:%x[–1, 1] U05:%x[1, 1]
of CRFs. In TEs recognition, X is the input observation
sequence and Y is the output label sequence or state U06:%x[0, 1]
sequence. U07:%x[1, 1]
Conditional probability can be obtained by joint U08:%x[–2, 1]
probability distribution. P(Y1, …, Yn, X) can be obtained U09:%x[2, 1]
from the equation (2)
Two experiments were tested on the international news
1 ⎧⎪ n m ⎫⎪
P(Y1 ," , Yn , X ) = exp ⎨
Z ⎪⎩ i =1
∑∑ j =1
λ j f j (Yi −1 , Yi , X , i ) ⎬
⎪⎭
(2) corpus of 2011, which includes 1,026 TUs. The effect of the
test shows that template 2 is better than template 1. Only
current word and its part of speech (POS), and the word and
In equation (2), Z is normalisation factor, which is defined its POS before and after current position are used as feature
as formulates: at last in training CRFs model.
⎧⎪ n m ⎫⎪
Z(X ) = ∑ exp ⎨ ∑∑ λ j f j (Yi −1 , Yi , X , i ) ⎬ (3) 4.1.3 Selection of TUs as annotation target of CRFs
Y ⎩⎪ i =1 j =1 ⎭⎪ The TEs were constructed by TUs. The TUs are relatively
In equation (3), fj is a characteristic function. It is the j th loose and varied in the TEs, but the format of TUs itself is
local function which is defined on the ith maximal group. λj relatively stable. To reduce the identification granularity
is the weight that is defined on fj. It is an important measure and avoid recognise time affix words, TUs were thought as
of the feature. n is sample size in corpus, and m is the annotation target of CRFs.
number of characteristic functions. The characteristic The experiments were done as to compare on
function f is described as formulates: recognition of TUs and TEs based on CRFs. The results are
shown as Figure 4.
⎧1 if Yi-1 , Yi and X = specific value
f j (Yi −1 , Yi , X , i ) = ⎨ (4) Figure 4 Recognition results of TUs and TEs based on CRFs
⎩0 others
(see online version for colours)
Specified information in X comes from feature template
which includes partial linguistic knowledge of the sample,
such as words, part of speech, word semantic information,
etc. All the feature function can be acquired according to
feature template.
4.2 Recognition of TEs based on rules The score of each candidate trigger word is calculated
from formula (5):
CRFs model can convert complex recognition problem to
simple serials label problem, through choosing feature and Ture(Ti )
Score(Ti ) = (5)
getting context information freely, and solving recognition Ture(Ti ) + False(Ti )
problem properly. Yet CRFs has its limitations and is not
suitable in our experiments for the following reasons: In this equation, True(Ti) represents the correct number of
candidate Ti in the test corpus, and False(Ti) indicates the
• it is not good at recognising pure digital TEs
number of errors. Set the threshold value λ, and the
• it is not good at recognising TUs that are not in the candidates so that the score that is less than λ will be
training corpus or that just appears few times removed from the list.
• it is not good at recognising the boundary of TEs.
4.2.3 Obtainment of TEs
So, the following rules are proposed to make remedies for
the weakness mentioned above. In this paper, we use the affix thesaurus to help determine
the boundary. Specifically speaking, check whether the
former word of current TU is in the prefix thesaurus or not
4.2.1 Pure digital TEs identification rules to determine the upper boundary, and check whether the
A pure digital TE is a string of numbers, usually latter word of current Time Unit is in suffix thesaurus or not
representing year. It is confined from 1,753 to 9,999, and to determine the lower boundary.
the following rules are defined: For the existence of ambiguous words, the limit
conditions are set. If the condition of the words is satisfied,
• ‘of’ + digital + not quantifier then the time element boundary is determined, otherwise, it
• digital + location does not belong to the scope of time.
The TUs are combined according to their spatial
• prepositions + digital + not quantifier. proximity. Two neighbouring TUs are merged into a
temporal expression. With respect to adjacent TUs, we first
4.2.2 Supplementary tagging rules check whether there is any conjunctive in between, if so, we
unite the conjunctive with the two TUs together to form a
For TUs in the training data that is scarce or non-existent,
time expression.
the approach taken is to construct time trigger thesaurus to
The purpose of rule-based processing part to make up
supplementary annotation in this paper.
the shortfall of recognition TUs based on CRFs, merged
The process for identifying new TUs is given as follow.
TUs and to obtain the integrated TEs. Delete obvious error
Step 1 Read a word from text. and supplementary labelled new TUs missed in CRFs,
combine TUs and confirm the boundary, TEs are obtained
Step 2 Access independent trigger thesaurus, check
at last. The complement of time triggers can effectively
whether the term belongs to independent trigger
extend the scale of the triggers thesaurus, and improve the
word. If so, then mark it as a time unit.
effect of recognition. Convert the boundary confirming
Step 3 If the word is not in independent trigger thesaurus, problem to see whether the fore word and back word were
check whether the term belongs to the digital affix time words, which simplifies the difficulty of the
trigger thesaurus. If it is, check forwards and boundary locator. Set constraints on ambiguity affix time
backwards. If there is a digit, then it is a time unit. words to reduce the error probability of boundary locations.
If the former word is quantifier, then look forwards
once more, if the new former are numbers or vague
word, then it is a time unit. Otherwise, it is not a 5 Experimental results and analysis
time unit.
5.1 Experimental setup and corpus preparation
The initial time trigger thesaurus was created by manual
work. The recognition effect depends on the size of time The CRF model was obtained from the tool of
trigger thesaurus to a certain extent. It is too big to be CRF++ –0.54. The segmentation system used ‘Nihao’
perfect. So, transformation rules were used to obtain the segmentation procedure of Dalian university of technology.
TUs in the candidate trigger words automatically, and The rules program was completed by C++ in Visual Studio
generate candidate trigger words. Candidates to trigger list 2010.
contains a lot of wrong candidate words, if the candidate The training data of the CRFs used in this paper are
words are put directly into the trigger thesaurus, this will news taken from The People’s Daily in the year 2000,
greatly reduce the accuracy of recognition results. which contains 240,000 words. The testing corpus is 2011
Therefore, the evaluation function score (Ti) is introduced to international news, which contains 180,000 words and 1954
evaluate the candidate trigger words and set the threshold TEs. After the basic testing, the sampling test was taken.
value to a filter. The sampling corpus was selected from 2012 international
news. The contrast experiment was carried out on the
The analysis and recognition of Chinese temporal expressions based on a mixtured model 159
Chinese corpus of TempEval-2010. In order to test our whose language style is consistent. So, the contrast and
approach in other areas of validity, the new experiments transfer test was carried out.
were carried out in Chinese Emergency Corpus (CEC). All
of the row corpus were tagged by our former program that 5.2.2 Comparisons with another method
extracted Chinese TEs based on LEX.
Downloaded the Chinese corpus of the TempEval-2010
from internet, used our model, and tagged the 44 files in
5.2 Experimental results
them. The F1-measure is 95.93%, which is better than the
5.2.1 Evaluation of the proposed method former work of Li Junchan on the same corpus. Specifically
result was shown in Table 4.
In order to compare the effect of this method, two systems
were built: CRFs-R represents the recognition system that Table 4 Comparison results on TempEval-2010
recognises TEs only using CRFs. And Rules-R represents the
system that recognises TEs only using rules. C&R-R was Methods
File TEs
P (%) R (%) F1 (%)
the method of combining statistic and rules described in this numbers numbers
paper. POS-R 8 131 85.16 83.21 84.17
Test results are shown in Table 2 in the same C&R-R 44 763 96.72 95.16 95.93
experimental dataset of 2011 international news.
POS-R was the recognition method based on part of speech
Table 2 Results of different TEs recognition methods tagging unit of time. C&R-R was the method of combining
statistic and rules used in this paper.
Methods P R F-measure
CRFs-R 91.78% 85.67% 88.62%
5.2.3 Experiments on CEC
Rules-R 93.24% 92.53% 92.88%
C&R-R 97.07% 97.57% 97.32% CEC is built by Data Semantic Laboratory in
Shanghai University. This corpus is divided into five
From the experimental results shown in Table 2, it can be categories – earthquake, fire, traffic accident, terrorist attack
seen that recognition results with combination of CRFs and and intoxication of food. There are totally 332 texts in CEC,
rules is better than that with CRFs method or rules method which are derived from internet and processed by several
alone. steps. Downloaded them from internet first and deleted all
On the basis of the experiments above, a large scale of the XML tagging from the file, then pretreated them with
corpus sample testing was carried out. This time, testing LEX, Nihao, and formatted them for the CRF++’s input.
data were selected from 2012 international news which There were 1,378 TEs in the whole corpus. The corpus’
contained 6,570,000 words. Sampling method is chosen as: information was shown in Table 5.
selecting each sample size with 10,000 words after every
60,000 words and in total deriving ten samples from the Table 5 Corpus scale of CEC
corpus. The whole sample size is 100,000 words. Detailed
File TEs Words
results of the ten sample are shown in Table 3. Type
number number number
These results were lower than the former test. The Acknowledgements
F-measure of earthquake category was lower than 90%, and
The authors sincerely thank the anonymous referees for
the best result of fire category F-measure was lower than
their valuable remarks and helpful suggestions, which
95%. The main reason was the different kind of corpus. In
significantly improved the paper.
order to include the factors of the test data into the training
This work was supported by the National Nature
corpus, a new round of testing was carried out as follows:
Science Foundation of China (Nos. 61173100, 61173101,
one category as test, the other four categories added into the
61272375) and by the Fundamental Research Funds for the
former training corpus. The new model was called
Central Universities.
self-trained model. In this way, the testing results were
shown in Table 7.
Ruifang, H., Bing, Q., Ting, L., Yuequn, P. and Sheng, L. (2007) Wei, L., Wenjie, X., Dong, W., Xujie, Z. and Zongtian, L. (2014)
‘Recognizing the extent of chinese time expressions based on ‘An extending description logic for action formalism in event
the dependency parsing and error-driven learning’, Journal of ontology’, International Journal of Computational Science
Chinese Information Processing, Vol. 21, No. 5, pp.36–40. and Engineering, Vol. 9, No. 3, pp.205–214.
TimeML (2009) ‘Guildelines for temporal expression annotation Zhao, Z., Xu, J., Zhang, Y. and Liu, J. (2013) ‘Japanese time
for English for TempEval 2010’, TimeML Working expression recognition by combining rules with statistics’,
Group [online] http://www.timeml.org/tempeval2/tempeval2- Journal of Chinese Information Processing, Vol. 27, No. 6,
trial/guidelines/timex3guidelines-072009.pdf. pp.192–200.
Tong, W. (2010) Research on Chinese Time Expression Zhu, S., Liu, Z., Fu, J. and Zhu, F. (2011) ‘Chinese temporal
Recognition, MS thesis, Department of Computer Science and phrase recognition based on conditional random fields’,
Technology, FuDan University, Shanghai, China. Computer Engineering, Vol. 37, No. 15, pp.164–167.