0% found this document useful (0 votes)

6 views15 pages

D'hondt, J. - Topic Identification Based On Document Coherence and Spectral Analysis

Uploaded by

Michael Zock

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

D'hondt, J. - Topic Identification Based On Document Coherence and Spectral Analysis

Uploaded by

Michael Zock

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Information Sciences 181 (2011) 3783–3797

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Topic identiﬁcation based on document coherence and spectral analysis

Joris D’hondt ⇑, Paul-Armand Verhaegen, Joris Vertommen, Dirk Cattrysse, Joost R. Duﬂou
Centre for Industrial Management, Katholieke Universiteit Leuven, Celestijnenlaan 300A bus 2422, 3001 Heverlee, Belgium

a r t i c l e i n f o a b s t r a c t

Article history: In a world with vast information overload, well-optimized retrieval of relevant information
Received 18 March 2010 has become increasingly important. Dividing large, multiple topic spanning documents
Received in revised form 16 March 2011 into sets of coherent subdocuments facilitates the information retrieval process. This paper
Accepted 27 April 2011
presents a novel technique to automatically subdivide a textual document into consistent
Available online 4 May 2011
components based on a coherence quantiﬁcation function. This function is based on stem
or term chains linking document entities, such as sentences or paragraphs, based on the
Keywords:
reoccurrences of stems or terms. Applying this function on a document results in a coher-
Topic identiﬁcation
Spectral theory
ence graph of the document linking its entities. Spectral graph partitioning techniques are
Text mining used to divide this coherence graph into a number of subdocuments. A novel technique is
introduced to obtain the most suitable number of subdocuments. These subdocuments are
an aggregation of (not necessarily adjacent) entities. Performance tests are conducted in
test environments based on standardized datasets to prove the algorithm’s capabilities.
The relevance of these techniques for information retrieval and text mining is discussed.
Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction

In recent decades, the ever-increasing amount of digital texts has boosted the research for retrieval techniques to be able
to deal with this vast amount of information. Clustering techniques and text categorization are a few examples of statistical
techniques available for this purpose. Considering these full-text processes, the current computational limitations when
dealing with large document datasets complicate the successful completion of these tasks. Digital texts or documents fre-
quently describe more than one subject in their content. The ability to divide full text documents into components contain-
ing coherent parts or topics of these documents can help to bypass the computational limitations of these techniques.
Information processing techniques on smaller documents decrease the required amount of memory and processing time,
and therefore induce the research for supporting techniques such as text extraction and text summarization. These tech-
niques reduce the size of documents and can be used as an input for further information processing. Topic-based segmen-
tation has shown its usefulness for improved retrieval accuracy and retrieval of meaningful components of text, for
document navigation and text summarization [2,8].
The research introduced in this paper is focused on a novel technique that automatically subdivides a document in multi-
ple topic-based components based on statistical and spectral properties of the text and the coherence graph of this text. The
presented technique is able to discover non-contiguous connections, which differs from linear text segmentation in two
main properties. Several of the existing techniques are applied to only identify topic boundaries between two adjacent doc-
ument entities. No recombination of segments containing similar content is thus performed. The main difference between
the presented technique and existing techniques is shown when these techniques are applied to the 21-paragraph Stargazers
document [17], a well-known example in the context of topic identiﬁcation. Subtopic frontiers were identiﬁed in this corpus

⇑ Corresponding author.
E-mail address: [email protected] (J. D’hondt).

0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.ins.2011.04.044
3784 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

by human judges in order to evaluate the TextTiling technique of Hearst. This subtopic structure is indicated below, together
with the paragraph ranges:

1 ? 3: Intro – the search for life in space

4 ? 5: The moon’s chemical composition
6 ? 8: How early earth–moon proximity shaped the moon
9 ? 2: How the moon helped life evolve on earth
13: Improbability of the earth–moon system
14 ? 16: Binary/trinary star systems make life unlikely
17 ? 18: The low probability of nonbinary/trinary systems
19 ? 20: Properties of earth’s sun that facilitate life
21: Summary

While subtopic algorithms try to identify these subtopic frontiers, the proposed technique tries to identify the general
topics present in a document taking the overall content into account. Considering the Stargazers document, the presented
technique identifies two general topics in this document: the first topic describes the earth–moon interaction, the second
topic concerns the binary/trinary star systems hence indicating the difference between subtopic identification techniques.
The techniques described in this paper directly resulted in two applications, based on a multi-vector representation of a
document [12]. A document vector in a vector space model [3] integrates all topics into one representation format which
results in less accessible or even irretrievable information. The ability to divide full text documents into their components
based on coherent parts or topics can help to bypass this retrieval issue. Constructing such multi-vector representation of a
document aims to improve the retrieval performance of a search engine, because the information present in the content of
the document is more accessible. The second developed application is the identification of near-duplicate documents based
on the topics present in the content of the documents. As the content is more accessible, and the lexical chains are usable as a
fingerprint of a document, interesting results are obtained [13].
This paper is structured in the following manner: in Section 2, a brief overview of related work in the domain top topic
identification is presented. The subsequent section introduces the novel document segmentation technique based on topic
identification. The four different steps in this topic identification and document segmentation process are explained in this
section. The evaluation process is performed on four different test environments, based on two different scenarios. This pro-
cedure and the results are explained in Section 4. The paper closes with a section drawing conclusions from the obtained
results.

2. Related work

As indicated in [10], the topic segmentation techniques can be divided in two categories: techniques respectively using
statistical information extraction techniques and those exploiting lexical cohesion. This classification is however not strict.
Several statistical methods adopt a format of lexical cohesion. Many topic identification algorithms assume that topically
coherent subdocuments are related to text fragments exhibiting a homogeneous lexical distribution (i.e. the usage of words).
In literature, several approaches can be identified, based on different descriptions of this distribution in a document.
The first category is based on lexical cohesion. Several approaches exist to measure this cohesion, such as stem of term
repetitions, context vectors entity repetition, semantic similarity, word distance model and word frequency model. As men-
tioned in the previous section, text segments describing a similar content contain a similar vocabulary. The re-occurrence of
specific terms can indicate the presence of a common topic. Lexical weighting is one of the most popular approaches in this
type of topic identification [24,17,20,31]. Lexical chains and the extended approach, the so-called weighted lexical links, are
two techniques often used in a huge collection of identification algorithms. The topic unigram language model is the most
frequently used technique [28]. Gathering the number of occurrences of each term for each topic leads to the posterior prob-
ability of a sequence of terms belonging to a certain topic. All terms obtain the same importance, i.e. terms not related to a
topic are equally important as keywords. The so-called Cache model is based on a set of keywords automatically extracted
for each topic. These words are the result of statistical distributions obtained from training corpora [5]. The TFIDF-classifier
allows to represent each topic as a vector. These vectors contain the vocabulary specifically for their related topic. The sim-
ilarity between a topic and a document represented in the Vector Space Model is calculated by the cosine similarity measure.
The highest similarity indicates the topic of this document.
Most techniques based on this approach are linear topic segmentation algorithms. These algorithms place boundaries in-
side a text at positions where a topic shift is identified. This identification process is performed in a (fixed size) sliding win-
dow, examining lexical variations. The lexical variation often results in a drop of an employed similarity measure. As
previously indicated, many algorithms can be described on this generic description. Popular examples are TextTiling [17],
C99 [10], Dotplotting [29] or Segmenter [37]. The TextTiling technique segments texts into multiple entities (i.e. sequences
of 3 to 5 sentences) or subtopics, the so-called ’tiles‘ using the cosine similarity between segments. A smoothed curve is cal-
culated expressing the similarity between adjacent entities. Minima in this curve are considered as potential topic
boundaries.
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3785

Other statistical approaches exist using global information of the text. Malioutov [22] presents a graph-theoretic frame-
work. The text is converted into a weighted undirected graph in which the nodes represent the sentences and the edges
quantify the relations. The text segmentation is performed by applying the normalized-cut criterion [30]. By using this
criterion, the similarity within each partition is maximized and the dissimilarity across the partitions is minimized. The
graph-based approach extends the local cohesion range of the sliding window by taking into account the long-range lexical
cohesion and distribution in a text. The computational techniques for finding the optimal solution to the minimal cut objec-
tive are however difficult. The minimization of the normalized cut is NP-complete, but, due to the linearity constraint of this
segmentation type, obtaining an exact solution is feasible [22].
All of the described algorithms rely on statistical properties of the text. The other category of techniques is based on Nat-
ural Language Processing techniques. Linguistic methods introduce a set of specific rules based on the corpus and use exter-
nal semantic information such as thesauri and ontologies, possibly combined with one or more statistical methods[23]. This
is the main drawback of this type of identification techniques: the results are dependent on the semantic resources available
for a specific text [35] and therefore the setup is limited to the text. Hidden Markov Models and Neural Networks are used as
part of the learning process in the technique of Amini [1]. A probabilistic sequence framework is proposed to estimate
symbol or term sequences in a text. This framework should enable processing of more complex information retrieval and
extraction tasks. Caillet proposed a machine learning technique based on term clustering [6]. The technique first discovers
the so-called different concepts in a text, which are defined as sets of representative terms. The partitioning in coherent para-
graphs is performed with the Maximum Likelihood clustering approach [19]. Passoneau and Litman [27] use decision trees in
their algorithm to combine multiple linguistic features extracted from the document content. Other semantical techniques
exist [11] that are able to recombine the segments according to their content.
A comprehensive overview of several, statistical and linguistic, topic identification techniques can be found in [7].

3. Document segmentation

The underlying idea for the presented segmentation technique is common to the techniques proposed by Hearst [17] and
Choi [10]: when a topic is described in a text, one can assume that a specific set of terms is used in the text fragment describ-
ing this topic. When a topic shift occurs, this set is substantially changed. Therefore re-occurrences of terms indicate a the
presence of a certain topic, and topic boundaries can be identified. Identifying and quantifying the relationships between the
so-called document entities (i.e. document parts, such as sentences or paragraphs) based on these terms enables the con-
struction of a coherence graph. This graph represents the topical cohesion of the document, and is the main tool in the iden-
tification and segmentation process. Based on this graph, a matrix representation called the Laplacian matrix can be
constructed with mathematically represents the connectivity of the graph. Using the spectral properties, the number and
identification of the topically coherent components can be obtained. As these phases are matrix-based, the entire process
is performed in an automatic manner.
The proposed segment identification process is thus composed of the following parts:

1. Construction of a coherence graph based on a linkage matrix

2. Construction of the normalized symmetric Laplacian of the coherence graph
3. Identiﬁcation of the number of coherent components using the calculated Laplacian
4. Partitioning of the coherence graph using its spectral properties to obtain the components

This last phase in the process is also a combination of two steps:

1. Reducing the feature space of the original document to a reduced subspace (see Section 3.4)
2. Application of a cluster algorithm in this space to obtain the components

In the following sections, a profound overview of these phases is provided.

3.1. Coherence graph construction

The main idea of this document decomposition process is based on the presence of lexical chains (i.e. a chain of document
entities based on the re-occurrence of a meaningful term in a document [4,24]). The presence of these chains presents the
lexical cohesive structure of the document [24]. The term and the identifiers of the document entities are stored as informa-
tion of the chain. An example of such a lexical chain is given in Fig. 1. In this figure, the sentences are taken as document
entities. Their identifiers are shown between brackets. Several lexical chains are linking the sentences of a BBC article.
The occurrences of the term ’copper‘ link the sentences 1,2 and 4; while the occurrences of term ’Honda‘ link the sentences
7, 8 and 10. These two lexical chains are only a subset of several possible lexical chains in this small example. Considering all
chains present in this example, two larger components can be identified. The first component is the set of sentences from 1
to 6, the second component contains the set of sentences from 7 to 11. In the figure, these two components are separated by
a horizontal line.
3786 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

Fig. 1. Examples of stem chains.

In a lexical chain not all the document entities are related in a similar manner. The first difference is based on the notion
of distance. The notion of distance between document entities in this context is defined as the number of entities they span.
In Fig. 1 the distance between the first and third occurrence of ’copper’ is 4 (sentence 1 to 4). Intuitively, the nearer the re-
lated document entities are, the more chance there is that they are related to a similar content. However, it remains possible
that another distant set of entities is related with the same stem. These entities are also describing a similar content, and
both groups will form one longer chain. However, due to the distance, not all pairwise relationships in this chain will be
equally strong.
A second difference between the entities is the informative value of a word. Not all words in a document are equally
important [25]. An obvious example are stopwords. These carry almost no content, and therefore can be considered as noise
when creating stem chains. Longer chains are also preferred over shorter ones. This reasoning is based on the importance of
mid-frequency words originating from the Zipf-curve [25]. Considering the possibility that a term occurs multiple times
within a document entity, the longest chains are most likely defined by the most frequently occurring terms in the docu-
ment. This requires an optimal filtering of non-informative terms. Shorter chains can be created based e.g. on typing mis-
takes or the presence of numerical values.
The previous assumptions are translated into a quantification formula. The linkage wi, j, C between document entities i and
j based on stem chain C is defined as

kDk
wi;j;C ¼ wC NbC ; ð1Þ
1 þ ðC i C j Þ
wC is the weight of the term defining the chain C in the document, such as raw or normalized frequency [19]. NbC is the num-
ber of occurrences in the term chain C, kDk is the length of the document expressed in number of document entities, and
Ci Cj is the distance between entities i and j in the text. In a complete document, multiple term chains can be identified
and quantified. The overall linkage wi,j between document entities i and j is based on the aggregation of all linkage values
for the term chains that connect both entities:
X
wi;j ¼ wi;j;C ð2Þ
C

To illustrate this process, consider the following example in Table 1. This document contains 5 document entities (indicated
as rows), connected by 6 term chains shown in the columns.
Applying this process between all pairs of document entities and for all identified chains, a square linkage matrix W can
be obtained containing all the linkage values. The linkage matrix of the example presented in Table 1 is shown in the follow-
ing matrix
2 3
0 8:05 1:83 0 1:12
6 7
6 8:05 0 0
2:55 7 0
6 7
W¼6
6 1:83 0 0 3:22 1 77:
6 7
4 0 0 3:22 0 0 5
1:12 2:55 0 0 0
Based on this matrix, a graph structure can be composed to represent the quantified relationships between the document
entities. This undirected, weighted coherence graph is defined as:

G ¼ ðV; EÞ; ð3Þ

comprising a set V of vertices, representing the document entities, together with a set E of edges representing the linkage
values. This means that each edge between two vertices vi and vj carries a non-negative weight wi,j.
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3787

Table 1
Example of a document containing 5 document entities and 6 lexical chains.

Entity Term 1 Term 2 Term 3 Term 4 Term 5 Term 6

A 1 1 0 1 1 0
B 1 1 0 0 0 1
C 0 0 1 1 0 0
D 0 0 1 0 0 0
E 0 1 0 0 1 1

Fig. 2. Coherence graph of a document containing 5 document entities.

In Fig. 2, the coherence graph of the example presented in Table 1 is shown. The dots in circular form are the different
nodes representing the document entities, the lines represent the coherence between the related pair of document entities.
The weight values are omitted for reasons of clarity.
A real example of a graph structure is given in Fig. 3. The final graph partitioning result is also indicated in this figure.
After applying the process which explained in the next sections, three topics can be identified. The cuts are indicated with
a full line.

3.2. Graph Laplacian construction

The previously obtained graph provides a (graphical) representation of the topical coherence of the document. By
employing a graph partitioning technique on this graph, the weakest connections are removed and thus the different topics
are obtained. To obtain a graph partitioning of the coherence graph, spectral graph partitioning techniques can be employed
[36]. The main tool for this spectral graph partitioning are the so-called Laplacian matrices. With every graph such Laplacian
matrix can be constructed, which is a mathematical representation of the graph. In literature multiple deﬁnitions of a Lapla-
cian matrix can be found [36]. The symmetric normalized Laplacian matrix is used for reasons of stability and consistency
[36]. This square matrix, based on the number of document entities, is deﬁned as

Fig. 3. Coherence graph of a document containing 50 document entities.

3788 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

1 1
L ¼ I D2 W D2 ð4Þ
I and D respectively denote the identity matrix and the degree matrix of the graph. W represents the previously obtained
linkage matrix. In this context, the following deﬁnition for a degree matrix is used:

Deﬁnition 1. A degree matrix D of a weighted, undirected graph is a diagonal matrix restricted to the following conditions:

8
<d ¼ P w
n
i;j i;k if i ¼ j;
D¼ k¼1
:
0 otherwise:

Considering the example in Table 1, the obtained degree matrix of the graph is presented in the following matrix:
2 3
10:99 0 0 0 0
6 7
6 0 10:60 0 00 7
6 7
D¼6
6 0 0 5:05 0 0 77:
6 7
4 0 0 0 3:22 0 5
0 0 0 0 3:67
Using this matrix in Eq. (4), the following Laplacian matrix is obtained:
2 3
1 0:75 0:25 0 0:18
6 7
6 0:75 1 0 0 0:41 7
6 7
L¼6
6 0:25 0 1 0:80 0 7 7:
6 7
4 0 0 0:80 1 0 5
0:18 0:41 0 0 1

3.3. Number of topics identiﬁcation

As stated in [36], the second smallest eigenvalue of the linkage matrix explains a significant amount of variance or struc-
ture present in the coherence graph[36]. The eigenvector related to this eigenvalue thus indicates the connectivity between
the different document entities. If the linkage matrix is sorted in the order proposed by this eigenvector, the reordered ma-
trix provides indications of the different document entities that are related together according to quantification of the dif-
ferent lexical chains.
Reordering the linkage matrix of the example shown in Table 1 according to the order proposed by this eigenvector, the
following matrix is obtained:
2 3
1 0:41 0:74 0 0
6 7
6 0:41 1 0:18 0 0 7
6 7
W¼6
6 0:74 0:18 1 0:25 0 7 7:
6 7
4 0 0 0:25 1 0:80 5
0 0 0 0:80 1
If this matrix is graphically represented, the Fig. 4 is obtained. The dots in this figure indicate existing relationships between
the document entities. A more complex example of this reordering process is shown in Fig. 5.
The reordered matrix thus contains dense segments along the diagonal indicating coherent subgraphs in the coherence
graph. The ideal situation is that the segments have a full square form. These squares indicate the presence of complete
subgraphs.
The identification process of the number of topics based on the coherence graph is based on this property. The number of
components can be determined by identifying the relevant dense segments along the diagonal. In an optimal configuration
these blocks are dense squares, as indicated in this formula:
2 3
L1 0 0
6 7
L¼40 L2 05
0 0 L3
with Li a full dense matrix. In this manner, relationships that are not located in these square regions are considered noise.
For the further identification process, the notion of a square segment along the diagonal is introduced. This shape is de-
fined to have its upper left and lower right corner on the diagonal, and its size is defined as the number of document entities
it spans.
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3789

Document Entity ID
3

6
0 1 2 3 4 5 6
Document Entity ID

Fig. 4. Matrix representation of the coherence graph of the example document.

The automatic identification process is composed of two steps. The first step is to extract a number of square segments
possibly indicating coherent components based on a single threshold parameter. The second step is to retain those segments
covering a substantial number of relationships.
In the first step, the following Algorithm 1 is used to retrieve dense segments. For every row, the sizes of possible seg-
ments are identified. The size of a segment is defined by T, the number of adjacent non-zero values in a row of the matrix
representation. A threshold value THR stipulates how many adjacent zero values, indicated as blank spaces in Fig. 5, can be
taken into account when delineating the segment.

Algorithm 1: The Square Segment Identiﬁcation (SSI) algorithm

Input: Linkage matrix L sorted in the order of the second largest eigenvector
while T< threshold THR # of non-adjacent relations do
foreach Document entity i do
j = i; if L(i, j) > 0 then
count++; j++
end
else
THR–;
end
if THR = 0 then
exit;
end
else
i++;
end
end
foreach Document entity i do
Create a square segment with count as size, and upper left corner positioned at point i
end
foreach Document entity i do
Calculate the density of the square. The density is deﬁned as the number of related document entities over the
total possible number of relationships. If the density of this square is above a predeﬁned threshold, retain this
square.
end
end
Output: Enumeration of square segments

The result of this ﬁrst step in the process is a list of possible square segments. The segments can overlap with, or be in-
cluded in, other segments. The number of segments involved in every non-overlapping list, is a possible result of the number
of components problem.
3790 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

100

200

Document Entity ID
300

400

500

600

700

800

0 200 400 600 800

Document Entity ID

Fig. 5. Matrix representation of the coherence graph.

The second step in the identiﬁcation process uses the notion of coverage ratio of a square segment. This ratio Ri of square
segment i is deﬁned as

Si;incl
Ri ¼ ; ð5Þ
Si;incl þ Si;excl
Si,incl is the number of relationships that are included in the identified square segment. Si,excl is the number of relationships
that are excluded from the square. In Fig. 6 the coverage percentage of the indicated segment is 0.667. Due to the symmetry
of the linkage matrix, the analysis of half the matrix is sufficient. The upper triangle of the square segment, starting at ele-
ment 2 with a size of 3, contains 6 non-zero values. The total number of non-zero values covered by rows of the square seg-
ment is nine, indicated in Fig. 6 as hatched cells. For every identified square segment, this coverage percentage can be
calculated. As a result, for every non-overlapping list of square segments, the total coverage ratio can be calculated as the
average sum of the coverage of these segments. This score is situated in the range ]0,1]. The higher this fraction, the higher
the chance is that the corresponding number of components will represent the actual number that can be distinguished in
the document. The reasoning is the following: the larger the group of of entities that is covered by a non-overlapping list of
square segments, the better this number of square segments will indicate the coherent structure present in the document.
These square segments are defined by the variance explained by the second largest eigenvector, and therefore are clustered
based on common properties. A value of one for the coverage indicates a full coverage of the entities by the related number of
square segments, a near 0-result indicates the opposite.
The algorithm identifies 2 topics with high probability (ratio score of 1) for the example of Table 1. The noise relationship
in the related Fig. 4 is the relationship between entities A and C based on term 4. The ratio score for the second possibility,

Document EntityID
1 2 3 4 5 6 7 8
1

3
Document Entity ID

Fig. 6. Example of the coverage ratio calculation. The ratio for the hatched region in this example is 0.667.
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3791

which is 1 topic, is 0.6. For the example shown in Fig. 5, a number of square segments can be identiﬁed. The SSI-algorithm
returns 4 as the most likely number of components, followed by 6, 2 and 3. Less likely are 16 and 19.

3.4. Spectral partitioning of the graph

As already introduced in the previous section, the actual partitioning of the coherence graph is based upon the ’minimal
cut approach‘. This approach results in the partitioning of the graph by removing those edges that have the lowest sum of
coherence quantiﬁcation in order to obtain the previously obtained number of subgraphs. Given the graph G = (V, E), mini-
mizing a cut of a graph is mathematically similar to the identiﬁcation of a non-trivial vector x that minimizes the following
function
X
FðxÞ ¼ ei;j ðxi xj Þ: ð6Þ
ði;jÞ2V

In this objective function x is a bi-partition (A, B) vector with xi = 1 if i 2 A and xj = 1 if j 2 B. ei,j 2 E is the weight of the rela-
tion between vertices i and j. This objective function can be reformulated as

xT L x ð7Þ
with x the bi-partition vector and L the Laplacian matrix. This formula describes the relationship between the minimal cut of
the graph and its spectral properties. The solution of Eq. (7) is the so-called Fiedler vector [16], which is used in the previous
section. The eigenvalue is also called the ’algebraic connectivity’ of the graph, and is greater than 0 if and only if the graph is
connected. A graph is called connected if every pair of distinct vertices in the graph can be connected through some sequen-
tial path of vertices.
The partitioning process is a two-phase partitioning or clustering process. Starting from the representation of the Lapla-
cian, the spectral or Jordan normalized decomposition can be calculated. Selecting the k eigenvectors related to the largest
eigenvalues of the Laplacian matrix, with k equal to the identiﬁed number of components, a k-dimensional representation in
an inner product space is obtained [32]. This mapping of the original data into a new metric space expresses the alignment or
coherence of the graph, rather than its structure based upon the original similarity. The second phase is the actual clustering
step. Any clustering algorithm can be employed to obtain a partitioning of the coherence graph. In this research, the algo-
rithm of Ng, Jordan and Weiss [26] is applied to partition the coherence graph in multiple subgraphs. The clustering algo-
rithm used in this phase is the agglomerative hierarchical clustering algorithm using average linkage [19]. This type of
algorithm is known to perform well in a document clustering environment [14]. An overview of this process is given in
[36]. The resulting subgraphs can be converted into the required coherent components by reordering the document entities.
Applying the segmentation process on the example shown in Table 1 results in the following ordering: the entities A, B
and E are clustered together, while the entities C and D form the second topic. The relationship between entities A and B
based on term 4 was thus not strong enough to cluster them together.

4. Validation

The validation process has two main objectives, in accordance with the two phases in the overall identiﬁcation process:

1. validation of the technique concerning the number of identiﬁed topics

2. validation of the quality of the identiﬁed topics

The ﬁrst test environments described in Section 4.3 aim to validate the novel technique of identifying the number of com-
ponents, as well as to compare the retrieval performance of the extracted components considering the known structure of
the documents. The second validation part compares the quality of the presented technique to two well-known subtopic
identiﬁcation techniques:

Choi (C99) [10]

TextTiling [17]

These techniques were chosen based upon their consistent performance under various circumstances [15,33].
This second validation step is merely to present the resemblances and the differences between the topic and the selected
subtopic identiﬁcation technique.

4.1. Datasets

For the experimental validation of these techniques, the following datasets were used to create the required test
environments:
3792 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

Reuters RCV1 document dataset [21]

OHSUMED medline dataset [18]

The Reuters RCV1 test collection is one of the most widely used collections for text categorization, containing Reuters arti-
cles assigned to various categories on different levels of detail.

4.1.1. Reuters
Six categories were selected from the Reuters RCV1 dataset with the following labels:

GCRIM: Craw, law enforcement

GDEF: Defense
GDIP: International relations
GDIS: Disasters and accidents
GENT: Arts, culture, entertainment
GSPO: Sports

This selection is based upon the size and the number of documents present in each category.

4.1.2. Ohsumed
From the OHSUMED test collection only the documents were withheld rated as deﬁnitely relevant with one of the queries,
as stated in the OHSUMED descriptions. This resulted in a document collection containing 101 topics (ﬁve queries returned
no documents), with in total 1985 documents. All the documents in this dataset, are truncated to a maximum of 250 words.

4.2. Preprocessing

The following steps were automatically performed on each document to obtain a suitable input format:

punctuations and other marking symbols were removed

stopword removal

4.3. Validation process

Two test scenarios were constructed using the previously described standardized datasets OHSUMED and Reuters result-
ing in four experimental setups. Both scenarios create artificial documents by concatenating original documents. The goal of
these scenarios is to identify the topics present in the artificial documents.
The first test scenario is the sequential adding of a random number of randomly selected original documents. In Fig. 7, the
left document is a graphical representation of this construction process. This artificial document is composed of 5 document
parts, originating from three documents. The different document parts of these documents are added sequentially.

Fig. 7. Schematic representation of the two different test scenarios.

J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3793

All document entities of the selected original documents are concatenated to form the artificial documents. The document
entities in this test environment are paragraphs delimited by blank lines in the original document. As the complexity of the
complete process is quadratic in the number of sentences, each paragraph is limited to ten sentences so larger paragraphs
were split. Every newly constructed document contains a random number of categories of the applied dataset. The number of
selected documents varies between two to ten documents. Duplicates and multiple documents from the same directory are
allowed. For each standardized dataset a collection of these artificial documents is created.
The second test scenario differs in the order in which the document entities are concatenated. In this test scenario the
document entities are concatenated in a random manner to effectively distribute the different topics present in the docu-
ment. This idea is shown in Fig. 7 as the right document. This document is also composed of 5 document parts originating
from three documents. The document parts are however added randomly to the document, in order to have recurring topics
in the content of the artificial documents. The other conditions are also valid for this test scenario.
These two test scenarios result in four different test sets, two for each standardized dataset. Each generated dataset con-
tained 1000 artificial documents. The results of the topic identification techniques on these test sets are discussed in Section
5.

4.4. Validation measure

Comparing two types of topic identification techniques, identifying topics at different levels of refinement, can pose a
problem. Since the focus of the proposed set of techniques is a reordering the document entities according to their spectral
properties, an adapted F-measure based validation is used. The original F-measure is a popular measure for evaluating a clus-
tering result [9]. This measure for a cluster i is a weighted harmonic mean of precision and recall of cluster i, and is math-
ematically defined as

2 recalli precisioni
Fi ¼ : ð8Þ
recalli þ precisioni
To calculate the precisioni and recalli, each cluster must be assigned to a known topic label. The adapted F-measure assumes a
cluster to represent a topic. The notion of topic in this validation measure is related to the category the document entity orig-
inates from. To obtain this measure the following scenario is used:

Find the largest group of document entities on the same topic within a document.
Label this group of entities as the group entities covering that topic.
Mark this group as labeled, and the topic label as assigned.
Repeat with the remaining groups and topics that are left.
Check that every group is covering a different topic. If not, the group containing the smaller group of document entities on
the same topic is set to be relabeled to a different topic.
Repeat until all groups are covering a different topic.

Based upon this process, a F-measure value can be obtained for every processed artiﬁcial document [9].

5. Results

In this section results are presented of the application to the three topic identification techniques for the four test envi-
ronments. For each artificial document of each of these test environments, the initial composition or topics are known.
Therefore comparisons can be made based on the identification of the topic boundaries. These boundaries are obtained from
the three topic identification techniques, thus indicating the functional differences between the two existing techniques, and
the newly proposed topic identification technique. This difference is situated on the level of detail on which the presented
techniques identify topics, and the ability to identify non-continuous topics. The clustering technique used in the second
phase of the topic identification process was the unsupervised hierarchical agglomerative clustering method using average
linkage [19]. For every test environment the results are summarized as boxplots [34] in which the five represented numerical
values are defined as:

Smallest observation
First quartile Q1 (25%)
Median Q2
Third quartile Q3 (75%)
Largest observation

Observations that are identiﬁed as outliers are also indicated. An observation is considered as a weak outlier if it is located
between the bounding quartiles and 1,5 times the interquartile range Q3 Q1. If it is located outside this range, it is consid-
ered as a strong outlier.
3794 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

5.1. Sequential test environments

Fig. 8 presents the boxplot of the obtained F-measures for the first Ohsumed test set. The indicator values determining
this boxplot are given in Table 2. A difference in average of 8% can be noted between the proposed technique and the tech-
nique of Choi. The range of obtained F-measures for the proposed technique is significantly smaller compared to the Choi and
TextTiling techniques. However, the positions of the boxplots of Choi and the proposed technique are positioned similar
along the range of F-measures, indicating a slightly better performance for the proposed technique. This indicates that
the identification techniques of Choi and the proposed techniques do not differ significantly for this sequential test environ-
ment. Finally, no strong outlier are identified among the obtained F-measures for each of the test techniques.
The results for the second sequential test environment are summarized in Fig. 9. The related figures determining this box-
plot are given in Table 3. Similar conclusions can be drawn as in the first sequential test environment considering the posi-
tion of the boxplot of the proposed technique compared to TextTiling and Choi. No significant difference in position of the
boxplot is notable between the three techniques. The differences in averages are considerably larger: the average of the pro-
posed technique is 22.71% higher than Choi and 10.9% than TextTiling. Also the range of the boxplot of the proposed tech-
nique is the smallest of all three obtained boxplots. Two observations were considered as strong outliers, one for TextTiling
and the proposed technique each.
Based on the results of these 2000 artificial documents, sequentially composed of the Reuters and Ohsumed datasets, the
conclusion can be drawn that the proposed technique performs similar or better compared to two well-known topic iden-
tification techniques in the context of identifying the topic boundaries.

5.2. Randomized test environment

The results of the second test scenario, in which artificial documents are composed in random manner, are presented in
this section. Fig. 10 presents the boxplot of the obtained F-measures for the first randomized Ohsumed test set with the re-
lated scores displayed in Table 4.
The first significant difference between the three boxplots, is the position of the boxplot of the proposed technique. The
boxplot range of 0.4672 to 0.5817 is significantly higher than the other two boxplots. Also the complete ranges of all three
boxplots are notable larger than in the first test environments, the largest range being for the presented technique. This is
mainly due to the nature of the technique. Since some documents, chosen randomly out the datasets without any content
knowledge, mostly contain only numerical information, the proposed technique fails to identify the expected structure. Choi
and TextTiling are less error prone in this situation compared to the proposed technique, indicated by the range of the box-
plots. No strong outliers are identified during the identification processes.

0.9

0.8

0.7
F−measure

0.6

0.5

0.4

0.3

0.2
Choi TextTiling TID

Fig. 8. Boxplot of the sequential Ohsumed test environment.

Table 2
Quality results of the sequential Ohsumed test environment.

Dataset Choi TextTiling TID

First quartile 0.5398 0.3913 0.6106
Median 0.6154 0.4644 0.6667
Third quartile 0.6892 0.5581 0.7296
Average 0.6122 0.4766 0.6774
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3795

0.9

0.8

0.7

F−measure
0.6

0.5

0.4

0.3

0.2

0.1
Choi TextTiling TID

Fig. 9. Boxplot of the sequential Reuters test environment.

Table 3
Quality results of the sequential test set.

Dataset Choi TextTiling TID

First quartile 0.3218 0.4152 0.4672
Median 0.4096 0.4806 0.5217
Third quartile 0.5275 0.5396 0.5817
Average 0.4327 0.4784 0.5310

0.9

0.8

0.7
F−measure

0.6

0.5

0.4

0.3

0.2

Choi TextTiling TID

Fig. 10. Boxplot of the Ohsumed test environment.

Table 4
Quality results of the random Ohsumed test set.

Dataset Choi TextTiling TID

First quartile 0.5829 0.3846 0.6292
Median 0.6592 0.4606 0.7966
Third quartile 0.7304 0.5439 0.8932
Average 0.6553 0.4694 0.7457

The results of the last test environment, composed of Reuters articles, are summarized in Fig. 11 with the related scores of
the boxplot shown in Table 5. These results are in line with the results of the previous random test environment: a clear
3796 J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797

0.9

0.8

0.7

F−measure
0.6

0.5

0.4

0.3

0.2

0.1
Choi TextTiling TID

Fig. 11. Boxplot of the Reuters test environment.

Table 5
Quality results of the random Reuters test set.

Dataset Choi TextTiling TID

First quartile 0.3391 0.3925 0.5942
Median 0.4292 0.4667 0.7049
Third quartile 0.5457 0.5287 0.7842
Average 0.4516 0.4642 0.6712

distinction can be noted between the position of the boxplot of the presented technique and the position of the second high-
est boxplot of Choi.

6. Conclusions

A novel technique was introduced to automatically identify the topics present in a document, based on the presence of
lexical chains. Since every topic can be related to a specific set of words, the coherence of a document can be quantified using
these lexical chains and the proposed similarity measure. The application of spectral graph partitioning techniques on the
coherence graph transforms the original input documents to a new metric space. This transformation has two benefits in
the topic identification process. First, the required number of topics can be identified based on the reordering of the Lapla-
cian matrix of the graph. Second, a clustering technique can be applied on this new metric space in order to obtain the re-
quired number of components, each representing a topic.
This topic identification technique differs from existing approaches because these current techniques identify subtopic
document parts. Two large experiments based on standardized datasets were performed to validate these techniques. The
results indicate that the techniques resulted in similar to better subtopic identification results in a sequential test scenario,
whereas in a randomized test scenario the proposed technique outperforms the two other subtopic identification technique.
This last result indicates that the application of these graph techniques enables the identification of non-contiguous topics in
a document.

References

[1] M. Amini, H. Zaragoza, P. Gallinari, Learning for sequence extraction tasks, Content-Based Multimedia Information Access (2000) 476–490.
[2] R. Angheluta, R.D. Busser, M.-F. Moens, The use of topic segmentation for automatic summarization, in: Workshop on Text Summarization in
Conjunction with the ACL 2002 and including the DARPA/NIST sponsored DUC 2002 Meeting on Text Summarization, 2002, pp. 11–12
[3] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley, 1999.
[4] R. Barzilay, M. Elhadad, Using lexical chains for text summarization, in: Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization,
1997, pp. 10–17.
[5] B. Bigi, R. de Mori, M. El-Béze, T. Spriet, A fuzzy decision strategy for topic identiﬁcation and dynamic selection of language models, Signal Processing
80 (6) (2000) 1085–1097.
[6] M. Caillet, J.-F. Pessiot, M.-R. Amini, P. Gallinari, Unsupervised learning with term clustering for thematic segmentation of texts, in: Proceedings of
Seventh Conference on Rercherche d’Information Assistee par Ordinateur, 2004, pp. 648–656.
[7] Y. Chali, Topic detection of unrestricted texts: Approaches and evaluations, Applied Artiﬁcial Intelligence 19 (2) (2005) 119–135.
[8] L. Chen, J. Zeng, N. Tokuda, A stereo document representation for textual information retrieval, Journal of American Society Information Science and
Technology 57 (6) (2006) 768–774.
[9] T.Y. Chen, F.-C. Kuo, R.G. Merkel, On the statistical properties of the f-measure, QSIC (2004) 146–153.
[10] F.Y.Y. Choi, Advances in domain independent linear text segmentation, Proceedings of NAACL (2000) 26–33.
J. D’hondt et al. / Information Sciences 181 (2011) 3783–3797 3797

[11] C. Clifton, R. Cooley, J. Rennie, Topcat: Data mining for topic identification in a text corpus, IEEE Transactions on Knowledge and Data Engineering 16
(8) (2004) 949–964.
[12] J. D’hondt, Clustering Techniques in Knowledge Management: Advances and Applications. Ph.D. thesis, Katholieke Universiteit Leuven, Leuven,
Belgium, 2011.
[13] J. D’hondt, P. Verhaegen, J. Vertommen, D. Cattrysse, J. Duflou, Near-duplicate detection based on text coherence quantification, in: Proceedings of the
10th European Conference on Knowledge Management, 2009, pp. 238–246.
[14] J. D’hondt, J. Vertommen, P.-A. Verhaegen, D. Cattrysse, J.R. Duflou, Pairwise-adaptive dissimilarity measure for document clustering, Information
Science 180 (2010) 2341–2358.
[15] G. Dias, E. Alves, J.G.P. Lopes, Topic segmentation algorithms for text summarization and passage retrieval: an exhaustive evaluation, AAAI’07:
Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI Press, 2007, pp. 1334–1339.
[16] M. Fiedler, Algebraic connectivity of graphs, Czechoslovak Mathematical Journal 23 (1973) 298–305.
[17] M.A. Hearst, Texttiling: segmenting text into multi-paragraph subtopic passages, Computational Linguistics 23 (1) (1997) 33–64.
[18] W. Hersh, C. Buckley, T.J. Leone, D. Hickam, Ohsumed: an interactive retrieval evaluation and new large test collection for research, SIGIR ’94:
Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval pp 192–201, New York,
NY, USA, Springer-Verlag, New York, Inc., 1994.
[19] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Computational Survey 31 (3) (1999) 264–323.
[20] M.G. Kathleen, K. Mckeown, Discourse segmentation of multi-party conversation, In Proceedings of the 41st Annual Meeting on Association for
Computational Linguistics (2003) 562–569.
[21] D. Lewis, Y. Yang, T. Rose, F. Li, Rcv1: A new benchmark collection for text categorization research (2004).
[22] I. Malioutov, R. Barzilay, Minimum cut model for spoken lecture segmentation, in: Proceedings of the 21st International Conference on Computational
Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2006, pp. 25–32.
[23] M.-F. Moens, R.D. Busser, Generic topic segmentation of document texts, SIGIR ’01: Proceedings of the 24th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2001, pp. 418–419.
[24] J. Morris, G. Hirst, Lexical cohesion computed by thesaural relations as an indicator of the structure of text, Computational Linguistics 17 (1) (1991) 21–
48.
[25] M.E.J. Newman, Power laws, pareto distributions and zipf’s law, Contemporary Physics 46 (2005).
[26] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems, vol. 14, MIT Press,
2001, pp. 849–856.
[27] R.J. Passonneau, D.J. Litman, Discourse segmentation by human and automated means, Computational Linguistics 23 (1997) 103–139.
[28] J.M. Ponte, W.B. Croft, A language modeling approach to information retrieval, Proceedings of the 21st Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR ’98, ACM, New York, NY, USA, 1998, pp. 275–281.
[29] J.C. Reynar, Statistical models for topic segmentation, in: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on
Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, 1999, pp. 357–364.
[30] J. Shi, J. Malik, Normalized cuts and image segmentation, Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97),
IEEE Computer Society, Washington, DC, USA, 1997, p. 731.
[31] L. Sitbon, P. Bellot, Topic segmentation using weighted lexical links (wll), SIGIR ’07: Proceedings of the 30th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, 2007, pp. 737–738.
[32] D.B. Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decompositions, Chapman & Hall, CRC, 2007.
[33] N. Stokes, Spoken and written news story segmentation using lexical chains, in: NAACL ’03: Proceedings of the 2003 Conference of the North American
Chapter of the Association for Computational Linguistics on Human Language Technology, Association for Computational Linguistics, Morristown, NJ,
USA, 2003, pp. 49–54.
[34] J.W. Tukey, Exploratory Data Analysis, Addison-Wesley, 1977.
[35] M. Utiyama, H. Isahara, A statistical model for domain-independent text segmentation, in: Proceedings of the Ninth Conference of the European
Chapter of the Association for Computational Linguistics, 2001, pp. 491–498.
[36] U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.
[37] M. yen Kan, J.L. Klavans, K.R. Mckeown, Linear segmentation and segment significance, in: Proceedings of the 6th International Workshop on Very
Large Corpora, 1998, pp. 197–205.

M327 Handbook
0% (1)
M327 Handbook
72 pages
Quiz 4 - 013700
No ratings yet
Quiz 4 - 013700
2 pages
MAE208 F18 Syl
No ratings yet
MAE208 F18 Syl
2 pages
Advanced Maths Methods Physics
No ratings yet
Advanced Maths Methods Physics
11 pages
Lab 1 Questions
No ratings yet
Lab 1 Questions
9 pages
Constrained Least Squares - Slides
No ratings yet
Constrained Least Squares - Slides
20 pages
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
No ratings yet
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
52 pages
Aau Course Outline Introduction To Linear Algebra
100% (1)
Aau Course Outline Introduction To Linear Algebra
2 pages
Excel Prompts
No ratings yet
Excel Prompts
23 pages
GE 122-Lec4-Matrix Rank and Ill-Conditioned Systems-Handout
No ratings yet
GE 122-Lec4-Matrix Rank and Ill-Conditioned Systems-Handout
6 pages
TU Delft Camber Correction
No ratings yet
TU Delft Camber Correction
19 pages
Handout Part IV Lie Groups
No ratings yet
Handout Part IV Lie Groups
85 pages
Biotechnology 1609322314
No ratings yet
Biotechnology 1609322314
100 pages
Feasibility Presentation PPT Format (1) (Read-Only)
No ratings yet
Feasibility Presentation PPT Format (1) (Read-Only)
17 pages
School of Life and Environmental Sciences
No ratings yet
School of Life and Environmental Sciences
15 pages
Cube Rotation
No ratings yet
Cube Rotation
3 pages
Chapter 5 DBM 10063
No ratings yet
Chapter 5 DBM 10063
14 pages
CS-211 Data Structure & Algorithms - Lecture4
No ratings yet
CS-211 Data Structure & Algorithms - Lecture4
29 pages
JEE Main 2022 June Analysis - MathonGo
No ratings yet
JEE Main 2022 June Analysis - MathonGo
5 pages
Emerald Specifications Sheet 01
No ratings yet
Emerald Specifications Sheet 01
4 pages
Periyar University: Periyar Palkalai Nagar SALEM - 636011
No ratings yet
Periyar University: Periyar Palkalai Nagar SALEM - 636011
48 pages
Em-4 Mini Project Linear Transformations
No ratings yet
Em-4 Mini Project Linear Transformations
27 pages
An Internet Reviews Topic Hierarchy Mining Method Based On Modified Continuous Renormalization Procedure
No ratings yet
An Internet Reviews Topic Hierarchy Mining Method Based On Modified Continuous Renormalization Procedure
43 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Experiment - 1: AIM: Write A Program To Implement QUICK SORT
No ratings yet
Experiment - 1: AIM: Write A Program To Implement QUICK SORT
24 pages
Reviwed MSC Syllabus 2015-2017 - Final BBB
No ratings yet
Reviwed MSC Syllabus 2015-2017 - Final BBB
50 pages
Paper Intro and Conclusion Corrected
No ratings yet
Paper Intro and Conclusion Corrected
5 pages
Format Synopsis DP
No ratings yet
Format Synopsis DP
12 pages
B Com Semester Syllabus
100% (1)
B Com Semester Syllabus
71 pages
A Calibration Routine For Full Matrix Capture (FMC)
No ratings yet
A Calibration Routine For Full Matrix Capture (FMC)
10 pages
State of The Art Document Clustering Algorithms Based On Semantic Similarity
No ratings yet
State of The Art Document Clustering Algorithms Based On Semantic Similarity
18 pages
Essays in Group-Cognitive Science: Gerry Stahl's eLibrary, #10
From Everand
Essays in Group-Cognitive Science: Gerry Stahl's eLibrary, #10
Gerry Stahl
No ratings yet
15 Deteminants & Matrices Part 4 of 6
No ratings yet
15 Deteminants & Matrices Part 4 of 6
11 pages
Azf - Azf.Ap9.B.: Subject Mathematics
No ratings yet
Azf - Azf.Ap9.B.: Subject Mathematics
19 pages
Discrete Structures Exam 3
No ratings yet
Discrete Structures Exam 3
4 pages
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Faculty of Engineering & Technology First Year M.C.A.: Part - I
No ratings yet
Faculty of Engineering & Technology First Year M.C.A.: Part - I
16 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Data Structure in Python: Essential Techniques
From Everand
Data Structure in Python: Essential Techniques
Ed A Norex
No ratings yet
Data Structures and Algorithm Analysis in Java, Third Edition
From Everand
Data Structures and Algorithm Analysis in Java, Third Edition
Clifford A. Shaffer
4/5 (4)
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
An Introduction to Information Theory
From Everand
An Introduction to Information Theory
Fazlollah M. Reza
No ratings yet
Introduction to Proof in Abstract Mathematics
From Everand
Introduction to Proof in Abstract Mathematics
Andrew Wohlgemuth
5/5 (1)
Data Structures Explained: A Practical Guide with Examples
From Everand
Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Python Data Structures Explained: A Practical Guide with Examples
From Everand
Python Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Regular Expressions Demystified: A Practical Guide with Examples
From Everand
Regular Expressions Demystified: A Practical Guide with Examples
William E. Clark
No ratings yet
Selective Search: Efficient and Effective Search of Large Textual Collections
No ratings yet
Selective Search: Efficient and Effective Search of Large Textual Collections
34 pages
Mining Frequent Closed Trees in Evolving Data Streams
No ratings yet
Mining Frequent Closed Trees in Evolving Data Streams
27 pages
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
From Everand
A Practical Guide to Mixed Research Methodology: For research students, supervisors, and academic authors
Farhad Daneshgar PhD
No ratings yet
Mastering Algorithms and Data Structures
From Everand
Mastering Algorithms and Data Structures
Manish Soni
No ratings yet
Python Regular Expressions Explained: A Practical Guide with Examples
From Everand
Python Regular Expressions Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
Essays in Computer-Supported Collaborative Learning: Gerry Stahl's eLibrary, #9
From Everand
Essays in Computer-Supported Collaborative Learning: Gerry Stahl's eLibrary, #9
Gerry Stahl
4/5 (3)
Semantic Modeling In Formal English
From Everand
Semantic Modeling In Formal English
Dr. Ir. Andries Van Renssen
No ratings yet
Computer Data
From Everand
Computer Data
Angel Gabaldon
No ratings yet
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
From Everand
Efficient String Processing with Trie Structures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Bird's Eye view of Data Visualisation
From Everand
A Bird's Eye view of Data Visualisation
Nisarg Patel
No ratings yet
Substantive Theory and Constructive Measures: A Collection of Chapters and Measurement Commentary on Causal Science
From Everand
Substantive Theory and Constructive Measures: A Collection of Chapters and Measurement Commentary on Causal Science
Mark Everett Stone
No ratings yet
Ontology-Based Text Clustering: A. Hotho and S. Staab A. Maedche
No ratings yet
Ontology-Based Text Clustering: A. Hotho and S. Staab A. Maedche
8 pages
Exploring the World of Data Science and Machine Learning
From Everand
Exploring the World of Data Science and Machine Learning
NIBEDITA Sahu
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Department of Computer Science and Engineering Spring 2012
No ratings yet
Department of Computer Science and Engineering Spring 2012
18 pages
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Union-Find Data Structures and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network Analysis and Synthesis: A Modern Systems Theory Approach
From Everand
Network Analysis and Synthesis: A Modern Systems Theory Approach
Brian D. O. Anderson
5/5 (2)
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Functional Analysis: Theory and Applications
From Everand
Functional Analysis: Theory and Applications
R.E. Edwards
No ratings yet
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
From Everand
Transformers in Deep Learning Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
8 pages
Document and Knowledge Management Interrelationships
From Everand
Document and Knowledge Management Interrelationships
A. Afritopic
4.5/5 (2)
The Fiesta Data Model: A novel approach to the representation of heterogeneous multimodal interaction data
From Everand
The Fiesta Data Model: A novel approach to the representation of heterogeneous multimodal interaction data
Peter Menke
No ratings yet
Paper 2
No ratings yet
Paper 2
9 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Semantic Topic Extraction and Segmentation For Efficient Document Visualization
No ratings yet
Semantic Topic Extraction and Segmentation For Efficient Document Visualization
72 pages
Comparison of Different Dimensionality Reduction Methods For Information Retrieval and Text Mining
No ratings yet
Comparison of Different Dimensionality Reduction Methods For Information Retrieval and Text Mining
92 pages
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Data Structures I Essentials
From Everand
Data Structures I Essentials
Dennis Smolarski
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Semantic Network: Fundamentals and Applications
From Everand
Semantic Network: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pattern Recognition: Fundamentals and Applications
From Everand
Pattern Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Relationship Extraction: Fundamentals and Applications
From Everand
Relationship Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Search Algorithm: Fundamentals and Applications
From Everand
Search Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet

D'hondt, J. - Topic Identification Based On Document Coherence and Spectral Analysis

Uploaded by

D'hondt, J. - Topic Identification Based On Document Coherence and Spectral Analysis

Uploaded by

Information Sciences 181 (2011) 3783–3797

Contents lists available at ScienceDirect

Topic identiﬁcation based on document coherence and spectral analysis

1 ? 3: Intro – the search for life in space

1. Construction of a coherence graph based on a linkage matrix

This last phase in the process is also a combination of two steps:

In the following sections, a profound overview of these phases is provided.

3.1. Coherence graph construction

Fig. 1. Examples of stem chains.

G ¼ ðV; EÞ; ð3Þ

Entity Term 1 Term 2 Term 3 Term 4 Term 5 Term 6

Fig. 2. Coherence graph of a document containing 5 document entities.

3.2. Graph Laplacian construction

Fig. 3. Coherence graph of a document containing 50 document entities.

3.3. Number of topics identiﬁcation

Fig. 4. Matrix representation of the coherence graph of the example document.

Algorithm 1: The Square Segment Identiﬁcation (SSI) algorithm

0 200 400 600 800

Fig. 5. Matrix representation of the coherence graph.

3.4. Spectral partitioning of the graph

1. validation of the technique concerning the number of identiﬁed topics

Choi (C99) [10]

Reuters RCV1 document dataset [21]

GCRIM: Craw, law enforcement

punctuations and other marking symbols were removed

4.3. Validation process

Fig. 7. Schematic representation of the two different test scenarios.

4.4. Validation measure

5.1. Sequential test environments

5.2. Randomized test environment

Fig. 8. Boxplot of the sequential Ohsumed test environment.

Dataset Choi TextTiling TID

Fig. 9. Boxplot of the sequential Reuters test environment.

Dataset Choi TextTiling TID

Choi TextTiling TID

Fig. 10. Boxplot of the Ohsumed test environment.

Dataset Choi TextTiling TID

Fig. 11. Boxplot of the Reuters test environment.

Dataset Choi TextTiling TID

You might also like