0% found this document useful (0 votes)
30 views5 pages

Paper Intro and Conclusion Corrected

The document proposes a two-step method for mining common topics from asynchronous text sequences: 1) check for common topics and isolate them with timestamps, 2) take topics and try to assign text documents timestamps to optimize topic-timestamp correlation. The method exploits topic-timestamp correlations to iteratively refine topics and synchronize timestamps until objectives converge.

Uploaded by

Zafar Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views5 pages

Paper Intro and Conclusion Corrected

The document proposes a two-step method for mining common topics from asynchronous text sequences: 1) check for common topics and isolate them with timestamps, 2) take topics and try to assign text documents timestamps to optimize topic-timestamp correlation. The method exploits topic-timestamp correlations to iteratively refine topics and synchronize timestamps until objectives converge.

Uploaded by

Zafar Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Asynchronous Text Sequences Mining

Mrs Sheema Khan

Department of Computer Science and Engineering, Everest Educational Societies College of Engineering & Technology,
Aurangabad, 431001 India.

Abstract We propose an effective method to solve the problem of


In this paper we tried to correlate text sequences those provides mining common topics from multiple asynchronous text
common topics for semantic clues. We propose a two step
method for asynchronous text mining. Step one check for the
sequences. We formally define the framework which is a
common topics in the sequences and isolates these with their principled, and the problem based on which a unified
timestamps. Step two takes the topic and tries to give the objective function can be derived. To optimize the
timestamp of the text document. After multiple repetitions of step objective function an algorithm is define, this algorithm
two, we could give optimum result.
helps by exploiting the mutual impact between topic
Keywords: Correlation, Asynchronous sequences, Timestamps discovery and time synchronization.

1. Introduction The key point is to use the correlation between sequences


which is temporal and meaningful to build up a mutual and
If we see today’s scenario, we would find that lot of text strong process. First we extract common topic using their
material is generated in this decade. The sources of text time stamp from a given set of sequences. Second we
generation are different like Internet, Application update the time stamp of documents by checking them to
packages, Research investigations reports, most closely connected topic, on the basis of extracted
Communication, Entertainment, etc. We can be lost in topic and their word distribution. This step reduces the
searching knowledge from these text sources. Hence it is asynchronism among sequences. According to new time
required to give some method for discovering knowledge stamp the common topic are refined after synchronization.
from this enormous text generated from decades. In this To maximize the unified objective function these steps are
paper we tried formulate a method to extract knowledge repeated alternatively and these function are provably
from the available source of text sequences. This method converges monotonically.
of discovering knowledge from text can be achieved in
two steps. In steps one, we try to determine the distribution The main points of our work are to
of word intensity by knowing the meaning of the topic
which is to be mined and in steps two, the distribution can Mining common topic from multiple text sequences.
by using time distribution method.
An objective function of problem which introduce the
In reality many topics and text sequences are correlated. A principled and probabilistic framework to formalize our
semantic topic and the comprehensive topic are discovered problem
by the interaction of multiple sequences, rather than a
To maximize the objective function we develop
single or individual stream. According to recent work,
optimization algorithm which is guaranteed optimum.
over different sequences same time distribution are share
by the common topic, usually different sequences are 2. Literature Survey
synchronous in time. But multiple sequences which
contain synchronisms, is actually very common in Yuekui Yang et al. [1] parsed every web page as a Dom-
practice. For example this is not sure that the article of any Tree. They proposed some rules in tree aiming at
news feeds having the same topic indexed the same time extracting the relationship among different paragraphs and
stamps, because for any news agencies there is delay of then presented a new topic-specific web crawler which
hours, and days for news papers, and also weeks for calculated the unvisited URLs prediction score based on
periodicals. the web page hierarchy and the text semantic similarity.
They calculated the text similarity using vector space Wenfeng Li et al. [7] used the user's web browsing history
model (VSM) which considered the query or paragraph as that can be mined out. They presented an innovative
a vector in which the terms are independent and contacted method to extract user's interests from his/her web
different paragraphs in a web page according to their browsing history. They applied an algorithm to extract
hierarchy in a Dom-Tree. Xiaofeng Liao et al. [2] useful texts from the web pages in user's browsed
considered the problem of modeling the topics in URL sequence [10]. Unlike other works that need a lot of
a sequence of images with known time stamp. Detecting training data to train a model to adopt supervised
and tracking of temporal data is an important task in information, they directly introduced raw supervised
multiple applications, such as finding hot research point information to the procedure of LLDA-TF. In their paper
from scientific literature, news article series analysis, Fotiadis et al. [8] presented a methodology for
email surveillance, search query log mining, etc. biosequence classification, which employs sequential
pattern mining and optimization algorithms.
Besides collections of text document they also
considered mining temporal topic trends from image data In first stage, sequential pattern mining algorithm is
set. Chenguha Lin et al. [3] proposed joint sentiment- applied to a set of biological sequences and the sequential
topic (JST) model based on latent Dirichlet allocation [7], patterns are extracted. Then, the score of each pattern with
which detects sentiment and topic simultaneously respect to each sequence is calculated using a scoring
from text. JST is equivalent to Reverse-JST without a function and the score of each class under consideration is
hierarchical priority. Neustein [4] showed estimated. The scores of the patterns and classes are
how sequence package analysis is informed by algorithms updated, multiplied by a weight. In the second stage
that can work with, rather than be hindered by, less than optimization technique was employed to calculate the
perfect natural speech for intelligent mining of doctor- weight values to achieve the optimal classification
patient recordings and blogs. Watts et al. [5] throws light accuracy. The methodology is applied in the protein class
on organization's knowledge gained through technical and fold prediction problem. Extensive evaluation is
conference. They worked out that there are processes carried out, using a dataset obtained from the Protein Data
where the knowledge gains are limited to the experiences Bank. Itoh [9] gave a contextual analysis processing
and communication skills of the individuals attending the technique, consisting in determining the context
conference. understanding together with coherences in sentences, of
concepts and phenomena related to each others that must
Many conference proceedings are published and provided be able to simultaneously interpret accurately
to attendees in electronic format, such as on CD-ROM a sequence of multiple semantic representations.
and/or published on the internet, such as IEEE conference
proceedings. These proceedings provide a rich repository By applying a semantic analysis of co-occurrence
that can be mined. They compiled reflected hot topics as expressions, based on the use of phrases having an
defined by the researchers in the field and delineate the absolute evaluation polarity, he developed a system
technical approaches being applied. As per their work achieving analysis capable of detecting the role relations
R&D profiling can more fully exploited by recorded between words, the relationship of meaning in a sentence,
conference proceedings' research to enhance corporate identifying transitions in the topic, anaphora, endophora,
knowledge. They illustrated in their paper the potential in and analyzing even idiomatic expressions and textual
profiling conference proceedings through use of WebQL emoticons. Our system evaluated correctly
information retrieval and TechOasis (Vantage ldquopositiverdquo or ldquonegativerdquo nuance for
Point) text mining software by showing how tracking 75.0% of those. Subasic et al. [11] proposed a method and
research patterns and changes over a sequence of visualization tool for mapping and interacting stories
conferences can illuminate R&D trends map dominant published in web pages and articles. In contrast to existing
issues and spotlight key research organizations. Walenz et approaches, their method concentrated on relational
al. [6] described sequencer system for the temporal information and on local patterns rather than on the
analysis of named entities in news articles between media occurrence of individual concepts and global models. They
reported stories and user generated content. They explored also presented an evaluation framework.
the evolution of social contexts with time that can provide
unique insights into human social dynamics.
Sekiya et al. [13] proposed that a word sequence can be 3. Proposed Model
used to identify context. Both contexts identified by
Document Timestamp Topic Word
word sequences and word sets related to the contexts are (D) (Ts) (Tc) (W)
shown concretely. They used the confabulation model and
five statistical measures as relations. Comparing the
Number of Number of
measures they found that cogency and mutual information sequences Topics
were the most effective. Creamer et al. [14] in their paper (Ns) (Nt)
analyzed the relationship between asset return, volatility Fig a Proposed generative model for asynchronous text mining.
and the centrality indicators of a corporate news network
conducting a longitudinal network analysis. They built The process as per the fig a is as follows.
a sequence of daily corporate news network using 1. Select a document D from its source like website, data
warehouse, etc.
companies of the STOXX 50 index as nodes, the weights
2. A time stamp T for that related document is selected.
of the edges the sum of the number of news items with the This means for every document, only one time stamp is
same topic by every pair of companies identified by associated.
the topic model methodology. They performed the 3. The next step will be to select common topic Tc and
Granger causality test and the Brownian distance then to select the topic T and get the word.
covariance test of independence among several measures
of centrality, return and volatility. They found that the Conventional methods on topic mining try to maximize the
likelihood function L by adjusting the probabilities of
average eigenvector centrality of the corporate news
topic and word assuming probability of timestamp is
networks at different points of time has an impact on known. However, in our work, we need to consider the
return and volatility of the STOXX 50 index. Likewise, potential asynchronism among different sequences. Thus,
return and volatility of the STOXX 50 index also had an besides finding optimal probabilities of topic and words
effect on average eigenvector centrality. we also need to decide probability of timestamp to further
maximize the likelihood function. In other words, we want
Perez et al. [15] proposed architecture for the integration to assign the document with time stamp T to a new time
of a corporate warehouse of structured data with a stamp by determining its relevance to respective topics, so
warehouse of text-rich XML documents. Yanpeng Li et al. that we can obtain larger L, or equivalently, topics with
better quality. By the term asynchronism, we refer to the
[16] present a feature coupling generalization (FCG)
time distortion among different sequences. The relative
framework for generating new features from unlabeled temporal order within each individual sequence is still
data. It selected two special types of features example- considered meaningful and generally correct. Therefore,
distinguishing features (EDFs) and class-distinguishing during each synchronization step, we preserve the relative
features (CDFs) from original feature set, and then temporal order of documents in each individual sequences
generalizes EDFs into higher-level features based on their with earlier time stamp before adjustment will never be
coupling degrees with CDFs in unlabeled data. It gave assigned to later time stamp after adjustment as compared
to its successors. This constraint aims to protect local
EDFs with extreme sparsity in labeled data that can be
temporal information within each individual sequence
enriched by their co-occurrences with CDFs in unlabeled while fixing the asynchronism among different sequences.
data so that the performance of these low-frequency
features can be greatly boosted and new information from 3.1 Algorithm
unlabeled can be incorporated. Navigli et al. [17]
presented a method, called structural semantic In this section, we show how to solve our objective
interconnections (SSI), which creates structural function through an alternate (constrained) optimization
scheme. Our algorithm has two steps. The first one
specifications of the possible senses for each word in a
assumes that the current time stamps of the sequences are
context and selects the best hypothesis according to a synchronous and extract common topics from them. The
grammar G, describing relations between sense second step synchronizes the time stamps of all documents
specifications. They created sense specifications from by matching them to most related topics, respectively.
several available lexical resources that they integrated in Then, we go back to first step and iterate until
part manually, in part with the help of automatic convergence.
procedures. The SSI algorithm applied to different
semantic disambiguation problems. 3.2 Topic Extraction
We assume the current time stamps of all sequences are this local optimum. This local optimum is apparently
already synchronous and extract common topics from meaningless since it is equivalent to discard all temporal
them. Our algorithm is summarized as below. K is the information of text sequences and treat them like a
number of topics specified by users. The initial values of collection of documents. Nevertheless, this trivial solution
time stamps and objective function are counted from the only exists theoretically. In practice, our algorithm will not
original time stamps in the sequences. converge to this trivial solution, as long as we use the
original time stamps of text sequences as initial value and
3.3 Algorithm: Topic mining with time have more numbers of topics.
synchronization
3.6 The Local Search Strategy
Input: K, Time stamp, Objective function In some real-world applications, we can have a
Output: Word, Topic, Time stamp quantitative estimation of the asynchronism among
Repeat sequences so it is unnecessary to search the entire time
Update word with time stamp and objective dimension when adjusting the time stamps of documents.
function This gives us the opportunity to reduce the complexity of
Initialize: Topic and word values with random time synchronization step without causing substantial
numbers performance loss, by setting a upper bound for the
Repeat difference between the time stamps of documents before
Update word and topic values. and after adjustment in each iteration. Specifically, given
Until convergence document D with time T, we now look for an optimal topic
For m=1 to M do (M is no of steps) function within the neighborhood of topic.
For u=1 to T do Initialize objective function
For v=2 to T do 4. Conclusion
For w=1 to T do compute objective
function Our first aim is that to extract common topic from
End multiple sequences which are asynchronous. We propose a
End
method which importantly extract common topic by fixing
Update timestamp
End potential asynchronism among sequences. Self
Until convergence improvement process is introduced by utilizing correlation
between the semantic and temporal information in
3.4 Constraint on Time Synchronization sequences. To optimize a unified objective function by
extracting common topic and time synchronization
We assumed asynchronism in given sequences. We alternately. Most likely results are guaranteed by our
assumed that time stamps are distorted and sequential algorithm. Two baseline methods are
information between documents is correct. This
assumption was based on observations from real-world From asynchronous text sequences we extract meaningful
applications like news stories published by different news
and discriminative topic.
agencies may vary in absolute time stamps, but their
sequential information conforms to the order of the
Quality and quantity are maintained by our method.
occurrences of the events.
We argue that the second option works better in practice
Performance of our method is strong and healthy against
since real-world data sets are not perfect. Although we
assume that the sequential information of the given random initialization and parameter setting.
sequences is correct in general, there will still be a small
number of documents that do not conform to our References
assumption. Our iterative updating process and the relaxed
constraint will help recover this kind of outlying
documents and assign them to the correct topics.

3.5 Convergence

Our objective function will converge to a local optimum


after iterations. Notice that there is a trivial solution to the
objective function, which is to assign all documents to an
arbitrary time stamp and our algorithm would terminate at

You might also like