Kumar 2016

Scoring Based Unsupervised Approach to Classify
Research Papers
Anil Kumar K.Ma , Gagan S Gb , Rajasimha Nc , Anil Bd , Rajath Kumar Ue
Department of CS & E, Sri Jayachamarajendra College of Engineering, Mysore-570006,
a
(Email: [email protected] ) b (Email: [email protected]) c (Email:
[email protected] ) d (Email: [email protected]) e (Email: [email protected])
Abstract— Platforms for publishing research papers are for retrieving and reviewing the required papers from huge
increasing largely that contribute to big data as their volume is collection of papers.
humongous and are unstandardized. Classification of this huge
chunk of data is one of the biggest challenges in Information
Retrieval. In this paper we discuss a scoring based unsupervised A research paper, unlike a normal web document, has some
learning approach to extract relevant features and classify the predefined sections and are highly structured. They are usually
research papers according to their content on a two class dataset. published in a strict format comprising of titles, abstracts
Feature extraction is carried out by analyzing the sections of and then full texts that are highly structured[1]. However
a research paper and scoring them, followed by a two level the template usually varies from one publication to another.
hybrid classification technique based on title and conceptual
summary using graph clustering. Promising experimental results Primary sections of the research paper include Abstract,
are observed for data set using our research paper classification Introduction, Method, Results, Conclusion and Reference
method. We present the experimental results of our proposed List. The Abstract is a summary which conveys a complete
algorithm for feature extraction and clustering and compare the synopsis of the paper. The Introduction consists of problem
same with different approaches reported in literature. outline and how it can be solved. The Method section contains
Keywords – Research Paper(RP), Section Score, Spanning the design and methodology used in experimentation. The
Tree, Markov Graph Clustering, Conceptual Summary Results section will have numerical results and data for the
experiments conducted. The Conclusion will have inferences
and perspectives for future work. The Reference List will
I. I NTRODUCTION have citations for articles or other materials referenced in the
text of the article. We make use of the fact that all research
A Research Paper is a published academic work that
contains original research results or reviews existing
results. As a result, research papers contribute to Big Data
papers will have these common sections for conducting our
experiments.
since their volume is huge and are unstandardized. Before
A different approach has to be taken when we realize
research paper is published, it has to be categorised and
that scientific texts and research papers are structured into
reviewed.
different sections. As the titles and abstracts usually present
a concise summary of the content of publications it is natural
”Publish or perish” is a common phrase used to describe to use them for classification. On the other hand, full texts
the pressure researchers feel to publish their research findings provide information that is not present anywhere else and
in order to stay relevant and be successful within the academic there is a question if and to what extent this could be used
community. In order to remain successful in academia, each [1]. Along with this we affirm that not all the parts of a
researcher is publishing more and more articles every year document are written equal. If we take the same comparison
[1]. Research papers can be submitted through conferences of titles, abstract or conclusion, authors write a few words that
or through online to the publications. These papers will concisely describe their work as title and a few paragraphs
be analysed and categorised before they are published. that outline their work as the abstract and finally a paragraph
Subject classification of publications is mostly done manually or two to summarize the work as a conclusion. These are very
according to the keywords authors provide. Some journals different and unequal parts of the same document. Moreover,
and proceedings ask authors to select one or more subjects most of the research papers contain scientific texts. Scientific
from a list of subjects when they are submitting the paper. texts are rather different from every day-language texts in a
Classifying a large collection of scientific resources with number of aspects.
regard to a set of subjects is an error-prone and also time
consuming task[3]. Furthermore, the automatic classification Therefore, there is a need for studies that would address
of a chunk of scientific text is a topical task due to the these differences and eventually develop methods and
rapid growth of volumes of published material as well as the algorithms suitable to capture these differences while
ongoing shift from a paper based publishing to e-publishing classifying the research papers. The main focus of this paper
[1]. Classification of research papers is of prime importance is on two aspects:
978-1-5090-5256-1/16/$31.00 2016
c IEEE 505
formalize the heuristic reasoning of different discriminative
1) Analyzing which section in a research paper (henceforth weight depending on the term location.[1] We exploit the
referred as RPs) has the highest influence on the subject location of terms by automatically identifying and segregating
of the RP. their structural parts.
2) The classification of RPs collected from variety of web
sources by grouping similar RPs into one class/category. C. Scoring Technique:
The rest of this paper is organized as follows. Section 2 Andrew Trotmen et al. provide an insight pertaining to topic
discusses the related works on the subject. In section 3, of ranking RPs based on the location. They observe that the
we explain our methodologies used to solve the problem. existing ranking schemes such as Vector space, probability,
We present of experimental setup in Section 4. Section and Okapi BM25 ranking treat all term occurrences in a given
5 comprises of the evaluation and tabulations of results. RP are of equal influence. Intuitively, terms occurring in some
Section 6 has the comparison of our results with results places should have a greater influence than those elsewhere.
of other research in the field. Finally, in section 7 we An occurrence in an abstract may be more important than an
conclude and discuss the future work. occurrence in the body text. Although this observation is not
new, there remains the issue of finding good weights for each
II. R ELATED W ORK structure [2]. Andrew et al. focus on a supervised learning
technique to address this issue. We develop our own algorithm
A considerable amount research has been done in the past for scoring the sections dynamically based on the content of
on classifying web documents, reviews and other texts by the research paper.
Information Retrieval community. In this section we discuss
the different approaches researched for classifying Research D. Semantic Orientation of Terms:
Papers. Turney presents a simple unsupervised method to classify
users reviews using the context / information that adjec-
A. General IR Approaches for document clustering: tive/adverbs provide by extracting pair of words (Bi-grams)
having adjectives/adverbs to classify users reviews. We extend
The standard Information Retrieval systems such as Vector
this concept of exploiting the context at which a term is used
space (Salton, Wong, and Yang, 1975) or probabilistic (Robert-
using Turney phrases or patterns to build our bag of phrases
son and Sparck-Jones, 1976) approaches that are extensively
representing each RP[4].
used in document clustering rank/score documents without
regard to term location [1]. In other words, a term found in an
E. Citation Based Approaches:
abstract is treated same as the term found elsewhere. Moreover,
the occurrence of a term in any section is treated the same Bader Aljaber et al. investigate the power of the citation-
regardless of the section where it occurs or the frequency of specific words that are usually present in references of the
its occurrence in a particular section. The document structure RPs, and compare them with the original documents tex-
is ignored even though authors write documents with structure. tual representation. Their hypothesis is that citation specific
contexts contain useful synonyms and related terms that can
be used to boost the accuracy of the similarity calculation
B. The Importance of Term Location: between documents. This helps to increase the effectiveness of
Vaidas et al. in their experimentation with classification of the bag-ofwords Representation. They perform the document
mathematical papers from the field of probability theory and clustering task on two collections of labeled scientific journal
mathematical statistics [1] point the above mentioned problem papers from two distinct domains: High Energy Physics and
quoting “By selecting only a part of scientific terms from Genomics.They explore a technique for capturing synonymous
the text and using stationary distributions P(.) and P(.|.) we and related terms.Their results show that using citation terms
ignore the context between the terms as well as the location can improve document clustering accuracy, when combined
of these terms in the text“ and also suggest a few methods with a traditional full-text representation of the document.
to consider location of terms in the document. A nave choice This improvement becomes significant when documents are
for getting the location could be measuring the distance from characterised at a general (rather than a specific) level of topic
the beginning of the article using a sequential number of granularity.[5].
term, word, sentence and paragraph and they employ this
method. However they opine that the logical or structural Mohsen Taheriyan et al. try to find out to which subject(s) a
part of the article (abstract, main results, proof) is another given paper is more related using a supervised approach for
strong option. But this option wasn’t used in their the research subject classification of scientific articles based on analysis of
because structural elements of the articles were unidentifiable the interrelationship links such as citations,common authors,
in the data they used. They state that these shortcomings and common references to assign subject to papers. They first
may be overcome by developing and implementing automatic build a graph of relationships in which nodes represent papers
algorithms for identification of the structural parts of the and links represent the relations such as citations, common
paper. hey present the procedures of making use of auxiliary author, and common reference[3]. Our approach is to consider
contextual information for the classification. These procedures all the sections of the paper for our experimentation.
506 2016 2nd International Conference on Contemporary Computing and Informatics (ic3i)
b. Section Scoring: Section scoring refers to the quantita-
Our approach differs from the above mentioned studies in tive measure of a sections influence on the subject of a research
such a way that, we consider all the sections in the RP and paper. This may vary from one paper to another. Hence we
rank them based on their influence. We build a bag of phrases calculate section score dynamically for each paper. We use
(BOP) having bigrams capturing the context of the terms and probabilistic frequency based technique to calculate the section
score each term based on the term location. Finally, we use score.
graph clustering algorithms to analyze and cluster the RPs.
Algorithm 1: Calculation of Section Score.
III. M ETHODOLOGY Step 1: Apply Parts Of Speech (POS) tagging for all research
papers to remove the stop words and to extract nouns
A. Procedure: and adjectives. Nouns and adjectives ( henceforth
referred to as terms) play an important role in
We apply unsupervised learning approaches for classifying
determining the context of the subject matter in the
Research Papers based on their content. Unsupervised
text.
learning is the machine learning task of inferring a function
Step 2: Calculate the probability of occurrences of
to describe hidden structure from unlabelled data. Machine
extracted terms (T) as shown below:
learning is a subfield of computer science that evolved from
the study of pattern recognition and computational learning
N umber of occurrences of term T in the paper
theory in artificial intelligence. P (T ) =
T otal number of terms.
(1)
1) Structuring of the data set: The Research papers are Step 3: Obtain the term influence (Ti) on each section of
processed to eliminate junk characters and words. A script the paper as follows:
is developed to parse the text to segregate different sections
present in it. The obtained sections are further grouped
under two sections: primary and secondary sections. Primary Ti = F (T ) × P (T ) (2)
sections are those that have more contextual influence on the
topic of discussion than secondary sections. Primary sections Where, F(T) is the frequency of the term T in that
consist of Abstract, Introduction, Keywords, Conclusion (and particular section
future work) and References. Rest of the sections such as Step 4: The summation of term influences(Ti ) for all the
literature survey, Experiments, Results and Appendix are terms(T) occurring in a particular section
secondary sections. gives the section influence(Si ).

2) Feature Extraction: Si = [P (Wi ) × F (Wi )] (3)
a. Importance of Adjective-Noun/ Noun-Noun pairs
(Bigrams) in determining the Context: Context play a very Step 5: Section score is calculated as the fraction of the
important role in determining the rich feature extraction or influence of a particular section and the overall RP scaled
the bag of phrases(BOP). For example consider a sentence. up to some integer value.
“However, having a controlled vocabulary of keywords fixed
Section Inf luence of Si
one can treat the assignment of keywords as the classification SectionScore(Si ) = (4)
of a paper“. Here the words “vocabulary“ and “controlled“ (inf luence of all sections)
behave very differently when used in different contexts.
But when used together, they give a holistic view of what
the author says. Moreover, the word controlled qualifies c. Obtaining Bag Of Phrases (BOP): We obtain the
the word vocabulary. In another sense the word controlled contextual phrases (CPs) from the RP based on Turney method
is an adjective qualifying the word vocabulary. Past work and compute the score for all the obtained phrases to get their
has demonstrated that adjectives are good indicators of influence on the RP.
subjective, evaluative sentences [4]. Furthermore, there are
Noun-Noun pairs that have different behaviors when used in
different context. For example: ”A black hole is a place in
space where gravity pulls so much that even light can not CP score = F requency of CP in a section(S)×section score of S
get out.” Here the noun-noun pair ”black hole” provides an (5)
information prompting us to consider that the phrase is used Once the score is calculated for each CP, next step is to filter
in a context related to astro-physics. However,the words - the CPs to obtain BOP. For this, we gather highly associated
”black” and ”hole” have completely different behaviors in (co-occurring) CPs. Highly associated CPs are those that occur
different context. We consider these pairs (Bigrams) while in the same section and have greater scores.
scoring sections dynamically and building our BOP.
2016 2nd International Conference on Contemporary Computing and Informatics (ic3i) 507
3) Title Based Approach: Some RPs have titles such that the nearness of two terms in BOP based on their common
the topic of the RP will be well represented in the title itself. characteristics using their conceptual summaries.
For this kind of RPs, we use titles for clustering approach
which saves additional methodologies that are required for
conceptual summary based Clustering. The RPs that are not eligible for title based clustering are
To check if an RP can be used for title based clustering, considered for conceptual summary based clustering along
we need to know if the title, in some way, represents the with the unclustered RPs of title based clustering.
topic of interest discussed in that RP. This is achieved by
matching the BOP obtained from the RP with its title. Since Algorithm 2: Calculating the Conceptual similarity
the BOP represents the topic discussed in RP, we check if any between two RPs.
term/terms in BOP matches with the phrase present in title. If
a match is found, we conclude that the RP is eligible for title Step 1: For each RP not suitable for Title Based Clustering:
based clustering. Step 2: For each phrase in BOP of RP:
Generate the summary of the phrase from Wikipedia
Extract phrases from the summary using Turneyś
After all such titles are extracted, RPs are clustered using method
only titles. We extract only nouns and adjectives from the Step 3: Append the phrases to the BOP
titles using POS tagging.
For example, consider two RPs with BOP as shown:
For example, consider these titles - RP1- approximation ratios, time algorithm, small relative,
”pair approximation models for disease spread” polynomial time, max clique, general case
”Network growth models and genetic regulatory networks” RP2- same reduction, string pointer
We append the Wikipedia phrases for each phrase of BOP
The tagged version of the above titles are: mentioned above as follows:
RP1- approximation ratios(Similar phrases: computer science,
”pair NN approximation NN models NNS for IN data structure, operation research, approximation solutions,
disease NN spread NN” etc.), Time algorithm(similar phrases: Polynomial time, run
time, time complexity, Big notation etc.)
”network NN growth NN models NNS and CC genetic JJ RP2- String pointer (Similar phrases: Computer science, Pro-
regulatory JJ networks NNS” gramming language, Data structure, run time, computer mem-
ory, etc.) Same reduction (Similar phrases: Reduction results,
NNP- Proper noun original probability, risk aversion decision etc.)
IN- Preposition or subordinating conjunction
JJ- Adjective We calculate the conceptual similarity score for RP pairs as
NNS- Noun, plural the number of common phrases between corresponding RP
NN- Noun, singular or mass pairs:
RP1, RP2=4. The RP pairs are represented as a graph and
Once we extract only nouns from the above two titles are clustered same as before.
we will have words as shown:
1) pair , approximation, models, disease and spread 5) Spanning Tree Algorithm: A spanning tree is an acyclic
2) network, growth, models and networks subgraph of a graph G, which contains all the vertices from
G. The maximum spanning tree of a weighted graph is
The score for all the RP pairs is the number of words in the maximum weight spanning tree of that graph.We use
the noun list matching between the two corresponding RPs. spanning tree for the following purposes[9][10]
For the above mentioned RPs, score is given by d1,d2=1
since only one common word(models) match. To obtain high 1) To reduce the graph obtained by representing the RP pair
RP pair scores so that they can be clustered based on the removing the redundant cycles/loops so that it reduces
threshold score, we expand the noun list of the title of a RP the burden on the clustering algorithm we employ.
by appending the nouns obtained from the titles mentioned 2) To analyze the homogeneity of the RP pairs.
in the reference list of that RP with the current noun list. We process the graph obtained in the previous step using
We represent the RP pair scores as a graph where the nodes the following Spanning Tree Algorithm.
represent the RPs and the edges between them represent the
scores between the RPs. Algorithm 3: Maximum Spanning Tree Algorithm.
4) Conceptual summary based Approach: Conceptual sim- Input: The weighted graph Vi, Vj: Ek, where Vi and
ilarity refers to a method to measure the semantic distance Vj represent any two research papers as nodes and E is the
between the concepts of any two words according to a given distance calculated using BOW.
context/summary. In other words, it is used to calculate
Output: Maximum spanning tree corresponding to the TABLE I: Observed Section Influence for Computer Science
weighted graph stated above. RPs
Step 1: Sort each entry of the graph in descending order Sl no. Section Score
based on weight edges and extract the first edge 1. Abstract 48%
2. Introduction 34%
Step 2: Add the first entry V1, V2: E1 to the spanning graph 3. Conclusions 10%
Step 3: For each node Vi in the spanning graph, find the edge 4. Experimentation 5%
Ek whose connecting node Vj is not present in the 5. Related work 3%
spanning graph already.
Step 4: Get the Edge with the maximum weight (max[Ek]) TABLE II: Observed Section Influence for Astrophysics RPs
from Step 3 and add its Sl no. Section Score
corresponding entry Vi, Vj: Ek to the spanning tree. 1 Discussions 34%
Step 5: Repeat steps 3 and 4 until all nodes are added to the 2 Conclusions 24%
3 Abstract 22%
spanning tree 4 Introduction 11%
5 Experimentation 9%
6) Markov Graph Clustering Algorithm: We feed the maxi-
mum spanning tree obtained by following the above mentioned
algorithm to Markov Graph Clustering Algorithm to cluster archived repository of electronic e-prints of scientific papers
the RPs. The main reason behind choosing Markov Clustering in various research fields maintained by Cornell University
Algorithm is as follows: The aim of this clustering method is Library. The data set comprises of research papers in the
to dissect a graph into regions with many edges inside, and field of Astrophysics(1854 RPs) and Computer science(912
with only few edges between regions. In other words if such RPs) in LaTeX format. Then, we extract different sections and
regions exist, then if inside one of them, it is relatively difficult subsections and send for further processing.
to get out, because there are many links within the region itself,
and only a few going out [8]. B. Structuring of the Data Set:
The data set in LaTeX format consist of words/terms
B. Tools: that cannot be used for our experimentation such as LaTeX
1) Wikipedia Python Module: Wikipedia is a Python symbols and some junk words. We parse the data set and
library that makes it easy to access and parse data from filter out those symbols and remove the junk words using the
Wikipedia. Wikipedia wraps the MediaWiki API. [13] standard Pyenchant dictionary.
2) NLTK python module: NLTK is a leading platform for C. Feature Extraction:

building Python programs to work with human language data.
It provides easy-to-use interfaces to over 50 corpora and Our feature extraction includes scoring each section of the
lexical resources such as WordNet, along with a suite of text RP to analyze the sections’ influence on the content of the
processing libraries for classification, tokenization, stemming, RP and choose phrases that represent the RP capturing their
tagging, parsing, and semantic reasoning and wrappers for context.
industrial-strength NLP libraries. We observe that the Abstract (with influence of 48%) section
has the highest influence for the RPs collected relating to
computer science field and Discussion section has the highest
3) JSOUP: jsoup is a Java library for working with real- influence (with influence of 34%) for the RPs collected relating
world HTML. It provides a very convenient API for extracting to Astrophysics field. The section influences for RPs belonging
and manipulating data, using the best of DOM, CSS, and to computer -science and Astrophysics fields are shown in
jquery-like methods.jsoup is designed to deal with all varieties Table I and Table II respectively.
of HTML found in the wild; from pristine and validating, to
invalid tag-soup; jsoup will create a sensible parse tree[9]. We performed a self annotation on the bag of phrases that
we obtained after feature extraction on around 100 RPs of
different fields as a preliminary verification for our feature
4) pyenchant module: This package provides a set of extraction technique. We found that 85 % of the times the
Python language bindings for the Enchant spellchecking li- phrases in the BOP matched the phrases/concepts that were
brary. This package provides a set of Python language bindings picked while self annotating these texts.
for the Enchant spellchecking library.
D. Clustering the RPs:
IV. E XPERIMENT & R ESULTS 1) Title Based Approach: We analyzed the data set col-
lected by comparing the Bag of Phrases with the nouns and
A. Data Collection: adjectives in the title to check the number of files that are
For our experimentation, we collected around 2766 research suitable for title based clustering as in representing these
paper publications archived at arxiv.org, a large collection self RPs by the Bag of Phrases having nouns and adjectives
TABLE III: Observed Homogeneity:
Sl No. Approach Homogeneity
1. Title Based 93%
2. Conceptual Based 85%
extracted from their titles. Out of the 2766 RPs we collected

as our dataset, around 65% of the RPs were suitable for title
based approach. After appending the extracted terms from the
reference list we chose a suitable threshold for matching the
bag of phrases and get RP pair scores.
2) Conceptual Summary Based Approach: We processed

the remaining (around 35%) dataset that were not suitable for
title based clustering approach by getting the conceptually
similar words using python Wikipedia package. We expanded Fig. 1: Homogeneity Analysis: The Maximum Spanning Tree
the conceptual meaning of the terms in BOP by extracting for a Subset of Dataset 1
the bigram terms (noun-noun/noun-adjective pairs) from the
summary content of the corresponding Wikipedia page for
each phrase in BOP. Majority of the phrases extracted in that V. E VALUATION AND TABULATION OF R ESULTS
way extended the conceptual meaning of the phrases in BOP.
To evaluate how well our methods work, we calculate the
However, some amount of deviation in the concept was also
precision, recall and the accuracy for the clusters found using
observed. For example :
the Markov graph clustering technique using a confusion
matrix.
In the RP titled ‘Reach ability problems for communicating
finite state machines‘ from computer science domain, for A confusion matrix [11] contains information about actual
the phrases in BOP ‘new vertex‘and ‘transition diagram‘ the and predicted classifications done by a classification system.
conceptual summary phrases are as follows [‘graph theory‘ , Performance of such systems is commonly evaluated using the
‘edge contraction‘ ] and [’state diagram’, ’computer science’, data in the matrix. The entries in the confusion matrix have
’state diagrams’ etc]. These phrases extend the concept of the the following meaning in the context of our study:
BOP and aid for clustering. False Positive (FP): Falsely Predicting that an RP belongs to
a cluster (or predicting a computer science RP belongs to
However, for the RP titled ‘NGC 300: an extremely faint, astrophysics class).
outer stellar disk observed to 10 scale lengths‘ which belongs False Negative (FN): Falsely Predicting that the other RP
to astrophysics domain,one of the phrases in the BOP is solar belongs to other cluster (or predicting an astrophysics RP
composition’, But the conceptual summary comprises of the belongs to computer science class ).
following phrases - ‘musical composition‘, ‘studio album‘, True Positive (TP): Correctly predicting that an RP belongs
‘modern jazz‘ etc. These phrases clearly cause the deviation to a cluster (or predicting a computer science RP belongs to
in the concept. computer science class ).
3) Maximum Spanning Tree Analysis: We analyzed the True Negative (TN): Correctly predicting an RP belongs to
homogeneity of the pairwise distances obtained by the other cluster (or predicting an astrophysics RP belongs to
aforementioned approaches by constructing a maximum astrophysics cluster).
spanning tree. If the constituent node pair of an edge The accuracy (AC) is the proportion of the total number
correspond to RPs of same category,then we define it to of predictions that were correct. It is determined using the
be homogeneous. Table III represents the homogeneity equation.[12]
percentage we observed for spanning trees obtained by
Accuracy(AC) = (T P + T N )/(T P + T N + F P + F N )
applying our approaches.
(6)
1
4) Markov Graph Clustering: After analyzing the homo- The Table IV presents the accuracy we observed for dif-
geneity and reducing the RP pair wise graph to a spanning tree, ferent Inflation values of Markov clustering (MC), different
we fed the spanning tree for the graph clustering algorithm. subsets of data sets having RPs from both computer science
We experimented by changing the inflation value of Markov and astrophysics domain and different approaches used.
clustering algorithm. Inflation value decides the granularity We observe that the accuracy of clustering RPs which are
of the clusters to be obtained. As we are dealing with a two suitable for using our title based approach superior to the
class clustering problem, we calibrated the value of I ( inflation
value) so as to get two cluster granularity. 1 Graph plotted using matplotlib Python library
same for clustering the same set of RPs using our conceptual
summary based approach. Since some of the RPs in our
data set are not suitable for title based approach, we use a TABLE VI: Comparison of Results:
combination of both the approaches by applying conceptual sl Researchers Feature Clustering Resulting Accuracy
No. Extraction Algorithm Metric
summary based approach on those files that are not suitable 1 Vaidas A combination , KNN precision For various
for title based approach. Table IV shows the above observation. et al. [1] of title abstract SVM and recall cluster methods
The title based approach(first level) greatly simplifies the input and full text ranging from
0.481
to the conceptual summary based approach (second level). We to 0.665
also compare the two level hybrid approach with just con- 2 Vahed Citations hierarchical Precision ranging from
ceptual summary based approach. Here too, the accuracy for et al. [7] agglomeration and recall 0.333
Algorithm to 0.851
the hybrid approach outperforms the single level conceptual For various
summary based approach. This is tabulated in table V where Datasets
get an average accuracy of 74% for the largest dataset (Trial 3 Mohsen Citations, electrical Precision ranging from
et al. [3] Common authors conductance and recall 0.53
III) . Algorithm to 0.90
Finally, different approaches have been considered for the For various
classification of RPs by different researchers using different Data sets
feature extraction techniques. In the table 6 we summarize 4 Bader Citation context Hierarchical Break-Even BEP values
et al. [5] and their K-means Point ranging from
these aspect and also compare our results, metric used, feature synonyms and Bi-clustering(BEP) metric 0.190 to
extraction techniques and clustering techniques with the other 0.295
researches. 5 Our Section Influence Markov Accuracy Ranging
approach based BOP + Graph based calculated from
conceptual clustering using 0.74
VI. C ONCLUSION AND F UTURE W ORK summary using confusion to 0.86
We have used an empirical approach to classify the research Wikipedia matrix for hybrid
approach
papers.The section scoring technique shows that primary sec-
KNN - K -Nearest Neighbors, SVM - Support Vector Machines
tions such as Abstract, Discussions and Introduction in RPs
have a greater influence in determining the subject of the
RP, thus providing a logical support for scoring phrases in R EFERENCES
our BOP. Title based approach simplifies the dataset to be [1] Statistical Classification of Scientific Publications Vaidas Balys and
processed in the subsequent conceptual summary based phase. Rimantas Rudzkis,Institute of Mathematics and Informatics, Vilnius
The greater homogeneity of edges in the spanning tree re- [2] Choosing Document Structure Weights Andrew Trotman,Department of
Computer Science, University of Otago, PO Box 56, Dunedin, New
moves cycles that introduce redundancy to the graph and Zealand [email protected]
hence makes clustering easier. We plan to extend the existing [3] Subject Classification of Research Papers Based on Interrelationships
algorithm for a multi-category data set. The present work can Analysis. Mohsen Taheriyan Computer Science Department University
of Southern California Los Angeles, CA [email protected]
be improved further by filtering the Wikipedia summaries for [4] Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsu-
each phrase in BOP considering conceptually related technical pervised Classification of Reviews Turney, Peter
words using synonymy and hypernym concepts. The accuracy [5] Document clustering of scientific texts using citation contexts Bader
Aljaber,Nicola Stokes,James Bailey,Jian Pei
of the clustering technique employed can be enhanced to [6] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze.
a greater extent by refining the semantic distances between 2008. Introduction to Information Retrieval. Cambridge University Press.
RPs using higher order probability distributions. Our approach [7] Scientific Paper Summarization Using Citation Summary Networks Vahed
Qazvinian,Dragomir R. Radev
needs to be experimented on semi structured textual data like [8] Stijn van Dongen, Graph Clustering by Flow Simula-
project reports, theses etc. tion, PhD thesis, University of Utrecht, May 2000. (
http://www.library.uu.nl/digiarchief/dip/diss/1895620/inhoud.htm )
TABLE IV: Observed Accuracy For Separate Levels: [9] Minimum Spanning Tree Based Clustering Algorithms Oleksandr Gry-
Trials Approach Files in Dataset I for MCL Accuracy gorash, Yan Zhou, Zach Jorgensen School of Computer and Information
Trial I Title Based 478 1.2 0.95 Sciences University of South Alabama, Mobile, AL 36688 USA ogrygo-
CSB 400 1.2 0.72 [email protected], [email protected], [email protected]
Trial II Title Based 1675 1.4 0.81 [10] Prim, R. C. (November 1957), ”Shortest connection networks And some
CSB 438 1.2 0.55 generalizations”, Bell System Technical Journal,
Trial III Title Based 2050 1.2 0.76 [11] Confusion Matrix-based Feature Selection,Sofia Visa,Computer Sci-
CSB 670 1.2 0.68 ence Department,College of Wooster,Wooster, [email protected]
Brian Ramsay, Sentry Data Systems, Inc. Fort Lauderdale, FL
[email protected], Anca Ralescu, Computer Science Department
TABLE V: Comparison Between Conceptual Summary and University of Cincinnati,Cincinnati, OH [email protected] Esther van
Hybrid ( Title based + Conceptual summary based): der Knaap, Ohio Agricultural Research and Development Center The
Ohio State University.
Trials Approach Files in Dataset I for MCL Accuracy
[12] http://www.dataschool.io/simple-guide-to-confusion-matrix-
Trial I CSB 878 1.4 0.68
terminology/
Hybrid 478 + 400 1.4 0.86
[13] https://pypi.python.org/pypi/wikipedia/
Trial II CSB 2113 1.2 0.60
Hybrid 1675 + 438 1.2 0.76
Trial III CSB 2720 1.2 0.56
Hybrid 2050 + 670 1.2 0.74
CSB - Conceptual Summary Based Approach. I for MCL - Inflation Value
for Markov Clustering Algorithm

Kumar 2016

Uploaded by

Copyright:

Available Formats

Kumar 2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kumar 2016

Uploaded by

Copyright:

Available Formats

Scoring Based Unsupervised Approach to Classify

2) NLTK python module: NLTK is a leading platform for C. Feature Extraction:

extracted from their titles. Out of the 2766 RPs we collected

2) Conceptual Summary Based Approach: We processed

You might also like