0% found this document useful (0 votes)

72 views6 pages

An Efficient Method For High

The document proposes a novel method called CQMine for efficient high quality and cohesive topical phrase mining. CQMine improves upon existing approaches by integrating quality phrase mining, a novel topic model, and document clustering into an iterative framework. This allows both phrase quality and topical cohesion to be enhanced. The method is evaluated to show it achieves better performance than state-of-the-art methods.

Uploaded by

phaniteja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views6 pages

An Efficient Method For High

Uploaded by

phaniteja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

An Efficient Method for High Quality andCohesive Topical

Phrase Mining

Abstract

A phrase is a natural, meaningful, and essential semantic unit. In topic modeling,

visualizing phrases for individual topics isan effective way to explore and
understand unstructured text corpora. Usually, the process of topical phrase mining
is twofold: phrase mining and topic modeling. For phrase mining, existing
approaches often suffer from order sensitive and inappropriate segmentation
problems, which make them often extract inferior quality phrases. For topic
modeling, traditional topic models do not fully consider the constraints induced by
phrases, which may weaken the cohesion. Moreover, existing approaches often
suffer from losing domain terminologies since they neglect the impact of domain-
level topical distribution. In this paper, we propose an efficient method for high
quality and cohesive topical phrase mining. A high quality phrase should satisfy
frequency, phraseness, completeness, and appropriateness criteria. In our
framework, we integrate quality guaranteed phrase mining method, a novel topic
model incorporating the constraint of phrases, and a novel document clustering
method into an iterative framework to improve both phrase quality and topical
cohesion. We also describe efficient algorithmic designs to execute these methods
efficiently

Existing System

Topical phrase mining is not only an important step in established fields of

information retrieval and text analytics, but also is critical in various tasks in
emerging applications, including topic detection and tracking , social event
discovery , news recommendation system, and document summarization .the
process of topical phrase mining is twofold: phrase mining and topic modeling.
These two stages notonly directly affect the quality of discovered phrases and the
cohesion of topics, but also, they may interact andindirectly impact each other’s
outcomes, e.g., low quality phrases (incomplete or meaningless) may cause
misleading topical assignment in topic modeling. However, from phrase quality
and topical cohesion perspectives, the outcomes of existing approaches remain to
be improved.
NLP based methods are commonly language-dependent and need texts to comply
with grammar-rules, so it is not easy for them to be migrated to other languages
and not suitable for analyzing some newlyemerging and grammar-free text data,
such as twitters, academic papers and query logs. In the hope to overcome the
disadvantages of NLP based methods, there are many data-driven approaches that
have been proposed in this area. A variety of statistic-based methods have been
proposed to improve phrases quality by ranking candidate phrases.

Proposed System

We propose a novel topical phrase mining method CQMine. Our method could
achieve a better performancethan state-of-the-art methods in terms of phrase
quality and topical cohesion. In order to effectively and efficiently mine topical
phrases and improve phrase quality and topical cohesion, we propose a Cohesive
and Quality Topical Phrase Mining (CQMine) framework, which automatically
clusters documents with a more sensible topic model, and improves the quality of
phrases by adopting more accurate and rigorous mining approaches.
We propose effective and efficient quality phrase mining approaches. By
eliminating order sensitive andavoiding inappropriate segmentation, our
approaches could guarantee the quality of extracted phrases. Moreover, we also
design effective algorithms to accelerate the processing.We propose a novel topic
model to address topic assignment problem associated with idiomatic phrases
toimprove the cohesion of topical phrases.

Considering the fact that some phrases are only valid in certain domains, we
propose an iterative framework tofacilitate more accurate domain terminologies
finding. Experimental evaluation and case study demonstratethat our method is of
high interpretability and efficiency compared with the state-of-the-art methods.

FutureWork

Different with the existing model which only considers intra-cooccurrence of

phrases and regards the generation of segmentations as an independent process.
Our methods comprehensively consider both the intra-cooccurrence of phrases and
the isolation of partition position. From a technical perspective, the isolation of
“current”split position depends on the “future” generated split position. Thus, we
need to check every possible new split positions to determine the isolation of
current split position, which makes the computation of optimal segmentations very
timeconsuming. To address this issue, we adopt a dynamic programming strategy,
which is based on an observation that if bi+1 and the previous partition position bi
is the optimal position.

Modules

News Publisher

News publisher provides the news articles on daily basis, breaking news; live news
etc. news data are stored in database. Offering the services to the end users. News
Recommendation system publish the news articles based on categories. News
Publisher search the news topics randomly whether the articles are displaying
related to category. Users Registered in news portal to view the news articles, once
read the article can also to comment the article and shared to others

Effectiveness Analysis of quality phrase

Examined the effectiveness of our quality phrase mining stage by measuring the
phrase quality in two metrics: (1) Wiki-phrases benchmark and (2) Expert
Evaluation. Wiki-Phrases: Wiki-phrases is a collection of popular mentions of
entities by crawling intra-Wiki citations within Wiki content. Wiki phrases
benchmark provides a good coverage of commonly used phrases which could
avoid the variance caused by different human raters. In this evaluation,we regarded
Wiki phrases as ground truth phrases. That is to belongs to/not belongs to Wiki
phrases. To compute precision, only the Wiki phrases are considered to be positive.
For recall, we firstly mergedall the phrases returned by all methods including ours,
and then we obtained the intersection between the Wiki phrases and the merged
phrases as the evaluation set.

Quality Phrase Mining

In the CQMine framework the quality phrase mining stage contains three steps:
Firstly, a PhraseTrie is built to count all possible phrases’ frequencies. Then, a
complete phrase mining algorithm is applied to mine complete phrases, which will
be under the guidance of a statistics-based measurement to satisfy phraseness
criterion. During phrase mining, the mined phrases are stored inPhraseTrie to avoid
recomputing duplicate phrases. Finally, to guarantee the appropriateness
requirement, for each document, CQMine needs to check if it contains overlapping
phrases, if so, we will partition them into non-overlapping phrases by utilizing an
effective and efficient overlaping phrases segmentation algorithm. After quality
phrase mining, a document is transformed from a multiset of words (bag-of-words)
into a multiset of phrases (bag-of-phrases) which will be taken as the input of topic
modeling.

Topical phrase mining

Significant progresses have been made on the topical phrase mining and they can
be broadly classified into three types:

(1) Joint learning phrases and their topic assignment,

(2) Mining phrases posterior to topic inferring,
(3) Mining phrases prior to topic inferring.

Word sequence segmentation (or phrasal segmentation) is another strategy for

phrase mining. Formally, phrasal segmentation aims at partitioning a word
sequence into a set of disjoint subsequences, each indicating a phrase. It only
considers intracooccurrence of phrases such as phrase length and words, while
ignores the inter-isolation between phrases. The second strategy utilizes a post-
processing step to generate phrases after inferred by the LDA model. Recursively
merges consecutive words with the same latent topic by a distribution-free
permutation test on arbitrary length back-off model until all significant
Consecutivewords have been merged.it performs phrase mining and topic inferring
simultaneously by incorporating successiveword sequence assumption into the
generative model. Wallach proposed a bigram topic model based on a hierarchical
Dirichlet allocation model. Bigram model is a probabilistic generative model that
conditions on the previous word and topic when drawing the next word.
Architecture

Algorithm

The completeness of extracted phrases highly depends on the merge order. In order
to obtain the complete phrases, we need to enumerate every possible merge order.
Obviously, a straight-forward algorithm of finding the complete phrases in
document d is: enumerating all the subsequences of this document first, then verify
whether each one is a complete phrase.The algorithm QBA (q-Chunk Based
Approach) firstly generates boundaries It then computes the local solution of each
chunk using DPBA denote the left boundary of current chunk. For each boundary
algorithm QBA checks whether satisfies merge condition.

The main processingsteps of QBA are as follows:

(1) Partitioning the sequenceinto a series of q-length chunks;
(2) Performing top-downsearch on each chunk to get local solutions
(3)Checking whether two adjacent chunks need to be merged.

If they do not need to be merged, it means no phrase couldcross the boundary

between the two chunks. Otherwise thetwo chunks are merged into a new chunk
and QBA will findnew solutions on the new chunks.

SYSTEM REQUIREMENTS

➢ H/W System Configuration:-

➢ Processor - Pentium –IV or Later Version

➢ RAM - 4 GB (min)

➢ Hard Disk - 40 GB

➢ Key Board - Standard Windows Keyboard

➢ Mouse - Two or Three Button Mouse

➢ Monitor - SVGA

Software Requirements:
 Operating System - Windows XP or Later Version
 Coding Language - Java/J2EE(JSP,Servlet)
 Front End - J2EE
 Back End - MySQL

Dissertation Interview Questions Template
100% (2)
Dissertation Interview Questions Template
4 pages
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
100% (1)
Unit I Lesson 3 Computing The Mean of A Discrete Probability Distribution
24 pages
Secure DLMS/COSEM Communication For Next Generation Advanced Metering Infrastructure
No ratings yet
Secure DLMS/COSEM Communication For Next Generation Advanced Metering Infrastructure
7 pages
Text Mining
No ratings yet
Text Mining
85 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Math111 Chapter 13 Maxima and Minima
100% (1)
Math111 Chapter 13 Maxima and Minima
6 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
Database Issues in Mobile-Computing PDF
No ratings yet
Database Issues in Mobile-Computing PDF
20 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
(Yard) Individual ASSIGNMENT (Qantitative)
40% (5)
(Yard) Individual ASSIGNMENT (Qantitative)
2 pages
Fourier 4
No ratings yet
Fourier 4
73 pages
Phase Plane Analysis
No ratings yet
Phase Plane Analysis
83 pages
Yellow: Blue Curve
100% (1)
Yellow: Blue Curve
28 pages
Cns 2
No ratings yet
Cns 2
13 pages
Software II: Principles of Programming Languages: Lecture 5 - Names, Bindings, and Scopes
No ratings yet
Software II: Principles of Programming Languages: Lecture 5 - Names, Bindings, and Scopes
50 pages
Unicycle Robot
No ratings yet
Unicycle Robot
45 pages
Department of Computer Science and Engineering Spring 2012
No ratings yet
Department of Computer Science and Engineering Spring 2012
18 pages
Unit - 2: Asymmetric Key Ciphers: Principles of Public Key Cryptosystems, Algorithms (RSA, Diffie
No ratings yet
Unit - 2: Asymmetric Key Ciphers: Principles of Public Key Cryptosystems, Algorithms (RSA, Diffie
47 pages
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
No ratings yet
Distributed Deadlock Detection: COP 4610 Notes On Deadlock
16 pages
A Fuzzy Based Approach To Text Mining Anddocument Clustering
No ratings yet
A Fuzzy Based Approach To Text Mining Anddocument Clustering
10 pages
Improved Method For Pattern Discovery in Text Mining
No ratings yet
Improved Method For Pattern Discovery in Text Mining
5 pages
An Approach For Post Mining of Combined Patterns
No ratings yet
An Approach For Post Mining of Combined Patterns
7 pages
Answering Questions Using Advanced Semantics and Probabilistic Inference
No ratings yet
Answering Questions Using Advanced Semantics and Probabilistic Inference
7 pages
Survey On Clustering Algorithms For Sentence Level Text
No ratings yet
Survey On Clustering Algorithms For Sentence Level Text
6 pages
Intelligent Information Retrieval From The Web
No ratings yet
Intelligent Information Retrieval From The Web
4 pages
Mining Structured From Massive Text Data: A Data-Driven Approach
No ratings yet
Mining Structured From Massive Text Data: A Data-Driven Approach
4 pages
(IJCST-V6I3P19) :vignesh Venkatesh
No ratings yet
(IJCST-V6I3P19) :vignesh Venkatesh
16 pages
Journalnx Text
No ratings yet
Journalnx Text
3 pages
A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay
No ratings yet
A Tutorial Review On Text Mining Algorithms: Mrs. Sayantani Ghosh, Mr. Sudipta Roy, and Prof. Samir K. Bandyopadhyay
11 pages
Coba Coba Upload
No ratings yet
Coba Coba Upload
3 pages
Natural Language Processing: Mature Enough For Requirements Documents Analysis?
No ratings yet
Natural Language Processing: Mature Enough For Requirements Documents Analysis?
12 pages
Humanitarian Applications of Big Data: Prof. (MRS.) Sindhu Nair, Mr. Neel Shah, Mr. Pinank Shah
No ratings yet
Humanitarian Applications of Big Data: Prof. (MRS.) Sindhu Nair, Mr. Neel Shah, Mr. Pinank Shah
3 pages
Paper 2
No ratings yet
Paper 2
9 pages
An Ontology Based Project Selection Using Text Mining: Dhiraj Kumar, Prajakta Patil, Pratiksha Lokhande, Rupali Khatale
No ratings yet
An Ontology Based Project Selection Using Text Mining: Dhiraj Kumar, Prajakta Patil, Pratiksha Lokhande, Rupali Khatale
4 pages
Demos 049
No ratings yet
Demos 049
8 pages
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
No ratings yet
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
2 pages
Effective Pattern Discovery For Text Mining
No ratings yet
Effective Pattern Discovery For Text Mining
8 pages
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
100% (1)
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
45 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
6 Suffix-Tree
No ratings yet
6 Suffix-Tree
20 pages
Characterizing The Software Process: A Maturity Framework
No ratings yet
Characterizing The Software Process: A Maturity Framework
27 pages
TextMining PAKDD1999
No ratings yet
TextMining PAKDD1999
7 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Chapter Two
No ratings yet
Chapter Two
12 pages
VaR Estimation Using GANs - 1553122463
No ratings yet
VaR Estimation Using GANs - 1553122463
23 pages
Background Research: 2.1 Machine Learning
No ratings yet
Background Research: 2.1 Machine Learning
9 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
A Detailed Study On Text Mining Techniques
No ratings yet
A Detailed Study On Text Mining Techniques
4 pages
Assignment Problem
No ratings yet
Assignment Problem
18 pages
Playfair Cipher
100% (1)
Playfair Cipher
6 pages
IEEE2023 Data Secure De-Duplication and Recovery Based On Public Key Encryption With Keyword Search
No ratings yet
IEEE2023 Data Secure De-Duplication and Recovery Based On Public Key Encryption With Keyword Search
11 pages
Moule 4 - Information Gathering
No ratings yet
Moule 4 - Information Gathering
64 pages
The Question Answering System Using NLP and AI
No ratings yet
The Question Answering System Using NLP and AI
6 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
40 pages
NLP 4
No ratings yet
NLP 4
33 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Information Retrieval Using Effective Bigram Topic Modeling
No ratings yet
Information Retrieval Using Effective Bigram Topic Modeling
8 pages
Curve Representation 2. Parametric Curves 3. Non Parametric Curves 4. Cubic Splines 5. Bezier Curves 6. B-Spline Curves
No ratings yet
Curve Representation 2. Parametric Curves 3. Non Parametric Curves 4. Cubic Splines 5. Bezier Curves 6. B-Spline Curves
9 pages
Multiclass Classification of DGA Based Malware Using NLP
No ratings yet
Multiclass Classification of DGA Based Malware Using NLP
3 pages
Assigmnent I TEXT WEB Media (2024 Feb)
No ratings yet
Assigmnent I TEXT WEB Media (2024 Feb)
12 pages
Project Example
No ratings yet
Project Example
19 pages
RCS Minuteslast Minute Notes PDF
No ratings yet
RCS Minuteslast Minute Notes PDF
9 pages
Keyword 2
No ratings yet
Keyword 2
5 pages
MA542 Lec13 Handout
No ratings yet
MA542 Lec13 Handout
18 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
7 Phrase Queries and Positional Indexes
No ratings yet
7 Phrase Queries and Positional Indexes
25 pages
Stock Watson Ecta 1993
No ratings yet
Stock Watson Ecta 1993
38 pages
Exercise 3 - Answer Key
No ratings yet
Exercise 3 - Answer Key
5 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
A Survey On Chat-Bot System For Agriculture Domain
No ratings yet
A Survey On Chat-Bot System For Agriculture Domain
5 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
Preprocessing Techniquesfor Text Mining
No ratings yet
Preprocessing Techniquesfor Text Mining
7 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
8 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
KDD97 046
No ratings yet
KDD97 046
3 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
3rd Unit Part-1
No ratings yet
3rd Unit Part-1
7 pages
BYJU'S Answer: Study Materials
No ratings yet
BYJU'S Answer: Study Materials
13 pages
Prashanth AI-Driven Traffic Lights
No ratings yet
Prashanth AI-Driven Traffic Lights
7 pages
Class 12 Maths
No ratings yet
Class 12 Maths
7 pages
Introduction-to-Pattern-Recognition (1) - Cropped
No ratings yet
Introduction-to-Pattern-Recognition (1) - Cropped
6 pages
DBM 302 Presentation
No ratings yet
DBM 302 Presentation
5 pages
Split Local Artificial Boundary Conditions For The Two-Dimensional Sine-Gordon Equation On
No ratings yet
Split Local Artificial Boundary Conditions For The Two-Dimensional Sine-Gordon Equation On
23 pages
Ijcsis Camera Ready Academia
No ratings yet
Ijcsis Camera Ready Academia
12 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Bert
No ratings yet
Bert
5 pages
Week 12
No ratings yet
Week 12
19 pages
Generstive Language Model
No ratings yet
Generstive Language Model
9 pages

An Efficient Method For High

Uploaded by

An Efficient Method For High

Uploaded by

An Efficient Method for High Quality andCohesive Topical

A phrase is a natural, meaningful, and essential semantic unit. In topic modeling,

Topical phrase mining is not only an important step in established fields of

Different with the existing model which only considers intra-cooccurrence of

Effectiveness Analysis of quality phrase

Quality Phrase Mining

Topical phrase mining

(1) Joint learning phrases and their topic assignment,

Word sequence segmentation (or phrasal segmentation) is another strategy for

The main processingsteps of QBA are as follows:

If they do not need to be merged, it means no phrase couldcross the boundary

➢ H/W System Configuration:-

➢ Processor - Pentium –IV or Later Version

➢ Key Board - Standard Windows Keyboard

➢ Mouse - Two or Three Button Mouse

You might also like