Chapter IV

This document proposes a solution to determine keyword weights in a domain-specific context using a modified weighted sum method. It involves building an ontology based on WordNet to identify semantic relationships between concepts. Candidate keywords will be extracted based on their similarity to concepts in the initial ontology, as measured using Lin's semantic similarity algorithm. Structural relationships will also be incorporated by assigning different term frequency weights based on the section of text, such as title vs body. This modified approach aims to address issues with considering semantic relationships and structure in existing weighted sum methods.

Uploaded by

Mithun Biswas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views5 pages

Chapter IV

Uploaded by

Mithun Biswas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Chapter IV A Modified Weighted Sum Method to Determine the Weight for DomainSpecific Keyword Based On Domain Repository

4.1 Introduction This is the most important section of our proposal. Here we will describe our proposal for our thesis. After describing the proposal we will also propose a solution for the diagnosed problem in the existing words. 4.2 Thesis proposal By studying papers we have observed that there are huge works in determining weight of the keyword with potential number of proposed solutions. In case of determining the weight by using the concept of domain knowledge in weighted sum method not too many papers were published and in this field of keyword weight finding the following issues have been considered so far. o Exploration of linguistic characteristics of a specific domain: Domain means the content of a particular field of knowledge. For example: Art, Philosophy, Information technology, Geography, History all are some specific domain. In some previous researches this characteristics have been formulated by using a feature vector. The various features are y TFIDF y POS y Relative Position of First Occurrence y Chi-Square Statistic And the feature vector is

o Finally find the weight vector: After finding the feature vector the weight was found by the following formula

Here, o

= weight vector =feature vector

Extraction of candidate keywords by using addition and multiplication: After finding the weight vector of each candidate keywords, candidate keyword extraction have been performed by addition and multiplication. o Assigning scores to each candidate words: After extraction of candidate keywords scores were tuned to the candidate keywords according to the below mentioned features.
1) FTF v IDF ! E ( keywords ' TF v IDF ) E ( non keywords ' TF v IDF ) ~~..(4.2) TF v IDF TF v IDF

2)F PoS ! E ( key ords PoS ) E( non key ords PoS ) ~~~..(4.3) PoS PoS

3) F RPFO !

E (key ords RPFO ) E (non key ords RPFO ) ~~~..(4.4) RPFO RPFO

o Sort the candidate keywords according to the score: After the scoring process candidate keywords were sorted according to the score. o Choosing of the few top candidates as keywords according to the scoring information: Finally taking the sorted information of the candidate keywords few top candidate keywords were chosen as keywords. However, those are the so far discovered features added to the previously examined method. We have observed that following issues have not been considered yet. o Semantic relationships among the weighted keywords in domain specific keyword extraction for weighted sum method. o Structural information has not been considered in weighed sum method for calculating term frequency, tf and inverse document frequency idf which is an important thing to be considered to get better results.

4.3 Proposed Solution Here we are providing our proposed solutions to find the semantic relationship among concepts and establish a structural relationship among document set. We will build an ontology which means The systematic description of objective existence of the world on a particular domain to determine the weight in the weighted sum method. We have considered building an ontology based on a particular domain to find the semantic relationship among concepts and establish a structural relationship among document set. We want to build an ontology based on WordNet according to a is-a (child-parent) relationship. Then we will extract candidate concept set according to the Lins similarity measurement theory. sim x1 , x2 ! 2 v log P(Co ) ~~~~~~~~(4.6) log P C1 log P(C2 )

Here x1,x2 two concepts in WordNet .C0 deepest common parent for candidate concept C1 and C2.For example let C1=hill and C2=coast. P (C1) =probability of C1 concepts emergence. P (C2) =probability of C2 concepts emergence. P (C0) = probability of C0 concepts emergence. The similarity between concepts Hill and Coast is sim Hill , Coast ! This is equal to 0.59. 2 v log log P(Geo log ical Formation) ~~~~~~~~(4.7) log log P Hill log log P(Coast)

4.4 Building the ontology: We have built an ontology based on WordNet. We know WordNet is composed of several sunsets (synonyms).WordNets concept is organized in the form of synonym set. We have constructed ontology with is_a relationship. Let oak and pine both is tree. Tree Is a Oak is a Pine

Our main target is to find the backbone words from WordNet. We have established an initial ontology with is-a (part of) relationship tree through these terms. For the above example oak, pine is tree. We will regard oak, pine as a sub node of tree in order to construct a basic concept tree as the initial ontology. After building the initial ontology we will compare the similarity of concepts in WordNet to concepts contained in the initial ontology by Lin semantic similarity algorithm[6] ---------------------- (4.8)

Here, siml(c1,c2) is the similarity between two concepts c1 and c2 in WordNet. P(c1) probability of concept c1 and P(c2) probability of concept c2. Iso(c1,c2) is the deepest common parent of concept c1 and c2. IC means information content. IC=-log p(c). 16

Where If simL(c1,c2)>k ,c1 will be a concept in WordNet, c2 is a concept in initial ontology.K will be a given threshold.We will add concept c1 into candidate concept set. 4.5 Building structural relationship: After finding similarity among keywords we will apply the structural relationship regarding to tf by following this formula [8].

tf ij ! E v tf ij1 F v tf ij 2 H v tf ij 3 ............(4.9)
As for example a web page is divided into several sections such as title, head and body. The same term appears in the different positions in a web page, should be considered to have different priorities (weights). So, the terms in the title should have higher weights than those in the head and body. In the above equation tfij is the frequency of a term in the kth area and , , are the factors that can be adjusted according to pre-experiments where > > >=1 Similarly we have applied the structural relationship regarding to idf by following this formula

idf j ! log(

N v (E i F i H i )

i!11v (E i Fi H i )

0.01)..............(4.10)

Here, Where, K is the document frequency for term j for documents indexed as 1 to K are contributing here in terms of their weights each time a new term( word ) is considered. Here N is the total number of web pages.

Certified Emergency Nurse (CEN) Exam Study Guide
0% (3)
Certified Emergency Nurse (CEN) Exam Study Guide
20 pages
Hematology Case Stdy 1 FINALE
No ratings yet
Hematology Case Stdy 1 FINALE
30 pages
Project On SBI Bank
80% (5)
Project On SBI Bank
48 pages
FIDIC Clauses (Red Book)
No ratings yet
FIDIC Clauses (Red Book)
6 pages
Organizing and Hosting A Christian Concert
No ratings yet
Organizing and Hosting A Christian Concert
8 pages
An Introduction to Finite Projective Planes
From Everand
An Introduction to Finite Projective Planes
Abraham Adrian Albert
No ratings yet
Robert Waelder Five Lectures
No ratings yet
Robert Waelder Five Lectures
68 pages
Security Analysis
No ratings yet
Security Analysis
60 pages
Engineering Economy: University of Eastern Philippines College of Engineering
No ratings yet
Engineering Economy: University of Eastern Philippines College of Engineering
34 pages
Volume 11 Complete
No ratings yet
Volume 11 Complete
143 pages
Business Strategy
100% (1)
Business Strategy
17 pages
Isolation of Invertase PH
No ratings yet
Isolation of Invertase PH
4 pages
The Origins of Philippine Judicial Review
No ratings yet
The Origins of Philippine Judicial Review
86 pages
Jockey Pump
No ratings yet
Jockey Pump
5 pages
My Words of Gratitude
No ratings yet
My Words of Gratitude
3 pages
Indian Management Thoughts
100% (1)
Indian Management Thoughts
32 pages
Latest
50% (2)
Latest
50 pages
Prithvi Narayan Shah
No ratings yet
Prithvi Narayan Shah
4 pages
CFP
No ratings yet
CFP
31 pages
Boght Hills Reads 2011 1
No ratings yet
Boght Hills Reads 2011 1
1 page
Martius Yellow
100% (2)
Martius Yellow
2 pages
Final Draft
No ratings yet
Final Draft
1 page
Jack Davis For Congress
No ratings yet
Jack Davis For Congress
3 pages
Blood Typing
No ratings yet
Blood Typing
6 pages
Hla
No ratings yet
Hla
19 pages
Feasibility Study
No ratings yet
Feasibility Study
17 pages
Malena Movie Review
No ratings yet
Malena Movie Review
3 pages
V MART Questionnaire
No ratings yet
V MART Questionnaire
2 pages
Chapters Complete
No ratings yet
Chapters Complete
47 pages
The Visit of Wise Men
No ratings yet
The Visit of Wise Men
3 pages
This Paper Translated by Johan
No ratings yet
This Paper Translated by Johan
10 pages
Beirux Media Kit
No ratings yet
Beirux Media Kit
6 pages
Secure Mailing System Report
No ratings yet
Secure Mailing System Report
45 pages
Flowchart 1 Is An Overview of The Production Cycle
No ratings yet
Flowchart 1 Is An Overview of The Production Cycle
4 pages
Gifec Revamps Library Board
No ratings yet
Gifec Revamps Library Board
4 pages
The ITC Profile
No ratings yet
The ITC Profile
3 pages
A Garb A Ti
No ratings yet
A Garb A Ti
4 pages
The Language of Linear B Tablets
100% (2)
The Language of Linear B Tablets
13 pages
Experiment 4 Journal Report
No ratings yet
Experiment 4 Journal Report
5 pages
206 Yash Agrawal EM Assignment
No ratings yet
206 Yash Agrawal EM Assignment
19 pages
Harry Potter 2
No ratings yet
Harry Potter 2
8 pages
3 Theoritical Framework
No ratings yet
3 Theoritical Framework
21 pages
C CC C C
No ratings yet
C CC C C
34 pages
Formal Report Chemistry Sublimation
100% (2)
Formal Report Chemistry Sublimation
2 pages
Resume: CCCCCCCCCCCCCCCCCCC C
No ratings yet
Resume: CCCCCCCCCCCCCCCCCCC C
2 pages
CC CC CC C: CCC C C !C"C !C#!$!CC C %C &'%% C !C C+,+C C !C - C C"!C !-%&% C
No ratings yet
CC CC CC C: CCC C C !C"C !C#!$!CC C %C &'%% C !C C+,+C C !C - C C"!C !-%&% C
17 pages
6th Pay Arrear
100% (2)
6th Pay Arrear
45 pages
Blake Edited Reviewer
No ratings yet
Blake Edited Reviewer
28 pages
Baskin Robbins
No ratings yet
Baskin Robbins
27 pages
Kelloggs
No ratings yet
Kelloggs
4 pages
Satya M Sca M Was Caught Way Back in October 2008
No ratings yet
Satya M Sca M Was Caught Way Back in October 2008
15 pages
Exams
No ratings yet
Exams
48 pages
IR Music Transmitter and Receiver
100% (2)
IR Music Transmitter and Receiver
17 pages
Biofuel Bintaro
No ratings yet
Biofuel Bintaro
20 pages
Pathophysiolgoy Coronary Artery Disease
No ratings yet
Pathophysiolgoy Coronary Artery Disease
3 pages
Bribery and Corruption
No ratings yet
Bribery and Corruption
14 pages
Grila RCS Digital Bucuresti (2012)
No ratings yet
Grila RCS Digital Bucuresti (2012)
4 pages
AP Biology Lab 6 Transformation
No ratings yet
AP Biology Lab 6 Transformation
3 pages
Shaving The Beard
No ratings yet
Shaving The Beard
6 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
An Introduction to the Theory of Elasticity
From Everand
An Introduction to the Theory of Elasticity
R. J. Atkin
4.5/5 (2)
Lecture Notes in Elementary Real Analysis
From Everand
Lecture Notes in Elementary Real Analysis
Rohan Dalpatadu
No ratings yet
Lectures on Measure and Integration
From Everand
Lectures on Measure and Integration
Harold Widom
No ratings yet
Elementary Point-Set Topology: A Transition to Advanced Mathematics
From Everand
Elementary Point-Set Topology: A Transition to Advanced Mathematics
Andre L. Yandl
5/5 (1)
Introduction to Partial Differential Equations: From Fourier Series to Boundary-Value Problems
From Everand
Introduction to Partial Differential Equations: From Fourier Series to Boundary-Value Problems
Arne Broman
2.5/5 (2)
Environmental Pollution by Tobacco Industry (Business Perspective) Questionnaire
No ratings yet
Environmental Pollution by Tobacco Industry (Business Perspective) Questionnaire
3 pages
Environmental Pollution by Tobacco Industry (Business Perspective) Questionnaire
No ratings yet
Environmental Pollution by Tobacco Industry (Business Perspective) Questionnaire
3 pages
Ratio Analysis of Beximco Pharma
No ratings yet
Ratio Analysis of Beximco Pharma
18 pages
Marketing Mix Final
100% (4)
Marketing Mix Final
44 pages
Questionnaire: Dear Respondent
No ratings yet
Questionnaire: Dear Respondent
2 pages
Term Final Routine
No ratings yet
Term Final Routine
1 page
Experimental Results and Discussions
No ratings yet
Experimental Results and Discussions
6 pages
Chapter 1
No ratings yet
Chapter 1
2 pages
Mithun Biswas's Resume
No ratings yet
Mithun Biswas's Resume
3 pages
Bookshops in Hay-on-Wye
No ratings yet
Bookshops in Hay-on-Wye
4 pages
RACGP Managing Wellbeing
No ratings yet
RACGP Managing Wellbeing
36 pages
Student Room Dissertation Thread
100% (2)
Student Room Dissertation Thread
5 pages
Mirgiyas Usmanov v. U.S. Attorney General, 11th Cir. (2014)
No ratings yet
Mirgiyas Usmanov v. U.S. Attorney General, 11th Cir. (2014)
9 pages
Vices
100% (1)
Vices
6 pages
Communication Art
No ratings yet
Communication Art
19 pages
User Reviews of Top Mobile Apps in Apple and Google App Stores
No ratings yet
User Reviews of Top Mobile Apps in Apple and Google App Stores
7 pages
Inductive Grammar Activity (Unit 6, Page 64)
No ratings yet
Inductive Grammar Activity (Unit 6, Page 64)
2 pages
The Ultimate Guide: How To Pitch Magazines and Blogs: Before You Pitch: Build A Media List
No ratings yet
The Ultimate Guide: How To Pitch Magazines and Blogs: Before You Pitch: Build A Media List
6 pages
MHis 1
No ratings yet
MHis 1
5 pages
Political Economy The Contest of Economic Ideas 2nd Edition Frank Stilwell Instant Download
No ratings yet
Political Economy The Contest of Economic Ideas 2nd Edition Frank Stilwell Instant Download
84 pages
VSR 411 QB Anaesthesia
No ratings yet
VSR 411 QB Anaesthesia
7 pages
Rakowski Preludes: A Brief Examination of His Compositional Process
0% (1)
Rakowski Preludes: A Brief Examination of His Compositional Process
20 pages
108 & Iob
No ratings yet
108 & Iob
75 pages
Simboluri Flowchart
No ratings yet
Simboluri Flowchart
6 pages
Ultimate Beneficial Ownership Self Declaration Form 2025
No ratings yet
Ultimate Beneficial Ownership Self Declaration Form 2025
2 pages
Level 2 CV
No ratings yet
Level 2 CV
2 pages
נאום הפרידה של הנשיא אייזנהאואר
100% (2)
נאום הפרידה של הנשיא אייזנהאואר
4 pages
Denton Et Al-2003-Cochrane Database of Systematic Reviews
No ratings yet
Denton Et Al-2003-Cochrane Database of Systematic Reviews
33 pages
History of Agrarian Reform
No ratings yet
History of Agrarian Reform
27 pages
MPD RGH
100% (1)
MPD RGH
12 pages
Henry Sy
No ratings yet
Henry Sy
1 page
Consecration - 1
No ratings yet
Consecration - 1
2 pages
Top Grade Homeopathic Medicines For Anger Control and Management - Homeopathy at DrHomeo
No ratings yet
Top Grade Homeopathic Medicines For Anger Control and Management - Homeopathy at DrHomeo
2 pages
Seerah Class Notes
No ratings yet
Seerah Class Notes
7 pages