0% found this document useful (0 votes)
82 views5 pages

Chapter IV

This document proposes a solution to determine keyword weights in a domain-specific context using a modified weighted sum method. It involves building an ontology based on WordNet to identify semantic relationships between concepts. Candidate keywords will be extracted based on their similarity to concepts in the initial ontology, as measured using Lin's semantic similarity algorithm. Structural relationships will also be incorporated by assigning different term frequency weights based on the section of text, such as title vs body. This modified approach aims to address issues with considering semantic relationships and structure in existing weighted sum methods.

Uploaded by

Mithun Biswas
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views5 pages

Chapter IV

This document proposes a solution to determine keyword weights in a domain-specific context using a modified weighted sum method. It involves building an ontology based on WordNet to identify semantic relationships between concepts. Candidate keywords will be extracted based on their similarity to concepts in the initial ontology, as measured using Lin's semantic similarity algorithm. Structural relationships will also be incorporated by assigning different term frequency weights based on the section of text, such as title vs body. This modified approach aims to address issues with considering semantic relationships and structure in existing weighted sum methods.

Uploaded by

Mithun Biswas
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Chapter IV A Modified Weighted Sum Method to Determine the Weight for DomainSpecific Keyword Based On Domain Repository

4.1 Introduction This is the most important section of our proposal. Here we will describe our proposal for our thesis. After describing the proposal we will also propose a solution for the diagnosed problem in the existing words. 4.2 Thesis proposal By studying papers we have observed that there are huge works in determining weight of the keyword with potential number of proposed solutions. In case of determining the weight by using the concept of domain knowledge in weighted sum method not too many papers were published and in this field of keyword weight finding the following issues have been considered so far. o Exploration of linguistic characteristics of a specific domain: Domain means the content of a particular field of knowledge. For example: Art, Philosophy, Information technology, Geography, History all are some specific domain. In some previous researches this characteristics have been formulated by using a feature vector. The various features are y TFIDF y POS y Relative Position of First Occurrence y Chi-Square Statistic And the feature vector is

o Finally find the weight vector: After finding the feature vector the weight was found by the following formula

13

Here, o

= weight vector =feature vector

Extraction of candidate keywords by using addition and multiplication: After finding the weight vector of each candidate keywords, candidate keyword extraction have been performed by addition and multiplication. o Assigning scores to each candidate words: After extraction of candidate keywords scores were tuned to the candidate keywords according to the below mentioned features. 
1) FTF v IDF ! E ( keywords ' TF v IDF ) E ( non  keywords ' TF v IDF )  ~~..(4.2) TF v IDF TF v IDF


2)F PoS ! E ( key ords PoS ) E( non  key ords PoS )  ~~~..(4.3) PoS PoS

3) F RPFO !

E (key ords RPFO ) E (non  key ords RPFO )  ~~~..(4.4) RPFO RPFO

o Sort the candidate keywords according to the score: After the scoring process candidate keywords were sorted according to the score. o Choosing of the few top candidates as keywords according to the scoring information: Finally taking the sorted information of the candidate keywords few top candidate keywords were chosen as keywords. However, those are the so far discovered features added to the previously examined method. We have observed that following issues have not been considered yet. o Semantic relationships among the weighted keywords in domain specific keyword extraction for weighted sum method. o Structural information has not been considered in weighed sum method for calculating term frequency, tf and inverse document frequency idf which is an important thing to be considered to get better results.

14

4.3 Proposed Solution Here we are providing our proposed solutions to find the semantic relationship among concepts and establish a structural relationship among document set. We will build an ontology which means The systematic description of objective existence of the world on a particular domain to determine the weight in the weighted sum method. We have considered building an ontology based on a particular domain to find the semantic relationship among concepts and establish a structural relationship among document set. We want to build an ontology based on WordNet according to a is-a (child-parent) relationship. Then we will extract candidate concept set according to the Lins similarity measurement theory.  sim x1 , x2 ! 2 v log P(Co ) ~~~~~~~~(4.6) log P C1  log P(C2 )

Here x1,x2 two concepts in WordNet .C0 deepest common parent for candidate concept C1 and C2.For example let C1=hill and C2=coast. P (C1) =probability of C1 concepts emergence. P (C2) =probability of C2 concepts emergence. P (C0) = probability of C0 concepts emergence. The similarity between concepts Hill and Coast is  sim Hill , Coast ! This is equal to 0.59. 2 v log log P(Geo log ical  Formation) ~~~~~~~~(4.7) log log P Hill  log log P(Coast)

15

4.4 Building the ontology: We have built an ontology based on WordNet. We know WordNet is composed of several sunsets (synonyms).WordNets concept is organized in the form of synonym set. We have constructed ontology with is_a relationship. Let oak and pine both is tree. Tree Is a Oak is a Pine

Our main target is to find the backbone words from WordNet. We have established an initial ontology with is-a (part of) relationship tree through these terms. For the above example oak, pine is tree. We will regard oak, pine as a sub node of tree in order to construct a basic concept tree as the initial ontology. After building the initial ontology we will compare the similarity of concepts in WordNet to concepts contained in the initial ontology by Lin semantic similarity algorithm[6] ---------------------- (4.8)

Here, siml(c1,c2) is the similarity between two concepts c1 and c2 in WordNet. P(c1) probability of concept c1 and P(c2) probability of concept c2. Iso(c1,c2) is the deepest common parent of concept c1 and c2. IC means information content. IC=-log p(c). 16

Where    If simL(c1,c2)>k ,c1 will be a concept in WordNet, c2 is a concept in initial ontology.K will be a given threshold.We will add concept c1 into candidate concept set. 4.5 Building structural relationship: After finding similarity among keywords we will apply the structural relationship regarding to tf by following this formula [8].

tf ij ! E v tf ij1  F v tf ij 2  H v tf ij 3 ............(4.9)
As for example a web page is divided into several sections such as title, head and body. The same term appears in the different positions in a web page, should be considered to have different priorities (weights). So, the terms in the title should have higher weights than those in the head and body. In the above equation tfij is the frequency of a term in the kth area and , , are the factors that can be adjusted according to pre-experiments where > > >=1 Similarly we have applied the structural relationship regarding to idf by following this formula

idf j ! log(

N v (E i  F i  H i )

i!11v (E i  Fi  H i )

 0.01)..............(4.10)

Here, Where, K is the document frequency for term j for documents indexed as 1 to K are contributing here in terms of their weights each time a new term( word ) is considered. Here N is the total number of web pages.

17

You might also like