Web Search Engine
Web Search Engine
AGENDA
1 INTRODUCTION 2 RELATED WORK 3 METHOD 4 EXPERIMENTS 5 CONCLUSION
INTRODUCTION
Accurately measuring the semantic similarity between words is an important problem in web mining, information retrieval, and natural language processing Semantically related words of a particular word are listed in manually created general-purpose lexical ontologies such as WordNet
INTRODUCTION
We propose an automatic method to estimate the semantic similarity between words or entities using web search engines.
Page counts and snippets are two useful information sources provided by most web search engines.
INTRODUCTION
Page count of a query is an estimate of the number of pages that contain the query words.
Snippets, a brief window of text extracted by a search engine around the query term in a document
REINTRODUCTIONLATED WORK
Outline Resnik [8] proposed a similarity measure using information content. Li et al. [9] combined structural semantic information from a lexical taxonomy and information content from a corpus in a nonlinear model Cilibrasi and Vitanyi [12] proposed a distance metric between words using only page counts retrieved from a web search engine. Sahami and Heilman [2] measured semantic similarity between two queries using snippets returned for those queries by a search engine
INTRODUCTION
Sahami and Heilman [2] measured semantic similarity between two queries using snippets returned for those queries by a search engine Chen et al. [4] proposed a double-checking model using text snippets returned by a web search engine to compute semantic similarity between words In query expansion [18], a user query is modified using synonymous words to improve the relevancy of the search.
METHOD
Given two words P and Q Sim(P,Q) If P and Q are highly similar
=>sim(P,Q) -> 1
METHOD
METHOD
Page Count-Based Co-Occurrence Measures car AND automobile
the same is 11,300,000
METHOD
four popular co-occurrence measures Jaccard, Overlap (Simpson), Dice, and Pointwise mutual information (PMI), we use the notation H(P) to denote the page counts for the query P in a search engine.
METHOD
METHOD
Lexical Pattern Extraction
METHOD
The parameters LThe maximum length of a subsequence is L words gdo not skip more than g number of words consecutively Gthe total number of words skipped in a subsequence should not exceed G Twe count the frequency of all generated subsequences and only use subsequences that occur more than T times as lexical patterns.
METHOD
Lexical Pattern Clustering Typically, a semantic relation can be expressed using more than one pattern. X is a Y, and X is a large Y.
is-a relation between X and Y
METHOD
METHOD
Measuring Semantic Similarity A pair of words (P,Q) (N + 4)-dimensional feature vector fPQ. (N + 1)st, (N + 2)nd, (N + 3)rd, and (N + 4)th features are set, respectively, to WebJaccard, WebOverlap, WebDice, and WebPMI
N+1 N+2 N+3 N+4
fPQ
N cluster of lexical pattern
METHOD
we assign a weight wij to a pattern ai that is in a cluster cj as follows:
METHOD
Finally, we compute the value of the jth feature in the feature vector for a word pair (P , Q) as follows:
METHOD
To train a two-class SVM to detect synonymous and nonsynonymous word pairs S={(Pk , Qk , yk)}
METHOD
Training We randomly select 3,000 nouns from WordNet, and extract a pair of synonymous words from a synset of each selected noun Extrac nonsynonymous word pairs
random shuffling technique
METHOD
METHOD
We determine the clustering threshold as follows: W denote the set of synonymous word pairs fW be the centroid vector of all feature vectors representing synonymous word pairs
METHOD
Next, we compute the average Mahalanobis distance, D()
Mahalanobis distance
METHOD
Finally, we set the optimum value of clustering threshold index 0 1
X1 3 X2 5
2 X3 1
METHOD
EXPERIMENTS
Benchmark Data Sets
Miller-Charles (MC: 28 pairs, 38 annotators) - Pearson Rubenstein- Goodenough (RG 65 pairs, 36 annotators) - Spearman WordSimilarity(WS: 353 pairs, 13 annotators) - Spearman
EXPERIMENTS
Semantic Similarity
EXPERIMENTS
similarity measures
EXPERIMENTS
human ratings
EXPERIMENTS
EXPERIMENTS
EXPERIMENTS
EXPERIMENTS
Community Mining We select 50 personal names from five communities: tennis players, golfers, actors, politicians, and scientists from the open directory project (DMOZ) Correlation, CorreT
EXPERIMENTS
We compute precision, recall, and F-score each person p the cluster that p belongs to by C(p) A(p) to denote the affiliation of person p
e.g., A(Tiger Woods)=Tennis Player.
EXPERIMENTS
,the F-score of person p is defined as
EXPERIMENTS
CONCLUSION
We proposed a semantic similarity measure using both page counts and snippets retrieved from a web search engine for two words. Both page counts-based co-occurrence measures and lexical pattern clusters were used to define features for a word pair Experimental results on three benchmark data sets showed that the proposed method outperforms various baselines as well as previously proposed web-based semantic similarity measures