0% found this document useful (0 votes)
128 views

Graph-Based Text Representations PPT

Graph-based representations provide alternatives to the traditional Bag-of-Words model for text representation. A Graph-of-Words model represents a document as a graph where nodes are terms and edges represent co-occurrence of terms within a sliding window. This model captures word order and dependence unlike Bag-of-Words. In-degree weighting in the Graph-of-Words representation counts the number of contexts a term occurs in. Graph-based models have been applied to problems like sentiment analysis of tweets by building a graph connecting unique terms that co-occur within tweets.

Uploaded by

Saikat Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Graph-Based Text Representations PPT

Graph-based representations provide alternatives to the traditional Bag-of-Words model for text representation. A Graph-of-Words model represents a document as a graph where nodes are terms and edges represent co-occurrence of terms within a sliding window. This model captures word order and dependence unlike Bag-of-Words. In-degree weighting in the Graph-of-Words representation counts the number of contexts a term occurs in. Graph-based models have been applied to problems like sentiment analysis of tweets by building a graph connecting unique terms that co-occur within tweets.

Uploaded by

Saikat Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Graph-based Text Representations:

Information Retrieval

Fragkiskos D. Saikat
Malliaros! Mondal
UC San Diego!
Msc
Michalis Vazirgiannis! École Polytechnique!
20419CMP020
!

Updated slides at: http://fragkiskosm.github.io/projects/graph_text_tutorial

BANARAS HINDU UNIVERSITY


Varasani,india
May 24, 2022
Overview

Text Representation
Bag of words Model
Graph Semantics

Graph-of-Words (GoW) Model


In-degree based TW

world
the
barks
dog paris

2!
Text Representation
Since text is unstructured, a document is usually converted in a common
representation like Bag of Model.

Nowadays, the most commonly used text representation model in the area
Textual Feature
of Information Extraction!
retrieval is called asModel Text (VSM)
the Vector Space Model Evaluation!
Data! Term Weighting! Learning! Categorization!

Applications of Text Mining:

•  Opinion mining (sentiment analysis)!


•  Email spam classification!
•  Web-pages classification!
•  …!
!
3!
Bag-of-Words (BoW) - Issues
Example text!
information retrieval is the activity of obtaining!
!
information resources relevant to an information need!
!
from a collection of information resources!

Bag of words: [(activity,1), (collection,1) !


(information,4), (relevant,1), !
(resources, 2), (retrieval, 1), …]!

•  Term independence assumption ! Assumptions made by


•  Term frequency weighting! the BoW model

!
4!
Graph-based Document Representation

•  Challenge the term independence and term frequency weighting


assumptions taking into account word dependence, order and
distance!

•  Employ a graph-based document representation capturing the


above!

•  Graphs have been successfully used in IR to encompass relations


and propose meaningful weights (e.g., PageRank)!

5!
Graph-based text representations

42
Graph
Semantics

•  Let G = (V, E) be the graph that corresponds to a document d!

•  The nodes can correspond to:!


–  Paragraphs!
–  Sentences!
–  Phrases!
–  Words
–  Syllables!

•  The edges of the graph can capture various types of relationships


between two nodes:!
–  Co-occurrence within a window over the text
–  Syntactic relationship!
–  Semantic relationship!

7
Graph-of-Words (GoW) Model

•  Each document d ∈ D is represented by a graph Gd = (Vd, Ed), where the


nodes correspond to the terms t of the document and the edges capture co-
occurrence relationships between terms within a fixed-size sliding window of
size w!
•  Directed vs. undirected graph!
•  Directed graphs are able to preserve the actual flow of a text!
•  In undirected graphs, an edge captures co-occurrence of two terms whatever the
respective order between them is!
•  Weighted vs. unweighted graph!
•  The higher the number of co-occurrences of two terms in the document, the higher
the weight of the corresponding edge!
•  Size w of the sliding window!
•  Add edges between the terms of the document that co-occur within a sliding
window of size w!
•  Larger window sizes produce graphs that are relatively dense!
Example of Unweighted GoW
Data Science is the extraction of knowledge from large volumes of data
that are structured or unstructured which is a continuation of the field of
data mining and predictive analytics, also known as knowledge discovery
and data mining.

predict field
scienc knowledg
mine
extract
structur data discoveri

analyt larg unstructur

volum continu
known

w = 3!
unweighted, undirected graph!
9
Example of Weighed Undirected GoW
Mathematical aspects of computer−aid
computer-aided share trading.
We consider problems of ● aspect
statistical analysis of share
prices and propose problem ●
probabilistic characteristics to
describe the price series. We

discuss three methods of
mathematical modelling of
price series with given statist
probabilistic characteristics. ●

Edge weights mathemat trade
1 ●
2 share
3 ●
4 price
5 ● ● ●
probabilist analysi

● characterist

seri

model
● method

10
In-degree based TW

•  The weight of a term in a document is its in-degree in the graph-of-words!


•  It represents the number of distinct contexts of occurrence!
•  We store the document as a vector of weights in the direct!
index and similarly in the inverted index!
•  For example:!
information 5!
retrieval 1!
is 2!
the 2!
activity 2!
of 3!
obtaining 2!
resources 3!
relevant 2!
to 2!
an 2!
need 2!
from 2! Bag of words: !
a 2! ((activity,1), (collection,1), !
collection 2!
(information,4), (relevant,1), !
(resources, 2), (retrieval, 1)..)!
11
Continue...

Term t, document d, collection of size N, term weight tw(t, d), document


frequency df(t), document length |d|, average document length avdl,
asymptotical marginal gain k1 (1.2), slope parameter b!
# &
% tw(t, d) ( # N +1 &
TW-IDF(t, d) = ! % ( × log % (
|d | ( $ df (t) '
$ avdl '

•  In the bag-of-word representation, tw is usually defined as the term


frequency or sometimes just the presence/absence of a term (binary tf)!
•  In the graph-of-word representation, tw is the in-degree of the vertex
representing the term in the graph !

12
Graph-based Representation of Tweets

!
•  Represents all the input tweets !
•  Nodes: unique terms !
•  Edges: #co-occurrences within a
tweet!

Example graph!
1.  Good goal by Neymar!
2.  Goal! Neymar scores for brazil!
3.  Goal!! Neymar scores again!
4.  Watching the game tonight!

Dataset: tweets from the 2014 FIFA


World Cup in Brazil! The graph that was built from 4 tweets!

13

THANK YOU

You might also like