0% found this document useful (0 votes)
94 views127 pages

10212cs214 Data Visualization Unit III 19.02.2024

Uploaded by

Light Speed 01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views127 pages

10212cs214 Data Visualization Unit III 19.02.2024

Uploaded by

Light Speed 01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 127

School of Computing

Department of Computer Science &


Engineering

10212CS214- Data Visualization


Category : Program Elective
UNIT-III

Course Handling Faculty :


Dr.M.Kavitha
Professor
8/29/2020 Dr.M.Kavitha Department of Computer Science & Engineering Data 1
Visualization
Course Outcomes
Level of learning
CO domain (Based on
Course Outcomes
Nos. revised Bloom’s
taxonomy)
CO3 Explore visualization techniques Trees, Graphs, K3
Networks, Text and documents

Cond
Desig uct Proje
Envir
n/ invest The Indivi ct Math Softw
Engin Probl Mode onme Life Trans
Devel igatio Engin dual Com Mana emati are
CO eering em rn nt & Long ferrin
opme ns of eer Ethics & munic geme cal Devel
Nos. Know Analy Tool Sustai Learn g
nt of compl Societ Team ation nt & Conce opme
ledge sis usage nabili ing Skills
soluti ex y Work Finan pts nt
ty
ons probl ce
ems

CO3 3 3 3 3 2 1 2 3 3

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 2
Data Visualization
Course Content

Unit 3 Visualization for Trees, Graphs, Networks and Text


L-9 Hours

Displaying Hierarchical Structures - Displaying Arbitrary


Graphs/Networks - Issues. Levels of Text Representations - The Vector
Space Model - Single Document Visualizations - Document Collection
Visualizations - Extended Text Visualizations - Designing Effective
Visualizations - Steps in Designing Visualizations - Problems - Comparing
and Evaluating Visualization Techniques.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 3
Data Visualization
Visualization Techniques for
Trees, Graphs, and Networks

▪ Trees or hierarchies are one of the


most common structures to hold
relational information.

▪ For this reason, in any visualization


techniques have been developed for
display of such information.

▪ We can divide these techniques into


two classes of algorithms: space-
filling and non-space-filling.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 4
Data Visualization
Tree Representations

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 5
Data Visualization
Visualization Techniques for
Trees, Graphs, and Networks

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 6
Data Visualization
Visualization Techniques for
Trees, Graphs, and Networks

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 7
Data Visualization
Visualization Techniques for
Trees, Graphs, and Networks

▪ Trees or hierarchies are one of the


most common structures to hold
relational information.

▪ For this reason, in any visualization


techniques have been developed for
display of such information.

▪ We can divide these techniques into


two classes of algorithms: space-
filling and non-space-filling.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 8
Data Visualization
Space Filling Method

▪ Space-filling techniques make maximal use of the display space.

▪ This is accomplished by using juxtapositioning to imply relations, as opposed


to, for example, conveying relations with edges joining data objects.

▪ The two most common approaches to generating space-filling hierarchies are


rectangular and radial layouts.

▪ Treemaps and their many variants are the most popular form of rectangular
space-filling layout.

▪ In the basic treemap, a rectangle is recursively divided into slices, alternating


horizontal and vertical slicing, based on the populations of the subtrees at a
given level

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 9
Data Visualization
Space Filling Method

Pseudocode for
drawing a
hierarchy using a
treemap

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 10
Data Visualization
Tree Map Display

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 11
Data Visualization
Pseudocode for drawing a hierarchy using a
sunburst display

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 12
Data Visualization
Sunburst display

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 13
Data Visualization
Non Space Filling Method

▪ The most common representation used to visualize tree or hierarchical


relationships is a node-link diagram.

▪ Organizational charts, family trees, and tournament pairings are just


some of the common applications for such diagrams.

▪ The drawing of such trees is influenced the most by two factors: the
fan-out degree (e.g., the number of siblings a parent node can have)
and the depth (e.g., the furthest node from the root).

▪ Trees that are significantly constrained in one or both of these aspects,


such as a binary tree or a tree with only three or four levels, tend to be
much easier to draw than those with fewer constraints.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 14
Data Visualization
Non Space Filling Method

▪ When designing an algorithm for drawing any node-link diagram (not


just trees), one must consider three categories of often-contradictory
guidelines:

▪ Drawing conventions, constraints, and aesthetics.

▪ Conventions may include restricting edges to be either a single


straight line, a series of rectilinear lines, polygonal lines, or curves.

▪ Other conventions might be to place nodes on a fixed grid, or to have


all sibling nodes share the same vertical position.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 15
Data Visualization
Non Space Filling Method

Aesthetics, however, often have significant impact on the interpretability of a tree


or graph drawing, yet often result in conflicting guidelines. Some typical aesthetic
rules include:
▪ minimize line crossings
▪ maintain a pleasing aspect ratio
▪ minimize the total area of the drawing
▪ minimize the total length of the edges
▪ minimize the number of bends in the edges
▪ minimize the number of distinct angles
or curvatures used
▪ strive for a symmetric structure

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 16
Data Visualization
Non Space Filling Method

For trees, especially balanced ones, it is relatively easy to design algorithms


that adhere to many, if not most, of these guidelines. For example,
a simple tree drawing procedure is given below

1. Slice the drawing area into equal-height slabs, based on the depth of
the tree.
2. For each level of the tree, determine how many nodes need to be drawn.
3. Divide each slice into equal-sized rectangles based on the number of nodes
at that level.
4. Draw each node in the center of its corresponding rectangle.
5. Draw a link between the center-bottom of each node to the center-top
of its child node(s).
Many enhancements can be made to this rather basic algorithm in order
to improve space utilization and move child nodes closer to their parents.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 17
Data Visualization
Improving Space Utilization

▪ A Rather than using even spacing and centering, divide each level based on
the number of terminal nodes belonging to each subtree.

▪ A Spread terminal nodes evenly across the drawing area and center parent
nodes above them.

▪ A Add some buffer space between adjacent nonsibling nodes to emphasize


relationships.

▪ If possible, reorder the subtrees of a node to achieve more symmetry and


balance.

▪ A Position the root node in the center of the display and lay out child Nodes
radially, rather than vertically.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 18
Data Visualization
Cone Tree Display

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 19
Data Visualization
Displaying Arbitrary Graphs/Networks

▪ A tree is a connected, unweighted, acyclic graph.

▪ There are many other possibilities, including graphs with weighted edges,
undirected graphs, graphs with cycles, disconnected graphs, and so on.

▪ Graph is undirected, though some of the techniques presented are easily


extended to directed graphs.

▪ Two distinct graph drawing approaches: node-link diagrams (building on the


material from the previous section) and matrix displays.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 20
Data Visualization
Displaying Arbitrary Graphs/Networks

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 21
Data Visualization
Displaying Arbitrary Graphs/Networks

• A face is a partition of the plane isolated by a set of connected vertices.


• A neighbor set is a counter-clockwise listing of the vertices incident to a
particular vertex.
• A planar embedding is a class of planar graph drawings with the same
neighbor sets for each vertex. A planar graph can have an exponential
number of such embeddings.
• A cutvertex is any node that causes the graph to be disconnected if it is
removed.
• A biconnected graph is one without a cutvertex.
• A block is a maximally biconnected subgraph of a graph.
• A separating pair means two vertices whose removal causes a
biconnected graph to become disconnected.
• A triconnected graph is one without a separating pair. A planar
triconnected graph has a unique embedding.
Dr.M.Kavitha Department of Computer Science & Engineering
8/29/2020 22
Data Visualization
Biconnected Graph

• Given a biconnected graph G and a separating cycle C:


1. Compute all the pieces of G with respect to C.
2. For each piece P that is not a simple path (e.g., that contains a cycle).
(a) Create graph G consisting of P plus C.
(b) Create cycle C consisting of a path through P plus the section of
C joining the ends.
(c) Apply the algorithm to (G, C). If the result is nonplanar, G is Nonplanar.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 23
Data Visualization
Biconnected Graph

3. Compute the interlacement graph I of the pieces of G.

4. If I is not bipartite, G is nonplanar; else G is planar.

If a graph is nonplanar, we can make it planar using the following


strategy:
1. Determine the largest planar subgraph of the graph.
2. For the remaining vertices, place each within a face that
minimizes the
number of edge crossings.
3. For each edge crossing, break the edges into two parts each,
and connect the broken ends to a new dummy vertex.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 24
Data Visualization
Biconnected Graph

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 25
Data Visualization
Biconnected Graph

Six different ways for networks to display complexity:


• structural complexity (edges are tangled),
• network evolution (the network evolves over time),
• connection diversity (weights/directions/signs of edges),
• dynamical complexity (node states can vary with time),
• node diversity (different types of nodes),
• meta-complication.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 26
Data Visualization
Matrix Representation

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 27
Data Visualization
Network Graph

● Network or Graph is a special representation of entities which have


relationships among themselves.
● It is made up of a collection of two generic objects —
● (1) node: which represents an entity, and
● (2) edge: which represents the connection between any two nodes. In a
complex network, we also have attributes or features associated with each
node and edge.
● For example, a person represented as a node may have attributes like age,
gender, salary, etc. Similarly, an edge between two persons which represents
‘friend’ connection may have attributes like friends_since, last_meeting, etc.
● Because of this complex nature, it becomes imperative that we present a
network intuitively, such that it showcases as much information as possible.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 28
Data Visualization
Network Graph

● Network graphs (or diagrams) are


useful for visualizing connections
between entities. For example,
subway maps are one of the most
frequently encountered network
graphs.
● Nodes are the labels in the data to be
visualized. The relationship between
these nodes is expressed by the lines.
Therefore, the data to be visualized
must be in a format that includes
node information and ‘from-to’ data.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 29
Data Visualization
Network Graph

● NetworkX has its own drawing module which provides multiple options for plotting. Below we can
find the visualization for some of the draw modules in the package. Using any of them is fairly
easy, as all you need to do is call the module and pass the G graph variable and the package does
the rest.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 30
Data Visualization
Network Graph
● PyVis is an interactive network visualization python package which takes the NetworkX graph as
input. It also provides multiple styling options to customize the nodes, edges and even the
complete layout. And the best part, it can be done on-the-go using a setting pane where you can
play with the various options and export the final settings in form of a python dictionary. This
dictionary can later be passed as config while calling the function, resulting in as-it-was drawing
of the network.

#import pyvis
from pyvis.network
import Network
net=Network(notebook=
True)
net.from_nx(G)
net.show("example.htm
l")

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 31
Data Visualization
Visdcc in Dash
● One major drawback of previous options is that they are very difficult to use with interactive
dashboards like Dash. This is so because apart from supporting manual interactions like select,
zoom, etc, a package should automatically adjust over programmatical interactions like change in
data, change in properties, etc. This feature is supported by visdcc which is a port of visjs in
Python. This makes it fairly easy to modify the graph or even some select properties of the graph
by callbacks, which in Dash can be connected to widgets like buttons or radio select options.

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 32
Data Visualization
Network Graph

Dr.M.Kavitha Department of Computer Science & Engineering


8/29/2020 33
Data Visualization
Text and Documents
• The most obvious tasks in text and documents, are searching for a word,
phrase, or topic.

• For partially structured data, relationships between words, phrases, topics,


or documents are searched.

• For structured text or document collections, the key task is most often
searching for patterns and outliers within the text or documents.

• Collection of documents is corpus (plural corpora). It deals with objects


within corpora.

• The objects can be words, sentences, paragraphs, documents, or even


collection of documents. Images and videos are also considered.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 34
Data Visualization
Text and Documents
• The objects are considered atomic with respect to the task, analysis and
visualization.

• Text and documents are often minimally structured and may be rich with
attributes and metadata, especially when focused in a specific application
domain.

• For example, documents have a format and often include metadata about the
document (i.e., author, date of creation, date of modification, comments,
size).

• Information retrieval systems are used to query corpora, which requires


computing the relevance of a document with respect to a query. This
requires document preprocessing and interpretation of the semantics of text.

• Statistics about Document computation


Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 35
Data Visualization
A Little Experiment

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 36
Data Visualization
A Little Experiment

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 37
Data Visualization
A Brief History

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 38
Data Visualization
A Brief History

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 39
Data Visualization
Text

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 40
Data Visualization
Text as Visualization

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 41
Data Visualization
Text as Visualization

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 42
Data Visualization
Text as Visualization

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 43
Data Visualization
Text as Visualization

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 44
Data Visualization
Visualization for “Raw” Text

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 45
Data Visualization
Visualization for “Raw” Text

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 46
Data Visualization
Visualization for “Raw” Text

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 47
Data Visualization
Visualizing text (features)

Requires a transformation step:


Discretization, Aggregation,Normalization,..

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 48
Data Visualization
Structured Text Features

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 49
Data Visualization
Typical Steps of Processing to derive Text Features

• Large collections require pre-processing of text to extract information and align text.
Typical steps are:
• Cleaning (regular expressions)
• Sentence splitting
• Change to lower case
• Stopword removal (most frequent words in a language)
• Stemming - demo porter stemmer
• POS tagging (part of speech) - demo
• Noun chunking
• NER (name entity recognition) - demo opencalais
• Deep parsing - try to “understand” text.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 50
Data Visualization
Text features are complicated

Be aware!! text understanding can be hard:


• Toilet out of order. Please use floor below.

• “One morning I shot an elephant in my pajamas.


How he got in my pajamas, I don't know.”

• Did you ever hear the story about the blind


carpenter who picked up his hammer and saw?
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 51
Data Visualization
Text features are complicated

Be aware!! text understanding can be hard:


• Toilet out of order. Please use floor below.

• “One morning I shot an elephant in my pajamas.


How he got in my pajamas, I don't know.”

• Did you ever hear the story about the blind


carpenter who picked up his hammer and saw?
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 52
Data Visualization
Text Units Hierarchy

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 53
Data Visualization
Levels of Text Representations
To convert the unstructured text to some form of structured data
• Lexical Level
• Transforms a string of characters into a sequence of atomic entities, called tokens.
• Process the sequence of characters with a given set of rules into a new sequence of tokens
• Tokens can include characters, character n-grams, words, word stems, lexemes, phrases, or word n-
grams, all with associated attributes.
• Finite state machines defined by regular expressions is used to extract tokens

• Syntatic Level
• Identifies and tags (annotating) each token’s function.
• Tokens have attributes such as singular or plural, or their proximity to other tokens.
• Richer tags include date, money, place, person, organization, and time.
• The process of extracting these annotations is called named entity recognition (NER).
• The richness and wide variety of language models and grammars yield a wide variety of approaches.

• Semantic Level
• Extraction of meaning and relationships between pieces of knowledge derived from the
structures identified in the syntactical level.
• The goal of this level is to define an analytic interpretation of the full text within a
specific context, or even independent of context.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 54
Data Visualization
Vector Space Model

• Computing term vectors is an essential step for many


document and corpus Visualization and analysis techniques.

• In the vector space model , a term vector for an object of


interest (paragraph, document, or document collection) is a
vector in which each dimension represents the weight of a
given word in that document.

• Typically, to clean up noise, stop words (such as “the” or


“a”) are removed (filtering), and words that share a word
stem are aggregated together (stemming)

• https://wordcounter.net/
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 55
Data Visualization
Vector Space Model
• The pseudocode below counts occurrences of unique tokens,
excluding stop words.
• The input is assumed to be a stream of tokens generated by a
lexical analyzer for a single document.
• The terms variable contains a hashtable that maps unique terms
to their counts in the document.

Count-Terms(tokenStream)
1 terms ← ∅ initialize terms to an empty hashtable.
2 for each token t in tokenStream
3 do if t is not a stop word
4 do increment (or initialize to 1) terms[t]
5 return terms

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 56
Data Visualization
Statistical Models
• A document is typically represented by a bag of words
(unordered words with frequencies).

• Bag = set that allows multiple occurrences of the same


element.

• User specifies a set of desired terms with optional weights:


• Weighted query terms:
• Q = < database 0.5; text 0.8; information 0.2 >
• Unweighted query terms:
• Q = < database; text; information >No Boolean
conditions specified in the query.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 57
Data Visualization
Statistical Retrieval

• Retrieval based on similarity between query and


documents.
• Output documents are ranked according to similarity to
query.
• Similarity based on occurrence frequencies of keywords in
query and document.
• Automatic relevance feedback can be supported:
• Relevant documents “added” to query.
• Irrelevant documents “subtracted” from query.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 58
Data Visualization
Issues for Vector Space Model

• How to determine important words in a document?


⮚ Word sense?
⮚ Word n-grams (and phrases, idioms,…) terms
• How to determine the degree of importance of a term
within a document and within the entire collection?
• How to determine the degree of similarity between a
document and the query?
• In the case of the web, what is a collection and what are the
effects of links, formatting information, etc.?
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 59
Data Visualization
Vector Space Model

• Assume t distinct terms remain after preprocessing; call


them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
⮚ Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a real-
valued weight, wij.
• Both documents and queries are expressed as t-dimensional
vectors:
⮚ dj = (w1j, w2j, …, wtj)
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 60
Data Visualization
Graphic Representation

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 61
Data Visualization
Document Collection

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 62
Data Visualization
Term Weights: Term Frequency

• More frequent terms in a document are more important, i.e.


more indicative of the topic.

fij = frequency of term i in document j

• May want to normalize term frequency (tf) by dividing by


the frequency of the most common term in the document:

tfij = fij / maxi{fij}

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 63
Data Visualization
Term Weights: Inverse Document Frequency

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 64
Data Visualization
TF-IDF Weighting

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 65
Data Visualization
Computing TF-IDF -- An Example

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 66
Data Visualization
Similarity Measure

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 67
Data Visualization
Similarity Measure - Inner Product

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 68
Data Visualization
Properties of Inner Product

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 69
Data Visualization
Inner Product - Example

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 70
Data Visualization
Example – Similarity Measure

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 71
Data Visualization
Example – Similarity Measure

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 72
Data Visualization
Example – Similarity Measure

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 73
Data Visualization
Cosine Similarity Measure

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 74
Data Visualization
Naïve Implementation

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 75
Data Visualization
Vector Space Model

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 76
Data Visualization
Vector Space Model - Issues

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 77
Data Visualization
Vector Space Model - Exercise

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 78
Data Visualization
Vector Space Model - Exercise

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 79
Data Visualization
Computing TF-IDF(Documents)

Compute-TfIdf(documents)
1 termFrequencies ← ∅ Looks up term count tables for document
names.
2 documentF requencies← ∅ Counts the documents in which a
term occurs.
3 uniqueT erms← ∅ The list of all unique terms.
4 for each document d in documents
5 do docName ← Name(d) Extract the name of the document.
6 tokenStream ← Tokenize(d) Generate document token stream.
7 terms ← Count-Terms(tokenStream) Count the term frequencies.
8 termFrequencies[docName] ← terms Store the term frequencies.
9 for each term t in Keys(terms)
10 do increment (or initialize to 1) documentF requencies[t]
11 uniqueT erms← uniqueT erms ∪ t
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 80
Data Visualization
Computing TF-IDF(Documents)

13 tf IdfV ectorT able ← ∅ Looks up tf-idf vectors for document


names.
14 n ← Length(documents)
15 for each document name docName in Keys(termFrequencies)
16 do tf IdfV ector ← create zeroed array of length
Length(uniqueT erms)
17 terms ← termFrequencies[docName]
18 for each term t in keys(terms)
19 do tf ← terms[t]
20 df ← documentF requencies[t]
21 tf Idf ← tf ∗ log(n/df)
22 tf IdfV ector[index of t in uniqueT erms]← tf Idf
23 tf IdfV ectorT able[docName] ← tf IdfV ector
24 return tf IdfV ectorT able
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 81
Data Visualization
Zipf’s Law

▪ The economist Vilfredo Pareto stated that a company’s revenue is


inversely proportional to its rank—a classic power law, resulting in
the famous 80-20 rule, in which 20% of the population holds 80% of
the wealth.

▪ Zipf stated the distribution of words in natural language corpora


using a discrete power law distribution called a Zipfian distribution.

▪ Zipf’s Law states that in a typical natural language document, the


frequency of any word is inversely proportional to its rank in the
frequency table.

▪ Plotting the Zipf curve on a log-log scale yields a straight line with a
slope of -1
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 82
Data Visualization
Zipf’s Law

A document view in which named entities are highlighted, color-coded by


entity type.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 83
Data Visualization
Zipf’s Law

The distribution of terms in Wikipedia, an example of Zipf’s Law in


action. Term frequency is on the y-axis, and frequency rank is on the x-
axis.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 84
Data Visualization
Tasks Using the Vector Space Model

• The vector space model, when accompanied by some distance


metric, allows one to perform many useful tasks.

• tf-idf and the vector space model is used to identify documents of


particular interest.

• For example, the vector space model, with the use of some distance
metric, will allow us to answer questions such as which documents
are similar to a specific one, which documents are relevant to a given
collection of documents, or which documents are most relevant to a
given search query—all by finding the documents whose term
vectors are most similar to the given document, the average vector
over a document collection, or the vector of a search query.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 85
Data Visualization
Single Document Visualizations

A tag cloud visualization generated by the free service tagCrowd.com. The font
size and darkness are proportional to the frequency of the word in the document.

https://tagcrowd.com/

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 86
Data Visualization
Word Clouds
•Word clouds , also known as text clouds or tag clouds, are layouts of raw tokens,
colored and sized by their frequency within a single document.

•Text clouds and their variations, such as a Wordle, are examples of


visualizations that use only term frequency vectors and some layout algorithm to
create the visualization.

A Wordle visualization
generated by the free service
wordle.net. The size of
the text corresponds to the
frequency of the word in the
document.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 87
Data Visualization
WordTree
The WordTree visualization is a visual representation of both term frequencies, as
well as their context .
Size is used to represent the term or phrase frequency. The root of the tree is a
user-specified word or phrase of interest, and the branches represent the various
contexts in which the word or phrase is used in the document.

A WordTree
visualization
generated by the free
service ManyEyes .
The branches of the
tree represent the
various contexts
following a root word
or phrase
in the document.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 88
Data Visualization
WordTree
['cats are better than dogs'],
['cats eat kibble'],
['cats are better than hamsters'],
['cats are awesome'],
['cats are people too'],
['cats eat mice'],
['cats meowing'],
['cats in the cradle'],
['cats eat mice'],
['cats in the cradle lyrics'],
['cats eat kibble'],
['cats for adoption'],
['cats are family'],
['cats eat mice'],
['cats are better than kittens'],
['cats are evil'],
9/13/2020 ['catsDr.M.Kavitha
are weird'],Department of Computer Science & Engineering
89
Data Visualization
['cats eat mice'],
WordTree
This word tree depicts a tree of phrases, with the size of the words proportional to their
usage. In this set of phrases, "cats eat mice" occurs four times, and "cats eat" occurs six
times (four times with "mice", and twice with "kibble").

Collected a set of phrases


about cats (e.g., "cats eat
mice", "cats are better than
kittens") and you want to
highlight the most
important attributes from
the set.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 90
Data Visualization
WordTree
Implicit and explicit Word Trees

There are two ways to create word trees: implicitly and explicitly. The choice is
specified with the wordtree.format option.

format: 'implicit'

The word tree will take a set of phrases, in any order, and construct the tree
according to the frequency of the words and subphrases.

format: 'explicit'

You tell the word tree what connects to what, how big to make each subphrase, and
what colors to use.

The word tree in the previous section was an implicit Word Tree: we just specified an
array of phrases, and the word tree figured out how big to make each word.

In an explicit word tree, the chart creator directly provides information about which
words link to which, their color, and size.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 91
Data Visualization
TextArc
TextArc is a visual representation of how terms relate to the lines of text in which
they appear.

Every word of the text is drawn in order around an ellipse as small lines with a
slight offset at its start.

As in a text cloud, more frequently occurring words are drawn larger and
brighter.

Words with higher frequencies are drawn within the ellipse, pulled by its
occurrences on the circle (similar to RadViz).

The user is able to highlight the underlying text with probing and animate
“reading” the text by visualizing the flow of the text through relevant connected
terms.

https://voyant-tools.org/docs/#!/guide/textualarc
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 92
Data Visualization
TextArc

http://textarc
.org/Stills.ht
ml

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 93
Data Visualization
Arc Diagram
Arc diagrams are a visualization focused on displaying repetition in
text or any sequence.

Repeated subsequences are identified and connected by


semicircular arcs.

The thickness of the arcs represents the length of the subsequence,


and the height of the arcs represents the distance between the
subsequences.

http://mbostock.github.io/protovis/ex/arc.html is the website for this diagram through


Protovis/D3. The input data is in http://mbostock.github.io/protovis/ex/miserables.js

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 94
Data Visualization
Arc Diagram
An arc diagram is a special kind of network graph. It is consituted by nodes that
represent entities and by links that show relationships between entities. In arc
diagrams, nodes are displayed along a single axis and links are represented with
arcs.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 95
Data Visualization
Arc Diagram
Arc diagrams are not as good as 2d network charts to convey the overall
node structure. It has 2 main advantages tough:
● it can highlight clusters and briges quite well if the node order is
optimized
● it allows to display the label of each node, which is often impossible
in 2d structure.
Here is an example showing the co-authorship network of a researcher.
Vincent Ranwez is author of several scientific publications and counts
more than 100 co-authors, all represented by a node on the following
chart. If two people have already been on the same paper, they are linked
by an arc.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 96
Data Visualization
Arc Diagram

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 97
Data Visualization
Arc Diagram

Figure displays Bach’s Minuet in G Major, visualizing the classic pattern of a


minuet. It contains two parts, each consisting of a long passage played twice. The
parts are loosely related, as shown by the bundle of thin arcs connecting the two
main parts. The overlap of the two main arcs shows that the end of the first passage
is the same as the beginning of the second.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 98
Data Visualization
Arc Diagram

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 99
Data Visualization
Literature Fingerprinting
▪ Literature fingerprinting is a method of visualizing features used to
characterize text .

▪ Instead of calculating just one feature value or vector for the whole
text (this is what is usually done), a sequence of feature values per text
are calculated and presented to the user as a characteristic fingerprint
of the document.

▪ This allows the user to “look inside” the document and analyze the
development of the values across the text. Moreover, the structural
information of the document is used to visualize the document on
different levels of resolution.

▪ Literature fingerprinting was applied to an authorship attribution


problem to show the discrimination power of the standard measures
that are assumed to capture the writing style of an author
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 100
Data Visualization
Literature Fingerprinting
Literature fingerprinting technique. Here, literature fingerprinting is used to analyze
the ability of several text measures to discriminate between authors. Each pixel
represents a text block, and the pixels are grouped into books. Color is mapped to the
feature value, in this case to the average sentence length. If a measure is able to
discriminate between the two authors, the books in the first row (written by London)
are visually set apart from the remaining (Image from [222], c 2007 IEEE.)

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 101
Data Visualization
Document Collection Visualizations

•Document collection visualizations, the goal is to place similar documents


close to each other and dissimilar ones far apart.

•This is a minimax problem and typically O(n2). We compute the similarity


between all pairs of documents and determine a layout.

•The common approaches are graph spring layouts, multidimensional scaling,


clustering (k-means, hierarchical, EM, support vector), and self organizing
maps.

•Several document collection visualizations are self organizing maps,


clustermaps, and themescapes.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 102
Data Visualization
Self-Organizing Maps

•A self-organizing map is an unsupervised learning algorithm using a collection of


typically 2D nodes, where documents will be located.
•Each node has an associated vector of the same dimensionality as the input vectors
(the document vectors) used to train the map.
•We initialize the SOM nodes, typically with random weights.
•We choose a random vector from the input vectors and calculate its distance from
each node.
•We adjust the weights of the closest nodes (within a particular radius), making each
closer to the input vector, with the higher weights corresponding to the closest
selected node.
•As we iterate through the input vectors, the radius gets smaller.
•An example of using SOMs for text data is shown in Figure, which shows a million
documents collected from 83 newsgroups.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 103
Data Visualization
Self-Organizing Maps
Self-Organising Maps (SOMs)
are an unsupervised data
visualisation technique that can
be used to visualise high-
dimensional data sets in lower
(typically 2) dimensional
representations.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 104
Data Visualization
Self-Organizing Maps

SOM visualisation are made up of multiple “nodes”. Each node vector has:

● A fixed position on the SOM grid

● A weight vector of the same dimension as the input space. (e.g. if your input data

represented people, it may have variables “age”, “sex”, “height” and “weight”,

each node on the grid will also have values for these variables)

● Associated samples from the input data. Each sample in the input space is

“mapped” or “linked” to a node on the map grid. One node can represent several

input samples.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 105
Data Visualization
Themescapes
•Themescapes are summaries of corpora using abstract 3D landscapes in
which height and color are used to represent density of similar documents.
•The example shown in Figure from Pacific Northwest National Labs
represents news articles visualized as a themescape.
•The taller mountains represent frequent themes in the document corpus

A themescape from PNNL that uses


height to represent the frequency of
themes in news articles. (Image
reprinted from with permission of
Springer Science and Business
Media.)

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 106
Data Visualization
Document Cards

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 107
Data Visualization
Document Cards

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 108
Data Visualization
Document Cards

The Document Card pipeline. Each step is further explained in the sections
indicated by the number in the top right corner of each box.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 109
Data Visualization
Extended Text Visualization
Software Visualization
Eick et al. developed a visualization tool called SeeSoft that
visualizes statistics for each line of code (i.e., age and number of
modifications, programmer, dates).
Dimensions of Software Visualization
▪ Tasks – why is the visualization needed?
▪ Audience – who will use the visualization?
▪ Target – what is the data source to represent?
▪ Representation – how to represent it?
▪ Medium – where
9/13/2020
to represent
Dr.M.Kavitha the visualization
Department of Computer Science & Engineering
110
Data Visualization
Software Visualization

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 111
Data Visualization
SeeSoft

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 112
Data Visualization
SeeSoft

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 113
Data Visualization
SeeSoft

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 114
Data Visualization
SeeSoft

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 115
Data Visualization
SeeSoft - Uses

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 116
Data Visualization
SeeSoft - Applications

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 117
Data Visualization
SeeSoft - Applications

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 118
Data Visualization
SeeSoft

New Application Area

Display of large amount of Texts.


Visualization of Directories and files.

Limitations

Only 50,000 lines of code can be displayed


Difficult to use with the monochrome devices

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 119
Data Visualization
Search Result Visualization

▪ Marti Hearst developed a simple query result visualization


foundationally similar to Keim’s pixel displays called TileBars,
which displays a number of term-related statistics, including
frequency and distribution of terms, length of document, term-
based ranking, and strength of ranking.

▪ Each document of the result set is represented by a rectangle,


where width indicates relative length of the document and
stacked squares correspond to text segments.
▪ Each row of the stack represents a set of query terms, and the
darkness of the square indicates the frequency of terms among
the corresponding terms.

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 120
Data Visualization
Search Result Visualization

The TileBars query result


visualization. Each large rectangle
indicates a document, and each
square within the document
represents a text segment. The
darker the tile, the more frequent
the query term set. (Image from [, c
1995 Addison-
Wesley.)

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 121
Data Visualization
Temporal Document Collection Visualization

ThemeRiver, also called a stream graph, is


a visualization of thematic changes in a
document collection over time .
This visualization assumes that the input
data progresses over time. Themes are
visually represented as colored horizontal
bands whose vertical thickness at a given
horizontal location represents their
frequency at a particular point in time.

A stream graph (ThemeRiver), depicting the election


night speeches of several different candidates for a
Canadian election. (Image from, c 2002 IEEE.)

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 122
Data Visualization
Temporal Document Collection Visualization

Jigsaw is a tool for visualizing and


exploring text corpora [155].
Jigsaw’s calendar view positions
document objects on a calendar based
on date entities identified within the
text. When the user highlights a
document, the entities that occur
within that document are displayed .

Wanner et al. developed a visual


analytics tool for conducting
semiautomatic sentiment analysis of
large news feeds.
News articles are presented with the Jigsaw
calendar view, based on the extracted date
entities

Dr.M.Kavitha Department of Computer Science & Engineering


9/13/2020 123
Data Visualization
Representing Relationships
▪Jigsaw also includes an entity
graph view , in which the user can
navigate a graph of related entities
and documents.

▪ In Jigsaw, entities are connected to


the documents in which they appear.

▪ The Jigsaw graph view does not


show the entire document collection,
but it allows the user to incrementally
expand the graph by selecting
documents and entities of interest

A sentiment analysis visualization. News items are plotted along the time axis. Shape and color show to which category
an item belongs, and the vertical position depends on the automatically determined sentiment score of an item. The visual
objects representing news items are painted semi-transparent in order to make overlapping items more easily
distinguishable.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 124
Data Visualization
Representing Relationships

The Jigsaw graph view, representing connections between named entities and
documents.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 125
Data Visualization
Representing Relationships

A clustered graph view in Jigsaw that filters for documents having specific entities.
Mousing over an entity identifies data about the document. Colors represent token
values. Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 126
Data Visualization
Representing Relationships

The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 127
Data Visualization

You might also like