10212cs214 Data Visualization Unit III 19.02.2024
10212cs214 Data Visualization Unit III 19.02.2024
Cond
Desig uct Proje
Envir
n/ invest The Indivi ct Math Softw
Engin Probl Mode onme Life Trans
Devel igatio Engin dual Com Mana emati are
CO eering em rn nt & Long ferrin
opme ns of eer Ethics & munic geme cal Devel
Nos. Know Analy Tool Sustai Learn g
nt of compl Societ Team ation nt & Conce opme
ledge sis usage nabili ing Skills
soluti ex y Work Finan pts nt
ty
ons probl ce
ems
CO3 3 3 3 3 2 1 2 3 3
▪ Treemaps and their many variants are the most popular form of rectangular
space-filling layout.
Pseudocode for
drawing a
hierarchy using a
treemap
▪ The drawing of such trees is influenced the most by two factors: the
fan-out degree (e.g., the number of siblings a parent node can have)
and the depth (e.g., the furthest node from the root).
1. Slice the drawing area into equal-height slabs, based on the depth of
the tree.
2. For each level of the tree, determine how many nodes need to be drawn.
3. Divide each slice into equal-sized rectangles based on the number of nodes
at that level.
4. Draw each node in the center of its corresponding rectangle.
5. Draw a link between the center-bottom of each node to the center-top
of its child node(s).
Many enhancements can be made to this rather basic algorithm in order
to improve space utilization and move child nodes closer to their parents.
▪ A Rather than using even spacing and centering, divide each level based on
the number of terminal nodes belonging to each subtree.
▪ A Spread terminal nodes evenly across the drawing area and center parent
nodes above them.
▪ A Position the root node in the center of the display and lay out child Nodes
radially, rather than vertically.
▪ There are many other possibilities, including graphs with weighted edges,
undirected graphs, graphs with cycles, disconnected graphs, and so on.
● NetworkX has its own drawing module which provides multiple options for plotting. Below we can
find the visualization for some of the draw modules in the package. Using any of them is fairly
easy, as all you need to do is call the module and pass the G graph variable and the package does
the rest.
#import pyvis
from pyvis.network
import Network
net=Network(notebook=
True)
net.from_nx(G)
net.show("example.htm
l")
• For structured text or document collections, the key task is most often
searching for patterns and outliers within the text or documents.
• Text and documents are often minimally structured and may be rich with
attributes and metadata, especially when focused in a specific application
domain.
• For example, documents have a format and often include metadata about the
document (i.e., author, date of creation, date of modification, comments,
size).
• Large collections require pre-processing of text to extract information and align text.
Typical steps are:
• Cleaning (regular expressions)
• Sentence splitting
• Change to lower case
• Stopword removal (most frequent words in a language)
• Stemming - demo porter stemmer
• POS tagging (part of speech) - demo
• Noun chunking
• NER (name entity recognition) - demo opencalais
• Deep parsing - try to “understand” text.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 50
Data Visualization
Text features are complicated
• Syntatic Level
• Identifies and tags (annotating) each token’s function.
• Tokens have attributes such as singular or plural, or their proximity to other tokens.
• Richer tags include date, money, place, person, organization, and time.
• The process of extracting these annotations is called named entity recognition (NER).
• The richness and wide variety of language models and grammars yield a wide variety of approaches.
• Semantic Level
• Extraction of meaning and relationships between pieces of knowledge derived from the
structures identified in the syntactical level.
• The goal of this level is to define an analytic interpretation of the full text within a
specific context, or even independent of context.
• https://wordcounter.net/
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 55
Data Visualization
Vector Space Model
• The pseudocode below counts occurrences of unique tokens,
excluding stop words.
• The input is assumed to be a stream of tokens generated by a
lexical analyzer for a single document.
• The terms variable contains a hashtable that maps unique terms
to their counts in the document.
Count-Terms(tokenStream)
1 terms ← ∅ initialize terms to an empty hashtable.
2 for each token t in tokenStream
3 do if t is not a stop word
4 do increment (or initialize to 1) terms[t]
5 return terms
Compute-TfIdf(documents)
1 termFrequencies ← ∅ Looks up term count tables for document
names.
2 documentF requencies← ∅ Counts the documents in which a
term occurs.
3 uniqueT erms← ∅ The list of all unique terms.
4 for each document d in documents
5 do docName ← Name(d) Extract the name of the document.
6 tokenStream ← Tokenize(d) Generate document token stream.
7 terms ← Count-Terms(tokenStream) Count the term frequencies.
8 termFrequencies[docName] ← terms Store the term frequencies.
9 for each term t in Keys(terms)
10 do increment (or initialize to 1) documentF requencies[t]
11 uniqueT erms← uniqueT erms ∪ t
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 80
Data Visualization
Computing TF-IDF(Documents)
▪ Plotting the Zipf curve on a log-log scale yields a straight line with a
slope of -1
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 82
Data Visualization
Zipf’s Law
• For example, the vector space model, with the use of some distance
metric, will allow us to answer questions such as which documents
are similar to a specific one, which documents are relevant to a given
collection of documents, or which documents are most relevant to a
given search query—all by finding the documents whose term
vectors are most similar to the given document, the average vector
over a document collection, or the vector of a search query.
A tag cloud visualization generated by the free service tagCrowd.com. The font
size and darkness are proportional to the frequency of the word in the document.
https://tagcrowd.com/
A Wordle visualization
generated by the free service
wordle.net. The size of
the text corresponds to the
frequency of the word in the
document.
A WordTree
visualization
generated by the free
service ManyEyes .
The branches of the
tree represent the
various contexts
following a root word
or phrase
in the document.
There are two ways to create word trees: implicitly and explicitly. The choice is
specified with the wordtree.format option.
format: 'implicit'
The word tree will take a set of phrases, in any order, and construct the tree
according to the frequency of the words and subphrases.
format: 'explicit'
You tell the word tree what connects to what, how big to make each subphrase, and
what colors to use.
The word tree in the previous section was an implicit Word Tree: we just specified an
array of phrases, and the word tree figured out how big to make each word.
In an explicit word tree, the chart creator directly provides information about which
words link to which, their color, and size.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 91
Data Visualization
TextArc
TextArc is a visual representation of how terms relate to the lines of text in which
they appear.
Every word of the text is drawn in order around an ellipse as small lines with a
slight offset at its start.
As in a text cloud, more frequently occurring words are drawn larger and
brighter.
Words with higher frequencies are drawn within the ellipse, pulled by its
occurrences on the circle (similar to RadViz).
The user is able to highlight the underlying text with probing and animate
“reading” the text by visualizing the flow of the text through relevant connected
terms.
https://voyant-tools.org/docs/#!/guide/textualarc
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 92
Data Visualization
TextArc
http://textarc
.org/Stills.ht
ml
▪ Instead of calculating just one feature value or vector for the whole
text (this is what is usually done), a sequence of feature values per text
are calculated and presented to the user as a characteristic fingerprint
of the document.
▪ This allows the user to “look inside” the document and analyze the
development of the values across the text. Moreover, the structural
information of the document is used to visualize the document on
different levels of resolution.
SOM visualisation are made up of multiple “nodes”. Each node vector has:
● A weight vector of the same dimension as the input space. (e.g. if your input data
represented people, it may have variables “age”, “sex”, “height” and “weight”,
each node on the grid will also have values for these variables)
● Associated samples from the input data. Each sample in the input space is
“mapped” or “linked” to a node on the map grid. One node can represent several
input samples.
The Document Card pipeline. Each step is further explained in the sections
indicated by the number in the top right corner of each box.
Limitations
A sentiment analysis visualization. News items are plotted along the time axis. Shape and color show to which category
an item belongs, and the vertical position depends on the automatically determined sentiment score of an item. The visual
objects representing news items are painted semi-transparent in order to make overlapping items more easily
distinguishable.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 124
Data Visualization
Representing Relationships
The Jigsaw graph view, representing connections between named entities and
documents.
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 125
Data Visualization
Representing Relationships
A clustered graph view in Jigsaw that filters for documents having specific entities.
Mousing over an entity identifies data about the document. Colors represent token
values. Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 126
Data Visualization
Representing Relationships
The Jigsaw list view, displaying the connections between people (left), places
(center), and organizations (right)
Dr.M.Kavitha Department of Computer Science & Engineering
9/13/2020 127
Data Visualization