Text
Text
Text Visualization
Collections of Documents
Messages (e-mail, blogs, tags, comments)
Social networks (personal profiles)
Academic collaborations (publications)
Example:
Health Care Reform
Example: Health Care Reform
Recent History
Initiatives by President Clinton
Overhaul by President Obama
Text Data
News articles
Speech transcriptions
Legal documents
economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93
Bill Clinton 1993
economix.blogs.nytimes.com/2009/09/09/obama-in-09-vs-clinton-in-93
Word Tree: Word Sequences
Gulfs of Evaluation
Many text visualizations do not represent the text
directly. They represent the output of a language
model (word counts, word sequences, etc.).
Modeling Abstraction
Determine your analysis task.
Understand abstraction of your language models.
Match analysis task with appropriate tools and models.
Topics
Text as Data
Visualizing Document Content
Evolving Documents
Visualizing Conversation
Document Collections
Text as Data
Words as nominal data?
2. Stemming
Group together different forms of a word.
Porter stemmer? visualization(s), visualize(s), visually -> visual
Lemmatization? goes, went, gone -> go
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
WordCounts (Harris ’04)
http://wordcount.org
Tag Clouds
Strengths
Can help with gisting and initial query formation.
Weaknesses
Sub-optimal visual encoding (size vs. position)
Inaccurate size encoding (long words are bigger)
May not facilitate comparison (unstable layout)
Term frequency may not be meaningful
Does not show the structure of the text
Keyword Weighting
Term Frequency
tftd = count(t) in d
Can take log frequency: log(1 + tftd)
Can normalize to show proportion: tftd / Σt tftd
Keyword Weighting
Term Frequency
tftd = count(t) in d
log(tfw) / log(tfthe)
http://benfry.com/traces/
Diff
History Flow [Viegas et al.]
Wikipedia History Flow (IBM)
Conversations
Visualizing Conversation
Many dimensions to consider:
Who (senders, receivers)
What (the content of communication)
When (temporal patterns)
Interesting cross-products:
What x When -> Topic “Zeitgeist”
Who x Who -> Social network
Who x Who x What x When -> Information flow
Usenet Visualization [Viegas & Smith]
Show correspondence patterns in text forums
Initiate vs. reply; size and duration of discussion
Newsgroup crowds / Authorlines
Email Mountain [Viegas]
Conversation by person over time (who x when).
Themail [Viegas]
Topic modeling
Assume documents are a mixture of topics
Topics are (roughly) a set of co-occurring terms
Latent Semantic Analysis (LSA): reduce term matrix
Latent Dirichlet Allocation (LDA): statistical model
Parallel Tag Clouds [Collins et al.]
Theme River [Havre et al.]
History of Comp. Ling. [Hall et al.]
Tiara [Wei et al.]
Stanford Dissertation Browser
with Jason Chuang, Dan Ramage & Christopher Manning
Oh, the humanities!
Computer Science Music
Modeling Abstraction
Determine your analysis task.
Understand abstraction of your language models.
Match analysis task with appropriate tools and models.