An Introduction To Text Mining: Bettina Berendt
An Introduction To Text Mining: Bettina Berendt
‹#›
Bettina Berendt
July 9th 2015: I‘ll add references to these yet, but wanted to get the slides out to you already!
3
4
Motivation (1)
4
5
Motivation (2)
5
6
6
7
and/or its
older brother:
Information
retrieval
10
11
11
12
Metadata
as output
12
13
Metadata as
input?
Requires
different search
interfaces!
13
14
14
15
15
16
16
17
17
18
Going further: What topics exist in a collection of
texts, and how do they evolve?
News texts, scientific
publications, …
Guiding questions
• Information retrieval:
▫ Given the current user‘s information need, which are the most relevant
documents?
• Text mining:
▫ What do the documents tell us? What‘s in the texts? What can we learn
about the texts, their authors, ...
▫ Many different subquestions
▫ Summarization (of one text, of many texts) is just one of them
• Cf.
▫ “Distant reading“ (Moretti)
understanding literature not by studying particular texts, but by aggregating and
analyzing massive amounts of data.
▫ “Machine reading“ (UCL Machine Reading Group)
machines that can read and "understand" this textual information, converting it
into interpretable structured knowledge to be leveraged by humans and other
machines alike
19
‹#›
Speed-reading
(Woody Allen)
I took a course in speed reading and was able to
read War and Peace in twenty minutes.
It's about Russia.
21
22
A personal “experiment“
- deliberately a bit silly, more a gentle introduction to a great
tool and to some pitfalls of “distant reading“
22
23
23
24
Note about „said“:
Compare Joyce‘s
Dubliners
24
25
25
26
26
27
Double-check in Wikipedia
(method: string search)
• Count Pyotr Kirillovich (Pierre) Bezukhov: The large-bodied, ungainly, and socially
awkward illegitimate son of an old Russian grandee. Pierre, educated abroad, returns
to Russia as a misfit. His unexpected inheritance of a large fortune makes him
socially desirable. Pierre is the central character and often a voice for Tolstoy's own
beliefs or struggles.
• Prince Andrey Nikolayevich Bolkonsky: A strong but skeptical, thoughtful and
philosophical aide-de-camp in the Napoleonic Wars.
▫ Some searching needed ... Andrew ... Andrei ... Andrey
• Countess Natalya Ilyinichna (Natasha) Rostova: A central character, introduced as
"not pretty but full of life" and a romantic young girl, although impulsive and highly
strung, she evolves through trials and suffering and eventually finds happiness. She is
an accomplished singer and dancer.
• ...
• Prince Anatole Vasilyevich Kuragin: Hélène's brother and a very handsome and amoral
pleasure seeker who is secretly married yet tries to elope with Natasha Rostova.
• Vasily Dmitrich Denisov: Nikolai Rostov's friend and brother officer, who proposes to
Natasha.
27
28
Questions
• How much of this was “really automatic“?
• What existing knowledge (in my head and in
others‘) went into this analysis,
• and how?
• Can you think of another reason why this
(deliberately) turned out silly?
29
30
30
31
31
32
33
34
36
37
▫ Basic idea:
Keywords are extracted from texts.
These keywords describe the (usually) topical content
of Web pages and other text contributions.
▫ Based on the vector space model of document
collections:
Each unique word in a corpus of Web pages = one
dimension
Each page(view) is a vector with non-zero weight for
each word in that page(view), zero weight for other
words
Words become “features” (in a data-mining sense)
37
38
39
40
41
https://aeshin.org/textmining/
41
42
‹#›
Data mining
(aka Knowledge Discovery)
45
46
46
47
1. Application understanding
2. Corpus generation
3. Data understanding
4. Text preprocessing
5. Search for patterns / modelling
Topical analysis
Sentiment analysis / opinion mining
6. Evaluation
7. Deployment
48
▫ Get help!
49
Preprocessing (1)
• Data cleaning
▫ Goal: get clean ASCII text
▫ Remove HTML markup*, pictures,
advertisements, ...
▫ Automate this: wrapper induction
* Note: HTML markup may carry information too (e.g., <b> or <h1>
marks something important), which can be extracted! (Depends on the
50
Preprocessing (2)
Do you see a
problem
here for DH?
What implicit
• Goal: get processable lexical / syntactical units assumptions
• Tokenize (find word boundaries) are made?
• Lemmatize / stem
▫ ex. buyers, buyer buyer / buyer, buying, ... buy
• Remove stopwords
• Find Named Entities (people, places, companies, ...); filtering
• Resolve polysemy and homonymy: word sense disambiguation;
“synonym unification“
• Part-of-speech tagging; filtering of nouns, verbs, adjectives, ...
• ...
Preprocessing (3)
https://aeshin.org/textmining/
55
56
Happiness in blogosphere
60
https://aeshin.org/textmining/
70
71
https://aeshin.org/textmining/ 71
72
https://aeshin.org/textmining/ 72
73
https://aeshin.org/textmining/ 73
74
https://aeshin.org/textmining/ 74
75
https://aeshin.org/textmining/ 75
78
78
79
80
81
81
82
#gamergate
“GamerGate is a grassroots
movement with the goal of
supporting ethics in game
journalism. Some feminists
have claimed it is a hateful,
misogynistic movement, but
they haven't been able to
meet the burden of proof
on that.”
http://drunken-peasants-
podcast.wikia.com/wiki/GamerGate
84
85
Gamergate tweets
• Based on the work of Budac, A., Chartier, R., Suomela, T., Gouglas, S., &
Rockwell, G. (see sources at the end of this slideset)
• I received the data for the purposes of this summer school (i.e. also for you)
▫ Condition: we all respect the associated ethics code
▫ This is an interesting document in itself, and we will use it for part 3
• Data post-processed for you: “most retweeted tweets“ Oct‘14 – Mar’15, in 4
versions (each version assembled into one ZIP file)
▫ 1 document per month, tweet texts ordered by count of retweets (desc.) Voyant
▫ 1 document per tweet, sorted into 1 folder per month DocumentAtlas/Ontogen
▫ 1 document overall ( Weka), with fields
anonymized user ID
Month
Count in that month‘s dataset
Tweet text
- The same, but with some post-processing that will make your analysis easier
86
87
87
88
88
89
89
90
Thank you!
? s
91
References
A good textbook on Text Mining:
• Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data.
Cambridge University Press.
An introduction similar to this one, but also covering unsupervised learning in some detail, and with lots of pointers to books,
materials, etc.:
• Shaw, R. (2012). Text-mining as a Research Tool in the Humanities and Social Sciences. Presentation at the Duke Libraries,
September 20, 2012. https://aeshin.org/textmining/
An overview of news and (micro-)blogs mining:
• Berendt, B. (in press). Text mining for news and blogs analysis. To appear in C. Sammut & G.I. Webb (Eds.), Encyclopedia
of Machine Learning and Data Mining. Berlin etc.: Springer.
http://people.cs.kuleuven.be/~bettina.berendt/Papers/berendt_encyclopedia_2015_with_publication_info.pdf
See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject.
94
95
Gamergate sources
• Budac, A., Chartier, R., Suomela, T., Gouglas, S.,
& Rockwell, G. (2015) #GamerGate: Distant
Reading Games Discourse. Paper presented at the
CGSA 2015 conference at the HSSFC Congress at
University of Ottawa, Ottawa, Ontario, June 2015.
• Rockwell, G. (2015). Appendix 1: Ethics of Twitter
Gamergate Research.
• Rockwell, Geoffrey; Suomela, Todd, 2015,
"Gamergate Reactions",
http://dx.doi.org/10.7939/DVN/10253 V5
[Version].
95
96
More sources
• Please find the URLs of pictures and
screenshots in the Powerpoint “comment“ box
• Thanks to the Internet for them!
96