Lec 19
Lec 19
PROCESSING
6 . S 0 8 0 S O F T WA R E S Y S T E M S F O R D ATA S C I E N C E
TIM KRASKA
CASE STUDY FOR THIS CLASS
You work at Nickelback Inc.
Nickelback Inc recently downloaded every song text ever written (TBs
of data) to draw inspiration as they lately have trouble to produce a
number 1 hit.
Now they want to create a system which enables them to search
through this large collection of text and help them to write some
songs.
2
YOUR TASK
Task1: Design a system that efficiently finds all song texts contain
certain keywords (e.g., “mountain” and “grass”)
Task2: Create a simple ranking for the query results and enable that
Nickelback can cluster the songs
Task3: Extend the system to allow search with sentiments (e.g., all
happy songs, sad songs,…)
Task4: Extend the system to find songs with the right meaning of
“grass” (FYI: Nickelback is a clean band)
3
GOAL: (EFFICIENT) TECHNIQUES TO
PROCESS TEXT
Basic queries:
• How often does word X appear
• How often does word X and Y appear together
• …
Search engines:
• Return the top 10 documents for a given query
• What news items are most relevant to me
• …
Analytics:
• What are trending topics on the web
• How to predict the unemployment rates of next month?
• How to predict Walmart´s sales numbers before they are released?
(e.g., to make a buy or sell recommendation)
• …
THE BASIC INDEXING PIPELINE
Tokenizer
Token stream.
Linguistic modules
Modified tokens.
Indexer
Sec. 2.2.1
TOKENIZATION
Input: “Friends, Romans and Countrymen”
Output: Tokens
• Friends
• Romans
• and
• Countrymen
A token is an instance of a sequence of characters
Each such token is now a candidate for an index entry, after further
processing
But what are valid tokens to emit?
CLICKER
Name 3 or more issues with tokenization, which could
influence the search result?
7
Sec. 2.2.1
TOKENIZATION
Issues in tokenization:
• Finland’s capital → Finland? Finlands? Finland’s?
• Hewlett-Packard → Hewlett and Packard as two tokens?
• state-of-the-art, lowercase, lower-case, lower case ?
• San Francisco: one token or two?
• How do you decide it is one token?
Numbers/Dates
• Examples:
• Date: 3/20/91 or Mar. 12, 1991 or 20/3/91 or 55 B.C.
• Numbers: My PGP key is 324a3df234cb23e
• Phone numbers: (800) 234-2333
• Older IR systems may not index numbers
• But often very useful: think about things like looking up error
codes/stacktraces on the web or finding a web-address
Sec. 2.2.1
← → ←→ ← start
THE BASIC INDEXING PIPELINE
Tokenizer
Linguistic modules
Modified tokens.
Indexer
Sec. 2.2.2
STOP WORDS
W ith a stop list, you exclude from the dictionary entirely the most common
words. Intuition:
• They have little semantic content: the, a, and, to, be
• There are a lot of them: ~30% of postings for top 30 words
Clicker:
a) For search and analytical tasks always remove them
b) Only for search tasks remove them
c) Only for analytical tasks remove them
d) Never remove
e) Scooby-doo – do not pick this answer J
Sec. 2.2.2
STOP WORDS
W ith a stop list, you exclude from the dictionary entirely the most common
words. Intuition:
• They have little semantic content: the, a, and, to, be
• There are a lot of them: ~30% of postings for top 30 words
13
Sec. 2.2.3
NORMALIZATION TO TERMS
We need to “normalize” words in indexed text as well as query words into the
sa m e f o r m
• We want to match U.S.A. and USA
Result is terms: a term is a (normalized) word type, which is an entry in our IR
sy st e m d i c t i o n a r y
We most commonly implicitly define equivalence classes of terms by, e.g.,
• deleting periods to form a term
• U .S.A ., U SA → U SA
• deleting hyphens to form a term
• anti-discriminatory, antidiscriminatory → antidiscriminatory
Sec. 2.2.3
CASE FOLDING
Reduce all letters to lower case
• exception: upper case in mid-sentence?
• e.g., General Motors
• Fed vs. fed
• Brown vs. brown
• Often best to lower case everything, since users will
use lowercase regardless of ‘correct’ capitalization…
Google example:
• Query C.A.T.
• Even today #1 result is for “cat” not Caterpillar Inc.
Sec. 2.2.3
NORMALIZATION TO TERMS
An alternative to equivalence classing is to do
asymmetric expansion
An example of where this may be useful
• Enter: window Search: window, windows
• Enter: windows Search: Windows, windows,
window
• Enter: Windows Search: Windows
Potentially more powerful, but less efficient
THESAURI AND SOUNDEX
Do we handle synonyms?
• E.g., by hand-constructed equivalence classes
• c a r = a uto m o b ile c o lo r = c o lo u r
• We can rewrite to form equivalence-class terms
• When the document contains automobile, index it under car-
automobile (and vice-versa)
• Or we can expand a query
• When the query contains automobile, look under car as well
What about spelling mistakes?
• One approach is soundex, which forms equivalence classes of
words based on phonetic heuristics
Sec. 2.2.4
LEMMATIZATION
Reduce inflectional/variant forms to base form
E.g.
• am, are, is → be
• car, cars, car's, cars’ → car
the boy's cars are different colors → the boy car
be different color
Lemmatization implies doing “proper” reduction
to dictionary headword form
Sec. 2.2.4
STEMMING
Reduce terms to their “roots” before indexing
“Stemming” suggest crude affix chopping
• language dependent
• e.g., automate(s), automatic, automation all reduced to
automat.
PORTER’S ALGORITHM
C o m m o n al go r i t h m f o r st e m m i n g E n gl i sh
• Results suggest it’s at least as good as other stemming options
22
THE BASIC INDEXING PIPELINE
Tokenizer
Linguistic modules
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
Mercy 1 0 1 1 1 1
Worser 1 0 1 1 1 0
Sec. 1.2
INVERTED INDEX
25
CLICKER: INTERSECTING TWO POSTINGS
LISTS (A “MERGE” JOIN)
2 4 8 16 32 64 128 Brutus
2 8
1 2 3 5 8 13 21 34 Caesar
41 128
2 4 8 41 48 64 128
11 31
1 2 3 8 11 17 21 31
Sec. 1.3
CLICKER
What is the best order for query processing?
Consider a query that is an AND of n terms.
For each of the n terms, get its postings, then AND
them together.
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16
Calpurnia 13 16
Query: Brutus AND Calpurnia AND Caesar
Clicker: What join order should you use?
A) Brutus join Caesar then join Culpurnia
B) Scooby-doo
C) Caesar join Calpurnia then join Brutus
D) Calpurnia join Brutus then join Caesar
Sec. 1.3
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 16
Calpurnia 13 16
Relational Algebra
sP
R1 R2
THE BASIC INDEXING PIPELINE
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 16
Sec. 2.4
PHRASE QUERIES
Want to be able to answer queries such as “computer science” – as a
ph rase
Thus the sentence “I worked on my science project on the
co m p u t e r” is n o t a m a t ch .
• The concept of phrase queries has proven easily understood by
users; one of the few “advanced search” ideas that works
• Many more queries are implicit phrase queries
For this, it no longer suffices to store only
< te r m : d o cs> e n trie s
Ideas???
Sec. 2.4.1
Clicker:
A) This strategy always works
B) Leads to less recall
C) Leads to less precision
D) Scooby-doo
Sec. 2.4.2
An Example:
<be: 993427;
1: 7, 18, 33, 72, 86, 231;
2: 3, 149;
4: 17, 191, 291, 430, 434;
5: 363, 367, …;
Tokenizer
Linguistic modules
Indexer friend 2 4
roman 1 2
Inverted index.
countryman 13 16
CASE STUDY FOR THIS CLASS
You work at Nickelback Inc.
Nickelback Inc recently downloaded every song text ever written (TB of
data) to draw inspiration as they lately have trouble to produce a number
1 hit.
Now they want to create a system which enables them to search through
this large collection of text and help them to write some songs.
Your task:
Task1: Design a system that efficiently finds all song texts contain certain
keywords (e.g., “mountain” and “grass”)
Task2: Create a simple ranking for the query results and enable that
N icke lb a ck ca n clu ste r th e so n g s
Task3: Extend the system to allow search with sentiments (e.g., all happy
songs, sad songs,…)
Task4: Extend the system further to find songs with the right meaning of
“grass” (the green stuff in the football stadium)
Task5: Develop an assistant that helps Nickelback to write songs by
predicting the next sentence
37
TERM-DOCUMENT COUNT INDICES
Idea: create a vector representation of the document and compare the
vectors (e.g., cosine similarity)
Consider the number of occurrences of a term in a document:
• Each document is a count vector in ℕv: a column below
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
the ? ? ? ? ? ?
44
LEVERAGE WORDNET
Unsupervised: Wordnet affect or similar
45
SUPERVISED
https://github.com/eclipse/deeplearning4j-examples/blob/master/dl4j-
examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.ja
va
46
1. Take a 3 layer neural network. (1 input
layer + 1 hidden layer + 1 output layer)
2. Feed it a word and train it to predict its
neighboring word.
3. Remove the last (output layer) and keep
the input and hidden layer.
4. Now, input a word from within the
vocabulary. The output given at the hidden
layer is the ‘word embedding’ of the input
word.
47
IDEA BEHIND WORD2VEC
48
49
Rome -Italy + China would return Beijing (same distance in vector space)
50
CASE STUDY FOR THIS CLASS
You work at Nickelback Inc.
Nickelback Inc recently downloaded every song text ever written (TB of
data) to draw inspiration as they lately have trouble to produce a number
1 hit.
Now they want to create a system which enables them to search through
this large collection of text and help them to write some songs.
Your task:
Task1: Design a system that efficiently finds all song texts contain certain
keywords (e.g., “mountain” and “grass”)
Task2: Create a simple ranking for the query results and enable that
Nickelback can cluster the songs
Task3: Extend the system to allow search with sentiments (e.g., all happy
songs, sad songs,…)
Task4: Extend the system further to find songs with the right meaning of
“grass” (the green stuff in the football stadium)
Task5: Develop an assistant that helps Nickelback to write songs by
predicting the next sentence
51
WHAT IS THE PROBLEM WITH WORD
EMBEDDINGS?
The mountain has a lot of grass You should never smoke grass
52
LITTLE HISTORY
Semi-Supervised Sequence Learning, Google, 2015
53
Improving Language Understanding by Generative Pre-Training,
OpenAI, 2018 – Based on transformers/attention from ”Attention is All
You Need” Vaswani et al
54
BERT
55
BERT VS OPENAI GPT VS ELMO
56
TASKS
57
http://www.msmarco.org/leaders.aspx
BERT FOR FEATURE EXTRACTION
58
MICROSOFT MARCO DATASETS
59
GOOGLE IS NOW USING BERT
60