0% found this document useful (0 votes)
20 views44 pages

1 Overview

CS 589 is a course on Text Mining and Information Retrieval, taught by Susan Liu at Stevens Institute of Technology. The course covers various topics including information retrieval techniques, text mining, and evaluation methods, with a focus on practical applications like building search engines and using machine learning tools. Students will complete programming assignments, a midterm, and a final project, with prerequisites including fluency in Python and a background in statistics.

Uploaded by

hoanglinh90198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views44 pages

1 Overview

CS 589 is a course on Text Mining and Information Retrieval, taught by Susan Liu at Stevens Institute of Technology. The course covers various topics including information retrieval techniques, text mining, and evaluation methods, with a focus on practical applications like building search engines and using machine learning tools. Students will complete programming assignments, a midterm, and a final project, with prerequisites including fluency in Python and a background in statistics.

Uploaded by

hoanglinh90198
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

CS 589 Fall 2020

Text Mining and Information Retrieval

Instructor: Susan Liu


TA: Huihui Liu

Stevens Institute of Technology

1
Welcome to CS589

• Instructor: Susan (Xueqing) Liu


• Email: [email protected]
• CAs:
• Huihui Liu [email protected]

2
Who am I?

• Assistant professor joined Jan 2020


• PhD@UIUC 2019
• My research:
• Helping users (especially software developers) to more quickly
search for information
my research
software
engineering, security

ML Text mining/IR

3
What is CS589 about?

• Text Mining
• The study of extracting high quality information from raw texts

• Information retrieval
• The study of retrieving relevant information/resources/knowledge to an
information need

4
Information Retrieval Techniques

“Because the systems that are accessible today are so easy to use, it is
tempting to think the technology behind them is similarly straightforward to
build. This review has shown that the route to creating successful IR
systems required much innovation and thought over a long period of time. “

— The history of Information Retrieval Research, Mark Sanderson and


Bruce Croft
Information Retrieval Techniques
How does Google
know cs 589 refers
to a course?
How does Google
know stevens = SIT?

How does Google return


results so quickly?

6
Information Retrieval Techniques

Getting enough Query


coverage of users’ understanding,
information need personalization,
results
diversification, result
page optimization,
etc.

Making sure the results


7
are returned to users fast
A Brief History of IR

Cranfield evaluation methodology;


1958 word-based indexing
300 BC

Callimachus: the first library catalog 1960s building IR systems on computers;


relevance feedback

1970s TF-IDF; probability ranking principle

1980s TREC; learning to rank; latent


1950s semantic indexing

1990 - web search; supporting natural


Punch cards, searching at 600 cards/min now language queries;

8
Information need

information need
“An individual or group's desire to locate and obtain
information to satisfy a need”, e.g., question answering,
program repair, route planning

query
A (short) natural language representation of users’ information
need

9
The Boolean retrieval system

10
The Boolean retrieval system

• e.g., SELECT * FROM table_computer WHERE price < $500 AND brand =
“Dell”
• Primary commercial retrieval system for 3 decades
• Many systems today still use the boolean retrieval system, i.e., faceted search
• Library catalog, eCommerce search, etc.

• Advantage: Returns exactly what you want

• Disadvantage:
• can only specify queries based on the pre-defined categories
• two few / two many queries 11
The Boolean retrieval system

12
The user may specify a condition that does not exist
The Cranfield experiment (1958)

• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?

computer science query = “subject = AI & subject =


bioinformatics”

artificial intelligence bioinformatics

system 1: the Boolean retrieval system


13
The Cranfield experiment (1958)

• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?

artificial

query = “artificial intelligence” bags of words representation

system 2: indexing documents by lists of words


14
The Cranfield experiment (1958)

system 1

compare

system 2

Boolean retrieval system < word indexing system


15
Word indexing: vector-space model

• Represent each document/


query as a vector

• The similarity = cosine score


between the vectors

16
Term frequency

artificial

tf (w, d) = count(w, d)
di = [count(w1 , di ), · · · , count(wn , di )]

• d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0] • query = “business intelligence”


• d2 = [1, 1, 0, 0, 0, 1, 1, 0, 0, 0] • q = [0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
• d3 = [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]

17
Vector space model

artificial

• To answer the query: • d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0]


• “business intelligence” • d2 = [1, 1, 0, 0, 0, 1, 1, 0, 0, 0]
• q = [0, 0, 0, 1, 0, 0, 0, 0, 0, 1] • d3 = [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]

q·d
score(q, d) =
kqk · kdk

18
TF-only representations is inaccurate

• Documents are dominated by words such as “the” “a”

• These words do not carry any meanings, nor do they discriminate between
documents

• q = “the artificial intelligence book” score(q, d1 ) = 0.8164


score(q, d2 ) = 0.3535
• d1 = “the cat, the doc, and the book” )score(q, d1 ) > score(q, d2 )
• d2 = “business intelligence”

19
Zipf’s law distribution of words

20
Stop words

• Documents are dominated by words such as “the” “a”

• These words do not discriminate between documents

21
Desiderata for a good ranking function

• If a word appears everywhere, it should be penalized

• If a word appears in the same document multiple times, it’s importance should
not grow linearly
• q = “artificial intelligence”
• d1 = ““Artificial intelligence was • d2 = ““Artificial intelligence was
founded as an academic discipline in founded as an academic
1955, and in the years since has discipline in 1955, artificial
experienced several waves of optimism” intelligence”

d2 is not twice more relevant than d1 22


Inverse-document frequency

• Inverse-document frequency: penalizing a word’s TF based on its


document frequency

IDF (w) = log N/df (w) • q = “the artificial intelligence book”

q(d, w) = T F (d, w) ⇥ IDF (w) • d1 = “the cat, the doc, and the book”
• d2 = “business intelligence”

TF-IDF weighting score(q, d1 ) = 0.8164 ! 0.2041


score(q, d2 ) = 0.3535 ! 0.3535
)score(q, d1 ) < score(q, d2 )
23
Term frequency reweighing

• Term frequency reweighing: penalizing a word’s TF based on the TF itself

• If a word appears in the same document multiple times, it’s importance should
not grow linearly

count(w, d)
Max TF tf (w, d) = ↵ + (1 ↵)
normalization maxv count(v, d)
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.

24
Term-frequency reweighing

• Logarithmic normalization
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.

score(q, d1 ) = 0.8164 ! 0.7618 • q = “the artificial intelligence book”


score(q, d1 ) = 0.3535 ! 0.3535
• d1 = “the cat, the doc, and the book”
• d2 = “business intelligence”

25
Document length pivoting

• Another problem with TF-IDF weighting


• Longer documents cover more topics, so the query may match a small
subset of the vocabulary
• Longer documents need to be considered differently

q = “artificial intelligence”

d1 = “artificial d2 = “Artificial intelligence was


intelligence book” founded as an academic discipline in
1955, and in the years since has
experienced several waves of optimism “

score(q, d1 ) > score(q, d2 ) 26


Document length pivoting

• For each query q and each document d, compute their relevance score
score(q, d)

• Manually evaluate the relevance between q and d


count(length = l, rel = 1)
relevance judgment@l = count(length = l)

relevance score

27
Document length pivoting

• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve

y=x
pivot = pivot ⇥ slope + intercept

pivoted normalization = (1.0–slope) ⇥ pivot + slope ⇥ oldnormalization 28


Document length pivoting

• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve

the similar formulation will be frequently used later


29
More on retrieval model design heuristics

• Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

30
IR != web search

• The other side of information retrieval techniques


• Recommender systems (users who bought this also bought…)
• Online advertising

31
IR != web search

• Reasoning-based question answering systems

32
What about text mining?

AI/ML
Data mining
document
classification

document information
clustering extraction

Text Mining NLP


Database
text summarization sentiment analysis

IR
web search & mining

33
Syllabus

• EM algorithm
• Vector space model, TF-IDF
• RNN/LSTM
• Probability ranking principle, BM25
• Transformer/Bert
• IR evaluation, query completion

• Inverted index, ES, PageRank, HITS

• Relevance feedback, PRF


• Frontier topic: recommender system
• Neural IR
• Frontier topic: opinion analysis/mining

• Frontier topic: NMT, program synthesis


Assignment goals

Upon successful completion of this course, students should be able to:

• Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement
text retrieval models such as TF-IDF and BM25;

• Use Elastic search to implement a prototypical search engine on Twitter data;

• Derive inference algorithms for the maximum likelihood estimation (MLE), implement the
expectation maximization (EM) algorithm;

• Use state-of-the-art tools such as LSTM/Bert for text classification tasks


Prerequisite

• CS116 is required for undergrad, CS225 is recommended (data structure in Java)

• Fluency in Python is required

• A good knowledge on statistics and probability

• Knowledge of one or more of the following areas is a plus, but not required:
Information Retrieval, Machine Learning, Data Mining, Natural Language
Processing

• Contact the instructor if you aren’t sure


Format

• Meeting: every Monday 8:15-9:45

• 4 programming assignments
• Submit code + report

• 1 midterm
• in class

• Final project
Final Project

Oct 19 Students choose a topic; for each topic, they pick 2-3
coherent papers, and write a summary for the paper
- Oct 26

Oct 26 - Students who share the same interest are categorized into groups; each group
Nov 16 propose a novel research topic motivated by their survey

Dec 14 Deliver a presentation in Week 14

Submit their implementation (code in Python) as well as an 8-page academic paper as


Dec 20 their final project.
Grading

• Homework - 40%, Midterm - 30%, Project - 30%

• Late policy
• Submit within 24 hours of deadline - 90%, within 48 hours - 70%, 0 if code not compile
• Late by over 48 hours are generally not permitted
• Medical conditions
• A sudden increase in family duty
• Too much workload from other courses
• The assignment is too difficult
Plagiarism policy

• We have a very powerful plagiarism detection pipeline, do not take the risk

• Cheating case in CS284


• A student put all his homework on a GitHub public repo
• In the end, we found 8+ students copied his code
Question answering

• Please do not ask your questions in Canvas, most questions can be asked on
Piazza, otherwise use emails
Question asking protocol

• Regrading requests: email TA, cc myself, titled [CS589 regrading]


• Deadline extension requests: email myself, titled [CS589 deadline]
• Dropping: email myself, titled [CS589 drop]
• All technical questions: Piazza
• Homework description clarification
• Clarification on course materials
• Having trouble with homework: join my office hour directly, no need to email me
• If you have a time conflict, email me & schedule another time
• Project discussion: join my office hour
• Ask any common questions shared by the class on Piazza
Your workload

Aug Sept Oct Nov Dec


First Day of Thanks- Last Day
Instruction giving of Instruction

Lectures/Readings

Programming Assignments

Midterm

Project
Books

• No text books

• Recommended readings:
• Zhai, C., & Massung, S. (2016). Text data management and analysis: a practical
introduction to information retrieval and text mining. Association for Computing
Machinery and Morgan & Claypool
• Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge University Press.
2008

You might also like