1 Overview
1 Overview
1
Welcome to CS589
2
Who am I?
ML Text mining/IR
3
What is CS589 about?
• Text Mining
• The study of extracting high quality information from raw texts
• Information retrieval
• The study of retrieving relevant information/resources/knowledge to an
information need
4
Information Retrieval Techniques
“Because the systems that are accessible today are so easy to use, it is
tempting to think the technology behind them is similarly straightforward to
build. This review has shown that the route to creating successful IR
systems required much innovation and thought over a long period of time. “
6
Information Retrieval Techniques
8
Information need
information need
“An individual or group's desire to locate and obtain
information to satisfy a need”, e.g., question answering,
program repair, route planning
query
A (short) natural language representation of users’ information
need
9
The Boolean retrieval system
10
The Boolean retrieval system
• e.g., SELECT * FROM table_computer WHERE price < $500 AND brand =
“Dell”
• Primary commercial retrieval system for 3 decades
• Many systems today still use the boolean retrieval system, i.e., faceted search
• Library catalog, eCommerce search, etc.
• Disadvantage:
• can only specify queries based on the pre-defined categories
• two few / two many queries 11
The Boolean retrieval system
12
The user may specify a condition that does not exist
The Cranfield experiment (1958)
• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?
• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?
artificial
system 1
compare
system 2
16
Term frequency
artificial
tf (w, d) = count(w, d)
di = [count(w1 , di ), · · · , count(wn , di )]
17
Vector space model
artificial
q·d
score(q, d) =
kqk · kdk
18
TF-only representations is inaccurate
• These words do not carry any meanings, nor do they discriminate between
documents
19
Zipf’s law distribution of words
20
Stop words
21
Desiderata for a good ranking function
• If a word appears in the same document multiple times, it’s importance should
not grow linearly
• q = “artificial intelligence”
• d1 = ““Artificial intelligence was • d2 = ““Artificial intelligence was
founded as an academic discipline in founded as an academic
1955, and in the years since has discipline in 1955, artificial
experienced several waves of optimism” intelligence”
q(d, w) = T F (d, w) ⇥ IDF (w) • d1 = “the cat, the doc, and the book”
• d2 = “business intelligence”
• If a word appears in the same document multiple times, it’s importance should
not grow linearly
count(w, d)
Max TF tf (w, d) = ↵ + (1 ↵)
normalization maxv count(v, d)
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.
24
Term-frequency reweighing
• Logarithmic normalization
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.
25
Document length pivoting
q = “artificial intelligence”
• For each query q and each document d, compute their relevance score
score(q, d)
relevance score
27
Document length pivoting
• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve
y=x
pivot = pivot ⇥ slope + intercept
• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve
30
IR != web search
31
IR != web search
32
What about text mining?
AI/ML
Data mining
document
classification
document information
clustering extraction
IR
web search & mining
33
Syllabus
• EM algorithm
• Vector space model, TF-IDF
• RNN/LSTM
• Probability ranking principle, BM25
• Transformer/Bert
• IR evaluation, query completion
• Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement
text retrieval models such as TF-IDF and BM25;
• Derive inference algorithms for the maximum likelihood estimation (MLE), implement the
expectation maximization (EM) algorithm;
• Knowledge of one or more of the following areas is a plus, but not required:
Information Retrieval, Machine Learning, Data Mining, Natural Language
Processing
• 4 programming assignments
• Submit code + report
• 1 midterm
• in class
• Final project
Final Project
Oct 19 Students choose a topic; for each topic, they pick 2-3
coherent papers, and write a summary for the paper
- Oct 26
Oct 26 - Students who share the same interest are categorized into groups; each group
Nov 16 propose a novel research topic motivated by their survey
• Late policy
• Submit within 24 hours of deadline - 90%, within 48 hours - 70%, 0 if code not compile
• Late by over 48 hours are generally not permitted
• Medical conditions
• A sudden increase in family duty
• Too much workload from other courses
• The assignment is too difficult
Plagiarism policy
• We have a very powerful plagiarism detection pipeline, do not take the risk
• Please do not ask your questions in Canvas, most questions can be asked on
Piazza, otherwise use emails
Question asking protocol
Lectures/Readings
Programming Assignments
Midterm
Project
Books
• No text books
• Recommended readings:
• Zhai, C., & Massung, S. (2016). Text data management and analysis: a practical
introduction to information retrieval and text mining. Association for Computing
Machinery and Morgan & Claypool
• Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge University Press.
2008