0% found this document useful (0 votes)

20 views44 pages

1 Overview

CS 589 is a course on Text Mining and Information Retrieval, taught by Susan Liu at Stevens Institute of Technology. The course covers various topics including information retrieval techniques, text mining, and evaluation methods, with a focus on practical applications like building search engines and using machine learning tools. Students will complete programming assignments, a midterm, and a final project, with prerequisites including fluency in Python and a background in statistics.

Uploaded by

hoanglinh90198

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views44 pages

1 Overview

Uploaded by

hoanglinh90198

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

CS 589 Fall 2020

Text Mining and Information Retrieval

Instructor: Susan Liu

TA: Huihui Liu

Stevens Institute of Technology

1
Welcome to CS589

• Instructor: Susan (Xueqing) Liu

• Email: [email protected]
• CAs:
• Huihui Liu [email protected]

2
Who am I?

• Assistant professor joined Jan 2020

• PhD@UIUC 2019
• My research:
• Helping users (especially software developers) to more quickly
search for information
my research
software
engineering, security

ML Text mining/IR

3
What is CS589 about?

• Text Mining
• The study of extracting high quality information from raw texts

• Information retrieval
• The study of retrieving relevant information/resources/knowledge to an
information need

4
Information Retrieval Techniques

“Because the systems that are accessible today are so easy to use, it is
tempting to think the technology behind them is similarly straightforward to
build. This review has shown that the route to creating successful IR
systems required much innovation and thought over a long period of time. “

— The history of Information Retrieval Research, Mark Sanderson and

Bruce Croft
Information Retrieval Techniques
How does Google
know cs 589 refers
to a course?
How does Google
know stevens = SIT?

How does Google return

results so quickly?

6
Information Retrieval Techniques

Getting enough Query

coverage of users’ understanding,
information need personalization,
results
diversification, result
page optimization,
etc.

Making sure the results

7
are returned to users fast
A Brief History of IR

Cranfield evaluation methodology;

1958 word-based indexing
300 BC

Callimachus: the first library catalog 1960s building IR systems on computers;

relevance feedback

1970s TF-IDF; probability ranking principle

1980s TREC; learning to rank; latent

1950s semantic indexing

1990 - web search; supporting natural

Punch cards, searching at 600 cards/min now language queries;

8
Information need

information need
“An individual or group's desire to locate and obtain
information to satisfy a need”, e.g., question answering,
program repair, route planning

query
A (short) natural language representation of users’ information
need

9
The Boolean retrieval system

10
The Boolean retrieval system

• e.g., SELECT * FROM table_computer WHERE price < $500 AND brand =
“Dell”
• Primary commercial retrieval system for 3 decades
• Many systems today still use the boolean retrieval system, i.e., faceted search
• Library catalog, eCommerce search, etc.

• Advantage: Returns exactly what you want

• Disadvantage:
• can only specify queries based on the pre-defined categories
• two few / two many queries 11
The Boolean retrieval system

12
The user may specify a condition that does not exist
The Cranfield experiment (1958)

• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?

computer science query = “subject = AI & subject =

bioinformatics”

artificial intelligence bioinformatics

system 1: the Boolean retrieval system

13
The Cranfield experiment (1958)

• Imagine you need to help users search for literatures in a digital library, how
would you design such a system?

artificial

query = “artificial intelligence” bags of words representation

system 2: indexing documents by lists of words

14
The Cranfield experiment (1958)

system 1

compare

system 2

Boolean retrieval system < word indexing system

15
Word indexing: vector-space model

• Represent each document/

query as a vector

• The similarity = cosine score

between the vectors

16
Term frequency

artificial

tf (w, d) = count(w, d)
di = [count(w1 , di ), · · · , count(wn , di )]

• d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0] • query = “business intelligence”

• d2 = [1, 1, 0, 0, 0, 1, 1, 0, 0, 0] • q = [0, 0, 0, 1, 0, 0, 0, 0, 0, 1]
• d3 = [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]

17
Vector space model

artificial

• To answer the query: • d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0]

• “business intelligence” • d2 = [1, 1, 0, 0, 0, 1, 1, 0, 0, 0]
• q = [0, 0, 0, 1, 0, 0, 0, 0, 0, 1] • d3 = [0, 0, 0, 1, 0, 0, 0, 1, 1, 1]

q·d
score(q, d) =
kqk · kdk

18
TF-only representations is inaccurate

• Documents are dominated by words such as “the” “a”

• These words do not carry any meanings, nor do they discriminate between
documents

• q = “the artificial intelligence book” score(q, d1 ) = 0.8164

score(q, d2 ) = 0.3535
• d1 = “the cat, the doc, and the book” )score(q, d1 ) > score(q, d2 )
• d2 = “business intelligence”

19
Zipf’s law distribution of words

20
Stop words

• Documents are dominated by words such as “the” “a”

• These words do not discriminate between documents

21
Desiderata for a good ranking function

• If a word appears everywhere, it should be penalized

• If a word appears in the same document multiple times, it’s importance should
not grow linearly
• q = “artificial intelligence”
• d1 = ““Artificial intelligence was • d2 = ““Artificial intelligence was
founded as an academic discipline in founded as an academic
1955, and in the years since has discipline in 1955, artificial
experienced several waves of optimism” intelligence”

d2 is not twice more relevant than d1 22

Inverse-document frequency

• Inverse-document frequency: penalizing a word’s TF based on its

document frequency

IDF (w) = log N/df (w) • q = “the artificial intelligence book”

q(d, w) = T F (d, w) ⇥ IDF (w) • d1 = “the cat, the doc, and the book”
• d2 = “business intelligence”

TF-IDF weighting score(q, d1 ) = 0.8164 ! 0.2041

score(q, d2 ) = 0.3535 ! 0.3535
)score(q, d1 ) < score(q, d2 )
23
Term frequency reweighing

• Term frequency reweighing: penalizing a word’s TF based on the TF itself

• If a word appears in the same document multiple times, it’s importance should
not grow linearly

count(w, d)
Max TF tf (w, d) = ↵ + (1 ↵)
normalization maxv count(v, d)
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.

24
Term-frequency reweighing

• Logarithmic normalization
(
Log scale 1 + log count(w, d) count(w, d) > 0
tf (w, d) =
normalization 0 o.w.

score(q, d1 ) = 0.8164 ! 0.7618 • q = “the artificial intelligence book”

score(q, d1 ) = 0.3535 ! 0.3535
• d1 = “the cat, the doc, and the book”
• d2 = “business intelligence”

25
Document length pivoting

• Another problem with TF-IDF weighting

• Longer documents cover more topics, so the query may match a small
subset of the vocabulary
• Longer documents need to be considered differently

q = “artificial intelligence”

d1 = “artificial d2 = “Artificial intelligence was

intelligence book” founded as an academic discipline in
1955, and in the years since has
experienced several waves of optimism “

score(q, d1 ) > score(q, d2 ) 26

Document length pivoting

• For each query q and each document d, compute their relevance score
score(q, d)

• Manually evaluate the relevance between q and d

count(length = l, rel = 1)
relevance judgment@l = count(length = l)

relevance score

27
Document length pivoting

• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve

y=x
pivot = pivot ⇥ slope + intercept

pivoted normalization = (1.0–slope) ⇥ pivot + slope ⇥ oldnormalization 28

Document length pivoting

• Rotate the relevance score curve, such that it most closely align with the
relevance judgement curve

the similar formulation will be frequently used later

29
More on retrieval model design heuristics

• Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

30
IR != web search

• The other side of information retrieval techniques

• Recommender systems (users who bought this also bought…)
• Online advertising

31
IR != web search

• Reasoning-based question answering systems

32
What about text mining?

AI/ML
Data mining
document
classification

document information
clustering extraction

Text Mining NLP

Database
text summarization sentiment analysis

IR
web search & mining

33
Syllabus

• EM algorithm
• Vector space model, TF-IDF
• RNN/LSTM
• Probability ranking principle, BM25
• Transformer/Bert
• IR evaluation, query completion

• Inverted index, ES, PageRank, HITS

• Relevance feedback, PRF

• Frontier topic: recommender system
• Neural IR
• Frontier topic: opinion analysis/mining

• Frontier topic: NMT, program synthesis

Assignment goals

Upon successful completion of this course, students should be able to:

• Evaluate ranking algorithms by using information retrieval evaluation techniques, and implement
text retrieval models such as TF-IDF and BM25;

• Use Elastic search to implement a prototypical search engine on Twitter data;

• Derive inference algorithms for the maximum likelihood estimation (MLE), implement the
expectation maximization (EM) algorithm;

• Use state-of-the-art tools such as LSTM/Bert for text classification tasks

Prerequisite

• CS116 is required for undergrad, CS225 is recommended (data structure in Java)

• Fluency in Python is required

• A good knowledge on statistics and probability

• Knowledge of one or more of the following areas is a plus, but not required:
Information Retrieval, Machine Learning, Data Mining, Natural Language
Processing

• Contact the instructor if you aren’t sure

Format

• Meeting: every Monday 8:15-9:45

• 4 programming assignments
• Submit code + report

• 1 midterm
• in class

• Final project
Final Project

Oct 19 Students choose a topic; for each topic, they pick 2-3
coherent papers, and write a summary for the paper
- Oct 26

Oct 26 - Students who share the same interest are categorized into groups; each group
Nov 16 propose a novel research topic motivated by their survey

Dec 14 Deliver a presentation in Week 14

Submit their implementation (code in Python) as well as an 8-page academic paper as

Dec 20 their final project.
Grading

• Homework - 40%, Midterm - 30%, Project - 30%

• Late policy
• Submit within 24 hours of deadline - 90%, within 48 hours - 70%, 0 if code not compile
• Late by over 48 hours are generally not permitted
• Medical conditions
• A sudden increase in family duty
• Too much workload from other courses
• The assignment is too difficult
Plagiarism policy

• We have a very powerful plagiarism detection pipeline, do not take the risk

• Cheating case in CS284

• A student put all his homework on a GitHub public repo
• In the end, we found 8+ students copied his code
Question answering

• Please do not ask your questions in Canvas, most questions can be asked on
Piazza, otherwise use emails
Question asking protocol

• Regrading requests: email TA, cc myself, titled [CS589 regrading]

• Deadline extension requests: email myself, titled [CS589 deadline]
• Dropping: email myself, titled [CS589 drop]
• All technical questions: Piazza
• Homework description clarification
• Clarification on course materials
• Having trouble with homework: join my office hour directly, no need to email me
• If you have a time conflict, email me & schedule another time
• Project discussion: join my office hour
• Ask any common questions shared by the class on Piazza
Your workload

Aug Sept Oct Nov Dec

First Day of Thanks- Last Day
Instruction giving of Instruction

Lectures/Readings

Programming Assignments

Midterm

Project
Books

• No text books

• Recommended readings:
• Zhai, C., & Massung, S. (2016). Text data management and analysis: a practical
introduction to information retrieval and text mining. Association for Computing
Machinery and Morgan & Claypool
• Christopher D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Introduction to Information Retrieval, Cambridge University Press.
2008

R For Health Data Science
100% (1)
R For Health Data Science
365 pages
Grade 8 Term 4 Eng HL 2021
100% (1)
Grade 8 Term 4 Eng HL 2021
9 pages
How To Set Up MES-Driven Staging
No ratings yet
How To Set Up MES-Driven Staging
15 pages
Baddley's Working Memory
No ratings yet
Baddley's Working Memory
5 pages
2 Corinthians
No ratings yet
2 Corinthians
120 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Web Search
No ratings yet
Web Search
30 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Module 1 Part BInformation Retrieval Webdocuments
No ratings yet
Module 1 Part BInformation Retrieval Webdocuments
49 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
IR Journal
No ratings yet
IR Journal
36 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Information Retrieval Practical
No ratings yet
Information Retrieval Practical
10 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Module 2-1
No ratings yet
Module 2-1
6 pages
IR Journal
No ratings yet
IR Journal
20 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
NLP See
No ratings yet
NLP See
27 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
No ratings yet
Statistical Machine Learning For Information Retrieval - Adam Berger PDF
147 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Bulu
No ratings yet
Bulu
47 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Ii - 3 Unit
No ratings yet
Ii - 3 Unit
45 pages
Certificate: T.Y.Bsc Cs
No ratings yet
Certificate: T.Y.Bsc Cs
120 pages
NLP See
No ratings yet
NLP See
9 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
62 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
IR New
No ratings yet
IR New
4 pages
Question Bank-Print-Irt
No ratings yet
Question Bank-Print-Irt
9 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
3 Retrieval Models
No ratings yet
3 Retrieval Models
87 pages
NLP Week10 IR Enc Dec Annotated - by - Ces
No ratings yet
NLP Week10 IR Enc Dec Annotated - by - Ces
83 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Ir QB
No ratings yet
Ir QB
8 pages
Hatakenaka 2011
No ratings yet
Hatakenaka 2011
6 pages
Applications of NLP
No ratings yet
Applications of NLP
48 pages
21ite09 Information Reterival
No ratings yet
21ite09 Information Reterival
2 pages
CS317 IR W1a
No ratings yet
CS317 IR W1a
20 pages
Text Mining
No ratings yet
Text Mining
23 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Unleashing the Power of Data: Innovative Data Mining with Python
From Everand
Unleashing the Power of Data: Innovative Data Mining with Python
Edward Franklin
No ratings yet
Class 9 Chapter 2
No ratings yet
Class 9 Chapter 2
12 pages
5.3x Screen Interactions Exercise
No ratings yet
5.3x Screen Interactions Exercise
28 pages
Version Control Systems
No ratings yet
Version Control Systems
14 pages
Counting of Figures (Reseoning)
No ratings yet
Counting of Figures (Reseoning)
4 pages
CV+List of Referees&agencies Irene Pennetta-Compressed PDF
No ratings yet
CV+List of Referees&agencies Irene Pennetta-Compressed PDF
5 pages
Hastasya Bhushanam Danam - Eng
No ratings yet
Hastasya Bhushanam Danam - Eng
7 pages
Read Me
No ratings yet
Read Me
1 page
Lab Manual # 13: Title: Functions
No ratings yet
Lab Manual # 13: Title: Functions
9 pages
IPE Program
No ratings yet
IPE Program
23 pages
French-English Cognate Patterns
No ratings yet
French-English Cognate Patterns
2 pages
Mcma 2433
No ratings yet
Mcma 2433
53 pages
Ap Syllabus
No ratings yet
Ap Syllabus
7 pages
Synthesis Essay
No ratings yet
Synthesis Essay
6 pages
Marxismo y Dialéctica, Lucio Colletti
No ratings yet
Marxismo y Dialéctica, Lucio Colletti
24 pages
DMBS Practical
No ratings yet
DMBS Practical
64 pages
Performance Task Newtons Olympic
100% (2)
Performance Task Newtons Olympic
1 page
Worksheet Integer Operations With Powers
No ratings yet
Worksheet Integer Operations With Powers
3 pages
Daily Lesson Plan: Complementary
No ratings yet
Daily Lesson Plan: Complementary
7 pages
ISC Computer Project/Computer File JAVA
No ratings yet
ISC Computer Project/Computer File JAVA
30 pages
CLFR
No ratings yet
CLFR
134 pages
Ivan M. Linforth Soul and Sieve in Plato's Gorgias. University of California Publications in Classical Philology Tate, J
No ratings yet
Ivan M. Linforth Soul and Sieve in Plato's Gorgias. University of California Publications in Classical Philology Tate, J
2 pages
IsiZulu HL P2 June-July 2015
No ratings yet
IsiZulu HL P2 June-July 2015
25 pages
Winslade 2000
No ratings yet
Winslade 2000
17 pages
IB DP1 Schedule 2024 2025 Term 2
No ratings yet
IB DP1 Schedule 2024 2025 Term 2
1 page
Signals and Daemon Processes: UNIX Programming
No ratings yet
Signals and Daemon Processes: UNIX Programming
17 pages

1 Overview

Uploaded by

1 Overview

Uploaded by

CS 589 Fall 2020

Text Mining and Information Retrieval

Instructor: Susan Liu

Stevens Institute of Technology

• Instructor: Susan (Xueqing) Liu

• Assistant professor joined Jan 2020

— The history of Information Retrieval Research, Mark Sanderson and

How does Google return

Getting enough Query

Making sure the results

Cranfield evaluation methodology;

Callimachus: the first library catalog 1960s building IR systems on computers;

1970s TF-IDF; probability ranking principle

1980s TREC; learning to rank; latent

1990 - web search; supporting natural

• Advantage: Returns exactly what you want

computer science query = “subject = AI & subject =

artificial intelligence bioinformatics

system 1: the Boolean retrieval system

query = “artificial intelligence” bags of words representation

system 2: indexing documents by lists of words

Boolean retrieval system < word indexing system

• Represent each document/

• The similarity = cosine score

• d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0] • query = “business intelligence”

• To answer the query: • d1 = [2, 1, 1, 1, 1, 0, 0, 0, 0, 0]

• Documents are dominated by words such as “the” “a”

• q = “the artificial intelligence book” score(q, d1 ) = 0.8164

• Documents are dominated by words such as “the” “a”

• These words do not discriminate between documents

• If a word appears everywhere, it should be penalized

d2 is not twice more relevant than d1 22

• Inverse-document frequency: penalizing a word’s TF based on its

IDF (w) = log N/df (w) • q = “the artificial intelligence book”

TF-IDF weighting score(q, d1 ) = 0.8164 ! 0.2041

• Term frequency reweighing: penalizing a word’s TF based on the TF itself

score(q, d1 ) = 0.8164 ! 0.7618 • q = “the artificial intelligence book”

• Another problem with TF-IDF weighting

d1 = “artificial d2 = “Artificial intelligence was

score(q, d1 ) > score(q, d2 ) 26

• Manually evaluate the relevance between q and d

pivoted normalization = (1.0–slope) ⇥ pivot + slope ⇥ oldnormalization 28

the similar formulation will be frequently used later

• Axiomatic thinking in information retrieval [Fang et al., SIGIR 2004]

• The other side of information retrieval techniques

• Reasoning-based question answering systems

Text Mining NLP

• Inverted index, ES, PageRank, HITS

• Relevance feedback, PRF

• Frontier topic: NMT, program synthesis

Upon successful completion of this course, students should be able to:

• Use Elastic search to implement a prototypical search engine on Twitter data;

• Use state-of-the-art tools such as LSTM/Bert for text classification tasks

• CS116 is required for undergrad, CS225 is recommended (data structure in Java)

• Fluency in Python is required

• A good knowledge on statistics and probability

• Contact the instructor if you aren’t sure

• Meeting: every Monday 8:15-9:45

Dec 14 Deliver a presentation in Week 14

Submit their implementation (code in Python) as well as an 8-page academic paper as

• Homework - 40%, Midterm - 30%, Project - 30%

• Cheating case in CS284

• Regrading requests: email TA, cc myself, titled [CS589 regrading]

Aug Sept Oct Nov Dec

You might also like