0% found this document useful (0 votes)

87 views47 pages

Clustering and Similarity:: Retrieving Documents

Uploaded by

Beyzagul Demir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views47 pages

Clustering and Similarity:: Retrieving Documents

Uploaded by

Beyzagul Demir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Clustering and

Similarity:
Retrieving Documents
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Retrieving documents of interest

2 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Document retrieval
•  Currently reading article you like

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Document retrieval
•  Currently reading article you like
•  Goal: Want to find similar article

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Document retrieval

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Challenges
•  How do we measure similarity?
•  How do we search over articles?

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Word count representation
for measuring similarity

7 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Word count document
representation
•  Bag of words model
- Ignore order of words
- Count # of instances of
each word in vocabulary

“Carlos calls the sport futbol.

Emily calls the sport soccer.”

8 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Measuring similarity

1 0 0 0 5 3 0 0 1 0 0 0 0 1*3
+
5*2
3 0 0 0 2 0 0 1 0 1 0 0 0 = 13

9 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Measuring similarity

1 0 0 0 5 3 0 0 1 0 0 0 0
0
0 0 1 0 0 0 9 0 0 6 0 4 0

10 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Issues with word counts – Doc length

1 0 0 0 5 3 0 0 1 0 0 0 0 2 0 0 0 10 6 0 0 2 0 0 0 0

3 0 0 0 2 0 0 1 0 1 0 0 0 6 0 0 0 4 0 0 2 0 2 0 0 0
Similarity = 13 Similarity = 52

11 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Solution = normalize

1 0 0 0 5 3 0 0 1 0 0 0 0
√(12 + 52 + 32 + 12)

1 5 3 1
/ 0 0 0 / / 0 0 / 0 0 0 0
6 6 6 6
12 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Prioritizing important words
with tf-idf

13 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Issues with word counts –
Rare words

Common words in doc: “the”, “player”, “field”, “goal”

Dominate rare words like: “futbol”, “Messi”
14 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Document frequency
•  What characterizes a rare word?
- Appears infrequently in the corpus

•  Emphasize words appearing in few docs

- Equivalently, discount word w based on
# of docs containing w in corpus

15 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Important words
•  Do we want only rare words to dominate???

•  What characterizes an important word?

-  Appears frequently in document
(common locally)
-  Appears rarely in corpus (rare globally)

•  Trade oﬀ between local frequency and

global rarity

16 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)

17 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Same as word counts

18 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

19 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

20 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

TF-IDF document representation
•  Term frequency – inverse
document frequency (tf-idf)
•  Term frequency

•  Inverse document frequency

tf * idf
21 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Retrieving similar documents

22 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Nearest neighbor search
•  Query article:

•  Corpus:

•  Specify: Distance metric

•  Output: Set of most similar articles
©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
1 – Nearest neighbor
•  Input: Query article
•  Output: Most similar article

•  Algorithm:
- Search over each article in corpus
•  Compute s = similarity( , )
•  If s > Best_s, record =
and set Best_s = s
- Return
©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
k – Nearest neighbor
•  Input: Query article
•  Output: List of k similar articles

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Clustering documents

26 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Structure documents by topic
•  Discover groups (clusters) of related articles

SPORTS WORLD NEWS

27 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

What if some of the labels are known?
•  Training set of labeled docs

SPORTS WORLD NEWS

ENTERTAINMENT SCIENCE
©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Multiclass classification problem
WORLD
NEWS SPORTS

ENTERTAINMENT
?
SCIENCE TECHNOLOGY

Example of
supervised learning
©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Clustering
•  No labels provided
•  Want to uncover cluster
structure

•  Input: docs as vectors

•  Output: cluster labels

An unsupervised
learning task
30 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
What defines a cluster?
•  Cluster defined by
center & shape/spread

•  Assign observation (doc)

to cluster (topic label)
-  Score under cluster is
higher than others
-  Often, just more similar to
assigned cluster center than
other cluster centers

k-means
•  Assume
- Similarity metric =
distance to cluster
center
(smaller better)
DATA
to
CLUSTER
32 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

k-means algorithm
0. Initialize cluster centers
1.  Assign observations to
closest cluster center
2.  Revise cluster centers
as mean of assigned
observations
3.  Repeat 1.+2. until
convergence

Other examples

Clustering images
•  For search, group as:
- Ocean
- Pink flower
- Dog
- Sunset
- Clouds
- …

Grouping patients by medical condition
•  Better characterize subpopulations
and diseases

Example: Patients and seizures are diverse

channels 0me

40 Machine Learning Specializa0on

Cluster seizures by observed time courses

41 Machine Learning Specializa0on

Products on Amazon
•  Discover product categories
from purchase histories

“furniture”
“baby”
•  Or discovering groups of users
42 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Structuring web search results
•  Search terms can have multiple meanings
•  Example: “cardinal”

•  Use clustering to structure output

43 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
−01 2001−01 2005−01 2009−01 2013−01 1997−01 2001−01 2005−01 2009−01 2013−01

Discovering similar neighborhoods

(a) (b)
Fig 11. Estimated global trend using the seasonality decomposition approach of Clev
et al. (1990), after adjusting for hedonic e↵ects.
tl_2010_53_tract10

•  Task 1: Estimate price at a clusterID_reassign

Cluster ID
1

small regional level 2

3
4
•  Challenge: 5
6
- Only a few (or no!) sales 7
8
in each region per month 9
10

•  Solution: 11
12
13
- Cluster regions with similar 14

trends and share information

15
16

within a cluster City of Seattle

Fig 12. Map of clusters under the MAP sample. The cluster labels and associated
colors are selected to indicate the level of deviance of the cluster’s average (across tr
44 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on
Discovering similar neighborhoods
•  Task 2: Forecast violent crimes
to better task police
•  Again, cluster regions and
share information!
•  Leads to improved predictions
compared to examining each
region independently

What you can do now…
•  Describe ways to represent a document (e.g., raw word
counts, tf-idf,…)
•  Measure the similarity between two documents
•  Discuss issues related to using raw word counts
-  Normalize counts to adjust for document length
-  Emphasize important words using tf-idf
•  Implement a nearest neighbor search for document
retrieval
•  Describe the input (unlabeled observations) and output
(labels) of a clustering algorithm
•  Determine whether a task is supervised or unsupervised
•  Cluster documents using k-means (algorithmic details to
come…)
•  Describe other applications of clustering
47 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
01 Slides Presented in This Module Kmeans Annotated
No ratings yet
01 Slides Presented in This Module Kmeans Annotated
64 pages
Deep Learning in Visual Computing Explanations and Examples (Hassan Ugail) - Bibis - Ir
No ratings yet
Deep Learning in Visual Computing Explanations and Examples (Hassan Ugail) - Bibis - Ir
140 pages
Lecture 7 - Overfitting, Bias-Variance Trade Off (DONE!!) PDF
No ratings yet
Lecture 7 - Overfitting, Bias-Variance Trade Off (DONE!!) PDF
42 pages
Intro
No ratings yet
Intro
38 pages
Coursera Machine Learning Course Week 6 - Slides
No ratings yet
Coursera Machine Learning Course Week 6 - Slides
44 pages
Data Science For Financial Markets - Kaggle
No ratings yet
Data Science For Financial Markets - Kaggle
202 pages
Chapter 1 - Introduction To Time Series and Forcasting - Student Version
100% (2)
Chapter 1 - Introduction To Time Series and Forcasting - Student Version
51 pages
Recommenders Intro Annotated PDF
No ratings yet
Recommenders Intro Annotated PDF
45 pages
Streamlit PDF Application Setup All Commands in One Single File
No ratings yet
Streamlit PDF Application Setup All Commands in One Single File
8 pages
High-Resolution Image Synthesis and Semantic Manipulation With Conditional Gans
No ratings yet
High-Resolution Image Synthesis and Semantic Manipulation With Conditional Gans
60 pages
Big Data Analytics Using Multiple Criteria Decision-Making Models (2017)
No ratings yet
Big Data Analytics Using Multiple Criteria Decision-Making Models (2017)
387 pages
Simio Lab 1 - 1026
No ratings yet
Simio Lab 1 - 1026
2 pages
Machine Learning Regression
No ratings yet
Machine Learning Regression
64 pages
1 My First Perceptron With Python Eric Joel Barragan Gonzalez (WWW - Ebook DL - Com)
No ratings yet
1 My First Perceptron With Python Eric Joel Barragan Gonzalez (WWW - Ebook DL - Com)
96 pages
Software Requirements Engineering
No ratings yet
Software Requirements Engineering
31 pages
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
No ratings yet
Efficient Python Tricks and Tools For Data Scientists - by Khuyen Tran
20 pages
Grade 9 Cre Theory Paper
100% (1)
Grade 9 Cre Theory Paper
5 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
61 pages
Face Recognition With Python
No ratings yet
Face Recognition With Python
5 pages
Forecasting Using Facebook's Prophet Library
No ratings yet
Forecasting Using Facebook's Prophet Library
11 pages
Introduction To Theory of Computation
100% (1)
Introduction To Theory of Computation
17 pages
L 0007634413 PDF
0% (1)
L 0007634413 PDF
30 pages
110107129
No ratings yet
110107129
655 pages
Bay Leaf
No ratings yet
Bay Leaf
8 pages
WO 5800 - 1N 1.0 User Guide PDF
50% (2)
WO 5800 - 1N 1.0 User Guide PDF
164 pages
Btech CSE
No ratings yet
Btech CSE
17 pages
Image Segmentation DeepLearning
No ratings yet
Image Segmentation DeepLearning
18 pages
Data Science Curriculum 2024
No ratings yet
Data Science Curriculum 2024
16 pages
Sparse Coding and Dictionary Learning
No ratings yet
Sparse Coding and Dictionary Learning
40 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
36 pages
CS601PC - MACHINE LEARNING Unit - 1-2
No ratings yet
CS601PC - MACHINE LEARNING Unit - 1-2
145 pages
Regression:: Emily Fox & Carlos Guestrin
No ratings yet
Regression:: Emily Fox & Carlos Guestrin
30 pages
Panasonic tc-p50x1
No ratings yet
Panasonic tc-p50x1
111 pages
Chapter02 Simulation Modeling With SIMIO A Workbook
No ratings yet
Chapter02 Simulation Modeling With SIMIO A Workbook
11 pages
First Contact With Tensor Flow - Part 1
100% (1)
First Contact With Tensor Flow - Part 1
136 pages
Deep Learning in The Era of Big Data: Foundations, Advances, Applications, Challenges, and Future Directions
No ratings yet
Deep Learning in The Era of Big Data: Foundations, Advances, Applications, Challenges, and Future Directions
4 pages
Recommender System
No ratings yet
Recommender System
45 pages
Pima Indians Diabetes Database Analysis - Kaggle
No ratings yet
Pima Indians Diabetes Database Analysis - Kaggle
37 pages
Prealgebra Via Python Programming (En)
No ratings yet
Prealgebra Via Python Programming (En)
270 pages
Face Mask Detection and Counting Using You Only Look Once Algorithm With Jetson Nano and NVDIA Giga Texel Shader Extreme
No ratings yet
Face Mask Detection and Counting Using You Only Look Once Algorithm With Jetson Nano and NVDIA Giga Texel Shader Extreme
9 pages
The Teacher in The Classroom and Community
No ratings yet
The Teacher in The Classroom and Community
46 pages
AE - IEEE - REPORT - 01fe20bei040
No ratings yet
AE - IEEE - REPORT - 01fe20bei040
5 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
234 pages
Augmented Reality
No ratings yet
Augmented Reality
28 pages
PDF of Digital Signal Processing Ramesh Babu 2 PDF
No ratings yet
PDF of Digital Signal Processing Ramesh Babu 2 PDF
2 pages
TensorFlow Tutorial
No ratings yet
TensorFlow Tutorial
65 pages
Course Collections by Coursera - Machine Learning & Artificial Intelligence
100% (2)
Course Collections by Coursera - Machine Learning & Artificial Intelligence
6 pages
Neural Networks PDF
No ratings yet
Neural Networks PDF
89 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Visual Categorization With Bags of Keypoints
No ratings yet
Visual Categorization With Bags of Keypoints
17 pages
Computer Vision Pretrained Models: What Is Pre-Trained Model?
No ratings yet
Computer Vision Pretrained Models: What Is Pre-Trained Model?
10 pages
Computer Vision - Ipynb - Colaboratory
No ratings yet
Computer Vision - Ipynb - Colaboratory
17 pages
Logistic Regression Learning Annotated
No ratings yet
Logistic Regression Learning Annotated
77 pages
Views On Big Data
No ratings yet
Views On Big Data
16 pages
How To Extend RapidMiner 5
No ratings yet
How To Extend RapidMiner 5
92 pages
What's Next For ML & You: Emily Fox & Carlos Guestrin
No ratings yet
What's Next For ML & You: Emily Fox & Carlos Guestrin
38 pages
HINDI - PAPER-I To PAPER-IV
No ratings yet
HINDI - PAPER-I To PAPER-IV
9 pages
Large-Scale Deep Reinforcement Learning
No ratings yet
Large-Scale Deep Reinforcement Learning
6 pages
The Stoic Doctrine of Providence A Study of Its Development and of Some of Its Major Issues 9781138125162 9781032049083 9781315647678
No ratings yet
The Stoic Doctrine of Providence A Study of Its Development and of Some of Its Major Issues 9781138125162 9781032049083 9781315647678
391 pages
Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
Informative Speech Assignment Packet - Leaders - Online Class
No ratings yet
Informative Speech Assignment Packet - Leaders - Online Class
6 pages
Bio 101 Hereditary Notes-Dr Anifowoshe
No ratings yet
Bio 101 Hereditary Notes-Dr Anifowoshe
10 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Poems Absurdities
No ratings yet
Poems Absurdities
6 pages
Exam Preparation - How To Prepare For The Literature Exam - Insight Publications
No ratings yet
Exam Preparation - How To Prepare For The Literature Exam - Insight Publications
8 pages
Lecture Series-II Principles of Legislation and Interpretation of Statutes
No ratings yet
Lecture Series-II Principles of Legislation and Interpretation of Statutes
22 pages
Lesson 3 - The Global Economy
No ratings yet
Lesson 3 - The Global Economy
6 pages
09chapter 1
No ratings yet
09chapter 1
45 pages
Smart Test Series: 731298 English-12 Inter Part-II
No ratings yet
Smart Test Series: 731298 English-12 Inter Part-II
3 pages
Theorem: Using The Law of Cosines
No ratings yet
Theorem: Using The Law of Cosines
8 pages
Ujjam Bai v. State of Uttar Pradesh-1
No ratings yet
Ujjam Bai v. State of Uttar Pradesh-1
95 pages
DLL - All Subjects 2 - Q4 - W5 - D1
No ratings yet
DLL - All Subjects 2 - Q4 - W5 - D1
6 pages
Screenshot 2025-02-18 at 11.42.43 PM
No ratings yet
Screenshot 2025-02-18 at 11.42.43 PM
1 page
What Makes Us Laugh
No ratings yet
What Makes Us Laugh
8 pages
2a Boardworks - Infectious Diseases 2 PDF
No ratings yet
2a Boardworks - Infectious Diseases 2 PDF
17 pages
Reading Assessment Tool: Banna National High School
No ratings yet
Reading Assessment Tool: Banna National High School
2 pages
Malaysian Boy Names - Malay Boys Name With Meaning
No ratings yet
Malaysian Boy Names - Malay Boys Name With Meaning
1 page
Sikora Strada Final
No ratings yet
Sikora Strada Final
11 pages
Law Admission Test (LAT) Past Papers July 2019
No ratings yet
Law Admission Test (LAT) Past Papers July 2019
12 pages
Audience Perception in Film
No ratings yet
Audience Perception in Film
3 pages
Perspective: Intelligent Supply Chain Management During Uncertain Times
No ratings yet
Perspective: Intelligent Supply Chain Management During Uncertain Times
8 pages
Verbatim Text From The Supreme Court Judgement Upholding The Death Sentence On Ajmal Kasab, One of The Terrorists in 26/11.
No ratings yet
Verbatim Text From The Supreme Court Judgement Upholding The Death Sentence On Ajmal Kasab, One of The Terrorists in 26/11.
3 pages
171001
No ratings yet
171001
2 pages
Three Stages of Falling in Love
0% (1)
Three Stages of Falling in Love
3 pages
Spelling Lesson 2 Student
0% (1)
Spelling Lesson 2 Student
2 pages
Honeywell Sensing Hall Effect Sensor Ics Ss41f Ss41g Datasheet 32312814 A en
No ratings yet
Honeywell Sensing Hall Effect Sensor Ics Ss41f Ss41g Datasheet 32312814 A en
5 pages
Street Corner Society
No ratings yet
Street Corner Society
1 page
ISO 80000-3 A Complete Guide
From Everand
ISO 80000-3 A Complete Guide
Gerardus Blokdyk
No ratings yet

Clustering and Similarity:: Retrieving Documents

Uploaded by

Clustering and Similarity:: Retrieving Documents

Uploaded by

Clustering and

2 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

7 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

“Carlos calls the sport futbol.

8 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

9 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

10 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

11 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

13 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

Common words in doc: “the”, “player”, “field”, “goal”

• Emphasize words appearing in few docs

15 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• What characterizes an important word?

• Trade oﬀ between local frequency and

16 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

17 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• Same as word counts

18 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• Inverse document frequency

19 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• Inverse document frequency

20 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• Inverse document frequency

22 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

• Specify: Distance metric

©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

26 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

SPORTS WORLD NEWS

27 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

SPORTS WORLD NEWS

• Input: docs as vectors

• Assign observation (doc)

31 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

33 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

34 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

35 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

36 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

37 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

38 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

39 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

40 Machine Learning Specializa0on

41 Machine Learning Specializa0on

• Use clustering to structure output

Discovering similar neighborhoods

• Task 1: Estimate price at a clusterID_reassign

small regional level 2

trends and share information

within a cluster City of Seattle

46 ©2015 Emily Fox & Carlos Guestrin Machine Learning Specializa0on

You might also like

•  Emphasize words appearing in few docs

•  What characterizes an important word?

•  Trade oﬀ between local frequency and

•  Same as word counts

•  Inverse document frequency

•  Inverse document frequency

•  Inverse document frequency

•  Specify: Distance metric

•  Input: docs as vectors

•  Assign observation (doc)

•  Use clustering to structure output

•  Task 1: Estimate price at a clusterID_reassign