Learning To Rank

Download as pdf or txt
Download as pdf or txt
You are on page 1of 777

DAT630

Learning-to-Rank
Search Engines, Section 7.6

30/10/2016

Krisztian Balog | University of Stavanger


Recap
- Classic retrieval models
- Vector space model, BM25, LM
- Three main components
- Term frequency
- How many times query terms appear in the document
- Document length
- Any term is expected to occur more frequently in long
document; account for differences in document length
- Document frequency
- How often the term appears in the entire collection
Additional factors
- So far: content-based matching
- Many additional signals, e.g.,
- Document quality
- PageRank
- SPAM?
- …
- Click-based features
- How many times users clicked on a document given a
query
- How many times this particular user clicked on a
document given the query
- …
Machine Learning 101
Machine Learning for IR
- The general idea is to use machine learning to
combine these features to generate the "best"
ranking function (referred to as "Learning to
Rank")
- Features can include up to hundreds
- Impossible to tune by hand
- Modern systems (especially on the Web) use a
great number of features
- In 2008, Google was using over 200 features
- The New York Times (2008-06-03)
Machine Learning for IR
- Some example features
- Log frequency of query word in anchor text?
- Query word in color on page?
- # of images on page?
- # of (out) links on page?
- PageRank of page?
- URL length?
- URL contains “~”?
- Page length?
- …
Why not earlier?
- Limited training data
- Especially for other use-cases (i.e., not web search)
- Poor machine learning techniques
- Insufficient customization to IR problem
- Not enough features for ML to show value
Learning-to-Rank (LTR)
- Learn a function automatically to rank items
(documents) effectively
- Training data: (item, query, relevance) triples
- Output: ranking function
Pointwise LTR
- The ranking function is based on features of a
single object: h(q,d) = h(x)
- x is a feature vector
- May be approached classification (relevant vs.
non-relevant) or regression (relevance score)
- Note: classic retrieval models are also point-
wise: score(q, d)
Classification vs.
Regression
- Classification
- Predict a categorical (unordered) output value
- Regression
- Predict an ordered or continuous output value
Pairwise LTR
- The learning function is based on a pair of
items
- Given two documents, predict partial ranking
- E.g., Ranking SVM
- Classify which of the two documents should be
ranked at a higher position?
Listwise LTR
- The ranking function is based on a ranked list
of items
- Given two ranked list of the same items, which is
better?
- Directly optimizes a retrieval metric
- Need a loss function on a list of documents
- Challenge is scale: huge number of potential
lists
- No clear benefits over pairwise ones (so far)
How to?
- Develop a feature set
- The most important step!
- Usually problem dependent
- Choose a good ranking algorithm
- E.g., Random Forests work well for pairwise LTR
- Training, validation, and testing
- Similar to standard machine learning applications
Exercise
- Design features for an email SPAM classifier
- You have access to the user’s mailbox (all emails,
including those that have been classified as SPAM
before)
- For an incoming email, decide whether it is SPAM or
not
Features for document
retrieval
- Query features
- Depend only on the query
- Document features
- Depend only on the document
- Query-document features
- Express the degree of matching between the query
and the document
LETOR
- https://www.microsoft.com/en-us/research/
project/letor-learning-rank-information-
retrieval/
Practical considerations
Feature normalization
- Feature values are often normalized to be in
the [0,1] range for a given query
- Esp. matching features that may be on different
scales across queries because of query length
- Min-max normalization:
xi min(x)
x̃i =
max(x) min(x)
- x1 , . . . , xn : original values for a given feature
- x̃i : normalized value for the ith instance
Computation cost
- Implemented as a re-ranking mechanism
- Retrieve top-N candidate documents using a strong
baseline approach (e.g., BM25)
- Create feature vectors and re-rank these top-N
candidates to arrive a the final ranking
- Document features may be computed offline
- Query and query-document features are
computed online
- Avoid using too many expensive features
Class imbalance
- Many more non-relevant than relevant
instances
- Classifiers usually do not handle huge
imbalance well
- Need to address by over or under sampling
Deep Learning
- Neural networks are a particular type of
machine learning, inspired by the architecture
of the brain
- Instead of manually engineering features, let
the machine learn the representation
Deep Learning
DAT630
Web Search
Search Engines, Sections 3.2, 4.5

18/10/2016

Krisztian Balog | University of Stavanger


So far…
- Representing document content
- Term-doc matrix, document vector, TFIDF weighting
- Retrieval models
- Vector space model, Language models, BM25
- Scoring queries
- Inverted index, term-at-a-time/doc-at-a-time scoring
- Fielded document representations
- Mixture of Language Models, BM25F
- Retrieval evaluation
Web search
- Before the web: search was small scale,
usually focused on libraries
- Web search is a major application that
everyone cares about
- Challenges
- Scalability (users as well as content)
- Ensure high-quality results (fighting SPAM)
- Dynamic nature (constantly changing content)
Some specific techniques
- Crawling
- Focused crawling
- Deep web crawling
- Indexing
- Parallel indexing based on MapReduce
- Retrieval
- SPAM detection
- Link analysis
Web Crawling
Web Crawling
- Finds and downloads web pages automatically
- I.e., provides the collection for searching
- Web is huge and constantly growing
- Web is not under the control of search engine
providers
- Web pages are constantly changing
- Crawlers also used for other types of data
Web Crawler
- Starts with a set of seeds, which are a set of
URLs given to it as parameters
- Seeds are added to a URL request queue
- Crawler starts fetching pages from the request
queue
- Downloaded pages are parsed to find link tags
that might contain other useful URLs to fetch
- New URLs added to the crawler’s request
queue, or frontier
- Continue until no more new URLs or disk full
Crawling Picture

URLs crawled
and parsed
Unseen Web

Seed
URLs frontier
pages

Web

7
Web Crawling
- Web crawlers spend a lot of time waiting for
responses to requests
- To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
- Crawlers could potentially flood sites with
requests for pages
- To avoid this problem, web crawlers use
politeness policies
- e.g., delay between requests to same web server
Web Crawling
- Freshness
- Not possible to constantly check all pages
- Must check important pages (i.e., visited by many
users) and pages that change frequently
- Focused crawling
- Attempts to download only those pages that are
about a particular topic
- Deep Web
- Sites that are difficult for a crawler to find are
collectively referred to as the deep (or hidden) Web
Deep Web Crawling
- Much larger than conventional Web
- Three broad categories:
- Private sites
- no incoming links, or may require log in with a valid
account
- Form results
- Sites that can be reached only after entering some data
into a form
- Scripted pages
- Pages that use JavaScript, Flash, or another client-side
language to generate links
Surfacing the Deep Web
- Pre-compute all interesting form submissions
for each HTML form
- Each form submission corresponds to a
distinct URL
- Add URLs for each form submission into
search engine index
Link Analysis
Link Analysis
- Links are a key component of the Web
- Important for navigation, but also for search
<a href="http://example.com">Example website</a>

destination link anchor text

- Both anchor text and links are used by search


engines
Anchor text
- Aggregated from all incoming links and added
as a separate document field
- Tends to be short, descriptive, and similar to
query text
- Can be thought of a description of the page "written
by others"
- Has a significant impact on effectiveness for
some types of queries
Example "winter school"
page1
I’ll be presenting our work at a pageX
<a href="pageX">winter school</a>
in Bressanone, Italy.

"information
page2 retrieval"
List of winter schools in 2013:
<ul>
<li><a href="pageX">information
retrieval</a></li>

</ul>

page3
The PROMISE Winter School in will
feature a range of <a
href="pageX">IR lectures</a> by
experts from the field. "IR lectures"
Fielded Document
Representation
title: Winter School 2013

meta: PROMISE, school, PhD, IR, DB, [...]


PROMISE Winter School 2013, [...]

headings: PROMISE Winter School 2013


Bridging between Information Retrieval and Databases
Bressanone, Italy 4 - 8 February 2013

body: The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a
grounding in the core topics that constitute the multidisciplinary
area of information access and retrieval to unstructured,
semistructured, and structured information. The school is a week-
long event consisting of guest lectures from invited speakers who
are recognized experts in the field. [...]

anchors: winter school


information retrieval
IR lectures

Anchor text is added as a separate document field


Document Importance on
the Web
- What are web pages that are popular and
useful to many people?
- Use the links between web pages as a way to
measure popularity
- The most obvious measure is to count the
number of inlinks
- Quite effective, but very susceptible to SPAM
PageRank
- Algorithm to rank web pages by popularity
- Proposed by Google founders Sergey Brin and
Larry Page in 1998
- Thesis: A web page is important if it is
pointed to by other important web pages
PageRank
- PageRank is a numeric value that represents
the importance of a page present on the web
- When one page links to another page, it is
effectively casting a vote for the other page
- More votes implies more importance
- Importance of each vote is taken into account
when a page's PageRank is calculated
Random Surfer Model
- PageRank simulates a user navigating on the
Web randomly as follows:
- The user is currently at page a
- She moves to one of the pages linked from a with
probability 1-q
- She jumps to a random webpage with probability q
- Repeat the process for the page she moved to

This is to ensure that the user doesn’t "get stuck" on


any given page (e.g., on a page with no outlinks)
PageRank Formula
Jump to a random page Follow one of the hyperlinks in the
with this probability current page with this probability
(q is typically set to 0.15)
PageRank value of page pi

n
X
q P R(pi )
P R(a) = + (1 q)
T i=1
L(pi )
PageRank of page a
Number of outgoing links of
page pi
Total number of pages in
the Web graph page a is pointed by
pages p1…pn
Technical Issues
- This is a recursive formula. PageRank values
need to be computed iteratively
- We don’t know the PageRank values at start. We
can assume equal values (1/T)
- Number of iterations?
- Good approximation already after a small number of
iterations; stop when change in absolute values is
below a given threshold
Example
q=0
(no random jumps)
Example
Iteration 0: assume that the PageRank q=0
values are the same for all pages (no random jumps)

0.33 0.33

0.33
Example
Iteration 1 q=0
(no random jumps)

0.33 0.33

0.33 0.33
P R(A) P R(B)
P R(C) = +
2 1
PageRank of C depends on the =0.5
PageRank values of A and B
Example
at the end of Iteration 1 q=0
(no random jumps)

0.33 0.17

0.5
Example
Iteration 2 q=0
(no random jumps)

0.33 0.17

0.33 0.17
P R(A) P R(B)
P R(C) = +
2 1
PageRank of C depends on the =0.33
PageRank values of A and B
Example
at the end of Iteration 2 q=0
(no random jumps)

0.5 0.17

0.33
Example
at the end of Iteration 3 q=0
(no random jumps)

0.33 0.25

0.42
Example #2
Iteration 0: assume that the PageRank q=0.2
values are the same for all pages (with random jumps)

0.33 0.33

0.33
Example #2
Iteration 1 q=0.2
q=0
(no random
(with randomjumps)
jumps)

0.33 0.33

0.33 0.33
0.2 P R(A) P R(B)
P R(C) = + 0.8( + ) =0.47
3 2 1
Exercise #1
Dealing with "rank sinks"
- Handling "dead ends" (or rank sinks), i.e.,
pages that have no outlinks
- Assume that it links to all other pages in the
collection (including itself) when computing
PageRank scores

Rank sink
Exercise #2
Online PageRank Checkers
PageRank Summary
- Important example of query-independent
document ranking
- Web pages with high PageRank are preferred
- It is, however, not as important as the
conventional wisdom holds
- Just one of the many features a modern web search
engine uses
- But it tends to have the most impact on popular
queries
Incorporating Document
Importance (e.g. PageRank)
score0 (d, q) = score(d) · score(d, q)

Query-independent score Query-dependent score


"Static" document score "Dynamic" document score

P (q|d)P (d)
P (d|q) = / P (q|d)P (d)
P (q)
Document prior
Stephen Robertson, SIGIR’17 keynote
Search Engine Optimization
Search Engine
Optimization (SEO)
- A process aimed at making the site appear
high on the list of (organic) results returned by
a search engine
- Considers how search engines work
- Major search engines provide information and
guidelines to help with site optimization
- Google/Bing Webmaster Tools
- Common protocols
- Sitemaps (https://www.sitemaps.org)
- robots.txt
White hat vs. black hat SEO
- White hat
- Conforms to the search engines' guidelines and
involves no deception
- "Creating content for users, not for search engines"
- Black hat
- Disapproved of by search engines, often involve
deception
- Hidden text
- Cloaking: returning a different page, depending on
whether it is requested by a human visitor or a robot
SEO Techniques
- Editing website content and HTML source
- Increase relevance to specific keywords
- Increasing the number of incoming links
("backlinks")
- Focus on long tail queries
- Social media presence
source: http://searchengineland.com/figz/wp-content/seloads/2017/06/2017-SEO_Periodic_Table_1920x1080.png
DAT630
Clustering
Introduction to Data Mining, Chapter 8

16/10/2017

Darío Garigliotti | University of Stavanger


Supervised vs.
Unsupervised Learning
- Supervised learning
- Labeled examples (with target information) are
available
- Unsupervised learning
- Examples are not labeled
Outline
- Clustering
- Two algorithms:
- K-means
- Hierarchical (Agglomerative) Clustering
Clustering
Clustering
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Why?
- For understanding
- E.g., biology (taxonomy of species)
- Business (segmenting customers for additional
analysis and marketing activities)
- Web (clustering search results into subcategories)
- For utility
- Some clustering techniques characterize each
cluster in terms of a cluster prototype
- These prototypes can be used as a basis for a
number of data analysis and processing techniques
How many clusters?

Original data Six Clusters

Two Clusters Four Clusters

- The notion of a cluster can be ambiguous


Types of Clustering
- Partitional vs. hierarchical
- Partitional: non-overlapping clusters such that each
data object is in exactly one cluster
- Hierarchical: a set of nested clusters organized as a
hierarchical tree
- Exclusive vs. non-exclusive
- Whether points may belong to a single or multiple
clusters
Types of Clustering (2)
- Partial versus complete
- In some cases, we only want to cluster some of the
data
- Fuzzy vs. non-fuzzy
- In fuzzy clustering, a point belongs to every cluster
with some weight between 0 and 1
- Weights must sum to 1
- Probabilistic clustering has similar characteristics
Different Types of Clusters
- Well-Separated Clusters
- A cluster is a set of points such that any point in a
cluster is closer (or more similar) to every other point
in the cluster than to any point not in the cluster
Different Types of Clusters
- Center-based (or prototype-based)
- A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a
cluster, than to the center of any other cluster
- The center of a cluster is often a centroid, the
average of all the points in the cluster, or a medoid,
the most “representative” point of a cluster
Different Types of Clusters
- Shared Property or Conceptual Clusters
- Clusters that share some common property or
represent a particular concept
Notation
- x an object (data point)
- m the number of points in the data set
- K the number of clusters
- Ci the ith cluster
- ci the centroid of cluster Ci
- mi the number of points in cluster Ci
K-means Clustering
K-means
- One of the oldest and most widely used
clustering techniques
- Prototype-based clustering
- Clusters are represented by their centroids
- Finds a user-specified number of clusters (K)
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
Basic K-means Algorithm
1. Select K points as initial centroids
2. repeat
3. Form K clusters by assigning each point to
its closest centroid
4. Recompute the centroid of each cluster
5. until centroids do not change
1. Choosing Initial
Centroids
- Most commonly: select points (centroids)
randomly
- They may be poor
- Possible solution: perform multiple runs, each with a
different set of randomly chosen centroids
3. Assigning Points to the
Closest Centroid
- We need a proximity measure that quantifies
the notion of "closest"
- Usually chosen to be simple
- Has to be calculated repeatedly
- See distance functions from Lecture 1
- E.g., Euclidean distance
4. Recomputing Centroids
- Objective function is selected
- I.e., what is it that we want minimize/maximize
- Once the objective function and the proximity
measure are defined, we can define
mathematically the centroid we should choose
- E.g., minimize the squared distance of each point to
its closest centroid
Sum of Squared Error (SSE)
- Measures the quality of clustering in the
Euclidean space
- Calculate the error of each data point (its
Euclidean distance to the closest centroid),
and then compute the total sum of the squared
errors
XK X
2
SSE = dist(ci , x)
i=1 x2Ci

- A clustering with lower SSE is better


Minimizing SSE
- It can be shown that the centroid that
minimizes the SSE of the cluster is the mean
- The centroid of the ith cluster

1 X
ci = x
mi
x2Ci
Example
Centroid computation
- What is the centroid of a cluster containing
three 2-dimensional points: (1,1), (2,3), (6,2)?
- Centroid:
((1+2+6)/3, (1+3+2)/3) = (3,2)
5. Stopping Condition
- Most of the convergence occurs in the early
steps
- "Centroids do not change" is often replaced
with a weaker condition
- E.g., repeat until only 1% of the points change
Exercise
Note
- There are other choices for proximity,
centroids, and objective functions, e.g.,
- Proximity function: Manhattan (L1)
Centroid: median
Objective function: minimize sum of L1 distance of an
object to its cluster centroid
- Proximity function: cosine
Centroid: mean
Objective function: maximize sum of cosine sim. of an
object to its cluster centroid
What is the complexity?
- m number of points, n number of attributes,
K number of clusters
- Space requirements: O(?)
- Time requirements: O(?)
Complexity
- m number of points, n number of attributes,
K number of clusters
- Space requirements: O((m+K)*n)
- Modest, as only the data points and centroids are
stored
- Time requirements: O(I*K*m*n)
- I is the number of iterations required for convergence
- Modest, linear in the number of data points
K-means Issues
- Depending on the initial (random) selection of
centroids different clustering can be produced
- Steps 3 and 4 are only guaranteed to find a
local optimum
- Empty clusters may be obtained
- replacement of centroid by (i) farthest point to any
other centroid, or (ii) chosen among those in the
cluster with highest SSE
K-means Issues (2)
- Presence of outliers must sometimes be kept
- E.g. all points must be clustered in data
compression
- In general outliers may be addressed by
eliminating them to improve clustering
- Before: by outlier detection techniques
- After: eliminating (i) points whose SSE is high, or (ii)
directly small clusters as likely outliers
K-means Issues (3)
- Postprocessing for reducing SSE
- and ideally not introducing more clusters
- How? Alternating splitting and merging steps
- Decrease SSE by more clusters
- Splitting (e.g., cluster with highest SSE)
- Introducing new centroid
- Less clusters by trying not to increase SSE
- Dispersing (a cluster e.g. with lowest SSE)
- Merging 2 clusters (e.g. with closest centroids)
Bisecting K-means
- Straightforward extension of the basic K-
means algorithm
- Idea:
- Split the set of data points to two clusters
- Select one of these clusters to split
- Repeat until K clusters have been produced
- The resulting clusters are often used as the
initial centroids for the basic K-means
algorithm
Bisecting K-means Alg.
1. Initial cluster contains all data points
2. repeat
3. Select a cluster to split
4. for a number of trials
5. Bisect the selected cluster using basic K-
means
6. end for
7. Select the clusters from the bisection with the
lowest total SSE
8. until we have K clusters
Selecting a Cluster to Split
- Number of possible ways
- Largest cluster
- Cluster with the largest SSE
- Combine size and SSE
- Different choices result in different clusters
Hierarchical Clustering
- By recording the sequence of clusterings
produced, bisecting K-means can also
produce a hierarchical clustering
Limitations
- K-means has difficulties detecting clusters
when they have
- differing sizes
- differing densities
- non-spherical shapes
- K-means has problems when the data contains
outliers
Example: differing sizes

Original points K-means (3 clusters)


Example: differing density

Original points K-means (3 clusters)


Example: non-spherical
shapes

Original points K-means (2 clusters)


Overcoming Limitations
- Use larger K values
- Natural clusters will be broken into a number of
sub-clusters
Example: differing sizes

Original points K-means clusters


Example: differing density

Original points K-means clusters


Example: non-spherical
shapes

Original points K-means clusters


Summary
- Efficient and simple
- Provided that K is relatively small (K<<m)
- Bisecting variant is even more efficient and
less susceptible to initialization problems
- Cannot handle certain types of clusters
- Problems can be overcome by generating more
(sub)clusters
- Has trouble with data that contains outliers
- Outlier detection and removal can help
Hierarchical Clustering
Hierarchical Clustering
- Two general approaches
- Agglomerative
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters
- Requires a notion of cluster proximity
- Divisive
- Start with a single, all-inclusive cluster
- At each step, split a cluster, until only singleton
clusters of individual points remain
Agglomerative Hierarchical
Clustering
- Produces a set of nested clusters organized as
a hierarchical tree
- Can be visualized
- Dendrogram
- Nested cluster diagram (only for 2D points)
0.2 6 5

4
0.15 3 4
2
5
0.1
2

0.05 1
3 1

0
1 3 2 5 4 6

Nested cluster
Dendogram
diagram
Strengths
- Do not have to assume any
particular number of clusters
- Any desired number of clusters
K=4
can be obtained by cutting the
dendrogram at the proper level
- They may correspond to
meaningful taxonomies
- E.g., in biological sciences
Basic Agglomerative
Hierarchical Clustering Alg.
1. Compute the proximity matrix
2. repeat
3. Merge the closest two clusters
4. Update the proximity matrix
5. until only one cluster remains
Example
Starting situation
- Start with clusters of individual points and a
proximity matrix
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Example
Intermediate situation
- After some merging steps, we have some
clusters
C1 C2 C3 C4 C5

C1

C2

C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Example
Intermediate situation
- We want to merge the two closest clusters (C2
and C5) and update the proximity matrix
C1 C2 C3 C4 C5
C1

C2

C3
C3
C4
C4
C5

Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Example
After merging
- How do we update the proximity matrix?
C2 U
C5
C1 C3 C4

C1 ?

C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?

Proximity Matrix
C1

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Defining the Proximity
between Clusters
- MIN (single link)
Proximity?
- MAX (complete link)
- Group average
- Distance between
centroids
Single link (MIN)
- Proximity of two clusters is based on the two
most similar (closest) points in the different
clusters
- Determined by one pair of points, i.e., by one link in
the proximity graph
Complete link (MAX)
- Proximity of two clusters is based on the two
least similar (most distant) points in the
different clusters
- Determined by all pairs of points in the two clusters
Group average
- Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters P
x2Ci ,y2Cj proximity(x, y)
proximity(Ci , Cj ) =
mi · mj
- Need to use average connectivity for scalability since
total proximity favors large clusters
Strengths and Weaknesses
- Single link (MIN)
- Strength: can handle non-elliptical shapes
- Weakness: sensitive to noise and outliers
- Complete link (MAX)
- Strength: less susceptible to noise and outliers
- Weakness: tends to break large clusters
- Group average
- Strength: less susceptible to noise and outliers
- Weakness: biased towards globular clusters
Prototype-based methods
- Represent clusters by their centroids
- Calculate the proximity based
on the distance between the
×
centroids of the clusters
×

- Ward’s method
- Similarity of two clusters is based on the increase in
SSE when two clusters are merged
- Very similar to group average if distance between points
is distance squared
Exercise
Key Characteristics
- No global objective function that is directly
optimized
- No problems with choosing initial points or
running into local minima
- Merging decisions are final
- Once a decision is made to combine two clusters, it
cannot be undone
What is the complexity?
- m is the number of points
- Space complexity O(?)
- Time complexity O(?)
Complexity
- Space complexity O(m2)
- Proximity matrix requires the storage of m2/2
proximities (it’s symmetric)
- Space to keep track of clusters is proportional to the
number of clusters (m-1, excluding singleton clusters)
- Time complexity O(m3)
- Computing the proximity matrix O(m2)
- m-1 iterations (Steps 3 and 4)
- It’s possible to reduce the total cost to O(m2 log m)
by keeping data in a sorted list (or heap)
Summary
- Typically used when the underlying application
requires a hierarchy
- Generally good clustering performance
- Expensive in terms of computation and storage
DAT630
Classification
Alternative Techniques
Introduction to Data Mining, Chapter 5

09/10/2017

Darío Garigliotti | University of Stavanger


Recall

Attribute set Classification Class label


(x) Model (y)
Outline
- Alternative classification techniques
- Rule-based
- Nearest neighbors
- Naive Bayes
- Ensemble methods
- Class imbalance problem
- Multiclass problem
Rule-based classifier
Rule-based Classifier
- Classifying records using a set of "if… then…"
rules
- Example
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

- R is known as the rule set


Classification Rules
- Each classification rule can be expressed in
the following way
ri : (Conditioni ) ! yi

rule antecedent rule consequent


(or precondition)
Classification Rules
- A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

Which rules cover the "hawk" and the "grizzly bear"?


Classification Rules
- A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal
Rule Coverage and
Accuracy Tid Refund Marital Taxable
- Coverage of a rule Status Income Class

1 Yes Single 125K No


- Fraction of records that 2 No Married 100K No

satisfy the antecedent of a 3 No Single 70K No

rule
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
- Accuracy of a rule 7 Yes Divorced 220K No

- Fraction of records that 8


9
No
No
Single
Married
85K
75K
Yes
No
satisfy both the antecedent 10 No Single 90K Yes

and consequent of a rule


10

(Status=Single) → No
Coverage = 40%, Accuracy = 50%
How does it work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Properties of the Rule Set
- Mutually exclusive rules
- Classifier contains mutually exclusive rules if the
rules are independent of each other
- Every record is covered by at most one rule
- Exhaustive rules
- Classifier has exhaustive coverage if it accounts for
every possible combination of attribute values
- Each record is covered by at least one rule
- These two properties ensure that every record
is covered by exactly one rule
When these Properties are
not Satisfied
- Rules are not mutually exclusive
- A record may trigger more than one rule
- Solution?
- Ordered rule set
- Unordered rule set – use voting schemes

- Rules are not exhaustive


- A record may not trigger any rules
- Solution?
- Use a default class (assign the majority class from the
training records)
Ordered Rule Set
- Rules are rank ordered according to their priority
- An ordered rule set is known as a decision list
- When a test record is presented to the classifier
- It is assigned to the class label of the highest ranked
rule it has triggered
- If none of the rules fired, it is assigned to the default
class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Rule Ordering Schemes
- Rule-based ordering
- Individual rules are ranked based on some quality
measure (e.g., accuracy, coverage)
- Class-based ordering
- Rules that belong to the same class appear together
- Rules are sorted on the basis of their class
information (e.g., total description length)
- The relative order of rules within a class does not
matter
Rule Ordering Schemes

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes
How to Build a Rule-based
Classifier?
- Direct Method
- Extract rules directly from data

- Indirect Method
- Extract rules from other classification models (e.g.
decision trees, neural networks, etc)
From Decision Trees To
Rules
Classification Rules
Refund (Refund=Yes) ==> No
Yes No
(Refund=No, Marital Status={Single,Divorced},
NO Marita l Taxable Income<80K) ==> No
{Single, Status
{Married}
Divorced} (Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
Taxable NO
Income (Refund=No, Marital Status={Married}) ==> No
< 80K > 80K

NO YES

Rules are mutually exclusive and exhaustive


Rule set contains as much information as the tree
Rules Can Be Simplified
Tid Refund Marital Taxable
Status Income Cheat
Refund
Yes No 1 Yes Single 125K No
2 No Married 100K No
NO Marita l
3 No Single 70K No
{Single, Status
{Married} 4 Yes Married 120K No
Divorced}
5 No Divorced 95K Yes
Taxable NO
Income 6 No Married 60K No

< 80K > 80K 7 Yes Divorced 220K No


8 No Single 85K Yes
NO YES
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No) ∧ (Status=Married) → No


Simplified Rule: (Status=Married) → No
Summary
- Expressiveness is almost equivalent to that of
a decision tree
- Generally used to produce descriptive models
that are easy to interpret, but gives comparable
performance to decision tree classifiers
- The class-based ordering approach is well
suited for handling data sets with imbalanced
class distributions
Exercise
Nearest Neighbors
So far
- Eager learners
- Decision trees, rule-base classifiers
- Learn a model as soon as the training data becomes
available Learning
Tid Attrib1 Attrib2 Attrib3 Class
Learning algorithm
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn Learn


Model
model
8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes


Model Model
10

Training Set
Apply
Model
Tid
11
Attrib1

No
Attrib2

Small
Attrib3

55K
Class
?
Apply
12 Yes Medium 80K ? model
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Opposite strategy
- Lazy learners
- Delay the process of modeling the data until it is
needed to classify the test examples

Tid Attrib1 Attrib2 Attrib3 Class


Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model Modeling
10

Training Set
Apply
Model
Tid
11
Attrib1

No
Attrib2

Small
Attrib3

55K
Class
?
Apply
12 Yes Medium 80K ? model
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Instance-Based Classifiers
Set of Stored Cases • Store the training records
• Use training records to
Atr1 ……... AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
A Atr1 ……... AtrN

C
B
Instance Based Classifiers
- Rote-learner
- Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
- Nearest neighbors
- Uses k “closest” points (nearest neighbors) for
performing classification
Nearest neighbors
- Basic idea
- "If it walks like a duck, quacks like a duck, then it’s
probably a duck"

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records
Nearest-Neighbor
Classifiers
- Requires three things
- The set of stored records
- Distance Metric to compute distance between
records
- The value of k, the number of nearest neighbors to
retrieve
Nearest-Neighbor
Classifiers Unknown record

- To classify an unknown record


- Compute distance to other
training records
- Identify k-nearest neighbors
- Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
Definition of Nearest
Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data


points that have the k smallest distance to x
Choices to make
- Compute distance between two points
- E.g., Euclidean distance
- See Chapter 2
- Determine the class from nearest neighbor list
- Take the majority vote of class labels among the k-
nearest neighbors
- Weigh the vote according to distance
- Choose the value of k
Choosing the value of k
- If k is too small, sensitive to noise points
- If k is too large, neighborhood may include
points from other classes

X
Summary
- Part of a more general technique called
instance-based learning
- Use specific training instances to make predictions
without having to maintain an abstraction (model)
derived from data
- Because there is no model building, classifying
a test example can be quite expensive
- Nearest-neighbors make their predictions
based on local information
- Susceptible to noise
Bayes Classifier
Bayes Classifier
- In many applications the relationship between
the attribute set and the class variable is
non-deterministic
- The label of the test record cannot be predicted with
certainty even if it was seen previously during training
- A probabilistic framework for solving
classification problems
- Treat X and Y as random variables and capture their
relationship probabilistically using P(Y|X)
Example
- Football game between teams A and B
- Team A won 65% team B won 35% of the time
- Among the games Team A won, 30% when game
hosted by B
- Among the games Team B won, 75% when B
played home
- Which team is more likely to win if the game is
hosted by Team B?
Probability Basics
- Conditional probability
P (X, Y ) = P (X|Y )P (Y ) = P (Y |X)P (X)

- Bayes’ theorem
P (X|Y )P (Y )
P (Y |X) =
P (X)
Example
- Probability Team A wins: P(win=A) = 0.65
- Probability Team B wins: P(win=B) = 0.35
- Probability Team A wins when B hosts:
P(hosted=B|win=A) = 0.3
- Probability Team B wins when playing at home:
P(hosted=B|win=B) = 0.75
- Who wins the next game that is hosted by B?
P(win=B|hosted=B) = ?
P(win=A|hosted=B) = ?
Solution
- Using:
P (X|Y )P (Y )
P (Y |X) =
P (X)

- P(win=B|hosted=B) = 0.5738
- P(win=A|hosted=B) = 0.4262

- See book page 229


Bayes’ Theorem for
Classification
Class-conditional Prior
probability probability

P (X|Y )P (Y )
P (Y |X) =
P (X)

Posterior The evidence


probability
Bayes’ Theorem for
Classification
Class-conditional Prior
probability probability

P (X|Y )P (Y )
P (Y |X) =
P (X)

Posterior The evidence


probability Constant (same for all classes),
can be ignored
Bayes’ Theorem for
Classification Prior probability
Can be computed from training
Class-conditional data (fraction of records that
probability belong to each class)

P (X|Y )P (Y )
P (Y |X) =
P (X)

Posterior The evidence


probability
Bayes’ Theorem for
Classification
Class-conditional probability Prior
Two methods: Naive Bayes, Bayesian belief network probability

P (X|Y )P (Y )
P (Y |X) =
P (X)

Posterior The evidence


probability
Naive Bayes
Estimation
- Mind that X is a vector
X = {X1 , . . . , Xn }
- Class-conditional probability
P (X|Y ) = P (X1 , . . . , Xn |Y )

- "Naive" assumption: attributes are independent


n
Y
P (X|Y ) = P (Xi |Y )
i=1
Naive Bayes Classifier
- Probability that X belongs to class Y
Yn
P (Y |X) / P (Y ) P (Xi |Y )
i=1

- Target label for record X


n
Y
y = arg max P (Y = yj ) P (Xi |Y = yj )
yj
i=1
Estimating class-
conditional probabilities
- Categorical attributes
- The fraction of training instances in class Y that have
a particular attribute value xi number of training instances
nc where Xi=xi and Y=y
P (Xi = xi |Y = y) =
n number of training instances
- Continuous attributes where Y=y

- Discretizing the range into bins


- Assuming a certain probability distribution
Conditional probabilities
for categorical attributes a l a l s
u
o ric o ric u o
e g e g it n s s
c at c at c on cl
a
Tid Refund Marital Taxable
Status Income Evade
- The fraction of training 1 Yes Single 125K No
instances in class Y that 2 No Married 100K No

have a particular 3 No Single 70K No


No
4 Yes Married 120K
attribute value Xi 5 No Divorced 95K Yes
6 No Married 60K No
- P(Status=Married|No)=? 7 Yes Divorced 220K No
8 No Single 85K Yes
- P(Refund=Yes|Yes)=? 9 No Married 75K No
10 No Single 90K Yes
10
Conditional probabilities
for continuous attributes
- Discretize the range into bins, or
- Assume a certain form of probability distribution
- Gaussian (normal) distribution is often used
(xi µij )2
1 2 2
P (Xi = xi |Y = yj ) = q exp ij

2⇡ 2
ij
- The parameters of the distribution are estimated from
the training data (from instances that belong to class yj)
2
- sample mean µij and variance ij
Tid Refund Marital
Tid Taxable
Status Income Evade
Class

Example
1
1 Yes
Yes Single
Single 125K
125K No
No
2 No Married 100K No
2 No Married 100K No
3 No Single 70K No
3 No Single 70K No
4 Yes Married 120K No
4 Yes Married 120K No
5 No Divorced 95K Yes
5 No Divorced 95K Yes
6 No Married 60K No
6 No Married 60K No
7 Yes Divorced 220K No
7
8 Yes
No Divorced 85K
Single 220K No
Yes
8
9 No
No Single
Married 85K
75K Yes
No
9
10 No Married
Single 75K
90K No
Yes
10

10 No Single 90K Yes


10
Tid Refund Marital
Tid Taxable
Status Income Evade
Class

Example
1
1 Yes
Yes Single
Single 125K
125K No
No
2 No Married 100K No
2 No Married 100K No
3 No Single 70K No
3 No Single 70K No
4 Yes Married 120K No
4 Yes Married 120K No
5 No Divorced 95K Yes
5 No Divorced 95K Yes
6 No Married 60K No
X={Refund=No, 6 No Married 60K No
7 Yes Divorced 220K No
Marital st.=Married, 7 Yes Divorced 85K
220K No
8 No Single Yes
Income=120K}
8
9 No
No Single
Married 85K
75K Yes
No
9
10 No Married
Single 75K
90K No
Yes
10

10 No Single 90K Yes


10

P(Refund=x|Y) P(Marital=x|Y) Ann. income


P(C)
No Yes Single Divorced Married mean var

class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975


class=Yes 3/10 3/3 3/3 2/3 1/3 0/3 90 25
Example
classifying a new instance
X={Refund=No, Marital st.=Married, Income=120K}

P(Refund=x|Y) P(Marital=x|Y) Ann. income


P(C)
No Yes Single Divorced Married mean var

class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975


class=Yes 3/10 3/3 3/3 2/3 1/3 0/3 90 25

P(Class=No|X) = P(Class=No) 7/10


× P(Refund=No|Class=No) 4/7
× P(Marital=Married| Class=No) 4/7
× P(Income=120K| Class=No) 0.0072
Example
classifying a new instance
X={Refund=No, Marital st.=Married, Income=120K}

P(Refund=x|Y) P(Marital=x|Y) Ann. income


P(C)
No Yes Single Divorced Married mean var

class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975


class=Yes 3/10 3/3 0/3 2/3 1/3 0/3 90 25

P(Class=Yes|X) = P(Class=Yes) 3/10


× P(Refund=No|Class=Yes) 3/3
× P(Marital=Married| Class=Yes) 0/3
× P(Income=120K| Class=Yes) 1.2*10-9
Can anything go wrong?
n
Y
P (Y |X) / P (Y ) P (Xi |Y )
i=1

What if this probability is zero?

- If one of the conditional probabilities is zero, then the


entire expression becomes zero!
Probability estimation
number of training instances
- Original where Xi=xi and Y=y
nc
P (Xi = xi |Y = y) = number of training instances
n where Y=y

- Laplace smoothing
nc + 1
P (Xi = xi |Y = y) =
n+c

c is the number of classes


Probability estimation (2)
- M-estimate
nc + mp
P (Xi = xi |Y = y) =
n+m

- p can be regarded as the prior probability


- m is called equivalent sample size which determines
the trade-off between the observed probability nc/n
and the prior probability p
- E.g., p=1/3 and m=3
Summary
- Robust to isolated noise points
- Handles missing values by ignoring the
instance during probability estimate
calculations
- Robust to irrelevant attributes
- Independence assumption may not hold for
some attributes
Exercise
Ensemble Methods
Ensemble Methods
- Construct a set of classifiers from the training
data
- Predict class label of previously unseen
records by aggregating predictions made by
multiple classifiers
General Idea
Random Forests
Class Imbalance Problem
Class Imbalance Problem
- Data sets with imbalanced class distributions
are quite common in real-world applications
- E.g., credit card fraud detection
- Correct classification of the rare class has
often greater value than a correct classification
of the majority class
- The accuracy measure is not well suited for
imbalanced data sets
- We need alternative measures
Confusion Matrix

Predicted class

Positive Negative

True Positives False Negatives


Positive (TP) (FN)
Actual
class
False Positives True Negatives
Negative (FP) (TN)
Additional Measures
- True positive rate (or sensitivity)
- Fraction of positive examples predicted correctly
TP
TPR =
TP + FN

- True negative rate (or specificity)


- Fraction of negative examples predicted correctly
TN
TNR =
TN + FP
Additional Measures
- False positive rate
- Fraction of negative examples predicted as positive
FP
FPR =
TN + FP

- False negative rate


- Fraction of positive examples predicted as negative
FN
FNR =
TP + FN
Additional Measures
- Precision
- Fraction of positive records among those that are
classified as positive
TP
P =
TP + FP
- Recall
- Fraction of positive examples correctly predicted
(same as the true positive rate)
TP
R=
TP + FN
Additional Measures
- F1-measure
- Summarizing precision and recall into a single
number
- Harmonic mean between precision and recall

2RP
F1 =
R+P
Multiclass Problem
Multiclass Classification
- Many of the approaches are originally
designed for binary classification problems
- Many real-world problems require data to be
divided into more than two categories
- Two approaches
- One-against-rest (1-r)
- One-against-one (1-1)
- Predictions need to be combined in both cases
One-against-rest
- Y={y1, y2, … yK} classes
- For each class yi
- Instances that belong to yi are positive examples
- All other instances are negative examples
- Combining predictions
- If an instance is classified positive, the positive class
gets a vote
- If an instance is classified negative, all classes
except for the positive class receive a vote
total votes

Example target class y1


y2
y3
- 4 classes, Y={y1, y2, y3, y4} y4

- Classifying a given test instance

y1 + y1 - y1 - y1 -
y2 - y2 + y2 - y2 -
y3 - y3 - y3 + y3 -
y4 - y4 - y4 - y4 +
class + class - class - class -
One-against-one
- Y={y1, y2, … yK} classes
- Construct a binary classifier for each pair of
classes (yi, yj)
- K(K-1)/2 binary classifiers in total
- Combining predictions
- The positive class receives a vote in each pairwise
comparison
total votes

Example target class y1


y2
- 4 classes, Y={y1, y2, y3, y4} y3
y4
- Classifying a given test instance

y1 + y1 + y1 +
y2 - y3 - y4 -
class + class + class -

y2 + y2 + y3 +
y3 - y4 - y4 -
class + class - class +
Locality Sensitive Hashing
Vinay Setty
([email protected])

Slides credit: http://mmds.org

1
Finding Similar Items Problem
‣ Similar Items
‣ Finding similar web pages and news articles
‣ Finding near duplicate images
‣ Plagiarism detection
‣ Duplications in Web crawls
‣ Find nearest-neighbors in high-dimensional space
‣ Nearest neighbors are points that are a small distance
apart

2
Very similar news articles

3
Near duplicate images

4
The Big Picture

Shingling

Document

5
The Big Picture

Shingling

Document

The set
of strings
of length k
that appear
in the doc-
ument

5
The Big Picture

Shingling

Hashing
Min
Document

The set Signatures:


of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

5
The Big Picture

Candidate
pairs:
Shingling

Hashing
Locality- those pairs

Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

5
Three Essential Steps for Similar Docs

1. Shingling: Convert documents to sets


2. Min-Hashing: Convert large sets to short signatures, while
preserving similarity
3. Locality-Sensitive Hashing: Focus on pairs of signatures likely
to be from similar documents
‣ Candidate pairs!

6
The Big Picture

Candidate
pairs:
Shingling

Hashing
Locality- those pairs

Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

7
The Big Picture

Candidate
pairs:
Shingling

Hashing
Locality- those pairs

Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

7
Documents as High-Dim. Data

8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets

8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets

‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?

8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets

‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?

‣ Need to account for ordering of words!

8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets

‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?

‣ Need to account for ordering of words!


‣ A different way: Shingles!

8
Define: Shingles
‣ A k-shingle (or k-gram) for a document is a sequence of k
tokens that appears in the doc
‣ Tokens can be characters, words or something else,
depending on the application
‣ Assume tokens = characters for examples

‣ Example: k=2; document D1 = abcab


Set of 2-shingles: S(D1) = {ab, bc, ca}
‣ Option: Shingles as a bag (multiset), count ab twice:
S’(D1) = {ab, bc, ca, ab}

9
Similarity Metric for Shingles

‣ Document D1 is a set of its k-shingles C1=S(D1)


‣ Equivalently, each document is a
0/1 vector in the space of k-shingles
‣ Each unique shingle is a dimension
‣ Vectors are very sparse

‣ A natural similarity measure is the


Jaccard similarity:
sim(D1, D2) = |C1∩C2|/|C1∪C2|

10
Working Assumption

‣ Documents that have lots of shingles in common


have similar text, even if the text appears in different
order

‣ Caveat: You must pick k large enough, or most


documents will have most shingles
‣ k = 5 is OK for short documents
‣ k = 10 is better for long documents

11
Motivation for Minhash/LSH

12
The Big Picture

Candidate
pairs:
Shingling

Hashing
Locality- those pairs

Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

13
Encoding Sets as Bit Vectors
‣ Many similarity problems can be
formalized as finding subsets that
have significant intersection
‣ Encode sets using 0/1 (bit, boolean) vectors
‣ One dimension per element in the universal set

‣ Interpret set intersection as bitwise AND, and


set union as bitwise OR

‣ Example: C1 = 10111; C2 = 10011


‣ Size of intersection = 3; size of union = 4,
‣ Jaccard similarity (not distance) = 3/4
‣ Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4

14
From Sets to Boolean Matrices
‣ Rows = elements (shingles)
‣ Columns = sets (documents)
‣ 1 in row e and column s if and only if
e is a member of s
‣ Column similarity is the Jaccard
similarity of the corresponding sets
(rows with value 1)
‣ Typical matrix is sparse!

15
From Sets to Boolean Matrices
‣ Rows = elements (shingles)
Documents (N)
‣ Columns = sets (documents)
‣ 1 in row e and column s if and only if 1 1 1 0
e is a member of s 1 1 0 1
‣ Column similarity is the Jaccard
similarity of the corresponding sets 0 1 0 1
(rows with value 1) 0 0 0 1

Shingles (D)
‣ Typical matrix is sparse!
1 0 0 1
‣ Each document is a column:
‣ Example: sim(C1 ,C2) = ? 1 1 1 0
‣ Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
‣ d(C1,C2) = 1 – (Jaccard similarity) = 3/6

15
Hashing Columns (Signatures)
‣ Key idea: “hash” each column C to a small signature
h(C), such that:
‣ (1) h(C) is small enough that the signature fits in RAM
‣ (2) sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)

16
Hashing Columns (Signatures)
‣ Key idea: “hash” each column C to a small signature
h(C), such that:
‣ (1) h(C) is small enough that the signature fits in RAM
‣ (2) sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)

‣ Goal: Find a hash function h(·) such that:


‣ If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
‣ If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

‣ Hash docs into buckets. Expect that “most” pairs


of near duplicate docs hash into the same bucket!
16
Min-Hashing
‣ Goal: Find a hash function h(·) such that:
‣ if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)

‣ if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

‣ Clearly, the hash function depends on


the similarity metric:
‣ Not all similarity metrics have a suitable
hash function
‣ There is a suitable hash function for
the Jaccard similarity: It is called Min-Hashing

17
Min-Hashing
‣ Imagine the rows of the boolean matrix permuted under
random permutation π

‣ Define a “hash” function hπ(C) = the index of the first


(in the permuted order π) row in which column C has value
‘1’:
hπ (C) = minπ π(C)

‣ Use several (e.g., 100) independent hash functions (that is,


permutations) to create a signature of a column

18
Example
nput matrix (Shingles x Documents)
Permutation π

1 0 1 0
1 0 0 1
0 1 0 1
0 1 0 1
0 1 0 1
1 0 1 0
1 0 1 0
19
Example
nput matrix (Shingles x Documents)
Permutation π

2 1 0 1 0
3 1 0 0 1
7 0 1 0 1

6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
nput matrix (Shingles x Documents)
Permutation π Signature matrix M

2 1 0 1 0
2 1 2 1
3 1 0 0 1
7 0 1 0 1

6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1

nput matrix (Shingles x Documents)


Permutation π Signature matrix M

2 1 0 1 0
2 1 2 1
3 1 0 0 1
7 0 1 0 1

6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1

nput matrix (Shingles x Documents)


Permutation π Signature matrix M

2 4 1 0 1 0
2 1 2 1
3 2 1 0 0 1
2 1 4 1
7 1 0 1 0 1

6 3 0 1 0 1
1 6 0 1 0 1
5 7 1 0 1 0
4 5 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1

nput matrix (Shingles x Documents)


Permutation π Signature matrix M

2 4 1 0 1 0
2 1 2 1
3 2 1 0 0 1
2 1 4 1
7 1 0 1 0 1

6 3 0 1 0 1
1 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 0 1 0
4 5 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1

nput matrix (Shingles x Documents)


Permutation π Signature matrix M

2 4 3 1 0 1 0
2 1 2 1
3 2 4 1 0 0 1
2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
19
Example Note: Another (equivalent) way is to
store row indexes:
1 5 1 5
2nd element of the permutation 2 3 1 3
is the first to map to a 1 6 4 6 4

nput matrix (Shingles x Documents)


Permutation π Signature matrix M

2 4 3 1 0 1 0
2 1 2 1
3 2 4 1 0 0 1
2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
19
Four Types of Rows
‣ Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
‣ a = # rows of type A, etc.

‣ Note: sim(C1, C2) = a/(a +b +c)


‣ Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
‣ Look down the cols C1 and C2 until we see a 1
‣ If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not

20
Similarity for Signatures

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree

21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree

‣ Note: Because of the Min-Hash property, the similarity of


columns is the same as the expected similarity of their
signatures
21
Min-Hashing Example
Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1
5 7 1 1 0 1 0 Similarities:
1-3 2-4 1-2 3-4
4 5 5 1 0 1 0
Col/Col 0.75 0.75 0 0
Permutation π Sig/Sig 0.67 1.00 0 0

22
Min-Hash Signatures

23
Min-Hash Signatures Example

Init

24
Min-Hash Signatures Example

Init Row 0

24
Min-Hash Signatures Example

Init Row 0 Row 1

24
Min-Hash Signatures Example

Init Row 0 Row 1

Row 2

24
Min-Hash Signatures Example

Init Row 0 Row 1

Row 3 Row 2

24
Min-Hash Signatures Example

Init Row 0 Row 1

Row 4 Row 3 Row 2

24
The Big Picture

Candidate
pairs:
Shingling

Hashing
Locality- those pairs

Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

25
2 1 4 1
LSH: First Cut 1 2 1 2
2 1 2 1
‣ Goal: Find documents with Jaccard similarity at least s (for
some similarity threshold, e.g., s=0.8)

‣ LSH – General idea: Use a function f(x,y) that tells whether


x and y is a candidate pair: a pair of elements whose
similarity must be evaluated

‣ For Min-Hash matrices:


‣ Hash columns of signature matrix M to many buckets
‣ Each pair of documents that hashes into the
same bucket is a candidate pair

26
Candidates from Min-Hash

2 1 4 1

‣ Pick a similarity threshold s (0 < s < 1) 1 2 1 2


2 1 2 1
‣ Columns x and y of M are a candidate pair if their
signatures agree on at least fraction s of their rows:
M (i, x) = M (i, y) for at least frac. s values of i
‣ We expect documents x and y to have the same
(Jaccard) similarity as their signatures

27
Partition M into b Bands
2 1 4 1
1 2 1 2
2 1 2 1

r rows
per band

b bands

One
signature

Signature matrix M
28
Hashing Bands
Buckets

Matrix M

r rows b bands

29
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)

Matrix M

r rows b bands

29
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)

Columns 6 and 7 are


guaranteed to be different.
Matrix M

r rows b bands

29
Partition M into Bands

‣ Divide matrix M into b bands of r rows


‣ For each band, hash its portion of each column to a hash
table with k buckets
‣ Make k as large as possible

‣ Candidate column pairs are those that hash to the same


bucket for ≥ 1 band
‣ Tune b and r to catch most similar pairs,
but few non-similar pairs

30
Simplifying Assumption

‣ There are enough buckets that columns are unlikely to


hash to the same bucket unless they are identical in a
particular band

‣ Hereafter, we assume that “same bucket” means


“identical in that band”

‣ Assumption needed only to simplify analysis, not for


correctness of algorithm

31
b bands, r rows/band

‣ Columns C1 and C2 have similarity s


‣ Pick any band (r rows)
‣ Prob. that all rows in band equal = sr
‣ Prob. that some row in band unequal = 1 - sr

‣ Prob. that no band identical = (1 - sr)b

‣ Prob. that at least one band is identical = 1 - (1 - sr)b

32
Example of Bands

Assume the following case:


‣ Suppose 100,000 columns of M (100k docs)
‣ Signatures of 100 integers (rows)
‣ Therefore, signatures take 40Mb
‣ Choose b = 20 bands of r = 5 integers/band

‣ Goal: Find pairs of documents that


are at least s = 0.8 similar

33
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)

34
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)

‣ Probability C1, C2 identical in one particular


band: (0.8)5 = 0.328

34
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)

‣ Probability C1, C2 identical in one particular


band: (0.8)5 = 0.328
‣ Probability C1, C2 are not similar in all of the 20 bands:
(1-0.328)20 = 0.00035
‣ i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
‣ We would find 1-(1-0.328)20 = 99.965% pairs of truly similar
documents

34
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)

35
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)

‣ Probability C1, C2 identical in one particular band: (0.3)5


= 0.00243

35
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)

‣ Probability C1, C2 identical in one particular band: (0.3)5


= 0.00243
‣ Probability C1, C2 identical in at least 1 of 20 bands: 1 - (1 -
0.00243)20 = 0.0474
‣ In other words, approximately 4.74% pairs of docs with similarity
0.3% end up becoming candidate pairs
‣ They are false positives since we will have to examine them (they are
candidate pairs) but then it will turn out their similarity is below
threshold s

35
LSH Involves a Tradeoff

‣ Pick:
‣ The number of Min-Hashes (rows of M)
‣ The number of bands b, and
‣ The number of rows r per band

to balance false positives/negatives

‣ Example: If we had only 15 bands of 5 rows, the number of


false positives would go down, but the number of false
negatives would go up

36
Analysis of LSH – What We Want

Similarity threshold s
Probability
of sharing
a bucket

Similarity s =sim(C1, C2) of two sets

37
Analysis of LSH – What We Want

Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket

Similarity s =sim(C1, C2) of two sets

37
Analysis of LSH – What We Want

Probability = 1 if
t>s

Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket

Similarity s =sim(C1, C2) of two sets

37
What One Band of One Row Gives You

Probability
of sharing
a bucket

Similarity s =sim(C1, C2) of two sets

38
What One Band of One Row Gives You

Probability
of sharing
a bucket

Similarity s =sim(C1, C2) of two sets

38
What One Band of One Row Gives You

Remember:
With a single hash function:
Probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

Similarity s =sim(C1, C2) of two sets

38
What One Band of One Row Gives You

Remember:
With a single hash function:
Probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

Similarity s =sim(C1, C2) of two sets

38
What One Band of One Row Gives You

Remember:
With a single hash function:
Probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

False positives

Similarity s =sim(C1, C2) of two sets

38
What One Band of One Row Gives You

Remember:
False
With a single hash function: negatives
Probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

False positives

Similarity s =sim(C1, C2) of two sets

38
What b Bands of r Rows Gives You

At least No bands
one band identical
identical

Probability t~ (1/b)1/r (
1 - 1 -s )
r b
of sharing
a bucket

All rows
Some row of a band
of a band are equal
unequal
Similarity s=sim(C1, C2) of two sets

39
Example: b = 20; r = 5

s 1-(1-sr)b
.2 .006
‣ Similarity threshold s .3 .047
‣ Prob. that at least 1 band is .4 .186
identical: .5 .470
.6 .802
.7 .975
.8 .9996

40
LSH Summary

‣ Tune M, b, r to get almost all pairs with similar signatures,


but eliminate most pairs that do not have similar signatures

‣ Check in main memory that candidate pairs really do have


similar signatures

‣ Optional: In another pass through data, check that the


remaining candidate pairs really represent similar
documents

41
References

For LSH refer to the Mining of Massive Datasets Chapter 3 http://infolab.stanford.edu/


~ullman/mmds/book.pdf

LSH slides are borrowed from http://i.stanford.edu/~ullman/cs246slides/LSH-1.pdf

42
DAT630
Classification
Basic Concepts, Decision Trees, and Model Evaluation
Introduction to Data Mining, Chapter 4

25/09/2017

Darío Garigliotti | University of Stavanger


Basic Concepts
Classification
- Classification is the task of assigning objects
to one of several predefined categories
- Examples
- Credit card transactions: legitimate or fraudulent?
- Emails: SPAM or not?
- Patients: high or low risk?
- Astronomy: star, galaxy, nebula, etc.
- News stories: finance, weather, entertainment,
sports, etc.
Why?
- Descriptive modeling
- Explanatory tool to distinguish between objects of
different classes
- Predictive modeling
- Predict the class label of previously unseen records
- Automatically assign a class label when presented
with the attributes of the record
The task
- Input is a collection of records (instances)
- Each record is characterized by a tuple (x, y)
- x is the attribute set
- y is the class label (category or target attribute)
- Classification is the task of learning a target
function f (classification model) that maps
each attribute set x to one of the predefined
class labels y
Attribute set Classification Class label
(x) Model (y)
Attribute set Classification Class label
(x) Model (y)

Nominal Nominal

Ordinal

Interval

Ratio
General approach
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
Records whose class
2 No Medium 100K No

3 No Small 70K No labels are known


4 Yes Medium 120K No
Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Model
Tid
11
Attrib1

No
Attrib2

Small
Attrib3

55K
Class
?
Records with
12 Yes Medium 80K ? unknown class labels
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
General approach
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No

Ind
uct
3 No Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
Should fit2theNo input
Medium 100K No

data well 3 No Ind


uct
Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?

?
15 No Large 67K
10

Test Set
Learning Algorithms
- Decision trees
- Rule-based
- Naive Bayes
- Support Vector Machines
- Random forests
- k-nearest neighbors
- …
Machine Learning vs.
Data Mining
- Similar techniques, but different goal
- Machine learning is focused on developing and
designing learning algorithms
- More abstract, e.g., features are given
- Data Mining is applied Machine Learning
- Performed by a person who has a goal in mind and
uses Machine Learning techniques on a specific
dataset
- Much of the work is concerned with data
(pre)processing and feature engineering
Today
- Decision trees
- Binary class labels
- Positive or Negative
Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
Should fit2theNo input
Medium 100K No

data well 3 No Ind


uct
Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
How to measure
9 No
this?
Medium 75K No
Yes
10 No Small 90K
10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?

?
15 No Large 67K
10

Test Set
Evaluation
- Measuring the performance of a classifier
- Based on the number of records correctly and
incorrectly predicted by the model
- Counts are tabulated in a table called the
confusion matrix
- Compute various performance metrics based
on this matrix
Confusion Matrix

Predicted class

Positive Negative

True Positives False Negatives


Positive (TP) (FN)
Actual
class
False Positives True Negatives
Negative (FP) (TN)
Confusion Matrix

Predicted class

Positive Negative

Type II Error
True Positives False Negatives
Positive (TP) (FN) failing to
Actual raise an alarm
class
False Positives True Negatives
Negative (FP) (TN)

Type I Error
raising a false alarm
Example
"Is the man innocent?"
Predicted class

Positive Negative
Innocent Guilty

True Positive False Negative convicting an


Positive innocent person
Innocent
Actual
Freed Convicted (miscarriage of
class justice)
False Positive True Negative
Negative
Guilty Freed Convicted

letting a guilty
person go free
(error of impunity)
Evaluation Metrics
- Summarizing performance in a single number
- Accuracy
Number of correct predictions TP + TN
=
Total number of predictions TP + FP + TN + FN

- Error rate
Number of wrong predictions FP + FN
=
Total number of predictions TP + FP + TN + FN

- We seek high accuracy, or equivalently, low


error rate
Decision Trees
An Example
How does it work?
- Asking a series of questions about the
attributes of the test record
- Each time we receive an answer, a follow-up
question is asked until we reach a conclusion
about the class label of the record
Decision Tree Model
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No

Ind
uct
3 No Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree
Decision Tree Root node
no incoming edges
zero or more outgoing edges
Decision Tree

Internal node
exactly one incoming edges
two or more outgoing edges
Decision Tree

Leaf (or terminal) nodes


have exactly one incoming edges
and no outgoing edges
Example Decision Tree
c al c al us
i i o
or or nu
t e g
t e g
nt i
a ss
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Example
a l a l s
c c u
ori ori uo
e g e g t in s s
t t n a Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
There could be more than one
10 No Single 90K Yes
tree that fits the same data!
10
Apply Model to Test Data
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No

Ind
uct
3 No Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt

Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Decision Tree Induction
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No

Ind
uct
3 No Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Tree Induction
- There are exponentially many decision trees
that can be constructed from a given set of
attributes
- Finding the optimal tree is computationally
infeasible (NP-hard)
- Greedy strategies are used
- Grow a decision tree by making a series of locally
optimum decisions about which attribute to use for
splitting the data
Tid Refund Marital Taxable

Hunt’s algorithm
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No

- Let Dt be the set of training records that 5 No Divorced 95K Yes


6 No Married 60K No
reach a node t and y={y1,…yc} the class
7 Yes Divorced 220K No
labels
8 No Single 85K Yes

- General Procedure 9 No Married 75K No

- If Dt contains records that belong the 10


10 No Single 90K Yes

same class yt, then t is a leaf node


labeled as yt Dt
- If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
- If Dt contains records that belong to
?
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No
Don’t 5 No Divorced 95K Yes
Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund Refund 9 No Married 75K No
Yes No 10 No Single 90K Yes
Yes No
10

Don’t Don’t Marital


Marital Cheat
Cheat
Status Status
Single, Single,
Married
Married Divorced
Divorced
Don’t
Don’t Taxable
Cheat
Cheat Income
< 80K >= 80K

Don’t Cheat
Cheat
Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
How to Specify Test
Condition?
- Depends on attribute types
- Nominal
- Ordinal
- Continuous
- Depends on number of ways to split
- 2-way split
- Multi-way split
Splitting Based on Nominal
Attributes
- Multi-way split: use as many partitions as
distinct values
CarType
Family Luxury
Sports

- Binary split: divide values into two subsets;


need to find optimal partitioning

CarType CarType
{Sports,
{Family} OR {Family,
{Sports}
Luxury} Luxury}
Splitting Based on Ordinal
Attributes
- Multi-way split: use as many partitions as
distinct values
Size
Small Large
Medium

- Binary split: divides values into two subsets;


need to find optimal partitioning

Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Splitting Based on
Continuous Attributes
- Different ways of handling
- Discretization to form an ordinal categorical attribute
- Static – discretize once at the beginning
- Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering

- Binary Decision: (A < v) or (A ≥ v)


- consider all possible splits and finds the best cut
- can be more compute intensive
Splitting Based on
Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
Determining the Best Split
Before Splitting: 10 records of class C0
10 records of class C1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


Determining the Best Split
- Greedy approach:
- Nodes with homogeneous class distribution are
preferred
- Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Impurity Measures
- Measuring the impurity of a node
- P(i|t) = fraction of records belonging to class i at a
given node t
- c is the number of classes
c 1
X
Entropy(t) = P (i|t)log2 P (i|t)
i=0
c 1
X
2
Gini(t) = 1 P (i|t)
i=0

Classification error(t) = 1 max P (i|t)


Entropy
c 1
X
Entropy(t) = P (i|t)log2 P (i|t)
i=0

- Maximum (log nc) when records are equally


distributed among all classes implying least
information
- Minimum (0.0) when all records belong to one class,
implying most information
c 1
X
Exercise Entropy(t) =
i=0
P (i|t)log2 P (i|t)

C1 0
C2 6

C1 1
C2 5

C1 2
C2 4
c 1
X
Exercise Entropy(t) =
i=0
P (i|t)log2 P (i|t)

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (5/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
GINI
c 1
X
2
Gini(t) = 1 P (i|t)
i=0

- Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
- Minimum (0.0) when all records belong to one class,
implying most interesting information
c 1
X
Exercise Gini(t) = 1
i=0
P (i|t)2

C1 0
C2 6

C1 1
C2 5

C1 2
C2 4
c 1
X
Exercise Gini(t) = 1
i=0
P (i|t)2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Classification Error
Classification error(t) = 1 max P (i|t)

- Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information
- Minimum (0.0) when all records belong to one class,
implying most interesting information
Exercise Classification error(t) = 1 max P (i|t)

C1 0
C2 6

C1 1
C2 5

C1 2
C2 4
Exercise Classification error(t) = 1 max P (i|t)

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Comparison of
Impurity Measures
For a 2-class problem:
Gain = goodness of a split
Split on A or on B?

Before Splitting: C0 N00 M0


C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?
N is the number of
training instances for
C0 N00
Before Splitting: M0 Class C0/C1
C1 N01
for the given node
A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?

M is an impurity measure
Before Splitting: C0 N00 M0 (Entropy, Gini, etc.)
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?

Before Splitting: C0 N00 M0


C1 N01
The node that produces the higher
A?
gain is considered the better split B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Information Gain
- When Entropy is used as the impurity measure,
it’s called information gain
- Measures how much we gain by splitting a
parent node number of records
number of attribute values
associated with the
child node vj
k
X N (vj )
inf o = Entropy(p) Entropy(vj )
j=1
N

total number of records


at the parent node
Determining the Best Split
Before Splitting: 10 records of class C0
10 records of class C1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


Gain Ratio
- Can be used instead of information gain

inf o
Gain ratio =
Split info
Xk
Split info = P (vi ) log2 P (vi )
i=1

- It the attribute produces a large number of


splits, its split info will also be large, which in
turn reduces its gain ratio
Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
Stopping Criteria for Tree
Induction
- Stop expanding a node when all the records
belong to the same class
- Stop expanding a node when all the records
have similar attribute values
- Early termination
- See details in a few slides
Summary Decision Trees
- Inexpensive to construct
- Extremely fast at classifying unknown records
- Easy to interpret for small-sized trees
- Accuracy is comparable to other classification
techniques for many simple data sets
Practical Issues of
Classification
Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
Should fit2theNo input
Medium 100K No

data well 3 No Ind


uct
Small 70K No

4 Yes Medium 120K No ion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class

io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?

?
15 No Large 67K
10

Test Set
Underfitting and Overfitting
Overfitting

Underfitting: when model is too simple, both training and test errors are large
How to Address Overfitting
- Pre-Pruning (Early Stopping Rule): stop the
algorithm before it becomes a fully-grown tree
- Typical stopping conditions for a node
- Stop if all instances belong to the same class
- Stop if all the attribute values are the same (i.e., belong to
the same split)
- More restrictive conditions
- Stop if number of instances is less than some user-
specified threshold
- Stop if class distribution of instances are independent of
the available features
- Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain)
How to Address Overfitting
- Post-pruning: grow decision tree to its entirety
- Trim the nodes of the decision tree in a bottom-up
fashion
- If generalization error improves after trimming,
replace sub-tree by a leaf node
- Class label of leaf node is determined from majority
class of instances in the sub-tree
Methods for estimating
performance
- Holdout
- Reserve 2/3 for training and 1/3 for testing
(validation set)
- Cross validation
- Partition data into k disjoint subsets
- k-fold: train on k-1 partitions, test on the remaining
one
- Leave-one-out: k=n
Expressivity
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6

y < 0.33?
y

0.5
y < 0.47?
0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
Expressivity

x+y<1

Class = + Class =
Exercise
DAT630
Exploring Data
Introduction to Data Mining, Chapter 3

18/09/2017

Darío Garigliotti | University of Stavanger


Data Exploration
- Preliminary investigation of the data in order to
better understand its specific characteristics
- Can aid in selecting the appropriate
preprocessing and data analysis techniques
- Can even address some of the questions
typically answered by data mining
- Finding patterns by visually inspecting the data
Three major topics
- Summary statistics
- Visualization
- On-Line Analytical Processing (OLAP)
The Iris Data Set
The Iris data set
- Introduced in 1936 by Ronald Fisher
- 50 samples from each of three species of Iris
- Iris setosa, Iris virginica, and Iris versicolor
- Four features from each sample
- The length and the width of the sepals and petals
Four Features
Summary Statistics
Summary Statistics
- Quantities that capture various
characteristics of a potentially large set of
values with a single number (or a small set of
numbers)
- Examples
- Average household income
- Fraction of students who complete a BSc in 3 years
Frequency
- The frequency of an attribute value is the
percentage with which the value occurs in the
data set
- For example, given the attribute ‘gender’ and a
representative population of people, the gender
‘female’ occurs about 50% of the time
- x is a categorical attribute that can take values
{v1,…,vk} and there are m objects in total

number of objects with attribute value vi


frequency(vi ) =
m
Mode
- The mode of an attribute is the most frequent
attribute value
- The notion of a mode is only interesting if attribute
values have different frequencies
- The notions of frequency and mode are
typically used with categorical data
Example
- What are the frequencies?
- What is the mode?
Age Count Frequency
0-9 3
10-19 4
20-29 15
30-39 12
40-49 8
50-59 2
Total 44
Example
- What are the frequencies?
- What is the mode?
Age Count Frequency
0-9 3 0.068
10-19 4 0.090
Mode 20-29 15 0.340
30-39 12 0.272
40-49 8 0.181
50-59 2 0.045
Total 44
Percentiles
- For continuous data, the notion of a percentile
is more useful
- Given an ordinal or continuous attribute x and
a number p between 0 and 100, the pth
percentile is a value xp of x such that p% of the
observed values of x are less than xp
- For instance, the 50th percentile is the value x50%
such that 50% of all values of x are less than x50%
- min(x)= x0% max(x)= x100%
Example
- What is the 80th percentile of this data?
- 8, 6, 3, 7, 3, 4, 1, 6, 8, 5
Example
- Sort the data
- 1, 3, 3, 4, 5, 6, 6, 7, 8, 8

80% of the values are smaller than 8


Mean and Median
- Most widely used statistics for continuous data
- Let x be an attribute and {x1,…,xm} the values
of the attribute for a set of m objects
- Let {x(1),…,x(m)} the set of values after sorting
- I.e., x(1)=min(x) and x(m)=max(x)
Xm
1
mean(x) = x̄ = xi
m i=1
Mean and Median (2)
- The middle value if there is an odd number of
values, and the average of the two middle
values if the number of values is even

median(x) =

x(r+1) , if m is odd (i.e., m = 2r + 1)
1
2 (x (r) + x (r+1) ), if m is even (i.e., m = 2r)
Mean vs. Median
- Both indicate the "middle" of the values
- If the distribution of values is skewed, then the
median is a better indicator of the middle
- The mean is sensitive to the presence of
outliers; the median provides a more robust
estimate of the middle
Trimmed Mean
- To overcome problems with the traditional
definition of a mean, the notion of a trimmed
mean is sometimes used
- A percentage p between 0 and 100 is
specified; the top and bottom (p/2)% of the
data is thrown out; then mean is calculated the
normal way
- Median is a trimmed mean with p=100%, the
standard mean corresponds to p=0%
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}
- What is the mean?
- What is the median?
- What is the trimmed mean with p=40%?
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}
- What is the mean? 17.5
- What is the median? (3+4)/2 = 3.5
- What is the trimmed mean with p=40%? 3.5
- Trimmed values (with top-20% and bottom-20% of
the values thrown out): {2,3,4,5}
Range and Variance
- To measure the dispersion/spread of a set of
values (for continuous data)
- Range
range(x) = max(x) min(x) = x(m) x(1)
- Variance*
m
X
2 1 2
variance(x) = sx = (xi x̄)
m 1 i=1
- Standard deviation is the square root of variance

*This variant is known as the "bias-corrected sample variance"


Range vs. Variance
- Range can be misleading if the values are
concentrated in a narrow area, but there are
also a relatively small number of extreme
values
- Hence, the variance is preferred as a measure
of spread
Example
- What is the range and variance of the following
data?
- 3 24 30 47 43 7 47 13 44 39
Example
- What is the range and variance of the following
data?
- 3 24 30 47 43 7 47 13 44 39
- Range: 47-3 = 44
- Variance: 289.57
- mean: 29.7
More Robust Estimates of
Spread
- Variance is particularly sensitive to outliers
- The mean can be distorted by outliers; variance
uses the squared difference between the mean and
other values
- Absolute Average Deviation (AAD)
Xm
1
AAD(x) = |xi x̄|
m i=1
More Robust Estimates of
Spread (2)
- Median Absolute Deviation (MAD)
MAD(x) = median |x1 x̄|, . . . , |xm x̄|

- Interquartile Range (IQR)


IQR(x) = x75% x25%
More Robust Estimates of
Spread (3)
Example
Exercises
Visualization
Goals and Motivation
- Data visualization is the display of information
in a graphic or tabular format
- The motivation for using visualization is that
people can quickly absorb large amounts of
visual information and find patterns in it
- Visualization is a powerful and appealing
technique for data exploration
- Humans can easily detect general patterns and
trends as well as outliers and unusual patterns
Example
- Sea Surface Temperature (SST) for July 1982
- Tens of thousands of data points are summarized in
a single figure
Outline for this part
- General concepts
- Representation
- Arrangement
- Selection
- Visualization techniques
- Histograms
- Box plots
- Scatter plots
- Contour plots
- …
Representation
- Mapping of information to a visual format
- Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors

- Objects are often represented as points


- Attribute values can be represented as the
position of the points or using color, size, shape, etc.
- Position can express the relationships among
points
Arrangement
- Placement of visual elements within a display
- Can make a large difference in how easy it is to
understand the data

vs.
Selection
- Elimination or the de-emphasis of certain
objects and attributes
- May involve the choosing a subset of attributes
- Dimensionality reduction is often used to reduce the
number of dimensions to two or three
- Alternatively, pairs of attributes can be considered
- May also involve choosing a subset of objects
- Visualizing all objects can result in a display that is
too crowded
Outline for this part
- General concepts
- Visualization techniques
- Yet techniques are often very specialized in the data
being analyzed, it is possible to group them by some
general properties
- For example, visualization techniques for:
- Small number of attributes
- Spatio-temporal data
- High-dimensional data
Outline for this part
- General concepts
- Visualization techniques
- Histograms
- Box plots
- Scatter plots
- Contour plots
- Matrix plots
- Parallel coordinates
- Star plots
- Chernoff faces
Histograms
- Usually shows the distribution of values of a
single variable
- Divide the values into bins and show a bar plot
of the number of objects in each bin.
- The height of each bar indicates the number of
objects
- Shape of histogram depends on the number of
bins
Example
- Petal width

10 bins 20 bins
2D Histograms
- Show the joint distribution of the values of two
attributes
- Each attribute is divided into intervals and the two
sets of intervals define two-dimensional rectangles of
values
- It can show patterns not present in 1D ones
- Visually more complicated, e.g., some columns may
be hidden by others
Example
- Petal width and petal length
outlier

10th perce

Box Plots
75th perce

50th perce
25th perce

10th perce

- Way of displaying the distribution of data


outlier

10th percentile

75th percentile

50th percentile
25th percentile

10th percentile
Example
- Comparing attributes
Pie Charts
- Similar to histograms, but typically used with
categorical attributes that have a relatively
small number of values
- Common in popular articles, but used less
frequently in technical publications
- The size of relative areas can be hard to judge
- Histograms are preferred for technical work!
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Empirical Cumulative
Distribution Function
Scatter Plots
- Attributes values determine the position
- Two-dimensional scatter plots most common,
but can have three-dimensional scatter plots
- Often additional attributes can be displayed by
using the size, shape, and color of the markers
that represent the objects
Example
Example
- Arrays of scatter plots to summarize the
relationships of several pairs of attributes
Contour Plots
Celsius

- Useful when a continuous attribute is


measured on a spatial grid
- They partition the plane into regions of similar
values
- The contour lines that form the boundaries of
these regions connect points with equal values
- The most common example is contour maps of
elevation
Example

Celsius
Parallel Coordinates
- Plot the attribute values of high-dimensional
data
- Instead of using perpendicular axes, use a set
of parallel axes
- The attribute values of each object are plotted
as a point on each corresponding coordinate
axis and the points are connected by a line,
i.e., each object is represented as a line
- The ordering of attributes is important
Example
- Different ordering of attributes
Star Plots
- Similar approach to parallel coordinates, but
axes radiate from a central point
- The line connecting the values of an object is a
polygon
Example

Setosa

Versicolour

Virginica
Example
- Other useful information, such as average
values or thresholds, can also be encoded
Chernoff Faces
- Approach created by Herman Chernoff
- Each attribute is associated with a
characteristic of a face
- Size of the face, shape of jaw, shape of forhead, etc.
- The value of the attribute determines the appearance
of the corresponding facial characteristic
- Each object becomes a separate face
- Relies on human’s ability to distinguish faces
Example
Setosa

Versicolour

Virginica
Principles for Visualization
Quality
- Apprehension to perceive relations among variables
- Clarity to distinguish the most important elements
- Consistency with previous, related graphs
- Efficiency to show complex information in simple ways
- Necessity of the graph, vs alternatives
- Truthfulness when using magnitudes, relative to scales
Infographics
OLAP and Multidimensional
Data Analysis
OLAP
- Relational databases put data into tables, while
OLAP uses a multidimensional array
representation
- Such representations of data previously existed in
statistics and other fields
- There are a number of data analysis and data
exploration operations that are easier with
such a data representation
Converting Tabular Data
- Two key steps in converting tabular data into a
multidimensional array
1.Identify which attributes are to be the
dimensions and which attribute is to be the
target attribute
- The attributes used as dimensions must have
discrete values
- The target value is typically a count or continuous
value, e.g., the cost of an item
- Can have no target variable at all except the count of
objects that have the same set of attribute values
Converting Tabular Data (2)
2.Find the value of each entry in the
multidimensional array by summing the values
(of the target attribute) or count of all objects
that have the attribute values corresponding to
that entry
Example
- Petal width and length are discretized to have
categorical values: low, medium, and high
Example
- Each unique tuple of petal width, petal length,
and species type identifies one element of the
array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Data Cube
- The key operation of a OLAP is the formation
of a data cube
- A data cube is a multidimensional
representation of data, together with all
possible aggregates
- Aggregates that result by selecting a proper subset
of the dimensions and summing over all remaining
dimensions
Example
- Consider a data set that records the sales of
products at a number of company stores at
various dates
- This data can be represented
as a 3 dimensional array
- There are 3 two-dimensional
aggregates, 3 one-dimensional
aggregates, and 1 zero-dimensional
aggregate (the overall total)
Example
- This table shows one of the two dimensional
aggregates, along with two of the one-
dimensional aggregates, and the overall total
OLAP Operations
- Slicing
- Dicing
- Roll-up
- Drill-down
Slicing and Dicing
- Slicing is selecting a group of cells from the
entire multidimensional array by specifying a
specific value for one or more dimensions
- Dicing involves selecting a subset of cells by
specifying a range of attribute values
- This is equivalent to defining a subarray from the
complete array
- In practice, both operations can also be
accompanied by aggregation over some
dimensions
Example
Roll-up and Drill-down
- Attribute values often have a hierarchical
structure
- Each date is associated with a year, month, and
week
- A location is associated with a continent, country,
state (province, etc.), and city
- Products can be divided into various categories,
such as clothing, electronics, and furniture
- These categories often nest and form a tree
- A year contains months which contains day
- A country contains a state which contains a city
Example
DAT630
Introduction & Data
Introduction to Data Mining, Chapters 1-2

11/09/2017

Darío Garigliotti | University of Stavanger


Introduction
What is Data Mining?
- (Non-trivial) extraction of implicit, previously
unknown and potentially useful information
from data
- Exploration & analysis, by automatic or
semi-automatic means, of large quantities of
data in order to discover meaningful patterns
- Process to automatically discover useful
information in large data
Motivating challenges
- Availability of large datasets, yet lack of
techniques for extracting useful information.
- Challenges:
- Scalability: by data structures and algorithms
- High dimensionality: affecting effectiveness and
efficiency
- Heterogeneous, complex data
- Integration of distributed data
- Analysis: vs traditional statistical experiments
Typical Workflow
Data Mining Tasks
- Predictive methods
- Use some variables to predict unknown or future
values of other variables
- Descriptive methods
- Find human-interpretable patterns that describe the
data
Classification (predictive)
- Given a collection of records (training set), find
a model that can automatically assign a class
attribute (as a function of the values of other
attributes) to previously unseen records
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
7 Yes Divorced 220K No
10

Set

8 No Single 85K Yes


No
9 No Married 75K Learn
Training
10 No Single 90K Yes Model
10

Set Classifier
Clustering (descriptive)
- Given a set of data points, each having a set of
attributes, find clusters such that
- Data points in one cluster are more similar to one
another
- Data points in separate clusters are less similar to
one another
Types of Data
What is data?
Attributes

- Collection of data objects and Tid Refund Marital


Status
Taxable
Income Cheat
their attributes 1 Yes Single 125K No
2 No Married 100K No

- An attribute (a.k.a. feature, 3


4
No
Yes
Single
Married
70K
120K
No
No

variable, field, component, Objects


5 No Divorced 95K Yes

etc.) is a property or
6 No Married 60K No
7 Yes Divorced 220K No

characteristic of an object 8
9
No
No
Single
Married
85K
75K
Yes
No
10 No Single 90K Yes
- A collection of attributes 10

describe an object (a.k.a.


record, instance, observation,
example, sample, vector)
Attribute properties
- The type of an attribute depends on which of
the following properties it possesses:

- Distinctness: = !=
- Order: < > <= >=
- Addition: + -
- Multiplication: * /
Types of attributes
- Nominal
- ID numbers, eye color, zip codes
- Ordinal
- Rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
- Interval
- Calendar dates, temperatures in C or F degrees.
- Ratio
- Temperature in Kelvin, length, time, counts
- Coarser types: categorical and numeric
Attribute types
Attribute type Description Examples

Only enough
ID numbers, eye color,
Nominal information to
zip codes
Categorical distinguish (=, !=)
(qualitative)
Enough information to grades {A,B,…F}
Ordinal order (<, >) street numbers

The differences calendar dates,


Interval between values are temperature in Celsius
Numeric meaningful (+, -) or Farenheit
(quantitative) Both differences and temperature in Kelvin,
Ratio ratios between values monetary values, age,
are meaningful (*, /) length, mass
Transformations
Attribute type Transformation Comment

If all employee ID numbers


Nominal Any permutation of values were reassigned, would it
make any difference?
An order preserving change: {good, better, best} can be
Ordinal new_value = f(old_value) represented equally well by
where f is a monotonic function the values {1, 2, 3}
The Fahrenheit and Celsius
new_value =a * old_value + b
Interval where a and b are constants
temperature scales differ in
terms of where their zero
value is and the size of a
Length can be measured in
Ratio new_value = a * old_value
meters or feet
Discrete vs. continuous
attributes
- Discrete attribute
- Has only a finite or countably infinite set of values
- Examples: zip codes, counts, or the set of words in
a collection of documents
- Often represented as integer variables
- Continuous attribute
- Has real numbers as attribute values
- Examples: temperature, height, or weight.
- Typically represented as floating-point variables
Asymmetric attributes
- Only presence counts (i.e., only non-zero
attribute values)

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Examples
- Time in terms of AM or PM
- Binary, qualitative, ordinal
- Brightness as measured by a light meter
- Continuous, quantitative, ratio
- Brightness as measured by people’s
judgments
- Discrete, qualitative, ordinal
Examples
- Angles as measured in degrees between 0◦
and 360◦
- Continuous, quantitative, ratio
- Bronze, Silver, and Gold medals as awarded at
the Olympics
- Discrete, qualitative, ordinal
- ISBN numbers for books
- Discrete, qualitative, nominal
Characteristics of
Structured Data
- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale
Types of data sets
- Record
- Data Matrix
- Document Data
- Transaction Data
- Graph
- Ordered
Record Data
- Consists of a collection of records, each of
which consists of a fixed set of attributes

Tid Refund Marital Taxable


Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Matrix
- Data objects have the same fixed set of
numeric attributes
- Can be represented by an m by n matrix
- Data objects can be thought of as points in a multi-
dimensional space, where each dimension
represents a distinct attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
- Documents are represented as term vectors
- each term is a component (attribute) of the vector
- the value of each component is the number of times
the corresponding term occurs in the document

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
- A special type of record data, where each
record (transaction) involves a set of items
- For example, the set of products purchased by a
customer (during one shopping trip) constitute a
transaction, while the individual products that were
purchased are the items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
- Examples

HTML links Chemical data


Ordered Data
- Sequences of transactions

Items/Events

An element of
the sequence
Ordered Data
- Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
- Spatio-temporal Data

Average Monthly
Temperature of
land and ocean
Non-record Data
- Often converted into record data
- For example: presence of substructures in a set, just
like the transaction items
- Ordered data conversion might lose explicit
representations of relationships
Data Quality
Data Quality Problems
- Data won’t be perfect
- Human error
- Limitations of measuring devices
- Flaws in the data collection process
- Data is of high quality if it is suitable for its
intended use
- Much work in data mining focuses on devising
robust algorithms that produce acceptable
results even when noise is present
Typical Data Quality
Problems
- Noise
- Random component of a measurement error
- For example, distortion of a person’s voice when
talking on a poor phone
- Outliers
- Data objects with characteristics that are
considerably different than most of the other data
objects in the data set
Typical Data Quality
Problems (2)
- Missing values
- Information is not collected
- E.g., people decline to give their age and weight
- Attributes may not be applicable to all cases
- E.g., annual income is not applicable to children

- Solutions
- Eliminate an entire object or attribute
- Estimate them by neighbor values
- Ignore them during analysis
Typical Data Quality
Problems (3)
- Inconsistent data
- Data may have some inconsistencies even among
present, acceptable values
- E.g. Zip code value doesn't correspond to the city value

- Duplicate data
- Data objects that are duplicates, or almost
duplicates of one another
- E.g., Same person with multiple email addresses
Quality Issues from the
Application viewpoint
- Timeliness:
- Aging of data implies aging of patterns on it
- Relevance:
- of the attributes modeling objects
- of the objects as representative of the population
- Knowledge of data:
- Availability of documentation about type of features,
origin, scales, missing values representation
Data Preprocessing
Data Preprocessing
- Different strategies and techniques to make
the data more suitable for data mining
- Aggregation
- Sampling
- Dimensionality reduction
- Feature subset selection
- Feature creation
- Discretization and binarization
- Attribute transformation
Aggregation
- Combining two or more attributes (or objects)
into a single attribute (or object)
- Purpose
- Data reduction
- Reduce the number of attributes or objects
- Change of scale
- Cities aggregated into regions, states, countries, etc
- More “stable” data
- Aggregated data tends to have less variability
Sampling
- Selecting a subset of the data objects to be
analyzed
- Statisticians sample because obtaining the entire set
of data of interest is too expensive or time
consuming
- Sampling is used in data mining because processing
the entire set of data of interest is too expensive or
time consuming
Sampling
- A sample is representative if it has
approximately the same property (of interest)
as the original set of data
- Key issues: sampling method and sample size
Types of Sampling
- Simple random sampling
- Any particular item is selected with equal probability
- Sampling without replacement
- As each item is selected, it is removed from the
population
- Sampling with replacement
- Objects are not removed from the population as they are
selected (same object can be picked up more than once)

- Stratified sampling
- Split the data into several partitions; then draw
random samples from each partition
Sample size

8000 points 2000 Points 500 Points


Curse of Dimensionality
- Many types of data analysis become
significantly harder as the dimensionality of the
data increases
- When dimensionality increases, data becomes
increasingly sparse in the space that it occupies
- Definitions of density and distance between points
become less meaningful
Dimensionality Reduction
- Purpose
- Avoid curse of dimensionality
- Reduce amount of time and memory required by
data mining algorithms
- Allow data to be more easily visualized
- May help to eliminate irrelevant features or reduce
noise
- Techniques
- Linear algebra techniques
- Feature subset selection
Linear Algebra Techniques
- Project the data from a high-dimensional space
into a lower-dimensional space
- Principal Component Analysis (PCA)
- Find new attributes (principal components) that are
- linear combinations of the original attributes
- orthogonal to each other
- capture the maximum amount of variation in the data
- See http://setosa.io/ev/principal-component-analysis/
- Singular Value Decomposition (SVD)
Feature Subset Selection
- Redundant features
- Duplicate much or all of the information contained in
one or more other attributes
- Example: purchase price of a product and the
amount of sales tax paid
- Irrelevant features
- Contain no information that is useful for the data
mining task at hand
- Example: students' ID is often irrelevant to the task
of predicting students' GPA
Feature Subset Selection
Approaches
- Brute-force approach
- Try all possible feature subsets as input to data
mining algorithm
- Embedded approaches
- Feature selection occurs naturally as part of the data
mining algorithm
- Filter approaches
- Features are selected before data mining algorithm
is run
- Wrapper approaches
- Use the data mining algorithm as a black box to find
best subset of attributes
Feature Subset Selection
Architecture
- Search
- Tradeoff between complexity and optimality
- Evaluation
- A way to predict goodness of the selection
- Stopping
- E.g. number of iterations; evaluation regarding
threshold; size of feature subset
- Validation
- Comparing performance for selected subset, vs
another selections (or the full set)
Feature Creation
- Create from the original attributes a new set of
attributes that captures the important
information more effectively
- Feature extraction
- E.g. pixels vs higher-level features in face recognition
- Mapping data to a new space
- E.g. recovering frequencies from noisy time series
- Feature construction
- E.g. constructing density (using given mass and volume)
for material classification
Binarization and
Discretization
- Binarization: converting a categorical attribute
to binary values
- Discretization: transforming a continuous
attribute to a categorical attribute
- Decide how many categories to have
- Determine how to map the values of the continuous
attribute to these categories
- Unsupervised: equal width, equal frequency
- Supervised
Discretization Without
Using Class Labels

Data Equal interval width

Equal frequency K-means


Attribute Transformation
- A function that maps the entire set of values of
a given attribute to a new set of replacement
values such that each old value can be
identified with one of the new values
- Simple functions: xk, log(x), ex, |x|, sin x, sqrt x,
log x, 1/x, …
- Normalization: when different variables are to
be combined in some way
Proximity Measures
Proximity
- Proximity refers to either similarity or
dissimilarity between two objects
- Similarity
- Numerical measure of how alike two data objects are;
higher when objects are more alike
- Often falls in the range [0,1]
- Dissimilarity
- Numerical measure of how different are two data
objects; lower when objects are more alike
- Falls in the interval [0,1] or [0,infinity)
Transformations
- To convert a similarity to a dissimilarity or vice
versa
- To transform a proximity measure to fall within
a certain range (e.g., [0,1])
- Min-max normalization

0 s mins
s =
maxs mins
(Dis)similarity for a
Single Attribute
Example
- Objects with a single original attribute that
measures the quality of the product
- {poor, fair, OK, good, wonderful}
- poor=0, fair=1, OK=2, good=3, wonderful=4
- What is the similarity between p="good" and
p="wonderful"?

|p q| |3 4| 1
s=1 =1 =1 = 0.75
n 1 5 1 4
Dissimilarities between
Data Objects
- Some examples of distances to show the
desired properties of a dissimilarity
- Objects have n attributes; xk is the kth attribute
- Euclidean distance
v
u n
uX
d(x, y) = t (xk yk ) 2
k=1
Minkowski Distance
- Generalization of the Euclidean Distance
Xn
r 1/r
d(x, y) = |xk yk |
k=1

- r=1 City block (Manhattan) distance (L1 norm)


- r=2 Euclidean distance (L2 norm)
- r= 1 Supremum distance (Lmax norm)
- Max difference between any attribute of the objects
n
X
r 1/r
d(x, y) = lim |xk yk |
r!1
k=1
Example
Eucledian Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

L2 p1
p1 p2
p2 p3
p3 p4
p4
p1p1 0 0 2.8282.828 3.162 3.1625.099 5.099
p2p2 2.828
2.828 0 0 1.414 1.4143.162 3.162
p3p3 3.162
3.162 1.4141.414 0 0 2 2
p4p4 5.099
5.099 3.1623.162 2 2 0 0

Distance Matrix
Distance Matrix
Example
Manhattan Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
L1p1 p1 0 p2 2.828p3 3.162
p4 5.099
p1p2 0 2.828 4 04 1.4146 3.162
p2p3 4 3.162 0 1.414 2 04 2
p3p4 4 5.099 2 3.162 0 22 0
p4 6 4 2 0

Distance Matrix
Distance Matrix
Example
Supremum Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1
p1 p2
p2 p3 p3 p4 p4
L∞
p1
p1 0
0 2
2.828 3
3.162 5
5.099
p2p2 2.828
2 0 0 1 1.414 3 3.162
p3p3 3.162
3 11.414 0 0 2 2
p4p4 5.099
5 33.162 2 2 0 0

Distance Matrix
Distance Matrix
Distance Properties
1.Positivity
- d(x,y) >= 0 for all x and y
- d(x,y) = 0 only if x=y
2.Symmetry
- d(x,y) = d(y,x) for all x and y
3.Triangle Inequality
- d(x,z) <= d(x,y) + d(y,z) for all x, y, and z

- A measurement that satisfies these properties


is a metric. A distance is a metric dissimilarity
Similarity Properties

1.s(x,y) = 1 only if x=y


2.x(x,y) = s(y,x) (Symmetry)
- There is no general analog of the triangle
inequality
- Some similarity measures can be converted to
a metric distance
- E.g., Jaccard similarity
Similarity between Binary
Vectors
- Common situation is that objects, p and q, have
only binary attributes
- f01 = the number of attributes where p was 0 and q was 1

- f10 = the number of attributes where p was 1 and q was 0

- f00 = the number of attributes where p was 0 and q was 0

- f11 = the number of attributes where p was 1 and q was 1


Similarity between Binary
Vectors
- Simple Matching Coefficient
- number of matching attribute values divided by the
number of attributes
f11 + f00
SM C =
f01 + f10 + f11 + f00
- Jaccard Coefficient
- Ignore 0-0 matches

f11
J=
f01 + f10 + f11
SMC versus Jaccard
p= 1000000000
q= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)


f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00)/(f01 + f10 + f11 + f00) = (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Cosine similarity
- Similarity for real-valued vectors
- Objects have n attributes; xk is the kth attribute
vector dot product
X k
x·y x k yk
cos(x, y) =
||x|| ||y|| i=1

length of vector
v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
Example
attr 1 attr 2 attr 3 attr 4 attr 5
x 1 0 1 0 3
y 0 2 4 0 1

k
X
x·y x k yk
cos(x, y) =
||x|| ||y|| i=1

v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
Example
attr 1 attr 2 attr 3 attr 4 attr 5
x 1 0 1 0 3
y 0 2 4 0 1

k
X
x·y x k yk
cos(x, y) =
7/(3.31*4.58)=0.46 ||x|| ||y|| i=1
1*0+0*2+1*4+0*0+3*1=7

v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
sqrt(12+02+12+02+32)=sqrt(11)=3.31 sqrt(02+22+42+02+12)=sqrt(21)=4.58
Geometric Interpretation
attr 1 attr 2
x 1 0
y 0 2

cos(x, y) = 0

y cos(90o) = 0
attr 2

90o x

attr 1
Geometric Interpretation
attr 1 attr 2
x 4 2
y 1 3

cos(x, y) = 0.70

cos(45o) = 0.70
y
attr 2

45o x

attr 1
Geometric Interpretation
attr 1 attr 2
x 1 2
y 2 4

cos(x, y) = 1
x
cos(0o) = 1
attr 2

0o
attr 1
DAT630
Retrieval Evaluation
Search Engines, Chapter 8

04/09/2017

Krisztian Balog | University of Stavanger


Today

Figure'2.2'
Evaluation
- Evaluation is key to building effective and
efficient search engines
- Measurement usually carried out in controlled
laboratory experiments
- Online testing can also be done
- Effectiveness, efficiency and cost are related
- E.g., if we want a particular level of effectiveness and
efficiency, this will determine the cost of the system
configuration
- Efficiency and cost targets may impact effectiveness
Evaluation Corpus
- To ensure repeatable experiments and fair
comparison of results from different systems
- Test collections consist of
- Documents
- Queries
- Relevance judgments
- (Evaluation metrics)
Text REtrieval Conference
(TREC)
- Organized by the US National Institute of
Standards and Technology (NIST)
- Yearly benchmarking cycle
- Development of test collections for various
information retrieval tasks
- Relevance judgments created by retired CIA
information analysts
TREC Assessors at Work
Example Test Collections
Example Collections
ClueWeb09/12 collections

- ClueWeb09
- 1 billion web pages in 10 languages
- 5TB compressed, 25TB uncompressed
- http://lemurproject.org/clueweb09/
- ClueWeb12
- 733 million English web pages
- http://lemurproject.org/clueweb12/
TREC Topic Example
TREC Topic Example

Short query (like in web search)


TREC Topic Example

Longer (more precise) version of the query


TREC Topic Example

Description of the criteria for relevance


Relevance Judgments
- Obtaining relevance judgments is an
expensive, time-consuming process
- Who does it?
- What are the instructions?
- What is the level of agreement?
- TREC judgments
- Depend on task being evaluated
- Generally binary
- Agreement is good because of “narrative”
Pooling
- Exhaustive judgments for all documents in a
collection is not practical
- Pooling technique is used in TREC
- Top k results (for TREC, k varied between 50 and 200)
from the rankings obtained by different search engines
(or retrieval algorithms) are merged into a pool
- Duplicates are removed
- Documents are presented in some random order to
the relevance judges
- Produces a large number of relevance judgments
for each query, although still incomplete
Pooling (for a given query)
System A

k Pooled results

System B

k Assessors

System C


Assessment task
Pool Query

Description/narrative

Document

Assessment
Crowdsourcing
- Obtain relevance judgments on a crowdsourcing
platform
- "Microtasks", performed in parallel by large, paid
crowds
- Platforms
- Amazon Mechanical Turk (US)
- Crowdflower (EU)
- https://www.crowdflower.com/use-case/search-relevance/
Example crowdsourcing
task
Query Logs
- Used for both tuning and evaluating search
engines
- Also for various techniques such as query
suggestion
- Typical contents
- User identifier or user session identifier
- Query terms - stored exactly as user entered
- List of URLs of results, their ranks on the result list,
and whether they were clicked on
- Timestamp(s) - records the time of user events such
as query submission, clicks
AOL query log
AOL query log fiasco
Query Logs
- Clicks are not relevance judgments
- Although they are correlated
- Biased by a number of factors such as rank on
result list
- Can use clickthrough data to predict
preferences between pairs of documents
- Appropriate for tasks with multiple levels of
relevance, focused on user relevance
- Various “policies” used to generate preferences
Example Click Policy
- Skip Above and Skip Next
- Given a set of results for a query and a clicked result
at rank position p
- all unclicked results ranked above p are predicted to be
less relevant than the result at p
- unclicked results immediately following a clicked result
are less relevant than the clicked result

click data generated preferences


Query Logs
- Click data can also be aggregated to remove
noise
- Click distribution information
- Can be used to identify clicks that have a higher
frequency than would be expected
- High correlation with relevance
- E.g., using click deviation to filter clicks for
preference-generation policies
Filtering Clicks
- Click deviation CD(d, p) for a result d in
position p:

- O(d,p): observed click frequency for a document in a


rank position p over all instances of a given query
- E(p): expected click frequency at rank p averaged
across all queries
0.3

0,25

Probability of click, P(i)


Filtering Clicks
0,2

0,15

0,1

0,05

1 2 3 4 5 6 7 8 9 10

Rank position, i

- Click deviation CD(d, p) for a result d in


position p:

- O(d,p): observed click frequency for a document in a


rank position p over all instances of a given query
- E(p): expected click frequency at rank p averaged
across all queries
Comparison
Implicit
Expert judges Crowd workers
judgments

Setting Artificial Artificial Realistic

Quality Excellent* Good* Noisy

Moderately
Cost Very expensive
expensive
Cheap

Scales to some
Scaling Doesn't scale well
extent (budget)
Scales very well

* But the quality of the data is only as good as the assessment guidelines
Effectiveness Measures
A is the set of relevant documents,
B is the set of retrieved documents
F-measure
- Harmonic mean of recall and precision

- harmonic mean emphasizes the importance of small


values, whereas the arithmetic mean is affected
more by outliers that are unusually large
- More general form

- β is a parameter that determines relative importance


of recall and precision
Evaluating Rankings
- Precision and Recall are set-based metrics
- How to evaluate a ranked list?
- Calculate recall and precision values at every rank
position
Ranking Effectiveness
Evaluating Rankings
- Precision and Recall are set-based metrics
- How to evaluate a ranked list?
- Calculate recall and precision values at every rank
position
- Produces a long list of numbers (see previous slide)
- Need to summarize the effectiveness of a ranking
Summarizing a Ranking
- Calculating recall and precision at fixed rank
positions
- Calculating precision at standard recall levels,
from 0.0 to 1.0
- Requires interpolation
- Averaging the precision values from the rank
positions where a relevant document was
retrieved
Fixed Rank Positions
- Compute precision/recall at a given rank
position p
- E.g., precision at 20 (P@20)
- Typically precision at 10 or 20
- This measure does not distinguish between
differences in the rankings at positions 1 to p
Example

Precision @5
Example

Precision @10
Standard Recall Levels
- Calculating precision at standard recall levels,
from 0.0 to 1.0
- Each ranking is then represented using 11 numbers
- Values of precision at these standard recall levels are
often not available, for example:

- Interpolation is needed
Recall-Precision Graph

Query 1

Query 2
Interpolation
- To average graphs, calculate precision at
standard recall levels:

- where S is the set of observed (R,P) points


- Defines precision at any recall level as the
maximum precision observed in any recall-
precision point at a higher recall level
- Produces a step function
Interpolation
Average Precision
- Average the precision values from the rank
positions where a relevant document was
retrieved
- If a relevant document is not retrieved (in the top K
ranks, e.g, K=1000) then its contribution is 0.0
- Single number that is based on the ranking of all the
relevant documents
- The value depends heavily on the highly ranked
relevant documents
Average Precision
1 X
AP = P (i) Precision at rank i
|Rel| i = 1, . . . , n
di 2 Rel

Total number of
relevant documents
According to the
ground truth

Only relevant documents


contribute to the sum
Average Precision
Averaging Across Queries
- So far: measuring ranking effectiveness on a
single query
- Need: measure ranking effectiveness on a set
of queries
- Average is computed over the set of queries
Mean Average Precision
(MAP)
- Summarize rankings from multiple queries by
averaging average precision
- Very succinct summary
- Most commonly used measure in research
papers
- Assumes user is interested in finding many
relevant documents for each query
- Requires many relevance judgments
MAP
Recall-Precision Graph
- Give more detail on the effectiveness of the
ranking algorithm at different recall levels
- Graphs for individual queries have very
different shapes and are difficult to compare
- Averaging precision values for standard recall
levels over all queries
Average Recall-Precision
Graph
Graph for 50 Queries
Other Metrics
- Focusing on the top documents
- Using graded relevance judgments
- E.g., web search engine companies often use a 6-
point scale: bad (0) … perfect (5)
Focusing on Top Documents
- Users tend to look at only the top part of the
ranked result list to find relevant documents
- Some search tasks have only one relevant
document
- E.g., navigational search, question answering
- Recall is not appropriate
- Instead need to measure how well the search engine
does at retrieving relevant documents at very high
ranks
Focusing on Top Documents
- Precision at Rank R
- R typically 5, 10, 20
- Easy to compute, average, understand
- Not sensitive to rank positions less than R
- Reciprocal Rank
- Reciprocal of the rank at which the first relevant
document is retrieved
- Mean Reciprocal Rank (MRR) is the average of the
reciprocal ranks over a set of queries
- Very sensitive to rank position
Mean Reciprocal Rank

Reciprocal rank (RR) = 1/1 = 1.0

Reciprocal rank (RR) = 1/2 = 0.5

Mean reciprocal rank (MRR) = (1.0 + 0.5) /2 = 0.75


Exercise
Discounted Cumulative
Gain
- Popular measure for evaluating web search
and related tasks
- Two assumptions:
- Highly relevant documents are more useful than
marginally relevant document
- The lower the ranked position of a relevant
document, the less useful it is for the user, since it is
less likely to be examined
Discounted Cumulative
Gain
- Uses graded relevance as a measure of the
usefulness, or gain, from examining a
document
- Gain is accumulated starting at the top of the
ranking and may be reduced, or discounted, at
lower ranks
- Typical discount is 1/log (rank)
- With base 2, the discount at rank 4 is 1/2, and at
rank 8 it is 1/3
Discounted Cumulative
Gain
- DCG is the total gain accumulated at a
particular rank p:

- reli is the graded relevance level of the document


retrieved at rank i
- Alternative formulation:

- used by some web search companies


- emphasis on retrieving highly relevant documents
DCG Example
- 10 ranked documents judged on 0-3 relevance
scale:
3, 2, 3, 0, 0, 1, 2, 2, 3, 0
- discounted gain:
3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0
= 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0
- DCG:
3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
Normalized DCG
- DCG numbers are averaged across a set of
queries at specific rank values
- Typically at rank 5 or 10
- E.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61
- DCG values are often normalized by comparing
the DCG at each rank with the DCG value for
the perfect ranking
- Makes averaging easier for queries with different
numbers of relevant documents
NDCG Example
- Perfect ranking:
3, 3, 3, 2, 2, 2, 1, 0, 0, 0
- ideal DCG values:
3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88
- NDCG values (divide actual by ideal):
1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88
- NDCG ≤ 1 at any rank position
Exercise
Significance Testing
- Given the results from a number of queries,
how can we conclude that ranking algorithm A
is better than algorithm B?
- A significance test enables us to reject the null
hypothesis (no difference) in favor of the
alternative hypothesis (B is better than A)
- The power of a test is the probability that the test will
reject the null hypothesis correctly
- Increasing the number of queries in the experiment
also increases power of test
Recipe
Performance Analysis
- Typically, system A (baseline) is compared
against system B (improved version)
- Average numbers can hide important details
about the performance of individual queries
- Important to analyze which queries were
helped and which were hurt
Distribution of query
effectiveness improvements
Query-level performance
differences
0.8

0,6

0,4

0,2

-0,2

-0,4

-0.6
Efficiency Metrics
- Elapsed indexing time
- Amount of time necessary to build a document
index on a particular system
- Indexing processor time
- CPU seconds used in building a document index
- Similar to elapsed time, but does not count time waiting
for I/O or speed gains from parallelism

- Query throughput
- Number of queries processed per second
Efficiency Metrics
- Query latency
- The amount of time a user must wait after issuing a
query before receiving a response, measured in
milliseconds
- Often measured with the median
- Indexing temporary space
- Amount of temporary disk space used while creating
an index
- Index size
- Amount of storage necessary to store the index files
Summary
- No single measure is the correct one for any
application
- Choose measures appropriate for task
- Use a combination
- Shows different aspects of the system effectiveness
- Use significance tests
- Analyze performance of individual queries
DAT630
Retrieval Models
Search Engines, Chapter 7

28/08/2017

Krisztian Balog | University of Stavanger


So far…

Figure'2.1'
Today

Figure'2.2'
Boolean Retrieval
Boolean Retrieval
- Two possible outcomes for query processing
- TRUE and FALSE (relevance is binary)
- “Exact-match” retrieval
- Query usually specified using Boolean operators
- AND, OR, NOT
- Can be extended with wildcard and proximity
operators
- Assumes that all documents in the retrieved set
are equally relevant
Boolean Retrieval
- Many search systems you still use are
Boolean:
- Email, library catalog, …
- Very effective in some specific domains
- E.g., legal search
- E.g., patent search
- Expert users
Boolean View of a
Collection

Doc&2&
Doc&3&
Doc&4&
Doc&1&

Doc&5&
Doc&6&
Doc&7&
Doc&8&
Term&
aid& 0& 0& 0& 1& 0& 0& 0& 1&
- Each row represents the view of all& 0& 1& 0& 1& 0& 1& 0& 0&
a particular term: What back& 1& 0& 1& 0& 0& 0& 1& 0&
documents contain this term? brown& 1& 0& 1& 0& 1& 0& 1& 0&
come& 0& 1& 0& 1& 0& 1& 0& 1&
- Like an inverted list dog& 0& 0& 1& 0& 1& 0& 0& 0&
fox& 0& 0& 1& 0& 1& 0& 1& 0&
good& 0& 1& 0& 1& 0& 1& 0& 1&
- To execute a query jump& 0& 0& 1& 0& 0& 0& 0& 0&
- Pick out rows corresponding lazy& 1& 0& 1& 0& 1& 0& 1& 0&
to query terms men& 0& 1& 0& 1& 0& 0& 0& 1&
now& 0& 1& 0& 0& 0& 1& 0& 1&
- Apply the logic table of the over& 1& 0& 1& 0& 1& 0& 1& 1&
corresponding Boolean party& 0& 0& 0& 0& 0& 1& 0& 1&
operator quick& 1& 0& 1& 0& 0& 0& 0& 0&
their& 1& 0& 0& 0& 1& 0& 1& 0&
6me& 0& 1& 0& 1& 0& 1& 0& 0&
Example Queries
Doc$2$
Doc$3$
Doc$4$
Doc$1$

Doc$5$
Doc$6$
Doc$7$
Doc$8$
Term$
dog$ 0$ 0$ 1$ 0$ 1$ 0$ 0$ 0$
fox$ 0$ 0$ 1$ 0$ 1$ 0$ 1$ 0$

dog$∧$fox$ 0$ 0$ 1$ 0$? 1$ 0$ 0$ 0$ ?
dog$AND$fox$→$Doc$3,$Doc$5$
dog$∨$fox$ 0$ 0$ 1$ 0$? 1$ 0$ 1$ 0$ ?
dog$OR$fox$→$Doc$3,$Doc$5,$Doc$7$
dog$¬$fox$ 0$ 0$ 0$ 0$? 0$ 0$ 0$ 0$ ?
dog$AND$NOT$fox$→$empty$
fox$¬$dog$ 0$ 0$ 0$ 0$? 0$ 0$ 1$ 0$
?
fox$AND$NOT$dog$→$Doc$7$
Example Query
good AND party AND NOT
over Doc$2$
Doc$3$
Doc$4$
Doc$1$

Doc$5$
Doc$6$
Doc$7$
Doc$8$
Term$
good$ 0$ 1$ 0$ 1$ 0$ 1$ 0$ 1$
party$ 0$ 0$ 0$ 0$ 0$ 1$ 0$ 1$
over$ 1$ 0$ 1$ 0$ 1$ 0$ 1$ 1$

g"∧"p" 0" 0" 0" 0" 0" 1" 0" 1" good"AND"party"→"Doc"6,"Doc"8"
over" 1" 0" 1" 0" 1" 0" 1" 1"

g"∧"p"¬"o" 0" 0" 0" 0" 0" 1" 0" 0" good"AND"party"AND"NOT"over"→"Doc"6"
Example of Query
(Re)formulation
lincoln
- Retrieves a large number of documents
- User may attempt to narrow the scope

president AND lincoln


- Also retrieves documents about the management of
the Ford Motor Company and Lincoln cars

Ford Motor Company today announced that Darrly Hazel will


succeed Brian Kelly as president of Lincoln Mercury.
Example of Query
(Re)formulation
- User may try to eliminate documents about cars

president AND lincoln


AND NOT (automobile OR car)
- This would remove any document that contains
even of the single mention of "automobile" or "car"
- For example, sentence in biography

Lincoln’s body departs Washington in a nine-car funeral train.


Example of Query
(Re)formulation
- If the retrieved set is too large, the user may try to
further narrow the query by adding additional words
that occur in biographies

president AND lincoln


AND (biography OR life OR birthplace)
AND NOT (automobile OR car)

- This query may do a reasonable job at retrieving a


set containing some relevant documents
- But it does not provide a ranking of documents
Example
- WestLaw.com: Largest commercial (paying
subscribers) legal search service
- Example query:
- What is the statute of limitations in cases involving
the federal tort claims act?

- LIMIT! /3 STATUTE ACTION /S FEDERAL /2


TORT /3 CLAIM

- ! = wildcard, /3 = within 3 words, /S = in same sentence


Boolean Retrieval
- Advantages
- Results are relatively easy to explain
- Many different features can be incorporated
- Efficient processing since many documents can be
eliminated from search
- We do not miss any relevant document
Boolean Retrieval
- Disadvantages
- Effectiveness depends entirely on user
- Simple queries usually don’t work well
- Complex queries are difficult to create accurately
- No ranking
- No control over result set size: either too many docs
or none
- What about partial matches? Documents that “don’t
quite match” the query may be useful also
Ranked Retrieval
General Scoring Formula

X
score(d, q) = wt,d · wt,q
t2q

It is enough to
Relevance score consider terms in
It is computed for each the query
document d in the collection
for a given input query q

Documents are returned in Term’s weight in Term’s weight


decreasing order of this score
the document in the query
Example 1:
Term presence/absence
- The score is the number of matching query
terms in the document
X
score(d, q) = wt,d · wt,q
t2q
ft,q

1, ft,d > 0
wt,d =
0, otherwise

- ft,d is the number of occurrences of term t in document d


- ft,q is the number of occurrences of term t in query q
Term Weighting
- Instead of using raw term frequencies, assign a
weight that reflects the term’s importance
Example 2:
Log-frequency Weighting

1 + log ft,d , ft,d > 0
wt,d =
0, otherwise

Raw term ft,d wt,d


frequency
0 0
1 1
2 1.3
10 2
1000 4
Example 2:
Log-frequency Weighting
X
score(d, q) = wt,d · wt,q
t2q

X
score(d, q) = (1 + log ft,d ) · ft,q
t2q
Query Processing
- Strategies for processing the data in the index
for producing query results
- Document-at-a-time
- Calculates complete scores for documents by
processing all term lists, one document at a time
- Term-at-a-time
- Accumulates scores for documents by processing
term lists one at a time
- Both approaches have optimization techniques
that significantly reduce time required to
generate scores
Document-at-a-Time

ß Inverted list for “salt”

ß Inverted list for “water”

ß Inverted list for “tropical”

ß Collected scores
ß Document #1

Figure 5.15
Term-at-a-Time

3:1

Figure 5.17
The Vector Space Model
The Vector Space Model
- Basis of most IR research in the 1960s and 70s
- Still used
- Provides a simple and intuitively appealing
framework for implementing
- Term weighting
- Ranking
- Relevance feedback
Representation
- Documents and query represented by a vector
of term weights

- Collection represented by a matrix of term


weights
Bag-of-Words Model
- Vector representation doesn’t consider the
ordering of words in a document
- "John is quicker than Mary" and "Mary is
quicker than John" have the same vectors
Scoring Documents
- Documents “near” the
query’s vector (i.e.,
more similar to the
query) are more likely
to be relevant to the
query
Scoring Documents
- The score for a document is computed using
the cosine similarity of the document and
query vectors
P
t wt,d · wt,q
cosine(d, q) = qP qP
w 2 w 2
t t,d t t,q
Term Weighting
Zipf’s Law
Weighting Terms
- Intuition
- Terms that appear often in a document should get
high weights
- The more often a document contains the term “dog”, the
more likely that the document is “about” dogs
- Terms that appear in many documents should get
low weights
- E.g., stopword-like words

- How do we capture this mathematically?


- Term frequency
- Inverse document frequency
Term Frequency (TF)
- Reflects the importance of a term in a
document (or query)
- Variants
- binary tft,d = {0, 1}
- raw frequency tft,d = ft,d
- normalized tft,d = ft,d /|d|
- log-normalized tft,d = 1 + log ft,d
- …
- ft,d is the number of occurrences of term k in
the document and |d| is the length of d
Inverse Document
Frequency (IDF)
- Reflects the importance of the term in the
collection of documents
- The more documents that a term occurs in, the less
discriminating the term is between documents,
consequently, the less useful for retrieval
N
idft = log
nt
- where N is the total number of document and nt is
the number of documents that contain term t
- log is used to "dampen" the effect of IDF
Term Weights
- Combine TF and IDF weights by multiplying
them:
tf idft,d = tft,d · idft

- Term frequency weight measures importance in


document
- Inverse document frequency measures importance
in collection
Scoring Documents
- The score for a document is computed using
the cosine similarity of the document and
query vectors P
t wt,d · wt,q
cosine(d, q) = qP qP
2 2
t wt,d t wt,q

P
tf idft,d · tf idft,q
t
cosine(d, q) = qP P
tf idf 2 tf idf 2
t t,d t t,q
Scoring Documents
- It also fits within our general scoring scheme:
- Note that we only consider terms that are present in
the query
X
Score(q, d) = wt,q · wt,d
t2q

tf idft,q tf idft,d
wt,q = qP wt,d = qP
2 tf idf 2
t tf idf t,q t t,d
Variations on Term Weighting
- It is possible to use different term weighting for
documents and for queries, for example:

- See also: https://en.wikipedia.org/wiki/Tf-idf


for further variants
Difference from Boolean
Retrieval
- Similarity calculation has two factors that
distinguish it from Boolean retrieval
- Number of matching terms affects similarity
- Weight of matching terms affects similarity
- Documents can be ranked by their similarity
scores
Exercise
BM25
BM25
- BM25 was created as the result of a series of
experiments
- Popular and effective ranking algorithm
- The reasoning behind BM25 is that good term
weighting is based on three principles
- Inverse document frequency
- Term frequency
- Document length normalization
BM25 Scoring
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

- Parameters
- k1: calibrating term frequency scaling
- b: document length normalization
- Note: several slight variations of BM25 exist!
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

Terms common between the


document and the query
=> good
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

Repetitions of query
terms in the document
=> good
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

Term saturation:
repetition is less
important after a while
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )
Term saturation

ft,d
k + ft,d for some k > 0

ft,d Asymptotically approaches 1


k + ft,d
Middle line is k=1
Upper line is lower k
Lower line is higher k
ft,d
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

Soft document
normalization taking into
account document length
Document is more
important if relatively long
(w.r.t. average)
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )

Common terms
less important
Parameter Setting
- k1: calibrating term frequency scaling
- 0 corresponds to a binary model
- large values correspond to using raw term frequencies
- k1 is set between 1.2 and 2.0, a typical value is 1.2
- b: document length normalization
- 0: no normalization at all
- 1: full length normalization
- typical value: 0.75
Language Models
Language Models
- Based on the notion of probabilities and
processes for generating text
Uses
- Speech recognition
- “I ate a cherry” is a more likely sentence than “Eye
eight uh Jerry”
- OCR & Handwriting recognition
- More probable sentences are more likely correct
readings
- Machine translation
- More likely sentences are probably better
translations
Uses
- Completion prediction
- Please turn off your cell _____
- Your program does not ______
- Predictive text input systems
can guess what you are
typing and give choices on
how to complete it
Ranking Documents using
Language Models
- Represent each document as a multinomial
probability distribution over terms
- Estimate the probability that the query was
"generated" by the given document
- "How likely is the search query given the language
model of the document?"
Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)

P (q|d)P (d)
P (d|q) = / P (q|d)P (d)
P (q)

Query likelihood Document prior


Probability that query q Probability of the document
was “produced” by document d being relevant to any query

Y
ft,q
P (q|d) = P (t|✓d )
t2q
Standard Language Modeling
approach (2)
Number of times t appears in q

Y
P (q|d) = P (t|✓d )ft,q
t2q
Document language model
Multinomial probability distribution Smoothing parameter
over the vocabulary of terms

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Empirical Maximum Collection
likelihood
document model estimates
model
P
ft,d d 0 ft,d0
P 0|
|d| d 0 |d
Language Modeling
Estimate a multinomial Smooth the distribution
probability distribution with one estimated from
from the text the entire collection

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Example
In the town where I was born,
Lived a man who sailed to sea,
And he told us of his life,
In the land of submarines,
So we sailed on to the sun,
Till we found the sea green,
And we lived beneath the
waves, In our yellow
submarine,
We all live in yellow
submarine, yellow submarine,
yellow submarine, We all live
in yellow submarine, yellow
submarine, yellow submarine.
who
where
waves
us
town
Empirical document LM

told
till
sun
submarines
so
our
man
ft,d
|d|

life
land
P (t|d) =

i
his
he
green
found
born
beneath
sea
sailed
lived
live
all
we
yellow
submarine

0,11
0,14

0,08

0,06

0,03

0,00
Alternatively...
Scoring a query
q = {sea, submarine}

P (q|d) = P (“sea”|✓d ) · P (“submarine”|✓d )


Scoring a query
q = {sea, submarine}
0.03602
P (q|d) = P (“sea”|✓d ) · P (“submarine”|✓d )

0.9 0.04 0.1 0.0002


(1 )P (“sea”|d) + P (“sea”|C)

t P(t|d) t P(t|C)
submarine 0,14 submarine 0,0001
sea 0,04 sea 0,0002
... ...
Scoring a query
q = {sea, submarine}
0.04538 0.03602 0.12601
P (q|d) = P (“sea”|✓d ) · P (“submarine”|✓d )

0.9 0.14 0.1 0.0001


(1 )P (“submarine”|d) + P (“submarine”|C)

t P(t|d) t P(t|C)
submarine 0,14 submarine 0,0001
sea 0,04 sea 0,0002
... ...
Smoothing
- Jelinek-Mercer smoothing
P (t|✓d ) = (1 )P (t|d) + P (t)
- Smoothing parameter is
- Same amount of smoothing is applied to all documents
- Dirichlet smoothing
ft,d + µ · p(t)
p(t|✓d ) =
|d| + µ
- Smoothing parameter is µ
- Smoothing is inversely proportional to the document
length
Relation between
Smoothing Methods
- Jelinek Mercer:
P (t|✓d ) = (1 )P (t|d) + P (t)
- by setting:
|d| µ
(1 )= =
|d| + µ |d| + µ
- Dirichlet:

ft,d + µ · p(t)
p(t|✓d ) =
|d| + µ
Practical Considerations
- Since we are multiplying small probabilities, it's
better to perform computations in the log
space
Y
ft,q
P (q|d) = P (t|✓d )
t2q

X
log P (q|d) = log P (t|✓d ) · ft,q
t2q

X
score(d, q) = wt,d · wt,q
t2q
Exercise
Exercise
See on GitHub
Exercise

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Document language model computation
Exercise

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Document language model computation
Exercise

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Document language model computation
Exercise

P (t|✓d ) = (1 )P (t|d) + P (t|C)


Document language model computation
Exercise

Scoring a query
Y
P (q|d) = P (t|✓d )ft,q P(q="T2 T1"|D2) = P(T2|D2) * P(T1|D2)
t2q
Fielded Variants
Motivation
- Documents are composed of multiple fields
- E.g., title, body, anchors, etc.
- Modeling internal document structure may be
beneficial for retrieval
Example
Unstructured representation
PROMISE Winter School 2013
Bridging between Information Retrieval and Databases
Bressanone, Italy 4 - 8 February 2013
The aim of the PROMISE Winter School 2013 on "Bridging between Information
Retrieval and Databases" is to give participants a grounding in the core
topics that constitute the multidisciplinary area of information access and
retrieval to unstructured, semistructured, and structured information. The
school is a week-long event consisting of guest lectures from invited
speakers who are recognized experts in the field. The school is intended for
PhD students, Masters students or senior researchers such as post-doctoral
researchers form the fields of databases, information retrieval, and related
fields.
[...]
<html>
<head>
<title>Winter School 2013</title>

Example
<meta name="keywords" content="PROMISE, school, PhD, IR, DB, [...]" />
<meta name="description" content="PROMISE Winter School 2013, [...]" />
</head>
<body>
<h1>PROMISE Winter School 2013</h1>
<h2>Bridging between Information Retrieval and Databases</h2>
<h3>Bressanone, Italy 4 - 8 February 2013</h3>
<p>The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a grounding
in the core topics that constitute the multidisciplinary area of
information access and retrieval to unstructured, semistructured, and
structured information. The school is a week-long event consisting of
guest lectures from invited speakers who are recognized experts in the
field. The school is intended for PhD students, Masters students or
senior researchers such as post-doctoral researchers form the fields of
databases, information retrieval, and related fields. </p>
[...]
</body>
</html>
Fielded representation
based on HTML markup
title: Winter School 2013

meta: PROMISE, school, PhD, IR, DB, [...]


PROMISE Winter School 2013, [...]

headings: PROMISE Winter School 2013


Bridging between Information Retrieval and Databases
Bressanone, Italy 4 - 8 February 2013

body: The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a
grounding in the core topics that constitute the multidisciplinary
area of information access and retrieval to unstructured,
semistructured, and structured information. The school is a week-
long event consisting of guest lectures from invited speakers who
are recognized experts in the field. The school is intended for
PhD students, Masters students or senior researchers such as post-
doctoral researchers form the fields of databases, information
retrieval, and related fields.
Fielded Extension of
Retrieval Models
- BM25 => BM25F
- LM => Mixture of Language Models (MLM)
BM25F
- Extension of BM25 incorporating multiple fields
- The soft normalization and term frequencies
need to be adjusted
- Original BM25:
X ft,d · (1 + k1 )
score(d, q) = · idft
t2q
ft,d + k1 · B
where B is the soft normalization:
|d|
B = (1 b+b )
avgdl
BM25F
X f˜t,d
score(d, q) = · idft
t2q k1 + f˜t,d

Combining term
frequencies across fields
X ft,di
f˜t,d = wi
i
Bi

Field weight Soft normalization


for field i
|di |
Parameter b becomes Bi = (1 bi + bi )
field-specific avgdli
Mixture of Language Models

- Build a separate language model for each field


- Take a linear combination of them
X
P (t|✓d ) = µi P (t|✓di )
i

Field language model


Smoothed with a collection model built
from all document representations of the
Field weights same type in the collection
Xm
µj = 1
j=1
Field Language Model
Smoothing parameter

P (t|✓di ) = (1 i )P (t|di ) + i P (t|Ci )

Empirical Maximum Collection


likelihood
field model estimates
field model
P
ft,di d 0 f t,d 0
P 0|
i
|di | d0 |d i
Example
q = {IR, winter, school}
fields = {title, meta, headings, body}
µ = {0.2, 0.1, 0.2, 0.5}

P (q|✓d ) = P (“IR”|✓d ) · P (“winter”|✓d ) · P (“school”|✓d )

P (“IR”|✓d ) = 0.2 · P (“IR”|✓dtitle )


+ 0.1 · P (“IR”|✓dmeta )
+ 0.2 · P (“IR”|✓dheadings )
+ 0.2
0.5 · P (“IR”|✓dbody )
Exercise
Parameter Settings
Setting Parameter Values
- Retrieval models often contain parameters that
must be tuned to get best performance for
specific types of data and queries
- For experiments:
- Use training and test data sets
- If less data available, use cross-validation by
partitioning the data into K subsets
Finding Parameter Values
- Many techniques used to find optimal
parameter values given training data
- Standard problem in machine learning
- In IR, often explore the space of possible
parameter values by grid search ("brute force")
- Perform a sweep over the possible parameter values
of each parameter, e.g., from 0 to 1 in 0.1 steps
DAT630
Search Engines
Search Engine Architecture and Indexing

22/09/2017

Krisztian Balog | University of Stavanger


Information Retrieval
Information Retrieval (IR)
“Information retrieval is a field concerned with
the structure, analysis, organization, storage,
searching, and retrieval of information.”
(Salton, 1968)
Modern definition
“Making the right information available to the
right person at the right time.”
Searching in Databases
- Query: records with balance > $50,000 in
branches located in Amherst, MA.
Name% Branch% Balance%
Sam$I.$Am$ Amherst,$MA$ $95,342.11$
Pa7y$MacPa7y$ Amherst,$MA$ $23,023.23$
Bobby$de$West$ Amherst,$NY$ $78,000.00$
Xing$O’Boston$ Boston,$MA$ $50,000.01$
Searching in Text
- Query: deadly disease due to diet
- Which are relevant?
Searching in Text
- Query: deadly disease due to diet
- Which are relevant?
Comparing Text
- Comparing the query text to the document text
and determining what is a good match is the
core issue of information retrieval
- Exact matching of words is not enough
- Many different ways to write the same thing in a
“natural language” like English
- E.g., does a news story containing the text “fatal
illnesses caused by your menu” match the query?
- Some documents will be better matches than others
Dimensions of IR
- IR is more than just text, and more than just
web search
- Although these are central
- Content
- Text
- Images
- Video
- Audio
- Scanned documents
Dimensions of IR
- Applications
- Web search
- Vertical search
- Enterprise search
- Mobile search
- Social search
- Desktop search
- Patent search
- …
Dimensions of IR
- Tasks
- Ad-hoc search
- Filtering
- Question answering
Core issues in IR
- Relevance
- Simple (and simplistic) definition: A relevant
document contains the information that a person
was looking for when they submitted a query to the
search engine
- Many factors influence a person’s decision about
what is relevant: e.g., task, context, novelty
- Topical relevance (same topic) vs. user relevance
(everything else)
Core issues in IR
- Relevance
- Retrieval models define a view of relevance
- Ranking algorithms used in search engines are
based on retrieval models
- Most models based on statistical properties of text
rather than linguistic
- I.e., counting simple text features such as words instead
of parsing and analyzing the sentences
Core issues in IR
- Evaluation
- Experimental procedures and measures for
comparing system output with user expectations
- Typically use test collection of documents, queries,
and relevance judgments
- Recall and precision are two examples of
effectiveness measures
Core issues in IR
- Information needs
- Keyword queries are often poor descriptions of
actual information needs
- Interaction and context are important for
understanding user intent
- Query refinement techniques such as query
expansion, query suggestion, relevance feedback
improve ranking
Search Engines in
Operational Environments
- Performance
- Response time, indexing speed, etc.
- Incorporating new data
- Coverage and freshness
- Scalability
- Growing with data and users
- Adaptibility
- Tuning for specific applications
Search Engine Architecture
Search Engine Architecture
- A software architecture consists of software
components, the interfaces provided by those
components, and the relationships between them
- Describes a system at a particular level of abstraction
- Architecture of a search engine determined by 2
requirements
- Effectiveness (quality of results)
- Efficiency (response time and throughput)
Indexing Process

Figure'2.1'
Indexing Process
Identify and make available
the documents that will be
searched

Figure'2.1'
Text Acquisition
- Crawler
- Identifies and acquires documents for search engine
- Many types: web, enterprise, desktop, etc.
- Web crawlers follow links to find documents
- Must efficiently find huge numbers of web pages
(coverage) and keep them up-to-date (freshness)
- Single site crawlers for site search
- Topical or focused crawlers for vertical search
- Document crawlers for enterprise and desktop search
- Follow links and scan directories
Text Acquisition
- Feeds
- Real-time streams of documents
- E.g., web feeds for news, blogs, video, radio, TV
- RSS is common standard
- RSS “reader” can provide new XML documents to
search engine
Text Acquisition
- Documents need to be converted into a
consistent text plus metadata format
- E.g. HTML, XML, Word, PDF, etc. → XML
- Convert text encoding for different languages
- Using a Unicode standard like UTF-8
Indexing Process

Figure'2.1'
Document Data Store
- Stores text, metadata, and other related
content for documents
- Metadata is information about document such as
type and creation date
- Other content includes links, anchor text
- Provides fast access to document contents for
search engine components
- E.g. result list generation
- Could use relational database system
- More typically, a simpler, more efficient storage
system is used due to huge numbers of documents
Indexing Process

Figure'2.1'

Transform documents into


index terms or features
Text Transformation
- Tokenization
- Stopword removal
- Stemming
- Information extraction
- Identify index terms that more complex than single
words
- E.g., named entity recognizers identify classes such as
people, locations, companies, dates, etc
- Important for some applications
Text Transformation
- Link Analysis
- Makes use of links and anchor text in web pages
- Link analysis identifies popularity and community
information
- E.g., PageRank
- Anchor text can significantly enhance the
representation of pages pointed to by links
- Significant impact on web search
- Less importance in other applications
Text Transformation
- Classification
- Identifies class-related metadata for documents or
part of documents
- Topics, reading levels, sentiment, genre
- Spam vs. non-spam
- Non-content parts of documents, e.g., advertisements
- Use depends on application
Indexing Process
Create indices or data
structures that enable fast
searching

Figure'2.1'
Index Creation
- Document Statistics
- Gathers counts and positions of words and other
features
- Used in ranking algorithm
- Weighting
- Computes weights for index terms
- Usually reflect “importance” of term in the document
- Used in ranking algorithm
Index Creation
- Inversion
- Core of indexing process
- Converts document-term information to term-
document for indexing
- Difficult for very large numbers of documents
- Format of inverted file is designed for fast query
processing
- Must also handle updates
- Compression used for efficiency
Index Creation
- Index Distribution
- Distributes indexes across multiple computers and/
or multiple sites
- Essential for fast query processing with large
numbers of documents
- Many variations
- Document distribution, term distribution,
replication
- P2P and distributed IR involve search across multiple
sites
Query Process

Figure'2.2'
Query Process
Interface between the person
doing the searching and the
search engine

Figure'2.2'
User Interaction
- Accepting the user’s query and transforming it
into index terms
- Taking the ranked list of documents from the
search engine and organizing it into the results
shown to the user
- E.g., generating snippets to summarize documents
- Range of techniques for refining the query (so
that it better represents the information need)
User Interaction
- Query input
- Provides interface and parser for query language
- Query language used to describe complex queries
- Operators indicate special treatment for query text
- Most web search query languages are very simple
- Small number of operators
- There are more complicated query languages
- E.g., Boolean queries, proximity operators
- IR query languages also allow content and structure
specifications, but focus on content
User Interaction
- Query transformation
- Improves initial query, both before and after initial
search
- Includes text transformation techniques used for
documents
- Spell checking and query suggestion provide
alternatives to original query
- Techniques often leverage query logs in web search
- Query expansion and relevance feedback modify the
original query with additional terms
User Interaction
- Results output
- Constructs the display of ranked documents for a
query
- Generates snippets to show how queries match
documents
- Highlights important words and passages
- Retrieves appropriate advertising in many
applications (“related” things)
- May provide clustering and other visualization tools
Query Process
Core of the search engine:
generates a ranked list of
documents for the user’s query

Figure'2.2'
Ranking
- Scoring
- Calculates scores for documents using a ranking
algorithm, which is based on a retrieval model
- Core component of search engine
- Basic form of score is X
qi d i
i
- qi and di are query and document term weights for term i
- Many variations of ranking algorithms and retrieval
models
Ranking
- Performance optimization
- Designing ranking algorithms for efficient processing
- Term-at-a time vs. document-at-a-time processing
- Safe vs. unsafe optimizations

- Distribution
- Processing queries in a distributed environment
- Query broker distributes queries and assembles
results
- Caching is a form of distributed searching
Query Process

Figure'2.2'

Measure and monitor


effectiveness and efficiency.
Record and analyze usage data
Evaluation
- Logging
- Logging user queries and interaction is crucial for
improving search effectiveness and efficiency
- Query logs and clickthrough data used for query
suggestion, spell checking, query caching, ranking,
advertising search, and other components
- Ranking analysis
- Measuring and tuning ranking effectiveness
- Performance analysis
- Measuring and tuning system efficiency
Indexing
Indices
- Indices are data structures designed to make
search faster
- Text search has unique requirements, which leads
to unique data structures
- Most common data structure is the inverted index
- General name for a class of structures
- “Inverted” because documents are associated with
words, rather than words with documents
- Similar to a concordance
Inverted Index
- Each index term is associated with a postings
list (or inverted list)
- Contains lists of documents, or lists of word
occurrences in documents, and other information
- Each entry is called a posting
- The part of the posting that refers to a specific
document or location is called a pointer
- Each document in the collection is given a unique number
(docID)
- The posting can store additional information, called
the payload
- Lists are usually document-ordered (sorted by docID)
Inverted Index

term posting posting posting …

docID; payload

points to a specific optionally can store other associated


document information (e.g., frequency or position)
Example
Simple
Inverted
Index
docID

Each document that


contains the term is a
posting.
No additional payload.
Inverted
Index with
Counts
docID: freq

The payload is the


frequency of the
term in the
document.
Supports better
ranking algorithms.
Inverted
Index with
Positions
docID, position

There is a separate
posting for each term
occurrence in the
document. The payload
is the term position.
Supports proximity
matches.
E.g., find "tropical" within
5 words of "fish"
Issues
- Compression
- Inverted lists are very large
- Compression of indexes saves disk and/or memory
space
- Optimization techniques to speed up search
- Read less data from inverted lists
- “Skipping” ahead
- Calculate scores for fewer documents
- Store highest-scoring documents at the beginning of
each inverted list

- Distributed indexing
Exercise
- Draw the inverted index for the following
document collection

Doc 1 new home sales top forecasts

Doc 2 home sales rise in july

Doc 3 increase in home sales in july

Doc 4 july new home sales rise


Solution
new 1 4

home 1 2 3 4

sales 1 2 3 4

top 1

forecasts 1

rise 2 4

in 2 3

july 2 3 4
increase 3
Text Preprocessing
Preprocessing Pipeline
raw document

text preprocessing

Tokenization

Stopping

Stemming

… sequence of terms
Tokenization
- Parsing a string into individual words (tokens)
- Splitting is usually done along white spaces,
punctuation marks, or other types of content
delimiters (e.g., HTML markup)
- Sounds easy, but can be surprisingly complex,
even for English
- Even worse for many other languages
Tokenization Issues
- Apostrophes can be a part of a word, a part of
a possessive, or just a mistake
- rosie o'donnell, can't, 80's, 1890's, men's straw
hats, master's degree, …
- Capitalized words can have different meaning
from lower case words
- Bush, Apple
- Special characters are an important part of
tags, URLs, email addresses, etc.
- C++, C#, …
Tokenization Issues
- Numbers can be important, including decimals
- nokia 3250, top 10 courses, united 93, quicktime
6.5 pro, 92.3 the beat, 288358
- Periods can occur in numbers, abbreviations,
URLs, ends of sentences, and other situations
- I.B.M., Ph.D., www.uis.no, F.E.A.R.
Common Practice
- First pass is focused on identifying markup or
tags; second pass is done on the appropriate
parts of the document structure
- Treat hyphens, apostrophes, periods, etc. like
spaces
- Ignore capitalization
- Index even single characters
- o’connor => o connor
Text Statistics
Top-50 words from AP89
Zipf’s Law
- Distribution of word frequencies is very skewed
- A few words occur very often, many words hardly ever
occur
- E.g., two most common words (“the”, “of”) make up
about 10% of all word occurrences in text documents
- Zipf’s law:
- Frequency of an item or event is inversely
proportional to its frequency rank
- Rank (r) of a word times its frequency (f) is
approximately a constant (k): r*f~k
Zip’s law for AP89
Stopword Removal
- Function words that have little meaning apart
from other words: the, a, an, that, those, …
- These are considered stopwords and are
removed
- A stopwords list can be constructed by taking
the top n (e.g., 50) most common words in a
collection
- May be customized for certain domains or applications
Stopword Removal

- There are problematic cases…

"to be or not to be"


Stemming
- Reduce the different forms of a word that
occur to a common stem
- inflectional (plurals, tenses)
- derivational (making verbs nouns etc.)
- In most cases, these have the same or very
similar meanings
- Two basic types of stemmers
- Algorithmic
- Dictionary-based
Stemming
- Suffix-s stemmer
- Assumes that any word ending with an s is plural
- cakes => cake, dogs =>dog
- Cannot detect many plural relationships (false
negative)
- centuries => century
- In rare cases it detects a relationship where it does
not exist (false positive)
- is => i
Stemming
- Porter stemmer
- Most popular algorithmic stemmer
- Consists of 5 steps, each step containing a set of
rules for removing suffixes
- Produces stems not words
- Makes a number of errors and difficult to modify
Porter Stemmer
- Example step (1 of 5)
Porter Stemmer
should not have the same stem should have the same stem
Stemming
- Krovetz stemmer
- Hybrid algorithmic-dictionary
- Word checked in dictionary
- If present, either left alone or replaced with exception
stems
- If not present, word is checked for suffixes that could be
removed
- After removal, dictionary is checked again
- Produces words not stems
Stemmer Comparison
Original text
Document will describe marketing strategies carried out by U.S. companies for
their agricultural chemicals, report predictions for market share of such
chemicals, or report market statistics for agrochemicals, pesticide, herbicide,
fungicide, insecticide, fertilizer, predicted sales, market share, stimulate demand,
price cut, volume of sales

Porter stemmer
market strateg carr compan agricultur chemic report predict market share
chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil
predict sale stimul demand price cut volum sale

Krovetz stemmer
marketing strategy carry company agriculture chemical report prediction market
share chemical report market statistic agrochemic pesticide herbicide fungicide
insecticide fertilizer predict sale stimulate demand price cut volume sale
Stemming
- Generally a small (but significant) effectiveness
improvement for English
- Can be crucial for some languages (e.g.,
Arabic, Russian)
Example
First pass extraction
The Transporter (2002)
PG-13 92 min Action, Crime, Thriller 11 October 2002 (USA)

Frank is hired to "transport" packages for unknown clients and


has made a very good living doing so. But when asked to move a
package that begins moving, complications arise.

Tokenization
the transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa

frank is hired to transport packages for unknown clients and has


made a very good living doing so but when asked to move a
package that begins moving complications arise
Stopwords removal
the transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa

frank is hired to transport packages for unknown clients and has


made a very good living doing so but when asked to move a
package that begins moving complications arise

transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa

frank hired transport packages unknown clients has made very


good living doing so when asked move package begins moving
complications arise
Stemming (Porter stemmer)
transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa

frank hired transport packages unknown clients has made very


good living doing so when asked move package begins moving
complications arise

transport 2002
pg 13 92 min action crime thriller 11 octob 2002 usa

frank hire transport packag unknown client ha made veri good


live do so when ask move packag begin move complic aris

You might also like