Learning To Rank
Learning To Rank
Learning To Rank
Learning-to-Rank
Search Engines, Section 7.6
30/10/2016
18/10/2016
URLs crawled
and parsed
Unseen Web
Seed
URLs frontier
pages
Web
7
Web Crawling
- Web crawlers spend a lot of time waiting for
responses to requests
- To reduce this inefficiency, web crawlers use
threads and fetch hundreds of pages at once
- Crawlers could potentially flood sites with
requests for pages
- To avoid this problem, web crawlers use
politeness policies
- e.g., delay between requests to same web server
Web Crawling
- Freshness
- Not possible to constantly check all pages
- Must check important pages (i.e., visited by many
users) and pages that change frequently
- Focused crawling
- Attempts to download only those pages that are
about a particular topic
- Deep Web
- Sites that are difficult for a crawler to find are
collectively referred to as the deep (or hidden) Web
Deep Web Crawling
- Much larger than conventional Web
- Three broad categories:
- Private sites
- no incoming links, or may require log in with a valid
account
- Form results
- Sites that can be reached only after entering some data
into a form
- Scripted pages
- Pages that use JavaScript, Flash, or another client-side
language to generate links
Surfacing the Deep Web
- Pre-compute all interesting form submissions
for each HTML form
- Each form submission corresponds to a
distinct URL
- Add URLs for each form submission into
search engine index
Link Analysis
Link Analysis
- Links are a key component of the Web
- Important for navigation, but also for search
<a href="http://example.com">Example website</a>
"information
page2 retrieval"
List of winter schools in 2013:
<ul>
<li><a href="pageX">information
retrieval</a></li>
…
</ul>
page3
The PROMISE Winter School in will
feature a range of <a
href="pageX">IR lectures</a> by
experts from the field. "IR lectures"
Fielded Document
Representation
title: Winter School 2013
body: The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a
grounding in the core topics that constitute the multidisciplinary
area of information access and retrieval to unstructured,
semistructured, and structured information. The school is a week-
long event consisting of guest lectures from invited speakers who
are recognized experts in the field. [...]
n
X
q P R(pi )
P R(a) = + (1 q)
T i=1
L(pi )
PageRank of page a
Number of outgoing links of
page pi
Total number of pages in
the Web graph page a is pointed by
pages p1…pn
Technical Issues
- This is a recursive formula. PageRank values
need to be computed iteratively
- We don’t know the PageRank values at start. We
can assume equal values (1/T)
- Number of iterations?
- Good approximation already after a small number of
iterations; stop when change in absolute values is
below a given threshold
Example
q=0
(no random jumps)
Example
Iteration 0: assume that the PageRank q=0
values are the same for all pages (no random jumps)
0.33 0.33
0.33
Example
Iteration 1 q=0
(no random jumps)
0.33 0.33
0.33 0.33
P R(A) P R(B)
P R(C) = +
2 1
PageRank of C depends on the =0.5
PageRank values of A and B
Example
at the end of Iteration 1 q=0
(no random jumps)
0.33 0.17
0.5
Example
Iteration 2 q=0
(no random jumps)
0.33 0.17
0.33 0.17
P R(A) P R(B)
P R(C) = +
2 1
PageRank of C depends on the =0.33
PageRank values of A and B
Example
at the end of Iteration 2 q=0
(no random jumps)
0.5 0.17
0.33
Example
at the end of Iteration 3 q=0
(no random jumps)
0.33 0.25
0.42
Example #2
Iteration 0: assume that the PageRank q=0.2
values are the same for all pages (with random jumps)
0.33 0.33
0.33
Example #2
Iteration 1 q=0.2
q=0
(no random
(with randomjumps)
jumps)
0.33 0.33
0.33 0.33
0.2 P R(A) P R(B)
P R(C) = + 0.8( + ) =0.47
3 2 1
Exercise #1
Dealing with "rank sinks"
- Handling "dead ends" (or rank sinks), i.e.,
pages that have no outlinks
- Assume that it links to all other pages in the
collection (including itself) when computing
PageRank scores
Rank sink
Exercise #2
Online PageRank Checkers
PageRank Summary
- Important example of query-independent
document ranking
- Web pages with high PageRank are preferred
- It is, however, not as important as the
conventional wisdom holds
- Just one of the many features a modern web search
engine uses
- But it tends to have the most impact on popular
queries
Incorporating Document
Importance (e.g. PageRank)
score0 (d, q) = score(d) · score(d, q)
P (q|d)P (d)
P (d|q) = / P (q|d)P (d)
P (q)
Document prior
Stephen Robertson, SIGIR’17 keynote
Search Engine Optimization
Search Engine
Optimization (SEO)
- A process aimed at making the site appear
high on the list of (organic) results returned by
a search engine
- Considers how search engines work
- Major search engines provide information and
guidelines to help with site optimization
- Google/Bing Webmaster Tools
- Common protocols
- Sitemaps (https://www.sitemaps.org)
- robots.txt
White hat vs. black hat SEO
- White hat
- Conforms to the search engines' guidelines and
involves no deception
- "Creating content for users, not for search engines"
- Black hat
- Disapproved of by search engines, often involve
deception
- Hidden text
- Cloaking: returning a different page, depending on
whether it is requested by a human visitor or a robot
SEO Techniques
- Editing website content and HTML source
- Increase relevance to specific keywords
- Increasing the number of incoming links
("backlinks")
- Focus on long tail queries
- Social media presence
source: http://searchengineland.com/figz/wp-content/seloads/2017/06/2017-SEO_Periodic_Table_1920x1080.png
DAT630
Clustering
Introduction to Data Mining, Chapter 8
16/10/2017
1 X
ci = x
mi
x2Ci
Example
Centroid computation
- What is the centroid of a cluster containing
three 2-dimensional points: (1,1), (2,3), (6,2)?
- Centroid:
((1+2+6)/3, (1+3+2)/3) = (3,2)
5. Stopping Condition
- Most of the convergence occurs in the early
steps
- "Centroids do not change" is often replaced
with a weaker condition
- E.g., repeat until only 1% of the points change
Exercise
Note
- There are other choices for proximity,
centroids, and objective functions, e.g.,
- Proximity function: Manhattan (L1)
Centroid: median
Objective function: minimize sum of L1 distance of an
object to its cluster centroid
- Proximity function: cosine
Centroid: mean
Objective function: maximize sum of cosine sim. of an
object to its cluster centroid
What is the complexity?
- m number of points, n number of attributes,
K number of clusters
- Space requirements: O(?)
- Time requirements: O(?)
Complexity
- m number of points, n number of attributes,
K number of clusters
- Space requirements: O((m+K)*n)
- Modest, as only the data points and centroids are
stored
- Time requirements: O(I*K*m*n)
- I is the number of iterations required for convergence
- Modest, linear in the number of data points
K-means Issues
- Depending on the initial (random) selection of
centroids different clustering can be produced
- Steps 3 and 4 are only guaranteed to find a
local optimum
- Empty clusters may be obtained
- replacement of centroid by (i) farthest point to any
other centroid, or (ii) chosen among those in the
cluster with highest SSE
K-means Issues (2)
- Presence of outliers must sometimes be kept
- E.g. all points must be clustered in data
compression
- In general outliers may be addressed by
eliminating them to improve clustering
- Before: by outlier detection techniques
- After: eliminating (i) points whose SSE is high, or (ii)
directly small clusters as likely outliers
K-means Issues (3)
- Postprocessing for reducing SSE
- and ideally not introducing more clusters
- How? Alternating splitting and merging steps
- Decrease SSE by more clusters
- Splitting (e.g., cluster with highest SSE)
- Introducing new centroid
- Less clusters by trying not to increase SSE
- Dispersing (a cluster e.g. with lowest SSE)
- Merging 2 clusters (e.g. with closest centroids)
Bisecting K-means
- Straightforward extension of the basic K-
means algorithm
- Idea:
- Split the set of data points to two clusters
- Select one of these clusters to split
- Repeat until K clusters have been produced
- The resulting clusters are often used as the
initial centroids for the basic K-means
algorithm
Bisecting K-means Alg.
1. Initial cluster contains all data points
2. repeat
3. Select a cluster to split
4. for a number of trials
5. Bisect the selected cluster using basic K-
means
6. end for
7. Select the clusters from the bisection with the
lowest total SSE
8. until we have K clusters
Selecting a Cluster to Split
- Number of possible ways
- Largest cluster
- Cluster with the largest SSE
- Combine size and SSE
- Different choices result in different clusters
Hierarchical Clustering
- By recording the sequence of clusterings
produced, bisecting K-means can also
produce a hierarchical clustering
Limitations
- K-means has difficulties detecting clusters
when they have
- differing sizes
- differing densities
- non-spherical shapes
- K-means has problems when the data contains
outliers
Example: differing sizes
4
0.15 3 4
2
5
0.1
2
0.05 1
3 1
0
1 3 2 5 4 6
Nested cluster
Dendogram
diagram
Strengths
- Do not have to assume any
particular number of clusters
- Any desired number of clusters
K=4
can be obtained by cutting the
dendrogram at the proper level
- They may correspond to
meaningful taxonomies
- E.g., in biological sciences
Basic Agglomerative
Hierarchical Clustering Alg.
1. Compute the proximity matrix
2. repeat
3. Merge the closest two clusters
4. Update the proximity matrix
5. until only one cluster remains
Example
Starting situation
- Start with clusters of individual points and a
proximity matrix
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Example
Intermediate situation
- After some merging steps, we have some
clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Example
Intermediate situation
- We want to merge the two closest clusters (C2
and C5) and update the proximity matrix
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Example
After merging
- How do we update the proximity matrix?
C2 U
C5
C1 C3 C4
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Defining the Proximity
between Clusters
- MIN (single link)
Proximity?
- MAX (complete link)
- Group average
- Distance between
centroids
Single link (MIN)
- Proximity of two clusters is based on the two
most similar (closest) points in the different
clusters
- Determined by one pair of points, i.e., by one link in
the proximity graph
Complete link (MAX)
- Proximity of two clusters is based on the two
least similar (most distant) points in the
different clusters
- Determined by all pairs of points in the two clusters
Group average
- Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters P
x2Ci ,y2Cj proximity(x, y)
proximity(Ci , Cj ) =
mi · mj
- Need to use average connectivity for scalability since
total proximity favors large clusters
Strengths and Weaknesses
- Single link (MIN)
- Strength: can handle non-elliptical shapes
- Weakness: sensitive to noise and outliers
- Complete link (MAX)
- Strength: less susceptible to noise and outliers
- Weakness: tends to break large clusters
- Group average
- Strength: less susceptible to noise and outliers
- Weakness: biased towards globular clusters
Prototype-based methods
- Represent clusters by their centroids
- Calculate the proximity based
on the distance between the
×
centroids of the clusters
×
- Ward’s method
- Similarity of two clusters is based on the increase in
SSE when two clusters are merged
- Very similar to group average if distance between points
is distance squared
Exercise
Key Characteristics
- No global objective function that is directly
optimized
- No problems with choosing initial points or
running into local minima
- Merging decisions are final
- Once a decision is made to combine two clusters, it
cannot be undone
What is the complexity?
- m is the number of points
- Space complexity O(?)
- Time complexity O(?)
Complexity
- Space complexity O(m2)
- Proximity matrix requires the storage of m2/2
proximities (it’s symmetric)
- Space to keep track of clusters is proportional to the
number of clusters (m-1, excluding singleton clusters)
- Time complexity O(m3)
- Computing the proximity matrix O(m2)
- m-1 iterations (Steps 3 and 4)
- It’s possible to reduce the total cost to O(m2 log m)
by keeping data in a sorted list (or heap)
Summary
- Typically used when the underlying application
requires a hierarchy
- Generally good clustering performance
- Expensive in terms of computation and storage
DAT630
Classification
Alternative Techniques
Introduction to Data Mining, Chapter 5
09/10/2017
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
rule
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
- Accuracy of a rule 7 Yes Divorced 220K No
(Status=Single) → No
Coverage = 40%, Accuracy = 50%
How does it work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Rule Ordering Schemes
- Rule-based ordering
- Individual rules are ranked based on some quality
measure (e.g., accuracy, coverage)
- Class-based ordering
- Rules that belong to the same class appear together
- Rules are sorted on the basis of their class
information (e.g., total description length)
- The relative order of rules within a class does not
matter
Rule Ordering Schemes
- Indirect Method
- Extract rules from other classification models (e.g.
decision trees, neural networks, etc)
From Decision Trees To
Rules
Classification Rules
Refund (Refund=Yes) ==> No
Yes No
(Refund=No, Marital Status={Single,Divorced},
NO Marita l Taxable Income<80K) ==> No
{Single, Status
{Married}
Divorced} (Refund=No, Marital Status={Single,Divorced},
Taxable Income>80K) ==> Yes
Taxable NO
Income (Refund=No, Marital Status={Married}) ==> No
< 80K > 80K
NO YES
3 No Small 70K No
6 No Medium 60K No
9 No Medium 75K No
Training Set
Apply
Model
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Apply
12 Yes Medium 80K ? model
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Opposite strategy
- Lazy learners
- Delay the process of modeling the data until it is
needed to classify the test examples
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Model
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Apply
12 Yes Medium 80K ? model
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Instance-Based Classifiers
Set of Stored Cases • Store the training records
• Use training records to
Atr1 ……... AtrN Class
predict the class label of
A unseen cases
B
B
Unseen Case
C
A Atr1 ……... AtrN
C
B
Instance Based Classifiers
- Rote-learner
- Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
- Nearest neighbors
- Uses k “closest” points (nearest neighbors) for
performing classification
Nearest neighbors
- Basic idea
- "If it walks like a duck, quacks like a duck, then it’s
probably a duck"
Compute
Distance Test
Record
X X X
X
Summary
- Part of a more general technique called
instance-based learning
- Use specific training instances to make predictions
without having to maintain an abstraction (model)
derived from data
- Because there is no model building, classifying
a test example can be quite expensive
- Nearest-neighbors make their predictions
based on local information
- Susceptible to noise
Bayes Classifier
Bayes Classifier
- In many applications the relationship between
the attribute set and the class variable is
non-deterministic
- The label of the test record cannot be predicted with
certainty even if it was seen previously during training
- A probabilistic framework for solving
classification problems
- Treat X and Y as random variables and capture their
relationship probabilistically using P(Y|X)
Example
- Football game between teams A and B
- Team A won 65% team B won 35% of the time
- Among the games Team A won, 30% when game
hosted by B
- Among the games Team B won, 75% when B
played home
- Which team is more likely to win if the game is
hosted by Team B?
Probability Basics
- Conditional probability
P (X, Y ) = P (X|Y )P (Y ) = P (Y |X)P (X)
- Bayes’ theorem
P (X|Y )P (Y )
P (Y |X) =
P (X)
Example
- Probability Team A wins: P(win=A) = 0.65
- Probability Team B wins: P(win=B) = 0.35
- Probability Team A wins when B hosts:
P(hosted=B|win=A) = 0.3
- Probability Team B wins when playing at home:
P(hosted=B|win=B) = 0.75
- Who wins the next game that is hosted by B?
P(win=B|hosted=B) = ?
P(win=A|hosted=B) = ?
Solution
- Using:
P (X|Y )P (Y )
P (Y |X) =
P (X)
- P(win=B|hosted=B) = 0.5738
- P(win=A|hosted=B) = 0.4262
P (X|Y )P (Y )
P (Y |X) =
P (X)
P (X|Y )P (Y )
P (Y |X) =
P (X)
P (X|Y )P (Y )
P (Y |X) =
P (X)
P (X|Y )P (Y )
P (Y |X) =
P (X)
2⇡ 2
ij
- The parameters of the distribution are estimated from
the training data (from instances that belong to class yj)
2
- sample mean µij and variance ij
Tid Refund Marital
Tid Taxable
Status Income Evade
Class
Example
1
1 Yes
Yes Single
Single 125K
125K No
No
2 No Married 100K No
2 No Married 100K No
3 No Single 70K No
3 No Single 70K No
4 Yes Married 120K No
4 Yes Married 120K No
5 No Divorced 95K Yes
5 No Divorced 95K Yes
6 No Married 60K No
6 No Married 60K No
7 Yes Divorced 220K No
7
8 Yes
No Divorced 85K
Single 220K No
Yes
8
9 No
No Single
Married 85K
75K Yes
No
9
10 No Married
Single 75K
90K No
Yes
10
Example
1
1 Yes
Yes Single
Single 125K
125K No
No
2 No Married 100K No
2 No Married 100K No
3 No Single 70K No
3 No Single 70K No
4 Yes Married 120K No
4 Yes Married 120K No
5 No Divorced 95K Yes
5 No Divorced 95K Yes
6 No Married 60K No
X={Refund=No, 6 No Married 60K No
7 Yes Divorced 220K No
Marital st.=Married, 7 Yes Divorced 85K
220K No
8 No Single Yes
Income=120K}
8
9 No
No Single
Married 85K
75K Yes
No
9
10 No Married
Single 75K
90K No
Yes
10
- Laplace smoothing
nc + 1
P (Xi = xi |Y = y) =
n+c
Predicted class
Positive Negative
2RP
F1 =
R+P
Multiclass Problem
Multiclass Classification
- Many of the approaches are originally
designed for binary classification problems
- Many real-world problems require data to be
divided into more than two categories
- Two approaches
- One-against-rest (1-r)
- One-against-one (1-1)
- Predictions need to be combined in both cases
One-against-rest
- Y={y1, y2, … yK} classes
- For each class yi
- Instances that belong to yi are positive examples
- All other instances are negative examples
- Combining predictions
- If an instance is classified positive, the positive class
gets a vote
- If an instance is classified negative, all classes
except for the positive class receive a vote
total votes
y1 + y1 - y1 - y1 -
y2 - y2 + y2 - y2 -
y3 - y3 - y3 + y3 -
y4 - y4 - y4 - y4 +
class + class - class - class -
One-against-one
- Y={y1, y2, … yK} classes
- Construct a binary classifier for each pair of
classes (yi, yj)
- K(K-1)/2 binary classifiers in total
- Combining predictions
- The positive class receives a vote in each pairwise
comparison
total votes
y1 + y1 + y1 +
y2 - y3 - y4 -
class + class + class -
y2 + y2 + y3 +
y3 - y4 - y4 -
class + class - class +
Locality Sensitive Hashing
Vinay Setty
([email protected])
1
Finding Similar Items Problem
‣ Similar Items
‣ Finding similar web pages and news articles
‣ Finding near duplicate images
‣ Plagiarism detection
‣ Duplications in Web crawls
‣ Find nearest-neighbors in high-dimensional space
‣ Nearest neighbors are points that are a small distance
apart
2
Very similar news articles
3
Near duplicate images
4
The Big Picture
Shingling
Document
5
The Big Picture
Shingling
Document
The set
of strings
of length k
that appear
in the doc-
ument
5
The Big Picture
Shingling
Hashing
Min
Document
5
The Big Picture
Candidate
pairs:
Shingling
Hashing
Locality- those pairs
Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
5
Three Essential Steps for Similar Docs
6
The Big Picture
Candidate
pairs:
Shingling
Hashing
Locality- those pairs
Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
7
The Big Picture
Candidate
pairs:
Shingling
Hashing
Locality- those pairs
Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
7
Documents as High-Dim. Data
8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets
8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets
‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?
8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets
‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?
8
Documents as High-Dim. Data
‣ Step 1: Shingling: Convert documents to sets
‣ Simple approaches:
‣ Document = set of words appearing in document
‣ Document = set of “important” words
‣ Don’t work well for this application. Why?
8
Define: Shingles
‣ A k-shingle (or k-gram) for a document is a sequence of k
tokens that appears in the doc
‣ Tokens can be characters, words or something else,
depending on the application
‣ Assume tokens = characters for examples
9
Similarity Metric for Shingles
10
Working Assumption
11
Motivation for Minhash/LSH
12
The Big Picture
Candidate
pairs:
Shingling
Hashing
Locality- those pairs
Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
13
Encoding Sets as Bit Vectors
‣ Many similarity problems can be
formalized as finding subsets that
have significant intersection
‣ Encode sets using 0/1 (bit, boolean) vectors
‣ One dimension per element in the universal set
14
From Sets to Boolean Matrices
‣ Rows = elements (shingles)
‣ Columns = sets (documents)
‣ 1 in row e and column s if and only if
e is a member of s
‣ Column similarity is the Jaccard
similarity of the corresponding sets
(rows with value 1)
‣ Typical matrix is sparse!
15
From Sets to Boolean Matrices
‣ Rows = elements (shingles)
Documents (N)
‣ Columns = sets (documents)
‣ 1 in row e and column s if and only if 1 1 1 0
e is a member of s 1 1 0 1
‣ Column similarity is the Jaccard
similarity of the corresponding sets 0 1 0 1
(rows with value 1) 0 0 0 1
Shingles (D)
‣ Typical matrix is sparse!
1 0 0 1
‣ Each document is a column:
‣ Example: sim(C1 ,C2) = ? 1 1 1 0
‣ Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
‣ d(C1,C2) = 1 – (Jaccard similarity) = 3/6
15
Hashing Columns (Signatures)
‣ Key idea: “hash” each column C to a small signature
h(C), such that:
‣ (1) h(C) is small enough that the signature fits in RAM
‣ (2) sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)
16
Hashing Columns (Signatures)
‣ Key idea: “hash” each column C to a small signature
h(C), such that:
‣ (1) h(C) is small enough that the signature fits in RAM
‣ (2) sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)
17
Min-Hashing
‣ Imagine the rows of the boolean matrix permuted under
random permutation π
18
Example
nput matrix (Shingles x Documents)
Permutation π
1 0 1 0
1 0 0 1
0 1 0 1
0 1 0 1
0 1 0 1
1 0 1 0
1 0 1 0
19
Example
nput matrix (Shingles x Documents)
Permutation π
2 1 0 1 0
3 1 0 0 1
7 0 1 0 1
6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
nput matrix (Shingles x Documents)
Permutation π Signature matrix M
2 1 0 1 0
2 1 2 1
3 1 0 0 1
7 0 1 0 1
6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1
2 1 0 1 0
2 1 2 1
3 1 0 0 1
7 0 1 0 1
6 0 1 0 1
1 0 1 0 1
5 1 0 1 0
4 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1
2 4 1 0 1 0
2 1 2 1
3 2 1 0 0 1
2 1 4 1
7 1 0 1 0 1
6 3 0 1 0 1
1 6 0 1 0 1
5 7 1 0 1 0
4 5 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1
2 4 1 0 1 0
2 1 2 1
3 2 1 0 0 1
2 1 4 1
7 1 0 1 0 1
6 3 0 1 0 1
1 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 0 1 0
4 5 1 0 1 0
19
Example
2nd element of the permutation
is the first to map to a 1
2 4 3 1 0 1 0
2 1 2 1
3 2 4 1 0 0 1
2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
19
Example Note: Another (equivalent) way is to
store row indexes:
1 5 1 5
2nd element of the permutation 2 3 1 3
is the first to map to a 1 6 4 6 4
2 4 3 1 0 1 0
2 1 2 1
3 2 4 1 0 0 1
2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation is
the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
19
Four Types of Rows
‣ Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
‣ a = # rows of type A, etc.
20
Similarity for Signatures
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree
21
Similarity for Signatures
‣ We know: Pr[hπ(C1) = hπ(C2)] = sim(C1, C2)
‣ Now generalize to multiple hash functions - why?
‣ Permuting rows is expensive for large number of rows
‣ Instead we want to simulate the effect of a random
permutation using hash functions
‣ The similarity of two signatures is the fraction of
the hash functions in which they agree
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1
5 7 1 1 0 1 0 Similarities:
1-3 2-4 1-2 3-4
4 5 5 1 0 1 0
Col/Col 0.75 0.75 0 0
Permutation π Sig/Sig 0.67 1.00 0 0
22
Min-Hash Signatures
23
Min-Hash Signatures Example
Init
24
Min-Hash Signatures Example
Init Row 0
24
Min-Hash Signatures Example
24
Min-Hash Signatures Example
Row 2
24
Min-Hash Signatures Example
Row 3 Row 2
24
Min-Hash Signatures Example
24
The Big Picture
Candidate
pairs:
Shingling
Hashing
Locality- those pairs
Min
Document Sensitive of signatures
Hashing that we need
to test for
similarity
The set Signatures:
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
25
2 1 4 1
LSH: First Cut 1 2 1 2
2 1 2 1
‣ Goal: Find documents with Jaccard similarity at least s (for
some similarity threshold, e.g., s=0.8)
26
Candidates from Min-Hash
2 1 4 1
27
Partition M into b Bands
2 1 4 1
1 2 1 2
2 1 2 1
r rows
per band
b bands
One
signature
Signature matrix M
28
Hashing Bands
Buckets
Matrix M
r rows b bands
29
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)
Matrix M
r rows b bands
29
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)
r rows b bands
29
Partition M into Bands
30
Simplifying Assumption
31
b bands, r rows/band
32
Example of Bands
33
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)
34
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)
34
C1, C2 are 80% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.8
‣ Since sim(C1, C2) ≥ s, we want C1, C2 to be a candidate pair: We
want them to hash to at least 1 common bucket (at least one band
is identical)
34
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
35
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
35
C1, C2 are 30% Similar
‣ Find pairs of ≥ s=0.8 similarity, set b=20, r=5
‣ Assume: sim(C1, C2) = 0.3
‣ Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
35
LSH Involves a Tradeoff
‣ Pick:
‣ The number of Min-Hashes (rows of M)
‣ The number of bands b, and
‣ The number of rows r per band
36
Analysis of LSH – What We Want
Similarity threshold s
Probability
of sharing
a bucket
37
Analysis of LSH – What We Want
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
37
Analysis of LSH – What We Want
Probability = 1 if
t>s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
37
What One Band of One Row Gives You
Probability
of sharing
a bucket
38
What One Band of One Row Gives You
Probability
of sharing
a bucket
38
What One Band of One Row Gives You
Remember:
With a single hash function:
Probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
38
What One Band of One Row Gives You
Remember:
With a single hash function:
Probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
38
What One Band of One Row Gives You
Remember:
With a single hash function:
Probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
False positives
38
What One Band of One Row Gives You
Remember:
False
With a single hash function: negatives
Probability of
equal hash-values
= similarity
Probability
of sharing
a bucket
False positives
38
What b Bands of r Rows Gives You
At least No bands
one band identical
identical
Probability t~ (1/b)1/r (
1 - 1 -s )
r b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
unequal
Similarity s=sim(C1, C2) of two sets
39
Example: b = 20; r = 5
s 1-(1-sr)b
.2 .006
‣ Similarity threshold s .3 .047
‣ Prob. that at least 1 band is .4 .186
identical: .5 .470
.6 .802
.7 .975
.8 .9996
40
LSH Summary
41
References
42
DAT630
Classification
Basic Concepts, Decision Trees, and Model Evaluation
Introduction to Data Mining, Chapter 4
25/09/2017
Nominal Nominal
Ordinal
Interval
Ratio
General approach
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
Records whose class
2 No Medium 100K No
6 No Medium 60K No
Training Set
Apply
Model
Tid
11
Attrib1
No
Attrib2
Small
Attrib3
55K
Class
?
Records with
12 Yes Medium 80K ? unknown class labels
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
General approach
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No
Ind
uct
3 No Small 70K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
Should fit2theNo input
Medium 100K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?
?
15 No Large 67K
10
Test Set
Learning Algorithms
- Decision trees
- Rule-based
- Naive Bayes
- Support Vector Machines
- Random forests
- k-nearest neighbors
- …
Machine Learning vs.
Data Mining
- Similar techniques, but different goal
- Machine learning is focused on developing and
designing learning algorithms
- More abstract, e.g., features are given
- Data Mining is applied Machine Learning
- Performed by a person who has a goal in mind and
uses Machine Learning techniques on a specific
dataset
- Much of the work is concerned with data
(pre)processing and feature engineering
Today
- Decision trees
- Binary class labels
- Positive or Negative
Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
Should fit2theNo input
Medium 100K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?
?
15 No Large 67K
10
Test Set
Evaluation
- Measuring the performance of a classifier
- Based on the number of records correctly and
incorrectly predicted by the model
- Counts are tabulated in a table called the
confusion matrix
- Compute various performance metrics based
on this matrix
Confusion Matrix
Predicted class
Positive Negative
Predicted class
Positive Negative
Type II Error
True Positives False Negatives
Positive (TP) (FN) failing to
Actual raise an alarm
class
False Positives True Negatives
Negative (FP) (TN)
Type I Error
raising a false alarm
Example
"Is the man innocent?"
Predicted class
Positive Negative
Innocent Guilty
letting a guilty
person go free
(error of impunity)
Evaluation Metrics
- Summarizing performance in a single number
- Accuracy
Number of correct predictions TP + TN
=
Total number of predictions TP + FP + TN + FN
- Error rate
Number of wrong predictions FP + FN
=
Total number of predictions TP + FP + TN + FN
Ind
uct
3 No Small 70K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Decision Tree
Decision Tree Root node
no incoming edges
zero or more outgoing edges
Decision Tree
Internal node
exactly one incoming edges
two or more outgoing edges
Decision Tree
Ind
uct
3 No Small 70K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Induction
Tid Attrib1 Attrib2 Attrib3 Class Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No
Ind
uct
3 No Small 70K No
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree Induction
- There are exponentially many decision trees
that can be constructed from a given set of
attributes
- Finding the optimal tree is computationally
infeasible (NP-hard)
- Greedy strategies are used
- Grow a decision tree by making a series of locally
optimum decisions about which attribute to use for
splitting the data
Tid Refund Marital Taxable
Hunt’s algorithm
Status Income Cheat
Don’t Cheat
Cheat
Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
Tree Induction Issues
- Determine how to split the records
- How to specify the attribute test condition?
- How to determine the best split?
- Determine when to stop splitting
How to Specify Test
Condition?
- Depends on attribute types
- Nominal
- Ordinal
- Continuous
- Depends on number of ways to split
- 2-way split
- Multi-way split
Splitting Based on Nominal
Attributes
- Multi-way split: use as many partitions as
distinct values
CarType
Family Luxury
Sports
CarType CarType
{Sports,
{Family} OR {Family,
{Sports}
Luxury} Luxury}
Splitting Based on Ordinal
Attributes
- Multi-way split: use as many partitions as
distinct values
Size
Small Large
Medium
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
Splitting Based on
Continuous Attributes
- Different ways of handling
- Discretization to form an ordinal categorical attribute
- Static – discretize once at the beginning
- Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Impurity Measures
- Measuring the impurity of a node
- P(i|t) = fraction of records belonging to class i at a
given node t
- c is the number of classes
c 1
X
Entropy(t) = P (i|t)log2 P (i|t)
i=0
c 1
X
2
Gini(t) = 1 P (i|t)
i=0
C1 0
C2 6
C1 1
C2 5
C1 2
C2 4
c 1
X
Exercise Entropy(t) =
i=0
P (i|t)log2 P (i|t)
C1 0
C2 6
C1 1
C2 5
C1 2
C2 4
c 1
X
Exercise Gini(t) = 1
i=0
P (i|t)2
C1 0
C2 6
C1 1
C2 5
C1 2
C2 4
Exercise Classification error(t) = 1 max P (i|t)
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?
N is the number of
training instances for
C0 N00
Before Splitting: M0 Class C0/C1
C1 N01
for the given node
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?
M is an impurity measure
Before Splitting: C0 N00 M0 (Entropy, Gini, etc.)
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Gain = goodness of a split
Split on A or on B?
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Information Gain
- When Entropy is used as the impurity measure,
it’s called information gain
- Measures how much we gain by splitting a
parent node number of records
number of attribute values
associated with the
child node vj
k
X N (vj )
inf o = Entropy(p) Entropy(vj )
j=1
N
inf o
Gain ratio =
Split info
Xk
Split info = P (vi ) log2 P (vi )
i=1
6 No Medium 60K No
Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
io n model
11 No Small 55K ?
uc t
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14 data
No Small 95K ?
?
15 No Large 67K
10
Test Set
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
How to Address Overfitting
- Pre-Pruning (Early Stopping Rule): stop the
algorithm before it becomes a fully-grown tree
- Typical stopping conditions for a node
- Stop if all instances belong to the same class
- Stop if all the attribute values are the same (i.e., belong to
the same split)
- More restrictive conditions
- Stop if number of instances is less than some user-
specified threshold
- Stop if class distribution of instances are independent of
the available features
- Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain)
How to Address Overfitting
- Post-pruning: grow decision tree to its entirety
- Trim the nodes of the decision tree in a bottom-up
fashion
- If generalization error improves after trimming,
replace sub-tree by a leaf node
- Class label of leaf node is determined from majority
class of instances in the sub-tree
Methods for estimating
performance
- Holdout
- Reserve 2/3 for training and 1/3 for testing
(validation set)
- Cross validation
- Partition data into k disjoint subsets
- k-fold: train on k-1 partitions, test on the remaining
one
- Leave-one-out: k=n
Expressivity
1
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.5
y < 0.47?
0.4
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Expressivity
x+y<1
Class = + Class =
Exercise
DAT630
Exploring Data
Introduction to Data Mining, Chapter 3
18/09/2017
median(x) =
⇢
x(r+1) , if m is odd (i.e., m = 2r + 1)
1
2 (x (r) + x (r+1) ), if m is even (i.e., m = 2r)
Mean vs. Median
- Both indicate the "middle" of the values
- If the distribution of values is skewed, then the
median is a better indicator of the middle
- The mean is sensitive to the presence of
outliers; the median provides a more robust
estimate of the middle
Trimmed Mean
- To overcome problems with the traditional
definition of a mean, the notion of a trimmed
mean is sometimes used
- A percentage p between 0 and 100 is
specified; the top and bottom (p/2)% of the
data is thrown out; then mean is calculated the
normal way
- Median is a trimmed mean with p=100%, the
standard mean corresponds to p=0%
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}
- What is the mean?
- What is the median?
- What is the trimmed mean with p=40%?
Example
- Consider the set of values {1, 2, 3, 4, 5, 90}
- What is the mean? 17.5
- What is the median? (3+4)/2 = 3.5
- What is the trimmed mean with p=40%? 3.5
- Trimmed values (with top-20% and bottom-20% of
the values thrown out): {2,3,4,5}
Range and Variance
- To measure the dispersion/spread of a set of
values (for continuous data)
- Range
range(x) = max(x) min(x) = x(m) x(1)
- Variance*
m
X
2 1 2
variance(x) = sx = (xi x̄)
m 1 i=1
- Standard deviation is the square root of variance
vs.
Selection
- Elimination or the de-emphasis of certain
objects and attributes
- May involve the choosing a subset of attributes
- Dimensionality reduction is often used to reduce the
number of dimensions to two or three
- Alternatively, pairs of attributes can be considered
- May also involve choosing a subset of objects
- Visualizing all objects can result in a display that is
too crowded
Outline for this part
- General concepts
- Visualization techniques
- Yet techniques are often very specialized in the data
being analyzed, it is possible to group them by some
general properties
- For example, visualization techniques for:
- Small number of attributes
- Spatio-temporal data
- High-dimensional data
Outline for this part
- General concepts
- Visualization techniques
- Histograms
- Box plots
- Scatter plots
- Contour plots
- Matrix plots
- Parallel coordinates
- Star plots
- Chernoff faces
Histograms
- Usually shows the distribution of values of a
single variable
- Divide the values into bins and show a bar plot
of the number of objects in each bin.
- The height of each bar indicates the number of
objects
- Shape of histogram depends on the number of
bins
Example
- Petal width
10 bins 20 bins
2D Histograms
- Show the joint distribution of the values of two
attributes
- Each attribute is divided into intervals and the two
sets of intervals define two-dimensional rectangles of
values
- It can show patterns not present in 1D ones
- Visually more complicated, e.g., some columns may
be hidden by others
Example
- Petal width and petal length
outlier
10th perce
Box Plots
75th perce
50th perce
25th perce
10th perce
10th percentile
75th percentile
50th percentile
25th percentile
10th percentile
Example
- Comparing attributes
Pie Charts
- Similar to histograms, but typically used with
categorical attributes that have a relatively
small number of values
- Common in popular articles, but used less
frequently in technical publications
- The size of relative areas can be hard to judge
- Histograms are preferred for technical work!
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
http://www.businessinsider.com/pie-charts-are-the-worst-2013-6
Empirical Cumulative
Distribution Function
Scatter Plots
- Attributes values determine the position
- Two-dimensional scatter plots most common,
but can have three-dimensional scatter plots
- Often additional attributes can be displayed by
using the size, shape, and color of the markers
that represent the objects
Example
Example
- Arrays of scatter plots to summarize the
relationships of several pairs of attributes
Contour Plots
Celsius
Celsius
Parallel Coordinates
- Plot the attribute values of high-dimensional
data
- Instead of using perpendicular axes, use a set
of parallel axes
- The attribute values of each object are plotted
as a point on each corresponding coordinate
axis and the points are connected by a line,
i.e., each object is represented as a line
- The ordering of attributes is important
Example
- Different ordering of attributes
Star Plots
- Similar approach to parallel coordinates, but
axes radiate from a central point
- The line connecting the values of an object is a
polygon
Example
Setosa
Versicolour
Virginica
Example
- Other useful information, such as average
values or thresholds, can also be encoded
Chernoff Faces
- Approach created by Herman Chernoff
- Each attribute is associated with a
characteristic of a face
- Size of the face, shape of jaw, shape of forhead, etc.
- The value of the attribute determines the appearance
of the corresponding facial characteristic
- Each object becomes a separate face
- Relies on human’s ability to distinguish faces
Example
Setosa
Versicolour
Virginica
Principles for Visualization
Quality
- Apprehension to perceive relations among variables
- Clarity to distinguish the most important elements
- Consistency with previous, related graphs
- Efficiency to show complex information in simple ways
- Necessity of the graph, vs alternatives
- Truthfulness when using magnitudes, relative to scales
Infographics
OLAP and Multidimensional
Data Analysis
OLAP
- Relational databases put data into tables, while
OLAP uses a multidimensional array
representation
- Such representations of data previously existed in
statistics and other fields
- There are a number of data analysis and data
exploration operations that are easier with
such a data representation
Converting Tabular Data
- Two key steps in converting tabular data into a
multidimensional array
1.Identify which attributes are to be the
dimensions and which attribute is to be the
target attribute
- The attributes used as dimensions must have
discrete values
- The target value is typically a count or continuous
value, e.g., the cost of an item
- Can have no target variable at all except the count of
objects that have the same set of attribute values
Converting Tabular Data (2)
2.Find the value of each entry in the
multidimensional array by summing the values
(of the target attribute) or count of all objects
that have the attribute values corresponding to
that entry
Example
- Petal width and length are discretized to have
categorical values: low, medium, and high
Example
- Each unique tuple of petal width, petal length,
and species type identifies one element of the
array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Example
- Cross-tabulations can be used to show slices
of the multidimensional array
Data Cube
- The key operation of a OLAP is the formation
of a data cube
- A data cube is a multidimensional
representation of data, together with all
possible aggregates
- Aggregates that result by selecting a proper subset
of the dimensions and summing over all remaining
dimensions
Example
- Consider a data set that records the sales of
products at a number of company stores at
various dates
- This data can be represented
as a 3 dimensional array
- There are 3 two-dimensional
aggregates, 3 one-dimensional
aggregates, and 1 zero-dimensional
aggregate (the overall total)
Example
- This table shows one of the two dimensional
aggregates, along with two of the one-
dimensional aggregates, and the overall total
OLAP Operations
- Slicing
- Dicing
- Roll-up
- Drill-down
Slicing and Dicing
- Slicing is selecting a group of cells from the
entire multidimensional array by specifying a
specific value for one or more dimensions
- Dicing involves selecting a subset of cells by
specifying a range of attribute values
- This is equivalent to defining a subarray from the
complete array
- In practice, both operations can also be
accompanied by aggregation over some
dimensions
Example
Roll-up and Drill-down
- Attribute values often have a hierarchical
structure
- Each date is associated with a year, month, and
week
- A location is associated with a continent, country,
state (province, etc.), and city
- Products can be divided into various categories,
such as clothing, electronics, and furniture
- These categories often nest and form a tree
- A year contains months which contains day
- A country contains a state which contains a city
Example
DAT630
Introduction & Data
Introduction to Data Mining, Chapters 1-2
11/09/2017
Set
Set Classifier
Clustering (descriptive)
- Given a set of data points, each having a set of
attributes, find clusters such that
- Data points in one cluster are more similar to one
another
- Data points in separate clusters are less similar to
one another
Types of Data
What is data?
Attributes
etc.) is a property or
6 No Married 60K No
7 Yes Divorced 220K No
characteristic of an object 8
9
No
No
Single
Married
85K
75K
Yes
No
10 No Single 90K Yes
- A collection of attributes 10
- Distinctness: = !=
- Order: < > <= >=
- Addition: + -
- Multiplication: * /
Types of attributes
- Nominal
- ID numbers, eye color, zip codes
- Ordinal
- Rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
- Interval
- Calendar dates, temperatures in C or F degrees.
- Ratio
- Temperature in Kelvin, length, time, counts
- Coarser types: categorical and numeric
Attribute types
Attribute type Description Examples
Only enough
ID numbers, eye color,
Nominal information to
zip codes
Categorical distinguish (=, !=)
(qualitative)
Enough information to grades {A,B,…F}
Ordinal order (<, >) street numbers
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Examples
- Time in terms of AM or PM
- Binary, qualitative, ordinal
- Brightness as measured by a light meter
- Continuous, quantitative, ratio
- Brightness as measured by people’s
judgments
- Discrete, qualitative, ordinal
Examples
- Angles as measured in degrees between 0◦
and 360◦
- Continuous, quantitative, ratio
- Bronze, Silver, and Gold medals as awarded at
the Olympics
- Discrete, qualitative, ordinal
- ISBN numbers for books
- Discrete, qualitative, nominal
Characteristics of
Structured Data
- Dimensionality
- Curse of Dimensionality
- Sparsity
- Only presence counts
- Resolution
- Patterns depend on the scale
Types of data sets
- Record
- Data Matrix
- Document Data
- Transaction Data
- Graph
- Ordered
Record Data
- Consists of a collection of records, each of
which consists of a fixed set of attributes
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
- A special type of record data, where each
record (transaction) involves a set of items
- For example, the set of products purchased by a
customer (during one shopping trip) constitute a
transaction, while the individual products that were
purchased are the items
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
- Examples
Items/Events
An element of
the sequence
Ordered Data
- Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
- Spatio-temporal Data
Average Monthly
Temperature of
land and ocean
Non-record Data
- Often converted into record data
- For example: presence of substructures in a set, just
like the transaction items
- Ordered data conversion might lose explicit
representations of relationships
Data Quality
Data Quality Problems
- Data won’t be perfect
- Human error
- Limitations of measuring devices
- Flaws in the data collection process
- Data is of high quality if it is suitable for its
intended use
- Much work in data mining focuses on devising
robust algorithms that produce acceptable
results even when noise is present
Typical Data Quality
Problems
- Noise
- Random component of a measurement error
- For example, distortion of a person’s voice when
talking on a poor phone
- Outliers
- Data objects with characteristics that are
considerably different than most of the other data
objects in the data set
Typical Data Quality
Problems (2)
- Missing values
- Information is not collected
- E.g., people decline to give their age and weight
- Attributes may not be applicable to all cases
- E.g., annual income is not applicable to children
- Solutions
- Eliminate an entire object or attribute
- Estimate them by neighbor values
- Ignore them during analysis
Typical Data Quality
Problems (3)
- Inconsistent data
- Data may have some inconsistencies even among
present, acceptable values
- E.g. Zip code value doesn't correspond to the city value
- Duplicate data
- Data objects that are duplicates, or almost
duplicates of one another
- E.g., Same person with multiple email addresses
Quality Issues from the
Application viewpoint
- Timeliness:
- Aging of data implies aging of patterns on it
- Relevance:
- of the attributes modeling objects
- of the objects as representative of the population
- Knowledge of data:
- Availability of documentation about type of features,
origin, scales, missing values representation
Data Preprocessing
Data Preprocessing
- Different strategies and techniques to make
the data more suitable for data mining
- Aggregation
- Sampling
- Dimensionality reduction
- Feature subset selection
- Feature creation
- Discretization and binarization
- Attribute transformation
Aggregation
- Combining two or more attributes (or objects)
into a single attribute (or object)
- Purpose
- Data reduction
- Reduce the number of attributes or objects
- Change of scale
- Cities aggregated into regions, states, countries, etc
- More “stable” data
- Aggregated data tends to have less variability
Sampling
- Selecting a subset of the data objects to be
analyzed
- Statisticians sample because obtaining the entire set
of data of interest is too expensive or time
consuming
- Sampling is used in data mining because processing
the entire set of data of interest is too expensive or
time consuming
Sampling
- A sample is representative if it has
approximately the same property (of interest)
as the original set of data
- Key issues: sampling method and sample size
Types of Sampling
- Simple random sampling
- Any particular item is selected with equal probability
- Sampling without replacement
- As each item is selected, it is removed from the
population
- Sampling with replacement
- Objects are not removed from the population as they are
selected (same object can be picked up more than once)
- Stratified sampling
- Split the data into several partitions; then draw
random samples from each partition
Sample size
0 s mins
s =
maxs mins
(Dis)similarity for a
Single Attribute
Example
- Objects with a single original attribute that
measures the quality of the product
- {poor, fair, OK, good, wonderful}
- poor=0, fair=1, OK=2, good=3, wonderful=4
- What is the similarity between p="good" and
p="wonderful"?
|p q| |3 4| 1
s=1 =1 =1 = 0.75
n 1 5 1 4
Dissimilarities between
Data Objects
- Some examples of distances to show the
desired properties of a dissimilarity
- Objects have n attributes; xk is the kth attribute
- Euclidean distance
v
u n
uX
d(x, y) = t (xk yk ) 2
k=1
Minkowski Distance
- Generalization of the Euclidean Distance
Xn
r 1/r
d(x, y) = |xk yk |
k=1
L2 p1
p1 p2
p2 p3
p3 p4
p4
p1p1 0 0 2.8282.828 3.162 3.1625.099 5.099
p2p2 2.828
2.828 0 0 1.414 1.4143.162 3.162
p3p3 3.162
3.162 1.4141.414 0 0 2 2
p4p4 5.099
5.099 3.1623.162 2 2 0 0
Distance Matrix
Distance Matrix
Example
Manhattan Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
L1p1 p1 0 p2 2.828p3 3.162
p4 5.099
p1p2 0 2.828 4 04 1.4146 3.162
p2p3 4 3.162 0 1.414 2 04 2
p3p4 4 5.099 2 3.162 0 22 0
p4 6 4 2 0
Distance Matrix
Distance Matrix
Example
Supremum Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1
p1 p2
p2 p3 p3 p4 p4
L∞
p1
p1 0
0 2
2.828 3
3.162 5
5.099
p2p2 2.828
2 0 0 1 1.414 3 3.162
p3p3 3.162
3 11.414 0 0 2 2
p4p4 5.099
5 33.162 2 2 0 0
Distance Matrix
Distance Matrix
Distance Properties
1.Positivity
- d(x,y) >= 0 for all x and y
- d(x,y) = 0 only if x=y
2.Symmetry
- d(x,y) = d(y,x) for all x and y
3.Triangle Inequality
- d(x,z) <= d(x,y) + d(y,z) for all x, y, and z
f11
J=
f01 + f10 + f11
SMC versus Jaccard
p= 1000000000
q= 0000001001
length of vector
v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
Example
attr 1 attr 2 attr 3 attr 4 attr 5
x 1 0 1 0 3
y 0 2 4 0 1
k
X
x·y x k yk
cos(x, y) =
||x|| ||y|| i=1
v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
Example
attr 1 attr 2 attr 3 attr 4 attr 5
x 1 0 1 0 3
y 0 2 4 0 1
k
X
x·y x k yk
cos(x, y) =
7/(3.31*4.58)=0.46 ||x|| ||y|| i=1
1*0+0*2+1*4+0*0+3*1=7
v v
u k u k
uX uX
t x2 t y2
k k
i=1 i=1
sqrt(12+02+12+02+32)=sqrt(11)=3.31 sqrt(02+22+42+02+12)=sqrt(21)=4.58
Geometric Interpretation
attr 1 attr 2
x 1 0
y 0 2
cos(x, y) = 0
y cos(90o) = 0
attr 2
90o x
attr 1
Geometric Interpretation
attr 1 attr 2
x 4 2
y 1 3
cos(x, y) = 0.70
cos(45o) = 0.70
y
attr 2
45o x
attr 1
Geometric Interpretation
attr 1 attr 2
x 1 2
y 2 4
cos(x, y) = 1
x
cos(0o) = 1
attr 2
0o
attr 1
DAT630
Retrieval Evaluation
Search Engines, Chapter 8
04/09/2017
Figure'2.2'
Evaluation
- Evaluation is key to building effective and
efficient search engines
- Measurement usually carried out in controlled
laboratory experiments
- Online testing can also be done
- Effectiveness, efficiency and cost are related
- E.g., if we want a particular level of effectiveness and
efficiency, this will determine the cost of the system
configuration
- Efficiency and cost targets may impact effectiveness
Evaluation Corpus
- To ensure repeatable experiments and fair
comparison of results from different systems
- Test collections consist of
- Documents
- Queries
- Relevance judgments
- (Evaluation metrics)
Text REtrieval Conference
(TREC)
- Organized by the US National Institute of
Standards and Technology (NIST)
- Yearly benchmarking cycle
- Development of test collections for various
information retrieval tasks
- Relevance judgments created by retired CIA
information analysts
TREC Assessors at Work
Example Test Collections
Example Collections
ClueWeb09/12 collections
- ClueWeb09
- 1 billion web pages in 10 languages
- 5TB compressed, 25TB uncompressed
- http://lemurproject.org/clueweb09/
- ClueWeb12
- 733 million English web pages
- http://lemurproject.org/clueweb12/
TREC Topic Example
TREC Topic Example
k Pooled results
System B
k Assessors
System C
…
Assessment task
Pool Query
Description/narrative
Document
Assessment
Crowdsourcing
- Obtain relevance judgments on a crowdsourcing
platform
- "Microtasks", performed in parallel by large, paid
crowds
- Platforms
- Amazon Mechanical Turk (US)
- Crowdflower (EU)
- https://www.crowdflower.com/use-case/search-relevance/
Example crowdsourcing
task
Query Logs
- Used for both tuning and evaluating search
engines
- Also for various techniques such as query
suggestion
- Typical contents
- User identifier or user session identifier
- Query terms - stored exactly as user entered
- List of URLs of results, their ranks on the result list,
and whether they were clicked on
- Timestamp(s) - records the time of user events such
as query submission, clicks
AOL query log
AOL query log fiasco
Query Logs
- Clicks are not relevance judgments
- Although they are correlated
- Biased by a number of factors such as rank on
result list
- Can use clickthrough data to predict
preferences between pairs of documents
- Appropriate for tasks with multiple levels of
relevance, focused on user relevance
- Various “policies” used to generate preferences
Example Click Policy
- Skip Above and Skip Next
- Given a set of results for a query and a clicked result
at rank position p
- all unclicked results ranked above p are predicted to be
less relevant than the result at p
- unclicked results immediately following a clicked result
are less relevant than the clicked result
0,25
0,15
0,1
0,05
1 2 3 4 5 6 7 8 9 10
Rank position, i
Moderately
Cost Very expensive
expensive
Cheap
Scales to some
Scaling Doesn't scale well
extent (budget)
Scales very well
* But the quality of the data is only as good as the assessment guidelines
Effectiveness Measures
A is the set of relevant documents,
B is the set of retrieved documents
F-measure
- Harmonic mean of recall and precision
Precision @5
Example
Precision @10
Standard Recall Levels
- Calculating precision at standard recall levels,
from 0.0 to 1.0
- Each ranking is then represented using 11 numbers
- Values of precision at these standard recall levels are
often not available, for example:
- Interpolation is needed
Recall-Precision Graph
Query 1
Query 2
Interpolation
- To average graphs, calculate precision at
standard recall levels:
Total number of
relevant documents
According to the
ground truth
0,6
0,4
0,2
-0,2
-0,4
-0.6
Efficiency Metrics
- Elapsed indexing time
- Amount of time necessary to build a document
index on a particular system
- Indexing processor time
- CPU seconds used in building a document index
- Similar to elapsed time, but does not count time waiting
for I/O or speed gains from parallelism
- Query throughput
- Number of queries processed per second
Efficiency Metrics
- Query latency
- The amount of time a user must wait after issuing a
query before receiving a response, measured in
milliseconds
- Often measured with the median
- Indexing temporary space
- Amount of temporary disk space used while creating
an index
- Index size
- Amount of storage necessary to store the index files
Summary
- No single measure is the correct one for any
application
- Choose measures appropriate for task
- Use a combination
- Shows different aspects of the system effectiveness
- Use significance tests
- Analyze performance of individual queries
DAT630
Retrieval Models
Search Engines, Chapter 7
28/08/2017
Figure'2.1'
Today
Figure'2.2'
Boolean Retrieval
Boolean Retrieval
- Two possible outcomes for query processing
- TRUE and FALSE (relevance is binary)
- “Exact-match” retrieval
- Query usually specified using Boolean operators
- AND, OR, NOT
- Can be extended with wildcard and proximity
operators
- Assumes that all documents in the retrieved set
are equally relevant
Boolean Retrieval
- Many search systems you still use are
Boolean:
- Email, library catalog, …
- Very effective in some specific domains
- E.g., legal search
- E.g., patent search
- Expert users
Boolean View of a
Collection
Doc&2&
Doc&3&
Doc&4&
Doc&1&
Doc&5&
Doc&6&
Doc&7&
Doc&8&
Term&
aid& 0& 0& 0& 1& 0& 0& 0& 1&
- Each row represents the view of all& 0& 1& 0& 1& 0& 1& 0& 0&
a particular term: What back& 1& 0& 1& 0& 0& 0& 1& 0&
documents contain this term? brown& 1& 0& 1& 0& 1& 0& 1& 0&
come& 0& 1& 0& 1& 0& 1& 0& 1&
- Like an inverted list dog& 0& 0& 1& 0& 1& 0& 0& 0&
fox& 0& 0& 1& 0& 1& 0& 1& 0&
good& 0& 1& 0& 1& 0& 1& 0& 1&
- To execute a query jump& 0& 0& 1& 0& 0& 0& 0& 0&
- Pick out rows corresponding lazy& 1& 0& 1& 0& 1& 0& 1& 0&
to query terms men& 0& 1& 0& 1& 0& 0& 0& 1&
now& 0& 1& 0& 0& 0& 1& 0& 1&
- Apply the logic table of the over& 1& 0& 1& 0& 1& 0& 1& 1&
corresponding Boolean party& 0& 0& 0& 0& 0& 1& 0& 1&
operator quick& 1& 0& 1& 0& 0& 0& 0& 0&
their& 1& 0& 0& 0& 1& 0& 1& 0&
6me& 0& 1& 0& 1& 0& 1& 0& 0&
Example Queries
Doc$2$
Doc$3$
Doc$4$
Doc$1$
Doc$5$
Doc$6$
Doc$7$
Doc$8$
Term$
dog$ 0$ 0$ 1$ 0$ 1$ 0$ 0$ 0$
fox$ 0$ 0$ 1$ 0$ 1$ 0$ 1$ 0$
dog$∧$fox$ 0$ 0$ 1$ 0$? 1$ 0$ 0$ 0$ ?
dog$AND$fox$→$Doc$3,$Doc$5$
dog$∨$fox$ 0$ 0$ 1$ 0$? 1$ 0$ 1$ 0$ ?
dog$OR$fox$→$Doc$3,$Doc$5,$Doc$7$
dog$¬$fox$ 0$ 0$ 0$ 0$? 0$ 0$ 0$ 0$ ?
dog$AND$NOT$fox$→$empty$
fox$¬$dog$ 0$ 0$ 0$ 0$? 0$ 0$ 1$ 0$
?
fox$AND$NOT$dog$→$Doc$7$
Example Query
good AND party AND NOT
over Doc$2$
Doc$3$
Doc$4$
Doc$1$
Doc$5$
Doc$6$
Doc$7$
Doc$8$
Term$
good$ 0$ 1$ 0$ 1$ 0$ 1$ 0$ 1$
party$ 0$ 0$ 0$ 0$ 0$ 1$ 0$ 1$
over$ 1$ 0$ 1$ 0$ 1$ 0$ 1$ 1$
g"∧"p" 0" 0" 0" 0" 0" 1" 0" 1" good"AND"party"→"Doc"6,"Doc"8"
over" 1" 0" 1" 0" 1" 0" 1" 1"
g"∧"p"¬"o" 0" 0" 0" 0" 0" 1" 0" 0" good"AND"party"AND"NOT"over"→"Doc"6"
Example of Query
(Re)formulation
lincoln
- Retrieves a large number of documents
- User may attempt to narrow the scope
X
score(d, q) = wt,d · wt,q
t2q
It is enough to
Relevance score consider terms in
It is computed for each the query
document d in the collection
for a given input query q
X
score(d, q) = (1 + log ft,d ) · ft,q
t2q
Query Processing
- Strategies for processing the data in the index
for producing query results
- Document-at-a-time
- Calculates complete scores for documents by
processing all term lists, one document at a time
- Term-at-a-time
- Accumulates scores for documents by processing
term lists one at a time
- Both approaches have optimization techniques
that significantly reduce time required to
generate scores
Document-at-a-Time
ß Collected scores
ß Document #1
Figure 5.15
Term-at-a-Time
3:1
Figure 5.17
The Vector Space Model
The Vector Space Model
- Basis of most IR research in the 1960s and 70s
- Still used
- Provides a simple and intuitively appealing
framework for implementing
- Term weighting
- Ranking
- Relevance feedback
Representation
- Documents and query represented by a vector
of term weights
P
tf idft,d · tf idft,q
t
cosine(d, q) = qP P
tf idf 2 tf idf 2
t t,d t t,q
Scoring Documents
- It also fits within our general scoring scheme:
- Note that we only consider terms that are present in
the query
X
Score(q, d) = wt,q · wt,d
t2q
tf idft,q tf idft,d
wt,q = qP wt,d = qP
2 tf idf 2
t tf idf t,q t t,d
Variations on Term Weighting
- It is possible to use different term weighting for
documents and for queries, for example:
- Parameters
- k1: calibrating term frequency scaling
- b: document length normalization
- Note: several slight variations of BM25 exist!
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )
Repetitions of query
terms in the document
=> good
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )
Term saturation:
repetition is less
important after a while
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )
Term saturation
ft,d
k + ft,d for some k > 0
Soft document
normalization taking into
account document length
Document is more
important if relatively long
(w.r.t. average)
BM25: An Intuitive View
X ft,d · (1 + k1 )
score(d, q) = |d|
· idft
t2q ft,d + k1 (1 b+ b avgdl )
Common terms
less important
Parameter Setting
- k1: calibrating term frequency scaling
- 0 corresponds to a binary model
- large values correspond to using raw term frequencies
- k1 is set between 1.2 and 2.0, a typical value is 1.2
- b: document length normalization
- 0: no normalization at all
- 1: full length normalization
- typical value: 0.75
Language Models
Language Models
- Based on the notion of probabilities and
processes for generating text
Uses
- Speech recognition
- “I ate a cherry” is a more likely sentence than “Eye
eight uh Jerry”
- OCR & Handwriting recognition
- More probable sentences are more likely correct
readings
- Machine translation
- More likely sentences are probably better
translations
Uses
- Completion prediction
- Please turn off your cell _____
- Your program does not ______
- Predictive text input systems
can guess what you are
typing and give choices on
how to complete it
Ranking Documents using
Language Models
- Represent each document as a multinomial
probability distribution over terms
- Estimate the probability that the query was
"generated" by the given document
- "How likely is the search query given the language
model of the document?"
Standard Language Modeling
approach
- Rank documents d according to their likelihood
of being relevant given a query q: P(d|q)
P (q|d)P (d)
P (d|q) = / P (q|d)P (d)
P (q)
Y
ft,q
P (q|d) = P (t|✓d )
t2q
Standard Language Modeling
approach (2)
Number of times t appears in q
Y
P (q|d) = P (t|✓d )ft,q
t2q
Document language model
Multinomial probability distribution Smoothing parameter
over the vocabulary of terms
told
till
sun
submarines
so
our
man
ft,d
|d|
life
land
P (t|d) =
i
his
he
green
found
born
beneath
sea
sailed
lived
live
all
we
yellow
submarine
0,11
0,14
0,08
0,06
0,03
0,00
Alternatively...
Scoring a query
q = {sea, submarine}
t P(t|d) t P(t|C)
submarine 0,14 submarine 0,0001
sea 0,04 sea 0,0002
... ...
Scoring a query
q = {sea, submarine}
0.04538 0.03602 0.12601
P (q|d) = P (“sea”|✓d ) · P (“submarine”|✓d )
t P(t|d) t P(t|C)
submarine 0,14 submarine 0,0001
sea 0,04 sea 0,0002
... ...
Smoothing
- Jelinek-Mercer smoothing
P (t|✓d ) = (1 )P (t|d) + P (t)
- Smoothing parameter is
- Same amount of smoothing is applied to all documents
- Dirichlet smoothing
ft,d + µ · p(t)
p(t|✓d ) =
|d| + µ
- Smoothing parameter is µ
- Smoothing is inversely proportional to the document
length
Relation between
Smoothing Methods
- Jelinek Mercer:
P (t|✓d ) = (1 )P (t|d) + P (t)
- by setting:
|d| µ
(1 )= =
|d| + µ |d| + µ
- Dirichlet:
ft,d + µ · p(t)
p(t|✓d ) =
|d| + µ
Practical Considerations
- Since we are multiplying small probabilities, it's
better to perform computations in the log
space
Y
ft,q
P (q|d) = P (t|✓d )
t2q
X
log P (q|d) = log P (t|✓d ) · ft,q
t2q
X
score(d, q) = wt,d · wt,q
t2q
Exercise
Exercise
See on GitHub
Exercise
Scoring a query
Y
P (q|d) = P (t|✓d )ft,q P(q="T2 T1"|D2) = P(T2|D2) * P(T1|D2)
t2q
Fielded Variants
Motivation
- Documents are composed of multiple fields
- E.g., title, body, anchors, etc.
- Modeling internal document structure may be
beneficial for retrieval
Example
Unstructured representation
PROMISE Winter School 2013
Bridging between Information Retrieval and Databases
Bressanone, Italy 4 - 8 February 2013
The aim of the PROMISE Winter School 2013 on "Bridging between Information
Retrieval and Databases" is to give participants a grounding in the core
topics that constitute the multidisciplinary area of information access and
retrieval to unstructured, semistructured, and structured information. The
school is a week-long event consisting of guest lectures from invited
speakers who are recognized experts in the field. The school is intended for
PhD students, Masters students or senior researchers such as post-doctoral
researchers form the fields of databases, information retrieval, and related
fields.
[...]
<html>
<head>
<title>Winter School 2013</title>
Example
<meta name="keywords" content="PROMISE, school, PhD, IR, DB, [...]" />
<meta name="description" content="PROMISE Winter School 2013, [...]" />
</head>
<body>
<h1>PROMISE Winter School 2013</h1>
<h2>Bridging between Information Retrieval and Databases</h2>
<h3>Bressanone, Italy 4 - 8 February 2013</h3>
<p>The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a grounding
in the core topics that constitute the multidisciplinary area of
information access and retrieval to unstructured, semistructured, and
structured information. The school is a week-long event consisting of
guest lectures from invited speakers who are recognized experts in the
field. The school is intended for PhD students, Masters students or
senior researchers such as post-doctoral researchers form the fields of
databases, information retrieval, and related fields. </p>
[...]
</body>
</html>
Fielded representation
based on HTML markup
title: Winter School 2013
body: The aim of the PROMISE Winter School 2013 on "Bridging between
Information Retrieval and Databases" is to give participants a
grounding in the core topics that constitute the multidisciplinary
area of information access and retrieval to unstructured,
semistructured, and structured information. The school is a week-
long event consisting of guest lectures from invited speakers who
are recognized experts in the field. The school is intended for
PhD students, Masters students or senior researchers such as post-
doctoral researchers form the fields of databases, information
retrieval, and related fields.
Fielded Extension of
Retrieval Models
- BM25 => BM25F
- LM => Mixture of Language Models (MLM)
BM25F
- Extension of BM25 incorporating multiple fields
- The soft normalization and term frequencies
need to be adjusted
- Original BM25:
X ft,d · (1 + k1 )
score(d, q) = · idft
t2q
ft,d + k1 · B
where B is the soft normalization:
|d|
B = (1 b+b )
avgdl
BM25F
X f˜t,d
score(d, q) = · idft
t2q k1 + f˜t,d
Combining term
frequencies across fields
X ft,di
f˜t,d = wi
i
Bi
22/09/2017
Figure'2.1'
Indexing Process
Identify and make available
the documents that will be
searched
Figure'2.1'
Text Acquisition
- Crawler
- Identifies and acquires documents for search engine
- Many types: web, enterprise, desktop, etc.
- Web crawlers follow links to find documents
- Must efficiently find huge numbers of web pages
(coverage) and keep them up-to-date (freshness)
- Single site crawlers for site search
- Topical or focused crawlers for vertical search
- Document crawlers for enterprise and desktop search
- Follow links and scan directories
Text Acquisition
- Feeds
- Real-time streams of documents
- E.g., web feeds for news, blogs, video, radio, TV
- RSS is common standard
- RSS “reader” can provide new XML documents to
search engine
Text Acquisition
- Documents need to be converted into a
consistent text plus metadata format
- E.g. HTML, XML, Word, PDF, etc. → XML
- Convert text encoding for different languages
- Using a Unicode standard like UTF-8
Indexing Process
Figure'2.1'
Document Data Store
- Stores text, metadata, and other related
content for documents
- Metadata is information about document such as
type and creation date
- Other content includes links, anchor text
- Provides fast access to document contents for
search engine components
- E.g. result list generation
- Could use relational database system
- More typically, a simpler, more efficient storage
system is used due to huge numbers of documents
Indexing Process
Figure'2.1'
Figure'2.1'
Index Creation
- Document Statistics
- Gathers counts and positions of words and other
features
- Used in ranking algorithm
- Weighting
- Computes weights for index terms
- Usually reflect “importance” of term in the document
- Used in ranking algorithm
Index Creation
- Inversion
- Core of indexing process
- Converts document-term information to term-
document for indexing
- Difficult for very large numbers of documents
- Format of inverted file is designed for fast query
processing
- Must also handle updates
- Compression used for efficiency
Index Creation
- Index Distribution
- Distributes indexes across multiple computers and/
or multiple sites
- Essential for fast query processing with large
numbers of documents
- Many variations
- Document distribution, term distribution,
replication
- P2P and distributed IR involve search across multiple
sites
Query Process
Figure'2.2'
Query Process
Interface between the person
doing the searching and the
search engine
Figure'2.2'
User Interaction
- Accepting the user’s query and transforming it
into index terms
- Taking the ranked list of documents from the
search engine and organizing it into the results
shown to the user
- E.g., generating snippets to summarize documents
- Range of techniques for refining the query (so
that it better represents the information need)
User Interaction
- Query input
- Provides interface and parser for query language
- Query language used to describe complex queries
- Operators indicate special treatment for query text
- Most web search query languages are very simple
- Small number of operators
- There are more complicated query languages
- E.g., Boolean queries, proximity operators
- IR query languages also allow content and structure
specifications, but focus on content
User Interaction
- Query transformation
- Improves initial query, both before and after initial
search
- Includes text transformation techniques used for
documents
- Spell checking and query suggestion provide
alternatives to original query
- Techniques often leverage query logs in web search
- Query expansion and relevance feedback modify the
original query with additional terms
User Interaction
- Results output
- Constructs the display of ranked documents for a
query
- Generates snippets to show how queries match
documents
- Highlights important words and passages
- Retrieves appropriate advertising in many
applications (“related” things)
- May provide clustering and other visualization tools
Query Process
Core of the search engine:
generates a ranked list of
documents for the user’s query
Figure'2.2'
Ranking
- Scoring
- Calculates scores for documents using a ranking
algorithm, which is based on a retrieval model
- Core component of search engine
- Basic form of score is X
qi d i
i
- qi and di are query and document term weights for term i
- Many variations of ranking algorithms and retrieval
models
Ranking
- Performance optimization
- Designing ranking algorithms for efficient processing
- Term-at-a time vs. document-at-a-time processing
- Safe vs. unsafe optimizations
- Distribution
- Processing queries in a distributed environment
- Query broker distributes queries and assembles
results
- Caching is a form of distributed searching
Query Process
Figure'2.2'
docID; payload
There is a separate
posting for each term
occurrence in the
document. The payload
is the term position.
Supports proximity
matches.
E.g., find "tropical" within
5 words of "fish"
Issues
- Compression
- Inverted lists are very large
- Compression of indexes saves disk and/or memory
space
- Optimization techniques to speed up search
- Read less data from inverted lists
- “Skipping” ahead
- Calculate scores for fewer documents
- Store highest-scoring documents at the beginning of
each inverted list
- Distributed indexing
Exercise
- Draw the inverted index for the following
document collection
home 1 2 3 4
sales 1 2 3 4
top 1
forecasts 1
rise 2 4
in 2 3
july 2 3 4
increase 3
Text Preprocessing
Preprocessing Pipeline
raw document
text preprocessing
Tokenization
Stopping
Stemming
… sequence of terms
Tokenization
- Parsing a string into individual words (tokens)
- Splitting is usually done along white spaces,
punctuation marks, or other types of content
delimiters (e.g., HTML markup)
- Sounds easy, but can be surprisingly complex,
even for English
- Even worse for many other languages
Tokenization Issues
- Apostrophes can be a part of a word, a part of
a possessive, or just a mistake
- rosie o'donnell, can't, 80's, 1890's, men's straw
hats, master's degree, …
- Capitalized words can have different meaning
from lower case words
- Bush, Apple
- Special characters are an important part of
tags, URLs, email addresses, etc.
- C++, C#, …
Tokenization Issues
- Numbers can be important, including decimals
- nokia 3250, top 10 courses, united 93, quicktime
6.5 pro, 92.3 the beat, 288358
- Periods can occur in numbers, abbreviations,
URLs, ends of sentences, and other situations
- I.B.M., Ph.D., www.uis.no, F.E.A.R.
Common Practice
- First pass is focused on identifying markup or
tags; second pass is done on the appropriate
parts of the document structure
- Treat hyphens, apostrophes, periods, etc. like
spaces
- Ignore capitalization
- Index even single characters
- o’connor => o connor
Text Statistics
Top-50 words from AP89
Zipf’s Law
- Distribution of word frequencies is very skewed
- A few words occur very often, many words hardly ever
occur
- E.g., two most common words (“the”, “of”) make up
about 10% of all word occurrences in text documents
- Zipf’s law:
- Frequency of an item or event is inversely
proportional to its frequency rank
- Rank (r) of a word times its frequency (f) is
approximately a constant (k): r*f~k
Zip’s law for AP89
Stopword Removal
- Function words that have little meaning apart
from other words: the, a, an, that, those, …
- These are considered stopwords and are
removed
- A stopwords list can be constructed by taking
the top n (e.g., 50) most common words in a
collection
- May be customized for certain domains or applications
Stopword Removal
Porter stemmer
market strateg carr compan agricultur chemic report predict market share
chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil
predict sale stimul demand price cut volum sale
Krovetz stemmer
marketing strategy carry company agriculture chemical report prediction market
share chemical report market statistic agrochemic pesticide herbicide fungicide
insecticide fertilizer predict sale stimulate demand price cut volume sale
Stemming
- Generally a small (but significant) effectiveness
improvement for English
- Can be crucial for some languages (e.g.,
Arabic, Russian)
Example
First pass extraction
The Transporter (2002)
PG-13 92 min Action, Crime, Thriller 11 October 2002 (USA)
Tokenization
the transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa
transporter 2002
pg 13 92 min action crime thriller 11 october 2002 usa
transport 2002
pg 13 92 min action crime thriller 11 octob 2002 usa