BDI Summary-4
BDI Summary-4
3 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 MapReduce Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Spark: Extends MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Problems suited for MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Cost Measures of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 The BFR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.1 The cure algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4.1 Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1 Content Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 User-User Collaborative Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2.1 Pearson Correlation Coefficient: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Latent Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.2 PageRank: Flow Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3 Google PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 Problems with PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.5 Topic Specific PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1
9.3 Filtering Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.1 Applications of Filtering Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.2 First Cut Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.3 Bloom Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.4 Counting Distinct Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
10 Lernziele . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2
1 Introduction
Definition
Data Mining: Given large amounts of data data mining is about discovering patterns and models that are:
– Example: Clustering
• Predictive Methods: Use some variables to predict unknown or future values of other variables
Definition
Distributed File System: A distributed file system manages data across multiple networked machines,
ensuring reliability and fault tolerance. Since copying data over a network is time-consuming, a robust storage
mechanism is needed to persist data even in the event of node failures. To achieve this, files are replicated
across multiple nodes for redundancy. These systems are designed to handle large-scale data workloads
(hundreds of GB to TB), where in-place updates are rare, and reads or appends are the primary operations.
Examples of such systems include Google File System (GFS) and Hadoop Distributed File System (HDFS).
To process data in these environments, distributed computing frameworks like MapReduce and Apache Spark
are commonly used.
• Chunk servers: Files are split into contiguous chunks (16 -64 MB). Each chunk is replicated (usually 2 or 3
times) and kept in different racks (servers are grouped into racks).
• Master node: It is also called Name Node in HDFS and it stores metadata about where files are stored. Master
nodes are typically more robust to hardware failure and run critical cluster services.
• Client library for file access: Talks to master to find chunk servers and connects directly to chunk servers
to access data.
Typically, a reliable distributed file system keeps data in chunks spread across machines and each chunk is replicated
on different machines to enable seamless recovery from disk or machine failure. Additionally, the phrase ”Bring
computation directly to the data!” tells that chunk servers like HDFS or GFS enable processing of data where it is
stored rather than transferring large volumes of data across a network to a central processing unit, which makes the
process more efficient.
3
3 MapReduce
Definition
MapReduce: An early distributed programming model designed for easy parallel programming, invisible man-
agement of hardware and software failures and easy management of very-large-scale data.
• It has several implementations, including Hadoop, Spark (used here), Flink, or the original Google implementation
just called MapReduce.
• Group by Key: Sort and shuffle, where the system sorts all the key-value pairs by key, and outputs key-[list
of values] pairs.
Workflow (typically the programmer specifies the map and reduce function and input files)
• All k’v’ paris with a given k’ are sent to the same Reduce
4
Example: Word Counting
Consider a huge text document where we have to count the number of times each distinct word appears in the file.
MapReduce can be performed in parallel. In this case a partitioning function determines which record goes to which
reducer.
5
MapReduce: Data Flow
• Input and final output are stored on a distributed file system (FS). (A scheduler tries to schedule map tasks
”close” to physical storage location of input data.)
MapReduce: Environment
• Map worker failure: Map tasks completed or in-progress at worker are reset to idle and rescheduled. Reduce
workers are notified when map task is rescheduled on another worker.
• Reduce worker Failure: Only in-progress tasks are reset to idle and the reduce task is restarted.
Generally, MapReduce does not compose well for large applications (many times, chaining multiple map-reduce
steps is required)...
• Performance Bottlenecks: MapReduce incurs substantial overheads due to data replication, disk I/O, and serial-
ization (saving to disk is typically much slower than in-memory work).
• Implementation Difficulty: It’s difficult to program MapReduce directly. Many big data problems/algorithms are
not easily described as map-reduce.
Definition
Spark: Spark is a data flow system constructed as a Resilient Distributed Dataset (RDD). Higher-level APIs
such as DataFrames and DataSets are introduced in recent versions of Spark and different APIs aggregate data,
which allowed to introduce SQL support.
• Fast data sharing which avoids saving intermediate results to disk and it caches data for repetitive queries (e.g.
for machine learning)
• Richer functions than just map and reduce and compatible with Hadoop
6
Definition
Spark Resilient Distributed Dataset (RDD): An RDD is a partitioned collection of records (generalization of
key-value pairs). It is spread across the cluster and read-only. It caches the dataset in memory and there
is a fallback to disk possible. It can be created from Hadoop, or by transforming other RDDs. They are best
suited for applications that apply the same operation to all elements of a dataset.
• Transformations (map, filter, join, union, intersection,..) build RDDs through deterministic operations on other
RDDs.
• Actions (count, collect, reduce, ...) return value or export data. They can be applied to RDDs and force calculations
and return values.
Higher-level API’s like DataFrames and Datasets are different than RDD’s (however they can built on Spark SQL engine
and both can be converted by to an RDD).
Definition
DataFrame: Unlike an RDD, data organized into named columns (= table in a relational database). Imposes
a structure onto a distributed collection of data, allowing higher-level abstraction.
Definition
Dataset: Extension of DataFrame API which provides type-safe, object-oriented programming interface
(compile-time error detection)
7
Spark vs. Hadoop MapReduce
• Spark can process data in-memory (Hadoop persists back to the disk after a map/reduce action) - it is normally
faster
• Spark is easier to program (higher-level APIs) and is more general in terms of data processing
• Suppose there is a large web corpus. Map Reduce can be helpful for each host to find the total number of bytes,
so the sum of the page sizes for all URLS from that particular host.
• Other examples include: link analysis, graph processing and machine learning algorithms
• MapReduce can be helpful to count the number of times every 5-word sequence (5-grams) occurs in a large corpus
of documents. In this case map can extract (5-gram, count) from the document and reduce combines the counts.
• Compute the natural join R(A, B) ▷◁ S(B, C) where R and S are stored in files and tuples are pairs (a,b) or (b,c).
In this case when applying Map-Reduce to the join operation one needs to use a hash function h from B-values
to 1...k . A Map process turns each tuple R(a, b) into key-value pair (b, (a, R)) and each input tuple S(b, c) into
key-value pair (b, (c, S)). Then, the Map process sends each key-value pair with key b to Reduce process h(b).
Each Reduce process matches all the pairs (b, (a, R)) with all (b, (c, S)) and outputs (a, b, c).
8
• (Elapsed) computational cost: Same as for the elapsed communication cost, but count only running time of
processes
(Note: The big-O notation is not suited in this case as more machines can be always added.)
• Dominant Cost: Either I/O (communication) cost (the cost of reading and writing data from storage (e.g., disk,
network transfers)) or processing (computation) cost (the CPU time needed to compute results.) Since one of
these usually dominates, we often ignore the other in cost models.
– Example: In a MapReduce job, if reading/writing data to HDFS is much slower than computation, we focus on
I/O cost. Conversely, for CPU-heavy tasks (like cryptographic operations), computation cost is more critical.
• Total Cost: The overall resources used for computation (e.g., CPU hours, memory, storage, bandwidth). In
cloud computing (e.g., AWS, Google Cloud), total cost translates to monetary cost, so what you pay for running
a job.
• Elapsed Cost: This measures the actual time it takes to complete a job. Parallelism helps reduce elapsed time:
instead of one machine working for 10 hours, 10 machines might complete the task in 1 hour.
• Communication cost: input file size + 2×(sum of the sizes of all files passed from Map processes to Reduce
processes) + sum of the output sizes of the reduce processes.
• Elapsed communication cost: The sum of the largest input + output for any map process, plus the same for any
reduce process
When performing a join operation between relations R and S in a MapReduce setting, the cost is primarily determined
by communication (I/O cost) and computation cost.
• Elapsed Communication Cost O(S): We put a limit s on the amount of input or output that any one process can
have. s could be what fits in main memory or what fits on local disk. The elapsed communication cost is the actual
time that passed.
• Computation Cost: With proper indexing, the computation cost is linear in the input and output size. This means
that:
O(Computation) = O(|R| + |S| + |R ▷◁ S|)
which is like the communication cost. Without indexes, a naive join might require expensive operations like sorting
or nested loop joins. With indexes, lookups are efficient, and computational cost is directly tied to the size of the
input and output.
9
4 Association Rule Discovery
4.1 Introduction: Market-basket Model
The goal of the market-basket model is to identify items that are bought together by sufficiently many customers. One
approach can be by processing the sales data collected with barcode scanners to find dependencies among items. A
classic rule is for example if someone buys diaper and milk, then they are likely to buy beer.
Therefore, we have a large set of items (e.g. things sold in a supermarket) and a large set of baskets (a subset of
items e.g. things one customer buys on one day). The task is to discover association rules like people who bought
{x,y,z} tend to buy {w,v}. More generally it is a many-to-many mapping (association) between two kinds of things
and we are interested in connections among items.
• Items = Products (e.g., milk, bread,...); Basket = Sets of products bought in a single trip. Example: Amazon’s
people who bought X (mouse) also bought Y (computer)
• Items = Documents; Baskets = Sentences. Example: If a particular sentence appears in multiple documents (i.e.,
the same item is found in many baskets), this might represent plagiarism
• Items = Drugs & side-effects; Baskets = Individual patients; Example: By analyzing which drugs and side effects
commonly occur together in patient records, we can discover potential drug interactions.
• Example: Support of {Beer, Bread} = 2 means that this combination appears in two baskets.
10
Definition of Association Rules
• If-then rule: {i1 , i2 , .., ik } → j means: ” it a basket contains all of i1 , ..., ik then it is likely to contain j ”
support(I ∪ j)
conf (I → j) =
support(I)
• Interesting rules: We want to look only at interesting rules because for example the rule X → {milk} has a high
confidence but is not interesting as milk is just purchased very often (independent of X). Therefore, we want the
absolute difference (to capture both positive and negative associations) between its confidence and the fraction of
baskets that contain j :
to be typically above 0.5 (this means that P (j) itself is not too high).
• In the example above the support measures how often the itemset {Milk, Beer, Coke} appears in the dataset.
The confidence measures how often Coke appears given that milk and beer were bought. The interest measures
how much this rule is different from the general probability of Coke appearing.
Association Rule Mining is a fundamental technique in data mining, primarily used to discover relationships (associations)
between items in large datasets (e.g., market basket analysis in retail).
1. Step - Find frequent Itemsets:
• Goal: Find all association rules that satisfy: Support ≥ s (Minimum Support Threshold) and Confidence ≥ c
(Minimum Confidence Threshold) (definition for frequency)
• Difficulty: Finding frequent itemsets (groups of items that appear together often). If {i1 , i2 , . . . , ik } → {j}
has high support and confidence, then both {i1 , i2 , . . . , ik } and {i1 , i2 , . . . , ik , j} will be ”frequent”.
support(I ∪ j)
conf (I → j) =
support(I)
11
If a dataset has N unique items, there are 2N − 1 possible itemsets, which is an exponential growth. Subse-
quently, an efficient way to find only the frequent ones is needed.
2. Step - Generate Rules from Itemsets: For each frequent itemset I , generate rules of the form:
A → I \ A where A ⊂ I
To reduce the number of rules, one can post-process them and only output:
• Maximal Frequent Itemsets: An itemset is maximal if none of its immediate supersets are frequent. This
provides aggressive pruning (skipping of parts of workspace that are unlikely to produce useful results) of the
itemset space.
• Closed Frequent Itemsets: An itemset is closed if no immediate superset has the same support (> 0). This
reduces redundancy while preserving exact supports/counts.
Typically, data is kept in flat files (stored on disk and basket-by-basket) rather than in a database system. Baskets are
small but we have many baskets and many items.
Note: To find frequent itemsets, we have to count them. To count them, we have to enumerate them.
Main-Memory Bottleneck
• In frequent-itemset mining, main memory is a critical resource because algorithms must keep track of item
occurrences while reading transactions.
• The primary cost is disk I/O rather than CPU computation, since accessing disk storage is significantly slower
than in-memory operations.
• Many algorithms process data in multiple passes over the dataset, requiring frequent reads from disk. Thus, the
cost is measured in terms of the number of disk passes.
• Memory limitation: The number of different item counts we can store is constrained by available main memory. If
the dataset is large, it may exceed memory capacity.
• Swapping data between memory and disk is inefficient, as it significantly increases processing time. Efficient
algorithms aim to minimize disk I/O and avoid excessive memory swapping.
12
4.3.1 Finding Frequent Pairs
Finding frequent pairs of items {i1 , i2 } is actually the hardest problem, because pairs are common and frequent
whereas triples are rare (the probability of being frequent drops exponentially with size).
The scenario is the following: we aim to identify frequent pairs and for that we will enumerate all pairs of items. But,
rather than keeping a count for every pair, we hope to discard a lot of pairs and only keep track of the ones that will in
the end turn out to be frequent. One very simple approach to find frequent pairs is the following algorithm:
• Naive Algorithm: This s a simple brute-force approach to identify frequent item pairs in large datasets, such as
market baskets.
Approach:
1. Read the dataset (baskets) once, counting in memory the occurrence of each pair. From each basket b of nb
items, generate its nb (nb − 1)/2 pairs by two nested loops.
2. Use an appropriate data structure to keep track of counts of every pair:
– Approach 1: A matrix where rows and columns represent items. Each pair of items (i, j) is stored in a
n(n−1)
fixed-size triangular matrix, requiring 4 bytes per pair.. The total number of possible pairs is 2 ,
2
leading to a total memory usage of O(n ) bytes. This approach becomes infeasible for large datasets
because all possible pairs are stored, even those that never appear in any basket.
– Approach 2: A hash table that stores counts as triples (item1 , item2 , count).
* Instead of storing all possible pairs, we only keep track of pairs that actually appear in the data.
* Each stored pair requires 12 bytes per occurring pair (4 bytes each for two item IDs + 4 bytes for
the count), plus some additional memory for hash table overhead.
1
* This approach outperforms Approach 1 if less than 3 of all possible pairs actually occur, as memory
usage is significantly reduced.
3. At the end of the scan, identify which pairs have high enough support (i.e. appear frequently).
1
Approach 2 beats Approach 1 if less than 3 of possible pairs actually occur.
Problem: When we have too many items, so all the pairs do not fit into memory (even for the hash-based approach
the hash table gets too large for memory). Hence, another approach is needed!
• A-Priori Algorithm: This algorithm limits the need for main memory through the key idea of monotonicity (if a
set of items I appears at least s times, so does every subset J of I ) and contrapositive for pairs (if item i does
not appear in basket s, then no pair including i can appear in s baskets.
General Approach:
– Pass 1: Start with k = 1 (individual items). Read baskets and count in main memory the number of oc-
currences of each individual item. (Requires only memory proportional to the number of items). Items that
appear more or equal to s times are the frequent items.
– Pass 2: Read baskets again and keep track of the count of only those pairs where both elements are frequent
(from pass 1). (Requires memory (for counts) proportional to square of the number of frequent items (not the
square of total number of items.) Repeat this until no frequent itemsets remain.
13
Figure 8: Main-Memory: Picture of A-Priori
– Ck = candidate k-tuples = those that might be frequent sets (support greater/equal s) based on information
form the pass for k − 1
– Lk = the set of truly frequent k-tuples
– Consider the candidate set C1 = {{b}, {c}, {j}, {m}, {n}, {p}}, then count the support and prune items
below the minimum support threshold. As a result we get the frequent 1-itemset: L1 = {b, c, j, m} (assuming
{n,p} were not frequent enough)
– Form pairs from the frequent 1-itemsets L1 : C2 = {{b, c}, {bj }, {b, m}, {c, j}, {c, m}, {j, m}}, then count
support for each pair and prune non-frequent pairs. As a result, one gets the frequent 2-itemsets: L2 =
{{b, c}, {b, m}, {c, j}, {c, m}} (assuming {j,m} was infrequent and got pruned)
– Form triples form the frequent 2-itemsets L2 : C3 = {{b, c, m}, {b, c, j}, {c, m, j}}, then count support for
each and prune infrequent triples. As a result we get the frequent 3-itemsets: L3 = {{b, s, m}} (assuming
{b,c,j} and {c,m,j} were infrequent and got pruned)
One pass for each k (itemset size) needs space in main memory to count each candidate k-tuple. For typical
market-basket data and reasonable support (e.g. 1%), k = 2 requires the most memory.
Note: We generate new candidates by generating Ck from Lk−1 and L1 . But one can be more careful with
candidate generation. For example, in C3 we know {b, m, j} cannot be frequent since {m, j} is not frequent.
14
• PCY Algorithm (Park-Chen-Yu): This algorithm is an improvement of A-Priori. The problem in Apriori is that in
pass 1 most memory is empty. This empty memory can be used to reduce the memory required in pass 2.
General Approach:
– Pass 1: In addition to item counts, maintain a hash table with as many buckets/elements as fit in memory.
Keep a count for each bucket into which pairs of items are hashed. For each bucket, just keep the count, not
the actual pairs that hash to the bucket.
1 for basket in baskets :
2 for item in basket :
3 counts [ item ] += 1
4 # new step for PCY
5 for p , q in basket :
6 bucket [ hash (( p , q ) ) ] += 1
7
• Pass 2: Only count pairs that hash to frequent buckets. For that the buckets are replaced by a bit-vector, where 1
means the bucket count exceed the support s (→ call it a frequent bucket) and 0 means it did not. 4-byte integer
counts are replaced by bits, so the bit-vector requires 1/32 of memory. Also, decide which items are frequent
and lit tehm for the second pass.
Count all pairs {i,j} that meet the conditions for being a candidate pair:
Both conditions are necessary for the pair to have a chance of being frequent.
15
Note on the above graphic: Buckets require a few bytes each. We do not have to count past s. The number of
buckets is O(main − memorysize).
On second pass, a table of (item, item, count) tiples is essential (we cannot use triangular matrix approach) Thus,
hash table must eliminate approx. 23 of the candidate paris for PCY to beat A-Priori
The random sampling algorithm takes a random sample of the market baskets and runs a-priori or one of its improve-
ments in main memory. So, we don’t pay for disk I/O each time we increase the size of the itemsets. Reduce supported
threshold proportionally to match the sample size. (Example: if the sample size is 1/100 of the baskets, use s/100 as
support threshold (instead of s))
• To avoid false positives: Optionally, verify that the candidate pairs are truly frequent in the entire data set by a
second pass
• But you don’t catch sets that are frequent in the whole data but not in the sample:
– Smaller threshold, e.g. s/125, helps catch more truly frequent itemsets
– But requires more space - SON algorithm tries to deal with this!
The SON algorithm is a 2-pass algorithm and it repeatedly reads small subsets of the baskets into main memory and
runs in-memory algorithm to find all frequent itemsets. (Note: We are not sampling, but processing the entire file in
memory-sized chunks) An itemset becomes a candidate if it is found to be frequent in one or more subsets of the
baskets.
• On a second pass, count all the candidate itemsets and determine which are frequent in the entire set.
• Key “monotonicity” idea: An itemset cannot be frequent in the entire dataset unless it is frequent in at least one
subset (pigeonhole principle)
• However, even with SON algorithm we still don’t know whether we found all frequent itemsets. (An itemset may be
infrequent in all subsets but frequent overall.)
• Pass 1:
– Start with the random sample, but lower the threshold slightly for the sample: Example: If the sample is 1 %
of the baskets, use s/125 as the support threshold rather than s/100.
– Find frequent itemsets in the sample
– Add the negative boarder (item in the negative boarder is not frequent in the sample, but all its immediate
subsets are) to the itemsets that are frequent in the sample.
16
• Pass 2: Count all candidate frequent itemsets from the first pass, and also count sets in their negative border.
• If no items form the negative border turns out to be frequent, then we found all the frequent itemsets.
If we find something in the negative border is frequent, we must start over again with another sample.
Try to choose the support threshold so the probability of failure is low, while the number of itemsets checked on
the second pass fits in main-memory.
• Theorem: If there is an itemset S that is frequent in full data, but not frequent in the sample, then the negative
border contains at least one itemset that is frequent in the full data.
• Proof by contradiction:
17
5 High Dimensional Data
5.1 Locality Sensitive Hashing
Definition
Locality Sensitive Hashing (LSH): A technique used to efficiently find similar items in large datasets (sim-
ilarity search). Many problems can be expressed as finding similar sets/ near neighbors in high-dimensional
data. For example: pages with similar words (duplicate detection), customers who purchased similar products
(products with similar customer sets), or users who visited similar websites (websites with similar user sets).
• The Scene Completion Problem refers to the task of filling in missing parts of an image by searching for similar
regions from a large image database. The idea is to replace missing parts with visually coherent patches from
other images. Considering a dataset consisting of 2 million images, an algorithm searches for the 10 most similar
patches (10 nearest neighbors) from the dataset. A blending algorithm integrates the best-matching patch into the
original image, making the transition seamless.
• And some distance function d(x1 , x2 ) to quantify the ”distance” between x1 and x2 . The goal is to find all pairs of
data points (xi , xj ) that are within some distance threshold d(xi , xj ) ≤ s (hence that are similar).
Before finding similar items, one first needs to define what distance means. For example the Jaccard distance/simi-
larity exists.
• The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:
|C1 ∩ C2 |
sim(C1 , C2 ) =
|C1 ∪ C2
• Jaccard distance:
|C1 ∩ C2 |
d(C1 , C2 ) = 1 −
|C1 ∪ C2
3
• So, when if there are 3 items in the intersection and 8 in the union. The Jaccard similarity is 8 and the 58 .
18
Task: Finding similar documents
The goal is to find near duplicate pairs, given a large number (N ∼ millions/billions) of documents. A common application
of that are mirror websites (website that are exact copies of a website except from that they are hosted under another
URL), or approximate mirrors where we don’t want to show both in search results. Or another application are similar
news where we cluster articles by “same story”.
For the application of finding similar documents it is not enough to treat the documents as a set of (key) words,
because it does not take into account the word order and therefore no context is given. A better way are shingles
(n-grams).
Definition
Shingle: A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the docu-
ment. Tokens can be characters, words, or something else, depending on the application.
Example: We assume the tokens are characters and we take k = 2 and document d1 = ”abcab”. Then, the
set of 2-shingles is S(D1 ) = {ab, bc, ca}. If we would take a bag (multiset) instead ”ab” would be counted twice
S ′ (D1 ) = {ab, bc, ca, ab}.
Solution: We can compress shingles by hashing them to a smaller representation like 4-byte integer (32-bit
hash). For example: h(ab) = 1, h(bc) = 5, h(ca) = 7, then h(D1 ) = {1, 5, 7}. This approach is more efficient and
uses less space and memory.
Measure Similarity using Jaccard Similarity Measure: A document is a set of k-shingles and can be repre-
sented as C1 = S(D1 ). Using bit vector encoding it becomes a set of 0/1 vectors, where each unique element
in the universal set gets a position (dimension) in the bit vector. So, if the entire dataset has unique shingles
{ab, ba, cd, da} for document D1 this means a binary vector of v1 = [1, 1, 0, 0]. These vectors tend to be very
sparse because the total number of possible shingles is usually very large but most documents contain only a
small subset of them. Therefore, an appropriate similarity measure is the Jaccard similarity.
Note: Documents that have lots of shingles in common have similar text, even if the text appears in different order.
The difficulty is to choose the right k to find truly similar documents. k = 5 is OK for short documents, but k = 10
is better for long documents.
2. Min-Hashing: Convert large sets to short signatures, while preserving similarity!
Problem: Using the previously established bit vector encoding and considering the following two documents:
19
N = 1 Million documents. This would mean a matrix where the columns are documents and the rows are the
shingles. Computing explicitly unions and intersections would be very inefficient, taking about multiple days to
years.
Definition
Signatures: Short integer vectors that represent the sets and reflect their similarity. In this case the
signature is a small hash of a column so that it fits in memory. The goal is to find a hash function such
that:
The hash values are then mapped to specific buckets (containers). Documents with the same or similar
hash values are placed in the same bucket.
The hash function clearly depends on the similarity metric (not all similarity matrics have a suitable
hash function), which is in this case Jaccard similarity. The suitable hash function for Jaccard similarity is
Min-Hashing.
Min-Hashing is used to efficiently estimate Jaccard similarity between large sets. The main idea is to:
(a) Permute the Rows Randomly: Apply a random permutation π to the rows of the boolean matrix.
(b) Hash Function Definition: hπ (C) = index of the first row (in the permuted order) where column C has a value of 1.
In other words, hπ (C) gives the position of the first ”1” in the permuted order for column C .
(c) Apply several independent permutations: Since a single permutation may not give accurate similarity
estimates, several independent permutations (e.g., 100 different hash functions) have to be used to generate
100 different hash values (or signatures) for each column (document). The signature vector for a document
then looks like: Signature(C) = [hπ1 (C), hπ2 (C), . . . , hπ100 (C)]. The similarity between signatures becomes
an average similarity over all permutations.
This is why comparing signatures directly is an accurate approximation of comparing the original sets
(this can be proven but is out of scope). This is crucial because signatures are much smaller, making similarity
computation significantly faster and more space-efficient. (Note: the following graphic might be wrong)
20
3. Locality-Sensitive Hashing (LSH): Focus on identifying pairs of signatures that are likely to come from
similar documents (candidate pairs).
Goal: Efficiently find document pairs whose Jaccard similarity is at least s (for some similarity threshold, e.g.,
s = 0.8). LSH leverages the idea of using a function f (x, y) that indicates whether x and y form a candidate pair
(i.e., a pair of elements whose similarity needs to be evaluated).
Candidate pairs from Min-Hash matrices are generated by hashing columns of the signature matrix M into
multiple buckets. Each pair of documents that hashes to the same bucket is considered a candidate pair. Since
the columns of the signature matrix M are hashed multiple times, similar columns are likely to hash into the same
bucket with high probability. Given the similarity threshold s such that 0 ≤ s ≤ 1, columns x and y of the
signature matrix M are considered a candidate pair if their signatures agree on at least a fraction s of their rows.
• Logic: Divide matrix M into b bands of r rows. For each band, hash its portion of each column to a hash
table with k buckets. A pair that matches in 1 or more band becomes a candidate pair. If two columns are
similar, they are likely to match in many bands. If two columns are dissimilar, they are unlikely to match in any
band. Tune b and r to catch most similar pairs, but few non-similar pairs.
Tradeoff in LSH: There is an inherent tradeoff between false positives (detected as similar but not similar)
and false negatives (not detected as similar but similar) when tuning the parameters:
The challenge is to find the right balance between detecting similar pairs (true positives) and avoiding false
detections (false positives). For example we had only 15 bands of 5 rows, the number of false positives would go
down, but the number of false negatives would go up.
What we want...
21
What 1 Band of 1 Row Gives you...
22
• Picking r and b to get the best S-curve (50 hash functions (r=5, b=10)).
LSH Summary
23
6 Clustering
Goal: Given a set of points, with a notion of distance between points, group the points into some number of clusters,
such that
• Members of the same cluster are close (i.e., similar) to each other
• Usually:
Clustering is hard!
• Clustering in two dimensions and clustering small amounts is easy. High-dimensional spaces look different: Almost
all pairs of points are at about the same distance, or rather almost all pairs of points are very far from each other.
This is the curse of dimensionality!
• Agglomerative (bottom-up): Initially, each point is a cluster. Repeatedly combine the two nearest clusters into
one.
Point assignment:
• Maintain a set of clusters, where points belong to the nearest cluster. This clustering strategy is the best when
clusters are nice, convex shapes.
Euclidean case:
• When merging clusters, the location of a cluster is represented by its centroid (artificial) point.
• The centroid is the average of the (data) points in the cluster.
• The distance between two clusters is measured as the distance between their centroids.
• At each step, the two clusters with the shortest distance are merged.
24
Figure 12: Example: Hierarchical Clustering
Non-Euclidean case: Note that there is not always a distance representation possible, since there are some examples
where we work in very high dimensions, and the distances rely on the properties, not on the sole (x,y) distance pair.
• Intercluster distance = minimum of the distances between any points, one from each cluster
• Or pick a notion of cohesion of clusters, e.g., maximum distance from the clustroid. Merge clusters whose
union is most cohesive. There are different notions of cohesion:
– Use the diameter of the merged cluster = maximum distance between points in the cluster
– Use the average distance between points in the cluster
– Use a density-based approach (take the diameter or average distance, e.g., and divide by the number
of points in the cluster)
• Large-data clustering requires loading one batch of data at a time, cluster them in memory, and keep summaries
of clusters. Example: BFR, CURE
6.2 K-Means
K-Means Clustering
The underlying assumption in K-Means is that samples are drawn from a mixture of Gaussians. The goal is to
estimate the cluster centers µk . This leads to a classic chicken-and-egg problem: To find the cluster centers (µk ), we
need the cluster memberships, and to determine the cluster memberships, we need the cluster centers.
• Given:
25
• Clustering Algorithm:
1. Randomly initialize the cluster means µ1 , . . . , µk
2. Repeat until convergence (i.e., until cluster centers no longer change significantly):
(a) Cluster Membership Assignment: For each sample xi , assign it to the closest cluster:
k(i) := arg min∥xi − µk ∥2
k
(b) Cluster Mean Recalculation: For each cluster k , recompute the mean of assigned points:
1 X
µk := xi
|{xi : k(i) = k}|
xi :k(i)=k
• Evaluation Metrics:
– Internal: Bayes (Schwarz) Information Criterion (BIC)
– External: Purity (requires ground truth labels)
• Bayes (Schwarz) Information Criterion (BIC): For k clusters and corresponding number of model parameters
f (k):
n
K ∗ = arg min −2 ln P (x1 , . . . , xn |K) + f (K) ·
K log n
• Further Analysis:
– Computational Cost: O(n · k · #iterations); can be optimized using the triangle inequality
– Convergence Criterion: Stop when there is no change in cluster assignments (i.e., cluster membership
stabilizes)
– Empty Clusters: If a cluster ends up empty, reinitialize its center randomly
– Is K-Means Deterministic? No — it is a local search algorithm, and results depend on random initialization
– Limitations: K-Means is best suited for isotropic (spherical) Gaussians. It assumes equal variance in
all directions and performs hard assignment (each point belongs to exactly one cluster). For non-isotropic
distributions or soft assignments, other clustering methods are needed.
26
6.3 The BFR Algorithm
The BFR (Bradley-Fayyad-Reina) algorithm is a variant of k-means, designed to handle very large, disk-resident data
sets. Unlike standard k-means, which can become inefficient with large datasets, BFR efficiently summarizes clusters
without storing all the data points in memory.
Key Characteristics of BFR
• Cluster Assumptions:
• N: Number of points
For a dimension x:
N
X N
X
SUMx = xi , SUMSQx = x2i
i=1 i=1
2
From these, the mean µ and variance σ are calculated as:
2
SUMx SUMSQx SUMx
µx = , σx2 = −
N N N
Example
Given the points (1, 2), (2, 1), and (1, 1):
N =3
SUMx = 1 + 2 + 1 = 4, SUMy = 2 + 1 + 1 = 4
2 2 2
SUMSQx = 1 + 2 + 1 = 6, SUMSQy = 22 + 12 + 12 = 6
27
6.3.1 The cure algorithm
Definition
Definition:
CURE (Clustering Using REpresentatives) is a clustering algorithm that assumes a Euclidean distance metric but
makes no assumptions about the shape, orientation, or distribution of clusters. Unlike k-means, which represents
each cluster by its centroid, CURE uses a set of well-dispersed representative points for each cluster. These
points better capture the geometry and boundaries of arbitrarily shaped clusters. The algorithm requires the
number of clusters k to be specified in advance.
• Pick a random sample of the data and cluster them in main memory using hierarchical clustering:
Pass 2
• Rescan the whole dataset and visit each point p in the data set.
– Normal definition of closest: find the closest representative to p and assign p to the representative’s cluster.
Dimensionality Reduction (DM) asks: How many dimensions do I need to keep in order to preserve the structure of
the data? Matrix Rank answers this question mathematically: How many independent dimensions are required to fully
describe the data?
Why do we need DM?
• Remove redundant and noisy features: Not all words are useful
In linear algebra, the rank of a matrix A is the dimension of the vector space generated (or spanned) by its columns. This
corresponds to the maximal number of linearly independent columns of A. Here is an example on how to find it:
28
Example
Let’s say you have data points in 3D space:
• A = (1, 2, 1)
• B = (−2, −3, 1)
• C = (−1, −1, 2)
Despite having 3D coordinates, these points actually lie on a plane in 3D. This means the rank of the matrix
would be 2, as you only need two dimensions to describe the data points, even though they exist in a 3D space.
Finding the Rank of the Matrix:
The matrix representing the data points is:
1 2 1
M = −2 −3 1
−1 −1 2
Performing row reduction (Gaussian elimination) gives the row echelon form:
1 2 1
M = 0 1 3
0 0 0
Since there are two non-zero rows, the rank of the matrix is 2, meaning the data lies on a 2D plane in 3D space.
We are given a ratings matrix A, where rows are users and columns are movies:
5 4 0 0
5 3 0 0
4 5 0 0
5
A= 2 0 0
0 0 4 5
0 0 5 2
0 0 4 5
Users 1–4 like Sci-Fi movies (*Alien*, *Serenity*), users 5–7 prefer Romance (*Casablanca*, *Amélie*).
We apply SVD:
A = UΣVT
Where:
• U: user-to-concept matrix
• VT : concept-to-movie matrix
29
" #
0.68 −0.59
Sci-Fi axis
T
Σ = diag(12.4, 9.5, . . . ), U≈ .. .. , V ≈
. . Romance axis
The formula sX
∥A − B∥F = (Aij − Bij )2
i,j
is called the Frobenius norm. It measures how different two matrices A and B are — like a distance, based on all
squared differences between corresponding entries.
In the context of SVD, we use this to find the best low-rank approximation:
This means: among all matrices B of rank k , the matrix Ak (built from the top k singular values of A) is the closest
to A, measured using this distance.
So, SVD gives the best possible simplified version of A with rank k .
Definition
Singular Value Decomposition (SVD):
Singular Value Decomposition is a matrix factorization technique used in linear algebra. It decomposes a given
matrix A of size m × n into three matrices:
A = UΣVT
where:
SVD is widely used in areas such as dimensionality reduction, image compression, and solving linear systems.
30
Example
Let
5 0
A=
0 2
Since A is already diagonal with non-negative entries, its singular values are simply the diagonal entries:
5 0
σ1 = 5, σ2 = 2 ⇒ Σ =
0 2
The left and right singular vectors are just the standard basis vectors (identity matrix), so:
1 0
U=V=
0 1
Complexity
Sparsity in SVD
• In many real-world applications, the matrix A we want to decompose is very sparse (mostly zeros).
• However, the matrices U and V from the SVD of A are usually dense, meaning they contain mostly non-zero
values.
• This destroys sparsity, making storage and computation more expensive, and can reduce interpretability.
31
6.6 CUR Decomposition
Definition
The CUR algorithm approximates a matrix A ∈ Rm×n using actual rows and columns from A, rather than abstract
components like in SVD.
A ≈ CU R
Where:
• U ∈ Rc×r : a small linking matrix computed such that the product CU R best approximates A
1. Select a subset of c columns from A (possibly using randomized or importance sampling) to form C
3. Let W ∈ Rc×r be the intersection of the chosen columns and rows (i.e., W is the submatrix of A at those
row and column indices)
• Let W be the intersection of the selected columns C and rows R from the matrix A.
• Define W+ as the pseudoinverse of W.
• To compute W+ , perform the SVD of W:
W = XZYT
1
Zii+ =
Zii
Advantages of CUR:
32
SVD vs. CUR
Example
Based on a real experiment: We consider a large, sparse matrix built from bibliographic data. The goal was to
compare the SVD vs. CUR for dimensionality reduction using Accuracy (1− relative sum of squared errors),
Space ratio ( ##output entries
input entries
) and CPU time (total computation time in seconds).
Results:
• SVD achieves the highest accuracy, but is slow and uses much more space
33
7 Recommender Systems
Definition
Recommender Systems: The internet enables the near-zero-cost distribution of information about products,
services, and content, which leads to an overwhelming abundance of choices. As a result, users need effective
filters to help them navigate this information overload — this is where recommender systems come in. Rec-
ommender systems aim to provide personalized suggestions by learning user preferences and predicting what
items they might like. Common examples of platforms using recommender systems include Netflix, YouTube,
Amazon, and Spotify.
Definition
Collaborative Filtering: Given a user x, the goal is to identify a set N of other users whose rating behavior is
similar to that of user x. Then, estimate user x’s unknown ratings based on the ratings provided by users in the
set N .
Types of Recommandations:
Key Problems
• Sparsity: Most users interact with only a small fraction of the available items, resulting in a highly sparse matrix
with many missing values.
• Collecting known ratings: Gathering explicit (f.e. ask people to rate items) or implicit (f.e. learn ratings from user
actions - like a purchase might suggest a high rating) feedback to populate the matrix can be difficult. Feedback
varies in form (e.g., thumbs-up, 1–5 stars, viewing time) and is not always available for all users or items.
• Predicting unknown ratings: The primary task is to accurately infer missing entries — in particular, to identify
items the user is likely to enjoy. We are generally more interested in high ratings (what the user would like) than
low ones.
34
• Evaluating recommendations: Measuring the effectiveness of predicted recommendations requires suitable per-
formance metrics. Commonly used metrics include Precision, Recall, and the F1-score, especially when gener-
ating top-N recommendation lists.
Content-based recommender systems suggest items to users based on the similarity of content between items that
the user has already liked and new items. For example, if a user likes movies with a certain director or genre, the system
will suggest other movies with similar attributes. Similarly, for blogs, articles, or websites, it looks for similar content
topics.
Each item is represented by a vector of features that describes its content. Examples include:
• Text documents: Important words extracted from the text (using TF-IDF).
This is a common method to weigh the importance of words in documents, emphasizing unique words in each document
to capture its essence.
User profiles are built based on the user’s rating history or interactions with items. Possible methods include:
• Variation: By considering how the user’s rating differs from average ratings.
Prediction Heuristic:
The system estimates the user’s preference for a new item using cosine similarity:
x·i
u(x, i) = cos(x, i) =
||x|| · ||i||
where x is the User profile vector, i is the Item profile vector, and it calculates how closely the user’s interests align with
the item’s attributes.
Pros:
• No need for data on other users – avoids the cold-start and sparsity problems.
• Recommendations for users with unique tastes – even if others don’t like similar items, you still get good
suggestions.
• Able to recommend new & unpopular items – it doesn’t rely on others’ feedback.
• Able to provide explanations – the system can explain why it recommends an item based on its features.
Cons:
• Finding appropriate features is hard: For media like movies, images, or music, it is challenging to determine
what features are most important.
35
• Recommendations for new users: It struggles with new users who haven’t rated enough items (Cold Start
Problem).
• Overspecialization:
– It only suggests items similar to what the user has liked before, potentially missing out on diverse content.
– It cannot leverage the preferences of other users to make broader recommendations.
Instead, we use the Pearson correlation coefficient, which compares only the ratings for items rated by both users and
normalizes for user rating bias (e.g., some users rate everything high, others low). Let Sxy be the set of items rated by
both users x and y . Then the Pearson similarity is defined as:
P
s∈Sxy (rxs − rx )(rys − ry )
sim(x, y) = qP qP
2 2
s∈Sxy (rxs − r x ) s∈Sxy (rys − r y )
where:
• rxs and rys are the ratings of users x and y for item s
36
Example
Pearson Correlation Example
We have three users (A, B, C) and three items (Item 1, Item 2, Item 3). The ratings are represented in the matrix
below, with NaN indicating a missing rating:
′ ′
rA,1 = 4 − 3.5 = 0.5, rA,2 = 3 − 3.5 = −0.5
′ ′
rB,1 = 5 − 3.5 = 1.5, rB,3 = 2 − 3.5 = −1.5
′ ′
rC,2 = 2 − 3 = −1, rC,3 =4−3=1
The normalized ratings are then represented as:
(0.5)(1.5)
sim(A, B) = p p =1
(0.5)2 · (1.5)2
(−0.5)(−1)
sim(A, C) = p p =1
(−0.5)2 · (−1)2
(−1.5)(1)
sim(B, C) = p p = −1
(−1.5)2 · (1)2
Step 3: Interpretation
This measure accounts for only items both users have rated (avoids zero-padding issues) and differences in user
37
rating scales (e.g., some users rate more harshly).
• Unweighted average:
1 X
rxi = ryi
k
y∈N
• Other enhancements include (however they are not explored further in this course):
Example
1. Unweighted Average
For user A, we predict Item 3:
1
rA,3 = (2 + 4) = 3
2
For user B, we predict Item 2:
1
rB,2 = (3 + 2) = 2.5
2
For user C, we predict Item 1:
1
(4 + 5) = 4.5
rC,1 =
2
2. Weighted Average Using Similarity Scores (See above example)
For user A, predicting Item 3:
(1 · 2) + (1 · 4)
rA,3 = =3
1+1
For user B, predicting Item 2:
(1 · 3) + (−1 · 2) 3−2
rB,2 = = = undefined
1 + (−1) 0
(In this case, the similarity cancels out, so we cannot predict using a weighted average)
For user C, predicting Item 1:
(1 · 4) + (−1 · 5)
rC,1 = = undefined
1 + (−1)
Key Difference:
Content-based Filtering:
38
• Looks at the content of items and matches them with the user’s history.
Collaborative Filtering:
• More explorative, can suggest unexpected items based on collective preference patterns.
Concretely, the adjusted weighted formula using similarity scores looks as follows:
P
j∈N (i;x) sij · rxj
rxi = P ,
j∈N (i;x) sij
where sij is the similarity between items i and j , rxj the rating of user x on item j , and N (i; x) the set of items rated by
user x that are similar to item i.
Example Item-Item CF (|N | = 2)
In this example the goal is to predict the movie 1 for user 5 by finding similar movies (neighbours) that user 5 has rated,
and then using those ratings to estimate the missing one.
1. Neighbourhood Selection: Apply item-item collaborative filtering by looking at other movies that are similar to
movie 1, and have already been rated by user 5. In this case user 5 has rated movies 3, 4, and 6.
2. Compute Similarities between Items: Apply Pearson correlation as similarity by subtracting mi from each movie
i, which results in m1 = (1 + 3 + 5 + 5 + 4)/5 = 3.6. So the mean-centered ratings for movie 1 are row1 =
[2.6, 0, −0.6, 0, 0, 1.4, 0, 1.4, 0, 0.4, 0]. Now, compute cosine similarities between rows.
• Similarity between movie 1 and movie 3 = 0.41
• Similarity between movie 1 and movie 6 = 0.59
• Similarity between movie 1 and movie 4 = -0.10
39
3. Select Neighbors (|N | = 2): Choose movies 6 and 3 as neighbours (highest similarity values)
Example
We have a user X with the following ratings for three items:
(0.92 · 5) + (0.90 · 3)
rx3 =
0.92 + 0.90
Step 3: Compute the Prediction
First, evaluate the numerator:
7.3
rx3 = ≈ 4.01
1.82
Result:
The predicted rating for Item 3 for user X is approximately:
rx3 ≈ 4.01
40
Evaluating Predictions
• Root-mean-square error (RMSE): Measures the average squared difference between predicted and actual rat-
ings. s X
∗ )2
(rxi − rxi
xi
∗
where rxi is the predicted rating and rxi is the true rating of user x on item i.
• Precision at top 10: Percentage of relevant items ranked in the top 10 recommendations.
• Rank Correlation: Spearman’s correlation between the system’s predicted ranking and the user’s true item rank-
ing.
• Coverage: Number of items or users for which the system is able to make predictions.
• Receiver Operating Characteristic (ROC): Tradeoff curve showing false positive vs. false negative rates for
different threshold settings.
Problems with Error Measures: Traditional error measures, such as RMSE, often place a narrow emphasis on nu-
merical accuracy, which may overlook the true goals of a recommender system. In practice, we are typically interested
only in predicting high ratings — items a user is likely to enjoy. An algorithm that performs well in identifying such high
ratings but poorly elsewhere might be unfairly penalized by RMSE, even though it delivers useful recommendations.
Complexity
In collaborative filtering, the most computationally expensive step is identifying the k most similar users (or items) for
each prediction.
• The complexity of finding k nearest neighbours for a user is O(|X|), where X is the set of all users, which typically
too costly to compute at runtime for large-scale systems.
• One solution is to pre-compute the nearest neighbours for each user or item, however, naı̈ve pre-computation has
complexity O(k · |X|) per user.
• Locality-Sensitive Hashing (LSH): Allows fast approximate nearest-neighbor search in high-dimensional spaces.
• Clustering: Group similar users or items together to limit the search to within a cluster.
• Dimensionality Reduction: Techniques like PCA or SVD can reduce the dimensionality of rating vectors, making
similarity comparisons faster and more meaningful.
Pros:
• Works for any kind of item: No item features are required. This makes item-item collaborative filtering broadly
applicable across domains (movies, books, products, etc.).
41
• Often outperforms user-user CF: Items are more stable than users — they don’t change behavior or preferences.
Once an item has been rated by many users, its rating pattern becomes stable and more reliable for similarity
comparison.
• Denser data per item: Each item typically receives ratings from many users, while users only rate a small subset
of items. This leads to more complete item profiles and makes it easier to compute item-item similarity than
user-user similarity.
Cons:
• Cold start problem: Newly added items cannot be recommended until they have been rated by enough users to
establish meaningful similarity.
• Sparsity: The overall rating matrix is sparse — it can still be difficult to find overlapping ratings between items,
especially in large catalogs.
• First-rater problem: Items that have not yet been rated cannot be recommended. This is especially problematic
for new or niche items.
• Popularity bias: The system tends to favor popular items with many ratings. Users with unique or niche prefer-
ences may receive less relevant recommendations.
42
7.3 Latent Factor Models
BellKor Recommender System
Definition
BellKor Recommender System: The BellKor Recommender System was one of the leading solutions in the
Netflix Prize competition. It uses a multi-scale modeling approach, combining techniques at different levels of
abstraction to produce highly accurate recommendations.
• Global Effects: Captures overall trends and biases in the data. For example, some users consistently rate
more generously or harshly than average, and some items (like blockbusters) receive consistently higher
ratings than others. This is modeled using a simple baseline:
• Factorization (Latent Factors): Addresses intermediate or “regional” effects. This is based on matrix
factorization models, also called latent factor models, which capture hidden dimensions of user prefer-
ences and item characteristics:
rui ≈ µ + bu + bi + ⟨pu , qi ⟩
where:
• Collaborative Filtering (Local Patterns): After accounting for global biases and latent factors, the system
applies memory-based collaborative filtering to capture local deviations. This helps refine predictions by
modeling specific patterns of agreement between small groups of users or items.
Traditional similarity-based collaborative filtering methods rely on fixed similarity measures such as cosine similarity or
Pearson correlation. However, these approaches have several limitations:
• Similarity measures are arbitrary: They are hand-crafted and not learned from the data.
• Pairwise similarities ignore context: They treat each pair of items (or users) independently, neglecting global
interdependencies.
• Averaging limits expressiveness: Using a weighted average of neighbor ratings restricts the model’s flexibility:
P
j∈N (i) sij · rxj
r̂xi = P
j∈N (i) sij
43
Interpolation Weights wij
Our ultimate goal is to make good recommendations — i.e., recommend items a user will likely enjoy. The problem
is that we don’t have ground truth to directly optimize on, since the user haven’t seen the new items yet. Therefore,
we take a practical workaround. Rather than using a simple weighted average of ratings based on arbitrary similarity
scores, we instead use a weighted sum with learned weights to improve prediction accuracy:
X
r̂xi = bxi + wij (rxj − bxj )
j∈N (i;x)
• bxi is the baseline estimate for user x’s rating of item i (e.g., µ + bx + bi )
• wij is the interpolation weight capturing how informative item j is for predicting item i
• N (i; x) is the set of items rated by user x that are most similar to item i
• We aim to minimize prediction error on the training data (i.e., known ratings). This is typically done using Root
Mean Squared Error (RMSE), or equivalently, the Sum of Squared Errors (SSE):
X 2
SSE = (r̂xi − rxi )
(x,i)∈R
• Substituting in the predicted rating formula that uses interpolation weights, the loss function (optimization prob-
lem) becomes:
2
X X
J(w) = bxi + wij (rxj − bxj ) − rxi
|{z}
x,i j∈N (i;x)
} true rating
| {z
predicted rating
• The weights wij are then learned by minimizing J(w) over the training data. This optimization captures how
influential item j ’s rating is in predicting item i’s rating, specifically for user x.
• Importantly, wij is not hand-defined like cosine or Pearson similarity — it is estimated from data, based on the
interactions of item i, its neighbors j , and the users who rated them.
• The assumption is that if we find weights that explain known ratings well, they will also generalize to unseen
ratings, which aligns with standard machine learning principles.
Now, the task is to solve the optimization problem using Gradient Descent. This means one has to fix movie i and
iterate over all rij for every movie j ∈ N (i; j) (and compute gradient) until convergence:w ← w − α ∇w J , where α is
the learning rate, and ∇w J is the gradient of the loss function J with respect to the weights w, evaluated on the training
data.
This layer of the BellKor recommender system captures fine-grained, local interactions, and removes reliance on
arbitrary similarity metrics like cosine or Pearson. It answers: “How important is item j ’s rating for predicting a rating for
item i?”
44
Latent Factor Models
Still thinking about ”the Netflix Prize”, for the latent factor model we choose ”SVD” on Netflix data and we want to
approximate the rating matrix R (even though R has missing entries - ignore that because we want the reconstruction
error to be small on known ratings and don’t care about missing ones) as a product of ”thin” QP T , concretely R ≈ QP T .
Visually, it looks as follows
So, R is our input matrix (A in original SVD), Q is our left singular vectors (U in original SVD), P T is the right singular
vectors and singular values (V T Σ in original SVD). To estimate the missing rating of user x for item i, let’s perform
Note: SVD is not defined when entries are missing - this is a problem for the sparse data for the Netflix movie
recommendation task. Instead of using classical SVD (which requires a fully observed matrix), we use a low-rank
matrix factorization model designed to work with missing data.
• P , Q are learned directly from data using optimization methods such as gradient descent.
• The vectors px and qi serve as latent embeddings, capturing abstract user and item traits (e.g., genre affinity,
rating behavior).
• This approach became the most widely used and successful method in the Netflix Prize competition.
45
8 PageRank
8.1 Introduction
There are many types of Graphs:
• Social Networks Ex.: Facebook — Represents connections between users where friendships or interactions are
modeled as edges.
• Media Networks Ex.: Connections between political blogs — Visualizes how blogs of similar political opinions link
to each other, forming clusters.
• Information Networks Ex.: Citation networks — Displays how scientific papers reference each other, indicating
the flow of knowledge.
• Communication Networks Ex.: Internet — Shows routers and computers connected through data links, forming
the backbone of global communication.
• Technological Networks Ex.: Seven Bridges of Königsberg — An early example of graph theory where Euler
studied pathways across bridges.
Definition
Web as directed Graph
• Nodes: Webpages
• Edges: Hyperlinks
Approaches:
46
Links as Votes
• This creates a recursive formulation: important pages link to other important pages
Rank Definition
Definition
• The rank of a page rj is calculated based on the sum of the ranks of pages linking to it, divided by their
out-degree (number of outgoing links): X ri
rj =
i→j
di
• Here:
Example
Example Setup:
• Three pages: A, B, C
• Links:
Final PageRank Values:
– A → B, C
• rA = 1.0
– B→C
• rB = 0.5
– C→A
• rC = 1.5
• Initial Ranks: rA = 1.0, rB = 1.0, rC = 1.0
Intuition:
• Out-degrees: dA = 2, dB = 1, dC = 1
• Page C is more important as it receives links
PageRank Calculation:
from both A and B.
• Page A:
• Page B is less important since it only re-
rC 1.0 ceives half the rank of A.
rA = = = 1.0
dC 1 Summary:
47
Example
Solving the Equations:
Efficiency Improvement
Gaussian elimination method works for small examples but we need a better method for large web-sized graphs.
48
Definition
Power Iteration Method
• Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks.
Definitions:
Example
Graph Setup:
First Iteration:
• Pages: A, B, C 1 1
0 0 1 3 3
• Links: r(1) = M · r(0) = 1 0 0 · 13 = 13
1 1
0 1 0 3 3
– A→B
– B→C Convergence Criterion:
1. Dead Ends
49
2. Spider Traps
Dead Ends
• This causes the PageRank calculation to leak importance and prevents the algorithm from converging.
Spider Traps
• If a surfer enters a spider trap, they are stuck indefinitely, causing these pages to absorb all the PageRank weight.
Definition
Solution to Dead ends and Spider Traps: Teleports
Google’s solution is to introduce teleports, which work as follows:
• This prevents the algorithm from getting stuck at dead ends or spider traps, ensuring smooth navigation
and proper PageRank distribution.
50
8.5 Topic Specific PageRank
Definition
In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector
is computed, using the link structure of the Web, to capture the relative ”importance” of Web pages, independent
of any particular search query. Topic-Specific PageRank introduces a bias towards a predefined set of relevant
pages called Set S.
• Mastodon Status Updates — Real-time social media updates that are continuous.
• Non-Stationary: The data’s nature and distribution can change over time (e.g., trending topics, seasonal
searches)
The Stream Model In the Stream Model, data elements arrive rapidly at one or more input ports, called ”streams”.
The system cannot store the entire stream because the data volume is too large and arrives too quickly. Key
Question:
How do we make important calculations on this endless data flow with only a limited amount of memory?
51
Side Note: SGD is a Streaming Algorithm Stochastic Gradient Descent (SGD) is a classic example of a streaming
algorithm. In machine learning, this is known as online learning.
What is Online Learning?
Instead of learning from a fixed dataset, the model learns and updates continuously from a stream of new data. This
allows the model to adapt over time as new patterns and data distributions emerge.
Applications of Data Streams
Problems on Data Streams
• Mining Query Streams – Google wants to identify
Types of Queries on a Data Stream: which queries are more frequent today than they
• Sampling data from a stream – Construct a ran- were yesterday.
dom sample from the data stream. • Mining Click Streams – Yahoo wants to monitor
• Queries over sliding windows – Count the num- which of its pages are receiving an unusual number
ber of items of type x in the last k elements of the of hits in the past hour.
stream. • Mining Social Network News Feeds – Detect
• Filtering a data stream – Select elements that sat- trending topics on platforms like Mastodon, Bluesky,
isfy property x from the stream. etc.
• Counting distinct elements – Compute the num- • Sensor Networks – Data from many sensors are
ber of distinct elements in the last k elements of the continuously fed into a central controller for real-
stream. time monitoring.
• Estimating moments – Estimate average, stan- • Telephone Call Records – Data is used for gener-
dard deviation, or higher moments of the last k ele- ating customer bills and managing settlements be-
ments. tween telephone companies.
• Finding frequent elements – Identify elements • IP Packets Monitored at a Switch – Gather in-
that appear frequently in the stream. formation for optimal routing and detect potential
denial-of-service (DoS) attacks.
• This method maintains a constant-size sample (e.g., 100 elements) regardless of the total stream size.
• It ensures that every element seen so far has an equal probability of being included.
• This is more representative of the data stream over time but slightly more complex to implement.
Problem Definition:
• We want to sample a fixed proportion (e.g., 10%) from a large search engine query stream.
52
• Queries are represented as tuples: (user, query, time).
Naive Solution:
1
P (Query is sampled) = = 0.1
10
What Happens with Duplicates?
If a query appears twice (a duplicate), each appearance is sampled independently with a probability of 0.1.
To have both copies of the duplicate in the sample, both independent events must occur. The probability is calculated
as follows:
P (Both duplicates are sampled) = 0.1 × 0.1 = 0.01
The Meaning of 1% (1/100):
This means that only 1 out of 100 pairs of duplicates will be fully captured in the sample. If there are 1000 duplicate
queries in the original data, we can expect approximately:
1000
= 10 full pairs
100
Hence, the naive sampling method drastically underestimates the true number of duplicates in the sampled data.
Improved Solution: Sample Users Instead of Queries
• If the hash falls into a specific bucket, all queries from that user are stored.
• For a 30% sample, hash into 10 buckets and select if the value is in the first 3 buckets.
53
Example
Example: Hashing Technique for Sampling
• Pick the tuple if its hash value is in the first 3 buckets (i.e., 0, 1, or 2).
Resulting Sample:
Hence, we effectively keep approximately 30% of the stream by uniformly hashing and selecting based on bucket
position.
Problem Definition:
• We cannot store all the elements; we only want to keep a fixed-size sample.
• Problem: This is impractical due to memory limits for large or infinite streams.
54
Definition
• Store all the first s elements of the stream into S .
• Suppose we have seen n − 1 elements, and now the nth element arrives (n > s):
s
– With probability n, keep the nth element, else discard it.
– If we pick the nth element, it replaces one of the s elements already in S , picked uniformly at random.
s
Conclusion: After n elements, the sample contains each element seen so far with probability n.
Example
Stream of elements: A, B, C, D, E, F, G, H, I, J
Sample Size: s = 3
– D arrives (n = 4):
Probability of keeping it is 34 = 0.75. Assume it is kept. It replaces one of A, B, C randomly. Let’s say
it replaces A, so S = [D, B, C].
– E arrives (n = 5):
3
Probability of keeping it is 5 = 0.6. Assume it is discarded. S remains unchanged.
– F arrives (n = 6):
3
Probability of keeping it is 6 = 0.5. Assume it is kept. It randomly replaces an element, let’s say B ,
so S = [D, F, C].
– G arrives (n = 7):
3
Probability of keeping it is 7 ≈ 0.428. Assume it is discarded.
• Final Sample after 7 elements: S = [D, F, C]
Think of it as a lottery:
Every new element gets a ticket to enter the sample. As the stream grows, the chance of getting a ticket de-
creases. But all elements—both old and new—are treated fairly.
• But: we do not have enough memory to store all of S in a hash table. We might be processing millions of filters on
the same stream
55
– We know 1 billion ”good” email addresses
– If an email comes from one of these, it is not spam
• Publish-subscribe systems
First Cut Solution: Understanding the Basics Problem: We have a set of keys S to filter and a data stream of
elements to check against S .
56
9.3.2 First Cut Solution
Example
Solution Steps:
• We create a bit array B with n bits, all initialized to 0. This array will help us quickly check if an element
might be in S .
B = [0, 0, 0, 0, . . . , 0]
2. Choose a Hash Function h:
• A hash function is a function that takes an element s and maps it to an index in the bit array B . The
output of h(s) is always in the range [0, n − 1], where n is the size of B . Example of a hash function:
3. Insert Elements of S :
• For each element s in S , we compute h(s) and set the bit at that index in B to 1.
B[h(s)] = 1
• Example:
S = {5, 12, 19}, n = 20
Example:
• Bit Array B :
B = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
The First Cut Solution was a simple form of a Bloom Filter with only 1 hash function. A proper Bloom Filter uses
multiple hash functions, spreads its bits better, and reduces false positives significantly.
• Consider:
Initialization:
• All bits in B are initially set to 0.
• False Positives Possible: An element not in S might hash to bits set by other elements.
– k = 1: P ≈ 0.12 (12%)
– k = 2: P ≈ 0.05 (5%)
– k = 6: P ≈ 0.02 (2%) — Optimal value
– k = 20: P ≈ 0.18 (18%)
• Optimal k value:
n
k= ln(2) ≈ 5.54 ≈ 6
m
58
9.4 Counting Distinct Elements
The Flajolet-Martin (FM): Estimate the number of distinct elements in a data stream using minimal memory. The core
idea is to use probabilistic hashing and trailing zeros in hash values to estimate unique counts.
• Pick a hash function h that maps each of the N elements to at least log2 N bits.
• For each stream element a, let r(a) be the number of trailing 0s in h(a).
Example
Step 1: Choose a Hash Function We choose a simple hash function h(x) that maps elements to numbers in
binary format. For example:
Step 2: Binary Representation and Counting Trailing Zeros We convert the values to binary format and count
the number of trailing zeros:
h(a) = 12 → 11002 , r(a) = 2
h(b) = 6 → 01102 , r(b) = 1
h(c) = 8 → 10002 , r(c) = 3
h(d) = 10 → 10102 , r(d) = 1
Step 3: Finding the Maximum Number of Trailing Zeros Now we determine the maximum of the observed
r(a) values:
R = max(2, 1, 3, 1) = 3
Step 4: Estimating the Number of Distinct Elements We estimate the number of distinct elements in the data
stream using the formula:
D̂ = 2R
Since R = 3, we get:
D̂ = 23 = 8
Step 5: Conclusion of the Estimation The algorithm estimates that there are approximately 8 distinct ele-
ments in the data stream. This is an approximation and may vary slightly, but the memory usage is extremely
low.
59
10 Lernziele
• MapReduce und verteilte Programmierung
– Verstehen des MapReduce-Paradigmas und seiner drei Hauptphasen (Map, Group by Key, Reduce)
– Unterscheiden zwischen MapReduce und Spark (RDDs, DataFrames, Datasets)
– Identifizieren geeigneter Probleme für MapReduce (sequentielle Datenverarbeitung, große Batch-Jobs)
– Berechnen von Kommunikations- und Berechnungskosten bei MapReduce-Algorithmen
• Association Rule Discovery
– Anwenden des A-Priori und PCY-Algorithmus zur Frequent Itemset-Erkennung. Vor- und Nachteile
– Berechnen von Support, Confidence und Interest von Association Rules
– Verstehen des Market-Basket-Models und seiner Anwendungen
– Analysieren von Speicher- und I/O-Kosten verschiedener Ansätze (Triangular Matrix vs. Triples)
• Ähnlichkeitssuche und LSH
– Wählen der richtigen Distanzmaße für verschiedene Datentypen:
* Jaccard-Distanz für Sets
* Cosine-Distanz für Vektoren
* Euclidean-Distanz für numerische Daten
– Implementieren von Locality Sensitive Hashing (LSH) für effiziente Ähnlichkeitssuche
– Durchführen der drei Schritte: Shingling → Min-Hashing → LSH
• Clustering-Algorithmen
– Verstehen verschiedener Clustering-Ansätze (Hierarchical, K-Means, BFR, CURE)
– Anwenden von K-Means und Auswahl der optimalen Clusterzahl
– Unterscheiden zwischen Euclidean und Non-Euclidean Clustering
• Dimensionalitätsreduktion
– Erklären der Unterschiede zwischen SVD und CUR:
T
* SVD: Erzeugt abstrakte, dichte Matrizen (U, Σ, V )
* CUR: Verwendet echte Zeilen/Spalten, behält Sparsity bei, ist interpretierbarer
– Anwenden von SVD für Latent Factor Models in Recommender Systems
• Recommender Systems
– Implementieren von Content-Based und Collaborative Filtering Ansätzen
– Berechnen der Pearson-Korrelation für User-User und Item-Item Similarity
– Verstehen von Latent Factor Models und Matrix Factorization
• PageRank und Graphalgorithmen
– Berechnen von PageRank-Werten mit der Power Iteration Method
– Lösen von Problemen mit Dead Ends und Spider Traps durch Teleportation
– Unterscheiden zwischen Standard- und Topic-Specific PageRank
• Data Stream Processing
– Implementieren von Sampling-Algorithmen (Reservoir Sampling)
– Anwenden von Bloom Filters für Stream Filtering
– Schätzen der Anzahl distinct Elements mit Flajolet-Martin
60