0% found this document useful (0 votes)

8 views61 pages

BDI Summary-4

The document discusses various aspects of Big Data infrastructures, including distributed computing, data mining techniques, and algorithms such as MapReduce and Spark. It covers topics like association rule discovery, clustering methods, recommender systems, and PageRank, providing insights into their applications and challenges. Additionally, it addresses data stream processing and high-dimensional data analysis.

Uploaded by

luisaellamaria4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views61 pages

BDI Summary-4

Uploaded by

luisaellamaria4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Big Data Infrastructures

Emma Kozmér and Luisa Ella Müller

Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Distributed Computing for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 MapReduce Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Spark: Extends MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Problems suited for MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Cost Measures of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Association Rule Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Introduction: Market-basket Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 Finding Frequent Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3.1 Finding Frequent Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Frequent Itemsets in ≤ 2 Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.1 Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.2 SON (Savasere, Omiecinski, and Navathe) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4.3 Toivonen’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 High Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 Locality Sensitive Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Finding Similar Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.3 The BFR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.1 The cure algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.4.1 Rank of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.6 CUR Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.1 Content Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 User-User Collaborative Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.2.1 Pearson Correlation Coefficient: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.3 Latent Factor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.2 PageRank: Flow Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8.3 Google PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8.4 Problems with PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.5 Topic Specific PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9 Data Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
9.2 Sampling Data From a Data Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
9.2.1 Reservoir Sampling (Fixed-Size Random Sample) . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1
9.3 Filtering Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.1 Applications of Filtering Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
9.3.2 First Cut Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.3.3 Bloom Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.4 Counting Distinct Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

10 Lernziele . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2
1 Introduction
Definition
Data Mining: Given large amounts of data data mining is about discovering patterns and models that are:

• Valid: hold on new data with some certainty

• Useful: should be possible to act on the item

• Unexpected: non-obvious to the system

• Understandable: humans should be able to interpret the pattern

• Descriptive Methods: Find human-interpretable patterns that describe the data.

– Example: Clustering

• Predictive Methods: Use some variables to predict unknown or future values of other variables

– Example: Recommender systems

2 Distributed Computing for Data Mining

The challenge of large-scale computing for data mining on commodity hardware is efficiently distributing com-
putation while ensuring fault tolerance. Writing distributed programs requires careful consideration of issues such as
machine failures and scalability.

Definition
Distributed File System: A distributed file system manages data across multiple networked machines,
ensuring reliability and fault tolerance. Since copying data over a network is time-consuming, a robust storage
mechanism is needed to persist data even in the event of node failures. To achieve this, files are replicated
across multiple nodes for redundancy. These systems are designed to handle large-scale data workloads
(hundreds of GB to TB), where in-place updates are rare, and reads or appends are the primary operations.
Examples of such systems include Google File System (GFS) and Hadoop Distributed File System (HDFS).
To process data in these environments, distributed computing frameworks like MapReduce and Apache Spark
are commonly used.

• Chunk servers: Files are split into contiguous chunks (16 -64 MB). Each chunk is replicated (usually 2 or 3
times) and kept in different racks (servers are grouped into racks).

• Master node: It is also called Name Node in HDFS and it stores metadata about where files are stored. Master
nodes are typically more robust to hardware failure and run critical cluster services.

• Client library for file access: Talks to master to find chunk servers and connects directly to chunk servers
to access data.

Typically, a reliable distributed file system keeps data in chunks spread across machines and each chunk is replicated
on different machines to enable seamless recovery from disk or machine failure. Additionally, the phrase ”Bring
computation directly to the data!” tells that chunk servers like HDFS or GFS enable processing of data where it is
stored rather than transferring large volumes of data across a network to a central processing unit, which makes the
process more efficient.

3
3 MapReduce
Definition
MapReduce: An early distributed programming model designed for easy parallel programming, invisible man-
agement of hardware and software failures and easy management of very-large-scale data.

• It has several implementations, including Hadoop, Spark (used here), Flink, or the original Google implementation
just called MapReduce.

3.1 MapReduce Algorithm

• Map: Apply a user-written Map function (Mapper) to each input element. Many mappers can be grouped in a
Map task and apply the Map function in parallel to single inputs.

• Group by Key: Sort and shuffle, where the system sorts all the key-value pairs by key, and outputs key-[list
of values] pairs.

• Reduce: User-written Reduce function is applied to each key-[list of values pair].

Workflow (typically the programmer specifies the map and reduce function and input files)

• Read inputs as a set of kv-pairs

• Map transforms input kv-pairs into a new set of k’v’-pairs

• Sorts & shuffles the k’v’-pairs to output nodes

• All k’v’ paris with a given k’ are sent to the same Reduce

• Reduce processes all k’v’-pairs grouped by key into new k”v”-pairs

• Write the resulting pairs to files

4
Example: Word Counting

Consider a huge text document where we have to count the number of times each distinct word appears in the file.

Figure 1: MapReduce: Word Counting

1 def map ( key , value ) :

2 # key : document name
3 # value : text of the document
4 for word in value . split () :
5 emit ( word , 1)
6
7 def reduce ( key , values ) :
8 # key : a word
9 # values : list / iterator over counts
10 result = sum ( value )
11 emit ( key , result )
Listing 1: Pseudo Code Word Count using MapReduce

MapReduce can be performed in parallel. In this case a partitioning function determines which record goes to which
reducer.

Figure 2: MapReduce: In Parallel

5
MapReduce: Data Flow

• Input and final output are stored on a distributed file system (FS). (A scheduler tries to schedule map tasks
”close” to physical storage location of input data.)

• Intermediate results are stored on local FS of Map/Reduce workers

• Output is often input to another MapReduce task

MapReduce: Environment

A MapReduce environment takes care of:

• Partitioning the input data

• Scheduling the program’s execution across a set of machines

• Performing the group by key step (a bottleneck in practice)

• Handling machine failures

• Managing inter-machine communication

MapReduce: Dealing with Failures

• Map worker failure: Map tasks completed or in-progress at worker are reset to idle and rescheduled. Reduce
workers are notified when map task is rescheduled on another worker.

• Reduce worker Failure: Only in-progress tasks are reset to idle and the reduce task is restarted.

3.2 Spark: Extends MapReduce

Problems with Traditional MapReduce

Generally, MapReduce does not compose well for large applications (many times, chaining multiple map-reduce
steps is required)...

• Performance Bottlenecks: MapReduce incurs substantial overheads due to data replication, disk I/O, and serial-
ization (saving to disk is typically much slower than in-memory work).

• Implementation Difficulty: It’s difficult to program MapReduce directly. Many big data problems/algorithms are
not easily described as map-reduce.

Definition
Spark: Spark is a data flow system constructed as a Resilient Distributed Dataset (RDD). Higher-level APIs
such as DataFrames and DataSets are introduced in recent versions of Spark and different APIs aggregate data,
which allowed to introduce SQL support.

Spark is not limited to the map-reduce model. Additions to MapReduce are:

• Fast data sharing which avoids saving intermediate results to disk and it caches data for repetitive queries (e.g.
for machine learning)

• General execution graphs (DAG = directed acyclic graphs)

• Richer functions than just map and reduce and compatible with Hadoop

6
Definition
Spark Resilient Distributed Dataset (RDD): An RDD is a partitioned collection of records (generalization of
key-value pairs). It is spread across the cluster and read-only. It caches the dataset in memory and there
is a fallback to disk possible. It can be created from Hadoop, or by transforming other RDDs. They are best
suited for applications that apply the same operation to all elements of a dataset.

• Transformations (map, filter, join, union, intersection,..) build RDDs through deterministic operations on other
RDDs.

• Actions (count, collect, reduce, ...) return value or export data. They can be applied to RDDs and force calculations
and return values.

Task Scheduler: General DAGs

Figure 3: General DAGs

• Supports general directed acyclic task graphs

• Pipelines functions where possible

• Cache-aware data reuse & locality

• Partitioning-aware to avoid shuffle

Higher-level API’s like DataFrames and Datasets are different than RDD’s (however they can built on Spark SQL engine
and both can be converted by to an RDD).

Definition
DataFrame: Unlike an RDD, data organized into named columns (= table in a relational database). Imposes
a structure onto a distributed collection of data, allowing higher-level abstraction.

Definition
Dataset: Extension of DataFrame API which provides type-safe, object-oriented programming interface
(compile-time error detection)

7
Spark vs. Hadoop MapReduce

• Spark can process data in-memory (Hadoop persists back to the disk after a map/reduce action) - it is normally
faster

• Spark needs lots of memory to perform well

• Spark is easier to program (higher-level APIs) and is more general in terms of data processing

3.3 Problems suited for MapReduce

• MapReduce is generally suited for problems that require sequential data access and large batch jobs (not for
interactive problems where random/irregular real-time access to data is required such as graphs or interdependent
data in machine learning and comparison of many pairs of items)

Example: Host size

• Suppose there is a large web corpus. Map Reduce can be helpful for each host to find the total number of bytes,
so the sum of the page sizes for all URLS from that particular host.

• Other examples include: link analysis, graph processing and machine learning algorithms

Example: Machine Translation

• MapReduce can be helpful to count the number of times every 5-word sequence (5-grams) occurs in a large corpus
of documents. In this case map can extract (5-gram, count) from the document and reduce combines the counts.

Example: Join By Map-Reduce

Figure 4: Join by Map-Reduce

• Compute the natural join R(A, B) ▷◁ S(B, C) where R and S are stored in files and tuples are pairs (a,b) or (b,c).
In this case when applying Map-Reduce to the join operation one needs to use a hash function h from B-values
to 1...k . A Map process turns each tuple R(a, b) into key-value pair (b, (a, R)) and each input tuple S(b, c) into
key-value pair (b, (c, S)). Then, the Map process sends each key-value pair with key b to Reduce process h(b).
Each Reduce process matches all the pairs (b, (a, R)) with all (b, (c, S)) and outputs (a, b, c).

3.4 Cost Measures of Algorithms

Generally, if we want to quantify the cost of an algorithm we should use:

• Communication cost: The total I/O of all processes

• Elapsed communication cost: The maximum of I/O along any path

8
• (Elapsed) computational cost: Same as for the elapsed communication cost, but count only running time of
processes

(Note: The big-O notation is not suited in this case as more machines can be always added.)

Different Cost Types

• Dominant Cost: Either I/O (communication) cost (the cost of reading and writing data from storage (e.g., disk,
network transfers)) or processing (computation) cost (the CPU time needed to compute results.) Since one of
these usually dominates, we often ignore the other in cost models.

– Example: In a MapReduce job, if reading/writing data to HDFS is much slower than computation, we focus on
I/O cost. Conversely, for CPU-heavy tasks (like cryptographic operations), computation cost is more critical.

• Total Cost: The overall resources used for computation (e.g., CPU hours, memory, storage, bandwidth). In
cloud computing (e.g., AWS, Google Cloud), total cost translates to monetary cost, so what you pay for running
a job.

• Elapsed Cost: This measures the actual time it takes to complete a job. Parallelism helps reduce elapsed time:
instead of one machine working for 10 hours, 10 machines might complete the task in 1 hour.

Cost Measures for MapReduce Algorithm

• Communication cost: input file size + 2×(sum of the sizes of all files passed from Map processes to Reduce
processes) + sum of the output sizes of the reduce processes.

• Elapsed communication cost: The sum of the largest input + output for any map process, plus the same for any
reduce process

Cost of MapReduce Join of R and S

When performing a join operation between relations R and S in a MapReduce setting, the cost is primarily determined
by communication (I/O cost) and computation cost.

• Total Communication Cost:

O(|R| + |S| + |R ▷◁ S|)
This describes the total data moved across the network. The mapper phase reads the input relations R and S ,
emitting key-value pairs based on the join key. The shuffle phase transfers key-value pairs to reducers, which
causes network communication. The reducer phase processes the pairs and produces the final output R ▷◁ S ,
adding to the total cost.

• Elapsed Communication Cost O(S): We put a limit s on the amount of input or output that any one process can
have. s could be what fits in main memory or what fits on local disk. The elapsed communication cost is the actual
time that passed.

• Computation Cost: With proper indexing, the computation cost is linear in the input and output size. This means
that:
O(Computation) = O(|R| + |S| + |R ▷◁ S|)
which is like the communication cost. Without indexes, a naive join might require expensive operations like sorting
or nested loop joins. With indexes, lookups are efficient, and computational cost is directly tied to the size of the
input and output.

9
4 Association Rule Discovery
4.1 Introduction: Market-basket Model
The goal of the market-basket model is to identify items that are bought together by sufficiently many customers. One
approach can be by processing the sales data collected with barcode scanners to find dependencies among items. A
classic rule is for example if someone buys diaper and milk, then they are likely to buy beer.

Therefore, we have a large set of items (e.g. things sold in a supermarket) and a large set of baskets (a subset of
items e.g. things one customer buys on one day). The task is to discover association rules like people who bought
{x,y,z} tend to buy {w,v}. More generally it is a many-to-many mapping (association) between two kinds of things
and we are interested in connections among items.

Applications of the Market-basket Model

• Items = Products (e.g., milk, bread,...); Basket = Sets of products bought in a single trip. Example: Amazon’s
people who bought X (mouse) also bought Y (computer)

• Items = Documents; Baskets = Sentences. Example: If a particular sentence appears in multiple documents (i.e.,
the same item is found in many baskets), this might represent plagiarism

• Items = Drugs & side-effects; Baskets = Individual patients; Example: By analyzing which drugs and side effects
commonly occur together in patient records, we can discover potential drug interactions.

4.2 Frequent Itemsets

Definition
Frequent Itemsets: If one tries to find sets of items that appear together ”frequently” in baskets, then the
so-called support for the itemset I describes the number of baskets containing all items in I . Given a support
threshold s, then sets of items that appear in at least s baskets are called frequent itemsets

• Example: Support of {Beer, Bread} = 2 means that this combination appears in two baskets.

Figure 5: Example: Frequent Itemsets

10
Definition of Association Rules

• If-then rule: {i1 , i2 , .., ik } → j means: ” it a basket contains all of i1 , ..., ik then it is likely to contain j ”

• Confidence of an association rule: The probability of j given I = {i1 , ..., ik }

support(I ∪ j)
conf (I → j) =
support(I)

• Interesting rules: We want to look only at interesting rules because for example the rule X → {milk} has a high
confidence but is not interesting as milk is just purchased very often (independent of X). Therefore, we want the
absolute difference (to capture both positive and negative associations) between its confidence and the fraction of
baskets that contain j :

Interest(I → j) = |conf (I → j) − P [j]| = |P (j|I) − P (j)|

to be typically above 0.5 (this means that P (j) itself is not too high).

Figure 6: Example: Confidence and Interest

• In the example above the support measures how often the itemset {Milk, Beer, Coke} appears in the dataset.
The confidence measures how often Coke appears given that milk and beer were bought. The interest measures
how much this rule is different from the general probability of Coke appearing.

4.3 Finding Frequent Itemsets

Association Rule Mining

Association Rule Mining is a fundamental technique in data mining, primarily used to discover relationships (associations)
between items in large datasets (e.g., market basket analysis in retail).
1. Step - Find frequent Itemsets:

• Goal: Find all association rules that satisfy: Support ≥ s (Minimum Support Threshold) and Confidence ≥ c
(Minimum Confidence Threshold) (definition for frequency)
• Difficulty: Finding frequent itemsets (groups of items that appear together often). If {i1 , i2 , . . . , ik } → {j}
has high support and confidence, then both {i1 , i2 , . . . , ik } and {i1 , i2 , . . . , ik , j} will be ”frequent”.

support(I ∪ j)
conf (I → j) =
support(I)

11
If a dataset has N unique items, there are 2N − 1 possible itemsets, which is an exponential growth. Subse-
quently, an efficient way to find only the frequent ones is needed.

2. Step - Generate Rules from Itemsets: For each frequent itemset I , generate rules of the form:

A → I \ A where A ⊂ I

Use confidence to filter rules:

support(A ∪ B)
confidence(A → B) =
support(A)

Figure 7: Example: Step 1 & 2

To reduce the number of rules, one can post-process them and only output:

• Maximal Frequent Itemsets: An itemset is maximal if none of its immediate supersets are frequent. This
provides aggressive pruning (skipping of parts of workspace that are unlikely to produce useful results) of the
itemset space.
• Closed Frequent Itemsets: An itemset is closed if no immediate superset has the same support (> 0). This
reduces redundancy while preserving exact supports/counts.

Typically, data is kept in flat files (stored on disk and basket-by-basket) rather than in a database system. Baskets are
small but we have many baskets and many items.

Note: To find frequent itemsets, we have to count them. To count them, we have to enumerate them.

Main-Memory Bottleneck

• In frequent-itemset mining, main memory is a critical resource because algorithms must keep track of item
occurrences while reading transactions.

• The primary cost is disk I/O rather than CPU computation, since accessing disk storage is significantly slower
than in-memory operations.

• Many algorithms process data in multiple passes over the dataset, requiring frequent reads from disk. Thus, the
cost is measured in terms of the number of disk passes.

• Memory limitation: The number of different item counts we can store is constrained by available main memory. If
the dataset is large, it may exceed memory capacity.

• Swapping data between memory and disk is inefficient, as it significantly increases processing time. Efficient
algorithms aim to minimize disk I/O and avoid excessive memory swapping.

12
4.3.1 Finding Frequent Pairs

Finding frequent pairs of items {i1 , i2 } is actually the hardest problem, because pairs are common and frequent
whereas triples are rare (the probability of being frequent drops exponentially with size).

The scenario is the following: we aim to identify frequent pairs and for that we will enumerate all pairs of items. But,
rather than keeping a count for every pair, we hope to discard a lot of pairs and only keep track of the ones that will in
the end turn out to be frequent. One very simple approach to find frequent pairs is the following algorithm:

• Naive Algorithm: This s a simple brute-force approach to identify frequent item pairs in large datasets, such as
market baskets.

Approach:

1. Read the dataset (baskets) once, counting in memory the occurrence of each pair. From each basket b of nb
items, generate its nb (nb − 1)/2 pairs by two nested loops.
2. Use an appropriate data structure to keep track of counts of every pair:
– Approach 1: A matrix where rows and columns represent items. Each pair of items (i, j) is stored in a
n(n−1)
fixed-size triangular matrix, requiring 4 bytes per pair.. The total number of possible pairs is 2 ,
2
leading to a total memory usage of O(n ) bytes. This approach becomes infeasible for large datasets
because all possible pairs are stored, even those that never appear in any basket.
– Approach 2: A hash table that stores counts as triples (item1 , item2 , count).
* Instead of storing all possible pairs, we only keep track of pairs that actually appear in the data.
* Each stored pair requires 12 bytes per occurring pair (4 bytes each for two item IDs + 4 bytes for
the count), plus some additional memory for hash table overhead.
1
* This approach outperforms Approach 1 if less than 3 of all possible pairs actually occur, as memory
usage is significantly reduced.
3. At the end of the scan, identify which pairs have high enough support (i.e. appear frequently).
1
Approach 2 beats Approach 1 if less than 3 of possible pairs actually occur.

Problem: When we have too many items, so all the pairs do not fit into memory (even for the hash-based approach
the hash table gets too large for memory). Hence, another approach is needed!

• A-Priori Algorithm: This algorithm limits the need for main memory through the key idea of monotonicity (if a
set of items I appears at least s times, so does every subset J of I ) and contrapositive for pairs (if item i does
not appear in basket s, then no pair including i can appear in s baskets.

General Approach:

– Pass 1: Start with k = 1 (individual items). Read baskets and count in main memory the number of oc-
currences of each individual item. (Requires only memory proportional to the number of items). Items that
appear more or equal to s times are the frequent items.
– Pass 2: Read baskets again and keep track of the count of only those pairs where both elements are frequent
(from pass 1). (Requires memory (for counts) proportional to square of the number of frequent items (not the
square of total number of items.) Repeat this until no frequent itemsets remain.

13
Figure 8: Main-Memory: Picture of A-Priori

For each k , we construct two sets of k-tuples (sets of size k):

– Ck = candidate k-tuples = those that might be frequent sets (support greater/equal s) based on information
form the pass for k − 1
– Lk = the set of truly frequent k-tuples

Figure 9: Apriori Process

Example: Running Apriori Algorithm for k = 1 to k = 3

– Consider the candidate set C1 = {{b}, {c}, {j}, {m}, {n}, {p}}, then count the support and prune items
below the minimum support threshold. As a result we get the frequent 1-itemset: L1 = {b, c, j, m} (assuming
{n,p} were not frequent enough)
– Form pairs from the frequent 1-itemsets L1 : C2 = {{b, c}, {bj }, {b, m}, {c, j}, {c, m}, {j, m}}, then count
support for each pair and prune non-frequent pairs. As a result, one gets the frequent 2-itemsets: L2 =
{{b, c}, {b, m}, {c, j}, {c, m}} (assuming {j,m} was infrequent and got pruned)
– Form triples form the frequent 2-itemsets L2 : C3 = {{b, c, m}, {b, c, j}, {c, m, j}}, then count support for
each and prune infrequent triples. As a result we get the frequent 3-itemsets: L3 = {{b, s, m}} (assuming
{b,c,j} and {c,m,j} were infrequent and got pruned)

One pass for each k (itemset size) needs space in main memory to count each candidate k-tuple. For typical
market-basket data and reasonable support (e.g. 1%), k = 2 requires the most memory.

Note: We generate new candidates by generating Ck from Lk−1 and L1 . But one can be more careful with
candidate generation. For example, in C3 we know {b, m, j} cannot be frequent since {m, j} is not frequent.

14
• PCY Algorithm (Park-Chen-Yu): This algorithm is an improvement of A-Priori. The problem in Apriori is that in
pass 1 most memory is empty. This empty memory can be used to reduce the memory required in pass 2.

General Approach:

– Pass 1: In addition to item counts, maintain a hash table with as many buckets/elements as fit in memory.
Keep a count for each bucket into which pairs of items are hashed. For each bucket, just keep the count, not
the actual pairs that hash to the bucket.
1 for basket in baskets :
2 for item in basket :
3 counts [ item ] += 1
4 # new step for PCY
5 for p , q in basket :
6 bucket [ hash (( p , q ) ) ] += 1
7

Listing 2: Pass 1 PCY

A few things to note:

* Pairs of items need to be generated from the input file; they are not present in the file.
* We are not just interested in the presence of a pair, but we need to see wheter it is present at least s
(support) times.
Observations about Buckets:
* If a bucket contains a frequent pair, then the bucket is surely frequent.
* However, even without any frequent pair, a bucket can still be frequent. But, for a bucket with a total
count less than s, none of its pairs can be frequent. (Pairs that hash to this bucket can be eliminated
as candidates (even if the pair consists of 2 frequent items))

• Pass 2: Only count pairs that hash to frequent buckets. For that the buckets are replaced by a bit-vector, where 1
means the bucket count exceed the support s (→ call it a frequent bucket) and 0 means it did not. 4-byte integer
counts are replaced by bits, so the bit-vector requires 1/32 of memory. Also, decide which items are frequent
and lit tehm for the second pass.

Count all pairs {i,j} that meet the conditions for being a candidate pair:

– Both i and j are frequent items

– The pair {i,j} hashes to a bucket whose bit in the bit vector is 1 (→ a frequent bucket)

Both conditions are necessary for the pair to have a chance of being frequent.

Figure 10: Main-Memory: Picture of PCY

15
Note on the above graphic: Buckets require a few bytes each. We do not have to count past s. The number of
buckets is O(main − memorysize).
On second pass, a table of (item, item, count) tiples is essential (we cannot use triangular matrix approach) Thus,
hash table must eliminate approx. 23 of the candidate paris for PCY to beat A-Priori

4.4 Frequent Itemsets in ≤ 2 Passes

Both Apriori and PCY require 2 − k passes to find frequent itemsets of size k. Less than 2 passes can be used, but may
miss some frequent itemsets. There exist 3 approaches for that.

4.4.1 Random Sampling

The random sampling algorithm takes a random sample of the market baskets and runs a-priori or one of its improve-
ments in main memory. So, we don’t pay for disk I/O each time we increase the size of the itemsets. Reduce supported
threshold proportionally to match the sample size. (Example: if the sample size is 1/100 of the baskets, use s/100 as
support threshold (instead of s))

• To avoid false positives: Optionally, verify that the candidate pairs are truly frequent in the entire data set by a
second pass

• But you don’t catch sets that are frequent in the whole data but not in the sample:

– Smaller threshold, e.g. s/125, helps catch more truly frequent itemsets
– But requires more space - SON algorithm tries to deal with this!

4.4.2 SON (Savasere, Omiecinski, and Navathe)

The SON algorithm is a 2-pass algorithm and it repeatedly reads small subsets of the baskets into main memory and
runs in-memory algorithm to find all frequent itemsets. (Note: We are not sampling, but processing the entire file in
memory-sized chunks) An itemset becomes a candidate if it is found to be frequent in one or more subsets of the
baskets.

• On a second pass, count all the candidate itemsets and determine which are frequent in the entire set.

• Key “monotonicity” idea: An itemset cannot be frequent in the entire dataset unless it is frequent in at least one
subset (pigeonhole principle)

• However, even with SON algorithm we still don’t know whether we found all frequent itemsets. (An itemset may be
infrequent in all subsets but frequent overall.)

• Toivonen’s algorithm solves this.

4.4.3 Toivonen’s Algorithm

• Pass 1:

– Start with the random sample, but lower the threshold slightly for the sample: Example: If the sample is 1 %
of the baskets, use s/125 as the support threshold rather than s/100.
– Find frequent itemsets in the sample
– Add the negative boarder (item in the negative boarder is not frequent in the sample, but all its immediate
subsets are) to the itemsets that are frequent in the sample.

16
• Pass 2: Count all candidate frequent itemsets from the first pass, and also count sets in their negative border.

• If no items form the negative border turns out to be frequent, then we found all the frequent itemsets.
If we find something in the negative border is frequent, we must start over again with another sample.
Try to choose the support threshold so the probability of failure is low, while the number of itemsets checked on
the second pass fits in main-memory.

• Theorem: If there is an itemset S that is frequent in full data, but not frequent in the sample, then the negative
border contains at least one itemset that is frequent in the full data.

• Proof by contradiction:

– Suppose not; i.e.;

1. There is an itemset S frequent in the full data but not frequent in the sample, and
2. Nothing in the negative border is frequent in the full data
– Let T be a smallest subset of S that is not frequent in the sample (but every subset of T is)
– T is frequent in the whole (S is frequent + monotonicity)
– But then T is in the negative border (contradiction)

17
5 High Dimensional Data
5.1 Locality Sensitive Hashing
Definition
Locality Sensitive Hashing (LSH): A technique used to efficiently find similar items in large datasets (sim-
ilarity search). Many problems can be expressed as finding similar sets/ near neighbors in high-dimensional
data. For example: pages with similar words (duplicate detection), customers who purchased similar products
(products with similar customer sets), or users who visited similar websites (websites with similar user sets).

Example: Scene Completion Problem

• The Scene Completion Problem refers to the task of filling in missing parts of an image by searching for similar
regions from a large image database. The idea is to replace missing parts with visually coherent patches from
other images. Considering a dataset consisting of 2 million images, an algorithm searches for the 10 most similar
patches (10 nearest neighbors) from the dataset. A blending algorithm integrates the best-matching patch into the
original image, making the transition seamless.

General problem to solve with LSH:

• We have high dimensional data points: x1 , x2 , . . .

– For example: An image is represented as a long vector of pixel colors.

 
1 2 1
0 2 1 → 1 2 1 0 2 1 0 1 0
0 1 0

• And some distance function d(x1 , x2 ) to quantify the ”distance” between x1 and x2 . The goal is to find all pairs of
data points (xi , xj ) that are within some distance threshold d(xi , xj ) ≤ s (hence that are similar).

• Note: A naive solution would take O(N 2 ).

• This can be done in O(N ) using LSH.

5.2 Finding Similar Items

Distance Measure

Before finding similar items, one first needs to define what distance means. For example the Jaccard distance/simi-
larity exists.

• The Jaccard similarity of two sets is the size of their intersection divided by the size of their union:

|C1 ∩ C2 |
sim(C1 , C2 ) =
|C1 ∪ C2

• Jaccard distance:
|C1 ∩ C2 |
d(C1 , C2 ) = 1 −
|C1 ∪ C2
3
• So, when if there are 3 items in the intersection and 8 in the union. The Jaccard similarity is 8 and the 58 .

18
Task: Finding similar documents

The goal is to find near duplicate pairs, given a large number (N ∼ millions/billions) of documents. A common application
of that are mirror websites (website that are exact copies of a website except from that they are hosted under another
URL), or approximate mirrors where we don’t want to show both in search results. Or another application are similar
news where we cluster articles by “same story”.

The problem with finding similar items it that mostly:

• Many small pieces of one item can appear out of order in another
• Too many items to compare all pairs
• Items are so large or so many that they cannot fit in main memory

3 Essential Steps for Similar Documents

1. Shingling: Convert documents to sets!

For the application of finding similar documents it is not enough to treat the documents as a set of (key) words,
because it does not take into account the word order and therefore no context is given. A better way are shingles
(n-grams).

Definition
Shingle: A k-shingle (or k-gram) for a document is a sequence of k tokens that appears in the docu-
ment. Tokens can be characters, words, or something else, depending on the application.

Example: We assume the tokens are characters and we take k = 2 and document d1 = ”abcab”. Then, the
set of 2-shingles is S(D1 ) = {ab, bc, ca}. If we would take a bag (multiset) instead ”ab” would be counted twice
S ′ (D1 ) = {ab, bc, ca, ab}.

Problem: Storing long shingles takes too much space.

Solution: We can compress shingles by hashing them to a smaller representation like 4-byte integer (32-bit
hash). For example: h(ab) = 1, h(bc) = 5, h(ca) = 7, then h(D1 ) = {1, 5, 7}. This approach is more efficient and
uses less space and memory.

Measure Similarity using Jaccard Similarity Measure: A document is a set of k-shingles and can be repre-
sented as C1 = S(D1 ). Using bit vector encoding it becomes a set of 0/1 vectors, where each unique element
in the universal set gets a position (dimension) in the bit vector. So, if the entire dataset has unique shingles
{ab, ba, cd, da} for document D1 this means a binary vector of v1 = [1, 1, 0, 0]. These vectors tend to be very
sparse because the total number of possible shingles is usually very large but most documents contain only a
small subset of them. Therefore, an appropriate similarity measure is the Jaccard similarity.

Note: Documents that have lots of shingles in common have similar text, even if the text appears in different order.
The difficulty is to choose the right k to find truly similar documents. k = 5 is OK for short documents, but k = 10
is better for long documents.
2. Min-Hashing: Convert large sets to short signatures, while preserving similarity!

Problem: Using the previously established bit vector encoding and considering the following two documents:

C1 = [1, 0, 1, 1, 1] and C2 = [1, 0, 0, 1, 1]

One would have to compute an intersection (= 3) and union (= 4) of those vectors. Then, the Jaccard similarity
would be 43 and the Jaccard distance d(C1 , C2 ) = 14 . Suppose we want to find near-duplicate documents among

19
N = 1 Million documents. This would mean a matrix where the columns are documents and the rows are the
shingles. Computing explicitly unions and intersections would be very inefficient, taking about multiple days to
years.

Solution: Finding similar columns while computing small signatures.

Definition
Signatures: Short integer vectors that represent the sets and reflect their similarity. In this case the
signature is a small hash of a column so that it fits in memory. The goal is to find a hash function such
that:

• If sim(C1 , C2 ) is high, then with high probability h(C1 ) = h(C2 ).

• If sim(C1 , C2 ) is low, then with high probability h(C1 ) ̸= h(C2 ).

The hash values are then mapped to specific buckets (containers). Documents with the same or similar
hash values are placed in the same bucket.

The hash function clearly depends on the similarity metric (not all similarity matrics have a suitable
hash function), which is in this case Jaccard similarity. The suitable hash function for Jaccard similarity is
Min-Hashing.

Min-Hashing is used to efficiently estimate Jaccard similarity between large sets. The main idea is to:

(a) Permute the Rows Randomly: Apply a random permutation π to the rows of the boolean matrix.
(b) Hash Function Definition: hπ (C) = index of the first row (in the permuted order) where column C has a value of 1.
In other words, hπ (C) gives the position of the first ”1” in the permuted order for column C .
(c) Apply several independent permutations: Since a single permutation may not give accurate similarity
estimates, several independent permutations (e.g., 100 different hash functions) have to be used to generate
100 different hash values (or signatures) for each column (document). The signature vector for a document
then looks like: Signature(C) = [hπ1 (C), hπ2 (C), . . . , hπ100 (C)]. The similarity between signatures becomes
an average similarity over all permutations.

Note: The Min-Hash property guarantees that:

Similarity of Columns = Expected Similarity of Signatures

This is why comparing signatures directly is an accurate approximation of comparing the original sets
(this can be proven but is out of scope). This is crucial because signatures are much smaller, making similarity
computation significantly faster and more space-efficient. (Note: the following graphic might be wrong)

The goal, is achieved! A long bit vectors into short signatures.

20
3. Locality-Sensitive Hashing (LSH): Focus on identifying pairs of signatures that are likely to come from
similar documents (candidate pairs).

Goal: Efficiently find document pairs whose Jaccard similarity is at least s (for some similarity threshold, e.g.,
s = 0.8). LSH leverages the idea of using a function f (x, y) that indicates whether x and y form a candidate pair
(i.e., a pair of elements whose similarity needs to be evaluated).

Candidate pairs from Min-Hash matrices are generated by hashing columns of the signature matrix M into
multiple buckets. Each pair of documents that hashes to the same bucket is considered a candidate pair. Since
the columns of the signature matrix M are hashed multiple times, similar columns are likely to hash into the same
bucket with high probability. Given the similarity threshold s such that 0 ≤ s ≤ 1, columns x and y of the
signature matrix M are considered a candidate pair if their signatures agree on at least a fraction s of their rows.

Partition M into b bands:

• Logic: Divide matrix M into b bands of r rows. For each band, hash its portion of each column to a hash
table with k buckets. A pair that matches in 1 or more band becomes a candidate pair. If two columns are
similar, they are likely to match in many bands. If two columns are dissimilar, they are unlikely to match in any
band. Tune b and r to catch most similar pairs, but few non-similar pairs.

Tradeoff in LSH: There is an inherent tradeoff between false positives (detected as similar but not similar)
and false negatives (not detected as similar but similar) when tuning the parameters:

• Number of Min-Hashes (rows of M )

• Number of bands (b)
• Number of rows per band (r)

The challenge is to find the right balance between detecting similar pairs (true positives) and avoiding false
detections (false positives). For example we had only 15 bands of 5 rows, the number of false positives would go
down, but the number of false negatives would go up.

What we want...

21
What 1 Band of 1 Row Gives you...

What b bands of r rows gives you...

• In Locality-Sensitive Hashing (LSH) for Min-Hashing, we compute the probability that two documents are
considered similar based on their Min-Hash signatures.
– Let b be the number of bands.
– Let r be the number of rows per band.
– Let t be the Jaccard similarity between two documents C1 and C2 .
Probability of Identical Band: The probability that all rows in a given band are identical is:
P (band identical) = tr
Probability of Non-Identical Band: The probability that at least one row in the band is different is:
P (band not identical) = 1 − tr
Probability of No Identical Band: With b bands and the assumption of independence, the probability that
none of the bands are identical is:
P (no band identical) = (1 − tr )b
Probability of At Least One Identical Band: The complement:
P (at least one band identical) = 1 − (1 − tr )b

22
• Picking r and b to get the best S-curve (50 hash functions (r=5, b=10)).

LSH Summary

Figure 11: Summary of the 3 essential steps for similar documents

23
6 Clustering
Goal: Given a set of points, with a notion of distance between points, group the points into some number of clusters,
such that

• Members of the same cluster are close (i.e., similar) to each other

• Members of different clusters are dissimilar

• Usually:

– Points are in a high-dimensional space

– Similarity is defined using a distance measure (Euclidean (given a sets as vectors), Cosine (given sets as
points), Jaccard (given sets of sets), Edit distance)

Clustering is hard!

• Clustering in two dimensions and clustering small amounts is easy. High-dimensional spaces look different: Almost
all pairs of points are at about the same distance, or rather almost all pairs of points are very far from each other.
This is the curse of dimensionality!

There exist different clustering strategies as explained in the following subsections.

6.1 Hierarchical Clustering

This clustering strategy is the best when shapes are weird.

• Agglomerative (bottom-up): Initially, each point is a cluster. Repeatedly combine the two nearest clusters into
one.

• Divisive (top-down): Start with one cluster an recursively split it.

Point assignment:

• Maintain a set of clusters, where points belong to the nearest cluster. This clustering strategy is the best when
clusters are nice, convex shapes.

Is the space Euclidean or non-Euclidean?

Euclidean case:

1. How to represent a cluster of many points?

• When merging clusters, the location of a cluster is represented by its centroid (artificial) point.
• The centroid is the average of the (data) points in the cluster.

2. How to determine the nearness of clusters?

• The distance between two clusters is measured as the distance between their centroids.
• At each step, the two clusters with the shortest distance are merged.

24
Figure 12: Example: Hierarchical Clustering

Non-Euclidean case: Note that there is not always a distance representation possible, since there are some examples
where we work in very high dimensions, and the distances rely on the properties, not on the sole (x,y) distance pair.

1. How to represent a cluster of many points?

• Clustroid = (data-)point closest to other points

• Closest can mean different things. Either smallest maximum distance to other points or smallest average
distance to other points, etc.

2. How to determine the nearness of clusters?

• Intercluster distance = minimum of the distances between any points, one from each cluster
• Or pick a notion of cohesion of clusters, e.g., maximum distance from the clustroid. Merge clusters whose
union is most cohesive. There are different notions of cohesion:
– Use the diameter of the merged cluster = maximum distance between points in the cluster
– Use the average distance between points in the cluster
– Use a density-based approach (take the diameter or average distance, e.g., and divide by the number
of points in the cluster)

Does the data fit in memory or does it reside on disk?

• In-memory clustering is more straightforward. Example: K-means

• Large-data clustering requires loading one batch of data at a time, cluster them in memory, and keep summaries
of clusters. Example: BFR, CURE

6.2 K-Means
K-Means Clustering

The underlying assumption in K-Means is that samples are drawn from a mixture of Gaussians. The goal is to
estimate the cluster centers µk . This leads to a classic chicken-and-egg problem: To find the cluster centers (µk ), we
need the cluster memberships, and to determine the cluster memberships, we need the cluster centers.

• Given:

– A set of unlabeled training samples (feature vectors) {xi } = {x1 , . . . , xn }

– The number of clusters k

25
• Clustering Algorithm:
1. Randomly initialize the cluster means µ1 , . . . , µk
2. Repeat until convergence (i.e., until cluster centers no longer change significantly):
(a) Cluster Membership Assignment: For each sample xi , assign it to the closest cluster:
k(i) := arg min∥xi − µk ∥2
k

(b) Cluster Mean Recalculation: For each cluster k , recompute the mean of assigned points:
1 X
µk := xi
|{xi : k(i) = k}|
xi :k(i)=k

• Model Selection – Choosing K :

– Too large k : Increases model complexity (overfitting)
– Too small k : Increases fitting error (underfitting)
• Strategy to Choose k :
– Try different values of k (e.g., 1, 3, 5, 10, . . .)
– Plot the average distance to the centroid as k increases
– Look for the “elbow” where the improvement slows — that’s a good candidate for the optimal k

Figure 13: Finding the right number of clusters k

• Evaluation Metrics:
– Internal: Bayes (Schwarz) Information Criterion (BIC)
– External: Purity (requires ground truth labels)
• Bayes (Schwarz) Information Criterion (BIC): For k clusters and corresponding number of model parameters
f (k):
n
K ∗ = arg min −2 ln P (x1 , . . . , xn |K) + f (K) ·
K log n
• Further Analysis:
– Computational Cost: O(n · k · #iterations); can be optimized using the triangle inequality
– Convergence Criterion: Stop when there is no change in cluster assignments (i.e., cluster membership
stabilizes)
– Empty Clusters: If a cluster ends up empty, reinitialize its center randomly
– Is K-Means Deterministic? No — it is a local search algorithm, and results depend on random initialization
– Limitations: K-Means is best suited for isotropic (spherical) Gaussians. It assumes equal variance in
all directions and performs hard assignment (each point belongs to exactly one cluster). For non-isotropic
distributions or soft assignments, other clustering methods are needed.

26
6.3 The BFR Algorithm
The BFR (Bradley-Fayyad-Reina) algorithm is a variant of k-means, designed to handle very large, disk-resident data
sets. Unlike standard k-means, which can become inefficient with large datasets, BFR efficiently summarizes clusters
without storing all the data points in memory.
Key Characteristics of BFR

• Cluster Assumptions:

– Clusters are assumed to be normally distributed around a centroid in Euclidean space.

– Standard deviations in different dimensions may vary.
– Clusters are represented as axis-aligned ellipses, meaning they can stretch along different axes but do not
rotate.

SUM and SUMSQ in the BFR Algorithm

In the BFR algorithm, each cluster maintains:

• N: Number of points

• SUM: Sum of coordinates

• SUMSQ: Sum of squared coordinates

For a dimension x:
N
X N
X
SUMx = xi , SUMSQx = x2i
i=1 i=1
2
From these, the mean µ and variance σ are calculated as:
2
SUMx SUMSQx SUMx
µx = , σx2 = −
N N N

Example
Given the points (1, 2), (2, 1), and (1, 1):

N =3
SUMx = 1 + 2 + 1 = 4, SUMy = 2 + 1 + 1 = 4
2 2 2
SUMSQx = 1 + 2 + 1 = 6, SUMSQy = 22 + 12 + 12 = 6

Calculating mean and variance:

2
4 6 4
µx = ≈ 1.33, σx2 = − ≈ 0.22
3 3 3
2
4 6 4
µy = ≈ 1.33, σy2 = − ≈ 0.22
3 3 3
Problem with BFR/k-means:

• Assumes clusters are normally distributed in each dimension

• And axes are fixed — ellipses at an angle or not ok

27
6.3.1 The cure algorithm

Definition
Definition:
CURE (Clustering Using REpresentatives) is a clustering algorithm that assumes a Euclidean distance metric but
makes no assumptions about the shape, orientation, or distribution of clusters. Unlike k-means, which represents
each cluster by its centroid, CURE uses a set of well-dispersed representative points for each cluster. These
points better capture the geometry and boundaries of arbitrarily shaped clusters. The algorithm requires the
number of clusters k to be specified in advance.

Extension of k-means to clusters of arbitrary shapes.

Pass 1

• Pick a random sample of the data and cluster them in main memory using hierarchical clustering:

– Merge two clusters when they have a close pair of points.

• Pick representative points from each cluster:

– For each cluster, pick a sample of k points, as dispersed as possible.

– Move each representative point a fraction s (e.g., 20%) of the distance toward the centroid of the cluster.

• Merge two clusters whose representative points are closest.

Pass 2

• Rescan the whole dataset and visit each point p in the data set.

• Place each point p into the closest cluster:

– Normal definition of closest: find the closest representative to p and assign p to the representative’s cluster.

6.4 Dimensionality Reduction

General Idea

Dimensionality Reduction (DM) asks: How many dimensions do I need to keep in order to preserve the structure of
the data? Matrix Rank answers this question mathematically: How many independent dimensions are required to fully
describe the data?
Why do we need DM?

• Discover hidden correlations/topics: Words that occur commonly together

• Remove redundant and noisy features: Not all words are useful

• Interpretation and visualization

• Easier storage and processing of the data

6.4.1 Rank of a Matrix

In linear algebra, the rank of a matrix A is the dimension of the vector space generated (or spanned) by its columns. This
corresponds to the maximal number of linearly independent columns of A. Here is an example on how to find it:

28
Example
Let’s say you have data points in 3D space:

• A = (1, 2, 1)

• B = (−2, −3, 1)

• C = (−1, −1, 2)

Despite having 3D coordinates, these points actually lie on a plane in 3D. This means the rank of the matrix
would be 2, as you only need two dimensions to describe the data points, even though they exist in a 3D space.
Finding the Rank of the Matrix:
The matrix representing the data points is:
 
1 2 1
M = −2 −3 1
−1 −1 2

Performing row reduction (Gaussian elimination) gives the row echelon form:
 
1 2 1
M = 0 1 3
0 0 0

Since there are two non-zero rows, the rank of the matrix is 2, meaning the data lies on a 2D plane in 3D space.

6.5 Singular Value Decomposition (SVD)

General idea

We are given a ratings matrix A, where rows are users and columns are movies:
 
5 4 0 0
5 3 0 0
 
4 5 0 0
 
5
A= 2 0 0
0 0 4 5
 
0 0 5 2
0 0 4 5
Users 1–4 like Sci-Fi movies (*Alien*, *Serenity*), users 5–7 prefer Romance (*Casablanca*, *Amélie*).
We apply SVD:
A = UΣVT
Where:

• U: user-to-concept matrix

• Σ: diagonal matrix of concept strengths

• VT : concept-to-movie matrix

SVD identifies two main latent concepts:

• Concept 1: Sci-Fi preference

• Concept 2: Romance preference

29
" #
0.68 −0.59
Sci-Fi axis
T
Σ = diag(12.4, 9.5, . . . ), U≈ .. .. , V ≈
. . Romance axis

This allows us to:

• Group users and movies by genre preference

• Reduce dimensionality (e.g., 7×4 → 2×2)

• Make recommendations using concept similarity

The formula sX
∥A − B∥F = (Aij − Bij )2
i,j

is called the Frobenius norm. It measures how different two matrices A and B are — like a distance, based on all
squared differences between corresponding entries.
In the context of SVD, we use this to find the best low-rank approximation:

Ak = arg min ∥A − B∥F

rank(B)=k

This means: among all matrices B of rank k , the matrix Ak (built from the top k singular values of A) is the closest
to A, measured using this distance.
So, SVD gives the best possible simplified version of A with rank k .

Definition
Singular Value Decomposition (SVD):
Singular Value Decomposition is a matrix factorization technique used in linear algebra. It decomposes a given
matrix A of size m × n into three matrices:
A = UΣVT
where:

• U is an m × m orthogonal matrix, whose columns are the left singular vectors of A,

• Σ is an m × n diagonal matrix, whose diagonal entries are the singular values of A,

• VT is an n × n orthogonal matrix, whose rows are the right singular vectors of A.

SVD is widely used in areas such as dimensionality reduction, image compression, and solving linear systems.

30
Example
Let
5 0
A=
0 2
Since A is already diagonal with non-negative entries, its singular values are simply the diagonal entries:

5 0
σ1 = 5, σ2 = 2 ⇒ Σ =
0 2

The left and right singular vectors are just the standard basis vectors (identity matrix), so:

1 0
U=V=
0 1

Thus, the SVD is:

1 0 5 0 1 0
A = UΣVT =
0 1 0 2 0 1

Complexity

• To compute the full SVD of an n × m matrix:

O(nm2 ) or O(n2 m), whichever is smaller

• However, computing SVD can be made faster if:

– We only want the singular values (not the full U or V )

– We only need the first k singular values and vectors
– The matrix is sparse

• SVD is implemented in many standard numerical libraries:

– LINPACK, LAPACK, Matlab, SPlus, Mathematica, SciPy, etc.

Sparsity in SVD

• In many real-world applications, the matrix A we want to decompose is very sparse (mostly zeros).

• However, the matrices U and V from the SVD of A are usually dense, meaning they contain mostly non-zero
values.

• This destroys sparsity, making storage and computation more expensive, and can reduce interpretability.

31
6.6 CUR Decomposition
Definition
The CUR algorithm approximates a matrix A ∈ Rm×n using actual rows and columns from A, rather than abstract
components like in SVD.

A ≈ CU R
Where:

• C ∈ Rm×c : a matrix formed by selecting c columns from A

• R ∈ Rr×n : a matrix formed by selecting r rows from A

• U ∈ Rc×r : a small linking matrix computed such that the product CU R best approximates A

Step-by-step CUR Algorithm:

1. Select a subset of c columns from A (possibly using randomized or importance sampling) to form C

2. Select a subset of r rows from A to form R

3. Let W ∈ Rc×r be the intersection of the chosen columns and rows (i.e., W is the submatrix of A at those
row and column indices)

4. Compute U = W † , the Moore–Penrose pseudoinverse of W

• Let W be the intersection of the selected columns C and rows R from the matrix A.
• Define W+ as the pseudoinverse of W.
• To compute W+ , perform the SVD of W:

W = XZYT

• Then the pseudoinverse is:

W+ = YZ+ XT
where Z+ is the diagonal matrix with reciprocal non-zero entries:

1
Zii+ =
Zii

• Finally, define U as:

U = Y(Z+ )2 XT

Advantages of CUR:

• Uses actual rows and columns from the original matrix A

• More interpretable than SVD

• Can preserve sparsity

• Efficient for large-scale or sparse data

32
SVD vs. CUR

Figure 14: Difference between SVD and CUR

Example
Based on a real experiment: We consider a large, sparse matrix built from bibliographic data. The goal was to
compare the SVD vs. CUR for dimensionality reduction using Accuracy (1− relative sum of squared errors),
Space ratio ( ##output entries
input entries
) and CPU time (total computation time in seconds).

• Matrix: Author-to-conference publication counts

• Aij : number of papers published by author i at conference j

• Size: 482,000 authors (rows) × 3,959 conferences (columns)

• The matrix is very sparse

Results:

• SVD achieves the highest accuracy, but is slow and uses much more space

• CUR achieves a good tradeoff:

– Much faster to compute

– Lower space usage
– Accuracy remains acceptable

• CMD is also evaluated and performs in between

33
7 Recommender Systems
Definition
Recommender Systems: The internet enables the near-zero-cost distribution of information about products,
services, and content, which leads to an overwhelming abundance of choices. As a result, users need effective
filters to help them navigate this information overload — this is where recommender systems come in. Rec-
ommender systems aim to provide personalized suggestions by learning user preferences and predicting what
items they might like. Common examples of platforms using recommender systems include Netflix, YouTube,
Amazon, and Spotify.

Definition
Collaborative Filtering: Given a user x, the goal is to identify a set N of other users whose rating behavior is
similar to that of user x. Then, estimate user x’s unknown ratings based on the ratings provided by users in the
set N .

Types of Recommandations:

• Editorial and hand-curated List of favorites, Lists “essential” items

• Simple aggregates, like Top 10, Most Popular, Recent Uploads

• Tailored to individual users on Amazon, Netflix, YouTube..

Key Problems

1. Gathering “known” ratings for matrix

How to collect the data in the utility matrix.

2. Extrapolate unknown ratings from the known ones

Mainly interested in high unknown ratings.
We are not interested in knowing what you don’t like, but what you do like.

3. Evaluating extrapolation methods

How to measure success/performance of recommendation methods.

7.1 Content Based Approach

Definition
Content-based Recommender Systems: Recommend items to customer x similar to previous items rated
highly by x. Examples include to recommend movies with same actor(s), director, genre or other sites with
“similar” content.

• Sparsity: Most users interact with only a small fraction of the available items, resulting in a highly sparse matrix
with many missing values.

• Collecting known ratings: Gathering explicit (f.e. ask people to rate items) or implicit (f.e. learn ratings from user
actions - like a purchase might suggest a high rating) feedback to populate the matrix can be difficult. Feedback
varies in form (e.g., thumbs-up, 1–5 stars, viewing time) and is not always available for all users or items.

• Predicting unknown ratings: The primary task is to accurately infer missing entries — in particular, to identify
items the user is likely to enjoy. We are generally more interested in high ratings (what the user would like) than
low ones.

34
• Evaluating recommendations: Measuring the effectiveness of predicted recommendations requires suitable per-
formance metrics. Commonly used metrics include Precision, Recall, and the F1-score, especially when gener-
ating top-N recommendation lists.

Content-based recommender systems suggest items to users based on the similarity of content between items that
the user has already liked and new items. For example, if a user likes movies with a certain director or genre, the system
will suggest other movies with similar attributes. Similarly, for blogs, articles, or websites, it looks for similar content
topics.

Item Profiles and Plan of Action

Each item is represented by a vector of features that describes its content. Examples include:

• Movies: Attributes like author, title, actor, director.

• Text documents: Important words extracted from the text (using TF-IDF).

TF-IDF (Term Frequency - Inverse Document Frequency)

This is a common method to weigh the importance of words in documents, emphasizing unique words in each document
to capture its essence.

User Profiles and Prediction

User profiles are built based on the user’s rating history or interactions with items. Possible methods include:

• Weighted average of rated item profiles.

• Variation: By considering how the user’s rating differs from average ratings.

Prediction Heuristic:

The system estimates the user’s preference for a new item using cosine similarity:

x·i
u(x, i) = cos(x, i) =
||x|| · ||i||
where x is the User profile vector, i is the Item profile vector, and it calculates how closely the user’s interests align with
the item’s attributes.

Pros and Cons of Content-based Approach

Pros:

• No need for data on other users – avoids the cold-start and sparsity problems.

• Recommendations for users with unique tastes – even if others don’t like similar items, you still get good
suggestions.

• Able to recommend new & unpopular items – it doesn’t rely on others’ feedback.

• Able to provide explanations – the system can explain why it recommends an item based on its features.

Cons:

• Finding appropriate features is hard: For media like movies, images, or music, it is challenging to determine
what features are most important.

35
• Recommendations for new users: It struggles with new users who haven’t rated enough items (Cold Start
Problem).

• Overspecialization:

– It only suggests items similar to what the user has liked before, potentially missing out on diverse content.
– It cannot leverage the preferences of other users to make broader recommendations.

7.2 User-User Collaborative Filter

7.2.1 Pearson Correlation Coefficient:

Instead, we use the Pearson correlation coefficient, which compares only the ratings for items rated by both users and
normalizes for user rating bias (e.g., some users rate everything high, others low). Let Sxy be the set of items rated by
both users x and y . Then the Pearson similarity is defined as:
P
s∈Sxy (rxs − rx )(rys − ry )
sim(x, y) = qP qP
2 2
s∈Sxy (rxs − r x ) s∈Sxy (rys − r y )

where:

• rxs and rys are the ratings of users x and y for item s

• rx and ry are the average ratings of users x and y , respectively

36
Example
Pearson Correlation Example

We have three users (A, B, C) and three items (Item 1, Item 2, Item 3). The ratings are represented in the matrix
below, with NaN indicating a missing rating:

Item 1 Item 2 Item 3

A 4 3 NaN
B 5 NaN 2
C NaN 2 4
Step 1: Normalize the Ratings
The average ratings for each user are calculated as follows:

4+3 5+2 2+4

r¯A = = 3.5, r¯B = = 3.5, r¯C = =3
2 2 2
We subtract the average rating from each rating (ignoring NaN values):

′ ′
rA,1 = 4 − 3.5 = 0.5, rA,2 = 3 − 3.5 = −0.5

′ ′
rB,1 = 5 − 3.5 = 1.5, rB,3 = 2 − 3.5 = −1.5

′ ′
rC,2 = 2 − 3 = −1, rC,3 =4−3=1
The normalized ratings are then represented as:

Item 1 Item 2 Item 3

A 0.5 −0.5 NaN
B 1.5 NaN −1.5
C NaN −1 1
Step 2: Calculate Pearson Correlation Similarity
The Pearson correlation is calculated as the cosine similarity of the normalized vectors:

(0.5)(1.5)
sim(A, B) = p p =1
(0.5)2 · (1.5)2
(−0.5)(−1)
sim(A, C) = p p =1
(−0.5)2 · (−1)2
(−1.5)(1)
sim(B, C) = p p = −1
(−1.5)2 · (1)2
Step 3: Interpretation

• Users A and B have perfect positive similarity on Item 1.

• Users A and C have perfect positive similarity on Item 2.

• Users B and C are perfectly negatively correlated on Item 3.

In simpler terms: This is just the normalized cosine distance.

This measure accounts for only items both users have rated (avoids zero-padding issues) and differences in user

37
rating scales (e.g., some users rate more harshly).

Prediction for item s of user x:

• Unweighted average:
1 X
rxi = ryi
k
y∈N

• Weighted average using similarity scores:

P
y∈N sxy · ryi
rxi = P
y∈N sxy

• Other enhancements include (however they are not explored further in this course):

– Adjusting for user biases (e.g., subtracting average ratings)

– Normalizing similarity scores
– Applying shrinkage or confidence-weighted averaging

Example
1. Unweighted Average
For user A, we predict Item 3:
1
rA,3 = (2 + 4) = 3
2
For user B, we predict Item 2:
1
rB,2 = (3 + 2) = 2.5
2
For user C, we predict Item 1:
1
(4 + 5) = 4.5
rC,1 =
2
2. Weighted Average Using Similarity Scores (See above example)
For user A, predicting Item 3:
(1 · 2) + (1 · 4)
rA,3 = =3
1+1
For user B, predicting Item 2:

(1 · 3) + (−1 · 2) 3−2
rB,2 = = = undefined
1 + (−1) 0

(In this case, the similarity cancels out, so we cannot predict using a weighted average)
For user C, predicting Item 1:
(1 · 4) + (−1 · 5)
rC,1 = = undefined
1 + (−1)

Content-Based Filtering vs. User-User Collaborative Filtering

Key Difference:

Content-based Filtering:

38
• Looks at the content of items and matches them with the user’s history.

• More focused and personalized to the user’s known preferences.

• Does not require community data.

Collaborative Filtering:

• Looks at other users’ behavior to suggest new items.

• More explorative, can suggest unexpected items based on collective preference patterns.

• Requires a rich history of user interactions.

Item-Item Collaborative Filtering

Definition
Item-Item Collaborative Filtering: For item i, find other similar items. Estimate rating for item i based on ratings
for similar items. The same similarity metrics as for user-user mode is applicable.

Concretely, the adjusted weighted formula using similarity scores looks as follows:
P
j∈N (i;x) sij · rxj
rxi = P ,
j∈N (i;x) sij

where sij is the similarity between items i and j , rxj the rating of user x on item j , and N (i; x) the set of items rated by
user x that are similar to item i.
Example Item-Item CF (|N | = 2)

In this example the goal is to predict the movie 1 for user 5 by finding similar movies (neighbours) that user 5 has rated,
and then using those ratings to estimate the missing one.

1. Neighbourhood Selection: Apply item-item collaborative filtering by looking at other movies that are similar to
movie 1, and have already been rated by user 5. In this case user 5 has rated movies 3, 4, and 6.

2. Compute Similarities between Items: Apply Pearson correlation as similarity by subtracting mi from each movie
i, which results in m1 = (1 + 3 + 5 + 5 + 4)/5 = 3.6. So the mean-centered ratings for movie 1 are row1 =
[2.6, 0, −0.6, 0, 0, 1.4, 0, 1.4, 0, 0.4, 0]. Now, compute cosine similarities between rows.
• Similarity between movie 1 and movie 3 = 0.41
• Similarity between movie 1 and movie 6 = 0.59
• Similarity between movie 1 and movie 4 = -0.10

39
3. Select Neighbors (|N | = 2): Choose movies 6 and 3 as neighbours (highest similarity values)

4. Predict the Rating: Using the weighted average formula:

s13 · r53 + s16 · r56 0.41 · 2 + 0.59 · 3 0.82 + 1.77

r51 = = = = 2.59
s13 + s16 0.41 + 0.59 1.0

Therefore, the predicted rating for user 5 on movie 1 is approximately 2.6 .

Example: Person Based Predictions

Example
We have a user X with the following ratings for three items:

Item 1 Item 2 Item 3

X 5 3 NaN
The goal is to predict the rating for Item 3. The system finds that Item 1 and Item 2 are similar to Item 3.
Step 1: Find Item Similarities
We calculate the similarities using the Cosine Similarity formula:

s13 = 0.92 (Item 1 and Item 3 are 92% similar)

s23 = 0.90 (Item 2 and Item 3 are 90% similar)

Step 2: Apply the Prediction Formula
We use the formula for Item-Item Collaborative Filtering:
P
j∈N (i;x) sij · rxj
rx3 = P
j∈N (i;x) sij

Substituting the values:

(0.92 · 5) + (0.90 · 3)
rx3 =
0.92 + 0.90
Step 3: Compute the Prediction
First, evaluate the numerator:

(0.92 · 5) + (0.90 · 3) = 4.6 + 2.7 = 7.3

Now, evaluate the denominator:

0.92 + 0.90 = 1.82

Finally, compute the prediction:

7.3
rx3 = ≈ 4.01
1.82
Result:
The predicted rating for Item 3 for user X is approximately:

rx3 ≈ 4.01

40
Evaluating Predictions

Compare predictions with known ratings

• Root-mean-square error (RMSE): Measures the average squared difference between predicted and actual rat-
ings. s X
∗ )2
(rxi − rxi
xi
∗
where rxi is the predicted rating and rxi is the true rating of user x on item i.

• Precision at top 10: Percentage of relevant items ranked in the top 10 recommendations.

• Rank Correlation: Spearman’s correlation between the system’s predicted ranking and the user’s true item rank-
ing.

Another approach: 0/1 model

• Coverage: Number of items or users for which the system is able to make predictions.

• Precision: Accuracy of the system’s binary (like/dislike) predictions.

• Receiver Operating Characteristic (ROC): Tradeoff curve showing false positive vs. false negative rates for
different threshold settings.

Problems with Error Measures: Traditional error measures, such as RMSE, often place a narrow emphasis on nu-
merical accuracy, which may overlook the true goals of a recommender system. In practice, we are typically interested
only in predicting high ratings — items a user is likely to enjoy. An algorithm that performs well in identifying such high
ratings but poorly elsewhere might be unfairly penalized by RMSE, even though it delivers useful recommendations.

Complexity

In collaborative filtering, the most computationally expensive step is identifying the k most similar users (or items) for
each prediction.

• The complexity of finding k nearest neighbours for a user is O(|X|), where X is the set of all users, which typically
too costly to compute at runtime for large-scale systems.

• One solution is to pre-compute the nearest neighbours for each user or item, however, naı̈ve pre-computation has
complexity O(k · |X|) per user.

Efficient techniques to reduce computation:

• Locality-Sensitive Hashing (LSH): Allows fast approximate nearest-neighbor search in high-dimensional spaces.

• Clustering: Group similar users or items together to limit the search to within a cluster.

• Dimensionality Reduction: Techniques like PCA or SVD can reduce the dimensionality of rating vectors, making
similarity comparisons faster and more meaningful.

Pros & Cons

Pros:

• Works for any kind of item: No item features are required. This makes item-item collaborative filtering broadly
applicable across domains (movies, books, products, etc.).

41
• Often outperforms user-user CF: Items are more stable than users — they don’t change behavior or preferences.
Once an item has been rated by many users, its rating pattern becomes stable and more reliable for similarity
comparison.

• Denser data per item: Each item typically receives ratings from many users, while users only rate a small subset
of items. This leads to more complete item profiles and makes it easier to compute item-item similarity than
user-user similarity.

Cons:

• Cold start problem: Newly added items cannot be recommended until they have been rated by enough users to
establish meaningful similarity.

• Sparsity: The overall rating matrix is sparse — it can still be difficult to find overlapping ratings between items,
especially in large catalogs.

• First-rater problem: Items that have not yet been rated cannot be recommended. This is especially problematic
for new or niche items.

• Popularity bias: The system tends to favor popular items with many ratings. Users with unique or niche prefer-
ences may receive less relevant recommendations.

42
7.3 Latent Factor Models
BellKor Recommender System

Definition
BellKor Recommender System: The BellKor Recommender System was one of the leading solutions in the
Netflix Prize competition. It uses a multi-scale modeling approach, combining techniques at different levels of
abstraction to produce highly accurate recommendations.

• Global Effects: Captures overall trends and biases in the data. For example, some users consistently rate
more generously or harshly than average, and some items (like blockbusters) receive consistently higher
ratings than others. This is modeled using a simple baseline:

rui = µ + bu + bi Latent factor models (like matrix

factorization or SVD) compress
where: the user-item interaction matrix
into low-dimensional latent
– µ is the global average rating space representations, which
are more compact and efficient,
– bu is the user bias (how user u deviates from average) especially with sparse data.
– bi is the item bias (how item i deviates from average)

• Factorization (Latent Factors): Addresses intermediate or “regional” effects. This is based on matrix
factorization models, also called latent factor models, which capture hidden dimensions of user prefer-
ences and item characteristics:
rui ≈ µ + bu + bi + ⟨pu , qi ⟩
where:

– pu ∈ Rk is the latent vector for user u

– qi ∈ Rk is the latent vector for item i
– ⟨pu , qi ⟩ is the dot product capturing the interaction between user and item in latent space

• Collaborative Filtering (Local Patterns): After accounting for global biases and latent factors, the system
applies memory-based collaborative filtering to capture local deviations. This helps refine predictions by
modeling specific patterns of agreement between small groups of users or items.

Interpolation Weights vs. Similarity Scores

Traditional similarity-based collaborative filtering methods rely on fixed similarity measures such as cosine similarity or
Pearson correlation. However, these approaches have several limitations:

• Similarity measures are arbitrary: They are hand-crafted and not learned from the data.

• Pairwise similarities ignore context: They treat each pair of items (or users) independently, neglecting global
interdependencies.

• Averaging limits expressiveness: Using a weighted average of neighbor ratings restricts the model’s flexibility:
P
j∈N (i) sij · rxj
r̂xi = P
j∈N (i) sij

Solution: Learn interpolation weights directly from data.

43
Interpolation Weights wij

Our ultimate goal is to make good recommendations — i.e., recommend items a user will likely enjoy. The problem
is that we don’t have ground truth to directly optimize on, since the user haven’t seen the new items yet. Therefore,
we take a practical workaround. Rather than using a simple weighted average of ratings based on arbitrary similarity
scores, we instead use a weighted sum with learned weights to improve prediction accuracy:
X
r̂xi = bxi + wij (rxj − bxj )
j∈N (i;x)

• bxi is the baseline estimate for user x’s rating of item i (e.g., µ + bx + bi )

• wij is the interpolation weight capturing how informative item j is for predicting item i

• N (i; x) is the set of items rated by user x that are most similar to item i

How are the weights wij determined?

• We aim to minimize prediction error on the training data (i.e., known ratings). This is typically done using Root
Mean Squared Error (RMSE), or equivalently, the Sum of Squared Errors (SSE):
X 2
SSE = (r̂xi − rxi )
(x,i)∈R

• Substituting in the predicted rating formula that uses interpolation weights, the loss function (optimization prob-
lem) becomes:
 2
 
X X 
J(w) = bxi + wij (rxj − bxj ) − rxi 
 
 |{z} 
x,i  j∈N (i;x)
} true rating

| {z
predicted rating

• The weights wij are then learned by minimizing J(w) over the training data. This optimization captures how
influential item j ’s rating is in predicting item i’s rating, specifically for user x.

• Importantly, wij is not hand-defined like cosine or Pearson similarity — it is estimated from data, based on the
interactions of item i, its neighbors j , and the users who rated them.

• The assumption is that if we find weights that explain known ratings well, they will also generalize to unseen
ratings, which aligns with standard machine learning principles.

Now, the task is to solve the optimization problem using Gradient Descent. This means one has to fix movie i and
iterate over all rij for every movie j ∈ N (i; j) (and compute gradient) until convergence:w ← w − α ∇w J , where α is
the learning rate, and ∇w J is the gradient of the loss function J with respect to the weights w, evaluated on the training
data.

This layer of the BellKor recommender system captures fine-grained, local interactions, and removes reliance on
arbitrary similarity metrics like cosine or Pearson. It answers: “How important is item j ’s rating for predicting a rating for
item i?”

44
Latent Factor Models

Still thinking about ”the Netflix Prize”, for the latent factor model we choose ”SVD” on Netflix data and we want to
approximate the rating matrix R (even though R has missing entries - ignore that because we want the reconstruction
error to be small on known ratings and don’t care about missing ones) as a product of ”thin” QP T , concretely R ≈ QP T .
Visually, it looks as follows

So, R is our input matrix (A in original SVD), Q is our left singular vectors (U in original SVD), P T is the right singular
vectors and singular values (V T Σ in original SVD). To estimate the missing rating of user x for item i, let’s perform

Note: SVD is not defined when entries are missing - this is a problem for the sparse data for the Netflix movie
recommendation task. Instead of using classical SVD (which requires a fully observed matrix), we use a low-rank
matrix factorization model designed to work with missing data.

• Let P ∈ Rk×#users : latent user feature matrix

• Let Q ∈ Rk×#items : latent item feature matrix

The predicted rating is computed as the dot product: r̂xi = qiT px

2
The model is trained by minimizing the squared error over known ratings: minP,Q (x,i)∈R rxi − qiT px
P

• No requirement for P , Q to be orthogonal or normalized (unlike classical SVD).

• P , Q are learned directly from data using optimization methods such as gradient descent.

• The vectors px and qi serve as latent embeddings, capturing abstract user and item traits (e.g., genre affinity,
rating behavior).

• This approach became the most widely used and successful method in the Netflix Prize competition.

45
8 PageRank
8.1 Introduction
There are many types of Graphs:

• Social Networks Ex.: Facebook — Represents connections between users where friendships or interactions are
modeled as edges.

• Media Networks Ex.: Connections between political blogs — Visualizes how blogs of similar political opinions link
to each other, forming clusters.

• Information Networks Ex.: Citation networks — Displays how scientific papers reference each other, indicating
the flow of knowledge.

• Communication Networks Ex.: Internet — Shows routers and computers connected through data links, forming
the backbone of global communication.

• Technological Networks Ex.: Seven Bridges of Königsberg — An early example of graph theory where Euler
studied pathways across bridges.

Definition
Web as directed Graph

• Nodes: Webpages

• Edges: Hyperlinks

Two ways to organize the Web

Approaches:

• Web Directories: Human-curated lists of trusted sites.

• Web Search: Information Retrieval to find relevant documents.

Challenges:

– 1. Many sources of information — Who to trust?

Trick: Trustworthy pages often link to each other.
– 2. Finding the best answer — No single right answer.
Trick: Relevant pages often link to similar ones.

8.2 PageRank: Flow Formulation

Definition
PageRank is a method for measuring the importance of web pages based on the link structure of the web. The
idea is to consider links as votes. If a page is linked to by many other pages, it is considered more important.
Not all votes are equal. Links from highly-ranked pages are more valuable. (not to learn by heard, but for
understanding.)

46
Links as Votes

• A page is more important if it is pointed to by other important pages.

• Links from important pages count more.

• This creates a recursive formulation: important pages link to other important pages

Rank Definition

Definition
• The rank of a page rj is calculated based on the sum of the ranks of pages linking to it, divided by their
out-degree (number of outgoing links): X ri
rj =
i→j
di

• Here:

– ri is the rank of the linking page i.

– di is the number of outgoing links from page i.

Example
Example Setup:

• Three pages: A, B, C

• Links:
Final PageRank Values:
– A → B, C
• rA = 1.0
– B→C
• rB = 0.5
– C→A
• rC = 1.5
• Initial Ranks: rA = 1.0, rB = 1.0, rC = 1.0
Intuition:
• Out-degrees: dA = 2, dB = 1, dC = 1
• Page C is more important as it receives links
PageRank Calculation:
from both A and B.
• Page A:
• Page B is less important since it only re-
rC 1.0 ceives half the rank of A.
rA = = = 1.0
dC 1 Summary:

• Page B: • Links act as votes.

rA 1.0 • More incoming links → Higher PageRank.

rB = = = 0.5
dA 2
• Splitting links reduces the vote power.
• Page C:
rA rB
rC = + = 0.5 + 1.0 = 1.5
2 1

Here, a harderexample with self loops

47
Example
Solving the Equations:

• Substitute into Equation 2:

ry ra
Example Setup: ra = + ⇒ ra = ry
2 2
• Three pages: y, a, m
• Substitute into Equation 3:
• Links: ra ry
rm = =
– y → y, a 2 2
– a→m • Apply the constraint:
– m → (no links)
ry + ra + rm = 1 ⇒ 2.5ry = 1
• Flow Equations:
ry ra • Solution:
ry = +
2 2 2 2 1
ry = , ra = , rm =
ry 5 5 5
ra = + rm
2
ra Final Values:
rm =
2 • ry = 2
5
2
• ra = 5
1
• rm = 5

Efficiency Improvement

Gaussian elimination method works for small examples but we need a better method for large web-sized graphs.

48
Definition
Power Iteration Method

• Given a web graph with N nodes, where the nodes are pages and edges are hyperlinks.

• Power iteration: a simple iterative scheme

– Suppose there are N web pages.

– Initialize: T
1 1 1
r(0) = , ,...,
N N N
– Iterate:
r(t+1) = M · r(t)
– Stop when:
∥r(t+1) − r(t) ∥1 < ϵ

Definitions:

• M : Stochastic adjacency matrix.

• ∥ · ∥1 : L1 norm (can also be any other vector norm).

Example
Graph Setup:
First Iteration:
• Pages: A, B, C   1 1
0 0 1 3 3
• Links: r(1) = M · r(0) = 1 0 0 ·  13  =  13 
1 1
0 1 0 3 3
– A→B
– B→C Convergence Criterion:

– C→A • The iteration process stops when the differ-

ence between the current rank vector r(t+1)
Stochastic Adjacency Matrix: and the previous one r(t) is smaller than a
  very small threshold ϵ.
0 0 1
M = 1 0 0 • This is written mathematically as:
0 1 0
∥r(t+1) − r(t) ∥1 < ϵ
Initialization: 1
3
• If this condition is met, we say that the
r(0) =  13 
1 PageRank vector has converged.
3

8.3 Google PageRank

Introduction
Google improves the PageRank algorithm by introducing the concept of teleports to solve two main problems that
disrupt the calculation:

1. Dead Ends

49
2. Spider Traps

Dead Ends

• A dead end is a page that has no outbound links.

• When a random surfer reaches a dead end, there is nowhere to go.

• This causes the PageRank calculation to leak importance and prevents the algorithm from converging.

Spider Traps

• A spider trap is a group of pages that only link to each other.

• These pages have no outbound links to other parts of the network.

• If a surfer enters a spider trap, they are stuck indefinitely, causing these pages to absorb all the PageRank weight.

Definition
Solution to Dead ends and Spider Traps: Teleports
Google’s solution is to introduce teleports, which work as follows:

• At each iteration, the algorithm has two options:

– With probability β , follow a random link from the current page.

– With probability 1 − β , jump to a randomly selected page in the network.

• This prevents the algorithm from getting stuck at dead ends or spider traps, ensuring smooth navigation
and proper PageRank distribution.

• Typically, the value of β is in the range of 0.8 to 0.9.

8.4 Problems with PageRank

• Measures generic popularity of a page

– Biased against topic-specific authorities

– Solution: Topic-Specific PageRank

• Uses a single measure of importance

– Other models of ”importance” are not considered

– Solution: Hubs-and-Authorities

• Susceptible to Link Spam

– Artificial link structures can be created to boost PageRank

– Solution: TrustRank

50
8.5 Topic Specific PageRank
Definition
In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector
is computed, using the link structure of the Web, to capture the relative ”importance” of Web pages, independent
of any particular search query. Topic-Specific PageRank introduces a bias towards a predefined set of relevant
pages called Set S.

Idea: Bias the Random Walk

• When teleporting, the algorithm selects a page from Set S.

• Set S contains only pages relevant to the specific topic.

• Example: Open Directory (DMOZ, now Curlie.org) lists topic-specific pages.

Discovering the Topic Vector S

• Different PageRanks are created for different topics:

– Arts, Business, Sports, etc.

• Which topic to use?

– User Input: Manual selection of a topic.

– Query Classification: Analyze the query to understand its topic.
– Contextual Cues: Use browsing history, query history, and user context.

9 Data Stream Processing

9.1 Introduction
Definition
Data streams are sequences of data elements made available over time. Unlike traditional datasets, we do not
have access to the entire data upfront. Examples of data streams include:

• Google Search Queries — Millions of searches happen every second.

• Mastodon Status Updates — Real-time social media updates that are continuous.

We think of data streams as:

• Infinite: The data never stops coming.

• Non-Stationary: The data’s nature and distribution can change over time (e.g., trending topics, seasonal
searches)

The Stream Model In the Stream Model, data elements arrive rapidly at one or more input ports, called ”streams”.
The system cannot store the entire stream because the data volume is too large and arrives too quickly. Key
Question:
How do we make important calculations on this endless data flow with only a limited amount of memory?

51
Side Note: SGD is a Streaming Algorithm Stochastic Gradient Descent (SGD) is a classic example of a streaming
algorithm. In machine learning, this is known as online learning.
What is Online Learning?
Instead of learning from a fixed dataset, the model learns and updates continuously from a stream of new data. This
allows the model to adapt over time as new patterns and data distributions emerge.
Applications of Data Streams
Problems on Data Streams
• Mining Query Streams – Google wants to identify
Types of Queries on a Data Stream: which queries are more frequent today than they
• Sampling data from a stream – Construct a ran- were yesterday.
dom sample from the data stream. • Mining Click Streams – Yahoo wants to monitor
• Queries over sliding windows – Count the num- which of its pages are receiving an unusual number
ber of items of type x in the last k elements of the of hits in the past hour.
stream. • Mining Social Network News Feeds – Detect
• Filtering a data stream – Select elements that sat- trending topics on platforms like Mastodon, Bluesky,
isfy property x from the stream. etc.

• Counting distinct elements – Compute the num- • Sensor Networks – Data from many sensors are
ber of distinct elements in the last k elements of the continuously fed into a central controller for real-
stream. time monitoring.

• Estimating moments – Estimate average, stan- • Telephone Call Records – Data is used for gener-
dard deviation, or higher moments of the last k ele- ating customer bills and managing settlements be-
ments. tween telephone companies.

• Finding frequent elements – Identify elements • IP Packets Monitored at a Switch – Gather in-
that appear frequently in the stream. formation for optimal routing and detect potential
denial-of-service (DoS) attacks.

9.2 Sampling Data From a Data Stream

When working with data streams, we cannot store all elements due to memory constraints. Instead, we use sampling to
maintain a manageable representation of the stream. We have two strategies, and look at both herre:

a. Sampling a Fixed Proportion

• This approach selects a fixed percentage of the data as it arrives.

• Example: If you sample 1 out of every 10 elements, you reduce storage by 90%.
• It is simple and lightweight, but it does not maintain fairness for all elements (e.g., duplicates are underrepre-
sented).

b. Reservoir Sampling (Fixed-Size Random Sample)

• This method maintains a constant-size sample (e.g., 100 elements) regardless of the total stream size.
• It ensures that every element seen so far has an equal probability of being included.
• This is more representative of the data stream over time but slightly more complex to implement.

Sampling a fixed-size sample

Problem Definition:

• We want to sample a fixed proportion (e.g., 10%) from a large search engine query stream.

52
• Queries are represented as tuples: (user, query, time).

Naive Solution:

• For each incoming query, generate a random integer between 0 and 9.

• Store the query if the number is 0; discard it otherwise.

• This keeps approximately 10% of the queries.

Problem with Naive Solution:

Probability of Sampling a Single Query:
For any single query, the probability of it being sampled in the naive method (1 out of every 10 queries) is:

1
P (Query is sampled) = = 0.1
10
What Happens with Duplicates?
If a query appears twice (a duplicate), each appearance is sampled independently with a probability of 0.1.
To have both copies of the duplicate in the sample, both independent events must occur. The probability is calculated
as follows:
P (Both duplicates are sampled) = 0.1 × 0.1 = 0.01
The Meaning of 1% (1/100):
This means that only 1 out of 100 pairs of duplicates will be fully captured in the sample. If there are 1000 duplicate
queries in the original data, we can expect approximately:

1000
= 10 full pairs
100
Hence, the naive sampling method drastically underestimates the true number of duplicates in the sampled data.
Improved Solution: Sample Users Instead of Queries

• Pick 1/10 of the users and keep all their queries.

• Achieved by hashing the user ID into 10 buckets.

• If the hash falls into a specific bucket, all queries from that user are stored.

Generalized Solution: Hashing Technique

• Treat each query as a tuple: (user, query, time).

• Hash the key (e.g., user) uniformly into b buckets.

• Pick the tuple if its hash value is at most a.

• For a 30% sample, hash into 10 buckets and select if the value is in the first 3 buckets.

53
Example
Example: Hashing Technique for Sampling

Stream of Tuples: We have a stream of tuples in the form:

(user, query, time)

Key Selection: We choose the user as the key for hashing.

Hashing Strategy: To achieve a 30% sample:

• Hash each user’s ID uniformly into b = 10 buckets.

• The hash function generates a value between 0 and 9.

• Pick the tuple if its hash value is in the first 3 buckets (i.e., 0, 1, or 2).

Example: Suppose we hash the following user IDs:

user123 → 2, user456 → 5, user789 → 0

Resulting Sample:

• user123 hashes to 2, which is within the first 3 buckets → Sampled.

• user456 hashes to 5, which is not in the first 3 buckets → Discarded.

• user789 hashes to 0, which is within the first 3 buckets → Sampled.

Hence, we effectively keep approximately 30% of the stream by uniformly hashing and selecting based on bucket
position.

9.2.1 Reservoir Sampling (Fixed-Size Random Sample)

Problem Definition:

• We want to maintain a sample S of exactly s tuples from a data stream.

• The size of the data stream is unknown and potentially infinite.

• We cannot store all the elements; we only want to keep a fixed-size sample.

Naive (Impractical) Solution:

• Store all elements that arrive in the stream.

• At the end, randomly pick s elements from the list.

• Problem: This is impractical due to memory limits for large or infinite streams.

Reservoir Sampling Solution:

54
Definition
• Store all the first s elements of the stream into S .

• Suppose we have seen n − 1 elements, and now the nth element arrives (n > s):
s
– With probability n, keep the nth element, else discard it.
– If we pick the nth element, it replaces one of the s elements already in S , picked uniformly at random.
s
Conclusion: After n elements, the sample contains each element seen so far with probability n.

Example
Stream of elements: A, B, C, D, E, F, G, H, I, J
Sample Size: s = 3

• Step 1: Store the first 3 elements: S = [A, B, C]

• Step 2: For each new element:

– D arrives (n = 4):
Probability of keeping it is 34 = 0.75. Assume it is kept. It replaces one of A, B, C randomly. Let’s say
it replaces A, so S = [D, B, C].
– E arrives (n = 5):
3
Probability of keeping it is 5 = 0.6. Assume it is discarded. S remains unchanged.
– F arrives (n = 6):
3
Probability of keeping it is 6 = 0.5. Assume it is kept. It randomly replaces an element, let’s say B ,
so S = [D, F, C].
– G arrives (n = 7):
3
Probability of keeping it is 7 ≈ 0.428. Assume it is discarded.
• Final Sample after 7 elements: S = [D, F, C]

Think of it as a lottery:
Every new element gets a ticket to enter the sample. As the stream grows, the chance of getting a ticket de-
creases. But all elements—both old and new—are treated fairly.

9.3 Filtering Data Streams

Each element of a data stream is a tuple:

• Given a list of keys S

• Determine which tuples of the stream are in S

• Obvious solution: Hash table.

• But: we do not have enough memory to store all of S in a hash table. We might be processing millions of filters on
the same stream

9.3.1 Applications of Filtering Data Streams

• Email spam filtering

55
– We know 1 billion ”good” email addresses
– If an email comes from one of these, it is not spam

• Publish-subscribe systems

– You are collecting lots of messages (news articles)

– People express interest in certain sets of keywords
– Determine whether each message matches user’s interest

First Cut Solution: Understanding the Basics Problem: We have a set of keys S to filter and a data stream of
elements to check against S .

56
9.3.2 First Cut Solution

Example
Solution Steps:

1. Initialize Bit Array B of Size n:

• We create a bit array B with n bits, all initialized to 0. This array will help us quickly check if an element
might be in S .
B = [0, 0, 0, 0, . . . , 0]
2. Choose a Hash Function h:

• A hash function is a function that takes an element s and maps it to an index in the bit array B . The
output of h(s) is always in the range [0, n − 1], where n is the size of B . Example of a hash function:

h(s) = (3s + 7) mod n

• Let’s understand this with an example:

– Suppose n = 20 and s = 5. Apply the hash function:

h(5) = (3 · 5 + 7) mod 20 = (15 + 7) mod 20 = 22 mod 20 = 2

The hash function mapped 5 to index 2 in the bit array B .

3. Insert Elements of S :

• For each element s in S , we compute h(s) and set the bit at that index in B to 1.

B[h(s)] = 1

• Example:
S = {5, 12, 19}, n = 20

• Compute the hashes:

h(5) = 2, h(12) = 13, h(19) = 16
• Update the bit array:
B = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
4. Filter the Stream:

• When an element a from the stream arrives, we compute h(a).

• If B[h(a)] = 1, it might be in S , if B[h(a)] = 0, it is definitely not in S .

Example:

• Suppose the stream sends 5, 12, 19, 4, 17.

• Compute their hash values:

h(5) = 2, h(12) = 13, h(19) = 16, h(4) = 19, h(17) = 18

• Bit Array B :
B = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]

• Stream elements 5, 12, 19 hash to 1s → might be in S .

• Stream elements 4, 17 hash to 0s → definitely not

57 in S .
Advantages:
• Memory-efficient → Bit array instead of full elements.

• Fast → O(1) time for insert and lookup.

9.3.3 Bloom Filters

The First Cut Solution was a simple form of a Bloom Filter with only 1 hash function. A proper Bloom Filter uses
multiple hash functions, spreads its bits better, and reduces false positives significantly.

• Consider:

– |S| = m: The number of elements in the set S .

– |B| = n: The size of the bit array B .

• We use k independent hash functions h1 , h2 , . . . , hk .

Initialization:
• All bits in B are initially set to 0.

• For each element s ∈ S :

B[hi (s)] = 1 for each i = 1, 2, . . . , k

Runtime (Stream Processing):

• When a stream element x arrives:

– Compute h1 (x), h2 (x), . . . , hk (x).

– If B[hi (x)] = 1 for all i, then declare x is in S .
– Otherwise, discard x.

Bloom Filter Properties:

• No False Negatives: If x is in S , it is always detected.

• False Positives Possible: An element not in S might hash to bits set by other elements.

• Memory Efficient: Only bits are stored, not actual elements.

• Hardware Friendly: Hash functions can be parallelized for speed.

False Positive Probability:
k
P (False Positive) = 1 − e−km/n

• m = 1 billion (number of addresses)

• n = 8 billion (size of bit array)

• Examples for different k values:

– k = 1: P ≈ 0.12 (12%)
– k = 2: P ≈ 0.05 (5%)
– k = 6: P ≈ 0.02 (2%) — Optimal value
– k = 20: P ≈ 0.18 (18%)

• Optimal k value:
n
k= ln(2) ≈ 5.54 ≈ 6
m

58
9.4 Counting Distinct Elements
The Flajolet-Martin (FM): Estimate the number of distinct elements in a data stream using minimal memory. The core
idea is to use probabilistic hashing and trailing zeros in hash values to estimate unique counts.

• Pick a hash function h that maps each of the N elements to at least log2 N bits.

• For each stream element a, let r(a) be the number of trailing 0s in h(a).

– r(a) = position of the first 1 from the right

– Example:
h(a) = 12, 1210 = 11002 , so r(a) = 2

• Record R = maxa r(a), the maximum number of trailing 0s seen so far.

• Estimated number of distinct elements:

D̂ = 2R

Example
Step 1: Choose a Hash Function We choose a simple hash function h(x) that maps elements to numbers in
binary format. For example:

h(a) = 12, h(b) = 6, h(c) = 8, h(d) = 10

Step 2: Binary Representation and Counting Trailing Zeros We convert the values to binary format and count
the number of trailing zeros:
h(a) = 12 → 11002 , r(a) = 2
h(b) = 6 → 01102 , r(b) = 1
h(c) = 8 → 10002 , r(c) = 3
h(d) = 10 → 10102 , r(d) = 1
Step 3: Finding the Maximum Number of Trailing Zeros Now we determine the maximum of the observed
r(a) values:
R = max(2, 1, 3, 1) = 3
Step 4: Estimating the Number of Distinct Elements We estimate the number of distinct elements in the data
stream using the formula:
D̂ = 2R
Since R = 3, we get:
D̂ = 23 = 8
Step 5: Conclusion of the Estimation The algorithm estimates that there are approximately 8 distinct ele-
ments in the data stream. This is an approximation and may vary slightly, but the memory usage is extremely
low.

59
10 Lernziele
• MapReduce und verteilte Programmierung
– Verstehen des MapReduce-Paradigmas und seiner drei Hauptphasen (Map, Group by Key, Reduce)
– Unterscheiden zwischen MapReduce und Spark (RDDs, DataFrames, Datasets)
– Identifizieren geeigneter Probleme für MapReduce (sequentielle Datenverarbeitung, große Batch-Jobs)
– Berechnen von Kommunikations- und Berechnungskosten bei MapReduce-Algorithmen
• Association Rule Discovery
– Anwenden des A-Priori und PCY-Algorithmus zur Frequent Itemset-Erkennung. Vor- und Nachteile
– Berechnen von Support, Confidence und Interest von Association Rules
– Verstehen des Market-Basket-Models und seiner Anwendungen
– Analysieren von Speicher- und I/O-Kosten verschiedener Ansätze (Triangular Matrix vs. Triples)
• Ähnlichkeitssuche und LSH
– Wählen der richtigen Distanzmaße für verschiedene Datentypen:
* Jaccard-Distanz für Sets
* Cosine-Distanz für Vektoren
* Euclidean-Distanz für numerische Daten
– Implementieren von Locality Sensitive Hashing (LSH) für effiziente Ähnlichkeitssuche
– Durchführen der drei Schritte: Shingling → Min-Hashing → LSH
• Clustering-Algorithmen
– Verstehen verschiedener Clustering-Ansätze (Hierarchical, K-Means, BFR, CURE)
– Anwenden von K-Means und Auswahl der optimalen Clusterzahl
– Unterscheiden zwischen Euclidean und Non-Euclidean Clustering
• Dimensionalitätsreduktion
– Erklären der Unterschiede zwischen SVD und CUR:
T
* SVD: Erzeugt abstrakte, dichte Matrizen (U, Σ, V )
* CUR: Verwendet echte Zeilen/Spalten, behält Sparsity bei, ist interpretierbarer
– Anwenden von SVD für Latent Factor Models in Recommender Systems
• Recommender Systems
– Implementieren von Content-Based und Collaborative Filtering Ansätzen
– Berechnen der Pearson-Korrelation für User-User und Item-Item Similarity
– Verstehen von Latent Factor Models und Matrix Factorization
• PageRank und Graphalgorithmen
– Berechnen von PageRank-Werten mit der Power Iteration Method
– Lösen von Problemen mit Dead Ends und Spider Traps durch Teleportation
– Unterscheiden zwischen Standard- und Topic-Specific PageRank
• Data Stream Processing
– Implementieren von Sampling-Algorithmen (Reservoir Sampling)
– Anwenden von Bloom Filters für Stream Filtering
– Schätzen der Anzahl distinct Elements mit Flajolet-Martin

MCA 202-Big Data and Big Data Analysis
No ratings yet
MCA 202-Big Data and Big Data Analysis
189 pages
DataStreamsCRC Anjaly
No ratings yet
DataStreamsCRC Anjaly
258 pages
MapReduce Book Final
No ratings yet
MapReduce Book Final
175 pages
Introduction To Data Science
100% (1)
Introduction To Data Science
200 pages
Data Science Project - An Inductive Learning Approach, Verri
No ratings yet
Data Science Project - An Inductive Learning Approach, Verri
238 pages
Social Media Mining: Reza Zafarani Mohammad Ali Abbasi Huan Liu
No ratings yet
Social Media Mining: Reza Zafarani Mohammad Ali Abbasi Huan Liu
382 pages
Machine Learning General Concepts
100% (4)
Machine Learning General Concepts
80 pages
Statistics and Probability Problems With Solutions
0% (2)
Statistics and Probability Problems With Solutions
5 pages
StreamMining Manual
No ratings yet
StreamMining Manual
185 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
3 Ec 6
No ratings yet
3 Ec 6
187 pages
Word Level Analysis
No ratings yet
Word Level Analysis
49 pages
MapReduce Book 20100219
No ratings yet
MapReduce Book 20100219
152 pages
Data-Intensive Text Processing With MapReduce
100% (1)
Data-Intensive Text Processing With MapReduce
178 pages
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
No ratings yet
DWDM (Data Warehousing and Data Mining) Summarizer (For B.Tech MAKAUT)
27 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Statistics Machine Learning Python Draft
No ratings yet
Statistics Machine Learning Python Draft
173 pages
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
100% (13)
Effective Pandas. Patterns For Data Manipulation (Treading On Python) - Matt Harrison - Independently Published (2021)
392 pages
Social Media Data Mining
100% (2)
Social Media Data Mining
382 pages
UK Top Secret STRAP1 Comit
No ratings yet
UK Top Secret STRAP1 Comit
96 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
RC 25186
No ratings yet
RC 25186
83 pages
Orange 3
100% (1)
Orange 3
46 pages
Dunham - Data Mining PDF
100% (1)
Dunham - Data Mining PDF
156 pages
R For Data Analysis
No ratings yet
R For Data Analysis
216 pages
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
No ratings yet
Venkatesh Ganti, Anish Das Sarma - Data Cleaning. A Practical Perspective-Morgan & Claypool (2013)
72 pages
Icebreake R
No ratings yet
Icebreake R
160 pages
Streaming Algorithms
No ratings yet
Streaming Algorithms
73 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Practical Tools Hadley Wickham PDF
No ratings yet
Practical Tools Hadley Wickham PDF
105 pages
StatisticsMachineLearningPythonDraft PDF
100% (1)
StatisticsMachineLearningPythonDraft PDF
223 pages
Ida PDF
No ratings yet
Ida PDF
62 pages
SAP HANA Predictive Analysis Library PAL en
100% (2)
SAP HANA Predictive Analysis Library PAL en
243 pages
Computer Science UGC NET 40 Days Study Plan
No ratings yet
Computer Science UGC NET 40 Days Study Plan
3 pages
Big Data and The Web
No ratings yet
Big Data and The Web
170 pages
Practical Machine Learning R
90% (10)
Practical Machine Learning R
149 pages
StreamMining PDF
No ratings yet
StreamMining PDF
185 pages
Analysis and Design of Algorithms
No ratings yet
Analysis and Design of Algorithms
4 pages
PercentageType 3
No ratings yet
PercentageType 3
6 pages
Matching Supply With Demand Solutions To End of Chapter Problems 4
100% (1)
Matching Supply With Demand Solutions To End of Chapter Problems 4
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
18 pages
Java Streams
No ratings yet
Java Streams
21 pages
Coding Systems - ASCII and Unicode
No ratings yet
Coding Systems - ASCII and Unicode
23 pages
Meta Interview Guide
No ratings yet
Meta Interview Guide
10 pages
4CP0 - Computer Science Igcse Edxecel Paper 1 2023
No ratings yet
4CP0 - Computer Science Igcse Edxecel Paper 1 2023
18 pages
NP Complete
No ratings yet
NP Complete
75 pages
cpp01 - 42 C++
No ratings yet
cpp01 - 42 C++
15 pages
Aust Cse Thesis Final Book
No ratings yet
Aust Cse Thesis Final Book
72 pages
Mid Sem Question Paper
No ratings yet
Mid Sem Question Paper
7 pages
CMPUT 175 Lecture #2
No ratings yet
CMPUT 175 Lecture #2
30 pages
Btech Cse 7 Sem Machine Learning Pec Cs701e 2024
No ratings yet
Btech Cse 7 Sem Machine Learning Pec Cs701e 2024
2 pages
Sunshine's Homepage - Understanding CRC
No ratings yet
Sunshine's Homepage - Understanding CRC
28 pages
Oc Method Trace
No ratings yet
Oc Method Trace
22 pages
Advanced Programming in The UNIX Environment
No ratings yet
Advanced Programming in The UNIX Environment
30 pages
Train and Peter
0% (1)
Train and Peter
14 pages
ORACLE LOCk
No ratings yet
ORACLE LOCk
14 pages
API For Fixed Departure Seats
No ratings yet
API For Fixed Departure Seats
18 pages
ICS-2204-PPL Main
No ratings yet
ICS-2204-PPL Main
8 pages
IEEE Projects in Bangalore - Final Year Projects in Bangalore - BE Academic Projects
No ratings yet
IEEE Projects in Bangalore - Final Year Projects in Bangalore - BE Academic Projects
8 pages
Java 5th Day
No ratings yet
Java 5th Day
7 pages
Excel Prep Material - Ignite
No ratings yet
Excel Prep Material - Ignite
8 pages
Assignment1 2 PL SQL
No ratings yet
Assignment1 2 PL SQL
6 pages
Design and Analysis of Algorithms - Tutorial Sheet Practice
No ratings yet
Design and Analysis of Algorithms - Tutorial Sheet Practice
2 pages
Zto Implement Hill Climbing Problem To Print "Hello World"
No ratings yet
Zto Implement Hill Climbing Problem To Print "Hello World"
4 pages
Oops Question Paper
No ratings yet
Oops Question Paper
2 pages
Back in the Real World (Stone Angel #2)
From Everand
Back in the Real World (Stone Angel #2)
Marvin H. Albert
4.5/5 (2)
Boundaries, Baby: How to Say ‘No’ and Still Be a Rockstar Educator,”
From Everand
Boundaries, Baby: How to Say ‘No’ and Still Be a Rockstar Educator,”
Thomas Grey
No ratings yet
Deadline Istanbul (The Elizabeth Darcy Series)
From Everand
Deadline Istanbul (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Deadline Yemen (The Elizabeth Darcy Series)
From Everand
Deadline Yemen (The Elizabeth Darcy Series)
Peggy Hanson
5/5 (1)
Advanced college algebra study guide
From Everand
Advanced college algebra study guide
Harrison Cook
No ratings yet
The Gracious Lily Affair
From Everand
The Gracious Lily Affair
Van Wyck Mason
5/5 (1)
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
From Everand
ADVANCED COLLEGE ALGEBRA STUDY GUIDE
Harrison K Cook
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Between River and Mountain
From Everand
Between River and Mountain
Sally Walker Brinkmann
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Bimbo Heaven: Stone Angel #7
From Everand
Bimbo Heaven: Stone Angel #7
Marvin H. Albert
No ratings yet
Keys to Better Reading
From Everand
Keys to Better Reading
Judy McFall
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Operation Longlife
From Everand
Operation Longlife
E. Hoffmann Price
3.5/5 (3)
The Future Is Ours: The Collected Science Fiction of Edward D. Hoch
From Everand
The Future Is Ours: The Collected Science Fiction of Edward D. Hoch
Edward D. Hoch
No ratings yet
Operation Exile
From Everand
Operation Exile
E. Hoffmann Price
3.5/5 (1)
Hamlet Had an Uncle: A Comedy of Honor
From Everand
Hamlet Had an Uncle: A Comedy of Honor
James Branch Cabell
4.5/5 (7)
Duenna to a Murder
From Everand
Duenna to a Murder
Rufus King
No ratings yet
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
From Everand
Securing ChatGPT: Best Practices for Protecting Sensitive Data in AI Language Models
Matthew C. Smith
No ratings yet
Lock 'n Load Tactical Core Rules v5.0
From Everand
Lock 'n Load Tactical Core Rules v5.0
David Heath
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet