0% found this document useful (0 votes)

4 views70 pages

01 Intro

CS246: Mining Massive Data Sets is a course at Stanford University focused on extracting actionable information from large datasets using data mining, predictive analytics, and machine learning. The course emphasizes scalability and practical applications, covering various data types, computational models, and real-world problems like recommender systems and spam detection. Students are encouraged to engage with the material through lectures, homework, and collaborative study, with a strong emphasis on programming and mathematical skills.

Uploaded by

Maksmilian Mro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views70 pages

01 Intro

Uploaded by

Maksmilian Mro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Note to other teachers and users of these slides: We would be delighted if you found our

material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

CS246: Mining Massive Data Sets

Jure Leskovec, Stanford University
Charilaos Kanatsoulis
http://cs246.stanford.edu
Data contains value and knowledge
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2
 But to extract the knowledge data
needs to be
▪ Stored (systems)
▪ Managed (databases)
▪ And ANALYZED  this class

Data Mining ≈ Predictive Analytics ≈

Data Science ≈ Machine Learning ≈
Data-Centric AI
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3
 Extraction of actionable information from
(usually) very large datasets, is the subject of
extreme hype, fear, and interest

 It’s not all about machine learning

 But most of it is!

 Emphasis in CS246 on algorithms that scale

▪ Parallelization often essential

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

“This class is a must if you want to become a
Data Scientist or an ML Engineer.”
(anonymous CS246 student)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 5
 Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering

 Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems

“Definitely take the course if you will be working with massive

datasets in the future, either in the industry or in academia.”
(anonymous CS246 student)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 6
 This combines best of machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data) Theory, Machine
▪ Algorithms Algorithms
Learning
▪ Computing architectures
▪ Automation for handling CS246
large data
“The class has a great focus on real-world Data processing
study cases, so you will learn a lot about systems
realistic ML problems and the solutions being
used in practice at places like Netflix, Amazon,
Facebook, Pinterest, etc.” (anonymous CS246 student)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 7
 We will learn to mine different types of data:
▪ Data is high dimensional
▪ Data is a graph
▪ Data is infinite/never-ending
▪ Data is labeled
 We will learn to use different models of
computation:
▪ MapReduce
▪ Streams and online algorithms
▪ Single machine in-memory
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 8
 We will learn to solve real-world problems:
▪ Recommender systems
▪ Market Basket Analysis
▪ Spam detection
▪ Data filtering
 We will learn various “tools”:
▪ Linear algebra (SVD, Rec. Sys., Communities)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 9
Graph Infinite Machine
High dim. data Apps
data data Learning

Locality
PageRank, Filtering data Learning Recommender
sensitive
SimRank streams Embeddings systems
hashing

Graph Neural Association

Clustering Web advertising Decision Trees
Networks Rules

Duplicate
Dimensionality Queries on Experimen-
Spam Detection document
reduction streams tation
detection

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 12
Lectures: Tue/Thu 3:00-4:20pm PST
Live in-person (in NVIDIA classroom),
recording available on Canvas
 ~70 min lecture:
▪ If you have a clarification question, raise your hand
 ~10 min Q&A:
▪ Ask questions, we will answer and discuss

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

 Ed:
▪ Use Ed for all questions and public
communication
▪ Search the feed before asking a duplicate question
▪ Please tag your posts and please no one-liners
 For e-mailing course staff always use:
▪ [email protected]
 We will post course announcements to
Ed (hence check it regularly!)

Auditors are welcome!

(please send request to <[email protected]> to add you to Canvas)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 14
 High-frequency feedback:
▪ Weekly survey about class morale
▪ Randomly select students to give us feedback
▪ Content
▪ Course setup
▪ Anything the teaching team should know/improve
▪ Anything that is confusing to you
▪…

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

 Course website: http://cs246.stanford.edu
▪ Lecture slides (at least 30min before the lecture)
▪ Homework, solutions, readings posted on Ed/Canvas

 Class textbook: Mining of Massive Datasets by

A. Rajaraman, J. Ullman, and J. Leskovec
▪ Sold by Cambridge Uni. Press but available for free
at http://mmds.org

 MOOC: www.youtube.com /channel/UC_Oao2FYkLAUlUVkBfze4jg/videos

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

 Office hours:
▪ TA office hours will be updated on the website
http://cs246.stanford.edu by Friday
▪ We start Office Hours next week!

▪ Office hours will be held on Zoom and use

QueueStatus
▪ Links will be posted on Canvas and the course calendar
▪ We will be holding (1) in-person office hours, (2) virtual
office hours, and possibly (3) mixed virtual + in-person
office hours

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

 Videos and materials on Canvas
 Spark tutorial:
▪ Details TBA
▪ Follows Colab 0
 Review of basic probability and proof
techniques:
▪ Details TBA, handout
 Review of linear algebra:
▪ Details TBA, handout

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

 4 longer homeworks: 40%
▪ Four major assignments, involving programming,
proofs, algorithm development.
▪ Assignments take lots of time (+20h). Start early!!
 How to submit?
▪ Homework write-up:
▪ Submit via Gradescope
▪ Enroll to CS246 on Canvas, and you will be automatically
added to the course Gradescope
▪ Homework code:
▪ If the homework requires a code submission, you will find a
separate assignment for it on Gradescope, e.g., HW1 (Code)
▪ Forgetting to submit code will result in point deduction.
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 19
 Homework schedule:
Date (23:59 PT) Out In
01/09, Thu HW1
01/23, Thu HW2 HW1
02/06, Thu HW3 HW2
02/20, Thu HW4 HW3
03/06, Thu HW4

▪ Two late periods for HWs for the quarter:

▪ Late period expires on the following Monday 23:59 PST
▪ Can use max 1 late period per HW
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 20
 Short weekly Colab notebooks: 30%
▪ Colab notebooks are posted every Thursday
▪ 10 in total, from 0 to 9, each worth 3%
▪ Due one week later on Thursday 23:59 PST. No late days!
▪ First 2 Colabs will be posted on Thu, including detailed
submission instructions to Gradescope
▪ Colab 0 (Spark Tutorial) is solved step-by-step in the Spark
Recitation video.
▪ Colabs require around 1hr of work.
▪ And a few lines of code.
▪ “Colab” is a free cloud service from Google, hosting Jupyter
notebooks with free access to GPU and TPU
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 21
 Final exam: 30%
▪ Tentative plan: In-person exam during finals week
▪ Thursday March 20, 2025 from 12:15 – 3:15pm
 Extra credit: Proportional to your contribution (up to 2%)
▪ Course attendance, asking questions, discussion
▪ For participating in Ed discussions
▪ Especially valuable are answers to questions posed by
other students
▪ Reporting bugs in course materials

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

 Programming: Python or Java
 Basic Algorithms: CS161 is surely sufficient
 Probability: e.g., CS109 or Stats116
▪ There will be a review session and a review doc is
linked from the class home page
 Linear algebra:
▪ Another review doc + review session is available
 Multivariable calculus
 Database systems (SQL, relational algebra):
▪ CS145 is sufficient but not necessary
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 23
 Each of the topics listed is important for a
part of the course:
▪ If you are missing an item or two of background,
you could consider just-in-time learning of the
necessary material.

 The exception is programming:

▪ To do well in this course, you really need to be
comfortable with writing code in Python or Java.

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

 We’ll follow the standard CS Dept. approach:
You can get help, but you MUST acknowledge
the help on the work you hand in

 Failure to acknowledge your sources is a

violation of the Honor Code

 We use MOSS to check the originality of your

code

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

 You can talk to others about the algorithm(s) to be
used to solve a homework problem;
▪ As long as you then mention their name(s) on the work
you submit.
 You should not use code of others or be looking at
code of others when you write your own:
▪ (don’t search/post code on Github, Co-pilots, and similar)
▪ You can talk to people but have to write your own
solution/code
▪ If you fail to mention your sources, MOSS will catch it,
which will result in an HC violation.
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 26
 CS246 is fast paced!
▪ Requires programming maturity
▪ Strong math skills
▪ SCPD students tend to be rusty on math/theory

 Course time commitment: “The colabs are easy and

can be done within an hour but the homework assignments
take a lot more time so start early!” (CS246 student)
▪ Homeworks take ~20h
▪ Colab notebooks take about 1h
 Form study groups!
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 27
CS246 is one of the most useful classes at
you’ll take at Stanford if you want to
become a Data Scientist or an ML Engineer.

CS246 going to be fun and hard work. ☺

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28

 Watch Colab0 recitation video
o Link will be also posted on Ed.
o Open OH for Spark questions will be over the weekend
 Office hours start on the 2nd week.
▪ Schedule will be posted on the website
 Upcoming releases:
▪ HW1, Colab0, and Colab 1 will all be released on
Thursday
 Upcoming submissions:
o Colab0, Colab 1 due on 18th January.
o HW1 due on 25th January.
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 29
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 31
 Large-scale computing for data mining
problems on commodity hardware
 Challenges:
▪ How do you distribute computation?
▪ How can we make it easy to write distributed
programs?
▪ Machines fail:
▪ One server may stay up 3 years (1,000 days)
▪ If you have 1,000 servers, expect to lose 1/day
▪ With 1M machines 1,000 machines fail every day!

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32

 Issue:
Copying data over a network takes time
 Idea:
▪ Bring computation to data
▪ Store files multiple times for reliability
 Spark/Hadoop address these problems
▪ Storage Infrastructure – File system
▪ Google: GFS. Hadoop: HDFS
▪ Programming model
▪ MapReduce
▪ Spark
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 33
 Problem:
▪ If nodes fail, how to store data persistently?
 Answer:
▪ Distributed File System
▪ Provides global file namespace
 Typical usage pattern:
▪ Huge files (100s of GB to TB)
▪ Data is rarely updated in place
▪ Reads and appends are common

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34

 Chunk servers
▪ File is split into contiguous chunks
▪ Typically each chunk is 16-64MB
▪ Each chunk replicated (usually 2x or 3x)
▪ Try to keep replicas in different racks
 Master node
▪ a.k.a. Name Node in Hadoop’s HDFS
▪ Stores metadata about where files are stored
▪ Master nodes are typically more robust to hardware
failure and run critical cluster services.
 Client library for file access
▪ Talks to master to find chunk servers
▪ Connects directly to chunk servers to access data
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 35
 Reliable distributed file system
 Data kept in “chunks” spread across machines
 Each chunk replicated on different machines
▪ Seamless recovery from disk or machine failure
C0 C1 D0 C1 C2 C5 C0 C5

C5 C2 C5 C3 D0 D1 … D0 C2

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Notation: C2… chunk no. 2 of file C

Bring computation directly to the data!

Chunk servers also serve as compute servers
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 36
 MapReduce is a style of programming
designed for:
1. Easy parallel programming
2. Invisible management of hardware and software
failures
3. Easy management of very-large-scale data

 It has several implementations, including

Hadoop, Spark (used in this class), Flink, and
the original Google implementation just called
“MapReduce”
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 38
3 steps of MapReduce
 Map:
▪ Apply a user-written Map function to each input element
▪ Mapper applies the Map function to a single element
▪ Many mappers grouped in a Map task (the unit of parallelism)
▪ The output of the Map function is a set of 0, 1, or more
key-value pairs.
 Group by key: Sort and shuffle
▪ System sorts all the key-value pairs by key, and
outputs key-(list of values) pairs
 Reduce:
▪ User-written Reduce function is applied to each
key-(list of values)
Outline stays the same, Map and Reduce change to fit the problem
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 39
key-value
pairs

Input Output

Mappers Reducers

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40

Example MapReduce task:
 We have a huge text document
 Count the number of times each
distinct word appears in the file
 Many applications of this:
▪ Analyze web server logs to find popular URLs
▪ Statistical machine translation:
▪ Need to count number of times every 5-word sequence
occurs in a large corpus of documents

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41

Provided by the Provided by the
programmer programmer
Data can be MAP: Reduce:
partitioned and Group by key:
Read input and Collect all values
processed in Collect all pairs
produces a set of belonging to the
parallel with same key
key-value pairs key and output

data
The crew of the space
(The, 1) (crew, 1)

reads
shuttle Endeavor recently

read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)

sequential
exploration. Scientists at (the, 1) (the, 1)
NASA are saying that the (the, 3)

Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
(shuttle, 1) (the, 1)
a long-term space-based
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …

Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
map(key, value):
# key: document name; value: text of the document
for each word w in value:
emit(w, 1)

reduce(key, values):
# key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

MAP:
Read input and
produces a set of
key-value pairs

Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)

Reduce:
Collect all values
belonging to the
key and output

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44

Partitioning function
determines which record
goes to which reducer

Phases of Map-Reduced are distributed with many tasks

doing the work in parallel
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 45
MapReduce environment takes care of:
 Partitioning the input data
 Scheduling the program’s execution across a
set of machines
 Performing the group by key step
▪ In practice this is the bottleneck
 Handling machine failures
 Managing required inter-machine communication

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

 Map worker failure
▪ Map tasks completed or in-progress at
worker are reset to idle and rescheduled
▪ Reduce workers are notified when map task is
rescheduled on another worker
 Reduce worker failure
▪ Only in-progress tasks are reset to idle and the
reduce task is restarted

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47

MapReduce incurs substantial overheads due to
data replication, disk I/O, and serialization
▪ Outputs of mappers M are saved on the disk, sorted,
and then read again by reducers R (HDFS read, HDFS
write)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 49
 Two major limitations of MapReduce:
▪ Difficulty of programming directly in MapReduce
▪ Many big data problems/algorithms aren’t easily described
as map-reduce
▪ Performance bottlenecks, or batch not fitting the use
cases
▪ Saving to disk is typically much slower than in-memory work

 In short, MapReduce doesn’t compose well for

large applications
▪ Many times, one needs to chain multiple map-reduce
steps.
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 50
 MapReduce uses two “ranks” of tasks:
One for Map the second for Reduce
▪ Data flows from the first rank to the second

 Data-Flow Systems generalize this in two ways:

1. Allow any number of tasks/ranks
2. Allow functions other than Map and Reduce
▪ If data flow is in one direction only (DAG=directed
acyclic graph), we can have the blocking property and
allow recovery of tasks rather than whole jobs
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 51
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 52
 Expressive computing system, not limited to
the map-reduce model
 Additions to MapReduce model:
▪ Fast data sharing
▪ Avoids saving intermediate results to disk
▪ Caches data for repetitive queries (e.g. for machine learning)
▪ General execution graphs (DAGs=directed acyclic
graph)
▪ Richer functions than just map and reduce
 Compatible with Hadoop
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 53
 Key construct/idea: Resilient Distributed Dataset
(RDD)

 Higher-level APIs: DataFrames & DataSets

▪ Introduced in more recent versions of Spark
▪ Different APIs for aggregate data, which allowed to
introduce SQL support

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 54

Key concept: Resilient Distributed Dataset (RDD)
▪ Partitioned collection of records
▪ Generalizes (key-value) pairs
 Spread across the cluster, Read-only
 Caching dataset in memory
▪ Fallback to disk possible
 RDDs can be created from Hadoop,
or by transforming other RDDs
(you can stack RDDs)
 RDDs are best suited for applications
that apply the same operation to
all elements of a dataset
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 55
 Transformations build RDDs through
deterministic operations on other RDDs:
▪ Transformations include map, filter, join, union,
intersection, distinct
▪ Lazy evaluation: Nothing computed until an action
requires it

 Actions to return value or export data

▪ Actions include count, collect, reduce, save
▪ Actions can be applied to RDDs; actions force
calculations and return values
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 56
A: B:

F:
Stage 1 groupBy

C: D: E:

join = RDD

Stage 2 map filter Stage 3 = cached partition

 Supports general directed acyclic task graphs

 Pipelines functions where possible
 Cache-aware data reuse & locality
 Partitioning-aware to avoid shuffles
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 57
 DataFrame:
▪ Unlike an RDD, data organized into named
columns, e.g. a table in a relational database.
▪ Imposes a structure onto a distributed collection
of data, allowing higher-level abstraction
 Dataset:
▪ Extension of DataFrame API which provides type-
safe, object-oriented programming interface
(compile-time error detection)

Both built on Spark SQL engine. Both can be

converted back to an RDD.
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 58
 Spark SQL
▪ scalable processing of relational data
 Spark Streaming
▪ stream processing of live datastreams
 MLlib
▪ scalable machine learning
 GraphX
▪ graph manipulation
▪ Extends Spark RDD with a Graph abstraction: a
directed multigraph with properties attached to
each vertex and edge
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 59
 Performance: Spark is normally faster but with caveats
▪ Spark can process data in-memory; Hadoop MapReduce
persists back to the disk after a map or reduce action
▪ Spark generally outperforms MapReduce, but it often
needs lots of memory to perform well; if there are
other resource-demanding services or can’t fit in
memory, Spark degrades
▪ MapReduce easily runs alongside other services with
minor performance differences, & works well with the
1-pass jobs it was designed for
 Ease of use: Spark is easier to program (higher-level APIs)
 Data processing: Spark more general
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 60
 Suppose we have a large web corpus
 Look at the metadata file
▪ Lines of the form: (URL, size, date, …)
 For each host, find the total number of bytes
▪ That is, the sum of the page sizes for all URLs from
that particular host

 Other examples:
▪ Link analysis and graph processing
▪ Machine Learning algorithms

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 62

 Statistical machine translation:
▪ Need to count number of times every 5-word
sequence occurs in a large corpus of documents

 Very easy with MapReduce:

▪ Map:
▪ Extract (5-word sequence, count) from document
▪ Reduce:
▪ Combine the counts

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 63

 Compute the natural join R(A,B) ⋈ S(B,C)
 R and S are each stored in files
 Tuples are pairs (a,b) or (b,c)

A B B C A C
a1 b1
⋈ =
b2 c1 a3 c1
a2 b1 b2 c2 a3 c2
a3 b2 b3 c3 a4 c3
a4 b3
S
R

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 64

 Use a hash function h from B-values to 1...k
 A Map process turns:
▪ Each input tuple R(a,b) into key-value pair (b,(a,R))
▪ Each input tuple S(b,c) into (b,(c,S))

 Map processes send each key-value pair with

key b to Reduce process h(b)
▪ Hadoop does this automatically; just tell it what k is.
 Each Reduce process matches all the pairs
(b,(a,R)) with all (b,(c,S)) and outputs (a,b,c).
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 65
 MapReduce is great for:
▪ Problems that require sequential data access
▪ Large batch jobs (not interactive, real-time)

 MapReduce is inefficient for problems where

random (or irregular) access to data required:
▪ Graphs
▪ Interdependent data
▪ Machine learning
▪ Comparisons of many pairs of items

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 66

 In MapReduce we quantify the cost of an
algorithm using
1. Communication cost = total I/O of all
processes
2. Elapsed communication cost = max of I/O
along any path
3. (Elapsed) computation cost analogous, but
count only running time of processes
Note that here the big-O notation is not the most useful
(adding more machines is always an option)

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 67

 For a map-reduce algorithm:
▪ Communication cost = input file size + 2*(sum of
the sizes of all files passed from Map processes to
Reduce processes) + the sum of the output sizes of
the Reduce processes.
▪ Q: Why is there a factor 2*?
▪ Elapsed communication cost is the sum of the
largest input + output for any map process, plus
the same for any reduce process

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 68

 Either the I/O (communication) or processing
(computation) cost dominates
▪ Ignore one or the other

 Total cost tells what you pay in rent from

your friendly neighborhood cloud

 Elapsed cost is wall-clock time using

parallelism

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 69

 Total communication cost of joining R and S:
= O(|R|+|S|+|R⋈ S|)
 Elapsed communication cost = O(s)
▪ We’re going to pick k and the number of Map
processes so that the I/O limit s is respected
▪ We put a limit s on the amount of input or output
that any one process can have. s could be:
▪ What fits in main memory
▪ What fits on local disk
 With proper indexes, computation cost is
linear in the input + output size
▪ So, computation cost is like communication cost
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 70

Azure Cosmos DB Workshop
100% (1)
Azure Cosmos DB Workshop
147 pages
Introduction PDF
No ratings yet
Introduction PDF
69 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
01 Intro PDF
No ratings yet
01 Intro PDF
69 pages
Mmds Exam 2022
No ratings yet
Mmds Exam 2022
17 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
01 Mapreduce
No ratings yet
01 Mapreduce
77 pages
Web - Stanford.edu 01-Intro
No ratings yet
Web - Stanford.edu 01-Intro
87 pages
CPSC 340 and 532M: Machine Learning and Data Mining
No ratings yet
CPSC 340 and 532M: Machine Learning and Data Mining
52 pages
18 Advertising
No ratings yet
18 Advertising
48 pages
Lecture 1
No ratings yet
Lecture 1
102 pages
16 Streams
No ratings yet
16 Streams
61 pages
Jeffrey D. Ullman: Stanford University
No ratings yet
Jeffrey D. Ullman: Stanford University
52 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
53 pages
Lecture 10
No ratings yet
Lecture 10
33 pages
19 Bandits
No ratings yet
19 Bandits
48 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
07 Recsys1
No ratings yet
07 Recsys1
48 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
CS 224W 01-Intro
No ratings yet
CS 224W 01-Intro
68 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Stanford - Slides Mapreduce
No ratings yet
Stanford - Slides Mapreduce
76 pages
Lecture 1
No ratings yet
Lecture 1
79 pages
CS246 - Home
No ratings yet
CS246 - Home
13 pages
L0 Overview
No ratings yet
L0 Overview
15 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
16 Streams
No ratings yet
16 Streams
5 pages
Xford Presentation GNN Part 3
No ratings yet
Xford Presentation GNN Part 3
10 pages
SYLLABUS
No ratings yet
SYLLABUS
13 pages
Free Resources - GenAI
No ratings yet
Free Resources - GenAI
19 pages
Lecture 1 2 Ruohan
No ratings yet
Lecture 1 2 Ruohan
35 pages
Handout: Course Information: CS 229 Machine Learning
No ratings yet
Handout: Course Information: CS 229 Machine Learning
4 pages
2019a Stat991304
No ratings yet
2019a Stat991304
4 pages
L01 Intro
No ratings yet
L01 Intro
32 pages
Introduction To Data Science 439
No ratings yet
Introduction To Data Science 439
28 pages
DS 3000 Syllabus Spring 2025
No ratings yet
DS 3000 Syllabus Spring 2025
10 pages
CSE 435 Data Mining Outline Fall 2024
No ratings yet
CSE 435 Data Mining Outline Fall 2024
6 pages
Machine Learning CS229/STATS229: Instructors: Moses Charikar, Tengyu Ma, and Chris Re
No ratings yet
Machine Learning CS229/STATS229: Instructors: Moses Charikar, Tengyu Ma, and Chris Re
40 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
CE246 DBMS Practical List
0% (1)
CE246 DBMS Practical List
10 pages
ECE 4760J Syllabus Summer2025
No ratings yet
ECE 4760J Syllabus Summer2025
5 pages
Lec1 Intro To p556
No ratings yet
Lec1 Intro To p556
29 pages
cs531 Machine Learning Course Outline
No ratings yet
cs531 Machine Learning Course Outline
3 pages
ISO 13485 Process Matrix Example
100% (3)
ISO 13485 Process Matrix Example
3 pages
ML Roadmap
No ratings yet
ML Roadmap
11 pages
BDA - CSE Syllabus
No ratings yet
BDA - CSE Syllabus
2 pages
AI Learning Resources
No ratings yet
AI Learning Resources
6 pages
CS224W Info Handout
No ratings yet
CS224W Info Handout
2 pages
Informatica Interview Questions and Answers
No ratings yet
Informatica Interview Questions and Answers
5 pages
IGCSE Ch09 Databases Sec 4
No ratings yet
IGCSE Ch09 Databases Sec 4
37 pages
Syllabus Ee541 22sp
No ratings yet
Syllabus Ee541 22sp
7 pages
Linux Commands For Managing Partitioning Troubleshooting
No ratings yet
Linux Commands For Managing Partitioning Troubleshooting
6 pages
SAP Global Track and Trace: Event-to-Action Engine
No ratings yet
SAP Global Track and Trace: Event-to-Action Engine
23 pages
Syllabus For Written Test: Ms by Research and PHD Admissions 2025-26
No ratings yet
Syllabus For Written Test: Ms by Research and PHD Admissions 2025-26
2 pages
GENESIS32 OLE Automation References
No ratings yet
GENESIS32 OLE Automation References
469 pages
CSE 460 - Syllabusf23
No ratings yet
CSE 460 - Syllabusf23
4 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
CS6220 Syllabus
No ratings yet
CS6220 Syllabus
2 pages
cs412 24FA Syllabus
No ratings yet
cs412 24FA Syllabus
2 pages
PMpowerpoint Slides Chapter 1
No ratings yet
PMpowerpoint Slides Chapter 1
28 pages
Handout
No ratings yet
Handout
4 pages
Configuring FDMEE Drill Through To SAP: Session ID #: 14043
No ratings yet
Configuring FDMEE Drill Through To SAP: Session ID #: 14043
41 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
Rain Technology PDF
No ratings yet
Rain Technology PDF
12 pages
System Administration Toolkit:: Using SNMP Data
No ratings yet
System Administration Toolkit:: Using SNMP Data
10 pages
Course Outline CSC 588 Data Warehousing and Data Mining1
No ratings yet
Course Outline CSC 588 Data Warehousing and Data Mining1
5 pages
Geographical Analysis in SAP Business Information Warehouse
No ratings yet
Geographical Analysis in SAP Business Information Warehouse
3 pages
B.tech - Non Credit Courses For 2nd Year Students
No ratings yet
B.tech - Non Credit Courses For 2nd Year Students
4 pages
A Study On K-Means Clustering in Text Mining Using Python
No ratings yet
A Study On K-Means Clustering in Text Mining Using Python
5 pages
Decision Support and Business Intelligence Systems: (9 Ed., Prentice Hall)
No ratings yet
Decision Support and Business Intelligence Systems: (9 Ed., Prentice Hall)
23 pages
Ts QL Join Type Poster 1
100% (1)
Ts QL Join Type Poster 1
3 pages
CS 3307: Unit 5 - Lab Report Student Program Assignment University of The People
No ratings yet
CS 3307: Unit 5 - Lab Report Student Program Assignment University of The People
9 pages
ALV Row Editable
No ratings yet
ALV Row Editable
9 pages
Turbo C Environment
No ratings yet
Turbo C Environment
12 pages
Hospital Management System
No ratings yet
Hospital Management System
5 pages
Auth0 Bluetooth Sig Case Study
No ratings yet
Auth0 Bluetooth Sig Case Study
11 pages
Backup Exec Deduplication Option's Deduplication Engine (Spoold) Fails To Start Due To Files Deleted - Quarantined by Antivirus Software
No ratings yet
Backup Exec Deduplication Option's Deduplication Engine (Spoold) Fails To Start Due To Files Deleted - Quarantined by Antivirus Software
3 pages
Cohort 9 Day 2
No ratings yet
Cohort 9 Day 2
10 pages
Fundamental Database Concepts
No ratings yet
Fundamental Database Concepts
2 pages
Week 3 Lab Overview
No ratings yet
Week 3 Lab Overview
3 pages
I Have An Existing Flex Card Which Dispays Cart Items at Cart Level From Opportunity
No ratings yet
I Have An Existing Flex Card Which Dispays Cart Items at Cart Level From Opportunity
5 pages
Project Report Connect Plus 1
No ratings yet
Project Report Connect Plus 1
8 pages
Security and Scalability of Blockchain Document Storage Systems
No ratings yet
Security and Scalability of Blockchain Document Storage Systems
7 pages
Trust Management in Cloud Computing Fixed
No ratings yet
Trust Management in Cloud Computing Fixed
2 pages
Deep Learning for Computer Vision with SAS: An Introduction
From Everand
Deep Learning for Computer Vision with SAS: An Introduction
Robert Blanchard
No ratings yet
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
From Everand
IGNOU MCA Data Science and Big Data Previous Years Unsolved Papers MCS 226
Manish Soni
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
NX Nastran 9.0 for Designers
From Everand
NX Nastran 9.0 for Designers
Prof. Sham Tickoo
4.5/5 (2)

01 Intro

Uploaded by

01 Intro

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found our

CS246: Mining Massive Data Sets

Data Mining ≈ Predictive Analytics ≈

 It’s not all about machine learning

 Emphasis in CS246 on algorithms that scale

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 4

“Definitely take the course if you will be working with massive

Graph Neural Association

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 10

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 13

Auditors are welcome!

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 15

 Class textbook: Mining of Massive Datasets by

 MOOC: www.youtube.com /channel/UC_Oao2FYkLAUlUVkBfze4jg/videos

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 16

▪ Office hours will be held on Zoom and use

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 17

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 18

▪ Two late periods for HWs for the quarter:

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 22

 The exception is programming:

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 24

 Failure to acknowledge your sources is a

 We use MOSS to check the originality of your

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 25

 Course time commitment: “The colabs are easy and

CS246 going to be fun and hard work. ☺

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 28

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 32

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 34

Chunk server 1 Chunk server 2 Chunk server 3 Chunk server N

Notation: C2… chunk no. 2 of file C

Bring computation directly to the data!

 It has several implementations, including

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 40

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 41

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 43

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 44

Phases of Map-Reduced are distributed with many tasks

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 46

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 47

 In short, MapReduce doesn’t compose well for

 Data-Flow Systems generalize this in two ways:

 Higher-level APIs: DataFrames & DataSets

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 54

 Actions to return value or export data

Stage 2 map filter Stage 3 = cached partition

 Supports general directed acyclic task graphs

Both built on Spark SQL engine. Both can be

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 62

 Very easy with MapReduce:

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 63

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 64

 Map processes send each key-value pair with

 MapReduce is inefficient for problems where

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 66

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 67

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 68

 Total cost tells what you pay in rent from

 Elapsed cost is wall-clock time using

1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 69

You might also like