01 Intro
01 Intro
material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems
Locality
PageRank, Filtering data Learning Recommender
sensitive
SimRank streams Embeddings systems
hashing
Duplicate
Dimensionality Queries on Experimen-
Spam Detection document
reduction streams tation
detection
C5 C2 C5 C3 D0 D1 … D0 C2
Input Output
Mappers Reducers
data
The crew of the space
(The, 1) (crew, 1)
reads
shuttle Endeavor recently
read the
returned to Earth as (crew, 1) (crew, 1)
ambassadors, harbingers of (crew, 2)
a new era of space (of, 1) (space, 1)
(space, 1)
sequential
exploration. Scientists at (the, 1) (the, 1)
NASA are saying that the (the, 3)
Sequentially
recent assembly of the (space, 1) (the, 1)
Dextre bot is the first step in (shuttle, 1)
(shuttle, 1) (the, 1)
a long-term space-based
(recently, 1)
man/mache partnership. (Endeavor, 1) (shuttle, 1)
'"The work we're doing now …
Only
-- the robotics we're doing - (recently, 1) (recently, 1)
- is what we're going to
need ……………………..
…. …
Big document (key, value) (key, value) (key, value)
1/7/2025 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 42
map(key, value):
# key: document name; value: text of the document
for each word w in value:
emit(w, 1)
reduce(key, values):
# key: a word; value: an iterator over counts
result = 0
for each count v in values:
result += v
emit(key, result)
Group by key:
Collect all pairs with
same key
(Hash merge, Shuffle,
Sort, Partition)
Reduce:
Collect all values
belonging to the
key and output
F:
Stage 1 groupBy
C: D: E:
join = RDD
Other examples:
▪ Link analysis and graph processing
▪ Machine Learning algorithms
A B B C A C
a1 b1
⋈ =
b2 c1 a3 c1
a2 b1 b2 c2 a3 c2
a3 b2 b3 c3 a4 c3
a4 b3
S
R