Machine learning @ Spotify - Madison Big Data Meetup

Machine Learning &
Big Data @
Andy Sloane
@a1k0n
http://a1k0n.net
Madison Big Data Meetup
Jan 27, 2015

Big data?
60M Monthly Active Users (MAU)
50M tracks in our catalog
...But many are identical copies from different
releases (e.g. US and UK releases of the same
album)
...and only 4M unique songs have been listened to
>500 times

Big data?
Raw material: application logs, delivered via Apache
Kafka
Wake Me Up by Avicii has been played 330M times, by
~6M different users
"EndSong": 500GB / day
...But aggregated per-user play counts for a whole
year fit in ~60GB ("medium data")

Hadoop @ Spotify
900 nodes (all in London datacenter)
34 TB RAM total
~16000 typical concurrent tasks (mappers/reducers)
2GB RAM per mapper/reducer slot

What do we need ML for?
Recommendations
Related Artists
Radio

The Discover page
4M tracks x 60M active users, rebuilt daily

The Discover page
Okay, but how do we come up with recommendations?
Collaborative filtering!

Collaborative filtering
Great, but how does that actually work?
Each time a user plays something, add it to a matrix
Compute similarity, somehow, between items based on
who played what

So compute some distance between every pair of rows
and columns
That's just O( ) = O( ) operations... O_O
We need a better way...
60M
2
2
1.8 × 10
15
(BTW: Twitter has a decent approximation that can actually make this work, called DIMSUM:
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum)
I've tried it but don't have results to report here yet :(

Latent factor models
Instead, we use a "small" representation for each user &
item: -dimensional vectorsf
(here, )f = 2
and approximate the big matrix with it.

Why vectors?
Very compact representation of musical style or user's
taste
Only like 40-200 elements (2 shown above for
illustration)

Why vectors?
Dot product between items = similarity between items
Dot product between vectors = good/bad
recommendation
user x item
2 x 4 = 8
-4 x 0 = 0
2 x -2 = -4
-1 x 5 = + -5
= -1

Recommendations via dot products

Another example of tracks in two
dimensions

Implicit Matrix Factorization
Hu, Koren, Volinsky - Collaborative Filtering for Implicit
Feedback Datasets
Tries to predict whether user listens to item :u i
P = ≈ ( )
⎛
⎝
⎜
⎜
⎜
⎜
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
1
⎞
⎠
⎟
⎟
⎟
⎟
X
⎛
⎝
⎜
⎜
⎜
Y
T
⎞
⎠
⎟
⎟
⎟
is all item vectors, is all user vectorsY X
"implicit" because users don't tell us what they like, we
only observe what they do/don't listen to

Goal: make close to 1 for things each user has
listened to, 0 for everything else.
Implicit Matrix Factorization
⋅xu y
i
— user 's vector
— item 's vector
— 1 if user played item , 0 otherwise
— "confidence", ad-hoc weight based on number of
times user played item ; e.g.,
— regularization penalty to avoid overfitting
xu u
y
i
i
p
ui
u i
cui
u i 1 + α ⋅
λ
Minimize:
+ λ
(
|| | + || |
)
∑
u,i
cui ( − )p
ui
x
T
u y
i
2
∑
u
xu |
2
∑
i
y
i
|
2

Solution: alternate solving for all users :
and all items :
Alternating Least Squares
xu
= ( Y + ( − I)Y + λIxu Y
T
Y
T
C
u
)
−1
Y
T
C
u
p
u⋅
y
i
= ( X + ( − I)X + λIy
i
X
T
X
T
C
i
)
−1
X
T
C
i
p
⋅i
= x matrix, sum of outer products of all items
same, except only items the user played
= weighted -dimensional sum of items the
user played
YY
T
f f
( − I)YY
T
C
u
Y
T
C
u
p
u
f

Key point: each iteration is linear in size of input, even
though we are solving for all users x all items, and needs
only memory to solvef
2
No learning rates, just a few tunable parameters ( , , )f λ α
All you do is add stuff up, solve an x matrix problem,
and repeat!
f f
We use dimensional vectors for
recommendations
f = 40
Matrix/vector math using numpy in Python, breeze in
scala

Adding lots of stuff up
Problem: any user (60M) can play any item (4M)
thus we may need to add any user's vector to any
item's vector
If we put user vectors in memory, it takes a lot of RAM!
Worst case: 60M users * 40 dimensions * sizeof(float) =
9.6GB of user vectors
...too big to fit in a mapper slot on our cluster

Solution: Split the data into a matrix
Most recent run made a 14 x 112 grid
Adding lots of stuff up

Input is a bunch of tuples
is the same modulo K for all users
is the same modulo L for all items
e.g., if K = 4, mapper #1 gets users 1, 5, 9, 13, ...
One map shard
(user, item, count)
user
item

Add up vectors from every data point
Then flip users ↔items and repeat!
Adding stuff up
(user, item, count)
def mapper(self, input): # Luigi-style python job
user, item, count = parse(input)
conf = AdHocConfidenceFunction(count) # e.g. 1 + alpha*count
# add up user vectors from previous iteration
term1 = conf * self.user_vectors[user]
term2 = np.outer(user_vectors[user], user_vectors[user])
* (conf - 1)
yield item, np.array([term1, term2])
def reducer(self, item, terms):
term1, term2 = sum(terms)
item_vector = np.solve(
self.YTY + term2 + self.l2penalty * np.identity(self.dim),
term1)
yield item, item_vector

Implemented in Java Map-Reduce framework which
runs other models, too
After about 20 iterations, we converge
Each iteration takes about 20 minutes, so about 7-8
hours total
Recomputed from scratch weekly
User vectors recomputed daily, keeping items fixed
So we have vectors, now what?

60M users x 4M recommendable items
Finding Recommendations
For each user, how do we find the best items given
their vector?
Brute force is O(60M x 4M x 40) = O(9 peta-operations)!
Instead, use an approximation based on locality
sensitive hashing (LSH)

Approximate Nearest Neighbors /
Locality-Sensitive Hashing
Annoy - github.com/spotify/annoy

Annoy - github.com/spotify/annoy
Pre-built read-only database of item vectors
Internally, recursively splits random hyperplanes
Nearby points likely on the same side of random split
Builds several random trees (a forest) for better
approximation
Given an -dimensional query vector, finds similar items
in database
Index loads via mmap, so all processes on the same
machine share RAM
Queries are very, very fast, but approximate
Python implementation available, Java forthcoming
f

Generating recommendations
Annoy index for all items is only 1.2GB
I have one on my laptop... Live demo!
Could serve up nearest neighbors at load time, but we
precompute Discover on Hadoop

Generating recommendations in parallel
Send annoy index in distributed cache, load it via mmap
in map-reduce process
Reducer loads vectors + user stats, looks up ANN,
generates recommendations.

Related Artists
Great for music discovery
Essential for finding believable reasons for latent
factor-based recommendations
When generating recommendations, run through a list
of related artists to find potential reasons

Similar items use cosine distance
Cosine is similar to dot product; just add a
normalization step
Helps "factor out" popularity from similarity

Related Artists
How we build it
Similar to user recommendations, but with more
models, not necessarily collaborative filtering based
Implicit Matrix Factorization (shown previously)
"Vector-Exp", similar model but probabilistic in
nature, trained with gradient descent
Google word2vec on playlists
Echo Nest "cultural similarity" — based on scraping
web pages about music!
Query ANNs to generate candidates
Score candidates from all models, combine and rank
Pre-build table of 20 nearest artists to each artist

ML-wise, exactly the same as Related Artists!
Radio
For each track, generate candidates with ANN from
each model
Score w/ all models, rank with ensemble
Store top 250 nearest neighbors in a database
(Cassandra)
User plays radio → load 250 tracks and shuffle
Thumbs up → load more tracks from the thumbed-up
song
Thumbs down → remove that song / re-weight tracks

Upcoming work
Deep learning based item similarity
http://benanne.github.io/2014/08/05/spotify-cnns.html

Upcoming work
Audio fingerprint based
content deduplication
~1500 Echo Nest Musical Fingerprints per track
based matching to accelerate all-pairs
similarity
Fast connected components using Hash-to-Min
algorithm - mapreduce steps
Min-Hash
O(log d)
http://arxiv.org/pdf/1203.5387.pdf

Thanks!
I can be reached here:
Andy Sloane
Email:
Twitter:
Special thanks to , whose slides I
plagiarized mercilessly
andy@a1k0n.net
@a1k0n
http://a1k0n.net
Erik Bernhardsson

Machine learning @ Spotify - Madison Big Data Meetup

More Related Content

What's hot (20)

Similar to Machine learning @ Spotify - Madison Big Data Meetup (20)

Recently uploaded (20)

Machine learning @ Spotify - Madison Big Data Meetup