SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors
SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors
Nonnegative Tensors
Ardavan Afshar1 Kejing Yin2 Sherry Yan3 Cheng Qian4 Joyce C. Ho5
Haesun Park1 Jimeng Sun6
1 2 3 4
Georgia Institute of Technology Hong Kong Baptist University Sutter Health IQVIA
5 6
Emory University University of Illinois at Urbana-Champaign
Background: CP Tensor Factorization
R
X
X ≈ X̂ = JA(1) , A(2) , ...., A(N) K = a(1) (2) (N)
r ◦ ar ◦ ... ◦ ar ,
r =1
1 Tamara G Kolda and Brett W Bader. “Tensor decompositions and applications”. In: SIAM Review (2009).
1/23
Motivation
Existing tensor factorization models assume certain distributions of input, for example:
2 J Carroll and J Chang. “Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition”. In: Psychometrika (1970).
3 E Chi and T Kolda. “On tensors, sparsity, and nonnegative factorizations”. In: SIAM Journal on Matrix Analysis and Applications (2012).
4 D Hong, T Kolda, and J Duersch. “Generalized canonical polyadic tensor decomposition”. In: SIAM Review (2020).
2/23
Motivation
3/23
Preliminaries: Wasserstein Distance and Optimal Transport
5 Gabriel Peyré, Marco Cuturi, et al. “Computational optimal transport”. In: Foundations and Trends® in Machine Learning (2019).
4/23
Preliminaries: Wasserstein Distance and Optimal Transport
An efficient alternative:
Definition (Entropy regularized OT problem6 )
The entropy regularized OT problem is defined as:
1
WV (a, b) = minimize hC, Ti − E (T), (2)
T∈U(a,b) ρ
PM,N
where E (T) = − i,j=1 tij log(tij ) is the entropy of T.
5/23
Challenges
1. Wasserstein distance is not well-defined for tensors: It is well-defined for vectors, yet
vectorizing tensor yields extremely large vectors, making it infeasible to solve.
3. Real-world input are often large, sparse and non-negative: Efficient algorithms are
possible only when the sparsity structure are fully utilized.
6/23
Our Contributions
We first define the Wasserstein distance for matrices by summing that over their columns:
Definition (Wasserstein Matrix Distance)
Given a cost matrix C ∈ RM×M + , the Wasserstein distance between two matrices A = [a1 , ..., aP ] ∈ RM×P
+ and
B = [b1 , ..., bP ] ∈ RM×P
+ is denoted by WM (A, B), and given by:
P
X 1
WM (A, B) = WV (ap , bp ) = minimizehC, Ti − E (T), (3)
T∈U(A,B) ρ
p=1
where C = [C, ...., C], T = [T1 , ...Tp , ..., TP ], and the feasible set U(A, B) is given by:
| {z }
P times
T ∈ RM×MP | Tp 1M = ap , TT T ∈ RM×MP
U(A, B) = + p 1M = bp ∀p = + | ∆(T) = A, Ψ(T) = B , (4)
8/23
Defining Wasserstein Tensor Distance
Then we can define the Wasserstein distance for tensors by summing that over the
matricization along each mode of the tensor:
Definition (Wasserstein Tensor Distance)
I ×...×IN
The Wasserstein distance between N-th order tensor X ∈ R+1 and its reconstruction
I ×...×IN
X̂ ∈ R+1 is denoted by WT (X̂ , X ):
N N
1
X X
WT (X̂ , X ) = WM X
b (n) , X(n) ≡ minimize hCn , Tn i − E (Tn ) , (5)
Tn ∈U bX ,X ρ
n=1 n=1 (n) (n)
The Wasserstein distance WT (X , Y) defined above is a valid distance and satisfies the metric
axioms of positivity, symmetry, and triangle inequality. 9/23
Defining Wasserstein Tensor Distance
Illustration of the Wasserstein distances. Left: Wasserstein matrix distance; right: Wasserstein tensor distance.
10/23
Wasserstein Tensor Factorization
SWIFT minimizes the Wasserstein tensor distance between input and its CP reconstruction:
Optimization problem
N
1
X
minimize hCn , Tn i − E (Tn )
{An ≥0,Tn }N ρ
n=1 n=1
subject to X
b = JA1 , . . . , AN K
Tn ∈ U(X
b(n) , X(n) ), n = 1, . . . , N
In ×In I
Note that: Tn = [Tn1 , ..., Tnj , ..., TnI(−n) ] ∈ R+ (−n) .
The number of optimal transport problems to solve is: I(−n) = I1 × · · · × In−1 × In+1 × · · · × IN .
Instead, we use the property of the OT solution T∗nj 1 = diag(uj )Kn vj = uj ∗ (Kn vj ); therefore:
Proposition 2
12/23
Efficient Algorithms: 1. Solving for OT Problems (Tn )
𝜙
𝜙
𝜙
𝑋#! 𝐾! 𝑋! 𝐾!" 𝑈!
Computa(on in Parallel
𝜙 𝜙 𝜙
𝜙 𝜙 𝜙
𝜙 𝜙 𝜙
SWIFT explores sparsity structure in input data X(n) and drops zero values columns.
13/23
Efficient Algorithms: 2. Updating CP factors (An )
(−i)
Challenge: An is also involved in the Khatri-Rao product A .
In this way, the Khatri-Rao product term no longer contain the factor matrix An .
14/23
Efficient Algorithms: 2. Updating CP factors (An )
With the rearranged objective function, the factor matrix An can be efficiently updated via
multiplicative update rules8 .
8 Daniel D Lee and H Sebastian Seung. “Algorithms for non-negative matrix factorization”. In: Advances in Neural Information Processing Systems. 2001.
15/23
Experiments: Datasets and Evaluation Metrics
BBC News9
• a third-order counting tensor with size of 400 articles by 100 words by 100 words
• downstream task: article category classification; evaluated by accuracy.
Sutter
• a dataset collected from a large real-world health provider network
• a third-order binary tensor with size of 1000 patients by 100 diagnoses by 100 medications
• downstream task: heart failure onset; evaluated by PR-AUC.
We use the pair-wise cosine distance to compute the cost matrices for each mode of the two
datasets.
9 Derek Greene and Pádraig Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”. In: International Conference on Machine
learning. 2006.
16/23
Experiments: Baselines
We compare against the following tensor factorization models with different loss functions:
Underlying
Model Loss Type Reference
Distribution Assumption
CP-ALS MSE Loss Gaussian (Bader & Kolda 2007)
CP-NMU MSE Loss Gaussian (Bader & Kolda 2007)
Supervised CP MSE Loss Gaussian (Kim et al. 2017)
Similarity based CP MSE Loss Gaussian (Kim et al. 2017)
CP-Continuous Gamma Loss Gamma (Hong et al. 2020)
CP-Binary Log Loss Bernoulli (Hong et al. 2020)
CP-APR KL Loss Poisson (Chi & Kolda 2012)
17/23
Experimental Results: Classification Performance
18/23
Experimental Results: Classification Performance
19/23
Experimental Results: Classification Performance on Noisy Data
We inject random noise to the BBC News data and run all models using the noisy data:
1.0
CP ALS Supervised CP CP Gamma CP APR
CP NMU Similarity Based CP CP Binary SWIFT
0.8
0.6
Accuracy 0.4
0.2
0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Noise Level
SWIFT outperforms all baselines, especially for medium and high noise levels.
20/23
Experimental Results: Scalability of SWIFT
We set R = 40, switch off all parallelization of SWIFT for fair comparison and measure the
running time of all models.
21/23
Experimental Results: Interpretability of SWIFT
We interpret the factor matrices learned for Sutter datasets. Following are three examples:
22/23
Conclusion
• We define the Wasserstein distance between two tensors and propose SWIFT, a Wasserstein
tensor factorization model.
• We derive an efficient learning algorithm by exploring the sparsity structure and introducing
efficient rearrangement operator.
23/23
Thank you!
All questions and comments are greatly appreciated!
23/23