0% found this document useful (0 votes)
49 views25 pages

SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors

AAAI.

Uploaded by

Indir Jaganjac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views25 pages

SWIFT: Scalable Wasserstein Factorization For Sparse Nonnegative Tensors

AAAI.

Uploaded by

Indir Jaganjac
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SWIFT: Scalable Wasserstein Factorization for Sparse

Nonnegative Tensors

Ardavan Afshar1 Kejing Yin2 Sherry Yan3 Cheng Qian4 Joyce C. Ho5
Haesun Park1 Jimeng Sun6
1 2 3 4
Georgia Institute of Technology Hong Kong Baptist University Sutter Health IQVIA
5 6
Emory University University of Illinois at Urbana-Champaign
Background: CP Tensor Factorization

CP factorization1 approximates a tensor X as the sum of R rank-one tensors:

R
X
X ≈ X̂ = JA(1) , A(2) , ...., A(N) K = a(1) (2) (N)
r ◦ ar ◦ ... ◦ ar ,
r =1

• A(n) : Factor matrix for the n-th mode.


An example of CP factorization with input of a
(n)
patient-diagnosis-medication tensor. • ar : the r -th column of A(n)

• It is widely-used in various applications, e.g. healthcare data analytics.


• It is highly interpretable: each rank-one tensor can be treated as a latent factor.

1 Tamara G Kolda and Brett W Bader. “Tensor decompositions and applications”. In: SIAM Review (2009).

1/23
Motivation

Existing tensor factorization models assume certain distributions of input, for example:

• Gaussian distribution: minXb ||X − Xb||2F ← MSE loss2


• Poisson distribution: minXb Xb − X ∗ log(Xb) ← KL divergence3
• Bernoulli distribution: minXb log(1 + e Xb) − X ∗ Xb ← logit loss4

Do we always know the distribution of a given input tensor?


• Real-world data often have very complex distributions.
• We usually do not know the underlying distribution of the input tensor.

2 J Carroll and J Chang. “Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition”. In: Psychometrika (1970).
3 E Chi and T Kolda. “On tensors, sparsity, and nonnegative factorizations”. In: SIAM Journal on Matrix Analysis and Applications (2012).
4 D Hong, T Kolda, and J Duersch. “Generalized canonical polyadic tensor decomposition”. In: SIAM Review (2020).

2/23
Motivation

Instead of assuming a specific distribution, the Wasserstein distance can be an alternative.

• a.k.a. Earth Mover Distance (EMD);


• is a potentially better measure of the difference between two distributions;
• does not assume any particular distributions of input data; and
• can leverage correlation relationship within each mode by defining the cost matrix.

3/23
Preliminaries: Wasserstein Distance and Optimal Transport

Definition (Wasserstein distance between vectors)


Wasserstein distance between probability vectors a and b is defined as

W (a, b) = hC, Ti (1)

• C is the cost matrix, where cij is the cost of moving ai to bj .


• T ∈ U(a, b) is an Optimal Transport (OT) solution between a and b.
n×m
• U(a, b) = {T ∈ R+ |T1m = a, TT 1n = b} is the feasible set of the OT problem.

Solving this OT problem is very expensive5 : it has a complexity of O(n3 ).

5 Gabriel Peyré, Marco Cuturi, et al. “Computational optimal transport”. In: Foundations and Trends® in Machine Learning (2019).

4/23
Preliminaries: Wasserstein Distance and Optimal Transport

An efficient alternative:
Definition (Entropy regularized OT problem6 )
The entropy regularized OT problem is defined as:
1
WV (a, b) = minimize hC, Ti − E (T), (2)
T∈U(a,b) ρ
PM,N
where E (T) = − i,j=1 tij log(tij ) is the entropy of T.

• It is strictly convex with a unique solution.


• It can be tackled with u and v such that diag(u) exp(−ρC) diag(v) ∈ U(a, b).
• Optimal u and v can be computed via the Sinkhorn’s algorithm7 .
6 Marco Cuturi. “Sinkhorn distances: Lightspeed computation of optimal transport”. In: Advances in Neural Information Processing Systems. 2013.
7 Richard Sinkhorn and Paul Knopp. “Concerning nonnegative matrices and doubly stochastic matrices”. In: Pacific Journal of Mathematics (1967)

5/23
Challenges

However, applying Wasserstein distance to tensor factorization is challenging:

1. Wasserstein distance is not well-defined for tensors: It is well-defined for vectors, yet
vectorizing tensor yields extremely large vectors, making it infeasible to solve.

2. Wasserstein distance is difficult to scale: It requires to solve the OT problems many


times in each iteration, which is extremely time consuming.

3. Real-world input are often large, sparse and non-negative: Efficient algorithms are
possible only when the sparsity structure are fully utilized.

6/23
Our Contributions

Contribution 1: Defining Wasserstein Tensor Distance


• SWIFT is the first work that defines Wasserstein distance for tensors.
• It does not assume any particular distribution.
• Therefore, it can handle non-negative inputs, including binaries, counts, and real-values.
Contribution 2: Formulating Wasserstein Tensor Factorization
• We propose SWIFT model to minimize the Wasserstein distance between the input and its
CP reconstructions.
Contribution 3: Efficiently Solving Wasserstein Tensor Factorization
• SWIFT effectively explores the sparsity structure of the input and reduces the number of
times required to compute OT.
• It reduces the computational time by efficient rearrangement of its sub-problems.
• As a result, it achieves 921x speed up over a naive implementation.
7/23
Defining Wasserstein Tensor Distance

We first define the Wasserstein distance for matrices by summing that over their columns:
Definition (Wasserstein Matrix Distance)
Given a cost matrix C ∈ RM×M + , the Wasserstein distance between two matrices A = [a1 , ..., aP ] ∈ RM×P
+ and
B = [b1 , ..., bP ] ∈ RM×P
+ is denoted by WM (A, B), and given by:

P
X 1
WM (A, B) = WV (ap , bp ) = minimizehC, Ti − E (T), (3)
T∈U(A,B) ρ
p=1

where C = [C, ...., C], T = [T1 , ...Tp , ..., TP ], and the feasible set U(A, B) is given by:
| {z }
P times

T ∈ RM×MP | Tp 1M = ap , TT T ∈ RM×MP
 
U(A, B) = + p 1M = bp ∀p = + | ∆(T) = A, Ψ(T) = B , (4)

where ∆(T) = [T1 1M , ..., TP 1M ] = T(IP ⊗ 1M ), Ψ(T) = [TT T


1 1M , ..., TP 1M ] and 1M is a vector of all ones with
the size of M.

8/23
Defining Wasserstein Tensor Distance

Then we can define the Wasserstein distance for tensors by summing that over the
matricization along each mode of the tensor:
Definition (Wasserstein Tensor Distance)
I ×...×IN
The Wasserstein distance between N-th order tensor X ∈ R+1 and its reconstruction
I ×...×IN
X̂ ∈ R+1 is denoted by WT (X̂ , X ):
 
N N 
1
X   X 
WT (X̂ , X ) = WM X
b (n) , X(n) ≡ minimize hCn , Tn i − E (Tn ) , (5)
Tn ∈U bX ,X  ρ
n=1 n=1 (n) (n)

In ×I(−n) In ×In I(−n)


where X(n) ∈ R+ is the n-th mode matricization of X , Cn = [Cn , Cn , ..., Cn ] ∈ R+ , and
In ×In I(−n) In ×In
Tn = [Tn1 , ..., Tnj , ..., TnI(−n) ] ∈ R+ . Tnj ∈ R+ is the transport matrix between the columns
b (n) (:, j) ∈ RI+n and X(n) (:, j) ∈ RI+n .
X

The Wasserstein distance WT (X , Y) defined above is a valid distance and satisfies the metric
axioms of positivity, symmetry, and triangle inequality. 9/23
Defining Wasserstein Tensor Distance

Illustration of the Wasserstein distances. Left: Wasserstein matrix distance; right: Wasserstein tensor distance.

10/23
Wasserstein Tensor Factorization

SWIFT minimizes the Wasserstein tensor distance between input and its CP reconstruction:
Optimization problem
N 
1
X 
minimize hCn , Tn i − E (Tn )
{An ≥0,Tn }N ρ
n=1 n=1

subject to X
b = JA1 , . . . , AN K
Tn ∈ U(X
b(n) , X(n) ), n = 1, . . . , N

Constraint Relaxation using the generalized KL-divergence


N  
1
X 
E (Tn ) +λ KL(∆(Tn )||An (A )T ) + KL(Ψ(Tn )||X(n) )
(−n)
minimize hCn , Tn i − (6)
{An ≥0,Tn }N ρ
n=1 n=1
| {z } | {z }
Part P2 Part P3
| {z }
Part P1

We alternate between An and Tn to solve Eq. (6).


11/23
Efficient Algorithms: 1. Solving for OT Problems (Tn )

In ×In I
Note that: Tn = [Tn1 , ..., Tnj , ..., TnI(−n) ] ∈ R+ (−n) .
The number of optimal transport problems to solve is: I(−n) = I1 × · · · × In−1 × In+1 × · · · × IN .

Instead, we use the property of the OT solution T∗nj 1 = diag(uj )Kn vj = uj ∗ (Kn vj ); therefore:

Proposition 2

∆(Tn ) = [Tn1 1, ..., Tnj 1, ..., TnI(−n) 1] = Un ∗ (Kn Vn ) (7)


In ×In b (n) )Φ Kn X(n) (KT Un ) Φ Φ
minimizes (6), where Kn = e (−ρCn −1) ∈ R+ , Un = (X n ,
Vn = X(n) (KT ) Φ
, Φ = λρ
, and indicates element-wise division.

n U n λρ+1

12/23
Efficient Algorithms: 1. Solving for OT Problems (Tn )

Exploring the sparsity structure for efficiently computing ∆(Tn )


• Many columns of the X(n) are all zeros; thus they can be ignored when computing Vn .
• Besides, each column of Vn can be computed in parallel.

𝜙
𝜙
𝜙

𝑋#! 𝐾! 𝑋! 𝐾!" 𝑈!
Computa(on in Parallel

𝜙 𝜙 𝜙
𝜙 𝜙 𝜙
𝜙 𝜙 𝜙

𝐾! 𝐾!" 𝐾! 𝐾!" 𝐾! 𝐾!"

SWIFT explores sparsity structure in input data X(n) and drops zero values columns.

13/23
Efficient Algorithms: 2. Updating CP factors (An )

Sub-problem for the CP factor matrices An


N  
X
KL ∆(Ti ) || Ai (A )T
(−i)
minimize (8)
An ≥0
i=1

(−i)
Challenge: An is also involved in the Khatri-Rao product A .

To tackle this, we define an rearranging operator Π, such that


Efficient rearranging operation
In ×I(−n)
Π(Ai (A )T , n) = An (A
(−i) (−n) T
) ∈ R+ ∀ i 6= n. (9)

In this way, the Khatri-Rao product term no longer contain the factor matrix An .

14/23
Efficient Algorithms: 2. Updating CP factors (An )

With this operator, the sub-problem is equivalent to:


Rearranged sub-problem for An
An (A )T
(−n)
Π(∆(T1 ), n)
   
 ... 
  ... 
minimize KL  Π(∆(Ti ), n) 
  An (A(−n) )T  (10)
An ≥0  
 ...  ...

Π(∆(TN ), n) An (A
(−n) T
)

With the rearranged objective function, the factor matrix An can be efficiently updated via
multiplicative update rules8 .

8 Daniel D Lee and H Sebastian Seung. “Algorithms for non-negative matrix factorization”. In: Advances in Neural Information Processing Systems. 2001.

15/23
Experiments: Datasets and Evaluation Metrics

BBC News9
• a third-order counting tensor with size of 400 articles by 100 words by 100 words
• downstream task: article category classification; evaluated by accuracy.

Sutter
• a dataset collected from a large real-world health provider network
• a third-order binary tensor with size of 1000 patients by 100 diagnoses by 100 medications
• downstream task: heart failure onset; evaluated by PR-AUC.

We use the pair-wise cosine distance to compute the cost matrices for each mode of the two
datasets.
9 Derek Greene and Pádraig Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”. In: International Conference on Machine
learning. 2006.

16/23
Experiments: Baselines

We compare against the following tensor factorization models with different loss functions:

Underlying
Model Loss Type Reference
Distribution Assumption
CP-ALS MSE Loss Gaussian (Bader & Kolda 2007)
CP-NMU MSE Loss Gaussian (Bader & Kolda 2007)
Supervised CP MSE Loss Gaussian (Kim et al. 2017)
Similarity based CP MSE Loss Gaussian (Kim et al. 2017)
CP-Continuous Gamma Loss Gamma (Hong et al. 2020)
CP-Binary Log Loss Bernoulli (Hong et al. 2020)
CP-APR KL Loss Poisson (Chi & Kolda 2012)

17/23
Experimental Results: Classification Performance

SWIFT outperforms all models consistently by a large margin.

18/23
Experimental Results: Classification Performance

Comparison against widely-adopted classifiers:

Accuracy on BBC PR-AUC on Sutter


Lasso Logistic Regression .728 ± .013 .308 ± .033
Random Forest .628 ± .049 .318 ± .083
Multi-Layer Perceptron .690 ± .052 .305 ± .054
K-Nearest Neighbor .596 ± .067 .259 ± .067
SWIFT (R=5) .759 ± .013 .364 ± .063
SWIFT (R=40) .818 ± .020 .374 ± .044

SWIFT with rank of 5 already outperforms all other classifiers compared.

19/23
Experimental Results: Classification Performance on Noisy Data

We inject random noise to the BBC News data and run all models using the noisy data:

1.0
CP ALS Supervised CP CP Gamma CP APR
CP NMU Similarity Based CP CP Binary SWIFT
0.8

0.6

Accuracy 0.4

0.2

0.0
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Noise Level

SWIFT outperforms all baselines, especially for medium and high noise levels.

20/23
Experimental Results: Scalability of SWIFT

We set R = 40, switch off all parallelization of SWIFT for fair comparison and measure the
running time of all models.

SWIFT is as scalable as other CP factorization models.

21/23
Experimental Results: Interpretability of SWIFT

We interpret the factor matrices learned for Sutter datasets. Following are three examples:

• Each group (phenotype) contains clinically relevant


diagnoses and medications.
• The weight indicates the lasso logistic regression
coefficient for heart failure (HF) prediction.
• First two groups are clinically relevant to HF, but the
third is not.
• The clinical meaningfulness is endorsed by a medical
expert.

SWIFT yields interpretable factor matrices.

22/23
Conclusion

• We define the Wasserstein distance between two tensors and propose SWIFT, a Wasserstein
tensor factorization model.

• We derive an efficient learning algorithm by exploring the sparsity structure and introducing
efficient rearrangement operator.

• Empirical evaluations demonstrate that SWIFT consistently outperforms baselines in


downstream prediction tasks, even in the presence of heavy noise.

• SWIFT is also shown scalable and interpretable.

23/23
Thank you!
All questions and comments are greatly appreciated!

23/23

You might also like