0% found this document useful (0 votes)
33 views11 pages

Embedded System Design

This document summarizes a research paper that proposes a method called node embedding compression to efficiently learn node representations in large-scale graphs using graph neural networks (GNNs). The method compresses node embeddings by representing each node with a binary vector instead of a floating-point vector. This significantly reduces memory usage and allows GNNs to be trained on graphs with billions of nodes. The compression parameters are trained jointly with the GNN to learn compressed node embeddings that achieve good performance.

Uploaded by

Israel Oyeleye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views11 pages

Embedded System Design

This document summarizes a research paper that proposes a method called node embedding compression to efficiently learn node representations in large-scale graphs using graph neural networks (GNNs). The method compresses node embeddings by representing each node with a binary vector instead of a floating-point vector. This significantly reduces memory usage and allows GNNs to be trained on graphs with billions of nodes. The compression parameters are trained jointly with the GNN to learn compressed node embeddings that achieve good performance.

Uploaded by

Israel Oyeleye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Embedding Compression with Hashing for Efficient

Representation Learning in Large-Scale Graph


Chin-Chia Michael Yeh Mengting Gu Yan Zheng
[email protected] [email protected] [email protected]
Visa Research Visa Research Visa Research

Huiyuan Chen Javid Ebrahimi Zhongfang Zhuang


[email protected] [email protected] [email protected]
Visa Research Visa Research Visa Research
arXiv:2208.05648v1 [cs.LG] 11 Aug 2022

Junpeng Wang Liang Wang Wei Zhang


[email protected] [email protected] [email protected]
Visa Research Visa Research Visa Research
ABSTRACT 2022. Embedding Compression with Hashing for Efficient Representation
Graph neural networks (GNNs) are deep learning models designed Learning in Large-Scale Graph. In Proceedings of the 28th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining (KDD ’22), August
specifically for graph data, and they typically rely on node features
14–18, 2022, Washington, DC, USA. ACM, New York, NY, USA, 11 pages.
as the input to the first layer. When applying such a type of network https://doi.org/10.1145/3534678.3539068
on the graph without node features, one can extract simple graph-
based node features (e.g., number of degrees) or learn the input
1 INTRODUCTION
node representations (i.e., embeddings) when training the network.
While the latter approach, which trains node embeddings, more Graph neural networks (GNNs) are representation learning meth-
likely leads to better performance, the number of parameters associ- ods for graph data. They learn the node representation from input
ated with the embeddings grows linearly with the number of nodes. node features X and its graph G where the node features X are used
It is therefore impractical to train the input node embeddings to- as the input node representations to the first layer of the GNN and
gether with GNNs within graphics processing unit (GPU) memory the graph G dictates the propagation of information [13, 18, 43].
in an end-to-end fashion when dealing with industrial-scale graph However, the input node features X may not always be available
data. Inspired by the embedding compression methods developed for certain datasets. To apply GNNs on a graph without node fea-
for natural language processing (NLP) tasks, we develop a node tures X, we could either 1) extract simple graph-based node features
embedding compression method where each node is compactly (e.g., number of degrees) from the graph G or 2) use embedding
represented with a bit vector instead of a floating-point vector. The learning methods to learn the node embeddings as features X [10].
parameters utilized in the compression method can be trained to- While both approaches are valid, it has been shown that the second
gether with GNNs. We show that the proposed node embedding approach constantly outperforms the first one with a noticeable
compression method achieves superior performance compared to margin [10], and most recent methods learn the node embeddings
the alternatives. jointly with the parameters of GNNs [14, 15, 37].
Learning node features (or embedding) X for a small graph can be
CCS CONCEPTS easily conducted. But, as the size of the embedding matrix X grows
linearly with the number of nodes, scalability quickly becomes a
• Information systems → Data mining; • Computing method-
problem, especially when attempting to apply such a method to
ologies → Neural networks.
industrial-grade graph data. For example, there are more than 1
billion Visa cards. If 1 billion of these cards are modeled as nodes
KEYWORDS in a graph, the memory cost for the embedding layer alone will be
graph neural network, compression, low-bit embeddings 238 gigabytes for 64-dimensional single-precision floating-point
ACM Reference Format: embeddings. Such memory cost is beyond the capability of common
Chin-Chia Michael Yeh, Mengting Gu, Yan Zheng, Huiyuan Chen, Javid graphics processing unit (GPU). To solve the scalability issue, we
Ebrahimi, Zhongfang Zhuang, Junpeng Wang, Liang Wang, and Wei Zhang. adopt the embedding compression idea originally developed for
natural language processing (NLP) tasks [29–32].
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed Particularly, we study the ALONE method proposed by [32]. ALONE
for profit or commercial advantage and that copies bear this notice and the full citation represents each word using a randomly generated compositional
on the first page. Copyrights for components of this work owned by others than the code vector [32], which significantly reduces the memory cost;
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission then a decoder model uncompresses the compositional code vector
and/or a fee. Request permissions from [email protected]. into a floating-point vector. The bit size of the compositional code
KDD ’22, August 14–18, 2022, Washington, DC, USA vector is parametrized by 𝑐 and 𝑚 where 𝑐 is the cardinality of each
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9385-0/22/08. . . $15.00 element in the code vector and 𝑚 is the length of the code vector.
https://doi.org/10.1145/3534678.3539068 For example, if we set 𝑐 = 4 and 𝑚 = 6, one valid code vector is
KDD ’22, August 14–18, 2022, Washington, DC, USA Chin-Chia Michael Yeh et al.

0.7 GloVe, analogy 0.7 GloVe, similarity

acc. random random

rho
hashing hashing
learn learn
raw raw
0 0
50

10

25

50

10

20

50

10

25

50

10

20
00

00

00

00

00

00

00

00

00

00

00

00
0

00

00

00

00
number of compressed entities number of compressed entities
0.8 metapath2vec, clustering 0.8 metapath2vec++, clustering

random random
hashing/pre-trained hashing/pre-trained
nmi.

nmi.
hashing/graph hashing/graph
learn learn
raw raw
0.5 0.5
50

10

25

50

10

20

50

10

25

50

10

20
00

00

00

00

00

00

00

00

00

00

00

00
0

00

00

00

00
number of compressed entities number of compressed entities

Figure 1: Three coding schemes are tested: 1) random coding/ALONE, 2) hashing-based coding/the proposed method, and 3)
learning-based coding scheme with an autoencoder. For GloVe embeddings, we apply the hashing-based coding method on
the pre-trained embeddings. For metapath2vec and metapath2vec++ embeddings, we apply the hashing-based coding method
on either the pre-trained embeddings or the adjacency matrix from the graph. The horizontal line labeled with “raw" shows
the performance of the original embeddings’ performance without any compression. The y-axis of each sub-figure is the
performance measurement (the higher the better). See Section 5.1 for more details.

[2, 0, 3, 1, 0, 1] where the length of the vector is 6 and each element The possible root cause of the performance degradation is the
in the vector is within the set {0, 1, 2, 3}. The code vector can be need of a more expensive decoder to model the larger variance
converted to a bit vector of length 𝑚 log2 𝑐 by representing each (from the increasing number of randomly generated code vectors).
element in the code vector as a binary number and concatenating In order to solve this problem, we replace the random vector gen-
the resulting binary numbers1 . Continuing the example, the code eration part of ALION with an efficient random projection hash-
vector [2, 0, 3, 1, 0, 1] can be compactly stored as [10 00 11 01 00 01]. ing algorithm, which better leverages the auxiliary information of
Using the conversion trick, it only requires 48 bits to store each the graph G, such as its adjacency matrix. The adopted hashing
word with the parametrization (𝑐 = 64, 𝑚 = 8, 8 log2 64 = 48 bits) method is locality-sensitive hashing (LSH) [3], as it hashes entities
used by [32] in their experiments. with similar auxiliary information into similar code vectors. The
The coding scheme can uniquely represent up to 248 word (or auxiliary information help us reduce the variance in the code vec-
sub-word) tokens, which is way beyond the number of tokens tors (compared to that of the randomly generated code vectors),
used in conventional NLP models [33, 36]. However, generating the which eliminates the need of an expensive decoder. In Figure 1,
code vectors in a random fashion hinders the model performance. our proposed hashing-based coding (see lines labeled as “hashing")
One way to quickly benchmark different embedding compression outperformed the random coding in all scenarios. Similar to ALONE,
methods is by evaluating the performance of reconstructed (or un- the proposed method does not introduce additional training stages.
compressed) embeddings. As shown in Figure 1, when a model On top of that, the memory footprint is identical to ALONE as the
compresses more embeddings, the performance of the uncom- proposed method only replaces the coding scheme.
pressed embeddings drops considerably when ALONE [32] is used In addition to the proxy tasks of pre-trained embedding recon-
(see lines labeled as “random"). The phenomenon is observed in struction, we also compare the effectiveness of different coding
our experiments with GloVe word embeddings [26] on both word schemes where the GNN models and the decoder model are trained
analogy/similarity tasks and metapath2vec/metapath2vec++ node together in an end-to-end fashion. Particularly, we trained four
embeddings [8] on node clustering task. Note, the experiments pre- different GNN models [13, 18, 38, 39] on five different node classifi-
sented in Figure 1 (i.e., reconstruction experiments) are only proxies cation/link prediction datasets as well as our in-house large scale
to the real use scenarios (i.e., node classification and link predic- transaction dataset. The experiment results have confirmed the
tion with GNNs). In the intended use scenarios (Section 5.2 and superb performance of the proposed hashing-based coding scheme.
Section 5.3), we do not use any pre-trained embeddings. To sum up, our contributions include:
• We propose a novel hashing-based coding scheme for large-scale
graphs, which is compatible with most of the GNNs and achieves
1 The conversion mechanism is more space-efficient when 𝑐 is set to a power of 2.
superior performance compared to existing methods.
Embedding Compression with Hashing for Efficient Representation Learning in Large-Scale Graph KDD ’22, August 14–18, 2022, Washington, DC, USA

• We show how the improved embedding compression method parameter size grown linearly with respect to the vocabulary size,
can be applied to GNNs without any pre-training. which makes the method not ideal for our application.
• We confirm the effectiveness of our embedding compression The ALONE method proposed by [32] represents each word with
method in proxy pre-trained embedding reconstruction tasks, a randomly generated compositional code, and the embedding is
node classification/link prediction tasks with GNNs, and an obtained by inputting the compositional code to a decoder where
industrial-scale merchant category identification task. the number of learnable parameters in the decoder model is inde-
pendent of the vocabulary size. The ALONE method satisfies all the
2 RELATED WORK requirements for our application; however, the performance suffers
The embedding compression problem is extensively studied for when the vocabulary size increases compared to the autoencoder-
NLP tasks because of the memory cost associated with storing the based approach [29] as demonstrated in Figure 1 (see “random"
embedding vectors. One of the most popular strategies is param- versus “learn"). In contrast, our proposed method has similar per-
eter sharing [29–32]. For example, Suzuki and Nagata [30] train formance compared to the autoencoder-based approach [29] but
a small set of sub-vectors shared by all words called “reference does not require additional training phases.
vectors" where each word embedding is constructed by concatenat- Learning-to-hash methods are another set of methods that are
ing different sub-vectors together. Both the shared sub-vectors and concerned with the compression of data [28, 34]. When learning-
sub-vector assignments are optimized during the training time. The to-hash methods are applied to graph data, the binary codes for
resulting compressed representation is capable of representing each each node can be generated by learning a hash function with the
word compactly, but the training process is memory costly as it still sign function [34] or the binary regularization loss [34] to binarize
needs to train the full embedding matrix to solve the sub-vector the hidden representation. Since our problem is focusing on com-
assignment problem. Therefore, the method proposed by [30] is pressing the input embedding rather than the intermediate hidden
not suitable for representing a large set of entities. representation, learning-to-hash methods like the ones proposed
Shu and Nakayama [29] train an encoder-decoder model (i.e., by [28, 34] are not applicable in our scenario. Other compression
autoencoder) where the encoder converts a pre-trained embed- methods for graph data like the method proposed by [4] requires
ding into the corresponding compositional code representation, the embedding table (i.e., query matrix ) as an input for the training
while the decoder reconstructs the pre-trained embedding from process, and is thus also impractical to perform training on large
the compositional code. Once the encoder-decoder is trained, all graphs. To the best of our knowledge, our work is the first to focus
the pre-trained embeddings are converted to the compact composi- on studying the compression of input embedding for graph data.
tional code representation using the encoder; then the decoder can
be trained together with the downstream models. Because the mem- 3 METHOD
ory cost associated with the compositional code is much smaller The proposed method consists of two stages: 1) an encoding stage
than the raw embeddings and the decoder is shared by all words, where each node’s compositional code is generated with a hashing-
the method reduces the overall memory consumption associated based method and 2) a decoding stage where the decoder is trained
with representing words. However, since it also requires training in an end-to-end fashion together with the downstream model.
the embeddings before training the encoder-decoder, the training Figure 2 shows an example forward pass. The binary code is a
process still has high memory cost associated with the conventional node’s compositional code generated by the hashing-based method
embedding training similar to the previous work. In other words, (Section 3.1). After the binary code is converted to integer code,
the method presented in [29] is not applicable to compress a large the decoder model, which mostly includes 𝑚 codebooks and a
set of entities either. multilayer perceptron (MLP) as described in Section 3.2, generates
Svenstrup et al. [31] represents each word compactly with a the corresponding embedding. The memory cost of storing both
unique integer and 𝑘 floating-point values where 𝑘 is much smaller the compositional codes and the decoder is drastically lower than
than the dimension of the embedding. To obtain the embedding the conventional embedding layer.
from the compact representation of a word, 𝑘 hash functions2 are
used to hash the word’s associated unique integer to an integer in 3.1 Hashing-based Coding Scheme
[0, 𝑐) where 𝑐 is much smaller than the number of words. Next, 𝑘
vectors are extracted from a set of 𝑐 learnable “component vectors" Algorithm 1 outlines the random projection-based hashing method.
based on the output of the hash function. The final embedding of the The first input to our algorithm is a matrix A ∈ R𝑛×𝑑 containing the
word is generated by computing the weighted sum of the 𝑘 vectors auxiliary information of each node where 𝑛 is the number of nodes
where the weights are based on the 𝑘 learnable floating-point values and 𝑑 is the length of auxiliary vector associated with each node.
associated with the word. Similar to our work, Svenstrup et al. [31] When the adjacency matrix is used as the auxiliary information,
also uses hash functions in their proposed method. But, the role 𝑑 is equal to 𝑛, and it is preferred to store A as a sparse matrix in
of the hash function is different: Svenstrup et al. [31] uses hash compressed row storage (CRS) format as all the operations on A are
functions for reducing cardinality while we use hash functions row-wise operations. The other inputs include code cardinality 𝑐
to perform LSH. On top of that, as the 𝑘 learnable floating-point and code length 𝑚. These two inputs dictate the format (and the
values are associated with each word in [31], their method has its memory cost) of the output compositional code X̂. For each node’s
associated code vector, 𝑐 controls the cardinality of each element in
2 The hash function used in [31] is the hash function proposed by [2] for hashing the code vector and 𝑚 controls the length of the code vector. The
integers. output is the resulting compositional codes X̂ ∈ B𝑛×𝑚 log2 𝑐 in the
KDD ’22, August 14–18, 2022, Washington, DC, USA Chin-Chia Michael Yeh et al.

5 codebooks (m=5)

+ + + + =

c=4 codebook Multilayer


lookup = Perceptron
if the codebooks
integer code [2 2 0 3 1] are trainable

W0 embedding
binary to integer
conversion
if the codebooks ⊙ = Downstream
binary code [10 10 00 11 01] are NOT trainable Model

Figure 2: In this toy example, each codebook has 4 vectors (𝑐 = 4) and there are 5 distinct codebooks (𝑚 = 5). There are two
variants of the adopted decoder models: 1) a light version where the codebooks are NOT trainable and 2) a full version where
the codebooks are trainable. 𝑊0 is a trainable vector for rescaling the intermediate representations (see Section 3.2).

binary format where each row contains a node’s associated code Such optimization could be important as the size of A could be too
vector. 𝑚 log2 𝑐 is the number of bits required to store one code large for systems with limited memory. In line 9, the median of 𝑈
vector and 𝑐 is the power of 2. We store X̂ in binary format because is identified and stored in 𝑡. This is the threshold for binarizing real
the binary format is more space-efficient compared to the integer values in 𝑈 . From line 10 to 11, using both the values in vector 𝑈
format. The binary code vector can be reversed back to integer and the threshold 𝑡, the binary code is generated for each node.
format before inputting it to the decoder. Lastly, in line 12, the resulting compositional codes X̂ are returned.
The resulting X̂ can be used for any downstream tasks.
Algorithm 1 Encode with Random Projection Note, we use the median as the threshold instead of the more
Input: auxiliary information A ∈ R𝑛×𝑑 , code cardinality 𝑐 , code length 𝑚
commonly seen zero because it reduces the number of collisions
Output: compositional code X̂ ∈ B𝑛×𝑚 log2 𝑐 in the resulting binary code3 . Reducing the number of collisions
1 function Encode(𝐴, 𝑐, 𝑚 ) is important for our case because our goal is to generate a unique
2 𝑛 bit ← 𝑚 log2 𝑐
code vector to represent each node. To confirm whether using the
3 X̂ ← GetAllFalseBooleanMatrix (𝑛, 𝑛 bit )
4 for 𝑖 in [0, 𝑛 bit ) do median as the threshold reduces the number of collisions, we have
5 𝑉 ← GetRandomVector (𝑑) performed an experiment using pre-trained metapath2vec node
6 𝑈 ← GetEmptyVector (𝑛) embeddings [8] as the auxiliary matrix A. Pre-trained embeddings
7 for 𝑗 in [0, 𝑛) do
8 𝑈 [ 𝑗 ] ← DotProduct (A[ 𝑗, :], 𝑉 ) provide us a way to quickly benchmark different LSH thresholds.
9 𝑡 ← GetMedian (𝑈 ) We do not use pre-trained embeddings in the intended use cases (i.e.,
10 for 𝑗 in [0, 𝑛) do Section 5.2 and Section 5.3). We generate the compositional codes
11 if 𝑈 [ 𝑗 ] > 𝑡 then X̂[ 𝑗, 𝑖 ] ← True
with random projection-based hashing with either the median or
12 return X̂
zero as the threshold. Then, we count the number of collisions
in the generated compositional codes. We repeat the experiment
In line 2, the number of bits required to store each code vector
100 times under two different experimental settings (i.e., 24 bits/32
(i.e., 𝑚 log2 𝑐) is computed and stored in variable 𝑛 bit . In line 3,
bits). The experiment results are summarized in Figure 3 with his-
a Boolean matrix X̂ of size 𝑛 × 𝑛 bit is initialized for storing the tograms, setting the threshold to median instead of zero indeed
resulting compositional codes. The default value for X̂ is False. reduces the number of collisions. We also repeat the experiments
From line 4 to 11, the compositional codes are generated bit-by-bit with metapath2vec++ and GloVe embeddings and the conclusion
in the outer loop and node-by-node in the inner loops. Generating remains the same. Please see Appendix A for experiment setup
compositional codes in such order is a more memory-efficient way details and additional experiment results.
to perform random projection as it only needs to keep a size 𝑑 The memory complexity of Algorithm 1 is
random vector in each iteration compared to the alternative order. 𝑂 (max(𝑛𝑚 log2 𝑐, 𝑑 𝑓 , 𝑛𝑓 )) where 𝑓 is the number of bits re-
If the inner loop (i.e., line 7 to 8) is switched with the outer loop quired to store a floating-point number. The 𝑛𝑚 log2 𝑐 term is the
(i.e., line 4 to 11), it would require us to use a R𝑛bit ×𝑑 matrix to memory cost associated with storing X̂, the 𝑑 𝑓 term is the memory
store all the random vectors for the random projection (i.e., matrix cost associated with storing 𝑉 , and the 𝑛𝑓 term is the memory
multiplication). cost associated with storing 𝑈 . Because 𝑓 is usually less than
In line 5, a random vector 𝑉 ∈ R𝑑 is generated; the vector 𝑉 is 𝑚 log2 𝑐 (i.e., based on hyper-parameters used in [29, 32]4 ) and 𝑑
used for performing random projection. In line 6, a vector 𝑈 ∈ R𝑛 is usually less than or equal to 𝑛, the typical memory complexity
is initialized for storing the result of random projection. From line of Algorithm 1 is 𝑂 (𝑛𝑚 log2 𝑐). In other words, the memory
7 to 8, each node’s associated auxiliary vector is projected using
the random vector 𝑉 and stored in 𝑈 (i.e., 𝑈 = A𝑉 ). Here, the 3 The threshold used in the LSH method proposed by [3] is zero.
memory footprint could be further reduced if we only load a few 4 Ifsingle-precision format is used for floating-point numbers, 𝑓 is 32 bit, and 𝑚 log2 𝑐
rows of A during the loop instead of the entire A before the loop. is commonly set to a number larger than 32 bit in [29, 32].
Embedding Compression with Hashing for Efficient Representation Learning in Large-Scale Graph KDD ’22, August 14–18, 2022, Washington, DC, USA

24 bits full method having a higher memory cost, the number of trainable
hashing-zero parameters is still independent of the number of nodes in the graph.
hashing-median Next, the retrieved real vectors are summed together. The
summed vector is handled differently for the light and full methods.
As the codebooks are not trainable for the light method, we com-
0 number of collisions 66536
pute the element-wise product between the summed vector and a
32 bits trainable vector 𝑊0 ∈ R𝑑𝑐 to rescale each dimension of the summed
hashing-zero vector following [32]. Such transformation is not needed for the
hashing-median full method because it can capture this kind of transformation with
the trainable parameters in the codebooks. The transformed vec-
number of collisions
tor is then fed to an MLP with ReLU between linear layers. The
0 11656
output of the MLP is the embedding corresponding to the input
compositional code for the downstream model.
Figure 3: The experiments are performed on metapath2vec If the number of neurons for the MLP is set to 𝑑𝑚 , the number
for 100 times under two different bit length settings: 24 bits of layers for the MLP is set to 𝑙, and the dimension of the output
and 32 bits. The distribution of the 100 outcomes (i.e., num- embedding is set to 𝑑𝑒 , the light method has 𝑚𝑐𝑑𝑐 non-trainable
ber of collisions) for each method is shown in the figure. The parameters (which can be stored outside of GPU memory) and
number of collisions is lower for the median threshold com- 𝑑𝑐 + 𝑑𝑐 𝑑𝑚 + (𝑙 − 2)𝑑𝑚2 + 𝑑 𝑑 trainable parameters. The full method
𝑚 𝑒
pared to the zero threshold. has 𝑚𝑐𝑑𝑐 + 𝑑𝑐 𝑑𝑚 + (𝑙 − 2)𝑑𝑚 2 + 𝑑 𝑑 trainable parameters. Here,
𝑚 𝑒
we assume 𝑙 is greater than or equal to 2. Note, the number of
parameters does not grow with the increasing number of nodes for
complexity of Algorithm 1 is the same as the output matrix X̂ both the light and full methods. The decoder model used in [32] is
which shows how memory efficient Algorithm 1 is. The time the light method without the binary to integer conversion step.
complexity of Algorithm 1 is 𝑂 (𝑛𝑚 log2 𝑐𝑑) for the nested loop5 .

3.2 Decoder Model Design


4 INTEGRATION WITH GNN MODELS
We will use the example forward pass presented in Figure 2 to
introduce the decoder design. The input to the decoder is the bi- In this section, we show how the proposed method can be integrated
nary compositional codes generated from the hashing-based coding with the GNN models. Figure 4 depicts an example where we use the
scheme introduced in Section 3.1. The input binary code is first proposed method with the GraphSAGE model [13], one of the most
converted to integers for use as indexes for retrieving the corre- prevalent GNN models applied to large scale (e.g. industrial level)
sponding real vector from the codebooks. In our example, the binary data. Other GNNs can be integrated with the proposed method
vector [10, 10, 00, 11, 01] is converted to integer vector [2, 2, 0, 3, 1]. in a similar fashion (i.e., by replacing the embedding layer with
Each codebook is a R𝑐×𝑑𝑐 matrix where 𝑐 is the number of codes in the proposed method). First, in step 0, a batch of nodes is sampled.
each codebook (i.e., code cardinality) and 𝑑𝑐 is the length of each In step 1, for each node in the batch, a number of neighboring
real vector in the codebook. There are 𝑚 codebooks in total where nodes (i.e., first neighbors) are sampled. Because the example model
𝑚 is the code length (i.e., length of the code after being converted shown in the figure has 2 layers, the neighbors of neighbors (i.e.,
to integer vector from binary vector). Because the code length is 4 second neighbors) are also sampled in step 2. Next, the binary
in the example, there are 5 codebooks in Figure 2. Because the code codes associated with each node’s first and second neighbors are
cardinality is 4 (i.e., the number of possible values in the integer retrieved in step 3 and decoded in step 4 using the system described
code), each codebook has 4 real vectors. in Section 3.2.
From each codebook, a real number vector is retrieved based on After the embeddings for both the first and second neighbors
each codebook’s corresponding index. In our example, the vector are retrieved, the second neighbor embeddings of each given
corresponding to index 2 (purple) is retrieved from the first code- first neighbor embedding are aggregated with functions like mean
book, index 2 (black) is retrieved from the second codebook, index or max in Aggregate 1 layer. Let’s say 𝐻𝑖 contains the embed-
0 vector (green) is retrieved from the third codebook, index 3 vector dings of neighboring nodes for a given node 𝑖, the aggregate
(red) is retrieved from the forth codebook, and the index 1 vector layer computes the output ℎˆ𝑖 with Aggregate(𝐻𝑖 ). Next, in Layer
(blue) is retrieved from the last codebook. The real vectors (i.e., the 1, for each first neighbor node 𝑖, ℎˆ𝑖 and 𝑥𝑖 (i.e., embedding for
codebooks) can either be non-trainable random vectors or trainable node 𝑖) are concatenated and processed with a linear layer plus
vectors. We refer to the former method as the light method and the non-linearity. The process of Layer 1 can be represented with
later method as the full method. The former method is lighter as the 𝜎 (𝑊 · Concatenate(ℎˆ𝑖 , 𝑥𝑖 )) where 𝑊 is the weight associated
later method increases the number of trainable parameters by 𝑚𝑐𝑑𝑐 . with the linear layer and 𝜎 (·) is the non-linear function like ReLU.
The full method is desired if the additional trainable parameters A similar process is repeated in Aggregate 2 layer and Layer 2
(i.e., memory cost) are allowed by the hardware. Note, despite the to generate the final representation of each node in the batch. The
final prediction is computed by feeding the learned representation
5 Themedian finding algorithm [1] in line 9 is 𝑂 (𝑛) which is the same as the inner to the output (i.e., linear) layer. All parameters in the model are
loops (i.e., line 7 to 8 and line 10 to 12.). learned end-to-end using the training data.
KDD ’22, August 14–18, 2022, Washington, DC, USA Chin-Chia Michael Yeh et al.

0. batch
Aggregate 1 are Accuracy and Spearman’s Rank Correlation (𝜌), respectively.
of nodes
The metapath2vec/metapath2vec++ embeddings are tested with
Layer 1 node clustering, and the performance measurement is Normalized
1. 1st neighbor Neighbor
nodes Sampler Mutual Information. Please see Appendix B.1 for more details re-
2. 1st and 2nd Aggregate 2 garding the datasets.
neighbor nodes
Code 5.1.2 Implementation. We use the full decoding method in this set
Layer 2
Lookup of experiments. To train the compression method, we use mean
3. 1st and 2nd squared error between the input embeddings and the reconstructed
neighbor codes Output
Layer
embeddings as the loss function following [32]. The loss function
Decoder
4. 1st and 2nd is optimized with AdamW [22] with the default hyper-parameter
neighbor
embeddings 5. Prediction settings in PyTorch [25]. Because we want to vary the numbers of
compressed entities when comparing different methods, we need to
sample from the available pre-trained embeddings. Similar to [32],
Figure 4: The proposed method can be integrated with the
we sample based on the frequency6 . Since different experiments
GraphSage model. The Code Lookup is used to look up the
use different numbers of compressed entities, we only evaluate
corresponding binary code for each input node. The Decoder
with the same top 5k entities based on frequency similar to [32],
is the system presented in Figure 2 and converts the input
despite there being more than 5k reconstructed embeddings when
binary codes to embeddings.
the number of compressed entities is greater than 5k. In this way, we
have the same test data across experiments with different numbers
5 EXPERIMENT of compressed entities. The detailed hyper-parameter settings are
shown in Appendix B.2.
We perform three sets of experiments: 1) pre-trained embedding
reconstruction, 2) training decoder with GNN models jointly for 5.1.3 Result. The experiment results are summarized in Figure 1.
node classification and link prediction problems, and 3) an indus- Note, we use “random" to denote the baseline method (i.e., ALONE).
trial application of the proposed method with merchant category When the number of compressed entities is low, the reconstructed
identification problem [42]. The first set of experiments uses proxy embeddings from all compression methods perform similar to us-
tasks to reveal the difference between different methods’ compress- ing the raw embeddings (i.e., the original pre-trained embeddings).
ing capability while the second set of experiments provides the As the number of compressed entities increases, the reconstructed
performance measurement of different methods on common graph embeddings’ performance decreases. The decreasing performance
problems like node classification and link prediction. The third set is likely caused by the fact that the decoder model’s size does not
of experiments compares the proposed method with the baseline grow with the number of compressed entities. In other words, the
on an industrial problem. Experiments are conducted in Python compression ratio increases as the number of compressed entities
(see [41]). increases (see Table 4). When comparing different compression
methods, we can observe that the quality of the reconstructed em-
5.1 Pre-trained Embedding Reconstruction beddings from the random coding method drops sharply compared
In this set of experiments, we compare the compression capability to other methods (i.e., hashing-based coding and learning-based
of different compression methods by testing the quality of the re- coding). It is surprising that the hashing-based coding method
constructed embedding. Note, this set of experiments uses proxy works as well as the learning-based coding method even if the
tasks as additional test-beds to highlight the difference between learning-based coding method uses additional parameters to learn
different compression methods. In the intended use scenarios (see the coding function. When we compare both variants of the pro-
Section 5.2 and Section 5.3), the pre-trained embeddings do not posed coding method (i.e., hashing with pre-trained and hashing
come with the dataset. The tested methods are the random coding with graph/adjacency matrix), the performance is very similar. This
(i.e., baseline method proposed by [32]), the learning-based coding shows how the adjacency matrix from the graph is a valid choice
(i.e., autoencoder similar to the method proposed by [29]), and the for applying the proposed hashing-based coding method. We have
hashing-based coding (i.e., the proposed method). When applying also tested other settings of 𝑐 and 𝑚 (see Table 5 and accompany
the hashing-based coding method on the graph dataset, we feed text); the conclusion stays the same. Overall, the hashing-based
either the original pre-trained embedding (i.e., hashing/pre-trained coding method outperforms the baseline ALONE method.
in Figure 1) or the adjacency matrix from the graph (i.e., hash-
ing/graph in Figure 1) into Algorithm 1. We vary the number of 5.2 Node Classification and Link Prediction
compressed entities when testing different methods. To examine the difference between the compression methods on
graph related tasks, we perform node classification and link pre-
5.1.1 Dataset. Three sets of pre-trained embeddings are used in
diction experiments where the decoder is trained together with
these experiments: 1) the 300 dimension GloVe word embeddings,
GNNs, e.g. GraphSAGE [13], Graph Convolutional Network (i.e.,
2) the 128 dimension metapath2vec node embeddings, and 3) the
128 dimension metapath2vec++ node embeddings. The GloVe em- 6 For GloVe, frequency means the times a word occurs in the training data. For
beddings are tested with word analogy and similarity tasks. The metapath2vec and metapath2vec++, frequency means the times of a node occurs
performance measurements for word analogy and similarity tasks in the sampled metapaths.
Embedding Compression with Hashing for Efficient Representation Learning in Large-Scale Graph KDD ’22, August 14–18, 2022, Washington, DC, USA

GCN) [18], Simplifying Graph Convolutional Network (i.e., SGC) [38], Table 1: The proposed hashing-based coding almost always
and Graph Isomorphism Network (i.e., GIN) [39]. Because we as- outperforms the baseline random coding with different
sume there is no node features or pre-trained embeddings available GNNs for both node classification and link prediction. It
in our experiment setup, the autoencoder-based method proposed also achieves close to, and occasionally outperforms, the
by [29] is not applicable. We compare the proposed hashing-based non-compressed method. We use NC to denote the non-
coding method (using adjacency matrices and Algorithm 1 to gen- compressed or embedding learning method without com-
erate the code) with two baseline methods: random coding method pression, Rand to denote the random coding method (i.e.,
and raw embedding method. The raw embedding method explicitly ALONE), and Hash to denote the proposed hashing coding
learns the embeddings together with the GNN model, which can method.
be treated as the upper bound in terms of accuracy because the
embeddings are not compressed. GraphSage
task dataset
NC Rand Hash
5.2.1 Dataset. The experiments are performed on the ogbn-arxiv, ogbn-arxiv (acc.) 0.6228 0.6045 0.6259
node classification ogbn-mag (acc.) 0.3192 0.2989 0.3387
ogbn-mag, ogbn-products, ogbl-collab, and ogbl-ddi datasets from ogbn-products (acc.) 0.7486 0.6327 0.6414
Open Graph Benchmark [16]. As we are more interested in evalu- ogbl-collab (hits@50) 0.2740 0.1966 0.1956
link prediction
ating our model performance in attribute-less graphs, we use only ogbl-ddi (hits@20) 0.3277 0.3043 0.3429
graph structure information of these datasets. We convert all the di- GCN
task dataset
NC Rand Hash
rected graphs to undirected graphs by making the adjacency matrix
ogbn-arxiv (acc.) 0.5251 0.4957 0.5437
symmetry. The ogbn-mag dataset is a heterogeneous graph, and node classification ogbn-mag (acc.) 0.1815 0.1146 0.3466
we only use the citing relation between paper nodes as the labels ogbn-products (acc.) 0.4719 0.3594 0.4914
are associated with paper nodes. The performance measurement is ogbl-collab (hits@50) 0.2316 0.1647 0.1898
link prediction
ogbl-ddi (hits@20) 0.3697 0.3399 0.3319
accuracy for node classification datasets, hits@50 for ogbl-collab,
SGC
and hits@20 for ogbl-ddi. task dataset
NC Rand Hash
ogbn-arxiv (acc.) 0.6690 0.5491 0.5809
5.2.2 Implementation. We use the PyTorch implementation of the node classification ogbn-mag (acc.) 0.3523 0.1839 0.3657
GraphSAGE model with mean pooling aggregator [13, 17]. We use ogbn-products (acc.) 0.7686 0.3767 0.4966
PyG library [12] to implement GCN [18], SGC [38], and GIN [39]. The ogbl-collab (hits@50) 0.5589 0.4790 0.5116
link prediction
ogbl-ddi (hits@20) 0.4841 0.5575 0.5941
model parameters are optimized by AdamW [22] with the default
GIN
hyper-parameter settings. The detailed hyper-parameter settings task dataset
NC Rand Hash
are shown in Appendix C.1. ogbn-arxiv (acc.) 0.5546 0.3736 0.5263
node classification ogbn-mag (acc.) 0.2728 0.2011 0.3414
5.2.3 Result. The experimental results are shown in Table 1. The ogbn-products (acc.) 0.6423 0.4396 0.5706
results show that the hashing-based coding method outperforms the ogbl-collab (hits@50) 0.2614 0.2086 0.2475
link prediction
ogbl-ddi (hits@20) 0.3216 0.3536 0.3876
random coding method (i.e., ALONE) in most tested scenarios. One
possible reason for the random coding method’s less impressive
performance compared to the result reported on NLP tasks by [32]
is related to the number of entities compressed by the compression contrary, it only takes the proposed method 28.55 MB to store
method. In NLP models, embeddings typically represent sub-words the binary codes in CPU memory, and the corresponding decoder
instead of words [36]. For example, the transformer model for ma- model only costs 9.13 MB of GPU memory. The compression ratio
chine translation adopted by [32] has 32,000 sub-words, which is is 43.75 for the proposed method’s less memory efficient setup (i.e.,
much smaller compared to most tested graph datasets (i.e., ogbn- full model) if we only consider GPU memory usage. For the total
arxiv, ogbn-mag, ogbn-products, and ogbl-collab all have more than memory usage, the compression ratio is 11.74 for the same setup.
150,000 nodes). In other words, the proposed hashing-based coding The complete memory cost breakdown is shown in Table 2. The
method is more effective for representing a larger set of entities in unit for memory is megabytes (MB), and the column label “ratio"
compressed space compared to the baseline random coding method. stands for “compression ratio".
The proposed method is also compared to the “without compres-
sion" baseline (i.e., NC). The NC baseline outperforms the proposed 5.3 Merchant Category Identification
method in 10 out of 20 experiments as expected since the compres- In this section, we use a real-world application to evaluate the effec-
sion used in our method is lossy. One possible reason for the ten tiveness of the proposed embedding compression method compared
unexpected outcomes (i.e., the proposed method outperforms the to the baseline. In this case, we apply our model to a large scale
NC baseline) is that the lossy compression may sometimes remove transaction graph. The transaction volume of credit and debit card
the “correct" noise from the data. Because the focus of this paper is payments has proliferated in recent years with the rapid growth
scalability when comparing to the NC baseline, we left the study of of small businesses and online retailers. When processing these
such phenomena for future work. payment transactions, recognizing each merchant’s real identity
In terms of memory usage, the compression method is capable (i.e., merchant category or business type) is vital to ensure the in-
of achieving a considerably good compression ratio. For example, tegrity of payment processing systems as a merchant could falsify
since the ogbn-products dataset has 1,871,031 nodes, it requires its identity by registering in an incorrect merchant category within
456.79 MB to store the raw embeddings in GPU memory. On the payment processing companies. For example, a high-risk merchant
KDD ’22, August 14–18, 2022, Washington, DC, USA Chin-Chia Michael Yeh et al.

Table 2: The memory cost (MB) for models on ogbn-products dataset.

CPU GPU CPU+GPU


Method
Binary Decoder or
Decoder Total GNN Total Ratio Total Ratio
Code Embedding
Raw 0.00 0.00 0.00 456.79 1.35 458.14 1.00 458.14 1.00
Hash-Light 28.55 8.00 36.55 1.13 1.35 2.47 185.34 39.02 11.74
Hash-Heavy 28.55 0.00 28.55 9.13 1.35 10.47 43.75 39.02 11.74

may pretend to be in a low-risk merchant category by reporting a data, and 20% of the merchant nodes as the test data. Because the
fake merchant category to avoid higher processing fees associated classification model is solving a multi-class classification problem,
with risky categories. Specific business type (i.e., gambling) is only we use accuracy (i.e., acc.) as the performance measurement. We
permitted in some regions and territories. A merchant could report also report the hit rate at different thresholds (i.e., hit@𝑘) as the
a false merchant category to avoid scrutiny from banks and regula- particular detection rule implemented in the identification system
tors. A merchant may also report the wrong merchant category by has its performance tie strongly with the hit rate of the model.
mistake.
We use the system depicted in Figure 5 to identify merchants with 5.3.2 Implementation. We use the following hyper-parameter set-
possible faulty merchant categories. The merchant category iden- tings for the decoders: 𝑙 = 3, 𝑑𝑐 = 𝑑𝑚 = 512, 𝑑𝑒 = 64, 𝑐 = 256,
tification system monitors the transactions of each merchant and and 𝑚 = 16. Because of the sheer size of the dataset, it is im-
notifies the investigation team whenever the identified merchant possible to run the non-compressed baseline on this dataset. We
category mismatches with the merchant’s self-reported category. once again use the PyTorch implementation of the GraphSAGE
We represent the transaction data in a consumer-merchant bipartite model with mean pooling aggregator [13, 17]. We choose to
graph and use the GNN-based classification model to identify a move forward with the GraphSAGE model because it provides
merchant’s category. The performance of the classification model the best node classification performance from previous experi-
dictates the success of the overall identification system. As there are ments. We use the following hyper-parameter settings for the
millions of consumers and merchants using the payment network, GraphSAGE model: number of layers = 2, number of neurons =
we want to build a scalable and accurate GNN model for the iden- 128, activation function = ReLU, and number of neighbors = 5.
tification system. To achieve this goal, we compare the proposed We use the following hyper-parameter settings for the AdamW
hashing-based coding scheme with the baseline random coding optimizer [22] when optimizing the cross entropy loss function:
scheme using real transaction data. learning rate = 0.01, 𝛽 1 = 0.9, 𝛽 2 = 0.999, and weight decay = 0.
We train GraphSAGE for 20 epochs with a batch size of 1024 and
Investigation Team
report the evaluation accuracy from the epoch with the best valida-
Investigation Alert tion accuracy.

5.3.3 Result. Table 3 summarizes the experiment result. As ex-


pected, we observe performance gains in all performance mea-

surements by using the proposed hashing-based coding method
Transactions instead of the random coding method, similar to the experiments
Merchant Category presented in prior sections. The proposed model achieved over 10%
Identify System
Merchant improvement in terms of accuracy when applied in the merchant
category identification system. In addition, with the help of the
GNN Model Alert
Transaction

embedding compression method, it only takes around 2.14 GB to


Reported
Category

Transaction Merchant
Predicted
Category

Database Database store both binary codes and the decoder. The performance improve-
Graph

ment of the hashing-based method over the baseline random coding


method is mild compared to the result presented in Table 1. One
Detection possible reason for such observation is that the merchant category
Rule
identification problem is more difficult than the tasks presented
in Section 5.2 due to various types of data imbalance issues. For
Figure 5: The overall design for the merchant category iden- example, the restaurant category has over 100k merchants while the
tification system. The proposed method is used in the GNN ambulance service category has less than 1k merchants. There are
model component in the system. merchants visited by almost one million consumers, but there are
also merchants visited by less than one hundred consumers. Never-
5.3.1 Dataset. We create a graph dataset by sampling transactions theless, the improvements in accuracy and hit rate are non-trivial,
from January to August in 2020. The resulting graph dataset consists not to mention the drastic reduction of memory cost.
of 17,943,972 nodes; of which 9,089,039 are consumer nodes, and
8,854,933 are merchant nodes. There is a total of 651 merchant 6 CONCLUSION
categories in the dataset. We use 70% of the merchant nodes as In this work, we proposed a hashing-based coding scheme that
the training data, 10% of the merchant nodes as the validation generates compositional codes for compactly representing nodes
Embedding Compression with Hashing for Efficient Representation Learning in Large-Scale Graph KDD ’22, August 14–18, 2022, Washington, DC, USA

Table 3: Comparison of the classification model using differ- [9] Min Du, Robert Christensen, Wei Zhang, and Feifei Li. 2019. Pcard: personalized
ent compression methods. We use Rand to denote the ran- restaurants recommendation from card payment transaction records. In WWW.
[10] Chi Thang Duong, Thanh Dat Hoang, Ha The Hien Dang, Quoc Viet Hung
dom coding method (i.e., ALONE), and Hash to denote the pro- Nguyen, and Karl Aberer. 2019. On node features for graph neural networks.
posed hashing-based coding method. arXiv preprint arXiv:1911.08795 (2019).
[11] Manaal Faruqui and Chris Dyer. 2014. Community evaluation and exchange of
word vectors at wordvectors.org. In ACL: System Demonstrations. 19–24.
Method acc. hit@5 hit@10 hit@20 [12] Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with
Rand 0.1239 0.3725 0.4953 0.6233 PyTorch Geometric. arXiv preprint arXiv:1903.02428 (2019).
Hash 0.1364 0.3867 0.5098 0.6350 [13] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
% improve 10.09% 3.81% 2.93% 1.88% learning on large graphs. NeurIPS 30 (2017).
[14] Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng
Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for
recommendation. In SIGIR. 639–648.
[15] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng
in graph datasets. The proposed coding scheme outperforms the Chua. 2017. Neural collaborative filtering. In WWW.
[16] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen
prior embedding compressing method which uses a random cod- Liu, Michele Catasta, and Jure Leskovec. 2020. Open graph benchmark: Datasets
ing scheme in almost all experiments. On top of that, the perfor- for machine learning on graphs. NeurIPS 33 (2020), 22118–22133.
mance degradation coming from the lossy compression is minimal [17] Ben Johnson, William L Hamilton, and Can Güney Aksakalli. 2018. pytorch-
graphsage. https://github.com/bkj/pytorch-graphsage.
as demonstrated in experiment results. Because the proposed em- [18] Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph
bedding compressing method drastically reduces the memory cost convolutional networks. In ICLR.
[19] Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer,
associated with embedding learning, it is now possible to jointly and Hervé Jégou. 2018. Word translation without parallel data. In ICLR.
train unique embeddings for all the nodes with GNN models on [20] Lample et al. 2018. Unsupervised Machine Translation Using Monolingual Cor-
industrial scale graph datasets as demonstrated in Section 5. pora Only. In ICLR.
[21] Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on
information theory 28, 2 (1982), 129–137.
6.1 Potential Impact and Future Directions [22] Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization.
In ICLR.
Aside from GNNs, the proposed methods can also be combined with [23] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
other kinds of models on tasks that require learning embeddings Link for word2vec. https://code.google.com/archive/p/word2vec/.
[24] Mikolov et al. 2013. Distributed representations of words and phrases and their
for a large set of entities. For example, it is common to have categor- compositionality. In NeurIPS.
ical features/variables with high cardinalities in financial technol- [25] Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning
ogy/targeted advertising datasets, and embeddings are often used to library. NeurIPS 32 (2019), 8026–8037.
[26] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
represent these categorical features [6, 9, 40]. The proposed method Global vectors for word representation. In EMNLP. 1532–1543.
is well suited for building memory-efficient deep learning mod- [27] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Link for
GloVe.6B. https://nlp.stanford.edu/data/glove.6B.zip.
els with these types of large-scale datasets, e.g., for click-through [28] Zongyue Qin, Yunsheng Bai, and Yizhou Sun. 2020. GHashing: Semantic Graph
rate (CTR) prediction or recommendation systems. As a result, Hashing for Approximate Similarity Search in Graph Databases. In SIGKDD.
the proposed embedding compressing method could potentially [29] Raphael Shu and Hideki Nakayama. 2018. Compressing Word Embeddings via
Deep Compositional Code Learning. In ICLR.
address the scalability problems associated with high-cardinality [30] Jun Suzuki and Masaaki Nagata. 2016. Learning compact neural word embeddings
categorical features in many real-world machine learning problems. by parameter space sharing. In IJCAI. 2046–2052.
Determining the most effective auxiliary information for generat- [31] Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. 2017. Hash embeddings
for efficient word representations. arXiv preprint arXiv:1709.03933 (2017).
ing the binary codes should be an interesting direction to explore [32] Sho Takase and Sosuke Kobayashi. 2020. All word embeddings from one embed-
for different applications. For example, one practical adjustment ding. NeurIPS 33 (2020), 3775–3785.
[33] Sho Takase and Naoaki Okazaki. 2019. Positional Encoding to Control Output
could be to use higher-order adjacency matrices to replace the orig- Sequence Length. In NAACL-HLT. 3999–4004.
inal adjacency matrix since the higher-order auxiliary information, [34] Qiaoyu Tan, Ninghao Liu, Xing Zhao, Hongxia Yang, Jingren Zhou, and Xia Hu.
which captures connectivity information on a broader scope, could 2020. Learning to hash with graph neural networks for recommender systems.
In WWW.
result in better embedding compression. [35] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnet-
miner: extraction and mining of academic social networks. In SIGKDD.
[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
REFERENCES Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
[1] Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, Robert Endre you need. In NeurIPS.
Tarjan, et al. 1973. Time bounds for selection. J. Comput. Syst. Sci. (1973). [37] Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019.
[2] J Lawrence Carter and Mark N Wegman. 1979. Universal classes of hash functions. Neural graph collaborative filtering. In SIGIR. 165–174.
Journal of computer and system sciences 18, 2 (1979), 143–154. [38] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian
[3] Moses S Charikar. 2002. Similarity estimation techniques from rounding algo- Weinberger. 2019. Simplifying graph convolutional networks. In ICML.
rithms. In STOC. 380–388. [39] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How Powerful
[4] Ting Chen, Lala Li, and Yizhou Sun. 2020. Differentiable product quantization are Graph Neural Networks?. In ICLR.
for end-to-end embedding compression. In ICML. [40] Chin-Chia Michael Yeh, Dhruv Gelda, Zhongfang Zhuang, Yan Zheng, Liang
[5] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, Gou, and Wei Zhang. 2020. Towards a flexible embedding learning framework.
and Hervé Jégou. 2017. Link for word similarity tasks. https://dl.fbaipublicfiles. In ICDMW. IEEE, 605–612.
com/arrival/wordsim.tar.gz. [41] Chin-Chia Michael Yeh, Mengting Gu, Yan Zheng, Huiyuan Chen, Javid Ebrahimi,
[6] Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. Zhongfang Zhuang, Junpeng Wang, Liang Wang, and Wei Zhang. 2022. Source
2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Code. https://www.dropbox.com/s/1mixmhgbg4wiwtd/release.zip.
Predictions in Ad Serving. In WSDM. 922–930. [42] Chin-Chia Michael Yeh, Zhongfang Zhuang, Yan Zheng, Liang Wang, Junpeng
[7] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. Link for metap- Wang, and Wei Zhang. 2020. Merchant Category Identification Using Credit
ath2vec. https://ericdongyx.github.io/metapath2vec/m2v.html. Card Transactions. In Big Data. IEEE, 1736–1744.
[8] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: [43] Zhou et al. 2020. Graph neural networks: A review of methods and applications.
Scalable representation learning for heterogeneous networks. In SIGKDD. AI Open (2020).
KDD ’22, August 14–18, 2022, Washington, DC, USA Chin-Chia Michael Yeh et al.

A HASHING-BASED CODING THRESHOLD first prepare a query vector 𝑄 with X[Greece] − X[Athens] +
We have compared the difference between the different choices X[Bangkok]. Next, we use 𝑄 to query X with cosine similarity.
of thresholds when binarizing the real values into binary codes The answer is only considered correct if the most similar word is
in Section 3.1 with an experiment. Here, we will describe the de- Thailand. The performance is measured with accuracy. We com-
tails of the experiment. The experiment dataset consists of the first pute the accuracy for each category; then, we report the average of
200,000 pre-trained metapath2vec, metapath2vec++ or GloVe em- the 14 accuracy values as the performance for word analogy.
beddings downloaded from the supplemental web page [7, 8, 26, 27]. B.1.3 Word similarity. Thirteen word similarity datasets are down-
The dimension of the pre-trained metapath2vec/metapath2vec++ loaded from the repository of MUSE [5, 19, 20]. Each dataset consists
node embeddings is 128. The dimensionality for the GloVe word of a list of paired words and their ground truth similarity scores.
embeddings is 300. Because we repeat the experiments 100 times, There is a total of 13 datasets. The experiment is performed as
we generate 100 seeds to make sure both methods use the same described by [11]. First, the cosine similarity between word embed-
basis to perform random projection as the only difference between dings for each pair of words in a dataset is computed. Then, the
the two tested methods should be the threshold. In each trial, we order based on the cosine similarity is compared with the order
first use the seed to generated a random matrix V ∈ R𝑑×𝑛bit where based on the ground truth similarity scores. The comparison of
𝑑 is 128 for node embeddings and 300 for word embeddings. Next, orders is measured with Spearman’s rho. The result Spearman’s
we project the embedding matrix (i.e., the 200,000 ×𝑑 embedding rhos from the 13 datasets are averaged and reported.
matrix) using V; then, we binarize the result matrix using either
zero or the median of each row. With the binary codes prepared, B.1.4 Node embedding. The pre-trained metapath2vec embed-
we count the number of coalitions for each method. We use 24 bits dings, the pre-trained metapath2vec++ embeddings, the associa-
binary codes in the experiment. Once all 100 trials are done, the tion between nodes (i.e., researchers), and the cluster labels (i.e.,
results are presented with histograms as shown in Figure 3 and research area) are downloaded from the web page created by [7, 8].
Figure 6. The number of collisions is lower for the proposed median The node embeddings are trained with AMiner dataset [35]. There is
threshold compared to the zero threshold baseline. a total of 246,678 labeled researchers from the downloaded dataset.
Each researcher is assigned one of the eight research areas. We
metapath2vec++ use 𝑘-means clustering algorithm [21] to cluster the embedding
hashing-zero associated with each researcher; then, we measure the clustering
hashing-median performance with normalized mutual information.

B.2 Hyper-parameter Setting


0 number of collisions 102159 We use the following hyper-parameter settings for the decoders.
GloVe For GloVe, we use 𝑙 = 3, 𝑑𝑐 = 𝑑𝑚 = 512, 𝑑𝑒 = 300, 𝑐 = 2, and
hashing-zero 𝑚 = 128. For metapath2vec/metapath2vec++, we use 𝑙 = 3, 𝑑𝑐 =
hashing-median 𝑑𝑚 = 512, 𝑑𝑒 = 128, 𝑐 = 2, and 𝑚 = 128. Note, the decoder design
is the same across different coding schemes tested on the same
dataset. We use different 𝑑𝑒 for different embeddings because the
0 number of collisions 10401
dimensionality of different pre-trained embeddings is different. The
default hyper-parameter settings for AdamW [22] in PyTorch [25]
are: learning rate = 0.001, 𝛽 1 = 0.9, 𝛽 2 = 0.999, and weight decay =
Figure 6: The experiments are performed on metapath2vec
0.01. We train all models for 1,024 epochs with a batch size of 512.
and GloVe for 100 times. The distribution of the 100 out-
comes (i.e., number of collisions) for each method is shown B.3 Additional Results
in the figure. The number of collisions is lower for the me-
dian threshold compared to the zero threshold. To understand the relationship between the number of compressed
entities and the compression ratio, we construct Table 4 to demon-
strate how the compression ratio changes as the number of com-
pressed entities are increased.
B PRE-TRAINED EMBEDDING
B.1 Dataset Table 4: Compression ratios for different numbers of com-
B.1.1 Word embedding. The pre-trained GloVe word embeddings pressed entities. The compression ratios of metapath2vec++
are downloaded from the web page created by [26, 27]. The word are omitted as the compression ratios are the same as
embeddings are trained using Wikipedia 2014 and Gigaword 5 metapath2vec.
datasets (total of 6B tokens).
# of Entities 5000 10000 25000 50000 100000 200000
B.1.2 Word analogy. We downloaded a list of word analogy pairs GloVe 2.65 5.11 11.60 20.09 31.69 44.55
from the repository of word2vec [23, 24]. The word analogy pairs metapath2vec 1.34 2.57 5.73 9.72 14.91 20.34
are categorized into 14 categories. The experiment is performed
as described by [24]. Given a word embedding matrix X and a Aside from the results presented in Figure 1, we perform addi-
word analogy pair (e.g., Athens:Greece::Bangkok:Thailand), we tional experiments to compare the proposed hashing-based coding
Embedding Compression with Hashing for Efficient Representation Learning in Large-Scale Graph KDD ’22, August 14–18, 2022, Washington, DC, USA

Table 5: Experiment results on pre-trained embeddings with method with the baseline random coding method under different
different settings of 𝑐 and 𝑚. We use random to denote the settings of 𝑐 and 𝑚 while varying the number of compressed entities.
random coding method (i.e., ALONE), and hashing to denote The results are presented in Table 5. The proposed hashing-based
the proposed hashing coding method. coding method almost always performs better than the baseline
random coding method. The performance gap between the two
# of Entities methods increases as the number of entities compressed by the
𝑐 𝑚 Coding Method
5000 10000 50000 200000
compression method increases. Because the settings of 𝑐 and 𝑚
random 0.578 0.444 0.074 0.005
2 128 also control the size of the decoder model, 𝑐 and 𝑚 affect the com-
GloVe (analogy)

hashing 0.580 0.490 0.364 0.288


4 64
random 0.593 0.460 0.095 0.007 pression ratio. Table 6 shows the compression ratio under different
hashing 0.601 0.487 0.320 0.294 settings of 𝑐 and 𝑚. Generally, settings with a lower compression
random 0.621 0.536 0.151 0.013
16 32 ratio have better performance as the potential information loss is
hashing 0.625 0.500 0.360 0.260
random 0.671 0.653 0.426 0.084 less. In the experiments, the bit size of the binary code is fixed to 128
256 16
hashing 0.668 0.669 0.471 0.314
bits. In other words, both {𝑐 = 256, 𝑚 = 16} and {𝑐 = 2, 𝑚 = 128}
random 0.544 0.517 0.371 0.106
2 128 use 128 bit binary codes. The 𝑐 and 𝑚 change the compression ratio
GloVe (similarity)

hashing 0.544 0.539 0.526 0.411


4 64
random 0.580 0.548 0.430 0.222 by changing the decoder size. When using the {𝑐 = 256, 𝑚 = 16}
hashing 0.580 0.523 0.484 0.410 setting, there will be 4,096 vectors total stored in 16 codebooks.
random 0.550 0.581 0.450 0.162
16 32
hashing 0.550 0.530 0.447 0.430 When using the {𝑐 = 2, 𝑚 = 128} setting, there will be 256 vectors
256 16
random 0.574 0.567 0.525 0.361 total stored in 2 codebooks. Because the {𝑐 = 256, 𝑚 = 16} setting
hashing 0.574 0.574 0.531 0.435
has a larger model (i.e., lower compression ratio), it usually is the
random 0.773 0.764 0.723 0.603
2 128 hashing/pre-trained 0.773 0.765 0.756 0.742
setting that outperformed the other in terms of embedding quality.
hashing/graph 0.779 0.768 0.747 0.717 To select a suitable setting for {𝑐, 𝑚}, we suggest the users compute
random 0.772 0.769 0.727 0.627
metapath2vec

the potential memory usage and compression ratio for different


4 64 hashing/pre-trained 0.780 0.770 0.751 0.751
hashing/graph 0.777 0.772 0.753 0.717 settings of {𝑐, 𝑚}, then select the one with the lowest compression
random 0.776 0.772 0.737 0.669 ratio while still meets the memory requirement.
16 32 hashing/pre-trained 0.776 0.767 0.753 0.740
hashing/graph 0.776 0.779 0.764 0.742
random 0.779 0.781 0.762 0.726 C NODE CLASSIFICATION AND LINK
256 16 hashing/pre-trained
hashing/graph
0.779
0.781
0.777
0.780
0.774
0.760
0.758
0.749
PREDICTION
random 0.755 0.759 0.716 0.580 C.1 Hyper-parameter Setting
2 128 hashing/pre-trained 0.759 0.757 0.736 0.726
hashing/graph 0.754 0.750 0.734 0.701 We use the following hyper-parameter settings for the decoders:
metapath2vec++

random 0.762 0.748 0.726 0.613 𝑙 = 3, 𝑑𝑐 = 𝑑𝑚 = 512, and 𝑑𝑒 = 64. We use validation data
4 64 hashing/pre-trained 0.761 0.746 0.738 0.712
hashing/graph 0.759 0.753 0.740 0.703
to tune the settings of 𝑐, the settings of 𝑚, and the light/full
random 0.755 0.750 0.715 0.644 method. We use the following hyper-parameter settings for the
16 32 hashing/pre-trained 0.765 0.752 0.746 0.731 GraphSAGE model: number of layers = 2, number of neurons =
hashing/graph 0.761 0.756 0.742 0.727
random 0.763 0.764 0.746 0.706 128, activation function = ReLU, and number of neighbors = 15.
256 16 hashing/pre-trained 0.760 0.766 0.750 0.743 These settings are the default hyper-parameter settings from the
hashing/graph 0.766 0.764 0.747 0.729 GraphSAGE implementation [17]. For GCN [18], we use a two-layered
structure with a hidden dimension of 128, self-loop, and skip con-
Table 6: Compression ratios for different numbers of com- nection. For SGC [38] and GIN [39], we also use a two-layered struc-
pressed entities with different settings of 𝑐 and 𝑚. The com- ture with hidden dimension of 128 with the other hyper param-
pression ratios of metapath2vec++ are omitted as the ratios eter set to the default values in the PyG library [12]. We use the
are the same as metapath2vec. following hyper-parameter settings for the AdamW optimizer [22]:
learning rate = 0.01, 𝛽 1 = 0.9, 𝛽 2 = 0.999, and weight decay = 0.
# of Entities We train GraphSAGE models for 10 epochs with a batch size of 256
Embedding 𝑐 𝑚
5000 10000 50000 200000
2 128 2.65 5.11 20.09 44.55
and report the evaluation accuracy from the epoch with the best
4 64 2.65 5.11 20.09 44.55 validation accuracy. We do not use mini-batches with GCN [18],
GloVe
16 32 2.15 4.18 17.09 40.60 SGC [38], and GIN [39]; these models are trained for 512 epochs, and
256 16 0.59 1.18 5.53 18.11
2 128 1.34 2.57 9.72 20.34
the evaluation accuracy from the epoch associated with the best
4 64 1.34 2.57 9.72 20.34 validation accuracy is reported.
metapath2vec
16 32 1.05 2.03 8.10 18.42
256 16 0.26 0.52 2.44 7.94

You might also like