0% found this document useful (0 votes)
18 views17 pages

Salsa Fresca

salsa

Uploaded by

aarkaduttaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

Salsa Fresca

salsa

Uploaded by

aarkaduttaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

S ALSA F RESCA: Angular Embeddings and Pre-Training

for ML Attacks on Learning With Errors

Samuel Stevens 1 2 * Emily Wenger 2 Cathy Li 3 2 * Niklas Nolte 2 Eshika Saxena 2 François Charton 2 **
Kristin Lauter 2 **

Abstract Both the NIST standardized schemes and homomorphic en-


arXiv:2402.01082v1 [cs.CR] 2 Feb 2024

cryption (HE) rely on the hardness of the “Learning with


Learning with Errors (LWE) is a hard math
Errors” (Regev, 2005, LWE) problem. NIST schemes stan-
problem underlying recently standardized post-
dardize small secrets (binomial), and the HE standard in-
quantum cryptography (PQC) systems for key ex-
cludes binary and ternary secrets (Albrecht, 2017), with
change and digital signatures (Chen et al., 2022).
sparse versions used in practice.
Prior work (Wenger et al., 2022; Li et al., 2023a;b)
proposed new machine learning (ML)-based at- The LWE problem is defined as follows: in dimension n,
tacks on LWE problems with small, sparse se- the secret s ∈ Znq is a vector of length n with integer entries
crets, but these attacks require millions of LWE modulo q. Let A ∈ Zm×n q be a uniformly random matrix
samples to train on and take days to recover se- with m rows, and e ∈ Zm q an error vector sampled from
crets. We propose three key methods—better pre- a narrow Gaussian χe (see Table 2 for notation summary).
processing, angular embeddings and model pre- The goal is to find s given A and b, where
training—to improve these attacks, speeding up
preprocessing by 25× and improving model sam- b = A · s + e mod q. (1)
ple efficiency by 10×. We demonstrate for the
first time that pre-training improves and reduces The hardness of this problem depends on the parameter
the cost of ML attacks on LWE. Our architecture choices: n, q, and the secret and error distributions.
improvements enable scaling to larger-dimension
Many HE implementations use sparse, small secrets to im-
LWE problems: this work is the first instance
prove efficiency and functionality, where all but h entries of
of ML attacks recovering sparse binary secrets
s are zero (so h is the Hamming weight of s). The non-zero
in dimension n = 1024, the smallest dimension
elements have a limited range of values: 1 for binary se-
used in practice for homomorphic encryption ap-
crets, 1 and −1 for ternary secrets. Sparse binary or ternary
plications of LWE where sparse binary secrets are
secrets allow for fast computations, since they replace the
proposed (Lauter et al., 2011).
n-dimensional scalar product A · s by h sums. However,
binary and ternary secrets might be less secure.
1. Introduction Most attacks on LWE rely on lattice reduction techniques,
such as LLL or BKZ (Lenstra et al., 1982; Schnorr, 1987;
Lattice-based cryptography was recently standardized by the Chen & Nguyen, 2011), which recover s by finding short
US National Institute of Standards and Technology (NIST) vectors in a lattice constructed from A, b and q (Ajtai, 1996;
in the 5-year post-quantum cryptography (PQC) competi- Chen et al., 2020). BKZ attacks scale poorly to large dimen-
tion (Chen et al., 2022). Lattice-based schemes are believed sion n and small moduli q (Albrecht et al., 2015).
to be resistant to attacks by both classical and quantum
computers. Given their importance for the future of infor- ML-based attacks on LWE were first proposed in Wenger
mation security, verifying the security of these schemes et al. (2022); Li et al. (2023a;b), inspired by viewing LWE
is critical, especially when special parameter choices are as a linear regression problem on a discrete torus. These
made such as secrets which are binary or ternary, and sparse. attacks train small transformer models (Vaswani et al., 2017)
to extract a secret from eavesdropped LWE samples (A,
*
Work done at Meta; ** Co-senior authors 1 The Ohio State Uni- b), using lattice-reduction for preprocessing. Although Li
versity 2 Meta AI Research 3 University of Chicago. Correspon- et al. (2023b) solves medium-to-hard small sparse LWE
dence to: François Charton, Kristin Lauter <[email protected],
[email protected]>. instances, for example, dimension n = 512, the approach is
bottlenecked by preprocessing and larger dimensions used
Preprint; under review. in HE schemes, such as n = 1024.

1
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 1. Best results from our attack for LWE problems in dimensions n (higher is harder), modulus q (lower is harder) and Hamming
weights h (higher is harder). Our work recovers secrets for n = 1024 for the first time in ML-based LWE attacks and reduces total attack
time for n = 512, log2 q = 41 to only 50 hours (assuming full CPU parallelization).

LWE (A, b) preprocessing time training total


n log2 q highest h
matrices needed (hrs/CPU/matrix) time (hrs) time (hrs)
512 41 44 1955 13.1 36.9 50.0
768 35 9 1302 12.4 14.8 27.2
1024 50 13 977 26.0 47.4 73.4

Table 2. Notation used in this work.


as a cryptographic distinguisher to guess the secret. In this
Symbol Description
section we provide an overview of ML-based attacks in prior
(A, b) LWE matrix/vector pair, with b = A · s + e. work to put our contributions in context.
(a, b) An LWE vector/integer pair, one row of (A, b).
q Modulus of the LWE problem.
s The (unknown) true secret. 2.1. Attack Part 1: LWE data preprocessing
h Number of nonzero bits in secret.
e Error vector, drawn from distribution χe The attack assumes that t = 4n initial LWE samples (a, b)
σe Standard deviation of the χe . (rows of (A, b)) with the same secret are available. Sam-
n Problem dimension (the dimension of a and s) pling m ≤ n of the 4n initial samples without replacement,
t The total number of LWE samples available m × n matrices A and associated m-dimensional vectors b
m # LWE samples in each subset during reduction are constructed.
s∗ Candidate secret, not necessarily correct
R Matrix computed to reduce the coordinates of A. The preprocessing step strives to reduce the norm of the
ρ Preprocessing reduction factor; the ratio σ(RA)
σ(A) rows of A by applying a carefully selected integer lin-
ear operator R. Because R is linear with integer en-
tries, the transformed pairs (RA, Rb) mod q are also
In this work, we introduce several improvements to Li LWE pairs with the same secret, albeit larger error. In
et al.’s ML attack, enabling secret recovery for harder LWE practice, R is found by performing lattice reduction on 
problems in less time. Our main contributions are: 0 q · In
the (m + n) × (m + n) matrix Λ = ,
ω · Im A
• 25× faster pre-processing using Flatter (Ryan &  
Heninger, 2023) and interleaving with polishing (Char- and finding linear operators C R such  that the norms
ton et al., 2024) and BKZ (see §3). of C R Λ = ω · R RA + q · C are small. This
• An encoder-only transformer architecture, coupled achieves a reduction of the norms of the entries of RA
with an angular embedding for model inputs. This mod q, but also increases the error in the calculation of
reduces the model’s logical and computational com- Rb = RA · s + Re, making secret recovery more difficult.
plexity and halves the input sequence length, signifi- Although ML models can learn from noisy data, too much
cantly improving model performance (see §4). noise will make the distribution of Rb uniform on [0, q)
• The first use of pre-training for LWE to improve and inhibit learning. The parameter ω controls the trade-
sample efficiency for ML attacks, further reducing pre- off between norm reduction and error increase. Reduction
processing cost by 10× (see §5). strength is measured by ρ = σ(RA)σ(A) , where σ denotes the
mean of the standard deviations of the rows of RA and A.
Overall, these improvements in both preprocessing and mod-
eling allow us to recover secrets for harder instances of LWE, Li et al. (2023a) use BKZ (Schnorr, 1987) for lattice reduc-
i.e. higher dimension and lower modulus in less time and tion. Li et al. (2023b) improves the reduction time by 45×
with fewer computing resources. A summary of our main via a modified definition of the Λ matrix and by interleaving
results can be found in Table 1. BKZ2.0 (Chen & Nguyen, 2011) and polish (Charton
et al., 2024) (see Appendix C).
2. Context and Attack Overview This preprocessing step produces many (RA, Rb) pairs
that can be used to train models. Individual rows of RA
Wenger et al. (2022); Li et al. (2023a;b) demonstrated the and associated elements of Rb, denoted as reduced LWE
feasibility of ML-based attacks on LWE. Li et al.’s attack samples (Ra, Rb) with some abuse of notation, are used
has 2 parts: 1) data preprocessing using lattice reduction for model training. Both the subsampling of m samples
techniques; 2) model training interleaved with regular calls from the original t LWE samples and the reduction step are
to a secret recovery routine, which uses the trained model

2
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

done repeatedly and in parallel to produce 4 million reduced Table 3. LWE parameters attacked in our work. For all settings,
LWE samples, providing the data needed to train the model. we attack both binary and ternary secret distributions χs .
n log2 q h
2.2. Attack Part 2: Model training and secret recovery
512 41 50 ≤ h ≤ 70
With 4 million reduced LWE samples (Ra, Rb), a trans- 768 35 5 ≤ h ≤ 15
former is trained to predict Rb from Ra. For simplicity, 1024 50 5 ≤ h ≤ 15
and without loss of generality, we will say the transformer
learns to predict b from a. Li et al. train encoder-decoder likely secret recoveries for the ML-attack.
transformers (Vaswani et al., 2017) with shared layers (De-
hghani et al., 2019). Inputs and outputs consist of integers
2.3. Improving upon prior work
that are split into two tokens per integer by representing
them in a large base B = kq with k ≈ 10 and binning Li et al. (2023b) recover binary and ternary secrets, for n =
the lower digit to keep the vocabulary small as q increases. 512 and log2 q = 41 LWE problems with Hamming weight
Training is supervised and minimizes a cross-entropy loss. ≤ 63, in about 36 days, using 4,000 CPUs and one GPU (see
their Table 1). Most of the computing resources are needed
The key intuition behind ML-attacks on LWE is that to
in the preprocessing stage: reducing one m × n A matrix
predict b from a, the model must have learned the secret s.
takes about 35 days, and 4000 matrices must be reduced to
We extract the secret from the model by comparing model
build a training set of 4 million examples. This suggests two
predictions for two vectors a and a′ which only differ on
directions for improving the attack performance. First, by
one entry. We expect the difference between the model’s
introducing fast alternatives to BKZ2.0, we could shorten
predictions for b and b′ to be small (of the same magnitude
the time required to reduce one matrix. Second, we could
as the error) if the corresponding bit of s is zero, and large
minimize the number of samples needed to train the models,
if it is non-zero. Repeating the process on all n positions
which would reduce the number of CPUs needed for the
yields a guess for the secret.
preprocessing stage.
For ternary secrets, Li et al. (2023b) introduce a two-bit
Another crucial goal is scaling to larger dimensions n. The
distinguisher, which leverages the fact that if secret bits si
smallest standardized dimension for LWE in the HE Stan-
and sj have the same value, adding a constant K to inputs at
dard (Albrecht et al., 2021) is n = 1024. At present, ML
both these indices should induce similar predictions. Thus,
attacks are limited by their preprocessing time and the length
if ui is the ith basis vector and K is a random integer, we
of the input sequences they use. The attention mechanisms
expect model predictions for a + Kui and a + Kuj to be
used in transformers is quadratic in the length of the se-
the same if si = sj . After using this pairwise method to
quence, and Li et al. (2023b) encodes n dimensional inputs
determine whether the non-zero secret bits have the same
with 2n tokens. More efficient encoding would cut down
value, Li et al. classify them into two groups. With only two
on transformer processing speed and memory consumption
ways to assign 1 and −1 to the groups of non-zero secret
quadratically.
bits, this produces two secret guesses.
Wenger et al. (2022) tests a secret guess s∗ by computing 2.4. Parameters and settings in our work
the residuals b − a · s∗ over the 4n initial LWE sample. If
s∗ is correct, the standard deviation of the residuals will be Before presenting our innovations and results, we briefly
close to ≈ σe . Otherwise, it will be close to the discuss the LWE settings considered in our work. LWE
√ standard problems are parameterized by the modulus q, the secret di-
deviation of a uniform distribution over Zq : q/ 12.
mension n, the secret distribution χs (sparse binary/ternary)
For a given dimension, modulus and secret Hamming and the hamming weight h of the secret (the number of non-
weight, the performance of ML attacks vary from one secret zero entries). Table 3 specifies the LWE problem settings
to the next. Li et al. (2023b) observes that the difficulty of we attack. Proposals for LWE parameter settings in homo-
recovering a given secret s from a set of reduced LWE sam- morphic encryption suggest using n = 1024 with sparse
ples (a, b) depends on the distribution of the scalar products secrets (as low as h = 64), albeit with smaller q than we
a · s. If a large proportion of these products remain in the consider (Curtis & Player, 2019; Albrecht, 2017). Thus, it
interval (−q/2, q/2) (assuming centering) even without a is important to show that ML attacks can work in practice
modulo operation, the problem is similar enough to linear for dimension n = 1024 if we hope to attack sparse secrets
regression that the ML attack will usually recover the se- in real-world settings.
cret. Li et al. introduce the statistic NoMod: the proportion
of scalar products in the training set having this property. The LWE error distribution remains the same throughout
They demonstrate that large NoMod strongly correlates with our work: rounded Gaussian with σ = 3 (following Al-
brecht et al. (2021)). Table 10 in Appendix B contains all

3
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 4. Reduction performance and median time to reduce matrix using one CPU. We parallelize our reduction across
one matrix for Li et al. (2023b) vs. our work. Li et al.’s method many CPUs. For n = 512, our methods improve reduction
fails for n > 512 on our compute cluster. time by a factor of 25, and scale easily to n = 1024 prob-
CPU · hours · matrix lems. Overall, we find that Flatter improves the time
n log2 q ρ (and consequently resources required) for preprocessing,
(Li et al., 2023b) Ours
but does not improve the overall reduction quality.
512 41 0.41 ≈ 350 13.1
768 35 0.71 N/A 12.4 Error penalty ω. We run all reduction experiments with
1024 50 0.70 N/A 26.0
penalty ω = 10. Table 5 demonstrates the tradeoff be-
Table 5. Tradeoff between reduction quality and reduction er- tween reduction quality and reduction error, as measured by
ror as controlled by ω. n = 1024, log2 q = 50. ∥R∥/q, for n = 1024, log2 q = 50 problems. Empirically,
ω 1 3 5 7 10 13 we find that ∥R∥/q < 0.09 is sufficiently low to recover
secrets from LWE problems with e ∼ N (0, 32 ).
ρ 0.685 0.688 0.688 0.694 0.698 0.706
∥R∥/q 0.341 0.170 0.118 0.106 0.075 0.068 Experimental setup. In practice, we do not preprocess a
different set of 4n (a, b) pairs for each secret recovery ex-
experimental settings, including values for the moduli q, periment because preprocessing is so expensive. Instead, we
preprocessing settings, and model training settings. use a single set of preprocessed Ra rows combined with an
arbitrary secret to produce different (Ra, Rb) pairs for train-
ing. We first generate a secret s and calculate b = A · s + e
3. Data Preprocessing
for the original 4n pairs. Then, we apply the many differ-
Prior work primarily used BKZ2.0 in the preprocess- ent R produced by preprocessing to A and b to produce
ing/lattice reduction step. While effective, BKZ2.0 is slow many (Ra, Rb) pairs with reduced norm. This technique en-
for large values of n. We found that for n = 1024 matrices ables analyzing attack performance across many dimensions
it could not finish a single loop in 3 days on an Intel Xeon (varying h, model parameters, etc.) in a reasonable amount
Gold 6230 CPU, eventually timing out. of time. Preprocessing a new dataset for each experiment
would make evaluation at scale near-impossible.
Our preprocessing pipeline. Our improved preprocessing
incorporates a new reduction algorithm Flatter1 , which
promises similar reduction guarantees to LLL with much- 4. Model Architecture
reduced compute time (Ryan & Heninger, 2023), allowing Previous ML attacks on LWE use a encoder-decoder trans-
us to preprocess LWE matrices in dimension up to n = former (Vaswani et al., 2017). A bidirectional encoder pro-
1024. We interleave Flatter and BKZ2.0 and switch cesses the input and an auto-regressive decoder generates
between them after 3 loops of one results in ∆ρ < −0.001. the output. Integers in both input and output sequences
Following Li et al. (2023b), we run polish after each were tokenized as two digits in a large base smaller than q.
Flatter and BKZ2.0 loop concludes. We initialize our We propose a simpler and faster encoder-only model and
Flatter and BKZ2.0 runs with block size 18 and α = introduce an angular embedding for integers modulo q.
0.04, which provided the best empirical trade-off between
time and reduction quality (see Appendix C for additional 4.1. Encoder-only model
details), and make reduction parameters stricter as reduction
progresses—up to block size 22 for BKZ2.0 and α = 0.025 Encoder-decoder models were originally introduced for ma-
for Flatter. chine translation, because their outputs can be longer than
their inputs. However, they are complex and slow at infer-
Preprocessing performance. ence, because the decoder must run once for each output
Table 4 records the reduction ρ achieved for each (n, q) pair token. For LWE, outputs (one integer) are always shorter
and the time required, compared with Li et al. (2023b). Re- than inputs (a vector of n integers). Li et al. (2023b) ob-
call that ρ measures the reduction in the standard deviation serve that an encoder-only model, a 4-layer bidirectional
of Ai relative to its original uniform random distribution; transformer based on DeBERTa (He et al., 2020), achieves
lower is better. Reduction in standard deviation strongly comparable performance with their encoder-decoder model.
correlates with reduction in vector norm, but for consistency Here, we experiment with simpler encoder-only models
with Li et al. (2023b) we use standard deviation. without DeBERTa’s disentangled attention mechanism with
We record reduction time as CPU · hours · matrix, the 2 to 8 layers. Their outputs are max-pooled across the
amount of time it takes our algorithm to reduce one LWE sequence dimension, and decoded by a linear layer for each
output digit (Figure 1). We minimize a cross-entropy loss.
1
https://github.com/keeganryan/flatter

4
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Input/Output
Table 6. Best recovery results for binary and ternary secrets on
(x, y) 2 R2
<latexit sha1_base64="bFSU+zSVMAITizx40zJBjPEkLVM=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKUoq6LLpxWcU+oIllMp20QyeTMDMRS8jGX3HjQhG3foY7/8ZJm4W2HrhwOOde7r3HixiVyrK+jcLS8srqWnG9tLG5tb1j7u61ZRgLTFo4ZKHoekgSRjlpKaoY6UaCoMBjpOONrzK/80CEpCG/U5OIuAEacupTjJSW+uZB5fEUTk6gQzl0AqRGnpfcpve1vlm2qtYUcJHYOSmDHM2++eUMQhwHhCvMkJQ924qUmyChKGYkLTmxJBHCYzQkPU05Coh0k+kDKTzWygD6odDFFZyqvycSFEg5CTzdmd0o571M/M/rxcq/cBPKo1gRjmeL/JhBFcIsDTiggmDFJpogLKi+FeIREggrnVlJh2DPv7xI2rWqfVat39TLjcs8jiI4BEegAmxwDhrgGjRBC2CQgmfwCt6MJ+PFeDc+Zq0FI5/ZB39gfP4AvWGVPw==</latexit>

various model architectures (n = 512, log2 q = 41). Encoder-


Not trainable
Decoder and DeBERTa models and recovery results are from Li
Wo
<latexit sha1_base64="96XF9wgm/O0hcd5HIOMllMraEE4=">AAAB6nicbVBNSwMxEJ3Ur1q/qh69BIvgqexKUY9FLx4r2g9ol5JNs21oNlmSrFCW/gQvHhTx6i/y5r8xbfegrQ8GHu/NMDMvTAQ31vO+UWFtfWNzq7hd2tnd2z8oHx61jEo1ZU2qhNKdkBgmuGRNy61gnUQzEoeCtcPx7cxvPzFtuJKPdpKwICZDySNOiXXSQ7uv+uWKV/XmwKvEz0kFcjT65a/eQNE0ZtJSQYzp+l5ig4xoy6lg01IvNSwhdEyGrOuoJDEzQTY/dYrPnDLAkdKupMVz9fdERmJjJnHoOmNiR2bZm4n/ed3URtdBxmWSWibpYlGUCmwVnv2NB1wzasXEEUI1d7diOiKaUOvSKbkQ/OWXV0nroupfVmv3tUr9Jo+jCCdwCufgwxXU4Q4a0AQKQ3iGV3hDAr2gd/SxaC2gfOYY/gB9/gA6UI3G</latexit>

Trainable parameters et al. (2023b); we benchmark DeBERTA samples/sec on our hard-


Hidden representations
ware. Encoder (Vocab.) uses prior work’s vocabulary emebedding.
c
<latexit sha1_base64="89J+JuC5FxMnvUsR7pwslvrJLdI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlJuuXK27VnYOsEi8nFcjR6Je/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSvqh6l9Vas1ap3+RxFOEETuEcPLiCOtxBA1rAAOEZXuHNeXRenHfnY9FacPKZY/gD5/MHyVOM8A==</latexit>

Encoder (Angular) is presented in Section 4.2.


Samples/ Largest Largest
Architecture
MaxPool Sec Binary h Ternary h
Encoder-Decoder 200 63 58
y1 <latexit sha1_base64="S9Cu8AKFnK6I/8lMB82XcUuy6Gc=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9b3+uWKW3XnIKvEy0kFcjT65a/eIGZpxBUySY3pem6C/oRqFEzyaamXGp5QNqZD3rVU0YgbfzI/dUrOrDIgYaxtKSRz9ffEhEbGZFFgOyOKI7PszcT/vG6K4bU/ESpJkSu2WBSmkmBMZn+TgdCcocwsoUwLeythI6opQ5tOyYbgLb+8SloXVe+yWruvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwjO8wpsjnRfn3flYtBacfOYY/sD5/AEQJI2q</latexit>

y2
<latexit sha1_base64="AXHDfzJfI2psFexzj7c34ZbJEOI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+0m/1i9X3Ko7B1klXk4qkKPZL3/1BjFLI5SGCap113MT42dUGc4ETku9VGNC2ZgOsWuppBFqP5ufOiVnVhmQMFa2pCFz9fdERiOtJ1FgOyNqRnrZm4n/ed3UhFd+xmWSGpRssShMBTExmf1NBlwhM2JiCWWK21sJG1FFmbHplGwI3vLLq6Rdq3oX1fpdvdK4zuMowgmcwjl4cAkNuIUmtIDBEJ7hFd4c4bw4787HorXg5DPH8AfO5w8RqI2r</latexit>

y3
<latexit sha1_base64="2CemlM6+NZ4ujdFym/hTPFz7v/c=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0qMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1Bqw8GHu/NMDMvSKQw6LpfTmFldW19o7hZ2tre2d0r7x+0TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfjm5nffuTaiFg9YJZwP6JDJULBKFrpPuuf98sVt+rOQf4SLycVyNHolz97g5ilEVfIJDWm67kJ+hOqUTDJp6VeanhC2ZgOeddSRSNu/Mn81Ck5scqAhLG2pZDM1Z8TExoZk0WB7YwojsyyNxP/87ophlf+RKgkRa7YYlGYSoIxmf1NBkJzhjKzhDIt7K2EjaimDG06JRuCt/zyX9I6q3oX1dpdrVK/zuMowhEcwyl4cAl1uIUGNIHBEJ7gBV4d6Tw7b877orXg5DOH8AvOxzcTLI2s</latexit>

yn yn
<latexit sha1_base64="fPDsHORc9W9XytJw1b6V0YZ3Ics=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9ZX/XLFrbpzkFXi5aQCORr98ldvELM04gqZpMZ0PTdBf0I1Cib5tNRLDU8oG9Mh71qqaMSNP5mfOiVnVhmQMNa2FJK5+ntiQiNjsiiwnRHFkVn2ZuJ/XjfF8NqfCJWkyBVbLApTSTAms7/JQGjOUGaWUKaFvZWwEdWUoU2nZEPwll9eJa2LqndZrd3XKvWbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBsmI3n</latexit>


<latexit sha1_base64="fPDsHORc9W9XytJw1b6V0YZ3Ics=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9ZX/XLFrbpzkFXi5aQCORr98ldvELM04gqZpMZ0PTdBf0I1Cib5tNRLDU8oG9Mh71qqaMSNP5mfOiVnVhmQMNa2FJK5+ntiQiNjsiiwnRHFkVn2ZuJ/XjfF8NqfCJWkyBVbLApTSTAms7/JQGjOUGaWUKaFvZWwEdWUoU2nZEPwll9eJa2LqndZrd3XKvWbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBsmI3n</latexit>

Encoder (DeBERTa) 83 63 60
Encoder (Vocab.) 256 63 66
Encoder (Angular) 610 66 66
MLP

xN

Self-Attention
then to the point (sin(2π aq ), cos(2π aq )) ∈ R2 . All input
integers (in Zq ) are therefore represented as points on the
x1
<latexit sha1_base64="QRlFKTdZFTH2TZmCLuq5ispBsxg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+ien53Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcOno2p</latexit>

x2
<latexit sha1_base64="AR8bw1iLV+h/TJ1WtHIGvCZC5c0=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX6RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStm7KFfvqqXadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AECKNqg==</latexit>

x3
<latexit sha1_base64="oHKEjBPa1NpTI9acyZHKjpoa4aI=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF48Y5ZHAhswOA0yYnd3M9BrJhk/w4kFjvPpF3vwbB9iDgpV0UqnqTndXEEth0HW/ndzK6tr6Rn6zsLW9s7tX3D9omCjRjNdZJCPdCqjhUiheR4GSt2LNaRhI3gxGN1O/+ci1EZF6wHHM/ZAOlOgLRtFK90/d826x5JbdGcgy8TJSggy1bvGr04tYEnKFTFJj2p4bo59SjYJJPil0EsNjykZ0wNuWKhpy46ezUyfkxCo90o+0LYVkpv6eSGlozDgMbGdIcWgWvan4n9dOsH/lp0LFCXLF5ov6iSQYkenfpCc0ZyjHllCmhb2VsCHVlKFNp2BD8BZfXiaNs7J3Ua7cVUrV6yyOPBzBMZyCB5dQhVuoQR0YDOAZXuHNkc6L8+58zFtzTjZzCH/gfP4AEaaNqw==</latexit>

… xn <latexit sha1_base64="+IywFGRoWtubaq24o0WD0l2cPVA=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbSbt0swm7G7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVDBssFrFqB1Sj4BIbhhuB7UQhjQKBrWB0M/Vbj6g0j+WDGSfoR3QgecgZNVa6f+rJXrniVt0ZyDLxclKBHPVe+avbj1kaoTRMUK07npsYP6PKcCZwUuqmGhPKRnSAHUsljVD72ezUCTmxSp+EsbIlDZmpvycyGmk9jgLbGVEz1IveVPzP66QmvPIzLpPUoGTzRWEqiInJ9G/S5wqZEWNLKFPc3krYkCrKjE2nZEPwFl9eJs2zqndRPb87r9Su8ziKcATHcAoeXEINbqEODWAwgGd4hTdHOC/Ou/Mxby04+cwh/IHz+QNrEo3m</latexit>

2-dimensional unit circle, which is then embedded as an


ellipse in Rd , via a learned linear projection We . Model out-
We
<latexit sha1_base64="SwZcjtsStrgnasl+9vkauMklcvA=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz00O5jv1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vktZF1bus1u5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwArKI28</latexit>

puts, obtained by max-pooling the encoder output sequence,


are decoded as points in R2 by another linear projection
a1
<latexit sha1_base64="320VPklRgEWhB8hnnEiRh3xn5cE=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0QPtev1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4yy+vktZF1bus1u5rlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwDrhY2S</latexit>

a2
<latexit sha1_base64="WxkZs5xRs38lJD4l9+FXgHm0HfI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOle9qv9csVt+rOQVaJl5MK5Gj2y1+9QczSCKVhgmrd9dzE+BlVhjOB01Iv1ZhQNqZD7FoqaYTaz+anTsmZVQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc14ZWfcZmkBiVbLApTQUxMZn+TAVfIjJhYQpni9lbCRlRRZmw6JRuCt/zyKmnXqt5FtX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPtCY2T</latexit>

a3 <latexit sha1_base64="gsX9/jccenrQQU4UL/6VgISS1Eo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0qMeiF48V7Qe0oUy2m3bpZhN2N0IJ/QlePCji1V/kzX/jts1Bqw8GHu/NMDMvSATXxnW/nMLK6tr6RnGztLW9s7tX3j9o6ThVlDVpLGLVCVAzwSVrGm4E6ySKYRQI1g7GNzO//ciU5rF8MJOE+REOJQ85RWOle+yf98sVt+rOQf4SLycVyNHolz97g5imEZOGCtS667mJ8TNUhlPBpqVeqlmCdIxD1rVUYsS0n81PnZITqwxIGCtb0pC5+nMiw0jrSRTYzgjNSC97M/E/r5ua8MrPuExSwyRdLApTQUxMZn+TAVeMGjGxBKni9lZCR6iQGptOyYbgLb/8l7TOqt5FtXZXq9Sv8ziKcATHcAoeXEIdbqEBTaAwhCd4gVdHOM/Om/O+aC04+cwh/ILz8Q3ujY2U</latexit>

… an
<latexit sha1_base64="jrq4lEYW8EALKbypH6FBv/UbZgg=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJp/1cTfvVmlt35yCrxCtIDQo0+9Wv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQT5EKlGXLFFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvr5LWRd27ql8+XNYat0UcZTiBUzgHD66hAffQBB8YCHiGV3hzlPPivDsfi9aSU8wcwx84nz8MvI7b</latexit>

Wo . The training loss is the L2 distance between the model


prediction and the point representing b on the unit circle.
Figure 1. Encoder-only transformer (§4.1) with angular em-
4.3. Experiments
bedding architecture. See §4.2 for an explanation.
Here, we compare our new architecture with previous work,
This simpler architecture improves training speed by 25%. and assess its performance on larger instances of LWE
(n = 768 and 1024). All comparisons with prior work are
4.2. Angular embedding performed on LWE instances with n = 512 and log2 q = 41
Most transformers process sequences of tokens from a fixed for binary and ternary secrets, using the same pre-processing
vocabulary, encoded in Rd by a learned embedding. Typical techniques as Li et al. (2023b).
vocabulary sizes vary from 32K in Llama2 (Touvron et al., Encoder-only models vs prior designs. In Table 6, we
2023), to 256K in Jurassic-1 (Lieber et al., 2021). The compare our encoder-only model (with and without the
larger the vocabulary, the more data needed to learn the angular embedding) with the encoder-decoder and the De-
embeddings. LWE inputs and outputs are integers from BERTa models from Li et al. (2023b). The encoder-only
Zq . For n ≥ 512 and q ≥ 235 : encoding integers with one and encoder-decoder models are trained for 72 hours on one
token creates a too-large vocabulary. To avoid this, Li et al. 32GB V100 GPU. The DeBERTa model, which requires
(2023a;b) encode integers with two tokens by representing more computing resources, is trained for 72 hours on four
them in base B = kq with small k and binning the low digit 32GB V100 GPUs. Our encoder-only model, using the
so the overall vocabulary has < 10K tokens. same vocabulary embedding as prior work, processes sam-
This approach has two limitations. First, input sequences are ples 25% faster than the encoder-decoder architecture and is
2n tokens long, which slows training as n grows, because 3× faster than the DeBERTa architecture. With the angular
transformers’ attention mechanism scales quadratically in embedding, training is 2.4× faster, because input sequences
sequence length. Second, all vocabulary tokens are learned are half as long, which accelerates attention calculations.
independently and the inductive bias of number continuity is Our models also outperform prior designs in terms of secret
lost. Prior work on modular arithmetic (Power et al., 2022; recovery: previous models recover binary secrets with Ham-
Liu et al., 2022) shows that trained integer embeddings ming weight 63 and ternary secrets with Hamming weight
for these problems move towards a circle, e.g. with the 60. Encoder-only models with an angular embedding re-
embedding of 0 close to 1 and q − 1. cover binary and ternary secrets with Hamming weights up
to 66. §D.1 in the Appendix provides detailed results.
To address these shortcomings, we introduce an angular
embedding which strives to better represent the problem’s Impact of model size. Table 7 compares encoder-only
modular structure in embedding space, while encoding in- models of different sizes (using angular embeddings). All
tegers with only one token. An integer a ∈ Zq is first models are run for up to 72 hours on one 32GB V100 GPU
converted to an angle by the transformation a → 2π aq , and on n = 512, log2 q = 41. We observe that larger models

5
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 7. Encoder-only (angular embedding) performance for
prior work made scaling attacks to n ≥ 512 difficult, due
varying # of transformer layers and embedding dimensions
(n = 512, log2 q = 41, binary secrets). Samples per second
to both memory footprint and slow model processing speed.
quantifies training speed; “Recovered” is % recovered out of 100 In contrast, our more efficient model and angular embed-
secrets with h from 49-67; “Hours” is mean hours to recovery. ding scheme (§4.1,§4.2) enable us to attack n = 768 and
n = 1024 secrets. Table 8 shows that we can recover up
Layers Emb. Dim. Params Samples/S Recovered Hours
to h = 9 for both n = 768 and n = 1024 settings, with
2 128 1.3M 2560 23% 18.9 < 24 hours of training on a single 32GB V100 GPU. In
4 256 4.1M 1114 22% 19.6 §5.1 we show recovery of h = 13 for n = 1024 using a
4 512 14.6M 700 25% 26.2
6 512 20.9M 465 25% 28.1 more sample-efficient training strategy. We run identical
8 512 27.2M 356 24% 30.3 experiments using prior work’s encoder-decoder model, but
fail to recover any secrets with n > 512 in the same com-
putational budget. Our proposed model improvements lead
to the first successful ML attack that scales to real-world
values of n: proposed real-world use cases for LWE-based
cryptosystems recommend using dimensions n = 768 and
n = 1024 (Avanzi et al., 2021; Albrecht et al., 2021; Curtis
& Player, 2019), although they also recommend smaller q
and harder secret distributions than we currently consider.

5. Training Methods
Figure 2. Count of # successes (orange) and failures (blue) for
various NoMod proportions for vocabulary-based and angular The final limitation we address is the 4 million preprocessed
embedding schemes. n = 512, log2 q = 41, h = 57-67. LWE samples for model training. Recall that each training
sample is a row of a reduced LWE matrix RA ∈ Zm×n q ,
so producing 4 million training samples requires reducing
yield little benefit in terms of secret recovery rate, and small
≈ 4,000,000
m+n LWE matrices. Even with the preprocessing
models are significantly faster (both in terms of training and
improvements highlighted in §3, for n = 1024, this means
recovery). Additional results are in Appendix D.2. We use 4
preprocessing between 2000 and 4500 matrices at the cost
layers with embedding dimension 512 for later experiments
of 26 hours per CPU per matrix. 2 To further reduce total
because it recovers the most secrets (25%) the fastest.
attack time, we propose training with fewer samples and
Embedding ablation. Next, we compare our new angular pre-training models.
embedding scheme to the vocabulary embeddings. The
better embedding should recover both more secrets and 5.1. Training with Fewer Samples
more difficult ones (as measured by Hamming weight and
We first consider simply reducing training dataset size and
NoMod; see §2.2 for a description of NoMod). To measure
seeing if the attack succeeds. Li et al. always use 4M train-
this, we run attacks on identical datasets, n = 512, log2 q =
ing examples. To test if this many is needed for secret recov-
41, h = 57-67 with 10 unique secrets per h. One set of
ery, we subsample datasets of size N = [100K, 300K, 1M,
models uses the angular embedding, while the other uses
3M] from the original 4M examples preprocessed via the
the vocabulary embedding from Li et al. (2023b).
techniques in §3. We train models to attack LWE problems
To check if angular embedding outperforms vocabulary em- n = 512, log2 q = 41 with binary secrets and h = 30-45.
bedding, we measure the attacked secrets’ NoMod. We Each attack is given 3 days on a single V100 32GB GPU.
expect the better embedding to recover secrets with lower
Table 9 shows that our attack still succeeds, even when as
NoMod and higher Hamming weights (e.g. harder secrets).
few as 300K samples are used. We recover approximately
As Figure 2 and Table 6 demonstrate, this is indeed the case.
the same number of secrets with 1M samples as with 4M,
Angular embeddings recover secrets with NoMod= 56 vs.
and both settings achieve the same best h recovery. Using
63 for vocabulary embedding (see Table 29 in Appendix F
1M rather than 4M training samples reduces our preprocess-
for raw numbers). Furthermore, angular embedding models
ing time by 75%. We run similar experiments for n = 768
recover more secrets than those with vocabulary embed-
and n = 1024, and find that we can recover up to h = 13
dings (16 vs. 2) and succeed on higher h secrets (66 vs.
63). We conclude that an angular embedding is superior to 2
We bound the number of reduced matrices needed because
a vocabulary embedding because it recovers harder secrets. some rows of reduction matrices R are 0, discarded after prepro-
cessing. Between m and n + m nonzero rows of R are kept, so
Scaling n. Finally, we use our proposed architecture im- we must reduce between 4,000,000
m+n
and 4,000,000
m
matrices.
provements to scale n. The long input sequence length in

6
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 8. Secret recovery time for larger dimensions n with encoder-only model with 4 layers, embedding dimension of 512 and
angular embedding We only report the training hours (on one V100 GPU) for successful secret recoveries, out of 10 secrets per h value.
Samples/ Hours to recovery for different h values
n log2 q
Sec
h=5 h=7 h=9 h = 11
768 35 355 3.1, 18.6, 18.9 9.1, 21.6, 24.9 15.9, 27.7 -
1024 50 256 1.6, 6.2, 7.6, 8.8, 34.0, 41.4, 43.7 4.6, 7.4, 13.5, 16.7 21.3 -

Table 9. # Training samples needed to recover secrets without


pre-training. (n = 512, log2 q = 41, binary secrets). We report
# of secrets recovered among h = 30-45 (10 secrets for each h),
the highest h recovered, and the average attack time among secrets
recovered with 1M, 3M and 4M training samples.
Training samples Total # Best h Mean Hours
300K 1 32 30.3
1M 18 44 28.0 ± 11.6
3M 21 44 26.5 ± 10.5 Figure 3. Mean minimum number of samples needed to re-
4M 22 44 25.3 ± 8.9 cover binary secrets as a function of # pre-training steps.
(n = 512, log2 q = 41, binary secrets).

secrets for n = 1024 with only 1M training samples. Those


results are in Tables 17 to 20 in Appendix E.1. so that each Ra′ has 150 different possible targets. So
the model can distinguish targets, we introduce 150 spe-
5.2. Model Pre-Training cial vocabulary tokens tsi , one for each secret si . We con-
catenate the appropriate token tsi to row Ra′ paired with
To further improve sample efficiency, we introduce the first Rb′ = Ra′ · si + e. Thus, from 4M rows Ra′ , we produce
use of pre-training in LWE. Pre-training models has im- 600M triplets (Ra′ , tsi , Rb′ ). The model learns to predict
proved sample efficiency in language (Devlin et al., 2019; Rb′ from a row Ra′ and an integer token tsi .3
Brown et al., 2020) and vision (Kolesnikov et al., 2020); we
hypothesize similar improvements are likely for LWE. Thus, We hypothesize that including many different Ra′ , Rb′
we frame secret recovery as a downstream task and pre-train pairs produced from different secrets will induce strong
a transformer to improve sample efficiency when recovering generalization capabilities. When we train the θ∗ -initialized
new secrets, further reducing preprocessing costs. model on new data with an unseen secret s, we indicate that
there is a new secret by adding a new token t0 to the model
Formally, an attacker would like to pre-train a model param- vocabulary that serves the same function as tsi above. We
eterized by θ on samples {(a′ , b′ )} such that the pre-trained randomly initialize the new token embedding, but initialize
parameters θ∗ are a better-than-random initialization for the remaining model parameters with those of θ∗ . Then we
recovering a secret from new samples {(a, b)}. Although train and extract secrets as in §2.2.
pre-training to get θ∗ may require significant compute, θ∗
can initialize models for many different secret recoveries, Experiments. For pre-training data, we use 4M Ra′ rows
amortizing the initial cost. reduced to ρ = 0.41, generated from a new set of 4n LWE
(a, b) samples. We use binary and ternary secrets with Ham-
Pre-training setup. First, we generate and reduce a new ming weights from 30 to 45, with 5 different secrets for each
dataset of 4 million Ra′ samples with which we pre-train weight, for a total of 150 secrets and 600M (Ra′ , tsi , Rb′ )
a model. The θ∗ -initialized model will then train on the triplets for pre-training. We pre-train an encoder-only trans-
(Ra, Rb) samples used in §5.1, leading to a fair comparison former with angular embeddings for 3 days on 8x 32GB
between a randomly-initialized model and a θ∗ -initialized V100 GPUs with a global batch size of 1200. The model
model. Pre-training on the true attack dataset is unfair and sees 528M total examples, less than one epoch. We do not
unrealistic, since we assume the attacker will train θ∗ before run the distinguisher step during pre-training because we
acquiring the real LWE samples they wish to attack. are not interested in recovering these secrets.
The pre-trained weights θ∗ should be a good initialization We use the weights θ∗ as the initialization and repeat §5.1’s
for recovering many different secrets. Thus, we use many experiments. We use three different checkpoints from pre-
different secrets with the 4M rows Ra′ . In typical recov- training as θ∗ to evaluate the effect of different pre-training
ery, we have 4M rows Ra and 4M targets Rb. In the pre-
3
training setting, however, we generate 150 different secrets tsi is embedded using a learned vocabulary.

7
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

& May, 2020; Greydanus, 2017). The three ML-based LWE


attacks upon which this work builds, S ALSA (Wenger et al.,
2022), P ICANTE (Li et al., 2023a), and V ERDE (Li et al.,
2023b), also take this approach.
AI for Math. The use of neural networks for arithmetic
was first considered by Siu & Roychowdhury (1992), and
recurrent networks by Zaremba et al. (2015), Kalchbrenner
et al. (2015) and Kaiser & Sutskever (2015). Transformers
have been used to solve problems in symbolic and numerical
mathematics, integration(Lample & Charton, 2020), linear
algebra (Charton, 2022), arithmetic (Charton, 2024) and
theorem proving (Polu & Sutskever, 2020). With the ad-
Figure 4. How pre-training affects mean hours to secret recov- vent of large language models, recent research has focused
ery for different training dataset sizes. (n = 512, log2 q = 41, on training or fine-tuning language models on math word
binary secrets). Among secrets recovered by all checkpoints; trend problems: problems of mathematics expressed in natural
lines calculated using only the pre-trained checkpoints. language (Meng & Rumshisky, 2019; Griffith & Kalita,
2021; Lee et al., 2023). The limitations of these approaches
amounts: 80K, 240K and 440K pre-training steps. were discussed by Nogueira et al. (2021) and Dziri et al.
(2023). Modular arithmetic was first considered by Power
Results. Figure 3 demonstrates that pre-training improves et al. (2022) and Wenger et al. (2022). Its difficulty was
sample efficiency during secret recovery. We record the discussed by Palamas (2017) and Gromov (2023).
minimum number of samples required to recover each secret
for each checkpoint. Then we average these minimums 7. Discussion & Future Work
among secrets recovered by all checkpoints (including the
randomly initialized model) to fairly compare them. We find Our contributions are spread across multiple fronts: faster
that 80K steps improves sample efficiency, dropping from preprocessing (25× fewer CPU hours), simpler architecture
1.7M to 409K mean samples required. However, further (25% more samples/sec), better token embeddings (2.4×
pre-training does not further improve sample efficiency. faster training) and the first use of pre-training for LWE
(10× fewer samples). These lead to 250× fewer CPU hours
Recall that using fewer samples harms recovery speed (see
spent preprocessing and 3× more samples/sec for for n =
Table 9). Figure 4 shows trends for recovery speed for
512, log2 q = 41 LWE problems, and lead to the first ML
pre-trained and randomly initialized models for different
attack on LWE for n = 768 and n = 1024. Although we
numbers of training samples. We find that any pre-training
have made substantial progress in pushing the boundaries
slows down recovery, but further pre-training might mini-
of machine learning-based attacks on LWE, much future
mize this slowdown. Appendix E has complete results.
work remains in both building on this pre-training work and
improving models’ capacity to learn modular arithmetic.
6. Related Work
Pre-training. Pre-training improves sample efficiency after
Using machine learning for cryptanalysis. An increasing 80K steps; however, further improvements to pre-training
number of cryptanalytic attacks in recent years have incor- should be explored. First, our experiments used just one set
porated ML models. Often, models are used to strengthen of 4M RA combined with multiple secrets. To encourage
existing cryptanalysis approaches, such as side channel or generalization to new RA, pre-training data should include
differential analysis (Chen & Yu, 2021). Of particular in- different original As. Second, our transformer only has
terest is recent work that successfully used ML algorithms 14.1M parameters, and may be too small to benefit from
to aid side-channel analysis of Kyber, a NIST-standardized pre-training. Third, pre-training data does not have to come
PQC method (Dubrova et al., 2022). Other ML-based crypt- from a sniffed set of 4n (a, b) samples. Rather than use
analysis schemes train models on plaintext/ciphertext pairs expensive preprocessed data, we could simulate reduction
or similar data, enabling direct recovery of cryptographic and generate random rows synthetically that look like RA.
secrets. Such approaches have been studied against a vari-
ety of cryptosystems, including hash functions (Goncharov, ML for modular arithmetic. NoMod experiments consis-
2019), block ciphers (Gohr, 2019; Benamira et al., 2021; tently show that more training data which does not wrap
Chen & Yu, 2021; Alani, 2012; So, 2020; Kimura et al., around the modulus leads to more successful secret recovery
2021; Baek & Kim, 2020), and substitution ciphers (Ah- (see Figure 2 and Li et al. (2023b)). This explains why a
madzadeh et al., 2021; Srivastava & Bhatia, 2018; Aldarrab smaller modulus q is harder for ML approaches to attack,

8
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

and indicates that models are not yet learning modular arith- Avanzi, R., Bos, J., Ducas, L., Kiltz, E., Lepoint, T., Lyuba-
metic (Wenger et al., 2022). Further progress on models shevsky, V., Schanck, J. M., Schwabe, P., Seiler, G., and
learning modular arithmetic problems will likely help to Stehlé, D. CRYSTALS-Kyber (version 3.02) – Submis-
achieve secret recovery for smaller q and larger h. sion to round 3 of the NIST post-quantum project. 2021.
Available at https://pq-crystals.org/.
Acknowledgements Baek, S. and Kim, K. Recent advances of neural attacks
We thank Mark Tygert for his always insightful comments against block ciphers. In Proc. of SCIS, 2020.
and Mohamed Malhou for running experiments.
Benamira, A., Gerault, D., Peyrin, T., and Tan, Q. Q. A
deeper look at machine learning-based cryptanalysis. In
8. Impact Statement Proc. of Annual International Conference on the Theory
and Applications of Cryptographic Techniques, 2021.
The main ethical concern related to this work is the possi-
bility of our attack compromising currently-deployed PQC Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
system. However, at present, our proposed attack does not Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
threaten current standardized systems. If our attack scales to Askell, A., et al. Language models are few-shot learners.
higher h and lower q settings, then its impact is significant, Proc. of NeurIPS, 2020.
as it would necessitate changing PQC encryption standards.
For reproducability of these results, our code will be open Charton, F. Linear algebra with transformers. Transactions
sourced after publication and is available to reviewers upon in Machine Learning Research, 2022.
request.
Charton, F. Can transformers learn the greatest common
divisor? arXiv:2308.15594, 2024.
References
Ahmadzadeh, E., Kim, H., Jeong, O., and Moon, I. A Novel Charton, F., Lauter, K., Li, C., and Tygert, M. An efficient
Dynamic Attack on Classical Ciphers Using an Attention- algorithm for integer lattice reduction. SIAM Journal on
Based LSTM Encoder-Decoder Model. IEEE Access, Matrix Analysis and Applications, 45(1), 2024.
2021.
Chen, H., Chua, L., Lauter, K., and Song, Y. On the
Ajtai, M. Generating hard instances of lattice problems. In Concrete Security of LWE with Small Secret. Cryp-
Proc. of the ACM symposium on Theory of Computing, tology ePrint Archive, Paper 2020/539, 2020. URL
1996. https://eprint.iacr.org/2020/539.

Alani, M. M. Neuro-cryptanalysis of DES and triple-DES. Chen, L., Moody, D., Liu, Y.-K., et al. PQC Stan-
In Proc. of NeurIPS, 2012. dardization Process: Announcing Four Candidates
to be Standardized, Plus Fourth Round Candi-
Albrecht, M., Chase, M., Chen, H., et al. Homomor- dates. US Department of Commerce, NIST, 2022.
phic encryption standard. In Protecting Privacy through https://csrc.nist.gov/News/2022/
Homomorphic Encryption, pp. 31–62. 2021. https: pqc-candidates-to-be-standardized-and-round-4.
//eprint.iacr.org/2019/939.
Chen, Y. and Nguyen, P. Q. BKZ 2.0: Better Lattice Security
Albrecht, M. R. On dual lattice attacks against small-secret Estimates. In Proc. of ASIACRYPT, 2011.
LWE and parameter choices in HElib and SEAL. In Proc.
of EUROCRYPT, 2017. ISBN 978-3-319-56614-6. Chen, Y. and Yu, H. Bridging Machine Learning and Crypt-
analysis via EDLCT. Cryptology ePrint Archive, 2021.
Albrecht, M. R., Player, R., and Scott, S. On the concrete https://eprint.iacr.org/2021/705.
hardness of learning with errors. Journal of Mathematical
Cryptology, 9(3):169–203, 2015. Curtis, B. R. and Player, R. On the feasibility and
impact of standardising sparse-secret LWE parameter
Albrecht, M. R., Göpfert, F., Virdia, F., and Wunderer, T. sets for homomorphic encryption. In Proc. of the
Revisiting the expected cost of solving usvp and applica- ACM Workshop on Encrypted Computing & Applied
tions to lwe. In Proc. of ASIACRYPT, 2017. Homomorphic Cryptography, 2019.

Aldarrab, N. and May, J. Can sequence-to-sequence Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and
models crack substitution ciphers? arXiv preprint Kaiser, Ł. Universal transformers. In Proc. of ICLR,
arXiv:2012.15229, 2020. 2019.

9
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Lauter, K., Naehrig, M., and Vaikuntanathan, V. Can homo-
Pre-training of deep bidirectional transformers for lan- morphic encryption be practical? Proceedings of the 3rd
guage understanding. In Proceedings of the Conference ACM workshop on Cloud computing security workshop,
of the North American Chapter of the Association for pp. 113–124, 2011.
Computational Linguistics, 2019. URL https://
aclanthology.org/N19-1423. Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., and Papail-
iopoulos, D. Teaching arithmetic to small transformers.
Dubrova, E., Ngo, K., and Gärtner, J. Breaking a fifth-order arXiv preprint arXiv:2307.03381, 2023.
masked implementation of crystals-kyber by copy-paste.
Cryptology ePrint Archive, 2022. https://eprint. Lenstra, H., Lenstra, A., and Lovász, L. Factoring polyno-
iacr.org/2022/1713. mials with rational coefficients. Mathematische Annalen,
261:515–534, 1982.
Dziri, N., Lu, X., Sclar, M., Li, X. L., et al. Faith and
Li, C. Y., Sotáková, J., Wenger, E., Malhou, M., Garcelon,
fate: Limits of transformers on compositionality. arXiv
E., Charton, F., and Lauter, K. Salsa Picante: A Machine
preprint arXiv:2305.18654, 2023.
Learning Attack on LWE with Binary Secrets. In Proc.
Gohr, A. Improving attacks on round-reduced speck32/64 of ACM CCS, 2023a.
using deep learning. In Proc. of Annual International Li, C. Y., Wenger, E., Allen-Zhu, Z., Charton, F., and Lauter,
Cryptology Conference, 2019. K. E. SALSA VERDE: a machine learning attack on
Goncharov, S. V. Using fuzzy bits and neural networks to LWE with sparse small secrets. In Proc. of NeurIPS,
partially invert few rounds of some cryptographic hash 2023b.
functions. arXiv preprint arXiv:1901.02438, 2019. Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-
1: Technical details and evaluation. White Paper. AI21
Greydanus, S. Learning the enigma with recurrent neural
Labs, 1:9, 2021.
networks. arXiv preprint arXiv:1708.07576, 2017.
Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark,
Griffith, K. and Kalita, J. Solving Arithmetic Word Prob- M., and Williams, M. Towards understanding grokking:
lems with Transformers and Preprocessing of Problem An effective theory of representation learning. Proc. of
Text. arXiv preprint arXiv:2106.00893, 2021. NeurIPS, 2022.
Gromov, A. Grokking modular arithmetic. arXiv preprint Meng, Y. and Rumshisky, A. Solving math word prob-
arXiv:2301.02679, 2023. lems with double-decoder transformer. arXiv preprint
arXiv:1908.10924, 2019.
He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-
enhanced bert with disentangled attention. arXiv preprint Micciancio, D. and Voulgaris, P. Faster exponential time al-
arXiv:2006.03654, 2020. gorithms for the shortest vector problem. In Proceedings
of the ACM-SIAM Symposium on Discrete Algorithms,
Kaiser, Ł. and Sutskever, I. Neural GPUs learn algorithms. 2010.
arXiv preprint arXiv:1511.08228, 2015.
Nogueira, R., Jiang, Z., and Lin, J. Investigating the limita-
Kalchbrenner, N., Danihelka, I., and Graves, A. Grid long tions of transformers with simple arithmetic tasks. arXiv
short-term memory. arXiv preprint arxiv:1507.01526, preprint arXiv:2102.13019, 2021.
2015.
Palamas, T. Investigating the ability of neural net-
Kimura, H., Emura, K., Isobe, T., Ito, R., Ogawa, K., and works to learn simple modular arithmetic. 2017.
Ohigashi, T. Output Prediction Attacks on SPN Block https://project-archive.inf.ed.ac.uk/
Ciphers using Deep Learning. Cryptology ePrint Archive, msc/20172390/msc_proj.pdf.
2021. URL https://eprint.iacr.org/2021/
401. Polu, S. and Sutskever, I. Generative language mod-
eling for automated theorem proving. arXiv preprint
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, arXiv:2009.03393, 2020.
J., Gelly, S., and Houlsby, N. Big transfer (bit): General
visual representation learning. In Proc. of ECCV, 2020. Power, A., Burda, Y., Edwards, H., Babuschkin, I., and
Misra, V. Grokking: Generalization Beyond Overfit-
Lample, G. and Charton, F. Deep learning for symbolic ting on Small Algorithmic Datasets. arXiv preprint
mathematics. In Proc. of ICLR, 2020. arXiv:2201.02177, 2022.

10
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Regev, O. On Lattices, Learning with Errors, Random


Linear Codes, and Cryptography. In Proc. of the ACM
Symposium on Theory of Computing, 2005.
Ryan, K. and Heninger, N. Fast practical lattice reduction
through iterated compression. Cryptology ePrint Archive,
2023. URL https://eprint.iacr.org/2023/
237.pdf.

Schnorr, C.-P. A hierarchy of polynomial time lattice ba-


sis reduction algorithms. Theoretical Computer Science,
1987. URL https://www.sciencedirect.com/
science/article/pii/0304397587900648.

Siu, K.-Y. and Roychowdhury, V. Optimal depth neural


networks for multiplication and related problems. In
Proc. of NeurIPS, 1992.
So, J. Deep learning-based cryptanalysis of lightweight
block ciphers. Security and Communication Networks,
2020.

Srivastava, S. and Bhatia, A. On the Learning Capabil-


ities of Recurrent Neural Networks: A Cryptographic
Perspective. In Proc. of ICBK, 2018.
The FPLLL development team. fplll, a lattice reduction
library, Version: 5.4.4. Available at https://github.
com/fplll/fplll, 2023.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Vaswani, A., Shazeer, N., Parmar, N., et al. Attention is all
you need. In Proc. of NeurIPS, 2017.
Wenger, E., Chen, M., Charton, F., and Lauter, K. Salsa:
Attacking lattice cryptography with transformers. In Proc.
of NeurIPS, 2022.
Zaremba, W., Mikolov, T., Joulin, A., and Fergus, R. Learn-
ing simple algorithms from examples. arXiv preprint
arXiv:1511.07275, 2015.

11
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

A. Appendix
We provide details, additional experimental results, and analysis ommitted in the main text:
1. Appendix B: Attack parameters
2. Appendix C: Lattice reduction background
3. Appendix D: Additional model results (Section 4)
4. Appendix E: Additional training results (Section 5)
5. Appendix F: Further NoMod analysis

B. Parameters for our attack


For training all models, we use a learning rate of 10−6 , weight decay of 0.001, 3000 warmup steps, and an embedding
dimension of 512. We use 4 encoder layers, and 8 attention heads for all encoder-only model experiments, except the
architecture ablation in Table 7. We use the following primes: q = 2199023255531 for n = 512, q = 34088624597 for
n = 768, and q = 607817174438671 for n = 1024. The rest of the parameters used for the experiments are in Table 10.

Table 10. LWE, preprocessing, and training parameters. For the adaptive increase of preprocessing parameters, we start with block
size β1 , flatter α1 , and LLL-delta δ1 and upgrade to β2 , α2 , and δ2 at a later stage. Parameters base B and bucket size are used to tokenize
the numbers for transformer training.
LWE parameters Preprocessing settings Model training settings
n log2 q h β1 β2 α1 α2 δ1 δ2 Base B Bucket size Batch size
512 41 ≤ 70 18 22 0.04 0.025 0.96 0.9 137438953471 134217728 256
768 35 ≤ 11 18 22 0.04 0.025 0.96 0.9 1893812477 378762 128
1024 50 ≤ 11 18 22 0.04 0.025 0.96 0.9 25325715601611 5065143120 128

C. Additional background on lattice reduction algorithms


Lattice reduction algorithms reduce the length of lattice vectors, and are a building block in most known attacks on
lattice-based cryptosystems. If a short enough vector can be found, it can be used to recover LWE secrets via the dual,
decoding, or uSVP attack (Micciancio & Voulgaris, 2010; Albrecht et al., 2017; 2021). The original lattice reduction
algorithm, LLL (Lenstra et al., 1982), runs in time polynomial in the dimension of the lattice, but is only guaranteed to
find an exponentially bad approximation to the shortest vector. In other words the quality of the reduction is poor. LLL
iterates through the ordered basis vectors of a lattice, projecting vectors onto each other pairwise and swapping vectors until
a shorter, re-ordered, nearly orthogonal basis is returned.
To improve the quality of the reduction, the BKZ (Schnorr, 1987) algorithm generalizes LLL by projecting basis vectors
onto k − 1-dimensional subspaces, where k < n is the “blocksize”. (LLL has blocksize 2.) As k approaches n, the quality
of the reduced basis improves, but the running time is exponential in k, so is not practical for large block size. Experiments
in (Chen et al., 2020) with running BKZto attack LWE instances found that block size k ≥ 60 and n > 100 was infeasible
in practice. Both BKZ and LLL are implemented in the fplll library (The FPLLL development team, 2023), along with
an improved version of BKZ: BKZ2.0 (Chen & Nguyen, 2011). Charton et al. proposed an alternative lattice reduction
algorithm similar to LLL. In practice we use it as a polishing step after each BKZ loop concludes. It “polishes” by
iteratively orthogonalizing the vectors, provably decreasing norms with each run.
A newer alternative to LLL is flatter (Ryan & Heninger, 2023), which provides reduction guarantees analogous to LLL
but runs faster due to better precision management. flatter runs on sublattices, and leverages clever techniques for
reducing numerical precision as the reduction proceeds, enabling it to run much faster than other reduction implementations.
Experiments in the original paper show flatter running orders of magnitude faster than other methods on high-
dimensional (n ≥ 1024) lattice problems. The implementation of flatter4 has a few tunable parameters, notably α,
which characterizes the strength of the desired reduction in terms of lattice “drop”, a bespoke method developed by (Ryan &
Heninger, 2023) that mimics the Lovász condition in traditional LLL (Lenstra et al., 1982). In our runs of flatter, we
set α = 0.04 initially and decrease it to α = 0.025 after the lattice is somewhat reduced, following the adaptive reduction
approach of (Li et al., 2023b).
4
https://github.com/keeganryan/flatter

12
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

D. Additional Results for §4


D.1. Architecture Comparison
Tables 11, 12, (for binary secrets) and Tables 13, and 14 (for ternary secrets) expand on the results in Table 6 by showing
how three different model architectures perform on binary and ternary secrets with different Hamming weights. We see that
the encoder-only model architecture with angular embedding improves secret recovery compared to the encoder-decoder
model and the encoder-only model with two-token embedding. Notably, the encoder-only model with angular embedding is
able to recover secrets up to h = 66 for both binary and ternary secrets, which is a big improvement compared to previous
work.

Table 11. Secret recovery (# successful recoveries/# attack attempts) with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, binary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder 1/10 1/10 1/10
Encoder (Vocabulary) 0/10 0/7 2/10 0/10 0/8 0/10 1/10 0/9 0/8 0/8
Encoder (Angular) 2/10 0/7 3/10 2/10 1/8 1/10 2/10 1/9 1/8 2/8

Table 12. Average time to recovery (hours) for successful recoveries with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, binary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder 10.0 - 20.0 - - - 17.5 - - -
Encoder (Vocabulary) - - 19.8 - - - 22.0 - - -
Encoder (Angular) 33.0 - 37.0 33.2 29.1 26.7 29.8 57.1 31.6 28.8

Table 13. Secret recovery (# successful recoveries/# attack attempts) with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, ternary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder - 1/10 - - - - - - - -
Encoder (Vocabulary) 0/9 1/8 0/9 0/9 1/10 0/8 0/7 0/7 0/10 1/9
Encoder (Angular) 0/9 1/8 0/9 0/9 1/10 0/8 0/7 0/7 0/10 1/9

Table 14. Average time to recovery (hours) for successful recoveries with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, ternary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder - 27.5 - - - - - - - -
Encoder (Vocabulary) - 24.8 - - 28.2 - - - - 34.4
Encoder (Angular) - 57.6 - - 47.4 - - - - 70.3

D.2. Architecture Ablation


Here, we present additional results from the architecture ablation experiments summarized in Table 7. The results in Table 15
and Table 16 show the number of successful recoveries and average time to recovery with varying architectures across
different Hamming weights. We see that increasing transformer depth (number of layers) tends to improve recovery but

13
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

also increases average recovery time. Increasing the embedding dimension from 256 to 512 with 4 layers improves secret
recovery. Thus, we choose 4 layers with a hidden dimension of 512 as it recovers the most secrets (25%) the fastest (26.2
mean hours).

Table 15. Effect of different transformer depths (# of layers) and widths (embedding dimension) on secret recovery (# successful
recoveries/# attack attempts) with encoder-only model (n = 512, log2 q = 41, binary secrets).
h
Layers Emb Dim
49 51 53 55 57 59 61 63 65 67
2 128 2/9 4/10 5/10 3/9 2/10 2/8 1/9 3/10 1/9 0/10
4 256 2/9 4/10 5/10 3/9 2/10 2/8 1/9 2/10 1/9 0/10
4 512 3/9 4/10 5/10 4/9 2/10 2/8 1/9 3/10 1/9 0/10
6 512 3/9 4/10 5/10 3/9 2/10 2/8 2/9 3/10 1/9 0/10
8 512 3/9 4/10 5/10 4/9 2/10 2/8 1/9 2/10 1/9 0/10

Table 16. Effect of different transformer depths (# of layers) and widths (embedding dimension) on average time to recovery
(hours) with encoder-only model (n = 512, log2 q = 41, binary secrets).
h
Layers Emb Dim
49 51 53 55 57 59 61 63 65 67
2 128 8.7 10.0 8.3 14.0 11.0 5.5 11.5 17.9 18.9 -
4 256 30.9 11.0 12.3 30.3 39.9 10.4 20.6 13.3 19.6 -
4 512 27.2 15.4 18.2 35.2 35.4 17.3 24.3 32.2 26.2 -
6 512 44.1 18.8 20.7 34.2 30.5 19.4 47.9 32.1 28.1 -
8 512 46.4 22.5 24.0 44.6 34.6 23.8 28.0 29.0 30.3 -

E. Additional Results for §5


E.1. Training with Fewer Samples
Here, we present additional results from scaling n to 768 and 1024 (without pre-training) as summarized in Table 8.
Tables 17 and 18 show the results for the n = 768, log2 q = 35 case with binary secrets. Similarly, Tables 19 and 20 show
the results for the n = 1024, log2 q = 50 case with binary secrets.

Table 17. Successful secret recoveries out of 10 trials per h Table 18. Average time (hours) to successful secret recov-
with varying amounts of training data (n = 768, log2 q = eries with varying amounts of training data (n = 768,
35, binary secrets). log2 q = 35, binary secrets).
h h
# Samples # Samples
5 7 9 11 5 7 9 11
100K 1 - - - 100K 1.4 - - -
300K 1 1 - - 300K 0.7 12.3 - -
1M 4 3 2 - 1M 27.9 24.5 39.2 -
3M 2 3 1 - 3M 17.8 13.0 44.2 -
4M 3 3 2 - 4M 13.5 18.5 21.8 -

E.2. Model Pre-Training


In this section, we expand upon the pre-training results summarized in Figures 3 and 4. For each pre-training checkpoint, we
measure number of successful recoveries out of 10 trials per h and the average time in hours to successful secret recovery
for h = 30 to h = 45. We also vary the number of samples from 100K to 4M to see which setup is most sample efficient. In
all of these experiments, n = 512 and q are fixed, log2 q = 41, with binary secrets. The results are presented as follows: no

14
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 19. Successful secret recoveries out of 10 trials per h Table 20. Average time (hours) to successful secret recov-
with varying amounts of training data (n = 1024, log2 q = eries with varying amounts of training data (n = 1024,
50, binary secrets). log2 q = 50, binary secrets).
h h
# Samples # Samples
5 7 9 11 13 15 5 7 9 11 13 15
100K 3 - - - - - 100K 11.6 - - - - -
300K 6 3 - - - - 300K 14.8 17.7 - - - -
1M 6 4 1 - 1 - 1M 15.7 29.2 36.1 - 47.4 -
3M 6 4 1 1 - - 3M 14.8 14.0 29.6 39.8 - -
4M 7 5 1 - - - 4M 19.9 10.6 21.3 - - -

pre-training baseline (Tables 21 and 22), 80K steps pre-training (Tables 23 and 24), 240K steps pre-training (Tables 25 and
26), and 440K steps pre-training (Tables 27 and 28).
Based on these results, we conclude that some pre-training helps to recover secrets with less samples, but more pre-training
is not necessarily better. We also see that more pre-training increases the average time to successful secret recovery.

Table 21. Successful secret recoveries out of 10 trials per h with no model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - - - - - - - - - - - - - -
300K - - 1 - - - - - - - - - - - - -
1M - - 3 2 1 - 3 1 2 1 1 1 - - 2 -
3M 1 - 4 3 1 - 2 1 3 1 1 2 - - 2 -
4M 1 - 4 3 1 1 3 1 3 1 1 1 - - 2 -

Table 22. Average time (hours) to successful secret recovery with no model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - - - - - - - - - - - - -
300K - - 30.3 - - - - - - - - - - - - -
1M - - 21.1 36.9 36.3 - 27.5 21.9 23.9 16.5 50.7 52.9 - - 36.8 -
3M 14.4 - 21.8 27.6 41.8 - 19.8 25.5 22.7 18.1 39.5 44.3 - - 37.8 -
4M 17.2 - 22.8 26.5 29.1 69.7 22.7 22.7 20.9 21.4 29.0 25.7 - - 40.5 -

Table 23. Successful secret recoveries out of 10 trials per h with 80K steps model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 1 1 - - - - - - - - - - - -
300K 1 - 2 2 - - 1 - 1 1 - - - - 2 -
1M 1 - 3 2 1 - 1 - 1 1 1 - - - 1 -
3M 1 - 3 2 1 - 1 - 1 1 1 - - - 2 -
4M 1 - 3 2 - - 1 - 1 1 1 - - - 1 -

15
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 24. Average time (hours) to successful secret recovery with 80K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 29.2 32.1 - - - - - - - - - - - -
300K 57.7 - 47.6 26.9 - - 31.7 - 62.6 55.6 - - - - 50.9 -
1M 53.4 - 37.0 25.6 61.7 - 25.1 - 65.9 30.0 44.8 - - - 51.4 -
3M 34.4 - 41.8 25.1 57.5 - 26.2 - 45.1 23.6 38.4 - - - 39.7 -
4M 35.1 - 30.9 33.0 - - 35.2 - 36.5 21.8 26.8 - - - 32.1 -

Table 25. Successful secret recoveries out of 10 trials per h with 240K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - 1 - - - - - - - - - - - -
300K 1 - 2 2 - - 2 - 2 1 - - - - - -
1M 1 - 2 2 1 - 1 - 1 1 1 - - - 1 -
3M 1 - 2 2 - - 2 - 2 1 1 - - - 1 -
4M 1 - 2 2 1 - 2 - 1 1 1 - - - 1 -

Table 26. Average time (hours) to successful secret recovery with 240K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - 21.0 - - - - - - - - - - - -
300K 49.3 - 30.8 23.3 - - 50.5 - 64.2 37.9 - - - - - -
1M 49.4 - 29.5 23.4 50.8 - 20.1 - 24.8 24.2 51.9 - - - 61.9 -
3M 48.9 - 38.2 19.7 - - 38.5 - 50.9 20.2 57.7 - - - 47.4 -
4M 42.4 - 34.9 19.5 55.8 - 29.0 - 24.1 18.3 54.0 - - - 52.4 -

Table 27. Successful secret recoveries out of 10 trials per h with 440K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 1 1 - - - - - - - - - - - -
300K - - 1 2 - - 1 1 1 1 - - - - 1 -
1M 1 - - 2 - - 1 - 2 1 1 - - - 1 -
3M 1 - 1 2 - - 1 - 1 1 1 - - - 2 -
4M 1 - 1 2 - - 2 - 1 1 1 - - - 2 -

F. Results of NoMod Analysis


As another performance metric for our approach, we measure the NoMod factor for the secrets/datasets we attack. Li et al.
computed NoMod as follows: given a training dataset of LWE pairs (Ra, Rb) represented in the range (−q/2, q/2) and
known secret s, compute x = Ra · s − Rb. If ∥x∥ < q/2, we know that the computation of Ra · s did not cause Rb to
“wrap around” modulus q. The NoMod factor of a dataset is the percentage of (Ra, Rb) pairs for which ∥x∥ < q/2.
Although NoMod is not usable in a real world attack, since it requires a priori knowledge of s, it is a useful metric for

16
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE

Table 28. Average time (hours) to successful secret recovery with 440K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 33.3 20.0 - - - - - - - - - - - -
300K - - 39.8 12.9 - - 29.0 60.5 36.2 49.1 - - - - 59.1 -
1M 56.9 - - 16.6 - - 18.6 - 43.6 24.9 36.4 - - - 53.4 -
3M 50.9 - 32.3 12.2 - - 25.2 - 22.3 18.9 42.4 - - - 60.8 -
4M 58.3 - 30.0 13.5 - - 34.6 - 19.7 22.3 45.7 - - - 59.7 -

understanding attack success in a lab environment. Li et al. derived an empirical result stating that attacks should be
successful when the NoMod factor of a dataset is ≥ 67. The NoMod analysis indicates that models trained in those
experiments were only learning secrets from datasets in which the majority of Rb values do not “wrap around” q. If models
could be trained to learn modular arithmetic better, this might ease the NoMod condition for success.
One of the main goals of introducing the angular embedding is to introduce some inductive bias into the model training
process. Specifically, teaching models that 0 and q − 1 are close in the embedding space may enable them to better learn the
modular arithmetic task at the heart of LWE. Here, we examine the NoMod factor of various datasets to see if the angular
embedding does provide such inductive bias. If it did, we would expect that models with angular embeddings would recover
secrets from datsets with NoMod < 67. Table 29 lists NoMod percentages and successful secret recoveries for the angular
and tokenization schemes described in §4.2.

Table 29. NoMod percentages for Verde data n = 512, log2 q = 41, binary secrets (varying h and secrets indexed 0-9), comparing
performance of angular vs. normal embedding. Key: recovered by angular only, recovered by both, not recovered.
h 0 1 2 3 4 5 6 7 8 9
57 45 49 52 41 51 46 51 57 60 49
58 48 48 48 48 38 43 52 52 45 48
59 43 46 56 50 66 63 35 46 41 40
60 48 48 52 59 50 58 48 49 51 54
61 60 49 43 41 56 42 42 41 41 50
62 45 42 45 54 55 43 61 56 54 42
63 56 60 55 54 63 47 54 51 45 43
64 44 46 41 41 41 47 45 43 41 55
65 45 51 48 60 45 48 41 48 45 50
66 45 51 39 64 45 47 43 60 48 55
67 43 47 48 49 40 47 48 51 50 46

Table 30. NoMod percentages for n = 768, log2 q = 35 secrets Table 31. NoMod percentages for n = 1024, log2 q = 50 se-
(varying h and secrets indexed 0-9). Key: secret recovered, crets (varying h and secrets indexed 0-10). Key: secret recov-
secret not recovered. ered, secret not recovered.
h 0 1 2 3 4 5 6 7 8 9 h 0 1 2 3 4 5 6 7 8 9
5 61 61 52 61 93 68 66 61 56 62 5 62 66 70 69 81 81 81 53 69 94
7 52 61 56 77 52 68 56 67 56 61 7 62 47 80 80 69 56 57 52 57 69
9 52 55 46 49 52 48 56 60 60 70 9 56 69 52 49 52 53 56 57 46 52
11 55 42 44 55 46 46 54 55 60 51 11 44 52 52 62 61 62 56 56 52 49
13 46 45 60 48 43 43 55 48 60 43 13 51 46 56 40 46 52 44 49 68 53
15 41 38 48 43 40 43 45 43 43 59 15 61 46 40 45 46 41 38 47 48 44

17

You might also like