Salsa Fresca
Salsa Fresca
Samuel Stevens 1 2 * Emily Wenger 2 Cathy Li 3 2 * Niklas Nolte 2 Eshika Saxena 2 François Charton 2 **
Kristin Lauter 2 **
1
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 1. Best results from our attack for LWE problems in dimensions n (higher is harder), modulus q (lower is harder) and Hamming
weights h (higher is harder). Our work recovers secrets for n = 1024 for the first time in ML-based LWE attacks and reduces total attack
time for n = 512, log2 q = 41 to only 50 hours (assuming full CPU parallelization).
2
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
done repeatedly and in parallel to produce 4 million reduced Table 3. LWE parameters attacked in our work. For all settings,
LWE samples, providing the data needed to train the model. we attack both binary and ternary secret distributions χs .
n log2 q h
2.2. Attack Part 2: Model training and secret recovery
512 41 50 ≤ h ≤ 70
With 4 million reduced LWE samples (Ra, Rb), a trans- 768 35 5 ≤ h ≤ 15
former is trained to predict Rb from Ra. For simplicity, 1024 50 5 ≤ h ≤ 15
and without loss of generality, we will say the transformer
learns to predict b from a. Li et al. train encoder-decoder likely secret recoveries for the ML-attack.
transformers (Vaswani et al., 2017) with shared layers (De-
hghani et al., 2019). Inputs and outputs consist of integers
2.3. Improving upon prior work
that are split into two tokens per integer by representing
them in a large base B = kq with k ≈ 10 and binning Li et al. (2023b) recover binary and ternary secrets, for n =
the lower digit to keep the vocabulary small as q increases. 512 and log2 q = 41 LWE problems with Hamming weight
Training is supervised and minimizes a cross-entropy loss. ≤ 63, in about 36 days, using 4,000 CPUs and one GPU (see
their Table 1). Most of the computing resources are needed
The key intuition behind ML-attacks on LWE is that to
in the preprocessing stage: reducing one m × n A matrix
predict b from a, the model must have learned the secret s.
takes about 35 days, and 4000 matrices must be reduced to
We extract the secret from the model by comparing model
build a training set of 4 million examples. This suggests two
predictions for two vectors a and a′ which only differ on
directions for improving the attack performance. First, by
one entry. We expect the difference between the model’s
introducing fast alternatives to BKZ2.0, we could shorten
predictions for b and b′ to be small (of the same magnitude
the time required to reduce one matrix. Second, we could
as the error) if the corresponding bit of s is zero, and large
minimize the number of samples needed to train the models,
if it is non-zero. Repeating the process on all n positions
which would reduce the number of CPUs needed for the
yields a guess for the secret.
preprocessing stage.
For ternary secrets, Li et al. (2023b) introduce a two-bit
Another crucial goal is scaling to larger dimensions n. The
distinguisher, which leverages the fact that if secret bits si
smallest standardized dimension for LWE in the HE Stan-
and sj have the same value, adding a constant K to inputs at
dard (Albrecht et al., 2021) is n = 1024. At present, ML
both these indices should induce similar predictions. Thus,
attacks are limited by their preprocessing time and the length
if ui is the ith basis vector and K is a random integer, we
of the input sequences they use. The attention mechanisms
expect model predictions for a + Kui and a + Kuj to be
used in transformers is quadratic in the length of the se-
the same if si = sj . After using this pairwise method to
quence, and Li et al. (2023b) encodes n dimensional inputs
determine whether the non-zero secret bits have the same
with 2n tokens. More efficient encoding would cut down
value, Li et al. classify them into two groups. With only two
on transformer processing speed and memory consumption
ways to assign 1 and −1 to the groups of non-zero secret
quadratically.
bits, this produces two secret guesses.
Wenger et al. (2022) tests a secret guess s∗ by computing 2.4. Parameters and settings in our work
the residuals b − a · s∗ over the 4n initial LWE sample. If
s∗ is correct, the standard deviation of the residuals will be Before presenting our innovations and results, we briefly
close to ≈ σe . Otherwise, it will be close to the discuss the LWE settings considered in our work. LWE
√ standard problems are parameterized by the modulus q, the secret di-
deviation of a uniform distribution over Zq : q/ 12.
mension n, the secret distribution χs (sparse binary/ternary)
For a given dimension, modulus and secret Hamming and the hamming weight h of the secret (the number of non-
weight, the performance of ML attacks vary from one secret zero entries). Table 3 specifies the LWE problem settings
to the next. Li et al. (2023b) observes that the difficulty of we attack. Proposals for LWE parameter settings in homo-
recovering a given secret s from a set of reduced LWE sam- morphic encryption suggest using n = 1024 with sparse
ples (a, b) depends on the distribution of the scalar products secrets (as low as h = 64), albeit with smaller q than we
a · s. If a large proportion of these products remain in the consider (Curtis & Player, 2019; Albrecht, 2017). Thus, it
interval (−q/2, q/2) (assuming centering) even without a is important to show that ML attacks can work in practice
modulo operation, the problem is similar enough to linear for dimension n = 1024 if we hope to attack sparse secrets
regression that the ML attack will usually recover the se- in real-world settings.
cret. Li et al. introduce the statistic NoMod: the proportion
of scalar products in the training set having this property. The LWE error distribution remains the same throughout
They demonstrate that large NoMod strongly correlates with our work: rounded Gaussian with σ = 3 (following Al-
brecht et al. (2021)). Table 10 in Appendix B contains all
3
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 4. Reduction performance and median time to reduce matrix using one CPU. We parallelize our reduction across
one matrix for Li et al. (2023b) vs. our work. Li et al.’s method many CPUs. For n = 512, our methods improve reduction
fails for n > 512 on our compute cluster. time by a factor of 25, and scale easily to n = 1024 prob-
CPU · hours · matrix lems. Overall, we find that Flatter improves the time
n log2 q ρ (and consequently resources required) for preprocessing,
(Li et al., 2023b) Ours
but does not improve the overall reduction quality.
512 41 0.41 ≈ 350 13.1
768 35 0.71 N/A 12.4 Error penalty ω. We run all reduction experiments with
1024 50 0.70 N/A 26.0
penalty ω = 10. Table 5 demonstrates the tradeoff be-
Table 5. Tradeoff between reduction quality and reduction er- tween reduction quality and reduction error, as measured by
ror as controlled by ω. n = 1024, log2 q = 50. ∥R∥/q, for n = 1024, log2 q = 50 problems. Empirically,
ω 1 3 5 7 10 13 we find that ∥R∥/q < 0.09 is sufficiently low to recover
secrets from LWE problems with e ∼ N (0, 32 ).
ρ 0.685 0.688 0.688 0.694 0.698 0.706
∥R∥/q 0.341 0.170 0.118 0.106 0.075 0.068 Experimental setup. In practice, we do not preprocess a
different set of 4n (a, b) pairs for each secret recovery ex-
experimental settings, including values for the moduli q, periment because preprocessing is so expensive. Instead, we
preprocessing settings, and model training settings. use a single set of preprocessed Ra rows combined with an
arbitrary secret to produce different (Ra, Rb) pairs for train-
ing. We first generate a secret s and calculate b = A · s + e
3. Data Preprocessing
for the original 4n pairs. Then, we apply the many differ-
Prior work primarily used BKZ2.0 in the preprocess- ent R produced by preprocessing to A and b to produce
ing/lattice reduction step. While effective, BKZ2.0 is slow many (Ra, Rb) pairs with reduced norm. This technique en-
for large values of n. We found that for n = 1024 matrices ables analyzing attack performance across many dimensions
it could not finish a single loop in 3 days on an Intel Xeon (varying h, model parameters, etc.) in a reasonable amount
Gold 6230 CPU, eventually timing out. of time. Preprocessing a new dataset for each experiment
would make evaluation at scale near-impossible.
Our preprocessing pipeline. Our improved preprocessing
incorporates a new reduction algorithm Flatter1 , which
promises similar reduction guarantees to LLL with much- 4. Model Architecture
reduced compute time (Ryan & Heninger, 2023), allowing Previous ML attacks on LWE use a encoder-decoder trans-
us to preprocess LWE matrices in dimension up to n = former (Vaswani et al., 2017). A bidirectional encoder pro-
1024. We interleave Flatter and BKZ2.0 and switch cesses the input and an auto-regressive decoder generates
between them after 3 loops of one results in ∆ρ < −0.001. the output. Integers in both input and output sequences
Following Li et al. (2023b), we run polish after each were tokenized as two digits in a large base smaller than q.
Flatter and BKZ2.0 loop concludes. We initialize our We propose a simpler and faster encoder-only model and
Flatter and BKZ2.0 runs with block size 18 and α = introduce an angular embedding for integers modulo q.
0.04, which provided the best empirical trade-off between
time and reduction quality (see Appendix C for additional 4.1. Encoder-only model
details), and make reduction parameters stricter as reduction
progresses—up to block size 22 for BKZ2.0 and α = 0.025 Encoder-decoder models were originally introduced for ma-
for Flatter. chine translation, because their outputs can be longer than
their inputs. However, they are complex and slow at infer-
Preprocessing performance. ence, because the decoder must run once for each output
Table 4 records the reduction ρ achieved for each (n, q) pair token. For LWE, outputs (one integer) are always shorter
and the time required, compared with Li et al. (2023b). Re- than inputs (a vector of n integers). Li et al. (2023b) ob-
call that ρ measures the reduction in the standard deviation serve that an encoder-only model, a 4-layer bidirectional
of Ai relative to its original uniform random distribution; transformer based on DeBERTa (He et al., 2020), achieves
lower is better. Reduction in standard deviation strongly comparable performance with their encoder-decoder model.
correlates with reduction in vector norm, but for consistency Here, we experiment with simpler encoder-only models
with Li et al. (2023b) we use standard deviation. without DeBERTa’s disentangled attention mechanism with
We record reduction time as CPU · hours · matrix, the 2 to 8 layers. Their outputs are max-pooled across the
amount of time it takes our algorithm to reduce one LWE sequence dimension, and decoded by a linear layer for each
output digit (Figure 1). We minimize a cross-entropy loss.
1
https://github.com/keeganryan/flatter
4
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Input/Output
Table 6. Best recovery results for binary and ternary secrets on
(x, y) 2 R2
<latexit sha1_base64="bFSU+zSVMAITizx40zJBjPEkLVM=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaLUEFKUoq6LLpxWcU+oIllMp20QyeTMDMRS8jGX3HjQhG3foY7/8ZJm4W2HrhwOOde7r3HixiVyrK+jcLS8srqWnG9tLG5tb1j7u61ZRgLTFo4ZKHoekgSRjlpKaoY6UaCoMBjpOONrzK/80CEpCG/U5OIuAEacupTjJSW+uZB5fEUTk6gQzl0AqRGnpfcpve1vlm2qtYUcJHYOSmDHM2++eUMQhwHhCvMkJQ924qUmyChKGYkLTmxJBHCYzQkPU05Coh0k+kDKTzWygD6odDFFZyqvycSFEg5CTzdmd0o571M/M/rxcq/cBPKo1gRjmeL/JhBFcIsDTiggmDFJpogLKi+FeIREggrnVlJh2DPv7xI2rWqfVat39TLjcs8jiI4BEegAmxwDhrgGjRBC2CQgmfwCt6MJ+PFeDc+Zq0FI5/ZB39gfP4AvWGVPw==</latexit>
y2
<latexit sha1_base64="AXHDfzJfI2psFexzj7c34ZbJEOI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOl+0m/1i9X3Ko7B1klXk4qkKPZL3/1BjFLI5SGCap113MT42dUGc4ETku9VGNC2ZgOsWuppBFqP5ufOiVnVhmQMFa2pCFz9fdERiOtJ1FgOyNqRnrZm4n/ed3UhFd+xmWSGpRssShMBTExmf1NBlwhM2JiCWWK21sJG1FFmbHplGwI3vLLq6Rdq3oX1fpdvdK4zuMowgmcwjl4cAkNuIUmtIDBEJ7hFd4c4bw4787HorXg5DPH8AfO5w8RqI2r</latexit>
y3
<latexit sha1_base64="2CemlM6+NZ4ujdFym/hTPFz7v/c=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0qMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1Bqw8GHu/NMDMvSKQw6LpfTmFldW19o7hZ2tre2d0r7x+0TJxqxpsslrHuBNRwKRRvokDJO4nmNAokbwfjm5nffuTaiFg9YJZwP6JDJULBKFrpPuuf98sVt+rOQf4SLycVyNHolz97g5ilEVfIJDWm67kJ+hOqUTDJp6VeanhC2ZgOeddSRSNu/Mn81Ck5scqAhLG2pZDM1Z8TExoZk0WB7YwojsyyNxP/87ophlf+RKgkRa7YYlGYSoIxmf1NBkJzhjKzhDIt7K2EjaimDG06JRuCt/zyX9I6q3oX1dpdrVK/zuMowhEcwyl4cAl1uIUGNIHBEJ7gBV4d6Tw7b877orXg5DOH8AvOxzcTLI2s</latexit>
yn yn
<latexit sha1_base64="fPDsHORc9W9XytJw1b6V0YZ3Ics=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9ZX/XLFrbpzkFXi5aQCORr98ldvELM04gqZpMZ0PTdBf0I1Cib5tNRLDU8oG9Mh71qqaMSNP5mfOiVnVhmQMNa2FJK5+ntiQiNjsiiwnRHFkVn2ZuJ/XjfF8NqfCJWkyBVbLApTSTAms7/JQGjOUGaWUKaFvZWwEdWUoU2nZEPwll9eJa2LqndZrd3XKvWbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBsmI3n</latexit>
…
<latexit sha1_base64="fPDsHORc9W9XytJw1b6V0YZ3Ics=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48V7Qe0oWy2m3bpZhN2J0Io/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6xCzhfkSHSoSCUbTSQ9ZX/XLFrbpzkFXi5aQCORr98ldvELM04gqZpMZ0PTdBf0I1Cib5tNRLDU8oG9Mh71qqaMSNP5mfOiVnVhmQMNa2FJK5+ntiQiNjsiiwnRHFkVn2ZuJ/XjfF8NqfCJWkyBVbLApTSTAms7/JQGjOUGaWUKaFvZWwEdWUoU2nZEPwll9eJa2LqndZrd3XKvWbPI4inMApnIMHV1CHO2hAExgM4Rle4c2Rzovz7nwsWgtOPnMMf+B8/gBsmI3n</latexit>
Encoder (DeBERTa) 83 63 60
Encoder (Vocab.) 256 63 66
Encoder (Angular) 610 66 66
MLP
xN
Self-Attention
then to the point (sin(2π aq ), cos(2π aq )) ∈ R2 . All input
integers (in Zq ) are therefore represented as points on the
x1
<latexit sha1_base64="QRlFKTdZFTH2TZmCLuq5ispBsxg=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbTbt0swm7E7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSKQw6LrfTmFldW19o7hZ2tre2d0r7x80TZxqxhsslrFuB9RwKRRvoEDJ24nmNAokbwWjm6nfeuTaiFg94DjhfkQHSoSCUbTS/VPP65UrbtWdgSwTLycVyFHvlb+6/ZilEVfIJDWm47kJ+hnVKJjkk1I3NTyhbEQHvGOpohE3fjY7dUJOrNInYaxtKSQz9fdERiNjxlFgOyOKQ7PoTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOyYbgLb68TJpnVe+ien53Xqld53EU4QiO4RQ8uIQa3EIdGsBgAM/wCm+OdF6cd+dj3lpw8plD+APn8wcOno2p</latexit>
x2
<latexit sha1_base64="AR8bw1iLV+h/TJ1WtHIGvCZC5c0=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHYJUY9ELx4xyiOBDZkdemHC7OxmZtZICJ/gxYPGePWLvPk3DrAHBSvppFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsGG4EthOFNAoEtoLRzcxvPaLSPJYPZpygH9GB5CFn1Fjp/qlX6RVLbtmdg6wSLyMlyFDvFb+6/ZilEUrDBNW647mJ8SdUGc4ETgvdVGNC2YgOsGOppBFqfzI/dUrOrNInYaxsSUPm6u+JCY20HkeB7YyoGeplbyb+53VSE175Ey6T1KBki0VhKoiJyexv0ucKmRFjSyhT3N5K2JAqyoxNp2BD8JZfXiXNStm7KFfvqqXadRZHHk7gFM7Bg0uowS3UoQEMBvAMr/DmCOfFeXc+Fq05J5s5hj9wPn8AECKNqg==</latexit>
x3
<latexit sha1_base64="oHKEjBPa1NpTI9acyZHKjpoa4aI=">AAAB6nicbVDLTgJBEOzFF+IL9ehlIjHxRHaVqEeiF48Y5ZHAhswOA0yYnd3M9BrJhk/w4kFjvPpF3vwbB9iDgpV0UqnqTndXEEth0HW/ndzK6tr6Rn6zsLW9s7tX3D9omCjRjNdZJCPdCqjhUiheR4GSt2LNaRhI3gxGN1O/+ci1EZF6wHHM/ZAOlOgLRtFK90/d826x5JbdGcgy8TJSggy1bvGr04tYEnKFTFJj2p4bo59SjYJJPil0EsNjykZ0wNuWKhpy46ezUyfkxCo90o+0LYVkpv6eSGlozDgMbGdIcWgWvan4n9dOsH/lp0LFCXLF5ov6iSQYkenfpCc0ZyjHllCmhb2VsCHVlKFNp2BD8BZfXiaNs7J3Ua7cVUrV6yyOPBzBMZyCB5dQhVuoQR0YDOAZXuHNkc6L8+58zFtzTjZzCH/gfP4AEaaNqw==</latexit>
… xn <latexit sha1_base64="+IywFGRoWtubaq24o0WD0l2cPVA=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8eK9gPaUDbbSbt0swm7G7GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/ncLK6tr6RnGztLW9s7tX3j9o6jhVDBssFrFqB1Sj4BIbhhuB7UQhjQKBrWB0M/Vbj6g0j+WDGSfoR3QgecgZNVa6f+rJXrniVt0ZyDLxclKBHPVe+avbj1kaoTRMUK07npsYP6PKcCZwUuqmGhPKRnSAHUsljVD72ezUCTmxSp+EsbIlDZmpvycyGmk9jgLbGVEz1IveVPzP66QmvPIzLpPUoGTzRWEqiInJ9G/S5wqZEWNLKFPc3krYkCrKjE2nZEPwFl9eJs2zqndRPb87r9Su8ziKcATHcAoeXEINbqEODWAwgGd4hTdHOC/Ou/Mxby04+cwh/IHz+QNrEo3m</latexit>
a2
<latexit sha1_base64="WxkZs5xRs38lJD4l9+FXgHm0HfI=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lKUY9FLx4r2lpoQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgbjm5n/+IRK81g+mEmCfkSHkoecUWOle9qv9csVt+rOQVaJl5MK5Gj2y1+9QczSCKVhgmrd9dzE+BlVhjOB01Iv1ZhQNqZD7FoqaYTaz+anTsmZVQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc14ZWfcZmkBiVbLApTQUxMZn+TAVfIjJhYQpni9lbCRlRRZmw6JRuCt/zyKmnXqt5FtX5XrzSu8ziKcAKncA4eXEIDbqEJLWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QPtCY2T</latexit>
a3 <latexit sha1_base64="gsX9/jccenrQQU4UL/6VgISS1Eo=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0qMeiF48V7Qe0oUy2m3bpZhN2N0IJ/QlePCji1V/kzX/jts1Bqw8GHu/NMDMvSATXxnW/nMLK6tr6RnGztLW9s7tX3j9o6ThVlDVpLGLVCVAzwSVrGm4E6ySKYRQI1g7GNzO//ciU5rF8MJOE+REOJQ85RWOle+yf98sVt+rOQf4SLycVyNHolz97g5imEZOGCtS667mJ8TNUhlPBpqVeqlmCdIxD1rVUYsS0n81PnZITqwxIGCtb0pC5+nMiw0jrSRTYzgjNSC97M/E/r5ua8MrPuExSwyRdLApTQUxMZn+TAVeMGjGxBKni9lZCR6iQGptOyYbgLb/8l7TOqt5FtXZXq9Sv8ziKcATHcAoeXEIdbqEBTaAwhCd4gVdHOM/Om/O+aC04+cwh/ILz8Q3ujY2U</latexit>
… an
<latexit sha1_base64="jrq4lEYW8EALKbypH6FBv/UbZgg=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cKpi20oWy2m3bpZhN2J0IJ/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemEph0HW/ndLa+sbmVnm7srO7t39QPTxqmSTTjPsskYnuhNRwKRT3UaDknVRzGoeSt8Px3cxvP3FtRKIecZLyIKZDJSLBKFrJp/1cTfvVmlt35yCrxCtIDQo0+9Wv3iBhWcwVMkmN6XpuikFONQom+bTSywxPKRvTIe9aqmjMTZDPj52SM6sMSJRoWwrJXP09kdPYmEkc2s6Y4sgsezPxP6+bYXQT5EKlGXLFFouiTBJMyOxzMhCaM5QTSyjTwt5K2IhqytDmU7EheMsvr5LWRd27ql8+XNYat0UcZTiBUzgHD66hAffQBB8YCHiGV3hzlPPivDsfi9aSU8wcwx84nz8MvI7b</latexit>
5
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 7. Encoder-only (angular embedding) performance for
prior work made scaling attacks to n ≥ 512 difficult, due
varying # of transformer layers and embedding dimensions
(n = 512, log2 q = 41, binary secrets). Samples per second
to both memory footprint and slow model processing speed.
quantifies training speed; “Recovered” is % recovered out of 100 In contrast, our more efficient model and angular embed-
secrets with h from 49-67; “Hours” is mean hours to recovery. ding scheme (§4.1,§4.2) enable us to attack n = 768 and
n = 1024 secrets. Table 8 shows that we can recover up
Layers Emb. Dim. Params Samples/S Recovered Hours
to h = 9 for both n = 768 and n = 1024 settings, with
2 128 1.3M 2560 23% 18.9 < 24 hours of training on a single 32GB V100 GPU. In
4 256 4.1M 1114 22% 19.6 §5.1 we show recovery of h = 13 for n = 1024 using a
4 512 14.6M 700 25% 26.2
6 512 20.9M 465 25% 28.1 more sample-efficient training strategy. We run identical
8 512 27.2M 356 24% 30.3 experiments using prior work’s encoder-decoder model, but
fail to recover any secrets with n > 512 in the same com-
putational budget. Our proposed model improvements lead
to the first successful ML attack that scales to real-world
values of n: proposed real-world use cases for LWE-based
cryptosystems recommend using dimensions n = 768 and
n = 1024 (Avanzi et al., 2021; Albrecht et al., 2021; Curtis
& Player, 2019), although they also recommend smaller q
and harder secret distributions than we currently consider.
5. Training Methods
Figure 2. Count of # successes (orange) and failures (blue) for
various NoMod proportions for vocabulary-based and angular The final limitation we address is the 4 million preprocessed
embedding schemes. n = 512, log2 q = 41, h = 57-67. LWE samples for model training. Recall that each training
sample is a row of a reduced LWE matrix RA ∈ Zm×n q ,
so producing 4 million training samples requires reducing
yield little benefit in terms of secret recovery rate, and small
≈ 4,000,000
m+n LWE matrices. Even with the preprocessing
models are significantly faster (both in terms of training and
improvements highlighted in §3, for n = 1024, this means
recovery). Additional results are in Appendix D.2. We use 4
preprocessing between 2000 and 4500 matrices at the cost
layers with embedding dimension 512 for later experiments
of 26 hours per CPU per matrix. 2 To further reduce total
because it recovers the most secrets (25%) the fastest.
attack time, we propose training with fewer samples and
Embedding ablation. Next, we compare our new angular pre-training models.
embedding scheme to the vocabulary embeddings. The
better embedding should recover both more secrets and 5.1. Training with Fewer Samples
more difficult ones (as measured by Hamming weight and
We first consider simply reducing training dataset size and
NoMod; see §2.2 for a description of NoMod). To measure
seeing if the attack succeeds. Li et al. always use 4M train-
this, we run attacks on identical datasets, n = 512, log2 q =
ing examples. To test if this many is needed for secret recov-
41, h = 57-67 with 10 unique secrets per h. One set of
ery, we subsample datasets of size N = [100K, 300K, 1M,
models uses the angular embedding, while the other uses
3M] from the original 4M examples preprocessed via the
the vocabulary embedding from Li et al. (2023b).
techniques in §3. We train models to attack LWE problems
To check if angular embedding outperforms vocabulary em- n = 512, log2 q = 41 with binary secrets and h = 30-45.
bedding, we measure the attacked secrets’ NoMod. We Each attack is given 3 days on a single V100 32GB GPU.
expect the better embedding to recover secrets with lower
Table 9 shows that our attack still succeeds, even when as
NoMod and higher Hamming weights (e.g. harder secrets).
few as 300K samples are used. We recover approximately
As Figure 2 and Table 6 demonstrate, this is indeed the case.
the same number of secrets with 1M samples as with 4M,
Angular embeddings recover secrets with NoMod= 56 vs.
and both settings achieve the same best h recovery. Using
63 for vocabulary embedding (see Table 29 in Appendix F
1M rather than 4M training samples reduces our preprocess-
for raw numbers). Furthermore, angular embedding models
ing time by 75%. We run similar experiments for n = 768
recover more secrets than those with vocabulary embed-
and n = 1024, and find that we can recover up to h = 13
dings (16 vs. 2) and succeed on higher h secrets (66 vs.
63). We conclude that an angular embedding is superior to 2
We bound the number of reduced matrices needed because
a vocabulary embedding because it recovers harder secrets. some rows of reduction matrices R are 0, discarded after prepro-
cessing. Between m and n + m nonzero rows of R are kept, so
Scaling n. Finally, we use our proposed architecture im- we must reduce between 4,000,000
m+n
and 4,000,000
m
matrices.
provements to scale n. The long input sequence length in
6
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 8. Secret recovery time for larger dimensions n with encoder-only model with 4 layers, embedding dimension of 512 and
angular embedding We only report the training hours (on one V100 GPU) for successful secret recoveries, out of 10 secrets per h value.
Samples/ Hours to recovery for different h values
n log2 q
Sec
h=5 h=7 h=9 h = 11
768 35 355 3.1, 18.6, 18.9 9.1, 21.6, 24.9 15.9, 27.7 -
1024 50 256 1.6, 6.2, 7.6, 8.8, 34.0, 41.4, 43.7 4.6, 7.4, 13.5, 16.7 21.3 -
7
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
8
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
and indicates that models are not yet learning modular arith- Avanzi, R., Bos, J., Ducas, L., Kiltz, E., Lepoint, T., Lyuba-
metic (Wenger et al., 2022). Further progress on models shevsky, V., Schanck, J. M., Schwabe, P., Seiler, G., and
learning modular arithmetic problems will likely help to Stehlé, D. CRYSTALS-Kyber (version 3.02) – Submis-
achieve secret recovery for smaller q and larger h. sion to round 3 of the NIST post-quantum project. 2021.
Available at https://pq-crystals.org/.
Acknowledgements Baek, S. and Kim, K. Recent advances of neural attacks
We thank Mark Tygert for his always insightful comments against block ciphers. In Proc. of SCIS, 2020.
and Mohamed Malhou for running experiments.
Benamira, A., Gerault, D., Peyrin, T., and Tan, Q. Q. A
deeper look at machine learning-based cryptanalysis. In
8. Impact Statement Proc. of Annual International Conference on the Theory
and Applications of Cryptographic Techniques, 2021.
The main ethical concern related to this work is the possi-
bility of our attack compromising currently-deployed PQC Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
system. However, at present, our proposed attack does not Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
threaten current standardized systems. If our attack scales to Askell, A., et al. Language models are few-shot learners.
higher h and lower q settings, then its impact is significant, Proc. of NeurIPS, 2020.
as it would necessitate changing PQC encryption standards.
For reproducability of these results, our code will be open Charton, F. Linear algebra with transformers. Transactions
sourced after publication and is available to reviewers upon in Machine Learning Research, 2022.
request.
Charton, F. Can transformers learn the greatest common
divisor? arXiv:2308.15594, 2024.
References
Ahmadzadeh, E., Kim, H., Jeong, O., and Moon, I. A Novel Charton, F., Lauter, K., Li, C., and Tygert, M. An efficient
Dynamic Attack on Classical Ciphers Using an Attention- algorithm for integer lattice reduction. SIAM Journal on
Based LSTM Encoder-Decoder Model. IEEE Access, Matrix Analysis and Applications, 45(1), 2024.
2021.
Chen, H., Chua, L., Lauter, K., and Song, Y. On the
Ajtai, M. Generating hard instances of lattice problems. In Concrete Security of LWE with Small Secret. Cryp-
Proc. of the ACM symposium on Theory of Computing, tology ePrint Archive, Paper 2020/539, 2020. URL
1996. https://eprint.iacr.org/2020/539.
Alani, M. M. Neuro-cryptanalysis of DES and triple-DES. Chen, L., Moody, D., Liu, Y.-K., et al. PQC Stan-
In Proc. of NeurIPS, 2012. dardization Process: Announcing Four Candidates
to be Standardized, Plus Fourth Round Candi-
Albrecht, M., Chase, M., Chen, H., et al. Homomor- dates. US Department of Commerce, NIST, 2022.
phic encryption standard. In Protecting Privacy through https://csrc.nist.gov/News/2022/
Homomorphic Encryption, pp. 31–62. 2021. https: pqc-candidates-to-be-standardized-and-round-4.
//eprint.iacr.org/2019/939.
Chen, Y. and Nguyen, P. Q. BKZ 2.0: Better Lattice Security
Albrecht, M. R. On dual lattice attacks against small-secret Estimates. In Proc. of ASIACRYPT, 2011.
LWE and parameter choices in HElib and SEAL. In Proc.
of EUROCRYPT, 2017. ISBN 978-3-319-56614-6. Chen, Y. and Yu, H. Bridging Machine Learning and Crypt-
analysis via EDLCT. Cryptology ePrint Archive, 2021.
Albrecht, M. R., Player, R., and Scott, S. On the concrete https://eprint.iacr.org/2021/705.
hardness of learning with errors. Journal of Mathematical
Cryptology, 9(3):169–203, 2015. Curtis, B. R. and Player, R. On the feasibility and
impact of standardising sparse-secret LWE parameter
Albrecht, M. R., Göpfert, F., Virdia, F., and Wunderer, T. sets for homomorphic encryption. In Proc. of the
Revisiting the expected cost of solving usvp and applica- ACM Workshop on Encrypted Computing & Applied
tions to lwe. In Proc. of ASIACRYPT, 2017. Homomorphic Cryptography, 2019.
Aldarrab, N. and May, J. Can sequence-to-sequence Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and
models crack substitution ciphers? arXiv preprint Kaiser, Ł. Universal transformers. In Proc. of ICLR,
arXiv:2012.15229, 2020. 2019.
9
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Lauter, K., Naehrig, M., and Vaikuntanathan, V. Can homo-
Pre-training of deep bidirectional transformers for lan- morphic encryption be practical? Proceedings of the 3rd
guage understanding. In Proceedings of the Conference ACM workshop on Cloud computing security workshop,
of the North American Chapter of the Association for pp. 113–124, 2011.
Computational Linguistics, 2019. URL https://
aclanthology.org/N19-1423. Lee, N., Sreenivasan, K., Lee, J. D., Lee, K., and Papail-
iopoulos, D. Teaching arithmetic to small transformers.
Dubrova, E., Ngo, K., and Gärtner, J. Breaking a fifth-order arXiv preprint arXiv:2307.03381, 2023.
masked implementation of crystals-kyber by copy-paste.
Cryptology ePrint Archive, 2022. https://eprint. Lenstra, H., Lenstra, A., and Lovász, L. Factoring polyno-
iacr.org/2022/1713. mials with rational coefficients. Mathematische Annalen,
261:515–534, 1982.
Dziri, N., Lu, X., Sclar, M., Li, X. L., et al. Faith and
Li, C. Y., Sotáková, J., Wenger, E., Malhou, M., Garcelon,
fate: Limits of transformers on compositionality. arXiv
E., Charton, F., and Lauter, K. Salsa Picante: A Machine
preprint arXiv:2305.18654, 2023.
Learning Attack on LWE with Binary Secrets. In Proc.
Gohr, A. Improving attacks on round-reduced speck32/64 of ACM CCS, 2023a.
using deep learning. In Proc. of Annual International Li, C. Y., Wenger, E., Allen-Zhu, Z., Charton, F., and Lauter,
Cryptology Conference, 2019. K. E. SALSA VERDE: a machine learning attack on
Goncharov, S. V. Using fuzzy bits and neural networks to LWE with sparse small secrets. In Proc. of NeurIPS,
partially invert few rounds of some cryptographic hash 2023b.
functions. arXiv preprint arXiv:1901.02438, 2019. Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-
1: Technical details and evaluation. White Paper. AI21
Greydanus, S. Learning the enigma with recurrent neural
Labs, 1:9, 2021.
networks. arXiv preprint arXiv:1708.07576, 2017.
Liu, Z., Kitouni, O., Nolte, N. S., Michaud, E., Tegmark,
Griffith, K. and Kalita, J. Solving Arithmetic Word Prob- M., and Williams, M. Towards understanding grokking:
lems with Transformers and Preprocessing of Problem An effective theory of representation learning. Proc. of
Text. arXiv preprint arXiv:2106.00893, 2021. NeurIPS, 2022.
Gromov, A. Grokking modular arithmetic. arXiv preprint Meng, Y. and Rumshisky, A. Solving math word prob-
arXiv:2301.02679, 2023. lems with double-decoder transformer. arXiv preprint
arXiv:1908.10924, 2019.
He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-
enhanced bert with disentangled attention. arXiv preprint Micciancio, D. and Voulgaris, P. Faster exponential time al-
arXiv:2006.03654, 2020. gorithms for the shortest vector problem. In Proceedings
of the ACM-SIAM Symposium on Discrete Algorithms,
Kaiser, Ł. and Sutskever, I. Neural GPUs learn algorithms. 2010.
arXiv preprint arXiv:1511.08228, 2015.
Nogueira, R., Jiang, Z., and Lin, J. Investigating the limita-
Kalchbrenner, N., Danihelka, I., and Graves, A. Grid long tions of transformers with simple arithmetic tasks. arXiv
short-term memory. arXiv preprint arxiv:1507.01526, preprint arXiv:2102.13019, 2021.
2015.
Palamas, T. Investigating the ability of neural net-
Kimura, H., Emura, K., Isobe, T., Ito, R., Ogawa, K., and works to learn simple modular arithmetic. 2017.
Ohigashi, T. Output Prediction Attacks on SPN Block https://project-archive.inf.ed.ac.uk/
Ciphers using Deep Learning. Cryptology ePrint Archive, msc/20172390/msc_proj.pdf.
2021. URL https://eprint.iacr.org/2021/
401. Polu, S. and Sutskever, I. Generative language mod-
eling for automated theorem proving. arXiv preprint
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, arXiv:2009.03393, 2020.
J., Gelly, S., and Houlsby, N. Big transfer (bit): General
visual representation learning. In Proc. of ECCV, 2020. Power, A., Burda, Y., Edwards, H., Babuschkin, I., and
Misra, V. Grokking: Generalization Beyond Overfit-
Lample, G. and Charton, F. Deep learning for symbolic ting on Small Algorithmic Datasets. arXiv preprint
mathematics. In Proc. of ICLR, 2020. arXiv:2201.02177, 2022.
10
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
11
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
A. Appendix
We provide details, additional experimental results, and analysis ommitted in the main text:
1. Appendix B: Attack parameters
2. Appendix C: Lattice reduction background
3. Appendix D: Additional model results (Section 4)
4. Appendix E: Additional training results (Section 5)
5. Appendix F: Further NoMod analysis
Table 10. LWE, preprocessing, and training parameters. For the adaptive increase of preprocessing parameters, we start with block
size β1 , flatter α1 , and LLL-delta δ1 and upgrade to β2 , α2 , and δ2 at a later stage. Parameters base B and bucket size are used to tokenize
the numbers for transformer training.
LWE parameters Preprocessing settings Model training settings
n log2 q h β1 β2 α1 α2 δ1 δ2 Base B Bucket size Batch size
512 41 ≤ 70 18 22 0.04 0.025 0.96 0.9 137438953471 134217728 256
768 35 ≤ 11 18 22 0.04 0.025 0.96 0.9 1893812477 378762 128
1024 50 ≤ 11 18 22 0.04 0.025 0.96 0.9 25325715601611 5065143120 128
12
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 11. Secret recovery (# successful recoveries/# attack attempts) with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, binary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder 1/10 1/10 1/10
Encoder (Vocabulary) 0/10 0/7 2/10 0/10 0/8 0/10 1/10 0/9 0/8 0/8
Encoder (Angular) 2/10 0/7 3/10 2/10 1/8 1/10 2/10 1/9 1/8 2/8
Table 12. Average time to recovery (hours) for successful recoveries with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, binary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder 10.0 - 20.0 - - - 17.5 - - -
Encoder (Vocabulary) - - 19.8 - - - 22.0 - - -
Encoder (Angular) 33.0 - 37.0 33.2 29.1 26.7 29.8 57.1 31.6 28.8
Table 13. Secret recovery (# successful recoveries/# attack attempts) with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, ternary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder - 1/10 - - - - - - - -
Encoder (Vocabulary) 0/9 1/8 0/9 0/9 1/10 0/8 0/7 0/7 0/10 1/9
Encoder (Angular) 0/9 1/8 0/9 0/9 1/10 0/8 0/7 0/7 0/10 1/9
Table 14. Average time to recovery (hours) for successful recoveries with various model architectures for h = 57 to h = 66
(n = 512, log2 q = 41, ternary secrets).
h
Architecture
57 58 59 60 61 62 63 64 65 66
Encoder-Decoder - 27.5 - - - - - - - -
Encoder (Vocabulary) - 24.8 - - 28.2 - - - - 34.4
Encoder (Angular) - 57.6 - - 47.4 - - - - 70.3
13
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
also increases average recovery time. Increasing the embedding dimension from 256 to 512 with 4 layers improves secret
recovery. Thus, we choose 4 layers with a hidden dimension of 512 as it recovers the most secrets (25%) the fastest (26.2
mean hours).
Table 15. Effect of different transformer depths (# of layers) and widths (embedding dimension) on secret recovery (# successful
recoveries/# attack attempts) with encoder-only model (n = 512, log2 q = 41, binary secrets).
h
Layers Emb Dim
49 51 53 55 57 59 61 63 65 67
2 128 2/9 4/10 5/10 3/9 2/10 2/8 1/9 3/10 1/9 0/10
4 256 2/9 4/10 5/10 3/9 2/10 2/8 1/9 2/10 1/9 0/10
4 512 3/9 4/10 5/10 4/9 2/10 2/8 1/9 3/10 1/9 0/10
6 512 3/9 4/10 5/10 3/9 2/10 2/8 2/9 3/10 1/9 0/10
8 512 3/9 4/10 5/10 4/9 2/10 2/8 1/9 2/10 1/9 0/10
Table 16. Effect of different transformer depths (# of layers) and widths (embedding dimension) on average time to recovery
(hours) with encoder-only model (n = 512, log2 q = 41, binary secrets).
h
Layers Emb Dim
49 51 53 55 57 59 61 63 65 67
2 128 8.7 10.0 8.3 14.0 11.0 5.5 11.5 17.9 18.9 -
4 256 30.9 11.0 12.3 30.3 39.9 10.4 20.6 13.3 19.6 -
4 512 27.2 15.4 18.2 35.2 35.4 17.3 24.3 32.2 26.2 -
6 512 44.1 18.8 20.7 34.2 30.5 19.4 47.9 32.1 28.1 -
8 512 46.4 22.5 24.0 44.6 34.6 23.8 28.0 29.0 30.3 -
Table 17. Successful secret recoveries out of 10 trials per h Table 18. Average time (hours) to successful secret recov-
with varying amounts of training data (n = 768, log2 q = eries with varying amounts of training data (n = 768,
35, binary secrets). log2 q = 35, binary secrets).
h h
# Samples # Samples
5 7 9 11 5 7 9 11
100K 1 - - - 100K 1.4 - - -
300K 1 1 - - 300K 0.7 12.3 - -
1M 4 3 2 - 1M 27.9 24.5 39.2 -
3M 2 3 1 - 3M 17.8 13.0 44.2 -
4M 3 3 2 - 4M 13.5 18.5 21.8 -
14
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 19. Successful secret recoveries out of 10 trials per h Table 20. Average time (hours) to successful secret recov-
with varying amounts of training data (n = 1024, log2 q = eries with varying amounts of training data (n = 1024,
50, binary secrets). log2 q = 50, binary secrets).
h h
# Samples # Samples
5 7 9 11 13 15 5 7 9 11 13 15
100K 3 - - - - - 100K 11.6 - - - - -
300K 6 3 - - - - 300K 14.8 17.7 - - - -
1M 6 4 1 - 1 - 1M 15.7 29.2 36.1 - 47.4 -
3M 6 4 1 1 - - 3M 14.8 14.0 29.6 39.8 - -
4M 7 5 1 - - - 4M 19.9 10.6 21.3 - - -
pre-training baseline (Tables 21 and 22), 80K steps pre-training (Tables 23 and 24), 240K steps pre-training (Tables 25 and
26), and 440K steps pre-training (Tables 27 and 28).
Based on these results, we conclude that some pre-training helps to recover secrets with less samples, but more pre-training
is not necessarily better. We also see that more pre-training increases the average time to successful secret recovery.
Table 21. Successful secret recoveries out of 10 trials per h with no model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - - - - - - - - - - - - - -
300K - - 1 - - - - - - - - - - - - -
1M - - 3 2 1 - 3 1 2 1 1 1 - - 2 -
3M 1 - 4 3 1 - 2 1 3 1 1 2 - - 2 -
4M 1 - 4 3 1 1 3 1 3 1 1 1 - - 2 -
Table 22. Average time (hours) to successful secret recovery with no model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - - - - - - - - - - - - -
300K - - 30.3 - - - - - - - - - - - - -
1M - - 21.1 36.9 36.3 - 27.5 21.9 23.9 16.5 50.7 52.9 - - 36.8 -
3M 14.4 - 21.8 27.6 41.8 - 19.8 25.5 22.7 18.1 39.5 44.3 - - 37.8 -
4M 17.2 - 22.8 26.5 29.1 69.7 22.7 22.7 20.9 21.4 29.0 25.7 - - 40.5 -
Table 23. Successful secret recoveries out of 10 trials per h with 80K steps model pre-training (n = 512, log2 q = 41, binary secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 1 1 - - - - - - - - - - - -
300K 1 - 2 2 - - 1 - 1 1 - - - - 2 -
1M 1 - 3 2 1 - 1 - 1 1 1 - - - 1 -
3M 1 - 3 2 1 - 1 - 1 1 1 - - - 2 -
4M 1 - 3 2 - - 1 - 1 1 1 - - - 1 -
15
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 24. Average time (hours) to successful secret recovery with 80K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 29.2 32.1 - - - - - - - - - - - -
300K 57.7 - 47.6 26.9 - - 31.7 - 62.6 55.6 - - - - 50.9 -
1M 53.4 - 37.0 25.6 61.7 - 25.1 - 65.9 30.0 44.8 - - - 51.4 -
3M 34.4 - 41.8 25.1 57.5 - 26.2 - 45.1 23.6 38.4 - - - 39.7 -
4M 35.1 - 30.9 33.0 - - 35.2 - 36.5 21.8 26.8 - - - 32.1 -
Table 25. Successful secret recoveries out of 10 trials per h with 240K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - 1 - - - - - - - - - - - -
300K 1 - 2 2 - - 2 - 2 1 - - - - - -
1M 1 - 2 2 1 - 1 - 1 1 1 - - - 1 -
3M 1 - 2 2 - - 2 - 2 1 1 - - - 1 -
4M 1 - 2 2 1 - 2 - 1 1 1 - - - 1 -
Table 26. Average time (hours) to successful secret recovery with 240K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - - 21.0 - - - - - - - - - - - -
300K 49.3 - 30.8 23.3 - - 50.5 - 64.2 37.9 - - - - - -
1M 49.4 - 29.5 23.4 50.8 - 20.1 - 24.8 24.2 51.9 - - - 61.9 -
3M 48.9 - 38.2 19.7 - - 38.5 - 50.9 20.2 57.7 - - - 47.4 -
4M 42.4 - 34.9 19.5 55.8 - 29.0 - 24.1 18.3 54.0 - - - 52.4 -
Table 27. Successful secret recoveries out of 10 trials per h with 440K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 1 1 - - - - - - - - - - - -
300K - - 1 2 - - 1 1 1 1 - - - - 1 -
1M 1 - - 2 - - 1 - 2 1 1 - - - 1 -
3M 1 - 1 2 - - 1 - 1 1 1 - - - 2 -
4M 1 - 1 2 - - 2 - 1 1 1 - - - 2 -
16
S ALSA F RESCA: Angular Embeddings and Pre-Training for ML Attacks on LWE
Table 28. Average time (hours) to successful secret recovery with 440K steps model pre-training (n = 512, log2 q = 41, binary
secrets).
h
# Samples
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
100K - - 33.3 20.0 - - - - - - - - - - - -
300K - - 39.8 12.9 - - 29.0 60.5 36.2 49.1 - - - - 59.1 -
1M 56.9 - - 16.6 - - 18.6 - 43.6 24.9 36.4 - - - 53.4 -
3M 50.9 - 32.3 12.2 - - 25.2 - 22.3 18.9 42.4 - - - 60.8 -
4M 58.3 - 30.0 13.5 - - 34.6 - 19.7 22.3 45.7 - - - 59.7 -
understanding attack success in a lab environment. Li et al. derived an empirical result stating that attacks should be
successful when the NoMod factor of a dataset is ≥ 67. The NoMod analysis indicates that models trained in those
experiments were only learning secrets from datasets in which the majority of Rb values do not “wrap around” q. If models
could be trained to learn modular arithmetic better, this might ease the NoMod condition for success.
One of the main goals of introducing the angular embedding is to introduce some inductive bias into the model training
process. Specifically, teaching models that 0 and q − 1 are close in the embedding space may enable them to better learn the
modular arithmetic task at the heart of LWE. Here, we examine the NoMod factor of various datasets to see if the angular
embedding does provide such inductive bias. If it did, we would expect that models with angular embeddings would recover
secrets from datsets with NoMod < 67. Table 29 lists NoMod percentages and successful secret recoveries for the angular
and tokenization schemes described in §4.2.
Table 29. NoMod percentages for Verde data n = 512, log2 q = 41, binary secrets (varying h and secrets indexed 0-9), comparing
performance of angular vs. normal embedding. Key: recovered by angular only, recovered by both, not recovered.
h 0 1 2 3 4 5 6 7 8 9
57 45 49 52 41 51 46 51 57 60 49
58 48 48 48 48 38 43 52 52 45 48
59 43 46 56 50 66 63 35 46 41 40
60 48 48 52 59 50 58 48 49 51 54
61 60 49 43 41 56 42 42 41 41 50
62 45 42 45 54 55 43 61 56 54 42
63 56 60 55 54 63 47 54 51 45 43
64 44 46 41 41 41 47 45 43 41 55
65 45 51 48 60 45 48 41 48 45 50
66 45 51 39 64 45 47 43 60 48 55
67 43 47 48 49 40 47 48 51 50 46
Table 30. NoMod percentages for n = 768, log2 q = 35 secrets Table 31. NoMod percentages for n = 1024, log2 q = 50 se-
(varying h and secrets indexed 0-9). Key: secret recovered, crets (varying h and secrets indexed 0-10). Key: secret recov-
secret not recovered. ered, secret not recovered.
h 0 1 2 3 4 5 6 7 8 9 h 0 1 2 3 4 5 6 7 8 9
5 61 61 52 61 93 68 66 61 56 62 5 62 66 70 69 81 81 81 53 69 94
7 52 61 56 77 52 68 56 67 56 61 7 62 47 80 80 69 56 57 52 57 69
9 52 55 46 49 52 48 56 60 60 70 9 56 69 52 49 52 53 56 57 46 52
11 55 42 44 55 46 46 54 55 60 51 11 44 52 52 62 61 62 56 56 52 49
13 46 45 60 48 43 43 55 48 60 43 13 51 46 56 40 46 52 44 49 68 53
15 41 38 48 43 40 43 45 43 43 59 15 61 46 40 45 46 41 38 47 48 44
17