Multi Scalar Multiplication For Recursive SNARKs and More
Multi Scalar Multiplication For Recursive SNARKs and More
1 Introduction
A SNARK is a cryptographic primitive that enables a prover to prove to a verifier the
knowledge of a satisfying witness to a non-deterministic (NP) statement by producing a
proof π such that the size of π and the cost to verify it are both sub-linear in the size of
the witness. Today, the most efficient SNARKs use elliptic curves to generate and verify
the proof. A SNARK usually consists in three algorithms Setup, Prove and Verify.
The Setup and Prove algorithms involve solving multiple large instances of tasks about
polynomial arithmetic in Fr [X] and multi-scalar multiplication (MSM) over the points of
an elliptic curve. Fast arithmetic in Fr [X], when manipulating large-degree polynomials, is
best implemented using the Fast Fourier Transform (FFT) [Pol71] and MSMs of large sizes
are best implemented using a variant of Pippenger’s algorithm [BDLO12, Section 4]. For
example, Table 1 reports the numbers of MSMs required in the Setup, Prove and Verify
algorithms in the [Gro16] SNARK and the KZG-based PLONK universal SNARK [GWC19].
The report excludes the number of FFTs as the dominating cost for such constructions is
the MSM computation (∼ 80% of the overall time).
Given a set of n elements G1 , · · · , Gn (bases) in G a cyclic group (e.g. group of points
on an elliptic curve) whose order #G has b-bit and a set of n integers a1 , · · · , an (scalars)
between 0 and #G, the goal is to compute efficiently the group element [a1 ]G1 +· · ·+[an ]Gn .
In SNARK applications, we are interested in large instances of variable-base MSMs
(n = 107 , 108 , 109 ) — with random bases and random scalars — over the pairing groups
G1 and G2 .
2 EdMSM: Multi-Scalar-Multiplication for recursive SNARKs and more
Table 1: Cost of Setup, Prove and Verify algorithms for [Gro16] and PLONK. m =
number of wires, n = number of multiplication gates, a = number of addition gates and ` =
number of public inputs. MG = multiplication in G and P=pairing. Note: Both Groth16
and PLONK verifiers have a dependency on the number of public inputs `, but for PLONK
it is just a polynomial evaluation (FFT).
The naive algorithm uses a double-and-add strategy to compute each [ai ]Gi then
adds them all up, costing on average 3/2 · b · n group operations (+). There are several
algorithms that optimize the total number of group operations as a function of n such
as Strauss [Str64], Bos–Coster [dR95, Sec. 4] and Pippenger [Pip76] algorithms. For
large instances of a variable-base MSM, the fastest approach is a variant of Pippenger’s
algorithm [BDLO12, Sec. 4]. For simplicity, we call it the bucket method. In this paper
we are interested in the bucket-method MSM on inner curves of 2-chains and 2-cycles of
elliptic curves. We’ve chosen to test and benchmark our results on mobile devices because
it is critical for deployment of recursive proof systems on mobile applications.
Definition 1. A 2-chain of elliptic curves is a list of two distinct curves E1 /Fp1 and
E1 /Fp2 where p1 and p2 are large primes and p1 | #E2 (Fp2 ). SNARK-friendly 2-chains
are composed of two curves that have highly 2-adic subgroups of orders r1 | #E1 (Fp1 ) and
r2 | #E2 (Fp2 ) such that r1 ≡ r2 ≡ 1 mod 2L for a large integer L ≥ 1. This also means
that p1 ≡ 1 mod 2L .
In a 2-chain, the first curve is denoted the inner curve, while the second curve whose
order is the characteristic of the inner curve, is denoted the outer curve (cf. Fig. 1).
Inner curves from polynomial families. The best elliptic curves amenable to efficient
implementations arise from polynomial based families. These curves are obtained by
parameterizing the Complex Multiplication (CM) equation with polynomials p(x), t(x), r(x)
and y(x). The authors of [EG22] showed that the polynomial-based pairing-friendly Barreto–
Lynn–Scott families of embedding degrees k = 12 (BLS12) and k = 24 (BLS24) [BLS03]
Gautam Botrel and Youssef El Housni 3
E2 (Fp2 )
#E2 (Fp2 ) = h · p1
E1 (Fp1 )
are the most suitable to construct inner curves in the context of pairing-based SNARKs.
These curves require the seed x to satisfy x ≡ 1 mod 3·2L to have the 2-adicity requirement
with respect to both r and p.
A particular example of an efficient 2-chain for SNARK applications is composed of
the inner curve BLS12-377 [BCG+ 20] and the outer curve BW6-761 [EG20].
We prove useful results (Prop. 1 and Lemma 1) that will be needed later to optimize
the MSM computation.
Proposition 1 ([EG22, Sec. 3.4]). All inner BLS curves admit a short Weierstrass form
Y 2 = X 3 + 1.
Proof. Let E : Y 2 = X 3 + b be a BLS curve over Fp parametrized by polynomials in
x [BLS03]. Let g neither a square nor a cube in Fp . One choice of b ∈ {1, g, g 2 , g 3 , g 4 , g 5 }
gives a curve with the correct order (i.e. r | #E(Fp )) [Sil09, §X.5]. For all BLS curves,
x−1 | #E(Fp ) and 3 | x−1 (which leads to all involved parameters being integers) [BLS03].
If, additionally, 2 | x − 1 then 2, 3 | #E(Fp ) and the curve has points of order 2 and 3. A
2-torsion point is (x0 , 0) with
√ x0 a root of x + b, hence b = (−x0 ) is a cube. The two
3 3
3-torsion points are (0, ± b) hence b is a square. This implies that b is a square and a
cube in Fp and therefore b = 1 is the only solution in the set {g i }0≤i≤5 for half of all BLS
curves: those with odd x.
An inner BLS curve has a seed x such that x ≡ 1 mod 3 · 2L for some large integer
L [EG22]. This means that 2 | x − 1 and hence that the curve is always of the form
Y 2 = X 3 + 1.
3
p p−1
the Legendre symbol. The quadratic reciprocity theorem tells us that = (−1) 2 .
p 3
3
p p
We have p ≡ 1 mod 4 from the 2-adicity condition, so = . Now ≡ p mod 3
p 3 3
which is always equal to 1 for all BLS curves (x ≡ 1 mod 3 and x − 1 | p − 1). More
generally one can prove that when p = 2 or p ≡ 1 or 11 mod 12 then 3 is a quadratic
residue in Fp . For inner BLS, we have p ≡ 1 mod 3 · 2L with L 2.
2.2 2-cycles
Definition 2. A 2-cycle of elliptic curves is a list of two distinct prime-order curves
E1 /Fp1 and E1 /Fp2 where p1 and p2 are large primes, p1 = #E2 (Fp2 ) and p2 = #E1 (Fp1 ).
SNARK-friendly 2-cycles are composed of two curves that have highly 2-adic subgroups,
i.e. #E1 (Fp1 ) ≡ #E2 (Fp2 ) ≡ 1 mod 2L for a large integer L ≥ 1. This also means that
p1 ≡ p2 ≡ 1 mod 2L .
This notion was initially introduced under different names, for example amicable pairs
(or equivalently dual elliptic primes [Mih07]) for 2-cycles of ordinary curves, and aliquot
cycles for the general case [SS11]. Some examples of SNARK-friendly 2-cycles include
MNT4-MNT6 curves [BCTV14], Tweedle curves [BGH19] and Pasta curves [Hop20].
In particular a 2-cycle is a 2-chain where both curves are inner and outer curves with
respect to each other (cf. Fig. 2). This means that both curves in a 2-cycle admit a twisted
Edwards form following the same reasoning as in subsection 2.1. In the sequel we will
focus on the case of BLS12 inner curves that form a 2-chain but we stress that these results
apply to 2-chain inner curves from other families (e.g. BLS24 and BN [AHG22]) and to
2-cycles as well.
E2 (Fp2 )
E1 (Fp1 )
3.2 Step 3: combine the c-bit MSMs into the final b-bit MSM
Algorithm 1 gives an iterative way to combine the small MSMs into the original MSM.
Algorithm 1: Step 3
Output: T = [a1 ]G1 + · · · + [an ]Gn
1 T ← T1 ;
2 for i from 2 to b/c do
3 T ← [2c ]T ; // Double c times
4 T ← T + Ti ; // Add
5 return T ;
Gk
+
..
.
+
G2c −k G23
+ +
G7 G15 G19 G2c −k0
..
+ + + . +
G4 G3 G18 G1
buckets: 1 2 3 ··· 2c − 1
S2c −1
+ S2c −1 + S2c −2
..
.
+ S2c −1 + S2c −2 + ··· + S3 + S2
+ S2c −1 + S2c −2 + ··· + S3 + S2 + S1
Combining Steps 1, 2 and 3, the expected overall cost of the bucket method is
Total cost: c (n
b
+ 2c ) + (b − c − b/c − 1) ≈ cb (n + 2c ) group operations.
Remark 1 (On choosing c). The theoretical minimum occurs at c ≈ log n and the asymptotic
scaling looks like ‰(b logn n ). However, in practice, empirical choices of c yield a better
performance because the memory usage scales with 2c and there are fewer edge cases if c
divides b. For example, with n = 107 and b = 256, we observed a peak performance at
c = 16 instead of c = log n ≈ 23.
4 Optimizations
4.1 Parallelism
Since each c-bit MSM is independent of the rest, we can compute each (Step 2) on a
separate core. This makes full use of up to b/c cores but increases memory usage as each
core needs 2c − 1 buckets (points). If more than b/c cores are available, further parallelism
does not help much because m MSM instances of size n/m cost more than 1 MSM instance
of size n.
4.2 Precomputation
When the bases G1 , · · · , Gn are known in advance, we can use a smooth trade-off between
precomputed storage vs. run time. For each base Gi , choose k as big as the storage allows
and precompute k points [2c −k]G, · · · , [2c − 1]G and use the bucket method only for the
first 2c − 1−k buckets instead of 2c − 1. The total cost becomes ≈ cb (n + 2c −k). However,
large MSM instances already use most available memory. For example, when n = 108
our implementation needs 58GB to store enough BLS12-377 curve points to produce a
Groth16 [Gro16] proof. Hence, the precomputation approach yield negligible improvement
in our case.
The signed-digit decomposition cost is negligible but it works only if the bitsize of #G1
(and #G2 ) is strictly bigger than b. We use the spare bits to avoid the overflow. This
observation should be taken into account at the curve design level.
Curve forms and coordinate systems. To minimize the overall cost of storage but also
run time, one can store the bases Gi in affine coordinates. This way we only need the
tuples (xi , yi ) for storage (although we can batch-compress these following [Kos21]) and
we can make use of mixed addition with a different coordinate systems.
The overall cost of the bucket method is cb (n + 2c−1 ) + (b − c − b/c − 1) group operations.
This can be broken down explicitly to:
• Mixed additions: to accumulate Gi in the c-bit MSM buckets with cost cb (n−2c−1 +1)
• Additions and doublings: to combine the c-bit MSMs into the b-bit MSM with cost
b − c + b/c − 1
For large MSM instances, the dominating cost is in the mixed additions as it scales
with n. For this, we use extended Jacobian coordinates {X, Y, ZZ, ZZZ} (x = X/ZZ, y =
Y /ZZZ, ZZ 3 = ZZZ 2 ) trading-off memory for run time compared to the usual Jacobian
coordinates {X, Y, Z} (x = X/Z 2 , y = Y /Z 3 ) (cf. Table 2).
8 EdMSM: Multi-Scalar-Multiplication for recursive SNARKs and more
Remark 2. In [GW20], the authors suggest to use affine coordinates for batch addition. That
is, they only compute the numerators in the affine addition, accumulate the denominators
and then batch-invert them using the Montgomery trick [Mon87]. An affine addition
costs 3m + 1i (i being a field inversion). For a single addition this is not worth it as
1i > 7m (= 10m − 3m). If we accumulate L points and batch-add them with cost
3Lm + Li = 6Lm + 1i (the Montgomery trick costing Li = 3Lm + 1i), this might be worth
it. Assuming I=Cm, there might be an improvement if we accumulate a number of points
L > C/4. However, we did not observe a significant improvement in our implementation
in compared to the extended Jacobian approach. This is mainly because C is large due
the optimized finite field arithmetic in bigint library we use. This means L should be
large requiring more memory.
We work over fields of large prime characteristic (6= 2, 3), so the elliptic curves in
question have always a short Weierstrass (SW ) form y 2 = x3 + ax + b. Over this form, the
fastest mixed addition is achieved using extended Jacobian coordinates. However, there
are other forms that enable even faster mixed additions (cf. Table 3).
Table 3: Cost of mixed addition in different elliptic curve forms and coordinate systems
assuming 1m = 1s. Formulas and references from [BL22].
It appears that a twisted Edwards (tEd) form is appealing for the bucket method since
it has the lowest cost for the mixed addition in extended coordinates. Furthermore, the
arithmetic on this form is complete, i.e. the addition formulas are defined for all inputs.
This improves the run time by eliminating the need of branching in case of adding the
neutral element or doubling compared to a SW form. We showed in Lemma 1 that all
inner BLS curves admit a tEd form.
For the arithmetic, we use the formulas in [HWCD08] alongside some optimizations.
We take the example of BLS12-377 for which a = −1:
• To combine the c-bit MSMs into a b-bit MSM we use unified additions [HWCD08,
Sec. 3.1] (9m) and dedicated doublings [HWCD08, Sec. 3.3] (4m + 4s).
Gautam Botrel and Youssef El Housni 9
• To combine the bucket sums we use unified additions (9m) to keep track of the
running sum and unified re-additions (8m) to keep track of the total sum. We save
1m by caching the multiplication by 2d0 from the running sum.
• To accumulate the Gi in the c-bit MSM we use unified re-additions with some
precomputations. Instead of storing Gi in affine coordinates we store them in a
custom coordinates system (X, Y, T ) where y − x = X, y + x = Y and 2d0 · x · y = T .
This saves 1m and 2a (additions) at each accumulation of Gi .
We note that although the dedicated addition (resp. the dedicated mixed addition)
in [HWCD08, Sec. 3.2] saves the multiplication by 2d0 , it costs 4m (resp. 2m) to check the
operands equality: X1 Z2 = X2 Z1 and Y1 Z2 = Y2 Z1 (resp. X1 = X2 Z1 and Y1 = Y2 Z1).
This cost offset makes both the dedicated (mixed) addition and the dedicated doubling
slower than the unified (mixed) addition in the MSM case. We also note that the conversion
of all the Gi points given on a SW curve with affine coordinates to points on a tEd curve
(also with a = −1) with the custom coordinates (X, Y, T ) is a one-time computation
dominated by a single inverse using the Montgomery batch trick. In SNARKs, since the
Gi are points from the proving key, this computation can be part of the Setup algorithm
and do not impact the Prove algorithm. If the Setup ceremony is yet to be conducted, it
can be performed directly with points in the twisted Edwards form.
Our implementation shows that an MSM instance of size 216 on the BLS12-377 curve
is 30% faster when the Gi points are given on a tEd curve with the custom coordinates
compared to the Jacobian-extended-based version which takes points in affine coordinates
on a SW curve.
5 Implementation
Submissions to the ZPrize “Accelerating MSM on Mobile” division must run on Android 12
(API level 32) and are tested on the Samsung Galaxy A13 5G (Model SM-A136ULGDXAA
with SoC MediaTek Dimensity 700 (MT6833)). The MSM must be an instance of 216 G1 -
points on the BLS12-377 curve. The baseline is the arkworks [aC22] MSM implementation
in Rust (the bucket-list method), a widely used library in SNARK projects. Submissions
must beat this baseline by at least 10% in order to be eligible for the prize. We implemented
our algorithm in Go language using the gnark-crypto bigint library [BPH+ ]. The ZPrize
judges have chosen this mobile device as a representative Android device, with specifications
similar to older higher-end devices and new budget devices, and as a result represents the
kind of hardware that is common today in wealthy markets, and will become common
over the next 3-5 years in middle income markets. It is also widely available and relatively
inexpensive. We achieved a speedup of 78% (cf. Table 4). The source code is available
under MIT or Apache-2 licenses at:
https://github.com/gbotrel/zprize-mobile-harness
The speedup against the baseline/arkworks comes from the algorithmic optimizations
discussed in this paper and the bigint arithmetic optimizations in gnark-crypto aimed
at the arm64 target. We use a Montgomery CIOS variant to handle the field multipli-
cation (Details of the algorithms and proofs are in Appendix A). On x86 architectures,
gnark-crypto leverages the ADX and BMI2 instructions to efficiently handle the interleaved
carry chains in the algorithm. For arm64 architecture, we “untangled” the carry propaga-
tion in the pure Go code to ensure the carry chains were uninterrupted. Moreover, the
large number of registers available (in practice 28 for arm64 against 14 for x86) allowed for
an efficient implementation of the squaring function – the 64 word-word multiplications are
performed at the beginning of each iteration, and the results are stored in registers. This
10 EdMSM: Multi-Scalar-Multiplication for recursive SNARKs and more
Table 4: Comparison of the ZPrize baseline and the submission MSM instances of 216
G1 -points on the BLS12-377 curve.
allows to have uninterrupted carry chains when doubling the intermediate product. The
impact of these optimizations is ∼ 17% for Fp multiplication and ∼ 25% for the squaring.
For an ext-Jac MSM instance of size 216 , the timing was 821ms before these arm64 field
arithmetic optimizations and 620ms after. For the tEd-custom version the speedup is only
related to the Fp -multiplication since there are no squaring in the mixed addition. For
this same version, we stored (y − x, y + x) in the coordinates system instead of (x, y) and
added ∼ 40 lines of arm64 assembly for a small function in Fp (Butterfly(a, b) → a
= a + b; b = a - b). The butterfly performance impact was ∼ 5%, as it speeds up the
unified (mixed) addition in the tEd form.
However, the large gap cannot be justified by these facts only. The target device SoC
can run 32-bit and 64-bit instruction sets. However, the stock firmware runs a 32-bit ARM
architecture (armv7) on which the baseline implementation is benchmarked by the ZPrize
judges. For the sake of the competition, we performed a static build targeting a 64-bit
ARM architecture (arm64), which allowed us without a complicated build process to run
the 64-bit code on the target device.
For the sake of this paper, and for a fair comparison, we perform the same architecture
hack on the baseline implementation. We report in Figure 3 a comparison of our code to the
baseline. We report timings of several MSM instances of different sizes and with different
curve parameterizations (SW in extended Jacobians vs. tEd (a = −1) in custom/extended
coordinates).
·105
2.0
Size of MSM
Figure 3: Comparison of our MSM code and the arkworks one for different instances on
the BLS12-377 G1 group.
For the ZPrize MSM instance of size 216 the speed up is 45% with the tEd version and
33% with the more generic SW-extJac version. For different sizes ranging from 28 to 218
the speed up is 40-47% with the tEd version and 20-35% with SW-extJac.
Gautam Botrel and Youssef El Housni 11
6 Conclusion
Multi-scalar-multiplication dominates the proving cost in most elliptic-curve-based SNARKs.
Inner curves such as the BLS12-377 are optimized elliptic curves suitable for both proving
generic-purpose statements and in particular for proving composition and recursive state-
ments. Hence, it is critical to aggressively optimize the computation of MSM instances on
these curves. We showed that our work yield a very fast implementation both when the
points are given on a short Weierstrass curve and even more when the points are given
on a twisted Edwards curve. We showed that this is always the case for inner curves
such as BLS12-377 and that the conversion cost is a one-time computation that can be
performed in the Setup phase. We note that, more generally, these tricks apply to any
elliptic curve that admits a twisted Edwards form — particularly SNARK-friendly 2-cycles
of elliptic curves. We suggest that this should be taken into account at the design level of
SNARK-friendly curves.
Open question: for Groth16, the same scalars ai are used for both G1 and G2 MSMs.
Is it possible to mutualize a maximum of computations between these two instances? It
seems that moving to a type-2 pairing would allow to deduce the G1 instance from the
G2 one using an efficient homomorphism (the trace map) over the resulting single point.
However, G2 computations would be done on the much slower full extension Fpk . The
pairing, needed for proof verification, would also be moderately slower because of the
anti-trace map.
Acknowledgement
The two co-authors of this paper are also co-authors of the gnark-crypto library. We
thank the other co-authors Thomas Piellard, Ivo Kubjas and Arya Pourtaba Tabaie for
their contributions to gnark-crypto, which allowed this work. We also acknowledge that
the appendix of this paper is a reprint of the material as it appears in a blog note we
shared in 2020 on hackmd.io. This note was never published in an academic paper.
ab mod p .
Overview of the solution: the Montgomery multiplication. There are many good
expositions of the Montgomery multiplication (e.g. [BM17]). As such, we do not go into
detail on the mathematics of the Montgomery multiplication. Instead, this section is
intended to establish notation that is used throughout this appendix.
The Montgomery multiplication algorithm does not directly compute ab mod p. In-
stead it computes abR−1 mod p for some carefully chosen number R called the Montgomery
radix. Typically, R is set to the smallest power of two exceeding p that falls on a computer
word boundary. For example, if p is 381 bits then R = 26×64 = 2384 on a 64-bit architecture.
In order to make use of the Montgomery multiplication the numbers a and b must
be encoded into the Montgomery form: instead of storing (a, b), we store the numbers
(ã, b̃) given by ã = aR mod p and b̃ = bR mod p. A simple calculation shows that the
Montgomery multiplication produces the product ab mod p, encoded in the Montgomery
form: (aR)(bR)R−1 = abR mod p. The idea is that numbers are always stored in the
Montgomery form so as to avoid costly conversions to and from the Montgomery form.
Other arithmetic operations such as addition, subtraction are unaffected by the Mont-
gomery form encoding. But the modular inverse computation a−1 mod p must be adapted
to account for the Montgomery form. We do not discuss modular inversion in this appendix.
How fast is the CIOS method? Let N denote the number of machine words needed to
store the modulus p. For example, if p is a 381-bit prime and the hardware has 64-bit word
size then N = 6. The CIOS method solves modular multiplication using 4N 2 + 4N + 2
unsigned integer additions and 2N 2 + N unsigned integer multiplications.
Our optimization reduces the number of additions needed in the CIOS Montgomery
multiplication to only 4N 2 − N , a saving of 5N + 2 additions. This optimization can be
used whenever the highest bit of the modulus is zero (and not all of the remaining bits are
set — see below for details).
The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below.
This listing is adapted from Section 2.3.2 of Tolga Acar’s thesis [Aca98]. The symbols in
this listing have the following meanings:
Next, we show that we can save the additions in lines 5 and 12 of Alg. 3 when the
highest word of the modulus p is at most (D − 1)/2 − 1. This condition holds if and only
if the highest bit of the modulus is zero and not all of the remaining bits are set.
Our optimization. Observe that lines 4 and 10 have the form (hi, lo) := m1 +m2 ·B +m3 ,
where hi, lo, m1 , m2 , m3 and B are machine-words where each is at most D − 1. If
B ≤ (D − 1)/2 − 1 then a simple calculation shows that
m1 + m2 · B + m3 ≤ (D − 1) + (D + 1)( D−1
2 − 1) + (D − 1)
2 ) + ( 2 − 1)
≤ D ( D−1 D+1
| {z } | {z }
hi lo
t[N ] = C
t[N + 1] = 0 .
A similar observation holds at the end of the second inner loop (j = N − 1) on line 10:
Lemma 2 implies that the carry C is at most (D − 1)/2. We previously observed that t[N ]
is also at most (D − 1)/2, so t[N ] + C is at most
D−1
2 + D−1
2 =D−1
which fits entirely into a single word. Then line 11 sets C to 0 and line 12 sets t[N ] to 0.
The proof by induction is now complete.
14 EdMSM: Multi-Scalar-Multiplication for recursive SNARKs and more
The final algorithm. For N < 12, we merge the two inner loops
Our optimized Montgomery squaring. The condition on the modulus differs, here
p[N − 1] ≤ D−1
4 −1 .
References
[aC22] arkworks Contributors. arkworks zkSNARK ecosystem. https://arkworks.
rs, 2022.
[AHG22] Diego F. Aranha, Youssef El Housni, and Aurore Guillevic. A survey of elliptic
curves for proof systems. Cryptology ePrint Archive, Paper 2022/586, 2022.
https://eprint.iacr.org/2022/586.
[BCG+ 20] Sean Bowe, Alessandro Chiesa, Matthew Green, Ian Miers, Pratyush Mishra,
and Howard Wu. ZEXE: Enabling decentralized private computation. pages
947–964, 2020.
Gautam Botrel and Youssef El Housni 15
[BCTV14] Eli Ben-Sasson, Alessandro Chiesa, Eran Tromer, and Madars Virza. Scalable
zero knowledge via cycles of elliptic curves. LNCS, pages 276–294, 2014.
[BDLO12] Daniel J. Bernstein, Jeroen Doumen, Tanja Lange, and Jan-Jaap Oosterwijk.
Faster batch forgery identification. LNCS, pages 454–473, 2012.
[BGH19] Sean Bowe, Jack Grigg, and Daira Hopwood. Halo: Recursive proof composi-
tion without a trusted setup. Cryptology ePrint Archive, Report 2019/1021,
2019. https://eprint.iacr.org/2019/1021.
[BL22] Daniel Bernstein and Tanja Lange. Explicit-formulas database. https://www.
hyperelliptic.org/EFD/, 2022.
[BLS03] Paulo S. L. M. Barreto, Ben Lynn, and Michael Scott. Constructing elliptic
curves with prescribed embedding degrees. LNCS, pages 257–267, 2003.
[BM17] Joppe W. Bos and Peter L. Montgomery. Montgomery arithmetic from a
software perspective. Cryptology ePrint Archive, Report 2017/1057, 2017.
https://eprint.iacr.org/2017/1057.
[BPH+ ] Gautam Botrel, Thomas Piellard, Youssef El Housni, Arya Tabaie, and Ivo
Kubjas. Go library for finite fields, elliptic curves and pairings for zero-
knowledge proof systems.
[dR95] Peter de Rooij. Efficient exponentiation using procomputation and vector
addition chains. LNCS, pages 389–399, 1995.
[EG20] Youssef El Housni and Aurore Guillevic. Optimized and secure pairing-friendly
elliptic curves suitable for one layer proof composition. LNCS, pages 259–279,
2020.
[EG22] Youssef El Housni and Aurore Guillevic. Families of SNARK-friendly 2-chains
of elliptic curves. LNCS, pages 367–396, 2022.
[Gro16] Jens Groth. On the size of pairing-based non-interactive arguments. LNCS,
pages 305–326, 2016.
[GW20] Ariel Gabizon and Zachary Williamson. Proposal: The turbo-plonk program
syntax for specifying snark programs. https://docs.zkproof.org/pages/
standards/accepted-workshop3/proposal-turbo_plonk.pdf, 2020.
[GWC19] Ariel Gabizon, Zachary J. Williamson, and Oana Ciobotaru. PLONK: Per-
mutations over lagrange-bases for oecumenical noninteractive arguments of
knowledge. Cryptology ePrint Archive, Report 2019/953, 2019. https:
//eprint.iacr.org/2019/953.
[Hop20] Daira Hopwood. The pasta curves for halo 2 and beyond. https:
//electriccoin.co/blog/the-pasta-curves-for-halo-2-and-beyond/,
2020.
[HWCD08] Hüseyin Hisil, Kenneth Koon-Ho Wong, Gary Carter, and Ed Dawson. Twisted
Edwards curves revisited. LNCS, pages 326–343, 2008.
[Kos21] Dmitrii Koshelev. Batch point compression in the context of advanced pairing-
based protocols. Cryptology ePrint Archive, Report 2021/1446, 2021. https:
//eprint.iacr.org/2021/1446.
[Mih07] Preda Mihailescu. Dual elliptic primes and applications to cyclotomy primality
proving. arXiv 0709.4113, 2007.
16 EdMSM: Multi-Scalar-Multiplication for recursive SNARKs and more
[MO90] François Morain and Jorge Olivos. Speeding up the computations on an elliptic
curve using addition-subtraction chains. RAIRO - Theoretical Informatics and
Applications - Informatique Théorique et Applications, 24(6):531–543, 1990.
[Pip76] Nicholas Pippenger. On the evaluation of powers and related problems (pre-
liminary version). In 17th Annual Symposium on Foundations of Computer
Science, Houston, Texas, USA, 25-27 October 1976, pages 258–263. IEEE
Computer Society, 1976.
[Pol71] J. M. Pollard. The Fast Fourier Transform in a finite field. Math. Comp.,
25(114):365–374, April 1971.
[Sil09] Joseph H Silverman. The Arithmetic of Elliptic Curves. Graduate texts in
mathematics. Springer, Dordrecht, 2009.
[SS11] Joseph H. Silverman and Katherine E. Stange. Amicable Pairs and Aliquot
Cycles for Elliptic Curves. Experimental Mathematics, 20(3):329 – 357, 2011.