Incremental Gauss–Newton Methods with
Superlinear Convergence Rates
Abstract
This paper addresses the challenge of solving large-scale nonlinear equations with Hölder continuous Jacobians. We introduce a novel Incremental Gauss–Newton (IGN) method within explicit superlinear convergence rate, which outperforms existing methods that only achieve linear convergence rate. In particular, we formulate our problem by the nonlinear least squares with finite-sum structure, and our method incrementally iterates with the information of one component in each round. We also provide a mini-batch extension to our IGN method that obtains an even faster superlinear convergence rate. Furthermore, we conduct numerical experiments to show the advantages of the proposed methods.
1 Introduction
We study the problem of solving the system of nonlinear equations
(1) |
where the nonlinear vector function is Lipschitz continuous and its Jacobian is Hölder continuous. This formulation is a fundamental problem in scientific computing [38], and it is popular in a large number of applications including machine learning [15, 10, 4], control system [7], data assimilation [45] and game theory [20, 40].
The Newton-type methods [16, 25, 26, 6, 39, 46] are widely used for solving nonlinear equations. The classical Newton’s method uses the curvature information in Jacobians to obtain a local quadratic convergence rate [39], while it suffers from the expensive computational cost to access the Jacobian and its (pseudo) inverse. Several lines of research focus on approximating Newton’s methods with inexact Jacobians. For example, the quasi-Newton methods [11, 29, 1, 29] estimate the Jacobians via secant equations, leading to the iteration scheme that only needs to access the function value and Jacobian-vector calls. The explicit local superlinear convergence rates of these methods have been established in recent years [30, 31, 32, 49]. Another line of work [51, 50, 42] introduce matrix sketching technique [48] to reduce the dimension of the Jacobian matrix, which improves the computational efficiency per iteration. The superiority of their local convergence depends on the structure of Jacobian in the specific problem. Although quasi-Newton and sketched Newton methods can benefit from the inexact Jacobians, they still require accessing the full information of the nonlinear vector function value at every iteration.
For large-scale nonlinear equations, we are interested in methods that do not require the computation of full function values and Jacobians. In particular, Bertsekas [8] proposed a variant of Gauss–Newton (GN) method by following the Extended Kalman Filter (EKF) framework [3, 34, 5], which incrementally accesses partial information of the vector function values and corresponding Jacobian during the iterations. Consequently, Moriyama et al. [36] incorporated a stepsize into the EKF method, which guarantees the global linear convergence rate under the gradient-growth condition [23]. In the past decade, the incremental (quasi) Newton methods with local superlinear convergence rates are established for strongly convex optimization [43, 44, 35, 28, 33].111In the view of solving nonlinear equations, the methods designed for convex optimization [43, 44, 35, 28, 33] require an additional assumption that the Jacobian is symmetric positive-definite. However, the superiority of local convergence for incremental Newton-type methods in solving the general nonlinear equations is still unclear.
In this work, we propose an incremental Gauss–Newton (IGN) method for solving the systems of nonlinear equations. Our method only requires access to one component of the nonlinear vector function and its gradient per iteration. We maintain an aggregated vector and an aggregated matrix to estimate the vector function value and its Jacobian by incrementally updating. We also introduce a Gram matrix with a low-rank update to reduce the computational cost of matrix inverse in vanilla Gauss–Newton methods. The theoretical analysis shows our IGN method enjoys explicit local superlinear convergence rate for nonlinear equations problem with Hölder continuous Jacobians. Furthermore, we provide a variant of our IGN that makes use of the information of a mini-batch of components, which achieves an even faster superlinear convergence rate. The numerical experiments on real-world applications validate the advantages of the proposed methods.
Paper Organization
In Section 2, we formalize the notations and assumptions for our problem. In Section 3, we propose our incremental Gauss–Newton (IGN) method and present its convergence analysis. In Section 4, we extend the IGN method with the mini-batch update to obtain an even faster convergence rate. In Section 5, we provide a discussion to compare the proposed method with related works. In Section 6, we conduct numerical experiments to show the advantages of our methods. We conclude our work in Section 7.
2 Preliminaries
In this section, we formalize the notations and assumptions throughout this paper.
2.1 Notations
We let and use the notation to present the remainder of divided by . We denote as the -th standard basis vector of the -dimensional Euclidean space for all . We use to represent the Euclidean norm for a given vector and the spectral norm for a given matrix. Moreover, we use the notation to represent the smallest singular value for a given matrix.
For the system of nonlinear equations (1), we partition the vector function at as , where . We also denote the gradient of component at as , and we organize the corresponding Jacobian as .
2.2 Assumptions
Throughout this paper, we suppose the function satisfies the following assumptions.
Assumption 1.
We suppose the vector function is Lipschitz continuous, i.e., there exists constant such that
(2) |
for all .
Assumption 2.
We suppose the Jacobian is -Hölder continuous for some , i.e., there exists constant such that
(3) |
for all .
Assumption 3.
The system of the nonlinear equations (1) holds and has a non-degenerate solution , i.e., there exists some such that
(4) |
3 The Incremental Gauss–Newton Method
In this section, we propose the Incremental Gauss-Newton (IGN) method and provide its explicit superlinear convergence rate.
3.1 The Algorithm
We first introduce the intuition of our algorithm design. Solving the system of nonlinear equations (1) can be regarded as minimizing the norm of the nonlinear vector function , which means we can reformulate the problem as the following nonlinear least squares minimization problem
(5) |
For each component , we consider its linear approximation
(6) |
where is some point related to component at the -th iteration. The estimation (6) motivates us to construct the surrogate problem for the nonlinear least squares (5) as follows
(7) |
Since each is convex, which implies problem (7) has the closed-form solution
(8) |
We assume the matrix is always non-singular in this subsection, which will be verified under our assumptions in later analysis.
We propose the Incremental Gauss-Newton (IGN) method by performing an update (8) at the -th iteration. It is worth noting that we can take advantage of the inherent finite-sum structure in formulation (5) to establish incremental methods. Specifically, we update one of at each iteration in a cyclic fashion, that is
(9) |
where . This indicates that we only need to address the terms associated with point in update (8), which can be implemented by introducing the aggregated variables
(10) |
Then we can write update (8) as
(11) |
and maintain the aggregated variables by following recursions222Noticing that there is no need to explicitly construct matrix in implementation, while this matrix is useful to understand and analyze our method.
(12) |
where the last one is based on Sherman–Morrison–Woodbury formula [47] and definitions
(13) |
Since each of matrices and only contains two columns, updating the variables , and can be implemented within the complexity of flops. Additionally, the memory cost for maintaining variables , , and is . As a comparison, the vanilla Gauss–Newton (GN) method [6, 39] performs the iteration
(14) |
which takes a computation cost of flops and a memory cost of .
We summarize the procedure of our IGN in Algorithm 1. Observe that the vanilla GN iteration (14) can be reformulated by
(15) |
Comparing our updates (7)–(11) with (15), the aggregated variables , and can be regarded as the estimators of terms , and in scheme of (15) respectively. The efficiency of our IGN method comes from the strategy that we apply the different in the linear approximation (6) for the different component . In contrast, the vanilla GN method is based on the linear approximation at the identical point for all components.
3.2 The Convergence Analysis
In this subsection, we establish the local superlinear convergence of the proposed IGN method.
We start our analysis from the following proposition, which shows the non-singularity of the Gram matrix associated with the exact Jacobian at the non-degenerate solution .
Proposition 1.
Under Assumption 3, it holds that
(16) |
Under the continuous assumptions on and , we can provide the Hölder continuity of the Gram matrices.
Recall the design of IGN method is motivated by the estimation , which indicates we can connect Proposition 1 and Lemma 1 to bound the spectrum of as follows.
Lemma 2 indicates that if all of the points are sufficiently close to the solution , the matrix is positive-definite, which guarantees that the inverse of (i.e., matrix ) in the algorithm is always well-defined. Based on this intuition, we use induction to show the positive-definiteness of matrices and , and the local superlinear convergence rate of the proposed method.
Theorem 1.
Under Assumptions 1, 2 and 3, running IGN (Algorithm 1) with initialization , and such that
we have and for all . Additionally, there exists sequence such that and it holds
for all .
Observe that the term of in Theorem 1 is monotonically decreasing with respect to , we can bound it by and simplify the superlinear convergence as follows.
Corollary 1.
Theorem 1 also indicates that the larger leads to faster superlinear convergence rate. In the case of , our Hölder continuous condition (Assumption 2) degenerates to the Lipschitz continuity, then we can achieve the -step local quadratic convergence rate as follows.
Corollary 2.
Under the settings and notations of Theorem 1 with , we have the -step quadratic convergence
for all .
4 The Extension to Mini-Batch Methods
We can also improve the efficiency of IGN method by using the mini-batch update. Specifically, we consider the mini-batch size and divide the indices into non-overlapping subsets, i.e., we partition the index set into subsets such that , and for all distinct .
The mini-batch variant of IGN also apply the update of the form . Different from IGN, we update variables with the smaller period such that
(17) |
where .
We establish recursions of aggregated variables by a mini-batch way as follows
(18) |
where we construct matrices as
and indices are the elements in subset such that .
We formally present the procedure of the Mini-Batch Incremental Gauss-Newton (MB-IGN) method in Algorithm 2 (see Appendix A). The memory cost of MB-IGN is , matching the complexity of IGN. Each iteration of MB-IGN includes the matrix multiplication of , and within the complexity of flops. It is worth noting that the mini-batch update in MB-IGN can be efficiently implemented by block matrix operation that takes advantage of parallel computation [14].
Formally, we present the following convergence results of MB-IGN.
Theorem 2.
Under Assumptions 1, 2 and 3, running MB-IGN (Algorithm 2) with mini-batch size and initialization , and such that
we have and for all . Additionally, there exists sequence such that and it holds
The terms of in the results of Theorem 2 imply that increasing mini-batch size can speed up the convergence of MB-IGN. Additionally, the convergence of MB-IGN matches IGN if we take .
Similar to the discussion in Section 3.2, we have the following corollary for MB-IGN method.
Corollary 3.
Under settings of Theorem 2, we have
for all . In the case of , we have the -step quadratic convergence
for all .
5 Related Work
Methods | Computation | Memory | Convergence | Jacobian | |
---|---|---|---|---|---|
GN [6, 39] | quadratic | Lipschitz | ✗ | ||
SNR [51] | sublinear | Lipschitz | ✗ | ||
GN-BFGS [29] | asymptotic superlinear | Hölder | ✗ | ||
BGB [32] | Lipschitz | ✗ | |||
BBB [32] | Lipschitz | ✗ | |||
EKF [8] | sublinear | Lipschitz | ✓ | ||
EKF-S [36, 23] | linear | Lipschitz | ✓ | ||
IGN (this work) | Hölder | ✓ | |||
MB-IGN (this work) | Hölder | ✓ |
-
The SNR method requires the star convexity in their minimization formulation. The notation presents the sketch size.
-
The GN-BFGS method requires and the Jacobian is symmetric.
-
The BGB and BBB methods requires . The notation is rank of the modification matrix and is the condition number.
We compare the theoretical results of proposed IGN and MB-IGN with existing methods in Table 1.
The methods including Gauss–Newton-based BFGS (GN-BFGS) [29], Block Good Broyden’s method (BGB) [32], Block Bad Broyden’s method (BBB) [32] and Sketched Newton–Raphson (SNR) [51] only focus on establishing the Jacobian estimator, while each of their iteration depends on accessing all components in the nonlinear vector function that is expensive for large-scale problems. In addition, the quasi-Newton methods including GN-BFGS [29], BGB [32] and BBB [32] only work for the scenario of . The SNR method enjoys an efficient update for large , while it lacks the local superlinear convergence like classical Newton-type methods.
The Extended Kalman Filter with Stepsize (EKF-S) [36, 23] is based on the incremental update that only accesses one (or mini-batch) of components and the corresponding gradient at each iteration. Concretely, the EKF-S method performs the iteration
with some stepsize , where is the estimator for the Gram matrix which is constructed by the recursion
(19) |
for some . The original Extended Kalman Filter method (EKF) [8] takes a fixed stepsize of in the above iteration and achieves a sublinear convergence rate. Later, Gürbüzbalaban et al. [23] showed that introducing the adaptive stepsize can achieve the linear convergence rate. Note that EKF-S and EKF will not explicitly reuse the information of vector in later iterations. In other words, the recursion (19) indicates all information of the historical gradient is heuristically compressed into the term of . In contrast, the proposed IGN method establishes the Gram matrix approximation by equations (10) and (12), which clearly corresponds to the linear approximation (6)-(7) by reusing all of the historical gradients . This strategy encourages a more accurate Gram matrix estimation in our method and leads to a superlinear convergence rate.
The incremental Newton-type methods have also been studied in finite-sum strongly convex optimization [43, 44, 35, 28, 33]. In the view of our formulation (1), this work considers solving the system of nonlinear equations of the form , where is the gradient of some objective function and has the finite-sum structure with symmetric positive-definite Jacobian. These methods can achieve superlinear convergence rates by accessing one of and its Jacobian at each iteration. However, their iterations have to maintain Jacobians for all of the individuals with a memory cost of , which is prohibitive for a large .
6 Experiments
We conduct numerical experiments on the following applications:
-
•
Regularized Logistic Regression: We consider training the binary classifier by solving the nonconvex regularized logistic regression problem [2, 27]
where is the training set such that and for all . We set and for the model. We formulated the above minimization problem by the formulation of nonlinear equations (1) with . We perform the experiments on dataset “DBWorld” ( and ) [19] for this problem.
- •
-
•
Soft Maximum Minimization: We consider the soft maximum minimization problem [37, 12]
(20) which can be formulated by problem (1) with . We follow the setting of [17, 18] by generating the entries of and randomly and independently from the uniform distribution on . We set , , and in our experiments.
We first investigate the impact of mini-batch size of MB-IGN method (Algorithm 2) on the performance. We run MB-IGN by taking the different mini-batch sizes on the three problems and present the empirical results for time (s) against in Figure 1, where the setting corresponds to our IGN method (Algorithm 1). We can observe that the mini-batch update is effective in reducing the time cost. The mini-batch sizes of , , and achieve the best performance on the problems of robust logistic regression, Chandrasekhar’s H-equation, and soft maximum minimization, respectively.
We then compare the proposed methods MB-IGN (Algorithm 2) with baseline methods SNR [51], EKF-S [8, 36], BGB [32] and BBB [32]. We present the empirical results for the number of epochs against in Figure 2, where one epoch means one complete pass of all components of the nonlinear vector function. We can obverse that the proposed MB-IGN and the baseline method BGB outperforms others on all problems. This is reasonable since only these two methods enjoy the explicit condition-number-free superlinear convergence rates (see Table 1). The superlinear convergence rate of BBB method depends on the condition number, which leads to its performance not always better than the linear convergent method EKF-S.
We also present the empirical results for the cost of time (second) against in Figure 3. We can obverse that the proposed MB-IGN always performs significantly better than all baseline methods. This is in line with our expectations because only our MB-IGN method enjoys both the superlinear convergence rate and the cheap iteration cost. Although the BGB method has a comparable number of epochs to our MB-IGN on the problem of solving Chandrasekhar’s H-Equation, the iteration with accessing all components makes its time cost expensive.
7 Conclusion
In this work, we propose the incremental Gauss–Newton method (IGN) for solving the system of nonlinear equations. We design the algorithm by tracking the historical gradient of all components to establish the estimator of the Gram matrix (its inverse). The theoretical analysis shows IGN enjoys the explicit superlinear convergence rate under the assumption of Hölder continuous Jacobian. We also provide a mini-batch extension of our IGN method (MB-IGN) and show it has an even faster superlinear convergence rate. The numerical experiments on the applications of regularized logistic regression, Chandrasekhar’s H-equation, and soft maximum minimization validate the advantage of the proposed methods over existing baselines.
In the future, it will be interesting to study the incremental Gauss–Newton method to solve nonlinear equations in the distributed setting. It is also possible to design incremental quasi-Newton methods for solving the general nonlinear equations.
References
- Al-Baali et al. [2014] Mehiddin Al-Baali, Emilio Spedicato, and Francesca Maggioni. Broyden’s quasi-Newton methods for a nonlinear system of equations and unconstrained optimization: a review and open problems. Optimization Methods and Software, 29(5):937–954, 2014.
- Antoniadis et al. [2011] Anestis Antoniadis, Irène Gijbels, and Mila Nikolova. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 63:585–615, 2011.
- Athans et al. [1968] Michael Athans, Richard Wishner, and Anthony Bertolini. Suboptimal state estimation for continuous-time nonlinear systems from discrete noisy measurements. IEEE Transactions on Automatic Control, 13(5):504–514, 1968.
- Bai et al. [2019] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 2019.
- Bell [1994] Bradley M. Bell. The iterated Kalman smoother as a Gauss–Newton method. SIAM Journal on Optimization, 4(3):626–636, 1994.
- Ben-Israel [1966] Adi Ben-Israel. A Newton–Raphson method for the solution of systems of equations. Journal of Mathematical analysis and applications, 15(2):243–252, 1966.
- Berthier et al. [2021] Eloıse Berthier, Justin Carpentier, and Francis Bach. Fast and robust stability region estimation for nonlinear dynamical systems. In European Control Conference, 2021.
- Bertsekas [1996] Dimitri P. Bertsekas. Incremental least squares methods and the extended Kalman filter. SIAM Journal on Optimization, 6(3):807–822, 1996.
- Bertsekas [1997] Dimitri P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913–926, 1997.
- Botev et al. [2017] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss–Newton optimisation for deep learning. In International Conference on Machine Learning, 2017.
- Broyden [1965] Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of computation, 19(92):577–593, 1965.
- Bullins [2020] Brian Bullins. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, 2020.
- Chandrasekhar [1960] Subrahmanyan Chandrasekhar. Radiative transfer. Courier Corporation, 1960.
- Davis [1998] Timothy A. Davis. Block matrix methods: Taking advantage of high-performance computers. Technical report, Computer and Information Sciences Department, 1998.
- Défossez and Bach [2015] Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In International Conference on Artificial Intelligence and Statistics, 2015.
- Dennis Jr and Schnabel [1996] John E. Dennis Jr and Robert B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996.
- Doikov et al. [2023] Nikita Doikov, El Mahdi Chayti, and Martin Jaggi. Second-order optimization with lazy Hessians. In International Conference on Machine Learning, 2023.
- Doikov et al. [2024] Nikita Doikov, Konstantin Mishchenko, and Yurii Nesterov. Super-universal regularized Newton method. SIAM Journal on Optimization, 34(1):27–56, 2024.
- Filannino [2011] Michele Filannino. DBWorld e-mails. UCI Machine Learning Repository, 2011. DOI: https://doi.org/10.24432/C5589M.
- Frehse and Bensoussan [1984] J. Frehse and A. Bensoussan. Nonlinear elliptic systems in stochastic game theory. 1984.
- Grapiglia and Nesterov [2017] Geovani N. Grapiglia and Yurii Nesterov. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM Journal on Optimization, 27(1):478–506, 2017.
- Grapiglia and Nesterov [2019] Geovani N. Grapiglia and Yurii Nesterov. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM Journal on Optimization, 29(1):77–99, 2019.
- Gürbüzbalaban et al. [2015] Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental Newton method. Mathematical Programming, 151(1):283–313, 2015.
- Hottel and Saforim [1967] Hoyt C. Hottel and Adel F. Saforim. Radiative transfer. 1967.
- Kelley [1995] Carl T. Kelley. Iterative methods for linear and nonlinear equations. SIAM, 1995.
- Kelley [2003] Carl T. Kelley. Solving nonlinear equations with Newton’s method. SIAM, 2003.
- Kohler and Lucchi [2017] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, 2017.
- Lahoti et al. [2023] Aakash Lahoti, Spandan Senapati, Ketan Rajawat, and Alec Koppel. Sharpened lazy incremental quasi-Newton method. arXiv preprint arXiv:2305.17283, 2023.
- Li and Fukushima [1999] Donghui Li and Masao Fukushima. A globally and superlinearly convergent Gauss–Newton-based BFGS method for symmetric nonlinear equations. SIAM Journal on numerical Analysis, 37(1):152–172, 1999.
- Lin et al. [2021] Dachao Lin, Haishan Ye, and Zhihua Zhang. Explicit superlinear convergence rates of Broyden’s methods in nonlinear equations. arXiv preprint arXiv:2109.01974, 2021.
- Liu and Luo [2022] Chengchang Liu and Luo Luo. Quasi-Newton methods for saddle point problems. Advances in Neural Information Processing Systems, 2022.
- Liu et al. [2023] Chengchang Liu, Cheng Chen, Luo Luo, and John Lui. Block Broyden’s methods for solving nonlinear equations. Advances in Neural Information Processing Systems, 2023.
- Liu et al. [2024] Zhuanghua Liu, Luo Luo, and Bryan Kian Hsiang Low. Incremental Quasi-newton methods with faster superlinear convergence rates. In AAAI Conference on Artificial Intelligence, 2024.
- Ljung [1979] Lennart Ljung. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems. IEEE Transactions on Automatic Control, 24(1):36–50, 1979.
- Mokhtari et al. [2018] Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM Journal on Optimization, 28(2):1670–1698, 2018.
- Moriyama et al. [2003] Hiroyuki Moriyama, Nobuo Yamashita, and Masao Fukushima. The incremental Gauss–Newton algorithm with adaptive stepsize rule. Computational Optimization and Applications, 26:107–141, 2003.
- Nesterov [2005] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103:127–152, 2005.
- Nesterov and Polyak [2006] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical programming, 108(1):177–205, 2006.
- Nocedal and Wright [1999] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 1999.
- Nourian and Caines [2013] Mojtaba Nourian and Peter E. Caines. -Nash mean field game theory for nonlinear stochastic dynamical systems with major and minor agents. SIAM Journal on Control and Optimization, 51(4):3302–3331, 2013.
- Petersen and Pedersen [2008] Kaare Brandt Petersen and Michael Syskind Pedersen. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
- Pilanci and Wainwright [2017] Mert Pilanci and Martin J. Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
- Rodomanov and Kropotov [2015] Anton Rodomanov and Dmitry Kropotov. A Newton-type incremental method with a superlinear convergence rate. In Optimization for Machine Learning, 2015.
- Rodomanov and Kropotov [2016] Anton Rodomanov and Dmitry Kropotov. A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In International Conference on Machine Learning, 2016.
- Trémolet [2007] Yannick Trémolet. Model-error estimation in 4D-var. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 133(626):1267–1280, 2007.
- Wang [2012] Yong Wang. Gauss–Newton method. Wiley Interdisciplinary Reviews: Computational Statistics, 4(4):415–420, 2012.
- Woodbury [1950] Max A. Woodbury. Inverting modified matrices. Department of Statistics, Princeton University, 1950.
- Woodruff [2014] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
- Ye et al. [2021a] Haishan Ye, Dachao Lin, and Zhihua Zhang. Greedy and random Broyden’s methods with explicit superlinear convergence rates in nonlinear equations. arXiv preprint arXiv:2110.08572, 2021a.
- Ye et al. [2021b] Haishan Ye, Luo Luo, and Zhihua Zhang. Approximate Newton methods. Journal of Machine Learning Research, 22(66):1–41, 2021b.
- Yuan et al. [2022] Rui Yuan, Alessandro Lazaric, and Robert M. Gower. Sketched Newton–Raphson. SIAM Journal on Optimization, 32(3):1555–1583, 2022.
Appendix
The appendix is organized as follows. In Section A, we provide the detailed procedure of Mini-Batch Incremental Gauss–Newton Method (MB-IGN). In Section B, we provide some results for Jacobians In Section C, we introduces an auxiliary sequence and analyze its properties. In Sections D and E, we provide the convergence analysis for proposed IGN and MB-IGN, respectively.
Appendix A The Mini-Batch Incremental Gauss–Newton Method
We provide the detailed procedure of Mini-Batch Incremental Gauss–Newton Method (MB-IGN) in Algorithm 2.
Appendix B Some Basic Results for Jacobians
This section presents some useful results for our later analysis.
Lemma 3.
Proof.
We denote
and let be the -th standard basic vector in -dimensional Euclidean space. Then the facts and imply we have
where the last step is based on the Hölder continuouity of . ∎
Lemma 4.
Proof.
Lemma 5.
Proof.
For all , we have
Taking the spectral norm on both sides, we have
where the inequality comes from Assumption 1.
Therefore, for all it holds
Let be the -th standard basic vector in -dimensional Euclid space, then we have
for all . ∎
Appendix C The Auxiliary Sequence and Its Properties
We construct the following sequence for our convergence analysis in later sections.
Definition 1.
We define the following sequence for given and :
(23) |
We then provide several useful properties for the sequence in Definition 1.
Lemma 6.
The sequence satisfies
for all .
Proof.
Part I: We first use induction to prove for all . For the induction base, we can verify that . For the induction step, we assume
holds for all such that . Then we have
where the first inequality is based on the induction hypothesis and the last inequality is based on the setting . This finishes the induction.
Part II: We then use induction to prove for all . For the induction base, we can verify that
where the first inequality is based on for all (which have shown in Part I), and the last inequality is based on the setting . For the induction step, we assume
holds for all such that . Then we have
where the first inequality is based on the induction hypothesis and the last inequality is based on the setting . This finishes the induction.
Combining the results of above two parts, we finish the proof of this lemma. ∎
Lemma 7.
The sequence satisfies
for all .
Proof.
Part I: For , the fact means
Part II: For all , we have
where the last inequality is based on Lemma 6. This indicates for .
Part III: For all , we use induction to prove . For the induction base, we can verify that
where the last inequality is based on Lemma 6.
For the induction step, we assume
holds for all such that . Then we have
where the inequality is based on the induction hypothesis and the fact for all (which have shown in Part I).
Combining the results of above three parts, we finish the proof of this lemma. ∎
Lemma 8.
For the sequence , we have
for all .
Proof.
For all , the definition of implies
Additionally, Lemma 7 implies for all , we have
Combining above results, we achieve
for all . ∎
Lemma 9.
For the sequence , we have
for all , where
Proof.
Lemma 10.
For the sequence , if there exists and such that
(25) |
for all , then we have
for all .
Proof.
Appendix D The Convergence Analysis for IGN
In this section, we provide the proofs for result in Section 3.
D.1 The Proof of Proposition 1
Proof.
We denote the singular value decomposition of as
where are (column) orthogonal matrices and is diagonal matrix with the smallest diagonal entry of . Therefore, we have
which means the smallest singular value of is equal to the smallest value of , which is . Therefore, we have
∎
D.2 Proof of Lemma 1
D.3 Proof of Lemma 2
D.4 The Proof of Theorem 1
We first show the update
in IGN method (Line 9 of Algorithm 1) is well-defined if the matrices and are non-singular.
Lemma 12.
Following the setting of Theorem 1, if the matrices and are non-singular, then the matrix is also non-singular, where
Proof.
The recursion of and the definition of and imply
Since we assume matrices and are non-singular, applying the matrix determinant lemma [41, Section 9.1.2] on above equation leads to
Then the definition implies
which finish the proofs. ∎
Then we show the non-singular assumption on can upper bound the distance .
Lemma 13.
Proof.
Subtracting the term on both sides of equation (8), we have
We split the results of Theorem 1 into two parts (i.e., Theorem 3 and 4) and provide their proofs as follows. Our analysis is based on the properties of our the auxiliary sequence constructed in Section C.
Theorem 3.
Proof.
We first show
(26) |
holds for all . We split the proof of results (26) into the following three parts.
Part I: For , the initialization and the fact leads to
Part II: For all , we use induction to prove the results of
(27) |
For the induction base, we can apply Lemma 2 to verify
This implies
(28) |
According to Lemma 13, we have
where the first inequality is based on equation (28) and the second inequality is based on initial condition. Therefore, the induction base holds
For the induction step, we assume
hold for all such that . Therefore, the update (9) means
(29) |
The induction hypothesis leads to
for , where the second is based on Lemma 6 and the third comes from the initial condition. Combining with the result of (29), we achive
According to Lemma 2, we have
where the second inequality comes from the initial condition. Therefore, we have
According to Lemma 13, we have
where the last equality comes from the fact . Therefore, we finish the induction.
Part III: For all , we use induction to prove
For the induction base, we can verify that it holds (from the result of Part II)
and
Then we have
where the second inequality is based on Lemma 6 and the third inequality is based on the initial condition.
From Eq. 9, we have
For the induction step, we assume
holds for all such that . Combining results of Part I and II, we have
which implies
where the second inequality is based on Lemma 6 and the last inequality is based on the condition condition.
Combing with Lemma 2, we have
Therefore, we achieve
According to Lemma 13, we have
Hence, we finish the induction.
Combining results of Part I, II and III completes the proof of (26).
Theorem 4.
Proof.
D.5 Proof of Corollary 1
Proof.
According to Theorem 1, we have
for all . Noticing that the value of is monotonically decreasing according to , we have
which implies
for all .
∎
D.6 Proof of Corollary 2
Appendix E The Convergence Analysis for MB-IGN
In this section, we analyze the convergence of MB-IGN (Algorithm 2). Most of the proof in this section can be achieved by follow the analysis in Section D and we provide the details for the completeness.
E.1 The Additional Lemma for Gram Matrix
We provide the bound for the spectrum of matrix for MB-IGN method as follows
Lemma 14.
E.2 Proof of Theorem 2
Similarly, we then show the update
in MB-IGN method (Line 10 of Algorithm 2) is well-defined if the matrices and are non-singular.
Lemma 15.
Following the setting of Theorem 2, if the matrices and are non-singular, then the matrix is also non-singular, where
Proof.
The recursion of and the definition of and imply
Since we assume matrices and are non-singular, applying the matrix determinant lemma [41, section 9.1.2] on above equation leads to
Then the definition implies
which finish the proofs. ∎
Then we show the non-singular assumption on can upper bound the distance .
Lemma 16.
Proof.
Subtracting the term on both sides of equation (8), we have
We split the results of Theorem 2 into two parts (i.e., Theorem 5 and 6) and provide their proofs as follows. Our analysis is based on the properties of our the auxiliary sequence constructed in Section C.
Theorem 5.
Proof.
We first show
(31) |
holds for all . We split the proof of results (31) into the following three parts.
Part I: For , the initialization and the fact leads to
Part II: For all , we use induction to prove the results of
(32) |
For the induction base, we can apply Lemma 14 to verify
This implies
(33) |
According to Lemma 16, we have
where the first inequality is based on equation (33) and the second inequality is based on initial condition. Therefore, the induction base holds
For the induction step, we assume
hold for all such that . Therefore, the update (9) means
(34) |
The induction hypothesis leads to
for , where the second is based on Lemma 6 and the third comes from the initial condition. Combining with the result of (34), we achive
According to Lemma 14, we have
where the second inequality comes from the initial condition. Therefore, we have
According to Lemma 13, we have
where the last equality comes from the fact . Therefore, we finish the induction.
Part III: For all , we use induction to prove
For the induction base, we can verify that it holds (from the result of Part II)
and
Then we have
where the second inequality is based on Lemma 6 and the third inequality is based on the initial condition.
From Eq. 9, we have
For the induction step, we assume
holds for all such that . Combining results of Part I and II, we have
which implies
where the second inequality is based on Lemma 6 and the last inequality is based on the condition condition.
Combing with Lemma 14, we have
Therefore, we achieve
According to Lemma 16, we have
Hence, we finish the induction.
Combining results of Part I, II and III completes the proof of (31).