Incremental Gauss–Newton Methods with
Superlinear Convergence Rates

Zhiling Zhou Zhuanghua Liu Chengchang Liu Luo Luo School of Data Science, Fudan University; [email protected] of Computer Science, National University of Singapore; [email protected] of Computer Science and Engineering, The Chinese University of Hong Kong; [email protected] of Data Science, Fudan University; [email protected]

Abstract

This paper addresses the challenge of solving large-scale nonlinear equations with Hölder continuous Jacobians. We introduce a novel Incremental Gauss–Newton (IGN) method within explicit superlinear convergence rate, which outperforms existing methods that only achieve linear convergence rate. In particular, we formulate our problem by the nonlinear least squares with finite-sum structure, and our method incrementally iterates with the information of one component in each round. We also provide a mini-batch extension to our IGN method that obtains an even faster superlinear convergence rate. Furthermore, we conduct numerical experiments to show the advantages of the proposed methods.

1 Introduction

We study the problem of solving the system of nonlinear equations

\displaystyle{\bf{f}}({\bf{x}})={\bf{0}},

(1)

where the nonlinear vector function ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n}$ is Lipschitz continuous and its Jacobian is Hölder continuous. This formulation is a fundamental problem in scientific computing [38], and it is popular in a large number of applications including machine learning [15, 10, 4], control system [7], data assimilation [45] and game theory [20, 40].

The Newton-type methods [16, 25, 26, 6, 39, 46] are widely used for solving nonlinear equations. The classical Newton’s method uses the curvature information in Jacobians to obtain a local quadratic convergence rate [39], while it suffers from the expensive computational cost to access the Jacobian and its (pseudo) inverse. Several lines of research focus on approximating Newton’s methods with inexact Jacobians. For example, the quasi-Newton methods [11, 29, 1, 29] estimate the Jacobians via secant equations, leading to the iteration scheme that only needs to access the function value and Jacobian-vector calls. The explicit local superlinear convergence rates of these methods have been established in recent years [30, 31, 32, 49]. Another line of work [51, 50, 42] introduce matrix sketching technique [48] to reduce the dimension of the Jacobian matrix, which improves the computational efficiency per iteration. The superiority of their local convergence depends on the structure of Jacobian in the specific problem. Although quasi-Newton and sketched Newton methods can benefit from the inexact Jacobians, they still require accessing the full information of the nonlinear vector function value at every iteration.

For large-scale nonlinear equations, we are interested in methods that do not require the computation of full function values and Jacobians. In particular, Bertsekas [8] proposed a variant of Gauss–Newton (GN) method by following the Extended Kalman Filter (EKF) framework [3, 34, 5], which incrementally accesses partial information of the vector function values and corresponding Jacobian during the iterations. Consequently, Moriyama et al. [36] incorporated a stepsize into the EKF method, which guarantees the global linear convergence rate under the gradient-growth condition [23]. In the past decade, the incremental (quasi) Newton methods with local superlinear convergence rates are established for strongly convex optimization [43, 44, 35, 28, 33].¹¹1In the view of solving nonlinear equations, the methods designed for convex optimization [43, 44, 35, 28, 33] require an additional assumption that the Jacobian is symmetric positive-definite. However, the superiority of local convergence for incremental Newton-type methods in solving the general nonlinear equations is still unclear.

In this work, we propose an incremental Gauss–Newton (IGN) method for solving the systems of nonlinear equations. Our method only requires access to one component of the nonlinear vector function and its gradient per iteration. We maintain an aggregated vector and an aggregated matrix to estimate the vector function value and its Jacobian by incrementally updating. We also introduce a Gram matrix with a low-rank update to reduce the computational cost of matrix inverse in vanilla Gauss–Newton methods. The theoretical analysis shows our IGN method enjoys explicit local superlinear convergence rate for nonlinear equations problem with Hölder continuous Jacobians. Furthermore, we provide a variant of our IGN that makes use of the information of a mini-batch of components, which achieves an even faster superlinear convergence rate. The numerical experiments on real-world applications validate the advantages of the proposed methods.

Paper Organization

In Section 2, we formalize the notations and assumptions for our problem. In Section 3, we propose our incremental Gauss–Newton (IGN) method and present its convergence analysis. In Section 4, we extend the IGN method with the mini-batch update to obtain an even faster convergence rate. In Section 5, we provide a discussion to compare the proposed method with related works. In Section 6, we conduct numerical experiments to show the advantages of our methods. We conclude our work in Section 7.

2 Preliminaries

In this section, we formalize the notations and assumptions throughout this paper.

2.1 Notations

We let $[n]=\{1,\dots,n\}$ and use the notation $t\%n$ to present the remainder of $t$ divided by $n$ . We denote ${\bf{e}}_{i}\in{\mathbb{R}}^{n}$ as the $i$ -th standard basis vector of the $n$ -dimensional Euclidean space for all $i\in[n]$ . We use $\left\|\cdot\right\|$ to represent the Euclidean norm for a given vector and the spectral norm for a given matrix. Moreover, we use the notation $\sigma_{\min}(\cdot)$ to represent the smallest singular value for a given matrix.

For the system of nonlinear equations (1), we partition the vector function ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n}$ at ${\bf{x}}\in{\mathbb{R}}^{d}$ as ${\bf{f}}({\bf{x}})=[f_{1}({\bf{x}}),\dots,f_{n}({\bf{x}})]^{\top}\in{\mathbb{R% }}^{n}$ , where $f_{i}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ . We also denote the gradient of component $f_{i}(\cdot)$ at ${\bf{x}}\in{\mathbb{R}}^{d}$ as ${\bf{g}}_{i}({\bf{x}})=\nabla f_{i}({\bf{x}})$ , and we organize the corresponding Jacobian as ${\bf J}({\bf{x}})=[{\bf{g}}_{1}({\bf{x}}),\cdots,{\bf{g}}_{n}({\bf{x}})]^{\top% }\in{\mathbb{R}}^{n\times d}$ .

2.2 Assumptions

Throughout this paper, we suppose the function ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n}$ satisfies the following assumptions.

Assumption 1.

We suppose the vector function ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n}$ is Lipschitz continuous, i.e., there exists constant $L_{f}>0$ such that

\displaystyle\left\|{\bf{f}}({\bf{x}})-{\bf{f}}({\bf{y}})\right\|\leq L_{f}% \left\|{\bf{x}}-{\bf{y}}\right\|

(2)

for all ${\bf{x}},{\bf{y}}\in{\mathbb{R}}^{d}$ .

Assumption 2.

We suppose the Jacobian ${\bf J}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n\times d}$ is $\nu$ -Hölder continuous for some $\nu\in(0,1]$ , i.e., there exists constant ${\mathcal{H}}_{\nu}>0$ such that

\displaystyle\left\|{\bf J}({\bf{x}})-{\bf J}({\bf{y}})\right\|\leq{\mathcal{H% }}_{\nu}\left\|{\bf{x}}-{\bf{y}}\right\|^{\nu}

(3)

for all ${\bf{x}},{\bf{y}}\in{\mathbb{R}}^{d}$ .

Assumption 3.

The system of the nonlinear equations (1) holds $n\geq d$ and has a non-degenerate solution ${\bf{x}}^{*}\in{\mathbb{R}}^{d}$ , i.e., there exists some $\mu>0$ such that

\displaystyle\mu=\sigma_{\min}({\bf J}({\bf{x}}^{*}))>0.

(4)

Noticing that most of existing work [44, 51, 32, 9, 8, 36, 23] focus on the assumption of Lipschitz continuous Jacobian, which is a special case of our Assumption 2 by taking $\nu=1$ .

3 The Incremental Gauss–Newton Method

Algorithm 1 Incremental Gauss–Newton Method (IGN)

1:Input:

{\bf{x}}^{0}\in{\mathbb{R}}^{d}

{\bf{u}}^{0}\in{\mathbb{R}}^{d}

{\bf H}^{0},{\bf G}^{0}\in{\mathbb{R}}^{d\times d}

2:for

t=0,1,\dots

{\bf{x}}^{t+1}={\bf G}^{t}{\bf{u}}^{t}

i_{t}=t\%n+1

{\bf U}^{t}=\big{[}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{i_{% t}}({\bf{x}}^{t+1})\big{]}

{\bf V}^{t}=\big{[}{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{i_{t% }}({\bf{x}}^{t+1})\big{]}

{\bf{u}}^{t+1}={\bf{u}}^{t}-\left({\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t})^{\top% }{\bf{z}}_{i_{t}}^{t}-f_{i_{t}}({\bf{z}}_{i_{t}}^{t})\right){\bf{g}}_{i_{t}}({% \bf{z}}_{i_{t}}^{t})+\left({\bf{g}}_{i_{t}}({\bf{x}}^{t+1})^{\top}{\bf{x}}^{t+% 1}-f_{i_{t}}({\bf{x}}^{t+1})\right){\bf{g}}_{i_{t}}({\bf{x}}^{t+1})

{\bf H}^{t+1}={\bf H}^{t}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}){\bf{g}}_{i_{t% }}({\bf{z}}_{i_{t}}^{t})^{\top}+{\bf{g}}_{i_{t}}({\bf{x}}^{t+1}){\bf{g}}_{i_{t% }}({\bf{x}}^{t+1})^{\top}

{\bf G}^{t+1}={\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({\bf V}^{t})^{\top}{% \bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t}

10:

{\bf{z}}_{i}^{t+1}=\begin{cases}{\bf{x}}^{t+1},&\text{if~{}}i=i_{t}\\ {\bf{z}}_{i}^{t},&\text{otherwise}\end{cases}

11:end for

In this section, we propose the Incremental Gauss-Newton (IGN) method and provide its explicit superlinear convergence rate.

3.1 The Algorithm

We first introduce the intuition of our algorithm design. Solving the system of nonlinear equations (1) can be regarded as minimizing the norm of the nonlinear vector function ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{n}$ , which means we can reformulate the problem as the following nonlinear least squares minimization problem

\displaystyle\min_{{\bf{x}}\in{\mathbb{R}}^{d}}\phi({\bf{x}})\triangleq\frac{1% }{2}\sum_{i=1}^{n}(f_{i}({\bf{x}}))^{2}.

(5)

For each component $f_{i}:{\mathbb{R}}^{d}\to{\mathbb{R}}$ , we consider its linear approximation

\displaystyle f_{i}({\bf{x}})\approx f_{i}({\bf{z}}_{i}^{t})+{\bf{g}}_{i}({\bf% {z}}_{i}^{t})^{\top}({\bf{x}}-{\bf{z}}_{i}^{t}),

(6)

where ${\bf{z}}_{i}^{t}\in{\mathbb{R}}^{d}$ is some point related to component $f_{i}$ at the $t$ -th iteration. The estimation (6) motivates us to construct the surrogate problem for the nonlinear least squares (5) as follows

\displaystyle\min_{{\bf{x}}\in{\mathbb{R}}^{d}}\psi({\bf{x}})\triangleq\sum_{i% =1}^{n}\psi_{i}({\bf{x}}),\qquad\text{where}~{}\psi_{i}({\bf{x}})\triangleq% \frac{1}{2}\left\|f_{i}({\bf{z}}_{i}^{t})+{\bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top% }({\bf{x}}-{\bf{z}}_{i}^{t})\right\|^{2}.

(7)

Since each $\psi_{i}$ is convex, which implies problem (7) has the closed-form solution

\displaystyle{\bf{x}}^{t+1}=\left(\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t})% {\bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top}\right)^{-1}\sum_{i=1}^{n}\left({\bf{g}}_% {i}({\bf{z}}_{i}^{t})^{\top}{\bf{z}}_{i}^{t}-f_{i}({\bf{z}}_{i}^{t})\right){% \bf{g}}_{i}({\bf{z}}_{i}^{t}).

(8)

We assume the matrix $\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top}$ is always non-singular in this subsection, which will be verified under our assumptions in later analysis.

We propose the Incremental Gauss-Newton (IGN) method by performing an update (8) at the $t$ -th iteration. It is worth noting that we can take advantage of the inherent finite-sum structure in formulation (5) to establish incremental methods. Specifically, we update one of $\{{\bf{z}}^{t}_{i}\}_{i=1}^{n}$ at each iteration in a cyclic fashion, that is

\displaystyle{\bf{z}}_{i}^{t+1}=\begin{cases}{\bf{x}}^{t+1},&\text{if~{}}i=i_{% t},\\ {\bf{z}}_{i}^{t},&\text{otherwise},\end{cases}

(9)

where $i_{t}={t\%n}+1$ . This indicates that we only need to address the terms associated with point ${\bf{z}}_{i_{t}}^{t}$ in update (8), which can be implemented by introducing the aggregated variables

\displaystyle{\bf{u}}^{t}=\sum_{i=1}^{n}\left({\bf{g}}_{i}({\bf{z}}_{i}^{t})^{% \top}{\bf{z}}_{i}^{t}-f_{i}({\bf{z}}_{i}^{t})\right){\bf{g}}_{i}({\bf{z}}_{i}^% {t}),\qquad{\bf H}^{t}=\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i% }({\bf{z}}_{i}^{t})^{\top}\qquad\text{and}\qquad{\bf G}^{t}=\left({\bf H}^{t}% \right)^{-1}.

(10)

Then we can write update (8) as

\displaystyle{\bf{x}}^{t+1}={\bf G}^{t}{\bf{u}}_{t}

(11)

and maintain the aggregated variables by following recursions²²2Noticing that there is no need to explicitly construct matrix ${\bf H}_{t}$ in implementation, while this matrix is useful to understand and analyze our method.

\displaystyle\small\begin{cases}\,{\bf{u}}^{t+1}={\bf{u}}^{t}-\left({\bf{g}}_{% i_{t}}({\bf{z}}_{i_{t}}^{t})^{\top}{\bf{z}}_{i_{t}}^{t}-f_{i_{t}}({\bf{z}}_{i_% {t}}^{t})\right){\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t})+\left({\bf{g}}_{i_{t}}(% {\bf{x}}^{t+1})^{\top}{\bf{x}}^{t+1}-f_{i_{t}}({\bf{x}}^{t+1})\right){\bf{g}}_% {i_{t}}({\bf{x}}^{t+1}),\\[4.26773pt] {\bf H}^{t+1}={\bf H}^{t}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}){\bf{g}}_{i_{t% }}({\bf{z}}_{i_{t}}^{t})^{\top}+{\bf{g}}_{i_{t}}({\bf{x}}^{t+1}){\bf{g}}_{i_{t% }}({\bf{x}}^{t+1})^{\top},\\[4.26773pt] {\bf G}^{t+1}={\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({\bf V}^{t})^{\top}{% \bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t},\end{cases}

(12)

where the last one is based on Sherman–Morrison–Woodbury formula [47] and definitions

\displaystyle{\bf U}^{t}\triangleq\big{[}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t% }),~{}~{}{\bf{g}}_{i_{t}}({\bf{x}}^{t+1})\big{]}\in{\mathbb{R}}^{d\times 2}% \qquad\text{and}\qquad{\bf V}^{t}\triangleq\big{[}{\bf{g}}_{i_{t}}({\bf{z}}_{i% _{t}}^{t}),~{}~{}{\bf{g}}_{i_{t}}({\bf{x}}^{t+1})\big{]}\in{\mathbb{R}}^{d% \times 2}.

(13)

Since each of matrices ${\bf U}^{t}$ and ${\bf V}^{t}$ only contains two columns, updating the variables ${\bf{x}}^{t+1}$ , ${\bf{u}}^{t+1}$ and ${\bf G}^{t+1}$ can be implemented within the complexity of ${\mathcal{O}}(d^{2})$ flops. Additionally, the memory cost for maintaining variables $\{{\bf{z}}_{i}^{t}\}_{i=1}^{n}$ , $\{{\bf{g}}_{i_{t}}({\bf{z}}^{t})\}_{i=1}^{n}$ , ${\bf{u}}^{t}$ and ${\bf G}^{t}$ is ${\mathcal{O}}(nd+d^{2})$ . As a comparison, the vanilla Gauss–Newton (GN) method [6, 39] performs the iteration

\displaystyle\begin{split}{\bf{x}}^{t+1}=&{\bf{x}}^{t}-\left({\bf J}({\bf{x}}^% {t})^{\top}{\bf J}({\bf{x}}^{t})\right)^{-1}{\bf J}({\bf{x}}^{t})^{\top}{\bf{f% }}({\bf{x}}^{t}),\end{split}

(14)

which takes a computation cost of ${\mathcal{O}}(nd+d^{3})$ flops and a memory cost of ${\mathcal{O}}(nd+d^{2})$ .

We summarize the procedure of our IGN in Algorithm 1. Observe that the vanilla GN iteration (14) can be reformulated by

\displaystyle{\bf{x}}^{t+1}=\left({\bf J}({\bf{x}}^{t})^{\top}{\bf J}({\bf{x}}% ^{t})\right)^{-1}{\bf J}({\bf{x}}^{t})^{\top}({\bf J}({\bf{x}}^{t}){\bf{x}}^{t% }-{\bf{f}}({\bf{x}}^{t})).

(15)

Comparing our updates (7)–(11) with (15), the aggregated variables ${\bf{u}}^{t}$ , ${\bf H}^{t}$ and ${\bf G}^{t}$ can be regarded as the estimators of terms ${\bf J}({\bf{x}}^{t})^{\top}({\bf J}({\bf{x}}^{t}){\bf{x}}^{t}-{\bf{f}}({\bf{x% }}^{t}))$ , ${\bf J}({\bf{x}}^{t})^{\top}{\bf J}({\bf{x}}^{t})$ and $({\bf J}({\bf{x}}^{t})^{\top}{\bf J}({\bf{x}}^{t}))^{-1}$ in scheme of (15) respectively. The efficiency of our IGN method comes from the strategy that we apply the different ${\bf{z}}_{i}^{t}$ in the linear approximation (6) for the different component $f_{i}$ . In contrast, the vanilla GN method is based on the linear approximation at the identical point ${\bf{x}}_{t}$ for all components.

3.2 The Convergence Analysis

In this subsection, we establish the local superlinear convergence of the proposed IGN method.

We start our analysis from the following proposition, which shows the non-singularity of the Gram matrix associated with the exact Jacobian at the non-degenerate solution ${\bf{x}}^{*}\in{\mathbb{R}}^{d}$ .

Proposition 1.

Under Assumption 3, it holds that

\displaystyle\sigma_{\min}({\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*}))=% \mu^{2}>0.

(16)

Under the continuous assumptions on ${\bf{f}}(\cdot)$ and ${\bf J}(\cdot)$ , we can provide the Hölder continuity of the Gram matrices.

Lemma 1.

Under Assumptions 1 and 2, we have

\displaystyle\left\|{\bf J}({\bf{y}})^{\top}{\bf J}({\bf{y}})-{\bf J}({\bf{x}}% )^{\top}{\bf J}({\bf{x}})\right\|

\displaystyle\leq 2L_{f}{\mathcal{H}}_{\nu}\left\|{\bf{y}}-{\bf{x}}\right\|^{\nu}

and

\displaystyle\left\|{\bf{g}}_{i}({\bf{y}}){\bf{g}}_{i}({\bf{y}})^{\top}-{\bf{g% }}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{x}})^{\top}\right\|\leq 2L_{f}{\mathcal{H}}_% {\nu}\left\|{\bf{y}}-{\bf{x}}\right\|^{\nu}

for all ${\bf{x}},{\bf{y}}\in{\mathbb{R}}^{n}$ and $i\in[n]$ .

Recall the design of IGN method is motivated by the estimation ${\bf H}_{t}\approx{\bf J}({\bf{x}}_{t})^{\top}{\bf J}({\bf{x}}_{t})$ , which indicates we can connect Proposition 1 and Lemma 1 to bound the spectrum of ${\bf H}_{t}$ as follows.

Lemma 2.

Under Assumptions 1, 2 and 3, running IGN (Algorithm 1) with ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ and ${\bf G}^{0}=({\bf H}^{0})^{-1}$ holds that

\displaystyle\sigma_{\min}({\bf H}^{t})\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}% \sum_{i=1}^{n}\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}

for all $t\geq 0$ .

Lemma 2 indicates that if all of the points ${\bf{z}}_{1}^{t},\dots,{\bf{z}}_{n}^{t}$ are sufficiently close to the solution ${\bf{x}}^{*}$ , the matrix ${\bf H}^{t}$ is positive-definite, which guarantees that the inverse of ${\bf H}^{t+1}$ (i.e., matrix ${\bf G}^{t+1}$ ) in the algorithm is always well-defined. Based on this intuition, we use induction to show the positive-definiteness of matrices ${\bf H}^{t}$ and ${\bf I}+({\bf V}^{t})^{\top}{\bf G}^{t}{\bf U}^{t}$ , and the local superlinear convergence rate of the proposed method.

Theorem 1.

Under Assumptions 1, 2 and 3, running IGN (Algorithm 1) with initialization ${\bf{x}}^{0}\in{\mathbb{R}}^{d}$ , ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ and ${\bf G}^{0}=({\bf H}^{0})^{-1}$ such that

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq\left(\frac{\mu^{2}}{% 4L_{f}{\mathcal{H}}_{\nu}n}\right)^{1/\nu},

we have ${\bf H}^{t}\succeq(\mu^{2}/2){\bf I}$ and $\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}{\bf G}^{t}{\bf U}^{t})>0$ for all $t\geq 0$ . Additionally, there exists sequence $\{r^{t}\}$ such that $\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq r_{t}$ and it holds

\displaystyle r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor{t}/{n}\right\rfloor-1% \right)}}r_{t}\qquad\text{with}\qquad c=1-\frac{1}{n}\left(1-\left(\frac{1}{2(% 1+\nu)}\right)^{(1+\nu)}\right)

for all $t\geq n$ .

Observe that the term of $c$ in Theorem 1 is monotonically decreasing with respect to $\nu\in(0,1]$ , we can bound it by $1-15/(16n)\leq c<1-1/(2n)$ and simplify the superlinear convergence as follows.

Corollary 1.

Under the settings and notations of Theorem 1, we have

\displaystyle r_{t+1}<\Big{(}1-\frac{1}{2n}\Big{)}^{(1+\nu)^{\left(\left% \lfloor{t}/{n}\right\rfloor-1\right)}}r_{t}

for all $t\geq n$ .

Theorem 1 also indicates that the larger $\nu\in(0,1]$ leads to faster superlinear convergence rate. In the case of $\nu=1$ , our Hölder continuous condition (Assumption 2) degenerates to the Lipschitz continuity, then we can achieve the $n$ -step local quadratic convergence rate as follows.

Corollary 2.

Under the settings and notations of Theorem 1 with $\nu=1$ , we have the $n$ -step quadratic convergence

\displaystyle r_{t}\leq\frac{1}{4}r_{t-n}^{2}

for all $t\geq n$ .

4 The Extension to Mini-Batch Methods

We can also improve the efficiency of IGN method by using the mini-batch update. Specifically, we consider the mini-batch size $k$ and divide the indices into $m=\lceil n/k\rceil$ non-overlapping subsets, i.e., we partition the index set $[n]=\{1,\dots,n\}$ into subsets $\{{\mathcal{S}}_{1},\dots,{\mathcal{S}}_{m}\}$ such that $|{\mathcal{S}}_{1}|=\dots=|{\mathcal{S}}_{m-1}|=k$ , $\cup_{i=1}^{m}{\mathcal{S}}_{i}=[n]$ and ${\mathcal{S}}_{i}\cap{\mathcal{S}}_{j}=\emptyset$ for all distinct $i,j\in[k]$ .

The mini-batch variant of IGN also apply the update of the form ${\bf{x}}^{t+1}={\bf G}^{t}{\bf{u}}^{t}$ . Different from IGN, we update variables $\{{\bf{z}}_{i}^{t}\}_{i=1}^{m}$ with the smaller period $m=\lceil n/k\rceil$ such that

\displaystyle{\bf{z}}_{i}^{t+1}=\begin{cases}{\bf{x}}^{t+1},&\text{if~{}}i=i_{% t},\\ {\bf{z}}_{i}^{t},&\text{otherwise},\end{cases}

(17)

where $i_{t}={t\%m}+1$ .

We establish recursions of aggregated variables by a mini-batch way as follows

\displaystyle\small\!\!\!\!\begin{cases}\displaystyle{{\bf{u}}^{t+1}\!=\!{\bf{% u}}^{t}\!-\!\!\sum_{j\in{\mathcal{S}}_{i_{t}}}\!\!\left({\bf{g}}_{j}({\bf{z}}_% {i_{t}}^{t})^{\top}{\bf{z}}_{i_{t}}^{t}\!-\!f_{j}({\bf{z}}_{i_{t}}^{t})\right)% {\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})\!+\!\!\sum_{j\in{\mathcal{S}}_{i_{t}}}\!\!% \left({\bf{g}}_{j}({\bf{x}}^{t+1})^{\top}{\bf{x}}^{t+1}\!-\!f_{j}({\bf{x}}^{t+% 1})\right){\bf{g}}_{j}({\bf{x}}^{t+1}),}\\[11.38092pt] \displaystyle{{\bf H}^{t+1}\!=\!{\bf H}^{t}-\sum_{j\in{\mathcal{S}}_{i_{t}}}{% \bf{g}}_{j}({\bf{z}}_{i_{t}}^{t}){\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})^{\top}+% \sum_{j\in{\mathcal{S}}_{i_{t}}}{\bf{g}}_{j}({\bf{x}}^{t+1}){\bf{g}}_{j}({\bf{% x}}^{t+1})^{\top},}\\[11.38092pt] \displaystyle{{\bf G}^{t+1}\!=\!{\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({% \bf V}^{t})^{\top}{\bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t},% }\end{cases}

(18)

where we construct matrices ${\bf U}^{t},{\bf V}^{t}\in{\mathbb{R}}^{d\times 2|{\mathcal{S}}_{i_{t}}|}$ as

\displaystyle\begin{cases}{\bf U}^{t}=\Big{[}-{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}% }^{t}),~{}~{}{\bf{g}}_{j_{1}}({\bf{x}}^{t+1}),~{}\cdots~{},~{}-{\bf{g}}_{j_{|{% \mathcal{S}}_{i_{t}}|}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}% }_{i_{t}}|}}({\bf{x}}^{t+1})\Big{]},\\[5.69046pt] {\bf V}^{t}=\Big{[}{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{1% }}({\bf{x}}^{t+1}),~{}\cdots~{},~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf% {z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf{x}}^{t+1})% \Big{]},\end{cases}

and indices $j_{1},\dots,j_{|{\mathcal{S}}_{i_{t}}|}$ are the elements in subset ${\mathcal{S}}_{i_{t}}$ such that $|{\mathcal{S}}_{i_{t}}|\leq k$ .

We formally present the procedure of the Mini-Batch Incremental Gauss-Newton (MB-IGN) method in Algorithm 2 (see Appendix A). The memory cost of MB-IGN is ${\mathcal{O}}(nd+d^{2})$ , matching the complexity of IGN. Each iteration of MB-IGN includes the matrix multiplication of ${\bf G}^{t}$ , ${\bf U}^{t}$ and ${\bf V}^{t}$ within the complexity of ${\mathcal{O}}(kd^{2})$ flops. It is worth noting that the mini-batch update in MB-IGN can be efficiently implemented by block matrix operation that takes advantage of parallel computation [14].

Formally, we present the following convergence results of MB-IGN.

Theorem 2.

Under Assumptions 1, 2 and 3, running MB-IGN (Algorithm 2) with mini-batch size $k$ and initialization ${\bf{x}}^{0}\in{\mathbb{R}}^{d}$ , ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ and ${\bf G}^{0}=({\bf H}^{0})^{-1}$ such that

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq\left(\frac{\mu^{2}}{% 4kL_{f}{\mathcal{H}}_{\nu}\lceil{n}/{k}\rceil}\right)^{1/\nu},

we have ${\bf H}^{t}\succeq(\mu^{2}/2){\bf I}\,$ and $\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}{\bf G}^{t}{\bf U}^{t})>0$ for all $t\geq 0$ . Additionally, there exists sequence $\{r_{t}\}$ such that $\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq r_{t}$ and it holds

\displaystyle r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{\lceil n/k% \rceil}\right\rfloor-1\right)}}r_{t}\qquad\text{with}\qquad c=1-\frac{1}{% \lceil n/k\rceil}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{(1+\nu)}\right).

The terms of $n/k$ in the results of Theorem 2 imply that increasing mini-batch size $k$ can speed up the convergence of MB-IGN. Additionally, the convergence of MB-IGN matches IGN if we take $k=1$ .

Similar to the discussion in Section 3.2, we have the following corollary for MB-IGN method.

Corollary 3.

Under settings of Theorem 2, we have

\displaystyle r_{t+1}\leq\Big{(}1-\frac{1}{2\lceil n/k\rceil}\Big{)}^{(1+\nu)^% {\left(\left\lfloor\frac{t}{\lceil n/k\rceil}\right\rfloor-1\right)}}r_{t}

for all $t\geq\lceil n/k\rceil$ . In the case of $\nu=1$ , we have the $\lceil n/k\rceil$ -step quadratic convergence

\displaystyle r_{t}\leq\frac{1}{4}r_{t-\lceil n/k\rceil}^{2}

for all $t\geq\lceil n/k\rceil$ .

Specifically, Corollary 3 indicates that the MB-IGN method with $k=n$ has the quadratic convergence under the assumption of Lipschitz continuous Jacobian (Assumption 2 with $\nu=1$ ), which matches the rate of vanilla Gauss–Newton method.

5 Related Work

Table 1: We compare the per-iteration computation complexity, memory cost, convergence rates and the assumption of Jacobin of proposed methods and baselines. The rightmost column means that the methods GN, SNR, GN-BFGS, BFB and BBB require to access all of the components

f_{1},\dots,f_{n}

at each iteration, while the other methods only require to access one or mini-batch of components.

Methods	Computation	Memory	Convergence	Jacobian	$f_{i}$
GN [6, 39]	${\mathcal{O}}(nd^{2}+d^{3})$	${\mathcal{O}}(nd+d^{2})$	quadratic	Lipschitz	✗
SNR [51]^$\sharp$	${\mathcal{O}}(n\tau^{2}+\tau^{3})$	${\mathcal{O}}(\tau d)$	sublinear	Lipschitz	✗
GN-BFGS [29]^${\ddagger}$	${\mathcal{O}}(d^{2})$	${\mathcal{O}}(d^{2})$	asymptotic superlinear	Hölder	✗
BGB [32]^$\S$	${\mathcal{O}}(\tilde{k}d^{2})$	${\mathcal{O}}(d^{2})$	${\mathcal{O}}\big{(}(1-{\tilde{k}}/d)^{t(t-1)/4}\big{)}$	Lipschitz	✗
BBB [32]^$\S$	${\mathcal{O}}(\tilde{k}d^{2})$	${\mathcal{O}}(d^{2})$	${\mathcal{O}}\big{(}(1-\tilde{k}/(\varkappa d))^{t(t-1)/4}\big{)}$	Lipschitz	✗
EKF [8]	${\mathcal{O}}(d^{2})$	${\mathcal{O}}(d^{2})$	sublinear	Lipschitz	✓
EKF-S [36, 23]	${\mathcal{O}}(d^{2})$	${\mathcal{O}}(d^{2})$	linear	Lipschitz	✓
IGN (this work)	${\mathcal{O}}(d^{2})$	${\mathcal{O}}(nd+d^{2})$	${\mathcal{O}}\big{(}(1-1/(2n))^{(1+\nu)^{\left\lfloor t/n\right\rfloor}}\big{)}$	Hölder	✓
MB-IGN (this work)	${\mathcal{O}}(kd^{2})$	${\mathcal{O}}(nd+d^{2})$	${\mathcal{O}}\big{(}(1-k/(2n))^{(1+\nu)^{\left\lfloor{kt}/{n}\right\rfloor}}% \big{)}$	Hölder	✓

$\sharp$

The SNR method requires the star convexity in their minimization formulation. The notation $\tau$ presents the sketch size.
${\ddagger}$

The GN-BFGS method requires $n=d$ and the Jacobian is symmetric.
$\S$

The BGB and BBB methods requires $n=d$ . The notation $\tilde{k}$ is rank of the modification matrix and $\varkappa\triangleq L_{f}/\mu$ is the condition number.

We compare the theoretical results of proposed IGN and MB-IGN with existing methods in Table 1.

The methods including Gauss–Newton-based BFGS (GN-BFGS) [29], Block Good Broyden’s method (BGB) [32], Block Bad Broyden’s method (BBB) [32] and Sketched Newton–Raphson (SNR) [51] only focus on establishing the Jacobian estimator, while each of their iteration depends on accessing all components in the nonlinear vector function that is expensive for large-scale problems. In addition, the quasi-Newton methods including GN-BFGS [29], BGB [32] and BBB [32] only work for the scenario of $n=d$ . The SNR method enjoys an efficient update for large $n$ , while it lacks the local superlinear convergence like classical Newton-type methods.

The Extended Kalman Filter with Stepsize (EKF-S) [36, 23] is based on the incremental update that only accesses one (or mini-batch) of components and the corresponding gradient at each iteration. Concretely, the EKF-S method performs the iteration

\displaystyle{\bf{x}}^{t+1}={\bf{x}}^{t}-\alpha^{t}(\tilde{\bf H}^{t})^{-1}{% \bf{g}}_{i_{t}}({\bf{x}}^{t})f_{i_{t}}({\bf{x}}^{t})

with some stepsize $\alpha^{t}>0$ , where $\tilde{\bf H}^{t}\in{\mathbb{R}}^{d\times d}$ is the estimator for the Gram matrix ${\bf J}({\bf{x}}^{t})^{\top}{\bf J}({\bf{x}}^{t})$ which is constructed by the recursion

\displaystyle\tilde{\bf H}^{t+1}=\lambda^{t}\tilde{\bf H}^{t}+{\bf{g}}_{i_{t}}% ({\bf{x}}^{t+1}){\bf{g}}_{i_{t}}({\bf{x}}^{t+1})^{\top}

(19)

for some $\lambda^{t}\in(0,1]$ . The original Extended Kalman Filter method (EKF) [8] takes a fixed stepsize of $\alpha^{t}=1$ in the above iteration and achieves a sublinear convergence rate. Later, Gürbüzbalaban et al. [23] showed that introducing the adaptive stepsize can achieve the linear convergence rate. Note that EKF-S and EKF will not explicitly reuse the information of vector $g_{i_{t}}({\bf{x}}_{t})$ in later iterations. In other words, the recursion (19) indicates all information of the historical gradient is heuristically compressed into the term of $\lambda^{t}\tilde{\bf H}^{t}$ . In contrast, the proposed IGN method establishes the Gram matrix approximation ${\bf H}_{t}\approx{\bf J}({\bf{x}}^{t})^{\top}{\bf J}({\bf{x}}^{t})$ by equations (10) and (12), which clearly corresponds to the linear approximation (6)-(7) by reusing all of the historical gradients $\{{\bf{g}}_{i}({\bf{z}}_{i}^{t})\}_{i=1}^{n}$ . This strategy encourages a more accurate Gram matrix estimation in our method and leads to a superlinear convergence rate.

The incremental Newton-type methods have also been studied in finite-sum strongly convex optimization [43, 44, 35, 28, 33]. In the view of our formulation (1), this work considers solving the system of nonlinear equations of the form ${\bf{f}}({\bf{x}})={\bf{0}}$ , where ${\bf{f}}:{\mathbb{R}}^{d}\to{\mathbb{R}}^{d}$ is the gradient of some objective function and has the finite-sum structure ${\bf{f}}({\bf{x}})\triangleq(1/N)\sum_{i=1}^{N}{\bf{f}}_{i}({\bf{x}})$ with symmetric positive-definite Jacobian. These methods can achieve superlinear convergence rates by accessing one of $\{{\bf{f}}_{i}\}_{i=1}^{N}$ and its Jacobian at each iteration. However, their iterations have to maintain Jacobians for all of the individuals $\{{\bf{f}}_{i}\}_{i=1}^{N}$ with a memory cost of ${\mathcal{O}}(Nd^{2})$ , which is prohibitive for a large $N$ .

6 Experiments

We conduct numerical experiments on the following applications:

•

Regularized Logistic Regression: We consider training the binary classifier ${\bf{x}}\in{\mathbb{R}}^{d}$ by solving the nonconvex regularized logistic regression problem [2, 27]

\displaystyle\min_{{\bf{x}}\in{\mathbb{R}}^{d}}\ell({\bf{x}})\triangleq\frac{1% }{N}\sum_{j=1}^{N}\log(1+\exp(-b_{j}{\bf{a}}_{j}^{\top}{\bf{x}}))+\theta\sum_{% k=1}^{d}\frac{\nu x_{k}^{2}}{1+\nu x_{k}^{2}},

where $\{({\bf{a}}_{j},b_{j})\}_{j=1}^{N}$ is the training set such that ${\bf{a}}_{j}\in{\mathbb{R}}^{d}$ and $b_{j}\in\{-1,1\}$ for all $j\in[N]$ . We set $\theta=10^{-2}$ and $\nu=1$ for the model. We formulated the above minimization problem by the formulation of nonlinear equations (1) with ${\bf{f}}({\bf{x}})\triangleq\nabla\ell({\bf{x}})$ . We perform the experiments on dataset “DBWorld” ( $N=64$ and $d=4,702$ ) [19] for this problem.

•

Chandrasekhar’s H-Equation: We consider the Chandrasekhar’s H-equation, which is widely used in analytical radiative transfer theory [24, 13]. It can be formulated by problem (1) with

\displaystyle f_{i}({\bf{x}})=x_{i}-\left(1-\frac{c}{2n}\sum_{j=1}^{n}\frac{% \mu_{i}x_{j}}{\mu_{i}+\mu_{j}}\right)^{-1}~{}\text{for all}~{}i\in[n],\quad% \text{where}\quad\mu_{i}=\frac{i-1/2}{n}.

We set $d=2,000$ and $c=1-10^{-5}$ for this problem in our experiments. .

•

Soft Maximum Minimization: We consider the soft maximum minimization problem [37, 12]

\displaystyle\min_{{\bf{x}}\in{\mathbb{R}}^{d}}h({\bf{x}})\triangleq\mu\ln{% \left(\sum_{i=1}^{N}\exp{\left(\frac{\langle{\bf{a}}_{i},{\bf{x}}\rangle-b_{i}% }{\mu}\right)}\right)}+\frac{\lambda}{2}\left\|x\right\|^{2},

(20)

which can be formulated by problem (1) with ${\bf{f}}({\bf{x}})\triangleq\nabla h({\bf{x}})$ . We follow the setting of [17, 18] by generating the entries of ${\bf{a}}_{1},\cdots,{\bf{a}}_{N}\in{\mathbb{R}}^{d}$ and ${\bf{b}}\in{\mathbb{R}}^{N}$ randomly and independently from the uniform distribution on $[-1,1]$ . We set $N=2000$ , $d=2000$ , $\mu=5$ and $\lambda=2$ in our experiments.

We first investigate the impact of mini-batch size $k$ of MB-IGN method (Algorithm 2) on the performance. We run MB-IGN by taking the different mini-batch sizes on the three problems and present the empirical results for time (s) against $||{\bf{f}}({\bf{x}})||$ in Figure 1, where the setting $k=1$ corresponds to our IGN method (Algorithm 1). We can observe that the mini-batch update is effective in reducing the time cost. The mini-batch sizes of $500$ , $200$ , and $100$ achieve the best performance on the problems of robust logistic regression, Chandrasekhar’s H-equation, and soft maximum minimization, respectively.

We then compare the proposed methods MB-IGN (Algorithm 2) with baseline methods SNR [51], EKF-S [8, 36], BGB [32] and BBB [32]. We present the empirical results for the number of epochs against $||{\bf{f}}({\bf{x}})||$ in Figure 2, where one epoch means one complete pass of all components of the nonlinear vector function. We can obverse that the proposed MB-IGN and the baseline method BGB outperforms others on all problems. This is reasonable since only these two methods enjoy the explicit condition-number-free superlinear convergence rates (see Table 1). The superlinear convergence rate of BBB method depends on the condition number, which leads to its performance not always better than the linear convergent method EKF-S.

We also present the empirical results for the cost of time (second) against $||{\bf{f}}({\bf{x}})||$ in Figure 3. We can obverse that the proposed MB-IGN always performs significantly better than all baseline methods. This is in line with our expectations because only our MB-IGN method enjoys both the superlinear convergence rate and the cheap iteration cost. Although the BGB method has a comparable number of epochs to our MB-IGN on the problem of solving Chandrasekhar’s H-Equation, the iteration with accessing all components makes its time cost expensive.

Refer to caption — (a) Robust Logistic Regression

7 Conclusion

In this work, we propose the incremental Gauss–Newton method (IGN) for solving the system of nonlinear equations. We design the algorithm by tracking the historical gradient of all components to establish the estimator of the Gram matrix (its inverse). The theoretical analysis shows IGN enjoys the explicit superlinear convergence rate under the assumption of Hölder continuous Jacobian. We also provide a mini-batch extension of our IGN method (MB-IGN) and show it has an even faster superlinear convergence rate. The numerical experiments on the applications of regularized logistic regression, Chandrasekhar’s H-equation, and soft maximum minimization validate the advantage of the proposed methods over existing baselines.

In the future, it will be interesting to study the incremental Gauss–Newton method to solve nonlinear equations in the distributed setting. It is also possible to design incremental quasi-Newton methods for solving the general nonlinear equations.

References

Al-Baali et al. [2014] Mehiddin Al-Baali, Emilio Spedicato, and Francesca Maggioni. Broyden’s quasi-Newton methods for a nonlinear system of equations and unconstrained optimization: a review and open problems. Optimization Methods and Software, 29(5):937–954, 2014.
Antoniadis et al. [2011] Anestis Antoniadis, Irène Gijbels, and Mila Nikolova. Penalized likelihood regression for generalized linear models with non-quadratic penalties. Annals of the Institute of Statistical Mathematics, 63:585–615, 2011.
Athans et al. [1968] Michael Athans, Richard Wishner, and Anthony Bertolini. Suboptimal state estimation for continuous-time nonlinear systems from discrete noisy measurements. IEEE Transactions on Automatic Control, 13(5):504–514, 1968.
Bai et al. [2019] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 2019.
Bell [1994] Bradley M. Bell. The iterated Kalman smoother as a Gauss–Newton method. SIAM Journal on Optimization, 4(3):626–636, 1994.
Ben-Israel [1966] Adi Ben-Israel. A Newton–Raphson method for the solution of systems of equations. Journal of Mathematical analysis and applications, 15(2):243–252, 1966.
Berthier et al. [2021] Eloıse Berthier, Justin Carpentier, and Francis Bach. Fast and robust stability region estimation for nonlinear dynamical systems. In European Control Conference, 2021.
Bertsekas [1996] Dimitri P. Bertsekas. Incremental least squares methods and the extended Kalman filter. SIAM Journal on Optimization, 6(3):807–822, 1996.
Bertsekas [1997] Dimitri P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM Journal on Optimization, 7(4):913–926, 1997.
Botev et al. [2017] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss–Newton optimisation for deep learning. In International Conference on Machine Learning, 2017.
Broyden [1965] Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of computation, 19(92):577–593, 1965.
Bullins [2020] Brian Bullins. Highly smooth minimization of non-smooth problems. In Conference on Learning Theory, 2020.
Chandrasekhar [1960] Subrahmanyan Chandrasekhar. Radiative transfer. Courier Corporation, 1960.
Davis [1998] Timothy A. Davis. Block matrix methods: Taking advantage of high-performance computers. Technical report, Computer and Information Sciences Department, 1998.
Défossez and Bach [2015] Alexandre Défossez and Francis Bach. Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In International Conference on Artificial Intelligence and Statistics, 2015.
Dennis Jr and Schnabel [1996] John E. Dennis Jr and Robert B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996.
Doikov et al. [2023] Nikita Doikov, El Mahdi Chayti, and Martin Jaggi. Second-order optimization with lazy Hessians. In International Conference on Machine Learning, 2023.
Doikov et al. [2024] Nikita Doikov, Konstantin Mishchenko, and Yurii Nesterov. Super-universal regularized Newton method. SIAM Journal on Optimization, 34(1):27–56, 2024.
Filannino [2011] Michele Filannino. DBWorld e-mails. UCI Machine Learning Repository, 2011. DOI: https://doi.org/10.24432/C5589M.
Frehse and Bensoussan [1984] J. Frehse and A. Bensoussan. Nonlinear elliptic systems in stochastic game theory. 1984.
Grapiglia and Nesterov [2017] Geovani N. Grapiglia and Yurii Nesterov. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM Journal on Optimization, 27(1):478–506, 2017.
Grapiglia and Nesterov [2019] Geovani N. Grapiglia and Yurii Nesterov. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM Journal on Optimization, 29(1):77–99, 2019.
Gürbüzbalaban et al. [2015] Mert Gürbüzbalaban, Asuman Ozdaglar, and Pablo Parrilo. A globally convergent incremental Newton method. Mathematical Programming, 151(1):283–313, 2015.
Hottel and Saforim [1967] Hoyt C. Hottel and Adel F. Saforim. Radiative transfer. 1967.
Kelley [1995] Carl T. Kelley. Iterative methods for linear and nonlinear equations. SIAM, 1995.
Kelley [2003] Carl T. Kelley. Solving nonlinear equations with Newton’s method. SIAM, 2003.
Kohler and Lucchi [2017] Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In International Conference on Machine Learning, 2017.
Lahoti et al. [2023] Aakash Lahoti, Spandan Senapati, Ketan Rajawat, and Alec Koppel. Sharpened lazy incremental quasi-Newton method. arXiv preprint arXiv:2305.17283, 2023.
Li and Fukushima [1999] Donghui Li and Masao Fukushima. A globally and superlinearly convergent Gauss–Newton-based BFGS method for symmetric nonlinear equations. SIAM Journal on numerical Analysis, 37(1):152–172, 1999.
Lin et al. [2021] Dachao Lin, Haishan Ye, and Zhihua Zhang. Explicit superlinear convergence rates of Broyden’s methods in nonlinear equations. arXiv preprint arXiv:2109.01974, 2021.
Liu and Luo [2022] Chengchang Liu and Luo Luo. Quasi-Newton methods for saddle point problems. Advances in Neural Information Processing Systems, 2022.
Liu et al. [2023] Chengchang Liu, Cheng Chen, Luo Luo, and John Lui. Block Broyden’s methods for solving nonlinear equations. Advances in Neural Information Processing Systems, 2023.
Liu et al. [2024] Zhuanghua Liu, Luo Luo, and Bryan Kian Hsiang Low. Incremental Quasi-newton methods with faster superlinear convergence rates. In AAAI Conference on Artificial Intelligence, 2024.
Ljung [1979] Lennart Ljung. Asymptotic behavior of the extended Kalman filter as a parameter estimator for linear systems. IEEE Transactions on Automatic Control, 24(1):36–50, 1979.
Mokhtari et al. [2018] Aryan Mokhtari, Mark Eisen, and Alejandro Ribeiro. IQN: An incremental quasi-Newton method with local superlinear convergence rate. SIAM Journal on Optimization, 28(2):1670–1698, 2018.
Moriyama et al. [2003] Hiroyuki Moriyama, Nobuo Yamashita, and Masao Fukushima. The incremental Gauss–Newton algorithm with adaptive stepsize rule. Computational Optimization and Applications, 26:107–141, 2003.
Nesterov [2005] Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical programming, 103:127–152, 2005.
Nesterov and Polyak [2006] Yurii Nesterov and Boris T. Polyak. Cubic regularization of Newton method and its global performance. Mathematical programming, 108(1):177–205, 2006.
Nocedal and Wright [1999] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 1999.
Nourian and Caines [2013] Mojtaba Nourian and Peter E. Caines. $\epsilon$ -Nash mean field game theory for nonlinear stochastic dynamical systems with major and minor agents. SIAM Journal on Control and Optimization, 51(4):3302–3331, 2013.
Petersen and Pedersen [2008] Kaare Brandt Petersen and Michael Syskind Pedersen. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
Pilanci and Wainwright [2017] Mert Pilanci and Martin J. Wainwright. Newton sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245, 2017.
Rodomanov and Kropotov [2015] Anton Rodomanov and Dmitry Kropotov. A Newton-type incremental method with a superlinear convergence rate. In Optimization for Machine Learning, 2015.
Rodomanov and Kropotov [2016] Anton Rodomanov and Dmitry Kropotov. A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In International Conference on Machine Learning, 2016.
Trémolet [2007] Yannick Trémolet. Model-error estimation in 4D-var. Quarterly Journal of the Royal Meteorological Society: A journal of the atmospheric sciences, applied meteorology and physical oceanography, 133(626):1267–1280, 2007.
Wang [2012] Yong Wang. Gauss–Newton method. Wiley Interdisciplinary Reviews: Computational Statistics, 4(4):415–420, 2012.
Woodbury [1950] Max A. Woodbury. Inverting modified matrices. Department of Statistics, Princeton University, 1950.
Woodruff [2014] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
Ye et al. [2021a] Haishan Ye, Dachao Lin, and Zhihua Zhang. Greedy and random Broyden’s methods with explicit superlinear convergence rates in nonlinear equations. arXiv preprint arXiv:2110.08572, 2021a.
Ye et al. [2021b] Haishan Ye, Luo Luo, and Zhihua Zhang. Approximate Newton methods. Journal of Machine Learning Research, 22(66):1–41, 2021b.
Yuan et al. [2022] Rui Yuan, Alessandro Lazaric, and Robert M. Gower. Sketched Newton–Raphson. SIAM Journal on Optimization, 32(3):1555–1583, 2022.

Appendix

The appendix is organized as follows. In Section A, we provide the detailed procedure of Mini-Batch Incremental Gauss–Newton Method (MB-IGN). In Section B, we provide some results for Jacobians In Section C, we introduces an auxiliary sequence and analyze its properties. In Sections D and E, we provide the convergence analysis for proposed IGN and MB-IGN, respectively.

Appendix A The Mini-Batch Incremental Gauss–Newton Method

We provide the detailed procedure of Mini-Batch Incremental Gauss–Newton Method (MB-IGN) in Algorithm 2.

Algorithm 2 Mini-Batch Incremental Gauss–Newton Method (MB-IGN)

1:Input:

{\bf{x}}^{0}\in{\mathbb{R}}^{d}

{\bf{u}}^{0}\in{\mathbb{R}}^{d}

{\bf H}^{0},{\bf G}^{0}\in{\mathbb{R}}^{d\times d}

k\leq n

m=\lceil n/k\rceil

2:Partition the index set

[n]=\{1,\dots,n\}

into subsets

\{{\mathcal{S}}_{1},\dots,{\mathcal{S}}_{m}\}

such that

|{\mathcal{S}}_{1}|=\dots=|{\mathcal{S}}_{m-1}|=k

\cup_{i=1}^{m}{\mathcal{S}}_{i}=[n]

and

{\mathcal{S}}_{i}\cap{\mathcal{S}}_{j}=\emptyset

for all

i,j\in[k]

3:for

t=0,1,\dots

{\bf{x}}^{t+1}={\bf G}^{t}{\bf{u}}^{t}

i_{t}=t\%m+1

{\bf U}^{t}=\Big{[}-{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{% 1}}({\bf{x}}^{t+1}),~{}\cdots~{},~{}-{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({% \bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf{x}}^{t+% 1})\Big{]}

{\bf V}^{t}=\Big{[}{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{1% }}({\bf{x}}^{t+1}),~{}\cdots~{},~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf% {z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf{x}}^{t+1})% \Big{]}

\displaystyle{{\bf{u}}^{t+1}\!=\!{\bf{u}}^{t}\!-\!\!\sum_{j\in{\mathcal{S}}_{i% _{t}}}\!\!\left({\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})^{\top}{\bf{z}}_{i_{t}}^{t}% \!-\!f_{j}({\bf{z}}_{i_{t}}^{t})\right){\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})\!+\!% \!\sum_{j\in{\mathcal{S}}_{i_{t}}}\!\!\left({\bf{g}}_{j}({\bf{x}}^{t+1})^{\top% }{\bf{x}}^{t+1}\!-\!f_{j}({\bf{x}}^{t+1})\right){\bf{g}}_{j}({\bf{x}}^{t+1})}

\displaystyle{{\bf H}^{t+1}\!=\!{\bf H}^{t}-\sum_{j\in{\mathcal{S}}_{i_{t}}}{% \bf{g}}_{j}({\bf{z}}_{i_{t}}^{t}){\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})^{\top}+% \sum_{j\in{\mathcal{S}}_{i_{t}}}{\bf{g}}_{j}({\bf{x}}^{t+1}){\bf{g}}_{j}({\bf{% x}}^{t+1})^{\top}}

10:

{\bf G}^{t+1}={\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({\bf V}^{t})^{\top}{% \bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t}

11:

{\bf{z}}_{i}^{t+1}=\begin{cases}{\bf{x}}^{t+1},&\text{if~{}}i=i_{t}\\ {\bf{z}}_{i}^{t},&\text{otherwise}\end{cases}

12:end for

Appendix B Some Basic Results for Jacobians

This section presents some useful results for our later analysis.

Lemma 3.

(Hölder continuity of each gradient) Under Assumption 2, it satisfies that

\displaystyle\left\|{\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right\|\leq{% \mathcal{H}}_{\nu}\left\|{\bf{y}}-{\bf{x}}\right\|^{\nu},

(21)

for any ${\bf{x}},{\bf{y}}\in{\mathbb{R}}^{d}$ , and $i\in[n]$ .

Proof.

We denote

\displaystyle\tilde{{\bf J}}=\begin{bmatrix}({\bf{g}}_{1}({\bf{y}})-{\bf{g}}_{% 1}({\bf{x}}))^{\top}\\ \vdots\\ ({\bf{g}}_{n}({\bf{y}})-{\bf{g}}_{n}({\bf{x}}))^{\top}\end{bmatrix}\in{\mathbb% {R}}^{n\times d}\qquad\text{with}\qquad{\bf{g}}_{i}({\bf{x}})=\nabla f_{i}({% \bf{x}})

and let ${\bf{e}}_{i}\in{\mathbb{R}}^{n}$ be the $i$ -th standard basic vector in $n$ -dimensional Euclidean space. Then the facts $\tilde{{\bf J}}={\bf J}({\bf{y}})-{\bf J}({\bf{x}})$ and $\tilde{{\bf J}}^{\top}{\bf{e}}_{i}={\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})$ imply we have

\displaystyle\left\|{\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right\|\leq% \|\tilde{{\bf J}}^{\top}\|\left\|{\bf{e}}_{i}\right\|=\|\tilde{{\bf J}}\|=% \left\|{\bf J}({\bf{y}})-{\bf J}({\bf{x}})\right\|\leq{\mathcal{H}}_{\nu}\left% \|{\bf{y}}-{\bf{x}}\right\|^{\nu},

where the last step is based on the Hölder continuouity of ${\bf J}(\cdot)$ . ∎

Lemma 4.

(Bound for Hölder-continuous function) Under Assumption 2, we have

\displaystyle f_{i}({\bf{y}})-f_{i}({\bf{x}})-{\bf{g}}_{i}({\bf{x}})^{\top}({% \bf{y}}-{\bf{x}})\leq\frac{{\mathcal{H}}_{\nu}}{1+\nu}\left\|{\bf{y}}-{\bf{x}}% \right\|^{1+\nu},

(22)

for any ${\bf{x}},{\bf{y}}\in{\mathbb{R}}^{d}$ and $i\in[n]$ .

Proof.

Following the proof of [21, 22], we have

	$\displaystyle f_{i}({\bf{y}})-f_{i}({\bf{x}})-{\bf{g}}_{i}({\bf{x}})^{\top}({% \bf{y}}-{\bf{x}})$	$\displaystyle=\int_{t=0}^{1}{\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}}))^{\top}% ({\bf{y}}-{\bf{x}})\text{d}t-{\bf{g}}_{i}({\bf{x}})^{\top}({\bf{y}}-{\bf{x}})$
		$\displaystyle=\int_{t=0}^{1}\left({\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}}))-% {\bf{g}}_{i}({\bf{x}})\right)^{\top}({\bf{y}}-{\bf{x}})\text{d}t$
		$\displaystyle\leq\int_{t=0}^{1}\left\\|{\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}% }))-{\bf{g}}_{i}({\bf{x}})\right\\|\left\\|{\bf{y}}-{\bf{x}}\right\\|\text{d}t$
		$\displaystyle\leq\int_{t=0}^{1}{\mathcal{H}}_{\nu}t^{\nu}\left\\|{\bf{y}}-{\bf{% x}}\right\\|^{1+\nu}\text{d}t$
		$\displaystyle={\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{1+\nu}\int_% {t=0}^{1}t^{\nu}\text{d}t$
		$\displaystyle=\frac{{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf{y}}-{\bf{x}}\right% \\|^{1+\nu},$

where the first inequality comes from Cauchy-Schwarz inequality, and the second one comes from Lemma 3 that each gradient is Hölder continuous. ∎

Lemma 5.

(Bound for Jacobian and gradient) Under Assumption 1, we have

\displaystyle\left\|{\bf{g}}_{i}({\bf{x}})\right\|\leq\left\|{\bf J}({\bf{x}})% \right\|\leq L_{f}

for all ${\bf{x}}\in{\mathbb{R}}^{d}$ and $i\in[n]$ .

Proof.

For all ${\bf{x}},{\bf{v}}\in{\mathbb{R}}^{d}$ , we have

\displaystyle{\bf J}({\bf{x}}){\bf{v}}=\lim_{h\to 0}\frac{{\bf{f}}({\bf{x}}+h{% \bf{v}})-{\bf{f}}({\bf{x}})}{h}.

Taking the spectral norm on both sides, we have

	$\displaystyle\left\\|{\bf J}({\bf{x}}){\bf{v}}\right\\|$	$\displaystyle=\lim_{h\to 0}\frac{\left\\|{\bf{f}}({\bf{x}}+h{\bf{v}})-{\bf{f}}(% {\bf{x}})\right\\|}{\|h\|}$
		$\displaystyle\leq\lim_{h\to 0}\frac{L_{f}\left\\|{\bf{x}}+h{\bf{v}}-{\bf{x}}% \right\\|}{\|h\|}$
		$\displaystyle=\lim_{h\to 0}\frac{L_{f}\|h\|\left\\|{\bf{v}}\right\\|}{\|h\|}$
		$\displaystyle=L_{f}\left\\|{\bf{v}}\right\\|,$

where the inequality comes from Assumption 1.

Therefore, for all ${\bf{x}}\in{\mathbb{R}}^{d}$ it holds

\displaystyle\left\|{\bf J}({\bf{x}})\right\|=\sup_{{\bf{v}}\in{\mathbb{R}}^{d% }}\frac{\left\|{\bf J}({\bf{x}}){\bf{v}}\right\|}{\left\|{\bf{v}}\right\|}\leq L% _{f}.

Let ${\bf{e}}_{i}\in{\mathbb{R}}^{n}$ be the $i$ -th standard basic vector in $n$ -dimensional Euclid space, then we have

\displaystyle\left\|{\bf{g}}_{i}({\bf{x}})\right\|=\left\|{\bf J}^{\top}{\bf{e% }}_{i}\right\|\leq\left\|{\bf J}\right\|\left\|{\bf{e}}_{i}\right\|=\left\|{% \bf J}({\bf{x}})\right\|\leq L_{f}

for all $i\in[n]$ . ∎

Appendix C The Auxiliary Sequence and Its Properties

We construct the following sequence for our convergence analysis in later sections.

Definition 1.

We define the following sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ for given $n\in{\mathbb{N}}^{+}$ and $\nu\in(0,1]$ :

\displaystyle a_{t}(n,\nu)\triangleq\begin{cases}1,~{}~{}~{}~{}&t=0,\\[4.26773% pt] \displaystyle{\frac{1}{2(1+\nu)n}\left(\sum_{j=0}^{t-1}(a_{j}(n,\nu))^{1+\nu}+% n-t\right)},~{}~{}~{}~{}&1\leq t\leq n,\\[17.07182pt] \displaystyle{\frac{1}{2(1+\nu)n}\sum_{j=t-n}^{t-1}(a_{j}(n,\nu))^{1+\nu}},~{}% ~{}~{}~{}&t>n.\end{cases}

(23)

We then provide several useful properties for the sequence in Definition 1.

Lemma 6.

The sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ satisfies

\displaystyle a_{t}(n,\nu)\leq 1

for all $t\geq 0$ .

Proof.

Part I: We first use induction to prove $a_{t}(n,\nu)\leq 1$ for all $t=0,1\dots,n$ . For the induction base, we can verify that $a_{0}(n,\nu)=1\leq 1$ . For the induction step, we assume

\displaystyle a_{j}(n,\nu)\leq 1

holds for all $j=1,\dots,t-1$ such that $t\leq n$ . Then we have

\displaystyle a_{t}(n,\nu)=\frac{1}{2(1+\nu)n}\left(\sum_{j=0}^{t-1}(a_{j}(n,% \nu))^{1+\nu}+n-t\right)\leq\frac{1}{2(1+\nu)n}\left(t+n-t\right)=\frac{1}{2(1% +\nu)}\leq 1,

where the first inequality is based on the induction hypothesis and the last inequality is based on the setting $\nu\in(0,1]$ . This finishes the induction.

Part II: We then use induction to prove $a_{t}(n,\nu)\leq 1$ for all $t\geq n+1$ . For the induction base, we can verify that

\displaystyle a_{n+1}(n,\nu)=\frac{1}{2(1+\nu)n}\sum_{j=1}^{n}(a_{j}(n,\nu))^{% 1+\nu}\leq\frac{1}{2(1+\nu)n}\cdot n=\frac{1}{2(1+\nu)}\leq 1,

where the first inequality is based on $a_{t}(n,\nu)\leq 1$ for all $t\leq n$ (which have shown in Part I), and the last inequality is based on the setting $\nu\in(0,1]$ . For the induction step, we assume

\displaystyle a_{n+1}(n,\nu)\leq 1

holds for all $j=n+2,\dots,t-1$ such that $t\geq n+3$ . Then we have

\displaystyle a_{t}(n,\nu)=\frac{1}{2(1+\nu)n}\sum_{j=t-n}^{t-1}(a_{j}(n,\nu))% ^{1+\nu}\leq\frac{1}{2(1+\nu)n}\cdot n=\frac{1}{2(1+\nu)}\leq 1,

where the first inequality is based on the induction hypothesis and the last inequality is based on the setting $\nu\in(0,1]$ . This finishes the induction.

Combining the results of above two parts, we finish the proof of this lemma. ∎

Lemma 7.

The sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ satisfies

\displaystyle a_{t}(n,\nu)\geq a_{t+1}(n,\nu)

for all $t\geq 0$ .

Proof.

Part I: For $t=0$ , the fact $\nu\in(0,1]$ means

\displaystyle a_{1}(n,\nu)=\frac{1}{2(1+\nu)}\leq 1=a_{0}(n,\nu).

Part II: For all $t=1,\dots,n-1$ , we have

	$\displaystyle a_{t+1}(n,\nu)-a_{t}(n,\nu)$
	$\displaystyle=\frac{1}{2(1+\nu)n}\left(\sum_{j=0}^{t}(a_{j}(n,\nu))^{1+\nu}+n-% t-1\right)-\frac{1}{2(1+\nu)n}\left(\sum_{j=0}^{t-1}(a_{j}(n,\nu))^{1+\nu}+n-t\right)$
	$\displaystyle=\frac{1}{2(1+\nu)n}\left((a_{t}(n,\nu))^{1+\nu}-1\right)\leq 0,$

where the last inequality is based on Lemma 6. This indicates $a_{t+1}\leq a_{t}$ for $t=1,\dots,n-1$ .

Part III: For all $t\geq n$ , we use induction to prove $a_{t+1}(n,\nu)\leq a_{t}(n,\nu)$ . For the induction base, we can verify that

	$\displaystyle a_{n+1}(n,\nu)-a_{n}(n,\nu)$
	$\displaystyle=\frac{1}{2(1+\nu)n}\sum_{j=1}^{n}(a_{j}(n,\nu))^{1+\nu}-\frac{1}% {2(1+\nu)n}\left(\sum_{j=1}^{n-1}(a_{j}(n,\nu))^{1+\nu}+1\right)$
	$\displaystyle=\frac{1}{2(1+\nu)n}\left((a_{n}(n,\nu))^{1+\nu}-1\right)$
	$\displaystyle\leq 0,$

where the last inequality is based on Lemma 6.

For the induction step, we assume

\displaystyle a_{j+1}(n,\nu)\leq a_{j}(n,\nu)

holds for all $j=n+1,\cdots,t-1$ such that $t\geq n+2$ . Then we have

		$\displaystyle a_{t+1}(n,\nu)-a_{t}(n,\nu)$
	$\displaystyle=$	$\displaystyle\frac{1}{2(1+\nu)n}\sum_{j=t-n+1}^{t}(a_{j}(n,\nu))^{1+\nu}-\frac% {1}{2(1+\nu)n}\sum_{j=t-n}^{t-1}(a_{j}(n,\nu))^{1+\nu}\leq 0,$

where the inequality is based on the induction hypothesis and the fact $a_{t+1}(n,\nu)\leq a_{t}(n,\nu)$ for all $t\leq n-1$ (which have shown in Part I).

Combining the results of above three parts, we finish the proof of this lemma. ∎

Lemma 8.

For the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ , we have

\displaystyle a_{t}(n,\nu)\leq\frac{1}{2(1+\nu)}(a_{t-n}(n,\nu))^{1+\nu}

for all $t\geq n$ .

Proof.

For all $t\geq n$ , the definition of $a_{t}(n,\nu)$ implies

	$\displaystyle a_{t}(n,\nu)=\frac{1}{2(1+\nu)n}\left(\sum_{j=t-n}^{t-1}(a_{j}(n% ,\nu))^{1+\nu}\right)$
	$\displaystyle\leq\frac{1}{2(1+\nu)}\max\{(a_{t-n}(n,\nu))^{1+\nu},\cdots,(a_{t% -1}(n,\nu))^{1+\nu}\}.$

Additionally, Lemma 7 implies for all $t\geq n$ , we have

\displaystyle\max\{(a_{t-n}(n,\nu))^{1+\nu},\cdots,(a_{t-1}(n,\nu))^{1+\nu}\}=% (a_{t-n}(n,\nu))^{1+\nu},\quad t\geq n.

Combining above results, we achieve

\displaystyle a_{t}(n,\nu)\leq\frac{1}{2(1+\nu)}(a_{t-n}(n,\nu))^{1+\nu}

for all $t\geq n$ . ∎

Lemma 9.

For the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ , we have

\displaystyle a_{t+1}(n,\nu)\leq c_{0}a_{t}(n,\nu)

for all $t\geq n$ , where

\displaystyle c_{0}=1-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+% \nu}\right).

Proof.

For all $t\geq n$ , we have

\displaystyle a_{t}(n,\nu)\leq\frac{1}{2(1+\nu)}(a_{t-n}(n,\nu))^{1+\nu}\leq% \frac{1}{2(1+\nu)}a_{t-n}(n,\nu),

(24)

where the first inequality is based on Lemma 8 and the second one is based on Lemma 6.

Then we also have

\displaystyle\small\begin{split}a_{t+1}&=\frac{1}{2(1+\nu)n}\left(\sum_{j=t-n+% 1}^{t}(a_{j}(n,\nu))^{1+\nu}\right)\\ &\leq\frac{1}{2(1+\nu)n}\left(\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}(a_{t-n}(% n,\nu))^{1+\nu}+\sum_{j=t-n+1}^{t-1}(a_{j}(n,\nu))^{1+\nu}\right)\\ &=\frac{1}{2(1+\nu)n}\left(\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}(a_{t-n}(n,% \nu))^{1+\nu}+\sum_{j=t-n+1}^{t-1}(a_{j}(n,\nu))^{1+\nu}+(a_{t-n}(n,\nu))^{1+% \nu}-(a_{t-n}(n,\nu))^{1+\nu}\right)\\ &=a_{t}(n,\nu)+\frac{1}{2(1+\nu)n}\left(\left(\frac{1}{2(1+\nu)}\right)^{1+\nu% }(a_{t-n}(n,\nu))^{1+\nu}-(a_{t-n}(n,\nu))^{1+\nu}\right)\\ &=a_{t}(n,\nu)-\frac{1}{2(1+\nu)n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+% \nu}\right)(a_{t-n}(n,\nu))^{1+\nu}\\ &\leq a_{t}(n,\nu)-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right)a_{t}(n,\nu)\\ &=\left(1-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}\right)% \right)a_{t}(n,\nu),\end{split}

for all $t\geq n$ , where the first inequality is based on equation (24) and the last inequality is based on Lemma 8. This finish the proof. ∎

Lemma 10.

For the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ , if there exists $c_{1}\in(0,1)$ and $t_{0}\geq 0$ such that

\displaystyle a_{t+1}(n,\nu)\leq c_{1}a_{t}(n,\nu)

(25)

for all $t\geq t_{0}+n$ , then we have

\displaystyle a_{t+1}(n,\nu)\leq c_{1}^{1+\nu}a_{t}(n,\nu)

for all $t\geq t_{0}+2n$ .

Proof.

For all $t\geq t_{0}+2n$ , we have

	$\displaystyle a_{t+1}(n,\nu)=\frac{1}{2(1+\nu)n}\sum_{j=t-n+1}^{t}(a_{j}(n,\nu% ))^{1+\nu}$
	$\displaystyle\leq\frac{1}{2(1+\nu)n}\sum_{j=t-n}^{t-1}c_{1}^{1+\nu}(a_{j}(n,% \nu))^{1+\nu}=c_{1}^{1+\nu}a_{t}(n,\nu),$

where the inequality is based on equation (25). ∎

Lemma 11.

For the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ , we have the superlinear convergence

\displaystyle a_{t+1}(n,\nu)\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t-1}{n}% \right\rfloor-1\right)}}a_{t}(n,\nu)

for all $t\geq n$ , where

\displaystyle c=1-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right).

Proof.

According to Lemma 9, we have

\displaystyle a_{t+1}(n,\nu)\leq ca_{t}(n,\nu)\quad\text{for all}~{}~{}t\geq n.

According to Lemma 10, we have

	$\displaystyle a_{t+1}(n,\nu)$	$\displaystyle\leq c^{1+\nu}a_{t}(n,\nu)\quad\text{for all}~{}~{}t\geq 2n,$
	$\displaystyle a_{t+1}(n,\nu)$	$\displaystyle\leq c^{(1+\nu)^{2}}a_{t}(n,\nu)\quad\text{for all}~{}~{}t\geq 3n,$
	$\displaystyle a_{t+1}(n,\nu)$	$\displaystyle\leq c^{(1+\nu)^{3}}a_{t}(n,\nu)\quad\text{for all}~{}~{}t\geq 4n,$
	$\displaystyle\cdots$

which implies

\displaystyle a_{t+1}(n,\nu)\leq c^{(1+\nu)^{\left(\lfloor{t}/{n}\rfloor-1% \right)}}a_{t}(n,\nu)

for all $t\geq n$ .

The superlinear convergence of the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ can be verify by the fact

\displaystyle\lim_{t\to\infty}c^{(1+\nu)^{\left(\lfloor{t}/{n}\rfloor-1\right)% }}=0.

Hence, we finish the proof. ∎

Appendix D The Convergence Analysis for IGN

In this section, we provide the proofs for result in Section 3.

D.1 The Proof of Proposition 1

Proof.

We denote the singular value decomposition of ${\bf J}({\bf{x}}^{*})$ as

\displaystyle{\bf J}({\bf{x}}^{*})={\bf P}{\bf D}{\bf Q}^{\top},

where ${\bf P}\in{\mathbb{R}}^{n\times d},{\bf Q}\in{\mathbb{R}}^{d\times d}$ are (column) orthogonal matrices and ${\bf D}\in{\mathbb{R}}^{d\times d}$ is diagonal matrix with the smallest diagonal entry of $\mu>0$ . Therefore, we have

\displaystyle{\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*})={\bf Q}{\bf D}^% {2}{\bf Q}^{\top},

which means the smallest singular value of ${\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*})$ is equal to the smallest value of ${\bf D}^{2}$ , which is $\mu^{2}$ . Therefore, we have

\displaystyle\sigma_{\min}({\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*}))% \geq\mu^{2}.

∎

D.2 Proof of Lemma 1

Proof.

The Jacobian holds that

	$\displaystyle\left\\|{\bf J}({\bf{y}})^{\top}{\bf J}({\bf{y}})-{\bf J}({\bf{x}}% )^{\top}{\bf J}({\bf{x}})\right\\|$	$\displaystyle=\left\\|{\bf J}({\bf{y}})^{\top}{\bf J}({\bf{y}})-{\bf J}({\bf{x}% })^{\top}{\bf J}({\bf{y}})+{\bf J}({\bf{x}})^{\top}{\bf J}({\bf{y}})-{\bf J}({% \bf{x}})^{\top}{\bf J}({\bf{x}})\right\\|$
		$\displaystyle\leq\left\\|\left({\bf J}({\bf{y}})-{\bf J}({\bf{x}})\right)^{\top% }{\bf J}({\bf{y}})\right\\|+\left\\|{\bf J}({\bf{x}})^{\top}\left({\bf J}({\bf{y% }})-{\bf J}({\bf{x}})\right)\right\\|$
		$\displaystyle\leq\left\\|{\bf J}({\bf{y}})\right\\|\left\\|{\bf J}({\bf{y}})-{\bf J% }({\bf{x}})\right\\|+\left\\|{\bf J}({\bf{x}})\right\\|\left\\|{\bf J}({\bf{y}})-{% \bf J}({\bf{x}})\right\\|$
		$\displaystyle\leq 2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{% \nu},$

where the first inequality comes from triangular inequality, the second inequality comes from property of norm, and the last inequality is based on Lemma 5 and Assumption 2.

For all $j\in[n]$ , the gradient holds that

	$\displaystyle\left\\|{\bf{g}}_{i}({\bf{y}}){\bf{g}}_{i}({\bf{y}})^{\top}\!-{\bf% {g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{x}})^{\top}\right\\|$	$\displaystyle=\left\\|{\bf{g}}_{i}({\bf{y}}){\bf{g}}_{i}({\bf{y}})^{\top}\!-{% \bf{g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{y}})^{\top}+{\bf{g}}_{i}({\bf{x}}){\bf{% g}}_{i}({\bf{y}})^{\top}\!-{\bf{g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{x}})^{\top}\right\\|$
		$\displaystyle\leq\left\\|\left({\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})% \right){\bf{g}}_{i}({\bf{y}})^{\top}\right\\|+\left\\|{\bf{g}}_{i}({\bf{x}})% \left({\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right)^{\top}\right\\|$
		$\displaystyle\leq\left\\|{\bf{g}}_{i}({\bf{y}})\right\\|\left\\|{\bf{g}}_{i}({\bf% {y}})-{\bf{g}}_{i}({\bf{x}})\right\\|+\left\\|{\bf{g}}_{i}({\bf{x}})\right\\|% \left\\|{\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right\\|$
		$\displaystyle\leq 2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{% \nu},$

where the first inequality comes from triangular inequality, the second inequality comes from property of norm, and the last inequality is based on Lemma 3 and 5. ∎

D.3 Proof of Lemma 2

Proof.

We have

	$\displaystyle\left\\|{\bf H}^{t}-{\bf J}({\bf{x}}^{})^{\top}{\bf J}({\bf{x}}^{% })\right\\|$	$\displaystyle=\left\\|\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i}(% {\bf{z}}_{i}^{t})^{\top}-\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{x}}^{}){\bf{g}}_{i}(% {\bf{x}}^{})^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{n}\left\\|{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{% i}({\bf{z}}_{i}^{t})^{\top}-{\bf{g}}_{i}({\bf{x}}^{}){\bf{g}}_{i}({\bf{x}}^{% })^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{n}2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{z}}_{i}^{t% }-{\bf{x}}^{*}\right\\|^{\nu},$

where the first inequality comes from the triangle inequality and the second inequality is based on Lemma 1. Thus, we have

\displaystyle{\bf H}^{t}-{\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*})% \succeq-\sum_{i=1}^{n}2L_{f}{\mathcal{H}}_{\nu}\left\|{\bf{z}}_{i}^{t}-{\bf{x}% }^{*}\right\|^{\nu}\cdot{\bf I},

which implies that

\displaystyle\sigma_{\min}({\bf H}^{t})\geq\sigma_{\min}({\bf J}({\bf{x}}^{*})% ^{\top}{\bf J}({\bf{x}}^{*}))-\sum_{i=1}^{n}2L_{f}{\mathcal{H}}_{\nu}\left\|{% \bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}=\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}% \sum_{i=1}^{n}\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu},

where the last step is based on Proposition 1. ∎

D.4 The Proof of Theorem 1

We first show the update

\displaystyle{\bf G}^{t+1}={\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({\bf V}% ^{t})^{\top}{\bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t}

in IGN method (Line 9 of Algorithm 1) is well-defined if the matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular.

Lemma 12.

Following the setting of Theorem 1, if the matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular, then the matrix ${\bf I}+{{\bf V}^{t}}^{\top}{\bf G}^{t}{\bf U}^{t}$ is also non-singular, where

\displaystyle{\bf U}^{t}=\begin{bmatrix}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}% )\!&{\bf{g}}_{i_{t}}({\bf{x}}^{t+1})\end{bmatrix}\in{\mathbb{R}}^{d\times 2},~% {}~{}{\bf V}^{t}=\begin{bmatrix}{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t})\!&{\bf{% g}}_{i_{t}}({\bf{x}}^{t+1})\end{bmatrix}\in{\mathbb{R}}^{d\times 2}~{}~{}\text% {and}~{}~{}i_{t}={t\%n}+1.

Proof.

The recursion of ${\bf H}^{t}$ and the definition of ${\bf U}^{t}$ and ${\bf V}^{t}$ imply

\displaystyle{\bf H}^{t+1}={\bf H}^{t}-{\bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t}){% \bf{g}}_{i_{t}}({\bf{z}}_{i_{t}}^{t})^{\top}+{\bf{g}}_{i_{t}}({\bf{x}}^{t+1}){% \bf{g}}_{i_{t}}({\bf{x}}^{t+1})^{\top}={\bf H}^{t}+{\bf U}^{t}{{\bf V}^{t}}^{% \top}.

Since we assume matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular, applying the matrix determinant lemma [41, Section 9.1.2] on above equation leads to

\displaystyle\det({\bf H}^{t+1})=\det({\bf H}^{t}+{\bf U}^{t}{{\bf V}^{t}}^{% \top})=\det({\bf I}+{{\bf V}^{t}}^{\top}({\bf H}^{t})^{-1}{\bf U}^{t})\det({% \bf H}^{t}).

Then the definition ${\bf G}^{t}={{\bf H}^{t}}^{-1}$ implies

\displaystyle\det({\bf I}+{{\bf V}^{t}}^{\top}{\bf G}^{t}{\bf U}^{t})=\det({% \bf I}+{{\bf V}^{t}}^{\top}{{\bf H}^{t}}^{-1}{\bf U}^{t})\neq 0

which finish the proofs. ∎

Then we show the non-singular assumption on $\{{\bf H}^{j}\}_{j=0}^{t}$ can upper bound the distance $\left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|$ .

Lemma 13.

Under Assumptions 1 and 2, we assume matrices $\{{\bf H}^{j}\}_{j=0}^{t}$ are non-singular and run IGN (Algorithm 1), then it holds

\displaystyle\left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq\frac{L_{f}{% \mathcal{H}}_{\nu}}{1+\nu}\left\|{\bf G}^{t}\right\|\sum_{i=1}^{n}\left\|{\bf{% z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{1+\nu},

where ${\bf G}^{t}=\left({\bf H}^{t}\right)^{-1}$ .

Proof.

Subtracting the term ${\bf{x}}^{*}$ on both sides of equation (8), we have

	$\displaystyle{\bf{x}}^{t+1}-{\bf{x}}^{*}$	$\displaystyle={\bf G}^{t}\left(\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){% \bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}_{i}^{t}-{\bf{x}}^{*})-\sum_{i=1}% ^{n}f_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i}({\bf{z}}_{i}^{t})\right)$
		$\displaystyle={\bf G}^{t}\left(\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){% \bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}_{i}^{t}-{\bf{x}}^{})-\sum_{i=1}% ^{n}f_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i}({\bf{z}}_{i}^{t})+\sum_{i=1}^{n}f_{i}(% {\bf{x}}^{}){\bf{g}}_{i}({\bf{z}}_{i}^{t})\right)$
		$\displaystyle={\bf G}^{t}\sum_{i=1}^{n}\left({\bf{g}}_{i}({\bf{z}}_{i}^{t})^{% \top}({\bf{z}}_{i}^{t}-{\bf{x}}^{})-f_{i}({\bf{z}}_{i}^{t})+f_{i}({\bf{x}}^{% })\right){\bf{g}}_{i}({\bf{z}}_{i}^{t}).$

Taking the norm on the both sides of above results, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle=\left\\|{\bf G}^{t}\sum_{i=1}^{n}\left({\bf{g}}_{i}({\bf{z}}_{i}^% {t})^{\top}({\bf{z}}_{i}^{t}-{\bf{x}}^{})-f_{i}({\bf{z}}_{i}^{t})+f_{i}({\bf{% x}}^{})\right){\bf{g}}_{i}({\bf{z}}_{i}^{t})\right\\|$
		$\displaystyle\leq\left\\|{\bf G}^{t}\right\\|\left\\|\sum_{i=1}^{n}\left({\bf{g}}% _{i}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}_{i}^{t}-{\bf{x}}^{})-f_{i}({\bf{z}}_{i% }^{t})+f_{i}({\bf{x}}^{})\right){\bf{g}}_{i}({\bf{z}}_{i}^{t})\right\\|$
		$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$

where the first inequality comes from the property of matrix norm, the second inequality is based on Lemma 4 and 5. ∎

We split the results of Theorem 1 into two parts (i.e., Theorem 3 and 4) and provide their proofs as follows. Our analysis is based on the properties of our the auxiliary sequence constructed in Section C.

Theorem 3.

Under the Assumption 1, 2 and 3, we run IGN (Algorithm 1) with initialization ${\bf{x}}^{0}\in{\mathbb{R}}^{d}$ and ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ such that

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq\left(\frac{\mu^{2}}{% 4L_{f}{\mathcal{H}}_{\nu}n}\right)^{{1}/{\nu}},

then it holds

\displaystyle\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}({\bf H}^{t})^{-1}{\bf U% }^{t})>0,\quad{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\quad\text{and}\quad% \left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t+1}(n,\nu)\left\|{\bf{x}}^{0}% -{\bf{x}}^{*}\right\|

for all $t\geq 0$ , where sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ is defined in equation (23).

Proof.

We first show

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t+1}(n,\nu)\left\|{\bf{x}}^{0}% -{\bf{x}}^{*}\right\|

(26)

holds for all $t\geq 0$ . We split the proof of results (26) into the following three parts.

Part I: For $t=0$ , the initialization and the fact $a_{0}=1$ leads to

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|=a_{0}(n,\nu)\left\|{\bf{% x}}^{0}-{\bf{x}}^{*}\right\|.

Part II: For all $t=0,\cdots,n-1$ , we use induction to prove the results of

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq a_{t+1}(n,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|.

(27)

For the induction base, we can apply Lemma 2 to verify

	$\displaystyle\sigma_{\min}({\bf H}^{0})$	$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{z}% }_{i}^{0}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle=\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}n\frac{\mu^{2}}{4L_{f}{% \mathcal{H}}_{\nu}n}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}$
		$\displaystyle=\frac{\mu^{2}}{2}.$

This implies

\displaystyle{\bf H}^{0}\succeq\frac{\mu^{2}}{2}\qquad\text{and}\qquad\left\|{% \bf G}^{0}\right\|=\left\|({\bf H}^{0})^{-1}\right\|\leq\frac{2}{\mu^{2}}.

(28)

According to Lemma 13, we have

	$\displaystyle\left\\|{\bf{x}}^{1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{0}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}}% \cdot\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle=\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}}% \cdot n\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{nL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}}% \cdot\frac{\mu^{2}}{4L_{f}{\mathcal{H}}_{\nu}n}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*% }\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{1}(n,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|,$

where the first inequality is based on equation (28) and the second inequality is based on initial condition. Therefore, the induction base holds

For the induction step, we assume

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{j+1}-{\bf{x}}^{*}\right\|\leq a_{j+1}(n,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|

hold for all $j=2,\cdots,t-1$ such that $t\leq n-1$ . Therefore, the update (9) means

\displaystyle{\bf{z}}_{i}^{t}=\begin{cases}{\bf{x}}^{i},~{}~{}~{}~{}&1\leq i% \leq t,\\ {\bf{x}}^{0},~{}~{}~{}~{}&t<i\leq n.\end{cases}

(29)

The induction hypothesis leads to

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(n,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4L_{f}{\mathcal{H}}_{\nu}n},

for $j=1,\cdots,t-1$ , where the second is based on Lemma 6 and the third comes from the initial condition. Combining with the result of (29), we achive

\displaystyle\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4L_{f}{\mathcal{H}}_{\nu}n}.

According to Lemma 2, we have

	$\displaystyle\sigma_{\min}({\bf H}^{t})$	$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{z}% }_{i}^{t}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}n\frac{\mu^{2}}{4L_{f}{% \mathcal{H}}_{\nu}n}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}$
		$\displaystyle=\frac{\mu^{2}}{2},$

where the second inequality comes from the initial condition. Therefore, we have

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{t}\right\|=\left\|({\bf H}^{t})^{-1}\right\|\leq\frac{2}{\mu^{% 2}}.

According to Lemma 13, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\frac{2}{\mu^{2}}\sum_{% i=1}^{n}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =1}^{t}\left\\|{\bf{x}}^{j}-{\bf{x}}^{}\right\\|^{1+\nu}+(n-t)\left\\|{\bf{x}}^{% 0}-{\bf{x}}^{}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =1}^{t}(a_{j}(n,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{}\right\\|^{1+\nu}+% (n-t)\left\\|{\bf{x}}^{0}-{\bf{x}}^{}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2}% }{4L_{f}{\mathcal{H}}_{\nu}n}\left(\sum_{j=1}^{t}(a_{j}(n,\nu))^{1+\nu}+n-t% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)n}\left(\sum_{j=1}^{t}(a_{j}(n,\nu))^{1+\nu}+n-% t\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)n}\left(\sum_{j=0}^{t}(a_{j}(n,\nu))^{1+\nu}+n-% t-1\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{t+1}(n,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|,$

where the last equality comes from the fact $a_{0}(n,\nu)=1$ . Therefore, we finish the induction.

Part III: For all $t\geq n$ , we use induction to prove

\displaystyle{\bf H}^{t}\succeq(\mu^{2}/2){\bf I}\qquad\text{and}\qquad\left\|% {\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq a_{t+1}(n,\nu)\left\|{\bf{x}}^{0}-{\bf% {x}}^{*}\right\|.

For the induction base, we can verify that it holds (from the result of Part II)

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{for all~{}~% {}}j=0,\dots,n-1,

and

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|\leq a_{j}(n,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\qquad\text{for all~{}~{}}j=1,\dots,n.

Then we have

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(n,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4L_{f}{\mathcal{H}}_{\nu}n},\quad% \text{for all~{}~{}}j=1,\dots,n,

where the second inequality is based on Lemma 6 and the third inequality is based on the initial condition.

From Eq. 9, we have

\displaystyle{\bf{z}}_{i}^{n}={\bf{x}}^{i}\qquad\text{for all~{}~{}}i\in[n].

Therefore, we have

\displaystyle\left\|{\bf{z}}_{i}^{n}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4L_{f}{\mathcal{H}}_{\nu}n},\qquad\text{for all~{}~{}}i\in[n].

According to Lemma 2, we have

	$\displaystyle\sigma_{\min}({\bf H}^{n})$	$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{z}% }_{i}^{n}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}n\frac{\mu^{2}}{4L_{f}{% \mathcal{H}}_{\nu}n}$
		$\displaystyle\geq\mu^{2}-\frac{\mu^{2}}{2}=\frac{\mu^{2}}{2},$

which implies

\displaystyle{\bf H}^{n}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{n}\right\|=\left\|({\bf H}^{n})^{-1}\right\|\leq\frac{2}{\mu^{% 2}}.

According to Lemma 13, we have

	$\displaystyle\left\\|{\bf{x}}^{n+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{n}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{n}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\sum_{i=1}^{n% }\left\\|{\bf{z}}_{i}^{n}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =1}^{n}(a_{j}(n,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2}% }{4L_{f}{\mathcal{H}}_{\nu}n}\left(\sum_{j=1}^{n}(a_{j}(n,\nu))^{1+\nu}\right)% \left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)n}\left(\sum_{j=1}^{n}(a_{j}(n,\nu))^{1+\nu}% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{n+1}(n,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|.$

Hence, we have shown the induction base holds.

For the induction step, we assume

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{j+1}-{\bf{x}}^{*}\right\|\leq a_{j+1}(n,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|

holds for all $j=n+1,\cdots,t-1$ such that $t\geq n+2$ . Combining results of Part I and II, we have

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|\leq a_{j}(n,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\qquad\text{for all}~{}~{}j=0,\dots,t,

which implies

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(n,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4L_{f}{\mathcal{H}}_{\nu}n},\qquad% \text{for all}~{}~{}j=1,\dots,t,

where the second inequality is based on Lemma 6 and the last inequality is based on the condition condition.

The update (9) means the points $\{{\bf{z}}_{i}^{t}\}_{i=1}^{n}$ can be written as $\{{\bf{x}}^{t+1-n},\cdots,{\bf{x}}^{t}\}$ , which implies

\displaystyle\max\{\left\|{\bf{z}}_{1}^{t}-{\bf{x}}^{*}\right\|,\cdots,\left\|% {\bf{z}}_{n}^{t}-{\bf{x}}^{*}\right\|\}=\max\{\left\|{\bf{x}}^{t+1-n}-{\bf{x}}% ^{*}\right\|,\cdots,\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\}.

Therefore, we have

\displaystyle\left\|{\bf{z}}_{i}^{n}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4L_{f}{\mathcal{H}}_{\nu}n}\qquad\text{for all}~{}~{}i=1,\dots,n.

Combing with Lemma 2, we have

	$\displaystyle\sigma_{\min}({\bf H}^{t})$	$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{z}% }_{i}^{t}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2L_{f}{\mathcal{H}}_{\nu}n\frac{\mu^{2}}{4L_{f}{% \mathcal{H}}_{\nu}n}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}=\frac{\mu^{2}}{2}.$

Therefore, we achieve

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{t}\right\|

\displaystyle=\left\|({\bf H}^{t})^{-1}\right\|\leq\frac{2}{\mu^{2}}.

According to Lemma 13, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\sum_{i=1}^{n% }\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =t-n+1}^{t}(a_{j}(n,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+% \nu}\right)$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =t-n+1}^{t}(a_{j}(n,\nu))^{1+\nu}\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right% \\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2}% }{4L_{f}{\mathcal{H}}_{\nu}n}\left(\sum_{j=t-n+1}^{t+1}a_{j}(n,\nu)^{1+\nu}% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)n}\left(\sum_{j=t-n+2}^{t+1}(a_{j}(n,\nu))^{1+% \nu}\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{t+1}(n,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|.$

Hence, we finish the induction.

Combining results of Part I, II and III completes the proof of (26).

Since the non-singularity of ${\bf H}^{t}$ and ${\bf H}^{t+1}$ has been verified by result (26), we can apply Lemma 12 to achieve

\displaystyle\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}({\bf H}^{t})^{-1}{\bf U% }^{t})>0.

∎

Theorem 4.

We define the sequence $\{r_{t}\}_{t\geq 0}$ such that

\displaystyle r_{t}\triangleq\begin{cases}\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{% *}\right\|,1\},~{}~{}~{}~{}&t=0,\\[5.69046pt] a_{t}(n,\nu)r_{0},~{}~{}~{}~{}&t\geq 1,\\ \end{cases}

where the sequence $\{a_{t}(n,\nu)\}_{t\geq 0}$ is defined by equation (23). Under the Assumptions 1, 2 and 3, running IGN (Algorithm 1) with initial condition shown in Theorem 3, we have

\displaystyle\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq r_{t}\qquad\text{and% }\qquad r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{n}\right\rfloor-1% \right)}}r_{t}

(30)

for all $t\geq n$ , where

\displaystyle c=1-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right).

Proof.

The definition of $\{r_{t}\}_{t\geq 0}$ leads to

\displaystyle r_{0}=\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|,1\}\geq% \left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|.

According to Theorem 3, we have

\displaystyle\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t}(n,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq a_{t}(n,\nu)r_{0}=r_{t}.

According to Lemma 11, we have

\displaystyle a_{t+1}(n,\nu)\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{n}% \right\rfloor-1\right)}}a_{t}(n,\nu)\qquad\text{for all}~{}~{}t\geq n.

Thus, achieve

\displaystyle r_{t+1}=a_{t+1}(n,\nu)r_{0}\leq c^{(1+\nu)^{\left(\left\lfloor% \frac{t}{n}\right\rfloor-1\right)}}a_{t}(n,\nu)r_{0}=c^{(1+\nu)^{\left(\left% \lfloor\frac{t}{n}\right\rfloor-1\right)}}r_{t}\qquad\text{for all}~{}~{}t\geq n,

where

\displaystyle c=1-\frac{1}{n}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right).

∎

Combining the results of Theorem 3 and 4, we finish the proof of Theorem 1.

D.5 Proof of Corollary 1

Proof.

According to Theorem 1, we have

\displaystyle r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{n}\right% \rfloor-1\right)}}r_{t}\qquad\text{with}\qquad c=1-\frac{1}{n}\left(1-\left(% \frac{1}{2(1+\nu)}\right)^{1+\nu}\right).

for all $\nu\in(0,1]$ . Noticing that the value of $c$ is monotonically decreasing according to $\nu$ , we have

\displaystyle 1-\frac{1}{2n}>c\geq 1-\frac{15}{16n},

which implies

\displaystyle r_{t+1}\leq\Big{(}1-\frac{1}{2n}\Big{)}^{(1+\nu)^{(\left\lfloor t% /n\right\rfloor-1)}}r_{t}

for all $t\geq n$ .

∎

D.6 Proof of Corollary 2

Proof.

According to the definition of $\{r_{t}\}_{t\geq 0}$ and Theorem 4, we have

\displaystyle r_{0}=\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|,1\}\geq 1.

Combining with Lemma 8, we have

	$\displaystyle r_{t}=$	$\displaystyle a_{t}(n,\nu)r_{0}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2(1+\nu)}(a_{t-n}(n,\nu))^{1+\nu}r_{0}$
	$\displaystyle=$	$\displaystyle\frac{1}{2(1+\nu)r_{0}^{\nu}}(a_{t-n}(n,\nu))^{1+\nu}r_{0}^{1+\nu}$
	$\displaystyle=$	$\displaystyle\frac{1}{2(1+\nu)r_{0}^{\nu}}r_{t-n}^{1+\nu}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2(1+\nu)}r_{t-n}^{1+\nu}$

for all $t\geq n$ . This leads to

\displaystyle r_{t}\leq\frac{1}{4}r_{t-n}^{2}

in the case of $\nu=1$ . ∎

Appendix E The Convergence Analysis for MB-IGN

In this section, we analyze the convergence of MB-IGN (Algorithm 2). Most of the proof in this section can be achieved by follow the analysis in Section D and we provide the details for the completeness.

E.1 The Additional Lemma for Gram Matrix

We provide the bound for the spectrum of matrix ${\bf H}^{t}$ for MB-IGN method as follows

Lemma 14.

Under Assumptions 1, 2 and 3, running MB-IGN (Algorithm 2) with batch size $k$ , ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ and ${\bf G}^{0}=({\bf H}^{0})^{-1}$ holds that

\displaystyle\sigma_{\min}({\bf H}^{t})\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}% \sum_{i=1}^{m}\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}

for all $t\geq 0$ , where $m=\lceil n/k\rceil$ .

Proof.

We have

	$\displaystyle\left\\|{\bf H}^{t}-{\bf J}({\bf{x}}^{})^{\top}{\bf J}({\bf{x}}^{% })\right\\|$	$\displaystyle=\left\\|\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}{\bf{g}}_{j}({% \bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}-\sum_{i=1}^{m}\sum_{j\in% {\mathcal{S}}_{i}}{\bf{g}}_{j}({\bf{x}}^{}){\bf{g}}_{j}({\bf{x}}^{})^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}\left\\|{\bf{g}}_{j}% ({\bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}-{\bf{g}}_{j}({\bf{x}}^% {}){\bf{g}}_{j}({\bf{x}}^{})^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{m}2\|{\mathcal{S}}_{i}\|L_{f}{\mathcal{H}}_{\nu}% \left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\leq\sum_{i=1}^{m}2kL_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{z}}_{i}^{% t}-{\bf{x}}^{*}\right\\|^{\nu},$

where the first inequality comes from the triangle inequality and the second inequality is based on Lemma 1. Thus, we have

\displaystyle{\bf H}^{t}-{\bf J}({\bf{x}}^{*})^{\top}{\bf J}({\bf{x}}^{*})% \succeq-\sum_{i=1}^{m}2kL_{f}{\mathcal{H}}_{\nu}\left\|{\bf{z}}_{i}^{t}-{\bf{x% }}^{*}\right\|^{\nu}\cdot{\bf I},

which implies that

\displaystyle\sigma_{\min}({\bf H}^{t})\geq\sigma_{\min}({\bf J}({\bf{x}}^{*})% ^{\top}{\bf J}({\bf{x}}^{*}))-\sum_{i=1}^{m}2kL_{f}{\mathcal{H}}_{\nu}\left\|{% \bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}=\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}% \sum_{i=1}^{m}\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu},

where the last step is based on Proposition 1. ∎

E.2 Proof of Theorem 2

Similarly, we then show the update

\displaystyle{\bf G}^{t+1}={\bf G}^{t}-{\bf G}^{t}{\bf U}^{t}({\bf I}+({\bf V}% ^{t})^{\top}{\bf G}^{t}{\bf U}^{t})^{-1}({\bf V}^{t})^{\top}{\bf G}^{t}

in MB-IGN method (Line 10 of Algorithm 2) is well-defined if the matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular.

Lemma 15.

Following the setting of Theorem 2, if the matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular, then the matrix ${\bf I}+{{\bf V}^{t}}^{\top}{\bf G}^{t}{\bf U}^{t}$ is also non-singular, where

\displaystyle\begin{cases}{\bf U}^{t}=\Big{[}-{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}% }^{t}),~{}~{}{\bf{g}}_{j_{1}}({\bf{x}}^{t+1}),~{}\cdots~{},~{}-{\bf{g}}_{j_{|{% \mathcal{S}}_{i_{t}}|}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}% }_{i_{t}}|}}({\bf{x}}^{t+1})\Big{]},\\[5.69046pt] {\bf V}^{t}=\Big{[}{\bf{g}}_{j_{1}}({\bf{z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{1% }}({\bf{x}}^{t+1}),~{}\cdots~{},~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf% {z}}_{i_{t}}^{t}),~{}~{}{\bf{g}}_{j_{|{\mathcal{S}}_{i_{t}}|}}({\bf{x}}^{t+1})% \Big{]},\quad i_{t}={t\%m}+1,\end{cases}

Proof.

The recursion of ${\bf H}^{t}$ and the definition of ${\bf U}^{t}$ and ${\bf V}^{t}$ imply

\displaystyle{\bf H}^{t+1}={\bf H}^{t}-\sum_{j\in{\mathcal{S}}_{i_{t}}}{\bf{g}% }_{j}({\bf{z}}_{i_{t}}^{t}){\bf{g}}_{j}({\bf{z}}_{i_{t}}^{t})^{\top}+\sum_{j% \in{\mathcal{S}}_{i_{t}}}{\bf{g}}_{j}({\bf{x}}^{t+1}){\bf{g}}_{j}({\bf{x}}^{t+% 1})^{\top}={\bf H}^{t}+{\bf U}^{t}{{\bf V}^{t}}^{\top}.

Since we assume matrices ${\bf H}^{t}$ and ${\bf H}^{t+1}$ are non-singular, applying the matrix determinant lemma [41, section 9.1.2] on above equation leads to

\displaystyle\det({\bf H}^{t+1})=\det({\bf H}^{t}+{\bf U}^{t}{{\bf V}^{t}}^{% \top})=\det({\bf I}+{{\bf V}^{t}}^{\top}({\bf H}^{t})^{-1}{\bf U}^{t})\det({% \bf H}^{t}).

Then the definition ${\bf G}^{t}={{\bf H}^{t}}^{-1}$ implies

\displaystyle\det({\bf I}+{{\bf V}^{t}}^{\top}{\bf G}^{t}{\bf U}^{t})=\det({% \bf I}+{{\bf V}^{t}}^{\top}{{\bf H}^{t}}^{-1}{\bf U}^{t})\neq 0

which finish the proofs. ∎

Then we show the non-singular assumption on $\{{\bf H}^{j}\}_{j=0}^{t}$ can upper bound the distance $\left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|$ .

Lemma 16.

Under Assumptions 1 and 2, we assume matrices $\{{\bf H}^{j}\}_{j=0}^{t}$ are non-singular and run MB-IGN (Algorithm 2) with batch size $k$ , then it holds

\displaystyle\left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq\frac{kL_{f}{% \mathcal{H}}_{\nu}}{1+\nu}\left\|{\bf G}^{t}\right\|\sum_{i=1}^{m}\left\|{\bf{% z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{1+\nu},

where ${\bf G}^{t}=\left({\bf H}^{t}\right)^{-1}$ and $m=\lceil n/k\rceil$ .

Proof.

Subtracting the term ${\bf{x}}^{*}$ on both sides of equation (8), we have

	$\displaystyle{\bf{x}}^{t+1}-{\bf{x}}^{*}$	$\displaystyle=\left(\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}{\bf{g}}_{j}({% \bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}\right)^{-1}\left(\sum_{i% =1}^{m}\left(\sum_{j\in{\mathcal{S}}_{i}}{\bf{g}}_{j}({\bf{z}}_{i}^{t}){\bf{g}% }_{j}({\bf{z}}_{i}^{t})^{\top}\right)({\bf{z}}_{i}^{t}-{\bf{x}}^{*})-\sum_{i=1% }^{m}\sum_{j\in{\mathcal{S}}_{i}}f_{j}({\bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_% {i}^{t})\right)$
		$\displaystyle={\bf G}^{t}\left(\sum_{i=1}^{m}\left(\sum_{j\in{\mathcal{S}}_{i}% }{\bf{g}}_{j}({\bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}\right)({% \bf{z}}_{i}^{t}-{\bf{x}}^{})-\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}f_{j}(% {\bf{z}}_{i}^{t}){\bf{g}}_{j}({\bf{z}}_{i}^{t})+\sum_{i=1}^{m}\sum_{j\in{% \mathcal{S}}_{i}}f_{j}({\bf{x}}^{}){\bf{g}}_{j}({\bf{z}}_{i}^{t})\right)$
		$\displaystyle={\bf G}^{t}\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}{\bf{g}}_{j% }({\bf{z}}_{i}^{t})\left({\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}_{i}^{t% }-{\bf{x}}^{})-f_{j}({\bf{z}}_{i}^{t})+f_{j}({\bf{x}}^{})\right).$

Taking the norm on the both sides of above results, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle=\left\\|{\bf G}^{t}\sum_{i=1}^{m}\sum_{j\in{\mathcal{S}}_{i}}{\bf% {g}}_{j}({\bf{z}}_{i}^{t})\left({\bf{g}}_{j}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}% _{i}^{t}-{\bf{x}}^{})-f_{j}({\bf{z}}_{i}^{t})+f_{j}({\bf{x}}^{})\right)\right\\|$
		$\displaystyle\leq\left\\|{\bf G}^{t}\right\\|\left\\|\sum_{i=1}^{m}\sum_{j\in{% \mathcal{S}}_{i}}\left({\bf{g}}_{i}({\bf{z}}_{i}^{t})^{\top}({\bf{z}}_{i}^{t}-% {\bf{x}}^{})-f_{i}({\bf{z}}_{i}^{t})+f_{i}({\bf{x}}^{})\right){\bf{g}}_{i}({% \bf{z}}_{i}^{t})\right\\|$
		$\displaystyle\leq\frac{L_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{n}\sum_{j\in{\mathcal{S}}_{i}}\left\\|{\bf{z}}_{i}^{t}-{\bf% {x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{n}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$

where the first inequality comes from the property of matrix norm, the second inequality is based on Lemma 4 and 5, the last inequality is based on $|{\mathcal{S}}_{i}|\leq k$ for all $i\in[m]$ . ∎

We split the results of Theorem 2 into two parts (i.e., Theorem 5 and 6) and provide their proofs as follows. Our analysis is based on the properties of our the auxiliary sequence constructed in Section C.

Theorem 5.

Under the Assumption 1, 2 and 3, we run MB-IGN (Algorithm 2) with batch size $k$ , and initialization ${\bf{x}}^{0}\in{\mathbb{R}}^{d}$ and ${\bf H}^{0}={\bf J}({\bf{x}}^{0})^{\top}{\bf J}({\bf{x}}^{0})$ such that

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq\left(\frac{\mu^{2}}{% 4kL_{f}{\mathcal{H}}_{\nu}m}\right)^{{1}/{\nu}},

where $m=\lceil n/k\rceil$ , then it holds

\displaystyle\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}({\bf H}^{t})^{-1}{\bf U% }^{t})>0,\quad{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\quad\text{and}\quad% \left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t+1}(m,\nu)\left\|{\bf{x}}^{0}% -{\bf{x}}^{*}\right\|

for all $t\geq 0$ , where the sequence $\{a_{t}(m,\nu)\}_{t\geq 0}$ is defined in equation (23).

Proof.

We first show

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t+1}(m,\nu)\left\|{\bf{x}}^{0}% -{\bf{x}}^{*}\right\|

(31)

holds for all $t\geq 0$ . We split the proof of results (31) into the following three parts.

Part I: For $t=0$ , the initialization and the fact $a_{0}=1$ leads to

\displaystyle\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|=a_{0}(m,\nu)\left\|{\bf{% x}}^{0}-{\bf{x}}^{*}\right\|.

Part II: For all $t=0,\cdots,m-1$ , we use induction to prove the results of

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq a_{t+1}(m,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|.

(32)

For the induction base, we can apply Lemma 14 to verify

	$\displaystyle\sigma_{\min}({\bf H}^{0})$	$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{m}\left\\|{\bf{z% }}_{i}^{0}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle=\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{m}\left\\|{\bf{x}}^% {0}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}m\frac{\mu^{2}}{4kL_{f}{% \mathcal{H}}_{\nu}m}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}$
		$\displaystyle=\frac{\mu^{2}}{2}.$

This implies

\displaystyle{\bf H}^{0}\succeq\frac{\mu^{2}}{2}\qquad\text{and}\qquad\left\|{% \bf G}^{0}\right\|=\left\|({\bf H}^{0})^{-1}\right\|\leq\frac{2}{\mu^{2}}.

(33)

According to Lemma 16, we have

	$\displaystyle\left\\|{\bf{x}}^{1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{0}% \right\\|\sum_{i=1}^{m}\left\\|{\bf{z}}_{i}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}}% \cdot\sum_{i=1}^{m}\left\\|{\bf{z}}_{i}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle=\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}}% \cdot m\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{kmL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\cdot\frac{2}{\mu^{2}% }\cdot\frac{\mu^{2}}{4kL_{f}{\mathcal{H}}_{\nu}m}\left\\|{\bf{x}}^{0}-{\bf{x}}^% {*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{1}(m,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|,$

where the first inequality is based on equation (33) and the second inequality is based on initial condition. Therefore, the induction base holds

For the induction step, we assume

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{j+1}-{\bf{x}}^{*}\right\|\leq a_{j+1}(m,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|

hold for all $j=2,\cdots,t-1$ such that $t\leq m-1$ . Therefore, the update (9) means

\displaystyle{\bf{z}}_{i}^{t}=\begin{cases}{\bf{x}}^{i},~{}~{}~{}~{}&1\leq i% \leq t,\\ {\bf{x}}^{0},~{}~{}~{}~{}&t<i\leq m.\end{cases}

(34)

The induction hypothesis leads to

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(m,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4kL_{f}{\mathcal{H}}_{\nu}m},

for $j=1,\cdots,t-1$ , where the second is based on Lemma 6 and the third comes from the initial condition. Combining with the result of (34), we achive

\displaystyle\left\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4kL_{f}{\mathcal{H}}_{\nu}m}.

According to Lemma 14, we have

	$\displaystyle\sigma_{\min}({\bf H}^{t})$	$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{m}\left\\|{\bf{z% }}_{i}^{t}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}m\frac{\mu^{2}}{4kL_{f}{% \mathcal{H}}_{\nu}m}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}$
		$\displaystyle=\frac{\mu^{2}}{2},$

where the second inequality comes from the initial condition. Therefore, we have

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{t}\right\|=\left\|({\bf H}^{t})^{-1}\right\|\leq\frac{2}{\mu^{% 2}}.

According to Lemma 13, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{m}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\frac{2}{\mu^{2}}\sum_% {i=1}^{m}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2L_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{j% =1}^{t}\left\\|{\bf{x}}^{j}-{\bf{x}}^{}\right\\|^{1+\nu}+(m-t)\left\\|{\bf{x}}^{% 0}-{\bf{x}}^{}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{% j=1}^{t}(a_{j}(m,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{}\right\\|^{1+\nu}% +(m-t)\left\\|{\bf{x}}^{0}-{\bf{x}}^{}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2% }}{4kL_{f}{\mathcal{H}}_{\nu}m}\left(\sum_{j=1}^{t}(a_{j}(m,\nu))^{1+\nu}+m-t% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)m}\left(\sum_{j=1}^{t}(a_{j}(m,\nu))^{1+\nu}+m-% t\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)m}\left(\sum_{j=0}^{t}(a_{j}(m,\nu))^{1+\nu}+m-% t-1\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{t+1}(m,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|,$

where the last equality comes from the fact $a_{0}(m,\nu)=1$ . Therefore, we finish the induction.

Part III: For all $t\geq m$ , we use induction to prove

\displaystyle{\bf H}^{t}\succeq(\mu^{2}/2){\bf I}\qquad\text{and}\qquad\left\|% {\bf{x}}^{t+1}-{\bf{x}}^{*}\right\|\leq a_{t+1}(m,\nu)\left\|{\bf{x}}^{0}-{\bf% {x}}^{*}\right\|.

For the induction base, we can verify that it holds (from the result of Part II)

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{for all~{}~% {}}j=0,\dots,m-1,

and

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|\leq a_{j}(m,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\qquad\text{for all~{}~{}}j=1,\dots,m.

Then we have

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(m,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4kL_{f}{\mathcal{H}}_{\nu}m},\quad% \text{for all~{}~{}}j=1,\dots,m,

where the second inequality is based on Lemma 6 and the third inequality is based on the initial condition.

From Eq. 9, we have

\displaystyle{\bf{z}}_{i}^{m}={\bf{x}}^{i}\qquad\text{for all~{}~{}}i\in[m].

Therefore, we have

\displaystyle\left\|{\bf{z}}_{i}^{m}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4kL_{f}{\mathcal{H}}_{\nu}m},\qquad\text{for all~{}~{}}i\in[m].

According to Lemma 14, we have

	$\displaystyle\sigma_{\min}({\bf H}^{m})$	$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{m}\left\\|{\bf{z% }}_{i}^{m}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}m\frac{\mu^{2}}{4kL_{f}{% \mathcal{H}}_{\nu}m}$
		$\displaystyle\geq\mu^{2}-\frac{\mu^{2}}{2}=\frac{\mu^{2}}{2},$

which implies

\displaystyle{\bf H}^{m}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{m}\right\|=\left\|({\bf H}^{m})^{-1}\right\|\leq\frac{2}{\mu^{% 2}}.

According to Lemma 16, we have

	$\displaystyle\left\\|{\bf{x}}^{n+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{n}% \right\\|\sum_{i=1}^{m}\left\\|{\bf{z}}_{i}^{m}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\sum_{i=1}^{% m}\left\\|{\bf{z}}_{i}^{m}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{% j=1}^{m}(a_{j}(m,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+\nu}\right)$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2% }}{4kL_{f}{\mathcal{H}}_{\nu}m}\left(\sum_{j=1}^{m}(a_{j}(m,\nu))^{1+\nu}% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)m}\left(\sum_{j=1}^{m}(a_{j}(m,\nu))^{1+\nu}% \right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{m+1}(m,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|.$

Hence, we have shown the induction base holds.

For the induction step, we assume

\displaystyle{\bf H}^{j}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf{x}}^{j+1}-{\bf{x}}^{*}\right\|\leq a_{j+1}(m,\nu)\left\|{\bf{x}}^{% 0}-{\bf{x}}^{*}\right\|

holds for all $j=m+1,\cdots,t-1$ such that $t\geq m+2$ . Combining results of Part I and II, we have

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|\leq a_{j}(m,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\qquad\text{for all}~{}~{}j=0,\dots,t,

which implies

\displaystyle\left\|{\bf{x}}^{j}-{\bf{x}}^{*}\right\|^{\nu}\leq(a_{j}(m,\nu))^% {\nu}\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|^{\nu}\leq\left\|{\bf{x}}^{0}-{% \bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{2}}{4kL_{f}{\mathcal{H}}_{\nu}m},% \qquad\text{for all}~{}~{}j=1,\dots,t,

where the second inequality is based on Lemma 6 and the last inequality is based on the condition condition.

The update (17) means the points $\{{\bf{z}}_{i}^{t}\}_{i=1}^{m}$ can be written as $\{{\bf{x}}^{t+1-m},\cdots,{\bf{x}}^{t}\}$ , which implies

\displaystyle\max\{\left\|{\bf{z}}_{1}^{t}-{\bf{x}}^{*}\right\|,\cdots,\left\|% {\bf{z}}_{m}^{t}-{\bf{x}}^{*}\right\|\}=\max\{\left\|{\bf{x}}^{t+1-m}-{\bf{x}}% ^{*}\right\|,\cdots,\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\}.

Therefore, we have

\displaystyle\left\|{\bf{z}}_{i}^{m}-{\bf{x}}^{*}\right\|^{\nu}\leq\frac{\mu^{% 2}}{4kL_{f}{\mathcal{H}}_{\nu}m}\qquad\text{for all}~{}~{}i=1,\dots,m.

Combing with Lemma 14, we have

	$\displaystyle\sigma_{\min}({\bf H}^{t})$	$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}\sum_{i=1}^{n}\left\\|{\bf{z% }}_{i}^{t}-{\bf{x}}^{*}\right\\|^{\nu}$
		$\displaystyle\geq\mu^{2}-2kL_{f}{\mathcal{H}}_{\nu}m\frac{\mu^{2}}{4kL_{f}{% \mathcal{H}}_{\nu}m}$
		$\displaystyle=\mu^{2}-\frac{\mu^{2}}{2}=\frac{\mu^{2}}{2}.$

Therefore, we achieve

\displaystyle{\bf H}^{t}\succeq\frac{\mu^{2}}{2}{\bf I}\qquad\text{and}\qquad% \left\|{\bf G}^{t}\right\|

\displaystyle=\left\|({\bf H}^{t})^{-1}\right\|\leq\frac{2}{\mu^{2}}.

According to Lemma 16, we have

	$\displaystyle\left\\|{\bf{x}}^{t+1}-{\bf{x}}^{*}\right\\|$	$\displaystyle\leq\frac{kL_{f}{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf G}^{t}% \right\\|\sum_{i=1}^{m}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\sum_{i=1}^{% m}\left\\|{\bf{z}}_{i}^{t}-{\bf{x}}^{*}\right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{% j=t-m+1}^{t}(a_{j}(m,\nu))^{1+\nu}\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|^{1+% \nu}\right)$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\left(\sum_{% j=t-m+1}^{t}(a_{j}(m,\nu))^{1+\nu}\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}% \right\\|^{1+\nu}$
		$\displaystyle\leq\frac{2kL_{f}{\mathcal{H}}_{\nu}}{(1+\nu)\mu^{2}}\frac{\mu^{2% }}{4kL_{f}{\mathcal{H}}_{\nu}m}\left(\sum_{j=t-m+1}^{t+1}(a_{j}(m,\nu))^{1+\nu% }\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=\frac{1}{2(1+\nu)m}\left(\sum_{j=t-m+2}^{t+1}(a_{j}(m,\nu))^{1+% \nu}\right)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|$
		$\displaystyle=a_{t+1}(m,\nu)\left\\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\\|.$

Hence, we finish the induction.

Combining results of Part I, II and III completes the proof of (31).

Since the non-singularity of ${\bf H}^{t}$ and ${\bf H}^{t+1}$ has been verified by result (31), we can apply Lemma 15 to achieve

\displaystyle\sigma_{\min}({\bf I}+({\bf V}^{t})^{\top}({\bf H}^{t})^{-1}{\bf U% }^{t})>0.

∎

Theorem 6.

We define the sequence $\{r_{t}\}_{t\geq 0}$ such that

\displaystyle r_{t}\triangleq\begin{cases}\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{% *}\right\|,1\},~{}~{}~{}~{}&t=0,\\[5.69046pt] a_{t}(m,\nu)r_{0},~{}~{}~{}~{}&t\geq 1,\\ \end{cases}

where the sequence $\{a_{t}(m,\nu)\}_{t\geq 0}$ is defined by equation (23). Under the Assumptions 1, 2 and 3, running MB-IGN (Algorithm 2) with initial condition shown in Theorem 6, we have

\displaystyle\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq r_{t}\qquad\text{and% }\qquad r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{m}\right\rfloor-1% \right)}}r_{t}

(35)

for all $t\geq m$ , where

\displaystyle c=1-\frac{1}{m}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right).

Proof.

The definition of $\{r_{t}\}_{t\geq 0}$ leads to

\displaystyle r_{0}=\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|,1\}\geq% \left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|.

According to Theorem 5, we have

\displaystyle\left\|{\bf{x}}^{t}-{\bf{x}}^{*}\right\|\leq a_{t}(m,\nu)\left\|{% \bf{x}}^{0}-{\bf{x}}^{*}\right\|\leq a_{t}(m,\nu)r_{0}=r_{t}.

According to Lemma 11, we have

\displaystyle a_{t+1}(m,\nu)\leq c^{(1+\nu)^{\left(\left\lfloor\frac{t}{m}% \right\rfloor-1\right)}}a_{t}(m,\nu)\qquad\text{for all}~{}~{}t\geq m.

Thus, achieve

\displaystyle r_{t+1}=a_{t+1}(m,\nu)r_{0}\leq c^{(1+\nu)^{\left(\left\lfloor% \frac{t}{m}\right\rfloor-1\right)}}a_{t}(m,\nu)r_{0}=c^{(1+\nu)^{\left(\left% \lfloor\frac{t}{m}\right\rfloor-1\right)}}r_{t}\qquad\text{for all}~{}~{}t\geq m,

where

\displaystyle c=1-\frac{1}{m}\left(1-\left(\frac{1}{2(1+\nu)}\right)^{1+\nu}% \right).

∎

Combining the results of Theorem 5 and 6, we finish the proof of Theorem 2.

E.3 Proof of Corollary 3

Proof.

Denote $m=\lceil n/k\rceil$ , according to Theorem 2, we have

\displaystyle r_{t+1}\leq c^{(1+\nu)^{\left(\left\lfloor{t}/{m}\right\rfloor-1% \right)}}r_{t}\qquad\text{with}\qquad c=1-\frac{1}{m}\left(1-\left(\frac{1}{2(% 1+\nu)}\right)^{1+\nu}\right).

for all $\nu\in(0,1]$ . Noticing that the value of $c$ is monotonically decreasing according to $\nu$ , we have

\displaystyle 1-\frac{1}{2m}>c\geq 1-\frac{15}{16m},

which implies

\displaystyle r_{t+1}\leq\Big{(}1-\frac{1}{2m}\Big{)}^{(1+\nu)^{(\left\lfloor t% /m\right\rfloor-1)}}r_{t}

for all $t\geq m$ .

According to the definition of $\{r_{t}\}_{t\geq 0}$ and Theorem 6, we have

\displaystyle r_{0}=\max\{\left\|{\bf{x}}^{0}-{\bf{x}}^{*}\right\|,1\}\geq 1.

Combining with Lemma 8, we have

	$\displaystyle r_{t}=$	$\displaystyle a_{t}(m,\nu)r_{0}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2(1+\nu)}(a_{t-m}(m,\nu))^{1+\nu}r_{0}$
	$\displaystyle=$	$\displaystyle\frac{1}{2(1+\nu)r_{0}^{\nu}}(a_{t-m}(m,\nu))^{1+\nu}r_{0}^{1+\nu}$
	$\displaystyle=$	$\displaystyle\frac{1}{2(1+\nu)r_{0}^{\nu}}r_{t-m}^{1+\nu}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2(1+\nu)}r_{t-m}^{1+\nu}$

for all $t\geq m$ . This leads to

\displaystyle r_{t}\leq\frac{1}{4}r_{t-m}^{2}

in the case of $\nu=1$ . ∎

	$\displaystyle f_{i}({\bf{y}})-f_{i}({\bf{x}})-{\bf{g}}_{i}({\bf{x}})^{\top}({% \bf{y}}-{\bf{x}})$	$\displaystyle=\int_{t=0}^{1}{\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}}))^{\top}% ({\bf{y}}-{\bf{x}})\text{d}t-{\bf{g}}_{i}({\bf{x}})^{\top}({\bf{y}}-{\bf{x}})$
		$\displaystyle=\int_{t=0}^{1}\left({\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}}))-% {\bf{g}}_{i}({\bf{x}})\right)^{\top}({\bf{y}}-{\bf{x}})\text{d}t$
		$\displaystyle\leq\int_{t=0}^{1}\left\\|{\bf{g}}_{i}({\bf{x}}+t({\bf{y}}-{\bf{x}% }))-{\bf{g}}_{i}({\bf{x}})\right\\|\left\\|{\bf{y}}-{\bf{x}}\right\\|\text{d}t$
		$\displaystyle\leq\int_{t=0}^{1}{\mathcal{H}}_{\nu}t^{\nu}\left\\|{\bf{y}}-{\bf{% x}}\right\\|^{1+\nu}\text{d}t$
		$\displaystyle={\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{1+\nu}\int_% {t=0}^{1}t^{\nu}\text{d}t$
		$\displaystyle=\frac{{\mathcal{H}}_{\nu}}{1+\nu}\left\\|{\bf{y}}-{\bf{x}}\right% \\|^{1+\nu},$

	$\displaystyle\left\\|{\bf J}({\bf{x}}){\bf{v}}\right\\|$	$\displaystyle=\lim_{h\to 0}\frac{\left\\|{\bf{f}}({\bf{x}}+h{\bf{v}})-{\bf{f}}(% {\bf{x}})\right\\|}{\|h\|}$
		$\displaystyle\leq\lim_{h\to 0}\frac{L_{f}\left\\|{\bf{x}}+h{\bf{v}}-{\bf{x}}% \right\\|}{\|h\|}$
		$\displaystyle=\lim_{h\to 0}\frac{L_{f}\|h\|\left\\|{\bf{v}}\right\\|}{\|h\|}$
		$\displaystyle=L_{f}\left\\|{\bf{v}}\right\\|,$

	$\displaystyle\left\\|{\bf J}({\bf{y}})^{\top}{\bf J}({\bf{y}})-{\bf J}({\bf{x}}% )^{\top}{\bf J}({\bf{x}})\right\\|$	$\displaystyle=\left\\|{\bf J}({\bf{y}})^{\top}{\bf J}({\bf{y}})-{\bf J}({\bf{x}% })^{\top}{\bf J}({\bf{y}})+{\bf J}({\bf{x}})^{\top}{\bf J}({\bf{y}})-{\bf J}({% \bf{x}})^{\top}{\bf J}({\bf{x}})\right\\|$
		$\displaystyle\leq\left\\|\left({\bf J}({\bf{y}})-{\bf J}({\bf{x}})\right)^{\top% }{\bf J}({\bf{y}})\right\\|+\left\\|{\bf J}({\bf{x}})^{\top}\left({\bf J}({\bf{y% }})-{\bf J}({\bf{x}})\right)\right\\|$
		$\displaystyle\leq\left\\|{\bf J}({\bf{y}})\right\\|\left\\|{\bf J}({\bf{y}})-{\bf J% }({\bf{x}})\right\\|+\left\\|{\bf J}({\bf{x}})\right\\|\left\\|{\bf J}({\bf{y}})-{% \bf J}({\bf{x}})\right\\|$
		$\displaystyle\leq 2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{% \nu},$

	$\displaystyle\left\\|{\bf{g}}_{i}({\bf{y}}){\bf{g}}_{i}({\bf{y}})^{\top}\!-{\bf% {g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{x}})^{\top}\right\\|$	$\displaystyle=\left\\|{\bf{g}}_{i}({\bf{y}}){\bf{g}}_{i}({\bf{y}})^{\top}\!-{% \bf{g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{y}})^{\top}+{\bf{g}}_{i}({\bf{x}}){\bf{% g}}_{i}({\bf{y}})^{\top}\!-{\bf{g}}_{i}({\bf{x}}){\bf{g}}_{i}({\bf{x}})^{\top}\right\\|$
		$\displaystyle\leq\left\\|\left({\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})% \right){\bf{g}}_{i}({\bf{y}})^{\top}\right\\|+\left\\|{\bf{g}}_{i}({\bf{x}})% \left({\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right)^{\top}\right\\|$
		$\displaystyle\leq\left\\|{\bf{g}}_{i}({\bf{y}})\right\\|\left\\|{\bf{g}}_{i}({\bf% {y}})-{\bf{g}}_{i}({\bf{x}})\right\\|+\left\\|{\bf{g}}_{i}({\bf{x}})\right\\|% \left\\|{\bf{g}}_{i}({\bf{y}})-{\bf{g}}_{i}({\bf{x}})\right\\|$
		$\displaystyle\leq 2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{y}}-{\bf{x}}\right\\|^{% \nu},$

	$\displaystyle\left\\|{\bf H}^{t}-{\bf J}({\bf{x}}^{})^{\top}{\bf J}({\bf{x}}^{% })\right\\|$	$\displaystyle=\left\\|\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{i}(% {\bf{z}}_{i}^{t})^{\top}-\sum_{i=1}^{n}{\bf{g}}_{i}({\bf{x}}^{}){\bf{g}}_{i}(% {\bf{x}}^{})^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{n}\left\\|{\bf{g}}_{i}({\bf{z}}_{i}^{t}){\bf{g}}_{% i}({\bf{z}}_{i}^{t})^{\top}-{\bf{g}}_{i}({\bf{x}}^{}){\bf{g}}_{i}({\bf{x}}^{% })^{\top}\right\\|$
		$\displaystyle\leq\sum_{i=1}^{n}2L_{f}{\mathcal{H}}_{\nu}\left\\|{\bf{z}}_{i}^{t% }-{\bf{x}}^{*}\right\\|^{\nu},$

Incremental Gauss–Newton Methods with Superlinear Convergence Rates

Abstract

1 Introduction

Paper Organization

2 Preliminaries

2.1 Notations

2.2 Assumptions

Assumption 1.

Assumption 2.

Assumption 3.

3 The Incremental Gauss–Newton Method

3.1 The Algorithm

3.2 The Convergence Analysis

Proposition 1.

Lemma 1.

Lemma 2.

Theorem 1.

Corollary 1.

Corollary 2.

4 The Extension to Mini-Batch Methods

Theorem 2.

Corollary 3.

5 Related Work

6 Experiments

7 Conclusion

References

Appendix A The Mini-Batch Incremental Gauss–Newton Method

Appendix B Some Basic Results for Jacobians

Lemma 3.

Proof.

Lemma 4.

Proof.

Lemma 5.

Proof.

Appendix C The Auxiliary Sequence and Its Properties

Definition 1.

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

Appendix D The Convergence Analysis for IGN

D.1 The Proof of Proposition 1

Proof.

D.2 Proof of Lemma 1

Proof.

D.3 Proof of Lemma 2

Proof.

D.4 The Proof of Theorem 1

Lemma 12.

Proof.

Lemma 13.

Proof.

Theorem 3.

Proof.

Theorem 4.

Proof.

D.5 Proof of Corollary 1

Proof.

D.6 Proof of Corollary 2

Proof.

Appendix E The Convergence Analysis for MB-IGN

E.1 The Additional Lemma for Gram Matrix

Lemma 14.

Proof.

E.2 Proof of Theorem 2

Lemma 15.

Proof.

Lemma 16.

Proof.

Theorem 5.

Proof.

Theorem 6.

Incremental Gauss–Newton Methods with
Superlinear Convergence Rates