Overview of Optimization Algorithms in Federated Learning#
The most important contribution of the initial work on federated learning [McMahan et al.[1]]
was the introduction of the Federated Averaging algorithm (FedAvg
). Mathematically, federated learning
solves the following problem of minimization of empirical risk function
where \(\ell_k\) is the loss function of client \(k\), \(\mathcal{D}_k\) is the data distribution of client \(k\), \(\mathcal{P}\) is the distribution of clients, and \(\mathbb{E}\) is the expectation operator. If we simply let \(\mathcal{P} = \{1, 2, \ldots, K\}\), then the optimization problem can be simplified as
For further simplicity, we often take \(w_k = \frac{1}{K}\). The functions \(f_k\) and \(f\) are usually assumed to satisfy the following conditions:
(A1) \(f_k\) and \(f\) are \(L\)-smooth (\(L > 0\)), i.e. they have \(L\)-Lipschitz continuous gradients:
(3)#\[\begin{split}\begin{array}{c} \lVert \nabla f (\theta) - f (\theta') \rVert \leqslant L \lVert \theta - \theta' \rVert, \\ \lVert \nabla f_k (\theta) - f_k (\theta') \rVert \leqslant L \lVert \theta - \theta' \rVert, \end{array} \quad \forall \theta, \theta' \in \R^d, k = 1, \ldots, K.\end{split}\](A2) The range of \(f\)
\[ \begin{align}\begin{aligned}\DeclareMathOperator*{\dom}{dom}\\\dom(f) := \{ \theta \in \R^d ~|~ f(\theta) < + \infty \}\end{aligned}\end{align} \]is nonempty and lower bounded, i.e. there exists a constant \(c \in \R\) such that
(4)#\[f(\theta) \geqslant c > -\infty, ~ \forall \theta \in \R^d,\]or equivalently,
(5)#\[f^* := \inf\limits_{\theta \in \R^d} f(\theta) > - \infty.\]
In many cases, in order to facilitate the analysis of convergence, we will also make some assumptions about the gradient of the objective function:
(A3) Bounded gradient: there exists a constant \(G > 0\) such that
(6)#\[\lVert \nabla f_k (\theta) \rVert^2 \leqslant G^2, ~ \forall \theta \in \R^d, ~ k = 1, \ldots K.\]
And the following assumptions on data distributions:
(A4-1) Data distribution is I.I.D. (identically and independently distributed) across clients, i.e.
(7)#\[\nabla f(\theta) = \expectation [f_k(\theta)] = \expectation\limits_{(x, y) \sim \mathcal{D}_k}[\nabla \ell_k(\theta; x, y)], ~ \forall \theta \in \R^d, ~ k = 1, \ldots K,\]or equivalently, for any \(\varepsilon > 0\), there exists a constant \(B \geqslant 0\) such that
(8)#\[\sum\limits_{k=1}^K \lVert \nabla f_k(\theta) \rVert^2 = \lVert f(\theta) \rVert^2, ~ \forall \theta \in \left\{ \theta \in \R^d ~ \middle| ~ \lVert f(\theta) \rVert^2 > \varepsilon \right\}.\]
- (A4-2) Data distribution is non-I.I.D across clients, in which case we need a quantity to measure
the degree of this statistical heterogeneity. This quantity can be defined in a number of ways [Karimireddy et al.[2], Zhang et al.[3], Li et al.[4], Li et al.[5]]. For example, in [Karimireddy et al.[2]] and [Zhang et al.[3]], the so-called bounded gradient dissimilarity (BGD), denoted as \((G; B)\)-BGD, is used as this quantity. More specifically, there exists constants \(G > 0\) and \(B \geqslant 0\) such that
(9)#\[\dfrac{1}{K} \sum\limits_{k=1}^K \lVert \nabla f_k(\theta) \rVert^2 \leqslant G^2 + B^2 \lVert \nabla f(\theta) \rVert^2, ~ \forall \theta \in \R^d.\]It should be noted that letting \(B = 0\), the bounded gradient dissimilarity condition (A4-2) degenrates to the bounded gradient condition (A3).
Sometimes, in the proof of algorithm convergence, one needs to make assumptions on the convexity of the objective function \(f\), which can be defined as follows:
(A5-1) convexity:
(10)#\[f(a \theta + (1 - a) \theta') \leqslant a f(\theta) + (1 - a) f(\theta'), ~ \forall \theta, \theta' \in \mathcal{C}, ~ \alpha \in [0, 1].\]where \(\mathcal{C}\) is a convex set on which \(f\) is defined.
- (A5-2) Strong convexity: there exists a constant \(\mu > 0\) such that \(f - \frac{\mu}{2} \lVert \theta \rVert^2\)
is convex. In this case, we say that \(f\) is \(\mu\)-strongly convex.
Due to the natural layered and decoupled structure of the federal learning problem, it is more natural to consider the following constrained optimization problem:
It is easy to find the equivalence between the constrained optimization problem (11) and the unconstrained optimization problem (2). The constrained formulation (11) is called the consensus problem in the literature of distributed optimization [Boyd et al.[6]]. The superiority of the constrained formulation (11) is that the objective function becomes block-separable, which is more suitable for the design of parallel and distributed algorithms.
Federated Averaging Algorithm#
The core idea of the FedAvg
algorithm is to make full use of the local computation resources of each client
so that each client can perform multiple local iterations before uploading the local model to the server.
It alleviates the problem of straggler clients and reduces the communication overhead,
hence accelerating the convergence of the algorithm. This may well be thought of as a simple form of
skipping algorithm, which were further developed in [Zhang et al.[3], Mishchenko et al.[7], Zhang and Loizou[8]].
Pseudocode for FedAvg
is shown as follows:
FedAvg
achieved some good numerical results (see Section 3 of [McMahan et al.[1]]),
but its convergence, espcially under non-I.I.D. data distributions, is not properly analyzed
(see [Khaled et al.[9], Khaled et al.[10]]). There are several works that deal with this issue
(such as [Zhou and Cong[11], Li et al.[4]]) with extra assumptions such as
the convexity of the objective function \(f\), etc.
FedAvg
from the Perspective of Optimization#
In this section, we will analyze the FedAvg
algorithm from the perspective of optimization theory.
In fact, the optimization problem (2) that FedAvg
solves can be equivalently reformulated
as the following constrained optimization problem:
where \(\Theta = \col(\theta_1, \cdots, \theta_K) := \begin{pmatrix} \theta_1 \\ \vdots \\ \theta_K \end{pmatrix}, \theta_1, \ldots, \theta_K \in \R^d\) and \(\mathcal{E} = \left\{ \Theta ~ \middle| ~ \theta_1 = \cdots = \theta_K \right\}\) is a convex set in \(\R^{Kd}\). Projected gradient descent (PGD) is an effective method for solving the constrained optimization problem (12), which has the following update rule:
where \(\Pi_{\mathcal{E}}\) is the projection operator onto the set \(\mathcal{E}\). It is easy to show that the projection operator onto the set \(\mathcal{E}\) is indeed the average operator, i.e.,
We have shown that mathematically the FedAvg
algorithm is indeed a kind of stochastic projected gradient descent (SPGD)
algorithm, where the clients perform local stochastic gradient descent (SGD) updates and the server performs
the projection step (14).
A Direct Improvement of FedAvg
#
Since FedAvg
is based on stochastic gradient descent (SGD), it is natural to consider applying
acceleration techniques [Duchi et al.[12], Kingma and Ba[13], Zaheer et al.[14], Reddi et al.[15]] to improve the algorithm performance.
Computation on clients and on the server are relatively decoupled, so it does not require large modifications
to the whole algorithm framework. Indeed, the authors of the FedAvg
paper put this idea into practice and proposed
a federated learning framework called FedOpt
[Reddi et al.[16]] which has stronger adaptability.
The pseudocode for FedOpt
is shown as follows:
In the above pseudocode, \(\operatorname{aggregate} \left( \left\{ \Delta_{k}^{(t)} \right\}_{k \in \mathcal{S}^{(t)}} \right)\) refers to some method that aggregates the local inertia updates \(\Delta_{k}^{(t)}\) from the selected clients \(\mathcal{S}^{(t)}\) into a global inertia update \(\Delta^{(t)}\). This method, for example, can be simply averaging
or linear combination with inertia of the previous iteration
As one has already noticed, compared to FedAvg
, FedOpt
introduces some momentum terms on the server node (in ServerOpt) to
accelerate the convergence. In [Reddi et al.[16]], the authors listed several options for ServerOpt:
FedAdagrad
:(17)#\[\begin{split}\begin{aligned} v^{(t)} & \gets v^{(t-1)} + ( \Delta^{(t)} )^2 \\ \theta^{(t+1)} & \gets \theta^{(t)} + \eta_g \Delta^{(t)} / (\sqrt{v^{(t)}}+\tau) \end{aligned}\end{split}\]FedYogi
:(18)#\[\begin{split}\begin{aligned} v^{(t)} & \gets v^{(t-1)} - (1 - \beta_2) ( \Delta^{(t)} )^2 \operatorname{sign}(v^{(t-1)} - ( \Delta^{(t)} )^2) \\ \theta^{(t+1)} & \gets \theta^{(t)} + \eta_g \Delta^{(t)} / (\sqrt{v^{(t)}}+\tau) \end{aligned}\end{split}\]FedAdam
:(19)#\[\begin{split}\begin{aligned} v^{(t)} & \gets \beta_2 v^{(t-1)} + (1 - \beta_2) ( \Delta^{(t)} )^2 \\ \theta^{(t+1)} & \gets \theta^{(t)} + \eta_g \Delta^{(t)} / (\sqrt{v^{(t)}}+\tau) \end{aligned}\end{split}\]
FedOpt
applys acceleration techniques which are frequently used in general machine learning tasks to the field of
federated learning. It is a direct improvement of FedAvg
which is simple but important. Moreover, it demonstrates
the decoupling of the computation on clients and on the server, which is a key feature of federated learning.
To better handle non-I.I.D. data, one needs to introduce some other techniques. In non-I.I.D. scenarios, the gradients have different distributions across clients. A natural idea is to bring in some extra parameters which update along with the model parameters to make corrections (modifications) to the gradients on clients, reducing their variance and further accelerating the convergence. This technique is the so-called variance reduction technique [Johnson and Zhang[17]], which was first introduced to federated learning in [Karimireddy et al.[2]] in the form of a new federated learning algorithm called SCAFFOLD (Stochastic Controlled Averaging algorithm). The pseudocode for SCAFFOLD is shown as follows:
Variance reduction is a technique that can be flexibly combined with most algorithms and has been widely used in federated learning for dealing with statistical heterogeneity. However, it should be noted in the SCAFFOLD algorithm that on both the server and the clients, there are extra parameters \(c\) and \(c_k\) to maintain, which may increase the communication cost. In scenarios which are sensitive to communication cost, this would potentially be a problem. Therefore, a better solution could be a combination of the variance reduction technique and some skipping techniques (e.g. [Zhang and Loizou[8]]), which will be introduced in next sections.
References