Notes on Ray paper

I am reading Robert Nishihara’s thesis on Ray [1] while encountering several concepts I am not familiar with. So this post takes some notes on those new concepts.

Ring AllReduce

Ring AllReduce is a technique to communicate with multiple computation nodes for aggregating results. It is a primitive to many distributed training systems. In the Ray paper, the author tries to analyze how easy Ring AllReduce can be implemented using Ray’s API. 

There are two articles that articulating the idea really well [5, 6]. They point out the bottleneck of network data communication when a distributed training system wants to aggregate gradients from worker nodes in a master-slave pattern. Because each worker (say, there are P-1 workers in total) needs to send its gradient vector (say it uses N space) to master and the master needs to aggregate and send the updated model vector (same size as the gradient vectors) back to each worker for the next iteration, the data transferring through the network would be as large as 2*(P-1)*N, which scales linearly with the number of workers.

Ring AllReduce have multiple cycles of data transferring among all workers (P workers since we don’t need one for the master). But each cycle only transfers a small amount of data in each cycle such that the accumulative amount of data transferring would be smaller than that of the master-slave pattern.

The basic idea of RingAllReduce is to divide each gradient vector into P chunks on all workers. There are first P-1 cycles of data aggregating then P-1 cycles data syncing. In the first cycle, each worker i (i=1,\ldots,P) sends its i-th chunk to the next worker (with index i+1), and receives the (i-1)-th chunk from the previous worker. The received (i-1)-th chunk will be aggregated locally with the worker’s own (i-1)-th chunk. In the second cycle, each worker sends its (i-1)-th chunk, which got aggregated in the last cycle, to the next worker, and receives (i-2)-th chunk from the previous worker. Similarly, each worker now can aggregate on its (i-2)-th chunk with the received chunk which also indexes on i-2. Continuing on this pattern, each worker will send its (i-2), (i-3)-th, …chunk in each cycle until P-1 data aggregating cycles are done. Upon then each worker has one chunk that has been fully aggregated with all other workers. Then doing the similar circular cycles P-1 times can make all workers sync on all fully aggregated chunks from each other.

I draw one simple diagram to illustrate the idea:

 

 

ADMM (Alternating Direction Method of Multipliers)

Another interesting technique introduced in the thesis is ADMM. It is a good opportunity to revisit optimization from the scratch. I’m mainly following [4] and [7] for basic concepts.

The definition of directional derivative and gradient:
The gradient of f(x), x\in \mathbb{R}^n is \nabla f(x) = \left\{ \frac{\partial f(x)}{\partial x_1}, \cdots, \frac{\partial f(x)}{\partial x_n}\right\}. The gradient is a vector and can only be known fully when given a concrete value x \in \mathbb{R}^n. The directional derivative is the gradient’s projection on another unit vector u \in \mathbb{R}^n: df(x)(u) = <\nabla f(x), u> = |\nabla f(x)| \;|u| \;cos \theta = |\nabla f(x)| \;cos \theta, where <\cdot, \cdot> is inner product. See some introduction in [9] and [10].

The definition of a proper function simply means (Definition 51.5 [7]): f(x) > -\infty for all x \in \mathbb{R}^n and f(x) < +\infty for some x \in \mathbb{R}^n

The definition of a convex and strictly convex function is (Proposition 40.9 [7]):  
The function f is convex on U iff f(v) \geq f(u) + df(u)(v-u). Remember that df(u) is the change of f(u) caused by a tiny change in u and we have df(u)/du = tan(\text{the slope of the tangent line at }f(u)). f is a strict convex function iff f(v) > f(u) + df(u)(v-u)

This definition actually leads to a very common technique to check if a multivariate quadratic function is a convex function: if f(u)=\frac{1}{2}u^T A u - u^T b, u \in \mathbb{R}^n, as long as A is positive semidefinite, then f(u) is convex. See Example 40.1 [7].

The definition of affine hull: given a set of vectors S=(x_1, \cdots, x_m) with x_i \in \mathbb{R}^n, aff(S) is all affine combinations of the vectors in S, i.e., aff(S)=\{\lambda_1 x_1 + \cdots + \lambda_m x_m | \lambda_1 + \cdots + \lambda_m = 1\}.

The definition of relint (Definition 51.9 [7]):
Suppose C a subset of \mathbb{R}^n, the relative interior of C is the set: relint(C)=\{x \in C | B_\epsilon(x) \cap aff(C) \subseteq C \;\; \text{for some } \epsilon > 0\}. There is a good example from [8] which explains the difference between interior and relative interior.

The most important thing is to understand the high level of the typical optimization procedure, which I also covered in [3]. Suppose we want to optimize the primal problem:
minimize  J(v)
subject to  \varphi_i(v) \leq 0, \quad \quad i=1,\ldots,m
                  \psi_j(v)=0, \quad\quad j = 1,\ldots, p
with dom(J) = \Omega \subset \mathbb{R}^n

The Lagrangian L(v,\mu, \upsilon) for the primal problem is defined as:
L(v, \mu, \upsilon)=J(v)+\sum\limits_{i=1}^m \mu_i \varphi_i(v) + \sum\limits_{j=1}^p \upsilon_j \psi_j(v), where \mu \in \mathbb{R}^m_+ (i.e., \mu > 0) and \upsilon \in \mathbb{R}^p.

The KKT conditions are necessary conditions that if v^* is a local minimum of L(v,\mu, \upsilon), then v^* must satisfy the following conditions [11]:
1. stationarity: \nabla J(v^*) + \sum\limits_{i=1}^m \mu_i \nabla \varphi_i(v^*) + \sum\limits_{j=1}^p \upsilon_j \nabla \psi_j(v^*) = 0
2. primal feasibility: \varphi_i(v^*) \leq 0, \; i=1, \ldots, m and \psi_j(v^*)=0, \; j=1, \ldots, p
3. dual feasibility: \mu_i \geq 0, \; i=1,\ldots, m
4. complementary slackness: \mu_i \varphi_i(v^*)=0, \; i=1, \ldots, m

In other words, if we already know v^* we know it must satisfy the KKT conditions. However not all solutions that satisfy the KKT conditions are the local minimum of L(v,\mu, \upsilon). The KKT conditions can be used to find global minimum v^* if additional conditions are satisfied. One such conditions is “Second-order sufficient conditions” [12], which I also talked about in [3]. Another such conditions are if the constraints, the domain dom(J), and the objective function J are convex, then KKT conditions also imply v^* is a global minimum. (Theorem 50.6 [7]). 

While some problems have characteristics that allow using the KKT conditions as sufficient condition to find the global solution, there are others that are easier to solve or approximate using another method called dual ascent. By maximizing another function called dual function, we can get the exact solution or a lower bound of the primal problem. 

The dual function G: \mathbb{R}^m_+ \times \mathbb{R}^p \rightarrow \mathbb{R} is defined as:
G(\mu, \upsilon) = inf\limits_{v \in \Omega} L(v, \mu, \upsilon)  \quad \quad \mu \in \mathbb{R}^m_+ \text{ and } \upsilon \in \mathbb{R}^p

The dual problem is defined as:
maximize G(\mu, \upsilon)
subject to \mu \in \mathbb{R}^m_+, \upsilon \in \mathbb{R}^p

From [7] page 1697, one of the main advantages of the dual problem over the primal problem is that it is a convex optimization problem, since we wish to maximize a concave objective function G (thus minimize −G, a convex function), and the constraints \mu \geq 0 are convex. In a number of practical situations, the dual function G can indeed be computed.

If the maximum of the dual problem is d^* and the minimum of the primal problem is p^*. We have weak duality that always holds: d^* \leq p^*. Strong duality holds when the dual gap is zero, with certain conditions holding, for example slater’s condition [14]. We can find the local minimum (\mu^*, \upsilon^*) of the dual problem by a special form of gradient ascent algorithm called sequential optimization problem (SMO) [13] because special treatment is needed for the constraints involved in the dual problem. 

[7] provides two ways on how to do constrained optimization on SVM (Section 50.6 and 50.10): one is to use the KKT conditions, the other is to solve the dual problem. 

The dual problem is simplified when there are only affine equality constraints in the primal problem:
Primal problem:
minimize J(u)
subject to  Au=b

Dual problem:
maximize G(\lambda)=inf\limits_{u \in \mathbb{R}^n} L(u, \lambda)=inf\limits_{u \in \mathbb{R}^n}J(u)+\lambda^T(Au-b)
subject to \lambda \in \mathbb{R}^p

Since the dual problem is an unconstrained optimization problem, we can use dual ascent to solve this problem easily:
u^{k+1} = argmin_u L(u, \lambda^k)
\lambda^{k+1} = \lambda^k + \alpha^k(Au^{k+1}-b),
where \alpha^k>0 is a step size.

The main flaw of the dual ascent is that under certain conditions the solution may diverge. The augmented Lagrangian method augments the Lagrangian function with a penalty term:
L_{p}(u, \lambda)=J(u) + \lambda^T(Au-b)+ (p/2)\|Au-b\|^2_2

Based on some theory, the augmented Lagrangian is strongly convex and has better convergence property. 

The method of multipliers is the dual ascent applied on the augmented Lagrangian:
u^{k+1} = argmin_u L_p(u, \lambda^k)
\lambda^{k+1} = \lambda^k + p(Au^{k+1}-b)

If u can be separated into two parts such that J(u)=J(x,z)=f(x)+g(z), then we can iteratively update x, z, and \lambda separately, and such a method is called Alternating Direction Method of Multipliers (ADMM):
Primal problem:
minimize f(x)+g(z) 
subject to Ax+Bz=c

The augmented Lagrangian:
L_p(x,z,\lambda)=f(x)+g(z)+\lambda^T(Ax+Bz-c)+(p/2)\|Ax+Bz-c\|^2_2

Updates (note we are not doing (x^{k+1}, z^{k+1})=argmin_{x,z} L_p(x,z,\lambda) as in the method of multipliers):
x^{k+1}=argmin_x L_p(x, z^k, \lambda^k)
z^{k+1}=argmin_z L_p(x^{k+1}, z, \lambda^k)
\lambda^{k+1} = \lambda^k + p(Ax^{k+1}+Bz^{k+1}-c)

If we define \mu=(1/p)\lambda and r=Ax+Bz-c, then the updates can be simplified as:
x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax+Bz^k-c+\mu^k\|^2_2 \right)
z^{k+1}=argmin_z \left( g(z) + (p/2)\|Ax^{k+1} + Bz - c + \mu^k \|^2_2 \right)
\mu^{k+1}=\mu^k + Ax^{k+1} + Bz^{k+1} - c

Structures in f, g, A, and B can often be exploited to carry out x-minimization and z-minimization more efficiently. We now look at several examples.

Example 1: if A=I is an identity matrix, v=-Bz^k+c-\mu^k, and f(\cdot) is the indicator function of a closed nonempty convex set \mathcal{C}, then x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-v\|^2_2 \right)=\Pi_{\mathcal{C}}(v), which means x^{k+1} is the projection of v onto \mathcal{C}.

Example 2: if f(x)=(1/2)x^T P x + q^T x + r and v=-Bz^k+c-\mu^k, then x^{k+1}=argmin_x \left( f(x) + (p/2) \|Ax-v\|^2_2 \right) becomes an unconstrained quadratic programming with the analytic solution at x^*=\left(P+pA^T A\right)^{-1}(pA^Tv-q).

Example 3: if f(x)=(1/2)x^T P x + q^T x + r, dom(f)=\{x|Ax+b\} and it must satisfy x\geq 0, then the ADMM primal problem becomes:
minimize f(x) + g(z)  (where g(z) is an indicator function of the nonnegative orthant \mathbb{R}_+^n)
subject to x-z=0

Then x^{k+1}=argmin_x \left( f(x) + (p/2) \|x-z^k+u^k\|^2_2 \right). This is a constrained quadratic programming. We have to use KKT conditions to solve it. Suppose the Lagrangian multiplier is v\in\mathbb{R}^n, then the KKT conditions state that:
\nabla f(x) +\nabla\left((p/2) \|x-z^k+u^k\|^2_2 \right) + v^T \nabla\left(Ax-b\right) \newline= \nabla f(x) + p (x - z^k + u^k)+ v^T A = 0
Ax-b = 0,
which is exactly given by a linear system in matrix form in Section 5.2 [4]:

(As a side note, solving a constrained quadratic programming usually relies on KKT conditions. It can be converted to solving a linear system in matrix form: https://www.math.uh.edu/~rohop/fall_06/Chapter3.pdf

Example 4 (Section 4.4.3 in [4]): if f(x)=\lambda \|x\|_1, v=-Bz^k+c-\mu^k, and A=I, then x^*=argmin_x(\lambda \|x\|_1 + (p/2)\|x-v\|^2_2)=S_{\lambda/p}(v) where S is the soft thresholding operator. 

 

References

[1] On Systems and Algorithms for Distributed Machine Learning: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-30.pdf

[2] Compare between parameter servers and ring allreduce: https://harvest.usask.ca/bitstream/handle/10388/12390/CHOWDHURY-THESIS-2019.pdf?sequence=1&isAllowed=y

[3] Overview of optimization: https://czxttkl.com/2016/02/22/optimization-overview/

[4] ADMM tutorial: https://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

[5] https://tech.preferred.jp/en/blog/technologies-behind-distributed-deep-learning-allreduce/

[6] https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da

[7] Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning https://www.cis.upenn.edu/~jean/math-deep.pdf

[8] https://math.stackexchange.com/a/2774285/235140

[9] http://sites.science.oregonstate.edu/math/home/programs/undergrad/CalculusQuestStudyGuides/vcalc/grad/grad.html

[10] https://math.stackexchange.com/a/661220/235140

[11] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Necessary_conditions

[12] https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions#Sufficient_conditions

[13] http://cs229.stanford.edu/notes/cs229-notes3.pdf

[14] https://en.wikipedia.org/wiki/Slater%27s_condition

Leave a comment

Your email address will not be published. Required fields are marked *