Coding for Distributed Computing (in Machine Learning and Data Analytics)Modern distributed computing frameworks play a critical role in various applications, such as large-scale machine learning and big data analytics, which require processing a large volume of data in a high throughput. To achieve high performance when running a job of distributed computing whose input is a large dataset, it is common to run such a job on a large-scale distributed infrastructure, e.g., in the cloud. By dividing a job of distributed computing into multiple tasks, distributed computing frameworks, such as MPI, Horovods, MapReduce, and Spark, have been able to process a high volume of data at the scale of tens of terabytes or more, on a cluster of nodes with limited power of CPU and memory space. However, it is well known that nodes in a distributed infrastructure are subject to various faulty behaviors. For example, a node may experience temporary performance degradation due to resource contention or load imbalance. Even worse, a node may fail to complete a task due to the failure inside the distributed infrastructure. Therefore, when the computation is distributed onto multiple nodes, its progress may be significantly delayed by stragglers, i.e., tasks running on such faulty nodes. Although the adversarial effects of stragglers in the distributed infrastructure can be mitigated by running replicated tasks on multiple nodes, such replicated tasks will consume a significant amount of resources. Instead of adding replicated tasks, my research focuses on coded computing, running additional tasks that process parity data encoded from the dataset. Compared to simply running duplicated tasks on more nodes, coded computing tolerates potential stragglers with much fewer resources. Jointly considering various resources in the distributed infrastructure, including computing, bandwidth, and storage, my research pushes forward the performance in coded computing and achieves flexible tradeoffs between different resources. Communication-aware coding for distributed matrix multiplicationMatrix multiplication is a fundamental building block in various machine learning algorithms, including linear regression, logistic regression, deep neural networks, etc. When the matrix comes from a large dataset, the multiplication can be split into multiple tasks that calculate the multiplication of submatrices on different nodes. In order to tolerate stragglers, coded distributed matrix multiplication has been proposed where each task multiplies matrices encoded from the original input matrices, and the result of the multiplication can be decoded from a subset of such tasks. So far, various coding schemes have been proposed which split the input matrices differently and lead to different recovery thresholds, i.e., the number of tasks required for the completion of the job. As the large-scale matrix multiplication typically requires running in a large-scale distributed infrastructure, it is common to run its corresponding tasks on different nodes hosted in the cloud. However, as resources in the cloud are shared by multiple tenants, the performance of tasks can change frequently over time. Although splitting the input matrices into smaller partitions can lead to lower computational complexity in each task, the corresponding recovery threshold will be larger, leading to higher communication overhead as the results of such tasks need to be uploaded to some single node for decoding. Therefore, it becomes challenging to choose the coding scheme and its parameters with dynamic resources. In my research, I explore flexible tradeoffs between computational overhead, communication overhead, and the recovery threshold. Existing coding schemes split the input matrices in at most two dimensions, while three-dimensional splitting across the rows of A, the shared dimension, and the columns of B allows for a lower recovery threshold. I propose dual entangled polynomial (DEP) codes that execute two matrix multiplications in each task by splitting the matrices in all three dimensions, reducing the recovery threshold by approximately 25% compared to the best prior three-dimensional coding scheme, entangled polynomial (EP) codes, while also lowering the decoding overhead and memory consumption of each task. To address the challenge of dynamically changing resource performance in the cloud, I further propose a local re-encoding framework in which each task can switch its coding scheme and parameters by only locally re-encoding its existing data on the same node, without receiving any additional data. This framework supports dynamic conversion among polynomial codes, MatDot codes, and entangled polynomial codes, and saves up to 92.7% of the time required to re-encode all tasks from scratch at a centralized node. While prior work focuses on a single pairwise matrix multiplication, many distributed computing workloads require computing multiple matrix multiplications at the same time. I propose a coding framework for concurrent matrix multiplications where the results of multiple independent multiplications A₁B₁, A₂B₂, …, AₙBₙ can all be recovered within a single distributed job rather than by running separate parallel jobs, achieving the lowest recovery threshold among all polynomial-based coding schemes. For workloads that require computing the product of a chain of matrices M₀×M₁×···×M_{m-1}, as is common in multi-layer neural network inference, coded matrix chain multiplication encodes all input matrices jointly and recovers the entire chain result within a single round of tasks, saving up to 90.3% of completion time compared to completing each multiplication round by round. For batch matrix multiplication where n pairs of independent matrices must be multiplied in parallel, rook coding (RC) constructs its coding polynomials in much simpler forms than existing schemes, achieving a recovery threshold of O(n^{log_2 3}) with substantially lower encoding overhead, and demonstrates a lower overall job completion time in experiments on AWS despite its higher recovery threshold. When servers in the distributed infrastructure are heterogeneous, Spinner leverages the results of partially completed tasks on straggler servers, dynamically assigning workload proportional to each server's performance and adapting the coding scheme accordingly, improving the time of matrix multiplication by up to 84% and linear regression by 40.7%. A further line of research focuses on reducing the encoding overhead by exploiting the order in which sub-tasks are executed within each task. In existing coded computing designs, coded sub-tasks are encoded from all uncoded sub-tasks equally, without accounting for the fact that sub-tasks executed earlier have a lower probability of being incomplete than those executed later. Sequence-aware coding (SAC) uses this temporal structure to encode each coded sub-task from only a carefully chosen subset of uncoded sub-tasks, reducing encoding complexity by 50% while still leveraging partial results from stragglers and saving job completion time by up to 80.3% compared to global MDS codes. I further propose a general framework parameterized by a θ-recoverability (0 ≤ θ ≤ 1) that arbitrarily controls the probability of recovering the result from any given set of completed sub-tasks. When θ = 1, full recovery is guaranteed as with global MDS codes; when θ = 0, no coded sub-tasks are needed at all. This framework enables a flexible and continuous tradeoff between straggler mitigation and encoding complexity, and supports both matrix-vector and matrix-matrix multiplication. Related publications:
Coded computing for distributed gradient descentGradient methods such as stochastic gradient descent (SGD) are widely deployed for training optimization-based models on large datasets. In distributed training, the dataset is split into partitions assigned to different worker nodes, each of which computes gradients over its local data. A parameter server aggregates the gradients and updates the model. In synchronous SGD, even a single straggler worker can significantly delay the entire training iteration, as the server must wait for all workers before updating the model parameters. Coded computing offers a principled solution: by encoding the dataset partitions and distributing coded data among workers, the server can recover the full aggregate gradient from a subset of workers without waiting for stragglers. In my research, I propose lightweight projective derivative codes for compressed asynchronous gradient descent. Unlike prior coded computing work that encodes the raw data, this approach directly encodes the partial derivatives themselves and applies lossy compression on the derivative codewords. The coding scheme is designed by recognizing that the real projective space ℝℙⁿ, not the Hamming space, is the correct information metric for gradients, since noise is tolerable and sometimes even helpful in gradient descent. By maximizing the information contained within each codeword while minimizing redundancy between codewords, the scheme compresses the gradient in a manner that scales well with the number of workers, achieving lower communication complexity and memory overhead than the state of the art. The low-weight structure of the code further enables asynchronous gradient updates, so that a worker's result can be immediately incorporated into the model upon arrival, with zero decoding overhead at the master since it simply adds and subtracts the results from workers. This framework applies to general machine learning models including deep neural networks. Beyond tolerating a fixed number of stragglers, I further study gradient coding from the perspective of arbitrary straggler ignorance. Standard gradient coding (GC) fully recovers all gradients but is limited to tolerating at most c−1 stragglers; ignore-straggler SGD (IS-SGD) ignores an arbitrary number of stragglers but only partially recovers the gradients. I propose ignore-straggler gradient coding (IS-GC) that unifies both approaches, allowing the parameter server to flexibly choose how many workers to wait for and recover more gradients than IS-SGD while tolerating more stragglers than GC. IS-GC uses a conflict-graph model to characterize the dependencies among coded gradients from different workers, and I design graph-based decoding algorithms proven to maximize gradient recovery from any arbitrary set of available workers. IS-GC is applied to representative data placement schemes including fractional repetition and cyclic repetition, and I further propose hybrid repetition that generalizes both and achieves a flexible tradeoff between the recovery of gradients and the flexibility of choosing straggler tolerance parameters. Related publications:
Parallelism-aware coding for distributed data analyticsAnother important application supported by the distributed computing framework is big data analytics. The tasks running in a job of distributed data analytics typically take the input data from a distributed storage system, which stores partitions of the input file on multiple nodes. For example, it is common for a job running in Spark or Hadoop to take its input from the Hadoop distributed file system (HDFS). In order to take advantage of data locality, tasks in the data analytics framework are typically co-located on the same node that stores the corresponding input partition in the distributed storage system. As the nodes hosting the distributed storage system are also subject to faulty behaviors, erasure coding has been supported by most existing distributed storage systems. By adding parity partitions in additional nodes, the data loss caused by faulty nodes can be tolerated with low storage overhead. However, such parity partitions typically cannot be taken as input by the tasks of distributed data analytics. Therefore, the data parallelism, which refers to the number of partitions that can be read by different tasks simultaneously, is limited by existing erasure codes deployed in the distributed storage system. In my research, I propose a coding framework that can convert representative erasure codes for distributed storage systems, such as Reed-Solomon codes and locally repairable codes, into linearly equivalent codes that offer a much higher level of data parallelism. After such a conversion, the new code will encode data into partitions that contain both original data and parity data. Hence, the original data can be sequentially embedded into all partitions, instead of just some of them, and therefore data can be read and processed in parallel in all the tasks with a higher overall throughput. Moreover, the new code is linearly equivalent to the original code, and thus it will maintain desirable properties that tolerate faulty nodes in the distributed storage systems. Therefore, by adding more parity data in the partitions, a higher level of data parallelism can also be achieved. This framework has been further extended to support nodes with heterogeneous hardware/software configurations. Hence, the amount of original data in each task/node can be arbitrarily determined, leading to a flexible tradeoff between the computational overhead and the storage overhead. Related publications:
|