Publications

Preprint

Journal papers

Conference papers

Paper review duties

  • IEEE Transactions on Communications (TCOM).
  • IEEE International Symposium on Information Theory (ISIT).

Robust Machine Learning

Relevant paper: [C3].

In this work, we examine distributed deep learning setups in which computing devices may return adversarial or erroneous computations. This behavior is called Byzantine and is typically attributed to adversarial attacks on the nodes or failures/crashes thereof. As a result, the corresponding gradients are potentially unreliable for use in training and model updates. Achieving robustness in the face of Byzantine node behavior (which can cause gradient-based methods to converge slowly or converge to a bad solution) is of crucial importance and has attracted significant attention in the community. Our redundancy-based method ByzShield leverages the properties of bipartite expander graphs for the assignment of tasks to workers; this helps to effectively mitigate the effect of the Byzantine behavior.

We consider an omniscient adversary who has full knowledge of the tasks assigned to the workers. The adversary is allowed to pick any subset of $q$ out of $K$ worker nodes to attack at each iteration of the algorithm. Our model allows for collusion among the Byzantine workers to inflict maximum damage on the training process. In addition, we test our method against the most sophisticated attacks in literature, among which is ALIE.

Our work demonstrates an interesting link between achieving robustness and spectral graph theory. Specifically, we demonstrate that using optimal expander graphs (constructed from mutually orthogonal Latin squares and Ramanujan graphs) allows us to assign redundant tasks to workers in such a way that the adversary's ability to corrupt the gradients is significantly reduced as compared to the prior state of the art.

Our theoretical analysis shows that our method achieves a much lower fraction of corrupted gradients compared to other methods (DETOX, DRACO); this is supported by numerical simulations which indicate over a 36% reduction on average in the fraction of corrupted gradients. Our experimental results on large scale training for image classification (CIFAR-10) show that ByzShield has on average a 20% advantage in accuracy under the most sophisticated attacks.

The notation used here is the following

  • $q$: number of adversarial devices.
  • $c_{\mathrm{max}}^{(q)}$: maximum number of distorted gradients a worst-case attack on ByzShield can distort.
  • $\hat{\epsilon}$: fraction of distorted gradients achieved by each scheme.
  • $\gamma$: upper bound on $c_{\mathrm{max}}^{(q)}$ of ByzShield, derived in our paper.

Let us define a scheme which does not assign redundant gradient tasks to workers as baseline. A baseline scheme directly filters the gradients returned by the $K$ workers. A redundancy-based scheme such as ByzShield, DETOX and DRACO assigns multiple copies of each gradient computation to different workers. Then, a majority voting is performed across all workers responsible for the same gradient. Finally, the outputs of the majority voting pass through one more step of filtering (e.g., coordinate-wise median) in order to decide the final gradient used for the model update.

A very useful metric that servers as a good indicator of the convergence of the model is the distorted gradients' fraction $\hat{\epsilon}$, as defined above. A set of simulations on a cluster of $K=15$ servers is presented in Table I. Since both DETOX and DRACO use the Fractional Repetition Code (FRC), we present them in a unified manner. Note that ByzShield enjoys a much lower $\hat{\epsilon}$ than other schemes in most of the cases.

Table I: Fraction of distorted gradients achieved by various schemes.

Some of our experimental results on CIFAR-10 using $K=25$ workers are shown in Figures 1 and 2. The metric is the top-1 test accuracy. Our method supersedes other schemes even for large values of $q$ (e.g., when $q=5$ out of the $K=25$ workers are under attack).

Figure I: ALIE attack and median-based defenses (CIFAR-10).

Figure II: Constant attack and signSGD-based defenses (CIFAR-10).

Coding & Distributed Computing

Relevant papers: [J2], [C1].

Big data analytics often require distributed implementations in which clusters process a sheer amount of data. Hadoop MapReduce and Apache Spark are examples of such frameworks. These systems perform an initial processing during the Map phase, the machines communicate during the Shuffle phase and a final computation is performed in the Reduce phase. Extensive evidence suggests that the Shuffle phase is rather time intensive and can delay the entire algorithm.

Existing methods run multiple copies of each Map task on different servers and utilize coded transmissions to reduce the amount of data exchanged among the machines. The main issue of prior techniques is that they require the original job to be subdivided into a large number of Map tasks. This increases the encoding complexity and diminishes any promised gains. We show that one can simultaneously obtain low communication loads while ensuring that jobs do not need to be split too finely. Our approach explores the relation between the Map task assignment and resolvable designs.

We have obtained experimental results on AWS EC2 clusters for a widely known distributed algorithm, namely TeraSort. Our method demonstrates over 4.69x improvement in speedup over the baseline approach and more than 2.6x over current state of the art. The data set is of size 12GB and 16 servers were used. The main experimental results are presented in Table II and visualized in Figure III. Table II reports the time with and without the memory allocation cost, which turns out to be significant on AWS.

Table II: MapReduce time for sorting 12GB data on 16 server nodes.

Figure III: MapReduce time chart for sorting 12GB data on 16 server nodes.

Aggregated MapReduce

Relevant papers: [J2], [C2].

We have extended our coded scheme for a general MapReduce framework to a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. In this case, our goal is to execute multiple MapReduce jobs. Prior techniques reduce the communication load significantly. However, they suffer from a requirement of a similar flavor to subpacketization, i.e., they require a number of jobs that grows exponentially in the system parameters. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs (and hence the number of subfiles) that the data set needs to be split into remain small.

Performing large matrix-vector multiplications is a key building block of several machine learning algorithms. For instance, during each step of the forward propagation in deep neural networks the output of the layer is the result of multiplying the matrix of the input data set with the weight vector. We formulate the matrix-vector product as a MapReduce operation and test our algorithm by executing multiple massive such multiplications using the naive method and the proposed CAMR (Coded Aggregated MapReduce) scheme. We now give a different interpretation of the resolvable design in terms of jobs instead of subfiles. Table III demonstrates the gain obtained by multiplying 512 fat matrices on 20 AWS EC2 servers. We compare against the baseline method which does not involve any coded transmissions and supersede it by a speedup factor of 4.31x.

Table III: Time for computing 512 products Ab, where A: 234000x100 and b: 100x1 on 20 servers.

Straggler Mitigation in Matrix Multiplication

Relevant paper: [J1].

It is well recognized that stragglers, i.e., machines that perform considerably slower than average on a cluster are a major bottleneck of distributed algorithms. In these cases, the result depends on local computations carried out by all servers and hence the execution may be delayed if one of them is experiencing performance issues. In order to motivate this problem, we refer to Fig. 2 which reports the average time to compute the product of two 2000x2000 matrices on 16 AWS EC2 t2.micro servers. Even though all machines have in theory the same specifications, we can see that two of them demonstrate straggling behavior and their execution time has a large standard deviation.

Figure IV: Mean and st. deviation of computation time of 16 AWS EC2 t2.micro machines.

Recent work has demonstrated that straggler mitigation can be viewed as a problem of designing erasure codes. In a distributed matrix multiplication setup, the technique essentially maps the computation into the multiplication of smaller (coded) submatrices. The stragglers are treated as erasures in this process. The computation can be completed as long as a certain number of workers (called the recovery threshold) complete their assigned tasks. We have proposed a novel coding strategy for this problem when the absolute values of the matrix entries are sufficiently small. One of the main benefits that our approach enjoys is that the threshold has been significantly reduced compared to prior methods.

Next we refer to Figure V. A master node initially generates random input matrices A and B of size 8000x8000 with elements in the set {0, 1,..., 50}. These matrices remained the same for all experiments. Computation latency refers to the elapsed time from the point when all workers have received their inputs until enough of them finish their computations accounting for the decoding time. For this experiment, our algorithm can tolerate up to 6 stragglers while the prior one is resilient to up to 2 stragglers. We observe that when S reaches the corresponding value for each scheme the computation latency jumps and the entire algorithm is significantly delayed by the slowest server.

Figure V: Comparison of total computation latency by simulating up to 8 stragglers.