## Publications

### Preprint

- [P1]
**K. Konstantinidis**and A. Ramamoorthy, “Aspis: A Robust Detection System for Distributed Learning,“ (preprint).

### Journal papers

- [J2]
**K. Konstantinidis**and A. Ramamoorthy, “Resolvable Designs for Speeding Up Distributed Computing,“ IEEE/ACM Transactions on Networking (ToN), May 2020.
Source code
- [J1] L. Tang,
**K. Konstantinidis**and A. Ramamoorthy, “Erasure Coding for Distributed Matrix Multiplication for Matrices With Bounded Entries,“ IEEE Communications Letters (CL), vol. 23(1), pp. 8-11, Jan. 2019.
Source code

### Conference papers

- [C3]
**K. Konstantinidis**and A. Ramamoorthy, “ByzShield: An Efficient and Robust System for Distributed Training,“ Machine Learning and Systems (MLSys), April 2021 (to appear).
Source code
- [C2]
**K. Konstantinidis**and A. Ramamoorthy, “CAMR: Coded Aggregated MapReduce,“ IEEE International Symposium on Information Theory (ISIT), pp. 1427-1431, Jul. 2019. - [C1]
**K. Konstantinidis**and A. Ramamoorthy, “Leveraging Coding Techniques for Speeding up Distributed Computing,“ IEEE Global Communications Conference (GLOBECOM), pp. 1-6, Dec. 2018.
Source code

## Paper review duties

- IEEE Transactions on Communications (TCOM).
- IEEE International Symposium on Information Theory (ISIT).

## Robust Machine Learning

##### Relevant paper: [C3].

In this work, we examine distributed deep learning setups in which computing devices may return adversarial or erroneous computations. This behavior is called
*Byzantine* and is typically attributed to adversarial attacks on the nodes or failures/crashes thereof. As a result, the corresponding
gradients are potentially unreliable for use in training and model updates. Achieving robustness in the face of Byzantine node behavior (which
can cause gradient-based methods to converge slowly or converge to a bad solution) is of crucial importance and has attracted significant
attention in the community. Our redundancy-based method *ByzShield* leverages the properties of bipartite expander graphs for the assignment
of tasks to workers; this helps to effectively mitigate the effect of the Byzantine behavior.

We consider an omniscient adversary who has full knowledge of the tasks assigned to the workers. The adversary is allowed to pick any subset of $q$ out of $K$ worker nodes to attack at each iteration of the algorithm. Our model allows for collusion among the Byzantine workers to inflict maximum damage on the training process. In addition, we test our method against the most sophisticated attacks in literature, among which is ALIE.

Our work demonstrates an interesting link between achieving robustness and spectral graph theory. Specifically, we demonstrate that using
optimal expander graphs (constructed from *mutually orthogonal Latin squares* and *Ramanujan* graphs) allows us to assign
redundant tasks to workers in such a way that the adversary's ability to corrupt the gradients is significantly reduced as compared
to the prior state of the art.

Our theoretical analysis shows that our method achieves a much lower fraction of corrupted gradients compared to other methods
(DETOX,
DRACO);
this is supported by numerical simulations which indicate over a **36%** reduction on average in the fraction of corrupted gradients.
Our experimental results on large scale training for image classification (**CIFAR-10**) show that ByzShield has on average a
**20%** advantage in accuracy under the most sophisticated attacks.

The notation used here is the following

- $q$: number of adversarial devices.
- $c_{\mathrm{max}}^{(q)}$: maximum number of distorted gradients a worst-case attack on ByzShield can distort.
- $\hat{\epsilon}$: fraction of distorted gradients achieved by each scheme.
- $\gamma$: upper bound on $c_{\mathrm{max}}^{(q)}$ of ByzShield, derived in our paper.

Let us define a scheme which does not assign redundant gradient tasks to workers as *baseline*. A baseline scheme directly
filters the gradients returned by the $K$ workers. A *redundancy*-based scheme such as ByzShield, DETOX and DRACO assigns multiple
copies of each gradient computation to different workers. Then, a majority voting is performed across all workers responsible for the same
gradient. Finally, the outputs of the majority voting pass through one more step of filtering (e.g., coordinate-wise median) in order
to decide the final gradient used for the model update.

A very useful metric that servers as a good indicator of the convergence of the model is the distorted gradients' fraction
$\hat{\epsilon}$, as defined above. A set of simulations on a cluster of $K=15$ servers is presented in Table I. Since both DETOX and
DRACO use the *Fractional Repetition Code* (FRC), we present them in a unified manner. Note that ByzShield enjoys a much lower $\hat{\epsilon}$
than other schemes in most of the cases.

Table I: Fraction of distorted gradients achieved by various schemes.

Some of our experimental results on CIFAR-10 using $K=25$ workers are shown in Figures 1 and 2. The metric is the top-1 test accuracy. Our method supersedes other schemes even for large values of $q$ (e.g., when $q=5$ out of the $K=25$ workers are under attack).

Figure I: ALIE attack and median-based defenses (CIFAR-10).

Figure II: Constant attack and signSGD-based defenses (CIFAR-10).

## Coding & Distributed Computing

##### Relevant papers: [J2], [C1].

Big data analytics often require distributed implementations in which clusters process a sheer amount of data. *Hadoop MapReduce* and *Apache Spark* are
examples of such frameworks. These systems perform an initial processing during the *Map* phase, the machines communicate during the
*Shuffle* phase and a final computation is performed in the *Reduce* phase. Extensive evidence suggests that the Shuffle phase is rather
time intensive and can delay the entire algorithm.

Existing methods run multiple copies of each Map task on different servers and utilize coded transmissions to reduce the amount of data exchanged among the machines. The main issue of prior techniques is that they require the original job to be subdivided into a large number of Map tasks. This increases the encoding complexity and diminishes any promised gains. We show that one can simultaneously obtain low communication loads while ensuring that jobs do not need to be split too finely. Our approach explores the relation between the Map task assignment and resolvable designs.

We have obtained experimental results on **AWS EC2** clusters for a widely known distributed algorithm, namely **TeraSort**. Our method demonstrates over **4.69x**
improvement in speedup over the baseline approach and more than **2.6x** over current state of the art. The data set is of size 12GB and 16 servers were used.
The main experimental results are presented in Table II and visualized in Figure III. Table II reports the time with and without the memory allocation cost, which turns out to be significant on AWS.

Table II: MapReduce time for sorting 12GB data on 16 server nodes.

Figure III: MapReduce time chart for sorting 12GB data on 16 server nodes.

## Aggregated MapReduce

##### Relevant papers: [J2], [C2].

We have extended our coded scheme for a general MapReduce framework to a class of distributed algorithms, broadly used in deep learning, where intermediate computations of the same task can be combined. In this case, our goal is to execute multiple MapReduce jobs. Prior techniques reduce the communication load significantly. However, they suffer from a requirement of a similar flavor to subpacketization, i.e., they require a number of jobs that grows exponentially in the system parameters. We propose a new scheme which achieves the same load as the state-of-the-art while ensuring that the number of jobs (and hence the number of subfiles) that the data set needs to be split into remain small.

Performing large matrix-vector multiplications is a key building block of several machine
learning algorithms. For instance, during each step of the forward propagation in deep neural
networks the output of the layer is the result of multiplying the matrix of the input data set with
the weight vector. We formulate the matrix-vector product as a MapReduce
operation and test our algorithm by executing multiple massive such multiplications using the
naive method and the proposed *CAMR* (*Coded Aggregated MapReduce*) scheme. We now give a different interpretation of the resolvable design in terms of jobs instead of subfiles. Table III demonstrates the gain
obtained by multiplying 512 fat matrices on 20 AWS EC2 servers. We compare against the baseline method which does not involve any coded transmissions and supersede it by a speedup factor of **4.31x**.

Table III: Time for computing 512 products **Ab**, where **A**: 234000x100 and **b**: 100x1 on 20 servers.

## Straggler Mitigation in Matrix Multiplication

##### Relevant paper: [J1].

It is well recognized that *stragglers*, i.e., machines that perform considerably slower than average on a cluster are a major bottleneck of distributed algorithms. In these cases, the result depends
on local computations carried out by all servers and hence the execution may be delayed if one of them is experiencing performance issues. In order to motivate this problem, we refer to Fig. 2 which reports
the average time to compute the product of two 2000x2000 matrices on 16 AWS EC2 t2.micro servers. Even though all machines have in theory the same specifications, we can see that two of them demonstrate
straggling behavior and their execution time has a large standard deviation.

Figure IV: Mean and st. deviation of computation time of 16 AWS EC2 t2.micro machines.

Recent work has demonstrated that straggler mitigation can be viewed as a problem of designing erasure codes. In a distributed matrix multiplication setup, the technique essentially maps the computation into
the multiplication of smaller (coded) submatrices. The stragglers are treated as erasures in this process. The computation can be completed as long as a certain number of workers (called the *recovery
threshold*) complete their assigned tasks. We have proposed a novel coding strategy for this problem when the absolute values of the matrix entries are sufficiently small. One of the main benefits that
our approach enjoys is that the threshold has been significantly reduced compared to prior methods.

Next we refer to Figure V. A master node initially generates random input matrices **A** and **B** of size 8000x8000 with elements in the set {0, 1,..., 50}. These matrices remained the same for all
experiments. *Computation latency* refers to the elapsed time from the point when all workers have received their inputs until enough of them finish their computations accounting for the decoding time.
For this experiment, our algorithm can tolerate up to 6 stragglers while the prior one is resilient to up to 2 stragglers. We observe that when S reaches the corresponding value for each scheme the computation
latency jumps and the entire algorithm is significantly delayed by the slowest server.

Figure V: Comparison of total computation latency by simulating up to 8 stragglers.