Problem Statement
The habit of programming by copy-and-paste is one of the sources for similar fragments of code in many software systems. Such fragments are called clones. It is found that cloning also occurs in model-based software development (MBD). Clones create difficulties for software maintenance, for example, changes to a cloned fragment must be carried out in several other places in the codebase.
Many approaches have been proposed to detect clones in traditional software as well as in model-based software. Among them, structure-oriented approaches are the most popular and successful in both code-based and model-based clone detection. In structure-oriented approaches, software artifacts including source code and models are represented as tree-based and/or graph-based data structures. For example, abstract syntax trees (ASTs) or parse trees are used for source code and attributed, directed graphs are used for program dependency graphs (PDGs), call graphs, and models in MBD.
In those approaches, the common detection process include (1) modeling the artifacts and their fragments (i.e. potential clone parts) in the form of a structure-oriented representation, (2) extracting features of each fragment from its representation, (3) computing the similarity between fragments using a similarity measure on the features, and (4) grouping similar fragments into clone groups. The key phases (3) and (4) involve the comparison of features extracted from the structure-oriented representation of fragments in phase (2). The key feature in those structure-oriented approaches is the internal structure of a (code or model) fragment. However, methods for comparing tree-based or graph-based structures, such as tree editing distance measurement or graph isomorphism are computationally expensive for scalable and efficient clone detection.
To avoid that high complexity, several light-weight approaches for clone detection have been proposed. However, the accuracy is not satisfactory. Methods for extracting structural features in artifacts are still simple. One of the state-of-the-art light-weight approaches, Deckard , uses a vector-based approach to extract the structural information in an AST or a parse tree. They proposed to use the occurrence counts of each q-level complete binary subtree in such a tree as characteristic vectors. However, for q > 1, this method does not work for graph-based representations. For q = 1 (the case that Deckard tool is implemented), a fragment is characterized by a vector in which each element is the occurrence counts of single AST node types. However, occurrence counts of only single node types is insufficient to capture well structural information.

Figure 1: Different fragments with similar occurrence-count vectors of single node types
The illustrated example in Figure 1 shows two code fragments. They quite differ from each other because they have very different nesting structures of program constructs (e.g. a "for" encloses an "if" and vice versa). In addition, in two methods, the sequential structures (i.e. the orders) of statements are also different, despite that each contains a declaration, a "for", an "if", and a "return" statement. Unfortunately, the above two code fragments have very similar Deckard representation vectors since they have similar numbers of occurrences of single AST node types, e.g., method and variable declarations, "if" and "for" statements, expressions, literals, simple names, etc. Thus, two fragments would be incorrectly detected by Deckard's vector-based approach.
In brief, existing methods for capturing structural information in software artifacts for clone detection are either too computationally expensive to be efficient or too light-weight to be accurate. Importantly, no light-weight approach has been proposed for graph-based structure. In this paper, we present a novel approach, Exas, for better approximating the structure as a feature of tree-based and graph-based representations in software artifacts. At the same time, Exas achieves high levels in term of both accuracy and efficiency. In Exas, features are the sequences of labels and numbers built from nodes, edges, and paths of various lengths in the graph-based representation. A structural characteristic vector for a fragment contains the occurrence counts of all kinds of its features.