Synthesis of Application Specific Instruction Sets

Ing-Jer Huang, Member, IEEE and Alvin M. Despain, Member, IEEE

Abstract—An instruction set serves as the interface between hardware and software in a computer system. In an application specific environment, the system performance can be improved by designing an instruction set that matches the characteristics of hardware and the application. We present a systematic approach to generate application-specific instruction sets so that software applications can be efficiently mapped to a given pipelined microarchitecture. The approach synthesizes instruction sets from application benchmarks, given a machine model, an objective function, and a set of design constraints. In addition, assembly code is generated to show how the benchmarks can be compiled with the synthesized instruction set. The problem of designing instruction sets is formulated as a modified scheduling problem. A binary tuple is proposed to model the semantics of instructions and integrate the instruction formation process into the scheduling process. A simulated annealing scheme is used to solve for the schedule. Experiments have shown that the approach is capable of synthesizing powerful instructions for modern pipelined microprocessors, and running with reasonable time and a modest amount of memory for large applications.

I. INTRODUCTION

MICROPROCESSORS (instruction set processors) offer a flexible and low-cost solution for embedded systems with complex algorithms or control intensive applications. The performance of a microprocessor-based system depends on how efficiently the application is mapped to the hardware. One key issue determining the success of the mapping is the design of the instruction set, which serves as the interface between the hardware and application (software). How to design an instruction set that closely matches the characteristics of the hardware and the application is an important design problem.

The design of instruction sets was once viewed as a design process independent to the design of the hardware (microarchitecture). Instruction sets designed under this principle, such as those of many mainframe computers, suffered from the fact that their supporting hardware was difficult to speed up or hardware was wasted due to the low utilization rate of the related instructions in real applications. The necessity of closely matching the design of instruction sets with the design of micro-architectures was recognized and adopted in the design of many modern RISC-style pipelined processors, in order to achieve better performance and cost trade-off. However, in most design projects, the designs were carried out manually, which limited the exploration of the design space and the understanding of the interaction between hardware and software. CAD tools are necessary to explore and manage such complex design space. While there has been much progress in automating the instruction set processor design, most of the work synthesizes micro-architectures at the RTL level from given instruction sets (e.g., [11], [13], and [14]). How to systematically design instruction sets which closely match the characteristics of hardware and software is still an open problem. The goal of our research is thus to investigate the instruction set design problem in a systematic way. The research intends to provide further understanding of the design and interaction of the hardware and software interface.

In this paper we present the problem formulation and the algorithm of a systematic approach [7] which synthesizes application-specific instruction sets for parameterized, pipelined micro-architectures, from a given application benchmark. The problem is formulated as a modified scheduling problem, with the micro-operations (MOP's) representing the application benchmark as the nodes to be scheduled, subject to several design constraints. Instructions are formed by an instruction formation process which is integrated into the scheduling process. The compiled code of the application is generated, using the synthesized instruction set. A simulated annealing scheme is used to solve for the schedule and the instruction set. The design issues addressed in this approach include: instruction utilization, instruction operand encoding, delay load/store and delay branches.

The rest of the paper is organized as follows. Section II reviews related work. Section III presents the models for the micro-architectures, instruction sets and application benchmarks. Sections IV and V describe the problem formulation and algorithm, respectively. Section VI demonstrates our techniques with some experiments. Section VII discusses the current status, limitations, and future directions.

II. RELATED WORK

Most of the early work in automatic instruction set design views the design problem as a design process independent to the hardware implementation. Instructions were not restricted to single cycle instructions since multiple cycle instructions can be supported through micro-programming (firmware). Without knowing the decode/control complexity, the focus was mainly in directly supporting high-level languages or increasing the code density. The results were CISC-like instructions. These studies include Haney's [1], Bosse's [2], and Bennett's [3] work. These techniques are not suitable for designing instruction sets for modern pipelined processors.

Sato et al. [6] propose an integrated design framework for application specific instruction set processors. This framework
generates profiling information from a given set of application benchmarks and their expected data. Based on the profiles, the design system customizes an instruction set from a super set, decides the hardware architecture (derived from the GCC’s abstract machine model), and the related software development tools. This framework is similar to our work in terms of the inputs and outputs of the design system; however, it is different from ours in terms of the machine model and the design method. They assume a sequential (nonpipelined) machine model, whereas we assume a pipelined machine with data-stationary control model. On the other hand, they generate instruction sets by selecting subsets from a super set, whereas we synthesize the instruction sets directly in order to find new and useful instructions for the given application domain.

Different from previous approaches, Holmer [4], [5] focuses on generating instruction sets which closely couple to the underlying micro-architecture. As pipelined micro-architecture proved its superiority in 1980’s, Holmer adopts the modern pipeline control model (data stationary control) and simple, parameterized data path as the underlying micro-architecture model. The parameters for a data path include the number of read/write register ports, memory ports, number of functional units and the cycle counts for memory operation. The user specifies the parameters, and then invokes the system to find the set of instructions which best utilizes the hardware resources such that minimal cycle counts for benchmarks are achieved. Our work builds on the results of Holmer and improves the problem formulation and synthesis algorithms, in order to generate application-specific instruction sets and compiled codes for microprocessor-based embedded systems.

Another design problem that is close to the instruction set design problem is microcode compaction [15]–[17]. However, it differs in terms of the design space and design goals. The micro-instructions do not have “opcodes” (and hence the semantics) and the goal of microcode compaction is to reduce the number of cycles to execute a microprogram. On the other hand, in the instruction set design, the size of the instruction set is determined by both syntax and semantics. The goal of the instruction set design is to optimize and trade off the instruction set size, the program size, and the number of cycles to execute a program.

III. DESIGN MODELS

In this section we present the models of instruction sets, micro-architectures and application benchmark programs, and describe how they are represented in our design system.

<table>
<thead>
<tr>
<th>Instruction Field Type</th>
<th>Number of bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instruction word</td>
<td>32</td>
</tr>
<tr>
<td>opcode</td>
<td>6</td>
</tr>
<tr>
<td>register (R)</td>
<td>5</td>
</tr>
<tr>
<td>tag (T)</td>
<td>5</td>
</tr>
<tr>
<td>displacement (S)</td>
<td>16</td>
</tr>
<tr>
<td>immediate (I)</td>
<td>16</td>
</tr>
<tr>
<td>relation operator (OP)</td>
<td>2</td>
</tr>
</tbody>
</table>

Table I. Bit Width Specification for Some Instruction Field Types

Fig. 1. Examples of instruction formats.

(a) Register file constant '1'
(b) Register file

Fig. 2. Variation in data path for different instruction sets.

A. Instruction Sets

The instruction set is assumed to be of fixed word length, typically 32 b, which is specified by the designer. An instruction consists of fields. The fields are a combination of some field types. For example, the instruction add(R1, R2, Immed) consists of an opcode field add, two register index fields R1 and R2, and one immediate data field Immed. The bit width of each field type is provided by the designer. Table I lists the specification of some instruction field types and their bit widths, taken from the BAM instruction set [19]. Each instruction has one opcode field, but the use of other fields is constrained only by the total number of bits needed by the operations in the instruction.

Fig. 1 lists the instruction formats for the instructions add(R1, R2, Immed) ‘R1 ← R2 + Immed’ and inc(R) ‘R ← R + 1’, based on the bit width specification in Table I. Note that there are 21 b unused in the format of inc.

The operands of instructions can be encoded in the opcodes. There are two ways to encode operands. First, a specific value can be permanently assigned to an operand and becomes implicit to the opcode. Second, the register specifiers can be unified. For example, the instruction inc is obtained from the general instruction add. The facts of R1 = R2 (unifying register specifiers; i.e., both register accesses refer to the same physical register) and Immed = 1 (fixing an operand to a specific value which becomes implicit) are encoded into the opcode inc. Encoding operands saves instruction fields, at the cost of possibly larger instruction set size, additional connections and hardwired constants in the data path. For example, adding the instruction inc to the instruction set increases the instruction set size by one, and adds a hardwired constant ‘1’ and an additional multiplexer in the data path, as shown in Fig. 2.

Furthermore, encoding allows more MOP’s to be packed into a single instruction. For example, if we find it happens very often that the values of two independent registers are increased by one at the same time, we may then devise a
new instruction $\text{incd}(R_1, R_2)$ which performs the MOP's "$R_1 \leftarrow R_1 + 1; R_2 \leftarrow R_2 + 1$" ("\) represents concurrency). This instruction uses only 16 b, as opposed to 58 b used by its generalized form "$R_1 \leftarrow R_2 + \text{Immed}_1; R_3 \leftarrow R_4 + \text{Immed}_2$" which does not meet the instruction word width constraint for 32-b instructions.

### B. Micro-Architectures

The styles of micro-architectures considered in this work are pipelined micro-architectures. For example, Fig. 3(a) shows a basic pipeline, which can be functionally partitioned into 6 pipeline stages: instruction fetch (IF), instruction decode (ID), register read (R), arithmetic/logic operation (A), memory access (M), and register write (W). Each functional stage may take more than one cycle, and can be further pipelined. The first two stages are identical to all instructions. The last four stages, the instruction execution stages, are dependent on the semantics of the instructions. The combination of pipeline stages can be varied. For example, the pipeline 'IF-ID/R-A-M-W' of Fig. 3(b) can be derived by merging the register-read stage with the instruction-decode stage, at the cost of restricting the instructions to use a single format for register specification such that registers can always be prefetched at the instruction-decode stage. On the other hand, the pipeline 'IF-ID/R-A/M-W' of Fig. 3(c) is derived by merging the arithmetic stage with the memory stage, at the cost of eliminating the displacement addressing mode. The displacements have to be computed by other instructions proceeding the memory-related instructions.

The pipeline is controlled in a data stationary fashion [9]. In the data stationary control, the opcode flows through the pipeline in synchronization with the data being processed in the data path. Fig. 4 shows the relationship between the control path with data stationary model and the data path. The register files at the top and bottom are the same register file.

They are duplicated for the ease of readability. Opcodes are forwarded to next stages synchronously. At each stage, the opcode, together with possible status bits from the data path, is decoded to generate the control signals necessary to drive the data path.

This pipeline configuration supports single-cycle instructions which are typical of modern RISC-style processors. Multiple-cycle instructions can be accommodated with some modification to the linear pipeline such as the insertion of internal opcodes [10]. To manage the complexity of this research, general multiple-cycle instructions are not considered at this moment. However, multiple-cycle arithmetic/logic operations, memory access, and change of control flow (branch/jump/call) are supported by specifying the delay cycles as design parameters.

**The Specification for the Target Micro-Architecture:** The target micro-architecture can be fully described by specifying the supported MOP's and a set of parameters. The supported MOP's describe the functionality supported by the micro-architecture, and the connectivity among modules in the data path. For example, the first two columns of Table II list some of the MOP's supported in the =SI-BAM microprocessor [20] and their corresponding MOP type ID's. The basic pipeline structure of the microprocessor is the same as Fig. 3(b).

The tabulated specification supports the variations of the micro-architectures easily. For example, the pipeline configuration 'IF-ID-R-A/M-W' in Fig. 3(c) can be derived by eliminating the MOP's rmd, mrd and marad from Table II.

The set of parameters describes resource allocation and timing. The parameters include the number of register-file read/write ports, number of memory ports, number of functional units, the sizes of the register file and memory, latencies of operations, and the delay cycles between operations of memory access, functional units and control flow change.

1 A single-cycle instruction has instruction latency of one cycle.

---

**TABLE II**

<table>
<thead>
<tr>
<th>Type ID</th>
<th>MOP*</th>
<th>Instruction Format Cost</th>
<th>Hardware Cost</th>
</tr>
</thead>
<tbody>
<tr>
<td>$m$</td>
<td>$R_1 \leftarrow R_2$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$r$</td>
<td>$R_1 \leftarrow R_2$</td>
<td>2 &amp; 1</td>
<td>2 &amp; 1</td>
</tr>
<tr>
<td>$t$</td>
<td>$R_1 \leftarrow \text{Immed}_1$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$n$</td>
<td>$R_1 \leftarrow \text{Immed}_2$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$r$</td>
<td>$R_1 \leftarrow \text{Immed}_1$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$m$</td>
<td>$R_1 \leftarrow \text{Immed}_2$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$r$</td>
<td>$R_1 \leftarrow \text{Immed}_1$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$m$</td>
<td>$R_1 \leftarrow \text{Immed}_2$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$r$</td>
<td>$R_1 \leftarrow \text{Immed}_1$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
<tr>
<td>$m$</td>
<td>$R_1 \leftarrow \text{Immed}_2$</td>
<td>1 &amp; 1</td>
<td>1 &amp; 1</td>
</tr>
</tbody>
</table>

---

* The operator "\)" appends a tag to a value before the value is sent to a destination.

* The characters 'F', 'R', 'W', 'D' indicate the tag of a register-file, memory, memory port.

* The characters 'D' and 'F' in a parameter indicate the number of a particular hardware component. For example, '2R' means two-read ports for register file.
TABLE III

<table>
<thead>
<tr>
<th>Parameters for the Target Microarchitectures: Resource Size and Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td>Resource Type</td>
</tr>
<tr>
<td>---------------------------------</td>
</tr>
<tr>
<td>Read port, register file (R)</td>
</tr>
<tr>
<td>Write port, register file (W)</td>
</tr>
<tr>
<td>Memory read/write port (M)</td>
</tr>
<tr>
<td>Functional unit (F)</td>
</tr>
<tr>
<td>Register file size</td>
</tr>
<tr>
<td>Memory size</td>
</tr>
</tbody>
</table>

Table III is an example of the resource parameters for the VLSI-BAM microprocessor. The resource parameters specified in this table include the numbers and sizes of resources, and their operation latencies. Table IV lists the delay parameters for various pairs of operations. For example, the M-A pair in the table specifies that there should be one cycle delay between a memory operation and a succeeding (dependent) arithmetic operation.

Note that the existence of bypassing buses in the data path can be modeled by the delay parameters. For example, if we remove the bypassing bus in the 'A' stage in Fig. 4, then the delay cycles for the A-A, A-M, and A-C pairs all become one, instead of zero.

Each MOP supported by the data path is assigned costs for the instruction format and hardware resources. The costs of the instruction format are the instruction fields required to operate the MOP's, including register index, function selectors, and immediate data. The hardware costs are the hardware resources required to support the MOP. The hardware resources include read/write ports of the register file, memory ports, and functional units. The third and fourth columns in Table II lists the costs for the corresponding MOP's.

C. Application Benchmarks

Each application benchmark is represented as a group of weighted basic blocks. The weight is defined by the designers, and is usually used to indicate how many times the basic block is executed in the benchmark. The basic blocks are mapped to control/data flow graphs (CDFG's) of MOP's, based on the given MOP specification. Different micro-architectures result in different MOP specifications, which may map the basic blocks to different CDFG's. Fig. 5 shows an example of a basic block, which consists of six MOP's, based on the MOP specification in Table II. The bold labels before the MOP's are the MOP's ID's. The dashed arrows are control dependencies; the MOP's M06 changes the control flow at the end of the basic block, and hence logically follows MOP's M01 ~ 6. The solid arrows are data-related dependencies. The data related dependencies can be characterized into three categories: read-after-write (RAW), write-after-read (WAR), and write-after-write (WAW). They all specify a before relation: the preceding MOP has to be scheduled before the succeeding MOP, except in micro-architectures where master-slaved latches are used to implement registers. In this case, the WAR dependency indicates a no-later-than relation: the preceding MOP has to be scheduled no later than the succeeding MOP. The data dependencies in the figure are all WAR's.

IV. INSTRUCTION SET DESIGN AS A MODIFIED SCHEDULING PROBLEM

The instruction set design problem can be formulated as a modified scheduling problem (Fig. 6). The inputs of the problem are: an application represented in CDFG's, constraints of the instruction word and field widths and hardware resources, the objective function, and the micro-architecture specification. The MOP's in CDFG's are scheduled into time steps, subject
to various constraints to be discussed later. While scheduling MOP's into time steps, instructions are formed at the same time. Finally, the outputs of this problem formulation is a synthesized instruction set and compiled code.

Two schedules of the MOP's in Fig. 5 are shown in Tables V and VI, respectively. In the first column of the table are time steps, and in the second column are the ID's of the MOP's scheduled into the corresponding time step. In this example we assume a one-cycle delay for the jump MOP (MO6) and zero-cycle delay for memory MOP's (MO0 and MO3). The schedule in Table V is a serialized one, with seven cycles. There is one MOP in each time step. Note that there is a nop at the seventh cycle since MO6 is scheduled as the last MOP. The schedule in Table VI is a more compact one, with four cycles. Note that the delay slot of MO6 is filled with MO5 such that there is no need for a nop.

A. Instruction Formation: The Binary Tuple and Its Relation with Scheduling Process

The semantics of an instruction can be represented by a binary tuple \( \langle \text{MOPTypelDs}, \text{IMPFields} \rangle \), where \text{MOPTypelDs} is a list of type ID's (as shown in the first column of Table II) of MOP's contained in the instruction, and \text{IMPFields} is a list of fields that are encoded into the opcode.

For example, the binary tuple for the instruction \text{add}(R_1, R_2, \text{Immed}) is \((\text{rrai}, [\])\). The instruction contains one MOP \( 'R_1 \leftarrow R_2 + \text{Immed}' \) with the type ID rrai, which is represented by the list in the first argument of the tuple. Since no fields are encoded, the second argument of the tuple is an empty list. On the other hand, the binary tuple for the instruction \text{inc}(R)\, an encoded version of the instruction add(R, \text{Immed}) as discussed in Section III-A, is \((\text{rrai}, [R = R, \text{Immed} = 1])\). The list in the second argument of the tuple specifies how the fields are encoded: The element \( R_1 = R_2 \) unifies the register specifiers \( R_1 \) and \( R_2 \) to the same register, and the element \( \text{Immed} = 1 \) fixes the immediate value permanently to the constant of one.

Instructions are generated from time steps in the schedule. Each time step corresponds to one instruction. The type ID's of the MOP's scheduled to the same time step are assigned to the first argument of the binary tuple for the instruction at the time step. The operand encoding specification, which is generated by an encoding process integrated into the scheduling process (described in Section V), is assigned to the second argument of the binary tuple.

In Tables V and VI, the columns under the header 'Instruction Semantics' and 'Instruction Fields' describe the semantics and field information of the instructions formed for the two schedules, respectively. The columns 'MOP type ID's' and 'Encoded fields' specify the binary tuples for the instructions. The RTL's for the corresponding MOP types are listed under the 'RTLs' column. Note that \( '*' \) denotes concurrency. The 'Inst Name' column assigns names to the generated instructions. The column 'Format' describes the instruction format, i.e., the required instruction fields. The column 'Field values' lists the instantiated field values for the corresponding time step. Note that, in order to demonstrate the variation in the instruction formation, the instruction set in Table V is chosen from a nonoptimal one.

For example, in Table V, the MOP's scheduled into time steps 4 and 5 have the same binary tuple, and thus are mapped to the same instruction \text{inst}(R_1, R_2, 1), with their field values instantiated to \( (r_1, r_2, 1) \) and \( (r_2, r_2, 2) \), respectively. Note that we use capitalized letters, e.g. \( R_i \), to denote the instruction fields, and noncapitalized letters, e.g. \( r_2 \), to denote the instantiated values of the fields. On the other hand, the MOP in time step 2, is mapped to a different instruction \text{inst}(R_3), although it contains the same type of MOP rrai as in time steps 4 and 5. The reason is that its field for the immediate data \( I \) is permanently assigned to the constant 'zero' and made implicit in the opcode, which is indicated by the specification \( I = 0 \) in the 'Encoded field' column. This implicit field makes the generated instruction behave as a 'move' instruction, instead of 'add.'

The compiled code can be obtained easily from the instruction names and instantiated field values. For example, the compiled code for the scheduled basic block in Table VI is represented as the sequence

\[
\text{inst7}(r_2, r_0, 0), \text{inst7}(r_2, r_1, 1), \text{inst5}(1024), \text{inst4}(r_2, r_2, 2).
\]

The instruction set is formed by unioning instructions generated from all time steps. For example, the instruction set derived from the schedule in Table V contains six instructions (\text{inst1}~\text{inst6}), and the instruction set for the schedule in Table VI contains three instructions (\text{inst4, inst5, inst7}).
B. Performance (Cycle Count) and Costs
   (Instruction Bits and Hardware Resources)
   The weighted sum of the lengths (number of time steps) of the scheduled basic blocks is the execution cycles of the benchmarks. The length of the basic block includes nop slots which are inserted by the design process to preserve the constraints due to multicycle operations. The design process will try to eliminate the nop slots by reordering other independent operations into the nop slots.
   Each instruction has two costs associated with it. One is the total number of bits required to represent the instruction. The number is a summation of field widths of opcode and all explicit fields required to operate the MOP's contained in the instruction. The implicit fields do not consume instruction bits. For example, in Table V, the instruction \texttt{inst4} requires 32 b, using the bit width specification in Table I; whereas \texttt{inst2} requires 16 b only because its immediate data field is made implicit, saving 16 b. The maximal bit widths of the instruction sets in Tables V and VI are 48 and 32 b, respectively.
   Another cost is hardware. It is the collection of the resources required by all MOP's contained in the instruction, minus the shared resources. The sharing of the resources can be related to field encoding. When two or more register reads of different MOP's are unified, i.e., reading from the same register, one read port of the register file is sufficient, instead of two or more. On the other hand, if more than one destination register receive results of the same arithmetic/logic expression, one functional unit is enough since the computation result can be shared. For example, \texttt{inst7} needs only one read port instead of three since \texttt{R1}, \texttt{R2}, and \texttt{R4} are unified. It also needs only one functional unit, instead of three, since the three destinations (memory data register, memory address register, register file) all receive the same value: \texttt{R1}.
   The global hardware resources are obtained by choosing the maximal number for each resource type from all instructions. For example, the global hardware resources used for the schedule I and II in Tables V and VI are \{2R,1W,1M,2F\} and \{1R,1W,1M,1F\}, respectively.
   The example in Table VI shows that compact and powerful instructions can be synthesized by packing more MOP's into a single instruction, and making fields implicit and register ports unified to satisfy the cost constraints. This is particularly useful in an application specific environment where instruction sets can be customized to produce compact and efficient codes for the intended applications.

C. Constraints
   The MOP's are scheduled into time steps, subject to several constraints. First, the data/control dependencies and the timing constraints (for multicycle MOP's) have to be satisfied. Data-dependent MOP's have to be scheduled into different time steps, subject to the precedent relationship and timing constraints, except single-cycle MOP's with WAR dependencies, which can be scheduled into the same time step if the registers can be read and written simultaneously. A control dependency with a timing constraint, e.g., a delayed jump, has to be dealt with differently. The MOP's that are data-independent to the jump/branch MOP's can be scheduled into the time steps before the jump/branch MOP's or the delay slots after the jump/branch MOP's. The length of the delay slots is determined by the timing constraint. For example, in Table VI, the independent MOP \texttt{M05} is scheduled into time step 4, which is the delay slot of the jump \texttt{M06}.
   Second, the instruction word width and the hardware resources consumed by the instructions have to be no larger than what are specified by the designer. Third, the size of the instruction set has to be no more than what the opcode field can afford.

D. Objective Function
   General speaking, a richer instruction set may result in more compact and efficient compiled code. On the other hand, the larger the instruction set size, the more complex the decoding circuitry, and the more time the hardware designers spend in design and verification. The same trends hold true in the compiler side as well. Therefore, an objective function is necessary to control the performance/cost trade-off.
   The goal of our design system is to minimize the objective function. The objective function is a function of the cycle count \(C\) and instruction set size \(S\), where \(C\) represents the performance metrics, how many cycles the benchmarks execute on the target machine, and \(S\) represents the cost metrics. An interesting objective function suitable for our purpose is the following equation.

\[
\text{Objective} = (100/P) \cdot \ln(C) + S. \tag{1}
\]
   This is an integral form, derived by Holmer in [4], of the statement "a new instruction will be accepted if it provides a \(P\%\) performance improvement," which tries to balance the instruction set size with the performance gain. Other types of objective functions can be used with the design system as well.
   Note that in our formulation, the design constraints are checked separately, and are not captured in the objective function.

V. SIMULATED ANNEALING ALGORITHM AND THE DESIGN FLOW
   Although we have formulated the instruction set design problem as a scheduling problem, it is indeed more difficult than a regular scheduling problem, because we have to control the number of unique patterns (instruction set) in the time steps during the scheduling, in addition to the dependency and performance/cost constraints. Also, the problem size is usually much larger than regular scheduling problems since the application benchmarks may easily contain thousands of MOP's to be scheduled.
   We propose an efficient solution to the problem based on a simulated annealing scheme. An initial design state consisting of an initial schedule and its derived instruction set (generated by a preprocessor) is given to the design system, and then a simulated annealing process is invoked to modify the design state in order to optimize the objective function, until the design state achieves an equilibrium state.
A. Move Operators

The move operators change the design state. They provide methods of manipulating the MOP's and time steps. The move operators can be characterized into three groups.

Manipulation of the Instruction Semantics and Format: The first group manipulates the instruction semantics and format of a selected time step. There are five move operators in this group.

Unification: Unify two register accesses in the MOP's; i.e., they always access the same register. For example, the specification of $R_1 = R_2$ in our previous example of the increment instruction $inc(R)$ is a result of the 'unification' operator. The effects of this operator are the decreases in the instruction word width and register read/write ports.

Split: Cancel the effect of the 'unification' operator. Two register accesses that are previously unified to the same register are made independent. The effects of this operator are the increases in the instruction word width and register read/write ports.

Implicit value: Bind a register specifier to a specific register, or an immediate data field to a specific value. The specific values are the instantiated values in the MOP's of the selected time step. For example, the specification of $Immed = 1$ in the instruction $inc(R)$ is a result of this operator. The effect of this operator is the decrease in the instruction word width.

Explicit value: Cancel the effect of the 'implicit value' operator. Instruction fields that are previously bound to specific values are made explicit; i.e., their values are assigned by the compiler and are specified in the regular instruction fields. The effect of this operator is the increase in the instruction word width.

Generalization: If the current instruction format of the selected time step contains encoded operands, make these operands general and become explicit in the instruction fields. The effects of this operator are increased instruction word width and hardware resources.

Manipulation of MOP's Locations: The second group of move operators involves the movement of the MOP's. There are four move operators in this group, which are all subject to the data/control dependencies and delay constraints when moving MOP's. The target MOP's and time steps can be selected randomly or with the guidance of heuristics.

Interchange: Interchange the locations of two MOP's from different time steps. This operator changes the semantics and formats of the two instructions in the corresponding time steps.

Displacement: Displace a MOP to another time step. This operator simplifies the semantics and format of the instruction in the original time step, and enriches the semantics and format of the other instruction in the destination time step.

Insertion: Insert an empty time step after or before the selected time step and move one MOP to the new time slot. This operator simplifies the semantics and formats of instructions in the selected and new time steps, and increases the cycle count.

Deletion: Delete the selected time step if it is an empty one. This operator decreases the cycle count.

In our current implementation, if the selected MOP's contain unified or implicit fields, these fields are restored to the original forms (generalized, explicit) before the move operators in this group are applied to the MOP's. In addition to the aforementioned effects, these move operators may change the resource usage in the selected time steps as well.

Micro-Architecture-Dependent Operators: The third group of move operators includes methods that explore the special properties of the target micro-architecture. These move operators are provided by the designer as part of the micro-architecture specification.

For example, if the target micro-architecture provides both register file → functional unit → register file, and register file → register file data paths, the designer can specify that the following MOP's ($rrai$ and $rr$) are functionally equivalent and can be transformed from one to another

$rrai: R_1 → R_2 + Immed(Immed = 0)$

$rr: R_1 → R_2$.

These MOP's have different costs in hardware and instruction format. While $rrai$ uses a functional unit and consumes an additional instruction field for the immediate data, $rr$ uses a direct bus between the read and write ports of the register file. When discovering an $rrai$ MOP with its immediate data being zero, the design system can map this MOP to the equivalent $rr$ MOP, or vice versa.

An Example: Changing the Design State with Move Operators: We demonstrate how the move operators are used to change design states. Here we show a sequence of move
operators which transforms the schedule and instruction set (one design state) in Table V to the ones (a better design state) in Table VI. The sequence is

1) DISPLACEMENT: displace the MO2 from time step 2 to 1 (as shown in Table VII).
2) UNIFICATION: unify fields D1 and D2 in the time step 1.
3) UNIFICATION: unify fields D1 and I in the time step 1.
4) UNIFICATION: unify fields R1 and R2 in the time step 1.
5) UNIFICATION: unify fields R1 and R4 in the time step 1 (as shown in Table VIII).
6) DELETION: delete the empty time step 2.
7) DISPLACEMENT: displace the MO2 from time step 4 to 3 (as shown in Table IX).
8) DELETION: delete the empty time step 4.
9) UNIFICATION: unify fields D1 and I in the time step 1.
10) UNIFICATION: unify fields R1 and R4 in the time step 1.
11) DISPLACEMENT: displace the MO5 from time step 5 to 7 (as shown in Table X).
12) DELETION: delete the empty time step 5.

Tables VII–X show the resulted schedule and instruction set for the design state after the first, fifth, seventh, and eleventh move operators are applied, respectively. After the twelfth move operator is applied, the design state in Table VI can be obtained. In the last row of the tables we show the cycle count, instruction set size, hardware cost, and instruction word width for the corresponding design states. The deleted time steps are shown as shaded rows. The time steps in which the move operators are applied are emphasized with heavy rectangles around the time step indices. The elements in the design state that are modified by the move operators are listed with bold face. Note that, for ease of illustration, we use the original time step indices in Table V in the above sequence when referring to selected time steps. In the implementation, the indices of time steps have to be adjusted when time steps are inserted or deleted such that the delay constraints between MOP's can be correctly maintained.

Note that there are more than one sequence which accomplish the same design state transition. How such sequences are formed depends on the design algorithm. In our simulated annealing scheme, the move operators are selected with a mix of random and heuristics strategies as described in Section V-B.

### B. Heuristics for Target Selection

During each iteration, the design space is examined whether it violates design constraints. If yes, a time step is randomly selected from a pool of time steps that violate constraints. If more than one constraint is violated, the resource violation gets higher priority than the instruction word width violation since a movement that resolves the former may resolve the latter as well.

---

**TABLE VII**

<table>
<thead>
<tr>
<th>Schedule</th>
<th>Instruction Semantics</th>
<th>Instruction Fields</th>
<th>Costs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time step</td>
<td>MOP IDs</td>
<td>MOP type</td>
<td>Encoded fields</td>
</tr>
<tr>
<td>1</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
<tr>
<td>2</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
</tbody>
</table>

**TABLE VIII**

<table>
<thead>
<tr>
<th>Schedule</th>
<th>Instruction Semantics</th>
<th>Instruction Fields</th>
<th>Costs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time step</td>
<td>MOP IDs</td>
<td>MOP type</td>
<td>Encoded fields</td>
</tr>
<tr>
<td>1</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
<tr>
<td>2</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
</tbody>
</table>

**TABLE IX**

<table>
<thead>
<tr>
<th>Schedule</th>
<th>Instruction Semantics</th>
<th>Instruction Fields</th>
<th>Costs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time step</td>
<td>MOP IDs</td>
<td>MOP type</td>
<td>Encoded fields</td>
</tr>
<tr>
<td>1</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
<tr>
<td>2</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
</tbody>
</table>

**TABLE X**

<table>
<thead>
<tr>
<th>Schedule</th>
<th>Instruction Semantics</th>
<th>Instruction Fields</th>
<th>Costs</th>
</tr>
</thead>
<tbody>
<tr>
<td>Time step</td>
<td>MOP IDs</td>
<td>MOP type</td>
<td>Encoded fields</td>
</tr>
<tr>
<td>1</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
<tr>
<td>2</td>
<td>mov, mov</td>
<td>MO2, MO2</td>
<td>mov, mov</td>
</tr>
</tbody>
</table>
Depending on the type of the constraints, one of the following rules is applied.

1) If the instruction word width constraint is violated, apply randomly one of the move operators: 'unification,' 'implicit value,' 'interchange,' 'displacement,' or 'insertion.'

2) If the resource constraint is violated, apply randomly one of the move operators: 'unification' (only when the register port constraint is violated), 'implicit value,' 'displacement,' or 'insertion.'

When the current design space does not violate any constraint, all move operators are eligible for changing the design state. In this case, a basic block is selected with the probability $Selection_i$, which is the selection weight of a basic block $i$ and is defined by the following equation, where $F_i$ is the execution frequency of the basic block $i$ in the benchmark, $N_i$ is the number of MOP's in the basic block $i$, and the summation in the denominator is the total number of MOP's executed in the benchmark. Therefore, the selection weight is intended to denote the degree of importance of a basic block in the benchmark. A time step is then randomly chosen from the selected basic block, and one move operator is randomly selected and applied to the time step.

$$Selection_i = \frac{F_i \cdot N_i}{\sum_i F_i \cdot N_i}$$  \hspace{1cm} (2)

C. Cooling Schedule

The cooling schedule is controlled by five parameters.

1) The initial temperature ($T_0$) should be high enough so that there is no rejection for high-cost states at the initial temperature. A simple heuristic to set the initial temperature is to start the simulated annealing algorithm with a given initial temperature. If some states are rejected at the initial temperature, then the value of the initial temperature is doubled. The trial run is repeated until the ideal initial temperature is obtained.

2) The number ($M$) of movements tried at each temperature is proportional to the total number ($O_{ps}$) of MOP's in the benchmarks, typically five times, which is given by the designer.

3) The next temperature is 90% of the current temperature.

4) A low temperature point is defined such that a special handling routine can be applied to stabilize the design state. The special handling routine stabilizes the design state by adopting move acceptance rules that are different from the ones in high temperatures. The move acceptance rules are described in Section V-D.

5) The annealing process terminates when the design state stays unchanged for a certain (e.g., four) consecutive temperature points. The number of the consecutive stable temperature points is given by the designer.

The complexity of the algorithm is mainly determined by the cooling schedule and the data structures used to represent the design state. As discussed previously, the number of movements tried at each temperature is proportional to the total number ($O_{ps}$) of MOP's in the benchmarks; the complexity of accessing the data structures, in our current implementation, is proportional to $O_{ps}$ as well. Therefore, the complexity of the algorithm at each temperature is of the order of $O_{ps}^2$. This complexity can be lowered by using more efficient data structures in our future implementation.

To derive the global complexity formally, we need to determine the total number of temperature points, which is difficult to analyze since it is affected by both the problem size and the nature of the benchmarks. However, our empirical study shows that the global complexity of the algorithm is roughly about the order of $O_{ps}^3$.

D. Move Acceptance

At high temperatures, a movement that satisfies one of the following conditions is definitely accepted.

1) The movement reduces the value of the objective function; 
2) The movement is a result of constraint resolution; i.e., it is a necessary movement in order to resolve some constraint violations.

Otherwise, a movement is accepted with the probability of $\exp(-\Delta/T)$ where $\Delta$ is the increased value of the objective function and $T$ is the current temperature.

At low temperatures, a different strategy is adopted to stabilize the design state. A movement is accepted when either one of the following conditions is true.

1) The movement generates a new state which does not violate any design constraint and has lower objective value; 
2) The movement is a result of constraint resolution. This condition is same as the one at high temperatures.

Otherwise, only those movements that generates new states which do not violate any design constraint are accepted with the probability of $\exp(-\Delta/T)$.

In addition, the current best design state is kept when the algorithm decides to accept inferior design states. At the end of each temperature point, if the reached design state is inferior to the current best state, the design state falls back to the current best state with the probability $1 - T/T_i$, where $T_i$ is the initial temperature.

E. Design Flow Based on the Simulated Annealing Algorithm

The instruction set design process consists of three major steps.

1) The given application is translated to dependency graphs of MOP's which are supported by the given architecture template. This translation is performed in two steps. First, the application, written in a high-level language, is translated into an intermediate representation by the compiler of the high-level language (in our current environment, the Aquarius Prolog Compiler [21]). Second, a retargetable MOP mapper, consulting the given architectural template specified with the language described in Section III-B, transforms the intermediate representation into the dependency graphs of MOP's.
TABLE XI
THE MOP's AND THEIR DEPENDENCIES OF A LIST-CREATING APPLICATION

<table>
<thead>
<tr>
<th>MOP ID</th>
<th>Type ID</th>
<th>RTL*</th>
<th>MOP ID</th>
<th>Type ID</th>
<th>RTL*</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>.fd</td>
<td>d = 1 \rightarrow 0 + 1</td>
<td>10</td>
<td>mst</td>
<td>d = 0 \rightarrow 1</td>
</tr>
<tr>
<td>2</td>
<td>rit</td>
<td>d = 0 \rightarrow 0 + 1</td>
<td>11</td>
<td>rit</td>
<td>d = 0 \rightarrow 0 + 1</td>
</tr>
<tr>
<td>3</td>
<td>mar</td>
<td>d = 0 \rightarrow 10</td>
<td>12</td>
<td>mst</td>
<td>d = 0 \rightarrow 10</td>
</tr>
<tr>
<td>4</td>
<td>mar</td>
<td>d = 0 \rightarrow 11</td>
<td>13</td>
<td>mst</td>
<td>d = 0 \rightarrow 11</td>
</tr>
<tr>
<td>5</td>
<td>rit</td>
<td>d = 0 \rightarrow 00 + 1</td>
<td>14</td>
<td>mst</td>
<td>d = 0 \rightarrow 00 + 1</td>
</tr>
<tr>
<td>6</td>
<td>mar</td>
<td>d = 0 \rightarrow 00</td>
<td>15</td>
<td>map</td>
<td>d = 0 \rightarrow 00</td>
</tr>
<tr>
<td>7</td>
<td>mar</td>
<td>d = 0 \rightarrow 11</td>
<td>16</td>
<td>mst</td>
<td>d = 0 \rightarrow 11</td>
</tr>
<tr>
<td>8</td>
<td>rit</td>
<td>d = 0 \rightarrow 00 + 1</td>
<td>17</td>
<td>mst</td>
<td>d = 0 \rightarrow 00 + 1</td>
</tr>
<tr>
<td>9</td>
<td>mar</td>
<td>d = 0 \rightarrow 00</td>
<td>18</td>
<td>mst</td>
<td>d = 0 \rightarrow 00</td>
</tr>
</tbody>
</table>

* bit width: tag=2, immed=14

TABLE XII
32-B INSTRUCTION SET

<table>
<thead>
<tr>
<th>Instruction name</th>
<th>Instruction fields</th>
<th>RTL*</th>
<th>MOP type ID</th>
<th>Encoded fields</th>
</tr>
</thead>
<tbody>
<tr>
<td>instl1</td>
<td>Rb, Rd</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>instl2</td>
<td>Rs, Rs, Rd, T0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>instl3</td>
<td>Rb, Rd</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

* The right two columns specify the binary value for the corresponding instruction.

TABLE XIII
Compiled Code with the 32-b Instruction Set

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>instl1(R0, R1, R2, R3)</td>
<td>2</td>
<td>instl2(R0, R1, R2)</td>
<td>3</td>
<td>instl3(R0, R1)</td>
</tr>
<tr>
<td>4</td>
<td>instl4(R0, R1, R2)</td>
<td>5</td>
<td>instl5(R0, R1, R2)</td>
<td>6</td>
<td>instl6(R0, R1)</td>
</tr>
</tbody>
</table>

TABLE XIV
Compiled Code with the 64-b Instruction Set

<table>
<thead>
<tr>
<th>Instruction name</th>
<th>Instruction fields</th>
<th>RTL*</th>
<th>MOP type ID</th>
<th>Encoded fields</th>
</tr>
</thead>
<tbody>
<tr>
<td>instl1</td>
<td>Rb, Rd</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>instl2</td>
<td>Rs, Rs, Rd, T0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>instl3</td>
<td>Rb, Rd</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

TABLE XV
Compiled Code with the 64-b Instruction Set

<table>
<thead>
<tr>
<th>Time Step</th>
<th>Compiled Code</th>
<th>Time Step</th>
<th>Compiled Code</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>instl1(R0, R1, R2, R3)</td>
<td>2</td>
<td>instl2(R0, R1, R2)</td>
</tr>
<tr>
<td>4</td>
<td>instl4(R0, R1, R2)</td>
<td>5</td>
<td>instl5(R0, R1, R2)</td>
</tr>
</tbody>
</table>

2) A preprocessor generates a simple schedule for the MOP's. The schedule is obtained by serializing the dependency graphs. An initial instruction set is then derived from the schedule. This is done by directly mapping time steps in the schedule into instructions without encoding any operand. The obtained schedule and instruction set constitute the initial design state.

3) The simulated annealing algorithm is invoked to optimize the design state. Several trial runs of the algorithm may be necessary to adjust the cooling schedule.

The best instruction set, micro-architecture, and assembly code which minimize the objective function can be obtained after the design state reaches the equilibrium state.

We have implemented the algorithm and its supporting tools into our design system ASIA (Automatic Synthesis of Instruction-set Architectures). It consists of about 8000 lines of Prolog code.

VI. EXPERIMENTS

We first demonstrate our technique with a small, illustrative example, and then with Prolog application benchmarks.

A. A Small Example

In this example, we assumed the target architecture in Table II, the instruction field specification in Table I with smaller bit widths for tag (2 b) and immediate (14 b), and the delay specification in Table IV. The example used in this subsection is a small application which sets up a list of two elements in Prolog. It consists of 18 MOP's. Table XI lists the MOP's and their dependencies. The bf clauses in the last row specify the before dependencies between MOP's. For example, bf \((1,4)\) constrains that MOP 1 has to be scheduled in a time step earlier than MOP 4's. The ctl \((18)\) clause specifies that the MOP 18 changes the control flow. Note that the control flow change has one cycle delay. We synthesized the 32-b and 64-b instruction sets, with the resource constraints \((3R, 1W, 2M, 1F)\) and \((6R, 4W, 4M, 4F)\), respectively. The objective function used is EQ 1 with \(P = 1\).

The synthesized 32-b instruction set is listed in Table XII, consisting of four instructions. Note that two instructions insl1 and insl2 contain encoded fields, in order to satisfy the required 32-b word constraint. This instruction set compiles the application into 12 cycles, as shown in Table XIII. Note that time step 12 is the delay slot of insl1 which changes the control flow. An independent instruction insl12 is scheduled into time step 12 to make use of the delay slot.

Table XIV lists the 64-b instruction sets, consisting of five instructions. Most of the instructions have concurrent MOP's. Since 64 b are wide enough to accommodate all instruction fields, there is no encoded field required in this instruction set. The compiled code (Table XV) consists of 9 cycles, which is 3 cycles less than the 32-b one. Also note that the instruction insl16 is scheduled to the delay slot of instruction insl15 which changes the control flow.

3Refer to the footnote 1 of Table II for the meaning of the notation.
In this subsection, experiments are presented to show the versatility and practicality of our tools by synthesizing instruction sets for some application benchmarks, with various design constraints and objective functions. Four benchmarks were selected from the Prolog Benchmark suite [18]. The benchmarks con1 and nreverse are programs for list manipulation. The benchmark query is a program for database query. The benchmark circuit maps boolean equations into logic gates. The second column in Table XVI lists the characteristics of the benchmarks, including the numbers of MOP's, data-related dependencies, and control dependencies in the benchmarks. The number of MOP's represents the size of the benchmark; the number of data-related dependencies is related to the degree of parallelism available within the benchmark; the number of control dependencies indicates the degree of the impact of the branch/jump delays on the benchmark.

We assumed that every basic block executes once. We assumed the target architecture in Table II and the instruction field specification in Table I. The delay constraints for control and memory operations are one and zero, respectively. The experiment was conducted on a HP750 workstation with 256 MBytes of memory.

For each benchmark, we synthesized its 32-b, 48-b, and 64-b instruction sets, respectively. We were interested in how the instruction sets vary with bit widths. Table XVI lists the results, synthesized under the objective function with $P = 1$ in (1). For all three benchmarks, as we had expected, the cycle decreases when the instruction word width increases. However, we observed a smaller gain in nreverse and circuit. This can be explained by their larger ratios of the number of data dependencies to the number of MOP's. Most of the MOP's depend on each other such that there is less parallelism available when packing MOP's into instructions.

In general, the size of the instruction set also increases when the instruction word width increases. This is due to the fact that wider words can accommodate more MOP's, resulting in richer and more powerful instructions. However, the 48-b instruction sets are 'embarrassing' designs for con1 and nreverse. Their instruction set sizes are larger, and their performance is worse than their 64-b alternatives in compiling the benchmarks. The 48 b are not wide enough for these benchmarks to accommodate the most frequent MOP patterns, for which 64 b are sufficient. Therefore, the design process has to specialize the general forms of some powerful instructions into several distinct instructions by making fields implicit or unifying register ports, in order to satisfy the bit width constraint.

In the 'Instruction set space' column we examined the number of instruction candidates explored by the design process. The numbers, much larger than the final instruction sets, show that the design process was able to explore a rich design space for the best candidates while keeping the size of the design space manageable.

In the two right most columns we also list the run time and memory usage of our algorithm, which show that our tools were able to synthesize instructions for application benchmarks within reasonable time and consume a modest amount of memory.

In Table XVII we compared the synthesized 32-b instruction sets for these benchmarks with the BAM instruction set, which was designed for the VLSI-BAM microprocessor by the VLSI-BAM microprocessor plus some powerful instructions to support efficient logic computation such as Prolog. The benchmarks were compiled with the BAM instruction set, and we measured the number of distinct instructions used (in the 'Instruction set size' column), and the number of cycles to execute the compiled code (in the 'Cycle' column). The programs were compiled by the Aquarius Prolog Compiler, with the post-phase optimization phase turned off. The experiments show that the synthesized instruction sets produce more compact codes for all four benchmarks, with 10%, 5%, 17%, and 3% reduction in the code size, respectively. This was achieved at the cost of a small number of additional instructions (7, 1, and 2 for con1, nreverse, and query, respectively), except in circuit where 16 additional instructions are required. We then used Holmer's objective function $100 \cdot \ln(C) + S$ to evaluate

\[\text{RESULTS (OBJECTIVE FUNCTION} = 100 - \ln(C) + S)\]

\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
Benchmark & Instruction set & Hardware resources & Cycle (C) & Instruction set space (S) & Objective value (number in base) \\
\hline
con1 & BAM, 32, 48, 64 & 123 & 22 & 527 & 100 \\
& BAM, 32, 48, 64 & 178 & 27 & 505 & 100 \\
nreverse & BAM, 32, 48, 64 & 176 & 24 & 527 & 100 \\
query & BAM, 32, 48, 64 & 140 & 28 & 505 & 100 \\
control & BAM, 32, 48, 64 & 140 & 24 & 505 & 100 \\
\hline
\end{tabular}
\caption{Performance Comparison with a Manually Designed Instruction Set}
\end{table}

\* The objective constraint given to ASA is the same as in the VLSI-BAM processor: 100, 10, 48, 1. 
\* BAM refers to the instruction set that was manually designed for the VLSI-BAM processor. 
1. ASA refers to the instruction set synthesized by the tools (ASA), reported in the paper.
the global performance/cost trade-offs for both instruction sets and found that in most cases (cond, nreverse, and query) the synthesized ones yield better results, as indicated in the 'Objective value' column (smaller values are better). It is possible to improve the result of circuit by adjusting the initial temperature and the cooling schedule in our future experiment. We also compared the hardware resources used by both instruction sets. They both use the same amount of resources, except in the nreverse case our synthesized instruction set uses one less register read port and one less memory port than BAM does. This experiment shows that ASIA is capable of competing with manually designed instruction sets within our collection of benchmarks. Further studies will be needed to investigate its competence in more general cases.

Table XVIII shows some interesting instructions synthesized for the benchmark query. They are selected from the 32-b, 48-b, and 64-b instruction sets, respectively. For ease of illustration, we do not list the binary tuples for these instructions; instead, we describe the RTL's of these instructions directly. In the RTL's, the register sharing is indicated by using the same register index. Note that the 32-b version of the instructions can be found in the BAM instruction set as well. This fact provides the BAM designers with more confidence about their instruction set, since some of the instructions that they considered 'powerful' retain their existence when the instruction set is designed by other independent designers (in this case, the ASIA design automation system). This observation suggests that ASIA, in addition to its original purpose (an automatic design tool), can be used as a verification tool for designers to verify their manually designed instruction sets as well.

Finally, Table XIX shows how the synthesized instruction sets vary with the objective functions. In this experiment we synthesized 32-b instruction sets for the benchmark query with two objective functions: one with \( P = 1 \), another with \( P = 5 \). The latter assigns less importance to the cycle count. Therefore, the tools focused on reducing the instruction set size, resulting in 7 instructions less, but 16 cycles more than the former case.

### Table XVIII

<table>
<thead>
<tr>
<th>Instruction word width</th>
<th>RTLs*</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>32</td>
<td>( m(t_1) \rightarrow R_2 ), ( R_2 \rightarrow R_2 )</td>
<td>push*</td>
</tr>
<tr>
<td></td>
<td>( (t_0) \rightarrow m(t_1) \rightarrow R_2 ), ( R_2 \rightarrow R_2 )</td>
<td>conditional push*</td>
</tr>
<tr>
<td></td>
<td>( (t_0) \rightarrow m(t_1) \rightarrow (p \rightarrow p \rightarrow D_3) )</td>
<td>switch on tag*</td>
</tr>
<tr>
<td></td>
<td>( m(t_1) \rightarrow R_2 )</td>
<td>compute condition and jump</td>
</tr>
<tr>
<td>48</td>
<td>( m(t_1) \rightarrow R_2 )</td>
<td>store and conditional push (with a shared register)</td>
</tr>
<tr>
<td></td>
<td>( (t_1) \rightarrow m(t_1) \rightarrow R_2 )</td>
<td>store and add</td>
</tr>
<tr>
<td>64</td>
<td>( m(t_1) \rightarrow R_2 )</td>
<td>store and conditional push</td>
</tr>
<tr>
<td></td>
<td>( (t_1) \rightarrow m(t_1) \rightarrow R_2 )</td>
<td>store and add</td>
</tr>
<tr>
<td></td>
<td>( R_2 \rightarrow R_2 )</td>
<td>tag data and add</td>
</tr>
</tbody>
</table>

* Notes: 1. The RTL's in an instruction are executed simultaneously; 2. \( p \) is a one bit latch which holds the truth value of a logic comparator; 3. The operator " " append a tag to a value before the value is sent to a destination.

### Table XIX

<table>
<thead>
<tr>
<th>Objective function</th>
<th>Cycle (C)</th>
<th>Instruction set size (S)</th>
</tr>
</thead>
<tbody>
<tr>
<td>( \text{cond} \rightarrow \text{cond} )</td>
<td>128</td>
<td>39</td>
</tr>
<tr>
<td>( \text{cycle} \rightarrow \text{cycle} )</td>
<td>29</td>
<td>32</td>
</tr>
</tbody>
</table>

### VII. Conclusion

We have presented a design automation system ASIA (Automatic Synthesis of Instruction-set Architectures) that synthesizes computer instruction sets from application benchmarks. The design problem is formulated as a modified scheduling problem. The benchmarks are represented as data/control flow graphs of MOP's. The MOP's are scheduled into time steps subject to constraints of dependencies, hardware resources, and instruction word width. Instructions are formed during the scheduling phase. A binary tuple is used to describe the semantics and formats of instructions. The binary tuple is the key idea which links the instruction formation to the scheduling process. In addition to the synthesized instruction sets, ASIA also generates the compiled codes for the given benchmarks, showing that how the instruction sets can be actually used to compile programs. An objective function of the cycle count and instruction set size is used to guide the design process, in order to balance the performance/cost trade-off. A simulated annealing algorithm is used to solve the schedules. We have discussed the move operators suitable for our problem, and other issues such as cooling schedules and heuristics.

We have demonstrated the versatility and practicality of ASIA by conducting experiments on some application benchmarks, with various design constraints and objective functions. The tools used reasonable amount of CPU time and a modest amount of memory. It has been shown that our tools are capable of synthesizing powerful instruction sets. Many of them can be found in today's processors. Compared with manually designed instruction sets, the synthesized instruction sets produce more compact code and may require less hardware. The tools were able to explore a rich design space, and handle important design options such as the instruction word width, and performance/cost trade-off. We were able to explain the variation of the performance of the instruction sets on different benchmarks, based on the characteristics of the benchmarks. The experiments also show that ASIA, in addition to its original purpose in automating the design process, can be used by the designers to verify their manually designed instruction sets as well.

The current limitations include: First, the designers are required to specify the number of hardware resources, which may take several iterations to find the best hardware allocation. Second, ASIA does not recognize the situation when the constraints are too loose, e.g., the instruction word is too wide or hardware resources are too rich. In this case, it is possible to suggest some partitioning of the constraints. For example, a 128-b instruction word can be realized as a single wide-word instruction or an abutting of several smaller instructions. Third, in our problem formulation, the concept of the basic block is used to partition benchmarks into small pieces.
However, there are other ways of partitioning benchmarks such as traces, and random segments [5]. What is the best way is unknown at this moment. Fourth, even though we have demonstrated that our algorithm is able to synthesize instruction sets from thousands of MOP's within 22 h, real world application benchmarks, such as system, CAD and simulation software, are usually much larger. How to manage problems of such sizes is an important issue. Fifth, the machine model is insufficient to account for the dynamic behavior of some modern architectures such as superscalar machines.

In the future, we will continue our efforts in ASIA and pursue the following issues: 1) improving the aforementioned limitations; 2) code generation for the synthesized instruction sets; 3) synthesis and comparison for application specific uniprocessors and VLIW processors; 4) design and synthesis of low-power instruction set architectures; and 5) analysis of architectural properties for application benchmarks.

ACKNOWLEDGMENT

The authors would like to thank B. Holmer, C.-L. Su, and the anonymous reviewers for their comments and suggestions in improving this work.

REFERENCES


Ing-Jer Huang (S'89–M'95) received the B.S. degree in electrical engineering from the National Taiwan University, Taiwan, R.O.C., in 1986, and the M.S. and Ph.D. degrees in computer engineering from the University of Southern California, Los Angeles, in 1989 and 1994, respectively.

He is currently with the Institute of Computer and Information Engineering at National Sun Yat-Sen University, Taiwan, R.O.C. as an Associate Professor. His research interests include hardware/software co-design, high/system level synthesis, computer architecture, and VLSI system design.

Dr. Huang has published ten technical papers in his areas of research. He is a member of ACM.

Alvin M. Despain (S'58–M'65) received the B.S. (1960), M.S. (1962), and Ph.D. (1966) degrees in electrical engineering from the University of Utah, Salt Lake City.

He is currently with the University of Southern California, Los Angeles as the Powell Professor of Computer Engineering and as a Professor in the Computer Science and Electrical Engineering—Systems Departments. He has been an Assistant Research Professor with Utah State University, Logan, a Visiting Associate Professor with Stanford University, Stanford, CA, a Professor with the University of California at Berkeley, and has been with USC since 1989. He is a pioneer in the study of high-performance computer systems for symbolic calculations. His research group builds experimental software and hardware systems including computers, custom VLSI processors, and multiprocessor systems. The goal is to determine principles for the design of high-performance computer systems. His research interests include computer architecture, multiprocessor and multiprocessor systems, logic programming, and design automation.

Dr. Despain is a member of ACM and AAAI.