# CprE / ComS 583 Reconfigurable Computing

Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University

Lecture #24 - Reconfigurable Coprocessors

### ••• Quick Points

- · Unresolved course issues
  - Gigantic red bug
  - · Bean-leaf beetle?



- · This Thursday, project status updates
  - 10 minute presentations per group + questions
  - Combination of Adobe Breeze and calling in to teleconference
  - More details later today

November 12, 2003

OprE 583 - Reconfigurable Computing

Lost 24.2









### Outline

- Recap
- · Reconfigurable Coprocessors
  - Motivation
  - Compute Models
  - Architecture
  - Examples

November 13, 2007

CprE 583 - Reconfigurable Computing

l act-24.7

### • • Overview

- Processors efficient at sequential codes, regular arithmetic operations
- FPGA efficient at fine-grained parallelism, unusual bit-level operations
- Tight-coupling important: allows sharing of data/control
- Efficiency is an issue:
  - Context-switches
  - Memory coherency
  - Synchronization

lovember 13, 2007

CprE 583 – Reconfigurable Computing

. . . . . .

### Compute Models

- I/O pre/post processing
- · Application specific operation
- Reconfigurable Co-processors
  - Coarse-grained
  - Mostly independent
- Reconfigurable Functional Unit
  - · Tightly integrated with processor pipeline
  - · Register file sharing becomes an issue

November 13, 200

CprE 583 – Reconfigurable Computing

### ••• Instruction Augmentation

- Processor can only describe a small number of basic computations in a cycle
  - I bits -> 2<sup>I</sup> operations
- Many operations could be performed on 2 W-bit words
- ALU implementations restrict execution of some simple operations

• e. g. bit reversal

Swap bit positions  $\begin{array}{c} a_{31} \, a_{30} ... ... \, a_0 \\ \\ b_{31} \quad b_0 \end{array}$ 

ber 13, 2007

prE 583 – Reconfigurable Computing

# ••• Instruction Augmentation (cont.)

- Provide a way to augment the processor instruction set for an application
- Avoid mismatch between hardware/software
- Fit augmented instructions into data <u>and</u> control stream
- Create a functional unit for augmented instructions
- Compiler techniques to identify/use new functional unit

November 13, 2007

CprE 583 – Reconfigurable Computing

Lect-24.11

### "First" Instruction Augmentation

- PRISM
  - Processor Reconfiguration through Instruction Set Metamorphosis
- PRISM-I
  - 68010 (10MHz) + XC3090
  - · can reconfigure FPGA in one second!
  - 50-75 clocks for operations

November 13, 2007

CprE 583 – Reconfigurable Computing

Lect-24.12

### PRISM-1 Results Function Description (input bytes / output bytes) Calculates the hamming metric. Compilation % Utilization of Time (mins) a XC3090 FPGA Speed-up Factor Hamming(x,y) (4/2) Bit-reversal function. Bitrev(x) 26 scadable 4-input N-Net function 52% 12 (4/4) Multiply/accumulate function. MultAccm(x,y) 11 58% 2.9 (4/4)LogicEv(x) gic simulation engine function 12 40% 18 (4/4) Error correction coder/decoder. ECC(x,y) 24 (3/2) Find first '1' in input. Find first 1(x) 11% 42 (4/1) 5-section piecewise linear seg. (4/4) Computes base-2 A\*log( x ). Piecewise(x) 24 77% 5.1 ALog2(x) (4/4)November 13, 2007 CprE 583 - Reconf

# **PRISM Architecture** FPGA on bus · Access as memory mapped peripheral

- Explicit context management
- Some software discipline for use
- · ...not much of an "architecture" presented to user



















CprE 583 - Reconfigurable Computi

|                                                     | Interlock? | Description                                                                                                                            |
|-----------------------------------------------------|------------|----------------------------------------------------------------------------------------------------------------------------------------|
| paconf reg                                          | yes        | Load (or switch to) configuration at address given by reg.                                                                             |
| tga reg, array-row-reg, count                       | yes        | Copy reg value to array-row-reg and set array clock counter to count.                                                                  |
| ifga reg, array-row-reg, count                      | yes        | Copy array-row-reg value to reg and set array clock counter to count.                                                                  |
| rabump <i>reg</i>                                   | по         | Increase array clock counter by value in reg.                                                                                          |
| astop reg                                           | по         | Copy array clock counter to reg and stop array by zeroing clock counter.                                                               |
| acinv reg                                           | по         | Invalidate cache copy of configuration at address given by reg.                                                                        |
| fga reg, array-control-reg                          | no         | Copy value of array control register array-control-reg to reg.                                                                         |
| pasave <i>reg</i><br>parestore <i>reg</i>           | yes        | Save all array data state to memory at address given by reg.  Restore previously saved data state from memory at address given by reg. |
|                                                     |            |                                                                                                                                        |
| <ul> <li>Interlock inc</li> </ul>                   | dicate     | es if processor waits for array                                                                                                        |
| <ul> <li>Interlock ind<br/>to count to a</li> </ul> |            | s if processor waits for array                                                                                                         |
| to count to                                         | zero       | es if processor waits for array                                                                                                        |





### ••• PRISC/Chimaera vs. Garp

- Prisc/Chimaera
  - · Basic op is single cycle: expfu
  - No state
  - Could have multiple PFUs
  - Fine grained parallelism
  - Not effective for deep pipelines
- Garp
  - Basic op is multi-cycle gaconfig
  - · Effective for deep pipelining
  - Single array
  - Requires state swapping consideration

November 13, 20

CprE 583 – Reconfigurable Computing

ect-24.27

### VLIW/microcoded Model

- · Similar to instruction augmentation
- Single tag (address, instruction)
  - Controls a number of more basic operations
- Some difference in expectation
  - Can sequence a number of different tags/operations together

7 CprE 583 – Reconfigurable Computing

PREMARC

• Array of "nano-processors"

• 16b, 32 instructions each

• VLIW like execution, global sequencer

• Coprocessor interface (similar to GARP)

• No direct array⇔memory

Recomburate Coprocessor

Bus Interface Unit

Bu



### ••• Common Theme

- To overcome instruction expression limits:
  - Define new array instructions. Make decode hardware slower / more complicated
  - Many bits of configuration... swap time. An issue -> recall tips for dynamic reconfiguration
- Give array configuration short "name" which processor can call out
- · Store multiple configurations in array
- · Access as needed (DPGA)

November 13, 2007

CprE 583 - Reconfigurable Computing

oct-24 31

### • • Observation

- · All coprocessors have been single-threaded
  - Performance improvement limited by application parallelism
- Potential for task/thread parallelism
  - DPGA
  - Fast context switch
- Concurrent threads seen in discussion of IO/stream processor
- Added complexity needs to be addressed in software

November 13, 2007

CprE 583 - Reconfigurable Computing

### Parallel Computation

• What would it take to let the processor and FPGA run in parallel?

### Modern Processors

### Deal with:

- · Variable data delays
- · Dependencies with data
- Multiple heterogeneous functional units Via:
- · Register scoreboarding
- Runtime data flow (Tomasulo)

November 13, 200

CprE 583 – Reconfigurable Computing

Lect-24.33

### ••• OneChip

- Want array to have direct memory
   →memory
   operations
- · Want to fit into programming model/ISA
  - Without forcing exclusive processor/FPGA operation
  - Allowing decoupled processor/array execution
- Key Idea:
  - FPGA operates on memory→memory regions
  - · Make regions explicit to processor issue
  - Scoreboard memory blocks

November 13, 200

CprE 583 – Reconfigurable Computing

Lect-24.34

# OneChip Pipeline TROCKSTOR TROC

## OneChip Instructions

- Basic Operation is:
  - FPGA MEM[Rsource]→MEM[Rdst]
    - block sizes powers of 2

 Opcode
 FPGA function
 misc.
 R<sub>outce</sub>
 R<sub>dett</sub>
 source block size
 destination block size

 6
 4
 2
 5
 5
 5
 5
 5

- Supports 14 "loaded" functions
  - DPGA/contexts so 4 can be cached
- Fits well into soft-core processor model

November 13, 2007

rE 583 – Reconfigurable Computing

Lect-24.36

### OneChip (cont.)

- Basic op is: FPGA MEM→MEM
- No state between these ops
- Coherence is that ops appear sequential
- · Could have multiple/parallel FPGA Compute
  - Scoreboard with processor and each other
- Single source operations?
- · Can't chain FPGA operations?

### OneChip Extensions

- FPGA operates on certain memory regions only
- Makes regions explicit to processor issue
- Scoreboard memory blocks

0x0 0x1000 FPGA Proc 0x10000

Indicates usage of data pages like virtual memory system!

### Compute Model Roundup

- Interfacing
- IO Processor (Asynchronous)
- Instruction Augmentation
  - PFU (like FU, no state)
  - Synchronous Coprocessor
  - VLIW
  - Configurable Vector
- Asynchronous Coroutine/coprocessor
- Memory⇒memory coprocessor

### **Shadow Registers**

- Reconfigurable functional units require tight integration with register file
- Many reconfigurable operations require more than two operands at a time



### **Multi-Operand Operations**

- · What's the best speedup that could be achieved?
  - Provides upper bound
- Assumes all operands available when needed



### Additional Register File Access

- Dedicated link move data as needed
  - Requires latency
- Extra register port consumes resources
  - May not be used often
- Replicate whole (or most) of register file

Can be wasteful

7





### Summary

- Many different models for co-processor implementation
  - Functional unit
  - Stand-alone co-processor
- Programming models for these systems is a key
- Recent compiler advancements open the door for future development
- · Need tie in with applications

November 13, 200

CprE 583 – Reconfigurable Computing

Lect-24.45