



# **Compact Code Generation**

Tobias Edler von Koch Igor Böhm and Björn Franke



 FASA

 Processor Automated Synthesis

 by iTerative Analysis







### Integrated Instruction Selection and Register Allocation for Compact Code Generation Exploiting Freeform Mixing of 16- and 32-bit Instructions

Tobias Edler von Koch Igor Böhm and Björn Franke



FASAProcessor Automated Synthesisby iTerative Analysis







# **Compact Code Generation**

Tobias Edler von Koch Igor Böhm and Björn Franke



 FASA

 Processor Automated Synthesis

 by iTerative Analysis









## **Does Code Size Matter?**

#### 128-Mbit Flash 27.3mm<sup>2</sup> at 0.13µm









## Does Code Size Matter?

#### I 28-Mbit Flash 27.3mm<sup>2</sup> at 0.I3μm



ISA

ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm









# Does Code Size Matter?

I 28-Mbit Flash 27.3mm<sup>2</sup> at 0.I3μm



ISA

ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm

**b** d

ARCompact ISA ENCORE Calton 0.15mm<sup>2</sup> at 0.13µm









## Does Code Size Matter?

I 28-Mbit Flash 27.3mm<sup>2</sup> at 0.I3μm



ISA

ARM Cortex M3 0.43mm<sup>2</sup> at 0.13µm



ARCompact ISA ENCORE Calton 0.15mm² at 0.13µm

### **RISC Architectures**

sacrifice code density in order to simplify implementation circuitry and decrease die area













 Dual instruction sets providing 32-bit and I6-bit instruction encodings:







 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

# ARM Thumb-2 ARM Thumb ARCompact







 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

#### ARM Thumb-2 ARM Thumb ARM Thumb

• There's no such thing as a free lunch!







 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

#### ARM Thumb-2 ARM Thumb ARM Thumb

- There's no such thing as a free lunch!
- 16-bit instructions come with constraints!







 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

#### ARM Thumb-2 ARM Thumb ARM Thumb

- There's no such thing as a free lunch!
- 16-bit instructions come with constraints!







 Dual instruction sets providing 32-bit and 16-bit instruction encodings:

#### ARM Thumb-2 ARM Thumb ARM Thumb

- There's no such thing as a free lunch!
- 16-bit instructions come with constraints!

















### Common Compact Instruction Format Constraints

• Only subset of registers accessible









- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes









- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands









- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart









- Only subset of registers accessible
- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart
- Free mixing of I6- and 32-bit encodings not always possible







### **ARCompact** 16-bit Instruction Format Constraints

• Only subset of registers accessible

THE UNIVERSITY of EDINBURGH

informatics

- Explicit instructions necessary to switch between 16-bit and 32-bit modes
- Reduced size of immediate operands
- Not every 32-bit instruction has a 16-bit counterpart
- Free mixing of 16- and 32-bit encodings not always possible









## Motivating Example

32-Bit Only

| ld         | r2,[sp,0]                                    |
|------------|----------------------------------------------|
| ld         | r3,[sp,4]                                    |
| add<br>asl | r4,[sp,8]<br>r2,r2,r3<br>r2,r2,2<br>r2,r2,r4 |

#### 24 bytes









# Motivating Example



24 bytes

12 bytes

5

Instruction Selection

**Basic Block** 







# Motivating Example



24 bytes

Instruction Selection

**Register Allocation** 









# Motivating Example

| 32-Bit Only                                                   | Mixed Mode<br>Aggressive                                                        | Mixed Mode<br>Integrated                                          |
|---------------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------|
| ld r2,[sp,0]<br>ld r3,[sp,4]                                  | <pre>ld_s r2,[sp,0] ld_s r3,[sp,4] mov r4,r1</pre>                              | ld_s r2,[sp,0]<br>ld_s r3,[sp,4]                                  |
| <pre>ld r4,[sp,8] add r2,r2,r3 asl r2,r2,2 sub r2,r2,r4</pre> | <pre>ld_s r1,[sp,8] add_s r2,r2,r3 asl_s r2,r2,2 sub_s r2,r2,r1 mov r4,r1</pre> | <pre>ld r4,[sp,8] add_s r2,r2,r3 asl_s r2,r2,2 sub r2,r2,r4</pre> |

24 bytes

20 bytes

6 bytes





## Efficient Compact Code Generation is an Integrated Instruction Selection and Register Allocation Problem!







**ECC** - EnCore C Compiler based on commercial CoSy compiler by ACE<sup>©</sup>.









**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies

Opportunistic

Feedback-Guided







**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies









**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies









**ECC** - EnCore C Compiler based on commercial CoSy compiler by ACE<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies









**ECC** - EnCore C Compiler based on commercial CoSy compiler by ACE<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies









**ECC** - **E**nCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







# **Compact Code Generation Approach**

**ECC** - **E**nCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







# **Compact Code Generation Approach**

**ECC** - EnCore C Compiler based on commercial CoSy compiler by ACE<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies







# **Compact Code Generation Approach**

**ECC** - EnCore **C** Compiler based on commercial **CoSy** compiler by **ACE**<sup>©</sup>.

Compiler backend supports **two** compact code generation strategies

































































BenchMarks EEMBC I.I









BenchMarks EEMBC I.I



| Simulator          | ArcSim                                           |
|--------------------|--------------------------------------------------|
| Simulation Mode    | Full System, Cycle Accurate                      |
| Accuracy           | Cycle accurate mode<br>validated against real HW |
| Options            | Default                                          |
| I/O & System Calls | Emulated                                         |









| BenchMarks | EEMBC I.I |
|------------|-----------|
|------------|-----------|

| Core              | ARC750D             |
|-------------------|---------------------|
| Pipeline          | 7-stage interlocked |
| Execution Order   | In-Order            |
| Branch Prediction | Yes                 |
| ISA               | ARCompact           |
| Floating Point    | Hardware            |



| Simulator          | ArcSim                                           |
|--------------------|--------------------------------------------------|
| Simulation Mode    | Full System, Cycle Accurate                      |
| Accuracy           | Cycle accurate mode<br>validated against real HW |
| Options            | Default                                          |
| I/O & System Calls | Emulated                                         |









| BenchMarks               | EEMBC I.I            |  |
|--------------------------|----------------------|--|
|                          |                      |  |
| Core                     | ARC750D              |  |
| Pipeline                 | 7-stage interlocked  |  |
| Execution Order          | In-Order             |  |
| <b>Branch Prediction</b> | Yes                  |  |
| ISA                      | ARCompact            |  |
| Floating Point           | Hardware             |  |
| Memory<br>Subsystem      |                      |  |
| LI Cache                 | Yes                  |  |
| Instruction              | 8k/2-way associative |  |
| Data                     | 8k/2-way associative |  |
| L2 Cache                 | No                   |  |



| Simulator          | ArcSim                                           |
|--------------------|--------------------------------------------------|
| Simulation Mode    | Full System, Cycle Accurate                      |
| Accuracy           | Cycle accurate mode<br>validated against real HW |
| Options            | Default                                          |
| I/O & System Calls | Emulated                                         |







#### Evaluation - Code Size Reduction

Improvement in Code Size (in %) 35 30 25 20 15 10 5 Λ -5 viterboo bitmp01 candrol inft01 pritchol average cipe<sup>9</sup> aiift01 ospt 10<sup>th</sup> 0kup ent aittin holedol hol med infole hor cache identification otherotateotextotoroneroo totalo tion dipeonyonidon office automotive telecom net consumer Baseline: plain32-bit code Feedback-guided selection Opportunistic selection GCC using -O3. (avg: 16.7%) (avg: 2.7%) (avg: 15.4%) 9











































# Why does the simple Opportunistic Mode perform so well?









#### Why does the simple Opportunistic Mode perform so well?









Why does the simple Opportunistic Mode perform so well?

 register allocator selects registers with lower ID from set of possible registers









Why does the simple Opportunistic Mode perform so well?

- register allocator selects registers with lower ID from set of possible registers
- calling conventions constrain register allocator

















































# Conclusions





# Conclusions

• Compact Code Generation is an *integrated* Instruction Selection and Register Allocation problem.





# Conclusions

- Compact Code Generation is an *integrated* Instruction Selection and Register Allocation problem.
- While our simple *opportunistic* mode works well, our *feedback-directed* mode produces more consistent results and does not rely on calling conventions or register-allocation implementations.





# Conclusions

- Compact Code Generation is an *integrated* Instruction Selection and Register Allocation problem.
- While our simple *opportunistic* mode works well, our *feedback-directed* mode produces more consistent results and does not rely on calling conventions or register-allocation implementations.
- Our scheme is the first one demonstrating that small code size can be achieved whilst improving performance.





# Thanks! Questions?

For more information visit our website or search for the term 'PASTA project'.



Engineering and Physical Sciences Research Council









http://groups.inf.ed.ac.uk/pasta/

Processor Automated Synthesis by Terative Analysis Project

# **Thanks! Questions**?

For more information visit our website or search for the term 'PASTA project'.



**EPSRC** Engineering and Physical Sciences Research Council

In the PASTA project we seek to automate the design and optimisation of customisable embedded processors. We do this by creating tools that are able to learn about the physical characteristics of the underlying silicon. technology and use that knowledge to synthesise the structure of an embedded processor. As the processor is now a flexible entity without a pre-defined instruction. set, the compiler for that processor must also be synthesised. Furthermore, the code optimisations that the compiler performs when translating from source. code to the synthetic architecture, must also be synthesised.

THE UNIVERSITY of EDINBURGH

*informatics* 

The three main areas for automated synthesis are:

- The processor architecture.
- the micro-architecture, and
- the compiler.



However, the information on which to make automated decisions in each case will be different. At the micro-architecture level we need to know how each microarchitecture option translates into speed. energy and die area. At the architectural we need to know how each level instruction set option translates into clock. cycles of execution time, and at the compiler level we need to know how each optimisation reduces the overall number of instructions executed and maximises the effectiveness of the memory system.

The challenge of our research is that all three areas are inter-dependent and ultimately depend on the characteristics of the silicon on which the system is based. By blurring the boundary between hardware and software, and by automating

the process of adjusting that boundary, we hope to create a system that can perform design trade-offs in seconds when currently it takes an experienced designer several days.



EPGA Platform ASIC Flow Chips & Boards

M.Sc. and UG Projects

Research Areas

Resource Sharing ConFigurable Flow Accelerators Automated ISE Mapping ISE Power Prediction & Optimisation Interconnect Synthesis Cache Optimisation for Embedde Systems: Compilation for Dual Memory Banks Fast Cycle - Approximate

