

## 50% Power Reduction Through Automated Synthesis of an Asynchronous Microprocessor and Logic from Synchronous RTL

Bill Ellersick, Analog Circuit Works Dylan Hand, Reduced Energy Microsystems Peter A. Beerel, University of Southern California

April 19, 2018



## **BIG PICTURE**

Power-efficiency is key to mobile and datacenters Industry shifting from raw performance to performance/W

REM and ACW are fabless semiconductor companies collaborating on a project using unique technology to build ultra-efficient processors

## POWER EFFICIENT MICROPROCESSOR PROJECT

- Asynchronous logic to reduce power (same functionality from outside)
- Integrated power management with 5 zones to optimize different circuit types



Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works

## ide) circuit types

## **WORST-CASE PERFORMANCE**



Synchronous designs synthesized to meet worst-case Area and power overhead required to hit performance target can be substantial

Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works

## **Block Size**

Power-Optimized Block Size Extra Area to Meet Timing

## **OVERVIEW**

- Introduction
- >Asynchronous bundled data technology
- Supply voltage scaling
- Synchronous to asynchronous flow
- Summary

## **RESILIENT BUNDLED DATA**



Static combinational logic, with error detecting latches to pass data to next block Delay matches worst-case combinational logic delay 90% of the time 10% of the time, outputs change after Delay: Control holds off next stage



# **AVERAGE-CASE PERFORMANCE**



Relaxing timing constraints allow some instructions to take longer On average, computational performance does not change Smaller logic blocks are more power and area efficient

Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works

## **Block Size**

|                 | Extra   |
|-----------------|---------|
| wer-            | Area to |
| mized<br>k Size | Meet    |
|                 | Timing  |

Power-Optimized **Block Size** 

Async Control Logic

to

ng

## **DATA DEPENDENCY**



Resilient BD allows cycle time to change depending on input data Assumes long data dependent cycles are rare: average performance improves

## **SELECTABLE DELAYS**



Identify blocks with variable delay based on operation (manually) Change delay of stage on cycle-by-cycle basis Achieve averaging over time



## **RESILIENT BD ADVANTAGES**

Exploits variations in delay regardless of logical function enabling a more generic approach

Allows designers to cut margins as timing violations will be detected and corrected

Tolerates and averages random delay variations

## **DELAY DISTRIBUTION**

Distribution of Operation Delays in Synchronous Processor



Worst-case operations occur infrequently. Wide variability in delay shows potential to optimize designs for average case performance, saving time.



## **VOLTAGE SCALING**



Dropping voltage trades time savings for power savings Matches synchronous performance at a lower voltage

## TIME SAVINGS CAN TRADE FOR **EVEN LARGER POWER SAVINGS**

- When peak speed not needed, run at lower Vdd
- Power drops faster than speed as Vdd drops => efficiency increases
  - Chart shows 28nm normalized logic efficiency (Gops/W) vs. Vdd
  - Blue curve includes efficient, integrated buck and charge pump converters
- Async designs use 2-3x less power than sync designs [D. Hand, et al, "Blade -- A Timing Violation Resilient Asynchronous Template," ASYNC 2015



## EDA FLOW OVERVIEW



Design flow on top of existing synchronous flow Custom Tcl / Perl scripts to glue between tools and generate constraints Benefits from advancements in tools & synchronous IP (e.g. RISC-V core)

Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works



## Verification

## Timing Analysis

# SYNC TO ASYNC

- Automated conversion of synchronous netlist to asynchronous
  - Choose asynchronous logic domains
  - Replace flip-flops with latches
  - Add custom cells: delay lines, error detecting latches, and control logic



Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works

Async Constraints File (Tcl)

> Async Netlist

## INTERACTIVE OPTIMIZATION, **TIMING CLOSURE**

Define virtual clock for each async controller to establish constraints

Periods, delays of clocks based on analysis of asynchronous circuit

Async Scripts analyze average case performance and generate constraints Iterate to obtain desired performance post-synthesis and place/route Delay lines synthesized, modified during P&R





## **TECHNOLOGY SUMMARY**





## **RECENT RESULTS**

- 130nm computational logic ASIC
  - Flawless operation from 0.65 to 1.3V, silicon-verified
  - 3.39M transistors in 10.35mm<sup>2</sup> with 6 power domains
  - Excellent power efficiency, matching simulation
- 28nm 3-stage OpenCore MIPS CPU
  - 14k gates
  - 15,226 um<sup>2</sup>: 9% area increase vs. 14,000 um<sup>2</sup> synchronous design
  - 35% speedup vs. synchronous design



# UPCOMING SILICON

- REM's upcoming 22nm Machine Vision processor
  - Custom architecture tailored for edge computing
    - Expected order-of-magnitude smaller power envelope than existing solutions
  - Heterogeneous processors for maximum versatility
    - Dual RISC-V processors
    - Powerful Tensilica Vision DSP
    - Proprietary Asynchronous Neural Network Accelerator designed in-house



## **22FDX PROCESS**

- 22nm Fully-Depleted Silicon-on-Insulator
  - FinFET performance at 28nm planar cost
  - Energy efficient for logic, analog, RF
  - Roadmap to 12nm FD-SOI
- Ultra-low power dissipation with supply voltage as low as 0.4V
- Body biasing can super-charge or throttle with no layout changes

## Reference: https://www.globalfoundries.com/technology-solutions/cmos/fdx/22fdx

Copyright 2018 Reduced Energy Microsystems, Analog Circuit Works

## v as 0.4V out changes

# WHY ASYNCHRONOUS LOGIC

- Eliminates inefficiencies in synchronous RTL methodology
  - Waits for worst case transistor, process, voltage, temperature, noise
- It may finally be time
  - ILLIAC II was an early asynchronous processor, in 1962
    - In ILLIAC's first 3 weeks of operation, it discovered 3 new Mersenne primes
    - (In one of my early school projects, I built the WILLIAC, powered by paper tape)
  - Asynchronous arithmetic units are widely used
  - Intel recently announced the Loihi neural asynchronous processor

## WHY RESILIENT BUNDLED DATA

- Automatic conversion from sync RTL: no costly manual design
- Tolerates variations and recovers in realtime
  - Performance improves as supply drops (variations increase)
  - Handles rapid changes in supply voltage gracefully
- 100s of timing constraints vs. 100K in previous async flows • Combinational blocks are similar to synchronous combinational blocks
- Meets urgent need for improved power efficiency



## SUMMARY

- REM and ACW are building highly power-efficient processors
- REM's asynchronous technology enables average-case design
  - Dramatic power-efficiency gains
  - Can optimize asynchronous flow with latest IP advances
- Analog Circuit Works enhances power efficiency, enables SoCs
  - DDR Memory I/F, CSI Camera I/F, USB
  - Integrated power management optimizes each circuit
- We are open to partnerships on products

## ANALOG CIRCUIT WORKS





