EE290A Homework 3 Report
Rhett Davis, Ning Zhang, Chris Tayor, David Chinnery

I. Introduction to InventraTM Viterbi encoder/decoder soft core


II. Our implementation based on the Mentor Viterbi soft core


III. Synopsys and Epic Power and Timing Analysis Results

Summary of back annotation methods Summary of Timing Analysis: Summary of final Power Analysis:
Summary of Power Compiler Results (Decoder Gates only)
Power without annotation (mW) with annotated switching activity (mW) with annotated capacitances & sw. act. (mW)
cell 28 20 20
net 15 6 9
Total 43 26 29
leakage 750 nW 810 nW 810 nW
Summary of Memory Module Power
Memory Module element count enery per operation measured switching activity frequency total power
9x8 ACS Memory (small SRAM) 16 31.9 pJ 0.5 60 MHz 7.9 mW
64x16 Traceback Memory (large SRAM) 3 96.5 pJ 0.94 60 MHz 18 mW

Final Power Numbers:

EPIC Powermill decoder simulation results:

IV. Place and Route

The place and route of the Viterbi SRAM macro cells and standard cells was done in Cadence. Silicon Ensemble was used for routing. Parasitics for timing simulation were extracted from the final placed and routed nets in Silicon Ensemble.

There were significant problems in attempting to route the chip without routing violations. The smallest number of routing violations observed was 6, despite trying a large chip area of 12 mm2.

The routing congestion appears to be worst at the 16 by 64 bit SRAM outputs - an SRAM cell design with pins spread over a wider length might solve the problem. From previous experience, Rhett has found problems in Silicon Ensemble when routing pins not spaced at least 2 um apart, as wires are restricted to being 1 um apart, and vias to higher metal layers are quite large.

Different areas were tried, along with different placements (with width x height in um):

The ASIC estimated area was 2.5 mm2, compared with 4.0 mm2 in the final implementation. This implementation was of a soft core, with some routing problems. It is expected that the area might be somewhat larger. Solving the routing problems by spacing pins wider or using IC Craftsman may have given a chip area as low as 3.2 mm2.
The final placement and routed chip.

Area statistics:

Gate count statistics: Placement and routing wiring statistics: Total wire lengths:

V. SRAM Simulations

Assumptions:
Smaller ACS unit SRAM (16 by 8 bits) power simulation, without parasitics. There are 16 of these on the chip.
Operations Average Current (uA) Average Power (mW) Average Energyper Operation (pJ)
all read activity 664 1.66 24.9 
all write activity 563 1.41 21.1 
random read and write activity 612 1.53 23.0 
Larger ACS unit SRAM (16 by 64 bits) power simulation, without parasitics. There are three of these on the chip.
Operations Average Current (uA) Average Power (mW) Average Energy per Operation (pJ)
all read activity 2170 5.43 81.4 
all write activity 1890 4.73 71.0 
random read and write activity 2090 5.22 78.3 
Smaller ACS unit SRAM (16 by 8 bits) power simulation, with parasitics.
Operations Average Current (uA) Average Power (mW) Average Energy per Operation (pJ)
all read activity 950 2.37 35.6 
all write activity 773 1.93 29.0 
random read and write activity 851 2.13 31.9 
Larger ACS unit SRAM (16 by 64 bits) power simulation, with parasitics. There are three of these on the chip.
Operations Average Current (uA) Average Power (mW) Average Energy per Operation (pJ)
all read activity 2480 6.21 93.2
all write activity 2680 6.69 100.0
random read and write activity 2570 6.44 96.5
The parasitic extraction had a 30% increase in power consumption on the SRAM used for the ACS unit and a 20% increase for the SRAM used for the traceback. 

VI. Scaling

In order to take advantage of the fact that the implemented Viterbi decoder soft core can achieve much higher sample rate with 2.5 V supply voltage than required, voltage scaling is applied to trade in the "extra" performance for low power consumption. Since the standard cell library is characterized only under 2.5 V, all the scaling is based on the two figures below, which are critical path delay vs. supply voltage and energy-delay-product (EDP) vs. supply voltage. The figures are based on results from HSPICE simulations of a 16 bit ripple adder design (with parasitic capacitance) using the same 0.25 um technology. The voltage range under consideration is from 0.8 V to 2.5 V.

Different supply voltage could be chosen depending on figure of merit of the design, for example, chose 2.5 V for highest performance, 0.8 V for lowest energy/power consumption, and 1.25 V for lowest energy-delay-product. The table below lists the scaled performance for the implemented Viterbi decoder, where clock rates are chosen to be approximately 1 / (critical path delay).
 
Performance Results for Variable Clock Frequency
Supply Voltage
(V)
Clock Rate
(MHz)
Symbol Rate
(Msps)
Energy-Delay
Product (fJ-s)
Power
(mW)
Optimized for
Performance
2.5 60 3.75 3.60 50.6
Optimized for
Power
0.8 7.46 0.47 2.96 0.64
Optimized for
EDP
1.25 25.1 1.57 2.15 5.29

Notice that even with the lowest supply voltage 0.8 V, the Viterbi decoder can still run faster than targeted operation (100 ksps). Frequency scaling is applied to get the power with intended clock rate: the estimated total power consumption of the decoder is 0.14 mW to achieve 100 ksps throughput.
 
Power Consumption after Fixing Clock Frequency
Supply Voltage
(V)
Clock Rate
(MHz)
Symbol Rate
(Msps)
Power
(mW)
Optimized for
Performance
2.5 1.6 0.1 1.35
Optimized for
Power
0.8 1.6 0.1 0.14
Optimized for
EDP
1.25 1.6 0.1 0.33

 


VI. Additional Data

HSPICE simulation of energy scaling with supply voltage for 16 bit ripple adder.