EE290A Homework 3 Report

EE290A Homework 3 Report
Rhett Davis, Ning Zhang, Chris Tayor, David Chinnery

I. Introduction to Inventra^TM Viterbi encoder/decoder soft core

Features

Flexible coding parameters: code rate, constraint length, traceback depth, code generating functions, soft decision word length
Two orders of magnitude flexibility in speed and area: full parallel for high speed, and parameterized resource sharing for area efficiency
Robust algorithmic implementation: self-normalizing traceback and saturation arithmetic in ACS computation
RTL model optimized for high quality synthesis results
Additional features (which can be disabled): channel bit error rate monitoring, Viterbi encoder, and synchronization monitor

Deliverables

Bit-true cycle-based C and VHDL behavioral models
Synthesizable and simulatable VHDL models with simulation scripts for Mentor VHDL simulator (which did not work for Synopsys VHDL simulator however)
Self verifying test bench
Synopsys synthesis scripts
User's guide

II. Our implementation based on the Mentor Viterbi soft core

Parameters

code rate = 1/2
constraint length (L) = 7
code generating functions: g0 = 171, g1 = 133
soft decision word length (q) = 6
traceback depth (D) = 48
accumulated state metric word length (swidth) = 9
number of parallel ACS units implemented in hardware = 4

Functionality verification and bit error rate (BER) simulations

The C behavioral model was used for BER simulations. It is a true cycle based implementation offering the highest simulation speed. The decoder is implemented using integer arithmetic with identical precision and rounding algorithms as the RTL model. The figure below shows the BER against SNR (signal to noise) curves for both "ideal" (D = 100 and floating point or swidth = 11 with modulo arithmetic in the ACS unit) and Mentor core (D = 48 and swidth = 9 with normalization) cases. The SNR degradation is about 0.15 dB.

The synthesized net list of the Viterbi decoder is simulated and the gate level simulation results are compared with those form behavioral simulation to verify the functionality of the synthesized hardware (i.e., no synthesis related errors were introduced and timing constraints are met under target condition). After place and route, a transistor level net list with parasitic capacitance can be extracted and final verification could be done using Epic tools.

Technology

Technology used was 0.25 um with a good standard cell library from ST Microelectronics, with six metal layers.

Design flow

Synthesis

Synthesis was performed assuming a cycle time of 15 us (effectively infinite)

Performance

A speed of 100 ksps was achieved at a clock speed of 1.6 MHz.
With the clock speed reduced to 1.6 MHz and supply voltage scaled down to 0.8 V, the power was 0.14 mW (0.03 mW/mm²).
The signal degradation was 0.15 dB with the test data provided.

Cost

The final chip area was 4.00 mm².
The Inventra Viterbi soft core costs $100,000 from Mentor.

Productivity and time to market

30 days design time: simulations 5 days, soft core design flow 10 days, software familiarity and debugging 15 days. It is expected that significantly more time would be required to integrate the soft core with other components.
Portability: the Viterbi soft core is portable if SRAM generators are available for different technologies.
Predictability: according to information on the Mentor web site, the Viterbi core takes roughly 19,700 gates and can be clocked at up to 75 MHz in 0.35 um technology, for a sample rate of 4.69 Msps. In comparison, this implementation used 5,595 gates (about 35,000 gates including SRAMs) and had a maximum clock speed of 60 MHz, with a sample rate of 3.75 Msps. The lower clock speed is expected, as there was no significant timing constraint allowing the design to be optimized for area and power (a 66 kHZ clock speed was targeted in synthesis).

Comparison with original estimates

	Homework 3 Estimation	This Project	Discrepancy
Performance	58 Msps	3.75 Msps	The proposed architecture was fully parallel, offering an approximate factor of 16 increase in speed
Power	15 uW	50.6 mW	The proposed circuit ran at 1.0 V instead of 2.5 V.
Area	2.5 mm²	4.0 mm²	A lot of area was spent in the floor planning for the 16 ACS unit SRAM's, which we didn't expect.
Design Time	120 designer days	30 designer days	The soft core from Mentor sped up the design process considerably.

Micro-architectural summary and breakdown of gates/area/power

III. Synopsys and Epic Power and Timing Analysis Results

Summary of back annotation methods

Parasitic Annotation

Capacitance per net was reported from the router (Silicon Ensemble) using the report simcap command.
Design Compiler capacitance annotation script was generated using a Perl script (capfilter.pl)

Switching Activity Annotation

Switching activity was measured from a 16000 cycle (1000 symbol) gate level VHDL simulation
The -saiffile tsviterbi.saif option was used with vhdlsim to cause the generation of a switching activity file
This SAIF file was filtered with a Perl script (saiffilter.pl)
The filtered SAIF file was input to Design Compiler with the command read_saif -i tsviterbi.saif -unit ps

NOTES:

set special variable for this to work (find_ignore_case=true)
annotations are not hierarchical i.e. if a block which is lower in the hierarchy is to be analyzed, it will have to be re-annotated.

Summary of Timing Analysis:

Critical Path (not including SRAM's) was found with the report_timing command (assuming 2.5 V operation)

Before parasitic annotation: 8.65 ns
After parasitic annotation: 16.7 ns
Maximum operating frequency is 60 MHz
As expected, the critical path was through an ACS unit.

Longest path including an SRAM was found by probing

Command used: report_timing -from tsviterbi_decode/acs_unit_2/ramc/dout[0] -to tsviterbi_decode/acs_unit_1/ramc/din[8]
Path delay: 11.3 ns (after parasitic annotation)
After adding in SRAM delay of 1.8 ns, and a setup time of 1 ns, the path delay is 13.1 ns, meaning that the SRAM is not in the critical path.

Summary of final Power Analysis:

Power numbers for gate designs were found using the report_power command
2.5 V operation, 60 MHz clock were assumed

**Summary of Power Compiler Results (Decoder Gates only)**
Power	without annotation (mW)	with annotated switching activity (mW)	with annotated capacitances & sw. act. (mW)
cell	28	20	20
net	15	6	9
Total	43	26	29
leakage	750 nW	810 nW	810 nW

**Summary of Memory Module Power**
Memory Module	element count	enery per operation	measured switching activity	frequency	total power
9x8 ACS Memory (small SRAM)	16	31.9 pJ	0.5	60 MHz	7.9 mW
64x16 Traceback Memory (large SRAM)	3	96.5 pJ	0.94	60 MHz	18 mW

Final Power Numbers:

report_power was used on each sub-block of the design (without any back annotation) to obtain the power percentages shown in part I.
It was found that 11% of the power was spent in blocks not relevant to this project (BER monitor, scrambler, etc.)
Final power = (29 mW x 89%) + 7.9 mW + 18 mW = 50.6 mW

EPIC Powermill decoder simulation results:

Transistor level net list was simulated without interconnect capacitances.
Results are questionable, since proper operation was never observed (this was due to problems creating a correct net list).
20 hours to simulate 100 symbols with 183,061 CMOS elements (transistor level simulation). The power was 19.7 mW from this simulation.
Note that this is very close to the 20 mW of cell power predicted by Design Compiler and Power Compiler

IV. Place and Route

The place and route of the Viterbi SRAM macro cells and standard cells was done in Cadence. Silicon Ensemble was used for routing. Parasitics for timing simulation were extracted from the final placed and routed nets in Silicon Ensemble.

There were significant problems in attempting to route the chip without routing violations. The smallest number of routing violations observed was 6, despite trying a large chip area of 12 mm².

The routing congestion appears to be worst at the 16 by 64 bit SRAM outputs - an SRAM cell design with pins spread over a wider length might solve the problem. From previous experience, Rhett has found problems in Silicon Ensemble when routing pins not spaced at least 2 um apart, as wires are restricted to being 1 um apart, and vias to higher metal layers are quite large.

Different areas were tried, along with different placements (with width x height in um):

chip area 2.5 mm², 1575x1560: about 4700 violations
chip area 3.2 mm², 1803x1790: 57 violations
chip area 3.6 mm², 1850x1950: 57 violations, 64 violations with external pin locations changed
chip area 4.0 mm², 1850x2150: 64 violations, 77 violations with one 64 bit SRAM macro cell above the other two (significant routing congestion between all three) - in both placements the 64 bit SRAMs were vertical
chip area 4.5 mm², 1850x2450: 78 violations initially, reduced to 15 violations, then 9 violations with better placement
Final placement, chip area 4.0 mm², 1250x3200: 9 violations, 64 bit SRAMs were placed sideways next to each other

The ASIC estimated area was 2.5 mm², compared with 4.0 mm² in the final implementation. This implementation was of a soft core, with some routing problems. It is expected that the area might be somewhat larger. Solving the routing problems by spacing pins wider or using IC Craftsman may have given a chip area as low as 3.2 mm². The final placement and routed chip.

Area statistics:

area of the 16 8x9 bit SRAM macro cells: 0.052 mm², 290x180. This was 62% larger than what was required, as 16x8 bit SRAMs were used (SRAM generator output had been verified for powers of 2).
area of the 3 16x64 bit SRAM macro cells: 0.25 mm², 420x590
total area of standard cells: 1.02 mm² (initial estimate from DEF file was 0.35 mm²)
total area of macro cells: 1.58 mm², 0.52 mm² is due to using 16x8 bit SRAMs rather than 8x9 bit SRAMs. This compares with the estimated ASIC SRAM area of 1.08 mm² for three 16 by 64 bit SRAMs.

Gate count statistics:

Viterbi decoder gate count: 5,470
Small Sram equivalent gate count: 16,296 (16 elements, 4074 transistors each)
Large SRAM equivalent gate count: 13,205 (3 elements, 17606 transistors each)
Total gate count: 35,096

Placement and routing wiring statistics:

There were about 6200 nets.
There were 46,114 vias.
Ground and power were spaced alternately 100 um apart horizontally and vertically.
There were six metal layers, with layers 5 and 6 used for power and ground respectively.

Total wire lengths:

metal layer 1: 3,293 um
metal layer 2: 458,440 um
metal layer 3: 510,517 um
metal layer 4: 218,023 um
metal layer 5: 96,882 um regular, and 38,400 um power
metal layer 6: 8,660 um regular, and 37,500 um ground
wire length: 685 mm horizontal, 611 mm vertical, total 1296 mm

V. SRAM Simulations

Assumptions:

Clock Speed is 66 Mhz
1000 iterations of activity (read, write, both)
Voltage 2.5 V

**Smaller ACS unit SRAM (16 by 8 bits) power simulation, without parasitics. There are 16 of these on the chip.**
Operations	Average Current (uA)	Average Power (mW)	Average Energyper Operation (pJ)
all read activity	664	1.66	24.9
all write activity	563	1.41	21.1
random read and write activity	612	1.53	23.0

1 ns set up time
2 ns hold time
1.8 ns to read address after rising clock edge

**Larger ACS unit SRAM (16 by 64 bits) power simulation, without parasitics. There are three of these on the chip.**
Operations	Average Current (uA)	Average Power (mW)	Average Energy per Operation (pJ)
all read activity	2170	5.43	81.4
all write activity	1890	4.73	71.0
random read and write activity	2090	5.22	78.3

1 ns set up time
2 ns hold time
5 ns to read address after rising clock edge

**Smaller ACS unit SRAM (16 by 8 bits) power simulation, with parasitics.**
Operations	Average Current (uA)	Average Power (mW)	Average Energy per Operation (pJ)
all read activity	950	2.37	35.6
all write activity	773	1.93	29.0
random read and write activity	851	2.13	31.9

1 ns set up time
2 ns hold time
1.9 ns to read address after rising clock edge

**Larger ACS unit SRAM (16 by 64 bits) power simulation, with parasitics. There are three of these on the chip.**
Operations	Average Current (uA)	Average Power (mW)	Average Energy per Operation (pJ)
all read activity	2480	6.21	93.2
all write activity	2680	6.69	100.0
random read and write activity	2570	6.44	96.5

1 ns set up time
8 ns hold time
5 ns to read address after rising clock edge

The parasitic extraction had a 30% increase in power consumption on the SRAM used for the ACS unit and a 20% increase for the SRAM used for the traceback.

VI. Scaling

In order to take advantage of the fact that the implemented Viterbi decoder soft core can achieve much higher sample rate with 2.5 V supply voltage than required, voltage scaling is applied to trade in the "extra" performance for low power consumption. Since the standard cell library is characterized only under 2.5 V, all the scaling is based on the two figures below, which are critical path delay vs. supply voltage and energy-delay-product (EDP) vs. supply voltage. The figures are based on results from HSPICE simulations of a 16 bit ripple adder design (with parasitic capacitance) using the same 0.25 um technology. The voltage range under consideration is from 0.8 V to 2.5 V.

Different supply voltage could be chosen depending on figure of merit of the design, for example, chose 2.5 V for highest performance, 0.8 V for lowest energy/power consumption, and 1.25 V for lowest energy-delay-product. The table below lists the scaled performance for the implemented Viterbi decoder, where clock rates are chosen to be approximately 1 / (critical path delay).

Performance Results for Variable Clock Frequency
Supply Voltage
(V) Clock Rate
(MHz) Symbol Rate
(Msps) Energy-Delay
Product (fJ-s) Power
(mW)

Optimized for
Performance 2.5 60 3.75 3.60 50.6

Optimized for
Power 0.8 7.46 0.47 2.96 0.64

Optimized for
EDP 1.25 25.1 1.57 2.15 5.29

**Performance Results for Variable Clock Frequency**
	Supply Voltage (V)	Clock Rate (MHz)	Symbol Rate (Msps)	Energy-Delay Product (fJ-s)	Power (mW)
Optimized for Performance	2.5	60	3.75	3.60	50.6
Optimized for Power	0.8	7.46	0.47	2.96	0.64
Optimized for EDP	1.25	25.1	1.57	2.15	5.29

Notice that even with the lowest supply voltage 0.8 V, the Viterbi decoder can still run faster than targeted operation (100 ksps). Frequency scaling is applied to get the power with intended clock rate: the estimated total power consumption of the decoder is 0.14 mW to achieve 100 ksps throughput.

Power Consumption after Fixing Clock Frequency
Supply Voltage
(V) Clock Rate
(MHz) Symbol Rate
(Msps) Power
(mW)

Optimized for
Performance 2.5 1.6 0.1 1.35

Optimized for
Power 0.8 1.6 0.1 0.14

Optimized for
EDP 1.25 1.6 0.1 0.33

**Power Consumption after Fixing Clock Frequency**
	Supply Voltage (V)	Clock Rate (MHz)	Symbol Rate (Msps)	Power (mW)
Optimized for Performance	2.5	1.6	0.1	1.35
Optimized for Power	0.8	1.6	0.1	0.14
Optimized for EDP	1.25	1.6	0.1	0.33

VI. Additional Data

HSPICE simulation of energy scaling with supply voltage for 16 bit ripple adder.