Niraj Shah
Scott Weber
EE290a Homework #3A

Description

We chose to implement a viterbi decoder using Tensilica's Xtensa configurable microprocessor with extensions to the instruction set architecture (ISA). With Tensilica's tools, it is possible to compile C code for a particular target architecture using native C programming tools (gcc, gdb). It is also possible to generate an instruction set simulator for the target architecture. These software tools are used to evaluate the system. Once the design is finalized, a description of the hardware implementation can be produced.

To estimate implementation characteristics of a viterbi decoder, we generated a compiler and instruction set simulator for a microprocessor core with a first shot at ISA extensions. The microprocessor core is a 32-bit RISC processor that does not have a multiplier. There are 16 general purpose registers and DSP-like features (such as zero-overhead loops) [1].

Using the processor generator, compiler, and instruction set simulator, we arrived at estimates for a viterbi decoder implementation on the processor described above. After analyzing the C code, we identified some possible ISA extensions. The first set of numbers below shows our results at a first attempt to implement the viterbi decoder.

Parameters

uncoded word length (k) = 1

coded word length (n) = 2

constraint length (L) = 7

branch metric calculation is QPSK

soft decision wordlength (q) = 6

chain-backing depth (D) = 100

generator polynomials: g0 = 171, g1= 133 (octal)

data rate target: 100 kbs

goal: bit error rate (BER) = 10^-4

signal to noise ratio (SNR) degradation 0.05dB

Performance

To calculate performance, we simulated the viterbi decoder using Rhett Davis' viterbi simulator code and the customized Tensilica instruction set simulator. After simulating 1000 samples, gprof was used to profile the execution time in the viterbi decoder. From the profile, it was found to take 1947 cycles to decode a bit. This value is amortized since traceback occurs every 100 cycles. Kbs was found as 1/((Cycles/Decode Bit) * (1/Clock Period)).

**Viterbi Decoder Performance**
	227 MHz	100 MHz
Cycles/Decoded Bit	1947	1947
Clock Period (ns)	4.4	10
Kbs	117	51

By unrolling the ACS loop, a conservative speedup of 3 should be achievable by implementing an ACS extension instruction and packing the memory.

	227 MHz	100 MHz
Cycles/Decoded Bit	650	650
Clock Period (ns)	4.4	10
Kbs	349	154

Projected Viterbi Decoder Performance

Energy

Energy was calculated as (Core Power) (J/S) * (Clock Period) (S/clock cycle) * (Cycles/Decoded Bit) (clock cycle/bit) = J/bit.

**Viterbi Decoder Energy**
	227 MHz	100 MHz
Core Power (mW)	155	53
Clock Period (ns)	4.4	10
Cycles/Decoded Bit	1947	1947
uJ / bit	1.3	1.0

These are the power number with the estimated extensions we are exploring.

	227 MHz	100 MHz
Core Power (mW)	155	53
Clock Period (ns)	4.4	10
Cycles/Decoded Bit	650	650
uJ/bit	0.44	0.34

Projected Viterbi Decoder Energy

Cost

The code size for the viterbi decoder is small (< 1000 instructions) so we did not include code size here.

An additional cost metric is die area. Given the clock speed, the processor generator estimated the number of gates our design would take and estimated the die area for a given process. The table below shows the die area estimates for different clock speeds at 0.25um. These estimates include a 4K Dcache and a 4K Icache. The estimates for die area were taken from the NTRS [2].

227 MHz 100 MHz

Gates (NAND2 equivalent) 40675 31835

Die Area (mm²) 3.33 2.91

Core Die Area

We estimate the size of the hardware needed to implement the new instructions will be approximately 3000 gates, depending on design priorities. The following table shows the estimated gate count and die area for the ISA extended design.

227 MHz 100 MHz

Gates (NAND2 equivalent) 43675 34835

Die Area (mm²) 3.5 3.1

ISA Extended Core Die Area

Design effort

Since we have a C model of the system we are designing, we expect the bulk of our time to be spent in identifying and evaluating new instructions. We expect to implement a viterbi decoder with ISA extensions in 100 man-hours.

Summary

The table below summarizes our estimates for different clock speeds assuming the extension speedups.

**Estimates Summary**
	227 MHz	100 MHz
Performance
Viterbi Decoder Speed (Kbs)	349	154
Energy
Viterbi Decoder Energy Dissipation (uJ / bit)	0.44	0.34
Core Power (mW)	155	53
Power Density (mW / mm²)	44.2	17.1
Cost
Die Area (0.25m) (mm²)	3.5	3.1
Design Effort (man-hours)	100	100

References

[1] Xtensa Instruction Set Architecture Reference Manual
[2] National Technology Roadmap for Semiconductors, 1997

	227 MHz	100 MHz
Gates (NAND2 equivalent)	40675	31835
Die Area (mm²)	3.33	2.91