Niraj Shah
Scott Weber
EE290a Homework #3A
Description
We chose to implement a viterbi decoder using Tensilica's Xtensa configurable
microprocessor with extensions to the instruction set architecture (ISA).
With Tensilica's tools, it is possible to compile C code for a particular
target architecture using native C programming tools (gcc, gdb). It is
also possible to generate an instruction set simulator for the target architecture.
These software tools are used to evaluate the system. Once the design is
finalized, a description of the hardware implementation can be produced.
To estimate implementation characteristics of a viterbi decoder, we
generated a compiler and instruction set simulator for a microprocessor
core with a first shot at ISA extensions. The microprocessor core is a
32-bit RISC processor that does not have a multiplier. There are 16 general
purpose registers and DSP-like features (such as zero-overhead loops) [1].
Using the processor generator, compiler, and instruction set simulator,
we arrived at estimates for a viterbi decoder implementation on the processor
described above. After analyzing the C code, we identified some possible
ISA extensions. The first set of numbers below shows our results at a first
attempt to implement the viterbi decoder.
Parameters
uncoded word length (k) = 1
coded word length (n) = 2
constraint length (L) = 7
branch metric calculation is QPSK
soft decision wordlength (q) = 6
chain-backing depth (D) = 100
generator polynomials: g0 = 171, g1= 133 (octal)
data rate target: 100 kbs
goal: bit error rate (BER) = 10^-4
signal to noise ratio (SNR) degradation 0.05dB
Performance
To calculate performance, we simulated the viterbi decoder using Rhett
Davis' viterbi simulator code and the customized Tensilica instruction
set simulator. After simulating 1000 samples, gprof was used to profile
the execution time in the viterbi decoder. From the profile, it was found
to take 1947 cycles to decode a bit. This value is amortized since traceback
occurs every 100 cycles. Kbs was found as 1/((Cycles/Decode Bit) * (1/Clock
Period)).
|
227 MHz |
100 MHz |
Cycles/Decoded Bit |
1947 |
1947 |
Clock Period (ns) |
4.4 |
10 |
Kbs |
117 |
51 |
Viterbi Decoder Performance
By unrolling the ACS loop, a conservative speedup of 3 should be achievable
by implementing an ACS extension instruction and packing the memory.
|
227 MHz |
100 MHz |
Cycles/Decoded Bit |
650 |
650 |
Clock Period (ns) |
4.4 |
10 |
Kbs |
349 |
154 |
Projected Viterbi Decoder Performance
Energy
Energy was calculated as (Core Power) (J/S) * (Clock Period) (S/clock cycle)
* (Cycles/Decoded Bit) (clock cycle/bit) = J/bit.
|
227 MHz |
100 MHz |
Core Power (mW) |
155 |
53 |
Clock Period (ns) |
4.4 |
10 |
Cycles/Decoded Bit |
1947 |
1947 |
uJ / bit |
1.3 |
1.0 |
Viterbi Decoder Energy
These are the power number with the estimated extensions we are exploring.
|
227 MHz |
100 MHz |
Core Power (mW) |
155 |
53 |
Clock Period (ns) |
4.4 |
10 |
Cycles/Decoded Bit |
650 |
650 |
uJ/bit |
0.44 |
0.34 |
Projected Viterbi Decoder Energy
Cost
The code size for the viterbi decoder is small (< 1000 instructions)
so we did not include code size here.
An additional cost metric is die area. Given the clock speed,
the processor generator estimated the number of gates our design would
take and estimated the die area for a given process. The table below shows
the die area estimates for different clock speeds at 0.25um.
These estimates include a 4K Dcache and a 4K Icache. The estimates for
die area were taken from the NTRS [2].
|
227 MHz |
100 MHz |
Gates (NAND2 equivalent) |
40675 |
31835 |
Die Area (mm2) |
3.33 |
2.91 |
Core Die Area
We estimate the size of the hardware needed to implement the new instructions
will be approximately 3000 gates, depending on design priorities. The following
table shows the estimated gate count and die area for the ISA extended
design.
|
227 MHz |
100 MHz |
Gates (NAND2 equivalent) |
43675 |
34835 |
Die Area (mm2) |
3.5 |
3.1 |
ISA Extended Core Die Area
Design effort
Since we have a C model of the system we are designing, we expect the bulk
of our time to be spent in identifying and evaluating new instructions.
We expect to implement a viterbi decoder with ISA extensions in 100 man-hours.
Summary
The table below summarizes our estimates for different clock speeds assuming
the extension speedups.
|
227 MHz |
100 MHz |
Performance |
Viterbi Decoder Speed (Kbs)
|
349 |
154 |
Energy |
Viterbi Decoder Energy Dissipation (uJ / bit)
|
0.44 |
0.34 |
Core Power (mW)
|
155 |
53 |
Power Density (mW / mm2)
|
44.2 |
17.1 |
Cost |
Die Area (0.25m) (mm2)
|
3.5 |
3.1 |
Design Effort (man-hours) |
100 |
100 |
Estimates Summary
References
[1] Xtensa Instruction Set Architecture Reference Manual
[2] National Technology Roadmap for Semiconductors,
1997