Niraj Shah
Scott Weber
EE290a Homework #3A

Description

We chose to implement a viterbi decoder using Tensilica's Xtensa configurable microprocessor with extensions to the instruction set architecture (ISA). With Tensilica's tools, it is possible to compile C code for a particular target architecture using native C programming tools (gcc, gdb). It is also possible to generate an instruction set simulator for the target architecture. These software tools are used to evaluate the system. Once the design is finalized, a description of the hardware implementation can be produced.

To estimate implementation characteristics of a viterbi decoder, we generated a compiler and instruction set simulator for a microprocessor core with a first shot at ISA extensions. The microprocessor core is a 32-bit RISC processor that does not have a multiplier. There are 16 general purpose registers and DSP-like features (such as zero-overhead loops) [1].

Using the processor generator, compiler, and instruction set simulator, we arrived at estimates for a viterbi decoder implementation on the processor described above. After analyzing the C code, we identified some possible ISA extensions. The first set of numbers below shows our results at a first attempt to implement the viterbi decoder.

Parameters

  • uncoded word length (k) = 1
  • coded word length (n) = 2
  • constraint length (L) = 7
  • branch metric calculation is QPSK
  • soft decision wordlength (q) = 6
  • chain-backing depth (D) = 100
  • generator polynomials: g0 = 171, g1= 133 (octal)
  • data rate target: 100 kbs
  • goal: bit error rate (BER) = 10^-4
  • signal to noise ratio (SNR) degradation 0.05dB
  • Performance

    To calculate performance, we simulated the viterbi decoder using Rhett Davis' viterbi simulator code and the customized Tensilica instruction set simulator. After simulating 1000 samples, gprof was used to profile the execution time in the viterbi decoder. From the profile, it was found to take 1947 cycles to decode a bit. This value is amortized since traceback occurs every 100 cycles. Kbs was found as 1/((Cycles/Decode Bit) * (1/Clock Period)).
     
    227 MHz 100 MHz
    Cycles/Decoded Bit 1947 1947
    Clock Period (ns) 4.4 10
    Kbs 117 51
    Viterbi Decoder Performance
    By unrolling the ACS loop, a conservative speedup of 3 should be achievable by implementing an ACS extension instruction and packing the memory.
     
     
    227 MHz 100 MHz
    Cycles/Decoded Bit 650 650
    Clock Period (ns) 4.4 10
    Kbs 349 154
    Projected Viterbi Decoder Performance

    Energy

    Energy was calculated as (Core Power) (J/S) * (Clock Period) (S/clock cycle) * (Cycles/Decoded Bit) (clock cycle/bit) = J/bit.
     
    227 MHz 100 MHz
    Core Power (mW) 155 53
    Clock Period (ns) 4.4 10
    Cycles/Decoded Bit 1947 1947
    uJ / bit 1.3 1.0
    Viterbi Decoder Energy
    These are the power number with the estimated extensions we are exploring.
     
    227 MHz 100 MHz
    Core Power (mW) 155 53
    Clock Period (ns) 4.4 10
    Cycles/Decoded Bit 650 650
    uJ/bit 0.44 0.34 
    Projected Viterbi Decoder Energy

    Cost

    The code size for the viterbi decoder is small (< 1000 instructions) so we did not include code size here.

    An additional cost metric is die area.  Given the clock speed, the processor generator estimated the number of gates our design would take and estimated the die area for a given process. The table below shows the die area estimates for different clock speeds at 0.25um. These estimates include a 4K Dcache and a 4K Icache. The estimates for die area were taken from the NTRS [2].
     
    227 MHz 100 MHz
    Gates (NAND2 equivalent) 40675 31835
    Die Area (mm2) 3.33 2.91
    Core Die Area

    We estimate the size of the hardware needed to implement the new instructions will be approximately 3000 gates, depending on design priorities. The following table shows the estimated gate count and die area for the ISA extended design.
     
    227 MHz 100 MHz
    Gates (NAND2 equivalent) 43675 34835
    Die Area (mm2) 3.5 3.1
    ISA Extended Core Die Area

    Design effort

    Since we have a C model of the system we are designing, we expect the bulk of our time to be spent in identifying and evaluating new instructions. We expect to implement a viterbi decoder with ISA extensions in 100 man-hours.

    Summary

    The table below summarizes our estimates for different clock speeds assuming the extension speedups.
     
     
    227 MHz 100 MHz
    Performance
  • Viterbi Decoder Speed (Kbs)
  • 349 154
    Energy
  • Viterbi Decoder Energy Dissipation (uJ / bit)
  • 0.44 0.34
  • Core Power (mW)
  • 155 53
  • Power Density (mW / mm2)
  • 44.2 17.1
    Cost
  • Die Area (0.25m) (mm2)
  • 3.5 3.1
    Design Effort (man-hours) 100 100
    Estimates Summary

    References

    [1] Xtensa Instruction Set Architecture Reference Manual
    [2] National Technology Roadmap for Semiconductors, 1997