Viterbi Decoder Implementation on a TI TMS320C54xx DSP

Viterbi Decoder Implementation on a TI TMS320VC5402 DSP

Final Project, EE 290A

May 4, 1999

Approach

We will investigate the costs and benefits of implementing a Viterbi Decoder using a TI TMS320C54x DSP chip. In particular we plan to estimate the impact that a variety of design parameters have on the following objective functions (as stated in class):

Performance (bits/sec)
Power (mW/bit)
Cost ($/unit,area)
Design effort (engineer-months)

C54x Description

The TI TMS320C54x DSP is a new, low power line of DSPs unveiled by TI in June 1998. These chips use 16 bit fixed point words, and can run at as low as .45 mW / MIPS for chips built in 0.25 micron technology.

For our implementation, we choose to use the VC5402, which has the following features:

Operates at 100 MIPS with a core voltage of 1.8V
I/O pins operate at 3.3V
16K word x 16 bits of dual-access RAM
4K word x 16 bits of ROM
Internal DMA
Is created in 0.18 micron technology (Results are later scaled up to a .25 micron implementation for comparison purposes)

Hardware capabilities of the entire C54x series include:

Three 16-bit data buses and one 16-bit program memory bus
40 bit ACC with 40 bit barrel shifter and two independant accumulators
A single cycle non-pipelined MAC
Single-instruction repeat and block-repeat operations for program code
Block-memory-move instructions for better program and data management
Arithmetic instructions with parallel store and parallel load
Up to 168K single access RAM
Up to 32K dual access ram
Up to 8M word external memory access
Six channel DMA controller

Viterbi Implementation

Specifications for our design include:

uncoded word length = 1
coded word length (n) = 2
constraint length (K aka. L) = 7
branch metric calculation is QPSK
this means 4 transitions for each state and 64x4=256 total transitions
soft decision wordlength (q) = 6
chain-backing depth (D) = 96
generator polynomials: p0 = 171, p1= 133 (octal)
data rate 100 kbs
goal: bit error rate (BER) = 10^-4
signal to noise ratio (SNR) degradation 0.05dB

Viterbi Decoder Software Implementation

Texas instruments provides a reference document describing how to implement a Viterbi Decoder on a C54 DSP. This document can be downloaded at:

http://www-s.ti.com/sc/psheets/spra071/spra071.pdf

The C54x is highly optimized to perform the Viterbi Decomposition due to the following features of its ALU:

A single cycle compare, select, and store unit (CSSU) is used to compare branch metrics, record the larger value, and store the appropriate decision bit, all in one cycle
Dual accumulators allow for dual add/subtract operations to occur in one clock cycle
Address pointer registers can be incremented/decremented in a circular buffer as part of these instructions
The above 2 instructions allow a butterfly to occur in 5 cycles - 1 load local distance, 2 dual add/subtract, 2 CSSU

I created an assembly program using code snippets from the TI Viterbi Decoder Appplication Report which showed how to best take advantage of these features. The code can be downloaded at viterbi.asm, and the listing file memory mapped to the VC5402 architecture can be found at viterbi.lst.

The only deviation from the specifications listed above is that a length 128 traceback was used, with only the final 64 bits being recorded. A multiple of 16 was chosen to simplify the code for traceback, and does not degrade the performance of the algorithm (in fact, it enhances it). Since 16 bit arithmetic was used upon 6 bit input words, there was no quantization loss in the algorithm (beyond input quantization noise), so it easily met the SNR and BER requirements.

The input I and Q data was assumed to have been put into a partiuclar memory location by one of the on-chip DMA channels, and the output data words are assumed to have been moved to their final destination by another on-chip DMA channel. Alternatively, since the Viterbi decoding algorithm does not require all of the processing power of the chip, the input data might have been placed in its memory location after the TI chip was used to run some filtering or other DSP algorithm upon the datapoints.

Analysis

The code mentioned above was simulated on a cycle-accurate TI simulator, which included a memory map specific to this processor. To decode 64 bits, the decoding algorithm took, on average, 13780 cycles to run (13714 cycles if the path metrics did not need to be decremented, 13914 cycles if the path metrics did need to be updated). At this rate, to decode 100 Kbits/second, which can be handled if the chip runs at 22 MHz. Alternatively, the chip could handle a maximum of 464.7 Kbits/second at 100 MHz. Since we are scaling the technology up from 0.18 to 0.25 technology, it is unlikely that the 1.8V core could be clocked at 100 MHz, but 22 MHz should be easily attainable.

As seen below, the code needed used only about a fourth of the 4Kx16 ROM, and the data storage necessary was a little over 1/16 of the available 16K x 16 dual access RAM.

Code Size:	1032 (16 bit) Program Words
Data Storage:	1280 (16 bit) Data Words
Cycles Required (100 Kbps)	13780
Maximum Speed (100 MIPS)	464.7

The results closely follow the estimates from TI. From the TI Appplication Report, we have the following predictions for required operations/frame:
(FS = frame size, FR = frame rate)

Metric update: Cycles/frame = (#States/2 butterflies × butterfly calculation + TRN store + local dist calculation.) × # bits
    = (2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 ) × FS
Traceback: Cycles/frame = (loop overhead and data storage + loop × 16) × # bits/16
    = (9 + 12 × 16) × FS/16
    = 201 × FS/16
Data reversal: Cycles/frame = 43 × FS/16
Total MIPS = Frame rate × (metric update + traceback + data reversal) cycles/frame
    = FR × [(2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 ) × FS + (201/16) × FS + (43/16) × FS]
    = FR × FS × (2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 + (201 + 43)/16)
    = FR × FS × (2 K-2 × 5 + 2 K-5 + n × 2 n-1 + 16.25)

So, for a frame size of 100K bits and frame rate of 1 Hz, the estimate was18.425 MIPS at 100 Kbits/sec, or 582 Kbits/sec maximum at 100 MIPS. The difference between the estimate and the actual implementation is due to necessary overhead of setting up pointers, creating loops, decrementing path metrics, and doing a traceback twice for all bits of data (since only the last 64 bits of a length 128 traceback are stored).

Power:

Using another Application Report from TI, Calculation of TMS320LC54x Power Dissipation, I examined every instruction used in my program. Since TI estimates current usage based upon half nop and half MAC instructions, I concluded that my program averaged 1.08 times more current usage than their standard figure of .45 mA/MIPS (which uses half MACs, half NOPs) for .25 micron technology. Their chips are made using static CMOS designs, made to run at frequencies approaching 0 Hz, so for a clock rate of 22 MHz the power should scale linearly. Thus, we calculate that to run the Viterbi decoder at 100 Kbits/sec using a C5402 DSP requires 13.78 mW.

	Voltage	Current	Power
DSP Core	1.8 V	10.69 mA	19.25 mW

This figure does not include any estimate of power to drive I/O pins since our comparison if upon the core only.

Physical Chip Size:
TI does not publish their die sizes, and I was unable to find a suitable figure anywhere in the literature. This chip (and other ones with larger memories) is implemented on a 144 pin ball grid array (BGA) measuring 12 mm per side. The area in this package usable for a die measures 3.2 mm on a side, so the maximum size of the die and memory is 10.24 mm^2.

Cost:

Cost of Chip (50K quantities)	$5 - $75 ea.
Emulator Development Kit	$2,995
Code Composer Development Environment	$3156.45
Simulator	$1578.22
C Compiler/Assembler/Linker	$2367.33
Debugger	$3156.45

Development Tools:

TI boasts a fairly sophisticated, integrated set of development tools. The Code Composer Development Environment claims to seamlessly integrate the compiler/assembler/linker, debugger, simulator and emulator. It includes a signal probe, profiler, multiprocessor debugging, data visualization, interactive compiling, and a development environment similar to Microsoft Visual C++. Run-time libraries are available to speed code development and assure code optimality, and TI claims efficiency of compiled code to be close to hand-assembled code.

This project did not use the C compiler at all, the entire program was created in assembly language. I found that many features of the debugger seemed to be better suited to debugging C code rather than assembly. The tools overall were well integrated and helpful, although there is still plenty of work to be done that would make them better.

Development Time:

This project could be completed in approximately 3 engineer days, from start to finish, using the latest tools. This is based upon the somewhat sophisticated (comparitively) development tools available for the TI line of DSPs, the availability of run-time libraries, the availability of the Viterbi algorithm application report (with included code snippets), and the availability of configurable emulator boards with JTAG pins. It also assumes that the engineer is familiar with the TI tools and the TI C54x assembly language. Since I was familiar with neither, this project took me somewhat longer to finish.

Conclusion

This chip is an excellent choice for a low data rate implementation of the Viterbi Decoder. The chip is fast, low power, and has specialized instructions that greatly accelerate the speed of decoding. An application report from TI will speed development, and the with less than 25% of the chip operation time and only a fraction of its memory being utilized for a 100 Kbit/sec decoder, there is plenty of processing power remaining to implement other blocks in the receiver chain.

References

VC5402 Datasheet: http://www-s.ti.com/sc/psheets/sprs079/sprs079.pdf
TMS320C5000 DSP Family Functional Overview: http://www-s.ti.com/sc/psheets/spru307/spru307.pdf
Tool Information: http://www.ti.com/sc/docs/dsps/tools/c5000/c54x/
Pricing Information: Arrow Semiconductor http://www.arrowsemi.com/