Viterbi Decoder Implementation on a TI TMS320VC5402 DSP
Final Project, EE 290A
May 4, 1999
We will investigate the costs and benefits of implementing a Viterbi Decoder using a TI TMS320C54x DSP chip. In particular we plan to estimate the impact that a variety of design parameters have on the following objective functions (as stated in class):
The TI TMS320C54x DSP is a new, low power line of DSPs unveiled by TI in June 1998. These chips use 16 bit fixed point words, and can run at as low as .45 mW / MIPS for chips built in 0.25 micron technology.
For our implementation, we choose to use the VC5402, which has the following features:
Hardware capabilities of the entire C54x series include:
Specifications for our design include:
Viterbi Decoder Software Implementation
Texas instruments provides a reference document describing how to implement a Viterbi Decoder on a C54 DSP. This document can be downloaded at:
The C54x is highly optimized to perform the Viterbi Decomposition due to the following features of its ALU:
I created an assembly program using code snippets from the TI Viterbi Decoder Appplication Report which showed how to best take advantage of these features. The code can be downloaded at viterbi.asm, and the listing file memory mapped to the VC5402 architecture can be found at viterbi.lst.
The only deviation from the specifications listed above is that a length 128 traceback was used, with only the final 64 bits being recorded. A multiple of 16 was chosen to simplify the code for traceback, and does not degrade the performance of the algorithm (in fact, it enhances it). Since 16 bit arithmetic was used upon 6 bit input words, there was no quantization loss in the algorithm (beyond input quantization noise), so it easily met the SNR and BER requirements.
The input I and Q data was assumed to have been put into
a partiuclar memory location by one of the on-chip DMA channels, and the output data words
are assumed to have been moved to their final destination by another on-chip DMA channel.
Alternatively, since the Viterbi decoding algorithm does not require all of the processing
power of the chip, the input data might have been placed in its memory location after the
TI chip was used to run some filtering or other DSP algorithm upon the datapoints.
The code mentioned above was simulated on a cycle-accurate TI simulator, which included a memory map specific to this processor. To decode 64 bits, the decoding algorithm took, on average, 13780 cycles to run (13714 cycles if the path metrics did not need to be decremented, 13914 cycles if the path metrics did need to be updated). At this rate, to decode 100 Kbits/second, which can be handled if the chip runs at 22 MHz. Alternatively, the chip could handle a maximum of 464.7 Kbits/second at 100 MHz. Since we are scaling the technology up from 0.18 to 0.25 technology, it is unlikely that the 1.8V core could be clocked at 100 MHz, but 22 MHz should be easily attainable.
As seen below, the code needed used only about a fourth of the 4Kx16 ROM, and the data storage necessary was a little over 1/16 of the available 16K x 16 dual access RAM.
|Code Size:||1032 (16 bit) Program Words|
|Data Storage:||1280 (16 bit) Data Words|
|Cycles Required (100 Kbps)||13780|
|Maximum Speed (100 MIPS)||464.7|
The results closely follow the estimates from TI. From
the TI Appplication Report, we have the following predictions for required
(FS = frame size, FR = frame rate)
Metric update: Cycles/frame = (#States/2 butterflies ×
butterfly calculation + TRN store + local dist calculation.) × # bits
= (2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 ) × FS
Traceback: Cycles/frame = (loop overhead and data storage + loop × 16) × # bits/16
= (9 + 12 × 16) × FS/16
= 201 × FS/16
Data reversal: Cycles/frame = 43 × FS/16
Total MIPS = Frame rate × (metric update + traceback + data reversal) cycles/frame
= FR × [(2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 ) × FS + (201/16) × FS + (43/16) × FS]
= FR × FS × (2 K-2 × 5 + 2 K-5 + 1 + n × 2 n-1 + (201 + 43)/16)
= FR × FS × (2 K-2 × 5 + 2 K-5 + n × 2 n-1 + 16.25)
So, for a frame size of 100K bits and frame rate of 1 Hz, the estimate was18.425 MIPS at 100 Kbits/sec, or 582 Kbits/sec maximum at 100 MIPS. The difference between the estimate and the actual implementation is due to necessary overhead of setting up pointers, creating loops, decrementing path metrics, and doing a traceback twice for all bits of data (since only the last 64 bits of a length 128 traceback are stored).
Using another Application Report from TI, Calculation of TMS320LC54x Power
Dissipation, I examined every instruction used in my program. Since TI estimates
current usage based upon half nop and half MAC instructions, I concluded that my program
averaged 1.08 times more current usage than their standard figure of .45 mA/MIPS (which
uses half MACs, half NOPs) for .25 micron technology. Their chips are made using static
CMOS designs, made to run at frequencies approaching 0 Hz, so for a clock rate of 22 MHz
the power should scale linearly. Thus, we calculate that to run the Viterbi decoder at 100
Kbits/sec using a C5402 DSP requires 13.78 mW.
|DSP Core||1.8 V||10.69 mA||19.25 mW|
This figure does not include any estimate of power to drive I/O pins since our comparison if upon the core only.
Physical Chip Size:
TI does not publish their die sizes, and I was unable to find a suitable figure anywhere in the literature. This chip (and other ones with larger memories) is implemented on a 144 pin ball grid array (BGA) measuring 12 mm per side. The area in this package usable for a die measures 3.2 mm on a side, so the maximum size of the die and memory is 10.24 mm^2.
|Cost of Chip (50K quantities)||$5 - $75 ea.|
|Emulator Development Kit||$2,995|
|Code Composer Development Environment||$3156.45|
TI boasts a fairly sophisticated, integrated set of development tools. The Code Composer Development Environment claims to seamlessly integrate the compiler/assembler/linker, debugger, simulator and emulator. It includes a signal probe, profiler, multiprocessor debugging, data visualization, interactive compiling, and a development environment similar to Microsoft Visual C++. Run-time libraries are available to speed code development and assure code optimality, and TI claims efficiency of compiled code to be close to hand-assembled code.
This project did not use the C compiler at all, the entire program was created in assembly language. I found that many features of the debugger seemed to be better suited to debugging C code rather than assembly. The tools overall were well integrated and helpful, although there is still plenty of work to be done that would make them better.
This project could be completed in approximately 3 engineer days, from start to finish, using the latest tools. This is based upon the somewhat sophisticated (comparitively) development tools available for the TI line of DSPs, the availability of run-time libraries, the availability of the Viterbi algorithm application report (with included code snippets), and the availability of configurable emulator boards with JTAG pins. It also assumes that the engineer is familiar with the TI tools and the TI C54x assembly language. Since I was familiar with neither, this project took me somewhat longer to finish.
This chip is an excellent choice for a low data rate implementation of the Viterbi Decoder. The chip is fast, low power, and has specialized instructions that greatly accelerate the speed of decoding. An application report from TI will speed development, and the with less than 25% of the chip operation time and only a fraction of its memory being utilized for a 100 Kbit/sec decoder, there is plenty of processing power remaining to implement other blocks in the receiver chain.