JPEG Implementation Study for a TI TMS320VC5410 DSP
Homework #1, EE 290A
February 2, 1999
We will investigate the costs and benefits of implementing a JPEG Encoder/Decoder using a TI TMS320C54xx DSP chip. In particular we plan to estimate the impact that a variety of design parameters have on the following objective functions (as stated in class):
The TI TMS320C54x DSP is a new, low power line of DSPs unveiled by TI in June 1998. These chips use 16 bit fixed point words, and can run at as low as .45 mW, or at 120 mW at 200 MIPS. The latest in their series is the VC5420, which boasts two parallel 100 MIPS DSP cores, capable of operating at 1.8V.
Hardware capabilities include:
We chose to use the TI TMS320VC5410 for our analysis. The algorithm is easily parallelized, so the dual 100 MHz DSP cores of the VC5420 could have greatly speed up our system. Unfortunately, this chip can only access up to 200K of on-chip RAM and maximum 256K of external RAM, not enough to decode a nearly 1MB 640x480 color picture in its initial bitmap form. A multiprocessor solution utilizing multiple parallel VC5420s could significantly increase frame throughput at the expense of greater hardware cost, power, and design time, but we will focus here on the single chip implementation, which is realizable using the VC5410.
The VC5410:
Our system diagram:
Our processor will begin to operate when an image arrives in memory. The image blocker/splitter block will split the image into its three components of chrominance and luminance. It will then divide each of the three image components into 8x8 blocks. Since the bulk of the processing occurs within the JPEG block, this will efficiently parallelize the algorithm. With the help of DMA transfers, this block should not be computationally complex.
Within the JPEG block, an 8x8 DCT is performed first. Using the AAN algorithm to take advantage of the redunancies in the DCT, each 8x8 block will require 64 adds and 256 multiply-accumulates (MACs). Since each of these is a 1 cycle step, we calculate that an 8x8 DCT can be calculated in roughly 320 clock cycles.
Quantization is next within the JPEG block. Quantization will require a 64 element quantization table to be in memory, in the form of an 8 bit mask of the number of bits required. Each element of the transformed DCT vector will be AND-ed with its mask, quantizing it to the proper precision. This operation will take 64 cycles per 8x8 block.
The encoder will first subtract the current quantized DC value from the previous. This value will then be used in a table lookup to find its huffman code, which will begin the encoded vector for this 8x8 block. From here, the processor will search for non-zero elements in the zig-zag pattern, counting the number of consecutive zeroes. When a non-zero element is found, that value is shifted, added to the number of consecutive zeroes, and used in table lookup to find the huffman code for that sequence. Once found, this code is shifted and added to the current vector for the 8x8 data set. Assuming 25% non-zero elements, this will take approximately 120 clock cycles.
The decoder will first add the proper number of bits to the DC value. A shift and mask will utilize a table lookup to uncode the number of consecutive preceeding zeroes, followed by the size and amplitude of the next non-zero value. Beginning with a vector initialized with 63 zeroes, only the necessary values will need to be changed. Assuming that most of the 75% zeroes are towards the end of the zig-zag pattern, we can estimate about 150 clock cycles for the decoder.
Dequantization will simply involve a shift for every element based on the quantization table, resulting in 64 clock cycles of operation (due to the single cycle barrel shifter).
The final step in the JPEG block is the Inverse Discrete Cosine Transform (IDCT). Again, taking advantage of the redundancies in the IDCT with a similar structure to the DCT, we can perform an IDCT on an 8x8 block using approximately 320 cycles.
Analysis
Computational Summary for a 640 x 480 color picture
(14400 8x8 blocks):
Function |
Cycles/8x8 Block |
Cycles/Frame |
Time/Frame (ms) |
DCT | 320 |
4608000 |
46.08 |
Quantization | 64 |
921600 |
9.22 |
Encoding | 120 |
1782800 |
17.82 |
Decoding | 150 |
2160000 |
21.60 |
De-Quantization | 64 |
921600 |
9.22 |
IDCT | 320 |
4608000 |
46.08 |
Total: | 1038 |
14947200 |
149.47 |
Data I/O for a 640 x 480 color picture (14400 8x8
blocks):
Total Bytes to Transfer (including DRAM -> SRAM and SRAM -> DRAM) |
14000 * 64 *2 = 1843200 |
Total Cycles (with 5 state DRAM using DMA): | 9216000 |
Total Time: | 92.16 ms |
Using the DMA for transfers to and from external DRAM at
the same time as the computations occur, each frame will require about 150 ms, allowing 6.67
frames/sec.
Power:
Voltage |
Current |
Power |
|
Core CPU | 2.5 V |
45.2 mA |
113 mW |
External Pins | 3.0 V |
25 mA |
75 mW |
IDLE2 (shut down CPU and peripherals) |
2.5 V |
2 mA |
5 mW |
IDLE3 (shut down processor entirely) |
2.5 V |
.005 mA |
0.0125 mA |
Note: Power data for the external RAM is not yet available, since TI does not yet list any approved vendors of external DRAM for the C54x series.
Cost:
Cost of Chip (50K quantities) | $5 ea. |
Emulator Development Kit | $2,995 |
Code Composer Development Environment | $3156.45 |
Simulator | $1578.22 |
C Compiler/Assembler/Linker | $2367.33 |
Debugger | $3156.45 |
Development Tools:
TI boasts a fairly sophisticated, integrated set of development tools. The Code Composer Development Environment claims to seamlessly integrate the compiler/assembler/linker, debugger, simulator and emulator. It includes a signal probe, profiler, multiprocessor debugging, data visualization, interactive compiling, and a development environment similar to Microsoft Visual C++. Run-time libraries are available to speed code development and assure code optimality, and TI claims efficiency of compiled code to be close to hand-assembled code.
Development Time:
This project could be completed in approximately 3 engineer months, from start to finish, using the latest tools. This is based upon the somewhat sophisticated (comparitively) development tools available for the TI line of DSPs, the availability of run-time libraries, the availability of JPEG algorithms in C on the web, and the availability of configurable emulator boards with JTAG pins.
The code size should be less than 1000 lines of C code. Exact code size depends on the amount of library calls / hand coded assembly calls, as well as the choice of real-time operating system (RTOS) most suitable for the application using this JPEG codec.
Conclusion
This chip is likely somewhat underpowered to be able to compute a high frame rate at 640 x 480 resolution. The frame rate of 6.7 frames/second is somewhat aggressive, depending on efficient library calls/assembly coded kernels. Actual speed will depend on how well these library calls and the compiler can utilize the capabilities of the processor. A smaller image size, capable of fitting into the 200K words of the VC5420, or at least into the 456K of addressable space of the VC5420, could have allowed 2x speedup of the algorithm, due to the dual 100 MIPS DSP cores. Power would have been comperable, due to the 1.8V power supply of the VC5420.
As of January 1999, TI does not list any third party vendors which have external DRAM supporting the C54x series. The TMS320VC5410, as well as the software for it, is expected sometime in 1H99.
References