Ning Zhang & Marlene Wan
We use the JavaTime implementation of DCT and Huffman Coding from the JPEG standard
[1] to estimation software running on a RISC processor (StrongARM)[2]. We then move the
computation intensive part (DCT) to a configurable ASIC [3] platform to get another data
point. These results are also compare with a pure ASIC implementations [4] and two other
software implementations on some DSPs [5][6].
Algorithm: base-line JPEG encoding on a color 640x480 color image (14400 8x8 single-component images)
For each 8x8 single-component image, we breakdown major operations, such as memory read and write, ALU, and multiply (on StrongARM, each operation takes one cycle except multiplication, the exactly cycle times of a multiplication is noted below), as following:
1. DCT:
1.1 Software implementation (JavaTime implementation)
computation: 8x8x8x(2 r , 1 mpy, 2 alu, ) + 8x8 w (1 cycle mpy)
index: 8x8x8 alu + 8x8 alu + 8 alu
address: 8x8x8x(2 alu) + 8x8 alu
computation: 8x8x8x(2 r , 1 mpy, 1 alu) + 8x8 w + (8x8 shift) (2 cycle mpy)
index: 8x8x8 alu + 8x8 alu + 8 alu
address: 8x8x8x(2 alu) + 8x8 alu1.2 Dataflow implementation of the same computation (based on Pleiades [3] performance and energy numbers):
Performance: bottle neck-> mpy -> 20 ns
512x20ns + 512x20ns = 20.48 usec per block
-> 293 msec for a frame
assuming 2 parallel pipes -> 293/2 = 146.5 msec a frame
Energy: mpy: 16 pJ
mem: 7 pJ
agp: 5 pJ
alu: 7 pJ
stage 1:
20992 pJ + 768 pJ = 248320 pJ
stage 2:
20992 pJ + 768 pJ = 248320 pJ
-> 49664 pJ
-> 0.715 mJ/frame2. Quantize:
8x8x(2 r, 1 mul, 1 shift, 1 w) (3 cycle mpy)
(assuming that the inverse of the value of each entry of the quantization table is stored in the memory)3. Zig-zag and Huffman encoding:
Read zig-zag table stored in memory:
8x8x(2 r + 1 alu)
DC encoding:
1 alu (differential)
1 cmp (MSG (16 th) ?= 1 or 0)max of 16, average of 8 x (1 shift, 1 cmp) to get the category number and rest of the code
1 r (index into the category table to get the base code)
1 cmp (MSG (10) ?= 1 or 0)max of 8, average of 4 x (1 shift, 1 cmp)
1 shift , 1 alu, 1 write to get the final code
AC encoding:
(8x8-1)x{
1 cmp, 1alu (counting of zeros)
30 % { ( assuming 30% of nonzero quantized AC coefficients)Table lookup:
1 cmp (MSG (16 th) ?= 1 or 0)
max of 8, average of 4 x (1 shift, 1 cmp) to get the category number and rest of the code
1 r (use category number of number of zero to get base code)
1 cmp (MSG (17) ?= 1 or 0)max of 15, average of 7.5 x (1 shift, 1 cmp)
1 shift , 1 alu, 1 write to concatecate two codes and get the final code
Pack (assuming two indices: array index and bit index):
2 alu (update array and bit index)
(if boundary has been crossed, the following x 2, which we assume occurs 30% of the time)
1 shift (shift according to bit index)
1 alu (concatenate)
}}
Result:
SA1100 [2]: 220 MHz, typical power
dissipation is 550mW
Running 1.1, 2, 3 on SA1100, using C as the source code and conventional C compilers
Performance:
1.12 sec/frame
Power:
0.616 J/frame
Cost:
$39 each for 10,000 units
Design
effort:
after code is done(fixed point calulation code is tricky: 1 week) , 2-3 more days of
compilation (easy) and debugging,
SA1100 + dataflow:
Running 1.2 in dataflow ASIC, 2 and 3 on SA1100 ->
Performance:
701.75 msec/frame
Power:
0.305715 J/frame
Cost:
StrongARM cost + fabrication of the dataflow ASIC component
Design
effort:
on top of the time spent in designing code for the StrongARM, asssume architecture and
library of dataflow ASIC
modules are there.
2 weeks - 1 month of assembly and debugging.
1. ASIC
JAGUAR [4]:
Performance:
30 frames/sec of the 1024x1024 images frame-> 10 msec/ 640x480 frame
Power:
Assume DCT has the same energy as the Pleiades datafow ASIC implementation (although it
should be less)
0.626 mJ/frame
In the huffman coding stage:
assuming that memory access is the dominat factor:
185 mem access (based on the descriptiotn above)
Using the energy consumptions of the Pleiades memory and address generator: 185x12pJx14400
= 0.032 mJ/frame
total -> 0.6258 mJ/frame
Cost:
Design cost + fab cost of 12mmx14mm in 2um -> 1.5x1.75mm in 0.25 um
Design effort:
9-28 man-month for dedicated Huffman encoder and optimized DCT coder all in ASIC
2. DSP
Two parallel TIC30's [5]: each TIC30 is
about 33-50 MHz
architecture feature: floating point dsp
Performance:
2.75 sec/frame
Power:
~ 1W each
Cost:
$ 247.95/processor
Design effort:
1 day for compilation (no performance tuning in assembly level)
TMS320C40 [6]: about 40 - 60
MHz
architecture feature: floating point dsp
Perfromance:
2.6 sec/frame
Power:
248 mA at 5V -> 3.224 J/frame
Cost:
$160
Design
effort:
compilation (no performance tuning in assembly level) -> 1 day
The technology and voltage of some of the processors are not specified. Therefore, we compare the "raw" data in this discussion.
The pure ASIC element is 100 faster and 1000 more energy efficient than the microprocessor implementation. Moving part of the algorithm into ASIC improves the speed and energy somewhat, but we run into the "Amdahl's Law" because Huffman Coding became our bottleneck. Although we can implemente Huffman Coding in fine-grain FPGA,the power and timing estimation of bit level manipulation operations on FPGA is hard to perdict at the high level. Therefore, we decide to keep the Huffman Coding in software for our estimation.
DSP implementations seems to have worst performance and energy efficiency. One reason
is that the codes are compiled down to the processor while many DSP can take advantage of
hand-turning codes. Another reason is that floating point processors are usually very
power hungry. In addition, when we perform our estimation, we only take into account the
pure computation cost, not the extra control cost (reading in the frame etc..) while the
DSP implementations in the papers might have take into account all these. Therefore, our
estimation is more optimistic.
[1] G. Wallace, "The JPEG Still Picture Compression Standard".
[2] http://developer.intel.com/design/strong/sa1100.htm
[3] Pleiades homepage: http://infopad/research/reconfigurable/
[4] M.Kovac and N.Ranganathan, "JAGUAR: A Fully Pipelined VLSI Architecture for JPEG
Image Compression Standard."
[5] N.Altoveros, X.Shi, and J.Waller, "DSP microprocessor implementation of image
data compression."
[6] D.C.Chen and R.H.Price, "A Real-Time TMS320C40 Based Parallel System for High
Rate Digital Signal Processing."