Homework #1

Estimation of the JPEG Implementation

Ning Zhang & Marlene Wan

I. Introduction

We use the JavaTime implementation of DCT and Huffman Coding from the JPEG standard [1] to estimation software running on a RISC processor (StrongARM)[2]. We then move the computation intensive part (DCT) to a configurable ASIC [3] platform to get another data point. These results are also compare with a pure ASIC implementations [4] and two other software implementations on some DSPs [5][6].

II. Our Implementation

Algorithm: base-line JPEG encoding on a color 640x480 color image (14400 8x8 single-component images)

Algorithm breakdown:

For each 8x8 single-component image, we breakdown major operations, such as memory read and write, ALU, and multiply (on StrongARM, each operation takes one cycle except multiplication, the exactly cycle times of a multiplication is noted below), as following:

1. DCT:

1.1 Software implementation (JavaTime implementation)

computation: 8x8x8x(2 r , 1 mpy, 2 alu, ) + 8x8 w                                (1 cycle mpy)
index: 8x8x8 alu + 8x8 alu + 8 alu
address: 8x8x8x(2 alu) + 8x8 alu
computation: 8x8x8x(2 r , 1 mpy, 1 alu) + 8x8 w + (8x8 shift)            (2 cycle mpy)
index: 8x8x8 alu + 8x8 alu + 8 alu
address: 8x8x8x(2 alu) + 8x8 alu

1.2 Dataflow implementation of the same computation (based on Pleiades [3] performance and energy numbers):

Performance: bottle neck-> mpy -> 20 ns
                        512x20ns + 512x20ns = 20.48 usec per block
                                            -> 293 msec for a frame
                        assuming 2 parallel pipes -> 293/2 = 146.5 msec a frame
Energy:           mpy: 16 pJ
                        mem: 7 pJ
                        agp: 5 pJ
                        alu: 7 pJ
                        stage 1:
                                20992 pJ + 768 pJ = 248320 pJ
                        stage 2:
                                20992 pJ + 768 pJ = 248320 pJ
                        -> 49664 pJ
                        -> 0.715 mJ/frame

2. Quantize:

8x8x(2 r, 1 mul, 1 shift, 1 w)            (3 cycle mpy)
(assuming that the inverse of the value of each entry of the quantization table is stored in the memory)

3. Zig-zag and Huffman encoding:

Read zig-zag table stored in memory:

8x8x(2 r + 1 alu)

DC encoding:

1 alu (differential)
1 cmp (MSG (16 th) ?= 1 or 0)

max of 16, average of 8 x (1 shift, 1 cmp) to get the category number and rest of the code

                           1 r     (index into the category table to get the base code)
                           1 cmp (MSG (10) ?= 1 or 0)

max of 8, average of 4 x (1 shift, 1 cmp)

                           1 shift , 1 alu, 1 write to get the final code

AC encoding:
(8x8-1)x{
                1 cmp, 1alu         (counting of zeros)
                30 % {                ( assuming 30% of nonzero quantized AC coefficients)

       Table lookup:

1 cmp (MSG (16 th) ?= 1 or 0)

max of 8, average of 4 x (1 shift, 1 cmp) to get the category number and rest of the code

1 r     (use category number of number of zero to get base code)
1 cmp (MSG (17) ?= 1 or 0)

max of 15, average of 7.5 x (1 shift, 1 cmp)

1 shift , 1 alu, 1 write to concatecate two codes and get the final code

        Pack (assuming two indices: array index and bit index):

2 alu               (update array and bit index)
(if boundary has been crossed, the following x 2, which we assume occurs 30% of the time)
1 shift             (shift according to bit index)
1 alu               (concatenate)
}

                }

Result:
        SA1100 [2]: 220 MHz, typical power dissipation is 550mW
                 Running 1.1, 2, 3 on SA1100, using C as the source code and conventional C compilers
               Performance:
                        1.12 sec/frame
               Power:
                        0.616 J/frame
               Cost:
                        $39 each for 10,000 units
               Design effort:
                        after code is done(fixed point calulation code is tricky: 1 week) , 2-3 more days of
                        compilation (easy) and debugging,

        SA1100 + dataflow:
                Running 1.2 in dataflow ASIC, 2 and 3 on SA1100 ->
               Performance:
                            701.75 msec/frame
               Power:
                            0.305715 J/frame
               Cost:
                            StrongARM cost + fabrication of the dataflow ASIC component
               Design effort:
                            on top of the time spent in designing code for the StrongARM, asssume architecture and library of dataflow ASIC
                            modules are there.
                            2 weeks - 1 month of assembly and debugging.

III. Other Implementations from the Literature:

1. ASIC
        JAGUAR [4]:
        Performance:
                      30 frames/sec of the 1024x1024 images frame-> 10 msec/ 640x480 frame
        Power:
                      Assume DCT has the same energy as the Pleiades datafow ASIC implementation (although it should be less)
                      0.626 mJ/frame
                      In the huffman coding stage:
                      assuming that memory access is the dominat factor:
                      185 mem access (based on the descriptiotn above)
                      Using the energy consumptions of the Pleiades memory and address generator: 185x12pJx14400 = 0.032 mJ/frame

                      total -> 0.6258 mJ/frame
        Cost:
                      Design cost + fab cost of 12mmx14mm in 2um -> 1.5x1.75mm in 0.25 um
        Design effort:
                      9-28 man-month for dedicated Huffman encoder and optimized DCT coder all in ASIC

2. DSP
        Two parallel TIC30's [5]: each TIC30 is about 33-50 MHz
                architecture feature: floating point dsp
            Performance:
                        2.75 sec/frame
            Power:
                        ~ 1W each
            Cost:
                        $ 247.95/processor
            Design effort:
                        1 day for compilation (no performance tuning in assembly level)

        TMS320C40 [6]: about 40 - 60 MHz
                architecture feature: floating point dsp
               Perfromance:
                            2.6 sec/frame
               Power:
                            248 mA at 5V -> 3.224 J/frame
               Cost:
                            $160
               Design effort:
                            compilation (no performance tuning in assembly level) -> 1 day

IV. Discussion

The technology and voltage of some of the processors are not specified. Therefore, we compare the "raw" data in this discussion.

The pure ASIC element is 100 faster and 1000 more energy efficient than the microprocessor implementation. Moving part of the algorithm into ASIC improves the speed and energy somewhat, but we run into the "Amdahl's Law" because Huffman Coding became our bottleneck. Although we can implemente Huffman Coding in fine-grain FPGA,the power and timing estimation of bit level manipulation operations on FPGA is hard to perdict at the high level. Therefore, we decide to keep the Huffman Coding in software for our estimation.

DSP implementations seems to have worst performance and energy efficiency. One reason is that the codes are compiled down to the processor while many DSP can take advantage of hand-turning codes. Another reason is that floating point processors are usually very power hungry. In addition, when we perform our estimation, we only take into account the pure computation cost, not the extra control cost (reading in the frame etc..) while the DSP implementations in the papers might have take into account all these. Therefore, our estimation is more optimistic.

V. References

[1] G. Wallace, "The JPEG Still Picture Compression Standard".
[2] http://developer.intel.com/design/strong/sa1100.htm
[3] Pleiades homepage: http://infopad/research/reconfigurable/
[4] M.Kovac and N.Ranganathan, "JAGUAR: A Fully Pipelined VLSI Architecture for JPEG Image Compression Standard."
[5] N.Altoveros, X.Shi, and J.Waller, "DSP microprocessor implementation of image data compression."
[6] D.C.Chen and R.H.Price, "A Real-Time TMS320C40 Based Parallel System for High Rate Digital Signal Processing."