A Pipelined JPEG CODEC ASIC

EE290A Homework #1
Feb. 2nd, 1999
David Chinnery, Rhett Davis


Description:

The ASIC for which we have estimated performance and costs implements JPEG encode with Huffman runlength encoding on the computed, quantized coefficients. A quantization table is input to the ASIC, and a Huffman table based on an image database is used (rather that computing the Huffman table for each image individually). Using a standard Huffman table for all images increases the size of the final compressed JPEG image size by about 10%.

According to the JPEG standard [Wallace] quantization of coefficients is by dividing each of the coefficients in the 8x8 DCT coefficient block by a corresponding value in an 8x8 quantization table, to reduce the bit allocation. In the hardware implementation, this is simply achieved by the quantization table detailing the number of bits to represent each coefficient, and then the coefficient is right shifted appropriately to reduce it to that many bits. The largest possible resulting DCT coefficient is 65280 (256x255), requiring sixteen bits to represent the magnitude, and an additional sign bit as coefficients may be positive or negative [Keating].

The DCT algorithm implemented requires 22 multiplies and 30 additions for each one-dimensional 8 point DCT [Richardson and Riley] and row-column decomposition is used to compute the two dimensional DCT. This is pipelined, so that each three stage pipeline multiplier is in constant use while processing the image. A diagram of the implementation is shown below (click on it, to magnify). The multipliers are 16 bit, required for accuracy (eight bit integer with four bit fractional part, the input values at each stage are multiplied by one of seven different constants, which range between 0.19 and 0.98 and are represented to 16 binary places).
 

Diagram of the pipeline for a color.

The stream of data coming in (assumed to be 8x8 24 bit RGB image blocks) is broken up into each eight bit color component, which goes through the above pipeline. The final 8x8 DCT coefficients are quantized and output. This is shown below.
 

Block diagram of implementation.

The inverse two dimensional DCT uses the same pipeline, just requiring right shifting of the input 8x8 block  of coefficients and the output values, to perform the inverse DCT (this can be done as the 1D DCT and inverse 1D DCT differ in the scaling factor of the DC coefficient and a factor of 16 [Lim]).

Power estimates are accurate to within about a factor of two, while the speed and area estimates are accurate to within about 20%.
 

Performance:

The following parameters were assumed in calculating the performance. We have also assumed that the decoding is not significantly different from the encoding, and that the same results hold for each case.

Since the register files between the 1D DCT calculations are "double-buffered", we can assume that the JPEG encoding is pipelined at the level of a 1D DCT.  Likewise, if we assume that the control logic for the 1D DCT is intelligent enough not to overwrite values from the previous operation (add/mult/add), then we can assume that the DCT is "pipelined" for each operation.  Furthermore, if we assume that the multiplier is the critical resource in the pipeline, then the number of multiplications determine the "cycle-time" of this pipeline.  Lastly, the multiplier itself is pipelined with a depth of 3 and operates at 25 MHz.  From these assumptions, we can calculate the total encode/decode time as follows:
 
Cycle Time Pipeline Depth Iterations Total time
Multiply 40 ns 3 22 (multiplies per 1D DCT) 960 ns
1D DCT 960 ns 3 8 (1D DCT's per block) 9600 ns
Block Processing 9600 ns 2 4800 (blocks per frame) 0.046 s

Note that the total time is the cycle time multiplied with the number of iterations plus the pipeline depth minus one.  Note also that we have ignored the parser and quantizer/coder blocks, because we have assumed that they add a few cycles to the latency of the outer-most pipeline.  This corresponds to less than a micro-second and is insignificant.
 

Power:

Power estimates were performed on a block-by-block basis.  Power estimates for the multiplier are based on numbers from the Maia chip to be fabricated in the next month [Wan].  All other power estimates are based on the clock switching energy for a D-flip-flop.  The flip-flop under consideration has a total gate area of 3.26 um2 switched by the clock.  With a Cox of 6.52 fF/um2 and a supply voltage of 1.0V, we can calculate the switching energy to be 21.3 fJ.

Based on this number, the power estimates for the register files (which are based on arrays of D-flip-flops) are calculated as twice the switching energy for one word times 25 MHz.  Power estimates for the other logic are based on a function of the number of state flip-flops assumed for each functional block.  It was assumed that 9 state flip-flops would be needed for the 2D DCT control block, along with 20 flip-flops for the Parser and Quantizer/Coder.  From here, the total power was estimated as six times the switching energy of the flip-flops times 25 MHz.

The table below tabulates the power numbers for each block and the chip as a whole.  We can calculate the total energy per frame as the total power times the calculation time, 22.2 uJ.

Since gated clocks are used, the static power is limited to the input clock loading, which is assumed to be 25 MHz times the switching energy for 10 D-flip-flops. This gives a static power of 5.3 uW.

Cost:

Area estimates for the register files were based on existing layout.  Length was calculated as (16 um x words + 40 um), and width was calculated as (16 um x bits + 60 um).  Area for multipliers is based on an existing layout for a 16-bit multiplier.  Area for other logic is based on a function of the number of state flip-flops assumed for each functional block.  It was assumed that 9 state flip-flops would be needed for the 2D DCT control block, along with 20 flip-flops for the Parser and Quantizer/Coder.  From here, the total area was estimated to be 6 times the area required for the flip-flops alone.

The area for the 2D-DCT block was assumed to be twice the sum of the areas of the sub-blocks (to account for routing area).  Likewise, the area for the entire chip was assumed to be twice the sum of the 2D-DCT and other blocks' areas.
 
 
Block Sub-block
2D-DCT Block words bits area (mm2) power (uW)
Register Files 64 8 0.18  8.5
64 8 0.18 8.5
16 11 0.068 11.7
22 16 0.12 17
64 12 0.25 12.8
64 12 0.25 12.8
16 15 0.088 16
22 16 0.12 17
64 16 0.32 17
64 16 0.32 17
area per mult. (mm2) energy per mult. (pJ)
Multipliers 0.075 16 0.155 80
area per f-f (mm2) energy per f-f (pJ)
Control Logic 0.000216 0.0213 0.0117 28.7
block area: 4.15
Rest of Chip words bits
Register File 64 4 0.113 4.25
area per f-f (mm2) energy per f-f (pJ)
Parser & Quant./Coder 0.000216 0.0213 0.00432 63.8
Total for Chip: 24.9 (mm2) 483 (uW)
 

Design Effort:

Estimates are based on a previous design effort with a similar approach. Total design time and work force: 3 people, ~5 months
 

Possible Improvements:


References: