Estimate of design cost and performance of a JPEG CODEC ASIC.

A Pipelined JPEG CODEC ASIC

EE290A Homework #1
Feb. 2nd, 1999
David Chinnery, Rhett Davis

Description:

The ASIC for which we have estimated performance and costs implements JPEG encode with Huffman runlength encoding on the computed, quantized coefficients. A quantization table is input to the ASIC, and a Huffman table based on an image database is used (rather that computing the Huffman table for each image individually). Using a standard Huffman table for all images increases the size of the final compressed JPEG image size by about 10%.

According to the JPEG standard [Wallace] quantization of coefficients is by dividing each of the coefficients in the 8x8 DCT coefficient block by a corresponding value in an 8x8 quantization table, to reduce the bit allocation. In the hardware implementation, this is simply achieved by the quantization table detailing the number of bits to represent each coefficient, and then the coefficient is right shifted appropriately to reduce it to that many bits. The largest possible resulting DCT coefficient is 65280 (256x255), requiring sixteen bits to represent the magnitude, and an additional sign bit as coefficients may be positive or negative [Keating].

The DCT algorithm implemented requires 22 multiplies and 30 additions for each one-dimensional 8 point DCT [Richardson and Riley] and row-column decomposition is used to compute the two dimensional DCT. This is pipelined, so that each three stage pipeline multiplier is in constant use while processing the image. A diagram of the implementation is shown below (click on it, to magnify). The multipliers are 16 bit, required for accuracy (eight bit integer with four bit fractional part, the input values at each stage are multiplied by one of seven different constants, which range between 0.19 and 0.98 and are represented to 16 binary places).

The stream of data coming in (assumed to be 8x8 24 bit RGB image blocks) is broken up into each eight bit color component, which goes through the above pipeline. The final 8x8 DCT coefficients are quantized and output. This is shown below.

The inverse two dimensional DCT uses the same pipeline, just requiring right shifting of the input 8x8 block of coefficients and the output values, to perform the inverse DCT (this can be done as the 1D DCT and inverse 1D DCT differ in the scaling factor of the DC coefficient and a factor of 16 [Lim]).

Power estimates are accurate to within about a factor of two, while the speed and area estimates are accurate to within about 20%.

Performance:

The following parameters were assumed in calculating the performance.

0.25 um technology provided by ST
25 MHz clock rate
1.0 V supply voltage
Custom multipliers and triple-ported register files
Standard cells for the rest (using a library provided by ST)

We have also assumed that the decoding is not significantly different from the encoding, and that the same results hold for each case.

Since the register files between the 1D DCT calculations are "double-buffered", we can assume that the JPEG encoding is pipelined at the level of a 1D DCT. Likewise, if we assume that the control logic for the 1D DCT is intelligent enough not to overwrite values from the previous operation (add/mult/add), then we can assume that the DCT is "pipelined" for each operation. Furthermore, if we assume that the multiplier is the critical resource in the pipeline, then the number of multiplications determine the "cycle-time" of this pipeline. Lastly, the multiplier itself is pipelined with a depth of 3 and operates at 25 MHz. From these assumptions, we can calculate the total encode/decode time as follows:

Cycle Time Pipeline Depth Iterations Total time

Multiply 40 ns 3 22 (multiplies per 1D DCT) 960 ns

1D DCT 960 ns 3 8 (1D DCT's per block) 9600 ns

Block Processing 9600 ns 2 4800 (blocks per frame) 0.046 s

Note that the total time is the cycle time multiplied with the number of iterations plus the pipeline depth minus one. Note also that we have ignored the parser and quantizer/coder blocks, because we have assumed that they add a few cycles to the latency of the outer-most pipeline. This corresponds to less than a micro-second and is insignificant.

Power:

Power estimates were performed on a block-by-block basis. Power estimates for the multiplier are based on numbers from the Maia chip to be fabricated in the next month [Wan]. All other power estimates are based on the clock switching energy for a D-flip-flop. The flip-flop under consideration has a total gate area of 3.26 um² switched by the clock. With a Cox of 6.52 fF/um² and a supply voltage of 1.0V, we can calculate the switching energy to be 21.3 fJ.

Based on this number, the power estimates for the register files (which are based on arrays of D-flip-flops) are calculated as twice the switching energy for one word times 25 MHz. Power estimates for the other logic are based on a function of the number of state flip-flops assumed for each functional block. It was assumed that 9 state flip-flops would be needed for the 2D DCT control block, along with 20 flip-flops for the Parser and Quantizer/Coder. From here, the total power was estimated as six times the switching energy of the flip-flops times 25 MHz.

The table below tabulates the power numbers for each block and the chip as a whole. We can calculate the total energy per frame as the total power times the calculation time, 22.2 uJ.

Since gated clocks are used, the static power is limited to the input clock loading, which is assumed to be 25 MHz times the switching energy for 10 D-flip-flops. This gives a static power of 5.3 uW.

Gated clocks (adds to design time)
Supply voltage (1.0 V)
Energy/Frame (22.2 uJ)
Minimal static power (100 transistors tied to clock line, 5.3 uW)

Cost:

Area estimates for the register files were based on existing layout. Length was calculated as (16 um x words + 40 um), and width was calculated as (16 um x bits + 60 um). Area for multipliers is based on an existing layout for a 16-bit multiplier. Area for other logic is based on a function of the number of state flip-flops assumed for each functional block. It was assumed that 9 state flip-flops would be needed for the 2D DCT control block, along with 20 flip-flops for the Parser and Quantizer/Coder. From here, the total area was estimated to be 6 times the area required for the flip-flops alone.

The area for the 2D-DCT block was assumed to be twice the sum of the areas of the sub-blocks (to account for routing area). Likewise, the area for the entire chip was assumed to be twice the sum of the 2D-DCT and other blocks' areas.

Block Sub-block

2D-DCT Block words bits area (mm²) power (uW)

Register Files 64 8 0.18 8.5

64 8 0.18 8.5

16 11 0.068 11.7

22 16 0.12 17

64 12 0.25 12.8

64 12 0.25 12.8

16 15 0.088 16

22 16 0.12 17

64 16 0.32 17

64 16 0.32 17

area per mult. (mm²) energy per mult. (pJ)

Multipliers 0.075 16 0.155 80

area per f-f (mm²) energy per f-f (pJ)

Control Logic 0.000216 0.0213 0.0117 28.7

block area: 4.15

Rest of Chip words bits

Register File 64 4 0.113 4.25

area per f-f (mm²) energy per f-f (pJ)

Parser & Quant./Coder 0.000216 0.0213 0.00432 63.8

Total for Chip: 24.9 (mm²) 483 (uW)

Design Effort:

Estimates are based on a previous design effort with a similar approach.

Specification (3 people, 1 month)
Multiplier design (1 person, 3 months)
Register file design (1 person, 3 months)
Logic/VHDL design (1 person, 1 month)
gated clocking system design (1 person, 2 months)
component assembly and verification (3 people, 1 month)

Total design time and work force: 3 people, ~5 months

Possible Improvements:

The area used for the multipliers can be reduced by not computing to more than six binary places, which would not change the final computed coefficients significantly (+/- 0.5).
There are faster algorithms for the two-dimensional DCT transform, but these are not easily pipelined [Lim].

References:

Marlene Wan, Energy Estimates for the Maia chip, http://infopad.EECS.Berkeley.EDU/~marlene/Research/Maia/Sats.html
Gregory K. Wallace, "The JPEG Still Picture Compression Standard," IEEE Trans. Consumer Electronics, 1991.
Iain G Richardson, and Martyn J Riley, "Digital Video Coding: Coding of Still Images," 1996.
Jae S. Lim, Two-Dimensional Signal and Image Processing, Prentice Hall, 1990.
Sean Keating, "Towards Hardware Implementation of the 2D-DCT Image Processing Algorithm," 1997.

	Cycle Time	Pipeline Depth	Iterations	Total time
Multiply	40 ns	3	22 (multiplies per 1D DCT)	960 ns
1D DCT	960 ns	3	8 (1D DCT's per block)	9600 ns
Block Processing	9600 ns	2	4800 (blocks per frame)	0.046 s

Block	Sub-block
2D-DCT Block		words	bits	area (mm²)	power (uW)
	Register Files	64	8	0.18	8.5
		64	8	0.18	8.5
		16	11	0.068	11.7
		22	16	0.12	17
		64	12	0.25	12.8
		64	12	0.25	12.8
		16	15	0.088	16
		22	16	0.12	17
		64	16	0.32	17
		64	16	0.32	17
		area per mult. (mm²)	energy per mult. (pJ)
	Multipliers	0.075	16	0.155	80
		area per f-f (mm²)	energy per f-f (pJ)
	Control Logic	0.000216	0.0213	0.0117	28.7
			block area:	4.15
Rest of Chip		words	bits
	Register File	64	4	0.113	4.25
		area per f-f (mm²)	energy per f-f (pJ)
	Parser & Quant./Coder	0.000216	0.0213	0.00432	63.8

			Total for Chip:	24.9 (mm²)	483 (uW)