Niraj Shah
Scott Weber
EE290a Homework #1

Description

We chose to implement a JPEG codec using Tensilica's Xtensa configurable microprocessor with extensions to the instruction set architecture (ISA). With Tensilica's tools, it is possible to compile C code for a particular target architecture using native C programming tools (gcc, gdb). It is also possible to generate an instruction set simulator for the target architecture. These software tools are used to evaluate the system. Once the design is finalized, a description of the hardware implementation can be produced.

To estimate implementation characteristics of JPEG, we generated a compiler and instruction set simulator for a microprocessor core with no ISA extensions. The microprocessor core is a 32-bit RISC processor that does not have a multiplier. There are 16 general purpose registers and DSP-like features (such as zero-overhead loops) [1].

Since Tensilica's approach uses native C programming tools, we used the Independent JPEG Group's C implementation [2] as our initial system model for the estimation. Because of its efficiency, we chose the AA&N algorithm [3] using a fixed point architecture for implementing the forward and inverse discrete cosine transform.

Using the processor generator, compiler, and instruction set simulator, we arrived at estimates for the JPEG implementation on the processor described above. After analyzing the C code, we identified some possible ISA extensions. Since most of the arithmetic operations are on 16-bit integers, the extensions we identified we mostly related to parallelizing relatively simple operations. The estimates given below are based on our projections of how the ISA extensions we identified will effect the performance, power, and cost data obtained from Tensilica's tools.

Performance

To calculate performance, we simulated the JPEG encode and decode on a test image. The instruction set simulator gave us the number of instructions executed to perform encode and decode. For the calculation of frames per second, we assume a frame to be 640 pixels by 480 pixels representing all three colors (red, green, and blue). Frames per second equals (# of instructions to process 1 frame) / [ (avg. # of instructions per clock period) * (clock period) ]. The table below summarizes the performance data for decode and encode for different clock speeds, assuming an average of one instruction executed per clock cycle.

**Decode Performance**
	100 MHz	175 MHz	250 MHz
Instructions Executed	61946059	61946059	61946059
Clock Period (ns)	10	5.7	4
Frames / second	1.61	2.83	4

**Encode Performance**
	100 MHz	175 MHz	250 MHz
Instructions Executed	62774404	62774404	62774404
Clock Period (ns)	10	5.7	4
Frames / second	1.59	2.79	3.98

Since this is a general purpose JPEG implementation, it includes functions that would not likely be present in an embedded system implementation, such as file I/O. Thus, the number of instructions executed is an over-estimate. Since most of the computation time of encode and decode is spent in the forward DCT and inverse DCT respectively, the ISA extensions will likely be geared toward optimizing these functions. At first glance, we see large potential gains in parallelizing add instructions and realizing constant multiplies in hardware. Combined with the execution time over-estimate, we estimate the ISA extensions will result in about a 100% performance gain. The gain would be higher, however, it is offset by the software overhead needed to make full use of the new instructions (e.g. setting up the 32-bit result register to hold two 16-bit values).

Energy

We chose to characterize the power dissipation of this approach in Joules / frame. The definition of a frame is the same in the previous section (640 x 480, RGB). Tensilica's processor generator gave us core power estimates based on the clock speed. The energy calculation is as follows: (core power) * (clock period) * (# instructions executed). The table below shows energy dissipation for various clock speeds.

**Decode Energy**
	100 MHz	175 MHz	250 MHz
Core Power (mW)	70	99	128
Clock Period (ns)	10	5.7	4
Instructions Executed	61946059	61946059	61946059
J / frame	0.043	0.035	0.031

**Encode Energy**
	100 MHz	175 MHz	250 MHz
Core Power (mW)	70	99	128
Clock Period (ns)	10	5.7	4
Instructions Executed	62774404	62774404	62774404
J / frame	0.044	0.035	0.032

As described in the performance section, this code executes some extraneous code. Extracting the core functions required for JPEG encode and decode will reduce the number of instructions executed and therefore reduce the energy consumed per frame. We are unsure what effect ISA extensions will have on energy dissipation. We foresee that the new instructions will reduce the number of total instructions executed dramatically (which will reduce J/frame). However, it is possible the hardware realization of the new instructions speeds up execution time at the cost of more energy consumption. The estimated power consumption is highly dependent on the relative priorities of speed, power, and area.

Cost

We chose to estimate cost by two different metrics: die area and code size. We did not estimate the cost in $/unit, as that information was not available to us. To estimate code size of the final implementation, we start with the executable size of the Independent JPEG Group's C implementation of the encoder and decoder. Since this code is very general, we believe the executable size can be reduced by 50% by stripping out the code irrelevant to our implementation. Additionally, we project the ISA extensions we design will reduce code size by another 20%. This estimate is based on the type of new instructions we expect to add.

**Code Size**
	Encode	Decode
Current Code Size (bytes)	79k	70k
Estimated Code Size (bytes)	31.6k	28k

An additional cost metric is die area. Given the clock speed, the processor generator estimated the number of gates our design would take and estimated the die area for a given process. The table below shows the die area estimates for different clock speeds at 0.25m.

**Core Die Area**
	100 MHz	175 MHz	250 MHz
Gates (NAND2 equivalent)	29473	36879	44550
Die Area (mm²)	1.35 - 2.10	1.69 - 2.44	2.03 - 2.78

We estimate the size of the hardware needed to implement the new instructions will range from 5000 to 8000 gates, depending on design priorities. The following table shows the estimated gate count and die area for the ISA extended design.

**ISA Extended Core Die Area**
	100 MHz	175 MHz	250 MHz
Gates (NAND2 equivalent)	35000 - 38000	42000 - 45000	50000 - 53000
Die Area (mm²)	1.60 - 2.49	1.92 - 2.78	2.28 - 3.12

Note that the core die area does not include any memory. To calculate the memory size, we assumed on-chip instruction and data caches as well as on-chip ROM to store the code. Additionally, we assumed there is off-chip memory to store the image data. Since DCT and IDCT operate on 8 x 8 pixel blocks of 8 bit color in a streaming fashion, our data cache need not be large. To compensate for second-level cache latency, we chose to make our data cache large enough to hold 10 blocks (640 bytes). Since most of the execution time will be spent inner loops, the instruction cache need not be large either. We estimate about one page of memory for the instruction cache (1k bytes). The on-chip ROM that stores the code will store about 60k bytes. The next step is to translate these cache sizes to die area. The cache sizes are calculated assuming 6 transistors/bit of memory and 3.7M transistors/cm² (taken from Logic transistors/cm² entry in Table 1 of [4]). The size of the on-chip ROM is calculated using 1 DRAM cell size/bit, where the DRAM cell size is 0.56m²/bit (taken from Table 15 in [4]). The table below summarizes memory cost.

**Memory Sizes**
	Size (bytes)	Size (mm²)
Instruction Cache	1k	1.3
Data Cache	640	0.83
Program ROM	60k	0.27
Total	n/a	2.4

The total die area for the ISA extended processor including memory is shown below.

**ISA Extended Processor Die Area (including memory)**
	100 MHz	175 MHz	250 MHz
Die Area (mm²)	3.00 - 3.99	4.32 - 5.18	4.68 - 5.52

Design effort

Since Tensilica's tool suite has not been announced, there is no data regarding the design effort required to design a system using their approach. Since we have a C model of the system we are designing, we expect the bulk of our time to be spent in identifying and evaluating new instructions. Assuming we do not encounter too many bugs while using the Xtensa Software-Development Toolkit, we expect to implement a JPEG codec in 100 man-hours.

Summary

The table below summarizes our estimates for different clock speeds.

100 MHz 175 MHz 250 MHz

Performance

Decode Speed (frames/sec)

1.61 2.83 4

Encode Speed (frames/sec)

1.59 2.83 4

Energy

Decode Energy Dissipation (J / frame)

0.043 0.035 0.031

Encode Energy Dissipation (J / frame)

0.044 0.035 0.032

Core Power (mW)

70 99 128

Power Density (mW / mm²)

17.5 - 23.3 19.1 - 22.9 23.2 - 27.4

Cost

Die Area (0.25m) (mm²)

3.00 - 3.99 4.32 - 5.18 4.68 - 5.52

Decode Code Size (bytes)

28k 28k 28k

Encode Code Size (bytes)

31.6k 31.6k 31.6k

Design Effort (man-hours) 100 100 100

Estimates Summary

References

[1] Xtensa Instruction Set Architecture Reference Manual
[2] Independent JPEG Group's C implementation
[3] AA&N algorithm
[4] National Technology Roadmap for Semiconductors, 1997