Niraj Shah
Scott Weber
EE290a Homework #1
 

Description

We chose to implement a JPEG codec using Tensilica's Xtensa configurable microprocessor with extensions to the instruction set architecture (ISA). With Tensilica's tools, it is possible to compile C code for a particular target architecture using native C programming tools (gcc, gdb). It is also possible to generate an instruction set simulator for the target architecture. These software tools are used to evaluate the system. Once the design is finalized, a description of the hardware implementation can be produced.

To estimate implementation characteristics of JPEG, we generated a compiler and instruction set simulator for a microprocessor core with no ISA extensions. The microprocessor core is a 32-bit RISC processor that does not have a multiplier. There are 16 general purpose registers and DSP-like features (such as zero-overhead loops) [1].

Since Tensilica's approach uses native C programming tools, we used the Independent JPEG Group's C implementation [2] as our initial system model for the estimation. Because of its efficiency, we chose the AA&N algorithm [3] using a fixed point architecture for implementing the forward and inverse discrete cosine transform.

Using the processor generator, compiler, and instruction set simulator, we arrived at estimates for the JPEG implementation on the processor described above. After analyzing the C code, we identified some possible ISA extensions. Since most of the arithmetic operations are on 16-bit integers, the extensions we identified we mostly related to parallelizing relatively simple operations. The estimates given below are based on our projections of how the ISA extensions we identified will effect the performance, power, and cost data obtained from Tensilica's tools.

Performance

To calculate performance, we simulated the JPEG encode and decode on a test image. The instruction set simulator gave us the number of instructions executed to perform encode and decode. For the calculation of frames per second, we assume a frame to be 640 pixels by 480 pixels representing all three colors (red, green, and blue). Frames per second equals (# of instructions to process 1 frame) / [ (avg. # of instructions per clock period) * (clock period) ]. The table below summarizes the performance data for decode and encode for different clock speeds, assuming an average of one instruction executed per clock cycle.
 
100 MHz 175 MHz 250 MHz
Instructions Executed 61946059 61946059 61946059
Clock Period (ns) 10 5.7 4
Frames / second 1.61 2.83 4
Decode Performance
 
 
100 MHz 175 MHz 250 MHz
Instructions Executed 62774404 62774404 62774404
Clock Period (ns) 10 5.7 4
Frames / second 1.59 2.79 3.98
Encode Performance
 
 Since this is a general purpose JPEG implementation,  it includes functions that would not likely be present in an embedded system implementation, such as file I/O. Thus, the number of instructions executed is an over-estimate. Since most of the computation time of encode and decode is spent in the forward DCT and inverse DCT respectively, the ISA extensions will likely be geared toward optimizing these functions. At first glance, we see large potential gains in parallelizing add instructions and realizing constant multiplies in hardware. Combined with the execution time over-estimate, we estimate the ISA extensions will result in about a 100% performance gain. The gain would be higher, however, it is offset by the software overhead needed to make full use of the new instructions (e.g. setting up the 32-bit result register to hold two 16-bit values).
 

Energy

We chose to characterize the power dissipation of this approach in Joules / frame. The definition of a frame is the same in the previous section (640 x 480, RGB). Tensilica's processor generator gave us core power estimates based on the clock speed. The energy calculation is as follows: (core power) * (clock period) * (# instructions executed). The table below shows energy dissipation for various clock speeds.
 
100 MHz 175 MHz 250 MHz
Core Power (mW) 70 99 128
Clock Period (ns) 10 5.7 4
Instructions Executed 61946059 61946059 61946059
J / frame 0.043 0.035 0.031
Decode Energy
 
 
100 MHz 175 MHz 250 MHz
Core Power (mW) 70 99 128
Clock Period (ns) 10 5.7 4
Instructions Executed 62774404 62774404 62774404
J / frame 0.044 0.035 0.032
Encode Energy
 
As described in the performance section, this code executes some extraneous code. Extracting the core functions required for JPEG encode and decode will reduce the number of instructions executed and therefore reduce the energy consumed per frame. We are unsure what effect  ISA extensions will have on energy dissipation. We foresee that the new instructions  will reduce the number of total instructions executed dramatically (which will reduce J/frame). However, it is possible the hardware realization of the new instructions speeds up execution time at the cost of  more energy consumption. The estimated power consumption is highly dependent on the relative priorities of speed, power, and area.
 

Cost

We chose to estimate cost by two different metrics: die area and code size. We did not estimate the cost in $/unit, as that information was not available to us. To estimate code size of the final implementation, we start with the executable size of the Independent JPEG Group's C implementation of the encoder and decoder. Since this code is very general, we believe the executable size can be reduced by 50% by stripping out the code irrelevant to our implementation. Additionally, we project the ISA extensions we design will reduce code size by another 20%. This estimate is based on the type of new instructions we expect to add.
 
Encode Decode
Current Code Size (bytes) 79k 70k
Estimated Code Size (bytes) 31.6k 28k
Code Size
An additional cost metric is die area.  Given the clock speed, the processor generator estimated the number of gates our design would take and estimated the die area for a given process. The table below shows the die area estimates for different clock speeds at 0.25m.
 
100 MHz 175 MHz 250 MHz
Gates (NAND2 equivalent) 29473 36879 44550
Die Area (mm2) 1.35 - 2.10 1.69 - 2.44 2.03 - 2.78
Core Die Area
 
We estimate the size of the hardware needed to implement the new instructions will range from 5000 to 8000 gates, depending on design priorities. The following table shows the estimated gate count and die area for the ISA extended design.
 
100 MHz 175 MHz 250 MHz
Gates (NAND2 equivalent) 35000 - 38000 42000 - 45000 50000 - 53000
Die Area (mm2) 1.60 - 2.49 1.92 - 2.78 2.28 - 3.12
ISA Extended Core Die Area
 
Note that the core die area does not include any memory. To calculate the memory size, we assumed on-chip instruction and data caches as well as on-chip ROM to store the code. Additionally, we assumed there is off-chip memory to store the image data. Since DCT and IDCT operate on 8 x 8 pixel blocks of 8 bit color in a streaming fashion, our data cache need not be large. To compensate for second-level cache latency, we chose to make our data cache large enough to hold 10 blocks (640 bytes). Since most of the execution time will be spent inner loops, the instruction cache need not be large either. We estimate about one page of memory for the instruction cache (1k bytes). The on-chip ROM that stores the code will store about 60k bytes. The next step is to translate these cache sizes to die area. The cache sizes are calculated assuming 6 transistors/bit of memory and 3.7M transistors/cm2 (taken from Logic transistors/cm2 entry in Table 1 of [4]). The size of the on-chip ROM is calculated using 1 DRAM cell size/bit, where the DRAM cell size is 0.56m2/bit (taken from Table 15 in [4]). The table below summarizes memory cost.
 
Size (bytes) Size (mm2)
Instruction Cache 1k 1.3
Data Cache 640 0.83
Program ROM 60k 0.27
Total n/a 2.4
Memory Sizes
The total die area for the ISA extended processor including memory is shown below.
 
100 MHz 175 MHz 250 MHz
Die Area (mm2) 3.00 - 3.99 4.32 - 5.18 4.68 - 5.52
ISA Extended Processor Die Area (including memory)
 

Design effort

Since Tensilica's tool suite has not been announced, there is no data regarding the design effort required to design a system using their approach. Since we have a C model of the system we are designing, we expect the bulk of our time to be spent in identifying and evaluating new instructions. Assuming we do not encounter too many bugs while using the Xtensa Software-Development Toolkit, we expect to implement a JPEG codec in 100 man-hours.

Summary

The table below summarizes our estimates for different clock speeds.

 
100 MHz 175 MHz 250 MHz
Performance
  • Decode Speed (frames/sec)
  • 1.61 2.83 4
  • Encode Speed (frames/sec)
  • 1.59 2.83 4
    Energy
  • Decode Energy Dissipation (J / frame)
  • 0.043 0.035 0.031
  • Encode Energy Dissipation (J / frame)
  • 0.044 0.035 0.032
  • Core Power (mW)
  • 70 99 128
  • Power Density (mW / mm2)
  • 17.5 - 23.3 19.1 - 22.9 23.2 - 27.4
    Cost
  • Die Area (0.25m) (mm2)
  • 3.00 - 3.99 4.32 - 5.18 4.68 - 5.52
  • Decode Code Size (bytes)
  • 28k 28k 28k
  • Encode Code Size (bytes)
  • 31.6k 31.6k 31.6k
    Design Effort (man-hours) 100 100 100
    Estimates Summary
     

    References

    [1] Xtensa Instruction Set Architecture Reference Manual
    [2] Independent JPEG Group's C implementation
    [3] AA&N algorithm
    [4] National Technology Roadmap for Semiconductors, 1997