Niraj Shah
Scott Weber
EE290a Homework #1
Description
We chose to implement a JPEG codec using Tensilica's Xtensa configurable
microprocessor with extensions to the instruction set architecture (ISA).
With Tensilica's tools, it is possible to compile C code for a particular
target architecture using native C programming tools (gcc, gdb). It is
also possible to generate an instruction set simulator for the target architecture.
These software tools are used to evaluate the system. Once the design is
finalized, a description of the hardware implementation can be produced.
To estimate implementation characteristics of JPEG, we generated a compiler
and instruction set simulator for a microprocessor core with no ISA extensions.
The microprocessor core is a 32-bit RISC processor that does not have a
multiplier. There are 16 general purpose registers and DSP-like features
(such as zero-overhead loops) [1].
Since Tensilica's approach uses native C programming tools, we used
the Independent JPEG Group's C implementation [2] as
our initial system model for the estimation. Because of its efficiency,
we chose the AA&N algorithm [3] using a fixed point
architecture for implementing the forward and inverse discrete cosine transform.
Using the processor generator, compiler, and instruction set simulator,
we arrived at estimates for the JPEG implementation on the processor described
above. After analyzing the C code, we identified some possible ISA extensions.
Since most of the arithmetic operations are on 16-bit integers, the extensions
we identified we mostly related to parallelizing relatively simple operations.
The estimates given below are based on our projections of how the ISA extensions
we identified will effect the performance, power, and cost data obtained
from Tensilica's tools.
Performance
To calculate performance, we simulated the JPEG encode and decode on a
test image. The instruction set simulator gave us the number of instructions
executed to perform encode and decode. For the calculation of frames per
second, we assume a frame to be 640 pixels by 480 pixels representing all
three colors (red, green, and blue). Frames per second equals (# of instructions
to process 1 frame) / [ (avg. # of instructions per clock period) * (clock
period) ]. The table below summarizes the performance data for decode and
encode for different clock speeds, assuming an average of one instruction
executed per clock cycle.
|
100 MHz |
175 MHz |
250 MHz |
Instructions Executed |
61946059 |
61946059 |
61946059 |
Clock Period (ns) |
10 |
5.7 |
4 |
Frames / second |
1.61 |
2.83 |
4 |
Decode Performance
|
100 MHz |
175 MHz |
250 MHz |
Instructions Executed |
62774404 |
62774404 |
62774404 |
Clock Period (ns) |
10 |
5.7 |
4 |
Frames / second |
1.59 |
2.79 |
3.98 |
Encode Performance
Since this is a general purpose JPEG implementation, it
includes functions that would not likely be present in an embedded system
implementation, such as file I/O. Thus, the number of instructions executed
is an over-estimate. Since most of the computation time of encode and decode
is spent in the forward DCT and inverse DCT respectively, the ISA extensions
will likely be geared toward optimizing these functions. At first glance,
we see large potential gains in parallelizing add instructions and realizing
constant multiplies in hardware. Combined with the execution time over-estimate,
we estimate the ISA extensions will result in about a 100% performance
gain. The gain would be higher, however, it is offset by the software overhead
needed to make full use of the new instructions (e.g. setting up the 32-bit
result register to hold two 16-bit values).
Energy
We chose to characterize the power dissipation of this approach in Joules
/ frame. The definition of a frame is the same in the previous section
(640 x 480, RGB). Tensilica's processor generator gave us core power estimates
based on the clock speed. The energy calculation is as follows: (core power)
* (clock period) * (# instructions executed). The table below shows energy
dissipation for various clock speeds.
|
100 MHz |
175 MHz |
250 MHz |
Core Power (mW) |
70 |
99 |
128 |
Clock Period (ns) |
10 |
5.7 |
4 |
Instructions Executed |
61946059 |
61946059 |
61946059 |
J / frame |
0.043 |
0.035 |
0.031 |
Decode Energy
|
100 MHz |
175 MHz |
250 MHz |
Core Power (mW) |
70 |
99 |
128 |
Clock Period (ns) |
10 |
5.7 |
4 |
Instructions Executed |
62774404 |
62774404 |
62774404 |
J / frame |
0.044 |
0.035 |
0.032 |
Encode Energy
As described in the performance section, this code executes some extraneous
code. Extracting the core functions required for JPEG encode and decode
will reduce the number of instructions executed and therefore reduce the
energy consumed per frame. We are unsure what effect ISA extensions
will have on energy dissipation. We foresee that the new instructions
will reduce the number of total instructions executed dramatically (which
will reduce J/frame). However, it is possible the hardware realization
of the new instructions speeds up execution time at the cost of more
energy consumption. The estimated power consumption is highly dependent
on the relative priorities of speed, power, and area.
Cost
We chose to estimate cost by two different metrics: die area and code size.
We did not estimate the cost in $/unit, as that information was not available
to us. To estimate code size of the final implementation, we start with
the executable size of the Independent JPEG Group's C implementation of
the encoder and decoder. Since this code is very general, we believe the
executable size can be reduced by 50% by stripping out the code irrelevant
to our implementation. Additionally, we project the ISA extensions we design
will reduce code size by another 20%. This estimate is based on the type
of new instructions we expect to add.
|
Encode |
Decode |
Current Code Size (bytes) |
79k |
70k |
Estimated Code Size (bytes) |
31.6k |
28k |
Code Size
An additional cost metric is die area. Given the clock speed, the
processor generator estimated the number of gates our design would take
and estimated the die area for a given process. The table below shows the
die area estimates for different clock speeds at 0.25m.
|
100 MHz |
175 MHz |
250 MHz |
Gates (NAND2 equivalent) |
29473 |
36879 |
44550 |
Die Area (mm2) |
1.35 - 2.10 |
1.69 - 2.44 |
2.03 - 2.78 |
Core Die Area
We estimate the size of the hardware needed to implement the new instructions
will range from 5000 to 8000 gates, depending on design priorities. The
following table shows the estimated gate count and die area for the ISA
extended design.
|
100 MHz |
175 MHz |
250 MHz |
Gates (NAND2 equivalent) |
35000 - 38000 |
42000 - 45000 |
50000 - 53000 |
Die Area (mm2) |
1.60 - 2.49 |
1.92 - 2.78 |
2.28 - 3.12 |
ISA Extended Core Die Area
Note that the core die area does not include any memory. To calculate
the memory size, we assumed on-chip instruction and data caches as well
as on-chip ROM to store the code. Additionally, we assumed there is off-chip
memory to store the image data. Since DCT and IDCT operate on 8 x 8 pixel
blocks of 8 bit color in a streaming fashion, our data cache need not be
large. To compensate for second-level cache latency, we chose to make our
data cache large enough to hold 10 blocks (640 bytes). Since most of the
execution time will be spent inner loops, the instruction cache need not
be large either. We estimate about one page of memory for the instruction
cache (1k bytes). The on-chip ROM that stores the code will store about
60k bytes. The next step is to translate these cache sizes to die area.
The cache sizes are calculated assuming 6 transistors/bit of memory and
3.7M transistors/cm2 (taken from Logic transistors/cm2
entry in Table 1 of [4]). The size of the on-chip ROM
is calculated using 1 DRAM cell size/bit, where the DRAM cell size is 0.56m2/bit
(taken from Table 15 in [4]). The table below summarizes
memory cost.
|
Size (bytes) |
Size (mm2) |
Instruction Cache |
1k |
1.3 |
Data Cache |
640 |
0.83 |
Program ROM |
60k |
0.27 |
Total |
n/a |
2.4 |
Memory Sizes
The total die area for the ISA extended processor including memory is shown
below.
|
100 MHz |
175 MHz |
250 MHz |
Die Area (mm2) |
3.00 - 3.99 |
4.32 - 5.18 |
4.68 - 5.52 |
ISA Extended Processor Die Area (including memory)
Design effort
Since Tensilica's tool suite has not been announced, there is no data regarding
the design effort required to design a system using their approach. Since
we have a C model of the system we are designing, we expect the bulk of
our time to be spent in identifying and evaluating new instructions. Assuming
we do not encounter too many bugs while using the Xtensa Software-Development
Toolkit, we expect to implement a JPEG codec in 100 man-hours.
Summary
The table below summarizes our estimates for different clock speeds.
|
100 MHz |
175 MHz |
250 MHz |
Performance |
Decode Speed (frames/sec)
|
1.61 |
2.83 |
4 |
Encode Speed (frames/sec)
|
1.59 |
2.83 |
4 |
Energy |
Decode Energy Dissipation (J / frame)
|
0.043 |
0.035 |
0.031 |
Encode Energy Dissipation (J / frame)
|
0.044 |
0.035 |
0.032 |
Core Power (mW)
|
70 |
99 |
128 |
Power Density (mW / mm2)
|
17.5 - 23.3 |
19.1 - 22.9 |
23.2 - 27.4 |
Cost |
Die Area (0.25m) (mm2)
|
3.00 - 3.99 |
4.32 - 5.18 |
4.68 - 5.52 |
Decode Code Size (bytes)
|
28k |
28k |
28k |
Encode Code Size (bytes)
|
31.6k |
31.6k |
31.6k |
Design Effort (man-hours) |
100 |
100 |
100 |
Estimates Summary
References
[1] Xtensa Instruction Set Architecture Reference Manual
[2] Independent
JPEG Group's C implementation
[3] AA&N algorithm
[4] National Technology Roadmap for Semiconductors,
1997