Niraj Shah
Scott Weber
EE290a
Viterbi Project Final Report

Overview

We chose to implement a viterbi decoder using Tensilica's Xtensa configurable microprocessor with extensions to the instruction set architecture (ISA). With Tensilica's tools, it is possible to compile C code for a particular target architecture using native C programming tools (gcc, gdb). It is also possible to generate an instruction set simulator for the target architecture. These software tools are used to evaluate the system. Once the design is finalized, a description of the hardware implementation can be produced.

The goal of this project was to design an implementation that had a data rate of 100 kb/s and another implementation that had the highest achievable data rate. In addition, we designed and characterized a viterbi decoder that could be implemented if we relaxed one of Xtensa's restrictions.

The rest of this report is as follows: first, we will describe the design flow of hardware/software co-design within Tensilica's Xtensa's processor. Then, we will describe the hardware and software of designs we built. Lastly, we will show the quality of results of these designs within the specified parameters.

Design Flow

The diagram below demonstrates the design flow methodology that we used. A designer uses the Tensilica Processor Generator to configure a microprocessor. The designer has the ability to do such things as configure caches, configure memories, set up interrupt mechanisms, add datapath units such as a multiplier or MAC, and add TIE instructions. Based on the configurations, a compiler and instruction set simulator are generated. The compiler and instruction set simulator are used to determine the performance of the applications on the described architecture. Based on the feedback from the simulator, the designer can update the architecture and restructure the C code. The process continues until the designer is satisfied with the results.

Design flow diagram

Designs

This section describes in detail each of our three implementations. The viterbi C-code and instruction extensions of the first two implementations are exactly the same. The only difference between the two is the cache size. The processor configuration of the last design is the same as that of design 2, however, the extension instructions are different, as they can hold state.

The Xtensa Architecture is shown below. The Xtensa Core is a RISC architecture. The Xtensa architecture allows for TIE extensions to be augmented to the core. The two input registers to the TIE unit, Rs and Rt, are 32 bit registers. The TIE unit writes its result to the 32 bit result register, Rr. The TIE unit is controlled by the instruction. TIE instructions must meet the following criteria:

single cycle
state free
no new exceptions
no stalls
typeless data

Hardware Architecture

Our software architecture is composed of six components. The I/O Device is some device that retrieves data from somewhere. To abstract the I/O Device in our model, we used file I/O. The ADC was abstracted as software quantization following Rhett's code. We did not measure the performance of the I/O Device and ADC since they would not be implemented as software in an embedded system. We did measure the performance of the following four components: Init, TraceBack, ACS, and RAM. The Init component initialize arrays of data and other variables at the beginning of execution. TraceBack traces back the decisions after every 100 iterations. ACS is the add-compare-select unit that does the butterfly of viterbi. RAM is the memory of the system and acts as a common interface between the modules.

Software Architecture

Design 1: 100 kb/s decoder

This implementation uses instruction extensions and a "lean" processor to achieve the data rate at a low power and small area. The C-code for this design is here.

Instruction Extensions

The SetBMreg TIE instruction demonstrates a MIMD instruction we used to pack branch metrics into a register. The four branch metrics are as follows where I is the in phase part and Q is the quadrature part of the QPSK signal.
bm0 = I + Q
bm1 = I - Q
bm2 = -I + Q
bm3 = -I - Q
An additional 0x7F is added to each in order to normalize the metrics from -127...127 to 0...254 so that modulo arithmetic is possible.

SetBMReg Instruction Extension

The four ACS TIE instructions ACS03, ACS12, ACS21, and ACS30 are used to implement the butterfly network in a viterbi trellis. We have an instruction to compute each of the four patterns in the trellis. Each instruction performs a add-compare-select using the appropriate branch metrics. Modulo arithmetic (mod 2048) is used to normalize path metrics so they do not overflow. A decision bit based on which branch is taken is set as well as the path metric. The decision bit is used in TraceBack and the path metric is used in the next iteration of the trellis.

ACS Instruction Extension

The Zmask TIE instruction calculates the previous state of the encoding machine to determine which bits were sent. This was made a TIE instruction since we are packing registers and we could pack instructions in a cycle. In addition, this instruction is probably less than 100 gates.

Zmask Traceback Instruction Extension

Processor Configuration

The processor was configured with a basic instruction set along with the above instruction extensions. The processor speed was 100 MHz. There was no multiplier or MAC unit. For area and power concerns, we configured the size of the data and instruction cache to be the smallest possible (1 kb) with the smallest line size (16 bytes). In addition, the read data bus and write data bus widths were set as small as possible (32 bits). The number of write-buffer entries was set as small as possible (4 address/value pairs).

Design 2: Maximum data rate decoder (within Xtensa's restrictions)

Instruction Extensions

For this design, the new instructions were the same as the previous design.

Processor Configuration

The processor for this design was the same as the first design except for the memory and cache configuration and processor speed. For this design, we increased the processor speed to the maximum allowed, 227 MHz. For performance reasons, we configured the size of the data and instruction cache to be the largest possible (16 kb) with the largest line size (64 bytes). In addition, the read data bus and write data bus widths were set as large as possible (128 bits). The number of write-buffer entries was set as large as possible (32 address/value pairs).

Design 3: Maximum data rate decoder (by allowing instruction extensions to hold state)

Instruction Extensions

If the Xtensa architecture allowed state in the TIE extensions, then the following micro-architecture would allow two butterfly networks to be computed simultaneously since the BM register could be set. The TIE instruction here was limited by data bandwidth and state would free up the needed bandwidth.

ACS Instruction Extension with State

The C-code for this design is here.

Processor Configuration

The configuration of the processor for this design was exactly the same as design 2.

Quality of Results

Assumptions

In evaluating the quality of results, we've assumed the following:

retrieving data out of memory

pipeline delay is modeled

processor fabricated in a 0.25m process

Parameters

The following parameters were set at the beginning of the project:

uncoded word length (k) = 1

coded word length (n) = 2

constraint length (L) = 7

branch metric calculation is QPSK

soft decision wordlength (q) = 6

chain-backing depth (D) = 100

generator polynomials: g0 = 171, g1= 133 (octal)

bit error rate (BER) = 10^-4

signal to noise ratio (SNR) degradation 0.05dB

Performance

To calculate performance numbers, we used Tensilica's custom instruction set simulator to simulate 10,000 samples. We did not simulate larger numbers of samples in the interest of run-time (simulating 10,000 samples took about 30 minutes). To ensure correctness, we compiled the viterbi decoder for a native machine and ran it over 10,000,000 samples. The BER was 9.5 x 10^-5. Using the output of profiling data in conjunction with a custom version of gprof (GNU code profiling utility), we calculated performance data. To facilitate estimation from profiling, we structured the code to easily count the contributing operations and ignore the overhead (e.g. file I/O). The code was compiled using a custom compiler based on gcc with the -O2 optimizations turned on (this performs all optimizations that do not involve a space-time trade-off). For each design, we give performance data assuming a realistic cache (model cache misses) and a perfect cache. The perfect cache numbers could be used to estimate performance if Xtensa supported a stream data input.

Design 1

**Design 1: Performance Data**
	realistic cache	perfect cache
Kb/s	118	409

Design 2

**Design 2: Performance Data**
	realistic cache	perfect cache
Kb/s	793	909

Design 3

**Design 3: Performance Data**
	realistic cache	perfect cache
Kb/s	966	1142

Power/Energy

Since Tensilica does not provide any tools to calculate dynamic power dissipation, we only give static power dissipation numbers here. In addition, based on the number of clock cycles of computation per bit, we calculate the energy cost per bit. This is calculated as (Core Power) [J/s] * (Clock Period) [s/clock cycle] * (Cycles/Decoded Bit) [clock cycle/bit] = [J/bit].

Design 1

**Design 1: Power/Energy Data**
	realistic cache	perfect cache
Core Power (mW)	48	48
Processor speed (MHz)	100	100
Clock Period (ns)	10	10
Cycles/Decoded Bit	841	244
mJ/bit	0.40	0.12

Design 2

**Design 2: Power/Energy Data**
	realistic cache	perfect cache
Core Power (mW)	191	191
Processor speed (MHz)	222	222
Clock Period (ns)	4.5	4.5
Cycles/Decoded Bit	280	244
mJ/bit	0.24	0.21

Design 3

**Design 3: Power/Energy Data**
	realistic cache	perfect cache
Core Power (mW)	191	191
Processor speed (MHz)	222	222
Clock Period (ns)	4.5	4.5
Cycles/Decoded Bit	230	194
mJ/bit	0.20	0.17

Cost

We included two metrics for cost: code size and die area.

Code Size

The code size for all three designs is given below. To calculate code size we dumped the assembly code for main.cc and viterbi.c compiled with -O2, so no space-time tradeoffs were explored (like loop unrolling). However, we unrolled the main decoding loop by hand for performance reasons. The number of assembly instructions times four gives a good estimate for code size since each instruction takes 4 bytes. Remember that the code for design 1 and design 2 are the same.

**Code size cost**
	Design 1	Design 2	Design 3
Code size (bytes)	2,640	2,640	2,288

Die Area

Given the clock speed, the processor generator estimated the number of gates our design would take and estimated the die area for a given process. The table below shows the die area estimates for the different designs in 0.25m. The core gate count does not include caches, but the die area does include caches sizes.

**Core Die Area**
	Design 1	Design 2	Design 3
Gates (NAND2 equivalent)	26480	47098	47098
Die Area (mm²)
Core	1.21	2.15	2.15
Core + Caches	1.96	6.55	6.55

The above numbers do not include the ISA extensions. Since the processor generator does not estimate the gate count for the extensions, we estimated this. We estimate the size of the hardware needed to implement the new instructions (for all designs) will be approximately 3000 gates, assuming extensions share resources, such as adders. The following table shows the estimated gate count and die area for the ISA extended design.

**ISA Extended Core Die Area**
	Design 1	Design 2	Design 3
Gates (NAND2 equivalent)	29480	50098	50098
Total Die Area (mm²)	2.10	6.69	6.69

Design effort

Since we started with a C model of the viterbi decoder (Rhett's code), the bulk of our time was spent identifying and evaluating new instructions. It took us about 150 man-hours to design the viterbi decoders with ISA extensions.

Summary

The table below summarizes the results for our different designs.

**Summary**
	Design 1		Design 2		Design 3
	realistic cache	perfect cache	realistic cache	perfect cache	realistic cache	perfect cache
Performance
Viterbi Decoder Speed (Kb/s)	118	409	793	909	966	1142
Energy
Energy Dissipation (mJ/bit)	0.40	0.12	0.24	0.21	0.20	0.17
Core Power (mW)	48	48	191	191	191	191
Power Density (mW/mm²)	22.9	22.9	28.5	28.5	28.5	28.5
*PerfomanceEnergy (nsJ/bit)*	3.39	0.293	0.315	0.231	0.207	0.148
Cost
Code Size (bytes)	2,640	2,640	2,640	2,640	2,288	2,288
Die Area, in 0.25m (mm²)	2.10	2.10	6.69	6.69	6.69	6.69
Total Design Effort	150 man-hours

Conclusions

To draw conclusions about the relationship between clock speed and memory usage, we interpolated two additional processor designs, design 1+ and design 2-. Design 1+ is identical to design design 1, except the clock has been turned up to 222 MHz. Design 2- is identical to design 2, except the clock has been turned down to 100 MHz. By calcuating energy data for these designs, we can draw conclusions about the impact of processor configurations on energy and power.

**Extrapolated Processor Data**
	Design 1+		Design 2-
	realistic cache	perfect cache	realistic cache	perfect cache
Performance
Viterbi Decoder Speed (Kb/s)	263	909	357	409
Energy
Energy Dissipation (mJ/bit)	0.54	0.16	0.19	0.17
Core Power (mW)	144	144	69	69
Power Density (mW/mm²)	60.8	60.8	11.2	11.2
*PerfomanceEnergy (nsJ/bit)*	2.05	0.176	0.532	0.416
Die Area, in 0.25m (mm²)	2.37	2.37	6.14	6.14

Charts comparing all designs under different metrics are given below.

Performance Chart

As cache size and clock frequency increase, peformance increases.

Energy Dissipation

Increasing the clock frequency, increases the power dissipation. Data bandwidth has a large impact on power dissipation. This can be
seen between design 1+ and design 2. They have the same clock frequency; the cache on design 2 is bigger.

Performance*Energy

As cache size and clock frequency increase, performance*energy improves.

Die Area

From this data, we can conclude the effect of cache size is more than the effect of clock frequency. Design 2- (which runs at a slow clock speed (100 MHz) but has large caches (16 kb)) runs at a higher data rate than design 1+ (which runs at 222 MHz and has small caches) and it dissipates energy/power more efficiency that design 1-. Thus, if design 2- operates at an acceptable data rate, it is the best choice in terms of performance/power efficiency. If the data is streaming, then obviously clock frequency is the only useful parameter to vary.

In general, this shows Tensilica's Xtensa processor can be used to achieve useful data rates for a Viterbi decoder. In addition, we believe the power/performance numbers are comparable to that of an off-the-shelf DSP. Also, if the extensions can hold state, we show an additional performance gain. Lastly, core power dissipation could be improved by deleting processor instructions that are not used. This will simplify processor control logic, which will reduce power dissipation.

References

[1] Xtensa Instruction Set Architecture Reference Manual