Niraj Shah
Scott Weber
EE290a
Viterbi Project Final Report
Overview
We chose to implement a viterbi decoder using Tensilica's Xtensa configurable
microprocessor with extensions to the instruction set architecture (ISA).
With Tensilica's tools, it is possible to compile C code for a particular
target architecture using native C programming tools (gcc, gdb). It is
also possible to generate an instruction set simulator for the target architecture.
These software tools are used to evaluate the system. Once the design is
finalized, a description of the hardware implementation can be produced.
The goal of this project was to design an implementation that had a
data rate of 100 kb/s and another implementation that had the highest achievable
data rate. In addition, we designed and characterized a viterbi decoder
that could be implemented if we relaxed one of Xtensa's restrictions.
The rest of this report is as follows: first, we will describe the design
flow of hardware/software co-design within Tensilica's Xtensa's processor.
Then, we will describe the hardware and software of designs we built. Lastly,
we will show the quality of results of these designs within the specified
parameters.
Design Flow
The diagram below demonstrates the design flow methodology that we used.
A designer uses the Tensilica Processor Generator to configure a microprocessor.
The designer has the ability to do such things as configure caches, configure
memories, set up interrupt mechanisms, add datapath units such as a multiplier
or MAC, and add TIE instructions. Based on the configurations, a compiler
and instruction set simulator are generated. The compiler and instruction
set simulator are used to determine the performance of the applications
on the described architecture. Based on the feedback from the simulator,
the designer can update the architecture and restructure the C code. The
process continues until the designer is satisfied with the results.
Design flow diagram
Designs
This section describes in detail each of our three implementations. The
viterbi C-code and instruction extensions of the first two implementations
are exactly the same. The only difference between the two is the cache
size. The processor configuration of the last design is the same as that
of design 2, however, the extension instructions are different, as they
can hold state.
The Xtensa Architecture is shown below. The Xtensa Core is a RISC architecture.
The Xtensa architecture allows for TIE extensions to be augmented to the
core. The two input registers to the TIE unit, Rs and Rt, are 32 bit registers.
The TIE unit writes its result to the 32 bit result register, Rr. The TIE
unit is controlled by the instruction. TIE instructions must meet the following
criteria:
-
single cycle
-
state free
-
no new exceptions
-
no stalls
-
typeless data
Hardware Architecture
Our software architecture is composed of six components. The I/O Device
is some device that retrieves data from somewhere. To abstract the I/O
Device in our model, we used file I/O. The ADC was abstracted as software
quantization following Rhett's code. We did not measure the performance
of the I/O Device and ADC since they would not be implemented as software
in an embedded system. We did measure the performance of the following
four components: Init, TraceBack, ACS, and RAM. The Init component initialize
arrays of data and other variables at the beginning of execution. TraceBack
traces back the decisions after every 100 iterations. ACS is the add-compare-select
unit that does the butterfly of viterbi. RAM is the memory of the system
and acts as a common interface between the modules.
Software Architecture
Design 1: 100 kb/s decoder
This implementation uses instruction extensions and a "lean" processor
to achieve the data rate at a low power and small area. The C-code for
this design is here.
Instruction Extensions
The SetBMreg TIE instruction demonstrates a MIMD instruction we used to
pack branch metrics into a register. The four branch metrics are as follows
where I is the in phase part and Q is the quadrature part of the QPSK signal.
bm0 = I + Q
bm1 = I - Q
bm2 = -I + Q
bm3 = -I - Q
An additional 0x7F is added to each in order to normalize the metrics
from -127...127 to 0...254 so that modulo arithmetic is possible.
SetBMReg Instruction Extension
The four ACS TIE instructions ACS03, ACS12, ACS21, and ACS30 are used
to implement the butterfly network in a viterbi trellis. We have an instruction
to compute each of the four patterns in the trellis. Each instruction performs
a add-compare-select using the appropriate branch metrics. Modulo arithmetic
(mod 2048) is used to normalize path metrics so they do not overflow. A
decision bit based on which branch is taken is set as well as the path
metric. The decision bit is used in TraceBack and the path metric is used
in the next iteration of the trellis.
ACS Instruction Extension
The Zmask TIE instruction calculates the previous state of the encoding
machine to determine which bits were sent. This was made a TIE instruction
since we are packing registers and we could pack instructions in a cycle.
In addition, this instruction is probably less than 100 gates.
Zmask Traceback Instruction Extension
Processor Configuration
The processor was configured with a basic instruction set along with the
above instruction extensions. The processor speed was 100 MHz. There was
no multiplier or MAC unit. For area and power concerns, we configured the
size of the data and instruction cache to be the smallest possible (1 kb)
with the smallest line size (16 bytes). In addition, the read data bus
and write data bus widths were set as small as possible (32 bits). The
number of write-buffer entries was set as small as possible (4 address/value
pairs).
Design 2: Maximum data rate decoder (within Xtensa's restrictions)
Instruction Extensions
For this design, the new instructions were the same as the previous design.
Processor Configuration
The processor for this design was the same as the first design except for
the memory and cache configuration and processor speed. For this design,
we increased the processor speed to the maximum allowed, 227 MHz. For performance
reasons, we configured the size of the data and instruction cache to be
the largest possible (16 kb) with the largest line size (64 bytes). In
addition, the read data bus and write data bus widths were set as large
as possible (128 bits). The number of write-buffer entries was set as large
as possible (32 address/value pairs).
Design 3: Maximum data rate decoder (by allowing instruction extensions
to hold state)
Instruction Extensions
If the Xtensa architecture allowed state in the TIE extensions, then the
following micro-architecture would allow two butterfly networks to be computed
simultaneously since the BM register could be set. The TIE instruction
here was limited by data bandwidth and state would free up the needed bandwidth.
ACS Instruction Extension with State
The C-code for this design is here.
Processor Configuration
The configuration of the processor for this design was exactly the same
as design 2.
Quality of Results
Assumptions
In evaluating the quality of results, we've assumed the following:
retrieving data out of memory
pipeline delay is modeled
processor fabricated in a 0.25m process
Parameters
The following parameters were set at the beginning of the project:
uncoded word length (k) = 1
coded word length (n) = 2
constraint length (L) = 7
branch metric calculation is QPSK
soft decision wordlength (q) = 6
chain-backing depth (D) = 100
generator polynomials: g0 = 171, g1= 133 (octal)
bit error rate (BER) = 10-4
signal to noise ratio (SNR) degradation 0.05dB
Performance
To calculate performance numbers, we used Tensilica's custom instruction
set simulator to simulate 10,000 samples. We did not simulate larger
numbers of samples in the interest of run-time (simulating 10,000 samples
took about 30 minutes). To ensure correctness, we compiled the viterbi
decoder for a native machine and ran it over 10,000,000 samples. The BER
was 9.5 x 10-5. Using the output of profiling data in conjunction
with a custom version of gprof (GNU code profiling utility), we calculated
performance data. To facilitate estimation from profiling, we structured
the code to easily count the contributing operations and ignore the overhead
(e.g. file I/O). The code was compiled using a custom compiler based on
gcc with the -O2 optimizations turned on (this performs all optimizations
that do not involve a space-time trade-off). For each design, we give performance
data assuming a realistic cache (model cache misses) and a perfect cache.
The perfect cache numbers could be used to estimate performance if Xtensa
supported a stream data input.
Design 1
|
realistic cache |
perfect cache |
Kb/s |
118 |
409 |
Design 1: Performance Data
Design 2
|
realistic cache |
perfect cache |
Kb/s |
793 |
909 |
Design 2: Performance Data
Design 3
|
realistic cache |
perfect cache |
Kb/s |
966 |
1142 |
Design 3: Performance Data
Power/Energy
Since Tensilica does not provide any tools to calculate dynamic power dissipation,
we only give static power dissipation numbers here. In addition, based
on the number of clock cycles of computation per bit, we calculate the
energy cost per bit. This is calculated as (Core Power) [J/s] * (Clock
Period) [s/clock cycle] * (Cycles/Decoded Bit) [clock cycle/bit] = [J/bit].
Design 1
|
realistic cache |
perfect cache |
Core Power (mW) |
48 |
48 |
Processor speed (MHz) |
100 |
100 |
Clock Period (ns) |
10 |
10 |
Cycles/Decoded Bit |
841 |
244 |
mJ/bit |
0.40 |
0.12 |
Design 1: Power/Energy Data
Design 2
|
realistic cache |
perfect cache |
Core Power (mW) |
191 |
191 |
Processor speed (MHz) |
222 |
222 |
Clock Period (ns) |
4.5 |
4.5 |
Cycles/Decoded Bit |
280 |
244 |
mJ/bit |
0.24 |
0.21 |
Design 2: Power/Energy Data
Design 3
|
realistic cache |
perfect cache |
Core Power (mW) |
191 |
191 |
Processor speed (MHz) |
222 |
222 |
Clock Period (ns) |
4.5 |
4.5 |
Cycles/Decoded Bit |
230 |
194 |
mJ/bit |
0.20 |
0.17 |
Design 3: Power/Energy Data
Cost
We included two metrics for cost: code size and die area.
Code Size
The code size for all three designs is given below. To calculate code size
we dumped the assembly code for main.cc
and viterbi.c compiled with -O2,
so no space-time tradeoffs were explored (like loop unrolling). However,
we unrolled the main decoding loop by hand for performance reasons. The
number of assembly instructions times four gives a good estimate for code
size since each instruction takes 4 bytes. Remember that the code for design
1 and design 2 are the same.
|
Design 1 |
Design 2 |
Design 3 |
Code size (bytes) |
2,640 |
2,640 |
2,288 |
Code size cost
Die Area
Given the clock speed, the processor generator estimated the number of
gates our design would take and estimated the die area for a given process.
The table below shows the die area estimates for the different designs
in 0.25m. The core gate count does not include
caches, but the die area does include caches sizes.
|
Design 1 |
Design 2 |
Design 3 |
Gates (NAND2 equivalent) |
26480 |
47098 |
47098 |
Die Area (mm2) |
Core
|
1.21 |
2.15 |
2.15 |
Core + Caches
|
1.96 |
6.55 |
6.55 |
Core Die Area
The above numbers do not include the ISA extensions. Since the processor
generator does not estimate the gate count for the extensions, we estimated
this. We estimate the size of the hardware needed to implement the new
instructions (for all designs) will be approximately 3000 gates, assuming
extensions share resources, such as adders. The following table shows the
estimated gate count and die area for the ISA extended design.
|
Design 1 |
Design 2 |
Design 3 |
Gates (NAND2 equivalent) |
29480 |
50098 |
50098 |
Total Die Area (mm2) |
2.10 |
6.69 |
6.69 |
ISA Extended Core Die Area
Design effort
Since we started with a C model of the viterbi decoder (Rhett's code),
the bulk of our time was spent identifying and evaluating new instructions.
It took us about 150 man-hours to design the viterbi decoders with ISA
extensions.
Summary
The table below summarizes the results for our different designs.
|
Design 1 |
Design 2 |
Design 3 |
|
realistic cache |
perfect cache |
realistic cache |
perfect cache |
realistic cache |
perfect cache |
Performance |
Viterbi Decoder Speed (Kb/s)
|
118 |
409 |
793 |
909 |
966 |
1142 |
Energy |
Energy Dissipation (mJ/bit)
|
0.40 |
0.12 |
0.24 |
0.21 |
0.20 |
0.17 |
Core Power (mW)
|
48 |
48 |
191 |
191 |
191 |
191 |
Power Density (mW/mm2)
|
22.9 |
22.9 |
28.5 |
28.5 |
28.5 |
28.5 |
Perfomance*Energy (ns*J/bit) |
3.39 |
0.293 |
0.315 |
0.231 |
0.207 |
0.148 |
Cost |
Code Size (bytes)
|
2,640 |
2,640 |
2,640 |
2,640 |
2,288 |
2,288 |
Die Area, in 0.25m (mm2)
|
2.10 |
2.10 |
6.69 |
6.69 |
6.69 |
6.69 |
Total Design Effort |
150 man-hours
|
Summary
Conclusions
To draw conclusions about the relationship between clock speed and
memory usage, we interpolated two additional processor designs, design
1+ and design 2-. Design 1+ is identical to design design 1, except the
clock has been turned up to 222 MHz. Design 2- is identical to design 2,
except the clock has been turned down to 100 MHz. By calcuating energy
data for these designs, we can draw conclusions about the impact of processor
configurations on energy and power.
|
Design 1+ |
Design 2- |
|
realistic cache |
perfect cache |
realistic cache |
perfect cache |
Performance |
Viterbi Decoder Speed (Kb/s)
|
263 |
909 |
357 |
409 |
Energy |
Energy Dissipation (mJ/bit)
|
0.54 |
0.16 |
0.19 |
0.17 |
Core Power (mW)
|
144 |
144 |
69 |
69 |
Power Density (mW/mm2)
|
60.8 |
60.8 |
11.2 |
11.2 |
Perfomance*Energy (ns*J/bit) |
2.05 |
0.176 |
0.532 |
0.416 |
Die Area, in 0.25m (mm2) |
2.37 |
2.37 |
6.14 |
6.14 |
Extrapolated Processor Data
Charts comparing all designs under different metrics are given below.
Performance Chart
As cache size and clock frequency increase, peformance increases.
Energy Dissipation
Increasing the clock frequency, increases the power dissipation. Data
bandwidth has a large impact on power dissipation. This can be
seen between design 1+ and design 2. They have the same clock frequency;
the cache on design 2 is bigger.
Performance*Energy
As cache size and clock frequency increase, performance*energy improves.
Die Area
From this data, we can conclude the effect of cache size is more than
the effect of clock frequency. Design 2- (which runs at a slow clock speed
(100 MHz) but has large caches (16 kb)) runs at a higher data rate than
design 1+ (which runs at 222 MHz and has small caches) and it dissipates
energy/power more efficiency that design 1-. Thus, if design 2- operates
at an acceptable data rate, it is the best choice in terms of performance/power
efficiency. If the data is streaming, then obviously clock frequency is
the only useful parameter to vary.
In general, this shows Tensilica's Xtensa processor can be used to achieve
useful data rates for a Viterbi decoder. In addition, we believe the power/performance
numbers are comparable to that of an off-the-shelf DSP. Also, if the extensions
can hold state, we show an additional performance gain. Lastly, core power
dissipation could be improved by deleting processor instructions that are
not used. This will simplify processor control logic, which will reduce
power dissipation.
References
[1] Xtensa Instruction Set Architecture Reference Manual