#### CS 152 Computer Architecture and Engineering

#### Introduction to Architectures for Digital Signal Processing

#### Nov. 12, 1997 Bob Brodersen (http://infopad.eecs.berkeley.edu)

# **Processor Applications**





#### World's Cellular Subscribers



Source: Ericsson Radio Systems, Inc.



Embedded applications



# Requirements of the Embedded Processors

- Optimized for a single program code often in on-chip ROM or off chip EPROM
- Minimum code size (one of the motivations initially for Java)
- Performance obtained by optimizing datapath
- Low cost
  - Lowest possible area
  - Technology behind the leading edge
  - High level of integration of peripherals (reduces system cost)
- Fast time to market
  - Compatible architectures (e.g. ARM) allows reuseable code
  - Customizable core
- Low power if application requires portability



# Another figure of merit Computation per unit area



# National Semiconductor - Embedded Processor Family



- Simple architecture
- 3 stage pipeline fetch decode execute
- Minimum power and size
  - Short pipeline avoids branch prediction and bypass
  - Versions range from 8-64 bit choose minimum that meets requirements

#### Code size



- If a majority of the chip is the program stored in ROM, then code size is a critical issue
- The Piranha has 3 sized instructions basic 2 byte, and 2 byte plus 16 or 32 bit immediate

### Example application (single chip system)



- Vector instructions directly supported
- Pipelined datapath supprts single cycle: Multiply, Add, Shift, Load/Store and Pointer adjustment
- Operates in parallel to processor core
- Saturation, overflow and rounding for ALU operations
- Automatic support for cyclic buffers (modulo arithmetic)

### The National DSP Module Architecture



# The 486 "Embedded Processor" Look familiar???



15

### The "Embedded" Features of the 486 GX

- Said to be designed "for embedded batteryoperated and hand-held applications" (???)
- Fully static design (clock can stop and all state is kept)
- "Auto Clock Freeze" stops circuits which are not being used in a given instruction (gated clocks)
- Stop Clock (60  $\mu$ W), Stop Grant clock runs but no program execution (40-85 mW)
- Split power supply 2.0-3.3 Volt core, 3.3V. I/O,

Embedded Ultra-Low Power Intel486<sup>™</sup> GX Processor

int<sub>el.</sub>

=

# Power = C V<sup>2</sup> $f_{clock}$

Table 17. Active I<sub>CC</sub> Values

T<sub>CASE</sub>=0 °C to +85 °C



# Characterizing programs for their energy consumption



Top four functions account for 90 % of the power 65% of power dissipation in dot-vector products (data obtained from profiling of C++-code, weighted with estimated instruction energy costs)

# An architecture optimized for multiplyaccumulate



Energy/Flexibility Tradeoff's

Arm 6 core (5V, 20 MHz): .02 MIPS/mW

ZSP DSP Superscaler (3V, 200 MHz) .3 MOPS/mW

Reconfigurable Dot-Vector Processor (1.5V, 30 MHz) 5.9 MIPS/mW

\* MOPS = millions of operations/sec = millions of MACS/sec

## **DSP** Application - equalization



- The audio data streams from the source (computer) through the digital analysis and synthesis
- Hard realtime requirement the processing must be done at the sample rate

# Common DSP algorithms and applications

- Applications
  - Instrumentation and measurement
  - Communications
  - Audio and video processing
  - Graphics, image enhancement, 3-D rendering
  - Navigation, radar, GPS
  - Control robotics, machine vision, guidance
- Algorithms
  - Frequency domain filtering FIR and IIR
  - Frequency-time transformations FFT
  - Correlation

#### Sampled data processing



This analog circuit really is just an solution of the differential equation calculated using the physics of electric fields and currents:

$$RC\frac{dV_{out}}{dt} + V_{out}(t) = V_{in}(t)$$

To implement this digitally we need to convert this expression to discrete time. First we need to convert from a continuous time representation of the signal to discrete time sequences:  $V_{out}(t) => Y_1 Y_2 Y_3 \dots Y_n$  and  $V_{in}(t) => X_1 X_2 X_3 \dots X_n$ 

#### Discrete time representation

The sampled version of  $V_{in}(t)$  is a sequence of numbers 6,8,4,12, .... This then provides the input to the digital signal processing algorithm

- Now what is the processing that goes on to implement the filtering?
- Using a discrete approximation to the derivative we obtain the discrete time equivalent of the continuous time differential equation:

$$RC\left(\frac{Y_n - Y_{n-1}}{\Delta t}\right) + Y_{n-1} = X_{n-1}$$

#### A computational structure

This can be rewritten as:

$$Y_{n} = \left(1 - \frac{\Delta t}{RC}\right)Y_{n-1} + \left(\frac{\Delta t}{RC}\right)X_{n-1} = \mathbf{a}Y_{n-1} + \mathbf{b}X_{n}$$

since the new sample is only a function of past samples it can be computed using the following procedure:



Direct mapping architecture



- These calculations need to be finished after every sample period, since  $Y_n$  depends on  $Y_{n-1}$  and new data is continuously coming => hard real time requirement
- In each sample period there are 2 multiply adds and one accumulate.
- We could directly map this structure into hardware and then the delay becomes a pipeline register and we would need two multipliers and an adder - this is the most direct approach, almost no control, but also no flexibility

#### Filter structures



26

# Mapping of the filter onto a DSP execution unit





- The critical hardware unit in a DSP is the multiplier much of the architecture is organized around allowing use of the multiplier on every cycle
- This means providing two operands on every cycle, through multiple data and address busses, multiple address units and local accumulator feedback

# IIR and FIR filters

• Infinite Impulse Response (IIR) filter - has a feedback loop and the response to an impulse goes on forever



• The impulse response completely characterizes the filter response, so a more direct (purely digital) approach is the finite impulse response filter or FIR.





• FIR filters are a very general structure and form the base of much more sophisticated processing, e.g. adaptive filters which make possible 56 kbit modems Transformations result in different critical paths for direct map architectures



### Delay Lines



### FFT support

 "Flow diagram" of FFT algorithm - again based on multiply adds



-0 X [0] x[0] ~ W<sup>0</sup><sub>N</sub> ×[4] WN WN x[1]q -0 X[2] x[2]9  $W_N^0 \times [6]$ WN W<sub>N</sub><sup>1</sup> x[3]9 -0 X[1] x[4]8 W<sup>0</sup><sub>N</sub> × [5]  $W_N^2$  $W_N^2$ x[5] -0 X[3] x[6] 0 W<sup>0</sup><sub>N</sub> ×[7]  $W_{\rm N}^2$  $W_N^3$ x[7] -1 -1

Bit reversed addressing - what is the pattern? 000 000 010 001 100 010 011 110 M - A 2 M - A 3 14 M - A 5 6 M - A

### Address calculation unit for DSP

#### ADDRESS CALCULATION UNIT



- Supports modulo and bit reversal arithmetic
- Often duplicated to calculate multiple addresses per cycle

# Lets look at an application -Supporting the Road Warrior of 1999



# Physical Layer Standards

| Parameter                     | AMPS   | IS54                              | GSM                                        | JCP                                | DECT                       | CT2                        | PHP                        | 802.11FH |
|-------------------------------|--------|-----------------------------------|--------------------------------------------|------------------------------------|----------------------------|----------------------------|----------------------------|----------|
| Origin                        | ΕΙΑ/ΠΑ | ΕΙΑ/ΠΑ                            | ETSI                                       |                                    | ETSI                       | UK                         | Japan                      | IEEE     |
| Access                        | FDD    | FDM/FDD/TD<br>M                   | FDW<br>FDD/TDM                             |                                    | FDWTDWTD<br>D              | FDWTDD                     | TDWTDD                     | FH/FDM   |
| Modulation<br>Baseband filter | FM     | pi/4QPSK<br>Root raised<br>cosine | GMSK, diff<br>Root raised<br>cos. beta=0.3 | pi/4DQPSK<br>Root raised<br>cosine | GFSK<br>Gaussian<br>BT=0.5 | GFSK<br>Gaussian<br>BT=0.5 | pi/4-DQPSK<br>Root Nyquist | (G)FSK   |

Software radio solution?





- Convert to digital representation as close to the antenna as possible
- Determine the best architecture to perform the DSP (FFT's, filters, correlators, ...)

Example of the digital processing -Direct sequence spread spectrum (CDMA)

- Modulator (transmit side)  $t_{t_{bit}}$ Data Input  $t_{chip}$   $t_{chip}$   $t_{chip}$ • Modulator (transmit side)  $t_{t_{bit}}$   $t_{t_{chip}}$ • Spread output data
- Demodulator (transmit side) a correlator is needed to decode the data



38

# Effficiency of direct mapping -CDMA digital baseband architecture



# Summary How is DSP different?

- Essentially infinite streams of data which need to be processed in real time
- Relatively small programs and data storage requirements
- Intensive arithmetic processing with low amount of control and branching (in the critical loops)
- High amount of I/O with analog interface
- Loosely coupled multiprocessor operation

# Summary How are DSP µP's different

- Single cycle multiply accumulate (multiple busses and array multipliers)
- Complex instructions for standard DSP functions (IIR and FIR filters, convolvers)
- Specialized memory addressing
  - Bit reversal (FFT)
  - Modular arithmetic for circular buffers (delay lines)
- Zero overhead loops and repeat instructions
- I/O support
  - Serial and parallel ports
  - DMA
  - A/D and D/A interface
- Limited use of data and instruction caches
- Compiler support for hazard elimination

# Tradeoff off between high performance µP and DSP's

- Advantages of General Purpose  $\mu P$ 's
  - High volume production advantages
  - High level language and tool support
  - Efficient implementation of non-DSP tasks
  - Higher clock rates and more advanced technology
- Advantages of DSP  $\mu$ P's
  - Software and development support for signal processing applications (filters, FFT's, etc.)
  - Real Time OS and application libraries
  - Minimal support chips
  - Variety of versions allow cost/performance/power tradeoffs
  - Low cost