University of California, Berkeley College of Engineering Computer Science Division | EECS

Spring 1998 D.A. Patterson

Quiz 1 Solutions CS252 Graduate Computer Architecture

Notes for future semesters: This quiz was long. If we were going to give this quiz again, we would probably drop the third part of question 2, and parts (b), (c), and (i) of question 3.

#### question 1: Calculate your Calculate your

A certain system with a 350 MHz clock uses a separate data and instruction cache, and a unied second-level cache. The first-level data cache is a direct-mapped, write-through, write-allocate cache with 8kBytes of data total and 8-Byte blocks, and has a perfect write buffer (never causes any stalls). The first-level instruction cache is a direct-mapped cache with 4kBytes of data total and 8-Byte blocks. The second-level cache is a two-way set associative, write-back, write-allocate cache with 2MBytes of data total and 32-Byte blocks.

The first-level instruction cache has a miss rate of  $2\%$ . The first-level data cache has a miss rate of 15%. The unified second-level cache has a local miss rate of 10%. Assume that  $40\%$  of all instructions are data memory accesses; 60% of those are loads, and 40% are stores. Assume that 50% of the blocks in the second-level cache are dirty at any time. Assume that there is no optimization for fast reads on an L1 or L2 cache miss.

All first-level cache hits cause no stalls. The second-level hit time is 10 cycles. (That means that the L1 miss penalty, assuming a hit in the L2 cache, is 10 cycles.) Main memory access time is 100 cycles to the first bus width of data; after that, the memory system can deliver consecutive bus widths of data on each following cycle. Outstanding non-consecutive memory requests can not overlap; an access to one memory location must complete before an access to another memory location can begin. There is a 128-bit bus from memory to the L2 cache, and a 64-bit bus from both L1 caches to the L2 cache. Assume a perfect TLB for this problem (never causes any stalls).

a) (2 points) What percent of all data memory references cause a main memory access (main memory is accessed before the memory request is satisfied)? First show the equation, then the numeric result.

If you did not treat all stores as L1 misses: = (L1 miss rate) - (L2 miss rate) = (.15) - (.10)  $= 1.5\%$ 

If you treated all stores as L1 misses:  $\sim$  ( ) of data refers that referred that  $\sim$  ( ) are  $\sim$  ( ) are written that  $\sim$  ( ) are written that  $\sim$ rate)-(L2 miss rate) = (.4)-(.1) + (.6)-(.15)-(.1)  $= 4.9\%$ 

b) (3 points) How many bits are used to index each of the caches? Assume the caches are presented physical addresses.

Data =  $8K/8 = 1024$  blocks = 10 bits  $Inst = 4K/8 = 512 blocks = 9 bits$  $L2 = 2M/32 = 64k$  blocks = 32k sets = 15 bits

## question is a continued of the cont

c) (3 points) How many cycles can the longest possible data memory access take? Describe (briefly) the events that occur during this access.

L1 miss, L2 miss, writeback. — 10 <u>- 213 cycles — 2010 i 10 - 2</u>14 cycles — 2010 i 2021 i 20<br>2021 i 2021 i 2022 i 2022

Note that the time to read an  $L2$  cache line from memory is 101 cycles (the first 16 B returns in 100 cycles; the next 16 return the next cycle).

d) (4 points) What is the average memory access time in cycles (including instruction and data memory references)? First show the equation, then the numeric result.

If you did not treat all stores as L1 misses:  $AMA1_{\text{total}} = \frac{1}{1.4} AMA1_{\text{inst}} + \frac{1}{1.4} AMA1_{\text{data}}$ AMAT = (L1 hit time) + (L1 miss rate)- [(L2 hit time) + (L2 miss rate) - (mem transfer time)] AMATinst = 1 + 0.02(10 + .10-1.5-(101)) = 1.503 AMATGA = 1 + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., + ..., +  $AMAT = 2.44$ 

Note that the mem transfer time is multipled by 1.5 to account for writebacks in the L2 cache.

If you treated all stores as L1 misses:  $AMA1_{\text{total}} = \frac{1}{1.4}$  $AMA1_{\text{inst}} + \frac{1}{1.4}$  $AWIA1_{\text{loads}} + \frac{1}{1.4}$  $AMA1_{\text{stores}}$  $\blacksquare$  is time time time time  $\blacksquare$  . The contract time is time time time time time time time  $\blacksquare$ AMATinst = 1 + 0.02(10 + .10-1.5-(101)) = 1.503  $\frac{10000}{3}$  .  $\frac{1}{3}$  .  $\frac{1}{3}$ AMATSTORIUS = 1 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 100 + 10  $AMAT = 4.88$ 

Note that the mem transfer time is multipled by 1.5 to account for writebacks in the L2 cache.

## Question 2: Tomasulo's Revenge

Using the DLX code shown below, show the state of the Reservation stations, Reorder buffers, and FP register status for a speculative processor implementing Tomasulo's algorithm. Assume the following:

- Only one instruction can issue per cycle.
- 
- The reorder buer implements the functionality of the load buers and store buers.
- All Fus are functionally proportions and the function of the set of
- There are 2 FP multiply reservation stations.
- 
- There are 3 integer reservation stations, which also execute load and store instructions.
- No exceptions occur during the execution of this code.
- All integer operations require 1 execution cycle. Memory requests occur and complete in this cycle. (For this problem, assume that, barring structural hazards, loads issue in one cycle, execute in the next, write in the third, and a dependent instruction can start execution on the fourth.)
- All FP multiply operations require 4 execution cycles.
- All FP addition operations require 2 execution cycles.
- On a CDB write con
ict, the instruction issued earlier gets priority.
- Execution for a dependent instruction can begin on the cycle after its operand is broadcast on the CDB.
- If any items from  $\Box$ to reflect this, but you should not erase any other information in the row (unless another instruction then overwrites that information).
- Assume the all reservations, reserves the all reservations, reorder buens, and functional units were empty and no busy when the code show below began execution.
- The \Value" column gets updated when the value is broadcast on the CDB.

Integer registers are not shown, and you do not have to show their state.

 $Lp$ :

a) (4 points) The tables below show the state after the cycle in which the second SUBI from the code below issued. Show the state after the next cycle.









b) (8 points) The tables below show the state during the cycle in which the second MULTD from the code below issued. Show the state after  $two$  cycles.









c) (8 points) The tables below show the state after the cycle in which the SUB from the code below issued. Show the state after the next  $\boldsymbol{four}$  cycles.

|  | F4, F0, F0<br>F4, F4, F2<br>R2, R2, 1<br>R3, R3, 1<br>F2, F6, F8<br>FO. F6. F8 |
|--|--------------------------------------------------------------------------------|







#### question 3: Vector variable v

Examine the two architectures below.

The first architecture is a 25 Mhz 3-stage dsp processor. A block diagram showing some of the fully-bypassed datapath is shown below. The three stages are fetch, decode (where branches



Figure 1: The DSP block diagram

are evaluated and the PC updated), and execute (where memory and register writes also occur). The processor is able to multiply, accumulate, and shift during its execute stage. It has the same load, store, and branch instructions as DLX. It also includes a LT instruction, which loads a value into a register from memory, and decrements the base register to the next element. The arithmetic operations are slightly different:

- Register 0 always contains the value of  $\sim$
- Register 1 always contains the value 1 always contains the value 1 and 1
- The result from the shifter is always written to the accumulator on arithmetic operations
- Operations can be specied as MAC W, X, Y, Z, S, where W is the register to be written; X and Y are registers that go to the multiplier; Z is the register that goes to the alu; and S specifies the amount to right shift the result.
- Operations can also be specied as MACA W, X, Y, S, where W is the register to be written; X and Y are registers that go to the multiplier; the accumulator goes to the alu; and S specifies the amount to right shift the result.

The second architecture is a 100 Mhz vector processor with a MVL of 64 elements. It has one FP add/subtract FU, one FP multiply/divide FU, and a single memory FU. The startup overhead is 5 cycle for add, subtract, multiply, and divide instructions, and 10 cycles for memory instructions. It supports flexible chaining but not tailgating.

Here is the code for the dsp:

LP: LT R2, 0(R5) # Load R2 with new value MAC R0, R10, R2, R0, 0 # Perform the calculation MACA R0, R11, R2, 0 MACA R2, R12, R2, 0 BNEZ R5, LP SW R2, -4(R5) # Delayed branch

a) (2 points) What is the peak performance, in results per second, of the above three-tap filter?

3 results per loop, 6 instructions per loop, 25 Mhz 12.5M results per second

b) (2 points) What would be the peak performance, in results per second, of the above code, if it was a five-tap filter?

5 results per loop, 8 instructions per loop, 25 Mhz 15.625M results per second

c) (3 points) Translate the following DLX code to code that will operate on this DSP. Assume that all floating point calculations below can be done in fixed-point on the DSP. Do not worry about round-off error from converting between floating point and fixed point. Assume that for DLX code, F0 contains 0, F2 contains 0.5, and F4 contains 2.0. Assume that for the DSP code you will write, R2 contains the value 2. Assume for both that register R5 contains the correct initial loop count.







Here is the code for the vector machine:

d) (3 points) Show the convoys of vector instructions for the above code. Follow the timing examples in the book. Draw lines to show the convoys on the existing code shown above.

(Shown above with lines)

e) (4 points) Show the execution time in clock cycles of this loop with n elements  $(T_n)$ ; assume  $T<sub>loop</sub> = 15$ . Show the equation, and give the value of execution time for  $n=64$ .

 $\mathbf{u} = \mathbf{v} + \mathbf{v}$  $\lceil \frac{n}{64} \rceil \times (\text{T}_{\text{loop}} + \text{T}_{\text{start}}) + \text{n} \times \text{T}_{\text{chime}}$  $\sigma$  in the distribution of  $\sigma$  $\left\lceil \frac{64}{64} \right\rceil \times (\text{T}_{\text{loop}} + \text{LV}_{\text{start}} + \text{MULTS V}_{\text{start}} + \text{ADDV}_{\text{start}} + \text{MDU}_{\text{start}} + \text{ADDV}_{\text{start}} + \text{ADDV}_{\text{start}} + \text{MDV}_{\text{start}} + \text{MDV}_{\text{start$ SVstart) + 64 - 3 <sup>l</sup>  $T_{64} = \left[\frac{64}{64}\right] \times (15 + 10 + 5 + 5 + 5 + 5 + 5 + 10) + 64 \times 3$  $T_{64} = 252$ 

f) (3 points) What is  $R_{\infty}$  for this loop?

$$
R_{\infty} = \frac{\text{Operations per iteration} \times \text{Clock rate}}{\lim_{n \to \infty} \text{Clock cycles per iteration}}
$$
\n
$$
\lim_{n \to \infty} \text{Clock cycles per iteration} = \lim_{n \to \infty} \left(\frac{T_n}{n}\right) = \lim_{n \to \infty} \left(\frac{3n + (60/64)n}{n}\right) = 3.9375
$$
\n
$$
R_{\infty} = \frac{5 \times 100}{3.9375} = 127 \text{MFLOPS}
$$

g) (3 points) List 6 characteristics of DSP instruction set architectures that differ from general purpose microprocessors.

For (g), (h), (i), and (j), there are many possibly answers besides what is listed here.

- Autoincrement addressing
- circular additional contractors and contract of the contract of the contract of the contract of the contract o
- Bit reverse addressing
- $\mathbf{r} = \mathbf{r}$  specialistic addressing  $\mathbf{r}$
- Saturating over
ow
- $F = 1$  and  $F = 1$  and
- Narrow data
- Fast loops in the second process of the second contract of th
- h) (1 point) Which of those characteristics are supported in vector architectures?
	- Autoincrement addressing
	- $M_{\rm H}$  and  $M_{\rm H}$  add (with chaining)  $M_{\rm H}$  and  $M_{\rm H}$  a
	- Fast loops in the second process of the second contract of th
- i) (1 point) Which of the unsupported characteristics could be handled in software?
	- circular additional contractors and contract of the contract of the contract of the contract of the contract o

j) (2 points) What changes to the hardware would you make to handle the remaining characteristics?

- Saturating over
ow
- Narrow data support the support of the sup
- $F = F = F \cdot F$  support to the support of  $F$