CS252 Graduate Computer Architecture Lecture 8

> Explicit Renaming (con't) Prediction (Branches, Return Addrs)

John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

http://www.eecs.berkeley.edu/~kubitron/cs252

#### Quick Recap: Explicit Register Renaming

- Make use of a *physical* register file that is larger than number of registers specified by ISA
- Keep a translation table:
  - ISA register => physical register mapping
  - When register is written, replace table entry with new register from freelist.
  - Physical register becomes free when not being used by any instructions in progress.



2/18/09

cs252-S09, Lecture 8



#### Explicit register renaming: R10000 Freelist Management



- Physical register file larger than ISA register file
- On issue, each instruction that modifies a register is allocated new physical register from freelist
- Used on: R10000, Alpha 21264, HP PA8000

#### 2/18/09

#### Explicit register renaming: R10000 Freelist Management



- Note that physical register P0 is "dead" (or not "live") past the point of this load.
  - When we go to commit the load, we free up



# ٢

9



# **Advantages of Explicit Renaming**

- Decouples renaming from scheduling:
  - Pipeline can be exactly like "standard" DLX pipeline (perhaps with multiple operations issued per cycle)
  - Or, pipeline could be tomasulo-like or a scoreboard, etc.
  - Standard forwarding or bypassing could be used
- Allows data to be fetched from single register file
  - No need to bypass values from reorder buffer
  - This can be important for balancing pipeline
- Many processors use a variant of this technique:
  - R10000, Alpha 21264, HP PA8000
- Another way to get precise interrupt points:
  - All that needs to be "undone" for precise break point is to undo the table mappings
  - Provides an interesting mix between reorder buffer and future file
    - » Results are written immediately back to register file
    - » Registers names are "freed" in program order (by ROB)

cs252-S09, Lecture 8



# **Superscalar Register Renaming**

- During decode, instructions allocated new physical destination register
- Source operands renamed to physical register with newest value
- Execution unit only sees physical register numbers



2/10/07

2/18/09



10

# Administrative

- Midterm I: Wednesday 3/18 Location: 310 Soda Hall TIME: 6:00—9:00
  - Can have 1 sheet of 8<sup>1</sup>/<sub>2</sub>x11 handwritten notes both sides
  - No microfiche of the book!
- This info is on the Lecture page (has been)
- Meet at LaVal's afterwards for Pizza and Beverages
  - Great way for me to get to know you better
  - I'll Buy!



### Branches must be resolved quickly

 In our loop-unrolling example, we relied on the fact that branches were under control of "fast" integer unit in order to get overlap!

| • Loop: LD |       | FO | 0    | R1 |
|------------|-------|----|------|----|
|            | MULTD | F4 | FO   | F2 |
|            | SD    | F4 | 0    | R1 |
|            | SUBI  | R1 | R1   | #8 |
|            | BNEZ  | R1 | Looj | p  |

- What happens if branch depends on result of multd??
  - We completely lose all of our advantages!
  - Need to be able to "predict" branch outcome.
  - If we were to predict that branch was taken, this would be right most of the time.
- Problem much worse for superscalar machines!

2/18/09

cs252-S09, Lecture 8

14

# **MIPS Branches and Jumps**

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

1) Is the preceding instruction a taken branch?

2) If so, what is the target address?

| Instruction | Taken known?       | Target known?      |
|-------------|--------------------|--------------------|
| J           | After Inst. Decode | After Inst. Decode |
| JR          | After Inst. Decode | After Reg. Fetch   |
| BEQZ/BNEZ   | After Reg. Fetch*  | After Inst. Decode |

#### \*Assuming zero detect on register read

cs252-S09, Lecture 8

15

2/18/09

#### **Branch Penalties in Modern Pipelines**

UltraSPARC-III instruction fetch pipeline stages (in-order issue, 4-way superscalar, 750MHz, 2000)



#### **Reducing Control Flow Penalty**

#### Software solutions

- Eliminate branches loop unrolling Increases the run length
- Reduce resolution time instruction scheduling Compute the branch condition as early as possible (of limited value)

#### Hardware solutions

- Find something else to do *delay slots* Replaces pipeline bubbles with useful work (requires software cooperation)
- Speculate branch prediction Speculative execution of instructions beyond the branch

cs252-S09, Lecture 8



#### **Branch Prediction**

- Motivation:
  - Branch penalties limit performance of deeply pipelined processors

cs252-S09, Lecture 8

- Modern branch predictors have high accuracy: (>95%) and can reduce branch penalties significantly
- Required hardware support:
  - Prediction structures:
    - » Branch history tables, branch target buffers, etc.
  - Mispredict recovery mechanisms:
    - » Keep result computation separate from commit
    - » Kill instructions following branch in pipeline
    - » Restore state to state following branch

#### Case for Branch Prediction when Issue N instructions per clock cycle

- Branches will arrive up to *n* times faster in an *n*-issue processor
  - Amdahl's Law => relative impact of the control stalls will be larger with the lower potential CPI in an *n*-issue processor
  - conversely, need branch prediction to 'see' potential parallelism
- Performance = f(accuracy, cost of misprediction)
  - Misprediction ⇒ Flush Reorder Buffer
  - Questions: How to increase accuracy or decrease cost of misprediction?
- Decreasing cost of misprediction
  - Reduce number of pipeline stages before result known
  - Decrease number of instructions in pipeline
  - Both contraindicated in high issue-rate processors!

2/18/09

2/18/09



### **Static Branch Prediction**

Overall probability a branch is taken is ~60-70% but:



ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110 bne0 (preferred taken) beg0 (not taken)

ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate

#### 21 2/18/09 22 2/18/09 cs252-S09, Lecture 8 cs252-S09, Lecture 8 **Dynamic Branch Prediction Dynamic Branch Prediction Problem** learning based on past behavior Temporal correlation History The way a branch resolves may be a good predictor of Information the way it will resolve at the next execution **Incoming Branches** O) { Address } Branch Spatial correlation Predictor Prediction Several branches may resolve in a highly correlated { Address, Value } manner (a preferred path of execution) Corrections { Address, Value } Incoming stream of addresses · Fast outgoing stream of predictions Correction information returned from pipeline 2/18/09 cs252-S09, Lecture 8 23 2/18/09 cs252-S09, Lecture 8 24



A =

B op C

# **Predicated Execution**

• Avoid branch prediction by turning branches into conditionally executed instructions:

#### if (x) then A = B op C else NOP

- If false, then neither store result nor cause exception
- Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.
- IA-64: 64 1-bit condition fields selected so conditional execution of any instruction
- This transformation is called "if-conversion"
- Drawbacks to conditional instructions
  - Still takes a clock even if "annulled"
  - Stall if condition evaluated late
  - Complex conditions reduce effectiveness; condition becomes known late in pipeline

# What does history look like? E.g.: One-level Branch History Table (BHT)

- Each branch given its own predictor state machine
- BHT is table of "Predictors"
  - Could be 1-bit, could be complex state machine
  - Indexed by PC address of Branch without tags
- Problem: in a loop, 1-bit BHT will cause two mispredictions (avg is 9 iterations before exit):
  - End of loop case: when it exits instead of looping as before
  - First time through loop on *next* time through code, when it predicts exit instead of looping
- Thus, most schemes use at least 2 bit predictors
- Performance = f(accuracy, cost of misprediction)
  Misprediction ⇒ Flush Reorder Buffer
- In Fetch state of branch:
  - Use Predictor to make prediction
- When branch completes

2/18/09

- Update corresponding Predictor



cs252-S09, Lecture 8



# 2-bit predictor

• Solution: 2-bit scheme where change prediction only if get misprediction *twice:* 



• Adds hysteresis to decision making process

2/18/09

2/18/09

cs252-S09, Lecture 8



26

#### **Pipeline considerations for BHT**

Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined.

| Correctly     |  | Α    | PC Generation/Mux                                     |
|---------------|--|------|-------------------------------------------------------|
| predicted     |  | Ρ    | Instruction Fetch Stage 1                             |
| taken branch  |  | F    | Instruction Fetch Stage 2                             |
| penalty       |  | В    | Branch Address Calc/Begin Decode                      |
|               |  | Ξ    | Complete Decode                                       |
| Jump Register |  | L    | Steer Instructions to Functional units                |
| penalty       |  | R    | Register File Read                                    |
|               |  | Ε    | Integer Execute                                       |
|               |  |      | Remainder of execute pipeline<br>(+ another 6 stages) |
|               |  | <br> |                                                       |

#### UltraSPARC-III fetch pipeline



## **Branch Target Buffer**



BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then nPC=target else nPC=PC+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb cs252-S09, Lecture 8

# **Address Collisions in BTB**





29

#### **BTB is only for Control Instructions**

BTB contains useful information for branch and jump instructions only

 $\Rightarrow$  Do not update it for other instructions

For all other instructions the next PC is PC+4 !

*How to achieve this effect without decoding the instruction?* 

## **Branch Target Buffer (BTB)**



- Keep both the branch PC and target PC in the BTB
- PC+4 is fetched if match fails
- Only predicted taken branches and jumps held in BTB
- Next PC determined before branch fetched and decoded

2/18/09



### **Consulting BTB Before Decoding**



- The match for PC=1028 fails and 1028+4 is fetched eliminates false predictions after ALU instructions
- BTB contains entries only for control transfer instructions more room to store branch targets

| 2/18/09 | cs252-S09, Lecture 8 | 33 | 2/18/09 | С |
|---------|----------------------|----|---------|---|
|         |                      |    |         |   |

### **Uses of Jump Register (JR)**

Switch statements (jump to address of matching case)

BTB works well if same case used repeatedly

• Dynamic function call (jump to run-time function address)

BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call)

Subroutine returns (jump to return address)

BTB works well if usually return to the same place ⇒ Often one function called from many distinct call sites!

How well does BTB work for each of these cases?

### **Combining BTB and BHT**

- BTB entries are considerably more expensive than BHT, but can redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)
- BHT can hold many more entries and is more accurate



#### BTB/BHT only updated after branch resolves in E stage

s252-S09, Lecture 8

34



#### Subroutine Return Stack

Small structure to accelerate JR for subroutine returns. typically much more accurate than BTBs.



### **Mispredict Recovery**

In-order execution machines:

- Assume no instruction issued after branch can write-back before branch resolves
- Kill all instructions in pipeline behind mispredicted branch

#### Out-of-order execution?

 Multiple instructions following branch in program order can complete before branch resolves

#### In-Order Commit for Precise Exceptions



- Instructions fetched and decoded into instruction reorder buffer in-order
- Execution is out-of-order (  $\Rightarrow$  out-of-order completion)
- Commit (write-back to architectural state, i.e., regfile & memory, is in-order

Temporary storage needed in ROB to hold results before commit

cs252-S09, Lecture 8



# **Recovering ROB/Renaming Table**



Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted

2/18/09

### **Speculating Both Directions**

An alternative to branch prediction is to execute both directions of a branch *speculatively* 

- resource requirement is proportional to the number of concurrent speculative executions
- only half the resources engage in useful work when both directions of a branch are executed speculatively
- branch prediction takes less resources than speculative execution of both paths

With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction

| 2/18/09 | cs252-S09, Lecture 8 | 41 | 2/18/09 | cs252-S09, Lecture 8 | 42 |
|---------|----------------------|----|---------|----------------------|----|
|         |                      |    |         |                      |    |

#### Exploiting Spatial Correlation Yeh and Patt, 1992



If first condition false, second condition also false

*History register,* H, records the direction of the last N branches executed by the processor

# **Correlating Branches**

- Hypothesis: recent branches are correlated; that is, behavior of recently executed branches affects prediction of current branch
- Two possibilities; Current branch depends on:
  - Last m most recently executed branches anywhere in program Produces a "GA" (for "global adaptive") in the Yeh and Patt classification (e.g. GAg)
  - Last m most recent outcomes of same branch. Produces a "PA" (for "per-address adaptive") in same classification (e.g. PAg)
- Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry
  - A single history table shared by all branches (appends a "g" at end), indexed by history value.
  - Address is used along with history to select table entry (appends a "p" at end of classification)
  - If only portion of address used, often appends an "s" to indicate "setindexed" tables (I.e. GAs)

### **Correlating Branches**

• For instance, consider global history, set-indexed BHT. That gives us a GAs history table.

#### (2,2) GAs predictor

- First 2 means that we keep two bits of history
- Second means that we have 2 bit counters in each slot.
- Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction
- Note that the original two-bit counter solution would be a (0,2) GAs predictor
- Note also that aliasing is possible here...



43

2/18/09





- PAg: Per-Address History Register, Global History Table
- PAp: Per-Address History Register, Per-Address History Table

```
2/18/09
```

cs252-S09, Lecture 8



49





#### Two-Level Adaptive Schemes: History Registers of Same Length (6 bits)



• PAg performs better because it has a branch history table 2/18/09 cs252-S09, Lecture 8

# Why doesn't GAg do better?

- Difference between GAg and both PA variants:
  - GAg tracks correllations between different branches
  - PAg/PAp track corellations between different instances of the same branch
- These are two different types of pattern tracking
  - Among other things, GAg good for branches in straight-line code, while PA variants good for loops
- Problem with GAg? It aliases results from different branches into same table
  - Issue is that different branches may take same global pattern and resolve it differently
  - GAg doesn't leave flexibility to do this

2/18/09



### **Tournament Predictor in Alpha 21264**

- 4K 2-bit counters to choose from among a global predictor and a local predictor
- Global predictor also has 4K entries and is indexed by the history of the last 12 branches; each entry in the global predictor is a standard 2-bit predictor

- Local predictor consists of a 2-level predictor:
  - Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted.
  - Next level Selected entry from the local history table is used to index a table of 1K entries consisting a 3-bit saturating counters, which provide the local prediction

cs252-S09, Lecture 8

- Total size: 4K\*2 + 4K\*2 + 1K\*10 + 1K\*3 = 29K bits!
  - (~180,000 transistors)

2/18/09

# % of predictions from local predictor in Tournament Scheme



2/18/09

57

cs252-S09, Lecture 8



58



#### Accuracy v. Size (SPEC89)



 <sup>12-</sup>bit pattern: ith bit 0 => ith prior branch not taken; ith bit 1 => ith prior branch taken;



- 21264 uses tournament predictor (29 Kbits)
- Earlier 21164 uses a simple 2-bit predictor with 2K entries (or a total of 4 Kbits)
- SPEC95 benchmarks, 21264 outperforms – 21264 avg. 11.5 mispredictions per 1000 instructions
  - 21164 avg. 16.5 mispredictions per 1000 instructions
- Reversed for transaction processing (TP) !
  - 21264 avg. 17 mispredictions per 1000 instructions
  - 21164 avg. 15 mispredictions per 1000 instructions
- TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

cs252-S09, Lecture 8

#### **Special Case Return Addresses**

#### Register Indirect branch hard to predict address

- SPEC89 85% such branches for procedure return
- Since stack discipline for procedures, save return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate





63

61

#### **Performance: Return Address Predictor**

#### Cache most recent return addresses:

– Call  $\Rightarrow$  Push a return address on stack

2/18/09

2/18/09

- Return ⇒ Pop an address off stack & predict as new PC



#### Conclusion

- Explicit Renaming: more physical registers than needed by ISA.
  - Rename table: tracks current association between architectural registers and physical registers
  - Uses a translation table to perform compiler-like transformation on the fly
- Prediction works because ....
  - Programs have patterns
  - Just have to figure out what they are
  - Basic Assumption: Future can be predicted from past!
- · Correlation: Recently executed branches correlated with next branch.
  - Either different branches (GA)
  - Or different executions of same branches (PA).
- Two-Level Branch Prediction
  - Uses complex history (either global or local) to predict next branch
  - Two tables: a history table and a pattern table
  - Global Predictors: GAg, GAs, GShare
  - Local Predictors: PAg, Pap

2/18/09