

- Keep recently accessed data items closer to processor
- Spatial Locality (Locality in Space):
  - Move contiguous blocks to the upper levels







### Review: Fully Associative Cache

- Fully Associative: Every block can hold any line
  - Address does not include a cache index
  - Compare Cache Tags of all Cache Entries in Parallel
- Example: Block Size=32B blocks
  - We need N 27-bit comparators
  - Still have byte select to choose from within block





## Review: Which block should be replaced on a miss?

- Easy for Direct Mapped: Only one possibility
- Set Associative or Fully Associative:
  - Random
  - LRU (Least Recently Used)

|             | 2-    | 2-way  |       | way    | 8-way<br>LRU Random |        |
|-------------|-------|--------|-------|--------|---------------------|--------|
| <u>Size</u> | LRU   | Random | LRU   | Random | LRU                 | Random |
| 16 KB       | 5.2%  | 5.7%   | 4.7%  | 5.3%   | 4.4%                | 5.0%   |
| 64 KB       | 1.9%  | 2.0%   | 1.5%  | 1.7%   | 1.4%                | 1.5%   |
| 256 KB      | 1.15% | 1.17%  | 1.13% | 1.13%  | 1.12%               | 1.12%  |

### Where does a Block Get Placed in a Cache?



### Review: What happens on a write?

- Write through: The information is written to both the block in the cache and to the block in the lower-level memory
- Write back: The information is written only to the block in the cache.
  - Modified cache block is written to main memory only when it is replaced
  - Question is block clean or dirty?
- Pros and Cons of each?
  - WT:
    - » PRO: read misses cannot result in writes
    - » CON: Processor held up on writes unless writes buffered
  - WB:
    - » PRO: repeated writes not sent to DRAM processor not held up on writes
    - » CON: More complex
      - Read miss may require writeback of dirty data

Lec 14.11

10/19/15

### Administrivia

- Still working on the grading of exams
  - No deadline yet, will let you know
- Solutions are done!
  - Will post them on the website tomorrow

#### Caching Applied to Address Translation



## What Actually Happens on a TLB Miss?

Kubiatowicz CS162 ©UCB Fall 2015

### • Hardware traversed page tables:

- On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels)
  - » If PTE valid, hardware fills TLB and processor never knows
  - » If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards
- Software traversed Page tables (like MIPS)
  - On TLB miss, processor receives TLB fault
  - Kernel traverses page table to find PTE
    - » If PTE valid, fills TLB and returns from fault
    - » If PTE marked as invalid, internally calls Page Fault handler
- Most chip sets provide hardware traversal
  - Modern operating systems tend to have more TLB faults since they use translation for many things
  - Examples:
    - » shared segments
    - » user-level portions of an operating system

## Transparent Exceptions: TLB/Page fault



- How to transparently restart faulting instructions?
  - (Consider load or store that gets TLB or Page fault)
  - Could we just skip faulting instruction?
    - » No: need to perform load or store after reconnecting physical page
- Hardware must help out by saving:
  - Faulting instruction and partial state
    - » Need to know which instruction caused fault
    - » Is single PC sufficient to identify faulting position????
  - Processor State: sufficient to restart user thread
    - » Save/restore registers, stack, etc
- What if an instruction has side-effects?

10/19/15

Lec 14.15

Lec 14,13

10/19/15

### Consider weird things that can happen

| <ul> <li>What if an instruction has side effects? <ul> <li>Options:</li> <li>&gt; Unwind side-effects (easy to restart)</li> <li>&gt; Finish off side-effects (messy!)</li> </ul> </li> <li>Example 1: mov (sp)+,10 <ul> <li>&gt; What if page fault occurs when write to stack pointer?</li> <li>&gt; Did sp get incremented before or after the page fault?</li> </ul> </li> <li>Example 2: strcpy (r1), (r2) <ul> <li>&gt; Source and destination overlap: can't unwind in principle!</li> <li>&gt; IBM S/370 and VAX solution: execute twice - once read-only</li> </ul> </li> <li>What about "RISC" processors? <ul> <li>For instance delayed branches?</li> <li>&gt; Example: bne somewhere Id r1, (sp)</li> <li>&gt; Precise exceptions:</li> <li>&gt; Example: div r1, r2, r3 Id r1, (sp)</li> </ul> </li> <li>What if takes many cycles to discover divide by zero, but load has already caused page fault? <ul> <li>10/19/15</li> <li>Kubiatowicz CS162 @UCB Fall 2015</li> <li>Lec 14.17</li> </ul> </li> </ul> | Program e<br>- All prev<br>- Offendi<br>if they<br>- Same s<br>- Difficul<br>execution<br>- MIPS t<br>· Imprecise<br>where and<br>· Performan<br>precise in<br>- system<br>wish the<br>· Modern to | akes this position<br>e ⇒ system software has to figure<br>d put it all back together<br>nce goals often lead designers to fo | ction<br>tions act as<br>ementations<br>-order<br>out what is<br>orsake<br>tc. usually<br>on and |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|

### What happens on a Context Switch?

- Need to do something, since TLBs map virtual addresses to physical addresses
  - Address Space just changed, so TLB entries no longer valid!
- Options?
  - Invalidate TLB: simple but might be expensive » What if switching frequently between processes?
  - Include ProcessID in TLB
    - » This is an architectural solution: needs hardware
- What if translation tables change?
  - For example, to move page from memory to disk or vice versa...
  - Must invalidate TLB entry!
  - » Otherwise, might think that page is still in memory!
  - Called "TLB Consistency"

## What TLB organization makes sense?

**Precise Exceptions** 



- Needs to be really fast
  - Critical path of memory access
    - » In simplest view: before the cache
    - » Thus, this adds to access time (reducing cache speed)
  - Seems to argue for Direct Mapped or Low Associativity
- However, needs to have very few conflicts!
  - With TLB, the Miss Time extremely high!
  - This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time)
- Thrashing: continuous conflicts between accesses
  - What if use low order bits of page as index into TLB? » First page of code, data, stack may map to same entry
    - » Need 3-way associativity at least?
  - What if use high order bits as index?
    - » TLB mostly unused for small programs

Lec 14.19

## TLB organization: include protection

- How big does TLB actually have to be?
  - Usually small: 128-512 entries
  - Not very big, can support higher associativity
- TLB usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - Returns Physical Address + other info
- What happens when fully-associative is too slow?
  - Put a small (4-16 entry) direct-mapped cache in front - Called a "TLB Slice"
- Example for MIPS R3000:

| V | irtual Address | Physical Address | Dirty | Ref | Valid | Access | ASID |
|---|----------------|------------------|-------|-----|-------|--------|------|
|   | 0xFA00         | 0x0003           | v     | N   | Y     | R/W    | 34   |
|   | 0x0040         | 0x0010           | Ň     | Ŷ   | Ý     | R      | 0    |
|   | 0x0041         | 0x0011           | N     | Ý   | Ý     | R      | 0    |

10/19/15

Kubiatowicz CS162 ©UCB Fall 2015

# Reducing translation time further

• As described, TLB lookup is in serial with cache lookup:



#### **Physical Address**

- Machines with TLBs go one step further: they overlap TLB lookup with cache access.
  - Works because offset available early

## Example: R3000 pipeline includes TLB "stages"

MIPS R3000 Pipeline

| Inst Fetch |       | Dcd/ Reg |    | ALU / E.A |     | Memory  | Write Reg |  |
|------------|-------|----------|----|-----------|-----|---------|-----------|--|
| TLB        | I-Cac | he       | RF | Operation |     |         | WB        |  |
|            |       |          |    | E.A.      | TLB | D-Cache |           |  |

#### TLB

64 entry, on-chip, fully associative, software TLB fault handler

Virtual Address Space



# Overlapping TLB & Cache Access (1/2)

#### • Main idea:

- Offset in virtual address exactly covers the "cache index" and "byte select"
- Thus can select the cached byte(s) in parallel to perform address translation

| virtual address  | Virtual Page # | Offset |      |
|------------------|----------------|--------|------|
|                  |                |        |      |
| physical address | tag / page #   | index  | byte |

Lec 14.21







Putting Everything Together: Cache







## **Demand Paging**

- Modern programs require a lot of physical memory
   Memory per system growing faster than 25%-30%/year
- But they don't use all their memory all of the time
  - 90-10 rule: programs spend 90% of their time in 10% of their code
  - Wasteful to require all of user's code to be in memory
- Solution: use main memory as cache for disk





- $\cdot$  Disk is larger than physical memory  $\Rightarrow$ 
  - In-use virtual memory can be bigger than physical memory
  - Combined memory of running processes much larger than physical memory
- » More programs fit into memory, allowing more concurrency
- Principle: Transparent Level of Indirection (page table)
  - Supports flexible placement of physical data
    - » Data could be on disk or somewhere across network
  - Variable location of data transparent to user program
     » Performance issue, not correctness issue

Kubiatowicz CS162 ©UCB Fall 2015

10/19/15

Lec 14.35



- Since Demand Paging is Caching, must ask:
  - What is block size?

» 1 page

- What is organization of this cache (i.e. direct-mapped, set-associative, fully-associative)?
  - » Fully associative: arbitrary virtual—physical mapping
- How do we find a page in the cache when look for it? » First check TLB, then page-table traversal
- What is page replacement policy? (i.e. LRU, Random...) » This requires more explanation... (kinda LRU)
- What happens on a miss?
  - » Go to lower level to fill miss (i.e. disk)
- What happens on a write? (write-through, write back) » Definitely write-back. Need dirty bit!

|     | ~  | 14 | - | 14 |   |
|-----|----|----|---|----|---|
| - 1 | U. | /1 | 9 | /1 | 2 |
|     | -  |    |   |    |   |

```
Kubiatowicz CS162 ©UCB Fall 2015
```



### Review: What is in a PTE?



Loading an executable into memory

memory

## **Demand Paging Mechanisms**

- PTE helps us implement demand paging
  - Valid  $\Rightarrow$  Page in memory, PTE points at physical page
  - Not Valid ⇒ Page not in memory; use info in PTE to find it on disk when necessary
- Suppose user references page with invalid PTE?
  - Memory Management Unit (MMU) traps to OS » Resulting trap is a "Page Fault"

#### - What does OS do on a Page Fault?: » Choose an old page to replace

- Hill
- » If old page modified ("D=1"), write contents back to disk
- » Change its PTE and any cached TLB to be invalid
- » Load new page into memory from disk
- » Update page table entry, invalidate TLB for new entry
- » Continue thread from original faulting location
- TLB for new page will be loaded when thread continued!
- While pulling pages off disk for one process, OS runs another process from ready queue
  - » Suspended process sits on wait queue



Lec 14.39



• .exe

disk (huge)

info

data

code

- lives on disk in the file system

exe

symbols

pointer)

- OS loads it into memory, initializes registers (and initial stack

- program sets up stack and heap upon initialization: CRTO

- contains contents of code & data segments, relocation entries and



## Create Virtual Address Space of the Process



- User Page table maps entire VAS
  - resident pages to the frame in memory they occupy
  - the portion of it that the HW needs to access must be resident in memory

# Provide Backing Store for VAS



- $\boldsymbol{\cdot}$  Resident pages mapped to memory frames
- For all other pages, OS must record where to find them on disk

Lec 14,43

10/19/15

What data structure is required to map nonresident pages to disk?

- FindBlock(PID, page#) => disk\_block
  - Some OSs utilize spare space in PTE for paged blocks
  - Like the PT, but purely software
- Where to store it?

10/19/15

- In memory can be compact representation if swap storage is contiguous on disk
- Could use hash table (like Inverted PT)
- Usually want backing store for resident pages too.
- May map code segment directly to on-disk image - Saves a copy of code to swap file
- May share code segment with multiple instances of the program

Kubiatowicz CS162 ©UCB Fall 2015

Lec 14,45



## Provide Backing Store for VAS



# On page Fault ... find & start load



## On page Equit find 8



10/19/15

## Eventually reschedule faulting thread



## Summary: Steps in Handling a Page Fault



Kubiatowicz CS162 ©UCB Fall 2015

### Demand Paging (more details)

• The Principle of Locality: • Does software-loaded TLB need use bit? - Program likely to access a relatively small portion of the Two Options: address space at any instant of time. - Hardware sets use bit in TLB; when TLB entry is » Temporal Locality: Locality in Time replaced, software copies use bit back to page table » Spatial Locality: Locality in Space - Software manages TLB entries as FIFO list; everything • Three (+1) Major Categories of Cache Misses: not in TLB is Second-Chance list, managed as strict LRU - Compulsory Misses: sad facts of life. Example: cold start · Core Map misses. - Conflict Misses: increase cache size and/or associativity - Page tables map virtual page  $\rightarrow$  physical page - Capacity Misses: increase cache size - Do we need a reverse mapping (i.e. physical page  $\rightarrow$ - Coherence Misses: Caused by external processors or I/O virtual page)? devices » Yes. Clock algorithm runs through page frames. If sharing, Cache Organizations: then multiple virtual-pages per physical page - Direct Mapped: single block per set » Can't push page out to disk without invalidating all PTEs - Set associative: more than one block per set - Fully associative: all entries equivalent

Summary (1/2)

| 10/19/15                                              | Kubiatowicz CS162 ©UCB Fall 2015                                                                                         | Lec 14.53   | 10/19/15 | Kubiatowicz CS162 ©UCB Fall 2015 | Lec 14.54 |
|-------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------|----------------------------------|-----------|
|                                                       |                                                                                                                          |             |          |                                  |           |
|                                                       | Summary (2/2)                                                                                                            |             |          |                                  |           |
| <ul> <li>A cache of t<br/>Buffer" (TLE</li> </ul>     | translations called a "Translation l<br>B)                                                                               | .ookaside   |          |                                  |           |
| - Relatively                                          | small number of entries (< 512)                                                                                          |             |          |                                  |           |
| - Fully Assoc                                         | ciative (Since conflict misses expensiv                                                                                  | re)         |          |                                  |           |
| - TLB entrie                                          | es contain PTE and optional process I                                                                                    | D           |          |                                  |           |
| $\cdot$ On TLB miss                                   | s, page table must be traversed                                                                                          |             |          |                                  |           |
| - If located                                          | PTE is invalid, cause Page Fault                                                                                         |             |          |                                  |           |
| • On context :                                        | switch/change in page table                                                                                              |             |          |                                  |           |
| - TLB entrie                                          | es must be invalidated somehow                                                                                           |             |          |                                  |           |
| • TLB is logica                                       | ally in front of cache                                                                                                   |             |          |                                  |           |
| -                                                     | ds to be overlapped with cache acces                                                                                     | s to be     |          |                                  |           |
| <ul> <li>Precise Exce</li> <li>All previou</li> </ul> | eption specifies a single instruction<br>is instructions have completed (commi<br>ng instructions nor actual instruction | tted state) |          |                                  |           |
| 10/19/15                                              | Kubiatowicz CS162 ©UCB Fall 2015                                                                                         | Lec 14.55   |          |                                  |           |