Caches Considered Harmful?

IRAM Class Discussion Mon, Jan 29/96

This discussion was based on three papers [Wul95,Sit96,Bur96] from the bibliography discussing performance issues w.r.t to the gap between CPU and memory speeds. A summary of each paper appears below. Comments of the class directly addressing a paper appear below each summary. Other discussion topics appear after the summaries.

Non-Uniform-Memory-Access
Spatial locality
IRAM architecture issues

Summaries

Hitting the memory wall: implications of the obvious, W.A. Wulf and S.A. McKee
In their short note, the authors point out what the IRAM class is all about: "The gap between memory speed and processor speed is growing". Furthermore, the gap is growing exponentially. The so-called "memory wall" is hit when program execution time is totally determined by memory speed. The author's model assumes a single level of cache that can keep up with processor speed and producing no capacity or conflict misses. Assuming that 20% of instructions reference memory, the "memory wall" is hit when the average time to access memory is 5 CPU cycles.
Assuming the current speed increase rates of CPUs and DRAMs, the paper concludes that the memory wall will be hit within 10 to 15 years, no matter how much the compulsory miss rate is reduced (but >0). The paper finally makes some wild speculations about potential solutions to the memory gap problem.
- Reduce compulsory miss-rate to 0 by having the code initialize all data dynamically. But what about compulsory misses for instructions?
- Go to a Non-Uniform-Memory-Access (NUMA) model, where the processor exposes the different memory levels and their latencies.
- Trade computation for storage or vice versa.
- Review ideas used to schedule memory accesses to magnetic drums.
Comments
It was pointed out that the compulsory miss-rates of 1% mentioned in the paper is conservative and that 0.2% may be more realistic. However, as shown in the paper, even with 0.2%, the wall is still within 15 years (Figure 3).
PathWrx -- A Dynamic Execution Tracing Tool, R.L. Sites and S.E. Perl, pages 9,10.
The authors present visual representations of address reference traces for a SPEC92 benchmark (tomcatv), a DEC C compiler, and for an SQL database server. SPEC92 benchmarks such as tomcatv produce very regular and simple address reference patterns, whereas the SQL server's pattern is nearly indistinguishable from white noise. From this observation, the authors argue that SPEC benchmarks are poor predictors of performance for large commercial software, and that caches don't work.
Comments
Some people in the class did not agree with Sites' conclusion that "caches don't work". Sites does not discuss the impact of a 3rd or 4th level cache and cache size.
Someone pointed out that the SQL server -- similar to an OS-- being a large piece of software, was written by many people and that it may not be tuned for memory performance.
Memory Bandwidth Limitations of Future Microprocessors, D. Burger, A. Kägi and J.R. Goodman
This paper explores the impacts of techniques to hide memory latency in processor (super-scalar, out-of-order issue) and memory subsystem architectures (lockup-free caches, large block size). The authors argue that such techniques transform the latency problem into a pin bandwidth problem. To measure the impact of bandwidth limitation, the execution time is broken down into processing time, raw memory latency time, and bandwidth time, where bandwidth time accounts for stalled cycles due to contention in the memory system. The authors support their thesis by measuring the execution time breakdown of a set of SPEC92 benchmarks on simulations of MIPS-like processors. Processors with aggressive latency hiding show substantially higher fractions of bandwidth stalls in the execution time.
The authors view pin-bandwidth requirements of CPUs of the next decade as a major obstacle, both in terms of needed pins (several thousand) and the clock frequency (several GHz). As an alternative, the authors propose to increase the effective pin bandwidth (apparent bandwidth seen by the processor) by improving the on-chip caches. Contrasting a perfectly managed cache (fully associative, min-replacement, write and read around, 4-byte blocks) to a direct mapped cache with 32-byte blocks, they quantify the opportunity to increase effective pin-bandwidth by 1 to 2 orders of magnitude. The biggest benefit appears to come from small block sizes, whereas min-replacement has a small impact. Finally they point out that increasing the effective pin bandwidth does not necessarily reduce the execution time. Some programs benefit from small block sizes, others with spatial locality from large ones. As a consequence, the authors suggest that block size be put under program control.
Comments
It was felt that one of the main contributions of the paper was the split of execution time into processing time, memory stall time, and bandwidth stall time.
The maximum bandwidths of the simulated architecture are 7.2GB/s (4-way super-scalar, 2 load/store units at 300MHz) between processor and L1 cache, 1.3GB/s between L1 and L2 cache, 0.8GB/s between L2 and memory. These numbers seem to suggest that the architecture isn't well balanced. According to the paper, the traffic ratio for the 128KB L1 cache is 0.55 on average, which suggests that for the most aggressive CPU, the bandwidth between L1 and L2 should be at least 4GB/s. In view of this, the bandwidth limitation does not come as a surprise.
Someone mentioned that the raw pin-bandwidth of processors isn't fully utilized today due to protocol limitations. Redesigning the transfer protocol across processor pins may yield higher bandwidth.
Another comment concerned the suggestion in the paper of program controlled block-size. From the simulations in the paper it appears that two block-sizes (16B and 128B) may be enough to cover a wide range of applications. On the same issue, someone gave the following reference:
```
   Dubnicki, C.; LeBlanc, T.J.
   Adjustable block size coherent caches. (19th Annual International Symposium
   on Computer Architecture, Gold Coast, Qld., Australia, 19-21 May 1992).
   Computer Architecture News, May 1992, vol.20, (no.2):170-80.
```

Discussion Notes

Non-Uniform-Memory-Access (NUMA)

Wulf's suggestion for NUMA as a solution to the memory gap problem, prompted discussion of where NUMA features have appeared in the past.

6502 CPU's zero-page: The zero-page were the first 256 bytes in the address space. These locations could be accessed with a single instruction (no need to encode a 16bit address). In that sense, memory wasn't really faster, but the instruction fetch time was only a third.

NUMA has been used in parallel processing architectures for a while, but is not perceived as the silver bullet. NUMA requires a different programming model.

The Cray-2 has a local memory of 16KB with a 4 cycle transfer time to register (1cycle=4.1ns). Main memory is 256MWords with 57 cycles access time.

Spatial locality

Garbage collection (GC) was mentioned as a possibility to increase spatial locality. This applies only to copying garbage collectors (since Mark/Sweep collectors don't move data). IRAM's high internal bandwidth may allow fast data reorganization. Below are two references on the subject of GC and locality:

Memory Subsystem Performance of Programs with Intensive Heap Allocation, David Tarditi, Amer Diwan, Eliot Moss, CMU-CS-FOX-93-07
Memory Subsystem Performance of Programs Using Copying Garbage Collection, David Tarditi, Amer Diwan, Eliot Moss, CMU-CS-FOX-93-06 (appeared in Proceedings of the 21st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages)

Another technique used in some LISP systems to compact list data-types and avoid indirection is cdr-coding. The idea of list compaction has been investigated in a project by Eric Anderson, and is also a subject in the Titanium compiler project at Berkeley.

It was pointed out that there is a correlation between programs exhibiting good spatial locality and the ability to vectorize such programs.

There may be ideas in work on minmizing page faults that are applicable to caches as well (see work by Domenico Ferrari).

IRAM architecture issues

General concerns or issues about the IRAM architecture raised during the discussion are under this heading:

A lesson from the FBRAM chip seems that we can use the sense amps as a second level cache. Someone pointed out that this cache wouldn't be very big, namely 128KB for a 1GB DRAM. Redoing the calculation myself, however, I get 256KB. If the 1GB DRAM is organized into 2K modules, each with a 2K page/256 rows, and each sense-amp shared between 2 bit-lines, we get 2K x 2K / (2 * 8) = 256KB. Nevertheless, this is a small 2nd level cache.
A concern that has been raised twice in the class is whether we will actually have enough bandwidth to/from the DRAM banks. Back of the envelope calculation: 256KB at 60ns (row access time) yields 4.266 TeraByte/sec peak bandwidth from the sense-amps.
Someone felt concern that in order to take advantage of the large memory bandwidth will require complex designs of the processor and memory hierarchy, s.a. super-scalar, out of order issue/completion etc. In response to this it was suggested that vector processors capable of using the available bandwidth may be easier to design.