In their short note, the authors point out what the IRAM class is all about: "The gap between memory speed and processor speed is growing". Furthermore, the gap is growing exponentially. The so-called "memory wall" is hit when program execution time is totally determined by memory speed. The author's model assumes a single level of cache that can keep up with processor speed and producing no capacity or conflict misses. Assuming that 20% of instructions reference memory, the "memory wall" is hit when the average time to access memory is 5 CPU cycles.
Assuming the current speed increase rates of CPUs and DRAMs, the paper concludes that the memory wall will be hit within 10 to 15 years, no matter how much the compulsory miss rate is reduced (but >0). The paper finally makes some wild speculations about potential solutions to the memory gap problem.
Comments
It was pointed out that the compulsory miss-rates of 1% mentioned in
the paper is conservative and that 0.2% may be more
realistic. However, as shown in the paper, even with 0.2%, the wall is
still within 15 years (Figure 3).
The authors present visual representations of address reference traces for a SPEC92 benchmark (tomcatv), a DEC C compiler, and for an SQL database server. SPEC92 benchmarks such as tomcatv produce very regular and simple address reference patterns, whereas the SQL server's pattern is nearly indistinguishable from white noise. From this observation, the authors argue that SPEC benchmarks are poor predictors of performance for large commercial software, and that caches don't work.
Comments
Some people in the class did not agree with Sites' conclusion that
"caches don't work". Sites does not discuss the impact of a 3rd or 4th
level cache and cache size.
Someone pointed out that the SQL server -- similar to an OS-- being a large piece of software, was written by many people and that it may not be tuned for memory performance.
This paper explores the impacts of techniques to hide memory latency in processor (super-scalar, out-of-order issue) and memory subsystem architectures (lockup-free caches, large block size). The authors argue that such techniques transform the latency problem into a pin bandwidth problem. To measure the impact of bandwidth limitation, the execution time is broken down into processing time, raw memory latency time, and bandwidth time, where bandwidth time accounts for stalled cycles due to contention in the memory system. The authors support their thesis by measuring the execution time breakdown of a set of SPEC92 benchmarks on simulations of MIPS-like processors. Processors with aggressive latency hiding show substantially higher fractions of bandwidth stalls in the execution time.
The authors view pin-bandwidth requirements of CPUs of the next decade as a major obstacle, both in terms of needed pins (several thousand) and the clock frequency (several GHz). As an alternative, the authors propose to increase the effective pin bandwidth (apparent bandwidth seen by the processor) by improving the on-chip caches. Contrasting a perfectly managed cache (fully associative, min-replacement, write and read around, 4-byte blocks) to a direct mapped cache with 32-byte blocks, they quantify the opportunity to increase effective pin-bandwidth by 1 to 2 orders of magnitude. The biggest benefit appears to come from small block sizes, whereas min-replacement has a small impact. Finally they point out that increasing the effective pin bandwidth does not necessarily reduce the execution time. Some programs benefit from small block sizes, others with spatial locality from large ones. As a consequence, the authors suggest that block size be put under program control.
Comments
It was felt that one of the main contributions of the paper was the
split of execution time into processing time, memory
stall time, and bandwidth stall time.
The maximum bandwidths of the simulated architecture are 7.2GB/s (4-way super-scalar, 2 load/store units at 300MHz) between processor and L1 cache, 1.3GB/s between L1 and L2 cache, 0.8GB/s between L2 and memory. These numbers seem to suggest that the architecture isn't well balanced. According to the paper, the traffic ratio for the 128KB L1 cache is 0.55 on average, which suggests that for the most aggressive CPU, the bandwidth between L1 and L2 should be at least 4GB/s. In view of this, the bandwidth limitation does not come as a surprise.
Someone mentioned that the raw pin-bandwidth of processors isn't fully utilized today due to protocol limitations. Redesigning the transfer protocol across processor pins may yield higher bandwidth.
Another comment concerned the suggestion in the paper of program controlled block-size. From the simulations in the paper it appears that two block-sizes (16B and 128B) may be enough to cover a wide range of applications. On the same issue, someone gave the following reference:
Dubnicki, C.; LeBlanc, T.J. Adjustable block size coherent caches. (19th Annual International Symposium on Computer Architecture, Gold Coast, Qld., Australia, 19-21 May 1992). Computer Architecture News, May 1992, vol.20, (no.2):170-80.
Another technique used in some LISP systems to compact list data-types and avoid indirection is cdr-coding. The idea of list compaction has been investigated in a project by Eric Anderson, and is also a subject in the Titanium compiler project at Berkeley.
It was pointed out that there is a correlation between programs exhibiting good spatial locality and the ability to vectorize such programs.