Second CS 294 Course Assignments

IRAM Software Prototyping System. Perhaps the investigation of IRAM alternatives would benefit from a prototype that could reconfigure itself to emulate several different IRAM alternatives. Options might include number of cells per sense amp, number of cells per word line, number of I/O lines per sense amp, number of banks, number of buses, width of buses, number of external connections and so on. Ideally there would either be an option or a separate program that could simulate the potential noise and retention problems to confirm that it would work without having to construct an IRAM. Alternatively, small test chips might collect and confirm parameters that could drive or verify the simulation results so that an IRAM chip would not be necessary. In addition to the memory subsystem, you would also need to vary the processor and cache portion of the IRAM. Key to an IRAM prototype is low development cost, ease of change, and speed of execution. One recommendation would be the best way to design a software IRAM emulator. Is it simply writing a C or C++ program to run on a uniprocessor? Are there advantages in being able to run on a multiprocessor? Can you get the benefits of a multiprocessor from a network of workstations? How fast would it run? What might the programming interface be? What is an easy way to run many programs? What kind of measurements would you like to collect from such a system?
(Cedric Krumbein)

DRAM Architecture for Vector IRAM. Putting a vector unit in a DRAM will dramatically lower the cost of high bandwidth requirements between the vector unit and main memory. In this assignment I will first give a brief overview on design techniques of memory hierarchy of some modern vector machines, and then based on the survey results of DRAM architectures, I try to propose a proper DRAM architecture for Vector IRAM system.
(Hui T. Zhang)

Finding Cache Hotspots in Spec95 Programs. Although frequently criticized, the Spec benchmark suite is widely used to measure computer performance and aid system design. We will use the WARTS toolkit to collect and analyze data about the cache behavior of Spec 95 programs to guide designing the memory hierarchy of IRAM.
(Joseph D. Darcy and Manuel Fähndrich)

Cache for IRAM This assignment is to propose a more cost effective solution for caches for IRAM than conventional designs. The assumption is that the large miss penalty of these designs drives the memory organization, and that you would use a much simpler cache on an IRAM. For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns while an Alpha fetches 64B in about 250 ns, or a potential hundredfold improvement in miss penalty bandwidth. This assignment is redesign the memory hierarchy if we assume it was implemented as an IRAM, and validated using SPEC92 or SPEC95 programs. One example you could start from is systems based on the 300 MHz Alpha 21164 microprocessor, which uses a three level cache hierarchy plus memory:

two direct-mapped, write through 8KB caches for instructions and data at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
a combined 3-way set associative, write back cache also on chip at the second level, block size either is 32B or 64B, and latency is 6 clock cycles for read (20 ns);
a direct mapped off-chip cache of 4MB (for the AlphaServer 8200), block size is 64B, with a latency of 6 clock cycles for read (20 ns) and 5 for write (16.7 ns);
DRAM is organized in up to 16 banks, depending on memory size, and transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76 cycles (253 ns) for 64 bytes.

How many levels make sense in an IRAM? What is the capacity and block size at each level?
(Windsor Hsu and Min Zhou)

Benchmark Trends In class, it was suggested that linpack performance may be correlated to achievable peak memory bandwidth. What other benchmarks exhibit this behavior? The benchmarks I will survey include: SPEC92 (the well-known CPU benchmark), Laddis (a file system benchmark), and TPC (a database benchmark). Tune in and find out!
Remzi Arpaci

Prefetching on an IRAM Two of the major criticisms of conventional cache prefetching are that it wastes bandwidth and that it leads to cache pollution (where the prefetched data displaces data that really is needed and will therefore result in a subsequent cache miss). The first criticism is not particularly valid in an IRAM context, where there is substantial on-chip bandwidth that we are looking to exploit. The second criticism may or may not be valid, depending on how the caching strategy is implemented. This project looks at the possible consequences of prefetching on an IRAM. A design is studied in which the sense amps act as one level of cache. There may also be another cache level, a smaller conventional SRAM cache. In a 2-level cache scenario, if the prefetching is limited to the sense amp (second level) cache, it is hypothesized that the problem of cache pollution will be less severe, partly because the second level cache will already have a relatively high miss rate compared to the first level cache. If prefetching into the L2 cache does not invalidate any L1 cache blocks, this may also reduce the potential cache pollution problem. This proposal requires one to somewhat abandon the principle of cache inclusion, since the L1 cache will not always entirely be a subset of the L2 cache. However, it is not apparent that there are any significant drawbacks to eliminating strict inclusion. [This project began as a literature search on prefetching, so a number of paper summaries are also included.]
(Richard Fromm)

Low Power Design Techniques for Memory and Microprocessors. Since power may be one of the constraints in the design of an IRAM, a literature survey was performed of past low power design techniques for both memory and microprocessors. A few design options were also explored.
(Heather Bowers and Trevor Pering)

NUMA for IRAM. Memory access latencies may be nonuniform within a single IRAM chip. If one wishes to build a multiprocessor on an IRAM die, it may make sense to take a NUMA approach to the design. A number of NUMA papers are surveyed, and their significance to IRAM is examined. A NUMA multiprocessor is one that exposes the different memory latencies to the programmer or operating system. The NUMA problem is that of managing the migration and replication of pages among the processors. Page placement policies enhance performance significantly compared to naive page placement. A parameterized policy can be easily adapted to work well on different architectures. These policies work well when there is a large latency difference, little network contention, and fast block transfers. Whether an IRAM multiprocessor has these characteristics depends on technology as well as design.
(James Young)

Survey of Vector Memory Units Some have suggested that the vector chaining and gather/scatter units are very complex and difficult to designs. Others have disagreed: the suggestion is to look at the patents awarded to Cray Research in this area for hints. This assignment would follow that advice, going to the Patent Office in Silicon Valley to find patents and to get copies, read them, and summarize the key ideas on how to design the memory unit for a vector machine that supports chaining and gather/scatter.
Lloyd Y. Huang

A proposal for a prefetching icache and compiler optimization for an IRAM Potential IRAM designs have huge bandwidth between cache and memory, which reduces the penalty due to an incorrect prefetch. Similarly, since there are multiple memory banks, they can be prefetched independently. But since there are few banks, care must be taken to prevent bank conflicts. An icache design is postulated, which naively prefetches instructions at some fixed lookahead from the current point of execution, in addition to prefetching the return point of a function through an explicit instruction. A compiler optimization needed to prevent bank conflicts is also proposed. Then, real programs (notably SIOD and several of the spec benchmarks) are analyzed to see if the compiler optimization can be used in practice.
(Nicholas C. Weaver)

A Survey of Embedded Processors Embedded computing is one application for IRAM which has been dissussed many times. This project investigates the possibility of using IRAM for embedded applications. Several types of embedded systems are compared based on performance, cost, power, and reliability. A summary the suitability of IRAM for use in embedded systems is presented.
(Rich Martin)

IRAM Yield Consideration: Some Thoughts from IBM CMOS Technology Evolution. Since IRAM is indeed a combination of DRAM process and logic process, any yield investigation or prediction must be based on those on DRAM and logic. However, The industry is inching toward the day where there may be two standardized manufacturing process flows, one for CMOS and one for DRAMs. Also, factory, process and cost modeling efforts have helped push the development of a single industry process, really two processes, to accommodate the difference in the way logic and memory devices are fabricated. In this project, I will review the modern yield analysis on DRAM and logic, point out the kernel technology in the relization of IRAM. It is hard to discuss and compare different technology, so in this paper I will concentrate on IBM CMOS technology, which includes both DRAM (4Mb, 16Mb, 64Mb and 256Mb) and logic (CMOS 5L, CMOS 5S, CMOS 5X) evolution. It is interesting to know that the migration design, from one process to another, is exist in IBM. The material are organized as:

IBM CMOS technology.
1. DRAM.
2. CMOS logic.
Yield consideration about IBM technology.
How about yield of IRAM?
Conclusion and future work.

(Xinhui Niu)

Low Power Design Techniques and their Relation to IRAM
One possible application of IRAM technology is in low power systems. Low power techniques are explored from the system level down to the circuit level. It would appear that IRAM is suitable for low power in a number of ways. IRAM eliminates much of the off-chip capacitance by integrating the memory on-chip. This not only eliminates all of the off-chip address and data lines, but also eliminates extra chips used for memory and bus management. Another interesting advantage of IRAM is its memory latency. The processor is kept waiting for fewer cycles during a memory miss, thus wasting less power driving the clock while the processor is idle. In the conclusion, an arguement for IRAM in low power applications is made based on the findings in the study.
( Bruce McGaughy )