CS 294 Course Assignments

Course Assignments
CS 294-4 - Intelligent RAM (IRAM)

Assignment Overview

These assignments are to be done in two weeks by 1 or 2 people. It is fine if the assignment is done by more than one group or person, but given the wealth of important topics, we probably don't want three or groups working on the same assignment.

The results should be placed on the WWW. My idea is that those who have taken an assignment will give me a dummy URL at time of the sign-up, and then the results will be updated over the 2 weeks so that people can see what is being learned as it happens. The initial report would just say that this page is under construction. Needless to say, the purpose of this class is to explore a potentially exciting new frontier, thus it makes no sense to hold back. Each group will present their results at a class meeting.

My general idea is to have assignments in three areas:

integrated circuits
computer architecture
software (compilers, operating systems, and possibly applications)

and have three types of assignments:

programming
literature search
short designs

My hope is that each of you will pick an assignment according to your skills: e.g., if you are good at programming/UNIX and interested in architecture, you will do a programming assignment in architecture, such as determining how well a cache with very large blocks would work on some of the SPEC95 programs. On the other hand, if your skills were were more in circuits, you might try to find all papers regarding logic in a DRAM process, coming up with a bibliography and summarizing the results in a table, both on the WWW.

The model is that the first two assignments would either be literature search or programming, depending on your area and skills, and everyone does a short design project as the final assignment. All three assignments should form a solid foundation on which to do the final projects.

Although its fun to program, it usually saves time to use programs that others have created. Two useful sets of resources are:

Below are examples of assignments. I am very interested in suggestions, from anyone, on what would be good things to work on.

Programming Assignments

IRAM vs. conventional caches on database/OS trace. I have a CD containing Dick Sites trace of the Microsoft SQL server running on the Windows NT system for a DEC Alpha computer [Sit96.] The first step would be to recreate the results he claimed in his paper and his course at Berkeley. The second step would be to see how well IRAM would work for his workload. You would start from a straight-forward cache design, just with very wide blocks that are loaded quickly, e.g., form 1024 bits to 16384 bits in 50 ns, and very the number of Sense Amps/Buffers. The question is whether the benefits (if any) come from reuse, spatial locality, or simply the wide bandwidth. Its possible that there is no benefit, as each access might look like a random 32-bit load or store. If IRAM does have a performance advantage, estimate how much slower an IRAM processor could be and still be as fast as the Alpha with a conventional memory system.
(Windsor Hsu and Min Zhou)
(Also, Remzi Arpaci is doing a similar but independent assignment.)
Vector vs. Superscalar/conventional Cache on SPEC95. Its possible that the vector computing [Joh78] would be a very good match to IRAM. SPEC 95 includes a set of integer programs in C and floating point programs in Fortran, and you would expect that some of the floating point programs would do very well on a Cray. I expect to have the SPEC95 CD this week. This project would run programs with and without vectorization, and report the results. Included in the paper would be comparison with some superscalar RISC machines on each program (whose results can probably be found via the Computer Architecture Home Page), so that we can see where vector works well and where it works poorly. I can probably get you an account on a Cray if you don't already have access to one. Since there are many SPEC95 programs, it probably makes sense just to do a subset, so several groups could take a crack at this. If time permits, it would be interesting to see why the results are good for vector vs. superscalar/cache. See if you can characterize which SPEC programs are a good match (e.g., tomcatv) versus a poor match (e.g., gcc) for IRAMs. (To my best knowledge, I've never seen SPEC ratings on Cray Research computers, so this would be a first.)
(Cedric Krumbein and Richard P. Martin)
Instruction/Data Correlation Instead of moving data to the processor, move the processor to the data. For a given piece of data, what is the size of the code that accesses it, and what other pieces of data does that code access.
(Trevor Pering)
Gather/Scatter Support for IRAM One area where Cray Research vector supercomputers shine compared to conventional cache-based workstations is applications that take advantage of the gather/scatter hardware. This mechanism allows vectors of data to be loaded or stored using another vector register which contains addresses of the data, giving basically a vector version of indirect addressing. This is fast on Cray Research machines for three reasons:
- the vector gather/scatter hardware
- the highly interleaved main memory of Cray Research machines, which reduces collisions from the same memory bank
- the low latency of the (expensive) SRAM used for main memory
IRAM could offer the first two, and while the latency might be less to main memory than with a traditional DRAM memory, it would not be as fast as SRAM. If the reason was simply the highly interleaved memory, then we might not need the vector operations. The purpose of this assignment is to determine how well such an IRAM would work for such cases. You might write your own microbenchmark to perform gather/scatter to see how well it runs on a cache based machine and then simulate the memory performance for IRAM as well as look at a Blocked Cholesky Sparse Matrix code. (This project was suggested by kathy Yelick.)
Cache for IRAM This assignment is to propose a more cost effective solution for caches for IRAM than conventional designs. The assumption is that the large miss penalty of these designs drives the memory organization, and that you would use a much simpler cache on an IRAM. For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns while an Alpha fetches 64B in about 250 ns, or a potential hundredfold improvement in miss penalty bandwidth. This assignment is redesign the memory hierarchy if we assume it was implemented as an IRAM, and validated using SPEC92 or SPEC95 programs. One example you could start from is systems based on the 300 MHz Alpha 21164 microprocessor, which uses a three level cache hierarchy plus memory:
- two direct-mapped, write through 8KB caches for instructions and data at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
- a combined 3-way set associative, write back cache also on chip at the second level, block size either is 32B or 64B, and latency is 6 clock cycles for read (20 ns);
- a direct mapped off-chip cache of 4MB (for the AlphaServer 8200), block size is 64B, with a latency of 6 clock cycles for read (20 ns) and 5 for write (16.7 ns);
- DRAM is organized in up to 16 banks, depending on memory size, and transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76 cycles (253 ns) for 64 bytes.
How many levels make sense in an IRAM? What is the capacity and block size at each level?
Caches and Code Size RISC designs sacrifice code size for fast exeuction by the CPU. More efficient instruction encoding, such as that used by the Java interpreter or the VAX, use more complicated instruction encoding to save code size. As the processor-memory gap goes, compact instructions may increasingly get performance benefits by wasting less time on instruction cache misses. This assignment tries to quantify those benefits. Make assumptions of cache sizes and miss penalty for 1986 and for 1996. Pick a RISC machine and some computer with compact encoding. Use the fast cache simulation schemes to compare performance. How much slower can the compact instruction CPU be and still be as fast the RISC machine in 1986 vs. 1996? (This project was suggested by John Ousterhout.)
Code space for Vector vs. Conventional Designs. One argument for vector processing is that vector instructions use fewer bits to specify interuction-level parallelism than do superscalar designs. This assignment would simply collect data to determine whether or not it is true. You would compare the code size of a few computers with the optimizations specified by their SPEC95 results, which may include statically linked libraries and loop unrolling, to a vector machine such as the Cray. It would be interesting to include the binaries for a x86 machine as well.
Examine Hot-Spots. Use standard UNIX profiling tools to find some of the time consuming code sequences in SPEC95, in the Sites database trace (if instructions are included), or in a commercial database. See if you can find techniques that would make them run better that are a good match to IRAM. Be sure to look at explicit memory management and vector processing, but consider more radical techniques like periodic linearizing of linked lists. Estimate how much faster these hot spots would be, as well as what fraction of the time are spent in these hot spots. Its fine to do these examples by hand.
Calibrate the benefits of code and data compression. Using both standard and novel compression schemes, experimentally determine the benefits of on-the-fly compression. How much benefit do you get from code? from data? What is the overall benefit. One approach might be to periodically cause core dumps and then run the compression on the resulting images.

Literature Search Assignments

Using the following resources

Using WWW search engines (such as Inktomi or Alta Vista ).
The University of California on-line library data base Melvyl .
Good old chasing references in papers and going to the library.

find on-line and regular references to architecture studies that describe proposed or real computers where the processing is next to the memory. Your tasks is to summarize this work by including including the proper citation, a short summary of the claimed results along with the pros and cons, and a single on-line table that summarizes all of the projects. Items to include would be year, style of machine (uniprocessor, SIMD, MIMD, DSP, ...), application area, performance claims, status of hardware (if any), citation, and so on. Especially important will be finding on-going projects!

History and State-of-the-Art of Logic in Memory Chips. Similar to the assignment above, find examples of proposals or chips that are dominated by memory but include logic on the chip. In addition to the categories mentioned above, see if you can find estimates of logic size, power, and speed if it is a DRAM chip or memory size, power, and speed if it is a logic chip. Also list as many process parameters as available.
(Bruce McGaughy and Xinhui Niu)
(Also, Lloyd Y. Huang is doing a similar but independent assignment.)
History and State-of-the-Art of Compiler Controlled Memory Hierarchy. Similar to the assignments above, review the status of the effectiveness of compilers in improving performance of memory accesses by explicit optimizations of the memory transfers. Examples should include vector architectures, cache-based optimizations, and anything else along these lines. In addition to the categories mentioned above, summarize the types of programs for which the optimizations work well and those which work poorly.
(Joe Darcy and Manuel Fahndrich)
Program Size and Page Fault Optimization Survey. This is a survey report on optimizations for program space, as well as historical work on reducing page-faults in programs. Since IRAM will be limited to the memory of a single DRAM, code and data space are important considerations.
(Nick Weaver)
History and State-of-the-Art of Circuit and Architecture in DRAM Chip This survey is going to summarize various DRAM designs with perspectives on their circuit techniques and architecture, in order to reveal the potentials and limitations in the launch of IRAM chip. We will evaluate claimed results of performances, techniques in circuit design, and pros and cons in matching DRAMs with different types of system. Based on the survey we want to extract some plausible strategies, especially, the strategies for physical design of the memory part in IRAM, in terms of area, power, timing, noise, etc.
(John Deng and Hui Zhang)
DRAM Architecture Tradeoffs. DRAM designs are typically optimized for operation with the traditional RAS/CAS memory interface. In an IRAM processor, the processor and memory reside together on a single die, so the DRAM does not need to deliver its data off-chip. Hence many design choices exist as to how to interface the DRAM to the processor(s). I will survey the impact of several factors such as block size, column decoding, and address decoding on the overall performance of the DRAM. An accurate characterization of the DRAM will enable sound architectural decisions to be made as to how best to interface the memory to the IRAM processor core.
(James Young)
History and State-of-the-Art of DRAM Testability Issues. A major portion of the cost of a DRAM is testing time. It may be possible to utilize the processing power present in an IRAM to reduce this cost. However, the additional complexities of testing the processor logic could also increase this cost. Before delving into the issues of how an IRAM might affect the DRAM test costs, it is useful to first understand the history and state of the art of DRAM testability issues, including both traditional testing and more novel techniques such as Built in Self Test (BiST). It would also be helpful to investigate how existing chips that merge logic within a DRAM process perform testing of the logic. This project is a literature search to explore these areas.
(Rich Fromm)
History and State-of-the-Art of Digital Signal Processors and Memory Bandwidth. DSP designers have also been pursuing the concept of DSP processors built on-board a DRAM chip. This literature survey will provide a brief history of conventional memory accesses in DSP applications, and will then focus on recent industry developments in overcoming the memory bandwidth issue.
(Heather Bowers)
History and State-of-the-Art of Code and Data Compression. Given the speed of off-chip accesses, it may be important to reduce size of instructions and data so as to fit more on chip. Look at the papers from the 1970s on instruction set encoding, as well as Huffman encoding. See if you can find any schemes that looked at using less space for data. Also, review standard compression technology to see if there were any schemes that might let you use standard instruction sets and data but decompress on-the-fly from memory and compress on-the-fly to memory. Also includes a survey of instruction set support for direct memory management in existing architectures.
(Craig Teuscher.)
History and State-of-the-Art of Logic in Memory Architectures. Include style of machine (uniprocessor, SIMD, MIMD, DSP, ...) as well as performance claims.
History and State-of-the-Art of Performance Optimization for and Evaluation of Real-Time Applications. Given that one of the potential applications of IRAM is embedded computing, and given that IRAMs may offer explicit control of memory management and even interrupts, real-time applications may be a good match to IRAM. Survey how real-time applications are evaluated, what it means to improve performance (e.g., worst case or average case?), real-time benchmarks, and so on. Two new EE faculty, Professors Thomas Henzinger (tah@eecs) and Sharad Malik (sharad@ee.princeton.edu), would be good people to talk to.
History and State-of-the-Art of Power Optimization for Processors and Memory. Given that power may be one of the constraints of an IRAM, it will be useful to learn the prior work on conserving power in processors, memory, and circuits. This assignment performs such a survey.

Short Design Assignments

The model for these assignments is relatively short investigations, either being simply idea generation and little evaluation or more serious evaluation of a very simple portion of the problem.

Multichip IRAM solutions. Propose a scheme that would allow programs and data to be larger than one chip. Here are a few places to start:
- Mountain to Mohammed : Assume one processor executes the program, so the processor must stall until the requested instruction or data is fetched from the remote IRAM. Reviewing the prefetching literature might be helpful.
- Mem> Hybrid System : it seems unlikely that generic DRAMs will disappear, not matter how successful IRAMs might be. Hence this design is simply having part of main memory being on-chip, and the rest simply be external DRAMs over a bus. How well would this work? What is the impact of different DRAM interfaces (e.g., Synchronous DRAM or Rambus DRAM)? Would you swap pages from external to internal DRAM, or simply have slower access? How might software or the linker allocate two different speeds of main memory?
- Bit-slice : Assume the processor is capable of operating either as a single whole unit or as a bit-slice. For example, assuming a 64-bit processor, 2 chips would each have 32-bits of the logical processor, 4 chips would each have 16-bits, and so on. You should probably look at the old bit-slice chips from AMD as a reference.
- Continuous State Broadcast : Using the network connections or a bus, try to keep every processor up-to-date by broadcasting the new results from the processor that is active. The active processor is the one that whose memory is being accessed.
- Parallel Processing : By definition, every program and its data is distributed between many chips, and its up to the programmer to coordinate the execution of multiple processors and the necessary communication to operate correctly.

For each scheme considered, do a back-of-the-envelope calculation on the performance of each scheme, and list the pros and cons. Look at the cases where the code is large but the data fits, and vice versa, as well as both the code and data are too large.

4-bit Adder in DRAM vs. Logic process. Use Spice to design a simple 4-bit adder in both a logic process and a DRAM process. Include the relationship of power, size, and area, and speculate what would happen for 64-bit adders.

Four, 4-bit Registers in DRAM vs. Logic process. Like the assignment above, but design four 4-bit registers. Each register must have two read ports and one write port. Speculate on the performance if there were 32 64-bit registers.

Cost Justified IRAM. You could also consider this DRAM with a free processor. One cost of standard DRAMs is testing time. If a very low cost processor was part of every gigabit DRAM, perhaps the processor could be justified simply by the reduction of tests. See if this idea has merit or not, and whether you can find other difficulties in DRAM manufacturing that would justify the cost of the processor even if a customer never used it. Estimate how small the processor would have to be to avoid making yield worse, and how fast it would have to be to significantly reduce testing time. It would also need to match the power limits of a standard DRAM. Can it be done with less than 1% impact on area, power, yield? How fast a processor would you have?

IRAM as a network interface controller. One potentially use of IRAM might be to control networks. They need processing, memory, and serial interfaces to the networks. Examine the processor speed, amount of memory, network interfaces, and cost goals to see if an IRAM might be attractive for several networks.

Reprogrammable Memory. One use of the FPGA on an IRAM is customize a processor to an application. Mark Horowitz suggested that another use might be to tailor the organization of memory, e.g., turn an IRAM into a single chip with five FIFOs for use in a router. The basic idea is with all the capacity in a memory and the limited number of pins on a chip, perhaps being able to "program" the logical width, number of memory modules, and connections between the modules on chips would make for a very attractive component.

An IRAM interface. Propose an interface appropriate for IRAM. It can have many more pins than a DRAM and you should include how the network should interface to the sense amps and an addressing scheme that allows a chip to read or write remote data over these pins. See if some existing RAM packages are appropriate: DRAM, VRAM, Rambus, and so on.

Explicit Memory Management. Propose a scheme that would allow the compiler to explicitly load or store a memory buffer/vector register. Include the instructions that would be needed to perform this control, and estimate how well it would work for some programs. See how close you can be to an existing instruction set.

IRAM Prototyping System. Perhaps the investigation of IRAM alternatives would benefit from a prototype that could reconfigure itself to emulate several different IRAM alternatives. Options might incldue number of cells per sense amp, number of cells er word line, number of I/O lines per sence amp, number of banks, number of buses, width of buses, number of external connections and so on. In addtion to the memory subsystem, you would also need to vary the processor and cache portion of the IRAM. Key to an IRAM prototype is low development cost, ease of change, and speed of execution. One recommendation would be the best way to design a software IRAM emulator. Is it simply writing a C or C++ program to run on a uniprocessor? Are there advantages in being able to run on a multiprocessor? Can you get the benefits of a multiprocessor from a network of workstations? How fast would it run? What might the programming interface be? What is an easy way to run many programs? What kind of measurements would you like to collect from such a system? Another recommendation would be a hardware prototype. This prototype might consist of Altera programmable logic chips, switch chips, and large amounts of SRAM or DRAM. How fast might it be? How easy would it be to change parameters? How would it run programs? What measurments could it collect? How long would it take to contruct? How much would the components cost? How big would it be? How would it connect to computers?

Layout of logic in an IRAM. Several of the IRAMs implementing an SIMD on chip would layout the logic to match the metal pitch so as to minimize area. One gigabit DRAM scrambled the blocks so as to minimize the power and interconnect area to the pins. The somewhat vague goal of this assignment is to explore the options in laying out a processor and cache so as to minimize area and power. What is the impact on processor speed of stretching in across the chip? Look at the 3-level and 4-level metal processes used in the gigabit chips presented at ISSCC 96 as well as more conservative designs.

Various thoughts and comments about IRAM and the lectures.
(Seth Copen Goldstein)

Course Assignments CS 294-4 - Intelligent RAM (IRAM)

Programming Assignments

Literature Search Assignments

Short Design Assignments

Course Assignments
CS 294-4 - Intelligent RAM (IRAM)