Ideas for CS 294 Course Assignments

Programming Assignments

Vector IRAM I recently had a potentially interesting idea. Instead of having a full processor on the IRAM, perhaps an interesting split would be to have a vector unit in the IRAM with an external processor. They would be linked with a queue, with vector instructions and possibly scalar data shipped down the queue to the vector unit. Such a split has several potentially interesting advantages (see Vector IRAM ). Key questions, however, are how much information would need to be shipped back and forth across the processor-IRAM boundary and how often would the processor and IRAM need to be sychornnized. This project would use examine programs developed for the Cray to determine the traffic across this interface. A dynammic instruction mix would likely go a long way towards answering this question.
Gather/Scatter Support for IRAM One area where Cray Research vector supercomputers shine compared to conventional cache-based workstations is applications that take advantage of the gather/scatter hardware. This mechanism allows vectors of data to be loaded or stored using another vector register which contains addresses of the data, giving basically a vector version of indirect addressing. This is fast on Cray Research machines for three reasons:
- the vector gather/scatter hardware
- the highly interleaved main memory of Cray Research machines, which reduces collisions from the same memory bank
- the low latency of the (expensive) SRAM used for main memory
IRAM could offer the first two, and while the latency might be less to main memory than with a traditional DRAM memory, it would not be as fast as SRAM. If the reason was simply the highly interleaved memory, then we might not need the vector operations. The purpose of this assignment is to determine how well such an IRAM would work for such cases. You might write your own microbenchmark to perform gather/scatter to see how well it runs on a cache based machine and then simulate the memory performance for IRAM as well as look at a Blocked Cholesky Sparse Matrix code. (This project was suggested by Kathy Yelick.)
Cache for IRAM This assignment is to propose a more cost effective solution for caches for IRAM than conventional designs. The assumption is that the large miss penalty of these designs drives the memory organization, and that you would use a much simpler cache on an IRAM. For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns while an Alpha fetches 64B in about 250 ns, or a potential hundredfold improvement in miss penalty bandwidth. This assignment is redesign the memory hierarchy if we assume it was implemented as an IRAM, and validated using SPEC92 or SPEC95 programs. One example you could start from is systems based on the 300 MHz Alpha 21164 microprocessor, which uses a three level cache hierarchy plus memory:
- two direct-mapped, write through 8KB caches for instructions and data at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
- a combined 3-way set associative, write back cache also on chip at the second level, block size either is 32B or 64B, and latency is 6 clock cycles for read (20 ns);
- a direct mapped off-chip cache of 4MB (for the AlphaServer 8200), block size is 64B, with a latency of 6 clock cycles for read (20 ns) and 5 for write (16.7 ns);
- DRAM is organized in up to 16 banks, depending on memory size, and transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76 cycles (253 ns) for 64 bytes.
How many levels make sense in an IRAM? What is the capacity and block size at each level?
Caches and Code Size RISC designs sacrifice code size for fast execution by the CPU. More efficient instruction encoding, such as that used by the Java interpreter or the VAX, use more complicated instruction encoding to save code size. As the processor-memory gap goes, compact instructions may increasingly get performance benefits by wasting less time on instruction cache misses. This assignment tries to quantify those benefits. Make assumptions of cache sizes and miss penalty for 1986 and for 1996. Pick a RISC machine and some computer with compact encoding. Use the fast cache simulation schemes to compare performance. How much slower can the compact instruction CPU be and still be as fast the RISC machine in 1986 vs. 1996? (This project was suggested by John Ousterhout.)
Code space for Vector vs. Conventional Designs. One argument for vector processing is that vector instructions use fewer bits to specify instruction-level parallelism than do superscalar designs. This assignment would simply collect data to determine whether or not it is true. You would compare the code size of a few computers with the optimizations specified by their SPEC95 results, which may include statically linked libraries and loop unrolling, to a vector machine such as the Cray. It would be interesting to include the binaries for a x86 machine as well.
Examine Hot-Spots. Use standard UNIX profiling tools to find some of the time consuming code sequences in SPEC95, in the Sites database trace (if instructions are included), or in a commercial database. See if you can find techniques that would make them run better that are a good match to IRAM. Be sure to look at explicit memory management and vector processing, but consider more radical techniques like periodic linearizing of linked lists. Estimate how much faster these hot spots would be, as well as what fraction of the time are spent in these hot spots. Its fine to do these examples by hand.
Calibrate the benefits of code and data compression. Using both standard and novel compression schemes, experimentally determine the benefits of on-the-fly compression. How much benefit do you get from code? from data? What is the overall benefit. One approach might be to periodically cause core dumps and then run the compression on the resulting images.

Literature Search Assignments

History and State-of-the-Art of Logic in Memory Architectures. Include style of machine (uniprocessor, SIMD, MIMD, DSP, ...) as well as performance claims.
History and State-of-the-Art of Performance Optimization for and Evaluation of Real-Time Applications. Given that one of the potential applications of IRAM is embedded computing, and given that IRAMs may offer explicit control of memory management and even interrupts, real-time applications may be a good match to IRAM. Survey how real-time applications are evaluated, what it means to improve performance (e.g., worst case or average case?), real-time benchmarks, and so on. Two new EE faculty, Professors Thomas Henzinger (tah@eecs) and Sharad Malik (sharad@ee.princeton.edu), would be good people to talk to.
Survey of Vector Memory Units Some have suggested that the vector chaining and gather/scatter units are very complex and difficult to designs. Others have disagreed: the suggestion is to look at the patents awarded to Cray Research in this area for hints. This assignment would follow that advice, going to the Patent Office in Silicon Valley to find patents and to get copies, read them, and summarize the key ideas on how to design the memory unit for a vector machine that supports chaining and gather/scatter.

Short Design Assignments

The model for these assignments is relatively short investigations, either being simply idea generation and little evaluation or more serious evaluation of a very simple portion of the problem.

Multichip IRAM solutions. Propose a scheme that would allow programs and data to be larger than one chip. Here are a few places to start:
- Mountain to Mohammed : Assume one processor executes the program, so the processor must stall until the requested instruction or data is fetched from the remote IRAM. Reviewing the prefetching literature might be helpful.
- Mem> Hybrid System : it seems unlikely that generic DRAMs will disappear, not matter how successful IRAMs might be. Hence this design is simply having part of main memory being on-chip, and the rest simply be external DRAMs over a bus. How well would this work? What is the impact of different DRAM interfaces (e.g., Synchronous DRAM or Rambus DRAM)? Would you swap pages from external to internal DRAM, or simply have slower access? How might software or the linker allocate two different speeds of main memory?
- Bit-slice : Assume the processor is capable of operating either as a single whole unit or as a bit-slice. For example, assuming a 64-bit processor, 2 chips would each have 32-bits of the logical processor, 4 chips would each have 16-bits, and so on. You should probably look at the old bit-slice chips from AMD as a reference.
- Continuous State Broadcast : Using the network connections or a bus, try to keep every processor up-to-date by broadcasting the new results from the processor that is active. The active processor is the one that whose memory is being accessed.
- Parallel Processing : By definition, every program and its data is distributed between many chips, and its up to the programmer to coordinate the execution of multiple processors and the necessary communication to operate correctly.

For each scheme considered, do a back-of-the-envelope calculation on the performance of each scheme, and list the pros and cons. Look at the cases where the code is large but the data fits, and vice versa, as well as both the code and data are too large.

4-bit Adder in DRAM vs. Logic process. Use Spice to design a simple 4-bit adder in both a logic process and a DRAM process. Include the relationship of power, size, and area, and speculate what would happen for 64-bit adders.

Four, 4-bit Registers in DRAM vs. Logic process. Like the assignment above, but design four 4-bit registers. Each register must have two read ports and one write port. Speculate on the performance if there were 32 64-bit registers.

Cost Justified IRAM. You could also consider this DRAM with a free processor. One cost of standard DRAMs is testing time. If a very low cost processor was part of every gigabit DRAM, perhaps the processor could be justified simply by the reduction of tests. See if this idea has merit or not, and whether you can find other difficulties in DRAM manufacturing that would justify the cost of the processor even if a customer never used it. Estimate how small the processor would have to be to avoid making yield worse, and how fast it would have to be to significantly reduce testing time. It would also need to match the power limits of a standard DRAM. Can it be done with less than 1% impact on area, power, yield? How fast a processor would you have?

IRAM as a network interface controller. One potentially use of IRAM might be to control networks. They need processing, memory, and serial interfaces to the networks. Examine the processor speed, amount of memory, network interfaces, and cost goals to see if an IRAM might be attractive for several networks.

Reprogrammable Memory. One use of the FPGA on an IRAM is customize a processor to an application. Mark Horowitz suggested that another use might be to tailor the organization of memory, e.g., turn an IRAM into a single chip with five FIFOs for use in a router. The basic idea is with all the capacity in a memory and the limited number of pins on a chip, perhaps being able to "program" the logical width, number of memory modules, and connections between the modules on chips would make for a very attractive component.

An IRAM interface. Propose an interface appropriate for IRAM. It can have many more pins than a DRAM and you should include how the network should interface to the sense amps and an addressing scheme that allows a chip to read or write remote data over these pins. See if some existing RAM packages are appropriate: DRAM, VRAM, Rambus, and so on.

Explicit Memory Management. Propose a scheme that would allow the compiler to explicitly load or store a memory buffer/vector register. Include the instructions that would be needed to perform this control, and estimate how well it would work for some programs. See how close you can be to an existing instruction set.

IRAM Software Prototyping System. Perhaps the investigation of IRAM alternatives would benefit from a prototype that could reconfigure itself to emulate several different IRAM alternatives. Options might include number of cells per sense amp, number of cells per word line, number of I/O lines per sense amp, number of banks, number of buses, width of buses, number of external connections and so on. Ideally there would either be an option or a separate program that could simulate the potential noise and retention problems to confirm that it would work without having to construct an IRAM. Alternatively, small test chips might collect and confirm parameters that could drive or verify the simulation results so that an IRAM chip would not be necessary. In addition to the memory subsystem, you would also need to vary the processor and cache portion of the IRAM. Key to an IRAM prototype is low development cost, ease of change, and speed of execution. One recommendation would be the best way to design a software IRAM emulator. Is it simply writing a C or C++ program to run on a uniprocessor? Are there advantages in being able to run on a multiprocessor? Can you get the benefits of a multiprocessor from a network of workstations? How fast would it run? What might the programming interface be? What is an easy way to run many programs? What kind of measurements would you like to collect from such a system?
(Cedric Krumbein is interested in this.)

IRAM Hardware Prototyping System. Another recommendation would be a hardware prototype. This prototype might consist of Altera programmable logic chips, switch chips, and large amounts of SRAM or DRAM. How fast might it be? How easy would it be to change parameters? How would it run programs? What measurements could it collect? How long would it take to construct? How much would the components cost? How big would it be? How would it connect to computers?

Layout of logic in an IRAM. Several of the IRAMs implementing an SIMD on chip would layout the logic to match the metal pitch so as to minimize area. One gigabit DRAM scrambled the blocks so as to minimize the power and interconnect area to the pins. The somewhat vague goal of this assignment is to explore the options in laying out a processor and cache so as to minimize area and power. What is the impact on processor speed of stretching in across the chip? Look at the 3-level and 4-level metal processes used in the gigabit chips presented at ISSCC 96 as well as more conservative designs.

Applications of IRAM. This assignment would investigate applications for IRAM. One potential application of an IRAM could be as a PostScript processor in laser printers. Printers need lots of RAM and a modestly fast processor. Furthermore, OS issues like virtual memory, multiple processes would not be important for a printer (Suggested by Manuel Fahndrich.) This assignment would find and give back of the envelope evalutations of several applications for IRAM.