- Vector IRAM
I recently had a potentially interesting idea. Instead of having a
full processor on the IRAM, perhaps an interesting split would be
to have a vector unit in the IRAM with an external processor. They
would be linked with a queue, with vector instructions and possibly
scalar data shipped down the queue to the vector unit. Such a split
has several potentially interesting advantages
(see Vector IRAM ).
Key questions, however,
are how much information would need to be shipped back and forth
across the processor-IRAM boundary and how often would the processor
and IRAM need to be sychornnized. This project would use examine
programs developed for the Cray to determine the traffic across this
interface. A dynammic instruction mix would likely go a long way
towards answering this question.
- Gather/Scatter Support for IRAM
One area where Cray Research vector supercomputers shine compared to
conventional cache-based workstations is applications that take
advantage of the gather/scatter hardware. This mechanism allows
vectors of data to be loaded or stored using another
vector register which contains addresses of the data,
giving basically a vector version of indirect addressing.
This is fast on Cray Research machines for three reasons:
IRAM could offer the first two, and while the latency might be less
to main memory than with a traditional DRAM memory, it would not be as
fast as SRAM.
If the reason was simply the highly interleaved memory, then we might
not need the vector operations.
The purpose of this assignment is to determine how well
such an IRAM would work for such cases. You might write your own
microbenchmark to perform gather/scatter to see how well it runs on
a cache based machine and then simulate the memory performance for IRAM
as well as look at a Blocked Cholesky Sparse Matrix code.
(This project was suggested by Kathy Yelick.)
- the vector gather/scatter hardware
- the highly interleaved main memory of Cray Research machines,
which reduces collisions from the same memory bank
- the low latency of the (expensive) SRAM used for main memory
- Cache for IRAM
This assignment is to propose a more cost effective solution for
caches for IRAM than conventional designs. The assumption is that
the large miss penalty of these designs drives the memory
organization, and that you would use a much simpler cache on an IRAM.
For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns
while an Alpha fetches 64B in about 250 ns, or a potential hundredfold
improvement in miss penalty bandwidth. This assignment is redesign the
memory hierarchy if we assume it was implemented as an IRAM, and
validated using SPEC92 or SPEC95 programs. One example you could start
systems based on the 300 MHz Alpha 21164 microprocessor, which uses a three level cache
hierarchy plus memory:
How many levels make sense in an IRAM? What is the capacity and block
size at each level?
- two direct-mapped, write through 8KB caches for instructions and data at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
- a combined 3-way set associative, write back cache also on chip at the
second level, block size either is 32B or 64B, and latency is 6 clock cycles for read (20 ns);
- a direct mapped off-chip cache of 4MB (for the AlphaServer
8200), block size is 64B, with a latency of 6 clock cycles for read (20 ns) and 5 for write (16.7 ns);
- DRAM is organized in up to 16 banks, depending on memory size,
and transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76
cycles (253 ns) for 64 bytes.
- Caches and Code Size
RISC designs sacrifice code size for fast execution by the CPU.
More efficient instruction encoding, such as that used by the Java
interpreter or the VAX, use more complicated instruction encoding to
save code size. As the processor-memory gap goes, compact instructions
may increasingly get performance benefits by wasting less time on
instruction cache misses. This assignment tries to quantify those benefits.
Make assumptions of cache sizes and miss penalty for 1986 and for 1996.
Pick a RISC
machine and some computer with compact encoding. Use the fast cache
simulation schemes to compare performance. How much slower can the
compact instruction CPU be and still be as fast the RISC machine in
1986 vs. 1996?
(This project was suggested by John Ousterhout.)
- Code space for Vector vs. Conventional Designs.
One argument for vector processing is that vector instructions use
fewer bits to specify instruction-level parallelism than do
superscalar designs. This assignment would simply collect data to
determine whether or not it is true. You would compare the code size
of a few computers with the optimizations specified by their SPEC95
results, which may include statically linked libraries and loop
unrolling, to a vector machine such as the Cray. It would be
interesting to include the binaries for a x86 machine as well.
- Examine Hot-Spots.
Use standard UNIX profiling tools to find some of the time consuming
code sequences in SPEC95, in the Sites database trace (if instructions are
included), or in a commercial database.
See if you can find techniques that would make them run
better that are a good match to IRAM. Be sure to look at explicit
memory management and vector processing, but consider more radical
techniques like periodic linearizing of linked lists. Estimate how much faster
these hot spots would be, as well as what fraction of the time are
spent in these hot spots. Its fine to do these examples by hand.
- Calibrate the benefits of code and data compression.
Using both standard and novel compression schemes, experimentally determine
the benefits of on-the-fly compression. How much benefit do you get
from code? from data? What is the overall benefit.
One approach might be to
periodically cause core dumps and then run the compression on the
Literature Search Assignments
History and State-of-the-Art of Logic in Memory Architectures.
Include style of machine (uniprocessor, SIMD, MIMD, DSP, ...) as
well as performance claims.
- History and State-of-the-Art of Performance Optimization for
and Evaluation of Real-Time Applications.
Given that one of the potential applications of IRAM is embedded
computing, and given that IRAMs may offer explicit control of
memory management and even interrupts, real-time applications
may be a good match to IRAM. Survey how real-time applications
are evaluated, what it means to improve performance (e.g., worst
case or average case?), real-time benchmarks, and so on. Two new
EE faculty, Professors Thomas Henzinger (tah@eecs) and Sharad Malik (email@example.com), would be good people
to talk to.
- Survey of Vector Memory Units
Some have suggested that the vector chaining and gather/scatter
units are very complex and difficult to designs. Others have
disagreed: the suggestion is to look at the patents awarded to
Cray Research in this area for hints. This assignment would follow
that advice, going to the Patent Office in Silicon Valley to find
patents and to get copies, read them, and summarize the key ideas
on how to design the memory unit for a vector machine that supports
chaining and gather/scatter.
Short Design Assignments
The model for these assignments is relatively short investigations, either
being simply idea generation and little evaluation or more serious evaluation
of a very simple portion of the problem.
For each scheme considered, do a back-of-the-envelope calculation on
the performance of each scheme, and list the pros and cons. Look at
the cases where the code is large but the data fits, and vice versa,
as well as both the code and data are too large.
4-bit Adder in DRAM vs. Logic process.
Use Spice to design a simple 4-bit adder in both a logic process and
a DRAM process. Include the relationship of power, size, and
area, and speculate what would happen for 64-bit adders.
Four, 4-bit Registers in DRAM vs. Logic process.
Like the assignment above, but design four 4-bit registers. Each
register must have two read ports and one write port. Speculate on the
performance if there were 32 64-bit registers.
Cost Justified IRAM.
You could also consider this DRAM with a free processor.
One cost of standard DRAMs is testing time. If a very low cost
processor was part of every gigabit DRAM, perhaps the processor
could be justified simply by the reduction of tests. See if this
idea has merit or not, and whether you can find other difficulties
in DRAM manufacturing that would justify the cost of the processor
even if a customer never used it. Estimate how small the processor
would have to be to avoid making yield worse, and how fast it would
have to be to significantly reduce testing time. It would also need to
match the power limits of a standard DRAM. Can it be done with less
than 1% impact on area, power, yield? How fast a processor would you
IRAM as a network interface controller.
One potentially use of IRAM might be to control networks. They need
processing, memory, and serial interfaces to the networks. Examine the
processor speed, amount of memory, network interfaces, and cost goals
to see if an IRAM might be attractive for several networks.
One use of the FPGA on an IRAM is customize a processor to an application.
Mark Horowitz suggested that another use might be to tailor the
organization of memory, e.g., turn an IRAM into a single chip with
five FIFOs for use in a router. The basic idea is with all the
capacity in a memory and the limited number of pins on a chip, perhaps
being able to "program" the logical width, number of memory modules,
and connections between the modules on chips would make for a very
An IRAM interface.
Propose an interface appropriate for
IRAM. It can have many more pins than a DRAM and you should include
how the network should interface to the sense amps and an addressing
scheme that allows a chip to read or write remote data over these
pins. See if some existing RAM packages are appropriate: DRAM, VRAM,
Rambus, and so on.
Explicit Memory Management.
Propose a scheme that would allow the compiler to explicitly load
or store a memory buffer/vector register. Include the instructions
that would be needed to perform this control, and estimate how well
it would work for some programs. See how close you can be to an
existing instruction set.
IRAM Software Prototyping System.
Perhaps the investigation of IRAM alternatives would benefit
from a prototype that could reconfigure itself to
emulate several different IRAM alternatives. Options might include
number of cells per sense amp, number of cells per word line,
number of I/O lines per sense amp, number of banks, number of buses,
width of buses, number of external connections and so on.
Ideally there would either be an option or a separate program that
could simulate the potential noise and retention problems to confirm
that it would work without having to construct an IRAM.
Alternatively, small test chips might collect and confirm parameters
that could drive or verify the simulation results so that an IRAM
chip would not be necessary.
In addition to the memory subsystem, you would also need to vary the processor
and cache portion of the IRAM.
Key to an IRAM prototype is low development cost, ease of change, and
speed of execution.
One recommendation would be the best way to design
a software IRAM emulator. Is it simply writing a C or C++ program
to run on a uniprocessor? Are there advantages in being able to
run on a multiprocessor? Can you get the benefits of a multiprocessor
from a network of workstations? How fast would it run? What might the
programming interface be? What is an easy way to run many programs?
What kind of measurements would you like to collect from such a
- Multichip IRAM solutions.
Propose a scheme that would allow programs and data to be larger than
one chip. Here are a few places to start:
- Mountain to Mohammed : Assume one processor executes the program,
so the processor must stall until the requested instruction or
data is fetched from the remote IRAM. Reviewing the prefetching
literature might be helpful.
- Mem> Hybrid System : it seems unlikely that generic DRAMs
will disappear, not matter how successful IRAMs might be. Hence this
design is simply having part of main memory being on-chip, and the
rest simply be external DRAMs over a bus. How well would this work?
What is the impact of different DRAM interfaces (e.g., Synchronous DRAM or
Rambus DRAM)? Would you swap pages from external to internal DRAM, or
simply have slower access? How might software or the linker allocate
two different speeds of main memory?
- Bit-slice : Assume the processor is capable of operating either as
a single whole unit or as a bit-slice. For example, assuming a 64-bit
processor, 2 chips would each have 32-bits of the logical processor, 4
chips would each have 16-bits, and so on. You should probably look at
the old bit-slice chips from AMD as a reference.
- Continuous State Broadcast : Using the network connections or a
bus, try to keep every processor up-to-date by broadcasting the new
results from the processor that is active. The active processor is the
one that whose memory is being accessed.
- Parallel Processing : By definition, every program and
its data is distributed between many chips, and its up to the
programmer to coordinate the execution of multiple processors and the
necessary communication to operate correctly.
(Cedric Krumbein is interested in this.)
IRAM Hardware Prototyping System.
Another recommendation would be a hardware prototype. This prototype
might consist of Altera programmable logic chips, switch chips, and
large amounts of SRAM or DRAM. How fast might it be? How easy would it be to
change parameters? How would it run programs?
What measurements could it collect? How long would it take to construct?
How much would the components cost? How big would it be? How would it
connect to computers?
Layout of logic in an IRAM.
Several of the IRAMs implementing an SIMD on chip would layout the logic
to match the metal pitch so as to minimize area. One gigabit DRAM
scrambled the blocks so as to minimize the power and interconnect area
to the pins. The somewhat vague goal of this assignment is to explore
the options in laying out a processor and cache so as to minimize area
and power. What is the impact on processor speed of stretching in
across the chip? Look at the 3-level and 4-level metal processes used
in the gigabit chips presented at ISSCC 96 as well as more
Applications of IRAM.
This assignment would investigate applications for IRAM.
One potential application of an IRAM could be as a PostScript
processor in laser printers. Printers need lots of RAM and a
modestly fast processor. Furthermore, OS issues like virtual memory,
multiple processes would not be important for a printer (Suggested by
Manuel Fahndrich.) This assignment would find and give back of the
envelope evalutations of several applications for IRAM.