Course Assignments
CS 294-4 - Intelligent RAM (IRAM)
Assignment Overview
These assignments are to be done in two weeks by 1 or 2 people. It is
fine if the assignment is done by more than one group or person, but
given the wealth of important topics, we probably don't want three or
groups working on the same assignment.
The results should be placed on
the WWW. My idea is that those who have taken an assignment will give
me a dummy URL at time of the sign-up, and then the results will be updated
over the 2 weeks so that people can see what is being learned as it happens.
The initial report would just say that this page is under
construction.
Needless to say, the purpose of this class is to explore a potentially
exciting new frontier, thus it makes no sense to hold back. Each group will
present their results at a class meeting.
My general idea is to have assignments in three areas:
- integrated circuits
- computer architecture
- software (compilers, operating systems, and possibly applications)
and have three types of assignments:
- programming
- literature search
- short designs
My hope is that each of you will pick an assignment according to your skills:
e.g., if you are good at programming/UNIX and interested in architecture, you
will do a programming assignment in architecture, such as determining how
well a cache with very large blocks would work on some of the SPEC95 programs.
On the other hand, if your skills were were more in circuits, you might try
to find all papers regarding logic in a DRAM process, coming up with a
bibliography and summarizing the results in a table, both on the WWW.
The model is that the first two assignments would either be literature
search or programming, depending on your area and skills, and everyone
does a short design project as the final assignment.
All three assignments should form a solid foundation on which to do the
final projects.
Although its fun to program, it usually saves time to use programs that
others have created. Two useful sets of resources are:
Below are examples of assignments. I am very interested in suggestions,
from anyone, on what would be good things to work on.
Programming Assignments
-
IRAM vs. conventional caches on database/OS trace.
I have a CD containing Dick Sites trace of the Microsoft SQL server running
on the Windows NT system for a DEC Alpha computer
[Sit96.]
The first step would be to recreate the results he claimed in his paper and
his course at Berkeley. The second step would be to see how well IRAM would
work for his workload. You would start from a straight-forward cache design,
just with very wide blocks that are loaded quickly, e.g., form 1024 bits to
16384 bits in 50 ns, and very the number of Sense Amps/Buffers. The question
is whether the benefits (if any) come from reuse, spatial locality, or simply
the wide bandwidth. Its possible that there is no benefit, as each access
might look like a random 32-bit load or store.
If IRAM does have a performance advantage, estimate how much slower an
IRAM processor could be and still be as fast as the Alpha with a
conventional memory system.
(Windsor Hsu and Min Zhou)
(Also, Remzi Arpaci is doing a similar but independent assignment.)
-
Vector vs. Superscalar/conventional Cache on SPEC95.
Its possible that the vector computing [Joh78] would be a very good match
to IRAM. SPEC 95 includes a set of integer programs in C and floating point
programs in Fortran, and you would expect that some of the floating point
programs would do very well on a Cray. I expect to have the SPEC95 CD this
week. This project would run programs with
and without vectorization, and report the results. Included in the paper
would be comparison with some superscalar RISC machines on each program
(whose results can probably be found via the Computer Architecture Home Page),
so that we can see where vector works well and where it works poorly.
I can probably get you an account on a Cray if you don't already have access
to one. Since there are many SPEC95 programs, it probably makes sense just to
do a subset, so several groups could take a crack at this. If time permits,
it would be interesting to see why the results are good for vector vs. superscalar/cache. See if you can characterize which SPEC programs are a good match
(e.g., tomcatv) versus a poor match (e.g., gcc) for IRAMs.
(To my best knowledge, I've never seen SPEC ratings on Cray Research
computers, so this would be a first.)
(Cedric Krumbein and Richard P. Martin)
-
Instruction/Data Correlation
Instead of moving data to the processor,
move the processor to the data. For a given piece of data, what is
the
size of the code that accesses it, and what other pieces of data does
that
code access.
(Trevor Pering)
- Gather/Scatter Support for IRAM
One area where Cray Research vector supercomputers shine compared to
conventional cache-based workstations is applications that take
advantage of the gather/scatter hardware. This mechanism allows
vectors of data to be loaded or stored using another
vector register which contains addresses of the data,
giving basically a vector version of indirect addressing.
This is fast on Cray Research machines for three reasons:
- the vector gather/scatter hardware
- the highly interleaved main memory of Cray Research machines,
which reduces collisions from the same memory bank
- the low latency of the (expensive) SRAM used for main memory
IRAM could offer the first two, and while the latency might be less
to main memory than with a traditional DRAM memory, it would not be as
fast as SRAM.
If the reason was simply the highly interleaved memory, then we might
not need the vector operations.
The purpose of this assignment is to determine how well
such an IRAM would work for such cases. You might write your own
microbenchmark to perform gather/scatter to see how well it runs on
a cache based machine and then simulate the memory performance for IRAM
as well as look at a Blocked Cholesky Sparse Matrix code.
(This project was suggested by kathy Yelick.)
- Cache for IRAM
This assignment is to propose a more cost effective solution for
caches for IRAM than conventional designs. The assumption is that
the large miss penalty of these designs drives the memory
organization, and that you would use a much simpler cache on an IRAM.
For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns
while an Alpha fetches 64B in about 250 ns, or a potential hundredfold
improvement in miss penalty bandwidth. This assignment is redesign the
memory hierarchy if we assume it was implemented as an IRAM, and
validated using SPEC92 or SPEC95 programs. One example you could start
from is
systems based on the 300 MHz Alpha 21164 microprocessor, which uses a three level cache
hierarchy plus memory:
- two direct-mapped, write through 8KB caches for instructions and data at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
- a combined 3-way set associative, write back cache also on chip at the
second level, block size either is 32B or 64B, and latency is 6 clock cycles for read (20 ns);
- a direct mapped off-chip cache of 4MB (for the AlphaServer
8200), block size is 64B, with a latency of 6 clock cycles for read (20 ns) and 5 for write (16.7 ns);
- DRAM is organized in up to 16 banks, depending on memory size,
and transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76
cycles (253 ns) for 64 bytes.
How many levels make sense in an IRAM? What is the capacity and block
size at each level?
- Caches and Code Size
RISC designs sacrifice code size for fast exeuction by the CPU.
More efficient instruction encoding, such as that used by the Java
interpreter or the VAX, use more complicated instruction encoding to
save code size. As the processor-memory gap goes, compact instructions
may increasingly get performance benefits by wasting less time on
instruction cache misses. This assignment tries to quantify those benefits.
Make assumptions of cache sizes and miss penalty for 1986 and for 1996.
Pick a RISC
machine and some computer with compact encoding. Use the fast cache
simulation schemes to compare performance. How much slower can the
compact instruction CPU be and still be as fast the RISC machine in
1986 vs. 1996?
(This project was suggested by John Ousterhout.)
- Code space for Vector vs. Conventional Designs.
One argument for vector processing is that vector instructions use
fewer bits to specify interuction-level parallelism than do
superscalar designs. This assignment would simply collect data to
determine whether or not it is true. You would compare the code size
of a few computers with the optimizations specified by their SPEC95
results, which may include statically linked libraries and loop
unrolling, to a vector machine such as the Cray. It would be
interesting to include the binaries for a x86 machine as well.
- Examine Hot-Spots.
Use standard UNIX profiling tools to find some of the time consuming
code sequences in SPEC95, in the Sites database trace (if instructions are
included), or in a commercial database.
See if you can find techniques that would make them run
better that are a good match to IRAM. Be sure to look at explicit
memory management and vector processing, but consider more radical
techniques like periodic linearizing of linked lists. Estimate how much faster
these hot spots would be, as well as what fraction of the time are
spent in these hot spots. Its fine to do these examples by hand.
- Calibrate the benefits of code and data compression.
Using both standard and novel compression schemes, experimentally determine
the benefits of on-the-fly compression. How much benefit do you get
from code? from data? What is the overall benefit.
One approach might be to
periodically cause core dumps and then run the compression on the
resulting images.
Literature Search Assignments
Using the following resources
- Using WWW search engines (such as
Inktomi
or
Alta Vista ).
- The University of California on-line library data base
Melvyl .
- Good old chasing references in papers and going to the library.
find on-line and regular references to architecture studies that describe
proposed or real computers where the processing is next to the memory.
Your tasks is to summarize this work by including including the proper citation,
a short summary of the claimed results along with the pros and cons,
and a single on-line table that summarizes all of the projects. Items to
include would be year, style of machine (uniprocessor, SIMD, MIMD, DSP, ...),
application area, performance claims, status of hardware (if any), citation,
and so on. Especially important will be finding on-going projects!
-
History and State-of-the-Art of Logic in Memory Chips.
Similar to the assignment above, find examples of proposals or chips that
are dominated by memory but include logic on the chip. In addition to
the categories mentioned above,
see if you can find estimates of logic size, power, and
speed if it is a DRAM chip or memory size, power, and speed if it is a
logic chip. Also list as many process parameters as available.
(Bruce McGaughy and Xinhui Niu)
(Also, Lloyd Y. Huang is doing a similar but independent assignment.)
-
History and State-of-the-Art of Compiler Controlled Memory Hierarchy.
Similar to the assignments above, review the status of the effectiveness of
compilers in improving performance of memory accesses by explicit optimizations of the memory transfers. Examples should include vector architectures,
cache-based optimizations, and anything else along these lines.
In addition to the categories mentioned above,
summarize the types of programs for which the optimizations work well and
those which work poorly.
(Joe Darcy and Manuel Fahndrich)
- Program
Size and Page Fault Optimization Survey.
This is a survey report on optimizations for program space, as well
as historical work on reducing page-faults in programs. Since IRAM
will be limited to the memory of a single DRAM, code and data space
are important considerations.
(Nick Weaver)
-
History and State-of-the-Art of Circuit and Architecture in DRAM
Chip
This survey is going to summarize various DRAM designs with
perspectives on
their circuit techniques and architecture, in order to reveal the
potentials
and limitations in the launch of IRAM chip. We will evaluate claimed
results
of performances, techniques in circuit design, and pros and cons in
matching
DRAMs with different types of system. Based on the survey we want to
extract
some plausible strategies, especially, the strategies for physical
design of
the memory part in IRAM, in terms of area, power, timing, noise, etc.
(John Deng and Hui Zhang)
-
DRAM Architecture Tradeoffs.
DRAM designs are typically optimized for operation with the
traditional
RAS/CAS memory interface. In an IRAM processor, the processor
and memory reside together on a single die, so the DRAM does not need
to
deliver its data off-chip. Hence many design choices exist as to
how to interface the DRAM to the processor(s). I will survey the
impact
of several factors such as block size, column decoding, and address
decoding on the overall performance of the DRAM.
An accurate characterization of the DRAM will enable sound
architectural
decisions to be made as to how best to interface the memory to the
IRAM processor core.
(James Young)
-
History and State-of-the-Art of DRAM Testability Issues.
A major portion of the cost of a DRAM is testing time. It may be
possible to utilize the processing
power present in an IRAM to reduce this cost. However, the additional
complexities of testing the
processor logic could also increase this cost. Before delving into the
issues of how an IRAM might
affect the DRAM test costs, it is useful to first understand the
history and state of the art of DRAM
testability issues, including both traditional testing and more novel
techniques such as Built in Self Test
(BiST). It would also be helpful to investigate how existing chips
that merge logic within a DRAM
process perform testing of the logic. This project is a literature
search to explore these areas.
(Rich Fromm)
-
History and State-of-the-Art of Digital Signal Processors and
Memory Bandwidth.
DSP designers have also been pursuing the concept of
DSP processors built on-board a DRAM chip. This literature
survey will provide a brief history of conventional memory
accesses in DSP applications, and will then focus on recent
industry developments in overcoming the memory bandwidth
issue.
(Heather Bowers)
-
History and State-of-the-Art of Code and Data Compression.
Given the speed of off-chip accesses, it may be important to reduce
size of instructions and data so as to fit more on chip.
Look at the papers from the 1970s on instruction set encoding, as well
as Huffman encoding. See if you can find any schemes that looked at
using less space for data. Also, review standard compression
technology to see if there were any schemes that might let you use
standard instruction sets and data but decompress on-the-fly from
memory and compress on-the-fly to memory. Also includes a survey of
instruction set support for direct memory management in existing
architectures.
(Craig Teuscher.)
-
History and State-of-the-Art of Logic in Memory Architectures.
Include style of machine (uniprocessor, SIMD, MIMD, DSP, ...) as
well as performance claims.
- History and State-of-the-Art of Performance Optimization for
and Evaluation of Real-Time Applications.
Given that one of the potential applications of IRAM is embedded
computing, and given that IRAMs may offer explicit control of
memory management and even interrupts, real-time applications
may be a good match to IRAM. Survey how real-time applications
are evaluated, what it means to improve performance (e.g., worst
case or average case?), real-time benchmarks, and so on. Two new
EE faculty, Professors Thomas Henzinger (tah@eecs) and Sharad Malik (sharad@ee.princeton.edu), would be good people
to talk to.
- History and State-of-the-Art of Power Optimization for
Processors and Memory.
Given that power may be one of the constraints of an IRAM, it will
be useful to learn the prior work on conserving power in processors,
memory, and circuits. This assignment performs such a survey.
Short Design Assignments
The model for these assignments is relatively short investigations, either
being simply idea generation and little evaluation or more serious evaluation
of a very simple portion of the problem.
- Multichip IRAM solutions.
Propose a scheme that would allow programs and data to be larger than
one chip. Here are a few places to start:
- Mountain to Mohammed : Assume one processor executes the program,
so the processor must stall until the requested instruction or
data is fetched from the remote IRAM. Reviewing the prefetching
literature might be helpful.
- Mem> Hybrid System : it seems unlikely that generic DRAMs
will disappear, not matter how successful IRAMs might be. Hence this
design is simply having part of main memory being on-chip, and the
rest simply be external DRAMs over a bus. How well would this work?
What is the impact of different DRAM interfaces (e.g., Synchronous DRAM or
Rambus DRAM)? Would you swap pages from external to internal DRAM, or
simply have slower access? How might software or the linker allocate
two different speeds of main memory?
- Bit-slice : Assume the processor is capable of operating either as
a single whole unit or as a bit-slice. For example, assuming a 64-bit
processor, 2 chips would each have 32-bits of the logical processor, 4
chips would each have 16-bits, and so on. You should probably look at
the old bit-slice chips from AMD as a reference.
- Continuous State Broadcast : Using the network connections or a
bus, try to keep every processor up-to-date by broadcasting the new
results from the processor that is active. The active processor is the
one that whose memory is being accessed.
- Parallel Processing : By definition, every program and
its data is distributed between many chips, and its up to the
programmer to coordinate the execution of multiple processors and the
necessary communication to operate correctly.
For each scheme considered, do a back-of-the-envelope calculation on
the performance of each scheme, and list the pros and cons. Look at
the cases where the code is large but the data fits, and vice versa,
as well as both the code and data are too large.
4-bit Adder in DRAM vs. Logic process.
Use Spice to design a simple 4-bit adder in both a logic process and
a DRAM process. Include the relationship of power, size, and
area, and speculate what would happen for 64-bit adders.
Four, 4-bit Registers in DRAM vs. Logic process.
Like the assignment above, but design four 4-bit registers. Each
register must have two read ports and one write port. Speculate on the
performance if there were 32 64-bit registers.
Cost Justified IRAM.
You could also consider this DRAM with a free processor.
One cost of standard DRAMs is testing time. If a very low cost
processor was part of every gigabit DRAM, perhaps the processor
could be justified simply by the reduction of tests. See if this
idea has merit or not, and whether you can find other difficulties
in DRAM manufacturing that would justify the cost of the processor
even if a customer never used it. Estimate how small the processor
would have to be to avoid making yield worse, and how fast it would
have to be to significantly reduce testing time. It would also need to
match the power limits of a standard DRAM. Can it be done with less
than 1% impact on area, power, yield? How fast a processor would you
have?
IRAM as a network interface controller.
One potentially use of IRAM might be to control networks. They need
processing, memory, and serial interfaces to the networks. Examine the
processor speed, amount of memory, network interfaces, and cost goals
to see if an IRAM might be attractive for several networks.
Reprogrammable Memory.
One use of the FPGA on an IRAM is customize a processor to an application.
Mark Horowitz suggested that another use might be to tailor the
organization of memory, e.g., turn an IRAM into a single chip with
five FIFOs for use in a router. The basic idea is with all the
capacity in a memory and the limited number of pins on a chip, perhaps
being able to "program" the logical width, number of memory modules,
and connections between the modules on chips would make for a very
attractive component.
An IRAM interface.
Propose an interface appropriate for
IRAM. It can have many more pins than a DRAM and you should include
how the network should interface to the sense amps and an addressing
scheme that allows a chip to read or write remote data over these
pins. See if some existing RAM packages are appropriate: DRAM, VRAM,
Rambus, and so on.
Explicit Memory Management.
Propose a scheme that would allow the compiler to explicitly load
or store a memory buffer/vector register. Include the instructions
that would be needed to perform this control, and estimate how well
it would work for some programs. See how close you can be to an
existing instruction set.
IRAM Prototyping System.
Perhaps the investigation of IRAM alternatives would benefit
from a prototype that could reconfigure itself to
emulate several different IRAM alternatives. Options might incldue
number of cells per sense amp, number of cells er word line,
number of I/O lines per sence amp, number of banks, number of buses,
width of buses, number of external connections and so on.
In addtion to the memory subsystem, you would also need to vary the processor
and cache portion of the IRAM.
Key to an IRAM prototype is low development cost, ease of change, and
speed of execution.
One recommendation would be the best way to design
a software IRAM emulator. Is it simply writing a C or C++ program
to run on a uniprocessor? Are there advantages in being able to
run on a multiprocessor? Can you get the benefits of a multiprocessor
from a network of workstations? How fast would it run? What might the
programming interface be? What is an easy way to run many programs?
What kind of measurements would you like to collect from such a
system?
Another recommendation would be a hardware prototype. This prototype
might consist of Altera programmable logic chips, switch chips, and
large amounts of SRAM or DRAM. How fast might it be? How easy would it be to
change parameters? How would it run programs?
What measurments could it collect? How long would it take to contruct?
How much would the components cost? How big would it be? How would it
connect to computers?
Layout of logic in an IRAM.
Several of the IRAMs implementing an SIMD on chip would layout the logic
to match the metal pitch so as to minimize area. One gigabit DRAM
scrambled the blocks so as to minimize the power and interconnect area
to the pins. The somewhat vague goal of this assignment is to explore
the options in laying out a processor and cache so as to minimize area
and power. What is the impact on processor speed of stretching in
across the chip? Look at the 3-level and 4-level metal processes used
in the gigabit chips presented at ISSCC 96 as well as more
conservative designs.
Various thoughts and comments about IRAM and the lectures.
(Seth Copen Goldstein)