IRAM Software Prototyping System.
Perhaps the investigation of IRAM alternatives would benefit
from a prototype that could reconfigure itself to
emulate several different IRAM alternatives. Options might include
number of cells per sense amp, number of cells per word line,
number of I/O lines per sense amp, number of banks, number of buses,
width of buses, number of external connections and so on.
Ideally there would either be an option or a separate program that
could simulate the potential noise and retention problems to confirm
that it would work without having to construct an IRAM.
Alternatively, small test chips might collect and confirm parameters
that could drive or verify the simulation results so that an IRAM
chip would not be necessary.
In addition to the memory subsystem, you would also need to vary the
and cache portion of the IRAM.
Key to an IRAM prototype is low development cost, ease of change, and
speed of execution.
One recommendation would be the best way to design
a software IRAM emulator. Is it simply writing a C or C++ program
to run on a uniprocessor? Are there advantages in being able to
run on a multiprocessor? Can you get the benefits of a multiprocessor
from a network of workstations? How fast would it run? What might the
programming interface be? What is an easy way to run many programs?
What kind of measurements would you like to collect from such a
DRAM Architecture for Vector IRAM.
Putting a vector unit in a DRAM will dramatically lower the cost of high
bandwidth requirements between the vector unit and main memory. In this
assignment I will first give a brief overview on design techniques of memory
hierarchy of some modern vector machines, and then based on the survey
results of DRAM architectures, I try to propose a proper DRAM architecture
for Vector IRAM system.
(Hui T. Zhang)
Finding Cache Hotspots in Spec95 Programs.
Although frequently criticized, the Spec benchmark suite is widely
used to measure computer performance and aid system design. We will
use the WARTS
toolkit to collect and analyze data about the cache behavior of Spec
95 programs to guide designing the memory hierarchy of IRAM.
(Joseph D. Darcy and Manuel Fähndrich)
Cache for IRAM
This assignment is to propose a more cost effective solution for
caches for IRAM than conventional designs. The assumption is that
the large miss penalty of these designs drives the memory
organization, and that you would use a much simpler cache on an IRAM.
For example, a gigabit IRAM can logically fetch 4096B in less than 100 ns
while an Alpha fetches 64B in about 250 ns, or a potential hundredfold
improvement in miss penalty bandwidth. This assignment is redesign the
memory hierarchy if we assume it was implemented as an IRAM, and
validated using SPEC92 or SPEC95 programs. One example you could start
from is systems based on the 300 MHz Alpha 21164 microprocessor, which
uses a three level cache hierarchy plus memory:
How many levels make sense in an IRAM? What is the capacity and block
size at each level?
- two direct-mapped, write through 8KB caches for instructions and data
at the first level, block size is 32B, and latency is 2 clock cycles (6.6 ns);
- a combined 3-way set associative, write back cache also on chip at the
second level, block size either is 32B or 64B, and latency is 6 clock
cycles for read (20 ns);
- a direct mapped off-chip cache of 4MB (for the AlphaServer 8200),
block size is 64B, with a latency of 6 clock cycles for read (20 ns) and
5 for write (16.7 ns);
- DRAM is organized in up to 16 banks, depending on memory size, and
transfers over a 256-bit (32B) bus to off-chip cache. The latency is 76
cycles (253 ns) for 64 bytes.
(Windsor Hsu and Min Zhou)
In class, it was suggested that linpack performance may be correlated to
achievable peak memory bandwidth. What other benchmarks exhibit this
behavior? The benchmarks I will survey include: SPEC92 (the well-known
CPU benchmark), Laddis (a file system benchmark), and TPC (a database
benchmark). Tune in and find out!
Prefetching on an IRAM
Two of the major criticisms of conventional cache prefetching are that it
wastes bandwidth and that it leads to cache pollution (where the prefetched
data displaces data that really is needed and will therefore result in a
subsequent cache miss). The first criticism is not particularly valid in an
IRAM context, where there is substantial on-chip bandwidth that we are looking
to exploit. The second criticism may or may not be valid, depending on how
the caching strategy is implemented. This project looks at the possible
consequences of prefetching on an IRAM. A design is studied in which
the sense amps act as one level of cache. There may also be another cache
level, a smaller conventional SRAM cache. In a 2-level cache scenario, if the
prefetching is limited to the sense amp (second level) cache, it is
hypothesized that the problem of cache pollution will be less severe, partly
because the second level cache will already have a relatively high miss rate
compared to the first level cache. If prefetching into the L2 cache does
not invalidate any L1 cache blocks, this may also reduce the potential cache
pollution problem. This proposal requires one to somewhat abandon the
principle of cache inclusion, since the L1 cache will not always entirely be a
subset of the L2 cache. However, it is not apparent that there are any
significant drawbacks to eliminating strict inclusion. [This project began as
a literature search on prefetching, so a number of paper summaries are also
Low Power Design Techniques for Memory and Microprocessors.
Since power may be one of the constraints in the design of an IRAM, a
literature survey was performed of past low power design techniques for both
memory and microprocessors. A few design options were also explored.
(Heather Bowers and Trevor Pering)
NUMA for IRAM.
Memory access latencies may be nonuniform within a single IRAM chip. If
one wishes to build a multiprocessor on an IRAM die, it may make sense to
take a NUMA approach to the design. A number of NUMA papers are surveyed,
and their significance to IRAM is examined. A NUMA multiprocessor is one
that exposes the different memory latencies to the programmer or operating
system. The NUMA problem is that of managing the migration and replication
of pages among the processors. Page placement policies enhance performance
significantly compared to naive page placement. A parameterized policy can
be easily adapted to work well on different architectures. These policies
work well when there is a large latency difference, little network
contention, and fast block transfers. Whether an IRAM multiprocessor has
these characteristics depends on technology as well as design.
Survey of Vector Memory Units
Some have suggested that the vector chaining and gather/scatter units are
very complex and difficult to designs. Others have disagreed: the suggestion
is to look at the patents awarded to Cray Research in this area for hints.
This assignment would follow that advice, going to the Patent Office in
Silicon Valley to find patents and to get copies, read them, and summarize
the key ideas on how to design the memory unit for a vector machine that
supports chaining and gather/scatter.
Lloyd Y. Huang
A proposal for a prefetching icache and compiler optimization for an
Potential IRAM designs have huge bandwidth between cache and memory,
which reduces the penalty due to an incorrect prefetch. Similarly,
since there are multiple memory banks, they can be prefetched
independently. But since there are few banks, care must be taken to
prevent bank conflicts. An icache design is postulated, which naively
prefetches instructions at some fixed lookahead from the current point
of execution, in addition to prefetching the return point of a
function through an explicit instruction. A compiler optimization
needed to prevent bank conflicts is also proposed. Then, real
programs (notably SIOD and several of the spec benchmarks) are
analyzed to see if the compiler optimization can be used in practice.
A Survey of Embedded Processors
Embedded computing is one application for IRAM which has been
dissussed many times. This project investigates the possibility of using
IRAM for embedded applications. Several types of embedded systems are
compared based on performance, cost, power, and reliability.
A summary the suitability of IRAM for use in embedded systems is
IRAM Yield Consideration: Some Thoughts from IBM CMOS
Since IRAM is indeed a combination of DRAM process and logic process,
any yield investigation or prediction must be based on those on DRAM
and logic. However, The industry is inching toward the day where there
may be two standardized manufacturing process flows, one for CMOS and
one for DRAMs. Also, factory, process and cost modeling efforts have
helped push the development of a single industry process, really two
processes, to accommodate the difference in the way logic and memory
devices are fabricated. In this project, I will review the modern
yield analysis on DRAM and logic, point out the kernel technology in
the relization of IRAM. It is hard to discuss and compare different
technology, so in this paper I will concentrate on IBM CMOS
technology, which includes both DRAM (4Mb, 16Mb, 64Mb and 256Mb) and
logic (CMOS 5L, CMOS 5S, CMOS 5X) evolution. It is interesting to know
that the migration design, from one process to another, is exist in IBM.
The material are organized as:
- IBM CMOS technology.
- CMOS logic.
- Yield consideration about IBM technology.
- How about yield of IRAM?
- Conclusion and future work.
Low Power Design Techniques and their Relation to IRAM
One possible application of IRAM technology is in low power systems.
Low power techniques are explored from the system level down to the
circuit level. It would appear that IRAM is suitable for low power in a
number of ways. IRAM eliminates much of the off-chip capacitance by
integrating the memory on-chip. This not only eliminates all of the
off-chip address and data lines, but also eliminates extra chips used
for memory and bus management. Another interesting advantage of IRAM is
its memory latency. The processor is kept waiting for fewer cycles during
a memory miss, thus wasting less power driving the clock while the processor
is idle. In the conclusion, an arguement for IRAM in low power applications
is made based on the findings in the study.
Bruce McGaughy )