Power and Performance Analysis of IRAM
A set of models for analyzing performance and energy efficiency in various memory architectures
is presented. The models take into account key parameters such as cache miss rates, instruction
mix, bus sizes, memory and cache latencies, and number of bits accessed. There are separate
models for SRAM and DRAM, and the models incorporate many of the key technology dependent
parameters. Tradeoffs are explored in terms of power, performance and area for IRAM vs.
traditional off-chip DRAM approaches, as well as for DRAM and SRAM 2nd level caches. It was
found that IRAM has significant advantages in both power and performance, and may be
indispensable for matching processor speeds of 750 MHz and higher. Based on our findings and
the fact that DRAM yields slower transistors than optimized microprocessor technologies, we
suggest that IRAM may find its first application in low power designs.
MPEG Encoding on a Vector Processor
A number of emerging applications have the potential for wide public use except for their
enormous computational demands. The MPEG video compression standard is one of these.
At the same time, vector processing is a long-established technique that has been proven
useful in tackling many large computational problems. This report investigates the viability of
using vector processing for MPEG encoding. It describes an implementation of a portion of
the MPEG encoding process on a vector processor and the performance benefits obtained.
Utilizing the on-chip IRAM bandwidth:
Operating system use of memory block copy and clear operations
An IRAM architecture gives the potential for a tremoundous amount of on-chip
bandwidth. An important question is how to best utilize this bandwidth to
achieve improved performance. One possible use of this bandwidth is in
memory block copy and clear operations. An IRAM gives the potential to
efficiently move large blocks of data quickly from one region of memory to
another. There is also the ability to quickly clear a large block of memory.
This project studies the use of block copy and clear operations by the
operating system in an attempt to quantify the usefulness of IRAM
architectural support for memory block copy and clear operations. The
frequency of occurrences of block copy and clear operations, the size of the
operations, and the relative alignment of the addresses for copy operations
are measured and analyzed.
IRAM for Real-Time Data Compression
Real-time data compression is an important application to consider for
IRAM, as the video telephony industry moves out of the research realm and
into the commercial market. Compression can be a memory intensive
operation, and IRAM could provide a single chip solution that is actually
lower in power due to the reduced need to drive off-chip buffers. This
paper investigates the JPEG compression algorithm and floorplans a JPEG
encoder using DRAM for buffers and scratchpad operations. Issues relating
to MPEG are also discussed, as well as compression techniques that do not
require a custom chip solution, such as general purpose DSPs and
Building Cache Efficient Trees Using Long Cache Lines
The goal of IRAM is to improve overall system performance by putting the
processor as close to main memory as close as possible: on the same chip. Such an
organization allows the potential for memory access to have both much lower latency and
tremendously greater bandwidth. But, even an IRAM may require some sort of cache to
ensure the processor can be fed data and instructions quickly enough. The internal
structure of a gigabit DRAM
suggests a natural cache structure: place a row of latches
as wide as a bank in front of each bank of DRAM modules to form a direct mapped cache.
Higher associativites can be realized by using more sets of latches. Such a structure was
for the design of an IRAM-like chip. If that cache organization were adopted
for IRAM, cache lines would be 1024 bits wide, four times longer than current cache lines.
While long cache lines can decrease the cache miss rate, they can also greatly
increase the bandwidth required, by several orders of magnitude.
bandwidth demands are not necessarily a concern for an IRAM, reducing resource
requirements is always beneficial.
Joseph D. Darcy)
DRAM in a Logic Process
The potential of IRAM rests, to some extent, on the amount of memory available on-chip.
The proposed process-memory hybrid is to be composed mostly of DRAM cells, thus the
name Intelligent Dynamic Random Access Memory. This project investigates the density,
speed, and power of DRAMs fabricated in a logic process, specifically MOSIS' scalable
cmos 0.6um process, versus those made from proprietary DRAM processes.
A single bank of memory is layed out with its associated circuitry. This bank is fully
analyzed and an analysis of its placement in a 4Mbit DRAM is given.
Lloyd Y. Huang)
Instruction and data traces of workloads drive the design of the next generation CPU and
memory systems. Unfortunately, traces often do not include kernel code. Furthermore, traces
of small programs, including many applications found in the SPEC benchmark suite, do not
show system behavior under real workloads. Orienting designs on such traces may be
In the context of the IRAM project at Berkeley, we decided to overcome the SPEC trace
dilemma and obtain meaningful traces of ``real world'' applications, such as web browsers
and personal productivity tools. To this end, we are putting an instrumentation framework
into place that allows tracing of applications running on the Solaris 2.4 operating system.
Our methodology for collecting the traces is to extend EEL (Executable Editing Library),
a library for building tools that analyze and modify executable programs and object
files. EEL allows code to be inserted at arbitrary points in a program, enabling other tools
that insert code to gather instruction and data address traces for on-the-fly simulations or
buffering for later analysis. One of the main strengths of EEL is that since it can edit
executables, it requires no access to source code; thus, EEL can instrument any
As of this week, we have an instrumented version of Solaris 2.4 up and running. The
instrumentation is currently limited to tracing of entries to kernel functions. We plan on
extending the tool to gather full instruction and data traces in the near future.
Vectorizing a Hash-Join
A vector instruction set is a well known method for exposing bandwidth
to applications. Although extensively studied in the scientific
programming community, scant work exists on vectorizing other kinds of
applications. This work examines vectorizing a traditional database
operation, a Grace hash-join. We how how to vectorize both the hash
and join phases of the algorithm, and present pre formance results ona
Cray C90 as well as traditional micro-processors. We con cluded that
vector scatter-gather and compress are essential to to both this
algorithm as well as to other non-scientific codes.
Multithreading & Miultiprocessing in IRAM
This report studies the
effectiveness of different basic processor architectures in an IRAM
environment. Moving logic onto a DRAM process dramatically changes the
cost effectivness of processing components (caches, processor cores,
reg-files, etc...) which in turn effect the cost effectiveness of various
processor architectures. In this report, graphs are made of price vs.
performance for several different architectures, where price is measured
in terms of die area. The overal conclusion is that no standard (single
threaded, multithreaded, and multiprocessing architectures are studied)
system is more (or equaly) effective in IRAM than a traditional
split-logic process. This conclusion, although negative, is helpful by
limiting and directing the search for IRAM architectures to ones that will
be cost effective. These solutions will have to exploit benefits, such as
extremly wide internal data busses, of IRAM that are not exploited by the
studied architectures. Examples of such architectural solutions are
vector computing and multimedia processing.
The SPEC benchmark programs have become very popular targets for
evaluating architectural ideas. Recently, an increasing number of
researchers have pointed out that the behavior of the SPEC benchmark
programs is not representative of real workloads. In particular, the
memory reference patterns of the SPEC programs are relatively simple
and their working sets are relatively small. Using the SPEC programs
alone to evaluate IRAM ideas will result in the erroneous conclusion
that current cache designs work very well at hiding the memory
bottleneck and that there is little performance advantage in employing
an IRAM solution. Thus getting representative traces of real workloads
is a crucial first step in IRAM design and research. In this assignment,
we set out to collect a trace of the memory behavior of a database
management system (DBMS). Postgres is a natural choice because it is a
public domain DBMS and has none of the profiling and tracing constraints
that are attached to commercial DBMS. However, a major concern with
using a public domain DBMS is whether its behavior is representative of
commercial DBMS. In this assignment, we compare some characteristics of
our Postgres trace to published numbers obtained on commercial systems.
A Novel DRAM Hierarchy Architecture for IRAM.
This paper proposes a new seperate bit-
line DRAM hierarchy architecture, which will include
1Gbit main memory and 4MByte level-2 cache. Relying on
the state-of-art technology, I estimate that the access time
of main memory and level-2 cache will be 23ns and 9ns
respectivepy. The active power consumption for level-2
cache hit is 5 times smaller than for main memory access.
The level-2 cache and its associated TAG controller will
only take 2.4% of the DRAM area.
Considerations in Memory Design for IRAM
IRAM, or intelligent DRAM, is a new approach to computer design, involving the integration
of processor logic and main memory together on a single chip. An IRAM can reap maximum
benefits from this approach only by exposing the physical organization of the DRAM to the
A parameterized Spice deck is used to simulate cell access latency in a 64 Mb DRAM, and
the results are used to draw some suggestions as to how the memory/processor interface
should be designed for IRAM. The possibility of fast memory operations within the memory is
An Investigation of a Wide Bus in a DRAM Architecture
In recent years, there has been an increasing gap between processor speed
and memory bandwidth. One proposed solution is integrating both DRAM
and a micro-Processor in a single die, which drastically improves
memory bandwidth. Most of the focus involves increased bandwidth
and reducing the distance between cache and main-memory. This paper
considers the benefits of increasing bandwidth between the register
file and the L1cache, by allowing aligned blocks of 8 or more
registers to be transfered as a single operation. This load occurs
through a bitmask, so only some memory words are effected. In
addition, an operation which allows a block of registers to be
rotated is proposed to allow the programmer to relax alignment
restrictions. Several uses for such a mechanism are proposed and
explored, including faster function calls, improved garbage-collection,
faster memory->memory copies, and simple vector-processing.
Nicholas C. Weaver)