Third CS 294 Course Assignments

Power and Performance Analysis of IRAM A set of models for analyzing performance and energy efficiency in various memory architectures is presented. The models take into account key parameters such as cache miss rates, instruction mix, bus sizes, memory and cache latencies, and number of bits accessed. There are separate models for SRAM and DRAM, and the models incorporate many of the key technology dependent parameters. Tradeoffs are explored in terms of power, performance and area for IRAM vs. traditional off-chip DRAM approaches, as well as for DRAM and SRAM 2nd level caches. It was found that IRAM has significant advantages in both power and performance, and may be indispensable for matching processor speeds of 750 MHz and higher. Based on our findings and the fact that DRAM yields slower transistors than optimized microprocessor technologies, we suggest that IRAM may find its first application in low power designs.
( Bruce McGaughy)

MPEG Encoding on a Vector Processor A number of emerging applications have the potential for wide public use except for their enormous computational demands. The MPEG video compression standard is one of these. At the same time, vector processing is a long-established technique that has been proven useful in tackling many large computational problems. This report investigates the viability of using vector processing for MPEG encoding. It describes an implementation of a portion of the MPEG encoding process on a vector processor and the performance benefits obtained.
( Cedric Krumbein)

Utilizing the on-chip IRAM bandwidth: Operating system use of memory block copy and clear operations An IRAM architecture gives the potential for a tremoundous amount of on-chip bandwidth. An important question is how to best utilize this bandwidth to achieve improved performance. One possible use of this bandwidth is in memory block copy and clear operations. An IRAM gives the potential to efficiently move large blocks of data quickly from one region of memory to another. There is also the ability to quickly clear a large block of memory. This project studies the use of block copy and clear operations by the operating system in an attempt to quantify the usefulness of IRAM architectural support for memory block copy and clear operations. The frequency of occurrences of block copy and clear operations, the size of the operations, and the relative alignment of the addresses for copy operations are measured and analyzed.
( Richard Fromm)

IRAM for Real-Time Data Compression Real-time data compression is an important application to consider for IRAM, as the video telephony industry moves out of the research realm and into the commercial market. Compression can be a memory intensive operation, and IRAM could provide a single chip solution that is actually lower in power due to the reduced need to drive off-chip buffers. This paper investigates the JPEG compression algorithm and floorplans a JPEG encoder using DRAM for buffers and scratchpad operations. Issues relating to MPEG are also discussed, as well as compression techniques that do not require a custom chip solution, such as general purpose DSPs and reconfigurable logic.
( Heather Bowers)

Building Cache Efficient Trees Using Long Cache Lines The goal of IRAM is to improve overall system performance by putting the processor as close to main memory as close as possible: on the same chip. Such an organization allows the potential for memory access to have both much lower latency and tremendously greater bandwidth. But, even an IRAM may require some sort of cache to ensure the processor can be fed data and instructions quickly enough. The internal structure of a gigabit DRAM suggests a natural cache structure: place a row of latches as wide as a bank in front of each bank of DRAM modules to form a direct mapped cache. Higher associativites can be realized by using more sets of latches. Such a structure was used for the design of an IRAM-like chip. If that cache organization were adopted for IRAM, cache lines would be 1024 bits wide, four times longer than current cache lines.
While long cache lines can decrease the cache miss rate, they can also greatly increase the bandwidth required, by several orders of magnitude. While greater bandwidth demands are not necessarily a concern for an IRAM, reducing resource requirements is always beneficial.
( Joseph D. Darcy)

DRAM in a Logic Process The potential of IRAM rests, to some extent, on the amount of memory available on-chip. The proposed process-memory hybrid is to be composed mostly of DRAM cells, thus the name Intelligent Dynamic Random Access Memory. This project investigates the density, speed, and power of DRAMs fabricated in a logic process, specifically MOSIS' scalable cmos 0.6um process, versus those made from proprietary DRAM processes.
A single bank of memory is layed out with its associated circuitry. This bank is fully analyzed and an analysis of its placement in a 4Mbit DRAM is given.
( Lloyd Y. Huang)

revEELing Solaris Instruction and data traces of workloads drive the design of the next generation CPU and memory systems. Unfortunately, traces often do not include kernel code. Furthermore, traces of small programs, including many applications found in the SPEC benchmark suite, do not show system behavior under real workloads. Orienting designs on such traces may be harmful.
In the context of the IRAM project at Berkeley, we decided to overcome the SPEC trace dilemma and obtain meaningful traces of ``real world'' applications, such as web browsers and personal productivity tools. To this end, we are putting an instrumentation framework into place that allows tracing of applications running on the Solaris 2.4 operating system.
Our methodology for collecting the traces is to extend EEL (Executable Editing Library), a library for building tools that analyze and modify executable programs and object files. EEL allows code to be inserted at arbitrary points in a program, enabling other tools that insert code to gather instruction and data address traces for on-the-fly simulations or buffering for later analysis. One of the main strengths of EEL is that since it can edit executables, it requires no access to source code; thus, EEL can instrument any off-the-shelf software.
As of this week, we have an instrumented version of Solaris 2.4 up and running. The instrumentation is currently limited to tracing of entries to kernel functions. We plan on extending the tool to gather full instruction and data traces in the near future.
( Manuel Fahndrich and Remzi Arpaci)

Vectorizing a Hash-Join A vector instruction set is a well known method for exposing bandwidth to applications. Although extensively studied in the scientific programming community, scant work exists on vectorizing other kinds of applications. This work examines vectorizing a traditional database operation, a Grace hash-join. We how how to vectorize both the hash and join phases of the algorithm, and present pre formance results ona Cray C90 as well as traditional micro-processors. We con cluded that vector scatter-gather and compress are essential to to both this algorithm as well as to other non-scientific codes.
( Rich Martin)

Multithreading & Miultiprocessing in IRAM This report studies the effectiveness of different basic processor architectures in an IRAM environment. Moving logic onto a DRAM process dramatically changes the cost effectivness of processing components (caches, processor cores, reg-files, etc...) which in turn effect the cost effectiveness of various processor architectures. In this report, graphs are made of price vs. performance for several different architectures, where price is measured in terms of die area. The overal conclusion is that no standard (single threaded, multithreaded, and multiprocessing architectures are studied) system is more (or equaly) effective in IRAM than a traditional split-logic process. This conclusion, although negative, is helpful by limiting and directing the search for IRAM architectures to ones that will be cost effective. These solutions will have to exploit benefits, such as extremly wide internal data busses, of IRAM that are not exploited by the studied architectures. Examples of such architectural solutions are vector computing and multimedia processing.
( Trevor Pering)

Tracing Postgres The SPEC benchmark programs have become very popular targets for evaluating architectural ideas. Recently, an increasing number of researchers have pointed out that the behavior of the SPEC benchmark programs is not representative of real workloads. In particular, the memory reference patterns of the SPEC programs are relatively simple and their working sets are relatively small. Using the SPEC programs alone to evaluate IRAM ideas will result in the erroneous conclusion that current cache designs work very well at hiding the memory bottleneck and that there is little performance advantage in employing an IRAM solution. Thus getting representative traces of real workloads is a crucial first step in IRAM design and research. In this assignment, we set out to collect a trace of the memory behavior of a database management system (DBMS). Postgres is a natural choice because it is a public domain DBMS and has none of the profiling and tracing constraints that are attached to commercial DBMS. However, a major concern with using a public domain DBMS is whether its behavior is representative of commercial DBMS. In this assignment, we compare some characteristics of our Postgres trace to published numbers obtained on commercial systems.
( Windsor Hsu and Min Zhou)

A Novel DRAM Hierarchy Architecture for IRAM. This paper proposes a new seperate bit- line DRAM hierarchy architecture, which will include 1Gbit main memory and 4MByte level-2 cache. Relying on the state-of-art technology, I estimate that the access time of main memory and level-2 cache will be 23ns and 9ns respectivepy. The active power consumption for level-2 cache hit is 5 times smaller than for main memory access. The level-2 cache and its associated TAG controller will only take 2.4% of the DRAM area.
( Hui Zhang)

Considerations in Memory Design for IRAM IRAM, or intelligent DRAM, is a new approach to computer design, involving the integration of processor logic and main memory together on a single chip. An IRAM can reap maximum benefits from this approach only by exposing the physical organization of the DRAM to the processor.
A parameterized Spice deck is used to simulate cell access latency in a 64 Mb DRAM, and the results are used to draw some suggestions as to how the memory/processor interface should be designed for IRAM. The possibility of fast memory operations within the memory is also addressed.
( James Young)

An Investigation of a Wide Bus in a DRAM Architecture In recent years, there has been an increasing gap between processor speed and memory bandwidth. One proposed solution is integrating both DRAM and a micro-Processor in a single die, which drastically improves memory bandwidth. Most of the focus involves increased bandwidth and reducing the distance between cache and main-memory. This paper considers the benefits of increasing bandwidth between the register file and the L1cache, by allowing aligned blocks of 8 or more registers to be transfered as a single operation. This load occurs through a bitmask, so only some memory words are effected. In addition, an operation which allows a block of registers to be rotated is proposed to allow the programmer to relax alignment restrictions. Several uses for such a mechanism are proposed and explored, including faster function calls, improved garbage-collection, faster memory->memory copies, and simple vector-processing.
( Nicholas C. Weaver)