CS 252 (Fall '98): Project Suggestions

CS252: Fall 1998 Project Suggestions

This document is divided into several sections. The first several describe research projects at Berkeley that have projects that might be applicable for CS252. At the end are more generic architecture-related projects.

The IRAM Project

The IRAM project has several topics that need investigation.

P1: Memory management in an IRAM system with external DRAM
Suggested by Christoforos Kozyrakis (kozyraki@cs.berkeley.edu)
One question raised by the IRAM architecture is how to support expansion of the built-in memory. One option is to connect external DRAMs to the IRAM chip. This project would investigate tradeoffs in different methods of connecting the external DRAM, and, more importantly, would investigate how to structure this IRAM-based memory hierarchy. For example, should the on-chip memory be like a cache to the external DRAM (with cache- or disk-like paging), or should it be part of the same physical memory space? Who should manage the location of blocks in the hierarchy (hardware, OS, application, some combination)?
This project will involve selecting one or more simple benchmarks that have large working sets (larger than the on-chip IRAM memory) and modelling the performance of these applications on different IRAM/external-memory configurations, either analytically or via simulation.
P2: Can multimedia-specific hardware be useful for scientific applications?
Suggested by Randi Thomas (randit@cs.berkeley.edu)
One curiosity of the IRAM architecture is that it is built around a vector processor, yet is targetted primarily at embedded multimedia applicatons rather than the large-scale scientific applications for which vector processors are typically used. This project investigates whether or not the multimedia-oriented hardware on VIRAM (fixed-point and DSP-like operations) could be leveraged to accelerate traditional scientific and supercomputer applications.
P3: Modelling vectorized scientific and multimedia applications
Characterize the memory bandwidth requirements and access patterns of multimedia codes, and compare with scientific applications. Ideally, this would involve building a parameterized analytic modelling framework for the memory demands of such applications and the hardware they run on (like LogP did for communication in parallel programs/machines). The model might take into consideration bandwidth, latency, strides, ratio of memory to computation, vectorizability, etc.
P4: Designing and calibrating hardware performance monitors
Suggested by Randi Thomas (randit@cs.berkeley.edu)
Eugene Miya wrote a paper, "A Methodology for Calibrating the Cray Research Hardware Performance Monitor: Initial Observations". His work consisted of running many benchmark tests and discovering non-determinisitc behavior as seen in the values returned by the hardware performance measuring tools on the Cray. The experiments and analysis done on the results barely skimmed the surface of explaining what was causing the non-deterministic behavior, how far off from zero and how often this occurred, and what, if anything, can be done to get better performance results from such measuring tools. This area would be a valuable one to explore since the IRAM test chip will need similar hardware performance monitors. This project would involve duplicating and perhaps adding to Miya's benchmarks, running them on the Cray and/or perhaps the Pentium Pro (which also has such measuring tools), analyzing the results, and then making general design recommendations for such tools.
P5: Benchmarks on VIRAM
Since we are finalizing the microarchitecture and ISA for VIRAM-1 this semester, now is the time to investigate algorithms and develop benchmarks to analyze the performance impact of our implementation decisions. We have simulators and others tools to help collect statistics. There are several possible projects here:
- Segmentation algorithms on VIRAM
  Prof. Malik and his students have developed novel algorithms for image segmentation (see http://HTTP.CS.Berkeley.EDU/~jshi/Grouping/overview.html). This project would have you code the segmentation algorithm for VIRAM and then compare the results to unvectorized versions running on standard workstations. The goal is to develop insight into how the algorithm works, how it interacts with the VIRAM microarchitecture, and how VIRAM might be improved to better support it.
- UCLA MediaBench on VIRAM
  UCLA has developed MediaBench, a comprehensive benchmark suite of multimedia kernels and applications. Since multimedia applications are a primary target for VIRAM, we would like to understand how these applications perform on VIRAM. This project would involve coding the algorithms (or some subset) for VIRAM, and then comparing the results to those available for standard systems and DSPs.
- LAPACK on VIRAM
  One open question is how well the VIRAM architecture will perform on large scientific benchmarks. One way to investigate this is to look at the performance of LAPACK on VIRAM. LAPACK is a linear algebra subroutine package that contains many of the kernels used in scientific applications. This project would involve coding some of the LAPACK algorithms on VIRAM, comparing their performance to published results, and investigating IRAM's strengths and weaknesses for those algorithms.
- Code other algorithms
  This suggestion is not as tightly formed. Examples include hidden markoff model for speech recognization, viterbi as alternative to HMM (possible help from Eric Fosler), lossless compression algorithms, MPEG2 encoding, GSM cellphone algorithms, and encryption algorithms.

The ISTORE Project

The ISTORE project is a spin-off of the IRAM project that is investigating the integration of processors (intelligence) into the storage systems of large-scale servers. An ISTORE system consists of a traditional front-end CPU or SMP, plus multiple so-called "Intelligent Disks" (IDISKs, disks with integrated processors) interconnected via a fast crossbar-switched network. It may also contain "Intelligent Memory" (IMEM, memory built out of IRAMs). The research issues in ISTORE are in how to adapt server applications (databases, scientific apps, etc.) to this new system model, and in how the system can provide better performance or runtime support to such applications based on the tighter coupling of processing and storage.

The following are some ISTORE-related projects. There are several people who may be able to help with these projects; talk to Aaron Brown (abrown@cs.berkeley.edu) if you're interested in one of the following projects.

P6: IDISK for Wintel PCs
Although intelligent disks were originally proposed for large server machines, one interesting application would be to use them to accelerate the performance of traditional desktop PCs. This project would investigate how an IDISK might help ordinary PCs, for example by making the DOS or Windows NT file systems run better/faster. The idea would be to take some disk-oriented PC benchmarks (such as those from PC Magazine, ZDBOP, or Byte), analyze them, and determine whether intelligence on the disk could accelerate them.
P7: Disk characterization
One facet of ISTORE is the use of detailed low-level positioning information from the IDISKs to do highly-optimal disk scheduling. In order to make this work, we need to understand the behavior of disks under various access patterns. This project would involve first writing and running microbenchmarks to analyze disk performance and behavior, then using them to build an analytic model of disk performance. This model might include the behavior of seeks of different lengths, different block-accesses, the policy and behavior of the on-disk track caches, etc. Another interesting part of the project could be to analyze several disks of the same model to determine if there is any variation between them.
P8: On-line extraction of disk parameters
Bruce Worthington et al. wrote a paper last year on automatically extracting physical layout and performance information from a SCSI disk (see http://www.hpl.hp.com/SSP/papers/Worthington96.ps). One limitation of their technique was that they used SCSI commands to directly read out the physical geometry of the disk. While this may work for some SCSI disks, it does not work for all SCSI disks, and certainly does not work for commodity EIDE disks. This project investigates the possibility of obtaining the physical disk layout parameters of a modern disk (with its own caching and scheduling algorithms) while treating the disk as a black box.
P9: 3D Graphics for IMEM
Many novel DRAM organizations have been proposed for 3D graphics applications (e.g., FBRAM). This project investigates whether or not an IMEM would be a useful architecture for 3D graphics. Do the typical components of a 3D graphics pipeline (e.g., RenderMan, OpenGL) decompose in a manner amenable to IMEM? Would IMEM be a good architecture for such codes? What performance could be expected? One approach to this project would be to take the Mesa graphics library (a free implementation of OpenGL) and try to decompose the code for use on an IMEM-style system.

The BRASS Project

The Berkeley Reconfigurable Architectures, Systems and Software (BRASS) Research Project is investigating issues involved in building high-performance reconfigurable computing systems. The following projects are related to BRASS.

P10: Reconfigurable Memory
Suggested by Andre' Dehon (amd@cs.berkeley.edu)
A reconfigurable array embedded with DRAM creates an opportunity for interesting application-specific memory organizations. Brass's embedded DRAM in HSRA is one example. Explore the performance benefits for one or more selected applications. For example:
- UCLA image correlation [FCCM96]
- BYU sonar processing [FPGA98]
- graphics rendering [examples from MESA public domain graphics package]
- Woody Yang's face recognition (see http://vlsi.eecs.harvard.edu/modmachvis.html)
P11: Explore energy implications of reconfigurable implementation of compute kernels.
Suggested by Andre' Dehon (amd@cs.berkeley.edu) for last year's 252 class
Per raw bit operation, the potential energy cost of a low-voltage FPGA and a low-power DSP or microprocessor are very similar. Once correlation between bits in a datapath are taken into account, the energy may vary considerably, perhaps an order of magnitude.
In particular, a spatial (non-multiplexed) implementation on an FPGA will have a low activiation rate when data is highly correlated. The heavy multiplexing and interleaving of operations on the processor will tend to destroy the natural correlation in the data yielding a higher activity rate.
For some common kernels, (maybe start with filters, transforms common in signal/video processing) collect the data activity and estimate the actual energy consumed on a processor and an FPGA implementation. The goal would be to understand the source of potential benefits for the reconfigurable architecture and quantify typical effects.
Andre' DeHon (amd@cs) would give advice on this project. It would likely involve:
- Find code for simple benchmark kernels;
- build netlist for FPGA implementation;
- Get an energy model for each (DeHon believes there are several in the literature and around Berkeley so it's simply a matter of picking and elaborating);
- Instrument appropriate level of simulations for processors and FPGA to collect bit toggle rates (Again, there are probably several things around to start with...just need to be tailored a bit to this task);
- Run sample data and collect activity stats;
- Use energy model to estimate energy consumed;
- Reflect on results, identify sources of benefits(costs);

Introspective Computing

The DynaCOMP project is a new project that is starting up here at Berkeley, under the direction of John Kubiatowicz. This project will be investigating new computing paradigms in which the traditional hardware functionality of a CPU is replaced by feedback-driven, continuous dynamic compilation and execution. This has been called "introspective computing" by other researchers, since part of execution involves monitoring the behavior of a running process and changing its behavior in order to optimize performance, power utilization, or other metrics. Note that modern processors such as the Pentium II have hardware "compilers" which translate x86 instructions directly into internal micro operations. Since this translation is done in hardware, none of the more sophisticated compiler optimizations are possible. An introspective computing processor could compile and recompile many times, optimizing code based on runtime information. Of particular interest is new ways of exploiting runtime information to perform this type of optimization. Some of the prediction techinques discussed in class might be appropriate here. So might genetic algorithms. Several possible projects come to mind here:

P12: DynaCOMP architecture
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
Come up with an architecture which includes both execution units and monitoring units (these might be the same), and a scheme for exploiting the monitoring process. A complete solution here would be a PhD thesis, so you should consider small pieces of this. For instance:
- P12.1: Architecture for Dynamic Compilation
  Is there some set of instructions or hardware mechanisms that could be added to an architecture to make dynamic compilation easier/faster/more efficient, etc.
- P12.2: Recompilation for Power Savings
  As mentioned a number of times in class, power dissipation is currently a major problem in architectures. Come up with some way to exploit the Introspective Computing concept to save power. What monitoring of execution would be appropriate? How would you alter the execution based on this to save power? This is pretty open ended, guaranteed to get a good conference paper out of this if you come up with something.
- P12.3: Optimistic Specialization Use of Data or Control prediction based on monitoring hardware to perform code splitting/optimistic specialization in order to improve performance. This could include incorporating data predictions directly into code as constants (optimistic), with appropriate ways to detect and correct for prediciton failure.
- P12.4: Parallelism Extraction
  Figure out how an introspective computing architecture might be able to automatically extract parallelism from running code (loops?) and split this code out into an explictly parallel version.
P13: Genetic algorithms in computer architecture
Suggested last year by Christoforos Kozyrakis (kozyraki@cs.berkeley.edu):
In class, we handed out Joel Emer's paper on using genetic algorithms to synthesizing branch predictors with impressive results. Genetic algorithm can probably be of use in other areas of computer architecture: data predictors, hardware prefetching. Figure out how to exploit genetic algorithms to design other aspects of hardware architectures.
P14: Data Value Prediction
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
Another thing that we discussed in class was the notion of "breaking the dataflow barrier" through Data Value Prediction. Although various archiectures have been proposed for (1) predicting values and (2) exploiting these predictions, there is a tremendous amount of room for improvement. Come up with a value prediction strategy and a proposed architecture for exploiting this. Figure out how to evaluate it using a simulation model (Superscalar simulator from Wisconsin?)

Miscellaneous Projects

This section contains projects that aren't related to UCB CS research groups.

The following suggestions are from John R. Mashey (mash@mash.engr.sgi.com), and were originally proposed for last year's 252 class:

P15: Doing better than SPEC
1996's efforts on XSPEC (prior CS252 project) were useful, but this is a never-ending task...the following are several possible projects related to benchmarking, each of which would make its own CS252 project.
- P15.1: Continue to propose/analyze new benchmarks
  I.e., along same model as SPEC/XSPEC, take real programs and analyze them. Gathering and publishing data is always useful, although deeper analyses of fewer codes might be good.
- P15.2: No new data, more correlation [need to know some statistics]
  Do a serious mathematical correlation analysis among the existing masses of SPEC95 & XSPEC data to understand the level of redundancy, with goal of selecting well-correlated subsets, and shrinking the number of benchmark cases, not growing them. [One of the implicit goals of SPEC, at least from some of us, was to make sure there was a reasonable body of results to analyze...]
  - P15.2.1 No new data, correlation within product lines
    Pick some product line with a number of members of different clock rates, memory systems, compiler versions. Study how performance has changed with compiler versions, and how it varies by Mhz, cache size, etc. Question: given the SPEC or XSPEC numbers for one family member, how well can you predict the numbers for other family members, given MHz, peak issue rate, cache size, memory latency, memory bandwdith, etc, for entire benchmark sets or individual members?
  - P15.2.2 No new data, correlation across product lines
    Like P7.2.1, but with added complexity of different compilers, cache designs, etc. If one proposes a formula that predicts most of the variation, it is especially important to analyze the mis-predictions (at least, for commercial benchmark fights :-).
    WARNING: be careful on correlation analyses. Sometimes people have reached somewhat-incorrect conclusions by using a set of data points dominated by large subsets of realted systems. I.e., if many of the systems used Pentiums of various flavors, one would might find that Mhz alone accounted for much of the variation.
- P15.3: Performance versus data cache size & dataset size
  (Some of this was done in 1996, much more needed). Some benchmarks have reasonably-scalable dataset sizes. In some cases, performance is very sensitive to dataset size versus cache size, and in others, it is not, and sometiems it depends on how the code is compiled. (I.e., SPEC89's matrix300 was supposed to be be a cache & memory thrahser, but cache-blocking compilers invalidated its usefulness for that, as they made it much less sensitive to cache size and memory performance.) Analyze the existing SPEC/XPSPEC codes and see if ther are any simple models that relate performance, cache size, dataset size, and some metric for nature of the code. Try to categorize codes better: it is silly to make multiple runs of different sizes if little new information is provided.
  This probably involves running more sizes of data on fewer benchmarks, to look for steep dropoffs in performance.
- P15.4: Performance, data cache size, dataset size across machine types
  This is like P7.3, but emphasizing inter-machine comparisons.
  [Note: in commercial system benchmarking, fixed-size benchmarks are notorious for allowing misleading/surprising comparisons, and vendor benchmarkers know this well.] If a benchmark does have datset vs caches-size sensitivity, and there are two systems:
  - A: bigger cache, slightly slower memory
  - B: smaller cache, lower memory latency
  If I sell A, I'll aim to get a benchmark size that is the largest that gets a good hit rate in my cache, because B's will be awful. If I sell B, I'll aim for a small benchmark, that fits in my cache as well, or else a huge benchmark that thrashes both of us.
  Of course, the SPEC benchmarks are all fixed-size, and sometimes prone to odd cases where the size just happens to clobber a system, but size/2 or or 2*size would not.] I've seen cases where one could make either A or B look P7.5X faster than the other by adjusting datset sizes.
  Maybe CPU benchmarks should look more like system throughput benchmarks, i.e., shown as a curve. For benchmarks in which cache size versus data size matters, it would be good to compare machines by curves with vertical = performance, horizontal = variable data size, across an appropriate range of sizes, and then see if there are better, succinct metrics derivable from the mass of data. Somehow, one would like a metric that gives credit to performance at a range of sizes, not just one size.
- P15.5: Error bars
  Redo any of the analyses above and get away from individual performance numbers into means with error bars, or just 90% confidence limits, or anything that would bring more rationality to the presentation and analysis of these things. [It is incredible that people talk about SPEC#s with 3 digits of accuracy...]
- P15.6 Synthetic benchmark (data)
  I'd love to have a reasonably short, scalable, perhaps synthetic benchmark, that, with appropriate & understandable input parameters, could predict most of the variation in the various SPEC/XSPEC codes. (This is probably not for 1998, since it should build on some of the other suggestions above, not be invented out of whole cloth.)
- P15.7: Synthetic benchmark (instruction)
  Data-intense codes lend themselves to scaling tests; instruction-cache ones don't. Characterize the instruction cache behavior of the existing codes. Propose a parameterizable synthetic benchmark whose code size can be varied, and investigate its correlation with the existing programs.
  - P15.7.1 Instruction & data-cache benchmarks
    As in P15.4, given the same size external cache, there are odd cases of differences among:
    
    1-level, split I& D cache
    2-level, joint L2 caches
    3-level (as in Alphas)
    (or in earlier days, between a split I&D or joint 1-level cache).
    Also, between direct-mapped and set-associative caches: direct-mapped D-caches sometimes have terrible performance dips as array sizes are increased, due to cache collisions. (Customers get angry out of all proportion, i.e., system A (direct-mapped) may have better average performance than system B (set-associative), but B degrades smoothly, rather than showing the occasional giant dips of A. People hate A.
    Study the existing benchmarks and analyze the performance results in light of either I-size or D-size being near sensitivity points of current system designs, or sensitivity points on associativity.
- P15.8: Latency
  Larry McVoy's lmbench benchmark is a useful indicator of performance for small-object-latency-intense codes, i.e., where cache misses are frequent. Since this includes some behavior of DBMS, networking, and OS code, and such behavior is not necesarily well represented by SPEC, these numbers are of interest. On the other hand, the numbers are prone to over-interpretation, as over-reliance on the part of the benchmark that measures memory latency alone, would lead one to design machines that have the simplest cache design (1-level, blocking loads, no fetchahead), and simplest in-order execution, since the benchmark explicitly defeats use of multiple outstanding cache misses and out-of-order execution, even though these features are believed by many to be useful.
  Analyze lmbench results (of which many exist) and compare to SPEC/XSPEC results, and see which if any of the SPEC/XSPEC components are well-predicted by the lmbench (a) Memory latency or (b) latency of largest cache.

The following suggestion is from Greg Pfister of IBM (pfister@us.ibm.com), and was originally proposed for last year's 252 class:

P16: I/O Benchmarking
First: Looking over the list of prior projects from last time, I noticed virtually nothing about I/O. You are using a text that doesn't focus there (nor do many others), but nevertheless -- many topics from last time could be revisited in an I/O context: Benchmarks of I/O, efficiency of I/O, etc. How good is the memory system at block streaming multiple multimedia streams onto disk and/or a fast network? How about OS overhead for lots of little transactions -- there certainly are imaginative ways it could be be reduced, and proposing/measuring the results could be a good project.

The following was suggested by David Douglas (douglas@East.Sun.COM) and F. Balint (balintf@East.Sun.COM) from Sun Microsystems, for last year's CS252 class:

P17: Evaluating New WinTEL I/O Standard
"I2O" (I-to-O) is the big new I/O architecture definition from WiNTel, attempting to push more processing off of the main CPU's and onto the I/O cards. Their question is: Is this any faster than the "smart" IO subsystems people have been building for a while? Will I2O open up new opportunities to move stuff off of the main CPU's, resulting in faster performance?
One additional issue not mentioned below, but within Berkeley's historic interests, is to look at I2O in light of RAID controllers.
The project goal is two fold:
1. Establish the performance benefits of I2O (Intelligent IO) subsystems compared to:
  - a. Traditional "smart" IO subsystems
  - b. Traditional "not smart" IO subsystems
2. Make suggestions of how to improve system performance further by perhaps off loading more IO processing into an IOP.
What are the factors impacting system performance?
At what IOP performance is needed to gain the most?
Is there linear relationship between IOP performance and system performance gain?
The focus should be on storage and on network performance.
For step #1, the recommendation is to compare the
- Symbios 875 SCSI controller stand alone with the same 875 controller connected to an Intel 960RP I20 controller.
- Also, compare the ISP 1040 Qlogic SCSI controller with the 875/960RP solution.
- Note that 875 is a traditional "medium" smart SCSI controller
- The 960RP is the defacto I20 controller used today
- The ISP1040 is a traditional smart SCSI controller, but not I20 compliant.
The data to be looked at is IO per second for 2K r/w and 8K r/W The sytem utilization should be compared (Intelligent IO is suppose to reduce PIO and interrupt rate, thus CPU utilization).

The following was suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):

P18:Probability of Deadlock in Direct Network Interfaces
In his PhD thesis, John Kubiatowicz explored the probability of deadlock in message-passing multiprocessors with Direct Network Interfaces (e.g. the Alewife interface). See description of "DeadSIM" in Chapter 6 of thesis, available off publications link on homepage. This exploration was done with probabilistic simulation in mesh networks of varying dimensions. Expand this analysis to include: (1) networks with virtual channels, (2) networks with automatic queueing to memory (such as the Wisconsin CNI interface). Figure out how to make the message traffic more realistic by encorporating actual multiprocessor traces. Will direct network interfaces with software deadlock recovery do well in large systems/under heavy load?

Back to CS252 page

Maintained by Aaron Brown (abrown@cs.berkeley.edu). Last modified 19 September 1998.