CS252: Fall 1998 Project Suggestions |
One question raised by the IRAM architecture is how to support expansion of the built-in memory. One option is to connect external DRAMs to the IRAM chip. This project would investigate tradeoffs in different methods of connecting the external DRAM, and, more importantly, would investigate how to structure this IRAM-based memory hierarchy. For example, should the on-chip memory be like a cache to the external DRAM (with cache- or disk-like paging), or should it be part of the same physical memory space? Who should manage the location of blocks in the hierarchy (hardware, OS, application, some combination)?
This project will involve selecting one or more simple benchmarks that have large working sets (larger than the on-chip IRAM memory) and modelling the performance of these applications on different IRAM/external-memory configurations, either analytically or via simulation.
One curiosity of the IRAM architecture is that it is built around a vector processor, yet is targetted primarily at embedded multimedia applicatons rather than the large-scale scientific applications for which vector processors are typically used. This project investigates whether or not the multimedia-oriented hardware on VIRAM (fixed-point and DSP-like operations) could be leveraged to accelerate traditional scientific and supercomputer applications.
Characterize the memory bandwidth requirements and access patterns of multimedia codes, and compare with scientific applications. Ideally, this would involve building a parameterized analytic modelling framework for the memory demands of such applications and the hardware they run on (like LogP did for communication in parallel programs/machines). The model might take into consideration bandwidth, latency, strides, ratio of memory to computation, vectorizability, etc.
Eugene Miya wrote a paper, "A Methodology for Calibrating the Cray Research Hardware Performance Monitor: Initial Observations". His work consisted of running many benchmark tests and discovering non-determinisitc behavior as seen in the values returned by the hardware performance measuring tools on the Cray. The experiments and analysis done on the results barely skimmed the surface of explaining what was causing the non-deterministic behavior, how far off from zero and how often this occurred, and what, if anything, can be done to get better performance results from such measuring tools. This area would be a valuable one to explore since the IRAM test chip will need similar hardware performance monitors. This project would involve duplicating and perhaps adding to Miya's benchmarks, running them on the Cray and/or perhaps the Pentium Pro (which also has such measuring tools), analyzing the results, and then making general design recommendations for such tools.
Since we are finalizing the microarchitecture and ISA for VIRAM-1 this semester, now is the time to investigate algorithms and develop benchmarks to analyze the performance impact of our implementation decisions. We have simulators and others tools to help collect statistics. There are several possible projects here:
Prof. Malik and his students have developed novel algorithms for image segmentation (see http://HTTP.CS.Berkeley.EDU/~jshi/Grouping/overview.html). This project would have you code the segmentation algorithm for VIRAM and then compare the results to unvectorized versions running on standard workstations. The goal is to develop insight into how the algorithm works, how it interacts with the VIRAM microarchitecture, and how VIRAM might be improved to better support it.
UCLA has developed MediaBench, a comprehensive benchmark suite of multimedia kernels and applications. Since multimedia applications are a primary target for VIRAM, we would like to understand how these applications perform on VIRAM. This project would involve coding the algorithms (or some subset) for VIRAM, and then comparing the results to those available for standard systems and DSPs.
One open question is how well the VIRAM architecture will perform on large scientific benchmarks. One way to investigate this is to look at the performance of LAPACK on VIRAM. LAPACK is a linear algebra subroutine package that contains many of the kernels used in scientific applications. This project would involve coding some of the LAPACK algorithms on VIRAM, comparing their performance to published results, and investigating IRAM's strengths and weaknesses for those algorithms.
This suggestion is not as tightly formed. Examples include hidden markoff model for speech recognization, viterbi as alternative to HMM (possible help from Eric Fosler), lossless compression algorithms, MPEG2 encoding, GSM cellphone algorithms, and encryption algorithms.
The following are some ISTORE-related projects. There are several people who may be able to help with these projects; talk to Aaron Brown (abrown@cs.berkeley.edu) if you're interested in one of the following projects.
Although intelligent disks were originally proposed for large server machines, one interesting application would be to use them to accelerate the performance of traditional desktop PCs. This project would investigate how an IDISK might help ordinary PCs, for example by making the DOS or Windows NT file systems run better/faster. The idea would be to take some disk-oriented PC benchmarks (such as those from PC Magazine, ZDBOP, or Byte), analyze them, and determine whether intelligence on the disk could accelerate them.
One facet of ISTORE is the use of detailed low-level positioning information from the IDISKs to do highly-optimal disk scheduling. In order to make this work, we need to understand the behavior of disks under various access patterns. This project would involve first writing and running microbenchmarks to analyze disk performance and behavior, then using them to build an analytic model of disk performance. This model might include the behavior of seeks of different lengths, different block-accesses, the policy and behavior of the on-disk track caches, etc. Another interesting part of the project could be to analyze several disks of the same model to determine if there is any variation between them.
Bruce Worthington et al. wrote a paper last year on automatically extracting physical layout and performance information from a SCSI disk (see http://www.hpl.hp.com/SSP/papers/Worthington96.ps). One limitation of their technique was that they used SCSI commands to directly read out the physical geometry of the disk. While this may work for some SCSI disks, it does not work for all SCSI disks, and certainly does not work for commodity EIDE disks. This project investigates the possibility of obtaining the physical disk layout parameters of a modern disk (with its own caching and scheduling algorithms) while treating the disk as a black box.
Many novel DRAM organizations have been proposed for 3D graphics applications (e.g., FBRAM). This project investigates whether or not an IMEM would be a useful architecture for 3D graphics. Do the typical components of a 3D graphics pipeline (e.g., RenderMan, OpenGL) decompose in a manner amenable to IMEM? Would IMEM be a good architecture for such codes? What performance could be expected? One approach to this project would be to take the Mesa graphics library (a free implementation of OpenGL) and try to decompose the code for use on an IMEM-style system.
A reconfigurable array embedded with DRAM creates an opportunity for interesting application-specific memory organizations. Brass's embedded DRAM in HSRA is one example. Explore the performance benefits for one or more selected applications. For example:
Per raw bit operation, the potential energy cost of a low-voltage FPGA and a low-power DSP or microprocessor are very similar. Once correlation between bits in a datapath are taken into account, the energy may vary considerably, perhaps an order of magnitude.
In particular, a spatial (non-multiplexed) implementation on an FPGA will have a low activiation rate when data is highly correlated. The heavy multiplexing and interleaving of operations on the processor will tend to destroy the natural correlation in the data yielding a higher activity rate.
For some common kernels, (maybe start with filters, transforms common in signal/video processing) collect the data activity and estimate the actual energy consumed on a processor and an FPGA implementation. The goal would be to understand the source of potential benefits for the reconfigurable architecture and quantify typical effects.
Andre' DeHon (amd@cs) would give advice on this project. It would likely involve:
Come up with an architecture which includes both execution units and monitoring units (these might be the same), and a scheme for exploiting the monitoring process. A complete solution here would be a PhD thesis, so you should consider small pieces of this. For instance:
Is there some set of instructions or hardware mechanisms that could be added to an architecture to make dynamic compilation easier/faster/more efficient, etc.
As mentioned a number of times in class, power dissipation is currently a major problem in architectures. Come up with some way to exploit the Introspective Computing concept to save power. What monitoring of execution would be appropriate? How would you alter the execution based on this to save power? This is pretty open ended, guaranteed to get a good conference paper out of this if you come up with something.
Figure out how an introspective computing architecture might be able to automatically extract parallelism from running code (loops?) and split this code out into an explictly parallel version.
In class, we handed out Joel Emer's paper on using genetic algorithms to synthesizing branch predictors with impressive results. Genetic algorithm can probably be of use in other areas of computer architecture: data predictors, hardware prefetching. Figure out how to exploit genetic algorithms to design other aspects of hardware architectures.
Another thing that we discussed in class was the notion of "breaking the dataflow barrier" through Data Value Prediction. Although various archiectures have been proposed for (1) predicting values and (2) exploiting these predictions, there is a tremendous amount of room for improvement. Come up with a value prediction strategy and a proposed architecture for exploiting this. Figure out how to evaluate it using a simulation model (Superscalar simulator from Wisconsin?)
The following suggestions are from John R. Mashey (mash@mash.engr.sgi.com), and were originally proposed for last year's 252 class:
1996's efforts on XSPEC (prior CS252 project) were useful, but this is a never-ending task...the following are several possible projects related to benchmarking, each of which would make its own CS252 project.
I.e., along same model as SPEC/XSPEC, take real programs and analyze them. Gathering and publishing data is always useful, although deeper analyses of fewer codes might be good.
Do a serious mathematical correlation analysis among the existing masses of SPEC95 & XSPEC data to understand the level of redundancy, with goal of selecting well-correlated subsets, and shrinking the number of benchmark cases, not growing them. [One of the implicit goals of SPEC, at least from some of us, was to make sure there was a reasonable body of results to analyze...]
Pick some product line with a number of members of different clock rates, memory systems, compiler versions. Study how performance has changed with compiler versions, and how it varies by Mhz, cache size, etc. Question: given the SPEC or XSPEC numbers for one family member, how well can you predict the numbers for other family members, given MHz, peak issue rate, cache size, memory latency, memory bandwdith, etc, for entire benchmark sets or individual members?
Like P7.2.1, but with added complexity of different compilers, cache designs, etc. If one proposes a formula that predicts most of the variation, it is especially important to analyze the mis-predictions (at least, for commercial benchmark fights :-).
WARNING: be careful on correlation analyses. Sometimes people have reached somewhat-incorrect conclusions by using a set of data points dominated by large subsets of realted systems. I.e., if many of the systems used Pentiums of various flavors, one would might find that Mhz alone accounted for much of the variation.
(Some of this was done in 1996, much more needed). Some benchmarks have reasonably-scalable dataset sizes. In some cases, performance is very sensitive to dataset size versus cache size, and in others, it is not, and sometiems it depends on how the code is compiled. (I.e., SPEC89's matrix300 was supposed to be be a cache & memory thrahser, but cache-blocking compilers invalidated its usefulness for that, as they made it much less sensitive to cache size and memory performance.) Analyze the existing SPEC/XPSPEC codes and see if ther are any simple models that relate performance, cache size, dataset size, and some metric for nature of the code. Try to categorize codes better: it is silly to make multiple runs of different sizes if little new information is provided.
This probably involves running more sizes of data on fewer benchmarks, to look for steep dropoffs in performance.
This is like P7.3, but emphasizing inter-machine comparisons.
[Note: in commercial system benchmarking, fixed-size benchmarks are notorious for allowing misleading/surprising comparisons, and vendor benchmarkers know this well.] If a benchmark does have datset vs caches-size sensitivity, and there are two systems:
Of course, the SPEC benchmarks are all fixed-size, and sometimes prone to odd cases where the size just happens to clobber a system, but size/2 or or 2*size would not.] I've seen cases where one could make either A or B look P7.5X faster than the other by adjusting datset sizes.
Maybe CPU benchmarks should look more like system throughput benchmarks, i.e., shown as a curve. For benchmarks in which cache size versus data size matters, it would be good to compare machines by curves with vertical = performance, horizontal = variable data size, across an appropriate range of sizes, and then see if there are better, succinct metrics derivable from the mass of data. Somehow, one would like a metric that gives credit to performance at a range of sizes, not just one size.
Redo any of the analyses above and get away from individual performance numbers into means with error bars, or just 90% confidence limits, or anything that would bring more rationality to the presentation and analysis of these things. [It is incredible that people talk about SPEC#s with 3 digits of accuracy...]
I'd love to have a reasonably short, scalable, perhaps synthetic benchmark, that, with appropriate & understandable input parameters, could predict most of the variation in the various SPEC/XSPEC codes. (This is probably not for 1998, since it should build on some of the other suggestions above, not be invented out of whole cloth.)
Data-intense codes lend themselves to scaling tests; instruction-cache ones don't. Characterize the instruction cache behavior of the existing codes. Propose a parameterizable synthetic benchmark whose code size can be varied, and investigate its correlation with the existing programs.
As in P15.4, given the same size external cache, there are odd cases of differences among:
Also, between direct-mapped and set-associative caches: direct-mapped D-caches sometimes have terrible performance dips as array sizes are increased, due to cache collisions. (Customers get angry out of all proportion, i.e., system A (direct-mapped) may have better average performance than system B (set-associative), but B degrades smoothly, rather than showing the occasional giant dips of A. People hate A.
Study the existing benchmarks and analyze the performance results in light of either I-size or D-size being near sensitivity points of current system designs, or sensitivity points on associativity.
Larry McVoy's lmbench benchmark is a useful indicator of performance for small-object-latency-intense codes, i.e., where cache misses are frequent. Since this includes some behavior of DBMS, networking, and OS code, and such behavior is not necesarily well represented by SPEC, these numbers are of interest. On the other hand, the numbers are prone to over-interpretation, as over-reliance on the part of the benchmark that measures memory latency alone, would lead one to design machines that have the simplest cache design (1-level, blocking loads, no fetchahead), and simplest in-order execution, since the benchmark explicitly defeats use of multiple outstanding cache misses and out-of-order execution, even though these features are believed by many to be useful.
Analyze lmbench results (of which many exist) and compare to SPEC/XSPEC results, and see which if any of the SPEC/XSPEC components are well-predicted by the lmbench (a) Memory latency or (b) latency of largest cache.
The following suggestion is from Greg Pfister of IBM (pfister@us.ibm.com), and was originally proposed for last year's 252 class:
First: Looking over the list of prior projects from last time, I noticed virtually nothing about I/O. You are using a text that doesn't focus there (nor do many others), but nevertheless -- many topics from last time could be revisited in an I/O context: Benchmarks of I/O, efficiency of I/O, etc. How good is the memory system at block streaming multiple multimedia streams onto disk and/or a fast network? How about OS overhead for lots of little transactions -- there certainly are imaginative ways it could be be reduced, and proposing/measuring the results could be a good project.
The following was suggested by David Douglas (douglas@East.Sun.COM) and F. Balint (balintf@East.Sun.COM) from Sun Microsystems, for last year's CS252 class:
"I2O" (I-to-O) is the big new I/O architecture definition from WiNTel, attempting to push more processing off of the main CPU's and onto the I/O cards. Their question is: Is this any faster than the "smart" IO subsystems people have been building for a while? Will I2O open up new opportunities to move stuff off of the main CPU's, resulting in faster performance?
One additional issue not mentioned below, but within Berkeley's historic interests, is to look at I2O in light of RAID controllers.
The project goal is two fold:
What are the factors impacting system performance?
At what IOP performance is needed to gain the most?
Is there linear relationship between IOP performance
and system performance gain?
The focus should be on storage and on network performance.
For step #1, the recommendation is to compare the
The data to be looked at is IO per second for 2K r/w and 8K r/W The sytem utilization should be compared (Intelligent IO is suppose to reduce PIO and interrupt rate, thus CPU utilization).
The following was suggested by John Kubiatowicz (kubitron@cs.berkeley.edu):
In his PhD thesis, John Kubiatowicz explored the probability of deadlock in message-passing multiprocessors with Direct Network Interfaces (e.g. the Alewife interface). See description of "DeadSIM" in Chapter 6 of thesis, available off publications link on homepage. This exploration was done with probabilistic simulation in mesh networks of varying dimensions. Expand this analysis to include: (1) networks with virtual channels, (2) networks with automatic queueing to memory (such as the Wisconsin CNI interface). Figure out how to make the message traffic more realistic by encorporating actual multiprocessor traces. Will direct network interfaces with software deadlock recovery do well in large systems/under heavy load?