The IRAM project has several topics that need investigation. As we are going to start building a 250,000,000 transistor microprocessor this summer, likely the most transistors in a MPU when it is finished, such projects would be very timely on what may prove to be a very influential effort.
P4: Characterize Architecture Metrics for Multiple Commercial Databases
About 40% of sales of servers are for data base applications, yet little has been published on comparing multiple databases on a single SMP. The idea is to use the builtin hardware performance tools of either the the 4-way SPARC SMP or the 4-way Intel Pentium II SMP to recreate the Kim Keeton's study across several commercial databases. The question whether architectural support varies by database. Kim Keeton (kkeeton@cs) would be willing to help with this study.
P5: Explore energy implications of reconfigurable implementation of compute kernels.
This project is related to the Berkeley Reconfigurable Architectures, Systems and Software (BRASS) Research Project.
Per raw bit operation, the potential energy cost of a low-voltage FPGA and a low-power DSP or microprocessor are very similar. Once correlation between bits in a datapath are taken into account, the energy may vary considerably, perhaps an order of magnitude.
In particular, a spatial (non-multiplexed) implementation on an FPGA will have a low activiation rate when data is highly correlated. The heavy multiplexing and interleaving of operations on the processor will tend to destroy the natural correlation in the data yielding a higher activity rate.
For some common kernels, (maybe start with filters, transforms common in signal/video processing) collect the data activity and estimate the actual energy consumed on a processor and an FPGA implementation. The goal would be to understand the source of potential benefits for the reconfigurable architecture and quantify typical effects.
Andre' DeHon (amd@cs) would give advice on this project. It would likely involve:
P6: Very Large Scale Backup.
Edward K. Lee (eklee@pa.dec.com) of DEC SRC Research labs would be willing to give advice on this project. He thinks, for example, think that designing an efficient backup system where users can get useful work done after only a small part of the backup has been restored is an interesting problem.
Automatic, reliable backup for large scale storage systems should be done at the device rather than file level. This ensures that all data can be backed up uniformly and regularly without regard to the type of file system or database system the data represents. This also means that if you have very large files, you can back up just the changed portions of the file.
Some requirements:
John R. Mashey (mash@mash.engr.sgi.com) made three major suggestions for projects, each containing multiple CS252 projects:
1996's efforts on XSPEC (prior CS252 project) were useful, but this is a never-ending task...
P7.1 Continue to propose/analyze new benchmarks.
I.e., along same model as SPEC/XSPEC, take real programs and analyze them. Gathering and publishing data is always useful, although deeper analyses of fewer codes might be good.
P7.2 No new data, more correlation [need to know some statistics]
Do a serious mathematical correlation analysis among the existing masses of SPEC95 & XSPEC data to understand the level of redundancy, with goal of selecting well-correlated subsets, and shrinking the number of benchmark cases, not growing them. [One of the implicit goals of SPEC, at least from soem of us, was to make sure there was a reasonable body of results to analyze...]
Pick some product line with a number of members of different clock rates, memory systems, compiler versions. Study how performance has changed with compiler versions, and how it varies by Mhz, cache size, etc. Question: given the SPEC or XSPEC numbers for one family member, how well can you predict the numbers for other family members, given MHz, peak issue rate, cache size, memory latency, memory bandwdith, etc, for entire benchmark sets or individual members?
Like P7.2.1, but with added complexity of different compilers, cache designs, etc. If one proposes a formula that predicts most of the variation, it is especially important to analyze the mis-predictions (at least, for commercial benchmark fights :-).
WARNING: be careful on correlation analyses. Sometimes people have reached somewhat-incorrect conclusions by using a set of data points dominated by large subsets of realted systems. I.e., if many of the systems used Pentiums of various flavors, one would might find that Mhz alone accounted for much of the variation.
P7.3 Performance versus data cache size & dataset size.
(Some of this was done in 1996, much more needed). Some benchmarks have reasonably-scalable dataset sizes. In some cases, performance is very sensitive to dataset size versus cache size, and in others, it is not, and sometiems it depends on how the code is compiled. (I.e., SPEC89's matrix300 was supposed to be be a cache & memory thrahser, but cache-blocking compilers invalidated its usefulness for that, as they made it much less sensitive to cache size and memory performance.) Analyze the existing SPEC/XPSPEC codes and see if ther are any simple models that relate performance, cache size, dataset size, and some metric for nature of the code. Try to categorize codes better: it is silly to make multiple runs of different sizes if little new information is provided.
This probably involves running more sizes of data on fewer benchmarks, to look for steep dropoffs in performance.
P7.4 Performance, data cache size, dataset size across machine types.
This is like P7.3, but emphasizing inter-machine comparisons.
[Note: in commercial system benchmarking, fixed-size benchmarks are notorious for allowing misleading/surprising comparisons, and vendor benchmarkers know this well.] If a benchmark does have datset vs caches-size sensitivity, and there are two systems:
Of course, the SPEC benchmarks are all fixed-size, and sometimes prone to odd cases where the size just happens to clobber a system, but size/2 or or 2*size would not.] I've seen cases where one could make either A or B look P7.5X faster than the other by adjusting datset sizes.
Maybe CPU benchmarks should look more like system throughput benchmarks, i.e., shown as a curve. For benchmarks in which cache size versus data size matters, it would be good to compare machines by curves with vertical = performance, horizontal = variable data size, across an appropriate range of sizes, and then see if there are better, succinct metrics derivable from the mass of data. Somehow, one would like a metric that gives credit to performance at a range of sizes, not just one size.
P7.5 Error bars
Redo any of the analyses above and get away from individual performance numbers into means with error bars, or just 90% confidence limits, or anything that would bring more rationality to the presentation and analysis of these things. [It is incredible that people talk about SPEC#s with 3 digits of accuracy...]
P7.6 Synthetic benchmark (data)
I'd love to have a easonably short, scalable, perhaps synthetic benchmark, that, with appropriate & understandable input parameters, could predict most of the variation in the various SPEC/XSPEC codes. (This is probably not for 1998, since it should build on some of the other suggestions above, not be invented out of whole cloth.)
P7.7 Synthetic benchmark (instruction)
Data-intense codes lend themselves to scaling tests; instruction-cache ones don't. Characterize the instruction cache behavior of the existing codes. Propose a parameterizable synthetic benchmark whose code size can be varied, and investigate its correlation with the existing programs.
As in P7.4, given the same size external cache, there are odd cases of differences among: 1-level, split I& D cache 2-level, joint L2 caches 3-level (as in Alphas)
(or in earlier days, between a split I&D or joint 1-level cache).
Also, between direct-mapped and set-associative caches: direct-mapped D-caches sometimes have terrible performance dips as array sizes are increased, due to cache collisions. (Customers get angry out of all proportion, i.e., system A (direct-mapped) may have better average performance than system B (set-associative), but B degrades smoothly, rather than showing the occasional giant dips of A. People hate A.
Study the existing benchmarks and analyze the performance results in light of either I-size or D-size being near sensitivity points of current system designs, or sensitivity points on associativity.
P7.8 Latency
Larry McVoy's lmbench benchmark is a useful indicator of performance for small-object-latency-intense codes, i.e., where cache misses are frequent. Since this includes some behavior of DBMS, networking, and OS code, and such behavior is not necesarily well represented by SPEC , these numbers are of interest. On the other hand, the numbers are prone to over-interpretation, as over-reliance on the part of the benchmark that measures memory latency alone, would lead one to design machines that have the simplest cache design (1-level, blocking loads, no fetchahead), and simplest in-order execution, since the benchmark explicitly defeats use of multiple outstanding cache misses and out-of-order execution, even though these features are believed by many to be useful.
Analyze lmbench results (of which many exist) and compare to SPEC/XSPEC results, and see which if any of the SPEC/XSPEC components are well-predicted by the lmbench (a) Memory latency or (b) latency of largest cache.
BACKGROUND
[At least once a quarter, I stop by at the school down the road from my house, and check out whatever they are using for "Algorithms and Data Structures" or equivalent [for UCB: CS61B especially}. I complain to Hennessy: "These courses are out of date; they seldom even mention cache memory, much less include it in algorithm analysis; they fit CPUs of 10-20 years ago, but not current ones; students ought to be sensitized to the issues a bit earlier, instead of hearing about theoretical machines in data structures course, and then real machines (but without reference to the data structures. ... rant, rave" Hennessy says "Well, Knuth is rewriting Volume 3" ... yes, good, sometime, but we have a problem, and it is in fact one of the most commonly-asked questions by practicing programmers: what should we do to our data structures to cope with the new CPUs?
The problem: Knuth Volume 3, and many other textbooks, effectively act like CPUs are stuck in the early-1980s microprocessor days:
But now, we have multi-level cache hieararchies, and it is quite possible for a cache miss to cost 400 or more (peak) instruction issues, and this is going to get worse, not better for a while.
But with caches, spatial locality can be very important, especially with out-of-order, superscalar machines, which can take good advantage of multiple elements in a cache line. For example, if one hashes into a large table, think hard about hashing to cache lines, then searching each cache line, rather than rehashing on each collision. You have paid many cycles to obtain the cache line... Also, in real life, sometimes people spend a lot of time aligning data structures on cache lign boundaries...
Two algorithms may theoretically have similar computational complexity, but one may run much faster in practice, for many reasons. Cache temporal locality can vary dramatically between otherwise similar-looking algorithms, as for example, in sorting, where people spend serious effort optimizing sorts around the use of caches, making internal sorts act somewhat more like external sorts.
Scientific programming, of course, is well familiar with these things, as evidenced by the widespread use of cache-blocking compilers. Unfortunately, analyses for lists, tress, searching, sorting, etc, seem not to have progressed as quickly, although I may have missed something, as I only look at this every once in a while. I think there is room for a lot of publishable papers to update the literature, and one of these days, there should be a new textbook, or at least a section called: "classic data structures versus modern machines". Of course, if everything turns into JAVA, maybe this won't matter :-)
P8.1 Classical algorithm analysis meets modern machines
Go back to your CS61B (or equivalent) textbook/notes.
Pick a scalable algorithm, that can be scaled large enough to stress cache memories well. Run it to produce a plot of (elapsed time - vertical) vs problem size (horizontal), and compare the shape of the curve with the theoretical analsys, and analyze any surprises. Do this across several systems with different memory hierachies. If the systems you use have machine-level profiling tools (i.e., equivalent to those in the MIPS R10000-based systems, which can be easily made to tell you about cache misses, TLB misses, etc, etc; I think DEC has similar things, I'm not sure how accessible they are elsewhere) use them to explain the curves. Suggest cache-related improvements to the algorithm and see if they matter. Searching, sorting, memory allocation problems are all relevant, and some people at UCB study parallel versions of some of these ... but it would be nice just ot accumulate better analysis for the simplest uniprocessor cases.
(I'm mostly ignoring MMU issues in favor of cache ones, but the latter can pop up as well.)
P8.2 Algorithm 1 meets algorithm 2, in presence of cache
As in P8.1, but compare two different algorithms that accomplish the same task, i.e., like different sort programs. Analyze cache size/nature sensitivies, and any other surprises. Can the source code be parameterized (with cache sizes, latencies) to give "good" performance across a wide range of machines? Can a binary have the same property and still be clean code? (I.e., how does one do such algorithms so they perform "well" on machines in the same family with 256K -> 8MB caches?) How complex does a machine model need to be to get good results in paremterizing such codes. Do this one one system, or better, on several.
P8.3 Evil pointers, linked lists, out-of-order machines
Following a chain of pointers, where each pointer is a cache miss, is one of the worst possible things to make cached machines do, since even a speculative-execution machine with multiple outstanding cache misses (MIPS R10000, PHP PA8000, Intel Pentium Pro, Pentium II) gets slowed to DRAM speed in following such a pointer chain. Such CPus can sometimes unroll tight iterative loops, and get resultant cache misses going faster ... but only if, for example, a linear list is restructured as pointers to vectors of pointers, over which the hardware can iterate.
Look at algorithms that use lists, or binary trees, and see if converting them to N-ary structures helps on machines with o-o-o execution, or not, especially as compared with those CPUs that block on first use of data.
P8.4 Algorithms, languages, object-oriented stuff
Chasing pointers is bad enough; the worst thing for a modern CPU is doing an indirect call to a function pointer located in a structure that has just been accessed via a pointer ... like some object-oriented programming likes. Even a deeply-speculative CPU cannot get beyond that indirect call. Study the I-cache miss behavior of a program coded in this style: perhaps this is not such a bad problem (if the call targets are a small enough set there is a good cache-hit rate), maybe it is. If it is, suggestion a hardware solution that makes indirect calls as fast as explicit ones. (50 points, or just send me the solution :-)
Some famous people have said that I/O is the orphan of computer architecture, and it is ... but many real-life problems depend a lot more on I/O performance than anything else. Even worse, many of them are very dependent on latency (rather than bandwidth), and latency is not improving as fast as anyone would like. Unfortunately, it is harder to think up good I/O problems that are doable in a few weeks. I'll suggest the #1 "InfraStress" (InfraStructure Sress) issue I know of coming up related to I/O, and maybe some problems will fall out.
Disks are getting HUGE, fast, and algorithms are breaking (out) all over
Disk density, and thus capacity/cubic inch, are going up dramatically already, and are about to go up even faster (i.e., due to IBM GMR disks, and Terastor/Quinta laser-enhanced magnetics): if you believe what TeraStor says, one would expect to see 500GB 3.5" drives by 2003, so that one could get a Terabyte on the desktop ... but unfortunately, one would guess that a normal 500GB disk would take 10,000 seconds to read ... not good.
Unfortunately, disk densities are going up faster than bandwidth, and latency is only dropping very slowly. Such changes in ratios tend to break old limits (which are at least obvious), and more insidiously, existing algorithms tend to have performance breakage in the face of large changes. The coming disks are likely to cause trouble for most current filesystems, and there'd be some serious rearchitecting.
Any problems here are likely to need to build on work/environments that are already around UCB, and I'm not sure offhand what there is. Realistically, one would want some environment that would let students do file system evaluations/analysis at user level, and study key algorithms without having to invent everything.
Greg Pfister of IBM (pfister@us.ibm.com) suggests projects P10 and P11:
First: Looking over the list of prior projects from last time, I noticed virtually nothing about I/O. You are using a text that doesn't focus there (nor do many others), but nevertheless -- many topics from last time could be revisited in an I/O context: Benchmarks of I/O, efficiency of I/O, etc. How good is the memory system at block streaming multiple multimedia streams onto disk and/or a fast network? How about OS overhead for lots of little transactions -- there certainly are imaginative ways it could be be reduced, and proposing/measuring the results could be a good project.
Second, how about an application comparison of a low-overhead messaging scheme (like VIA) versus a shared-memory implementation using CC-NUMA? Of course it should be based on measurement, so maybe SMP measurement plus some analytical modelling substitutes for CC-NUMA and some analytical work.
Jim Gray of Micrsoft (Gray@Microsoft.com) also thought last time there was not much emphasis on IO or memory bus. A IO/Networking project:
David Douglas (douglas@East.Sun.COM) and F. Balint (balintf@East.Sun.COM) from Sun Microsystems made a suggestion regarding a new I/O standard. "I2O" (I-to-O) is the big new I/O architecture definition from WiNTel, attempting to push more processing off of the main CPU's and onto the I/O cards. Their question is: Is this any faster than the "smart" IO subsystems people have been building for a while? Will I2O open up new opportunities to move stuff off of the main CPU's, resulting in faster performance?
One additional issue not mentioned below, but within Berkeley's historic interests, is to look at I2O in light of RAID controllers.
The project goal is two fold:
What are the factors impacting system performance? At what IOP performance is needed to gain the most? Is there linear relationship between IOP performance and system performance gain?
The focus should be on storage and on network performance.
For step #1, the recommendation is to compare the
The data to be looked at is IO per second for 2K r/w and 8K r/W The sytem utilization should be compared (Intelligent IO is suppose to reduce PIO and interrupt rate, thus CPU utilization).
Christoforos Kozyrakis (kozyraki@ikaros.CS.Berkeley.EDU) had suggestions based on using the embedded processor board for StrongARM SA110 that we have in the lab, plus a few miscellaneous ideas.
Run Spec95int and other micro-benchmarks (lmbench etc) on an embedded processor (recommended: SA110 that we have available). Measure performance; compare with results for other microprocessors.
Use BDTI benchmarks or other DSP kernels to compare/analyze the performance of an embedded processor (SA110) and a desktop processor (Sparc/Pentium).
It is believed that scatter/gather support on vector processors is crucial for high-performance. Take a small number of vector algorithms that are usually written using scatter/gather. Change them (i.e. use different algorithm to solve the same problem) to use strided accesses. Quantify difference in performance using T0 simulation tools.
Previous work has indicated the existence of large amount of locality in texture accesses in high-end 3D machines. Still, the industrial experience on 3d accelerators for personal computers has been that texture accesses in 3D games and other PC applications do not cache well.
Collect traces from 3D applications, games or benchmarks and analyze the locality (temporal and spatial).
In ISCA97, Joe Emer presented how genetic algorithms can be used to describing and synthesizing branch predictors with impressive results. Genetic algorithm can probably be of use in other areas of computer architecture: data predictors, hardware prefetching. Exploit the potential of creating predictors by using genetic algorithms. (Kubi may have more specific ideas here)
List of CS252 projects from last time (look under Course Projects)
Back to CS252 home page...