CS 252 PROJECT SUGGESTIONS FALL 1994

P1. RISC enhancements for 80x86 emulation.

WABI and similar programs will simulate an instruction set such as the 80x86 so accurately that many application programs will run correctly, albeit slowly. Assume that WABI is the single most important program running on {Sun, DEC, SGI, PowerPC} workstations today. What 5 instructions would you add to the native instruction set to speed up native x86 codes running under WABI? What would the speedup be? How would it compare to Intel offerings? (Suggested by Dave Douglas of Thinking Machines)

P2. TCP/IP accelerator.

Starting with a SPARCstation 10 or SPARCstation 20 system design, how fast could you make TCP/IP go? Assume that you can add instructions to SPARC, add a coprocessor if necessary, add logic to the IO controller, and so on. Be sure to understand why no one has implemented Van Jacobsen's proposals for very fast TCP/IP. (Suggested by Dave Douglas of Thinking Machines)

P3. Truth in SPECmarks I.

Using the available compilers on each workstation, compile and run SPECmarks, comparing results to those published in the SPEC newsletter. Try this:

What is the SPECmanship ratio between published results and each of these measurements? What would be necessary to actually reproduce the indicated results at your site (e.g. new compilers, more memory, larger disk) and how much would that cost? (Suggested by Stephen Richardson of Sun Microsystems)

P4. Truth in SPECmarks II.

Prepare your own suite of 10 benchmarks (5 integer, 5 floating point). Compile and run on each workstation using default compiler flags. How does the relative performance of each workstation compare to published SPECmarks? (Suggested by Stephen Richardson of Sun Microsystems)

P5. Sophistication of Coherency and Traffic.

Measure the differences in memory or coherency traffic for 3 state vs. 4 state cache coherency protocols for some standard multiprocessor benchmark suites. Be sure to search for prior results in this area before starting: As Wallach says ``for all I know somewhere buried in some journals some people have already done projects like this. With the publications these days, it is difficult to always keep up on everything.'' (Suggested by Steve Wallach of Convex)

P6. False sharing and Block Size.

Measure invalidates, in a two processor system as a function of cache block size. The bigger the cache block size, the more false-sharing. This would generate lots of nice charts and graphs:

y-axis = percentage false-sharing

x-axis = cache block size.

Theoretically making the block size 1, eliminated false-sharing and then you can develop different curves based on coherency policy. (Suggested by Steve Wallach of Convex)

P7. Cache sharing with in the year 2000.

Wallach writes: I am posing the questions to architects on what would you do with 100, million transistors, available in year 2000. I did this with my ``testimony'' at the recent NAS pow wow to review HPC. This generated some lively discussion between myself, Hennessy, Sutherland, and Lampson. Since we are pin limited on packages I maintain that we will have tightly coupled shared memory system with these dies; that is, with one memory bus coming out. However, the processors will have their own caches, or possibly share a common cache: who knows? But an interesting simulation experiment, would be to run some traces and determine if, say one 4 mbyte cache shared among 2 processors is better/worse than 2 separate 2 mbyte individual caches. At a first level approximation the same die size is used.certainly a shared cache eliminates or simplifies cache coherency. Be sure to look at a cache access and cycle time model that Norm Jouppi recently completed. There is a tech report for it and C code is available via ftp. It gives the expected cache access and cycle time for a set of cache parameters (size, associativity, an so on.) for a 0.8um process. (Suggested by Steve Wallach of Convex)

P8. Cache coherency on networks of workstations. (NOW)

Benchmark shared address programs on a NOW. Willy Zwanepoole at Rice has developed and distributed a simulated global, coherent memory system on top of Unix workstations. Willy has shown this to me working on Sun's, IBM, and Alpha. This stuff is much better than Kai Li's implementation.You could measure microbenchmarks on this class of system architectures. (Suggested by Steve Wallach of Convex)

P9. Cache warmth and performance.

Measure performance degradations, if any, when the following experiment is performed.

(a) Run a single threaded program that runs for several seconds. Make sure that a lot of the data IS cached.

(b) Have another program perform some type of interfering process. It could be as simple as an interval timer on a uniprocessor. or another program PROBING shared data.

The objective is to measure the effect on performance with respect to hot/cold/mild cache and/or background interrupts. About 10 years, at michigan there was a paper written about this phenomena relative to a Cray-2. The program was run uniprocessor, and then with another processor executing programs with different characteristics. The paper was by Calahan. The probe program was a scalar and then a vector program. The scalar had more random references. It turns out the scalar program resulted in more degradation to the STANDARD program. The memory system was designed to interlaced multiple consistent address streams (namely vector), but could not deal with random interference patterns. (Suggested by Steve Wallach of Convex)

P10. Value of Secondary Caches

With increasing densities you can now get 64 or 128K caches on chip. These result in high hit rates. The question is whether an external cache can be justified. How fast does it have to be to make it worthwhile? The other alternative is a fast memory based on Synchronous DRAMs. Assume that a cached design cannot start a memory access before you know that the cache is going to miss. A simpler alternative would be a study that measures cache hit ratios for a secondary cache relative to the number of accesses to that cache, as function of primary cache size. You could calculate it from the data for all cache sizes, but I have never seen it presented that way. (Suggested by Dileep Bhandarkar of Digital Equipment)

P11. Experimental Validation

An architecture has a feature like a cache. How do you know it works? How do you know how well it works? About 10 years ago (a little less maybe), CRI produced a feature called microtasking. It was an automatic way to use more than one CPU. The problem was that multiprocessor Crays were hard to come by even at CRI. We had one, so we were a logic test choice. We scheduled time on weekends for literally snapshot fast runs. It turned out that the software developed on a Uniprocessor had 1 CPU as a distributed default. We had 3 other idle (briefly) CPUs. How do we know these things work? I recently wrote a program to tell me how much memory I could use in an array. It did this by bisection of the address space. I ran into a few unexpected problems. Real simple program. It found a use in work within 24 hours. How do we know something doesn't get bypassed? In the fluid dynamics community, we call it validation and verification of the science (to the simulations). I don't think we do enough, and we do it in grossly general ways. Assign different features. How can we tell VM is working, what about a cache, register assignment,.and so on. The work all needs to be in the public domain. It will be fun hacking as well, too. (Suggested by Eugene N. Miya of NASA Ames Research Center)

P12. Distributed system cost

An I/O benchmark by Peter Chen suggests that we can validate the UNIX file system performance of many options from a single large system. Since we can get the price from the company, we can calculate price/performance. What is the most economical way to configure distributed systems in terms of local memory, remote memory (in file system), local disk, remote disk, network bandwidth, and so on.. (Suggested by Andy Bechtolsheim of Sun Microsystems.)


Last Modified: 08:25am PDT, August 31, 1995