CS 252 PROJECT SUGGESTIONS SPRING 1994

Here are some ideas that could lead to interesting projects. Some of these are short-term, self-contained projects that could easily be accomplished within the context of a term-project. Others could eventually expand into publications, or even a thesis topic. If you pick a problem with a wider scope, you must be sure to isolate a piece that can be adequately addressed within the semester. You may work in teams of two or three, but there should be a clearly identifiable role for each member of the team, especially for three person teams.

To find background literature for your projects you might want to consult the following journals and proceedings of conferences:

Instruction Set Design and Measurement

P1. Spec Revisited

These days designers do their work with Spec marks hanging over their heads. Some have complained that the benchmarking has gotten to cooked and that, for example, compilers are doing Spec specific optimizations. On a sizable variety of locally available machines, try to reproduce the published spec marks using the published compiler incantations. Compare this with -O performance or other standard compilation options. Are the spec optimizations actually safe? How do spec marks collerlate with performance on other benchmarks.

P2. Condition codes or not, that is the question

(Suggested by David Wood) Most architectures designed in the 1960s and 1970s employ condition codes. Several of the recent RISC architectures do not, yet others do. For example, the MIPS architecture does not have condition codes, yet has SET instructions that put the result of a comparison into a general purpose register. The RS/6000 has 8 (virtual) condition code registers, which can be set either as a side-effect of an ALU instruction, or explicitly by compare instructions. Explore the similarities and differences of these two schemes. Is one clearly better than the other? Are the differences technology dependent? As processors become increasingly integrated, which scheme will lead to the best performance?

P3. Analyze instruction issue strategies

Analyze and compare Scoreboarding, Tomasulo's algorithm and the RS6000 scheme. Where's the beef? How much of the advantage of dynamic scheduling is achieved by static scheduling and delay slots? How do the trade-offs change with increased processor/memory speed ratio? Are lock-up free caches or write-buffers critical? A bunch of people looked at this last year and built some great tools. The problem was that the studies were not as complete as one would like. Perhaps you could start from where they left off and really spend time on the study, rather than the tool building.

P4. Instruction Statistics Tools

Adapt spim, DLXsim or other tools to provide detailed measurements of other instruction sets, such as Sparc, Alpha, PowerPC, PA-RISC, or older ISAs (Vax, IBM, 80x86). GCC back-ends are available for some of these. The vendor's compilers probably run only on their own platform, so object code may be very important. Alternatively, Object ocde translation (as in Pixie) could be used to collect data in situ.

P5. Improving CS252 tools

It should be possible to run the entire SPEC benchmark suite through the pipeline or cache simulator.

P6. Address spaces beyond 32 bits

According to some observers, the demand for virtual address space increases at the rate of 1 bit every 2 years. Thus, while 16 bit architectures were quite acceptable throughout the 60s and 70s, they eventually became too constraining, and have almost entirely been replaced. At this rate, we have just a few more years before 32-bit addresses begin to constrain our programs. Some architectures have introduced segments to extend the address space. Some computer architects suggest that only full 64-bit machines (integers and addresses) will solve the problems. Explore the cost, performance, programming, and compatibility issues of these approaches.

We don't have any tools for analyzing the use of large address spaces. Several insteresting methodology issues arise in doing so.

P7. Machine independent binaries

All RISC instruction sets look basically alike, but you sure cannot compile one program for all. Perhaps there is a middle ground, brand-X generic RISC that could be easily mapped to a variety of ISA with reasonable efficiency. (One such design has been developed pretty far by OSF.) This raises another interesting question of machine indepdent pipeline scheduling.

Memory Systems

P8. Validation of Cache Studies

Using tools like pixie and piping the trace directly into the simulator, rather than attempting to store the trace, it is possible to evaluate substantial workloads of billions of instructions. Review the literature and see whether the important published results on cache studies really hold up. In particular, what is the limit of Spec based caching studies? At some point, all the Spec programs fit in the cache.

Alternatively, examine the body of work on very large caches. Is there an validity to single program studies? How solid and reproducible are the multiprogrammed studies. Can you put this on a stronger footing? Maybe you'll have to apply MPP computing power to the problem.

P9. Transit Buffers

(Suggested by David Wood)} In a recent paper, Norm Jouppi proposes a cache optimization called victim caches. A victim cache is a small fully-associative cache that sits behind a larger direct-mapped cache. Victims (replaced blocks) from the big cache are placed in the victim cache, rather than just being thrown away. His simulation results indicate that victim caches can significantly reduce the effective access time. Victim caches can be generalized in several ways, to produce a mechanism that might be called a Transit Buffer. Transit buffers include the functions of a prefetch buffer and a writeback buffer, further improving performance. Transit buffers can also be extended to support lockup-free execution. Explore the design space using trace-driven simulation.

P10. High-Performance Memory Systems

As processors get faster and service more instructions per cycle, the processor-to-cache and cache-to-memory bandwidth requirement increases. Supercomputer memory systems with substantial interleaving are being considered for the workstations of the near future. Interleaved caches and new RAM technologies are currently being explored. The new Power2 architecture provides very high cache-to-memory bandwidth compared to other microprocessor systems and in David Bailey's Supercomputing93 paper you can really see the payoff for scientific computing. Study the trade-offs in this design space. Caution, you cannot study the problem on toy programs or wimpy processors.

P11. Novel cache policies

Most caches implement some variant of an LRU policy. Work on optimal cache policies shows that sometimes it is necessary to throw away the most recently used item. Machines target at scientific programming generally do not even have caches, because large amounts of data must be \Q\Qstreamed'' through the processor, exhibiting little locality. Perhaps there should be a kind of inbound purgatory, where items prove their worth by being accessed more than once before entering the cache. Many have suggested explicit control over caching.

P12. Trace Compaction

Although the SPEC benchmark set offers a fairly limited basis for design decisions, an address trace from this suite would be extremely large. A wealth of literature is available on trace compaction techniques. Is it possible to produce a very compact trace set that accurately predicts SPEC performance? What new tricks can be played?

P13. Network RAM

Currently, DRAM accesses cost 25 to 100 instruction times and transfer at about half a gigabyte per second, while a disk access costs roughly one million instruction times and transfers at a megabyte per second. With good network interfaces we can obtain access times of 100 to 1000 instruction times and a transfer rate of 10 to 100 Megabytes per second. Clearly, it is interesting to consider paging into the other memories of workstations on the network. How do you do it? What is the win? What is the block size? The policy? How is interactive performance impacted? Address translation?

Superscalar, Superpipeline

P14. Superscalar instruction set design

Load/Store instruction set architectures provide an elegant abstraction for form of machine organization, with a simple pipeline centered on a large register file. Several new machines have been announced, or are about to be, that issue multiple instructions per cycle. The current round of products strive for compatibility with previous generation one-instruction-per-cycle RISC machines. This requires the hardware to detect hazards by analysis of register usage, on the fly. Perhaps we have returned to a mismatch of instruction set and machine organization. What is the best way of representing programs for machines that can perform a small number of operations per cycle? There is a long history of long and very-long instruction word machines, where each instruction specifies several operations explicitly. Dataflow ideas are also well developed, in which an instruction explicitly identifies those that are to receive its result. A recent machine developed by Burton Smith, Tera, allows the compiler to inform the instruction dispatcher about independence via a look-ahead field. The RS6000 may play a similar game. Consider how variations on current instruction set architectures could improve performance and simplify implementation of superscalar machines.

The previous issue could be generalized still further; what is the optimal representation of programs with a certain degree of operation level parallelism? The first step is understanding what cost metric should be applied.

P15. Superscalar instruction fetching.

(Suggested by David Wood) Investigate combining AMD's branch target cache with CRISP's decoded instruction cache. Each block could contain a tag, the next branch target, one or more instructions at the branch target. The next branch target could be dynamically filled. Instructions could be placed in the buffer out-of-order to aid superscalar decode. Many issues here. (Might want to consider explicit representation of dependences in the internal form.)

P16. Instruction-cache organization for Superscalar Architectures.

(Suggested by Tzi-cker Chiueh) The increasing processing bandwidth offered by the superscalar architecture won't be used efficiently if the instruction supply bandwidth cannot keep up. The instruction memory organization has been extensively studied in the context of scalar RISC architectures. Measure the instruction bandwidth requirement for superscalar machines and propose new architectural solutions to alleviate this problem. Accompanied with this issue is the branch handling problem , which is increasingly important when the normalized basic-block size is shrinking with the growing instruction-level parallelism.

P17. Sophisticated Branching for Superscalar Machines:

(suggested by Steve Krueger, T.I.) In RISC processors a pipeline break is relatively costly. In SuperScalar RISC processors that cost is effectively multiplied by the number of instructions that execute simultaneously. It is therefore desirable to increase the runs of instructions without branches. Some old architectures had SKIP instructions that were used extensively. It seems that SKIP instructions (possibly with restrictions on what could be skipped) could give conditional execution without breaking the pipeline through the use of SQUASH or KILL hardware already in place due to the needs of exception processing. Study whether the number of cases where SKIP instructions could be effectively used is great enough to make them useful. Skip-n forms might extend the usefulness by allowing more than one instruction to be skipped.

Several variants of this idea could be considered, including multiway branches, conditional moves, operators to avoid branching (e.g., max, min, abs), and conditional operators.

P18. Superpipelining

Develop a VLSI cost model in which to assess the tradeoffs in interleaving various portions of a superpipelined processor. For a given technology, is there an \Q\Qoptimal organization?''

Vector Processing

P19. Vector cache organization

What kind of organizational changes should be made to make caches actually work well with vector codes. For example, does interleaved cache make any sense ?

P20. Vector nodal performance on MPPs

Several recent MPPs, the Paragon, the CM-5, and the CS-2 have unique vector support on each node. Evaluate the nodal and global vector performance along the lines that have been published for the traditional vector processors.

P21. Programmable Address Generators for Vector Processors

Vector operations are driven by hardwired address sequence generators, the most popular being a arithmetic sequence from a given base, with a fixed stride, and a certain length. More sophisticated generators are needed for sparse matrix operations, including gather and scatter. These are certainly not the only kinds of data sequences we would like to combine efficiently. Array processors and modern DSP chips have a wide instruction word, so the address generation is explicitly programmed into the inner loop. Another approach that has been proposed is programmable address generators. A generalized vector operation would involve specifying the address generation for each of the sequences and the pointwise data transformation. The arithmetic part runs at whatever speed the memory system can supply the correct data. Study how this technique could be applied to achieved \Q\Qgreater vectorizability'' in, say, the Livermore loops. What machine organizational advantages/disadvantages does it present?

Alternatively, how well do agressive superscalar designs capture vector loops?

P22. Vector Processor Design Tools

Develop a well instrumented, flexible vector SPIM (and code-generator?). Memory system analysis is particularly important in this domain. This might serve as a basis for determining better vector register organizations, and so on. (Some previous projects have taken some big steps on this.)

Parallelism

P23. Multiple processors on a Chip

If you allow yourself a transistor budget of 10 to 50 million transistors on a chip you have plenty of room for innovative designs with multiple processors on a chip. This raises a host of interesting questions. Should caches be shared or dedicated to individual processors? Should floating point units be shared or dedicated? What is the trade-off between having more simpler processors vs fewer more sophisticated ones. In some sense this is trading instruction level parallelism against process level parallelism. The most serious bottleneck is going to be pin bandwidth in and out of the chip. How can you minimize the bandwidth requirement? Are new protocols required? There are several ways to frame studies in this context. You may want to look at multiple independent processes, as in a workstation with many open windows, or a small shared-memory multiprocessor design, or you may want to look at this as a component in a massively parallel machine.

P24. Architectural analysis of parallel languages

There are quite a number of parallel applications being developed in Split-C within the department. The Splash and NAS benchmarks are widely circulated as well. It is quite likely that the instruction frequencies, memory usage characteristics, I/O usage, etc. for such parallel programs are quite different from sequential programs. Develop or borrow tools, measure and compare. Trace analysis for parallel programs is an interesting problem. Cache behavior is also likely to differ significantly from that observed on sequential programs.

P25. Fast Communication Layers for novel Parallel Machines

A current hot topic in the parallel computing arena is how to obtain low-overhead communication. The CM-5 Active Message layer, developed at Berkeley by Thorsten von Eicken, is something of a de facto standard. An earlier version ran on the nCUBE. Rich Martin has developed an Active Message layer for a cluster of HP workstations with a special network interface. It would be very valuable to construct a layer of similar quality for the Intel Paragon, Meiko CS-2, or ATM based networks. A component of this, which could be a project all by itself is to characterize the performance to dual processors on a cache-coherent bus where one is a compute processor and one is a message processor.

P26. Language implementation

A great way to learn about architectures is to implement a programming language on them. You really see what things cost, what you can use, and what you can not. Split-C is built as an extension to GCC for distributed memory multiprocessors. Currently it runs on the CM-5, Intel Paragon, and some workstation clusters. Two interesting machines with novel hardware support for communication and global access are the Cray T3D and the Meiko CS-2. Implementing Split-C on these machines will involve a modest amount of coding and a great deal of architectural understanding. Other candidates are the IBM SP-1 and true shared memory MPs, such as the KSR and Dash.

P27. Abstract Parallel Machine

In a recent paper the LogP model was formulated to capture the critical performance characteristics of modern multiprocessors as a basis for algorithm design. We have some preliminary measurements on the CM-5. It would be very interesting to characterize a variety of parallel machines and try to test the model predictions on programs.

Multiprocessors

P28. Software-based Cache Coherency Scheme

Some beliegve software-based cache coherency scheme are the only feasible solution for large-scale multi-processors. Currently there are a few proposals in the literature that tries to solve this problem. All these are based on some kind of single-assignment memory semantics for shared-write data. Study these algorithms and evaluate them either analytically or empirically through simulation. .

P29. Hardware support for parallel distributed debugging

Hardware support for logging, replay, and synchronization for multiple threads of the same program.

P30. Characterizing communication and sharing in multiprocessors

In current bus-based multiprocessors, interprocessor communication takes the form of cache misses. Thus, several issues get folded into a single number --- the miss rate. Some good work has been done to try to characterize sharing in terms of modern cache organization, but there remains many unanswered questions. Some of these may only be answered by inventing new analysis techniques. Certainly it will be hard to get useful data. The reference string generated my each of the processors depends on how work is scheduled onto processors. This is generally done by allowing the processors to contend for various scheduling data structures. Thus, the schedule is somewhat dependent on the memory system. Exploring design variations relative to a fixed trace ignores this feedback. A thorough study needs to be done on the sensitivity or robustness of multiprocessor address traces. (A few other concerns have been raised, such as the number of shared references in available traces.)

P31. What is the minimum cache miss rate due to communication?

As uniprocessor caches get larger, the miss rate approaches the initial load cost (compulsory miss rate). It would seem that multiprocessor caches should tend toward the compulsory miss rate plus a communication factor. How would an optimal cache perform?

Snoopy caches provide communication and replication of data. Replication is what causes the coherence headaches. How does the miss rate (or communication rate) decrease with degree of replication?

P32. Multiprocessor Design Tools

Develop a well instrumented, flexible shared-memory or distributed memory multiprocessor simulator. Cache simulation is particularly important in this domain. How is the network really used?

P33. Communication Patterns in Modern Network Multiprocessors

Study actual intercommunication patterns in available multiprocessors. How is the network really used? What could be done to improve communication performance?

High Performance NOWs

Busses, interfaces, routers.

Network error characterisitcs. What is the relationship between network quality and cost of ``semantic tranparency''?

Characterize the effect of parallel file transfers.

Characterize peak point loads and average loads on a NOW.

Serious evaluation of ATM. Fiber channels? SCI? Measure real machines.

Optical networks

P34. Smart Valley

There is a megaproject brewing to build an advanced network giving direct multimedia access to institutions and individuals throughout the greater bay area. This hinges on developing a standard representation for information of a very general nature. What are the exising alternatives? What the critieria for judging such a standard? What are the architectural implications for machine independent exchange of information?

Multithreading

P35. Multithreaded Processor/Cache Design

In view of the ever-increasing gap between processor speed and memory latency, an architecture can either reduce or tolerate this mismatch. Caching is a scheme to reduce this gap by replicating data. Multithreading, on the other hand, aims at tolerating it. The central idea is to apply the context-switching idea at the instruction-by-instruction level. Suppose there are several process states stored on a processor, whenever a process encounters a cache miss, the processor is switched to serve another process, thus the throughput of the processor is not reduced. This scheme is particularly attractive in a multi-processor configuration, where memory latency is a major problem. Study the design tradeoffs of this kind of architecture, especially focusing on the structure of cache memory design.

(A previous study made some progress on this, but the results were inclusive due to an interesting methodological issue. The right question may be, \Q\QWhen is multithreading more cost-effective than adding a second level cache?''

P36. Validating Models of Multithreading

In a recent paper, Saavedra-Barrera, Culler, and von Eicken propose a simple analytical model of multithreading and cache behavior under this model. Try to validate or disprove this model empirically. (Anant Argawal has proposed a related model with a more sophisticated network component and more primitive processor component.)

Symbolic Computing: Prolog, Lisp, etc

P37. Instruction Level Characterization of Symbolic Programs

Given modern compilation techniques for languages like Lisp, ML, Miranda, Prolog, and Smalltalk for current processors, perform a detailed study to determine how the requirements of this class of programs differ from the C and Fortran programs provided in the text. (Does type-inference and significant use of higher-order functions matter? What about lazy evaluation?)

P38. Read barrier

(Suggested by Doug Johnson)} Under a generation-based garbage collector, a record must be kept of pointers from big-old-space to little-new-space, so that the little-new-space can be scanned without crawling over the big-old-space. One means of achieving this is to trap when a new-space pointer value is stored into an old-space location --- a write barrier. This is not too hard to implement efficiently. In an upcoming paper, Doug Johnson argues for a read barrier as well. This is more challenging since it is on critical path of the memory interface, doing it with low impact on the performance and (ideally) little or no impact on the instruction set would be \Q\Qkind of interesting''.

P39. Effect of tagging support for Prolog architectures

(Suggested by Peter Van Roy) Study the effects on architectural support for tagging when the ``tag hoisting'' transformation proposed by Saumya Debray is used to reduce the number of required tag operations. Debray has a paper explaining this transformation. The project would entail implementing this transformation for BAM code output by the Aquarius compiler and using the VLSI-BAM simulation tools to study its effect. Some people have mentioned contradictory intuitions that either (1) tagging would be reduced significantly, or (2) it has little effect on the inner loops of programs. It would be nice to get a solid answer to this question.

P40. Effect of type derivation for Prolog architectures

(Suggested by Peter Van Roy) Using a dataflow analyzer that is able to infer a rich set of types in Prolog programs, study its effect on architectural support for dynamic typing. It is expected that the support for tagging will be less useful, but there is no idea how much so. The major effort in this project is to either implement such a dataflow analysis, or else annotate programs semi-manually with types that such an analysis could infer (justifying the derived types, of course), or else determine the actual types occurring empirically and annotating with those types. The latter would give a best-case result. The compilation and measurement tools for this project already exist.

P41. Prolog performance on RISC's

(Suggested by Bruce Holmer) The Aquarius Prolog system (Peter's compiler and Ralph Haygood's runtime system) is at a point where it could be used to do a comparative study of Prolog performance on the popular RISC architectures. Do a detailed study comparing the performance of the different machines for a set of large Prolog benchmarks. It may be that differences in the instruction sets/memory architectures could have non-trivial consequences on performance. For example, how is performance affected by the presence/absence of annulling for branches, how strong is the dependence on the cost of memory reads/writes (including block reads and writes), and perhaps one of the architectures allows a clever way of doing the tag manipulations. This project would require some familiarity with Prolog, since the translation from compiler intermediate code to assembly language is done with a Prolog program. There would also be a period of tweeking the assembly language output, perhaps using some peephole optimization rules. There will be plenty of support from our group, and the MIPS code generator could be used to help guide one through the process of writing the new code generator.

Miscellany

P42. Architectures for Note Pads

(Suggested by David Wood)} Many researchers are interested in portable note pad computers, that have touch sensitive screens and wireless communication. These machines require fast processing, to do handwriting and possibly speech analysis, but have very limited power and weight budgets. In addition, the communications bandwidth is limited. How should the architecture of a note pad computer differ from a standard workstation architecture? What is the power consumption of current pipelines and how can it be reduced?

P43. Network instrumentation

(Suggested by Nick Wainwright (HP))} Current implementations of UNIX perform a kernel - user space copy on network data. For large transfers this can be replaced by a remapping of the user buffer into kernal space. Instrument the network code to collect statistics on packet and user buffer length for various applications. Estimate the number of instructions required to remap or copy to determine whether remapping gives a performance improvement.

P44. Page Re-mapping

(Suggested by Greg Watson (HP)) Page re-mapping has been proposed as a way to reduce the number of data copies that are made when data is passed to another process. Two examples of this are network I/O and disk I/O - instead of copying the data into kernel space the page is simply mapped out of user space and into kernel space. There are some hazards to consider when trying to implement this: data that is not page aligned, what happens when the user process tries to write to the page that has just mapped out of his address space (there are several alternatives), what if the re-mapping costs more than the copy itself? Where is the break-even point in terms of data size? How do current burst read/write cache architectures affect the decisions?


Last Modified: 08:53am PDT, August 31, 1995