CS 252 PROJECT SUGGESTIONS SPRING 1994

Here are some ideas that could lead to interesting projects. Some of these are short-term, self-contained projects that could easily be accomplished within the context of a term-project. Others could eventually expand into publications, or even a thesis topic. If you pick a problem with a wider scope, you must be sure to isolate a piece that can be adequately addressed within the semester. You may work in teams of two or three, but there should be a clearly identifiable role for each member of the team, especially for three person teams.

To find background literature for your projects you might want to consult the following journals and proceedings of conferences:

International Symposium on Computer Architecture (ISCA)
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
Supercomputing
COMPCON Conference
VLSI Conference
Computer Architecture News
IEEE Computer
Communications of the ACM

Instruction Set Design and Measurement

P1. Spec Revisited

These days designers do their work with Spec marks hanging over their heads. Some have complained that the benchmarking has gotten to cooked and that, for example, compilers are doing Spec specific optimizations. On a sizable variety of locally available machines, try to reproduce the published spec marks using the published compiler incantations. Compare this with -O performance or other standard compilation options. Are the spec optimizations actually safe? How do spec marks collerlate with performance on other benchmarks.

P2. Condition codes or not, that is the question

Memory Systems

P8. Validation of Cache Studies

Using tools like pixie and piping the trace directly into the simulator, rather than attempting to store the trace, it is possible to evaluate substantial workloads of billions of instructions. Review the literature and see whether the important published results on cache studies really hold up. In particular, what is the limit of Spec based caching studies? At some point, all the Spec programs fit in the cache.

Alternatively, examine the body of work on very large caches. Is there an validity to single program studies? How solid and reproducible are the multiprogrammed studies. Can you put this on a stronger footing? Maybe you'll have to apply MPP computing power to the problem.

P9. Transit Buffers

P10. High-Performance Memory Systems

As processors get faster and service more instructions per cycle, the processor-to-cache and cache-to-memory bandwidth requirement increases. Supercomputer memory systems with substantial interleaving are being considered for the workstations of the near future. Interleaved caches and new RAM technologies are currently being explored. The new Power2 architecture provides very high cache-to-memory bandwidth compared to other microprocessor systems and in David Bailey's Supercomputing93 paper you can really see the payoff for scientific computing. Study the trade-offs in this design space. Caution, you cannot study the problem on toy programs or wimpy processors.

P11. Novel cache policies

Most caches implement some variant of an LRU policy. Work on optimal cache policies shows that sometimes it is necessary to throw away the most recently used item. Machines target at scientific programming generally do not even have caches, because large amounts of data must be \Q\Qstreamed'' through the processor, exhibiting little locality. Perhaps there should be a kind of inbound purgatory, where items prove their worth by being accessed more than once before entering the cache. Many have suggested explicit control over caching.

P12. Trace Compaction

Although the SPEC benchmark set offers a fairly limited basis for design decisions, an address trace from this suite would be extremely large. A wealth of literature is available on trace compaction techniques. Is it possible to produce a very compact trace set that accurately predicts SPEC performance? What new tricks can be played?

P13. Network RAM

Currently, DRAM accesses cost 25 to 100 instruction times and transfer at about half a gigabyte per second, while a disk access costs roughly one million instruction times and transfers at a megabyte per second. With good network interfaces we can obtain access times of 100 to 1000 instruction times and a transfer rate of 10 to 100 Megabytes per second. Clearly, it is interesting to consider paging into the other memories of workstations on the network. How do you do it? What is the win? What is the block size? The policy? How is interactive performance impacted? Address translation?

Superscalar, Superpipeline

P14. Superscalar instruction set design

Load/Store instruction set architectures provide an elegant abstraction for form of machine organization, with a simple pipeline centered on a large register file. Several new machines have been announced, or are about to be, that issue multiple instructions per cycle. The current round of products strive for compatibility with previous generation one-instruction-per-cycle RISC machines. This requires the hardware to detect hazards by analysis of register usage, on the fly. Perhaps we have returned to a mismatch of instruction set and machine organization. What is the best way of representing programs for machines that can perform a small number of operations per cycle? There is a long history of long and very-long instruction word machines, where each instruction specifies several operations explicitly. Dataflow ideas are also well developed, in which an instruction explicitly identifies those that are to receive its result. A recent machine developed by Burton Smith, Tera, allows the compiler to inform the instruction dispatcher about independence via a look-ahead field. The RS6000 may play a similar game. Consider how variations on current instruction set architectures could improve performance and simplify implementation of superscalar machines.

The previous issue could be generalized still further; what is the optimal representation of programs with a certain degree of operation level parallelism? The first step is understanding what cost metric should be applied.

P15. Superscalar instruction fetching.

P16. Instruction-cache organization for Superscalar Architectures.

P17. Sophisticated Branching for Superscalar Machines:

Several variants of this idea could be considered, including multiway branches, conditional moves, operators to avoid branching (e.g., max, min, abs), and conditional operators.

P18. Superpipelining

Develop a VLSI cost model in which to assess the tradeoffs in interleaving various portions of a superpipelined processor. For a given technology, is there an \Q\Qoptimal organization?''

Vector Processing

P19. Vector cache organization

What kind of organizational changes should be made to make caches actually work well with vector codes. For example, does interleaved cache make any sense ?

P20. Vector nodal performance on MPPs

Several recent MPPs, the Paragon, the CM-5, and the CS-2 have unique vector support on each node. Evaluate the nodal and global vector performance along the lines that have been published for the traditional vector processors.

P21. Programmable Address Generators for Vector Processors

Vector operations are driven by hardwired address sequence generators, the most popular being a arithmetic sequence from a given base, with a fixed stride, and a certain length. More sophisticated generators are needed for sparse matrix operations, including gather and scatter. These are certainly not the only kinds of data sequences we would like to combine efficiently. Array processors and modern DSP chips have a wide instruction word, so the address generation is explicitly programmed into the inner loop. Another approach that has been proposed is programmable address generators. A generalized vector operation would involve specifying the address generation for each of the sequences and the pointwise data transformation. The arithmetic part runs at whatever speed the memory system can supply the correct data. Study how this technique could be applied to achieved \Q\Qgreater vectorizability'' in, say, the Livermore loops. What machine organizational advantages/disadvantages does it present?

Alternatively, how well do agressive superscalar designs capture vector loops?

P22. Vector Processor Design Tools

Develop a well instrumented, flexible vector SPIM (and code-generator?). Memory system analysis is particularly important in this domain. This might serve as a basis for determining better vector register organizations, and so on. (Some previous projects have taken some big steps on this.)

Parallelism

P23. Multiple processors on a Chip

If you allow yourself a transistor budget of 10 to 50 million transistors on a chip you have plenty of room for innovative designs with multiple processors on a chip. This raises a host of interesting questions. Should caches be shared or dedicated to individual processors? Should floating point units be shared or dedicated? What is the trade-off between having more simpler processors vs fewer more sophisticated ones. In some sense this is trading instruction level parallelism against process level parallelism. The most serious bottleneck is going to be pin bandwidth in and out of the chip. How can you minimize the bandwidth requirement? Are new protocols required? There are several ways to frame studies in this context. You may want to look at multiple independent processes, as in a workstation with many open windows, or a small shared-memory multiprocessor design, or you may want to look at this as a component in a massively parallel machine.

P24. Architectural analysis of parallel languages

There are quite a number of parallel applications being developed in Split-C within the department. The Splash and NAS benchmarks are widely circulated as well. It is quite likely that the instruction frequencies, memory usage characteristics, I/O usage, etc. for such parallel programs are quite different from sequential programs. Develop or borrow tools, measure and compare. Trace analysis for parallel programs is an interesting problem. Cache behavior is also likely to differ significantly from that observed on sequential programs.

P25. Fast Communication Layers for novel Parallel Machines

A current hot topic in the parallel computing arena is how to obtain low-overhead communication. The CM-5 Active Message layer, developed at Berkeley by Thorsten von Eicken, is something of a de facto standard. An earlier version ran on the nCUBE. Rich Martin has developed an Active Message layer for a cluster of HP workstations with a special network interface. It would be very valuable to construct a layer of similar quality for the Intel Paragon, Meiko CS-2, or ATM based networks. A component of this, which could be a project all by itself is to characterize the performance to dual processors on a cache-coherent bus where one is a compute processor and one is a message processor.

P26. Language implementation

A great way to learn about architectures is to implement a programming language on them. You really see what things cost, what you can use, and what you can not. Split-C is built as an extension to GCC for distributed memory multiprocessors. Currently it runs on the CM-5, Intel Paragon, and some workstation clusters. Two interesting machines with novel hardware support for communication and global access are the Cray T3D and the Meiko CS-2. Implementing Split-C on these machines will involve a modest amount of coding and a great deal of architectural understanding. Other candidates are the IBM SP-1 and true shared memory MPs, such as the KSR and Dash.

P27. Abstract Parallel Machine

In a recent paper the LogP model was formulated to capture the critical performance characteristics of modern multiprocessors as a basis for algorithm design. We have some preliminary measurements on the CM-5. It would be very interesting to characterize a variety of parallel machines and try to test the model predictions on programs.

Multiprocessors

P28. Software-based Cache Coherency Scheme

Some beliegve software-based cache coherency scheme are the only feasible solution for large-scale multi-processors. Currently there are a few proposals in the literature that tries to solve this problem. All these are based on some kind of single-assignment memory semantics for shared-write data. Study these algorithms and evaluate them either analytically or empirically through simulation. .

P29. Hardware support for parallel distributed debugging

Hardware support for logging, replay, and synchronization for multiple threads of the same program.

P30. Characterizing communication and sharing in multiprocessors

In current bus-based multiprocessors, interprocessor communication takes the form of cache misses. Thus, several issues get folded into a single number --- the miss rate. Some good work has been done to try to characterize sharing in terms of modern cache organization, but there remains many unanswered questions. Some of these may only be answered by inventing new analysis techniques. Certainly it will be hard to get useful data. The reference string generated my each of the processors depends on how work is scheduled onto processors. This is generally done by allowing the processors to contend for various scheduling data structures. Thus, the schedule is somewhat dependent on the memory system. Exploring design variations relative to a fixed trace ignores this feedback. A thorough study needs to be done on the sensitivity or robustness of multiprocessor address traces. (A few other concerns have been raised, such as the number of shared references in available traces.)

P31. What is the minimum cache miss rate due to communication?

As uniprocessor caches get larger, the miss rate approaches the initial load cost (compulsory miss rate). It would seem that multiprocessor caches should tend toward the compulsory miss rate plus a communication factor. How would an optimal cache perform?

Snoopy caches provide communication and replication of data. Replication is what causes the coherence headaches. How does the miss rate (or communication rate) decrease with degree of replication?

P32. Multiprocessor Design Tools

Develop a well instrumented, flexible shared-memory or distributed memory multiprocessor simulator. Cache simulation is particularly important in this domain. How is the network really used?

P33. Communication Patterns in Modern Network Multiprocessors

Study actual intercommunication patterns in available multiprocessors. How is the network really used? What could be done to improve communication performance?

High Performance NOWs

Busses, interfaces, routers.

Network error characterisitcs. What is the relationship between network quality and cost of ``semantic tranparency''?

Characterize the effect of parallel file transfers.

Characterize peak point loads and average loads on a NOW.

Serious evaluation of ATM. Fiber channels? SCI? Measure real machines.

Optical networks

P34. Smart Valley

There is a megaproject brewing to build an advanced network giving direct multimedia access to institutions and individuals throughout the greater bay area. This hinges on developing a standard representation for information of a very general nature. What are the exising alternatives? What the critieria for judging such a standard? What are the architectural implications for machine independent exchange of information?

Multithreading

P35. Multithreaded Processor/Cache Design

In view of the ever-increasing gap between processor speed and memory latency, an architecture can either reduce or tolerate this mismatch. Caching is a scheme to reduce this gap by replicating data. Multithreading, on the other hand, aims at tolerating it. The central idea is to apply the context-switching idea at the instruction-by-instruction level. Suppose there are several process states stored on a processor, whenever a process encounters a cache miss, the processor is switched to serve another process, thus the throughput of the processor is not reduced. This scheme is particularly attractive in a multi-processor configuration, where memory latency is a major problem. Study the design tradeoffs of this kind of architecture, especially focusing on the structure of cache memory design.

(A previous study made some progress on this, but the results were inclusive due to an interesting methodological issue. The right question may be, \Q\QWhen is multithreading more cost-effective than adding a second level cache?''

P36. Validating Models of Multithreading

In a recent paper, Saavedra-Barrera, Culler, and von Eicken propose a simple analytical model of multithreading and cache behavior under this model. Try to validate or disprove this model empirically. (Anant Argawal has proposed a related model with a more sophisticated network component and more primitive processor component.)

Symbolic Computing: Prolog, Lisp, etc

P37. Instruction Level Characterization of Symbolic Programs

Given modern compilation techniques for languages like Lisp, ML, Miranda, Prolog, and Smalltalk for current processors, perform a detailed study to determine how the requirements of this class of programs differ from the C and Fortran programs provided in the text. (Does type-inference and significant use of higher-order functions matter? What about lazy evaluation?)

CS 252 PROJECT SUGGESTIONS SPRING 1994

Instruction Set Design and Measurement

P1. Spec Revisited

P8. Validation of Cache Studies

P14. Superscalar instruction set design

P19. Vector cache organization

P23. Multiple processors on a Chip

P28. Software-based Cache Coherency Scheme

P34. Smart Valley

P35. Multithreaded Processor/Cache Design

P37. Instruction Level Characterization of Symbolic Programs

P42. Architectures for Note Pads

Last Modified: 08:53am PDT, August 31, 1995