Computer Science 252. Graduate Computer Architecture

Handout #3. Projects Ideas, Fall 1995

P1. RISC Enhancements for Instruction Set Emulation

Older generations of popular instruction sets are successfully emulated on today's high end microprocessors (e.g., 80x86 on virtually every platform, 68K on the PowerPC, etc.). Choose a native instruction set and a target instruction set to be emulated. What small number of instructions would you add to the native instruction set to speed up the emulation under existing software packages like SoftPC? What speedup could you achieve?

A variation on this theme is motivated by the recent announcement that HP and Intel will collaborate on a new architecture that will be binary compatible with the existing HP PA Risc architecture and the x86 architecture in the latter part of this decade. There are rumors that this architecture will based in some way on the VLIW (very long instruction word) concept. Consider how to extend the PA Risc architecture to be a better substrate for x86 execution. How might you design a VLIW-based future architecture to be particularly good at emulating 80x86 programs?

P2. TCP/IP and/or I/O Accelerator Architectures

Since network access continues to be a bottleneck in many workstation applications, it would be highly desirable to have the workstation architecture support network protocol processing as efficiently as possible. Choose a workstation system design for investigation, and explore the space of possible design options in order to accelerate the processing of the TCP/IP protocol suite. Possibilities include new instructions, protocol coprocessor, smart DMA processing, or a new processor/IO controller interface.

It has been observed that the bottleneck in network and I/O systems is the need to move the packets or I/O data from interfaces or controllers over a system bus optimized for processor-memory interactions rather than controller-memory interactions. Propose and evaluate schemes for circumventing the intrinsic bottleneck of the traditional system backplane, and use it to accelerate network and I/O performance in workstation systems. You might even consider how an approach that solves both problems may form the basis of a really high performance network file server.

For example, the Berkeley RAID server was built around a high performance crossbar interconnect that linked the control processor, I/O devices, and network interfaces to a single high speed memory system, making it very fast to stream data between disks and network interfaces.

P3. Software-based Cache Coherency Scheme

Increasingly sophisticated multiprocessor cache coherency schemes have been proposed, but these can become quite difficult to implement correctly as hardware finite state machines. A possible approach to reduce implementation complexity is to develop software-based schemes for cache coherency. The hardware FSMs handle the common cases, trapping to software for the less common cases. This has the advantage of keeping the cache control cycle time short for the most frequent cases. It would be interesting to evaluate the proposed schemes to see how well they perform on real multiprocessor workloads. The study could pursued either as an analytical model or one that would bebased on simulation.

P4. Validation of SPECmark95

The new SPECmark95 benchmarks were announced in the middle of August. Run these programs on as many OLD and NEW machines as you can find in the department, and develop a complete report of the new SPECmarks ratings. How well do the new SPECmarks track previously reported SPEC89 and SPEC92 performance? In particular, has the relative order of machine performance changed by the choice of a new benchmark set? Make sure to use both optimized and unoptimized versions of the benchmark sets.

A related issue is how SPEC performance varies with other system parameters, such as cache size or processor-memory bandwidth. Reverse engineer the new set of SPEC benchmark programs to analyze how amenable they are to acceleration through variation of system level parameters. You can bet that there are many engineers in industry doing this as you read this suggestion!

P5. Value of Secondary Caches

The microprocessor of the Year 1998 will probably have on the order of 10 million transistors. What are the implications for the system organization and the memory hierarchy in particular? For example, this level of integration could easily support on-chip caches on the order of 64-128KBytes (or even more). Caches of these sizes are likely to result in very high hit rates, eliminating all but a small number of misses beyond those that are compulsary. Is a second level cache STILL worthwhile? How large/fast does it have to be to make it a worthwhile addition to a system at the turn of the century? How could you exploit new memory organizations, like Synchronous DRAM, to accelerate the Level 2 cache?

P6. A Benchmark Set for Intelligent Information Appliances

Traditional CPU benchmarks have been influenced by the kinds of applications people typically ran on workstations in the late 1980s: CAD and scientific applications that were computationally intensive. Formulate a new benchmark set appropriate for the applications of the coming decade:

VIDmark95: Audio and video-intensive (e.g., motion JPEG, MPEG encode/decode, VQ, quicktime, cuSEEme, VIC/VAT, PC meets TV, etc.)
GRAFmark95: Graphics-rendering intensive (e.g., virtual reality, ray tracing, luminosity calculations, etc.)
IMAGEmark95: Image processing intensive (e.g., edge detection, optical character recognition, etc.)
UImark95: User Interface intensive (e.g., Windows, Icons, Mice, Pull Down Menus = WIMPS)
NETmark95: Network intensive (e.g., TCP/IP or ATM support)
Something of your own specification

Collect real programs for your benchmark and run them against a variety of different machines and system configurations. Analyze the sources of performance differences observed by your benchmarking effort. Be careful to consider non-processor centric elements of the system, like graphics accelerators, cache side, etc.

P7. A Benchmark Set for Scientific Application I/O

In many scientific applications, such as General Circulation Models (GCMs) for climate modeling, there is a growing realization that overall application performance is becoming dictated by I/O performance. Collect applications kernels that can be used to indicate the I/O access patterns of important classes of scientific applications. Compare the I/O performance predicated by these kernels for two different classes of machines, like a massively parallel machine (e.g., CM-5) and a vector supercomputer (e.g., Cray).

P8. Finding the Optimal Performance/Power Tradeoff in Microprocessors

Existing microprocessor families obtain low(er) power implementations by using some combination of the following implementation and organizational approaches:

Reduce the processor clock cycle time
Reduce the size of on-chip caches
Reduce the width of the datapath
Reduce the complexity of the pipelines
Reduce the number of pins/widths of external buses

The general goal is to reduce internal and external capacitances, rather than seek a smaller die size for reduced cost or implementation complexity (though these may be positive side benefits).

Develop an abstract power/performance model that attempts to quantify the tradeoffs inherent in the strategies outline above. Use it with one microprocessor family (x86, PowerPC, Sparc, etc.) to design the abstract organization of a processor in that family that achieves the "best" performance per watt. Validate your model against some real microprocessors.

P9. Instruction-Level Approaches for Saving Power

Could you save significant amounts of power of reducing the instruction set at a cost to performance? As a wild example, could you save enough power by removing the multiply unit from the datapath to offset the power cost of executing the iterative multiply operation, a la the early generation research RISC machines?

What about an instruction level power audit, that examines each instruction in the instruction set for the power that it consumes during operation? Given such information, it is possible to develop a compiler that optimizes for power as an alternative to optimizing for computational performance.

P10. System Level Power Control Issues for Portable Devices

There have been a number of studies that have developed optimal schemes for managing disk drive spin-up/spin-down frequency in portable computers. It saves power to spin down the disk, but reduces performance and adds significant start-up latency (and power demand spikes) to spin the disk back up.

Can you apply the same methodology to other aspects of the system-level design of portable devices, such as memory subsystems, displays, and network interfaces? You will need to quantify the power demands and behavior of these elements of the system and the degree with which they can support stand-by or sleep modes.

P11. Hardware Support for Hot Java

Many people believe that the killer app for the late 1990's is World Wide Web access. A method for shipping new functionality across the network for interpretive execution in web browsers has been developed by Sun Microsystems, and is called Hot Java. In many respects, the goals of Hot Java are not very different from those of the early Lisp machines: high performance for an interpretive environment. Evaluate the architectural techniques developed for earlier forms of dynamic languages, especially those that provide dynamic type checking and that support ``safe'' execution, and see how they might speed up the performance of real Java programs.

If you are a fan of Tcl/Tk, you can examine architectural extensions that could help accelerate the execution of Tcl scripts.

P12. Memory System Support for Embedded Processors

Normally, architects (especially of RISC processors!) do not concern themselves too much with code size. However for many embedded applications, from PDAs to game computers to set-top boxes, code size is a major consideration. For one of the popular RISC instruction sets, investigate alternative techniques for compacting code size in the memory system without sacrificing the benefits of RISC pipelining. Here are some ideas that could be pursued:

Compress finished programs in memory. Add decompressor hardware between memory and the cache to expand the instruction on the way into the cache. Since cache resources may be limited in these machines as well, evaluate the implications of placing the decompressor between the cache and the instruction fetch unit?
An alternative is to select a small number of the most common 32-bit instructions and extend the instruction set with 16-bit encodings of these instructions. The instruction fetch unit can decompress or ``predecode'' these instructions on the fly. Examples of these ``short'' instructions might include shorter offsets, shorter immediates, separate setting of the condition code registers, etc.

Identify a target set of programs for analysis (the Spec benchmarks are probably not indicative of what embedded programs typically do). What are the performance/code size tradeoffs?

P13. Comparison of Branch Prediction Schemes

Since the widespread introduction of superscalar execution, branch prediction schemes have become increasingly important in processor architectures. Even if misdirected branches occur infrequently, a bad guess is expensive when you consider the branch delay penalty (typically two cycles) is multiplied by the number of instructions simultaneously being initiated (up to four in current generation designs!).

Quite a few schemes have been proposed: static prediction (always not taken, always taken), compiler hints, and branch history mechanisms. In addition to branch histories, recent predicted branch addresses can be cached in a Branch Target Address Caches, which can be accessed in parallel with instruction fetch.

Using the new SPECmark95 benchmark programs, evaluate the effectiveness of the various schemes for branch predication.

P14. Multipath Instruction Execution

One way to avoid the issues of branch prediction altogether is to fetch both from the branch taken and not taken instruction streams. Design and evaluate a datapath/memory system pipeline structure that can support such parallel execution. What are the bandwidth implications for cache? How can a superscalar datapath support multiple streams as well as parallel instruction execution? What are the relative advantages and disadvantages of shallow versus deep pipelines for supporting this idea (e.g., what happens when a branch appears in the stream being fetched before the branch direction has been resolved?). For your design, how frequently might this case occur for the SPECmark95 programs?

Last Modified: 03:44pm PDT, September 04, 1995