Vault - A VLSI Architecture Using Lightweight Threads. Ian Watson and Greg Wright {iwatson,gwright}@cs.man.ac.uk Department of Computer Science, University of Manchester Manchester M13 9PL, England. The Vault system is a single-chip multiprocessor designed to run Java software, with hardware support for fast lightweight thread creation. With the search for increased performance, more parallelism is being sought even in processors aimed at the PC market. Most modern superscalar processors select instructions to execute in parallel from a single instruction stream. These processors are complex to design, and adding more execution units produces steeply diminishing returns. The advent of Java and "network computers" presents new possibilities. Java has in-built support for threading, and each application on a desktop might have several threads performing calculations in the background, animating graphics, or downloading files. It seems slightly bizarre that these tasks should be time-sliced into a serial instruction stream, with a superscalar processor then trying to put some parallelism back together. A single-chip multiprocessor can exploit these different levels of parallelism, use fast on-chip communications, and is much easier to design and optimise than a conventional superscalar machine. Also, most signals are local, so it should be better suited to future chips where wire delays are the limiting factor. The Vault system has several (say 8 or 16) processors on a chip, with private caches using a write-back protocol on a shared memory bus. Each processor is stack-based to allow rapid context switching: there are no registers to save and restore. There is also a fast task distribution bus; within very few cycles, any processor can determine whether another is unoccupied and fork a new lightweight thread onto it with a remote procedure call. The system also buffers return values from remote functions; a processor waiting for a result can accept work from elsewhere. This allows dynamic parallelism: the way the work is distributed can be decided at run-time with very little overhead, and irregularly-structured computations can be parallelised efficiently. The Javar restructuring compiler has been modified to use these lightweight threads, and encouraging speedup results have been obtained. Work is progressing on more realistic and detailed simulations.