Register Window Vectors with Automatic Spill/Fill

Brian Davila

A processor with a multi-ported memory and a suitably long instruction word does not need a register file. The sources and destinations of instructions can simply be specified as direct or indirect memory addresses. Each cycle, the processor would fetch one or more instructions, read operands and write the results of any completing operations.

Such an architecture would greatly reduce the demands placed on compiler writers and assembly language programmers. There would be no need for register allocation algorithms. Many cycles would be saved that were previously wasted shuffling large data sets betwen the register file and memory, not to mention those spent saving and restoring registers on function calls and context switches. The CPI of such a processor would be superior to that of one with equivalent functionality using a load-store architecture.

However, architectures that allow for addressing memory as operands have fallen out of favor. Indeed, Hennessy and Patterson state "virually every new architecture after 1980 uses a load-store register architecture."[1] What could cause processor designers to inflict such a burden on those who must use their instruction sets?

One reason is the complexity of decoding such instructions. This is not as valid as it once was. The steady progression of Moore's law has made possible relatively fast implementations of convoluted instruction sets possible as recent x86 compatible processors from Intel and AMD have shown. But support for many instructions introduced decades ago really exists only for legacy reasons and one would prefer a processor be efficient with the help of its instruction set design rather than in spite of it.

Many reasons boil down to the large and ever increasing discrepancy between processor cycle time and memory latency. While caching (often in multiple levels) can help buffer the processor from slower levels of the memory hierarchy, it is not a panacea. Data must be kept in registers to maximize utilization of functional units, lest they be kept idly waiting for operands to arrive from sluggish memory.

While a large number of addressable registers would seem to ease the task of creating efficient code, bits must be allocated in the instruction to address these registers as operands and it is important to keep the size of the instruction word small. While an instruction cache can have a lower miss rate than a data cache[2], superscalar processors need to look at larger windows of instructions to exploit more instruction level parallelism[3]. Even the 32-bit instruction format traditional of RISC machines is too long for some applications, especially in the embedded world. Mike Johnson, VP and AMD senior fellow, has said "When all instructions are fixed at 32 bits in length, the definition of a RISC instruction set becomes, to some extent, a study in how to waste 32 bits of instruction."[4] Evidence of this can be seen in the introduction of the ARM Thumb and MIPS16 instruction set extensions ("Reduced" Reduced Instruction Set Computers?) and the design of the SuperH processor that natively uses 16-bit instructions[5].

How then to reconcile the difference between the desire for many registers with the need to keep the instruction short? The Berkeley RISC-I and SPARC architectures include hardware support for register windows, though compiler register allocation was shown to often be as good, especially when accompanied by profiling[6] and the technique has not been widely adopted, though the IA-64 has a register stack engine that is similar in some regards.[7]

Many more recent designs use register renaming to make available a larger set of registers than is specified in the ISA. These additional registers can be used to remove name dependences and for speculation. Register renaming is virually a necessity for a modern superscalar design and is used in the PowerPC, MIPS and Alpha families as well as the Pentium II, III and IV and IA-64 all use register renaming[8].

The WIMS microcontroller has two copies of both its data and address registers and uses special instructions to change which is active and to move data between the two.[9] (What did the Dragon processor do? Mention it here?)

I propose a new architecture that provides a set of register windows. There will be two active windows at any time, each consisting of eight 32-bit words, any of which may be used as scalar integer operands. Instructions will be provided to load a window from memory or to allocate a new window. Each window will be associated with a memory address. There will be a number of register windows that is invisible to the ISA. The processor will keep track of which register windows have been written to and will mark these as dirty. The processor will then try to write this data back to the data cache when there are no other data memory requests, thus utilizing idle periods.

I will construct a verilog model of this architecture or, alternatively, model it in SimpleScalar or M5 and measure whether the the implicit parallelization of the backround spill and fill of registers combined with the greater flexibility of loading blocks of addressable operands from memory results in an overall speedup over a more conventional load-store architecture. Performing integer matrix multiplications of various sizes will be the first test suite used to benchmark the two approaches. It would also be nice to measure deeply recursive subroutine calls to see whether this register window scheme can be used as a stack cache. Finally, it seems simple to add vector support to this architecture if the physical resources are available.

(Gotta read "Design and Applications of a Virtual Context Architecture" just sent to me, sounds like related work.)

[1] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, p. 93

[2] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, Fig. 5.8, p. 406

[3] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, Fig. 3.36, p. 245

[4] Johnson, Mike, "System Considerations in the Design of the Am29000," IEEE Micro, August 1987, pp. 28-41

[5] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, pp 119-120

[6] Wall, David W., "Register Windows vs. Register Allocation", Proceedings of the SIGPLAN 88 Conference on Programming Language Design and Implementation, 1988, pp. 67-78

[7] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, p. 352

[8] Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003, p. 239

[9] Ravindran, et. al, "Increasing the Number of Effective Regisers in a Low-Power Processor Using a Windowed Register File", CASES '03, 2003, pp. 125-136.