Register Window Vectors with Automatic Spill/Fill
Brian Davila
A processor with a multi-ported
memory and a suitably long instruction word does not need a register file. The
sources and destinations of instructions can simply be specified as direct or
indirect memory addresses. Each cycle, the processor would fetch one or more
instructions, read operands and write the results of any completing operations.
Such an architecture would greatly
reduce the demands placed on compiler writers and assembly language
programmers. There would be no need for register allocation algorithms. Many
cycles would be saved that were previously wasted shuffling large data sets
betwen the register file and memory, not to mention those spent saving and
restoring registers on function calls and context switches. The CPI of such a
processor would be superior to that of one with equivalent functionality using
a load-store architecture.
However, architectures that allow
for addressing memory as operands have fallen out of favor. Indeed, Hennessy
and Patterson state "virually every new architecture after 1980 uses a
load-store register architecture."[1] What could cause processor designers
to inflict such a burden on those who must use their instruction sets?
One reason is the complexity of
decoding such instructions. This is not as valid as it once was. The steady
progression of Moore's law has made possible relatively fast implementations of
convoluted instruction sets possible as recent x86 compatible processors from
Intel and AMD have shown. But support for many instructions introduced decades
ago really exists only for legacy reasons and one would prefer a processor be
efficient with the help of its instruction set design rather than in spite of
it.
Many reasons boil down to the large
and ever increasing discrepancy between processor cycle time and memory
latency. While caching (often in multiple levels) can help buffer the processor
from slower levels of the memory hierarchy, it is not a panacea. Data must be
kept in registers to maximize utilization of functional units, lest they be kept
idly waiting for operands to arrive from sluggish memory.
While a large number of addressable
registers would seem to ease the task of creating efficient code, bits must be
allocated in the instruction to address these registers as operands and it is
important to keep the size of the instruction word small. While an instruction
cache can have a lower miss rate than a data cache[2], superscalar processors
need to look at larger windows of instructions to exploit more instruction
level parallelism[3]. Even the 32-bit instruction format traditional of RISC
machines is too long for some applications, especially in the embedded world.
Mike Johnson, VP and AMD senior fellow, has said "When all instructions
are fixed at 32 bits in length, the definition of a RISC instruction set
becomes, to some extent, a study in how to waste 32 bits of
instruction."[4] Evidence of this can be seen in the introduction of the
ARM Thumb and MIPS16 instruction set extensions ("Reduced" Reduced
Instruction Set Computers?) and the design of the SuperH processor that
natively uses 16-bit instructions[5].
How then to reconcile the difference
between the desire for many registers with the need to keep the instruction
short? The Berkeley RISC-I and SPARC architectures include hardware support for
register windows, though compiler register allocation was shown to often be as
good, especially when accompanied by profiling[6] and the technique has not
been widely adopted, though the IA-64 has a register stack engine that is
similar in some regards.[7]
Many more recent designs use
register renaming to make available a larger set of registers than is specified
in the ISA. These additional registers can be used to remove name dependences
and for speculation. Register renaming is virually a necessity for a modern
superscalar design and is used in the PowerPC, MIPS and Alpha families as well
as the Pentium II, III and IV and IA-64 all use register renaming[8].
The WIMS microcontroller has two
copies of both its data and address registers and uses special instructions to
change which is active and to move data between the two.[9] (What did the
Dragon processor do? Mention it here?)
I propose a new architecture that
provides a set of register windows. There will be two active windows at any time,
each consisting of eight 32-bit words, any of which may be used as scalar
integer operands. Instructions will be provided to load a window from memory or
to allocate a new window. Each window will be associated with a memory address.
There will be a number of register windows that is invisible to the ISA. The
processor will keep track of which register windows have been written to and
will mark these as dirty. The processor will then try to write this data back
to the data cache when there are no other data memory requests, thus utilizing
idle periods.
I will construct a verilog model of
this architecture or, alternatively, model it in SimpleScalar or M5 and measure
whether the the implicit parallelization of the backround spill and fill of
registers combined with the greater flexibility of loading blocks of
addressable operands from memory results in an overall speedup over a more
conventional load-store architecture. Performing integer matrix multiplications
of various sizes will be the first test suite used to benchmark the two
approaches. It would also be nice to measure deeply recursive subroutine calls
to see whether this register window scheme can be used as a stack cache.
Finally, it seems simple to add vector support to this architecture if the physical
resources are available.
(Gotta
read "Design and Applications of a Virtual Context Architecture" just
sent to me, sounds like related work.)
[1]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
p. 93
[2]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
Fig. 5.8, p. 406
[3]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
Fig. 3.36, p. 245
[4]
Johnson, Mike, "System Considerations in the Design of the Am29000,"
IEEE Micro, August 1987, pp. 28-41
[5]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
pp 119-120
[6] Wall,
David W., "Register Windows vs. Register Allocation", Proceedings of
the SIGPLAN 88 Conference on Programming Language Design and Implementation,
1988, pp. 67-78
[7]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
p. 352
[8]
Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 2003,
p. 239
[9]
Ravindran, et. al, "Increasing the Number of Effective Regisers in a
Low-Power Processor Using a Windowed Register File", CASES '03, 2003, pp.
125-136.