Data Threaded Microarchitecture: Dataflow on the fly

				ABSTRACT

Problems

Processor designs for today's desktop and server applications incorporate
superscalar techniques to increase throughput but this exposes new costs and
bottlenecks such as larger caches yielding diminishing returns, need for higher
degrees of multiporting, more resources devoted to guessing conditional branch
outcomes. A new approach addresses these problems.

Solution

A new computer microarchitecture transforms itself from traditional control flow
to dataflow while transparently executing unmodified programs. Flows of operands
between instructions is persistently mapped during initial execution so that 
subsequent executions can initiate direct data flows. Operands flow directly
from producing instructions to consuming instructions. Receipt of all needed 
operands triggers execution of flow targets and their resulting output operands 
are in turn forwarded to other instructions which may also be triggered (data
threads -- hence data threaded microarchitecture: DTMA). Values for registers, 
memory and stack locations, special registers, condition codes and vectors can 
all be mapped and forwarded. This leads to radically out-of-order execution but 
instructions are retired and results materialized (or killed) in sequential 
program order (possible using wide retirement words).

Operands can often flow around minimal control dependencies where the target 
would be reached through all possible paths. Other operand flows can await 
branch resolution before flowing to their targets. Conditional branches also
benefit from operand flows (of condition codes). For simple backward (looping) 
branches the number of iterations (FOR loop) is known and surprises (possible 
loop exits) can be determined so the backward conditional branch direction is 
known and operand forwarding can proceed to subsequent iterations as in a 
Tomasulo machine (but without forgetting the mapped flows).

Knowledge gained from an initial loop iteration also allows loop explosion:
a form of hardware loop unrolling where multiple iterations of loop-dependent
instructions are started in each cycle. With or without loop explosion, 
knowledge garnered from an initial iteration can allow post loop instructions 
to start before the loop completes subsequent iterations and even begin 
concurrent execution of multiple adjoining loops as well as combinations of 
mixed iterations of inner and outer loops.

Advantages

DTMA has many of the advantages of dataflow architectures but can be applied to
existing CISC, RISC and stack (e.g. Java) control flow instruction sets 
executing legacy code. By enabling more out-of-order execution and having more
instructions in process DTMA can be more tolerant of memory latencies. 
Use of direct operand flows results in dramatically decreased need for data
cache. Traffic in and out of registers decreases greatly as does the need for
multi-porting of register files, TLBs and caches. DTMA also eliminates or 
decreases the need for branch guessing hardware. DTMA should be particularly
attractive for executing legacy instruction sets -- especially ones that are
register constrained (x86, S/370) and/or make extensive use of stacks.

DTMA should be able to keep more execution units busier using fewer other 
resources than alternative microarchitectures. DTMA requires much less
associative storage and is easier to distribute to more independent units and
so should scale much better than conventional superscalar microarchitectures.