Data Threaded Microarchitecture: Dataflow on the fly ABSTRACT Problems Processor designs for today's desktop and server applications incorporate superscalar techniques to increase throughput but this exposes new costs and bottlenecks such as larger caches yielding diminishing returns, need for higher degrees of multiporting, more resources devoted to guessing conditional branch outcomes. A new approach addresses these problems. Solution A new computer microarchitecture transforms itself from traditional control flow to dataflow while transparently executing unmodified programs. Flows of operands between instructions is persistently mapped during initial execution so that subsequent executions can initiate direct data flows. Operands flow directly from producing instructions to consuming instructions. Receipt of all needed operands triggers execution of flow targets and their resulting output operands are in turn forwarded to other instructions which may also be triggered (data threads -- hence data threaded microarchitecture: DTMA). Values for registers, memory and stack locations, special registers, condition codes and vectors can all be mapped and forwarded. This leads to radically out-of-order execution but instructions are retired and results materialized (or killed) in sequential program order (possible using wide retirement words). Operands can often flow around minimal control dependencies where the target would be reached through all possible paths. Other operand flows can await branch resolution before flowing to their targets. Conditional branches also benefit from operand flows (of condition codes). For simple backward (looping) branches the number of iterations (FOR loop) is known and surprises (possible loop exits) can be determined so the backward conditional branch direction is known and operand forwarding can proceed to subsequent iterations as in a Tomasulo machine (but without forgetting the mapped flows). Knowledge gained from an initial loop iteration also allows loop explosion: a form of hardware loop unrolling where multiple iterations of loop-dependent instructions are started in each cycle. With or without loop explosion, knowledge garnered from an initial iteration can allow post loop instructions to start before the loop completes subsequent iterations and even begin concurrent execution of multiple adjoining loops as well as combinations of mixed iterations of inner and outer loops. Advantages DTMA has many of the advantages of dataflow architectures but can be applied to existing CISC, RISC and stack (e.g. Java) control flow instruction sets executing legacy code. By enabling more out-of-order execution and having more instructions in process DTMA can be more tolerant of memory latencies. Use of direct operand flows results in dramatically decreased need for data cache. Traffic in and out of registers decreases greatly as does the need for multi-porting of register files, TLBs and caches. DTMA also eliminates or decreases the need for branch guessing hardware. DTMA should be particularly attractive for executing legacy instruction sets -- especially ones that are register constrained (x86, S/370) and/or make extensive use of stacks. DTMA should be able to keep more execution units busier using fewer other resources than alternative microarchitectures. DTMA requires much less associative storage and is easier to distribute to more independent units and so should scale much better than conventional superscalar microarchitectures.