Vector IRAM
Dave Patterson
March 2, 1996
<P>
Problems in IRAM design:
<LI> Logic is slower in DRAM process
<LI> Adding a processor means an instruction set which cutomizes part , 
limits software
<LI> How can you really use the phenomenal bandwidth?
<P>Observations
<LI> The vector processing units on different brands of vector computer are 
largely the same, with the same operations and registers. There is basically 
widespread agreement on the design of vecot units.
<LI> Vector units can trade off clock rate and amount of hardware: you can 
build a vector processor with the same peak bandwidth in a slower technology 
simply by replicating the function units so that they do, say, 4 elements per 
clock cycle
<LI> The large costs of vector systems is the network that connects the 
memory banks to the vector units. It essentially is a cross bar or fat tree.
<LI> There is a very clear dividing line between a vector processor and a 
scalar processor: vector isntructions, operations to load vector length and 
vector mask regiters, are the primary items that cross the line.
<P>
Proposal:
Instead of putting a full processor in a DRAM, put a vector unit in a DRAM 
and provide a port between a traditional processor and the vector IRAM. 
Across this port goes vector instructions and pssobily scalar values, which 
can specify a lot of work in a few bits. Thus a conventional processor-cache 
complex might operate well things that work well on caches, and anything 
that needs lots of bandwidth done inside the memory, using the standard load-
store interface to communicate between the two worlds. 
<P>
Details:
<LI> The width of this port depends on how radical you want the IRAM.
<LI> A conservative model would use, say, Synchronous DRAM to send 
instructions in peices into an instruction queue at whatever rate the processor 
can generate information. By reserving a portion of the space for commands, 
you can get information from the address lines as well as the data lines.
<LI> A more radical model might use the Rambus interface to ship 
instructions in 8b chunks. These is no need to send a single instruction at a 
time.
<LI> Of course, you can make the portal as wide as you want
<LI> You can have multiple vector IRAMs if you need more memory or more 
processing; communication between IRAMs could be done by
<LI> chip to chip transfers over the memory bus, assuming an appropriate 
controller
<LI> through the processor via a block move instruction using the normal 
memory interface
<LI> via a network connection between IRAMs
<LI> By adding some instructions to manipulate the vector control registers 
(moves, possibly simple arithmetic) you may be able to reduce the number of 
scalar moves
<P>
Comments:
* The single, central set of vector registers mean that memory on chip is 
uniformly acessible, unlike some of the SIMD approaches to IRAM. 
* The speed of the logic on a DRAM process simply determines the width of 
the vector units. If logic is 1/2 speed, we can get the same peak performance 
by doubling the number of vecotr elements processed per clock (costing twice 
as much hardware). The cost is larger vector startup time, making a large 
N1/2 value.
* Key to the design is the interconnect: how much area does it take to provide 
a potent interconnect?
<P>
Questions:
<LI> How much data and instructions really cross the vector-scalar interface?
<LI> How expensive is the hardware to fully connect to the memory 
modules?
<LI> How much area would it take for, say, 16 or 32 vector registers, each 
with 64 or 128 vector elements? what about the functional units?
<LI> Can the scalar register remain in the vector unit, or must they be 
transmitted as well across that interface? (Since its only for reads, may not be 
too back)
<LI> How good a match is vectors to visualization instructions?
<LI> Can the idea become popular as graphics accelerators?
<LI> Is the overhead of communication so high that its better to perform 
length 1 vector operations than to perform operations in the scalar unit?
<LI> Can software handle all the syncronization between scalar and vector 
accesses? (e.g., read/write conflicts to same work)
<LI> Do any such machines exist so that we could look at the code?
<LI> How many applicatios will run well with vector assist?