Vector IRAM

Dave Patterson
March 2, 1996

Problems in IRAM design:

Logic is slower in DRAM process
Adding a processor means an instruction set which customizes part , limits software compatability
How can a slower processor really use the phenomenal memory bandwidth?

Observations:

There is basically widespread agreement on the design of vector units. The vector processing units on different brands of vector computer are largely the same, with the similar operations and register organizations.
Vector units can trade off clock rate and amount of hardware: you can build a vector processor with the same peak bandwidth in a slower technology simply by replicating the function units so that they do, say, 4 elements per clock cycle vs. 1 element/clock in a 4X faster clock cycle.
One large cost of vector systems is the network that connects the memory banks to the vector units. It essentially is a cross bar or fat tree.
There is a very clear dividing line between a vector processor and a scalar processor: vector instructions, operations to load vector length and vector mask registers, are the primary items that cross the line.

Proposal:

Instead of putting a full processor in a DRAM, put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM. Across this port goes vector instructions and possibly scalar values, which can specify a lot of work in a few bits. Keeping the vector interconnection network on-chip will also dramatically lower the cost of phenominal bandwidth. Thus a conventional processor-cache complex might operate well things that work well on caches, and anything that needs lots of bandwidth done inside the memory, using the standard load- store interface to communicate between the two worlds.

Details:

The width of this port depends on how radical you want the IRAM:
- A conservative model would use, say, Synchronous DRAM to send instructions in pieces into an instruction queue at whatever rate the processor can generate information. By reserving a portion of the space for commands, you can get information from the address lines as well as the data lines.
- A more radical model might use the Rambus interface to ship instructions in 8b chunks. These is no need to send a single instruction at a time.
- Of course, you can make the portal as wide as you want.
You can have multiple vector IRAMs if you need more memory or more processing; communication between IRAMs could be done by:
- chip to chip transfers over the memory bus, assuming an appropriate controller
- through the processor via a block move instruction using the normal memory interface
- via a network connection between IRAMs
By adding some instructions to manipulate the vector control registers (moves, possibly simple arithmetic) you may be able to reduce the number of vector-scalar moves

Comments:

The single, central set of vector registers mean that memory on chip is uniformly accessible and easier for software, unlike some of the SIMD approaches to IRAM.
The speed of the logic on a DRAM process simply determines the width of the vector units. If logic is 1/2 speed, we can get the same peak performance by doubling the number of vector elements processed per clock (costing twice as much hardware). The cost is larger vector startup time, making a large N1/2 value.
Key to the design is the interconnect: how much area does it take to provide a potent interconnect?

Questions:

How much data and instructions really cross the vector-scalar interface?
How expensive is the hardware to fully connect to the memory modules?
How much area would it take for, say, 16 or 32 vector registers, each with 64 or 128 vector elements? what about the functional units?
Can the scalar register remain in the vector unit, or must they be transmitted as well across that interface? (Since its only for reads, may not be too bad to send scalar data.)
How good a match is vectors to visualization instructions?
Is the overhead of communication so high that its better to perform length 1 vector operations than to perform operations in the scalar unit?
Can software handle all the synchronization between scalar and vector accesses? (e.g., read/write conflicts to same work)
Do any such machines exist (e.g., VAX 6000) so that we could look at the code?
Can Vector IRAM become popular as graphics accelerators?
How many applications will run well with vector assist?