Vector IRAM
Dave Patterson
March 2, 1996
Problems in IRAM design:
- Logic is slower in DRAM process
- Adding a processor means an instruction set which customizes part ,
limits software compatability
- How can a slower processor really use the phenomenal memory bandwidth?
Observations:
-
There is basically widespread agreement on the design of vector units.
The vector processing units on different brands of vector computer are
largely the same, with the similar operations and register
organizations.
- Vector units can trade off clock rate and amount of hardware: you can
build a vector processor with the same peak bandwidth in a slower technology
simply by replicating the function units so that they do, say, 4 elements per
clock cycle vs. 1 element/clock in a 4X faster clock cycle.
- One large cost of vector systems is the network that connects the
memory banks to the vector units. It essentially is a cross bar or fat tree.
- There is a very clear dividing line between a vector processor and a
scalar processor: vector instructions, operations to load vector length and
vector mask registers, are the primary items that cross the line.
Proposal:
Instead of putting a full processor in a DRAM, put a vector unit in a DRAM
and provide a port between a traditional processor and the vector IRAM.
Across this port goes vector instructions and possibly scalar values, which
can specify a lot of work in a few bits.
Keeping the
vector interconnection network on-chip will also dramatically lower the cost of
phenominal bandwidth.
Thus a conventional processor-cache
complex might operate well things that work well on caches, and anything
that needs lots of bandwidth done inside the memory, using the standard load-
store interface to communicate between the two worlds.
Details:
- The width of this port depends on how radical you want the IRAM:
- A conservative model would use, say, Synchronous DRAM to send
instructions in pieces into an instruction queue at whatever rate the processor
can generate information. By reserving a portion of the space for commands,
you can get information from the address lines as well as the data lines.
- A more radical model might use the Rambus interface to ship
instructions in 8b chunks. These is no need to send a single instruction at a
time.
- Of course, you can make the portal as wide as you want.
- You can have multiple vector IRAMs if you need more memory or more
processing; communication between IRAMs could be done by:
- chip to chip transfers over the memory bus, assuming an appropriate
controller
- through the processor via a block move instruction using the normal
memory interface
- via a network connection between IRAMs
- By adding some instructions to manipulate the vector control registers
(moves, possibly simple arithmetic) you may be able to reduce the number of
vector-scalar moves
Comments:
- The single, central set of vector registers mean that memory on chip is
uniformly accessible and easier for software, unlike some of the SIMD approaches to IRAM.
- The speed of the logic on a DRAM process simply determines the width of
the vector units. If logic is 1/2 speed, we can get the same peak performance
by doubling the number of vector elements processed per clock (costing twice
as much hardware). The cost is larger vector startup time, making a large
N1/2 value.
- Key to the design is the interconnect: how much area does it take to provide
a potent interconnect?
Questions:
- How much data and instructions really cross the vector-scalar interface?
- How expensive is the hardware to fully connect to the memory
modules?
- How much area would it take for, say, 16 or 32 vector registers, each
with 64 or 128 vector elements? what about the functional units?
- Can the scalar register remain in the vector unit, or must they be
transmitted as well across that interface? (Since its only for reads, may not be
too bad to send scalar data.)
- How good a match is vectors to visualization instructions?
- Is the overhead of communication so high that its better to perform
length 1 vector operations than to perform operations in the scalar unit?
- Can software handle all the synchronization between scalar and vector
accesses? (e.g., read/write conflicts to same work)
- Do any such machines exist (e.g., VAX 6000)
so that we could look at the code?
- Can Vector IRAM become popular as graphics accelerators?
- How many applications will run well with vector assist?