CS 294-4: Intelligent DRAM (IRAM)

The following comments, regarding Lecture 4 and Lecture 5, are from Olivet Chou, of Mitsubishi Electronics America, Inc.

Conceptually, the description for Figure 1 is correct, but all other 3D rendering controller designers will tell you that the data transfers across the "bus" is always bursty to achieve bandwidth efficiency. The burst transfer itself is a tradeoff between too much data (wasted burst cycles) and not enough data (needing another burst transfer).
The two level of caches in 3D-RAM not only solves the speed mismatch and balances the bandwidth, but also enables the on-chip ALU to realize its benefit. This is so because the only way to get the ALU working as desired is to have a dual-port memory. It is VERY expensive to build a dual-port DRAM at 10Mbits density. On the other hand, a 2Kbits triple ported SRAM is an economical solution. Mitsubishi is leading the way in implementing SRAM in DRAM process technology in production, in addition to the much hyped DRAM + logic on the DRAM process.
As mentioned in one of the lectures in the course, "a new technology needs a roadmap." 3D-RAM is a family of products. Mitsubishi currently have 3 generations of 3D-RAM products.
3D-RAM(1) P/N M5M410092 0.50um proc, 2metal, 4poly, in production
3D-RAM(2) P/N M5M410092A 0.45um proc, 2metal, 4poly, in production
3D-RAM(3) P/N M5M410092B 0.40um proc, 2metal, 4poly, production soon

The lecture notes incorrectly states the process to be 0.55 um, 1 level metal. The point is Mitsubishi is very much in sync with the state-of- the-art CMOS process technology.
The performance cited in the lecture notes corresponds to Figure 4. This means that the rendering controller has (32 + 32) x 4 = 256 bits data interface with the 3D-RAM based frame buffer. This is workstation class controller delivering super-workstation class 3D graphics.
Your notes from discussion on Jan 26, 1996 state that "a lesson from the FBRAM chip seems that we can use the sense amps as a second level cache ... this cache wouldn't be very big, namely 128KB for a 1GB DRAM". Today, Synchronous DRAM (SDRAM) and Synchronous Graphics RAM (SGRAM) have 8192 bits for a 4 Mbits DRAM bank at density of 16 Mbits or 8 Mbits per DRAM chip. Even at 64 Mbits per chip density, it is conceivable to have 8192 bits for sense amps for an 8 Mbits DRAM bank. In this latter system, you would get 1 MBytes of sense amps in a 1 Gbytes DRAM, or 128KBytes of sense amps in 1 Gbits DRAM, depending on how you look at it.
But, the point is totally missed. The sense amps are not suitable to function as the first line cache between the processor (or the on- processor cache) and the DRAM cells because of the speed mismatch between the two is still too significant. Mitsubishi believes the approach of integrated SRAM on a DRAM chip is the best approach to solve this speed mismatch problem, even if there is NO on-chip ALU. Mitsubishi pioneered Cache DRAM (CDRAM) and introduced a Multi-port Cache DRAM (MP-RAM) at the quarterly semiconductor memory supplier meet (JEDEC meeting) at Portland, OR in June 1996. A on-chip SRAM 100% matches up with the controller's appetite for data, while the internal global bus between the DRAM and the on-chip SRAM compensates for the slower speed of the sense amps with an 8x width. Note again that this principle holds even if there is NO on-chip ALU. The first step to a truly intelligent RAM to get this IRAM to provide the RIGHT data to the controller cycle by cycle without missing a beat!
For the latest analysis on CDRAM (or DRAM + SRAM) advantage, please see the July 22, 1996 issue of the EE Times newspaper, p.57.
Not to belabor the point too much, but it is not really the absolute size of the on-chip SRAM or the number of sense amps per DRAM bank that holds up the performance bottleneck, but rather it is the MATCHING of the bandwidth and speed at every stage. It is OK to have a cache miss, as long as the IRAM may be informed several cycles (e.g. 4) in advance (which is entirely possible from the controller's design perspective). The SRAM cache can then perform a line flush and line update, and still present the requested data in time.

Return to course home page...