next up previous
Next: PERFORMANCE Up: No Title Previous: INSTRUCTION SET ARCHITECTURE

SPACE IMPLEMENTATION

 

We have constructed a large SPACE system as part of the PADMAVATI project [Guichard-Jary1990]. Full custom VLSI SPACE chips are packaged using Tape Automated Bonding onto compact SPACE modules. These modules are attached to conventional circuit boards, which are then connected by a backplane to host processor boards within a MIMD parallel computer system. The following sections describe each level in the packaging hierarchy.

SPACE chip

The SPACE chip implements a small version of the SPACE programming model described in Figure 1. Each chip contains an array of 5328 static CAPP cells arranged as 148 words of 36 bits, a 148-bit flag chain, 36-bit Mask and Write-Enable registers and a 148-input priority resolution tree. The array is never coordinate addressed, so there is no incentive to make the number of words per die a power of two. A conservative 16 transistor static CAPP cell design was adopted to simplify design and ensure robustness. Priority resolution was implemented using a high-radix tree to reduce the latency of selection operations.

The SPACE chip has 56 pins: 7 instruction inputs, a 36b tri-state data bus, 4 pins for cascading chips (pins PRF , NXF , REQ and PRQ ), 3 timing and control inputs (pins CS , CE and PCH ), and 6 power supply connections.

PRF and NXF are tri-state pins used to connect the flag chain of a chip to the previous and next chips' flag chains respectively. REQ and PRQ are used to connect the internal priority resolution tree as a leaf node of an external priority resolution tree. A chip asserts REQ if it contains selected words, the external tree asserts PRQ if any preceding chip asserted REQ . Each SPACE chip in a large system maintains its own copy of the Mask and Write-Enable registers. For write register operations all registers are updated in parallel. For read register operations, the priority resolution tree is used to ensure that only the first chip in the array responds with the values stored in these registers.

The CS input is used to partition a large array into independent banks. When low, it disables all chip operations and outputs. Multiple selected chips must be adjacent if flag chain continuity is to be preserved. This feature was added to allow a large SPACE array to be partitioned amongst a number of different applications. For example, on the PADMAVATI machine the run-time system may use one bank to speed global address translation for interprocess communication while the rest of the array is made available for user applications.

SPACE chip timing is controlled by the CE input. The PCH

pin controls the internal prechargers. It can be tied high in which case the chip is fully static but burns more power, or can be pulsed high between supplying the data and strobing CE .

The SPACE chips were fabricated in a 1.2 m two-level-metal n-well CMOS technology from a fully-custom layout. The die measures 5.87.9mm and contains over 80,000 transistors. The minimum chip cycle time is 125ns.

Figure 2 is a photomicrograph of the chip. Samples were received in June 1989 and passed all functional and speed tests first time, thanks to extensive functional and electrical simulation prior to fabrication. Over 1,200 working parts have since been fabricated for the PADMAVATI project. A more comprehensive data sheet is available [Howe and Asanovic1990].

 

 


Figure 2: SPACE Chip.

SPACE modules

A SPACE module contains twelve SPACE chips, buffering and one stage of external priority resolution, and is functionally equivalent to a 1776-word SPACE chip. There are numerous advantages to mounting the chips on modules rather than directly onto a large circuit board. Fabrication, testing and repair are simplified, and the modules can be reused in different target systems.

Each module measures 10060mm and holds twelve 56-pin SPACE chips plus seven 20--28 pin SSI/MSI components. Tape Automated Bonding (TAB) is used to bond the SPACE chips to the module PCB; the remaining components are surface mounted. With TAB, each SPACE chip has a PCB footprint of only 1210mm . The module PCB has power and ground planes, and 4 signal layers with a trace pitch of 0.5mm . The pin-out of the SPACE chip was designed to allow devices to be tightly tessellated on the module PCB. Each face of the die has the connections for one data byte. Since all four data bytes within a word are equivalent, the north facing data byte of one die can be directly connected to the south facing data byte of another die, minimising the area needed for routing and vias on the module PCB.

SPACE in PADMAVATI

Each PADMAVATI processor node occupies a standard 280233mm 6u board and includes a 25MHz Inmos T800 transputer with 16MB of 200ns DRAM and a 32b asynchronous bus interface. SPACE boards have the same form factor and connect to the processor through the bus interface. Figure 3 is a photograph of a SPACE board. Each board holds six SPACE modules, and each operation can select any subset of the six modules. The selected subset then acts like a single SPACE chip with up to 10656 words.

 

 


Figure 3: SPACE Board.

The transputer acts as the microcode sequencer for the SPACE array, sending instructions and data over the inter-board bus. The SPACE array is memory mapped, with instructions encoded on the bus address lines. SPACE board write and search operations are pipelined, and have a minimum cycle time of 320ns. SPACE read operations require a wait for the returned value, and have a minimum cycle time of 480ns. The transputer limits the rate at which instructions are issued to the board. Measurements on the PADMAVATI prototype give an average instruction cycle time of around 1250ns. A more tightly coupled, dedicated microcontroller could reduce the board cycle time for each SPACE instruction to around 160ns with the existing modules.

The final level of packaging yields a complete PADMAVATI system. Sixteen transputer/SPACE nodes are fully connected through a custom VLSI dynamic routing switch that obeys the transputer link protocol [Rabet et al. 1990]. By partitioning the SPACE array amongst the transputers in this manner, we gain higher array I/O bandwidth and the flexibility of MIMD control of SIMD subarrays. The total associative storage capacity is 17049636b words, or around 750KBytes. The complete system is attached as a compute server to a host Sun workstation.



next up previous
Next: PERFORMANCE Up: No Title Previous: INSTRUCTION SET ARCHITECTURE



Krste Asanovic
Wed Jan 31 22:40:32 PST 1996