DRAFT - TO BE SUBMITTED FOR PUBLICATION - NOT FOR WIDE DISTRIBUTION

Evaluating Billion Transistor Architectures
under Different Computing Domains

Christoforos E. Kozyrakis - David A. Patterson

Computer Science Division
University of California at Berkeley
Berkeley, CA 94720

Abstract

The opportunity of integrating a billion transistors in a single chip has led to the proposal of a number of processor architectures to utilize this budget. Most of them are optimized for the technical, scientific and commercial workloads for which desktop and server systems are typically used today.

In this paper we present a different computing environment that we expect to become popular when these chips are available: personal mobile computing, where portable devices are used for visual computing and personal communications tasks. Such a device supports in an integrated fashion all the functionalities provided by a portable computer, a cellular phone, a digital camera and a video game today. The requirements placed on the processor in this environment are energy efficiency, high performance for multimedia and DSP functions and area efficient, scalable designs.

We examine the proposed architectures with respect to these requirements and discover that most them are unable to meet the new challenges and provide the necessary enhancements for multimedia applications running on portable devices.

Introduction

Advances in integrated circuits technology will soon provide the capability to integrate one billion transistors in a single chip [1]. This exciting opportunity presents computer architects and designers with the challenging problem of proposing microprossecor organizations able to utilize this huge transistor budget efficiently and meet the requirements of future applications. To address this challenge, IEEE Computer magazine hosted a special issue on ``Billion Transistor Architectures'' [2] in September 1997. The first three articles of the issue discussed problems and trends that will affect future processor design, while seven articles (footnote 1) from academic research groups proposed organizations for billion transistor processors. These proposals covered a wide architecture space, ranging from out-of-order designs to reconfigurable systems. In addition to the academic proposals, Intel and Hewlett-Packard presented the basic characteristics of their next generation IA-64 architecture [4], which is expected to dominate the high-performance processor market within a few years.

In this paper we present a perspective evaluation of these architectures under two different computing domains. We start with the computing domain that has shaped processor architecture for the past decade: the uniprocessor desktop running technical and scientific applications, and the multiprocessor server used for transaction processing and file-system workloads.

In the second part of the paper we evaluate the same architectures under a new computing domain that we expect to play a significant role in driving technology in the next millennium: personal mobile computing. In this paradigm, the basic personal computing and communication devices will be portable and battery operated, will support multimedia functions like speech recognition and video, and will be sporadically interconnected through a wireless infrastructure. A different set of requirements for the microprocessor, like real-time response and energy efficiency, arise in such an environment and lead to significantly different evaluation results.

This paper reflects the opinion and expectations of its authors. We believe that in order to design successful processor architectures for the future, we first need to explore the future applications of computing and then try to match their requirements in a scalable, cost-efficient way.

Overview of the Billion Transistor Architectures

Architecture	Source	Key Idea	Transistors used for Memory
Advanced Superscalar	[5]	wide-issue superscalar processor with speculative execution and multilevel on-chip caches	910M
Superspeculative Architecture	[6]	wide-issue superscalar processor with aggressive data and control speculation and multilevel on-chip caches	820M
Trace Processor	[7]	multiple distinct cores, that speculatively execute program traces, with multilevel on-chip caches	600M (footnote 2)
Simultaneous Multithreaded (SMT)	[3]	wide superscalar with support for aggressive sharing among multiple threads and multilevel on-chip caches	810M
Chip Multiprocessor (CMP)	[8]	symmetric multiprocessor system with shared second level cache	450M
IA-64	[4]	VLIW architecture with support for predicated execution and long instruction bundling	600M (footnote 2)
RAW	[9]	multiple processing tiles with reconfigurable logic and caches, interconnected through a reconfigurable network	640M
Vector IRAM (VIRAM)	[10]	multimedia enhanced vector processor with high bandwidth on-chip DRAM memory	800M

Table 1: The architectures proposed for the billion transistor microprocessors and the number of transistors used for memory cells for each one. We assume a billion transistor implementation the IA-64 architecture.

Table 1 summarizes the basic features of the proposed architectures.

The first three architectures (Advanced Superscalar, Superspeculative Architecture and Trace Processor) have very similar characteristics. The basic idea is a wide superscalar organization with multiple execution units or functional cores, that uses multi-level caching and aggressive prediction of data, control and even sequences of instructions (traces) to utilize all the available instruction level parallelism (ILP). Due to their similarity, we group them together and call them ``Wide Superscalar'' processors in the rest of this paper.

The Simultaneous Multithreaded (SMT) processor uses multithreading at the granularity of issue slot to maximize the utilization of a wide-issue out-of-order superscalar processor at the cost of additional complexity in the issue and control logic.

The Chip Multiprocessor (CMP) utilizes the transistor budget by placing a symmetric multiprocessor on a single die. There will be eight uniprocessors on the chip, all similar to current out-of-order processors, which will have separate first level caches but will share a large second level cache and the main memory interface.

The IA-64 can be considered as the commercial reincarnation of the VLIW architecture, renamed ``Explicitly Parallel Instruction Computer''. Its major innovations announced so far are the support for bundling multiple long instructions and the instruction dependence information attached to each one of them, which attack the problem of scaling and code density of older VLIW machines. It also includes hardware checks for hazards and interlocks so that binary compatibility can be maintained across generations of chips. Finally, it supports predicated execution through general-purpose predication registers to reduce control hazards.

The RAW machine is probably the most revolutionary architecture proposed, supporting the case of reconfigurable logic for general purpose-computing. The processor consists of 128 tiles, each with a small core, first level caches and a reconfigurable functional unit, interconnected with a reconfigurable network in an matrix fashion. The emphasis is placed on the software infrastructure, compiler and dynamic-event support, which handles the partitioning and mapping of programs on the tiles, as well as the configuration selection, data routing and scheduling.

Vector IRAM (VIRAM), the architecture introduced by the authors of this paper, proposes the integration of the processor and DRAM memory on the same chip, as well as the the use of vector processing. The system consists of an in-order dual-issue superscalar processor with first level caches, a multi-media enhanced vector execution unit with multiple partitionable pipelines and support for digital signal processing (DSP), and 96 MBytes of DRAM used as main memory. Serial lines, each operating in the Gbit/s range, will be used for high-speed I/O directly to on-chip memory.

Table 1 also reports the number of transistors used for caches and main memory in each billion transistor architecture. This varies from almost half the budget to 90% of it. It is interesting to notice that only a single architecture uses this memory as the main system memory. The rest spend 50% to 90% of their transistor budget to build caches in order to tolerate the high latency and low bandwidth problem of external memory.

In other words, the conventional vision of computers of the future is to spend most of the billion transistor budget on redundant, local copies of data normally found elsewhere in the system. Is such redundancy really our best idea for the use of 500,000,000 transistors(footnote 3) for applications of the future?

The Desktop/Server Computing Domain

	Wide Superscalar	Simultaneous Multithreaded	Chip Multiprocessor	IA-64	RAW	Vector IRAM
SPEC'04 (Desktop)	A	A	B	A-	C	?
TPC-F (Server)	B	A	A	B	D	?
Software Effort	A	C	C	C	D	?
Physical Design Complexity	D	D	B	B	A	?

Table 2: The evaluation of the billion transistor architectures for the desktop/server domain. Wide Superscalar includes the Advanced Superscalar, Superspeculative and Trace processors.

Current processors and computer systems are being optimized for the desktop and server domain, with SPEC'95 and TCP-C/D being the most popular benchmarks. This computing domain will likely be significant when the billion transistor chips will be available and similar benchmark suites will be in use. We playfully call them ``SPEC'04'' for technical/scientific applications and ``TCP-F'' for on-line transaction processing (OLTP) workloads.

Table 2 presents our prediction of the performance of these architectures for this domain using the traditional academic grading system from A to F. NOTE TO REVIEWERS(footnote 0)

For the desktop environment, the wide superscalar processors and the Simultaneous Multithreading one are expected to deliver the highest performance on SPEC'04, since out-of-order and advanced prediction techniques can utilize most the available ILP of a single sequential program. IA-64 will perform slightly worse because VLIW compilers are not yet mature enough to outperform the most advanced hardware ILP techniques, which exploit run-time information. CMP, RAW and VIRAM will have inferior performance since desktop applications have not been shown to be highly parallelizable or vectorizable. CMP will still benefit from the out-of-order features of its cores, while VIRAM will benefit from low memory latency and high memory bandwidth.

For the server domain, CMP and SMT will provide the best performance, due to their ability to utilize coarse-grain parallelism even with a single chip. Wide Superscalar or IA-64 systems will perform worse, since current evidence is that out-of-order execution provides little benefit to database-like applications [12]. The same holds for VIRAM because of its limited on-chip memory(footnote 4). With the RAW architecture it is difficult to predict any potential success of its software to map the parallelism of databases on reconfigurable logic.

A potentially different evaluation for the server domain could arise if we examine decision support (DSS) instead of OLTP workloads. In this case, small code loops with highly data parallel operations dominate execution time [15], so architectures like RAW and VIRAM should perform significantly better than for OLTP workloads.

For any new architecture to be widely accepted, it has to be able to run a significant body of software [11]. Thus, the effort needed to port existing software or develop new software is very important. The wide superscalar processors have the edge, since they can run existing executables. The same holds for SMT and CMP but, in this case, high performance can be delivered if the applications are written in a multithreaded or parallel fashion. As the past decade has taught us, parallel programming for high performance is neither easy nor automated. For IA-64 a significant amount of work is required to enhance VLIW compilers. In the case of VIRAM, we need to vectorize applications, but vectorizing compilers have been developed and used in commercial environments for years now. The RAW machine relies on the most challenging software development. Apart from the requirements of sophisticated routing, mapping and scheduling tools, there is a need for development of compilers or libraries to make such an architecture usable.

A last issue is that of physical design complexity which includes the effort for design, verification and testing. Currently, the whole development of an advanced microprocessor takes almost 4 years and a few hundred engineers [2][16][17]. Functional and electrical verification and testing complexity has been steadily growing [18][19] and accounts for the majority of the processor development effort. The wide superscalar and multithreading architectures exacerbate both problems by using complex techniques like aggressive data/control prediction, out-of-order execution and multithreading, and by having non modular designs(footnote 5) (multiple blocks individually designed). Chip Multiprocessor carries on the complexity of current out-of-order designs with support for cache coherency and multiprocessor communication. With the IA-64 architecture, the basic challenge is the design and verification of the forwarding logic between the multiple functional units on the chip. The RAW machine and VIRAM are modular designs. For the RAW architecture, only a single tile and network switch need to be designed and replicated. Verification of a reconfigurable organization is trivial in terms of the circuits, but verification of the mapping software is also required. For VIRAM, the necessary building blocks are the in-order scalar core, the vector pipeline, which is replicated 8 times, and the basic memory array tile. Due to the lack of dependencies and forwarding in the vector model and the in-order paradigm, the verification effort is expected to be low.

A New Target for Future Computers: Personal Mobile Computing

In the last few years, we have experienced a significant change in technology drivers. While high-end systems alone used to direct the evolution of computing, current technology is mostly driven by the low-end systems due to their large volume. Within this environment, two important trends have evolved that could change the shape of computing.

The first new trend is that of multimedia applications. The recent improvements in circuits technology and innovations in software development have enabled the use of real-time media data-types like video, speech, animation and music. These dynamic data-types greatly improve the usability, quality, productivity and enjoyment of personal computers [20]. Functions like 3D graphics, video and visual imaging are already included in the most popular applications and it is common knowledge that their influence on computing will only increase:

``90% of desktop cycles will be spent on `media' applications by 2000'' [21]
``multimedia workloads will continue to increase in importance'' [2]``many users would like outstanding 3D graphics and multimedia'' [16]
``image, handwriting, and speech recognition will be other major challenges'' [19]

At the same time, portable computing and communication devices have gained large popularity. Inexpensive ``gadgets'', small enough to fit in a pocket, like personal digital assistants (PDA), palmtop computers, webphones and digital cameras were added to the list of portable devices like notebook computers, cellular phones, pagers and video games [22]. The functionalities supported by such devices are constantly expanded and multiple devices are converging into a single one. This leads to a natural increase in their demand for computing power, but at the same time their size, weight and power consumption have to remain constant. For example, a typical PDA is 5 to 8 inches by 3.2 inches big, weighs six to twelve ounces, has 2 to 8 Mbytes of memory (ROM/RAM) and is expected to run on the same set of batteries for a period of a few days to a few weeks [22]. One should also notice the large software, operating system and networking infrastructure developed for such devices (wireless modems, infra-red communications etc): Windows CE and the PalmPilot development environment are prime examples [22].

Figure 1: Personal mobile devices of the future will integrate the functions of current portable devices like PDAs, video games, digital cameras and cellular phones.

Our expectation is that these two trends together will lead to a new application domain and market in the near future. In this environment, there will be a single personal computation and communication device, small enough to carry around all the time. This device will include the functionalities of a pager, a cellular phone, a laptop computer, a PDA, a digital camera and a video game combined [23] (Figure 1) . The most important feature of such a device will be the interface and interaction with the user: voice and image input and output (speech and voice recognition) will be key functions used to type notes, scan documents and check the surrounding for specific objects. A wireless infrastructure for sporadic connectivity will be used for services like networking (www and email), telephony and global positioning system (GPS), while the device will be fully functional even in the absence of network connectivity.

Potentially this device will be all that a person may need to perform tasks ranging from keeping notes to making an on-line presentation, and from browsing the web to programming a VCR. The numerous uses of such devices and the potential large volume lead us to expect that this computing domain will soon become at least as significant as desktop computing is today.

The microprocessor needed for these computing devices is actually a merged general-purpose processor and digital-signal processor (DSP), at the power budget of the latter. There are four major requirements: energy/power efficiency, high performance for multimedia functions, small size and low design complexity.

With a budget of less than two Watts for the whole device, the processor has to be designed with a power target less than one Watt, while still being able to provide high-performance for functions like speech recognition. Power budgets close to those of current high-performance microprocessors (tens of Watts) are unacceptable.

The basic characteristics of media-centric applications that a processor needs to support or utilize in order to provide high-performance were specified in [20] in the same issue of IEEE Computer:

real-time response: instead of maximum peak performance, sufficient worst case guaranteed performance is needed for real-time qualitative perception for applications like video.
continuous-media data types: media functions are typically processing a continuous stream of input that is discarded once it is too old, and continuously send results to a display or speaker. Hence, the data reuse and locality premise of the desktop processors is questionable. This data is also narrow, as pixel images and sound samples are 8 to 16 bits wide, rather than the 32-bit or 64-bit data of desktop machines. The ability to perform multiple operations on such types on a single wide datapath is desirable.
fine-grained parallelism: in functions like image, voice and signal processing, the same operation is performed across sequences of data in a SIMD fashion.
coarse-grained parallelism: in many media applications a single stream of data is processed by a pipeline of functions to produce the end result.
high instruction-reference locality: media functions usually have small kernels or loops that dominate the processing time and demonstrate high temporal and spatial locality for instructions.
high memory bandwidth: applications like 3D graphics require huge memory bandwidth for large data sets that have limited locality.
high network bandwidth: streaming data like video or images from external sources requires high network and I/O bandwidth.

After energy efficiency and multimedia support, the third main requirement for personal mobile computers is small size and weight. The desktop assumption of several chips for external cache and many more for main memory is infeasible for PDAs, and integrated solutions that reduce chip count are highly desirable. A related matter is code size, as PDAs will have limited memory to keep down costs and size, so the size of program representations is important.

Another important concern is design complexity, like in the desktop domain, and scalability. An architecture should scale efficiently not only in terms of performance but also in terms of physical design. Long interconnects for on-chip communication are expected to be a limiting factor for future architectures as a small region of the chip (around 15%) will be accessible in a single clock cycle [24] and therefore should be avoided.

Architecture Evaluation for Mobile Multimedia Applications

	Wide Superscalar	Simultaneous Multithreaded	Chip Multiprocessor	IA-64	RAW	Vector IRAM
Real-time response	C	C	C	C	C	?
Real-time response	unpredictability of out-of-order, branch prediction and/or caching techniques					in-order model no vector caches
Continuous data-types	B	B	B	B	B	?
Continuous data-types	caches do not efficiently support data streams with little locality					no vector caches
Fine-grained parallelism	B	B	B	B	A	?
Fine-grained parallelism	MMX-like extensions less efficient than full vector support				reconfigurable logic unit	vector unit
Coarse-grained parallelism	C	A	A	C	A	?
Instruction ref. locality	A	A	A	A	A	?
Instruction ref. locality	locality through large instruction caches
Code size	A-	A-	A-	C	C	?
Code size	potential use of loop unrolling and software pipelining for higher ILP			VLIW instructions	hardware configuration	vector instructions for loops
Memory bandwidth	C	C	C	C	C	?
Memory bandwidth	cache-based designs					on-chip multibank DRAM
Network bandwidth limits	C	B	B	C	B	?
Network bandwidth limits	caches/cost of interrupts	separate thread or process for network I/O		caches/cost of interrupts	separate tile for network I/O	Gbit/s serial I/O to memory
Energy/power efficiency	D	D	C	C	D	?
Energy/power efficiency	power penalty for out-of-order schemes, complex issue logic, forwarding and reconfigurable logic}
Physical design complexity	D	D	B	B	A	?
Design scalability	C	C	C	C	C	?
Design scalability	long wires for forwarding data or for reconfigurable interconnect					memory crossbar

Table 3: The evaluation of the billion transistor architectures for the personal mobile computing domain.

Table 3 summarizes our evaluation of the billion transistor architectures with respect to personal mobile computing.

The support for multimedia applications is limited in most architectures. Out-of-order techniques and caches make the delivered performance quite unpredictable for easily guaranteed real-time response, while caches also complicate support for continuous-media data-types. Fine-grained parallelism can be exploited by using MMX-like, vector or reconfigurable execution units. The vector model is still superior to MMX-like solutions, as it provides explicit support of the length of SIMD instructions, and it does not expose the complexity of data packing and alignment to software. After all, most media processing functions are based on algorithms working on vectors of pixels or samples, so the highest performance can be delivered by a vector unit. Coarse-grained parallelism, on the other hand, is best on the Simultaneous Multithreading, Chip Multiprocessor and RAW architectures.

Instruction reference locality has traditionally been exploited through large instruction caches. Yet, designers of portable system would prefer reductions in code size as suggested by the 16-bit instruction versions of MIPS and ARM [25]. Code size is a weakness for IA-64 and any other architecture that relies heavily on loop unrolling for performance, as it will surely be larger than that of 32-bit RISC machines. RAW may also have code size problems, as one must ``program'' the reconfigurable portion of each datapath. The code size penalty of the other designs will likely depend on how much they exploit loop unrolling and in-line procedures to expose enough parallelism for high performance. VIRAM has the advantage in being able to specify whole loops in a single vector instruction, potentially leading to smaller programs than the other alternatives.

Memory bandwidth is another limited resource for cache-based architectures, especially in the presence of multiple data sequences, with little locality, being streamed through the system. Network and I/O bandwidth can be limited by software overhead, interrupt cost or memory bandwidth. Overlooking the software overhead limit which should be similar for all architectures, network and I/O bandwidth will be further limited by the cache bandwidth and the high cost of interrupts in out-of-order or deeply pipelined designs. Still, certain architectures can address this issue by dedicating a process, thread or tile to I/O handling. For VIRAM, I/O goes directly to memory through Gbit/s serial lines and interrupt cost is not high for vector architectures with multiple vector pipelines and on-chip main memory [26].

The energy/power efficiency issue, despite its importance both for portable and desktop domains [27], is not addressed in most designs. Redundant computation for out-of-order models, complex issue and dependence analysis logic, fetching a large number of instructions for a single loop, forwarding across long wires and use of the typically power hungry reconfigurable logic increase the energy consumption of a single task and the power of the processor. VIRAM on the other hand uses the inherently low-power DRAM and a vector unit, where there are no dependencies, limited forwarding is needed and performance comes from multiple vector pipelines rather than from high-frequency operation.

As for physical design scalability, forwarding results across large chips is the main problem of most designs. Such communication already requires multiple cycles in high-performance out-of-order designs. Scaling reconfigurable interconnects is the challenge for RAW architecture. For VIRAM, the processor-memory crossbar is the only place were long wires are used. Still, the vector model can tolerate latency if sufficient fine-grain parallelism is available, so deep pipelining is a viable solution.

Conclusions

Billion transistor architectures will be a reality almost a decade from now. Nevertheless, it is clear that we are still designing processors of the future with a heavy bias for the past. For example, the programs in the SPEC'95 suite were originally written many years ago, yet these were the main drivers for most papers in the special issue on billion transistor processors for 2010.

Exploring future architectures requires exploration of future computer applications as well. In the last few years, the major use of computing devices has shifted to non-engineering areas. Personal computing is already the mainstream market, portable devices for computation, communication and entertainment have become popular, and multimedia functions drive the application market. We expect that the combination of these will lead to the personal mobile computing domain, where portability, energy efficiency and efficient interfaces through the use of media types (voice and images) will be the key features.

The question we asked is whether the proposed architectures can meet the challenges of this new computing domain. Unfortunately, the answer is negative for most of them. Limited and mostly ``ad-hoc'' support for multimedia or DSP functions is provided, power is not a serious issue and unlimited complexity of design and verification is justified by even slightly higher peak performance.

Providing the necessary support for personal mobile computing requires a significant shift in the way we design processors. The key requirements that processor designers will have to address will be energy efficiency to allow battery operated devices, focus on worst case performance instead of peak for real-time applications, multimedia and DSP support to enable visual computing, and simple scalable designs with reduced development and verification cycles.

We believe that personal mobile computing offers a vision of the future with a much richer and more exciting set of architecture research challenges than extrapolations of the current desktop architectures and benchmarks.

Put another way, which problem would you rather work on: improving performance of PCs running FPPPP or making speech input practical for PDAs?

References

1: Semiconductor Industry Association. The National Technology Roadmap for Semiconductors. SEMATECH Inc., 1997.
2: D. Burger and D. Goodman. Billion-Transistor Architectures - Guest Editors' Introduction. IEEE Computer, 30(9):46-48, September 1997.
3: S.J. Eggers, J.S. Emer, H.M. Leby, J.L. Lo, R.L. Stamm, and D.M. Tullsen. Simultaneous Multithreading: a Platform for Next-Generation Processors. IEEE MICRO, 17(5):12-19, October 1997.
4: J. Crawford and J. Huck. Motivations and Design Approach for the IA-64 64-Bit Instruction Set Architecture. In Microprocessor Forum, October 1997.
5: Y.N. Patt, S.J. Patel, M. Evers, D.H. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. IEEE Computer, 30(9):51-57, September 1997.
6: M. Lipasti and L.P. Shen. Superspeculative Microarchitecture for Beyond AD 2000. IEEE Computer, 30(9):59-66, September 1997.
7: J. Smith and S. Vajapeyam. Trace Processors: Moving to Fourth Generation Microarchitectures. IEEE Computer, 30(9):68-74, September 1997.
8: L. Hammond, B.A. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. IEEE Computer, 30(9):79-85, September 1997.
9: E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring It All to Software: Raw Machines. IEEE Computer, 30(9):86-93, September 1997.
10: C.E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Cardwell, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and K. Yelick. Scalable Processors in the Billion-Transistor Era: IRAM. IEEE Computer, 30(9):75-78, September 1997.
11: J Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach, second edition. Morgan Kaufmann, 1996.
12: K. Keeton, D.A. Patterson, Y.Q. He, and Baker W.E. Performance Characterization of the Quad Pentium Pro SMP Using OLTP Workloads. In the proceedings of the 1998 International Symposium on Computer Architecture (to appear), June 1998.
13: K. Keeton, R. Arpaci-Dusseau, and D.A. Patterson. IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck. In Workshop on "Mixing Logic and DRAM: Chips that Compute and Remember", the 24th Annual International Symposium on Computer Architecture, June 1997.
14: K. Keeton, D.A. Patterson, and J.M. Hellerstein. The Intelligent Disk (IDISK): A Revolutionary Approach to Database Computing Infrastructure. submitted for publication, March 1998.
15: P. Trancoso, J. Larriba-Pey, Z. Zhang, and J. Torrellas. The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. In Third International Symposium on High-Performance Computer Architecture, January 1997.
16: G. Grohoski. Challenges and Trends in Processor Design: Reining in Complexity. IEEE Computer, 31(1):41-42, January 1998.
17: P. Rubinfeld. Challenges and Trends in Processor Design: Managing Problems in High Speed. IEEE Computer, 31(1):47-48, January 1998.
18: R. Colwell. Challenges and Trends in Processor Design: Maintaining a Leading Position. IEEE Computer, 31(1):45-47, January 1998.
19: E. Killian. Challenges and Trends in Processor Design: Challenges, Not Roadblocks. IEEE Computer, 31(1):44-45, January 1998.
20: K. Diefendorff and R. Dubey. How Multimedia Workloads Will Change Processor Design. IEEE Computer, 30(9):43-45, September 1997.
21: W. Dally. Tomorrow's Computing Engines. Keynote Speech, Fourth International Symposium on High-Performance Computer Architecture, February 1998.
22: T. Lewis. Information Appliances: Gadget Netopia. IEEE Computer, 31(1):59-68, January 1998.
23: V. Cerf. The Next 50 Years of Networking. In ACM97 Conference Proceedings, March 1997.
24: D. Matzke. Will Physical Scalability Sabotage Performance Gains? IEEE Computer, 30(9):37-39, September 1997.
25: L. Goudge and S. Segars. Thumb: reducing the cost of 32-bit RISC performance in portable and consumer applications. In Digest of Papers, COMPCON '96, February 1996.
26: K. Asanovic. Vector Microprocessors. PhD thesis, Computer Science Division, University of California at Berkeley, 1998.
27: T. Mudge. Strategic Directions in Computer Architecture. ACM Computing Surveys, 28(4):671-678, December 1996.

Footnotes...

Footnote 1

[3] was actually published in IEEE Micro in October 1997, but it was part of the IEEE Computer special issue.

Footnote 2

These numbers include transistors for main memory, caches and tags. They are calculated based on information from the referenced papers. The numbers for the Trace processor and IA-64 were based on lower-limit expectations and the fact that their predecessors spent at least half their transistor budget on caches.

Footnote 3

While die area is not a linear function of the transistor number (memory transistors can be placed much more densely than logic transistors), die cost is an exponential function of die area [11]. Thus, 500M transistors are very expensive.

Footnote 0

NOTE TO REVIEWERS: We are torn as whether or not to include VIRAM in evaluations presented. As co-authors of the VIRAM paper and since we are going to spend the next few years of our lives working on it, we are obviously biased in its favor, and will not likely give it a lower grade than it deserves. Hence, we should drop it. On the other hand, since we are working on a related idea it may seem machiavellian to grade everyone else on metrics we think are important for the future. The choice between being accused for being biased or machiavellian is hard one, but we picked the former. In addition, to make reviewing more interesting, this version of the paper includes our reasoning but not the actual grades we assigned to VIRAM. Each reviewer can assign her/his own grades which we will compare to ours.

Footnote 4

While the use of VIRAM as the main CPU is not attractive for servers, a more radical approach to servers of the future places a VIRAM in each SIMM module [13] or each disk [14] and have them communicate over high speed serial lines via crossbar switches

Footnote 5

The Trace Processor uses replicated processing elements, so its design complexity is lower than that of the other wide superscalar processors.

Evaluating Billion Transistor Architectures under Different Computing Domains

Evaluating Billion Transistor Architectures
under Different Computing Domains