CS298-1 System Seminar

Spring 1996

Thursdays, 3:30-5, 306 Soda Hall

Course Control #25236 (directions to Soda)

The UC Berkeley Systems Seminar highlights exciting developments in Computer Architecture, Operating Systems, Networking, and related areas. It returns to its traditional Thursday afternoon timeslot, with refreshments served at 3:30 and talks beginning at 4 pm. Graduate students may enroll for one unit of credit. It looks like the pattern this semester is Berkeley systems graduates. We will have Ed Lee (DEC), Stuart Sechrest (U Mich), Mary Baker (Stanford), Mark Hill (Wisconsin), and Mendel Rosenbloom (Stanford) talking about current research. Bill Dally (MIT) has also agreed to speak. More will get filled in as we know it.

Current Schedule

Jan. 25: Scalable Network Storage , Ed Lee, Digital Equipment Corp.
( SRC 1996 Summer Intern Program Announcement )
Feb. 1: Stuart Sechrest
Feb. 8: The Illinois Aggressive Cache Only Memory Architecture Multiprocessor (I-ACOMA) Josep Torrellas, Univ. Illinois
Feb. 15: The Information Bus(r) -- An Architecture for Extensible Distributed Systems Thomas Joseph, Teknekron Software Systems.
Feb. 22: Mendel Rosenblum, Stanford University, The SimOS machine simulation environment
Feb. 29: John Wawrzynek, SPERT-II: A Vector Microprocessor System
Mar. 7: C. Mohan, Trends in Workflow Management and an Overview of IBM's Exotica Project
Mar. 14: I.L.P
Mar. 21: Thorsten von Eicken, Cornell Univ., U-Net: A User-Level Network Interface for Parallel and Distributed Computing
Mar. 28: Spring Break
Apr. 4: Bill Dally, MIT The M-Machine: Cost-Effective, Scalable Parallel Computing
Apr. 11: Mark Hill, Tempest: A Substrate for Portable Parallel Programs

Apr. 25: The MosquitoNet Mobile and Wireless Computing Project, Mary Baker, Stanford University
May 2: Vern Paxson, UCB, An Analysis of End-to-End Internet Dynamics

David E. Culler culler@cs.Berkeley.edu, Last Edited: 1/15/95

Upcoming Talks

The MosquitoNet Mobile and Wireless Computing Project

Mary Baker
Stanford University

The main goal of the MosquitoNet project is to provide ubiquitous, continuous network connectivity for mobile computers. Achieving this goal requires taking advantage of whatever network connectivity is available at a particular time or place. It also requires switching transparently between different varieties of networks as availability changes. In MosquitoNet so far we have built a simple test-bed. It provides in-office connectivity using Ethernet and Metropolitan-area connectivity using Metricom's wireless radios. We are able to switch between these networks transparently using our mobile IP implementation. Our mobile IP implementation emphasizes the ability to "visit" networks that are operated by different authorities. In this talk I'll describe our test-bed, the mobile IP implementation, and the performance we've seen so far.

An Analysis of End-to-End Internet Dynamics

Vern Paxson
vern@ee.lbl.gov
EECS
University of California, Berkeley

Network Research Group
Lawrence Berkeley National Laboratory

Dissertation Seminar

With (currently) 10,000,000 hosts and 100,000 networks, the Internet presents formidable difficulties to accurate, representative measurement. Yet without sound measurement, keeping the Internet functioning, improving its performance, and extending it for future traffic become seat-of-the-pants engineering problems, destined to prove disastrous when applied to such a huge system. Further complicating the problem of measurement is the fact that what is generally of interest is the *end-to-end* behavior of an Internet path, which can be difficult or impossible to deduce from localized measurement of only one component (say, the portion of the path that traverses Sprint's links). At the heart of my dissertation research lies a prototype framework for conducting end-to-end Internet measurements. An important question we must address early on when conducting end-to-end measurements concerns the behavior of end-to-end routing. The frequency of routing changes and failures directly colors subsequent analysis of other end-to-end properties. I'll give an overview of the measurement framework and then a detailed look at an analysis of 40,000 end-to-end route measurements conducted using repeated ``traceroutes'' between 37 Internet sites. After arguing that the set of measurements is plausibly representative, I'll characterize the different pathologies, the stability of routes over time, and their symmetry. One ominous finding that emerges from the analysis is that the likelihood of encountering a major routing pathology more than doubled between the end of 1994 and the end of 1995, rising from 1.5% to 3.4%. This finding matches widespread anecdotal evidence that Internet performance has seriously degraded in the post-NSFNET era. I will also present preliminary results from a corresponding analysis of end-to-end TCP dynamics, and finish with a look at how the measurement methodology might evolve into a more general Internet measurement infrastructure.

Past Talks

The M-Machine: Cost-Effective, Scalable Parallel Computing

William J. Dally
Massachusetts Institute of Technology
Laboratory for Computer Science, and Artificial Intelligence Laboratory

The M-Machine is an experimental parallel computer system being designed to address three fundamental problems with current approaches to parallel computing: single processor scalability, programmability, and cost/performance. The M-Machine uses "processor coupling" to exploit instruction-level-parallelism across four independent "clusters" of arithmetic elements. This approach overcomes the register bandwidth and instruction issue bottlenecks of superscalar processors and the synchronization bottleneck of VLIW approaches. The M-Machine supports an incremental approach to parallel programming in which a sequential program is converted in stages. To this end, the machine provides a distributed shared memory that can be used to automate data placement and migration as well as low-overhead (sub microsecond), protected message passing for optimized communication. An efficient implementation of capability-based addressing provides fast switching between protection domains to support a modular approach to parallel programming. Finally, the M-Machine uses a very high processor to memory ratio with supporting mechanisms to overcome the cost-performance disadvantage of conventional parallel computers. For jobs with even modest amounts of parallelism, the M-Machine offers both better absolute performance and better cost performance than a sequential computer.

Physically, the M-Machine is a collection of processing nodes connected by a 3-D mesh network. Each M-Machine node consists of a multi-ALU processor (MAP) chip and 1MW (8MB) of external memory (a total of five chips per node). a MAP chip contains four operation clusters, a network router, a network interface, a 4-bank, 128KB cache, and internal switches. Each cluster includes two integer ALUs, a floating point unit, and the associated register files. Each network channel and each node's local memory port provide bandwidth of one 64-bit word per processor cycle.

Biography

William Dally received the B.S. degree in Electrical Engineering from Virginia Polytechnic Institute, the M.S. degree in Electrical Engineering from Stanford University, and the Ph.D. degree in Computer Science from Caltech.

Bill has worked at Bell Telephone Laboratories where he contributed to the design of the BELLMAC32 microprocessor. Later as a consultant to Bell Laboratories, he helped design the MARS hardware accelerator. He was a Research Assistant and then a Research Fellow at Caltech where he designed the MOSSIM Simulation Engine and the Torus Routing Chip. He is currently a Professor of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology where he directs the Concurrent VLSI Architecture group. Bill and his group have built the J-Machine, a fine-grain concurrent computer and the Reliable Router, a high-speed fault-tolerant network component that includes simultaneous bidirectional signalling and a low-latency plesiochronous synchronizer. They are currently designing the M-Machine, a computer that explores mechanisms appropriate for parallel computation.

Bill's research interests include computer architecture, computer graphics, parallel computing, software systems, computer aided design, and VLSI design.

Tempest: A Substrate for Portable Parallel Programs

Mark D. Hill
Computer Sciences Department
University of Wisconsin
1210 West Dayton Street

Madison, WI 53706
Are parallel programs portable? Today the answer is "no". Massively parallel processors (MPPs) are programmed with explicit message passing, networks of workstations (NOWs) with sockets, and symmetric multiprocessors (SMPs) with shared memory. To be portable, parallel applications must run across this range of machines without source-level changes. To make this possible, we have developed the Tempest interface, which provides mechanisms to support shared-memory, message-passing, and hybrid communications. With these mechanisms, a low-cost workstation cluster can provide the same communications abstractions (e.g., shared memory) as high-performance parallel machines. The Tempest interface consists of low-level communication and memory-system mechanisms. User-level software implements specific policies, such as shared-memory protocols. Programmers and compilers can customize a policy to fit an application's semantics and sharing patterns. Experiments show that in some cases custom cache-coherence policies can produce an order-of-magnitude performance improvement, without restructuring a program's source code. We have studied several Tempest implementations. The first, Typhoon, offers high performance with first-class hardware support. This proposed hardware implements Tempest using a fully-programmable, user-level processor in the network interface. The second system, Blizzard, demonstrates Tempest's portability by implementing the same interface on unmodified stock hardware: both a Thinking Machines CM-5 and the Wisconsin COW (a Cluster Of Workstations). Finally, our third system, Typhoon-Zero, serves as both a mid-range implementation and a prototype of the more complex Typhoon hardware. Typhoon-Zero adds a simple hardware module to the Wisconsin COW nodes to accelerate Tempest's fine-grain access control mechanism. This work is part of the Wisconsin Wind Tunnel project, co-led by Mark Hill, James Larus, and David Wood. For more information, see URL http://www.cs.wisc.edu/~wwt.

SPERT-II: A Vector Microprocessor System

John Wawrzynek
Computer Science Division
University of California, Berkeley

Feb 29

We have developed a high-performance system, called SPERT-II, for multimedia, human-interface, neural network, and other digital signal processing tasks. The system is based on a custom vector microprocessor packaged as an attached processor for a conventional workstation. The processor comprises an integer RISC core and a fixed point vector coprocessor integrated on a single VLSI chip. The attached processor is in daily use for neural network backpropagation training for speech recognition. It demonstrates roughly 15 times the performance of a mid-range workstation and five times the performance of a high-end workstation with extensive hand-optimization of both workstation versions. I will present the architecture of our vector processor and the SPERT-II system and present performance comparisons with workstations on neural network backpropagation.

U-Net: A User-Level Network Interface for Parallel and Distributed Computing

Thorsten von Eicken
Cornell University
March 21

The U-Net communication architecture provides processes with a virtual view of a network interface to enable user-level access to high-speed communication devices. The architecture, implemented on standard workstations using off-the-shelf ATM communication hardware, removes the kernel from the communication path, while still providing full protection. The model presented by U-Net allows for the construction of protocols at user level whose performance is only limited by the capabilities of network. The architecture is extremely flexible in the sense that traditional protocols like TCP and UDP, as well as novel abstractions like Active Messages can be implemented efficiently. A U-Net prototype on an 8-node ATM cluster of standard workstations offers 65 microseconds round-trip latency and 15 Mbytes/sec bandwidth. It achieves TCP performance at maximum network bandwidth and demonstrates performance equivalent to Meiko CS-2 and TMC CM-5 supercomputers on a set of Split-C benchmarks.

Trends in Workflow Management and an Overview of IBM's Exotica Project

C. MOHAN

IBM Almaden Research Center, San Jose, CA 95120, USA
mohan@almaden.ibm.com

In the last few years, workflow management has become a hot topic in the research community and, especially, in the commercial arena. Workflow management is multidisciplinary in nature encompassing many aspects of computing: database management, distributed client-server systems, transaction management, mobile computing, business process reengineering, integration of legacy and new applications via object oriented technology, and heterogeneity of networks, and hardware and software platforms. Many academic and industrial research projects are underway. Numerous successful products have been released. Standardization efforts are in progress under the auspices of the Workflow Management Coalition. As has happened in the RDBMS area with respect to some topics, in the workflow area also, some of the important real-life problems faced by customers and product developers are not being tackled by researchers. Based on my experience founding and leading the Exotica workflow project at IBM, and my close collaboration with the IBM FlowMark workflow product group and, more recently, with the Lotus people, I will briefly survey the state of the art in workflow management. I will also give an overview of the Exotica project at IBM Almaden Research Center. Information on Exotica is available on the web at http://www.almaden.ibm.com/cs/exotica/.

The SimOS machine simulation environment

Mendel Rosenblum
Computer Systems Laboratory
Stanford University
Feb 22

In this talk I will describe SimOS, a machine simulation environment my students and I have developed for studying workloads and evaluating ideas in operating systems and computer architecture. SimOS can model the behavior of modern machines with enough speed and detail that is possible to boot and run commerical operating systems along with the application workloads that run on top. In this talk I will describe some of the key features of SimOS including the explict control of the simulation speed-detail tradeoff and the techniques used for relating the events observed in the simulator back to the high-level program actions that cause them. I will illustrate the power of these techniques with examples from several completed SimOs studies.

Scalable Network Storage

Edward K. Lee
Chandramohan A. Thekkath
Digital Equipment Corporation
Systems Research Center

High-performance scalable interconnects allow us to build large-scale computing systems consisting of tens if not hundreds of computing nodes. The practical deployment and use of such systems is often limited by our ability to manage them. This problem is particularly severe for large-scale storage subsystems, because they must reliably store and manage vast amounts of information over long periods of time. We feel that such systems must be designed to be highly available, incrementally expandable in both capacity and performance, and support mechanisms for automating the configuration and management of the system. Our approach to the problem is to design plug-and-play block-level storage servers that cooperate with each other to implement a single logical storage subsystem that is auto-configuring and self-managing. Expanding the storage system is a matter of plugging in a new storage server and switching on the server. The information stored on the storage system automatically redistributes itself online to take advantage of the added capacity and performance of the new storage server. While centralized block-level storage systems with some of the above functionality are currently available, we are unaware of any such system that has implemented the above functionality at the distributed systems level. This talk describes our goals, the overall architecture of the system, and performance measurements of our second Scalable Network Storage prototype implemented using 8 Alpha workstations, 32 SCSI disks, and AN2, an auto-configuring ATM network designed, built, and deployed at our research center.

Dynamich Branch Predictors: The BIG Picture

Stuart Sechrest
University of Michigan

The importance of accurate branch prediction to future processors has been widely noted, but much of the recent work on branch prediction has been limited by its reliance upon user-level traces of the SPECint89 and SPECint92 benchmarks and a failure to appreciate way that interbranch aliasing will affect accuracy.

In this talk we will report on simulations of a variety of branch prediction schemes using a set of relatively large benchmark programs more representative than SPECint of likely system workloads. We examine the effects of aliasing in two-level branch prediction schemes and the sensitivity of these schemes to variation in workload, in resources, and in design and configuration. We also examine the role of adaptation in these schemes, showing that adaptation is not necessary for competitive performance.

The Illinois Aggressive Cache Only Memory Architecture Multiprocessor (I-ACOMA)

Josep Torrellas
Computer Science Department and Center for Supercomputing Res and Dev
Univ. of Illinois at Urbana-Champaign

The Illinois Aggressive Cache-Only Memory Architecture Multiprocessor (I-ACOMA) is a research project that explores scalability issues for hundreds of high-performance processors organized in a flat-COMA configuration. The machine is a tightly-coupled multiprocessor that presents a single address space to the programmer.
The two main aspects of the project are an aggressive design of the memory hierarchy to eliminate slowdowns due to memory accesses and a machine organization that supports database workloads effectively. The first issue, namely eliminating slowdowns due to memory accesses, is addressed via a combination of COMA organization, a novel update protocol, and aggressive prefetching at several levels. The second issue, database support is addressed with the COMA organization. The project also includes other systems and compiler issues that will be briefly discussed in the talk.

The Information Bus(r) -- An Architecture for Extensible Distributed Systems

Thomas Joseph
Teknekron Software Systems

Research can rarely be performed on large-scale, distributed systems at the level of thousands of workstations. In this talk, we describe the motivating constraints, design principles, and architecture for an extensible, distributed system operating in such an environment. The constraints include continuous operation, dynamic system evolution, and integration with extant systems. The Information Bus is a sysnthesis of four design principles: core communication protocols have minimal semantics, objects are self-describing, types can be dynamically defined and communication is anonymous. The current implementation provides both flexibility and high performance, and has been proven in several commercial environments, including brokerage/trading floors and semi-conductor fabrication plants.