CS252: Spring 2012 Final Projects

This page contains pointers to the final CS252 project pages for Spring of 2012. These projects are done in groups of two or three and span a wide range of topics.
1:   Parallel Assembly of Full Genomes on Commodity Clusters
 Richard Xia and Albert Kim
Modern genome sequencers are capable of producing millions of short sequences of DNA. Each new generation of genome sequencers is able to provide an order of magnitude more data than the previous. The challenge today is to build a software genome assembler that is highly parallel, highly performant, and affordable to run. We present an implementation of a whole-genome assembler which is highly parallelized and runs efficiently on commodity clusters. We use a number of techniques to lower memory usage, reduce communication bandwidth, and allow for scaling up to hundreds of cores on commodity clusters.
Supporting Documentation: Final Report (pdf) Slides (ppt)
2:   Hardware Support for Irregular Control Flow in Vector Processor
 Huy Vo
Data parallel applications can be divided into two categories: regular data level parallel applications and irregular data level parallel applications. Regular data parallel applications have structured data accesses as well as structured control flow. Irregular data level parallel applications, on the other hand, have either unstructured data accesses, or unstructured control flow, or both. It is known that data parallel processors exhibit much higher performance and energy efficiency than scalar processors when executing data parallel applications. If data parallel processors are to continue to grow in popularity, they must be able to efficiently handle both regular and irregular data level parallelism. In this paper, I show how to provide hardware support for data parallel applications that display irregular control flow. Through an extensive design space exploration of various design points, I evaluate the cost, performance, and energy efficiency of my implementations.
Supporting Documentation: Final Report (pdf) Slides (pdf)
3:   Maximizing Cache Energy-Efficiency through Joint Optimization of L1 Write Policy, SRAM Design, and Error Protection
 Brian Zimmer and Michael Zimmer
L1 cache design contributes significantly to the performance and energy consumption of microprocessors due to L1's large proportion of die size and high activity factor. Voltage reduction reduces energy per operation, however increased process variability in modern technology nodes causes an exponential growth in SRAM failure probability for linear voltage reduction, necessitating some form of error correction. We compare the implications on energy, area, performance, and error rate for two methods of error correction---a write-back cache with SEC-DED and a write-through cache with parity detection---and identify system factors (such as cache size and latency) that define the optimal solution. Additionally, we explore potential benefits of running a processor at finite error rates (aggressive voltage scaling) by analyzing recovery costs and system energy. Energy measurements come from full transistor-level simulation of functional 28nm SRAM designs over varying operating conditions, and hit/miss rate measurements come from both a cache simulator embedded into the RISC-V ISA simulator and cycle accurate simulation of the Rocket processor.
Supporting Documentation: Final Report (pdf) Slides (pdf)
4:   SecureCell: An Architecture for Computing with Private Data on the Cloud
 Eric Love and Soham Mehta
We present an architecture called SecureCell that enables secure computation with users' private data in a cloud setting. Specifically, it allows the users to run untrusted third-party applications on top of untrusted operating systems on machines they do not control, while remaining confident that no private data will be leaked by either the applications or the operating system. The SecureCell architecture accomplishes these goals by introducing a 'sealed container' primitive, a hardware feature that ensures only code running inside a single address space may see an unencrypted view of that address space's data, and that automatically encrypts and decrypts data as it moves across the container's boundary. We present an initial implementation of the automatic encryption feature and evaluate its performance in an architectural simulator.
Supporting Documentation: Final Report (pdf) Slides (ppt)
5:   An Investigation into Concurrent Expectation Propagation
 David Hall and Alex Kantchelian
As statistical machine learning becomes more and more prevalent and models become more complicated and fit to larger amounts of data, approximate inference mechanisms become more and more crucial to their success. Expecta- tion propagation (EP) is one such algorithm for inference in probabilistic graphical models. In this work, we introduce a robustified version of EP which helps ensure convergence under a relaxed memory consistency model. The resulting algorithm can be efficiently implemented on a GPU in a straightforward way. Using a 2D Ising spin glass model, we evaluate both the original EP algorithm and our robustified version in terms of convergence, if any, and precision on a classic single core processor. We also compare the naive parallelized version of the original EP algorithm against the parallelized robustified EP on both a multicore CPU and a GPU.
Supporting Documentation: Final Report (pdf) Slides (ppt)
6:   Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors
 Amik Singh
We explore optimization techniques for geometric multigrid on existing multicore systems including the Cray XE6, SandyBridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi GPUs and Intel's Many Integrated Core (MIC) processor. We apply a number of techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase for each level in the v-cycle for both single-node and distributed memory experiments noting where our techniques enabled performance gains and what the costs were elsewhere.
Supporting Documentation: Final Report (pdf) Slides (ppt)
7:   Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems
 Scott Marshall and Stephen Twigg
Increasing core counts, as a result of many-core proliferation, combined with greater I/O demands in the form of memory, disk, and network I/O, has placed a strain on the processor-memory interconnect in modern architectures. This has spurred the need for enforcement of fair resource utilization, particularly when multiple tenants share a single many-core system. In our project, we strive to bring attention to DMA-based I/O operations in establishing fairness in these systems through design of a DMA-aware QoS policy that applies admission-control throttling for requests entering the interconnection network. We validate our work through two means: PARSEC and DMA benchmarks in a Linux environment via the gem5 full-system simulator, and a full-system memory simulator with synthetic traffic that we designed as a part of our current work. We aim to improve upon past research, which analyzed performance of isolated message passes, by evaluating QOS performance while in systems employing realistic cache coherence protocols. The admission control scheme proposed promises to enforce granted bandwith guarantees while allowing for the recovery of unused resources.
Supporting Documentation: Final Report (pdf) Slides (pdf)
8:   A Distributed Algorithm for 3D Radar Imaging on eWallpaper
 Patrick Li and Simon Scott
eWallpaper is a smart wallpaper with embedded low-power processors and radio transceivers. An important application of the wallpaper is to use the radio transceivers to perform 3D imaging of the room, a task traditionally achieved using mm-wave synthetic-aperture radar (SAR) imaging techniques. The major obstacles to implementing this technique on the wallpaper are the distribution of the data amongst the large number of processors, the restrictive mesh topology and the limited local memory for each processor. Our major contribution is a distributed and memory efficient implementation of the 3D imaging algorithm that achieves real-time framerates. An eWallpaper hardware simulator was built to verify the algorithm's correctness and investigate various communication schemes. This simulator was parallelized using MPI and Pthreads, enabling fast simulation on a high-performance computing cluster. A computational model was derived from the parallel algorithm, and a network traffic simulator was built, to model the actual performance on the eWallpaper hardware.
Supporting Documentation: Final Report (pdf) Slides (ppt)
9:   Hardware Chaining using Coherency Traffic
 James Martin
Many shared memory algorithms use either virtual or physical pipelines of processing elements. Sharing from sender to receiver is a significant part of memory accesses. For distributed shared memory architectures, there is increased latency because the receiver must monitor the sender's writing of data that is possibly significantly further away. Instead, the optimal would be to directly communicate to the receiver from the sender. Instead of a separate protocol such as message passing, the proposed solution is the sender writes to the address space managed by the receiver using the same coherency traffic for shared memory with minimal additional hardware.
Supporting Documentation:

To Edit your description, enter the project number and press "Edit":
Back to CS252 page
Maintained by John Kubiatowicz (kubitron@cs.berkeley.edu).
Last modified Tue May 8 09:48:02 2012