CS262a: Fall 2021 Final Project Suggestions (New suggestions coming!)

Stay tuned for other project suggestions! For the final Fall 2019 projects, check HERE.


Here are some suggestions for projects:

FOG ROBOTICS PROJECTS:
Suggested by John Kubiatowicz (kubitron@cs.berkeley.edu) and Anthony Joseph (adj@berkeley.edu):

Here are some suggested projects in this space. Note that many variations of these projects are possible.

    Figure 1: The Paranoid Stateful Lambda System(PSL)
  1. Sharded, Secure multicast tree for Paranoid Stateful Lambdas (Suggested/supervised by John Kubiatowicz)
    The Paranoid Stateful Lambda system (PSL) is build on top of DataCapsules and provides a secure system for large-scale parallel computation protected within secure enclaves such as Intel SGX. The PSL system provides a key-value store implemented within DataCapsules which allows multiple simultaneous writes with an eventually consistent semantics. The current implementation of PSL sends signed writes from each secure enclave to every other secure enclave over an extremely basic multicast tree, thereby providing for extremely responsive communcation that is also unscalable. This project involves providing a much more scalable and secure multicast tree implementation that can scale to 100s or 1000s of nodes. This system should dynamically build a multicast tree that restricts communication to nodes with proper cryptographic credentials, that distributes communication over multiple intermediate networking elements for performance, and which automatically figures out which PSL writes are of interest to which enclaves, thereby sending each secure write to only a subset of the enclave participants (suggestion here is to use a Bloom filter or equivalent). Ideally, the multicast tree would handle retransmission in a network-efficient fashion as well (e.g. using NAKs). Suggestions for implementation technologies include Capsule which was inspired by Scott Shenker's NetBricks.
  2. Epidemic Replication Mechanisms for DataCapsules (Suggested/supervised by John Kubiatowicz)
    In the interest of further improving the Paranoid Stateful Lambda system (PSL), described above, this project seeks to provide a continuous replication mechanism for DataCapsules which supports multiple replicas of the PSL Key-Value system embedded in DataCapsules. This sytem would figure out how to provide efficient acknowledgements of writes back to the Paranoid Stateful Lambdas (of which there might be 100s or 1000s) while guaranteeing that a minimum threshold of DataCapsule replicas receive each write and that DataCapsules constantly updating each other to make up for lost writes. Ideally, one or more replicas could go temporarily offline and come back online without disturbing the PSL computation. Note that this task is helped by the fact that DataCapsules can be view as CRDTs (as in Reading #9), since each log entry has a unique ID (i.e. the hash over its contents). Further, given a set of log entries from a DataCapsule, they can be uniquely reassembled into an acyclic graph. Consequently, replicas can be synchronized via set-union. For this project, develop a replication strategy DataCapsules that deals well with (1) crashed and restarted servers, (2) network partitions, (3) minimum threshold number of copies of each write (a quorum) to assure durability. What is the API for readers? Can you provide a read mechanism that deals with missing information(a "hole" in Figure 1 of the proposal above)?

    There are many interesting details such as utilizing multicast replication during normal operation coupled with epidemic, pair-wise synchronization to handle failures. Other questions involve secure acknowledgement to the writer of a quorum of writes. Can this be done in a way that is less expensive than requiring every replica to sign its ack and send in bulk back to client? [Note that, among other things, each DataCapsule server would have a unique cryptographic identity. Consequently, acknowledgements must be signed with that identity in order for the writer to be sure that an authorized party has received the writes. Such signatures can be optimized in a variety of ways. What becomes particularly interesting, however, is how to verify that the replicas have been received by a minimum (quorum) number of servers without flooding the network with a huge number of acnowledgements.
  3. Log-Structured Merge-Tree in Data Capsules (Suggested/supervised by John Kubiatowicz in combination with William Mullen)
    The Paranoid Stateful Lambda project (PSL) provides an eventually-consistent key-value store that can be simultaneously read and written by multiple tasks running in multiple enclaves. However, the back-end storage portion of PSL is still in need of further implementation. Adopting ideas from RocksDB, LevelDB, or SplinterDB, produce an implementation of the back-end storage that is compatible with the PSL system and that can efficiently and securely store key-value pairs in DataCapsules. This component would fit in the "KVS CAAPI" component above, in Figure 1, and would need to run in an enclave. This project should maintain the linked and signed structure of records in DataCapsules, and allow for efficient querying of the latest value associated with a given key. It should also validate writes (by checking signatures) coming on the multicast tree from the many enclaves running in PSL. Many performance optimizations are possible. Note that this project already has at least one partner -- they are looking for another one! Talk to William Mullen (wmullen@berkeley.edu).
  4. Secure Edge Computing using dynamically linked code exported as DataCapsules (Suggested/supervised by John Kubiatowicz)
    Extend a secure enclave environment (such as for the Paranoid Stateful Lambda system (PSL)) to serve secure computation to mobile clients at the edge of the network (such as Robots, Industrial Manufacturing environments, etc). Assume that both code and data are exported using DataCapsules, so that all that has to happen is for the client to contact a third-party "computation provider", verify ("attest") the third-party software, then exchange cryptographic credentials to allow DataCapsules to be used. If done correctly, such computation providers could be embedded anywhere on the edge -- in 5G base stations, in local Fog Servers (or "cloudletts"). An additional aspect would be to include a secure migration protocol which would allow computation to move along with the client (consider a smart car traveling along a highway, using a succession of computation providers along the side of the road). What is the architecture of such a system? Consider any number of interesting issues such as how does a client (1) confirm the reliability of the offered secure execution environment, (2) negotiate keys with the secure environment to give it access to the correponding DataCapsules and (3) begin the execution of a secure computation that reads and writes its information through DataCapsules. You might consider building a Docker-like format inside DataCapsules that allows execution in a variety of different environments. Can you figure out how to dynamically link the DataCapsule into the secure enclave environment once you have started it running? Some interesting consequences:
    1. Since DataCapsules provide secure branched/versioned data, a single DataCapsule could provide multiple versions of an application or library -- thus exporting an official (and cyptographically secured) package of binaries. In fact, applications could be dynamically linked with a variety of specific libraries (also in DataCapsules) at the time that they start executing.
    2. Media (such as video, slides, documents, etc) could be stored in DataCapsules in a self-executing fashion, i.e. so that the DataCapsule holding the media would have embedded in it pointers to the binaries required to play/display this media. Thus, assuming that DataCapsules can be embedded in the network for extended periods (say decades), one could be sure that this media could always be playable, decades after it was authored.
    Note that the GDP group is looking to work with researchers at LBL to integrate functionality such as this into a 5G basestation, so that this particular project may have some interesting follow-on research.
  5. Pick an application to run on the edge in a Paranoid Stateful Lambda infrastructure (Suggested/supervised by John Kubiatowicz)
    This project is quite open ended. It would involve porting an interesting application to the Paranoid Stateful Lambda infrastructure and demonstrating its advantages on in an Edge environment. Ideally matched to this environment would be robotics or machine learning applications involving sensitive information. Talk to John Kubiatowicz (kubitron@cs.berkeley.edu) to help in definining this projects.
  6. Browser-based Secure DataCapsule Access (Suggested by Anthony Joseph/John Kubiatowicz/Joey Gonzales)
    Using Javascript-based access to SGX to enable secure read and write of DataCapsules through the GDP. Possibly using THIS. And, possibly converting one of the GDP experimental implementations (written in Python) into Javascript using Transcript. Note that interfacing with the Key-Value store from PSL is one obvious option here.
  7. Secure Data Filesystem build on top of the DataCapsule/GDP infrastructure. (Suggested by John Kubiatowicz)
    Build a usable file system on top of the GDP that handles authentication, replication, and key management (i.e. can deal with many different users, each with their own credentials). Interesting questions might be how to get good performance out of this system (think about things like the Log-structured file system and/or Flash File system). Could this have high-enough performance to make people want to use it? How would it deal with replication? There are many directions that this project could go.
  8. Routing Information Base for the Global Data Plane
    This project is quite open ended. The idea is to reimplement some of the GDP infrastructure in Rust using something like Capsule which was inspired by Scott Shenker's NetBricks. The point is not to implement everything, but rather reimplement a reasonable subset of data-centric routing routing delegations and DataCapsule service. The hope is to make the basic service as lightweight as possible to allow EECS to support a GDP networking infrastructure that is fast enough to be used instead of IP for a variety of Data services. Talk to Kubi about some options here.
    Some details:
    1. The Global Data Plane (GDP)provides a flat address space routing protocol to endpoints (e.g. DataCapsules) identified by 256-bit hash names, rather than IP addresses. DataCapsules are served over the network on DataCapsule servers, which "Advertise" the DataCapsules that they support. These advertisements are signed delegations from the DataCapsule owner stating that the owner of the DataCapsule server is allowed to advertise. For replication, there is a possibility that multiple replicated DataCapsules could be advertised in the network at the same time. Clients route queries through the GDP infrastructure to the closest DataCapsule.
    2. The current GDP infrastructure is undergoing a lot of changes. Among other things, it includes a two-level communication structure. The lowest layer is a switching layer that switches packets (for the GDPinUDP tunneling protocol) from point to point; an initial version of this routing layer already exists. When it is presented with an unknown destination the switching layer consults a higher level Routing Information Base (RIB), which is updated as names appear and disappear. In some sense the RIB incorporates functions similar to DNS (locating names) and OSPF or IS-IS (determining paths). The job of this "location resolution process" is to place an optimized path into the switching layer from the client to the DataCapsule.
    3. We view the process described above as "Caching routes for active conversations in the switching layer." Inactive conversations are removed (expire) from the switching layer. We need a new implementation for the RIB (which we informally call the "Black box").The result of this work will definitely lead to an interesting, publishable work. Note, also, that there are several different possible projects related to the RIB, including a globally-scalable option (this project), a secure multicast tree (next project), or a QoS-aware router (following project).
    4. The desired architecture for the RIB is a hierarchy (tree) with nodes corresponding to Administrative Management Domains which we call "Trust Domains" (TDs). Each TD knows about the names managed in that domain, any broadly advertised names from lower in the hierarchy, and about a "parent" TD to be consulted for other names. Updates to one TD are only passed to the parent if the name is intended to be more broadly known, allowing for local-only names. The top of the hierarchy (the root of the tree) is expected to hold all globally known names (and no local names at all). Our current implementation only supports a single ADMD (so all names are global), and that implementation is centralized and doesn't scale well. This project will move us toward the ultimate goal.
  9. Quality of Service for the Global Data Plane
    There is significant interest in utilizing the GDP/DataCapsule infrastructure for real-time control systems at the edge of the network. This project would investigate the how to add QoS constraints (e.g.max latency,min throughput, low variation in behavior) to an application utilizing the GDP infrastructure. In particular, it would provide an API for asserting QoS constraints and mechanisms within the underlying GDP network to guarantee such constraints. QoS could be guaranteed by adjusting the mapping in underlying switches (see above). It could also investigate migrating the location of DataCapsules to achieve deired constraints. This project might touch many levels of the existing GDP network. The project should investigate a range of workloads and mechanisms for achieving QoS, possibly modifying the switches and RIBs. One possible modification, for instance, would be to investigate fair-queueing and/or realtime scheduling in the switching infrastructure. Another would be to forward messages directly from writer to both DataCapsulse and readers (subscribers) in parallel to achieve lower latency. An investigation of the resulting semantics from the respect of clients might be in order here.

OTHER RISELAB Projects:
Suggested by various folks. Ask Stephanie Wang (swang@cs.berkeley.edu) for more details:

    1. Fuzz testing distributed invariants in Ray
      Safety invariants are key to reasoning about the correctness of a distributed protocol. However, even if a protocol is proven to be correct, the implementation can still have bugs that violate the invariant. The goal of this project is to build a testing framework for a distributed system that can find critical implementation bugs while keeping overhead low (to avoid interfering with normal execution). A good use case to start with is checking distributed memory safety in Ray, an open source general purpose platform for distributed execution.
    2. Benchmarking and optimizing data storage in Ray
      Ray is an open-source general-purpose platform for distributed execution. For performance, Ray implements a distributed memory layer that allows tasks to reference memory on other nodes. The data may be stored in various ways: process heap, shared memory, local disk, S3, etc. The goal of this project is to measure the impact and tradeoffs of this choice on some representative applications. A followup goal is to dynamically and automatically choose the best data placement based on application characteristics.
    3. Online job reconfiguration in ERDOS
      ERDOS is an open-source streaming system for self-driving car and robotics applications. ERDOS programs (jobs) are structured as a directed graph, where data flows between operators which are responsible for processing. Currently, this graph is static, meaning that it is difficult to change application behavior at runtime. The goal of this project is to investigate how we can support changes to ERDOS programs at runtime, and measure the impact of this feature on several applications.
    4. Reducing storage footprint under data versioning in mltrace
      mltrace is an open-source tool that brings observability to ML pipelines (DAGs) through logging, data tracing, and debugging support. Users log inputs and outputs at every stage of the pipeline, which can contribute to a massive storage overhead because all versions of data & artifacts are saved. The goal of this project is to explore data versioning techniques to reduce mltrace's storage footprint. This project is fairly open-ended -- students can optimize model versioning & storage, dataframe versioning & storage, etc.
    5. Distributed data loader for large-scale machine learning in Parax
      Parax is a JIT compiler for distributed tensor computation and large-scale neural network training. Parax is built on top of JAX and XLA. Parax can compile a tensor computational graph, generate a parallelization strategy and run it on a Ray GPU cluster. Parax searches for the best parallelization strategy in a comprehensive search space that combines intra-operator parallelism and inter-operator parallelism. Parax can automatically find and compose complicated strategies such as data-parallel, tensor partitioning and pipeline parallel. Currently, Parax lacks an efficient distributed data loader, making data-loading the bottleneck of training. In this project, you are going to design and implement a flexible and efficient distributed data loader for Parax.
    6. Cross mesh collective communication in Parax
      Parax is a JIT compiler for distributed tensor computation and large-scale neural network training. Parax is built on top of JAX and XLA. Parax can compile a tensor computational graph, generate a parallelization strategy and run it on a Ray GPU cluster. Parax searches for the best parallelization strategy in a comprehensive search space that combines intra-operator parallelism and inter-operator parallelism. Parax can automatically find and compose complicated strategies such as data-parallel, tensor partitioning and pipeline parallel. One performance bottleneck of the Parax is to handle the communication of distributed tensors across device meshes. The goal of this project is to design and implement an efficient collective communication library that supports the communication of distributed tensors across device meshes.
    7. Run-time statistics collection for Modin
      Modin is a scalable dataframe system that acts as a drop-in replacement for pandas — and is installed over 50,000 times a week by data scientist who are looking to address their scalability bottlenecks. There are lots of remaining challenges in adapting systems techniques to make pandas query execution even faster. One approach to reduce latencies further is to collect statistics about underlying dataframes such that Modin can find the best way of optimizing execution. Examples of such statistics could be the data distribution and the number of distinct values for each column/row. However, collecting statistics could be time-consuming and prolong the response time of normal dataframe execution. This project studies how to collect statistics at run-time while not slowing down the execution time.
    8. Replacing transactions with distributed objects
      While general-purpose transactions offer flexibility to application developers, they can also be harmful for the system to support if programmers abuse transactional APIs (e.g., running large transactions consisting of objects that don’t need to modified together). This project seeks to explore if we can abstract transactions away behind a distributed object API to restrict how users can access the underlying database. Safeguarding the transaction mechanism through a programming abstraction can affect workload behavior, and the goal of this project would be to figure out what abstraction provides enough functionality for developers while also allowing for optimized system behavior.
    9. Mixing isolation levels
      There is a growing range of applications supported by databases, and these apps often have workloads that have very different consistency requirements. Ideally, the database could support mixed transactional isolation levels with high performance. This project aims to explore what it means to mix isolation levels and how to run concurrency control over them (with performance isolation). In particular, evaluating how system behavior changes with the addition of stronger isolation models and how to limit their impact would be important towards understanding how various guarantees can be supported together.
    10. Instrumentation of ML Framework Compilers
      PyTorch, TensorFlow, and JAX are popular frameworks for model training in Python. These frameworks were created to fit the particular semantics of machine learning, rather than to work in specialized hardware or high-performance architectures. Compilers such as NVCC, XLA, and PyTorch Glow bridge that gap by lowering the high-level program into efficiently executable code. For this project, you will write software to instrument these compilers. The objective will be to transparently intercept the generated Intermediate Representation together with other compile-time data structures without significant runtime overhead. The compile-time execution data you intercept will be used by the FLOR project for model version control: specifically, to enable comparing or diffing versions of model training.
    11. Co-scheduling Feature Updates and Queries for Feature Stores
      RALF is a system for maintaining data in feature stores, which store pre-computed features use for ML model training and inference. RALF is responsible for processing new data to compute updates to feature tables, which are queried by downstream models. Currently, RALF only computes updates eagerly. This project is to add support for lazy computation in RALF, so that features are computed when queried to meet some specified deadline.
    12. Automated Cache Hierarchy for Feature Stores
      RALF is a system for maintaining data in feature stores, which store pre-computed features use for ML model training and inference. RALF currently assumes all features can be fit into memory. This project will add support for RALF to spill move features into durable storage, and only maintain feature that are frequently queried in-memory.
    13. Computational planning for secure multi-party computation
      Piranha is a GPU-based platform, implemented in C++, that provides a generic interface for accelerating _multi-party computation_ (MPC). Piranha supports privacy-preserving machine learning that significantly speeds up training times compared to the same MPC protocols executing on a CPU. The system is limited in its static design choices: code executes only on 1 GPU, and each protocol uses a fixed-point integer representation with a static precision; if incorrectly set, computation will overflow. The project would involve creating a computational planner that could distribute computation locally across GPUs and/or adjust precision as needed based on the phase of computation.

 


Back to CS262a page
Maintained by John Kubiatowicz (kubitron@cs.berkeley.edu).
Last modified Friday, 9/21/2021