I am Professor Emeritus and a Professor of the Graduate School in the Computer Science Division of the EECS Department at the University of California, Berkeley. My main research areas are computer architecture , VLSI design, parallel programming and operating system design. I am Co-Director of the SLICE lab, which is improving the development process and deployed efficiency of SpeciaLIzed Computing Ecosystems. I am also an Associate Director at the Berkeley Wireless Research Center.

I am a co-founder, Chief Architect, and SVP Global Engineering at SiFive. I am Chief Architect at RISC-V International.

Short bio here.

I am no longer taking on new students as their primary advisor.

Active Research Projects

Chipyard (2019-)

Chipyard is a framework for designing and evaluating full systems-on-chip (SoCs), and supports an agile hardware design approach. Chipyard comprises a collection of tools and IP libraries, with a tightly integrated flow supporting both open-source and commercial tools for the development and evaluation of SoCs. Among other components, Chipyard integrates Rocket and BOOM RISC-V core generators, Gemmini neural-net accelerator generators, and the FireSim FPGA-accelerated simulation environment.

Gemmini (2019-)

The Gemmini project is developing a Chisel-based generator of programmable systolic-array-based machine-learning accelerators to explore the design space, particularly for edge and mobile SoCs.

Keystone (2017-)

Keystone is an open-source project for building customizable trusted execution environments (TEEs) based on RISC-V for various platforms and use cases. Our goal is to enable trustworthy open-source secure hardware enclaves, which can be applied to a wide range of applications and devices.

FireSim (2017-)

FireSim is an open-source cycle-accurate FPGA-accelerated full-system hardware simulation platform that runs on cloud FPGAs (Amazon EC2 F1). FireSim follows the earlier RAMP and DIABLO projects, but unlike those projects automatically generates the FPGA models from design RTL. FireSim was originally developed for the FireBox project to allow full-scale simulation of datacenter clusters at the RTL level using FPGAs. FireSim can simulate arbitrary hardware designs written in Chisel or designs that can be transformed into FIRRTL (including Verilog designs via Yosys’s Verilog to FIRRTL flow). With FireSim, you can write your own RTL (processors, accelerators, etc.) and run at near-FPGA-prototype speeds on cloud FPGAs, while obtaining cycle-exact performance results matching a fabricated version of design.Depending on the hardware design and the simulation scale, FireSim simulations run at 10s to 100s of MHz. Custom software models can also be added for components that don't need or want to be in RTL.

FireBox (2013-)

The FireBox project is developing a system architecture for third-generation Warehouse-Scale Computers (WSCs). The original Firebox vision was to scale up to a ~1 MegaWatt WSC containing up to 10,000 compute nodes and up to an Exabyte (2^60 Bytes) of non-volatile memory connected via a low-latency, high-bandwidth optical switch. Each compute node contains a System-on-a-Chip (SoC) with around 100 cores connected to high-bandwidth on-package DRAM. Fast SoC network interfaces reduce the software overhead of communicating between application services and high-radix network backplane switches connected by Terabit/sec optical fibers reduce the network's contribution to tail latency. The very large non-volatile store directly supports in-memory databases, and pervasive encryption ensures that data is always protected in transit and in storage. FireBox is being developed in the Berkeley ADEPT and RISE labs. Recent work has focused on specialized accelerators for datacenter applications.

RISC-V Instruction Set Architecture (2010-)

RISC-V is a new free and open instruction set architecture (ISA) developed at UC Berkeley, initially designed for research and education, but is now increasingly being used for commercial designs. A full set of open-source software tools are available as well as several open-source processor implementations. RISC-V was initially developed as part of Par Lab, continued in ASPIRE and now in ADEPT. The RISC-V ISA standard is now managed by the RISC-V International Association.

Constructing Hardware in a Scala Embedded Language (2010-)

Chisel is a new open-source hardware construction language developed at UC Berkeley that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages. Chisel is embedded in the Scala programming language, which raises the level of hardware design abstraction by providing concepts including object orientation, functional programming, parameterized types, and type inference. In the move to Chisel version 3, a separate intermediate layer FIRRTL was developed to support a wide range of circuit transforms. Chisel was originally developed in the DoE Project Isis and Par Lab, and development continued in ASPIRE, and ADEPT. Chisel/FIRRTL is now being maintained by the Chisel Working Group within the CHIPS Alliance.

Earlier Projects at UC Berkeley

The ADEPT Lab (2016-2021)

The goal of the ADEPT lab was to dramatically improve computing capability by reducing the cost and risk of designing custom silicon for new application areas. The integrated 5-year research mission cut across applications, programming systems, architecture, and hardware design and verification methodologies.

Monolithically Integrated CMOS Photonics (2006-2019)

In a collaboration with MIT, the University of Colorado at Boulder, and Micron Technology, we are exploring the use of silicon photonics to provide high bandwidth energy-efficient links between processors and memory. A recent Nature publication describes how we used this technology to build a RISC-V microprocessor that communicates directly with light. Integrated photonics is a key component of the FireBox project where it will be used to construct warehouse-scale computers. The technology is being commercialized at Ayar Labs.

Resiliency for Extreme Energy Efficiency (2010-2018)

Most manycore hardware designs have the potential to achieve maximum energy efficiency when operated in a broad range of supply voltages, spanning from nominal down to near the transistor threshold. As part of ASPIRE, we are working on new circuit and architectural techniques to enable parallel processors to work across a broad supply range while tolerating technology variability, and providing immunity to soft- and hard‐errors. We have built several prototype resilient microprocessors codenamed Raven in 28nm FDSOI technology, as well as a resilient out-of-order processor, BROOM.

Graph Algorithm Platform (2010-2017)

Graph algorithms are becoming increasingly important, from warehouse-scale computers reasoning about vast amounts of data for analytics and recommendation applications to mobile clients running recognition and machine-learning applications. Unfortunately, graph algorithms execute inefficiently on current platforms, either shared-memory systems or distributed clusters. The Berkeley Graph Algorithm Platform (GAP) Project spans the entire stack, aiming to accelerate graph algorithms through software optimization and hardware acceleration. GAP was begun in Par Lab and continued in ASPIRE.

The ASPIRE Lab (2012-2017)

ASPIRE was a 5-year research project that recognized the shift from transistor-scaling-driven performance improvements to a new post-scaling world where whole-stack co-design is the key to improved efficiency. Building on the success of the Par Lab project, it explored deep hardware and software co-tuning to achieve the highest possible performance and energy efficiency for future warehouse-scale and mobile computing systems.

Hwacha Vector-Fetch Processor (2012-2017)

The Hwacha project developed a new vector-fetch architecture to improve energy-efficiency of data-parallel accelerators, following on from earlier work on Scale and Maven. Versions of the Hwacha vector accelerator have been taped out several times as RISC-V Rocket coprocessors in both 28nm and 45nm nodes, and running up to 1.5GHz+. Hwacha was a project in the ASPIRE lab.

A Liquid Thread Environment (2008-2015)

Applications built by composing different parallel libraries perform poorly when those libraries interfere with one another by obliviously using the same physical cores, leading to destructive resource oversubscription. Lithe is a low-level substrate that provides basic primitives and a standard interface for composing parallel libraries efficiently, and can be inserted underneath the runtimes of legacy parallel libraries, such as TBB and OpenMP, to provide bolt-on composability without changes to existing application code. Lithe was initially developed in Par Lab and is now part of DEGAS, and is available as an open-source project.

DIABLO: Datacenter-In-A-Box at LOw Cost (2007-2015)

DIABLO is a wind tunnel for datacenter research, simulating O(10,000) datacenter servers and O(1,000) switches for O(100) seconds. DIABLO is built with FPGAs and executes real instructions and moves real bytes, while running the full Linux operating system and unmodified datacenter software stacks on each simulated server. DIABLO has successfully reproduced some real-life datacenter phenomena, such as the memcached request latency long tail at large scales. DIABLO was initially developed in the RAMP project, and continued in ASPIRE. The next-generation of Berkeley datacenter simulators is being developed as part of the FireBox project.

DHOSA: Defending Against Hostile Operating Systems (2009-2014)

The DHOSA research project focuses on building systems that will remain secure even when the operating system is compromised or hostile. DHOSA is a collaborative effort among researchers from Harvard, Stony Brook, U.C. Berkeley, University of Illinois at Urbana-Champaign, and the University of Virginia.

Par Lab: The Parallel Computing Laboratory (2008-2013)

With the end of sequential processor performance scaling, multicore processors provide the only path to increased performance and energy efficiency in all platforms from mobile to warehouse-scale computers. The Par Lab was created by a team of Berkeley researchers with the ambitious goal of enabling "most programmers to be productive writing efficient, correct, portable SW for 100+ cores & scale as cores increase every 2 years".

The Maven Vector-Thread Architecture (2007-2013)

Based on our experiences designing, implementing, and evaluating the Scale vector-thread architecture, we identified three primary directions for improvement to simplify both the hardware and software aspects of the VT architectural design pattern: (1) a unified VT instruction set architecture; (2) a VT microarchitecture more closely based on the vector-SIMD pattern; and (3) an explicitly data-parallel VT programming methodology. These ideas formed the foundation for the Maven VT architecture.

Tessellation OS

Tessellation is a manycore OS targeted at the resource management challenges of emerging client devices. Tessellation is built on two central ideas: Space-Time Partitioning and Two-Level Scheduling. Tessellation was initially developed within Par Lab and is now part of the Swarm Lab.

RAMP: Research Accelerator for Multi-Processors (2005-2010)

The RAMP project was a multi-University collaboration to develop new techniques for efficient FPGA-based emulation of novel parallel architectures thereby overcoming the multicore simulation bottlenecks facing computer architecture researchers. At Berkeley, prototypes included the 1,008 processor RAMP Blue system and the RAMP Gold manycore emulator, as well as the follow-on DIABLO datacenter emulator.


RAMP Gold is an FPGA-based emulator for SPARC V8 manycore processors providing a high-throughput, cycle-accurate full-system simulator capable of booting real operating systems. RAMP Gold models target-system timing and functionality separately, and employs host-multithreading for an efficient FPGA implementation. The RAMP Gold prototype runs on a single Xilinx Virtex-5 FPGA board and simulates a 64-core shared-memory target machine.


RAMP Blue was the first large-scale RAMP system built as a demonstrator of the ideas. The system models a cluster of up to 1008 MicroBlaze cores implementing using up to 84 Virtex-II Pro FPGAs on up to 21 BEE2 boards. The software infrastructure consists of GCC, uClinux, and the UPC parallel language and runtimes, and the prototype can run off-the-shelf scientific applications.

Earlier Projects from the MIT SCALE Group (1998-2007)

The Scale Vector-Thread Microprocessor

The Scale microprocessor introduced a new architectural paradigm, vector-threading, which combines the benefits of vector and threaded execution. The vector-thread unit can smoothly morph its control structure from vector-style to threaded-style execution.

Transactional Memory

In many dynamic thread-parallel applications, lock management is the source of much programming complexity as well as space and time overhead. We are investigating possible practical microarchitectures for implementing transactional memory, which provides a superior solution for atomicity that is much simpler to program than locks, and which also reduces space and time overheads.

Low-power Microprocessor Design

We have been developing techniques that combine new circuit designs and microarchitectural algorithms to reduce both switching and leakage power in components that dominate energy consumption, including flip-flops, caches, datapaths, and register files.

Energy-Exposed Instruction Sets

Modern ISAs such as RISC or VLIW only expose to software properties of the implementation that affect performance. In this project we are developing new energy-exposed hardware-software interfaces that also allow software to have fine-grain control over energy consumption.

Mondriaan Memory Protection

Mondriaan memory protection (MMP) is a fine-grained protection scheme that allows multiple protection domains to flexibly share memory and export protected services. In contrast to earlier page-based systems, MMP allows arbitrary permissions control at the granularity of individual words.

Highly Parallel Memory Systems

We are investigating techniques for building high-performance, low-power memory subsystems for highly parallel architectures.

Mobile Computing Systems

Within the context of MIT Project Oxygen, several projects examine the energy and performance of complete mobile wireless systems.

Heads and Tails: Efficient Variable-Length Instruction Encoding

Existing variable-length instruction formats provide higher code densities than fixed-length formats, but are ill-suited to pipelined or parallel instruction fetch and decode. Heads-and-Tails is a new variable-length instruction format that supports parallel fetch and decode of multiple instructions per cycle, allowing both high code density and rapid execution for high-performance embedded processors.

Early Projects

IRAM: Intelligent RAM (1997-2002)

The Berkeley IRAM project sought to understand the entire spectrum of issues involved in designing general-purpose computer systems that integrate a processor and DRAM onto a single chip - from circuits, VLSI design and architectures to compilers and operating systems.

PHiPAC: Portable High-Performance ANSI C (1994-1997)

PHiPAC was the first autotuning project, automatically generating a high-performance general matrix-multiply (GEMM) routine by using parameterized code generators and empirical search to produce fast code for any platform. Autotuners are now standard in high-performance library development.

The T0 Vector Microprocessor (1992-1998)

T0 (Torrent-0) was the first single-chip vector microprocessor. T0 was designed for multimedia, human-interface, neural network, and other digital signal processing tasks. T0 includes a MIPS-II compatible 32-bit integer RISC core, a 1KB instruction cache, a high performance fixed-point vector coprocessor, a 128-bit wide external memory interface, and a byte-serial host interface. T0 formed the basis of the SPERT-II workstation accelerator.

SPACE: Symbolic Processing in Associative Computing Elements (1987-1992)

In the PADMAVATI prototype system, a hierarchy of packaging technologies cascade multiple SPACE chips to form an associative processor array with 170,496 36-bit processors. Primary applications for SPACE are AI algorithms that require fast searching and processing within large, rapidly changing data structures.

[Krste, headshot licenced as CC-BY-3.0
                                by copyright owner SiFive]
Professor Emeritus
Professor of the Graduate School
Computer Science Division
EECS Department
579 Soda Hall, MC #1776
University of California
Berkeley, CA 94720-1776
email: krste at berkeley dot edu
phone: 510-642-6506 (don't phone, use email!)
fax: 510-643-1534
Administrative Support:
Tami Chouteau
565 Soda Hall
phone: 510-643-4816
email: slice-admin at eecs dot berkeley dot edu

Ria Briggs
563 Soda Hall
phone: 510-643-1455
email: slice-admin at eecs dot berkeley dot edu
Grant Administrator:
Leslie Goldstein
765 Soda Hall
phone: 510-643-2469
email: lgolds at berkeley dot edu