Publications

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
    OSDI 2022

    Paper Code arXiv

  • Rearchitecting In-Memory Object Stores for Low Latency
    VLDB 2022

    Paper Code

  • TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
    ICML 2021

    Paper Code arXiv

  • Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems
    SIGCOMM 2021

    Paper Code arXiv

  • Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
    ICML 2020

    Paper arXiv Blog Post

  • Fast Structured Decoding for Sequence Models
    NeurIPS 2019

    Paper Code Poster arXiv

  • Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
    NeurIPS 2019 Workshop on Machine Learning and the Physical Sciences
    ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

    Paper Code arXiv

  • Hint-Based Training for Non-Autoregressive Machine Translation
    EMNLP 2019

    Paper Code Slides Video arXiv

  • Efficient Training of BERT by Progressively Stacking
    ICML 2019

    Paper Code

  • Towards Binary-Valued Gates for Robust LSTM Training
    ICML 2018

    Paper Code Slides Video Poster arXiv Blog Post (Chinese)

  • Reproducing Vectorization of the Tersoff Multi-Body Potential on the Intel Broadwell Architecture
    Parallel Computing (reproducibility challenge of SC17 SCC)

    Paper DOI

  • ParConnect Reproducibility Report
    Parallel Computing (reproducibility challenge of SC16 SCC)

    Paper DOI