Paper List for AI System

Survey

  • Training and Serving System of Foundation Models: A Comprehensive Survey

Training

System Architecture

  • NSDI ‘24 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
  • NSDI ‘24 Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
  • SOSP ‘23 Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Parallelism & Communication

  • NSDI ‘23 TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
  • NSDI ‘23 TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
  • NSDI ‘23 Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
  • SIGCOMM ‘24 Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
  • ASPLOS ‘23 Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
  • ASPLOS ‘23 In-Network Aggregation with Transport Transparency for Distributed Training
  • ASPLOS ‘24 Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
  • ASPLOS ‘24 AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning

Resource Management & Scheduling

  • NSDI ‘23 Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
  • NSDI ‘23 Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
  • NSDI ‘23 Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
  • SOSP ‘23 Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
  • ASPLOS ‘23 Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-demand VMs
  • ASPLOS ‘23 Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
  • ASPLOS ‘23 ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
  • ASPLOS ‘24 Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
  • OSDI ‘24 MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale

Job Analysis & Optimization

  • NSDI ‘24 Characterization of Large Language Model Development in the Datacenter
  • OSDI ‘24 When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling
  • NSDI ‘23 ModelKeeper: Accelerating DNN Training via Automated Training Warmup
  • NSDI ‘23 BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
  • SIGCOMM ‘23 anus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Failure Recovery

  • SOSP ‘23 GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

Serving

System Architecture

  • NSDI ‘23 SHEPHERD: Serving DNNs in the Wild
  • OSDI ‘24 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
  • OSDI ‘24 ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
  • OSDI ‘24 InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
  • OSDI ‘24 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
  • OSDI ‘24 Llumnix: Dynamic Scheduling for Large Language Model Serving

Resource Management & Optimization

  • OSDI ‘24 USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
  • OSDI ‘24 Fairness in Serving Large Language Models
  • OSDI ‘24 dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
  • ASPLOS ‘24 ExeGPT: Constraint-Aware Resource Allocator for LLM Inference
  • ASPLOS ‘24 Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
  • ASPLOS ‘24 SpotServe: Serving Generative Large Language Models on Preemptible Instances

Memory & Cache Management

  • NSDI ‘24 Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models
  • ATC ‘24 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
  • SOSP ‘23 Efficient Memory Management for Large Language Model Serving with PagedAttention
  • SIGCOMM ‘24 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving

Infrastructure

Networking

  • SIGCOMM ‘24 NetLLM: Adapting Large Language Models for Networking
  • SIGCOMM ‘24 Alibaba HPN: A Data Center Network for Large Language Model Training

Cloud & Serverless

  • ASPLOS ‘24 RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing
  • ASPLOS ‘23 AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows
  • ASPLOS ‘24 AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines

System Tools & Analysis

  • ASPLOS ‘24 Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
  • ASPLOS ‘24 A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs
  • ASPLOS ‘24 DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads
  • ASPLOS ‘24 NDPipe: Exploiting Near-data Processing for Scalable Inference and Continuous Training in Photo Storage

Ref