Paper List for AI System
Survey
- Training and Serving System of Foundation Models: A Comprehensive Survey
Training
System Architecture
- NSDI ‘24 MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
- NSDI ‘24 Resiliency at Scale: Managing Google’s TPUv4 Machine Learning Supercomputer
- SOSP ‘23 Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates
Parallelism & Communication
- NSDI ‘23 TopoOpt: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
- NSDI ‘23 TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
- NSDI ‘23 Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE
- SIGCOMM ‘24 Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem
- ASPLOS ‘23 Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
- ASPLOS ‘23 In-Network Aggregation with Transport Transparency for Distributed Training
- ASPLOS ‘24 Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- ASPLOS ‘24 AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
Resource Management & Scheduling
- NSDI ‘23 Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs
- NSDI ‘23 Transparent GPU Sharing in Container Clouds for Deep Learning Workloads
- NSDI ‘23 Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning
- SOSP ‘23 Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
- ASPLOS ‘23 Snape: Reliable and Low-Cost Computing with Mixture of Spot and On-demand VMs
- ASPLOS ‘23 Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs
- ASPLOS ‘23 ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning
- ASPLOS ‘24 Heet: Accelerating Elastic Training in Heterogeneous Deep Learning Clusters
- OSDI ‘24 MAST: Global Scheduling of ML Training across Geo-Distributed Datacenters at Hyperscale
Job Analysis & Optimization
- NSDI ‘24 Characterization of Large Language Model Development in the Datacenter
- OSDI ‘24 When will my ML Job finish? Toward providing Completion Time Estimates through Predictability-Centric Scheduling
- NSDI ‘23 ModelKeeper: Accelerating DNN Training via Automated Training Warmup
- NSDI ‘23 BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing
- SIGCOMM ‘23 anus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models
Failure Recovery
- SOSP ‘23 GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
Serving
System Architecture
- NSDI ‘23 SHEPHERD: Serving DNNs in the Wild
- OSDI ‘24 Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- OSDI ‘24 ServerlessLLM: Low-Latency Serverless Inference for Large Language Models
- OSDI ‘24 InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- OSDI ‘24 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
- OSDI ‘24 Llumnix: Dynamic Scheduling for Large Language Model Serving
Resource Management & Optimization
- OSDI ‘24 USHER: Holistic Interference Avoidance for Resource Optimized ML Inference
- OSDI ‘24 Fairness in Serving Large Language Models
- OSDI ‘24 dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving
- ASPLOS ‘24 ExeGPT: Constraint-Aware Resource Allocator for LLM Inference
- ASPLOS ‘24 Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling
- ASPLOS ‘24 SpotServe: Serving Generative Large Language Models on Preemptible Instances
Memory & Cache Management
- NSDI ‘24 Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models
- ATC ‘24 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
- SOSP ‘23 Efficient Memory Management for Large Language Model Serving with PagedAttention
- SIGCOMM ‘24 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
Infrastructure
Networking
- SIGCOMM ‘24 NetLLM: Adapting Large Language Models for Networking
- SIGCOMM ‘24 Alibaba HPN: A Data Center Network for Large Language Model Training
Cloud & Serverless
- ASPLOS ‘24 RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing
- ASPLOS ‘23 AQUATOPE: QoS-and-Uncertainty-Aware Resource Management for Multi-stage Serverless Workflows
- ASPLOS ‘24 AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
System Tools & Analysis
- ASPLOS ‘24 Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O Samples
- ASPLOS ‘24 A Journey of a 1,000 Kernels Begins with a Single Step: A Retrospective of Deep Learning on GPUs
- ASPLOS ‘24 DREAM: A Dynamic Scheduler for Dynamic Real-time Multi-model ML Workloads
- ASPLOS ‘24 NDPipe: Exploiting Near-data Processing for Scalable Inference and Continuous Training in Photo Storage