(This page has no text content)
9 7 9 8 3 4 1 6 2 7 7 8 9 5 9 9 9 9 US $99.99 CAN $124.99 DATA ISBN: 979-8-341-62778-9 Elevate your AI system performance capabilities with this definitive guide to maximizing efficiency across every layer of your AI infrastructure. In today’s era of ever-growing generative models, AI Systems Performance Engineering provides engineers, researchers, and developers with a hands-on set of actionable optimization strategies. Learn to co-optimize hardware, software, and algorithms to build resilient, scalable, and cost-effective AI systems that excel in both training and inference. Authored by Chris Fregly, a performance-focused engineering and product leader, this resource transforms complex AI systems into streamlined, high-impact AI solutions. Inside, you’ll discover step-by-step methodologies for fine-tuning GPU CUDA kernels, PyTorch-based algorithms, and multinode training and inference systems. You’ll also master the art of scaling GPU clusters for high performance, distributed model training jobs, and inference servers. The book ends with a 175+-item checklist of proven, ready-to-use optimizations. • Codesign and optimize hardware, software, and algorithms to achieve maximum throughput and cost savings • Implement cutting-edge inference strategies that reduce latency and boost throughput in real-world settings • Utilize industry-leading scalability tools and frameworks • Profile, diagnose, and eliminate performance bottlenecks across complex AI pipelines • Integrate full stack optimization techniques for robust, reliable AI system performance Chris Fregly is an AI product and performance engineering leader with experience at AWS, Databricks, and Netflix. He coauthored Data Science on AWS and Generative AI on AWS, created an O’Reilly course on optimizing AI performance with NVIDIA GPUs, and founded the global AI Performance Engineering meetup. AI Systems Performance Engineering “AI systems are layered and fast‑moving. Chris breaks the complexity down into a reference that will set the standard for years.” Chris Lattner, CEO at Modular “CUDA kernels, distributed training, compilers, disaggregated inference— f inally in one place. An encyclopedia of ML systems.” Mark Saroufim, PyTorch engineer at Meta and founder of GPU Mode Community “Squeezing the most performance out of your AI system is what separates the good from the great. This is the missing manual.” Sebastian Raschka, ML/AI researcher and thought leader
Praise for AI Systems Performance Engineering AI systems are moving incredibly fast through intense research and development, and are filled with multiple layers of complexity. Chris breaks down this complexity and explains it in a way we can all benefit from, providing a singular reference that will set a standard for years to come. —Chris Lattner, CEO at Modular Understanding how LLMs work is just the beginning. Getting the most out of them through model and hardware optimization is what separates good practitioners from great ones. Chris’s book is the missing reference manual that shows you how. —Sebastian Raschka, PhD, ML/AI researcher and author of bestselling book Build a Large Language Model from Scratch (Manning, 2024) The breadth here is insane—CUDA kernels, distributed training, PyTorch compiler internals, disaggregated inference, all in one place. Chris pulled together stuff that’s usually scattered across a million blog posts and papers. This is an excellent encyclopedia of ML systems. —Mark Saroufim, PyTorch engineer at Meta and founder of GPU MODE Community Taming AI systems at scale is one of the immense challenges of our time. Chris Fregly has written the definitive field guide for this new frontier. He masterfully connects the dots from the silicon all the way up to the application, providing the full-stack wisdom that every AI engineer needs to turn raw compute into efficient, high-performance models. —Harsh Banwait, director of product at Coreweave
This book is an exceptional resource for anyone looking to immerse themselves in modern ML systems engineering. Its focus on the details of state-of-the-art projects like vLLM and llm-d demonstrates a deep understanding of inference optimization techniques and the transformative power of open source software. —Michael Goin, vLLM maintainer and principal engineer at Red Hat This book is a definitive guide for AI engineers who refuse to settle for default performance. Whether you’re tuning CUDA kernels, scaling LLM inference, or orchestrating AI agents across GPUs, this book gives you the surgical tools and system-level clarity to efficiently run modern AI workloads. Read it, use it and you will redefine how you build, scale, and think about modern AI systems. AI Systems Performance Engineering is a master key to unlocking AI performance. —Arpitha Srinivas, AI systems performance engineer at the world’s leading manufacturer of high-performance and high-efficiency AI servers AI Systems Performance Engineering brilliantly connects GPU hardware architecture with modern AI workload optimization, combining rigorous low-level tuning with high-level system design. Spanning GPU memory bandwidth optimization, KV cache management, batching strategies, and multi-GPU inference, the book delivers insights forged in real- world production. Chris brings clarity to complex topics like Nsight profiling, transformer attention tuning, and distributed scaling, making it a must-read for performance engineers bridging hardware, systems, and AI. —Amer Ather, cloud and ML performance engineer at Netflix The most comprehensive and up-to-date guide on building modern-day AI systems. A must-read for every AI/ML developer and practitioner. —Chaim Rand, AI/ML algorithm engineer This is the go-to reference for anything AI-performance related. Chris’s latest book is packed full of content that helps me in my day-to-day activities of optimizing and tuning AI workloads. He covers all of today’s AI systems performance issues—and provides solutions that are invaluable to every company trying to put AI into production. —Antje Barth, member of technical staff at Amazon AGI There is no other book in this field that even comes close to AI Systems Performance Engineering. Each chapter demonstrates deep expertise and could easily be a standalone book. The content feels so fresh and evergreen, and very easy to digest. —Suman Debnath, ML systems engineer at Anyscale
A tour-de-force that is essential reading for performance engineers who are working with the latest AI-based applications. —Adrian Cockcroft, OrionX.net (former head of cloud infrastructure at Netflix and VP of engineering at AWS) This is the book I was waiting for. It ties together the scattered, vast, and fast-moving world of AI systems performance engineering into one clear, modern resource. —Madison Kanna, AI engineer at Baseten As AI becomes the foundation of modern computing, understanding accelerator architectures and their ecosystems is no longer optional—it’s a strategic imperative. Chris distills extraordinary technical depth into a clear and accessible narrative. For anyone engineering or leading at hyperscale, this book is the essential starting point. —Omer Zaki, VP of AI infrastructure at Oracle Today’s AI is a system problem where one has to co-optimize the software for the hardware fabric to achieve peak performance. Chris peels back the curtain to showcase the different levels of the AI software and hardware stack that modern AI workloads run on. —Abdul Dakkak, head of GenAI at Modular
(This page has no text content)
Chris Fregly AI Systems Performance Engineering Optimizing Model Training and Inference Workloads with GPUs, CUDA, and PyTorch
979-8-341-62778-9 [LSI] AI Systems Performance Engineering by Chris Fregly Copyright © 2026 Flux Capacitor, LLC. All rights reserved. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (https://oreilly.com). For more information, contact our corporate/institu‐ tional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Angela Rufino Production Editor: Kristen Brown Copyeditor: nSight, Inc. Proofreader: Piper Content Partners Indexer: Sue Klefstad Cover Designer: Susan Brown Cover Illustrator: José Marzan Jr. Interior Designer: David Futato Interior Illustrator: Kate Dullea November 2025: First Edition Revision History for the First Edition 2025-11-11: First Release See http://oreilly.com/catalog/errata.csp?isbn=9798341627789 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. AI Systems Performance Engineering, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi 1. Introduction and AI System Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 The AI Systems Performance Engineer 3 Benchmarking and Profiling 4 Scaling Distributed Training and Inference 5 Managing Resources Efficiently 5 Cross-Team Collaboration 6 Transparency and Reproducibility 6 DeepSeek Scales to ~680-Billion Parameter Models Despite US Export Hardware Restrictions in China 9 Toward 100-Trillion-Parameter Models 11 NVIDIA’s “AI Supercomputer in a Rack” 13 Mechanical Sympathy: Hardware-Software Codesign 15 Measuring “Goodput” Useful Throughput 16 Book Roadmap and Methodology 18 Key Takeaways 20 Conclusion 21 2. AI System Hardware Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 The CPU and GPU Superchip 23 NVIDIA Grace CPU 26 NVIDIA Blackwell “Dual-Die” GPU 26 NVIDIA GPU Tensor Cores and Transformer Engine 29 Streaming Multiprocessor, Threads, and Warps 31 Ultrascale Networking Treating Many GPUs as One 33 NVLink and NVSwitch 34 Multi-GPU Programming 38 In-Network Aggregations with NVIDIA SHARP 40 vii
Multirack and Storage Communication 41 Preintegrated Rack Appliance 42 Co-Packaged Optics: Future of Networking Hardware 43 Compute Density and Power Requirements 43 Liquid Cooling Versus Air Cooling 45 Performance Monitoring and Utilization in Practice 47 Sharing and Scheduling 47 ROI of Upgrading Your Hardware 48 A Glimpse into the Future: NVIDIA’s Roadmap 49 Blackwell Ultra and Grace Blackwell Ultra 49 Vera Rubin Superchip (2026) 50 Rubin Ultra and Vera Rubin Ultra (2027) 51 Feynman GPU (2028) and Doubling Something Every Year 51 Key Takeaways 52 Conclusion 53 3. OS, Docker, and Kubernetes Tuning for GPU-Based Environments. . . . . . . . . . . . . . . 55 Operating System 56 NVIDIA Software Stack 56 GPU Driver 57 CUDA Toolkit and Runtime 57 CUDA Forward and Backward Compatibility Across GPU Hardware Generations 58 C++ and Python CUDA Libraries 59 PyTorch and Higher-Level AI Frameworks 60 Configuring the CPUs and OS for GPU Environments 61 NUMA Awareness and CPU Pinning 62 NUMA-Friendly Memory Allocation and Memory Pinning 70 Transparent Hugepages 72 Scheduler and Interrupt Affinity 73 Virtual Memory and Swapping 74 Filesystem Caching and Write-Back 75 CPU Frequency and C-states 75 Tune Host CPU Memory Allocator 76 GPU Driver and Runtime Settings for Performance 77 GPU Persistence Mode 77 MPS 78 MIG 81 GPU Clock Speeds and ECC 84 GPU Memory Oversubscription, Fragmentation, and Out-of-Memory Handling 85 Container Runtime Optimizations for GPUs 88 viii | Table of Contents
NVIDIA Container Toolkit and CUDA Compatibility 88 NVIDIA Container Runtime 89 Avoiding Container Overlay Filesystem Overhead 90 Reduce Image Size for Faster Container Startup 91 Kubernetes for Topology-Aware Container Orchestration and Networking 91 Orchestrating Containers with Kubernetes Topology Manager 92 Job Scheduling with Kubernetes and SLURM 93 Slicing a GPU with MIG 94 Optimizing Network Communication for Kubernetes 95 Reducing Kubernetes Orchestration Jitter 96 Improving Resource Guarantees 96 Memory Isolation and Avoiding the OOM Killer 97 Dealing with I/O Isolation 98 Key Takeaways 98 Conclusion 101 4. Tuning Distributed Networking Communication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Overlapping Communication and Computation (Pipelining) 104 Asynchronous Execution with Streams 106 Reducing Communication Frequency and Volume 107 Achieving Maximal Overlap in Practice 107 NVIDIA Magnum IO Optimization Stack 113 High-Speed, Low-Overhead Data Transfers with RDMA 114 Tuning Multinode Connectivity 117 Multinode Communication Pitfalls 119 NCCL for Distributed Multi-GPU Communication 125 Topology Awareness in NCCL 126 NCCL Communication Algorithms 129 Distributed Data Parallel Strategies 132 NCCL Communicator Lifecycle and Environment Gotchas 137 Profiling and Debugging NCCL 145 In-Network SHARP Aggregation 146 Persistent NCCL User Buffers and Zero-Copy Registration 148 NVIDIA’s NIXL and Disaggregated Inference 148 Separate Prefill and Decode Inference Stages 150 Intelligent Interconnect Routing for KV Cache Transfers 152 NIXL Asynchronous API with Callbacks 153 KV Cache Offloading with NIXL 157 NIXL and High-Performance Inference Systems Like NVIDIA Dynamo 159 NCCL Versus NIXL 160 Key Takeaways 161 Conclusion 162 Table of Contents | ix
5. GPU-Based Storage I/O Optimizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Fast Storage and Data Locality 163 Sequential Versus Random Read Patterns 165 Tuning NVMe and Filesystem for Throughput 166 Using NVIDIA GDS 167 Checkpointing GPU State with cuda-checkpoint 170 Measuring GDS with gdsio 171 DeepSeek’s Fire-Flyer File System 172 Distributed, Parallel Filesystems and Object Stores 174 Tuning, Replicating, and Compressing Data 176 Monitoring Storage I/O 177 Tuning the Data Pipeline 178 Efficient Data Loading and Preprocessing 178 Scaling Out Workers as You Scale Out Number of GPUs 181 Multimodal Data Processing with NVIDIA DALI 182 Creating High-Quality LLM Datasets with NVIDIA NeMo Curator 183 Continuous Profiling and Tuning Workflow 184 Diagnosing Communication- Versus Compute-Bound Workloads 188 Key Takeaways 189 Conclusion 190 6. GPU Architecture, CUDA Programming, and Maximizing Occupancy. . . . . . . . . . . . . 191 Understanding GPU Architecture 191 Threads, Warps, Blocks, and Grids 195 Choosing Threads-per-Block and Blocks-per-Grid Sizes 200 CUDA GPU Backward and Forward Compatibility Model 203 CUDA Programming Refresher 203 Configuring Launch Parameters: Blocks per Grid and Threads per Block 207 2D and 3D Kernel Inputs 209 Asynchronous Memory Allocation and Memory Pools 210 Understanding GPU Memory Hierarchy 212 Unified Memory 218 Maintaining High Occupancy and GPU Utilization 221 Tuning Occupancy with Launch Bounds 229 Debugging Functional Correctness with NVIDIA Compute Sanitizer 231 Roofline Model: Compute-Bound or Memory-Bound Workloads 233 Key Takeaways 237 Conclusion 238 7. Profiling and Tuning GPU Memory Access Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Coalesced Versus Uncoalesced Global Memory Access 239 Vectorized Memory Access 247 x | Table of Contents
Tiling and Data Reuse Using Shared Memory 255 Avoid Shared-Memory Bank Conflicts 264 Warp Shuffle Intrinsics: Avoid Shared Memory and Explicit Synchronization 271 Read-Only Data Caches 273 Asynchronous Memory Prefetching and Tensor Memory Accelerator 278 Key Takeaways 283 Conclusion 285 8. Occupancy Tuning, Warp Efficiency, and Instruction-Level Parallelism. . . . . . . . . . 287 Profiling and Diagnosing GPU Bottlenecks 288 Nsight Systems Timeline View 288 Profiling and Tuning the Data Pipeline 289 Nsight Compute and Roofline Analysis 290 PyTorch Profiler and Visualization Tools 291 Profiler-Guided Analysis 293 Analyzing Warp Stall Reasons with Nsight Compute 293 Memory-Related Stalls 293 Execution-Dependency Stalls 295 Execution Unit Contention 295 Other Stall Reasons 295 Inspecting Achieved Occupancy and GPU Utilization 297 Kernel Memory Throughput Versus Peak HBM Memory Bandwidth 299 Kernel Compute Throughput Versus Peak GPU FLOPS 299 Iteratively Profiling and Determining the Kernel Bottleneck 301 Optimizing the Kernel 303 Tuning Occupancy 306 Find the Right Occupancy for Your Workload 307 Techniques for Occupancy Tuning 308 Compiler Hints to Optimize Occupancy 311 Determine Optimal Launch Configuration with the Occupancy API 312 Tuning Occupancy with PyTorch 312 Improving Warp Execution Efficiency (Warp Divergence) 314 Causes of Warp Divergence 314 Techniques to Avoid Warp Divergence 315 Profiling and Detecting Warp Divergence 318 Using Predication to Minimize Divergence 319 Efficient Intrawarp Communication with Warp Intrinsics 323 PyTorch Considerations for Warp-Level Efficiency 323 Exposing Instruction-Level Parallelism 325 Warp Scheduling and Dual Issue Instructions 327 ILP and Occupancy 330 Loop Unrolling, Interleaving, and Compiler Hinting 332 Table of Contents | xi
Profiling and Mitigating Register Pressure 333 Key Takeaways 334 Conclusion 335 9. Increasing CUDA Kernel Efficiency and Arithmetic Intensity. . . . . . . . . . . . . . . . . . . . 337 Multilevel Microtiling and Software Prefetching 339 Tiling with Thread Block Clusters 341 Kernel Fusion 344 Structured Sparsity 347 Recomputation Versus Memory Trade-Off 350 PyTorch and Arithmetic Intensity 350 Mixed Precision and Utilizing Tensor Cores 351 Feeding Tensor Cores with TMEM and TMA 352 TF32 and Automatic Mixed Precision (PyTorch) 355 BF16/FP16, FP8, and FP4 Reduced Precision 356 INT8 Reduced Precision and DP4A Instructions for Inference 357 Transformer Engine and TMEM in Depth 358 Using CUTLASS for Optimal Arithmetic Intensity and Tensor Core Performance 360 Inline PTX and SASS Tuning for Microoptimizations 365 DeepSeek’s Use of Inline PTX for Memory Allocation Optimization 369 Key Takeaways 370 Conclusion 371 10. Intra-Kernel Pipelining, Warp Specialization, and Cooperative Thread Block Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Intra-Kernel Pipelining Techniques 374 Cooperative Tiling and Double-Buffering with the CUDA Pipeline API 375 Warp Specialization and the Producer-Consumer Model 381 Using CUDA Pipeline API for Warp Specialization 385 PyTorch, CUDA Pipeline API, and Warp Specialization 390 Persistent Kernels and Megakernels 391 Common Workloads for Persistent Kernels 393 Megakernels for Inference 394 Persistent Kernels and Warp Specialization 395 Cooperative Groups 395 Cooperative Grid Synchronization and Persistent Kernels 398 When to Combine Persistent Kernels and Cooperative Groups 401 Thread Block Clusters and Distributed Shared Memory 402 Thread Block Swizzling 404 Distributed Shared Memory 405 Scratch Memory 406 xii | Table of Contents
Launching a Thread Block Cluster 407 Coordinating Thread Block Clusters with Cooperative Groups API 408 Thread Block Pair 410 Reducing Global Memory Traffic with Thread Block Clusters 413 Designing Efficient Algorithms with Thread Block Clusters 416 Warp Specialization with Thread Block Clusters 418 Key Takeaways 425 Conclusion 426 11. Inter-Kernel Pipelining, Synchronization, and CUDA Stream-Ordered Memory Allocations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 Overlapping Kernel Execution with CUDA Streams 429 Using Streams to Overlap Compute with Data Transfers 433 Stream-Ordered Memory Allocator 435 Using CUDA Streams and Stream-Ordered Memory Allocator with LLMs 439 Legacy Default Stream 441 Modern Per-Thread Default Stream 442 Default Versus Explicit (Nondefault) Streams 443 Best Practices for Default Stream Usage 444 Fine-Grained Synchronization with Events and Callbacks 446 Using CUDA Events for Cross-Stream Synchronization 448 Pipelining with Warp Specialization (Intra-Kernel) and CUDA Streams (Inter-Kernel) 449 Warp Specialization with Thread Block Clusters and CUDA Streams 456 Multi-GPU Compute and Data Transfer Overlap with CUDA Streams 463 Programmatic Dependent Launch 469 Combining PDL and Thread Block Clusters with Warp Specialization 472 Key Takeaways 478 Conclusion 480 12. Dynamic Scheduling, CUDA Graphs, and Device-Initiated Kernel Orchestration. . . 481 Dynamic Scheduling with Atomic Work Queues 482 Atomic Counters 483 Atomic Queues 485 CUDA Graphs 489 PyTorch, Inference Engines, and CUDA Graphs 490 Memory Pools for CUDA Graphs 491 Capturing a CUDA Graph with a CUDA Stream 492 Dynamic Graph Update 496 Device-Initiated CUDA Graph Launch 497 Atomic Queues and Device-Initiated CUDA Graphs for In-Kernel Persistent Scheduling 501 Table of Contents | xiii
Conditional Graph Nodes 501 Dynamic Parallelism 506 Orchestrate Across Multiple GPUs and Cluster Nodes (NVSHMEM) 512 Fine-Grained GPU-to-GPU Memory Sharing with NVSHMEM 513 Capturing Multi-GPU Collectives with NCCL and CUDA Graphs 517 Pattern for N-GPU Scaling 520 Roofline-Guided Scheduling and Orchestration Decisions 521 Key Takeaways 523 Conclusion 524 13. Profiling, Tuning, and Scaling PyTorch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 NVTX Markers and Profiling Tools 528 Profiling PyTorch to Identify Bottlenecks 531 Using PyTorch Profiler 532 System Profiling with Nsight Systems and NVTX Timelines 535 Kernel Roofline Analysis for General Matrix Multiply (GEMM) 537 CPU and GPU Profiling with Linux perf 539 PyTorch Compiler (torch.compile) 543 Using the PyTorch Compiler 544 Compiling Versus Writing Custom Kernels 548 Compilation Modes and Trade-Offs in Speed, Memory, and Compile Time 549 Regional Compilation 551 Profiling and Debugging Compiler Performance Issues 552 PyTorch Optimized Attention Mechanisms 553 PyTorch Architecture Optimization (torchao), Quantization, Sparsity, and Pruning 555 Concurrency with CUDA Streams 555 Overlapping Communication and Computation 556 Stream Synchronization with Events 558 Using CUDA Streams with MoE Models 560 Reducing Kernel Launch Overhead with CUDA Graphs 562 Capturing a CUDA Graph and Preallocating Memory 562 Replaying the Graph 565 Best Practices for CUDA Graphs 567 CUDA Graph Trees (PyTorch Compiler Internal) 568 Profiling and Tuning Memory in PyTorch 569 Tuning the CUDA Memory Allocator 570 Activation Checkpointing for Memory Savings 572 Offloading Parameters to CPU and NVMe 573 SuperOffload: Optimized CPU-GPU Superchip Offload 573 FSDP Automatic Checkpointing and Offloading 574 Combining FSDP with Tensor Parallel and Pipeline Parallel 577 xiv | Table of Contents
Pluggable Memory Allocators and Cross-GPU Data Transfers 578 Enabling Peer-to-Peer DMA and UCX 581 PyTorch Symmetric Memory 582 Optimizing the Data Input Pipeline 583 Scaling with PyTorch Distributed 585 DDP with torch.compile 586 FSDP with torch.compile 586 Tensor and Pipeline Parallelism with torch.compile 589 TorchTitan, AsyncTP, AutoParallel, and SimpleFSDP 590 Multi-GPU Profiling with HTA 591 Continuous Integration and Performance Benchmarking 592 PyTorch HUD Performance Dashboard 594 Performance Benchmarks and MLPerf Logging 595 Key Takeaways 598 Conclusion 602 14. PyTorch Compiler, OpenAI Triton, and XLA Backends. . . . . . . . . . . . . . . . . . . . . . . . . . 605 PyTorch Compiler Deep Dive 606 TorchDynamo for Bytecode Capture and Graph Extraction 607 AOT Autograd Fusion for Forward and Backward Passes 612 PrimTorch IR (Prims) Simplified Operator Set 613 TorchInductor Backend Code Generation 615 Autotuning with TorchInductor 617 Dynamic Shapes and Variable Sequence Lengths 620 Disabling the PyTorch Compiler and Reverting Back to Eager Mode 623 Performance Hints and Debugging Generated Code 623 Debugging Numerical Correctness and Accuracy 624 Explaining and Minimizing Graph Breaks 627 Graph Breaks and TorchDynamo explain() 627 Minimize Graph Recompilations 632 Mark Functions and Code Blocks as Safe with allow_in_graph 632 Tips for Handling Graph Breaks 633 Debugging Compiler Phases, Graph Breaks, and Performance 636 Writing Custom Kernels with OpenAI Triton 637 Triton Programming Model 638 Accessing Shared Memory in Triton 641 Registering Custom Kernels with PyTorch 641 Tuning Kernel-Launch Parameters 643 Autotuning Triton Kernels 643 Advanced Triton Kernel Implementations 645 Warp Specialization with Triton 645 Tiled and Persistent GEMM Kernel (Triton) 646 Table of Contents | xv
Software Pipelining and Double Buffering with Triton 653 Profiling with Triton Proton Profiler 657 PyTorch XLA Backend 658 Key Takeaways 659 Conclusion 662 15. Multinode Inference, Parallelism, Decoding, and Routing Optimizations. . . . . . . . 663 Disaggregated Prefill and Decode Architecture 664 Prefill-Decode Interference 665 Scaling Prefill and Worker Nodes Independently 665 Impact on Latency (TTFT) and Throughput (TPOT) 666 KV Cache Data Transfer and NIXL 667 Deploying Disaggregated Prefill and Decode with Kubernetes 668 Parallelism Strategies for Serving Massive MoE Models 670 Tensor Parallelism 672 Pipeline Parallelism 673 Expert Parallelism 674 Data Parallelism 677 Context (Sequence) Parallelism 678 Hybrid Parallelism 679 Speculative Decoding and Parallel Token Generation Techniques 681 Two-Model, Draft-Based Speculative Decoding and EAGLE 682 Single-Model Self-Speculative Decoding 686 Multitoken Decoding with Medusa’s Multiple Heads 686 Interleaving Decode Steps from Multiple Requests 688 Combining Decoding Techniques and Evaluating Complexity 688 Constrained Decoding Performance Implications 689 Dynamic Routing Strategies for MoE Inference 691 Expert Communication Optimization 691 Load Balancing, Capacity Factor, and Expert Replication 693 Adaptive Expert Routing and Real-Time Monitoring 695 Key Takeaways 698 Conclusion 699 16. Profiling, Debugging, and Tuning Inference at Scale. . . . . . . . . . . . . . . . . . . . . . . . . . 701 Profiling, Debugging, and Tuning Inference Performance 702 Monitoring System Metrics and Counters 705 Profiling with Nsight Systems and Nsight Compute 709 Inference Troubleshooting Recipes 712 Full-Stack Inference Optimizations 713 Debugging Correctness Issues 715 Dynamic Batching, Scheduling, and Routing 717 xvi | Table of Contents
Dynamic Batching 718 Continuous Batching 720 Continuous Scheduling 721 Stall-Free Scheduling (Chunked Prefill) 723 Latency-Aware Scheduling and Dynamic Routing 724 Systems-Level Optimizations 725 Overlapping Communication and Computation 725 Maximizing GPU Utilization and Throughput Versus Latency Trade-Offs 729 Power and Thermal Constraints 730 Error Handling 731 Memory 732 KV Cache Offloading and Memory Pool Allocation 732 Quantization Approaches for Real-Time Inference 733 Reducing Precision from FP16 to FP8 and FP4 734 Weight-Only Quantization (GPTQ, AWQ) 735 Activation Quantization 737 Post-Training Quantization Workflow 737 Combining Weight and Activation Quantization 738 Fusing Quantization-Dequantization Steps into the Execution Graph 739 Application-Level Optimizations 740 Prompt Compression 740 Prompt Cleansing 741 Prefix Caching 743 Model Cascading and Tiered Model Deployment 750 Streaming Responses 752 Debouncing and Request Coalescing 755 Token Output Limits and Timeouts 756 Key Takeaways 757 Conclusion 758 17. Scaling Disaggregated Prefill and Decode for Inference. . . . . . . . . . . . . . . . . . . . . . . 759 Why Prefill-Decode Disaggregation? 761 Advantages of Disaggregation 762 Disaggregated Prefill and Decode Cluster Pools 765 Disaggregated Routing and Scheduling Policies 780 Scalability of Disaggregated Prefill and Decode 795 Key Takeaways 796 Conclusion 797 18. Advanced Prefill-Decode and KV Cache Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799 Optimized Decode Kernels 799 FlashMLA (DeepSeek) 800 Table of Contents | xvii
ThunderMLA (Stanford) 801 FlexDecoding (PyTorch) 802 Tuning KV Cache Utilization and Management 805 Disaggregated KV Cache Pool 805 KV Cache Reuse and Prefix Sharing 808 Optimized KV Cache Memory Layout 810 GPU and CPU-GPU Superchip Improvements 811 Fast KV Cache Transfer Between Prefill and Decode 812 KV Cache Size 812 Zero-Copy GPU-to-GPU Transfer 813 Connector and Data Path Design 817 Heterogeneous Hardware and Parallelism Strategies for Prefill and Decode 820 Compute-Optimized Versus Memory-Optimized Hardware 820 Hybrid Prefill with GPU-CPU Collaboration 826 SLO-Aware Request Management and Fault Tolerance 829 Early Rejection (Admission Control) 829 Quality of Service 831 Fault Tolerance 832 Dynamic Scheduling and Load Balancing 834 Adaptive Resource Scheduling and Hotspot Prevention 834 Key Takeaways 841 Conclusion 841 19. Dynamic and Adaptive Inference Engine Optimizations. . . . . . . . . . . . . . . . . . . . . . . 843 Adaptive Parallelism Strategies (TP Versus PP Versus Hybrid) 843 Dynamic Precision Changes 848 Kernel Autotuning for Transformer Self-Attention and MLP Paths 853 Dynamic Shared-Memory Allocation and Occupancy-Aware Kernel Selection 858 Speculative KV Prefetching for Faster TTFT 862 Real-Time KV Cache Compression and Policy Switching 867 Reinforcement Learning Agents for Tuning AI Systems at Runtime 875 Dynamic Memory-Allocation Switching (Slab Versus Caching Versus Stream-Ordered) 880 Runtime Kernel Performance Improvements and Hot-Swappable Implementations 886 Continuous Prewarming of CUDA Graphs and Caches Using Time-Series Prediction 889 Adaptive Batching and Chunked Prefill Scheduling 893 Congestion-Aware and Topology-Aware Scheduling with Multiple GPUs 899 NVLink/NVSwitch Topology and Bandwidth Constraints 900 Real-Time Link Telemetry and Monitoring 901 Adaptive Process-GPU Mapping 902 xviii | Table of Contents
Comments 0
Loading comments...
Reply to Comment
Edit Comment