Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iii Professional Linux® Kernel Architecture Wolfgang Mauerer Wiley Publishing, Inc.
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page ii
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page i Professional Linux® Kernel Architecture Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii Chapter 1: Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2: Process Management and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Chapter 3: Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Chapter 4: Virtual Process Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Chapter 5: Locking and Interprocess Communication . . . . . . . . . . . . . . . . . . . . . . . 347 Chapter 6: Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Chapter 7: Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Chapter 8: The Virtual Filesystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 Chapter 9: The Extended Filesystem Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583 Chapter 10: Filesystems without Persistent Storage . . . . . . . . . . . . . . . . . . . . . . . 643 Chapter 11: Extended Attributes and Access Control Lists . . . . . . . . . . . . . . . . . 707 Chapter 12: Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733 Chapter 13: System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819 Chapter 14: Kernel Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847 Chapter 15: Time management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893 Chapter 16: Page and Buffer Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949 Chapter 17: Data Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989 Chapter 18: Page Reclaim and Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023 Chapter 19: Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097 Appendix A: Architecture Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117 Appendix B: Working with the Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141 Appendix C: Notes on C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175 Appendix D: System Startup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223 Appendix E: The ELF Binary Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1241 Appendix F: The Kernel Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1293
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page ii
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iii Professional Linux® Kernel Architecture Wolfgang Mauerer Wiley Publishing, Inc.
Mauerer ffirs.tex V2 - 08/26/2008 3:23am Page iv Professional Linux® Kernel Architecture Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256 www.wiley.com Copyright © 2008 by Wolfgang Mauerer Published by Wiley Publishing, Inc., Indianapolis, Indiana Published simultaneously in Canada ISBN: 978-0-470-34343-2 Manufactured in the United States of America 10 9 8 7 6 5 4 3 2 1 Library of Congress Cataloging-in-Publication Data: Mauerer, Wolfgang, 1978- Professional Linux kernel architecture / Wolfgang Mauerer. p. cm. Includes index. ISBN 978-0-470-34343-2 (pbk.) 1. Linux. 2. Computer architecture. 3. Application software. I. Title. QA76.9.A73M38 2008 005.4’32--dc22 2008028067 No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Legal Department, Wiley Publishing, Inc., 10475 Crosspoint Blvd., Indianapolis, IN 46256, (317) 572-3447, fax (317) 572-4355, or online at http://www.wiley.com/go/permissions. Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Website is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Website may provide or recommendations it may make. Further, readers should be aware that Internet Websites listed in this work may have changed or disappeared between when this work was written and when it is read. For general information on our other products and services please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Wrox Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Mauerer fauth.tex V2 - 08/22/2008 4:52am Page v About the Author Wolfgang Mauerer is a quantum physicist whose professional interests are centered around quantum cryptography, quantum electrodynamics, and compilers for — you guessed it — quantum architectures. With the confirmed capacity of being the worst experimentalist in the known universe, he sticks to the theoretical side of his profession, which is especially reassuring considering his constant fear of acci- dentally destroying the universe. Outside his research work, he is fascinated by operating systems, and for more than a decade — starting with an article series about the kernel in 1997 — he has found great pleasure in documenting and explaining Linux kernel internals. He is also the author of a book about typesetting with LaTeX and has written numerous articles that have been translated into seven languages in total. When he’s not submerged in vast Hilbert spaces or large quantities of source code, he tries to take the opposite direction, namely, upward — be this with model planes, a paraglider, or on foot with an ice axe in his hands: Mountains especially have the power to outrival even the Linux kernel. Consequently, he considers planning and accomplishing a first-ascent expedition to the vast arctic glaciers of east Green- land to be the really unique achievement in his life. Being interested in everything that is fundamental, he is also the author of the first compiler for Plankalkül, the world’s earliest high-level language devised in 1942–1946 by Konrad Zuse, the father of the computer. As an avid reader, he is proud that despite the two-digit number of computers present in his living room, the volume required for books still occupies a larger share.
Mauerer fauth.tex V2 - 08/22/2008 4:52am Page vi
Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page vii Credits Executive Editor Carol Long Senior Development Editor Tom Dinse Production Editor Debra Banninger Copy Editors Cate Caffrey Kathryn Duggan Editorial Manager Mary Beth Wakefield Production Manager Tim Tate Vice President and Executive Group Publisher Richard Swadley Vice President and Executive Publisher Joseph B. Wikert Project Coordinator, Cover Lynsey Stanford Proofreader Publication Services, Inc. Indexer Jack Lewis
Mauerer fcredit.tex V2 - 08/22/2008 4:53am Page viii
Mauerer fack.tex V4 - 09/04/2008 3:36pm Page ix Acknowledgments First and foremost, I have to thank the thousands of programmers who have created the Linux kernel over the years — most of them commercially based, but some also just for their own private or academic joy. Without them, there would be no kernel, and I would have had nothing to write about. Please accept my apologies that I cannot list all several hundred names here, but in true UNIX style, you can easily generate the list by: for file in $ALL_FILES_COVERED_IN_THIS_BOOK; do git log --pretty="format:%an" $file; done | sort -u -k 2,2 It goes without saying that I admire your work very much — you are all the true heroes in this story! What you are reading right now is the result of an evolution over more than seven years: After two years of writing, the first edition was published in German by Carl Hanser Verlag in 2003. It then described kernel 2.6.0. The text was used as a basis for the low-level design documentation for the EAL4+ security evaluation of Red Hat Enterprise Linux 5, requiring to update it to kernel 2.6.18 (if the EAL acronym does not mean anything to you, then Wikipedia is once more your friend). Hewlett-Packard sponsored the translation into English and has, thankfully, granted the rights to publish the result. Updates to kernel 2.6.24 were then performed specifically for this book. Several people were involved in this evolution, and my appreciation goes to all of them: Leslie Mackay- Poulton, with support from David Jacobs, did a tremendous job at translating a huge pile of text into English. I’m also indebted to Sal La Pietra of atsec information security for pulling the strings to get the translation project rolling, and especially to Stephan Müller for close cooperation during the evaluation. My cordial thanks also go to all other HP and Red Hat people involved in this evaluation, and also to Claudio Kopper and Hans Löhr for our very enjoyable cooperation during this project. Many thanks also go to the people at Wiley — both visible and invisible to me — who helped to shape the book into its current form. The German edition was well received by readers and reviewers, but nevertheless comments about inaccuracies and suggestions for improvements were provided. I’m glad for all of them, and would also like to mention the instructors who answered the publisher’s survey for the original edition. Some of their suggestions were very valuable for improving the current publication. The same goes for the referees for this edition, especially to Dr. Xiaodong Zhang for providing numerous suggestions for Appendix F.4. Furthermore, I express my gratitude to Dr. Christine Silberhorn for granting me the opportunity to suspend my regular research work at the Max Planck Research Group for four weeks to work on this project. I hope you enjoyed the peace during this time when nobody was trying to install Linux on your MacBook! As with every book, I owe my deepest gratitude to my family for supporting me in every aspect of life — I more than appreciate this indispensable aid. Finally, I have to thank Hariet Fabritius for infinite
Mauerer fack.tex V4 - 09/04/2008 3:36pm Page x Acknowledgments patience with an author whose work cycle not only perfectly matched the most alarming forms of sleep dyssomnias, but who was always right on the brink of confusing his native tongue with ‘‘C,’’ and whom she consequently had to rescue from numerous situations where he seemingly had lost his mind (see below. . .). Now that I have more free time again, I’m not only looking forward to our well-deserved holiday, but can finally embark upon the project of giving your laptop all joys of a proper operating system! (Writing these acknowledgments, I all of a sudden realize why people always hasten to lock away their laptops when they see me approaching. . . .) x
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xi Contents Introduction xxvii Chapter 1: Introduction and Overview 1 Tasks of the Kernel 2 Implementation Strategies 3 Elements of the Kernel 3 Processes, Task Switching, and Scheduling 4 Unix Processes 4 Address Spaces and Privilege Levels 7 Page Tables 11 Allocation of Physical Memory 13 Timing 16 System Calls 17 Device Drivers, Block and Character Devices 17 Networks 18 Filesystems 18 Modules and Hotplugging 18 Caching 20 List Handling 20 Object Management and Reference Counting 22 Data Types 25 . . . and Beyond the Infinite 27 Why the Kernel Is Special 28 Some Notes on Presentation 29 Summary 33 Chapter 2: Process Management and Scheduling 35 Process Priorities 36 Process Life Cycle 38 Preemptive Multitasking 40 Process Representation 41 Process Types 47 Namespaces 47
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xii Contents Process Identification Numbers 54 Task Relationships 62 Process Management System Calls 63 Process Duplication 63 Kernel Threads 77 Starting New Programs 79 Exiting Processes 83 Implementation of the Scheduler 83 Overview 84 Data Structures 86 Dealing with Priorities 93 Core Scheduler 99 The Completely Fair Scheduling Class 106 Data Structures 106 CFS Operations 107 Queue Manipulation 112 Selecting the Next Task 113 Handling the Periodic Tick 114 Wake-up Preemption 115 Handling New Tasks 116 The Real-Time Scheduling Class 117 Properties 118 Data Structures 118 Scheduler Operations 119 Scheduler Enhancements 121 SMP Scheduling 121 Scheduling Domains and Control Groups 126 Kernel Preemption and Low Latency Efforts 127 Summary 132 Chapter 3: Memory Management 133 Overview 133 Organization in the (N)UMA Model 136 Overview 136 Data Structures 138 Page Tables 153 Data Structures 154 Creating and Manipulating Entries 161 Initialization of Memory Management 161 Data Structure Setup 162 Architecture-Specific Setup 169 Memory Management during the Boot Process 191 xii
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiii Contents Management of Physical Memory 199 Structure of the Buddy System 199 Avoiding Fragmentation 201 Initializing the Zone and Node Data Structures 209 Allocator API 215 Reserving Pages 222 Freeing Pages 240 Allocation of Discontiguous Pages in the Kernel 244 Kernel Mappings 251 The Slab Allocator 256 Alternative Allocators 258 Memory Management in the Kernel 259 The Principle of Slab Allocation 261 Implementation 265 General Caches 283 Processor Cache and TLB Control 285 Summary 287 Chapter 4: Virtual Process Memory 289 Introduction 289 Virtual Process Address Space 290 Layout of the Process Address Space 290 Creating the Layout 294 Principle of Memory Mappings 297 Data Structures 298 Trees and Lists 299 Representation of Regions 300 The Priority Search Tree 302 Operations on Regions 306 Associating Virtual Addresses with a Region 306 Merging Regions 308 Inserting Regions 309 Creating Regions 310 Address Spaces 312 Memory Mappings 314 Creating Mappings 314 Removing Mappings 317 Nonlinear Mappings 319 Reverse Mapping 322 Data Structures 323 Creating a Reverse Mapping 324 Using Reverse Mapping 325 xiii
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xiv Contents Managing the Heap 327 Handling of Page Faults 330 Correction of Userspace Page Faults 336 Demand Allocation/Paging 337 Anonymous Pages 339 Copy on Write 340 Getting Nonlinear Mappings 341 Kernel Page Faults 341 Copying Data between Kernel and Userspace 344 Summary 345 Chapter 5: Locking and Interprocess Communication 347 Control Mechanisms 348 Race Conditions 348 Critical Sections 349 Kernel Locking Mechanisms 351 Atomic Operations on Integers 352 Spinlocks 354 Semaphores 355 The Read-Copy-Update Mechanism 357 Memory and Optimization Barriers 359 Reader/Writer Locks 361 The Big Kernel Lock 361 Mutexes 362 Approximate Per-CPU Counters 364 Lock Contention and Fine-Grained Locking 365 System V Interprocess Communication 366 System V Mechanisms 366 Semaphores 367 Message Queues 376 Shared Memory 380 Other IPC Mechanisms 381 Signals 381 Pipes and Sockets 389 Summary 390 Chapter 6: Device Drivers 391 I/O Architecture 391 Expansion Hardware 392 Access to Devices 397 Device Files 397 Character, Block, and Other Devices 397 xiv
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xv Contents Device Addressing Using Ioctls 400 Representation of Major and Minor Numbers 401 Registration 403 Association with the Filesystem 406 Device File Elements in Inodes 406 Standard File Operations 407 Standard Operations for Character Devices 407 Standard Operations for Block Devices 408 Character Device Operations 409 Representing Character Devices 409 Opening Device Files 409 Reading and Writing 412 Block Device Operations 412 Representation of Block Devices 413 Data Structures 415 Adding Disks and Partitions to the System 423 Opening Block Device Files 425 Request Structure 427 BIOs 430 Submitting Requests 432 I/O Scheduling 438 Implementation of Ioctls 441 Resource Reservation 442 Resource Management 442 I/O Memory 445 I/O Ports 446 Bus Systems 448 The Generic Driver Model 449 The PCI Bus 454 USB 463 Summary 471 Chapter 7: Modules 473 Overview 473 Using Modules 474 Adding and Removing 474 Dependencies 477 Querying Module Information 478 Automatic Loading 480 Inserting and Deleting Modules 483 Module Representation 483 Dependencies and References 488 xv
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xvi Contents Binary Structure of Modules 491 Inserting Modules 496 Removing Modules 505 Automation and Hotplugging 506 Automatic Loading with kmod 507 Hotplugging 508 Version Control 511 Checksum Methods 512 Version Control Functions 515 Summary 517 Chapter 8: The Virtual Filesystem 519 Filesystem Types 520 The Common File Model 521 Inodes 522 Links 522 Programming Interface 523 Files as a Universal Interface 524 Structure of the VFS 525 Structural Overview 525 Inodes 527 Process-Specific Information 532 File Operations 537 Directory Entry Cache 542 Working with VFS Objects 547 Filesystem Operations 548 File Operations 565 Standard Functions 572 Generic Read Routine 573 The fault Mechanism 576 Permission-Checking 578 Summary 581 Chapter 9: The Extended Filesystem Family 583 Introduction 583 Second Extended Filesystem 584 Physical Structure 585 Data Structures 592 xvi
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xvii Contents Creating a Filesystem 608 Filesystem Actions 610 Third Extended Filesystem 637 Concepts 638 Data Structures 639 Summary 642 Chapter 10: Filesystems without Persistent Storage 643 The proc Filesystem 644 Contents of /proc 644 Data Structures 652 Initialization 655 Mounting the Filesystem 657 Managing /proc Entries 660 Reading and Writing Information 664 Task-Related Information 666 System Control Mechanism 671 Simple Filesystems 680 Sequential Files 680 Writing Filesystems with Libfs 684 The Debug Filesystem 687 Pseudo Filesystems 689 Sysfs 689 Overview 690 Data Structures 690 Mounting the Filesystem 695 File and Directory Operations 697 Populating Sysfs 704 Summary 706 Chapter 11: Extended Attributes and Access Control Lists 707 Extended Attributes 707 Interface to the Virtual Filesystem 708 Implementation in Ext3 714 Implementation in Ext2 721 Access Control Lists 722 Generic Implementation 722 Implementation in Ext3 726 Implementation in Ext2 732 Summary 732 xvii
Mauerer ftoc.tex V4 - 09/03/2008 11:13pm Page xviii Contents Chapter 12: Networks 733 Linked Computers 734 ISO/OSI and TCP/IP Reference Model 734 Communication via Sockets 738 Creating a Socket 738 Using Sockets 740 Datagram Sockets 744 The Layer Model of Network Implementation 745 Networking Namespaces 747 Socket Buffers 749 Data Management Using Socket Buffers 750 Management Data of Socket Buffers 753 Network Access Layer 754 Representation of Network Devices 755 Receiving Packets 760 Sending Packets 768 Network Layer 768 IPv4 769 Receiving Packets 771 Local Delivery to the Transport Layer 772 Packet Forwarding 774 Sending Packets 775 Netfilter 778 IPv6 783 Transport Layer 785 UDP 785 TCP 787 Application Layer 799 Socket Data Structures 799 Sockets and Files 803 The socketcall System Call 804 Creating Sockets 805 Receiving Data 807 Sending Data 808 Networking from within the Kernel 808 Communication Functions 808 The Netlink Mechanism 810 Summary 817 xviii
Comments 0
Loading comments...
Reply to Comment
Edit Comment