The Hundred Page Language Models Book. Hands on with PyTorch 2025 (Andriy Burkov) (Z Library)

Author: Andriy Burkov

其他

Large language models (LLMs) have fundamentally transformed how machines process and generate information. They are reshaping white-collar jobs at a pace comparable only to the revolutionary impact of personal computers. Understanding the mathematical foundations and inner workings of language models has become crucial for maintaining relevance and competitiveness in an increasingly automated workforce. This book guides you through the evolution of language models, starting from machine learning fundamentals. Rather presenting transformers right away, which can feel overwhelming, we build understanding of language models step by step—from simple count-based methods through recurrent neural networks to modern architectures. Each concept is grounded in clear mathematical foundations and illustrated with working Python code. In the largest chapter on large language models, you'll learn both effective prompt engineering techniques and how to finetune these models to follow arbitrary instructions. Through hands-on experience, you'll master proven strategies for getting consistent outputs and adapting models to your needs.

📄 File Format: PDF

💾 File Size: 25.3 MB

336

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Andriy Burkov THE HUNDRED-PAGE LANGUAGE MODELS BOOK hands-on with PyTorch

📄 Page 2

“Andriy's long-awaited sequel in his "The Hundred-Page" series of ma­ chine learning textbooks is a masterpiece of concision.” — Bob van Luijt, CEO and Co-Founder of Weaviate“Andriy has this almost supernatural talent for shrinking epic AI con­ cepts down to bite-sized, ‘Ah, now I get it!’ moments.” — Jorge Torres, CEO at MindsDB“Andriy paints for us, in 100 marvelous strokes, the journey from lin­ ear algebra basics to the implementation of transformers.” — Florian Douetteau, Co-founder and CEO at Dataiku“Andriy's book is an incredibly concise, clear, and accessible introduc­ tion to machine learning.” — Andre Zayarni, Co-founder and CEO at Qdrant“This is one of the most comprehensive yet concise handbooks out there for truly understanding how LLMs work under the hood.” — Jerry Liu, Co-founder and CEO at LlamaIndex Featuring a foreword by Tomas Mikolov and back cover text by Vint Cerf 1

📄 Page 3

The Hundred-Page Language Models Book Andriy Burkov 2

📄 Page 4

Copyright © 2025 Andriy Burkov. All rights reserved. 1. Read First, Buy Later: You are welcome to freely read and share this book with others by preserving this copyright notice. However, if you find the book valuable or continue to use it, you must purchase your own copy. This ensures fairness and supports the author.2. No Unauthorized Use: No part of this work—its text, structure, or deriva- tives—may be used to train artificial intelligence or machine learning mod­els, nor to generate any content on websites, apps, or other services, without the author’s explicit written consent. This restriction applies to all forms of automated or algorithmic processing.3. Permission Required If you operate any website, app, or service and wish to use any portion of this work for the purposes mentioned above—or for any other use beyond personal reading—you must first obtain the author’s explicit written permission. No exceptions or implied licenses are granted.4. Enforcement: Any violation of these terms is copyright infringement. It may be pursued legally in any jurisdiction. By reading or distributing this book, you agree to abide by these conditions.ISBN 978-1-7780427-2-0Publisher: True Positive Inc. 3

📄 Page 5

To my family, with love 4

📄 Page 6

“Language is the source of misunderstandings.” —Antoine de Saint-Exupery, The Little Prince “In mathematics you don't understand things. You just get used to them.” —John von Neumann “Computers are useless. They can only give you answers.” — Pablo Picasso The book is distributed on the “read first, buy later” principle 5

📄 Page 7

ContentsForeword 9Preface 11Who This Book Is For 11What This Book Is Not 12Book Structure 13Should You Buy This Book? 14Acknowledgements 15Chapter 1. Machine Learning Basics 161.1. AI and Machine Learning 161.2. Model 161.3. Four-Step Machine Learning Process 281.4. Vector 281.5. Neural Network 321.6. Matrix 371.7. Gradient Descent 401.8. Automatic Differentiation 45Chapter 2. Language Modeling Basics 502.1. Bag of Words 502.2. Word Embeddings 632.3. Byte-Pair Encoding 702.4. Language Model 752.5. Count-Based Language Model 772.6. Evaluating Language Models 84Chapter 3. Recurrent Neural Network 983.1. Elman RNN 983.2. Mini-Batch Gradient Descent 1003.3. Programming an RNN 1013.4. RNN as a Language Model 1046

📄 Page 8

3.5. Embedding Layer 1053.6. Training an RNN Language Model 1073.7. Dataset and DataLoader 1113.8. Training Data and Loss Computation 113Chapter 4. Transformer 1174.1. Decoder Block 1174.2. Self-Attention 1194.3. Position-Wise Multilayer Perceptron 1234.4. Rotary Position Embedding 1244.5. Multi-Head Attention 1314.6. Residual Connection 1334.7. Root Mean Square Normalization 1364.8. Key-Value Caching 1384.9. Transformer in Python 139Chapter 5. Large Language Model 1475.1. Why Larger Is Better 1475.2. Supervised Finetuning 1545.3. Finetuning a Pretrained Model 1565.4. Sampling From Language Models 1715.5. Low-Rank Adaptation (LoRA) 1765.6. LLM as a Classifier 1805.7. Prompt Engineering 1825.8. Hallucinations 1885.9. LLMs, Copyright, and Ethics 191Chapter 6. Further Reading 1956.1. Mixture of Experts 1956.2. Model Merging 1956.3. Model Compression 1967

📄 Page 9

6.4. Preference-Based Alignment 1966.5. Advanced Reasoning 1966.6. Language Model Security 1976.7. Vision Language Model 1976.8. Preventing Overfitting 1986.9. Concluding Remarks 1986.10. More From the Author 199Index 201 8

📄 Page 10

ForewordFirst time I got involved in language modeling was already two decades ago. I wanted to improve some of my data compression algorithms and found out about the n-gram statistics. Very simple concept, but so hard to beat! Then I quickly gained another motivation—since my childhood, I was interested in artificial intelligence. I had a vision of machines that would understand pat­terns in our world that are hidden from our limited minds. It would be so exciting to talk with such super-intelligence. And I realized that language mod­eling could be a way towards such AI.I started searching for others sharing this vision and did find the works of Sol- omonoff, Schmidhuber and the Hutter prize competition organized by Matt Mahoney. They all did write about AI completeness of language modeling and I knew I had to try to make it work. But the world was very different than it is today. Language modeling was considered a dead research direction, and I've heard countless times that I should give up as nothing will ever beat n-grams on large data.I've completed my master's thesis on neural language models, as these models were quite like what I previously developed for data compression, and I did believe the distributed representations that could be applied to any language is the right way to go. This infuriated a local linguist who declared my ideas to be a total nonsense as language modeling has to be addressed from the lin­guistics point of view, and each language had to be treated differently.However, I did not give up and did continue working on my vision of AI-com- plete language models. Just the summer before starting my PhD, I did come up with the idea to generate text from these neural models. I was amazed by how much better this text was than text generated from n-grams models. That was summer 2007 and I quickly realized the only person excited about this at the Brno University of Technology was actually me. But I did not give up any­ways.In the following years, I did develop a number of algorithms to make neural language models more useful. To convince others about their qualities, I pub­lished open-source toolkit RNNLM in 2010. It had the first implementations ever of neural text generation, gradient clipping, dynamic evaluation, model adaptation (nowadays called fine-tuning) and other tricks such as hierarchical softmax or splitting infrequent words into subword units. However, the result 9

📄 Page 11

I was the most proud of was when I could demonstrate in my PhD thesis that neural language models not only beat n-grams on large datasets—something widely considered to be impossible at the time—but the improvements were actually increasing with the amount of training data. This happened for the first time after something like fifty years of language modeling research and I still remember the disbelief in faces of famous researchers when I showed them my work.Fast forward some fifteen years, and I'm amazed by how much the world has changed. The mindset completely flipped—what used to be some obscure tech­nology in a dead research direction is now thriving and gets the attention of CEOs of the largest companies in the world. Language models are everywhere today. With all this hype, I think it is needed more than ever to actually under­stand this technology.Young students who want to learn about language modeling are flooded with information. Thus, I was delighted when I learned about Andriy's project to write a short book with only one hundred pages that would cover some of the most important ideas. I think the book is a good start for anyone new to lan­guage modeling who aspires to improve on state of the art—and if someone tells you that everything that could have been invented in language modeling has already been discovered, don't believe it. Tomas Mikolov, Senior Researcher at Czech Institute of Informatics, Robotics and Cybernetics, the author of word2vec and FastText 10

📄 Page 12

PrefaceMy interest in text began in the late 1990s during my teenage years, building dynamic websites using Perl and HTML. This early experience with coding and organizing text into structured formats sparked my fascination with how text could be processed and transformed. Over the years, I advanced to building web scrapers and text aggregators, developing systems to extract structured data from webpages. The challenge of processing and understanding text led me to explore more complex applications, including designing chatbots that could understand and address user needs.The challenge of extracting meaning from words intrigued me. The complexity of the task only fueled my determination to “crack” it, using every tool at my disposal—ranging from regular expressions and scripting languages to text classifiers and named entity recognition models.The rise of large language models (LLMs) transformed everything. For the first time, computers could converse with us fluently and follow verbal instructions with remarkable precision. However, like any tool, their immense power comes with limitations. Some are easy to spot, but others are more subtle, requiring deep expertise to handle properly. Attempting to build a skyscraper without fully understanding your tools will only result in a pile of concrete and steel. The same holds true for language models. Approaching large-scale text pro­cessing tasks or creating reliable products for paying users requires precision and knowledge—guesswork simply isn’t an option. Who This Book Is ForI wrote this book for those who, like me, are captivated by the challenge of understanding language through machines. Language models are, at their core, just mathematical functions. However, their true potential isn’t fully ap­preciated in theory—you need to implement them to see their power and how their abilities grow as they scale. This is why I decided to make this book hands-on.This book serves software developers, data scientists, machine learning engi­neers, and anyone curious about language models. Whether your goal is to integrate existing models into applications or to train your own, you’ll find practical guidance alongside theoretical foundations. 11

📄 Page 13

Given its hundred-page format, the book makes certain assumptions about readers. You should have programming experience, as all hands-on examples use Python.While familiarity with PyTorch and tensors—PyTorch’s fundamental data types—is beneficial, it’s not mandatory. If you’re new to these tools, the book’s wiki (thelmbook.com/wiki) provides a concise introduction with examples and resource links for further learning. This wiki format ensures content remains current and addresses reader questions beyond publication.College-level math knowledge helps, but you needn’t remember every detail or have machine learning experience. The book introduces concepts systemat­ically, beginning with notations, definitions, and fundamental vector and ma­trix operations. From there, it progresses through simple neural networks to more advanced topics. Mathematical concepts are presented intuitively, with clear diagrams and examples that facilitate understanding. What This Book Is NotThis book is focused on understanding and implementing language models. It will not cover: • Large-scale training: This book won’t teach you how to train massive models on distributed systems or how to manage training infrastruc­ture. • Production deployment: Topics like model serving, API development, scaling for high traffic, monitoring, and cost optimization are not cov­ered. The code examples focus on understanding the concepts rather than production readiness. • Enterprise applications: This book won’t guide you through building commercial LLM applications, handling user data, or integrating with existing systems.If you’re interested in learning the mathematical foundations of language mod­els, understanding how they work, implementing core components yourself, or learning to work effectively with LLMs, this book is for you. But if you’re primarily looking to deploy models in production or build scalable applica­tions, you may want to supplement this book with other resources.12

📄 Page 14

Book StructureTo make this book engaging and to deepen the reader’s understanding, I de­cided to discuss language modeling as a whole, including approaches that are often overlooked in modern literature. While Transformer-based LLMs domi­nate the spotlight, earlier approaches like count-based methods and recurrent neural networks (RNNs) remain effective for some tasks.Learning the math of the Transformer architecture from scratch may seem overwhelming for someone starting from scratch. By revisiting these founda­tional methods, my goal is to gradually build up the reader’s intuition and mathematical understanding, making the transition to modern Transformer architectures feel like a natural progression rather than an intimidating leap.The book is divided into six chapters, progressing from fundamentals to ad­vanced topics: • Chapter 1 covers machine learning basics, including key concepts like AI, models, neural networks, and gradient descent. Even if you’re famil­iar with these topics, the chapter provides important foundations for understanding language models. • Chapter 2 introduces language modeling fundamentals, exploring text representation methods like bag of words and word embeddings, as well as count-based language models and evaluation techniques. • Chapter 3 focuses on recurrent neural networks, covering their imple­mentation, training, and application as language models. • Chapter 4 provides a detailed exploration of the Transformer architec­ture, including key components like self-attention, position embed­dings, and practical implementation. • Chapter 5 examines large language models (LLMs), discussing why scale matters, finetuning techniques, practical applications, and im­portant considerations around hallucinations, copyright, and ethics. • Chapter 6 concludes with further reading on advanced topics like mix­ture of experts, model compression, preference-based alignment, and vision language models, providing direction for continued learning.Most chapters contain working code examples you can run and modify. While only essential code appears in the book, complete code is available as Jupyter notebooks on the book’s website, with notebooks referenced in relevant 13

📄 Page 15

sections. All code in notebooks remains compatible with the latest stable ver­sions of Python, PyTorch, and other libraries.The notebooks run on Google Colab, which at the time of writing offers free access to computing resources including GPUs and TPUs. These resources, though, aren’t guaranteed and have usage limits that may vary. Some exam­ples might require extended GPU access, potentially involving wait times for availability. If the free tier proves limiting, Colab’s pay-as-you-go option lets you purchase compute credits for reliable GPU access. While these credits are relatively affordable by North American standards, costs may be significant depending on your location.For those familiar with the Linux command line, GPU cloud services provide another option through pay-per-time virtual machines with one or more GPUs. The book’s wiki maintains current information on free and paid notebook or GPU rental services. Verbatim terms and blocks indicate code, code fragments, or code execution outputs. Bold terms link to the book’s term index, and occasionally highlight algorithm steps.In this book, we use pip3 to ensure the packages are installed for Python 3. On most modern systems, you can use pip instead if it's already set up for Python 3. Should You Buy This Book?Like my previous two books, this one is distributed on the read first, buy later principle. I firmly believe that paying for content before consuming it means buying a pig in a poke. At a dealership, you can see and try a car. In a depart­ment store, you can try on clothes. Similarly, you should be able to read a book before paying for it.The read first, buy later principle means you can freely download the book, read it, and share it with friends and colleagues. If you find the book helpful or useful in your work, business, or studies—or if you simply enjoy reading it—then buy it. 14

📄 Page 16

AcknowledgementsThe high quality of this book would be impossible without volunteering edi­tors. I especially thank Erman Sert, Viet Hoang Tran Duong, Alex Sherstinsky, Kelvin Sundli, and Mladen Korunoski for their systematic contributions.I am also grateful to Alireza Bayat Makou, Taras Shalaiko, Domenico Siciliani, Preethi Raju, Srikumar Sundareshwar, Mathieu Nayrolles, Abhijit Kumar, Gior­gio Mantovani, Abhinav Jain, Steven Finkelstein, Ryan Gaughan, Ankita Guha, Harmanan Kohli, Daniel Gross, Kea Kohv, Marcus Oliveira, Tracey Mercier, Prabin Kumar Nayak, Saptarshi Datta, Gurgen R. Hayrapetyan, Sina Abdidi- zaji, Federico Raimondi Cominesi, Santos Salinas, Anshul Kumar, Arash Mirbagheri, Roman Stanek, Jeremy Nguyen, Efim Shuf, Pablo Llopis, Marco Celeri, Tiago Pedro, and Manoj Pillai for their help.If this is your first time exploring language models, I envy you a little—it’s truly magical to discover how machines learn to understand the world through nat­ural language.I hope you enjoy reading this book as much as I enjoyed writing it.Now grab your tea or coffee, and let’s begin! 15

📄 Page 17

Chapter 1. Machine Learning BasicsThis chapter starts with a brief overview of how artificial intelligence has evolved, explains what a machine learning model is, and presents the four steps of the machine learning process. Then, it covers some math basics like vectors and matrices, introduces neural networks, and wraps up with optimi­zation methods like gradient descent and automatic differentiation. 1.1. AI and Machine LearningThe term artificial intelligence (AI) was first introduced in 1955 during a workshop led by John McCarthy. Researchers at the workshop aimed to ex­plore how machines could use language, form concepts, solve problems like humans, and improve over time. 1.1.1. Early ProgressThe field’s first major breakthrough came in 1956 with the Logic Theorist. Created by Allen Newell, Herbert Simon, and Cliff Shaw, it was the first pro­gram engineered to perform automated reasoning, and has been later de­scribed as “the first artificial intelligence program.”Frank Rosenblatt’s Perceptron (1958) was an early neural network designed to recognize patterns by adjusting its internal parameters based on examples. Perceptron learned a decision boundary—a dividing line that separates exam­ples of different classes (e.g., spam versus not spam): 16

📄 Page 18

Around the same time, in 1959, Arthur Samuel coined the term machine learning. In his paper, “Some Studies in Machine Learning Using the Game of Checkers,” he described machine learning as “programming computers to learn from experience.”Another notable development of the mid-1960s was ELIZA. Developed in 1967 by Joseph Weizenbaum and being the first chatbot in history, ELIZA gave the illusion of understanding language by matching patterns in users’ text and gen­erating preprogrammed responses. Despite its simplicity, it illustrated the lure of building machines that could appear to think or understand.Optimism about near-future breakthroughs ran high during this period. Her­bert Simon, a future Turing Award recipient, exemplified this enthusiasm when he predicted in 1965 that “machines will be capable, within twenty years, of doing any work a man can do.” Many experts shared this optimism, forecasting that truly human-level AI—often called artificial general intelli­ gence (AGI)—was just a few decades away. Interestingly, these predictions maintained a consistent pattern: decade after decade, AGI remained roughly 25 years on the horizon: 17

📄 Page 19

1.1.2. AI WintersAs researchers tried to deliver on early promises, they encountered unforeseen complexity. Numerous high-profile projects failed to meet ambitious goals. As a consequence, funding and enthusiasm waned significantly between 1975 and 1980, a period now known as the first AI winter.During the first AI winter, even the term “AI” became somewhat taboo. Many researchers rebranded their work as “informatics,” “knowledge­based systems,” or “pattern recognition” to avoid association with AI’s perceived failures.In the 1980s, a resurgence of interest in expert systems—rule-based software designed to replicate specialized human knowledge—promised to capture and automate domain expertise. These expert systems were part of a broader branch of AI research known as symbolic AI, often referred to as good old- fashioned AI (GOFAI), which had been a dominant approach since AI’s earliest days. GOFAI methods relied on explicitly coded rules and symbols to represent knowledge and logic, and while they worked well in narrowly defined areas, they struggled with scalability and adaptability.From 1987 to 2000, AI entered its second winter, when the limitations of sym­bolic methods caused funding to diminish, once again leading to numerous research and development projects being put on hold or canceled.Despite these setbacks, new techniques continued to evolve. In particular, de­ cision trees, first introduced in 1963 by John Sonquist and James Morgan and then advanced by Ross Quinlan’s ID3 algorithm in 1986, split data into subsets through a tree-like structure. Each node in a tree represents a question about the data, each branch is an answer, and each leaf provides a prediction. While easy to interpret, decision trees were prone to overfitting, where they adapted too closely to training data, reducing their ability to perform well on new, un­seen data. 1.1.3. The Modern EraIn the late 1990s and early 2000s, incremental improvements in hardware and the availability of larger datasets (thanks to the widespread use of the Internet) started to lift AI from its second winter. Leo Breiman’s random forest algo­rithm (2001) addressed overfitting in decision trees by creating multiple trees on random subsets of the data and then combining their outputs—dramatically improving predictive accuracy.

📄 Page 20

Support vector machines (SVMs), introduced in 1992 by Vladimir Vapnik and his colleagues, were another significant step forward. SVMs identify the opti­mal hyperplane that separates data points of different classes with the widest margin. The introduction of kernel methods allowed SVMs to manage com­plex, non-linear patterns by mapping data into higher-dimensional spaces, making it easier to find a suitable separating hyperplane. These innovations placed SVMs at the center of machine learning research in the early 2000s.A turning point arrived around 2012, when more advanced versions of neural networks called deep neural networks began outperforming other techniques in fields like speech and image recognition. Unlike the simple Perceptron, which used only a single “layer” of learnable parameters, this deep learning approach stacked multiple layers to tackle much more complex problems. Surg­ing computational power, abundant data, and algorithmic advancements con­verged to produce remarkable breakthroughs. As academic and commercial in­terest soared, so did AI’s visibility and funding.Today, AI and machine learning remain intimately entwined. Research and industry efforts continue to seek ever more capable models that learn complex tasks from data. Although predictions of achieving human-level AI “in just 25 years” have consistently failed to materialize, AI’s impact on everyday applica­tions is undeniable.Throughout this book, AI refers broadly to techniques that enable machines to solve problems once considered solvable only by humans, with machine learn­ing being its key subfield focusing on creating algorithms learning from collec­tions of examples. These examples can come from nature, be designed by hu­mans, or be generated by other algorithms. The process involves gathering a dataset and building a model from it, which is then used to solve a problem.I will use “learning” and “machine learning” interchangeably to save keystrokes.Let’s examine what exactly we mean by a model and how it forms the founda­tion of machine learning. 1.2. ModelA model is typically represented by a mathematical equation: y = f(x) 19

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List