M A N N I N G Hobson Lane Maria Dyshel SECOND EDITION IN ACTION
Agentic chatbot architecture Grammar/regex parser Deep learning encoder Intent classifier Template-based generator Information extractor Conversation manager logic Deep learning generator Machine-generated response (messages) Retrieval-based generator Programatic generator Structured data (context) Human interface Generated responses Interpolated “f-string” template Reused response (e.g., FAQ answer) Alternative chatbot responses Interpolated str Matched strings NLU NLG User text messages (utterances)
Praise for the First Edition Learn both the theory and practical skills needed to go beyond merely understanding the inner workings of NLP, and start creating your own algorithms or models. —Dr. Arwen Griffioen, Zendesk, from the foreword to the first edition Natural language processing unleashed—Go from novice to ninja! —Parthasarathy C. Mandayam, senior technical lead at XECOM Information Technologies A deep dive in natural language processing for human-machine cooperation. —Simona Russo, technical director at Serendipity S.r.l. Gives a thorough, in-depth look at natural language processing, starting from the basics, all the way up to state-of-the-art problems. —Srdjan Santic, data science mentor at Springboard.com An intuitive guide to start with natural language processing, which also covers deep learning techniques for NLP and real-world use cases. The book is full of many programming examples which help to learn the subject in a very pragmatic way. —Tommaso Teofili, computer scientist at Adobe Systems Natural Language Processing in Action provides a great overview of current NLP tools in Python. I’ll definitely be keeping this book on hand for my own NLP work. Highly recommended! —Tony Mullen, associate professor at Northeastern University, Seattle
(This page has no text content)
Natural Language Processing in Action SECOND EDITION HOBSON LANE MARIA DYSHEL MANN I NG SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The authors and publisher have made every effort to ensure that the information in this book was correct at press time. The authors and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Karen Miller 20 Baldwin Road Technical editor: Kostas Passadis PO Box 761 Review editor: Dunja Nikitović Shelter Island, NY 11964 Production editor: Keri Hales Copy editor: Christian Berk Proofreader: Katie Tennant Technical proofreaders: Mayur Patil, Kimberly Fessel Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617299445 Printed in the United States of America
brief contents PART 1 WORDY MACHINES: VECTOR MODELS OF NATURAL LANGUAGE ................................................1 1 ■ Machines that read and write: A natural language processing overview 3 2 ■ Tokens of thought: Natural language words 43 3 ■ Math with words: Term frequency–inverse document frequency vectors 89 4 ■ Finding meaning in word counts: Semantic analysis 134 PART 2 DEEPER LEARNING: NEURAL NETWORKS .....................177 5 ■ Word brain: Neural networks 179 6 ■ Reasoning with word embeddings 214 7 ■ Finding kernels of knowledge in text with CNNs 263 8 ■ Reduce, reuse, and recycle your words: RNNs and LSTMs 310v
BRIEF CONTENTSviPART 3 GETTING REAL: REAL-WORLD NLP APPLICATIONS.......355 9 ■ Stackable deep learning: Transformers 357 10 ■ Large language models in the real world 410 11 ■ Information extraction and knowledge graphs 470 12 ■ Getting chatty with dialog engines 513
contents preface xv acknowledgments xviii about this book xxi about the authors xxviii about the contributors xxix about the cover illustration xxxi PART 1 WORDY MACHINES: VECTOR MODELS OF NATURAL LANGUAGE.......................................1 1 Machines that read and write: A natural language processing overview 3 1.1 Programming languages vs. NLP 4 Natural language understanding 6 ■ Natural language generation 7 ■ Plumbing it all together for positive-impact AI 9 1.2 The magic of natural language 10 Language and thought 11 ■ Machines that converse 13 The math 14 1.3 Applications 15 Processing programming languages with NLP 18vii
CONTENTSviii1.4 Language through a computer’s “eyes” 19 The language of locks 19 ■ Regular expressions 20 1.5 Building a simple chatbot 22 Keyword-based greeting recognizer 23 ■ Pattern-based intent recognition 24 ■ Another way to recognize greetings 27 1.6 A brief overflight of hyperspace 32 1.7 Word order and grammar 34 1.8 A chatbot natural language pipeline 35 1.9 Processing in depth 37 1.10 Natural language IQ 39 1.11 Test yourself 41 2 Tokens of thought: Natural language words 43 2.1 Tokens and tokenization 44 Your tokenizer toolbox 45 ■ The simplest tokenizer 45 ■ Rule-based tokenization 45 ■ SpaCy 47 ■ Finding the fastest word tokenizer 49 2.2 Beyond word tokens 51 WordPiece tokenizers 53 2.3 Improving your vocabulary 57 Extending your vocabulary with n-grams 57 ■ Normalizing your vocabulary 61 2.4 Challenging tokens: Processing logographic languages 70 A complicated picture: Lemmatization and stemming in Chinese 72 2.5 Vectors of tokens 73 One-hot vectors 73 ■ Bag-of-words vectors 77 ■ Why not bag of characters? 78 2.6 Sentiment 80 VADER: A rule-based sentiment analyzer 82 ■ Naive Bayes 83 2.7 Test yourself 87 3 Math with words: Term frequency–inverse document frequency vectors 89 3.1 Bag-of-words vectors 90 3.2 Vectorizing text DataFrame constructor 96 Faster, better, easier token counting 100 ■ Vectorizing your code 102 ■ Vector space TF–IDF (term frequency–inverse document frequency) 104
CONTENTS ix3.3 Vector distance and similarity 107 Dot product 109 3.4 Counting TF–IDF frequencies 112 Analyzing “this” 114 3.5 Zipf’s law 118 3.6 Inverse document frequency 121 Return of Zipf 124 ■ Relevance ranking 126 ■ Smoothing out the math 128 3.7 Using TF–IDF for your bot 129 3.8 What’s next 132 3.9 Test yourself 132 4 Finding meaning in word counts: Semantic analysis 134 4.1 From word counts to topic scores 135 The limitations of TF–IDF vectors and lemmatization 135 Topic vectors 137 ■ Thought experiment 138 ■ Algorithms for scoring topics 142 4.2 The challenge: Detecting toxicity 143 Linear discriminant analysis classifier 144 ■ Going beyond linear 149 4.3 Reducing dimensions 149 Enter principal component analysis 151 ■ Singular value decomposition 153 4.4 Latent semantic analysis 155 Diving into semantics analysis 156 ■ TruncatedSVD or PCA? 159 ■ How well does LSA perform for toxicity detection? 160 ■ Other ways to reduce dimensions 161 4.5 Latent Dirichlet allocation 162 The LDiA idea 162 ■ LDiA topic model for comments 164 Detecting toxicity with LDiA 167 ■ A fairer comparison: 32 LDiA topics 168 4.6 Distance and similarity 169 4.7 Steering with feedback 171 4.8 Topic vector power 172 Semantic search 173 4.9 Equipping your bot with semantic search 175 4.10 Test yourself 176
CONTENTSxPART 2 DEEPER LEARNING: NEURAL NETWORKS 177 5 Word brain: Neural networks 179 5.1 Why neural networks? 180 Neural networks for words 181 ■ Neurons as feature engineers 182 Biological neurons 184 ■ Perceptron 187 ■ A Python perceptron 188 5.2 An example logistic neuron 193 The logistics of clickbait 193 ■ Sex education 194 Pronouns, gender, and sex 195 ■ Sex logistics 196 A sleek, new PyTorch neuron 204 5.3 Skiing down the error slope 208 Off the chair lift, onto the slope: Gradient descent and local minima 209 ■ Shaking things up: Stochastic gradient descent 211 5.4 Test yourself 212 6 Reasoning with word embeddings 214 6.1 This is your brain on words 215 6.2 Applications 217 Search for meaning 217 ■ Combining word embeddings 219 Analogy questions 221 ■ Word2Vec innovation 222 Artificial intelligence relies on embeddings 223 6.3 Word2Vec 224 Analogy reasoning 226 ■ Learning word embeddings 229 Learning meaning without a dictionary 234 ■ Using the gensim.word2vec module 240 ■ Generating your own word vector representations 242 6.4 Word2Vec alternatives 244 GloVe 244 ■ fastText 245 ■ Word2Vec vs. LSA 247 Static vs. contextualized embeddings 248 ■ Visualizing word relationships 249 ■ Making connections 255 ■ Unnatural words 260 6.5 Test yourself 261 7 Finding kernels of knowledge in text with CNNs 263 7.1 Patterns in sequences of words 264 7.2 Convolution 266 Stencils for natural language text 266 ■ A bit more stenciling 269 ■ Correlation vs. convolution 271
CONTENTS xiConvolution as a mapping function 272 ■ Python convolution example 273 ■ PyTorch 1D CNN on 4D embedding vectors 277 ■ Natural examples 281 7.3 Morse code 282 Decoding Morse with convolution 285 7.4 Building a CNN with PyTorch 289 Clipping and padding 291 ■ Better representation with word embeddings 294 ■ Transfer learning 298 ■ Robustifying your CNN with dropout 300 7.5 PyTorch CNN to process disaster toots 301 Network architecture 301 ■ Pooling 305 ■ Linear layers 305 Getting fit 306 ■ Hyperparameter tuning 306 7.6 Test yourself 309 8 Reduce, reuse, and recycle your words: RNNs and LSTMs 310 8.1 What are RNNs good for? 312 RNN sequence handling 314 ■ RNNs remember everything you tell them 314 ■ RNNs hide their understanding 317 ■ RNNs remember everything you tell them 320 8.2 Predicting nationality with only a last name 321 Building an RNN from scratch 328 ■ Training an RNN, one token at a time 329 ■ Understanding the results 332 Multiclass classifiers vs. multi-label taggers 336 8.3 Backpropagation through time 337 Initializing the hidden layer in an RNN 340 8.4 Remembering with recurrent networks 341 Word-level language models 342 ■ Gated recurrent units 344 Long short-term memory 347 ■ Giving your RNN a tune-up 349 8.5 Predicting 350 8.6 Test yourself 353 PART 3 GETTING REAL: REAL-WORLD NLP APPLICATIONS .................................................355 9 Stackable deep learning: Transformers 357 9.1 Recursion vs. recurrence 358 Attention is not all you need 360 ■ A LEGO set for language 362
CONTENTSxii9.2 Filling the attention gaps 368 Positional encoding 369 ■ Connecting all the pieces 370 Transformer translation 372 9.3 Bidirectional backpropagation and BERT 396 Tokenization and pretraining 397 ■ Fine-tuning 398 Implementation 398 ■ Fine-tuning a pretrained BERT model for text classification 400 9.4 Test yourself 409 10 Large language models in the real world 410 10.1 Large language models 411 Scaling up 414 ■ Smarter, smaller LLMs 422 ■ Semantic routing and guard rails 423 ■ Red teaming 430 10.2 Generating words with your own LLM 432 Creating your own generative LLM 433 ■ Fine-tuning your generative model 438 ■ Nonsense: Hallucination 442 10.3 Giving LLMs an IQ boost with search 444 Searching for words: Full-text search 444 ■ Searching for meaning: Semantic search 446 ■ Scaling up your semantic search 448 Approximate nearest neighbor search 449 ■ Choosing your index 450 ■ Quantizing the math 451 ■ Pulling it all together with Haystack 454 ■ Getting real 455 ■ A haystack of knowledge 456 ■ Answering questions 458 ■ Combining semantic search with text generation 459 ■ Deploying your app in the cloud 461 ■ Serve your users better 464 ■ AI ethics vs. AI safety 467 10.4 Test yourself 469 11 Information extraction and knowledge graphs 470 11.1 Grounding 472 Going old-fashioned: Information extraction with patterns 473 11.2 First things first: Segmenting your text into sentences 474 Why won’t split('.!?') work? 474 ■ Sentence segmentation with regular expressions 475 ■ Sentence semantics 479 11.3 A knowledge extraction pipeline 482 11.4 Entity recognition 484 Pattern-based entity recognition: Extracting GPS locations 485 Named entity recognition with spaCy 486
CONTENTS xiii11.5 Coreference resolution 489 Coreference resolution with spaCy 490 ■ Entity name normalization 494 11.6 Dependency parsing 494 Constituency parsing with benepar 498 11.7 From dependency parsing to relation extraction 499 Pattern-based relation extraction 499 ■ Neural relation extraction 502 11.8 Building your knowledge base 504 A large knowledge graph 505 11.9 Finding answers in a knowledge graph 507 From questions to queries 510 11.10 Test yourself 511 12 Getting chatty with dialog engines 513 12.1 Chatbots are everywhere 514 Different chatbots, same tools 515 ■ Conversation design 517 Your first conversation diagram 518 ■ What makes a good conversation? 521 ■ Making your chatbot a good listener: Implicit and explicit confirmations 522 ■ Using GUI elements 523 12.2 Making sense of the user’s input: Natural language understanding 524 Intent recognition 524 ■ Multi-label classification 527 12.3 Generating a response 528 Template-based approach 529 ■ Conversation graphs 529 Storing your graph in a relational database 532 ■ Scaling up the content: The search-based approach 533 ■ Designing more complex logic: The programmatic approach 533 12.4 The generative approach 535 12.5 Chatbot frameworks 538 Building an intent-based chatbot with Rasa 539 ■ Adding LLMs to your chatbot with LangChain 543 12.6 Maintaining your chatbot’s design 550 12.7 Evaluating your chatbot 552 Defining your chatbot’s performance metrics 552 ■ Measuring NLU performance 552 ■ Measuring user experience 553 What’s next? 556 12.8 Test yourself 560
CONTENTSxivappendix A Your NLP tools 561 appendix B Playful Python and regular expressions 572 appendix C Vectors and linear algebra 580 appendix D Machine learning tools and techniques 596 appendix E Deploying NLU containerized microservices 614 appendix F Glossary 632 notes 639 index 669
preface A lot has changed in the world of NLP since the first edition. You probably couldn‘t miss the release of BERT, GPT-3, Llama 3, and the wave of enthusiasm for ever larger large language models, such as ChatGPT. More subtly, while reviewing the first edition of this book at the San Diego Machine Learning group book club (https://github.com/SanDiegoMachineLearning/bookclub), we watched while PyTorch (https://github.com/pytorch/pytorch) and spaCy (https:// spacy.io/) rose to prominence as the workhorses of NLP at even the biggest of big tech corporations. And the past few years have seen the rise of Phind, You.com, Papers With Code (http://paperswithcode.com; Meta AI Research maintains a repository of machine learning papers, code, datasets, and leaderboards), Wayback Machine (http://archive .today; The Internet Archive maintains the Wayback Machine, which houses petabytes of cached natural language content from web pages you wouldn‘t have access to oth- erwise), arXiv.org (http://arxiv.org; Cornell University maintains arXiv for indepen- dent researchers to release prepublication academic research), and many smaller search engines powered by prosocial NLP algorithms. In addition, vector search data- bases were a niche product when we wrote the first edition, while now, they are the cornerstone of most NLP applications. With this expansion and retooling of the NLP toolbox has come an explosion of opportunities for applying NLP to benefit society. NLP algorithms have become ingrained in the core business processes of big tech, startups, and small businesses alike. Luckily for you, big tech has myopically focused on digging deeper moats aroundxv
PREFACExvitheir monopolies, a business process called enshittification. This nearsightedness has left a green field of opportunity for you to build user-focused, prosocial NLP that can outcompete the enshittified NLP algorithms of big tech. Business models optimized for monopoly building have so thoroughly captivated users and captured regulators, business executives, and engineers that most are blind to the decline in profitability of those business models. If you learn how to build NLP systems that serve your needs, you will contribute to building a better world for everyone. The unchecked growth in the power of algorithms to transform society is apparent to those able to escape the information bubble these algorithms capture us in. Authori- tarian governments and tech businesses, both large and small, have utilized NLP algo- rithms to dramatically shift our collective will and values. The breakup of the EU, the insurrection in the US, and the global addiction to Like buttons are all being fueled by people employing natural language processing to propagate misinformation and suppress authentic voices. In Stuart Russell’s book, Human-Compatible AI (Penguin Books, 2020), he estimates that out of approximately 100,000 researchers focused on advancing the power of AI, only about 20 are focused on trying to protect humanity from the powerful AI that is rapidly emerging. And even the social tragedies of the past decade have been insuffi- cient to wake up the collective consciousness of AI researchers. This may be due to social media and information retrieval tools insulating us from the inconvenient truth that the technology we are advancing is putting society into a collective trance. For example, Russell’s interviews and lectures on beneficial AI typi- cally garner fewer than 20 likes per year on YouTube and X (formerly Twitter), whereas comparable videos by gung-ho AI researchers garner thousands of likes. Most AI researchers and the general public are seemingly ignorant of the algorithms chip- ping away at their access to truthful information and profound ideas. So this second edition is a more strident call to arms for budding engineers not yet captured by algorithms. We few, we happy few. Our hope for the future is powered by two things: an idea and a skill. The idea is that we can out-compete those businesses and individuals that degrade the collective consciousness with NLP. You only need put your faith in the supercooperator habits your parents and teachers taught you. You can pass along those powerful habits and instincts to the NLP algorithms you build. The second pillar of our hope is your skill. The expertise in NLP that you will gain from this book will ensure you can maintain that prosocial instinct by protecting your- self and those around you from manipulation and coercion. Hopefully, many of you will even achieve dramatic commercial success building on this idea with your toolbox of NLP skills. You will program and resist being programmed. For this second edition, we have a new lead author, bringing a fresh perspective and a wealth of experience in the impact of prosocial algorithms. Maria Dyshel and I were sitting in Geisel Library collaborating with our fellow San Diegans at a Python
PREFACE xviiUser Group meetup when we realized we had the same mission. Maria had just founded Tangible AI to harness the power of NLP for the social sector, and I was working with San Diego Machine Learning (SDML) friends to build a cognitive assistant called qary. She immediately saw how qary and the tools you’ll learn about here are such powerful forces for good. In the rest of this book, she and I will show you how NLP can be used to help non- profits and social-impact businesses in ways I’d never considered before that fateful encounter. You’ll find many new success stories of prosocial NLP in the real world within these pages. She’s teaching me conversation design (and appropriate emoji use). I’m teaching her how to build dialog engines and information retrieval systems. And we’re both showing businesses and nonprofits (and you) how to harness these tools for good. From authentic information retrieval and misinformation filtering to emo- tional support and companionship, chatbots and NLP may just save society from itself. —Hobson Lane
acknowledgments We deeply thank the contributing authors who created and sustained the collective intelligence that went into this book, often putting into words the ideas we could not. Hannes Hapke and Cole Howard were crucial in creating the first edition of this book and fostering our mutual learning and growth as NLP engineers. When we set out to write the second edition, we were fortunate to tap into the col- lective intelligence of the San Diego Machine Learning community, and it amazed us how many people chose to generously give their time and mind to cocreate with us. Brian Cox took on the daunting task of rewriting the entire vector and linear algebra appendix. Geoffrey Marshall valiantly drafted all of chapter 9, which Hobson then mangled, trying to get it up to speed with PyTorch’s evolution—as with the other chapters, all bugs and mistakes are Hobson’s. Geoffrey’s writing discipline inspired us throughout the entire process of writing this book. John Sundin enriched chapter 6 with network diagrams that connect sentences and concepts. Ted Kye contributed paragraphs about byte pair encoding as well as subword toke- nization. Vishvesh Bhat contributed large parts of chapter 11 and continues to share his groundbreaking research into grounding LLMs at the startups he has cofounded. Greg Thompson contributed his RASA and turn.io knowledge to chapter 12 and wrote appendix E on containerization. If it weren’t for Greg, this book and our busi- ness would have faded into oblivion long ago.xviii
Comments 0
Loading comments...
Reply to Comment
Edit Comment