AI-Powered Search (Trey Grainger, Doug Turnbull, Max Irwin) (Z-Library)

M A N N I N G Trey Grainger Doug Turnbull Max Irwin Foreword by Grant Ingersoll

AI-Powered Search

ii

AI-Powered Search TREY GRAINGER DOUG TURNBULL MAX IRWIN FOREWORD BY GRANT INGERSOLL M A N N I N G SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2025 by Trey Grainger. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. The author and publisher have made every effort to ensure that the information in this book was correct at press time. The author and publisher do not assume and hereby disclaim any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause, or from any usage of the information herein. Manning Publications Co. Development editor: Marina Michaels 20 Baldwin Road Technical development editor: John Guthrie PO Box 761 Review editors: Adriana Sabo and Shelter Island, NY 11964 Dunja Nikitović Production editor: Keri Hales Copy editor: Andy Carroll Proofreader: Mike Beady Technical proofreaders: Alex Ott, Daniel Crouch Typesetter and cover designer: Marija Tudor ISBN 9781617296970 Printed in the United States of America

brief contents PART 1 MODERN SEARCH RELEVANCE .......................................... 1 1 ■ Introducing AI-powered search 3 2 ■ Working with natural language 25 3 ■ Ranking and content-based relevance 49 4 ■ Crowdsourced relevance 78 PART 2 LEARNING DOMAIN-SPECIFIC INTENT ........................... 101 5 ■ Knowledge graph learning 103 6 ■ Using context to learn domain-specific language 131 7 ■ Interpreting query intent through semantic search 161 PART 3 REFLECTED INTELLIGENCE .......................................... 191 8 ■ Signals-boosting models 193 9 ■ Personalized search 216 10 ■ Learning to rank for generalizable search relevance 254 11 ■ Automating learning to rank with click models 285 12 ■ Overcoming ranking bias through active learning 314 PART 4 THE SEARCH FRONTIER ............................................... 339 13 ■ Semantic search with dense vectors 341 14 ■ Question answering with a fine-tuned large language model 396 15 ■ Foundation models and emerging search paradigms 426v

contents foreword xiv preface xvi acknowledgments xviii about this book xx about the authors xxvii about the cover illustration xxix PART 1 MODERN SEARCH RELEVANCE ........................... 1 1 Introducing AI-powered search 3 1.1 What is AI-powered search? 5 1.2 Understanding user intent 8 What is a search engine? 9 ■ What do recommendation engines offer? 10 ■ The personalization spectrum between search and recommendations 11 ■ Semantic search and knowledge graphs 12 ■ Understanding the dimensions of user intent 13 1.3 How does AI-powered search work? 15 The core search foundation 16 ■ Reflected intelligence through feedback loops 16 ■ Signals boosting, collaborative filtering, and learning to rank 17 ■ Content and domain intelligence 19 Generative AI and retrieval augmented generation 20 Curated vs. black-box AI 21 ■ Architecture for an AI-powered search engine 22vi

CONTENTS vii2 Working with natural language 25 2.1 The myth of unstructured data 26 Types of unstructured data 27 ■ Data types in traditional structured databases 28 ■ Joins, fuzzy joins, and entity resolution in unstructured data 29 2.2 The structure of natural language 33 2.3 Distributional semantics and embeddings 35 2.4 Modeling domain-specific knowledge 40 2.5 Challenges in natural language understanding for search 43 The challenge of ambiguity (polysemy) 43 ■ The challenge of understanding context 44 ■ The challenge of personalization 44 Challenges interpreting queries vs. documents 45 ■ Challenges interpreting query intent 45 2.6 Content + signals: The fuel powering AI-powered search 46 3 Ranking and content-based relevance 49 3.1 Scoring query and document vectors with cosine similarity 50 Mapping text to vectors 51 ■ Calculating similarity between dense vector representations 52 ■ Calculating similarity between sparse vector representations 53 ■ Term frequency: Measuring how well documents match a term 55 ■ Inverse document frequency: Measuring the importance of a term in the query 60 ■ TF-IDF: A balanced weighting metric for text-based relevance 61 3.2 Controlling the relevance calculation 62 BM25: The industry standard default text-similarity algorithm 62 ■ Functions, functions, everywhere! 67 Choosing multiplicative vs. additive boosting for relevance functions 70 ■ Differentiating matching (filtering) vs. ranking (scoring) of documents 72 ■ Logical matching: Weighting the relationships between terms in a query 72 ■ Separating concerns: Filtering vs. scoring 74 3.3 Implementing user and domain-specific relevance ranking 76

CONTENTSviii4 Crowdsourced relevance 78 4.1 Working with user signals 79 Content vs. signals vs. models 79 ■ Setting up our product and signals datasets (RetroTech) 81 ■ Exploring the signals data 84 ■ Modeling users, sessions, and requests 86 4.2 Introducing reflected intelligence 87 What is reflected intelligence? 87 ■ Popularized relevance through signals boosting 89 ■ Personalized relevance through collaborative filtering 94 ■ Generalized relevance through learning to rank 96 ■ Other reflected intelligence models 97 ■ Crowdsourcing from content 98 PART 2 LEARNING DOMAIN-SPECIFIC INTENT ............ 101 5 Knowledge graph learning 103 5.1 Working with knowledge graphs 104 5.2 Using our search engine as a knowledge graph 106 5.3 Automatically extracting knowledge graphs from content 106 Extracting arbitrary relationships from text 107 ■ Extracting hyponyms and hypernyms from text 109 5.4 Learning intent by traversing semantic knowledge graphs 112 What is a semantic knowledge graph? 112 ■ Indexing the datasets 113 ■ Structure of an SKG 114 ■ Calculating edge weights to measure the relatedness of nodes 116 ■ Using SKGs for query expansion 120 ■ Using SKGs for content-based recommendations 124 ■ Using SKGs to model arbitrary relationships 127 5.5 Using knowledge graphs for semantic search 129 6 Using context to learn domain-specific language 131 6.1 Classifying query intent 132 6.2 Query-sense disambiguation 135 6.3 Learning related phrases from query signals 140 Mining query logs for related queries 141 ■ Finding related queries through product interactions 146 6.4 Phrase detection from user signals 150 Treating queries as entities 151 ■ Extracting entities from more complex queries 152

CONTENTS ix6.5 Misspellings and alternative representations 152 Learning spelling corrections from documents 153 ■ Learning spelling corrections from user signals 154 6.6 Pulling it all together 159 7 Interpreting query intent through semantic search 161 7.1 The mechanics of query interpretation 162 7.2 Indexing and searching on a local reviews dataset 164 7.3 An end-to-end semantic search example 168 7.4 Query interpretation pipelines 170 Parsing a query for semantic search 170 ■ Enriching a query for semantic search 179 ■ Sparse lexical and expansion models 184 ■ Transforming a query for semantic search 187 ■ Searching with a semantically enhanced query 188 PART 3 REFLECTED INTELLIGENCE ........................... 191 8 Signals-boosting models 193 8.1 Basic signals boosting 194 8.2 Normalizing signals 195 8.3 Fighting signal spam 197 Using signal spam to manipulate search results 198 ■ Combating signal spam through user-based filtering 200 8.4 Combining multiple signal types 202 8.5 Time decays and short-lived signals 204 Handling time-insensitive signals 205 ■ Handling time-sensitive signals 205 8.6 Index-time vs. query-time boosting: Balancing scale vs. flexibility 208 Tradeoffs when using query-time boosting 208 ■ Implementing index-time signals boosting 210 ■ Tradeoffs when implementing index-time boosting 212 9 Personalized search 216 9.1 Personalized search vs. recommendations 217 Personalized queries 219 ■ User-guided recommendations 220

CONTENTSx9.2 Recommendation algorithm approaches 220 Content-based recommenders 220 ■ Behavior-based recommenders 222 ■ Multimodal recommenders 223 9.3 Implementing collaborative filtering 224 Learning latent user and item features through matrix factorization 224 ■ Implementing collaborative filtering with Alternating Least Squares 228 ■ Personalizing search results with recommendation boosting 234 9.4 Personalizing search using content-based embeddings 238 Generating content-based latent features 238 Implementing categorical guardrails for personalization 241 ■ Integrating embedding-based personalization into search results 246 9.5 Challenges with personalizing search results 251 10 Learning to rank for generalizable search relevance 254 10.1 What is LTR? 255 Moving beyond manual relevance tuning 255 ■ Implementing LTR in the real world 256 10.2 Step 1: A judgment list, starting with the training data 259 10.3 Step 2: Feature logging and engineering 260 Storing features in a modern search engine 261 ■ Logging features from our search engine corpus 262 10.4 Step 3: Transforming LTR to a traditional machine learning problem 264 SVMrank: Transforming ranking to binary classification 265 ■ Transforming our LTR training task to binary classification 267 10.5 Step 4: Training (and testing!) the model 275 Turning a separating hyperplane’s vector into a scoring function 275 ■ Taking the model for a test drive 276 Validating the model 277 10.6 Steps 5 and 6: Upload a model and search 279 Deploying and using the LTR model 279 ■ A note on LTR performance 282 10.7 Rinse and repeat 283

CONTENTS xi11 Automating learning to rank with click models 285 11.1 (Re)creating judgment lists from signals 287 Generating implicit, probabilistic judgments from signals 287 Training an LTR model using probabilistic judgments 289 Click-Through Rate: Your first click model 290 ■ Common biases in judgments 294 11.2 Overcoming position bias 295 Defining position bias 295 ■ Position bias in RetroTech data 295 ■ Simplified dynamic Bayesian network: A click model that overcomes position bias 297 11.3 Handling confidence bias: Not upending your model due to a few lucky clicks 302 The low-confidence problem in click data 303 ■ Using a beta prior to model confidence probabilistically 304 11.4 Exploring your training data in an LTR system 311 12 Overcoming ranking bias through active learning 314 12.1 Our automated LTR engine in a few lines of code 317 Turning clicks into training data (chapter 11 in one line of code) 317 ■ Model training and evaluation in a few function calls 318 12.2 A/B testing a new model 320 Taking a better model out for a test drive 320 ■ Defining an A/B test in the context of automated LTR 321 ■ Graduating the better model into an A/B test 322 ■ When “good” models go bad: What we can learn from a failed A/B test 323 12.3 Overcoming presentation bias: Knowing when to explore vs. exploit 325 Presentation bias in the RetroTech training data 326 ■ Beyond the ad hoc: Thoughtfully exploring with a Gaussian process 327 Examining the outcome of our explorations 334 12.4 Exploit, explore, gather, rinse, repeat: A robust automated LTR loop 336 PART 4 THE SEARCH FRONTIER ................................ 339 13 Semantic search with dense vectors 341 13.1 Representation of meaning through embeddings 342

CONTENTSxii13.2 Search using dense vectors 343 A brief refresher on sparse vectors 344 ■ A conceptual dense vector search engine 344 13.3 Getting text embeddings by using a Transformer encoder 348 What is a Transformer? 348 ■ Openly available pretrained Transformer models 351 13.4 Applying Transformers to search 351 Using the Stack Exchange outdoors dataset 352 ■ Fine-tuning and the Semantic Text Similarity Benchmark 354 ■ Introducing the SBERT Transformer library 355 13.5 Natural language autocomplete 358 Getting noun and verb phrases for our nearest-neighbor vocabulary 359 ■ Getting embeddings 361 ■ ANN search 365 ■ ANN index implementation 367 13.6 Semantic search with LLM embeddings 369 Getting titles and their embeddings 370 ■ Creating and searching the nearest-neighbor index 371 13.7 Quantization and representation learning for more efficient vector search 374 Scalar quantization 376 ■ Binary quantization 381 Product quantization 383 ■ Matryoshka Representation Learning 386 ■ Combining multiple vector search optimization approaches 389 13.8 Cross-encoders vs. bi-encoders 391 14 Question answering with a fine-tuned large language model 396 14.1 Question-answering overview 397 How a question-answering model works 397 ■ The retriever-reader pattern 402 14.2 Constructing a question-answering training dataset 405 Gathering and cleaning a question-answering dataset 406 Creating the silver set: Automatically labeling data from a pretrained model 407 ■ Human-in-the-loop training: Manually correcting the silver set to produce a golden set 410 ■ Formatting the golden set for training, testing, and validation 411

CONTENTS xiii14.3 Fine-tuning the question-answering model 413 Tokenizing and shaping our labeled data 414 ■ Configuring the model trainer 416 ■ Performing training and evaluating loss 418 ■ Holdout validation and confirmation 418 14.4 Building the reader with the new fine-tuned model 419 14.5 Incorporating the retriever: Using the question-answering model with the search engine 421 Step 1: Querying the retriever 421 ■ Step 2: Inferring answers from the reader model 422 ■ Step 3: Reranking the answers 423 Step 4: Returning results by combining the retriever, reader, and reranker 423 15 Foundation models and emerging search paradigms 426 15.1 Understanding foundation models 427 What qualifies as a foundation model? 428 ■ Training vs. fine- tuning vs. prompting 428 15.2 Generative search 431 Retrieval augmented generation 433 ■ Results summarization using foundation models 435 ■ Data generation using foundation models 438 ■ Evaluating generative output 441 Constructing your own metric 443 ■ Algorithmic prompt optimization 445 15.3 Multimodal search 447 Common modes for multimodal search 447 ■ Implementing multimodal search 449 15.4 Other emerging AI-powered search paradigms 454 Conversational and contextual search 454 ■ Agent-based search 456 15.5 Hybrid search 456 Reciprocal rank fusion 457 ■ Other hybrid search algorithms 463 15.6 Convergence of contextual technologies 465 15.7 All the above, please! 466 appendix A Running the code examples 469 appendix B Supported search engines and vector databases 474 index 479

foreword For the past two decades, search has been at the heart of nearly every aspect of our tech- nical existence as humans. Need to find a fact? Do a search. Want to try a new restau- rant? Do a search. Need directions to that trailhead in the mountains for your weekend hike? Do a search. Yet, for many engineers, the underpinnings of how search works or goes beyond simple keyword matching to truly unlock what users need out of an infor- mation system is a mystery, left untaught in almost all computer science courses and bootcamps. Given this relative lack of instruction and the new golden age of AI, there is no better time for AI-Powered Search to make its mark on the world by teaching all of the core principles required for readers to unlock AI in any application. At the heart of all search systems is the goal of doing just that: unlocking informa- tion to help users make better decisions that help them understand and navigate their world. This unlocking primarily takes place in four ways: 1 Combing through data, finding relevant pieces of information, and ranking and returning the most important bits for the user to synthesize 2 Summarizing data into smaller, more digestible forms for sharing and collabo- ration via visualizations and other abstractions 3 Relating data to other, ideally familiar, pieces of information and concepts 4 Feeding any of these three, along with other context from the user, into a large language model (LLM) for further synthesis, summarization, and insights, all while interacting and updating based on user feedback In these same two decades that search has become ubiquitous in our lives at the con- sumer level, the engines powering this world, like Google, Elasticsearch, Apache Solr, and others, have evolved to tackle not only the retrieval and ranking part above, but also the other three challenges, and not just on text data, but on all forms of data.xiv

FOREWORD xvSearch engines have leaped forward to tackle these problems by deeply incorporating statistical analysis, machine learning, large language models, and natural language processing; in other words, integrating artificial intelligence techniques into every aspect of their core. And yet, despite their depth and breadth of capabilities, they are all too often overlooked as that thing that does “keyword search.” In AI-Powered Search, Trey, Doug, and Max have crafted a rich and thorough guide designed to take engineers through all aspects of building intelligent information sys- tems using all means available: LLMs, domain-specific knowledge, knowledge bases and graphs, and finally, user- and crowdsourced signals. Examples in the book high- light key concepts in accessible, easy-to-understand ways. As someone who has spent the better part of their career building, teaching, and promoting search as a means to help solve some of the most important challenges of our time, I’ve witnessed firsthand the A-ha! moments that launch engineers (after they push through the fuzziness inherent in dealing with messy, multimodal data) into life- long careers working on one of the hardest and most interesting problems of our time. My hope in your reading this book is that you too will find endless fascination in the world of search. Happy searching! —GRANT INGERSOLL, CEO & FOUNDER OF DEVELOMENTOR LLC, OPENSEARCH LEADERSHIP COMMITTEE

preface Thanks for purchasing AI-Powered Search! This book will teach you the knowledge and skills you need to deliver highly intelligent search applications that can automatically learn from every content update and user interaction, delivering continuously more relevant search results. There is no better time than now to learn how to implement AI-powered search. With the rise of generative AI, techniques like retrieval augmented generation (RAG) have arisen as the de facto way to ground AI systems with up-to-date and reliable data from which to drive responses. Yet the “R” in RAG is often the least-well-understood aspect of building such systems. This book provides a deep dive into how to do AI- powered information retrieval well, whether you’re using it to power an AI system, building a traditional search application, or creating a novel new application requir- ing intelligent ranking and matching. Over my career, I’ve had the opportunity to dive deep into search relevance, semantic search, personalized search and recommendations, behavioral signals pro- cessing, semantic knowledge graphs, learning to rank, LLMs and other foundation models, dense vector search, and many other AI-powered search capabilities, publish- ing research in top journals and conferences and, more importantly, delivering work- ing software at massive scale. As founder of Searchkernel and as Lucidworks’ former chief algorithms officer and SVP of engineering, I’ve also helped deliver many of these capabilities to hundreds of the most innovative companies in the world to help them power search experiences you probably use every single day. I’m thrilled to also have Doug Turnbull (Reddit, previously Shopify) and Max Irwin (Max.io, previously OpenSource Connections) as contributing authors on this book, pulling from their many years of hands-on experience helping companies and clients with search and relevance engineering. xvi

PREFACE xvii In this book, we distill our many decades of combined experience into a practical guide to help you take your search applications to the next level. You’ll discover how to enable your applications to continually learn to better understand your content, users, and domain in order to deliver optimally relevant experiences with each and every user interaction. Best wishes as you begin putting AI-powered search into practice! —TREY GRAINGER

acknowledgments First and foremost, I want to thank my wife, Lindsay, and my children Melodie, Tallie, and Olivia. You’ve supported me through all the long nights and weekends I’ve spent writing this book, and I couldn’t have done it without you. I love you all! Next, I’d like to thank Doug Turnbull and Max Irwin for their contributions to this book, and to the field of AI-powered search (and to Information Retrieval in general). Doug contributed most of chapters 10–12, and Max contributed most of chapters 13– 14 and some of 15. I’ve learned a lot from both of you and your careers, and I’m grate- ful for the opportunity to work with you on this book. Next, I’d like to acknowledge my development editor at Manning, Marina Michaels. Thank you for your encouragement and patience, especially as the time- lines stretched on this massive undertaking due to me working at multiple startups during the course of this project. The quality of the book is in large part attributable to your experience and guidance. Thanks as well to all the other folks at Manning who worked with me on develop- ment and promotion: John Guthrie on technical development, Ivan Martinović on early-access releases, Michael Stephens on overall vision and direction, and the entire Manning marketing team. I also thank the Manning production team for all their hard work in the formatting and typesetting of this book. Special thanks to Grant Ingersoll for writing the foreword. I’ve learned a tremen- dous amount from you over the years, and I’m very grateful for your support. I’d next like to thank the additional technical contributors to the book:  Daniel Crouch, for his thorough review of the book’s manuscript, his extensive refactoring of the book’s codebase, and his work to make the book mostly search-engine agnostic by integrating plug-and-play support for multiple popu- lar search engines and vector databasesxviii

ACKNOWLEDGMENTS xix Alex Ott, for his many technical reviews of the book and for his many rounds of contributions to improve the book’s codebase  Mohammed Korayem, PhD, for his collaboration and implementation of the algorithms for knowledge graph learning from user signals (chapter 6) and per- sonalized search techniques leveraging embeddings (chapter 9)  Chao Han, PhD, for her collaboration on the design of the signals-based algo- rithms for domain-specific phrase detection and spelling correction I’d also like the thank the many readers who provided feedback on the early-access versions of this book while it was in development. Your feedback made a significant impact on the quality of the book. Finally, I’d also like to thank the reviewers who took their valuable time to read the manuscript at various stages during its development and who provided invaluable feedback: Abdul-Basit Hafeez, Adam Dudczak, Al Krinker, Alain Couniot, Alfonso Jesus Flores Alvarado, Austin Story, Bhagvan Kommadi, David Meza, Davide Cada- muro, Davide Fiorentino, Derek Hampton, Gaurav Mohan Tuli, George Seif, Håvard Wall, Ian Pointer, Ishan Khurana, John Kasiewicz, Keith Kim, Kim Falk Jorgensen, Maria Ana, Mark James Miller, Martin Beer, Matt Welke, Maxim Volgin, Milorad Imbra, Nick Rakochy, Pierluigi Riti, Richard Vaughan, Satej Kumar Sahu, Sen Xu, Sri- ram Macharla, Steve Rogers, Sumit Pal, Thomas Hauck, Tiklu Ganguly, Tony Hol- droyd, Venkata Marrapu, Vidhya Vinay, Yudhiesh Ravindranath, and Zorodzayi Mukuya. Your suggestions helped make this a better book. —TREY GRAINGER

Statistics

Uploader

AI-Powered Search (Trey Grainger, Doug Turnbull, Max Irwin) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

AI-Powered Search (Trey Grainger, Doug Turnbull, Max Irwin) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You