Building Machine Learning Systems with a Feature Store Batch, Real-Time and LLM Systems (Jim Dowling) (Z-Library)

(This page has no text content)

Hopsworks

(This page has no text content)

Praise for Building Machine Learning Systems with a Feature Store I witnessed the rise of feature stores at Uber, where ML- powered products operated on batch and real-time data. Jim Dowling helped define the category, and this book gives every engineer a practical playbook for shipping production-grade ML systems that matter. —Vinoth Chandar, CEO and founder of Onehouse Inc. This book shows how modern feature engineering is really done: with scalable, expressive tools at its core. It bridges the gap between research and production by demonstrating how DataFrame engines, feature stores, and ML pipelines can work together seamlessly. A must- read for anyone serious about building efficient, real- world ML systems. —Ritchie Vink, inventor of Polars, CEO and founder of Polars Inc. Nobody before has captured the essentials of building AI apps using modern data streaming systems like Flink. Jim’s book shows the way! Using only widely available open source technologies, this book provides the right blueprints for the job. —Paris Carbone, ACM-Awarded computer scientist and Apache Flink committer It’s easy to be lost in quality metrics land and forget about the crucial systems aspect to ML. Jim does a great job explaining those aspects and gives a lot of practical tips on how to survive a long deployment. —Hannes Mühleisen, cocreator of DuckDB

Building machine learning systems in production has historically involved a lot of black magic and undocumented learnings. Jim Dowling is doing a great service to ML practitioners by sharing the best practices and putting together clear step-by-step guide. — Erik Bernhardsson, CEO of Modal In this crazy industry of ours, Jim’s the closest thing we have to a world-class expert. Read this book if you want a detailed, practical, re-usable manual on how to get a good-quality running system—as an SRE, I especially appreciate his attention to observability and debugging. The detailed case studies are crunchy icing on a filling cake. —Niall Murphy, O’Reilly author, cofounder and CEO at Stanza It’s really excellent, the sort of material that isn’t taught anywhere. —Liam Brannigan, data science educator A must-read for AI/ML practitioners looking to match use cases to the right ML platforms and tools. The book strikes the right balance of breadth, depth, and historical context through comprehensive projects covering real- world ML architectures. —Lalith Suresh, CEO of Feldera

Building Machine Learning Systems with a Feature Store Batch, Real-Time, and LLM Systems Jim Dowling

Building Machine Learning Systems with a Feature Store by Jim Dowling Copyright © 2026 O’Reilly Media, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 141 Stony Circle, Suite 195, Santa Rosa, CA 95401. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800- 998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Gary O’Brien Production Editor: Clare Laylock Copyeditor: nSight, Inc. Proofreader: Doug McNair Indexer: WordCo Indexing Services, Inc. Cover Designer: Susan Brown Cover Illustrator: José Marzan Jr. Interior Designer: David Futato Interior Illustrator: Kate Dullea

November 2025: First Edition Revision History for the First Edition 2025-11-06: First Release See http://oreilly.com/catalog/errata.csp? isbn=9781098165239 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Building Machine Learning Systems with a Feature Store, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Hopsworks. See our statement of editorial independence. 978-1-098-16524-6 [LSI]

Preface AI is a wide and deep field. If you’ve never trained a model, it can feel like you need a PhD just to begin. If you have trained a model, building a machine learning (ML) system can feel like you need to first become both a data engineer and a Kubernetes or cloud expert. You may already have some experience in ML or AI. Maybe you trained a model on a static dataset. Or you may have learned about large language models (LLMs) through crafting a prompt such that you successfully accomplished a task. But to create real value from AI, you need to move from static datasets and static prompts to dynamic data and context engineering. When you train a model, you need a system that will make many predictions with it, not just predictions on the static dataset you downloaded. When you AI-enable an application, you don’t have to hardwire the same responses for all users. You can personalize the AI by providing fresh and relevant context information at request time. ML and AI systems create the most value when they work with dynamic data. Pipelines are key to this. You need pipelines to transform the dynamic data from your data sources into a format that can be used for anything from training your model, to making predictions, to providing context information for your LLM. In this book, we will define ML systems as sequences of pipelines. They transform data progressively from data sources until it is used as input to a model for training or inference (making predictions). Pipelines enable us to lift the level of abstraction when describing an ML or AI

system. What is the pipeline’s input and output? Does it create feature data from your data sources? Does it train a model from your feature data? Does it output predictions using the model you trained? Pipelines help us decompose our ML or AI system into modular components. We will see how the feature store, a data management platform for AI, enables the composition of pipelines into working ML or AI systems. You will also see that the journey to building pipelines for AI systems is similar to the journey to building pipelines for ML systems. Context engineering for agents follows many of the same principles as feature engineering for classical ML models. This book is useful because it can help you build different types of ML and AI systems from scratch. A real-world ML system rarely processes a ready-made dataset and optimizes a clear metric. Instead, it often implements a messy process of identifying the right “prediction problem” to solve for available data sources; managing with incremental, never-ending data flows; sometimes training or fine-tuning a model; and building a user interface so that stakeholders can get value from your model. Your ML system should also be well engineered, not a house of cards. It needs to be tested before it goes into production and monitored once in production. And you should follow best practices in automated testing and deployment for software engineering. This book can help you attain the skills of a staff data scientist or lead ML engineer. This book teaches you the skills needed to build three important classes of ML or AI systems: Batch ML systems that make predictions on a schedule

Real-time ML systems that run 24/7 and make (personalized) predictions in response to requests Agentic AI systems that work autonomously to solve a goal using LLMs and relevant context data Why Did I Write This Book? This book is the coursebook I would like to have had for ID2223, “Scalable Machine Learning and Deep Learning”, a course I developed and taught at KTH Royal Institute of Technology in Stockholm. KTH is the alma mater of the founders of important AI companies like Spotify, Lovable, Databricks, Modal, and Feldera (all of which are referenced in this book). My course was, to the best of my knowledge, the first university course that taught students to build complete and novel ML systems as part of their coursework. It was the result of my own nontraditional academic route of going wide (not just deep). I have published at top-tier conferences in the most important disciplines for building ML systems: AI (ICML, AAMAS), systems (USENIX, ACM Middleware), programming languages (ECOOP), and databases (SIGMOD, PVLDB). Building ML systems requires you to go wider, to leave your comfort zone. Hopefully, you will learn something new about data engineering, model training, agents, or MLOps for building ML systems. By the end of my course, the students had built their own ML or AI system (after two to three weeks of work, in groups of two). Their ML system specification answered the following questions:

What unique data source (or sources) generates new data at some cadence? What is the prediction problem you will solve with ML or AI using the data source(s)? What is the UI (interactive or dashboard) for stakeholder(s) to generate value from your ML system? How will you ensure the correctness and monitor the performance of your system? Here are some examples of ML and AI systems built by students: A water height prediction system that uses public measurements of water height along with weather forecast data A system that predicts electricity demand using historical and projected demand data, as well as weather forecast data A system that predicts public transport arrival times using historical data, weather forecast data, and real-time context data A system that lets users ask questions about the course through a UI, by indexing the course’s PDFs with retrieval-augmented generation (RAG) pipelines and an LLM Hopefully, after reading this book, you will be similarly inspired to build your own ML and AI systems.

Target Readers of This Book This book is for data scientists, data engineers, software engineers, and software architects who love to build things and are interested in building ML or AI systems. If you are a data scientist and are tired of the constant refrain of productionizing your models, but are not yet a Docker and Terraform expert, then this book is for you. If you are a data engineer and wonder what all the fuss is about AI, then this book is for you. ML engineers will also enjoy the exercises that will enable them to refine their ML system design, pipeline building skills, and offline and online testing. You should have some experience in Python and SQL to get the most out of the exercises. If any of the following describe you, you’ll find this book valuable: A data scientist who wants to be able to build ML systems, not just train models A data engineer who wants to learn about data modeling for AI as well as batch and real-time feature engineering An AI engineer who wants to build agents that are fed with relevant context using pipelines An ML engineer who wants to build scalable, reliable, and maintainable ML systems A developer who wants to build ML systems, whether for a portfolio or for fun

What This Book Is Not This book is not a traditional MLOps book that starts with experiment tracking and how to package and deploy software with containers and infrastructure as code. We do not discuss Docker, Terraform, or AWS CloudFormation. We don’t need them as we assume support for automatic containerization of pipelines. We also don’t cover experiment tracking due to our focus on ML systems over model training, the rise in AutoML (and the corresponding drop in the importance of hyperparameter tuning), and the fact that a model registry is all you need to store model evaluation results and support model governance. Outline of the Book The book is arranged into six logical parts, with each part consisting of a group of chapters. Each chapter stands in its own right and has exercises to help deepen your understanding of the concepts and technologies introduced. Part I introduces the feature, training, and inference (FTI) architecture and concludes with a case study. In Chapter 1, we describe the anatomy of an ML system, provide a whirlwind history of ML system architectures and MLOps, and introduce a unified architecture for building ML systems: FTI pipelines, connected by the feature store and model registry. Chapter 2 introduces the three main classes of ML pipeline: feature pipelines, training pipelines, and batch/online/agentic inference pipelines. It also introduces a development process for building AI systems and a taxonomy that helps you understand which class of data

transformation should be performed in which FTI pipeline. In Chapter 3, you’ll build your first ML system. You’ll identify an air quality sensor near where you live and build an air quality forecasting system using ML along with a dashboard. You will also query it with natural language using an LLM. Part II introduces feature stores for ML and a real-time credit card fraud example that will be covered throughout the book. In Chapter 4, we provide an overview of the main characteristics of a feature store, including the problems it solves by storing feature data for training and inference in feature groups, querying feature data using feature views, preventing offline/online skew through supporting the taxonomy of data transformations, and data modeling. In Chapter 5, we introduce the Hopsworks feature store, its multitenant project security model, and its APIs for reading and writing with ML pipelines with feature groups and feature views, as well as running ML pipelines as jobs. Part III is about data transformations for AI systems using frameworks such as Pandas, Polars, Apache Spark, Apache Flink, and Feldera. Chapter 6 describes data transformations for feature pipelines, including data validation with Great Expectations. Chapter 7 describes feature transformations for training and inference pipelines, including real-time transformations. Chapter 8 describes how to design and schedule batch feature pipelines. Chapter 9 describes how to design and operate streaming feature pipelines, including windowed aggregations and rolling aggregations. Part IV is about training models. In Chapter 10, we start by describing how to build training datasets from a feature store and how to train a decision tree from time-series data. We then look at training models with unstructured

data, including fine-tuning LLMs with low-rank adaptation (LoRA) and training PyTorch models with Ray. We also outline the scalability challenges in distributed training. Part V is about making predictions in batch, real-time, and agentic AI systems. In Chapter 11, we look at batch inference and how to scale it with PySpark. We also look at real-time inference and deployment APIs. We look at model serving using KServe, both with and without graphics processing units (GPUs), including vLLM for serving LLMs. In Chapter 12, we introduce agents and LLM workflows. We look at LlamaIndex, RAG, and protocols for using tools (like the Model Context Protocol [MCP]) and other agents (like Agent-to-Agent [A2A]). We also compare the agentic workflow with LLM workflows and introduce a development process for agents. Part VI is about MLOps. In Chapter 13, we cover offline tests for AI systems, from unit tests for features (to enforce their contract), to ML pipeline integration tests, to blue/green tests for deployments, to evals for agents. We also cover governance and automatic containerization for ML pipelines. In Chapter 14, we cover observability for AI systems, built on logging/traces and metrics for models and agents. We look at how feature monitoring and model monitoring are built from logs, as well as evals from agent traces. We look at how metrics help models meet service- level objectives through autoscaling. We conclude the book in Chapter 15 with a case study on how to build a personalized video recommender system, similar to TikTok’s, and the dirty dozen fallacies of MLOps. The book is deliberately light on references compared with the academic articles I usually write. I hope the book will still guide you to deeper sources of information on the

topics covered and give credit to all the technologies and ideas it builds on. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note.

WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/featurestorebook/mlfs-book. If you have a technical question or a problem using the code examples, please send an email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Building Machine Learning Systems with a Feature Store by Jim Dowling (O’Reilly). Copyright 2026 O’Reilly Media, Inc., 978-1-098-16524-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com.

O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 141 Stony Circle, Suite 195 Santa Rosa, CA 95401 800-889-8969 (in the United States or Canada) 707-827-7019 (international or local) 707-829-0104 (fax)

support@oreilly.com https://oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/buildingMLsys-feature-store. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly- media. Watch us on YouTube: https://youtube.com/oreillymedia. Acknowledgments It takes a village to bring a book to life. First and foremost, I would like to thank the technical reviewers who helped polish my patchy prose: Liam Brannigan (Polars expert), Pier Paolo Ippolito, Paridhi Singh, Sanjay Shukla, Shubham Patel, and Pau Labarta Bajo. My thanks also go to many more members of the village: my colleagues at Hopsworks who helped review sections: Manu Joseph, Aleksey Veresov, Mikael Ronström, Aleksei Avstreikh, Raymond Cunningham, Javier de la Rua Martinez, and Kenneth Mak. My cofounders at Hopsworks: Fabio Buso, Ermias Gebremeskel, Robin Andersson, Salman Niazi, Mahmoud Ismail, and Prof. Seif Haridi. My colleague Lars Nordwall, who pressed me to get this over the line, and my board who enable and help us achieve things: Sami Ahvenniemi, Caroline Wadstein, Timo Tirkkonen, and Artis Bisers. Our advisor Vinay Joosery, who taught us the art of bootstrapping. All those who have

Statistics

Uploader

Building Machine Learning Systems with a Feature Store Batch, Real-Time and LLM Systems (Jim Dowling) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Building Machine Learning Systems with a Feature Store Batch, Real-Time and LLM Systems (Jim Dowling) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You