Prometheus Up Running (Julien Pivotto, Brian Brazil) (Z-Library)

(This page has no text content)

Prometheus: Up & Running Infrastructure and Application Performance Monitoring SECOND EDITION Julien Pivotto and Brian Brazil

Prometheus: Up & Running, Second Edition by Julien Pivotto and Brian Brazil Copyright © 2023 Julien Pivotto. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: John Devins Development Editor: Rita Fernando Production Editor: Ashley Stussy Copyeditor: Kim Cofer Proofreader: Sonia Saruba Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2018: First Edition April 2023: Second Edition

Revision History for the Second Edition 2023-04-04: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098131142 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Prometheus: Up & Running, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-13114-2 [LSI]

Preface This book describes in detail how to use the Prometheus monitoring system to monitor, graph, and alert on the performance of your applications and infrastructure. This book is intended for application developers, system administrators, and everyone in between. Expanding the Known When it comes to monitoring, knowing that the systems you care about are turned on is important, but that’s not where the real value is. The big wins are in understanding the performance of your systems. By performance we don’t only mean the response time of and CPU used by each request, but the broader meaning of performance. How many requests to the database are required for each customer order that is processed? Is it time to purchase higher throughput networking equipment? How many machines are your cache misses costing? Are enough of your users interacting with a complex feature in order to justify its continued existence? These are the sorts of questions that a metrics-based monitoring system can help you answer, and beyond that help you dig into why the answer is what it is. We see monitoring as getting insight from throughout your system, from high-level overviews down to the nitty-gritty details that are useful for debugging. A full set of monitoring tools for debugging and analysis includes not only metrics, but also logs, traces, and profiling; but metrics should be your first port of call when you want to answer systems-level questions. Prometheus encourages you to have instrumentation liberally spread across your systems, from applications all the way down to the bare metal. With

instrumentation you can observe how all your subsystems and components are interacting, and convert unknowns into knowns. The Evolution of Prometheus As Prometheus has crossed the 10-year mark, this second edition brings new developments across all sections. Prometheus has continued to evolve and expand, offering even more options for scraping, storing, and querying data. This progress is a result of the dedicated community of users and contributors who use Prometheus across a wide and growing range of industries and applications. The second edition of this book provides coverage of the many new PromQL functions, service discovery providers, and Alertmanager receivers that have been added since the first edition. A new dedicated chapter covers server-side security features, such as TLS, that have been added to Prometheus and some of the exporters. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user.

Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, configuration files, etc.) is available for download at https://github.com/prometheus-up-and-running- 2e/examples. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant

amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Prometheus: Up & Running, Second Edition by Julien Pivotto and Brian Brazil (O’Reilly). Copyright 2023 Julien Pivotto, 978-1-098-13114-2.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North

Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/prometheus-up-running-2e. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments This book would not have been possible without all the work of the Prometheus team, and the hundreds of contributors to Prometheus and its ecosystem. A special thanks to Julius Volz, Richard Hartmann, Carl Bergquist, Andrew McMillan, and Greg Stark for providing feedback on initial drafts of the first revision of this book. Thanks to Brian Brazil, Bartłomiej Płotka, Carl Bergquist, TJ Hoplock, and Richard Hartmann for their feedback on the second edition.

Part I. Introduction This section will introduce you to monitoring in general, and Prometheus more specifically. In Chapter 1 you will learn about the many different meanings of monitoring and approaches to it, the metrics approach that Prometheus takes, and the architecture of Prometheus. In Chapter 2 you will get your hands dirty running a simple Prometheus setup that scrapes machine metrics, evaluates queries, and sends alert notifications.

Chapter 1. What Is Prometheus? Prometheus is an open source, metrics-based monitoring system. Of course, Prometheus is far from the only one of those out there, so what makes it notable? Prometheus does one thing and it does it well. It has a simple yet powerful data model and a query language that lets you analyze how your applications and infrastructure are performing. It does not try to solve problems outside of the metrics space, leaving those to other more appropriate tools. Since its beginnings with no more than a handful of developers working in SoundCloud in 2012, a community and ecosystem has grown around Prometheus. Prometheus is primarily written in Go and licensed under the Apache 2.0 license. There are hundreds of people who have contributed to the project itself, which is not controlled by any one company. It is always hard to tell how many users an open source project has, but we estimate that as of 2022, hundreds of thousands of organizations are using Prometheus in production. In 2016 the Prometheus project became the second member1 of the Cloud Native Computing Foundation (CNCF). For instrumenting your own code, there are client libraries in all the popular languages and runtimes, including Go, Java/JVM, C#/.Net, Python, Ruby, Node.js, Haskell, Erlang, and Rust. Many popular applications are already instrumented with Prometheus client libraries, like Kubernetes, Docker, Envoy, and Vault. For third-party software that exposes metrics in a non- Prometheus format, there are hundreds of integrations available. These are called exporters, and include HAProxy, MySQL, PostgreSQL, Redis, JMX, SNMP, Consul, and Kafka. A friend of Brian’s even added support for monitoring Minecraft servers, as he cares a lot about his frames per second.

A simple text format2 makes it easy to expose metrics to Prometheus. Other monitoring systems, both open source and commercial, have added support for this format. This allows all of these monitoring systems to focus more on core features, rather than each having to spend time duplicating effort to support every single piece of software a user like you may wish to monitor. The data model identifies each time series not just with a name, but also with an unordered set of key-value pairs called labels. The PromQL query language allows aggregation across any of these labels, so you can analyze not just per process but also per datacenter and per service or by any other labels that you have defined. These can be graphed in dashboard systems such as Grafana and Perses. Alerts can be defined using the exact same PromQL query language that you use for graphing. If you can graph it, you can alert on it. Labels make maintaining alerts easier, as you can create a single alert covering all possible label values. In some other monitoring systems you would have to individually create an alert per machine/application. Relatedly, service discovery can automatically determine what applications and machines should be scraped from sources such as Kubernetes, Consul, Amazon Elastic Compute Cloud (EC2), Azure, Google Compute Engine (GCE), and OpenStack. For all these features and benefits, Prometheus is efficient and simple to run. A single Prometheus server can ingest millions of samples per second. It is a single, statically linked binary with a configuration file. All components of Prometheus can be run in containers, and they avoid doing anything fancy that would get in the way of configuration management tools. It is designed to be integrated into the infrastructure you already have and built on top of, not to be a management platform itself. Now that you have an overview of what Prometheus is, let’s step back for a minute and look at what is meant by “monitoring” in order to provide some context. Following that, we will look at what the main components of Prometheus are, and what Prometheus is not.

What Is Monitoring? In secondary school, one of Brian’s teachers told him that if you were to ask 10 economists what economics means, you’d get 11 answers. Monitoring has a similar lack of consensus as to what exactly it means. When he tells others what he does, people think his job entails everything from keeping an eye on temperature in factories, to employee monitoring where he is the one to find out who is accessing Facebook during working hours, and even detecting intruders on networks. Prometheus wasn’t built to do any of those things.3 It was built to aid software developers and administrators in the operation of production computer systems, such as the applications, tools, databases, and networks backing popular websites. So what is monitoring in that context? Let’s narrow this sort of operational monitoring of computer systems down to four things: Alerting Knowing when things are going wrong is usually the most important thing that you want monitoring for. You want the monitoring system to call in a human to take a look. Debugging Now that you have called in a human, they need to investigate to determine the root cause and ultimately resolve whatever the issue is. Trending Alerting and debugging usually happen on timescales on the order of minutes to hours. While less urgent, the ability to see how your systems are being used and changing over time is also useful. Trending can feed into design decisions and processes such as capacity planning. Plumbing

When all you have is a hammer, everything starts to look like a nail. At the end of the day, all monitoring systems are data processing pipelines. Sometimes it is more convenient to appropriate part of your monitoring system for another purpose, rather than building a bespoke solution. This is not strictly monitoring, but it is common in practice so we like to include it. Depending on who you talk to and their background, they may consider only some of these to be monitoring. This leads to many discussions about monitoring going around in circles, leaving everyone frustrated. To help you understand where others are coming from, we’re going to look at a small bit of the history of monitoring. A Brief and Incomplete History of Monitoring Monitoring has seen a shift toward tools including Prometheus in the past few years. For a long time, the dominant solution has been some combination of Nagios and Graphite or their variants. When we say Nagios, we are including any software within the same broad family, such as Icinga, Zmon, and Sensu. They work primarily by regularly executing scripts called checks. If a check fails by returning a nonzero exit code, an alert is generated. Nagios was initially started by Ethan Galstad in 1996 as an MS-DOS application used to perform pings. It was first released as NetSaint in 1999, and renamed Nagios in 2002. To talk about the history of Graphite, we need to go back to 1994. Tobias Oetiker created a Perl script that became Multi Router Traffic Grapher, or MRTG 1.0, in 1995. As the name indicates, it was mainly used for network monitoring via the Simple Network Management Protocol (SNMP). It could also obtain metrics by executing scripts.4 The year 1997 brought big changes with a move of some code to C, and the creation of the Round Robin Database (RRD), which was used to store metric data. This brought notable performance improvements, and RRD was the basis for other tools, including Smokeping and Graphite.

Started in 2006, Graphite uses Whisper for metrics storage, which has a similar design to RRD. Graphite does not collect data itself, rather it is sent in by collection tools such as collectd and StatsD, which were created in 2005 and 2010, respectively. The key takeaway here is that graphing and alerting were once completely separate concerns performed by different tools. You could write a check script to evaluate a query in Graphite and generate alerts on that basis, but most checks tended to be on unexpected states such as a process not running. Another holdover from this era is the relatively manual approach to administering computer services. Services were deployed on individual machines and lovingly cared for by system administrators. Alerts that might potentially indicate a problem were jumped upon by devoted engineers. As cloud and cloud native technologies such as EC2, Docker, and Kubernetes have come to prominence, treating individual machines and services like pets with each getting individual attention does not scale. Rather, they tend to be looked at more as cattle and administered and monitored as a group. In the same way that the industry has moved from doing management by hand, to tools like Chef and Ansible, to now starting to use technologies like Kubernetes, monitoring also needs to make a similar transition. This means moving from checks on individual processes on individual machines to monitoring based on service health as a whole. Moving to a more recent time, OpenTelemetry is born from two other open source projects, OpenCensus and OpenTracing. OTel5 is a specification and a set of components that aim to offer built-in telemetry for projects. Its metrics component is compatible with Prometheus with the addition of the OpenTelemetry collector,6 which is responsible for collecting and providing metrics to your Prometheus server. You may have noticed that we didn’t mention logging, tracing, and profiling. Historically, logs have been used as something that you use tail, grep, and awk on by hand. You might have had an analysis tool such as AWStats to produce reports hourly or daily. In more recent years,

logs have also been used as a significant part of monitoring, such as with the Elasticsearch, Logstash, and Kibana (ELK) and OpenSearch stack. Tracing and profiling are generally done with their own software stack: Zipkin and Jaeger are made for tracing, while Parca and Pyroscope deal with continuous profiling. Now that we have looked a bit at graphing and alerting, let’s look at how metrics and logs fit into the landscape. Are there more categories of monitoring than those two? Categories of Monitoring At the end of the day, most monitoring is about the same thing: events. Events can be almost anything, including: Receiving an HTTP request Sending an HTTP 400 response Entering a function Reaching the else of an if statement Leaving a function A user logging in Writing data to disk Reading data from the network Requesting more memory from the kernel All events also have context. An HTTP request will have the IP address it is coming from and going to, the URL being requested, the cookies that are set, and the user who made the request. An HTTP response will have how long the response took, the HTTP status code, and the length of the response body. Events involving functions have the call stack of the functions above them, and whatever triggered this part of the stack, such as an HTTP request.

Having all the context for all the events would be great for debugging and understanding how your systems are performing in both technical and business terms, but that amount of data is not practical to process and store. Thus, we see roughly four ways to approach reducing that volume of data to something workable, namely profiling, tracing, logging, and metrics. Profiling Profiling takes the approach that you can’t have all the context for all of the events all of the time, but you can have some of the context for limited periods of time. Tcpdump is one example of a profiling tool. It allows you to record network traffic based on a specified filter. It’s an essential debugging tool, but you can’t really turn it on all the time as you will run out of disk space. Debug builds of binaries that track profiling data are another example. They provide a plethora of useful information, but the performance impact of gathering all that information, such as timings of every function call, means that it is not generally practical to run it in production on an ongoing basis. In the Linux kernel, enhanced Berkeley Packet Filters (eBPF) allow detailed profiling of kernel events from filesystem operations to network oddities. These provide access to a level of insight that was not generally available previously. eBPF comes with other advantages, such as portability and safety. We’d recommend reading Brendan Gregg’s writings on the subject. Profiling is largely for tactical debugging. If it is being used on a longer- term basis, then the data volume must be cut down in order to fit into one of the other categories of monitoring, or you’d need to move to continuous profiling, which enables the collection over longer runs. What’s new with continuous profiling is that in order to cut down the data volume and keep a relatively low overhead, it reduces the profiling frequency. One of the emerging continuous profiling tools, the eBPF-based Parca Agent, uses a 19Hz frequency.7 As a consequence, it tries to get statistically significant data over minutes rather than seconds, while still

providing the data required to understand how the CPU time is spent in an infrastructure, and helping to improve application efficiency where it’s needed. Tracing Tracing doesn’t typically look at all events, rather it takes some proportion of events such as one in a hundred that pass through some functions of interest. Tracing will note the functions in the stack trace of the points of interest, and often also how long each of these functions took to execute. From this you can get an idea of where your program is spending time and which code paths are most contributing to latency. Rather than doing snapshots of stack traces at points of interest, some tracing systems trace and record timings of every function call below the function of interest. For example, one in a hundred user HTTP requests might be sampled, and for those requests you could see how much time was spent talking to backends such as databases and caches. This allows you to see how timings differ based on factors like cache hits versus cache misses. Distributed tracing takes this a step further. It makes tracing work across processes by attaching unique IDs to requests that are passed from one process to another in remote procedure calls (RPCs) in addition to whether this request is one that should be traced. The traces from different processes and machines can be stitched back together based on the request ID. This is a vital tool for debugging distributed microservices architectures. Technologies in this space include OpenZipkin and Jaeger. For tracing, it is the sampling that keeps the data volumes and instrumentation performance impact within reason. Logging Logging looks at a limited set of events and records some of the context for each of these events. For example, it may look at all incoming HTTP requests, or all outgoing database calls. To avoid consuming too many resources, as a rule of thumb you are limited to somewhere around a

hundred fields per log entry. Beyond that, bandwidth and storage space tend to become a concern. For example, for a server handling 1,000 requests per second, a log entry with 100 fields each taking 10 bytes works out as 1 megabyte per second. That’s a nontrivial proportion of a 100 Mbit network card, and 84 GB of storage per day just for logging. A big benefit of logging is that there is (usually) no sampling of events, so even though there is a limit on the number of fields, it is practical to determine how slow requests are affecting one particular user talking to one particular API endpoint. Just as monitoring means different things to different people, logging also means different things depending on who you ask, which can cause confusion. Different types of logging have different uses, durability, and retention requirements. As we see it, there are four general and somewhat overlapping categories: Transaction logs These are the critical business records that you must keep safe at all costs, likely forever. Anything touching on money or that is used for critical user-facing features tends to be in this category. Request logs If you are tracking every HTTP request, or every database call, that’s a request log. They may be processed in order to implement user facing features, or just for internal optimizations. You don’t generally want to lose them, but it’s not the end of the world if some of them go missing. Application logs Not all logs are about requests; some are about the process itself. Startup messages, background maintenance tasks, and other process- level log lines are typical. These logs are often read directly by a human, so you should try to avoid having more than a few per minute in normal operations.

Debug logs Debug logs tend to be very detailed and thus expensive to create and store. They are often only used in very narrow debugging situations, and are trending toward profiling due to their data volume. Reliability and retention requirements tend to be low, and debug logs may not even leave the machine they are generated on. Treating the differing types of logs all in the same way can put you in the worst of all worlds, where you have the data volume of debug logs combined with the extreme reliability requirements of transaction logs. Thus as your system grows, you should plan on splitting out the debug logs so that they can be handled separately. Examples of logging systems include the ELK stack, OpenSearch, Grafana Loki, and Graylog. Metrics Metrics largely ignore context, instead tracking aggregations over time of different types of events. To keep resource usage sane, the amount of different numbers being tracked needs to be limited: 10,000 per process is a reasonable upper bound for you to keep in mind. Examples of the sort of metrics you might have would be the number of times you received HTTP requests, how much time was spent handling requests, and how many requests are currently in progress. By excluding any information about context, the data volumes and processing required are kept reasonable. That is not to say, though, that context is always ignored. For an HTTP request you could decide to have a metric for each URL path. But the 10,000 metric guideline has to be kept in mind, as each distinct path now counts as a metric. Using context such as a user’s email address would be unwise, as they have an unbounded cardinality.8

Statistics

Uploader

Prometheus Up Running (Julien Pivotto, Brian Brazil) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Prometheus Up Running (Julien Pivotto, Brian Brazil) (Z-Library)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You