Enterprise AIOps (a collaboration between O’Reilly and Booz Allen Hamilton. ( etc.)（Z-Library）

(This page has no text content)

Booz Allen Hamilton

(This page has no text content)

Enterprise AIOps A Framework for Enabling Artificial Intelligence Justin Neroda, Steve Escaravage, and Aaron Peters

Enterprise AIOps by Justin Neroda, Steve Escaravage, and Aaron Peters Copyright © 2021 O’Reilly Media Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Rebecca Novack Development Editor: Virginia Wilson Production Editor: Christopher Faucher Copyeditor: nSight, Inc. Proofreader: Piper Editorial Consulting, LLC Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Kate Dullea August 2021: First Edition Revision History for the First Edition 2021-08-16: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Enterprise AIOps, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Booz Allen Hamilton. See our statement of editorial independence. 978-1-098-10726-0 [LSI]

Preface A significant transformation is underway in the marketplace. Mountains of data are being generated on a daily basis. Business operations, online transactions, vehicles and smart homes, our cell phone, and the increasing prevalence of the Internet of Things (IoT) means that data is continuously being generated and stored. This growth in data vastly outstrips growth in the supply of technical analysts as demand for their services proliferates across industries. Companies, governments, and organizations are rightly asking how they can possibly provide value from these data volumes with their existing analytical capabilities and staff while determining where they need to invest to meet this growth. When deployed effectively, Artificial Intelligence (AI) provides the necessary force multiplier, allowing organizations to generate insights not possible even a few years ago. AI provides organizations with the ability to process vast amounts of data by training algorithms to automate current processes, generate new insights, and then present these insights so that decision makers can act. Artificial Intelligence (AI) The ability for machines to solve problems and perform tasks that would normally require human capabilities and intelligence. Rapid gains in computing power, inexpensive storage solutions for large volumes of data, and new research in algorithm development make AI increasingly more accessible. These three keys—processing power, data, 1

and algorithms—bring us to a new reality, one where AI can now meet or exceed human capability across a variety of diverse tasks. Similar to past industrial revolutions, AI’s adoption will have sweeping effects on our daily lives and work. Traditional business analytics (think Excel or basic statistical analyses) provided a historically competitive advantage to organizations in many industries. Now, companies are seeking the next wave of competitive advantage and strive for better performance in their analyses—a result that effective AI deployments can deliver. As such, artificial intelligence will change how we conduct business, introducing new efficiencies—and complexities—that we are just beginning to understand. Just as importantly, AI will fundamentally alter how you manage your teams, services, and operations. The ability of strategic leaders to anticipate and guide these long-term changes will be essential to ensuring the success of this transition and allowing their organization to remain competitive. While there is growing consensus around AI’s importance, there is a growing chasm regarding how to successfully deploy and apply artificial intelligence at an enterprise-wide scale. Many organizations are struggling with expanding, evolving, and integrating their early AI development efforts into mature, sustainable, enterprise-wide capabilities. This gap is due to the drastic increase in scope and complexity required to operationalize artificial intelligence, particularly in terms of integrating AI solutions within the larger organization. This report is for business leaders who desire to transition AI from small, pilot projects to an enterprise-wide reality. We will introduce an artificial intelligence operations (AIOps) engineering framework to assist you in overcoming these post-pilot challenges through responsibly developing AI tools, the important role of data management, team roles and responsibilities, and large-scale implementation. The insights and strategies we share come from the lessons we’ve learned working at Booz Allen Hamilton on a portfolio of over 120 AI projects across the federal government. By the end of the report, you’ll learn how to unlock the incredible potential that lies within your organization’s exponentially growing data, deriving

insights not available just a few years ago. To extract the most benefit from this report, you should understand AI’s fundamentals, its purpose, and what challenges can be solved when deploying AI-based solutions. If you’d like a quick refresher or a deeper introduction to AI, our AI primer is a great place to start. Acknowledgments A huge thank-you to all the individuals who collaborated with us on this report. To the following colleagues: John Larson, Kathleen Featheringham, Elizabeth Cardosa, Alex Walter-Higgins, Caleb Wharton, Sheshadri Mudiyanur, Byron Gaskin, Jeffrey Gross, Chuck Audet, Drew Farris, Drew Leety, Geoff Schaefer, David McCleary, Susan Johnston, Katrina Jacobs, Catherine Quinn, and Catherine Ordun—your insight and contributions strengthened this piece considerably. 1 While there are many, often competing, technical definitions for AI, we wanted to provide a broad, high-level definition for this report. Our definition of AI is extracted from the National Security Commission on Artificial Intelligence. You can view their 2021 report at their website, where they define AI on page 20. 2 Booz Allen Hamilton, The Artificial Intelligence Primer, accessed July 13, 2021. 2

Chapter 1. Demystifying AI A majority of analytics exist to take operational data (e.g., past/present stock prices) and provide focused insights (e.g., predicted stock prices) that inform decision making. This essential objective is the same for conventional business analytics and AI analytics and includes a range of functions (e.g., automation, augmentation, conversational AI for consumers, etc.). The key difference is how developers create the code that transforms operational data into insights. For conventional business analytics, this is a static process where the developer manually defines each logical operation the computer must take. AI analytics, via machine learning (ML), attempts to derive the necessary operations directly from the data, reducing the onus on the developer to create and update the model over time (but not eliminating the developer) and making it possible to address otherwise prohibitively sophisticated use cases (e.g., computer vision). Machine Learning (ML) A subset of AI, machine learning is the study of computer algorithms that improve automatically through experience and by use of data. ML algorithms build a model based on sample data, known as “training data” (defined in Chapter 5), to make predictions or decisions without being explicitly programmed to do so. Beyond the initial challenge of the ML algorithm teaching an AI analytic to complete a basic task, we must ensure that it does not learn additional, undesirable behaviors that may impact its long-term sustainability (reliability, security, etc.). The ability to holistically understand the learned 1

behavior of an AI analytic is called explainability and will be explored in detail in the following chapters. With machine learning, computer models use experiences/historical data to make decisions by recognizing patterns in data. These experiences take several forms. For example, they could be collected by reviewing historical process data or observing current processes, or they could be generated using synthetic data. However, in many cases, practitioners must manually extract these patterns before they can be used. The sophistication of patterns and resulting operations can vary wildly based on the algorithm selected, the learning parameters used, and the way in which the training data is processed into the algorithm. Similarly, AI (to be more specific—the sub-area of deep learning) uses models, such as neural networks, to learn highly complex patterns across various data types. To summarize at a high level, AI enables computers to perceive, learn from, abstract, and act on data while automatically recognizing patterns in provided datasets. AI can be used for a variety of use cases—some of which you may be familiar with. A few common examples where AI can be deployed to recognize patterns include: 1. Detecting anomalous entries in a dataset (e.g., identifying fraudulent versus legitimate credit card purchases) 2. Classifying a set of pixels in an image as familiar or unfamiliar (e.g., suggesting which of your friends might be included in a photo you took on your phone) 3. Offering new suggestions for entertainment choices based on your history (e.g., Netflix, Spotify, Amazon) We can also describe what AI is not—at least not today. For example, some older sci-fi movies depict robots with the ability to have sophisticated, improvised, and fluent conversations with humans, or carry out complex actions and decisions in unexpected circumstances as people can. In fact, 2

we’re not at that level of AI sophistication; to get there will take significant, persistent investment to advance current AI capabilities. Currently, operational instances of AI represent what is known as narrow intelligence, or the ability to supplement human judgment for a single decision under controlled circumstances. Artificial general intelligence, in which machines can match a human’s capacity to perform multiple decisions in uncontrolled circumstances, does not exist at this point in time. While there have been recent advances to move in the direction of general intelligence, we are still quite far from this type of AI being seen at any meaningful scale. Figure 1-1 provides a high-level overview on what AI can and cannot do well today. 3

(This page has no text content)

Figure 1-1. AI limitations AI Pilot-to-Production Challenges Mature AI capabilities do not appear overnight. Rather, they require months to years of sustained, cooperative, organization-spanning efforts to achieve. Creating and maintaining buy-in across stakeholders (e.g., strategic leadership, end users, and risk managers) is a critical and essential challenge for change agents within your organizations. AI analytic pilots performed in laboratory conditions (handpicked use cases, curated data, controlled environments) are one of the best ways to create initial buy-in at modest cost. However, most analytical use cases will require organizations to graduate these pilots from a laboratory setting to a production environment in order to fully succeed with solving the analytical challenges within these selected use cases. A common mistake many organizations make is underestimating the challenge of transitioning between these environments and failing to mature their development capability in response. A few main challenges include scalability, sustainability, and coordination. Scalability During a pilot, AI development teams can be small and simple in terms of roles and processes because they are addressing only a single use case. As AI capabilities mature and migrate from pilot to production, project volume will generally rise much more rapidly than available personnel. Particularly, analytics already in production will begin to compete for resources with new deployments (amplified by the sustainability challenges in item 2). This calls for the evolution of the development team, process, and tooling to allow individual and collective distribution of labor across multiple teams and projects. Additionally, the volume and velocity of data involved in development will increase, demanding increasingly powerful, efficient, and sophisticated training pipelines (discussed extensively in Chapter 5).

Sustainability By design, the laboratory environment limits threats to analytic sustainment to help the pilot team focus on functionality. Once in production, analytics are subject to a diverse range of issues, including operational (e.g., load variability, data drift, user error), security, legal, and ethical. Ensuring that sustainability does not compromise scalability requires evolution of development to anticipate and resolve these issues prior to release. Sustainability also benefits from coordination (see item 3) to allow key stakeholders to participate in the effort (see Chapter 3). Coordination In a laboratory environment, the pilot team interacts with a limited number of stakeholders by design. The number of stakeholders climbs drastically as these pilots enter production, and your teams must be prepared to motivate and facilitate coordination across data owners, end users, operations staff, risk managers, and others. Coordination also helps ensure equitable and efficient distribution of labor across the organization. In addition to the three we just discussed, Table 1-1 provides a more complete list of challenges you might face when moving your AI solutions from pilots to production.

Table 1-1. AI pilots versus AI in production Challenges to Operationalizing AI AI Pilots AI in Production Simplified, static use case Multistakeholder, dynamic use case High-performance laboratory environment Distributed legacy systems with dynamic fallback options Openly accessible, low latency, data remains consistent Access controlled; latency restricted; high-velocity data Complete responsibility and control over data Data mostly controlled by upstream stakeholders No change or widely anticipated changes to data Rapid, unexpected data drift One-time, manual explanation for algorithm’s results Real-time, automated explanation AI developer does not reexamine model after pilot AI developer continues to monitor model No-cost shut-down, refactor Costly shut-down, refactoring One-off development against a single use case Reproducible development against multiple use cases Small team with well-defined focus and requisite skill sets Continual training for a wide mix of experience, skill levels, and specialties Informal, research-oriented project management Hybrid research/development project management Failing to mature AI capabilities to meet these challenges threatens the long- term viability of AI adoption since organizations will struggle to implement artificial intelligence in a production capacity. AI development will slow, existing analytics will remain difficult to sustain, leadership will become disillusioned with the lack of lasting mission impact, end users will lose faith, and hard conversations will ensue. Proactively addressing these challenges during the design phase results in organizations dramatically increasing the speed of adoption and impact of AI initiatives. In coming chapters, we’ll introduce a framework (AIOps) to address these challenges and allow your organization to maximize the impact of AI across the enterprise. 4

1 Aurélien Géron, Hands-On Machine Learning with Scikit-Learn and TensorFlow (O’Reilly Media, 2017). 2 While there are many, often competing, technical definitions of AI, we wanted to provide a broad, high-level definition for this report. Our definition of AI is extracted from the National Security Commission on Artificial Intelligence. You can view their 2021 report at their website, where they define AI on page 20. 3 An example of current research and thinking in the area of artificial general intelligence that continues to evolve rapidly is David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton, “Reward Is Enough”, Artificial Intelligence 299 (October 2021). 4 Mike Loukides, AI Adoption in the Enterprise (O’Reilly Media, 2021).

Chapter 2. Defining the AIOps Framework Now that we’ve outlined the challenges of moving AI deployments from pilots to production, let’s introduce a framework (see Figure 2-1) that will enable your organization to move past these challenges and operationalize AI at an enterprise scale. Our AIOps framework will focus on two primary objectives: (1) evolving the AI development process itself and (2) integrating that development process within other parts of your organization to achieve a scalable, sustainable, and coordinated AI enterprise capability. In the following section, we’ll introduce the components and functionalities you’ll need to implement for a successful AIOps deployment. Figure 2-1 demonstrates how these components are aligned and sequenced with one another to enable enterprise AIOps. AIOps The processes, strategies, and frameworks to operationalize AI to address real-world challenges and realize high-impact, enduring outcomes. AIOps combines responsible AI development, data, algorithms, and teams into an integrated, automated, and documented modular solution for development and sustainment of AI.

(This page has no text content)

Figure 2-1. AIOps framework This AIOps framework provides you with the basis to move AI from pilots to production. It is composed of several key components that are integrated to deliver AI solutions that meet requirements of specific use cases, operate in production environments, and can be updated rapidly to address changing conditions. We’ll discuss each of these components throughout the report. These primary components are: Mission engineering It is critical for organizations to define and validate if AI is applicable to the use case(s) they want to solve through the deployment of AI. Successful mission engineering will allow organizations to bring AI to the enterprise responsibly with real mission outcomes. We’ll cover this more in Chapter 6. Responsible AI with human-centered design To operationalize AI, you’ll need to focus on responsible AI to ensure the AI solutions, when deployed, will meet performance required and adhere to organization standards and core values. We’ll discuss this further in Chapter 3. Data engineering and data operations (DataOps) Locating the data required and developing repeatable data pipelines to increase value of data and make it available across the enterprise. The operationalization of data engineering and management is known as “DataOps,” which allows the rest of your downstream pipeline to reap the benefits (e.g., more accurate AI/ML training) of better data availability and quality. See Chapter 4 for more detail. ML engineering and ML operations (MLOps) Development of advanced algorithms using supervised, unsupervised, reinforcement, deep learning, etc. as required to support complex decision making. This process includes both science and art but provides

Enterprise AIOps (a collaboration between O’Reilly and Booz Allen Hamilton. ( etc.)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Reply to Comment

Edit Comment