The Enterprise Big Data Lake Delivering the Promise of Big Data and Data Science (Alex Gorelik) (Z-Library)
Author: Alex Gorelik
其他
No Description
📄 File Format:
PDF
💾 File Size:
12.3 MB
49
Views
0
Downloads
0.00
Total Donations
📄 Text Preview (First 20 pages)
ℹ️
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
📄 Page
1
The Enterprise Big Data Lake Delivering the Promise of Big Data and Data Science Alex Gorelik
📄 Page
2
(This page has no text content)
📄 Page
3
Praise for The Enterprise Big Data Lake Alex is a visionary in the data industry. He has encapsulated his practical insights into a thorough treatise examining the technical considerations, firm-wide implications, and leveraged business impact of transitioning to a data-driven enterprise. This is a book for any business or technical professional who wishes to succeed with data. —Keyur Desai, Chief Data Officer, TD Ameritrade Data lakes are essential in achieving many of the benefits of decision- and analytics- driven solutions. This book does a great job clarifying the architecture of data lakes, what value they provide, what challenges they pose, and how to address those challenges. —Jari Koister, VP of Product and Technology, FICO, and professor in the data science program at UC Berkeley, California Big Data is one of the most confusing terms in the industry today. This book breaks down the components into easy, understandable terms and explains the best ways to approach such projects. I found the sections that articulate the interconnectedness of data streams, data ponds, and data lakes especially helpful. The book is a must-read for any executive looking to understand and educate themselves on contemporary methods of analytics. —Opinder Bawa, Vice President and Chief Information Officer, University of San Francisco I can’t wait to share this book with managers I know who have joined data lake teams and need an introduction to the tools and terms they will need to converse and understand their new teams. They will also get a great idea for the direction they should try and steer their teams. This book is a great place to start, whether you are building a data lake or have inherited one. —Nicole Schwartz, Agile and Technical Product Management consultant
📄 Page
4
(This page has no text content)
📄 Page
5
Alex Gorelik The Enterprise Big Data Lake Delivering the Promise of Big Data and Data Science Boston Farnham Sebastopol TokyoBeijing
📄 Page
6
978-1-491-93155-4 [LSI] The Enterprise Big Data Lake by Alex Gorelik Copyright © 2019 Alex Gorelik. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Andy Oram Production Editor: Kristen Brown Copyeditor: Rachel Head Proofreader: Rachel Monaghan Indexer: Ellen Troutman Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest March 2019: First Edition Revision History for the First Edition 2019-02-19: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491931554 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The Enterprise Big Data Lake, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
7
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Introduction to Data Lakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Data Lake Maturity 3 Data Puddles 5 Data Ponds 6 Creating a Successful Data Lake 7 The Right Platform 7 The Right Data 8 The Right Interface 9 The Data Swamp 11 Roadmap to Data Lake Success 12 Standing Up a Data Lake 13 Organizing the Data Lake 14 Setting Up the Data Lake for Self-Service 15 Data Lake Architectures 20 Data Lakes in the Public Cloud 20 Logical Data Lakes 21 Conclusion 24 2. Historical Perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 The Drive for Self-Service Data—The Birth of Databases 25 The Analytics Imperative—The Birth of Data Warehousing 28 The Data Warehouse Ecosystem 29 Storing and Querying the Data 31 Loading the Data—Data Integration Tools 37 Organizing and Managing the Data 41 Consuming the Data 46 v
📄 Page
8
Conclusion 47 3. Introduction to Big Data and Data Science. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Hadoop Leads the Historic Shift to Big Data 50 The Hadoop File System 50 How Processing and Storage Interact in a MapReduce Job 51 Schema on Read 53 Hadoop Projects 53 Data Science 55 What Should Your Analytics Organization Focus On? 56 Machine Learning 59 Explainability 60 Change Management 61 Conclusion 62 4. Starting a Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 The What and Why of Hadoop 63 Preventing Proliferation of Data Puddles 66 Taking Advantage of Big Data 67 Leading with Data Science 67 Strategy 1: Offload Existing Functionality 70 Strategy 2: Data Lakes for New Projects 71 Strategy 3: Establish a Central Point of Governance 72 Which Way Is Right for You? 73 Conclusion 74 5. From Data Ponds/Big Data Warehouses to Data Lakes. . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Essential Functions of a Data Warehouse 76 Dimensional Modeling for Analytics 77 Integrating Data from Disparate Sources 78 Preserving History Using Slowly Changing Dimensions 78 Limitations of the Data Warehouse as a Historical Repository 78 Moving to a Data Pond 79 Keeping History in a Data Pond 79 Implementing Slowly Changing Dimensions in a Data Pond 81 Growing Data Ponds into a Data Lake—Loading Data That’s Not in the Data Warehouse 83 Raw Data 83 External Data 84 Internet of Things (IoT) and Other Streaming Data 86 Real-Time Data Lakes 87 The Lambda Architecture 89 vi | Table of Contents
📄 Page
9
Data Transformations 90 Target Systems 92 Data Warehouses 93 Operational Data Stores 93 Real-Time Applications and Data Products 93 Conclusion 95 6. Optimizing for Self-Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 The Beginnings of Self-Service 98 Business Analysts 100 Finding and Understanding Data—Documenting the Enterprise 101 Establishing Trust 103 Provisioning 110 Preparing Data for Analysis 112 Data Wrangling in the Data Lake 113 Situating Data Preparation in Hadoop 113 Common Use Cases for Data Preparation 114 Analyzing and Visualizing 116 The New World of Self-Service Business Intelligence 116 The New Analytic Workflow 117 Gatekeepers to Shopkeepers 118 Governing Self-Service 119 Conclusion 120 7. Architecting the Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Organizing the Data Lake 121 Landing or Raw Zone 123 Gold Zone 123 Work Zone 125 Sensitive Zone 125 Multiple Data Lakes 127 Advantages of Keeping Data Lakes Separate 127 Advantages of Merging the Data Lakes 128 Cloud Data Lakes 129 Virtual Data Lakes 131 Data Federation 131 Big Data Virtualization 132 Eliminating Redundancy 134 Conclusion 136 8. Cataloging the Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Organizing the Data 137 Table of Contents | vii
📄 Page
10
Technical Metadata 138 Business Metadata 143 Tagging 145 Automated Cataloging 146 Logical Data Management 147 Sensitive Data Management and Access Control 147 Data Quality 149 Relating Disparate Data 151 Establishing Lineage 152 Data Provisioning 153 Tools for Building a Catalog 154 Tool Comparison 155 The Data Ocean 156 Conclusion 156 9. Governing Data Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Authorization or Access Control 158 Tag-Based Data Access Policies 159 Deidentifying Sensitive Data 162 Data Sovereignty and Regulatory Compliance 165 Self-Service Access Management 167 Provisioning Data 171 Conclusion 177 10. Industry-Specific Perspectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Big Data in Financial Services 180 Consumers, Digitization, and Data Are Changing Finance as We Know It 180 Saving the Bank 182 New Opportunities Offered by New Data 185 Key Processes in Making Use of the Data Lake 188 Value Added by Data Lakes in Financial Services 190 Data Lakes in the Insurance Industry 192 Smart Cities 193 Big Data in Medicine 195 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 viii | Table of Contents
📄 Page
11
Preface In recent years many enterprises have begun experimenting with using big data and cloud technologies to build data lakes and support data-driven culture and decision making—but the projects often stall or fail because the approaches that worked at internet companies have to be adapted for the enterprise, and there is no comprehen‐ sive practical guide on how to successfully do that. I wrote this book with the hope of providing such a guide. In my roles as executive at IBM and Informatica (major data technology vendors), Entrepreneur in Residence at Menlo Ventures (a leading VC firm), and founder and CTO of Waterline (a big data startup), I’ve been fortunate to have had the opportu‐ nity to speak with hundreds of experts, visionaries, industry analysts, and hands-on practitioners about the challenges of building successful data lakes and creating a data-driven culture. This book is a synthesis of the themes and best practices that I’ve encountered across industries (from social media to banking and government agen‐ cies) and roles (from chief data officers and other IT executives to data architects, data scientists, and business analysts). Big data, data science, and analytics supporting data-driven decision making promise to bring unprecedented levels of insight and efficiency to everything from how we work with data to how we work with customers to the search for a cure for cancer— but data science and analytics depend on having access to historical data. In recogni‐ tion of this, companies are deploying big data lakes to bring all their data together in one place and start saving history, so data scientists and analysts have access to the information they need to enable data-driven decision making. Enterprise big data lakes bridge the gap between the freewheeling culture of modern internet companies, where data is core to all practices, everyone is an analyst, and most people can code and roll their own data sets, and enterprise data warehouses, where data is a precious commodity, carefully tended to by professional IT personnel and provisioned in the form of carefully prepared reports and analytic data sets. ix
📄 Page
12
To be successful, enterprise data lakes must provide three new capabilities: • Cost-effective, scalable storage and computing, so large amounts of data can be stored and analyzed without incurring prohibitive computational costs • Cost-effective data access and governance, so everyone can find and use the right data without incurring expensive human costs associated with programming and manual ad hoc data acquisition • Tiered, governed access, so different levels of data can be available to different users based on their needs and skill levels and applicable data governance policies Hadoop, Spark, NoSQL databases, and elastic cloud–based systems are exciting new technologies that deliver on the first promise of cost-effective, scalable storage and computing. While they are still maturing and face some of the challenges inherent to any new technology, they are rapidly stabilizing and becoming mainstream. However, these powerful enabling technologies do not deliver on the other two promises of cost-effective and tiered data access. So, as enterprises create large clusters and ingest vast amounts of data, they find that instead of a data lake, they end up with a data swamp—a large repository of unusable data sets that are impossible to navigate or make sense of, and too dangerous to rely on for any decisions. This book guides readers through the considerations and best practices of delivering on all the promises of the big data lake. It discusses various approaches to starting and growing a data lake, including data puddles (analytical sandboxes) and data ponds (big data warehouses), as well as building data lakes from scratch. It explores the pros and cons of different data lake architectures—on premises, cloud-based, and virtual—and covers setting up different zones to house everything from raw, untrea‐ ted data to carefully managed and summarized data, and governing access to those zones. It explains how to enable self-service so that users can find, understand, and provision data themselves; how to provide different interfaces to users with different skill levels; and how to do all of that in compliance with enterprise data governance policies. Who Should Read This Book? This book is intended for the following audiences at large traditional enterprises: • Data services and governance teams: chief data officers and data stewards • IT executives and architects: chief technology officers and big data architects • Analytics teams: data scientists, data engineers, data analysts, and heads of analytics x | Preface
📄 Page
13
• Compliance teams: chief information security officers, data protection officers, information security analysts, and regulatory compliance heads The book leverages my 30-year career developing leading-edge data technology and working with some of the world’s largest enterprises on their thorniest data problems. It draws on best practices from the world’s leading big data companies and enterpri‐ ses, with essays and success stories from hands-on practitioners and industry experts to provide a comprehensive guide to architecting and deploying a successful big data lake. If you’re interested in taking advantage of what these exciting new big data tech‐ nologies and approaches offer to the enterprise, this book is an excellent place to start. Management may want to read it once and refer to it periodically as big data issues come up in the workplace, while for hands-on practitioners it can serve as a useful reference as they are planning and executing big data lake projects. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. O’Reilly Online Learning For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help compa‐ nies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. Preface | xi
📄 Page
14
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/Enterprise-Big-Data-Lake. To comment or ask technical questions about this book, send email to bookques‐ tions@oreilly.com. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments First and foremost, I want to express my deep gratitude to all the experts and practi‐ tioners who shared their stories, expertise, and best practices with me—this book is for and about you! A great thank you also to all the people who helped me work on this project. This is my first book, and I truly would not have been able to do it without their help. Thanks to: • The O’Reilly team: Andy Oram, my O’Reilly editor, who breathed new life into this book as I was running out of steam and helped bring it from a stream of con‐ sciousness to some level of coherency; Tim McGovern, the original editor who helped get this book off the ground; and Rachel Head, the copyeditor who shocked me with how many more improvements could still be made to the book after over two years of writing, editing, rewriting, reviewing, more rewriting, more editing, more rewriting…; and Kristen Brown, who shepherded the book through the production process. • The industry contributors who shared their thoughts and best practices in essays and whose names and bios you will find next to their essays inside the book xii | Preface
📄 Page
15
• The reviewers who made huge improvements with their fresh perspective, critical eye, and industry expertise: Sanjeev Mohan, Opinder Bawa, and Nicole Schwartz Finally, this book would not have happened without the support and love of my won‐ derful family—my wife, Irina; my kids, Hannah, Jane, Lisa, and John; and my mom, Regina—my friends, and my wonderful Waterline family. Preface | xiii
📄 Page
16
(This page has no text content)
📄 Page
17
CHAPTER 1 Introduction to Data Lakes Data-driven decision making is changing how we work and live. From data science, machine learning, and advanced analytics to real-time dashboards, decision makers are demanding data to help make decisions. Companies like Google, Amazon, and Facebook are data-driven juggernauts that are taking over traditional businesses by leveraging data. Financial services organizations and insurance companies have always been data driven, with quants and automated trading leading the way. The Internet of Things (IoT) is changing manufacturing, transportation, agriculture, and healthcare. From governments and corporations in every vertical to non-profits and educational institutions, data is being seen as a game changer. Artificial intelligence and machine learning are permeating all aspects of our lives. The world is bingeing on data because of the potential it represents. We even have a term for this binge: big data, defined by Doug Laney of Gartner in terms of the three Vs (volume, variety, and velocity), to which he later added a fourth and, in my opinion, the most important V—veracity. With so much variety, volume, and velocity, the old systems and processes are no longer able to support the data needs of the enterprise. Veracity is an even bigger problem for advanced analytics and artificial intelligence, where the principle of “GIGO” (garbage in = garbage out) is even more critical because it is virtually impos‐ sible to tell whether the data was bad and caused bad decisions in statistical and machine learning models or the model was bad. To support these endeavors and address these challenges, a revolution is occurring in data management around how data is stored, processed, managed, and provided to the decision makers. Big data technology is enabling scalability and cost efficiency orders of magnitude greater than what’s possible with traditional data management infrastructure. Self-service is taking over from the carefully crafted and labor- 1
📄 Page
18
intensive approaches of the past, where armies of IT professionals created well- governed data warehouses and data marts, but took months to make any changes. The data lake is a daring new approach that harnesses the power of big data technol‐ ogy and marries it with agility of self-service. Most large enterprises today either have deployed or are in the process of deploying data lakes. This book is based on discussions with over a hundred organizations, ranging from the new data-driven companies like Google, LinkedIn, and Facebook to governments and traditional corporate enterprises, about their data lake initiatives, analytic projects, experiences, and best practices. The book is intended for IT executives and practitioners who are considering building a data lake, are in the process of building one, or have one already but are struggling to make it productive and widely adopted. What’s a data lake? Why do we need it? How is it different from what we already have? This chapter gives a brief overview that will get expanded in detail in the fol‐ lowing chapters. In an attempt to keep the summary succinct, I am not going to explain and explore each term and concept in detail here, but will save the in-depth discussion for subsequent chapters. Data-driven decision making is all the rage. From data science, machine learning, and advanced analytics to real-time dashboards, decision makers are demanding data to help make decisions. This data needs a home, and the data lake is the preferred solu‐ tion for creating that home. The term was invented and first described by James Dixon, CTO of Pentaho, who wrote in his blog: “If you think of a datamart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” I italicized the critical points, which are: • The data is in its original form and format (natural or raw data). • The data is used by various users (i.e., accessed and accessible by a large user community). This book is all about how to build a data lake that brings raw (as well as processed) data to a large user community of business analysts rather than just using it for IT- driven projects. The reason to make raw data available to analysts is so they can per‐ form self-service analytics. Self-service has been an important mega-trend toward democratization of data. It started at the point of usage with self-service visualization tools like Tableau and Qlik (sometimes called data discovery tools) that let analysts analyze data without having to get help from IT. The self-service trend continues with data preparation tools that help analysts shape the data for analytics, and catalog tools that help analysts find the data that they need and data science tools that help per‐ form advanced analytics. For even more advanced analytics generally referred to as 2 | Chapter 1: Introduction to Data Lakes
📄 Page
19
data science, a new class of users called data scientists also usually make a data lake their primary data source. Of course, a big challenge with self-service is governance and data security. Everyone agrees that data has to be kept safe, but in many regulated industries, there are pre‐ scribed data security policies that have to be implemented and it is illegal to give ana‐ lysts access to all data. Even in some non-regulated industries, it is considered a bad idea. The question becomes, how do we make data available to the analysts without violating internal and external data compliance regulations? This is sometimes called data democratization and will be discussed in detail in subsequent chapters. Data Lake Maturity The data lake is a relatively new concept, so it is useful to define some of the stages of maturity you might observe and to clearly articulate the differences between these stages: • A data puddle is basically a single-purpose or single-project data mart built using big data technology. It is typically the first step in the adoption of big data tech‐ nology. The data in a data puddle is loaded for the purpose of a single project or team. It is usually well known and well understood, and the reason that big data technology is used instead of traditional data warehousing is to lower cost and provide better performance. • A data pond is a collection of data puddles. It may be like a poorly designed data warehouse, which is effectively a collection of colocated data marts, or it may be an offload of an existing data warehouse. While lower technology costs and bet‐ ter scalability are clear and attractive benefits, these constructs still require a high level of IT participation. Furthermore, data ponds limit data to only that needed by the project, and use that data only for the project that requires it. Given the high IT costs and limited data availability, data ponds do not really help us with the goals of democratizing data usage or driving self-service and data-driven decision making for business users. • A data lake is different from a data pond in two important ways. First, it supports self-service, where business users are able to find and use data sets that they want to use without having to rely on help from the IT department. Second, it aims to contain data that business users might possibly want even if there is no project requiring it at the time. • A data ocean expands self-service data and data-driven decision making to all enterprise data, wherever it may be, regardless of whether it was loaded into the data lake or not. Data Lake Maturity | 3
📄 Page
20
Figure 1-1 illustrates the differences between these concepts. As maturity grows from a puddle to a pond to a lake to an ocean, the amount of data and the number of users grow—sometimes quite dramatically. The usage pattern moves from one of high- touch IT involvement to self-service, and the data expands beyond what’s needed for immediate projects. Figure 1-1. The four stages of maturity The key difference between the data pond and the data lake is the focus. Data ponds provide a less expensive and more scalable technology alternative to existing rela‐ tional data warehouses and data marts. Whereas the latter are focused on running routine, production-ready queries, data lakes enable business users to leverage data to make their own decisions by doing ad hoc analysis and experimentation with a vari‐ ety of new types of data and tools, as illustrated in Figure 1-2. Before we get into what it takes to create a successful data lake, let’s take a closer look at the two maturity stages that lead up to it. 4 | Chapter 1: Introduction to Data Lakes
The above is a preview of the first 20 pages. Register to read the complete e-book.