Learning Spark, 2nd Edition Lightning-Fast Data Analytics (Jules S. Damji, Brooke Wenig, Tathagata Das etc.) (Z-Library)

Author: Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee

科学

Data is getting bigger, arriving faster, and coming in varied formats — and it all needs to be processed at scale for analytics or machine learning. How can you process such varied data workloads efficiently? Enter Apache Spark. Updated to emphasize new features in Spark 2.x., this second edition shows data engineers and scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine-learning algorithms. Through discourse, code snippets, and notebooks, you’ll be able to: • Learn Python, SQL, Scala, or Java high-level APIs: DataFrames and Datasets • Peek under the hood of the Spark SQL engine to understand Spark transformations and performance • Inspect, tune, and debug your Spark operations with Spark configurations and Spark UI • Connect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or Kafka • Perform analytics on batch and streaming data using Structured Streaming • Build reliable data pipelines with open source Delta Lake and Spark • Develop machine learning pipelines with MLlib and productionize models using MLflow • Use open source Pandas framework Koalas and Spark for data transformation and feature engineering

📄 File Format: PDF
💾 File Size: 15.3 MB
12
Views
0
Downloads
0.00
Total Donations

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

📄 Page 1
Learning Spark Lightning-Fast Data Analytics Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee Foreword by Matei Zaharia 2nd Edition Covers Apache Spark 3.0 Compliments of
📄 Page 2
(This page has no text content)
📄 Page 3
Praise for Learning Spark, Second Edition This book offers a structured approach to learning Apache Spark, covering new developments in the project. It is a great way for Spark developers to get started with big data. —Reynold Xin, Databricks Chief Architect and Cofounder and Apache Spark PMC Member For data scientists and data engineers looking to learn Apache Spark and how to build scalable and reliable big data applications, this book is an essential guide! —Ben Lorica, Databricks Chief Data Scientist, Past Program Chair O’Reilly Strata Conferences, Program Chair for Spark + AI Summit
📄 Page 4
(This page has no text content)
📄 Page 5
Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Learning Spark Lightning-Fast Data Analytics SECOND EDITION Boston Farnham Sebastopol TokyoBeijing
📄 Page 6
978-1-492-05004-9 [GP] Learning Spark by Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee Copyright © 2020 Databricks, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jonathan Hassell Development Editor: Michele Cronin Production Editor: Deborah Baker Copyeditor: Rachel Head Proofreader: Penelope Perkins Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest January 2015: First Edition July 2020: Second Edition Revision History for the Second Edition 2020-06-24: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492050049 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Spark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Databricks. See our statement of editorial inde‐ pendence.
📄 Page 7
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Introduction to Apache Spark: A Unified Analytics Engine. . . . . . . . . . . . . . . . . . . . . . . . . . 1 The Genesis of Spark 1 Big Data and Distributed Computing at Google 1 Hadoop at Yahoo! 2 Spark’s Early Years at AMPLab 3 What Is Apache Spark? 4 Speed 4 Ease of Use 5 Modularity 5 Extensibility 5 Unified Analytics 6 Apache Spark Components as a Unified Stack 6 Apache Spark’s Distributed Execution 10 The Developer’s Experience 14 Who Uses Spark, and for What? 14 Community Adoption and Expansion 16 2. Downloading Apache Spark and Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Step 1: Downloading Apache Spark 19 Spark’s Directories and Files 21 Step 2: Using the Scala or PySpark Shell 22 Using the Local Machine 23 Step 3: Understanding Spark Application Concepts 25 Spark Application and SparkSession 26 v
📄 Page 8
Spark Jobs 27 Spark Stages 28 Spark Tasks 28 Transformations, Actions, and Lazy Evaluation 28 Narrow and Wide Transformations 30 The Spark UI 31 Your First Standalone Application 34 Counting M&Ms for the Cookie Monster 35 Building Standalone Applications in Scala 40 Summary 42 3. Apache Spark’s Structured APIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Spark: What’s Underneath an RDD? 43 Structuring Spark 44 Key Merits and Benefits 45 The DataFrame API 47 Spark’s Basic Data Types 48 Spark’s Structured and Complex Data Types 49 Schemas and Creating DataFrames 50 Columns and Expressions 54 Rows 57 Common DataFrame Operations 58 End-to-End DataFrame Example 68 The Dataset API 69 Typed Objects, Untyped Objects, and Generic Rows 69 Creating Datasets 71 Dataset Operations 72 End-to-End Dataset Example 74 DataFrames Versus Datasets 74 When to Use RDDs 75 Spark SQL and the Underlying Engine 76 The Catalyst Optimizer 77 Summary 82 4. Spark SQL and DataFrames: Introduction to Built-in Data Sources. . . . . . . . . . . . . . . . . 83 Using Spark SQL in Spark Applications 84 Basic Query Examples 85 SQL Tables and Views 89 Managed Versus UnmanagedTables 89 Creating SQL Databases and Tables 90 Creating Views 91 Viewing the Metadata 93 vi | Table of Contents
📄 Page 9
Caching SQL Tables 93 Reading Tables into DataFrames 93 Data Sources for DataFrames and SQL Tables 94 DataFrameReader 94 DataFrameWriter 96 Parquet 97 JSON 100 CSV 102 Avro 104 ORC 106 Images 108 Binary Files 110 Summary 111 5. Spark SQL and DataFrames: Interacting with External Data Sources. . . . . . . . . . . . . . . 113 Spark SQL and Apache Hive 113 User-Defined Functions 114 Querying with the Spark SQL Shell, Beeline, and Tableau 119 Using the Spark SQL Shell 119 Working with Beeline 120 Working with Tableau 122 External Data Sources 129 JDBC and SQL Databases 129 PostgreSQL 132 MySQL 133 Azure Cosmos DB 134 MS SQL Server 136 Other External Sources 137 Higher-Order Functions in DataFrames and Spark SQL 138 Option 1: Explode and Collect 138 Option 2: User-Defined Function 138 Built-in Functions for Complex Data Types 139 Higher-Order Functions 141 Common DataFrames and Spark SQL Operations 144 Unions 147 Joins 148 Windowing 149 Modifications 151 Summary 155 6. Spark SQL and Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Single API for Java and Scala 157 Table of Contents | vii
📄 Page 10
Scala Case Classes and JavaBeans for Datasets 158 Working with Datasets 160 Creating Sample Data 160 Transforming Sample Data 162 Memory Management for Datasets and DataFrames 167 Dataset Encoders 168 Spark’s Internal Format Versus Java Object Format 168 Serialization and Deserialization (SerDe) 169 Costs of Using Datasets 170 Strategies to Mitigate Costs 170 Summary 172 7. Optimizing and Tuning Spark Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Optimizing and Tuning Spark for Efficiency 173 Viewing and Setting Apache Spark Configurations 173 Scaling Spark for Large Workloads 177 Caching and Persistence of Data 183 DataFrame.cache() 183 DataFrame.persist() 184 When to Cache and Persist 187 When Not to Cache and Persist 187 A Family of Spark Joins 187 Broadcast Hash Join 188 Shuffle Sort Merge Join 189 Inspecting the Spark UI 197 Journey Through the Spark UI Tabs 197 Summary 205 8. Structured Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Evolution of the Apache Spark Stream Processing Engine 207 The Advent of Micro-Batch Stream Processing 208 Lessons Learned from Spark Streaming (DStreams) 209 The Philosophy of Structured Streaming 210 The Programming Model of Structured Streaming 211 The Fundamentals of a Structured Streaming Query 213 Five Steps to Define a Streaming Query 213 Under the Hood of an Active Streaming Query 219 Recovering from Failures with Exactly-Once Guarantees 221 Monitoring an Active Query 223 Streaming Data Sources and Sinks 226 Files 226 Apache Kafka 228 viii | Table of Contents
📄 Page 11
Custom Streaming Sources and Sinks 230 Data Transformations 234 Incremental Execution and Streaming State 234 Stateless Transformations 235 Stateful Transformations 235 Stateful Streaming Aggregations 238 Aggregations Not Based on Time 238 Aggregations with Event-Time Windows 239 Streaming Joins 246 Stream–Static Joins 246 Stream–Stream Joins 248 Arbitrary Stateful Computations 253 Modeling Arbitrary Stateful Operations with mapGroupsWithState() 254 Using Timeouts to Manage Inactive Groups 257 Generalization with flatMapGroupsWithState() 261 Performance Tuning 262 Summary 264 9. Building Reliable Data Lakes with Apache Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 The Importance of an Optimal Storage Solution 265 Databases 266 A Brief Introduction to Databases 266 Reading from and Writing to Databases Using Apache Spark 267 Limitations of Databases 267 Data Lakes 268 A Brief Introduction to Data Lakes 268 Reading from and Writing to Data Lakes using Apache Spark 269 Limitations of Data Lakes 270 Lakehouses: The Next Step in the Evolution of Storage Solutions 271 Apache Hudi 272 Apache Iceberg 272 Delta Lake 273 Building Lakehouses with Apache Spark and Delta Lake 274 Configuring Apache Spark with Delta Lake 274 Loading Data into a Delta Lake Table 275 Loading Data Streams into a Delta Lake Table 277 Enforcing Schema on Write to Prevent Data Corruption 278 Evolving Schemas to Accommodate Changing Data 279 Transforming Existing Data 279 Auditing Data Changes with Operation History 282 Querying Previous Snapshots of a Table with Time Travel 283 Summary 284 Table of Contents | ix
📄 Page 12
10. Machine Learning with MLlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 What Is Machine Learning? 286 Supervised Learning 286 Unsupervised Learning 288 Why Spark for Machine Learning? 289 Designing Machine Learning Pipelines 289 Data Ingestion and Exploration 290 Creating Training and Test Data Sets 291 Preparing Features with Transformers 293 Understanding Linear Regression 294 Using Estimators to Build Models 295 Creating a Pipeline 296 Evaluating Models 302 Saving and Loading Models 306 Hyperparameter Tuning 307 Tree-Based Models 307 k-Fold Cross-Validation 316 Optimizing Pipelines 320 Summary 321 11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark. . 323 Model Management 323 MLflow 324 Model Deployment Options with MLlib 330 Batch 332 Streaming 333 Model Export Patterns for Real-Time Inference 334 Leveraging Spark for Non-MLlib Models 336 Pandas UDFs 336 Spark for Distributed Hyperparameter Tuning 337 Summary 341 12. Epilogue: Apache Spark 3.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Spark Core and Spark SQL 343 Dynamic Partition Pruning 343 Adaptive Query Execution 345 SQL Join Hints 348 Catalog Plugin API and DataSourceV2 349 Accelerator-Aware Scheduler 351 Structured Streaming 352 PySpark, Pandas UDFs, and Pandas Function APIs 354 Redesigned Pandas UDFs with Python Type Hints 354 x | Table of Contents
📄 Page 13
Iterator Support in Pandas UDFs 355 New Pandas Function APIs 356 Changed Functionality 357 Languages Supported and Deprecated 357 Changes to the DataFrame and Dataset APIs 357 DataFrame and SQL Explain Commands 358 Summary 360 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Table of Contents | xi
📄 Page 14
(This page has no text content)
📄 Page 15
Foreword Apache Spark has evolved significantly since I first started the project at UC Berkeley in 2009. After moving to the Apache Software Foundation, the open source project has had over 1,400 contributors from hundreds of companies, and the global Spark meetup group has grown to over half a million members. Spark’s user base has also become highly diverse, encompassing Python, R, SQL, and JVM developers, with use cases ranging from data science to business intelligence to data engineering. I have been working closely with the Apache Spark community to help continue its develop‐ ment, and I am thrilled to see the progress thus far. The release of Spark 3.0 marks an important milestone for the project and has sparked the need for updated learning material. The idea of a second edition of Learning Spark has come up many times—and it was overdue. Even though I coau‐ thored both Learning Spark and Spark: The Definitive Guide (both O’Reilly), it was time for me to let the next generation of Spark contributors pick up the narrative. I’m delighted that four experienced practitioners and developers, who have been working closely with Apache Spark from its early days, have teamed up to write this second edition of the book, incorporating the most recent APIs and best practices for Spark developers in a clear and informative guide. The authors’ approach to this edition is highly conducive to hands-on learning. The key concepts in Spark and distributed big data processing have been distilled into easy-to-follow chapters. Through the book’s illustrative code examples, developers can build confidence using Spark and gain a greater understanding of its Structured APIs and how to leverage them. I hope that this second edition of Learning Spark will guide you on your large-scale data processing journey, whatever problems you wish to tackle using Spark. — Matei Zaharia, Chief Technologist, Cofounder of Databricks, Asst. Professor at Stanford, and original creator of Apache Spark xiii
📄 Page 16
(This page has no text content)
📄 Page 17
Preface We welcome you to the second edition of Learning Spark. It’s been five years since the first edition was published in 2015, originally authored by Holden Karau, Andy Kon‐ winski, Patrick Wendell, and Matei Zaharia. This new edition has been updated to reflect Apache Spark’s evolution through Spark 2.x and Spark 3.0, including its expanded ecosystem of built-in and external data sources, machine learning, and streaming technologies with which Spark is tightly integrated. Over the years since its first 1.x release, Spark has become the de facto big data uni‐ fied processing engine. Along the way, it has extended its scope to include support for various analytic workloads. Our intent is to capture and curate this evolution for readers, showing not only how you can use Spark but how it fits into the new era of big data and machine learning. Hence, we have designed each chapter to build pro‐ gressively on the foundations laid by the previous chapters, ensuring that the content is suited for our intended audience. Who This Book Is For Most developers who grapple with big data are data engineers, data scientists, or machine learning engineers. This book is aimed at those professionals who are look‐ ing to use Spark to scale their applications to handle massive amounts of data. In particular, data engineers will learn how to use Spark’s Structured APIs to perform complex data exploration and analysis on both batch and streaming data; use Spark SQL for interactive queries; use Spark’s built-in and external data sources to read, refine, and write data in different file formats as part of their extract, transform, and load (ETL) tasks; and build reliable data lakes with Spark and the open source Delta Lake table format. For data scientists and machine learning engineers, Spark’s MLlib library offers many common algorithms to build distributed machine learning models. We will cover how to build pipelines with MLlib, best practices for distributed machine learning, xv
📄 Page 18
how to use Spark to scale single-node models, and how to manage and deploy these models using the open source library MLflow. While the book is focused on learning Spark as an analytical engine for diverse work‐ loads, we will not cover all of the languages that Spark supports. Most of the examples in the chapters are written in Scala, Python, and SQL. Where necessary, we have infused a bit of Java. For those interested in learning Spark with R, we recommend Javier Luraschi, Kevin Kuo, and Edgar Ruiz’s Mastering Spark with R (O’Reilly). Finally, because Spark is a distributed engine, building an understanding of Spark application concepts is critical. We will guide you through how your Spark applica‐ tion interacts with Spark’s distributed components and how execution is decomposed into parallel tasks on a cluster. We will also cover which deployment modes are sup‐ ported and in what environments. While there are many topics we have chosen to cover, there are a few that we have opted to not focus on. These include the older low-level Resilient Distributed Dataset (RDD) APIs and GraphX, Spark’s API for graphs and graph-parallel computation. Nor have we covered advanced topics such as how to extend Spark’s Catalyst opti‐ mizer to implement your own operations, how to implement your own catalog, or how to write your own DataSource V2 data sinks and sources. Though part of Spark, these are beyond the scope of your first book on learning Spark. Instead, we have focused and organized the book around Spark’s Structured APIs, across all its components, and how you can use Spark to process structured data at scale to perform your data engineering or data science tasks. How the Book Is Organized We organized the book in a way that leads you from chapter to chapter by introduc‐ ing concepts, demonstrating these concepts via example code snippets, and providing full code examples or notebooks in the book’s GitHub repo. Chapter 1, Introduction to Apache Spark: A Unified Analytics Engine Introduces you to the evolution of big data and provides a high-level overview of Apache Spark and its application to big data. Chapter 2, Downloading Apache Spark and Getting Started Walks you through downloading and setting up Apache Spark on your local machine. Chapter 3, Apache Spark’s Structured APIs through Chapter 6, Spark SQL and Datasets These chapters focus on using the DataFrame and Dataset Structured APIs to ingest data from built-in and external data sources, apply built-in and custom functions, and utilize Spark SQL. These chapters comprise the foundation for later chapters, incorporating all the latest Spark 3.0 changes where appropriate. xvi | Preface
📄 Page 19
Chapter 7, Optimizing and Tuning Spark Applications Provides you with best practices for tuning, optimizing, debugging, and inspect‐ ing your Spark applications through the Spark UI, as well as details on the con‐ figurations you can tune to increase performance. Chapter 8, Structured Streaming Guides you through the evolution of the Spark Streaming engine and the Struc‐ tured Streaming programming model. It examines the anatomy of a typical streaming query and discusses the different ways to transform streaming data— stateful aggregations, stream joins, and arbitrary stateful aggregation—while pro‐ viding guidance on how to design performant streaming queries. Chapter 9, Building Reliable Data Lakes with Apache Spark Surveys three open source table format storage solutions, as part of the Spark ecosystem, that employ Apache Spark to build reliable data lakes with transac‐ tional guarantees. Due to Delta Lake’s tight integration with Spark for both batch and streaming workloads, we focus on that solution and explore how it facilitates a new paradigm in data management, the lakehouse. Chapter 10, Machine Learning with MLlib Introduces MLlib, the distributed machine learning library for Spark, and walks you through an end-to-end example of how to build a machine learning pipeline, including topics such as feature engineering, hyperparameter tuning, evaluation metrics, and saving and loading models. Chapter 11, Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark Covers how to track and manage your MLlib models with MLflow, compares and contrasts different model deployment options, and explores how to leverage Spark for non-MLlib models for distributed model inference, feature engineer‐ ing, and/or hyperparameter tuning. Chapter 12, Epilogue: Apache Spark 3.0 The epilogue highlights notable features and changes in Spark 3.0. While the full range of enhancements and features is too extensive to fit in a single chapter, we highlight the major changes you should be aware of and recommend you check the release notes when Spark 3.0 is officially released. Throughout these chapters, we have incorporated or noted Spark 3.0 features where needed and tested all the code examples and notebooks against Spark 3.0.0-preview2. Preface | xvii
📄 Page 20
How to Use the Code Examples The code examples in the book range from brief snippets to complete Spark applica‐ tions and end-to-end notebooks, in Scala, Python, SQL, and, where necessary, Java. While some short code snippets in a chapter are self-contained and can be copied and pasted to run in a Spark shell (pyspark or spark-shell), others are fragments from standalone Spark applications or end-to-end notebooks. To run standalone Spark applications in Scala, Python, or Java, read the instructions in the respective chapter’s README files in this book’s GitHub repo. As for the notebooks, to run these you will need to register for a free Databricks Community Edition account. We detail how to import the notebooks and create a cluster using Spark 3.0 in the README. Software and Configuration Used Most of the code in this book and the accompanying notebooks were written in and tested against Apache Spark 3.0.0-preview2, which was available to us at the time we were writing the final chapters. By the time this book is published, Apache Spark 3.0 will have been released and be available to the community for general use. We recommend that you download and use the official release with the following configurations for your operating system: • Apache Spark 3.0 (prebuilt for Apache Hadoop 2.7) • Java Development Kit (JDK) 1.8.0 If you intend to use only Python, then you can simply run pip install pyspark. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. xviii | Preface
The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00
Total Amount (¥)
0
Donation Count

Login to support the author

Login Now
Back to List