Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark (Akash Tandon, Sandy Ryza, Uri Laserson etc.) (Z-Library)

Ta nd on, Ryza , La serson, O w en & W ills A d va nced A na lytics w ith PySp a rk A d va nced A na lytics w ith PySp a rk Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen & Josh Wills Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark

DATA Advanced Analytics with PySpark 9 781098 103651 5 5 9 9 9 US $59.99 CAN $74.99 ISBN: 978-1-098-10365-1 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia The amount of data being generated today is staggering— and growing. Apache Spark has emerged as the de facto tool for analyzing big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark’s Python API, and other best practices in Spark programming. Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques—including classification, clustering, collaborative filtering, and anomaly detection—to fields such as genomics, security, and finance. This updated edition also covers image processing and the Spark NLP library. If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis. • Familiarize yourself with Spark’s programming model and ecosystem • Learn general approaches in data science • Examine complete implementations that analyze large public datasets • Discover which machine learning tools make sense for particular problems • Explore code that can be adapted to many uses Akash Tandon is cofounder and CTO of Looppanel. Previously, he worked as a senior data engineer at Atlan. Sandy Ryza leads development of the Dagster project and is a committer on Apache Spark. Uri Laserson is founder and CTO of Patch Biosciences. Previously, he worked on big data and genomics at Cloudera. Sean Owen, a principal solutions architect focusing on machine learning and data science at Databricks, is an Apache Spark committer and PMC member. Josh Wills is a software engineer at WeaveGrid and the former head of data engineering at Slack. Ta nd on, Ryza , La serson, O w en & W ills A d va nced A na lytics w ith PySp a rk A d va nced A na lytics w ith PySp a rk

Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark Boston Farnham Sebastopol TokyoBeijing

978-1-098-10365-1 [LSI] Advanced Analytics with PySpark by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills Copyright © 2022 Akash Tandon. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Jeff Bleiel Production Editor: Christopher Faucher Copyeditor: Penelope Perkins Proofreader: Kim Wimpsett Indexer: Sue Klefstad Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea June 2022: First Edition Revision History for the First Edition 2022-06-14: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098103651 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Advanced Analytics with PySpark, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Analyzing Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Working with Big Data 2 Introducing Apache Spark and PySpark 4 Components 4 PySpark 6 Ecosystem 7 Spark 3.0 8 PySpark Addresses Challenges of Data Science 8 Where to Go from Here 9 2. Introduction to Data Analysis with PySpark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Spark Architecture 13 Installing PySpark 14 Setting Up Our Data 17 Analyzing Data with the DataFrame API 22 Fast Summary Statistics for DataFrames 26 Pivoting and Reshaping DataFrames 28 Joining DataFrames and Selecting Features 30 Scoring and Model Evaluation 32 Where to Go from Here 34 3. Recommending Music and the Audioscrobbler Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Setting Up the Data 36 Our Requirements for a Recommender System 38 Alternating Least Squares Algorithm 40 iii

Preparing the Data 41 Building a First Model 44 Spot Checking Recommendations 48 Evaluating Recommendation Quality 49 Computing AUC 51 Hyperparameter Selection 52 Making Recommendations 55 Where to Go from Here 56 4. Making Predictions with Decision Trees and Decision Forests. . . . . . . . . . . . . . . . . . . . . 59 Decision Trees and Forests 60 Preparing the Data 63 Our First Decision Tree 67 Decision Tree Hyperparameters 74 Tuning Decision Trees 76 Categorical Features Revisited 79 Random Forests 82 Making Predictions 85 Where to Go from Here 85 5. Anomaly Detection with K-means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 K-means Clustering 88 Identifying Anomalous Network Traffic 89 KDD Cup 1999 Dataset 90 A First Take on Clustering 91 Choosing k 93 Visualization with SparkR 96 Feature Normalization 100 Categorical Variables 102 Using Labels with Entropy 103 Clustering in Action 105 Where to Go from Here 106 6. Understanding Wikipedia with LDA and Spark NLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Latent Dirichlet Allocation 110 LDA in PySpark 110 Getting the Data 111 Spark NLP 112 Setting Up Your Environment 113 Parsing the Data 114 Preparing the Data Using Spark NLP 115 iv | Table of Contents

TF-IDF 119 Computing the TF-IDFs 120 Creating Our LDA Model 121 Where to Go from Here 124 7. Geospatial and Temporal Data Analysis on Taxi Trip Data. . . . . . . . . . . . . . . . . . . . . . . . 125 Preparing the Data 126 Converting Datetime Strings to Timestamps 128 Handling Invalid Records 130 Geospatial Analysis 132 Intro to GeoJSON 132 GeoPandas 133 Sessionization in PySpark 136 Building Sessions: Secondary Sorts in PySpark 137 Where to Go from Here 139 8. Estimating Financial Risk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Terminology 142 Methods for Calculating VaR 143 Variance-Covariance 143 Historical Simulation 143 Monte Carlo Simulation 143 Our Model 144 Getting the Data 145 Preparing the Data 146 Determining the Factor Weights 148 Sampling 152 The Multivariate Normal Distribution 154 Running the Trials 155 Visualizing the Distribution of Returns 158 Where to Go from Here 158 9. Analyzing Genomics Data and the BDG Project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Decoupling Storage from Modeling 162 Setting Up ADAM 164 Introduction to Working with Genomics Data Using ADAM 166 File Format Conversion with the ADAM CLI 166 Ingesting Genomics Data Using PySpark and ADAM 167 Predicting Transcription Factor Binding Sites from ENCODE Data 173 Where to Go from Here 178 Table of Contents | v

10. Image Similarity Detection with Deep Learning and PySpark LSH. . . . . . . . . . . . . . . . . 179 PyTorch 180 Installation 180 Preparing the Data 181 Resizing Images Using PyTorch 181 Deep Learning Model for Vector Representation of Images 182 Image Embeddings 183 Import Image Embeddings into PySpark 185 Image Similarity Search Using PySpark LSH 186 Nearest Neighbor Search 187 Where to Go from Here 190 11. Managing the Machine Learning Lifecycle with MLflow. . . . . . . . . . . . . . . . . . . . . . . . . 191 Machine Learning Lifecycle 192 MLflow 193 Experiment Tracking 194 Managing and Serving ML Models 197 Creating and Using MLflow Projects 200 Where to Go from Here 203 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 vi | Table of Contents

Preface Apache Spark’s long lineage of predecessors, from MPI (message passing interface) to MapReduce, made it possible to write programs that take advantage of massive resources while abstracting away the nitty-gritty details of distributed systems. As much as data processing needs have motivated the development of these frameworks, in a way the field of big data has become so related to them that its scope is defined by what these frameworks can handle. Spark’s original promise was to take this a little further—to make writing distributed programs feel like writing regular programs. The rise in Spark’s popularity coincided with that of the Python data (PyData) ecosys‐ tem. So it makes sense that Spark’s Python API—PySpark—has significantly grown in popularity over the last few years. Although the PyData ecosystem has recently sprung up some distributed programming options, Apache Spark remains one of the most popular choices for working with large datasets across industries and domains. Thanks to recent efforts to integrate PySpark with the other PyData tools, learning the framework can help you boost your productivity significantly as a data science practitioner. We think that the best way to teach data science is by example. To that end, we have put together a book of applications, trying to touch on the interactions between the most common algorithms, datasets, and design patterns in large-scale analytics. This book isn’t meant to be read cover to cover: page to a chapter that looks like something you’re trying to accomplish, or that simply ignites your interest, and start there. Why Did We Write This Book Now? Apache Spark experienced a major version upgrade in 2020—version 3.0. One of the biggest improvements was the introduction of Spark Adaptive Execution. This feature takes away a big portion of the complexity around tuning and optimization. We do not refer to it in the book because it’s turned on by default in Spark 3.2 and later versions, and so you automatically get the benefits. vii

The ecosystem changes, combined with Spark’s latest major release, make this edition a timely one. Unlike previous editions of Advanced Analytics with Spark, which chose Scala, we will use Python. We’ll cover best practices and integrate with the wider Python data science ecosystem when appropriate. All chapters have been updated to use the latest PySpark API. Two new chapters have been added and multiple chapters have undergone major rewrites. We will not cover Spark’s streaming and graph libraries. With Spark in a new era of maturity and stability, we hope that these changes will preserve the book as a useful resource on analytics for years to come. How This Book Is Organized Chapter 1 places Spark and PySpark within the wider context of data science and big data analytics. After that, each chapter comprises a self-contained analysis using PySpark. Chapter 2 introduces the basics of data processing in PySpark and Python through a use case in data cleansing. The next few chapters delve into the meat and potatoes of machine learning with Spark, applying some of the most common algorithms in canonical applications. The remaining chapters are a bit more of a grab bag and apply Spark in slightly more exotic applications—for example, querying Wikipedia through latent semantic relationships in the text, analyzing genomics data, and identifying similar images. This book is not about PySpark’s merits and disadvantages. There are a few other things that it is not about either. It introduces the Spark programming model and basics of Spark’s Python API, PySpark. However, it does not attempt to be a Spark reference or provide a comprehensive guide to all Spark’s nooks and crannies. It does not try to be a machine learning, statistics, or linear algebra reference, although many of the chapters provide some background on these before using them. Instead, this book will help the reader get a feel for what it’s like to use PySpark for complex analytics on large datasets by covering the entire pipeline: not just building and evaluating models, but also cleansing, preprocessing, and exploring data, with attention paid to turning results into production applications. We believe that the best way to teach this is by example. Here are examples of some tasks that will be tackled in this book: Predicting forest cover We predict type of forest cover using relevant features like location and soil type by using decision trees (see Chapter 4). Querying Wikipedia for similar entries We identify relationships between entries and query the Wikipedia corpus by using NLP (natural language processing) techniques (see Chapter 6). viii | Preface

Understanding utilization of New York cabs We compute average taxi waiting time as a function of location by performing temporal and geospatial analysis (see Chapter 7). Reduce risk for an investment portfolio We estimate financial risk for an investment portfolio using the Monte Carlo simulation (see Chapter 9). When possible, we attempt not to just provide a “solution,” but to demonstrate the full data science workflow, with all of its iterations, dead ends, and restarts. This book will be useful for getting more comfortable with Python, Spark, and machine learning and data analysis. However, these are in service of a larger goal, and we hope that most of all this book will teach you how to approach tasks like those described earlier. Each chapter, in about 20 measly pages, will try to get as close as possible to demonstrating how to build one piece of these data applications. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion. This element signifies a general note. Preface | ix

This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/sryza/aas. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Advanced Analytics with PySpark by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills (O’Reilly). Copyright 2022 Akash Tandon, 978-1-098-10365-1.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. x | Preface

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/adv-analytics-pyspark. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments It goes without saying that you wouldn’t be reading this book if it were not for the existence of Apache Spark and MLlib. We all owe thanks to the team that has built and open sourced it and the hundreds of contributors who have added to it. We would like to thank everyone who spent a great deal of time reviewing the content of the previous editions of the book with expert eyes: Michael Bernico, Adam Breindel, Ian Buss, Parviz Deyhim, Jeremy Freeman, Chris Fregly, Debashish Ghosh, Juliet Hougland, Jonathan Keebler, Nisha Muktewar, Frank Nothaft, Nick Pentreath, Kostas Sakellis, Tom White, Marcelo Vanzin, and Juliet Hougland again. Thanks all! We owe you one. This has greatly improved the structure and quality of the result. Sandy also would like to thank Jordan Pinkus and Richard Wang for helping with some of the theory behind the risk chapter. Thanks to Jeff Bleiel and O’Reilly for the experience and great support in getting this book published and into your hands. Preface | xi

(This page has no text content)

CHAPTER 1 Analyzing Big Data When people say that we live in an age of big data they mean that we have tools for collecting, storing, and processing information at a scale previously unheard of. The following tasks simply could not have been accomplished 10 or 15 years ago: • Build a model to detect credit card fraud using thousands of features and billions of transactions • Intelligently recommend millions of products to millions of users • Estimate financial risk through simulations of portfolios that include millions of instruments • Easily manipulate genomic data from thousands of people to detect genetic associations with disease • Assess agricultural land use and crop yield for improved policymaking by peri‐ odically processing millions of satellite images Sitting behind these capabilities is an ecosystem of open source software that can lev‐ erage clusters of servers to process massive amounts of data. The introduction/release of Apache Hadoop in 2006 has led to widespread adoption of distributed computing. The big data ecosystem and tooling have evolved at a rapid pace since then. The past five years have also seen the introduction and adoption of many open source machine learning (ML) and deep learning libraries. These tools aim to leverage vast amounts of data that we now collect and store. But just as a chisel and a block of stone do not make a statue, there is a gap between having access to these tools and all this data and doing something useful with it. Often, “doing something useful” means placing a schema over tabular data and using SQL to answer questions like “Of the gazillion users who made it to the third page in our registration process, how many are over 25?” The field of how to architect 1

data storage and organize information (data warehouses, data lakes, etc.) to make answering such questions easy is a rich one, but we will mostly avoid its intricacies in this book. Sometimes, “doing something useful” takes a little extra work. SQL still may be core to the approach, but to work around idiosyncrasies in the data or perform complex analysis, we need a programming paradigm that’s more flexible and with richer functionality in areas like machine learning and statistics. This is where data science comes in and that’s what we are going to talk about in this book. In this chapter, we’ll start by introducing big data as a concept and discuss some of the challenges that arise when working with large datasets. We will then introduce Apache Spark, an open source framework for distributed computing, and its key components. Our focus will be on PySpark, Spark’s Python API, and how it fits within a wider ecosystem. This will be followed by a discussion of the changes brought by Spark 3.0, the framework’s first major release in four years. We will finish with a brief note about how PySpark addresses challenges of data science and why it is a great addition to your skillset. Previous editions of this book used Spark’s Scala API for code examples. We decided to use PySpark instead because of Python’s popularity in the data science community and an increased focus by the core Spark team to better support the language. By the end of this chapter, you will ideally appreciate this decision. Working with Big Data Many of our favorite small data tools hit a wall when working with big data. Libraries like pandas are not equipped to deal with data that can’t fit in our RAM. Then, what should an equivalent process look like that can leverage clusters of computers to achieve the same outcomes on large datasets? Challenges of distributed computing require us to rethink many of the basic assumptions that we rely on in single-node systems. For example, because data must be partitioned across many nodes on a cluster, algorithms that have wide data dependencies will suffer from the fact that network transfer rates are orders of magnitude slower than memory accesses. As the number of machines working on a problem increases, the probability of a failure increases. These facts require a programming paradigm that is sensitive to the charac‐ teristics of the underlying system: one that discourages poor choices and makes it easy to write code that will execute in a highly parallel manner. 2 | Chapter 1: Analyzing Big Data

How Big Is Big Data? Without a reference point, the term big data is ambiguous. Moreover, the age-old two-tier definition of small and big data can be confusing. When it comes to data size, a three-tiered definition is more helpful (see Table 1-1). Table 1-1. A tiered definition of data sizes Dataset type Fits in RAM? Fits on local disk? Small dataset Yes Yes Medium dataset No Yes Big dataset No No As per the table, if the dataset can fit in memory or disk on a single system, it cannot be termed big data. This definition is not perfect, but it does act as a good rule of thumb in context of an average machine. The focus of this book is to enable you to work efficiently with big data. If your dataset is small and can fit in memory, stay away from distributed systems. To analyze medium-sized datasets, a database or parallelism may be good enough at times. At other times, you may have to set up a cluster and use big data tools. Hopefully, the experience that you will gain in the following chapters will help you make such judgment calls. Single-machine tools that have come to recent prominence in the software commu‐ nity are not the only tools used for data analysis. Scientific fields like genomics that deal with large datasets have been leveraging parallel-computing frameworks for decades. Most people processing data in these fields today are familiar with a cluster-computing environment called HPC (high-performance computing). Where the difficulties with Python and R lie in their inability to scale, the difficulties with HPC lie in its relatively low level of abstraction and difficulty of use. For example, to process a large file full of DNA-sequencing reads in parallel, we must manually split it up into smaller files and submit a job for each of those files to the cluster scheduler. If some of these fail, the user must detect the failure and manually resubmit them. If the analysis requires all-to-all operations like sorting the entire dataset, the large dataset must be streamed through a single node, or the scientist must resort to lower-level distributed frameworks like MPI, which are difficult to program without extensive knowledge of C and distributed/networked systems. Tools written for HPC environments often fail to decouple the in-memory data models from the lower-level storage models. For example, many tools only know how to read data from a POSIX filesystem in a single stream, making it difficult to make Working with Big Data | 3

tools naturally parallelize or to use other storage backends, like databases. Modern distributed computing frameworks provide abstractions that allow users to treat a cluster of computers more like a single computer—to automatically split up files and distribute storage over many machines, divide work into smaller tasks and execute them in a distributed manner, and recover from failures. They can automate a lot of the hassle of working with large datasets and are far cheaper than HPC. A simple way to think about distributed systems is that they are a group of independent computers that appear to the end user as a single computer. They allow for horizontal scaling. That means adding more computers rather than upgrading a single system (vertical scaling). The latter is relatively expensive and often insuf‐ ficient for large workloads. Distributed systems are great for scaling and reliability but also introduce complexity when it comes to design, construction, and debugging. One should understand this trade-off before opting for such a tool. Introducing Apache Spark and PySpark Enter Apache Spark, an open source framework that combines an engine for dis‐ tributing programs across clusters of machines with an elegant model for writing programs atop it. Spark originated at the University of California, Berkeley, AMPLab and has since been contributed to the Apache Software Foundation. When released, it was arguably the first open source software that made distributed programming truly accessible to data scientists. Components Apart from the core computation engine (Spark Core), Spark is comprised of four main components. Spark code written by a user, using either of its APIs, is executed in the workers’ JVMs (Java Virtual Machines) across the cluster (see Chapter 2). These components are available as distinct libraries as shown in Figure 1-1: Spark SQL and DataFrames + Datasets A module for working with structured data. MLlib A scalable machine learning library. Structured Streaming This makes it easy to build scalable fault-tolerant streaming applications. 4 | Chapter 1: Analyzing Big Data

GraphX (legacy) GraphX is Apache Spark’s library for graphs and graph-parallel computation. However, for graph analytics, GraphFrames is recommended instead of GraphX, which isn’t being actively developed as much and lacks Python bindings. Graph‐ Frames is an open source general graph processing library that is similar to Apache Spark’s GraphX but uses DataFrame-based APIs. Figure 1-1. Apache Spark components Comparison with MapReduce One illuminating way to understand Spark is in terms of its advances over its prede‐ cessor, Apache Hadoop’s MapReduce. MapReduce revolutionized computation over huge datasets by offering a simple and resilient model for writing programs that could execute in parallel across hundreds to thousands of machines. It broke up work into small tasks and could gracefully accommodate task failures without compromising the job to which they belonged. Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways: • First, rather than relying on a rigid map-then-reduce format, its engine can execute a more general directed acyclic graph of operators. This means that in situations where MapReduce must write out intermediate results to the dis‐ tributed filesystem, Spark can pass them directly to the next step in the pipeline. • Second, it complements its computational capability with a rich set of transfor‐ mations that enable users to express computation more naturally. Out-of-the-box functions are provided for various tasks, including numerical computation, date‐ time processing, and string manipulation. • Third, Spark extends its predecessors with in-memory processing. This means that future steps that want to deal with the same dataset need not recompute it or reload it from disk. Spark is well-suited for highly iterative algorithms as well as ad hoc queries. Introducing Apache Spark and PySpark | 5

PySpark PySpark is Spark’s Python API. In simpler words, PySpark is a Python-based wrapper over the core Spark framework, which is written primarily in Scala. PySpark provides an intuitive programming environment for data science practitioners and offers the flexibility of Python with the distributed processing capabilities of Spark. PySpark allows us to work across programming models. For example, a common pattern is to perform large-scale extract, transform, and load (ETL) workloads with Spark and then collect the results to a local machine followed by manipulation using pandas. We’ll explore such programming models as we write PySpark code in the upcoming chapters. Here is a code example from the official documentation to give you a glimpse of what’s to come: from pyspark.ml.classification import LogisticRegression # Load training data training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8) # Fit the model lrModel = lr.fit(training) # Print the coefficients and intercept for logistic regression print("Coefficients: " + str(lrModel.coefficients)) print("Intercept: " + str(lrModel.intercept)) # We can also use the multinomial family for binary classification mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial") # Fit the model mlrModel = mlr.fit(training) # Print the coefficients and intercepts for logistic regression # with multinomial family print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix)) print("Multinomial intercepts: " + str(mlrModel.interceptVector)) 6 | Chapter 1: Analyzing Big Data

Statistics

Uploader

Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark (Akash Tandon, Sandy Ryza, Uri Laserson etc.) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Statistics

Uploader

Advanced Analytics with PySpark Patterns for Learning from Data at Scale Using Python and Spark (Akash Tandon, Sandy Ryza, Uri Laserson etc.) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Reply to Comment

Edit Comment

Recommended for You