📄 Page
1
Lea rning D a ta Science Lea rning D a ta Science Sam Lau, Joseph Gonzalez & Deborah Nolan Learning Data Science Data Wrangling, Exploration, Visualization, and Modeling with Python
📄 Page
2
DATA SCIENCE “This is the book I wish we had when we first came up with the term data scientist to describe what we do. If you’re looking to be in data science/ engineering, AI, or machine learning, this is where you need to start.” —DJ Patil, PhD first US Chief Data Scientist Learning Data Science Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia As an aspiring data scientist, you appreciate why organizations rely on data for important decisions—whether it’s for companies designing websites, cities deciding how to improve services, or scientists working to stop the spread of disease. And you want the skills required to distill a messy pile of data into actionable insights. We call this the data science lifecycle: the process of collecting, wrangling, analyzing, and drawing conclusions from data. Learning Data Science is the first book to cover foundational skills in both programming and statistics that encompass this entire lifecycle. It’s aimed at those who wish to become data scientists or who work with data scientists, and at data analysts who wish to cross the “technical/nontechnical” divide. If you have a basic knowledge of Python programming, you’ll learn how to work with data using industry-standard tools like pandas. • Refine a question of interest to one that can be studied with data • Pursue data collection that may involve text processing, web scraping, etc. • Glean valuable insights through data cleaning, exploration, and visualization • Learn how to use modeling to describe the data • Generalize findings beyond the data Sam Lau is an assistant teaching professor in the Halıcıoğlu Data Science Institute at UC San Diego. Sam has a decade of teaching experience, and he has designed and taught flagship data science courses at UC Berkeley and UC San Diego. Joey Gonzalez is an associate professor in the EECS Department at UC Berkeley, a member of the Berkeley AI Research group, and a founding member of the Berkeley RISE Lab. He also cofounded Turi Inc. and Aqueduct, which develop tools for data scientists. Deborah Nolan is professor emerita of statistics and associate dean for students in the College of Computing, Data Science, and Society at UC Berkeley. US $89.99 CAN $112.99 ISBN: 978-1-098-11300-1
📄 Page
3
Praise for Learning Data Science I helped develop and teach the UC Berkeley data science course based on Learning Data Science. This book provides the foundational skills and concepts needed to solve real-world data science problems. — Fernando Pérez, UC Berkeley Professor and Cofounder of Project Jupyter Learning Data Science is a great introduction to the field of data science for beginners and working professionals alike. Read it for the exciting case studies. —Siddharth Yadav, Freelance Data Scientist There’s not a lot of data science books that focus on exploratory data analysis and how that segues into the real modeling process. This book does just that and should serve anyone wanting a deep-dive in how to explore data. —Thomas Nield, Consultant/Instructor, Nield Consulting Group/Yawman Flight Learning Data Science provides a fantastic, comprehensive introduction to the data science lifecycle. It builds a strong foundation in data science principles and techniques, enabling readers to tackle the complex problems we face each day. What truly sets this book apart is the abundance of modern, real-world examples. —Sona Jeswani, Machine Learning Engineer for Google Search Ads Quality
📄 Page
4
This great book covers the whole data science pipeline, from data wrangling to visualization to modeling. The text and (Python) code are both beautifully written. (For example, the extensive use of pandas pipes is quite elegant, and similar in style to R’s tidyverse.) I recommend the book for anyone who wants to get started in data science. —Kevin Murphy, Research Scientist at Google DeepMind, Author of Probabilistic Machine Learning (MIT Press, 2023)
📄 Page
5
Sam Lau, Joseph Gonzalez, and Deborah Nolan Learning Data Science Data Wrangling, Exploration, Visualization, and Modeling with Python Boston Farnham Sebastopol TokyoBeijing
📄 Page
6
978-1-098-11300-1 [LSI] Learning Data Science by Sam Lau, Joseph Gonzalez, and Deborah Nolan Copyright © 2023 Sam Lau, Joseph Gonzalez, and Deborah Nolan. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Melissa Potter Production Editor: Katherine Tozer Copyeditor: Audrey Doyle Proofreader: J.M. Olejarz Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2023: First Edition Revision History for the First Release 2023-09-15: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098113001 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Data Science, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
7
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Part I. The Data Science Lifecycle 1. The Data Science Lifecycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Stages of the Lifecycle 3 Examples of the Lifecycle 6 Summary 7 2. Questions and Data Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Big Data and New Opportunities 10 Example: Google Flu Trends 10 Target Population, Access Frame, and Sample 12 Example: What Makes Members of an Online Community Active? 14 Example: Who Will Win the Election? 14 Example: How Do Environmental Hazards Relate to an Individual’s Health? 15 Instruments and Protocols 16 Measuring Natural Phenomena 17 Example: What Is the Level of CO2 in the Air? 18 Accuracy 19 Types of Bias 20 Types of Variation 22 Summary 24 v
📄 Page
8
3. Simulation and Data Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 The Urn Model 28 Sampling Designs 30 Sampling Distribution of a Statistic 32 Simulating the Sampling Distribution 33 Simulation with the Hypergeometric Distribution 35 Example: Simulating Election Poll Bias and Variance 36 The Pennsylvania Urn Model 38 An Urn Model with Bias 40 Conducting Larger Polls 41 Example: Simulating a Randomized Trial for a Vaccine 43 Scope 43 The Urn Model for Random Assignment 44 Example: Measuring Air Quality 46 Summary 49 4. Modeling with Summary Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 The Constant Model 52 Minimizing Loss 54 Mean Absolute Error 55 Mean Squared Error 57 Choosing Loss Functions 59 Summary 60 5. Case Study: Why Is My Bus Always Late?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Question and Scope 64 Data Wrangling 64 Exploring Bus Times 67 Modeling Wait Times 70 Summary 74 Part II. Rectangular Data 6. Working with Dataframes Using pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Subsetting 80 Data Scope and Question 80 Dataframes and Indices 81 Slicing 83 Filtering Rows 86 Example: How Recently Has Luna Become a Popular Name? 89 vi | Table of Contents
📄 Page
9
Aggregating 91 Basic Group-Aggregate 92 Grouping on Multiple Columns 95 Custom Aggregation Functions 96 Pivoting 98 Joining 100 Inner Joins 101 Left, Right, and Outer Joins 103 Example: Popularity of NYT Name Categories 105 Transforming 107 Apply 107 Example: Popularity of “L” Names 109 The Price of Apply 110 How Are Dataframes Different from Other Data Representations? 111 Dataframes and Spreadsheets 111 Dataframes and Matrices 112 Dataframes and Relations 113 Summary 113 7. Working with Relations Using SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Subsetting 115 SQL Basics: SELECT and FROM 116 What’s a Relation? 117 Slicing 118 Filtering Rows 119 Example: How Recently Has Luna Become a Popular Name? 121 Aggregating 122 Basic Group-Aggregate Using GROUP BY 123 Grouping on Multiple Columns 124 Other Aggregation Functions 125 Joining 126 Inner Joins 127 Left and Right Joins 129 Example: Popularity of NYT Name Categories 130 Transforming and Common Table Expressions 131 SQL Functions 131 Multistep Queries Using a WITH Clause 134 Example: Popularity of “L” Names 134 Summary 135 Table of Contents | vii
📄 Page
10
Part III. Understanding The Data 8. Wrangling Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Data Source Examples 140 Drug Abuse Warning Network (DAWN) Survey 140 San Francisco Restaurant Food Safety 140 File Formats 142 Delimited Format 142 Fixed-Width Format 144 Hierarchical Formats 145 Loosely Formatted Text 145 File Encoding 146 File Size 148 The Shell and Command-Line Tools 151 Table Shape and Granularity 155 Granularity of Restaurant Inspections and Violations 156 DAWN Survey Shape and Granularity 158 Summary 161 9. Wrangling Dataframes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Example: Wrangling CO2 Measurements from the Mauna Loa Observatory 164 Quality Checks 167 Addressing Missing Data 170 Reshaping the Data Table 171 Quality Checks 172 Quality Based on Scope 172 Quality of Measurements and Recorded Values 173 Quality Across Related Features 174 Quality for Analysis 174 Fixing the Data or Not 175 Missing Values and Records 176 Transformations and Timestamps 178 Transforming Timestamps 179 Piping for Transformations 182 Modifying Structure 183 Example: Wrangling Restaurant Safety Violations 186 Narrowing the Focus 187 Aggregating Violations 188 Extracting Information from Violation Descriptions 190 Summary 193 viii | Table of Contents
📄 Page
11
10. Exploratory Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Feature Types 196 Example: Dog Breeds 198 Transforming Qualitative Features 203 The Importance of Feature Types 206 What to Look For in a Distribution 207 What to Look For in a Relationship 211 Two Quantitative Features 211 One Qualitative and One Quantitative Variable 212 Two Qualitative Features 214 Comparisons in Multivariate Settings 216 Guidelines for Exploration 220 Example: Sale Prices for Houses 221 Understanding Price 222 What Next? 224 Examining Other Features 225 Delving Deeper into Relationships 229 Fixing Location 230 EDA Discoveries 232 Summary 233 11. Data Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Choosing Scale to Reveal Structure 235 Filling the Data Region 236 Including Zero 237 Revealing Shape Through Transformations 239 Banking to Decipher Relationships 241 Revealing Relationships Through Straightening 242 Smoothing and Aggregating Data 245 Smoothing Techniques to Uncover Shape 245 Smoothing Techniques to Uncover Relationships and Trends 247 Smoothing Techniques Need Tuning 249 Reducing Distributions to Quantiles 250 When Not to Smooth 252 Facilitating Meaningful Comparisons 254 Emphasize the Important Difference 254 Ordering Groups 256 Avoid Stacking 258 Selecting a Color Palette 260 Guidelines for Comparisons in Plots 262 Incorporating the Data Design 263 Table of Contents | ix
📄 Page
12
Data Collected Over Time 263 Observational Studies 265 Unequal Sampling 266 Geographic Data 267 Adding Context 268 Example: 100m Sprint Times 269 Creating Plots Using plotly 270 Figure and Trace Objects 271 Modifying Layout 273 Plotting Functions 274 Annotations 276 Other Tools for Visualization 277 matplotlib 278 Grammar of Graphics 278 Summary 279 12. Case Study: How Accurate Are Air Quality Measurements?. . . . . . . . . . . . . . . . . . . . . . 281 Question, Design, and Scope 282 Finding Collocated Sensors 284 Wrangling the List of AQS Sites 284 Wrangling the List of PurpleAir Sites 286 Matching AQS and PurpleAir Sensors 288 Wrangling and Cleaning AQS Sensor Data 290 Checking Granularity 291 Removing Unneeded Columns 292 Checking the Validity of Dates 292 Checking the Quality of PM2.5 Measurements 293 Wrangling PurpleAir Sensor Data 294 Checking the Granularity 296 Handling Missing Values 300 Exploring PurpleAir and AQS Measurements 302 Creating a Model to Correct PurpleAir Measurements 308 Summary 310 Part IV. Other Data Sources 13. Working with Text. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 Examples of Text and Tasks 316 Convert Text into a Standard Format 316 Extract a Piece of Text to Create a Feature 316 x | Table of Contents
📄 Page
13
Transform Text into Features 317 Text Analysis 317 String Manipulation 318 Converting Text to a Standard Format with Python String Methods 318 String Methods in pandas 319 Splitting Strings to Extract Pieces of Text 320 Regular Expressions 321 Concatenation of Literals 322 Quantifiers 324 Alternation and Grouping to Create Features 326 Reference Tables 327 Text Analysis 329 Summary 334 14. Data Exchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 NetCDF Data 336 JSON Data 341 HTTP 345 REST 349 XML, HTML, and XPath 353 Example: Scraping Race Times from Wikipedia 356 XPath 358 Example: Accessing Exchange Rates from the ECB 360 Summary 363 Part V. Linear Modeling 15. Linear Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Simple Linear Model 368 Example: A Simple Linear Model for Air Quality 372 Interpreting Linear Models 374 Assessing the Fit 375 Fitting the Simple Linear Model 377 Multiple Linear Model 379 Fitting the Multiple Linear Model 384 Example: Where Is the Land of Opportunity? 388 Explaining Upward Mobility Using Commute Time 389 Relating Upward Mobility Using Multiple Variables 392 Feature Engineering for Numeric Measurements 396 Feature Engineering for Categorical Measurements 400 Table of Contents | xi
📄 Page
14
Summary 407 16. Model Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Overfitting 410 Example: Energy Consumption 410 Train-Test Split 415 Cross-Validation 419 Regularization 424 Model Bias and Variance 425 Summary 429 17. Theory for Inference and Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 Distributions: Population, Empirical, Sampling 431 Basics of Hypothesis Testing 433 Example: A Rank Test to Compare Productivity of Wikipedia Contributors 435 Example: A Test of Proportions for Vaccine Efficacy 439 Bootstrapping for Inference 442 Basics of Confidence Intervals 446 Basics of Prediction Intervals 450 Example: Predicting Bus Lateness 450 Example: Predicting Crab Size 451 Example: Predicting the Incremental Growth of a Crab 453 Probability for Inference and Prediction 455 Formalizing the Theory for Average Rank Statistics 456 General Properties of Random Variables 459 Probability Behind Testing and Intervals 462 Probability Behind Model Selection 465 Summary 467 18. Case Study: How to Weigh a Donkey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 Donkey Study Question and Scope 471 Wrangling and Transforming 472 Exploring 477 Modeling a Donkey’s Weight 481 A Loss Function for Prescribing Anesthetics 481 Fitting a Simple Linear Model 482 Fitting a Multiple Linear Model 484 Bringing Qualitative Features into the Model 485 Model Assessment 488 Summary 490 xii | Table of Contents
📄 Page
15
Part VI. Classification 19. Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Example: Wind-Damaged Trees 496 Modeling and Classification 498 A Constant Model 498 Examining the Relationship Between Size and Windthrow 499 Modeling Proportions (and Probabilities) 501 A Logistic Model 502 Log Odds 504 Using a Logistic Curve 505 A Loss Function for the Logistic Model 505 From Probabilities to Classification 509 The Confusion Matrix 511 Precision Versus Recall 512 Summary 515 20. Numerical Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Gradient Descent Basics 518 Minimizing Huber Loss 520 Convex and Differentiable Loss Functions 522 Variants of Gradient Descent 524 Stochastic Gradient Descent 525 Mini-Batch Gradient Descent 525 Newton’s Method 526 Summary 527 21. Case Study: Detecting Fake News. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Question and Scope 530 Obtaining and Wrangling the Data 531 Exploring the Data 535 Exploring the Publishers 536 Exploring Publication Date 538 Exploring Words in Articles 540 Modeling 542 A Single-Word Model 542 Multiple-Word Model 544 Predicting with the tf-idf Transform 546 Summary 549 Table of Contents | xiii
📄 Page
16
Additional Material. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Data Sources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 xiv | Table of Contents
📄 Page
17
Preface Data science is exciting work. The ability to draw insights from messy data is valuable for all kinds of decision making across business, medicine, policy, and more. This book, Learning Data Science, aims to prepare readers to do data science. To achieve this, we’ve designed this book with the following special features: Focus on the fundamentals Technologies come and go. While we work with specific technologies in this book, our goal is to equip readers with the fundamental building blocks of data science. We do this by revealing how to think about data science problems and challenges, and by covering the fundamentals behind the individual technologies. Our aim is to serve readers even as technologies change. Cover the entire data science lifecycle Instead of just focusing on a single topic, like how to work with data tables or how to apply machine learning techniques, we cover the entire data science life‐ cycle—the process of asking a question, obtaining data, understanding the data, and understanding the world. Working through the entire lifecycle can often be the hardest part of being a data scientist. Use real data To be prepared for working on real problems, we consider it essential to learn from examples that use real data, with their warts and all. We chose the datasets presented in this book by carefully picking from actual data analyses that have made an impact, rather than using overly refined or synthetic data. Apply concepts through case studies We’ve included extended case studies throughout the book that follow or extend analyses from other data scientists. These case studies show readers how to navi‐ gate the data science lifecycle in real settings. xv
📄 Page
18
Combine both computational and inferential thinking On the job, data scientists need to foresee how the decisions they make when writing code and how the size of a dataset might affect statistical analysis. To pre‐ pare readers for their future work, Learning Data Science integrates computa‐ tional and statistical thinking. We also motivate statistical concepts through simulation studies rather than mathematical proofs. The text and code for this book are open source and available on GitHub. Expected Background Knowledge We expect readers to be proficient in Python and understand how to use built-in data structures like lists, dictionaries, and sets; import and use functions and classes from other packages; and write functions from scratch. We also use the numpy Python package without introduction but don’t expect readers to have much prior experience using it. Readers will get more from this book if they also know a bit of probability, calculus, and linear algebra, but we aim to explain mathematical ideas intuitively. Organization of the Book This book has 21 chapters, divided into six parts: Part I (Chapters 1–5) Part I describes what the lifecycle is, makes one full pass through the lifecycle at a basic level, and introduces terminology that we use throughout the book. The part concludes with a short case study about bus arrival times. Part II (Chapters 6–7) Part II introduces dataframes and relations and how to write code to manipulate data using pandas and SQL. Part III (Chapters 8–12) Part III is all about obtaining data, discovering its traits, and spotting issues. After understanding these concepts, a reader can take a datafile and describe the dataset’s interesting features to someone else. This part ends with a case study about air quality. Part IV (Chapters 13–14) Part IV looks at widely used alternative sources of data, like text, binary, and data from the web. xvi | Preface
📄 Page
19
Part V (Chapters 15–18) Part V focuses on understanding the world using data. It covers inferential topics like confidence intervals and hypothesis testing in addition to model fitting, feature engineering, and model selection. This part ends with a case study about predicting donkey weights for veterinarians in Kenya. Part VI (Chapters 19–21) Part VI completes our study of supervised learning with logistic regression and optimization. It ends with a case study on predicting whether news articles make real or fake statements. At the end of the book, we included resources to learn more about many of the topics this book introduces, and we provided the complete list of datasets used throughout the book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a general note. This element indicates a tip. Preface | xvii
📄 Page
20
This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://learningds.org. If you have a technical question or a problem using the code examples, please email bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Learning Data Science by Sam Lau, Joseph Gonzalez, and Deborah Nolan (O’Reilly). Copyright 2023 Sam Lau, Joseph Gonzalez, and Deborah Nolan, 978-1-098-11300-1.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at bookquestions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. xviii | Preface