📄 Page
1
SQL for Data Analysis Advanced Techniques for Transforming Data into Insights Cathy Tanimura
📄 Page
2
(This page has no text content)
📄 Page
3
Cathy Tanimura SQL for Data Analysis Advanced Techniques for Transforming Data into Insights Boston Farnham Sebastopol TokyoBeijing
📄 Page
4
978-1-492-08878-3 [LSI] SQL for Data Analysis by Cathy Tanimura Copyright © 2021 Cathy Tanimura. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Andy Kwan Development Editors Amelia Blevins and Shira Evans Production Editor: Kristen Brown Copyeditor: Arthur Johnson Proofreader: Paula L. Fleming Indexer: Ellen Troutman-Zaig Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea September 2021: First Edition Revision History for the First Edition 2021-09-09: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492088783 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. SQL for Data Analysis, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Analysis with SQL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is Data Analysis? 1 Why SQL? 4 What Is SQL? 4 Benefits of SQL 7 SQL Versus R or Python 8 SQL as Part of the Data Analysis Workflow 9 Database Types and How to Work with Them 12 Row-Store Databases 13 Column-Store Databases 15 Other Types of Data Infrastructure 16 Conclusion 17 2. Preparing Data for Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Types of Data 20 Database Data Types 20 Structured Versus Unstructured 22 Quantitative Versus Qualitative Data 22 First-, Second-, and Third-Party Data 23 Sparse Data 24 SQL Query Structure 25 Profiling: Distributions 27 Histograms and Frequencies 28 Binning 31 n-Tiles 33 iii
📄 Page
6
Profiling: Data Quality 35 Detecting Duplicates 36 Deduplication with GROUP BY and DISTINCT 38 Preparing: Data Cleaning 39 Cleaning Data with CASE Transformations 39 Type Conversions and Casting 42 Dealing with Nulls: coalesce, nullif, nvl Functions 45 Missing Data 47 Preparing: Shaping Data 52 For Which Output: BI, Visualization, Statistics, ML 52 Pivoting with CASE Statements 53 Unpivoting with UNION Statements 55 pivot and unpivot Functions 57 Conclusion 59 3. Time Series Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Date, Datetime, and Time Manipulations 62 Time Zone Conversions 62 Date and Timestamp Format Conversions 64 Date Math 68 Time Math 71 Joining Data from Different Sources 72 The Retail Sales Data Set 74 Trending the Data 75 Simple Trends 75 Comparing Components 77 Percent of Total Calculations 86 Indexing to See Percent Change over Time 90 Rolling Time Windows 95 Calculating Rolling Time Windows 97 Rolling Time Windows with Sparse Data 102 Calculating Cumulative Values 104 Analyzing with Seasonality 107 Period-over-Period Comparisons: YoY and MoM 109 Period-over-Period Comparisons: Same Month Versus Last Year 112 Comparing to Multiple Prior Periods 116 Conclusion 119 4. Cohort Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Cohorts: A Useful Analysis Framework 122 The Legislators Data Set 125 iv | Table of Contents
📄 Page
7
Retention 127 SQL for a Basic Retention Curve 128 Adjusting Time Series to Increase Retention Accuracy 131 Cohorts Derived from the Time Series Itself 137 Defining the Cohort from a Separate Table 142 Dealing with Sparse Cohorts 146 Defining Cohorts from Dates Other Than the First Date 151 Related Cohort Analyses 153 Survivorship 154 Returnship, or Repeat Purchase Behavior 158 Cumulative Calculations 163 Cross-Section Analysis, Through a Cohort Lens 166 Conclusion 174 5. Text Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Why Text Analysis with SQL? 175 What Is Text Analysis? 176 Why SQL Is a Good Choice for Text Analysis 176 When SQL Is Not a Good Choice 177 The UFO Sightings Data Set 178 Text Characteristics 179 Text Parsing 182 Text Transformations 187 Finding Elements Within Larger Blocks of Text 195 Wildcard Matches: LIKE, ILIKE 195 Exact Matches: IN, NOT IN 200 Regular Expressions 203 Constructing and Reshaping Text 218 Concatenation 218 Reshaping Text 222 Conclusion 226 6. Anomaly Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Capabilities and Limits of SQL for Anomaly Detection 228 The Data Set 229 Detecting Outliers 230 Sorting to Find Anomalies 231 Calculating Percentiles and Standard Deviations to Find Anomalies 234 Graphing to Find Anomalies Visually 241 Forms of Anomalies 250 Anomalous Values 250 Table of Contents | v
📄 Page
8
Anomalous Counts or Frequencies 254 Anomalies from the Absence of Data 258 Handling Anomalies 260 Investigation 260 Removal 260 Replacement with Alternate Values 262 Rescaling 264 Conclusion 266 7. Experiment Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Strengths and Limits of Experiment Analysis with SQL 269 The Data Set 270 Types of Experiments 272 Experiments with Binary Outcomes: The Chi-Squared Test 272 Experiments with Continuous Outcomes: The t-Test 274 Challenges with Experiments and Options for Rescuing Flawed Experiments 276 Variant Assignment 277 Outliers 278 Time Boxing 279 Repeated Exposure Experiments 280 When Controlled Experiments Aren’t Possible: Alternative Analyses 282 Pre-/Post-Analysis 282 Natural Experiment Analysis 284 Analysis of Populations Around a Threshold 286 Conclusion 286 8. Creating Complex Data Sets for Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 When to Use SQL for Complex Data Sets 287 Advantages of Using SQL 288 When to Build into ETL Instead 288 When to Put Logic in Other Tools 290 Code Organization 292 Commenting 292 Capitalization, Indentation, Parentheses, and Other Formatting Tricks 293 Storing Code 296 Organizing Computations 296 Understanding Order of SQL Clause Evaluation 296 Subqueries 300 Temporary Tables 302 Common Table Expressions 303 grouping sets 305 vi | Table of Contents
📄 Page
9
Managing Data Set Size and Privacy Concerns 308 Sampling with %, mod 308 Reducing Dimensionality 310 PII and Data Privacy 314 Conclusion 316 9. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Funnel Analysis 317 Churn, Lapse, and Other Definitions of Departure 319 Basket Analysis 323 Resources 325 Books and Blogs 325 Data Sets 326 Final Thoughts 327 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Table of Contents | vii
📄 Page
10
(This page has no text content)
📄 Page
11
Preface Over the past 20 years, I’ve spent many of my working hours manipulating data with SQL. For most of those years, I’ve worked in technology companies spanning a wide range of consumer and business-to-business industries. In that time, volumes of data have increased dramatically, and the technology I get to use has improved by leaps and bounds. Databases are faster than ever, and the reporting and visualization tools used to communicate the meaning in the data are more powerful than ever. One thing that has remained remarkably constant, however, is SQL being a key part of my toolbox. I remember when I first learned SQL. I started my career in finance, where spread‐ sheets rule, and I’d gotten pretty good at writing formulas and memorizing all those keyboard shortcuts. One day I totally geeked out and Ctrl- and Alt-clicked every key on my keyboard just to see what would happen (and then created a cheat sheet for my peers). That was part fun and part survival: the faster I was with my spreadsheets, the more likely I would be to finish my work before midnight so I could go home and get some sleep. Spreadsheet mastery got me in the door at my next role, a startup where I was first introduced to databases and SQL. Part of my role involved crunching inventory data in spreadsheets, and thanks to early internet scale, the data sets were sometimes tens of thousands of rows. This was “big data” at the time, at least for me. I got in the habit of going for a cup of coffee or for lunch while my computer’s CPU was occupied with running its vlookup magic. One day my manager went on vacation and asked me to tend to the data warehouse he’d built on his laptop using Access. Refreshing the data involved a series of steps: running SQL queries in a portal, loading the resulting csv files into the database, and then refreshing the spreadsheet reports. After the first successful load, I started tin‐ kering, trying to understand how it worked, and pestering the engineers to show me how to modify the SQL queries. I was hooked, and even when I thought I might change directions with my career, I’ve kept coming back to data. Manipulating data, answering questions, helping my ix
📄 Page
12
colleagues work better and smarter, and learning about businesses and the world through sets of data have never stopped feeling fun and exciting. When I started working with SQL, there weren’t many learning resources. I got a book on basic syntax, read it in a night, and from there mostly learned through trial and error. Back in the days when I was learning, I queried production databases directly and brought the website down more than once with my overly ambitious (or more likely just poorly written) SQL. Fortunately my skills improved, and over the years I learned to work forward from the data in tables, and backward from the out‐ put needed, solving technical and logic challenges and puzzles to write queries that returned the right data. I ended up designing and building data warehouses to gather data from different sources and avoid bringing down critical production databases. I’ve learned a lot about when and how to aggregate data before writing the SQL query and when to leave data in a more raw form. I’ve compared notes with others who got into data around the same time, and it’s clear we mostly learned in the same ad hoc way. The lucky among us had peers with whom to share techniques. Most SQL texts are either introductory and basic (there’s definitely a place for these!) or else aimed at database developers. There are few resources for advanced SQL users who are focused on analysis work. Knowledge tends to be locked up in individuals or small teams. A goal of this book is to change that, giving practitioners a reference for how to solve common analysis problems with SQL, and I hope inspiring new inquiries into data using techniques you might not have seen before. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, file extensions, and keywords. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, environment variables, and statements. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. x | Preface
📄 Page
13
This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/cathytanimura/sql_book. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “SQL for Data Analysis by Cathy Tanimura (O’Reilly). Copyright 2021 Cathy Tanimura, 978-1-492-08878-3.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Preface | xi
📄 Page
14
O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/sql-data-analysis. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments This book wouldn’t have been possible without the efforts of a number of people at O’Reilly. Andy Kwan recruited me to this project. Amelia Blevins and Shira Evans guided me through the process and gave helpful feedback along the way. Kristen Brown shepherded the book through the production process. Arthur Johnson improved the quality and clarity of the text and inadvertently made me think more deeply about SQL keywords. xii | Preface
📄 Page
15
Many colleagues over the years played an important role in my SQL journey, and I’m grateful for their tutorials, tips, and shared code, and the time spent brainstorming ways to solve analysis problems over the years. Sharon Lin opened my eyes to regular expressions. Elyse Gordon gave me lots of book-writing advice. Dave Hoch and our conversations about experiment analysis inspired Chapter 7. Dan, Jim, and Stu from the Star Chamber have long been my favorite guys to geek out with. I’m also grateful for all of the colleagues who asked hard questions over the years—and once those were answered, asked even harder ones. I’d like to thank my husband Rick, son Shea, daughters Lily and Fiona, and mom Janet for their love, encouragement, and most of all the gift of time to work on this book. Amy, Halle, Jessi, and the Den of Slack kept me sane and laughing through months of writing and pandemic lockdown. Preface | xiii
📄 Page
16
(This page has no text content)
📄 Page
17
CHAPTER 1 Analysis with SQL If you’re reading this book, you’re probably interested in data analysis and in using SQL to accomplish it. You may be experienced with data analysis but new to SQL, or perhaps you’re experienced with SQL but new to data analysis. Or you may be new to both topics entirely. Whatever your starting point, this chapter lays the groundwork for the topics covered in the rest of the book and makes sure we have a common vocabulary. I’ll start with a discussion of what data analysis is and then move on to a discussion of SQL: what it is, why it’s so popular, how it compares to other tools, and how it fits into data analysis. Then, since modern data analysis is so intertwined with the technologies that have enabled it, I’ll conclude with a discussion of different types of databases that you may encounter in your work, why they’re used, and what all of that means for the SQL you write. What Is Data Analysis? Collecting and storing data for analysis is a very human activity. Systems to track stores of grain, taxes, and the population go back thousands of years, and the roots of statistics date back hundreds of years. Related disciplines, including statistical process control, operations research, and cybernetics, exploded in the 20th century. Many dif‐ ferent names are used to describe the discipline of data analysis, such as business intelligence (BI), analytics, data science, and decision science, and practitioners have a range of job titles. Data analysis is also done by marketers, product managers, busi‐ ness analysts, and a variety of other people. In this book, I’ll use the terms data ana‐ lyst and data scientist interchangeably to mean the person working with SQL to understand data. I will refer to the software used to build reports and dashboards as BI tools. Data analysis in the contemporary sense was enabled by, and is intertwined with, the history of computing. Trends in both research and commercialization have shaped it, 1
📄 Page
18
and the story includes a who’s who of researchers and major companies, which we’ll talk about in the section on SQL. Data analysis blends the power of computing with techniques from traditional statistics. Data analysis is part data discovery, part data interpretation, and part data communication. Very often the purpose of data analysis is to improve decision making, by humans and increasingly by machines through automation. Sound methodology is critical, but analysis is about more than just producing the right number. It’s about curiosity, asking questions, and the “why” behind the num‐ bers. It’s about patterns and anomalies, discovering and interpreting clues about how businesses and humans behave. Sometimes analysis is done on a data set gathered to answer a specific question, as in a scientific setting or an online experiment. Analysis is also done on data that is generated as a result of doing business, as in sales of a company’s products, or that is generated for analytics purposes, such as user interac‐ tion tracking on websites and mobile apps. This data has a wide range of possible applications, from troubleshooting to planning user interface (UI) improvements, but it often arrives in a format and volume such that the data needs processing before yielding answers. Chapter 2 will cover preparing data for analysis, and Chapter 8 will discuss some of the ethical and privacy concerns with which all data practitioners should be familiar. It’s hard to think of an industry that hasn’t been touched by data analysis: manufac‐ turing, retail, finance, health care, education, and even government have all been changed by it. Sports teams have employed data analysis since the early years of Billy Beane’s term as general manager of the Oakland Athletics, made famous by Michael Lewis’s book Moneyball (Norton). Data analysis is used in marketing, sales, logistics, product development, user experience design, support centers, human resources, and more. The combination of techniques, applications, and computing power has led to the explosion of related fields such as data engineering and data science. Data analysis is by definition done on historical data, and it’s important to remember that the past doesn’t necessarily predict the future. The world is dynamic, and organi‐ zations are dynamic as well—new products and processes are introduced, competi‐ tors rise and fall, sociopolitical climates change. Criticisms are leveled against data analysis for being backward looking. Though that characterization is true, I have seen organizations gain tremendous value from analyzing historical data. Mining histori‐ cal data helps us understand the characteristics and behavior of customers, suppliers, and processes. Historical data can help us develop informed estimates and predicted ranges of outcomes, which will sometimes be wrong but quite often will be right. Past data can point out gaps, weaknesses, and opportunities. It allows organizations to optimize, save money, and reduce risk and fraud. It can also help organizations find opportunity, and it can become the building blocks of new products that delight customers. 2 | Chapter 1: Analysis with SQL
📄 Page
19
Organizations that don’t do some form of data analysis are few and far between these days, but there are still some holdouts. Why do some organizations not use data analysis? One argument is the cost-to-value ratio. Collecting, processing, and analyzing data takes work and some level of financial investment. Some organizations are too new, or they’re too haphazard. If there isn’t a consistent pro‐ cess, it’s hard to generate data that’s consistent enough to analyze. Finally, there are ethical considerations. Collecting or storing data about certain people in certain situations may be regulated or even banned. Data about children and health-care interventions is sensi‐ tive, for example, and there are extensive regulations around its collection. Even organizations that are otherwise data driven need to take care around customer privacy and to think hard about what data should be collected, why it is needed, and how long it should be stored. Regulations such as the European Union’s General Data Protection Regulation, or GDPR, and the California Consumer Pri‐ vacy Act, or CCPA, have changed the way businesses think about consumer data. We’ll discuss these regulations in more depth in Chapter 8. As data practitioners, we should always be thinking about the ethical implications of our work. When working with organizations, I like to tell people that data analysis is not a project that wraps up at a fixed date—it’s a way of life. Developing a data-informed mindset is a process, and reaping the rewards is a journey. Unknowns become known, difficult questions are chipped away at until there are answers, and the most critical information is embedded in dashboards that power tactical and strategic deci‐ sions. With this information, new and harder questions are asked, and then the pro‐ cess repeats. Data analysis is both accessible for those looking to get started and hard to master. The technology can be learned, particularly SQL. Many problems, such as optimizing marketing spend or detecting fraud, are familiar and translate across businesses. Every organization is different and every data set has quirks, so even familiar prob‐ lems can pose new challenges. Communicating results is a skill. Learning to make good recommendations and becoming a trusted partner to an organization take time. In my experience, simple analysis presented persuasively has more impact than sophisticated analysis presented poorly. Successful data analysis also requires partner‐ ship. You can have great insights, but if there is no one to execute on them, you haven’t really made an impact. Even with all the technology, it’s still about people, and relationships matter. What Is Data Analysis? | 3
📄 Page
20
Why SQL? This section describes what SQL is, the benefits of using it, how it compares to other languages commonly used for analysis, and finally how SQL fits into the analysis workflow. What Is SQL? SQL is the language used to communicate with databases. The acronym stands for Structured Query Language and is pronounced either like “sequel” or by saying each letter, as in “ess cue el.” This is only the first of many controversies and inconsisten‐ cies surrounding SQL that we’ll see, but most people will know what you mean regardless of how you say it. There is some debate as to whether SQL is or isn’t a pro‐ gramming language. It isn’t a general purpose language in the way that C or Python are. SQL without a database and data in tables is just a text file. SQL can’t build a web‐ site, but it is powerful for working with data in databases. On a practical level, what matters most is that SQL can help you get the job of data analysis done. IBM was the first to develop SQL databases, from the relational model invented by Edgar Codd in the 1960s. The relational model was a theoretical description for man‐ aging data using relationships. By creating the first databases, IBM helped to advance the theory, but it also had commercial considerations, as did Oracle, Microsoft, and every other company that has commercialized a database since. From the beginning, there has been tension between computer theory and commercial reality. SQL became an International Organization for Standards (ISO) standard in 1987 and an American National Standards Institute (ANSI) standard in 1986. Although all major databases start from these standards in their implementation of SQL, many have var‐ iations and functions that make life easier for the users of those databases. These come at the cost of making SQL more difficult to move between databases without some modifications. SQL is used to access, manipulate, and retrieve data from objects in a database. Data‐ bases can have one or more schemas, which provide the organization and structure and contain other objects. Within a schema, the objects most commonly used in data analysis are tables, views, and functions. Tables contain fields, which hold the data. Tables may have one or more indexes; an index is a special kind of data structure that allows data to be retrieved more efficiently. Indexes are usually defined by a database administrator. Views are essentially stored queries that can be referenced in the same way as a table. Functions allow commonly used sets of calculations or procedures to be stored and easily referenced in queries. They are usually created by a database administrator, or DBA. Figure 1-1 gives an overview of the organization of databases. 4 | Chapter 1: Analysis with SQL