Think Bayes Bayesian Statistics in Python (Allen B. Downey)（Z-Library）

Allen B. Downey Second Edition Think Bayes Bayesian Statistics in Python

(This page has no text content)

Allen B. Downey Think Bayes Bayesian Statistics in Python SECOND EDITION Boston Farnham Sebastopol TokyoBeijing

978-1-492-08946-9 [LSI] Think Bayes by Allen B. Downey Copyright © 2021 Allen B. Downey. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Jessica Haberman Development Editor: Michele Cronin Production Editor: Kristen Brown Copyeditor: O’Reilly Production Services Proofreader: Stephanie English Indexer: Sue Klefstad Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Allen B. Downey September 2013: First Edition May 2021: Second Edition Revision History for the Second Edition 2021-05-18: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492089469 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Think Bayes, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. Think Bayes is available under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Inter‐ national License. The author maintains an online version at https://greenteapress.com/wp/think-bayes.

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Linda the Banker 1 Probability 2 Fraction of Bankers 3 The Probability Function 4 Political Views and Parties 4 Conjunction 5 Conditional Probability 6 Conditional Probability Is Not Commutative 7 Condition and Conjunction 8 Laws of Probability 8 Theorem 1 9 Theorem 2 10 Theorem 3 10 The Law of Total Probability 11 Summary 13 Exercises 14 2. Bayes’s Theorem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 The Cookie Problem 17 Diachronic Bayes 19 Bayes Tables 20 The Dice Problem 22 The Monty Hall Problem 23 Summary 25 Exercises 26 iii

3. Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Distributions 29 Probability Mass Functions 29 The Cookie Problem Revisited 32 101 Bowls 34 The Dice Problem 38 Updating Dice 39 Summary 40 Exercises 41 4. Estimating Proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 The Euro Problem 43 The Binomial Distribution 44 Bayesian Estimation 47 Triangle Prior 49 The Binomial Likelihood Function 51 Bayesian Statistics 52 Summary 53 Exercises 54 5. Estimating Counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 The Train Problem 57 Sensitivity to the Prior 60 Power Law Prior 61 Credible Intervals 63 The German Tank Problem 64 Informative Priors 65 Summary 66 Exercises 66 6. Odds and Addends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Odds 69 Bayes’s Rule 70 Oliver’s Blood 71 Addends 73 Gluten Sensitivity 76 The Forward Problem 77 The Inverse Problem 78 Summary 80 More Exercises 81 iv | Table of Contents

7. Minimum, Maximum, and Mixture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Cumulative Distribution Functions 83 Best Three of Four 86 Maximum 88 Minimum 89 Mixture 90 General Mixtures 93 Summary 96 Exercises 97 8. Poisson Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 The World Cup Problem 99 The Poisson Distribution 100 The Gamma Distribution 101 The Update 103 Probability of Superiority 105 Predicting the Rematch 106 The Exponential Distribution 108 Summary 110 Exercises 110 9. Decision Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 The Price Is Right Problem 113 The Prior 114 Kernel Density Estimation 115 Distribution of Error 116 Update 118 Probability of Winning 120 Decision Analysis 122 Maximizing Expected Gain 124 Summary 126 Discussion 126 More Exercises 127 10. Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Estimation 129 Evidence 131 Uniformly Distributed Bias 132 Bayesian Hypothesis Testing 134 Bayesian Bandits 134 Prior Beliefs 135 The Update 136 Table of Contents | v

Multiple Bandits 137 Explore and Exploit 138 The Strategy 140 Summary 142 More Exercises 142 11. Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Outer Operations 145 How Tall Is A? 147 Joint Distribution 148 Visualizing the Joint Distribution 149 Likelihood 151 The Update 152 Marginal Distributions 153 Conditional Posteriors 156 Dependence and Independence 157 Summary 158 Exercises 158 12. Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Penguin Data 161 Normal Models 163 The Update 164 Naive Bayesian Classification 166 Joint Distributions 168 Multivariate Normal Distribution 170 A Less Naive Classifier 172 Summary 173 Exercises 173 13. Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Improving Reading Ability 175 Estimating Parameters 177 Likelihood 178 Posterior Marginal Distributions 180 Distribution of Differences 181 Using Summary Statistics 184 Update with Summary Statistics 186 Comparing Marginals 187 Summary 188 Exercises 189 vi | Table of Contents

14. Survival Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 The Weibull Distribution 191 Incomplete Data 194 Using Incomplete Data 196 Light Bulbs 199 Posterior Means 201 Posterior Predictive Distribution 202 Summary 204 Exercises 204 15. Mark and Recapture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 The Grizzly Bear Problem 207 The Update 209 Two-Parameter Model 211 The Prior 212 The Update 213 The Lincoln Index Problem 215 Three-Parameter Model 217 Summary 220 Exercises 221 16. Logistic Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Log Odds 223 The Space Shuttle Problem 226 Prior Distribution 229 Likelihood 230 The Update 231 Marginal Distributions 232 Transforming Distributions 233 Predictive Distributions 235 Empirical Bayes 237 Summary 238 More Exercises 238 17. Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 More Snow? 241 Regression Model 243 Least Squares Regression 244 Priors 245 Likelihood 246 The Update 247 Marathon World Record 250 Table of Contents | vii

The Priors 252 Prediction 254 Summary 255 Exercises 255 18. Conjugate Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 The World Cup Problem Revisited 257 The Conjugate Prior 258 What the Actual? 260 Binomial Likelihood 261 Lions and Tigers and Bears 263 The Dirichlet Distribution 264 Summary 266 Exercises 267 19. MCMC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 The World Cup Problem 269 Grid Approximation 270 Prior Predictive Distribution 270 Introducing PyMC3 271 Sampling the Prior 272 When Do We Get to Inference? 274 Posterior Predictive Distribution 275 Happiness 276 Simple Regression 277 Multiple Regression 280 Summary 282 Exercises 283 20. Approximate Bayesian Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 The Kidney Tumor Problem 287 A Simple Growth Model 288 A More General Model 289 Simulation 291 Approximate Bayesian Computation 294 Counting Cells 295 Cell Counting with ABC 298 When Do We Get to the Approximate Part? 299 Summary 302 Exercises 303 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 viii | Table of Contents

Preface The premise of this book, and the other books in the Think X series, is that if you know how to program, you can use that skill to learn other topics. Most books on Bayesian statistics use math notation and present ideas using mathe‐ matical concepts like calculus. This book uses Python code and discrete approxima‐ tions instead of continuous mathematics. As a result, what would be an integral in a math book becomes a summation, and most operations on probability distributions are loops or array operations. I think this presentation is easier to understand, at least for people with programming skills. It is also more general, because when we make modeling decisions, we can choose the most appropriate model without worrying too much about whether the model lends itself to mathematical analysis. Also, it provides a smooth path from simple examples to real-world problems. Who Is This Book For? To start this book, you should be comfortable with Python. If you are familiar with NumPy and pandas, that will help, but I’ll explain what you need as we go. You don’t need to know calculus or linear algebra. You don’t need any prior knowledge of statistics. In Chapter 1, I define probability and introduce conditional probability, which is the foundation of Bayes’s theorem. Chapter 3 introduces the probability distribution, which is the foundation of Bayesian statistics. In later chapters, we use a variety of discrete and continuous distributions, including the binomial, exponential, Poisson, beta, gamma, and normal distributions. I will explain each distribution when it is introduced, and we will use SciPy to compute them, so you don’t need to know about their mathematical properties. ix

Modeling Most chapters in this book are motivated by a real-world problem, so they involve some degree of modeling. Before we can apply Bayesian methods (or any other analy‐ sis), we have to make decisions about which parts of the real-world system to include in the model and which details we can abstract away. For example, in Chapter 8, the motivating problem is to predict the winner of a soc‐ cer (football) game. I model goal-scoring as a Poisson process, which implies that a goal is equally likely at any point in the game. That is not exactly true, but it is proba‐ bly a good enough model for most purposes. I think it is important to include modeling as an explicit part of problem solving because it reminds us to think about modeling errors (that is, errors due to simplifi‐ cations and assumptions of the model). Many of the methods in this book are based on discrete distributions, which makes some people worry about numerical errors. But for real-world problems, numerical errors are almost always smaller than modeling errors. Furthermore, the discrete approach often allows better modeling decisions, and I would rather have an approximate solution to a good model than an exact solution to a bad model. Working with the Code Reading this book will only get you so far; to really understand it, you have to work with the code. The original form of this book is a series of Jupyter notebooks. After you read each chapter, I encourage you to run the notebook and work on the exerci‐ ses. If you need help, my solutions are available. There are several ways to run the notebooks: • If you have Python and Jupyter installed, you can download the notebooks and run them on your computer. • If you don’t have a programming environment where you can run Jupyter note‐ books, you can use Colab, which lets you run Jupyter notebooks in a browser without installing anything. To run the notebooks on Colab, start from this landing page, which has links to all of the notebooks. If you already have Python and Jupyter, you can download the notebooks as a ZIP file. x | Preface

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require per‐ mission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Think Bayes, Second Edition, by Allen B. Downey (O’Reilly). Copyright 2021 Allen B. Downey, 978-1-492-08946-9.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact O’Reilly Media at permissions@oreilly.com. Installing Jupyter If you don’t have Python and Jupyter already, I recommend you install Anaconda, which is a free Python distribution that includes all the packages you’ll need. I found Anaconda easy to install. By default it installs files in your home directory, so you don’t need administrator privileges. You can download Anaconda from this site. Anaconda includes most of the packages you need to run the code in this book. But there are a few additional packages you need to install. To make sure you have everything you need (and the right versions), the best option is to create a Conda environment. Download this Conda environment file and run the following commands to create and activate an environment called ThinkBayes2: conda env create -f environment.yml conda activate ThinkBayes2 If you don’t want to create an environment just for this book, you can install what you need using Conda. The following commands should get everything you need: conda install python jupyter pandas scipy matplotlib pip install empiricaldist If you don’t want to use Anaconda, you will need the following packages: • Jupyter to run the notebooks, https://jupyter.org; • NumPy for basic numerical computation, https://numpy.org; • SciPy for scientific computation, https://scipy.org; Preface | xi

• pandas for working with data, https://pandas.pydata.org; • matplotlib for visualization, https://matplotlib.org; • empiricaldist for representing distributions, https://pypi.org/project/empiricaldist. Although these are commonly used packages, they are not included with all Python installations, and they can be hard to install in some environments. If you have trou‐ ble installing them, I recommend using Anaconda or one of the other Python distri‐ butions that include these packages. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates URLs, email addresses, filenames, and file extensions. Bold Indicates new and key terms. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. xii | Preface

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/thinkBayes2e. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia Contributor List If you have a suggestion or correction, please send email to downey@allendow‐ ney.com. If I make a change based on your feedback, I will add you to the contributor list (unless you ask to be omitted). If you include at least part of the sentence the error appears in, that makes it easy for me to search. Page and section numbers are fine, too, but not as easy to work with. Thanks! • First, I have to acknowledge David MacKay’s excellent book, Information Theory, Inference, and Learning Algorithms, which is where I first came to understand Bayesian methods. With his permission, I use several problems from his book as examples. • Several examples and exercises in the second edition are borrowed, with permis‐ sion, from Cameron Davidson-Pilon and one exercise from Rasmus Bååth. • This book also benefited from my interactions with Sanjoy Mahajan, especially in Fall 2012, when I audited his class on Bayesian Inference at Olin College. Preface | xiii

• Many examples in this book were developed in collaboration with students in my Bayesian Statistics classes at Olin College. In particular, the Red Line example started as a class project by Brendan Ritter and Kai Austin. • I wrote parts of this book during project nights with the Boston Python User Group, so I would like to thank them for their company and pizza. • Jasmine Kwityn and Dan Fauxsmith at O’Reilly Media proofread the first edition and found many opportunities for improvement. • Linda Pescatore found a typo and made some helpful suggestions. • Tomasz Miasko sent many excellent corrections and suggestions. • For the second edition, I want to thank Michele Cronin and Kristen Brown at O’Reilly Media and the technical reviewers Ravin Kumar, Thomas Nield, Josh Starmer, and Junpeng Lao. • I am grateful to the developers and contributors of the software libraries this book is based on, especially Jupyter, NumPy, SciPy, pandas, PyMC, ArviZ, and Matplotlib. Other people who spotted typos and errors include Greg Marra, Matt Aasted, Marcus Ogren, Tom Pollard, Paul A. Giannaros, Jonathan Edwards, George Purkins, Robert Marcus, Ram Limbu, James Lawry, Ben Kahle, Jeffrey Law, Alvaro Sanchez, Olivier Yiptong, Yuriy Pasichnyk, Kristopher Overholt, Max Hailperin, Markus Dobler, Brad Minch, Allen Minch, Nathan Yee, Michael Mera, Chris Krenn, and Daniel Vianna. xiv | Preface

CHAPTER 1 Probability The foundation of Bayesian statistics is Bayes’s theorem, and the foundation of Bayes’s theorem is conditional probability. In this chapter, we’ll start with conditional probability, derive Bayes’s theorem, and demonstrate it using a real dataset. In the next chapter, we’ll use Bayes’s theorem to solve problems related to conditional probability. In the chapters that follow, we’ll make the transition from Bayes’s theorem to Bayesian statistics, and I’ll explain the difference. Linda the Banker To introduce conditional probability, I’ll use an example from a famous experiment by Tversky and Kahneman, who posed the following question: Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable? 1. Linda is a bank teller. 2. Linda is a bank teller and is active in the feminist movement. Many people choose the second answer, presumably because it seems more consistent with the description. It seems uncharacteristic if Linda is just a bank teller; it seems more consistent if she is also a feminist. But the second answer cannot be “more probable”, as the question asks. Suppose we find 1,000 people who fit Linda’s description and 10 of them work as bank tellers. How many of them are also feminists? At most, all 10 of them are; in that case, the two options are equally probable. If fewer than 10 are, the second option is less proba‐ ble. But there is no way the second option can be more probable. 1

If you were inclined to choose the second option, you are in good company. The biol‐ ogist Stephen J. Gould wrote: I am particularly fond of this example because I know that the [second] statement is least probable, yet a little homunculus in my head continues to jump up and down, shouting at me, “but she can’t just be a bank teller; read the description.” If the little person in your head is still unhappy, maybe this chapter will help. Probability At this point I should provide a definition of “probability”, but that turns out to be surprisingly difficult. To avoid getting stuck before we start, we will use a simple defi‐ nition for now and refine it later: A probability is a fraction of a finite set. For example, if we survey 1,000 people, and 20 of them are bank tellers, the fraction that work as bank tellers is 0.02 or 2%. If we choose a person from this population at random, the probability that they are a bank teller is 2%. By “at random” I mean that every person in the dataset has the same chance of being chosen. With this definition and an appropriate dataset, we can compute probabilities by counting. To demonstrate, I’ll use data from the General Social Survey (GSS). I’ll use pandas to read the data and store it in a DataFrame. import pandas as pd gss = pd.read_csv('gss_bayes.csv', index_col=0) gss.head() year age sex polviews partyid indus10 caseid 1 1974 21.0 1 4.0 2.0 4970.0 2 1974 41.0 1 5.0 0.0 9160.0 5 1974 58.0 2 6.0 1.0 2670.0 6 1974 30.0 1 5.0 4.0 6870.0 7 1974 48.0 1 5.0 4.0 7860.0 The DataFrame has one row for each person surveyed and one column for each vari‐ able I selected. 2 | Chapter 1: Probability

The columns are • caseid: Respondent id (which is the index of the table). • year: Year when the respondent was surveyed. • age: Respondent’s age when surveyed. • sex: Male or female. • polviews: Political views on a range from liberal to conservative. • partyid: Political party affiliation: Democratic, Republican, or independent. • indus10: Code for the industry the respondent works in. Let’s look at these variables in more detail, starting with indus10. Fraction of Bankers The code for “Banking and related activities” is 6870, so we can select bankers like this: banker = (gss['indus10'] == 6870) banker.head() caseid 1 False 2 False 5 False 6 True 7 False Name: indus10, dtype: bool The result is a pandas Series that contains the Boolean values True and False. If we use the sum function on this Series, it treats True as 1 and False as 0, so the total is the number of bankers: banker.sum() 728 In this dataset, there are 728 bankers. To compute the fraction of bankers, we can use the mean function, which computes the fraction of True values in the Series: banker.mean() 0.014769730168391155 About 1.5% of the respondents work in banking, so if we choose a random person from the dataset, the probability they are a banker is about 1.5%. Fraction of Bankers | 3

The Probability Function I’ll put the code from the previous section in a function that takes a Boolean Series and returns a probability: def prob(A): """Computes the probability of a proposition, A.""" return A.mean() So we can compute the fraction of bankers like this: prob(banker) 0.014769730168391155 Now let’s look at another variable in this dataset. The values of the column sex are encoded like this: 1 Male 2 Female So we can make a Boolean Series that is True for female respondents and False otherwise: female = (gss['sex'] == 2) And use it to compute the fraction of respondents who are women: prob(female) 0.5378575776019476 The fraction of women in this dataset is higher than in the adult US population because the GSS doesn’t include people living in institutions like prisons and military housing, and those populations are more likely to be male. Political Views and Parties The other variables we’ll consider are polviews, which describes the political views of the respondents, and partyid, which describes their affiliation with a political party. The values of polviews are on a seven-point scale: 1 Extremely liberal 2 Liberal 3 Slightly liberal 4 Moderate 5 Slightly conservative 6 Conservative 7 Extremely conservative I’ll define liberal to be True for anyone whose response is “Extremely liberal”, “Liberal”, or “Slightly liberal”: 4 | Chapter 1: Probability

Think Bayes Bayesian Statistics in Python (Allen B. Downey)（Z-Library）

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Reply to Comment

Edit Comment