M A N N I N G Oliver Dürr Beate Sick with Elvis Murina With Python, Keras, and TensorFlow Probability
Data modeling with probabilistic DL. The network determines the parameters of a probability distribution. Fit the model using the MaxLike principle. In the example shown, the outcome is count data. Here it’s modeled by a Poisson distribution, where NN is used to control its rate parameter λ (see the chosen last plate with one output node). RNN1 CNN2 CNN1 DATA MaxLike M a x L ike FCNN2 POIS FCNN1 ZIP GAUS1 GAUS2 Network Shelf Prob. Distribution Shelf
Probabilistic Deep Learning
(This page has no text content)
Probabilistic Deep Learning WITH PYTHON, KERAS, AND TENSORFLOW PROBABILITY OLIVER DÜRR BEATE SICK WITH ELVIS MURINA M A N N I N G SHELTER ISLAND
For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2020 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Development editor: Marina Michaels Technical development editors: Michiel Trimpe and Arthur Zubarev Manning Publications Co. Review editor: Aleksandar Dragosavljević 20 Baldwin Road Production editor: Deirdre S. Hiam PO Box 761 Copy editor: Frances Buran Shelter Island, NY 11964 Proofreader: Keri Hales Technical proofreader: Al Krinker Typesetter: Dennis Dalinnik Cover designer: Marija Tudor ISBN: 9781617296079 Printed in the United States of America
brief contents PART 1 BASICS OF DEEP LEARNING.............................................1 1 ■ Introduction to probabilistic deep learning 3 2 ■ Neural network architectures 25 3 ■ Principles of curve fitting 62 PART 2 MAXIMUM LIKELIHOOD APPROACHES FOR PROBABILISTIC DL MODELS .........................................91 4 ■ Building loss functions with the likelihood approach 93 5 ■ Probabilistic deep learning models with TensorFlow Probability 128 6 ■ Probabilistic deep learning models in the wild 157 PART 3 BAYESIAN APPROACHES FOR PROBABILISTIC DL MODELS. ..............................................................195 7 ■ Bayesian learning 197 8 ■ Bayesian neural networks 229v
(This page has no text content)
contents preface xi acknowledgments xii about this book xiv about the authors xvii about the cover illustration xviii PART 1 BASICS OF DEEP LEARNING...................................1 1 Introduction to probabilistic deep learning 3 1.1 A first look at probabilistic models 4 1.2 A first brief look at deep learning (DL) 6 A success story 8 1.3 Classification 8 Traditional approach to image classification 9 ■ Deep learning approach to image classification 12 ■ Non-probabilistic classification 14 ■ Probabilistic classification 14 Bayesian probabilistic classification 16 1.4 Curve fitting 16 Non-probabilistic curve fitting 17 ■ Probabilistic curve fitting 18 ■ Bayesian probabilistic curve fitting 20vii
CONTENTSviii1.5 When to use and when not to use DL? 21 When not to use DL 21 ■ When to use DL 22 ■ When to use and when not to use probabilistic models? 22 1.6 What you’ll learn in this book 23 2 Neural network architectures 25 2.1 Fully connected neural networks (fcNNs) 26 The biology that inspired the design of artificial NNs 26 Getting started with implementing an NN 28 ■ Using a fully connected NN (fcNN) to classify images 38 2.2 Convolutional NNs for image-like data 44 Main ideas in a CNN architecture 44 ■ A minimal CNN for edge lovers 47 ■ Biological inspiration for a CNN architecture 50 Building and understanding a CNN 52 2.3 One-dimensional CNNs for ordered data 56 Format of time-ordered data 57 ■ What’s special about ordered data? 58 ■ Architectures for time-ordered data 59 3 Principles of curve fitting 62 3.1 “Hello world” in curve fitting 63 Fitting a linear regression model based on a loss function 65 3.2 Gradient descent method 69 Loss with one free model parameter 69 ■ Loss with two free model parameters 73 3.3 Special DL sauce 78 Mini-batch gradient descent 78 ■ Using SGD variants to speed up the learning 79 ■ Automatic differentiation 79 3.4 Backpropagation in DL frameworks 80 Static graph frameworks 81 ■ Dynamic graph frameworks 88 PART 2 MAXIMUM LIKELIHOOD APPROACHES FOR PROBABILISTIC DL MODELS ...............................91 4 Building loss functions with the likelihood approach 93 4.1 Introduction to the MaxLike principle: The mother of all loss functions 94
CONTENTS ix4.2 Deriving a loss function for a classification problem 99 Binary classification problem 99 ■ Classification problems with more than two classes 105 ■ Relationship between NLL, cross entropy, and Kullback-Leibler divergence 109 4.3 Deriving a loss function for regression problems 111 Using a NN without hidden layers and one output neuron for modeling a linear relationship between input and output 111 Using a NN with hidden layers to model non-linear relationships between input and output 119 ■ Using an NN with additional output for regression tasks with nonconstant variance 121 5 Probabilistic deep learning models with TensorFlow Probability 128 5.1 Evaluating and comparing different probabilistic prediction models 130 5.2 Introducing TensorFlow Probability (TFP) 132 5.3 Modeling continuous data with TFP 135 Fitting and evaluating a linear regression model with constant variance 136 ■ Fitting and evaluating a linear regression model with a nonconstant standard deviation 140 5.4 Modeling count data with TensorFlow Probability 145 The Poisson distribution for count data 148 ■ Extending the Poisson distribution to a zero-inflated Poisson (ZIP) distribution 153 6 Probabilistic deep learning models in the wild 157 6.1 Flexible probability distributions in state-of-the-art DL models 159 Multinomial distribution as a flexible distribution 160 Making sense of discretized logistic mixture 162 6.2 Case study: Bavarian roadkills 165 6.3 Go with the flow: Introduction to normalizing flows (NFs) 166 The principle idea of NFs 168 ■ The change of variable technique for probabilities 170 ■ Fitting an NF to data 175 ■ Going deeper by chaining flows 177 ■ Transformation between higher dimensional spaces* 181 ■ Using networks to control flows 183 Fun with flows: Sampling faces 188
CONTENTSxPART 3 BAYESIAN APPROACHES FOR PROBABILISTIC DL MODELS ....................................................195 7 Bayesian learning 197 7.1 What’s wrong with non-Bayesian DL: The elephant in the room 198 7.2 The first encounter with a Bayesian approach 201 Bayesian model: The hacker’s way 202 ■ What did we just do? 206 7.3 The Bayesian approach for probabilistic models 207 Training and prediction with a Bayesian model 208 ■ A coin toss as a Hello World example for Bayesian models 213 ■ Revisiting the Bayesian linear regression model 224 8 Bayesian neural networks 229 8.1 Bayesian neural networks (BNNs) 230 8.2 Variational inference (VI) as an approximative Bayes approach 232 Looking under the hood of VI* 233 ■ Applying VI to the toy problem* 238 8.3 Variational inference with TensorFlow Probability 243 8.4 MC dropout as an approximate Bayes approach 245 Classical dropout used during training 246 ■ MC dropout used during train and test times 249 8.5 Case studies 252 Regression case study on extrapolation 252 ■ Classification case study with novel classes 256 Glossary of terms and abbreviations 264 index 269
preface Thank you for buying our book. We hope that it provides you with a look under the hood of deep learning (DL) and gives you some inspirations on how to use probabilis- tic DL methods for your work. All three of us, the authors, have a background in statistics. We started our journey in DL together in 2014. We got so excited about it that DL is still in the center of our professional lives. DL has a broad range of applications, but we are especially fasci- nated by the power of combining DL models with probabilistic approaches as used in statistics. In our experience, a deep understanding of the potential of probabilistic DL requires both insight into the underlying methods and practical experience. There- fore, we tried to find a good balance of both ingredients in this book. In this book, we aimed to give some clear ideas and examples of applications before discussing the methods involved. You also have the chance to make practical use of all discussed methods by working with the accompanying Jupyter notebooks. We hope you learn as much by reading this book as we learned while writing it. Have fun and stay curious!xi
acknowledgments We want to thank all the people who helped us in writing this book. A special thanks go out to our development editor, Marina Michaels, who managed to teach a bunch of Swiss and Germans how to write sentences shorter than a few hundred words. Without her, you would have no fun deciphering the text. Also, many thanks to our copyeditor, Frances Buran, who spotted uncountable errors and inconsistencies in the text (and also in the formulas, kudos!). We also got much support on the technical side from Al Krinkler and Hefin Rhys to make the text and code in the notebooks more consistent and easier to understand. Also, thank you to our project editor, Deirdre Hiam; our proofreader, Keri Hales; and our review editor, Aleksandar Dragosavljević. We would also like to thank the reviewers, which at various stages of the book helped with their very valuable feedback: Bartek Krzyszycha, Brynjar Smári Bjarnason, David Jacobs, Diego Casella, Francisco José Lacueva Pérez, Gary Bake, Guillaume Alleon, Howard Bandy, Jon Machtynger, Kim Falk Jorgensen, Kumar Kandasami, Raphael Yan, Richard Vaughan, Richard Ward, and Zalán Somogyváry. Finally, we would also like to thank Richard Sheppard for the many excellent graphics and drawings making the book less dry and friendlier. I, Oliver, would like to thank my partner Lena Obendiek for her patience as I worked on the book for many long hours. I also thank my friends from the “Tatort” viewing club for providing food and company each Sunday at 8:15 pm and for keep- ing me from going crazy while writing this book. I, Beate, want to thank my friends, not so much for helping me to write the book, but for sharing with me a good time beyond the computer screen—first of all myxii
ACKNOWLEDGMENTS xiiipartner Michael, but also the infamous Limmat BBQ group and my friends and family outside of Zurich who still spend leisure time with me despite the Rösti-Graben, the country border to the big canton, or even the big pond in between. I, Elvis, want to thank everyone who supported me during the exciting time of writ- ing this book, not only professionally, but also privately during a good glass of wine or a game of football. We, the Tensor Chiefs, are happy that we made it together to the end of this book. We look forward to new scientific journeys, but also to less stressful times where we not only meet for work, but also for fun.
about this book In this book, we hope to bring the probabilistic principles underpinning deep learn- ing (DL) to a broader audience. In the end (almost), all neural networks (NNs) in DL are probabilistic models. There are two powerful probabilistic principles: maximum likelihood and Bayes. Maximum likelihood (fondly referred to as MaxLike) governs all traditional DL. Understanding networks as probabilistic models trained with the maximum likeli- hood principle helps you to boost the performance of your networks (as Google did when going from WaveNet to WaveNet++) or to generate astounding applications (like OpenAI did with Glow, a net that generates realistic looking faces). Bayesian methods come into play in situations where networks need to say, “I’m not sure.” (Strangely, traditional NNs cannot do this.) The subtitle for the book, “with Python, Keras, and TensorFlow Probability,” reflects the fact that you really should get your hands dirty and do some coding. Who should read this book This book is written for people who like to understand the underlying probabilistic principles of DL. Ideally, you should have some experience with DL or machine learn- ing (ML) and should not be too afraid of a bit of math and Python code. We did not spare the math and always included examples in code. We believe math goes better with code.xiv
ABOUT THIS BOOK xvHow this book is organized: A roadmap The book has three parts that cover eight chapters. Part 1 explains traditional deep learning (DL) architectures and how the training of neural networks (NNs) is done technically. ■ Chapter 1—Sets the stage and introduces you to probabilistic DL. ■ Chapter 2—Talks about network architectures. We cover fully connected neural networks (fcNNs), which are kind of all-purpose networks, and convolutional neural networks (CNNs), which are ideal for images. ■ Chapter 3—Shows you how NNs manage to fit millions of parameters. We keep it easy and show gradient descent and backpropagation on the simplest network one can think of—linear regression. Part 2 focuses on using NNs as probabilistic models. In contrast to part 3, we discuss maximum likelihood approaches. These are behind all traditional DL. ■ Chapter 4—Explores maximum likelihood (MaxLike), the underlying princi- ple of ML and DL. We start by applying this principle to classification and (simple regression problems). ■ Chapter 5—Introduces TensorFlow Probability (TFP), a framework to build deep probabilistic models. We use it for not-so-simple regression problems like count data. ■ Chapter 6—Begins with more complex regression models. At the end, we explain how you can use probabilistic models to master complex distributions like describing images of human faces. Part 3 introduces Bayesian NNs. Bayesian NNs allow you to handle uncertainty. ■ Chapter 7—Motivates the need for Bayesian DL and explains its principles. We again look at the simple example of linear regression to explain the Bayesian principle. ■ Chapter 8—Shows you how to build Bayesian NNs. Here we cover two approaches called MC (Monte Carlo) dropout and variational inference. If you already have experience with DL, you can skip the first part. Also, the second part of chapter 6 (starting with section 6.3) describes normalizing flows. You do not need to know these to understand the material in part 3. Section 6.3.5 is a bit heavy on math, so if this is not your cup of tea, you can skip it. The same holds true for sec- tions 8.2.1 and 8.2.2. About the code This book contains many examples of source code both in numbered listings and in line with normal text. In both cases, source code is formatted in a fixed-width font, like this to separate it from ordinary text. The code samples are taken from Jupyter notebooks. These notebooks include additional explanations and most include little exercises you should do for a better
ABOUT THIS BOOKxviunderstanding of the concepts introduced in this book. You can find all the code in this directory in GitHub: https://github.com/tensorchiefs/dl_book/. A good place to start is in the directory https://tensorchiefs.github.io/dl_book/, where you’ll find links to the notebooks. The notebooks are numbered according to the chapters. So, for example, nb_ch08_02 is the second notebook in chapter 8. All the examples in this book, except nb_06_05, are tested with the TensorFlow v2.1 and TensorFlow Probability (TFP) v0.8. The notebooks nb_ch03_03 and nb_ch03_04, describing the computation graphs, are easier to understand in TensorFlow v1. For these notebooks, we also include both versions of TensorFlow. The nb_06_05 note- book only works with TensorFlow v1 because we need weights that are only provided in that version of TensorFlow. You can execute the notebooks in Google’s Colab or locally. Colab is great; you can simply click on a link and then play with the code in the cloud. No installation—you just need a browser. We definitely suggest that you go this way. TensorFlow is still fast-evolving, and we cannot guarantee the code will run in sev- eral years’ time. We, therefore, provide a Docker container (https://github.com oduerr/ dl_book_docker/) that you can use to execute all notebooks except nb_06_05 and the TensorFlow 1.0 versions of nb_ch03_03 and nb_ch03_04. This Docker container is the way to go if you want to use the notebooks locally. liveBook discussion forum Purchase of Probabilistic Deep Learning includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask techni- cal questions, and receive help from the authors and from other users. To access the forum, go to https://livebook.manning.com/book/probabilistic-deep-learning-with- python/welcome/v-6/. You can also learn more about Manning’s forums and the rules of conduct at https://livebook.manning.com/#!/discussion. Manning’s commitment to our readers is to provide a venue where a meaningful dialogue between individual readers and between readers and the authors can take place. It is not a commitment to any specific amount of participation on the part of the authors, whose contribution to the forum remains voluntary (and unpaid). We suggest you try asking the authors some challenging questions lest their interest stray! The forum and the archives of previous discussions will be accessible from the pub- lisher’s website as long as the book is in print.
about the authors Oliver Dürr is professor for data science at the University of Applied Sciences in Kon- stanz, Germany. Beate Sick holds a chair for applied statistics at ZHAW, and works as a researcher and lecturer at the University of Zurich, and as a lecturer at ETH Zurich. Elvis Murina is a research scientist, responsible for the extensive exercises that accom- pany this book. Dürr and Sick are both experts in machine learning and statistics. They have supervised numerous bachelor’s, master’s, and PhD theses on the topic of deep learn- ing, and planned and conducted several postgraduate- and master’s-level deep learning courses. All three authors have worked with deep learning methods since 2013, and have extensive experience in both teaching the topic and developing probabilistic deep learning models.xvii
about the cover illustration The figure on the cover of Probabilistic Deep Learning is captioned “Danseuse de l’Isle O-tahiti,” or A dancer from the island of Tahiti. The illustration is taken from a collec- tion of dress costumes from various countries by Jacques Grasset de Saint-Sauveur (1757–1810), titled Costumes de Différents Pays, published in France in 1788. Each illustra- tion is finely drawn and colored by hand. The rich variety of Grasset de Saint-Sauveur’s collection reminds us vividly of how culturally apart the world’s towns and regions were just 200 years ago. Isolated from each other, people spoke different dialects and languages. In the streets or in the countryside, it was easy to identify where they lived and what their trade or station in life was just by their dress. The way we dress has changed since then and the diversity by region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants of different conti- nents, let alone different towns, regions, or countries. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life. At a time when it is hard to tell one computer book from another, Manning cele- brates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by Grasset de Saint-Sauveur’s pictures.xviii
Comments 0
Loading comments...
Reply to Comment
Edit Comment