Python Data Science Handbook Essential Tools for Working with Data, 2nd Edition (Jake VanderPlas) (Z-Library) (1)

Jake VanderPlas Python Data Science Handbook Essential Tools for Working with Data Second Edition

DATA “This freshly updated edition offers clear, easy- to-follow examples that will help you successfully set up and use essential data science and machine learning tools.” —Anne Bonner, Founder and CEO, Content Simplicity Python Data Science Handbook US $69.99 CAN $87.99 ISBN: 978-1-098-12122-8 Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to- day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you’ll learn how: • IPython and Jupyter provide computational environments for scientists using Python • NumPy includes the ndarray for efficient storage and manipulation of dense data arrays • Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data • Matplotlib includes capabilities for a flexible range of data visualizations • Scikit-Learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms Jake VanderPlas is a software engineer at Google Research, working on tools that support data-intensive research. He creates and develops Python tools for use in data-intensive science, including packages like Scikit-Learn, SciPy, Astropy, Altair, JAX, and many others.

Praise for Python Data Science Handbook, Second Edition There are many data science books out there today but I find Jake VanderPlas’s book to be exceptional. He takes a subject that is very broad and complex and breaks it down in a way that makes it easy to understand with great writing and exercises that get you using the concepts quickly. —Celeste Stinger, Site Reliability Engineer Jake VanderPlas’s expertise and passion for sharing knowledge are undeniable. This freshly updated edition offers clear, easy-to-follow examples that will help you successfully set up and use essential data science and machine learning tools. If you’re ready to dive into core techniques for using Python-based tools to gain real insights from your data, this is the book for you! —Anne Bonner, Founder and CEO, Content Simplicity Python Data Science Handbook has been a favorite of mine for years for recommending to data science students. The second edition improves on an already amazing book complete with compelling Jupyter notebooks that allow you to execute your favorite data science recipe while you read along. —Noah Gift, Duke Executive in Residence and Founder of Pragmatic AI Labs This updated edition is a great introduction to the libraries that make Python a top language for data science and scientific computing, presented in an accessible style with great examples throughout. —Allen Downey, author of Think Python and Think Bayes

Python Data Science Handbook is an excellent guide for readers learning the Python data science stack. With complete practical examples written in an approachable manner, the reader will undoubtedly learn how to effectively store, manipulate, and gain insight from a dataset. —William Jamir Silva, Senior Software Engineer, Adjust GmbH Jake VanderPlas has a history of breaking down core Python concepts and tooling for those learning data science, and in the second edition of Python Data Science Handbook he has done it once again. In this book, he provides an overview of all the tools one would need to get started as well as the background on why certain things are the way they are, and he does so in an accessible way. —Jackie Kazil, Creator of the Mesa Library and Data Science Leader

Jake VanderPlas Python Data Science Handbook Essential Tools for Working with Data SECOND EDITION Boston Farnham Sebastopol TokyoBeijing

978-1-098-12122-8 [LSI] Python Data Science Handbook by Jake VanderPlas Copyright © 2023 Jake VanderPlas. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Aaron Black Development Editor: Jill Leonard Production Editor: Katherine Tozer Copyeditor: Rachel Head Proofreader: James Fraleigh Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea December 2022: Second Edition Revision History for the Second Edition 2022-12-06: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098121228 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python Data Science Handbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. Python Data Science Handbook is available under the Creative Commons Attribution-Noncommercial-No Derivatives 4.0 International Public License. The author maintains an online version at https://github.com/ jakevdp/PythonDataScienceHandbook.

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Part I. Jupyter: Beyond Normal Python 1. Getting Started in IPython and Jupyter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Launching the IPython Shell 3 Launching the Jupyter Notebook 4 Help and Documentation in IPython 4 Accessing Documentation with ? 5 Accessing Source Code with ?? 6 Exploring Modules with Tab Completion 7 Keyboard Shortcuts in the IPython Shell 9 Navigation Shortcuts 10 Text Entry Shortcuts 10 Command History Shortcuts 10 Miscellaneous Shortcuts 12 2. Enhanced Interactive Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 IPython Magic Commands 13 Running External Code: %run 13 Timing Code Execution: %timeit 14 Help on Magic Functions: ?, %magic, and %lsmagic 15 Input and Output History 15 IPython’s In and Out Objects 15 Underscore Shortcuts and Previous Outputs 16 Suppressing Output 17 Related Magic Commands 17 v

IPython and Shell Commands 18 Quick Introduction to the Shell 18 Shell Commands in IPython 19 Passing Values to and from the Shell 20 Shell-Related Magic Commands 20 3. Debugging and Profiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Errors and Debugging 22 Controlling Exceptions: %xmode 22 Debugging: When Reading Tracebacks Is Not Enough 24 Profiling and Timing Code 26 Timing Code Snippets: %timeit and %time 27 Profiling Full Scripts: %prun 28 Line-by-Line Profiling with %lprun 29 Profiling Memory Use: %memit and %mprun 30 More IPython Resources 31 Web Resources 31 Books 32 Part II. Introduction to NumPy 4. Understanding Data Types in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A Python Integer Is More Than Just an Integer 36 A Python List Is More Than Just a List 37 Fixed-Type Arrays in Python 39 Creating Arrays from Python Lists 39 Creating Arrays from Scratch 40 NumPy Standard Data Types 41 5. The Basics of NumPy Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 NumPy Array Attributes 44 Array Indexing: Accessing Single Elements 44 Array Slicing: Accessing Subarrays 45 One-Dimensional Subarrays 45 Multidimensional Subarrays 46 Subarrays as No-Copy Views 47 Creating Copies of Arrays 47 Reshaping of Arrays 48 Array Concatenation and Splitting 49 Concatenation of Arrays 49 Splitting of Arrays 50 vi | Table of Contents

6. Computation on NumPy Arrays: Universal Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 The Slowness of Loops 51 Introducing Ufuncs 52 Exploring NumPy’s Ufuncs 53 Array Arithmetic 53 Absolute Value 55 Trigonometric Functions 55 Exponents and Logarithms 56 Specialized Ufuncs 56 Advanced Ufunc Features 57 Specifying Output 57 Aggregations 58 Outer Products 59 Ufuncs: Learning More 59 7. Aggregations: min, max, and Everything in Between. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Summing the Values in an Array 60 Minimum and Maximum 61 Multidimensional Aggregates 61 Other Aggregation Functions 62 Example: What Is the Average Height of US Presidents? 63 8. Computation on Arrays: Broadcasting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Introducing Broadcasting 65 Rules of Broadcasting 67 Broadcasting Example 1 68 Broadcasting Example 2 68 Broadcasting Example 3 69 Broadcasting in Practice 70 Centering an Array 70 Plotting a Two-Dimensional Function 71 9. Comparisons, Masks, and Boolean Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Example: Counting Rainy Days 72 Comparison Operators as Ufuncs 73 Working with Boolean Arrays 74 Counting Entries 75 Boolean Operators 76 Boolean Arrays as Masks 77 Using the Keywords and/or Versus the Operators &/| 78 Table of Contents | vii

10. Fancy Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Exploring Fancy Indexing 80 Combined Indexing 81 Example: Selecting Random Points 82 Modifying Values with Fancy Indexing 84 Example: Binning Data 85 11. Sorting Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Fast Sorting in NumPy: np.sort and np.argsort 89 Sorting Along Rows or Columns 89 Partial Sorts: Partitioning 90 Example: k-Nearest Neighbors 90 12. Structured Data: NumPy’s Structured Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Exploring Structured Array Creation 96 More Advanced Compound Types 97 Record Arrays: Structured Arrays with a Twist 97 On to Pandas 98 Part III. Data Manipulation with Pandas 13. Introducing Pandas Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 The Pandas Series Object 101 Series as Generalized NumPy Array 102 Series as Specialized Dictionary 103 Constructing Series Objects 104 The Pandas DataFrame Object 104 DataFrame as Generalized NumPy Array 105 DataFrame as Specialized Dictionary 106 Constructing DataFrame Objects 106 The Pandas Index Object 108 Index as Immutable Array 108 Index as Ordered Set 108 14. Data Indexing and Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Data Selection in Series 110 Series as Dictionary 110 Series as One-Dimensional Array 111 Indexers: loc and iloc 112 Data Selection in DataFrames 113 viii | Table of Contents

DataFrame as Dictionary 113 DataFrame as Two-Dimensional Array 115 Additional Indexing Conventions 116 15. Operating on Data in Pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Ufuncs: Index Preservation 118 Ufuncs: Index Alignment 119 Index Alignment in Series 119 Index Alignment in DataFrames 120 Ufuncs: Operations Between DataFrames and Series 121 16. Handling Missing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Trade-offs in Missing Data Conventions 123 Missing Data in Pandas 124 None as a Sentinel Value 125 NaN: Missing Numerical Data 125 NaN and None in Pandas 126 Pandas Nullable Dtypes 127 Operating on Null Values 128 Detecting Null Values 128 Dropping Null Values 129 Filling Null Values 130 17. Hierarchical Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A Multiply Indexed Series 132 The Bad Way 133 The Better Way: The Pandas MultiIndex 133 MultiIndex as Extra Dimension 134 Methods of MultiIndex Creation 136 Explicit MultiIndex Constructors 136 MultiIndex Level Names 137 MultiIndex for Columns 138 Indexing and Slicing a MultiIndex 138 Multiply Indexed Series 139 Multiply Indexed DataFrames 140 Rearranging Multi-Indexes 141 Sorted and Unsorted Indices 141 Stacking and Unstacking Indices 143 Index Setting and Resetting 143 18. Combining Datasets: concat and append. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Recall: Concatenation of NumPy Arrays 146 Table of Contents | ix

Simple Concatenation with pd.concat 147 Duplicate Indices 148 Concatenation with Joins 149 The append Method 150 19. Combining Datasets: merge and join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Relational Algebra 151 Categories of Joins 152 One-to-One Joins 152 Many-to-One Joins 153 Many-to-Many Joins 153 Specification of the Merge Key 154 The on Keyword 154 The left_on and right_on Keywords 155 The left_index and right_index Keywords 155 Specifying Set Arithmetic for Joins 157 Overlapping Column Names: The suffixes Keyword 158 Example: US States Data 159 20. Aggregation and Grouping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Planets Data 165 Simple Aggregation in Pandas 165 groupby: Split, Apply, Combine 167 Split, Apply, Combine 167 The GroupBy Object 169 Aggregate, Filter, Transform, Apply 171 Specifying the Split Key 174 Grouping Example 175 21. Pivot Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Motivating Pivot Tables 176 Pivot Tables by Hand 177 Pivot Table Syntax 178 Multilevel Pivot Tables 178 Additional Pivot Table Options 179 Example: Birthrate Data 180 22. Vectorized String Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Introducing Pandas String Operations 185 Tables of Pandas String Methods 186 Methods Similar to Python String Methods 186 Methods Using Regular Expressions 187 x | Table of Contents

Miscellaneous Methods 188 Example: Recipe Database 190 A Simple Recipe Recommender 192 Going Further with Recipes 193 23. Working with Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Dates and Times in Python 195 Native Python Dates and Times: datetime and dateutil 195 Typed Arrays of Times: NumPy’s datetime64 196 Dates and Times in Pandas: The Best of Both Worlds 197 Pandas Time Series: Indexing by Time 198 Pandas Time Series Data Structures 199 Regular Sequences: pd.date_range 200 Frequencies and Offsets 201 Resampling, Shifting, and Windowing 202 Resampling and Converting Frequencies 203 Time Shifts 205 Rolling Windows 206 Example: Visualizing Seattle Bicycle Counts 208 Visualizing the Data 209 Digging into the Data 211 24. High-Performance Pandas: eval and query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Motivating query and eval: Compound Expressions 215 pandas.eval for Efficient Operations 216 DataFrame.eval for Column-Wise Operations 218 Assignment in DataFrame.eval 219 Local Variables in DataFrame.eval 219 The DataFrame.query Method 220 Performance: When to Use These Functions 220 Further Resources 221 Part IV. Visualization with Matplotlib 25. General Matplotlib Tips. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 Importing Matplotlib 225 Setting Styles 225 show or No show? How to Display Your Plots 226 Plotting from a Script 226 Plotting from an IPython Shell 227 Plotting from a Jupyter Notebook 227 Table of Contents | xi

Saving Figures to File 228 Two Interfaces for the Price of One 230 26. Simple Line Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Adjusting the Plot: Line Colors and Styles 235 Adjusting the Plot: Axes Limits 238 Labeling Plots 240 Matplotlib Gotchas 242 27. Simple Scatter Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 Scatter Plots with plt.plot 244 Scatter Plots with plt.scatter 247 plot Versus scatter: A Note on Efficiency 250 Visualizing Uncertainties 251 Basic Errorbars 251 Continuous Errors 253 28. Density and Contour Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Visualizing a Three-Dimensional Function 255 Histograms, Binnings, and Density 260 Two-Dimensional Histograms and Binnings 263 plt.hist2d: Two-Dimensional Histogram 263 plt.hexbin: Hexagonal Binnings 264 Kernel Density Estimation 264 29. Customizing Plot Legends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Choosing Elements for the Legend 270 Legend for Size of Points 272 Multiple Legends 274 30. Customizing Colorbars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Customizing Colorbars 277 Choosing the Colormap 278 Color Limits and Extensions 280 Discrete Colorbars 281 Example: Handwritten Digits 282 31. Multiple Subplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 plt.axes: Subplots by Hand 285 plt.subplot: Simple Grids of Subplots 287 plt.subplots: The Whole Grid in One Go 289 plt.GridSpec: More Complicated Arrangements 291 xii | Table of Contents

32. Text and Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Example: Effect of Holidays on US Births 294 Transforms and Text Position 296 Arrows and Annotation 298 33. Customizing Ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Major and Minor Ticks 302 Hiding Ticks or Labels 304 Reducing or Increasing the Number of Ticks 306 Fancy Tick Formats 307 Summary of Formatters and Locators 310 34. Customizing Matplotlib: Configurations and Stylesheets. . . . . . . . . . . . . . . . . . . . . . . 312 Plot Customization by Hand 312 Changing the Defaults: rcParams 314 Stylesheets 316 Default Style 317 FiveThiryEight Style 317 ggplot Style 318 Bayesian Methods for Hackers Style 318 Dark Background Style 319 Grayscale Style 319 Seaborn Style 320 35. Three-Dimensional Plotting in Matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Three-Dimensional Points and Lines 322 Three-Dimensional Contour Plots 323 Wireframes and Surface Plots 325 Surface Triangulations 328 Example: Visualizing a Möbius Strip 330 36. Visualization with Seaborn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Exploring Seaborn Plots 333 Histograms, KDE, and Densities 333 Pair Plots 335 Faceted Histograms 336 Categorical Plots 338 Joint Distributions 339 Bar Plots 340 Example: Exploring Marathon Finishing Times 342 Further Resources 350 Other Python Visualization Libraries 351 Table of Contents | xiii

Part V. Machine Learning 37. What Is Machine Learning?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Categories of Machine Learning 355 Qualitative Examples of Machine Learning Applications 356 Classification: Predicting Discrete Labels 356 Regression: Predicting Continuous Labels 359 Clustering: Inferring Labels on Unlabeled Data 363 Dimensionality Reduction: Inferring Structure of Unlabeled Data 364 Summary 366 38. Introducing Scikit-Learn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 Data Representation in Scikit-Learn 367 The Features Matrix 368 The Target Array 368 The Estimator API 370 Basics of the API 371 Supervised Learning Example: Simple Linear Regression 372 Supervised Learning Example: Iris Classification 375 Unsupervised Learning Example: Iris Dimensionality 376 Unsupervised Learning Example: Iris Clustering 377 Application: Exploring Handwritten Digits 378 Loading and Visualizing the Digits Data 378 Unsupervised Learning Example: Dimensionality Reduction 380 Classification on Digits 381 Summary 383 39. Hyperparameters and Model Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 Thinking About Model Validation 384 Model Validation the Wrong Way 385 Model Validation the Right Way: Holdout Sets 385 Model Validation via Cross-Validation 386 Selecting the Best Model 388 The Bias-Variance Trade-off 389 Validation Curves in Scikit-Learn 391 Learning Curves 395 Validation in Practice: Grid Search 400 Summary 401 40. Feature Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 Categorical Features 402 xiv | Table of Contents

Text Features 404 Image Features 405 Derived Features 405 Imputation of Missing Data 408 Feature Pipelines 409 41. In Depth: Naive Bayes Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 Bayesian Classification 410 Gaussian Naive Bayes 411 Multinomial Naive Bayes 414 Example: Classifying Text 414 When to Use Naive Bayes 417 42. In Depth: Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Simple Linear Regression 419 Basis Function Regression 422 Polynomial Basis Functions 422 Gaussian Basis Functions 424 Regularization 425 Ridge Regression (L2 Regularization) 427 Lasso Regression (L1 Regularization) 428 Example: Predicting Bicycle Traffic 429 43. In Depth: Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 Motivating Support Vector Machines 435 Support Vector Machines: Maximizing the Margin 437 Fitting a Support Vector Machine 438 Beyond Linear Boundaries: Kernel SVM 441 Tuning the SVM: Softening Margins 444 Example: Face Recognition 445 Summary 450 44. In Depth: Decision Trees and Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Motivating Random Forests: Decision Trees 451 Creating a Decision Tree 452 Decision Trees and Overfitting 455 Ensembles of Estimators: Random Forests 456 Random Forest Regression 458 Example: Random Forest for Classifying Digits 459 Summary 462 Table of Contents | xv

45. In Depth: Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Introducing Principal Component Analysis 463 PCA as Dimensionality Reduction 466 PCA for Visualization: Handwritten Digits 467 What Do the Components Mean? 469 Choosing the Number of Components 470 PCA as Noise Filtering 471 Example: Eigenfaces 473 Summary 476 46. In Depth: Manifold Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Manifold Learning: “HELLO” 478 Multidimensional Scaling 479 MDS as Manifold Learning 482 Nonlinear Embeddings: Where MDS Fails 484 Nonlinear Manifolds: Locally Linear Embedding 486 Some Thoughts on Manifold Methods 488 Example: Isomap on Faces 489 Example: Visualizing Structure in Digits 493 47. In Depth: k-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Introducing k-Means 496 Expectation–Maximization 498 Examples 504 Example 1: k-Means on Digits 504 Example 2: k-Means for Color Compression 507 48. In Depth: Gaussian Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Motivating Gaussian Mixtures: Weaknesses of k-Means 512 Generalizing E–M: Gaussian Mixture Models 516 Choosing the Covariance Type 520 Gaussian Mixture Models as Density Estimation 520 Example: GMMs for Generating New Data 524 49. In Depth: Kernel Density Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 Motivating Kernel Density Estimation: Histograms 528 Kernel Density Estimation in Practice 533 Selecting the Bandwidth via Cross-Validation 535 Example: Not-so-Naive Bayes 535 Anatomy of a Custom Estimator 537 Using Our Custom Estimator 539 xvi | Table of Contents

50. Application: A Face Detection Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 HOG Features 542 HOG in Action: A Simple Face Detector 543 1. Obtain a Set of Positive Training Samples 543 2. Obtain a Set of Negative Training Samples 543 3. Combine Sets and Extract HOG Features 545 4. Train a Support Vector Machine 546 5. Find Faces in a New Image 546 Caveats and Improvements 548 Further Machine Learning Resources 550 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 Table of Contents | xvii

(This page has no text content)

Statistics

Uploader

Python Data Science Handbook Essential Tools for Working with Data, 2nd Edition (Jake VanderPlas) (Z-Library) (1)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Recommended for You

Statistics

Uploader

Python Data Science Handbook Essential Tools for Working with Data, 2nd Edition (Jake VanderPlas) (Z-Library) (1)

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Comments 0

Reply to Comment

Edit Comment

Recommended for You