Python Data Science Handbook (Jake VanderPlas) (Z-Library)

Author: Jake VanderPlas

Python

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all--IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools. Working scientists and data crunchers familiar with reading and writing Python code will find this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python. With this handbook, you'll learn how to use: IPython and Jupyter: provide computational environments for data scientists using Python NumPy: includes the ndarray for efficient storage and manipulation of dense data arrays in Python Pandas: features the DataFrame for efficient storage and manipulation of labeled/columnar data in Python Matplotlib: includes capabilities for a flexible range of data visualizations in Python Scikit-Learn: for efficient and clean Python implementations of the most important and established machine learning algorithms

📄 File Format: PDF

💾 File Size: 19.9 MB

106

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

Jake VanderPlas Python Data Science Handbook ESSENTIAL TOOLS FOR WORKING WITH DATA powered by

📄 Page 2

(This page has no text content)

📄 Page 3

Jake VanderPlas Python Data Science Handbook Essential Tools for Working with Data Boston Farnham Sebastopol TokyoBeijing

📄 Page 4

978-1-491-91205-8 [LSI] Python Data Science Handbook by Jake VanderPlas Copyright © 2017 Jake VanderPlas. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com/safari). For more information, contact our corporate/insti‐ tutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Dawn Schanafelt Production Editor: Kristen Brown Copyeditor: Jasmine Kwityn Proofreader: Rachel Monaghan Indexer: WordCo Indexing Services, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest December 2016: First Edition Revision History for the First Edition 2016-11-17: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781491912058 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python Data Science Handbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

📄 Page 5

Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. IPython: Beyond Normal Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Shell or Notebook? 2 Launching the IPython Shell 2 Launching the Jupyter Notebook 2 Help and Documentation in IPython 3 Accessing Documentation with ? 3 Accessing Source Code with ?? 5 Exploring Modules with Tab Completion 6 Keyboard Shortcuts in the IPython Shell 8 Navigation Shortcuts 8 Text Entry Shortcuts 9 Command History Shortcuts 9 Miscellaneous Shortcuts 10 IPython Magic Commands 10 Pasting Code Blocks: %paste and %cpaste 11 Running External Code: %run 12 Timing Code Execution: %timeit 12 Help on Magic Functions: ?, %magic, and %lsmagic 13 Input and Output History 13 IPython’s In and Out Objects 13 Underscore Shortcuts and Previous Outputs 15 Suppressing Output 15 Related Magic Commands 16 IPython and Shell Commands 16 Quick Introduction to the Shell 16 Shell Commands in IPython 18 iii

📄 Page 6

Passing Values to and from the Shell 18 Shell-Related Magic Commands 19 Errors and Debugging 20 Controlling Exceptions: %xmode 20 Debugging: When Reading Tracebacks Is Not Enough 22 Profiling and Timing Code 25 Timing Code Snippets: %timeit and %time 25 Profiling Full Scripts: %prun 27 Line-by-Line Profiling with %lprun 28 Profiling Memory Use: %memit and %mprun 29 More IPython Resources 30 Web Resources 30 Books 31 2. Introduction to NumPy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Understanding Data Types in Python 34 A Python Integer Is More Than Just an Integer 35 A Python List Is More Than Just a List 37 Fixed-Type Arrays in Python 38 Creating Arrays from Python Lists 39 Creating Arrays from Scratch 39 NumPy Standard Data Types 41 The Basics of NumPy Arrays 42 NumPy Array Attributes 42 Array Indexing: Accessing Single Elements 43 Array Slicing: Accessing Subarrays 44 Reshaping of Arrays 47 Array Concatenation and Splitting 48 Computation on NumPy Arrays: Universal Functions 50 The Slowness of Loops 50 Introducing UFuncs 51 Exploring NumPy’s UFuncs 52 Advanced Ufunc Features 56 Ufuncs: Learning More 58 Aggregations: Min, Max, and Everything in Between 58 Summing the Values in an Array 59 Minimum and Maximum 59 Example: What Is the Average Height of US Presidents? 61 Computation on Arrays: Broadcasting 63 Introducing Broadcasting 63 Rules of Broadcasting 65 Broadcasting in Practice 68 iv | Table of Contents

📄 Page 7

Comparisons, Masks, and Boolean Logic 70 Example: Counting Rainy Days 70 Comparison Operators as ufuncs 71 Working with Boolean Arrays 73 Boolean Arrays as Masks 75 Fancy Indexing 78 Exploring Fancy Indexing 79 Combined Indexing 80 Example: Selecting Random Points 81 Modifying Values with Fancy Indexing 82 Example: Binning Data 83 Sorting Arrays 85 Fast Sorting in NumPy: np.sort and np.argsort 86 Partial Sorts: Partitioning 88 Example: k-Nearest Neighbors 88 Structured Data: NumPy’s Structured Arrays 92 Creating Structured Arrays 94 More Advanced Compound Types 95 RecordArrays: Structured Arrays with a Twist 96 On to Pandas 96 3. Data Manipulation with Pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Installing and Using Pandas 97 Introducing Pandas Objects 98 The Pandas Series Object 99 The Pandas DataFrame Object 102 The Pandas Index Object 105 Data Indexing and Selection 107 Data Selection in Series 107 Data Selection in DataFrame 110 Operating on Data in Pandas 115 Ufuncs: Index Preservation 115 UFuncs: Index Alignment 116 Ufuncs: Operations Between DataFrame and Series 118 Handling Missing Data 119 Trade-Offs in Missing Data Conventions 120 Missing Data in Pandas 120 Operating on Null Values 124 Hierarchical Indexing 128 A Multiply Indexed Series 128 Methods of MultiIndex Creation 131 Indexing and Slicing a MultiIndex 134 Table of Contents | v

📄 Page 8

Rearranging Multi-Indices 137 Data Aggregations on Multi-Indices 140 Combining Datasets: Concat and Append 141 Recall: Concatenation of NumPy Arrays 142 Simple Concatenation with pd.concat 142 Combining Datasets: Merge and Join 146 Relational Algebra 146 Categories of Joins 147 Specification of the Merge Key 149 Specifying Set Arithmetic for Joins 152 Overlapping Column Names: The suffixes Keyword 153 Example: US States Data 154 Aggregation and Grouping 158 Planets Data 159 Simple Aggregation in Pandas 159 GroupBy: Split, Apply, Combine 161 Pivot Tables 170 Motivating Pivot Tables 170 Pivot Tables by Hand 171 Pivot Table Syntax 171 Example: Birthrate Data 174 Vectorized String Operations 178 Introducing Pandas String Operations 178 Tables of Pandas String Methods 180 Example: Recipe Database 184 Working with Time Series 188 Dates and Times in Python 188 Pandas Time Series: Indexing by Time 192 Pandas Time Series Data Structures 192 Frequencies and Offsets 195 Resampling, Shifting, and Windowing 196 Where to Learn More 202 Example: Visualizing Seattle Bicycle Counts 202 High-Performance Pandas: eval() and query() 208 Motivating query() and eval(): Compound Expressions 209 pandas.eval() for Efficient Operations 210 DataFrame.eval() for Column-Wise Operations 211 DataFrame.query() Method 213 Performance: When to Use These Functions 214 Further Resources 215 vi | Table of Contents

📄 Page 9

4. Visualization with Matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 General Matplotlib Tips 218 Importing matplotlib 218 Setting Styles 218 show() or No show()? How to Display Your Plots 218 Saving Figures to File 221 Two Interfaces for the Price of One 222 Simple Line Plots 224 Adjusting the Plot: Line Colors and Styles 226 Adjusting the Plot: Axes Limits 228 Labeling Plots 230 Simple Scatter Plots 233 Scatter Plots with plt.plot 233 Scatter Plots with plt.scatter 235 plot Versus scatter: A Note on Efficiency 237 Visualizing Errors 237 Basic Errorbars 238 Continuous Errors 239 Density and Contour Plots 241 Visualizing a Three-Dimensional Function 241 Histograms, Binnings, and Density 245 Two-Dimensional Histograms and Binnings 247 Customizing Plot Legends 249 Choosing Elements for the Legend 251 Legend for Size of Points 252 Multiple Legends 254 Customizing Colorbars 255 Customizing Colorbars 256 Example: Handwritten Digits 261 Multiple Subplots 262 plt.axes: Subplots by Hand 263 plt.subplot: Simple Grids of Subplots 264 plt.subplots: The Whole Grid in One Go 265 plt.GridSpec: More Complicated Arrangements 266 Text and Annotation 268 Example: Effect of Holidays on US Births 269 Transforms and Text Position 270 Arrows and Annotation 272 Customizing Ticks 275 Major and Minor Ticks 276 Hiding Ticks or Labels 277 Reducing or Increasing the Number of Ticks 278 Table of Contents | vii

📄 Page 10

Fancy Tick Formats 279 Summary of Formatters and Locators 281 Customizing Matplotlib: Configurations and Stylesheets 282 Plot Customization by Hand 282 Changing the Defaults: rcParams 284 Stylesheets 285 Three-Dimensional Plotting in Matplotlib 290 Three-Dimensional Points and Lines 291 Three-Dimensional Contour Plots 292 Wireframes and Surface Plots 293 Surface Triangulations 295 Geographic Data with Basemap 298 Map Projections 300 Drawing a Map Background 304 Plotting Data on Maps 307 Example: California Cities 308 Example: Surface Temperature Data 309 Visualization with Seaborn 311 Seaborn Versus Matplotlib 312 Exploring Seaborn Plots 313 Example: Exploring Marathon Finishing Times 322 Further Resources 329 Matplotlib Resources 329 Other Python Graphics Libraries 330 5. Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 What Is Machine Learning? 332 Categories of Machine Learning 332 Qualitative Examples of Machine Learning Applications 333 Summary 342 Introducing Scikit-Learn 343 Data Representation in Scikit-Learn 343 Scikit-Learn’s Estimator API 346 Application: Exploring Handwritten Digits 354 Summary 359 Hyperparameters and Model Validation 359 Thinking About Model Validation 359 Selecting the Best Model 363 Learning Curves 370 Validation in Practice: Grid Search 373 Summary 375 Feature Engineering 375 viii | Table of Contents

📄 Page 11

Categorical Features 376 Text Features 377 Image Features 378 Derived Features 378 Imputation of Missing Data 381 Feature Pipelines 381 In Depth: Naive Bayes Classification 382 Bayesian Classification 383 Gaussian Naive Bayes 383 Multinomial Naive Bayes 386 When to Use Naive Bayes 389 In Depth: Linear Regression 390 Simple Linear Regression 390 Basis Function Regression 392 Regularization 396 Example: Predicting Bicycle Traffic 400 In-Depth: Support Vector Machines 405 Motivating Support Vector Machines 405 Support Vector Machines: Maximizing the Margin 407 Example: Face Recognition 416 Support Vector Machine Summary 420 In-Depth: Decision Trees and Random Forests 421 Motivating Random Forests: Decision Trees 421 Ensembles of Estimators: Random Forests 426 Random Forest Regression 428 Example: Random Forest for Classifying Digits 430 Summary of Random Forests 432 In Depth: Principal Component Analysis 433 Introducing Principal Component Analysis 433 PCA as Noise Filtering 440 Example: Eigenfaces 442 Principal Component Analysis Summary 445 In-Depth: Manifold Learning 445 Manifold Learning: “HELLO” 446 Multidimensional Scaling (MDS) 447 MDS as Manifold Learning 450 Nonlinear Embeddings: Where MDS Fails 452 Nonlinear Manifolds: Locally Linear Embedding 453 Some Thoughts on Manifold Methods 455 Example: Isomap on Faces 456 Example: Visualizing Structure in Digits 460 In Depth: k-Means Clustering 462 Table of Contents | ix

📄 Page 12

Introducing k-Means 463 k-Means Algorithm: Expectation–Maximization 465 Examples 470 In Depth: Gaussian Mixture Models 476 Motivating GMM: Weaknesses of k-Means 477 Generalizing E–M: Gaussian Mixture Models 480 GMM as Density Estimation 484 Example: GMM for Generating New Data 488 In-Depth: Kernel Density Estimation 491 Motivating KDE: Histograms 491 Kernel Density Estimation in Practice 496 Example: KDE on a Sphere 498 Example: Not-So-Naive Bayes 501 Application: A Face Detection Pipeline 506 HOG Features 506 HOG in Action: A Simple Face Detector 507 Caveats and Improvements 512 Further Machine Learning Resources 514 Machine Learning in Python 514 General Machine Learning 515 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 x | Table of Contents

📄 Page 13

Preface What Is Data Science? This is a book about doing data science with Python, which immediately begs the question: what is data science? It’s a surprisingly hard definition to nail down, espe‐ cially given how ubiquitous the term has become. Vocal critics have variously dis‐ missed the term as a superfluous label (after all, what science doesn’t involve data?) or a simple buzzword that only exists to salt résumés and catch the eye of overzealous tech recruiters. In my mind, these critiques miss something important. Data science, despite its hype- laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia. This cross-disciplinary piece is key: in my mind, the best existing defini‐ tion of data science is illustrated by Drew Conway’s Data Science Venn Diagram, first published on his blog in September 2010 (see Figure P-1). Figure P-1. Drew Conway’s Data Science Venn Diagram xi

📄 Page 14

While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say “data science”: it is fundamen‐ tally an interdisciplinary subject. Data science comprises three distinct and overlap‐ ping areas: the skills of a statistician who knows how to model and summarize datasets (which are growing ever larger); the skills of a computer scientist who can design and use algorithms to efficiently store, process, and visualize this data; and the domain expertise—what we might think of as “classical” training in a subject—neces‐ sary both to formulate the right questions and to put their answers in context. With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but as a new set of skills that you can apply within your current area of expertise. Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to ask and answer new ques‐ tions about your chosen subject area. Who Is This Book For? In my teaching both at the University of Washington and at various tech-focused conferences and meetups, one of the most common questions I have heard is this: “how should I learn Python?” The people asking are generally technically minded students, developers, or researchers, often with an already strong background in writ‐ ing code and using computational and numerical tools. Most of these folks don’t want to learn Python per se, but want to learn the language with the aim of using it as a tool for data-intensive and computational science. While a large patchwork of videos, blog posts, and tutorials for this audience is available online, I’ve long been frustrated by the lack of a single good answer to this question; that is what inspired this book. The book is not meant to be an introduction to Python or to programming in gen‐ eral; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks. Instead, it is meant to help Python users learn to use Python’s data science stack—libraries such as IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related tools—to effectively store, manipulate, and gain insight from data. Why Python? Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. xii | Preface

📄 Page 15

The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: NumPy for manipulation of homogeneous array- based data, Pandas for manipulation of heterogeneous and labeled data, SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations, IPython for interactive execution and sharing of code, Scikit-Learn for machine learning, and many more tools that will be mentioned in the following pages. If you are looking for a guide to the Python language itself, I would suggest the sister project to this book, A Whirlwind Tour of the Python Language. This short report pro‐ vides a tour of the essential features of the Python language, aimed at data scientists who already are familiar with one or more other programming languages. Python 2 Versus Python 3 This book uses the syntax of Python 3, which contains language enhancements that are not compatible with the 2.x series of Python. Though Python 3.0 was first released in 2008, adoption has been relatively slow, particularly in the scientific and web devel‐ opment communities. This is primarily because it took some time for many of the essential third-party packages and toolkits to be made compatible with the new lan‐ guage internals. Since early 2014, however, stable releases of the most important tools in the data science ecosystem have been fully compatible with both Python 2 and 3, and so this book will use the newer Python 3 syntax. However, the vast majority of code snippets in this book will also work without modification in Python 2: in cases where a Py2-incompatible syntax is used, I will make every effort to note it explicitly. Outline of This Book Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python data science story. IPython and Jupyter (Chapter 1) These packages provide the computational environment in which many Python- using data scientists work. NumPy (Chapter 2) This library provides the ndarray object for efficient storage and manipulation of dense data arrays in Python. Pandas (Chapter 3) This library provides the DataFrame object for efficient storage and manipulation of labeled/columnar data in Python. Matplotlib (Chapter 4) This library provides capabilities for a flexible range of data visualizations in Python. Preface | xiii

📄 Page 16

Scikit-Learn (Chapter 5) This library provides efficient and clean Python implementations of the most important and established machine learning algorithms. The PyData world is certainly much larger than these five packages, and is growing every day. With this in mind, I make every attempt through these pages to provide references to other interesting efforts, projects, and packages that are pushing the boundaries of what can be done in Python. Nevertheless, these five are currently fun‐ damental to much of the work being done in the Python data science space, and I expect they will remain important even as the ecosystem continues growing around them. Using Code Examples Supplemental material (code examples, figures, etc.) is available for download at https://github.com/jakevdp/PythonDataScienceHandbook. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for per‐ mission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example, “Python Data Science Handbook by Jake VanderPlas (O’Reilly). Copyright 2017 Jake VanderPlas, 978-1-491-91205-8.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Installation Considerations Installing Python and the suite of libraries that enable scientific computing is straightforward. This section will outline some of the considerations to keep in mind when setting up your computer. Though there are various ways to install Python, the one I would suggest for use in data science is the Anaconda distribution, which works similarly whether you use Windows, Linux, or Mac OS X. The Anaconda distribution comes in two flavors: • Miniconda gives you the Python interpreter itself, along with a command-line tool called conda that operates as a cross-platform package manager geared xiv | Preface

📄 Page 17

toward Python packages, similar in spirit to the apt or yum tools that Linux users might be familiar with. • Anaconda includes both Python and conda, and additionally bundles a suite of other preinstalled packages geared toward scientific computing. Because of the size of this bundle, expect the installation to consume several gigabytes of disk space. Any of the packages included with Anaconda can also be installed manually on top of Miniconda; for this reason I suggest starting with Miniconda. To get started, download and install the Miniconda package (make sure to choose a version with Python 3), and then install the core packages used in this book: [~]$ conda install numpy pandas scikit-learn matplotlib seaborn ipython-notebook Throughout the text, we will also make use of other, more specialized tools in Python’s scientific ecosystem; installation is usually as easy as typing conda install packagename. For more information on conda, including information about creating and using conda environments (which I would highly recommend), refer to conda’s online documentation. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. O’Reilly Safari Safari (formerly Safari Books Online) is a membership-based training and reference platform for enterprise, government, educators, and individuals. Preface | xv

📄 Page 18

Members have access to thousands of books, training videos, Learning Paths, interac‐ tive tutorials, and curated playlists from over 250 publishers, including O’Reilly Media, Harvard Business Review, Prentice Hall Professional, Addison-Wesley Profes‐ sional, Microsoft Press, Sams, Que, Peachpit Press, Adobe, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, and Course Technology, among others. For more information, please visit http://oreilly.com/safari. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/python-data-sci-handbook. To comment or ask technical questions about this book, send email to bookques‐ tions@oreilly.com. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia xvi | Preface

📄 Page 19

CHAPTER 1 IPython: Beyond Normal Python There are many options for development environments for Python, and I’m often asked which one I use in my own work. My answer sometimes surprises people: my preferred environment is IPython plus a text editor (in my case, Emacs or Atom depending on my mood). IPython (short for Interactive Python) was started in 2001 by Fernando Perez as an enhanced Python interpreter, and has since grown into a project aiming to provide, in Perez’s words, “Tools for the entire lifecycle of research computing.” If Python is the engine of our data science task, you might think of IPy‐ thon as the interactive control panel. As well as being a useful interactive interface to Python, IPython also provides a number of useful syntactic additions to the language; we’ll cover the most useful of these additions here. In addition, IPython is closely tied with the Jupyter project, which provides a browser-based notebook that is useful for development, collabora‐ tion, sharing, and even publication of data science results. The IPython notebook is actually a special case of the broader Jupyter notebook structure, which encompasses notebooks for Julia, R, and other programming languages. As an example of the use‐ fulness of the notebook format, look no further than the page you are reading: the entire manuscript for this book was composed as a set of IPython notebooks. IPython is about using Python effectively for interactive scientific and data-intensive computing. This chapter will start by stepping through some of the IPython features that are useful to the practice of data science, focusing especially on the syntax it offers beyond the standard features of Python. Next, we will go into a bit more depth on some of the more useful “magic commands” that can speed up common tasks in creating and using data science code. Finally, we will touch on some of the features of the notebook that make it useful in understanding data and sharing results. 1

📄 Page 20

Shell or Notebook? There are two primary means of using IPython that we’ll discuss in this chapter: the IPython shell and the IPython notebook. The bulk of the material in this chapter is relevant to both, and the examples will switch between them depending on what is most convenient. In the few sections that are relevant to just one or the other, I will explicitly state that fact. Before we start, some words on how to launch the IPython shell and IPython notebook. Launching the IPython Shell This chapter, like most of this book, is not designed to be absorbed passively. I recom‐ mend that as you read through it, you follow along and experiment with the tools and syntax we cover: the muscle-memory you build through doing this will be far more useful than the simple act of reading about it. Start by launching the IPython inter‐ preter by typing ipython on the command line; alternatively, if you’ve installed a dis‐ tribution like Anaconda or EPD, there may be a launcher specific to your system (we’ll discuss this more fully in “Help and Documentation in IPython” on page 3). Once you do this, you should see a prompt like the following: IPython 4.0.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. In [1]: With that, you’re ready to follow along. Launching the Jupyter Notebook The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities. As well as executing Python/ IPython statements, the notebook allows the user to include formatted text, static and dynamic visualizations, mathematical equations, JavaScript widgets, and much more. Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems. Though the IPython notebook is viewed and edited through your web browser win‐ dow, it must connect to a running Python process in order to execute code. To start this process (known as a “kernel”), run the following command in your system shell: $ jupyter notebook This command will launch a local web server that will be visible to your browser. It immediately spits out a log showing what it is doing; that log will look something like this: 2 | Chapter 1: IPython: Beyond Normal Python

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List