Statistics
24
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-12-19
Support
Share

AuthorKyle Gallatin, Chris Albon

This practical guide provides more than 200 self-contained recipes to help you solve machine learning challenges you may encounter in your work. If you're comfortable with Python and its libraries, including pandas and scikit-learn, you'll be able to address specific problems all the way from loading data to training models and leveraging neural networks. Each recipe in this updated edition includes code that you can copy, paste, and run with a toy dataset to ensure it works. From there, you can adapt these recipes according to your use case or application. Recipes include a discussion that explains the solution and provides meaningful context. Go beyond theory and concepts by learning the nuts and bolts you need to construct working machine learning applications. You'll find recipes for: Vectors, matrices, and arrays Working with data from CSV, JSON, SQL, databases, cloud storage, and other sources Handling numerical and categorical data, text, images, and dates and times Dimensionality reduction using feature extraction or feature selection Model evaluation and selection Linear and logical regression, trees and forests, and k-nearest neighbors Support vector machines (SVM), naive Bayes, clustering, and tree-based models Saving and loading trained models from multiple frameworks

Tags
No tags
ISBN: 1098135725
Publisher: O'Reilly Media
Publish Year: 2023
Language: 英文
Pages: 404
File Format: PDF
File Size: 3.4 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
Machine Learning with Python Cookbook SECOND EDITION Practical Solutions from Preprocessing to Deep Learning Kyle Gallatin and Chris Albon
Machine Learning with Python Cookbook by Kyle Gallatin and Chris Albon Copyright © 2023 Kyle Gallatin. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Nicole Butterfield Development Editor: Jeff Bleiel Production Editor: Clare Laylock Copyeditor: Penelope Perkins Proofreader: Piper Editorial Consulting, LLC Indexer: Potomac Indexing, LLC Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea April 2018: First Edition July 2023: Second Edition Revision History for the Second Edition 2023-07-27: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781098135720 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning with Python Cookbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for
damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-13572-0 [LSI]
Preface When the first edition of this book was published in 2018, it filled a critical gap in the growing wealth of machine learning (ML) content. By providing well-tested, hands-on Python recipes, it enabled practitioners to copy and paste code before easily adapting it to their use cases. In a short five years, the ML space has continued to explode with advances in deep learning (DL) and the associated DL Python frameworks. Now, in 2023, there is a need for the same sort of hands-on content that serves the needs of both ML and DL practitioners with the latest Python libraries. This book intends to build on the existing (and fantastic) work done by the author of the first edition by: Updating existing examples to use the latest Python versions and frameworks Incorporating modern practices in data sources, data analysis, ML, and DL Expanding the DL content to include tensors, neural networks, and DL for text and vision in PyTorch Taking our models one step further by serving them in an API Like the first edition, this book takes a task-based approach to machine learning, boasting over 200 self-contained solutions (copy, paste, and run) for the most common tasks a data scientist or machine learning engineer building a model will run into. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by
context. Using Code Examples This book is accompanied by a GitHub repository that has instructions for running a Jupyter Notebook in a Docker container with all dependencies used in this book. By replicating the commands from this book in the notebook, you can ensure the examples in this book will be completely reproducible. If you have a technical question or a problem using the code examples, please send an email to support@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Machine Learning with Python Cookbook, 2nd ed., by Kyle Gallatin and Chris Albon (O’Reilly). Copyright 2023 Kyle Gallatin, 978-1-098-13572- 0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com.
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.xhtml We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/ml_python_2e. For news and information about our books and courses, visit https://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments The second edition of this book is clearly only possible because of the fantastic content, structure, and quality laid out in the first edition by the original author, Chris Albon. As the first author of the second edition, I cannot understate the degree to which this made my job way, way easier. Of course, the machine learning space also evolves rapidly, and the updates included in this second edition could not have been written without the thoughtful feedback of my peers. I’d specifically like to thank my fellow Etsy coworkers Andrea Heyman, Maria Gomez, Alek Maelstrum, and Brian Schmidt for acquiescing to requests for input on various chapters and being unwillingly coaxed into sudden brainstorming sessions that shaped the new content added to this edition. I’d also like to thank the technical reviewers—Jigyasa Grover, Matteus Tanha, and Ganesh Harke—along with the O’Reilly editors: Jeff Bleiel, Nicole Butterfield, and Clare Laylock. That being said, the number of people who’ve helped me and this book get to the place it’s at (in one way or another) is massive. I’d love to thank everyone who’s been a part of my
ML journey in one way or another and helped make this book what it is. Love y’all.
Chapter 1. Working with Vectors, Matrices, and Arrays in NumPy 1.0 Introduction NumPy is a foundational tool of the Python machine learning stack. NumPy allows for efficient operations on the data structures often used in machine learning: vectors, matrices, and tensors. While NumPy isn’t the focus of this book, it will show up frequently in the following chapters. This chapter covers the most common NumPy operations we’re likely to run into while working on machine learning workflows. 1.1 Creating a Vector Problem You need to create a vector. Solution Use NumPy to create a one-dimensional array: # Load library import numpy as np # Create a vector as a row vector_row = np.array([1, 2, 3]) # Create a vector as a column vector_column = np.array([[1], [2], [3]]) Discussion NumPy’s main data structure is the multidimensional array. A vector is just an array with a single dimension. To create a vector, we simply create a one-dimensional array. Just like vectors, these arrays can be represented horizontally (i.e., rows) or vertically (i.e., columns). See Also Vectors, Math Is Fun Euclidean vector, Wikipedia
1.2 Creating a Matrix Problem You need to create a matrix. Solution Use NumPy to create a two-dimensional array: # Load library import numpy as np # Create a matrix matrix = np.array([[1, 2], [1, 2], [1, 2]]) Discussion To create a matrix we can use a NumPy two-dimensional array. In our solution, the matrix contains three rows and two columns (a column of 1s and a column of 2s). NumPy actually has a dedicated matrix data structure: matrix_object = np.mat([[1, 2], [1, 2], [1, 2]]) matrix([[1, 2], [1, 2], [1, 2]]) However, the matrix data structure is not recommended for two reasons. First, arrays are the de facto standard data structure of NumPy. Second, the vast majority of NumPy operations return arrays, not matrix objects. See Also Matrix, Wikipedia Matrix, Wolfram MathWorld 1.3 Creating a Sparse Matrix Problem Given data with very few nonzero values, you want to efficiently represent it.
Solution Create a sparse matrix: # Load libraries import numpy as np from scipy import sparse # Create a matrix matrix = np.array([[0, 0], [0, 1], [3, 0]]) # Create compressed sparse row (CSR) matrix matrix_sparse = sparse.csr_matrix(matrix) Discussion A frequent situation in machine learning is having a huge amount of data; however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero. A sparse matrix is a matrix in which most elements are 0. Sparse matrices store only nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored: # View sparse matrix print(matrix_sparse) (1, 1) 1 (2, 0) 3 There are a number of types of sparse matrices. However, in compressed sparse row (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the nonzero values 1 and 3, respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix: # Create larger matrix matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [3, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) # Create compressed sparse row (CSR) matrix matrix_large_sparse = sparse.csr_matrix(matrix_large) # View original sparse matrix print(matrix_sparse) (1, 1) 1
(2, 0) 3 # View larger sparse matrix print(matrix_large_sparse) (1, 1) 1 (2, 0) 3 As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix. As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the different types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences among them, and we should be conscious about why we are choosing one type over another. See Also SciPy documentation: Sparse Matrices 101 Ways to Store a Sparse Matrix 1.4 Preallocating NumPy Arrays Problem You need to preallocate arrays of a given size with some value. Solution NumPy has functions for generating vectors and matrices of any size using 0s, 1s, or values of your choice: # Load library import numpy as np # Generate a vector of shape (1,5) containing all zeros vector = np.zeros(shape=5) # View the matrix print(vector) array([0., 0., 0., 0., 0.]) # Generate a matrix of shape (3,3) containing all ones matrix = np.full(shape=(3,3), fill_value=1) # View the vector print(matrix) array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.]])
Discussion Generating arrays prefilled with data is useful for a number of purposes, such as making code more performant or using synthetic data to test algorithms. In many programming languages, preallocating an array of default values (such as 0s) is considered common practice. 1.5 Selecting Elements Problem You need to select one or more elements in a vector or matrix. Solution NumPy arrays make it easy to select elements in vectors or matrices: # Load library import numpy as np # Create row vector vector = np.array([1, 2, 3, 4, 5, 6]) # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Select third element of vector vector[2] 3 # Select second row, second column matrix[1,1] 5 Discussion Like most things in Python, NumPy arrays are zero-indexed, meaning that the index of the first element is 0, not 1. With that caveat, NumPy offers a wide variety of methods for selecting (i.e., indexing and slicing) elements or groups of elements in arrays: # Select all elements of a vector vector[:] array([1, 2, 3, 4, 5, 6]) # Select everything up to and including the third element vector[:3] array([1, 2, 3]) # Select everything after the third element vector[3:] array([4, 5, 6]) # Select the last element vector[-1]
6 # Reverse the vector vector[::-1] array([6, 5, 4, 3, 2, 1]) # Select the first two rows and all columns of a matrix matrix[:2,:] array([[1, 2, 3], [4, 5, 6]]) # Select all rows and the second column matrix[:,1:2] array([[2], [5], [8]]) 1.6 Describing a Matrix Problem You want to describe the shape, size, and dimensions of a matrix. Solution Use the shape, size, and ndim attributes of a NumPy object: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]) # View number of rows and columns matrix.shape (3, 4) # View number of elements (rows * columns) matrix.size 12 # View number of dimensions matrix.ndim 2 Discussion This might seem basic (and it is); however, time and again it will be valuable to check the shape and size of an array both for further calculations and simply as a gut check after an operation. 1.7 Applying Functions over Each Element Problem
You want to apply some function to all elements in an array. Solution Use the NumPy vectorize method: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Create function that adds 100 to something add_100 = lambda i: i + 100 # Create vectorized function vectorized_add_100 = np.vectorize(add_100) # Apply function to all elements in matrix vectorized_add_100(matrix) array([[101, 102, 103], [104, 105, 106], [107, 108, 109]]) Discussion The NumPy vectorize method converts a function into a function that can apply to all elements in an array or slice of an array. It’s worth noting that vectorize is essentially a for loop over the elements and does not increase performance. Furthermore, NumPy arrays allow us to perform operations between arrays even if their dimensions are not the same (a process called broadcasting). For example, we can create a much simpler version of our solution using broadcasting: # Add 100 to all elements matrix + 100 array([[101, 102, 103], [104, 105, 106], [107, 108, 109]]) Broadcasting does not work for all shapes and situations, but it is a common way of applying simple operations over all elements of a NumPy array. 1.8 Finding the Maximum and Minimum Values Problem You need to find the maximum or minimum value in an array.
Solution Use NumPy’s max and min methods: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Return maximum element np.max(matrix) 9 # Return minimum element np.min(matrix) 1 Discussion Often we want to know the maximum and minimum value in an array or subset of an array. This can be accomplished with the max and min methods. Using the axis parameter, we can also apply the operation along a certain axis: # Find maximum element in each column np.max(matrix, axis=0) array([7, 8, 9]) # Find maximum element in each row np.max(matrix, axis=1) array([3, 6, 9]) 1.9 Calculating the Average, Variance, and Standard Deviation Problem You want to calculate some descriptive statistics about an array. Solution Use NumPy’s mean, var, and std: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Return mean np.mean(matrix) 5.0 # Return variance np.var(matrix) 6.666666666666667 # Return standard deviation np.std(matrix) 2.5819888974716112 Discussion Just like with max and min, we can easily get descriptive statistics about the whole matrix or do calculations along a single axis: # Find the mean value in each column np.mean(matrix, axis=0) array([ 4., 5., 6.]) 1.10 Reshaping Arrays Problem You want to change the shape (number of rows and columns) of an array without changing the element values. Solution Use NumPy’s reshape: # Load library import numpy as np # Create 4x3 matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) # Reshape matrix into 2x6 matrix matrix.reshape(2, 6) array([[ 1, 2, 3, 4, 5, 6], [ 7, 8, 9, 10, 11, 12]]) Discussion reshape allows us to restructure an array so that we maintain the same data but organize it as a different number of rows and columns. The only requirement is that the shape of the original and new matrix contain the same number of elements (i.e., are the same size). We can see the size of a matrix using size:
matrix.size 12 One useful argument in reshape is -1, which effectively means “as many as needed,” so reshape(1, -1) means one row and as many columns as needed: matrix.reshape(1, -1) array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]]) Finally, if we provide one integer, reshape will return a one-dimensional array of that length: matrix.reshape(12) array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) 1.11 Transposing a Vector or Matrix Problem You need to transpose a vector or matrix. Solution Use the T method: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Transpose matrix matrix.T array([[1, 4, 7], [2, 5, 8], [3, 6, 9]]) Discussion Transposing is a common operation in linear algebra where the column and row indices of each element are swapped. A nuanced point typically overlooked outside of a linear algebra class is that, technically, a vector can’t be transposed because it’s just a collection of values: # Transpose vector np.array([1, 2, 3, 4, 5, 6]).T array([1, 2, 3, 4, 5, 6]) However, it is common to refer to transposing a vector as converting a row vector to a column
vector (notice the second pair of brackets) or vice versa: # Transpose row vector np.array([[1, 2, 3, 4, 5, 6]]).T array([[1], [2], [3], [4], [5], [6]]) 1.12 Flattening a Matrix Problem You need to transform a matrix into a one-dimensional array. Solution Use the flatten method: # Load library import numpy as np # Create matrix matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) # Flatten matrix matrix.flatten() array([1, 2, 3, 4, 5, 6, 7, 8, 9]) Discussion flatten is a simple method to transform a matrix into a one-dimensional array. Alternatively, we can use reshape to create a row vector: matrix.reshape(1, -1) array([[1, 2, 3, 4, 5, 6, 7, 8, 9]]) Another common way to flatten arrays is the ravel method. Unlike flatten, which returns a copy of the original array, ravel operates on the original object itself and is therefore slightly faster. It also lets us flatten lists of arrays, which we can’t do with the flatten method. This operation is useful for flattening very large arrays and speeding up code: # Create one matrix matrix_a = np.array([[1, 2], [3, 4]])
# Create a second matrix matrix_b = np.array([[5, 6], [7, 8]]) # Create a list of matrices matrix_list = [matrix_a, matrix_b] # Flatten the entire list of matrices np.ravel(matrix_list) array([1, 2, 3, 4, 5, 6, 7, 8]) 1.13 Finding the Rank of a Matrix Problem You need to know the rank of a matrix. Solution Use NumPy’s linear algebra method matrix_rank: # Load library import numpy as np # Create matrix matrix = np.array([[1, 1, 1], [1, 1, 10], [1, 1, 15]]) # Return matrix rank np.linalg.matrix_rank(matrix) 2 Discussion The rank of a matrix is the dimensions of the vector space spanned by its columns or rows. Finding the rank of a matrix is easy in NumPy thanks to matrix_rank. See Also The Rank of a Matrix, CliffsNotes 1.14 Getting the Diagonal of a Matrix Problem You need to get the diagonal elements of a matrix.
The above is a preview of the first 20 pages. Register to read the complete e-book.