Machine Learning in Action (Peter Harrington) (Z-Library) (1)

Author: Peter Harrington

Summary Machine Learning in Action is unique book that blends the foundational theories of machine learning with the practical realities of building tools for everyday data analysis. You'll use the flexible Python programming language to build programs that implement algorithms for data classification, forecasting, recommendations, and higher-level features like summarization and simplification. About the Book A machine is said to learn when its performance improves with experience. Learning requires algorithms and programs that capture data and ferret out the interesting or useful patterns. Once the specialized domain of analysts and mathematicians, machine learning is becoming a skill needed by many. Machine Learning in Action is a clearly written tutorial for developers. It avoids academic language and takes you straight to the techniques you'll use in your day-to-day work. Many (Python) examples present the core algorithms of statistical data processing, data analysis, and data visualization in code you can reuse. You'll understand the concepts and how they fit in with tactical tasks like classification, forecasting, recommendations, and higher-level features like summarization and simplification. Readers need no prior experience with machine learning or statistical processing. Familiarity with Python is helpful. What's Inside A no-nonsense introduction Examples showing common ML tasks Everyday data analysis Implementing classic algorithms like Apriori and Adaboos =================================== Table of Contents PART 1 CLASSIFICATION Machine learning basics Classifying with k-Nearest Neighbors Splitting datasets one feature at a time: decision trees Classifying with probability theory: naïve Bayes Logistic regression Support vector machines Improving classification with the AdaBoost meta algorithm PART 2 FORECASTING NUMERIC VALUES WITH REGRESSION Predicting numeric values: regression Tree-based regression PART 3 UNSUPERVISED LEARNING Grouping unlabele

📄 File Format: PDF

💾 File Size: 11.8 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

M A N N I N G Peter Harrington IN ACTION

📄 Page 2

Machine Learning in Action Download from Wow! eBook <www.wowebook.com>

📄 Page 3

Download from Wow! eBook <www.wowebook.com>

📄 Page 4

Machine Learning in Action PETER HARRINGTON M A N N I N G Shelter IslandDownload from Wow! eBook <www.wowebook.com>

📄 Page 5

For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 261 Shelter Island, NY 11964 Email: orders@manning.com ©2012 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine. Manning Publications Co.Development editor:Jeff Bleiel 20 Baldwin Road Technical proofreaders: Tricia Hoffman, Alex Ott PO Box 261 Copyeditor: Linda Recktenwald Shelter Island, NY 11964 Proofreader: Maureen Spencer Typesetter: Gordan Salinovic Cover designer: Marija Tudor ISBN 9781617290183 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12Download from Wow! eBook <www.wowebook.com>

📄 Page 6

To Joseph and MiloDownload from Wow! eBook <www.wowebook.com>

📄 Page 7

Download from Wow! eBook <www.wowebook.com>

📄 Page 8

brief contents PART 1 CLASSIFICATION ...............................................................1 1 ■ Machine learning basics 3 2 ■ Classifying with k-Nearest Neighbors 18 3 ■ Splitting datasets one feature at a time: decision trees 37 4 ■ Classifying with probability theory: naïve Bayes 61 5 ■ Logistic regression 83 6 ■ Support vector machines 101 7 ■ Improving classification with the AdaBoost meta-algorithm 129 PART 2 FORECASTING NUMERIC VALUES WITH REGRESSION..............151 8 ■ Predicting numeric values: regression 153 9 ■ Tree-based regression 179 PART 3 UNSUPERVISED LEARNING...............................................205 10 ■ Grouping unlabeled items using k-means clustering 207 11 ■ Association analysis with the Apriori algorithm 224 12 ■ Efficiently finding frequent itemsets with FP-growth 248vii Download from Wow! eBook <www.wowebook.com>

📄 Page 9

BRIEF CONTENTSviiiPART 4 ADDITIONAL TOOLS.......................................................267 13 ■ Using principal component analysis to simplify data 269 14 ■ Simplifying data with the singular value decomposition 280 15 ■ Big data and MapReduce 299Download from Wow! eBook <www.wowebook.com>

📄 Page 10

contents preface xvii acknowledgments xix about this book xxi about the author xxv about the cover illustration xxvi PART 1 CLASSIFICATION ...................................................1 1 Machine learning basics 3 1.1 What is machine learning? 5 Sensors and the data deluge 6 ■ Machine learning will be more important in the future 7 1.2 Key terminology 7 1.3 Key tasks of machine learning 10 1.4 How to choose the right algorithm 11 1.5 Steps in developing a machine learning application 11 1.6 Why Python? 13 Executable pseudo-code 13 ■ Python is popular 13 ■ What Python has that other languages don’t have 14 ■ Drawbacks 14 1.7 Getting started with the NumPy library 15 1.8 Summary 17ix Download from Wow! eBook <www.wowebook.com>

📄 Page 11

CONTENTSx2 Classifying with k-Nearest Neighbors 18 2.1 Classifying with distance measurements 19 Prepare: importing data with Python 21 ■ Putting the kNN classification algorithm into action 23 ■ How to test a classifier 24 2.2 Example: improving matches from a dating site with kNN 24 Prepare: parsing data from a text file 25 ■ Analyze: creating scatter plots with Matplotlib 27 ■ Prepare: normalizing numeric values 29 ■ Test: testing the classifier as a whole program 31 ■ Use: putting together a useful system 32 2.3 Example: a handwriting recognition system 33 Prepare: converting images into test vectors 33 ■ Test: kNN on handwritten digits 35 2.4 Summary 36 3 Splitting datasets one feature at a time: decision trees 37 3.1 Tree construction 39 Information gain 40 ■ Splitting the dataset 43 ■ Recursively building the tree 46 3.2 Plotting trees in Python with Matplotlib annotations 48 Matplotlib annotations 49 ■ Constructing a tree of annotations 51 3.3 Testing and storing the classifier 56 Test: using the tree for classification 56 ■ Use: persisting the decision tree 57 3.4 Example: using decision trees to predict contact lens type 57 3.5 Summary 59 4 Classifying with probability theory: naïve Bayes 61 4.1 Classifying with Bayesian decision theory 62 4.2 Conditional probability 63 4.3 Classifying with conditional probabilities 65 4.4 Document classification with naïve Bayes 65 4.5 Classifying text with Python 67 Prepare: making word vectors from text 67 ■ Train: calculating probabilities from word vectors 69 ■ Test: modifying the classifier for real- world conditions 71 ■ Prepare: the bag-of-words document model 73 4.6 Example: classifying spam email with naïve Bayes 74 Prepare: tokenizing text 74 ■ Test: cross validation with naïve Bayes 75Download from Wow! eBook <www.wowebook.com>

📄 Page 12

CONTENTS xi4.7 Example: using naïve Bayes to reveal local attitudes from personal ads 77 Collect: importing RSS feeds 78 ■ Analyze: displaying locally used words 80 4.8 Summary 82 5 Logistic regression 83 5.1 Classification with logistic regression and the sigmoid function: a tractable step function 84 5.2 Using optimization to find the best regression coefficients 86 Gradient ascent 86 ■ Train: using gradient ascent to find the best parameters 88 ■ Analyze: plotting the decision boundary 90 Train: stochastic gradient ascent 91 5.3 Example: estimating horse fatalities from colic 96 Prepare: dealing with missing values in the data 97 ■ Test: classifying with logistic regression 98 5.4 Summary 100 6 Support vector machines 101 6.1 Separating data with the maximum margin 102 6.2 Finding the maximum margin 104 Framing the optimization problem in terms of our classifier 104 Approaching SVMs with our general framework 106 6.3 Efficient optimization with the SMO algorithm 106 Platt’s SMO algorithm 106 ■ Solving small datasets with the simplified SMO 107 6.4 Speeding up optimization with the full Platt SMO 112 6.5 Using kernels for more complex data 118 Mapping data to higher dimensions with kernels 118 ■ The radial bias function as a kernel 119 ■ Using a kernel for testing 122 6.6 Example: revisiting handwriting classification 125 6.7 Summary 127 7 Improving classification with the AdaBoost meta-algorithm 129 7.1 Classifiers using multiple samples of the dataset 130 Building classifiers from randomly resampled data: bagging 130 Boosting 131 7.2 Train: improving the classifier by focusing on errors 131Download from Wow! eBook <www.wowebook.com>

📄 Page 13

CONTENTSxii7.3 Creating a weak learner with a decision stump 133 7.4 Implementing the full AdaBoost algorithm 136 7.5 Test: classifying with AdaBoost 139 7.6 Example: AdaBoost on a difficult dataset 140 7.7 Classification imbalance 142 Alternative performance metrics: precision, recall, and ROC 143 Manipulating the classifier’s decision with a cost function 147 Data sampling for dealing with classification imbalance 148 7.8 Summary 148 PART 2 FORECASTING NUMERIC VALUES WITH REGRESSION .151 8 Predicting numeric values: regression 153 8.1 Finding best-fit lines with linear regression 154 8.2 Locally weighted linear regression 160 8.3 Example: predicting the age of an abalone 163 8.4 Shrinking coefficients to understand our data 164 Ridge regression 164 ■ The lasso 167 ■ Forward stagewise regression 167 8.5 The bias/variance tradeoff 170 8.6 Example: forecasting the price of LEGO sets 172 Collect: using the Google shopping API 173 ■ Train: building a model 174 8.7 Summary 177 9 Tree-based regression 179 9.1 Locally modeling complex data 180 9.2 Building trees with continuous and discrete features 181 9.3 Using CART for regression 184 Building the tree 184 ■ Executing the code 186 9.4 Tree pruning 188 Prepruning 188 ■ Postpruning 190 9.5 Model trees 192 9.6 Example: comparing tree methods to standard regression 195 9.7 Using Tkinter to create a GUI in Python 198 Building a GUI in Tkinter 199 ■ Interfacing Matplotlib and Tkinter 201 9.8 Summary 203Download from Wow! eBook <www.wowebook.com>

📄 Page 14

CONTENTS xiiiPART 3 UNSUPERVISED LEARNING ..................................205 10 Grouping unlabeled items using k-means clustering 207 10.1 The k-means clustering algorithm 208 10.2 Improving cluster performance with postprocessing 213 10.3 Bisecting k-means 214 10.4 Example: clustering points on a map 217 The Yahoo! PlaceFinder API 218 ■ Clustering geographic coordinates 220 10.5 Summary 223 11 Association analysis with the Apriori algorithm 224 11.1 Association analysis 225 11.2 The Apriori principle 226 11.3 Finding frequent itemsets with the Apriori algorithm 228 Generating candidate itemsets 229 ■ Putting together the full Apriori algorithm 231 11.4 Mining association rules from frequent item sets 233 11.5 Example: uncovering patterns in congressional voting 237 Collect: build a transaction data set of congressional voting records 238 ■ Test: association rules from congressional voting records 243 11.6 Example: finding similar features in poisonous mushrooms 245 11.7 Summary 246 12 Efficiently finding frequent itemsets with FP-growth 248 12.1 FP-trees: an efficient way to encode a dataset 249 12.2 Build an FP-tree 251 Creating the FP-tree data structure 251 ■ Constructing the FP-tree 252 12.3 Mining frequent items from an FP-tree 256 Extracting conditional pattern bases 257 ■ Creating conditional FP-trees 258 12.4 Example: finding co-occurring words in a Twitter feed 260 12.5 Example: mining a clickstream from a news site 264 12.6 Summary 265Download from Wow! eBook <www.wowebook.com>

📄 Page 15

CONTENTSxivPART 4 ADDITIONAL TOOLS ..........................................267 13 Using principal component analysis to simplify data 269 13.1 Dimensionality reduction techniques 270 13.2 Principal component analysis 271 Moving the coordinate axes 271 ■ Performing PCA in NumPy 273 13.3 Example: using PCA to reduce the dimensionality of semiconductor manufacturing data 275 13.4 Summary 278 14 Simplifying data with the singular value decomposition 280 14.1 Applications of the SVD 281 Latent semantic indexing 281 ■ Recommendation systems 282 14.2 Matrix factorization 283 14.3 SVD in Python 284 14.4 Collaborative filtering–based recommendation engines 286 Measuring similarity 287 ■ Item-based or user-based similarity? 289 Evaluating recommendation engines 289 14.5 Example: a restaurant dish recommendation engine 290 Recommending untasted dishes 290 ■ Improving recommendations with the SVD 292 ■ Challenges with building recommendation engines 295 14.6 Example: image compression with the SVD 295 14.7 Summary 298 15 Big data and MapReduce 299 15.1 MapReduce: a framework for distributed computing 300 15.2 Hadoop Streaming 302 Distributed mean and variance mapper 303 ■ Distributed mean and variance reducer 304 15.3 Running Hadoop jobs on Amazon Web Services 305 Services available on AWS 305 ■ Getting started with Amazon Web Services 306 ■ Running a Hadoop job on EMR 307 15.4 Machine learning in MapReduce 312 15.5 Using mrjob to automate MapReduce in Python 313 Using mrjob for seamless integration with EMR 313 ■ The anatomy of a MapReduce script in mrjob 314Download from Wow! eBook <www.wowebook.com>

📄 Page 16

CONTENTS xv15.6 Example: the Pegasos algorithm for distributed SVMs 316 The Pegasos algorithm 317 ■ Training: MapReduce support vector machines with mrjob 318 15.7 Do you really need MapReduce? 322 15.8 Summary 323 appendix A Getting started with Python 325 appendix B Linear algebra 335 appendix C Probability refresher 341 appendix D Resources 345 index 347Download from Wow! eBook <www.wowebook.com>

📄 Page 17

Download from Wow! eBook <www.wowebook.com>

📄 Page 18

preface After college I went to work for Intel in California and mainland China. Originally my plan was to go back to grad school after two years, but time flies when you are having fun, and two years turned into six. I realized I had to go back at that point, and I didn’t want to do night school or online learning, I wanted to sit on campus and soak up everything a university has to offer. The best part of college is not the classes you take or research you do, but the peripheral things: meeting people, going to seminars, joining organizations, dropping in on classes, and learning what you don’t know. Sometime in 2008 I was helping set up for a career fair. I began to talk to someone from a large financial institution and they wanted me to interview for a position mod- eling credit risk (figuring out if someone is going to pay off their loans or not). They asked me how much stochastic calculus I knew. At the time, I wasn’t sure I knew what the word stochastic meant. They were hiring for a geographic location my body couldn’t tolerate, so I decided not to pursue it any further. But this stochastic stuff interested me, so I went to the course catalog and looked for any class being offered with the word “stochastic” in its title. The class I found was “Discrete-time Stochastic Systems.” I started attending the class without registering, doing the homework and taking tests. Eventually I was noticed by the professor and she was kind enough to let me continue, for which I am very grateful. This class was the first time I saw probability applied to an algorithm. I had seen algorithms take an averaged value as input before, but this was different: the variance and mean were internal values in these algorithms. The course was about “time series” data where every piece of data is a regularly spaced sample. I found another course with Machine Learning in the title. In this class thexvii Download from Wow! eBook <www.wowebook.com>

📄 Page 19

PREFACExviiidata was not assumed to be uniformly spaced in time, and they covered more algo- rithms but with less rigor. I later realized that similar methods were also being taught in the economics, electrical engineering, and computer science departments. In early 2009, I graduated and moved to Silicon Valley to start work as a software consultant. Over the next two years, I worked with eight companies on a very wide range of technologies and saw two trends emerge which make up the major thesis for this book: first, in order to develop a compelling application you need to do more than just connect data sources; and second, employers want people who understand theory and can also program. A large portion of a programmer’s job can be compared to the concept of connect- ing pipes—except that instead of pipes, programmers connect the flow of data—and monstrous fortunes have been made doing exactly that. Let me give you an example. You could make an application that sells things online—the big picture for this would be allowing people a way to post things and to view what others have posted. To do this you could create a web form that allows users to enter data about what they are selling and then this data would be shipped off to a data store. In order for other users to see what a user is selling, you would have to ship the data out of the data store and display it appropriately. I’m sure people will continue to make money this way; however to make the application really good you need to add a level of intelligence. This intelli- gence could do things like automatically remove inappropriate postings, detect fraud- ulent transactions, direct users to things they might like, and forecast site traffic. To accomplish these objectives, you would need to apply machine learning. The end user would not know that there is magic going on behind the scenes; to them your applica- tion “just works,” which is the hallmark of a well-built product. An organization may choose to hire a group of theoretical people, or “thinkers,” and a set of practical people, “doers.” The thinkers may have spent a lot of time in aca- demia, and their day-to-day job may be pulling ideas from papers and modeling them with very high-level tools or mathematics. The doers interface with the real world by writing the code and dealing with the imperfections of a non-ideal world, such as machines that break down or noisy data. Separating thinkers from doers is a bad idea and successful organizations realize this. (One of the tenets of lean manufacturing is for the thinkers to get their hands dirty with actual doing.) When there is a limited amount of money to be spent on hiring, who will get hired more readily—the thinker or the doer? Probably the doer, but in reality employers want both. Things need to get built, but when applications call for more demanding algorithms it is useful to have someone who can read papers, pull out the idea, implement it in real code, and iterate. I didn’t see a book that addressed the problem of bridging the gap between think- ers and doers in the context of machine learning algorithms. The goal of this book is to fill that void, and, along the way, to introduce uses of machine learning algorithms so that the reader can build better applications. Download from Wow! eBook <www.wowebook.com>

📄 Page 20

acknowledgments This is by far the easiest part of the book to write... First, I would like to thank the folks at Manning. Above all, I would like to thank my editor Troy Mott; if not for his support and enthusiasm, this book never would have happened. I would also like to thank Maureen Spencer who helped polish my prose in the final manuscript; she was a pleasure to work with. Next I would like to thank Jennie Si at Arizona State University for letting me sneak into her class on discrete-time stochastic systems without registering. Also Cynthia Rudin at MIT for pointing me to the paper “Top 10 Algorithms in Data Mining,”1 which inspired the approach I took in this book. For indirect contributions I would like to thank Mark Bauer, Jerry Barkely, Jose Zero, Doug Chang, Wayne Carter, and Tyler Neylon. Special thanks to the following peer reviewers who read the manuscript at differ- ent stages during its development and provided invaluable feedback: Keith Kim, Franco Lombardo, Patrick Toohey, Josef Lauri, Ryan Riley, Peter Venable, Patrick Goetz, Jeroen Benckhuijsen, Ian McAllister, Orhan Alkan, Joseph Ottinger, Fred Law, Karsten Strøbæk, Brian Lau, Stephen McKamey, Michael Brennan, Kevin Jackson, John Griffin, Sumit Pal, Alex Alves, Justin Tyler Wiley, and John Stevenson. My technical proofreaders, Tricia Hoffman and Alex Ott, reviewed the technical content shortly before the manuscript went to press and I would like to thank them 1 Xindong Wu, et al., “Top 10 Algorithms in Data Mining,” Journal of Knowledge and Information Systems 14, no. 1 (December 2007).xix Download from Wow! eBook <www.wowebook.com>

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List