(This page has no text content)
Hands-On Unsupervised Learning Using Python How to Build Applied Machine Learning Solutions from Unlabeled Data Ankur A. Patel
Hands-On Unsupervised Learning Using Python by Ankur A. Patel Copyright © 2019 Human AI Collaboration, Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Development Editor: Michele Cronin Acquisition Editor: Jonathan Hassell Production Editor: Katherine Tozer Copyeditor: Jasmine Kwityn Proofreader: Christina Edwards Indexer: Judith McConville Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest February 2019: First Edition Revision History for the First Edition
2019-02-21: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492035640 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Hands-On Unsupervised Learning Using Python, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaims all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-03564-0 [LSI]
Preface A Brief History of Machine Learning Machine learning is a subfield of artificial intelligence (AI) in which computers learn from data—usually to improve their performance on some narrowly defined task—without being explicitly programmed. The term machine learning was coined as early as 1959 (by Arthur Samuel, a legend in the field of AI), but there were few major commercial successes in machine learning during the twenty-first century. Instead, the field remained a niche research area for academics at universities. Early on (in the 1960s) many in the AI community were too optimistic about its future. Researchers at the time, such as Herbert Simon and Marvin Minsky, claimed that AI would reach human-level intelligence within a matter of decades: Machines will be capable, within twenty years, of doing any work a man can do. —Herbert Simon, 1965 From three to eight years, we will have a machine with the general intelligence of an average human being. —Marvin Minsky, 1970 Blinded by their optimism, researchers focused on so-called strong AI or general artificial intelligence (AGI) projects, attempting to build AI agents capable of problem solving, knowledge representation, learning and planning, natural language processing, perception, and motor control. This optimism helped attract significant funding into the nascent field from major players such as the Department of Defense, but the problems these researchers tackled were too ambitious and 1
ultimately doomed to fail. AI research rarely made the leap from academia to industry, and a series of so-called AI winters followed. In these AI winters (an analogy based on the nuclear winter during this Cold War era), interest in and funding for AI dwindled. Occasionally, hype cycles around AI occurred but had very little staying power. By the early 1990s, interest in and funding for AI had hit a trough. AI Is Back, but Why Now? AI has re-emerged with a vengeance over the past two decades—first as a purely academic area of interest and now as a full-blown field attracting the brightest minds at both universities and corporations. Three critical developments are behind this resurgence: breakthroughs in machine learning algorithms, the availability of lots of data, and superfast computers. First, instead of focusing on overly ambitious strong AI projects, researchers turned their attention to narrowly defined subproblems of strong AI, also known as weak AI or narrow AI. This focus on improving solutions for narrowly defined tasks led to algorithmic breakthroughs, which paved the way for successful commercial applications. Many of these algorithms—often developed initially at universities or private research labs—were quickly open-sourced, speeding up the adoption of these technologies by industry. Second, data capture became a focus for most organizations, and the costs of storing data fell dramatically driven by advances in digital data storage. Thanks to the internet, lots of data also became widely and publicly available at a scale never before seen. Third, computers became increasingly powerful and available over the cloud, allowing AI researchers to easily and cheaply scale their IT infrastructure as required without making huge upfront investments in
hardware. The Emergence of Applied AI These three forces have pushed AI from academia to industry, helping attract increasingly higher levels of interest and funding every year. AI is no longer just a theoretical area of interest but rather a full-blown applied field. Figure P-1 shows a chart from Google Trends, indicating the growth in interest in machine learning over the past five years. Figure P-1. Interest in machine learning over time AI is now viewed as a breakthrough horizontal technology, akin to the advent of computers and smartphones, that will have a significant impact on every single industry over the next decade. Successful commercial applications involving machine learning include —but are certainly not limited to—optical character recognition, email spam filtering, image classification, computer vision, speech recognition, machine translation, group segmentation and clustering, generation of synthetic data, anomaly detection, cybercrime prevention, credit card fraud detection, internet fraud detection, time series prediction, natural language processing, board game and video game playing, document classification, recommender systems, search, robotics, online advertising, sentiment analysis, DNA sequencing, 2
financial market analysis, information retrieval, question answering, and healthcare decision making. Major Milestones in Applied AI over the Past 20 Years The milestones presented here helped bring AI from a mostly academic topic of conversation then to a mainstream staple in technology today. 1997: Deep Blue, an AI bot that had been in development since the mid-1980s, beats world chess champion Garry Kasparov in a highly publicized chess event. 2004: DARPA introduces the DARPA Grand Challenge, an annually held autonomous driving challenge held in the desert. In 2005, Stanford takes the top prize. In 2007, Carnegie Mellon University performs this feat in an urban setting. In 2009, Google builds a self-driving car. By 2015, many major technology giants, including Tesla, Alphabet’s Waymo, and Uber, have launched well-funded programs to build mainstream self-driving technology. 2006: Geoffrey Hinton of the University of Toronto introduces a fast learning algorithm to train neural networks with many layers, kicking off the deep learning revolution. 2006: Netflix launches the Netflix Prize competition, with a one million dollar purse, challenging teams to use machine learning to improve its recommendation system’s accuracy by at least 10%. A team won the prize in 2009. 2007: AI achieves superhuman performance at checkers, solved by a team from the University of Alberta. 2010: ImageNet launches an annual contest—the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC)—in which teams use machine learning algorithms to correctly detect and classify objects in a large, well-curated image dataset. This draws significant attention from both academia and technology giants. The classification error rate falls from 25% in 2011 to just a few percent by 2015, backed by advances in deep convolutional neural networks. This leads to commercial applications of computer vision and object recognition. 2010: Microsoft launches Kinect for Xbox 360. Developed by the computer vision team at Microsoft Research, Kinect is capable of tracking human body movement and translating this into gameplay. 2010: Siri, one of the first mainstream digital voice assistants, is acquired by Apple and released as part of iPhone 4S in October 2011. Eventually, Siri is rolled out across all of Apple’s products. Powered by convolutional neural networks and long short-term memory recurrent neural networks, Siri performs both speech recognition and natural language processing. Eventually, Amazon, Microsoft, and Google enter the race, releasing Alexa (2014), Cortana (2014), and Google Assistant (2016), respectively. 2011: IBM Watson, a question-answering AI agent developed by a team led by David Ferrucci, beats former Jeopardy! winners Brad Rutter and Ken Jennings. IBM Watson is now used across several industries, including healthcare and retail. 2012: Google Brain team, led by Andrew Ng and Jeff Dean, trains a neural network to recognize cats by watching unlabeled images taken from YouTube videos. 2013: Google wins DARPA’s Robotics Challenge, involving
trials in which semi-autonomous bots perform complex tasks in treacherous environments, such as driving a vehicle, walking across rubble, removing debris from a blocked entryway, opening a door, and climbing a ladder. 2014: Facebook publishes work on DeepFace, a neural network-based system that can identify faces with 97% accuracy. This is near human-level performance and is a more than 27% improvement over previous systems. 2015: AI goes mainstream, and is commonly featured in media outlets around the world. 2015: Google DeepMind’s AlphaGo beats world-class professional Fan Hui at the game Go. In 2016, AlphaGo defeats Lee Sedol, and in 2017, AlphaGo defeats Ke Jie. In 2017, a new version called AlphaGo Zero defeats the previous AlphaGo version 100 to zero. AlphaGo Zero incorporates unsupervised learning techniques and masters Go just by playing itself. 2016: Google launches a major revamp to its language translation, Google Translate, replacing its existing phrase- based translation system with a deep learning-based neural machine translation system, reducing translation errors by up to 87% and approaching near human-level accuracy. 2017: Libratus, developed by Carnegie Mellon, wins at head- to-head no-limit Texas Hold’em. 2017: OpenAI-trained bot beats professional gamer at Dota 2 tournament. From Narrow AI to AGI Of course, these successes in applying AI to narrowly defined problems
are just a starting point. There is a growing belief in the AI community that—by combining several weak AI systems—we can develop strong AI. This strong AI or AGI agent will be capable of human-level performance at many broadly defined tasks. Soon after AI achieves human-level performance, some researchers predict this strong AI will surpass human intelligence and reach so- called superintelligence. Estimates for attaining such superintelligence range from as little as 15 years to as many as 100 years from now, but most researchers believe AI will advance enough to achieve this in a few generations. Is this inflated hype once again (like what we saw in previous AI cycles), or is it different this time around? Only time will tell. Objective and Approach Most of the successful commercial applications to date—in areas such as computer vision, speech recognition, machine translation, and natural language processing—have involved supervised learning, taking advantage of labeled datasets. However, most of the world’s data is unlabeled. In this book, we will cover the field of unsupervised learning (which is a branch of machine learning used to find hidden patterns) and learn the underlying structure in unlabeled data. According to many industry experts, such as Yann LeCun, the Director of AI Research at Facebook and a professor at NYU, unsupervised learning is the next frontier in AI and may hold the key to AGI. For this and many other reasons, unsupervised learning is one of the trendiest topics in AI today. The book’s goal is to outline the concepts and tools required for you to develop the intuition necessary for applying this technology to everyday problems that you work on. In other words, this is an applied book, one that will allow you to build real-world systems. We will also explore
how to efficiently label unlabeled datasets to turn unsupervised learning problems into semisupervised ones. The book will use a hands-on approach, introducing some theory but focusing mostly on applying unsupervised learning techniques to solving real-world problems. The datasets and code are available online as Jupyter notebooks on GitHub. Armed with the conceptual understanding and hands-on experience you’ll gain from this book, you will be able to apply unsupervised learning to large, unlabeled datasets to uncover hidden patterns, obtain deeper business insight, detect anomalies, cluster groups based on similarity, perform automatic feature engineering and selection, generate synthetic datasets, and more. Prerequisites This book assumes that you have some Python programming experience, including familiarity with NumPy and Pandas. For more on Python, visit the official Python website. For more on Jupyter Notebook, visit the official Jupyter website. For a refresher on college-level calculus, linear algebra, probability, and statistics, read Part I of the Deep Learning textbook by Ian Goodfellow and Yoshua Bengio. For a refresher on machine learning, read The Elements of Statistical Learning. Roadmap The book is organized into four parts, covering the following topics: Part I, Fundamentals of Unsupervised Learning Differences between supervised and unsupervised learning, an overview of popular supervised and unsupervised algorithms, and an end-to-end machine learning project
Part II, Unsupervised Learning Using Scikit-Learn Dimensionality reduction, anomaly detection, and clustering and group segmentation TIP For more information on the concepts discussed in Parts I and II, refer to the Scikit-learn documentation. Part III, Unsupervised Learning Using TensorFlow and Keras Representation learning and automatic feature extraction, autoencoders, and semisupervised learning Part IV, Deep Unsupervised Learning Using TensorFlow and Keras Restricted Boltzmann machines, deep belief networks, and generative adversarial networks Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold
Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion. NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, etc.) is available for download on GitHub. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by
citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Hands-On Unsupervised Learning Using Python by Ankur A. Patel (O’Reilly). Copyright 2019 Ankur A. Patel, 978-1-492-03564-0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning NOTE For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on- demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher:
O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/unsupervised-learning. To comment or ask technical questions about this book, send email to bookquestions@oreilly.com. For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia 1 Such views inspired Stanley Kubrick in 1968 to create the AI agent HAL 9000 in 2001: A Space Odyssey. 2 According to McKinsey Global Institute, over half of all the professional activities people are paid to do could be automated by 2055.
Part I. Fundamentals of Unsupervised Learning To start, let’s explore the current machine learning ecosystem and where unsupervised learning fits in. We will also build a machine learning project from scratch to cover basics such as setting up the programming environment, acquiring and preparing data, exploring data, selecting machine learning algorithms and cost functions, and evaluating the results.
Chapter 1. Unsupervised Learning in the Machine Learning Ecosystem Most of human and animal learning is unsupervised learning. If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake. We need to solve the unsupervised learning problem before we can even think of getting to true AI. —Yann LeCun In this chapter, we will explore the difference between a rules-based system and machine learning, the difference between supervised learning and unsupervised learning, and the relative strengths and weaknesses of each. We will also cover many popular supervised learning algorithms and unsupervised learning algorithms and briefly examine how semisupervised learning and reinforcement learning fit into the mix. Basic Machine Learning Terminology Before we delve into the different types of machine learning, let’s take a look at a simple and commonly used machine learning example to help make the concepts we introduce tangible: the email spam filter. We need to build a simple program that takes in emails and correctly classifies them as either “spam” or “not spam.” This is a straightforward classification problem.
Here’s a bit of machine learning terminology as a refresher: the input variables into this problem are the text of the emails. These input variables are also known as features or predictors or independent variables. The output variable—what we are trying to predict—is the label “spam” or “not spam.” This is also known as the target variable, dependent variable, or response variable (or class since this is a classification problem). The set of examples the AI trains on is known as the training set, and each individual example is called a training instance or sample. During the training, the AI is attempting to minimize its cost function or error rate, or framed more positively, to maximize its value function—in this case, the ratio of correctly classified emails. The AI actively optimizes for a minimal error rate during training. Its error rate is calculated by comparing the AI’s predicted label with the true label. However, what we care about most is how well the AI generalizes its training to never-before-seen emails. This will be the true test for the AI: can it correctly classify emails that it has never seen before using what it has learned by training on the examples in the training set? This generalization error or out-of-sample error is the main thing we use to evaluate machine learning solutions. This set of never-before-seen examples is known as the test set or holdout set (because the data is held out from the training). If we choose to have multiple holdout sets (perhaps to gauge our generalization error as we train, which is advisable), we may have intermediate holdout sets that we use to evaluate our progress before the final test set; these intermediate holdout sets are called validation sets. To put all of this together, the AI trains on the training data (experience) to improve its error rate (performance) in flagging spam (task), and the ultimate success criterion is how well its experience generalizes to new, never-before-seen data (generalization error).
Rules-Based vs. Machine Learning Using a rules-based approach, we can design a spam filter with explicit rules to catch spam such as flag emails with “u” instead of “you,” “4” instead of “for,” “BUY NOW,” etc. But this system would be difficult to maintain over time as bad guys change their spam behavior to evade the rules. If we used a rules-based system, we would have to frequently adjust the rules manually just to stay up-to-date. Also, it would be very expensive to set up—think of all the rules we would need to create to make this a well-functioning system. Instead of a rules-based approach, we can use machine learning to train on the email data and automatically engineer rules to correctly flag malicious email as spam. This machine learning-based system could be automatically adjusted over time as well. This system would be much cheaper to train and maintain. In this simple email problem, it may be possible for us to handcraft rules, but, for many problems, handcrafting rules is not feasible at all. For example, consider designing a self-driving car—imagine drafting rules for how the car should behave in each and every single instance it ever encounters. This is an intractable problem unless the car can learn and adapt on its own based on its experience. We could also use machine learning systems as an exploration or data discovery tool to gain deeper insight into the problem we are trying to solve. For example, in the email spam filter example, we can learn which words or phrases are most predictive of spam and recognize newly emerging malicious spam patterns. Supervised vs. Unsupervised The field of machine learning has two major branches—supervised learning and unsupervised learning—and plenty of sub-branches that bridge the two.
Comments 0
Loading comments...
Reply to Comment
Edit Comment