📄 Page
1
George T. Heineman Learning Algorithms A Programmer’s Guide to Writing Better Code
📄 Page
2
(This page has no text content)
📄 Page
3
George T. Heineman Learning Algorithms A Programmer’s Guide to Writing Better Code Boston Farnham Sebastopol TokyoBeijing
📄 Page
4
978-1-492-09106-6 [GP] Learning Algorithms by George T. Heineman Copyright © 2021 George T. Heineman. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Melissa Duffield Developmental Editor: Sarah Grey Production Editor: Beth Kelly Copyeditor: Piper Editorial Consulting, LLC Proofreader: Justin Billing Indexer: nSight, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea August 2021: First Edition Revision History for the First Edition 2021-07-20: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492091066 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Learning Algorithms, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
📄 Page
5
Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1. Problem Solving. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 What Is an Algorithm? 1 Finding the Largest Value in an Arbitrary List 5 Counting Key Operations 6 Models Can Predict Algorithm Performance 7 Find Two Largest Values in an Arbitrary List 12 Tournament Algorithm 16 Time Complexity and Space Complexity 23 Summary 24 Challenge Exercises 25 2. Analyzing Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Using Empirical Models to Predict Performance 30 Multiplication Can Be Faster 32 Performance Classes 34 Asymptotic Analysis 36 Counting All Operations 39 Counting All Bytes 40 When One Door Closes, Another One Opens 41 Binary Array Search 42 Almost as Easy as π 44 Two Birds with One Stone 46 Pulling It All Together 50 Curve Fitting Versus Lower and Upper Bounds 52 iii
📄 Page
6
Summary 53 Challenge Exercises 54 3. Better Living Through Better Hashing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Associating Values with Keys 57 Hash Functions and Hash Codes 62 A Hashtable Structure for (Key, Value) Pairs 64 Detecting and Resolving Collisions with Linear Probing 65 Separate Chaining with Linked Lists 71 Removing an Entry from a Linked List 74 Evaluation 77 Growing Hashtables 80 Analyzing the Performance of Dynamic Hashtables 85 Perfect Hashing 86 Iterate Over (key, value) Pairs 89 Summary 91 Challenge Exercises 92 4. Heaping It On. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Max Binary Heaps 104 Inserting a (value, priority) 107 Removing the Value with Highest Priority 109 Representing a Binary Heap in an Array 112 Implementation of Swim and Sink 114 Summary 118 Challenge Exercises 118 5. Sorting Without a Hat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Sorting by Swapping 124 Selection Sort 125 Anatomy of a Quadratic Sorting Algorithm 127 Analyze Performance of Insertion Sort and Selection Sort 129 Recursion and Divide and Conquer 131 Merge Sort 137 Quicksort 141 Heap Sort 145 Performance Comparison of O(N log N) Algorithms 148 Tim Sort 149 Summary 151 Challenge Exercises 152 iv | Table of Contents
📄 Page
7
6. Binary Trees: Infinity in the Palm of Your Hand. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Getting Started 154 Binary Search Trees 159 Searching for Values in a Binary Search Tree 165 Removing Values from a Binary Search Tree 166 Traversing a Binary Tree 171 Analyzing Performance of Binary Search Trees 174 Self-Balancing Binary Trees 176 Analyzing Performance of Self-Balancing Trees 185 Using Binary Tree as (key, value) Symbol Table 185 Using the Binary Tree as a Priority Queue 187 Summary 190 Challenge Exercises 191 7. Graphs: Only Connect!. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Graphs Efficiently Store Useful Information 196 Using Depth First Search to Solve a Maze 200 Breadth First Search Offers Different Searching Strategy 207 Directed Graphs 215 Graphs with Edge Weights 223 Dijkstra’s Algorithm 225 All-Pairs Shortest Path 237 Floyd–Warshall Algorithm 240 Summary 245 Challenge Exercises 245 8. Wrapping It Up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Python Built-in Data Types 251 Implementing Stack in Python 253 Implementing Queues in Python 254 Heap and Priority Queue Implementations 255 Future Exploration 256 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Table of Contents | v
📄 Page
8
(This page has no text content)
📄 Page
9
Foreword Algorithms are at the heart of computer science and essential to the modern informa‐ tion age. They power the search engines used to answer billions of daily Internet search requests and provide privacy when communicating over the Internet. Algo‐ rithms are increasingly visible to consumers in countless areas, from customized advertising to online price quotes, and the news media is full of discussions about what algorithms are and what they can do. The large growth in STEM (Science, Technology, Engineering and Mathematics) is powering a new wave of sustained growth and innovation in the global economy. But there simply aren’t enough computer scientists to discover and apply the algorithms needed for advances in medicine, engineering, and even government. We need to increase the number of people who know how to apply algorithms to the problems within their own fields and disciplines. You don’t need a four-year degree in computer science to get started with algorithms. Unfortunately, most online material and textbooks on the topic are designed for undergraduate students, with an emphasis on mathematical proofs and computer sci‐ ence concepts. Algorithm textbooks can be intimidating because they are references for so many different algorithms, with countless variations and highly specialized cases. All too often, readers find it hard to complete the first chapter of these books. Using them can be a bit like trying to improve your English spelling by reading an entire dictionary: you would be much better off if, instead, you had a specially designed reference book that summarizes the 100 most misspelled words in the English language and explains the rules (and exceptions) that govern them. Similarly, people from different backgrounds and experiences who use algorithms in their work need a reference book that is more focused and designed for their needs. Learning Algorithms provides an approachable introduction to a range of algorithms that you can immediately use to improve the efficiency of your code. All algorithms are presented in Python, one of the most popular and user-friendly programming languages, used in disciplines ranging from data science to bioinformatics to vii
📄 Page
10
engineering. The text carefully explains each algorithm, with numerous images to help readers grasp the essential concepts. The code is open source and freely available from the book’s repository. Learning Algorithms will teach you the fundamental algorithms and data types used in computer science, so that you can write more efficient programs. If you are looking for a technical job that requires programming skills, this book might help you ace your next coding interview. I hope it inspires you to continue your journey in learn‐ ing algorithms. — Zvi Galil Dean of Computing Emeritus Frederick G. Storey Chair in Computing Georgia Institute of Technology Atlanta, May 2021 viii | Foreword
📄 Page
11
Preface Who This Book Is For If you are reading this book, I assume you already have a working knowledge of a programming language, such as Python. If you have never programmed before, I encourage you to first learn a programming language and then come back! I use Python in this book because it is accessible to programmers and nonprogrammers alike. Algorithms are designed to solve common problems that arise frequently in software applications. When teaching algorithms to undergraduate students, I try to bridge the gap between the students’ background knowledge and the algorithm concepts I’m teaching. Many textbooks have carefully written—but always too brief—explanations. Without having a guide to explain how to navigate this material, students are often unable to learn algorithms on their own. In one paragraph and in Figure P-1, let me show you my goal for the book. I intro‐ duce a number of data structures that explain how to organize information using primitive fixed-size types, such as 32-bit integer values or 64-bit floating point values. Some algorithms, such as Binary Array Search, work directly on data structures. More complicated algorithms, especially graph algorithms, rely on a number of fun‐ damental abstract data types, which I introduce as needed, such as stacks or priority queues. These data types provide fundamental operations that can be efficiently implemented by choosing the right data structure. By the end of this book, you will understand how the various algorithms achieve their performance. For these algo‐ rithms, I will either show full implementations in Python or refer you to third-party Python packages that provide efficient implementation. If you review the associated code resources provided with the book, you will see that for each chapter there is a book.py Python file that can be executed to reproduce all tables within the book. As they say in the business, “your mileage may vary,” but the overall trends will still appear. ix
📄 Page
12
Figure P-1. Summary of the technical content of the book At the end of every chapter in the book are challenge exercises that give you a chance to put your new knowledge to the test. I encourage you to try these on your own before you review my sample solutions, found in the code repository for the book. About the Code All the code for this book can be found in the associated GitHub repository, http://github.com/heineman/LearningAlgorithms. The code conforms to Python 3.4 or higher. Where relevant, I conform to Python best practices using double underscore methods, such as __str()__ and __len()__. Throughout the code examples in the book, I use two-space indentation to reduce the width of the code on the printed page; the code repository uses standard four-space indentation. In a few code listings, I format code using an abbreviated one-line if statement like if j == lo: break. The code uses three externally available, open source Python libraries: • NumPy version 1.19.5 • SciPy version 1.6.0 • NetworkX version 2.5 x | Preface
📄 Page
13
NumPy and SciPy are among the most commonly used open source Python libraries and have a wide audience. I use these libraries to analyze empirical runtime perfor‐ mance. NetworkX provides a wide range of efficient algorithms for working with graphs, as covered in Chapter 7; it also provides a ready-to-use graph data type imple‐ mentation. Using these libraries ensures that I do not unnecessarily reinvent the wheel. If you do not have these libraries installed, you will still be fine since I provide workarounds. All timing results presented in the book use the timeit module using repeated runs of a code snippet. Often the code snippet is run a repeated number of times to ensure it can be accurately measured. After a number of runs, the minimum time is used as the timing performance, not the average of all runs. This is commonly considered to be the most effective way to produce an accurate timing measurement because aver‐ aging a number of runs can skew timing results when some performance runs are affected by external factors, such as other executing tasks from the operating system. When the performance of an algorithm is highly sensitive to its input (such as Inser‐ tion Sort in Chapter 5), I will clearly state that I am taking the average over all perfor‐ mance runs. The code repository contains over 10,000 lines of Python code, with scripts to execute all test cases and compute the tables presented in the book; many of the charts and graphs can also be reproduced. The code is documented using Python docstring con‐ ventions, and code coverage is 95%, using https://coverage.readthedocs.io. If you have a technical question or a problem using the code examples, please send email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Learning Algorithms: A Programmer’s Guide to Writing Better Code by George T. Heineman (O’Reilly). Copy‐ right 2021 George T. Heineman, 978-1-492-09106-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. Preface | xi
📄 Page
14
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, filenames, file extensions, and points I want to emphasize. Constant width Used for program listings as well as within paragraphs to refer to program ele‐ ments such as variable or function names, data types, statements, and keywords. This element, identified by an image of a ring-tailed lemur, is a tip or suggestion. I use this image because lemurs have a combined visual field of up to 280°, which is a wider visual field than anthro‐ poid primates (such as humans). When you see this tip icon, I am literally asking you to open your eyes wider to learn a new fact or Python capability. This element, identified by an image of a crow, signifies a general note. Numerous researchers have identified crows to be intelligent, problem-solving animals—some even use tools. I use these notes to define a new term or call your attention to a useful concept that you should understand before advancing to the next page. This element, identified by an image of a scorpion, indicates a warning or caution. Much like in real life, when you see a scorpion, stop and look! I use the scorpion to call attention to key challenges you must address when applying algorithms. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. xii | Preface
📄 Page
15
How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/learn-algorithms. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia Acknowledgments For me, the study of algorithms is the best part of computer science. Thank you for giving me the opportunity to present this material to you. I also want to thank my wife, Jennifer, for her support on yet another book project, and my two sons, Nicho‐ las and Alexander, who are now both old enough to learn about programming. My O’Reilly editors—Melissa Duffield, Sarah Grey, Beth Kelly, and Virginia Wilson— improved the book by helping me organize the concepts and its explanations. My technical reviewers—Laura Helliwell, Charlie Lovering, Helen Scott, Stanley Selkow, and Aura Velarde—helped eliminate numerous inconsistencies and increase the qual‐ ity of the algorithm implementations and explanations. All defects that remain are my responsibility. Preface | xiii
📄 Page
16
(This page has no text content)
📄 Page
17
CHAPTER 1 Problem Solving In this chapter, you will learn: • Multiple algorithms that solve an introductory problem. • How to consider an algorithm’s performance on problem instances of size N. • How to count the number of times a key operation is invoked when solving a given problem instance. • How to determine order of growth as the size of a problem instance doubles. • How to estimate time complexity by counting the number of key operations an algorithm executes on a problem instance of size N. • How to estimate space complexity by determining the amount of memory required by an algorithm on a problem instance of size N. Let’s get started! What Is an Algorithm? Explaining how an algorithm works is like telling a story. Each algorithm introduces a novel concept or innovation that improves upon ordinary solutions. In this chapter I explore several solutions to a simple problem to explain the factors that affect an algorithm’s performance. Along the way I introduce techniques used to analyze an algorithm’s performance independent of its implementation, though I will always pro‐ vide empirical evidence from actual implementations. 1
📄 Page
18
An algorithm is a step-by-step problem-solving method imple‐ mented as a computer program that returns a correct result in a predictable amount of time. The study of algorithms is concerned with both correctness (will this algorithm work for all input?) and performance (is this the most efficient way to solve this problem?). Let’s walk through an example of a problem-solving method to see what this looks like in practice. What if you wanted to find the largest value in an unordered list? Each Python list in Figure 1-1 is a problem instance, that is, the input processed by an algorithm (shown as a cylinder); the correct answer appears on the right. How is this algorithm implemented? How would it perform on different problem instances? Can you predict the time needed to find the largest value in a list of one million values? Figure 1-1. Three different problem instances processed by an algorithm An algorithm is more than just a problem-solving method; the program also needs to complete in a predictable amount of time. The built-in Python function max() already solves this problem. Now, it can be hard to predict an algorithm’s perfor‐ mance on problem instances containing random data, so it’s worth identifying prob‐ lem instances that are carefully constructed. Table 1-1 shows the results of timing max() on two kinds of problem instances of size N: one where the list contains ascending integers and one where the list contains descending integers. While your execution may yield different results in the table, based on the configuration of your computing system, you can verify the following two statements: • The timing for max() on ascending values is always slower than on descending values once N is large enough. • As N increases ten-fold in subsequent rows, the corresponding time for max() also appears to increase ten-fold, with some deviation, as is to be expected from live performance trials. 2 | Chapter 1: Problem Solving
📄 Page
19
For this problem, the maximum value is returned, and the input is unchanged. In some cases, the algorithm updates the problem instance directly instead of comput‐ ing a new value—for example, sorting a list of values, as you will see in Chapter 5. In this book, N represents the size of a problem instance. Table 1-1. Executing max() on two kinds of problem instances of size N (time in ms) N Ascending values Descending values 100 0.001 0.001 1,000 0.013 0.013 10,000 0.135 0.125 100,000 1.367 1.276 1,000,000 14.278 13.419 When it comes to timing: • You can’t predict in advance the value of T(100,000)—that is, the time required by the algorithm to solve a problem instance of size 100,000—because computing platforms vary, and different programming languages may be used. • However, once you empirically determine T(10,000), you can predict T(100,000)—that is, the time to solve a problem instance ten times larger— though the prediction will inevitably be inaccurate to an extent. When designing an algorithm, the primary challenge is to ensure it is correct and works for all input. I will spend more time in Chapter 2 explaining how to analyze and compare the behavior of different algorithms that solve the exact same problem. The field of algorithm analysis is tied to the study of interesting, relevant problems that arise in real life. While the mathematics of algorithms can be challenging to understand, I will provide specific examples to always connect the abstract concepts with real-world problems. The standard way to judge the efficiency of an algorithm is to count how many com‐ puting operations it requires. But this is exceptionally hard to do! Computers have a central processing unit (CPU) that executes machine instructions that perform mathe‐ matical computations (like add and multiply), assign values to CPU registers, and compare two values with each other. Modern programming languages (like C or C++) are compiled into machine instructions. Other languages (like Python or Java) are compiled into an intermediate byte code representation. The Python interpreter (which is itself a C program) executes the byte code, while built-in functions, such as min() and max(), are implemented in C and ultimately compiled into machine instructions for execution. What Is an Algorithm? | 3
📄 Page
20
The Almighty Array An array stores a collection of N values in a contiguous block of memory. It is one of the oldest and most dependable data structures programmers use to store multiple values. The following image represents an array of eight integers. The array A has eight values indexed by their location. For example, A[0] = 31, and A[7] = 5. The values in A can be of any type, such as strings or more complicated objects. The following are important things to know about an array: • The first value, A[0], is at index position 0; the last is A[N–1], where N is the size of the array. • Each array has a fixed length. Python and Java allow the programmer to deter‐ mine this length at runtime, while C does not. • One can read or update an individual location, A[i], based on the index position, i, which is an integer in the range from 0 to N – 1. • An array cannot be extended (or shrunk); instead, you allocate a new array of the desired size and copy old values that should remain. Despite their simplicity, arrays are an extremely versatile and efficient way to struc‐ ture data. In Python, list objects can be considered an array, even though they are more powerful because they can grow and shrink in size over time. It is nearly impossible to count the total number of executed machine instructions for an algorithm, not to mention that modern day CPUs can execute billions of instruc‐ tions per second! Instead, I will count the number of times a key operation is invoked for each algorithm, which could be “the number of times two values in an array are compared with each other” or “how many times a function is called.” In this discus‐ sion of max(), the key operation is “how many times the less-than (<) operator is invoked.” I will expand on this counting principle in Chapter 2. Now is a good time to lift up the hood on the max() algorithm to see why it behaves the way it does. 4 | Chapter 1: Problem Solving