Introducing Regular Expressions (Michael Fitzgerald) (Z-Library)

(This page has no text content)

Introducing Regular Expressions Michael Fitzgerald Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo

Introducing Regular Expressions by Michael Fitzgerald Copyright © 2012 Michael Fitzgerald. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Editor: Simon St. Laurent Production Editor: Holly Bauer Proofreader: Julie Van Keuren Indexer: Lucie Haskins Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrator: Rebecca Demarest July 2012: First Edition. Revision History for the First Edition: 2012-07-10 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449392680 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Introducing Regular Expressions, the image of a fruit bat, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trademark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information con- tained herein. ISBN: 978-1-449-39268-0 [LSI] 1341860829

Table of Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. What Is a Regular Expression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Getting Started with Regexpal 2 Matching a North American Phone Number 2 Matching Digits with a Character Class 4 Using a Character Shorthand 5 Matching Any Character 5 Capturing Groups and Back References 6 Using Quantifiers 6 Quoting Literals 8 A Sample of Applications 9 What You Learned in Chapter 1 11 Technical Notes 11 2. Simple Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Matching String Literals 15 Matching Digits 15 Matching Non-Digits 17 Matching Word and Non-Word Characters 18 Matching Whitespace 20 Matching Any Character, Once Again 22 Marking Up the Text 24 Using sed to Mark Up Text 24 Using Perl to Mark Up Text 25 What You Learned in Chapter 2 27 Technical Notes 27 3. Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 The Beginning and End of a Line 29 Word and Non-word Boundaries 31 iii

Other Anchors 33 Quoting a Group of Characters as Literals 34 Adding Tags 34 Adding Tags with sed 36 Adding Tags with Perl 37 What You Learned in Chapter 3 38 Technical Notes 38 4. Alternation, Groups, and Backreferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Alternation 41 Subpatterns 45 Capturing Groups and Backreferences 46 Named Groups 48 Non-Capturing Groups 49 Atomic Groups 50 What You Learned in Chapter 4 50 Technical Notes 51 5. Character Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Negated Character Classes 55 Union and Difference 56 POSIX Character Classes 56 What You Learned in Chapter 5 59 Technical Notes 60 6. Matching Unicode and Other Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Matching a Unicode Character 62 Using vim 63 Matching Characters with Octal Numbers 64 Matching Unicode Character Properties 65 Matching Control Characters 68 What You Learned in Chapter 6 70 Technical Notes 71 7. Quantifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Greedy, Lazy, and Possessive 74 Matching with *, +, and ? 74 Matching a Specific Number of Times 75 Lazy Quantifiers 76 Possessive Quantifiers 77 What You Learned in Chapter 7 78 Technical Notes 79 iv | Table of Contents

8. Lookarounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Positive Lookaheads 81 Negative Lookaheads 84 Positive Lookbehinds 85 Negative Lookbehinds 85 What You Learned in Chapter 8 86 Technical Notes 86 9. Marking Up a Document with HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Matching Tags 87 Transforming Plain Text with sed 88 Substitution with sed 89 Handling Roman Numerals with sed 90 Handling a Specific Paragraph with sed 91 Handling the Lines of the Poem with sed 91 Appending Tags 92 Using a Command File with sed 92 Transforming Plain Text with Perl 94 Handling Roman Numerals with Perl 95 Handling a Specific Paragraph with Perl 96 Handling the Lines of the Poem with Perl 96 Using a File of Commands with Perl 97 What You Learned in Chapter 9 98 Technical Notes 98 10. The End of the Beginning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Learning More 102 Notable Tools, Implementations, and Libraries 103 Perl 103 PCRE 103 Ruby (Oniguruma) 104 Python 104 RE2 105 Matching a North American Phone Number 105 Matching an Email Address 105 What You Learned in Chapter 10 106 Appendix: Regular Expression Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Regular Expression Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Table of Contents | v

(This page has no text content)

Preface This book shows you how to write regular expressions through examples. Its goal is to make learning regular expressions as easy as possible. In fact, this book demonstrates nearly every concept it presents by way of example so you can easily imitate and try them yourself. Regular expressions help you find patterns in text strings. More precisely, they are specially encoded text strings that match patterns in sets of strings, most often strings that are found in documents or files. Regular expressions began to emerge when mathematician Stephen Kleene wrote his book Introduction to Metamathematics (New York, Van Nostrand), first published in 1952, though the concepts had been around since the early 1940s. They became more widely available to computer scientists with the advent of the Unix operating system— the work of Brian Kernighan, Dennis Ritchie, Ken Thompson, and others at AT&T Bell Labs—and its utilities, such as sed and grep, in the early 1970s. The earliest appearance that I can find of regular expressions in a computer application is in the QED editor. QED, short for Quick Editor, was written for the Berkeley Time- sharing System, which ran on the Scientific Data Systems SDS 940. Documented in 1970, it was a rewrite by Ken Thompson of a previous editor on MIT’s Compatible Time-Sharing System and yielded one of the earliest if not first practical implementa- tions of regular expressions in computing. (Table A-1 in Appendix documents the regex features of QED.) I’ll use a variety of tools to demonstrate the examples. You will, I hope, find most of them usable and useful; others won’t be usable because they are not readily available on your Windows system. You can skip the ones that aren’t practical for you or that aren’t appealing. But I recommend that anyone who is serious about a career in com- puting learn about regular expressions in a Unix-based environment. I have worked in that environment for 25 years and still learn new things every day. “Those who don’t understand Unix are condemned to reinvent it, poorly.” —Henry Spencer vii

Some of the tools I’ll show you are available online via a web browser, which will be the easiest for most readers to use. Others you’ll use from a command or a shell prompt, and a few you’ll run on the desktop. The tools, if you don’t have them, will be easy to download. The majority are free or won’t cost you much money. This book also goes light on jargon. I’ll share with you what the correct terms are when necessary, but in small doses. I use this approach because over the years, I’ve found that jargon can often create barriers. In other words, I’ll try not to overwhelm you with the dry language that describes regular expressions. That is because the basic philoso- phy of this book is this: Doing useful things can come before knowing everything about a given subject. There are lots of different implementations of regular expressions. You will find regular expressions used in Unix command-line tools like vi (vim), grep, and sed, among others. You will find regular expressions in programming languages like Perl (of course), Java, JavaScript, C# or Ruby, and many more, and you will find them in declarative lan- guages like XSLT 2.0. You will also find them in applications like Notepad++, Oxygen, or TextMate, among many others. Most of these implementations have similarities and differences. I won’t cover all those differences in this book, but I will touch on a good number of them. If I attempted to document all the differences between all implementations, I’d have to be hospitalized. I won’t get bogged down in these kinds of details in this book. You’re expecting an introductory text, as advertised, and that is what you’ll get. Who Should Read This Book The audience for this book is people who haven't ever written a regular expression before. If you are new to regular expressions or programming, this book is a good place to start. In other words, I am writing for the reader who has heard of regular expressions and is interested in them but who doesn’t really understand them yet. If that is you, then this book is a good fit. The order I’ll go in to cover the features of regex is from the simple to the complex. In other words, we’ll go step by simple step. Now, if you happen to already know something about regular expressions and how to use them, or if you are an experienced programmer, this book may not be where you want to start. This is a beginner’s book, for rank beginners who need some hand- holding. If you have written some regular expressions before, and feel familiar with them, you can start here if you want, but I’m planning to take it slower than you will probably like. viii | Preface

I recommend several books to read after this one. First, try Jeff Friedl’s Mastering Reg- ular Expressions, Third Edition (see http://shop.oreilly.com/product/9781565922570 .do). Friedl’s book gives regular expressions a thorough going over, and I highly rec- ommend it. I also recommend the Regular Expressions Cookbook (see http://shop.oreilly .com/product/9780596520694.do) by Jan Goyvaerts and Steven Levithan. Jan Goy- vaerts is the creator of RegexBuddy, a powerful desktop application (see http://www .regexbuddy.com/). Steven Levithan created RegexPal, an online regular expression processor that you’ll use in the first chapter of this book (see http://www.regexpal.com). What You Need to Use This Book To get the most out of this book, you’ll need access to tools available on Unix or Linux operating systems, such as Darwin on the Mac, a variant of BSD (Berkeley Software Distribution) on the Mac, or Cygwin on a Windows PC, which offers many GNU tools in its distribution (see http://www.cygwin.com and http://www.gnu.org). There will be plenty of examples for you to try out here. You can just read them if you want, but to really learn, you’ll need to follow as many of them as you can, as the most important kind of learning, I think, always comes from doing, not from standing on the sidelines. You’ll be introduced to websites that will teach you what regular expres- sions are by highlighting matched results, workhorse command line tools from the Unix world, and desktop applications that analyze regular expressions or use them to per- form text search. You will find examples from this book on Github at https://github.com/michaeljames fitzgerald/Introducing-Regular-Expressions. You will also find an archive of all the ex- amples and test files in this book for download from http://examples.oreilly.com/ 9781449392680/examples.zip. It would be best if you create a working directory or folder on your computer and then download these files to that directory before you dive into the book. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, file extensions, and so forth. Constant width Used for program listings, as well as within paragraphs, to refer to program ele- ments such as expressions and command lines or any other programmatic elements. Preface | ix

This icon signifies a tip, suggestion, or a general note. Using Code Examples This book is here to help you get your job done. In general, you may use the code in this book in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Introducing Regular Expressions by Mi- chael Fitzgerald (O’Reilly). Copyright 2012 Michael Fitzgerald, 978-1-4493-9268-0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact O’Reilly at permissions@oreilly.com. Safari® Books Online Safari Books Online (www.safaribooksonline.com) is an on-demand digital library that delivers expert content in both book and video form from the world’s leading authors in technology and business. Technology professionals, software developers, web designers, and business and cre- ative professionals use Safari Books Online as their primary resource for research, problem solving, learning, and certification training. Safari Books Online offers a range of product mixes and pricing programs for organi- zations, government agencies, and individuals. Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable da- tabase from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley Professional, Microsoft Press, Sams, Que, Peachpit Press, Focal Press, Cisco Press, John Wiley & Sons, Syngress, Morgan Kaufmann, IBM Redbooks, Packt, Adobe Press, FT Press, Apress, Manning, New Riders, McGraw-Hill, Jones & Bartlett, Course Technology, and dozens more. For more information about Safari Books Online, please visit us online. x | Preface

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) This book has a web page listing errata, examples, and any additional information. You can access this page at: http://orei.ly/intro_regex To comment or to ask technical questions about this book, send email to: bookquestions@oreilly.com For more information about O'Reilly books, courses, conferences, and news, see its website at http://www.oreilly.com. Find O'Reilly on Facebook: http://facebook.com/oreilly Follow O'Reilly on Twitter: http://twitter.com/oreillymedia Watch O'Reilly on YouTube: http://www.youtube.com/oreillymedia Acknowledgments Once again, I want to express appreciation to my editor at O’Reilly, Simon St. Laurent, a very patient man without whom this book would never have seen the light of day. Thank you to Seara Patterson Coburn and Roger Zauner for your helpful reviews. And, as always, I want to recognize the love of my life, Cristi, who is my raison d’être. Preface | xi

(This page has no text content)

CHAPTER 1 What Is a Regular Expression? Regular expressions are specially encoded text strings used as patterns for matching sets of strings. They began to emerge in the 1940s as a way to describe regular languages, but they really began to show up in the programming world during the 1970s. The first place I could find them showing up was in the QED text editor written by Ken Thompson. “A regular expression is a pattern which specifies a set of strings of characters; it is said to match certain strings.” —Ken Thompson Regular expressions later became an important part of the tool suite that emerged from the Unix operating system—the ed, sed and vi (vim) editors, grep, AWK, among others. But the ways in which regular expressions were implemented were not always so regular. This book takes an inductive approach; in other words, it moves from the specific to the general. So rather than an example after a treatise, you will often get the example first and then a short treatise following that. It’s a learn-by-doing book. Regular expressions have a reputation for being gnarly, but that all depends on how you approach them. There is a natural progression from something as simple as this: \d a character shorthand that matches any digit from 0 to 9, to something a bit more complicated, like: ^(\(\d{3}\)|^\d{3}[.-]?)?\d{3}[.-]?\d{4}$ which is where we’ll wind up at the end of this chapter: a fairly robust regular expression that matches a 10-digit, North American telephone number, with or without paren- theses around the area code, or with or without hyphens or dots (periods) to separate the numbers. (The parentheses must be balanced, too; in other words, you can’t just have one.) 1

Chapter 10 shows you a slightly more sophisticated regular expression for a phone number, but the one above is sufficient for the purposes of this chapter. If you don’t get how that all works yet, don’t worry: I’ll explain the whole expression a little at a time in this chapter. If you will just follow the examples (and those through- out the book, for that matter), writing regular expressions will soon become second nature to you. Ready to find out for yourself? I at times represent Unicode characters in this book using their code point—a four- digit, hexadecimal (base 16) number. These code points are shown in the form U+0000. U+002E, for example, represents the code point for a full stop or period (.). Getting Started with Regexpal First let me introduce you to the Regexpal website at http://www.regexpal.com. Open the site up in a browser, such as Google Chrome or Mozilla Firefox. You can see what the site looks like in Figure 1-1. You can see that there is a text area near the top, and a larger text area below that. The top text box is for entering regular expressions, and the bottom one holds the subject or target text. The target text is the text or set of strings that you want to match. At the end of this chapter and each following chapter, you’ll find a “Technical Notes” section. These notes provide additional information about the technology discussed in the chapter and tell you where to get more information about that technology. Placing these notes at the end of the chapters helps keep the flow of the main text moving forward rather than stopping to discuss each detail along the way. Matching a North American Phone Number Now we’ll match a North American phone number with a regular expression. Type the phone number shown here into the lower section of Regexpal: 707-827-7019 Do you recognize it? It’s the number for O’Reilly Media. Let’s match that number with a regular expression. There are lots of ways to do this, but to start out, simply enter the number itself in the upper section, exactly as it is written in the lower section (hold on now, don’t sigh): 707-827-7019 2 | Chapter 1: What Is a Regular Expression?

What you should see is the phone number you entered in the lower box highlighted from beginning to end in yellow. If that is what you see (as shown in Figure 1-2), then you are in business. When I mention colors in this book, in relation to something you might see in an image or a screenshot, such as the highlighting in Regexpal, those colors may appear online and in e-book versions of this book, but, alas, not in print. So if you are reading this book on paper, then when I mention a color, your world will be grayscale, with my apologies. What you have done in this regular expression is use something called a string literal to match a string in the target text. A string literal is a literal representation of a string. Now delete the number in the upper box and replace it with just the number 7. Did you see what happened? Now only the sevens are highlighted. The literal character (number) 7 in the regular expression matches the four instances of the number 7 in the text you are matching. Figure 1-1. Regexpal in the Google Chrome browser Matching a North American Phone Number | 3

Matching Digits with a Character Class What if you wanted to match all the numbers in the phone number, all at once? Or match any number for that matter? Try the following, exactly as shown, once again in the upper text box: [0-9] All the numbers (more precisely digits) in the lower section are highlighted, in alter- nating yellow and blue. What the regular expression [0-9] is saying to the regex pro- cessor is, “Match any digit you find in the range 0 through 9.” The square brackets are not literally matched because they are treated specially as metacharacters. A metacharacter has special meaning in regular expressions and is re- served. A regular expression in the form [0-9] is called a character class, or sometimes a character set. Figure 1-2. Ten-digit phone number highlighted in Regexpal 4 | Chapter 1: What Is a Regular Expression?

You can limit the range of digits more precisely and get the same result using a more specific list of digits to match, such as the following: [012789] This will match only those digits listed, that is, 0, 1, 2, 7, 8, and 9. Try it in the upper box. Once again, every digit in the lower box will be highlighted in alternating colors. To match any 10-digit, North American phone number, whose parts are separated by hyphens, you could do the following: [0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] This will work, but it’s bombastic. There is a better way with something called a shorthand. Using a Character Shorthand Yet another way to match digits, which you saw at the beginning of the chapter, is with \d which, by itself, will match all Arabic digits, just like [0-9]. Try that in the top section and, as with the previous regular expressions, the digits below will be highlighted. This kind of regular expression is called a character shorthand. (It is also called a character escape, but this term can be a little misleading, so I avoid it. I’ll explain later.) To match any digit in the phone number, you could also do this: \d\d\d-\d\d\d-\d\d\d\d Repeating the \d three and four times in sequence will exactly match three and four digits in sequence. The hyphen in the above regular expression is entered as a literal character and will be matched as such. What about those hyphens? How do you match them? You can use a literal hyphen (-) as already shown, or you could use an escaped uppercase D (\D), which matches any character that is not a digit. This sample uses \D in place of the literal hyphen. \d\d\d\D\d\d\d\D\d\d\d\d Once again, the entire phone number, including the hyphens, should be highlighted this time. Matching Any Character You could also match those pesky hyphens with a dot (.): \d\d\d.\d\d\d.\d\d\d\d The dot or period essentially acts as a wildcard and will match any character (except, in certain situations, a line ending). In the example above, the regular expression matches the hyphen, but it could also match a percent sign (%): Matching Any Character | 5

707%827%7019 Or a vertical bar (|): 707|827|7019 Or any other character. As I mentioned, the dot character (officially, the full stop) will not nor- mally match a new line character, such as a line feed (U+000A). How- ever, there are ways to make it possible to match a newline with a dot, which I will show you later. This is often called the dotall option. Capturing Groups and Back References You’ll now match just a portion of the phone number using what is known as a cap- turing group. Then you’ll refer to the content of the group with a backreference. To create a capturing group, enclose a \d in a pair of parentheses to place it in a group, and then follow it with a \1 to backreference what was captured: (\d)\d\1 The \1 refers back to what was captured in the group enclosed by parentheses. As a result, this regular expression matches the prefix 707. Here is a breakdown of it: • (\d) matches the first digit and captures it (the number 7) • \d matches the next digit (the number 0) but does not capture it because it is not enclosed in parentheses • \1 references the captured digit (the number 7) This will match only the area code. Don’t worry if you don’t fully understand this right now. You’ll see plenty of examples of groups later in the book. You could now match the whole phone number with one group and several backreferences: (\d)0\1\D\d\d\1\D\1\d\d\d But that’s not quite as elegant as it could be. Let’s try something that works even better. Using Quantifiers Here is yet another way to match a phone number using a different syntax: \d{3}-?\d{3}-?\d{4} The numbers in the curly braces tell the regex processor exactly how many occurrences of those digits you want it to look for. The braces with numbers are a kind of quanti- fier. The braces themselves are considered metacharacters. 6 | Chapter 1: What Is a Regular Expression?

Statistics

Uploader

Introducing Regular Expressions (Michael Fitzgerald) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Recommended for You

Statistics

Uploader

Introducing Regular Expressions (Michael Fitzgerald) (Z-Library)

AI Reading Assistant

Passage locations

Tags

Text Preview (First 20 pages)

Registered users can read the full content for free

Reply to Comment

Edit Comment

Recommended for You