Author:Alex Thomas
No description
Tags
Support Statistics
¥.00 ·
0times
Text Preview (First 20 pages)
Registered users can read the full content for free
Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.
Page
1
Alex Thomas Natural Language Processing with Spark NLP Learning to Understand Text at Scale
Page
2
(This page has no text content)
Page
3
Alex Thomas Natural Language Processing with Spark NLP Learning to Understand Text at Scale Boston Farnham Sebastopol TokyoBeijing
Page
4
978-1-492-04776-6 [LSI] Natural Language Processing with Spark NLP by Alex Thomas Copyright © 2020 Alex Thomas. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Mike Loukides Developmental Editors: Nicole Taché, Gary O’Brien Production Editor: Beth Kelly Copyeditor: Piper Editorial Proofreader: Athena Lakri Indexer: WordCo, Inc. Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Rebecca Demarest July 2020: First Edition Revision History for the First Edition 2020-06-24: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492047766 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Natural Language Processing with Spark NLP, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Page
5
Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Part I. Basics 1. Getting Started. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Introduction 3 Other Tools 5 Setting Up Your Environment 6 Prerequisites 6 Starting Apache Spark 6 Checking Out the Code 7 Getting Familiar with Apache Spark 7 Starting Apache Spark with Spark NLP 8 Loading and Viewing Data in Apache Spark 8 Hello World with Spark NLP 11 2. Natural Language Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 What Is Natural Language? 19 Origins of Language 20 Spoken Language Versus Written Language 21 Linguistics 22 Phonetics and Phonology 22 Morphology 23 Syntax 24 Semantics 25 Sociolinguistics: Dialects, Registers, and Other Varieties 25 Formality 26 iii
Page
6
Context 26 Pragmatics 27 Roman Jakobson 27 How To Use Pragmatics 28 Writing Systems 28 Origins 28 Alphabets 29 Abjads 30 Abugidas 31 Syllabaries 32 Logographs 32 Encodings 33 ASCII 33 Unicode 33 UTF-8 34 Exercises: Tokenizing 34 Tokenize English 35 Tokenize Greek 35 Tokenize Ge’ez (Amharic) 36 Resources 36 3. NLP on Apache Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Parallelism, Concurrency, Distributing Computation 40 Parallelization Before Apache Hadoop 43 MapReduce and Apache Hadoop 43 Apache Spark 44 Architecture of Apache Spark 44 Physical Architecture 44 Logical Architecture 46 Spark SQL and Spark MLlib 51 Transformers 54 Estimators and Models 57 Evaluators 60 NLP Libraries 63 Functionality Libraries 63 Annotation Libraries 63 NLP in Other Libraries 64 Spark NLP 65 Annotation Library 65 Stages 65 Pretrained Pipelines 72 Finisher 74 iv | Table of Contents
Page
7
Exercises: Build a Topic Model 76 Resources 77 4. Deep Learning Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Gradient Descent 84 Backpropagation 85 Convolutional Neural Networks 96 Filters 96 Pooling 97 Recurrent Neural Networks 97 Backpropagation Through Time 97 Elman Nets 98 LSTMs 98 Exercise 1 99 Exercise 2 99 Resources 100 Part II. Building Blocks 5. Processing Words. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Tokenization 104 Vocabulary Reduction 107 Stemming 108 Lemmatization 108 Stemming Versus Lemmatization 108 Spelling Correction 110 Normalization 112 Bag-of-Words 113 CountVectorizer 114 N-Gram 116 Visualizing: Word and Document Distributions 118 Exercises 122 Resources 122 6. Information Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Inverted Indices 124 Building an Inverted Index 124 Vector Space Model 130 Stop-Word Removal 133 Inverse Document Frequency 134 In Spark 137 Table of Contents | v
Page
8
Exercises 137 Resources 138 7. Classification and Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Bag-of-Words Features 142 Regular Expression Features 143 Feature Selection 145 Modeling 148 Naïve Bayes 149 Linear Models 149 Decision/Regression Trees 149 Deep Learning Algorithms 150 Iteration 150 Exercises 153 8. Sequence Modeling with Keras. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Sentence Segmentation 156 (Hidden) Markov Models 156 Section Segmentation 163 Part-of-Speech Tagging 164 Conditional Random Field 168 Chunking and Syntactic Parsing 168 Language Models 169 Recurrent Neural Networks 170 Exercise: Character N-Grams 176 Exercise: Word Language Model 176 Resources 177 9. Information Extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Named-Entity Recognition 179 Coreference Resolution 187 Assertion Status Detection 189 Relationship Extraction 191 Summary 195 Exercises 196 10. Topic Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 K-Means 198 Latent Semantic Indexing 202 Nonnegative Matrix Factorization 205 Latent Dirichlet Allocation 209 Exercises 211 vi | Table of Contents
Page
9
11. Word Embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Word2vec 215 GloVe 226 fastText 227 Transformers 227 ELMo, BERT, and XLNet 228 doc2vec 229 Exercises 231 Part III. Applications 12. Sentiment Analysis and Emotion Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Problem Statement and Constraints 235 Plan the Project 236 Design the Solution 240 Implement the Solution 241 Test and Measure the Solution 245 Business Metrics 245 Model-Centric Metrics 246 Infrastructure Metrics 247 Process Metrics 247 Offline Versus Online Model Measurement 248 Review 248 Initial Deployment 249 Fallback Plans 249 Next Steps 250 Conclusion 250 13. Building Knowledge Bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Problem Statement and Constraints 252 Plan the Project 253 Design the Solution 253 Implement the Solution 255 Test and Measure the Solution 262 Business Metrics 262 Model-Centric Metrics 262 Infrastructure Metrics 263 Process Metrics 263 Review 264 Conclusion 264 Table of Contents | vii
Page
10
14. Search Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Problem Statement and Constraints 266 Plan the Project 266 Design the Solution 266 Implement the Solution 267 Test and Measure the Solution 275 Business Metrics 275 Model-Centric Metrics 275 Review 276 Conclusion 276 15. Chatbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Problem Statement and Constraints 278 Plan the Project 279 Design the Solution 279 Implement the Solution 280 Test and Measure the Solution 289 Business Metrics 289 Model-Centric Metrics 290 Review 290 Conclusion 290 16. Object Character Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Kinds of OCR Tasks 291 Images of Printed Text and PDFs to Text 291 Images of Handwritten Text to Text 292 Images of Text in Environment to Text 292 Images of Text to Target 293 Note on Different Writing Systems 293 Problem Statement and Constraints 294 Plan the Project 294 Implement the Solution 295 Test and Measure the Solution 299 Model-Centric Metrics 300 Review 300 Conclusion 300 Part IV. Building NLP Systems 17. Supporting Multiple Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Language Typology 303 viii | Table of Contents
Page
11
Scenario: Academic Paper Classification 303 Text Processing in Different Languages 304 Compound Words 304 Morphological Complexity 305 Transfer Learning and Multilingual Deep Learning 306 Search Across Languages 307 Checklist 308 Conclusion 308 18. Human Labeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Guidelines 310 Scenario: Academic Paper Classification 310 Inter-Labeler Agreement 312 Iterative Labeling 313 Labeling Text 314 Classification 314 Tagging 314 Checklist 315 Conclusion 315 19. Productionizing NLP Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Spark NLP Model Cache 318 Spark NLP and TensorFlow Integration 319 Spark Optimization Basics 319 Design-Level Optimization 321 Profiling Tools 322 Monitoring 322 Managing Data Resources 322 Testing NLP-Based Applications 323 Unit Tests 323 Integration Tests 323 Smoke and Sanity Tests 323 Performance Tests 324 Usability Tests 324 Demoing NLP-Based Applications 325 Checklists 325 Model Deployment Checklist 325 Scaling and Performance Checklist 326 Testing Checklist 326 Conclusion 327 Table of Contents | ix
Page
12
Glossary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 x | Table of Contents
Page
13
Preface Why Natural Language Processing Is Important and Difficult Natural language processing (NLP) is a field of study concerned with processing lan‐ guage data. We will be focusing on text, but natural language audio data is also a part of NLP. Dealing with natural language text data is difficult. The reason it is difficult is that it relies on three fields of study: linguistics, software engineering, and machine learning. It is hard to find the expertise in all three for most NLP-based projects. For‐ tunately, you don’t need to be a world-class expert in all three fields to make informed decisions about your application. As long as you know some basics, you can use libra‐ ries built by experts to accomplish your goals. Consider the advances made in creat‐ ing efficient algorithms for vector and matrix operations. If the common linear algebra libraries that deep learning libraries use were not available, imagine how much harder it would have been for the deep learning revolution to begin. Even though these libraries mean that we don’t need to implement cache aware matrix multiplication for every new project, we still need to understand the basics of linear algebra and the basics of how the operations are implemented to make the best use of these libraries. I believe the situation is becoming the same for NLP and NLP libraries. Applications that use natural language (text, spoken, and gestural) will always be dif‐ ferent than other applications due to the data they use. The benefit and draw to these applications is how much data is out there. Humans are producing and churning nat‐ ural language data all the time. The difficult aspects are that people are literally evolved to detect mistakes in natural language use, and the data (text, images, audio, and video) is not made with computers in mind. These difficulties can be overcome through a combination of linguistics, software engineering, and machine learning. xi
Page
14
This book deals with text data. This is the easiest of the data types that natural lan‐ guage comes in, because our computers were designed with text in mind. That being said, we still want to consider a lot of small and large details that are not obvious. Background A few years ago, I was working on a tutorial for O’Reilly. This tutorial was about building NLP pipelines on Apache Spark. At the time, Apache Spark 2.0 was still rela‐ tively new, but I was mainly using version 1.6. I thought it would be cool to build an annotation library using the new DataFrames and pipelines; alas, I was not able to implement this for the tutorial. However, I talked about this with my friend (and tutorial copresenter) David Talby, and we created a design doc. I didn’t have enough time to work on building the library, so I consulted Saif Addin, whom David had hired to work on the project. As the project grew and developed, David, Claudiu Branzan (another friend and colleague), and I began presenting tutorials at conferen‐ ces and meetups. It seemed like there was an interest in learning more about the library and an interest in learning more about NLP in general. People who know me know I am rant-prone, and few topics are as likely to get me started as NLP and how it is used and misused in the technology industry. I think this is because of my background. Growing up, I studied linguistics as a hobby—an all- consuming hobby. When I went to university, even though I focused on mathematics, I also took linguistics courses. Shortly before graduating, I decided that I also wanted to learn computer science, so I could take the theoretical concepts I had learned and create something. Once I began in the industry, I learned that I could combine these three interests into one: NLP. This gives me a rare view of NLP because I studied its components first individually and then combined. I am really excited to be working on this book, and I hope this book helps you in building your next NLP application! Philosophy An important part of the library is the idea that people should build their own mod‐ els. There is no one-size-fits-all method in NLP. If you want to build a successful NLP application, you need to understand your data as well as your product. Prebuilt mod‐ els are useful for initial versions, demos, and tutorials. This means that if you want to use Spark NLP successfully, you will need to understand how NLP works. So in this book we will cover more than just Spark NLP API. We will talk about how to use Spark NLP, but we will also talk about how NLP and deep learning work. When you combine an understanding of NLP with a library that is built with the intent of cus‐ tomization, you will be able to build NLP applications that achieve your goals. xii | Preface
Page
15
Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program ele‐ ments such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed literally by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a general note. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://oreil.ly/SparkNLP. If you have a technical question or a problem using the code examples, please send an email to bookquestions@oreilly.com. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but generally do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: Natural Language Preface | xiii
Page
16
Processing with Spark NLP by Alex Thomas (O’Reilly). Copyright 2020 Alex Thomas, 978-1-492-04776-6.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at permissions@oreilly.com. O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/NLPSpark. Email bookquestions@oreilly.com to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://youtube.com/oreillymedia xiv | Preface
Page
17
Acknowledgments I want to thank my editors at O’Reilly Nicole Taché and Gary O’Brien for their help and support. I want to thank the tech reviewers who were of great help in restructur‐ ing the book. I also want to thank Mike Loukides for his guidance in starting this project. I want to thank David Talby for all his mentorship. I want to thank Saif Addin, Maziyar Panahi and the rest of the John Snow Labs team for taking the initial design David and I had, and making it into a successful and widely used library. I also want to thank Vishnu Vettrivel for his support and counsel during this project. Finally, I want to thank my family and friends for their patience and encouragement. Preface | xv
Page
18
(This page has no text content)
Page
19
PART I Basics
Page
20
(This page has no text content)
The above is a preview of the first 20 pages. Register to read the complete e-book.
Comments 0
Loading comments...
Reply to Comment
Edit Comment