📄 Page
1
(This page has no text content)
📄 Page
2
1. Preface a. Approach of the Book b. Prerequisites c. Some Important Libraries to Know d. Books to Read e. Conventions Used in This Book f. Using Code Examples g. O’Reilly Online Learning h. How to Contact Us i. Acknowledgments 2. 1. Gaining Early Insights from Textual Data a. What You’ll Learn and What We’ll Build b. Exploratory Data Analysis c. Introducing the Dataset d. Blueprint: Getting an Overview of the Data with Pandas i. Calculating Summary Statistics for Columns ii. Checking for Missing Data iii. Plotting Value Distributions iv. Comparing Value Distributions Across Categories v. Visualizing Developments Over Time e. Blueprint: Building a Simple Text Preprocessing Pipeline
📄 Page
3
i. Performing Tokenization with Regular Expressions ii. Treating Stop Words iii. Processing a Pipeline with One Line of Code f. Blueprints for Word Frequency Analysis i. Blueprint: Counting Words with a Counter ii. Blueprint: Creating a Frequency Diagram iii. Blueprint: Creating Word Clouds iv. Blueprint: Ranking with TF-IDF g. Blueprint: Finding a Keyword-in-Context h. Blueprint: Analyzing N-Grams i. Blueprint: Comparing Frequencies Across Time Intervals and Categories i. Creating Frequency Timelines ii. Creating Frequency Heatmaps j. Closing Remarks 3. 2. Extracting Textual Insights with APIs a. What You’ll Learn and What We’ll Build b. Application Programming Interfaces c. Blueprint: Extracting Data from an API Using the Requests Module i. Pagination ii. Rate Limiting
📄 Page
4
d. Blueprint: Extracting Twitter Data with Tweepy i. Obtaining Credentials ii. Installing and Configuring Tweepy iii. Extracting Data from the Search API iv. Extracting Data from a User’s Timeline v. Extracting Data from the Streaming API e. Closing Remarks 4. 3. Scraping Websites and Extracting Data a. What You’ll Learn and What We’ll Build b. Scraping and Data Extraction c. Introducing the Reuters News Archive d. URL Generation e. Blueprint: Downloading and Interpreting robots.txt f. Blueprint: Finding URLs from sitemap.xml g. Blueprint: Finding URLs from RSS h. Downloading Data i. Blueprint: Downloading HTML Pages with Python j. Blueprint: Downloading HTML Pages with wget k. Extracting Semistructured Data l. Blueprint: Extracting Data with Regular Expressions
📄 Page
5
m. Blueprint: Using an HTML Parser for Extraction n. Blueprint: Spidering i. Introducing the Use Case ii. Error Handling and Production- Quality Software o. Density-Based Text Extraction i. Extracting Reuters Content with Readability ii. Summary Density-Based Text Extraction p. All-in-One Approach q. Blueprint: Scraping the Reuters Archive with Scrapy r. Possible Problems with Scraping s. Closing Remarks and Recommendation 5. 4. Preparing Textual Data for Statistics and Machine Learning a. What You’ll Learn and What We’ll Build b. A Data Preprocessing Pipeline c. Introducing the Dataset: Reddit Self-Posts i. Loading Data Into Pandas ii. Blueprint: Standardizing Attribute Names iii. Saving and Loading a DataFrame d. Cleaning Text Data
📄 Page
6
i. Blueprint: Identify Noise with Regular Expressions ii. Blueprint: Removing Noise with Regular Expressions iii. Blueprint: Character Normalization with textacy iv. Blueprint: Pattern-Based Data Masking with textacy e. Tokenization i. Blueprint: Tokenization with Regular Expressions ii. Tokenization with NLTK iii. Recommendations for Tokenization f. Linguistic Processing with spaCy i. Instantiating a Pipeline ii. Processing Text iii. Blueprint: Customizing Tokenization iv. Blueprint: Working with Stop Words v. Blueprint: Extracting Lemmas Based on Part of Speech vi. Blueprint: Extracting Noun Phrases vii. Blueprint: Extracting Named Entities g. Feature Extraction on a Large Dataset i. Blueprint: Creating One Function to Get It All
📄 Page
7
ii. Blueprint: Using spaCy on a Large Dataset iii. Persisting the Result iv. A Note on Execution Time h. There Is More i. Language Detection ii. Spell-Checking iii. Token Normalization i. Closing Remarks and Recommendations 6. 5. Feature Engineering and Syntactic Similarity a. What You’ll Learn and What We’ll Build b. A Toy Dataset for Experimentation c. Blueprint: Building Your Own Vectorizer i. Enumerating the Vocabulary ii. Vectorizing Documents iii. The Document-Term Matrix iv. The Similarity Matrix d. Bag-of-Words Models i. Blueprint: Using scikit-learn’s CountVectorizer ii. Blueprint: Calculating Similarities e. TF-IDF Models i. Optimized Document Vectors with TfidfTransformer ii. Introducing the ABC Dataset
📄 Page
8
iii. Blueprint: Reducing Feature Dimensions iv. Blueprint: Improving Features by Making Them More Specific v. Blueprint: Using Lemmas Instead of Words for Vectorizing Documents vi. Blueprint: Limit Word Types vii. Blueprint: Remove Most Common Words viii. Blueprint: Adding Context via N- Grams f. Syntactic Similarity in the ABC Dataset i. Blueprint: Finding Most Similar Headlines to a Made-up Headline ii. Blueprint: Finding the Two Most Similar Documents in a Large Corpus (Much More Difficult) iii. Blueprint: Finding Related Words iv. Tips for Long-Running Programs like Syntactic Similarity g. Summary and Conclusion 7. 6. Text Classification Algorithms a. What You’ll Learn and What We’ll Build b. Introducing the Java Development Tools Bug Dataset c. Blueprint: Building a Text Classification System i. Step 1: Data Preparation
📄 Page
9
ii. Step 2: Train-Test Split iii. Step 3: Training the Machine Learning Model iv. Step 4: Model Evaluation d. Final Blueprint for Text Classification e. Blueprint: Using Cross-Validation to Estimate Realistic Accuracy Metrics f. Blueprint: Performing Hyperparameter Tuning with Grid Search g. Blueprint Recap and Conclusion h. Closing Remarks i. Further Reading 8. 7. How to Explain a Text Classifier a. What You’ll Learn and What We’ll Build b. Blueprint: Determining Classification Confidence Using Prediction Probability c. Blueprint: Measuring Feature Importance of Predictive Models d. Blueprint: Using LIME to Explain the Classification Results e. Blueprint: Using ELI5 to Explain the Classification Results f. Blueprint: Using Anchor to Explain the Classification Results i. Using the Distribution with Masked Words ii. Working with Real Words
📄 Page
10
g. Closing Remarks 9. 8. Unsupervised Methods: Topic Modeling and Clustering a. What You’ll Learn and What We’ll Build b. Our Dataset: UN General Debates i. Checking Statistics of the Corpus ii. Preparations c. Nonnegative Matrix Factorization (NMF) i. Blueprint: Creating a Topic Model Using NMF for Documents ii. Blueprint: Creating a Topic Model for Paragraphs Using NMF d. Latent Semantic Analysis/Indexing i. Blueprint: Creating a Topic Model for Paragraphs with SVD e. Latent Dirichlet Allocation i. Blueprint: Creating a Topic Model for Paragraphs with LDA ii. Blueprint: Visualizing LDA Results f. Blueprint: Using Word Clouds to Display and Compare Topic Models g. Blueprint: Calculating Topic Distribution of Documents and Time Evolution h. Using Gensim for Topic Modeling i. Blueprint: Preparing Data for Gensim
📄 Page
11
ii. Blueprint: Performing Nonnegative Matrix Factorization with Gensim iii. Blueprint: Using LDA with Gensim iv. Blueprint: Calculating Coherence Scores v. Blueprint: Finding the Optimal Number of Topics vi. Blueprint: Creating a Hierarchical Dirichlet Process with Gensim i. Blueprint: Using Clustering to Uncover the Structure of Text Data j. Further Ideas k. Summary and Recommendation l. Conclusion 10. 9. Text Summarization a. What You’ll Learn and What We’ll Build b. Text Summarization i. Extractive Methods ii. Data Preprocessing c. Blueprint: Summarizing Text Using Topic Representation i. Identifying Important Words with TF-IDF Values ii. LSA Algorithm d. Blueprint: Summarizing Text Using an Indicator Representation
📄 Page
12
e. Measuring the Performance of Text Summarization Methods f. Blueprint: Summarizing Text Using Machine Learning i. Step 1: Creating Target Labels ii. Step 2: Adding Features to Assist Model Prediction iii. Step 3: Build a Machine Learning Model g. Closing Remarks h. Further Reading 11. 10. Exploring Semantic Relationships with Word Embeddings a. What You’ll Learn and What We’ll Build b. The Case for Semantic Embeddings i. Word Embeddings ii. Analogy Reasoning with Word Embeddings iii. Types of Embeddings c. Blueprint: Using Similarity Queries on Pretrained Models i. Loading a Pretrained Model ii. Similarity Queries d. Blueprints for Training and Evaluating Your Own Embeddings i. Data Preparation
📄 Page
13
ii. Blueprint: Training Models with Gensim iii. Blueprint: Evaluating Different Models e. Blueprints for Visualizing Embeddings i. Blueprint: Applying Dimensionality Reduction ii. Blueprint: Using the TensorFlow Embedding Projector iii. Blueprint: Constructing a Similarity Tree f. Closing Remarks g. Further Reading 12. 11. Performing Sentiment Analysis on Text Data a. What You’ll Learn and What We’ll Build b. Sentiment Analysis c. Introducing the Amazon Customer Reviews Dataset d. Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches i. Bing Liu Lexicon ii. Disadvantages of a Lexicon-Based Approach e. Supervised Learning Approaches i. Preparing Data for a Supervised Learning Approach
📄 Page
14
f. Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm i. Step 1: Data Preparation ii. Step 2: Train-Test Split iii. Step 3: Text Vectorization iv. Step 4: Training the Machine Learning Model g. Pretrained Language Models Using Deep Learning i. Deep Learning and Transfer Learning h. Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model i. Step 1: Loading Models and Tokenization ii. Step 2: Model Training iii. Step 3: Model Evaluation i. Closing Remarks j. Further Reading 13. 12. Building a Knowledge Graph a. What You’ll Learn and What We’ll Build b. Knowledge Graphs i. Information Extraction c. Introducing the Dataset d. Named-Entity Recognition
📄 Page
15
i. Blueprint: Using Rule-Based Named-Entity Recognition ii. Blueprint: Normalizing Named Entities iii. Merging Entity Tokens e. Coreference Resolution i. Blueprint: Using spaCy’s Token Extensions ii. Blueprint: Performing Alias Resolution iii. Blueprint: Resolving Name Variations iv. Blueprint: Performing Anaphora Resolution with NeuralCoref v. Name Normalization vi. Entity Linking f. Blueprint: Creating a Co-Occurrence Graph i. Extracting Co-Occurrences from a Document ii. Visualizing the Graph with Gephi g. Relation Extraction i. Blueprint: Extracting Relations Using Phrase Matching ii. Blueprint: Extracting Relations Using Dependency Trees h. Creating the Knowledge Graph i. Don’t Blindly Trust the Results
📄 Page
16
i. Closing Remarks j. Further Reading 14. 13. Using Text Analytics in Production a. What You’ll Learn and What We’ll Build b. Blueprint: Using Conda to Create Reproducible Python Environments c. Blueprint: Using Containers to Create Reproducible Environments d. Blueprint: Creating a REST API for Your Text Analytics Model e. Blueprint: Deploying and Scaling Your API Using a Cloud Provider f. Blueprint: Automatically Versioning and Deploying Builds g. Closing Remarks h. Further Reading 15. Index
📄 Page
17
Praise for Blueprints for Text Analysis Using Python This is the book I wish I had at the beginning of my research. Solidly written, well-researched, and substantiated with hands- on examples that can be replicated for a variety of business use cases that need ML. —K.V.S. Dileep, Head, Program Development, GreyAtom An excellent book for anyone looking to enter the world of text analytics in an efficient manner. Packed with well-thought-out examples that can jump start the process of developing real- world applications using text analytics and natural language processing. —Marcus Bender, Distinguished Solution Engineer and Oracle Fellow The authors provide a comprehensive view of all useful methods and techniques related to text analytics and NLP that are used today in any production system. All datasets and use cases are inspired by real-life problems which would help readers understand how complex business problems are solved in large organizations. —Dr. Soudip Roy Chowdhury, Cofounder and CEO, Eugenie.ai Text analytics as a field is advancing considerably, which mandates a solid foundation while building text-related applications. This book helps achieve exactly that, with detailed concepts and blueprints for the implementation of multiple applications on realistic datasets. —Kishore Ayyadevara, author of books on ML and AI
📄 Page
18
A seamless melding of the methodical demands of the engineering discipline with the reactive nature of data science. This text is for the serious data engineer and balances an enterprise project’s prescription nature with innovative techniques and exploratory scenarios. —Craig Trim, Senior Engineer at Causality Link This book bridges the gap between fanatically Googling and hoping that it works, and just knowing that it will. The extremely code-driven layout combined with clear names of methods and approaches is a perfect combination to save you tons of time and heartache. —Nirant Kasliwal, Verloop.io This book is high quality, very practical, and teaches necessary basics. —Oliver Zeigermann, book and video course author and machine learning practitioner
📄 Page
19
Blueprints for Text Analysis Using Python Machine Learning-Based Solutions for Common Real World (NLP) Applications Jens Albrecht, Sidharth Ramachandran, and Christian Winkler
📄 Page
20
Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler Copyright © 2021 Jens Albrecht, Sidharth Ramachandran, and Christian Winkler. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Amelia Blevins Production Editor: Daniel Elfanbaum Copyeditor: Kim Wimpsett Proofreader: Piper Editorial LLC Indexer: Sam Arnold-Boyd Interior Designer: David Futato Cover Designer: Jose Marzan Illustrator: Kate Dullea December 2020: First Edition Revision History for the First Edition