Statistics
34
Views
0
Downloads
0
Donations
Uploader

高宏飞

Shared on 2025-12-21
Support
Share

AuthorJens Albrecht, Sidharth Ramachandran, Christian Winkler

Turning text into valuable information is essential for businesses looking to gain a competitive advantage. With recent improvements in natural language processing (NLP), users now have many options for solving complex challenges. But it's not always clear which NLP tools or libraries would work for a business's needs, or which techniques you should use and in what order. This practical book provides data scientists and developers with blueprints for best practice solutions to common tasks in text analytics and natural language processing. Authors Jens Albrecht, Sidharth Ramachandran, and Christian Winkler provide real-world case studies and detailed code examples in Python to help you get started quickly. Extract data from APIs and web pages Prepare textual data for statistical analysis and machine learning Use machine learning for classification, topic modeling, and summarization Explain AI models and classification results Explore and visualize semantic similarities with word embeddings Identify customer sentiment in product reviews Create a knowledge graph based on named entities and their relations

Tags
python
Publisher: O'Reilly Media, Inc.
Publish Year: 2020
Language: 英文
File Format: PDF
File Size: 18.5 MB
Support Statistics
¥.00 · 0times
Text Preview (First 20 pages)
Registered users can read the full content for free

Register as a Gaohf Library member to read the complete e-book online for free and enjoy a better reading experience.

(This page has no text content)
1. Preface a. Approach of the Book b. Prerequisites c. Some Important Libraries to Know d. Books to Read e. Conventions Used in This Book f. Using Code Examples g. O’Reilly Online Learning h. How to Contact Us i. Acknowledgments 2. 1. Gaining Early Insights from Textual Data a. What You’ll Learn and What We’ll Build b. Exploratory Data Analysis c. Introducing the Dataset d. Blueprint: Getting an Overview of the Data with Pandas i. Calculating Summary Statistics for Columns ii. Checking for Missing Data iii. Plotting Value Distributions iv. Comparing Value Distributions Across Categories v. Visualizing Developments Over Time e. Blueprint: Building a Simple Text Preprocessing Pipeline
i. Performing Tokenization with Regular Expressions ii. Treating Stop Words iii. Processing a Pipeline with One Line of Code f. Blueprints for Word Frequency Analysis i. Blueprint: Counting Words with a Counter ii. Blueprint: Creating a Frequency Diagram iii. Blueprint: Creating Word Clouds iv. Blueprint: Ranking with TF-IDF g. Blueprint: Finding a Keyword-in-Context h. Blueprint: Analyzing N-Grams i. Blueprint: Comparing Frequencies Across Time Intervals and Categories i. Creating Frequency Timelines ii. Creating Frequency Heatmaps j. Closing Remarks 3. 2. Extracting Textual Insights with APIs a. What You’ll Learn and What We’ll Build b. Application Programming Interfaces c. Blueprint: Extracting Data from an API Using the Requests Module i. Pagination ii. Rate Limiting
d. Blueprint: Extracting Twitter Data with Tweepy i. Obtaining Credentials ii. Installing and Configuring Tweepy iii. Extracting Data from the Search API iv. Extracting Data from a User’s Timeline v. Extracting Data from the Streaming API e. Closing Remarks 4. 3. Scraping Websites and Extracting Data a. What You’ll Learn and What We’ll Build b. Scraping and Data Extraction c. Introducing the Reuters News Archive d. URL Generation e. Blueprint: Downloading and Interpreting robots.txt f. Blueprint: Finding URLs from sitemap.xml g. Blueprint: Finding URLs from RSS h. Downloading Data i. Blueprint: Downloading HTML Pages with Python j. Blueprint: Downloading HTML Pages with wget k. Extracting Semistructured Data l. Blueprint: Extracting Data with Regular Expressions
m. Blueprint: Using an HTML Parser for Extraction n. Blueprint: Spidering i. Introducing the Use Case ii. Error Handling and Production- Quality Software o. Density-Based Text Extraction i. Extracting Reuters Content with Readability ii. Summary Density-Based Text Extraction p. All-in-One Approach q. Blueprint: Scraping the Reuters Archive with Scrapy r. Possible Problems with Scraping s. Closing Remarks and Recommendation 5. 4. Preparing Textual Data for Statistics and Machine Learning a. What You’ll Learn and What We’ll Build b. A Data Preprocessing Pipeline c. Introducing the Dataset: Reddit Self-Posts i. Loading Data Into Pandas ii. Blueprint: Standardizing Attribute Names iii. Saving and Loading a DataFrame d. Cleaning Text Data
i. Blueprint: Identify Noise with Regular Expressions ii. Blueprint: Removing Noise with Regular Expressions iii. Blueprint: Character Normalization with textacy iv. Blueprint: Pattern-Based Data Masking with textacy e. Tokenization i. Blueprint: Tokenization with Regular Expressions ii. Tokenization with NLTK iii. Recommendations for Tokenization f. Linguistic Processing with spaCy i. Instantiating a Pipeline ii. Processing Text iii. Blueprint: Customizing Tokenization iv. Blueprint: Working with Stop Words v. Blueprint: Extracting Lemmas Based on Part of Speech vi. Blueprint: Extracting Noun Phrases vii. Blueprint: Extracting Named Entities g. Feature Extraction on a Large Dataset i. Blueprint: Creating One Function to Get It All
ii. Blueprint: Using spaCy on a Large Dataset iii. Persisting the Result iv. A Note on Execution Time h. There Is More i. Language Detection ii. Spell-Checking iii. Token Normalization i. Closing Remarks and Recommendations 6. 5. Feature Engineering and Syntactic Similarity a. What You’ll Learn and What We’ll Build b. A Toy Dataset for Experimentation c. Blueprint: Building Your Own Vectorizer i. Enumerating the Vocabulary ii. Vectorizing Documents iii. The Document-Term Matrix iv. The Similarity Matrix d. Bag-of-Words Models i. Blueprint: Using scikit-learn’s CountVectorizer ii. Blueprint: Calculating Similarities e. TF-IDF Models i. Optimized Document Vectors with TfidfTransformer ii. Introducing the ABC Dataset
iii. Blueprint: Reducing Feature Dimensions iv. Blueprint: Improving Features by Making Them More Specific v. Blueprint: Using Lemmas Instead of Words for Vectorizing Documents vi. Blueprint: Limit Word Types vii. Blueprint: Remove Most Common Words viii. Blueprint: Adding Context via N- Grams f. Syntactic Similarity in the ABC Dataset i. Blueprint: Finding Most Similar Headlines to a Made-up Headline ii. Blueprint: Finding the Two Most Similar Documents in a Large Corpus (Much More Difficult) iii. Blueprint: Finding Related Words iv. Tips for Long-Running Programs like Syntactic Similarity g. Summary and Conclusion 7. 6. Text Classification Algorithms a. What You’ll Learn and What We’ll Build b. Introducing the Java Development Tools Bug Dataset c. Blueprint: Building a Text Classification System i. Step 1: Data Preparation
ii. Step 2: Train-Test Split iii. Step 3: Training the Machine Learning Model iv. Step 4: Model Evaluation d. Final Blueprint for Text Classification e. Blueprint: Using Cross-Validation to Estimate Realistic Accuracy Metrics f. Blueprint: Performing Hyperparameter Tuning with Grid Search g. Blueprint Recap and Conclusion h. Closing Remarks i. Further Reading 8. 7. How to Explain a Text Classifier a. What You’ll Learn and What We’ll Build b. Blueprint: Determining Classification Confidence Using Prediction Probability c. Blueprint: Measuring Feature Importance of Predictive Models d. Blueprint: Using LIME to Explain the Classification Results e. Blueprint: Using ELI5 to Explain the Classification Results f. Blueprint: Using Anchor to Explain the Classification Results i. Using the Distribution with Masked Words ii. Working with Real Words
g. Closing Remarks 9. 8. Unsupervised Methods: Topic Modeling and Clustering a. What You’ll Learn and What We’ll Build b. Our Dataset: UN General Debates i. Checking Statistics of the Corpus ii. Preparations c. Nonnegative Matrix Factorization (NMF) i. Blueprint: Creating a Topic Model Using NMF for Documents ii. Blueprint: Creating a Topic Model for Paragraphs Using NMF d. Latent Semantic Analysis/Indexing i. Blueprint: Creating a Topic Model for Paragraphs with SVD e. Latent Dirichlet Allocation i. Blueprint: Creating a Topic Model for Paragraphs with LDA ii. Blueprint: Visualizing LDA Results f. Blueprint: Using Word Clouds to Display and Compare Topic Models g. Blueprint: Calculating Topic Distribution of Documents and Time Evolution h. Using Gensim for Topic Modeling i. Blueprint: Preparing Data for Gensim
ii. Blueprint: Performing Nonnegative Matrix Factorization with Gensim iii. Blueprint: Using LDA with Gensim iv. Blueprint: Calculating Coherence Scores v. Blueprint: Finding the Optimal Number of Topics vi. Blueprint: Creating a Hierarchical Dirichlet Process with Gensim i. Blueprint: Using Clustering to Uncover the Structure of Text Data j. Further Ideas k. Summary and Recommendation l. Conclusion 10. 9. Text Summarization a. What You’ll Learn and What We’ll Build b. Text Summarization i. Extractive Methods ii. Data Preprocessing c. Blueprint: Summarizing Text Using Topic Representation i. Identifying Important Words with TF-IDF Values ii. LSA Algorithm d. Blueprint: Summarizing Text Using an Indicator Representation
e. Measuring the Performance of Text Summarization Methods f. Blueprint: Summarizing Text Using Machine Learning i. Step 1: Creating Target Labels ii. Step 2: Adding Features to Assist Model Prediction iii. Step 3: Build a Machine Learning Model g. Closing Remarks h. Further Reading 11. 10. Exploring Semantic Relationships with Word Embeddings a. What You’ll Learn and What We’ll Build b. The Case for Semantic Embeddings i. Word Embeddings ii. Analogy Reasoning with Word Embeddings iii. Types of Embeddings c. Blueprint: Using Similarity Queries on Pretrained Models i. Loading a Pretrained Model ii. Similarity Queries d. Blueprints for Training and Evaluating Your Own Embeddings i. Data Preparation
ii. Blueprint: Training Models with Gensim iii. Blueprint: Evaluating Different Models e. Blueprints for Visualizing Embeddings i. Blueprint: Applying Dimensionality Reduction ii. Blueprint: Using the TensorFlow Embedding Projector iii. Blueprint: Constructing a Similarity Tree f. Closing Remarks g. Further Reading 12. 11. Performing Sentiment Analysis on Text Data a. What You’ll Learn and What We’ll Build b. Sentiment Analysis c. Introducing the Amazon Customer Reviews Dataset d. Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches i. Bing Liu Lexicon ii. Disadvantages of a Lexicon-Based Approach e. Supervised Learning Approaches i. Preparing Data for a Supervised Learning Approach
f. Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm i. Step 1: Data Preparation ii. Step 2: Train-Test Split iii. Step 3: Text Vectorization iv. Step 4: Training the Machine Learning Model g. Pretrained Language Models Using Deep Learning i. Deep Learning and Transfer Learning h. Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model i. Step 1: Loading Models and Tokenization ii. Step 2: Model Training iii. Step 3: Model Evaluation i. Closing Remarks j. Further Reading 13. 12. Building a Knowledge Graph a. What You’ll Learn and What We’ll Build b. Knowledge Graphs i. Information Extraction c. Introducing the Dataset d. Named-Entity Recognition
i. Blueprint: Using Rule-Based Named-Entity Recognition ii. Blueprint: Normalizing Named Entities iii. Merging Entity Tokens e. Coreference Resolution i. Blueprint: Using spaCy’s Token Extensions ii. Blueprint: Performing Alias Resolution iii. Blueprint: Resolving Name Variations iv. Blueprint: Performing Anaphora Resolution with NeuralCoref v. Name Normalization vi. Entity Linking f. Blueprint: Creating a Co-Occurrence Graph i. Extracting Co-Occurrences from a Document ii. Visualizing the Graph with Gephi g. Relation Extraction i. Blueprint: Extracting Relations Using Phrase Matching ii. Blueprint: Extracting Relations Using Dependency Trees h. Creating the Knowledge Graph i. Don’t Blindly Trust the Results
i. Closing Remarks j. Further Reading 14. 13. Using Text Analytics in Production a. What You’ll Learn and What We’ll Build b. Blueprint: Using Conda to Create Reproducible Python Environments c. Blueprint: Using Containers to Create Reproducible Environments d. Blueprint: Creating a REST API for Your Text Analytics Model e. Blueprint: Deploying and Scaling Your API Using a Cloud Provider f. Blueprint: Automatically Versioning and Deploying Builds g. Closing Remarks h. Further Reading 15. Index
Praise for Blueprints for Text Analysis Using Python This is the book I wish I had at the beginning of my research. Solidly written, well-researched, and substantiated with hands- on examples that can be replicated for a variety of business use cases that need ML. —K.V.S. Dileep, Head, Program Development, GreyAtom An excellent book for anyone looking to enter the world of text analytics in an efficient manner. Packed with well-thought-out examples that can jump start the process of developing real- world applications using text analytics and natural language processing. —Marcus Bender, Distinguished Solution Engineer and Oracle Fellow The authors provide a comprehensive view of all useful methods and techniques related to text analytics and NLP that are used today in any production system. All datasets and use cases are inspired by real-life problems which would help readers understand how complex business problems are solved in large organizations. —Dr. Soudip Roy Chowdhury, Cofounder and CEO, Eugenie.ai Text analytics as a field is advancing considerably, which mandates a solid foundation while building text-related applications. This book helps achieve exactly that, with detailed concepts and blueprints for the implementation of multiple applications on realistic datasets. —Kishore Ayyadevara, author of books on ML and AI
A seamless melding of the methodical demands of the engineering discipline with the reactive nature of data science. This text is for the serious data engineer and balances an enterprise project’s prescription nature with innovative techniques and exploratory scenarios. —Craig Trim, Senior Engineer at Causality Link This book bridges the gap between fanatically Googling and hoping that it works, and just knowing that it will. The extremely code-driven layout combined with clear names of methods and approaches is a perfect combination to save you tons of time and heartache. —Nirant Kasliwal, Verloop.io This book is high quality, very practical, and teaches necessary basics. —Oliver Zeigermann, book and video course author and machine learning practitioner
Blueprints for Text Analysis Using Python Machine Learning-Based Solutions for Common Real World (NLP) Applications Jens Albrecht, Sidharth Ramachandran, and Christian Winkler
Blueprints for Text Analytics Using Python by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler Copyright © 2021 Jens Albrecht, Sidharth Ramachandran, and Christian Winkler. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com. Acquisitions Editor: Michelle Smith Development Editor: Amelia Blevins Production Editor: Daniel Elfanbaum Copyeditor: Kim Wimpsett Proofreader: Piper Editorial LLC Indexer: Sam Arnold-Boyd Interior Designer: David Futato Cover Designer: Jose Marzan Illustrator: Kate Dullea December 2020: First Edition Revision History for the First Edition
The above is a preview of the first 20 pages. Register to read the complete e-book.