Web Scraping with Python Data Extraction from the Modern Web (Ryan Mitchell) (Z-Library)

Author: Ryan Mitchell

Python

If programming is magic, then web scraping is surely a form of wizardry. By writing a simple automated program, you can query web servers, request data, and parse it to extract the information you need. This thoroughly updated third edition not only introduces you to web scraping but also serves as a comprehensive guide to scraping almost every type of data from the modern web. Part I focuses on web scraping mechanics: using Python to request information from a web server, performing basic handling of the server's response, and interacting with sites in an automated fashion. Part II explores a variety of more specific tools and applications to fit any web scraping scenario you're likely to encounter. Parse complicated HTML pages Develop crawlers with the Scrapy framework Learn methods to store the data you scrape Read and extract data from documents Clean and normalize badly formatted data Read and write natural languages Crawl through forms and logins Scrape JavaScript and crawl through APIs Use and write image-to-text software Avoid scraping traps and bot blockers Use scrapers to test your website

📄 File Format: PDF

💾 File Size: 11.7 MB

Views

Downloads

0.00

Total Donations

📖 Read Online ⬇️ Download

📄 Text Preview (First 20 pages)

ℹ️

Registered users can read the full content for free

📄 Page 1

(This page has no text content)

📄 Page 2

Web Scraping with Python THIRD EDITION Data Extraction from the Modern Web Ryan Mitchell

📄 Page 3

Web Scraping with Python by Ryan Mitchell Copyright © 2024 Ryan Mitchell. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800- 998-9938 or corporate@oreilly.com. Acquisitions Editor: Amanda Quinn Development Editor: Sara Hunter Production Editor: Aleeya Rahman Copyeditor: Sonia Saruba Proofreader: Piper Editorial Consulting, LLC Indexer: nSight, Inc Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea July 2015: First Edition April 2018: Second Edition

📄 Page 4

February 2024: Third Edition Revision History for the Third Edition 2024-02-14: First Release See http://oreilly.com/catalog/errata.csp? isbn=9781098145354 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Web Scraping with Python, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-098-14535-4 [LSI]

📄 Page 5

Preface To those who have not developed the skill, computer programming can seem like a kind of magic. If programming is magic, web scraping is wizardry: the application of magic for particularly impressive and useful —yet surprisingly effortless—feats. In my years as a software engineer, I’ve found that few programming practices capture the excitement of both programmers and laypeople alike quite like web scraping. The ability to write a simple bot that collects data and streams it down a terminal or stores it in a database, while not difficult, never fails to provide a certain thrill and sense of possibility, no matter how many times you might have done it before. Unfortunately, when I speak to other programmers about web scraping, there’s a lot of misunderstanding and confusion about the practice. Some people aren’t sure it’s legal (it is), or how to handle problems like JavaScript- heavy pages or required logins. Many are confused about how to start a large web scraping project, or even where to find the data they’re looking for. This book seeks to put an end to many of these common questions and misconceptions about web scraping, while providing a comprehensive guide to most common web scraping tasks. Web scraping is a diverse and fast-changing field, and I’ve tried to provide both high-level concepts and concrete examples to cover just about any data collection project you’re likely to encounter. Throughout the book, code samples are provided to demonstrate these concepts and

📄 Page 6

allow you to try them out. The code samples themselves can be used and modified with or without attribution (although acknowledgment is always appreciated). All code samples are available on GitHub for viewing and downloading. What Is Web Scraping? The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as screen scraping, data mining, web harvesting, or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots. In theory, web scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information. In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as data analysis, natural language parsing, and information security. Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first

📄 Page 7

part and delve into the more specific in the second part as needed. Why Web Scraping? If the only way you access the internet is through a browser, you’re missing out on a huge range of possibilities. Although browsers are handy for executing JavaScript, displaying images, and arranging objects in a more human-readable format (among other things), web scrapers are excellent at gathering and processing large amounts of data quickly. Rather than viewing one page at a time through the narrow window of a monitor, you can view databases spanning thousands or even millions of pages at once. In addition, web scrapers can go places that traditional search engines cannot. A Google search for “cheapest flights to Boston” will result in a slew of advertisements and popular flight search sites. Google knows only what these websites say on their content pages, not the exact results of various queries entered into a flight search application. However, a well-developed web scraper can chart the cost of a flight to Boston over time, across a variety of websites, and tell you the best time to buy your ticket. You might be asking: “Isn’t data gathering what APIs are for?” (If you’re unfamiliar with APIs, see Chapter 15.) Well, APIs can be fantastic, if you find one that suits your purposes. They are designed to provide a convenient stream of well-formatted data from one computer program to another. You can find an API for many types of data you might want to use, such as Twitter posts or Wikipedia pages. In general, it is preferable to use an API (if one

📄 Page 8

exists), rather than build a bot to get the same data. However, an API might not exist or be useful for your purposes for several reasons: You are gathering relatively small, finite sets of data across a large collection of websites without a cohesive API. The data you want is fairly small or uncommon, and the creator did not think it warranted an API. The source does not have the infrastructure or technical ability to create an API. The data is valuable and/or protected and not intended to be spread widely. Even when an API does exist, the request volume and rate limits, the types of data, or the format of data that it provides might be insufficient for your purposes. This is where web scraping steps in. With few exceptions, if you can view data in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually anything with that data. There are obviously many extremely practical applications of having access to nearly unlimited data: market forecasting, machine-language translation, and even medical diagnostics have benefited tremendously from the ability to retrieve and analyze data from news sites, translated texts, and health forums, respectively. Even in the art world, web scraping has opened up new frontiers for creation. The 2006 project “We Feel Fine” by Jonathan Harris and Sep Kamvar scraped a variety of English-language blog sites for phrases starting with “I

📄 Page 9

feel” or “I am feeling.” This led to a popular data visualization, describing how the world was feeling day by day and minute by minute. Regardless of your field, web scraping almost always provides a way to guide business practices more effectively, improve productivity, or even branch off into a brand-new field entirely. About This Book This book is designed to serve not only as an introduction to web scraping but also as a comprehensive guide to collecting, transforming, and using data from uncooperative sources. Although it uses the Python programming language and covers many Python basics, it should not be used as an introduction to the language. If you don’t know any Python at all, this book might be a bit of a challenge. Please do not use it as an introductory Python text. With that said, I’ve tried to keep all concepts and code samples at a beginning-to-intermediate Python programming level in order to make the content accessible to a wide range of readers. To this end, there are occasional explanations of more advanced Python programming and general computer science topics where appropriate. If you are a more advanced reader, feel free to skim these parts! If you’re looking for a more comprehensive Python resource, Introducing Python by Bill Lubanovic (O’Reilly) is a good, if lengthy, guide. For those with shorter attention spans, the video series Introduction to Python by Jessica McKellar (O’Reilly) is an excellent resource. I’ve also enjoyed Think Python by a former professor of mine, Allen

📄 Page 10

Downey (O’Reilly). This last book in particular is ideal for those new to programming, and teaches computer science and software engineering concepts along with the Python language. Technical books often focus on a single language or technology, but web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of “data gathering.” It should not be used as a complete treatment of any of these subjects, but I believe they are covered in enough detail to get you started writing web scrapers! Part I covers the subject of web scraping and web crawling in depth, with a strong focus on a small handful of libraries used throughout the book. Part I can easily be used as a comprehensive reference for these libraries and techniques (with certain exceptions, where additional references will be provided). The skills taught in the first part will likely be useful for everyone writing a web scraper, regardless of their particular target or application. Part II covers additional subjects that the reader might find useful when writing web scrapers, but that might not be useful for all scrapers all the time. These subjects are, unfortunately, too broad to be neatly wrapped up in a single chapter.  Because of this, frequent references are made to other resources for additional information. The structure of this book enables you to easily jump around among chapters to find only the web scraping technique or information that you are looking for. When a concept or piece of code builds on another mentioned in a

📄 Page 11

previous chapter, I explicitly reference the section that it was addressed in. Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold Shows commands or other text that should be typed by the user. Constant width italic Shows text that should be replaced with user-supplied values or by values determined by context. TIP This element signifies a tip or suggestion.

📄 Page 12

NOTE This element signifies a general note. WARNING This element indicates a warning or caution. Using Code Examples Supplemental material (code examples, exercises, etc.) is available for download at https://github.com/REMitchell/python-scraping. This book is here to help you get your job done. If the example code in this book is useful to you, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD- ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission. We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Web Scraping with Python, Third Edition, by Ryan Mitchell (O’Reilly). Copyright 2024 Ryan Mitchell, 978-1-098-14535-4.”

📄 Page 13

If you feel your use of code examples falls outside fair use or the permission given here, feel free to contact us at permissions@oreilly.com. Unfortunately, printed books are difficult to keep up-to- date. With web scraping, this provides an added challenge, as the many libraries and websites that the book references and that the code often depends on may occasionally be modified, and code samples may fail or produce unexpected results. If you choose to run the code samples, please run them from the GitHub repository rather than copying from the book directly. I, and readers of this book who choose to contribute (including, perhaps, you!), will strive to keep the repository up-to-date with required modifications. In addition to code samples, terminal commands are often provided to illustrate how to install and run software. In general, these commands are geared toward Linux-based operating systems but will usually be applicable for Windows users with a properly configured Python environment and pip installation. When this is not the case, I have provided instructions for all major operating systems, or external references for Windows users to accomplish the task. O’Reilly Online Learning NOTE For more than 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our

📄 Page 14

online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit https://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-889-8969 (in the United States or Canada) 707-829-7019 (international or local) 707-829-0104 (fax) support@oreilly.com https://www.oreilly.com/about/contact.html We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/web_scraping_with_python. For news and information about our books and courses, visit https://oreilly.com.

📄 Page 15

Find us on LinkedIn: https://linkedin.com/company/oreilly- media Follow us on Twitter: https://twitter.com/oreillymedia Watch us on YouTube: https://youtube.com/oreillymedia Acknowledgments Just as some of the best products arise out of a sea of user feedback, this book never could have existed in any useful form without the help of many collaborators, cheerleaders, and editors. Thank you to the O’Reilly staff and their amazing support for this somewhat unconventional subject; to my friends and family who have offered advice and put up with impromptu readings; and to my coworkers at the Gerson Lehrman Group, whom I now likely owe many hours of work. Thank you to my editors: Sara Hunter, John Obelenus, and Tracey Larvenz. Their feedback, guidance, and occasional tough love were invaluable. Quite a few sections and code samples were written as a direct result of their suggestions. The inspiration for the first two chapters, as well as many new inclusions throughout the third edition, are thanks to Bryan Specht. The legacy he left is more broad and vast than even he knew, but the hole he left to be filled by that legacy is even bigger. Finally, thanks to Jim Waldo, who started this whole project many years ago when he mailed a Linux box and The Art and Science of C by Eric Roberts (Addison-Wesley) to a young, impressionable teenager.

📄 Page 16

Part I. Building Scrapers This first part of this book focuses on the basic mechanics of web scraping: how to use Python to request information from a web server, how to perform basic handling of the server’s response, and how to begin interacting with a website in an automated fashion. By the end, you’ll be cruising around the internet with ease, building scrapers that can hop from one domain to another, gather information, and store that information for later use. To be honest, web scraping is a fantastic field to get into if you want a huge payout for relatively little up-front investment. In all likelihood, 90% of web scraping projects you’ll encounter will draw on techniques used in just the next 6 chapters. This section covers what the general (albeit technically savvy) public tends to think of when they think of “web scrapers”: Retrieving HTML data from a domain name Parsing that data for target information Storing the target information Optionally, moving to another page to repeat the process This will give you a solid foundation before moving on to more complex projects in Part II. Don’t be fooled into thinking that this first section isn’t as important as some of the more advanced projects in the second half. You will use

📄 Page 17

nearly all the information in the first half of this book on a daily basis while writing web scrapers!

📄 Page 18

Chapter 1. How the Internet Works I have met very few people in my life who truly know how the internet works, and I am certainly not one of them. The vast majority of us are making do with a set of mental abstractions that allow us to use the internet just as much as we need to. Even for programmers, these abstractions might extend only as far as what was required for them to solve a particularly tricky problem once in their career. Due to limitations in page count and the knowledge of the author, this chapter must also rely on these sorts of abstractions. It describes the mechanics of the internet and web applications, to the extent needed to scrape the web (and then, perhaps a little more). This chapter, in a sense, describes the world in which web scrapers operate: the customs, practices, protocols, and standards that will be revisited throughout the book. When you type a URL into the address bar of your web browser and hit Enter, interactive text, images, and media spring up as if by magic. This same magic is happening for billions of other people every day. They’re visiting the same websites, using the same applications—often getting media and text customized just for them. And these billions of people are all using different types of devices and software applications, written by different developers at different (often competing!) companies.

📄 Page 19

Amazingly, there is no all-powerful governing body regulating the internet and coordinating its development with any sort of legal force. Instead, different parts of the internet are governed by several different organizations that evolved over time on a somewhat ad hoc and opt-in basis. Of course, choosing not to opt into the standards that these organizations publish may result in your contributions to the internet simply...not working. If your website can’t be displayed in popular web browsers, people likely aren’t going to visit it. If the data your router is sending can’t be interpreted by any other router, that data will be ignored. Web scraping is, essentially, the practice of substituting a web browser for an application of your own design. Because of this, it’s important to understand the standards and frameworks that web browsers are built on. As a web scraper, you must both mimic and, at times, subvert the expected internet customs and practices. Networking In the early days of the telephone system, each telephone was connected by a physical wire to a central switchboard. If you wanted to make a call to a nearby friend, you picked up the phone, asked the switchboard operator to connect you, and the switchboard operator physically created (via plugs and jacks) a dedicated connection between your phone and your friend’s phone. Long-distance calls were expensive and could take minutes to connect. Placing a long-distance call from Boston to Seattle would result in the coordination of switchboard operators across the United States creating a single

📄 Page 20

enormous length of wire directly connecting your phone to the recipient’s. Today, rather than make a telephone call over a temporary dedicated connection, we can make a video call from our house to anywhere in the world across a persistent web of wires. The wire doesn’t tell the data where to go, the data guides itself, in a process called packet switching. Although many technologies over the years contributed to what we think of as “the internet,” packet switching is really the technology that single-handedly started it all. In a packet-switched network, the message to be sent is divided into discrete ordered packets, each with its own sender and destination address. These packets are routed dynamically to any destination on the network, based on that address. Rather than being forced to blindly traverse the single dedicated connection from receiver to sender, the packets can take any path the network chooses. In fact, packets in the same message transmission might take different routes across the network and be reordered by the receiving computer when they arrive. If the old phone networks were like a zip line—taking passengers from a single destination at the top of a hill to a single destination at the bottom—then packet-switched networks are like a highway system, where cars going to and from multiple destinations are all able to use the same roads. A modern packet-switching network is usually described using the Open Systems Interconnection (OSI) model, which is composed of seven layers of routing, encoding, and error handling: 1. Physical layer

The above is a preview of the first 20 pages. Register to read the complete e-book.

💝 Support Author

0.00

Total Amount (¥)

Donation Count

← Back to List