Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
()
About this ebook
Collect and scrape different complexities of data from the modern Web using the latest tools, best practices, and techniques
Key Features- Learn various scraping techniques using a range of Python libraries such as Scrapy and Beautiful Soup
- Build scrapers and crawlers to extract relevant information from the web
- Automate web scraping operations to bridge the accuracy gap and ease complex business needs
Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. This book will enable you to delve deeply into web scraping techniques and methodologies.
This book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. We'll use powerful libraries from the Python ecosystem—such as Scrapy, lxml, pyquery, bs4, and others—to carry out web scraping operations. We will take an in-depth look at essential tasks to carry out simple to intermediate scraping operations such as identifying information from web pages, using patterns or attributes to retrieve information, and others. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. This book also covers the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.
By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.
What you will learn- Analyze data and Information from web pages
- Learn how to use browser-based developer tools from the scraping perspective
- Use XPath and CSS selectors to identify and explore markup elements
- Learn to handle and manage cookies
- Explore advanced concepts in handling HTML forms and processing logins
- Optimize web securities, data storage, and API use to scrape data
- Use Regex with Python to extract data
- Deal with complex web entities by using Selenium to find and extract data
This book is for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need! A working knowledge of the Python programming language is expected.
Related to Hands-On Web Scraping with Python
Related ebooks
Python Web Scraping - Second Edition Rating: 5 out of 5 stars5/5Web Scraping with Python Rating: 4 out of 5 stars4/5Mastering Objectoriented Python Rating: 5 out of 5 stars5/5NumPy Essentials Rating: 0 out of 5 stars0 ratingsPython 3 Object-oriented Programming - Second Edition Rating: 4 out of 5 stars4/5Getting Started with Beautiful Soup Rating: 3 out of 5 stars3/5Python Data Visualization Cookbook Rating: 4 out of 5 stars4/5Mastering matplotlib Rating: 0 out of 5 stars0 ratingsNumpy Simply In Depth Rating: 5 out of 5 stars5/5Tiny Python Projects: Learn coding and testing with puzzles and games Rating: 4 out of 5 stars4/5Python: Programming for Intermediates: Learn the Fundamentals of Python in 7 Days Rating: 4 out of 5 stars4/5Learning Flask Framework: Build dynamic, data-driven websites and modern web applications with Flask Rating: 4 out of 5 stars4/5Building Machine Learning Systems with Python Rating: 4 out of 5 stars4/5Hands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Essentials Rating: 5 out of 5 stars5/5NumPy Beginner's Guide Rating: 5 out of 5 stars5/5Mastering Python Design Patterns Rating: 0 out of 5 stars0 ratingsMastering Python Regular Expressions Rating: 5 out of 5 stars5/5Python In - Depth: Use Python Programming Features, Techniques, and Modules to Solve Everyday Problems Rating: 0 out of 5 stars0 ratingsNumPy Cookbook Rating: 5 out of 5 stars5/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsFlask Framework Cookbook Rating: 5 out of 5 stars5/5
Computers For You
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Self-Taught Computer Scientist: The Beginner's Guide to Data Structures & Algorithms Rating: 0 out of 5 stars0 ratingsProcreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 5 out of 5 stars5/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsTechnical Writing For Dummies Rating: 0 out of 5 stars0 ratingsHow to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5Learn Typing Rating: 0 out of 5 stars0 ratingsFundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5Computer Science I Essentials Rating: 5 out of 5 stars5/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5Black Holes: The Key to Understanding the Universe Rating: 5 out of 5 stars5/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5
Reviews for Hands-On Web Scraping with Python
0 ratings0 reviews
Book preview
Hands-On Web Scraping with Python - Anish Chapagain
Hands-On Web Scraping with Python
Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Anish Chapagain
BIRMINGHAM - MUMBAI
Hands-On Web Scraping with Python
Copyright © 2019 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Commissioning Editor: Sunith Shetty
Acquisition Editor: Aniruddha Patil
Content Development Editor: Roshan Kumar
Senior Editor: Ayaan Hoda
Technical Editor: Sushmeeta Jena
Copy Editor: Safis Editing
Project Coordinator: Namrata Swetta
Proofreader: Safis Editing
Indexer: Tejal Daruwale Soni
Production Designer: Alishon Mendonsa
First published: June 2019
Production reference: 2120619
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78953-339-2
www.packtpub.com
To my daughter, Aasira, and my family and friends. Special thanks to Ashish Chapagain,
Peter, and Prof. W.J. Teahan. This book is dedicated to you all.
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
Why subscribe?
Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals
Improve your learning with Skill Plans built especially for you
Get a free eBook or video every month
Fully searchable for easy access to vital information
Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
Contributors
About the author
Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.
About the reviewers
Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.
Rohit Negi completed his bachelor of technology in computer science from Uttarakhand Technical University, Dehradun. His bachelor's curriculum included a specialization in computer science and applied engineering. Currently, he is working as a senior test consultant at Orbit Technologies and provides test automation solutions to LAM Research (USA clients). He has extensive quality assurance proficiency working with the following tools: Microsoft Azure VSTS, Selenium, Cucumber/BDD, MS SQL/MySQL, Java, and web scraping using Selenium. Additionally, he has a good working knowledge of how to automate workflows using Selenium, Protractor for AngularJS-based applications, Python for exploratory data analysis, and machine learning.
Packt is searching for authors like you
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Table of Contents
Title Page
Copyright and Credits
Hands-On Web Scraping with Python
Dedication
About Packt
Why subscribe?
Contributors
About the author
About the reviewers
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Download the color images
Conventions used
Get in touch
Reviews
Section 1: Introduction to Web Scraping
Web Scraping Fundamentals
Introduction to web scraping
Understanding web development and technologies
HTTP
HTML
HTML elements and attributes
Global attributes
XML
JavaScript
JSON
CSS
AngularJS
Data finding techniques for the web
HTML page source
Case 1
Case 2
Developer tools
Sitemaps
The robots.txt file
Summary
Further reading
Section 2: Beginning Web Scraping
Python and the Web – Using urllib and Requests
Technical requirements
Accessing the web with Python
Setting things up
Loading URLs
URL handling and operations with urllib and requests
urllib
requests
Implementing HTTP methods
GET
POST
Summary
Further reading
Using LXML, XPath, and CSS Selectors
Technical requirements
Introduction to XPath and CSS selector
XPath
CSS selectors
Element selectors
ID and class selectors
Attribute selectors
Pseudo selectors
Using web browser developer tools for accessing web content
HTML elements and DOM navigation
XPath and CSS selectors using DevTools
Scraping using lxml, a Python library
lxml by examples
Example 1 – reading XML from file and traversing through its elements
Example 2 – reading HTML documents using lxml.html
Example 3 – reading and parsing HTML for retrieving HTML form type element attributes
Web scraping using lxml
Example 1 – extracting selected data from a single page using lxml.html.xpath
Example 2 – looping with XPath and scraping data from multiple pages
Example 3 – using lxml.cssselect to scrape content from a page
Summary
Further reading
Scraping Using pyquery – a Python Library
Technical requirements
Introduction to pyquery
Exploring pyquery
Loading documents
Element traversing, attributes, and pseudo-classes
Iterating
Web scraping using pyquery
Example 1 – scraping data science announcements
Example 2 – scraping information from nested links
Example 3 – extracting AHL Playoff results
Example 4 – collecting URLs from sitemap.xml
Case 1 – using the HTML parser
Case 2 – using the XML parser
Summary
Further reading
Web Scraping Using Scrapy and Beautiful Soup
Technical requirements
Web scraping using Beautiful Soup
Introduction to Beautiful Soup
Exploring Beautiful Soup
Searching, traversing, and iterating
Using children and parents
Using next and previous
Using CSS Selectors
Example 1 – listing <li> elements with the data-id attribute
Example 2 – traversing through elements
Example 3 – searching elements based on attribute values
Building a web crawler
Web scraping using Scrapy
Introduction to Scrapy
Setting up a project
Generating a Spider
Creating an item
Extracting data
Using XPath
Using CSS Selectors
Data from multiple pages
Running and exporting
Deploying a web crawler
Summary
Further reading
Section 3: Advanced Concepts
Working with Secure Web
Technical requirements
Introduction to secure web
Form processing
Cookies and sessions
Cookies
Sessions
User authentication
HTML <form> processing
Handling user authentication
Working with cookies and sessions
Summary
Further reading
Data Extraction Using Web-Based APIs
Technical requirements
Introduction to web APIs
REST and SOAP
REST
SOAP
Benefits of web APIs
Accessing web API and data formats
Making requests to the web API using a web browser
Case 1 – accessing a simple API (request and response)
Case 2 – demonstrating status codes and informative responses from the API
Case 3 – demonstrating RESTful API cache functionality
Web scraping using APIs
Example 1 – searching and collecting university names and URLs
Example 2 – scraping information from GitHub events
Summary
Further reading
Using Selenium to Scrape the Web
Technical requirements
Introduction to Selenium
Selenium projects
Selenium WebDriver
Selenium RC
Selenium Grid
Selenium IDE
Setting things up
Exploring Selenium
Accessing browser properties
Locating web elements
Using Selenium for web scraping
Example 1 – scraping product information
Example 2 – scraping book information
Summary
Further reading
Using Regex to Extract Data
Technical requirements
Overview of regular expressions
Regular expressions and Python
Using regular expressions to extract data
Example 1 – extracting HTML-based content
Example 2 – extracting dealer locations
Example 3 – extracting XML content
Summary
Further reading
Section 4: Conclusion
Next Steps
Technical requirements
Managing scraped data
Writing to files
Analysis and visualization using pandas and matplotlib
Machine learning
ML and AI
Python and ML
Types of ML algorithms
Supervised learning
Classification
Regression
Unsupervised learning
Association
Clustering
Reinforcement learning
Data mining
Tasks of data mining
Predictive
Classification
Regression
Prediction
Descriptive
Clustering
Summarization
Association rules
What's next?
Summary
Further reading
Other Books You May Enjoy
Leave a review - let other readers know what you think
Preface
Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. Web scraping, or web harvesting, is done with a view to extracting and collecting data from websites. Web scraping comes in handy with model development, which requires data to be collected on the fly. It is also applicable for the data that is true and relevant to the topic, in which the accuracy is desired over the short-term, as opposed to implementing datasets. Data collected is stored in files including JSON, CSV, and XML, is also written a the database for later use, and is also made available online as datasets. This book will open the gates for you in terms of delving deep into web scraping techniques and methodologies using Python libraries and other popular tools, such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.
Who this book is for
This book is intended for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need!
What this book covers
Chapter 1, Web Scraping Fundamentals, explores some core technologies and tools that are relevant to WWW and that are required for web scraping.
Chapter 2, Python and the Web – Using URLlib and Requests, demonstrates some of the core features available through the Python libraries such as requests and urllib, in addition to exploring page contents in various formats and structures.
Chapter 3, Using LXML, XPath, and CSS Selectors, describes various examples using LXML, implementing a variety of techniques and library features to deal with elements and ElementTree.
Chapter 4, Scraping Using pyquery – a Python Library, goes into more detail regarding web scraping techniques and a number of new Python libraries that deploy these techniques.
Chapter 5, Web Scraping Using Scrapy and Beautiful Soup, examines various aspects of traversing web documents using Beautiful Soup, while also exploring a framework that was built for crawling activities using spiders, in other words, Scrapy.
Chapter 6, Working with Secure Web, covers a number of basic security-related measures and techniques that are often encountered and that pose a challenge to web scraping.
Chapter 7, Data Extraction Using Web-Based APIs, covers the Python programming language and how to interact with the web APIs with regard to data extraction.
Chapter 8, Using Selenium to Scrape the Web, covers Selenium and how to use it to scrape data from the web.
Chapter 9, Using Regex to Extract Data, goes into more detail regarding web scraping techniques using regular expressions.
Chapter 10, Next Steps, introduces and examines basic concepts regarding data management using files, and analysis and visualization using pandas and matplotlib, while also providing an introduction to machine learning and data mining and exploring a number of related resources that can be helpful in terms of further learning and career development.
To get the most out of this book
Readers should have some working knowledge of the Python programming language.
Download the example code files
You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:
Log in or register atwww.packt.com.
Select theSUPPORTtab.
Click onCode Downloads & Errata.
Enter the name of the book in theSearchbox and follow the onscreen instructions.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Download the color images
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789533392_ColorImages.pdf.
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: and The
HTML elements contain general text information (element content) with them.
A block of code is set as follows:
import requests
link=http://localhost:8080/~cache
queries= {'id':'123456','display':'yes'}
addedheaders={'user-agent':''}
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
import requests
link=http://localhost:8080/~cache
queries= {'id':'123456','display':'yes'}
addedheaders={'user-agent':''}
Any command-line input or output is written as follows:
C:\> pip --version
pip 18.1 from c:\python37\lib\site-packages\pip (python 3.7)
Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: If accessing Developer tools through the Chrome menu, click More tools | Developer tools
Warnings or important notes appear like this.
Tips and tricks appear like this.
Get in touch
Feedback from our readers is always welcome.
General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
Reviews
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
For more information about Packt, please visit packt.com.
Section 1: Introduction to Web Scraping
In this section, you will be given an overview of web scraping (scraping requirements, the importance of data), web contents (patterns and layouts), Python programming and libraries (the basics and advanced), and data managing techniques (file handling and databases).
This section consists of the following chapter:
Chapter 1, Web Scraping Fundamentals
Web Scraping Fundamentals
In this chapter, we will learn about and explore certain fundamental concepts related to web scraping and web-based technologies, assuming that you have no prior experience of web scraping.
So, to start with, let's begin by asking a number of questions:
Why isthere a growing need or demand for data?
How are we going to manage and fulfill the requirement for data with resources from the World Wide Web (WWW)?
Web scraping addresses both these questions, as it provides various tools and technologies that can be deployed to extract data or assist with information retrieval. Whether its web-based structured or unstructured data, we can use the web scraping process to extract data and use it for research, analysis, personal collections, information extraction, knowledge discovery, and many more purposes.
We will learn general techniques that are deployed to find data from the web and explore those techniques in depth using the Python programming language in the chapters ahead.
In this chapter, we will cover the following topics:
Introduction to web scraping
Understanding web development and technologies
Data finding techniques
Introduction to web scraping
Scraping is the process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (commonly known as websites or web pages, or internet-related resources) is normally termed web scraping.
Web scraping is a process of data extraction from the web that is suitable for certain requirements. Data collection and analysis, and its involvement in information and decision making, plus research-related activities, make the scraping process sensitive for all types of industry.
The popularity of the internet and its resources is causing information domains to evolve every day, which is also causing a growing demand for raw data. Data is the basic requirement in the fields of science, technology, and management. Collected or organized data is processed with varying degrees of logic to obtain information and gain further insights.
Web scraping provides the tools and techniques used to collect data from websites as appropriate for either personal or business-related needs, but with a number of legal considerations.
There are a number of legal factors to consider before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where legal terms, prohibited content policies, and general information are available. It's a developer's ethical duty to follow those policies before planning any crawling and scraping activities from websites.
Scraping and crawling are both used quite interchangeably throughout the chapters in this book.