Explore 1.5M+ audiobooks & ebooks free for days

From $11.99/month after trial. Cancel anytime.

Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Ebook588 pages3 hours

Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Collect and scrape different complexities of data from the modern Web using the latest tools, best practices, and techniques

Key Features
  • Learn various scraping techniques using a range of Python libraries such as Scrapy and Beautiful Soup
  • Build scrapers and crawlers to extract relevant information from the web
  • Automate web scraping operations to bridge the accuracy gap and ease complex business needs
Book Description

Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. This book will enable you to delve deeply into web scraping techniques and methodologies.

This book will introduce you to the fundamental concepts of web scraping techniques and how they can be applied to multiple sets of web pages. We'll use powerful libraries from the Python ecosystem—such as Scrapy, lxml, pyquery, bs4, and others—to carry out web scraping operations. We will take an in-depth look at essential tasks to carry out simple to intermediate scraping operations such as identifying information from web pages, using patterns or attributes to retrieve information, and others. This book adopts a practical approach to web scraping concepts and tools, guiding you through a series of use cases and showing you how to use the best tools and techniques to efficiently scrape web pages. This book also covers the use of other popular web scraping tools, such as Selenium, Regex, and web-based APIs.

By the end of this book, you will have learned how to efficiently scrape the web using different techniques with Python and other popular tools.

What you will learn
  • Analyze data and Information from web pages
  • Learn how to use browser-based developer tools from the scraping perspective
  • Use XPath and CSS selectors to identify and explore markup elements
  • Learn to handle and manage cookies
  • Explore advanced concepts in handling HTML forms and processing logins
  • Optimize web securities, data storage, and API use to scrape data
  • Use Regex with Python to extract data
  • Deal with complex web entities by using Selenium to find and extract data
Who this book is for

This book is for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need! A working knowledge of the Python programming language is expected.

LanguageEnglish
PublisherPackt Publishing
Release dateJul 15, 2019
ISBN9781789536195
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

Related to Hands-On Web Scraping with Python

Related ebooks

Computers For You

View More

Reviews for Hands-On Web Scraping with Python

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Hands-On Web Scraping with Python - Anish Chapagain

    Hands-On Web Scraping with Python

    Hands-On Web Scraping with Python

    Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others

    Anish Chapagain

    BIRMINGHAM - MUMBAI

    Hands-On Web Scraping with Python

    Copyright © 2019 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    Commissioning Editor: Sunith Shetty

    Acquisition Editor: Aniruddha Patil

    Content Development Editor: Roshan Kumar

    Senior Editor: Ayaan Hoda

    Technical Editor: Sushmeeta Jena

    Copy Editor: Safis Editing

    Project Coordinator: Namrata Swetta

    Proofreader: Safis Editing

    Indexer: Tejal Daruwale Soni

    Production Designer: Alishon Mendonsa

    First published: June 2019

    Production reference: 2120619

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham

    B3 2PB, UK.

    ISBN 978-1-78953-339-2

    www.packtpub.com

    To my daughter, Aasira, and my family and friends. Special thanks to Ashish Chapagain,

    Peter, and Prof. W.J. Teahan. This book is dedicated to you all.

    Packt.com

    Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

    Why subscribe?

    Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

    Improve your learning with Skill Plans built especially for you

    Get a free eBook or video every month

    Fully searchable for easy access to vital information

    Copy and paste, print, and bookmark content

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

    At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. 

    Contributors

    About the author

    Anish Chapagain is a software engineer with a passion for data science, its processes, and Python programming, which began around 2007. He has been working with web scraping and analysis-related tasks for more than 5 years, and is currently pursuing freelance projects in the web scraping domain. Anish previously worked as a trainer, web/software developer, and as a banker, where he was exposed to data and gained further insights into topics including data analysis, visualization, data mining, information processing, and knowledge discovery. He has an MSc in computer systems from Bangor University (University of Wales), United Kingdom, and an Executive MBA from Himalayan Whitehouse International College, Kathmandu, Nepal.

    About the reviewers

    Radhika Datar has more than 5 years' experience in software development and content writing. She is well versed in frameworks such as Python, PHP, and Java, and regularly provides training on them. She has been working with Educba and Eduonix as a training consultant since June 2016, while also working as a freelance academic writer in data science and data analytics. She obtained her master's degree from the Symbiosis Institute of Computer Studies and Research and her bachelor's degree from K. J. Somaiya College of Science and Commerce.

    Rohit Negi completed his bachelor of technology in computer science from Uttarakhand Technical University, Dehradun. His bachelor's curriculum included a specialization in computer science and applied engineering. Currently, he is working as a senior test consultant at Orbit Technologies and provides test automation solutions to LAM Research (USA clients). He has extensive quality assurance proficiency working with the following tools: Microsoft Azure VSTS, Selenium, Cucumber/BDD, MS SQL/MySQL, Java, and web scraping using Selenium. Additionally, he has a good working knowledge of how to automate workflows using Selenium, Protractor for AngularJS-based applications, Python for exploratory data analysis, and machine learning.

    Packt is searching for authors like you

    If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

    Table of Contents

    Title Page

    Copyright and Credits

    Hands-On Web Scraping with Python

    Dedication

    About Packt

    Why subscribe?

    Contributors

    About the author

    About the reviewers

    Packt is searching for authors like you

    Preface

    Who this book is for

    What this book covers

    To get the most out of this book

    Download the example code files

    Download the color images

    Conventions used

    Get in touch

    Reviews

    Section 1: Introduction to Web Scraping

    Web Scraping Fundamentals

    Introduction to web scraping

    Understanding web development and technologies

    HTTP

    HTML 

    HTML elements and attributes

    Global attributes

    XML

    JavaScript

    JSON

    CSS

    AngularJS

    Data finding techniques for the web

    HTML page source

    Case 1

    Case 2

    Developer tools

    Sitemaps

    The robots.txt file

    Summary

    Further reading

    Section 2: Beginning Web Scraping

    Python and the Web – Using urllib and Requests

    Technical requirements

    Accessing the web with Python

    Setting things up

    Loading URLs

    URL handling and operations with urllib and requests

    urllib

    requests

    Implementing HTTP methods

    GET

    POST

    Summary

    Further reading

    Using LXML, XPath, and CSS Selectors

    Technical requirements

    Introduction to XPath and CSS selector

    XPath

    CSS selectors

    Element selectors

    ID and class selectors

    Attribute selectors

    Pseudo selectors

    Using web browser developer tools for accessing web content

    HTML elements and DOM navigation

    XPath and CSS selectors using DevTools

    Scraping using lxml, a Python library

    lxml by examples

    Example 1 – reading XML from file and traversing through its elements

    Example 2 – reading HTML documents using lxml.html

    Example 3 – reading and parsing HTML for retrieving HTML form type element attributes

    Web scraping using lxml

    Example 1 – extracting selected data from a single page using lxml.html.xpath

    Example 2 – looping with XPath and scraping data from multiple pages

    Example 3 – using lxml.cssselect to scrape content from a page

    Summary

    Further reading

    Scraping Using pyquery – a Python Library

    Technical requirements

    Introduction to pyquery

    Exploring pyquery

    Loading documents

    Element traversing, attributes, and pseudo-classes

    Iterating

    Web scraping using pyquery

    Example 1 – scraping data science announcements

    Example 2 – scraping information from nested links

    Example 3 – extracting AHL Playoff results

    Example 4 – collecting URLs from sitemap.xml

    Case 1 – using the HTML parser

    Case 2 – using the XML parser

    Summary

    Further reading

    Web Scraping Using Scrapy and Beautiful Soup

    Technical requirements

    Web scraping using Beautiful Soup

    Introduction to Beautiful Soup

    Exploring Beautiful Soup

    Searching, traversing, and iterating

    Using children and parents

    Using next and previous

    Using CSS Selectors

    Example 1 – listing <li> elements with the data-id attribute 

    Example 2 – traversing through elements

    Example 3 – searching elements based on attribute values

    Building a web crawler

    Web scraping using Scrapy

    Introduction to Scrapy

    Setting up a project

    Generating a Spider

    Creating an item

    Extracting data

    Using XPath

    Using CSS Selectors

    Data from multiple pages

    Running and exporting

    Deploying a web crawler

    Summary

    Further reading

    Section 3: Advanced Concepts

    Working with Secure Web

    Technical requirements

    Introduction to secure web

    Form processing

    Cookies and sessions

    Cookies

    Sessions

    User authentication

    HTML <form> processing

    Handling user authentication

    Working with cookies and sessions

    Summary

    Further reading

    Data Extraction Using Web-Based APIs

    Technical requirements

    Introduction to web APIs

    REST and SOAP

    REST 

    SOAP 

    Benefits of web APIs

    Accessing web API and data formats

    Making requests to the web API using a web browser

    Case 1 – accessing a simple API (request and response)

    Case 2 – demonstrating status codes and informative responses from the API

    Case 3 – demonstrating RESTful API cache functionality

    Web scraping using APIs

    Example 1 – searching and collecting university names and URLs

    Example 2 – scraping information from GitHub events

    Summary

    Further reading

    Using Selenium to Scrape the Web

    Technical requirements

    Introduction to Selenium

    Selenium projects

    Selenium WebDriver

    Selenium RC

    Selenium Grid

    Selenium IDE

    Setting things up

    Exploring Selenium

    Accessing browser properties

    Locating web elements

    Using Selenium for web scraping

    Example 1 – scraping product information

    Example 2 – scraping book information

    Summary

    Further reading

    Using Regex to Extract Data

    Technical requirements

    Overview of regular expressions

    Regular expressions and Python

    Using regular expressions to extract data

    Example 1 – extracting HTML-based content

    Example 2 – extracting dealer locations

    Example 3 – extracting XML content

    Summary

    Further reading

    Section 4: Conclusion

    Next Steps

    Technical requirements

    Managing scraped data

    Writing to files

    Analysis and visualization using pandas and matplotlib

    Machine learning 

    ML and AI

    Python and ML

    Types of ML algorithms

    Supervised learning

    Classification

    Regression

    Unsupervised learning

    Association

    Clustering

    Reinforcement learning

    Data mining 

    Tasks of data mining

    Predictive

    Classification

    Regression

    Prediction 

    Descriptive

    Clustering

    Summarization

    Association rules

    What's next?

    Summary 

    Further reading

    Other Books You May Enjoy

    Leave a review - let other readers know what you think

    Preface

    Web scraping is an essential technique used in many organizations to scrape valuable data from web pages. Web scraping, or web harvesting, is done with a view to extracting and collecting data from websites. Web scraping comes in handy with model development, which requires data to be collected on the fly. It is also applicable for the data that is true and relevant to the topic, in which the accuracy is desired over the short-term, as opposed to implementing datasets. Data collected is stored in files including JSON, CSV, and XML, is also written a the database for later use, and is also made available online as datasets. This book will open the gates for you in terms of delving deep into web scraping techniques and methodologies using Python libraries and other popular tools, such as Selenium. By the end of this book, you will have learned how to efficiently scrape different websites.

    Who this book is for

    This book is intended for Python programmers, data analysts, web scraping newbies, and anyone who wants to learn how to perform web scraping from scratch. If you want to begin your journey in applying web scraping techniques to a range of web pages, then this book is what you need!

    What this book covers

    Chapter 1, Web Scraping Fundamentals, explores some core technologies and tools that are relevant to WWW and that are required for web scraping.

    Chapter 2, Python and the Web – Using URLlib and Requests, demonstrates some of the core features available through the Python libraries such as requests and urllib, in addition to exploring page contents in various formats and structures. 

    Chapter 3, Using LXML, XPath, and CSS Selectors, describes various examples using LXML, implementing a variety of techniques and library features to deal with elements and ElementTree. 

    Chapter 4, Scraping Using pyquery – a Python Library, goes into more detail regarding web scraping techniques and a number of new Python libraries that deploy these techniques.

    Chapter 5, Web Scraping Using Scrapy and Beautiful Soup, examines various aspects of traversing web documents using Beautiful Soup, while also exploring a framework that was built for crawling activities using spiders, in other words, Scrapy.

    Chapter 6, Working with Secure Web, covers a number of basic security-related measures and techniques that are often encountered and that pose a challenge to web scraping.

    Chapter 7, Data Extraction Using Web-Based APIs, covers the Python programming language and how to interact with the web APIs with regard to data extraction.

    Chapter 8, Using Selenium to Scrape the Web, covers Selenium and how to use it to scrape data from the web.

    Chapter 9, Using Regex to Extract Data, goes into more detail regarding web scraping techniques using regular expressions.

    Chapter 10, Next Steps, introduces and examines basic concepts regarding data management using files, and analysis and visualization using pandas and matplotlib, while also providing an introduction to machine learning and data mining and exploring a number of related resources that can be helpful in terms of further learning and career development. 

    To get the most out of this book

    Readers should have some working knowledge of the Python programming language.

    Download the example code files

    You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

    You can download the code files by following these steps:

    Log in or register atwww.packt.com.

    Select theSUPPORTtab.

    Click onCode Downloads & Errata.

    Enter the name of the book in theSearchbox and follow the onscreen instructions.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR/7-Zip for Windows

    Zipeg/iZip/UnRarX for Mac

    7-Zip/PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Web-Scraping-with-Python. In case there's an update to the code, it will be updated on the existing GitHub repository.

    We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Download the color images

    We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://www.packtpub.com/sites/default/files/downloads/9781789533392_ColorImages.pdf.

    Conventions used

    There are a number of text conventions used throughout this book.

    CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: The 

     and 

     HTML elements contain general text information (element content) with them.

    A block of code is set as follows:

    import requests

    link=http://localhost:8080/~cache

    queries= {'id':'123456','display':'yes'}

    addedheaders={'user-agent':''}

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    import requests

    link=http://localhost:8080/~cache

    queries= {'id':'123456','display':'yes'}

    addedheaders={'user-agent':''}

    Any command-line input or output is written as follows:

    C:\> pip --version

    pip 18.1 from c:\python37\lib\site-packages\pip (python 3.7)

    Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: If accessing Developer tools through the Chrome menu, click More tools | Developer tools

    Warnings or important notes appear like this.

    Tips and tricks appear like this.

    Get in touch

    Feedback from our readers is always welcome.

    General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

    Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

    Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

    If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

    Reviews

    Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

    For more information about Packt, please visit packt.com.

    Section 1: Introduction to Web Scraping

    In this section, you will be given an overview of web scraping (scraping requirements, the importance of data), web contents (patterns and layouts), Python programming and libraries (the basics and advanced), and data managing techniques (file handling and databases).

    This section consists of the following chapter: 

    Chapter 1, Web Scraping Fundamentals

    Web Scraping Fundamentals

    In this chapter, we will learn about and explore certain fundamental concepts related to web scraping and web-based technologies, assuming that you have no prior experience of web scraping. 

    So, to start with, let's begin by asking a number of questions: 

    Why isthere a growing need or demand for data? 

    How are we going to manage and fulfill the requirement for data with resources from the World Wide Web (WWW)?

    Web scraping addresses both these questions, as it provides various tools and technologies that can be deployed to extract data or assist with information retrieval. Whether its web-based structured or unstructured data, we can use the web scraping process to extract data and use it for research, analysis, personal collections, information extraction, knowledge discovery, and many more purposes.

    We will learn general techniques that are deployed to find data from the web and explore those techniques in depth using the Python programming language in the chapters ahead.

    In this chapter, we will cover the following topics:

    Introduction to web scraping

    Understanding web development and technologies

    Data finding techniques

    Introduction to web scraping

    Scraping is the process of extracting, copying, screening, or collecting data. Scraping or extracting data from the web (commonly known as websites or web pages, or internet-related resources) is normally termed web scraping.

    Web scraping is a process of data extraction from the web that is suitable for certain requirements. Data collection and analysis, and its involvement in information and decision making, plus research-related activities, make the scraping process sensitive for all types of industry.

    The popularity of the internet and its resources is causing information domains to evolve every day, which is also causing a growing demand for raw data. Data is the basic requirement in the fields of science, technology, and management. Collected or organized data is processed with varying degrees of logic to obtain information and gain further insights.

    Web scraping provides the tools and techniques used to collect data from websites as appropriate for either personal or business-related needs, but with a number of legal considerations. 

    There are a number of legal factors to consider before performing scraping tasks. Most websites contain pages such as Privacy Policy, About Us, and Terms and Conditions, where legal terms, prohibited content policies, and general information are available. It's a developer's ethical duty to follow those policies before planning any crawling and scraping activities from websites.

    Scraping and crawling are both used quite interchangeably throughout the chapters in this book.

    Enjoying the preview?
    Page 1 of 1