Web+Scraping+Cheat+Sheet+2 0

This document provides a cheat sheet on web scraping using Beautiful Soup and Selenium. It covers HTML basics, Beautiful Soup workflow including making requests, parsing responses and finding elements, and XPath syntax and functions for locating elements.

Uploaded by

Debajyoti Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views3 pages

Web+Scraping+Cheat+Sheet+2 0

Uploaded by

Debajyoti Sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Web Scraping

Cheat Sheet

BS4 | Selenium | Scrapy

Frank Andrade
Web Scraping “Siblings” are nodes with the same parent.
It’s recommended for beginners to use IDs to find
XPath

Cheat Sheet
We need to learn XPath to scrape with Selenium or
elements and if there isn't any build an XPath.

Scrapy.

Web Scraping is the process of extracting data from a

Beautiful Soup XPath Syntax
website. Before studying Beautiful Soup and Selenium, it's Workflow An XPath usually contains a tag name, attribute
good to review some HTML basics first. Importing the libraries name, and attribute value.
from bs4 import BeautifulSoup

import requests //tagName[@AttributeName="Value"]

HTML for Web Scraping

Let's take a look at the HTML element syntax. Fetch the pages Let’s check some examples to locate the article,
result=requests.get("www.google.com") title, and transcript elements of the HTML code we
Tag Attribute Attribute result.status_code # get status code
End tag result.headers # get the headers used before.
name name value

Page content //article[@class="main-article"]

<h1 class="title"> Titanic (1997) </h1> content = result.text

//h1
Create soup //div[@class="full-script"]
Attribute Affected content soup = BeautifulSoup(content,"lxml")

HTML Element HTML in a readable format XPath Functions and Operators

print(soup.prettify())

XPath functions
This is a single HTML element, but the HTML code behind a Find an element //tag[contains(@AttributeName, "Value")]
website has hundreds of them. soup.find(id="specific_id")

HTML code example Find elements XPath Operators: and, or

<article class="main-article"> soup.find_all("a")
soup.find_all("a","css_class") //tag[(expression 1) and (expression 2)]
<h1> Titanic (1997) </h1>
soup.find_all("a",class_="my_class")
<p class="plot"> 84 years later ... </p> soup.find_all("a",attrs={"class": XPath Special Characters
<div class="full-script"> 13 meters. You ... </div> "my_class"})
Get inner text Selects the children from the node set on the
</article> /
sample = element.get_text() left side of this character
sample = element.get_text(strip=True,
The HTML code is structured with “nodes”. Each rectangle below separator= ' ') // Specifies that the matching node set should
represents a node (element, attribute and text nodes) Get specific attributes be located at any level within the document
sample = element.get('href') Specifies the current context should be used
Root Element Parent Node
. (refers to present node)
<article>
- Medium Guides/YouTube Tutorials
..
Here are my guides/tutorials and courses Refers to a parent node
A wildcard character that selects all
Element Attribute Element Element - Web Scraping Course * elements or attributes regardless of names
<h1> class="main-article" <p> <div>
Siblings - Data Science Course Select an attribute
@
- Automation Course
Text Attribute Text Attribute Text () Grouping an XPath expression
Titanic (1997) class="plot" 84 years later ... class="full-script"" 13 meters. You ... - Make Money Using Programming Skills Indicates that a node with index "n" should
[n]
Made by Frank Andrade frank-andrade.medium.com be selected
Selenium 4 Scrapy
Note that there are a few changes between Selenium 3.x versions and Scrapy is the most powerful web scraping framework in Python, but it's a bit
Selenium 4. complicated to set up, so check my guide or its documentation to set it up.
Import libraries:

from selenium import webdriver Creating a Project and Spider

from selenium.webdriver.chrome.service import Service To create a new project, run the following command in the terminal.
scrapy startproject my_first_spider
web="www.google.com" To create a new spider, first change the directory.
path='introduce chromedriver path' cd my_first_spider
service = Service(executable_path=path) # selenium 4 Create an spider
driver = webdriver.Chrome(service=service) # selenium 4 scrapy genspider example example.com
driver.get(web)

The Basic Template

Note: When you create a spider, you obtain a template with the following content.
driver = webdriver.Chrome(path) # selenium 3.x

import scrapy
Find an element
class ExampleSpider(scrapy.Spider):
driver.find_element(by="id", value="...") # selenium 4

driver.find_element_by_id("write-id-here") # selenium 3.x

name = 'example'

allowed_domains = ['example.com'] Class
Find elements
start_urls = ['http://example.com/']
driver.find_elements(by="xpath", value="...") # selenium 4

driver.find_elements_by_xpath("write-xpath-here") # selenium 3.x

def parse(self, response):

Parse method
Quit driver
pass
driver.quit()

The class is built with the data we introduced in the previous command, but the
Getting the text parse method needs to be built by us. To build it, use the functions below.
data = element.text

Finding elements
Implicit Waits To find elements in Scrapy, use the response argument from the parse method
import time

time.sleep(2) response.xpath('//tag[@AttributeName="Value"]')

Getting the text

Explicit Waits To obtain the text element we use text() and either .get() or .getall(). For example:
from selenium.webdriver.common.by import By response.xpath(‘//h1/text()’).get()
from selenium.webdriver.support.ui import WebDriverWait response.xpath(‘//tag[@Attribute=”Value”]/text()’).getall()
from selenium.webdriver.support import expected_conditions as EC

Return data extracted

WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, To see the data extracted we have to use the yield keyword
'id_name')))
# Wait 5 seconds until an element is clickable def parse(self, response):

title = response.xpath(‘//h1/text()’).get()
Options: Headless mode, change window size

from selenium.webdriver.chrome.options import Options # Return data extracted

options = Options() yield {'titles': title}
options.headless = True

options.add_argument('window-size=1920x1080') Run the spider and export data to CSV or JSON

driver=webdriver.Chrome(service=service,options=options) scrapy crawl example
scrapy crawl example -o name_of_file.csv
scrapy crawl example -o name_of_file.json

Air Liquide HandBook Nov 2020
No ratings yet
Air Liquide HandBook Nov 2020
106 pages
Python Module-4
No ratings yet
Python Module-4
109 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
HTML, CSS & JavaScript. Become A Front-End Developer 2023
No ratings yet
HTML, CSS & JavaScript. Become A Front-End Developer 2023
548 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Python Scrapy
No ratings yet
Python Scrapy
4 pages
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
100% (2)
Vendor Qualification and Requirements - 1P - Latest 22-11-2019
7 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 1 - First Scrapy Spider - ScrapeOps
17 pages
Scraping
100% (1)
Scraping
25 pages
Long Beach, CA Enterprise Systems List
No ratings yet
Long Beach, CA Enterprise Systems List
1 page
Facility Design Shelter Animal Housing and Shelter Population.20220715115912
No ratings yet
Facility Design Shelter Animal Housing and Shelter Population.20220715115912
17 pages
RAIC Field Review Manual
No ratings yet
RAIC Field Review Manual
29 pages
Lesson 1 - Fundamental of Programming in Python
No ratings yet
Lesson 1 - Fundamental of Programming in Python
20 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Using Scrapy in PyCharm
100% (1)
Using Scrapy in PyCharm
8 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Excel Training
No ratings yet
Excel Training
57 pages
Json En-2x2 PDF
No ratings yet
Json En-2x2 PDF
24 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
BeautifulSoup Notes
No ratings yet
BeautifulSoup Notes
22 pages
Web Scraping Python - Chapter 1
No ratings yet
Web Scraping Python - Chapter 1
29 pages
Python Libraries For Data Extraction
No ratings yet
Python Libraries For Data Extraction
10 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Cutting and Tailoring 600-400 PDF
No ratings yet
Cutting and Tailoring 600-400 PDF
18 pages
WEBSCRAping Buildwithpython
No ratings yet
WEBSCRAping Buildwithpython
78 pages
JavaScript in A Day
No ratings yet
JavaScript in A Day
8 pages
Web Hacking: CEH Test Prep Video Series
No ratings yet
Web Hacking: CEH Test Prep Video Series
9 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
XML Security - OWASP Cheat Sheet Series
No ratings yet
XML Security - OWASP Cheat Sheet Series
2 pages
PLSQLmy Updated1
No ratings yet
PLSQLmy Updated1
65 pages
Scraping HTML Chapter2
No ratings yet
Scraping HTML Chapter2
31 pages
Web Scraping and HTML Basics
No ratings yet
Web Scraping and HTML Basics
4 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Most Used Programming Languages 2021 Popular Programming Languages
No ratings yet
Most Used Programming Languages 2021 Popular Programming Languages
15 pages
IAPP Cipm - Instructiuni Tematica Si Examen
No ratings yet
IAPP Cipm - Instructiuni Tematica Si Examen
7 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Hillstone HSM 4.0.0 EN
No ratings yet
Hillstone HSM 4.0.0 EN
2 pages
Web Scrapping: From NP-10
No ratings yet
Web Scrapping: From NP-10
11 pages
Back End
No ratings yet
Back End
34 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Day 5 IELTS Academic Reading Questions by KenyanNurse-1
No ratings yet
Day 5 IELTS Academic Reading Questions by KenyanNurse-1
12 pages
Lesson 4 Unstructured Data
No ratings yet
Lesson 4 Unstructured Data
20 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Marketing ABM11 Module1 WEEK3.4
No ratings yet
Marketing ABM11 Module1 WEEK3.4
9 pages
Web Application Audit Report: Demo Account
No ratings yet
Web Application Audit Report: Demo Account
12 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Web Scarpping
No ratings yet
Web Scarpping
4 pages
Cloud Security Risks
No ratings yet
Cloud Security Risks
6 pages
JSON Interview Documentation: ZR ATS Integration Team
No ratings yet
JSON Interview Documentation: ZR ATS Integration Team
12 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
2 Introduction To Management Science
100% (1)
2 Introduction To Management Science
16 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Download
No ratings yet
Download
4 pages
Developing Idea Processor (Axon2000)
No ratings yet
Developing Idea Processor (Axon2000)
5 pages
Problem Chapter 8
No ratings yet
Problem Chapter 8
62 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Sharp Photointerrupter Line-Up
No ratings yet
Sharp Photointerrupter Line-Up
6 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Xpath Cheat Sheet: Ahmed Rafik - Modern Web Scraping With Python Using Scrapy, Splash & Selenium (Udemy) 2 Edition
No ratings yet
Xpath Cheat Sheet: Ahmed Rafik - Modern Web Scraping With Python Using Scrapy, Splash & Selenium (Udemy) 2 Edition
11 pages
Tetra Plate Heat Exchanger - 2020H
No ratings yet
Tetra Plate Heat Exchanger - 2020H
8 pages
Webscraping
No ratings yet
Webscraping
12 pages
Portfolio in Psych Stats - Docx 1
No ratings yet
Portfolio in Psych Stats - Docx 1
17 pages
Nixon - Practicum Log Final
No ratings yet
Nixon - Practicum Log Final
3 pages
Chapter 1 - Introduction To HRM
No ratings yet
Chapter 1 - Introduction To HRM
55 pages
Introduction For Term Paper Sample
100% (1)
Introduction For Term Paper Sample
4 pages
CMA Inter - July 2023 Past Paper Questions Practice
No ratings yet
CMA Inter - July 2023 Past Paper Questions Practice
36 pages
Sportage 2017 Headlights Adjustement
No ratings yet
Sportage 2017 Headlights Adjustement
5 pages
FB 12 STC 045 en 03 - Epoca Raso NHL - RNHL 105 - Eng
No ratings yet
FB 12 STC 045 en 03 - Epoca Raso NHL - RNHL 105 - Eng
3 pages
Isidro-Free Recall Experiment
No ratings yet
Isidro-Free Recall Experiment
19 pages
Nanded
No ratings yet
Nanded
2 pages
Stock Analysis Strategy For US's Stock Market Based On Risk, Profitability, and Market Value Insights
No ratings yet
Stock Analysis Strategy For US's Stock Market Based On Risk, Profitability, and Market Value Insights
5 pages
Quiz Final Examination
No ratings yet
Quiz Final Examination
5 pages
RESUME - Payam Rahrow
No ratings yet
RESUME - Payam Rahrow
2 pages
Software Requirements Specification Template
No ratings yet
Software Requirements Specification Template
12 pages
Sunpal Power Co.,Ltd.: Quotation of 5.5KW Hybrid Solar Power System (Battery Backup 4.8kwh)
No ratings yet
Sunpal Power Co.,Ltd.: Quotation of 5.5KW Hybrid Solar Power System (Battery Backup 4.8kwh)
1 page
NCP Format
No ratings yet
NCP Format
2 pages
RCC CE Capstone-Outline
No ratings yet
RCC CE Capstone-Outline
1 page
Cab 2024
No ratings yet
Cab 2024
1 page
Flyer Filter Sleeves
No ratings yet
Flyer Filter Sleeves
1 page