PDF Document 2

There are two main ways to acquire data from websites: web scraping and using APIs. Web scraping involves sending HTTP requests to access a website's HTML, parsing the HTML with a library like BeautifulSoup to create a nested structure, and then traversing the structure to extract desired data. APIs provide a standardized way to request data without scraping, but require valid credentials. The document provides examples of scraping BBC and using the SerpAPI to search Google Images and extract image URLs.

Uploaded by

Mohamed aboaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views24 pages

PDF Document 2

Uploaded by

Mohamed aboaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Acquisition

Scraping as a method for data acquiring

There are mainly two ways to extract data from a website:
1. Access the HTML of the webpage and extract useful
information/data from it. This technique is called web scraping or
web harvesting or web data extraction.
2. Use the API of the website (if it exists). For example:
• Google has serpapi, (https://serpapi.com/ )
• Facebook has the Facebook Graph API which allows retrieval of
data posted on Facebook.
1st technique: web scraping or web harvesting or web data extraction.
Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the
request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP
library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of
the HTML data is nested, we cannot extract data simply through string processing. One needs a parser
which can create a nested/tree structure of the HTML data. There are many HTML parser libraries
available but the most advanced one is html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For
this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for
pulling data out of HTML and XML files.
Page’s retrieve
#!pip install requests

import requests

URL = "https://www.bbc.com/"

r = requests.get(URL)

print(r)
Note that: There are many more status
codes
200 – ‘OK’
400 – ‘Bad request’ is sent when the server cannot understand the request sent
by the client. Generally, this indicates a malformed request syntax, invalid
request message framing, etc.
401 – ‘Unauthorized’ is sent whenever fulfilling the requests requires supplying
valid credentials.
403 – ‘Forbidden’ means that the server understood the request but will not fulfil
it. In cases where credentials were provided, 403 would mean that the account
in question does not have sufficient permissions to view the content.
404 – ‘Not found’ means that the server found no content matching the
Request-URI. Sometimes 404 is used to mask 403 responses when the server
does not want to reveal reasons for refusing the request.
print(r.content)
#!pip install bs4 Beautiful Soup Library

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html5lib')
print (soup)
images = soup.select('div img')
print (images)

images_url = images[0]['src']
print (images_url)
img_data = requests.get(images_url).content
with open('img1.jpg', 'wb') as handler:
handler.write(img_data)
Let us discuss how to loop over all of them ?
2nd technique: Use the API of the website.
For Example:(https://serpapi.com/)
params = {
"q":"Coffee", "location": "Egypt", "hl": "en", "gl": "us",
"engine": "google", "google_domain": "google.com",
"api_key": “………."}

To use this API Library

You have to create
account and get your
“api_key”

html = requests.get("http://www.google.com/images", params=params)

soup2 = BeautifulSoup(html.text, 'lxml')
print (soup2)
images = soup2.select('div img')
print (len(images) )
print (images)
images_url = images[9]['src']
img_data = requests.get(images_url).content
with open('pic.jpg', 'wb') as handler:
handler.write(img_data)
Let us discuss how to loop over all of them ?
SerpApi Example(2)

During Your Assignment Sheet

GoogleSearch Example
params = {
"q": "world cup",
"hl": "en",
"api_key": “………………….."
}

search = GoogleSearch(params)
results = search.get_dict()

print (results)
res = results["organic_results"]
print (res)
for i in range (len(res)) :
print (res[i]["link"])
Google Scholar Example
params = {
"engine": "google_scholar",
"q": "Guido Burkard",
"api_key": “……..",
}

search = GoogleSearch(params)
results = search.get_dict()
print (results)
organic_results = results["organic_results"]
for i in range (len(organic_results)) :
print (organic_results[i]["inline_links"])
for i in range (len(organic_results)) :
if ("cited_by" in organic_results[i]["inline_links"]) :
print (organic_results[i]["inline_links"]["cited_by"]["total"])
for i in range (len(organic_results)) :
print (organic_results[i]["title"])

Python Module-4
No ratings yet
Python Module-4
109 pages
The Complete Guide To The TOEFL PBT Test Class 1
No ratings yet
The Complete Guide To The TOEFL PBT Test Class 1
3 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Detonado Dragon Quest VIII PDF: Mirror Link #1
0% (1)
Detonado Dragon Quest VIII PDF: Mirror Link #1
4 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
DAP - Module 4
No ratings yet
DAP - Module 4
57 pages
Module 4
No ratings yet
Module 4
14 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
No ratings yet
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
6 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
‎⁨‏لقطة شاشة ٢٠٢٤-٠٣-٢٩ في ١١.٠٧.٠٧ م⁩
No ratings yet
‎⁨‏لقطة شاشة ٢٠٢٤-٠٣-٢٩ في ١١.٠٧.٠٧ م⁩
6 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
03 Web Scraping
No ratings yet
03 Web Scraping
41 pages
Web Crawling - Python
No ratings yet
Web Crawling - Python
34 pages
Basic Scraping Techniques
No ratings yet
Basic Scraping Techniques
7 pages
The Importance of Grammar in Language Teaching and Learning
No ratings yet
The Importance of Grammar in Language Teaching and Learning
11 pages
Transformation of Sentence - Magic Rules & Example
No ratings yet
Transformation of Sentence - Magic Rules & Example
9 pages
FDSWeb Scraping
No ratings yet
FDSWeb Scraping
31 pages
Europe 1789 To 1914.. The Age of Industry and Empire. Vol. 4 PDF
No ratings yet
Europe 1789 To 1914.. The Age of Industry and Empire. Vol. 4 PDF
631 pages
Sec 4 Bio Mar HW - Coordination & Response
No ratings yet
Sec 4 Bio Mar HW - Coordination & Response
6 pages
Implementing Web Scraping in Python With Beautifulsoup
No ratings yet
Implementing Web Scraping in Python With Beautifulsoup
6 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
3252 Ids 10
No ratings yet
3252 Ids 10
5 pages
Notes For Web Scraping - BeautifulSoup-3903
No ratings yet
Notes For Web Scraping - BeautifulSoup-3903
6 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Sight Screen Catalog
No ratings yet
Sight Screen Catalog
3 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Experiment2 Web Scraping and Data Analysis
No ratings yet
Experiment2 Web Scraping and Data Analysis
5 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
06 WebScrapingData
No ratings yet
06 WebScrapingData
39 pages
Web Scraping
No ratings yet
Web Scraping
7 pages
Medical Writing Humour
No ratings yet
Medical Writing Humour
1 page
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Template
No ratings yet
Template
21 pages
Warehousing and Stock Control
No ratings yet
Warehousing and Stock Control
37 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Web Scraping - Unit 1
100% (1)
Web Scraping - Unit 1
31 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Web Scraping Presentation With Images
No ratings yet
Web Scraping Presentation With Images
4 pages
Tourism Industries in Assam Agriculture Economy Geography
No ratings yet
Tourism Industries in Assam Agriculture Economy Geography
6 pages
Management Micro Project For Last Year Student
No ratings yet
Management Micro Project For Last Year Student
10 pages
MY COPY Evaluating Text Image
No ratings yet
MY COPY Evaluating Text Image
52 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
BUYAMIA-Investment Round 12-September 2023 Indonesia Updated
No ratings yet
BUYAMIA-Investment Round 12-September 2023 Indonesia Updated
23 pages
Jee Result
No ratings yet
Jee Result
1 page
Scrapeez
No ratings yet
Scrapeez
3 pages
Sample Paper-5
No ratings yet
Sample Paper-5
8 pages
A Guide To Web Scraping in Python Using Beautiful Soup
No ratings yet
A Guide To Web Scraping in Python Using Beautiful Soup
6 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Plastic Bottle - Presentation
No ratings yet
Plastic Bottle - Presentation
2 pages
Kits For Kids Lesson Plan
No ratings yet
Kits For Kids Lesson Plan
9 pages
Fun With Python
100% (5)
Fun With Python
113 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
1 The Living World: Solutions
No ratings yet
1 The Living World: Solutions
18 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
The Weakest Link, But Not Goodbye: India and Southeast Asia: A Plus' Up in Relations
No ratings yet
The Weakest Link, But Not Goodbye: India and Southeast Asia: A Plus' Up in Relations
14 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Pre-Installed SAP Portable Hard Drive Plug N Play For Laptop and Desktops
No ratings yet
Pre-Installed SAP Portable Hard Drive Plug N Play For Laptop and Desktops
23 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
No ratings yet
Web Scraping With Python and Selenium: Sarah Fatima, Shaik Luqmaan Nuha Abdul Rasheed
5 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Timothy F Resume
No ratings yet
Timothy F Resume
4 pages
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
No ratings yet
3-Intestate Estate of Gonzales v. People G.R. No. 181409 February 11, 2010
12 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Dame Project Machine Element Ii
No ratings yet
Dame Project Machine Element Ii
6 pages
Managing Change in Printing Industry
No ratings yet
Managing Change in Printing Industry
10 pages
PG Diploma in Oil & Gas Piping Engineering Design and Analysis
No ratings yet
PG Diploma in Oil & Gas Piping Engineering Design and Analysis
4 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Class Viii G. Science Summative Assessment No. - 1 Combustion & Flame Assignment No. 6
No ratings yet
Class Viii G. Science Summative Assessment No. - 1 Combustion & Flame Assignment No. 6
2 pages
Action Plan in Reading
No ratings yet
Action Plan in Reading
2 pages
Clean Resume Vol 1
No ratings yet
Clean Resume Vol 1
1 page
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Track List
No ratings yet
Track List
3 pages