0% found this document useful (0 votes)
34 views24 pages

PDF Document 2

There are two main ways to acquire data from websites: web scraping and using APIs. Web scraping involves sending HTTP requests to access a website's HTML, parsing the HTML with a library like BeautifulSoup to create a nested structure, and then traversing the structure to extract desired data. APIs provide a standardized way to request data without scraping, but require valid credentials. The document provides examples of scraping BBC and using the SerpAPI to search Google Images and extract image URLs.

Uploaded by

Mohamed aboaly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

PDF Document 2

There are two main ways to acquire data from websites: web scraping and using APIs. Web scraping involves sending HTTP requests to access a website's HTML, parsing the HTML with a library like BeautifulSoup to create a nested structure, and then traversing the structure to extract desired data. APIs provide a standardized way to request data without scraping, but require valid credentials. The document provides examples of scraping BBC and using the SerpAPI to search Google Images and extract image URLs.

Uploaded by

Mohamed aboaly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Acquisition

Scraping as a method for data acquiring


There are mainly two ways to extract data from a website:
1. Access the HTML of the webpage and extract useful
information/data from it. This technique is called web scraping or
web harvesting or web data extraction.
2. Use the API of the website (if it exists). For example:
• Google has serpapi, (https://serpapi.com/ )
• Facebook has the Facebook Graph API which allows retrieval of
data posted on Facebook.
1st technique: web scraping or web harvesting or web data extraction.
Steps involved in web scraping:

1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the
request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP
library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of
the HTML data is nested, we cannot extract data simply through string processing. One needs a parser
which can create a nested/tree structure of the HTML data. There are many HTML parser libraries
available but the most advanced one is html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For
this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for
pulling data out of HTML and XML files.
Page’s retrieve
#!pip install requests

import requests

URL = "https://www.bbc.com/"

r = requests.get(URL)

print(r)
Note that: There are many more status
codes
200 – ‘OK’
400 – ‘Bad request’ is sent when the server cannot understand the request sent
by the client. Generally, this indicates a malformed request syntax, invalid
request message framing, etc.
401 – ‘Unauthorized’ is sent whenever fulfilling the requests requires supplying
valid credentials.
403 – ‘Forbidden’ means that the server understood the request but will not fulfil
it. In cases where credentials were provided, 403 would mean that the account
in question does not have sufficient permissions to view the content.
404 – ‘Not found’ means that the server found no content matching the
Request-URI. Sometimes 404 is used to mask 403 responses when the server
does not want to reveal reasons for refusing the request.
print(r.content)
#!pip install bs4 Beautiful Soup Library

from bs4 import BeautifulSoup


soup = BeautifulSoup(r.content, 'html5lib')
print (soup)
images = soup.select('div img')
print (images)

images_url = images[0]['src']
print (images_url)
img_data = requests.get(images_url).content
with open('img1.jpg', 'wb') as handler:
handler.write(img_data)
Let us discuss how to loop over all of them ?
2nd technique: Use the API of the website.
For Example:(https://serpapi.com/)
params = {
"q":"Coffee", "location": "Egypt", "hl": "en", "gl": "us",
"engine": "google", "google_domain": "google.com",
"api_key": “………."}

To use this API Library


You have to create
account and get your
“api_key”

html = requests.get("http://www.google.com/images", params=params)


soup2 = BeautifulSoup(html.text, 'lxml')
print (soup2)
images = soup2.select('div img')
print (len(images) )
print (images)
images_url = images[9]['src']
img_data = requests.get(images_url).content
with open('pic.jpg', 'wb') as handler:
handler.write(img_data)
Let us discuss how to loop over all of them ?
SerpApi Example(2)

During Your Assignment Sheet


GoogleSearch Example
params = {
"q": "world cup",
"hl": "en",
"api_key": “………………….."
}

search = GoogleSearch(params)
results = search.get_dict()

print (results)
res = results["organic_results"]
print (res)
for i in range (len(res)) :
print (res[i]["link"])
Google Scholar Example
params = {
"engine": "google_scholar",
"q": "Guido Burkard",
"api_key": “……..",
}

search = GoogleSearch(params)
results = search.get_dict()
print (results)
organic_results = results["organic_results"]
for i in range (len(organic_results)) :
print (organic_results[i]["inline_links"])
for i in range (len(organic_results)) :
if ("cited_by" in organic_results[i]["inline_links"]) :
print (organic_results[i]["inline_links"]["cited_by"]["total"])
for i in range (len(organic_results)) :
print (organic_results[i]["title"])

You might also like