PDF Document 2
PDF Document 2
1. Send an HTTP request to the URL of the webpage you want to access. The server responds to the
request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP
library for python-requests.
2. Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of
the HTML data is nested, we cannot extract data simply through string processing. One needs a parser
which can create a nested/tree structure of the HTML data. There are many HTML parser libraries
available but the most advanced one is html5lib.
3. Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For
this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for
pulling data out of HTML and XML files.
Page’s retrieve
#!pip install requests
import requests
URL = "https://www.bbc.com/"
r = requests.get(URL)
print(r)
Note that: There are many more status
codes
200 – ‘OK’
400 – ‘Bad request’ is sent when the server cannot understand the request sent
by the client. Generally, this indicates a malformed request syntax, invalid
request message framing, etc.
401 – ‘Unauthorized’ is sent whenever fulfilling the requests requires supplying
valid credentials.
403 – ‘Forbidden’ means that the server understood the request but will not fulfil
it. In cases where credentials were provided, 403 would mean that the account
in question does not have sufficient permissions to view the content.
404 – ‘Not found’ means that the server found no content matching the
Request-URI. Sometimes 404 is used to mask 403 responses when the server
does not want to reveal reasons for refusing the request.
print(r.content)
#!pip install bs4 Beautiful Soup Library
images_url = images[0]['src']
print (images_url)
img_data = requests.get(images_url).content
with open('img1.jpg', 'wb') as handler:
handler.write(img_data)
Let us discuss how to loop over all of them ?
2nd technique: Use the API of the website.
For Example:(https://serpapi.com/)
params = {
"q":"Coffee", "location": "Egypt", "hl": "en", "gl": "us",
"engine": "google", "google_domain": "google.com",
"api_key": “………."}
search = GoogleSearch(params)
results = search.get_dict()
print (results)
res = results["organic_results"]
print (res)
for i in range (len(res)) :
print (res[i]["link"])
Google Scholar Example
params = {
"engine": "google_scholar",
"q": "Guido Burkard",
"api_key": “……..",
}
search = GoogleSearch(params)
results = search.get_dict()
print (results)
organic_results = results["organic_results"]
for i in range (len(organic_results)) :
print (organic_results[i]["inline_links"])
for i in range (len(organic_results)) :
if ("cited_by" in organic_results[i]["inline_links"]) :
print (organic_results[i]["inline_links"]["cited_by"]["total"])
for i in range (len(organic_results)) :
print (organic_results[i]["title"])