6Python Web Scraping Data Processing
6Python Web Scraping Data Processing
Advertisements
In earlier chapters, we learned about extracting the data from web pages or web scraping by various Python
modules. In this chapter, let us look into various techniques to process the data that has been scraped.
Introduction
To process the data that has been scraped, we must store the data on our local machine in a particular format like
spreadsheet C S V , JSON or sometimes in databases like MySQL.
import requests
from bs4 import BeautifulSoup
import csv
In this following line of code, we use requests to make a GET HTTP requests for the url:
https://authoraditiagarwal.com/ by making a GET request.
r = requests.get('https://authoraditiagarwal.com/')
Now, with the help of next lines of code, we will write the grabbed data into a CSV file named dataprocessing.csv.
After running this script, the textual information or the title of the webpage will be saved in the above mentioned
CSV file on your local machine.
Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python
script for doing the same in which we are grabbing the same information as we did in last Python script, but this
time the grabbed information is saved in JSONfile.txt by using JSON Python module.
import requests
from bs4 import BeautifulSoup
import csv
import json
r = requests.get('https://authoraditiagarwal.com/')
soup = BeautifulSoup(r.text, 'lxml')
y = json.dumps(soup.title.text)
with open('JSONFile.txt', 'wt') as outfile:
json.dump(y, outfile)
After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned
text file on your local machine.
Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while
storing the data. It will create a S3 bucket in which we can store our data.
Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of
the following command −
Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3
bucket.
First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data
to S3 bucket.
import requests
import boto3
s3 = boto3.client('s3')
bucket_name = "our‐content"
Now you can check the bucket with name ourcontent from your AWS account.
Data processing using MySQL
Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link
https://www.tutorialspoint.com/mysql/.
With the help of following steps, we can scrape and process data into MySQL table −
Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data.
For example, we are creating the table with following query −
Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to
turn on this feature with the help of following commands which will change the default character set for the
database, for the table and for both of the columns −
Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help
of the following command
Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into
table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved
into our database.
def getLinks(articleUrl):
html = urlopen('http://en.wikipedia.org'+articleUrl)
bs = BeautifulSoup(html, 'html.parser')
title = bs.find('h1').get_text()
content = bs.find('div', {'id':'mw‐content‐text'}).find('p').get_text()
store(title, content)
return bs.find('div', {'id':'bodyContent'}).findAll('a',href=re.compile('^(/wiki/)
((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
newArticle = links[random.randint(0, len(links)‐1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)
finally:
cur.close()
conn.close()
This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and
web scraping, then the above code would not be tough to understand.
If you are not familiar with PostgreSQL then you can learn it at https://www.tutorialspoint.com/postgresql/. And
with the help of following command we can install psycopg2 Python library −