0% found this document useful (0 votes)

17 views4 pages

6python Web Scraping Data Processing

Uploaded by

David Osei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views4 pages

6python Web Scraping Data Processing

Uploaded by

David Osei

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

PYTHON WEB SCRAPING DATA PROCESSING

[Link] Copyright © [Link]

CSV and JSON Data Processing

First, we are going to write the information, after grabbing from web page, into a CSV file or a spreadsheet. Let us
first understand through a simple example in which we will first grab the information using BeautifulSoup
module, as did earlier, and then by using Python CSV module we will write that textual information into CSV file.

First, we need to import the necessary Python libraries as follows −

import requests
from bs4 import BeautifulSoup
import csv

In this following line of code, we use requests to make a GET HTTP requests for the url:
[Link] by making a GET request.

r = [Link]('[Link]

Now, we need to create a Soup object as follows −

soup = BeautifulSoup([Link], 'lxml')

Now, with the help of next lines of code, we will write the grabbed data into a CSV file named [Link].

f = [Link](open(' [Link] ','w'))

[Link](['Title'])
[Link]([[Link]])

After running this script, the textual information or the title of the webpage will be saved in the above mentioned
CSV file on your local machine.

Similarly, we can save the collected information in a JSON file. The following is an easy to understand Python
script for doing the same in which we are grabbing the same information as we did in last Python script, but this
time the grabbed information is saved in [Link] by using JSON Python module.

import requests
from bs4 import BeautifulSoup
import csv
import json
r = [Link]('[Link]
soup = BeautifulSoup([Link], 'lxml')
y = [Link]([Link])
with open('[Link]', 'wt') as outfile:
[Link](y, outfile)

After running this script, the grabbed information i.e. title of the webpage will be saved in the above mentioned
text file on your local machine.

Data Processing using AWS S3

Sometimes we may want to save scraped data in our local storage for archive purpose. But what if the we need to
store and analyze this data at a massive scale? The answer is cloud storage service named Amazon S3 or AWS S3
S impleS torageS ervice . Basically AWS S3 is an object storage which is built to store and retrieve any amount of

data from anywhere.

We can follow the following steps for storing data in AWS S3 −

Step 1 − First we need an AWS account which will provide us the secret keys for using in our Python script while
storing the data. It will create a S3 bucket in which we can store our data.

Step 2 − Next, we need to install boto3 Python library for accessing S3 bucket. It can be installed with the help of
the following command −

pip install boto3

Step 3 − Next, we can use the following Python script for scraping data from web page and saving it to AWS S3
bucket.

First, we need to import Python libraries for scraping, here we are working with requests, and boto3 saving data
to S3 bucket.

import requests
import boto3

Now we can scrape the data from our URL.

data = [Link]("Enter the URL").text

Now for storing data to S3 bucket, we need to create S3 client as follows −

s3 = [Link]('s3')
bucket_name = "our‐content"

Next line of code will create S3 bucket as follows −

s3.create_bucket(Bucket = bucket_name, ACL = 'public‐read')

s3.put_object(Bucket = bucket_name, Key = '', Body = data, ACL = "public‐read")

Now you can check the bucket with name ourcontent from your AWS account.
Data processing using MySQL
Let us learn how to process data using MySQL. If you want to learn about MySQL, then you can follow the link
[Link]

With the help of following steps, we can scrape and process data into MySQL table −

Step 1 − First, by using MySQL we need to create a database and table in which we want to save our scraped data.
For example, we are creating the table with following query −

CREATE TABLE Scrap_pages (id BIGINT(7) NOT NULL AUTO_INCREMENT,

title VARCHAR(200), content VARCHAR(10000),PRIMARY KEY(id));

Step 2 − Next, we need to deal with Unicode. Note that MySQL does not handle Unicode by default. We need to
turn on this feature with the help of following commands which will change the default character set for the
database, for the table and for both of the columns −

ALTER DATABASE scrap CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

ALTER TABLE Scrap_pages CONVERT TO CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
ALTER TABLE Scrap_pages CHANGE title title VARCHAR(200) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
ALTER TABLE pages CHANGE content content VARCHAR(10000) CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;

Step 3 − Now, integrate MySQL with Python. For this, we will need PyMySQL which can be installed with the help
of the following command

pip install PyMySQL

Step 4 − Now, our database named Scrap, created earlier, is ready to save the data, after scraped from web, into
table named Scrap_pages. Here in our example we are going to scrape data from Wikipedia and it will be saved
into our database.

First, we need to import the required Python modules.

from [Link] import urlopen

from bs4 import BeautifulSoup
import datetime
import random
import pymysql
import re

Now, make a connection, that is integrate this with Python.

conn = [Link](host='[Link]',user='root', passwd = None, db = 'mysql',

charset = 'utf8')
cur = [Link]()
[Link]("USE scrap")
[Link]([Link]())
def store(title, content):
[Link]('INSERT INTO scrap_pages (title, content) VALUES ''("%s","%s")', (title,
content))
[Link]()

Now, connect with Wikipedia and get data from it.

def getLinks(articleUrl):
html = urlopen('[Link]
bs = BeautifulSoup(html, '[Link]')
title = [Link]('h1').get_text()
content = [Link]('div', {'id':'mw‐content‐text'}).find('p').get_text()
store(title, content)
return [Link]('div', {'id':'bodyContent'}).findAll('a',href=[Link]('^(/wiki/)
((?!:).)*$'))
links = getLinks('/wiki/Kevin_Bacon')
try:
while len(links) > 0:
newArticle = links[[Link](0, len(links)‐1)].attrs['href']
print(newArticle)
links = getLinks(newArticle)

Lastly, we need to close both cursor and connection.

finally:
[Link]()
[Link]()

This will save the data gather from Wikipedia into table named scrap_pages. If you are familiar with MySQL and
web scraping, then the above code would not be tough to understand.

Data processing using PostgreSQL

PostgreSQL, developed by a worldwide team of volunteers, is an open source relational database Management
system RDM S . The process of processing the scraped data using PostgreSQL is similar to that of MySQL. There
would be two changes: First, the commands would be different to MySQL and second, here we will use psycopg2
Python library to perform its integration with Python.

If you are not familiar with PostgreSQL then you can learn it at [Link] And
with the help of following command we can install psycopg2 Python library −

pip install psycopg2

ML Week 6
No ratings yet
ML Week 6
11 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
Programming in Ds With Python
No ratings yet
Programming in Ds With Python
11 pages
Web Scraping with Python for Econometrics
No ratings yet
Web Scraping with Python for Econometrics
14 pages
UI Ex 6 (61) - 1
No ratings yet
UI Ex 6 (61) - 1
3 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
Convert HTML Table Into CSV File in Python
No ratings yet
Convert HTML Table Into CSV File in Python
4 pages
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
No ratings yet
UE20CS203-Unit1-Class6-Scraping The Web, Reading Files (.CSV)
29 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
3python Web Scraping Getting Started With Python
No ratings yet
3python Web Scraping Getting Started With Python
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Scrapy Web Crawling Guide
No ratings yet
Scrapy Web Crawling Guide
25 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Web Scraping with Python & Selenium
No ratings yet
Web Scraping with Python & Selenium
5 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Web Scraping with BeautifulSoup in Python
No ratings yet
Web Scraping with BeautifulSoup in Python
6 pages
Module 4
No ratings yet
Module 4
14 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Lab 8
No ratings yet
Lab 8
6 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Data Collection Techniques in Data Science
No ratings yet
Data Collection Techniques in Data Science
14 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Webscraping 2
No ratings yet
Webscraping 2
2 pages
Webscraping
No ratings yet
Webscraping
12 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
No ratings yet
Scrapy Beginners Series Part 3 - Storing Data With Scrapy - ScrapeOps
9 pages
Web Scraping Techniques in Python
No ratings yet
Web Scraping Techniques in Python
21 pages
Unit I
No ratings yet
Unit I
12 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Web Scraping with BeautifulSoup 4 Guide
No ratings yet
Web Scraping with BeautifulSoup 4 Guide
5 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Automating Web Scraping with Scrapy
No ratings yet
Automating Web Scraping with Scrapy
5 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Python Rapid Fire
No ratings yet
Python Rapid Fire
8 pages
2python Web Scraping Introduction
No ratings yet
2python Web Scraping Introduction
4 pages
Web Scraping Techniques Cheat Sheet
No ratings yet
Web Scraping Techniques Cheat Sheet
3 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Web Scraping
No ratings yet
Web Scraping
11 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Ujjual PDF Web Scraping 2
No ratings yet
Ujjual PDF Web Scraping 2
2 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
4.4 Applied Data Science Capstone-Collecting The Data 2
No ratings yet
4.4 Applied Data Science Capstone-Collecting The Data 2
13 pages
AIML Manual Lab-For Students
No ratings yet
AIML Manual Lab-For Students
45 pages
Data Analysis for Beginners
No ratings yet
Data Analysis for Beginners
46 pages
Python Web Crawler Tutorial
No ratings yet
Python Web Crawler Tutorial
15 pages
4python Heat Maps
No ratings yet
4python Heat Maps
1 page
Python Web Scraping Data Extraction
No ratings yet
Python Web Scraping Data Extraction
4 pages
8python Web Scraping Dealing With Text
No ratings yet
8python Web Scraping Dealing With Text
7 pages
10python Web Scraping Form Based Websites
No ratings yet
10python Web Scraping Form Based Websites
3 pages
11python Web Scraping Testing With Scrapers
No ratings yet
11python Web Scraping Testing With Scrapers
5 pages
Reference Consent Form
No ratings yet
Reference Consent Form
1 page
Learning Theories 2016-1 (Notes)
No ratings yet
Learning Theories 2016-1 (Notes)
69 pages
Android Java
No ratings yet
Android Java
5 pages
Quiz Application Project Report
No ratings yet
Quiz Application Project Report
38 pages
How To Connect To An API With JavaScript
No ratings yet
How To Connect To An API With JavaScript
11 pages
Ajax With Classic ASP Using Jquery
No ratings yet
Ajax With Classic ASP Using Jquery
9 pages
Desktop Voice Assistant Project Overview
0% (1)
Desktop Voice Assistant Project Overview
46 pages
Daewoo Express COD Home Delivery - API Integration Guide - Version 1.3-1
No ratings yet
Daewoo Express COD Home Delivery - API Integration Guide - Version 1.3-1
16 pages
Transformations of Mapping Data Flow
No ratings yet
Transformations of Mapping Data Flow
2 pages
HTML5 Phaser Tutorial - Top-Down Games With Tiled - GameDev Academy PDF
No ratings yet
HTML5 Phaser Tutorial - Top-Down Games With Tiled - GameDev Academy PDF
27 pages
A Reference Architecture For The Internet of Things - Identity of Things
No ratings yet
A Reference Architecture For The Internet of Things - Identity of Things
15 pages
Xdata User Guide
No ratings yet
Xdata User Guide
198 pages
WP Lab Manual
No ratings yet
WP Lab Manual
28 pages
Course Api Platform
No ratings yet
Course Api Platform
71 pages
FF
No ratings yet
FF
11 pages
BCVN EDC Controller API Guide 1.1.3
No ratings yet
BCVN EDC Controller API Guide 1.1.3
38 pages
Mid 1
No ratings yet
Mid 1
20 pages
Cisco Enauto 300-435
No ratings yet
Cisco Enauto 300-435
30 pages
Adbms Unit 5
No ratings yet
Adbms Unit 5
70 pages
Xmlschema PDF
No ratings yet
Xmlschema PDF
42 pages
Error Log Analysis for LG Devices
No ratings yet
Error Log Analysis for LG Devices
16 pages
Data Visualization in Python Preview PDF
100% (9)
Data Visualization in Python Preview PDF
58 pages
Salesforce Analytics Rest Api
No ratings yet
Salesforce Analytics Rest Api
289 pages
Blue Prism - Visual Business Objects (VBO) Guide - 0 PDF
No ratings yet
Blue Prism - Visual Business Objects (VBO) Guide - 0 PDF
14 pages
IoT Interoperability and Security Challenges
No ratings yet
IoT Interoperability and Security Challenges
7 pages
Important Read
No ratings yet
Important Read
71 pages
TS-0020-WebSocket Protocol Binding v5.0.0
No ratings yet
TS-0020-WebSocket Protocol Binding v5.0.0
21 pages
Auto Numeric - Manual
No ratings yet
Auto Numeric - Manual
12 pages
Key Concepts of Business Intelligence
No ratings yet
Key Concepts of Business Intelligence
15 pages
BSC Bca 6 Sem Mobile Application Development Android 20100406 Mar 2020
No ratings yet
BSC Bca 6 Sem Mobile Application Development Android 20100406 Mar 2020
2 pages
IBM API Connect Labs Guide
No ratings yet
IBM API Connect Labs Guide
137 pages
Amit Tiwari: Software Developer CV
No ratings yet
Amit Tiwari: Software Developer CV
5 pages