0% found this document useful (0 votes)
12 views11 pages

Cabico Tan

Uploaded by

jaydee cabico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Cabico Tan

Uploaded by

jaydee cabico
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Prelim Exercise 1

Submitted By:

Cabico, Julius Daniel

Tan, Kyle Daeniel

Submitted to:

Ma’am Josephine Dela Cruz


1. Provide the source of your data and the specific tool/s you used to scrape your data.

In this task, we embarked on a journey to gather data from the Amazon store. Much like
Python, Amazon holds a prominent position among its peers, being a highly popular and
frequently scraped website. Having identified Amazon as our target, we set our sights on finding
an affordable monitor for potential purchase.

To initiate the web scraping process, we chose to employ the "requests_html" Python library.
This choice is backed by the library's unique capability to seamlessly handle the challenge of
rendering JavaScript content, which often arises during web scraping. While alternatives like
BeautifulSoup and Scrapy are available for web scraping, requests-html takes the spotlight due
to its straightforward nature, efficiently addressing the complexities of website scraping,
especially when JavaScript intricacies are involved.

In the preliminary stages, we meticulously identified the specific data points we aimed to extract
from the website. Employing the browser's "inspect element" function aided us in locating these
elements within the HTML structure. We then proceeded to initialize an HTML Session, a crucial
step in establishing communication with the target website. To ensure we captured the most
accurate and up-to-date information, we ensured that JavaScript content was fully rendered.
With this prerequisite fulfilled, we harnessed CSS selectors to precisely extract the desired data
from the page. Lastly, we stored all the harvested data in a structured CSV file, facilitating easy
analysis and further processing. This iterative process illustrates the power of utilizing
well-chosen tools and strategies to navigate the dynamic landscape of web scraping.

2. Scrape at least 50 rows of data. Provide the steps on how you scraped your data. In a
table, provide the Python scripts you used to clean the dataset.

Scraping Data
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT

from requests_html This code imports the


import HTMLSession necessary libraries for
web scraping
import pandas as pd
import time

class Reviews: This code declares the


def class . It also includes
the asin, the user agent
__init__(self,asin)
and the URL to be
-> None: used
self.asin =
asin
self.session
= HTMLSession()
self.headers
=
{'User-Agent':'Mozill
a/5.0 (Windows NT
10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko)
Chrome/115.0.0.0
Safari/537.36',
'Accept-Language':
'en-US, en; q=0.5'}
self.url =
f'https://www.amazon.
com/KOORUI-FreeSyncTM
-Compatible-Ultra-Thi
n-24E4/product-review
s/{self.asin}/ref=cm_
cr_arp_d_paging_btm_n
ext_2?ie=UTF8&reviewe
rType=all_reviews&pag
eNumber='

def This code is


pagination(self,page) responsible for going
over each page in the
:
reviews of the product.
r = When a tag is no
self.session.get(self longer seen on the
page, it moves over the
.url+str(page))
next page

r.html.render(sleep=1
)
if not
r.html.find('div[data
-hook=review]'):
return
False
else:
return
r.html.find('div[data
-hook=review]')
def parse(self, This function is
reviews): responsible for parsing
the content that we
total = []
want to extract from the
for review in review page.
reviews:

It stores it in a
customer_name = dictionary that we
review.find('div[clas specified.
s=a-profile-content]
In this case, the
span',first=True).tex contents from the
t review page that we
scraped are Customer
Name, Verified
verified_or_not = Purchase,Title,
review.find('a[class= Customer Review, and
a-link-normal] Customer Rating
span',first=True).tex
t
title =
review.find('a[data-h
ook=review-title]',
first=True).text

cust_review =
review.find('span[dat
a-hook=review-body]',
first=True).text.repl
ace('\n','').strip()

cust_rating =
review.find('i[data-h
ook=review-star-ratin
g] span',
first=True).text

data = {

'Customer Name' :
customer_name,
'Verified?' :
verified_or_not,

'Title' : title,

'Review' :
cust_review,

'Rating' :
cust_rating
}

total.append(data)

print(total)
return total

def save_to_csv(self, This code is


results): responsible for
converting the
df =
dictionary into
pd.DataFrame(results) dataframes and into a
CSV file format
df.to_csv('Cabico_Tan
_Amzon_Review.csv')

if __name__ == This code serves as


'__main__': the runner of the
program.
amz =
Reviews('B09TTDRXNS') It was able to go
results = [] through 10 pages and
extract over 50 rows of
for i in
data
range(1,13):

print('getting page
', i)
time.sleep(0.5)
reviews =
amz.pagination(i)
if reviews is
not False:

results.extend(amz.pa
rse(reviews))
else:
print('No
More Pages')
break
print(results)

amz.save_to_csv(resul
ts)

Preprocessing
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT

import pandas as pd Import necessary


import numpy as np libraries for
preprocessing
df = Read CSV file and
pd.read_csv('Cabico display it
_Tan_Amazon_Review.
csv')
df
df = df.iloc[:, Removing
1:]#Remove first Unnecessary columns
column
df

df['Title'] = Removing the ratings


df['Title'].str.rep from the Title Column
lace('5.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('4.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('3.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('2.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('1.0 out of 5
stars ', '')
df

rating_order = Creates an order that N/A


['1.0 out of 5 fits the ‘Rating’ column
stars','2.0 out of
in the csv
5 stars','3.0 out
of 5 stars','4.0
out of 5
stars','5.0 out of
5 stars',]

df['Rating'] = Converts the ‘Rating N/A


pd.Categorical(df[' Column’ into a
Rating'],
‘Categorical’ data type
with proper order
categories=rating_o
rder, ordered=True)

df = Arranges the data of


df.sort_values(by=' csv in Ascending order
Rating')
based from the ‘Rating
Column’
df
CSV File
https://docs.google.com/spreadsheets/d/1azC4OGkA7kmm4rUDJRG-ivCuEdqyOuFcJW_tIOhD
RA0/edit#gid=1335734933

REFERENCES:

KOORUI 24 Inch Computer Monitor -FHD 1080P Gaming Monitor 165Hz


(https://www.amazon.com/KOORUI-FreeSyncTM-Compatible-Ultra-Thin-24E4/product-reviews/
B09TTDRXNS/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pa
geNumber=)

You might also like