0% found this document useful (0 votes)

12 views11 pages

Cabico Tan

Uploaded by

jaydee cabico

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views11 pages

Cabico Tan

Uploaded by

jaydee cabico

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Prelim Exercise 1

Submitted By:

Cabico, Julius Daniel

Tan, Kyle Daeniel

Submitted to:

Ma’am Josephine Dela Cruz

1. Provide the source of your data and the specific tool/s you used to scrape your data.

In this task, we embarked on a journey to gather data from the Amazon store. Much like
Python, Amazon holds a prominent position among its peers, being a highly popular and
frequently scraped website. Having identified Amazon as our target, we set our sights on finding
an affordable monitor for potential purchase.

To initiate the web scraping process, we chose to employ the "requests_html" Python library.
This choice is backed by the library's unique capability to seamlessly handle the challenge of
rendering JavaScript content, which often arises during web scraping. While alternatives like
BeautifulSoup and Scrapy are available for web scraping, requests-html takes the spotlight due
to its straightforward nature, efficiently addressing the complexities of website scraping,
especially when JavaScript intricacies are involved.

In the preliminary stages, we meticulously identified the specific data points we aimed to extract
from the website. Employing the browser's "inspect element" function aided us in locating these
elements within the HTML structure. We then proceeded to initialize an HTML Session, a crucial
step in establishing communication with the target website. To ensure we captured the most
accurate and up-to-date information, we ensured that JavaScript content was fully rendered.
With this prerequisite fulfilled, we harnessed CSS selectors to precisely extract the desired data
from the page. Lastly, we stored all the harvested data in a structured CSV file, facilitating easy
analysis and further processing. This iterative process illustrates the power of utilizing
well-chosen tools and strategies to navigate the dynamic landscape of web scraping.

2. Scrape at least 50 rows of data. Provide the steps on how you scraped your data. In a
table, provide the Python scripts you used to clean the dataset.

Scraping Data
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT

from requests_html This code imports the

import HTMLSession necessary libraries for
web scraping
import pandas as pd
import time

class Reviews: This code declares the

def class . It also includes
the asin, the user agent
__init__(self,asin)
and the URL to be
-> None: used
self.asin =
asin
self.session
= HTMLSession()
self.headers
=
{'User-Agent':'Mozill
a/5.0 (Windows NT
10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko)
Chrome/115.0.0.0
Safari/537.36',
'Accept-Language':
'en-US, en; q=0.5'}
self.url =
f'https://www.amazon.
com/KOORUI-FreeSyncTM
-Compatible-Ultra-Thi
n-24E4/product-review
s/{self.asin}/ref=cm_
cr_arp_d_paging_btm_n
ext_2?ie=UTF8&reviewe
rType=all_reviews&pag
eNumber='

def This code is

pagination(self,page) responsible for going
over each page in the
:
reviews of the product.
r = When a tag is no
self.session.get(self longer seen on the
page, it moves over the
.url+str(page))
next page

r.html.render(sleep=1
)
if not
r.html.find('div[data
-hook=review]'):
return
False
else:
return
r.html.find('div[data
-hook=review]')
def parse(self, This function is
reviews): responsible for parsing
the content that we
total = []
want to extract from the
for review in review page.
reviews:

It stores it in a
customer_name = dictionary that we
review.find('div[clas specified.
s=a-profile-content]
In this case, the
span',first=True).tex contents from the
t review page that we
scraped are Customer
Name, Verified
verified_or_not = Purchase,Title,
review.find('a[class= Customer Review, and
a-link-normal] Customer Rating
span',first=True).tex
t
title =
review.find('a[data-h
ook=review-title]',
first=True).text

cust_review =
review.find('span[dat
a-hook=review-body]',
first=True).text.repl
ace('\n','').strip()

cust_rating =
review.find('i[data-h
ook=review-star-ratin
g] span',
first=True).text

data = {

'Customer Name' :
customer_name,
'Verified?' :
verified_or_not,

'Title' : title,

'Review' :
cust_review,

'Rating' :
cust_rating
}

total.append(data)

print(total)
return total

def save_to_csv(self, This code is

results): responsible for
converting the
df =
dictionary into
pd.DataFrame(results) dataframes and into a
CSV file format
df.to_csv('Cabico_Tan
_Amzon_Review.csv')

if name == This code serves as

'__main__': the runner of the
program.
amz =
Reviews('B09TTDRXNS') It was able to go
results = [] through 10 pages and
extract over 50 rows of
for i in
data
range(1,13):

print('getting page
', i)
time.sleep(0.5)
reviews =
amz.pagination(i)
if reviews is
not False:

results.extend(amz.pa
rse(reviews))
else:
print('No
More Pages')
break
print(results)

amz.save_to_csv(resul
ts)

Preprocessing
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT

import pandas as pd Import necessary

import numpy as np libraries for
preprocessing
df = Read CSV file and
pd.read_csv('Cabico display it
_Tan_Amazon_Review.
csv')
df
df = df.iloc[:, Removing
1:]#Remove first Unnecessary columns
column
df

df['Title'] = Removing the ratings

df['Title'].str.rep from the Title Column
lace('5.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('4.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('3.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('2.0 out of 5
stars ', '')
df['Title'] =
df['Title'].str.rep
lace('1.0 out of 5
stars ', '')
df

rating_order = Creates an order that N/A

['1.0 out of 5 fits the ‘Rating’ column
stars','2.0 out of
in the csv
5 stars','3.0 out
of 5 stars','4.0
out of 5
stars','5.0 out of
5 stars',]

df['Rating'] = Converts the ‘Rating N/A

pd.Categorical(df[' Column’ into a
Rating'],
‘Categorical’ data type
with proper order
categories=rating_o
rder, ordered=True)

df = Arranges the data of

df.sort_values(by=' csv in Ascending order
Rating')
based from the ‘Rating
Column’
df
CSV File
https://docs.google.com/spreadsheets/d/1azC4OGkA7kmm4rUDJRG-ivCuEdqyOuFcJW_tIOhD
RA0/edit#gid=1335734933

REFERENCES:

KOORUI 24 Inch Computer Monitor -FHD 1080P Gaming Monitor 165Hz

(https://www.amazon.com/KOORUI-FreeSyncTM-Compatible-Ultra-Thin-24E4/product-reviews/
B09TTDRXNS/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pa
geNumber=)

MIPS R10000 Microprocessor User's Manual
No ratings yet
MIPS R10000 Microprocessor User's Manual
424 pages
GIS For Resource Assessment Management: Dr.R.Jaganathan Deartment of Geography University of Madras Chennai-600005
No ratings yet
GIS For Resource Assessment Management: Dr.R.Jaganathan Deartment of Geography University of Madras Chennai-600005
79 pages
Sku 9619 Manual
No ratings yet
Sku 9619 Manual
37 pages
Class IV Computer Science Ist Semester
No ratings yet
Class IV Computer Science Ist Semester
8 pages
ROHC Huawei
No ratings yet
ROHC Huawei
15 pages
1.software Testing Methodologies
0% (1)
1.software Testing Methodologies
2 pages
Introduction To HTML: Ms. Azenith R. Mojica
No ratings yet
Introduction To HTML: Ms. Azenith R. Mojica
28 pages
FusionHub User Manual and Installation Guide PDF
No ratings yet
FusionHub User Manual and Installation Guide PDF
116 pages
The Internet of Things: An Overview of A More Connected World and Understanding It's Challenges
No ratings yet
The Internet of Things: An Overview of A More Connected World and Understanding It's Challenges
4 pages
B.Tech - CSE Semester - I: Topic For The Class: Unit 1: Title: Date & Time
No ratings yet
B.Tech - CSE Semester - I: Topic For The Class: Unit 1: Title: Date & Time
31 pages
Syvecs Software Basic Manual
100% (1)
Syvecs Software Basic Manual
33 pages
01 Computational Methods For Numerical Analysis With R - 1
No ratings yet
01 Computational Methods For Numerical Analysis With R - 1
28 pages
Civil 3D Handling of Survey Points Practice Manual
No ratings yet
Civil 3D Handling of Survey Points Practice Manual
7 pages
58 Cool Linux Hacks!
No ratings yet
58 Cool Linux Hacks!
15 pages
Signaling Protocols For Voip Signaling Protocols For Voip: Dr. Ahmed A. Khalifa
No ratings yet
Signaling Protocols For Voip Signaling Protocols For Voip: Dr. Ahmed A. Khalifa
24 pages
CN Question Bank
No ratings yet
CN Question Bank
4 pages
Flipkart Web Scrapping
No ratings yet
Flipkart Web Scrapping
8 pages
BEXEL Data Enrichment User Manual
No ratings yet
BEXEL Data Enrichment User Manual
4 pages
Web Scrape For Barcodes
No ratings yet
Web Scrape For Barcodes
9 pages
Diagnostic Test - Programming Java
No ratings yet
Diagnostic Test - Programming Java
5 pages
Web Scraping Assignment Ebay
No ratings yet
Web Scraping Assignment Ebay
6 pages
The Satisfaction Improvement Strategy of Green Development Based On The AHP-Entropy Weight Method and CART Algorithm
No ratings yet
The Satisfaction Improvement Strategy of Green Development Based On The AHP-Entropy Weight Method and CART Algorithm
6 pages
Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
No ratings yet
Workshop 2B: Web Scraping With Beautifulsoup 4: Comp20008 Elements of Data Processing
5 pages
ABAP Dictionary Notes
No ratings yet
ABAP Dictionary Notes
11 pages
ALAS Missile System Launcher
No ratings yet
ALAS Missile System Launcher
2 pages
Web Scrapping Project Phase 4 1679950739
No ratings yet
Web Scrapping Project Phase 4 1679950739
12 pages
1.03c Memory-Virtual MemoryLAB
No ratings yet
1.03c Memory-Virtual MemoryLAB
3 pages
01 Web Data Analytics Pawan
No ratings yet
01 Web Data Analytics Pawan
55 pages
19-5E8 Tushara Priya
No ratings yet
19-5E8 Tushara Priya
23 pages
Comsats University Islamabad Wah Campus (Project Report) : Submitted by
No ratings yet
Comsats University Islamabad Wah Campus (Project Report) : Submitted by
14 pages
Balaji 1
No ratings yet
Balaji 1
30 pages
Online Railway Reservation and Management System.
No ratings yet
Online Railway Reservation and Management System.
6 pages
How To Scrape Product Data From Amazon - A Complete Guide - Oxylabs
No ratings yet
How To Scrape Product Data From Amazon - A Complete Guide - Oxylabs
19 pages
Lingika Manovidyawa - PDF
No ratings yet
Lingika Manovidyawa - PDF
1 page
Python Assignment PDF
No ratings yet
Python Assignment PDF
1 page
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
No ratings yet
Sari Serhan Python Toolbox 100 Scripts For Developers 2023
193 pages
List of Courses 2023-24
No ratings yet
List of Courses 2023-24
80 pages
Sbi Bank Clerk Computer Based Sample Questions Posted by Free
No ratings yet
Sbi Bank Clerk Computer Based Sample Questions Posted by Free
5 pages
Jim Grandy Resume
100% (1)
Jim Grandy Resume
2 pages
Team One - 20241214 - 201830 - 0000
No ratings yet
Team One - 20241214 - 201830 - 0000
14 pages
Learn Web Development
No ratings yet
Learn Web Development
2 pages
Aasma Pes1ug23cs008 Unit 1
No ratings yet
Aasma Pes1ug23cs008 Unit 1
6 pages
ChatGPT - Auto Classification TensorFlow
No ratings yet
ChatGPT - Auto Classification TensorFlow
38 pages
Step 2
No ratings yet
Step 2
2 pages
Assignment Question
No ratings yet
Assignment Question
1 page
Syllabus
No ratings yet
Syllabus
11 pages
Assignment
No ratings yet
Assignment
5 pages
Team One - 20241214 - 203551 - 0000
No ratings yet
Team One - 20241214 - 203551 - 0000
15 pages
Python Scrapping Task
No ratings yet
Python Scrapping Task
2 pages
Assessment Task - Carbon38
No ratings yet
Assessment Task - Carbon38
5 pages
Python Scraping
No ratings yet
Python Scraping
1 page
UI21CS29 Lab2
No ratings yet
UI21CS29 Lab2
11 pages
B - 2 CIE Web Scraping
No ratings yet
B - 2 CIE Web Scraping
8 pages
Web - Scrapping - Ipynb - Colab
No ratings yet
Web - Scrapping - Ipynb - Colab
7 pages
21CSC303JJ SEPM - Ex 1
No ratings yet
21CSC303JJ SEPM - Ex 1
4 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Dropdownlistscraping
No ratings yet
Dropdownlistscraping
7 pages
Benchmaster Documentation
No ratings yet
Benchmaster Documentation
12 pages
UI Ex 6 (61) - 1
No ratings yet
UI Ex 6 (61) - 1
3 pages
Scrapeez
No ratings yet
Scrapeez
3 pages
Python Programming
No ratings yet
Python Programming
11 pages
Verified PDF Download Software Engineering 9th Edition by Ian Sommerville Ebook and TestBank Bundle FULL Version
No ratings yet
Verified PDF Download Software Engineering 9th Edition by Ian Sommerville Ebook and TestBank Bundle FULL Version
408 pages
Essential n8n Playbook
From Everand
Essential n8n Playbook
Leandro Calado
No ratings yet
6
No ratings yet
6
3 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
Product Info Scrapper
No ratings yet
Product Info Scrapper
18 pages
Context
No ratings yet
Context
8 pages
Web Scraping
No ratings yet
Web Scraping
2 pages
Ecom Research Paper
No ratings yet
Ecom Research Paper
4 pages
Ecom RSP FINAL
No ratings yet
Ecom RSP FINAL
5 pages
Atulkumar 20215011
No ratings yet
Atulkumar 20215011
12 pages
ML Week 6
No ratings yet
ML Week 6
11 pages
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
From Everand
Salesforce Certified Platform Developer I CRT-450 Exam Preparation
Georgio Daccache
No ratings yet
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Mastering JavaScript: The Complete Guide to JavaScript Mastery
From Everand
Mastering JavaScript: The Complete Guide to JavaScript Mastery
Tim Robards
5/5 (1)
JavaScript Interview Questions, Answers, and Explanations: JavaScript Certification Review
From Everand
JavaScript Interview Questions, Answers, and Explanations: JavaScript Certification Review
equitypress
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
React Portfolio App Development: Increase your online presence and create your personal brand
From Everand
React Portfolio App Development: Increase your online presence and create your personal brand
Abdelfattah Ragab
No ratings yet
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
From Everand
Angular Generative AI: Building an intelligent CV enhancer with Google Gemini
Abdelfattah Ragab
No ratings yet
Angular Shopping Store: From Scratch to Successful Payment
From Everand
Angular Shopping Store: From Scratch to Successful Payment
Abdelfattah Ragab
No ratings yet
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
50 Recipes for Programming Angular
From Everand
50 Recipes for Programming Angular
Jamie Munro
4/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
From Everand
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
Eddie Vi
4/5 (1)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
Introduction to PHP, Part 4, Second Edition
From Everand
Introduction to PHP, Part 4, Second Edition
Adam Majczak
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet