Cabico Tan
Cabico Tan
Submitted By:
Submitted to:
In this task, we embarked on a journey to gather data from the Amazon store. Much like
Python, Amazon holds a prominent position among its peers, being a highly popular and
frequently scraped website. Having identified Amazon as our target, we set our sights on finding
an affordable monitor for potential purchase.
To initiate the web scraping process, we chose to employ the "requests_html" Python library.
This choice is backed by the library's unique capability to seamlessly handle the challenge of
rendering JavaScript content, which often arises during web scraping. While alternatives like
BeautifulSoup and Scrapy are available for web scraping, requests-html takes the spotlight due
to its straightforward nature, efficiently addressing the complexities of website scraping,
especially when JavaScript intricacies are involved.
In the preliminary stages, we meticulously identified the specific data points we aimed to extract
from the website. Employing the browser's "inspect element" function aided us in locating these
elements within the HTML structure. We then proceeded to initialize an HTML Session, a crucial
step in establishing communication with the target website. To ensure we captured the most
accurate and up-to-date information, we ensured that JavaScript content was fully rendered.
With this prerequisite fulfilled, we harnessed CSS selectors to precisely extract the desired data
from the page. Lastly, we stored all the harvested data in a structured CSV file, facilitating easy
analysis and further processing. This iterative process illustrates the power of utilizing
well-chosen tools and strategies to navigate the dynamic landscape of web scraping.
2. Scrape at least 50 rows of data. Provide the steps on how you scraped your data. In a
table, provide the Python scripts you used to clean the dataset.
Scraping Data
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT
r.html.render(sleep=1
)
if not
r.html.find('div[data
-hook=review]'):
return
False
else:
return
r.html.find('div[data
-hook=review]')
def parse(self, This function is
reviews): responsible for parsing
the content that we
total = []
want to extract from the
for review in review page.
reviews:
It stores it in a
customer_name = dictionary that we
review.find('div[clas specified.
s=a-profile-content]
In this case, the
span',first=True).tex contents from the
t review page that we
scraped are Customer
Name, Verified
verified_or_not = Purchase,Title,
review.find('a[class= Customer Review, and
a-link-normal] Customer Rating
span',first=True).tex
t
title =
review.find('a[data-h
ook=review-title]',
first=True).text
cust_review =
review.find('span[dat
a-hook=review-body]',
first=True).text.repl
ace('\n','').strip()
cust_rating =
review.find('i[data-h
ook=review-star-ratin
g] span',
first=True).text
data = {
'Customer Name' :
customer_name,
'Verified?' :
verified_or_not,
'Title' : title,
'Review' :
cust_review,
'Rating' :
cust_rating
}
total.append(data)
print(total)
return total
print('getting page
', i)
time.sleep(0.5)
reviews =
amz.pagination(i)
if reviews is
not False:
results.extend(amz.pa
rse(reviews))
else:
print('No
More Pages')
break
print(results)
amz.save_to_csv(resul
ts)
Preprocessing
PYTHON SCRIPT REMARKS SCREENSHOT OF OUTPUT
REFERENCES: