0% found this document useful (0 votes)
40 views4 pages

WebScraping Lessons 5

Web Parsing Course: Lesson 5 - Scraping Dynamic Content with JavaScript-Heavy Websites

Uploaded by

jofil39669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views4 pages

WebScraping Lessons 5

Web Parsing Course: Lesson 5 - Scraping Dynamic Content with JavaScript-Heavy Websites

Uploaded by

jofil39669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Web Parsing Course: Lesson 5 - Scraping Dynamic Content with JavaScript-Heavy

Websites

Objective:

In this lesson, you will learn how to scrape dynamic content from websites that heavily rely on
JavaScript. These websites load content dynamically through AJAX calls or other client-side
rendering techniques, which traditional scraping methods might not capture.

Lesson Outline:

1. Understanding Dynamic Content


o Static vs Dynamic Websites:
 Static Websites: Serve content directly via HTML, making scraping
straightforward (e.g., scraping the HTML file).
 Dynamic Websites: Use JavaScript to load additional content after the initial
HTML page load. This is common in Single Page Applications (SPAs),
which use frameworks like React, Angular, or Vue.js.
o AJAX Requests: Websites use AJAX (Asynchronous JavaScript and XML) to fetch
data in the background without refreshing the page.
 Example: Loading more content as you scroll down a page.
2. Challenges of Scraping JavaScript-Heavy Websites
o Content Not Present in Initial HTML: Traditional HTML scrapers like
BeautifulSoup or lxml may not find the data if it’s loaded via JavaScript after the
page renders.
o JavaScript Frameworks: Modern front-end frameworks make it difficult to locate
elements or data using static HTML parsers.
o Infinite Scroll and Pagination: Some websites load content as users scroll down or
navigate through pages dynamically.
o API Endpoints Hidden in JavaScript: Instead of providing data directly in the
HTML, websites fetch it from hidden APIs via AJAX calls.
3. Using Playwright for Dynamic Content
o Why Playwright?:
 Playwright allows you to interact with a full browser and run JavaScript,
making it ideal for scraping dynamic websites. It can wait for network
requests to finish and trigger interactions like scrolling or button clicks.
o Basic Example:

python
Копіювати код
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example-dynamic-site.com")

# Wait for AJAX content to load


page.wait_for_selector(".dynamic-content")

# Extract content after JavaScript has finished rendering


content = page.content()
print(content)

browser.close()
o Waiting for Elements:
 Playwright allows you to wait until elements appear after JavaScript loads
them.

python
Копіювати код
page.wait_for_selector(".loaded-element")

4. Dealing with Infinite Scrolling


o What is Infinite Scrolling?
 Websites dynamically load more content when the user scrolls down, a
feature common in social media platforms, e-commerce sites, and news
articles.
o Simulating Scroll in Playwright:

python
Копіювати код
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
page.wait_for_timeout(2000) # Give it some time to load

o Automating Scroll Behavior:


 Repeat the scroll action in a loop to capture all dynamically loaded content.

python
Копіювати код
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
page.wait_for_timeout(2000) # Wait for content to load
if not page.query_selector(".new-content"): # Break if no new
content
break

5. Scraping AJAX Requests Directly


o Identifying AJAX Endpoints:
 Sometimes, it’s more efficient to directly capture and scrape data from the
API endpoints used by the website’s AJAX calls, rather than rendering the
whole page.
o How to Find AJAX Calls:
 Use browser Developer Tools (Network tab) to inspect the requests made by
the page. Look for API endpoints returning JSON or XML.
o Capturing AJAX Calls in Playwright:
 Playwright allows you to intercept and log network requests.

python
Копіювати код
def log_request(request):
if "api" in request.url:
print("Captured API request:", request.url)

page.on("request", log_request)
page.goto("https://example.com")

6. Extracting Data from JSON Responses


o Fetching JSON Directly:
 Once you've identified the API endpoint, you can send HTTP requests
directly to it, bypassing the need to scrape HTML or run JavaScript.

python
Копіювати код
import requests

response = requests.get("https://example.com/api/data")
json_data = response.json()
print(json_data)

o Working with JSON Responses:


 Many AJAX calls return data in JSON format, which is structured and easy
to parse with Python’s json library.
7. Handling Pagination in Dynamic Websites
o Navigating Through Pages:
 Some websites use pagination where clicking "Next" or "Load More" loads
new data.
o Automating Clicks on Pagination Buttons:

python
Копіювати код
next_button = page.query_selector(".pagination-next")
while next_button:
next_button.click()
page.wait_for_timeout(2000) # Wait for content to load
next_button = page.query_selector(".pagination-next")

8. Practical Task: Scraping a JavaScript-Heavy Website


o Scenario: Scrape data from a website that loads additional content via AJAX.
Identify the AJAX endpoints and either capture the requests or use Playwright to
extract dynamically loaded content.
o Use Playwright’s ability to wait for selectors, handle scrolling, and navigate through
paginated content to scrape the complete data.
9. Advanced: Headless Browsers vs Headed Browsers
o Headless Browsers:
 By default, Playwright runs in headless mode (i.e., without showing the
browser UI). This is faster and more efficient for scraping tasks.
 Example of running Playwright in headless mode:

python
Копіювати код
browser = p.chromium.launch(headless=True)

o When to Use Headed Browsers:


 Sometimes it’s useful to see what’s happening in the browser. You can
launch a browser in headed mode for debugging.

python
Копіювати код
browser = p.chromium.launch(headless=False)

10. Performance Optimization for Dynamic Websites


o Reducing Resource Load:
 You can disable loading of unnecessary resources like images, videos, and
stylesheets to improve scraping speed.

python
Копіювати код
def block_resources(route):
if route.request.resource_type in ["image", "media",
"stylesheet"]:
route.abort()
else:
route.continue_()

page.route("**/*", block_resources)

o Headless Mode for Faster Scraping:


 Using headless mode significantly speeds up scraping by not rendering the
browser UI.
o Timeout and Retry Mechanisms:
 Implement retries and timeout handling to ensure scraping continues even
when the website is slow or AJAX calls fail.

python
Копіювати код
page.wait_for_timeout(5000) # Wait for 5 seconds before retrying

Key Takeaways:

 Scraping JavaScript-heavy websites requires the ability to wait for content rendered via
JavaScript.
 Playwright is a powerful tool for interacting with and extracting data from such websites,
allowing you to simulate real browsing behavior.
 AJAX calls are often the key to scraping dynamic content, and intercepting them can make
scraping more efficient.
 Techniques like scrolling, paginating, and monitoring network requests are essential to
extract data from dynamic content-rich websites.

By the end of this lesson, you’ll have the skills to handle and scrape data from websites that rely
heavily on client-side JavaScript, making your scraping capabilities more versatile and robust.

You might also like