WebScraping Lessons 5
WebScraping Lessons 5
Websites
Objective:
In this lesson, you will learn how to scrape dynamic content from websites that heavily rely on
JavaScript. These websites load content dynamically through AJAX calls or other client-side
rendering techniques, which traditional scraping methods might not capture.
Lesson Outline:
python
Копіювати код
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://example-dynamic-site.com")
browser.close()
o Waiting for Elements:
Playwright allows you to wait until elements appear after JavaScript loads
them.
python
Копіювати код
page.wait_for_selector(".loaded-element")
python
Копіювати код
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
page.wait_for_timeout(2000) # Give it some time to load
python
Копіювати код
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
page.wait_for_timeout(2000) # Wait for content to load
if not page.query_selector(".new-content"): # Break if no new
content
break
python
Копіювати код
def log_request(request):
if "api" in request.url:
print("Captured API request:", request.url)
page.on("request", log_request)
page.goto("https://example.com")
python
Копіювати код
import requests
response = requests.get("https://example.com/api/data")
json_data = response.json()
print(json_data)
python
Копіювати код
next_button = page.query_selector(".pagination-next")
while next_button:
next_button.click()
page.wait_for_timeout(2000) # Wait for content to load
next_button = page.query_selector(".pagination-next")
python
Копіювати код
browser = p.chromium.launch(headless=True)
python
Копіювати код
browser = p.chromium.launch(headless=False)
python
Копіювати код
def block_resources(route):
if route.request.resource_type in ["image", "media",
"stylesheet"]:
route.abort()
else:
route.continue_()
page.route("**/*", block_resources)
python
Копіювати код
page.wait_for_timeout(5000) # Wait for 5 seconds before retrying
Key Takeaways:
Scraping JavaScript-heavy websites requires the ability to wait for content rendered via
JavaScript.
Playwright is a powerful tool for interacting with and extracting data from such websites,
allowing you to simulate real browsing behavior.
AJAX calls are often the key to scraping dynamic content, and intercepting them can make
scraping more efficient.
Techniques like scrolling, paginating, and monitoring network requests are essential to
extract data from dynamic content-rich websites.
By the end of this lesson, you’ll have the skills to handle and scrape data from websites that rely
heavily on client-side JavaScript, making your scraping capabilities more versatile and robust.