Web Scraping and Data Collection CheatSheet 1731972399
This cheat sheet provides a comprehensive guide to web scraping and data collection techniques using various libraries such as Requests, BeautifulSoup, Selenium, and Scrapy. It covers essential topics including HTTP requests, HTML parsing, handling dynamic content, authentication, data extraction, storage, error handling, and advanced techniques like proxy rotation and CAPTCHA handling. The document is structured into sections detailing setup, operations, and best practices for effective web scraping.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
34 views10 pages
Web Scraping and Data Collection CheatSheet 1731972399
This cheat sheet provides a comprehensive guide to web scraping and data collection techniques using various libraries such as Requests, BeautifulSoup, Selenium, and Scrapy. It covers essential topics including HTTP requests, HTML parsing, handling dynamic content, authentication, data extraction, storage, error handling, and advanced techniques like proxy rotation and CAPTCHA handling. The document is structured into sections detailing setup, operations, and best practices for effective web scraping.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10
[ Web Scraping and Data Collection ] ( CheatSheet )
1. Basic Setup and Libraries
● Import requests: import requests
● Import BeautifulSoup: from bs4 import BeautifulSoup ● Import Selenium: from selenium import webdriver ● Import Scrapy: import scrapy ● Import pandas: import pandas as pd ● Import lxml: import lxml ● Import regex: import re
2. HTTP Requests with Requests Library
● GET request: response = requests.get('https://example.com')
● POST request: response = requests.post('https://example.com/submit', data={'key': 'value'}) ● Request with headers: response = requests.get('https://example.com', headers={'User-Agent': 'Mozilla/5.0'}) ● Request with timeout: response = requests.get('https://example.com', timeout=5) ● Request with cookies: response = requests.get('https://example.com', cookies={'session_id': '123'}) ● Request with params: response = requests.get('https://example.com/search', params={'q': 'python'}) ● Request with authentication: response = requests.get('https://example.com', auth=('username', 'password')) ● Request with proxy: response = requests.get('https://example.com', proxies={'http': 'http://10.10.1.10:3128'}) ● Get status code: status_code = response.status_code ● Get response content: content = response.content ● Get response text: text = response.text ● Get response headers: headers = response.headers ● Get response cookies: cookies = response.cookies ● Get response encoding: encoding = response.encoding ● Get response URL: url = response.url
3. Parsing HTML with BeautifulSoup
By: Waleed Mousa
● Create BeautifulSoup object: soup = BeautifulSoup(response.content, 'html.parser') ● Find first occurrence of a tag: element = soup.find('div') ● Find all occurrences of a tag: elements = soup.find_all('p') ● Find by ID: element = soup.find(id='my-id') ● Find by class: elements = soup.find_all(class_='my-class') ● Find by attribute: elements = soup.find_all(attrs={'data-test': 'value'}) ● Get tag name: tag_name = element.name ● Get tag text: text = element.text ● Get tag contents: contents = element.contents ● Get tag children: children = element.children ● Get tag parent: parent = element.parent ● Get tag siblings: siblings = element.next_siblings ● Get tag attributes: attributes = element.attrs ● Get specific attribute: href = element['href'] ● Navigate DOM: element = soup.body.div.p ● Search by CSS selector: elements = soup.select('div.class > p') ● Search by XPath: elements = soup.xpath('//div[@class="my-class"]') ● Get all links: links = [a['href'] for a in soup.find_all('a', href=True)]
● Initialize Firefox WebDriver: driver = webdriver.Firefox() ● Open URL: driver.get('https://example.com') ● Get page source: source = driver.page_source ● Find element by ID: element = driver.find_element_by_id('my-id') ● Find element by class name: element = driver.find_element_by_class_name('my-class') ● Find element by tag name: element = driver.find_element_by_tag_name('div') ● Find element by XPath: element = driver.find_element_by_xpath('//div[@class="my-class"]') ● Find element by CSS selector: element = driver.find_element_by_css_selector('div.my-class') ● Find multiple elements: elements = driver.find_elements_by_class_name('my-class') ● Click element: element.click() ● Send keys to element: element.send_keys('text') ● Clear input field: element.clear()
By: Waleed Mousa
● Get element text: text = element.text ● Get element attribute: attribute = element.get_attribute('class') ● Check if element is displayed: is_displayed = element.is_displayed() ● Check if element is enabled: is_enabled = element.is_enabled() ● Check if element is selected: is_selected = element.is_selected() ● Execute JavaScript: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ● Take screenshot: driver.save_screenshot('screenshot.png') ● Switch to frame: driver.switch_to.frame('frame_name') ● Switch to default content: driver.switch_to.default_content() ● Switch to window: driver.switch_to.window(driver.window_handles[-1]) ● Close current window: driver.close() ● Quit WebDriver: driver.quit()
5. Web Scraping with Scrapy
● Create new Scrapy project: scrapy startproject myproject
● Generate new spider: scrapy genspider myspider example.com ● Run spider: scrapy crawl myspider ● Extract data with CSS selector: response.css('div.class::text').extract() ● Extract data with XPath: response.xpath('//div[@class="my-class"]/text()').extract() ● Extract first item: response.css('div.class::text').extract_first() ● Follow link: yield response.follow(next_page, self.parse) ● Store extracted item: yield {'name': name, 'price': price} ● Use item loader: loader = ItemLoader(item=MyItem(), response=response) ● Add value to item loader: loader.add_css('name', 'div.name::text')
6. Handling Dynamic Content
● Wait for element (Selenium): WebDriverWait(driver,
10).until(EC.presence_of_element_located((By.ID, 'my-id'))) ● Wait for element to be clickable: WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'my-id'))) ● Scroll to element: driver.execute_script("arguments[0].scrollIntoView();", element) ● Scroll to bottom of page: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ● Handle alert: alert = driver.switch_to.alert; alert.accept()
● Basic auth with requests: requests.get('https://example.com',
auth=('user', 'pass')) ● Use session for persistent login: session = requests.Session(); session.post('https://example.com/login', data={'username': 'user', 'password': 'pass'}) ● Handle cookies: cookies = {'session_id': '123'}; requests.get('https://example.com', cookies=cookies) ● Use API key: headers = {'Authorization': 'Bearer YOUR_API_KEY'}; requests.get('https://api.example.com', headers=headers)
8. Parsing and Data Extraction
● Parse JSON response: data = response.json()
● Parse XML with lxml: root = lxml.etree.fromstring(response.content) ● Extract table data with pandas: tables = pd.read_html(response.text) ● Extract data with regex: match = re.search(r'pattern', text) ● Clean text data: clean_text = ' '.join(text.split()) ● Remove HTML tags: clean_text = re.sub('<.*?>', '', html_text) ● Parse dates: date = pd.to_datetime('2023-05-20') ● Extract numbers from text: numbers = re.findall(r'\d+', text)
9. Data Storage and Export
● Save to CSV: df.to_csv('data.csv', index=False)
● Save to Excel: df.to_excel('data.xlsx', index=False) ● Save to JSON: df.to_json('data.json', orient='records') ● Save to SQLite: df.to_sql('table_name', sqlite_connection, if_exists='replace') ● Save to MongoDB: collection.insert_many(df.to_dict('records')) ● Save to pickle: df.to_pickle('data.pkl')
10. Rate Limiting and Politeness
● Add delay between requests: time.sleep(1)
● Use random delay: time.sleep(random.uniform(1, 3))
requests.exceptions.RequestException as e: print(f"An error occurred: {e}") ● Retry with exponential backoff: @retry(wait=wait_exponential(multiplier=1, max=60), stop=stop_after_attempt(5)) ● Handle specific HTTP status codes: if response.status_code == 404: print("Page not found") ● Log errors: logging.error(f"Failed to scrape {url}: {e}")
12. Parallel and Asynchronous Scraping
● Use multithreading: with
concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: results = list(executor.map(scrape_url, urls)) ● Use multiprocessing: with concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor: results = list(executor.map(scrape_url, urls)) ● Use asyncio: asyncio.get_event_loop().run_until_complete(scrape_urls(urls)) ● Use aiohttp for async requests: async with aiohttp.ClientSession() as session: async with session.get(url) as response: html = await response.text()
13. Advanced Techniques
● Use proxy rotation: proxies = ['http://proxy1:8080',
ImageToTextTask(captcha_file); job = client.createTask(task); job.join(); print(job.get_solution_response()) ● Use Tor for anonymity: from stem import Signal; from stem.control import Controller; with Controller.from_port(port=9051) as controller: controller.authenticate(); controller.signal(Signal.NEWNYM) ● Implement IP rotation: requests.get(url, proxies={'http': f'socks5://127.0.0.1:{random.randint(9000, 9100)}'}) ● Handle JavaScript rendering: driver.execute_script("return document.documentElement.outerHTML") ● Extract data from PDF: import PyPDF2; pdf = PyPDF2.PdfReader(open('file.pdf', 'rb')); text = pdf.pages[0].extract_text() ● Handle infinite scroll (Selenium): while True: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);"); time.sleep(2); new_height = driver.execute_script("return document.body.scrollHeight"); if new_height == last_height: break; last_height = new_height ● Extract data from images: from PIL import Image; import pytesseract; text = pytesseract.image_to_string(Image.open('image.png'))
● Solve reCAPTCHA using 2captcha: solver = TwoCaptcha('YOUR_API_KEY');
result = solver.recaptcha(sitekey='SITE_KEY', url='https://example.com') ● Bypass IP-based restrictions: response = requests.get(url, proxies={'http': 'http://username:[email protected]:8080'}) ● Mimic human behavior with random delays: time.sleep(random.uniform(1, 3)) ● Rotate user agents: headers = {'User-Agent': random.choice(user_agents)} ● Handle browser fingerprinting: options.add_argument('--disable-blink-features=AutomationControlled')
19. Asynchronous Scraping
● Use aiohttp for async requests: async with aiohttp.ClientSession() as
session: async with session.get(url) as response: html = await response.text() ● Parse HTML asynchronously: soup = BeautifulSoup(html, 'html.parser')
By: Waleed Mousa
● Use asyncio to run multiple coroutines: asyncio.gather(*[fetch(url) for url in urls]) ● Implement rate limiting in async code: async with aiohttp_requests.Limiter(10, 1) as limiter: async with limiter: await session.get(url) ● Use aiofiles for async file I/O: async with aiofiles.open('output.txt', mode='w') as f: await f.write(data)
20. Data Extraction from Various Formats
● Extract data from XML: tree = ET.parse('file.xml'); root = tree.getroot()
● Parse RSS feed: feed = feedparser.parse('http://example.com/rss') ● Extract data from CSV: with open('file.csv', 'r') as f: reader = csv.DictReader(f) ● Read Excel file: df = pd.read_excel('file.xlsx', sheet_name='Sheet1') ● Extract text from PDF: text = textract.process('file.pdf').decode('utf-8')
21. Scraped Data Validation and Cleaning
● Remove HTML tags: clean_text = BeautifulSoup(html,
● Save to SQLite database: conn = sqlite3.connect('data.db');
df.to_sql('table_name', conn, if_exists='replace') ● Insert into MySQL database: engine = create_engine('mysql://user:password@localhost/dbname'); df.to_sql('table_name', engine, if_exists='append') ● Save to MongoDB: client = pymongo.MongoClient('mongodb://localhost:27017/'); db = client['dbname']; db['collection'].insert_many(df.to_dict('records')) ● Write to Elasticsearch: es = Elasticsearch(); es.index(index='my_index', body=document)
By: Waleed Mousa
● Save to Amazon S3: s3 = boto3.client('s3'); s3.put_object(Bucket='my-bucket', Key='data.csv', Body=csv_string)
23. Monitoring and Logging
● Set up basic logging: logging.basicConfig(level=logging.INFO,
● Use multiprocessing for CPU-bound tasks: with Pool(4) as p: results =
p.map(scrape_func, urls) ● Use multithreading for I/O-bound tasks: with ThreadPoolExecutor(max_workers=10) as executor: futures = [executor.submit(scrape_url, url) for url in urls] ● Implement caching: @functools.lru_cache(maxsize=100) ● Use a job queue: q.enqueue(scrape_func, url) ● Optimize database queries: session.query(Model).filter(Model.attr == value).options(joinedload(Model.relation))
robotparser.set_url('http://example.com/robots.txt'); robotparser.read(); allowed = robotparser.can_fetch('*', url) ● Set user agent to identify your bot: headers = {'User-Agent': 'MyBot/1.0 (+http://example.com/bot)'} ● Implement politeness delay: time.sleep(random.uniform(1, 3)) ● Respect 'nofollow' links: if 'rel' in link.attrs and 'nofollow' in link['rel']: continue ● Handle terms of service compliance: if not check_tos_compliance(url): raise Exception("TOS violation")