0% found this document useful (0 votes)
34 views10 pages

Web Scraping and Data Collection CheatSheet 1731972399

This cheat sheet provides a comprehensive guide to web scraping and data collection techniques using various libraries such as Requests, BeautifulSoup, Selenium, and Scrapy. It covers essential topics including HTTP requests, HTML parsing, handling dynamic content, authentication, data extraction, storage, error handling, and advanced techniques like proxy rotation and CAPTCHA handling. The document is structured into sections detailing setup, operations, and best practices for effective web scraping.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

Web Scraping and Data Collection CheatSheet 1731972399

This cheat sheet provides a comprehensive guide to web scraping and data collection techniques using various libraries such as Requests, BeautifulSoup, Selenium, and Scrapy. It covers essential topics including HTTP requests, HTML parsing, handling dynamic content, authentication, data extraction, storage, error handling, and advanced techniques like proxy rotation and CAPTCHA handling. The document is structured into sections detailing setup, operations, and best practices for effective web scraping.

Uploaded by

vamsitarak55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

[ Web Scraping and Data Collection ] ( CheatSheet )

1. Basic Setup and Libraries

● Import requests: import requests


● Import BeautifulSoup: from bs4 import BeautifulSoup
● Import Selenium: from selenium import webdriver
● Import Scrapy: import scrapy
● Import pandas: import pandas as pd
● Import lxml: import lxml
● Import regex: import re

2. HTTP Requests with Requests Library

● GET request: response = requests.get('https://example.com')


● POST request: response = requests.post('https://example.com/submit',
data={'key': 'value'})
● Request with headers: response = requests.get('https://example.com',
headers={'User-Agent': 'Mozilla/5.0'})
● Request with timeout: response = requests.get('https://example.com',
timeout=5)
● Request with cookies: response = requests.get('https://example.com',
cookies={'session_id': '123'})
● Request with params: response =
requests.get('https://example.com/search', params={'q': 'python'})
● Request with authentication: response =
requests.get('https://example.com', auth=('username', 'password'))
● Request with proxy: response = requests.get('https://example.com',
proxies={'http': 'http://10.10.1.10:3128'})
● Get status code: status_code = response.status_code
● Get response content: content = response.content
● Get response text: text = response.text
● Get response headers: headers = response.headers
● Get response cookies: cookies = response.cookies
● Get response encoding: encoding = response.encoding
● Get response URL: url = response.url

3. Parsing HTML with BeautifulSoup

By: Waleed Mousa


● Create BeautifulSoup object: soup = BeautifulSoup(response.content,
'html.parser')
● Find first occurrence of a tag: element = soup.find('div')
● Find all occurrences of a tag: elements = soup.find_all('p')
● Find by ID: element = soup.find(id='my-id')
● Find by class: elements = soup.find_all(class_='my-class')
● Find by attribute: elements = soup.find_all(attrs={'data-test': 'value'})
● Get tag name: tag_name = element.name
● Get tag text: text = element.text
● Get tag contents: contents = element.contents
● Get tag children: children = element.children
● Get tag parent: parent = element.parent
● Get tag siblings: siblings = element.next_siblings
● Get tag attributes: attributes = element.attrs
● Get specific attribute: href = element['href']
● Navigate DOM: element = soup.body.div.p
● Search by CSS selector: elements = soup.select('div.class > p')
● Search by XPath: elements = soup.xpath('//div[@class="my-class"]')
● Get all links: links = [a['href'] for a in soup.find_all('a', href=True)]

4. Web Scraping with Selenium

● Initialize Chrome WebDriver: driver = webdriver.Chrome()


● Initialize Firefox WebDriver: driver = webdriver.Firefox()
● Open URL: driver.get('https://example.com')
● Get page source: source = driver.page_source
● Find element by ID: element = driver.find_element_by_id('my-id')
● Find element by class name: element =
driver.find_element_by_class_name('my-class')
● Find element by tag name: element =
driver.find_element_by_tag_name('div')
● Find element by XPath: element =
driver.find_element_by_xpath('//div[@class="my-class"]')
● Find element by CSS selector: element =
driver.find_element_by_css_selector('div.my-class')
● Find multiple elements: elements =
driver.find_elements_by_class_name('my-class')
● Click element: element.click()
● Send keys to element: element.send_keys('text')
● Clear input field: element.clear()

By: Waleed Mousa


● Get element text: text = element.text
● Get element attribute: attribute = element.get_attribute('class')
● Check if element is displayed: is_displayed = element.is_displayed()
● Check if element is enabled: is_enabled = element.is_enabled()
● Check if element is selected: is_selected = element.is_selected()
● Execute JavaScript: driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
● Take screenshot: driver.save_screenshot('screenshot.png')
● Switch to frame: driver.switch_to.frame('frame_name')
● Switch to default content: driver.switch_to.default_content()
● Switch to window: driver.switch_to.window(driver.window_handles[-1])
● Close current window: driver.close()
● Quit WebDriver: driver.quit()

5. Web Scraping with Scrapy

● Create new Scrapy project: scrapy startproject myproject


● Generate new spider: scrapy genspider myspider example.com
● Run spider: scrapy crawl myspider
● Extract data with CSS selector: response.css('div.class::text').extract()
● Extract data with XPath:
response.xpath('//div[@class="my-class"]/text()').extract()
● Extract first item: response.css('div.class::text').extract_first()
● Follow link: yield response.follow(next_page, self.parse)
● Store extracted item: yield {'name': name, 'price': price}
● Use item loader: loader = ItemLoader(item=MyItem(), response=response)
● Add value to item loader: loader.add_css('name', 'div.name::text')

6. Handling Dynamic Content

● Wait for element (Selenium): WebDriverWait(driver,


10).until(EC.presence_of_element_located((By.ID, 'my-id')))
● Wait for element to be clickable: WebDriverWait(driver,
10).until(EC.element_to_be_clickable((By.ID, 'my-id')))
● Scroll to element:
driver.execute_script("arguments[0].scrollIntoView();", element)
● Scroll to bottom of page: driver.execute_script("window.scrollTo(0,
document.body.scrollHeight);")
● Handle alert: alert = driver.switch_to.alert; alert.accept()

By: Waleed Mousa


● Handle infinite scroll: last_height = driver.execute_script("return
document.body.scrollHeight")

7. Handling Authentication

● Basic auth with requests: requests.get('https://example.com',


auth=('user', 'pass'))
● Use session for persistent login: session = requests.Session();
session.post('https://example.com/login', data={'username': 'user',
'password': 'pass'})
● Handle cookies: cookies = {'session_id': '123'};
requests.get('https://example.com', cookies=cookies)
● Use API key: headers = {'Authorization': 'Bearer YOUR_API_KEY'};
requests.get('https://api.example.com', headers=headers)

8. Parsing and Data Extraction

● Parse JSON response: data = response.json()


● Parse XML with lxml: root = lxml.etree.fromstring(response.content)
● Extract table data with pandas: tables = pd.read_html(response.text)
● Extract data with regex: match = re.search(r'pattern', text)
● Clean text data: clean_text = ' '.join(text.split())
● Remove HTML tags: clean_text = re.sub('<.*?>', '', html_text)
● Parse dates: date = pd.to_datetime('2023-05-20')
● Extract numbers from text: numbers = re.findall(r'\d+', text)

9. Data Storage and Export

● Save to CSV: df.to_csv('data.csv', index=False)


● Save to Excel: df.to_excel('data.xlsx', index=False)
● Save to JSON: df.to_json('data.json', orient='records')
● Save to SQLite: df.to_sql('table_name', sqlite_connection,
if_exists='replace')
● Save to MongoDB: collection.insert_many(df.to_dict('records'))
● Save to pickle: df.to_pickle('data.pkl')

10. Rate Limiting and Politeness

● Add delay between requests: time.sleep(1)


● Use random delay: time.sleep(random.uniform(1, 3))

By: Waleed Mousa


● Respect robots.txt: from urllib.robotparser import RobotFileParser; rp =
RobotFileParser(); rp.set_url('https://example.com/robots.txt');
rp.read(); can_fetch = rp.can_fetch('*', 'https://example.com/page')
● Implement exponential backoff: time.sleep(2 ** retry_count +
random.random())

11. Error Handling and Retrying

● Try-except block: try: response = requests.get(url) except


requests.exceptions.RequestException as e: print(f"An error occurred:
{e}")
● Retry with exponential backoff:
@retry(wait=wait_exponential(multiplier=1, max=60),
stop=stop_after_attempt(5))
● Handle specific HTTP status codes: if response.status_code == 404:
print("Page not found")
● Log errors: logging.error(f"Failed to scrape {url}: {e}")

12. Parallel and Asynchronous Scraping

● Use multithreading: with


concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: results
= list(executor.map(scrape_url, urls))
● Use multiprocessing: with
concurrent.futures.ProcessPoolExecutor(max_workers=5) as executor:
results = list(executor.map(scrape_url, urls))
● Use asyncio:
asyncio.get_event_loop().run_until_complete(scrape_urls(urls))
● Use aiohttp for async requests: async with aiohttp.ClientSession() as
session: async with session.get(url) as response: html = await
response.text()

13. Advanced Techniques

● Use proxy rotation: proxies = ['http://proxy1:8080',


'http://proxy2:8080']; requests.get(url, proxies={'http':
random.choice(proxies)})
● Implement user-agent rotation: user_agents = ['Mozilla/5.0 ...',
'Chrome/91.0 ...']; headers = {'User-Agent': random.choice(user_agents)}
● Handle CAPTCHA: from python_anticaptcha import AnticaptchaClient,
ImageToTextTask; client = AnticaptchaClient('your-api-key'); task =

By: Waleed Mousa


ImageToTextTask(captcha_file); job = client.createTask(task); job.join();
print(job.get_solution_response())
● Use Tor for anonymity: from stem import Signal; from stem.control import
Controller; with Controller.from_port(port=9051) as controller:
controller.authenticate(); controller.signal(Signal.NEWNYM)
● Implement IP rotation: requests.get(url, proxies={'http':
f'socks5://127.0.0.1:{random.randint(9000, 9100)}'})
● Handle JavaScript rendering: driver.execute_script("return
document.documentElement.outerHTML")
● Extract data from PDF: import PyPDF2; pdf =
PyPDF2.PdfReader(open('file.pdf', 'rb')); text =
pdf.pages[0].extract_text()
● Handle infinite scroll (Selenium): while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);");
time.sleep(2); new_height = driver.execute_script("return
document.body.scrollHeight"); if new_height == last_height: break;
last_height = new_height
● Extract data from images: from PIL import Image; import pytesseract; text
= pytesseract.image_to_string(Image.open('image.png'))

14. Data Validation and Cleaning

● Remove duplicates: df.drop_duplicates(subset=['column'], keep='first',


inplace=True)
● Handle missing values: df.fillna(value={'column': 0}, inplace=True)
● Convert data types: df['column'] = df['column'].astype(int)
● Normalize text data: df['text'] = df['text'].str.lower().str.strip()
● Remove special characters: df['text'] = df['text'].str.replace('[^\w\s]',
'')
● Validate email addresses: df['valid_email'] =
df['email'].str.match(r'^[\w\.-]+@[\w\.-]+\.\w+$')
● Validate URLs: df['valid_url'] =
df['url'].str.match(r'^https?:\/\/(?:www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}
\.[a-zA-Z0-9()]{1,6}\b(?:[-a-zA-Z0-9()@:%_\+.~#?&\/=]*)$')
● Validate phone numbers: df['valid_phone'] =
df['phone'].str.match(r'^\+?1?\d{9,15}$')

15. Web APIs and JSON Parsing

● Make API request: response = requests.get('https://api.example.com/data',


params={'key': 'value'})

By: Waleed Mousa


● Parse JSON response: data = response.json()
● Extract nested JSON data: value = data['key1']['key2'][0]['key3']
● Flatten nested JSON: df = pd.json_normalize(data)
● Handle paginated API: while url: response = requests.get(url);
data.extend(response.json()['results']); url =
response.json().get('next')
● Use API authentication: headers = {'Authorization': f'Bearer {token}'};
response = requests.get(url, headers=headers)
● Handle rate limiting: if response.status_code == 429: retry_after =
int(response.headers['Retry-After']); time.sleep(retry_after)

16. Scrapy-specific Operations

● Define spider: class MySpider(scrapy.Spider): name = 'myspider';


start_urls = ['https://example.com']
● Parse response: def parse(self, response): yield {'title':
response.css('h1::text').get()}
● Follow links: yield response.follow(next_page, self.parse)
● Use item pipeline: class MyPipeline: def process_item(self, item,
spider): return item
● Use middleware: class MyMiddleware: def process_request(self, request,
spider): request.meta['proxy'] = 'http://proxy.com:8080'
● Handle cookies: request.cookies['sessionid'] = '1234567890abcdef'
● Set download delay: custom_settings = {'DOWNLOAD_DELAY': 1}
● Limit crawl speed: custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN':
2}
● Implement crawl rules: rules = (Rule(LinkExtractor(allow=r'/category/'),
callback='parse_item', follow=True),)
● Use FormRequest for form submission: yield
scrapy.FormRequest.from_response(response, formdata={'username': 'john',
'password': 'secret'}, callback=self.after_login)
● Extract data with Scrapy's ItemLoader: loader =
ItemLoader(item=Product(), response=response); loader.add_css('name',
'h1::text'); yield loader.load_item()
● Use Scrapy shell: scrapy shell 'http://example.com'
● Save scraped items to CSV: scrapy crawl myspider -o output.csv
● Use Scrapy contracts for testing: @contract('url', 'parse', 'item')
● Handle JavaScript with Scrapy-Splash: yield SplashRequest(url,
self.parse, args={'wait': 0.5})

By: Waleed Mousa


17. Advanced Selenium Techniques

● Use Selenium with headless browser: options = webdriver.ChromeOptions();


options.add_argument('--headless'); driver =
webdriver.Chrome(options=options)
● Handle dynamic content loading: WebDriverWait(driver,
10).until(EC.presence_of_element_located((By.ID, 'content')))
● Interact with dropdown:
Select(driver.find_element_by_id('dropdown')).select_by_visible_text('Opt
ion')
● Handle multiple windows:
driver.switch_to.window(driver.window_handles[-1])
● Perform drag and drop: ActionChains(driver).drag_and_drop(source,
target).perform()
● Upload file: driver.find_element_by_id('file').send_keys('/path/to/file')
● Execute custom JavaScript:
driver.execute_script("arguments[0].scrollIntoView();", element)
● Handle iframes: driver.switch_to.frame('iframe_name')
● Set browser capabilities: capabilities =
DesiredCapabilities.CHROME.copy(); capabilities['goog:loggingPrefs'] =
{'browser': 'ALL'}; driver =
webdriver.Chrome(desired_capabilities=capabilities)
● Extract console logs: logs = driver.get_log('browser')

18. Handling CAPTCHAs and Anti-Bot Measures

● Solve reCAPTCHA using 2captcha: solver = TwoCaptcha('YOUR_API_KEY');


result = solver.recaptcha(sitekey='SITE_KEY', url='https://example.com')
● Bypass IP-based restrictions: response = requests.get(url,
proxies={'http': 'http://username:[email protected]:8080'})
● Mimic human behavior with random delays: time.sleep(random.uniform(1, 3))
● Rotate user agents: headers = {'User-Agent': random.choice(user_agents)}
● Handle browser fingerprinting:
options.add_argument('--disable-blink-features=AutomationControlled')

19. Asynchronous Scraping

● Use aiohttp for async requests: async with aiohttp.ClientSession() as


session: async with session.get(url) as response: html = await
response.text()
● Parse HTML asynchronously: soup = BeautifulSoup(html, 'html.parser')

By: Waleed Mousa


● Use asyncio to run multiple coroutines: asyncio.gather(*[fetch(url) for
url in urls])
● Implement rate limiting in async code: async with
aiohttp_requests.Limiter(10, 1) as limiter: async with limiter: await
session.get(url)
● Use aiofiles for async file I/O: async with aiofiles.open('output.txt',
mode='w') as f: await f.write(data)

20. Data Extraction from Various Formats

● Extract data from XML: tree = ET.parse('file.xml'); root = tree.getroot()


● Parse RSS feed: feed = feedparser.parse('http://example.com/rss')
● Extract data from CSV: with open('file.csv', 'r') as f: reader =
csv.DictReader(f)
● Read Excel file: df = pd.read_excel('file.xlsx', sheet_name='Sheet1')
● Extract text from PDF: text =
textract.process('file.pdf').decode('utf-8')

21. Scraped Data Validation and Cleaning

● Remove HTML tags: clean_text = BeautifulSoup(html,


'html.parser').get_text()
● Normalize whitespace: normalized_text = ' '.join(text.split())
● Remove non-ASCII characters: ascii_text = text.encode('ascii',
'ignore').decode('ascii')
● Validate email addresses: is_valid =
re.match(r'^[\w\.-]+@[\w\.-]+\.\w+$', email) is not None
● Validate dates: valid_date = datetime.strptime(date_string, '%Y-%m-%d')

22. Data Storage and Database Integration

● Save to SQLite database: conn = sqlite3.connect('data.db');


df.to_sql('table_name', conn, if_exists='replace')
● Insert into MySQL database: engine =
create_engine('mysql://user:password@localhost/dbname');
df.to_sql('table_name', engine, if_exists='append')
● Save to MongoDB: client =
pymongo.MongoClient('mongodb://localhost:27017/'); db = client['dbname'];
db['collection'].insert_many(df.to_dict('records'))
● Write to Elasticsearch: es = Elasticsearch(); es.index(index='my_index',
body=document)

By: Waleed Mousa


● Save to Amazon S3: s3 = boto3.client('s3');
s3.put_object(Bucket='my-bucket', Key='data.csv', Body=csv_string)

23. Monitoring and Logging

● Set up basic logging: logging.basicConfig(level=logging.INFO,


format='%(asctime)s - %(levelname)s - %(message)s')
● Log to file: handler = logging.FileHandler('scraper.log');
logger.addHandler(handler)
● Use rotating file handler: handler = RotatingFileHandler('scraper.log',
maxBytes=10000, backupCount=5)
● Send email alerts: send_mail(subject='Scraping Error', message='Error
occurred during scraping', from_email='[email protected]',
recipient_list=['[email protected]'])
● Integrate with monitoring tools: statsd.increment('pages_scraped')

24. Performance Optimization

● Use multiprocessing for CPU-bound tasks: with Pool(4) as p: results =


p.map(scrape_func, urls)
● Use multithreading for I/O-bound tasks: with
ThreadPoolExecutor(max_workers=10) as executor: futures =
[executor.submit(scrape_url, url) for url in urls]
● Implement caching: @functools.lru_cache(maxsize=100)
● Use a job queue: q.enqueue(scrape_func, url)
● Optimize database queries: session.query(Model).filter(Model.attr ==
value).options(joinedload(Model.relation))

25. Legal and Ethical Considerations

● Check robots.txt: robotparser = RobotFileParser();


robotparser.set_url('http://example.com/robots.txt'); robotparser.read();
allowed = robotparser.can_fetch('*', url)
● Set user agent to identify your bot: headers = {'User-Agent': 'MyBot/1.0
(+http://example.com/bot)'}
● Implement politeness delay: time.sleep(random.uniform(1, 3))
● Respect 'nofollow' links: if 'rel' in link.attrs and 'nofollow' in
link['rel']: continue
● Handle terms of service compliance: if not check_tos_compliance(url):
raise Exception("TOS violation")

By: Waleed Mousa

You might also like