Citl Exp 8
Citl Exp 8
Web Crawlers/Spiders:
• Definition: Web crawlers, also known as spiders or bots, are
automated programs that browse the internet systematically to
index content and gather information.
• Functionality: They navigate through web pages by following
hyperlinks and retrieving content, which can be stored and
processed for various applications, such as search engines, data
mining, and analytics.
• Use Cases: Common uses include indexing for search engines
(like Google), gathering data for research, monitoring changes in
websites, and scraping data for analysis.
2. Web Scraping Techniques Data Extraction:
• Web scraping involves extracting data from web pages. This can
be
achieved using various techniques, with two common methods being
XPATH
and CSS selectors.
XPATH:
• Definition: XPATH is a query language used to select nodes from
an XML document. It can also be used to navigate HTML
documents.
• Syntax: XPATH uses a path-like syntax to specify the location of
elements in a document. For example, //div[@class='example']
selects all <div> elements with a class of "example".
• Advantages: XPATH is powerful for complex queries and allows
for precise element selection, including attributes and text content.
CSS Selectors:
• Definition: CSS selectors are used to select elements in HTML
based on their attributes, types, classes, and IDs.
• Syntax: For example, .example selects all elements with the class
"example", and #uniqueID selects the element with the ID
"uniqueID".
Code: • euler.py:
import requests
from bs4 import BeautifulSoup
import sqlite3
import matplotlib.pyplot as plt
import os
• newProjectEuler.py:
import sqlite3
CSS Selector:
<tr><td class="id_column">1</td><td><a href="problem=1"
title="Published on Friday, 5th October 2001, 06:00 pm">Multiples of 3
or 5</a></td><td><div class="center">1010203</div></td></tr>