DAP - Module 4
DAP - Module 4
• The main purpose of web scraping is to collect and analyze data from
websites for various applications, such as research, business
intelligence, or creating datasets.
• Developers use tools and libraries like BeautifulSoup (for Python),
Scrapy, or Puppeteer to automate the process of fetching and parsing
web data.
Python Libraries
• requests
• Beautiful Soup
• Selenium
Requests
There are mainly two ways to extract data from a website: • Use
the API of the website (if it exists). Ex. Facebook Graph API •
Access the HTML of the webpage and extract useful
information/data from it.
Ex. WebScraping
Steps involved in web scraping
• BeautifulSoup does not fetch the web page for us. So we use
requests pip install beautifulsoup4
BeautifulSoup
"html.parser") print(type(soup))
Tag Object
HTML document.
given element
descendants generator
• descendants generator is provided by Beautiful Soup
• The .contents and .children attribute only consider a tag’s direct
children • The descendants generator is used to iterate over all of
the tag’s children, recursively.
Example for descendants generator
from bs4 import BeautifulSoup
# Create the document
doc = "<body><b> <p>Hello world<i>innermost</i><p> </b><p> Outer
text</p><body>" # Initialize the object with the document
soup = BeautifulSoup(doc, "html.parser")
# Get the body tag
tag = soup.body
for content in tag.contents:
print(content)
for child in tag.children:
print(child)
for descendant in tag.descendants:
print(descendant)
Searching and Extract for specific tags With Beautiful Soup
• BeautifulSoup provides several methods for searching for tags based on their
contents, such as find(), find_all(), and select().
• The find_all() method returns a list of all tags that match a given filter, while the
find() method returns the first tag that matches the filter.
• You can use the text keyword argument to search for tags that contain specific text.
Select method
• Id selector (#)
• Class selector (.)
• Universal Selector (*)
• Element Selector (tag)
• Grouping Selector(,)
CSS Selector
• Id selector (#) :The ID selector targets a specific HTML element based on its unique
identifier attribute (id). An ID is intended to be unique within a webpage, so using
the ID selector allows you to style or apply CSS rules to a particular element with a
specific ID. #header {
color: blue;
font-size: 16px;
}
• Class selector (.) : The class selector is used to select and style HTML elements
based on their class attribute. Unlike IDs, multiple elements can share the same
class, enabling you to apply the same styles to multiple elements throughout the
document. .highlight {
background-color: yellow;
font-weight: bold;
}
CSS Selector
• Universal Selector (*) :The universal selector selects all HTML elements on the
webpage. It can be used to apply styles or rules globally, affecting every element.
However, it is important to use the universal selector judiciously to avoid
unintended consequences. * {
margin: 0;
padding: 0;
}
• Element Selector (tag) : The element selector targets all instances of a specific
HTML element on the page. It allows you to apply styles universally to elements
of the same type, regardless of their class or ID.
p{
color: green;
font-size: 14px;
}
• Grouping Selector(,) : The grouping selector allows you to apply the same styles
to multiple selectors at once. Selectors are separated by commas, and the styles
specified will be applied to all the listed selectors.
h1, h2, h3 {
font-family: 'Arial', sans-serif;
color: #333;
}
• These selectors are fundamental to CSS and provide a powerful way to target
and style different elements on a webpage.
<html>
<head>
<title>Sample Page</title>
</head>
<body>
Creating a basic HTML page <div id="content">
<!DOCTYPE html> <h1>Heading 1</h1>
<p class="paragraph">This is a sample </ul>
paragraph.</p> <ul> <a href="https://example.com">Visit
<li>Item 1</li> Example</a> </div>
<li>Item 2</li> </body>
<li>Item 3</li> </html>
Scraping example using CSS selectors
Select by ID
from bs4 import BeautifulSoup div_content = soup.select('#content')
Html=request.get((“web.html”) print("3. Div Content:",
soup = BeautifulSoup(Html, div_content[0].text)
'html.parser') # 1. Select by tag name # 4. Select by attribute
heading = soup.select('h1') link =
print("1. Heading:", soup.select('a[href="https://example.co
heading[0].text) # 2. Select by m"] ')
class print("4. Link:", link[0]['href'])
paragraph = # 5. Select all list items
soup.select('.paragraph') print("2. list_items = soup.select('ul li')
Paragraph:", paragraph[0].text) # 3.
print("5. List Items:")
for item in list_items: print("-", item.text)
Selenium
• Selenium is an open-source testing tool, which means it can be
downloaded from the internet without spending anything.
CSS Selector
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status
') # Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR,
'.col-xs-4') # Fill in form fields
pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()
Downloading web pages through form submission
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status')
# Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR, '.col-xs-4')
# Fill in form fields
pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()
welcome_message =
driver.find_element(By.CSS_SELECTOR,".pnr-card") # Print or
use the scraped values
print(type(welcome_message))
html_content =
welcome_message.get_attribute('outerHTML') # Print the
HTML content
print("HTML Content:", html_content)
# Close the browser
driver.quit()
A Python Integer Is More Than Just an Integer
Every Python object is simply a cleverly
disguised
C structure, which contains not only its value,
but
other information as well.
X = 10000
In the special case that all variables are of the same type, much of this
information is redundant: it can be much more efficient to store data in a
fixed-type array. The difference between a dynamic-type list and a fixed-type
(NumPy-style) array is illustrated in Figure.
Fixed-Type Arrays in Python
• Python offers several different options for storing data in efficient, fixed-type data
buffers. The built-in array module (available since Python 3.3) can be used to
create dense arrays of a uniform type:
While Python’s array object provides efficient storage of array-based data, NumPy
adds to this efficient operations on that data.
Creating Arraysfrom Python Lists
import numpy as np
NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if
possible
If we want to explicitly set the data type of the resulting array, we can use the dtype
keyword:
string:
Output:
Array Indexing: Accessing Single
Elements
You can also modify values using any of the above index notation:
NumPy arrays have a fixed type. This means, for example, that if you attempt to insert a
floating-point value to an integer array, the value will be silently truncated.
• Note that for this to work, the size of the initial array must match the size of the
reshaped array.
• The reshape method will use a no-copy view of the initial array, but with
noncontiguous memory buffers this is not always the case.
Another common reshaping pattern is the conversion of a one-dimensional
array into a two-dimensional row or column matrix.