0% found this document useful (0 votes)
8 views57 pages

DAP - Module 4

This document covers web scraping and numerical analysis using Python, focusing on data acquisition from web applications. It introduces web scraping techniques, tools like BeautifulSoup and Selenium, and demonstrates how to send HTTP requests, parse HTML content, and extract data using CSS selectors. The document also provides code examples for implementing web scraping and form submission in Python.

Uploaded by

souravdilu78090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views57 pages

DAP - Module 4

This document covers web scraping and numerical analysis using Python, focusing on data acquisition from web applications. It introduces web scraping techniques, tools like BeautifulSoup and Selenium, and demonstrates how to send HTTP requests, parse HTML content, and extract data using CSS selectors. The document also provides code examples for implementing web scraping and form submission in Python.

Uploaded by

souravdilu78090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Data Analytics using Python Module-4

Web Scraping And Numerical Analysis


Topics to be studied

• Data Acquisition by Scraping web

applications • Submitting a form

• Fetching web pages

• Downloading web pages through form

submission • CSS Selectors.

• NumPy Essentials: The NumPy


Need for Web Scraping

• Let’s suppose you want to get some information from a website? •


Let’s say an article from the some news article, what will you do? • The
first thing that may come in your mind is to copy and paste the
information into your local media.
• But what if you want a large amount of data on a daily basis and as
quickly as possible.
• In such situations, copy and paste will not work and that’s where
you’ll need web scraping.
Web Scraping

• Web scraping is a technique used to extract data from websites. It


involves fetching and parsing HTML content to gather information.

• The main purpose of web scraping is to collect and analyze data from
websites for various applications, such as research, business
intelligence, or creating datasets.
• Developers use tools and libraries like BeautifulSoup (for Python),
Scrapy, or Puppeteer to automate the process of fetching and parsing
web data.
Python Libraries

• requests
• Beautiful Soup
• Selenium
Requests

• The requests module allows you to send HTTP requests


using Python.
• The HTTP request returns a Response Object with all the
response data (content, encoding, status, etc).
• Install requests with pip install requests
Python script to make a simple HTTP GET request
import requests
# Specify the URL you want to make a GET request to
url = "https://www.w3schools.com"
# Make the GET request
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Print the content of the response
print("Response content:")
print(response.text)
else:
# Print an error message if the request was not successful
print(f"Error: {response.status_code}")
import requests
# Specify the base URL
base_url = "https://jsonplaceholder.typicode.com"
# GET request
get_response = requests.get(f"{base_url}/posts/1")
print(f"GET Response:\n{get_response.json()}\n")
# POST request
new_post_data = {
'title': 'New Post',
'body': 'This is the body of the new post.',
'userId': 1
}
post_response = requests.post(f"{base_url}/posts",
json=new_post_data) print(f"POST
Response:\n{post_response.json()}\n")
# PUT request (Update the post with ID 1)
updated_post_data = {
'title': 'Updated Post',
'body': 'This is the updated body of the post.',
'userId': 1
}
put_response = requests.put(f"{base_url}/posts/1",
json=updated_post_data) print(f"PUT
Response:\n{put_response.json()}\n")
# DELETE request (Delete the post with ID 1)
delete_response = requests.delete(f"{base_url}/posts/1")
print(f"DELETE Response:\nStatus Code: {delete_response.status_code}")
Implementing Web Scraping in Python with BeautifulSoup

There are mainly two ways to extract data from a website: • Use
the API of the website (if it exists). Ex. Facebook Graph API •
Access the HTML of the webpage and extract useful
information/data from it.
Ex. WebScraping
Steps involved in web scraping

• Send an HTTP request to URL

• Parse the data which is accessed

• Navigate and search the parse tree that we created


BeautifulSoup

• It is an incredible tool for pulling out information from a webpage. •


Used to extract tables, lists, paragraph and you can also put filters to
extract information from web pages.

• BeautifulSoup does not fetch the web page for us. So we use
requests pip install beautifulsoup4
BeautifulSoup

from bs4 import BeautifulSoup

# parsing the document


soup = BeautifulSoup('''<h1>Knowx Innovations PVt Ltd</h1>''',

"html.parser") print(type(soup))

Tag Object

• Tag object corresponds to an XML or HTML tag in the original


document. • This object is usually used to extract a tag from the whole

HTML document.

• Beautiful Soup is not an HTTP client which means to scrap online


websites you first have to download them using the requests module
and then serve them to Beautiful Soup for scraping.
• This object returns the first found tag if your document has multiple tags with the same
name. from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b>RNSIT</b>
<b> Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag)
# Print the output
print(type(tag))
• The tag contains many methods and attributes. And two important features of a
tag are its name and attributes.
• Name:The name of the tag can be accessed through ‘.name’ as
suffix. • Attributes: Anything that is NOT tag
# Import Beautiful Soup
from bs4 import BeautifulSoup #
Initialize the object with an HTML
page soup = BeautifulSoup('''
<html>
<b>Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
# Print the output
print(tag.name)
# changing the tag
tag.name = "Strong"
print(tag)
from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b class=“RNSIT“ name=“knowx”>Knowx
Innoavtions</b> </html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag["class"])
# modifying class
tag["class"] = “ekant"
print(tag)
# delete the class attributes
del tag["class"]
print(tag)
• A document may contain multi-valued attributes and can be accessed using key-value pair.

# Import Beautiful Soup


from bs4 import BeautifulSoup
# Initialize the object with an HTML page
# soup for multi_valued attributes
soup = BeautifulSoup('''
<html>
<b class="rnsit knowx">Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag["class"])
• NavigableString Object: A string corresponds to a bit of text within a tag. Beautiful Soup
uses the NavigableString class to contain these bits of text

from bs4 import BeautifulSoup


soup = BeautifulSoup('''
<html>
<b>Knowx Innovations</b>
</html>
''', "html.parser")
tag = soup.b
# Get the string inside the tag
string = tag.string
print(string)
# Print the output
print(type(string))
Find the Siblings of the tag

• previous_sibling is used to find the previous element of the given

element • next_sibling is used to find the next element of the given

element • previous_siblings is used to find all previous element of the

given element • next_siblings is used to find all next element of the

given element

descendants generator
• descendants generator is provided by Beautiful Soup
• The .contents and .children attribute only consider a tag’s direct
children • The descendants generator is used to iterate over all of
the tag’s children, recursively.
Example for descendants generator
from bs4 import BeautifulSoup
# Create the document
doc = "<body><b> <p>Hello world<i>innermost</i><p> </b><p> Outer
text</p><body>" # Initialize the object with the document
soup = BeautifulSoup(doc, "html.parser")
# Get the body tag
tag = soup.body
for content in tag.contents:
print(content)
for child in tag.children:
print(child)
for descendant in tag.descendants:
print(descendant)
Searching and Extract for specific tags With Beautiful Soup

• Python BeautifulSoup – find all class


# Import Module
from bs4 import BeautifulSoup
import requests
# Website URL
URL = 'https://www.python.org/'
# class list set
class_list = set()
# Page content from Website URL
page = requests.get( URL )
# parse html content
soup = BeautifulSoup( page.content , 'html.parser')
# get all tags
tags = {tag.name for tag in soup.find_all()}
# iterate all tags
for tag in tags:
# find all element of tag
for i in soup.find_all( tag ):
# if tag has attribute of class
if i.has_attr( "class" ):
if len( i['class'] ) != 0:
class_list.add(" ".join(
i['class'])) print( class_list )
Find a particular class
html_doc = """<html><head><title>Welcome to
geeksforgeeks</title></head> <body>
<p class="title"><b>Geeks</b></p>
<p class="body">This is an example to find a perticular
class </body>
"""
# import module
from bs4 import BeautifulSoup
# parse html content
soup = BeautifulSoup( html_doc , 'html.parser')
# Finding by class name
c=soup.find( class_ = "body")
print(c)
Search by text inside a tag

Steps involved for searching the text inside the tag:


• Import module
• Pass the URL
• Request page
• Specify the tag to be searched
• For Search by text inside tag we need to check condition to with help of string
function. • The string function will return the text inside a tag.
• When we will navigate tag then we will check the condition with the
text. • Return text
from bs4 import BeautifulSoup
import requests
# sample web page
sample_web_page = 'https://www.python.org'
# call get method to request that page
page = requests.get(sample_web_page)
# with the help of beautifulSoup and html parser create
soup soup = BeautifulSoup(page.content,
"html.parser") child_soup = soup.find_all('strong')
#print(child_soup)
text = """Notice:"""
# we will search the tag with in which text is same as
given text for i in child_soup:
if(i.string == text):
print(i)
IMPORTANTS POINTS

• BeautifulSoup provides several methods for searching for tags based on their
contents, such as find(), find_all(), and select().
• The find_all() method returns a list of all tags that match a given filter, while the
find() method returns the first tag that matches the filter.
• You can use the text keyword argument to search for tags that contain specific text.
Select method

• The select method in BeautifulSoup (bs4) is used to find all elements


in a parsed HTML or XML document that match a specific CSS
selector.

• CSS selectors are patterns used to select and style elements in a


document.

• The select method allows you to apply these selectors to navigate


and extract data from the parsed document easily.
CSS Selector

• Id selector (#)
• Class selector (.)
• Universal Selector (*)
• Element Selector (tag)
• Grouping Selector(,)
CSS Selector

• Id selector (#) :The ID selector targets a specific HTML element based on its unique
identifier attribute (id). An ID is intended to be unique within a webpage, so using
the ID selector allows you to style or apply CSS rules to a particular element with a
specific ID. #header {
color: blue;
font-size: 16px;
}
• Class selector (.) : The class selector is used to select and style HTML elements
based on their class attribute. Unlike IDs, multiple elements can share the same
class, enabling you to apply the same styles to multiple elements throughout the
document. .highlight {
background-color: yellow;
font-weight: bold;
}
CSS Selector

• Universal Selector (*) :The universal selector selects all HTML elements on the
webpage. It can be used to apply styles or rules globally, affecting every element.
However, it is important to use the universal selector judiciously to avoid
unintended consequences. * {
margin: 0;
padding: 0;
}
• Element Selector (tag) : The element selector targets all instances of a specific
HTML element on the page. It allows you to apply styles universally to elements
of the same type, regardless of their class or ID.
p{
color: green;
font-size: 14px;
}
• Grouping Selector(,) : The grouping selector allows you to apply the same styles
to multiple selectors at once. Selectors are separated by commas, and the styles
specified will be applied to all the listed selectors.
h1, h2, h3 {
font-family: 'Arial', sans-serif;
color: #333;
}

• These selectors are fundamental to CSS and provide a powerful way to target
and style different elements on a webpage.
<html>

<head>
<title>Sample Page</title>
</head>
<body>
Creating a basic HTML page <div id="content">
<!DOCTYPE html> <h1>Heading 1</h1>
<p class="paragraph">This is a sample </ul>
paragraph.</p> <ul> <a href="https://example.com">Visit
<li>Item 1</li> Example</a> </div>
<li>Item 2</li> </body>
<li>Item 3</li> </html>
Scraping example using CSS selectors
Select by ID
from bs4 import BeautifulSoup div_content = soup.select('#content')
Html=request.get((“web.html”) print("3. Div Content:",
soup = BeautifulSoup(Html, div_content[0].text)
'html.parser') # 1. Select by tag name # 4. Select by attribute
heading = soup.select('h1') link =
print("1. Heading:", soup.select('a[href="https://example.co
heading[0].text) # 2. Select by m"] ')
class print("4. Link:", link[0]['href'])
paragraph = # 5. Select all list items
soup.select('.paragraph') print("2. list_items = soup.select('ul li')
Paragraph:", paragraph[0].text) # 3.
print("5. List Items:")
for item in list_items: print("-", item.text)
Selenium
• Selenium is an open-source testing tool, which means it can be
downloaded from the internet without spending anything.

• Selenium is a functional testing tool and also compatible with non


functional testing tools as well.

• Pip install selenium


Steps in form filling

• Import the webdriver from selenium •

Create driver instance by specifying


browser • Find the element

• Send the values to the elements

• Use click function to submit


Webdriver

• WebDriver is a powerful tool for automating web browsers.

• It provides a programming interface for interacting with web


browsers and performing various operations, such as clicking
buttons, filling forms, navigating between pages, and more.

• WebDriver supports multiple programming languages


from selenium import webdriver
Creating Webdriverinstance
• You can create the instance of webdriver by using class webdriver and a
browser which you want to use
• Ex: driver = webdriver.Chrome()
• Browsers:
– webdriver.Chrome()
– webdriver.Firefox()
– webdriver.Edge()
– webdriver.Safari()
– webdriver.Opera()
– webdriver.Ie()
Find the element

• First you need get the form using function get()


• To find the element you can use find_element() by specifying any
of the fallowing arguments
—XPATH
—CSS Selector
XPATH

CSS Selector
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status
') # Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR,
'.col-xs-4') # Fill in form fields
pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()
Downloading web pages through form submission
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
driver.maximize_window()
time.sleep(3)
# Navigate to the form page
driver.get('https://www.confirmtkt.com/pnr-status')
# Locate form elements
pnr_field = driver.find_element("name", "pnr")
submit_button = driver.find_element(By.CSS_SELECTOR, '.col-xs-4')
# Fill in form fields
pnr_field.send_keys('4358851774')
# Submit the form
submit_button.click()
welcome_message =
driver.find_element(By.CSS_SELECTOR,".pnr-card") # Print or
use the scraped values
print(type(welcome_message))
html_content =
welcome_message.get_attribute('outerHTML') # Print the
HTML content
print("HTML Content:", html_content)
# Close the browser
driver.quit()
A Python Integer Is More Than Just an Integer
Every Python object is simply a cleverly
disguised
C structure, which contains not only its value,
but
other information as well.

X = 10000

X is not just a “raw” integer. It’s


actually a
pointer to a compound C structure, which
contains several values.

Difference between C and Python Variable


A Python List Is More Than Just a List
A Python List Is More Than Just a List
Because of Python’s dynamic typing, we can even create heterogeneous lists:

In the special case that all variables are of the same type, much of this
information is redundant: it can be much more efficient to store data in a
fixed-type array. The difference between a dynamic-type list and a fixed-type
(NumPy-style) array is illustrated in Figure.
Fixed-Type Arrays in Python
• Python offers several different options for storing data in efficient, fixed-type data
buffers. The built-in array module (available since Python 3.3) can be used to
create dense arrays of a uniform type:

While Python’s array object provides efficient storage of array-based data, NumPy
adds to this efficient operations on that data.
Creating Arraysfrom Python Lists
import numpy as np
NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if

possible

If we want to explicitly set the data type of the resulting array, we can use the dtype

keyword:

Creating Arraysfrom Python Lists


• NumPy arrays can explicitly be multidimensional; here’s one way of
initializing a multidimensional array using a list of lists:
Creating Arrays from Scratch
NumPy Standard Data Types
• While constructing an array, you can specify them using a

string:

• Or using the associated NumPy object:


The Basics of NumPy Arrays
The Basics of NumPy Arrays
We’ll cover a few categories of basic array manipulations here:
• Attributes of arrays
Determining the size, shape, memory consumption, and data types of
arrays • Indexing of arrays
Getting and setting the value of individual array elements
• Slicing of arrays
Getting and setting smaller subarrays within a larger array
• Reshaping of arrays
Changing the shape of a given array
• Joining and splitting of arrays
Combining multiple arrays into one, and splitting one array into many
NumPy Array Attributes
NumPy Array Attributes
• Each array has attributes
ndim (the number of dimensions)
shape (the size of each dimension)
size (the total size of the array)
Write a Python program that creates a mxn integer arrayand Prints its attributes
using Numpy

Output:
Array Indexing: Accessing Single
Elements

In a multidimensional array, you access items using a comma-separated tuple of


indices:

You can also modify values using any of the above index notation:
NumPy arrays have a fixed type. This means, for example, that if you attempt to insert a
floating-point value to an integer array, the value will be silently truncated.

Array Slicing: Accessing Subarrays


One-dimensional subarrays
Multidimensional subarrays
Subarray dimensions can even be reversed
together:
Accessing array rows and columns
Subarrays as no-copy views

Now if we modify this subarray, we’ll see that


the original array is changed! Observe:
Creating copies of arrays
Reshaping of Arrays
Another useful type of operation is reshaping of arrays. The most flexible way of
doing this is with the reshape() method. For example, if you want to put the
numbers 1 through 9 in a 3×3 grid, you can do the following:

• Note that for this to work, the size of the initial array must match the size of the
reshaped array.

• The reshape method will use a no-copy view of the initial array, but with
noncontiguous memory buffers this is not always the case.
Another common reshaping pattern is the conversion of a one-dimensional
array into a two-dimensional row or column matrix.

You might also like