0% found this document useful (0 votes)
14 views

Python Packages for Web Data Access

The document discusses Python packages for web data access, including modules for web scraping like urllib and BeautifulSoup, and highlights the differences between web scraping and using APIs. It explains the processes involved in both methods, such as fetching, extracting, and storing data, as well as the use of Regular Expressions for pattern matching and data manipulation. Key features of urllib and examples of its functionalities are also provided.

Uploaded by

jmhh2187
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Python Packages for Web Data Access

The document discusses Python packages for web data access, including modules for web scraping like urllib and BeautifulSoup, and highlights the differences between web scraping and using APIs. It explains the processes involved in both methods, such as fetching, extracting, and storing data, as well as the use of Regular Expressions for pattern matching and data manipulation. Key features of urllib and examples of its functionalities are also provided.

Uploaded by

jmhh2187
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Python Packages for

Web Data Access


Web data is any information available on the internet, such as text, images, or
structured data. Web data access means getting information from the internet using
Python. Websites have lots of data—news, weather, stock prices, book lists, etc. Python
helps us fetch that data so we can use it in programs.
Accessing Web Data with Python
Python modules used for web scraping:
1.urllib – A Python module to fetch webpage content (like requests).
2.BeautifulSoup – Extracts and organizes data from HTML (helps in web scraping).
3.Regex (Regular Expressions) – Helps find patterns in text (useful for extracting
specific data).

Python formats for APIs:


1.json – A format for storing and exchanging data (APIs mostly return data in JSON).
2.REST API – A method websites use to provide data when requested (common in APIs).
3.Facebook and Twitter API – Social media platforms provide APIs so developers
can access posts, user data, or analytics.
Differences between web scraping and APIs:
11️⃣ Start → The process begins with deciding how to
access web data.
2️⃣ Choose Method →
• If the website does not provide an API, we use Web Scraping.
• If the website offers an API, we use the API method.

Web Scraping Path:


Fetch Webpage (HTML) → We first download the webpage
content.
Extract Data (BeautifulSoup/Regex) → Then, we process and
extract relevant data.
Store Data (CSV, JSON, TXT) → Finally, we save it in a file for
later use.

API Path:
Send API Request → We send a request to an API server.
Receive Data (JSON/XML) → The server sends back structured
data.
Store Data (CSV, JSON, TXT) → We save the API data in a file.
REGEX (Regular Expressions)
Regular Expressions (Regex) are a special sequence of characters used to find, match, and manipulate
patterns in text. It acts like a smart filter that helps you search for specific words, numbers, or patterns
inside a large amount of text.

Key Features of Regex:


✔ Pattern Matching – Finds specific words, numbers, or symbols in text.
✔ Text Validation – Ensures correct formats (like email, phone numbers, dates).
✔ Data Extraction – Pulls useful information from messy text (like all email IDs).
✔ Text Replacement – Helps clean or modify text (like replacing all spaces with commas).

For using regex first, you need to import the module:


import re
Metacharacters
Metacharacters are characters with a special meaning:
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special
meaning:
Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:

Methods
Examples:
1. re.findall()
2. re.sub() 3. re.search()

4. re.match()

5. re.split()
Urllib
URL (Uniform Resource Locator) Library
urllib is a built-in Python module used for fetching, processing, and handling URLs. It allows
Python to interact with websites by sending requests, downloading data, and handling web-related
tasks like encoding URLs and managing errors.

Key Features:
✅ Open and read web pages (urllib.request)
✅ Parse and manipulate URLs (urllib.parse)
✅ Handle HTTP errors (urllib.error)

✅ Check robots.txt rules (urllib.robotparser)


1. urllib.request (For Opening URLs)

2. urllib.parse (For Manipulating URLs)


3. urllib.error (For Handling Errors)

4.urllib.robotparser (For Checking Robots.txt)


Example Code
1. urlib.request
2. urlib.parse

3. urlib.error

4. urlib.robotparser
Submitted by:
CB.PS.I5MAT24004
CB.PS.I5MAT24006
CB.PS.I5MAT24009

You might also like