0% found this document useful (0 votes)
17 views24 pages

Module 5

Uploaded by

satyasatya255280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

Module 5

Uploaded by

satyasatya255280
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

DATA VISUALIZATION 21AD71

DATA VISUALIZATION
21AD71
MODULE-5
Networked programs

HyperText Transfer Protocol – HTTP, The World’s Simplest Web Browser,


Retrieving an image over HTTP, Retrieving web pages with urllib, Parsing
HTML and scraping the web, Parsing HTML using regular expressions, Parsing
HTML using BeautifulSoup, Reading binary files using urllib

Using Web Services


eXtensibleMarkup Language – XML, Parsing XML, Looping through nodes,
JavaScript Object Notation – JSON,
Parsing JSON

Hypertext Transfer Protocol – HTTP


The network protocol that powers the web is simple.
Python has built-in support for network connections using the
socket module.
A socket is a two-way connection between two programs, allowing
both reading and writing.
When data is written to a socket, it is sent to the other program
connected through the socket.
When data is read from a socket, the data sent by the other
program is received.
If no data is sent by the other side, the program waits indefinitely.
If both programs wait for data without sending anything, they can
remain stuck waiting.
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 1
DATA VISUALIZATION 21AD71

A protocol is needed to manage communication between programs


over the internet.
Protocols define rules like who sends data first, responses, and the
sequence of exchanges.
The Hypertext Transfer Protocol (HTTP) is one such protocol
for web communication.
The world’s simplest web browser

Perhaps the easiest way to show how the HTTP protocol works is to
write a very simple Python program that makes a connection to a web
server and follows the rules of the HTTP protocol to request a
document and display what the server sends back.

import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()

Program connects to port 80 on the server www.pr4e.com.

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 2


DATA VISUALIZATION 21AD71

The program acts as a web browser and follows the HTTP


protocol.
A GET command is sent, followed by a blank line to comply with
the protocol.
The \r\n represents an End of Line (EOL) and \r\n\r\n represents a
blank line.
After sending the blank line, the program enters a loop.
The loop receives data in 512-character chunks from the socket.
The data is printed out until no more data is available.
The loop terminates when recv() returns an empty string, signaling
the end of data.
The program produces the following output:

HTTP/1.1 200 OK

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 3


DATA VISUALIZATION 21AD71

The output begins with headers sent by the web server to describe
the document.
Example: Content-Type header indicates that the document is
plain text (text/plain).
After sending headers, the server adds a blank line to indicate the
end of the headers.
The actual data (e.g., the file romeo.txt) is sent after the headers.
Demonstrates a low-level network connection using sockets.
Sockets can be used to communicate with various servers,
including web and mail servers.
Protocols guide the communication; knowing the protocol allows
sending and receiving data.
Python has a special HTTP library for retrieving documents over
the web.
HTTP protocol requires sending and receiving data as bytes
objects.

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 4


DATA VISUALIZATION 21AD71

The encode() and decode() methods are used to convert between


strings and bytes objects.

Retrieving an image over HTTP

We can use a similar program to retrieve an image across using


HTTP. Instead of copying the data to the screen as the program runs,
we accumulate the data in a string, trim off the headers, and then save
the image data to a file as follows:

import socket
import time
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg
HTTP/1.0\r\n\r\n')
count = 0
picture = b""
while True:
data = mysock.recv(5120)
if len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count)

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 5


DATA VISUALIZATION 21AD71

picture = picture + data


mysock.close()
# Look for the end of the header (2 CRLF)
pos = picture.find(b"\r\n\r\n")
print('Header length', pos)
print(picture[:pos].decode())
# Skip past the header and save the picture data
picture = picture[pos+4:]
fhand = open("stuff.jpg", "wb")
fhand.write(picture)
fhand.close()
# Code: https://www.py4e.com/code3/urljpeg.py
12.3. RETRIEVING AN IMAGE OVER HTTP 149
When the program runs, it produces the following output:
$ python urljpeg.py
5120 5120
5120 10240
4240 14480
5120 19600
...
5120 214000
3200 217200
5120 222320
5120 227440
3167 230607
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 6
DATA VISUALIZATION 21AD71

Header length 393


HTTP/1.1 200 OK
Date: Wed, 11 Apr 2018 18:54:09 GMT
Server: Apache/2.4.7 (Ubuntu)
Last-Modified: Mon, 15 May 2017 12:27:40 GMT
ETag: "38342-54f8f2e5b6277"
Accept-Ranges: bytes
Content-Length: 230210
Vary: Accept-Encoding
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: image/jpeg

The Content-Type header for this URL indicates the document is


an image (image/jpeg).
Once the program finishes, the image data can be viewed by
opening the file stuff.jpg in an image viewer.
The recv() method doesn't always return the full 5120 characters;
it depends on how much data has been transferred by the server at that
moment.
In this example, around 3200 characters are received each time
the program requests up to 5120 characters.
Results may vary depending on network speed.

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 7


DATA VISUALIZATION 21AD71

The final recv() call returns 3167 bytes, indicating the end of the
stream.
A subsequent recv() call returns a zero-length string, signaling
the server has closed its socket.
Introducing a time.sleep() delay (e.g., a quarter of a second) can
slow down the recv() calls, allowing the server more time to send
data.
With the delay, the server can send more data before the next
recv() call.

Retrieving web pages with urllib

Instead of manually handling HTTP with the socket library,


Python offers a simpler solution using the urllib library.
urllib makes it easier to interact with web pages by handling all
HTTP protocol and header details.
With urllib, a web page can be treated similarly to a file.
You specify the web page to retrieve, and urllib manages the rest.
The code to read the romeo.txt file using urllib is more concise
and efficient than with manual socket handling.
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')
for line in fhand:
print(line.decode().strip())
# Code: https://www.py4e.com/code3/urllib1.py

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 8


DATA VISUALIZATION 21AD71

After opening a web page with urllib.request.urlopen, the page


can be treated like a file.
The contents can be read using a for loop.
When the program runs, only the output of the file's contents is
visible.
The headers are still sent by the server, but urllib handles them
internally.
The urllib code consumes the headers and returns only the data to
the user.
Reading binary files using urllib

You may want to retrieve non-text (or binary) files, such as


images or videos.
The data in binary files is typically not useful for printing.
You can easily copy a URL to a local file on your hard disk using
urllib.
The process involves opening the URL and using read() to
download the entire contents into a string variable (e.g., img).
After retrieving the data, you can write it to a local file.

import urllib.request, urllib.parse, urllib.error


img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg').read()
fhand = open('cover3.jpg', 'wb')
fhand.write(img)
fhand.close()
# Code: https://www.py4e.com/code3/curl1.py

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 9


DATA VISUALIZATION 21AD71

The program reads all data from the network at once and stores it
in the img variable in main memory.
It then opens the file cover.jpg and writes the data to disk.
The wb argument in open() indicates that the file is opened for
writing in binary mode.
This approach works well if the file size is less than the computer's
memory capacity.
For large files (like audio or video), this method may cause the
program to crash or run slowly due to memory exhaustion.
To prevent running out of memory, data is retrieved in blocks (or
buffers).
Each block is written to disk before retrieving the next, allowing
the program to handle files of any size without consuming all
available memory.

import urllib.request, urllib.parse, urllib.error


img = urllib.request.urlopen('http://data.pr4e.org/cover3.jpg')
fhand = open('cover3.jpg', 'wb')
size = 0
while True:
info = img.read(100000)
if len(info) < 1: break
size = size + len(info)
fhand.write(info)
print(size, 'characters copied.')
fhand.close()
# Code: https://www.py4e.com/code3/curl2.py
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 10
DATA VISUALIZATION 21AD71

The program reads data in increments of 100,000 characters at a


time.
After reading each chunk, it writes those characters to the
cover3.jpg file.
The program retrieves the next 100,000 characters of data from
the web after writing the previous chunk.
Parsing HTML and scraping the web
One simple way to parse HTML is to use regular expressions to
repeatedly search for and extract substrings that match a particular
pattern.

<h1>The First Page</h1>


<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>
We can construct a well-formed regular expression to match and
extract the link
values from the above text as follows:
href="http[s]?://.+?"

The regular expression searches for strings starting with


“href="http://” or “href="https://”.
It is followed by one or more characters (denoted by .+?), ending
with another double quote.

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 11


DATA VISUALIZATION 21AD71

The [s]? indicates that the match can include zero or one “s” after
“http”.
The ? after .+? signifies a non-greedy match, meaning it seeks the
smallest possible matching string.
In contrast, a greedy match would look for the largest possible
matching string.
Parentheses are used in the regular expression to specify which
part of the matched string to extract.
This results in a program designed to extract specific parts of the
matched strings.

# Search for link values within URL input


import urllib.request, urllib.parse, urllib.error
import re
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
links = re.findall(b'href="(http[s]?://.*?)"', html)
for link in links:
print(link.decode())
# Code: https://www.py4e.com/code3/urlregex.py

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 12


DATA VISUALIZATION 21AD71

The ssl library enables the program to access websites that enforce
HTTPS.
The read() method returns the HTML source code as a bytes
object instead of an HTTPResponse object.
The findall regular expression method provides a list of all strings
that match the regular expression.
It returns only the link text found between the double quotes.

Regular expressions work very nicely when your HTML is well


formatted and predictable. But since there are a lot of “broken”
HTML pages out there, a solution only using regular expressions
might either miss some valid links or end up with bad data.
This can be solved by using a robust HTML parsing library.

Parsing HTML using BeautifulSoup

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 13


DATA VISUALIZATION 21AD71

Although HTML resembles XML and some pages are constructed


as XML, most HTML is often flawed.
These flaws can cause an XML parser to reject the entire HTML
page as improperly formed.
Several Python libraries are available for parsing HTML and
extracting data from web pages.
Each library has its own strengths and weaknesses, allowing
users to choose based on specific needs.
As an example, the BeautifulSoup library can be used to parse
HTML input and extract links.
BeautifulSoup is particularly forgiving of flawed HTML, making
it easier to extract the required data.
The BeautifulSoup library can be downloaded and installed from
its respective source.

https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package
Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will use urllib to read the page and then use BeautifulSoup to
extract the
href attributes from the anchor (a) tags.
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 14


DATA VISUALIZATION 21AD71

import urllib.request, urllib.parse, urllib.error


from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
# Code: https://www.py4e.com/code3/urllinks.py
The program prompts for a web address, then opens the web page,
reads the data and passes the data to the BeautifulSoup parser, and
then retrieves all of the anchor tags and prints out the href attribute for
each tag.
You can use also BeautifulSoup to pull out various parts of each tag:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 15
DATA VISUALIZATION 21AD71

# Ignore SSL certificate errors


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
# Look at the parts of a tag
print('TAG:', tag)
print('URL:', tag.get('href', None))
print('Contents:', tag.contents[0])
print('Attrs:', tag.attrs)
# Code: https://www.py4e.com/code3/urllink2.py
python urllink2.py

Using Web Services

With the ease of retrieving and parsing documents over HTTP, it


became common to create documents designed for program
consumption rather than for display in browsers.
Two popular formats are commonly used for data exchange across
the web:

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 16


DATA VISUALIZATION 21AD71

 eXtensible Markup Language (XML): An older format best


suited for exchanging document-style data.
 JavaScript Object Notation (JSON): Typically used for
exchanging dictionaries, lists, or other internal information
between programs.
Both XML and JSON formats will be examined further.
eXtensible Markup Language – XML

XML looks very similar to HTML, but XML is more structured than
HTML. Here
is a sample of an XML document:
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 17


DATA VISUALIZATION 21AD71

Parsing XML

Here is a simple application that parses some XML and extracts some
data elements from the XML:
import xml.etree.ElementTree as ET
data = '''
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>'''
tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))
# Code: https://www.py4e.com/code3/xml1.py
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 18
DATA VISUALIZATION 21AD71

Triple single quotes (''') and triple double quotes (""") allow for the
creation of strings that span multiple lines.
The fromstring function converts the string representation of
XML into a “tree” of XML elements.
Once in a tree format, various methods can be called to extract
data from the XML string.
The find function searches through the XML tree to retrieve the
element that matches a specified tag.

Looping through nodes

Often the XML has multiple nodes and we need to write a loop to
process all of
the nodes. In the following program, we loop through all of the user
nodes:
import xml.etree.ElementTree as ET
input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 19


DATA VISUALIZATION 21AD71

<name>Brent</name>
</user>
</users>
</stuff>'''
stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))
for item in lst:
print('Name', item.find('name').text)
print('Id', item.find('id').text)
print('Attribute', item.get('x'))
# Code: https://www.py4e.com/code3/xml2.py
The findall method retrieves a Python list of subtrees that represent
the user
structures in the XML tree. Then we can write a for loop that looks at
each of
the user nodes, and prints the name and id text elements as well as the
x attribute
from the user node.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 20
DATA VISUALIZATION 21AD71

JavaScript Object Notation – JSON

The JSON format was inspired by the object and array format
used in JavaScript.
However, since Python was created before JavaScript, Python’s
syntax for dictionaries and lists influenced JSON's syntax.
As a result, the format of JSON closely resembles a combination
of Python lists and dictionaries.
An example of JSON encoding is provided, which is roughly
equivalent to the simple XML mentioned earlier.

{
"name" : "Chuck",
"phone" : {
"type" : "intl",
"number" : "+1 734 303 4456"
13.5. PARSING JSON 163
},
"email" : {
"hide" : "yes"
}
}

There are some key differences between XML and JSON:

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 21


DATA VISUALIZATION 21AD71

 In XML, attributes like “intl” can be added to tags, whereas in


JSON, data is represented as key-value pairs.
 The XML “person” tag is replaced by outer curly braces in
JSON.
Generally, JSON structures are simpler than XML because JSON
has fewer capabilities.
However, JSON's advantage lies in its direct mapping to
dictionaries and lists, which are common in many programming
languages.
This makes JSON a natural format for data exchange between
cooperating programs.
JSON is rapidly becoming the preferred format for data exchange
between applications due to its simplicity compared to XML

Parsing JSON

JSON is constructed by nesting dictionaries and lists as needed.


In the example, a list of users is represented, where each user is a
set of key-value pairs (i.e., a dictionary).
This results in a list of dictionaries.
The built-in json library is used in the program to parse the JSON
and read through the data.
Compared to equivalent XML data and code, JSON has less detail.
Users must know in advance that they are working with a list of
users, where each user is a set of key-value pairs.
JSON is more succinct, which is an advantage, but it is also less
self-describing, which can be seen as a disadvantage.
import json
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 22
DATA VISUALIZATION 21AD71

data = '''
[
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
},
{ "id" : "009",
"x" : "7",
"name" : "Brent"
}
]'''
info = json.loads(data)
print('User count:', len(info))
for item in info:
print('Name', item['name'])
print('Id', item['id'])
print('Attribute', item['x'])
# Code: https://www.py4e.com/code3/json2.py

Comparing the code for extracting data from parsed JSON and
XML shows some differences:
 The output from json.loads() is a Python list.
 This list is traversed using a for loop, with each item being a
Python dictionary.
After parsing the JSON, the Python index operator can be used
to extract specific pieces of data for each user.
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 23
DATA VISUALIZATION 21AD71

There is no need to use the JSON library to navigate through the


parsed JSON because the returned data consists of native Python
structures.

PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 24

You might also like