Module 5
Module 5
DATA VISUALIZATION
21AD71
MODULE-5
Networked programs
Perhaps the easiest way to show how the HTTP protocol works is to
write a very simple Python program that makes a connection to a web
server and follows the rules of the HTTP protocol to request a
document and display what the server sends back.
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
HTTP/1.1 200 OK
The output begins with headers sent by the web server to describe
the document.
Example: Content-Type header indicates that the document is
plain text (text/plain).
After sending headers, the server adds a blank line to indicate the
end of the headers.
The actual data (e.g., the file romeo.txt) is sent after the headers.
Demonstrates a low-level network connection using sockets.
Sockets can be used to communicate with various servers,
including web and mail servers.
Protocols guide the communication; knowing the protocol allows
sending and receiving data.
Python has a special HTTP library for retrieving documents over
the web.
HTTP protocol requires sending and receiving data as bytes
objects.
import socket
import time
HOST = 'data.pr4e.org'
PORT = 80
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((HOST, PORT))
mysock.sendall(b'GET http://data.pr4e.org/cover3.jpg
HTTP/1.0\r\n\r\n')
count = 0
picture = b""
while True:
data = mysock.recv(5120)
if len(data) < 1: break
#time.sleep(0.25)
count = count + len(data)
print(len(data), count)
The final recv() call returns 3167 bytes, indicating the end of the
stream.
A subsequent recv() call returns a zero-length string, signaling
the server has closed its socket.
Introducing a time.sleep() delay (e.g., a quarter of a second) can
slow down the recv() calls, allowing the server more time to send
data.
With the delay, the server can send more data before the next
recv() call.
The program reads all data from the network at once and stores it
in the img variable in main memory.
It then opens the file cover.jpg and writes the data to disk.
The wb argument in open() indicates that the file is opened for
writing in binary mode.
This approach works well if the file size is less than the computer's
memory capacity.
For large files (like audio or video), this method may cause the
program to crash or run slowly due to memory exhaustion.
To prevent running out of memory, data is retrieved in blocks (or
buffers).
Each block is written to disk before retrieving the next, allowing
the program to handle files of any size without consuming all
available memory.
The [s]? indicates that the match can include zero or one “s” after
“http”.
The ? after .+? signifies a non-greedy match, meaning it seeks the
smallest possible matching string.
In contrast, a greedy match would look for the largest possible
matching string.
Parentheses are used in the regular expression to specify which
part of the matched string to extract.
This results in a program designed to extract specific parts of the
matched strings.
The ssl library enables the program to access websites that enforce
HTTPS.
The read() method returns the HTML source code as a bytes
object instead of an HTTPResponse object.
The findall regular expression method provides a list of all strings
that match the regular expression.
It returns only the link text found between the double quotes.
https://pypi.python.org/pypi/beautifulsoup4
Information on installing BeautifulSoup with the Python Package
Index tool pip
is available at:
https://packaging.python.org/tutorials/installing-packages/
We will use urllib to read the page and then use BeautifulSoup to
extract the
href attributes from the anchor (a) tags.
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file
XML looks very similar to HTML, but XML is more structured than
HTML. Here
is a sample of an XML document:
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>
Parsing XML
Here is a simple application that parses some XML and extracts some
data elements from the XML:
import xml.etree.ElementTree as ET
data = '''
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes" />
</person>'''
tree = ET.fromstring(data)
print('Name:', tree.find('name').text)
print('Attr:', tree.find('email').get('hide'))
# Code: https://www.py4e.com/code3/xml1.py
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 18
DATA VISUALIZATION 21AD71
Triple single quotes (''') and triple double quotes (""") allow for the
creation of strings that span multiple lines.
The fromstring function converts the string representation of
XML into a “tree” of XML elements.
Once in a tree format, various methods can be called to extract
data from the XML string.
The find function searches through the XML tree to retrieve the
element that matches a specified tag.
Often the XML has multiple nodes and we need to write a loop to
process all of
the nodes. In the following program, we loop through all of the user
nodes:
import xml.etree.ElementTree as ET
input = '''
<stuff>
<users>
<user x="2">
<id>001</id>
<name>Chuck</name>
</user>
<user x="7">
<id>009</id>
<name>Brent</name>
</user>
</users>
</stuff>'''
stuff = ET.fromstring(input)
lst = stuff.findall('users/user')
print('User count:', len(lst))
for item in lst:
print('Name', item.find('name').text)
print('Id', item.find('id').text)
print('Attribute', item.get('x'))
# Code: https://www.py4e.com/code3/xml2.py
The findall method retrieves a Python list of subtrees that represent
the user
structures in the XML tree. Then we can write a for loop that looks at
each of
the user nodes, and prints the name and id text elements as well as the
x attribute
from the user node.
User count: 2
Name Chuck
Id 001
Attribute 2
Name Brent
Id 009
Attribute 7
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 20
DATA VISUALIZATION 21AD71
The JSON format was inspired by the object and array format
used in JavaScript.
However, since Python was created before JavaScript, Python’s
syntax for dictionaries and lists influenced JSON's syntax.
As a result, the format of JSON closely resembles a combination
of Python lists and dictionaries.
An example of JSON encoding is provided, which is roughly
equivalent to the simple XML mentioned earlier.
{
"name" : "Chuck",
"phone" : {
"type" : "intl",
"number" : "+1 734 303 4456"
13.5. PARSING JSON 163
},
"email" : {
"hide" : "yes"
}
}
Parsing JSON
data = '''
[
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
},
{ "id" : "009",
"x" : "7",
"name" : "Brent"
}
]'''
info = json.loads(data)
print('User count:', len(info))
for item in info:
print('Name', item['name'])
print('Id', item['id'])
print('Attribute', item['x'])
# Code: https://www.py4e.com/code3/json2.py
Comparing the code for extracting data from parsed JSON and
XML shows some differences:
The output from json.loads() is a Python list.
This list is traversed using a for loop, with each item being a
Python dictionary.
After parsing the JSON, the Python index operator can be used
to extract specific pieces of data for each user.
PROFESSOR S.VINUTHA RNSIT CSE-DATA SCIENCE 23
DATA VISUALIZATION 21AD71