Module - 5: Python Application Programming
Module - 5: Python Application Programming
MODULE – 5
NETWORKED PROGRAMS
In this era of internet, it is a requirement in many situations to retrieve the data from web and to
process it. In this section, we will discuss basics of network protocols and Python libraries available
to extract data from web.
Consider a situation:
you try to read a socket, but the program on the other end of the socket has not sent any data,
then you need to wait.
If the programs on both ends of the socket simply wait for some data without sending
anything, they will wait for a very long time.
So an important part of programs that communicate over the Internet is to have some sort of
protocol. A protocol is a set of precise rules that determine
Who will send request for what purpose
What action to be taken
What response to be given
To send request and to receive response, HTTP uses GET and POST methods.
NOTE: To test all the programs in this section, you must be connected to internet.
Consider a simple program to retrieve the data from a web page. To understand the program given
below, one should know the meaning of terminologies used there.
AF_INET is an address family (IP) that is used to designate the type of addresses that your
socket can communicate with. When you create a socket, you have to
specify its address family, and then you can use only addresses of that type with the socket.
SOCK_STREAM is a constant indicating the type of socket (TCP). It works as a file stream
and is most reliable over the network.
Port is a logical end-point. Port 80 is one of the most commonly used port numbers
in the Transmission Control Protocol (TCP) suite.
The command to retrieve the data must use CRLF(Carriage Return Line Feed) line endings,
and it must end in \r\n\r\n (line break in protocol specification).
encode() method applied on strings will return bytes-representation of the string. Instead of
encode() method, one can attach a character b at the beginning of the string for the same
effect.
decode() method returns a string decoded from the given bytes.
A socket connection between the user program and the webpage is shown in Figure 5.1.
import socket
while True:
data = mysock.recv(512) if
(len(data) < 1):
break print(data.decode(),end='')
mysock.close()
When we run above program, we will get some information related to web-server of the website
which we are trying to scrape. Then, we will get the data written in that web-page. In this program,
we are extracting 512 bytes of data at a time. (One can use one’s convenient number here). The
extracted data is decoded and printed. When the length of data becomes less than one (that is, no
more data left out on the web page), the loop is terminated.
import socket
import time
count = 0
picture = b"" #empty string in binary format
while True:
data = mysock.recv(5120) #retrieve 5120 bytes at a time if (len(data) <
1):
break
mysock.close()
pos = picture.find(b"\r\n\r\n") #find end of the header (2 CRLF) print('Header length', pos)
print(picture[:pos].decode())
# Skip past the header and save the picture data picture = picture[pos+4:]
fhand.write(picture) fhand.close()
When we run the above program, the amount of data (in bytes) retrieved from the internet is
displayed in a cumulative format. At the end, the image file ‘stuff.jpg’ will be stored in the current
working directory. (One has to verify it by looking at current working directory of the program).
import urllib.request
Once the web page has been opened with urllib.urlopen, we can treat it like a file and read through it
using a for-loop. When the program runs, we only see the output of the contents of the file. The
headers are still sent, but the urllib code consumes the headers and only returns the data to us.
Following is the program to retrieve the data from the file romeo.txt which is residing at
www.data.pr4e.org, and then to count number of words in it.
import urllib.request
fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') counts = dict()
Once we execute the above program, we can see a file cover3.jpg in the current working directory in
our computer.
The program reads all of the data in at once across the network and stores it in the variable img in the
main memory of your computer, then opens the file cover.jpg and writes the data out to your disk.
This will work if the size of the file is less than the size of the memory (RAM) of your computer.
However, if this is a large audio or video file, this program may crash or at least run extremely
slowly when your computer runs out of memory. In order to avoid memory overflow, we retrieve the
data in blocks (or buffers) and then write each block to your disk before retrieving the next block.
This way the program can read any size file without using up all of the memory you have in your
computer.
Following is another version of above program, where data is read in chunks and then stored onto
the disk.
import urllib.request
while True:
info = img.read(100000) if
len(info) < 1:
break
size = size + len(info)
fhand.write(info)
Once we run the above program, an image file cover3.jpg will be stored on to the current working
directory.
Google spiders its way through nearly all of the pages on the web. Google also uses the frequency
of links from pages it finds to a particular page as one measure of how “important” a page is and
how high the page should appear in its search results.
Here,
<h1> and </h1>are the beginning and end of header tags
<p>and </p>are the beginning and end of paragraph tags
<a>and </a>are the beginning and end of anchor tag which is used for giving links
hrefis the attribute for anchor tag which takes the value as the link for another page.
The above information clearly indicates that if we want to extract all the hyperlinks in a webpage,
we need a regular expression which matches the href attribute. Thus, we can create a regular
expression as –
href="http://.+?"
Here, the question mark in .+? indicate that the match should find smallest possible matching string.
Now, consider a Python program that uses the above regular expression to extract all hyperlinks
from the webpage given as input.
urllib.request.urlopen(url).read()
links = re.findall(b'href="(http://.*?)"', html)
When we run this program, it prompts for user input. We need to give a valid URL of any website.
Then all the hyperlinks on that website will be displayed.
http://www.crummy.com/software/
Consider the following program which uses urllib to read the page and uses BeautifulSoup to extract
hrefattribute from the anchor tag.
import urllib.request
from bs4 import BeautifulSoup
import ssl #Secure Socket Layer
ctx = ssl.create_default_context()
ctx.check_hostname = False ctx.verify_mode =
ssl.CERT_NONE
The above program prompts for a web address, then opens the web page, reads the data and passes
the data to the BeautifulSoup parser, and then retrieves all of the anchor tags and prints out the
href attribute for each tag.
The BeautifulSoup can be used to extract various parts of each tag as shown below –
ctx = ssl.create_default_context()
XML is best suited for exchanging document-style data. When programs just want to exchange
dictionaries, lists, or other internal information with each other, they use JavaScript Object Notation
or JSON (refer www.json.org). We will look at both formats.
<person>
<name>Chuck</name>
<phone type="intl">
+1 734 303 4456
</phone>
<email hide="yes"/>
</person>
Often it is helpful to think of an XML document as a tree structure where there is a top tag
person and other tags such as phone are drawn as children of their parent nodes. Figure
is the tree structure for above given XML code.
Parsing XML
Python provides library xml.etree.ElementTree to parse the data from XML files. One has to provide
XML code as a string to built-in method fromstring() of ElementTree class. ElementTree acts as a
parser and provides a set of relevant methods to extract the data. Hence, the programmer need not
know the rules and the format of XML document syntax. The fromstring()method will convert XML
code into a tree-structure of XML nodes. When the XML is in a tree format, Python provides several
methods to extract data from XML. Consider the following program.
import xml.etree.ElementTree as ET
tree = ET.fromstring(data)
print('Attribute for tag email:', tree.find('email').get('hide')) print('Attribute for tag phone:',
tree.find('phone').get('type'))
In the above example, fromstring() is used to convert XML code into a tree. The find() method
searches XML tree and retrieves a node that matches the specified tag. The get() method retrieves
the value associated with the specified attribute of that tag. Each node can have some text, some
attributes (like hide), and some “child” nodes. Each node can be the parent for a tree of nodes.
stuff = ET.fromstring(input)
lst = stuff.findall('users/user') print('User count:',
len(lst))
The findall() method retrieves a Python list of subtrees that represent the user structures in the XML
tree. Then we can write a for-loop that extracts each of the user nodes, and prints the name and id,
which are text elements as well as the attribute x from the user node.
{
"name" : "Chuck",
"phone": {"type" : "intl", "number" : "+1 734 303 4456"}, "email": {"hide" : "yes"}
}
In general, JSON structures are simpler than XML because JSON has fewer capabilities than XML.
But JSON has the advantage that it maps directly to some combination of dictionaries and lists. And
since nearly all programming languages have something equivalent to Python’s dictionaries and
lists, JSON is a very natural format to have two compatible programs exchange data. JSON is
quickly becoming the format of choice for nearly all data exchange between applications because of
its relative simplicity compared to XML.
Parsing JSON
Python provides a module json to parse the data in JSON pages. Consider the following program
which uses JSON equivalent of XML string written in Section 5.2.3. Note that, the JSON string has
to embed a list of dictionaries.
import json
data = ''' [
{ "id" : "001",
"x" : "2",
"name" : "Chuck"
},
{ "id" : "009",
"x" : "7",
"name" : "Chuck"
}
]'''
Here, the string data contains a list of users, where each user is a key-value pair. The method loads()
in the json module converts the string into a list of dictionaries. Now onwards, we don’t need
anything from json, because the parsed data is available in Python native structures. Using a for-
loop, we can iterate through the list of dictionaries and extract every element (in the form of key-
value pair) as if it is a dictionary object. That is, we use index operator (a pair of square brackets) to
extract value for a particular key.
NOTE: Current IT industry trend is to use JSON for web services rather than XML. Because, JSON
is simpler than XML and it directly maps to native data structures we already have in the
programming languages. This makes parsing and data extraction simpler compared to XML. But
XML is more self descriptive than JSON and so there are some applications where XML retains an
advantage. For example, most word processors store documents internally using XML rather than
JSON.
When we begin to build our programs where the functionality of our program includes access to
services provided by other programs, we call the approach a Service-Oriented Architecture(SOA).
A SOA approach is one where our overall application makes use of
the services of other applications. A non-SOA approach is where the application is a single stand-
alone application which contains all of the code necessary to implement the application.
Consider an example of SOA: Through a single website, we can book flight tickets and hotels. The
data related to hotels is not stored in the airline servers. Instead, airline servers contact the services
on hotel servers and retrieve the data from there and present it to the user. When the user agrees to
make a hotel reservation using the airline site, the airline site uses another web service on the hotel
systems to actually make the reservation. Similarly, to reach airport, we may book a cab through a
cab rental service. And when it comes time to charge your credit card for the whole transaction, still
other computers become involved in the process. This process is depicted in Figure 5.3.
With these advantages, an SOA system must be carefully designed to have good performance and
meet the user’s needs. When an application makes a set of services in its API available over the web,
then it is called as web services.
The following program asks the user to provide the name of a location to be searched for. Then, it
will call Google geocoding API and extracts the information from the returned JSON.
try:
js = json.loads(data)
except:
js = None
print(json.dumps(js, indent=4))
lat = js["results"][0]["geometry"]["location"]["lat"]
lng = js["results"][0]["geometry"]["location"]["lng"] print('lat', lat, 'lng', lng)
location = js['results'][0]['formatted_address'] print(location)
(Students are advised to run the above program and check the output, which will contain
several lines of Google geographical data).
The above program retrieves the search string and then encodes it. This encoded string along with
Google API link is treated as a URL to fetch the data from the internet. The data retrieved from the
internet will be now passed to JSON to put it in JSON object format. If the input string (which must
be an existing geographical location like Channasandra, Malleshwaram etc!!) cannot be located by
Google API either due to bad internet or due to unknown location, we just display the message as
‘Failure to Retrieve’. If Google successfully identifies the location, then we will dump that data in
JSON object. Then, using
indexing on JSON (as JSON will be in the form of dictionary), we can retrieve the location address,
longitude, latitude etc.
Sometimes, vendor wants more security and expects the user to provide cryptographically signed
messages using shared keys and secrets. The most common protocol used in the internet for signing
requests is OAuth.
As the Twitter API became increasingly valuable, Twitter went from an open and public API to an
API that required the use of OAuth signatures on each API request. But, there are still a number of
convenient and free OAuth libraries so you can avoid writing an OAuth implementation from scratch
by reading the specification. These libraries are of varying complexity and have varying degrees of
richness. The OAuth web site has information about various OAuth libraries.
There are many database management softwares like Oracle, MySQL, Microsoft SQL Server,
PostgreSQL, SQLite etc. They are designed to insert and retrieve data very fast, however big the
dataset is. Database software builds indexes as data is added to the database so as to provider quicker
access to particular entry.
In this course of study, SQLite is used because it is already built into Python. SQLite is a C library
that provides a lightweight disk-based database that doesn’t require a separate server process and
allows accessing the database using a non-standard variant of the SQL query language. SQLite is
designed to be embedded into other applications to provide database support within the application.
For example, the Firefox browser also uses the SQLite database internally. SQLite is well suited to
some of the data manipulation problems in Informatics such as the Twitter spidering application etc.
Database Concepts
For the first look, database seems to be a spreadsheet consisting of multiple sheets. The primary data
structures in a database are tables, rows and columns. In a relational database terminology, tables,
rows and columns are referred as relation, tuple and attribute respectively. Typical structure of a
database table is as shown below. Each table may consist of n number of attributes and m number of
tuples (or records). Every tuple gives the information about one individual. Every cell(i, j) in the
table indicates value of jth attribute for ith tuple.
Consider the problem of storing details of students in a database table. The format may look like –
There are some clauses like FROM, WHERE, ORDER BY, INNER JOIN etc. that are used with
SQL commands, which we will study in a due course. The following table gives few of the SQL
commands.
Command Meaning
CREATE DATABASE creates a new database
As mentioned earlier, every RDBMS has its own way of storing the data in tables. Each of RDBMS
uses its own set of data types for the attribute values to be used. SQLite uses the data types as
mentioned in the following table –
REAL The value is a floating point value, stored as an 8-byte floating point number
TEXT The value is a text string, stored using the database encoding (UTF- 8,
UTF-16BE or UTF-16LE)
BLOB The value is a blob (Binary Large Object) of data, stored exactly as it was
input
Note that, SQL commands are case-insensitive. But, it is a common practice to write commands and
clauses in uppercase alphabets just to differentiate them from table name and attribute names.
Now, let us see some of the examples to understand the usage of SQL statements –
CREATE TABLE Tracks (title TEXT, plays INTEGER)
This command creates a table called as Tracks with the attributes title and plays
where title can store data of type TEXT and playscan store data of type INTEGER.
Using this browser, one can easily create tables, insert data, edit data, or run simple SQL queries on
the data in the database. This database browser is similar to a text editor when working with text
files. When you want to do one or very few operations on a text file, you can just open it in a text
editor and make the changes you want. When you have many changes that you need to do to a text
file, often you will write a simple Python program. You will find the same pattern when working
with databases. You will do simple operations in the database manager and more complex operations
will be most conveniently done in Python.
Ex1.
import sqlite3
conn = sqlite3.connect('music.sqlite') cur = conn.cursor()
The connect() method of sqlite3 makes a “connection” to the database stored in the file music.sqlite3
in the current directory. If the file does not exist, it will be created. Sometimes, the database is stored
on a different database server from the server on which we are running our program. But, all the
examples that we consider here will be local file in the current working directory of Python code.
A cursor() is like a file handle that we can use to perform operations on the data stored in the
database. Calling cursor() is very similar conceptually to calling open() when dealing with text files.
Hence, once we get a cursor, we can execute the commands on the contents of database using
execute()method.
In the above program, we are trying to remove the database table Tracks, if at all it existed in the
current working directory. The DROP TABLE command deletes the table along with all its columns
and rows. This procedure will help to avoid a possible error of trying to create a table with same
name. Then, we are creating a table with name Tracks which has two columns viz. title, which can
take TEXT type data and plays, which can
take INTEGER type data. Once our job with the database is over, we need to close the connection
using close() method.
In the previous example, we have just created a table, but not inserted any records into it. So,
consider below given program, which will create a table and then inserts two rows and finally delete
records based on some condition.
Ex2.
import sqlite3
print('Tracks:')
cur.execute('SELECT title, plays FROM Tracks') for row in cur:
print(row)
In the above program, we are inserting first record with the SQL command –
Note that, execute() requires SQL command to be in string format. But, if the value to be store in the
table is also a string (TEXT type), then there may be a conflict of string representation using quotes.
Hence, in this example, the entire SQL is mentioned within double-quotes and the value to be
inserted in single quotes. If we would like to use either single quote or double quote everywhere,
then we need to use escape-sequences like \’ or \”.
While inserting second row in a table, SQL statement is used with a little different syntax –
“INSERT INTO Tracks (title, plays) VALUES (?, ?)”,('My Way', 15)
Here, the question mark acts as a place-holder for particular value. This type of syntax is useful
when we would like to pass user-input values into database table.
After inserting two rows, we must use commit() method to store the inserted records permanently on
the database table. If this method is not applied, then the insertion (or any other statement execution)
will be temporary and will affect only the current run of the program.
Later, we use SELECT command to retrieve the data from the table and then use for-loop to display
all records. When data is retrieved from database using SELECT command, the cursor object gets
those data as a list of records. Hence, we can use for-loop on the cursor object. Finally, we have used
a DELETE command to delete all the records WHERE plays is less than 100.
Ex3.
import sqlite3
from sqlite3 import Error
def create_connection():
""" create a database connection to a database that resides in the memory
"""
try:
conn = sqlite3.connect(':memory:') print("SQLite
Version:",sqlite3.version)
except Error as e: print(e)
finally:
conn.close()
create_connection()
Ex4. Write a program to create a Student database with a table consisting of student name and age.
Read n records from the user and insert them into database. Write queries to display all records and
to display the students whose age is 20.
conn.commit()
c.execute("select * from tblStudent ") print(c.fetchall())
conn.close()
In the above program we take a for-loop to get user-input for student’s name and age. These data are
inserted into the table. Observe the question mark acting as a placeholder for user-input variables.
Later we use a method fetchall() that is used to display all the records form the table in the form of a
list of tuples. Here, each tuple is one record from the table.
A foreign key is usually a number that points to the primary key of an associated row in a
different table.
Consider a table consisting of student details like RollNo, name, age, semester and address as shown
below –
In this table, RollNo can be considered as a primary key because it is unique for every student in that
table. Consider another table that is used for storing marks of students in all the three tests as below
–
RollNo Sem M1 M2 M3
1 6 34 45 42.5
2 6 42.3 44 25
3 4 38 44 41.5
4 6 39.4 43 40
2 8 37 42 41
To save the memory, this table can have just RollNo and marks in all the tests. There is no need to
store the information like name, age etc of the students as these information can be retrieved from
first table. Now, RollNo is treated as a foreign key in the second table.
Consider the example of Student database discussed in Section 5.3.5. We can create a table using
following SQL command –
Here, RollNo is a primary key and by default it will be unique in one table. Now, another take can be
created as –
Now, in the tblMarks consisting of marks of 3 tests of all the students, RollNo and sem are together
unique. Because, in one semester, only one student can be there having a particular RollNo. Whereas
in another semester, same RollNo may be there.
Such types of relationships are established between various tables in RDBMS and that will help
better management of time and space.
Consider the following program which creates two tables tblStudent and tblMarks as discussed in the
previous section. Few records are inserted into both the tables. Then we extract the marks of students
who are studying in 6th semester.
import sqlite3
conn=sqlite3.connect('StudentDB.db') c=conn.cursor()
conn.commit()
conn.close()
The query joins two tables and extracts the records where RollNo and sem matches in both the tables,
and sem must be 6.