0% found this document useful (0 votes)
31 views94 pages

Data Science - III

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views94 pages

Data Science - III

Uploaded by

VENKATESHWARLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 94

UNIT - I

Data Analysis and Types of Data Analysis

Data analysis is defined as a process of cleaning, transforming, and modeling data


to discover useful information for business decision-making. The purpose of Data
Analysis is to extract useful information from data and taking the decision based
upon the data analysis.

Data Analysis Tools

Different Data analysis tools include : Python, R, SQL, Java, MathLab

Types of Data Analysis: Techniques and Methods


There are several types of Data Analysis techniques that exist based on business
and technology. However, the major Data Analysis methods are:

 Text Analysis
 Statistical Analysis
 Diagnostic Analysis
 Predictive Analysis
 Prescriptive Analysis

Text Analysis
Text Analysis is also referred to as Data Mining. It is one of the methods of data
analysis to discover a pattern in large data sets using databases or data mining
tools. It used to transform raw data into business information. Business Intelligence
tools are present in the market which is used to take strategic business decisions.
Overall it offers a way to extract and examine data and deriving patterns and finally
interpretation of the data.

Statistical Analysis
Statistical Analysis shows “What happen?” by using past data in the form of
dashboards. Statistical Analysis includes collection, Analysis, interpretation,
presentation, and modeling of data. It analyses a set of data or a sample of data.
There are two categories of this type of Analysis – Descriptive Analysis and
Inferential Analysis.

Descriptive Analysis
analyses complete data or a sample of summarized numerical data. It shows mean
and deviation for continuous data whereas percentage and frequency for
categorical data.
1
Inferential Analysis
analyses sample from complete data. In this type of Analysis, you can find different
conclusions from the same data by selecting different samples.

Diagnostic Analysis
Diagnostic Analysis shows “Why did it happen?” by finding the cause from the
insight found in Statistical Analysis. This Analysis is useful to identify behavior
patterns of data. If a new problem arrives in your business process, then you can
look into this Analysis to find similar patterns of that problem. And it may have
chances to use similar prescriptions for the new problems.

Predictive Analysis
Predictive Analysis shows “what is likely to happen” by using previous data. The
simplest data analysis example is like if last year I bought two dresses based on my
savings and if this year my salary is increasing double then I can buy four dresses.
But of course it’s not easy like this because you have to think about other
circumstances like chances of prices of clothes is increased this year or maybe
instead of dresses you want to buy a new bike, or you need to buy a house!

So here, this Analysis makes predictions about future outcomes based on current or
past data. Forecasting is just an estimate. Its accuracy is based on how much
detailed information you have and how much you dig in it.

Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine
which action to take in a current problem or decision. Most data-driven companies
are utilizing Prescriptive Analysis because predictive and descriptive Analysis are
not enough to improve data performance. Based on current situations and
problems, they analyze the data and make decisions.

Data Analysis Process / Data Analysis Sequence


The Data Analysis Process is nothing but gathering information by using a proper
application or tool which allows you to explore the data and find a pattern in it.
Based on that information and data, you can make decisions, or you can get
ultimate conclusions.
Data Analysis consists of the following phases in a sequence:

 Data Requirement Gathering


 Data Collection
 Data Cleaning
2
 Data Analysis
 Data Interpretation
 Data Visualization

Data Data Data Data Data Data


Requirem
ent Collecti Cleanin Analysi Interpr Visualiz
Gathering on g s etation ation

Data Analysis Sequence


Data Requirement Gathering
First of all, you have to think about why do you want to do this data analysis? All
you need to find out the purpose or aim of doing the Analysis of data. You have to
decide which type of data analysis you wanted to do! In this phase, you have to
decide what to analyze and how to measure it, you have to understand why you are
investigating and what measures you have to use to do this Analysis.

Data Collection
After requirement gathering, you will get a clear idea about what things you have to
measure and what should be your findings. Now it’s time to collect your data based
on requirements. Once you collect your data, remember that the collected data
must be processed or organized for Analysis. As you collected data from various
sources, you must have to keep a log with a collection date and source of the data.

Data Cleaning
Now whatever data is collected may not be useful or irrelevant to your aim of
Analysis, hence it should be cleaned. The data which is collected may contain
duplicate records, white spaces or errors. The data should be cleaned and error
free. This phase must be done before Analysis because based on data cleaning,
your output of Analysis will be closer to your expected outcome.

Data Analysis
Once the data is collected, cleaned, and processed, it is ready for Analysis. As you
manipulate data, you may find you have the exact information you need, or you
might need to collect more data. During this phase, you can use data analysis
tools and software which will help you to understand, interpret, and derive
conclusions based on the requirements.

Data Interpretation

3
After analyzing your data, it’s finally time to interpret your results. You can choose
the way to express or communicate your data analysis either you can use simply in
words or maybe a table or chart. Then use the results of your data analysis process
to decide your best course of action.

Data Visualization
Data visualization is very common in your day to day life; they often appear in the
form of charts and graphs. In other words, data shown graphically so that it will be
easier for the human brain to understand and process it. Data visualization often
used to discover unknown facts and trends. By observing relationships and
comparing datasets, you can find a way to find out meaningful information.

Data Acquisition Pipeline


Data acquisition is all about obtaining the artifacts that contain the input data from a
variety of sources, extracting the data from the artifacts, and converting it into
representations suitable for further processing, as shown in the following figure.

The three main sources of data are the Internet (namely, the World Wide Web),
databases, and local files (possibly previously downloaded by hand or using
additional software). Some of the local files may have been produced by other
Python programs and contain serialized or “pickled” data.
The formats of data in the artifacts may range widely. The most popular formats are:
 Unstructured plain text in a natural language (such as English or Chinese)
 Structured data, including:
o Tabular data in comma separated values (CSV) files
o Tabular data from databases
o Tagged data in HyperText Markup Language (HTML) or, in general, in
eXtensible Markup Language (XML)
o Tagged data in JavaScript Object Notation (JSON)

4
Depending on the original structure of the extracted data and the purpose and
nature of further processing, the data are represented using native Python data
structures (lists and dictionaries) or advanced data structures that support
specialized operations (numpy arrays and pandas data frames).

An attempt to keep the data processing pipeline (obtaining, cleaning, and


transforming raw data; descriptive and exploratory data analysis; and data modeling
and prediction) fully automated. For this reason, We avoid using interactive GUI
tools, as they can rarely be scripted to operate in a batch mode, and they rarely
record any history of operations. To promote modularity, reusability, and
recoverability, we will break a long pipeline into shorter sub-pipelines, saving
intermediate results into Pickle or JSON files, as appropriate.
Pipeline automation naturally leads to reproducible code: a set of Python scripts that
anyone can execute to convert the original raw data into the final results as
described in the report, ideally without any additional human interaction. Other
researchers can use reproducible code to validate your models and results and to
apply the process that you developed to their own problems.

Report Structure
The project report is what we (data scientists) submit to the data sponsor (the
customer). The report typically includes the following:
 Abstract (a brief and accessible description of the project)

 Introduction

 Methods that were used for data acquisition and processing

 Results that were obtained (do not include intermediate and insignificant

results in this section; rather, put them into an appendix)

 Conclusion

 Appendix

5
INTRODUCTION TO FILES
Usually, organizations would want to permanently store information about
employees, inventory, sales, etc. to avoid repetitive tasks of entering the same
data.
Hence,data are stored permanently on secondary storage devices for reusability. We
store Python programs written in script mode with a .py extension. Eachprogram
is stored on the secondary device as a file.
Likewise, the data entered, and the output can be stored permanently into a file.
Text files contain only the ASCII equivalent of the contents of the file, Binary files
A file is a named location on a secondary storage media where data are permanently
stored for later access.
Types Of Files
Computers store every file as a collection of 0s and 1s i.e., in binary form.
Therefore, every file is basically just a series of bytes stored one after the other.
There are mainly two types of data files — text file and binaryfile. A text file
consists of human readable characters, which can be opened by any text editor.
On the other hand, binary files are made up of non-human readable characters
and symbols, which require specific programs to access its contents.

Text file
A text file can be understood as a sequence of characters consisting of alphabets,
numbers and other special symbols. Files with extensions like .txt, .py, .csv,
etc. are some examples of text files. When we open a text file using a text editor
(e.g., Notepad), we see several lines of text. However, the file contents are not
stored in such a way internally. Rather, they are stored in sequence of bytes
consisting of 0s and 1s. In ASCII, UNICODE or any other encoding scheme, the
value of each characterof the text file is stored as bytes.
Binary Files
Binary files are also stored in terms of bytes (0s and 1s), but unlike text files, these
bytes do not represent theASCII values of characters. Rather, they represent the actual
content such as image, audio, video, compressed versions of other files, executable
files, etc. These filesare not human readable. Thus, trying to open a binaryfile using
a text editor will show some garbage values. We need specific software to read or
write the contentsof a binary file.
Opening And Closing A Text File
In real world applications, computer programs deal with data coming from different
6
sources like databases, CSV files, HTML, XML, JSON, etc. We broadly access files
either to write or read data from it. But operations on files include creating and
opening a file, writing datain a file, traversing a file, reading data from a file andso
on. Python has the io module that contains different functions for handling files.

Opening a file
To open a file in Python, we use the open() function. The syntax of open() is as
follows:
file_object= open(file_name, access_mode)
This function returns a file object called file handle which is stored in the
variable file_object. We can use this variable to transfer data to and from the
file (read and write) by calling the functions defined in the Python’s io module. If
the file does not exist, the abovestatement creates a new empty file and assigns it
the name we specify in the statement.
File Open Modes

<r> Opens the file in read-only mode. Beginning of the file


<rb> Opens the file in binary and read-only mode. Beginning of the file
<r+> or <+r> Opens the file in both read and write mode. Beginning of the file
<w> Opens the file in write mode. If the file already
exists, all the contents will be overwritten. If the Beginning of the file
file doesn’t exist, then a new file will be created.
<wb+> or Opens the file in read,write and binary mode. If
<+wb> the file already exists, the contents will be
Beginning of the file
overwritten. If the file doesn’t exist, then a new
file will be created.
<a> Opens the file in append mode. If the file doesn’t
exist, thena new file will be created. End of the file

<a+> or <+a> Opens the file in append and read mode. If the
file doesn’texist, then it will create a new file. End of the file

Closing a file
Once we are done with the read/write operations on a file, it is a good practice to close
the file. Python provides a close() method to do so. While closing a file, the system
frees the memory allocated to it. The syntax of close() is:
file_object.close()

Here, file_object is the object that was returned while opening the file.

Opening a file using with clause


In Python, we can also open a file using with clause. The syntax of with clause is:
with open (file_name, access_mode) as file_object:
7
The advantage of using with clause is that any file that is opened using this
clause is closed automatically, once the control comes outside the with clause. In
case the user forgets to close the file explicitly or if an exception occurs, the file is
closed automatically. Also, it provides a simpler syntax.
with open(“myfile.txt”,”r+”) as myObject:content = myObject.read()

Here, we don’t have to close the file explicitly using close() statement. Python
will automatically closethe file.

Writing To A Text File


For writing to a file, we first need to open it in write or append mode. If we open an
existing file in write mode, the previous data will be erased, and the file object will
be positioned at the beginning of the file. On the other hand, in append mode, new
data will be added at theend of the previous data as the file object is at the end of
the file. After opening the file, we can use the following methods to write data in the
file.
• write() - for writing a single string
• writeline() - for writing a sequence of strings

The write() method


write() method takes a string as an argument and writes it to the text file. It
returns the number of characters being written on single execution of the write()
method. Also, we need to add a newline character (\n) at the endof every
sentence to mark the end of line.
Consider the following piece of code:
>>> myobject=open("myfile.txt",'w')

>>> myobject.write("Hey I have started#using files in Python\n")

>>> myobject.close()

On execution, write() returns the number of characters written on to the file.


Hence, 41, which is the length of the string passed as an argument, is displayed.
Note: ‘\n’ is treated as a single character

The writelines() method


This method is used to write multiple strings to a file. We need to pass an iterable
object like lists, tuple, etc. containing strings to the writelines() method. Unlike
write(), the writelines() method does not return thenumber of characters written
in the file. The following code explains the use of writelines().
>>> myobject=open("myfile.txt",'w')
>>> lines = ["Hello everyone\n", "Writing
#multiline strings\n", "This is the

#third line"]

>>> myobject.writelines(lines)

>>>myobject.close()

REaDING FROM a TExT FILE


We can write a program to read the contents of a file. Before reading a file, we must
make sure that the file is opened in “r”, “r+”, “w+” or “a+” mode. There are three ways
to read the contents of a file:

The read() method


This method is used to read a specified number of bytes of data from a data file. The
syntax of read() method is:
file_object.read(n)

Consider the following set of statements to understandthe usage of read() method:


>>>myobject=open("myfile.txt",'r')

>>> myobject.read(10)
'Hello ever'

>>> myobject.close()

If no argument or a negative number is specified in read(), the entire file


content is read. For example,
>>> myobject=open("myfile.txt",'r')

>>> print(myobject.read())
Hello everyone

Writing multiline
strings This is the
third line

>>> myobject.close()

The readlines() method


The method reads all the lines and returns the lines along with newline as a list
of strings. The following example uses readlines() to read data from the text file
myfile.txt.
>>> myobject=open("myfile.txt", 'r')

>>> print(myobject.readlines())

['Hello everyone\n', 'Writing multilinestrings\n', 'This is the


third line']
>>> myobject.close()

As shown in the above output, when we read a file using readlines()


function, lines in the file become members of a list, where each list element ends
with a newline character (‘\n’).
In case we want to display each word of a line separately as an element of a list,
then we can use split() function. The following code demonstrates the use of split()
function.

>>> myobject=open("myfile.txt",'r')

>>> d=myobject.readlines()

>>> for line in d:

words=line.split()
print(words)

['Hello', 'everyone']

['Writing', 'multiline', 'strings']


['This', 'is', 'the', 'third', 'line']

In the output, each string is returned as elements of a list. However, if splitlines()


is used instead of split(), then each line is returned as element of a list, as shownin
the output below:
>>> for line in d:

words=line.splitlines()
print(words)

['Hello everyone']

['Writing multiline
strings'] ['This is the
third line']

Let us now write a program that accepts a string from the user and writes it to a
text file. Thereafter, the same program reads the text file and displays it on the
screen.
SETTING OFFSETS IN a FILE (Navigation in file )

The functions that we have learnt till now are used to access the data sequentially
from a file. But if we want to access data in a random fashion, then Python gives
us seek() and tell() functions to do so.
The tell() method
This function returns an integer that specifies the current position of the file
object in the file. The positionso specified is the byte position from the beginning
of the file till the current position of the file object. The syntax of using tell() is:
file_object.tell()

The seek() method


This method is used to position the file object at aparticular position in a file. The
syntax of seek() is:
file_object.seek(offset [, reference_point])
In the above syntax, offset is the number of bytes by which the file object is to be
moved. reference_pointindicates the starting position of the file object. That is,
with reference to which position, the offset has to be counted. It can have any of the
following values:
0 - beginning of the file
1 - current position of the file
2 - end of file
By default, the value of reference_point is 0, i.e.the offset is counted from
the beginning of the file. For example, the statement fileObject.seek(5,0)
will position the file object at 5th byte position from the beginning of the file. The code
in Program below demonstrates the usage of seek() and tell().

Application of seek() and tell()

print("Learning to move the file object")


fileobject=open("testfile.txt","r+")
str=fileobject.read()

print(str)

print("Initially, the position of the file object is:


",fileobject.tell()) fileobject.seek(0)

print("Now the file object is at the beginning of the file:


",fileobject.tell())
fileobject.seek(5)

print("We are moving to 10th byte position from the beginning offile")
print("The position of the file object is at", fileobject.tell())
str=fileobject.read()

print(str)

Output of Program :
Learning to move the file object
roll_numbers = [1, 2, 3, 4, 5, 6]
Initially, the position of the file object is: 33

Now the file object is at the beginning of the file: 0

We are moving to 10th byte position from the beginning of file

The position of the file object is at 10


numbers = [1, 2, 3, 4, 5, 6]

>>>

Creating And Traversing A Text File (Random Access files)


Tell() and seek() methods that help us to open and close a file, read and write
data in a text file, findthe position of the file object and move the file object at a
desired location, To perform these operations,let us assume that we will be working
with practice.txt.

Creating a file and writing data


To create a text file, we use the open() method and provide the filename and the
mode. If the file already exists with the same name, the open() function willbehave
differently depending on the mode (write or append) used. If it is in write mode (w),
then all the existing contents of file will be lost, and an empty filewill be created
with the same name. But, if the file is created in append mode (a), then the new
data will be written after the existing data. In both cases, if the file does not exist, then
a new empty file will be created.
In Program below, a file, practice.txt is opened in write (w) mode and three sentences
are stored in it as shown inthe output screen that follows it
Program To create a text file and write data in it

# program to create a text file and add


data fileobject=open("practice.txt","w+")

while True:
data= input("Enter data to save in the text file:
") fileobject.write(data)

ans=input("Do you wish to enter more data?(y/n):


") if ans=='n': break

fileobject.close()

Output of Program 2-3:


>>>
RESTART: Path_to_file\Program2-3.py

Enter data to save in the text file: I am interested to learn


about Computer Science

Do you wish to enter more data?(y/n): y

Enter data to save in the text file: Python is easy to


learn Do you wish to enter more data?(y/n): n

>>>

Traversing a file and displaying data


To read and display data that is stored in a text file, we will refer to the previous
example where we have createdthe file practice.txt. The file will be opened in read
mode and reading will begin from the beginning of the file.
Program To display data from a text file

fileobject=open("practice.txt","r")

str = fileobject.readline()

while str:

print(str)

str=fileobject.readline()

fileobject.close()

In Program, the readline() is used in the whileloop to read the data line by line
from the text file. The lines are displayed using the print(). As the end of file is
reached, the readline() will return an empty string. Finally, the file is closed using
the close().
Output of Program
>>>

I am interested to learn about Computer SciencePython is easy to learn

Till now, we have been creating separate programs for writing data to a file
and for reading the file. Nowlet us create one single program to read and write data
using a single file object. Since both the operations have to be performed using a single
file object, the file will be opened in w+ mode.

Program To perform reading and writing operation in a


text file

fileobject=open("report.txt", "w+")

print ("WRITING DATA IN THE FILE")


print() # to display a blank
line while True:

line= input("Enter a sentence


") fileobject.write(line)
fileobject.write('\n')

choice=input("Do you wish to enter more data? (y/n):


") if choice in ('n','N'): break

print("The byte position of file object is ",fileobject.tell())


fileobject.seek(0) #places file object at beginning of file
print()

print("READING DATA FROM THE FILE")

str=fileobject.read()

print(str)

In Program , the file will be read till the time end of file is not reached and the
output as shown in below is displayed.

Output of Program :

Enter a sentence I am a student of class XII


Do you wish to enter more data? (y/n): y

Enter a sentence my school contact number is


4390xxx8 Do you wish to enter more data? (y/n): n

The byte position of file object is 67

READING DATA FROM THE FILE

I am a student of class XII

my school contact number is 4390xxx8

>>>
THE PICkLE MODULE
We know that Python considers everything as an object. So, all data types
including list, tuple, dictionary, etc. are also considered as objects.
During execution of a program, we may require to store current state of
variables so that we can retrieve them later to its present state. Suppose you
are playing a video game, and after some time, you want to close it. So, the
program should be able to store the current state of the game, including
current level/stage, your score, etc. as a Python object. Likewise, you may
like to store a Python dictionary asan object, to be able to retrieve later. To
save any object structure along with data, Python provides a module called
Pickle. The module Pickle is used for serializing and de-serializing any
Python object structure. Picklingis a method of preserving food items by
placing them in some solution, which increases the shelf life. In other
words, it is a method to store food items for later consumption.
Serialization is the process of transforming data or an object in
memory (RAM) to a stream of bytes called byte streams. These byte
streams in a binary file can then be stored in a disk or in a database or sent
througha network. Serialization process is also called pickling.
De-serialization or unpickling is the inverse of pickling process where a
byte stream is converted backto Python object.
The pickle module deals with binary files. Here, data are not written but
dumped and similarly, data are not read but loaded. The Pickle Module
must be imported to load and dump data. The pickle module provides two
methods - dump() and load() to work with binary files for pickling and
unpickling, respectively.

The dump() method


This method is used to convert (pickling) Python objectsfor writing
data in a binary file. The file in which data are to be dumped,
needs to be opened in binary write mode (wb).
Syntax of dump() is as follows:
dump(data_object, file_object)

where data_object is the object that has to be dumped to the file with the file
handle named file_object. For example, Program writes the record of a
student (roll_no, name, gender and marks) in the binary file named
mybinary.dat using the dump(). We need to close the file after pickling.

import pickle listvalues=[1,"Geetika",'F', 26]


fileobject=open("mybinary.dat", "wb")
pickle.dump(listvalues,fileobject) fileobject.close()
The load() method
This method is used to load (unpickling) data from a binary file. The file to be loaded
is opened in binary read (rb) mode. Syntax of load() is as follows:
Store_object = load(file_object)
Here, the pickled Python object is loaded from thefile having a file handle named
file_object and is stored in a new file handle called store_object. The
program demonstrates how to read data from the file mybinary.dat using the
load().
Program below we see Unpickling data in Python

import pickle

print("The data that were stored in file are: ")


fileobject=open("mybinary.dat","rb") objectvar=pickle.load(fileobject)
fileobject.close()

print(objectvar)

Output of Program
>>>

RESTART: Path_to_file\Program2-7.py The data that were stored


in file are:[1, 'Geetika', 'F', 26]

>>>

File handling using pickle module


As we read and write data in a text file, similarly wewill be adding and displaying
data for a binary file. Program 2-8 accepts a record of an employee from the user and
appends it in the binary file tv. Thereafter, the records are read from the binary file
and displayed onthe screen using the same object. The user may enter as many
records as they wish to. The program alsodisplays the size of binary files
before starting with thereading process.

Program 2-8 To perform basic operations on a binary file using pickle module

# Program to write and read employee records in a binary


file import pickle

print("WORKING WITH BINARY FILES")

bfile=open("empfile.dat","ab")

recno=1

print ("Enter Records of Employees")


print()

#taking data from user and dumping in the file as list


object while True:

print("RECORD No.", recno)


eno=int(input("\tEmployee number :
")) ename=input("\tEmployee Name : ")
ebasic=int(input("\tBasic Salary :
")) allow=int(input("\tAllowances :
")) totsal=ebasic+allow

print("\tTOTAL SALARY : ", totsal)


edata=[eno,ename,ebasic,allow,totsal]
pickle.dump(edata,bfile)

ans=input("Do you wish to enter more records (y/n)?


") recno=recno+1

if ans.lower()=='n':
print("Record entry OVER
") print()

Break

# retrieving the size of file

print("Size of binary file (in bytes):",bfile.tell())

bfile.close()

# Reading the employee records from the file using load()


module print("Now reading the employee records from the file")

print()
readrec=1
try:

with open("empfile.dat","rb") as bfile:

while True:

edata=pickle.load(bfile)
print("Record Number :
",readrec) print(edata)

readrec=readrec+1
except EOFError:

pass

bfile.close()
Output of Program :
>>>

RESTART: Path_to_file\Program2.8py

WORKING WITH BINARY FILES

Enter Records of Employees

RECORD No. 1

Employee number : 11
Employee Name : D N Ravi
Basic Salary : 32600
Allowances : 4400

TOTAL SALARY : 37000

Do you wish to enter more records (y/n)? y


RECORD No. 2

Employee number : 12
Employee Name : Farida
Ahmed Basic Salary : 38250
Allowances : 5300

TOTAL SALARY : 43550

Do you wish to enter more records (y/n)? n


Record entry OVER

Size of binary file (in bytes): 216

Now reading the employee records from the file

Record Number : 1

[11, 'D N Ravi', 32600, 4400, 37000]

Record Number : 2

[12, 'Farida Ahmed', 38250, 5300, 43550]

>>>

As each employee record is stored as a list in the


file empfile.dat, hence while reading the file, a list is
displayed showing record of each employee. Notice that
in Program 2-8, we have also used try.. except block to
handle the end-of-file exception.
Summary

• A file is a named location on a secondary storage media where


data are permanently stored for later access.

• A text file contains only textual information consisting


of alphabets, numbers and other

special symbols. Such files are stored with extensions


like
.txt, .py, .c, .csv, .html, etc. Each byte of a text file
represents a character.

• Each line of a text file is stored as a sequence of ASCII


equivalent of the characters and is terminated by a
special character, called the Endof Line (EOL).
• Binary file consists of data stored as a streamof bytes.
• open() method is used to open a file in Python andit
returns a file object called file handle. The file handle
is used to transfer data to and from the file by calling
the functions defined in the Python’sio module.
• close() method is used to close the file. Whileclosing
a file, the system frees up all the resourceslike
processor and memory allocated to it.
• write() method takes a string as an argument
and writes it to the text file.
• writelines() method is used to write multiplestrings
to a file. We need to pass an iterable object like lists,
tuple etc. containing strings to writelines() method.
• read([n]) method is used to read a specified
number of bytes (n) of data from a data file.
• readline([n]) method reads one complete line
from a file where lines are ending with a newline (\n).
It can also be used to read a specified number
(n) of bytes of data from a file but maximum up
tothe newline character (\n).
• readlines() method reads all the lines and
returns the lines along with newline character, asa
list of strings.
• tell() method returns an integer that specifies the
current position of the file object. The position so
specified is the byte position from the beginningof the
file till the current position of the file object.
• seek()method is used to position the file object at
a particular position in a file.
Reading and writing CSV files in python

CSV (stands for comma separated values) format is a commonly used data format used by
spreadsheets. The csv module in Python’s standard library presents classes and methods to
perform read/write operations on CSV files.

writer()
This function in csv module returns a writer object that converts data into a delimited string
and stores in a file object. The function needs a file object with write permission as a
parameter. Every row written in the file issues a newline character. To prevent additional
space between lines, newline parameter is set to ‘’.

The writer class has following methods

writerow()
This function writes items in an iterable (list, tuple or string) ,separating them nby comma
character.

writerows()
This function takes a list of iterables as parameter and writes each item as a comma
separated line of items in the file.

Following example shows use of write() function. First a file is opened in ‘w’ mode. This
file is used to obtain writer object. Each tuple in list of tuples is then written to file using
writerow() method.

>>> import csv


>>> persons=[('Lata',22,45),('Anil',21,56),('John',20,60)]
>>> csvfile=open('persons.csv','w', newline='')
>>> obj=csv.writer(csvfile)
>>> for person in persons:
obj.writerow(person)
>>> csvfile.close()

This will create ‘persons.csv’ file in current directory. It will show following data.

Lata,22,45
Anil,21,56
John,20,60

Instad of iterating over the list to write each row individually, we can use writerows()
method.

>>> csvfile = open('persons.csv','w', newline='')


>>> obj = csv.writer(csvfile)
>>> obj.writerows(persons)
>>> obj.close()
read()
this function returns a reader object which returns an iterator of lines in the csv file. Using
the regular for loop, all lines in the file are displayed in following example.

>>> csvfile=open('persons.csv','r', newline='')


>>> obj=csv.reader(csvfile)
>>> for row in obj:
print (row)
['Lata', '22', '45']
['Anil', '21', '56']
['John', '20', '60']

Since reader object is an iterator, built-in next() function is also useful to display all lines in
csv file.

>>> csvfile = open('persons.csv','r', newline='')


>>> obj = csv.reader(csvfile)
>>> while True:
try:
row=next(obj)
print (row)
except StopIteration:
break

The csv module also defines a dialect class. Dialect is set of standards used to implement
CSV protocol. The list of dialects available can be obtained by list_dialects() function.

>>> csv.list_dialects()
['excel', 'excel-tab', 'unix']

DictWriter()
This function returns a DictWriter object. It is similar to writer object, but the rows are
mapped to dictionary object. The function needs a file object with write permission and a
list of keys used in dictionary as fieldnames parameter. This is used to write first line in the
file as header.

writeheader()
This method writes list of keys in dictionary as a comma separated line as first line in the
file.

In following example, a list of dictionary items is defined. Each item in the list is a
dictionary. Using writrows() method, they are written to file in comma separated manner.

>>> persons=[{'name':'Lata', 'age':22, 'marks':45}, {'name':'Anil',


'age':21, 'marks':56}, {'name':'John', 'age':20, 'marks':60}]
>>> csvfile=open('persons.csv','w', newline='')
>>> fields=list(persons[0].keys())
>>> obj=csv.DictWriter(csvfile, fieldnames=fields)
>>> obj.writeheader()
>>> obj.writerows(persons)
>>> csvfile.close()

The file shows following contents.

name,age,marks
Lata,22,45
Anil,21,56
John,20,60

DictReader()
This function returns a DictReader object from the underlying CSV file. As in case of
reader object, this one is also an iterator, using which contents of the file are retrieved.

>>> csvfile = open('persons.csv','r', newline='')


>>> obj = csv.DictReader(csvfile)

The class provides fieldnames attribute, returning the dictionary keys used as header of file.

>>> obj.fieldnames
['name', 'age', 'marks']

Use loop over the DictReader object to fetch individual dictionary objects.

>>> for row in obj:


print (row)

This results in following output.

OrderedDict([('name', 'Lata'), ('age', '22'), ('marks', '45')])


OrderedDict([('name', 'Anil'), ('age', '21'), ('marks', '56')])
OrderedDict([('name', 'John'), ('age', '20'), ('marks', '60')])

To convert OrderedDict object to normal dictionary, we have to first import OrderedDict


from collections module.

>>> from collections import OrderedDict


>>> r=OrderedDict([('name', 'Lata'), ('age', '22'), ('marks', '45')])
>>> dict(r)
{'name': 'Lata', 'age': '22', 'marks': '45'}

json module

In interactive programs users need to store the user responses in python data
structures like lists and dictionaries. But, when the program is closed, that data will not be
available. For this reason we need to store this data in files. Python’s json module allows us
to store the data structures in to the files and we can load that data as and when needed.
The JSON (JavaScript Object Notation) is developed for Java Script and it is used by many
programming languages including python.
Storing and loading the data
We have two methods json.dump() and json.load() in json module to store and load
the data. The below code demonstrate the concept.

#This code stores the list names to file names.json


import json
names = ['RAMAN','SAMEER','GOVIND','VISHAL','VAMAN']
filename = 'names.json'
with open(filename, 'w') as f:
json.dump(names, f)

#loading the names list from names.json


import json
filename = 'names.json'
with open(filename) as f:
names = json.load(f)
print(names)

['RAMAN', 'SAMEER', 'GOVIND', 'VISHAL', 'VAMAN']


Accepting User input and sending into file using json.dump()
The below code accepts input from the user and sends into a file using
json.dump() method.

import json
filename = 'favourite_dish.json'
user_option = "\nEnter your favourite_dish: "
user_option += '\n Type quit if you have no item to type :'
msg=''
favourite_items=[]
while True:
msg = input(user_option)
if msg == 'quit':
break
favourite_items.append(msg)
with open(filename, 'w') as f:
json.dump(favourite_items, f)
print('program terminated')

Enter your favourite_dish:


Type quit if you have no item to type :NOODLES

Enter your favourite_dish:


Type quit if you have no item to type :PASTA

Enter your favourite_dish:


Type quit if you have no item to type :IDLI

Enter your favourite_dish:


Type quit if you have no item to type :DOSA

Enter your favourite_dish:


Type quit if you have no item to type :quit
program terminated
Loading the data from file using json.load()
The below code, loads the user data stored in a file using json.load()method.

import json
filename = 'favourite_dish.json'
try:
with open(filename) as f:
favourite_items = json.load(f)
except FileNotFoundError:
print('Sorry, favourite_dish.json is not found')
else:
print(favourite_items)

['NOODLES', 'PASTA', 'IDLI', 'DOSA']


Python OS module

The OS module in Python comes with various functions that enables


developers to interact with the Operating system that they are currently
working on.
Python’s OS module comes packaged within python when installed. This
means you do not need to separately install it using PIP. In order to access
its various methods/functions, you just need to import the module.
import os
Now that you’ve imported the module, you can start using its various
functions.

Getting current working directory


The currently working directory is the folder in which the python script is
saved and being run from.
import os
os.getcwd()
Note − Directory is nothing but folder.

Creating a directory
importos
os.mkdir("D:\\Data Science")
This will create a folder Data Science in the D drive.

Deleting a directory
In order to delete a directory, we will be using the rmdir() function, it
stands for remove directory.
import os
os.rmdir("D:\\Data Science")
Renaming a directory
In order to rename a folder, we use the rename function present in the os
module.
import os
os.mkdir("D:\\Data Science")
os.rename("D:\\Data Science","D:\\Data Science3")
Basic file manipulation
Now that you know how to work around with folders, let us look into file
manipulation.

Creating a file
file = os.popen("Hello.txt", 'w')
A file named Hello.txt is created in the current working directory.

Adding content to the created file


file =os.popen("Hello.txt",'w')
file.write("Hello there! This is a Data Science article")
Note − You can use os.rename to rename files as well. Just make sure
you get their extensions right.

Example
Given below is the complete program to test out all the above-mentioned
scenarios:
importos
os.getcwd()
os.mkdir("D:\\Data Science")
os.rmdir("D:\\Data Science")
os.mkdir("D:\\Data Science")
os.rename("D:\\Data Science","D:\\Data Science2")
file =os.popen("Hello.txt",'w')
file.write("Hello there! This is a Data Science article")

The os.path module is a very extensively used module that is handy when
processing files from different places in the system. It is used for different
purposes such as for merging, normalizing and retrieving path names in
python . All of these functions accept either only bytes or only string
objects as their parameters. Its results are specific to the OS on which it is
being run.
os.path.basename
This function gives us the last part of the path which may be a folder or a
file name. Please the difference in how the path is mentioned in Windows
and Linux in terms of the backslash and the forward slash.
Example
Import os
# In windows
fldr=os.path.basename("C:\\Users\\xyz\\Documents\\My Web Sites")
print(fldr)
file =os.path.basename("C:\\Users\\xyz\\Documents\\My Web Sites\\
intro.html")
print(file)
Running the above code gives us the following result −
Output
My Web Sites
intro.html
MyWebSites
music.txt
os.path.dirname
This function gives us the directory name where the folder or file is
located.
Example
importos
# In windows
DIR =os.path.dirname("C:\\Users\\xyz\\Documents\\My Web Sites")
print(DIR)
Running the above code gives us the following result −
Output
C:\Users\xyz\Documents
/Documents
os.path.isfile
Sometimes we may need to check if the complete path given, represents
a folder or a file. If the file does not exist then it will give False as the
output. If the file exists then the output is True.
Example
print(IS_FILE)
IS_FILE =os.path.isfile("C:\\Users\\xyz\\Documents\\My Web Sites\\
intro.html")
print(IS_FILE)
Running the above code gives us the following result −
Output
False
True
False
True
XML is a portable, open source language that allows programmers to develop applications
that can be read by other applications, regardless of operating system and/or developmental
language.

What is XML?

The Extensible Markup Language (XML) is a markup language much like


HTML or SGML. This is recommended by the World Wide Web Consortium
and available as an open standard.

XML is extremely useful for keeping track of small to medium amounts of


data without requiring a SQL-based backbone.

XML Parser Architectures and APIs

The Python standard library provides a minimal but useful set of interfaces
to work with XML.

The two most basic and broadly used APIs to XML data are the SAX and
DOM interfaces.

 Simple API for XML (SAX) − Here, you register callbacks for
events of interest and then let the parser proceed through the
document. This is useful when your documents are large or you
have memory limitations, it parses the file as it reads it from disk
and the entire file is never stored in memory.
 Document Object Model (DOM) API − This is a World Wide Web
Consortium recommendation where in the entire file is read into
memory and stored in a hierarchical (tree-based) form to represent
all the features of an XML document.

Example :

from xml.dom import minidom

# parse an xml file by name


file = minidom.parse('d:/myxml.xml.txt')
#use getElementsByTagName() to get tag
models = file.getElementsByTagName('model')

# one specific item attribute


print('model #2 attribute:')
print(models[1].attributes['name'].value)

# all item attributes


print('\nAll attributes:')
for elem in models:
print(elem.attributes['name'].value)

# one specific item's data


print('\nmodel #2 data:')
print(models[1].firstChild.data)
print(models[1].childNodes[0].data)

# all items data


print('\nAll model data:')
for elem in models:
print(elem.firstChild.data)

model #2 attribute:
model2

All attributes:
model1
model2

model #2 data:
model2abc
model2abc

All model data:


model1abc
model2abc
UNIT – II

Web Scraping is a technique to extract a large amount of data from


several websites. The term "scraping" refers to obtaining the information
from another source (webpages) and saving it into a local file. For
example: Suppose you are working on a project called "Phone
comparing website," where you require the price of mobile phones,
ratings, and model names to make comparisons between the different
mobile phones. If you collect these details by checking various sites, it will
take much time. In that case, web scrapping plays an important role
where by writing a few lines of code you can get the desired results.

Web Scrapping extracts the data from websites in the unstructured


format. It helps to collect these unstructured data and convert it in a
structured form.

Startups prefer web scrapping because it is a cheap and effective way to


get a large amount of data without any partnership with the data selling
company.

Steps involved in this kind of processing:


from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs): print("Encountered a start tag:",


tag)

def handle_endtag(self, tag): print("Encountered an end tag :", tag)

def handle_data(self, data): print("Encountered some data :", data)

parser = MyHTMLParser()

parser.feed('<html><head><title>Sucf Degree College</title></head>'


'<body><h1>I Believe in war, Not in Morality </h1></body></html>')

OUTPUT:

Encountered a start tag: html


Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Sucf Degree College
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : I Believe in war, Not in Morality
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Processing texts in Natural languages:

Natural language processing:

Natural language processing (NLP) is the ability of a computer program to


understand human language as it is spoken and written -- referred to as
natural language. It is a component of artificial intelligence (AI).

Techniques and methods of natural language processing

Syntax and semantic analysis are two main techniques used with natural
language processing.
Syntax is the arrangement of words in a sentence to make grammatical
sense. NLP uses syntax to assess meaning from a language based on
grammatical rules. Syntax techniques include:

 Parsing. This is the grammatical analysis of a sentence. Example: A


natural language processing algorithm is fed the sentence, "The dog
barked." Parsing involves breaking this sentence into parts of
speech -- i.e., dog = noun, barked = verb. This is useful for more
complex downstream processing tasks.

 Word segmentation. This is the act of taking a string of text and


deriving word forms from it. Example: A person scans a handwritten
document into a computer. The algorithm would be able to analyze
the page and recognize that the words are divided by white spaces.

 Sentence breaking. This places sentence boundaries in large


texts. Example: A natural language processing algorithm is fed the
text, "The dog barked. I woke up." The algorithm can recognize the
period that splits up the sentences using sentence breaking.

 Morphological segmentation. This divides words into smaller


parts called morphemes. Example: The word untestably would be
broken into [[un[[test]able]]ly], where the algorithm recognizes
"un," "test," "able" and "ly" as morphemes. This is especially useful
in machine translation and speech recognition.

 Stemming. This divides words with inflection in them to root forms.


Example: In the sentence, "The dog barked," the algorithm would be
able to recognize the root of the word "barked" is "bark." This would
be useful if a user was analyzing a text for all instances of the word
bark, as well as all of its conjugations. The algorithm can see that
they are essentially the same word even though the letters are
different.

Semantics involves the use of and meaning behind words. Natural


language processing applies algorithms to understand the meaning and
structure of sentences. Semantics techniques include:

 Word sense disambiguation. This derives the meaning of a word


based on context. Example: Consider the sentence, "The pig is in
the pen." The word pen has different meanings. An algorithm using
this method can understand that the use of the word pen here
refers to a fenced-in area, not a writing implement.
 Named entity recognition. This determines words that can be
categorized into groups. Example: An algorithm using this method
could analyze a news article and identify all mentions of a certain
company or product. Using the semantics of the text, it would be
able to differentiate between entities that are visually the same. For
instance, in the sentence, "Daniel McDonald's son went to
McDonald's and ordered a Happy Meal," the algorithm could
recognize the two instances of "McDonald's" as two separate
entities -- one a restaurant and one a person.

 Natural language generation. This uses a database to determine


semantics behind words and generate new text. Example: An
algorithm could automatically write a summary of findings from a
business intelligence platform, mapping certain words and phrases
to features of the data in the BI platform. Another example would be
automatically generating news articles or tweets based on a certain
body of text used for training.

Example

import nltk

nltk.download('punkt')

text = "Backgammon is one of the oldest known board games. Its history
can be traced back nearly 5,000 years to archeological discoveries in the
Middle East. It is a two player game where each player has fifteen
checkers which move between twenty-four points according to the roll of
two dice."

sentences = nltk.sent_tokenize(text)

for sentence in sentences:

print(sentence)

Python Regular Expressions

The regular expressions can be defined as the sequence of characters


which are used to search for a pattern in a string. The module re provides
the support to use regex in the python program. The re module throws an
exception if there is some error while using the regular expression.

The re module must be imported to use the regex functionalities in


python.
import re

Regex Functions
The following regex functions are used in the python.

SN Functio Description
n

1 match This method matches the regex pattern in the


string with the optional flag. It returns true if a
match is found in the string otherwise it returns
false.

2 search This method returns the match object if there is a


match found in the string.

3 findall It returns a list that contains all the matches of a


pattern in the string.

4 split Returns a list in which the string has been split in


each match.

5 sub Replace one or many matches in the string.

Forming a regular expression


A regular expression can be formed by using the mix of meta-characters,
special sequences, and sets.

Meta-Characters
Metacharacter is a character with the specified meaning.

Metacharacter Description Example

[] It represents the set of characters. "[a-z]"

\ It represents the special sequence. "\r"

. It signals that any character is present at "Ja.v."


some specific place.
^ It represents the pattern present at the "^Java"
beginning of the string.

$ It represents the pattern present at the "point"


end of the string.

* It represents zero or more occurrences of a "hello*"


pattern in the string.

+ It represents one or more occurrences of a "hello+"


pattern in the string.

{} The specified number of occurrences of a "java{2}"


pattern the string.

| It represents either this or that character is "java|


present. point"

() Capture and group

Special Sequences
Special sequences are the sequences containing \ followed by one of the
characters.

Character Description

\A It returns a match if the specified characters are present at


the beginning of the string.

\b It returns a match if the specified characters are present at


the beginning or the end of the string.

\B It returns a match if the specified characters are present at


the beginning of the string but not at the end.

\d It returns a match if the string contains digits [0-9].

\D It returns a match if the string doesn't contain the digits [0-


9].

\s It returns a match if the string contains any white space


character.

\S It returns a match if the string doesn't contain any white


space character.

\w It returns a match if the string contains any word characters.

\W It returns a match if the string doesn't contain any word.

\Z Returns a match if the specified characters are at the end of


the string.

Sets
A set is a group of characters given inside a pair of square brackets. It
represents the special meaning.

SN Set Description

1 [arn] Returns a match if the string contains any of the


specified characters in the set.

2 [a-n] Returns a match if the string contains any of the


characters between a to n.

3 [^arn] Returns a match if the string contains the characters


except a, r, and n.

4 [0123] Returns a match if the string contains any of the


specified digits.

5 [0-9] Returns a match if the string contains any digit between


0 and 9.

6 [0-5][0- Returns a match if the string contains any digit between


9] 00 and 59.

10 [a-zA- Returns a match if the string contains any alphabet


Z] (lower-case or upper-case).

The findall() function


This method returns a list containing a list of all matches of a pattern
within the string. It returns the patterns in the order they are found. If
there are no matches, then an empty list is returned.

Consider the following example.

Example

import re
str = "How are you. How is everything"
matches = re.findall("How", str)
print(matches)
print(matches)

Output:

['How', 'How']

The match object


The match object contains the information about the search and the
output. If there is no match found, the None object is returned.

Example
import re
str = "How are you. How is everything"
matches = re.search("How", str)
print(type(matches))
print(matches) #matches is the search object

Output:

<class '_sre.SRE_Match'>
<_sre.SRE_Match object; span=(0, 3), match='How'>
The Match object methods
There are the following methods associated with the Match object.

 span(): It returns the tuple containing the starting and end position of the
match.
 string(): It returns a string passed into the function.
 group(): The part of the string is returned where the match is found.

Example
import re
str = "How are you. How is everything"
matches = re.search("How", str)
print(matches.span())
print(matches.group())
print(matches.string)

Output:

(0, 3)
How
How are you. How is everything
UNIT – III

SQL is a language to operate databases; it includes database creation,


deletion, fetching rows, modifying rows, etc. SQL is an ANSI (American
National Standards Institute) standard language, but there are many
different versions of the SQL language.

What is SQL?
SQL is Structured Query Language, which is a computer language for
storing, manipulating and retrieving data stored in a relational database.

SQL is the standard language for Relational Database System. All the
Relational Database Management Systems (RDMS) like MySQL, MS
Access, Oracle, Sybase, Informix, Postgres and SQL Server use SQL as
their standard database language.

Also, they are using different dialects, such as −

 MS SQL Server using T-SQL,


 Oracle using PL/SQL,
 MS Access version of SQL is called JET SQL (native format) etc.

Why SQL?
SQL is widely popular because it offers the following advantages −

 Allows users to access data in the relational database management


systems.
 Allows users to describe the data.
 Allows users to define the data in a database and manipulate that
data.
 Allows to embed within other languages using SQL modules,
libraries & pre-compilers.
 Allows users to create and drop databases and tables.
 Allows users to create view, stored procedure, functions in a
database.
 Allows users to set permissions on tables, procedures and views

SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT,
INSERT, UPDATE, DELETE and DROP. These commands can be classified into the
following groups based on their nature −

DDL - Data Definition Language


Sr.No. Command & Description

CREATE
1
Creates a new table, a view of a table, or other object in the database.
ALTER
2
Modifies an existing database object, such as a table.
DROP
3
Deletes an entire table, a view of a table or other objects in the database.

DML - Data Manipulation Language


Sr.No. Command & Description

SELECT
1
Retrieves certain records from one or more tables.
INSERT
2
Creates a record.
UPDATE
3
Modifies records.
DELETE
4
Deletes records.

DCL - Data Control Language


Sr.No. Command & Description

GRANT
1
Gives a privilege to user.
REVOKE
2
Takes back privileges granted from user.

MongoDB
MongoDB is an open-source document database that provides high
performance, high availability, and automatic scaling.

In simple words, you can say that - Mongo DB is a document-oriented


database. It is an open source product, developed and supported by a
company named 10gen.

MongoDB is available under General Public license for free, and it is also
available under Commercial license from the manufacturer.

The manufacturing company 10gen has defined MongoDB as:

"MongoDB is a scalable, open source, high performance, document-


oriented database." - 10gen

MongoDB was designed to work with commodity servers. Now it is used by


the company of all sizes, across all industry.

The primary purpose of building MongoDB is:

 Scalability
 Performance
 High Availability
 Scaling from single server deployments to large, complex multi-site
architectures.
 Key points of MongoDB
 Develop Faster
 Deploy Easier
 Scale Bigger

First of all, we should know what is document oriented database?

Example of document oriented database

MongoDB is a document oriented database. It is a key feature of


MongoDB. It offers a document oriented storage. It is very simple you can
program it easily.

MongoDB stores data as documents, so it is known as document-oriented


database.

1. FirstName = "John",

2. Address = "Detroit",

3. Spouse = [{Name: "Angela"}].


4. FirstName ="John",
5. Address = "Wick"

There are two different documents (separated by ".").

Storing data in this manner is called as document-oriented database.

Mongo DB falls into a class of databases that calls Document Oriented


Databases. There is also a broad category of database known as No SQL
Databases.

Features of MongoDB

These are some important features of MongoDB:

1. Support ad hoc queries

In MongoDB, you can search by field, range query and it also supports
regular expression searches.

2. Indexing

You can index any field in a document.

3. Replication

MongoDB supports Master Slave replication.

A master can perform Reads and Writes and a Slave copies data from the
master and can only be used for reads or back up (not writes)

4. Duplication of data

MongoDB can run over multiple servers. The data is duplicated to keep
the system up and also keep its running condition in case of hardware
failure.

5. Load balancing

It has an automatic load balancing configuration because of data placed in


shards.

6. Supports map reduce and aggregation tools.

7. Uses JavaScript instead of Procedures.


8. It is a schema-less database written in C++.

9. Provides high performance.

10. Stores files of any size easily without complicating your stack.

11. Easy to administer in the case of failures.

12. It also supports:

 JSON data model with dynamic schemas


 Auto-sharding for horizontal scalability
 Built in replication for high availability
 Now a day many companies using MongoDB to create
new types of applications, improve performance and
availability.

What are Numpy Arrays? Explain

NumPy stands for ‘Numerical Python’. It is a package for data analysis and
scientific computing with Python. NumPy uses a multidimensional array
object, and has functions and tools for working with these arrays. The
powerful n-dimensional array in NumPy speeds-up data processing.
NumPy can be easily interfaced with other Python packages and provides
tools for integrating with other programming languages like C, C++ etc.

Installing NumPy

NumPy can be installed by typing following command: pip install NumPy

An array:

An array is a data type used to store multiple values using a single


identifier (variable name). An array contains an ordered collection of data
elements where each element is of the same type and can be referenced
by its index (position). The important characteristics of an array are:

• Each element of the array is of same data type, though the values
stored in them may be different.

• The entire array is stored contiguously in memory. This makes


operations on array fast.
• Each element of the array is identified or referred using the name
of the Array along with the index of that element, which is unique
for each element. The index of an element is an integral value
associated with the element, based on the element’s position in the
array.

For example consider an array with 5 numbers: [ 10, 9, 99, 71, 90 ]

Here, the 1st value in the array is 10 and has the index value [0]
associated with it;

the 2nd value in the array is 9 and has the index value [1] associated with
it, and so on.

The last value (in this case the 5th value) in this array has an index [4].

This is called zero based indexing. This is very similar to the indexing of
lists in Python. The idea of arrays is so important that almost all
programming languages support it in one form or another.

NumPy Array

NumPy arrays are used to store lists of numerical data, vectors and
matrices. The NumPy library has a large set of routines (built-in functions)
for creating, manipulating, and transforming NumPy arrays. Python
language also has an array data structure, but it is not as versatile,
efficient and useful as the NumPy array. The NumPy. array is officially
called ndarray but commonly known as array.

How do you create numpy arrays?

Creating NumPy arrays


array() method allows us to create single and multidimensional arrays.
The code is shown below:

Single dimensional array creation


import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
print(a)

[1 2 3 4 5 6]

Double dimensional array creation


import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
print(a)

[[1 2 3]
[4 5 6]]

Creation of array from list


import numpy as np
list1 = [1, 2, 3, 4, 5, 6]
arr = np.array(list1)
print(arr)

[1 2 3 4 5 6]

Creating an array using arange() method


The arange() method allows to create a numpy array with the range of
values specified. It takes lower limit in the range into consideration but
excludes the upper limit.

import numpy as np
arr = np.arange(5, 50)
print(arr)

5 to 49 will be displayed

Array with zeros() method


zeros()method allows to create an array with zeros.

import numpy as np
arr = np.zeros((4,4))
print(arr)

[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]

linspace() method
linspace() method() returns specified number of elements within the
given range.

import numpy as np
arr = np.linspace(0,100,11)
arr

array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])

asarray() method
asarray() method returns a numpy array by accepting list of values.

import numpy as np
x=[1,2,3,4,5]
a=np.asarray(x)
print(a)

[1 2 3 4 5]

random number arrays


Numpy also has lots of ways to create random number arrays

rand() method
It creates an array of the given shape with random samples from a
uniform distribution over [0, 1).

import numpy as np
x = np.random.rand(2)
y = np.random.rand(5,5)
print(f'x = {x}')
print(f'y = {y}')

x = [0.85403266 0.42939896]
For a 5X5 random numbers are displayed

randn() method
Return a sample (or samples) from the "standard normal" distribution.
Unlike rand which is uniform

import numpy as np
x = np.random.randn(2)
y = np.random.randn(5,5)
print(f'x = {x}')
print(f'y = {y}')
x = [-0.930095 -0.12667659]
Here Random numbers are displayed in a uniform distribution fashion

randint() method
Return random integers from `low` (inclusive) to `high` (exclusive)

import numpy as np
x = np.random.randint(1,100) # returns one value between 1 and
100
y = np.random.randint(1,100,10)# returns 10 values between 1
and 100
print(f'x = {x}')
print(f'y = {y}')

x=4
y = [ 4 96 12 45 41 2 66 53 32 67]

reshape() method
This method returns a numpy array as per the given dimensions. The
product of dimensions must be equal to number of elements in the array.

import numpy as np
arr=np.zeros(12)
arr3d=arr.reshape((2,2,3))
print(arr3d)

[[[0. 0. 0.]
[0. 0. 0.]]
[[0. 0. 0.]
[0. 0. 0.]]]

How do you Access element in Numpy arrays.

Accessing elements in the array


We can access the elements in the array with index as subscript in square
brackets. Index begins with zero.

import numpy as np
arr=np.arange(2,20)
print(arr)
element=arr[8]
print(f'element 8 is {element}')

[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
element 8 is 10

Slicing
Slicing is process of extracting particular set of elements from the array.
The slicing is similar to list slicing. The below coding samples
demonstrate the slicing.

import numpy as np
arr=np.arange(20)
print(arr)
arr_slice=slice(1,10,2) # from 1 to 10 alternate elements
print(f'slice = {arr_slice}')
print(f'elements in the slice : {arr[arr_slice]}')

arr_slice2=slice(1,15,3) # from 1 to 15 third element


print(f'slice2 = {arr_slice2}')
print(f'elements in the slice2 : {arr[arr_slice2]}')

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
slice = slice(1, 10, 2)
elements in the slice : [1 3 5 7 9]
slice2 = slice(1, 15, 3)
elements in the slice2 : [ 1 4 7 10 13]
import numpy as np
arr=np.arange(20)
print(arr[2:])
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

import numpy as np
arr=np.arange(20)
print(arr[:15])
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
import numpy as np
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a[0:2,0:2])

[[1 2]
[3 4]]

Array shape, dimension and size


Numpy array shape, dimension and size can be known with the methods
shape(), ndim() itemsize()respectively

import numpy as np
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a.shape) #prints the shape of the array
print(a.ndim) #prints dimentions of the array
print(a.itemsize) # prints itemsize

(3, 3)
2
4

dtype() method
You can also grab the data type of the object in the array

import numpy as np
arr = np.arange(25)
arr.dtype

dtype('int32')

Statistical Operations on Numpy arrays

Min, max methods


min(), max()methods returns minimum and maximum from the
elements of the numpy array.

import numpy as np
a= np.array([10,11,12,13,14,15])
print(f'minimum = {a.min()}')
print(f'maximum = {a.max()}')

minimum = 10
maximum = 15

Finding sum of elements


sum() method allows us to find sum of all elements in the array. And it
also allows us to find sum row wise and column wise. The concept is
shown below practically

import numpy as np
a= np.array([10,11,12,13,14,15])
print(f'sum of the elements = {a.sum()}')

a1 = np.array([(10,11,12),(13,14,15)])
print(f'sum of the elements column wise= {a1.sum(axis=0)}')
print(f'sum of the elements row wise = {a1.sum(axis=1)}')

sum of the elements = 75


sum of the elements column wise= [23 25 27]
sum of the elements row wise = [33 42]

Square root and standard deviation


sqrt() and std() methods of numpy module returns the square root
and standard deviation of elements in the array. The concept is shown
below practically

import numpy as np
a1 = np.array([(10,11,12),(13,14,15)])
print(f'sqrt of the elements: \n {np.sqrt(a1)}')
print(f'standard deviation of the elements = {np.std(a1)}')

sqrt of the elements:


[[3.16227766 3.31662479 3.46410162]
[3.60555128 3.74165739 3.87298335]]
standard deviation of the elements = 1.707825127659933
Some Special methods on NumPy

ravel() method
This method converts the 2 dimensional or multidimensional numpy array
elements into single dimensional array. Practically the concept is as
follows.
import numpy as np

a1 = np.array([(10,11,12),(13,14,15)])
a1 = a.ravel()
print(f'elements of a1 array : \n {a1}')
print(f'dimension of a1 is {a1.ndim}')

elements of a1 array :
[10 11 12 13 14 15]
dimension of a1 is 1

transpose() method
This method returns the matrix transpose for a 2 dimensionalnumpy
array. The concept is shown below practically.

x = np.arange(9).reshape((3,3))
print(f'input for transpose is : \n {x}')
x=np.transpose(x)
print(f' array after transpose is \n{x}')

input for transpose is :


[[0 1 2]
[3 4 5]
[6 7 8]]
array after transpose is
[[0 3 6]
[1 4 7]
[2 5 8]]

eye() method
This method allows us to create identity matrix.
import numpy as np
arr = np.eye(4)
print(arr)

[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]

Arithmetic Operations//Mathematical on numpy arrays


We can perform arithmetic operations on numpy arrays. The concept is
given below practically

import numpy as np
x= np.array([(1,2,3),(4,5,6)])
y= np.array([(2,3,4),(5,6,7)])
print('array x is :\n')
print(x)
print('\narray y is :\n')
print(y)
print('x+y: \n')
print(x+y)
print('x-y: \n')
print(x-y)
print('x*y: \n')
print(x*y)
print('x/y: \n')
print(x/y)

array x is :

[[1 2 3]
[4 5 6]]

array y is :
[[2 3 4]
[5 6 7]]
x+y:

[[ 3 5 7]
[ 9 11 13]]
x-y:

[[-1 -1 -1]
[-1 -1 -1]]
x*y:

[[ 2 6 12]
[20 30 42]]
x/y:

[[0.5 0.66666667 0.75 ]


[0.8 0.83333333 0.85714286]]

Broadcasting
Broadcasting allows us to perform an operation on arrays, even though
the size is not same. The below example demonstrates broadcasting on
one dimensional and multidimensional arrays.

import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,])
b = 10
print('a+b is \n')
print(a+b) # broadcasting is done with single element
x= np.array([(1,2,3),(4,5,6),(7,8,9)])
y= np.array([(10,11,12)])
print('\nx+y is \n')
print(x+y) # single row of y is added to every row of x
(broadcasting)

a+b is
[11 12 13 14 15 16 17 18 19]
x+y is
[[11 13 15]
[14 16 18]
[17 19 21]]

Vertical & Horizontal Stacking


Concatenating two arrays is possible in horizontal and vertical format.
The concept is presented below practically

import numpy as np
x= np.array([(1,2,3),(4,5,6)])
y= np.array([(7,8,9),(10,11,12)])
print(f'array x is : \n {x}')
print(f'array y is : \n {y}')
print('vertical stack of x and y\n')
print(np.vstack((x,y)))
print('horizontal stack of x and y\n')
print(np.hstack((x,y)))

array x is :
[[1 2 3]
[4 5 6]]
array y is :
[[ 7 8 9]
[10 11 12]]
vertical stack of x and y

[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
horizontal stack of x and y

[[ 1 2 3 7 8 9]
[ 4 5 6 10 11 12]]

Write about Indexing in numpy arrays


Indexing a 2D array (matrices)
The general format is arr_2d[row][col] or arr_2d[row,col]. We use the
comma notation for clarity.

import numpy as np
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
#Printing array
print(f'array arr_2d is : \n {arr_2d}')
print(f'\nrow at index 1 is :\n {arr_2d[1]}')
# Getting individual element value
print(f'\nelement at 1st row 0th column :\n {arr_2d[1][0]}')
#Getting individual element using comma notation in index
print(f'\nelement at 1st row 1st column :\n {arr_2d[1,1]}')

# 2D array slicing ;
print(f'\nShape (2,2) from top right corner :\n
{arr_2d[:2,1:]}')

#printing bottom row in ways


print(f'\nbottom row :\n {arr_2d[2]}')
print(f'\nbottom row :\n {arr_2d[2,:]}')

array arr_2d is :
[[ 5 10 15]
[20 25 30]
[35 40 45]]
row at index 1 is :
[20 25 30]

element at 1st row 0th column :


20

element at 1st row 1st column :


25

Shape (2,2) from top right corner :


[[10 15]
[25 30]]

bottom row :
[35 40 45]

bottom row :
[35 40 45]

Fancy Indexing
Fancy indexing allows you to select entire rows or columns out of order.
The concept is practically shown below

import numpy as np
#Set up matrix
arr2d = np.zeros((10,10))
#Length of array
print(f'\nlenght of the array is : {arr2d.shape[1]}')
#Set up array
arr_length = arr2d.shape[1]
for i in range(arr_length):
arr2d[i] = i
print(f'\nthe 2d array is : \n {arr2d}')

print(f'\n array with fancy indexing :\n {arr2d[[2,4,6,8]]}')


#Allows in any order
print(f'\n array in differnt order :\n {arr2d[[6,4,2,7]]}')

lenght of the array is : 10

the 2d array is :
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
[4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
[5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]
[6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
[7. 7. 7. 7. 7. 7. 7. 7. 7. 7.]
[8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]
[9. 9. 9. 9. 9. 9. 9. 9. 9. 9.]]

array with fancy indexing :


[[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
[6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
[8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]]

array in different order :


[[6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
[4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[7. 7. 7. 7. 7. 7. 7. 7. 7. 7.]]
Selecting array values with comparison
(Boolean Comparision)
We can select the values from a numpy array by comparing values as
shown below

arr = np.arange(1,11)
print(f'arr = {arr}')
print(f'\n compare array values with some integer\n {arr> 3}')
bool_arr = arr>4
print(f'\n arr values > 4 : {arr[bool_arr]}')
print(f'\n arr values > 5 : {arr[arr>5]}')

arr = [ 1 2 3 4 5 6 7 8 9 10]
compare array values with some integer
[False FalseFalse TrueTrueTrueTrueTrueTrue True]

arr values >4 : [ 5 6 7 8 9 10]

arr values >5 : [ 6 7 8 9 10]

UNIT - IV

Data Series (Different ways of creating series in


pandas)
A Data Series is similar to numpy array only. The difference is that series
can be indexed by labels. Numpy array can be indexed by only numbers
and it holds only numeric values. In data series, the index can be a label
and values in the series can be any python object not just numbers.

Creating a Series
We can create a series from a list, numpy array or dictionary. The process
is practically shown below.

Creating a series form list


A list can be sent as an argument to pandas Series() method to create a
data series from list.

import numpy as np
import pandas as pd
labels = ['a','b','c']
my_list = [10,20,30]
#creating a series from a list
list_series1 = pd.Series(data=my_list)
list_series2 = pd.Series(data=my_list,index=labels)
list_series3 = pd.Series(my_list,labels)
print(f'\nlist_series1:\n{list_series1}')
print(f'\nlist_series2:\n{list_series2}')
print(f'\nlist_series3:\n{list_series3}')

list_series1:
0 10
1 20
2 30
dtype: int64

list_series2:
a 10
b 20
c 30
dtype: int64
list_series3:
a 10
b 20
c 30
dtype: int64

Creating a series from numpy array


Python numpy array is sent to Series method of pandas for creating a data
series.

import numpy as np
import pandas as pd
labels = ['a','b','c']
arr = np.array([10,20,30])
arr_series1=pd.Series(arr)
arr_series2=pd.Series(arr,labels)
print(f'\narr_series1:\n{arr_series1}')
print(f'\narr_series2:\n{arr_series2}')

arr_series1:
0 10
1 20
2 30
dtype: int32

arr_series2:
a 10
b 20
c 30
dtype: int32

Creating a series from dictionary


Similar to lists and numpy arrays, python dictionary can be sent as
argument to pandas Series() method to create a data series.
import pandas as pd
d = {'a': 10, 'b':20, 'c':30}
pd.Series(d)

a 10
b 20
c 30
dtype: int64

Data frames (creating accessing elements in Data


Frames)
Data Frame is a bunch of Series objects put together that share the same
index.

Creating a Data frame


With the below code we are creating a data frame for 5X4. Here the frame
contains random numbers. The method random.seed() gives the starting
point for generating random numbers. The numbers are taken from
computers clock generated seconds value.

import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)# a starting point to generate random
number
df = pd.DataFrame(randn(5,4),index='A B C D
E'.split(),columns='W X Y Z'.split())
print(df)

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

-
B 0.651118 -0.319318 0.605965
0.848077

C -2.018168 0.740122 0.528813 -0.589001


W X Y Z

-
D 0.188695 -0.758872 0.955057
0.933237

E 0.190794 1.978757 2.605967 0.683509

Selection and Indexing

Column selection
We can select a column by its name as shown below

print(df['W'])

A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64

Selecting more columns


We can select two or more columns by sending the column names as a
list as shown below

print(df[['W','Z']])

W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509

We can select the column from a data frame using dot operator in sql
style. But, it is not recommended.

# SQL Syntax (NOT RECOMMENDED!)


print(df.W)
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64

Note that the columns we are extracting from the data frame are data
series. We can check this by type() function as shown below

type(df['W'])

pandas.core.series.Series

Creating a new column


We can create new column from the existing columns in the data frame.
The concept is shown below practically

df['NEW'] = df['X'] + df['Y']


df

W X Y Z NEW

A 2.706850 0.628133 0.907969 0.503826 1.536102

B 0.651118 -0.319318 -0.848077 0.605965 -1.167395

C -2.018168 0.740122 0.528813 -0.589001 1.268936

D 0.188695 -0.758872 -0.933237 0.955057 -1.692109

E 0.190794 1.978757 2.605967 0.683509 4.584725

Removing Columns
Removing columns is possible with drop() method of pandas module. But
the changes are reflected only we make inplace parameter is True. The
concept is practically shown below

df.drop('NEW',axis=1)
W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 -0.319318 -0.848077 0.605965

C -2.018168 0.740122 0.528813 -0.589001

D 0.188695 -0.758872 -0.933237 0.955057

E 0.190794 1.978757 2.605967 0.683509

But the changes are not reflected in the data frame. We can see it by
printing the data frame

# Not inplace unless specified!


df

W X Y Z NEW

A 2.706850 0.628133 0.907969 0.503826 1.536102

B 0.651118 -0.319318 -0.848077 0.605965 -1.167395

C -2.018168 0.740122 0.528813 -0.589001 1.268936

D 0.188695 -0.758872 -0.933237 0.955057 -1.692109

E 0.190794 1.978757 2.605967 0.683509 4.584725

df.drop('NEW',axis=1,inplace=True)
df

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 -0.319318 -0.848077 0.605965

C -2.018168 0.740122 0.528813 -0.589001

D 0.188695 -0.758872 -0.933237 0.955057


W X Y Z

E 0.190794 1.978757 2.605967 0.683509

Removing rows
We can drop the rows similar to columns. Here we use axis = 0. The
changes are reflected only if inplace = True.

df.drop('E',axis=0)

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 -0.319318 -0.848077 0.605965

C -2.018168 0.740122 0.528813 -0.589001

D 0.188695 -0.758872 -0.933237 0.955057

Selecting rows
We can select the rows from a data frame using labels and position. We
can also select a subset of rows and columns. The concept is practically
shown below.

#selecting row with label


df.loc['A']

W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
#select row with position
df.iloc[2]

W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
#selectingparticlular cell
df.loc['B','Y']

-0.84807698340363147
#selecting a particluar part from dataframe
df.loc[['A','B'],['W','Y']]

W Y

A 2.706850 0.907969

B 0.651118 -0.848077

Conditional Selection
Selecting values from data frame is possible based on conditions. This is
similar to the selection in numpy arrays. The concept is practically shown
below

Consider the following data frame

print(df)

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 -0.319318 -0.848077 0.605965

C -2.018168 0.740122 0.528813 -0.589001

D 0.188695 -0.758872 -0.933237 0.955057

E 0.190794 1.978757 2.605967 0.683509

print(df>0)
W X Y Z

A True True True True

B True False False True

C False True True False

D True False False True

E True True True True

df[df>0]

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 NaN NaN 0.605965

C NaN 0.740122 0.528813 NaN

D 0.188695 NaN NaN 0.955057

E 0.190794 1.978757 2.605967 0.683509

df[df['W']>0]

W X Y Z

A 2.706850 0.628133 0.907969 0.503826

B 0.651118 -0.319318 -0.848077 0.605965

D 0.188695 -0.758872 -0.933237 0.955057

E 0.190794 1.978757 2.605967 0.683509

df[df['W']>0]['Y']

A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64

df[df['W']>0][['Y','X']]

Y X

A 0.907969 0.628133

B -0.848077 -0.319318

D -0.933237 -0.758872

E 2.605967 1.978757

df[(df['W']>0) & (df['Y'] > 1)]

W X Y Z

E 0.190794 1.978757 2.605967 0.683509

Resetting index

reset_index()method of pandas allows us to reset the index for a dataframe

# Reset to default 0,1...n index


df.reset_index()

index W X Y Z

0 A 2.706850 0.628133 0.907969 0.503826

1 B 0.651118 -0.319318 -0.848077 0.605965

2 C -2.018168 0.740122 0.528813 -0.589001

3 D 0.188695 -0.758872 -0.933237 0.955057

4 E 0.190794 1.978757 2.605967 0.683509


#Creating new index
newind = 'TS AP UP MP TN'.split()
df['States'] = newind
df

W X Y Z States

A 2.706850 0.628133 0.907969 0.503826 TS

B 0.651118 -0.319318 -0.848077 0.605965 AP

C -2.018168 0.740122 0.528813 -0.589001 UP

D 0.188695 -0.758872 -0.933237 0.955057 MP

E 0.190794 1.978757 2.605967 0.683509 TN

#set new index


df.set_index('States',inplace=True)

W X Y Z

States

TS 2.706850 0.628133 0.907969 0.503826

AP 0.651118 -0.319318 -0.848077 0.605965

UP -2.018168 0.740122 0.528813 -0.589001

MP 0.188695 -0.758872 -0.933237 0.955057

TN 0.190794 1.978757 2.605967 0.683509

Handling missing data in data frames


We find some data is missing in data frames. There are several techniques
to handle missing data. Some of the techniques are practically shown
below. For this consider the following data frame.
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df

A B C

0 1.0 5.0 1

1 2.0 NaN 2

2 NaN NaN 3

Drop null values


dropna() method removes null values from the data frame.

df.dropna()

A B C

0 1.0 5.0 1

#drop null values from columns


df.dropna(axis=1)

0 1

1 2

2 3

#drop null values from 2nd row and above


df.dropna(thresh=2)
A B C

0 1.0 5.0 1

Na
1 2.0 2
N

#fill null values with zeros


df.fillna(value=0)

A B C

0 1.0 5.0 1

1 2.0 0.0 2

2 0.0 0.0 3

#fill null values with the mean of that columns


df['A'].fillna(value=df['A'].mean())

0 1.0
1 2.0
2 1.5
Name: A, dtype: float64

Grouping data in data frame


The groupby() method of pandas allows you to group rows of data
together and call aggregate functions. The concept is practically
described. For this consider the following data frame.

import pandas as pd
# Create dataframe
data = {'course':['MPCS','MPCS','MSCS','MSCS','MSDS','MSDS'],
'student':
['Sameer','Shantan','Amit','Varsha','Govind','Vishal'],
'marks':[410,370,402,433,366,297]}
df = pd.DataFrame(data)
df
course student marks

0 MPCS Sameer 410

1 MPCS Shantan 370

2 MSCS Amit 402

3 MSCS Varsha 433

4 MSDS Govind 366

5 MSDS Vishal 297

#group by course the mean


by_course = df.groupby("course")
by_course.mean()

marks

course

MPCS 390.0

MSCS 417.5

MSDS 331.5

#course wise standard deviation of marks


by_course.std()

marks

course

MPCS 28.284271

MSCS 21.920310
marks

course

MSDS 48.790368

#course wise min. marks


by_course.min()

student marks

course

MPCS Sameer 370

MSCS Amit 402

MSDS Govind 297

by_course.max()

student marks

course

MPCS Shantan 410

MSCS Varsha 433

MSDS Vishal 366

by_course.count()
student marks

course

MPCS 2 2

MSCS 2 2

MSDS 2 2

by_course.describe()

marks

count mean std min 25% 50% 75% max

course

MPCS 2.0 390.0 28.284271 370.0 380.00 390.0 400.00 410.0

MSCS 2.0 417.5 21.920310 402.0 409.75 417.5 425.25 433.0

MSDS 2.0 331.5 48.790368 297.0 314.25 331.5 348.75 366.0

by_course.describe().transpose()['MPCS']
marks count 2.000000
mean 390.000000
std 28.284271
min 370.000000
25% 380.000000
50% 390.000000
75% 400.000000
max 410.000000
Name: MPCS, dtype: float64
Concatenating the data frames
Concatenation basically joins DataFrames together. Here the condition is
that dimensions should match along the axis we are concatenating on. We
can use pd.concat and pass in a list of Data Frames to concatenate
together. The below code practically demonstrate the concept.

import pandas as pd
df1 = pd.DataFrame({'BSc': ['11', '12', '13', '14'],
'BCom': ['31', '32', '33', '34'],
'BBA': ['51', '52', '53', '54'],
'BA': ['71', '72', '73', '74']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'BSc': ['15', '16', '17', '18'],
'BCom': ['35', '36', '37', '38'],
'BBA': ['55', '56', '57', '58'],
'BA': ['75', '76', '77', '78']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'BSc': ['19', '20', '21', '22'],
'BCom': ['39', '40', '41', '42'],
'BBA': ['59', '60', '61', '62'],
'BA': ['79', '80', '81', '82']},
index=[8, 9, 10, 11])
print(f'\ndf1 is \n {df1}')
print(f'\ndf2 is \n {df2}')
print(f'\ndf3 is \n {df3}')
df1 is
BSc BCom BBA BA
0 11 31 51 71
1 12 32 52 72
2 13 33 53 73
3 14 34 54 74

df2 is
BSc BCom BBA BA
4 15 35 55 75
5 16 36 56 76
6 17 37 57 77
7 18 38 58 78

df3 is
BSc BCom BBA BA
8 19 39 59 79
9 20 40 60 80
10 21 41 61 81
11 22 42 62 82
#concatenated row wise
pd.concat([df1,df2,df3])

BS BB
BCom BA
c A

0 11 31 51 71

1 12 32 52 72

2 13 33 53 73

3 14 34 54 74

4 15 35 55 75

5 16 36 56 76

6 17 37 57 77

7 18 38 58 78

8 19 39 59 79

9 20 40 60 80

10 21 41 61 81

11 22 42 62 82

#axis =1, column based concatenation


pd.concat([df1,df2,df3],axis=1)

BCo
BSc BCom BBA BA BSc BBA BA BSc BCom BBA BA
m

0 11 31 51 71 NaN NaN NaN NaN NaN NaN NaN NaN

1 12 32 52 72 NaN NaN NaN NaN NaN NaN NaN NaN

2 13 33 53 73 NaN NaN NaN NaN NaN NaN NaN NaN

3 14 34 54 74 NaN NaN NaN NaN NaN NaN NaN NaN

4 NaN NaN NaN NaN 15 35 55 75 NaN NaN NaN NaN

5 NaN NaN NaN NaN 16 36 56 76 NaN NaN NaN NaN

6 NaN NaN NaN NaN 17 37 57 77 NaN NaN NaN NaN

7 NaN NaN NaN NaN 18 38 58 78 NaN NaN NaN NaN

8 NaN NaN NaN NaN NaN NaN NaN NaN 19 39 59 79

9 NaN NaN NaN NaN NaN NaN NaN NaN 20 40 60 80

1
NaN NaN NaN NaN NaN NaN NaN NaN 21 41 61 81
0

1
NaN NaN NaN NaN NaN NaN NaN NaN 22 42 62 82
1

Merging

The merge() function allows us to merge DataFrames together using a similar logic as
merging SQL Tables together. The concept is practically demonstrated.

left = pd.DataFrame({'key': ['11', '12', '13', '14'],


'S1': ['89', '90', '91', '86'],
'S2': ['74', '86', '77', '70']})

right = pd.DataFrame({'key': ['11', '12', '13', '14'],


'S3': ['78', '79', '81', '85'],
'S4': ['88', '85', '87', '73']})
print(f'\nleft is \n {left}')
print(f'\nright is \n {right}')
pd.merge(left,right,how='inner',on='key')
left is
key S1 S2
0 11 89 74
1 12 90 86
2 13 91 77
3 14 86 70

right is
key S3 S4
0 11 78 88
1 12 79 85
2 13 81 87
3 14 85 73

S1 S2 S3 S4
key

1 7
0 89 78 88
1 4

1 8
1 90 79 85
2 6

1 7
2 91 81 87
3 7

1 7
3 86 85 73
4 0

Joining
Joining is a convenient method for combining the columns of two
potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({
'S1': ['89', '90', '91', '86'],
'S2': ['74', '86', '77', '70'],
'S5': ['70', '80', '79', '79']})

right = pd.DataFrame({ 'S3': ['78', '79', '81', '85'],


'S4': ['88', '85', '87', '73']})
print(f'\nleft is \n {left}')
print(f'\nright is \n {right}')
left.join(right) #joining
left is
S1 S2 S5
0 89 74 70
1 90 86 80
2 91 77 79
3 86 70 79

right is
S3 S4
0 78 88
1 79 85
2 81 87
3 85 73

S1 S2 S5 S3 S4

7
0 89 74 78 88
0

8
1 90 86 79 85
0

7
2 91 77 81 87
9

7
3 86 70 85 73
9
Matplotlib

Data visualization is the presentation of data in a pictorial or graphical


format. It enables decision makers to see patterns, trends and correlations
that might go undetected in text-based data. It means that large set of
numbers or text may not give clarity so, it is necessary to plot the
patterns from the data. Different types of charts are and their plotting
process is discussed in this chapter.

Matplotlib is a Python library that is specially designed for the


development of graphs, charts to provide interactive data visualization.
Matplotlib is inspired from the MATLAB software and reproduces many of
its features. If matplotlib is not installed in your system, install it with pip.

plot with matplotlib (How do you create and format


plots )
The following code is for plotting a simple graph for a list of values using
mtplotlib. Here plot() method takes the values for Y axis and X axis values
are implicit. The X axis values are taken from 0 ton (N-1) where N is the
length of the list. show() method displays the plot.

import matplotlib.pyplot as plt


plt.plot([2,4,6,8])
plt.show()

A plot with values for both axes


In the above plot, the values for X axis are implicit. Now in this section we
are considering values for both X and Y axes.
Let X = [1,2,3,4,5] and Y = X^2. The code for plot is as follows

import matplotlib.pyplot as plt


x = range(5)
plt.plot(x,[s**2 for s in x])
plt.show()

Multiline plots
Here we are using numpy array and using plot method for 3 equations.

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,5,0.01)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s**2 for s in x]) #y=x^2
plt.plot(x,[c**3 for c in x]) #y=x^3
plt.show()
Adding a grid
grid() method allows us to add grid lines in the plot area. This method
takes only one single Boolean parameter. Grid appears in the background
of the plot

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,5,0.01)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s*2 for s in x]) #y=2x
plt.plot(x,[c*4 for c in x]) #y=4x
plt.grid(True)
plt.show()
Limiting the Axes
The scale of the plot can be set using axis() method. Use of axis() method
is practically demonstrated below.

import matplotlib.pyplot as plt


x = range(5)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s*2 for s in x]) #y=2x
plt.plot(x,[c*4 for c in x]) #y=4x
plt.axis([-1,5,-1,12]) #sets new axes limits
plt.grid(True)
plt.show()

xlim() and ylim() methods


These methods are used set axes limits separately. The code is shown
below

import matplotlib.pyplot as plt


x = range(5)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s*2 for s in x]) #y=2x
plt.plot(x,[c*4 for c in x]) #y=4x
plt.xlim([-1,5]) # limits for x axis
plt.ylim([-1,20]) #limits for y axis
plt.grid(True)
plt.show()

Adding axes labels and graph title


xlabel(), ylabel() and title() methods allow us to add axes lables and graph
title respectively.

import matplotlib.pyplot as plt


x = range(5)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s*2 for s in x]) #y=2x
plt.plot(x,[c*4 for c in x]) #y=4x
plt.xlim([-1,5]) # limits for x axis
plt.ylim([-1,16]) #limits for y axis
plt.xlabel('X value') # x axis label
plt.ylabel('Y value') # y axis label
plt.title('Polynomial Graph') # chart title
plt.grid(True)
plt.show()
Adding Legend
Legend displays meaning of each line in the graph. The legend() method
displays the legend in the plot.

import matplotlib.pyplot as plt


import numpy as np
x = np.arange(5)
plt.plot(x,x, label = 'linear') # y=x
plt.plot(x, x*x, label = 'square') #y=x^2
plt.plot(x, x*x*x, label='cube') #y=x^3
plt.xlim([0,5]) # limits for x axis
plt.ylim([0,20]) #limits for y axis
plt.xlabel('X value') # x axis label
plt.ylabel('Y value') # y axis label
plt.title('Polynomial Graph') # chart title
plt.legend()
plt.grid(True)
plt.show()
Saving plots
Plots can be saved using savefig() method. The concept is shown below
practically

import matplotlib.pyplot as plt


import numpy as np
x = np.arange(5)
plt.plot(x,x, label = 'linear') # y=x
plt.plot(x, x*x, label = 'square') #y=x^2
plt.plot(x, x*x*x, label='cube') #y=x^3
plt.xlim([0,5]) # limits for x axis
plt.ylim([0,20]) #limits for y axis
plt.xlabel('X value') # x axis label
plt.ylabel('Y value') # y axis label
plt.title('Polynomial Graph') # chart title
plt.legend()
plt.grid(True)
plt.savefig('Polygraph.png')
#plot will be saved in current working directory
plt.show()

Types of Plots
Matplotlib provides many types of plot formats for visualising information.
The different formats are listed below
 Histogram

 Bar Graph

 Scatter Plot

 Pie Chart

Histogram
Histograms display the distribution of a variable over a range of
frequencies or values. hist() is the method used to plot histogram.

import matplotlib.pyplot as plt


import numpy as np
y = np.random.randn(100,100) #100X100 array of Gaussian
distribution
plt.hist(y)# y is parameter for histogram
plt.show()

Histogram groups values into non-overlapping categories called bins. Default bin value of
the histogram plot is 10. The below plot shows the result of 100 bins

import matplotlib.pyplot as plt


import numpy as np
y = np.random.randn(100,100) #100X100 array of Gaussian
distribution
plt.hist(y,100)# 100 bins are taken
plt.show()
Bar Chart
A bar chart compares data of different categories. It is plotted to visualise
the changes over a period of time. bar() method is used to plot the
graph.

# Importing the matplotlib library


import numpy as np
import matplotlib.pyplot as plt
# Declaring the figure or the plot (y, x) or (width, height)
plt.figure(figsize=[10, 5])
# Data to be plotted
srilanka = [50,102,200,280,320]
india = [80,120,210,290,326]
X = np.arange(5)
plt.bar(X, srilanka, color = 'r', width = 0.25)
plt.bar(X + 0.25, india, color = 'g', width = 0.25)
# Creating the legend of the bars in the plot
plt.legend(['SRILANKA', 'INDIA'])
# Overiding the x axis with the country names
plt.xticks([i + 0.25 for i in range(5)], ['10', '20', '30',
'40', '50'])
# Giving the tilte for the plot
plt.title("Bar plot representing score comparision")
# Namimg the x and y axis
plt.xlabel('Overs')
plt.ylabel('Runs')
# Displaying the bar plot
plt.show()

Pie charts
Pie charts are used to compare multiple parts against the whole.
pie()method is used to plot the pie chart

import matplotlib.pyplot as plt


plt.figure(figsize=[3, 3])
x = [5000,3000,1000,800,500]
labels = ['FOOD','EDUCATION','HEALTH','TRANSPORT','MISC']
plt.pie(x,labels=labels)
plt.show()
Scatter plot
Scatter plots display values for two sets of data, visualized as a collection
of points.

import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(1000)
y = np.random.rand(1000)
x1 = np.random.rand(100)
y1 = np.random.rand(100)
plt.scatter(x,y, color = 'r')
plt.scatter(x1,y1, color = 'g')
plt.show()

Creating Multiplots on Same Canvas


subplot() method allows us to create multiple plots on same canvas. The
below code demonstrates the concept practically

import matplotlib.pyplot as plt


import numpy as np
x = np.linspace(0, 10, 21)
y = x ** 2
# plt.subplot(nrows, ncols, plot_number)
plt.subplot(1,2,1)
plt.plot(x, y, 'g--') # More on color options later
plt.subplot(1,2,2)
plt.plot(y, x, 'r*-');
Styling the plots
Matplotlib allows to choose custom colors for plots. In plot() method we
can send color code as argument. The below code demonstrates the
concept practically

import matplotlib.pyplot as plt


import numpy as np
y = np.arange(1,5)
plt.plot(y,'r') # red
plt.plot(y+5,'g') #green
plt.plot(y+10,'b') #blue
plt.show()

The control colors and codes are listed below:

Color code Color


b Blue
c Cyan
g Green
k Black
m Magenta
r Red
w White
y Yellow
We can also select the line style in the plots. Different options are listed
below

Styl Style Name


e
- Solid line
-- Dashed line
-. Dash dot line
: Dotted line
The below code demonstrates the concept of applying line style practically.

import matplotlib.pyplot as plt


import numpy as np
y = np.arange(1,5)
plt.plot(y,':') # dotted
plt.plot(y+5,'--') #dashed
plt.plot(y+10,'-') #solid
plt.show()

You might also like