Data Science - III
Data Science - III
Text Analysis
Statistical Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Text Analysis
Text Analysis is also referred to as Data Mining. It is one of the methods of data
analysis to discover a pattern in large data sets using databases or data mining
tools. It used to transform raw data into business information. Business Intelligence
tools are present in the market which is used to take strategic business decisions.
Overall it offers a way to extract and examine data and deriving patterns and finally
interpretation of the data.
Statistical Analysis
Statistical Analysis shows “What happen?” by using past data in the form of
dashboards. Statistical Analysis includes collection, Analysis, interpretation,
presentation, and modeling of data. It analyses a set of data or a sample of data.
There are two categories of this type of Analysis – Descriptive Analysis and
Inferential Analysis.
Descriptive Analysis
analyses complete data or a sample of summarized numerical data. It shows mean
and deviation for continuous data whereas percentage and frequency for
categorical data.
1
Inferential Analysis
analyses sample from complete data. In this type of Analysis, you can find different
conclusions from the same data by selecting different samples.
Diagnostic Analysis
Diagnostic Analysis shows “Why did it happen?” by finding the cause from the
insight found in Statistical Analysis. This Analysis is useful to identify behavior
patterns of data. If a new problem arrives in your business process, then you can
look into this Analysis to find similar patterns of that problem. And it may have
chances to use similar prescriptions for the new problems.
Predictive Analysis
Predictive Analysis shows “what is likely to happen” by using previous data. The
simplest data analysis example is like if last year I bought two dresses based on my
savings and if this year my salary is increasing double then I can buy four dresses.
But of course it’s not easy like this because you have to think about other
circumstances like chances of prices of clothes is increased this year or maybe
instead of dresses you want to buy a new bike, or you need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or
past data. Forecasting is just an estimate. Its accuracy is based on how much
detailed information you have and how much you dig in it.
Prescriptive Analysis
Prescriptive Analysis combines the insight from all previous Analysis to determine
which action to take in a current problem or decision. Most data-driven companies
are utilizing Prescriptive Analysis because predictive and descriptive Analysis are
not enough to improve data performance. Based on current situations and
problems, they analyze the data and make decisions.
Data Collection
After requirement gathering, you will get a clear idea about what things you have to
measure and what should be your findings. Now it’s time to collect your data based
on requirements. Once you collect your data, remember that the collected data
must be processed or organized for Analysis. As you collected data from various
sources, you must have to keep a log with a collection date and source of the data.
Data Cleaning
Now whatever data is collected may not be useful or irrelevant to your aim of
Analysis, hence it should be cleaned. The data which is collected may contain
duplicate records, white spaces or errors. The data should be cleaned and error
free. This phase must be done before Analysis because based on data cleaning,
your output of Analysis will be closer to your expected outcome.
Data Analysis
Once the data is collected, cleaned, and processed, it is ready for Analysis. As you
manipulate data, you may find you have the exact information you need, or you
might need to collect more data. During this phase, you can use data analysis
tools and software which will help you to understand, interpret, and derive
conclusions based on the requirements.
Data Interpretation
3
After analyzing your data, it’s finally time to interpret your results. You can choose
the way to express or communicate your data analysis either you can use simply in
words or maybe a table or chart. Then use the results of your data analysis process
to decide your best course of action.
Data Visualization
Data visualization is very common in your day to day life; they often appear in the
form of charts and graphs. In other words, data shown graphically so that it will be
easier for the human brain to understand and process it. Data visualization often
used to discover unknown facts and trends. By observing relationships and
comparing datasets, you can find a way to find out meaningful information.
The three main sources of data are the Internet (namely, the World Wide Web),
databases, and local files (possibly previously downloaded by hand or using
additional software). Some of the local files may have been produced by other
Python programs and contain serialized or “pickled” data.
The formats of data in the artifacts may range widely. The most popular formats are:
Unstructured plain text in a natural language (such as English or Chinese)
Structured data, including:
o Tabular data in comma separated values (CSV) files
o Tabular data from databases
o Tagged data in HyperText Markup Language (HTML) or, in general, in
eXtensible Markup Language (XML)
o Tagged data in JavaScript Object Notation (JSON)
4
Depending on the original structure of the extracted data and the purpose and
nature of further processing, the data are represented using native Python data
structures (lists and dictionaries) or advanced data structures that support
specialized operations (numpy arrays and pandas data frames).
Report Structure
The project report is what we (data scientists) submit to the data sponsor (the
customer). The report typically includes the following:
Abstract (a brief and accessible description of the project)
Introduction
Results that were obtained (do not include intermediate and insignificant
Conclusion
Appendix
5
INTRODUCTION TO FILES
Usually, organizations would want to permanently store information about
employees, inventory, sales, etc. to avoid repetitive tasks of entering the same
data.
Hence,data are stored permanently on secondary storage devices for reusability. We
store Python programs written in script mode with a .py extension. Eachprogram
is stored on the secondary device as a file.
Likewise, the data entered, and the output can be stored permanently into a file.
Text files contain only the ASCII equivalent of the contents of the file, Binary files
A file is a named location on a secondary storage media where data are permanently
stored for later access.
Types Of Files
Computers store every file as a collection of 0s and 1s i.e., in binary form.
Therefore, every file is basically just a series of bytes stored one after the other.
There are mainly two types of data files — text file and binaryfile. A text file
consists of human readable characters, which can be opened by any text editor.
On the other hand, binary files are made up of non-human readable characters
and symbols, which require specific programs to access its contents.
Text file
A text file can be understood as a sequence of characters consisting of alphabets,
numbers and other special symbols. Files with extensions like .txt, .py, .csv,
etc. are some examples of text files. When we open a text file using a text editor
(e.g., Notepad), we see several lines of text. However, the file contents are not
stored in such a way internally. Rather, they are stored in sequence of bytes
consisting of 0s and 1s. In ASCII, UNICODE or any other encoding scheme, the
value of each characterof the text file is stored as bytes.
Binary Files
Binary files are also stored in terms of bytes (0s and 1s), but unlike text files, these
bytes do not represent theASCII values of characters. Rather, they represent the actual
content such as image, audio, video, compressed versions of other files, executable
files, etc. These filesare not human readable. Thus, trying to open a binaryfile using
a text editor will show some garbage values. We need specific software to read or
write the contentsof a binary file.
Opening And Closing A Text File
In real world applications, computer programs deal with data coming from different
6
sources like databases, CSV files, HTML, XML, JSON, etc. We broadly access files
either to write or read data from it. But operations on files include creating and
opening a file, writing datain a file, traversing a file, reading data from a file andso
on. Python has the io module that contains different functions for handling files.
Opening a file
To open a file in Python, we use the open() function. The syntax of open() is as
follows:
file_object= open(file_name, access_mode)
This function returns a file object called file handle which is stored in the
variable file_object. We can use this variable to transfer data to and from the
file (read and write) by calling the functions defined in the Python’s io module. If
the file does not exist, the abovestatement creates a new empty file and assigns it
the name we specify in the statement.
File Open Modes
<a+> or <+a> Opens the file in append and read mode. If the
file doesn’texist, then it will create a new file. End of the file
Closing a file
Once we are done with the read/write operations on a file, it is a good practice to close
the file. Python provides a close() method to do so. While closing a file, the system
frees the memory allocated to it. The syntax of close() is:
file_object.close()
Here, file_object is the object that was returned while opening the file.
Here, we don’t have to close the file explicitly using close() statement. Python
will automatically closethe file.
>>> myobject.close()
#third line"]
>>> myobject.writelines(lines)
>>>myobject.close()
>>> myobject.read(10)
'Hello ever'
>>> myobject.close()
>>> print(myobject.read())
Hello everyone
Writing multiline
strings This is the
third line
>>> myobject.close()
>>> print(myobject.readlines())
>>> myobject=open("myfile.txt",'r')
>>> d=myobject.readlines()
words=line.split()
print(words)
['Hello', 'everyone']
words=line.splitlines()
print(words)
['Hello everyone']
['Writing multiline
strings'] ['This is the
third line']
Let us now write a program that accepts a string from the user and writes it to a
text file. Thereafter, the same program reads the text file and displays it on the
screen.
SETTING OFFSETS IN a FILE (Navigation in file )
The functions that we have learnt till now are used to access the data sequentially
from a file. But if we want to access data in a random fashion, then Python gives
us seek() and tell() functions to do so.
The tell() method
This function returns an integer that specifies the current position of the file
object in the file. The positionso specified is the byte position from the beginning
of the file till the current position of the file object. The syntax of using tell() is:
file_object.tell()
print(str)
print("We are moving to 10th byte position from the beginning offile")
print("The position of the file object is at", fileobject.tell())
str=fileobject.read()
print(str)
Output of Program :
Learning to move the file object
roll_numbers = [1, 2, 3, 4, 5, 6]
Initially, the position of the file object is: 33
>>>
while True:
data= input("Enter data to save in the text file:
") fileobject.write(data)
fileobject.close()
>>>
fileobject=open("practice.txt","r")
str = fileobject.readline()
while str:
print(str)
str=fileobject.readline()
fileobject.close()
In Program, the readline() is used in the whileloop to read the data line by line
from the text file. The lines are displayed using the print(). As the end of file is
reached, the readline() will return an empty string. Finally, the file is closed using
the close().
Output of Program
>>>
Till now, we have been creating separate programs for writing data to a file
and for reading the file. Nowlet us create one single program to read and write data
using a single file object. Since both the operations have to be performed using a single
file object, the file will be opened in w+ mode.
fileobject=open("report.txt", "w+")
str=fileobject.read()
print(str)
In Program , the file will be read till the time end of file is not reached and the
output as shown in below is displayed.
Output of Program :
>>>
THE PICkLE MODULE
We know that Python considers everything as an object. So, all data types
including list, tuple, dictionary, etc. are also considered as objects.
During execution of a program, we may require to store current state of
variables so that we can retrieve them later to its present state. Suppose you
are playing a video game, and after some time, you want to close it. So, the
program should be able to store the current state of the game, including
current level/stage, your score, etc. as a Python object. Likewise, you may
like to store a Python dictionary asan object, to be able to retrieve later. To
save any object structure along with data, Python provides a module called
Pickle. The module Pickle is used for serializing and de-serializing any
Python object structure. Picklingis a method of preserving food items by
placing them in some solution, which increases the shelf life. In other
words, it is a method to store food items for later consumption.
Serialization is the process of transforming data or an object in
memory (RAM) to a stream of bytes called byte streams. These byte
streams in a binary file can then be stored in a disk or in a database or sent
througha network. Serialization process is also called pickling.
De-serialization or unpickling is the inverse of pickling process where a
byte stream is converted backto Python object.
The pickle module deals with binary files. Here, data are not written but
dumped and similarly, data are not read but loaded. The Pickle Module
must be imported to load and dump data. The pickle module provides two
methods - dump() and load() to work with binary files for pickling and
unpickling, respectively.
where data_object is the object that has to be dumped to the file with the file
handle named file_object. For example, Program writes the record of a
student (roll_no, name, gender and marks) in the binary file named
mybinary.dat using the dump(). We need to close the file after pickling.
import pickle
print(objectvar)
Output of Program
>>>
>>>
Program 2-8 To perform basic operations on a binary file using pickle module
bfile=open("empfile.dat","ab")
recno=1
if ans.lower()=='n':
print("Record entry OVER
") print()
Break
bfile.close()
print()
readrec=1
try:
while True:
edata=pickle.load(bfile)
print("Record Number :
",readrec) print(edata)
readrec=readrec+1
except EOFError:
pass
bfile.close()
Output of Program :
>>>
RESTART: Path_to_file\Program2.8py
RECORD No. 1
Employee number : 11
Employee Name : D N Ravi
Basic Salary : 32600
Allowances : 4400
Employee number : 12
Employee Name : Farida
Ahmed Basic Salary : 38250
Allowances : 5300
Record Number : 1
Record Number : 2
>>>
CSV (stands for comma separated values) format is a commonly used data format used by
spreadsheets. The csv module in Python’s standard library presents classes and methods to
perform read/write operations on CSV files.
writer()
This function in csv module returns a writer object that converts data into a delimited string
and stores in a file object. The function needs a file object with write permission as a
parameter. Every row written in the file issues a newline character. To prevent additional
space between lines, newline parameter is set to ‘’.
writerow()
This function writes items in an iterable (list, tuple or string) ,separating them nby comma
character.
writerows()
This function takes a list of iterables as parameter and writes each item as a comma
separated line of items in the file.
Following example shows use of write() function. First a file is opened in ‘w’ mode. This
file is used to obtain writer object. Each tuple in list of tuples is then written to file using
writerow() method.
This will create ‘persons.csv’ file in current directory. It will show following data.
Lata,22,45
Anil,21,56
John,20,60
Instad of iterating over the list to write each row individually, we can use writerows()
method.
Since reader object is an iterator, built-in next() function is also useful to display all lines in
csv file.
The csv module also defines a dialect class. Dialect is set of standards used to implement
CSV protocol. The list of dialects available can be obtained by list_dialects() function.
>>> csv.list_dialects()
['excel', 'excel-tab', 'unix']
DictWriter()
This function returns a DictWriter object. It is similar to writer object, but the rows are
mapped to dictionary object. The function needs a file object with write permission and a
list of keys used in dictionary as fieldnames parameter. This is used to write first line in the
file as header.
writeheader()
This method writes list of keys in dictionary as a comma separated line as first line in the
file.
In following example, a list of dictionary items is defined. Each item in the list is a
dictionary. Using writrows() method, they are written to file in comma separated manner.
name,age,marks
Lata,22,45
Anil,21,56
John,20,60
DictReader()
This function returns a DictReader object from the underlying CSV file. As in case of
reader object, this one is also an iterator, using which contents of the file are retrieved.
The class provides fieldnames attribute, returning the dictionary keys used as header of file.
>>> obj.fieldnames
['name', 'age', 'marks']
Use loop over the DictReader object to fetch individual dictionary objects.
json module
In interactive programs users need to store the user responses in python data
structures like lists and dictionaries. But, when the program is closed, that data will not be
available. For this reason we need to store this data in files. Python’s json module allows us
to store the data structures in to the files and we can load that data as and when needed.
The JSON (JavaScript Object Notation) is developed for Java Script and it is used by many
programming languages including python.
Storing and loading the data
We have two methods json.dump() and json.load() in json module to store and load
the data. The below code demonstrate the concept.
import json
filename = 'favourite_dish.json'
user_option = "\nEnter your favourite_dish: "
user_option += '\n Type quit if you have no item to type :'
msg=''
favourite_items=[]
while True:
msg = input(user_option)
if msg == 'quit':
break
favourite_items.append(msg)
with open(filename, 'w') as f:
json.dump(favourite_items, f)
print('program terminated')
import json
filename = 'favourite_dish.json'
try:
with open(filename) as f:
favourite_items = json.load(f)
except FileNotFoundError:
print('Sorry, favourite_dish.json is not found')
else:
print(favourite_items)
Creating a directory
importos
os.mkdir("D:\\Data Science")
This will create a folder Data Science in the D drive.
Deleting a directory
In order to delete a directory, we will be using the rmdir() function, it
stands for remove directory.
import os
os.rmdir("D:\\Data Science")
Renaming a directory
In order to rename a folder, we use the rename function present in the os
module.
import os
os.mkdir("D:\\Data Science")
os.rename("D:\\Data Science","D:\\Data Science3")
Basic file manipulation
Now that you know how to work around with folders, let us look into file
manipulation.
Creating a file
file = os.popen("Hello.txt", 'w')
A file named Hello.txt is created in the current working directory.
Example
Given below is the complete program to test out all the above-mentioned
scenarios:
importos
os.getcwd()
os.mkdir("D:\\Data Science")
os.rmdir("D:\\Data Science")
os.mkdir("D:\\Data Science")
os.rename("D:\\Data Science","D:\\Data Science2")
file =os.popen("Hello.txt",'w')
file.write("Hello there! This is a Data Science article")
The os.path module is a very extensively used module that is handy when
processing files from different places in the system. It is used for different
purposes such as for merging, normalizing and retrieving path names in
python . All of these functions accept either only bytes or only string
objects as their parameters. Its results are specific to the OS on which it is
being run.
os.path.basename
This function gives us the last part of the path which may be a folder or a
file name. Please the difference in how the path is mentioned in Windows
and Linux in terms of the backslash and the forward slash.
Example
Import os
# In windows
fldr=os.path.basename("C:\\Users\\xyz\\Documents\\My Web Sites")
print(fldr)
file =os.path.basename("C:\\Users\\xyz\\Documents\\My Web Sites\\
intro.html")
print(file)
Running the above code gives us the following result −
Output
My Web Sites
intro.html
MyWebSites
music.txt
os.path.dirname
This function gives us the directory name where the folder or file is
located.
Example
importos
# In windows
DIR =os.path.dirname("C:\\Users\\xyz\\Documents\\My Web Sites")
print(DIR)
Running the above code gives us the following result −
Output
C:\Users\xyz\Documents
/Documents
os.path.isfile
Sometimes we may need to check if the complete path given, represents
a folder or a file. If the file does not exist then it will give False as the
output. If the file exists then the output is True.
Example
print(IS_FILE)
IS_FILE =os.path.isfile("C:\\Users\\xyz\\Documents\\My Web Sites\\
intro.html")
print(IS_FILE)
Running the above code gives us the following result −
Output
False
True
False
True
XML is a portable, open source language that allows programmers to develop applications
that can be read by other applications, regardless of operating system and/or developmental
language.
What is XML?
The Python standard library provides a minimal but useful set of interfaces
to work with XML.
The two most basic and broadly used APIs to XML data are the SAX and
DOM interfaces.
Simple API for XML (SAX) − Here, you register callbacks for
events of interest and then let the parser proceed through the
document. This is useful when your documents are large or you
have memory limitations, it parses the file as it reads it from disk
and the entire file is never stored in memory.
Document Object Model (DOM) API − This is a World Wide Web
Consortium recommendation where in the entire file is read into
memory and stored in a hierarchical (tree-based) form to represent
all the features of an XML document.
Example :
model #2 attribute:
model2
All attributes:
model1
model2
model #2 data:
model2abc
model2abc
class MyHTMLParser(HTMLParser):
parser = MyHTMLParser()
OUTPUT:
Syntax and semantic analysis are two main techniques used with natural
language processing.
Syntax is the arrangement of words in a sentence to make grammatical
sense. NLP uses syntax to assess meaning from a language based on
grammatical rules. Syntax techniques include:
Example
import nltk
nltk.download('punkt')
text = "Backgammon is one of the oldest known board games. Its history
can be traced back nearly 5,000 years to archeological discoveries in the
Middle East. It is a two player game where each player has fifteen
checkers which move between twenty-four points according to the roll of
two dice."
sentences = nltk.sent_tokenize(text)
print(sentence)
Regex Functions
The following regex functions are used in the python.
SN Functio Description
n
Meta-Characters
Metacharacter is a character with the specified meaning.
Special Sequences
Special sequences are the sequences containing \ followed by one of the
characters.
Character Description
Sets
A set is a group of characters given inside a pair of square brackets. It
represents the special meaning.
SN Set Description
Example
import re
str = "How are you. How is everything"
matches = re.findall("How", str)
print(matches)
print(matches)
Output:
['How', 'How']
Example
import re
str = "How are you. How is everything"
matches = re.search("How", str)
print(type(matches))
print(matches) #matches is the search object
Output:
<class '_sre.SRE_Match'>
<_sre.SRE_Match object; span=(0, 3), match='How'>
The Match object methods
There are the following methods associated with the Match object.
span(): It returns the tuple containing the starting and end position of the
match.
string(): It returns a string passed into the function.
group(): The part of the string is returned where the match is found.
Example
import re
str = "How are you. How is everything"
matches = re.search("How", str)
print(matches.span())
print(matches.group())
print(matches.string)
Output:
(0, 3)
How
How are you. How is everything
UNIT – III
What is SQL?
SQL is Structured Query Language, which is a computer language for
storing, manipulating and retrieving data stored in a relational database.
SQL is the standard language for Relational Database System. All the
Relational Database Management Systems (RDMS) like MySQL, MS
Access, Oracle, Sybase, Informix, Postgres and SQL Server use SQL as
their standard database language.
Why SQL?
SQL is widely popular because it offers the following advantages −
SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT,
INSERT, UPDATE, DELETE and DROP. These commands can be classified into the
following groups based on their nature −
CREATE
1
Creates a new table, a view of a table, or other object in the database.
ALTER
2
Modifies an existing database object, such as a table.
DROP
3
Deletes an entire table, a view of a table or other objects in the database.
SELECT
1
Retrieves certain records from one or more tables.
INSERT
2
Creates a record.
UPDATE
3
Modifies records.
DELETE
4
Deletes records.
GRANT
1
Gives a privilege to user.
REVOKE
2
Takes back privileges granted from user.
MongoDB
MongoDB is an open-source document database that provides high
performance, high availability, and automatic scaling.
MongoDB is available under General Public license for free, and it is also
available under Commercial license from the manufacturer.
Scalability
Performance
High Availability
Scaling from single server deployments to large, complex multi-site
architectures.
Key points of MongoDB
Develop Faster
Deploy Easier
Scale Bigger
1. FirstName = "John",
2. Address = "Detroit",
Features of MongoDB
In MongoDB, you can search by field, range query and it also supports
regular expression searches.
2. Indexing
3. Replication
A master can perform Reads and Writes and a Slave copies data from the
master and can only be used for reads or back up (not writes)
4. Duplication of data
MongoDB can run over multiple servers. The data is duplicated to keep
the system up and also keep its running condition in case of hardware
failure.
5. Load balancing
10. Stores files of any size easily without complicating your stack.
NumPy stands for ‘Numerical Python’. It is a package for data analysis and
scientific computing with Python. NumPy uses a multidimensional array
object, and has functions and tools for working with these arrays. The
powerful n-dimensional array in NumPy speeds-up data processing.
NumPy can be easily interfaced with other Python packages and provides
tools for integrating with other programming languages like C, C++ etc.
Installing NumPy
An array:
• Each element of the array is of same data type, though the values
stored in them may be different.
Here, the 1st value in the array is 10 and has the index value [0]
associated with it;
the 2nd value in the array is 9 and has the index value [1] associated with
it, and so on.
The last value (in this case the 5th value) in this array has an index [4].
This is called zero based indexing. This is very similar to the indexing of
lists in Python. The idea of arrays is so important that almost all
programming languages support it in one form or another.
NumPy Array
NumPy arrays are used to store lists of numerical data, vectors and
matrices. The NumPy library has a large set of routines (built-in functions)
for creating, manipulating, and transforming NumPy arrays. Python
language also has an array data structure, but it is not as versatile,
efficient and useful as the NumPy array. The NumPy. array is officially
called ndarray but commonly known as array.
[1 2 3 4 5 6]
[[1 2 3]
[4 5 6]]
[1 2 3 4 5 6]
import numpy as np
arr = np.arange(5, 50)
print(arr)
5 to 49 will be displayed
import numpy as np
arr = np.zeros((4,4))
print(arr)
[[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]
[0. 0. 0. 0.]]
linspace() method
linspace() method() returns specified number of elements within the
given range.
import numpy as np
arr = np.linspace(0,100,11)
arr
array([ 0., 10., 20., 30., 40., 50., 60., 70., 80., 90., 100.])
asarray() method
asarray() method returns a numpy array by accepting list of values.
import numpy as np
x=[1,2,3,4,5]
a=np.asarray(x)
print(a)
[1 2 3 4 5]
rand() method
It creates an array of the given shape with random samples from a
uniform distribution over [0, 1).
import numpy as np
x = np.random.rand(2)
y = np.random.rand(5,5)
print(f'x = {x}')
print(f'y = {y}')
x = [0.85403266 0.42939896]
For a 5X5 random numbers are displayed
randn() method
Return a sample (or samples) from the "standard normal" distribution.
Unlike rand which is uniform
import numpy as np
x = np.random.randn(2)
y = np.random.randn(5,5)
print(f'x = {x}')
print(f'y = {y}')
x = [-0.930095 -0.12667659]
Here Random numbers are displayed in a uniform distribution fashion
randint() method
Return random integers from `low` (inclusive) to `high` (exclusive)
import numpy as np
x = np.random.randint(1,100) # returns one value between 1 and
100
y = np.random.randint(1,100,10)# returns 10 values between 1
and 100
print(f'x = {x}')
print(f'y = {y}')
x=4
y = [ 4 96 12 45 41 2 66 53 32 67]
reshape() method
This method returns a numpy array as per the given dimensions. The
product of dimensions must be equal to number of elements in the array.
import numpy as np
arr=np.zeros(12)
arr3d=arr.reshape((2,2,3))
print(arr3d)
[[[0. 0. 0.]
[0. 0. 0.]]
[[0. 0. 0.]
[0. 0. 0.]]]
import numpy as np
arr=np.arange(2,20)
print(arr)
element=arr[8]
print(f'element 8 is {element}')
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
element 8 is 10
Slicing
Slicing is process of extracting particular set of elements from the array.
The slicing is similar to list slicing. The below coding samples
demonstrate the slicing.
import numpy as np
arr=np.arange(20)
print(arr)
arr_slice=slice(1,10,2) # from 1 to 10 alternate elements
print(f'slice = {arr_slice}')
print(f'elements in the slice : {arr[arr_slice]}')
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
slice = slice(1, 10, 2)
elements in the slice : [1 3 5 7 9]
slice2 = slice(1, 15, 3)
elements in the slice2 : [ 1 4 7 10 13]
import numpy as np
arr=np.arange(20)
print(arr[2:])
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
import numpy as np
arr=np.arange(20)
print(arr[:15])
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
import numpy as np
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a[0:2,0:2])
[[1 2]
[3 4]]
import numpy as np
a=np.array([[1,2,3],[3,4,5],[4,5,6]])
print(a.shape) #prints the shape of the array
print(a.ndim) #prints dimentions of the array
print(a.itemsize) # prints itemsize
(3, 3)
2
4
dtype() method
You can also grab the data type of the object in the array
import numpy as np
arr = np.arange(25)
arr.dtype
dtype('int32')
import numpy as np
a= np.array([10,11,12,13,14,15])
print(f'minimum = {a.min()}')
print(f'maximum = {a.max()}')
minimum = 10
maximum = 15
import numpy as np
a= np.array([10,11,12,13,14,15])
print(f'sum of the elements = {a.sum()}')
a1 = np.array([(10,11,12),(13,14,15)])
print(f'sum of the elements column wise= {a1.sum(axis=0)}')
print(f'sum of the elements row wise = {a1.sum(axis=1)}')
import numpy as np
a1 = np.array([(10,11,12),(13,14,15)])
print(f'sqrt of the elements: \n {np.sqrt(a1)}')
print(f'standard deviation of the elements = {np.std(a1)}')
ravel() method
This method converts the 2 dimensional or multidimensional numpy array
elements into single dimensional array. Practically the concept is as
follows.
import numpy as np
a1 = np.array([(10,11,12),(13,14,15)])
a1 = a.ravel()
print(f'elements of a1 array : \n {a1}')
print(f'dimension of a1 is {a1.ndim}')
elements of a1 array :
[10 11 12 13 14 15]
dimension of a1 is 1
transpose() method
This method returns the matrix transpose for a 2 dimensionalnumpy
array. The concept is shown below practically.
x = np.arange(9).reshape((3,3))
print(f'input for transpose is : \n {x}')
x=np.transpose(x)
print(f' array after transpose is \n{x}')
eye() method
This method allows us to create identity matrix.
import numpy as np
arr = np.eye(4)
print(arr)
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
import numpy as np
x= np.array([(1,2,3),(4,5,6)])
y= np.array([(2,3,4),(5,6,7)])
print('array x is :\n')
print(x)
print('\narray y is :\n')
print(y)
print('x+y: \n')
print(x+y)
print('x-y: \n')
print(x-y)
print('x*y: \n')
print(x*y)
print('x/y: \n')
print(x/y)
array x is :
[[1 2 3]
[4 5 6]]
array y is :
[[2 3 4]
[5 6 7]]
x+y:
[[ 3 5 7]
[ 9 11 13]]
x-y:
[[-1 -1 -1]
[-1 -1 -1]]
x*y:
[[ 2 6 12]
[20 30 42]]
x/y:
Broadcasting
Broadcasting allows us to perform an operation on arrays, even though
the size is not same. The below example demonstrates broadcasting on
one dimensional and multidimensional arrays.
import numpy as np
a = np.array([1,2,3,4,5,6,7,8,9,])
b = 10
print('a+b is \n')
print(a+b) # broadcasting is done with single element
x= np.array([(1,2,3),(4,5,6),(7,8,9)])
y= np.array([(10,11,12)])
print('\nx+y is \n')
print(x+y) # single row of y is added to every row of x
(broadcasting)
a+b is
[11 12 13 14 15 16 17 18 19]
x+y is
[[11 13 15]
[14 16 18]
[17 19 21]]
import numpy as np
x= np.array([(1,2,3),(4,5,6)])
y= np.array([(7,8,9),(10,11,12)])
print(f'array x is : \n {x}')
print(f'array y is : \n {y}')
print('vertical stack of x and y\n')
print(np.vstack((x,y)))
print('horizontal stack of x and y\n')
print(np.hstack((x,y)))
array x is :
[[1 2 3]
[4 5 6]]
array y is :
[[ 7 8 9]
[10 11 12]]
vertical stack of x and y
[[ 1 2 3]
[ 4 5 6]
[ 7 8 9]
[10 11 12]]
horizontal stack of x and y
[[ 1 2 3 7 8 9]
[ 4 5 6 10 11 12]]
import numpy as np
arr_2d = np.array(([5,10,15],[20,25,30],[35,40,45]))
#Printing array
print(f'array arr_2d is : \n {arr_2d}')
print(f'\nrow at index 1 is :\n {arr_2d[1]}')
# Getting individual element value
print(f'\nelement at 1st row 0th column :\n {arr_2d[1][0]}')
#Getting individual element using comma notation in index
print(f'\nelement at 1st row 1st column :\n {arr_2d[1,1]}')
# 2D array slicing ;
print(f'\nShape (2,2) from top right corner :\n
{arr_2d[:2,1:]}')
array arr_2d is :
[[ 5 10 15]
[20 25 30]
[35 40 45]]
row at index 1 is :
[20 25 30]
bottom row :
[35 40 45]
bottom row :
[35 40 45]
Fancy Indexing
Fancy indexing allows you to select entire rows or columns out of order.
The concept is practically shown below
import numpy as np
#Set up matrix
arr2d = np.zeros((10,10))
#Length of array
print(f'\nlenght of the array is : {arr2d.shape[1]}')
#Set up array
arr_length = arr2d.shape[1]
for i in range(arr_length):
arr2d[i] = i
print(f'\nthe 2d array is : \n {arr2d}')
the 2d array is :
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
[3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
[4. 4. 4. 4. 4. 4. 4. 4. 4. 4.]
[5. 5. 5. 5. 5. 5. 5. 5. 5. 5.]
[6. 6. 6. 6. 6. 6. 6. 6. 6. 6.]
[7. 7. 7. 7. 7. 7. 7. 7. 7. 7.]
[8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]
[9. 9. 9. 9. 9. 9. 9. 9. 9. 9.]]
arr = np.arange(1,11)
print(f'arr = {arr}')
print(f'\n compare array values with some integer\n {arr> 3}')
bool_arr = arr>4
print(f'\n arr values > 4 : {arr[bool_arr]}')
print(f'\n arr values > 5 : {arr[arr>5]}')
arr = [ 1 2 3 4 5 6 7 8 9 10]
compare array values with some integer
[False FalseFalse TrueTrueTrueTrueTrueTrue True]
UNIT - IV
Creating a Series
We can create a series from a list, numpy array or dictionary. The process
is practically shown below.
import numpy as np
import pandas as pd
labels = ['a','b','c']
my_list = [10,20,30]
#creating a series from a list
list_series1 = pd.Series(data=my_list)
list_series2 = pd.Series(data=my_list,index=labels)
list_series3 = pd.Series(my_list,labels)
print(f'\nlist_series1:\n{list_series1}')
print(f'\nlist_series2:\n{list_series2}')
print(f'\nlist_series3:\n{list_series3}')
list_series1:
0 10
1 20
2 30
dtype: int64
list_series2:
a 10
b 20
c 30
dtype: int64
list_series3:
a 10
b 20
c 30
dtype: int64
import numpy as np
import pandas as pd
labels = ['a','b','c']
arr = np.array([10,20,30])
arr_series1=pd.Series(arr)
arr_series2=pd.Series(arr,labels)
print(f'\narr_series1:\n{arr_series1}')
print(f'\narr_series2:\n{arr_series2}')
arr_series1:
0 10
1 20
2 30
dtype: int32
arr_series2:
a 10
b 20
c 30
dtype: int32
a 10
b 20
c 30
dtype: int64
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)# a starting point to generate random
number
df = pd.DataFrame(randn(5,4),index='A B C D
E'.split(),columns='W X Y Z'.split())
print(df)
W X Y Z
-
B 0.651118 -0.319318 0.605965
0.848077
-
D 0.188695 -0.758872 0.955057
0.933237
Column selection
We can select a column by its name as shown below
print(df['W'])
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
print(df[['W','Z']])
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509
We can select the column from a data frame using dot operator in sql
style. But, it is not recommended.
Note that the columns we are extracting from the data frame are data
series. We can check this by type() function as shown below
type(df['W'])
pandas.core.series.Series
W X Y Z NEW
Removing Columns
Removing columns is possible with drop() method of pandas module. But
the changes are reflected only we make inplace parameter is True. The
concept is practically shown below
df.drop('NEW',axis=1)
W X Y Z
But the changes are not reflected in the data frame. We can see it by
printing the data frame
W X Y Z NEW
df.drop('NEW',axis=1,inplace=True)
df
W X Y Z
Removing rows
We can drop the rows similar to columns. Here we use axis = 0. The
changes are reflected only if inplace = True.
df.drop('E',axis=0)
W X Y Z
Selecting rows
We can select the rows from a data frame using labels and position. We
can also select a subset of rows and columns. The concept is practically
shown below.
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
#select row with position
df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
#selectingparticlular cell
df.loc['B','Y']
-0.84807698340363147
#selecting a particluar part from dataframe
df.loc[['A','B'],['W','Y']]
W Y
A 2.706850 0.907969
B 0.651118 -0.848077
Conditional Selection
Selecting values from data frame is possible based on conditions. This is
similar to the selection in numpy arrays. The concept is practically shown
below
print(df)
W X Y Z
print(df>0)
W X Y Z
df[df>0]
W X Y Z
df[df['W']>0]
W X Y Z
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757
W X Y Z
Resetting index
index W X Y Z
W X Y Z States
W X Y Z
States
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.dropna()
A B C
0 1.0 5.0 1
0 1
1 2
2 3
0 1.0 5.0 1
Na
1 2.0 2
N
A B C
0 1.0 5.0 1
1 2.0 0.0 2
2 0.0 0.0 3
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
import pandas as pd
# Create dataframe
data = {'course':['MPCS','MPCS','MSCS','MSCS','MSDS','MSDS'],
'student':
['Sameer','Shantan','Amit','Varsha','Govind','Vishal'],
'marks':[410,370,402,433,366,297]}
df = pd.DataFrame(data)
df
course student marks
marks
course
MPCS 390.0
MSCS 417.5
MSDS 331.5
marks
course
MPCS 28.284271
MSCS 21.920310
marks
course
MSDS 48.790368
student marks
course
by_course.max()
student marks
course
by_course.count()
student marks
course
MPCS 2 2
MSCS 2 2
MSDS 2 2
by_course.describe()
marks
course
by_course.describe().transpose()['MPCS']
marks count 2.000000
mean 390.000000
std 28.284271
min 370.000000
25% 380.000000
50% 390.000000
75% 400.000000
max 410.000000
Name: MPCS, dtype: float64
Concatenating the data frames
Concatenation basically joins DataFrames together. Here the condition is
that dimensions should match along the axis we are concatenating on. We
can use pd.concat and pass in a list of Data Frames to concatenate
together. The below code practically demonstrate the concept.
import pandas as pd
df1 = pd.DataFrame({'BSc': ['11', '12', '13', '14'],
'BCom': ['31', '32', '33', '34'],
'BBA': ['51', '52', '53', '54'],
'BA': ['71', '72', '73', '74']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'BSc': ['15', '16', '17', '18'],
'BCom': ['35', '36', '37', '38'],
'BBA': ['55', '56', '57', '58'],
'BA': ['75', '76', '77', '78']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'BSc': ['19', '20', '21', '22'],
'BCom': ['39', '40', '41', '42'],
'BBA': ['59', '60', '61', '62'],
'BA': ['79', '80', '81', '82']},
index=[8, 9, 10, 11])
print(f'\ndf1 is \n {df1}')
print(f'\ndf2 is \n {df2}')
print(f'\ndf3 is \n {df3}')
df1 is
BSc BCom BBA BA
0 11 31 51 71
1 12 32 52 72
2 13 33 53 73
3 14 34 54 74
df2 is
BSc BCom BBA BA
4 15 35 55 75
5 16 36 56 76
6 17 37 57 77
7 18 38 58 78
df3 is
BSc BCom BBA BA
8 19 39 59 79
9 20 40 60 80
10 21 41 61 81
11 22 42 62 82
#concatenated row wise
pd.concat([df1,df2,df3])
BS BB
BCom BA
c A
0 11 31 51 71
1 12 32 52 72
2 13 33 53 73
3 14 34 54 74
4 15 35 55 75
5 16 36 56 76
6 17 37 57 77
7 18 38 58 78
8 19 39 59 79
9 20 40 60 80
10 21 41 61 81
11 22 42 62 82
BCo
BSc BCom BBA BA BSc BBA BA BSc BCom BBA BA
m
1
NaN NaN NaN NaN NaN NaN NaN NaN 21 41 61 81
0
1
NaN NaN NaN NaN NaN NaN NaN NaN 22 42 62 82
1
Merging
The merge() function allows us to merge DataFrames together using a similar logic as
merging SQL Tables together. The concept is practically demonstrated.
right is
key S3 S4
0 11 78 88
1 12 79 85
2 13 81 87
3 14 85 73
S1 S2 S3 S4
key
1 7
0 89 78 88
1 4
1 8
1 90 79 85
2 6
1 7
2 91 81 87
3 7
1 7
3 86 85 73
4 0
Joining
Joining is a convenient method for combining the columns of two
potentially differently-indexed DataFrames into a single result DataFrame.
left = pd.DataFrame({
'S1': ['89', '90', '91', '86'],
'S2': ['74', '86', '77', '70'],
'S5': ['70', '80', '79', '79']})
right is
S3 S4
0 78 88
1 79 85
2 81 87
3 85 73
S1 S2 S5 S3 S4
7
0 89 74 78 88
0
8
1 90 86 79 85
0
7
2 91 77 81 87
9
7
3 86 70 85 73
9
Matplotlib
Multiline plots
Here we are using numpy array and using plot method for 3 equations.
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,5,0.01)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s**2 for s in x]) #y=x^2
plt.plot(x,[c**3 for c in x]) #y=x^3
plt.show()
Adding a grid
grid() method allows us to add grid lines in the plot area. This method
takes only one single Boolean parameter. Grid appears in the background
of the plot
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0,5,0.01)
plt.plot(x,[x1 for x1 in x]) # y=x
plt.plot(x,[s*2 for s in x]) #y=2x
plt.plot(x,[c*4 for c in x]) #y=4x
plt.grid(True)
plt.show()
Limiting the Axes
The scale of the plot can be set using axis() method. Use of axis() method
is practically demonstrated below.
Types of Plots
Matplotlib provides many types of plot formats for visualising information.
The different formats are listed below
Histogram
Bar Graph
Scatter Plot
Pie Chart
Histogram
Histograms display the distribution of a variable over a range of
frequencies or values. hist() is the method used to plot histogram.
Histogram groups values into non-overlapping categories called bins. Default bin value of
the histogram plot is 10. The below plot shows the result of 100 bins
Pie charts
Pie charts are used to compare multiple parts against the whole.
pie()method is used to plot the pie chart
import numpy as np
import matplotlib.pyplot as plt
x = np.random.rand(1000)
y = np.random.rand(1000)
x1 = np.random.rand(100)
y1 = np.random.rand(100)
plt.scatter(x,y, color = 'r')
plt.scatter(x1,y1, color = 'g')
plt.show()