0% found this document useful (0 votes)
4 views

intro2Python_part2

The document covers Python for Data Science, focusing on data capturing, preparation, and manipulation using libraries like NumPy and Pandas. It explains basic file I/O operations, the differences between NumPy and Pandas, and provides examples of reading and writing files, as well as working with CSV files. Additionally, it details various NumPy functionalities including array creation, random number generation, and aggregation functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

intro2Python_part2

The document covers Python for Data Science, focusing on data capturing, preparation, and manipulation using libraries like NumPy and Pandas. It explains basic file I/O operations, the differences between NumPy and Pandas, and provides examples of reading and writing files, as well as working with CSV files. Additionally, it details various NumPy functionalities including array creation, random number generation, and aggregation functions.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Python for Data Science (PDS)

Unit-02
Capturing, Preparing
and Working with data
 Outline
Looping

 Basic File IO in Python


 NumPy V/S Pandas (what to use?)
 NumPy
 Pandas
 www.kaggle.com
From part1 .. Why Python?
 Huge amount of additional open-source libraries
Some libraries listed below.

 NumPy for scientific computing

 pandas for performing data analysis

 SciPy for engineering applications, science, and


mathematics

 Scikit for machine learning

 matplotib for plotting charts and graphs

 BeautifulSoup for HTML parsing and XML

 Django for server-side web development

 And many more..

 Unit 02 – Overview of Python and Data Analysis 3


Basic IO operations in Python
 Before we can read or write a file, we have to open it using Python's built-in open() function.
syntax
fileobject = open(filename , accessmode)

 filename is a name of a file we want to open.


 accessmode is determines the mode in which file has to be opened (list of possible values given below)

M Description M Description (create file if not exist) M Description


r Read only (default) w Write only Opens file to append, if file
a
not exist will create it for write
rb Read only in binary format wb Write only in binary format
Append in binary format, if file
r+ Read and Write both w+ Read and Write both ab
not exist will create it for write
Read and Write both in Read and Write both in Append, if file not exist it will
rb+ wb+ a+
binary format binary format create for read & write both
Read and Write both in binary
ab+
format
 Unit 02 – Overview of Python and Data Analysis 4
Example : Read file in Python
 read(size) will read specified bytes from the file, if we don’t specify size it will return whole file.
readfile.py demofile.txt
1 f = open('demofile.txt') Hello! Welcome to demofile.txt
2 data = f.read() This file is for testing purposes.
3 print(data) Good Luck!

 readlines() method will return list of lines from the file.


readlines.py OUTPUT
1 f = open('demofile.txt') ['Hello! Welcome to demofile.txt
2 lines = f.readlines() \n', 'This file is for testing purposes.,\n', Good Luck!
3 print(lines) ']

 We can use for loop to get each line separately,


readlinesfor.py OUTPUT
1 f = open('demofile.txt') Hello! Welcome to demofile.txt
2 lines = f.readlines()
3 for l in lines : This file is for testing purposes.
4 print(l)
Good Luck!
 Unit 02 – Overview of Python and Data Analysis 5
How to write path?
 We can specify relative path in argument to open method, alternatively we can also specify
absolute path.
 To specify absolute path,
 In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
 In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)

 We suppose to close the file once we are done using the file in the Python using close()
method.
closefile.py
1 f = open('demofile.txt') fileusingwith.py
2 data = f.read() 1 with open('demofile.txt') as f :
3 print(data) 2 data = f.read()
4 f.close() 3 print(data)

 When we open file using with we need not to close the file.

 Unit 02 – Overview of Python and Data Analysis 6


Example : Write file in Python
 write() method will write the specified data to the file.
readdemo.py
1 with open('demofile.txt','a') as f :
2 f.write('Hello world')

 If we open file with ‘w’ mode it will overwrite the data to the existing file or will create new file if
file does not exists.
 If we open file with ‘a’ mode it will append the data at the end of the existing file or will create
new file if file does not exists.

 Unit 02 – Overview of Python and Data Analysis 7


Reading CSV files without any library functions
 A comma-separated values file is a delimited text file that uses a comma to separate values.
 Each line of is a data record, Each record consists of many fields, separated by commas.
 Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
bcde,456789,2.5 3 isFirstLine
for r in rows= :
True
cdef,321654,7.6 4 for r in
cols rows :
= r.split(',')
5 if isFirstLine Name
print('Student : = ', cols[0], end=" ")
6 isFirstLine
print('\tEn. No.==False
', cols[1], end=" ")
 We can use Microsoft Excel to access 7 continue = \t', cols[2])
print('\tCPI
CSV files. 8 cols = r.split(',')
9 print('Student Name = ', cols[0], end=" ")
 In the later sessions we will access CSV 10 print('\tEn. No. = ', cols[1], end=" ")
11 print('\tCPI = \t', cols[2])
files using different libraries, but we can
also access CSV files without any libraries.
(Not recommend)

 Unit 02 – Overview of Python and Data Analysis 8


NumPy v/s Pandas
 Developers built pandas on top of NumPy,
 as a result every task we perform using pandas also goes through NumPy.
 To obtain the benefits of pandas, we need to pay a performance penalty that some testers say
is 100 times slower than NumPy for similar task.
 Nowadays computer hardware are powerful enough to take care for the performance issue,
 but when speed of execution is essential NumPy is always the best choice.
 We use pandas to make writing code easier and faster, and tobreduce potential coding errors.
 Pandas provide rich time-series functionality, data alignment, NA-friendly statistics, groupby,
merge, etc.. methods, if we use NumPy we have to implement all these methods manually.
 So,
 if we want performance we should use NumPy,
 if we want ease of coding we should use pandas.

 Unit 02 – Overview of Python and Data Analysis 9


1- NumPy
 NumPy (Numeric Python) is a Python library to manipulate arrays.
 Almost all the libraries in python rely on NumPy as one of their main building block.
 NumPy provides functions for domains like Algebra, Fourier transform etc..
 NumPy is incredibly fast as it has bindings to C libraries.
 Install :
 conda install numpy
 pip install numpy

 Unit 02 – Overview of Python and Data Analysis 10


NumPy Array
 The most important object defined in NumPy is an N-dimensional array type called ndarray.
 It describes the collection of items of the same type, Items in the collection can be accessed
using a zero-based index.
 An instance of ndarray class can be constructed in many different ways, the basic ndarray can
be created as below.
syntax
import numpy as np
a= np.array(list | tuple | set | dict)

numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a= np.array(['Andalus','Insitute','Sanaa']) ['Andalus' 'Insitute' 'Sanaa']
3 print(type(a))
4 print(a)

 Unit 02 – Overview of Python and Data Analysis 11


NumPy Array (Cont.)
 arange(start,end,step) function will create NumPy array starting from start till end (not
included) with specified steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)

 zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)

 ones(n) function will return NumPy array of given shape, filled with ones.

 Unit 02 – Overview of Python and Data Analysis 12


NumPy Array (Cont.)
 eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]

 linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)

 Note: in arange function we have given start, stop & step, whereas in lispace function we are
giving start,stop & number of elements we want.

 Unit 02 – Overview of Python and Data Analysis 13


Array Shape in NumPy
 We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)

 We can also reshape the array using reshape method of ndarray.


numpyarange.py Output
1 import numpy as np [[29 55]
2 re1 = np.random.randint(1,100,10) [44 50]
3 re2 = re1.reshape(5,2) [25 53]
4 print(re2) [59 6]
[93 7]]

 Note: the number of elements and multiplication of rows and cols in new array must be equal.
 Example : here we have old one-dimensional array of 10 elements and reshaped shape is (5,2)
so, 5 * 2 = 10, which means it is a valid reshape

 Unit 02 – Overview of Python and Data Analysis 14


NumPy Random
 rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform
distrubution, if we do not specify any parameter it will return random float number.
numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]
 randint(low,high,num) function will create one-dimensional array with num random integer data
between low and high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)

 We can reshape the array in any shape using reshape method, which we learned in previous
slide.
 Unit 02 – Overview of Python and Data Analysis 15
Aggregations
 min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))

 max() function will return the maximum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))

 Unit 02 – Overview of Python and Data Analysis 16


Aggregations (Cont.)
 NumPy support many aggregation functions such as min, max, argmin, argmax, sum, mean, std,
etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635

 Unit 02 – Overview of Python and Data Analysis 17


Using axis argument with aggregate functions
 When we apply aggregate functions with multidimensional ndarray, it will apply aggregate
function to all its dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())

 If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal

 Unit 02 – Overview of Python and Data Analysis 18


Single V/S Double bracket notations
 There are two ways in which you can access element of multi-dimensional array, example of
both the method is given below
numpybrackets.py Output
1 arr = double = h
2 np.array([['a','b','c'],['d','e','f'],['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation

 Both method is valid and provides exactly the same answer, but single bracket notation is
recommended as in double bracket notation it will create a temporary sub array of third row
and then fetch the second column from it.
 Single bracket notation will be easy to read and write while programming.

 Unit 02 – Overview of Python and Data Analysis 19


Slicing ndarray
 Slicing in python means taking elements from one given index to another given index.
 Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
 Default start is 0
 Default end is length of the array
 Default step is 1
numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])

 Unit 02 – Overview of Python and Data Analysis 20


Array Slicing Example
C-0 C-1 C-2 C-3 C-4
 Example :
R-0 1 2 3 4 5  a[2][3] =
R-1 6 7 8 9 10
 a[2,3] =
a = R-2
 a[2] =
11 12 13 14 15
 a[0:2] =
R-3
16 17 18 19 20  a[0:2:2] =
R-4
 a[::-1] =
21 22 23 24 25
 a[1:3,1:3] =
 a[3:,:3] =
 a[:,::-1] =

 Unit 02 – Overview of Python and Data Analysis 21


NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
 Unit 02 – Overview of Python and Data Analysis 22
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)

 Unit 02 – Overview of Python and Data Analysis 23


 Outline
Looping (Pandas)

 Series
 Data Frames
 Accessing text, CSV, Excel files using pandas
 Accessing SQL Database
 Missing Data
 Group By
 Merging, Joining & Concatenating
 Operations
2- Pandas
 Pandas is an open source library built on top of NumPy.
 It allows for fast data cleaning, preparation and analysis.
 It excels in performance and productivity.
 It also has built-in visualization features.
 It can work with the data from wide variety of sources.
 Install :
 conda install pandas
OR  pip install pandas

 Unit 02 – Overview of Python and Data Analysis 25

You might also like