N Umpy Pandas Tutorial
N Umpy Pandas Tutorial
ADVANCED PROGRAMMING
STATISTICS FOR DATA SCIENCE
Ricardo Aler
NumPy
What is NumPy?
• It is a Python module/library
import numpy as np
• It is useful for computing with numeric vectors,
matrices, and multi-dimensional arrays in general
• Standard Python lists could be used, but +, -, *, /,
etc. cannot be used with numeric lists:
>>> a = [1,3,5,7,9] >>> a = [1,3,5,7,9]
>>> print(a[2:4]) >>> b = [3,5,6,7,9]
[5, 7] >>> c = a + b
>>> b = [[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]] >>> print c
>>> print(b[0]) [1, 3, 5, 7, 9, 3, 5, 6, 7, 9]
[1, 3, 5, 7, 9]
>>> print(b[1][2:4])
[6, 8]
Creating NumPy arrays
• One-dimension vectors:
# From lists
>>> a = np.array([1,3,5,7,9])
>>> b = np.array([3,5,6,7,9])
>>> c = a + b
>>> print c
[4, 8, 11, 14, 18]
>>> type(c)
(<type 'numpy.ndarray'>)
>>> c.shape
(5,)
Creating NumPy arrays
• Matrices:
>>> # convert a list to an array
>>> a = np.array([[1, 2, 3], [3, 6, 9], [2, 4, 6]])
>>>print(a)
[[1 2 3]
[3 6 9]
[2 4 6]]
>>> a.shape
(3, 3)
Shape of NumPy arrays
• 1-dimensional arrays
• 2-dimensional arrays
(matrices)
• 3-dimensional arrays
Shape of NumPy arrays
• Important: a 1-dimensional vector is different
from a matrix with 1 row (or 1 column)
Types of NumPy arrays
• All elements in a NumPy
array must belong to the
same type (dtype)
• dtypes are inferred
automatically
• But dtypes can also be
stated explicitely
• Available dtypes
Beware! (NumPy types)
• A NumPy array belongs to a single type
>>> d = np.arange(5)
>>> print(d.dtype)
>>> print(d)
int32
[0 1 2 3 4]
In []: a = a.reshape((2,5))
In []: a
Out[]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
In []: a.shape
Out[]: (2, 5)
Using NumPy matrices:
slicing (indexing with slices)
>>> print(a)
0 1 2
[[1 2 3]
[3 6 9] [[1 2 3] 0
[2 4 6]] [3 6 9] 1
[2 4 6]] 2
# this is just like a list of lists
>>> print(a[0]) [[1 2 3] 0
[1 2 3] [3 6 9] 1
[2 4 6]] 2
# arrays can be given comma separated indices [[1 2 3] 0
>>> print(a[1, 2]) [3 6 9] 1
9 [2 4 6]] 2
# and slices
[[1 2 3] 0
>>> print(a[1, 1:3])
[3 6 9] 1
[6 9]
[2 4 6]] 2
>>> print(a[:,1])
[2 6 4]
Using NumPy matrices:
indexing with booleans
In [99]: a = np.array([[0, np.nan], [np.nan, 3], [4, np.nan]]) # This array of booleans shows where a contains nan
In [100]: a
Out[100]: In [104]: np.isnan(a)
array([[ 0., nan], Out[104]:
[nan, 3.], array([[False, True],
[ 4., nan]]) [ True, False],
[False, True]])
In [101]: a<4
# This array of booleans shows where a<4 is true # Here we transform nan’s into 0
Out[101]: In [106]: a[np.isnan(a)] = 0
array([[ True, False], In [107]: a
[False, True], Out[107]:
[False, False]]) array([[0., 0.],
# Here, we can see what elements in the array are < 4 [0., 3.],
In [103]: a[a<4] [4., 0.]])
Out[103]: array([0., 3.])
Using NumPy matrices:
modification (setting)
# We can modify a single element in the matrix
>>> a[1, 2] = 7
>>> print(a)
[[1 2 3]
[3 6 7]
[2 4 6]]
In [20]: a[:] = 0
In [21]: a
Out[21]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
In [22]: a = 0
In [23]: a
Out[23]: 0
Using NumPy matrices:
views (references)
• b=a does not copy b’s content into a. Rather, it creates a
reference (view). This is standard Python behavior.
In [12]: a
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [13]: b = a
In [14]: b[0] = 1000
In [15]: a
Out[15]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [16]: b
Out[16]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Using NumPy matrices:
views (references)
• Beware, indexing also creates a view (reference)!
In [26]: a = np.array(np.arange(10))
In [27]: a
Out[27]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [60]: b = a[2:4].copy()
In [61]: b[:] = -1
In [62]: a
Out[62]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [63]: b
Out[63]: array([-1, -1])
In [64]: a.flags.owndata
Out[64]: True
In [65]: b.flags.owndata
Out[65]: True
Exercise
1. Create a 3x5 matrix of normal random numbers named my_matrix. Print it.
2. Now, we are going to introduce some NA's into the matrix (in Python NA's
are represented as numpy.nan = not a number)
1. x = [0,2]
2. y = [3,1]
3. We are going to use x and y to introduce NA's into my_matrix, at
positions (0,3) and (2,1) by doing this: my_matrix[x,y] = np.nan.
4. Print the result.
3. Now, use boolean indexing and isna() for replacing all NA's by zero, and
print the result
Solution
In []: my_matrix = np.random.randn(3,5)
In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, -0.81259558, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, -1.42001248, -0.23261268, 1.08373964, 1.41257598]])
In []: x = [0,2]
In []: y = [3,1]
In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, nan, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, nan, -0.23261268, 1.08373964, 1.41257598]])
In []: my_matrix[np.isnan(my_matrix)] = 0
In []: print(my_matrix)
[[-1.48413505 -0.23568385 -1.22030818 0. 1.68216758]
[-0.24242369 -2.51793289 1.70739294 1.30946991 -1.74124409]
[-0.17144277 0. -0.23261268 1.08373964 1.41257598]]
Universal functions
• They are functions that operate element-wise on one or more arrays
In [69]: a = np.arange(10)
In [70]: a
Out[70]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [73]: b = np.arange(10)*8.7
In [75]: c = a + b
In [76]: c
Out[76]:
array([ 0. , 9.7 , 18.81421356, 27.83205081, 36.8 , 45.73606798, 54.64948974, 63.54575131, 72.42842712,
81.3 ])
Available universal functions
https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs
Reduction functions
• Reduction functions allow to transform an array to a single number:
– sum, mean, ...
In []: a
Out[]: array([0, 1, 2, 3, 4, 5])
In []: b
Out[]: array([10])
In [48]: c = a+b
In [49]: c
Out[49]: array([10, 11, 12, 13, 14, 15])
In [212]: standarized_matrix.mean(axis=0)
Out[212]:
array([ 2.22044605e-16, 3.70074342e-17, 3.70074342e-16, -3.88578059e-16,
4.62592927e-17])
In [213]: standarized_matrix.std(axis=0)
Out[213]: array([1., 1., 1., 1., 1.])
Loading and saving numpy
arrays to files
• Reading files: np.genfromtxt("BodyTemperature.txt",
skip_header=True)
– np.loadtxt is faster, but allows for less user control (header,
handling NA's)
• Writing to text files: np.savetxt(filename, data)
• For pickle (binary format, faster):
– np.save(filename, data)
– my_array = np.load(filename, data)
# Read textdatafile, ignore the header In [323]: males = data[data[:,0] == 0]
# The header is Gender Age HeartRate Temperature In [324]: females = data[data[:,0] == 1]
In [313]: data = np.genfromtxt("BodyTemperature.txt" , In [325]: males_mean = males.mean(axis=0)[1:]
skip_header=True ) In [326]: males_max = males.max(axis=0)[1:]
In [327]: males_min = males.min(axis=0)[1:]
# Any nan? In [328]: females_mean = females.mean(axis=0)[1:]
In [315]: np.any(np.isnan(data)) In [329]: females_max = females.max(axis=0)[1:]
Out[315]: False In [330]: females_min = females.min(axis=0)[1:]
In [331]: table = np.array([males_mean, males_max,
# Number of males males_min, females_mean, females_max, females_min])
In [317]: np.sum(data[:,0] == 0)
Out[317]: 49 In [332]: table
Out[332]:
# Number of females array([[ 37.81632653, 73.91836735, 98.19795918],
In [319]: np.sum(data[:,0] == 1) [ 50. , 87. , 101.3 ],
Out[319]: 51 [ 22. , 61. , 96.2 ],
[ 37.43137255, 73.41176471, 98.45686275],
# Ignoring gender, the remaining columns are averages for: [ 49. , 87. , 100.8 ],
# Age HeartRate Temperature [ 21. , 67. , 96.8 ]])
s = pd.Series(np.random.randn(5)) In [229]: s
Out[229]:
a 0.226183
In [224]: s b -0.564569
Out[224]: c -1.058691
0 1.037685 d 0.970553
1 0.403077 e -0.857780
2 -1.814123 dtype: float64
3 -0.005181
4 1.692980 In [230]: s.values
dtype: float64 Out[230]: array([ 0.22618273, -0.564569 , -1.05869052, 0.97055338, -
0.85777957])
# We can get the values of a series as a numpy array
In [225]: s.values In [231]: s.index
Out[225]: array([ 1.03768522, 0.40307685, -1.81412276, -0.005181 , Out[231]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
1.69298038])
In [226]: s.index
Out[226]: RangeIndex(start=0, stop=5, step=1)
Note: although índices can be useful in some cases (time series, ...), this tutorial will not
focus on them
Reading files as dataframes
In [121]: import pandas as pd # Getting the names of the columns
# Read file in csv format into a Pandas dataframe In [130]: list(flights.columns)
In [122]: flights = pd.read_csv("flights.csv") Out[130]:
['year', 'month', 'day', 'dep_time',
In [124]: flights.shape 'sched_dep_time', 'dep_delay',
Out[124]: (336776, 19) 'arr_time', 'sched_arr_time', 'arr_delay',
'carrier', 'flight', 'tailnum', 'origin',
# head: print the first rows 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour']
Column names
Values
In []: flights.index
Index (by default 0, 1, 2, ...)
Out[]: RangeIndex(start=0, stop=336776, step=1)
Describing the dataframe
Setting the index
• By default: 0, 1, 2, ...
In [12]: flights.index
Out[12]: RangeIndex(start=0, stop=336776, step=1)
In [39]: flights.values
Out[39]:
array([[2013, 1, 1, ..., 5, 15, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 29, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 40, '2013-01-01 05:00:00'],
...,
[2013, 9, 30, ..., 12, 10, '2013-09-30 12:00:00'],
[2013, 9, 30, ..., 11, 59, '2013-09-30 11:00:00'],
[2013, 9, 30, ..., 8, 40, '2013-09-30 08:00:00']], dtype=object)
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: e.g. select rows from 0 to 10
– Columns: e.g. select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Label selection: rows
.loc
In [25]: flights.loc[2:4]
Out[25]:
year month day ... hour minute time_hour
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Label selection: columns
.loc
List of columns Range of columns
Labels: rows and columns
.loc
Beware! series vs. dataframe
This returns a dataframe This returns a series!
In []: flights.loc[:,['month']] In [35]: flights.loc[:,'month']
Out[]: Out[35]:
month 01
01 11
11 21
21 31
31 41
41 51
51 61
61 71
71 81
81 91
91 10 1
# Several columns
In [53]: flights.iloc[2:4, flights.columns.get_indexer(['month','day'])]
Out[53]:
month day
21 1
31 1
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Boolean indexing:
selecting rows on condition
• Both loc and iloc can be used, but loc is recommended
• List of flights for January the first?
• Two ways:
– First: flights.loc[:,'speed'] =
• flights.loc[:,'speed'] = flights.distance - flights.air_time
• flights.loc[:,'speed'] = flights.loc[:,'distance'] - flights.loc[:,'air_time']
• flights.loc[:,'speed'] = flights ['distance'] - flights ['air_time']
– Shorthand: flights['speed'] =
• flights['speed'] = flights['distance'] - flights['air_time']
Creating new columns
Modifying subsets of the
dataframe (setting)
• Let's put nan on the first three rows and
columns 'year', 'month' and 'day'
# Let's create a copy first # Now, we do the assignment
In [81]: flights_copy = flights.copy() In [85]: flights_copy.iloc[0:4,
# Let's see the content of the first three rows and the flights.columns.get_indexer(['year', 'month', 'day'])] = np.nan
first three columns In [86]: flights_copy
In [83]: flights_copy.iloc[0:4, Out[86]:
flights.columns.get_indexer(['year', 'month', 'day'])] year month day ... minute time_hour speed
Out[83]: 0 NaN NaN NaN ... 15 2013-01-01 05:00:00 1173.0
year month day 1 NaN NaN NaN ... 29 2013-01-01 05:00:00 1189.0
0 2013 1 1 2 NaN NaN NaN ... 40 2013-01-01 05:00:00 929.0
1 2013 1 1 3 NaN NaN NaN ... 45 2013-01-01 05:00:00 1393.0
2 2013 1 1 4 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 646.0
3 2013 1 1 5 2013.0 1.0 1.0 ... 58 2013-01-01 05:00:00 569.0
6 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 907.0
7 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 176.0
Subset (modification) with
boolean selection
• Let's create a column 'satisfaction' with
'good' if arrival delay <= 75, and 'bad'
otherwise
In [107]: flights['satisfaction'] = 'bad'