0% found this document useful (0 votes)
9 views65 pages

N Umpy Pandas Tutorial

The document provides an overview of NumPy, a Python library for numerical computing with arrays and matrices. It covers creating and manipulating NumPy arrays, including one-dimensional and multi-dimensional arrays, as well as functions for generating arrays and handling data types. Additionally, it discusses indexing, slicing, and modifying arrays, along with practical exercises for applying these concepts.

Uploaded by

juan peñaloza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views65 pages

N Umpy Pandas Tutorial

The document provides an overview of NumPy, a Python library for numerical computing with arrays and matrices. It covers creating and manipulating NumPy arrays, including one-dimensional and multi-dimensional arrays, as well as functions for generating arrays and handling data types. Additionally, it discusses indexing, slicing, and modifying arrays, along with practical exercises for applying these concepts.

Uploaded by

juan peñaloza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

OPENCOURSEWARE

ADVANCED PROGRAMMING
STATISTICS FOR DATA SCIENCE
Ricardo Aler
NumPy
What is NumPy?
• It is a Python module/library
import numpy as np
• It is useful for computing with numeric vectors,
matrices, and multi-dimensional arrays in general
• Standard Python lists could be used, but +, -, *, /,
etc. cannot be used with numeric lists:
>>> a = [1,3,5,7,9] >>> a = [1,3,5,7,9]
>>> print(a[2:4]) >>> b = [3,5,6,7,9]
[5, 7] >>> c = a + b
>>> b = [[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]] >>> print c
>>> print(b[0]) [1, 3, 5, 7, 9, 3, 5, 6, 7, 9]
[1, 3, 5, 7, 9]
>>> print(b[1][2:4])
[6, 8]
Creating NumPy arrays
• One-dimension vectors:
# From lists
>>> a = np.array([1,3,5,7,9])
>>> b = np.array([3,5,6,7,9])
>>> c = a + b
>>> print c
[4, 8, 11, 14, 18]

>>> type(c)
(<type 'numpy.ndarray'>)

>>> c.shape
(5,)
Creating NumPy arrays
• Matrices:
>>> # convert a list to an array
>>> a = np.array([[1, 2, 3], [3, 6, 9], [2, 4, 6]])
>>>print(a)
[[1 2 3]
[3 6 9]
[2 4 6]]
>>> a.shape
(3, 3)
Shape of NumPy arrays
• 1-dimensional arrays

• 2-dimensional arrays
(matrices)

• 3-dimensional arrays
Shape of NumPy arrays
• Important: a 1-dimensional vector is different
from a matrix with 1 row (or 1 column)
Types of NumPy arrays
• All elements in a NumPy
array must belong to the
same type (dtype)
• dtypes are inferred
automatically
• But dtypes can also be
stated explicitely
• Available dtypes
Beware! (NumPy types)
• A NumPy array belongs to a single type
>>> d = np.arange(5)
>>> print(d.dtype)
>>> print(d)
int32
[0 1 2 3 4]

# We try to assign a real number to an integer array


# but the value is converted to integer
>>> d[1] = 9.7
print(d)
[0 9 2 3 4]
Creating NumPy vectors
with functions
• arange(x) is similar to list(range(x)), but it generates
NumPy vectors, rather than Python vectors
>>> x = np.arange(0, 10, 1) # arguments: start, stop, step
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> x.dtype
dtype('int32')

# Real-valued (float) vectors can also be created


>>> d = np.arange(5, dtype=numpy.float)
>>> print(d)
[ 0. 1. 2. 3. 4.]

# arbitrary start, stop and step


>>> np.arange(3, 7, 0.5)
array([ 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5])
Creating NumPy vectors
with functions
• linspace is useful for generating n real-valued
vectors within an interval
>>> np.linspace(0, 10, 25)
array([ 0. , 0.41666667, 0.83333333, 1.25 ,
1.66666667, 2.08333333, 2.5 , 2.91666667,
3.33333333, 3.75 , 4.16666667, 4.58333333,
5. , 5.41666667, 5.83333333, 6.25 ,
6.66666667, 7.08333333, 7.5 , 7.91666667,
8.33333333, 8.75 , 9.16666667, 9.58333333, 10.
])
Creating NumPy vectors
with functions
• diag: diagonal matrices • arrays of zeros or ones

# a diagonal matrix # A vector of zeros


>>> np.diag([1,2,3]) >>> b = np.zeros(5)
array([[1, 0, 0], >>> print(b)
[ 0. 0. 0. 0. 0.]
[0, 2, 0],
[0, 0, 3]])
# A matrix of ones
>>> c = np.ones((3,3))
# An identity matrix >>> print(c)
>>> np.eye(3) array([[ 1., 1., 1.],
array([[1., 0., 0.], [ 1., 1., 1.],
[0., 1., 0.], [ 1., 1., 1.]])
[0., 0., 1.]])
Creating NumPy vectors
with functions
• Generating random real numbers from a
uniform distribution in [0,1)
>>> np.random.rand(5,5)
array([[ 0.51531133, 0.74085206, 0.99570623, 0.97064334, 0.5819413 ],
[ 0.2105685 , 0.86289893, 0.13404438, 0.77967281, 0.78480563],
[ 0.62687607, 0.51112285, 0.18374991, 0.2582663 , 0.58475672],
[ 0.72768256, 0.08885194, 0.69519174, 0.16049876, 0.34557215],
[ 0.93724333, 0.17407127, 0.1237831 , 0.96840203, 0.52790012]])

• randn(a,b) returns a axb matrix with random real


numbers from the standard normal distribution
• randint(low, high, size=(a,b)) returns uniform
random integers in the low-high interval
Creating NumPy matrices
from vectors
• Using reshape
In []: a = np.arange(10)
In []: a
Out[]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In []: a.shape
Out[]: (10,)

In []: a = a.reshape((2,5))
In []: a
Out[]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

In []: a.shape
Out[]: (2, 5)
Using NumPy matrices:
slicing (indexing with slices)
>>> print(a)
0 1 2
[[1 2 3]
[3 6 9] [[1 2 3] 0
[2 4 6]] [3 6 9] 1
[2 4 6]] 2
# this is just like a list of lists
>>> print(a[0]) [[1 2 3] 0
[1 2 3] [3 6 9] 1
[2 4 6]] 2
# arrays can be given comma separated indices [[1 2 3] 0
>>> print(a[1, 2]) [3 6 9] 1
9 [2 4 6]] 2

# and slices
[[1 2 3] 0
>>> print(a[1, 1:3])
[3 6 9] 1
[6 9]
[2 4 6]] 2
>>> print(a[:,1])
[2 6 4]
Using NumPy matrices:
indexing with booleans
In [99]: a = np.array([[0, np.nan], [np.nan, 3], [4, np.nan]]) # This array of booleans shows where a contains nan
In [100]: a
Out[100]: In [104]: np.isnan(a)
array([[ 0., nan], Out[104]:
[nan, 3.], array([[False, True],
[ 4., nan]]) [ True, False],
[False, True]])
In [101]: a<4
# This array of booleans shows where a<4 is true # Here we transform nan’s into 0
Out[101]: In [106]: a[np.isnan(a)] = 0
array([[ True, False], In [107]: a
[False, True], Out[107]:
[False, False]]) array([[0., 0.],
# Here, we can see what elements in the array are < 4 [0., 3.],
In [103]: a[a<4] [4., 0.]])
Out[103]: array([0., 3.])
Using NumPy matrices:
modification (setting)
# We can modify a single element in the matrix
>>> a[1, 2] = 7
>>> print(a)
[[1 2 3]
[3 6 7]
[2 4 6]]

# We can also modify whole columns


>>> a[:, 0] = [0, 9, 8]
>>> print(a)
[[0 2 3]
[9 6 7]
[8 4 6]]

# And whole rows


>>> a[0, :] = [1, 1, 1]
>>> print(a)
[[1 1 1]
[9 6 7]
[8 4 6]]
Using NumPy matrices:
modification (setting)
• Important, for arrays, a = 0 is not the same
as a[:]=0 (or a[0:] = 0)
In [18]: a = np.array(np.arange(10))
In [19]: a
Out[19]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [20]: a[:] = 0
In [21]: a
Out[21]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [22]: a = 0
In [23]: a
Out[23]: 0
Using NumPy matrices:
views (references)
• b=a does not copy b’s content into a. Rather, it creates a
reference (view). This is standard Python behavior.

In [10]: import numpy as np


In [11]: a = np.array(np.arange(10))

In [12]: a
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]: b = a
In [14]: b[0] = 1000

In [15]: a
Out[15]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]: b
Out[16]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Using NumPy matrices:
views (references)
• Beware, indexing also creates a view (reference)!
In [26]: a = np.array(np.arange(10))
In [27]: a
Out[27]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]: # This is a view into a


In [29]: b = a[2:4]
In [30]: b
Out[30]: array([2, 3])

In [31]: # If we modify the view, we modify the original variable


In [32]: b[:] = -1
In [33]: b
Out[33]: array([-1, -1])
In [34]: a
# a is modified aswell!!
Out[34]: array([ 0, 1, -1, -1, 4, 5, 6, 7, 8, 9])

# We can print owndata to distinguish views from copies


In [35]: a.flags.owndata
Out[35]: True
In [36]: b.flags.owndata
Out[36]: False
Using NumPy matrices:
views (references)
• We can use copy() to actually copy the object
In [58]: a = np.array(np.arange(10))
In [59]: a
Out[59]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [60]: b = a[2:4].copy()
In [61]: b[:] = -1

In [62]: a
Out[62]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [63]: b
Out[63]: array([-1, -1])

In [64]: a.flags.owndata
Out[64]: True
In [65]: b.flags.owndata
Out[65]: True
Exercise
1. Create a 3x5 matrix of normal random numbers named my_matrix. Print it.
2. Now, we are going to introduce some NA's into the matrix (in Python NA's
are represented as numpy.nan = not a number)
1. x = [0,2]
2. y = [3,1]
3. We are going to use x and y to introduce NA's into my_matrix, at
positions (0,3) and (2,1) by doing this: my_matrix[x,y] = np.nan.
4. Print the result.
3. Now, use boolean indexing and isna() for replacing all NA's by zero, and
print the result
Solution
In []: my_matrix = np.random.randn(3,5)
In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, -0.81259558, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, -1.42001248, -0.23261268, 1.08373964, 1.41257598]])

In []: x = [0,2]
In []: y = [3,1]

In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, nan, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, nan, -0.23261268, 1.08373964, 1.41257598]])

In []: my_matrix[np.isnan(my_matrix)] = 0
In []: print(my_matrix)
[[-1.48413505 -0.23568385 -1.22030818 0. 1.68216758]
[-0.24242369 -2.51793289 1.70739294 1.30946991 -1.74124409]
[-0.17144277 0. -0.23261268 1.08373964 1.41257598]]
Universal functions
• They are functions that operate element-wise on one or more arrays
In [69]: a = np.arange(10)
In [70]: a
Out[70]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Universal function sqrt


In [71]: a = np.sqrt(a)
In [72]: a
Out[72]:
array([0. , 1. , 1.41421356, 1.73205081, 2. , 2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

In [73]: b = np.arange(10)*8.7

In [75]: c = a + b
In [76]: c
Out[76]:
array([ 0. , 9.7 , 18.81421356, 27.83205081, 36.8 , 45.73606798, 54.64948974, 63.54575131, 72.42842712,
81.3 ])
Available universal functions
https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs
Reduction functions
• Reduction functions allow to transform an array to a single number:
– sum, mean, ...

In [10]: a = np.arange(10) In [14]: a = np.array([[0, 1, 2, 3],


In [11]: a [4, 5, 6, 7],
Out[11]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) [8, 9, 10, 11]])
In [15]: a
In [12]: a.sum() Out[15]:
Out[12]: 45 array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
In [13]: a.mean() [10, 11, 12, 13, 14]])
Out[13]: 4.5
In [16]: a.sum()
Out[16]: 105
Reduction functions
• In general, reduction functions transform
arrays into arrays of smaller dimensionality
by reducing along an axis
• A 1-dimensional array has one axis, axis=0
• A 2-dimensional array (matrix) has two axis,
axis=0 (rows), axis=1(columns)
• We could, for instance, sum the columns of a
matrix (sum along the 0-axis)
Reduction functions
# Sum along axis 0 / rows (sum all rows elements in a column)
In [19]: a.sum(axis=0)
Out[19]: array([15, 18, 21, 24, 27])
# Sum along axis 1 / columns (sum all column elements in a row)
In [20]: a.sum(axis=1)
Out[20]: array([ 6, 22, 38])

# Sum along axis 0 / rows


In [21]: np.sum(a, axis=0)
Out[21]: array([12, 15, 18, 21])
# Sum along axis 1
In [22]: np.sum(a, axis=1)
Out[22]: array([ 6, 22, 38])
Reduction functions
• Other reduction functions: max, min, mean,
...
Broadcasting
• Why does this work?
In []: a = np.array([0, 1, 2, 3, 4, 5])
In []: b = np.array([10])

In []: a
Out[]: array([0, 1, 2, 3, 4, 5])
In []: b
Out[]: array([10])

In [48]: c = a+b
In [49]: c
Out[49]: array([10, 11, 12, 13, 14, 15])

• Broadcasting allows to have operations


between arrays with different sizes
Broadcasting
• Broadcast requires all dimensions to be 1 or
equal.
In []: a = np.array([[0, 1], [2, 3], [4, 5]])
In []: b = np.array([[10], [20], [30]]) 10
In []: a
Out[]:
array([[0, 1],
[2, 3],
[4, 5]])
In []: b
Out[]:
array([[10],
[20],
[30]])
a.shape = (3,2)
In [61]: c = a + b
In [62]: c
b.shape = (3,1) => (3,2)
Out[62]:
array([[10, 11],
[22, 23],
[34, 35]])
Exercise: normalization (scaling
features to a range)
1. Create a 3x5 matrix of normal random numbers named my_matrix. Print it.
2. Now, use reduction functions (max and min) to compute two vectors
maxima and minima with the máximum and minimum values (respectively)
of the columns of my_matrix
3. Now, compute new matrix normalized_matrix, so that columns of
my_matrix become normalized between zero and one.
1. Definition of normalization: x'ij = (xij - min(x.j))/(max(x.j)-min(x.j))
4. Check that all values of normalized_matrix are >= 0, and <= 1
5. Finally, compute standarized_matrix (mean removal/variance scaling)
1. Def of standarization: x'ij = (xij - mean(x.j))/std(x.j)
6. Verify that the mean of all columns is zero, and the standard deviation is 1
(approximately)
In [201]: my_matrix = np.random.randn(3,5)
In [202]: maxima = my_matrix.max(axis=0)
In [203]: print(maxima)
[ 2.19405637 0.54857877 -0.77583136 -0.75875882 1.22463799]

In [204]: minima = my_matrix.min(axis=0)


In [205]: print(minima)
[ 1.03488226 -0.82966138 -1.55133288 -1.46959842 -0.76071212]

In [206]: normalized_matrix = (my_matrix - minima) / (maxima-minima)


In [207]: print(normalized_matrix)
[[0.28223942 0. 0.37006178 1. 1. ]
[1. 1. 0. 0. 0. ]
[0. 0.14074337 1. 0.64842885 0.40670394]]

In [208]: normalized_matrix >= 0


Out[208]:
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])

In [209]: normalized_matrix <= 1


Out[209]:
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])

In [210]: standarized_matrix = (my_matrix - my_matrix.mean(axis=0))/(my_matrix.std(axis=0))


In [211]: print(standarized_matrix)
[[-0.34486633 -0.86032467 -0.20983943 1.08769345 1.29343692]
[ 1.36020436 1.40221228 -1.10626795 -1.32659331 -1.14196153]
[-1.01533803 -0.54188761 1.31610738 0.23889987 -0.15147539]]

In [212]: standarized_matrix.mean(axis=0)
Out[212]:
array([ 2.22044605e-16, 3.70074342e-17, 3.70074342e-16, -3.88578059e-16,
4.62592927e-17])

In [213]: standarized_matrix.std(axis=0)
Out[213]: array([1., 1., 1., 1., 1.])
Loading and saving numpy
arrays to files
• Reading files: np.genfromtxt("BodyTemperature.txt",
skip_header=True)
– np.loadtxt is faster, but allows for less user control (header,
handling NA's)
• Writing to text files: np.savetxt(filename, data)
• For pickle (binary format, faster):
– np.save(filename, data)
– my_array = np.load(filename, data)
# Read textdatafile, ignore the header In [323]: males = data[data[:,0] == 0]
# The header is Gender Age HeartRate Temperature In [324]: females = data[data[:,0] == 1]
In [313]: data = np.genfromtxt("BodyTemperature.txt" , In [325]: males_mean = males.mean(axis=0)[1:]
skip_header=True ) In [326]: males_max = males.max(axis=0)[1:]
In [327]: males_min = males.min(axis=0)[1:]
# Any nan? In [328]: females_mean = females.mean(axis=0)[1:]
In [315]: np.any(np.isnan(data)) In [329]: females_max = females.max(axis=0)[1:]
Out[315]: False In [330]: females_min = females.min(axis=0)[1:]
In [331]: table = np.array([males_mean, males_max,
# Number of males males_min, females_mean, females_max, females_min])
In [317]: np.sum(data[:,0] == 0)
Out[317]: 49 In [332]: table
Out[332]:
# Number of females array([[ 37.81632653, 73.91836735, 98.19795918],
In [319]: np.sum(data[:,0] == 1) [ 50. , 87. , 101.3 ],
Out[319]: 51 [ 22. , 61. , 96.2 ],
[ 37.43137255, 73.41176471, 98.45686275],
# Ignoring gender, the remaining columns are averages for: [ 49. , 87. , 100.8 ],
# Age HeartRate Temperature [ 21. , 67. , 96.8 ]])

In [322]: data.mean(axis=0)[1:] In [333]: np.savetxt("BD_results.txt", table)


Out[322]: array([37.62, 73.66, 98.33])
Pandas
PANDAS
• Pandas is the Python library to work with
dataframes (similar to R data.frames)
import pandas as pd
• An advantage of Pandas over numpy is that all
elements in a numpy array must belong to the
same type, while Pandas allows to have different
columns with different types (integers, reals,
strings, ...)
PANDAS data structures
• Pandas contains two data structures:
– Series: a series is like a vector, but with an index
– Dataframes: it is similar to R dataframes (a matrix
with column names. Each column may belong to
different data types: integer, real numbers, strings,
...)
• A dataframe is made of:
– index
– column names
– values
Example of series
# Using the default index 0, 1, ... # Using a custom index
In [228]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s = pd.Series(np.random.randn(5)) In [229]: s
Out[229]:
a 0.226183
In [224]: s b -0.564569
Out[224]: c -1.058691
0 1.037685 d 0.970553
1 0.403077 e -0.857780
2 -1.814123 dtype: float64
3 -0.005181
4 1.692980 In [230]: s.values
dtype: float64 Out[230]: array([ 0.22618273, -0.564569 , -1.05869052, 0.97055338, -
0.85777957])
# We can get the values of a series as a numpy array
In [225]: s.values In [231]: s.index
Out[225]: array([ 1.03768522, 0.40307685, -1.81412276, -0.005181 , Out[231]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
1.69298038])

In [226]: s.index
Out[226]: RangeIndex(start=0, stop=5, step=1)

Note: although índices can be useful in some cases (time series, ...), this tutorial will not
focus on them
Reading files as dataframes
In [121]: import pandas as pd # Getting the names of the columns
# Read file in csv format into a Pandas dataframe In [130]: list(flights.columns)
In [122]: flights = pd.read_csv("flights.csv") Out[130]:
['year', 'month', 'day', 'dep_time',
In [124]: flights.shape 'sched_dep_time', 'dep_delay',
Out[124]: (336776, 19) 'arr_time', 'sched_arr_time', 'arr_delay',
'carrier', 'flight', 'tailnum', 'origin',
# head: print the first rows 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour']

Column names

Values

In []: flights.index
Index (by default 0, 1, 2, ...)
Out[]: RangeIndex(start=0, stop=336776, step=1)
Describing the dataframe
Setting the index
• By default: 0, 1, 2, ...
In [12]: flights.index
Out[12]: RangeIndex(start=0, stop=336776, step=1)

• In most cases, this is what you need


• We can set one of the columns as the index:
flights.set_index("month")
Setting the index
• New index. We could also use dates ...

Note: although índices


can be useful in some
cases (time series, ...),
this tutorial will not
focus on them
Extracting the values from a Pandas dataframe or
series to a numpy matrix / array

In [39]: flights.values
Out[39]:
array([[2013, 1, 1, ..., 5, 15, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 29, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 40, '2013-01-01 05:00:00'],
...,
[2013, 9, 30, ..., 12, 10, '2013-09-30 12:00:00'],
[2013, 9, 30, ..., 11, 59, '2013-09-30 11:00:00'],
[2013, 9, 30, ..., 8, 40, '2013-09-30 08:00:00']], dtype=object)
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: e.g. select rows from 0 to 10
– Columns: e.g. select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Label selection: rows
.loc

In [25]: flights.loc[2:4]
Out[25]:
year month day ... hour minute time_hour
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Label selection: columns
.loc
List of columns Range of columns
Labels: rows and columns
.loc
Beware! series vs. dataframe
This returns a dataframe This returns a series!
In []: flights.loc[:,['month']] In [35]: flights.loc[:,'month']
Out[]: Out[35]:
month 01
01 11
11 21
21 31
31 41
41 51
51 61
61 71
71 81
81 91
91 10 1

In []: type(flights.loc[:,['month']]) In []: type(flights.loc[:,'month'])


Out[]: pandas.core.frame.DataFrame Out[]: pandas.core.series.Series
Selecting single columns (series)
This returns a series! The same, with dot notation
In [35]: flights.loc[:,'month'] In [42]: flights.month
Out[35]: Out[42]:
01 01
11 11
21 21
31 31
41 41
51 51
61 61
71 71
81 81
91 91
10 1 10 1
Note: we can get the values as a numpy array
In [43]: flights.month.values
Out[43]: array([1, 1, 1, ..., 9, 9, 9], dtype=int64)
Shorthand for column selection
• flights.loc[:,'month'] is equivalent to
flights['month']
• flighs.loc[:,['year', 'month']] is equivalent to
flights[['year', 'month']]
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Position (integer) selection:
iloc

In [41]: flights.iloc[2:4, 1:3]


Out[41]:
month day
21 1
31 1
Combining iloc for rows and
loc for columns
• What if we want to select rows by position
but columns by name?
# Just one column
In [51]: flights.iloc[2:4, flights.columns.get_loc('month')]
Out[51]:
21
31
Name: month, dtype: int64

# Several columns
In [53]: flights.iloc[2:4, flights.columns.get_indexer(['month','day'])]
Out[53]:
month day
21 1
31 1
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Boolean indexing:
selecting rows on condition
• Both loc and iloc can be used, but loc is recommended
• List of flights for January the first?

In []: flights.loc[(flights.month == 1) & (flights.day == 1)]


Out[]:
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01 05:00:00
1 2013 1 1 ... 5 29 2013-01-01 05:00:00
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00

• Note: we can also write (flights.loc[:,"month"] == 1)


Boolean indexing:
selecting rows on condition
• Same thing in two lines (clearer code)
• List of flights for January the first?
In []: condition = (flights.month == 1) & (flights.day == 1)
In []: flights.loc[condition]
Out[]:
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01 05:00:00
1 2013 1 1 ... 5 29 2013-01-01 05:00:00
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Boolean indexing:
selecting rows on condition
• List of flights for January the first?
– A shorter version

•In []: flights.query("month == 1 & day == 1")


Out[]:
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01 05:00:00
1 2013 1 1 ... 5 29 2013-01-01 05:00:00
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Boolean conditions
• In order to create conditions, we can use:
• <, >, ==, <=, >=, !=
• &: and
• |: or
• ~: not
• isin: is value in a list of values?
• isnull: is value nan?
Boolean selection:
selecting rows on condition
• What flights start at EWR or JFK airports?
In [61]: flights.loc[flights.origin.isin(['EWR', 'JFK']), ['origin', 'dest']]
Out[61]:
origin dest
0 EWR IAH
2 JFK MIA
3 JFK BQN
5 EWR ORD
6 EWR FLL
8 JFK MCO
10 JFK PBI

• Note: we are also selecting origin and dest columns


Creating new columns
• Compute speed for every flight
– speed = distance / airtime

• Two ways:
– First: flights.loc[:,'speed'] =
• flights.loc[:,'speed'] = flights.distance - flights.air_time
• flights.loc[:,'speed'] = flights.loc[:,'distance'] - flights.loc[:,'air_time']
• flights.loc[:,'speed'] = flights ['distance'] - flights ['air_time']
– Shorthand: flights['speed'] =
• flights['speed'] = flights['distance'] - flights['air_time']
Creating new columns
Modifying subsets of the
dataframe (setting)
• Let's put nan on the first three rows and
columns 'year', 'month' and 'day'
# Let's create a copy first # Now, we do the assignment
In [81]: flights_copy = flights.copy() In [85]: flights_copy.iloc[0:4,
# Let's see the content of the first three rows and the flights.columns.get_indexer(['year', 'month', 'day'])] = np.nan
first three columns In [86]: flights_copy
In [83]: flights_copy.iloc[0:4, Out[86]:
flights.columns.get_indexer(['year', 'month', 'day'])] year month day ... minute time_hour speed
Out[83]: 0 NaN NaN NaN ... 15 2013-01-01 05:00:00 1173.0
year month day 1 NaN NaN NaN ... 29 2013-01-01 05:00:00 1189.0
0 2013 1 1 2 NaN NaN NaN ... 40 2013-01-01 05:00:00 929.0
1 2013 1 1 3 NaN NaN NaN ... 45 2013-01-01 05:00:00 1393.0
2 2013 1 1 4 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 646.0
3 2013 1 1 5 2013.0 1.0 1.0 ... 58 2013-01-01 05:00:00 569.0
6 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 907.0
7 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 176.0
Subset (modification) with
boolean selection
• Let's create a column 'satisfaction' with
'good' if arrival delay <= 75, and 'bad'
otherwise
In [107]: flights['satisfaction'] = 'bad'

In [108]: flights.loc[flights.arr_delay <= 75, 'satisfaction'] = 'good'

In [109]: flights.loc[:, ['arr_delay', 'satisfaction']].head()


Out[109]:
arr_delay satisfaction
0 11.0 good
1 20.0 good
2 33.0 good
3 -18.0 good
4 -25.0 good

You might also like