0% found this document useful (0 votes)

9 views65 pages

N Umpy Pandas Tutorial

The document provides an overview of NumPy, a Python library for numerical computing with arrays and matrices. It covers creating and manipulating NumPy arrays, including one-dimensional and multi-dimensional arrays, as well as functions for generating arrays and handling data types. Additionally, it discusses indexing, slicing, and modifying arrays, along with practical exercises for applying these concepts.

Uploaded by

juan peñaloza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views65 pages

N Umpy Pandas Tutorial

Uploaded by

juan peñaloza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

OPENCOURSEWARE

ADVANCED PROGRAMMING
STATISTICS FOR DATA SCIENCE
Ricardo Aler
NumPy
What is NumPy?
• It is a Python module/library
import numpy as np
• It is useful for computing with numeric vectors,
matrices, and multi-dimensional arrays in general
• Standard Python lists could be used, but +, -, *, /,
etc. cannot be used with numeric lists:
>>> a = [1,3,5,7,9] >>> a = [1,3,5,7,9]
>>> print(a[2:4]) >>> b = [3,5,6,7,9]
[5, 7] >>> c = a + b
>>> b = [[1, 3, 5, 7, 9], [2, 4, 6, 8, 10]] >>> print c
>>> print(b[0]) [1, 3, 5, 7, 9, 3, 5, 6, 7, 9]
[1, 3, 5, 7, 9]
>>> print(b[1][2:4])
[6, 8]
Creating NumPy arrays
• One-dimension vectors:
# From lists
>>> a = np.array([1,3,5,7,9])
>>> b = np.array([3,5,6,7,9])
>>> c = a + b
>>> print c
[4, 8, 11, 14, 18]

>>> type(c)
(<type 'numpy.ndarray'>)

>>> c.shape
(5,)
Creating NumPy arrays
• Matrices:
>>> # convert a list to an array
>>> a = np.array([[1, 2, 3], [3, 6, 9], [2, 4, 6]])
>>>print(a)
[[1 2 3]
[3 6 9]
[2 4 6]]
>>> a.shape
(3, 3)
Shape of NumPy arrays
• 1-dimensional arrays

• 2-dimensional arrays
(matrices)

• 3-dimensional arrays
Shape of NumPy arrays
• Important: a 1-dimensional vector is different
from a matrix with 1 row (or 1 column)
Types of NumPy arrays
• All elements in a NumPy
array must belong to the
same type (dtype)
• dtypes are inferred
automatically
• But dtypes can also be
stated explicitely
• Available dtypes
Beware! (NumPy types)
• A NumPy array belongs to a single type
>>> d = np.arange(5)
>>> print(d.dtype)
>>> print(d)
int32
[0 1 2 3 4]

# We try to assign a real number to an integer array

# but the value is converted to integer
>>> d[1] = 9.7
print(d)
[0 9 2 3 4]
Creating NumPy vectors
with functions
• arange(x) is similar to list(range(x)), but it generates
NumPy vectors, rather than Python vectors
>>> x = np.arange(0, 10, 1) # arguments: start, stop, step
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> x.dtype
dtype('int32')

# Real-valued (float) vectors can also be created

>>> d = np.arange(5, dtype=numpy.float)
>>> print(d)
[ 0. 1. 2. 3. 4.]

# arbitrary start, stop and step

>>> np.arange(3, 7, 0.5)
array([ 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. , 6.5])
Creating NumPy vectors
with functions
• linspace is useful for generating n real-valued
vectors within an interval
>>> np.linspace(0, 10, 25)
array([ 0. , 0.41666667, 0.83333333, 1.25 ,
1.66666667, 2.08333333, 2.5 , 2.91666667,
3.33333333, 3.75 , 4.16666667, 4.58333333,
5. , 5.41666667, 5.83333333, 6.25 ,
6.66666667, 7.08333333, 7.5 , 7.91666667,
8.33333333, 8.75 , 9.16666667, 9.58333333, 10.
])
Creating NumPy vectors
with functions
• diag: diagonal matrices • arrays of zeros or ones

# a diagonal matrix # A vector of zeros

>>> np.diag([1,2,3]) >>> b = np.zeros(5)
array([[1, 0, 0], >>> print(b)
[ 0. 0. 0. 0. 0.]
[0, 2, 0],
[0, 0, 3]])
# A matrix of ones
>>> c = np.ones((3,3))
# An identity matrix >>> print(c)
>>> np.eye(3) array([[ 1., 1., 1.],
array([[1., 0., 0.], [ 1., 1., 1.],
[0., 1., 0.], [ 1., 1., 1.]])
[0., 0., 1.]])
Creating NumPy vectors
with functions
• Generating random real numbers from a
uniform distribution in [0,1)
>>> np.random.rand(5,5)
array([[ 0.51531133, 0.74085206, 0.99570623, 0.97064334, 0.5819413 ],
[ 0.2105685 , 0.86289893, 0.13404438, 0.77967281, 0.78480563],
[ 0.62687607, 0.51112285, 0.18374991, 0.2582663 , 0.58475672],
[ 0.72768256, 0.08885194, 0.69519174, 0.16049876, 0.34557215],
[ 0.93724333, 0.17407127, 0.1237831 , 0.96840203, 0.52790012]])

• randn(a,b) returns a axb matrix with random real

numbers from the standard normal distribution
• randint(low, high, size=(a,b)) returns uniform
random integers in the low-high interval
Creating NumPy matrices
from vectors
• Using reshape
In []: a = np.arange(10)
In []: a
Out[]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In []: a.shape
Out[]: (10,)

In []: a = a.reshape((2,5))
In []: a
Out[]:
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])

In []: a.shape
Out[]: (2, 5)
Using NumPy matrices:
slicing (indexing with slices)
>>> print(a)
0 1 2
[[1 2 3]
[3 6 9] [[1 2 3] 0
[2 4 6]] [3 6 9] 1
[2 4 6]] 2
# this is just like a list of lists
>>> print(a[0]) [[1 2 3] 0
[1 2 3] [3 6 9] 1
[2 4 6]] 2
# arrays can be given comma separated indices [[1 2 3] 0
>>> print(a[1, 2]) [3 6 9] 1
9 [2 4 6]] 2

# and slices
[[1 2 3] 0
>>> print(a[1, 1:3])
[3 6 9] 1
[6 9]
[2 4 6]] 2
>>> print(a[:,1])
[2 6 4]
Using NumPy matrices:
indexing with booleans
In [99]: a = np.array([[0, np.nan], [np.nan, 3], [4, np.nan]]) # This array of booleans shows where a contains nan
In [100]: a
Out[100]: In [104]: np.isnan(a)
array([[ 0., nan], Out[104]:
[nan, 3.], array([[False, True],
[ 4., nan]]) [ True, False],
[False, True]])
In [101]: a<4
# This array of booleans shows where a<4 is true # Here we transform nan’s into 0
Out[101]: In [106]: a[np.isnan(a)] = 0
array([[ True, False], In [107]: a
[False, True], Out[107]:
[False, False]]) array([[0., 0.],
# Here, we can see what elements in the array are < 4 [0., 3.],
In [103]: a[a<4] [4., 0.]])
Out[103]: array([0., 3.])
Using NumPy matrices:
modification (setting)
# We can modify a single element in the matrix
>>> a[1, 2] = 7
>>> print(a)
[[1 2 3]
[3 6 7]
[2 4 6]]

# We can also modify whole columns

>>> a[:, 0] = [0, 9, 8]
>>> print(a)
[[0 2 3]
[9 6 7]
[8 4 6]]

# And whole rows

>>> a[0, :] = [1, 1, 1]
>>> print(a)
[[1 1 1]
[9 6 7]
[8 4 6]]
Using NumPy matrices:
modification (setting)
• Important, for arrays, a = 0 is not the same
as a[:]=0 (or a[0:] = 0)
In [18]: a = np.array(np.arange(10))
In [19]: a
Out[19]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [20]: a[:] = 0
In [21]: a
Out[21]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [22]: a = 0
In [23]: a
Out[23]: 0
Using NumPy matrices:
views (references)
• b=a does not copy b’s content into a. Rather, it creates a
reference (view). This is standard Python behavior.

In [10]: import numpy as np

In [11]: a = np.array(np.arange(10))

In [12]: a
Out[12]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]: b = a
In [14]: b[0] = 1000

In [15]: a
Out[15]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]: b
Out[16]: array([1000, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Using NumPy matrices:
views (references)
• Beware, indexing also creates a view (reference)!
In [26]: a = np.array(np.arange(10))
In [27]: a
Out[27]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]: # This is a view into a

In [29]: b = a[2:4]
In [30]: b
Out[30]: array([2, 3])

In [31]: # If we modify the view, we modify the original variable

In [32]: b[:] = -1
In [33]: b
Out[33]: array([-1, -1])
In [34]: a
# a is modified aswell!!
Out[34]: array([ 0, 1, -1, -1, 4, 5, 6, 7, 8, 9])

# We can print owndata to distinguish views from copies

In [35]: a.flags.owndata
Out[35]: True
In [36]: b.flags.owndata
Out[36]: False
Using NumPy matrices:
views (references)
• We can use copy() to actually copy the object
In [58]: a = np.array(np.arange(10))
In [59]: a
Out[59]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [60]: b = a[2:4].copy()
In [61]: b[:] = -1

In [62]: a
Out[62]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [63]: b
Out[63]: array([-1, -1])

In [64]: a.flags.owndata
Out[64]: True
In [65]: b.flags.owndata
Out[65]: True
Exercise
1. Create a 3x5 matrix of normal random numbers named my_matrix. Print it.
2. Now, we are going to introduce some NA's into the matrix (in Python NA's
are represented as numpy.nan = not a number)
1. x = [0,2]
2. y = [3,1]
3. We are going to use x and y to introduce NA's into my_matrix, at
positions (0,3) and (2,1) by doing this: my_matrix[x,y] = np.nan.
4. Print the result.
3. Now, use boolean indexing and isna() for replacing all NA's by zero, and
print the result
Solution
In []: my_matrix = np.random.randn(3,5)
In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, -0.81259558, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, -1.42001248, -0.23261268, 1.08373964, 1.41257598]])

In []: x = [0,2]
In []: y = [3,1]

In []: my_matrix
array([[-1.48413505, -0.23568385, -1.22030818, nan, 1.68216758],
[-0.24242369, -2.51793289, 1.70739294, 1.30946991, -1.74124409],
[-0.17144277, nan, -0.23261268, 1.08373964, 1.41257598]])

In []: my_matrix[np.isnan(my_matrix)] = 0
In []: print(my_matrix)
[[-1.48413505 -0.23568385 -1.22030818 0. 1.68216758]
[-0.24242369 -2.51793289 1.70739294 1.30946991 -1.74124409]
[-0.17144277 0. -0.23261268 1.08373964 1.41257598]]
Universal functions
• They are functions that operate element-wise on one or more arrays
In [69]: a = np.arange(10)
In [70]: a
Out[70]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Universal function sqrt

In [71]: a = np.sqrt(a)
In [72]: a
Out[72]:
array([0. , 1. , 1.41421356, 1.73205081, 2. , 2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])

In [73]: b = np.arange(10)*8.7

In [75]: c = a + b
In [76]: c
Out[76]:
array([ 0. , 9.7 , 18.81421356, 27.83205081, 36.8 , 45.73606798, 54.64948974, 63.54575131, 72.42842712,
81.3 ])
Available universal functions
https://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs
Reduction functions
• Reduction functions allow to transform an array to a single number:
– sum, mean, ...

In [10]: a = np.arange(10) In [14]: a = np.array([[0, 1, 2, 3],

In [11]: a [4, 5, 6, 7],
Out[11]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) [8, 9, 10, 11]])
In [15]: a
In [12]: a.sum() Out[15]:
Out[12]: 45 array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
In [13]: a.mean() [10, 11, 12, 13, 14]])
Out[13]: 4.5
In [16]: a.sum()
Out[16]: 105
Reduction functions
• In general, reduction functions transform
arrays into arrays of smaller dimensionality
by reducing along an axis
• A 1-dimensional array has one axis, axis=0
• A 2-dimensional array (matrix) has two axis,
axis=0 (rows), axis=1(columns)
• We could, for instance, sum the columns of a
matrix (sum along the 0-axis)
Reduction functions
# Sum along axis 0 / rows (sum all rows elements in a column)
In [19]: a.sum(axis=0)
Out[19]: array([15, 18, 21, 24, 27])
# Sum along axis 1 / columns (sum all column elements in a row)
In [20]: a.sum(axis=1)
Out[20]: array([ 6, 22, 38])

# Sum along axis 0 / rows

In [21]: np.sum(a, axis=0)
Out[21]: array([12, 15, 18, 21])
# Sum along axis 1
In [22]: np.sum(a, axis=1)
Out[22]: array([ 6, 22, 38])
Reduction functions
• Other reduction functions: max, min, mean,
...
Broadcasting
• Why does this work?
In []: a = np.array([0, 1, 2, 3, 4, 5])
In []: b = np.array([10])

In []: a
Out[]: array([0, 1, 2, 3, 4, 5])
In []: b
Out[]: array([10])

In [48]: c = a+b
In [49]: c
Out[49]: array([10, 11, 12, 13, 14, 15])

• Broadcasting allows to have operations

between arrays with different sizes
Broadcasting
• Broadcast requires all dimensions to be 1 or
equal.
In []: a = np.array([[0, 1], [2, 3], [4, 5]])
In []: b = np.array([[10], [20], [30]]) 10
In []: a
Out[]:
array([[0, 1],
[2, 3],
[4, 5]])
In []: b
Out[]:
array([[10],
[20],
[30]])
a.shape = (3,2)
In [61]: c = a + b
In [62]: c
b.shape = (3,1) => (3,2)
Out[62]:
array([[10, 11],
[22, 23],
[34, 35]])
Exercise: normalization (scaling
features to a range)
1. Create a 3x5 matrix of normal random numbers named my_matrix. Print it.
2. Now, use reduction functions (max and min) to compute two vectors
maxima and minima with the máximum and minimum values (respectively)
of the columns of my_matrix
3. Now, compute new matrix normalized_matrix, so that columns of
my_matrix become normalized between zero and one.
1. Definition of normalization: x'ij = (xij - min(x.j))/(max(x.j)-min(x.j))
4. Check that all values of normalized_matrix are >= 0, and <= 1
5. Finally, compute standarized_matrix (mean removal/variance scaling)
1. Def of standarization: x'ij = (xij - mean(x.j))/std(x.j)
6. Verify that the mean of all columns is zero, and the standard deviation is 1
(approximately)
In [201]: my_matrix = np.random.randn(3,5)
In [202]: maxima = my_matrix.max(axis=0)
In [203]: print(maxima)
[ 2.19405637 0.54857877 -0.77583136 -0.75875882 1.22463799]

In [204]: minima = my_matrix.min(axis=0)

In [205]: print(minima)
[ 1.03488226 -0.82966138 -1.55133288 -1.46959842 -0.76071212]

In [206]: normalized_matrix = (my_matrix - minima) / (maxima-minima)

In [207]: print(normalized_matrix)
[[0.28223942 0. 0.37006178 1. 1. ]
[1. 1. 0. 0. 0. ]
[0. 0.14074337 1. 0.64842885 0.40670394]]

In [208]: normalized_matrix >= 0

Out[208]:
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])

In [209]: normalized_matrix <= 1

Out[209]:
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])

In [210]: standarized_matrix = (my_matrix - my_matrix.mean(axis=0))/(my_matrix.std(axis=0))

In [211]: print(standarized_matrix)
[[-0.34486633 -0.86032467 -0.20983943 1.08769345 1.29343692]
[ 1.36020436 1.40221228 -1.10626795 -1.32659331 -1.14196153]
[-1.01533803 -0.54188761 1.31610738 0.23889987 -0.15147539]]

In [212]: standarized_matrix.mean(axis=0)
Out[212]:
array([ 2.22044605e-16, 3.70074342e-17, 3.70074342e-16, -3.88578059e-16,
4.62592927e-17])

In [213]: standarized_matrix.std(axis=0)
Out[213]: array([1., 1., 1., 1., 1.])
Loading and saving numpy
arrays to files
• Reading files: np.genfromtxt("BodyTemperature.txt",
skip_header=True)
– np.loadtxt is faster, but allows for less user control (header,
handling NA's)
• Writing to text files: np.savetxt(filename, data)
• For pickle (binary format, faster):
– np.save(filename, data)
– my_array = np.load(filename, data)
# Read textdatafile, ignore the header In [323]: males = data[data[:,0] == 0]
# The header is Gender Age HeartRate Temperature In [324]: females = data[data[:,0] == 1]
In [313]: data = np.genfromtxt("BodyTemperature.txt" , In [325]: males_mean = males.mean(axis=0)[1:]
skip_header=True ) In [326]: males_max = males.max(axis=0)[1:]
In [327]: males_min = males.min(axis=0)[1:]
# Any nan? In [328]: females_mean = females.mean(axis=0)[1:]
In [315]: np.any(np.isnan(data)) In [329]: females_max = females.max(axis=0)[1:]
Out[315]: False In [330]: females_min = females.min(axis=0)[1:]
In [331]: table = np.array([males_mean, males_max,
# Number of males males_min, females_mean, females_max, females_min])
In [317]: np.sum(data[:,0] == 0)
Out[317]: 49 In [332]: table
Out[332]:
# Number of females array([[ 37.81632653, 73.91836735, 98.19795918],
In [319]: np.sum(data[:,0] == 1) [ 50. , 87. , 101.3 ],
Out[319]: 51 [ 22. , 61. , 96.2 ],
[ 37.43137255, 73.41176471, 98.45686275],
# Ignoring gender, the remaining columns are averages for: [ 49. , 87. , 100.8 ],
# Age HeartRate Temperature [ 21. , 67. , 96.8 ]])

In [322]: data.mean(axis=0)[1:] In [333]: np.savetxt("BD_results.txt", table)

Out[322]: array([37.62, 73.66, 98.33])
Pandas
PANDAS
• Pandas is the Python library to work with
dataframes (similar to R data.frames)
import pandas as pd
• An advantage of Pandas over numpy is that all
elements in a numpy array must belong to the
same type, while Pandas allows to have different
columns with different types (integers, reals,
strings, ...)
PANDAS data structures
• Pandas contains two data structures:
– Series: a series is like a vector, but with an index
– Dataframes: it is similar to R dataframes (a matrix
with column names. Each column may belong to
different data types: integer, real numbers, strings,
...)
• A dataframe is made of:
– index
– column names
– values
Example of series
# Using the default index 0, 1, ... # Using a custom index
In [228]: s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

s = pd.Series(np.random.randn(5)) In [229]: s
Out[229]:
a 0.226183
In [224]: s b -0.564569
Out[224]: c -1.058691
0 1.037685 d 0.970553
1 0.403077 e -0.857780
2 -1.814123 dtype: float64
3 -0.005181
4 1.692980 In [230]: s.values
dtype: float64 Out[230]: array([ 0.22618273, -0.564569 , -1.05869052, 0.97055338, -
0.85777957])
# We can get the values of a series as a numpy array
In [225]: s.values In [231]: s.index
Out[225]: array([ 1.03768522, 0.40307685, -1.81412276, -0.005181 , Out[231]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
1.69298038])

In [226]: s.index
Out[226]: RangeIndex(start=0, stop=5, step=1)

Note: although índices can be useful in some cases (time series, ...), this tutorial will not
focus on them
Reading files as dataframes
In [121]: import pandas as pd # Getting the names of the columns
# Read file in csv format into a Pandas dataframe In [130]: list(flights.columns)
In [122]: flights = pd.read_csv("flights.csv") Out[130]:
['year', 'month', 'day', 'dep_time',
In [124]: flights.shape 'sched_dep_time', 'dep_delay',
Out[124]: (336776, 19) 'arr_time', 'sched_arr_time', 'arr_delay',
'carrier', 'flight', 'tailnum', 'origin',
# head: print the first rows 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour']

Column names

Values

In []: flights.index
Index (by default 0, 1, 2, ...)
Out[]: RangeIndex(start=0, stop=336776, step=1)
Describing the dataframe
Setting the index
• By default: 0, 1, 2, ...
In [12]: flights.index
Out[12]: RangeIndex(start=0, stop=336776, step=1)

• In most cases, this is what you need

• We can set one of the columns as the index:
flights.set_index("month")
Setting the index
• New index. We could also use dates ...

Note: although índices

can be useful in some
cases (time series, ...),
this tutorial will not
focus on them
Extracting the values from a Pandas dataframe or
series to a numpy matrix / array

In [39]: flights.values
Out[39]:
array([[2013, 1, 1, ..., 5, 15, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 29, '2013-01-01 05:00:00'],
[2013, 1, 1, ..., 5, 40, '2013-01-01 05:00:00'],
...,
[2013, 9, 30, ..., 12, 10, '2013-09-30 12:00:00'],
[2013, 9, 30, ..., 11, 59, '2013-09-30 11:00:00'],
[2013, 9, 30, ..., 8, 40, '2013-09-30 08:00:00']], dtype=object)
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: e.g. select rows from 0 to 10
– Columns: e.g. select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Label selection: rows
.loc

In [25]: flights.loc[2:4]
Out[25]:
year month day ... hour minute time_hour
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Label selection: columns
.loc
List of columns Range of columns
Labels: rows and columns
.loc
Beware! series vs. dataframe
This returns a dataframe This returns a series!
In []: flights.loc[:,['month']] In [35]: flights.loc[:,'month']
Out[]: Out[35]:
month 01
01 11
11 21
21 31
31 41
41 51
51 61
61 71
71 81
81 91
91 10 1

In []: type(flights.loc[:,['month']]) In []: type(flights.loc[:,'month'])

Out[]: pandas.core.frame.DataFrame Out[]: pandas.core.series.Series
Selecting single columns (series)
This returns a series! The same, with dot notation
In [35]: flights.loc[:,'month'] In [42]: flights.month
Out[35]: Out[42]:
01 01
11 11
21 21
31 31
41 41
51 51
61 61
71 71
81 81
91 91
10 1 10 1
Note: we can get the values as a numpy array
In [43]: flights.month.values
Out[43]: array([1, 1, 1, ..., 9, 9, 9], dtype=int64)
Shorthand for column selection
• flights.loc[:,'month'] is equivalent to
flights['month']
• flighs.loc[:,['year', 'month']] is equivalent to
flights[['year', 'month']]
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Position (integer) selection:
iloc

In [41]: flights.iloc[2:4, 1:3]

Out[41]:
month day
21 1
31 1
Combining iloc for rows and
loc for columns
• What if we want to select rows by position
but columns by name?
# Just one column
In [51]: flights.iloc[2:4, flights.columns.get_loc('month')]
Out[51]:
21
31
Name: month, dtype: int64

# Several columns
In [53]: flights.iloc[2:4, flights.columns.get_indexer(['month','day'])]
Out[53]:
month day
21 1
31 1
Selecting rows and columns
(indexing)
• Label selection: both rows and columns can have labels:
loc
– the labels of the rows are the indices (index)
– the labels of the columns are the column names
• Position (integer) selection: iloc
– Rows: select rows from 0 to 10
– Columns: select rows from 3 to 7
• Boolean selection: selecting rows that satisfy a condition
– E.g.: Select all rows where age > 35
Boolean indexing:
selecting rows on condition
• Both loc and iloc can be used, but loc is recommended
• List of flights for January the first?

In []: flights.loc[(flights.month == 1) & (flights.day == 1)]

• Note: we can also write (flights.loc[:,"month"] == 1)

Boolean indexing:
selecting rows on condition
• Same thing in two lines (clearer code)
• List of flights for January the first?
In []: condition = (flights.month == 1) & (flights.day == 1)
In []: flights.loc[condition]
Out[]:
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01 05:00:00
1 2013 1 1 ... 5 29 2013-01-01 05:00:00
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Boolean indexing:
selecting rows on condition
• List of flights for January the first?
– A shorter version

•In []: flights.query("month == 1 & day == 1")

Out[]:
year month day ... hour minute time_hour
0 2013 1 1 ... 5 15 2013-01-01 05:00:00
1 2013 1 1 ... 5 29 2013-01-01 05:00:00
2 2013 1 1 ... 5 40 2013-01-01 05:00:00
3 2013 1 1 ... 5 45 2013-01-01 05:00:00
4 2013 1 1 ... 6 0 2013-01-01 06:00:00
Boolean conditions
• In order to create conditions, we can use:
• <, >, ==, <=, >=, !=
• &: and
• |: or
• ~: not
• isin: is value in a list of values?
• isnull: is value nan?
Boolean selection:
selecting rows on condition
• What flights start at EWR or JFK airports?
In [61]: flights.loc[flights.origin.isin(['EWR', 'JFK']), ['origin', 'dest']]
Out[61]:
origin dest
0 EWR IAH
2 JFK MIA
3 JFK BQN
5 EWR ORD
6 EWR FLL
8 JFK MCO
10 JFK PBI

• Note: we are also selecting origin and dest columns

Creating new columns
• Compute speed for every flight
– speed = distance / airtime

• Two ways:
– First: flights.loc[:,'speed'] =
• flights.loc[:,'speed'] = flights.distance - flights.air_time
• flights.loc[:,'speed'] = flights.loc[:,'distance'] - flights.loc[:,'air_time']
• flights.loc[:,'speed'] = flights ['distance'] - flights ['air_time']
– Shorthand: flights['speed'] =
• flights['speed'] = flights['distance'] - flights['air_time']
Creating new columns
Modifying subsets of the
dataframe (setting)
• Let's put nan on the first three rows and
columns 'year', 'month' and 'day'
# Let's create a copy first # Now, we do the assignment
In [81]: flights_copy = flights.copy() In [85]: flights_copy.iloc[0:4,
# Let's see the content of the first three rows and the flights.columns.get_indexer(['year', 'month', 'day'])] = np.nan
first three columns In [86]: flights_copy
In [83]: flights_copy.iloc[0:4, Out[86]:
flights.columns.get_indexer(['year', 'month', 'day'])] year month day ... minute time_hour speed
Out[83]: 0 NaN NaN NaN ... 15 2013-01-01 05:00:00 1173.0
year month day 1 NaN NaN NaN ... 29 2013-01-01 05:00:00 1189.0
0 2013 1 1 2 NaN NaN NaN ... 40 2013-01-01 05:00:00 929.0
1 2013 1 1 3 NaN NaN NaN ... 45 2013-01-01 05:00:00 1393.0
2 2013 1 1 4 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 646.0
3 2013 1 1 5 2013.0 1.0 1.0 ... 58 2013-01-01 05:00:00 569.0
6 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 907.0
7 2013.0 1.0 1.0 ... 0 2013-01-01 06:00:00 176.0
Subset (modification) with
boolean selection
• Let's create a column 'satisfaction' with
'good' if arrival delay <= 75, and 'bad'
otherwise
In [107]: flights['satisfaction'] = 'bad'

In [108]: flights.loc[flights.arr_delay <= 75, 'satisfaction'] = 'good'

In [109]: flights.loc[:, ['arr_delay', 'satisfaction']].head()

Out[109]:
arr_delay satisfaction
0 11.0 good
1 20.0 good
2 33.0 good
3 -18.0 good
4 -25.0 good

Numpy Library Basics
No ratings yet
Numpy Library Basics
16 pages
Numpy Complete Notes
No ratings yet
Numpy Complete Notes
68 pages
Numpy Primer
No ratings yet
Numpy Primer
19 pages
Unit 1
No ratings yet
Unit 1
170 pages
Worksheet 4
No ratings yet
Worksheet 4
9 pages
Numerical Methods Using Python: (MCSC-202)
No ratings yet
Numerical Methods Using Python: (MCSC-202)
34 pages
5CS037 - WS01 - Numpy For Matrix Manipulation
No ratings yet
5CS037 - WS01 - Numpy For Matrix Manipulation
20 pages
Pre Mfe Nla Feb2024 Syllabus
No ratings yet
Pre Mfe Nla Feb2024 Syllabus
4 pages
Activity 6 Applicatiton of Determinants-1
No ratings yet
Activity 6 Applicatiton of Determinants-1
5 pages
Trigonometry Notes
No ratings yet
Trigonometry Notes
10 pages
Chapter 6 Matrices Population Problems Exam Questions
No ratings yet
Chapter 6 Matrices Population Problems Exam Questions
9 pages
Math 7
No ratings yet
Math 7
1 page
6 - Differentiation
No ratings yet
6 - Differentiation
6 pages
Soluções Lista Álgebra Linear III - UFRJ
No ratings yet
Soluções Lista Álgebra Linear III - UFRJ
2 pages
Numpy
No ratings yet
Numpy
24 pages
MA111 Test 1 S1 2019
No ratings yet
MA111 Test 1 S1 2019
8 pages
15 Numpy
No ratings yet
15 Numpy
32 pages
Lab 1 - Introduction
No ratings yet
Lab 1 - Introduction
14 pages
Python Numpy
100% (1)
Python Numpy
31 pages
6th Maths 2nd Assessment
No ratings yet
6th Maths 2nd Assessment
3 pages
NUMPY
No ratings yet
NUMPY
33 pages
Task 2 Half-Yearly Year9 5.3 Notification
No ratings yet
Task 2 Half-Yearly Year9 5.3 Notification
2 pages
Unit 4 Numpy
No ratings yet
Unit 4 Numpy
14 pages
Python Presentation 3
No ratings yet
Python Presentation 3
44 pages
Number System Questions and Answers: Competoid
0% (1)
Number System Questions and Answers: Competoid
4 pages
Vectors & Calculus PDF
No ratings yet
Vectors & Calculus PDF
80 pages
Numpy Merged
No ratings yet
Numpy Merged
93 pages
Numpy Full
100% (1)
Numpy Full
40 pages
Universiti Utara Malaysia Kolej Perniagaan Program Pengurusan Teknologi
No ratings yet
Universiti Utara Malaysia Kolej Perniagaan Program Pengurusan Teknologi
4 pages
Python Sem V Portion 2
No ratings yet
Python Sem V Portion 2
29 pages
UNIT 5 Python Aktu
No ratings yet
UNIT 5 Python Aktu
49 pages
Num Py
No ratings yet
Num Py
13 pages
Numpy
No ratings yet
Numpy
9 pages
5 Simplex Method - Lecture
No ratings yet
5 Simplex Method - Lecture
50 pages
Numpy (Numerical Python)
No ratings yet
Numpy (Numerical Python)
80 pages
13 - NumPy
No ratings yet
13 - NumPy
46 pages
10 Numpy
No ratings yet
10 Numpy
39 pages
Mds1111 Merged Numbered
No ratings yet
Mds1111 Merged Numbered
41 pages
Arrays
No ratings yet
Arrays
28 pages
Numpy, Pandas
No ratings yet
Numpy, Pandas
19 pages
Unit 7 Test Review
No ratings yet
Unit 7 Test Review
6 pages
QM1 Chapter 2
No ratings yet
QM1 Chapter 2
22 pages
Solving For The Unknown: A How-To Approach For Solving Equations
No ratings yet
Solving For The Unknown: A How-To Approach For Solving Equations
20 pages
Nurbs: BITS Pilani
No ratings yet
Nurbs: BITS Pilani
28 pages
Numpy
No ratings yet
Numpy
7 pages
Numpy
No ratings yet
Numpy
71 pages
Numpy
No ratings yet
Numpy
14 pages
Num Py
No ratings yet
Num Py
31 pages
Numpy
No ratings yet
Numpy
64 pages
Python Unit 4
No ratings yet
Python Unit 4
43 pages
Basic Array Creation and Operations
No ratings yet
Basic Array Creation and Operations
27 pages
45B AIML Practical1.1
No ratings yet
45B AIML Practical1.1
57 pages
Exp 12345
No ratings yet
Exp 12345
15 pages
Numpy and Scipy: Numerical Computing in Python
No ratings yet
Numpy and Scipy: Numerical Computing in Python
44 pages
NumPy Notes
No ratings yet
NumPy Notes
13 pages
Module3 Advance Pythonlibraries
No ratings yet
Module3 Advance Pythonlibraries
53 pages
Mod 3 Numpy Ds
No ratings yet
Mod 3 Numpy Ds
15 pages
C1 W2 Lab01 Python Numpy Vectorization Soln
No ratings yet
C1 W2 Lab01 Python Numpy Vectorization Soln
12 pages
Applied Machine Learning For Engineers: Introduction To Numpy
No ratings yet
Applied Machine Learning For Engineers: Introduction To Numpy
13 pages
HKU - 7001 - 3.2 Managing Data II
No ratings yet
HKU - 7001 - 3.2 Managing Data II
67 pages
Grade 5 Math Questions
No ratings yet
Grade 5 Math Questions
13 pages
Introduction To Numpy
No ratings yet
Introduction To Numpy
41 pages
Int246 L1
No ratings yet
Int246 L1
25 pages
Numpy Tutorial
No ratings yet
Numpy Tutorial
19 pages
11 NumPy
No ratings yet
11 NumPy
14 pages
Numpy
No ratings yet
Numpy
20 pages
Numpy Basics
No ratings yet
Numpy Basics
66 pages
FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3007 ETH VL2023240104352 2023-09-27 Reference-Material-I
47 pages
Data Science Handwritten Notes - 3
No ratings yet
Data Science Handwritten Notes - 3
26 pages
Under, Over and Critical Damping
No ratings yet
Under, Over and Critical Damping
6 pages
New SAT Math Workbook PDF
100% (12)
New SAT Math Workbook PDF
354 pages
Module Numpy
No ratings yet
Module Numpy
67 pages
Maths Module 2
No ratings yet
Maths Module 2
3 pages
Overview of Logic Based Testing: By, M.Saravanan, II-M.sc-IT
No ratings yet
Overview of Logic Based Testing: By, M.Saravanan, II-M.sc-IT
11 pages
Curriculum and Syllabus Under Semester System: Bs (H) in Mathematics
100% (1)
Curriculum and Syllabus Under Semester System: Bs (H) in Mathematics
19 pages
Detailed Lesson Plan in Mathematics 8 Illustrating Rational Algebraic Expressions
No ratings yet
Detailed Lesson Plan in Mathematics 8 Illustrating Rational Algebraic Expressions
5 pages
python-notes-BCC-302 (Unit - 05)
No ratings yet
python-notes-BCC-302 (Unit - 05)
25 pages
Lecture+Notes Python+for+DS PDF
No ratings yet
Lecture+Notes Python+for+DS PDF
48 pages
Numpy Python
No ratings yet
Numpy Python
36 pages
NumPy Basics
No ratings yet
NumPy Basics
23 pages
Numpy Guide
No ratings yet
Numpy Guide
1 page
02 Numpy
No ratings yet
02 Numpy
11 pages
(Undergraduate Texts in Mathematics) Murray H. Protter, Charles B. Jr. Morrey - Intermediate Calculus (1986, Springer)
75% (4)
(Undergraduate Texts in Mathematics) Murray H. Protter, Charles B. Jr. Morrey - Intermediate Calculus (1986, Springer)
664 pages
Numerical Python Numpy
No ratings yet
Numerical Python Numpy
28 pages
Numpy Handbook
No ratings yet
Numpy Handbook
16 pages
Radicals: Roots and Exponents
No ratings yet
Radicals: Roots and Exponents
5 pages
MCR3U January - Student
No ratings yet
MCR3U January - Student
17 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

N Umpy Pandas Tutorial

Uploaded by

N Umpy Pandas Tutorial

Uploaded by

OPENCOURSEWARE

# We try to assign a real number to an integer array

# Real-valued (float) vectors can also be created

# arbitrary start, stop and step

# a diagonal matrix # A vector of zeros

• randn(a,b) returns a axb matrix with random real

# We can also modify whole columns

# And whole rows

In [10]: import numpy as np

In [28]: # This is a view into a

In [31]: # If we modify the view, we modify the original variable

# We can print owndata to distinguish views from copies

# Universal function sqrt

In [10]: a = np.arange(10) In [14]: a = np.array([[0, 1, 2, 3],

# Sum along axis 0 / rows

• Broadcasting allows to have operations

In [204]: minima = my_matrix.min(axis=0)

In [206]: normalized_matrix = (my_matrix - minima) / (maxima-minima)

In [208]: normalized_matrix >= 0

In [209]: normalized_matrix <= 1

In [210]: standarized_matrix = (my_matrix - my_matrix.mean(axis=0))/(my_matrix.std(axis=0))

In [322]: data.mean(axis=0)[1:] In [333]: np.savetxt("BD_results.txt", table)

• In most cases, this is what you need

Note: although índices

In []: type(flights.loc[:,['month']]) In []: type(flights.loc[:,'month'])

In [41]: flights.iloc[2:4, 1:3]

In []: flights.loc[(flights.month == 1) & (flights.day == 1)]

• Note: we can also write (flights.loc[:,"month"] == 1)

•In []: flights.query("month == 1 & day == 1")

• Note: we are also selecting origin and dest columns

In [108]: flights.loc[flights.arr_delay <= 75, 'satisfaction'] = 'good'

In [109]: flights.loc[:, ['arr_delay', 'satisfaction']].head()

You might also like