0% found this document useful (0 votes)
45 views420 pages

Combinepdf Removed

This document serves as an introduction to Python, covering its general purpose, data types, and basic functionalities such as variables, lists, and functions. It explains how to execute commands in the IPython shell, create scripts, and manipulate data using lists and functions like max() and round(). The content is aimed at beginners in data science, providing foundational knowledge for using Python effectively.

Uploaded by

raresdynu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views420 pages

Combinepdf Removed

This document serves as an introduction to Python, covering its general purpose, data types, and basic functionalities such as variables, lists, and functions. It explains how to execute commands in the IPython shell, create scripts, and manipulate data using lists and functions like max() and round(). The content is aimed at beginners in data science, providing foundational knowledge for using Python effectively.

Uploaded by

raresdynu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 420

Hello Python!

INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
How you will learn

INTRODUCTION TO PYTHON
Python

General purpose: build anything

Open source! Free!

Python packages, also for data science


Many applications and fields

INTRODUCTION TO PYTHON
IPython Shell
Execute Python commands

INTRODUCTION TO PYTHON
IPython Shell
Execute Python commands

INTRODUCTION TO PYTHON
IPython Shell

INTRODUCTION TO PYTHON
Python Script
Text files - .py
List of Python commands

Similar to typing in IPython Shell

INTRODUCTION TO PYTHON
Python Script

INTRODUCTION TO PYTHON
Python Script

Use print() to generate output from script

INTRODUCTION TO PYTHON
DataCamp Interface

INTRODUCTION TO PYTHON
Variables and Types
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Variable
Specific, case-sensitive name
Call up value through variable name

1.79 m - 68.7 kg

height = 1.79
weight = 68.7
height

1.79

INTRODUCTION TO PYTHON
Calculate BMI
height = 1.79 68.7 / 1.79 ** 2
weight = 68.7
height
21.4413

1.79
weight / height ** 2

weight
BMI = 21.4413
height2

bmi = weight / height ** 2


bmi

21.4413

INTRODUCTION TO PYTHON
Reproducibility
height = 1.79
weight = 68.7
bmi = weight / height ** 2
print(bmi)

21.4413

INTRODUCTION TO PYTHON
Reproducibility
height = 1.79
weight = 74.2 # <-
bmi = weight / height ** 2
print(bmi)

23.1578

INTRODUCTION TO PYTHON
Python Types
type(bmi)

float

day_of_week = 5
type(day_of_week)

int

INTRODUCTION TO PYTHON
Python Types (2)
x = "body mass index"
y = 'this works too'
type(y)

str

z = True
type(z)

bool

INTRODUCTION TO PYTHON
Python Types (3)
2 + 3

'ab' + 'cd'

'abcd'

Different type = different behavior!

INTRODUCTION TO PYTHON
Python Lists
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Python Data Types
float - real numbers
int - integer numbers

str - string, text

bool - True, False

height = 1.73
tall = True

Each variable represents single value

INTRODUCTION TO PYTHON
Problem
Data Science: many data points

Height of entire family

height1 = 1.73
height2 = 1.68
height3 = 1.71
height4 = 1.89

Inconvenient

INTRODUCTION TO PYTHON
Python List
[a, b, c]

[1.73, 1.68, 1.71, 1.89]

[1.73, 1.68, 1.71, 1.89]

fam = [1.73, 1.68, 1.71, 1.89]


fam

[1.73, 1.68, 1.71, 1.89]

Name a collection of values

Contain any type

Contain different types

INTRODUCTION TO PYTHON
Python List
[a, b, c]

fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]


fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam2 = [["liz", 1.73],


["emma", 1.68],
["mom", 1.71],
["dad", 1.89]]
fam2

[['liz', 1.73], ['emma', 1.68], ['mom', 1.71], ['dad', 1.89]]

INTRODUCTION TO PYTHON
List type
type(fam)

list

type(fam2)

list

Specific functionality

Specific behavior

INTRODUCTION TO PYTHON
Subsetting Lists
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Subsetting lists
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[3]

1.68

INTRODUCTION TO PYTHON
Subsetting lists
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[6]

'dad'

fam[-1]

1.89

fam[7]

1.89

INTRODUCTION TO PYTHON
Subsetting lists
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[6]

'dad'

fam[-1] # <-

1.89

fam[7] # <-

1.89

INTRODUCTION TO PYTHON
List slicing
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[3:5]

[1.68, 'mom']

fam[1:4]

[1.73, 'emma', 1.68]

INTRODUCTION TO PYTHON
List slicing
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[:4]

['liz', 1.73, 'emma', 1.68]

fam[5:]

[1.71, 'dad', 1.89]

INTRODUCTION TO PYTHON
Manipulating Lists
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
List Manipulation
Change list elements
Add list elements

Remove list elements

INTRODUCTION TO PYTHON
Changing list elements
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam[7] = 1.86
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]

fam[0:2] = ["lisa", 1.74]


fam

['lisa', 1.74, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]

INTRODUCTION TO PYTHON
Adding and removing elements
fam + ["me", 1.79]

['lisa', 1.74,'emma', 1.68, 'mom', 1.71, 'dad', 1.86, 'me', 1.79]

fam_ext = fam + ["me", 1.79]


del fam[2]
fam

['lisa', 1.74, 1.68, 'mom', 1.71, 'dad', 1.86]

INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]

INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y

['a', 'z', 'c']

['a', 'z', 'c']

INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y

['a', 'z', 'c']

['a', 'z', 'c']

INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y

['a', 'z', 'c']

['a', 'z', 'c']

INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]

INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]
y = list(x)
y = x[:]

INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]
y = list(x)
y = x[:]
y[1] = "z"
x

['a', 'b', 'c']

INTRODUCTION TO PYTHON
Functions
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Functions
Nothing new!
type()

Piece of reusable code

Solves particular task

Call function instead of writing code yourself

INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam

[1.73, 1.68, 1.71, 1.89]

max(fam)

1.89

INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam

[1.73, 1.68, 1.71, 1.89]

max(fam)

1.89

INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam

[1.73, 1.68, 1.71, 1.89]

max(fam)

1.89

INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam

[1.73, 1.68, 1.71, 1.89]

max(fam)

1.89

tallest = max(fam)
tallest

1.89

INTRODUCTION TO PYTHON
round()
round(1.68, 1)

1.7

round(1.68)

help(round) # Open up documentation

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

INTRODUCTION TO PYTHON
round()
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
Round a number to a given precision in decimal digits.

The return value is an integer if ndigits is omitted or None.


Otherwise the return value has the same type as the number. ndigits may be negative.

round(number)

round(number, ndigits)

INTRODUCTION TO PYTHON
Find functions
How to know?
Standard task -> probably function exists!

The internet is your friend

INTRODUCTION TO PYTHON
Methods
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Built-in Functions
Maximum of list: max()
Length of list or string: len()

Get index in list: ?

Reversing a list: ?

INTRODUCTION TO PYTHON
Back 2 Basics

sister = "liz"

height = 1.73

fam = ["liz", 1.73, "emma", 1.68,


"mom", 1.71, "dad", 1.89]

INTRODUCTION TO PYTHON
Back 2 Basics

sister = "liz"

height = 1.73

fam = ["liz", 1.73, "emma", 1.68,


"mom", 1.71, "dad", 1.89]

Methods: Functions that


belong to objects

INTRODUCTION TO PYTHON
Back 2 Basics

sister = "liz"

height = 1.73

fam = ["liz", 1.73, "emma", 1.68,


"mom", 1.71, "dad", 1.89]

Methods: Functions that


belong to objects

INTRODUCTION TO PYTHON
list methods
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam.index("mom") # "Call method index() on fam"

fam.count(1.73)

INTRODUCTION TO PYTHON
str methods
sister

'liz'

sister.capitalize()

'Liz'

sister.replace("z", "sa")

'lisa'

INTRODUCTION TO PYTHON
Methods
Everything = object
Object have methods associated, depending on type

sister.replace("z", "sa")

'lisa'

fam.replace("mom", "mommy")

AttributeError: 'list' object has no attribute 'replace'

INTRODUCTION TO PYTHON
Methods
sister.index("z")

fam.index("mom")

INTRODUCTION TO PYTHON
Methods (2)
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]

fam.append("me")
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me']

fam.append(1.79)
fam

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me', 1.79]

INTRODUCTION TO PYTHON
Summary
Functions

type(fam)

list

Methods: call functions on objects

fam.index("dad")

INTRODUCTION TO PYTHON
Packages
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Motivation
Functions and methods are powerful

All code in Python distribution?


Huge code base: messy

Lots of code you won’t use

Maintenance problem

INTRODUCTION TO PYTHON
Packages
Directory of Python Scripts

Each script = module

Specify functions, methods,


types

Thousands of packages
available
NumPy

Matplotlib

scikit-learn

INTRODUCTION TO PYTHON
Install package
https://pip.pypa.io/en/stable/installation/
Download get-pip.py

Terminal:
python3 get-pip.py

pip3 install numpy

INTRODUCTION TO PYTHON
Import package
import numpy import numpy as np
array([1, 2, 3]) np.array([1, 2, 3])

NameError: name 'array' is not defined array([1, 2, 3])

numpy.array([1, 2, 3]) from numpy import array


array([1, 2, 3])

array([1, 2, 3])
array([1, 2, 3])

INTRODUCTION TO PYTHON
from numpy import array
my_script.py

from numpy import array

fam = ["liz", 1.73, "emma", 1.68,


"mom", 1.71, "dad", 1.89]

...
fam_ext = fam + ["me", 1.79]

...
print(str(len(fam_ext)) + " elements in fam_ext")

...
np_fam = array(fam_ext)

Using NumPy, but not very clear

INTRODUCTION TO PYTHON
import numpy
import numpy as np

fam = ["liz", 1.73, "emma", 1.68,


"mom", 1.71, "dad", 1.89]

...
fam_ext = fam + ["me", 1.79]

...
print(str(len(fam_ext)) + " elements in fam_ext")

...
np_fam = np.array(fam_ext) # Clearly using NumPy

INTRODUCTION TO PYTHON
NumPy
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Lists Recap
Powerful
Collection of values

Hold different types

Change, add, remove

Need for Data Science


Mathematical operations over collections

Speed

INTRODUCTION TO PYTHON
Illustration
height = [1.73, 1.68, 1.71, 1.89, 1.79]
height

[1.73, 1.68, 1.71, 1.89, 1.79]

weight = [65.4, 59.2, 63.6, 88.4, 68.7]


weight

[65.4, 59.2, 63.6, 88.4, 68.7]

weight / height ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

INTRODUCTION TO PYTHON
Solution: NumPy
Numeric Python
Alternative to Python List: NumPy Array

Calculations over entire arrays

Easy and Fast

Installation
In the terminal: pip3 install numpy

INTRODUCTION TO PYTHON
NumPy
import numpy as np
np_height = np.array(height)
np_height

array([1.73, 1.68, 1.71, 1.89, 1.79])

np_weight = np.array(weight)
np_weight

array([65.4, 59.2, 63.6, 88.4, 68.7])

bmi = np_weight / np_height ** 2


bmi

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

INTRODUCTION TO PYTHON
Comparison
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
weight / height ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

np_height = np.array(height)
np_weight = np.array(weight)
np_weight / np_height ** 2

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

INTRODUCTION TO PYTHON
NumPy: remarks
np.array([1.0, "is", True])

array(['1.0', 'is', 'True'], dtype='<U32')

NumPy arrays: contain only one type

INTRODUCTION TO PYTHON
NumPy: remarks
python_list = [1, 2, 3]
numpy_array = np.array([1, 2, 3])

python_list + python_list

[1, 2, 3, 1, 2, 3]

numpy_array + numpy_array

array([2, 4, 6])

Different types: different behavior!

INTRODUCTION TO PYTHON
NumPy Subsetting
bmi

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

bmi[1]

20.975

bmi > 23

array([False, False, False, True, False])

bmi[bmi > 23]

array([24.7473475])

INTRODUCTION TO PYTHON
2D NumPy Arrays
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Type of NumPy Arrays
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

type(np_height)

numpy.ndarray

type(np_weight)

numpy.ndarray

INTRODUCTION TO PYTHON
2D NumPy Arrays
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
[65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d

array([[ 1.73, 1.68, 1.71, 1.89, 1.79],


[65.4 , 59.2 , 63.6 , 88.4 , 68.7 ]])

np_2d.shape

(2, 5) # 2 rows, 5 columns

np.array([[1.73, 1.68, 1.71, 1.89, 1.79],


[65.4, 59.2, 63.6, 88.4, "68.7"]])

array([['1.73', '1.68', '1.71', '1.89', '1.79'],


['65.4', '59.2', '63.6', '88.4', '68.7']], dtype='<U32')

INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4

array([[ 1.73, 1.68, 1.71, 1.89, 1.79], 0


[ 65.4, 59.2, 63.6, 88.4, 68.7]]) 1

np_2d[0]

array([1.73, 1.68, 1.71, 1.89, 1.79])

INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4

array([[ 1.73, 1.68, 1.71, 1.89, 1.79], 0


[ 65.4, 59.2, 63.6, 88.4, 68.7]]) 1

np_2d[0][2]

1.71

np_2d[0, 2]

1.71

INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4

array([[ 1.73, 1.68, 1.71, 1.89, 1.79], 0


[ 65.4, 59.2, 63.6, 88.4, 68.7]]) 1

np_2d[:, 1:3]

array([[ 1.68, 1.71],


[59.2 , 63.6 ]])

np_2d[1, :]

array([65.4, 59.2, 63.6, 88.4, 68.7])

INTRODUCTION TO PYTHON
NumPy: Basic
Statistics
INTRODUCTION TO PYTHON

Hugo Bowne-Anderson
Data Scientist at DataCamp
Data analysis
Get to know your data
Little data -> simply look at it

Big data -> ?

INTRODUCTION TO PYTHON
City-wide survey
import numpy as np
np_city = ... # Implementation left out
np_city

array([[1.64, 71.78],
[1.37, 63.35],
[1.6 , 55.09],
...,
[2.04, 74.85],
[2.04, 68.72],
[2.01, 73.57]])

INTRODUCTION TO PYTHON
NumPy
np.mean(np_city[:, 0])

1.7472

np.median(np_city[:, 0])

1.75

INTRODUCTION TO PYTHON
NumPy
np.corrcoef(np_city[:, 0], np_city[:, 1])

array([[ 1. , -0.01802],
[-0.01803, 1. ]])

np.std(np_city[:, 0])

0.1992

sum(), sort(), ...

Enforce single data type: speed!

INTRODUCTION TO PYTHON
Generate data
Arguments for np.random.normal()
distribution mean

distribution standard deviation

number of samples

height = np.round(np.random.normal(1.75, 0.20, 5000), 2)

weight = np.round(np.random.normal(60.32, 15, 5000), 2)

np_city = np.column_stack((height, weight))

INTRODUCTION TO PYTHON
Basic plots with
Matplotlib
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Basic plots with Matplotlib
Visualization Data Structure

Control Structures Case Study

INTERMEDIATE PYTHON
Data visualization
Very important in Data Analysis
Explore data

Report insights

INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations

INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations

INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations

INTERMEDIATE PYTHON
Matplotlib
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()

INTERMEDIATE PYTHON
Matplotlib

INTERMEDIATE PYTHON
Matplotlib

INTERMEDIATE PYTHON
Scatter plot
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()

INTERMEDIATE PYTHON
Scatter plot
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.scatter(year, pop)
plt.show()

INTERMEDIATE PYTHON
Histogram
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution

INTERMEDIATE PYTHON
Matplotlib
import matplotlib.pyplot as plt

help(plt.hist)

Help on function hist in module matplotlib.pyplot:


hist(x, bins=None, range=None, density=False, weights=None,
cumulative=False, bottom=None, histtype='bar', align='mid',
orientation='vertical', rwidth=None, log=False, color=None,
label=None, stacked=False, *, data=None, **kwargs)
Plot a histogram.

Compute and draw the histogram of *x*. The return value is a


tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...],
*bins*, [*patches0*, *patches1*, ...]) if the input contains
multiple data.

INTERMEDIATE PYTHON
Matplotlib example
values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
plt.hist(values, bins=3)
plt.show()

INTERMEDIATE PYTHON
Population pyramid

INTERMEDIATE PYTHON
Population pyramid

INTERMEDIATE PYTHON
Customization
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Data visualization
Many options
Different plot types

Many customizations

Choice depends on
Data

Story you want to tell

INTERMEDIATE PYTHON
Basic plot
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.show()

INTERMEDIATE PYTHON
Axis labels
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')

plt.show()

INTERMEDIATE PYTHON
Axis labels
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')

plt.show()

INTERMEDIATE PYTHON
Title
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')

plt.show()

INTERMEDIATE PYTHON
Title
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')

plt.show()

INTERMEDIATE PYTHON
Ticks
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10])

plt.show()

INTERMEDIATE PYTHON
Ticks
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10])

plt.show()

INTERMEDIATE PYTHON
Ticks (2)
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])

plt.show()

INTERMEDIATE PYTHON
Ticks (2)
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])

plt.show()

INTERMEDIATE PYTHON
Add historical data
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

# Add more data


year = [1800, 1850, 1900] + year
pop = [1.0, 1.262, 1.650] + pop

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])

plt.show()

INTERMEDIATE PYTHON
Add historical data
population.py

import matplotlib.pyplot as plt


year = [1950, 1951, 1952, ..., 2100]
pop = [2.538, 2.57, 2.62, ..., 10.85]

# Add more data


year = [1800, 1850, 1900] + year
pop = [1.0, 1.262, 1.650] + pop

plt.plot(year, pop)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])

plt.show()

INTERMEDIATE PYTHON
Before vs. after

INTERMEDIATE PYTHON
Dictionaries, Part 1
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
List
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
ind_alb = countries.index("albania")
ind_alb

pop[ind_alb]

2.77

Not convenient

Not intuitive

INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]

...

{ }

INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]

...

{"afghanistan":30.55, }

INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]

...

world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}


world["albania"]

2.77

INTERMEDIATE PYTHON
Dictionaries, Part 2
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Recap
world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}
world["albania"]

2.77

world = {"afghanistan":30.55, "albania":2.77,


"algeria":39.21, "albania":2.81}
world

{'afghanistan': 30.55, 'albania': 2.81, 'algeria': 39.21}

INTERMEDIATE PYTHON
Recap
Keys have to be "immutable" objects

{0:"hello", True:"dear", "two":"world"}

{0: 'hello', True: 'dear', 'two': 'world'}

{["just", "to", "test"]: "value"}

TypeError: unhashable type: 'list'

INTERMEDIATE PYTHON
Principality of Sealand

1 Source: Wikipedia

INTERMEDIATE PYTHON
Dictionary
world["sealand"] = 0.000027
world

{'afghanistan': 30.55, 'albania': 2.81,


'algeria': 39.21, 'sealand': 2.7e-05}

"sealand" in world

True

INTERMEDIATE PYTHON
Dictionary
world["sealand"] = 0.000028
world

{'afghanistan': 30.55, 'albania': 2.81,


'algeria': 39.21, 'sealand': 2.8e-05}

del(world["sealand"])
world

{'afghanistan': 30.55, 'albania': 2.81, 'algeria': 39.21}

INTERMEDIATE PYTHON
List vs. Dictionary

INTERMEDIATE PYTHON
List vs. Dictionary

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []

Indexed by range of numbers

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []

Indexed by range of numbers Indexed by unique keys

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and
Select, update, and remove with [] remove with []

Indexed by unique
Indexed by range of numbers
keys
Collection of values — order matters,
for selecting entire subsets

INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and
Select, update, and remove with [] remove with []

Indexed by unique
Indexed by range of numbers
keys
Collection of values — order matters, Lookup table with
for selecting entire subsets unique keys

INTERMEDIATE PYTHON
Pandas, Part 1
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Tabular dataset examples

INTERMEDIATE PYTHON
Tabular dataset examples

INTERMEDIATE PYTHON
Tabular dataset examples

INTERMEDIATE PYTHON
Datasets in Python
2D NumPy array?
One data type

INTERMEDIATE PYTHON
Datasets in Python

INTERMEDIATE PYTHON
Datasets in Python

pandas!
High level data manipulation tool

Wes McKinney

Built on NumPy

DataFrame

INTERMEDIATE PYTHON
DataFrame
brics

country capital area population


BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

INTERMEDIATE PYTHON
DataFrame from Dictionary
dict = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221]
"population":[200.4, 143.5, 1252, 1357, 52.98] }

keys (column labels)

values (data, column by column)

import pandas as pd
brics = pd.DataFrame(dict)

INTERMEDIATE PYTHON
DataFrame from Dictionary (2)
brics

area capital country population


0 8.516 Brasilia Brazil 200.40
1 17.100 Moscow Russia 143.50
2 3.286 New Delhi India 1252.00
3 9.597 Beijing China 1357.00
4 1.221 Pretoria South Africa 52.98

brics.index = ["BR", "RU", "IN", "CH", "SA"]


brics

area capital country population


BR 8.516 Brasilia Brazil 200.40
RU 17.100 Moscow Russia 143.50
IN 3.286 New Delhi India 1252.00
CH 9.597 Beijing China 1357.00
SA 1.221 Pretoria South Africa 52.98

INTERMEDIATE PYTHON
DataFrame from CSV file
brics.csv

,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.10,143.5
IN,India,New Delhi,3.286,1252
CH,China,Beijing,9.597,1357
SA,South Africa,Pretoria,1.221,52.98

CSV = comma-separated values

INTERMEDIATE PYTHON
DataFrame from CSV file
brics.csv

,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.10,143.5
IN,India,New Delhi,3.286,1252
CH,China,Beijing,9.597,1357
SA,South Africa,Pretoria,1.221,52.98

brics = pd.read_csv("path/to/brics.csv")
brics

Unnamed: 0 country capital area population


0 BR Brazil Brasilia 8.516 200.40
1 RU Russia Moscow 17.100 143.50
2 IN India New Delhi 3.286 1252.00
3 CH China Beijing 9.597 1357.00
4 SA South Africa Pretoria 1.221 52.98

INTERMEDIATE PYTHON
DataFrame from CSV file
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics

country population area capital


BR Brazil 200 8515767 Brasilia
RU Russia 144 17098242 Moscow
IN India 1252 3287590 New Delhi
CH China 1357 9596961 Beijing
SA South Africa 55 1221037 Pretoria

INTERMEDIATE PYTHON
Pandas, Part 2
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
import pandas as pd
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics

country capital area population


BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

INTERMEDIATE PYTHON
Index and select data
Square brackets
Advanced methods
loc

iloc

INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics["country"]

BR Brazil
RU Russia
IN India
CH China
SA South Africa
Name: country, dtype: object

INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

type(brics["country"])

pandas.core.series.Series

1D labelled array

INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics[["country"]]

country
BR Brazil
RU Russia
IN India
CH China
SA South Africa

INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

type(brics[["country"]])

pandas.core.frame.DataFrame

INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics[["country", "capital"]]

country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria

INTERMEDIATE PYTHON
Row Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics[1:4]

country capital area population


RU Russia Moscow 17.100 143.5
IN India New Delhi 3.286 1252.0
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Row Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40 * 0 *
RU Russia Moscow 17.100 143.50 * 1 *
IN India New Delhi 3.286 1252.00 * 2 *
CH China Beijing 9.597 1357.00 * 3 *
SA South Africa Pretoria 1.221 52.98 * 4 *

brics[1:4]

country capital area population


RU Russia Moscow 17.100 143.5
IN India New Delhi 3.286 1252.0
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Discussion [ ]
Square brackets: limited functionality
Ideally
2D NumPy arrays

my_array[rows, columns]

pandas
loc (label-based)

iloc (integer position-based)

INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc["RU"]

country Russia
capital Moscow
area 17.1
population 143.5
Name: RU, dtype: object

Row as pandas Series

INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU"]]

country capital area population


RU Russia Moscow 17.1 143.5

DataFrame

INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU", "IN", "CH"]]

country capital area population


RU Russia Moscow 17.100 143.5
IN India New Delhi 3.286 1252.0
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Row & Column loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing

INTERMEDIATE PYTHON
Row & Column loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[:, ["country", "capital"]]

country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria

INTERMEDIATE PYTHON
Recap
Square brackets
Column access brics[["country", "capital"]]

Row access: only through slicing brics[1:4]

loc (label-based)
Row access brics.loc[["RU", "IN", "CH"]]

Column access brics.loc[:, ["country", "capital"]]

Row & Column access


brics.loc[
["RU", "IN", "CH"],
["country", "capital"]
]

INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU"]]

country capital area population


RU Russia Moscow 17.1 143.5

brics.iloc[[1]]

country capital area population


RU Russia Moscow 17.1 143.5

INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU", "IN", "CH"]]

country capital area population


RU Russia Moscow 17.100 143.5
IN India New Delhi 3.286 1252.0
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.iloc[[1,2,3]]

country capital area population


RU Russia Moscow 17.100 143.5
IN India New Delhi 3.286 1252.0
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[["RU", "IN", "CH"], ["country", "capital"]]

country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing

INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.iloc[[1,2,3], [0, 1]]

country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing

INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.loc[:, ["country", "capital"]]

country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria

INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics.iloc[:, [0,1]]

country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria

INTERMEDIATE PYTHON
Comparison
Operators
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
NumPy recap
# Code from Intro to Python for Data Science, Chapter 4
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
bmi

array([ 21.852, 20.975, 21.75 , 24.747, 21.441])

bmi > 23

array([False, False, False, True, False], dtype=bool)

bmi[bmi > 23]

array([ 24.747])

Comparison operators: how Python values relate

INTERMEDIATE PYTHON
Numeric comparisons
2 < 3 3 <= 3

True True

2 == 3 x = 2
y = 3
x < y
False

True
2 <= 3

True

INTERMEDIATE PYTHON
Other comparisons
"carl" < "chris"

True

3 < "chris"

TypeError: unorderable types: int() < str()

3 < 4.1

True

INTERMEDIATE PYTHON
Other comparisons
bmi

array([21.852, 20.975, 21.75 , 24.747, 21.441])

bmi > 23

array([False, False, False, True, False], dtype=bool)

INTERMEDIATE PYTHON
Comparators
Comparator Meaning
< Strictly less than

<= Less than or equal

> Strictly greater than

>= Greater than or equal

== Equal

!= Not equal

INTERMEDIATE PYTHON
Boolean Operators
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Boolean Operators
and

or

not

INTERMEDIATE PYTHON
and
True and True False and True

True False

x = 12 True and False


x > 5 and x < 15
# True True
False

True
False and False

False

INTERMEDIATE PYTHON
or
True or True False or False

True False

False or True y = 5
y < 7 or y > 13

True
True

True or False

True

INTERMEDIATE PYTHON
not
not True

False

not False

True

INTERMEDIATE PYTHON
NumPy
bmi # calculation of bmi left out

array([21.852, 20.975, 21.75 , 24.747, 21.441])

bmi > 21

array([True, False, True, True, True], dtype=bool)

bmi < 22

array([True, True, True, False, True], dtype=bool)

bmi > 21 and bmi < 22

ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()

INTERMEDIATE PYTHON
NumPy
logical_and()

logical_or()

logical_not()

np.logical_and(bmi > 21, bmi < 22)

array([True, False, True, False, True], dtype=bool)

bmi[np.logical_and(bmi > 21, bmi < 22)]

array([21.852, 21.75, 21.441])

INTERMEDIATE PYTHON
if, elif, else
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Overview
Comparison Operators
< , > , >= , <= , == , !=

Boolean Operators
and , or , not

Conditional Statements
if , else , elif

INTERMEDIATE PYTHON
if
if condition :
expression

control.py

z = 4
if z % 2 == 0 : # True
print("z is even")

z is even

INTERMEDIATE PYTHON
if
if condition :
expression

expression not part of if

control.py

z = 4
if z % 2 == 0 : # True
print("z is even")

z is even

INTERMEDIATE PYTHON
if
if condition :
expression

control.py

z = 4
if z % 2 == 0 :
print("checking " + str(z))
print("z is even")

checking 4
z is even

INTERMEDIATE PYTHON
if
if condition :
expression

control.py

z = 5
if z % 2 == 0 : # False
print("checking " + str(z))
print("z is even")

INTERMEDIATE PYTHON
else
if condition :
expression
else :
expression

control.py

z = 5
if z % 2 == 0 : # False
print("z is even")
else :
print("z is odd")

z is odd

INTERMEDIATE PYTHON
elif
if condition :
expression
elif condition :
expression
else :
expression

control.py

z = 3
if z % 2 == 0 :
print("z is divisible by 2") # False
elif z % 3 == 0 :
print("z is divisible by 3") # True
else :
print("z is neither divisible by 2 nor by 3")

z is divisible by 3

INTERMEDIATE PYTHON
elif
if condition :
expression
elif condition :
expression
else :
expression

control.py

z = 6
if z % 2 == 0 :
print("z is divisible by 2") # True
elif z % 3 == 0 :
print("z is divisible by 3") # Never reached
else :
print("z is neither divisible by 2 nor by 3")

z is divisible by 2

INTERMEDIATE PYTHON
Filtering pandas
DataFrames
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
import pandas as pd
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics

country capital area population


BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

INTERMEDIATE PYTHON
Goal
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

Select countries with area over 8 million km2

3 steps
Select the area column

Do comparison on area column


Use result to select countries

INTERMEDIATE PYTHON
Step 1: Get column
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

brics["area"]

BR 8.516
RU 17.100
IN 3.286
CH 9.597
SA 1.221
Name: area, dtype: float64 # - Need Pandas Series

Alternatives:
brics.loc[:,"area"]
brics.iloc[:,2]

INTERMEDIATE PYTHON
Step 2: Compare
brics["area"]

BR 8.516
RU 17.100
IN 3.286
CH 9.597
SA 1.221
Name: area, dtype: float64

brics["area"] > 8

BR True
RU True
IN False
CH True
SA False
Name: area, dtype: bool

is_huge = brics["area"] > 8

INTERMEDIATE PYTHON
Step 3: Subset DF
is_huge

BR True
RU True
IN False
CH True
SA False
Name: area, dtype: bool

brics[is_huge]

country capital area population


BR Brazil Brasilia 8.516 200.4
RU Russia Moscow 17.100 143.5
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Summary
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.988

is_huge = brics["area"] > 8


brics[is_huge]

country capital area population


BR Brazil Brasilia 8.516 200.4
RU Russia Moscow 17.100 143.5
CH China Beijing 9.597 1357.0

brics[brics["area"] > 8]

country capital area population


BR Brazil Brasilia 8.516 200.4
RU Russia Moscow 17.100 143.5
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
Boolean operators
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

import numpy as np
np.logical_and(brics["area"] > 8, brics["area"] < 10)

BR True
RU False
IN False
CH True
SA False
Name: area, dtype: bool

brics[np.logical_and(brics["area"] > 8, brics["area"] < 10)]

country capital area population


BR Brazil Brasilia 8.516 200.4
CH China Beijing 9.597 1357.0

INTERMEDIATE PYTHON
while loop
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
if-elif-else
control.py

Goes through construct only once!

z = 6
if z % 2 == 0 : # True
print("z is divisible by 2") # Executed
elif z % 3 == 0 :
print("z is divisible by 3")
else :
print("z is neither divisible by 2 nor by 3")

... # Moving on

While loop = repeated if statement

INTERMEDIATE PYTHON
While
while condition :
expression

Numerically calculating model


"repeating action until condition is met"

Example
Error starts at 50

Divide error by 4 on every run

Continue until error no longer > 1

INTERMEDIATE PYTHON
While
while condition :
expression

while_loop.py

error = 50.0

while error > 1:


error = error / 4
print(error)

Error starts at 50

Divide error by 4 on every run

Continue until error no longer > 1

INTERMEDIATE PYTHON
While
while condition :
expression

while_loop.py

error = 50.0
# 50
while error > 1: # True
error = error / 4
print(error)

12.5

INTERMEDIATE PYTHON
While
while condition :
expression

while_loop.py

error = 50.0
# 12.5
while error > 1: # True
error = error / 4
print(error)

12.5
3.125

INTERMEDIATE PYTHON
While
while condition :
expression

while_loop.py

error = 50.0
# 3.125
while error > 1: # True
error = error / 4
print(error)

12.5
3.125
0.78125

INTERMEDIATE PYTHON
While
while condition :
expression

while_loop.py

error = 50.0
# 0.78125
while error > 1: # False
error = error / 4
print(error)

12.5
3.125
0.78125

INTERMEDIATE PYTHON
While
while condition : DataCamp: session
expression
disconnected
while_loop.py Local system: Control + C

error = 50.0
while error > 1 : # always True
# error = error / 4
print(error)

50
50
50
50
50
50
50
...

INTERMEDIATE PYTHON
for loop
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
for loop
for var in seq :
expression

"for each var in seq, execute expression"

INTERMEDIATE PYTHON
fam
family.py

fam = [1.73, 1.68, 1.71, 1.89]


print(fam)

[1.73, 1.68, 1.71, 1.89]

INTERMEDIATE PYTHON
fam
family.py

fam = [1.73, 1.68, 1.71, 1.89]


print(fam[0])
print(fam[1])
print(fam[2])
print(fam[3])

1.73
1.68
1.71
1.89

INTERMEDIATE PYTHON
for loop
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]


for height in fam :
print(height)

INTERMEDIATE PYTHON
for loop
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]


for height in fam :
print(height)
# first iteration
# height = 1.73

1.73

INTERMEDIATE PYTHON
for loop
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]


for height in fam :
print(height)
# second iteration
# height = 1.68

1.73
1.68

INTERMEDIATE PYTHON
for loop
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]


for height in fam :
print(height)

1.73
1.68
1.71
1.89

No access to indexes

INTERMEDIATE PYTHON
for loop
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]

???

index 0: 1.73
index 1: 1.68
index 2: 1.71
index 3: 1.89

INTERMEDIATE PYTHON
enumerate
for var in seq :
expression

family.py

fam = [1.73, 1.68, 1.71, 1.89]


for index, height in enumerate(fam) :
print("index " + str(index) + ": " + str(height))

index 0: 1.73
index 1: 1.68
index 2: 1.71
index 3: 1.89

INTERMEDIATE PYTHON
Loop over string
for var in seq :
expression

strloop.py

for c in "family" :
print(c.capitalize())

F
A
M
I
L
Y

INTERMEDIATE PYTHON
Loop Data
Structures Part 1
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Dictionary
for var in seq :
expression

dictloop.py

world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for key, value in world :
print(key + " -- " + str(value))

ValueError: too many values to


unpack (expected 2)

INTERMEDIATE PYTHON
Dictionary
for var in seq :
expression

dictloop.py

world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for key, value in world.items() :
print(key + " -- " + str(value))

algeria -- 39.21
afghanistan -- 30.55
albania -- 2.77

INTERMEDIATE PYTHON
Dictionary
for var in seq :
expression

dictloop.py

world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for k, v in world.items() :
print(k + " -- " + str(v))

algeria -- 39.21
afghanistan -- 30.55
albania -- 2.77

INTERMEDIATE PYTHON
NumPy Arrays
for var in seq :
expression

nploop.py

import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi :
print(val)

21.852
20.975
21.750
24.747
21.441

INTERMEDIATE PYTHON
2D NumPy Arrays
nploop.py

import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
meas = np.array([np_height, np_weight])
for val in meas :
print(val)

[ 1.73 1.68 1.71 1.89 1.79]


[ 65.4 59.2 63.6 88.4 68.7]

INTERMEDIATE PYTHON
2D NumPy Arrays
nploop.py

import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
meas = np.array([np_height, np_weight])
for val in np.nditer(meas) :
print(val)

1.73
1.68
1.71
1.89
1.79
65.4
...

INTERMEDIATE PYTHON
Recap
Dictionary
for key, val in my_dict.items() :

NumPy array
for val in np.nditer(my_array) :

INTERMEDIATE PYTHON
Loop Data
Structures Part 2
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98

dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)

INTERMEDIATE PYTHON
for, first try
dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for val in brics :
print(val)

country
capital
area
population

INTERMEDIATE PYTHON
iterrows
dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows():
print(lab)
print(row)

BR
country Brazil
capital Brasilia
area 8.516
population 200.4
Name: BR, dtype: object
...
RU
country Russia
capital Moscow
area 17.1
population 143.5
Name: RU, dtype: object
IN ...

INTERMEDIATE PYTHON
Selective print
dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows():
print(lab + ": " + row["capital"])

BR: Brasilia
RU: Moscow
IN: New Delhi
CH: Beijing
SA: Pretoria

INTERMEDIATE PYTHON
Add column
dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows() :
# - Creating Series on every iteration
brics.loc[lab, "name_length"] = len(row["country"])
print(brics)

country capital area population name_length


BR Brazil Brasilia 8.516 200.40 6
RU Russia Moscow 17.100 143.50 6
IN India New Delhi 3.286 1252.00 5
CH China Beijing 9.597 1357.00 5
SA South Africa Pretoria 1.221 52.98 12

INTERMEDIATE PYTHON
apply
dfloop.py

import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
brics["name_length"] = brics["country"].apply(len)
print(brics)

country capital area population name_length


BR Brazil Brasilia 8.516 200.40 6
RU Russia Moscow 17.100 143.50 6
IN India New Delhi 3.286 1252.00 5
CH China Beijing 9.597 1357.00 5
SA South Africa Pretoria 1.221 52.98 12

INTERMEDIATE PYTHON
Random Numbers
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
Can't go below step 0

0.1 % chance of falling down


the stairs

Bet: you'll reach step 60

INTERMEDIATE PYTHON
How to solve?
Analytical
Simulate the process
Hacker statistics!

INTERMEDIATE PYTHON
Random generators
import numpy as np
np.random.rand() # Pseudo-random numbers

0.9535543896720104 # Mathematical formula

np.random.seed(123) # Starting from a seed


np.random.rand()

0.6964691855978616

np.random.rand()

0.28613933495037946

INTERMEDIATE PYTHON
Random generators
np.random.seed(123)
np.random.rand()

0.696469185597861 # Same seed: same random numbers!

np.random.rand() # Ensures "reproducibility"

0.28613933495037946

INTERMEDIATE PYTHON
Coin toss
game.py

import numpy as np
np.random.seed(123)
coin = np.random.randint(0,2) # Randomly generate 0 or 1
print(coin)

INTERMEDIATE PYTHON
Coin toss
game.py

import numpy as np
np.random.seed(123)
coin = np.random.randint(0,2) # Randomly generate 0 or 1
print(coin)
if coin == 0:
print("heads")
else:
print("tails")

0
heads

INTERMEDIATE PYTHON
Random Walk
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Random Step

INTERMEDIATE PYTHON
Random Walk
Known in Science

Path of molecules

Gambler's financial status

INTERMEDIATE PYTHON
Heads or Tails
headtails.py

import numpy as np
np.random.seed(123)
outcomes = []
for x in range(10) :
coin = np.random.randint(0, 2)
if coin == 0 :
outcomes.append("heads")
else :
outcomes.append("tails")
print(outcomes)

['heads', 'tails', 'heads', 'heads', 'heads',


'heads', 'heads', 'tails', 'tails', 'heads']

INTERMEDIATE PYTHON
Heads or Tails: Random Walk
headtailsrw.py

import numpy as np
np.random.seed(123)
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
print(tails)

[0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3]

INTERMEDIATE PYTHON
Step to Walk
outcomes

['heads', 'tails', 'heads', 'heads', 'heads',


'heads', 'heads', 'tails', 'tails', 'heads']

tails

[0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3]

INTERMEDIATE PYTHON
Distribution
I N T E R M E D I AT E P Y T H O N

Hugo Bowne-Anderson
Data Scientist at DataCamp
Distribution

INTERMEDIATE PYTHON
Random Walk
headtailsrw.py

import numpy as np
np.random.seed(123)
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)

INTERMEDIATE PYTHON
100 runs
distribution.py

import numpy as np
np.random.seed(123)
final_tails = []
for x in range(100) :
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
print(final_tails)

[3, 6, 4, 5, 4, 5, 3, 5, 4, 6, 6, 8, 6, 4, 7, 5, 7, 4, 3, 3, ..., 4]

INTERMEDIATE PYTHON
Histogram, 100 runs
distribution.py

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(100) :
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()

INTERMEDIATE PYTHON
Histogram, 100 runs

INTERMEDIATE PYTHON
Histogram, 1,000 runs
distribution.py

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(1000) : # <--
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()

INTERMEDIATE PYTHON
Histogram, 1,000 runs

INTERMEDIATE PYTHON
Histogram, 10,000 runs
distribution.py

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(10000) : # <--
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()

INTERMEDIATE PYTHON
Histogram, 10,000 runs

INTERMEDIATE PYTHON
Introducing
DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
What's the point of pandas?
Data Manipulation skill track
Data Visualization skill track

DATA MANIPULATION WITH PANDAS


Course outline
Chapter 1: DataFrames Chapter 3: Slicing and Indexing Data
Sorting and subsetting Subsetting using slicing

Creating new columns Indexes and subsetting using indexes


Chapter 2: Aggregating Data Chapter 4: Creating and Visualizing Data
Summary statistics Plotting

Counting Handling missing data

Grouped summary statistics Reading data into a DataFrame

DATA MANIPULATION WITH PANDAS


pandas is built on NumPy and Matplotlib

DATA MANIPULATION WITH PANDAS


pandas is popular

1 https://pypistats.org/packages/pandas

DATA MANIPULATION WITH PANDAS


Rectangular data
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 25 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 17 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


pandas DataFrames
print(dogs)

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .head()
print(dogs.head())

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .info()
print(dogs.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 name 7 non-null object
1 breed 7 non-null object
2 color 7 non-null object
3 height_cm 7 non-null int64
4 weight_kg 7 non-null int64
5 date_of_birth 7 non-null object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .shape
print(dogs.shape)

(7, 6)

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .describe()
print(dogs.describe())

height_cm weight_kg
count 7.000000 7.000000
mean 49.714286 27.428571
std 17.960274 22.292429
min 18.000000 2.000000
25% 44.500000 19.500000
50% 49.000000 23.000000
75% 57.500000 27.000000
max 77.000000 74.000000

DATA MANIPULATION WITH PANDAS


Components of a DataFrame: .values
print(dogs.values)

array([['Bella', 'Labrador', 'Brown', 56, 24, '2013-07-01'],


['Charlie', 'Poodle', 'Black', 43, 24, '2016-09-16'],
['Lucy', 'Chow Chow', 'Brown', 46, 24, '2014-08-25'],
['Cooper', 'Schnauzer', 'Gray', 49, 17, '2011-12-11'],
['Max', 'Labrador', 'Black', 59, 29, '2017-01-20'],
['Stella', 'Chihuahua', 'Tan', 18, 2, '2015-04-20'],
['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']],
dtype=object)

DATA MANIPULATION WITH PANDAS


Components of a DataFrame: .columns and .index
print(dogs.columns)

Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'],


dtype='object')

dogs.index

RangeIndex(start=0, stop=7, step=1)

DATA MANIPULATION WITH PANDAS


pandas Philosophy
There should be one -- and preferably only one -- obvious way to do it.

- The Zen of Python by Tim Peters, Item 13

1 https://www.python.org/dev/peps/pep-0020/

DATA MANIPULATION WITH PANDAS


Sorting and
subsetting
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
Sorting
dogs.sort_values("weight_kg")

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Sorting in descending order
dogs.sort_values("weight_kg", ascending=False)

name breed color height_cm weight_kg date_of_birth


6 Bernie St. Bernard White 77 74 2018-02-27
4 Max Labrador Black 59 29 2017-01-20
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
5 Stella Chihuahua Tan 18 2 2015-04-20

DATA MANIPULATION WITH PANDAS


Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"])

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
2 Lucy Chow Chow Brown 46 24 2014-08-25
1 Charlie Poodle Black 43 24 2016-09-16
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

pandas with relevant values highlighted: Bella, Lucy and Charlie in descending order by
height

DATA MANIPULATION WITH PANDAS


Subsetting columns
dogs["name"]

0 Bella
1 Charlie
2 Lucy
3 Cooper
4 Max
5 Stella
6 Bernie
Name: name, dtype: object

DATA MANIPULATION WITH PANDAS


Subsetting multiple columns
dogs[["breed", "height_cm"]] cols_to_subset = ["breed", "height_cm"]
dogs[cols_to_subset]

breed height_cm
0 Labrador 56 breed height_cm
1 Poodle 43 0 Labrador 56
2 Chow Chow 46 1 Poodle 43
3 Schnauzer 49 2 Chow Chow 46
4 Labrador 59 3 Schnauzer 49
5 Chihuahua 18 4 Labrador 59
6 St. Bernard 77 5 Chihuahua 18
6 St. Bernard 77

DATA MANIPULATION WITH PANDAS


Subsetting rows
dogs["height_cm"] > 50

0 True
1 False
2 False
3 False
4 True
5 False
6 True
Name: height_cm, dtype: bool

DATA MANIPULATION WITH PANDAS


Subsetting rows
dogs[dogs["height_cm"] > 50]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Subsetting based on text data
dogs[dogs["breed"] == "Labrador"]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


Subsetting based on dates
dogs[dogs["date_of_birth"] < "2015-01-01"]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11

DATA MANIPULATION WITH PANDAS


Subsetting based on multiple conditions
is_lab = dogs["breed"] == "Labrador"
is_brown = dogs["color"] == "Brown"
dogs[is_lab & is_brown]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01

dogs[ (dogs["breed"] == "Labrador") & (dogs["color"] == "Brown") ]

DATA MANIPULATION WITH PANDAS


Subsetting using .isin()
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


New columns
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
Adding a new column
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)

name breed color height_cm weight_kg date_of_birth height_m


0 Bella Labrador Brown 56 24 2013-07-01 0.56
1 Charlie Poodle Black 43 24 2016-09-16 0.43
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49
4 Max Labrador Black 59 29 2017-01-20 0.59
5 Stella Chihuahua Tan 18 2 2015-04-20 0.18
6 Bernie St. Bernard White 77 74 2018-02-27 0.77

DATA MANIPULATION WITH PANDAS


Doggy mass index
BMI = weight in kg/(height in m)2

dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2


print(dogs.head())

name breed color height_cm weight_kg date_of_birth height_m bmi


0 Bella Labrador Brown 56 24 2013-07-01 0.56 76.530612
1 Charlie Poodle Black 43 24 2016-09-16 0.43 129.799892
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46 113.421550
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49 70.803832
4 Max Labrador Black 59 29 2017-01-20 0.59 83.309394

DATA MANIPULATION WITH PANDAS


Multiple manipulations
bmi_lt_100 = dogs[dogs["bmi"] < 100]
bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending=False)
bmi_lt_100_height[["name", "height_cm", "bmi"]]

name height_cm bmi


4 Max 59 83.309394
0 Bella 56 76.530612
3 Cooper 49 70.803832
5 Stella 18 61.728395

DATA MANIPULATION WITH PANDAS


Summary statistics
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Summarizing numerical data
.median() , .mode()
dogs["height_cm"].mean()
.min() , .max()

49.714285714285715 .var() , .std()

.sum()

.quantile()

DATA MANIPULATION WITH PANDAS


Summarizing dates
Oldest dog:

dogs["date_of_birth"].min()

'2011-12-11'

Youngest dog:

dogs["date_of_birth"].max()

'2018-02-27'

DATA MANIPULATION WITH PANDAS


The .agg() method
def pct30(column):
return column.quantile(0.3)

dogs["weight_kg"].agg(pct30)

22.599999999999998

DATA MANIPULATION WITH PANDAS


Summaries on multiple columns
dogs[["weight_kg", "height_cm"]].agg(pct30)

weight_kg 22.6
height_cm 45.4
dtype: float64

DATA MANIPULATION WITH PANDAS


Multiple summaries
def pct40(column):
return column.quantile(0.4)

dogs["weight_kg"].agg([pct30, pct40])

pct30 22.6
pct40 24.0
Name: weight_kg, dtype: float64

DATA MANIPULATION WITH PANDAS


Cumulative sum
dogs["weight_kg"] dogs["weight_kg"].cumsum()

0 24 0 24
1 24 1 48
2 24 2 72
3 17 3 89
4 29 4 118
5 2 5 120
6 74 6 194
Name: weight_kg, dtype: int64 Name: weight_kg, dtype: int64

DATA MANIPULATION WITH PANDAS


Cumulative statistics
.cummax()

.cummin()

.cumprod()

DATA MANIPULATION WITH PANDAS


Walmart
sales.head()

store type dept date weekly_sales is_holiday temp_c fuel_price unemp


0 1 A 1 2010-02-05 24924.50 False 5.73 0.679 8.106
1 1 A 2 2010-02-05 50605.27 False 5.73 0.679 8.106
2 1 A 3 2010-02-05 13740.12 False 5.73 0.679 8.106
3 1 A 4 2010-02-05 39954.04 False 5.73 0.679 8.106
4 1 A 5 2010-02-05 32229.38 False 5.73 0.679 8.106

DATA MANIPULATION WITH PANDAS


Counting
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Avoiding double counting

DATA MANIPULATION WITH PANDAS


Vet visits
print(vet_visits)

date name breed weight_kg


0 2018-09-02 Bella Labrador 24.87
1 2019-06-07 Max Labrador 28.35
2 2018-01-17 Stella Chihuahua 1.51
3 2019-10-19 Lucy Chow Chow 24.07
.. ... ... ... ...
71 2018-01-20 Stella Chihuahua 2.83
72 2019-06-07 Max Chow Chow 24.01
73 2018-08-20 Lucy Chow Chow 24.40
74 2019-04-22 Max Labrador 28.54

DATA MANIPULATION WITH PANDAS


Dropping duplicate names
vet_visits.drop_duplicates(subset="name")

date name breed weight_kg


0 2018-09-02 Bella Labrador 24.87
1 2019-06-07 Max Chow Chow 24.01
2 2019-03-19 Charlie Poodle 24.95
3 2018-01-17 Stella Chihuahua 1.51
4 2019-10-19 Lucy Chow Chow 24.07
7 2019-03-30 Cooper Schnauzer 16.91
10 2019-01-04 Bernie St. Bernard 74.98
(6 2019-06-07 Max Labrador 28.35)

DATA MANIPULATION WITH PANDAS


Dropping duplicate pairs
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"])
print(unique_dogs)

date name breed weight_kg


0 2018-09-02 Bella Labrador 24.87
1 2019-03-13 Max Chow Chow 24.13
2 2019-03-19 Charlie Poodle 24.95
3 2018-01-17 Stella Chihuahua 1.51
4 2019-10-19 Lucy Chow Chow 24.07
6 2019-06-07 Max Labrador 28.35
7 2019-03-30 Cooper Schnauzer 16.91
10 2019-01-04 Bernie St. Bernard 74.98

DATA MANIPULATION WITH PANDAS


Easy as 1, 2, 3
unique_dogs["breed"].value_counts() unique_dogs["breed"].value_counts(sort=True)

Labrador 2 Labrador 2
Schnauzer 1 Chow Chow 2
St. Bernard 1 Schnauzer 1
Chow Chow 2 St. Bernard 1
Poodle 1 Poodle 1
Chihuahua 1 Chihuahua 1
Name: breed, dtype: int64 Name: breed, dtype: int64

DATA MANIPULATION WITH PANDAS


Proportions
unique_dogs["breed"].value_counts(normalize=True)

Labrador 0.250
Chow Chow 0.250
Schnauzer 0.125
St. Bernard 0.125
Poodle 0.125
Chihuahua 0.125
Name: breed, dtype: float64

DATA MANIPULATION WITH PANDAS


Grouped summary
statistics
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Summaries by group
dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()
dogs[dogs["color"] == "White"]["weight_kg"].mean()
dogs[dogs["color"] == "Gray"]["weight_kg"].mean()
dogs[dogs["color"] == "Tan"]["weight_kg"].mean()

26.0
24.0
74.0
17.0
2.0

DATA MANIPULATION WITH PANDAS


Grouped summaries
dogs.groupby("color")["weight_kg"].mean()

color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0
Name: weight_kg, dtype: float64

DATA MANIPULATION WITH PANDAS


Multiple grouped summaries
dogs.groupby("color")["weight_kg"].agg([min, max, sum])

min max sum


color
Black 24 29 53
Brown 24 24 48
Gray 17 17 17
Tan 2 2 2
White 74 74 74

DATA MANIPULATION WITH PANDAS


Grouping by multiple variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()

color breed
Black Chow Chow 25
Labrador 29
Poodle 24
Brown Chow Chow 24
Labrador 24
Gray Schnauzer 17
Tan Chihuahua 2
White St. Bernard 74
Name: weight_kg, dtype: int64

DATA MANIPULATION WITH PANDAS


Many groups, many summaries
dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()

weight_kg height_cm
color breed
Black Labrador 29 59
Poodle 24 43
Brown Chow Chow 24 46
Labrador 24 56
Gray Schnauzer 17 49
Tan Chihuahua 2 18
White St. Bernard 74 77

DATA MANIPULATION WITH PANDAS


Pivot tables
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Group by to pivot table
dogs.groupby("color")["weight_kg"].mean() dogs.pivot_table(values="weight_kg",
index="color")

color
Black 26 weight_kg
Brown 24 color
Gray 17 Black 26.5
Tan 2 Brown 24.0
White 74 Gray 17.0
Name: weight_kg, dtype: int64 Tan 2.0
White 74.0

DATA MANIPULATION WITH PANDAS


Different statistics
import numpy as np
dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

weight_kg
color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0

DATA MANIPULATION WITH PANDAS


Multiple statistics
dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])

mean median
weight_kg weight_kg
color
Black 26.5 26.5
Brown 24.0 24.0
Gray 17.0 17.0
Tan 2.0 2.0
White 74.0 74.0

DATA MANIPULATION WITH PANDAS


Pivot on two variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()

dogs.pivot_table(values="weight_kg", index="color", columns="breed")

breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard


color
Black NaN NaN 29.0 24.0 NaN NaN
Brown NaN 24.0 24.0 NaN NaN NaN
Gray NaN NaN NaN NaN 17.0 NaN
Tan 2.0 NaN NaN NaN NaN NaN
White NaN NaN NaN NaN NaN 74.0

DATA MANIPULATION WITH PANDAS


Filling missing values in pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard


color
Black 0 0 29 24 0 0
Brown 0 24 24 0 0 0
Gray 0 0 0 0 17 0
Tan 2 0 0 0 0 0
White 0 0 0 0 0 74

DATA MANIPULATION WITH PANDAS


Summing with pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed",
fill_value=0, margins=True)

breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard All
color
Black 0 0 29 24 0 0 26.500000
Brown 0 24 24 0 0 0 24.000000
Gray 0 0 0 0 17 0 17.000000
Tan 2 0 0 0 0 0 2.000000
White 0 0 0 0 0 74 74.000000
All 2 24 26 24 17 74 27.714286

DATA MANIPULATION WITH PANDAS


Explicit indexes
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
The dog dataset, revisited
print(dogs)

name breed color height_cm weight_kg


0 Bella Labrador Brown 56 25
1 Charlie Poodle Black 43 23
2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Gray 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


.columns and .index
dogs.columns

Index(['name', 'breed', 'color', 'height_cm', 'weight_kg'], dtype='object')

dogs.index

RangeIndex(start=0, stop=7, step=1)

DATA MANIPULATION WITH PANDAS


Setting a column as the index
dogs_ind = dogs.set_index("name")
print(dogs_ind)

breed color height_cm weight_kg


name
Bella Labrador Brown 56 25
Charlie Poodle Black 43 23
Lucy Chow Chow Brown 46 22
Cooper Schnauzer Grey 49 17
Max Labrador Black 59 29
Stella Chihuahua Tan 18 2
Bernie St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


Removing an index
dogs_ind.reset_index()

name breed color height_cm weight_kg


0 Bella Labrador Brown 56 25
1 Charlie Poodle Black 43 23
2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Grey 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


Dropping an index
dogs_ind.reset_index(drop=True)

breed color height_cm weight_kg


0 Labrador Brown 56 25
1 Poodle Black 43 23
2 Chow Chow Brown 46 22
3 Schnauzer Grey 49 17
4 Labrador Black 59 29
5 Chihuahua Tan 18 2
6 St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


Indexes make subsetting simpler
dogs[dogs["name"].isin(["Bella", "Stella"])]

name breed color height_cm weight_kg


0 Bella Labrador Brown 56 25
5 Stella Chihuahua Tan 18 2

dogs_ind.loc[["Bella", "Stella"]]

breed color height_cm weight_kg


name
Bella Labrador Brown 56 25
Stella Chihuahua Tan 18 2

DATA MANIPULATION WITH PANDAS


Index values don't need to be unique
dogs_ind2 = dogs.set_index("breed")
print(dogs_ind2)

name color height_cm weight_kg


breed
Labrador Bella Brown 56 25
Poodle Charlie Black 43 23
Chow Chow Lucy Brown 46 22
Schnauzer Cooper Grey 49 17
Labrador Max Black 59 29
Chihuahua Stella Tan 18 2
St. Bernard Bernie White 77 74

DATA MANIPULATION WITH PANDAS


Subsetting on duplicated index values
dogs_ind2.loc["Labrador"]

name color height_cm weight_kg


breed
Labrador Bella Brown 56 25
Labrador Max Black 59 29

DATA MANIPULATION WITH PANDAS


Multi-level indexes a.k.a. hierarchical indexes
dogs_ind3 = dogs.set_index(["breed", "color"])
print(dogs_ind3)

name height_cm weight_kg


breed color
Labrador Brown Bella 56 25
Poodle Black Charlie 43 23
Chow Chow Brown Lucy 46 22
Schnauzer Grey Cooper 49 17
Labrador Black Max 59 29
Chihuahua Tan Stella 18 2
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Subset the outer level with a list
dogs_ind3.loc[["Labrador", "Chihuahua"]]

name height_cm weight_kg


breed color
Labrador Brown Bella 56 25
Black Max 59 29
Chihuahua Tan Stella 18 2

DATA MANIPULATION WITH PANDAS


Subset inner levels with a list of tuples
dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]

name height_cm weight_kg


breed color
Labrador Brown Bella 56 25
Chihuahua Tan Stella 18 2

DATA MANIPULATION WITH PANDAS


Sorting by index values
dogs_ind3.sort_index()

name height_cm weight_kg


breed color
Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Controlling sort_index
dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])

name height_cm weight_kg


breed color
Poodle Black Charlie 43 23
Labrador Black Max 59 29
Brown Bella 56 25
Chow Chow Brown Lucy 46 22
Schanuzer Grey Cooper 49 17
Chihuahua Tan Stella 18 2
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Now you have two problems
Index values are just data
Indexes violate "tidy data" principles

You need to learn two syntaxes

DATA MANIPULATION WITH PANDAS


Temperature dataset
date city country avg_temp_c
0 2000-01-01 Abidjan Côte D'Ivoire 27.293
1 2000-02-01 Abidjan Côte D'Ivoire 27.685
2 2000-03-01 Abidjan Côte D'Ivoire 29.061
3 2000-04-01 Abidjan Côte D'Ivoire 28.162
4 2000-05-01 Abidjan Côte D'Ivoire 27.547

DATA MANIPULATION WITH PANDAS


Slicing and
subsetting with .loc
and .iloc
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
Slicing lists
breeds = ["Labrador", "Poodle", breeds[2:5]
"Chow Chow", "Schnauzer",
"Labrador", "Chihuahua",
['Chow Chow', 'Schnauzer', 'Labrador']
"St. Bernard"]

breeds[:3]
['Labrador',
'Poodle',
'Chow Chow', ['Labrador', 'Poodle', 'Chow Chow']
'Schnauzer',
'Labrador', breeds[:]
'Chihuahua',
'St. Bernard']
['Labrador','Poodle','Chow Chow','Schnauzer',
'Labrador','Chihuahua','St. Bernard']

DATA MANIPULATION WITH PANDAS


Sort the index before you slice
dogs_srt = dogs.set_index(["breed", "color"]).sort_index()
print(dogs_srt)

name height_cm weight_kg


breed color
Chihuahua Tan Stella 18 2
Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Slicing the outer index level
dogs_srt.loc["Chow Chow":"Poodle"] Full dataset

name height_cm weight_kg name height_cm weight_kg

breed color breed color

Chow Chow Brown Lucy 46 22 Chihuahua Tan Stella 18 2

Labrador Black Max 59 29 Chow Chow Brown Lucy 46 22

Brown Bella 56 25 Labrador Black Max 59 29

Poodle Black Charlie 43 23 Brown Bella 56 25


Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
The final value "Poodle" is included
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Slicing the inner index levels badly
dogs_srt.loc["Tan":"Grey"] Full dataset

Empty DataFrame name height_cm weight_kg

Columns: [name, height_cm, weight_kg] breed color

Index: [] Chihuahua Tan Stella 18 2


Chow Chow Brown Lucy 46 22
Labrador Black Max 59 29
Brown Bella 56 25
Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Slicing the inner index levels correctly
dogs_srt.loc[ Full dataset
("Labrador", "Brown"):("Schnauzer", "Grey")]
name height_cm weight_kg

name height_cm weight_kg breed color

breed color Chihuahua Tan Stella 18 2

Labrador Brown Bella 56 25 Chow Chow Brown Lucy 46 22

Poodle Black Charlie 43 23 Labrador Black Max 59 29

Schnauzer Grey Cooper 49 17 Brown Bella 56 25


Poodle Black Charlie 43 23
Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Slicing columns
dogs_srt.loc[:, "name":"height_cm"] Full dataset

name height_cm name height_cm weight_kg

breed color breed color

Chihuahua Tan Stella 18 Chihuahua Tan Stella 18 2

Chow Chow Brown Lucy 46 Chow Chow Brown Lucy 46 22

Labrador Black Max 59 Labrador Black Max 59 29

Brown Bella 56 Brown Bella 56 25

Poodle Black Charlie 43 Poodle Black Charlie 43 23

Schnauzer Grey Cooper 49 Schnauzer Grey Cooper 49 17

St. Bernard White Bernie 77 St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Slice twice
dogs_srt.loc[ Full dataset
("Labrador", "Brown"):("Schnauzer", "Grey"),
"name":"height_cm"] name height_cm weight_kg
breed color

name height_cm Chihuahua Tan Stella 18 2

breed color Chow Chow Brown Lucy 46 22

Labrador Brown Bella 56 Labrador Black Max 59 29

Poodle Black Charlie 43 Brown Bella 56 25

Schanuzer Grey Cooper 49 Poodle Black Charlie 43 23


Schnauzer Grey Cooper 49 17
St. Bernard White Bernie 77 74

DATA MANIPULATION WITH PANDAS


Dog days
dogs = dogs.set_index("date_of_birth").sort_index()
print(dogs)

name breed color height_cm weight_kg


date_of_birth
2011-12-11 Cooper Schanuzer Grey 49 17
2013-07-01 Bella Labrador Brown 56 25
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23
2017-01-20 Max Labrador Black 59 29
2018-02-27 Bernie St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


Slicing by dates
# Get dogs with date_of_birth between 2014-08-25 and 2016-09-16
dogs.loc["2014-08-25":"2016-09-16"]

name breed color height_cm weight_kg


date_of_birth
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23

DATA MANIPULATION WITH PANDAS


Slicing by partial dates
# Get dogs with date_of_birth between 2014-01-01 and 2016-12-31
dogs.loc["2014":"2016"]

name breed color height_cm weight_kg


date_of_birth
2014-08-25 Lucy Chow Chow Brown 46 22
2015-04-20 Stella Chihuahua Tan 18 2
2016-09-16 Charlie Poodle Black 43 23

DATA MANIPULATION WITH PANDAS


Subsetting by row/column number
print(dogs.iloc[2:5, 1:4]) Full dataset

breed color height_cm name breed color height_cm weight_kg


2 Chow Chow Brown 46 0 Bella Labrador Brown 56 25
3 Schnauzer Grey 49 1 Charlie Poodle Black 43 23
4 Labrador Black 59 2 Lucy Chow Chow Brown 46 22
3 Cooper Schnauzer Grey 49 17
4 Max Labrador Black 59 29
5 Stella Chihuahua Tan 18 2
6 Bernie St. Bernard White 77 74

DATA MANIPULATION WITH PANDAS


Working with pivot
tables
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Data Evangelist at DataCamp
A bigger dog dataset
print(dog_pack)

breed color height_cm weight_kg


0 Boxer Brown 62.64 30.4
1 Poodle Black 46.41 20.4
2 Beagle Brown 36.39 12.4
3 Chihuahua Tan 19.70 1.6
4 Labrador Tan 54.44 36.1
.. ... ... ... ...
87 Boxer Gray 58.13 29.9
88 St. Bernard White 70.13 69.4
89 Poodle Gray 51.30 20.4
90 Beagle White 38.81 8.8
91 Beagle Black 33.40 13.5

DATA MANIPULATION WITH PANDAS


Pivoting the dog pack
dogs_height_by_breed_vs_color = dog_pack.pivot_table(
"height_cm", index="breed", columns="color")
print(dogs_height_by_breed_vs_color)

color Black Brown Gray Tan White


breed
Beagle 34.500000 36.4500 36.313333 35.740000 38.810000
Boxer 57.203333 62.6400 58.280000 62.310000 56.360000
Chihuahua 18.555000 NaN 21.660000 20.096667 17.933333
Chow Chow 51.262500 50.4800 NaN 53.497500 54.413333
Dachshund 21.186667 19.7250 NaN 19.375000 20.660000
Labrador 57.125000 NaN NaN 55.190000 55.310000
Poodle 48.036000 57.1300 56.645000 NaN 44.740000
St. Bernard 63.920000 65.8825 67.640000 68.334000 67.495000

DATA MANIPULATION WITH PANDAS


.loc[] + slicing is a power combo
dogs_height_by_breed_vs_color.loc["Chow Chow":"Poodle"]

color Black Brown Gray Tan White


breed
Chow Chow 51.262500 50.480 NaN 53.4975 54.413333
Dachshund 21.186667 19.725 NaN 19.3750 20.660000
Labrador 57.125000 NaN NaN 55.1900 55.310000
Poodle 48.036000 57.130 56.645 NaN 44.740000

DATA MANIPULATION WITH PANDAS


The axis argument
dogs_height_by_breed_vs_color.mean(axis="index")

color
Black 43.973563
Brown 48.717917
Gray 48.107667
Tan 44.934738
White 44.465208
dtype: float64

DATA MANIPULATION WITH PANDAS


Calculating summary stats across columns
dogs_height_by_breed_vs_color.mean(axis="columns")

breed
Beagle 36.362667
Boxer 59.358667
Chihuahua 19.561250
Chow Chow 52.413333
Dachshund 20.236667
Labrador 55.875000
Poodle 51.637750
St. Bernard 66.654300
dtype: float64

DATA MANIPULATION WITH PANDAS


Visualizing your data
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Histograms
import matplotlib.pyplot as plt

dog_pack["height_cm"].hist()

plt.show()

DATA MANIPULATION WITH PANDAS


Histograms
dog_pack["height_cm"].hist(bins=20) dog_pack["height_cm"].hist(bins=5)
plt.show() plt.show()

DATA MANIPULATION WITH PANDAS


Bar plots
avg_weight_by_breed = dog_pack.groupby("breed")["weight_kg"].mean()
print(avg_weight_by_breed)

breed
Beagle 10.636364
Boxer 30.620000
Chihuahua 1.491667
Chow Chow 22.535714
Dachshund 9.975000
Labrador 31.850000
Poodle 20.400000
St. Bernard 71.576923
Name: weight_kg, dtype: float64

DATA MANIPULATION WITH PANDAS


Bar plots
avg_weight_by_breed.plot(kind="bar") avg_weight_by_breed.plot(kind="bar",
plt.show() title="Mean Weight by Dog Breed")
plt.show()

DATA MANIPULATION WITH PANDAS


Line plots
sully.head() sully.plot(x="date",
y="weight_kg",

date weight_kg kind="line")

0 2019-01-31 36.1 plt.show()

1 2019-02-28 35.3
2 2019-03-31 32.0
3 2019-04-30 32.9
4 2019-05-31 32.0

DATA MANIPULATION WITH PANDAS


Rotating axis labels
sully.plot(x="date", y="weight_kg", kind="line", rot=45)
plt.show()

DATA MANIPULATION WITH PANDAS


Scatter plots
dog_pack.plot(x="height_cm", y="weight_kg", kind="scatter")
plt.show()

DATA MANIPULATION WITH PANDAS


Layering plots
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist()
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist()
plt.show()

DATA MANIPULATION WITH PANDAS


Add a legend
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist()
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist()
plt.legend(["F", "M"])
plt.show()

DATA MANIPULATION WITH PANDAS


Transparency
dog_pack[dog_pack["sex"]=="F"]["height_cm"].hist(alpha=0.7)
dog_pack[dog_pack["sex"]=="M"]["height_cm"].hist(alpha=0.7)
plt.legend(["F", "M"])
plt.show()

DATA MANIPULATION WITH PANDAS


Avocados
print(avocados)

date type year avg_price size nb_sold


0 2015-12-27 conventional 2015 0.95 small 9626901.09
1 2015-12-20 conventional 2015 0.98 small 8710021.76
2 2015-12-13 conventional 2015 0.93 small 9855053.66
... ... ... ... ... ... ...
1011 2018-01-21 organic 2018 1.63 extra_large 1490.02
1012 2018-01-14 organic 2018 1.59 extra_large 1580.01
1013 2018-01-07 organic 2018 1.51 extra_large 1289.07

[1014 rows x 6 columns]

DATA MANIPULATION WITH PANDAS


Missing values
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
What's a missing value?
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 25 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 17 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


What's a missing value?
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 ? 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 ? 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Missing values in pandas DataFrames
print(dogs)

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 NaN 2013-07-01
1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
3 Cooper Schnauzer Gray 49 NaN 2011-12-11
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27

DATA MANIPULATION WITH PANDAS


Detecting missing values
dogs.isna()

name breed color height_cm weight_kg date_of_birth


0 False False False False True False
1 False False False False False False
2 False False False False False False
3 False False False False True False
4 False False False False False False
5 False False False False False False
6 False False False False False False

DATA MANIPULATION WITH PANDAS


Detecting any missing values
dogs.isna().any()

name False
breed False
color False
height_cm False
weight_kg True
date_of_birth False
dtype: bool

DATA MANIPULATION WITH PANDAS


Counting missing values
dogs.isna().sum()

name 0
breed 0
color 0
height_cm 0
weight_kg 2
date_of_birth 0
dtype: int64

DATA MANIPULATION WITH PANDAS


Plotting missing values
import matplotlib.pyplot as plt
dogs.isna().sum().plot(kind="bar")
plt.show()

DATA MANIPULATION WITH PANDAS


Removing missing values
dogs.dropna()

name breed color height_cm weight_kg date_of_birth


1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27

DATA MANIPULATION WITH PANDAS


Replacing missing values
dogs.fillna(0)

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 0.0 2013-07-01
1 Charlie Poodle Black 43 24.0 2016-09-16
2 Lucy Chow Chow Brown 46 24.0 2014-08-25
3 Cooper Schnauzer Gray 49 0.0 2011-12-11
4 Max Labrador Black 59 29.0 2017-01-20
5 Stella Chihuahua Tan 18 2.0 2015-04-20
6 Bernie St. Bernard White 77 74.0 2018-02-27

DATA MANIPULATION WITH PANDAS


Creating
DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Dictionaries
my_dict = { my_dict = {
"key1": value1, "title": "Charlotte's Web",
"key2": value2, "author": "E.B. White",
"key3": value3 "published": 1952
} }

my_dict["key1"] my_dict["title"]

value1 Charlotte's Web

DATA MANIPULATION WITH PANDAS


Creating DataFrames
From a list of dictionaries From a dictionary of lists

Constructed row by row Constructed column by column

DATA MANIPULATION WITH PANDAS


List of dictionaries - by row
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09

list_of_dicts = [
{"name": "Ginger", "breed": "Dachshund", "height_cm": 22,
"weight_kg": 10, "date_of_birth": "2019-03-14"},
{"name": "Scout", "breed": "Dalmatian", "height_cm": 59,
"weight_kg": 25, "date_of_birth": "2019-05-09"}
]

DATA MANIPULATION WITH PANDAS


List of dictionaries - by row
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09

new_dogs = pd.DataFrame(list_of_dicts)
print(new_dogs)

name breed height_cm weight_kg date_of_birth


0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09

DATA MANIPULATION WITH PANDAS


Dictionary of lists - by column
date dict_of_lists = {
name breed height weight of
birth "name": ["Ginger", "Scout"],
"breed": ["Dachshund", "Dalmatian"],
2019-
Ginger Dachshund 22 10 03-14 "height_cm": [22, 59],

2019- "weight_kg": [10, 25],


Scout Dalmatian 59 25 05- "date_of_birth": ["2019-03-14",
09
"2019-05-09"]
}
new_dogs = pd.DataFrame(dict_of_lists)
Key = column name

Value = list of column values

DATA MANIPULATION WITH PANDAS


Dictionary of lists - by column
name breed height (cm) weight (kg) date of birth
Ginger Dachshund 22 10 2019-03-14
Scout Dalmatian 59 25 2019-05-09

print(new_dogs)

name breed height_cm weight_kg date_of_birth


0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09

DATA MANIPULATION WITH PANDAS


Reading and writing
CSVs
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
What's a CSV file?
CSV = comma-separated values

Designed for DataFrame-like data

Most database and spreadsheet programs can use them or create them

DATA MANIPULATION WITH PANDAS


Example CSV file
new_dogs.csv

name,breed,height_cm,weight_kg,d_o_b
Ginger,Dachshund,22,10,2019-03-14
Scout,Dalmatian,59,25,2019-05-09

DATA MANIPULATION WITH PANDAS


CSV to DataFrame
import pandas as pd
new_dogs = pd.read_csv("new_dogs.csv")
print(new_dogs)

name breed height_cm weight_kg date_of_birth


0 Ginger Dachshund 22 10 2019-03-14
1 Scout Dalmatian 59 25 2019-05-09

DATA MANIPULATION WITH PANDAS


DataFrame manipulation
new_dogs["bmi"] = new_dogs["weight_kg"] / (new_dogs["height_cm"] / 100) ** 2
print(new_dogs)

name breed height_cm weight_kg date_of_birth bmi


0 Ginger Dachshund 22 10 2019-03-14 206.611570
1 Scout Dalmatian 59 25 2019-05-09 71.818443

DATA MANIPULATION WITH PANDAS


DataFrame to CSV
new_dogs.to_csv("new_dogs_with_bmi.csv")

new_dogs_with_bmi.csv

name,breed,height_cm,weight_kg,d_o_b,bmi
Ginger,Dachshund,22,10,2019-03-14,206.611570
Scout,Dalmatian,59,25,2019-05-09,71.818443

DATA MANIPULATION WITH PANDAS


Wrap-up
D ATA M A N I P U L AT I O N W I T H PA N D A S

Maggie Matsui
Senior Content Developer at DataCamp
Recap
Chapter 1 Chapter 3
Subsetting and sorting Indexing

Adding new columns Slicing

Chapter 2 Chapter 4
Aggregating and grouping Visualizations

Summary statistics Reading and writing CSVs

DATA MANIPULATION WITH PANDAS


More to learn
Joining Data with pandas
Streamlined Data Ingestion with pandas

Analyzing Police Activity with pandas

Analyzing Marketing Campaigns with


pandas

DATA MANIPULATION WITH PANDAS


Congratulations!
D ATA M A N I P U L AT I O N W I T H PA N D A S

You might also like