Combinepdf Removed
Combinepdf Removed
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
How you will learn
INTRODUCTION TO PYTHON
Python
INTRODUCTION TO PYTHON
IPython Shell
Execute Python commands
INTRODUCTION TO PYTHON
IPython Shell
Execute Python commands
INTRODUCTION TO PYTHON
IPython Shell
INTRODUCTION TO PYTHON
Python Script
Text files - .py
List of Python commands
INTRODUCTION TO PYTHON
Python Script
INTRODUCTION TO PYTHON
Python Script
INTRODUCTION TO PYTHON
DataCamp Interface
INTRODUCTION TO PYTHON
Variables and Types
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Variable
Specific, case-sensitive name
Call up value through variable name
1.79 m - 68.7 kg
height = 1.79
weight = 68.7
height
1.79
INTRODUCTION TO PYTHON
Calculate BMI
height = 1.79 68.7 / 1.79 ** 2
weight = 68.7
height
21.4413
1.79
weight / height ** 2
weight
BMI = 21.4413
height2
21.4413
INTRODUCTION TO PYTHON
Reproducibility
height = 1.79
weight = 68.7
bmi = weight / height ** 2
print(bmi)
21.4413
INTRODUCTION TO PYTHON
Reproducibility
height = 1.79
weight = 74.2 # <-
bmi = weight / height ** 2
print(bmi)
23.1578
INTRODUCTION TO PYTHON
Python Types
type(bmi)
float
day_of_week = 5
type(day_of_week)
int
INTRODUCTION TO PYTHON
Python Types (2)
x = "body mass index"
y = 'this works too'
type(y)
str
z = True
type(z)
bool
INTRODUCTION TO PYTHON
Python Types (3)
2 + 3
'ab' + 'cd'
'abcd'
INTRODUCTION TO PYTHON
Python Lists
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Python Data Types
float - real numbers
int - integer numbers
height = 1.73
tall = True
INTRODUCTION TO PYTHON
Problem
Data Science: many data points
height1 = 1.73
height2 = 1.68
height3 = 1.71
height4 = 1.89
Inconvenient
INTRODUCTION TO PYTHON
Python List
[a, b, c]
INTRODUCTION TO PYTHON
Python List
[a, b, c]
INTRODUCTION TO PYTHON
List type
type(fam)
list
type(fam2)
list
Specific functionality
Specific behavior
INTRODUCTION TO PYTHON
Subsetting Lists
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Subsetting lists
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
fam
fam[3]
1.68
INTRODUCTION TO PYTHON
Subsetting lists
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]
fam[6]
'dad'
fam[-1]
1.89
fam[7]
1.89
INTRODUCTION TO PYTHON
Subsetting lists
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]
fam[6]
'dad'
fam[-1] # <-
1.89
fam[7] # <-
1.89
INTRODUCTION TO PYTHON
List slicing
fam
fam[3:5]
[1.68, 'mom']
fam[1:4]
INTRODUCTION TO PYTHON
List slicing
fam
fam[:4]
fam[5:]
INTRODUCTION TO PYTHON
Manipulating Lists
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
List Manipulation
Change list elements
Add list elements
INTRODUCTION TO PYTHON
Changing list elements
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
fam
fam[7] = 1.86
fam
INTRODUCTION TO PYTHON
Adding and removing elements
fam + ["me", 1.79]
INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y
INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y
INTRODUCTION TO PYTHON
Behind the scenes (1)
x = ["a", "b", "c"]
y = x
y[1] = "z"
y
INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]
INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]
y = list(x)
y = x[:]
INTRODUCTION TO PYTHON
Behind the scenes (2)
x = ["a", "b", "c"]
y = list(x)
y = x[:]
y[1] = "z"
x
INTRODUCTION TO PYTHON
Functions
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Functions
Nothing new!
type()
INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam
max(fam)
1.89
INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam
max(fam)
1.89
INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam
max(fam)
1.89
INTRODUCTION TO PYTHON
Example
fam = [1.73, 1.68, 1.71, 1.89]
fam
max(fam)
1.89
tallest = max(fam)
tallest
1.89
INTRODUCTION TO PYTHON
round()
round(1.68, 1)
1.7
round(1.68)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
INTRODUCTION TO PYTHON
round()
help(round)
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
round(number)
round(number, ndigits)
INTRODUCTION TO PYTHON
Find functions
How to know?
Standard task -> probably function exists!
INTRODUCTION TO PYTHON
Methods
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Built-in Functions
Maximum of list: max()
Length of list or string: len()
Reversing a list: ?
INTRODUCTION TO PYTHON
Back 2 Basics
sister = "liz"
height = 1.73
INTRODUCTION TO PYTHON
Back 2 Basics
sister = "liz"
height = 1.73
INTRODUCTION TO PYTHON
Back 2 Basics
sister = "liz"
height = 1.73
INTRODUCTION TO PYTHON
list methods
fam
fam.count(1.73)
INTRODUCTION TO PYTHON
str methods
sister
'liz'
sister.capitalize()
'Liz'
sister.replace("z", "sa")
'lisa'
INTRODUCTION TO PYTHON
Methods
Everything = object
Object have methods associated, depending on type
sister.replace("z", "sa")
'lisa'
fam.replace("mom", "mommy")
INTRODUCTION TO PYTHON
Methods
sister.index("z")
fam.index("mom")
INTRODUCTION TO PYTHON
Methods (2)
fam
fam.append("me")
fam
fam.append(1.79)
fam
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me', 1.79]
INTRODUCTION TO PYTHON
Summary
Functions
type(fam)
list
fam.index("dad")
INTRODUCTION TO PYTHON
Packages
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Motivation
Functions and methods are powerful
Maintenance problem
INTRODUCTION TO PYTHON
Packages
Directory of Python Scripts
Thousands of packages
available
NumPy
Matplotlib
scikit-learn
INTRODUCTION TO PYTHON
Install package
https://pip.pypa.io/en/stable/installation/
Download get-pip.py
Terminal:
python3 get-pip.py
INTRODUCTION TO PYTHON
Import package
import numpy import numpy as np
array([1, 2, 3]) np.array([1, 2, 3])
array([1, 2, 3])
array([1, 2, 3])
INTRODUCTION TO PYTHON
from numpy import array
my_script.py
...
fam_ext = fam + ["me", 1.79]
...
print(str(len(fam_ext)) + " elements in fam_ext")
...
np_fam = array(fam_ext)
INTRODUCTION TO PYTHON
import numpy
import numpy as np
...
fam_ext = fam + ["me", 1.79]
...
print(str(len(fam_ext)) + " elements in fam_ext")
...
np_fam = np.array(fam_ext) # Clearly using NumPy
INTRODUCTION TO PYTHON
NumPy
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Lists Recap
Powerful
Collection of values
Speed
INTRODUCTION TO PYTHON
Illustration
height = [1.73, 1.68, 1.71, 1.89, 1.79]
height
weight / height ** 2
INTRODUCTION TO PYTHON
Solution: NumPy
Numeric Python
Alternative to Python List: NumPy Array
Installation
In the terminal: pip3 install numpy
INTRODUCTION TO PYTHON
NumPy
import numpy as np
np_height = np.array(height)
np_height
np_weight = np.array(weight)
np_weight
INTRODUCTION TO PYTHON
Comparison
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
weight / height ** 2
np_height = np.array(height)
np_weight = np.array(weight)
np_weight / np_height ** 2
INTRODUCTION TO PYTHON
NumPy: remarks
np.array([1.0, "is", True])
INTRODUCTION TO PYTHON
NumPy: remarks
python_list = [1, 2, 3]
numpy_array = np.array([1, 2, 3])
python_list + python_list
[1, 2, 3, 1, 2, 3]
numpy_array + numpy_array
array([2, 4, 6])
INTRODUCTION TO PYTHON
NumPy Subsetting
bmi
bmi[1]
20.975
bmi > 23
array([24.7473475])
INTRODUCTION TO PYTHON
2D NumPy Arrays
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Type of NumPy Arrays
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
type(np_height)
numpy.ndarray
type(np_weight)
numpy.ndarray
INTRODUCTION TO PYTHON
2D NumPy Arrays
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
[65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d
np_2d.shape
INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4
np_2d[0]
INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4
np_2d[0][2]
1.71
np_2d[0, 2]
1.71
INTRODUCTION TO PYTHON
Subsetting
0 1 2 3 4
np_2d[:, 1:3]
np_2d[1, :]
INTRODUCTION TO PYTHON
NumPy: Basic
Statistics
INTRODUCTION TO PYTHON
Hugo Bowne-Anderson
Data Scientist at DataCamp
Data analysis
Get to know your data
Little data -> simply look at it
INTRODUCTION TO PYTHON
City-wide survey
import numpy as np
np_city = ... # Implementation left out
np_city
array([[1.64, 71.78],
[1.37, 63.35],
[1.6 , 55.09],
...,
[2.04, 74.85],
[2.04, 68.72],
[2.01, 73.57]])
INTRODUCTION TO PYTHON
NumPy
np.mean(np_city[:, 0])
1.7472
np.median(np_city[:, 0])
1.75
INTRODUCTION TO PYTHON
NumPy
np.corrcoef(np_city[:, 0], np_city[:, 1])
array([[ 1. , -0.01802],
[-0.01803, 1. ]])
np.std(np_city[:, 0])
0.1992
INTRODUCTION TO PYTHON
Generate data
Arguments for np.random.normal()
distribution mean
number of samples
INTRODUCTION TO PYTHON
Basic plots with
Matplotlib
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Basic plots with Matplotlib
Visualization Data Structure
INTERMEDIATE PYTHON
Data visualization
Very important in Data Analysis
Explore data
Report insights
INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations
INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations
INTERMEDIATE PYTHON
1 Source: GapMinder, Wealth and Health of Nations
INTERMEDIATE PYTHON
Matplotlib
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()
INTERMEDIATE PYTHON
Matplotlib
INTERMEDIATE PYTHON
Matplotlib
INTERMEDIATE PYTHON
Scatter plot
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()
INTERMEDIATE PYTHON
Scatter plot
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.scatter(year, pop)
plt.show()
INTERMEDIATE PYTHON
Histogram
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Histogram
Explore dataset
Get idea about distribution
INTERMEDIATE PYTHON
Matplotlib
import matplotlib.pyplot as plt
help(plt.hist)
INTERMEDIATE PYTHON
Matplotlib example
values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
plt.hist(values, bins=3)
plt.show()
INTERMEDIATE PYTHON
Population pyramid
INTERMEDIATE PYTHON
Population pyramid
INTERMEDIATE PYTHON
Customization
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Data visualization
Many options
Different plot types
Many customizations
Choice depends on
Data
INTERMEDIATE PYTHON
Basic plot
population.py
plt.plot(year, pop)
plt.show()
INTERMEDIATE PYTHON
Axis labels
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
INTERMEDIATE PYTHON
Axis labels
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.show()
INTERMEDIATE PYTHON
Title
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.show()
INTERMEDIATE PYTHON
Title
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.show()
INTERMEDIATE PYTHON
Ticks
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10])
plt.show()
INTERMEDIATE PYTHON
Ticks
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10])
plt.show()
INTERMEDIATE PYTHON
Ticks (2)
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])
plt.show()
INTERMEDIATE PYTHON
Ticks (2)
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])
plt.show()
INTERMEDIATE PYTHON
Add historical data
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])
plt.show()
INTERMEDIATE PYTHON
Add historical data
population.py
plt.plot(year, pop)
plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0, 2, 4, 6, 8, 10],
['0', '2B', '4B', '6B', '8B', '10B'])
plt.show()
INTERMEDIATE PYTHON
Before vs. after
INTERMEDIATE PYTHON
Dictionaries, Part 1
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
List
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
ind_alb = countries.index("albania")
ind_alb
pop[ind_alb]
2.77
Not convenient
Not intuitive
INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
...
{ }
INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
...
{"afghanistan":30.55, }
INTERMEDIATE PYTHON
Dictionary
pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
...
2.77
INTERMEDIATE PYTHON
Dictionaries, Part 2
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Recap
world = {"afghanistan":30.55, "albania":2.77, "algeria":39.21}
world["albania"]
2.77
INTERMEDIATE PYTHON
Recap
Keys have to be "immutable" objects
INTERMEDIATE PYTHON
Principality of Sealand
1 Source: Wikipedia
INTERMEDIATE PYTHON
Dictionary
world["sealand"] = 0.000027
world
"sealand" in world
True
INTERMEDIATE PYTHON
Dictionary
world["sealand"] = 0.000028
world
del(world["sealand"])
world
INTERMEDIATE PYTHON
List vs. Dictionary
INTERMEDIATE PYTHON
List vs. Dictionary
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and remove Select, update, and remove
with [] with []
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and
Select, update, and remove with [] remove with []
Indexed by unique
Indexed by range of numbers
keys
Collection of values — order matters,
for selecting entire subsets
INTERMEDIATE PYTHON
List vs. Dictionary
List Dictionary
Select, update, and
Select, update, and remove with [] remove with []
Indexed by unique
Indexed by range of numbers
keys
Collection of values — order matters, Lookup table with
for selecting entire subsets unique keys
INTERMEDIATE PYTHON
Pandas, Part 1
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Tabular dataset examples
INTERMEDIATE PYTHON
Tabular dataset examples
INTERMEDIATE PYTHON
Tabular dataset examples
INTERMEDIATE PYTHON
Datasets in Python
2D NumPy array?
One data type
INTERMEDIATE PYTHON
Datasets in Python
INTERMEDIATE PYTHON
Datasets in Python
pandas!
High level data manipulation tool
Wes McKinney
Built on NumPy
DataFrame
INTERMEDIATE PYTHON
DataFrame
brics
INTERMEDIATE PYTHON
DataFrame from Dictionary
dict = {
"country":["Brazil", "Russia", "India", "China", "South Africa"],
"capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
"area":[8.516, 17.10, 3.286, 9.597, 1.221]
"population":[200.4, 143.5, 1252, 1357, 52.98] }
import pandas as pd
brics = pd.DataFrame(dict)
INTERMEDIATE PYTHON
DataFrame from Dictionary (2)
brics
INTERMEDIATE PYTHON
DataFrame from CSV file
brics.csv
,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.10,143.5
IN,India,New Delhi,3.286,1252
CH,China,Beijing,9.597,1357
SA,South Africa,Pretoria,1.221,52.98
INTERMEDIATE PYTHON
DataFrame from CSV file
brics.csv
,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.10,143.5
IN,India,New Delhi,3.286,1252
CH,China,Beijing,9.597,1357
SA,South Africa,Pretoria,1.221,52.98
brics = pd.read_csv("path/to/brics.csv")
brics
INTERMEDIATE PYTHON
DataFrame from CSV file
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics
INTERMEDIATE PYTHON
Pandas, Part 2
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
import pandas as pd
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics
INTERMEDIATE PYTHON
Index and select data
Square brackets
Advanced methods
loc
iloc
INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics["country"]
BR Brazil
RU Russia
IN India
CH China
SA South Africa
Name: country, dtype: object
INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
type(brics["country"])
pandas.core.series.Series
1D labelled array
INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics[["country"]]
country
BR Brazil
RU Russia
IN India
CH China
SA South Africa
INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
type(brics[["country"]])
pandas.core.frame.DataFrame
INTERMEDIATE PYTHON
Column Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics[["country", "capital"]]
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria
INTERMEDIATE PYTHON
Row Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics[1:4]
INTERMEDIATE PYTHON
Row Access [ ]
country capital area population
BR Brazil Brasilia 8.516 200.40 * 0 *
RU Russia Moscow 17.100 143.50 * 1 *
IN India New Delhi 3.286 1252.00 * 2 *
CH China Beijing 9.597 1357.00 * 3 *
SA South Africa Pretoria 1.221 52.98 * 4 *
brics[1:4]
INTERMEDIATE PYTHON
Discussion [ ]
Square brackets: limited functionality
Ideally
2D NumPy arrays
my_array[rows, columns]
pandas
loc (label-based)
INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics.loc["RU"]
country Russia
capital Moscow
area 17.1
population 143.5
Name: RU, dtype: object
INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics.loc[["RU"]]
DataFrame
INTERMEDIATE PYTHON
Row Access loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
INTERMEDIATE PYTHON
Row & Column loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing
INTERMEDIATE PYTHON
Row & Column loc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria
INTERMEDIATE PYTHON
Recap
Square brackets
Column access brics[["country", "capital"]]
loc (label-based)
Row access brics.loc[["RU", "IN", "CH"]]
INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics.loc[["RU"]]
brics.iloc[[1]]
INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
INTERMEDIATE PYTHON
Row Access iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics.iloc[[1,2,3]]
INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing
INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
country capital
RU Russia Moscow
IN India New Delhi
CH China Beijing
INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria
INTERMEDIATE PYTHON
Row & Column iloc
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics.iloc[:, [0,1]]
country capital
BR Brazil Brasilia
RU Russia Moscow
IN India New Delhi
CH China Beijing
SA South Africa Pretoria
INTERMEDIATE PYTHON
Comparison
Operators
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
NumPy recap
# Code from Intro to Python for Data Science, Chapter 4
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
bmi
bmi > 23
array([ 24.747])
INTERMEDIATE PYTHON
Numeric comparisons
2 < 3 3 <= 3
True True
2 == 3 x = 2
y = 3
x < y
False
True
2 <= 3
True
INTERMEDIATE PYTHON
Other comparisons
"carl" < "chris"
True
3 < "chris"
3 < 4.1
True
INTERMEDIATE PYTHON
Other comparisons
bmi
bmi > 23
INTERMEDIATE PYTHON
Comparators
Comparator Meaning
< Strictly less than
== Equal
!= Not equal
INTERMEDIATE PYTHON
Boolean Operators
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Boolean Operators
and
or
not
INTERMEDIATE PYTHON
and
True and True False and True
True False
True
False and False
False
INTERMEDIATE PYTHON
or
True or True False or False
True False
False or True y = 5
y < 7 or y > 13
True
True
True or False
True
INTERMEDIATE PYTHON
not
not True
False
not False
True
INTERMEDIATE PYTHON
NumPy
bmi # calculation of bmi left out
bmi > 21
bmi < 22
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
INTERMEDIATE PYTHON
NumPy
logical_and()
logical_or()
logical_not()
INTERMEDIATE PYTHON
if, elif, else
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Overview
Comparison Operators
< , > , >= , <= , == , !=
Boolean Operators
and , or , not
Conditional Statements
if , else , elif
INTERMEDIATE PYTHON
if
if condition :
expression
control.py
z = 4
if z % 2 == 0 : # True
print("z is even")
z is even
INTERMEDIATE PYTHON
if
if condition :
expression
control.py
z = 4
if z % 2 == 0 : # True
print("z is even")
z is even
INTERMEDIATE PYTHON
if
if condition :
expression
control.py
z = 4
if z % 2 == 0 :
print("checking " + str(z))
print("z is even")
checking 4
z is even
INTERMEDIATE PYTHON
if
if condition :
expression
control.py
z = 5
if z % 2 == 0 : # False
print("checking " + str(z))
print("z is even")
INTERMEDIATE PYTHON
else
if condition :
expression
else :
expression
control.py
z = 5
if z % 2 == 0 : # False
print("z is even")
else :
print("z is odd")
z is odd
INTERMEDIATE PYTHON
elif
if condition :
expression
elif condition :
expression
else :
expression
control.py
z = 3
if z % 2 == 0 :
print("z is divisible by 2") # False
elif z % 3 == 0 :
print("z is divisible by 3") # True
else :
print("z is neither divisible by 2 nor by 3")
z is divisible by 3
INTERMEDIATE PYTHON
elif
if condition :
expression
elif condition :
expression
else :
expression
control.py
z = 6
if z % 2 == 0 :
print("z is divisible by 2") # True
elif z % 3 == 0 :
print("z is divisible by 3") # Never reached
else :
print("z is neither divisible by 2 nor by 3")
z is divisible by 2
INTERMEDIATE PYTHON
Filtering pandas
DataFrames
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
import pandas as pd
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
brics
INTERMEDIATE PYTHON
Goal
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
3 steps
Select the area column
INTERMEDIATE PYTHON
Step 1: Get column
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
brics["area"]
BR 8.516
RU 17.100
IN 3.286
CH 9.597
SA 1.221
Name: area, dtype: float64 # - Need Pandas Series
Alternatives:
brics.loc[:,"area"]
brics.iloc[:,2]
INTERMEDIATE PYTHON
Step 2: Compare
brics["area"]
BR 8.516
RU 17.100
IN 3.286
CH 9.597
SA 1.221
Name: area, dtype: float64
brics["area"] > 8
BR True
RU True
IN False
CH True
SA False
Name: area, dtype: bool
INTERMEDIATE PYTHON
Step 3: Subset DF
is_huge
BR True
RU True
IN False
CH True
SA False
Name: area, dtype: bool
brics[is_huge]
INTERMEDIATE PYTHON
Summary
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.988
brics[brics["area"] > 8]
INTERMEDIATE PYTHON
Boolean operators
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
import numpy as np
np.logical_and(brics["area"] > 8, brics["area"] < 10)
BR True
RU False
IN False
CH True
SA False
Name: area, dtype: bool
INTERMEDIATE PYTHON
while loop
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
if-elif-else
control.py
z = 6
if z % 2 == 0 : # True
print("z is divisible by 2") # Executed
elif z % 3 == 0 :
print("z is divisible by 3")
else :
print("z is neither divisible by 2 nor by 3")
... # Moving on
INTERMEDIATE PYTHON
While
while condition :
expression
Example
Error starts at 50
INTERMEDIATE PYTHON
While
while condition :
expression
while_loop.py
error = 50.0
Error starts at 50
INTERMEDIATE PYTHON
While
while condition :
expression
while_loop.py
error = 50.0
# 50
while error > 1: # True
error = error / 4
print(error)
12.5
INTERMEDIATE PYTHON
While
while condition :
expression
while_loop.py
error = 50.0
# 12.5
while error > 1: # True
error = error / 4
print(error)
12.5
3.125
INTERMEDIATE PYTHON
While
while condition :
expression
while_loop.py
error = 50.0
# 3.125
while error > 1: # True
error = error / 4
print(error)
12.5
3.125
0.78125
INTERMEDIATE PYTHON
While
while condition :
expression
while_loop.py
error = 50.0
# 0.78125
while error > 1: # False
error = error / 4
print(error)
12.5
3.125
0.78125
INTERMEDIATE PYTHON
While
while condition : DataCamp: session
expression
disconnected
while_loop.py Local system: Control + C
error = 50.0
while error > 1 : # always True
# error = error / 4
print(error)
50
50
50
50
50
50
50
...
INTERMEDIATE PYTHON
for loop
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
for loop
for var in seq :
expression
INTERMEDIATE PYTHON
fam
family.py
INTERMEDIATE PYTHON
fam
family.py
1.73
1.68
1.71
1.89
INTERMEDIATE PYTHON
for loop
for var in seq :
expression
family.py
INTERMEDIATE PYTHON
for loop
for var in seq :
expression
family.py
1.73
INTERMEDIATE PYTHON
for loop
for var in seq :
expression
family.py
1.73
1.68
INTERMEDIATE PYTHON
for loop
for var in seq :
expression
family.py
1.73
1.68
1.71
1.89
No access to indexes
INTERMEDIATE PYTHON
for loop
for var in seq :
expression
family.py
???
index 0: 1.73
index 1: 1.68
index 2: 1.71
index 3: 1.89
INTERMEDIATE PYTHON
enumerate
for var in seq :
expression
family.py
index 0: 1.73
index 1: 1.68
index 2: 1.71
index 3: 1.89
INTERMEDIATE PYTHON
Loop over string
for var in seq :
expression
strloop.py
for c in "family" :
print(c.capitalize())
F
A
M
I
L
Y
INTERMEDIATE PYTHON
Loop Data
Structures Part 1
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Dictionary
for var in seq :
expression
dictloop.py
world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for key, value in world :
print(key + " -- " + str(value))
INTERMEDIATE PYTHON
Dictionary
for var in seq :
expression
dictloop.py
world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for key, value in world.items() :
print(key + " -- " + str(value))
algeria -- 39.21
afghanistan -- 30.55
albania -- 2.77
INTERMEDIATE PYTHON
Dictionary
for var in seq :
expression
dictloop.py
world = { "afghanistan":30.55,
"albania":2.77,
"algeria":39.21 }
for k, v in world.items() :
print(k + " -- " + str(v))
algeria -- 39.21
afghanistan -- 30.55
albania -- 2.77
INTERMEDIATE PYTHON
NumPy Arrays
for var in seq :
expression
nploop.py
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
bmi = np_weight / np_height ** 2
for val in bmi :
print(val)
21.852
20.975
21.750
24.747
21.441
INTERMEDIATE PYTHON
2D NumPy Arrays
nploop.py
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
meas = np.array([np_height, np_weight])
for val in meas :
print(val)
INTERMEDIATE PYTHON
2D NumPy Arrays
nploop.py
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
meas = np.array([np_height, np_weight])
for val in np.nditer(meas) :
print(val)
1.73
1.68
1.71
1.89
1.79
65.4
...
INTERMEDIATE PYTHON
Recap
Dictionary
for key, val in my_dict.items() :
NumPy array
for val in np.nditer(my_array) :
INTERMEDIATE PYTHON
Loop Data
Structures Part 2
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
brics
country capital area population
BR Brazil Brasilia 8.516 200.40
RU Russia Moscow 17.100 143.50
IN India New Delhi 3.286 1252.00
CH China Beijing 9.597 1357.00
SA South Africa Pretoria 1.221 52.98
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
INTERMEDIATE PYTHON
for, first try
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for val in brics :
print(val)
country
capital
area
population
INTERMEDIATE PYTHON
iterrows
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows():
print(lab)
print(row)
BR
country Brazil
capital Brasilia
area 8.516
population 200.4
Name: BR, dtype: object
...
RU
country Russia
capital Moscow
area 17.1
population 143.5
Name: RU, dtype: object
IN ...
INTERMEDIATE PYTHON
Selective print
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows():
print(lab + ": " + row["capital"])
BR: Brasilia
RU: Moscow
IN: New Delhi
CH: Beijing
SA: Pretoria
INTERMEDIATE PYTHON
Add column
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
for lab, row in brics.iterrows() :
# - Creating Series on every iteration
brics.loc[lab, "name_length"] = len(row["country"])
print(brics)
INTERMEDIATE PYTHON
apply
dfloop.py
import pandas as pd
brics = pd.read_csv("brics.csv", index_col = 0)
brics["name_length"] = brics["country"].apply(len)
print(brics)
INTERMEDIATE PYTHON
Random Numbers
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
INTERMEDIATE PYTHON
Can't go below step 0
INTERMEDIATE PYTHON
How to solve?
Analytical
Simulate the process
Hacker statistics!
INTERMEDIATE PYTHON
Random generators
import numpy as np
np.random.rand() # Pseudo-random numbers
0.6964691855978616
np.random.rand()
0.28613933495037946
INTERMEDIATE PYTHON
Random generators
np.random.seed(123)
np.random.rand()
0.28613933495037946
INTERMEDIATE PYTHON
Coin toss
game.py
import numpy as np
np.random.seed(123)
coin = np.random.randint(0,2) # Randomly generate 0 or 1
print(coin)
INTERMEDIATE PYTHON
Coin toss
game.py
import numpy as np
np.random.seed(123)
coin = np.random.randint(0,2) # Randomly generate 0 or 1
print(coin)
if coin == 0:
print("heads")
else:
print("tails")
0
heads
INTERMEDIATE PYTHON
Random Walk
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Random Step
INTERMEDIATE PYTHON
Random Walk
Known in Science
Path of molecules
INTERMEDIATE PYTHON
Heads or Tails
headtails.py
import numpy as np
np.random.seed(123)
outcomes = []
for x in range(10) :
coin = np.random.randint(0, 2)
if coin == 0 :
outcomes.append("heads")
else :
outcomes.append("tails")
print(outcomes)
INTERMEDIATE PYTHON
Heads or Tails: Random Walk
headtailsrw.py
import numpy as np
np.random.seed(123)
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
print(tails)
[0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3]
INTERMEDIATE PYTHON
Step to Walk
outcomes
tails
[0, 0, 1, 1, 1, 1, 1, 1, 2, 3, 3]
INTERMEDIATE PYTHON
Distribution
I N T E R M E D I AT E P Y T H O N
Hugo Bowne-Anderson
Data Scientist at DataCamp
Distribution
INTERMEDIATE PYTHON
Random Walk
headtailsrw.py
import numpy as np
np.random.seed(123)
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
INTERMEDIATE PYTHON
100 runs
distribution.py
import numpy as np
np.random.seed(123)
final_tails = []
for x in range(100) :
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
print(final_tails)
[3, 6, 4, 5, 4, 5, 3, 5, 4, 6, 6, 8, 6, 4, 7, 5, 7, 4, 3, 3, ..., 4]
INTERMEDIATE PYTHON
Histogram, 100 runs
distribution.py
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(100) :
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()
INTERMEDIATE PYTHON
Histogram, 100 runs
INTERMEDIATE PYTHON
Histogram, 1,000 runs
distribution.py
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(1000) : # <--
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()
INTERMEDIATE PYTHON
Histogram, 1,000 runs
INTERMEDIATE PYTHON
Histogram, 10,000 runs
distribution.py
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(123)
final_tails = []
for x in range(10000) : # <--
tails = [0]
for x in range(10) :
coin = np.random.randint(0, 2)
tails.append(tails[x] + coin)
final_tails.append(tails[-1])
plt.hist(final_tails, bins = 10)
plt.show()
INTERMEDIATE PYTHON
Histogram, 10,000 runs
INTERMEDIATE PYTHON
Introducing
DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S
Richie Cotton
Data Evangelist at DataCamp
What's the point of pandas?
Data Manipulation skill track
Data Visualization skill track
1 https://pypistats.org/packages/pandas
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 name 7 non-null object
1 breed 7 non-null object
2 color 7 non-null object
3 height_cm 7 non-null int64
4 weight_kg 7 non-null int64
5 date_of_birth 7 non-null object
dtypes: int64(2), object(4)
memory usage: 464.0+ bytes
(7, 6)
height_cm weight_kg
count 7.000000 7.000000
mean 49.714286 27.428571
std 17.960274 22.292429
min 18.000000 2.000000
25% 44.500000 19.500000
50% 49.000000 23.000000
75% 57.500000 27.000000
max 77.000000 74.000000
dogs.index
1 https://www.python.org/dev/peps/pep-0020/
Richie Cotton
Data Evangelist at DataCamp
Sorting
dogs.sort_values("weight_kg")
pandas with relevant values highlighted: Bella, Lucy and Charlie in descending order by
height
0 Bella
1 Charlie
2 Lucy
3 Cooper
4 Max
5 Stella
6 Bernie
Name: name, dtype: object
breed height_cm
0 Labrador 56 breed height_cm
1 Poodle 43 0 Labrador 56
2 Chow Chow 46 1 Poodle 43
3 Schnauzer 49 2 Chow Chow 46
4 Labrador 59 3 Schnauzer 49
5 Chihuahua 18 4 Labrador 59
6 St. Bernard 77 5 Chihuahua 18
6 St. Bernard 77
0 True
1 False
2 False
3 False
4 True
5 False
6 True
Name: height_cm, dtype: bool
Richie Cotton
Data Evangelist at DataCamp
Adding a new column
dogs["height_m"] = dogs["height_cm"] / 100
print(dogs)
Maggie Matsui
Senior Content Developer at DataCamp
Summarizing numerical data
.median() , .mode()
dogs["height_cm"].mean()
.min() , .max()
.sum()
.quantile()
dogs["date_of_birth"].min()
'2011-12-11'
Youngest dog:
dogs["date_of_birth"].max()
'2018-02-27'
dogs["weight_kg"].agg(pct30)
22.599999999999998
weight_kg 22.6
height_cm 45.4
dtype: float64
dogs["weight_kg"].agg([pct30, pct40])
pct30 22.6
pct40 24.0
Name: weight_kg, dtype: float64
0 24 0 24
1 24 1 48
2 24 2 72
3 17 3 89
4 29 4 118
5 2 5 120
6 74 6 194
Name: weight_kg, dtype: int64 Name: weight_kg, dtype: int64
.cummin()
.cumprod()
Maggie Matsui
Senior Content Developer at DataCamp
Avoiding double counting
Labrador 2 Labrador 2
Schnauzer 1 Chow Chow 2
St. Bernard 1 Schnauzer 1
Chow Chow 2 St. Bernard 1
Poodle 1 Poodle 1
Chihuahua 1 Chihuahua 1
Name: breed, dtype: int64 Name: breed, dtype: int64
Labrador 0.250
Chow Chow 0.250
Schnauzer 0.125
St. Bernard 0.125
Poodle 0.125
Chihuahua 0.125
Name: breed, dtype: float64
Maggie Matsui
Senior Content Developer at DataCamp
Summaries by group
dogs[dogs["color"] == "Black"]["weight_kg"].mean()
dogs[dogs["color"] == "Brown"]["weight_kg"].mean()
dogs[dogs["color"] == "White"]["weight_kg"].mean()
dogs[dogs["color"] == "Gray"]["weight_kg"].mean()
dogs[dogs["color"] == "Tan"]["weight_kg"].mean()
26.0
24.0
74.0
17.0
2.0
color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0
Name: weight_kg, dtype: float64
color breed
Black Chow Chow 25
Labrador 29
Poodle 24
Brown Chow Chow 24
Labrador 24
Gray Schnauzer 17
Tan Chihuahua 2
White St. Bernard 74
Name: weight_kg, dtype: int64
weight_kg height_cm
color breed
Black Labrador 29 59
Poodle 24 43
Brown Chow Chow 24 46
Labrador 24 56
Gray Schnauzer 17 49
Tan Chihuahua 2 18
White St. Bernard 74 77
Maggie Matsui
Senior Content Developer at DataCamp
Group by to pivot table
dogs.groupby("color")["weight_kg"].mean() dogs.pivot_table(values="weight_kg",
index="color")
color
Black 26 weight_kg
Brown 24 color
Gray 17 Black 26.5
Tan 2 Brown 24.0
White 74 Gray 17.0
Name: weight_kg, dtype: int64 Tan 2.0
White 74.0
weight_kg
color
Black 26.5
Brown 24.0
Gray 17.0
Tan 2.0
White 74.0
mean median
weight_kg weight_kg
color
Black 26.5 26.5
Brown 24.0 24.0
Gray 17.0 17.0
Tan 2.0 2.0
White 74.0 74.0
breed Chihuahua Chow Chow Labrador Poodle Schnauzer St. Bernard All
color
Black 0 0 29 24 0 0 26.500000
Brown 0 24 24 0 0 0 24.000000
Gray 0 0 0 0 17 0 17.000000
Tan 2 0 0 0 0 0 2.000000
White 0 0 0 0 0 74 74.000000
All 2 24 26 24 17 74 27.714286
Richie Cotton
Data Evangelist at DataCamp
The dog dataset, revisited
print(dogs)
dogs.index
dogs_ind.loc[["Bella", "Stella"]]
Richie Cotton
Data Evangelist at DataCamp
Slicing lists
breeds = ["Labrador", "Poodle", breeds[2:5]
"Chow Chow", "Schnauzer",
"Labrador", "Chihuahua",
['Chow Chow', 'Schnauzer', 'Labrador']
"St. Bernard"]
breeds[:3]
['Labrador',
'Poodle',
'Chow Chow', ['Labrador', 'Poodle', 'Chow Chow']
'Schnauzer',
'Labrador', breeds[:]
'Chihuahua',
'St. Bernard']
['Labrador','Poodle','Chow Chow','Schnauzer',
'Labrador','Chihuahua','St. Bernard']
Richie Cotton
Data Evangelist at DataCamp
A bigger dog dataset
print(dog_pack)
color
Black 43.973563
Brown 48.717917
Gray 48.107667
Tan 44.934738
White 44.465208
dtype: float64
breed
Beagle 36.362667
Boxer 59.358667
Chihuahua 19.561250
Chow Chow 52.413333
Dachshund 20.236667
Labrador 55.875000
Poodle 51.637750
St. Bernard 66.654300
dtype: float64
Maggie Matsui
Senior Content Developer at DataCamp
Histograms
import matplotlib.pyplot as plt
dog_pack["height_cm"].hist()
plt.show()
breed
Beagle 10.636364
Boxer 30.620000
Chihuahua 1.491667
Chow Chow 22.535714
Dachshund 9.975000
Labrador 31.850000
Poodle 20.400000
St. Bernard 71.576923
Name: weight_kg, dtype: float64
1 2019-02-28 35.3
2 2019-03-31 32.0
3 2019-04-30 32.9
4 2019-05-31 32.0
Maggie Matsui
Senior Content Developer at DataCamp
What's a missing value?
Name Breed Color Height (cm) Weight (kg) Date of Birth
Bella Labrador Brown 56 25 2013-07-01
Charlie Poodle Black 43 23 2016-09-16
Lucy Chow Chow Brown 46 22 2014-08-25
Cooper Schnauzer Gray 49 17 2011-12-11
Max Labrador Black 59 29 2017-01-20
Stella Chihuahua Tan 18 2 2015-04-20
Bernie St. Bernard White 77 74 2018-02-27
name False
breed False
color False
height_cm False
weight_kg True
date_of_birth False
dtype: bool
name 0
breed 0
color 0
height_cm 0
weight_kg 2
date_of_birth 0
dtype: int64
Maggie Matsui
Senior Content Developer at DataCamp
Dictionaries
my_dict = { my_dict = {
"key1": value1, "title": "Charlotte's Web",
"key2": value2, "author": "E.B. White",
"key3": value3 "published": 1952
} }
my_dict["key1"] my_dict["title"]
list_of_dicts = [
{"name": "Ginger", "breed": "Dachshund", "height_cm": 22,
"weight_kg": 10, "date_of_birth": "2019-03-14"},
{"name": "Scout", "breed": "Dalmatian", "height_cm": 59,
"weight_kg": 25, "date_of_birth": "2019-05-09"}
]
new_dogs = pd.DataFrame(list_of_dicts)
print(new_dogs)
print(new_dogs)
Maggie Matsui
Senior Content Developer at DataCamp
What's a CSV file?
CSV = comma-separated values
Most database and spreadsheet programs can use them or create them
name,breed,height_cm,weight_kg,d_o_b
Ginger,Dachshund,22,10,2019-03-14
Scout,Dalmatian,59,25,2019-05-09
new_dogs_with_bmi.csv
name,breed,height_cm,weight_kg,d_o_b,bmi
Ginger,Dachshund,22,10,2019-03-14,206.611570
Scout,Dalmatian,59,25,2019-05-09,71.818443
Maggie Matsui
Senior Content Developer at DataCamp
Recap
Chapter 1 Chapter 3
Subsetting and sorting Indexing
Chapter 2 Chapter 4
Aggregating and grouping Visualizations