0% found this document useful (0 votes)
46 views

Introducing Dataframes: Richie Cotton

The document provides an overview of pandas, a popular Python library for data manipulation and analysis. It introduces DataFrames as the primary data structure in pandas. It demonstrates how to explore, sort, filter and subset DataFrames to retrieve and manipulate data. The document also discusses the core components of DataFrames like columns, index, values and how various methods can be used to work with DataFrames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Introducing Dataframes: Richie Cotton

The document provides an overview of pandas, a popular Python library for data manipulation and analysis. It introduces DataFrames as the primary data structure in pandas. It demonstrates how to explore, sort, filter and subset DataFrames to retrieve and manipulate data. The document also discusses the core components of DataFrames like columns, index, values and how various methods can be used to work with DataFrames.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Introducing

DataFrames
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Curriculum Architect at DataCamp
What's the point of pandas?

DATA MANIPULATION WITH PANDAS


Course outline
Chapter 1: DataFrames Chapter 3: Slicing and Indexing Data
Sorting and subsetting Subsetting using slicing

Creating new columns Indexes and subsetting using indexes

Chapter 2: Aggregating Data Chapter 4: Creating and Visualizing Data


Summary statistics Plotting

Counting Handling missing data

Grouped summary statistics Reading data into a DataFrame

DATA MANIPULATION WITH PANDAS


pandas is built on NumPy and Matplotlib

DATA MANIPULATION WITH PANDAS


pandas is popular

1 https://pypistats.org/packages/pandas

DATA MANIPULATION WITH PANDAS


Rectangular data
Name Breed Color Height (cm) Weight (kg) Date of Birth

Bella Labrador Brown 56 25 2013-07-01

Charlie Poodle Black 43 23 2016-09-16

Lucy Chow Chow Brown 46 22 2014-08-25

Cooper Schnauzer Gray 49 17 2011-12-11

Max Labrador Black 59 29 2017-01-20

Stella Chihuahua Tan 18 2 2015-04-20

Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


pandas DataFrames
print(dogs)

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .head()
dogs.head()

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .info()
dogs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
name 7 non-null object
breed 7 non-null object
color 7 non-null object
height_cm 7 non-null int64
weight_kg 7 non-null int64
date_of_birth 7 non-null object
dtypes: int64(2), object(4)
b

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .shape
dogs.shape

(7, 6)

DATA MANIPULATION WITH PANDAS


Exploring a DataFrame: .describe()
dogs.describe()

height_cm weight_kg
count 7.000000 7.000000
mean 49.714286 27.428571
std 17.960274 22.292429
min 18.000000 2.000000
25% 44.500000 19.500000
50% 49.000000 23.000000
75% 57.500000 27.000000
max 77.000000 74.000000

DATA MANIPULATION WITH PANDAS


Components of a DataFrame: .values
dogs.values

array([['Bella', 'Labrador', 'Brown', 56, 24, '2013-07-01'],


['Charlie', 'Poodle', 'Black', 43, 24, '2016-09-16'],
['Lucy', 'Chow Chow', 'Brown', 46, 24, '2014-08-25'],
['Cooper', 'Schnauzer', 'Gray', 49, 17, '2011-12-11'],
['Max', 'Labrador', 'Black', 59, 29, '2017-01-20'],
['Stella', 'Chihuahua', 'Tan', 18, 2, '2015-04-20'],
['Bernie', 'St. Bernard', 'White', 77, 74, '2018-02-27']],
dtype=object)

DATA MANIPULATION WITH PANDAS


Components of a DataFrame: .columns and .index
dogs.columns

Index(['name', 'breed', 'color', 'height_cm', 'weight_kg', 'date_of_birth'],


dtype='object')

dogs.index

RangeIndex(start=0, stop=7, step=1)

DATA MANIPULATION WITH PANDAS


pandas Philosophy
There should be one -- and preferably only one -- obvious way to do it.

     - The Zen of Python by Tim Peters, Item 13

1 https://www.python.org/dev/peps/pep-0020/

DATA MANIPULATION WITH PANDAS


Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
Sorting and
subsetting
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Curriculum Architect at DataCamp
Sorting
dogs.sort_values("weight_kg")

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Sorting in descending order
dogs.sort_values("weight_kg", ascending=False)

name breed color height_cm weight_kg date_of_birth


6 Bernie St. Bernard White 77 74 2018-02-27
4 Max Labrador Black 59 29 2017-01-20
0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
3 Cooper Schnauzer Gray 49 17 2011-12-11
5 Stella Chihuahua Tan 18 2 2015-04-20

DATA MANIPULATION WITH PANDAS


Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"])

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Sorting by multiple variables
dogs.sort_values(["weight_kg", "height_cm"], ascending=[True, False])

name breed color height_cm weight_kg date_of_birth


5 Stella Chihuahua Tan 18 2 2015-04-20
3 Cooper Schnauzer Gray 49 17 2011-12-11
0 Bella Labrador Brown 56 24 2013-07-01
2 Lucy Chow Chow Brown 46 24 2014-08-25
1 Charlie Poodle Black 43 24 2016-09-16
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Subsetting columns
dogs["name"]

0 Bella
1 Charlie
2 Lucy
3 Cooper
4 Max
5 Stella
6 Bernie
Name: name, dtype: object

DATA MANIPULATION WITH PANDAS


Subsetting multiple columns
dogs[["breed", "height_cm"]] cols_to_subset = ["breed", "height_cm"]
dogs[cols_to_subset]

breed height_cm
0 Labrador 56 breed height_cm
1 Poodle 43 0 Labrador 56
2 Chow Chow 46 1 Poodle 43
3 Schnauzer 49 2 Chow Chow 46
4 Labrador 59 3 Schnauzer 49
5 Chihuahua 18 4 Labrador 59
6 St. Bernard 77 5 Chihuahua 18
6 St. Bernard 77

DATA MANIPULATION WITH PANDAS


Subsetting rows
dogs["height_cm"] > 50

0 True
1 False
2 False
3 False
4 True
5 False
6 True
Name: height_cm, dtype: bool

DATA MANIPULATION WITH PANDAS


Subsetting rows
dogs[dogs["height_cm"] > 50]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Subsetting based on text data
dogs[dogs["breed"] == "Labrador"]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


Subsetting based on dates
dogs[dogs["date_of_birth"] > "2015-01-01"]

name breed color height_cm weight_kg date_of_birth


1 Charlie Poodle Black 43 24 2016-09-16
4 Max Labrador Black 59 29 2017-01-20
5 Stella Chihuahua Tan 18 2 2015-04-20
6 Bernie St. Bernard White 77 74 2018-02-27

DATA MANIPULATION WITH PANDAS


Subsetting based on multiple conditions
is_lab = dogs["breed"] == "Labrador"

is_brown = dogs["color"] == "Brown"

dogs[is_lab & is_brown]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01

dogs[ (dogs["breed"] == "Labrador") & (dogs["color"] == "Brown") ]

DATA MANIPULATION WITH PANDAS


Subsetting using .isin()
is_black_or_brown = dogs["color"].isin(["Black", "Brown"])
dogs[is_black_or_brown]

name breed color height_cm weight_kg date_of_birth


0 Bella Labrador Brown 56 24 2013-07-01
1 Charlie Poodle Black 43 24 2016-09-16
2 Lucy Chow Chow Brown 46 24 2014-08-25
4 Max Labrador Black 59 29 2017-01-20

DATA MANIPULATION WITH PANDAS


Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S
New columns
D ATA M A N I P U L AT I O N W I T H PA N D A S

Richie Cotton
Curriculum Architect at DataCamp
Adding a new column
dogs["height_m"] = dogs["height_cm"] / 100

print(dogs)

name breed color height_cm weight_kg date_of_birth height_m


0 Bella Labrador Brown 56 24 2013-07-01 0.56
1 Charlie Poodle Black 43 24 2016-09-16 0.43
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49
4 Max Labrador Black 59 29 2017-01-20 0.59
5 Stella Chihuahua Tan 18 2 2015-04-20 0.18
6 Bernie St. Bernard White 77 74 2018-02-27 0.77

DATA MANIPULATION WITH PANDAS


Doggy mass index
BMI = weight in kg/(height in m)2

dogs["bmi"] = dogs["weight_kg"] / dogs["height_m"] ** 2


print(dogs.head())

name breed color height_cm weight_kg date_of_birth height_m bmi


0 Bella Labrador Brown 56 24 2013-07-01 0.56 76.530612
1 Charlie Poodle Black 43 24 2016-09-16 0.43 129.799892
2 Lucy Chow Chow Brown 46 24 2014-08-25 0.46 113.421550
3 Cooper Schnauzer Gray 49 17 2011-12-11 0.49 70.803832
4 Max Labrador Black 59 29 2017-01-20 0.59 83.309394

DATA MANIPULATION WITH PANDAS


Multiple manipulations
bmi_lt_100 = dogs[dogs["bmi"] < 100]

bmi_lt_100_height = bmi_lt_100.sort_values("height_cm", ascending=False)

bmi_lt_100_height[["name", "height_cm", "bmi"]]

name height_cm bmi


4 Max 59 83.309394
0 Bella 56 76.530612
3 Cooper 49 70.803832
5 Stella 18 61.728395

DATA MANIPULATION WITH PANDAS


Let's practice!
D ATA M A N I P U L AT I O N W I T H PA N D A S

You might also like