2613947
Join at slido.com
#2613947
Click Present with Slido or install our Chrome extension to display joining
ⓘ
instructions for participants while presenting.
1
2613947
LECTURE 2
Pandas, Part I
Introduction to pandas syntax, operators, and functions
Data 100/Data 200, Fall 2023 @ UC Berkeley
Narges Norouzi and Fernando Pérez
2
A Few Announcements
● The first set of assignments has been released 2613947
○ Lab #1 is due Tuesday, 8/29 at 11:59 PM
○ Homework #1 is due Thursday, 8/31 at 11:59 PM – complete both parts 1A and 1B.
Part 1A will also ask you complete a brief Syllabus Quiz.
● Discussion assignments are released – see Ed announcement
● Office hours have begun – see Ed announcement.
3
2613947
• Introduce pandas, an important
Python library for working with data
• Key data structures: DataFrames,
Series, Indices
• Extracting data: loc, iloc, []
This is the first of a three-lecture sequence
about pandas.
Goals for This Get ready: lots of code incoming!
Lecture • Lecture: introduce high-level concepts
• Lab, homework: practical
Lecture 2, Data 100 Fall 2023 experimentation
4
2613947
• Tabular data
• Series, DataFrames, and Indices
• Data extraction with loc, iloc, and []
Agenda
Lecture 02, Data 100 Fall 2023
5
2613947
• Tabular data
• Series, DataFrames, and Indices
• Data extraction with loc, iloc, and
[]
Tabular Data
Lecture 02, Data 100 Fall 2023
6
Recall the Data Science Lifecycle
2613947
? Question &
Problem
Formulation
Data
Acquisition
Prediction and Exploratory
Inference Data Analysis
Reports, Decisions,
and Solutions
7
Plan for First Few Weeks
2613947
? Question &
Problem
Formulation
Data
Acquisition
Prediction and Exploratory
Inference Data Analysis
Reports, Decisions,
and Solutions
(Weeks 1 and 2) (Weeks 2 and 3)
Exploring and Cleaning Tabular Data Data Science in Practice
From datascience to pandas EDA, Data Cleaning, Text processing (regular expressions)
8
2613947
Box Congratulations!!!
of D
ata
You have collected or have been given
a box of data.
What does this "data" actually look like?
How will you work with it?
9
Data Scientists Love Tabular Data
"Tabular data" = data in a table. 2613947
Typically:
A row represents one observation
(here, a single person running for
president in a particular year).
A column represents some
characteristic, or feature, of that
observation (here, the political party of
that person).
In Data 8, you worked with the datascience library using Tables.
In Data 100 (and beyond), we’ll use an industry-standard library called pandas.
10
Introducing the Standard Python Data Science Tool: pandas
2613947
Stands for "panel
The Python Data data"
Analysis Library
a cartoon panda
The (unofficial) Data
100 logo
11
Introducing the Standard Python Data Science Tool: pandas
Using pandas, we can: 2613947
● Arrange data in a tabular format.
● Extract useful information filtered by specific conditions.
● Operate on data to gain new insights.
● Apply NumPy functions to our data (our friends from Data 8).
● Perform vectorized computations to speed up our analysis (Lab 1).
pandas is the standard tool across research and industry for working with tabular data.
The first week of Data 100 will serve as a "bootcamp" in helping you build familiarity with
operating on data with pandas.
Your Data 8 knowledge will serve you well! Much of our work will be in translating syntax.
12
Contents
2613947
Data used in this lecture
Unofficial datascience -> pandas translations
Primary notebook for lecture
13
2613947
• Tabular data
• DataFrames, Series, and Indices
• Data extraction with loc, iloc, and []
DataFrames,
Series, and Indices
Lecture 2, Data 100 Fall 2023
14
DataFrames
In the "language" of pandas, we call a table a DataFrame. 2613947
We think of DataFrames as collections of named columns, called Series.
A DataFrame A Series named "Candidate"
15
Series
A Series is a 1-dimensional array-like object. It contains: 2613947
● A sequence of values of the same type. pd is the conventional
alias for pandas
● A sequence of data labels, called the index.
import pandas as pd
s = pd.Series(["welcome", "to", "data 100"])
Index, accessed by calling s.index Values, accessed by calling s.values
16
Series - Custom Index
● We can provide index labels for items in a Series by passing an index list. 2613947
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s.index
● A Series index can also be changed.
s.index = ["first", "second", "third"])
s.index
17
Selection in Series
● We can select a single value or a set of values in a Series using: 2613947
○ A single label
○ A list of labels
○ A filtering condition
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
18
Selection in Series
● We can select a single value or a set of values in a Series using: 2613947
○ A single label
○ A list of labels
○ A filtering condition
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s["a"]
19
Selection in Series
● We can select a single value or a set of values in a Series using: 2613947
○ A single label
○ A list of labels
○ A filtering condition
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s[["a", "c"]]
20
Selection in Series
● We can select a single value or a set of values in a Series using: 2613947
○ A single label
○ A list of labels
○ A filtering condition
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
● Say we want to select values in the Series that satisfy a particular condition:
1) Apply a boolean condition to the Series. This creates a new Series of boolean values.
2) Index into our Series using this boolean condition. pandas will select only the entries in
the Series that satisfy the condition.
s > 0 s[s > 0]
21
2613947
What is the output of the
following code?
Click Present with Slido or install our Chrome extension to activate this poll
ⓘ
while presenting.
DataFrames of Series!
Typically, we will work with Series using the perspective that they are columns in a 2613947
DataFrame.
We can think of a DataFrame as a collection of Series that all share the same Index.
[...]
The Series "Year" The Series "Candidate" The DataFrame elections
23
Non-native English speaker note: The plural of "series" is "series". Sorry.
Creating a DataFrame
The syntax of creating DataFrame is: 2613947
pandas.DataFrame(data, index, columns)
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file.
● Using a list and column name(s).
● From a dictionary.
● From a Series.
24
Creating a DataFrame
The syntax of creating DataFrame is: 2613947
pandas.DataFrame(data, index, columns)
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file. elections = pd.read_csv("data/elections.csv")
● Using a list and column name(s).
● From a dictionary.
● From a Series.
The DataFrame elections 25
Creating a DataFrame
The syntax of creating DataFrame is: 2613947
pandas.DataFrame(data, index, columns)
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file. elections = pd.read_csv("data/elections.csv", index_col="Year")
● Using a list and column name(s).
● From a dictionary.
● From a Series.
The DataFrame elections with "Year" as Index 26
Creating a DataFrame
2613947
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file.
● Using a list and column name(s).
● From a dictionary.
● From a Series.
pd.DataFrame([1, 2, 3], pd.DataFrame([[1, "one"], [2, "two"]],
columns=["Numbers"]) columns = ["Number", "Description"])
27
Creating a DataFrame
2613947
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file.
● Using a list and column name(s).
● From a dictionary. Specify columns of the DataFrame
● From a Series.
pd.DataFrame({"Fruit":["Strawberry", "Orange"],
"Price": [5.49, 3.99]})
pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49},
{"Fruit":"Orange", "Price":3.99}])
Specify rows of the DataFrame
28
Creating a DataFrame
2613947
Many approaches exist for creating a DataFrame. Here, we will go over the most popular ones.
● From a CSV file.
● Using a list and column name(s).
● From a dictionary.
● From a Series.
s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])
pd.DataFrame({"A-column":s_a, "B-column":s_b})
pd.DataFrame(s_a)
s_a.to_frame()
29
Indices Are Not Necessarily Row Numbers
An Index (a.k.a. row labels) can also: 2613947
● Be non-numeric.
● Have a name, e.g. "Candidate".
# Creating a DataFrame from a CSV file and specifying the Index column
elections = pd.read_csv("data/elections.csv", index_col = "Candidate")
30
Indices Are Not Necessarily Unique
The row labels that constitute an index do not have to be unique. 2613947
● Left: The index values are all unique and numeric, acting as a row number.
● Right: The index values are named and non-unique.
31
Modifying Indices
● We can select a new column and set it as the index of the DataFrame. 2613947
Example: Setting the index to the "Party" column.
elections.set_index("Party")
32
Resetting the Index
● We can change our mind and reset the Index back to the default list of integers. 2613947
elections.reset_index()
33
Column Names Are Usually Unique!
Column names in pandas are almost always unique. 2613947
● Example: Really shouldn’t have two columns named "Candidate".
34
Retrieving the Index, Columns, and shape
Sometimes you'll want to extract the list of row and column labels. 2613947
elections.set_index("Party")
For row labels, use DataFrame.index: elections.index
For column labels, use DataFrame.columns: elections.columns
For shape of the DataFrame we use DataFrame.shape: elections.shape
35
The Relationship Between DataFrames, Series, and Indices
We can think of a DataFrame as a collection of Series that all share the same Index. 2613947
● Candidate, Party, %, Year, and Result Series all share an Index from 0 to 5.
Candidate Series Party Series % Series Year Series Result Series
36
2613947
What is the output of the
following code?
Click Present with Slido or install our Chrome extension to activate this poll
ⓘ
while presenting.
The DataFrame API
The API for the DataFrame class is enormous. 2613947
● API: "Application Programming Interface".
● The API is the set of abstractions supported by the class.
Full documentation is at
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
● Compare with the Table class from Data8: http://data8.org/datascience/tables.html
● We will only consider a tiny portion of this API.
We want you to get familiar with the real world programming practice of… Googling!
● Answers to your questions are often found in the pandas documentation, Stack Overflow,
etc.
38
2613947
• Tabular data
Data Extraction •
•
DataFrames, Series, and Indices
Data extraction with loc, iloc, and []
with loc, iloc,
and [ ]
Lecture 02, Data 100 Fall 2023
39
Extracting Data
One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of 2613947
interest. As we'll see, the large pandas API means there are many ways to do things.
Common ways we may want to extract data:
● Grab the first or last n rows in the DataFrame.
● Grab data with a certain label.
● Grab data at a certain position.
We'll find that all three of these methods are useful to us in data manipulation tasks.
40
.head and .tail
The simplest scenarios: We want to extract the first or last n rows from the DataFrame. 2613947
● df.head(n) will return the first n rows of the DataFrame df.
● df.tail(n) will return the last n rows.
elections elections.head(5)
elections.tail(5)
41
Label-based Extraction: .loc
A more complex task: We want to extract data with specific column or index labels. 2613947
df.loc[row_labels, column_labels]
The .loc accessor allows us to specify the labels of rows and columns we wish to extract.
● We describe "labels" as the bolded text at the top and left of a DataFrame.
Column labels
Row labels
42
Label-based Extraction: .loc
Arguments to .loc can be: 2613947
● A list.
● A slice (syntax is inclusive of the right hand side of the slice).
● A single value.
43
Label-based Extraction: .loc
Arguments to .loc can be: 2613947
● A list.
● A slice (syntax is inclusive of the right hand side of the slice).
● A single value.
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]
Select the columns with labels
"Year", "Candidate", and "Result".
Select the rows with
labels 87, 25, and 179.
44
Label-based Extraction: .loc
Arguments to .loc can be: 2613947
● A list.
● A slice (syntax is inclusive of the right hand side of the slice).
● A single value.
elections.loc[[87, 25, 179], "Popular vote":"%"]
Select all columns starting
from "Popular vote" until "%".
Select the rows with
labels 87, 25, and 179.
45
Label-based Extraction: .loc
To extract all rows or all columns, use a colon (:) 2613947
elections.loc[:, ["Year", "Candidate", "Result"]]
All rows for the columns with labels "Year", "Candidate", and "Result".
Ellipses (...) indicate more rows not shown.
elections.loc[[87, 25, 179], :]
All columns for the rows with labels 87, 25, 179.
46
Label-based Extraction: .loc
Arguments to .loc can be: 2613947
● A list.
● A slice (syntax is inclusive of the right hand side of the slice).
● A single value.
elections.loc[[87, 25, 179], "Popular vote"]
Wait, what? Why did everything get so ugly?
We've extracted a subset of the "Popular
vote" column as a Series.
elections.loc[0, "Candidate"]
We've extracted the string value with row
label 0 and column label "Candidate".
47
Integer-based Extraction: .iloc
A different scenario: We want to extract data according to its position. 2613947
● Example: Grab the 1st, 4th, and 3rd columns of the DataFrame.
Covered until here
df.iloc[row_integers, column_integers] on 8/29
The .iloc accessor allows us to specify the integers of rows and columns we wish to extract.
● Python convention: The first position has integer index 0.
0 1 2 3 4 5 Column integers
0
1
Row
2
integers
3
4
48
Integer-based Extraction: .iloc
Arguments to .iloc can be: 2613947
● A list.
● A slice (syntax is exclusive of the right hand side of the slice).
● A single value.
49
Integer-based Extraction: .iloc
Arguments to .iloc can be: 2613947
● A list.
● A slice (syntax is exclusive of the right hand side of the slice).
● A single value.
elections.iloc[[1, 2, 3], [0, 1, 2]]
Select the columns
at positions 0, 1,
and 2.
Select the rows at
positions 1, 2, and 3.
50
Integer-based Extraction: .iloc
Arguments to .iloc can be: 2613947
● A list.
● A slice (syntax is exclusive of the right hand side of the slice).
● A single value.
elections.iloc[[1, 2, 3], 0:3]
Select all columns from
integer 0 to integer 2.
Select the rows
Remember:
at positions 1, 2,
integer-based slicing is
and 3.
right-end exclusive!
51
Integer-based Extraction: .iloc
Just like .loc, we can use a colon with .iloc to extract all rows or all columns. 2613947
elections.iloc[:, 0:3]
Grab all rows of the columns at integers 0 to 2.
52
Integer-based Extraction: .iloc
Arguments to .iloc can be: 2613947
● A list.
● A slice (syntax is exclusive of the right hand side of the slice).
● A single value.
elections.iloc[[1, 2, 3], 1] As before, the result for a single value
argument is a Series.
We have extracted row integers 1, 2, and 3
from the column at position 1.
elections.iloc[0, 1]
We've extracted the string value with row
position 0 and column position 1.
53
.loc vs .iloc
Remember: 2613947
● .loc performs label-based extraction
● .iloc performs integer-based extraction
When choosing between .loc and .iloc, you'll usually choose .loc.
● Safer: If the order of data gets shuffled in a public database, your code still works.
● Readable: Easier to understand what elections.loc[:, ["Year", "Candidate",
"Result"]] means than elections.iloc[:, [0, 1, 4]]
.iloc can still be useful.
● Example: If you have a DataFrame of movie earnings sorted by earnings, can use .iloc
to get the median earnings for a given year (index into the middle).
54
… Just When It Was All Making Sense
ilo 2613947
c
c
lo
[]
55
Context-dependent Extraction: [ ]
Selection operators: 2613947
● .loc selects items by label. First argument is rows, second argument is columns.
● .iloc selects items by integer. First argument is rows, second argument is columns.
● [] only takes one argument, which may be:
○ A slice of row numbers.
○ A list of column labels.
○ A single column label.
That is, [] is context sensitive.
Let’s see some examples.
56
Context-dependent Extraction: [ ]
[] only takes one argument, which may be: 2613947
● A slice of row integers.
● A list of column labels.
● A single column label.
elections[3:7]
57
Context-dependent Extraction: [ ]
[] only takes one argument, which may be: 2613947
● A slice of row numbers.
● A list of column labels.
● A single column label.
elections[["Year", "Candidate", "Result"]]
58
Context-dependent extraction: [ ]
[] only takes one argument, which may be: 2613947
● A slice of row numbers.
● A list of column labels.
● A single column label.
elections["Candidate"]
Extract the "Candidate" column as a Series.
59
Why Use []?
In short: [ ] can be much more concise than .loc or .iloc 2613947
● Consider the case where we wish to extract the "Candidate" column. It is far simpler to
write elections["Candidate"] than it is to write elections.loc[:, "Candidate"]
In practice, [ ] is often used over .iloc and .loc in data science work. Typing time adds up!
60
2613947
LECTURE 2
Pandas, Part I
Content credit: Acknowledgments
61