Lecture 7 Understanding dataFrames in Python and R
Lecture 7 Understanding dataFrames in Python and R
References
• Python DataFrame:
https://www.w3schools.com/python/pandas/pandas_dataframes.asp
• R DataFrame: https://www.w3schools.com/r/r_data_frames.asp
Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.
What is Pandas?
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was
created by Wes McKinney in 2008.
Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty
or NULL values. This is called cleaning the data.
Installation of Pandas
If you have Python and PIP already installed on a system, then installation of Pandas is very
easy.
If this command fails, then use a python distribution that already has Pandas installed like,
Anaconda, Spyder etc.
Import Pandas
Once Pandas is installed, import it in your applications by adding the import keyword:
import pandas
Example
import pandas
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pandas.DataFrame(mydataset)
print(myvar)
Pandas Series
What is a Series?
Example
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar
If nothing else is specified, the values are labeled with their index number. First value has
index 0, second value has index 1 etc.
This label can be used to access a specified value.
Example
print(myvar[0])
With the index argument, you can name your own labels.
Example
import pandas as pd
a = [1, 7, 2]
print(myvar)
When you have created labels, you can access an item by referring to the label.
Example
print(myvar["y"])
You can also use a key/value object, like a dictionary, when creating a Series.
Example
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
Example
import pandas as pd
print(myvar)
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
Pandas DataFrames
What is a DataFrame?
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Result
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
Result
calories 420
duration 50
Name: 0, dtype: int64
Example
Result
calories duration
0 420 50
1 380 40
Named Indexes
With the index argument, you can name your own indexes.
Example
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Result
calories duration
day1 420 50
day2 380 40
day3 390 45
Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
#refer to the named index:
print(df.loc["day2"])
Result
calories 380
duration 40
Name: 0, dtype: int64
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
CSV files contains plain text and is a well know format that can be read by everyone
including Pandas.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.to_string())
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
max_rows
The number of rows returned is defined in Pandas option settings.
Example
import pandas as pd
print(pd.options.display.max_rows)
In my system the number is 60, which means that if the DataFrame contains more than 60
rows, the print(df) statement will return only the headers and the first and last 5 rows.
You can change the maximum rows number with the same statement.
Example
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Pandas Read JSON
Big data sets are often stored, or extracted as JSON.
JSON is plain text, but has the format of an object, and is well known in the world of
programming, including Pandas.
Open data.json.
Example
import pandas as pd
df = pd.read_json('data.json')
print(df.to_string())
Dictionary as JSON
If your JSON code is not in a file, but in a Python Dictionary, you can load it into a
DataFrame directly:
Example
import pandas as pd
data = {
"Duration":{
"0":60,
"1":60,
"2":60,
"3":45,
"4":45,
"5":60
},
"Pulse":{
"0":110,
"1":117,
"2":103,
"3":109,
"4":117,
"5":102
},
"Maxpulse":{
"0":130,
"1":145,
"2":135,
"3":175,
"4":148,
"5":127
},
"Calories":{
"0":409,
"1":479,
"2":340,
"3":282,
"4":406,
"5":300
}
}
df = pd.DataFrame(data)
print(df)
One of the most used method for getting a quick overview of the DataFrame, is
the head() method.
The head() method returns the headers and a specified number of rows, starting from the top.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head(10))
Note: if the number of rows is not specified, the head() method will return the top 5 rows.
Example
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
There is also a tail() method for viewing the last rows of the DataFrame.
The tail() method returns the headers and a specified number of rows, starting from the
bottom.
Example
print(df.tail())
Example
Result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 169 non-null int64
1 Pulse 169 non-null int64
2 Maxpulse 169 non-null int64
3 Calories 164 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None
Result Explained
Null Values
The info() method also tells us how many Non-Null values there are present in each column,
and in our data set it seems like there are 164 of 169 Non-Null values in the "Calories"
column.
Which means that there are 5 rows with no value at all, in the "Calories" column, for
whatever reason.
Empty values, or Null values, can be bad when analyzing data, and you should consider
removing rows with empty values. This is a step towards what is called cleaning data, and
you will learn more about that in the next chapters.
R Data Frames
Data Frames
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column should
have the same type of data.
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame
summary(Data_Frame)
Access Items
We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a data frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Add Rows
Use the rbind() function to add new rows in a Data Frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Add Columns
Use the cbind() function to add new columns in a Data Frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
dim(Data_Frame)
You can also use the ncol() function to find the number of columns and nrow() to find the
number of rows:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
ncol(Data_Frame)
nrow(Data_Frame)
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
length(Data_Frame
Example
Data_Frame1 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
And use the cbind() function to combine two or more data frames in R horizontally:
Example
Data_Frame3 <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame4 <- data.frame (
Steps = c(3000, 6000, 2000),
Calories = c(300, 400, 300)
)