DS (Pandas)
DS (Pandas)
The name "Pandas" refers to both "Panel Data" and "Python Data Analysis"
and was created by Wes McKinney in 2008.
What is a Series?
A Pandas Series is like a column in a table.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
output:
0 1
1 7
2 2
dtype: int64
Labels
If nothing else is specified, the values are labeled with their index number.
First value has index 0, second value has index 1 etc.
print(myvar[0]
Create Labels
With the index argument, you can name your own labels.
import pandas as pd
a = [1, 7, 2]
print(myvar)
output
x 1
y 7
z 2
dtype: int64
import pandas as pd
myvar = pd.Series(calories)
print(myvar)
output:
day1 420
day2 380
day3 390
dtype: int64
DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional
array, or a table with rows and columns.Data sets in Pandas are usually multi-
dimensional tables, called DataFrames. Series is like a column, a DataFrame
is the whole table.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]}
df= pd.DataFrame(data)
print(df)
output
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
Pandas use the loc attribute to return one or more specified row(s)
print(df.loc[0])
C:\Users\LUV\Downloads\data.csv
import pandas as pd
df = pd.read_csv('data.csv')
If you have a large DataFrame with many rows, Pandas will only return the
first 5 rows, and the last 5 rows:
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with
the pd.options.display.max_rows statement.
import pandas as pd
print(pd.options.display.max_rows)
In my system the number is 60, which means that if the DataFrame contains
more than 60 rows, the print(df) statement will return only the headers and
the first and last 5 rows.
You can change the maximum rows number with the same statement.
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')
print(df)
Analyzing DataFrames
Viewing the Data
One of the most used method for getting a quick overview of the DataFrame,
is the head() method.
The head() method returns the headers and a specified number of rows,
starting from the top.
import pandas as pd
df = pd.read_csv('data.csv')
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
Print the first 5 rows of the DataFrame:
The tail() method returns the headers and a specified number of rows,
starting from the bottom. Print the last 5 rows of the DataFrame:
print(df.tail())
Empty cells
Data in wrong format
Wrong data
Duplicates
The data set contains some empty cells ("Date" in row 22, and "Calories" in
row 18 and 28).
Empty cells can potentially give you a wrong result when you analyze
data.
Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows
will not have a big impact on the result.
import pandas as pd
df = pd.read_csv('data.csv')
new_df = df.dropna()
print(new_df.to_string())
import pandas as pd
df = pd.read_csv('data.csv')
df.dropna(inplace = True)
This way you do not have to delete entire rows just because of some empty
cells.
import pandas as pd
df = pd.read_csv('data.csv')
To only replace empty values for one column, specify the column name for
the DataFrame.
import pandas as pd
df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True) # Replace the null
values in calories column with the number 130.
import pandas as pd
df = pd.read_csv(r'C:\Users\LUV\OneDrive\Desktop\data.csv')
df.fillna(130, inplace = True)
print(df.to_string())
df.to_csv('C:\\Users\\LUV\\OneDrive\\Desktop\\data.csv')
#Calculate the MEAN, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mean()
#Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].median()
#Calculate the MODE, and replace any empty values with it:
import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mode()[0]
To fix it, you have two options: remove the rows, or convert all cells in the
columns into the same format.
Let's try to convert all cells in the 'Date' column into dates.
import pandas as pd
df = pd.read_csv('data.csv')
df['Date'] = pd.to_datetime(df['Date'])
print(df.to_string()
Removing Duplicates
To remove duplicates, use the drop_duplicates() method.
import pandas as pd
# create dataframe
data = {
'Name': ['John', 'Michael', 'Tom', 'Alex', 'Ryan'],
df = pd.DataFrame(data)
# replace F with M
print(df)
# create dataframe
data = {
for i in df.index:
print(df)
The corr() method calculates the relationship between each column in your
data set.
Result Explained
The Result of the corr() method is a table with a lot of numbers that
represents how well the relationship is between two columns.
0.9 is also a good relationship, and if you increase one value, the other will
probably increase as well.
-0.9 would be just as good relationship as 0.9, but if you increase one value,
the other will probably go down.
0.2 means NOT a good relationship, meaning that if one value goes up does
not mean that the other will.
Perfect Correlation:
We can see that "Duration" and "Duration" got the number 1.000000, which
makes sense, each column always has a perfect relationship with itself.
Good Correlation:
"Duration" and "Calories" got a 0.922721 correlation, which is a very good
correlation, and we can predict that the longer you work out, the more
calories you burn, and the other way around: if you burned a lot of calories,
you probably had a long work out.
Bad Correlation:
"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad
correlation, meaning that we can not predict the max pulse by just looking at
the duration of the work out, and vice versa.
What is Matplotlib?
Matplotlib is a low level graph plotting library in python that serves as a
visualization utility.
Pandas - Plotting
Plotting
Pandas uses the plot() method to create diagrams.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
df.plot()
plt.show()
Scatter Plot
Specify that you want a scatter plot with the kind argument:
kind = 'scatter'
In the example below we will use "Duration" for the x-axis and "Calories" for
the y-axis.
x = 'Duration', y = 'Calories'
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')
plt.show()
In the previous example, we learned that the correlation between "Duration"
and "Calories" was 0.922721, and we concluded with the fact that higher
duration means more calories burned.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data.csv')