1-Python Pandas Case Study
1-Python Pandas Case Study
There is a close connection between the DataFrames and the Series of Pandas. A DataFrame
can be seen as a concatenation of Series, each Series having the same index, i.e. the index of
the DataFrame.
import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83,
3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83,
3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19,
1963.10], index=years)
What happens, if we concatenate these "shop" Series? Pandas provides a concat function
for this purpose:
print(type(shops_df))
city_frame.columns.values
Custom Index
We can see that an index (0,1,2, ...) has been automatically assigned to
the DataFrame. We can also assign a custom index to the DataFrame
object:
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
But what if you want to change the column names and the ordering of an
existing DataFrame?
city_frame.reindex(["country", "name",
"population"])
city_frame
Now, we want to rename our columns. For this purpose, we will use the
DataFrame method 'rename'. This method supports two calling
conventions
city_frame.rename(columns={"name":"Nume",
"country":"țară", "population":"populație"},
inplace=True)
city_frame
city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
print(city_frame.loc[["Germany", "France"]])
print(city_frame.loc[city_frame.population>2000
000])
print(city_frame.sum())
city_frame["population"].sum()
x = city_frame["population"].cumsum()
print(x)
city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the
cumulative sum, we want to add the cumulative population sum as a new
culumn with the name "cum_population".
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
city_frame["cum_population"] =
city_frame["population"].cumsum()
city_frame
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
# in a dictionary-like way:
print(city_frame["population"])
# as an attribute
print(city_frame.population)
print(type(city_frame.population))
city_frame.population
From the previous example, we can see that we have not copied the
population column. "p" is a view on the data of city_frame.
Assigning New Values to a Column
The column area is still not defined. We can set all elements of the
column to the same value:
city_frame["area"] = 1572
print(city_frame)
In this case, it will be definitely better to assign the exact area to the
cities. The list with the area values needs to have the same length as the
number of rows in our DataFrame.
Sorting DataFrames
city_frame = city_frame.sort_values(by="area",
ascending=False)
print(city_frame)
Let's assume, we have only the areas of London, Hamburg and Milan. The
areas are in a series with the correct indices. We can assign this series as
well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
city_frame
idx = 1
city_frame.insert(loc=idx, column='area',
value=area)
city_frame
growth_frame = pd.DataFrame(growth)
growth_frame
You like to have the years in the columns and the countries in the rows?
No problem, you can transpose the data:
growth_frame.T
growth_frame = growth_frame.T
growth_frame2 =
growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido',
'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12,
5)*1000).round(2),
columns=names,
index=index)
df
DataFrame.from_csv
read_csv
There is no big difference between those two functions, e.g. they have
different default values in some cases and read_csv has more paramters.
We will focus on read_csv, because DataFrame.from_csv is kept inside
Pandas for reasons of backwards compatibility.
import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t")
print(exchange_rates)
As we can see, read_csv used automatically the first line as the names for
the columns. It is possible to give other names to the columns. For this
purpose, we have to skip the first line by setting the parameter "header"
to 0 and we have to assign a list with the column names to the parameter
"names":
import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t",
header=0,
names=["year", "min", "max", "days"])
print(exchange_rates)
Exercise
The file "countries_population.csv" is a csv file, containing the population
numbers of all countries (July 2014). The delimiter of the file is a space
and commas are used to separate groups of thousands in the numbers.
The method 'head(n)' of a DataFrame can be used to give out only the
first n rows or lines. Read the file into a DataFrame.
pop =
pd.read_csv("/home/madhu/Desktop/datasets/countri
es_population.csv",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
Writing csv Files
We can create csv (or dsv) files with the method "to_csv". Before we do
this, we will prepare some data to output, which we will write to a file. We
have two csv files with population data for various
countries. countries_male_population.csv contains the figures of the male
populations and countries_male_population.csv correspondingly the
numbers for the female populations. We will create a new csv file with the
sum:
population.to_csv("/home/madhu/Desktop/
datasets/countries_total_population.csv")
We want to create a new DataFrame with all the information, i.e. female,
male and complete population. This means that we have to introduce an
hierarchical index. Before we do it on our DataFrame, we will introduce
this problem in a simple example:
import pandas as pd
shop1 = {"foo":{2010:23, 2011:25}, "bar":
{2010:13, 2011:29}}
shop2 = {"foo":{2010:223, 2011:225}, "bar":
{2010:213, 2011:229}}
shop1 = pd.DataFrame(shop1)
shop2 = pd.DataFrame(shop2)
both_shops = shop1 + shop2
print("Sales of shop1:\n", shop1)
print("\nSales of both shops\n", both_shops)
shops.swaplevel()
shops.sort_index(inplace=True)
shops
We will go back to our initial problem with the population
figures. We will apply the same steps to those
DataFrames:
pop_complete = pd.concat([population.T,
male_pop.T,
female_pop.T],
keys=["total", "male", "female"])
df = pop_complete.swaplevel()
df.sort_index(inplace=True)
df[["Austria", "Australia", "France"]]
df.to_csv("/home/madhu/Desktop/datasets/countries_total_population.csv")
Exercise
Read in the dsv file (csv) bundeslaender.txt. Create a new
file with the columns 'land', 'area', 'female', 'male',
'population' and 'density' (inhabitants per square
kilometres.
print out the rows where the area is greater than
30000 and the population is greater than 10000
Print the rows where the density is greater than 300
lands =
pd.read_csv('/home/madhu/Desktop/datasets/bunde
slaender.txt', sep=" ")
print(lands.columns.values)
‘nan' in Python
Python knows NaN values as well. We can create it with "float":
n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)
import math
n1 = math.nan
print(n1)
print(math.isnan(n1))
nan
True
NaN in Pandas
Example without NaNs
Before we will work with NaN data, we will process a file without any NaN
values. The data file temperatures.csv contains the temperature data of
six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.
Reading in the data file can be done with the read_csv function:
import pandas as pd
df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
decimal=",")
df.loc[:3]
df.mean()
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])
sensors = df.columns.values[1:]
# all columns except the time column will be
removed:
df = df.drop(sensors, axis=1)
print(df[:5])
df[:3]
We will use and change the data from the the temperatures.csv file:
temp_df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
index_col=0,
decimal=",")
We will randomly assign some NaN values into the data frame. For this
purpose, we will use the where method from DataFrame. If we apply
where to a DataFrame object df, i.e. df.where(cond, other_df), it will
return an object of same shape as df and whose corresponding entries are
from df where the corresponding element of cond is True and otherwise
are taken from other_df.
Before we continue with our task, we will demonstrate the way of working
of where with some simple examples:
import numpy as np
A = np.random.randint(1, 30, (4, 2))
df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df
random_df =
pd.DataFrame(np.random.random(size=(54, 6)),
columns=temp_df.columns.values,
index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
columns=temp_df.columns.values,
index=temp_df.index)
df_bool = random_df<0.8
df_bool[:5]
df = disturbed_data.dropna()
df
'dropna' can also be used to drop all columns in which some values are
NaN. This can be achieved by assigning 1 to the axis parameter. The
default value is False, as we have seen in our previous example. As every
column from our sensors contain NaN values, they will all disappear:
df = disturbed_data.dropna(axis=1)
df[:5]
Let us change our task: We only want to get rid of all the rows, which
contain more than one NaN value. The parameter 'thresh' is ideal for this
task. It can be set to the minimum number. 'thresh' is set to an integer
value, which defines the minimum number of non-NaN values. We have
six temperature values in every row. Setting 'thresh' to 5 makes sure that
we will have at least 5 valid floats in every remaining row:
cleansed_df = disturbed_data.dropna(thresh=5,
axis=0)
cleansed_df[:7]
Now we will calculate the mean values again, but this time on the
DataFrame 'cleansed_df', i.e. where we have taken out all the rows,
where more than one NaN value occurred.
average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)
# best practice:
df = df.assign(temperature=average_temp_series)
# inplace option not available
df[:6]