0% found this document useful (0 votes)
30 views

1-Python Pandas Case Study

Uploaded by

sureshkgenai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

1-Python Pandas Case Study

Uploaded by

sureshkgenai
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Pandas - Series and Dataframe

There is a close connection between the DataFrames and the Series of Pandas. A DataFrame
can be seen as a concatenation of Series, each Series having the same index, i.e. the index of
the DataFrame.

We will demonstrate this in the following example.

We define the following three Series:

import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83,
3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83,
3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19,
1963.10], index=years)

pd.concat([shop1, shop2, shop3])

What happens, if we concatenate these "shop" Series? Pandas provides a concat function
for this purpose:

shops_df = pd.concat([shop1, shop2, shop3],


axis=1)
shops_df

Let's do some fine sanding by giving names to the columns:

cities = ["Zürich", "Winterthur", "Freiburg"]


shops_df.columns = cities
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3],
axis=1)
print(shops_df2)

print(type(shops_df))

This means, we can arrange or concat Series into DataFrames!

DataFrames from Dictionaries


A DataFrame has a row and column index; it's like a dict of Series with a
common index.

cities = {"name": ["London", "Berlin",


"Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235,
2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain",
"Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame

Retrieving the Column Names


It's possible to get the names of the columns as a list:

city_frame.columns.values

Custom Index
We can see that an index (0,1,2, ...) has been automatically assigned to
the DataFrame. We can also assign a custom index to the DataFrame
object:

ordinals = ["first", "second", "third",


"fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities,
index=ordinals)
city_frame

Rearranging the Order of Columns


We can also define and rearrange the order of the columns at the time of
creation of the DataFrame. This makes also sure that we will have a
defined ordering of our columns, if we create the DataFrame from a
dictionary. Dictionaries are not ordered

city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame

But what if you want to change the column names and the ordering of an
existing DataFrame?

city_frame.reindex(["country", "name",
"population"])
city_frame

Now, we want to rename our columns. For this purpose, we will use the
DataFrame method 'rename'. This method supports two calling
conventions

(index=index_mapper, columns=columns_mapper, ...)


(mapper, axis={'index', 'columns'}, ...)
We will rename the columns of our DataFrame into Romanian names in
the following example. We set the parameter inplace to True so that our
DataFrame will be changed instead of returning a new DataFrame, if
inplace is set to False, which is the default!

city_frame.rename(columns={"name":"Nume",
"country":"țară", "population":"populație"},
inplace=True)
city_frame

Existing Column as the Index of a DataFrame


We want to create a more useful index in the following example. We will
use the country name as the index, i.e. the list value associated to the key
"country" of our cities dictionary:

city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame

Alternatively, we can change an existing DataFrame. We can us the


method set_index to turn a column into an index. "set_index" does not
work in-place, it returns a new data frame with the chosen column as the
index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)

We saw in the previous example that the set_index method returns a


new DataFrame object and doesn't change the original DataFrame. If we
set the optional parameter "inplace" to True, the DataFrame will be
changed in place, i.e. no new object will be created:

city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)

Label-Indexing on the Rows


So far we have indexed DataFrames via the columns. We will
demonstrate now, how we can access rows from DataFrames via the
locators 'loc' and 'iloc'. ('ix' is deprecated and will be removed in the
future)

city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])

print(city_frame.loc[["Germany", "France"]])
print(city_frame.loc[city_frame.population>2000
000])

Sum and Cumulative Sum


We can calculate the sum of all the columns of a DataFrame or the sum of
certain columns:

print(city_frame.sum())

city_frame["population"].sum()

x = city_frame["population"].cumsum()
print(x)

Assigning New Values to Columns

x is a Pandas Series. We can reassign the previously calculated cumulative


sums to the population column:

city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the
cumulative sum, we want to add the cumulative population sum as a new
culumn with the name "cum_population".

city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame

We can see that the column "cum_population" is set to Nan, as we


haven't provided any data for it.
We will assign now the cumulative sums to this column:

city_frame["cum_population"] =
city_frame["population"].cumsum()
city_frame

We can also include a column name which is not contained in the


dictionary, when we create the DataFrame from the dictionary. In this
case, all the values of this column will be set to NaN:

city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)

Accessing the Columns of a DataFrame


There are two ways to access a column of a DataFrame. The result is in
both cases a Series:

# in a dictionary-like way:
print(city_frame["population"])

# as an attribute
print(city_frame.population)

print(type(city_frame.population))

city_frame.population

From the previous example, we can see that we have not copied the
population column. "p" is a view on the data of city_frame.
Assigning New Values to a Column

The column area is still not defined. We can set all elements of the
column to the same value:

city_frame["area"] = 1572
print(city_frame)

In this case, it will be definitely better to assign the exact area to the
cities. The list with the area values needs to have the same length as the
number of rows in our DataFrame.

# area in square km:


area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
# area could have been designed as a list, a
Series, an array or a scalar
city_frame["area"] = area
print(city_frame)

Sorting DataFrames

Let's sort our DataFrame according to the city area:

city_frame = city_frame.sort_values(by="area",
ascending=False)
print(city_frame)

Let's assume, we have only the areas of London, Hamburg and Milan. The
areas are in a series with the correct indices. We can assign this series as
well:

city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)

Inserting new columns into existing DataFrames


In the previous example we have added the column area at creation time.
Quite often it will be necessary to add or insert columns into existing
DataFrames.

city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
city_frame
idx = 1
city_frame.insert(loc=idx, column='area',
value=area)
city_frame

DataFrame from Nested Dictionaries


A nested dictionary of dicts can be passed to a DataFrame as well. The
indices of the outer dictionary are taken as the the columns and the inner
keys. i.e. the keys of the nested dictionaries, are used as the row indices:

growth = {"Switzerland": {"2010": 3.0, "2011":


1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012":
0.4, "2013": 0.1},
"France": {"2010":2.0, "2011":2.1, "2012":
0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-
6.6, "2013": -3.3},
"Italy": {"2010":1.7, "2011": 0.6, "2012":-
2.3, "2013":-1.9}
}

growth_frame = pd.DataFrame(growth)
growth_frame

You like to have the years in the columns and the countries in the rows?
No problem, you can transpose the data:

growth_frame.T

growth_frame = growth_frame.T
growth_frame2 =
growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)

Filling a DataFrame with random values:

import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido',
'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12,
5)*1000).round(2),
columns=names,
index=index)
df

Reading CSV and DSV Files


Pandas offers two ways to read in CSV or DSV files to be precise:

DataFrame.from_csv
read_csv
There is no big difference between those two functions, e.g. they have
different default values in some cases and read_csv has more paramters.
We will focus on read_csv, because DataFrame.from_csv is kept inside
Pandas for reasons of backwards compatibility.

import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t")
print(exchange_rates)

As we can see, read_csv used automatically the first line as the names for
the columns. It is possible to give other names to the columns. For this
purpose, we have to skip the first line by setting the parameter "header"
to 0 and we have to assign a list with the column names to the parameter
"names":

import pandas as pd
exchange_rates =
pd.read_csv("/home/madhu/Desktop/datasets/dolla
r_euro.txt",
sep="\t",
header=0,
names=["year", "min", "max", "days"])
print(exchange_rates)

Exercise
The file "countries_population.csv" is a csv file, containing the population
numbers of all countries (July 2014). The delimiter of the file is a space
and commas are used to separate groups of thousands in the numbers.
The method 'head(n)' of a DataFrame can be used to give out only the
first n rows or lines. Read the file into a DataFrame.

pop =
pd.read_csv("/home/madhu/Desktop/datasets/countri
es_population.csv",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))

",
header=None,
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop.head(5))
Writing csv Files
We can create csv (or dsv) files with the method "to_csv". Before we do
this, we will prepare some data to output, which we will write to a file. We
have two csv files with population data for various
countries. countries_male_population.csv contains the figures of the male
populations and countries_male_population.csv correspondingly the
numbers for the female populations. We will create a new csv file with the
sum:

column_names = ["Country"] + list(range(2002,


2013))
male_pop =
pd.read_csv("/home/madhu/Desktop/datasets/count
ries_male_population.csv",
header=None,
index_col=0,
names=column_names)
female_pop =
pd.read_csv("/home/madhu/Desktop/datasets/count
ries_female_population.csv",
header=None,
index_col=0,
names=column_names)
population = male_pop + female_pop
population

population.to_csv("/home/madhu/Desktop/
datasets/countries_total_population.csv")
We want to create a new DataFrame with all the information, i.e. female,
male and complete population. This means that we have to introduce an
hierarchical index. Before we do it on our DataFrame, we will introduce
this problem in a simple example:

import pandas as pd
shop1 = {"foo":{2010:23, 2011:25}, "bar":
{2010:13, 2011:29}}
shop2 = {"foo":{2010:223, 2011:225}, "bar":
{2010:213, 2011:229}}
shop1 = pd.DataFrame(shop1)
shop2 = pd.DataFrame(shop2)
both_shops = shop1 + shop2
print("Sales of shop1:\n", shop1)
print("\nSales of both shops\n", both_shops)

shops = pd.concat([shop1, shop2], keys=["one",


"two"])
shops

We want to swap the hierarchical indices. For this we will


use 'swaplevel':

shops.swaplevel()
shops.sort_index(inplace=True)
shops
We will go back to our initial problem with the population
figures. We will apply the same steps to those
DataFrames:

pop_complete = pd.concat([population.T,
male_pop.T,
female_pop.T],
keys=["total", "male", "female"])
df = pop_complete.swaplevel()
df.sort_index(inplace=True)
df[["Austria", "Australia", "France"]]

df.to_csv("/home/madhu/Desktop/datasets/countries_total_population.csv")

Exercise
Read in the dsv file (csv) bundeslaender.txt. Create a new
file with the columns 'land', 'area', 'female', 'male',
'population' and 'density' (inhabitants per square
kilometres.
print out the rows where the area is greater than
30000 and the population is greater than 10000
Print the rows where the density is greater than 300

lands =
pd.read_csv('/home/madhu/Desktop/datasets/bunde
slaender.txt', sep=" ")
print(lands.columns.values)
‘nan' in Python
Python knows NaN values as well. We can create it with "float":

n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)

"nan" is also part of the math module since Python 3.5:

import math
n1 = math.nan
print(n1)
print(math.isnan(n1))

nan
True

NaN in Pandas
Example without NaNs
Before we will work with NaN data, we will process a file without any NaN
values. The data file temperatures.csv contains the temperature data of
six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.

Reading in the data file can be done with the read_csv function:
import pandas as pd
df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
decimal=",")
df.loc[:3]

We want to calculate the avarage temperatures per measuring point over


all the sensors. We can use the DataFrame method 'mean'. If we use
'mean' without parameters it will sum up the sensor columns, which isn't
what we want, but it may be interesting as well:

df.mean()

average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])

sensors = df.columns.values[1:]
# all columns except the time column will be
removed:
df = df.drop(sensors, axis=1)
print(df[:5])

We will assign now the average temperature values as a


new column 'temperature':
# best practice:
df = df.assign(temperature=average_temp_series)
# inplace option not available
# alternatively:
#df.loc[:,"temperature"] = average_temp_series

df[:3]

Example with NaNs


We will use now a data file similar to the previous temperature csv, but
this time we will have to cope with NaN data, when the sensors
malfunctioned.

We will create a temperature DataFrame, in which some data is not


defined, i.e. NaN.

We will use and change the data from the the temperatures.csv file:

temp_df =
pd.read_csv("home/madhu/Desktop/datasets/temper
atures.csv",
sep=";",
index_col=0,
decimal=",")

We will randomly assign some NaN values into the data frame. For this
purpose, we will use the where method from DataFrame. If we apply
where to a DataFrame object df, i.e. df.where(cond, other_df), it will
return an object of same shape as df and whose corresponding entries are
from df where the corresponding element of cond is True and otherwise
are taken from other_df.

Before we continue with our task, we will demonstrate the way of working
of where with some simple examples:

s = pd.Series(range(5)) s.where(s > 0)

import numpy as np
A = np.random.randint(1, 30, (4, 2))
df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df

For our task, we need to create a DataFrame 'nan_df', which consists


purely of NaN values and has the same shape as our temperature
DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also
need a DataFrame with the conditions "df_bool" as True values. For this
purpose we will create a DataFrame with random values between 0 and 1
and by applying 'random_df < 0.8' we get the df_bool DataFrame, in
which about 20 % of the values will be True:

random_df =
pd.DataFrame(np.random.random(size=(54, 6)),
columns=temp_df.columns.values,
index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
columns=temp_df.columns.values,
index=temp_df.index)
df_bool = random_df<0.8
df_bool[:5]

Finally, we have everything toghether to create our DataFrame with


distrubed measurements:

disturbed_data = temp_df.where(df_bool, nan_df)


disturbed_data.to_csv("data1/
temperatures_with_NaN.csv")
disturbed_data[:10]

Using dropna on the DataFrame


'dropna' is a DataFrame method. If we call this method without
arguments, it will return an object where every row is ommitted, in which
data are missing, i.e. some value is NaN:

df = disturbed_data.dropna()
df

'dropna' can also be used to drop all columns in which some values are
NaN. This can be achieved by assigning 1 to the axis parameter. The
default value is False, as we have seen in our previous example. As every
column from our sensors contain NaN values, they will all disappear:

df = disturbed_data.dropna(axis=1)
df[:5]

Let us change our task: We only want to get rid of all the rows, which
contain more than one NaN value. The parameter 'thresh' is ideal for this
task. It can be set to the minimum number. 'thresh' is set to an integer
value, which defines the minimum number of non-NaN values. We have
six temperature values in every row. Setting 'thresh' to 5 makes sure that
we will have at least 5 valid floats in every remaining row:

cleansed_df = disturbed_data.dropna(thresh=5,
axis=0)
cleansed_df[:7]

Now we will calculate the mean values again, but this time on the
DataFrame 'cleansed_df', i.e. where we have taken out all the rows,
where more than one NaN value occurred.

average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)
# best practice:
df = df.assign(temperature=average_temp_series)
# inplace option not available
df[:6]

You might also like