Python Pandas II Notes XII
Python Pandas II Notes XII
Descriptive Statistics:
• Descriptive statistics are used to describe / summarize large data in ways that are meaningful
and useful. Means “must know” with any set of data. It gives us a general idea of trends in our
data including:
• The mean, mode, median and range.
• Variance and standard deviation ,quartile
• Sum, Count, maximum and minimum.
• Descriptive statistics is useful because it allows us take decision. For example, let’s say we are
having data on the incomes of one million people. No one is going to want to read a million
pieces of data; if they did, they wouldn’t be able to get any useful information from it. On the
other hand, if we summarize it, it becomes useful: an average wage, or a median income, is
much easier to understand than reams of data.
• Some Essential functions:
➢ min() and max():The min() and max() functions find out the maximum and minimum
value respectively from a given set of data.
Syntax:
<df>.min(axis=None, skipna=None, numeric_only=None)
<df>.max(axis=None, skipna=None, numeric_only=None)
Parameters:
axis : {index (0), columns (1)}
skipna : Exclude NA/null values when computing the result
numeric_only : Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data. Not implemented for
Series.
Example of max():
import pandas as pd
import numpy as np
df = pd.DataFrame({"A":[12, 4, 5, 44, 1], "B":[5, np.NaN, 54, 3, 2],
"C":[20, 16, 7, 3, 8], "D":['ABC', 3, 17, 2, 6]})
print(df)
print(df.max(axis = 0))
output:
➢ count():This function count the non-NA entries for each row or column. The values
None, NaN, NaT etc. are considered as NA in pandas.
Syntax:
<df>.count(axis=0, numeric_only=False)
e.g.
import pandas as pd
df = pd.DataFrame({"A":[-5, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["sam", "haris", "alex", np.nan, "peter", "nathan"]})
print(df)
output:
➢ Sum(): This function returns the sum of the values for the requested axis.
Syntax:
<df>.sum(axis=None, skipna=None, numeric_only=None, min_count=0)
Parameters:
axis : {index (0), columns (1)}
skipna : Exclude NA/null values when computing the result.
numeric_only : Include only float, int, boolean columns. If None, will attempt to use
everything, then use only numeric data. Not implemented for Series.
min_count : The required number of valid values to perform the operation. If fewer than
min_count non-NA values are present the result will be NA.
e.g.
i) Example of sum():
➢ Quantile():
• The word “quantile” comes from the word quantity. means, a quantile is where
a sample is divided into equal-sized or subgroups (that’s why it’s sometimes
called a “fractile“). So that’s why ,It can also refer to dividing a probability
distribution into areas of equal probability.
• The median is a kind of quantile; the median is placed in a probability
distribution at center so that exactly half of the data is lower than the median
and half of the data is above the median. The median cuts a distribution into two
equal parts and so why sometimes it is called 2-quantile.
• Quartiles are quantiles; when they divide the distribution into four equal parts.
Deciles are quantiles that divide a distribution into 10 equal parts and
Percentiles when that divide a distribution into 100 equal parts .
• Common Quantiles: Certain types of quantiles are used commonly enough to
have specific names. Below is a list of these:
• The 2 quantile is called the median
• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
➢ Standard Deviation(std()):
Standard deviation means measure the amount of variation dispersion of a set of values.
A low standard deviation means the values tend to be close to the mean in a set and a
high standard deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as finance is concerned.
Finance and banking is all about measuring the risk and standard deviation measures
risk.
*After calculating variance in previous function using var() then after calculate square
root of return value from var().
e.g.
import pandas as pd
import numpy as np
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print(df.std())
Output:
DataFrame Sorting
➢ Sorting means arranging the contents in ascending or descending order. There are two
kinds of sorting available in pandas(Dataframe).
• By value(column) : Sorting over dataframe column/s elements is supported
by sort_values() method. We will cover here three aspects of sorting values of
dataframe.
• Sort a pandas dataframe in python by Ascending and Descending
• Sort a python pandas dataframe by single column
• Sort a pandas dataframe by multiple columns.
Syntax:
<df>.sort_values(by, axis=0, ascending=True, inplace=False, na_position=‘last’)
Parameter:
na_position: Takes two string input ‘last’ or ‘first’ to set position of Null
values. Default is ‘last’.
df=df.sort_values(by='Score')
print("Dataframe contents after sorting")
print (df)
Output:
ii) Sort the python pandas Dataframe by single column – Descending order:
import pandas as pd
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([30,25,25,30,20]), 'Score':pd.Series([80,60,90,50,50])}
df = pd.DataFrame(d)
print("Dataframe contents without sorting")
print (df)
Output:
df=df.sort_values(by='Score',ascending=0)
print("Dataframe contents after sorting")
print (df)
Output:
df.sort_values(by=['Country','Name'],ascending=[True,False],inplace=True)
print("Dataframe contents after sorting")
print(df)
10 | P a g e By Sumit Sir…
df = pd.DataFrame(d)
df= df.reindex([1,4,3,2,0])
print("Dataframe contents without sorting")
print (df)
Output:
df=df.sort_index()
print("Dataframe contents after sorting")
print(df)
11 | P a g e By Sumit Sir…
Example #1: Use reindex() function to reindex the dataframe. By default values in the new index
that do not have corresponding records in the dataframe are assigned NaN.
import pandas as pd
df = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]},
index =["first", "second", "third", "fourth", "fifth"])
print(df)
Output:
Example #2: Use reindex() function to reindex the column axis and specify fill values.
import pandas as pd
df1=pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4], "C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
# reindexing the column axis with
# old and new index values
df1= df.reindex(columns =["A“,"B“,"D“,"E"])
print(df1)
Output:
12 | P a g e By Sumit Sir…
# reindex the columns
# fill the missing values by 25
df2=df.reindex(columns
=[“E“,“B“,“C“,“A“,”D”], fill_value = 30)
print(df2)
Output:
Pivoting DataFrame
➢ Pivot –Pivot reshapes data and uses unique values from index/columns to
form axes of the resulting dataframe. Index is column name to use to
make new frame’s index. Columns is column name to use to make new
frame’s columns. Values is column name to use for populating new
frame’s values.
➢ Exception: ValueError raised if there are any duplicates.
Syntax:
pandas.pivot(index=None, columns=None, values=None)
#Example of pivot()
import pandas as pd
import numpy as np
data={'Item':['TV','TV','AC','AC'],
'Company':['LG','VIDEOCON','LG','SONY'],
'Rupees':[12000,10000,15000,14000], 'USD':[700,650,800,750]}
df1=pd.DataFrame(data)
print(df1)
Output:
13 | P a g e By Sumit Sir…
pvt=df1.pivot(index='Item', columns='Company', values='Rupees')
print(pvt)
print(pvt[pvt.index=='TV'].LG.values)
#pivot() creates a new table/DataFrame whose columns are the unique
values in COMPANY and whose rows are indexed with the unique values
of ITEM. Last statement of above program return value of TV item LG
company i.e. 12000.
Now in previous example, we want to pivot the values of both RUPEES an
USD together, we will have to use pivot function in below manner.
p2=df1.pivot(index=‘Item’, columns=‘Company’)
14 | P a g e By Sumit Sir…
df1.pivot(index='ITEM', columns='COMPANY', values='RUPEES')
It throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape
➢ Pivot table: Pivot table is used to summarize and aggregate data inside
dataframe.
Syntax:
pandas.pivot_table(data, values=None, index=None, columns=None,
aggfunc=‘mean’)
OR
<df>.pivot_table(values=None, index=None, columns=None, aggfunc=‘mean’)
#Pivot Table
The pivot_table() method comes to solve this problem. It works like pivot, but it
aggregates the values from rows with duplicate entries for the specified columns.
df1.pivot_table(index='ITEM', columns='COMPANY',
values='RUPEES‘,aggfunc=np.mean)
In essence pivot_table is a generalization of pivot, which allows you to
aggregate multiple values with the same destination in the pivoted table.
DataFrame: Group by()
➢ Any groupby operation involves one of the following operations on the
original object. They are −
o Splitting the Object
o Applying a function
o Combining the results
➢ In many situations, we split the data into sets and we apply some
functionality on each subset. In the apply functionality, we can perform
the following operations
o Aggregation − computing a summary statistic
15 | P a g e By Sumit Sir…
o Transformation − perform some group-specific operation
Syntax:
<df>.groupby(by=None, axis=0)
Parameters:
by: label or list of labels to be used for grouping.
axis: {0 for index, 1 for column}, default 0; Split along rows (0) or
columns(1) }
e.g.
df1.groupby(‘Tutor’)
Output:
<pandas.core.goupby.DataFrameGroupBy objects at 0a399393>
The result of groupby() is also an object, the DataFrameGroupBy object.
gdf=df1.groupby(‘Tutor’)
gdf is a GroupBy object.
You can store the GroupBy object in a variable name and then use
following attributes
e.g.
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,
2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,
701,804,690]}
df = pd.DataFrame(ipl_data)
print (df)
Output:
16 | P a g e By Sumit Sir…
#store groupobject
gdf=df.groupby('Team')
#view groups
print(gdf.groups)
Output:
{'Devils': Int64Index([2, 3], dtype='int64'),
'Kings': Int64Index([4, 6, 7], dtype='int64'),
'Riders': Int64Index([0, 1, 8, 11], dtype='int64'),
'Royals': Int64Index([9, 10], dtype='int64'),
'kings': Int64Index([5], dtype='int64')}
#select a group
print(gdf.get_group('Riders'))
Output:
17 | P a g e By Sumit Sir…
#print size of nested groups
print(gdf1.size())
Output:
print(gdf1.get_group(('Kings',2016))) \
Output:
18 | P a g e By Sumit Sir…
Syntax:
<GroupbyObject>.agg(func, axis=0)
<df>.groupby(by).agg(func)
func: function, str, list or dict
Example of GroupBy() using agg():
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings‘, 'kings',
'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],'Year‘:[2014,2015,2014,2015,2014,2015,
2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,
694,701,804,690]}
df = pd.DataFrame(ipl_data)
gdf2=df.groupby(‘Year’)
print(gdf2['Points'].agg(np.mean))
Output:
Note :- for mysql.connector library use pip install mysql_connector command in command prompt.
Pass proper host name, database name, user name and password (if any) in connect method.
19 | P a g e By Sumit Sir…
Exporting data to a MySQL database from a Pandas data frame
e.g.
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://root:root@localhost/bank')
lst = ['vishal', 'ram']
lst2 = [11, 22]
# Calling DataFrame constructor after zipping # both lists, with columns specified
df = pd.DataFrame(list(zip(lst, lst2)), columns =['Name', 'val'])
df.to_sql(name='bmaster', con=engine, if_exists = 'replace', index=False)
20 | P a g e By Sumit Sir…