Chapter1.2 PythonPandas2
Chapter1.2 PythonPandas2
Data Handling
using Pandas -2
Informatics Practices
Class XII ( As per CBSE Board)
Data handling using pandas
Descriptive statistics
Descriptive statistics are used to describe / summarize large data in
ways that are meaningful and useful. Means “must knows” with any
set of data. It gives us a general idea of trends in our data including:
• The mean, mode, median and range.
• Variance and standard deviation ,quartile
• SumCount, maximum and minimum.
Descriptive statistics is useful because it allows us take decision. For
example, let’s say we are having data on the incomes of one million
people. No one is going to want to read a million pieces of data; if they
did, they wouldn’t be able to get any useful information from it. On the
other hand, if we summarize it, it becomes useful: an average wage, or
a median income, is much easier to understand than reams of data.
Data handling using pandas
Steps to Get the descriptive statistics
• Step 1: Collect the Data
Either from data file or from user
• Step 2: Create the DataFrame
Create dataframe from pandas object
• Step 3: Get the Descriptive Statistics for Pandas
DataFrame
Get the descriptive statistics as per
requirement like mean,mode,max,sum etc.
from pandas object
Note :- Dataframe object is best for descriptive statistics as it can hold
large amount of data and relevant functions.
Descriptive statistics - dataframe
OUTPUT
0.25 4.25
0.50 7.00
0.75 11.50
dtype: float64
How to Find Quartiles in python
#Program in python to find 0.25 quantile of
series[1, 10, 100, 1000]
import pandas as pd
import numpy as np
s = pd.Series([1, 10, 100, 1000])
r=s.quantile(.25) 2.Now integer part is a=1 and fraction part is 0.75 and T is term.
print(r) Now formula for quantile is
=T1+b*(T2-T1)
OUTPUT 7.75 =1+0.75*(10-1)
Solution steps =1+0.75*9
1. q=0.25 (0.25 quantile) =1+6.75 = 7.75 Quantile is 7.75
2. n = 4 (no of elements) Note:- That in series [1, 10, 100, 1000] 1 is at 1 position 10 is at 2,
=(n-1)*q+1 100 is at 3 and so on.Here we are choosing T1 as 1 because at 1
=(4-1)*0.25+1 position ( integer part of 1.75 is 1) value is 1(T1).here we are
=3*0.25+1 choosing value and then next fraction part is between 1 to
=0.75+1 10,that is being found by 0.75*(10-1).Its result is 6.75 next to
=1.75 1.Thats why we are adding 1 with 6.75.
Standard Deviation
standard deviation means measure the amount of variation /
dispersion of a set of values.A low standard deviation means the
values tend to be close to the mean in a set and a high standard
deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as
finance is concerned. Finance and banking is all about measuring
the risk and standard deviation measures risk. Standard deviation
is used by all portfolio managers to measure and track risk.
Steps to calculate the standard deviation:
1. Work out the Mean (the simple average of the numbers)
2. Then for each number:subtract the Mean and square the result
3. Then work out the mean of those squared differences.
4. Take the square root of that and we are done!
Standard Deviation
E.g. Std deviation for (9, 2, 12, 4, 5, 7)
Step 1. Work out the mean -(9+2+12+4+5+7) / 6 = 39/6 = 6.5
Step 2. Then for each number: subtract the Mean and square the
result - (9 - 6.5)2 = (2.5)2 = 6.25 , (2 - 6.5)2 = (-4.5)2 = 20.25
Perform same operation for all remaining numbers.
Step 3. Then work out the mean of those squared differences.
Sum = 6.25 + 20.25 + 2.25 + 6.25 + 30.25 + 0.25 = 65.5
Divide by N-1: (1/5) × 65.5 = 13.1(This value is "Sample Variance“)
Step 4. Take the square root of that: s = √(13.1) = 3.619...(stddev)
#Create a DataFrame
info = {
'Name':['Mohak','Freya','Viraj','Santosh','Mishti','Subrata'],
'Marks':[9, 2, 12, 4, 5, 7]}
data = pd.DataFrame(info)
# standard deviation of the dataframe
OUTPUT
r=data.std() Marks 3.619392
print(r) dtype: float64
Descriptive statistics - dataframe
var() – Variance Function in python pandas is used to calculate variance of a given set of
numbers, Variance of a data frame, Variance of column and Variance of rows, let’s see
an example of each.
#e.g.program
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print(df.var())
#df.loc[:,“Age"].var() for variance of specific column
#df.var(axis=0) column variance
#df.var(axis=1) row variance
Dataframe Operations
Pivot –Pivot reshapes data and uses unique values from index/
columns to form axes of the resulting dataframe. Index is column
name to use to make new frame’s index.Columns is column name
to use to make new frame’s columns.Values is column name to
use for populating new frame’s values.
Example of pivot:
1.Pivot()
2.pivot_table()
p = d.pivot(index='ITEM', columns='COMPANY')
RUPEES USD
COMPANY LG SONY VIDEOCON LG SONY VIDEOCON
ITEM
AC 15000 14000 NaN 800 750 NaN
TV 12000 NaN 10000 700 NaN 650
Pivoting - dataframe
#Common Mistake in Pivoting
pivot method takes at least 2 column names as parameters - the index and the columns named
parameters. Now the problem is that,What happens if we have multiple rows with the same values
for these columns? What will be the value of the corresponding cell in the pivoted table using pivot
method? The following diagram depicts the problem:
ITEM COMPANY RUPEES USD
TV LG 12000 700
TV VIDEOCON 10000 650
TV LG 15000 800
AC SONY 14000 750
OUTPUT
Will be as data available in table bmaster