0% found this document useful (0 votes)

41 views38 pages

Chapter1.2 PythonPandas2

ip ppt

Uploaded by

Coding corner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views38 pages

Chapter1.2 PythonPandas2

ip ppt

Uploaded by

Coding corner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Chapter 2

Data Handling
using Pandas -2

Informatics Practices
Class XII ( As per CBSE Board)
Data handling using pandas
Descriptive statistics
Descriptive statistics are used to describe / summarize large data in
ways that are meaningful and useful. Means “must knows” with any
set of data. It gives us a general idea of trends in our data including:
• The mean, mode, median and range.
• Variance and standard deviation ,quartile
• SumCount, maximum and minimum.
Descriptive statistics is useful because it allows us take decision. For
example, let’s say we are having data on the incomes of one million
people. No one is going to want to read a million pieces of data; if they
did, they wouldn’t be able to get any useful information from it. On the
other hand, if we summarize it, it becomes useful: an average wage, or
a median income, is much easier to understand than reams of data.
Data handling using pandas
Steps to Get the descriptive statistics
• Step 1: Collect the Data
Either from data file or from user
• Step 2: Create the DataFrame
Create dataframe from pandas object
• Step 3: Get the Descriptive Statistics for Pandas
DataFrame
Get the descriptive statistics as per
requirement like mean,mode,max,sum etc.
from pandas object
Note :- Dataframe object is best for descriptive statistics as it can hold
large amount of data and relevant functions.
Descriptive statistics - dataframe

Pandas dataframe object come up with the methods to

calculate max, min, count, sum, mean, median, mode,
quartile, Standard deviation, variance.
Mean
Mean is an average of all the numbers. The steps required
to calculate a mean are:
• sum up all the values of a target variable in the dataset
• divide the sum by the number of values
Descriptive statistics - dataframe
Median- Median is the middle value of a sorted list of numbers.
The steps required to get a median from a list of numbers are:
• sort the numbers from smallest to highest
• if the list has an odd number of values, the value in the middle
position is the median
• if the list has an even number of values, the average of the two
values in the middle will be the median
Mode-To find the mode, or modal value, it is best to put the
numbers in order. Then count how many of each number. A number
that appears most often is the mode.e.g.{19, 8, 29, 35, 19, 28, 15}.
Arrange them in order: {8, 15, 19, 19, 28, 29, 35} .19 appears twice,
all the rest appear only once, so 19 is the mode.
Having two modes is called "bimodal".Having more than two modes
is called "multimodal".
Descriptive statistics - dataframe
#e.g. program for data aggregation/descriptive statistics
from pandas import DataFrame

Cars = {'Brand': ['Maruti ciaz','Ford ','Tata Indigo','Toyota Corolla','Audi STEP1

A9'],
'Price': [22000,27000,25000,29000,35000],
'Year': [2014,2015,2016,2017,2018] OUTPUT
} count 5
mean 27600
df = DataFrame(Cars, columns= ['Brand', 'Price','Year']) std 4878
STEP2 min 22000
stats_numeric = df['Price'].describe().astype (int) 25% 25000
STEP3
print (stats_numeric) 50% 27000
#describe method return mean,standard deviationm,min,max, 75% 29000
% values max 35000
Name: Price, dtype:
int32
Descriptive statistics - dataframe
#e.g. program for data aggregation/descriptive statistics
import pandas as pd
OUTPUT
import numpy as np Dataframe contents
#Create a Dictionary of series Name Age Score
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']), 0 Sachin 26 87
STEP1 1 Dhoni 25 67
'Age':pd.Series([26,25,25,24,31]), 2 Virat 25 89
'Score':pd.Series([87,67,89,55,47])} 3 Rohit 24 55
4 Shikhar 31 47
#Create a DataFrame Name 5
df = pd.DataFrame(d) Age 5

print("Dataframe contents") STEP2 Score 5

dtype: int64
print (df) count age Age 5
dtype: int64
print(df.count()) sum of score Score 345
print("count age",df[['Age']].count()) dtype: int64
minimum age Age 24
print("sum of score",df[['Score']].sum()) dtype: int64
print("minimum age",df[['Age']].min()) maximum score Score 89
dtype: int64
print("maximum score",df[['Score']].max()) STEP3 mean age Age 26.2
print("mean age",df[['Age']].mean()) dtype: float64
mode of age Age
print("mode of age",df[['Age']].mode()) 0 25
print("median of score",df[['Score']].median()) median of score Score 67.0
dtype: float64
Descriptive statistics - dataframe
Quantile -
Quantile statistics is a part of a data set. It is used to describe data in a clear
and understandable way.The 0,30 quantile is basically saying that 30 % of the
observations in our data set is below a given line. On the other hand ,it is also stating
that there are 70 % remaining above the line we set.
Common Quantiles
Certain types of quantiles are used commonly enough to have specific names. Below is
a list of these:
• The 2 quantile is called the median
• The 3 quantiles are called terciles
• The 4 quantiles are called quartiles
• The 5 quantiles are called quintiles
• The 6 quantiles are called sextiles
• The 7 quantiles are called septiles
• The 8 quantiles are called octiles
• The 10 quantiles are called deciles
• The 12 quantiles are called duodeciles
• The 20 quantiles are called vigintiles
• The 100 quantiles are called percentiles
• The 1000 quantiles are called permilles
Quantiles
The word “quantile” comes from the word quantity. means, a
quantile is where a sample is divided into equal-sized or subgroups
(that’s why it’s sometimes called a “fractile“). So that’s why ,It can
also refer to dividing a probability distribution into areas of equal
probability.
The median is a kind of quantile; the median is placed in a
probability distribution at center so that exactly half of the data is
lower than the median and half of the data is above the median. The
median cuts a distribution into two equal parts and so why
sometimes it is called 2-quantile.
Quartiles are quantiles; when they divide the distribution into four
equal parts. Deciles are quantiles that divide a distribution into 10
equal parts and Percentiles when that divide a distribution into 100
equal parts .
Quantiles
How to Find Quantiles?
Sample question: Find the number in the following set of data where 30
percent of values fall below it, and 70 percent fall above:
2 4 5 7 9 11 12 17 19 21 22 31 35 36 45 44 55 68 79 80 81 88 90 91 92 100 112
113 114 120 121 132 145 148 149 152 157 170 180 190
Step 1: Order the data from smallest to largest. The data in the question is
already in ascending order.
Step 2: Count how many observations you have in your data set. this particular
data set has 40 items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the
number where 30 percent of the values fall below it, so convert that to .3.
Step 4: Insert your values into the formula:
ith observation = q (n + 1)
ith observation = .3 (40 + 1) = 12.3
Answer: The ith observation is at 12.3, so we round down to 12 (remembering
that this formula is an estimate). The 12th number in the set is 31, which is the
number where 30 percent of the values fall below it.
Quantiles
How to Find Quartiles in python
In pandas series object->
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 4, 5,6,8,10,12,16,20])
r=s.quantile([0.25,0.5,0.75])
print(r)

OUTPUT
0.25 4.25
0.50 7.00
0.75 11.50
dtype: float64
How to Find Quartiles in python
#Program in python to find 0.25 quantile of
series[1, 10, 100, 1000]
import pandas as pd
import numpy as np
s = pd.Series([1, 10, 100, 1000])
r=s.quantile(.25) 2.Now integer part is a=1 and fraction part is 0.75 and T is term.
print(r) Now formula for quantile is
=T1+b*(T2-T1)
OUTPUT 7.75 =1+0.75*(10-1)
Solution steps =1+0.75*9
1. q=0.25 (0.25 quantile) =1+6.75 = 7.75 Quantile is 7.75
2. n = 4 (no of elements) Note:- That in series [1, 10, 100, 1000] 1 is at 1 position 10 is at 2,
=(n-1)*q+1 100 is at 3 and so on.Here we are choosing T1 as 1 because at 1
=(4-1)*0.25+1 position ( integer part of 1.75 is 1) value is 1(T1).here we are
=3*0.25+1 choosing value and then next fraction part is between 1 to
=0.75+1 10,that is being found by 0.75*(10-1).Its result is 6.75 next to
=1.75 1.Thats why we are adding 1 with 6.75.
Standard Deviation
standard deviation means measure the amount of variation /
dispersion of a set of values.A low standard deviation means the
values tend to be close to the mean in a set and a high standard
deviation means the values are spread out over a wider range.
Standard deviation is the most important concepts as far as
finance is concerned. Finance and banking is all about measuring
the risk and standard deviation measures risk. Standard deviation
is used by all portfolio managers to measure and track risk.
Steps to calculate the standard deviation:
1. Work out the Mean (the simple average of the numbers)
2. Then for each number:subtract the Mean and square the result
3. Then work out the mean of those squared differences.
4. Take the square root of that and we are done!
Standard Deviation
E.g. Std deviation for (9, 2, 12, 4, 5, 7)
Step 1. Work out the mean -(9+2+12+4+5+7) / 6 = 39/6 = 6.5
Step 2. Then for each number: subtract the Mean and square the
result - (9 - 6.5)2 = (2.5)2 = 6.25 , (2 - 6.5)2 = (-4.5)2 = 20.25
Perform same operation for all remaining numbers.
Step 3. Then work out the mean of those squared differences.
Sum = 6.25 + 20.25 + 2.25 + 6.25 + 30.25 + 0.25 = 65.5
Divide by N-1: (1/5) × 65.5 = 13.1(This value is "Sample Variance“)
Step 4. Take the square root of that: s = √(13.1) = 3.619...(stddev)

formula for Standard Deviation

Above e.g. is for practice
purpose otherwise stddev is performed for large amount of data
Standard Deviation
E.g. Std deviation for (9, 2, 12, 4, 5, 7)
import pandas as pd
import numpy as np

#Create a DataFrame
info = {
'Name':['Mohak','Freya','Viraj','Santosh','Mishti','Subrata'],
'Marks':[9, 2, 12, 4, 5, 7]}
data = pd.DataFrame(info)
# standard deviation of the dataframe
OUTPUT
r=data.std() Marks 3.619392
print(r) dtype: float64
Descriptive statistics - dataframe
var() – Variance Function in python pandas is used to calculate variance of a given set of
numbers, Variance of a data frame, Variance of column and Variance of rows, let’s see
an example of each.
#e.g.program
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
'Age':pd.Series([26,25,25,24,31]),
'Score':pd.Series([87,67,89,55,47])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Dataframe contents")
print (df)
print(df.var())
#df.loc[:,“Age"].var() for variance of specific column
#df.var(axis=0) column variance
#df.var(axis=1) row variance
Dataframe Operations

Data aggregation – Aggregation is the process of turning the values

of a dataset (or a subset of it) into one single value or data aggregation
is a multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible like
count,sum,min,max,median,quartile etc. These(count,sum etc.) are
descriptive statistics and other related operations on DataFrame Let us
make this clear! If we have a DataFrame like… Name Age Score
0 Sachin 26 87
1 Dhoni 25 67
2 Virat 25 89
3 Rohit 24 55
4 Shikhar 31 47
…then a simple aggregation method is to calculate the summary of the
Score, which is 87+67+89+55+47= 345. Or a different aggregation
method would be to count the number of Name, which is 5.
Dataframe operations
Group by

A groupby operation involves some combination of splitting the

object, applying a function, and combining the results. This can be
used to group large amounts of data and compute operations on
these groups.
E.g.
import pandas as pd
df = pd.DataFrame({'Animal': ['Tiger', 'Tiger','Parrot', 'Parrot'],
'Max Speed': [180., 170., 24., 26.]})
m=df.groupby(['Animal']).mean()
print(m)
OUTPUT
Max Speed
Animal
Parrot 25.0
Tiger 175.0
Dataframe operations
Sorting

Sorting means arranging the contents in ascending or

descending order.There are two kinds of sorting available in
pandas(Dataframe).
1. By value(column)
2. By index

1. By value - Sorting over dataframe column/s elements is

supported by sort_values() method. We will cover here three
aspects of sorting values of dataframe.
• Sort a pandas dataframe in python by Ascending and
Descending
• Sort a python pandas dataframe by single column
• Sort a pandas dataframe by multiple columns.
Dataframe operations
Sorting

Sort the python pandas Dataframe by single column – Ascending order

import pandas as pd
import numpy as np
#Create a Dictionary of series OUTPUT
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']), Dataframe contents without sorting
Name Age Score
'Age':pd.Series([26,27,25,24,31]),
0 Sachin 26 87
'Score':pd.Series([87,89,67,55,47])} 1 Dhoni 27 89
#Create a DataFrame 2 Virat 25 67
df = pd.DataFrame(d) 3 Rohit 24 55
print("Dataframe contents without sorting") 4 Shikhar 31 47
print (df)
df=df.sort_values(by='Score') Dataframe contents after sorting
Name Age Score
print("Dataframe contents after sorting")
4 Shikhar 31 47
print (df) 3 Rohit 24 55
#In above example dictionary object is used to create 2 Virat 25 67
the dataframe.Elements of dataframe object df is s 1 Dhoni 27 87
orted by sort_value() method.As argument we are 0 Sachin 26 89
passing value score for by parameter only.by default
it is sorting in ascending manner.
Dataframe operations
Sorting

Sort the python pandas Dataframe by single column – Descending order

import pandas as pd
import numpy as np
#Create a Dictionary of series OUTPUT
Dataframe contents without sorting
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']),
Name Age Score
'Age':pd.Series([26,27,25,24,31]), 0 Sachin 26 89
'Score':pd.Series([87,89,67,55,47])} 1 Dhoni 27 87
#Create a DataFrame 2 Virat 25 67
df = pd.DataFrame(d) 3 Rohit 24 55
print("Dataframe contents without sorting") 4 Shikhar 31 47
print (df)
Dataframe contents after sorting
df=df.sort_values(by='Score',ascending=0)
Name Age Score
print("Dataframe contents after sorting") 1 Dhoni 27 89
print (df) 0 Sachin 26 87
#In above example dictionary object is used to create 2 Virat 25 67
the dataframe.Elements of dataframe object df is s 3 Rohit 24 55
orted by sort_value() method.we are passing 0 for 4 Shikhar 31 47
Ascending parameter ,which sort the data in desce-
nding order of score.
Dataframe operations
Sorting

Sort the pandas Dataframe by Multiple Columns

import pandas as pd
import numpy as np
#Create a Dictionary of series OUTPUT
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']), Dataframe contents without sorting
'Age':pd.Series([26,25,25,24,31]), 'Score':pd.Series([87,67,89,55,47])} Name Age Score
0 Sachin 26 87
#Create a DataFrame
1 Dhoni 25 67
df = pd.DataFrame(d) 2 Virat 25 89
print("Dataframe contents without sorting") 3 Rohit 24 55
print (df) 4 Shikhar 31 47
df=df.sort_values(by=['Age', 'Score'],ascending=[True,False])
print("Dataframe contents after sorting") Dataframe contents after sorting
print (df) Name Age Score
#In above example dictionary object is used to create 3 Rohit 24 55
2 Virat 25 89
the dataframe.Elements of dataframe object df is s
1 Dhoni 25 67
orted by sort_value() method.we are passing two columns 0 Sachin 26 87
as by parameter value and in ascending parameter also 4 Shikhar 31 47
with two parameters first true and second false,which
means sort in ascending order of age and descending
order of score
Dataframe operations
Sorting

2. By index - Sorting over dataframe index sort_index() is

supported by sort_values() method. We will cover here
three aspects of sorting values of dataframe. We will cover
here two aspects of sorting index of dataframe.

• how to sort a pandas dataframe in python by index in

Ascending order
• how to sort a pandas dataframe in python by index in
Descending order
Dataframe operations
Sorting
sort the dataframe in python pandas by index in ascending order:
import pandas as pd
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']), OUTPUT
Dataframe contents without sorting
'Age':pd.Series([26,25,25,24,31]),
Name Age Score
'Score':pd.Series([87,67,89,55,47])} 1 Dhoni 25 67
#Create a DataFrame 4 Shikhar 31 47
df = pd.DataFrame(d) 3 Rohit 24 55
df=df.reindex([1,4,3,2,0]) 2 Virat 25 89
print("Dataframe contents without sorting") 0 Sachin 26 87
print (df) Dataframe contents after sorting
Name Age Score
df1=df.sort_index()
0 Sachin 26 87
print("Dataframe contents after sorting") 1 Dhoni 25 67
print (df1) 2 Virat 25 89
#In above example dictionary object is used to create 3 Rohit 24 55
the dataframe.Elements of dataframe object df is first 4 Shikhar 31 47
reindexed by reindex() method,index 1 is positioned at
0,4 at 1 and so on.then sorting by sort_index() method.
By default it is sorting in ascending order of index. index
Dataframe operations
Sorting
Sorting pandas dataframe by index in descending order:
import pandas as pd
import numpy as np
#Create a Dictionary of series OUTPUT
d = {'Name':pd.Series(['Sachin','Dhoni','Virat','Rohit','Shikhar']), Dataframe contents without sorting
'Age':pd.Series([26,25,25,24,31]), Name Age Score
1 Dhoni 25 67
'Score':pd.Series([87,67,89,55,47])}
4 Shikhar 31 47
#Create a DataFrame 3 Rohit 24 55
df = pd.DataFrame(d) 2 Virat 25 89
df=df.reindex([1,4,3,2,0]) 0 Sachin 26 87
print("Dataframe contents without sorting") Dataframe contents after sorting
print (df) Name Age Score
df1=df.sort_index(ascending=0) 4 Shikhar 31 47
3 Rohit 24 55
print("Dataframe contents after sorting")
2 Virat 25 89
print (df1) 1 Dhoni 25 67
#In above example dictionary object is used to create 0 Sachin 26 87
the dataframe.Elements of dataframe object df is first
reindexed by reindex() method,index 1 is positioned at
0,4 at 1 and so on.then sorting by sort_index() method.
index
Passing ascending=0 as argument for descending order.
Dataframe operations
Indexing

Index is like an address, that's how any data point across

the dataframe or series can be accessed. Rows and
columns both have indexes, rows indices are called as
index and for columns its general column names.
Indexing in pandas used for selecting particular rows
and columns of data from a DataFrame. Indexing could
mean selecting all the rows and some of the columns,
some of the rows and all of the columns, or some of
each of the rows and columns. Indexing can also be
known as Subset Selection.
Dataframe operations
Indexing e.g.
import pandas as pd
students = [ ('Mohak', 34, 'Sydeny') ,('Freya', 30, 'Delhi' ) ,('Rajesh', 16, 'New York') ]
# Create a DataFrame object OUTPUT
dfObj = pd.DataFrame(students, columns = ['Name' , 'Age', 'City'], Select a Single Row
index=['a', 'b', 'c']) Name Freya
Age 30
#Selecting a Single Row by Index label City Delhi
rowData = dfObj.loc[ 'b' , : ] Name: b, dtype: object
print("Select a Single Row " , rowData , sep='\n') Type : <class 'pandas.core.series.Series'>
Select multiple Rows
print("Type : " , type(rowData)) Name Age City
#Selecting multiple Rows by Index labels c Rajesh 16 New York
rowData = dfObj.loc[ ['c' , 'b'] , : ] b Freya 30 Delhi
Select both columns & Rows
print("Select multiple Rows" , rowData , sep='\n') Age Name
#Select both Rows & Columns by Index labels c 16 Rajesh
subset = dfObj.loc[ ['c' , 'b'] ,['Age', 'Name'] ] b 30 Freya
Select column at index 2
print("Select both columns & Rows" , subset , sep='\n') a Sydeny
#Select a single column by Index Position b Delhi
print(" Select column at index 2 ") c New York
Name: City, dtype: object
print( dfObj.iloc[ : , 2 ] ) Select columns in column index range 0 to
#Select multiple columns by Index range 2
print(" Select columns in column index range 0 to 2") Name Age
a Mohak 34
print(dfObj.iloc[:, 0:2]) b Freya 30
c Rajesh 16
Dataframe operations
Renaming Indexing e.g.

Index can be renamed using rename method.

e.g.
import pandas as pd

df = pd.DataFrame({'A': [11, 21, 31],

'B': [12, 22, 32],
'C': [13, 23, 33]},
index=['ONE', 'TWO', 'THREE'])
df_new = df.rename(columns={'A': 'a'}, index={'ONE': 'one'})
print(df_new)
OUTPUT
a B C
one 11 12 13
TWO 21 22 23
THREE 31 32 33
Pivoting - dataframe

DataFrame -It is a 2-dimensional data structure with columns of

different types. It is just similar to a spreadsheet or SQL table, or a
dict of Series objects. It is generally the most commonly used
pandas object.

Pivot –Pivot reshapes data and uses unique values from index/
columns to form axes of the resulting dataframe. Index is column
name to use to make new frame’s index.Columns is column name
to use to make new frame’s columns.Values is column name to
use for populating new frame’s values.

Pivot table - Pivot table is used to summarize and aggregate data

inside dataframe.
Pivoting - dataframe

Example of pivot:

ITEM COMPANY RUPEES USD

TV LG 12000 700
TV VIDEOCON 10000 650 DATAFRAME
AC LG 15000 800
AC SONY 14000 750

COMPANY LG SONY VIDEOCON

ITEM PIVOT
AC 15000 14000 NaN
TV 12000 NaN 10000
Pivoting - dataframe

There are two functions available in python for pivoting dataframe.

1.Pivot()
2.pivot_table()

1. pivot() - This function is used to create a new derived table(pivot) from

existing dataframe. It takes 3 arguments : index, columns, and values. As a
value for each of these parameters we need to specify a column name in the
original table(dataframe). Then the pivot function will create a new
table(pivot), whose row and column indices are the unique values of the
respective parameters. The cell values of the new table are taken from column
given as the values parameter.
Pivoting - dataframe

#pivot() e.g. program

from collections import OrderedDict
from pandas import DataFrame
import pandas as pd
import numpy as np
table = OrderedDict((
("ITEM", ['TV', 'TV', 'AC', 'AC']),
('COMPANY',['LG', 'VIDEOCON', 'LG', 'SONY']),
('RUPEES', ['12000', '10000', '15000', '14000']),
('USD', ['700', '650', '800', '750'])
))
d = DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
p = d.pivot(index='ITEM', columns='COMPANY', values='RUPEES')
print("\n\nDATA OF PIVOT")
print(p)
print (p[p.index=='TV'].LG.values)
#pivot() creates a new table/DataFrame whose columns are the unique values
in COMPANY and whose rows are indexed with the unique values of ITEM.Last statement
of above program retrun value of TV item LG company i.e. 12000
Pivoting - dataframe

#Pivoting By Multiple Columns

Now in previous example, we want to pivot the values of both RUPEES an USD together,
we will have to use pivot function in below manner.

p = d.pivot(index='ITEM', columns='COMPANY')

This will return the following pivot.

RUPEES USD
COMPANY LG SONY VIDEOCON LG SONY VIDEOCON
ITEM
AC 15000 14000 NaN 800 750 NaN
TV 12000 NaN 10000 700 NaN 650
Pivoting - dataframe
#Common Mistake in Pivoting
pivot method takes at least 2 column names as parameters - the index and the columns named
parameters. Now the problem is that,What happens if we have multiple rows with the same values
for these columns? What will be the value of the corresponding cell in the pivoted table using pivot
method? The following diagram depicts the problem:
ITEM COMPANY RUPEES USD
TV LG 12000 700
TV VIDEOCON 10000 650
TV LG 15000 800
AC SONY 14000 750

COMPANY LG SONY VIDEOCON

ITEM
AC NaN 14000 NaN
TV 12000 or 15000 ? NaN 10000
d.pivot(index='ITEM', columns='COMPANY', values='RUPEES')
It throws an exception with the following message:
ValueError: Index contains duplicate entries, cannot reshape
Pivoting - dataframe
#Pivot Table
The pivot_table() method comes to solve this problem. It works like pivot, but it
aggregates the values from rows with duplicate entries for the specified columns.
ITEM COMPANY RUPEES USD
TV LG 12000 700
TV VIDEOCON 10000 650
TV LG 15000 800
AC SONY 14000 750

COMPANY LG SONY VIDEOCON

ITEM
AC NaN 14000 NaN
TV 13500 = mean(12000,15000) NaN 10000

d.pivot_table(index='ITEM', columns='COMPANY', values='RUPEES‘,aggfunc=np.mean)

In essence pivot_table is a generalisation of pivot, which allows you to aggregate multiple values with
the same destination in the pivoted table.

Visit : python.mykvs.in for regular updates

Handling Missing Data

Filling the missing data Eg.

import pandas as pd
import numpy as np
raw_data = {'name': ['freya', 'mohak', 'rajesh'],
'age': [42, np.nan, 36 ] }
df = pd.DataFrame(raw_data, columns = ['name',
'age'])
print(df)
df['age']=df['age'].fillna(0) name age
0 freya 42.0
print(df)
1 mohak NaN
In above e.g. age of mohak is filled with 0
2 rajesh 36.0
name age
Note :- The dropna() function is used to 1 freya 42.0
2 mohak 0.0
remove missing values. df.dropna() will remove 3 rajesh 36.0
the record of mohak
Visit : python.mykvs.in for regular updates
Importing data from a
MySQL database into a
Pandas data frame
import mysql.connector as sql
import pandas as pd
db_connection = sql.connect(host='localhost', database='bank', user='root',
password='root')
db_cursor = db_connection.cursor()
db_cursor.execute('SELECT * FROM bmaster')
table_rows = db_cursor.fetchall()
df = pd.DataFrame(table_rows)
print(df)

OUTPUT
Will be as data available in table bmaster

Note :- for mysql.connector library use pip install mysql_connector command in

command prompt.
Pass proper host name,database name,user name and password in connect method.
Visit : python.mykvs.in for regular updates
Exporting data to a
MySQL database from
a Pandas data frame
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://root:root@localhost/bank')
lst = ['vishal', 'ram']
lst2 = [11, 22]
# Calling DataFrame constructor after zipping
# both lists, with columns specified
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['Name', 'val'])
df.to_sql(name='bmaster', con=engine, if_exists = 'replace', index=False)

user name password server databasename

Note :- Create dataframe as per the structure of the table.to_sql() method is used to write
data from dataframe to mysql table. Standard library sqlalchemy is being used for writing
data.

Cathay Pacific B747-400 and - 8F NPA Simulator Briefing Notes 20170627
75% (4)
Cathay Pacific B747-400 and - 8F NPA Simulator Briefing Notes 20170627
25 pages
Python Pandas2 PDF
No ratings yet
Python Pandas2 PDF
38 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
6.lab Activity
No ratings yet
6.lab Activity
23 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
Note 02
No ratings yet
Note 02
31 pages
R3.Descriptive Statistics
No ratings yet
R3.Descriptive Statistics
5 pages
Experiment - 1 csd201
No ratings yet
Experiment - 1 csd201
19 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
Measures of Location and Spread
No ratings yet
Measures of Location and Spread
1 page
Program-1
No ratings yet
Program-1
15 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Histogram and Qunatiles
No ratings yet
Histogram and Qunatiles
12 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
Session 2 Descriptive Statistics
No ratings yet
Session 2 Descriptive Statistics
33 pages
Machine Learning: Where To Start?
No ratings yet
Machine Learning: Where To Start?
71 pages
Measures of Central Tendency Lecture 2
No ratings yet
Measures of Central Tendency Lecture 2
22 pages
Exp 10
No ratings yet
Exp 10
4 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
Ch.2 Measures of Location and Spread
No ratings yet
Ch.2 Measures of Location and Spread
1 page
Data Mining-5 - Getting Know Data 1
No ratings yet
Data Mining-5 - Getting Know Data 1
27 pages
Unlock Scilab13
No ratings yet
Unlock Scilab13
38 pages
DEV Unit 3
No ratings yet
DEV Unit 3
24 pages
ML2 Math Algo
No ratings yet
ML2 Math Algo
72 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
Python Tutorial - W3school2 PDF
No ratings yet
Python Tutorial - W3school2 PDF
131 pages
R22 Unit2 CH2
No ratings yet
R22 Unit2 CH2
28 pages
Unit 3
No ratings yet
Unit 3
45 pages
QTM Lecture 3
No ratings yet
QTM Lecture 3
36 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Assignment 1 Midterm
No ratings yet
Assignment 1 Midterm
5 pages
Machine Learning: Data Set
100% (1)
Machine Learning: Data Set
52 pages
Descriptive Statistics: Mean or Average
No ratings yet
Descriptive Statistics: Mean or Average
5 pages
Slide-04-Chapter2-Getting To Know Your Data
No ratings yet
Slide-04-Chapter2-Getting To Know Your Data
47 pages
Lecture 3 Numerical Measures of Data
No ratings yet
Lecture 3 Numerical Measures of Data
36 pages
History Reporting
No ratings yet
History Reporting
61 pages
DM Lec2 Getting To Know Your Data
No ratings yet
DM Lec2 Getting To Know Your Data
34 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Getting To Know Your Data: 2.1 Exercises
100% (1)
Getting To Know Your Data: 2.1 Exercises
8 pages
Unit II
No ratings yet
Unit II
76 pages
Business Analytics - The Science of Data Driven Decision Making
No ratings yet
Business Analytics - The Science of Data Driven Decision Making
55 pages
Lecture 2
No ratings yet
Lecture 2
38 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
TUT1
No ratings yet
TUT1
7 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
29 pages
Chapter 3, Part A Descriptive Statistics: Numerical Measures
No ratings yet
Chapter 3, Part A Descriptive Statistics: Numerical Measures
41 pages
Machine Learning
No ratings yet
Machine Learning
80 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Lecture - 2 Descriptive Analytics
No ratings yet
Lecture - 2 Descriptive Analytics
56 pages
Measures of Partition and Dispersion
No ratings yet
Measures of Partition and Dispersion
51 pages
Dtatistical Measures
No ratings yet
Dtatistical Measures
54 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
AC 18th Dec 2024 Q&A English PDF - Watermark
No ratings yet
AC 18th Dec 2024 Q&A English PDF - Watermark
17 pages
UMTS Voice Quality Improvement Solution (RAN17.1 - 01)
100% (3)
UMTS Voice Quality Improvement Solution (RAN17.1 - 01)
32 pages
Part I Question 05 Revised
No ratings yet
Part I Question 05 Revised
13 pages
Instruction & Application Form - (TOYOTA - VN)
No ratings yet
Instruction & Application Form - (TOYOTA - VN)
5 pages
LAb Sonic 03-09-2013
No ratings yet
LAb Sonic 03-09-2013
1 page
An ISO 9001:2000 Company
No ratings yet
An ISO 9001:2000 Company
6 pages
Commonwealth Home Support Programme CHSP Manual
No ratings yet
Commonwealth Home Support Programme CHSP Manual
127 pages
Ece-1 2319 Slides
No ratings yet
Ece-1 2319 Slides
34 pages
David Osborn
No ratings yet
David Osborn
9 pages
L1 34264 en T Thermanit MTS 3 Se en v2
No ratings yet
L1 34264 en T Thermanit MTS 3 Se en v2
1 page
10 BEST Free YouTube Video Downloader
100% (1)
10 BEST Free YouTube Video Downloader
5 pages
007 - Procedure For Extended Work Hours PDF
No ratings yet
007 - Procedure For Extended Work Hours PDF
12 pages
Summer TRNG Reg Form
No ratings yet
Summer TRNG Reg Form
1 page
This Report Contains Avanza Solutions
No ratings yet
This Report Contains Avanza Solutions
4 pages
Avp Application Form
No ratings yet
Avp Application Form
2 pages
Cubic Models
No ratings yet
Cubic Models
2 pages
Steel Section Tables PDF
No ratings yet
Steel Section Tables PDF
5 pages
Conceptualizing Public Diplomacy Social Convention Culinary Core
No ratings yet
Conceptualizing Public Diplomacy Social Convention Culinary Core
8 pages
Supramolecular Fullerene-Porphyrin Chemistry. Fullerene Complexation by Metalated "Jaws Porphyrin" Hosts
No ratings yet
Supramolecular Fullerene-Porphyrin Chemistry. Fullerene Complexation by Metalated "Jaws Porphyrin" Hosts
9 pages
College Course 383536
No ratings yet
College Course 383536
2 pages
Notes From The Ai Frontier: Applying Ai For Social Good
No ratings yet
Notes From The Ai Frontier: Applying Ai For Social Good
52 pages
Chapter 4 Partnership Liquidation 2021 v2.0
No ratings yet
Chapter 4 Partnership Liquidation 2021 v2.0
56 pages
Chapter 2 IBS Integrated Building System
0% (2)
Chapter 2 IBS Integrated Building System
34 pages
Walta Act 2002
No ratings yet
Walta Act 2002
4 pages
Ubc-1997-Volume 2 (Wind & Seismic Considerations)
No ratings yet
Ubc-1997-Volume 2 (Wind & Seismic Considerations)
32 pages
Encoding and Decoding of Wireless wn890 Signlas
No ratings yet
Encoding and Decoding of Wireless wn890 Signlas
4 pages
Essentials of Pediatric Nursing 3rd Edition Kyle Solution Manual Unlocked Test Bank
No ratings yet
Essentials of Pediatric Nursing 3rd Edition Kyle Solution Manual Unlocked Test Bank
317 pages
"Robust Engineering": IE 361 Quality Control
No ratings yet
"Robust Engineering": IE 361 Quality Control
6 pages
CELPIP Path and Checklist PDF
No ratings yet
CELPIP Path and Checklist PDF
2 pages

Chapter1.2 PythonPandas2

Uploaded by

Chapter1.2 PythonPandas2

Uploaded by

Chapter 2

Pandas dataframe object come up with the methods to

Cars = {'Brand': ['Maruti ciaz','Ford ','Tata Indigo','Toyota Corolla','Audi STEP1

print("Dataframe contents") STEP2 Score 5

formula for Standard Deviation

Data aggregation – Aggregation is the process of turning the values

A groupby operation involves some combination of splitting the

Sorting means arranging the contents in ascending or

1. By value - Sorting over dataframe column/s elements is

Sort the python pandas Dataframe by single column – Ascending order

Sort the python pandas Dataframe by single column – Descending order

Sort the pandas Dataframe by Multiple Columns

2. By index - Sorting over dataframe index sort_index() is

• how to sort a pandas dataframe in python by index in

Index is like an address, that's how any data point across

Index can be renamed using rename method.

df = pd.DataFrame({'A': [11, 21, 31],

DataFrame -It is a 2-dimensional data structure with columns of

Pivot table - Pivot table is used to summarize and aggregate data

ITEM COMPANY RUPEES USD

COMPANY LG SONY VIDEOCON

There are two functions available in python for pivoting dataframe.

1. pivot() - This function is used to create a new derived table(pivot) from

#pivot() e.g. program

#Pivoting By Multiple Columns

This will return the following pivot.

COMPANY LG SONY VIDEOCON

COMPANY LG SONY VIDEOCON

d.pivot_table(index='ITEM', columns='COMPANY', values='RUPEES‘,aggfunc=np.mean)

Visit : python.mykvs.in for regular updates

Filling the missing data Eg.

Note :- for mysql.connector library use pip install mysql_connector command in

user name password server databasename

You might also like