1 IP 12 NOTES PythonPandas 2022 PDF
1 IP 12 NOTES PythonPandas 2022 PDF
1 IP 12 NOTES PythonPandas 2022 PDF
Syllabus:
===============================================================================
Pandas is an open-source Python Library providing high-performance Data Science for data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an
Econometrics from Multidimensional data.
The MatPotLib Python library, developed by John Hunter and many other contributors, is used to create
high-quality graphs, charts, and figures. The library is extensive and capable of changing very minute details
of a figure. Some basic concepts and functions provided in matplotlib are:
• Figure and axes: The entire illustration is called a figure and each plot on it is an axes (do not confuse
Axes with Axis).
• Plotting: The very first thing required to plot a graph is data. A dictionary of key-value pairs can be
declared, with keys and values as the x and y values. After that, scatter(), bar(), and pie(), along with
tons of other functions, can be used to create plot.
• Axis: The figure and axes obtained using subplots() can be used for modification. Properties of the x-
axis and y-axis (labels, minimum and maximum values, etc.) can be changed using Axes.set().
Data Type of Pandas
• integer
• string
• float
• object
• Series: 1-D structure to store homogeneous (same data type) and mutable (can be modified/added)
data, but size of the series is immutable.
• DataFrame: 2-D structure to store heterogeneous (multiple data type) and mutable data.
• Panel: It is 3-D way of storing data.
Series
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python
objects, etc.). The axis labels are collectively called index. For example, the following series is a collection of
integers 10, 23, 56,…
10 23 56 17 52 61 73 90 26 72
Key Points:
• Homogeneous data
• Size Immutable
• Values of Data Mutable
Output:
Series([], dtype: float64)
Output:
The original list:
[10, 20, 30, 40]
The Series:
0 10
1 20
2 30
3 40
dtype: int64
With user given index:
100 10
200 20
300 30
400 40
dtype: int64
Output:
x=['a','b','c','d'] # List
import pandas as pd import
numpy as np
data = np.array(x) # Array created by NumPy
s = pd.Series(data) # Series
print (s)
Output:
0 a
1 b
2 c
3 d
dtype: object
Array with defined indexes:
Example:
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data, index=[100,101,102,103])
print (s)
Output:
100 a
101 b
102 c
103 d
dtype: object
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print (s)
Output:
a 0.0
b 1.0
c 2.0
dtype: float64
Example
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data, index=['b','c','d','a'])
print (s)
Output:
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
Observe: Index order is persisted and the missing element is filled with NaN (Not a Number).
Example:
x={'a': “Africa”, 'b': ”Britain“, 'c': “Canada” , 'd': “Denmark”}
import pandas as pd
s = pd.Series(data) # Series
print (s)
Output:
a Africa
b Britain
c Canada
d Denmar
dtype: string
Example:
import pandas as pd
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print (s)
s1=pd.Series(data, index=['a', 'c', 'd'])
print(s1)
s2=pd.Series(data, index=['b','c','a','x'])
print(s2)
print(s.index)
print(s.dtype)
print(s.shape)
Output:
a 0.0
b 1.0
c 2.0
dtype: float64
a 0.0
c 2.0
d NaN
dtype: float64
b 1.0
c 2.0
a 0.0
x NaN
dtype: float64
float64
(3,)
Output:
0 15
1 15
2 15
3 15
4 15
dtype: int64
Example
Retrieve/access the first element. As we already know, the counting starts from zero for the array, which
means the first element is stored at zeroth position and so on.
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
#retrieve the first element
print (s[0])
Output:
1
Example
Retrieve/access/index the first three elements in the Series. If a: is inserted in front of it, all items from that
index onwards will be extracted. If two parameters (with: between them) is used, items between the two
indexes (not including the stop index)
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) #retrieves the first three element
print (s[:3])
Output:
a 1
b 2
c 3
dtype: int64
Example
Retrieve (slicing) the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) #retrieve the last three element
print (s[-3:])
Output:
c 3
d 4
e 5
dtype: int64
Example
Retrieve a single element using index label value.
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e']) #retrieve a single element
print (s['a'])
Output:
1
Example
Retrieve multiple elements using a list of index label values.
import pandas as pd
s = pd.Series([1,2,3,4,5] ,index = ['a','b','c','d','e']) #retrieve multiple elements
print (s[ ['a','c','d'] ])
Output:
a 1
c 3
d 4
dtype: int64
Example 3
If a label is not contained, an exception is raised.
import pandas as pd
s = pd.Series([1,2,3,4,5], index = ['a','b','c','d','e'])
print (s['f']) # No output, as ‘f’ is not present
Output:
KeyError: 'f'
10 20
11 22
12 24
13 26
14 28
dtype: int64
Example:
import pandas as pd
s = pd.Series(range(2,15,2))
print(s)
s1=pd.Series(range(2,15,2), index=[10,11,12,13,14,16,17])
print(s1)
Output:
0 2
1 4
2 6
3 8
4 10
5 12
6 14
dtype: int64
10 2
11 4
12 6
13 8
14 10
16 12
17 14
dtype: int64
import pandas as pd
s = pd.Series(range(2,10,2))
print("#Prints the values of Series: ")
print(s)
s1=pd.Series(range(2,12,2), index=[10,11,12,13,14])
print("#Prints the of Series with user index: ")
print(s1)
print("#Prints the values of Series with default index: ")
print(s[1:4])
print("#Prints the same values of Series with default index using iloc: ")
print(s.iloc[1:4])
print("#Can't print with iloc as 11:14 are not default indexes: ")
print(s1.iloc[11:14])
print("#Prints the values of Series with user index using loc: ")
print(s1.loc[11:14]) # It prints all the values of the range (:) of labels (user index)
Output:
3 8
14 10
dtype: int64
#Prints the same values of Series with default index using iloc:
1 4
2 6
3 8
dtype: int64
import pandas as pd
s=pd.Series([1, 2, 3, 4, 5] , index=['a', 'b','c','d','e'])
print(s)
print()
print(s.iloc[1 : 4]) # for indexing or selecting based on position
print()
print(s.loc['b' : 'e']) # Rule of range of values doesn't work here
Output:
a 1
b 2
c 3
d 4
e 5
dtype: int64
b 2
c 3
d 4
dtype: int64
b 2
c 3
d 4
e 5
dtype: int64
import pandas as pd
s = pd.Series([11,22,33,44,55,66,77,88,99,100], index=[49,48,47,46,45, 1, 2, 3, 4, 5])
print(s.loc[:3]) # Prints the values till user's index
print()
print(s.loc[1:3]) # Prints the values till user's index
print()
print(s[:3]) # Prints the values till default index-1
print()
print(s[1:3]) # Prints the values till default index-1
print()
print(s.iloc[:3]) # Prints the values same as default index rules
print()
print(s.iloc[1:3])
Output:
-------
49 11
48 22
47 33
46 44
45 55
1 66
2 77
3 88
dtype: int64
1 66
2 77
3 88
dtype: int64
49 11
48 22
47 33
dtype: int64
49 11
48 22
47 33
dtype: int64
To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.
head() returns the first n rows(observe the index values). The default number of elements to display is five,
but you may pass a custom number.
Example:
import pandas as pd
s = pd.Series([10,20,30,40,50], index=[1,2,3,4,5])
print(s)
print(“Head=>”)
print(s.head()) # top 5 rows by default
print(s.head(3)) # to 3 rows
Output:
1 10
2 20
3 30
4 40
5 50
dtype: int64
Head=>
1 10
2 20
3 30
4 40
5 50
dtype: int64
1 10
2 20
3 30
dtype: int64
Example:
import pandas as pd
import numpy as np
#Create a series with 4 random numbers
s = pd.Series(np.random.randn(4))
print ("The original series is:")
print (s)
print ("The first two rows of the data series:")
print s.head(2)
Output:
The original series is:
0 0.720876
1 -0.765898
2 0.479221
3 -0.139547
dtype: float64
tail() returns the last n rows(observe the index values). The default number of elements to display is five, but
you may pass a custom number.
Example:
import pandas as pd
s = pd.Series([10,20,30,40,50], index=[1,2,3,4,5])
print(s)
print(“Tail=> “)
print(s.tail()) # By default print 5 lowermost rows
print(s.tail(3)) # Prints 3 rows from bottom of the series
Output:
1 10
2 20
3 30
4 40
5 50
dtype: int64
Tail=>
1 10
2 20
3 30
4 40
5 50
dtype: int64
3 30
4 40
5 50
dtype: int64
Example:
Output:
The original series is:
0 -0.655091
1 -0.881407
2 -0.608592
3 -2.341413
dtype: float64
Pandas Series.where() function replace values where the input condition is False for the given Series object. It takes
another object as an input which will be used to replace the value from the original object.
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
dtype: float64
Example#2: Print the series has a value 50
import pandas as pd
s = pd.Series([10,20,30,40,50,60,70,80,90,100])
a=s.where(s == 50)
print(a)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 50.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Output:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
0 2
1 4
2 6
3 8
dtype: int64
0 False
1 True
2 True
3 True
dtype: bool
1 4
2 6
3 8
dtype: int64
0 NaN
1 4.0
2 6.0
3 8.0
dtype: float64
Example #4: Use Series.where() function to replace values in the given Series object with some other value when
the passed condition is not satisfied.
import pandas as pd
# Creating the First Series
sr1 = pd.Series(['New York', 'Chicago', 'Toronto', 'Lisbon', 'Rio'])
sr1.index = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5']
print(sr1)
Output:
# Serites of sr1
City 1 New York
City 2 Chicago
City 3 Toronto
City 4 Lisbon
City 5 Rio
dtype: object
# Serites of sr2
City 1 New York
City 2 Bangkok
City 3 London
City 4 Lisbon
City 5 Brisbane
dtype: object
# ‘Brisbane’ of sr2 of index city5 has been replaced by ‘Rio’ of sr1 of same index
City 1 New York
City 2 Bangkok
City 3 London
City 4 Lisbon
City 5 Rio
dtype: object
Example #5 : Use Series.where() function to replace values in the given Series object with some other value when
the passed condition is not satisfied.
import pandas as pd
sr1 = pd.Series([22, 18, 19, 20, 21], index = ['Student 1', 'Student 2', 'Student 3', 'Student 4', 'Student 5'])
print(sr1)
print()
Output:
Student 1 22
Student 2 18
Student 3 19
Student 4 20
Student 5 21
dtype: int64
Student 1 19
Student 2 16
Student 3 22
Student 4 20
Student 5 18
dtype: int64
#Replacing the values of sr2 by sr1 where the corresponding value of sr1 is greater than 20
Student 1 22
Student 2 16
Student 3 22
Student 4 20
Student 5 21
dtype: int64
Mathematical Operations:
+ add()
- sub(), subtract()
* mul(), multiply()
/ div(), divide()
// floordiv()
% mod()
** pow()
Example:
import pandas as pd
s1 = pd.Series([1,2,3,4])
s2 = pd.Series([10,20,30,40])
print (s1)
print (s2)
print”ADD:”, (s2+s1) # print(s2.add(s1))
print(”SUB:”, s2-s1) # print(s2.sub(s1))
print(”MUL:”, s2*s1) # print(s2.multiply(s1))
print(”DIV:”, s2/s1) # print(s2.div(s1))
print(”F.DIV:”, s2//s1) # print(s2.floordiv(s1))
print(”MOD:”, s2%s1) # print(s2.mod(s1))
print(”POW:”, s2**s1) # print(s2.pow(s1))
Output:
0 1
1 2
2 3
3 4
dtype: int64
0 10
1 20
2 30
3 40
dtype: int64
ADD:
0 11
1 22
2 33
3 44
dtype: int64
SUB:
0 9
1 18
2 27
3 36
dtype: int64
MUL:
0 11
1 22
2 33
3 44
dtype: int64
DIV:
0 10.0
1 10.0
2 10.0
3 10.0
dtype: float
F.DIV:
0 10
1 10
2 10
3 10
dtype: int64
import pandas as pd
s = pd.Series(range(1 , 15 , 3), index=(x for x in ‘abcde’)
print (s)
Output:
a 1
b 4
c 7
d 10
e 13
dtype: float64
###
DataFrame
DataFrame is a two-dimensional array with heterogeneous data, like a table with rows and columns.
• Lists
• dict
• Series
• Numpy ndarrays
• Another DataFrame
The table represents the data of a team of an organization with their overall performance rating. The data is
represented in rows and columns. Each column represents an attribute and each row represents a person.
Column Type
Name String
Age Integer
Gender String
Rating Float
Example:
Empty DataFrame
Columns: []
Index: []
2. Create a DataFrame from Lists:
Example:
import pandas as pd
df= pd.D ataF rame([10, 20, 30, 40, 50])
print (df)
Output:
0
0 10
1 20
2 30
3 40
4 50
import pandas as pd
df = pd.DataFrame([ [1, 2, 3, 4, 5] , [10, 20, 30, 40, 50] ])
print (df)
Output:
0 1 2 3 4
Output:
Class 90% Score
0 XII 101
1 XI 201
2 X 301
Example:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data , columns=['Name' , 'Age'] , dtype=float)
print (df)
Output:
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the
length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length.
Example:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(d)
print (df)
Output:
col1 col2
0 1 3
1 2 4
Example
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'] , 'Age':[28,34,29,42]}
x = pd.DataFrame(data)
print(x)
Output:
Name Age
0 Tom 28
1 Jack 34
2 Steve 29
3 Ricky 42
Example:
import pandas as pd
nme = ["aparna", "pankaj", "sudhir", "Geeku"]
deg = ["MBA", "BCA", "M.Tech", "MBA"]
scr = [90, 40, 80, 98]
dict = {'Name': nme, 'Degree': deg, 'Score': scr}
df = pd.DataFrame(dict)
print(df)
Output:
Example:
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1' , 'rank2' , 'rank3' , 'rank4'])
print (df)
Output:
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'] ,
'Height': [5.1 , 6.2 , 5.1 , 5.2] ,
'Qualification': ['Msc' , 'MA' , 'Msc' , 'Msc']
}
df = pd.DataFrame(data)
print(df)
print("") # Print/insert a blank line
df1 = pd.DataFrame(data, index=['one', 'two', 'three', 'four']) # Assigning index
print(df1)
Output:
Height Name Qualification
0 5.1 Jai Msc
1 6.2 Princ MA
2 5.1 Gaurav Msc
3 5.2 Anuj Msc
5.1 Example:
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000],
'Year': [2015,2013,2018,2018]
}
df = pd.DataFrame(cars, columns= ['Brand', 'Price','Year'])
print (df)
Output:
Brand Price Year
0 Honda Civic 22000 2015
1 Toyota Corolla 25000 2013
2 Ford Focus 27000 2018
3 Audi A4 35000 2018
5.2 Example:
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000],
'Year': [2015,2013,2018,2018]
}
_
df = pd.DataFrame(cars, columns= ['Brand', 'Price','Year']) sd=df.sort values(by=['Brand'],
ascending=False) # Descending order
print (sd)
Output:
5.3 Example:
import pandas as pd
cars = {'Brand': ['Honda Civic' , 'Toyota Corolla', 'Ford Focus' , 'Audi A4'],
'Price': [22000, 25000, 27000, 35000],
'Year': [2015, 2013, 2018, 2018]
}
df = pd.DataFrame(cars, columns= ['Brand', 'Price' , 'Year'])
sd=df.sort_values(by=['Price'], ascending=False) # Descending order
print (sd)
Output:
5.4 Example:
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'Price': [22000,25000,27000,35000],
'Year': [2015,2013,2018,2018]
}
df = pd.DataFrame(cars, columns= ['Brand', 'Price','Year'])
sd=df.sort_values(by=['Year'], ascending=True) # Ascending order
print (sd)
Output:
Brand Price Year
1 Toyota Corolla 25000 2013
0 Honda Civic 22000 2015
2 Ford Focus 27000 2018
3 Audi A4 35000 2018
6. Renaming a column name in DataFrame
import pandas as pd
L1=[10,30,50,70,90]
print(L1) # Prints List
df=pd.DataFrame(L1)
print(df) #Prints DataFrame
df.columns=['Code'] # Renaming column
print(df)
Output:
[10 , 30 , 50 , 70 , 90]
0
0 10
1 30
2 50
3 70
4 90
Code 10
0
1 30
2 50
3 70
4 90
6.2 Renaming column using function rename()
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df)
df.rename(columns={"A": "a", "B": "c"}, inplace=True)
print(df)
Output:
a c
0 1 4
1 2 5
2 3 6
Note: When inplace = True , the data is modified in place, which means it will return nothing and the
dataframe is now updated. When inplace = False , which is the default, then the operation is performed
and it returns a copy of the object. You then need to save it to something.
6.3 Example (Get new column names):
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'] ,
'Height': [5.1 , 6.2 , 5.1 , 5.2] ,
'Qualification': ['Msc' , 'MA' , 'Msc' , 'Msc']}
df = pd.DataFrame(data, index=['one', 'two', 'three', 'four'])
print(df)
df.columns=['N. Height' , 'N. Name', 'N. Qualification'] # New column names
print(df)
Output:
Height Name Qualification
one 5.1 Jai Msc
two 6.2 Princ MA
three 5.1 Gaurav Msc
four 5.2 Anuj Msc
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'] ,
'Height': [5.1 , 6.2 , 5.1 , 5.2] ,
'Qualification': ['Msc' , 'MA' , 'Msc' , 'Msc']}
df = pd.DataFrame(data)
(
df.rename columns={'N. Qualification' : 'Degree'} , inplace=True) # Replacing a specific column
print(df)
Output:
7.1 Example:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'] ,
'Height': [5.1 , 6.2 , 5.1 , 5.2] ,
'Qualification': ['Msc' , 'MA' , 'Msc' , 'Msc']
}
df = pd.DataFrame(data)
print(df) # Prints the DataFrame
print(“”)
addr = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] # Declare a list
df['New Address'] = addr # Prints the DataFrame with a new column
print(df)
Output:
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
print(x)
print()
y = pd.Series([1, 2, 3])
print(y)
print()
new_x = x.add(y, axis=0) # Adding series to DF row-wise for axis=0 on 0th col using add() function
print(new_x)
print()
new_y = x.add(y, axis=1) # Adding series to DF col-wise for axis=1 on 0th row using add() function
print(new_y)
output;
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
0 1
1 2
2 3
dtype: int64
0 1 2
0 2 5 8
1 4 7 10
2 6 9 12
0 1 2
0 2 6 10
1 3 7 11
2 4 8 12
7.3 Example: Binary operation of DataFrame with DtaFrame row /column wise: addition
import pandas as pd
x = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
y = pd.DataFrame({0: [1,2,3], 1: [4,5,6], 2: [7,8,9] })
print(x)
print()
print(y)
print()
x1 = x.add(y, axis=0) # Adding series to DF row-wise as axis=0 on 0th col
print(x1)
print()
y1 = x.add(y, axis=1) # Adding series to DF col-wise as axis=1 on 0th row
print(y1)
Output:
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
dtype: int64
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
dtype: int64
0 1 2
0 2 8 14
1 4 10 16
2 6 12 18
dtype: int64
0 1 2
0 2 8 14
1 4 10 16
2 6 12 18
dtype: int64
Output:
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Addition:
0 1 2
0 2 8 14
1 4 10 16
2 6 12 18
Subtraction:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
Multiplication:
0 1 2
0 1 16 49
1 4 25 64
2 9 36 81
Division:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
Output:
0 1 2
0 1 4 7
1 2 5 8
2 3 6 9
Addition:
0 1 2
0 2 8 14
1 4 10 16
2 6 12 18
Subtraction:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
Multiplication:
0 1 2
0 1 16 49
1 4 25 64
2 9 36 81
Division:
0 1 2
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
8.1 Example:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
x1=df[df.columns[1 : 4]]
print(x1) # From 2nd col to 4th col
print(“”)
x2=df[df.columns[1:3] # From 2nd col to 3rd col
print(x2)
x3=df[df.columns[:]]
print(x3) # All columns
print(“”)
x4=df[df.columns[:2]]
print(x4) # First 1st col to 2nd col
Output:
Age Address Qualification
0 27 Delhi Msc
1 24 Kanpur MA
2 22 Allahabad MCA
3 32 Kannauj Phd
Age Address
0 27 Delhi
1 24 Kanpur
2 22 Allahabad
3 32 Kannauj
Name Age
0 Jai 27
1 Princ 24
2 Gaurav 22
3 Anuj 32
import pandas as pd
# import numpy as np1
raw_data1 = { 'name': ['freya', 'mohak'],
'age': [10, 1],
'favorite_color': ['pink', 'blue'],
'grade': [88, 92]}
df1 = pd.DataFrame(raw_data1, columns = ['name', 'age', 'favorite_color', 'grade'])
for index, row in df1.iterrows(): # Reads data with index row wise & returns to vriable row
print (row["name"], row["age"])
Output:
freya 10
mohak 1
Output:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'] ,
'Height': [5.1 , 6.2 , 5.1 , 5.2] ,
'Qualification': ['Msc' , 'MA' , 'Msc' , 'Msc']
}
df = pd.DataFrame(data)
print(df) # Prints the DataFrame
print("")
df.loc[len(df.index)] = ['Amit', 5.5, 'PhD']
# df.loc locates the index to place values of the new row
# len(df.index) returens number of indexes = no. of rows+1 i.e. last index+1 i.e. new index
print(df)
df.iloc[x:y , a:b]
11. Example:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'], 'Age': [27,
24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
,
x1=df.iloc[ : 1 : 4] # First part is range of rows , Second part is range of columns
print(x1)
x2=df.iloc[1:2 , 1:4]
print(x2)
x3=df.iloc[1: , 1:]
print(x3)
4=df.iloc[: , :]
print(x4)
Output:
import pandas as pd
d={ 'one' : pd .Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd .Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df=pd.DataFrame(d)
print(df)
print()
df['three']=pd.Series([10,20,30],index=['a','b','c']) # Adding a new column
print(df)
print()
df ['four']=df ['one']+df ['three'] # Adding a new column made by sum of other columns
print(df )
print()
print (df['one']) # Displaying a column data
Output:
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
10.1 Example:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Princ 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
del df['Name'] # Removes / deletes field
print(df)
Output:
Age Address Qualification
0 27 Delhi Msc
1 24 Kanpur MA
2 22 Allahabad MCA
3 32 Kannauj Phd
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'], 'Age':
[27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
df.pop('Age')
print(df)
Output:
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
df1= df.drop("Address", axis=1) # Axis=1 is column
print(df1)
Output:
Name Age Qualification
0 Jai 27 Msc
1 Princ 24 MA
2 Gaurav 22 MCA
3 Anuj 32 Phd
import pandas as pd
data = { 'Name': ['Jai', 'Princ', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
df = pd.DataFrame(data)
df2= df.drop([1,2], axis=0) # Axis=0 is Row
print(df2)
Output:
Name Age Address Qualification
0 Jai 27 Delhi Msc
3 Anuj 32 Kannauj Phd
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
df1 = df1.append(df2)
print (df1)
Output;
a b
0 1 2
1 3 4
0 5 6
1 7 8
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Output:
Name Age Address Qualification
0 Jai 27 Nagpur Msc
1 Princi 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannuaj Phd
0 Abhi 17 Nagpur Btech
1 Ayushi 14 Kanpur B.A
2 Dhiraj 12 Allahabad Bcom
3 Hitesh 52 Kannuaj B.hons
[8 rows x 4 columns]
Name Age Address Qualification Name Age Address Qualification
0 Jai 27 Nagpur Msc Abhi 17 Nagpur Btech
1 Princi 24 Kanpur MA Aydhi 14 Kanpur B.A
2 Gaurav 22 Allahabad MCA Dhiraj 12 Allahabad Bcom
3 Anuj 32 Kannuaj Phd Hitesh 52 Kannuaj B.hons
[4 rows x 8 columns]
NOTE:
• axis=0 acts on all the ROWS in each COLUMN
• axis=1 acts on all the COLUMNS in each ROW
• by default axis=0
Merging data within DataFrames using merge():
import pandas as pd
x = pd.DataFrame({'id':[1,2],'Name': ['anil', 'vishal'], 'subject_id':['sub1','sub2']})
y = pd.DataFrame({'id':[1,2],'Name': ['sumer', 'salil'], 'subject_id':['sub2','sub4']})
print(pd.merge(x , y , on='id'))
Output:
id Name_x subject_id_x Name_y subject_id_y
0 1 anil sub1 sumer sub2
1 2 vishal sub2 salil sub4
import pandas as pd
data = {
'Name': ['Hafza', 'Srikanth', 'Rakesh'],
'Age': [19, 20, 19]
}
df = pd.DataFrame(data, index = [True, False, True]) # Creating a DataFrame with boolean index
print(df)
print(“”) #Prints a blank line
print(df.loc[True])
print(“”)
print(df.iloc[1])
Output:
Name Age
True Hafza 19
False Srikanth 20
True Rakesh 19
Name Age
True Hafza 19
True Rakesh 19
Name Srikanth
Age 20
Selecting data from DataFrame with boolean indexing:
import pandas as pd
dict = {'name':['Mohak', "Freya", "Roshni"],
'degree': ["MBA", "BCA", "M.Tech"],
'score':[90, 40, 80]}
# creating a dataframe with boolean index
df = pd.DataFrame(dict, index = [True, False, True]) # accessing a dataframe using .loc[] function
print(df.loc[True]) #it will return rows of Mohak and Roshni only(matching true only)
Output:
name degree score
True Mohak MBA 90
True Roshni M.Tech 80
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric
python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but must be a hashable
type. The object supports both integer- and label-based indexing and provides a host of methods for performing
operations involving the index.
Pandas Series.where() function replace values where the input condition is False for the given Series object. It takes
another object as an input which will be used to replace the value from the original object.
Parameters :
cond : boolean NDFrame, array-like, or callable
other : scalar, NDFrame, or callable
inplace : boolean, default False
axis : int, default None
level : int, default None
errors : str, {‘raise’, ‘ignore’}, default raise
try_cast : boolean, default False
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 60.0
6 70.0
7 80.0
8 90.0
9 100.0
dtype: float64
import pandas as pd
s = pd.Series([10,20,30,40,50,60,70,80,90,100])
a=s.where(s == 50)
print(a)
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 50.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Output:
0 10.0
1 20.0
2 30.0
3 40.0
4 50.0
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
dtype: float64
Example #1: Use Series.where() function to replace values in the given Series object with some other value when the
passed condition is not satisfied.
# importing pandas as pd
import pandas as pd
Example #2 : Use Series.where() function to replace values in the given Series object with some other value when
the passed condition is not satisfied.
# importing pandas as pd
import pandas as pd
A CSV is a comma-separated values file, which allows data to be saved in a tabular format. CSVs look like a
garden-variety spreadsheet but with a .csv extension. CSV files can be used with most any spreadsheet
program, such as Microsoft Excel or Google Spreadsheets.
emp1.xlsx
Emp ID Emp Name Emp Role
1 Pankaj Kumar Admin
2 David Lee Editor
3 Lisa Ray Author
emp1.csv
Emp ID, Emp Name, Emp Role
1, Pankaj Kumar, Admin
2, David Lee, Editor 3,
3, Lisa Ray, Author
import pandas as pd
df = pd.read _csv(c:/mydata/class12/'emp1.csv')
print(df)
Output:
Emp ID Emp Name Emp Role
0 1 Pankaj Kumar Admin
1 2 David Lee Editor
2 3 Lisa Ray Author
import pandas as pd
# importing Data here from remote source of .csv file
df = pd.read_csv('emp1.csv')
x=[50000 , 55000, 60000]
df['Salary'] = x
print(df)
Output:
Emp ID Emp Name Emp Role Salary
0 1 Pankaj Kumar Admin 50000
1 2 David Lee Editor 55000
2 3 Lisa Ray Author 60000
df.to_csv('emp2.csv', index=True)
df.to_csv('emp2.csv', index=False) # Without indexing
Output: (index=True)
Emp ID Emp Name Emp Role Salary
0 0 1 Pankaj Kumar Admin 50000
1 1 2 David Lee Editor 55000
2 2 3 Lisa Ray Author 60000
Output: (index=False)
Emp ID Emp Name Emp Role Salary
0 1 Pankaj Kumar Admin 50000
1 2 David Lee Editor 55000
2 3 Lisa Ray Author 60000
Output:
Name Age Address Qualification
0 Jai 27 Delhi Msc
1 Prince 24 Kanpur MA
2 Gaurav 22 Allahabad MCA
3 Anuj 32 Kannauj Phd
Output:
32
Output:
22
13.3 Example: count() # It counts the number of values present in the column
import pandas as pd
x = { 'Name': ['Jai', 'Prince', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'] }
df = pd.DataFrame(x)
print(df['Age'].count())
Output:
4
13.4 Example: sum() # Finds the total/addition of the values of the column
import pandas as pd
x = { 'Name': ['Jai', 'Prince', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd'] }
df = pd.DataFrame(x)
print(df['Age'].sum())
Output:
105
import pandas as pd
import numpy as np
data = np.array([54,76,88,99,34])
s1 = pd.Series(data,index=['a','b','c','d','e'])
print (s1)
s2=s1.rename(index={'a':0, 'b':1})
print(“After reindexing: \n“, s2)
OUTPUT
a 54
b 76
c 88
d 99
e 34
dtype: int32
0 54
1 76
c 88
d 99
e 34
dtype: int32
import pandas as pd
import numpy as np
table = {"name": ['vishal', 'anil', 'mayur', 'viraj','mahesh'],
'age':[15, 16, 15, 17,16],
'weight': [51, 48, 49, 51,48],
'height': [5.1, 5.2, 5.1, 5.3,5.1],
}
d = pd.DataFrame(table)
print("DATA OF DATAFRAME")
print(d)
print("DATA OF DATAFRAME AFTER REINDEX")
df=d.reindex([2,1, 0,4,3])
print(df)
Output:
DATA OF DATAFRAME
name age weight height
0 vishal 15 51 5.1
1 anil 16 48 5.2
2 mayur 15 49 5.1
3 viraj 17 51 5.3
4 mahesh 16 48 5.1
Output:
DATA OF DATAFRAME=>
name age weight height
0 vishal 15 51 5.1
1 anil 16 48 5.2
2 mayur 15 49 5.1
3 viraj 17 51 5.3
4 mahesh 16 48 5.1
17.1 Histogram:
Histogram in Python:
Drawing a histogram in Python is very easy. All we have to do is code for 3-4 lines of code. But
complexity is involved when we are trying to deal with live data for visualization.
To draw histogram in python following concepts must be clear.
Title: To display heading of the histogram.
Color : To show the colour of the bar.
Axis: y-axis and x-axis.
Data: The data can be represented as an array.
Height and width of bars. This is determined based on the analysis.
The width of the bar is called bin or intervals. Default size = 10
Border colour: To display border colour of the bar.
Matplotlib is the whole python package/ library used to create 2D graphs and plots by using python
scripts. pyplot is a module in matplotlib, which supports a very wide variety of graphs and plots namely
- histogram, bar charts, power spectra, error charts etc. It is used along with NumPy to provide an
environment for MatLab.
Pyplot provides the state-machine interface to the plotting library in matplotlib.It means that figures
and axes are implicitly and automatically created to achieve the desired plot. For example, calling plot
from pyplot will automatically create the necessary figure and axes to achieve the desired plot. Setting
a title will then automatically set that title to the current axes object.The pyplot interface is generally
preferred for non-interactive plotting (i.e., scripting).
import pandas as pd
AvMark
0 90
1 95
2 95
3 93
4 94
5 78
6 69
7 85
8 74
9 86
10 75
11 79
12 98
17.2.1a Example: Plot a Histogram to show the various frequencies (bin with distribution) of given marks
17.2.1b Example:
import pandas as pd
import matplotlib.pyplot as plt
x = {'Age': [27, 24, 22, 32, 33, 32], 'Points': [3,5,7, 9, 7, 9] }
df = pd.DataFrame(x)
hist = df.hist()
plt.show()
17.1.2 Example:
import pandas as pd
import matplotlib.pyplot as plt
data={ 'length': [15, 5, 12, 12, 12, 5], 'width': [7, 2, 15, 2, 5, 7] }
df = pd.DataFrame(data)
hist = df.hist(bins=3) # Bin is 3
plt.show()
17.1.3 Example:
import pandas as pd
import matplotlib.pyplot as plt
data={'length': [15, 5, 12, 12, 12, 5], 'width': [7, 2, 15, 2, 5, 7] }
df = pd.DataFrame(data)
hist = df.hist(bins=5) # Bin is 5
plt.show()
17.1.4 Example:
import pandas as pd
import matplotlib.pyplot as plt
data={'length': [15, 5, 12, 12, 12, 5], 'width': [7, 2, 15, 2, 5, 7] }
df = pd.DataFrame(data)
hist = df.hist(bins=10) # Bin is 10
plt.show()
Syntax:
import matplotlib.pyplot as plt
x = [value1, value2, value3,. .. ]
plt.hist(x, bins = number of bins)
plt.show()
17.1.7 Example:
import matplotlib.pyplot as plt
wt=[43.1 , 36.6, 37.6, 36.5, 45.3, 43.5, 40.3, 50.2, 47.3, 31.2, 42.2, 45.5, 30.3, 31.4, 35.6,
45.2, 54.1, 45.6, 36.5, 43.1]
plt.hist(wt, 10)
plt.show()
17.1.9 Example:
Example: With customization (Label, Titile, Font size, Edge colour, Face colour. )
import numpy as np
import matplotlib.pyplot as plt
plt.hist([5,15,25,35,45,55], bins=[0,10,20,30,40,50, 60], weights=[20,10,45,33,6,8], edgecolor="red")
plt.show()
# At interval (bin) 40 to 50 no bar because we have not mentioned position from 40 to 50 in first
argument(list) of hist method. Where as in interval 10 to 20 width is being Displayed as 16 (10+6
both weights are added) because 15 is twice In first argument.
Example (Histogram Type „step‟):
import numpy as np
import matplotlib.pyplot as plt
data = [1,11,21,31,41]
plt.hist([5,15,25,35,15, 55], bins=[0,10,20,30,40,50, 60], weights=[20,10,45,33,6,8],
edgecolor="red", histtype='step') #plt.hist(data, bins=20, histtype='step')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.title('Histogram')
plt.show()
17.2.1 Example:
17.2.2 Example:
Example:
Example:
import numpy as np
import matplotlib.pyplot as plt
x=np.arange(1,5,1)
plt.plot(x, 'r') # ‘r’ makes red colour of the line generated according to value of x
plt.plot(x+1, 'y') # ‘y’ makes yellow colour of the line according to value of x+1
plt.plot(x+2, 'b') # ‘b’ makes blue colour of the line according to value of x+2
plt.show()
X=[1, 2, 3, 4]
X+1 = [2 , 3, 4, 5]
X+2= [3 , 4, 5, 6]
Note:
arange(a,b,c): arange() function generates values from starting value(a) up to before stop value (b)
incremented by third value(c) which is optional. In the above example, [1, 2, 3, 4] values will be
generated for x variable, where initial value is 1, final is 5 and increment value is 1.
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with
heights or lengths proportional to the values that they represent. The bars can be plotted vertically or
horizontally.
A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific
categories being compared, and the other axis represents a measured value.
Matplotlib API provides the bar() function that can be used in the MATLAB style use as well as object
oriented API.
17.3.1 Example:
17.3.3 Example:
17.3.4 Example:
17.3.6 Example:
import pandas as pd
import matplotlib.pyplot as plot
data = {"Production":[10000, 12000, 14000],
"Sales":[9000, 10500, 12000]
}
index= ["2017", "2018", "2019"]
df = pd.DataFrame(data=data, index=index) # data variable has two sets of values as X-axis
df.plot.bar(rot=15, title="Annual Production Vs Annual Sales")
plot.show()