CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
CH 02 - Data Handling Using Pandas Leip102 EDITED Smaller 01 Codes Only
770945137.docx 1
size – Printing the number of values in the series print(s.size)
empty – Checking if series is empty or not print(s.empty)
if s.empty == True:
Methods of a series object
head(n) – accessing first n (by default 5) values print(s.head())
of the series print(s.head(2))
tail(n) – accessing last n (by default 5) values of print(s.tail())
the series print(s.tail(2))
count() – returning the number of non-NaN print(s.count())
values in the Series
len(series) – returning the number of values in print(len(s))
the Series including NaN values
len() is a Python function and not a Series
method
Mathematical operations on two or more
series
Addition of two series objects: sA = pd.Series([1,2,3,np.nan],index=['a','b','c','d'])
by using + operator to add the sB = pd.Series([4,5,6,7],index=['a','x','c','d'])
corresponding elements of two series print(sA+sB)
Addition of two series objects by using print(sA.add(sB,fill_value=0)
series method add() with parameter
fill_value to replace the missing values with
a specified value while adding the
corresponding elements of two series so as
to avoid NaN values in the output.
Other mathematical operations on two
series
Subtraction of Series print(sA-sB)
print(sA.sub(sB,fill_value=0)
Multiplication of Series print(sA*sB)
print(sA.mul(sB,fill_value=0)
Division of Series print(sA/sB)
Gives "inf" if denominator is zero print(sA.div(sB,fill_value=0)
770945137.docx 2
DataFrame (Two Dimensional or 2D structure)
770945137.docx 3
Accessing dataframe elements by using #use .loc indexer to specify row and col index labels
label based indexing #General syntax: DataFrame.loc[ rows , columns ]
#specify labels e.g. as 'r1', ['r1','r3'], 'r1':'r5'
#for single row, discrete multiple rows and slice or range of rows
#in a slice, both start and stop index labels are included
#single row label returns the row as a series
#single column name returns the column as a series
#returns a series
df.loc['r1']
df.loc['r1' , : ]
df.loc[ , : 'c1']
#returns a dataframe
df.loc['r1' , ['c1','c3']]
df.loc[['r1','r3'] , 'c1']
df.loc[['r1','r3'] , ['c1','c3']]
df.loc['r1':'r3' , 'c1':'c3']
df.loc[ : , 'c1':'c3']
df.loc['r1':'r3' , : ]
df.loc[ 'r1' , 'c1':'c3']
df.loc['r1':'r3' , : 'c1']
df.loc['r1',:] > 90
df.loc[:,'c1'] > 90
Filtering rows and columns of a #filtering rows and columns without writing any explicit condition
dataframe by using Boolean indexing #but by giving a list of Boolean values for each row and column
#use a Boolean list specifying 'True' for the rows and columns to be displayed and 'False'
for the rows and columns to be omitted
# Rows and columns for which Boolean values are not specified are not displayed
Example:
To increment the marks by 5 if marks are less than 33
df.loc[df['marks']<33, 'marks'] = df['marks']+5
If MRP is >70 then store 10% of MRP into a new column called 'Discount'
df['Discount'] = df[df['MRP']>70]['MRP']*10/100
770945137.docx 4
Descriptive Statistics with Pandas Aggregating data to get summary statistics of the data so as to analyse it
Applying statistical or aggregation effectively
functions on dataframes Applying a library function such as max(), min(), count(), sum(), mean()... or a
custom function on df or df.loc or df.iloc or df.colLabel or df['colLabel']
['rowLabel']
While using Pandas' statistical functions with dataframe:
axis=0 (default) means for each column and axis=1 means for each row
NOTE!
Elsewhere in Python, axis=0 (default) means for each row and axis=1
means for each column.
Applying functions on all columns of the df.functionName()
dataframe
(certain functions work on numeric
columns only)
Applying functions on a column of the df['colLabel'].functionName()
dataframe
Applying functions on multiple columns df[['colLabel1','colLabel1',...]].functionName()
of the dataframe
Applying functions on a row of the df.loc['rowLabel',:].functionName()
dataframe
Applying functions on a range of rows of df.loc['startRowLabel':'endRowLabel',:].functionName()
the dataframe
Applying functions on a slice or a subset df.loc['startRowlabel':'endRowIndex',
of the dataframe 'startColIndex':'endColIndex'].functionName()
Count of values of each row or column in df.count(axis=0|1|None, skipna=True|False|None,
the dataframe numeric_only=True|False|None, min_count=0)
Sum of all values of each row or column df.sum(axis=0|1|None, skipna=None,
in the dataframe numeric_only=None, min_count=0)
Max (maximum) value of each row or df.max(axis=0|1|None, skipna=True|False|None,
column in the dataframe numeric_only=True|False|None)
Min (minimum) value of each row or df.min(axis=0|1|None, skipna=True|False|None,
column in the dataframe numeric_only=True|False|None)
Mean (computed mean or average of the df.mean(axis=0|1|None, skipna=True|False|None,
dataset) of each row or column in the numeric_only=True|False|None).round(2)
dataframe
Mode (value that appears most often in a df.mode(axis=0|1|None, numeric_only=True|False|None)
dataset) of each row or column in the df.loc['s01':'s02','marks1':'marks2'].mode(axis=1)
dataframe
Median (middle value of a dataset, value df.median(axis=0|1|None, skipna=True|False|None,
which divides dataset into two equal parts) numeric_only=True|False|None)
of each row or column in the dataframe
MAD (mean absolute deviation) – average df.mad(axis=0|1|None, skipna=True|False)
distance between each data value and the
mean of a dataset
Var (unbiased variance) over the df.var(axis=0|1|None, skipna=True|False|None,
requested axis. Average of the squared numeric_only=True|False|None)
differences from the mean. Variance is the
expectation of the squared deviation of a
random variable from its mean. The var()
function calculates the variance of a given
set of numbers, variance of the dataframe,
variance of one or more columns or
variance of rows.
Std (standard deviation) – Square root of df.std(axis=0|1|None, skipna=True|False)
the variance.
Quantile is the point in a distribution that df.quantile(q=f, axis=0|1|None, numeric_only=True|False|None)
relate to the rank order of a value in that where, q is a float or array-like sequence of values between
0 and 1.0
distribution.
To produce quartiles or 4-quantile use a list
q=[0.25,0.5,0.75,1.0]
770945137.docx 5
Attributes of DataFrames Access properties/attributes of a DataFrame by using the property name with the
DataFrame name.
DataFrame.index displays row labels ForestAreaDF.index
Index(['GeoArea', 'VeryDense', 'ModeratelyDense',
'OpenForest'], dtype='object')
DataFrame.columns displays column
labels
DataFrame.dtypes displays data type of ForestAreaDF.dtypes
each column in the DataFrame Assam int64
Kerala int64
Delhi float64
dtype: object
DataFrame.values displays a NumPy ForestAreaDF.values
array([ [7.8438e+04, 3.8852e+04, 1.4830e+03],
ndarray having all the values in the [2.7970e+03, 1.6630e+03, 6.7200e+00],
DataFrame, without the axes labels. [1.0192e+04, 9.4070e+03, 5.6240e+01],
[1.5116e+04, 9.2510e+03, 1.2945e+02]])
DataFrame.shape displays a tuple ForestAreaDF.shape
representing the dimensionality of the (4, 3) #means ForestAreaDF has 4 rows and 3 columns
DataFrame.
DataFrame.size displays a tuple ForestAreaDF.size
representing the dimensionality of the 12 #means ForestAreaDF has 12 values in i
DataFrame.
DataFrame.T transposes the DataFrame – ForestAreaDF.T
GeoArea VeryDense Moderately Dense OpenForest
row index and column labels of the Assam 78438.0 2797.00 10192.00 15116.00
DataFrame are interchanged. Kerala 38852.0 1663.00 9407.00 9251.00
Equivalent of writing Delhi 1483.0 6.72 56.24 129.45
DataFrame.transposes()
DataFrame.head(n) displays first n rows ForestAreaDF.head(2)
Assam Kerala Delhi
of the DataFrame. If parameter n is not GeoArea 78438 38852 1483.00
specified, then by default, it returns first 5 VeryDense 2797 1663 6.72
rows of the DataFrame.
DataFrame.tail(n) displays last n rows of ForestAreaDF.tail(2)
Assam Kerala Delhi
the DataFrame. If parameter n is not ModeratelyDense 10192 9407 56.24
specified then by default, it gives last 5 OpenForest 15116 9251 129.45
rows of the DataFrame.
DataFrame.empty returns value True if ForestAreaDF.empty
False
DataFrame is empty and False otherwise.
df=pd.DataFrame() #Create an empty dataFrame
df.empty
True
Sorting Dataframes df.sort_values( by=colLabels|rowLabels, axis=0|1,
Arranging rows and columns in ascending ascending=True|False, inplace=True|False,
kind='quicksort', na_position='first|last')
or descending order on the basis of the
Arranging rows by the values of a column
values of one or more specified rows and df.sort_values(by='marks1')
columns. Arranging rows by the values of multiple columns
df.sort_values(by=['class','marks1'])
Arranging rows by the values of multiple columns in different orders
df.sort_values(by=['class','marks1'],ascending=[False,True])
770945137.docx 6
Checking/detecting missing values in a df.isnull() returns a Boolean same-sized DataFrame indicating if values are missing
dataframe df.notnull() returns a Boolean same-sized DataFrame which is just opposite of
isnull()
df.isnull().sum() returns a Series containing the number of missing values for
each column
df.isnull().sum().sum() returns the total number of missing values in the entire
dataframe
Dropping the missing values in a df.dropna(axis=0|1, how='any'|'all', thresh=None|num,
dataframe subset=None|[cols], inplace=True|False)
To drop row if any NaN values are present:
df.dropna(axis=0)
To drop column if any NaN values are present:
df.dropna(axis=1)
To drop the rows if the number of non-NaN is less than 6.
df.dropna(axis=0,thresh=6)
770945137.docx 7
Importing and exporting data between
csv file and dataframe
Importing csv file to a dataframe Reading a csv file with pandas and into a dataframe
Using pd.read_csv() function to read a CSV file
Parameters:
filep URL or path of the CSV file
sep the seperator. By default it is comma ',' as in csv (comma
seperated values)
index_col uses passed column labels instead of column index 0, 1, 2,
3…
header uses passed row (int) or rows (int list) as header
use_cols uses passed col (string list) to make dataframe
squeeze if true and only one column is passed, returns pandas series
skiprows skips passed rows in new data frame
df = pd.read_csv("nba.csv")
#or df =
pd.read_csv("https://media.geeksforgeeks.org/nba.csv")
770945137.docx 8
Importing database connectivity libraries import mysql.connector as myconnector
for MySQL connection OR
import pymysql
Creating database connection object to #mysql-connector
connect to the mysql database conn = myconnector.connect( host='localhost',
user='root', password='', database='mydb')
#pymysql
conn = pymysql.connect( host='localhost',
user='root', port='', password='', db=db,
cursorclass=pymysql.cursors.DictCursor)
Creating cursor object to execute SQL #mysql-connector
queries cursor = connection.cursor(dictionary=True)
#pymysql
cursor=conn.cursor()
Executing SQL queries sql='select * from t1'
cursor.execute(sql)
OR
df=pd.read_sql(sql,con=conn)
Fetching MySQL table rows in the cursor rows=cursor.fetchall()
by using a loop to access one row at a time df=pd.DataFrame(rows)
Committing DML SQL queries, if any sql='delete from t1'
cursor.execute(sql)
conn.commit()
Handling errors/exceptions try:
...
except myconnector.Error as e:
print("ERROR%d:%s"%(e.args[0],e.args[1]))
Closing connections and destroying try:
objects before terminating the application ...
catch:
...
cursor.close()
connection.close()
raise SystemExit #terminate app
Change datatype of index of the dataframe df.index=df.index.astype('str')
e.g. from int to str after importing data
into dataframe from mysql table
770945137.docx 9
MATPLOTLIB Example (plots ver01.py) – Line, Bar & Histogram
#pip install matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
#====plot data
year = [2015,2016,2017,2018]
percentMale = [70,82,63,90]
percentFemale = [80,52,73,43]
#ticks
xticks=np.arange(2015,2020)
yticks=np.arange(0,101,10)
'''
#PLOT – LINE CHART
plt.plot(year,percentMale,color='b')
plt.plot(year,percentFemale,color='r')
'''
'''
#BAR
x=np.arange(year[0],year[len(year)-1]+1)
plt.bar(x+0.00, percentMale, color='b', width=0.25)
plt.bar(x+0.25, percentFemale, color='r', width=0.25)
'''
'''
#BARH - Grouped Horizontal Bar (not stacked)
y=np.arange(year[0],year[len(year)-1]+1)
width = 0.3
fig, ax = plt.subplots()
ax.barh(y+(0.5*width), percentMale, width, color='red', label='Male')
ax.barh(y+(0.5*width)+width, percentFemale, width, color='green', label='Female')
ax.set(yticks=y+width, yticklabels=year) #do not set plt.xticks() and plt.yticks()
xticks = np.arange(0,101,10)
yticks=y+width
'''
'''
#HISTOGRAM
#plt.hist(positions, intervals or bins, weightsofbars)
percent=[2,5,70,40,30,45,50,45,43,40,44,60,7,13,57,18,90,77,32,21,20,40] #frequencies
range = (0, 100) #set ranges and no. of intervals
bins = 10
plt.hist(ages,bins,range,color='green',histtype='bar',rwidth=0.8) # plotting a histogram
xticks=np.arange(0,100,10)
yticks=np.arange(0,11)
'''
#====common plot settings
plt.title('MYPLOT')
plt.xticks(xticks) #plt.xticks(xticks,labels,rotation='vertical')
plt.yticks(yticks)
plt.xlabel('YEAR') #plt.xlabel('X Labels', fontsize=6)
plt.ylabel('PERCENTAGE') #plt.ylabel('Y Labels', fontsize=10)
plt.legend(['Male','Female']) #legends used in the plot
#plt.grid(True)
plt.show()
#plt.savefig("temp.png") #save the plot as an image
770945137.docx 10
Python MySQL Database Connectivity Example (PythonMySQLdbCRUD.py)
#Python MySQL Databasde Connectivity
import pandas as pd
import numpy as np
import pymysql
770945137.docx 11
name = input('Name: ')
std = input('Class & Section (e.g. 12A): ')
dob = input('DOB (YYYYMMDD): ')
marks = input('Marks %: ')
sql = '''INSERT INTO student(roll,name,std,dob,marks)
VALUES (%s,%s,%s,%s,%s)'''
try:
cursor.execute(sql, (roll,name,std,dob,marks))
conn.commit()
print("New record added successfully.")
except conn.Error as e:
print("ERROR %d: %s" %(e.args[0], e.args[1]))
'''--------------------------------
#to insert multiple rows all at once
sql = "INSERT INTO student(roll,name) VALUES(%s, %s)"
val = [('123','Abhay',...),('456','Vijay',...),...]
cursor.executemany(sql,val)
--------------------------------'''
def selectall():
sql = "SELECT * FROM student"
try:
cursor.execute(sql)
#number_of_rows = cursor.execute(sql)
#print(number_of_rows)
result = cursor.fetchall()
print('''ROLL \t NAME \t\t CLASS \t DOB \t\t MARKS''')
print("----------------------------------------------------------------------")
for row in result:
#to use index rather than column name to access cursor data values
#disable 'cursorclass=pymysql.cursors.DictCursor' in conn declaration
#print(row[0],"\t",row[1],"\t",row[2],"\t",row[3],"\t",row[4]),"\t",row[5]
#0:>7 for right-align; #0:<7 for left-align; #0:7 for exact length
print("{0:<9}{1:<16}{2:<8}{3:<16}
{4:<5}".format(row['roll'],row['name'],row['std'],str(row['dob']),str(row['marks'])))
#print(row['roll'],row['name'],row['std'],row['dob'],row['marks'])
except conn.Error as e:
print("ERROR %d: %s" %(e.args[0], e.args[1]))
def selectcondition():
column = input("Enter column name for record search...")
value = input("Enter value for the column...")
sql = "SELECT * FROM student WHERE {} = %s".format(column)
try:
cursor.execute(sql, (value,))
result = cursor.fetchall()
print("ROLL \t NAME \t STD \t DOB \t MARKS")
print("----------------------------------------------------------------------")
for row in result: print(row['roll'],"\t",row['name'],"\
t",row['std'],"\t",row["dob"],"\t",row["marks"])
except conn.Error as e:
print("ERROR %d: %s" %(e.args[0], e.args[1]))
def update():
roll = input("Enter roll of record to be updated...")
column = input("Enter the column to be updated...")
value = input("Enter the new value of the column...")
sql = "UPDATE student SET {}=%s WHERE roll = %s".format(column)
try:
cursor.execute(sql, (value, roll))
conn.commit()
print("\+nSuccessfully Updated...\n")
except conn.Error as e:
print("ERROR %d: %s" %(e.args[0], e.args[1]))
def delete():
roll = input("Enter roll of record to be deleted...")
sql = "DELETE FROM student WHERE roll = %s"
try:
cursor.execute(sql, (roll,))
conn.commit()
print("\nSuccessfully Deleted...\n")
except conn.Error as e:
print("ERROR %d: %s" %(e.args[0], e.args[1]))
#=====call for Starting function=====
main()
770945137.docx 12
DataFrame Plot – Line, Bar & Histogram
#----------------------------
#DATAFRAME plots - line & bar
#DATA
year = [2015,2016,2017,2018] #index for x-axis
percentMale = [70,82,63,90] #numeric column for y-axis
percentFemale = [80,52,73,43] #another numeric column for y-axis
#DATAFRAME
#By default - 'index' is laid out on X-axis and all numeric columns on Y-axis
df = pd.DataFrame({'Male': percentMale, 'Female': percentFemale}, index=year)
#PLOT
#ax = df['columnname'].plot(kind='bar',rot=0)
#to plot specified column(y-axis) against index(x-axis)
ax = df.plot(kind='bar',rot=0) #kind='bar|barh|line'
#PLOT settings
#ticks and legends are automatically set
plt.title('MYPLOT')
plt.xlabel('YEAR')
plt.ylabel('PERCENTAGE')
plt.show()
#----------------------------
#----------------------------
#DATAFRAME plots - histogram
#frequencies
ages=[2,5,70,40,30,45,50,45,43,40,44,60,7,13,57,18,90,77,32,21,20,40]
# setting the ranges and no. of intervals
myrange = (0, 100)
bins = 10
#bins=[0,20,40,60,80,100]
# plotting a histogram
df = pd.DataFrame({'Age': ages})
ax = df.plot(kind='hist',range=myrange,bins=bins,rwidth=0.8,alpha=0.5)
plt.title('MYPLOT')
plt.xlabel('YEAR')
plt.ylabel('PERCENTAGE')
plt.show()
#----------------------------
770945137.docx 13
Joining, Merging and Concatenating
DataFrames
Appending DataFrames The DataFrame.append() method merges two DataFrames. It appends rows of second
dataframe to the end of the first DataFrame and may end up having duplicate index.
Columns not present in the first dataframe are added as new columns in the resultant
dataframe. Columns remain unique.
df1=pd.DataFrame([[1,2,3],[4,5],[6]],
columns=['C1','C2','C3'], index=['R1','R2','R3'])
df2=pd.DataFrame([[10,20],[30],[40,50]],
columns=['C2','C5'], index=['R4','R2','R5'])
df=df1.append(df2)
C1 C2 C3 C5
R1 1.0 2.0 3.0 NaN
R2 4.0 5.0 NaN NaN
R3 6.0 NaN NaN NaN
R4 NaN 10.0 NaN 20.0
R2 NaN 30.0 NaN NaN
R5 NaN 40.0 NaN 50.0
Set parameter sort=True to get the column labels appear in sorted order.
df=df1.append(df2,sort='True')
Set verify_integrity=True parameter to raise an error if row labels are duplicate. By
default, verify_integrity=False, use it to accept duplicate row labels while appending the
DataFrames.
Set ignore_index=True parameter if row index labels are to be ignore in the resultant
dataframe. By default, ignore_index=False retains the index labels.
dFrame1 = dFrame1.append(dFrame2, ignore_index=True)
C1 C2 C3 C5
0 1.0 2.0 3.0 NaN
1 4.0 5.0 NaN NaN
2 6.0 NaN NaN NaN
3 NaN 10.0 NaN 20.0
4 NaN 30.0 NaN NaN
5 NaN 40.0 NaN 50.0
The append() method can also be used to append a series or a dictionary to a DataFrame.
Concatenating DataFrames The concat() function concates dataframes along an axis while performing optional set
logic (union or intersection) applied on index or other axes i.e. column. It joins multiple
dataframes vertically (row-wise, one after another).
df1 = pd.DataFrame({'A':['A0','A1'],'B':['B0','B1']},
index=[0,1])
df2 = pd.DataFrame({'A':['A2','A3'],'B':['B2','B3']},
index=[2,3])
df3 = pd.DataFrame({'A':['A4','A5'],'B':['B4','B5']},
index=[4,5])
pd.concat([df1,df2,df3])
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
5 A5 B5
Merging DataFrames merge() function joins dataframes using a key column. Similar to SQL join using a
common column. Similar to df.join(), but uses a common key column to join on rather than
the common index. It joins multiple dataframes horizontally (column-wise, one after
another).
df1 = pd.DataFrame({'Key':['K0','K1'],
'A':['A0','A1'],'B':['B0','B1']})
df2 = pd.DataFrame({'Key':['K0','K1'],
'C':['C0','C1'],'D':['D0','D1']})
pd.merge(df1,df2,how='inner',on='Key')
A B Key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
Joining DataFrames The join() function joins dataframes using common index. Similar to SQL join using a
common column. Similar to pd.merge(), but uses common index to join on rather than a
common key column. It joins multiple dataframes horizontally (column-wise, one after
another).
df1 = pd.DataFrame({'A':['A0','A1'],'B':['B0','B1']},
index=['K0','K1'])
df2 = pd.DataFrame({'C':['C0','C1'],'D':['D0','D1']},
index=['K0','K1'])
df1.join(df2)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 C1 D1
770945137.docx 14
770945137.docx 15