Pandas
Pandas
#List of Dicts
l1=[{'First_Name':'Suresh','Last_Name':'Kumar'},{'First_Name':'
Ramesh','Last_Name':'Babu'}]
df2=pd.DataFrame(l1)
df2
3. Loading data from files into DataFrames
• The read_csv method is used to load comma separated (csv)
files and text files into a DataFrame. Its syntax is:
df1 = pd.read_csv(file/URL/file-like-object, sep = ',', header = None)
• For Example:
df1=pd.read_csv('.\pandas\supermarkets.csv')
df2=pd.read_csv('.\pandas\supermarkets-commas.txt')
df3=pd.read_csv('.\pandas\supermarkets-semi-colons.txt',sep=";")
• df3.index # displays the indexes for this DataFrame
• df3.columns # displays the column names for this DataFrame
Loading data from json and excel files
• The read_json and read_excel methods are used to read json
and excel files respectively. For example:
df4=pd.read_json('.\pandas\supermarkets.json')
df5=pd.read_excel('.\pandas\supermarkets.xlsx',sheet_name=0)
• Suppose a data file does not include header names, the header
option should be specified as None
df6=pd.read_csv('.\pandas\data.txt',header=None)
• We can give user specified column names as:
df6.columns=['ID','Address','City','Pin_Code','Country',
'Name', 'Employees']
4. Missing Data
• In Pandas, missing values are represented as NaN means Not a
Number.
• Consider the marks.txt file with some missing values.
df= pd.read_csv('.\pandas\marks.txt',header = None,names =
["name","m1","m2","m3"])
• To make detecting missing values easier (and across different
array dtypes), Pandas provides isnull() and notnull() functions,
which are also methods on Series and DataFrame objects
For example: df.isnull()
Handling Missing Data
• The fillna() method is used to replace NaN values with another
value. For example, to replace NaN with zeroes,
df.fillna(0)
• We can replace NaN values in a column with the mean of the
column:
df["m1"].fillna(df["m1"].mean())
• We can drop the rows containing NaN values: df.dropna()
• We can drop the columns containing NaN values df.dropna(axis
=1)
5. Indexing and Slicing
• The Python and NumPy indexing operators "[ ]" and attribute
operator "." provide quick and easy access to Pandas data
structures.
• Pandas supports two methods for Multi-axes indexing:
i. loc(): It is used for label based indexing
ii. iloc(): It is used for integer based indexing
• In addition, Boolean indexing is used to perform conditional
retrieval of data
Set Index Column
• We can set a column of the DataFrame as an index in place of
the default index column as follows:
df7 = df6.set_index("ID")
• But if we observe, the changes are not permanent in the older
DataFrame (df6)
• In order to make it the permanent index column, the following
option (inplace) should be set to True:
df6.set_index("ID",inplace = True)
Label based Indexing
• loc takes two single/list/range operator separated by ','. The
first one indicates the row and the second one indicates
columns. For example: Select all rows and only Country
column:
df7.loc[:,”Country”]
• Select rows with index ID 3 to 5 and columns City to Name :
df7.loc[3:5,"City":"Name“]
• Similarly, we can give the following slices:
df7.loc[4,"Country"]
Integer based Indexing
• Pandas provides iloc() method in order to perform purely
integer based indexing. These are 0-based indexing methods.
• The various access methods are as follows:
i. An Integer: For example: df7.iloc[4] returns the data from the
5th row. df7.iloc[:,4] returns the data from 5th column.
ii. A list of integers: df7.iloc[4,4] returns the data in the
intersection of row 4 and column 4.
iii. A range of values: df7.iloc[1:3,1:3] returns the intersection of
rows 1 to 2 and columns 1 to 2.
Boolean indexing
• In boolean indexing, we will select subsets of data based on the
actual values of the data in the DataFrame and not on their
row/column labels or integer locations.
• The syntax for performing boolean indexing on a DataFrame is
similar to other indexing methods. The index label or integer
value is replaced by a boolean expression. For example:
df[df["m1"]>90]
df[(df["m1"]>90) & (df["m2"]>90)]
6. Data Aggregation
• Data aggregation means any data transformation that
produces scalar values from arrays, such as “mean”, “max”, etc.
• We can create a grouping of categories and apply a function to
the categories. For example, if we want to find out the number
of employees city wise, the procedure is as follows:
• Using groupby method, create groups based on the City
attribute:
df7.groupby(['City'])
df7.groupby(['City']).groups # In order to see the groups
Data Aggregation (cont..)
• Apply an aggregate function called sum() on the Employees
column:
group=df7.groupby(['City']).sum()['Employees']
group
• Similarly, to find the maximum number of employees in a City,
the max() aggregate function should be used:
group=df7.groupby(['City']).max()['Employees']
Merging DataFrames
• Pandas has full-featured, high performance in-memory join
operations idiomatically very similar to relational databases like
SQL.
• Pandas provides a single function, merge() for all standard
database join operations between DataFrame objects
• Create two DataFrames called emp and dept with the following
atributes:
• Emp(empno,ename,deptno)
• Dept(deptno,dname)
merge() two DataFrames on a key
• emp = pd.DataFrame({
'empno':[101,102,103,104,105],
'ename': ['Ramesh', 'Suresh', 'Mahesh', 'Dinesh', 'Naresh'],
'deptno':[1,2,2,1,2]})
• dept = pd.DataFrame(
{'deptno':[1,2,3],
'dname': ['Marketing', 'Operations', 'EDP']})
• merge_df=pd.merge(emp,dept,on='deptno') # key is deptno
• merge_df
Types of merge
• Perform left join: Use keys from left object
merge_left= pd.merge(emp,dept, on='deptno', how='left')
• Perform right join: Use keys from right object
merge_right= pd.merge(emp,dept, on='deptno', how='right')
• Peform outer join: Use union of keys
merge_outer= pd.merge(emp,dept, on='deptno', how='outer')
• Perform inner join: Use intersection of keys (default)
merge_inner= pd.merge(emp,dept, on='deptno', how='inner')
Deleting columns and rows
• The drop operation on a DataFrame can be used to delete a
row or a column. It returns a DataFrame object. Note that
dropping a column or row is not done inplace.
• To drop a column:
df7.drop("City",1) # 1 for column and 0 for row
• To drop a row:
df7.drop(3, 0) # 0 for row and 3 is the index of the row to be
dropped
• deleting more than one row based on inbuilt index:
df7.drop(df7.index[1:3],0) #rows with ID 1 to 3 will be deleted
Updating and modifying columns
• Retrieve the length of the index: len(df7.index)
• Retrieve the shape of the DataFrame: df7.shape
• Retrieve the number of rows as df7.shape[0]
• Finally, add the new column as:
df7[“Continent”] = df7.shape[0]*[“North America”]
• Note that this operation in inplace
• To update or modify an existing column:
df7["Continent"] = df7["Country"]+" , "+df7["Continent"]
Updating and modifying rows
• Find the transpose of the DataFrame to which we want to add
the rows. Store this result in a temporary DataFrame:
df7_t=df7.T # Find the transpose
• Now, add a column to the above temporary DataFrame:
df7_t[7] = ["109 Charles Lane","Los Angeles","LA
500017","USA","Bakers World",5,"USA North America"]
• Find the transpose of this temporary DataFrame and replace
the original DataFrame:
df7=df7_t.T
THANK YOU