0% found this document useful (0 votes)
45 views26 pages

Pandas

The document outlines a Machine Learning Lab focused on data analysis using the Pandas library in Python. It covers key topics such as DataFrames, Series, loading data, handling missing data, indexing, data aggregation, merging DataFrames, and updating/modifying data. The document provides practical examples and methods for performing various data manipulation tasks with Pandas.

Uploaded by

160421737033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views26 pages

Pandas

The document outlines a Machine Learning Lab focused on data analysis using the Pandas library in Python. It covers key topics such as DataFrames, Series, loading data, handling missing data, indexing, data aggregation, merging DataFrames, and updating/modifying data. The document provides practical examples and methods for performing various data manipulation tasks with Pandas.

Uploaded by

160421737033
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Machine Learning Lab

(PC 652 IT)

Data Analysis with Pandas


Topics
1. Introduction to Pandas
2. DataFrame and Series
3. Loading data from files into DataFrames
4. Missing Data
5. Indexing and Slicing
6. Data Aggregation
7. Visualization
1. Introduction to Pandas
• Pandas stands for Python Data Analysis Library.
• Pandas is a Python package providing fast, flexible, and
expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive.
• It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python.
• Install pandas by running the following command at the
Anaconda prompt:
pip install pandas
• In order to work with pandas library, it has to be imported
2. DataFrame and Series
• Series are One-dimensional array-like objects containing an
array of data (of any NumPy data type) and an associated array
of data labels, called its “index”.
• If index of data is not specified, then a default one consisting of
the integers 0 through N-1 is created.
• A DataFrame is a Two-dimensional tabular data structure with
ordered collections of columns, each of which can be different
value type.
• DataFrame (DF) can be thought of as a dictionary of Series.
Working with DataFrames and Series
import pandas as pd
# Creating a DataFrame
df1=pd.DataFrame([["Suresh",101,87.5],["Ramesh",102,89.9]])
df1
type(df1)
#Adding columns to a DataFrame
df1=pd.DataFrame([["Suresh",101,87.5],["Ramesh",102,89.9]],
columns=["Student_Name","Roll_No","Percentage"])
df1
Adding index names to DataFrame
df1=pd.DataFrame([["Suresh",101,87.5],["Ramesh",102,89.9]],
columns=["Student_Name","Roll_No","Percentage"],
index=["First_Student","Second_Student"])
df1
type(df1.Student_Name) # Student_Name is a series

# List of methods on a DataFrame:


dir(df1)
Applying methods on DataFrame
# Mean of every column:
df1.mean()
# Mean of complete numeric data in the DataFrame:
df1.mean().mean()
#Displaying a single series from the DataFrame:
df1.Percentage
# Mean of only a single column:
df1.Percentage.mean()
Creating a DataFrame from Dictionary
dict1={'First_Name':("Suresh","Ramesh"),
'Last_Name':("Kumar","Babu")}
df2=pd.DataFrame(dict1)
df2

#List of Dicts
l1=[{'First_Name':'Suresh','Last_Name':'Kumar'},{'First_Name':'
Ramesh','Last_Name':'Babu'}]
df2=pd.DataFrame(l1)
df2
3. Loading data from files into DataFrames
• The read_csv method is used to load comma separated (csv)
files and text files into a DataFrame. Its syntax is:
df1 = pd.read_csv(file/URL/file-like-object, sep = ',', header = None)
• For Example:
df1=pd.read_csv('.\pandas\supermarkets.csv')
df2=pd.read_csv('.\pandas\supermarkets-commas.txt')
df3=pd.read_csv('.\pandas\supermarkets-semi-colons.txt',sep=";")
• df3.index # displays the indexes for this DataFrame
• df3.columns # displays the column names for this DataFrame
Loading data from json and excel files
• The read_json and read_excel methods are used to read json
and excel files respectively. For example:
df4=pd.read_json('.\pandas\supermarkets.json')
df5=pd.read_excel('.\pandas\supermarkets.xlsx',sheet_name=0)
• Suppose a data file does not include header names, the header
option should be specified as None
df6=pd.read_csv('.\pandas\data.txt',header=None)
• We can give user specified column names as:
df6.columns=['ID','Address','City','Pin_Code','Country',
'Name', 'Employees']
4. Missing Data
• In Pandas, missing values are represented as NaN means Not a
Number.
• Consider the marks.txt file with some missing values.
df= pd.read_csv('.\pandas\marks.txt',header = None,names =
["name","m1","m2","m3"])
• To make detecting missing values easier (and across different
array dtypes), Pandas provides isnull() and notnull() functions,
which are also methods on Series and DataFrame objects
For example: df.isnull()
Handling Missing Data
• The fillna() method is used to replace NaN values with another
value. For example, to replace NaN with zeroes,
df.fillna(0)
• We can replace NaN values in a column with the mean of the
column:
df["m1"].fillna(df["m1"].mean())
• We can drop the rows containing NaN values: df.dropna()
• We can drop the columns containing NaN values df.dropna(axis
=1)
5. Indexing and Slicing
• The Python and NumPy indexing operators "[ ]" and attribute
operator "." provide quick and easy access to Pandas data
structures.
• Pandas supports two methods for Multi-axes indexing:
i. loc(): It is used for label based indexing
ii. iloc(): It is used for integer based indexing
• In addition, Boolean indexing is used to perform conditional
retrieval of data
Set Index Column
• We can set a column of the DataFrame as an index in place of
the default index column as follows:
df7 = df6.set_index("ID")
• But if we observe, the changes are not permanent in the older
DataFrame (df6)
• In order to make it the permanent index column, the following
option (inplace) should be set to True:
df6.set_index("ID",inplace = True)
Label based Indexing
• loc takes two single/list/range operator separated by ','. The
first one indicates the row and the second one indicates
columns. For example: Select all rows and only Country
column:
df7.loc[:,”Country”]
• Select rows with index ID 3 to 5 and columns City to Name :
df7.loc[3:5,"City":"Name“]
• Similarly, we can give the following slices:
df7.loc[4,"Country"]
Integer based Indexing
• Pandas provides iloc() method in order to perform purely
integer based indexing. These are 0-based indexing methods.
• The various access methods are as follows:
i. An Integer: For example: df7.iloc[4] returns the data from the
5th row. df7.iloc[:,4] returns the data from 5th column.
ii. A list of integers: df7.iloc[4,4] returns the data in the
intersection of row 4 and column 4.
iii. A range of values: df7.iloc[1:3,1:3] returns the intersection of
rows 1 to 2 and columns 1 to 2.
Boolean indexing
• In boolean indexing, we will select subsets of data based on the
actual values of the data in the DataFrame and not on their
row/column labels or integer locations.
• The syntax for performing boolean indexing on a DataFrame is
similar to other indexing methods. The index label or integer
value is replaced by a boolean expression. For example:
df[df["m1"]>90]
df[(df["m1"]>90) & (df["m2"]>90)]
6. Data Aggregation
• Data aggregation means any data transformation that
produces scalar values from arrays, such as “mean”, “max”, etc.
• We can create a grouping of categories and apply a function to
the categories. For example, if we want to find out the number
of employees city wise, the procedure is as follows:
• Using groupby method, create groups based on the City
attribute:
df7.groupby(['City'])
df7.groupby(['City']).groups # In order to see the groups
Data Aggregation (cont..)
• Apply an aggregate function called sum() on the Employees
column:
group=df7.groupby(['City']).sum()['Employees']
group
• Similarly, to find the maximum number of employees in a City,
the max() aggregate function should be used:
group=df7.groupby(['City']).max()['Employees']
Merging DataFrames
• Pandas has full-featured, high performance in-memory join
operations idiomatically very similar to relational databases like
SQL.
• Pandas provides a single function, merge() for all standard
database join operations between DataFrame objects
• Create two DataFrames called emp and dept with the following
atributes:
• Emp(empno,ename,deptno)
• Dept(deptno,dname)
merge() two DataFrames on a key
• emp = pd.DataFrame({
'empno':[101,102,103,104,105],
'ename': ['Ramesh', 'Suresh', 'Mahesh', 'Dinesh', 'Naresh'],
'deptno':[1,2,2,1,2]})
• dept = pd.DataFrame(
{'deptno':[1,2,3],
'dname': ['Marketing', 'Operations', 'EDP']})
• merge_df=pd.merge(emp,dept,on='deptno') # key is deptno
• merge_df
Types of merge
• Perform left join: Use keys from left object
merge_left= pd.merge(emp,dept, on='deptno', how='left')
• Perform right join: Use keys from right object
merge_right= pd.merge(emp,dept, on='deptno', how='right')
• Peform outer join: Use union of keys
merge_outer= pd.merge(emp,dept, on='deptno', how='outer')
• Perform inner join: Use intersection of keys (default)
merge_inner= pd.merge(emp,dept, on='deptno', how='inner')
Deleting columns and rows
• The drop operation on a DataFrame can be used to delete a
row or a column. It returns a DataFrame object. Note that
dropping a column or row is not done inplace.
• To drop a column:
df7.drop("City",1) # 1 for column and 0 for row
• To drop a row:
df7.drop(3, 0) # 0 for row and 3 is the index of the row to be
dropped
• deleting more than one row based on inbuilt index:
df7.drop(df7.index[1:3],0) #rows with ID 1 to 3 will be deleted
Updating and modifying columns
• Retrieve the length of the index: len(df7.index)
• Retrieve the shape of the DataFrame: df7.shape
• Retrieve the number of rows as df7.shape[0]
• Finally, add the new column as:
df7[“Continent”] = df7.shape[0]*[“North America”]
• Note that this operation in inplace
• To update or modify an existing column:
df7["Continent"] = df7["Country"]+" , "+df7["Continent"]
Updating and modifying rows
• Find the transpose of the DataFrame to which we want to add
the rows. Store this result in a temporary DataFrame:
df7_t=df7.T # Find the transpose
• Now, add a column to the above temporary DataFrame:
df7_t[7] = ["109 Charles Lane","Los Angeles","LA
500017","USA","Bakers World",5,"USA North America"]
• Find the transpose of this temporary DataFrame and replace
the original DataFrame:
df7=df7_t.T
THANK YOU

You might also like