The Unique Computers
Neelmatha Lucknow
Python Pandas
What is Pandas?
Pandas is a powerful Python library that is specifically designed to work on
data frames that have "relational" or "labeled" data. Its aim aligns with doing
real-world data analysis using Python. Its flexibility and functionality make it
indispensable for various data-related tasks. Hence, this Python package
works well for data manipulation, operating a dataset, exploring a data
frame, data analysis, and machine learning-related tasks.
Generally, Pandas operates a data frame using Series and DataFrame; where
Series works on a one-dimensional labeled array holding data of any type
like integers, strings, and objects, while a DataFrame is a two-dimensional
data structure that manages and operates data in tabular form (using rows
and columns).
Why Pandas?
The beauty of Pandas is that it simplifies the task related to data frames and
makes it simple to do many of the time-consuming, repetitive tasks involved
in working with data frames, such as:
• Import datasets - available in the form of spreadsheets, comma-
separated values (CSV) files, and more.
• Data cleansing - dealing with missing values and representing them
as NaN, NA, or NaT.
• Size mutability - columns can be added and removed from
DataFrame and higher-dimensional objects.
• Data normalization – normalize the data into a suitable format for
analysis.
• Data alignment - objects can be explicitly aligned to a set of labels.
Intuitive merging and joining data sets – we can merge and join
datasets.
• Reshaping and pivoting of datasets – datasets can be reshaped
and pivoted as per the need.
• Efficient manipulation and extraction - manipulation and
extraction of specific parts of extensive datasets using intelligent label-
based slicing, indexing, and subsetting techniques.
• Statistical analysis - to perform statistical operations on datasets.
• Data visualization - Visualize datasets and uncover insights.
Applications of Pandas
The most common applications of Pandas are as follows:
• Data Cleaning: Pandas provides functionalities to clean messy data,
deal with incomplete or inconsistent data, handle missing values,
remove duplicates, and standardize formats to do effective data
analysis.
• Data Exploration: Pandas easily summarize statistics, find trends,
and visualize data using built-in plotting functions, Matplotlib, or
Seaborn integration.
• Data Preparation: Pandas may pivot, melt, convert variables, and
merge datasets based on common columns to prepare data for
analysis.
• Data Analysis: Pandas supports descriptive statistics, time series
analysis, group-by operations, and custom functions.
• Data Visualisation: Pandas itself has basic plotting capabilities; it
integrates and supports data visualization libraries like Matplotlib,
Seaborn, and Plotly to create innovative visualizations.
• Time Series Analysis: Pandas supports date/time indexing,
resampling, frequency conversion, and rolling statistics for time series
data.
• Data Aggregation and Grouping: Pandas HYPERLINK
"https://www.tutorialspoint.com/python_pandas/python_pandas_groupb
y.htm"groupby HYPERLINK
"https://www.tutorialspoint.com/python_pandas/python_pandas_groupb
y.htm"() function lets you aggregate data and compute group-wise
summary statistics or apply functions to groups.
• Data Input/Output: Pandas makes data input and export easy by
reading and writing CSV, Excel, JSON, SQL databases, and more.
• Machine Learning: Pandas works well with Scikit-learn for data
preparation, feature engineering, and model input data.
• Financial Analysis: Pandas is commonly used in finance for stock
market data analysis, financial indicator calculation, and portfolio
optimization.
• Text Data Analysis: Pandas' string manipulation, regular expressions,
and text mining functions help analyse textual data.
• Experimental Data Analysis: Pandas makes manipulating and
analysing large datasets, performing statistical tests, and visualizing
results easy.
Python Pandas Data Structures
Data structures in Pandas are designed to handle data efficiently. They allow
for the organization, storage, and modification of data in a way that
optimizes memory usage and computational performance. Python Pandas
−
library provides two primary data structures for handling and analyzing data
• Series
• DataFrame
Dimension and Description of Pandas Data Structures
Data Dimensio Description
Structure ns
Series 1 A one-dimensional labeled homogeneous array, sizeimmutable.
Data Frames 2 A two-dimensional labeled, size-mutable tabular structure with
potentially heterogeneously typed columns.
Series
A Series is a one-dimensional labeled array that can hold any data type. It
can store integers, strings, floating-point numbers, etc. Each value in a
Series is associated with a label (index), which can be an integer or a string.
Name Steve
Age 35
Gender Male
Rating 3.5
Example
Consider the following Series which is a collection of different data types
import pandas as pd
data = ['Steve', '35', 'Male', '3.5']
series = pd.Series(data, index=['Name', 'Age', 'Gender', 'Rating'])
print(series)
On executing the above program, you will get the following output −
Name Steve
Age 35
Gender Male
Rating 3.5
dtype: object
Key Points
Following are the key points related to the Pandas Series.
• Homogeneous data
• Size Immutable
• Values of Data Mutable
DataFrame
A DataFrame is a two-dimensional labeled data structure with columns that
can hold different data types. It is similar to a table in a database or a
rating of a sales team −
spreadsheet. Consider the following data representing the performance
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
Example
The above tabular data can be represented in a DataFrame as follows −
Open Compiler
import pandas as pd
# Data represented as a dictionary
data = {
'Name': ['Steve', 'Lia', 'Vin', 'Katie'],
'Age': [32, 28, 45, 38],
'Gender': ['Male', 'Female', 'Male', 'Female'],
'Rating': [3.45, 4.6, 3.9, 2.78]
# Creating the DataFrame
df = pd.DataFrame(data)
print(df)
Output
On executing the above code you will get the following output −
Name Age Gender Rating
0 Steve 32 Male 3.45
1 Lia 28 Female 4.60
2 Vin 45 Male 3.90
3 Katie 38 Female 2.78
Key Points
Following are the key points related the Pandas DataFrame −
• Heterogeneous data
• Size Mutable
• Data Mutable
Creation of Data Frames
Creation New dataFrames
import pandas as pd
data={"name":["rahul","neha","amit"],
"age":[12,15,27],
"Salary":[1200,1500,1200]
}
df=pd.DataFrame(data)
print(df)
Reading CSV File
import pandas as pd
data=pd.read_csv("book.csv")
print(data)
reading Excel File
import pandas as pd
data=pd.read_excel("book1.xlsx")
print(data)
Exploring Data in Pandas
There are some Function in Pandas
Head()
Tail()
Info()
Describe()
Isnull()
Isnull().sum()
Dealing With Duplicate Values
Data.duplicated()
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.duplicated())
Data[“emp_id”].duplicated()
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data["Emp_ID"].duplicated())
Data[“emp_id”].duplicated().sum()
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data["Emp_ID"].duplicated().sum())
Data.drop_duplicates(“emp_id”)
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.drop_duplicates("Emp_ID"))
Working with missing values
Data.isnull
To print null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.isnull())
data.isnull().sum())
to count null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data.isnull().sum())
data.dropna()
To delete null values
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.dropna())
data.replace(np.nan,"hii")
to replace nan
import numpy as np
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data)
data.replace(np.nan,"hii")
data["Salary"]=data["Salary"].replace(np.nan,30000)
to replace any special char
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
data["Salary"]=data["Salary"].replace(np.nan,30000)
print(data)
data["Salary"].mean()
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data["Salary"].mean())
data.fillna(method="bfill")
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.fillna(method="bfill"))
data.fillna(method="ffill")
import pandas as pd
import numpy as np
data=pd.read_excel("salary.xlsx")
print(data)
print("\n\n\n")
print(data.fillna(method="ffill"))
Column transformation in Pandas
To create new column
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data,"\n\n")
data.loc[(data["Bonus"] == 0),"GetBonus"]="No Bonus"
data.loc[(data["Bonus"] > 0,"GetBonus")]="Bonus"
print(data)
To marge two column
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data,"\n\n")
data["Full name"]=data["Name"]+" "+data["Last Name"]
print(data)
To Add Calculation in column
import pandas as pd
data=pd.read_excel("salary.xlsx")
print(data,"\n\n")
data["Bonus"]=(data["Salary"]/100)*20
print(data)
To extract some latter from dataFrame
import pandas as pd
data={"Month":["January","Fabruary","March","April"]}
a=pd.DataFrame(data)
print(a)
def extract(value):
return value[0:3]
a["Short_Months"]=a["Month"].map(extract)
print(a)
GroupBy In Pandas
Count gender by Deparment
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby("Department").agg({"Gender":"count"})
print(gp)
By Job Title count Emp_id
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby("Job Title").agg({"Emp_ID":"count"})
print(gp)
By Gender
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
gp=data.groupby(["Department","Gender"]).agg({"Emp_ID":"count"})
print(gp)
By Age
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
print("\n\n\n")
a=data.groupby("Countries").agg({"Age":"max"})
print(a)
By Age and Gender
import pandas as pd
data=pd.read_excel("Salary.xlsx")
print(data)
print("\n\n\n")
a=data.groupby(["Countries","Gender"]).agg({"Age":"max"})
print(a)
Merge Join and Concatenate in Pandas
Merge
On the basis of EEID
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"],
"Age":[34,56,24,27,28,26]}
print(data1)
data2={"EEID":["A01","A02","A03","A04","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(pd.merge(df1,df2,on="EEID"))
Use of how
P1
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"],
"Age":[34,56,24,27,28,26]}
print(data1)
data2={"EEID":["A01","A02","A03","A04","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(pd.merge(df1,df2,on="EEID", how="inner"))
P2
print(pd.merge(df1,df2,on="EEID", how="left"))
P3
print(pd.merge(df1,df2,on="EEID", how="right"))
Concatenate
import pandas as pd
data1={"EEID":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karan","Mohit"]}
data2={"EEID":["A07","A08","A09","A010","A11","A12"],
"Name":["Atin","Pankaj","Alia","Suman","Sanjay","Karan"]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print(df2)
print()
ndf=pd.concat([df1,df2])
print(ndf)
Join
import pandas as pd
data1={"EEI":["A01","A02","A03","A04","A05","A06"],
"Name":["Amit","priya","Neha","Lovely","Karab","Mohit"]}
print(data1)
data2={"EEID":["A09","A02","A03","A010","A05","A06"],
"Salary":[45000,47000,30000,14200,42300,456600]}
print(data2)
print("\n\n\n")
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
print(df1)
print()
print(df2)
print()
print(df1.join(df2))