Introduction to Python Pandas
Pandas is a powerful open-source Python library for data analysis and manipulation. It provides
data structures like DataFrame and Series that make handling structured data (like tables and
time-series) easy and efficient. Pandas is widely used in data science, machine learning, and
analytics due to its versatility and high-level abstractions for managing datasets.
Key Features of Pandas
1. Data Structures:
o Series: One-dimensional, similar to a column in Excel or a 1D NumPy array.
o DataFrame: Two-dimensional, like a table with rows and columns.
2. Data Manipulation:
o Filtering, sorting, grouping, and aggregation.
3. Integration:
o Works seamlessly with other libraries like NumPy, Matplotlib, and Scikit-learn.
4. Data I/O:
o Read and write data from various formats like CSV, Excel, SQL, JSON, etc.
5. Time-Series Support:
o Provides functionality for analyzing and processing time-series data.
Applications of Pandas
1. Data Cleaning
Real-time Example:
Task: Cleaning customer data by handling missing values.
import pandas as pd
# Sample data
data = {'Name': ['Alice', 'Bob', None, 'Eve'], 'Age': [25, 30, None, 40]}
df = pd.DataFrame(data)
# Handling missing values
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)
print(df)
2. Financial Analysis
Real-time Example:
Task: Analyzing stock market data.
import pandas as pd
# Loading sample stock data
df = pd.read_csv('https://example.com/stock_prices.csv', parse_dates=['Date'])
# Filtering for a specific company
apple_stock = df[df['Company'] == 'Apple']
# Calculating moving average
apple_stock['Moving_Avg'] = apple_stock['Close'].rolling(window=20).mean()
print(apple_stock.head())
3. Exploratory Data Analysis (EDA)
Real-time Example:
Task: Analyzing a dataset of sales.
# Loading sales data
sales_data = pd.read_csv('https://example.com/sales_data.csv')
# Grouping sales by region
region_sales = sales_data.groupby('Region')['Sales'].sum()
# Plotting the data
region_sales.plot(kind='bar', title='Sales by Region')
4. Time Series Analysis
Real-time Example:
Task: Forecasting electricity demand based on past data.
# Loading time series data
df = pd.read_csv('https://example.com/electricity_demand.csv',
parse_dates=['Timestamp'])
# Resampling data to hourly averages
hourly_demand = df.resample('H', on='Timestamp')['Demand'].mean()
print(hourly_demand.head())
5. Machine Learning Preprocessing
Real-time Example:
Task: Preparing data for a machine learning model.
# Loading data
data = pd.read_csv('https://example.com/housing_data.csv')
# Dropping irrelevant columns
data.drop(['ID'], axis=1, inplace=True)
# Encoding categorical features
data = pd.get_dummies(data, columns=['City'], drop_first=True)
# Normalizing numerical features
data['Price'] = (data['Price'] - data['Price'].mean()) / data['Price'].std()
print(data.head())
6. Web Scraping and Analysis
Real-time Example:
Task: Scraping live product prices and analyzing them.
import pandas as pd
import requests
from bs4 import BeautifulSoup
# Scraping data from a website
response = requests.get('https://example.com/products')
soup = BeautifulSoup(response.text, 'html.parser')
# Extracting product names and prices
products = {'Name': [], 'Price': []}
for product in soup.select('.product-item'):
products['Name'].append(product.select_one('.name').text)
products['Price'].append(float(product.select_one('.price').text.strip('$')))
df = pd.DataFrame(products)
# Analyzing product prices
print(df.describe())
Why Use Pandas?
Handles large datasets efficiently.
Provides intuitive data manipulation tools.
Simplifies working with different data formats.
Integrates well with visualization and machine learning tools.
What Can Pandas Do?
Pandas gives you answers about the data. Like:
Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?
Pandas are also able to delete rows that are not relevant, or contains wrong values, like
empty or NULL values. This is called cleaning the data.
Pandas Series:
The Pandas Series can be defined as a one-dimensional array that is capable of storing various
data types. We can easily convert the list, tuple, and dictionary into series using "series' method.
The row labels of series are called the index. A Series cannot contain multiple columns. It has
the following parameter:
data: It can be any list, dictionary, or scalar value.
index: The value of the index should be unique and hashable. It must be of the same
length as data. If we do not pass any index, default np.arrange(n) will be used.
dtype: It refers to the data type of series.
copy: It is used for copying the data.
Creating a Series:
We can create a Series in two ways:
1. Create an empty Series
2. Create a Series using inputs.
Create an Empty Series:
We can easily create an empty series in Pandas which means it will not have any value.
The syntax that is used for creating an Empty Series:
1. <series object> = pandas.Series()
The below example creates an Empty Series type object that has no values and having
default datatype, i.e., float64.
Example
1. import pandas as pd
2. x = pd.Series()
3. print (x)
Output
Series([], dtype: float64)
Creating a Series using inputs:
We can create Series by using various inputs:
o Array
o Dict
o Scalar value
Creating Series from Array:
Before creating a Series, firstly, we have to import the numpy module and then use array()
function in the program. If the data is ndarray, then the passed index must be of the same
length.
If we do not pass an index, then by default index of range(n) is being passed where n
defines the length of an array, i.e., [0,1,2,....range(len(array))-1].
Example
1. import pandas as pd
2. import numpy as np
3. info = np.array(['P','a','n','d','a','s'])
4. a = pd.Series(info)
5. print(a)
Output
0P
1a
2n
3d
4a
5s
dtype: object
Create a Series from dict
We can also create a Series from dict. If the dictionary object is being passed as an
input and the index is not specified, then the dictionary keys are taken in a sorted
order to construct the index.
If index is passed, then values correspond to a particular label in the index will be extracted
from the dictionary.
1. #import the pandas library
2. import pandas as pd
3. import numpy as np
4. info = {'x' : 0., 'y' : 1., 'z' : 2.}
5. a = pd.Series(info)
6. print (a)
Output
x 0.0
y 1.0
z 2.0
dtype: float64
Create a Series using Scalar:
If we take the scalar values, then the index must be provided. The scalar value will be
repeated for matching the length of the index.
1. #import pandas library
2. import pandas as pd
3. import numpy as np
4. x = pd.Series(4, index=[0, 1, 2, 3])
5. print (x)
Output
04
14
24
34
dtype: int64
Accessing data from series with Position:
Once you create the Series type object, you can access its indexes, data, and even
individual elements.
The data in the Series can be accessed similar to that in the ndarray.
1. import pandas as pd
2. x = pd.Series([1,2,3],index = ['a','b','c'])
3. #retrieve the first element
4. print (x[0])
Output
1
Series object attributes
The Series attribute is defined as any information related to the Series object such as size,
datatype. etc. Below are some of the attributes that you can use to get the information about
the Series object:
Attributes Description
Series.index Defines the index of the Series.
Series.shape It returns a tuple of shape of the data.
Series.dtype It returns the data type of the data.
Series.size It returns the size of the data.
It returns True if Series object is empty, otherwise
Series.empty
returns false.
It returns True if there are any NaN values,
Series.hasnans
otherwise returns false.
Series.nbytes It returns the number of bytes in the data.
Series.ndim It returns the number of dimensions in the data.
Series.itemsize It returns the size of the datatype of item.
Retrieving Index array and data array of a series object
We can retrieve the index array and data array of an existing Series object by using the
attributes index and values.
1. import numpy as np
2. import pandas as pd
3. x=pd.Series(data=[2,4,6,8])
4. y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
5. print(x.index)
6. print(x.values)
7. print(y.index)
8. print(y.values)
Output
RangeIndex(start=0, stop=4, step=1)
[2 4 6 8]
Index(['a', 'b', 'c'], dtype='object')
[11.2 18.6 22.5]
Retrieving Types (dtype) and Size of Type (itemsize)
You can use attribute dtype with Series object as <objectname> dtype for retrieving the data
type of an individual element of a series object, you can use the itemsize attribute to show
the number of bytes allocated to each data item.
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.dtype)
7. print(a.itemsize)
8. print(b.dtype)
9. print(b.itemsize)
Output
int64
8
float64
8
Retrieving Shape
The shape of the Series object defines total number of elements including missing or empty
values(NaN).
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. print(a.shape)
6. print(b.shape)
Output
(4,)
(3,)
Retrieving Dimension, Size and Number of bytes:
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,4])
4. b=pd.Series(data=[4.9,8.2,5.6],
5. index=['x','y','z'])
6. print(a.ndim, b.ndim)
7. print(a.size, b.size)
8. print(a.nbytes, b.nbytes)
Output
11
43
32 24
Checking Emptiness and Presence of NaNs
To check the Series object is empty, you can use the empty attribute. Similarly, to check if
a series object contains some NaN values or not, you can use the hasans attribute.
Example
1. import numpy as np
2. import pandas as pd
3. a=pd.Series(data=[1,2,3,np.NaN])
4. b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
5. c=pd.Series()
6. print(a.empty,b.empty,c.empty)
7. print(a.hasnans,b.hasnans,c.hasnans)
8. print(len(a),len(b))
9. print(a.count( ),b.count( ))
Output
False False True
True False False
43
33
Series Functions
There are some functions used in Series which are as follows:
Map the values from two series that have a
Pandas Series.map()
common column.
Calculate the standard deviation of the given set
Pandas Series.std()
of numbers, DataFrame, column, and rows.
Pandas Series.to_frame() Convert the series object to the dataframe.
Returns a Series that contain counts of unique
Pandas Series.value_counts()
values.
Python DataFrame: Reading CSV and JSON, and Performing Analysis
Functions
Python's pandas library provides powerful tools for handling, manipulating, and analyzing
structured data.
1. Python DataFrame: Reading CSV
Definition
pd.read_csv(): Reads a comma-separated values (CSV) file into a DataFrame.
CSV files are widely used for storing tabular data in various fields such as finance, healthcare,
and e-commerce.
Real-Time Scenario
Finance: Reading a CSV containing stock market data to analyze trends.
E-commerce: Reading product sales data for generating reports.
Example: Reading CSV and Basic Operations
import pandas as pd
# Reading a CSV file
df = pd.read_csv("sales_data.csv")
# Displaying first 5 rows
print(df.head())
# Scenario: Calculate total sales
total_sales = df["sales_amount"].sum()
print(f"Total Sales: {total_sales}")
2. Python DataFrame: Reading JSON
Definition
pd.read_json(): Reads a JSON file into a DataFrame.
JSON is a popular format for transmitting data in web applications and APIs.
Real-Time Scenario
Web Development: Reading user details from a JSON API response.
Social Media Analysis: Reading JSON containing user activity for engagement reports.
Example: Reading JSON and Basic Operations
import pandas as pd
# Reading a JSON file
df = pd.read_json("user_data.json")
# Displaying first 5 rows
print(df.head())
# Scenario: Filter users above age 30
filtered_users = df[df["age"] > 30]
print(filtered_users)
3. Python DataFrame: Analysis Functions
Definition
Pandas provides a wide range of functions to analyze and manipulate data, such as
summarization, filtering, grouping, and visualization.
Real-Time Scenario
Healthcare: Summarizing patient data for trend analysis.
Marketing: Grouping customer purchases by region for targeted campaigns.
Analysis Functions
Summarization Functions
df.describe(): Provides statistical summary.
df.mean(), df.sum(), df.count(): Calculate mean, sum, or count of values.
Example: Statistical Summary
df = pd.read_csv("employee_data.csv")
print(df.describe()) # Summary of numerical columns
# Scenario: Calculate average salary
avg_salary = df["salary"].mean()
print(f"Average Salary: {avg_salary}")
Filtering and Querying
df.loc[]: Filter rows by label.
df[df["column_name"] > value]: Conditional filtering.
Example: Filter Data
# Scenario: Employees with salary > 50000
high_salary = df[df["salary"] > 50000]
print(high_salary)
Grouping and Aggregation
df.groupby(): Groups data by specified columns and applies aggregation functions.
Example: Group Sales by Region
# Scenario: Total sales by region
grouped_sales = df.groupby("region")["sales_amount"].sum()
print(grouped_sales)
Sorting
df.sort_values(): Sorts the DataFrame by specified columns.
Example: Sort Employees by Salary
sorted_employees = df.sort_values(by="salary", ascending=False)
print(sorted_employees)
Handling Missing Data
df.isnull(): Checks for missing values.
df.fillna(): Fills missing values with a specified value.
df.dropna(): Drops rows/columns with missing values.
Example: Handle Missing Values
# Scenario: Replace missing salaries with 30000
df["salary"] = df["salary"].fillna(30000)
print(df)
Merging and Joining
pd.merge(): Merges two DataFrames.
df.join(): Joins DataFrames on indices.
Example: Merge Employee and Department Data
departments = pd.DataFrame({"dept_id": [1, 2], "dept_name": ["HR", "Finance"]})
merged_df = pd.merge(df, departments, left_on="dept_id", right_on="dept_id")
print(merged_df)
Visualization
df.plot(): Generates basic plots.
df.hist(): Creates histograms.
Example: Plot Sales Data
import matplotlib.pyplot as plt
# Scenario: Sales Trend
df["sales_amount"].plot(kind="line")
plt.title("Sales Trend")
plt.show()
Functions Summary
Function Purpose Example Use Case
Load data from a CSV file
pd.read_csv() Load sales or employee data.
into a DataFrame
Load data from a JSON file Load API response for user
pd.read_json()
into a DataFrame activity.
Statistical summary of
df.describe() Summarize patient statistics.
numerical columns
Group data and apply Calculate total sales per
df.groupby()
aggregation functions region.
Sort data by specified
df.sort_values() Rank employees by salary.
columns
Replace missing product
df.fillna() Fill missing values
prices.
Visualize data using basic Analyze sales trends over
df.plot()
plots months.
Data Cleaning Functions in Python DataFrames
Data cleaning is a crucial step in preparing datasets for analysis. Pandas provides several
functions to clean and preprocess data. Below is a detailed explanation of key data-cleaning
techniques, real-time scenarios, and example codes.
Common Data Issues and Pandas Cleaning Functions
Issue Function/Technique Description
isnull(), notnull(), fillna(), Identify, fill, or remove missing
Missing Values dropna() data.
Detect and remove duplicate
Duplicate Rows duplicated(), drop_duplicates()
rows.
Convert data to appropriate
Incorrect Data Types astype(), to_datetime()
types.
Outliers clip(), replace(), filtering Handle extreme values.
Replace or correct invalid
Invalid Values Filtering, apply(), replace()
entries.
Inconsistent str.lower(), str.strip(), Standardize text data for
Formatting str.replace() consistency.
Removing Unwanted Drop irrelevant rows or
Filtering, drop()
Data columns.
1. Handling Missing Data
Scenario: A sales dataset has missing values for revenue.
import pandas as pd
# Sample DataFrame with missing values
data = {
"Product": ["A", "B", "C", None],
"Sales": [100, 200, None, 150],
"Revenue": [500, None, 300, 400],
}
df = pd.DataFrame(data)
# Identify missing values
print(df.isnull())
# Fill missing values
df["Revenue"] = df["Revenue"].fillna(df["Revenue"].mean())
# Drop rows with missing Product
df = df.dropna(subset=["Product"])
print(df)
Functions
isnull(): Checks for missing values.
fillna(value): Replaces missing values with a specified value.
dropna(): Removes rows or columns with missing values.
2. Removing Duplicates
Scenario: A customer dataset has duplicate entries.
# Sample DataFrame with duplicates
data = {"Customer": ["Alice", "Bob", "Alice"], "Purchase": [200, 300, 200]}
df = pd.DataFrame(data)
# Detect duplicates
print(df.duplicated())
# Remove duplicates
df = df.drop_duplicates()
print(df)
Functions
duplicated(): Identifies duplicate rows.
drop_duplicates(): Removes duplicate rows.
3. Converting Data Types
Scenario: Date data is in string format and needs conversion.
data = {"Date": ["2024-12-01", "2024-12-02", "2024-12-03"], "Sales": ["100",
"200", "300"]}
df = pd.DataFrame(data)
# Convert Sales to numeric
df["Sales"] = pd.to_numeric(df["Sales"])
# Convert Date to datetime
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
Functions
astype(type): Converts a column to the specified type.
pd.to_datetime(): Converts a column to datetime format.
4. Handling Outliers
Scenario: Sales data contains extreme outliers.
data = {"Sales": [100, 200, 300, 10000]}
df = pd.DataFrame(data)
# Cap sales at 500
df["Sales"] = df["Sales"].clip(upper=500)
print(df)
Functions
clip(lower, upper): Limits values within a specified range.
replace(): Replaces specified values.
5. Replacing Invalid or Incorrect Values
Scenario: Age column has invalid negative values.
data = {"Name": ["Alice", "Bob"], "Age": [25, -5]}
df = pd.DataFrame(data)
# Replace negative ages with mean age
df["Age"] = df["Age"].apply(lambda x: max(x, 0))
print(df)
Functions
replace(to_replace, value): Replaces values based on conditions.
apply(func): Applies a custom function to transform data.
6. Standardizing Text
Scenario: Product names have inconsistent capitalization and whitespace.
data = {"Product": [" apple", "Orange ", "BANANA"]}
df = pd.DataFrame(data)
# Standardize text
df["Product"] = df["Product"].str.strip().str.lower()
print(df)
Functions
str.strip(): Removes leading/trailing whitespace.
str.lower(): Converts text to lowercase.
str.replace(pattern, replacement): Replaces text based on a pattern.
7. Dropping Irrelevant Data
Scenario: Drop unnecessary columns like "ID".
data = {"ID": [1, 2, 3], "Name": ["Alice", "Bob", "Charlie"], "Score": [85,
90, 95]}
df = pd.DataFrame(data)
# Drop ID column
df = df.drop(columns=["ID"])
print(df)
Functions
drop(columns): Removes specified columns.
drop(indexes): Removes specified rows.
8. Applying Filters
Scenario: Retain rows where revenue > 300.
data = {"Product": ["A", "B", "C"], "Revenue": [200, 400, 300]}
df = pd.DataFrame(data)
# Filter rows
filtered_df = df[df["Revenue"] > 300]
print(filtered_df)
9. Handling Categorical Data
Scenario: Replace categorical values with labels.
data = {"Gender": ["Male", "Female", "Male"]}
df = pd.DataFrame(data)
# Replace categories with numeric labels
df["Gender"] = df["Gender"].replace({"Male": 0, "Female": 1})
print(df)
Function Use Case Real-Time Scenario
isnull() Identify missing values Detect missing survey responses.
Replace missing prices with the
fillna() Fill missing data
average.
Remove rows/columns with Drop incomplete customer
dropna()
missing data records.
Find duplicate orders in e-
duplicated() Detect duplicate rows
commerce data.
drop_duplicates() Remove duplicate rows Clean duplicate customer entries.
Convert numeric strings to
astype() Convert column data types
integers.
Replace "NA" with a default
replace() Replace specific values
value in a column.
clip() Cap outliers Limit revenue to a specific range.
str.strip() Remove extra spaces Clean up messy product names.
drop() Drop irrelevant data Remove ID columns for analysis.
Summary of functions