The Pandas Library
The Pandas Library
• Pandas provides tools to detect, replace, and manage missing data effectively using NaN
(Not a Number).
Label-based Indexing:
• Pandas allows label-based selection of data via rows or columns, making data extraction
more intuitive and faster.
Data Alignment:
• When performing arithmetic operations, Pandas automatically aligns data based on labels
(rows and columns), simplifying operations on datasets of different shapes.
• Pandas supports reading from and writing to multiple file formats including CSV, Excel,
XML, HTML, and more.
• With powerful methods for filtering, transforming, and cleaning data, Pandas is a go-to tool
for preparing data for analysis.
Group Operations:
• The groupby() functionality allows easy aggregation and transformation of data based on
conditions or keys.
Powerful Time Series Tools:
• Pandas has built-in support for time series data, making it easier to handle date ranges,
timestamps, and frequency conversion.
• Ease of Use: Pandas simplifies working with large datasets by providing high-level
functions for common data analysis tasks.
• Data Preparation for Machine Learning: In machine learning projects, Pandas is often
used to clean and preprocess data before feeding it into models.
• Comprehensive Data Analysis: Pandas offers a wide array of methods for data
manipulation, statistical analysis, and visualization, making it a versatile tool for
researchers, analysts, and developers.
Example:
import pandas as pd
s = pd.Series(data)
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
# Creating a DataFrame
df = pd.DataFrame(data)
# Basic operations
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
Output of print(df.describe()):
Age Salary
• print(df) displays the entire DataFrame, showing each row with columns "Name", "Age", and
"Salary".
• print(df.describe()) provides summary statistics for the numerical columns (Age and Salary),
including count, mean, standard deviation, minimum, maximum, and quartile values (25%, 50%, and
75%).
Pandas Series
A Series in Pandas is a one-dimensional array that can hold various data types such as integers, floats,
and strings. Each element in a Series is associated with a unique label, called an index.
Creating a Series
You can create a Series from lists, dictionaries, or numpy arrays.
import pandas as pd
# Series from a list
print(data)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Here:
data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
print(data)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
Here, each element is labeled with custom indexes 'a', 'b', etc.
print(data['c']) # Output: 30
print(data[2]) # Output: 30
Operations on Series
Series supports vectorized operations, meaning operations are applied to each element in the Series.
# Addition
# Condition-based filtering
Pandas DataFrame
A DataFrame is a two-dimensional, tabular data structure with rows and columns. Each column in a
DataFrame is a Series, so a DataFrame is essentially a collection of Series aligned by a common index.
Creating a DataFrame
A DataFrame can be created from various sources, such as dictionaries, lists, or even other DataFrames.
data = {
df = pd.DataFrame(data)
print(df)
Output:
2 Charlie 22 Chicago
3 David 32 Houston
Here:
print(df)
Output:
c Charlie 22 Chicago
d David 32 Houston
Accessing Columns
Adding a Column
print(df)
Output:
Deleting a Column
df = df.drop(columns=['Salary'])
Sorting Data
Allows for complex operations like filtering, grouping, and merging data.
FeatureSeries DataFrame
Data Type Can hold one data type Each column can hold different data types
Pandas Series and DataFrames are highly versatile structures for managing and analyzing data, making
it possible to apply efficient, readable, and fast operations on various data types.
Definition: In pandas, Index objects are immutable, which means they cannot be changed directly. They
are used as labels for rows and columns in a Series or DataFrame.
Purpose: Indexes are essential for data alignment, slicing, and selection. They make data manipulation
efficient and accessible by referencing labels rather than numeric positions.
Types of Indexes:
1. Default Integer Index: This is the default index (0, 1, 2, …).
2. Custom Index: Users can specify custom labels for indices.
3. MultiIndex: Supports hierarchical (multi-level) indexing, useful for working with multi-
dimensional data.
Creating Index: You can create a Series or DataFrame with a custom index.
import pandas as pd
Modify an index with methods like set_index() (for DataFrame) or rename() (for renaming labels).
reset_index() can convert an index back to a column, resetting the index to a default integer-based
index.
Suppose we have scores for three students (Alice, Bob, Charlie) in two subjects (Math, Science). Using
MultiIndex, we can organize the scores by both student and subject.
import pandas as pd
Output:
Score
Student Subject
Alice Math 85
Science 90
Bob Math 78
Science 82
Charlie Math 92
Science 88
Explanation:
students represents the first level of the index with names of students.
subjects represents the second level of the index with subjects for each student.
Accessing Data
python
print(df.loc['Alice'])
Output:
javascript
Score
Subject
Math 85
Science 90
python
print(df.loc[('Alice', 'Science')])
Output:
yaml
Score 90
Reindexing
Definition: Reindexing aligns data to a new index, which can involve reordering, adding new labels, or
dropping existing ones.
Purpose: It’s useful when you need to conform data to a specific index structure or ensure consistency
across multiple Series or DataFrames.
How It Works:
When you reindex, missing labels in the new index result in NaN values.
New labels will be added, and if they don’t exist in the original data, NaN values are also assigned.
Examples:
s_reindexed = s.reindex(['a', 'b', 'c', 'd']) # 'd' is a new index, resulting in NaN
Use the fill_value parameter to replace NaN values with a specific value during reindexing.
method parameter allows forward fill (ffill) or backward fill (bfill), which can propagate the last or next
available value.
You can reindex on rows (default) or columns by specifying the axis parameter.
Dropping
Definition: Dropping removes specified labels (rows or columns) from a Series or DataFrame.
Purpose: Useful for excluding specific data or focusing on a subset by discarding unnecessary parts.
inplace: If True, modifies the original object; otherwise, returns a modified copy.
Examples:
# Dropping a column
# In-place modification
Definition: Arithmetic operations between Series or DataFrames in pandas are performed element-wise
and align based on index labels.
Purpose: This alignment ensures consistency across data even when labels don’t perfectly match, filling
mismatched areas with NaN.
Supported Operations:
Basic arithmetic (+, -, *, /, etc.) is supported, allowing for intuitive mathematical operations between
Series/DataFrames.
Use .add(), .sub(), .mul(), and .div() methods for controlled alignment and filling.
Alignment Process:
If labels don’t match, the resulting DataFrame or Series has NaN values in non-matching areas.
Example of Arithmetic Operations:
python
s3 = s1 + s2 # Result: 'a' has NaN, 'b' = 2+4, 'c' = 3+5, 'd' has NaN
python
s3 = s1.add(s2, fill_value=0)
DataFrame Arithmetic:
When two DataFrames are involved, the alignment occurs based on both row and column labels.
python
These operations typically involve arithmetic operations like addition, subtraction, multiplication, and
division, and they follow a principle known as "broadcasting."
1. Column-wise operations: When Series matches the DataFrame columns, the operation applies
to each column individually.
2. Row-wise operations: When Series matches the DataFrame rows, the operation applies to each
row individually.
3. Broadcasting: If Series has only one value, it can be broadcasted across all elements in the
DataFrame.
Example:
Suppose we have a DataFrame of students' scores in three subjects, and a Series representing bonus
points to be added to each subject.
import pandas as pd
data = {
print(new_scores_df)
Output:
Math Science English
Alice 90 94 81
Bob 95 90 83
Charlie 83 86 85
Explanation: Each subject's scores have been increased by the respective bonus points. The Series
(bonus_series) was aligned with the DataFrame based on the column labels, and the addition was done
element-wise.
Example:
Suppose we have the same DataFrame of scores, but this time, we want to subtract a specific penalty for
each student.
print(new_scores_df)
Output:
Alice 83 90 76
Bob 87 85 77
Charlie 77 83 81
Explanation: The penalty_series was aligned with the DataFrame based on the row labels, and the
subtraction was done for each student's scores.
Broadcasting with a Series
When a Series has only one value (like a single column or row), it can be broadcasted across either the
columns or rows of a DataFrame.
Example:
Let's say we want to add a fixed bonus of 2 points to each score in the DataFrame.
constant_bonus = pd.Series([2])
print(new_scores_df)
Output:
Alice 87 94 80
Bob 92 90 82
Charlie 80 86 84
Explanation: Here, constant_bonus was broadcasted across all values in the DataFrame, adding 2 points
to each score.
Functions by Element
Element-wise functions operate on each individual entry in the DataFrame. These are typically
functions applied using applymap() or element-wise arithmetic operations.
Example:
import pandas as pd
# Sample DataFrame
data = {
df = pd.DataFrame(data)
squared_df = df.applymap(lambda x: x ** 2)
print(squared_df)
Output:
A B C
0 1 16 49
1 4 25 64
2 9 36 81
Explanation: Here, applymap(lambda x: x ** 2) applies the lambda function to each element, squaring
each value in the DataFrame.
lambda x: x ** 2 is a lambda function that takes a single input (x) and returns its square (x ** 2).
print(column_sums)
Output:
A 6
B 15
C 24
dtype: int64
Explanation: Setting axis=0 in apply() performs the sum on each column. So, it calculates the total of
each column (A, B, and C).
print(row_means)
Output:
0 4.0
1 5.0
2 6.0
dtype: float64
Explanation: By setting axis=1, apply() calculates the mean across each row. So, we get the average of
each row.
Statistics Functions
Pandas provides a variety of built-in statistical functions that make it easy to perform common
calculations on DataFrames and Series objects.
Pandas provides a variety of statistical functions to help with data analysis, allowing you to calculate
statistical metrics like mean, median, standard deviation, and more.
These functions can be applied to both DataFrames and Series objects, and they support operations
along rows or columns.
1. Mean (mean())
Calculates the average (mean) of values along the specified axis.
import pandas as pd
# Sample DataFrame
data = {
'C': [7, 8, 9]
df = pd.DataFrame(data)
# Column-wise mean
column_means = df.mean(axis=0)
print(column_means)
Output:
A 2.0
B 5.0
C 8.0
dtype: float64
2. Median (median())
Finds the median (middle value) of values along the specified axis.
# Column-wise median
column_medians = df.median(axis=0)
print(column_medians)
Output:
A 2.0
B 5.0
C 8.0
dtype: float64
column_std = df.std(axis=0)
print(column_std)
Output:
A 1.0
B 1.0
C 1.0
dtype: float64
4. Variance (var())
Calculates the variance, which is the square of the standard deviation.
# Column-wise variance
column_variance = df.var(axis=0)
print(column_variance)
Output:
A 1.0
B 1.0
C 1.0
dtype: float64
5. Minimum and Maximum (min(), max())
Finds the minimum or maximum value along the specified axis.
# Column-wise minimum
column_min = df.min(axis=0)
print(column_min)
# Column-wise maximum
column_max = df.max(axis=0)
print(column_max)
Output:
Minimum:
A 1
B 4
C 7
dtype: int64
Maximum:
A 3
B 6
C 9
dtype: int64
6. Sum (sum())
Calculates the sum of values along the specified axis.
# Column-wise sum
column_sum = df.sum(axis=0)
print(column_sum)
Output:
A 6
B 15
C 24
dtype: int64
7. Count (count())
Counts the number of non-NaN values along the specified axis.
# Column-wise count
column_count = df.count(axis=0)
print(column_count)
Output:
A 3
B 3
C 3
dtype: int64
8. Correlation (corr())
Calculates the pairwise correlation of columns in a DataFrame.
correlation_matrix = df.corr()
print(correlation_matrix)
Output:
A B C
9. Covariance (cov())
Calculates the covariance between columns.
covariance_matrix = df.cov()
print(covariance_matrix)
Output:
A B C
1. Sorting in Pandas
Pandas provides two primary methods for sorting data:
Example:
data = {
df = pd.DataFrame(data)
print(sorted_df)
Output:
1 Bob 92 82
3 David 88 95
0 Alice 85 91
2 Charlie 78 89
The sort_index() function sorts the DataFrame or Series by its index labels.
Example:
sorted_index_df = df.sort_index()
print(sorted_index_df)
You can specify the ascending parameter to sort in descending order if needed.
2. Ranking in Pandas
Ranking assigns ranks to values, with options to handle ties and specify ranking method. The rank()
function assigns ranks to the values in a DataFrame or Series.
Ranking in Pandas is a way of giving each value a position (or rank) based on how it compares to other
values. Think of it like lining up people by height or scores and assigning each person a number to show
their position in the lineup.
In Pandas, the rank() function helps us do this automatically for a column in a DataFrame or a Series.
Ranking Methods
Imagine we have a list of students and their math scores. We want to assign ranks based on the scores,
where the highest score gets the highest rank (rank 1).
import pandas as pd
# Sample DataFrame
data = {
df = pd.DataFrame(data)
df['Math_Rank'] = df['Math'].rank(ascending=False)
print(df)
Output:
0 Alice 85 3.0
1 Bob 92 1.0
2 Charlie 78 4.0
3 David 88 2.0
Explanation:
Each student’s rank is based on their position in the sorted list of scores.
Let’s add another student with the same score as Alice (85) to see how ties work.
data = {
df = pd.DataFrame(data)
df['Math_Rank'] = df['Math'].rank(ascending=False)
print(df)
Output:
0 Alice 85 3.5
1 Bob 92 1.0
2 Charlie 78 5.0
3 David 88 2.0
4 Eve 85 3.5
Explanation:
Both Alice and Eve have a score of 85, which ties them for ranks 3 and 4.
With the default average method, they each get an average rank of 3.5.
Covariance
Covariance tells us the direction of the linear relationship between two variables:
• Positive Covariance: If one variable increases when the other increases, they have a positive
covariance.
• Negative Covariance: If one variable increases when the other decreases, they have a negative
covariance.
• Zero Covariance: If there is no predictable relationship in the direction, the covariance will be
close to zero.
import pandas as pd
# Sample DataFrame
data = {
'X': [1, 2, 3, 4, 5],
'Y': [5, 4, 6, 8, 10]
}
df = pd.DataFrame(data)
# Calculate covariance
covariance = df.cov()
print(covariance)
Output:
X Y
X 2.500000 2.375000
Y 2.375000 4.300000
Explanation: The covariance between XXX and YYY is 2.375, suggesting a positive relationship.
However, the actual value of covariance can vary greatly depending on the units of XXX and YYY,
making it hard to interpret the strength of the relationship.
Correlation
Correlation measures both the strength and direction of the linear relationship between two variables
and is a normalized form of covariance. The correlation coefficient (often called Pearson’s correlation
coefficient) ranges from -1 to +1:
• +1: Perfect positive correlation (as one variable increases, the other increases in a perfectly
linear way).
• 0: No linear correlation.
• -1: Perfect negative correlation (as one variable increases, the other decreases in a perfectly
linear way).
# Calculate correlation
correlation = df.corr()
print(correlation)
Output:
X Y
X 1.000000 0.944911
Y 0.944911 1.000000
Explanation: The correlation between XXX and YYY is approximately 0.945, indicating a strong
positive linear relationship. Unlike covariance, the correlation coefficient is dimensionless (it doesn't
depend on the units of the variables), making it easier to interpret.
• Covariance measures the direction of a relationship (positive or negative) but doesn't provide a
standard scale, so it's hard to compare across different datasets.
• Correlation measures both the direction and strength of the relationship on a standardized scale
from -1 to +1, making it easier to interpret and compare.
In practice:
• Use covariance when you only need to know the direction of the relationship and units aren’t an
issue.
• Use correlation when you need a clear, standardized measure of both the direction and strength
of the relationship.
• NaN stands for "Not a Number." It is a special floating-point value defined in the IEEE 754 floating-
point standard.
• NaN is used in Pandas to indicate missing values. It is represented as np.nan in NumPy and pd.NA
for nullable data types in Pandas.
• NaNs are primarily found in datasets where data may be incomplete, such as survey results where
respondents skip questions.
Example
import pandas as pd
import numpy as np
data = {
'C': [1, 2, 3, 4]
df = pd.DataFrame(data)
print(df.isna())
Output:
A B C
Explanation: The isna() function returns True where values are NaN and False otherwise.
There are several ways to handle NaN values, depending on the analysis requirements:
df_dropped = df.dropna()
print(df_dropped)
Output:
A B C
1 2.0 2.0 2
df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols)
Output:
0 1
1 2
2 3
3 4
fillna(): Fills NaN values with a specified value or a calculated statistic (like mean, median, or mode).
# Filling NaN values with a specific value (e.g., 0)
df_filled = df.fillna(0)
print(df_filled)
Output:
A B C
0 1.0 0.0 1
1 2.0 2.0 2
2 0.0 3.0 3
3 4.0 0.0 4
CSV files are one of the most common formats for data storage. They store tabular data in plain text,
where each line represents a row and values are separated by commas.
import pandas as pd
df = pd.read_csv('data.csv')
2. Text Files
Text files can be read and written in Pandas, especially when they follow a specific delimiter (e.g., tab-
separated, space-separated).
Reading a Text File
df = pd.read_csv('data.txt', delimiter='\t')
print(df.head())
3. HTML Files
HTML tables can be directly read into Pandas if they are well-structured in the HTML file.
Pandas can read HTML tables from a URL or local file, and it returns a list of DataFrames (one for each
table found).
url = 'https://example.com/data.html'
dfs = pd.read_html(url)
# Displaying the first table if there are multiple tables in the HTML file
df = dfs[0]
print(df.head())
df.to_html('output.html', index=False)
4. XML Files
XML files are structured documents, and each entry can be mapped to rows in a DataFrame. Pandas can
read XML files with the read_xml() function.
df = pd.read_xml('data.xml')
print(df.head())
df.to_xml('output.xml', index=False)
You can specify the sheet name if there are multiple sheets.
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df.head())
df.to_excel('output.xlsx', index=False)
Practice Program:
1. The Series
import pandas as pd
# Creating a Series
print("Series:\n", s)
2. The DataFrame
# Creating a DataFrame
df = pd.DataFrame(data)
print("DataFrame:\n", df)
# Accessing Index
4. Reindexing
5. Dropping
# Dropping a column
result = s1 + s2
print("Arithmetic with Data Alignment:\n", result)
df_subtracted = df.sub(s, axis='index') # Subtracts each row in 'A' and 'B' by the Series 's'
8. Functions by Element
df['B_rank'] = df['B'].rank()
correlation = df.corr()
covariance = df.cov()
print("Correlation:\n", correlation)
print("Covariance:\n", covariance)
df_filled = df_nan.fillna(0)
CSV
df.to_csv('output.csv', index=False)
# Reading from a CSV file
df_csv = pd.read_csv('output.csv')
Text File
HTML
df.to_html('output.html', index=False)
dfs = pd.read_html('output.html')
XML
df.to_xml('output.xml', index=False)
df_xml = pd.read_xml('output.xml')
print("DataFrame from XML:\n", df_xml)
Excel
df.to_excel('output.xlsx', index=False)
df_excel = pd.read_excel('output.xlsx')