0% found this document useful (0 votes)
44 views39 pages

The Pandas Library

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views39 pages

The Pandas Library

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

UNIT-3

THE PANDAS LIBRARY


• Pandas is the foundational library for data manipulation and analysis in Python. It offers a robust
set of tools for working with various types of data, enabling fast, intuitive, and powerful data
analysis. Whether you're handling small datasets or large-scale data in machine learning
pipelines, Pandas is an essential tool for data professionals.
• Pandas is an open-source data manipulation and analysis library for Python, built on top of
NumPy.
• It provides powerful data structures, such as Series and DataFrame, that make it easy to work
with structured data, such as tables, time series, and even mixed-type data.
• Pandas is widely used in data science, machine learning, finance, economics, and various fields
where data analysis is required.

Key Features of Pandas:


Data Structures:

• Series: A one-dimensional labeled array (similar to a column in a table).


• DataFrame: A two-dimensional, size-mutable, and heterogeneous tabular data structure (similar
to a spreadsheet or SQL table).

Handling Missing Data:

• Pandas provides tools to detect, replace, and manage missing data effectively using NaN
(Not a Number).

Label-based Indexing:

• Pandas allows label-based selection of data via rows or columns, making data extraction
more intuitive and faster.

Data Alignment:

• When performing arithmetic operations, Pandas automatically aligns data based on labels
(rows and columns), simplifying operations on datasets of different shapes.

Flexible Data Input and Output (I/O):

• Pandas supports reading from and writing to multiple file formats including CSV, Excel,
XML, HTML, and more.

Data Wrangling and Cleaning:

• With powerful methods for filtering, transforming, and cleaning data, Pandas is a go-to tool
for preparing data for analysis.

Group Operations:

• The groupby() functionality allows easy aggregation and transformation of data based on
conditions or keys.
Powerful Time Series Tools:

• Pandas has built-in support for time series data, making it easier to handle date ranges,
timestamps, and frequency conversion.

Why Use Pandas?

• Ease of Use: Pandas simplifies working with large datasets by providing high-level
functions for common data analysis tasks.
• Data Preparation for Machine Learning: In machine learning projects, Pandas is often
used to clean and preprocess data before feeding it into models.
• Comprehensive Data Analysis: Pandas offers a wide array of methods for data
manipulation, statistical analysis, and visualization, making it a versatile tool for
researchers, analysts, and developers.

Example:

import pandas as pd

# Creating a Series from a list

data = [10, 20, 30, 40]

s = pd.Series(data)

# Displaying the Series

print(s)

Output:

0 10

1 20

2 30

3 40

dtype: int64

# Creating a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'Salary': [50000, 60000, 70000]}

df = pd.DataFrame(data)

# Basic operations

print(df) # Display the DataFrame


Output:

Name Age Salary

0 Alice 25 50000

1 Bob 30 60000

2 Charlie 35 70000

print(df.describe()) # Summary statistics of numerical columns

Output of print(df.describe()):

Age Salary

count 3.000000 3.000000

mean 30.000000 60000.000000

std 5.000000 10000.000000

min 25.000000 50000.000000

25% 27.500000 55000.000000

50% 30.000000 60000.000000

75% 32.500000 65000.000000

max 35.000000 70000.000000

• print(df) displays the entire DataFrame, showing each row with columns "Name", "Age", and
"Salary".
• print(df.describe()) provides summary statistics for the numerical columns (Age and Salary),
including count, mean, standard deviation, minimum, maximum, and quartile values (25%, 50%, and
75%).

Pandas Series

A Series in Pandas is a one-dimensional array that can hold various data types such as integers, floats,
and strings. Each element in a Series is associated with a unique label, called an index.

Creating a Series
You can create a Series from lists, dictionaries, or numpy arrays.

Example 1: Creating a Series from a list

import pandas as pd
# Series from a list

data = pd.Series([10, 20, 30, 40, 50])

print(data)

Output:

0 10

1 20

2 30

3 40

4 50

dtype: int64

Here:

The default index is integers starting from 0.

The dtype int64 denotes the data type of the elements.

Example 2: Creating a Series with a custom index

data = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

print(data)

Output:

a 10

b 20

c 30

d 40

e 50

dtype: int64

Here, each element is labeled with custom indexes 'a', 'b', etc.

Accessing Data in a Series


You can access elements using their index, similar to a dictionary.
# Accessing using index

print(data['c']) # Output: 30

# Accessing by integer location

print(data[2]) # Output: 30

Operations on Series
Series supports vectorized operations, meaning operations are applied to each element in the Series.

# Addition

print(data + 10) # Adds 10 to each element

# Condition-based filtering

print(data[data > 25]) # Only values greater than 25

Key Points of Series


It’s similar to a 1D numpy array but with labels.

Can store different data types: integers, floats, and strings.

Useful for single-column data or labeled data collections.

Pandas DataFrame
A DataFrame is a two-dimensional, tabular data structure with rows and columns. Each column in a
DataFrame is a Series, so a DataFrame is essentially a collection of Series aligned by a common index.

Creating a DataFrame

A DataFrame can be created from various sources, such as dictionaries, lists, or even other DataFrames.

Example 1: Creating a DataFrame from a dictionary

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 27, 22, 32],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']


}

df = pd.DataFrame(data)

print(df)

Output:

Name Age City

0 Alice 24 New York

1 Bob 27 Los Angeles

2 Charlie 22 Chicago

3 David 32 Houston

Here:

df has three columns: Name, Age, and City.

Rows are labeled with the default integer index.

Example 2: Creating a DataFrame with a custom index

# Setting custom index

df = pd.DataFrame(data, index=['a', 'b', 'c', 'd'])

print(df)

Output:

Name Age City

a Alice 24 New York

b Bob 27 Los Angeles

c Charlie 22 Chicago

d David 32 Houston

Accessing Data in a DataFrame


You can access data by columns, rows, or specific elements.

Accessing Columns

print(df['Name']) # Outputs the 'Name' column as a Series

Accessing Rows by Index

print(df.loc['c']) # Outputs the row with index 'c'

Accessing Specific Elements

print(df.loc['b', 'City']) # Output: Los Angeles

Basic DataFrame Operations

Adding a Column

df['Salary'] = [50000, 60000, 55000, 65000]

print(df)

Output:

Name Age City Salary

a Alice 24 New York 50000

b Bob 27 Los Angeles 60000

c Charlie 22 Chicago 55000

d David 32 Houston 65000

Deleting a Column

df = df.drop(columns=['Salary'])

Filtering Rows Based on Condition

print(df[df['Age'] > 25]) # Rows where Age is greater than 25

Sorting Data

df = df.sort_values(by='Age', ascending=False) # Sorts by Age in descending order

Key Points of DataFrames

It’s like a table or a spreadsheet in Python.


Supports multiple data types across columns.

Provides flexibility with labeled columns and row indexing.

Allows for complex operations like filtering, grouping, and merging data.

Summary: Series vs. DataFrame

FeatureSeries DataFrame

Dimensions 1-Dimensional 2-Dimensional

Structure Single column Multiple columns and rows

Data Type Can hold one data type Each column can hold different data types

Usage Simple labeled collections Tabular, structured data

Indexing Supports single index Supports row and column indexing

Pandas Series and DataFrames are highly versatile structures for managing and analyzing data, making
it possible to apply efficient, readable, and fast operations on various data types.

The Index Objects

Definition: In pandas, Index objects are immutable, which means they cannot be changed directly. They
are used as labels for rows and columns in a Series or DataFrame.

Purpose: Indexes are essential for data alignment, slicing, and selection. They make data manipulation
efficient and accessible by referencing labels rather than numeric positions.

Types of Indexes:
1. Default Integer Index: This is the default index (0, 1, 2, …).
2. Custom Index: Users can specify custom labels for indices.
3. MultiIndex: Supports hierarchical (multi-level) indexing, useful for working with multi-
dimensional data.

Creating Index: You can create a Series or DataFrame with a custom index.

import pandas as pd

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])


df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])

Accessing and Modifying Index:

Access an index with df.index or s.index.

Modify an index with methods like set_index() (for DataFrame) or rename() (for renaming labels).

reset_index() can convert an index back to a column, resetting the index to a default integer-based
index.

Working with MultiIndex:

MultiIndex objects allow you to create multiple levels of indexing.

Useful for complex datasets, such as hierarchical data or grouped data.

arrays = [['A', 'A', 'B', 'B'], ['one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays, names=('upper', 'lower'))

df = pd.DataFrame({'data': [1, 2, 3, 4]}, index=index)

Example: MultiIndex for Students’ Test Scores

Suppose we have scores for three students (Alice, Bob, Charlie) in two subjects (Math, Science). Using
MultiIndex, we can organize the scores by both student and subject.

import pandas as pd

# Define the levels of the MultiIndex

students = ['Alice', 'Alice', 'Bob', 'Bob', 'Charlie', 'Charlie']

subjects = ['Math', 'Science', 'Math', 'Science', 'Math', 'Science']

# Create the MultiIndex

index = pd.MultiIndex.from_arrays([students, subjects], names=('Student', 'Subject'))

# Create a DataFrame with scores for each student in each subject

scores = [85, 90, 78, 82, 92, 88]

df = pd.DataFrame({'Score': scores}, index=index)


print(df)

Output:

Score

Student Subject

Alice Math 85

Science 90

Bob Math 78

Science 82

Charlie Math 92

Science 88

Explanation:

Define the Index Levels:

students represents the first level of the index with names of students.

subjects represents the second level of the index with subjects for each student.

Create the MultiIndex:

pd.MultiIndex.from_arrays([students, subjects], names=('Student', 'Subject')) combines the two lists into


a MultiIndex and assigns names to each level (Student and Subject).

Create the DataFrame:

df = pd.DataFrame({'Score': scores}, index=index) creates a DataFrame where the MultiIndex is used as


the index and the Score column contains each student’s score in each subject.

Accessing Data

You can access scores for individual students or subjects easily:


All scores for Alice:

python

print(df.loc['Alice'])

Output:

javascript

Score

Subject

Math 85

Science 90

Alice’s score in Science:

python

print(df.loc[('Alice', 'Science')])

Output:

yaml

Score 90

Name: (Alice, Science), dtype: int64

Reindexing
Definition: Reindexing aligns data to a new index, which can involve reordering, adding new labels, or
dropping existing ones.
Purpose: It’s useful when you need to conform data to a specific index structure or ensure consistency
across multiple Series or DataFrames.

How It Works:

When you reindex, missing labels in the new index result in NaN values.

New labels will be added, and if they don’t exist in the original data, NaN values are also assigned.

Examples:

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s_reindexed = s.reindex(['a', 'b', 'c', 'd']) # 'd' is a new index, resulting in NaN

Filling Missing Values:

Use the fill_value parameter to replace NaN values with a specific value during reindexing.

method parameter allows forward fill (ffill) or backward fill (bfill), which can propagate the last or next
available value.

s.reindex(['a', 'b', 'c', 'd'], fill_value=0)

Reindexing for DataFrames:

Similar syntax applies to DataFrames.

You can reindex on rows (default) or columns by specifying the axis parameter.

df.reindex(['a', 'b', 'c'], axis=0) # Reindex rows

df.reindex(['A', 'B'], axis=1) # Reindex columns

Dropping

Definition: Dropping removes specified labels (rows or columns) from a Series or DataFrame.

Purpose: Useful for excluding specific data or focusing on a subset by discarding unnecessary parts.

Methods and Parameters:

drop(labels, axis, inplace, errors) where:


labels: Specifies row/column labels to drop.

axis: Determines whether to drop rows (axis=0) or columns (axis=1).

inplace: If True, modifies the original object; otherwise, returns a modified copy.

errors: Controls handling of non-existent labels (ignore or raise).

Examples:

# Dropping a row from a DataFrame

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])

df_dropped = df.drop('a') # Drop row 'a'

# Dropping a column

df_dropped = df.drop('A', axis=1) # Drop column 'A'

# In-place modification

df.drop('a', inplace=True) # Drop row 'a' in-place

Arithmetic and Data Alignment

Definition: Arithmetic operations between Series or DataFrames in pandas are performed element-wise
and align based on index labels.

Purpose: This alignment ensures consistency across data even when labels don’t perfectly match, filling
mismatched areas with NaN.

Supported Operations:

Basic arithmetic (+, -, *, /, etc.) is supported, allowing for intuitive mathematical operations between
Series/DataFrames.

Use .add(), .sub(), .mul(), and .div() methods for controlled alignment and filling.

Alignment Process:

Pandas aligns data by matching row and column labels.

If labels don’t match, the resulting DataFrame or Series has NaN values in non-matching areas.
Example of Arithmetic Operations:

python

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

s3 = s1 + s2 # Result: 'a' has NaN, 'b' = 2+4, 'c' = 3+5, 'd' has NaN

Filling Missing Values in Arithmetic Operations:

Use the fill_value parameter to replace missing values (e.g., 0) in calculations.

python

s3 = s1.add(s2, fill_value=0)

DataFrame Arithmetic:

When two DataFrames are involved, the alignment occurs based on both row and column labels.

Arithmetic operations can be performed with scalars, Series, or other DataFrames.

python

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

df_sum = df1 + df2 # Element-wise addition


Operations between Data Frame and Series
Operations between DataFrame and Series are quite useful in Pandas, as they allow us to perform
operations on individual rows or columns in a DataFrame using a Series.

These operations typically involve arithmetic operations like addition, subtraction, multiplication, and
division, and they follow a principle known as "broadcasting."

1. Column-wise operations: When Series matches the DataFrame columns, the operation applies
to each column individually.
2. Row-wise operations: When Series matches the DataFrame rows, the operation applies to each
row individually.
3. Broadcasting: If Series has only one value, it can be broadcasted across all elements in the
DataFrame.

Arithmetic Operation between a DataFrame and a Series (Column-Wise)


When a DataFrame and a Series have compatible dimensions, Pandas aligns them based on the index or
column labels and performs operations element-wise.

Example:

Suppose we have a DataFrame of students' scores in three subjects, and a Series representing bonus
points to be added to each subject.

import pandas as pd

# Creating a DataFrame of students' scores

data = {

'Math': [85, 90, 78],

'Science': [92, 88, 84],

'English': [78, 80, 82]

scores_df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])

# Series with bonus points for each subject

bonus_series = pd.Series({'Math': 5, 'Science': 2, 'English': 3})

# Adding bonus points to each subject

new_scores_df = scores_df + bonus_series

print(new_scores_df)

Output:
Math Science English

Alice 90 94 81

Bob 95 90 83

Charlie 83 86 85

Explanation: Each subject's scores have been increased by the respective bonus points. The Series
(bonus_series) was aligned with the DataFrame based on the column labels, and the addition was done
element-wise.

Arithmetic Operation between a DataFrame and a Series (Row-Wise)


In this case, if the Series index matches the DataFrame index (row labels), the operation applies to each
row.

Example:

Suppose we have the same DataFrame of scores, but this time, we want to subtract a specific penalty for
each student.

# Series with penalty points for each student

penalty_series = pd.Series({'Alice': 2, 'Bob': 3, 'Charlie': 1})

# Subtracting penalty points from each student's total scores

new_scores_df = scores_df.sub(penalty_series, axis=0)

print(new_scores_df)

Output:

Math Science English

Alice 83 90 76

Bob 87 85 77

Charlie 77 83 81

Explanation: The penalty_series was aligned with the DataFrame based on the row labels, and the
subtraction was done for each student's scores.
Broadcasting with a Series
When a Series has only one value (like a single column or row), it can be broadcasted across either the
columns or rows of a DataFrame.

Example:

Let's say we want to add a fixed bonus of 2 points to each score in the DataFrame.

# Series with a single value

constant_bonus = pd.Series([2])

# Adding constant bonus to all scores

new_scores_df = scores_df + constant_bonus

print(new_scores_df)

Output:

Math Science English

Alice 87 94 80

Bob 92 90 82

Charlie 80 86 84

Explanation: Here, constant_bonus was broadcasted across all values in the DataFrame, adding 2 points
to each score.

Functions by Element
Element-wise functions operate on each individual entry in the DataFrame. These are typically
functions applied using applymap() or element-wise arithmetic operations.

Example:

Suppose we want to square each element in a DataFrame of values.

import pandas as pd

# Sample DataFrame

data = {

'A': [1, 2, 3],

'B': [4, 5, 6],


'C': [7, 8, 9]

df = pd.DataFrame(data)

# Squaring each element in the DataFrame

squared_df = df.applymap(lambda x: x ** 2)

print(squared_df)

Output:

A B C

0 1 16 49

1 4 25 64

2 9 36 81

Explanation: Here, applymap(lambda x: x ** 2) applies the lambda function to each element, squaring
each value in the DataFrame.

lambda x: x ** 2 is a lambda function that takes a single input (x) and returns its square (x ** 2).

Functions by Row or Column


Row-wise or column-wise functions operate across either rows or columns in the DataFrame. You
typically use apply() with axis=0 for column-wise operations and axis=1 for row-wise operations.

Example of Column-wise Operation

Let's calculate the sum of each column in the DataFrame.

# Sum of each column

column_sums = df.apply(sum, axis=0)

print(column_sums)

Output:

A 6
B 15

C 24

dtype: int64

Explanation: Setting axis=0 in apply() performs the sum on each column. So, it calculates the total of
each column (A, B, and C).

Example of Row-wise Operation

Now, let’s calculate the mean of each row in the DataFrame.

# Mean of each row

row_means = df.apply(lambda x: x.mean(), axis=1)

print(row_means)

Output:

0 4.0

1 5.0

2 6.0

dtype: float64

Explanation: By setting axis=1, apply() calculates the mean across each row. So, we get the average of
each row.

Statistics Functions
Pandas provides a variety of built-in statistical functions that make it easy to perform common
calculations on DataFrames and Series objects.

Pandas provides a variety of statistical functions to help with data analysis, allowing you to calculate
statistical metrics like mean, median, standard deviation, and more.

These functions can be applied to both DataFrames and Series objects, and they support operations
along rows or columns.

1. Mean (mean())
Calculates the average (mean) of values along the specified axis.
import pandas as pd

# Sample DataFrame

data = {

'A': [1, 2, 3],

'B': [4, 5, 6],

'C': [7, 8, 9]

df = pd.DataFrame(data)

# Column-wise mean

column_means = df.mean(axis=0)

print(column_means)

Output:

A 2.0

B 5.0

C 8.0

dtype: float64

2. Median (median())
Finds the median (middle value) of values along the specified axis.

# Column-wise median

column_medians = df.median(axis=0)

print(column_medians)

Output:

A 2.0

B 5.0
C 8.0

dtype: float64

3. Standard Deviation (std())


Calculates the standard deviation, which measures how spread out the values are.

# Column-wise standard deviation

column_std = df.std(axis=0)

print(column_std)

Output:

A 1.0

B 1.0

C 1.0

dtype: float64

4. Variance (var())
Calculates the variance, which is the square of the standard deviation.

# Column-wise variance

column_variance = df.var(axis=0)

print(column_variance)

Output:

A 1.0

B 1.0

C 1.0

dtype: float64
5. Minimum and Maximum (min(), max())
Finds the minimum or maximum value along the specified axis.

# Column-wise minimum

column_min = df.min(axis=0)

print(column_min)

# Column-wise maximum

column_max = df.max(axis=0)

print(column_max)

Output:

Minimum:

A 1

B 4

C 7

dtype: int64

Maximum:

A 3

B 6

C 9

dtype: int64

6. Sum (sum())
Calculates the sum of values along the specified axis.

# Column-wise sum

column_sum = df.sum(axis=0)

print(column_sum)
Output:

A 6

B 15

C 24

dtype: int64

7. Count (count())
Counts the number of non-NaN values along the specified axis.

# Column-wise count

column_count = df.count(axis=0)

print(column_count)

Output:

A 3

B 3

C 3

dtype: int64

8. Correlation (corr())
Calculates the pairwise correlation of columns in a DataFrame.

# Correlation between columns

correlation_matrix = df.corr()

print(correlation_matrix)

Output:

A B C

A 1.0 1.0 1.0


B 1.0 1.0 1.0

C 1.0 1.0 1.0

9. Covariance (cov())
Calculates the covariance between columns.

# Covariance between columns

covariance_matrix = df.cov()

print(covariance_matrix)

Output:

A B C

A 1.0 1.0 1.0

B 1.0 1.0 1.0

C 1.0 1.0 1.0

Sorting and Ranking


In Pandas, sorting and ranking functions are essential for organizing data, either in ascending or
descending order, based on index or values, as well as for ranking data according to specific criteria.

1. Sorting in Pandas
Pandas provides two primary methods for sorting data:

sort_values(): Sorts a DataFrame or Series by one or more columns.

sort_index(): Sorts a DataFrame or Series by index labels.

1.1 Sorting by Values (sort_values())


The sort_values() function allows you to sort a DataFrame or Series by one or multiple columns.

Example:

Suppose we have a DataFrame of students and their scores:


import pandas as pd

data = {

'Student': ['Alice', 'Bob', 'Charlie', 'David'],

'Math': [85, 92, 78, 88],

'Science': [91, 82, 89, 95]

df = pd.DataFrame(data)

# Sorting by Math scores in descending order

sorted_df = df.sort_values(by='Math', ascending=False)

print(sorted_df)

Output:

Student Math Science

1 Bob 92 82

3 David 88 95

0 Alice 85 91

2 Charlie 78 89

1.2 Sorting by Index (sort_index())

The sort_index() function sorts the DataFrame or Series by its index labels.

Example:

# Sorting by index in ascending order

sorted_index_df = df.sort_index()

print(sorted_index_df)

You can specify the ascending parameter to sort in descending order if needed.
2. Ranking in Pandas
Ranking assigns ranks to values, with options to handle ties and specify ranking method. The rank()
function assigns ranks to the values in a DataFrame or Series.

Ranking in Pandas is a way of giving each value a position (or rank) based on how it compares to other
values. Think of it like lining up people by height or scores and assigning each person a number to show
their position in the lineup.

In Pandas, the rank() function helps us do this automatically for a column in a DataFrame or a Series.

Ranking Methods

• average: Default method. Assigns the average rank to tied values.


• min: Assigns the minimum rank to tied values.
• max: Assigns the maximum rank to tied values.
• first: Assigns ranks in the order the values appear.
• dense: Like min, but rank increases by 1 for the next distinct value.

Example Scenario: Ranking Test Scores

Imagine we have a list of students and their math scores. We want to assign ranks based on the scores,
where the highest score gets the highest rank (rank 1).

import pandas as pd

# Sample DataFrame

data = {

'Student': ['Alice', 'Bob', 'Charlie', 'David'],

'Math': [85, 92, 78, 88]

df = pd.DataFrame(data)

# Rank students based on Math scores

df['Math_Rank'] = df['Math'].rank(ascending=False)

print(df)

Output:

Student Math Math_Rank

0 Alice 85 3.0

1 Bob 92 1.0
2 Charlie 78 4.0

3 David 88 2.0

Explanation:

The highest score (92) gets a rank of 1.

The second highest score (88) gets a rank of 2.

The third highest score (85) gets a rank of 3.

The lowest score (78) gets a rank of 4.

Each student’s rank is based on their position in the sorted list of scores.

Example with a Tie

Let’s add another student with the same score as Alice (85) to see how ties work.

# Adding a new student with the same score as Alice

data = {

'Student': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],

'Math': [85, 92, 78, 88, 85]

df = pd.DataFrame(data)

# Ranking with the default 'average' method for ties

df['Math_Rank'] = df['Math'].rank(ascending=False)

print(df)

Output:

Student Math Math_Rank

0 Alice 85 3.5

1 Bob 92 1.0

2 Charlie 78 5.0

3 David 88 2.0
4 Eve 85 3.5

Explanation:

Both Alice and Eve have a score of 85, which ties them for ranks 3 and 4.

With the default average method, they each get an average rank of 3.5.

Correlation and Covariance


Correlation and Covariance are both measures of the relationship between two variables, but they
capture different aspects of how these variables are related.

Covariance

Covariance tells us the direction of the linear relationship between two variables:

• Positive Covariance: If one variable increases when the other increases, they have a positive
covariance.
• Negative Covariance: If one variable increases when the other decreases, they have a negative
covariance.
• Zero Covariance: If there is no predictable relationship in the direction, the covariance will be
close to zero.

Example with Pandas

import pandas as pd

# Sample DataFrame
data = {
'X': [1, 2, 3, 4, 5],
'Y': [5, 4, 6, 8, 10]
}
df = pd.DataFrame(data)

# Calculate covariance
covariance = df.cov()
print(covariance)

Output:

X Y
X 2.500000 2.375000
Y 2.375000 4.300000

Explanation: The covariance between XXX and YYY is 2.375, suggesting a positive relationship.
However, the actual value of covariance can vary greatly depending on the units of XXX and YYY,
making it hard to interpret the strength of the relationship.
Correlation

Correlation measures both the strength and direction of the linear relationship between two variables
and is a normalized form of covariance. The correlation coefficient (often called Pearson’s correlation
coefficient) ranges from -1 to +1:

• +1: Perfect positive correlation (as one variable increases, the other increases in a perfectly
linear way).
• 0: No linear correlation.
• -1: Perfect negative correlation (as one variable increases, the other decreases in a perfectly
linear way).

Example with Pandas

# Calculate correlation
correlation = df.corr()
print(correlation)

Output:

X Y
X 1.000000 0.944911
Y 0.944911 1.000000

Explanation: The correlation between XXX and YYY is approximately 0.945, indicating a strong
positive linear relationship. Unlike covariance, the correlation coefficient is dimensionless (it doesn't
depend on the units of the variables), making it easier to interpret.

Summary: Covariance vs. Correlation

• Covariance measures the direction of a relationship (positive or negative) but doesn't provide a
standard scale, so it's hard to compare across different datasets.
• Correlation measures both the direction and strength of the relationship on a standardized scale
from -1 to +1, making it easier to interpret and compare.

In practice:

• Use covariance when you only need to know the direction of the relationship and units aren’t an
issue.
• Use correlation when you need a clear, standardized measure of both the direction and strength
of the relationship.

“Not a Number” Data


Not a Number (NaN) represents missing or undefined data within a DataFrame or Series. NaN values
are very common in real-world datasets, as data may be incomplete or missing for various reasons.
These values are handled differently from typical data values, and Pandas provides several methods to
work with them effectively.

• NaN stands for "Not a Number." It is a special floating-point value defined in the IEEE 754 floating-
point standard.
• NaN is used in Pandas to indicate missing values. It is represented as np.nan in NumPy and pd.NA
for nullable data types in Pandas.
• NaNs are primarily found in datasets where data may be incomplete, such as survey results where
respondents skip questions.

How to Identify NaN Values

Pandas provides functions to help identify NaN values in a DataFrame or Series.

Example

import pandas as pd

import numpy as np

# Sample DataFrame with NaN values

data = {

'A': [1, 2, np.nan, 4],

'B': [np.nan, 2, 3, np.nan],

'C': [1, 2, 3, 4]

df = pd.DataFrame(data)

# Checking for NaN values

print(df.isna())

Output:

A B C

0 False True False

1 False False False

2 True False False

3 False True False

Explanation: The isna() function returns True where values are NaN and False otherwise.

Handling NaN Values

There are several ways to handle NaN values, depending on the analysis requirements:

1. Removing NaN Values


dropna(): Removes rows or columns that contain NaN values.

# Dropping rows with any NaN values

df_dropped = df.dropna()

print(df_dropped)

Output:

A B C

1 2.0 2.0 2

Explanation: Only rows without any NaN values are retained.

dropna(axis=1): Drops columns with NaN values.

# Dropping columns with any NaN values

df_dropped_cols = df.dropna(axis=1)

print(df_dropped_cols)

Output:

0 1

1 2

2 3

3 4

2. Filling NaN Values

fillna(): Fills NaN values with a specified value or a calculated statistic (like mean, median, or mode).
# Filling NaN values with a specific value (e.g., 0)

df_filled = df.fillna(0)

print(df_filled)

Output:

A B C

0 1.0 0.0 1

1 2.0 2.0 2

2 0.0 3.0 3

3 4.0 0.0 4

Reading and Writing Data on Files


1. CSV (Comma-Separated Values) Files

CSV files are one of the most common formats for data storage. They store tabular data in plain text,
where each line represents a row and values are separated by commas.

Reading a CSV File

import pandas as pd

# Reading a CSV file

df = pd.read_csv('data.csv')

print(df.head()) # Display the first few rows

Writing to a CSV File

# Writing DataFrame to a CSV file

df.to_csv('output.csv', index=False) # index=False prevents saving the index column

2. Text Files

Text files can be read and written in Pandas, especially when they follow a specific delimiter (e.g., tab-
separated, space-separated).
Reading a Text File

# Reading a text file with a specific delimiter (e.g., tab-separated)

df = pd.read_csv('data.txt', delimiter='\t')

print(df.head())

Writing to a Text File

# Writing DataFrame to a text file with space as delimiter

df.to_csv('output.txt', sep=' ', index=False)

3. HTML Files

HTML tables can be directly read into Pandas if they are well-structured in the HTML file.

Reading HTML Files

Pandas can read HTML tables from a URL or local file, and it returns a list of DataFrames (one for each
table found).

# Reading HTML tables from a URL or local file

url = 'https://example.com/data.html'

dfs = pd.read_html(url)

# Displaying the first table if there are multiple tables in the HTML file

df = dfs[0]

print(df.head())

Writing to an HTML File

# Writing DataFrame to an HTML file

df.to_html('output.html', index=False)

4. XML Files

XML files are structured documents, and each entry can be mapped to rows in a DataFrame. Pandas can
read XML files with the read_xml() function.

Reading XML Files


# Reading an XML file

df = pd.read_xml('data.xml')

print(df.head())

Writing to an XML File

# Writing DataFrame to an XML file

df.to_xml('output.xml', index=False)

5. Microsoft Excel Files


Excel files are often used to store data, and Pandas provides straightforward methods for reading and
writing .xls or .xlsx files.

Reading an Excel File

You can specify the sheet name if there are multiple sheets.

# Reading an Excel file (first sheet by default)

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print(df.head())

Writing to an Excel File

# Writing DataFrame to an Excel file

df.to_excel('output.xlsx', index=False)

Practice Program:
1. The Series

import pandas as pd

# Creating a Series

s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

print("Series:\n", s)
2. The DataFrame

# Creating a DataFrame

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}

df = pd.DataFrame(data)

print("DataFrame:\n", df)

3. The Index Object

# Accessing Index

print("Index of DataFrame:", df.index)

4. Reindexing

# Reindexing the DataFrame

df_reindexed = df.reindex([0, 1, 2, 3]) # Adding an extra row

print("Reindexed DataFrame:\n", df_reindexed)

5. Dropping

# Dropping a column

df_dropped = df.drop('A', axis=1)

print("DataFrame after dropping column A:\n", df_dropped)

6. Arithmetic and Data Alignment

# Arithmetic between two Series with different indices

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s2 = pd.Series([4, 5, 6], index=['b', 'c', 'd'])

result = s1 + s2
print("Arithmetic with Data Alignment:\n", result)

7. Operations between DataFrame and Series

# Subtracting a Series from a DataFrame

df_subtracted = df.sub(s, axis='index') # Subtracts each row in 'A' and 'B' by the Series 's'

print("DataFrame after operation with Series:\n", df_subtracted)

8. Functions by Element

# Applying a function to each element

df_squared = df.applymap(lambda x: x**2)

print("DataFrame with elements squared:\n", df_squared)

9. Functions by Row or Column

# Applying a function to each column

column_means = df.apply(lambda x: x.mean())

print("Column Means:\n", column_means)

10. Statistics Functions

# Using statistical functions

print("Sum of each column:\n", df.sum())

print("Mean of each column:\n", df.mean())

print("Standard Deviation of each column:\n", df.std())

11. Sorting and Ranking

# Sorting by column 'B'


df_sorted = df.sort_values(by='B', ascending=False)

print("DataFrame sorted by column B:\n", df_sorted)

# Ranking values in column 'B'

df['B_rank'] = df['B'].rank()

print("Ranking in column B:\n", df)

12. Correlation and Covariance

# Calculating correlation and covariance

correlation = df.corr()

covariance = df.cov()

print("Correlation:\n", correlation)

print("Covariance:\n", covariance)

13. Not a Number (NaN) Data

# Handling NaN values

df_nan = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})

print("DataFrame with NaN values:\n", df_nan)

# Filling NaN with 0

df_filled = df_nan.fillna(0)

print("DataFrame after filling NaN with 0:\n", df_filled)

14. Reading and Writing Data

CSV

# Writing to a CSV file

df.to_csv('output.csv', index=False)
# Reading from a CSV file

df_csv = pd.read_csv('output.csv')

print("DataFrame from CSV:\n", df_csv)

Text File

# Writing to a text file

df.to_csv('output.txt', sep='\t', index=False)

# Reading from a text file

df_text = pd.read_csv('output.txt', delimiter='\t')

print("DataFrame from Text File:\n", df_text)

HTML

# Writing to an HTML file

df.to_html('output.html', index=False)

# Reading from an HTML file (if file is well-structured)

dfs = pd.read_html('output.html')

print("DataFrame from HTML:\n", dfs[0]) # first table in the HTML file

XML

# Writing to an XML file

df.to_xml('output.xml', index=False)

# Reading from an XML file

df_xml = pd.read_xml('output.xml')
print("DataFrame from XML:\n", df_xml)

Excel

# Writing to an Excel file

df.to_excel('output.xlsx', index=False)

# Reading from an Excel file

df_excel = pd.read_excel('output.xlsx')

print("DataFrame from Excel:\n", df_excel)

You might also like