0% found this document useful (0 votes)

2 views60 pages

Data Handling Using Pandas-1

Chapter 2 of Information Practices for Class 12 introduces the Pandas library for data handling in Python, covering its core data structures, including Series and DataFrame. The chapter explains how to manipulate and analyze data efficiently, highlighting the benefits of using Pandas in various industries. It also provides guidance on installation, data creation, and accessing data within these structures.

Uploaded by

javedrayyan060

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views60 pages

Data Handling Using Pandas-1

Uploaded by

javedrayyan060

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 60

Data Handling Using

Pandas - I
Welcome to Chapter 2 of Information Practices for Class 12. In this
presentation, we will explore the fundamentals of data handling
using the powerful Python library called Pandas. This chapter
introduces essential tools for data analysis that are crucial for
modern data science applications.

We'll cover data structures in Pandas, methods for data

manipulation, and techniques for efficient data analysis. By the
end of this presentation, you will understand how to use Pandas to
solve real-world data problems in your projects and assignments.
Introduction to Pandas
Python Library
Pandas is a powerful, open-source Python library designed for
data manipulation and analysis.

Data Handling
It provides data structures and functions needed to efficiently
handle structured data.

Built on NumPy
Pandas is built on top of NumPy, another powerful library for
numerical computation.

Analysis Tools
It offers tools for reading, writing, manipulating, and analyzing
data with ease.
Why Pandas?
Data Analysis Benefits Industry Relevance

• Fast and efficient data manipulation • Essential skill for data scientists
• Handling missing data seamlessly • Used extensively in finance and business
• Merging and joining datasets • Popular for academic research
• Reshaping and pivoting data • Foundation for AI and machine learning
• Time-series functionality • In-demand job skill
Installing Pandas
Check Python Installation
Ensure that Python is installed on your computer. Pandas
requires Python 3.6 or higher.

Install Pandas Using pip

Open the command prompt or terminal and type: pip install
pandas

Verify Installation
Import pandas in a Python script to verify: import pandas as pd
Core Data Structures in Pandas
DataFrame
2D labeled data structure with columns of potentially different types

Series
1D labeled array capable of holding any data type

Index
Immutable array-like structure for axis labels

These three data structures form the foundation of data manipulation in Pandas. We'll explore each of these in
detail, starting with Series and then moving to the more complex DataFrame structure, which is the most commonly
used.
Pandas Series
Definition Structure
A Series is a one- It consists of two arrays:
dimensional labeled array one for data and another for
capable of holding any data labels, called the index. The
type (integers, strings, index makes data alignment
floating point numbers, possible and provides
Python objects, etc.). additional functionality.

Usage
Series are ideal for representing time series data, vector data,
or any ordered collection where you need to associate labels
with values.
Creating a Series
From Lists With Custom Index

import pandas as pd import pandas as pd

s = pd.Series([1, 3, 5, 7, 9]) s = pd.Series([1, 3, 5, 7, 9],
print(s) index=['a', 'b', 'c', 'd', 'e'])
print(s)

Output:
Output:
0 1
1 3 a 1
2 5 b 3
3 7 c 5
4 9 d 7
dtype: int64 e 9
dtype: int64
Series from Dictionary
Create a Dictionary
Define a Python dictionary with keys and values

Convert to Series
Use pd.Series(dictionary) to create a Series

Result
Keys become index, values become Series values

import pandas as pd
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(data)
print(s)
Accessing Series Elements

Using Index Using Position Using Conditions

Labels
Access elements by Filter elements using
Access elements integer position: s[0] boolean conditions:
using their index or s[1:3] for slicing s[s > 20]
labels: s['a'] or s[['a',
'c']] for multiple
elements
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c',
'd'])
print(s['a']) # 10
print(s[0]) # 10
print(s[s > 20]) # c 30
# d 40
Series Operations

Arithmetic Sorting
Addition, subtraction, multiplication, division Sort by index or values

Statistics Filtering
Mean, sum, min, max, etc. Select data based on conditions

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

s2 = pd.Series([4, 5, 6], index=['a', 'b', 'c'])
print(s1 + s2) # Addition element-wise
print(s1.mean()) # Calculate mean
Pandas DataFrame
Definition Structure
A DataFrame is a 2- It consists of three
dimensional labeled data components: the data,
structure with columns that index (row labels), and
can be of different types. columns (column labels).
It's similar to a spreadsheet Each column in a
or SQL table. DataFrame is a Series.

Applications
DataFrames are ideal for representing real-world data like
financial data, experimental results, or any structured data
that needs to be analyzed.
Creating a DataFrame
From Dictionary of Lists
1 Keys become column names, lists become columns

From List of Dictionaries

Each dictionary becomes a row

From Series
Each Series becomes a column

import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
DataFrame from Dictionary of Lists
import pandas as pd
Output

data = { Name Age City

'Name': ['John', 'Anna', 'Peter'], 0 John 28 New York
'Age': [28, 24, 35], 1 Anna 24 Paris
'City': ['New York', 'Paris', 'Berlin'] 2 Peter 35 Berlin
}

df = pd.DataFrame(data)
print(df)

The dictionary keys become the column names, and

the lists become the values in those columns. Notice
that Pandas automatically creates a numeric index (0,
1, 2) for the rows.
DataFrame from List of Dictionaries
import pandas as pd
Output

data = [ Name Age City

{'Name': 'John', 'Age': 28, 'City': 0 John 28 New York
'New York'}, 1 Anna 24 Paris
{'Name': 'Anna', 'Age': 24, 'City': 2 Peter 35 Berlin
'Paris'},
{'Name': 'Peter', 'Age': 35, 'City':
'Berlin'}
]

df = pd.DataFrame(data) In this approach, each dictionary in the list represents a

print(df) row in the DataFrame. The keys in each dictionary
become the column names, and the values are placed
in their respective rows.
Custom Index in DataFrame
Prepare Your Data
Create a dictionary with your data values

Define Custom Index

Create a list of values to use as row labels

Create DataFrame
Use the index parameter when creating the DataFrame

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]}
row_labels = ['Person1', 'Person2', 'Person3']
df = pd.DataFrame(data, index=row_labels)
print(df)
DataFrame from CSV Files

Explore the DataFrame

Use pd.read_csv()
Use various methods to examine the imported
Prepare CSV File
Import the CSV file using pandas' read_csv() data and start your analysis.
Ensure your CSV file is properly formatted with function with the file path as an argument.
comma-separated values and headers.

import pandas as pd
# Read CSV file
df = pd.read_csv('students.csv')
# Display first few rows
print(df.head())
Other Data Sources for DataFrame
Excel Files SQL Databases JSON Data HTML Tables
Read Excel files using Connect to SQL Import JSON data Parse HTML tables
pd.read_excel('file.xls databases using with from websites using
x', pandas.read_sql_quer pd.read_json('file.json pd.read_html(url)
sheet_name='Sheet1 y() or ') or from API
') pandas.read_sql_tabl responses
e()
Pandas provides versatile tools to import data from various sources, making it a powerful library for data collection and
integration from multiple formats.
Viewing DataFrame Contents
Common Methods import pandas as pd
data = {
• df.head(n) - First n rows (default 5) 'Name': ['John', 'Anna', 'Peter'],
• 'Age': [28, 24, 35],
df.tail(n) - Last n rows (default 5)
'City': ['New York', 'Paris', 'Berlin']
• df.info() - Concise summary }
df = pd.DataFrame(data)
• df.describe() - Statistical summary
• # Display first 2 rows
df.shape - Dimensions (rows, columns)
print(df.head(2))
• df.columns - Column labels
# Display DataFrame info
• df.index - Row labels print(df.info())
• df.dtypes - Data types of columns
Examining DataFrame

These methods provide quick ways to understand your data's structure, content, and statistical properties before
diving into deeper analysis. The head() and tail() methods are particularly useful for large datasets where displaying
all rows would be impractical.
Accessing DataFrame Columns
Accessing Single Columns Accessing Multiple Columns

# Using dictionary-like notation # Select multiple columns

cities = df['City'] subset = df[['Name', 'Age']]

# Using attribute notation print(type(subset)) # pandas.DataFrame

ages = df.Age print(subset)

print(type(cities)) # pandas.Series
print(cities)

When selecting multiple columns, the result is a DataFrame.

Both methods return a Pandas Series object.

Accessing DataFrame Rows
loc[] Accessor iloc[] Accessor
Access rows by label (index). Both Access rows by integer position.
endpoints are included. Upper bound is exclusive.

# Get row with index # Get row at position 1

'Person2' df.iloc[1]
df.loc['Person2']
# Range of rows
# Range of rows df.iloc[0:2] # rows 0 and
df.loc['Person1':'Person3'] 1

Boolean Indexing
Filter rows based on conditions.

# Get rows where Age > 25

df[df['Age'] > 25]
Accessing Specific Cells
Using loc[] for Label-Based Using iloc[] for Position- Using at[] and iat[] for Fast
Access Based Access Access
Use df.loc[row_label, Use df.iloc[row_position, For single cell access,
column_label] to access a specific column_position] to access a df.at[row_label, column_label]
cell by its row and column labels. specific cell by its integer and df.iat[row_pos, col_pos] are
position. faster methods.
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},
index=['row1', 'row2', 'row3'])

# Accessing cells
value1 = df.loc['row2', 'A'] # 2
value2 = df.iloc[0, 1] # 4
value3 = df.at['row3', 'B'] # 6
value4 = df.iat[1, 0] # 2
Accessing Rows and Columns Together
Using loc[] Using iloc[]

# Access rows 'r1' to 'r3' and columns 'A' # Access rows 0 to 2 and columns 0 and 2
and 'C' df.iloc[0:3, [0, 2]]
df.loc['r1':'r3', ['A', 'C']]
# All rows for column at position 1
# All rows for column 'B' df.iloc[:, 1]
df.loc[:, 'B']
# First 3 rows, first 2 columns
# Rows where column A > 5 df.iloc[0:3, 0:2]
df.loc[df['A'] > 5, :]

The loc[] and iloc[] accessors provide powerful ways to select specific subsets of your data, combining row and
column selections in a single operation.
Adding Columns to DataFrame

Using assign() Method

Using insert() Method
Create a new DataFrame with added columns: new_df =
Direct Assignment
Insert a column at a specific position: df.insert(position, df.assign(New_Column=values)
Add a column by assigning values to a new column column_name, values)
name: df['New_Column'] = values

import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]})

# Add a new column

df['City'] = ['New York', 'Paris', 'Berlin']

# Add column derived from existing data

df['Age_in_Months'] = df['Age'] * 12
Adding Rows to DataFrame
Prepare Row Data
Create a Series or dictionary with new row data

Use append() or concat()

Append the new row to the existing DataFrame

Reset Index if Needed

Use reset_index() to create sequential indices

# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna'],
'Age': [28, 24]})

# New row as a Series

new_row = pd.Series({'Name': 'Peter', 'Age': 35},
name='Person3')

# Append the row (returns new DataFrame)

df_new = df.append(new_row)

# Or using concat
df_new = pd.concat([df, pd.DataFrame([new_row])])
Deleting Columns
Using drop() Method Using del Statement

# Drop a single column # Delete a column using del

df_new = df.drop('Age', axis=1) df = pd.DataFrame({'Name': ['John',
'Anna'],
# Drop multiple columns 'Age': [28, 24],
df_new = df.drop(['City', 'Age'], axis=1) 'City': ['NY',
'Paris']})
# Drop in-place
df.drop('Age', axis=1, inplace=True) del df['City'] # Removes 'City' column

The del statement modifies the DataFrame in-place

without returning a new copy.
The axis=1 parameter specifies that we're dropping
columns (axis=0 would drop rows).
Deleting Rows
Using drop() Method Using Boolean Filtering Using iloc[] for Position-Based Dropping
Remove rows by their index labels: Filter out unwanted rows: df[df['Age'] != 35] Drop rows by position: df.drop(df.iloc[2].name)
df.drop(['row1', 'row3'])

# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]},
index=['P1', 'P2', 'P3'])

# Drop rows by index label

df_new = df.drop(['P1', 'P3'])

# Drop rows based on condition

df_new = df[df['Age'] < 30]
Renaming Columns

Using rename() Reassigning columns Using add_prefix()

Method Attribute and add_suffix()
Rename specific columns Replace all column names Add prefixes or suffixes to
using a dictionary at once: df.columns = all columns:
mapping: ['col1', 'col2', 'col3'] df.add_prefix('X_') or
df.rename(columns={'old df.add_suffix('_Y')
_name': 'new_name'})
# Original DataFrame
df = pd.DataFrame({'NAME': ['John', 'Anna'],
'AGE': [28, 24]})

# Rename specific columns

df_new = df.rename(columns={'NAME': 'Name', 'AGE': 'Age'})

# Rename all columns at once

df.columns = ['First_Name', 'Years']
Handling Missing Values

Identify Remove
Locate missing values using df.isna() or df.isnull() Drop missing values using df.dropna()

Interpolate Fill
Estimate missing values using df.interpolate() Replace missing values using df.fillna()

# Check for missing values

missing_values = df.isnull().sum()

# Drop rows with any missing values

df_clean = df.dropna()

# Fill missing values with a specific value

df_filled = df.fillna(0)

# Fill with column means

df_mean = df.fillna(df.mean())
DataFrame Sorting
sort_values() Method sort_index() Method

# Sort by a single column # Sort by row index

df_sorted = df.sort_values('Age') df_sorted = df.sort_index()

# Sort by multiple columns # Sort by row index in descending order

df_sorted = df.sort_values( df_sorted = df.sort_index(ascending=False)
['City', 'Age'])
# Sort by column names
# Sort in descending order df_sorted = df.sort_index(axis=1)
df_sorted = df.sort_values(
'Age', ascending=False) # In-place sorting
df.sort_values('Age', inplace=True)
# Sort by multiple columns with different
# orders
df_sorted = df.sort_values(
['City', 'Age'],
ascending=[True, False])
Filtering Data
Simple Conditions Multiple Conditions

# Filter rows where Age > 25 # Rows where Age > 25 AND City
result = df[df['Age'] > 25] is 'New York'
result = df[(df['Age'] > 25) &
# Filter rows where Name is (df['City'] == 'New
'John' York')]
result = df[df['Name'] ==
'John'] # Rows where Age < 30 OR City is
'Paris'
result = df[(df['Age'] < 30) |
(df['City'] ==
'Paris')]

Special Methods

# Rows where Name starts with 'J'

result = df[df['Name'].str.startswith('J')]

# Rows where City is in a list

cities = ['New York', 'London', 'Tokyo']
result = df[df['City'].isin(cities)]
Statistical Functions

7+
Descriptive Statistics
Methods like mean(), median(), min(), max(), count(), std(), var(), etc.

1
describe()
Generates summary statistics for numeric columns

2
Aggregation Levels
Apply to entire DataFrame, specific columns, or groups

3
Correlation
corr() method checks relationships between variables

# Get basic statistics for all numeric columns

stats = df.describe()

# Mean of each column

column_means = df.mean()

# Calculate correlation between numeric columns

correlation = df.corr()
Grouping Data
Split: Group by one or more columns
df.groupby('column') or df.groupby(['col1', 'col2'])

Apply: Perform operations on each group

Apply aggregation functions like mean(), sum(), count()

Combine: Merge results into a new DataFrame

Results are automatically combined into a new DataFrame

# Group by a single column and calculate mean

result = df.groupby('City')['Age'].mean()

# Group by multiple columns with multiple aggregations

result = df.groupby(['City', 'Gender']).agg({
'Age': ['mean', 'max', 'count'],
'Salary': ['sum', 'mean']
})
Pivot Tables
Creating Pivot Tables Pivot Table Parameters
• index: Column(s) to use for row labels
# Basic pivot table
pivot = df.pivot_table( • columns: Column(s) to use for column labels
index='City', # Rows
columns='Gender', # Columns • values: Column(s) to aggregate
values='Salary', # Values to aggregate • aggfunc: Function(s) for aggregation
aggfunc='mean' # Aggregation function
) • fill_value: Value to replace missing data

# Multiple aggregation functions • margins: Add row/column with totals

pivot = df.pivot_table( • dropna: Whether to exclude columns with all NaN
index='City',
values=['Salary', 'Age'],
aggfunc=['mean', 'sum', 'count']
)
Calculating New Columns
Simple Calculations Using apply() Method
Create new columns based on arithmetic Apply custom functions to create new columns.
operations with existing columns.
def age_category(age):
df['BMI'] = df['Weight'] / if age < 18:
((df['Height']/100) ** 2) return 'Child'
df['Full_Name'] = df['First_Name'] + elif age < 65:
' ' + df['Last_Name'] return 'Adult'
else:
return 'Senior'

df['Age_Category'] =
df['Age'].apply(age_category)

Using numpy Functions

Apply NumPy functions to create new columns.

import numpy as np
df['Log_Salary'] = np.log(df['Salary'])
df['Salary_Normalized'] = (df['Salary'] - df['Salary'].mean()) /
df['Salary'].std()
Aggregate Methods
Merging DataFrames

pd.merge() Join Types Keys

Merge DataFrames based on common columns, similar to inner, outer, left, right - control how to handle rows with Specify columns to join on with 'on', 'left_on', 'right_on' parameters
SQL join operations no matches

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3, 4],
'Name': ['John', 'Anna', 'Peter', 'Linda']})
df2 = pd.DataFrame({'ID': [2, 3, 5, 6],
'City': ['Paris', 'Berlin', 'London', 'Rome']})

# Inner join on 'ID' column

result = pd.merge(df1, df2, on='ID', how='inner')

# Left join with different column names

result = pd.merge(df1, df2, left_on='ID', right_on='EmpID', how='left')
Concatenating DataFrames
Vertical Concatenation (Row-wise) Horizontal Concatenation (Column-wise)

# Sample DataFrames # Sample DataFrames

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, df1 = pd.DataFrame({'A': [1, 2], 'B': [3,
4]}) 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, df3 = pd.DataFrame({'C': [5, 6], 'D': [7,
8]}) 8]})

# Concatenate vertically # Concatenate horizontally

result = pd.concat([df1, df2]) result = pd.concat([df1, df3], axis=1)

# Reset index after concatenation # Using join parameter for mismatched

result = pd.concat([df1, indices
df2]).reset_index(drop=True) result = pd.concat([df1, df3], axis=1,
join='inner')
Working with Dates and Times
Converting to Datetime
Use pd.to_datetime() to convert strings to datetime objects

Date Functionality
Access year, month, day, hour, etc. with dt accessor

Date Ranges
Create date ranges with pd.date_range()

Time-Based Indexing
Use datetime objects as index for time series analysis

# Convert string dates to datetime

df['Date'] = pd.to_datetime(df['Date'])

# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Create a date range

dates = pd.date_range(start='2023-01-01',
end='2023-01-10', freq='D')
String Methods in Pandas
String Manipulation String Pattern Matching

• str.lower(), str.upper() - Convert case • str.contains() - Check if substring exists

• str.strip() - Remove whitespace • str.startswith(), str.endswith() - Check prefixes and suffixes
• str.replace() - Replace substrings • str.match() - Match regex pattern
• str.split() - Split strings into lists • str.extract() - Extract regex groups
• str.join() - Join strings together • str.findall() - Find all occurrences
• str.slice() - Extract substrings • str.count() - Count occurrences

# Convert all names to lowercase

df['Name'] = df['Name'].str.lower()

# Extract domain from email addresses

df['Domain'] = df['Email'].str.split('@').str[1]

# Filter rows where name contains 'john'

result = df[df['Name'].str.contains('john', case=False)]
Reshaping Data with melt()
Wide Format
Data spread across multiple columns

Melt Operation
Convert from wide to long format

Long Format
Data structured with more rows and fewer columns

# Wide format data

wide_df = pd.DataFrame({
'Student': ['John', 'Anna', 'Peter'],
'Math': [90, 85, 92],
'Science': [88, 91, 84],
'History': [79, 82, 88]
})

# Convert to long format

long_df = pd.melt(
wide_df,
id_vars=['Student'], # Columns to keep as is
value_vars=['Math', 'Science', 'History'], # Columns to unpivot
var_name='Subject', # Name for the variable column
value_name='Score' # Name for the value column
)
Pivoting Data
Long Format
Data in a normalized form with many rows

Pivot Operation
Convert from long to wide format

Wide Format
Data spread across multiple columns for easier viewing

# Long format data

long_df = pd.DataFrame({
'Student': ['John', 'John', 'John', 'Anna', 'Anna', 'Anna'],
'Subject': ['Math', 'Science', 'History', 'Math', 'Science', 'History'],
'Score': [90, 88, 79, 85, 91, 82]
})

# Convert to wide format

wide_df = long_df.pivot(
index='Student', # Rows
columns='Subject', # Columns
values='Score' # Values to fill the table
)
Handling Duplicates

Identify Count
Find duplicate rows with duplicated() method Count occurrences with value_counts()

Keep Remove
Specify which duplicates to keep (first/last) Drop duplicates with drop_duplicates()

# Check for duplicate rows

duplicates = df.duplicated()
print(f"Number of duplicates: {duplicates.sum()}")

# Show duplicate rows

duplicate_rows = df[df.duplicated()]

# Drop duplicates, keeping the first occurrence

df_clean = df.drop_duplicates()

# Drop duplicates based on specific columns

df_clean = df.drop_duplicates(subset=['Name', 'City'])
Reading Multiple Files
List Files
Identify the files to be read using glob, os.listdir, or a manual list.

Read Each File

Use a loop or list comprehension to read each file into a DataFrame.

Combine Results
Concatenate the individual DataFrames into a single DataFrame.

import pandas as pd
import glob

# Get all CSV files in a directory

file_paths = glob.glob('data/*.csv')

# Read and combine files

dfs = []
for file in file_paths:
df = pd.read_csv(file)
dfs.append(df)

# Concatenate all DataFrames

combined_df = pd.concat(dfs, ignore_index=True)
File Export Options
CSV Excel JSON SQL
df.to_csv('filename.csv', df.to_excel('filename.xlsx', df.to_json('filename.json', df.to_sql('table_name',
index=False) sheet_name='Sheet1') orient='records') connection)

# Export to CSV without index

df.to_csv('students.csv', index=False)

# Export to Excel with multiple sheets

with pd.ExcelWriter('school_data.xlsx') as writer:
students_df.to_excel(writer, sheet_name='Students')
courses_df.to_excel(writer, sheet_name='Courses')

# Export to JSON
df.to_json('students.json', orient='records')
Data Visualization with Pandas
Built-in Plotting Customization Options

• Line plots: df.plot() or df.plot.line() • figsize: Set the figure size

• Bar charts: df.plot.bar() or df.plot.barh() • title: Add a title to the plot
• Histograms: df.plot.hist() • xlabel, ylabel: Add axis labels
• Scatter plots: df.plot.scatter(x='column1', y='column2') • color: Specify plot colors
• Box plots: df.plot.box() • grid: Show or hide grid lines
• Pie charts: df.plot.pie() • legend: Show or hide the legend

Pandas visualization is built on Matplotlib, providing a convenient interface for quick data exploration. For more
advanced visualizations, consider using specialized libraries like Matplotlib, Seaborn, or Plotly.
Data Types in Pandas

Pandas supports various data types to efficiently store and process different kinds of data. Understanding data
types is crucial for memory optimization and ensuring appropriate data operations. The dtypes attribute shows the
data types of each column in a DataFrame, while the astype() method can be used to convert between different
types.
Type Conversion in Pandas
Check Current Types
Use df.dtypes to see the current data types of all columns

Apply Conversion
Use df['column'].astype() or pd.to_numeric(), pd.to_datetime()

Verify Conversion
Check dtypes again to ensure the conversion was successful

# Convert to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Convert to string
df['ID'] = df['ID'].astype(str)

# Convert to categorical (for memory efficiency)

df['Category'] = df['Category'].astype('category')
MultiIndex (Hierarchical Indexing)
Creating MultiIndex Working with MultiIndex

# From tuples # Accessing with .loc

tuples = [('A', 1), ('A', 2), ('B', 1), ('B', value = df.loc[('A', 1)]
2)]
index = pd.MultiIndex.from_tuples(tuples, # Slicing the first level
names=['Letter', subset = df.loc['A']
'Number'])
df = pd.DataFrame({'Value': [10, 20, 30, 40]}, # Selecting specific levels
index=index) df.index.get_level_values('Letter')

# From arrays # Unstacking (pivoting) levels

arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] unstacked = df.unstack(level='Number')
index = pd.MultiIndex.from_arrays(arrays,
names=['Letter', # Resetting index to columns
'Number']) flat_df = df.reset_index()
Real World Example: Student Data Analysis

Student Performance Attendance Analysis Personalized Learning

Analyze student grades across Monitor attendance patterns, Use data insights to create
different subjects and identify identify students with attendance personalized learning plans for
patterns in academic performance. issues, and analyze the correlation students based on their strengths,
Track improvements over time and between attendance and academic weaknesses, and learning pace.
identify areas that need attention. performance.
Real World Example: Sales Data Analysis
Real World Example: Financial Analysis

Stock Market Analysis Company Performance Risk Assessment

Analyze historical stock prices, Track revenue, expenses, and Calculate financial ratios, perform
calculate moving averages, and profitability metrics over time. credit scoring, and assess
identify trading patterns. Financial Compare performance across investment risks using statistical
analysts use Pandas' time series different business units or against methods in Pandas.
capabilities to track market trends. competitors.
Real World Example: Healthcare Analytics

Healthcare organizations use Pandas for patient data analysis, tracking treatment outcomes, and predicting hospital
readmissions. During the COVID-19 pandemic, Pandas was extensively used for tracking infection rates, analyzing
vaccination data, and modeling the spread of the virus. Medical researchers also use Pandas for clinical trial data
analysis and drug effectiveness studies.
Practical Example: Data Cleaning
Import Data
Read data from a CSV file and inspect its structure and content

Handle Missing Values

Identify and handle NULL values by filling or dropping them

Fix Data Types

Convert columns to appropriate data types for analysis

Remove Duplicates
Identify and remove duplicate records from the dataset

Standardize Values
Normalize text data, fix inconsistencies, and handle outliers

# Import and clean data

df = pd.read_csv('raw_data.csv')
df = df.drop_duplicates()
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Category'] = df['Category'].str.upper()
df = df.dropna(subset=['Price', 'Date'])
Practical Example: Analysis Workflow
Data Import
Load data from various sources into DataFrames

Data Cleaning
Handle missing values, fix data types, and remove outliers

Data Transformation
Create new variables, aggregate data, and reshape as needed

Analysis & Visualization

Conduct statistical analysis and create visualizations

Reporting & Exporting

Export results and prepare reports for stakeholders

A typical data analysis workflow involves multiple steps, from importing raw data to generating insights and reports. Pandas provides tools for each
stage of this process, allowing analysts to work efficiently within a single environment.
Performance Optimization Tips
Memory Usage
Use appropriate data types (int8/int16 instead of int64, category for text with few unique values)

Computation Speed
Use vectorized operations instead of loops, leverage built-in methods

Large Datasets
Process data in chunks, use filters before loading full data

Indexing
Set appropriate index for common query patterns, use query() for filtering

# Convert object to category

df['Category'] = df['Category'].astype('category')

# Use Int8 for small integers

df['SmallNumber'] = df['SmallNumber'].astype('int8')

# Vectorized calculation instead of apply

df['Result'] = df['Value1'] * df['Value2']
Common Challenges and Solutions
Memory Errors Data Type Errors

• Challenge: "MemoryError" when working with large • Challenge: Operations fail due to incorrect data types
datasets • Solution: Explicitly convert data using astype(),
• Solution: Use chunksize parameter in read_csv(), to_numeric(), to_datetime()
optimize data types, filter early
Missing Data
Performance Issues
• Challenge: NaN values causing calculation errors
• Challenge: Slow operations on large DataFrames • Solution: Use fillna(), dropna(), or handle NaN
• Solution: Use vectorized operations, avoid apply() explicitly in calculations
when possible, use query() for filtering
Practical Exercises
Exercise 1: Data Import and Exploration
Read a CSV file of student marks, display basic information about the
dataset, and calculate summary statistics for each subject.

Exercise 2: Data Filtering and Transformation

Filter students who scored above 90 in Mathematics, create a new column
for total marks, and rank students based on their performance.

Exercise 3: Data Aggregation and Visualization

Group students by sections, calculate average marks for each subject in
each section, and create a bar chart to visualize the comparison.

Exercise 4: Data Merging and Export

Merge the marks dataset with another dataset containing student personal
information, clean the merged dataset, and export it to an Excel file.
Advanced Topics for Further Exploration

As you advance in your Pandas journey, explore topics like time series analysis with resample() and rolling()
functions, integrating Pandas with machine learning libraries like scikit-learn, processing big data with Dask for
distributed computing, and using extension arrays for custom data types. Advanced indexing techniques and
optimization methods will also become increasingly important as you work with larger and more complex datasets.
Summary and Key Takeaways
Mastery
Advanced pandas techniques for real-world data science

Analysis
2 Statistical tools and aggregation methods

Transformation
Data cleaning, merging, and reshaping

Data Structures
Series and DataFrame fundamentals

In this chapter, we explored the fundamentals of Pandas, starting with basic data structures like Series and DataFrames. We learned how to
create, manipulate, and analyze data using various Pandas functions. The skills you've gained form a solid foundation for data analysis and
preparation for more advanced topics in data science.

Remember that proficiency in Pandas comes with practice. Continue working with different datasets and exploring the rich functionality that
Pandas offers to become more confident in your data handling abilities.

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
No ratings yet
Cheat Sheet: The Pandas Dataframe Object: Column Index (DF - Columns)
6 pages
Linux System Administration For The 2020s The Modern Sysadmin Leaving
100% (10)
Linux System Administration For The 2020s The Modern Sysadmin Leaving
314 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Unit 4
No ratings yet
Unit 4
36 pages
Unit 3
No ratings yet
Unit 3
10 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Python 3rd Unit Question and Answer
No ratings yet
Python 3rd Unit Question and Answer
25 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
Dataframe Ip
No ratings yet
Dataframe Ip
75 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Pandas
No ratings yet
Pandas
27 pages
Class 12 Panda Project
No ratings yet
Class 12 Panda Project
13 pages
L32, 33 Pandas
No ratings yet
L32, 33 Pandas
7 pages
Ip Study
No ratings yet
Ip Study
18 pages
UNIT II Notes
No ratings yet
UNIT II Notes
23 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas
No ratings yet
Pandas
12 pages
14 Pandas
No ratings yet
14 Pandas
25 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
Data Handing Using Pandas-I
100% (2)
Data Handing Using Pandas-I
46 pages
Pandas
No ratings yet
Pandas
41 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
Lab 9
No ratings yet
Lab 9
9 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
6 pages
SBLC 1
No ratings yet
SBLC 1
23 pages
Pandas DataFrame1
No ratings yet
Pandas DataFrame1
22 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas Notes
No ratings yet
Pandas Notes
44 pages
Pandas
No ratings yet
Pandas
5 pages
05getting Started With Pandas
No ratings yet
05getting Started With Pandas
44 pages
Data Handling Using Pandas-I-ORG
No ratings yet
Data Handling Using Pandas-I-ORG
44 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Pandas Cheat Sheet........
No ratings yet
Pandas Cheat Sheet........
11 pages
Pandas DataFrame
No ratings yet
Pandas DataFrame
70 pages
Pandas Python
No ratings yet
Pandas Python
11 pages
Lecture 9 Pandas
No ratings yet
Lecture 9 Pandas
176 pages
09 - Pandas Slides
No ratings yet
09 - Pandas Slides
33 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
138 pages
Pandas - Ipynb - Colab
No ratings yet
Pandas - Ipynb - Colab
8 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Mmte-001 P 1
No ratings yet
Mmte-001 P 1
35 pages
Claret College of Isabela: Information Technology Department
No ratings yet
Claret College of Isabela: Information Technology Department
15 pages
How To Load Firmware Using VESC Tool
No ratings yet
How To Load Firmware Using VESC Tool
3 pages
Mechetronics
No ratings yet
Mechetronics
12 pages
Digital Twin: Manufacturing Excellence Through Virtual Factory Replication
No ratings yet
Digital Twin: Manufacturing Excellence Through Virtual Factory Replication
9 pages
Marvell Link Street® 88E6341: 6 Port Ethernet Switch With 2500Base-X Serdes and Four 10/100/1000Mbps Phys
No ratings yet
Marvell Link Street® 88E6341: 6 Port Ethernet Switch With 2500Base-X Serdes and Four 10/100/1000Mbps Phys
2 pages
Ai Practice Papers
No ratings yet
Ai Practice Papers
4 pages
On Bed Posture Recognition Using Deep Learning With Pressure Sens
No ratings yet
On Bed Posture Recognition Using Deep Learning With Pressure Sens
86 pages
Smart Carting System Using Simple
No ratings yet
Smart Carting System Using Simple
5 pages
CSE 111 - Team Activity - pdf4
No ratings yet
CSE 111 - Team Activity - pdf4
5 pages
FA2100CVR Install
100% (1)
FA2100CVR Install
201 pages
TCP-2 (Flow Control)
No ratings yet
TCP-2 (Flow Control)
2 pages
qm75c Spec
No ratings yet
qm75c Spec
2 pages
The Architecture Machine - MIT
No ratings yet
The Architecture Machine - MIT
104 pages
Balaram Resume
No ratings yet
Balaram Resume
1 page
DRWG 02
No ratings yet
DRWG 02
1 page
B. Deped Omnibus Certification
100% (2)
B. Deped Omnibus Certification
2 pages
Explanation of The Following 8051 Instructions With Examples
No ratings yet
Explanation of The Following 8051 Instructions With Examples
7 pages
Toyota Quiz #2 - Overview
No ratings yet
Toyota Quiz #2 - Overview
8 pages
Holvi
No ratings yet
Holvi
12 pages
7 Karnaugh Maps - 1
No ratings yet
7 Karnaugh Maps - 1
21 pages
Intel VROC VMD Supported Configs
No ratings yet
Intel VROC VMD Supported Configs
19 pages
Introduction To Data Science Lab Manual
100% (1)
Introduction To Data Science Lab Manual
76 pages
WRFmanual v2 2
No ratings yet
WRFmanual v2 2
22 pages
The 2023 PHP Developer RoadMap
No ratings yet
The 2023 PHP Developer RoadMap
5 pages
West Bengal State University: "Facebook Friend Searching
No ratings yet
West Bengal State University: "Facebook Friend Searching
26 pages
PrimeFaces Showcase Phtcam1
No ratings yet
PrimeFaces Showcase Phtcam1
1 page
Hipo Chart
No ratings yet
Hipo Chart
1 page
Startup
No ratings yet
Startup
2 pages