0% found this document useful (0 votes)
2 views60 pages

Data Handling Using Pandas-1

Chapter 2 of Information Practices for Class 12 introduces the Pandas library for data handling in Python, covering its core data structures, including Series and DataFrame. The chapter explains how to manipulate and analyze data efficiently, highlighting the benefits of using Pandas in various industries. It also provides guidance on installation, data creation, and accessing data within these structures.

Uploaded by

javedrayyan060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views60 pages

Data Handling Using Pandas-1

Chapter 2 of Information Practices for Class 12 introduces the Pandas library for data handling in Python, covering its core data structures, including Series and DataFrame. The chapter explains how to manipulate and analyze data efficiently, highlighting the benefits of using Pandas in various industries. It also provides guidance on installation, data creation, and accessing data within these structures.

Uploaded by

javedrayyan060
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Data Handling Using

Pandas - I
Welcome to Chapter 2 of Information Practices for Class 12. In this
presentation, we will explore the fundamentals of data handling
using the powerful Python library called Pandas. This chapter
introduces essential tools for data analysis that are crucial for
modern data science applications.

We'll cover data structures in Pandas, methods for data


manipulation, and techniques for efficient data analysis. By the
end of this presentation, you will understand how to use Pandas to
solve real-world data problems in your projects and assignments.
Introduction to Pandas
Python Library
Pandas is a powerful, open-source Python library designed for
data manipulation and analysis.

Data Handling
It provides data structures and functions needed to efficiently
handle structured data.

Built on NumPy
Pandas is built on top of NumPy, another powerful library for
numerical computation.

Analysis Tools
It offers tools for reading, writing, manipulating, and analyzing
data with ease.
Why Pandas?
Data Analysis Benefits Industry Relevance

• Fast and efficient data manipulation • Essential skill for data scientists
• Handling missing data seamlessly • Used extensively in finance and business
• Merging and joining datasets • Popular for academic research
• Reshaping and pivoting data • Foundation for AI and machine learning
• Time-series functionality • In-demand job skill
Installing Pandas
Check Python Installation
Ensure that Python is installed on your computer. Pandas
requires Python 3.6 or higher.

Install Pandas Using pip


Open the command prompt or terminal and type: pip install
pandas

Verify Installation
Import pandas in a Python script to verify: import pandas as pd
Core Data Structures in Pandas
DataFrame
2D labeled data structure with columns of potentially different types

Series
1D labeled array capable of holding any data type

Index
Immutable array-like structure for axis labels

These three data structures form the foundation of data manipulation in Pandas. We'll explore each of these in
detail, starting with Series and then moving to the more complex DataFrame structure, which is the most commonly
used.
Pandas Series
Definition Structure
A Series is a one- It consists of two arrays:
dimensional labeled array one for data and another for
capable of holding any data labels, called the index. The
type (integers, strings, index makes data alignment
floating point numbers, possible and provides
Python objects, etc.). additional functionality.

Usage
Series are ideal for representing time series data, vector data,
or any ordered collection where you need to associate labels
with values.
Creating a Series
From Lists With Custom Index

import pandas as pd import pandas as pd


s = pd.Series([1, 3, 5, 7, 9]) s = pd.Series([1, 3, 5, 7, 9],
print(s) index=['a', 'b', 'c', 'd', 'e'])
print(s)

Output:
Output:
0 1
1 3 a 1
2 5 b 3
3 7 c 5
4 9 d 7
dtype: int64 e 9
dtype: int64
Series from Dictionary
Create a Dictionary
Define a Python dictionary with keys and values

Convert to Series
Use pd.Series(dictionary) to create a Series

Result
Keys become index, values become Series values

import pandas as pd
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(data)
print(s)
Accessing Series Elements

Using Index Using Position Using Conditions


Labels
Access elements by Filter elements using
Access elements integer position: s[0] boolean conditions:
using their index or s[1:3] for slicing s[s > 20]
labels: s['a'] or s[['a',
'c']] for multiple
elements
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c',
'd'])
print(s['a']) # 10
print(s[0]) # 10
print(s[s > 20]) # c 30
# d 40
Series Operations

Arithmetic Sorting
Addition, subtraction, multiplication, division Sort by index or values

Statistics Filtering
Mean, sum, min, max, etc. Select data based on conditions

s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])


s2 = pd.Series([4, 5, 6], index=['a', 'b', 'c'])
print(s1 + s2) # Addition element-wise
print(s1.mean()) # Calculate mean
Pandas DataFrame
Definition Structure
A DataFrame is a 2- It consists of three
dimensional labeled data components: the data,
structure with columns that index (row labels), and
can be of different types. columns (column labels).
It's similar to a spreadsheet Each column in a
or SQL table. DataFrame is a Series.

Applications
DataFrames are ideal for representing real-world data like
financial data, experimental results, or any structured data
that needs to be analyzed.
Creating a DataFrame
From Dictionary of Lists
1 Keys become column names, lists become columns

From List of Dictionaries


Each dictionary becomes a row

From Series
Each Series becomes a column

import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
DataFrame from Dictionary of Lists
import pandas as pd
Output

data = { Name Age City


'Name': ['John', 'Anna', 'Peter'], 0 John 28 New York
'Age': [28, 24, 35], 1 Anna 24 Paris
'City': ['New York', 'Paris', 'Berlin'] 2 Peter 35 Berlin
}

df = pd.DataFrame(data)
print(df)

The dictionary keys become the column names, and


the lists become the values in those columns. Notice
that Pandas automatically creates a numeric index (0,
1, 2) for the rows.
DataFrame from List of Dictionaries
import pandas as pd
Output

data = [ Name Age City


{'Name': 'John', 'Age': 28, 'City': 0 John 28 New York
'New York'}, 1 Anna 24 Paris
{'Name': 'Anna', 'Age': 24, 'City': 2 Peter 35 Berlin
'Paris'},
{'Name': 'Peter', 'Age': 35, 'City':
'Berlin'}
]

df = pd.DataFrame(data) In this approach, each dictionary in the list represents a


print(df) row in the DataFrame. The keys in each dictionary
become the column names, and the values are placed
in their respective rows.
Custom Index in DataFrame
Prepare Your Data
Create a dictionary with your data values

Define Custom Index


Create a list of values to use as row labels

Create DataFrame
Use the index parameter when creating the DataFrame

import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]}
row_labels = ['Person1', 'Person2', 'Person3']
df = pd.DataFrame(data, index=row_labels)
print(df)
DataFrame from CSV Files

Explore the DataFrame


Use pd.read_csv()
Use various methods to examine the imported
Prepare CSV File
Import the CSV file using pandas' read_csv() data and start your analysis.
Ensure your CSV file is properly formatted with function with the file path as an argument.
comma-separated values and headers.

import pandas as pd
# Read CSV file
df = pd.read_csv('students.csv')
# Display first few rows
print(df.head())
Other Data Sources for DataFrame
Excel Files SQL Databases JSON Data HTML Tables
Read Excel files using Connect to SQL Import JSON data Parse HTML tables
pd.read_excel('file.xls databases using with from websites using
x', pandas.read_sql_quer pd.read_json('file.json pd.read_html(url)
sheet_name='Sheet1 y() or ') or from API
') pandas.read_sql_tabl responses
e()
Pandas provides versatile tools to import data from various sources, making it a powerful library for data collection and
integration from multiple formats.
Viewing DataFrame Contents
Common Methods import pandas as pd
data = {
• df.head(n) - First n rows (default 5) 'Name': ['John', 'Anna', 'Peter'],
• 'Age': [28, 24, 35],
df.tail(n) - Last n rows (default 5)
'City': ['New York', 'Paris', 'Berlin']
• df.info() - Concise summary }
df = pd.DataFrame(data)
• df.describe() - Statistical summary
• # Display first 2 rows
df.shape - Dimensions (rows, columns)
print(df.head(2))
• df.columns - Column labels
# Display DataFrame info
• df.index - Row labels print(df.info())
• df.dtypes - Data types of columns
Examining DataFrame

These methods provide quick ways to understand your data's structure, content, and statistical properties before
diving into deeper analysis. The head() and tail() methods are particularly useful for large datasets where displaying
all rows would be impractical.
Accessing DataFrame Columns
Accessing Single Columns Accessing Multiple Columns

# Using dictionary-like notation # Select multiple columns


cities = df['City'] subset = df[['Name', 'Age']]

# Using attribute notation print(type(subset)) # pandas.DataFrame


ages = df.Age print(subset)

print(type(cities)) # pandas.Series
print(cities)

When selecting multiple columns, the result is a DataFrame.

Both methods return a Pandas Series object.


Accessing DataFrame Rows
loc[] Accessor iloc[] Accessor
Access rows by label (index). Both Access rows by integer position.
endpoints are included. Upper bound is exclusive.

# Get row with index # Get row at position 1


'Person2' df.iloc[1]
df.loc['Person2']
# Range of rows
# Range of rows df.iloc[0:2] # rows 0 and
df.loc['Person1':'Person3'] 1

Boolean Indexing
Filter rows based on conditions.

# Get rows where Age > 25


df[df['Age'] > 25]
Accessing Specific Cells
Using loc[] for Label-Based Using iloc[] for Position- Using at[] and iat[] for Fast
Access Based Access Access
Use df.loc[row_label, Use df.iloc[row_position, For single cell access,
column_label] to access a specific column_position] to access a df.at[row_label, column_label]
cell by its row and column labels. specific cell by its integer and df.iat[row_pos, col_pos] are
position. faster methods.
# Example DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]},
index=['row1', 'row2', 'row3'])

# Accessing cells
value1 = df.loc['row2', 'A'] # 2
value2 = df.iloc[0, 1] # 4
value3 = df.at['row3', 'B'] # 6
value4 = df.iat[1, 0] # 2
Accessing Rows and Columns Together
Using loc[] Using iloc[]

# Access rows 'r1' to 'r3' and columns 'A' # Access rows 0 to 2 and columns 0 and 2
and 'C' df.iloc[0:3, [0, 2]]
df.loc['r1':'r3', ['A', 'C']]
# All rows for column at position 1
# All rows for column 'B' df.iloc[:, 1]
df.loc[:, 'B']
# First 3 rows, first 2 columns
# Rows where column A > 5 df.iloc[0:3, 0:2]
df.loc[df['A'] > 5, :]

The loc[] and iloc[] accessors provide powerful ways to select specific subsets of your data, combining row and
column selections in a single operation.
Adding Columns to DataFrame

Using assign() Method


Using insert() Method
Create a new DataFrame with added columns: new_df =
Direct Assignment
Insert a column at a specific position: df.insert(position, df.assign(New_Column=values)
Add a column by assigning values to a new column column_name, values)
name: df['New_Column'] = values

import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]})

# Add a new column


df['City'] = ['New York', 'Paris', 'Berlin']

# Add column derived from existing data


df['Age_in_Months'] = df['Age'] * 12
Adding Rows to DataFrame
Prepare Row Data
Create a Series or dictionary with new row data

Use append() or concat()


Append the new row to the existing DataFrame

Reset Index if Needed


Use reset_index() to create sequential indices

# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna'],
'Age': [28, 24]})

# New row as a Series


new_row = pd.Series({'Name': 'Peter', 'Age': 35},
name='Person3')

# Append the row (returns new DataFrame)


df_new = df.append(new_row)

# Or using concat
df_new = pd.concat([df, pd.DataFrame([new_row])])
Deleting Columns
Using drop() Method Using del Statement

# Drop a single column # Delete a column using del


df_new = df.drop('Age', axis=1) df = pd.DataFrame({'Name': ['John',
'Anna'],
# Drop multiple columns 'Age': [28, 24],
df_new = df.drop(['City', 'Age'], axis=1) 'City': ['NY',
'Paris']})
# Drop in-place
df.drop('Age', axis=1, inplace=True) del df['City'] # Removes 'City' column

The del statement modifies the DataFrame in-place


without returning a new copy.
The axis=1 parameter specifies that we're dropping
columns (axis=0 would drop rows).
Deleting Rows
Using drop() Method Using Boolean Filtering Using iloc[] for Position-Based Dropping
Remove rows by their index labels: Filter out unwanted rows: df[df['Age'] != 35] Drop rows by position: df.drop(df.iloc[2].name)
df.drop(['row1', 'row3'])

# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]},
index=['P1', 'P2', 'P3'])

# Drop rows by index label


df_new = df.drop(['P1', 'P3'])

# Drop rows based on condition


df_new = df[df['Age'] < 30]
Renaming Columns

Using rename() Reassigning columns Using add_prefix()


Method Attribute and add_suffix()
Rename specific columns Replace all column names Add prefixes or suffixes to
using a dictionary at once: df.columns = all columns:
mapping: ['col1', 'col2', 'col3'] df.add_prefix('X_') or
df.rename(columns={'old df.add_suffix('_Y')
_name': 'new_name'})
# Original DataFrame
df = pd.DataFrame({'NAME': ['John', 'Anna'],
'AGE': [28, 24]})

# Rename specific columns


df_new = df.rename(columns={'NAME': 'Name', 'AGE': 'Age'})

# Rename all columns at once


df.columns = ['First_Name', 'Years']
Handling Missing Values

Identify Remove
Locate missing values using df.isna() or df.isnull() Drop missing values using df.dropna()

Interpolate Fill
Estimate missing values using df.interpolate() Replace missing values using df.fillna()

# Check for missing values


missing_values = df.isnull().sum()

# Drop rows with any missing values


df_clean = df.dropna()

# Fill missing values with a specific value


df_filled = df.fillna(0)

# Fill with column means


df_mean = df.fillna(df.mean())
DataFrame Sorting
sort_values() Method sort_index() Method

# Sort by a single column # Sort by row index


df_sorted = df.sort_values('Age') df_sorted = df.sort_index()

# Sort by multiple columns # Sort by row index in descending order


df_sorted = df.sort_values( df_sorted = df.sort_index(ascending=False)
['City', 'Age'])
# Sort by column names
# Sort in descending order df_sorted = df.sort_index(axis=1)
df_sorted = df.sort_values(
'Age', ascending=False) # In-place sorting
df.sort_values('Age', inplace=True)
# Sort by multiple columns with different
# orders
df_sorted = df.sort_values(
['City', 'Age'],
ascending=[True, False])
Filtering Data
Simple Conditions Multiple Conditions

# Filter rows where Age > 25 # Rows where Age > 25 AND City
result = df[df['Age'] > 25] is 'New York'
result = df[(df['Age'] > 25) &
# Filter rows where Name is (df['City'] == 'New
'John' York')]
result = df[df['Name'] ==
'John'] # Rows where Age < 30 OR City is
'Paris'
result = df[(df['Age'] < 30) |
(df['City'] ==
'Paris')]

Special Methods

# Rows where Name starts with 'J'


result = df[df['Name'].str.startswith('J')]

# Rows where City is in a list


cities = ['New York', 'London', 'Tokyo']
result = df[df['City'].isin(cities)]
Statistical Functions

7+
Descriptive Statistics
Methods like mean(), median(), min(), max(), count(), std(), var(), etc.

1
describe()
Generates summary statistics for numeric columns

2
Aggregation Levels
Apply to entire DataFrame, specific columns, or groups

3
Correlation
corr() method checks relationships between variables

# Get basic statistics for all numeric columns


stats = df.describe()

# Mean of each column


column_means = df.mean()

# Calculate correlation between numeric columns


correlation = df.corr()
Grouping Data
Split: Group by one or more columns
df.groupby('column') or df.groupby(['col1', 'col2'])

Apply: Perform operations on each group


Apply aggregation functions like mean(), sum(), count()

Combine: Merge results into a new DataFrame


Results are automatically combined into a new DataFrame

# Group by a single column and calculate mean


result = df.groupby('City')['Age'].mean()

# Group by multiple columns with multiple aggregations


result = df.groupby(['City', 'Gender']).agg({
'Age': ['mean', 'max', 'count'],
'Salary': ['sum', 'mean']
})
Pivot Tables
Creating Pivot Tables Pivot Table Parameters
• index: Column(s) to use for row labels
# Basic pivot table
pivot = df.pivot_table( • columns: Column(s) to use for column labels
index='City', # Rows
columns='Gender', # Columns • values: Column(s) to aggregate
values='Salary', # Values to aggregate • aggfunc: Function(s) for aggregation
aggfunc='mean' # Aggregation function
) • fill_value: Value to replace missing data

# Multiple aggregation functions • margins: Add row/column with totals


pivot = df.pivot_table( • dropna: Whether to exclude columns with all NaN
index='City',
values=['Salary', 'Age'],
aggfunc=['mean', 'sum', 'count']
)
Calculating New Columns
Simple Calculations Using apply() Method
Create new columns based on arithmetic Apply custom functions to create new columns.
operations with existing columns.
def age_category(age):
df['BMI'] = df['Weight'] / if age < 18:
((df['Height']/100) ** 2) return 'Child'
df['Full_Name'] = df['First_Name'] + elif age < 65:
' ' + df['Last_Name'] return 'Adult'
else:
return 'Senior'

df['Age_Category'] =
df['Age'].apply(age_category)

Using numpy Functions


Apply NumPy functions to create new columns.

import numpy as np
df['Log_Salary'] = np.log(df['Salary'])
df['Salary_Normalized'] = (df['Salary'] - df['Salary'].mean()) /
df['Salary'].std()
Aggregate Methods
Merging DataFrames

pd.merge() Join Types Keys


Merge DataFrames based on common columns, similar to inner, outer, left, right - control how to handle rows with Specify columns to join on with 'on', 'left_on', 'right_on' parameters
SQL join operations no matches

# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3, 4],
'Name': ['John', 'Anna', 'Peter', 'Linda']})
df2 = pd.DataFrame({'ID': [2, 3, 5, 6],
'City': ['Paris', 'Berlin', 'London', 'Rome']})

# Inner join on 'ID' column


result = pd.merge(df1, df2, on='ID', how='inner')

# Left join with different column names


result = pd.merge(df1, df2, left_on='ID', right_on='EmpID', how='left')
Concatenating DataFrames
Vertical Concatenation (Row-wise) Horizontal Concatenation (Column-wise)

# Sample DataFrames # Sample DataFrames


df1 = pd.DataFrame({'A': [1, 2], 'B': [3, df1 = pd.DataFrame({'A': [1, 2], 'B': [3,
4]}) 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, df3 = pd.DataFrame({'C': [5, 6], 'D': [7,
8]}) 8]})

# Concatenate vertically # Concatenate horizontally


result = pd.concat([df1, df2]) result = pd.concat([df1, df3], axis=1)

# Reset index after concatenation # Using join parameter for mismatched


result = pd.concat([df1, indices
df2]).reset_index(drop=True) result = pd.concat([df1, df3], axis=1,
join='inner')
Working with Dates and Times
Converting to Datetime
Use pd.to_datetime() to convert strings to datetime objects

Date Functionality
Access year, month, day, hour, etc. with dt accessor

Date Ranges
Create date ranges with pd.date_range()

Time-Based Indexing
Use datetime objects as index for time series analysis

# Convert string dates to datetime


df['Date'] = pd.to_datetime(df['Date'])

# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month

# Create a date range


dates = pd.date_range(start='2023-01-01',
end='2023-01-10', freq='D')
String Methods in Pandas
String Manipulation String Pattern Matching

• str.lower(), str.upper() - Convert case • str.contains() - Check if substring exists


• str.strip() - Remove whitespace • str.startswith(), str.endswith() - Check prefixes and suffixes
• str.replace() - Replace substrings • str.match() - Match regex pattern
• str.split() - Split strings into lists • str.extract() - Extract regex groups
• str.join() - Join strings together • str.findall() - Find all occurrences
• str.slice() - Extract substrings • str.count() - Count occurrences

# Convert all names to lowercase


df['Name'] = df['Name'].str.lower()

# Extract domain from email addresses


df['Domain'] = df['Email'].str.split('@').str[1]

# Filter rows where name contains 'john'


result = df[df['Name'].str.contains('john', case=False)]
Reshaping Data with melt()
Wide Format
Data spread across multiple columns

Melt Operation
Convert from wide to long format

Long Format
Data structured with more rows and fewer columns

# Wide format data


wide_df = pd.DataFrame({
'Student': ['John', 'Anna', 'Peter'],
'Math': [90, 85, 92],
'Science': [88, 91, 84],
'History': [79, 82, 88]
})

# Convert to long format


long_df = pd.melt(
wide_df,
id_vars=['Student'], # Columns to keep as is
value_vars=['Math', 'Science', 'History'], # Columns to unpivot
var_name='Subject', # Name for the variable column
value_name='Score' # Name for the value column
)
Pivoting Data
Long Format
Data in a normalized form with many rows

Pivot Operation
Convert from long to wide format

Wide Format
Data spread across multiple columns for easier viewing

# Long format data


long_df = pd.DataFrame({
'Student': ['John', 'John', 'John', 'Anna', 'Anna', 'Anna'],
'Subject': ['Math', 'Science', 'History', 'Math', 'Science', 'History'],
'Score': [90, 88, 79, 85, 91, 82]
})

# Convert to wide format


wide_df = long_df.pivot(
index='Student', # Rows
columns='Subject', # Columns
values='Score' # Values to fill the table
)
Handling Duplicates

Identify Count
Find duplicate rows with duplicated() method Count occurrences with value_counts()

Keep Remove
Specify which duplicates to keep (first/last) Drop duplicates with drop_duplicates()

# Check for duplicate rows


duplicates = df.duplicated()
print(f"Number of duplicates: {duplicates.sum()}")

# Show duplicate rows


duplicate_rows = df[df.duplicated()]

# Drop duplicates, keeping the first occurrence


df_clean = df.drop_duplicates()

# Drop duplicates based on specific columns


df_clean = df.drop_duplicates(subset=['Name', 'City'])
Reading Multiple Files
List Files
Identify the files to be read using glob, os.listdir, or a manual list.

Read Each File


Use a loop or list comprehension to read each file into a DataFrame.

Combine Results
Concatenate the individual DataFrames into a single DataFrame.

import pandas as pd
import glob

# Get all CSV files in a directory


file_paths = glob.glob('data/*.csv')

# Read and combine files


dfs = []
for file in file_paths:
df = pd.read_csv(file)
dfs.append(df)

# Concatenate all DataFrames


combined_df = pd.concat(dfs, ignore_index=True)
File Export Options
CSV Excel JSON SQL
df.to_csv('filename.csv', df.to_excel('filename.xlsx', df.to_json('filename.json', df.to_sql('table_name',
index=False) sheet_name='Sheet1') orient='records') connection)

# Export to CSV without index


df.to_csv('students.csv', index=False)

# Export to Excel with multiple sheets


with pd.ExcelWriter('school_data.xlsx') as writer:
students_df.to_excel(writer, sheet_name='Students')
courses_df.to_excel(writer, sheet_name='Courses')

# Export to JSON
df.to_json('students.json', orient='records')
Data Visualization with Pandas
Built-in Plotting Customization Options

• Line plots: df.plot() or df.plot.line() • figsize: Set the figure size


• Bar charts: df.plot.bar() or df.plot.barh() • title: Add a title to the plot
• Histograms: df.plot.hist() • xlabel, ylabel: Add axis labels
• Scatter plots: df.plot.scatter(x='column1', y='column2') • color: Specify plot colors
• Box plots: df.plot.box() • grid: Show or hide grid lines
• Pie charts: df.plot.pie() • legend: Show or hide the legend

Pandas visualization is built on Matplotlib, providing a convenient interface for quick data exploration. For more
advanced visualizations, consider using specialized libraries like Matplotlib, Seaborn, or Plotly.
Data Types in Pandas

Pandas supports various data types to efficiently store and process different kinds of data. Understanding data
types is crucial for memory optimization and ensuring appropriate data operations. The dtypes attribute shows the
data types of each column in a DataFrame, while the astype() method can be used to convert between different
types.
Type Conversion in Pandas
Check Current Types
Use df.dtypes to see the current data types of all columns

Apply Conversion
Use df['column'].astype() or pd.to_numeric(), pd.to_datetime()

Verify Conversion
Check dtypes again to ensure the conversion was successful

# Convert to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Convert to string
df['ID'] = df['ID'].astype(str)

# Convert to categorical (for memory efficiency)


df['Category'] = df['Category'].astype('category')
MultiIndex (Hierarchical Indexing)
Creating MultiIndex Working with MultiIndex

# From tuples # Accessing with .loc


tuples = [('A', 1), ('A', 2), ('B', 1), ('B', value = df.loc[('A', 1)]
2)]
index = pd.MultiIndex.from_tuples(tuples, # Slicing the first level
names=['Letter', subset = df.loc['A']
'Number'])
df = pd.DataFrame({'Value': [10, 20, 30, 40]}, # Selecting specific levels
index=index) df.index.get_level_values('Letter')

# From arrays # Unstacking (pivoting) levels


arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]] unstacked = df.unstack(level='Number')
index = pd.MultiIndex.from_arrays(arrays,
names=['Letter', # Resetting index to columns
'Number']) flat_df = df.reset_index()
Real World Example: Student Data Analysis

Student Performance Attendance Analysis Personalized Learning


Analyze student grades across Monitor attendance patterns, Use data insights to create
different subjects and identify identify students with attendance personalized learning plans for
patterns in academic performance. issues, and analyze the correlation students based on their strengths,
Track improvements over time and between attendance and academic weaknesses, and learning pace.
identify areas that need attention. performance.
Real World Example: Sales Data Analysis
Real World Example: Financial Analysis

Stock Market Analysis Company Performance Risk Assessment


Analyze historical stock prices, Track revenue, expenses, and Calculate financial ratios, perform
calculate moving averages, and profitability metrics over time. credit scoring, and assess
identify trading patterns. Financial Compare performance across investment risks using statistical
analysts use Pandas' time series different business units or against methods in Pandas.
capabilities to track market trends. competitors.
Real World Example: Healthcare Analytics

Healthcare organizations use Pandas for patient data analysis, tracking treatment outcomes, and predicting hospital
readmissions. During the COVID-19 pandemic, Pandas was extensively used for tracking infection rates, analyzing
vaccination data, and modeling the spread of the virus. Medical researchers also use Pandas for clinical trial data
analysis and drug effectiveness studies.
Practical Example: Data Cleaning
Import Data
Read data from a CSV file and inspect its structure and content

Handle Missing Values


Identify and handle NULL values by filling or dropping them

Fix Data Types


Convert columns to appropriate data types for analysis

Remove Duplicates
Identify and remove duplicate records from the dataset

Standardize Values
Normalize text data, fix inconsistencies, and handle outliers

# Import and clean data


df = pd.read_csv('raw_data.csv')
df = df.drop_duplicates()
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['Category'] = df['Category'].str.upper()
df = df.dropna(subset=['Price', 'Date'])
Practical Example: Analysis Workflow
Data Import
Load data from various sources into DataFrames

Data Cleaning
Handle missing values, fix data types, and remove outliers

Data Transformation
Create new variables, aggregate data, and reshape as needed

Analysis & Visualization


Conduct statistical analysis and create visualizations

Reporting & Exporting


Export results and prepare reports for stakeholders

A typical data analysis workflow involves multiple steps, from importing raw data to generating insights and reports. Pandas provides tools for each
stage of this process, allowing analysts to work efficiently within a single environment.
Performance Optimization Tips
Memory Usage
Use appropriate data types (int8/int16 instead of int64, category for text with few unique values)

Computation Speed
Use vectorized operations instead of loops, leverage built-in methods

Large Datasets
Process data in chunks, use filters before loading full data

Indexing
Set appropriate index for common query patterns, use query() for filtering

# Convert object to category


df['Category'] = df['Category'].astype('category')

# Use Int8 for small integers


df['SmallNumber'] = df['SmallNumber'].astype('int8')

# Vectorized calculation instead of apply


df['Result'] = df['Value1'] * df['Value2']
Common Challenges and Solutions
Memory Errors Data Type Errors

• Challenge: "MemoryError" when working with large • Challenge: Operations fail due to incorrect data types
datasets • Solution: Explicitly convert data using astype(),
• Solution: Use chunksize parameter in read_csv(), to_numeric(), to_datetime()
optimize data types, filter early
Missing Data
Performance Issues
• Challenge: NaN values causing calculation errors
• Challenge: Slow operations on large DataFrames • Solution: Use fillna(), dropna(), or handle NaN
• Solution: Use vectorized operations, avoid apply() explicitly in calculations
when possible, use query() for filtering
Practical Exercises
Exercise 1: Data Import and Exploration
Read a CSV file of student marks, display basic information about the
dataset, and calculate summary statistics for each subject.

Exercise 2: Data Filtering and Transformation


Filter students who scored above 90 in Mathematics, create a new column
for total marks, and rank students based on their performance.

Exercise 3: Data Aggregation and Visualization


Group students by sections, calculate average marks for each subject in
each section, and create a bar chart to visualize the comparison.

Exercise 4: Data Merging and Export


Merge the marks dataset with another dataset containing student personal
information, clean the merged dataset, and export it to an Excel file.
Advanced Topics for Further Exploration

As you advance in your Pandas journey, explore topics like time series analysis with resample() and rolling()
functions, integrating Pandas with machine learning libraries like scikit-learn, processing big data with Dask for
distributed computing, and using extension arrays for custom data types. Advanced indexing techniques and
optimization methods will also become increasingly important as you work with larger and more complex datasets.
Summary and Key Takeaways
Mastery
Advanced pandas techniques for real-world data science

Analysis
2 Statistical tools and aggregation methods

Transformation
Data cleaning, merging, and reshaping

Data Structures
Series and DataFrame fundamentals

In this chapter, we explored the fundamentals of Pandas, starting with basic data structures like Series and DataFrames. We learned how to
create, manipulate, and analyze data using various Pandas functions. The skills you've gained form a solid foundation for data analysis and
preparation for more advanced topics in data science.

Remember that proficiency in Pandas comes with practice. Continue working with different datasets and exploring the rich functionality that
Pandas offers to become more confident in your data handling abilities.

You might also like