Data Handling Using Pandas-1
Data Handling Using Pandas-1
Pandas - I
Welcome to Chapter 2 of Information Practices for Class 12. In this
presentation, we will explore the fundamentals of data handling
using the powerful Python library called Pandas. This chapter
introduces essential tools for data analysis that are crucial for
modern data science applications.
Data Handling
It provides data structures and functions needed to efficiently
handle structured data.
Built on NumPy
Pandas is built on top of NumPy, another powerful library for
numerical computation.
Analysis Tools
It offers tools for reading, writing, manipulating, and analyzing
data with ease.
Why Pandas?
Data Analysis Benefits Industry Relevance
• Fast and efficient data manipulation • Essential skill for data scientists
• Handling missing data seamlessly • Used extensively in finance and business
• Merging and joining datasets • Popular for academic research
• Reshaping and pivoting data • Foundation for AI and machine learning
• Time-series functionality • In-demand job skill
Installing Pandas
Check Python Installation
Ensure that Python is installed on your computer. Pandas
requires Python 3.6 or higher.
Verify Installation
Import pandas in a Python script to verify: import pandas as pd
Core Data Structures in Pandas
DataFrame
2D labeled data structure with columns of potentially different types
Series
1D labeled array capable of holding any data type
Index
Immutable array-like structure for axis labels
These three data structures form the foundation of data manipulation in Pandas. We'll explore each of these in
detail, starting with Series and then moving to the more complex DataFrame structure, which is the most commonly
used.
Pandas Series
Definition Structure
A Series is a one- It consists of two arrays:
dimensional labeled array one for data and another for
capable of holding any data labels, called the index. The
type (integers, strings, index makes data alignment
floating point numbers, possible and provides
Python objects, etc.). additional functionality.
Usage
Series are ideal for representing time series data, vector data,
or any ordered collection where you need to associate labels
with values.
Creating a Series
From Lists With Custom Index
Output:
Output:
0 1
1 3 a 1
2 5 b 3
3 7 c 5
4 9 d 7
dtype: int64 e 9
dtype: int64
Series from Dictionary
Create a Dictionary
Define a Python dictionary with keys and values
Convert to Series
Use pd.Series(dictionary) to create a Series
Result
Keys become index, values become Series values
import pandas as pd
data = {'a': 10, 'b': 20, 'c': 30, 'd': 40}
s = pd.Series(data)
print(s)
Accessing Series Elements
Arithmetic Sorting
Addition, subtraction, multiplication, division Sort by index or values
Statistics Filtering
Mean, sum, min, max, etc. Select data based on conditions
Applications
DataFrames are ideal for representing real-world data like
financial data, experimental results, or any structured data
that needs to be analyzed.
Creating a DataFrame
From Dictionary of Lists
1 Keys become column names, lists become columns
From Series
Each Series becomes a column
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
DataFrame from Dictionary of Lists
import pandas as pd
Output
df = pd.DataFrame(data)
print(df)
Create DataFrame
Use the index parameter when creating the DataFrame
import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]}
row_labels = ['Person1', 'Person2', 'Person3']
df = pd.DataFrame(data, index=row_labels)
print(df)
DataFrame from CSV Files
import pandas as pd
# Read CSV file
df = pd.read_csv('students.csv')
# Display first few rows
print(df.head())
Other Data Sources for DataFrame
Excel Files SQL Databases JSON Data HTML Tables
Read Excel files using Connect to SQL Import JSON data Parse HTML tables
pd.read_excel('file.xls databases using with from websites using
x', pandas.read_sql_quer pd.read_json('file.json pd.read_html(url)
sheet_name='Sheet1 y() or ') or from API
') pandas.read_sql_tabl responses
e()
Pandas provides versatile tools to import data from various sources, making it a powerful library for data collection and
integration from multiple formats.
Viewing DataFrame Contents
Common Methods import pandas as pd
data = {
• df.head(n) - First n rows (default 5) 'Name': ['John', 'Anna', 'Peter'],
• 'Age': [28, 24, 35],
df.tail(n) - Last n rows (default 5)
'City': ['New York', 'Paris', 'Berlin']
• df.info() - Concise summary }
df = pd.DataFrame(data)
• df.describe() - Statistical summary
• # Display first 2 rows
df.shape - Dimensions (rows, columns)
print(df.head(2))
• df.columns - Column labels
# Display DataFrame info
• df.index - Row labels print(df.info())
• df.dtypes - Data types of columns
Examining DataFrame
These methods provide quick ways to understand your data's structure, content, and statistical properties before
diving into deeper analysis. The head() and tail() methods are particularly useful for large datasets where displaying
all rows would be impractical.
Accessing DataFrame Columns
Accessing Single Columns Accessing Multiple Columns
print(type(cities)) # pandas.Series
print(cities)
Boolean Indexing
Filter rows based on conditions.
# Accessing cells
value1 = df.loc['row2', 'A'] # 2
value2 = df.iloc[0, 1] # 4
value3 = df.at['row3', 'B'] # 6
value4 = df.iat[1, 0] # 2
Accessing Rows and Columns Together
Using loc[] Using iloc[]
# Access rows 'r1' to 'r3' and columns 'A' # Access rows 0 to 2 and columns 0 and 2
and 'C' df.iloc[0:3, [0, 2]]
df.loc['r1':'r3', ['A', 'C']]
# All rows for column at position 1
# All rows for column 'B' df.iloc[:, 1]
df.loc[:, 'B']
# First 3 rows, first 2 columns
# Rows where column A > 5 df.iloc[0:3, 0:2]
df.loc[df['A'] > 5, :]
The loc[] and iloc[] accessors provide powerful ways to select specific subsets of your data, combining row and
column selections in a single operation.
Adding Columns to DataFrame
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]})
# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna'],
'Age': [28, 24]})
# Or using concat
df_new = pd.concat([df, pd.DataFrame([new_row])])
Deleting Columns
Using drop() Method Using del Statement
# Original DataFrame
df = pd.DataFrame({'Name': ['John', 'Anna', 'Peter'],
'Age': [28, 24, 35]},
index=['P1', 'P2', 'P3'])
Identify Remove
Locate missing values using df.isna() or df.isnull() Drop missing values using df.dropna()
Interpolate Fill
Estimate missing values using df.interpolate() Replace missing values using df.fillna()
# Filter rows where Age > 25 # Rows where Age > 25 AND City
result = df[df['Age'] > 25] is 'New York'
result = df[(df['Age'] > 25) &
# Filter rows where Name is (df['City'] == 'New
'John' York')]
result = df[df['Name'] ==
'John'] # Rows where Age < 30 OR City is
'Paris'
result = df[(df['Age'] < 30) |
(df['City'] ==
'Paris')]
Special Methods
7+
Descriptive Statistics
Methods like mean(), median(), min(), max(), count(), std(), var(), etc.
1
describe()
Generates summary statistics for numeric columns
2
Aggregation Levels
Apply to entire DataFrame, specific columns, or groups
3
Correlation
corr() method checks relationships between variables
df['Age_Category'] =
df['Age'].apply(age_category)
import numpy as np
df['Log_Salary'] = np.log(df['Salary'])
df['Salary_Normalized'] = (df['Salary'] - df['Salary'].mean()) /
df['Salary'].std()
Aggregate Methods
Merging DataFrames
# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3, 4],
'Name': ['John', 'Anna', 'Peter', 'Linda']})
df2 = pd.DataFrame({'ID': [2, 3, 5, 6],
'City': ['Paris', 'Berlin', 'London', 'Rome']})
Date Functionality
Access year, month, day, hour, etc. with dt accessor
Date Ranges
Create date ranges with pd.date_range()
Time-Based Indexing
Use datetime objects as index for time series analysis
# Extract components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
Melt Operation
Convert from wide to long format
Long Format
Data structured with more rows and fewer columns
Pivot Operation
Convert from long to wide format
Wide Format
Data spread across multiple columns for easier viewing
Identify Count
Find duplicate rows with duplicated() method Count occurrences with value_counts()
Keep Remove
Specify which duplicates to keep (first/last) Drop duplicates with drop_duplicates()
Combine Results
Concatenate the individual DataFrames into a single DataFrame.
import pandas as pd
import glob
# Export to JSON
df.to_json('students.json', orient='records')
Data Visualization with Pandas
Built-in Plotting Customization Options
Pandas visualization is built on Matplotlib, providing a convenient interface for quick data exploration. For more
advanced visualizations, consider using specialized libraries like Matplotlib, Seaborn, or Plotly.
Data Types in Pandas
Pandas supports various data types to efficiently store and process different kinds of data. Understanding data
types is crucial for memory optimization and ensuring appropriate data operations. The dtypes attribute shows the
data types of each column in a DataFrame, while the astype() method can be used to convert between different
types.
Type Conversion in Pandas
Check Current Types
Use df.dtypes to see the current data types of all columns
Apply Conversion
Use df['column'].astype() or pd.to_numeric(), pd.to_datetime()
Verify Conversion
Check dtypes again to ensure the conversion was successful
# Convert to numeric
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
# Convert to string
df['ID'] = df['ID'].astype(str)
Healthcare organizations use Pandas for patient data analysis, tracking treatment outcomes, and predicting hospital
readmissions. During the COVID-19 pandemic, Pandas was extensively used for tracking infection rates, analyzing
vaccination data, and modeling the spread of the virus. Medical researchers also use Pandas for clinical trial data
analysis and drug effectiveness studies.
Practical Example: Data Cleaning
Import Data
Read data from a CSV file and inspect its structure and content
Remove Duplicates
Identify and remove duplicate records from the dataset
Standardize Values
Normalize text data, fix inconsistencies, and handle outliers
Data Cleaning
Handle missing values, fix data types, and remove outliers
Data Transformation
Create new variables, aggregate data, and reshape as needed
A typical data analysis workflow involves multiple steps, from importing raw data to generating insights and reports. Pandas provides tools for each
stage of this process, allowing analysts to work efficiently within a single environment.
Performance Optimization Tips
Memory Usage
Use appropriate data types (int8/int16 instead of int64, category for text with few unique values)
Computation Speed
Use vectorized operations instead of loops, leverage built-in methods
Large Datasets
Process data in chunks, use filters before loading full data
Indexing
Set appropriate index for common query patterns, use query() for filtering
• Challenge: "MemoryError" when working with large • Challenge: Operations fail due to incorrect data types
datasets • Solution: Explicitly convert data using astype(),
• Solution: Use chunksize parameter in read_csv(), to_numeric(), to_datetime()
optimize data types, filter early
Missing Data
Performance Issues
• Challenge: NaN values causing calculation errors
• Challenge: Slow operations on large DataFrames • Solution: Use fillna(), dropna(), or handle NaN
• Solution: Use vectorized operations, avoid apply() explicitly in calculations
when possible, use query() for filtering
Practical Exercises
Exercise 1: Data Import and Exploration
Read a CSV file of student marks, display basic information about the
dataset, and calculate summary statistics for each subject.
As you advance in your Pandas journey, explore topics like time series analysis with resample() and rolling()
functions, integrating Pandas with machine learning libraries like scikit-learn, processing big data with Dask for
distributed computing, and using extension arrays for custom data types. Advanced indexing techniques and
optimization methods will also become increasingly important as you work with larger and more complex datasets.
Summary and Key Takeaways
Mastery
Advanced pandas techniques for real-world data science
Analysis
2 Statistical tools and aggregation methods
Transformation
Data cleaning, merging, and reshaping
Data Structures
Series and DataFrame fundamentals
In this chapter, we explored the fundamentals of Pandas, starting with basic data structures like Series and DataFrames. We learned how to
create, manipulate, and analyze data using various Pandas functions. The skills you've gained form a solid foundation for data analysis and
preparation for more advanced topics in data science.
Remember that proficiency in Pandas comes with practice. Continue working with different datasets and exploring the rich functionality that
Pandas offers to become more confident in your data handling abilities.