0% found this document useful (0 votes)
5 views53 pages

10 20241104 Data-Analysis Pandas

The document provides an overview of the Pandas library in Python, highlighting its importance for data manipulation and analysis through its data structures: Series and DataFrame. It details key features, functionalities, and practical examples of using Series for various applications, such as tracking temperatures, stock prices, and construction materials. Additionally, it introduces DataFrames as a two-dimensional data structure suitable for handling heterogeneous data.

Uploaded by

scs623170
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views53 pages

10 20241104 Data-Analysis Pandas

The document provides an overview of the Pandas library in Python, highlighting its importance for data manipulation and analysis through its data structures: Series and DataFrame. It details key features, functionalities, and practical examples of using Series for various applications, such as tracking temperatures, stock prices, and construction materials. Additionally, it introduces DataFrames as a two-dimensional data structure suitable for handling heterogeneous data.

Uploaded by

scs623170
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Data Analysis with Pandas

Prof. Murali Krishna Gurram


Dept. of Geo-Engineering & RDT
Centre for Remote Sensing, AUCE
Andhra University, Visakhapatnam – 530 003
Dt. 04/11/2024
Data Analysis with Pandas

Data Analysis with Pandas


a. An overview of the Pandas package
b. The Pandas data structure-Series
c. The DataFrame
d. The essential basic functionality:
 Reindexing and altering labels
 Head and tail
 Binary operations
 Functional statistics
 Function application Sorting
 Indexing and selecting data
Overview of the Pandas Package in Python
What is PANDAS?
 Pandas is an open-source Python library that provides high-
performance data manipulation and analysis tools.

 Pandas is built on top of the NumPy library and is particularly


useful for working with structured data.

 The name "Pandas" is derived from "Panel Data," a term used in


econometrics to represent data collected over time for the
same individuals.
Overview of the Pandas Package in Python
Why use PANDAS?
 Data analysis often requires handling large datasets, and Pandas
makes this easier by providing functions for manipulating data,
performing statistical analysis, handling missing data, and
more.

 Pandas is well-suited for tasks that involve data cleaning, data


transformation, data visualization, and even complex analyses
like grouping and aggregation.
Overview of the Pandas Package in Python
Why use PANDAS?
 Pandas is a versatile and essential library for data analysis in
Python, providing tools for data manipulation, transformation,
aggregation, and visualization.

 It serves as the backbone for many data science and machine


learning workflows due to its flexibility and powerful
functionalities.

 Learning Pandas opens up opportunities to work efficiently with


real-world data, conduct complex analyses, and prepare data for
advanced applications like machine learning.
Overview of the Pandas Package in Python
Key Features of Pandas
 High-level data structures: Series and DataFrame.

 Easy handling of missing data.

 Powerful tools for data alignment and data manipulation.

 Flexible reshaping and pivoting of datasets.

 Time-series functionality.

 Integration with other Python libraries like Matplotlib and


Seaborn for data visualization.
The Pandas Data Structures - Series and
DataFrame
The Pandas Data Structures - Series and DataFrame
1. Introduction to Pandas Data Structures
• Overview:
– Pandas provides two main data structures that simplify
handling and analyzing structured data: Series and
DataFrame.

– Understanding these structures is essential because they are


the foundation of most data analysis tasks in Pandas.
The Pandas Data Structures - Series and DataFrame
2. Series
What is a Series?
• A Series is a one-dimensional labeled array that can hold any
data type (integers, floats, strings, etc.).

• A Series is similar to a single column in a table or an Excel


spreadsheet.

• Every element in a Series has an index label, allowing access to


values through their index.

• A Series can be created from various data types, including lists,


dictionaries, and scalar values.
The Pandas Data Structures - Series and DataFrame
2. Series
Syntax and Practical Examples
import pandas as pd

Creating a Series from a List:


data = [10, 20, 30, 40] # a list with data elements
series = pd.Series(data)
print(series)

This will display a Series with default integer indexing starting from 0.
The Pandas Data Structures - Series and DataFrame
2. Series
Creating a Series with Custom Index:
data = [10, 20, 30, 40] # a list with data elements
index = ['a', 'b', 'c', 'd'] # a list of items intended as labels in index variable
series = pd.Series(data, index=index) # index labels assigned to 'index‘ attribute
print(series)
The Series will have custom labels a, b, c, and d.

Creating a Series from a Dictionary:


data = {'a': 10, 'b': 20, 'c': 30} # a dictionary with key-value pairs
series = pd.Series(data)
print(series)
Accessing Elements in a Series:
# Access by position
print(series[0])

# Access by label
print(series['a'])
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Attributes:
• series.index - Returns the index of the Series.
• series.values - Returns the values as a NumPy array.
• series.dtype - Shows the data type of the elements.

Methods:
• series.head(n) - Returns the first n elements.
• series.tail(n) - Returns the last n elements.
• series.sum() - Returns the sum of the Series.
• series.mean() - Calculates the mean of values in the Series.
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Example
# Example: Summary statistics
print("Sum:", series.sum())
print("Mean:", series.mean())
print("First 2 elements:", series.head(2))
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Example
# Example: Summary statistics
print("Sum:", series.sum())
print("Mean:", series.mean())
print("First 2 elements:", series.head(2))
The Pandas Data Structures - Series and DataFrame
2. Series
• Exercise 1: Series for Daily Temperatures
• Objective:
– Create a Series to represent daily temperatures for a week.
– Use custom indices (labels) to name each day of the week.
– Calculate and print the average temperature.
import pandas as pd
# List of daily temperatures Output
temperatures = [23, 25, 22, 26, 24, 28, 27] Daily Temperatures:
# Custom indices for each day of the week Monday 23
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', Tuesday 25
'Saturday', 'Sunday'] Wednesday 22
Thursday 26
# Create the Series Friday 24
temp_series = pd.Series(temperatures, index=days) Saturday 28
# Display the Series Sunday 27
print("Daily Temperatures:\n", temp_series) dtype: int64

# Calculate the average temperature Average Temperature for


avg_temp = temp_series.mean() the Week: 25.0
print("\nAverage Temperature for the Week:", avg_temp)
The Pandas Data Structures - Series and DataFrame
2. Series
Exercise 2: Series for Stock Prices
• Objective:
– Create a Series using a dictionary where keys represent company names, and
values represent their stock prices.
– Access a specific stock price using the company's name as the label.
import pandas as pd
# Dictionary representing stock prices of various companies Output
stock_prices = {
'Apple': 150, Stock Prices:
'Microsoft': 280, Apple 150
'Google': 2700, Microsoft 280
'Amazon': 3300, Google 2700
'Facebook': 340 Amazon 3300
}
Facebook 340
# Create the Series dtype: int64
stock_series = pd.Series(stock_prices)
Stock Price of Google:
# Display the Series
print("Stock Prices:\n", stock_series) 2700
# Access the stock price for Google
google_price = stock_series['Google']
print("\nStock Price of Google:", google_price)
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Analyzing Concrete Strength Test Results
Objective: Record the compressive strength of concrete samples
tested over different days and analyze the data.
Solution:
• Record compressive strength (in MPa) of concrete samples
tested after curing for 7, 14, and 28 days.
• Use a Pandas DataFrame to store the data.
• Calculate the mean compressive strength for each curing period.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Analyzing Concrete Strength Test Results
import pandas as pd
Output
# Daily temperatures recorded during the concrete curing Daily Temperatures (°C):
period (in °C) Monday 22
temperatures = [22, 23, 25, 21, 24, 26, 23] Tuesday 23
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', Wednesday 25
'Saturday', 'Sunday'] Thursday 21
Friday 24
# Create a Series for temperatures Saturday 26
temperature_series = pd.Series(temperatures, index=days) Sunday 23
dtype: int64
# Calculate the average temperature
avg_temperature = temperature_series.mean() Average Temperature (°C):
23.428571428571427
# Identify the highest and lowest temperatures Highest Temperature on: Saturday
max_temp_day = temperature_series.idxmax() Lowest Temperature on: Thursday
min_temp_day = temperature_series.idxmin()

print("Daily Temperatures (°C):\n", temperature_series)


print("\nAverage Temperature (°C):", avg_temperature)
print("Highest Temperature on:", max_temp_day)
print("Lowest Temperature on:", min_temp_day)
The Pandas Data Structures - Series and DataFrame
2. Series
Example 2: Monitoring Construction Site Inventory
• In this example, you can track the quantity of various materials
(e.g., cement, sand, gravel, steel) available at a construction site.
This is useful for managing inventory and knowing when to
reorder materials.

Objective:
• Create a Series representing quantities of different construction
materials.
• Check the quantity of a specific material.
• Calculate the total inventory.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Monitoring Construction Site Inventory
import pandas as pd
Output
# Material quantities at the construction site Construction Site Inventory:
materials = { Cement Bags 100
'Cement Bags': 100, Sand (cubic meters) 50
'Sand (cubic meters)': 50, Gravel (cubic meters) 30
'Gravel (cubic meters)': 30, Steel (tons) 20
'Steel (tons)': 20 dtype: int64
}
Quantity of Cement Bags: 100
# Create a Series for materials Total Inventory Quantity: 200
inventory_series = pd.Series(materials)

# Access quantity of a specific material


cement_quantity = inventory_series['Cement Bags']

# Calculate total inventory in terms of distinct items


total_inventory = inventory_series.sum()

print("Construction Site Inventory:\n", inventory_series)


print("\nQuantity of Cement Bags:", cement_quantity)
print("Total Inventory Quantity:", total_inventory)
The Pandas Data Structures - Series and DataFrame
2. Series
Example 2: Monitoring Construction Site Inventory
• In this example, you can track the quantity of various materials
(e.g., cement, sand, gravel, steel) available at a construction site.
This is useful for managing inventory and knowing when to
reorder materials.

Objective:
• Create a Series representing quantities of different construction
materials.
• Check the quantity of a specific material.
• Calculate the total inventory.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Monitoring Construction Site Inventory
import pandas as pd
Output
# Material quantities at the construction site Construction Site Inventory:
materials = { Cement Bags 100
'Cement Bags': 100, Sand (cubic meters) 50
'Sand (cubic meters)': 50, Gravel (cubic meters) 30
'Gravel (cubic meters)': 30, Steel (tons) 20
'Steel (tons)': 20 dtype: int64
}
Quantity of Cement Bags: 100
# Create a Series for materials Total Inventory Quantity: 200
inventory_series = pd.Series(materials)

# Access quantity of a specific material


cement_quantity = inventory_series['Cement Bags']

# Calculate total inventory in terms of distinct items


total_inventory = inventory_series.sum()

print("Construction Site Inventory:\n", inventory_series)


print("\nQuantity of Cement Bags:", cement_quantity)
print("Total Inventory Quantity:", total_inventory)
The Pandas Data Structures - Series and DataFrame
2. Series
Example 3: Road Survey Traffic Volume
• In road design and traffic analysis, civil engineers often monitor
traffic flow on specific road segments. Here’s an example of
using a Series to represent the number of vehicles passing a
road segment over several days.

Objective:
• Create a Series to store traffic volume for each day of the week.
• Calculate the average traffic volume.
• Identify the day with peak traffic.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Road Survey Traffic Volume
import pandas as pd
Output
# Traffic volume (vehicles) recorded each day on a road Daily Traffic Volume:
segment Monday 1200
traffic_data = [1200, 1350, 1400, 1300, 1250, 1600, 1500] Tuesday 1350
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', Wednesday 1400
'Saturday', 'Sunday'] Thursday 1300
Friday 1250
# Create a Series for traffic volume Saturday 1600
traffic_series = pd.Series(traffic_data, index=days) Sunday 1500
dtype: int64
# Calculate average traffic volume
avg_traffic = traffic_series.mean() Average Traffic Volume:
1371.4285714285713
# Identify peak traffic day Peak Traffic Day: Saturday
peak_day = traffic_series.idxmax()

print("Daily Traffic Volume:\n", traffic_series)


print("\nAverage Traffic Volume:", avg_traffic)
print("Peak Traffic Day:", peak_day)
The Pandas Data Structures - Series and DataFrame
2. Series
Example 4: Structural Steel Beam Loads
• This example can help in analyzing the loads on different steel
beams in a structure. Each beam has a different load capacity,
and it’s essential to keep track of these values.

Objective:
• Create a Series for different beams and their load capacities (in
kN).
• Find the maximum load capacity among the beams.
• Calculate the average load capacity.
The Pandas Data Structures - Series and DataFrame
2. Series
Example 4: Structural Steel Beam Loads
import pandas as pd
Output
# Load capacity (kN) for different beams in a structure
Beam Load Capacities (kN):
beam_loads = {
Beam A 45
'Beam A': 45,
Beam B 50
'Beam B': 50,
Beam C 55
'Beam C': 55,
Beam D 60
'Beam D': 60,
Beam E 52
'Beam E': 52
dtype: int64
}
# Create a Series for beam loads Maximum Load Capacity (kN): 60
beam_series = pd.Series(beam_loads) Average Load Capacity (kN): 52.4

# Find maximum load capacity


max_load = beam_series.max()
# Calculate average load capacity
avg_load = beam_series.mean()
print("Beam Load Capacities (kN):\n", beam_series)
print("\nMaximum Load Capacity (kN):", max_load)
print("Average Load Capacity (kN):", avg_load)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
 A DataFrame is a two-dimensional, size-mutable, and
potentially heterogeneous data structure.

 A DataFrame is similar to a table in SQL or an Excel spreadsheet


with rows and columns.

 Each column can contain data of a different type (e.g., integers,


floats, strings).

 A DataFrame has two axes: rows and columns, each with its own
label.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
 Creating a DataFrame from a Dictionary of Lists:
# Creating a DataFrame from a Dictionary of Lists
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
 Creating a DataFrame from a List of Dictionaries:
# Creating a DataFrame from a List of Dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
 Creating a DataFrame from a NumPy Array:
# Creating a DataFrame from a NumPy Array
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Attributes and Methods of DataFrame
Attributes:
• df.columns - Returns column labels of the DataFrame.
• df.index - Returns row labels of the DataFrame.
• df.dtypes - Shows the data types of each column.

Basic Methods:
• df.head(n) - Returns the first n rows.
• df.tail(n) - Returns the last n rows.
• df.info() - Displays a summary of the DataFrame, including
column names, non-null counts, and data types.
• df.describe() - Provides summary statistics for numerical
columns.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Attributes and Methods of DataFrame

# Example: DataFrame Info and Summary Statistics


print(df.info())
print(df.describe())
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Data Selection and Manipulation in DataFrames
Selecting Columns:
print(df['Name']) # Select a single column
print(df[['Name', 'City']]) # Select multiple columns

Selecting Rows by Index:


print(df.loc[0]) # Select first row by label
print(df.iloc[1]) # Select second row by position
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Data Selection and Manipulation in DataFrames
Adding and Removing Columns:
# Adding a new column
df['Salary'] = [50000, 60000, 70000]

# Removing a column
df.drop(columns=['City'], inplace=True)

Filtering Data:
# Filter rows where Age > 30
print(df[df['Age'] > 30])
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results

Objective: Record the compressive strength of concrete samples


tested over different days and analyze the data.

Solution:
• Record compressive strength (in MPa) of concrete samples
tested after curing for 7, 14, and 28 days.
• Use a Pandas DataFrame to store the data.
• Calculate the mean compressive strength for each curing period.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results
import pandas as pd
# Data: Compressive strength values (MPa) for 3 samples tested on different days
data = {
'Sample ID': ['S1', 'S2', 'S3'],
'7 Days': [18.5, 19.0, 18.0],
'14 Days': [24.0, 25.5, 24.5],
'28 Days': [32.0, 31.5, 33.0]
}
# Create DataFrame
df = pd.DataFrame(data)
# Set 'Sample ID' as the index
df.set_index('Sample ID', inplace=True)
# Calculate mean compressive strength for each testing period
mean_strength = df.mean()
print("Concrete Compressive Strength Data:\n", df)
print("\nAverage Compressive Strength (MPa) for each curing period:\n", mean_strength)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results

Expected Output:
Concrete Compressive Strength Data:
7 Days 14 Days 28 Days
Sample ID
S1 18.5 24.0 32.0
S2 19.0 25.5 31.5
S3 18.0 24.5 33.0

Average Compressive Strength (MPa) for each curing period:


7 Days 18.500000
14 Days 24.666667
28 Days 32.166667
dtype: float64
The Pandas Data Structures - Series and DataFrame
3. DataFrame

Example 2: Structural Analysis of a Building Load

Objective: Calculate the total load on each floor of a building based


on floor area and load per square meter.

Solution:
• Create a DataFrame where each row represents a floor with its
area (in square meters) and load per square meter (in kN/m²).
• Calculate the total load on each floor and add it as a new
column.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 2: Structural Analysis of a Building Load
import pandas as pd

# Data: Floor area and load per square meter for each floor
data = {
'Floor': ['Ground', 'First', 'Second', 'Third', 'Fourth'],
'Area (m²)': [500, 400, 350, 300, 250],
'Load per m² (kN/m²)': [2.5, 2.2, 2.4, 2.3, 2.1]
}

# Create DataFrame
df = pd.DataFrame(data)

# Calculate total load for each floor


df['Total Load (kN)'] = df['Area (m²)'] * df['Load per m² (kN/m²)']

print("Building Load Analysis:\n", df)


The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 2: Structural Analysis of a Building Load

Expected Output:
Building Load Analysis:
Floor Area (m²) Load per m² (kN/m²) Total Load (kN)
0 Ground 500 2.5 1250.0
1 First 400 2.2 880.0
2 Second 350 2.4 840.0
3 Third 300 2.3 690.0
4 Fourth 250 2.1 525.0
The Pandas Data Structures - Series and DataFrame
4. Summary and Best Practices
Key Points:
• A Series is a one-dimensional labeled array, suitable for representing a
single column of data.

• A DataFrame is a two-dimensional, tabular data structure, useful for


representing structured datasets.

Best Practices:

• Use appropriate data types to save memory.

• Leverage indexing in Series and DataFrames for efficient data selection.

• Clean and preprocess data before performing analyses.


The Essential Basic Functionality of
Pandas
The Essential Basic Functionality of Pandas
1. Reindexing and Altering Labels
 Reindexing involves creating a new index for a DataFrame or Series,
effectively aligning it to a new structure. Reindexing is critical in reshaping or
merging data from different sources, allowing data to be aligned on a
common index before any operations. Reindexing is particularly useful when
working with time series data or merging multiple datasets with potentially
mismatched indices.

 Altering Labels is about renaming rows or columns, providing clearer and


more descriptive names for data attributes. Altering Labels improves
readability and makes data easier to work with, especially in multi-step
analyses where column names need to be standardized.
Example
# Example: Reindexing
df = df.reindex(new_index)

# Renaming columns
df = df.rename(columns={'old_name': 'new_name'})
The Essential Basic Functionality of Pandas
2. Head and Tail

 These methods (head() and tail()) display the first and last few rows of a
DataFrame, giving a quick snapshot of data. This is essential for initial data
exploration, where users can confirm if data has loaded correctly and observe
general characteristics (e.g., data types, column names, any visible patterns).

Example
# Display the first 5 rows
df.head(5)

# Display the last 5 rows


df.tail(5)
The Essential Basic Functionality of Pandas
3. Binary Operations

 Binary operations allow element-wise arithmetic between DataFrames, using


arithmetic operators like +, -, *, and /. These operations can align data on
indexes before computing.

 Binary Operations allow for element-wise arithmetic between two


DataFrames or Series. This includes addition, subtraction, multiplication, and
division, and can align data automatically by index. Binary operations are
fundamental in scenarios where multiple data sources need to be combined
or compared.
 For example, calculating growth rates, differences, or ratios between two
datasets can be achieved by aligning the indices and applying binary
operations.
Example
# Element-wise addition
result = df1 + df2
The Essential Basic Functionality of Pandas
4. Functional Statistics
 Functional statistics are built-in statistical functions that summarize data,
offering measures like mean, median, standard deviation, minimum,
maximum, count, etc.

 This functionality is foundational for exploring data distribution and central


tendencies, enabling analysts to get a quick overview of data’s quantitative
properties.

 Functional statistics provide summary statistics like mean, sum, min, max,
count, and describe(), which give insights into data distribution and central
tendencies.

Example
# Get summary statistics
df.describe()

# Calculate mean
df['column_name'].mean()
The Essential Basic Functionality of Pandas
5. Function Application
 This feature is powerful for data transformation, as it enables custom
manipulations and complex operations that go beyond built-in functions.
Examples include creating new calculated fields, cleaning data, or performing
any custom analysis needed on each element or column.

 Function application allows you to apply custom or built-in functions to rows


or columns using apply() or element-wise with applymap().

Example
# Apply function to a column
df['column_name'] = df['column_name'].apply(lambda x: x * 2)
The Essential Basic Functionality of Pandas
6. Sorting
 Sorting organizes data based on specified criteria, either by labels or values.

 Sorting by row or column labels or values within a column can be done using
sort_index() or sort_values().

 Sorting is useful in ranking data, creating ordered lists, or identifying top or


bottom records, which are essential in exploratory data analysis and
reporting.

Example
# Sort by column values
df = df.sort_values(by='column_name')
The Essential Basic Functionality of Pandas
7. Indexing and Selecting Data
 Indexing and selection facilitate subsetting specific rows, columns, or
elements.

 Selecting specific rows and columns is fundamental in data manipulation,


done using loc (label-based) and iloc (position-based) indexing.

 This functionality is crucial for working with specific parts of large datasets
without loading unnecessary data.

 Efficient indexing and selection allow focused data analysis on relevant parts
of data and can improve performance by reducing memory usage and
computation.

Example
# Sort by column values
df = df.sort_values(by='column_name')
The Essential Basic Functionality of Pandas
8. Computational Tools

 Computational tools allow efficient data manipulation using vectorized


operations and NumPy integration, which speeds up data analysis.

 This includes operations on DataFrames and Series, such as applying


arithmetic or aggregation across rows or columns.

 Vectorized operations are significantly faster than iterating through data due
to their use of low-level optimizations.

Example
# Vectorized operation
df['new_column'] = df['column1'] * df['column2']
The Essential Basic Functionality of Pandas
9. Working with Missing Data
 Handling missing data includes methods like fillna(), dropna(), and isnull() to
identify, replace, or drop missing values.

 Missing data handling is essential to prevent errors in analysis, especially in


statistical or machine learning applications where missing values can affect
model accuracy.

 Methods for dealing with NaNs include imputation (filling missing values with
substitutes like mean or median) or dropping them based on the analysis’s
needs.

Example
# Replace missing values with 0
df.fillna(0, inplace=True)

# Drop rows with missing values


df.dropna(inplace=True)
The Essential Basic Functionality of Pandas
10. Advanced Uses of Pandas for Data Analysis
 Hierarchical Indexing: A multi-level index that allows you to work with data
hierarchically, which is useful for complex data.

 This is especially useful for organizing and analyzing complex, high-


dimensional datasets, enabling analysts to group data in a hierarchical
structure.

 It’s commonly used in time-series data where multiple variables are recorded
at different times and hierarchically indexed.
Example
# Setting hierarchical index
df.set_index(['level1', 'level2'], inplace=True)

Panel Data: Pandas has deprecated the Panel class.

But multi-dimensional data can still be handled using DataFrames with hierarchical
indexing or the xarray library for more complex data.
Q&A

You might also like