10 20241104 Data-Analysis Pandas
10 20241104 Data-Analysis Pandas
Time-series functionality.
This will display a Series with default integer indexing starting from 0.
The Pandas Data Structures - Series and DataFrame
2. Series
Creating a Series with Custom Index:
data = [10, 20, 30, 40] # a list with data elements
index = ['a', 'b', 'c', 'd'] # a list of items intended as labels in index variable
series = pd.Series(data, index=index) # index labels assigned to 'index‘ attribute
print(series)
The Series will have custom labels a, b, c, and d.
# Access by label
print(series['a'])
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Attributes:
• series.index - Returns the index of the Series.
• series.values - Returns the values as a NumPy array.
• series.dtype - Shows the data type of the elements.
Methods:
• series.head(n) - Returns the first n elements.
• series.tail(n) - Returns the last n elements.
• series.sum() - Returns the sum of the Series.
• series.mean() - Calculates the mean of values in the Series.
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Example
# Example: Summary statistics
print("Sum:", series.sum())
print("Mean:", series.mean())
print("First 2 elements:", series.head(2))
The Pandas Data Structures - Series and DataFrame
2. Series
Attributes and Methods of Series:
Example
# Example: Summary statistics
print("Sum:", series.sum())
print("Mean:", series.mean())
print("First 2 elements:", series.head(2))
The Pandas Data Structures - Series and DataFrame
2. Series
• Exercise 1: Series for Daily Temperatures
• Objective:
– Create a Series to represent daily temperatures for a week.
– Use custom indices (labels) to name each day of the week.
– Calculate and print the average temperature.
import pandas as pd
# List of daily temperatures Output
temperatures = [23, 25, 22, 26, 24, 28, 27] Daily Temperatures:
# Custom indices for each day of the week Monday 23
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', Tuesday 25
'Saturday', 'Sunday'] Wednesday 22
Thursday 26
# Create the Series Friday 24
temp_series = pd.Series(temperatures, index=days) Saturday 28
# Display the Series Sunday 27
print("Daily Temperatures:\n", temp_series) dtype: int64
Objective:
• Create a Series representing quantities of different construction
materials.
• Check the quantity of a specific material.
• Calculate the total inventory.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Monitoring Construction Site Inventory
import pandas as pd
Output
# Material quantities at the construction site Construction Site Inventory:
materials = { Cement Bags 100
'Cement Bags': 100, Sand (cubic meters) 50
'Sand (cubic meters)': 50, Gravel (cubic meters) 30
'Gravel (cubic meters)': 30, Steel (tons) 20
'Steel (tons)': 20 dtype: int64
}
Quantity of Cement Bags: 100
# Create a Series for materials Total Inventory Quantity: 200
inventory_series = pd.Series(materials)
Objective:
• Create a Series representing quantities of different construction
materials.
• Check the quantity of a specific material.
• Calculate the total inventory.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Monitoring Construction Site Inventory
import pandas as pd
Output
# Material quantities at the construction site Construction Site Inventory:
materials = { Cement Bags 100
'Cement Bags': 100, Sand (cubic meters) 50
'Sand (cubic meters)': 50, Gravel (cubic meters) 30
'Gravel (cubic meters)': 30, Steel (tons) 20
'Steel (tons)': 20 dtype: int64
}
Quantity of Cement Bags: 100
# Create a Series for materials Total Inventory Quantity: 200
inventory_series = pd.Series(materials)
Objective:
• Create a Series to store traffic volume for each day of the week.
• Calculate the average traffic volume.
• Identify the day with peak traffic.
The Pandas Data Structures - Series and DataFrame
2. Series
Example : Road Survey Traffic Volume
import pandas as pd
Output
# Traffic volume (vehicles) recorded each day on a road Daily Traffic Volume:
segment Monday 1200
traffic_data = [1200, 1350, 1400, 1300, 1250, 1600, 1500] Tuesday 1350
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', Wednesday 1400
'Saturday', 'Sunday'] Thursday 1300
Friday 1250
# Create a Series for traffic volume Saturday 1600
traffic_series = pd.Series(traffic_data, index=days) Sunday 1500
dtype: int64
# Calculate average traffic volume
avg_traffic = traffic_series.mean() Average Traffic Volume:
1371.4285714285713
# Identify peak traffic day Peak Traffic Day: Saturday
peak_day = traffic_series.idxmax()
Objective:
• Create a Series for different beams and their load capacities (in
kN).
• Find the maximum load capacity among the beams.
• Calculate the average load capacity.
The Pandas Data Structures - Series and DataFrame
2. Series
Example 4: Structural Steel Beam Loads
import pandas as pd
Output
# Load capacity (kN) for different beams in a structure
Beam Load Capacities (kN):
beam_loads = {
Beam A 45
'Beam A': 45,
Beam B 50
'Beam B': 50,
Beam C 55
'Beam C': 55,
Beam D 60
'Beam D': 60,
Beam E 52
'Beam E': 52
dtype: int64
}
# Create a Series for beam loads Maximum Load Capacity (kN): 60
beam_series = pd.Series(beam_loads) Average Load Capacity (kN): 52.4
A DataFrame has two axes: rows and columns, each with its own
label.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
Creating a DataFrame from a Dictionary of Lists:
# Creating a DataFrame from a Dictionary of Lists
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
Creating a DataFrame from a List of Dictionaries:
# Creating a DataFrame from a List of Dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
What is a DataFrame?
Creating a DataFrame from a NumPy Array:
# Creating a DataFrame from a NumPy Array
import numpy as np
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
print(df)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Attributes and Methods of DataFrame
Attributes:
• df.columns - Returns column labels of the DataFrame.
• df.index - Returns row labels of the DataFrame.
• df.dtypes - Shows the data types of each column.
Basic Methods:
• df.head(n) - Returns the first n rows.
• df.tail(n) - Returns the last n rows.
• df.info() - Displays a summary of the DataFrame, including
column names, non-null counts, and data types.
• df.describe() - Provides summary statistics for numerical
columns.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Attributes and Methods of DataFrame
# Removing a column
df.drop(columns=['City'], inplace=True)
Filtering Data:
# Filter rows where Age > 30
print(df[df['Age'] > 30])
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results
Solution:
• Record compressive strength (in MPa) of concrete samples
tested after curing for 7, 14, and 28 days.
• Use a Pandas DataFrame to store the data.
• Calculate the mean compressive strength for each curing period.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results
import pandas as pd
# Data: Compressive strength values (MPa) for 3 samples tested on different days
data = {
'Sample ID': ['S1', 'S2', 'S3'],
'7 Days': [18.5, 19.0, 18.0],
'14 Days': [24.0, 25.5, 24.5],
'28 Days': [32.0, 31.5, 33.0]
}
# Create DataFrame
df = pd.DataFrame(data)
# Set 'Sample ID' as the index
df.set_index('Sample ID', inplace=True)
# Calculate mean compressive strength for each testing period
mean_strength = df.mean()
print("Concrete Compressive Strength Data:\n", df)
print("\nAverage Compressive Strength (MPa) for each curing period:\n", mean_strength)
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 1: Analyzing Concrete Strength Test Results
Expected Output:
Concrete Compressive Strength Data:
7 Days 14 Days 28 Days
Sample ID
S1 18.5 24.0 32.0
S2 19.0 25.5 31.5
S3 18.0 24.5 33.0
Solution:
• Create a DataFrame where each row represents a floor with its
area (in square meters) and load per square meter (in kN/m²).
• Calculate the total load on each floor and add it as a new
column.
The Pandas Data Structures - Series and DataFrame
3. DataFrame
Example 2: Structural Analysis of a Building Load
import pandas as pd
# Data: Floor area and load per square meter for each floor
data = {
'Floor': ['Ground', 'First', 'Second', 'Third', 'Fourth'],
'Area (m²)': [500, 400, 350, 300, 250],
'Load per m² (kN/m²)': [2.5, 2.2, 2.4, 2.3, 2.1]
}
# Create DataFrame
df = pd.DataFrame(data)
Expected Output:
Building Load Analysis:
Floor Area (m²) Load per m² (kN/m²) Total Load (kN)
0 Ground 500 2.5 1250.0
1 First 400 2.2 880.0
2 Second 350 2.4 840.0
3 Third 300 2.3 690.0
4 Fourth 250 2.1 525.0
The Pandas Data Structures - Series and DataFrame
4. Summary and Best Practices
Key Points:
• A Series is a one-dimensional labeled array, suitable for representing a
single column of data.
Best Practices:
# Renaming columns
df = df.rename(columns={'old_name': 'new_name'})
The Essential Basic Functionality of Pandas
2. Head and Tail
These methods (head() and tail()) display the first and last few rows of a
DataFrame, giving a quick snapshot of data. This is essential for initial data
exploration, where users can confirm if data has loaded correctly and observe
general characteristics (e.g., data types, column names, any visible patterns).
Example
# Display the first 5 rows
df.head(5)
Functional statistics provide summary statistics like mean, sum, min, max,
count, and describe(), which give insights into data distribution and central
tendencies.
Example
# Get summary statistics
df.describe()
# Calculate mean
df['column_name'].mean()
The Essential Basic Functionality of Pandas
5. Function Application
This feature is powerful for data transformation, as it enables custom
manipulations and complex operations that go beyond built-in functions.
Examples include creating new calculated fields, cleaning data, or performing
any custom analysis needed on each element or column.
Example
# Apply function to a column
df['column_name'] = df['column_name'].apply(lambda x: x * 2)
The Essential Basic Functionality of Pandas
6. Sorting
Sorting organizes data based on specified criteria, either by labels or values.
Sorting by row or column labels or values within a column can be done using
sort_index() or sort_values().
Example
# Sort by column values
df = df.sort_values(by='column_name')
The Essential Basic Functionality of Pandas
7. Indexing and Selecting Data
Indexing and selection facilitate subsetting specific rows, columns, or
elements.
This functionality is crucial for working with specific parts of large datasets
without loading unnecessary data.
Efficient indexing and selection allow focused data analysis on relevant parts
of data and can improve performance by reducing memory usage and
computation.
Example
# Sort by column values
df = df.sort_values(by='column_name')
The Essential Basic Functionality of Pandas
8. Computational Tools
Vectorized operations are significantly faster than iterating through data due
to their use of low-level optimizations.
Example
# Vectorized operation
df['new_column'] = df['column1'] * df['column2']
The Essential Basic Functionality of Pandas
9. Working with Missing Data
Handling missing data includes methods like fillna(), dropna(), and isnull() to
identify, replace, or drop missing values.
Methods for dealing with NaNs include imputation (filling missing values with
substitutes like mean or median) or dropping them based on the analysis’s
needs.
Example
# Replace missing values with 0
df.fillna(0, inplace=True)
It’s commonly used in time-series data where multiple variables are recorded
at different times and hierarchically indexed.
Example
# Setting hierarchical index
df.set_index(['level1', 'level2'], inplace=True)
But multi-dimensional data can still be handled using DataFrames with hierarchical
indexing or the xarray library for more complex data.
Q&A