0% found this document useful (0 votes)
4 views7 pages

Python 2.1.2

The document provides an overview of data manipulation using the Pandas library in Python, detailing its main data structures, Series and DataFrames. It covers data indexing and selection, operations on data, handling missing data, hierarchical indexing, and methods for combining datasets using concat() and append(). Key concepts include creating Series and DataFrames, performing arithmetic operations, detecting and filling missing values, and utilizing multi-level indexing.

Uploaded by

hritikp266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Python 2.1.2

The document provides an overview of data manipulation using the Pandas library in Python, detailing its main data structures, Series and DataFrames. It covers data indexing and selection, operations on data, handling missing data, hierarchical indexing, and methods for combining datasets using concat() and append(). Key concepts include creating Series and DataFrames, performing arithmetic operations, detecting and filling missing values, and utilizing multi-level indexing.

Uploaded by

hritikp266
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

2.

Data Manipulation with Pandas: Introducing Pandas Objects, Data Indexing and Selection,
Operating on Data in Pandas, Handling Missing Data, Hierarchical Indexing, Combining Datasets:
Concat and Append.

1. Introducing Pandas Objects

Pandas is a powerful and widely used library in Python for data manipulation and analysis. It
provides two main data structures:

1. Series: A one-dimensional labeled array, similar to a list, that can hold data of any
type (integers, strings, floats, etc.).
2. DataFrame: A two-dimensional labeled data structure, similar to a table in a
database, an Excel spreadsheet, or a dictionary of Series objects. It has both rows and
columns with labels.

Creating a Pandas Series

A Series can be created from a list, numpy array, or dictionary. Here's an example of creating
a Series from a Python list:

import pandas as pd

# Create a Series from a list


data = [10, 20, 30, 40, 50]
series = pd.Series(data)

print(series)

Output:

0 10
1 20
2 30
3 40
4 50
dtype: int64

The index is automatically assigned as integers starting from 0.

Creating a DataFrame

A DataFrame can be created from a dictionary, lists, or NumPy arrays. Here's an example of
creating a DataFrame from a dictionary:

import pandas as pd

# Create a DataFrame from a dictionary


data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

Output:

Name Age City


0 Alice 24 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago

The DataFrame has both row labels (index) and column labels (column names).

2. Data Indexing and Selection

Pandas provides multiple ways to select and index data from Series and DataFrames.

Selecting Data from a DataFrame

 Selecting a single column: You can access a column by using the column name.

# Select a single column


print(df['Name'])

Output:

0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object

 Selecting multiple columns: Use a list of column names.

# Select multiple columns


print(df[['Name', 'Age']])

Output:

Name Age
0 Alice 24
1 Bob 30
2 Charlie 35

Selecting Rows by Index

You can select rows using .loc[] and .iloc[]:

 iloc[] is used for integer-location based indexing (by position).


 loc[] is used for label-based indexing.

# Selecting by position (integer-based)


print(df.iloc[1]) # Select the second row (index 1)
# Selecting by label (index-based)
print(df.loc[1]) # Select the row with index label 1

Output:

Name Bob
Age 30
City Los Angeles
Name: 1, dtype: object

3. Operating on Data in Pandas

Once you have selected data, Pandas allows you to perform a variety of operations.

Arithmetic Operations

Pandas supports arithmetic operations like addition, subtraction, multiplication, and division.
These operations can be performed element-wise on Series or DataFrames.

# Create a DataFrame
data = {'A': [10, 20, 30], 'B': [5, 15, 25]}
df = pd.DataFrame(data)

# Add 10 to each element


df = df + 10
print(df)

Output:

A B
0 20 15
1 30 25
2 40 35

Applying Functions

You can apply functions element-wise or column-wise using .apply().

# Apply a function to each column


df['A'] = df['A'].apply(lambda x: x * 2)
print(df)

Output:

A B
0 40 15
1 60 25
2 80 35

In this example, the function lambda x: x * 2 was applied to the 'A' column.
4. Handling Missing Data

Missing data is common in real-world datasets. Pandas provides powerful tools for detecting,
removing, or replacing missing data.

Detecting Missing Data

Use isnull() to detect missing values and notnull() for the opposite.

import numpy as np

# Create a DataFrame with missing data (NaN)


data = {'Name': ['Alice', 'Bob', np.nan], 'Age': [24, np.nan, 35]}
df = pd.DataFrame(data)

# Check for missing data


print(df.isnull())

Output:

Name Age
0 False False
1 False True
2 True False

Filling Missing Data

You can fill missing values using .fillna().

# Fill missing data with a default value


df_filled = df.fillna({'Name': 'Unknown', 'Age': df['Age'].mean()})
print(df_filled)

Output:

Name Age
0 Alice 24.0
1 Bob 29.5
2 Unknown 35.0

Here, missing values in the Name column are filled with 'Unknown', and missing values in the
Age column are filled with the mean of the Age column.

Dropping Missing Data

You can drop rows or columns that contain missing data using .dropna().

# Drop rows with missing data


df_dropped = df.dropna()
print(df_dropped)

Output:

Name Age
0 Alice 24.0
2 Charlie 35.0

5. Hierarchical Indexing

Hierarchical indexing allows you to have multiple levels of indexing, which can be helpful
when working with more complex data structures.

Creating a Hierarchical Index

You can create a multi-level index by passing a list of arrays to


pd.MultiIndex.from_arrays().

# Create a DataFrame with multi-level index


arrays = [['A', 'A', 'B', 'B'], [1, 2, 1, 2]]
index = pd.MultiIndex.from_arrays(arrays, names=('Letter', 'Number'))

df = pd.DataFrame({'Data': [10, 20, 30, 40]}, index=index)


print(df)

Output:

Data
Letter Number
A 1 10
2 20
B 1 30
2 40

Selecting Data with Multi-level Index

You can use .loc[] to access data in a multi-level index DataFrame.

# Select data for 'A' with Number 2


print(df.loc[('A', 2)])

Output:

Data 20
Name: (A, 2), dtype: int64

6. Combining Datasets: Concat and Append

Pandas provides functions like concat() and append() to combine data from different
DataFrames.

Using concat() to Combine DataFrames

The concat() function can concatenate DataFrames along rows or columns.

# Concatenate DataFrames along rows


df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})

df_combined = pd.concat([df1, df2], ignore_index=True)


print(df_combined)

Output:

A B
0 1 3
1 2 4
2 5 7
3 6 8

Using append() to Add Rows to DataFrame

The append() function is another way to add rows to a DataFrame. However, concat() is
generally more efficient and flexible.

# Append rows to a DataFrame


df3 = pd.DataFrame({'A': [9, 10], 'B': [11, 12]})
df_appended = df1.append(df3, ignore_index=True)
print(df_appended)

Output:

A B
0 1 3
1 2 4
2 9 11
3 10 12

Summary of Key Concepts:

1. Pandas Objects: Series and DataFrames are the primary data structures.
2. Data Indexing and Selection: Pandas allows easy indexing and selection of data
using labels and positions.
3. Operating on Data: Element-wise operations and functions can be applied to Series
and DataFrames.
4. Handling Missing Data: Missing data can be detected, filled, or dropped.
5. Hierarchical Indexing: Pandas supports multi-level indexes to handle complex data.
6. Combining Datasets: Pandas provides concat() and append() to combine multiple
DataFrames.

Questions:

1. What are the two main data structures in Pandas, and how do they differ? types of
data.
2. How can you fill missing values in a Pandas DataFrame with a default value or a
calculated value (like the mean)?
3. What is hierarchical indexing in Pandas, and how is it useful?
4. How do you access data from a multi-level indexed DataFrame in Pandas?
5. What is the difference between the concat() and append() functions in Pandas?
6. How do you concatenate DataFrames along rows using concat() in Pandas?
7. Explain how to add rows to an existing DataFrame using the append() function in
Pandas.

You might also like