0% found this document useful (0 votes)
7 views13 pages

Pandas

Pandas is an open-source data manipulation and analysis library for Python, essential for data scientists and analysts. It provides powerful data structures like Series and DataFrames, along with functionalities for data manipulation, cleaning, input/output, and time series analysis. Key features include grouping and aggregation, handling missing data, and merging/joining DataFrames, making it a cornerstone for data analysis in Python.

Uploaded by

7refie5l6f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

Pandas

Pandas is an open-source data manipulation and analysis library for Python, essential for data scientists and analysts. It provides powerful data structures like Series and DataFrames, along with functionalities for data manipulation, cleaning, input/output, and time series analysis. Key features include grouping and aggregation, handling missing data, and merging/joining DataFrames, making it a cornerstone for data analysis in Python.

Uploaded by

7refie5l6f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Pandas

 Pandas is a powerful open-source data manipula on and analysis library for the Python
programming language.
 It provides data structures and func ons designed to work with structured data, making it
easier for data scien sts, analysts, and anyone working with data to perform complex data
opera ons efficiently.
 Pandas is an essen al tool for anyone working with data in Python. Its ability to handle and
manipulate structured data with ease, along with its integra on with other data science
libraries, makes it a cornerstone of data analysis in Python. Whether you are cleaning data,
performing analysis, or preparing data for machine learning, Pandas provides the
func onality to streamline your workflow.

Key Features of Pandas

1. Data Structures:

o Series: A one-dimensional labeled array capable of holding any data type (integers,
strings, floa ng point numbers, Python objects, etc.). It can be compared to a
column in a spreadsheet or a single column in a DataFrame.

o DataFrame: A two-dimensional labeled data structure with columns of poten ally


different types. It can be thought of as a table in a database or a spreadsheet. Each
column in a DataFrame can be considered a Series.

2. Data Manipula on:

o Indexing and Selec on: Pandas provides powerful indexing and selec on capabili es,
allowing users to retrieve and manipulate data based on row and column labels or
integer-based indexing.

o Grouping and Aggrega on: Users can group data based on certain criteria and
perform aggregate opera ons (e.g., sum, mean) on the grouped data.

3. Data Cleaning:

o Pandas offers various func ons for handling missing data, including detec ng, filling,
or dropping missing values.

o It also allows users to filter, sort, and manipulate data frames easily.

4. Data Input/Output:

o Pandas can read and write data to various file formats, including CSV, Excel, JSON,
SQL databases, and more, making it versa le for data import and export tasks.

5. Time Series Analysis:

o The library has built-in support for handling me series data, which is useful for
financial analysis, stock market data, or any data indexed by me.

6. Performance:
o Pandas is built on top of NumPy, which means it leverages NumPy’s performance
advantages for numerical opera ons. It is op mized for performance and efficiency
in handling large datasets.

Use Cases of Pandas

 Data Analysis: Analysts use Pandas to perform exploratory data analysis (EDA), which
involves summarizing the main characteris cs of a dataset, o en using visual methods.

 Data Cleaning and Prepara on: Before analysis, data o en needs to be cleaned and
transformed. Pandas provides tools to handle missing data, remove duplicates, and convert
data types.

 Sta s cal Analysis: Researchers can use Pandas for sta s cal modeling and analysis, such as
calcula ng correla ons or running regressions.

 Data Visualiza on: While Pandas itself does not provide extensive visualiza on capabili es, it
integrates well with libraries like Matplotlib and Seaborn, allowing users to create various
plots and charts.

Crea ng a Series
 A Series is a one-dimensional labeled array capable of holding any data type.
 It is similar to a list or a dic onary in Python but has addi onal features that make it more
powerful for data analysis.
 Each element in a Series has an associated index, which allows for easy access and
manipula on of the data. Series are one of the two primary data structures in Pandas, the
other being DataFrames.

Crea ng a Series from Lists

One of the most common ways to create a Series is from a list. This is par cularly useful for simple
datasets where you want to convert a Python list into a Series.

Example:

Crea ng a Series from NumPy Arrays

Pandas Series can also be created from NumPy arrays. This is par cularly advantageous when
working with numerical data since NumPy is op mized for numerical computa ons.

Example:
Crea ng a Series from Dic onaries

Another powerful feature of Series is that you can create them from dic onaries. In this case, the
keys become the index and the values become the data.

Example:

This flexibility allows for more meaningful indexing when the data is represented as a dic onary,
which can o en be clearer than using default integer indexing.

Storing Data in Series from Intrinsic Sources


Pandas also supports crea ng Series from intrinsic data sources. For instance, you can use built-in
func ons or generate random data.

Example:

In the above code, we create a Series from a range of numbers and another Series filled with random
floa ng-point numbers. This feature is useful for ini alizing data for tes ng or simula ons.

Accessing Data in a Series


You can access elements in a Series using the index. This is similar to how you would access elements
in a Python list or dic onary.

Example:
Modifying a Series

You can modify elements in a Series by assigning new values to exis ng indices or appending new
values.

Example:

Crea ng DataFrames
A DataFrame is a two-dimensional labeled data structure with columns that can hold different data
types (such as integers, floats, strings, etc.). It is similar to a spreadsheet or SQL table and is the
primary data structure used for data manipula on and analysis in Pandas.

Crea ng a DataFrame from Lists

One of the most straigh orward ways to create a DataFrame is by passing a list of lists (each inner list
represents a row) to the pd.DataFrame() constructor.

Example:

data = [['Alice', 25], ['Bob', 30], ['Charlie', 35]]

df_from_list = pd.DataFrame(data, columns=['Name', 'Age'])

print("DataFrame from List:\n", df_from_list)

In this example, we create a DataFrame from a list of lists and specify the column names using the
columns parameter.

Crea ng a DataFrame from Dic onaries

You can also create a DataFrame from a dic onary. In this case, each key-value pair in the dic onary
represents a column, with the key as the column name and the value as the column data.

Example:

data_dict = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],

'City': ['New York', 'Los Angeles', 'Chicago']

df_from_dict = pd.DataFrame(data_dict)
print("DataFrame from Dic onary:\n", df_from_dict)

This method is convenient for construc ng DataFrames with predefined data, making it easy to
represent structured data.

Crea ng a DataFrame from NumPy Arrays

You can also create a DataFrame from a NumPy array, which is par cularly useful when dealing with
numerical data.

Example:

data_array = np.array([['Alice', 25], ['Bob', 30], ['Charlie', 35]])

df_from_array = pd.DataFrame(data_array, columns=['Name', 'Age'])

print("DataFrame from Array:\n", df_from_array)

The flexibility of Pandas allows for easy integra on with NumPy, making it an efficient tool for data
analysis.

Accessing Data in DataFrames

You can access individual columns or rows in a DataFrame using the column name or index.

Example:

# Accessing a column

print("Name Column:\n", df_from_dict['Name'])

# Accessing a row by index

print("First Row:\n", df_from_dict.iloc[0])

This feature allows for intui ve data manipula on, enabling you to quickly retrieve the informa on
you need.

Modifying DataFrames

DataFrames allow for easy modifica ons, including adding new columns, renaming exis ng columns,
and changing values.

Example:

# Adding a new column

df_from_dict['Salary'] = [70000, 80000, 90000]

print("DataFrame a er Adding Salary Column:\n", df_from_dict)

# Renaming a column

df_from_dict.rename(columns={'Age': 'Years'}, inplace=True)

print("DataFrame a er Renaming Column:\n", df_from_dict)


# Modifying values

df_from_dict.at[0, 'Salary'] = 75000

print("DataFrame a er Modifying Salary:\n", df_from_dict)

Imputa on
Imputa on is the process of replacing missing values in a dataset with subs tuted values. Missing
data can lead to inaccurate results and analysis; hence, handling missing data is a crucial part of data
preprocessing.

Iden fying Missing Values

Before you can impute missing values, you first need to iden fy them. Pandas provides func ons to
check for missing values in your DataFrame.

Example:

df = pd.DataFrame({

'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, None, 35],

'City': ['New York', None, 'Chicago']

})

# Checking for missing values

print("Missing Values:\n", df.isnull())

The isnull() func on returns a DataFrame of the same shape as the original, with True indica ng
missing values.

Filling Missing Values

You can use the fillna() method to replace missing values with a specific value. This method is useful
when you have a reasonable es mate of what the missing value should be.

Example:

# Filling missing values with a specific value

df_filled = df.fillna({'Age': 30, 'City': 'Unknown'})

print("DataFrame a er Imputa on:\n", df_filled)

In this example, we fill missing values in the 'Age' and 'City' columns with specific values.

Forward Fill and Backward Fill


Pandas also allows you to propagate the next or previous value forward or backward using the
method parameter in the fillna() func on.

Example:

# Forward filling missing values

df_ffill = df.fillna(method='ffill')

print("Forward Filled DataFrame:\n", df_ffill)

Backward filling missing values

df_bfill = df.fillna(method='bfill') print("Backward Filled DataFrame:\n", df_bfill)

Forward fill (`ffill`) propagates the last valid observa on forward to the next valid. Backward fill
(`bfill`) does the opposite.

Imputa on Using Sta s cal Methods

Imputa on can also be done using sta s cal methods such as mean, median, or mode. This is
par cularly useful for numerical data.

Example:

# Impu ng missing values with mean

df['Age'].fillna(df['Age'].mean(), inplace=True)

print("DataFrame a er Mean Imputa on:\n", df)

Here, we replace missing values in the 'Age' column with the mean of the column.

Grouping and Aggrega on


Introduc on to Grouping and Aggrega on

Grouping is a powerful feature in Pandas that allows you to split data into groups based on some
criteria and perform opera ons on those groups, such as aggrega on. This is par cularly useful for
summarizing data.

Crea ng Sample Data for Grouping

Before diving into grouping and aggrega on, let’s create a sample DataFrame to work with.

Example:

df_group = pd.DataFrame({

'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],

'Age': [25, 30, 35, 28, 32],

'Score': [85, 90, 95, 88, 92]


})

print("Sample DataFrame for Grouping:\n", df_group)

Grouping Data

You can use the groupby() func on to group data by one or more columns. This func on splits the
data into groups based on the unique values of the specified column(s).

Example:

# Grouping by Name

grouped = df_group.groupby('Name')

print("Grouped Data:\n", grouped.mean())

In this case, we group the data by 'Name' and calculate the mean for each group. The result is a new
DataFrame with the average 'Age' and 'Score' for each name.

Aggrega ng Data

Aggrega on refers to applying a func on to each group, such as sum, mean, min, max, etc. You can
use the agg() func on to apply mul ple aggrega on func ons simultaneously.

Example:

# Aggrega ng data

agg_data = df_group.groupby('Name').agg({

'Age': ['mean', 'max'],

'Score': ['sum', 'mean']

})

print("Aggregated Data:\n", agg_data)

Here, we calculate the mean and maximum age, as well as the sum and mean score for each name.

Filtering Grouped Data

You can also filter groups based on certain condi ons a er grouping. This is useful when you want to
analyze only those groups that meet specific criteria.

Example:

# Filtering groups with mean Score greater than 90

filtered_groups = df_group.groupby('Name').filter(lambda x: x['Score'].mean() > 90)

print("Filtered Groups:\n", filtered_groups)

In this example, we filter out groups with an average score greater than 90.
Merging, Joining, and Concatena on
Introduc on to Merging, Joining, and Concatena on

Merging, joining, and concatena ng are essen al opera ons in data manipula on that allow you to
combine mul ple DataFrames into a single one. These opera ons are vital for integra ng data from
different sources and performing analyses.

Merging DataFrames
Merging involves combining two DataFrames based on common columns or indices. The merge()
func on is used to accomplish this.

Example:

df1 = pd.DataFrame({

'EmployeeID': [1, 2, 3],

'Name': ['Alice', 'Bob', 'Charlie']

})

df2 = pd.DataFrame({

'EmployeeID': [1, 2, 4],

'Salary': [70000, 80000, 90000]

})

# Merging DataFrames on EmployeeID

merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')

print("Merged DataFrame:\n", merged_df)

In this example, we merge two DataFrames on the 'EmployeeID' column using an inner join, which
includes only the rows with matching values in both DataFrames.

Different Types of Joins

The how parameter in the merge() func on allows you to specify the type of join:

1. Inner Join: Returns only the rows with matching values in both DataFrames (default).

2. Outer Join: Returns all rows from both DataFrames, filling in NaN for missing matches.

3. Le Join: Returns all rows from the le DataFrame and matched rows from the right
DataFrame.

4. Right Join: Returns all rows from the right DataFrame and matched rows from the le
DataFrame.
Example:

# Outer join

outer_joined_df = pd.merge(df1, df2, on='EmployeeID', how='outer')

print("Outer Joined DataFrame:\n", outer_joined_df)

# Le join

le _joined_df = pd.merge(df1, df2, on='EmployeeID', how='le ')

print("Le Joined DataFrame:\n", le _joined_df)

# Right join

right_joined_df = pd.merge(df1, df2, on='EmployeeID', how='right')

print("Right Joined DataFrame:\n", right_joined_df)

Concatena ng DataFrames
Concatena on involves stacking DataFrames on top of each other or side by side. The concat()
func on is used for this purpose.

Example:

# Crea ng two DataFrames for concatena on

df3 = pd.DataFrame({

'EmployeeID': [5, 6],

'Name': ['David', 'Eva']

})

# Concatena ng DataFrames ver cally

concat_df = pd.concat([df1, df3], axis=0, ignore_index=True)

print("Concatenated DataFrame (Ver cal):\n", concat_df)

# Concatena ng DataFrames horizontally

concat_df_horizontal = pd.concat([df1, df2], axis=1)

print("Concatenated DataFrame (Horizontal):\n", concat_df_horizontal)

In this example, we concatenate two DataFrames ver cally and horizontally.


Joining DataFrames
Joining is another method to combine DataFrames, similar to merging but generally used when the
joining keys are the index. The join() method is used for this purpose.

Example:

df4 = pd.DataFrame({

'Salary': [70000, 80000, 90000]},

index=[1, 2, 3]

# Joining DataFrames

joined_df = df1.join(df4)

print("Joined DataFrame:\n", joined_df)

This method is convenient for joining DataFrames based on their indices.

Finding and Checking for Null Values


Introduc on to Null Values

In data analysis, missing or null values can significantly impact the quality and reliability of your
results. Iden fying and handling null values is crucial for maintaining data integrity and ensuring
accurate analyses.

Checking for Null Values


Pandas provides several func ons to check for null values within a DataFrame. The isnull() method
returns a DataFrame of the same shape as the original, indica ng where values are null.

Example:

# Checking for null values

null_check = df.isnull()

print("Null Values Check:\n", null_check)

Coun ng Null Values


You can also count the total number of null values in each column using the isnull() method
combined with sum().

Example:

# Coun ng null values in each column

null_count = df.isnull().sum()
print("Count of Null Values:\n", null_count)

This is useful for quickly assessing the extent of missing data in your dataset.

Finding Rows with Null Values


To find specific rows that contain null values, you can use boolean indexing.

Example:

# Finding rows with any null values

rows_with_nulls = df[df.isnull().any(axis=1)]

print("Rows with Null Values:\n", rows_with_nulls)

This method filters the DataFrame to include only those rows with at least one null value.

Filling Null Values


Once you iden fy null values, you can decide how to handle them. The fillna() method allows you to
fill missing values with a specified value or method.

Example:

# Filling null values with a specified value

df_filled_nulls = df.fillna({'Age': 30, 'City': 'Unknown'})

print("DataFrame a er Filling Nulls:\n", df_filled_null

s)

Dropping Null Values


In some cases, you may choose to drop rows or columns that contain null values. The `dropna()`
method allows you to remove these entries.

Example:

# Dropping rows with any null values

df_dropped = df.dropna()

print("DataFrame a er Dropping Nulls:\n", df_dropped)

This method is useful for cleaning up your dataset by removing incomplete entries.

Reading Data from CSV, TXT, and Excel Files


Introduc on to Reading Data

Pandas provides powerful func ons to read data from various file formats, including CSV, TXT, and
Excel files. This func onality makes it easy to import external datasets for analysis.

Reading CSV Files


CSV (Comma-Separated Values) is one of the most common file formats for storing tabular data. The
read_csv() func on is used to read CSV files into a DataFrame.

Example:

# Reading a CSV file

df_csv = pd.read_csv('data.csv')

print("DataFrame from CSV:\n", df_csv)

You can specify addi onal parameters such as sep for the delimiter and header for the row
containing column names.

Reading TXT Files


Text files can also be read into Pandas. The read_csv() func on can be used with different delimiters,
making it versa le for reading TXT files as well.

Example:

# Reading a TXT file with tab delimiters

df_txt = pd.read_csv('data.txt', sep='\t')

print("DataFrame from TXT:\n", df_txt)

Reading Excel Files


Pandas provides the read_excel() func on to read Excel files. This func on requires the openpyxl or
xlrd library, depending on the Excel file format.

Example:

# Reading an Excel file

df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')

print("DataFrame from Excel:\n", df_excel)

You can specify the sheet_name parameter to read a specific sheet from the Excel file.

Handling Missing Values While Reading


When reading data from files, you can also handle missing values directly. The na_values parameter
allows you to specify addi onal strings to recognize as NA/NaN.

Example:

# Reading a CSV file and trea ng 'NA' as a missing value

df_csv_with_na = pd.read_csv('data.csv', na_values=['NA'])

print("DataFrame from CSV with NA Handling:\n", df_csv_with_na)

You might also like