Ii Unit Pandas
Ii Unit Pandas
Introduction to Pandas
1
1. Efficient Data Handling: Pandas provides efficient data structures and operations
for handling large datasets.
2. Flexible Data Manipulation: Pandas offers various data manipulation functions
for data cleaning, transformation, and analysis.
3. Easy Data Analysis: Pandas integrates well with other libraries, making data
analysis and visualization easier.
4. Community Support: Pandas has an active community, ensuring continuous
development and support.
Pandas is a powerful library for data manipulation and analysis. Its efficient data
structures and operations make it an ideal choice for working with structured data
in Python.
What is Library architecture in Pandas?
The library architecture in Pandas is designed to provide a robust and efficient data
analysis framework. Here's an overview of the key components:
1. Data Structures:
- Series (1-dimensional labeled array): Represents a single column of data.
- DataFrame (2-dimensional labeled data structure): Represents a table of data
with rows and columns.
- Panel (3-dimensional labeled data structure): Represents a collection of
DataFrames.
2. Indexing and Selecting Data:
- Index: A data structure that provides fast lookups and labeling for rows and
columns.
- Label-based selection: Select data using label-based indexing (e.g.,
df['column_name']).
- Position-based selection: Select data using position-based indexing (e.g.,
df.iloc[0]).
3. Data Operations:
- Filtering: Select data based on conditions (e.g., df[df['column_name'] > 0]).
- Sorting: Sort data by one or more columns (e.g.,
df.sort_values('column_name')).
2
- Grouping: Group data by one or more columns and perform aggregation (e.g.,
df.groupby('column_name').mean()).
- Merging: Combine data from multiple DataFrames (e.g., pd.merge(df1, df2,
on='column_name')).
4. Data Input/Output:
- Readers: Read data from various file formats (e.g., CSV, Excel, JSON).
- Writers: Write data to various file formats.
5. Data Manipulation:
- Reshaping: Pivot, melt, and reshape data (e.g., df.pivot_table()).
- Data cleaning: Handle missing data, duplicates, and data normalization.
6. Computational Tools:
- NumPy integration: Leverage NumPy's vectorized operations for efficient
computations.
- Cython optimization: Optimize performance-critical code using Cython.
7. Extensibility:
- Plugins: Extend Pandas functionality using plugins (e.g., Pandas-GBQ for
Google BigQuery).
These features make pandas a powerful and flexible library for data manipulation
and analysis in Python.
Write the Applications of Pandas>
Pandas has numerous applications in various fields, including:
Pandas is a versatile library that can be applied to various domains and industries,
making it a valuable tool for anyone working with data.
Explain about data structures in Pandas.
Pandas provides three primary data structures:
- Label-based indexing: Access data using labels (e.g., column names, row
indices).
- Index-based: Data is stored in an index-based structure, enabling fast lookups and
slicing.
- Flexible data types: Support various data types, including numeric, string, and
datetime.
- Vectorized operations: Perform operations on entire columns or rows at once,
making computations efficient.
- Missing data handling: Support for missing data (NaN, None) and data cleaning
methods.
These data structures enable efficient data manipulation, analysis, and storage in
Pandas.
Explain about series in pandas
In pandas, a Series is a one-dimensional labeled array of values, similar to a
column in a spreadsheet or a column in a SQL table. It's a fundamental data
structure in pandas, and it's used to represent a single column of data.
You can create a Series from a list, array, or other iterable using the pd.Series()
function:
import pandas as pd
data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)
0 1
1 2
2 3
3 4
4 5
dtype: int64
print(series[0]) # prints 1
print(series[1:3]) # prints 2, 3
Series Operations:
7
Explain about Dataframe in Pandas.
A DataFrame in pandas is a two-dimensional labeled data structure with columns
of potentially different types. It's similar to a spreadsheet or a SQL table, and is the
most commonly used data structure in pandas.
Creating a DataFrame:
You can create a DataFrame from a dictionary, list of lists, or other data structures
using the pd.DataFrame() function:
import pandas as pd
8
You can access data in a DataFrame using the column name or row index:
DataFrame Operations:
- Filtering data
- Sorting and indexing
- Grouping and aggregating data
- Merging and joining DataFrames
- Data manipulation (e.g., adding, removing columns)
In pandas, dropping entries from an axis refers to removing rows or columns from
a DataFrame based on specified conditions. This is achieved using the drop()
function, which allows you to remove entries from either the index (rows) or
columns axis.
Dropping rows:
To drop rows, you can use the drop() function with the index parameter. You can
pass a single label or a list of labels to remove specific rows.
Example:
import pandas as pd
Dropping columns:
To drop columns, you can use the drop() function with the columns parameter.
You can pass a single label or a list of labels to remove specific columns.
Example:
import pandas as pd
Options:
By using the drop() function, you can efficiently remove unwanted entries from
your DataFrame, making it easier to work with and analyze your data.
Index objects are a fundamental component of Pandas, used to label and identify
rows and columns in DataFrames and Series. They provide a way to access and
10
manipulate data efficiently.
1. Data Structures:
11
- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure with columns of potentially
different types)
2. Data Manipulation:
3. Data Analysis:
4. Data Input/Output:
- Reading data from various file formats (CSV, Excel, JSON, etc.)
- Writing data to various file formats (CSV, Excel, JSON, etc.)
5. Data Cleaning:
6. Data Transformation:
7. Data Selection:
These essential functionalities make Pandas a powerful tool for data manipulation,
analysis, and visualization.
1. Label-based selection: Use the loc attribute to select rows and columns by label.
- df.loc[row_labels, column_labels]
2. Integer-based selection: Use the iloc attribute to select rows and columns by
integer position.
- df.iloc[row_positions, column_positions]
3. Conditional selection: Use boolean indexing to select rows based on conditions.
- df[condition]
Filtering:
Some examples:
- Select rows where the value in the 'age' column is greater than 30: df[df['age'] >
30]
- Select rows where the value in the 'country' column is either 'USA' or 'Canada':
df[df['country'].isin(['USA', 'Canada'])]
- Select rows where the value in the 'name' column starts with 'J':
df[df['name'].str.startswith('J')]
13
These are just a few examples of the many ways to select and filter data in pandas.
import pandas as pd
# Selection
print("Original DataFrame:")
print(df)
# Filtering
print("\nFiltering rows where Age is greater than 30:")
print(df.query('Age > 30'))
Output:
Original DataFrame:
Name Age Country
14
0 John 28 USA
1 Anna 24 UK
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK
15
Ranking in pandas refers to the process of assigning a rank to each row based on
the values of one or more columns. This can be useful for identifying the top or
bottom performers, or for creating a leaderboard.
Sorting:
Ranking:
- Assigns a rank to each row based on the values of one or more columns
- Can be done in ascending or descending order
- Does not reorder the entire DataFrame (although it can be used in conjunction
with sorting)
Some common use cases for sorting and ranking in pandas include:
- Sorting:
- Organizing data in alphabetical or numerical order
- Preparing data for visualization or analysis
- Ranking:
- Identifying top or bottom performers
- Creating a leaderboard or scoring system
- Assigning a percentile or quartile rank to each row
Sorting:
16
1. Sort by a single column:
df.sort_values(by='column_name')
df.sort_values(by=['column1', 'column2'])
df.sort_values(by='column_name', ascending=False)
df.sort_values(by='column_name', inplace=True)
Ranking:
df['rank'] = df['column_name'].rank()
df['rank'] = df['column_name'].rank(ascending=False)
df['rank'] = df['column_name'].rank(method='min')
import pandas as pd
17
# Create a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
'Score': [90, 85, 95, 88, 92]}
df = pd.DataFrame(data)
Output:
Sorted by Age:
Name Age Score
1 Anna 24 85
0 John 28 90
3 Linda 32 88
2 Peter 35 95
4 Phil 40 92
Ranked by Score:
Name Age Score rank
0 John 28 90 2.0
1 Anna 24 85 5.0
2 Peter 35 95 1.0
3 Linda 32 88 4.0
4 Phil 40 92 3.0
Some common use cases for summarizing and computing descriptive statistics in
pandas include:
print("\nMedian:")
print(df.median())
print("\nMode:")
print(df.mode())
print("\nStandard Deviation:")
print(df.std())
print("\nVariance:")
print(df.var())
Output:
20
Summary:
Age Score
count 5.000000 5.000000
mean 31.800000 90.000000
std 6.303762 3.316625
min 24.000000 85.000000
25% 28.000000 88.000000
50% 32.000000 90.000000
75% 35.000000 92.500000
max 40.000000 95.000000
Mean:
Age 31.8
Score 90.0
dtype: float64
Median:
Age 32.0
Score 90.0
dtype: float64
Mode:
Age Score
0 24.0 85.0
Standard Deviation:
Age 6.303762
Score 3.316625
dtype: float64
Variance:
Age 39.733333
Score 11.000000
dtype: float64
Age Score
24 85.0
28 90.0
32 88.0
35 95.0
40 92.0
This example demonstrates various ways to summarize and compute descriptive
statistics in pandas, including using the describe(), mean(), median(), mode(), std(),
var(), min(), max(), and quantile() functions.
22
3. Mode: The most frequently occurring value in a column.
4. Standard Deviation (std): A measure of the amount of variation or dispersion in
a column.
5. Variance: The average of the squared differences from the mean.
6. Minimum (min): The smallest value in a column.
7. Maximum (max): The largest value in a column.
8. Quantiles (q): Divide the data into equal-sized groups based on rank or position.
9. Interquartile Range (IQR): The difference between the 75th percentile (Q3) and
25th percentile (Q1).
10. Range: The difference between the maximum and minimum values.
- mean()
- median()
- mode()
- std()
- var()
- min()
- max()
- quantile()
- describe(): Generates a summary of the central tendency, dispersion, and shape of
the dataset's distribution.
These descriptive statistics are essential for understanding the distribution of your
data, identifying patterns and trends, and informing data-driven decisions.
To find unique values in pandas, you can use the unique() function, which returns
an array of unique values. Here are some examples:
Note: The unique() function returns an array of unique values, while nunique()
returns the count of unique values.
Example:
import pandas as pd
# Output: 4
# Output:
# Name Age
# 0 John 28
# 1 Anna 24
24
# 3 Linda 32
# 5 Phil 40
# Output:
# Anna 2
# John 2
# Linda 1
# Phil 1
In this example, we demonstrate how to find unique values, count unique values,
get unique rows, and get unique values with frequency using pandas.
Example:
import pandas as pd
# Output:
# banana 3
# apple 2
25
# orange 1
In this example, the value_counts() function returns a Series with the unique values
('banana', 'apple', 'orange') as the index and their respective counts (3, 2, 1) as the
values.
# Output:
# banana 3
# apple 2
# orange 1
Example code:
import pandas as pd
import numpy as np
Note: The choice of method depends on the nature of the data and the problem
you're trying to solve.
What is filtering out missing data in pandas?
Filtering out missing data means removing or excluding rows or columns that
contain missing or null values from a dataset. This is a common data preprocessing
step in data analysis and machine learning to ensure that the data is complete and
consistent.
27
Missing data can be represented in different ways, such as:
Filtering out missing data can be done using various techniques, including:
1. Prevents bias: Missing data can lead to biased results if not handled properly.
2. Improves accuracy: Complete data leads to more accurate analysis and
modeling.
3. Enhances reliability: Filtering out missing data ensures that the results are
reliable and consistent.
However, it's essential to consider the nature of the data and the problem you're
trying to solve before filtering out missing data. In some cases, missing data may
be informative or important for the analysis.
import pandas as pd
import numpy as np
28
# Print the original DataFrame
print("Original DataFrame:")
print(df)
Output:
Original DataFrame:
Name Age Score
0 John 28.0 90.0
1 Anna NaN 85.0
2 NaN 35.0 NaN
3 Linda 32.0 88.0
4 Phil 40.0 92.0
Filtered DataFrame:
Name Age Score
0 John 28.0 90.0
29
3 Linda 32.0 88.0
4 Phil 40.0 92.0
30