0% found this document useful (0 votes)

107 views30 pages

Ii Unit Pandas

Uploaded by

sksigmaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views30 pages

Ii Unit Pandas

Uploaded by

sksigmaman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

FUNDAMENTALS OF DATA SCIENCE

II UNIT : GETTING SARTED WITH PANDAS

Introduction to Pandas

Pandas is a powerful open-source library in Python for data manipulation and

analysis. It provides data structures and functions to efficiently handle structured
data, including tabular data such as spreadsheets and SQL tables.

Key Features of Pandas:

1. Data Structures: Pandas introduces two primary data structures:

- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure)
2. Data Operations: Pandas supports various data operations, including:
- Filtering
- Sorting
- Grouping
- Merging
- Reshaping
3. Data Input/Output: Pandas allows reading and writing data from various file
formats, including:
- CSV
- Excel
- JSON
- SQL databases
4. Data Manipulation: Pandas provides various data manipulation functions,
including:
- Handling missing data
- Data cleaning
- Data transformation
5. Integration: Pandas integrates well with other popular Python libraries,
including:
- NumPy
- Matplotlib
- Scikit-learn

Why Use Pandas?

1
1. Efficient Data Handling: Pandas provides efficient data structures and operations
for handling large datasets.
2. Flexible Data Manipulation: Pandas offers various data manipulation functions
for data cleaning, transformation, and analysis.
3. Easy Data Analysis: Pandas integrates well with other libraries, making data
analysis and visualization easier.
4. Community Support: Pandas has an active community, ensuring continuous
development and support.

Getting Started with Pandas:

1. Install Pandas: pip install pandas

2. Import Pandas: import pandas as pd
3. Create a DataFrame: df = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28,
24]})

Pandas is a powerful library for data manipulation and analysis. Its efficient data
structures and operations make it an ideal choice for working with structured data
in Python.
What is Library architecture in Pandas?
The library architecture in Pandas is designed to provide a robust and efficient data
analysis framework. Here's an overview of the key components:

1. Data Structures:
- Series (1-dimensional labeled array): Represents a single column of data.
- DataFrame (2-dimensional labeled data structure): Represents a table of data
with rows and columns.
- Panel (3-dimensional labeled data structure): Represents a collection of
DataFrames.
2. Indexing and Selecting Data:
- Index: A data structure that provides fast lookups and labeling for rows and
columns.
- Label-based selection: Select data using label-based indexing (e.g.,
df['column_name']).
- Position-based selection: Select data using position-based indexing (e.g.,
df.iloc[0]).
3. Data Operations:
- Filtering: Select data based on conditions (e.g., df[df['column_name'] > 0]).
- Sorting: Sort data by one or more columns (e.g.,
df.sort_values('column_name')).
2
- Grouping: Group data by one or more columns and perform aggregation (e.g.,
df.groupby('column_name').mean()).
- Merging: Combine data from multiple DataFrames (e.g., pd.merge(df1, df2,
on='column_name')).
4. Data Input/Output:
- Readers: Read data from various file formats (e.g., CSV, Excel, JSON).
- Writers: Write data to various file formats.
5. Data Manipulation:
- Reshaping: Pivot, melt, and reshape data (e.g., df.pivot_table()).
- Data cleaning: Handle missing data, duplicates, and data normalization.
6. Computational Tools:
- NumPy integration: Leverage NumPy's vectorized operations for efficient
computations.
- Cython optimization: Optimize performance-critical code using Cython.
7. Extensibility:
- Plugins: Extend Pandas functionality using plugins (e.g., Pandas-GBQ for
Google BigQuery).

This architecture enables Pandas to efficiently handle large datasets, perform

complex data operations, and provide a flexible and extensible framework for data
analysis.
Write the key features of Pandas
Here are the key features of pandas:

1. Data Structures: pandas provides two primary data structures:

- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure)
2. Data Operations:
- Filtering: Select data based on conditions
- Sorting: Sort data by one or more columns
- Grouping: Group data by one or more columns and perform aggregation
- Merging: Combine data from multiple DataFrames
3. Data Input/Output:
- Read: Read data from various file formats (e.g., CSV, Excel, JSON)
- Write: Write data to various file formats
4. Data Manipulation:
- Handling missing data: Handle missing data (NaN, None) and data cleaning
methods
- Data transformation: Perform data transformation (e.g., pivoting, melting)
5. Data Analysis:
3
- Statistical functions: Provide statistical functions (e.g., mean, median, standard
deviation)
- Data visualization: Integrate with visualization libraries (e.g., Matplotlib,
Seaborn)
6. Performance:
- Vectorized operations: Perform operations on entire columns or rows at once
- Cython optimization: Optimize performance-critical code using Cython
7. Integration:
- NumPy integration: Leverage NumPy's vectorized operations
- Matplotlib integration: Integrate with Matplotlib for data visualization
- Scikit-learn integration: Integrate with Scikit-learn for machine learning
8. Time Series Analysis:
- Date and time handling: Handle date and time data
- Time series functions: Provide time series functions (e.g., rolling, resampling)
9. Data Cleaning:
- Data cleaning functions: Provide data cleaning functions (e.g., drop duplicates,
handle missing data)
10. Extensibility:
- Plugins: Extend pandas functionality using plugins
- Custom data types: Support custom data types

These features make pandas a powerful and flexible library for data manipulation
and analysis in Python.
Write the Applications of Pandas>
Pandas has numerous applications in various fields, including:

1. Data Analysis and Science:

- Data cleaning and preprocessing
- Data visualization
- Statistical analysis
- Machine learning
2. Business Intelligence and Analytics:
- Data reporting and dashboards
- Data mining
- Predictive analytics
- Business insights
3. Finance and Economics:
- Financial data analysis
- Portfolio management
- Risk analysis
4
- Economic modeling
4. Scientific Research:
- Data analysis and visualization
- Statistical modeling
- Data mining
- Research data management
5. Web Scraping and Data Extraction:
- Extracting data from websites
- Web scraping
- Data parsing
6. Data Engineering and Architecture:
- Data warehousing
- ETL (Extract, Transform, Load) processes
- Data pipeline management
7. Machine Learning and AI:
- Data preprocessing
- Feature engineering
- Model training and evaluation
8. Healthcare and Biomedical Research:
- Medical data analysis
- Clinical trial data management
- Genomics and proteomics research
9. Social Media and Text Analysis:
- Text data analysis
- Sentiment analysis
- Social media monitoring
10. Education and Research:
- Educational data analysis
- Research data management
- Academic data visualization

Pandas is a versatile library that can be applied to various domains and industries,
making it a valuable tool for anyone working with data.
Explain about data structures in Pandas.
Pandas provides three primary data structures:

1. Series (1-dimensional labeled array):

- Represents a single column of data.
- Index-based, with a label for each entry.
- Supports various data types, including numeric, string, and datetime.
5
2. DataFrame (2-dimensional labeled data structure):
- Represents a table of data with rows and columns.
- Index-based, with labels for rows (index) and columns (columns).
- Supports various data types, including numeric, string, and datetime.
3. Panel (3-dimensional labeled data structure):
- Represents a collection of DataFrames.
- Index-based, with labels for rows (index), columns (columns), and depth
(panels).
- Less commonly used, but useful for working with multi-dimensional data.

Key characteristics of Pandas data structures:

- Label-based indexing: Access data using labels (e.g., column names, row
indices).
- Index-based: Data is stored in an index-based structure, enabling fast lookups and
slicing.
- Flexible data types: Support various data types, including numeric, string, and
datetime.
- Vectorized operations: Perform operations on entire columns or rows at once,
making computations efficient.
- Missing data handling: Support for missing data (NaN, None) and data cleaning
methods.

These data structures enable efficient data manipulation, analysis, and storage in
Pandas.
Explain about series in pandas
In pandas, a Series is a one-dimensional labeled array of values, similar to a
column in a spreadsheet or a column in a SQL table. It's a fundamental data
structure in pandas, and it's used to represent a single column of data.

Here are some key characteristics of a pandas Series:

1. One-dimensional: A Series is a single column of data, with a single index (row

labels).
2. Labeled: Each value in the Series has a label, which is used to identify the value.
3. Index: The index of a Series is the row labels, which can be integers, strings, or
other types of data.
4. Data type: A Series can have a single data type, such as integers, floats, strings,
or datetime.
5. Size: A Series can have any number of values, from a few to millions.
6
Creating a Series:

You can create a Series from a list, array, or other iterable using the pd.Series()
function:

import pandas as pd

data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)

This will output:

0 1
1 2
2 3
3 4
4 5
dtype: int64

Accessing values in a Series:

You can access values in a Series using the index label:

print(series[0]) # prints 1

You can also access values using slicing:

print(series[1:3]) # prints 2, 3

Series Operations:

You can perform various operations on a Series, such as:

- Arithmetic operations (e.g., addition, subtraction)

- Comparison operations (e.g., equality, greater than)
- Aggregate functions (e.g., mean, sum)
- Data manipulation (e.g., sorting, indexing)

7
Explain about Dataframe in Pandas.
A DataFrame in pandas is a two-dimensional labeled data structure with columns
of potentially different types. It's similar to a spreadsheet or a SQL table, and is the
most commonly used data structure in pandas.

Here are some key characteristics of a DataFrame:

1. Two-dimensional: A DataFrame has rows and columns, similar to a spreadsheet.

2. Labeled: Each column and row has a label, which can be used to identify the
data.
3. Columns: Each column can have a different data type (e.g., integers, strings,
datetime).
4. Rows: Each row represents a single observation or record.
5. Index: The index is the row labels, which can be integers, strings, or other types
of data.
6. Size: A DataFrame can have any number of rows and columns.

Creating a DataFrame:

You can create a DataFrame from a dictionary, list of lists, or other data structures
using the pd.DataFrame() function:

import pandas as pd

data = {'Name': ['John', 'Anna', 'Linda'],

'Age': [28, 24, 32],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

This will output:

Name Age City

0 John 28 New York
1 Anna 24 Paris
2 Linda 32 London

Accessing data in a DataFrame:

8
You can access data in a DataFrame using the column name or row index:

print(df['Name']) # prints the 'Name' column

print(df.loc[0]) # prints the first row

DataFrame Operations:

You can perform various operations on a DataFrame, such as:

- Filtering data
- Sorting and indexing
- Grouping and aggregating data
- Merging and joining DataFrames
- Data manipulation (e.g., adding, removing columns)

What is dropping entries from an axis in Pandas?

In pandas, dropping entries from an axis refers to removing rows or columns from
a DataFrame based on specified conditions. This is achieved using the drop()
function, which allows you to remove entries from either the index (rows) or
columns axis.

Dropping rows:

To drop rows, you can use the drop() function with the index parameter. You can
pass a single label or a list of labels to remove specific rows.

Example:

import pandas as pd

# Create a sample DataFrame

data = {'Name': ['John', 'Anna', 'Linda', 'Phil'],
'Age': [28, 24, 32, 40]}
df = pd.DataFrame(data)

# Drop a single row by index label

df.drop(index='Anna', inplace=True)

# Drop multiple rows by index labels

9
df.drop(index=['Linda', 'Phil'], inplace=True)

Dropping columns:

To drop columns, you can use the drop() function with the columns parameter.
You can pass a single label or a list of labels to remove specific columns.

Example:

import pandas as pd

# Create a sample DataFrame

data = {'Name': ['John', 'Anna', 'Linda', 'Phil'],
'Age': [28, 24, 32, 40]}
df = pd.DataFrame(data)

# Drop a single column by label

df.drop(columns='Age', inplace=True)

# Drop multiple columns by labels

df.drop(columns=['Name', 'Age'], inplace=True)

Options:

The drop() function has several options to customize the behavior:

- inplace: If True, modifies the original DataFrame. If False, returns a new

DataFrame.
- axis: Specifies the axis to drop entries from (0 for rows, 1 for columns).
- errors: Specifies how to handle errors (e.g., 'ignore' to ignore missing labels).

By using the drop() function, you can efficiently remove unwanted entries from
your DataFrame, making it easier to work with and analyze your data.

Index objects in Pandas!

Index objects are a fundamental component of Pandas, used to label and identify
rows and columns in DataFrames and Series. They provide a way to access and
10
manipulate data efficiently.

Types of Index objects:

1. RangeIndex: default index, created from a range of integers (e.g., 0, 1, 2, ...)

2. Int64Index: index with 64-bit integer values
3. Float64Index: index with 64-bit floating-point values
4. MultiIndex: hierarchical index with multiple levels
5. DateTimeIndex: index with datetime values
6. PeriodIndex: index with period values (e.g., daily, monthly, quarterly)
7. TimedeltaIndex: index with timedelta values

Index object properties:

1. name: name of the index

2. dtype: data type of the index values
3. values: array of index values
4. shape: shape of the index (number of elements)

Index object methods:

1. reindex: reindex the data with a new index

2. reset_index: reset the index to the default integer index
3. set_index: set the index to a specific column or array
4. drop_duplicates: drop duplicate index values
5. get_loc: get the location of a specific index value

Using Index objects, you can:

1. Select data using label-based indexing

2. Filter data using conditional indexing
3. Sort and order data using index sorting
4. Group and aggregate data using index grouping
5. Merge and join data using index matching

Write Essential functionality in Pandas?

Essential functionality in Pandas includes:

1. Data Structures:

11
- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure with columns of potentially
different types)

2. Data Manipulation:

- Filtering: selecting data based on conditions

- Sorting: sorting data by one or more columns
- Grouping: grouping data by one or more columns and applying aggregation
functions
- Merging: combining data from multiple DataFrames
- Reshaping: transforming data from wide to long format and vice versa

3. Data Analysis:

- Statistical functions: mean, median, mode, standard deviation, etc.

- Data alignment: aligning data by index or columns
- Data merging: combining data from multiple DataFrames

4. Data Input/Output:

- Reading data from various file formats (CSV, Excel, JSON, etc.)
- Writing data to various file formats (CSV, Excel, JSON, etc.)

5. Data Cleaning:

- Handling missing data: detecting, filling, and dropping missing values

- Data normalization: scaling and transforming data

6. Data Transformation:

- Melting: transforming data from wide to long format

- Pivoting: transforming data from long to wide format
- Stack and unstack: transforming data by stacking or unstacking levels

7. Data Selection:

- Label-based selection: selecting data by label

- Conditional selection: selecting data based on conditions
- Boolean indexing: selecting data using boolean arrays
12
8. Data Aggregation:

- GroupBy: grouping data and applying aggregation functions

- Pivot tables: creating pivot tables to summarize data

These essential functionalities make Pandas a powerful tool for data manipulation,
analysis, and visualization.

What is selection and filtering in Pandas give ex?

Selection and filtering are essential operations in pandas, a powerful data
manipulation library in Python. Here are some ways to select and filter data in
pandas:
Selection:

1. Label-based selection: Use the loc attribute to select rows and columns by label.
- df.loc[row_labels, column_labels]
2. Integer-based selection: Use the iloc attribute to select rows and columns by
integer position.
- df.iloc[row_positions, column_positions]
3. Conditional selection: Use boolean indexing to select rows based on conditions.
- df[condition]

Filtering:

1. Boolean indexing: Use boolean conditions to filter rows.

- df[condition]
2. Query: Use the query method to filter rows using a SQL-like syntax.
- df.query('condition')
3. Filtering with isin: Use the isin method to filter rows based on a list of values.
- df[df['column'].isin(values)]

Some examples:

- Select rows where the value in the 'age' column is greater than 30: df[df['age'] >
30]
- Select rows where the value in the 'country' column is either 'USA' or 'Canada':
df[df['country'].isin(['USA', 'Canada'])]
- Select rows where the value in the 'name' column starts with 'J':
df[df['name'].str.startswith('J')]
13
These are just a few examples of the many ways to select and filter data in pandas.
import pandas as pd

# Create a sample DataFrame

data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
'Country': ['USA', 'UK', 'USA', 'Canada', 'UK']}
df = pd.DataFrame(data)

# Selection
print("Original DataFrame:")
print(df)

# Select rows where Age is greater than 30

print("\nRows where Age > 30:")
print(df[df['Age'] > 30])

# Select rows where Country is USA or UK

print("\nRows where Country is USA or UK:")
print(df[df['Country'].isin(['USA', 'UK'])])

# Select rows where Name starts with 'P'

print("\nRows where Name starts with 'P':")
print(df[df['Name'].str.startswith('P')])

# Filtering
print("\nFiltering rows where Age is greater than 30:")
print(df.query('Age > 30'))

# Filter rows where Country is USA and Age is greater than 30

print("\nRows where Country is USA and Age > 30:")
print(df[(df['Country'] == 'USA') & (df['Age'] > 30)])

Output:

Original DataFrame:
Name Age Country
14
0 John 28 USA
1 Anna 24 UK
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Age > 30:

Name Age Country
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Country is USA or UK:

Name Age Country
0 John 28 USA
1 Anna 24 UK
2 Peter 35 USA
4 Phil 40 UK

Rows where Name starts with 'P':

Name Age Country
2 Peter 35 USA

Filtering rows where Age is greater than 30:

Name Age Country
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Country is USA and Age > 30:

Name Age Country
2 Peter 35 USA
These examples demonstrate how to select and filter rows in a pandas DataFrame
using various conditions.

What is Sorting and Ranking in Pandas?

Sorting in pandas refers to the process of rearranging the rows of a DataFrame in a
specific order based on the values of one or more columns. This can be done in
either ascending or descending order.

15
Ranking in pandas refers to the process of assigning a rank to each row based on
the values of one or more columns. This can be useful for identifying the top or
bottom performers, or for creating a leaderboard.

Here are some key differences between sorting and ranking:

Sorting:

- Reorders the entire DataFrame

- Can be done in ascending or descending order
- Does not assign a rank to each row

Ranking:

- Assigns a rank to each row based on the values of one or more columns
- Can be done in ascending or descending order
- Does not reorder the entire DataFrame (although it can be used in conjunction
with sorting)

Some common use cases for sorting and ranking in pandas include:

- Sorting:
- Organizing data in alphabetical or numerical order
- Preparing data for visualization or analysis
- Ranking:
- Identifying top or bottom performers
- Creating a leaderboard or scoring system
- Assigning a percentile or quartile rank to each row

Pandas provides various functions for sorting and ranking, including:

- sort_values(): Sorts the DataFrame by one or more columns

- sort_index(): Sorts the DataFrame by its index
- rank(): Assigns a rank to each row based on the values of one or more columns
- nlargest() and nsmallest(): Returns the top or bottom N rows based on the values
of one or more columns
Example
Here are some examples of sorting and ranking in pandas:

Sorting:
16
1. Sort by a single column:

df.sort_values(by='column_name')

2. Sort by multiple columns:

df.sort_values(by=['column1', 'column2'])

3. Sort in descending order:

df.sort_values(by='column_name', ascending=False)

4. Sort in place (modify the original DataFrame):

df.sort_values(by='column_name', inplace=True)

Ranking:

1. Rank by a single column:

df['rank'] = df['column_name'].rank()

2. Rank by multiple columns:

df['rank'] = df[['column1', 'column2']].apply(tuple, axis=1).rank()

3. Rank in descending order:

df['rank'] = df['column_name'].rank(ascending=False)

4. Rank with specific method (e.g., min, max, dense, etc.):

df['rank'] = df['column_name'].rank(method='min')

Here's an example code snippet:

import pandas as pd

17
# Create a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
'Score': [90, 85, 95, 88, 92]}
df = pd.DataFrame(data)

# Sort by Age in ascending order

print("Sorted by Age:")
print(df.sort_values(by='Age'))

# Rank by Score in descending order

df['rank'] = df['Score'].rank(ascending=False)
print("\nRanked by Score:")
print(df)

Output:

Sorted by Age:
Name Age Score
1 Anna 24 85
0 John 28 90
3 Linda 32 88
2 Peter 35 95
4 Phil 40 92

Ranked by Score:
Name Age Score rank
0 John 28 90 2.0
1 Anna 24 85 5.0
2 Peter 35 95 1.0
3 Linda 32 88 4.0
4 Phil 40 92 3.0

What is summarizing and computing statistics in pandas?

Summarizing in pandas refers to the process of reducing a large dataset into a
smaller, more manageable form, while still maintaining the essential characteristics
of the data. This can be done using various summary statistics, such as:
- Count: Number of non-missing values
18
- Mean: Average value
- Median: Middle value
- Mode: Most frequent value
- Standard Deviation: Measure of variability
- Variance: Measure of spread
- Minimum and Maximum values
- Quartiles (25th, 50th, 75th percentiles)
- Percentiles (e.g., 10th, 90th percentiles)

Computing descriptive statistics in pandas involves calculating these summary

statistics to understand the distribution, central tendency, and variability of the
data. This can be done using various pandas functions, such as:

- describe(): Generates a summary of the central tendency, dispersion, and shape of

the dataset's distribution.
- mean(), median(), mode(), std(), var(), min(), max(), quantile(): Calculate specific
summary statistics.
- groupby(): Calculate summary statistics for each group of a categorical variable.
- pivot_table(): Create a spreadsheet-style summary of the data.

Some common use cases for summarizing and computing descriptive statistics in
pandas include:

- Exploratory data analysis (EDA)

- Data cleaning and preprocessing
- Feature engineering
- Data visualization
- Statistical modeling

By summarizing and computing descriptive statistics, you can:

- Understand the distribution and characteristics of your data

- Identify patterns, trends, and correlations
- Inform data-driven decisions
- Prepare data for machine learning or statistical modeling
import pandas as pd

# Create a sample DataFrame

data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
19
'Score': [90, 85, 95, 88, 92]}
df = pd.DataFrame(data)

# Summarize the DataFrame

print("Summary:")
print(df.describe())

# Compute descriptive statistics

print("\nMean:")
print(df.mean())

print("\nMedian:")
print(df.median())

print("\nMode:")
print(df.mode())

print("\nStandard Deviation:")
print(df.std())

print("\nVariance:")
print(df.var())

print("\nMinimum and Maximum values:")

print(df.min())
print(df.max())

print("\nQuantiles (25th, 50th, 75th percentiles):")

print(df.quantile([0.25, 0.5, 0.75]))

# Group by 'Name' and compute mean 'Score'

print("\nMean Score by Name:")
print(df.groupby('Name')['Score'].mean())

# Pivot table to compute mean 'Score' by 'Age' group

print("\nMean Score by Age:")
print(df.pivot_table(values='Score', index='Age', aggfunc='mean'))

Output:
20
Summary:
Age Score
count 5.000000 5.000000
mean 31.800000 90.000000
std 6.303762 3.316625
min 24.000000 85.000000
25% 28.000000 88.000000
50% 32.000000 90.000000
75% 35.000000 92.500000
max 40.000000 95.000000

Mean:
Age 31.8
Score 90.0
dtype: float64

Median:
Age 32.0
Score 90.0
dtype: float64

Mode:
Age Score
0 24.0 85.0

Standard Deviation:
Age 6.303762
Score 3.316625
dtype: float64

Variance:
Age 39.733333
Score 11.000000
dtype: float64

Minimum and Maximum values:

Age 24
Score 85
21
dtype: int64
Age 40
Score 95
dtype: int64

Quantiles (25th, 50th, 75th percentiles):

Age Score
0.25 28.0 88.0
0.50 32.0 90.0
0.75 35.0 92.5

Mean Score by Name:

Name
Anna 85.0
John 90.0
Linda 88.0
Peter 95.0
Phil 92.0
Name: Score, dtype: float64
Mean Score by Age:

Age Score
24 85.0
28 90.0
32 88.0
35 95.0
40 92.0
This example demonstrates various ways to summarize and compute descriptive
statistics in pandas, including using the describe(), mean(), median(), mode(), std(),
var(), min(), max(), and quantile() functions.

What is Descriptive statistics in pandas?

Descriptive statistics in pandas refer to the statistical measures that summarize and
describe the basic features of a dataset. These measures provide an overview of the
central tendency, dispersion, and shape of the data's distribution.

Common descriptive statistics in pandas include:

1. Mean: The average value of a column.

2. Median: The middle value of a column when sorted in ascending order.

22
3. Mode: The most frequently occurring value in a column.
4. Standard Deviation (std): A measure of the amount of variation or dispersion in
a column.
5. Variance: The average of the squared differences from the mean.
6. Minimum (min): The smallest value in a column.
7. Maximum (max): The largest value in a column.
8. Quantiles (q): Divide the data into equal-sized groups based on rank or position.
9. Interquartile Range (IQR): The difference between the 75th percentile (Q3) and
25th percentile (Q1).
10. Range: The difference between the maximum and minimum values.

Pandas provides various functions to calculate these descriptive statistics,

including:

- mean()
- median()
- mode()
- std()
- var()
- min()
- max()
- quantile()
- describe(): Generates a summary of the central tendency, dispersion, and shape of
the dataset's distribution.

These descriptive statistics are essential for understanding the distribution of your
data, identifying patterns and trends, and informing data-driven decisions.

What is unique values in pandas?

In pandas, unique values refer to the distinct or non-duplicate values in a column or
Series. These are values that appear only once or a specified number of times in the
data.

To find unique values in pandas, you can use the unique() function, which returns
an array of unique values. Here are some examples:

1. Get unique values in a column:

df['column_name'].unique()

2. Get unique values in a Series:

23
series.unique()

3. Count unique values:

df['column_name'].nunique()

4. Get unique rows in a DataFrame:

df.drop_duplicates()

5. Get unique values with frequency:

df['column_name'].value_counts()

Note: The unique() function returns an array of unique values, while nunique()
returns the count of unique values.

Example:

import pandas as pd

# Create a sample DataFrame

data = {'Name': ['John', 'Anna', 'John', 'Linda', 'Anna', 'Phil'],
'Age': [28, 24, 28, 32, 24, 40]}
df = pd.DataFrame(data)

# Get unique values in the 'Name' column

print(df['Name'].unique())

# Output: ['John', 'Anna', 'Linda', 'Phil']

# Count unique values in the 'Name' column

print(df['Name'].nunique())

# Output: 4

# Get unique rows in the DataFrame

print(df.drop_duplicates())

# Output:
# Name Age
# 0 John 28
# 1 Anna 24
24
# 3 Linda 32
# 5 Phil 40

# Get unique values with frequency in the 'Name' column

print(df['Name'].value_counts())

# Output:
# Anna 2
# John 2
# Linda 1
# Phil 1

In this example, we demonstrate how to find unique values, count unique values,
get unique rows, and get unique values with frequency using pandas.

What is value counts in pandas?

In pandas, value_counts() is a function that returns a Series containing the count of
unique values in a Series or column of a DataFrame. It's a convenient way to get
the frequency of each unique value.

Here's what value_counts() does:

1. Counts the number of occurrences of each unique value.

2. Returns a Series with the unique values as the index and the counts as the
values.
3. Sorts the results in descending order by default (most frequent values first).

Example:

import pandas as pd

# Create a sample Series

s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'banana'])

# Get the value counts

print(s.value_counts())

# Output:
# banana 3
# apple 2
25
# orange 1

In this example, the value_counts() function returns a Series with the unique values
('banana', 'apple', 'orange') as the index and their respective counts (3, 2, 1) as the
values.

You can also use value_counts() on a DataFrame column:

# Create a sample DataFrame

df = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']})

# Get the value counts for the 'fruit' column

print(df['fruit'].value_counts())

# Output:
# banana 3
# apple 2
# orange 1

Note that value_counts() has some optional parameters:

- normalize: If True, returns the relative frequencies instead of counts.

- sort: If False, doesn't sort the results.
- ascending: If True, sorts the results in ascending order.
- bins: Specifies the number of bins for discrete data.

How to handling missing data in pandas?

Handling missing data in pandas involves identifying, removing, or replacing
missing values. Here are some steps to handle missing data in pandas using
Python:

1. Identify missing values:

- Use isnull() or isna() to detect missing values.
- Use sum() to count the number of missing values.
2. Remove missing values:
- Use dropna() to remove rows or columns with missing values.
- Use dropna(how='all') to remove rows with all missing values.
3. Replace missing values:
- Use fillna() to replace missing values with a specified value.
- Use fillna(method='ffill') to forward-fill missing values.
26
- Use fillna(method='bfill') to backward-fill missing values.
4. Interpolate missing values:
- Use interpolate() to interpolate missing values.
5. Impute missing values:
- Use SimpleImputer from scikit-learn to impute missing values.

Example code:

import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values

data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Identify missing values

print(df.isnull())

# Remove missing values

df.dropna(inplace=True)

# Replace missing values

df.fillna(0, inplace=True)

# Interpolate missing values

df.interpolate(method='linear', inplace=True)

# Impute missing values using SimpleImputer

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['A', 'B']] = imputer.fit_transform(df[['A', 'B']])

Note: The choice of method depends on the nature of the data and the problem
you're trying to solve.
What is filtering out missing data in pandas?
Filtering out missing data means removing or excluding rows or columns that
contain missing or null values from a dataset. This is a common data preprocessing
step in data analysis and machine learning to ensure that the data is complete and
consistent.
27
Missing data can be represented in different ways, such as:

- NaN (Not a Number)

- None
- Null
- Empty strings
- Special values like -999 or 999

Filtering out missing data can be done using various techniques, including:

1. Listwise deletion: Removing rows with missing values.

2. Pairwise deletion: Removing rows with missing values for a specific analysis or
calculation.
3. Mean/Median imputation: Replacing missing values with the mean or median of
the respective column.
4. Forward/Backward fill: Replacing missing values with the previous or next
value in the same column.
5. Interpolation: Estimating missing values based on surrounding values.

Filtering out missing data is important because:

1. Prevents bias: Missing data can lead to biased results if not handled properly.
2. Improves accuracy: Complete data leads to more accurate analysis and
modeling.
3. Enhances reliability: Filtering out missing data ensures that the results are
reliable and consistent.

However, it's essential to consider the nature of the data and the problem you're
trying to solve before filtering out missing data. In some cases, missing data may
be informative or important for the analysis.
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values

data = {'Name': ['John', 'Anna', np.nan, 'Linda', 'Phil'],
'Age': [28, np.nan, 35, 32, 40],
'Score': [90, 85, np.nan, 88, 92]}
df = pd.DataFrame(data)

28
# Print the original DataFrame
print("Original DataFrame:")
print(df)

# Filter out rows with missing values

df_filtered = df.dropna()

# Print the filtered DataFrame

print("\nFiltered DataFrame:")
print(df_filtered)

# Filter out columns with missing values

df_filtered_columns = df.dropna(axis=1)

# Print the filtered DataFrame

print("\nFiltered DataFrame (columns):")
print(df_filtered_columns)

# Filter out rows with missing values in a specific column

df_filtered_name = df[df['Name'].notnull()]

# Print the filtered DataFrame

print("\nFiltered DataFrame (Name column):")
print(df_filtered_name)

Output:

Original DataFrame:
Name Age Score
0 John 28.0 90.0
1 Anna NaN 85.0
2 NaN 35.0 NaN
3 Linda 32.0 88.0
4 Phil 40.0 92.0

Filtered DataFrame:
Name Age Score
0 John 28.0 90.0
29
3 Linda 32.0 88.0
4 Phil 40.0 92.0

Filtered DataFrame (columns):

Name Age
0 John 28.0
1 Anna NaN
2 NaN 35.0
3 Linda 32.0
4 Phil 40.0

Filtered DataFrame (Name column):

Name Age Score
0 John 28.0 90.0
1 Anna NaN 85.0
3 Linda 32.0 88.0
4 Phil 40.0 92.0
In this example, we demonstrate how to filter out missing data in pandas using the
dropna() function, which removes rows or columns with missing values. We also
show how to filter out rows with missing values in a specific column using the
notnull() function.

Pandas
No ratings yet
Pandas
8 pages
Python Pandas
No ratings yet
Python Pandas
2 pages
Pandas Definitions Summary
No ratings yet
Pandas Definitions Summary
2 pages
Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
ML Unit-2 Notes
No ratings yet
ML Unit-2 Notes
17 pages
Unit Ii Getting Started With Pandas
No ratings yet
Unit Ii Getting Started With Pandas
35 pages
Python Exp12.
No ratings yet
Python Exp12.
2 pages
Python Pandas
No ratings yet
Python Pandas
13 pages
Pandas For Data Science
No ratings yet
Pandas For Data Science
42 pages
Pandas
No ratings yet
Pandas
13 pages
Practical 7
No ratings yet
Practical 7
8 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Pandas Understanding and Architecture
No ratings yet
Pandas Understanding and Architecture
2 pages
JOINS
No ratings yet
JOINS
10 pages
Pandas Notes
No ratings yet
Pandas Notes
6 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Python Programming For Data Science
No ratings yet
Python Programming For Data Science
36 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Pandas
No ratings yet
Pandas
10 pages
Python Pandas Tutorial
No ratings yet
Python Pandas Tutorial
6 pages
Python Pandas
No ratings yet
Python Pandas
21 pages
4a Introduction To Pandas - PPTX - Lyst5943
No ratings yet
4a Introduction To Pandas - PPTX - Lyst5943
11 pages
Unit 4
No ratings yet
Unit 4
36 pages
Python For Analytics - 2025 - 2020
No ratings yet
Python For Analytics - 2025 - 2020
28 pages
Python Pandas
No ratings yet
Python Pandas
1 page
Unit V Pandas AIML A B Lastupdated 18-06-2024
No ratings yet
Unit V Pandas AIML A B Lastupdated 18-06-2024
33 pages
Pandas
No ratings yet
Pandas
40 pages
Pandas
No ratings yet
Pandas
82 pages
Pandas
No ratings yet
Pandas
25 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Week 4.1
No ratings yet
Week 4.1
16 pages
Python Pandas Module - Introduction-07-11-2023
No ratings yet
Python Pandas Module - Introduction-07-11-2023
84 pages
Pandas
No ratings yet
Pandas
11 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Practical Guide To Pandas For Data Science
100% (1)
Practical Guide To Pandas For Data Science
26 pages
Pandas
No ratings yet
Pandas
13 pages
Notes On Pandasmanpreet
No ratings yet
Notes On Pandasmanpreet
4 pages
Mypnotes
No ratings yet
Mypnotes
3 pages
Introduction To The Pandas Library - The Backbone o
No ratings yet
Introduction To The Pandas Library - The Backbone o
3 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Pandas - Panel Data System
No ratings yet
Pandas - Panel Data System
4 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
Pandas Python
No ratings yet
Pandas Python
11 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
Pandas
No ratings yet
Pandas
163 pages
06 MGMT 590 Fall 2019 Data Handling With Pandas
No ratings yet
06 MGMT 590 Fall 2019 Data Handling With Pandas
14 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Lab Manual ET Lab III
No ratings yet
Lab Manual ET Lab III
38 pages
Module 4
No ratings yet
Module 4
38 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
Python Pandas
No ratings yet
Python Pandas
177 pages
Data Handling Using Pandas - 1-2-1
No ratings yet
Data Handling Using Pandas - 1-2-1
10 pages
Pandas Py
No ratings yet
Pandas Py
2 pages
Unit 2
No ratings yet
Unit 2
81 pages
2 Pandas
No ratings yet
2 Pandas
22 pages
Pandas Assignment
No ratings yet
Pandas Assignment
12 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
PTSP Questions
100% (2)
PTSP Questions
3 pages
Compilation of Lessons and Activities
No ratings yet
Compilation of Lessons and Activities
37 pages
Business Analyst Slide Deck
No ratings yet
Business Analyst Slide Deck
120 pages
Child Labour Part 3
No ratings yet
Child Labour Part 3
2 pages
York University MATH 2565 (Winter 2017) : Introduction To Applied Statistics Midterm Examination - Tue, Feb 28, 2017, 10:00 A.m.-11:15 A.M
No ratings yet
York University MATH 2565 (Winter 2017) : Introduction To Applied Statistics Midterm Examination - Tue, Feb 28, 2017, 10:00 A.m.-11:15 A.M
6 pages
Attitude of Government and Private School Teachers Towards Their Profession
No ratings yet
Attitude of Government and Private School Teachers Towards Their Profession
19 pages
Organic Fertilizer
100% (3)
Organic Fertilizer
4 pages
Lesson1 - Simple Linier Regression
No ratings yet
Lesson1 - Simple Linier Regression
40 pages
Frontmatter
No ratings yet
Frontmatter
24 pages
ANOVA 4 Regression
No ratings yet
ANOVA 4 Regression
2 pages
Logistic Regression
100% (3)
Logistic Regression
41 pages
Chapter 04 Notes
No ratings yet
Chapter 04 Notes
4 pages
Homework Problems Stat 490C
No ratings yet
Homework Problems Stat 490C
44 pages
Przepiorka 2016
No ratings yet
Przepiorka 2016
23 pages
Influence of Market Penetration Strategy On The Performance of Telkom Kenya Limited in Nairobi City County
No ratings yet
Influence of Market Penetration Strategy On The Performance of Telkom Kenya Limited in Nairobi City County
4 pages
2020 - Applied Statistics For Environmental Science With R
No ratings yet
2020 - Applied Statistics For Environmental Science With R
3 pages
Exam Research 2 Students
No ratings yet
Exam Research 2 Students
7 pages
Complex Engineering Problem (CEP) Descriptive Form
No ratings yet
Complex Engineering Problem (CEP) Descriptive Form
4 pages
Final Exam - Quanti Research (1st Sem 2014 - Satorre)
No ratings yet
Final Exam - Quanti Research (1st Sem 2014 - Satorre)
3 pages
Mk0004 Solved Assignment
No ratings yet
Mk0004 Solved Assignment
11 pages
Frequencies
No ratings yet
Frequencies
2 pages
CRT, .4, Utztt. 196: Symbiosis College of Arts and Commerce: Statistical Methods
No ratings yet
CRT, .4, Utztt. 196: Symbiosis College of Arts and Commerce: Statistical Methods
1 page
Tabela 1 - Distribuição Binomial A. Função Probabilidade B. Função de Distribuição
No ratings yet
Tabela 1 - Distribuição Binomial A. Função Probabilidade B. Função de Distribuição
20 pages
Ch. 2
No ratings yet
Ch. 2
60 pages
Theory and Methods in Political Science - 2 Version - p4
No ratings yet
Theory and Methods in Political Science - 2 Version - p4
68 pages
3rd Review Test in Stat
No ratings yet
3rd Review Test in Stat
43 pages
Planning Project Quality Unit 5 SPM MCA FINAL
No ratings yet
Planning Project Quality Unit 5 SPM MCA FINAL
60 pages
The PQRST Strategy, Reading Comprehension, and Learning Styles
No ratings yet
The PQRST Strategy, Reading Comprehension, and Learning Styles
18 pages
Gingoog City Comprehensive National High School Senior High School P The Problem and Its Scope
No ratings yet
Gingoog City Comprehensive National High School Senior High School P The Problem and Its Scope
44 pages
Theory Question For 504 A
No ratings yet
Theory Question For 504 A
2 pages