0% found this document useful (0 votes)
107 views30 pages

Ii Unit Pandas

Uploaded by

sksigmaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views30 pages

Ii Unit Pandas

Uploaded by

sksigmaman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

FUNDAMENTALS OF DATA SCIENCE

II UNIT : GETTING SARTED WITH PANDAS

Introduction to Pandas

Pandas is a powerful open-source library in Python for data manipulation and


analysis. It provides data structures and functions to efficiently handle structured
data, including tabular data such as spreadsheets and SQL tables.

Key Features of Pandas:

1. Data Structures: Pandas introduces two primary data structures:


- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure)
2. Data Operations: Pandas supports various data operations, including:
- Filtering
- Sorting
- Grouping
- Merging
- Reshaping
3. Data Input/Output: Pandas allows reading and writing data from various file
formats, including:
- CSV
- Excel
- JSON
- SQL databases
4. Data Manipulation: Pandas provides various data manipulation functions,
including:
- Handling missing data
- Data cleaning
- Data transformation
5. Integration: Pandas integrates well with other popular Python libraries,
including:
- NumPy
- Matplotlib
- Scikit-learn

Why Use Pandas?

1
1. Efficient Data Handling: Pandas provides efficient data structures and operations
for handling large datasets.
2. Flexible Data Manipulation: Pandas offers various data manipulation functions
for data cleaning, transformation, and analysis.
3. Easy Data Analysis: Pandas integrates well with other libraries, making data
analysis and visualization easier.
4. Community Support: Pandas has an active community, ensuring continuous
development and support.

Getting Started with Pandas:

1. Install Pandas: pip install pandas


2. Import Pandas: import pandas as pd
3. Create a DataFrame: df = pd.DataFrame({'Name': ['John', 'Anna'], 'Age': [28,
24]})

Pandas is a powerful library for data manipulation and analysis. Its efficient data
structures and operations make it an ideal choice for working with structured data
in Python.
What is Library architecture in Pandas?
The library architecture in Pandas is designed to provide a robust and efficient data
analysis framework. Here's an overview of the key components:

1. Data Structures:
- Series (1-dimensional labeled array): Represents a single column of data.
- DataFrame (2-dimensional labeled data structure): Represents a table of data
with rows and columns.
- Panel (3-dimensional labeled data structure): Represents a collection of
DataFrames.
2. Indexing and Selecting Data:
- Index: A data structure that provides fast lookups and labeling for rows and
columns.
- Label-based selection: Select data using label-based indexing (e.g.,
df['column_name']).
- Position-based selection: Select data using position-based indexing (e.g.,
df.iloc[0]).
3. Data Operations:
- Filtering: Select data based on conditions (e.g., df[df['column_name'] > 0]).
- Sorting: Sort data by one or more columns (e.g.,
df.sort_values('column_name')).
2
- Grouping: Group data by one or more columns and perform aggregation (e.g.,
df.groupby('column_name').mean()).
- Merging: Combine data from multiple DataFrames (e.g., pd.merge(df1, df2,
on='column_name')).
4. Data Input/Output:
- Readers: Read data from various file formats (e.g., CSV, Excel, JSON).
- Writers: Write data to various file formats.
5. Data Manipulation:
- Reshaping: Pivot, melt, and reshape data (e.g., df.pivot_table()).
- Data cleaning: Handle missing data, duplicates, and data normalization.
6. Computational Tools:
- NumPy integration: Leverage NumPy's vectorized operations for efficient
computations.
- Cython optimization: Optimize performance-critical code using Cython.
7. Extensibility:
- Plugins: Extend Pandas functionality using plugins (e.g., Pandas-GBQ for
Google BigQuery).

This architecture enables Pandas to efficiently handle large datasets, perform


complex data operations, and provide a flexible and extensible framework for data
analysis.
Write the key features of Pandas
Here are the key features of pandas:

1. Data Structures: pandas provides two primary data structures:


- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure)
2. Data Operations:
- Filtering: Select data based on conditions
- Sorting: Sort data by one or more columns
- Grouping: Group data by one or more columns and perform aggregation
- Merging: Combine data from multiple DataFrames
3. Data Input/Output:
- Read: Read data from various file formats (e.g., CSV, Excel, JSON)
- Write: Write data to various file formats
4. Data Manipulation:
- Handling missing data: Handle missing data (NaN, None) and data cleaning
methods
- Data transformation: Perform data transformation (e.g., pivoting, melting)
5. Data Analysis:
3
- Statistical functions: Provide statistical functions (e.g., mean, median, standard
deviation)
- Data visualization: Integrate with visualization libraries (e.g., Matplotlib,
Seaborn)
6. Performance:
- Vectorized operations: Perform operations on entire columns or rows at once
- Cython optimization: Optimize performance-critical code using Cython
7. Integration:
- NumPy integration: Leverage NumPy's vectorized operations
- Matplotlib integration: Integrate with Matplotlib for data visualization
- Scikit-learn integration: Integrate with Scikit-learn for machine learning
8. Time Series Analysis:
- Date and time handling: Handle date and time data
- Time series functions: Provide time series functions (e.g., rolling, resampling)
9. Data Cleaning:
- Data cleaning functions: Provide data cleaning functions (e.g., drop duplicates,
handle missing data)
10. Extensibility:
- Plugins: Extend pandas functionality using plugins
- Custom data types: Support custom data types

These features make pandas a powerful and flexible library for data manipulation
and analysis in Python.
Write the Applications of Pandas>
Pandas has numerous applications in various fields, including:

1. Data Analysis and Science:


- Data cleaning and preprocessing
- Data visualization
- Statistical analysis
- Machine learning
2. Business Intelligence and Analytics:
- Data reporting and dashboards
- Data mining
- Predictive analytics
- Business insights
3. Finance and Economics:
- Financial data analysis
- Portfolio management
- Risk analysis
4
- Economic modeling
4. Scientific Research:
- Data analysis and visualization
- Statistical modeling
- Data mining
- Research data management
5. Web Scraping and Data Extraction:
- Extracting data from websites
- Web scraping
- Data parsing
6. Data Engineering and Architecture:
- Data warehousing
- ETL (Extract, Transform, Load) processes
- Data pipeline management
7. Machine Learning and AI:
- Data preprocessing
- Feature engineering
- Model training and evaluation
8. Healthcare and Biomedical Research:
- Medical data analysis
- Clinical trial data management
- Genomics and proteomics research
9. Social Media and Text Analysis:
- Text data analysis
- Sentiment analysis
- Social media monitoring
10. Education and Research:
- Educational data analysis
- Research data management
- Academic data visualization

Pandas is a versatile library that can be applied to various domains and industries,
making it a valuable tool for anyone working with data.
Explain about data structures in Pandas.
Pandas provides three primary data structures:

1. Series (1-dimensional labeled array):


- Represents a single column of data.
- Index-based, with a label for each entry.
- Supports various data types, including numeric, string, and datetime.
5
2. DataFrame (2-dimensional labeled data structure):
- Represents a table of data with rows and columns.
- Index-based, with labels for rows (index) and columns (columns).
- Supports various data types, including numeric, string, and datetime.
3. Panel (3-dimensional labeled data structure):
- Represents a collection of DataFrames.
- Index-based, with labels for rows (index), columns (columns), and depth
(panels).
- Less commonly used, but useful for working with multi-dimensional data.

Key characteristics of Pandas data structures:

- Label-based indexing: Access data using labels (e.g., column names, row
indices).
- Index-based: Data is stored in an index-based structure, enabling fast lookups and
slicing.
- Flexible data types: Support various data types, including numeric, string, and
datetime.
- Vectorized operations: Perform operations on entire columns or rows at once,
making computations efficient.
- Missing data handling: Support for missing data (NaN, None) and data cleaning
methods.

These data structures enable efficient data manipulation, analysis, and storage in
Pandas.
Explain about series in pandas
In pandas, a Series is a one-dimensional labeled array of values, similar to a
column in a spreadsheet or a column in a SQL table. It's a fundamental data
structure in pandas, and it's used to represent a single column of data.

Here are some key characteristics of a pandas Series:

1. One-dimensional: A Series is a single column of data, with a single index (row


labels).
2. Labeled: Each value in the Series has a label, which is used to identify the value.
3. Index: The index of a Series is the row labels, which can be integers, strings, or
other types of data.
4. Data type: A Series can have a single data type, such as integers, floats, strings,
or datetime.
5. Size: A Series can have any number of values, from a few to millions.
6
Creating a Series:

You can create a Series from a list, array, or other iterable using the pd.Series()
function:

import pandas as pd

data = [1, 2, 3, 4, 5]
series = pd.Series(data)
print(series)

This will output:

0 1
1 2
2 3
3 4
4 5
dtype: int64

Accessing values in a Series:

You can access values in a Series using the index label:

print(series[0]) # prints 1

You can also access values using slicing:

print(series[1:3]) # prints 2, 3

Series Operations:

You can perform various operations on a Series, such as:

- Arithmetic operations (e.g., addition, subtraction)


- Comparison operations (e.g., equality, greater than)
- Aggregate functions (e.g., mean, sum)
- Data manipulation (e.g., sorting, indexing)

7
Explain about Dataframe in Pandas.
A DataFrame in pandas is a two-dimensional labeled data structure with columns
of potentially different types. It's similar to a spreadsheet or a SQL table, and is the
most commonly used data structure in pandas.

Here are some key characteristics of a DataFrame:

1. Two-dimensional: A DataFrame has rows and columns, similar to a spreadsheet.


2. Labeled: Each column and row has a label, which can be used to identify the
data.
3. Columns: Each column can have a different data type (e.g., integers, strings,
datetime).
4. Rows: Each row represents a single observation or record.
5. Index: The index is the row labels, which can be integers, strings, or other types
of data.
6. Size: A DataFrame can have any number of rows and columns.

Creating a DataFrame:

You can create a DataFrame from a dictionary, list of lists, or other data structures
using the pd.DataFrame() function:

import pandas as pd

data = {'Name': ['John', 'Anna', 'Linda'],


'Age': [28, 24, 32],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

This will output:

Name Age City


0 John 28 New York
1 Anna 24 Paris
2 Linda 32 London

Accessing data in a DataFrame:

8
You can access data in a DataFrame using the column name or row index:

print(df['Name']) # prints the 'Name' column


print(df.loc[0]) # prints the first row

DataFrame Operations:

You can perform various operations on a DataFrame, such as:

- Filtering data
- Sorting and indexing
- Grouping and aggregating data
- Merging and joining DataFrames
- Data manipulation (e.g., adding, removing columns)

What is dropping entries from an axis in Pandas?

In pandas, dropping entries from an axis refers to removing rows or columns from
a DataFrame based on specified conditions. This is achieved using the drop()
function, which allows you to remove entries from either the index (rows) or
columns axis.

Dropping rows:

To drop rows, you can use the drop() function with the index parameter. You can
pass a single label or a list of labels to remove specific rows.

Example:

import pandas as pd

# Create a sample DataFrame


data = {'Name': ['John', 'Anna', 'Linda', 'Phil'],
'Age': [28, 24, 32, 40]}
df = pd.DataFrame(data)

# Drop a single row by index label


df.drop(index='Anna', inplace=True)

# Drop multiple rows by index labels


9
df.drop(index=['Linda', 'Phil'], inplace=True)

Dropping columns:

To drop columns, you can use the drop() function with the columns parameter.
You can pass a single label or a list of labels to remove specific columns.

Example:

import pandas as pd

# Create a sample DataFrame


data = {'Name': ['John', 'Anna', 'Linda', 'Phil'],
'Age': [28, 24, 32, 40]}
df = pd.DataFrame(data)

# Drop a single column by label


df.drop(columns='Age', inplace=True)

# Drop multiple columns by labels


df.drop(columns=['Name', 'Age'], inplace=True)

Options:

The drop() function has several options to customize the behavior:

- inplace: If True, modifies the original DataFrame. If False, returns a new


DataFrame.
- axis: Specifies the axis to drop entries from (0 for rows, 1 for columns).
- errors: Specifies how to handle errors (e.g., 'ignore' to ignore missing labels).

By using the drop() function, you can efficiently remove unwanted entries from
your DataFrame, making it easier to work with and analyze your data.

Index objects in Pandas!

Index objects are a fundamental component of Pandas, used to label and identify
rows and columns in DataFrames and Series. They provide a way to access and
10
manipulate data efficiently.

Types of Index objects:

1. RangeIndex: default index, created from a range of integers (e.g., 0, 1, 2, ...)


2. Int64Index: index with 64-bit integer values
3. Float64Index: index with 64-bit floating-point values
4. MultiIndex: hierarchical index with multiple levels
5. DateTimeIndex: index with datetime values
6. PeriodIndex: index with period values (e.g., daily, monthly, quarterly)
7. TimedeltaIndex: index with timedelta values

Index object properties:

1. name: name of the index


2. dtype: data type of the index values
3. values: array of index values
4. shape: shape of the index (number of elements)

Index object methods:

1. reindex: reindex the data with a new index


2. reset_index: reset the index to the default integer index
3. set_index: set the index to a specific column or array
4. drop_duplicates: drop duplicate index values
5. get_loc: get the location of a specific index value

Using Index objects, you can:

1. Select data using label-based indexing


2. Filter data using conditional indexing
3. Sort and order data using index sorting
4. Group and aggregate data using index grouping
5. Merge and join data using index matching

Write Essential functionality in Pandas?


Essential functionality in Pandas includes:

1. Data Structures:

11
- Series (1-dimensional labeled array)
- DataFrame (2-dimensional labeled data structure with columns of potentially
different types)

2. Data Manipulation:

- Filtering: selecting data based on conditions


- Sorting: sorting data by one or more columns
- Grouping: grouping data by one or more columns and applying aggregation
functions
- Merging: combining data from multiple DataFrames
- Reshaping: transforming data from wide to long format and vice versa

3. Data Analysis:

- Statistical functions: mean, median, mode, standard deviation, etc.


- Data alignment: aligning data by index or columns
- Data merging: combining data from multiple DataFrames

4. Data Input/Output:

- Reading data from various file formats (CSV, Excel, JSON, etc.)
- Writing data to various file formats (CSV, Excel, JSON, etc.)

5. Data Cleaning:

- Handling missing data: detecting, filling, and dropping missing values


- Data normalization: scaling and transforming data

6. Data Transformation:

- Melting: transforming data from wide to long format


- Pivoting: transforming data from long to wide format
- Stack and unstack: transforming data by stacking or unstacking levels

7. Data Selection:

- Label-based selection: selecting data by label


- Conditional selection: selecting data based on conditions
- Boolean indexing: selecting data using boolean arrays
12
8. Data Aggregation:

- GroupBy: grouping data and applying aggregation functions


- Pivot tables: creating pivot tables to summarize data

These essential functionalities make Pandas a powerful tool for data manipulation,
analysis, and visualization.

What is selection and filtering in Pandas give ex?


Selection and filtering are essential operations in pandas, a powerful data
manipulation library in Python. Here are some ways to select and filter data in
pandas:
Selection:

1. Label-based selection: Use the loc attribute to select rows and columns by label.
- df.loc[row_labels, column_labels]
2. Integer-based selection: Use the iloc attribute to select rows and columns by
integer position.
- df.iloc[row_positions, column_positions]
3. Conditional selection: Use boolean indexing to select rows based on conditions.
- df[condition]

Filtering:

1. Boolean indexing: Use boolean conditions to filter rows.


- df[condition]
2. Query: Use the query method to filter rows using a SQL-like syntax.
- df.query('condition')
3. Filtering with isin: Use the isin method to filter rows based on a list of values.
- df[df['column'].isin(values)]

Some examples:

- Select rows where the value in the 'age' column is greater than 30: df[df['age'] >
30]
- Select rows where the value in the 'country' column is either 'USA' or 'Canada':
df[df['country'].isin(['USA', 'Canada'])]
- Select rows where the value in the 'name' column starts with 'J':
df[df['name'].str.startswith('J')]
13
These are just a few examples of the many ways to select and filter data in pandas.
import pandas as pd

# Create a sample DataFrame


data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
'Country': ['USA', 'UK', 'USA', 'Canada', 'UK']}
df = pd.DataFrame(data)

# Selection
print("Original DataFrame:")
print(df)

# Select rows where Age is greater than 30


print("\nRows where Age > 30:")
print(df[df['Age'] > 30])

# Select rows where Country is USA or UK


print("\nRows where Country is USA or UK:")
print(df[df['Country'].isin(['USA', 'UK'])])

# Select rows where Name starts with 'P'


print("\nRows where Name starts with 'P':")
print(df[df['Name'].str.startswith('P')])

# Filtering
print("\nFiltering rows where Age is greater than 30:")
print(df.query('Age > 30'))

# Filter rows where Country is USA and Age is greater than 30


print("\nRows where Country is USA and Age > 30:")
print(df[(df['Country'] == 'USA') & (df['Age'] > 30)])

Output:

Original DataFrame:
Name Age Country
14
0 John 28 USA
1 Anna 24 UK
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Age > 30:


Name Age Country
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Country is USA or UK:


Name Age Country
0 John 28 USA
1 Anna 24 UK
2 Peter 35 USA
4 Phil 40 UK

Rows where Name starts with 'P':


Name Age Country
2 Peter 35 USA

Filtering rows where Age is greater than 30:


Name Age Country
2 Peter 35 USA
3 Linda 32 Canada
4 Phil 40 UK

Rows where Country is USA and Age > 30:


Name Age Country
2 Peter 35 USA
These examples demonstrate how to select and filter rows in a pandas DataFrame
using various conditions.

What is Sorting and Ranking in Pandas?


Sorting in pandas refers to the process of rearranging the rows of a DataFrame in a
specific order based on the values of one or more columns. This can be done in
either ascending or descending order.

15
Ranking in pandas refers to the process of assigning a rank to each row based on
the values of one or more columns. This can be useful for identifying the top or
bottom performers, or for creating a leaderboard.

Here are some key differences between sorting and ranking:

Sorting:

- Reorders the entire DataFrame


- Can be done in ascending or descending order
- Does not assign a rank to each row

Ranking:

- Assigns a rank to each row based on the values of one or more columns
- Can be done in ascending or descending order
- Does not reorder the entire DataFrame (although it can be used in conjunction
with sorting)

Some common use cases for sorting and ranking in pandas include:

- Sorting:
- Organizing data in alphabetical or numerical order
- Preparing data for visualization or analysis
- Ranking:
- Identifying top or bottom performers
- Creating a leaderboard or scoring system
- Assigning a percentile or quartile rank to each row

Pandas provides various functions for sorting and ranking, including:

- sort_values(): Sorts the DataFrame by one or more columns


- sort_index(): Sorts the DataFrame by its index
- rank(): Assigns a rank to each row based on the values of one or more columns
- nlargest() and nsmallest(): Returns the top or bottom N rows based on the values
of one or more columns
Example
Here are some examples of sorting and ranking in pandas:

Sorting:
16
1. Sort by a single column:

df.sort_values(by='column_name')

2. Sort by multiple columns:

df.sort_values(by=['column1', 'column2'])

3. Sort in descending order:

df.sort_values(by='column_name', ascending=False)

4. Sort in place (modify the original DataFrame):

df.sort_values(by='column_name', inplace=True)

Ranking:

1. Rank by a single column:

df['rank'] = df['column_name'].rank()

2. Rank by multiple columns:

df['rank'] = df[['column1', 'column2']].apply(tuple, axis=1).rank()

3. Rank in descending order:

df['rank'] = df['column_name'].rank(ascending=False)

4. Rank with specific method (e.g., min, max, dense, etc.):

df['rank'] = df['column_name'].rank(method='min')

Here's an example code snippet:

import pandas as pd

17
# Create a sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
'Score': [90, 85, 95, 88, 92]}
df = pd.DataFrame(data)

# Sort by Age in ascending order


print("Sorted by Age:")
print(df.sort_values(by='Age'))

# Rank by Score in descending order


df['rank'] = df['Score'].rank(ascending=False)
print("\nRanked by Score:")
print(df)

Output:

Sorted by Age:
Name Age Score
1 Anna 24 85
0 John 28 90
3 Linda 32 88
2 Peter 35 95
4 Phil 40 92

Ranked by Score:
Name Age Score rank
0 John 28 90 2.0
1 Anna 24 85 5.0
2 Peter 35 95 1.0
3 Linda 32 88 4.0
4 Phil 40 92 3.0

What is summarizing and computing statistics in pandas?


Summarizing in pandas refers to the process of reducing a large dataset into a
smaller, more manageable form, while still maintaining the essential characteristics
of the data. This can be done using various summary statistics, such as:
- Count: Number of non-missing values
18
- Mean: Average value
- Median: Middle value
- Mode: Most frequent value
- Standard Deviation: Measure of variability
- Variance: Measure of spread
- Minimum and Maximum values
- Quartiles (25th, 50th, 75th percentiles)
- Percentiles (e.g., 10th, 90th percentiles)

Computing descriptive statistics in pandas involves calculating these summary


statistics to understand the distribution, central tendency, and variability of the
data. This can be done using various pandas functions, such as:

- describe(): Generates a summary of the central tendency, dispersion, and shape of


the dataset's distribution.
- mean(), median(), mode(), std(), var(), min(), max(), quantile(): Calculate specific
summary statistics.
- groupby(): Calculate summary statistics for each group of a categorical variable.
- pivot_table(): Create a spreadsheet-style summary of the data.

Some common use cases for summarizing and computing descriptive statistics in
pandas include:

- Exploratory data analysis (EDA)


- Data cleaning and preprocessing
- Feature engineering
- Data visualization
- Statistical modeling

By summarizing and computing descriptive statistics, you can:

- Understand the distribution and characteristics of your data


- Identify patterns, trends, and correlations
- Inform data-driven decisions
- Prepare data for machine learning or statistical modeling
import pandas as pd

# Create a sample DataFrame


data = {'Name': ['John', 'Anna', 'Peter', 'Linda', 'Phil'],
'Age': [28, 24, 35, 32, 40],
19
'Score': [90, 85, 95, 88, 92]}
df = pd.DataFrame(data)

# Summarize the DataFrame


print("Summary:")
print(df.describe())

# Compute descriptive statistics


print("\nMean:")
print(df.mean())

print("\nMedian:")
print(df.median())

print("\nMode:")
print(df.mode())

print("\nStandard Deviation:")
print(df.std())

print("\nVariance:")
print(df.var())

print("\nMinimum and Maximum values:")


print(df.min())
print(df.max())

print("\nQuantiles (25th, 50th, 75th percentiles):")


print(df.quantile([0.25, 0.5, 0.75]))

# Group by 'Name' and compute mean 'Score'


print("\nMean Score by Name:")
print(df.groupby('Name')['Score'].mean())

# Pivot table to compute mean 'Score' by 'Age' group


print("\nMean Score by Age:")
print(df.pivot_table(values='Score', index='Age', aggfunc='mean'))

Output:
20
Summary:
Age Score
count 5.000000 5.000000
mean 31.800000 90.000000
std 6.303762 3.316625
min 24.000000 85.000000
25% 28.000000 88.000000
50% 32.000000 90.000000
75% 35.000000 92.500000
max 40.000000 95.000000

Mean:
Age 31.8
Score 90.0
dtype: float64

Median:
Age 32.0
Score 90.0
dtype: float64

Mode:
Age Score
0 24.0 85.0

Standard Deviation:
Age 6.303762
Score 3.316625
dtype: float64

Variance:
Age 39.733333
Score 11.000000
dtype: float64

Minimum and Maximum values:


Age 24
Score 85
21
dtype: int64
Age 40
Score 95
dtype: int64

Quantiles (25th, 50th, 75th percentiles):


Age Score
0.25 28.0 88.0
0.50 32.0 90.0
0.75 35.0 92.5

Mean Score by Name:


Name
Anna 85.0
John 90.0
Linda 88.0
Peter 95.0
Phil 92.0
Name: Score, dtype: float64
Mean Score by Age:

Age Score
24 85.0
28 90.0
32 88.0
35 95.0
40 92.0
This example demonstrates various ways to summarize and compute descriptive
statistics in pandas, including using the describe(), mean(), median(), mode(), std(),
var(), min(), max(), and quantile() functions.

What is Descriptive statistics in pandas?


Descriptive statistics in pandas refer to the statistical measures that summarize and
describe the basic features of a dataset. These measures provide an overview of the
central tendency, dispersion, and shape of the data's distribution.

Common descriptive statistics in pandas include:

1. Mean: The average value of a column.


2. Median: The middle value of a column when sorted in ascending order.

22
3. Mode: The most frequently occurring value in a column.
4. Standard Deviation (std): A measure of the amount of variation or dispersion in
a column.
5. Variance: The average of the squared differences from the mean.
6. Minimum (min): The smallest value in a column.
7. Maximum (max): The largest value in a column.
8. Quantiles (q): Divide the data into equal-sized groups based on rank or position.
9. Interquartile Range (IQR): The difference between the 75th percentile (Q3) and
25th percentile (Q1).
10. Range: The difference between the maximum and minimum values.

Pandas provides various functions to calculate these descriptive statistics,


including:

- mean()
- median()
- mode()
- std()
- var()
- min()
- max()
- quantile()
- describe(): Generates a summary of the central tendency, dispersion, and shape of
the dataset's distribution.

These descriptive statistics are essential for understanding the distribution of your
data, identifying patterns and trends, and informing data-driven decisions.

What is unique values in pandas?


In pandas, unique values refer to the distinct or non-duplicate values in a column or
Series. These are values that appear only once or a specified number of times in the
data.

To find unique values in pandas, you can use the unique() function, which returns
an array of unique values. Here are some examples:

1. Get unique values in a column:


df['column_name'].unique()

2. Get unique values in a Series:


23
series.unique()

3. Count unique values:


df['column_name'].nunique()

4. Get unique rows in a DataFrame:


df.drop_duplicates()

5. Get unique values with frequency:


df['column_name'].value_counts()

Note: The unique() function returns an array of unique values, while nunique()
returns the count of unique values.

Example:

import pandas as pd

# Create a sample DataFrame


data = {'Name': ['John', 'Anna', 'John', 'Linda', 'Anna', 'Phil'],
'Age': [28, 24, 28, 32, 24, 40]}
df = pd.DataFrame(data)

# Get unique values in the 'Name' column


print(df['Name'].unique())

# Output: ['John', 'Anna', 'Linda', 'Phil']

# Count unique values in the 'Name' column


print(df['Name'].nunique())

# Output: 4

# Get unique rows in the DataFrame


print(df.drop_duplicates())

# Output:
# Name Age
# 0 John 28
# 1 Anna 24
24
# 3 Linda 32
# 5 Phil 40

# Get unique values with frequency in the 'Name' column


print(df['Name'].value_counts())

# Output:
# Anna 2
# John 2
# Linda 1
# Phil 1

In this example, we demonstrate how to find unique values, count unique values,
get unique rows, and get unique values with frequency using pandas.

What is value counts in pandas?


In pandas, value_counts() is a function that returns a Series containing the count of
unique values in a Series or column of a DataFrame. It's a convenient way to get
the frequency of each unique value.

Here's what value_counts() does:

1. Counts the number of occurrences of each unique value.


2. Returns a Series with the unique values as the index and the counts as the
values.
3. Sorts the results in descending order by default (most frequent values first).

Example:

import pandas as pd

# Create a sample Series


s = pd.Series(['apple', 'banana', 'apple', 'orange', 'banana', 'banana'])

# Get the value counts


print(s.value_counts())

# Output:
# banana 3
# apple 2
25
# orange 1

In this example, the value_counts() function returns a Series with the unique values
('banana', 'apple', 'orange') as the index and their respective counts (3, 2, 1) as the
values.

You can also use value_counts() on a DataFrame column:

# Create a sample DataFrame


df = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'banana']})

# Get the value counts for the 'fruit' column


print(df['fruit'].value_counts())

# Output:
# banana 3
# apple 2
# orange 1

Note that value_counts() has some optional parameters:

- normalize: If True, returns the relative frequencies instead of counts.


- sort: If False, doesn't sort the results.
- ascending: If True, sorts the results in ascending order.
- bins: Specifies the number of bins for discrete data.

How to handling missing data in pandas?


Handling missing data in pandas involves identifying, removing, or replacing
missing values. Here are some steps to handle missing data in pandas using
Python:

1. Identify missing values:


- Use isnull() or isna() to detect missing values.
- Use sum() to count the number of missing values.
2. Remove missing values:
- Use dropna() to remove rows or columns with missing values.
- Use dropna(how='all') to remove rows with all missing values.
3. Replace missing values:
- Use fillna() to replace missing values with a specified value.
- Use fillna(method='ffill') to forward-fill missing values.
26
- Use fillna(method='bfill') to backward-fill missing values.
4. Interpolate missing values:
- Use interpolate() to interpolate missing values.
5. Impute missing values:
- Use SimpleImputer from scikit-learn to impute missing values.

Example code:

import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values


data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Identify missing values


print(df.isnull())

# Remove missing values


df.dropna(inplace=True)

# Replace missing values


df.fillna(0, inplace=True)

# Interpolate missing values


df.interpolate(method='linear', inplace=True)

# Impute missing values using SimpleImputer


from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df[['A', 'B']] = imputer.fit_transform(df[['A', 'B']])

Note: The choice of method depends on the nature of the data and the problem
you're trying to solve.
What is filtering out missing data in pandas?
Filtering out missing data means removing or excluding rows or columns that
contain missing or null values from a dataset. This is a common data preprocessing
step in data analysis and machine learning to ensure that the data is complete and
consistent.
27
Missing data can be represented in different ways, such as:

- NaN (Not a Number)


- None
- Null
- Empty strings
- Special values like -999 or 999

Filtering out missing data can be done using various techniques, including:

1. Listwise deletion: Removing rows with missing values.


2. Pairwise deletion: Removing rows with missing values for a specific analysis or
calculation.
3. Mean/Median imputation: Replacing missing values with the mean or median of
the respective column.
4. Forward/Backward fill: Replacing missing values with the previous or next
value in the same column.
5. Interpolation: Estimating missing values based on surrounding values.

Filtering out missing data is important because:

1. Prevents bias: Missing data can lead to biased results if not handled properly.
2. Improves accuracy: Complete data leads to more accurate analysis and
modeling.
3. Enhances reliability: Filtering out missing data ensures that the results are
reliable and consistent.

However, it's essential to consider the nature of the data and the problem you're
trying to solve before filtering out missing data. In some cases, missing data may
be informative or important for the analysis.
import pandas as pd
import numpy as np

# Create a sample DataFrame with missing values


data = {'Name': ['John', 'Anna', np.nan, 'Linda', 'Phil'],
'Age': [28, np.nan, 35, 32, 40],
'Score': [90, 85, np.nan, 88, 92]}
df = pd.DataFrame(data)

28
# Print the original DataFrame
print("Original DataFrame:")
print(df)

# Filter out rows with missing values


df_filtered = df.dropna()

# Print the filtered DataFrame


print("\nFiltered DataFrame:")
print(df_filtered)

# Filter out columns with missing values


df_filtered_columns = df.dropna(axis=1)

# Print the filtered DataFrame


print("\nFiltered DataFrame (columns):")
print(df_filtered_columns)

# Filter out rows with missing values in a specific column


df_filtered_name = df[df['Name'].notnull()]

# Print the filtered DataFrame


print("\nFiltered DataFrame (Name column):")
print(df_filtered_name)

Output:

Original DataFrame:
Name Age Score
0 John 28.0 90.0
1 Anna NaN 85.0
2 NaN 35.0 NaN
3 Linda 32.0 88.0
4 Phil 40.0 92.0

Filtered DataFrame:
Name Age Score
0 John 28.0 90.0
29
3 Linda 32.0 88.0
4 Phil 40.0 92.0

Filtered DataFrame (columns):


Name Age
0 John 28.0
1 Anna NaN
2 NaN 35.0
3 Linda 32.0
4 Phil 40.0

Filtered DataFrame (Name column):


Name Age Score
0 John 28.0 90.0
1 Anna NaN 85.0
3 Linda 32.0 88.0
4 Phil 40.0 92.0
In this example, we demonstrate how to filter out missing data in pandas using the
dropna() function, which removes rows or columns with missing values. We also
show how to filter out rows with missing values in a specific column using the
notnull() function.

30

You might also like