0% found this document useful (0 votes)
8 views3 pages

DA Unit 2 15m Handling Missing Data

The document discusses the challenges posed by missing data in datasets, categorizing it into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It outlines methods for identifying missing data using functions in Pandas and strategies for handling missing data, including deletion and imputation techniques such as mean/median/mode imputation and K-Nearest Neighbors (KNN) imputation. The document emphasizes the importance of addressing missing data to ensure accurate analysis and reliable conclusions.

Uploaded by

jeyakarthika cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

DA Unit 2 15m Handling Missing Data

The document discusses the challenges posed by missing data in datasets, categorizing it into three types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It outlines methods for identifying missing data using functions in Pandas and strategies for handling missing data, including deletion and imputation techniques such as mean/median/mode imputation and K-Nearest Neighbors (KNN) imputation. The document emphasizes the importance of addressing missing data to ensure accurate analysis and reliable conclusions.

Uploaded by

jeyakarthika cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Transformation Handling Missing Data

Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various

ways,such as blank cells, null values, or special symbols like NA"or *"unknown." These missing data points pose

a significant challenge in data analysis and can lead to inaccurate or biased results.

Types of Missing Values

There are three main types of missing values:

1. Missing Completely at Random(MCAR):MCAR is a specific type of missing data in which the probability

of a data point being missing is entirely random and independent of any other variable in the dataset. In

simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the

characteristics of the data point itself.

2. Missing at Random (MAR):MAR is a type of missing data where the probability ofa data point missing

depends on the values of other variables in the dataset, but not on the missing variable itself. This means that

the missingness mechanism is not entirely random, but can be predicted based on the available information.

it
3. MissingNot at Random (MNAR):MNAR is the most challenging type of missing data to deal with. It occurs

when the probability of a data point being missing is related to the missing value itself. This means that the

reason for the missing data is informative and directly associated with the variable that is missing.

4. Methods for Identifying Missing Data


Locating and understanding patterns of missingness in the dataset is an important step in addressing its

impact on analysis. Working with Missing Data in Pandas there are several useful functions for

detecting, removing,and replacing null values in Pandas DataFrame.

Functions Deseriptions

Identifies missing values in a Series or


.snull)
DataFrame.

check for missing values in a pandas Series or


DataFrame. It returns a boolean Series or
.notnull0
DataFrame, where True indicates non-missing
values and False indicates missing values.

Displays information about the DataFrame,


.info) including data types, memory usage, and
presence of missing values.

similar to notnull(0 but returns True for missing


isna)
values and False for non-missing values.

Drops rows or columns containing missing


dropna0 values based on custom criteria.

Fills missing values with specific values, means,


fillna)
medians, or other calculated values.

Replaces specific values with other values,


replace0
facilitating data correction and standardization.

Removes duplicate rows based on specified


drop_duplicates()
columns.

unique) Finds unique values in a Series or DataFrame.


How to Handle Missing Data

Missing data is a common headache in any field that deals with datasets.It can arise for various reasons, from human

error during data collection to limitations of data gathering methods. Luckily, there are strategiesto address missing

data and minimize its impact on your analysis.Here are two main approaches:

•Deletion:This involves remnoving rows or columns with missing values. This is a straightforward method, but

it can be problematic if a significantportion of your data is missing. Discarding too much data can affect the

reliability ofyour conclusions.

•Imputation: This replaces missing values with estimates. There are various imputation techniques, each with

its strengths and weaknesses. Here are some commonones:

o Mean/Median/Mode Imputation: Replace missing entries with the average (mean), middle value

(median), or most frequent value (mode)of the corresponding column. This is a quick and easy

approach,but it can introduce bias if the missing data is not randomly distributed.

oK-NearestNeighbors (KNN Imputation): This method finds the closest data points (neighbors) based

on available features and uses their values to estimate the missing value. KNN is useful when you
have a lot of data and the missing values are scattered.

o Model-based Imputation: This involves creating a statistical mnodel to predict the missing values based

on other features in the data. This can be a powerful technique, but it requires more expertise and can
be computationally expensive.pen spark.
Vauee
Handung miaing

impont panda pd
impont umpy
data
'Name:S'Sohn', etu', 'Anna', linda'
Tam 1
Aae': 1 2&, npnan, 35, 32, np. nan1,
8alany':I5poeU,540eo, np.nan, 58o ob,
62o0o ]

d pa. Datafn ame l data)

I:Dropping with mìscing valueh


f- drop :d. dropna t)
int ldt- drop
Yaluee
epeclie kalue
d4-tled df.ftlnato)

Il3. Filling with

d-meann
d4 - mLann (hae'] d- m eann
fihna C dt-meann
'hae.
'ngemean)
dy -means I'atany 1* d -neann "Satany']:
illna Cdf- meann 'Salany.meant)
p int (df -meann)

Backuwasd

df - btill = df illna Lnethod bei)


print (Af-btl)

df- col e df dopna (axis-1)


pane tAf. co)

You might also like