DA Unit 2 15m Handling Missing Data
DA Unit 2 15m Handling Missing Data
Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various
ways,such as blank cells, null values, or special symbols like NA"or *"unknown." These missing data points pose
a significant challenge in data analysis and can lead to inaccurate or biased results.
1. Missing Completely at Random(MCAR):MCAR is a specific type of missing data in which the probability
of a data point being missing is entirely random and independent of any other variable in the dataset. In
simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the
2. Missing at Random (MAR):MAR is a type of missing data where the probability ofa data point missing
depends on the values of other variables in the dataset, but not on the missing variable itself. This means that
the missingness mechanism is not entirely random, but can be predicted based on the available information.
it
3. MissingNot at Random (MNAR):MNAR is the most challenging type of missing data to deal with. It occurs
when the probability of a data point being missing is related to the missing value itself. This means that the
reason for the missing data is informative and directly associated with the variable that is missing.
impact on analysis. Working with Missing Data in Pandas there are several useful functions for
Functions Deseriptions
Missing data is a common headache in any field that deals with datasets.It can arise for various reasons, from human
error during data collection to limitations of data gathering methods. Luckily, there are strategiesto address missing
data and minimize its impact on your analysis.Here are two main approaches:
•Deletion:This involves remnoving rows or columns with missing values. This is a straightforward method, but
it can be problematic if a significantportion of your data is missing. Discarding too much data can affect the
•Imputation: This replaces missing values with estimates. There are various imputation techniques, each with
o Mean/Median/Mode Imputation: Replace missing entries with the average (mean), middle value
(median), or most frequent value (mode)of the corresponding column. This is a quick and easy
approach,but it can introduce bias if the missing data is not randomly distributed.
oK-NearestNeighbors (KNN Imputation): This method finds the closest data points (neighbors) based
on available features and uses their values to estimate the missing value. KNN is useful when you
have a lot of data and the missing values are scattered.
o Model-based Imputation: This involves creating a statistical mnodel to predict the missing values based
on other features in the data. This can be a powerful technique, but it requires more expertise and can
be computationally expensive.pen spark.
Vauee
Handung miaing
impont panda pd
impont umpy
data
'Name:S'Sohn', etu', 'Anna', linda'
Tam 1
Aae': 1 2&, npnan, 35, 32, np. nan1,
8alany':I5poeU,540eo, np.nan, 58o ob,
62o0o ]
d-meann
d4 - mLann (hae'] d- m eann
fihna C dt-meann
'hae.
'ngemean)
dy -means I'atany 1* d -neann "Satany']:
illna Cdf- meann 'Salany.meant)
p int (df -meann)
Backuwasd