Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024
Statistical Computing With R: Masters in Data Sciences 503 (S29) Third Batch, SMS, TU, 2024
• There are two main ways this is • Say we have a vector x, which
commonly done: either with represents 30 observations of an
replicate() or with for() loops. animal length (mm):
• prop. table(table(data$admit))
Class imbalance as dependent variable “admit” has 273
(68.25%) cases in 0 (not admitted category) and 127
• 0 1 (31.75%) in 1 (admitted) category.
• 0.6825 0.3175
• The real-world data often has a lot • The 7 ways to handle missing
of missing values. The cause of values in the dataset
missing values can be data • Deleting Rows with missing values
corruption or failure to record • Impute missing values for continuous
data. variable (mean, median etc.)
• The handling of missing data is • Impute missing values for categorical
variable (predict the categories)
very important during the • Other Imputation Methods
preprocessing of the dataset as
• Using Algorithms that support
many machine learning algorithms missing values
do not support missing values. • Prediction of missing values
• Visit the link to learn more about • Imputation using Deep Learning
handling missing values to learn: Library — Datawig
Missing values checking and handling in R:
#Check missing values in R #List of R Packages
• colsum(is.na(data frame)) • MICE
• sum(is.na(data frame$column name) • Amelia
#Strategies • missForest
• List-wise deletion • Hmisc
• Pair-wise deletion • mi
• Mean/ Mode/ Median Imputation • etc.
• Generalized Imputation
• Similar case Imputation
• Prediction Model
• KNN Imputation
https://medium.com/coinmonks/dealing-with-missing- https://www.analyticsvidhya.com/blog/2016/03/tutori
data-using-r-3ae428da2d17 al-powerful-packages-imputing-missing-values/
MICE package:
• MICE (Multivariate Imputation via • It imputes data on a variable by
Chained Equations) is one of the variable basis by specifying an
commonly used package by R imputation model per variable.
users. Creating multiple • The methods used by this package
imputations as compared to a are:
single imputation (such as mean) • PMM (Predictive Mean Matching) —
takes care of uncertainty in missing For numeric variables
values. • logreg(Logistic Regression) — For
• MICE assumes that the missing Binary Variables( with 2 levels)
data are Missing at Random • polyreg(Bayesian polytomous
(MAR), which means that the regression) — For Factor Variables (>=
probability that a value is missing 2 levels)
depends only on observed value • Proportional odds model (ordered
and can be predicted using them. and censored variables, >= 2 levels)
• R notebook
Thank you!
@shitalbhandary