Module II
Module II
learning and R-
Programming
Module - II
Outliers
• Outliers are data points that significantly differ from other observations in a dataset.
• These observations can be unusually high or low in comparison to the majority of the data and
may distort statistical analyses and machine learning models if not handled properly.
• Outliers can occur due to various reasons, including measurement errors, experimental error,
or genuinely unusual observations.
key points about outliers
• Identification: Outliers are typically identified using statistical methods such as the
interquartile range (IQR) or z-scores.
• Impact: Outliers can have a significant impact on statistical measures such as the
mean and standard deviation. Since these measures are sensitive to extreme values,
outliers can skew them, making them less representative of the central tendency and
variability of the data.
• Winsorizing
• Replacing outliers with the nearest non-outlier values.
• Transformation
• Applying mathematical transformations such as log transformation to make the data less
skewed.
• To work with data stored like this it is necessary to learn how to combine and
merge different datasets into a single data frame.
Combining Datasets in R
• If you have new data that has the same structure as your old data, it can be
added onto the end of your data frame with bind_rows().
• If you are instead looking to merge data sets based on some shared variable,
there are a number of joins that are useful.
• inner_join()
• left_join()
• full_join()
inner_join()
• The first join we will look at is an inner_join(). This function takes two data
frames as input and merges them based on their shared column.
• Before merging, always inspect your datasets using functions like head(), str(), and
summary(). This helps you understand the structure and identify key variables for merging.
• Ensure that the variables you're merging on are unique and don't have duplicates unless it's
intentional. This prevents unintended data duplication.
Tips on Merging Data in R
Specify Merge Type:
• R's merge function allows for different types of joins: left, right, inner, and outer.
Understand the differences and choose the one that best fits your needs.
• After merging, check for NA values. These can arise if there's no match for a particular
key. Decide how you want to handle these: remove, replace, or impute.
• If the datasets have columns with the same names but different data, R will append a
suffix (e.g., .x and .y) to distinguish them. Rename these columns if necessary for clarity.
Tips on Merging Data in R
Sort Your Data:
• After merging, it's often helpful to sort your data using the order() function. This can make subsequent
analyses easier and more intuitive.
• For very large datasets, consider using the data.table package. It offers a faster merging process compared to
the base R merge function.
• Ensure that the key variables in both datasets have the same data type. For instance, merging on a character
variable in one dataset and a factor in another can lead to unexpected results.
Test on a Subset:
• If you're unsure about the merge, try it on a small subset of your data first. This allows you to quickly spot
and rectify any issues.
Functions and Loops
• A function is a block of code which only runs when it is called.
Creating a Function
Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Number of Arguments
By default, a function must be called with the correct number of arguments.
Meaning that if your function expects 2 arguments, you have to call the function
with 2 arguments, not more, and not less.
# Error
my_function <- function(fname, lname) {
paste(fname, lname) my_function <- function(fname,
lname) {
}
paste(fname, lname)
}
my_function("Peter", "Griffin")
my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.
my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Nested Function
There are two ways to create a nested function:
a <- x + y a <- x + y
return(a) return(a)
} }
(2,2), Nested_function(3,3)) }
Global Variables
Variables that are created outside of a function are known as global variables.
Global variables can be used by everyone, both inside of functions and outside.
my_function()
Loops in R
R is very good at performing repetitive tasks. If we want a set of operations to be
repeated several times we use what’s known as a loop. When you create a loop, R
will execute the instructions in the loop a specified number of times or until a
specified condition is met.
Syntax Example
for (variable in sequence) for (x in 1:10)
{
{
print(x)
# Code to be executed
}
}
While Loop
With the while loop we can execute a set of statements as long as a condition is
TRUE.
Example
Syntax
i <- 1
while (condition)
while (i < 6)
{ {
Syntax Example
x <- 1
repeat {
repeat {
# Code to be executed
print(x)
if (condition) { x <- x + 1
if (x > 5) {
break # Exit the loop if condition is true
break # Exit loop when x is greater than 5
} }
} }
Summary statistics
o Summary statistics are used to summarize a set of observations, in order to
o Location
o Spread
o Shape
o Dependence
Summary statistics
Location
Common measures of location, or central tendency, are the arithmetic mean, median, mode, and
interquartile mean.
Spread
Common measures of statistical dispersion are the standard deviation, variance, range,
interquartile range, absolute deviation, mean absolute difference and the distance standard
deviation. Measures that assess spread in comparison to the typical size of data values include the
coefficient of variation.
The Gini coefficient was originally developed to measure income inequality and is equivalent to one
of the L-moments.
A simple summary of a dataset is sometimes given by quoting particular order statistics as
can be based on L-moments. A different measure is the distance skewness, for which a value of
zero implies central symmetry.
Dependence
The common measure of dependence between paired random variables is the Pearson product-
creating summary tables or visualizations. Here's how you can perform data
summarization in R
numbers, but the numbers represent categories rather than actual amounts of things.
o There are three types of categorical variables: binary, nominal, and ordinal variables.
o Graphical Summaries
Numeric Summaries
o quantile(x): finds the sample quantiles of the numeric vector x. Default is min, Q1, M, Q3, and max. Can
o range(x): finds the range of the numeric vector x. Displays c(min(x), max(x)).
Sometimes you will have multiple columns or rows of a data.frame/matrix that you want to summarize.
o summary(x): for data.frames, display the quantile information and number of NAs
Graphical Summaries
o For one quantitative variable, we can summarize the data graphically using a
o Histogram
o Density plot
o plot(density(x)): makes a density plot from a numeric vector x (think a smoothed histogram)
o Boxplot
table(hers.dat$physact)
503 838
Summarizing Categorical variables
o Numeric Summaries
o Graphical Summaries
Numeric Summaries
For a single categorical variable, we are interested in counts and proportions within each category. To find
the counts, we can use the base R function table or with dplyr we can use the count.
unique(x) will return the unique elements of x, so we can see what the different categories of physical
activity are.
unique(hers.dat$physact)
o barplot
o pie chart
o Base R has the functions barplot and pie to make these charts
barplot
o barplot(table(hers.dat$physact))
barplot
o pie(table(hers.dat$physact))