0% found this document useful (0 votes)
2 views

Module II

The document provides an overview of outliers in statistical analysis, their identification, impact, and methods for handling them. It also covers data merging techniques in R, including different types of joins, tips for effective merging, and the use of functions and loops in R programming. Additionally, it discusses summarizing quantitative and categorical data with R, including numeric and graphical summaries.

Uploaded by

rishiv1947
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module II

The document provides an overview of outliers in statistical analysis, their identification, impact, and methods for handling them. It also covers data merging techniques in R, including different types of joins, tips for effective merging, and the use of functions and loops in R programming. Additionally, it discusses summarizing quantitative and categorical data with R, including numeric and graphical summaries.

Uploaded by

rishiv1947
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Introduction to statistical

learning and R-
Programming
Module - II
Outliers
• Outliers are data points that significantly differ from other observations in a dataset.

• These observations can be unusually high or low in comparison to the majority of the data and
may distort statistical analyses and machine learning models if not handled properly.

• Outliers can occur due to various reasons, including measurement errors, experimental error,
or genuinely unusual observations.
key points about outliers
• Identification: Outliers are typically identified using statistical methods such as the
interquartile range (IQR) or z-scores.

• Impact: Outliers can have a significant impact on statistical measures such as the
mean and standard deviation. Since these measures are sensitive to extreme values,
outliers can skew them, making them less representative of the central tendency and
variability of the data.

• Influence on Models: In predictive modeling, outliers can distort the results of


statistical models. For example, linear regression models may be heavily influenced
by outliers, leading to inaccurate estimates of coefficients and poor predictive
performance.
Handling Outliers
• Trimming
• Removing outliers from the dataset.

• Winsorizing
• Replacing outliers with the nearest non-outlier values.

• Transformation
• Applying mathematical transformations such as log transformation to make the data less
skewed.

• Robust statistical methods


• Using statistical methods that are less sensitive to outliers, such as median instead of mean.
Combining Datasets in R
• Your data may not be stored in a single, complete, form and instead will be
found in a number of places.

• Perhaps measurements taken in different weeks have been saved to separate


places, or maybe you have different files recording observations and metadata.

• To work with data stored like this it is necessary to learn how to combine and
merge different datasets into a single data frame.
Combining Datasets in R
• If you have new data that has the same structure as your old data, it can be
added onto the end of your data frame with bind_rows().

• If you are instead looking to merge data sets based on some shared variable,
there are a number of joins that are useful.

• inner_join()

• left_join()

• full_join()
inner_join()
• The first join we will look at is an inner_join(). This function takes two data
frames as input and merges them based on their shared column.

• Only rows with data in both data frames are kept.


Full_join()
• The opposite of an inner_join() is a full_join(). This function keeps
all rows from both data frames, filling in any missing data with NA.
left_join()
• This uses the first data frame as a reference and merges in data from the second
data frame. Any rows that are in the left data frame but not the right are filled in
with NA. Any rows in the right data frame but not the left are ignored.
Tips on Merging Data in R
• Merging data is a common task in data analysis, especially when working with large
datasets. The merge function in R is a powerful tool that allows you to combine two or more
datasets based on shared variables. Here are some tips to ensure a smooth and efficient
merging process:

Understand Your Data:

• Before merging, always inspect your datasets using functions like head(), str(), and
summary(). This helps you understand the structure and identify key variables for merging.

Choose the Right Key Variables:

• Ensure that the variables you're merging on are unique and don't have duplicates unless it's
intentional. This prevents unintended data duplication.
Tips on Merging Data in R
Specify Merge Type:

• R's merge function allows for different types of joins: left, right, inner, and outer.
Understand the differences and choose the one that best fits your needs.

Handle Missing Values:

• After merging, check for NA values. These can arise if there's no match for a particular
key. Decide how you want to handle these: remove, replace, or impute.

Check Column Names:

• If the datasets have columns with the same names but different data, R will append a
suffix (e.g., .x and .y) to distinguish them. Rename these columns if necessary for clarity.
Tips on Merging Data in R
Sort Your Data:

• After merging, it's often helpful to sort your data using the order() function. This can make subsequent
analyses easier and more intuitive.

Large Datasets Consideration:

• For very large datasets, consider using the data.table package. It offers a faster merging process compared to
the base R merge function.

Consistent Data Types:

• Ensure that the key variables in both datasets have the same data type. For instance, merging on a character
variable in one dataset and a factor in another can lead to unexpected results.

Test on a Subset:

• If you're unsure about the merge, try it on a small subset of your data first. This allows you to quickly spot
and rectify any issues.
Functions and Loops
• A function is a block of code which only runs when it is called.

• You can pass data, known as parameters, into a function.

• A function can return data as a result.

Creating a Function

• To create a function, use the function() keyword:

my_function <- function()

{ # create a function with the name my_function


print("Hello World!")
}
Functions
Calling a Function

To call a function, use the function name followed by parenthesis, like


my_function()
• my_function <- function() {
print("Hello World!")
}
my_function() # call the function named my_function
Arguments
Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.

my_function <- function(fname) {


paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")
Number of Arguments
By default, a function must be called with the correct number of arguments.
Meaning that if your function expects 2 arguments, you have to call the function
with 2 arguments, not more, and not less.

# Error
my_function <- function(fname, lname) {
paste(fname, lname) my_function <- function(fname,
lname) {
}
paste(fname, lname)
}
my_function("Peter", "Griffin")
my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.

If we call the function without an argument, it uses the default value.

my_function <- function(country = "Norway") {


paste("I am from", country)
}

my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Nested Function
There are two ways to create a nested function:

Call a function within another function. Write a function within a function.

Nested_function <- function(x, Outer_func <- function(x) {

y) { Inner_func <- function(y) {

a <- x + y a <- x + y

return(a) return(a)

} }

Nested_function(Nested_function return (Inner_func)

(2,2), Nested_function(3,3)) }
Global Variables
Variables that are created outside of a function are known as global variables.

Global variables can be used by everyone, both inside of functions and outside.

txt <- "awesome"


my_function <- function() {
paste("R is", txt)
}

my_function()
Loops in R
R is very good at performing repetitive tasks. If we want a set of operations to be
repeated several times we use what’s known as a loop. When you create a loop, R
will execute the instructions in the loop a specified number of times or until a
specified condition is met.

There are three main types of loop in R:

the for loop,

the while loop,

the repeat loop.


For Loop
A for loop is used for iterating over a sequence:

Syntax Example
for (variable in sequence) for (x in 1:10)
{
{
print(x)
# Code to be executed
}
}
While Loop
With the while loop we can execute a set of statements as long as a condition is
TRUE.

Example
Syntax
i <- 1
while (condition)
while (i < 6)
{ {

# Code to be executed print(i)


i <- i + 1
}
}
Repeat Loop
In R programming, the repeat loop is a type of loop that allows you to execute a block of code
indefinitely until a specific condition is met. Unlike for and while loops, which have a
predefined number of iterations or are based on a condition, the repeat loop continues to
execute its block of code until explicitly stopped using a break statement.

Syntax Example
x <- 1
repeat {
repeat {
# Code to be executed
print(x)

if (condition) { x <- x + 1
if (x > 5) {
break # Exit the loop if condition is true
break # Exit loop when x is greater than 5
} }

} }
Summary statistics
o Summary statistics are used to summarize a set of observations, in order to

communicate the largest amount of information as simply as possible.

o Statisticians commonly try to describe the observations in

o Location

o Spread

o Shape

o Dependence
Summary statistics
Location
 Common measures of location, or central tendency, are the arithmetic mean, median, mode, and

interquartile mean.

Spread
 Common measures of statistical dispersion are the standard deviation, variance, range,

interquartile range, absolute deviation, mean absolute difference and the distance standard
deviation. Measures that assess spread in comparison to the typical size of data values include the
coefficient of variation.
 The Gini coefficient was originally developed to measure income inequality and is equivalent to one

of the L-moments.
 A simple summary of a dataset is sometimes given by quoting particular order statistics as

approximations to selected percentiles of a distribution.


Summary statistics
Shape
 Common measures of the shape of a distribution are skewness or kurtosis, while alternatives

can be based on L-moments. A different measure is the distance skewness, for which a value of
zero implies central symmetry.

Dependence
 The common measure of dependence between paired random variables is the Pearson product-

moment correlation coefficient, while a common alternative summary statistic is Spearman's


rank correlation coefficient.

 A value of zero for the distance correlation implies independence.


Summarizing data with R.
o Data summarization in R involves the process of condensing large datasets into

smaller, more manageable forms while retaining essential information.

o It typically involves generating descriptive statistics, aggregating data, and

creating summary tables or visualizations. Here's how you can perform data
summarization in R

o Data is generally divided into two categories:


 Quantitative data (represents amounts)

 Categorical data (represents groupings)


Quantitative variables
o When you collect quantitative data, the numbers you record represent real

amounts that can be added, subtracted, divided, etc.

o There are two types of quantitative variables: discrete and continuous.

Type of variable What does the data represent? Examples


Number of students in a
Counts of individual items or class
Discrete variables
values. Number of different tree
species in a forest
Distance
Measurements of continuous or
Continuous variables Volume
non-finite values.
Age
Categorical variables
o Categorical variables represent groupings of some kind. They are sometimes recorded as

numbers, but the numbers represent categories rather than actual amounts of things.

o There are three types of categorical variables: binary, nominal, and ordinal variables.

Type of variable What does the data represent? Examples

Binary variables (aka dichotomous Heads/tails in a coin flip


Yes or no outcomes.
variables)
Win/lose in a football game
Species names
Groups with no rank or order
Nominal variables Colors
between them.
Brands

Finishing place in a race


Groups that are ranked in a specific
Ordinal variables
order.
Rating scale responses in a survey, s
uch as Likert scales*
Summarizing Quantitative Variable
o Numeric Summaries

o Graphical Summaries

Numeric Summaries

o mean(x): find the mean of a numeric vector x.

o sd(x): find the standard deviation of a numeric vector x.

o median(x): finds the median of a numeric vector x.

o quantile(x): finds the sample quantiles of the numeric vector x. Default is min, Q1, M, Q3, and max. Can

find other quantiles by using the probs argument.

o range(x): finds the range of the numeric vector x. Displays c(min(x), max(x)).

o sum(x): find the sum of the elements of the numeric vector x.


Numeric Summaries
o R also has many mathematical functions that are useful for transformations (e.g. in regression modeling).

o log - log (base e) transformation

o log2 - log base 2 transformation

o log10 - log base 10 transformation

o sqrt - square root transformation

Sometimes you will have multiple columns or rows of a data.frame/matrix that you want to summarize.

o rowMeans(x): finds the mean of each row of x

o colMeans(x): finds the mean of each column of x

o rowSums(x): finds the sum of each row of x

o colSums(x): finds the sum of each column of x

o summary(x): for data.frames, display the quantile information and number of NAs
Graphical Summaries
o For one quantitative variable, we can summarize the data graphically using a

o Histogram

o hist(x): makes a histogram from a numeric vector x

o Density plot

o plot(density(x)): makes a density plot from a numeric vector x (think a smoothed histogram)

o Boxplot

o boxplot(x): makes a boxplot from the numeric vector x


Histogram
o hist(hers.dat$LDL)
Density plot
o plot(density(hers.dat$LDL, na.rm = TRUE))
Boxplot
o plot(density(hers.dat$LDL, na.rm = TRUE))
Summarizing Categorical variables (Numeric
Summaries )
o To count how many observations fall into each physical activity category, we can use table or count to

construct a frequency table.

table(hers.dat$physact)

about as active much less active much more active

919 197 306

somewhat less active somewhat more active

503 838
Summarizing Categorical variables
o Numeric Summaries

o Graphical Summaries

Numeric Summaries

For a single categorical variable, we are interested in counts and proportions within each category. To find
the counts, we can use the base R function table or with dplyr we can use the count.

unique(x) will return the unique elements of x, so we can see what the different categories of physical
activity are.

unique(hers.dat$physact)

"much more active" "much less active" "about as active"

"somewhat less active" "somewhat more active"


Summarizing Categorical variables
(Graphical Summaries)
o For one categorical variable, we graphically summarize it with a

o barplot

o pie chart

o Base R has the functions barplot and pie to make these charts
barplot
o barplot(table(hers.dat$physact))
barplot
o pie(table(hers.dat$physact))

You might also like