0% found this document useful (0 votes)

6 views40 pages

Module II

The document provides an overview of outliers in statistical analysis, their identification, impact, and methods for handling them. It also covers data merging techniques in R, including different types of joins, tips for effective merging, and the use of functions and loops in R programming. Additionally, it discusses summarizing quantitative and categorical data with R, including numeric and graphical summaries.

Uploaded by

rishiv1947

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views40 pages

Module II

Uploaded by

rishiv1947

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Introduction to statistical

learning and R-
Programming
Module - II
Outliers
• Outliers are data points that significantly differ from other observations in a dataset.

• These observations can be unusually high or low in comparison to the majority of the data and
may distort statistical analyses and machine learning models if not handled properly.

• Outliers can occur due to various reasons, including measurement errors, experimental error,
or genuinely unusual observations.
key points about outliers
• Identification: Outliers are typically identified using statistical methods such as the
interquartile range (IQR) or z-scores.

• Impact: Outliers can have a significant impact on statistical measures such as the
mean and standard deviation. Since these measures are sensitive to extreme values,
outliers can skew them, making them less representative of the central tendency and
variability of the data.

• Influence on Models: In predictive modeling, outliers can distort the results of

statistical models. For example, linear regression models may be heavily influenced
by outliers, leading to inaccurate estimates of coefficients and poor predictive
performance.
Handling Outliers
• Trimming
• Removing outliers from the dataset.

• Winsorizing
• Replacing outliers with the nearest non-outlier values.

• Transformation
• Applying mathematical transformations such as log transformation to make the data less
skewed.

• Robust statistical methods

• Using statistical methods that are less sensitive to outliers, such as median instead of mean.
Combining Datasets in R
• Your data may not be stored in a single, complete, form and instead will be
found in a number of places.

• Perhaps measurements taken in different weeks have been saved to separate

places, or maybe you have different files recording observations and metadata.

• To work with data stored like this it is necessary to learn how to combine and
merge different datasets into a single data frame.
Combining Datasets in R
• If you have new data that has the same structure as your old data, it can be
added onto the end of your data frame with bind_rows().

• If you are instead looking to merge data sets based on some shared variable,
there are a number of joins that are useful.

• inner_join()

• left_join()

• full_join()
inner_join()
• The first join we will look at is an inner_join(). This function takes two data
frames as input and merges them based on their shared column.

• Only rows with data in both data frames are kept.

Full_join()
• The opposite of an inner_join() is a full_join(). This function keeps
all rows from both data frames, filling in any missing data with NA.
left_join()
• This uses the first data frame as a reference and merges in data from the second
data frame. Any rows that are in the left data frame but not the right are filled in
with NA. Any rows in the right data frame but not the left are ignored.
Tips on Merging Data in R
• Merging data is a common task in data analysis, especially when working with large
datasets. The merge function in R is a powerful tool that allows you to combine two or more
datasets based on shared variables. Here are some tips to ensure a smooth and efficient
merging process:

Understand Your Data:

• Before merging, always inspect your datasets using functions like head(), str(), and
summary(). This helps you understand the structure and identify key variables for merging.

Choose the Right Key Variables:

• Ensure that the variables you're merging on are unique and don't have duplicates unless it's
intentional. This prevents unintended data duplication.
Tips on Merging Data in R
Specify Merge Type:

• R's merge function allows for different types of joins: left, right, inner, and outer.
Understand the differences and choose the one that best fits your needs.

Handle Missing Values:

• After merging, check for NA values. These can arise if there's no match for a particular
key. Decide how you want to handle these: remove, replace, or impute.

Check Column Names:

• If the datasets have columns with the same names but different data, R will append a
suffix (e.g., .x and .y) to distinguish them. Rename these columns if necessary for clarity.
Tips on Merging Data in R
Sort Your Data:

• After merging, it's often helpful to sort your data using the order() function. This can make subsequent
analyses easier and more intuitive.

Large Datasets Consideration:

• For very large datasets, consider using the data.table package. It offers a faster merging process compared to
the base R merge function.

Consistent Data Types:

• Ensure that the key variables in both datasets have the same data type. For instance, merging on a character
variable in one dataset and a factor in another can lead to unexpected results.

Test on a Subset:

• If you're unsure about the merge, try it on a small subset of your data first. This allows you to quickly spot
and rectify any issues.
Functions and Loops
• A function is a block of code which only runs when it is called.

• You can pass data, known as parameters, into a function.

• A function can return data as a result.

Creating a Function

• To create a function, use the function() keyword:

my_function <- function()

{ # create a function with the name my_function

print("Hello World!")
}
Functions
Calling a Function

To call a function, use the function name followed by parenthesis, like

my_function()
• my_function <- function() {
print("Hello World!")
}
my_function() # call the function named my_function
Arguments
Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. You can add as
many arguments as you want, just separate them with a comma.

my_function <- function(fname) {

paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")
Number of Arguments
By default, a function must be called with the correct number of arguments.
Meaning that if your function expects 2 arguments, you have to call the function
with 2 arguments, not more, and not less.

# Error
my_function <- function(fname, lname) {
paste(fname, lname) my_function <- function(fname,
lname) {
}
paste(fname, lname)
}
my_function("Peter", "Griffin")
my_function("Peter")
Default Parameter Value
The following example shows how to use a default parameter value.

If we call the function without an argument, it uses the default value.

my_function <- function(country = "Norway") {

paste("I am from", country)
}

my_function("Sweden")
my_function("India")
my_function() # will get the default value, which is Norway
my_function("USA")
Nested Function
There are two ways to create a nested function:

Call a function within another function. Write a function within a function.

Nested_function <- function(x, Outer_func <- function(x) {

y) { Inner_func <- function(y) {

a <- x + y a <- x + y

return(a) return(a)

} }

Nested_function(Nested_function return (Inner_func)

(2,2), Nested_function(3,3)) }
Global Variables
Variables that are created outside of a function are known as global variables.

Global variables can be used by everyone, both inside of functions and outside.

txt <- "awesome"

my_function <- function() {
paste("R is", txt)
}

my_function()
Loops in R
R is very good at performing repetitive tasks. If we want a set of operations to be
repeated several times we use what’s known as a loop. When you create a loop, R
will execute the instructions in the loop a specified number of times or until a
specified condition is met.

There are three main types of loop in R:

the for loop,

the while loop,

the repeat loop.

For Loop
A for loop is used for iterating over a sequence:

Syntax Example
for (variable in sequence) for (x in 1:10)
{
{
print(x)
# Code to be executed
}
}
While Loop
With the while loop we can execute a set of statements as long as a condition is
TRUE.

Example
Syntax
i <- 1
while (condition)
while (i < 6)
{ {

# Code to be executed print(i)

i <- i + 1
}
}
Repeat Loop
In R programming, the repeat loop is a type of loop that allows you to execute a block of code
indefinitely until a specific condition is met. Unlike for and while loops, which have a
predefined number of iterations or are based on a condition, the repeat loop continues to
execute its block of code until explicitly stopped using a break statement.

Syntax Example
x <- 1
repeat {
repeat {
# Code to be executed
print(x)

if (condition) { x <- x + 1
if (x > 5) {
break # Exit the loop if condition is true
break # Exit loop when x is greater than 5
} }

} }
Summary statistics
o Summary statistics are used to summarize a set of observations, in order to

communicate the largest amount of information as simply as possible.

o Statisticians commonly try to describe the observations in

o Location

o Spread

o Shape

o Dependence
Summary statistics
Location
 Common measures of location, or central tendency, are the arithmetic mean, median, mode, and

interquartile mean.

Spread
 Common measures of statistical dispersion are the standard deviation, variance, range,

interquartile range, absolute deviation, mean absolute difference and the distance standard
deviation. Measures that assess spread in comparison to the typical size of data values include the
coefficient of variation.
 The Gini coefficient was originally developed to measure income inequality and is equivalent to one

of the L-moments.
 A simple summary of a dataset is sometimes given by quoting particular order statistics as

approximations to selected percentiles of a distribution.

Summary statistics
Shape
 Common measures of the shape of a distribution are skewness or kurtosis, while alternatives

can be based on L-moments. A different measure is the distance skewness, for which a value of
zero implies central symmetry.

Dependence
 The common measure of dependence between paired random variables is the Pearson product-

moment correlation coefficient, while a common alternative summary statistic is Spearman's

rank correlation coefficient.

 A value of zero for the distance correlation implies independence.

Summarizing data with R.
o Data summarization in R involves the process of condensing large datasets into

smaller, more manageable forms while retaining essential information.

o It typically involves generating descriptive statistics, aggregating data, and

creating summary tables or visualizations. Here's how you can perform data
summarization in R

o Data is generally divided into two categories:

 Quantitative data (represents amounts)

 Categorical data (represents groupings)

Quantitative variables
o When you collect quantitative data, the numbers you record represent real

amounts that can be added, subtracted, divided, etc.

o There are two types of quantitative variables: discrete and continuous.

Type of variable What does the data represent? Examples

Number of students in a
Counts of individual items or class
Discrete variables
values. Number of different tree
species in a forest
Distance
Measurements of continuous or
Continuous variables Volume
non-finite values.
Age
Categorical variables
o Categorical variables represent groupings of some kind. They are sometimes recorded as

numbers, but the numbers represent categories rather than actual amounts of things.

o There are three types of categorical variables: binary, nominal, and ordinal variables.

Type of variable What does the data represent? Examples

Binary variables (aka dichotomous Heads/tails in a coin flip

Yes or no outcomes.
variables)
Win/lose in a football game
Species names
Groups with no rank or order
Nominal variables Colors
between them.
Brands

Finishing place in a race

Groups that are ranked in a specific
Ordinal variables
order.
Rating scale responses in a survey, s
uch as Likert scales*
Summarizing Quantitative Variable
o Numeric Summaries

o Graphical Summaries

Numeric Summaries

o mean(x): find the mean of a numeric vector x.

o sd(x): find the standard deviation of a numeric vector x.

o median(x): finds the median of a numeric vector x.

o quantile(x): finds the sample quantiles of the numeric vector x. Default is min, Q1, M, Q3, and max. Can

find other quantiles by using the probs argument.

o range(x): finds the range of the numeric vector x. Displays c(min(x), max(x)).

o sum(x): find the sum of the elements of the numeric vector x.

Numeric Summaries
o R also has many mathematical functions that are useful for transformations (e.g. in regression modeling).

o log - log (base e) transformation

o log2 - log base 2 transformation

o log10 - log base 10 transformation

o sqrt - square root transformation

Sometimes you will have multiple columns or rows of a data.frame/matrix that you want to summarize.

o rowMeans(x): finds the mean of each row of x

o colMeans(x): finds the mean of each column of x

o rowSums(x): finds the sum of each row of x

o colSums(x): finds the sum of each column of x

o summary(x): for data.frames, display the quantile information and number of NAs
Graphical Summaries
o For one quantitative variable, we can summarize the data graphically using a

o Histogram

o hist(x): makes a histogram from a numeric vector x

o Density plot

o plot(density(x)): makes a density plot from a numeric vector x (think a smoothed histogram)

o Boxplot

o boxplot(x): makes a boxplot from the numeric vector x

Histogram
o hist(hers.dat$LDL)
Density plot
o plot(density(hers.dat$LDL, na.rm = TRUE))
Boxplot
o plot(density(hers.dat$LDL, na.rm = TRUE))
Summarizing Categorical variables (Numeric
Summaries )
o To count how many observations fall into each physical activity category, we can use table or count to

construct a frequency table.

table(hers.dat$physact)

about as active much less active much more active

919 197 306

somewhat less active somewhat more active

503 838
Summarizing Categorical variables
o Numeric Summaries

o Graphical Summaries

Numeric Summaries

For a single categorical variable, we are interested in counts and proportions within each category. To find
the counts, we can use the base R function table or with dplyr we can use the count.

unique(x) will return the unique elements of x, so we can see what the different categories of physical
activity are.

unique(hers.dat$physact)

"much more active" "much less active" "about as active"

"somewhat less active" "somewhat more active"

Summarizing Categorical variables
(Graphical Summaries)
o For one categorical variable, we graphically summarize it with a

o barplot

o pie chart

o Base R has the functions barplot and pie to make these charts
barplot
o barplot(table(hers.dat$physact))
barplot
o pie(table(hers.dat$physact))

The Audit Process (Part 1 & 2)
100% (2)
The Audit Process (Part 1 & 2)
39 pages
R-Programming Notes
100% (1)
R-Programming Notes
33 pages
Module II
No ratings yet
Module II
40 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
15 pages
R WorkSamples
No ratings yet
R WorkSamples
44 pages
Introduction To Data Science With R Programming
No ratings yet
Introduction To Data Science With R Programming
91 pages
Importing The Files
No ratings yet
Importing The Files
14 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
R
No ratings yet
R
13 pages
Unit 2
No ratings yet
Unit 2
17 pages
Week 1-3
No ratings yet
Week 1-3
17 pages
R Software - Notes
No ratings yet
R Software - Notes
18 pages
R Programming
No ratings yet
R Programming
22 pages
R Module 2
No ratings yet
R Module 2
30 pages
Practical 1 - Data Frame Manipulation - 072502
No ratings yet
Practical 1 - Data Frame Manipulation - 072502
16 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
R-Unit 2
No ratings yet
R-Unit 2
81 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
4 Overview of R Part 2
No ratings yet
4 Overview of R Part 2
63 pages
Untitled
No ratings yet
Untitled
59 pages
Unit 4
No ratings yet
Unit 4
27 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
Advantages of R Programming Language:: Extensive Libraries
No ratings yet
Advantages of R Programming Language:: Extensive Libraries
34 pages
RSTUDIO
No ratings yet
RSTUDIO
44 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
Data Anlytics Using R Notes
No ratings yet
Data Anlytics Using R Notes
14 pages
Statistics Using R Language
No ratings yet
Statistics Using R Language
5 pages
R Comandos
No ratings yet
R Comandos
13 pages
Unit 5 - DS - 1st Year
No ratings yet
Unit 5 - DS - 1st Year
19 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
R Programming Cheat Sheet
No ratings yet
R Programming Cheat Sheet
7 pages
Data Manipulation Workshop Handout
No ratings yet
Data Manipulation Workshop Handout
46 pages
Unit 2 Notes - Data Analysis Using R
No ratings yet
Unit 2 Notes - Data Analysis Using R
19 pages
Unit 1 Notes R Programming
No ratings yet
Unit 1 Notes R Programming
7 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
R Programming For NGS Data Analysis
No ratings yet
R Programming For NGS Data Analysis
5 pages
R Programming
No ratings yet
R Programming
22 pages
Bdo Co1 Session 4
No ratings yet
Bdo Co1 Session 4
43 pages
CH 3
No ratings yet
CH 3
33 pages
Introduction To R For Business Analytics
No ratings yet
Introduction To R For Business Analytics
7 pages
MIT 302 - Statistical Computing II - Tutorial 01
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 01
6 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
Introduction To R Installation: Data Types Value Examples
No ratings yet
Introduction To R Installation: Data Types Value Examples
9 pages
Unit III R Programming Fundamentals
No ratings yet
Unit III R Programming Fundamentals
33 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
R Course 2014: Lecture 7
No ratings yet
R Course 2014: Lecture 7
45 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
DA Lab Week-2
No ratings yet
DA Lab Week-2
22 pages
13.1 Course Notes - Section II, III, IV
No ratings yet
13.1 Course Notes - Section II, III, IV
12 pages
R Training by Emma Mba
No ratings yet
R Training by Emma Mba
68 pages
SEC Notes
No ratings yet
SEC Notes
62 pages
Introduction To R, Version 2
No ratings yet
Introduction To R, Version 2
51 pages
Statistics With R Programming For Bigdata (Autosaved)
No ratings yet
Statistics With R Programming For Bigdata (Autosaved)
41 pages
R Programming
No ratings yet
R Programming
50 pages
R Cheatsheet Base R
No ratings yet
R Cheatsheet Base R
2 pages
Ids Longs (Unit 3,4,5)
No ratings yet
Ids Longs (Unit 3,4,5)
26 pages
Modeling and Visulizing Data Using R: A Practical Introduction
No ratings yet
Modeling and Visulizing Data Using R: A Practical Introduction
106 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
OIE 542 Syllabus
No ratings yet
OIE 542 Syllabus
14 pages
Measures of Spatial Accessibility To Health Care in A GIS Environment Synthesis and A Case Study in The Chicago Region
No ratings yet
Measures of Spatial Accessibility To Health Care in A GIS Environment Synthesis and A Case Study in The Chicago Region
20 pages
Oecd/Ocde 408: Oecd Guideline For The Testing of Chemicals
No ratings yet
Oecd/Ocde 408: Oecd Guideline For The Testing of Chemicals
16 pages
Barrage Presentation
100% (1)
Barrage Presentation
10 pages
Pair Trading Logic and Applications 1734292219
No ratings yet
Pair Trading Logic and Applications 1734292219
24 pages
3 C Concept of Marketing Strategy - Marketing Strategy Explained With 3 C's
No ratings yet
3 C Concept of Marketing Strategy - Marketing Strategy Explained With 3 C's
8 pages
Accuracy Assessment in Remotely Sensed Categorical Information
No ratings yet
Accuracy Assessment in Remotely Sensed Categorical Information
38 pages
Theorigins Philosophy and Fundamental Principles of The IB Diploma and MYP
No ratings yet
Theorigins Philosophy and Fundamental Principles of The IB Diploma and MYP
29 pages
SR23409152640
No ratings yet
SR23409152640
11 pages
613 P
No ratings yet
613 P
2 pages
Avalon A Long Intl Detailed Resume
No ratings yet
Avalon A Long Intl Detailed Resume
11 pages
Module 8 - Emotional Intelligence Personal Development
71% (7)
Module 8 - Emotional Intelligence Personal Development
19 pages
Absenteism 1
No ratings yet
Absenteism 1
8 pages
Gurunadham - Goli, D., & Rao, C. N. (2020) .
No ratings yet
Gurunadham - Goli, D., & Rao, C. N. (2020) .
9 pages
Human Resource Management Chapter 4: Employee Testing and Selection
No ratings yet
Human Resource Management Chapter 4: Employee Testing and Selection
38 pages
Aronson9e - TB - CH01 (INTRO TO SOCIAL PSYCH)
100% (2)
Aronson9e - TB - CH01 (INTRO TO SOCIAL PSYCH)
54 pages
Eureka Forbes's Products
0% (1)
Eureka Forbes's Products
72 pages
Zeus Consulting Sa
No ratings yet
Zeus Consulting Sa
12 pages
David Sm13 PPT 04
No ratings yet
David Sm13 PPT 04
34 pages
On Writing Summary, Conclusion, and Recommendation
No ratings yet
On Writing Summary, Conclusion, and Recommendation
20 pages
Placement Project BMIH6006.8 - Autumn Term 2023 Handbook FINAL KF EE PDF
No ratings yet
Placement Project BMIH6006.8 - Autumn Term 2023 Handbook FINAL KF EE PDF
20 pages
Proposal Fa
No ratings yet
Proposal Fa
31 pages
Updates Management SAM Infantandchildren Review1
No ratings yet
Updates Management SAM Infantandchildren Review1
23 pages
MEDNET - Analysis
100% (1)
MEDNET - Analysis
8 pages
Usability Engineering
No ratings yet
Usability Engineering
2 pages
Implications of Open Access Repositories Quality Criteria and Features For Teachers' TPACK Development
No ratings yet
Implications of Open Access Repositories Quality Criteria and Features For Teachers' TPACK Development
71 pages
ChatGPT in Education A Blessing or A Curse A Qual - 2024 - Computers in Human
No ratings yet
ChatGPT in Education A Blessing or A Curse A Qual - 2024 - Computers in Human
20 pages
Perception of Students On The Consequence of Smoking Final
No ratings yet
Perception of Students On The Consequence of Smoking Final
25 pages
Gaver and Utke (2019)
No ratings yet
Gaver and Utke (2019)
35 pages