Unit 2 R
Unit 2 R
1. How to run R
1. Install R
Download RStudio: Go to the RStudio website and download the free version.
Using R: You can open the R console directly from the installation. It will show a
command line interface where you can enter R commands.
Using RStudio: Open RStudio, and you’ll have a more user-friendly environment with
a script editor, console, and other features.
In RStudio:
install.packages("packageName")
Getting Help
R Sessions
Starting a Session: You start a new session whenever you open R or RStudio. You can
run code interactively or execute scripts.
Workspace Management:
Saving Workspace: Use save.image() to save your current session to an .RData file.
Loading Workspace: Use load("yourfile.RData") to load a saved workspace.
Ending a Session: You can end a session by typing q() in the console. It will prompt
you to save your workspace if you haven’t already.
Functions in R
Defining a Function: You create a function using the function() keyword. Here’s the
basic syntax:
# Function body
result <- arg1 + arg2
return(result)
Calling a Function: Once defined, you can call the function by passing the required
arguments:
my_function(5, 3) # Returns 8
return(arg1 + arg2)
}
my_function(5) # Returns 15
Anonymous Functions: You can create functions without naming them, often used for
short, one-time tasks:
result <- (function(x) { x^2 })(4) # Returns 16
Using Functions: Functions are versatile and can be used in various contexts:
if (n < 0) {
} else if (n == 0) {
return(1)
} else {
result <- 1
for (i in 1:n) {
result <- result * i
return(result)
R is a powerful tool for statistical analysis, offering a wide range of built-in functions for
various statistical operations. Here’s an overview of some common statistical functions in R:
Descriptive Statistics
1. Mean
2. Median
3. Standard Deviation
sd(x) # Calculates the standard deviation of x
4. Variance
quantile(x, probs = c(0.25, 0.5, 0.75)) # Calculates the 25th, 50th, and 75th percentiles
5. Summary Statistics
Probability Distributions
1. Normal Distribution
Key Concepts
2. PDF Characteristics:
The area under the PDF curve across the entire range is equal to 1.
The probability that a random variable falls within a particular range is given by the
integral of the PDF over that range.
Characteristics:
c) Random Generation:
Random generation in R refers to the process of generating random numbers from various
statistical distributions. R provides a suite of functions for generating random numbers from
common distributions.
2. Binomial Distribution
3. T-distribution
1. t-tests
a) One-sample t-test: A one-sample t-test is used to determine whether the mean of a single
sample differs significantly from a known or hypothesized population mean. Here’s how
to perform a one-sample t-test in R, along with an example.
3. Use the t.test() Function: This function in R performs the one-sample t-test.
Let’s say we want to test whether the average height of a sample of students is different
from the known average height of 170 cm.
student_heights <- c (168, 172, 169, 171, 173, 165, 170, 174)
Step 2: Perform the One-Sample t-Test
We will test the null hypothesis that the mean height is equal to 170 cm.
print(t_test_result)
When you run the t.test() function, you'll get an output that includes:
p-value: The probability of observing the data assuming the null hypothesis is true.
confidence interval: The range in which the true population mean likely falls.
mean of x 170.50
p-value = 0.1468: Since this p-value is greater than 0.05, we fail to reject the null
hypothesis. There is not enough evidence to conclude that the average height of the
students is different from 170 cm.
95% confidence interval: This indicates that we are 95% confident the true mean
height falls between 168.43 and 172.57 cm.
b) Two-sample t-test: A two-sample t-test is used to compare the means of two independent
groups to determine if there is a statistically significant difference between them. Here's
how to perform a two-sample t-test in R, along with an example.
Let’s say we want to compare the heights of students from two different classes to see if there
is a significant difference between their average heights.
We will test the null hypothesis that the mean height of Class A is equal to the mean height of
Class B.
print(t_test_result)
Interpreting the Output
When you run the t.test() function, you'll get an output that includes:
p-value: The probability of observing the data assuming the null hypothesis is true.
confidence interval: The range in which the true difference in means likely falls.
p-value = 0.00684: Since this p-value is less than 0.05, we reject the null hypothesis.
There is significant evidence to conclude that the average heights of the two classes are
different.
95% confidence interval: This indicates that we are 95% confident the true difference
in means lies between 5.745 cm and 14.255 cm.
5. Chi-squared Test
chisq.test(table) # Performs a chi-squared test of independence
1. Correlation
2. Linear Regression
# Sample data
# Descriptive statistics
# One-sample t-test
print(mean_value)
print(sd_value)
print(summary_stats)
print(t_test_result)
4. Basic Maths in R
R is a powerful tool for performing basic mathematical operations. Here’s a quick guide on
how to perform various mathematical tasks in R:
1. Addition:
2. Subtraction:
3. Multiplication:
result <- 5 * 3 # Multiplies 5 by 3
4. Division:
5. Exponentiation:
6. Modulus (Remainder):
1. Square Root:
2. Absolute Value:
3. Trigonometric Functions:
4. Logarithmic Functions:
5. Exponential Function:
# Create a vector
# Perform operations
# Basic arithmetic
a <- 10
b <- 5
# Addition
sum_result <- a + b
# Multiplication
prod_result <- a * b
# Square root
# Create a vector
# Element-wise addition
print(paste("Sum:", sum_result))
print(paste("Product:", prod_result))
print(added_vector)
Summary
R supports all basic mathematical operations and provides numerous built-in functions
for more complex calculations.
For more complex mathematical tasks, you can explore packages like dplyr or purrr.
5. Variables in R
In R, variables are used to store data values. Understanding how to create, assign, and
manipulate variables is fundamental for programming in R. Here’s a guide on variables in R:
The most common assignment operator is <-, but you can also use =.
2. Variable Names:
Variable names must start with a letter and can contain letters, numbers, underscores,
and dots.
They are case-sensitive, so myVar and myvar are different variables.
Types of Variables
1. Numeric:
num <- 42.5 # A numeric value
2. Integer:
3. Character:
4. Logical:
5. Complex:
You can check the type of a variable using the class() function:
Updating Variables
x <- 5
x <- x + 2 # x is now 7
Removing Variables
Here’s a simple example that demonstrates variable creation, assignment, and manipulation:
a <- 10
b <- 20
prod_result <- a * b
# Create a character variable
# Print results
print(message)
print(paste("Sum:", sum_result))
print(paste("Product:", prod_result))
Summary
Variables in R are created using assignment operators, with <- being the most common.
R supports multiple data types, including numeric, integer, character, logical, and
complex.
You can check variable types, update values, and remove variables as needed.
R offers several advanced data structures that allow for more complex data management and
manipulation. Here’s an overview of some of the key advanced data structures in R:
1. Lists
Description: Lists can hold different types of elements, including numbers, strings,
vectors, and even other lists. They are particularly useful for storing collections of
related but heterogeneous data.
Creating a List:
my_list <- list (name = "Alice", age = 25, scores = c(90, 85, 88))
Accessing Elements:
my_list$name # Access the name
2. Data Frames
Description: Data frames are two-dimensional structures that can store different types
of variables (columns) in a tabular format. They are widely used for statistical data
analysis.
df <- data.frame(
Accessing Elements:
df[1, 2] # Access the value in the first row and second column
3. Matrices
Description: Matrices are two-dimensional arrays that can only hold one data type.
They are useful for mathematical computations and linear algebra.
Creating a Matrix:
Accessing Elements:
matrix_data[2, 3] # Access the element in the second row and third column
4. Arrays
Description: Arrays are similar to matrices but can have more than two dimensions.
They can hold multiple types of data as well.
Creating an Array:
Accessing Elements:
5. Factors
Description: Factors are used to represent categorical data. They store both the values
and the corresponding levels (categories).
Creating a Factor:
Accessing Levels:
levels(gender) # Get the levels of the factor
Description: Tibbles are modern versions of data frames that provide a cleaner print
output and are easier to use with the tidyverse family of packages.
Creating a Tibble:
library(tibble)
Accessing Elements:
print(passed_students)
Summary
R provides a variety of advanced data structures, including lists, data frames, matrices,
arrays, factors, and tibbles.
Each data structure has its own use cases and advantages, making R a flexible tool for
data analysis and manipulation.
7. Recursion in R
Recursion in R is a programming technique where a function calls itself to solve a problem.
It's useful for problems that can be broken down into smaller, similar subproblems. Here’s
how recursion works in R, along with some examples:
Base Case: The condition under which the recursion stops. This prevents infinite recursion.
Recursive Case: The part of the function that calls itself with modified arguments to work
towards the base case.
R also accepts function recursion, which means a defined function can call itself.
Recursion is a common mathematical and programming concept. It means that a function calls
itself. This has the benefit of meaning that you can loop through data to reach a result.
In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We
use the k variable as the data, which decrements (-1) every time we recurse. The recursion ends
when the condition is not greater than 0 (i.e. when it is 0).
To a new developer it can take some time to work out how exactly this works, best way to find
out is by testing and modifying it.