0% found this document useful (0 votes)
25 views16 pages

Unit 2 R

Uploaded by

Madhav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

Unit 2 R

Uploaded by

Madhav Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Science using R (Unit 2)

1. How to run R

To run R, you’ll need to install it on your computer. Here’s a step-by-step guide:

1. Install R

 Download R: Go to CRAN (Comprehensive R Archive Network) and download the


version suitable for your operating system (Windows, macOS, or Linux).

 Install R: Follow the installation instructions for your OS.

2. Install RStudio (Optional but Recommended)

 Download RStudio: Go to the RStudio website and download the free version.

 Install RStudio: Follow the installation instructions.


3. Run R or RStudio

 Using R: You can open the R console directly from the installation. It will show a
command line interface where you can enter R commands.

 Using RStudio: Open RStudio, and you’ll have a more user-friendly environment with
a script editor, console, and other features.

4. Write and Run Code

 In R Console: Type R commands directly and hit Enter to execute.

 In RStudio:

 Write your code in the script editor.


 Highlight the code and click the "Run" button or use the keyboard shortcut (usually Ctrl
+ Enter or Cmd + Enter) to run selected code.

5. Save Your Work

 You can save your scripts in RStudio with a. R file extension.

6. Install Packages (Optional)

 If you need additional functionality, you can install packages using:

install.packages("packageName")

 Load the package with:


library(packageName)

Getting Help

 Use the ? operator to get help on specific functions. For example:


?mean

2. R sessions and functions

In R, managing sessions and functions is key to effective programming:

R Sessions

 What is an R Session? An R session is an interactive environment where you can


execute R commands and scripts. Each session maintains its own workspace, including
variables, functions, and loaded packages.

 Starting a Session: You start a new session whenever you open R or RStudio. You can
run code interactively or execute scripts.

 Workspace Management:

 Saving Workspace: Use save.image() to save your current session to an .RData file.
 Loading Workspace: Use load("yourfile.RData") to load a saved workspace.

 Ending a Session: You can end a session by typing q() in the console. It will prompt
you to save your workspace if you haven’t already.

Functions in R

 Defining a Function: You create a function using the function() keyword. Here’s the
basic syntax:

my_function <- function(arg1, arg2) {

# Function body
result <- arg1 + arg2

return(result)

 Calling a Function: Once defined, you can call the function by passing the required
arguments:

my_function(5, 3) # Returns 8

 Default Arguments: You can set default values for arguments:

my_function <- function(arg1, arg2 = 10) {

return(arg1 + arg2)

}
my_function(5) # Returns 15

 Anonymous Functions: You can create functions without naming them, often used for
short, one-time tasks:
result <- (function(x) { x^2 })(4) # Returns 16

 Using Functions: Functions are versatile and can be used in various contexts:

 As arguments to other functions (e.g., apply, lapply).


 In control structures (e.g., if, for, while).

Example of a Simple R Function

Here’s a complete example that calculates the factorial of a number:

factorial_function <- function(n) {

if (n < 0) {

return("Error: Negative number")

} else if (n == 0) {

return(1)

} else {

result <- 1

for (i in 1:n) {
result <- result * i

return(result)

# Call the function

factorial_function(5) # Returns 120


3. Statistical functions in R

R is a powerful tool for statistical analysis, offering a wide range of built-in functions for
various statistical operations. Here’s an overview of some common statistical functions in R:
Descriptive Statistics

1. Mean

mean(x) # Calculates the mean of x

2. Median

median(x) # Calculates the median of x

3. Standard Deviation
sd(x) # Calculates the standard deviation of x

4. Variance

var(x) # Calculates the variance of x

quantile(x, probs = c(0.25, 0.5, 0.75)) # Calculates the 25th, 50th, and 75th percentiles

5. Summary Statistics

summary(x) # Provides a summary of minimum, maximum, mean, median, and quartiles

Probability Distributions

R provides functions for various probability distributions:

1. Normal Distribution

a) Probability Density Function (PDF):

A Probability Density Function (PDF) describes the likelihood of a continuous random


variable taking on a particular value. In R, you can work with PDFs for various distributions.

Key Concepts

1. Continuous Random Variable: Unlike discrete random variables, continuous


variables can take any value within a range.

2. PDF Characteristics:
 The area under the PDF curve across the entire range is equal to 1.
 The probability that a random variable falls within a particular range is given by the
integral of the PDF over that range.

dnorm(x, mean, sd) # Density at x

b) Cumulative Distribution Function (CDF):


The Cumulative Distribution Function (CDF) describes the probability that a random
variable takes a value less than or equal to a specific value. In R, you can work with CDFs for
various distributions using built-in functions.

Characteristics:

 The CDF is a non-decreasing function.


 The value of the CDF approaches 1 as x approaches infinity and approaches 0 as x
approaches negative infinity.

pnorm(q, mean, sd) # Probability of being less than q

c) Random Generation:
Random generation in R refers to the process of generating random numbers from various
statistical distributions. R provides a suite of functions for generating random numbers from
common distributions.

rnorm(n, mean, sd) # Generates n random numbers from a normal distribution

2. Binomial Distribution

 PDF: dbinom(x, size, prob) # Probability of x successes in size trials


 CDF: pbinom(q, size, prob) # Probability of x successes <= q
 Random Generation: rbinom(n, size, prob) # Generates n random numbers from a
binomial distribution

3. T-distribution

 PDF: dt(x, df) # Density at x for t-distribution with df degrees of freedom


 CDF: pt(q, df) # Probability of being less than q
Inferential Statistics

1. t-tests

a) One-sample t-test: A one-sample t-test is used to determine whether the mean of a single
sample differs significantly from a known or hypothesized population mean. Here’s how
to perform a one-sample t-test in R, along with an example.

Steps to Perform a One-Sample t-Test

1. State the Hypotheses:

 Null Hypothesis: The sample mean is equal to the population mean.


 Alternative Hypothesis: The sample mean is not equal to the population mean.

2. Collect Sample Data: Gather the data for your sample.

3. Use the t.test() Function: This function in R performs the one-sample t-test.

Example: One-Sample t-Test in R

Let’s say we want to test whether the average height of a sample of students is different
from the known average height of 170 cm.

Step 1: Create Sample Data

# Sample data: heights of students (in cm)

student_heights <- c (168, 172, 169, 171, 173, 165, 170, 174)
Step 2: Perform the One-Sample t-Test

We will test the null hypothesis that the mean height is equal to 170 cm.

# Perform one-sample t-test


t_test_result <- t.test(student_heights, mu = 170)

# Print the results

print(t_test_result)

Interpreting the Output

When you run the t.test() function, you'll get an output that includes:

 t-value: The calculated t-statistic.

 degrees of freedom: The number of independent values in the calculation.

 p-value: The probability of observing the data assuming the null hypothesis is true.

 confidence interval: The range in which the true population mean likely falls.

 mean of x: The sample mean.

Example Output Interpretation

Suppose the output looks like this:


data: student_heights

t = 1.1543, df = 7, p-value = 0.1468

alternative hypothesis: true mean is not equal to 170

95 percent confidence interval:

mean of x 170.50

 t = 1.1543: The t-statistic.

 p-value = 0.1468: Since this p-value is greater than 0.05, we fail to reject the null
hypothesis. There is not enough evidence to conclude that the average height of the
students is different from 170 cm.

 95% confidence interval: This indicates that we are 95% confident the true mean
height falls between 168.43 and 172.57 cm.

b) Two-sample t-test: A two-sample t-test is used to compare the means of two independent
groups to determine if there is a statistically significant difference between them. Here's
how to perform a two-sample t-test in R, along with an example.

Steps to Perform a Two-Sample t-Test

1. State the Hypotheses:

 Null Hypothesis: The means of the two groups are equal.


 Alternative Hypothesis: The means of the two groups are not equal.

2. Collect Sample Data: Gather the data for both groups.


3. Use the t.test() Function: This function in R performs the two-sample t-test.

Example: Two-Sample t-Test in R

Let’s say we want to compare the heights of students from two different classes to see if there
is a significant difference between their average heights.

Step 1: Create Sample Data

# Heights of students in Class A (in cm)

class_a <- c(160, 165, 170, 175, 180)

# Heights of students in Class B (in cm)

class_b <- c(155, 160, 165, 170, 175)

Step 2: Perform the Two-Sample t-Test

We will test the null hypothesis that the mean height of Class A is equal to the mean height of
Class B.

# Perform two-sample t-test

t_test_result <- t.test(class_a, class_b)

# Print the results

print(t_test_result)
Interpreting the Output

When you run the t.test() function, you'll get an output that includes:

 t-value: The calculated t-statistic.

 degrees of freedom: The number of independent values in the calculation.

 p-value: The probability of observing the data assuming the null hypothesis is true.

 confidence interval: The range in which the true difference in means likely falls.

 mean of x and mean of y: The means of each group.

Example Output Interpretation


Suppose the output looks like this:

data: class_and class_b

t = 3.727, df = 7, p-value = 0.00684

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

mean of x & mean of y : 172.0 & 165.0


 t = 3.727: The t-statistic indicating the difference in means.

 p-value = 0.00684: Since this p-value is less than 0.05, we reject the null hypothesis.
There is significant evidence to conclude that the average heights of the two classes are
different.

 95% confidence interval: This indicates that we are 95% confident the true difference
in means lies between 5.745 cm and 14.255 cm.

 mean of x: Average height of Class A is 172.0 cm.

 mean of y: Average height of Class B is 165.0 cm.

4. ANOVA (Analysis of Variance)

aov_result <- aov(y ~ group, data = your_data) # Fit ANOVA model

summary(aov_result) # Summarizes the ANOVA results

5. Chi-squared Test
chisq.test(table) # Performs a chi-squared test of independence

Correlation and Regression

1. Correlation

cor(x, y) # Calculates the correlation coefficient between x and y

2. Linear Regression

lm_result <- lm(y ~ x, data = your_data) # Fit linear model

summary(lm_result) # Summarizes the regression results


Example

Here’s a quick example of performing some descriptive statistics and a t-test:

# Sample data

data <- c(12, 15, 14, 10, 20, 18)

# Descriptive statistics

mean_value <- mean(data)

sd_value <- sd(data)


summary_stats <- summary(data)

# One-sample t-test

t_test_result <- t.test(data, mu = 15)


# Output results

print(mean_value)

print(sd_value)

print(summary_stats)

print(t_test_result)

4. Basic Maths in R

R is a powerful tool for performing basic mathematical operations. Here’s a quick guide on
how to perform various mathematical tasks in R:

Basic Arithmetic Operations

1. Addition:

result <- 5 + 3 # Adds 5 and 3

2. Subtraction:

result <- 5 - 3 # Subtracts 3 from 5

3. Multiplication:
result <- 5 * 3 # Multiplies 5 by 3

4. Division:

result <- 5 / 3 # Divides 5 by 3

5. Exponentiation:

result <- 5 ^ 3 # Raises 5 to the power of 3

6. Modulus (Remainder):

result <- 5 %% 3 # Returns the remainder of 5 divided by 3


Functions for Basic Math

1. Square Root:

sqrt_value <- sqrt(25) # Calculates the square root of 25

2. Absolute Value:

abs_value <- abs(-10) # Returns the absolute value of -10

3. Trigonometric Functions:

sin_value <- sin(pi / 2) # Sine of 90 degrees

cos_value <- cos(pi) # Cosine of 180 degrees


tan_value <- tan(pi / 4) # Tangent of 45 degrees

4. Logarithmic Functions:

log_value <- log(10) # Natural logarithm (base e)

log10_value <- log10(100) # Logarithm base 10

5. Exponential Function:

exp_value <- exp(1) # e^1 (Euler's number)

Working with Vectors

R is particularly strong in handling vectors, which allows you to perform element-wise


operations:

# Create a vector

vec <- c(1, 2, 3, 4, 5)

# Perform operations

vec_plus_one <- vec + 1 # Adds 1 to each element

vec_squared <- vec^2 # Squares each element


Example: Basic Math Operations in R

Here’s an example that puts some of these concepts together:

# Basic arithmetic

a <- 10

b <- 5

# Addition

sum_result <- a + b
# Multiplication

prod_result <- a * b

# Square root

sqrt_result <- sqrt(a)

# Create a vector

numbers <- c(1, 2, 3, 4, 5)

# Element-wise addition

added_vector <- numbers + 2


# Print results

print(paste("Sum:", sum_result))

print(paste("Product:", prod_result))

print(paste("Square root of a:", sqrt_result))

print("Vector after adding 2:")

print(added_vector)

Summary

 R supports all basic mathematical operations and provides numerous built-in functions
for more complex calculations.

 You can easily work with vectors to perform operations element-wise.

 For more complex mathematical tasks, you can explore packages like dplyr or purrr.

5. Variables in R

In R, variables are used to store data values. Understanding how to create, assign, and
manipulate variables is fundamental for programming in R. Here’s a guide on variables in R:

Creating and Assigning Variables


1. Assignment Operators:

 The most common assignment operator is <-, but you can also use =.

x <- 5 # Assigns the value 5 to x

y = 10 # Also assigns the value 10 to y

2. Variable Names:

 Variable names must start with a letter and can contain letters, numbers, underscores,
and dots.
 They are case-sensitive, so myVar and myvar are different variables.

my_variable <- 100

myVariable2 <- 200

Types of Variables

R has several data types that variables can hold:

1. Numeric:
num <- 42.5 # A numeric value
2. Integer:

int_num <- 42L # An integer value (note the 'L')

3. Character:

char <- "Hello, R!" # A character string

4. Logical:

bool <- TRUE # A logical value

5. Complex:

comp <- 1 + 2i # A complex number

Checking Variable Types

You can check the type of a variable using the class() function:

class(num) # Returns "numeric"

class(char) # Returns "character"


class(bool) # Returns "logical"

Updating Variables

You can update the value of a variable by simply reassigning it:

x <- 5

x <- x + 2 # x is now 7

Removing Variables

To remove a variable, use the rm() function:


rm(x) # Removes the variable x

Example: Working with Variables

Here’s a simple example that demonstrates variable creation, assignment, and manipulation:

# Assign values to variables

a <- 10

b <- 20

# Perform arithmetic operations


sum_result <- a + b

prod_result <- a * b
# Create a character variable

message <- "The results are:"

# Print results

print(message)

print(paste("Sum:", sum_result))

print(paste("Product:", prod_result))

Summary

 Variables in R are created using assignment operators, with <- being the most common.

 R supports multiple data types, including numeric, integer, character, logical, and
complex.

 You can check variable types, update values, and remove variables as needed.

6. Advanced Data Structures

R offers several advanced data structures that allow for more complex data management and
manipulation. Here’s an overview of some of the key advanced data structures in R:

1. Lists

 Description: Lists can hold different types of elements, including numbers, strings,
vectors, and even other lists. They are particularly useful for storing collections of
related but heterogeneous data.

 Creating a List:

my_list <- list (name = "Alice", age = 25, scores = c(90, 85, 88))

 Accessing Elements:
my_list$name # Access the name

my_list [[2]] # Access the second element (age)

2. Data Frames

 Description: Data frames are two-dimensional structures that can store different types
of variables (columns) in a tabular format. They are widely used for statistical data
analysis.

 Creating a Data Frame:

df <- data.frame(

name = c ("Alice", "Bob", "Charlie"),

age = c (25, 30, 35),


score = c (90, 85, 88)

 Accessing Elements:

df$age # Access the age column

df[1, 2] # Access the value in the first row and second column

df[df$score > 85, ] # Subset rows based on a condition

3. Matrices

 Description: Matrices are two-dimensional arrays that can only hold one data type.
They are useful for mathematical computations and linear algebra.

 Creating a Matrix:

matrix_data <- matrix (1:9, nrow = 3, ncol = 3) # 3x3 matrix

 Accessing Elements:

matrix_data[2, 3] # Access the element in the second row and third column

4. Arrays
 Description: Arrays are similar to matrices but can have more than two dimensions.
They can hold multiple types of data as well.
 Creating an Array:

array_data <- array (1:12, dim = c(2, 3, 2)) # 2x3x2 array

 Accessing Elements:

array_data[1, 2, 2] # Access an element in the array

5. Factors

 Description: Factors are used to represent categorical data. They store both the values
and the corresponding levels (categories).

 Creating a Factor:

gender <- factor (c ("Male", "Female", "Female", "Male"))

 Accessing Levels:
levels(gender) # Get the levels of the factor

6. Tibbles (from the tibble package)

 Description: Tibbles are modern versions of data frames that provide a cleaner print
output and are easier to use with the tidyverse family of packages.
 Creating a Tibble:

library(tibble)

my_tibble <- tibble(

name = c("Alice", "Bob", "Charlie"),

age = c(25, 30, 35),

score = c(90, 85, 88)

 Accessing Elements:

my_tibble$name # Access the name column

Example: Using Advanced Data Structures

Here’s an example that combines several of these data structures:

# Creating a data frame


df <- data.frame(

name = c("Alice", "Bob", "Charlie"),

age = c(25, 30, 35),

score = c(90, 85, 88)

# Creating a list that contains the data frame and a vector

my_list <- list(student_data = df, passed = df$score > 85)


# Accessing the data frame from the list

df_from_list <- my_list$student_data

# Print the names of students who passed

passed_students <- df_from_list[df_from_list$score > 85, "name"]

print(passed_students)

Summary

 R provides a variety of advanced data structures, including lists, data frames, matrices,
arrays, factors, and tibbles.

 Each data structure has its own use cases and advantages, making R a flexible tool for
data analysis and manipulation.

7. Recursion in R
Recursion in R is a programming technique where a function calls itself to solve a problem.
It's useful for problems that can be broken down into smaller, similar subproblems. Here’s
how recursion works in R, along with some examples:

Key Concepts of Recursion

Base Case: The condition under which the recursion stops. This prevents infinite recursion.

Recursive Case: The part of the function that calls itself with modified arguments to work
towards the base case.

R also accepts function recursion, which means a defined function can call itself.

Recursion is a common mathematical and programming concept. It means that a function calls
itself. This has the benefit of meaning that you can loop through data to reach a result.

In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We
use the k variable as the data, which decrements (-1) every time we recurse. The recursion ends
when the condition is not greater than 0 (i.e. when it is 0).

To a new developer it can take some time to work out how exactly this works, best way to find
out is by testing and modifying it.

You might also like