Unit 1- Data Analysis Using r
Unit 1- Data Analysis Using r
INTRODUCTION TO R:
R is a scripting or programming language which provides an environment for statistical
computing, data science and graphics. It was inspired by, and is mostly compatible with,
the statistical language S developed at Bell laboratory (formerly AT & T, now Lucent
technologies).
It easily integrates with programming languages such as Java, C++, Python and Ruby.
R is free. It is available under the terms of the Free Software Foundation’s GNU General Public
License in source code form.
It is available for Windows, Mac and a wide variety of Unix platforms (including FreeBSD,
Linux, etc.).
1
DOWNLOADING AND INSTALLING R
Downloading R
Go to the R Project website: Visit the R Project website.
Click on "Download R": On the R Project homepage, you'll typically find a link or
button that says "Download R." Click on it.
Select a CRAN mirror: CRAN (Comprehensive R Archive Network) mirrors are
servers that host R packages and the R software itself. Choose a location that is
geographically close to you to ensure faster download speeds.
Choose your operating system: There are versions of R available for Windows, macOS,
and various Linux distributions. Select the version that corresponds to your operating
system.
Download the installer: Click on the link to download the installer package appropriate
for your operating system.
2
omacOS: Click "Download R for macOS" -> download the .pkg file.
oLinux: Follow the instructions specific to your distribution or use package
managers.
4. Run Installer:
o Windows: Double-click the downloaded .exe file and follow the wizard.
o macOS: Double-click the .pkg file and follow the instructions.
o Linux: Use package manager commands (e.g., sudo apt install r-base for
Ubuntu).
5. Complete Installation: Follow the prompts to complete the installation process.
Final Steps
11. Open R: Locate and launch R from your applications or start menu.
12. Open RStudio: Locate and launch RStudio from your applications or start menu.
13. Verify Installation:
o In RStudio, check that R is properly integrated by running version in the R
console.
14. Update Packages: In RStudio, update your packages by running update.packages() in
the R console.
15. Start Coding: Begin using RStudio for your data analysis and coding tasks!
R Studio
RStudio is an integrated development environment (IDE) specifically designed for R
programming. It provides a comprehensive and user-friendly interface for writing, debugging,
and running R code.
3
RStudio is the most widely used tool for writing, testing, and running R code (Figure 1.7). It's
user-friendly and open source. The typical screen of RStudio includes several key parts:
Console: Where commands are typed and outputs are displayed.
Workspace: Shows active objects (variables, datasets) used in the current session.
History: Displays a log of previously executed commands.
Files: Provides a view of folders and files in the current directory.
Plots: Shows graphical outputs such as plots and charts.
Packages: Lists add-ons and packages needed for specific tasks.
Help: Provides documentation and assistance on using RStudio, commands, and more.
Eclipse with StatET
Eclipse with StatET is an IDE originally known for Java and C++, but it's also used for R
programming. StatET adds tools for coding in R and building R packages. It supports local and
remote R installations and can expand its capabilities with add-ons like Sweave and Wikitext.
Key features include:
4
Console: Executes R commands and displays results.
Object Browser: Helps navigate and manage R objects.
Package Manager: Manages R packages for project dependencies.
Debugger: Assists in finding and fixing issues in R code.
Data Viewer: Displays datasets and allows exploration.
R Help System: Provides documentation and assistance for R functions and commands.
HANDLING packages in R
5
o setwd(): Sets the working directory to a specified location.
o dir() / list.files(): Lists files and directories in the current or specified
directory, with options to:
Display files in a specific path.
Show absolute paths.
Match files/directories with a pattern.
Recursively list files/directories.
Data Types in R
In R, variables are not explicitly declared with data types. Instead, they store R objects,
and the data type of the R object determines the data type of the variable. The most
commonly used R objects are:
Vector
List
Matrix
Array
Factor
Data Frame
A vector is the simplest R object and can contain various data types. All other R objects are
based on these atomic vectors. The main data types supported by R are:
Logical
Numeric
Integer
Character
Double
Complex
Raw
In R, there are several ways to work with dates, including functions to get the current date,
manipulate dates, and format dates. Here are some common date functions in R:
6
1. Getting the Current Date and Time:
o Sys.Date(): Returns the current date.
o Sys.time(): Returns the current date and time.
3. Formatting Dates:
o format(): Formats a date object to a specified string format.
4. Date Arithmetic:
o You can perform arithmetic operations on date objects to add or subtract days.
These functions cover the basics of handling dates in R. There are many more functions and
packages available (such as lubridate) that provide additional capabilities and convenience
for working with dates and times.
7
VARIABLES
In R, variables are used to store data that can be referenced and manipulated later in
the program.
Creating Variables
8
7. Data Frame:
o Stores tabular data.
my_data <- data.frame( name = c("John", "Jane", "Doe"),
age = c(30, 25, 22), married = c(TRUE, FALSE, TRUE))
Variables can be reassigned values either of the same data type or of a different
datatype.
(iv) Reassign a string value to the variable, ‘Var’.
> Var <- “R is a Statistical Programming Language”
Loading and Handling Data in R 51
Print the value in the variable, ‘Var’.
> Var
[1] “R is a Statistical Programming Language”
(v) Reassign a logical value to the variable, ‘Var’.
[1] TRUE
Functions
Functions in R are used to encapsulate a block of code that performs a specific task. They can
take inputs, process them, and return outputs.
Creating Functions
9
{
print("Hello, World!")
}
# Call the function
my_function()
greet("Alice")
In R, min, max, and seq are commonly used functions for handling numerical data and
generating sequences. Here's an overview of how to use these functions effectively:
min Function
The min function is used to find the smallest value in a numeric vector.
# Example vector
vec <- c(10, 5, 8, 3, 7)
Handling NA values:
o By default, min will return NA if there are any missing values (NA) in the
vector.
o Use na.rm = TRUE to remove NA values before finding the minimum.
o The na.rm argument stands for "NA remove" and is a standard argument
recognized by many R functions for handling missing values.
max Function
The max function is used to find the largest value in a numeric vector.
10
# Example vector
vec <- c(10, 5, 8, 3, 7)
Handling NA values:
o By default, max will return NA if there are any missing values (NA) in the
vector.
o Use na.rm = TRUE to remove NA values before finding the maximum.
seq Function
The seq function is used to generate sequences of numbers. It can be used in several ways:
1. Basic Sequence:
o Generates a sequence from a starting value to an ending value.
11
sequence_points <- seq(1, 10, length.out = 5)
print(sequence_points)
Examples
print(min_value)
print(max_value)
print(min_no_na)
print(max_no_na)
print(custom_sequence)
print(min_custom)
print(max_custom)
12
Manipulating text in R involves using various functions and techniques to modify, clean, and
analyze text data, such as concatenating strings, changing case, extracting substrings, pattern
matching, replacing text, and splitting or joining strings. This is essential for preparing and
transforming textual information for further analysis and processing.
13
Regular Expressions
Stringr Package
14
The stringr package provides a cohesive set of functions designed for consistent and easier
string manipulation.
# String length
str_length(text) # Output: 11
# Substring
str_sub(text, 1, 5) # Output: "Hello"
# Uppercase
str_to_upper(text) # Output: "HELLO WORLD"
# Pattern matching
str_detect(text, "World") # Output: TRUE
# Replace text
str_replace(text, "World", "Everyone") # Output: "Hello Everyone"
15
1. Identifying Missing Values
Using is.na:
o is.na returns a logical vector indicating which elements are NA.
16
o Use replace or indexing to fill NA values with a specified value.
r
Copy code
# Calculate the mean, ignoring NAs
mean(vec, na.rm = TRUE)
The tidyverse package provides convenient functions for handling missing data.
17
library(Hmisc)
Coercion
Coercion helps to convert one data type to another, e.g. logical “TRUE” value when
converted to numeric yields “1”. Likewise, logical “FALSE” value yields “0 ”.
Types of Coercion in R
Implicit Coercion
R will automatically coerce data types to make them compatible in certain operations. For
example, combining numeric and character data in a vector will result in all elements being
coerced to characters.
print(x)
Explicit Coercion
Explicit coercion is performed using specific functions designed to convert one type to
another.
18
List
19
Usage: Lists are used to store heterogeneous data. Each element can be named for
easier access.
my_list <- list(name = "Ajay", age = 25, scores = c(85, 90, 88))
Vector
Definition: A vector in R is a sequence of elements that are all of the same type. It
can be a numeric, character, logical, or integer vector.
Usage: Vectors are the basic data type in R for storing a sequence of values.
Data Frame
df <- data.frame(name = c("Alice", "Bob"), age = c(25, 30), scores = c(85, 90))
Array
Matrix
Definition: A matrix is a two-dimensional array where elements are of the same type.
It is essentially a collection of vectors of the same length.
Usage: Matrices are used for mathematical computations and storing tabular data.
Summary
These data structures are foundational in R and are used extensively for data manipulation
and analysis.
20
using the ‘as’ operator to Change the structure of Data
In R, the as operator is used to change the structure or type of data. This operator is part of
functions that coerce data from one type to another, which is essential for data manipulation
and analysis.
The as operator in R is crucial for converting data types to ensure compatibility with various
functions and analyses. Common as functions include as.numeric, as.character, as.integer,
as.factor, as.Date, and as.logical, each serving a specific purpose in data type coercion.
Common as Functions
4. as.factor: Converts data to factor type, which is used for categorical data.
21
# Example: Convert a numeric vector to logical
num_vector <- c(1, 0, 1, 0)
logical_vector <- as.logical(num_vector)
print(logical_vector) # Output: TRUE FALSE TRUE FALSE
print(df)
VECTOR
Vector is a one-dimensional array that can store a sequence of elements of the same
type.
Vector Definition
A vector in R is a basic data structure that contains a collection of elements, all
of which must be of the same type. This type can be numeric, integer, character,
or logical.
Vectors are essential in R because many functions operate on vectors, making them a
fundamental concept in data manipulation and analysis.
Characteristics of Vectors in R
1. Homogeneity: All elements in a vector must be of the same type.
2. Indexing: Elements in a vector can be accessed and manipulated using their index
positions, which start at 1 in R.
3. Element-wise Operations: Many operations in R are vectorized, meaning they can be
applied to each element of a vector simultaneously.
Types of Vectors
1. Numeric Vectors: Contain numbers (real or decimal numbers).
numeric_vector <- c(1.1, 2.2, 3.3)
NOTE: The c() function in R stands for "combine" or "concatenate." It is used to create a vector
by combining its arguments into a single vector.
22
In this case, c(1.1, 2.2, 3.3) creates a vector containing the numeric values 1.1, 2.2, and
3.3.
2. Integer Vectors: Contain integer numbers.
integer_vector <- c(1, 2, 3) # indicates integers.
3. Character Vectors: Contain strings or text.
character_vector <- c("apple", "banana", "cherry")
4. Logical Vectors: Contain Boolean values (TRUE or FALSE).
logical_vector <- c(TRUE, FALSE, TRUE)
Creating Vectors
Vectors are typically created using the c() function, which stands for "combine" or
"concatenate":
SNO VECTOR TYPE SYNTAX
1 Numeric vector <- c(1, 2, 3, 4, 5)
# Accessing elements
second_element <- numeric_vector[2]
print(second_element) # Output: 4
# Functions on vectors
vector_sum <- sum(numeric_vector)
23
print(vector_sum) # Output: 12
vector_mean <- mean(numeric_vector)
print(vector_mean) # Output: 4
Lists in R
A list in R is a data structure that can contain elements of different types, such as numbers,
strings, vectors, and even other lists. Unlike vectors, which can only hold elements of the
same type, lists are versatile and can mix different data types.
Syntax:
list_name <- list(element1, element2, ..., elementN)
Example:
Let's create a simple list containing different types of data: a numeric vector, a character
vector, and a logical value.
print(my_list)
Output:
$numbers
[1] 1 2 3
$fruits
[1] "apple" "banana" "cherry"
$is_true
[1] TRUE
List-Related Operations:
o By Name:
o Using [ Operator:
24
2. Modifying List Elements:
o Change an element:
r
my_list$new_item <- "new element"
3. Combining Lists:
o Combine two lists:
5. Length of List:
o Find the number of elements in the list:
length(my_list)
Example with Simple Data Set:
Let's consider a data set with information about students: their names, ages, and scores in a
test.
25
Output:
$names
[1] "Alice" "Bob" "Charlie"
$scores
[1] 87 92 89
$gender
[1] "F" "M" "M"
Matrix in R
Definition:
A matrix in R is a two-dimensional data structure that contains elements of the same data
type (numeric, character, or logical). It is essentially a collection of vectors of the same
length, arranged in rows and columns.
Syntax:
matrix(data, nrow, ncol, byrow = FALSE, dimnames = NULL)
Example:
print(my_matrix)
Output:
Col1 Col2 Col3
Row1 1 2 3
Row2 4 5 6
Row3 7 8 9
26
Operations on Matrices:
1. Accessing Elements:
o Access a specific element:
2. Matrix Arithmetic:
o Addition/Subtraction:
o Element-wise Multiplication:
o Matrix Multiplication:
3. Transposing a Matrix:
o Transpose the matrix:
27
# Print results
print("Matrix Addition Result:")
print(add_result)
Output:
[1] "Matrix Addition Result:"
[,1] [,2]
[1,] 6 8
[2,] 10 12
28