0% found this document useful (0 votes)
74 views33 pages

CH 3

Uploaded by

Rashi Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views33 pages

CH 3

Uploaded by

Rashi Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Chapter 3 -

Loading and
Handling Data in
R

to enter random numbers between 1 to 100 . 10 numbers. so we will write


round(runif(10, min = 1, max = 100))

Copyright © 2018
Assignment Operator

Operator Example

= X=5

<- X<-5

-> 5->x

<<- X<<-2
Expression, Variables and Functions
Operation Operator Description
Addition x+y y added to x
Subtraction x-y y subtracted from x
Expressions
Multiplication x*y x multiplied by y

Division x/y x divided by y


Exponentiation x^y x raised to the power y
x ** y
Modulus x %% y Remainder of (x divided by y)

Integer division x%/%y x divided by y but rounded down

Computing the sqrt(x) Computing the square root of x


Square root
Relational Operator

Operator Description

x>y True if left operand greater than the right


x<y True if left operand less than the right
x == y True if left operand is equal to the right

x != y True if left operand is not equal to the right


>= True if left operand is greater than and equal to the right
<= True if left operand is less than and equal to the right
Expression, Variables and Functions
Logical values
Logical values are TRUE and FALSE or T and F. Note that these are case sensitive.
The equality operator is ==.

Dates
The default format of date is “YYYY-MM-DD”
Print system’s date

Print system’s time


Expression, Variables and Functions

Variables
(i) Assign a value of 50 to the variable by the name, “Var”.

Or

(ii) Print the value in the variable, “Var”.

(iii) Perform arithmetic operations on the variable, “Var”.


atrix(t.test.res$p.value, file="ttest.tsv", sep="\t")

• Vectors- a = c(1,2,4,5,6)
• Sort(), #replace a[1]<-5, #index
Data Type
b=c(“abc”, “ddd”, “ccc”)
a[1:3] #first three position will be printed
• Lists #character a = list(“aa”, 55, 33, ‘bb’) b = list(“da”, 55, 1, ‘bb’)
• #merge list merge(a,b)
• C=list(a,b)
• Arrays – store data in more than 2 parameters
• arr1 = c(11,13,15,16,14)
• arr2 = c(55,54,2,15,12)
• arr3 = array(c(arr1,arr2),dim = c(3,3,5))
• Matrices
• Mtr = matrix(c(arr1,arr2),4,4)
• Factors
• Fact1 = factor(arr1)
• Data Frames
• data <- data.frame(x1 = 1:5, x2 = 2:6, x3 = 1:7)
• data1<-(“aa”, “bb”, “cc”, “dd”)
• Merge
Vectors
• Vectors are stored like arrays in C
• Vector indices begin at 1
• All Vector elements must have the same mode such as integer,
numeric (floating point number), character (string), logical (Boolean),
complex, object etc.

Create a vector of numbers

The c function (c is short for combine) creates a new vector consisting of three
values: 4, 7, and 8.
Vectors
A vector cannot hold values of different data types. Consider the example below.
We are trying to place integer, string and boolean values together in a vector.

Note: All the values are converted to the same data type, i.e. “character”.
Vectors
Accessing the value (s) in the vector
Create a variable by the name, “VariableSeq” and assign to it a vector consisting of
string values.

• Access values in a vector, specify the indices at which the value is present in the
vector. Indices start at 1.
Vectors
Vector math
Matrices
Create a matrix, “mat”, 3 rows high and 4 columns wide using a vector

Access the element present in the 2nd row and 3rd column of the matrix, “mat”.
Matrices
To access the 2nd column of the matrix, simply provide the column number and
omit the row number.

To access the 2nd and 3rd columns of the matrix, simply provide the column
numbers and omit the row number.
List
To create a list, “emp” having three elements, “EmpName”, “EmpUnit”, “EmpSal”.

To get the elements of the list, “emp” use the below command.

Retrieve the names of the elements in the list “emp”.


List
Add an element with the name “EmpDesg” and value “Software Engineer” to the
list, “emp”.

Output:

Delete an element with the name “EmpUnit” and value “IT” from the list, “emp”.
Recursive list
A recursive list means a list within a list.
Let us begin with two lists, “emp” and “emp1”.
The elements in both the lists are as shown below:

Combine both the lists into a single list by the name “EmpList”.
Functions Function Arguments Description
substr(a, start stop) Manipulating Text in Data


a is a character vector The function returns part of the string
Start and stop arguments contain a numeric starting from start argument and ends at
value the stop argument.

strsplit(a, split, …) ∙ a is a character vector The function splits the given text string
∙ Split is also a character vector that contains a into substring.
regular expression for splitting.

paste(…, sep= “”, …) ∙ The dots “…” define R objects The function concatenates string vectors
∙ sep argument is a character string for after converting the objects into strings.
separating objects
∙ paste('a',1:5,sep = ‘ ')

grep(pattern, a) ∙ Pattern argument contains matching pattern The function returns string after
∙ a is a character vector searching for a text pattern into a given
∙ x <- c("d", "a", "c", "abba") text string.
∙ grep("a", x)

∙ grepl("a", x)

toupper(a) ∙ a is a character vector The function converts a string into


uppercase
tolower(a) ∙ a is a character vector The function converts a string into
lowercase.
Exploring a Dataset
Functions Function Arguments Description
names(dataset) Dataset argument contains name of dataset The function displays the variables of the given
dataset.
summary(dataset) Dataset argument contains name of dataset The function displays the summary of the given
dataset.
str(dataset) Dataset argument contains name of dataset The function displays the structure of the given
dataset.
head(dataset, n) Dataset argument contains name of dataset The function displays the top rows according to the
value of n. If value of n is not provided in the function
n is a numeric value to display the number of top rows then by default function displays top 6 rows of the
dataset.
tail(dataset, n) Dataset argument contains the name of a dataset The function displays the top rows according to the
n is a numeric value to display the number of bottom rows value of n. If value of n is not provided in the function
then by default function displays bottom 6 rows of the
dataset.

class(dataset) Dataset argument contains the name of a dataset The function displays the class of the dataset.

dim(dataset) Dataset argument contains the name of a dataset The function returns the dimension of the dataset
which implies the total number of rows and columns
of the dataset.
table(dataset$variablenames) Dataset argument contains name of dataset The function returns the number of categorical value
Variable name contains the name of the variable names after counting it.
Aggregating and Group Processing of a Variable
The syntax of the aggregate() function is as follows:
aggregate(x, …) or

aggregate(x, by, FUN, …)


where,
x is an object; by argument defines the list of group elements of the
particular variable of the dataset; FUN argument is a statistic function
that returns numeric value after given statistic operations; the dots “…”
define the other optional argument.
Create data frame

• data <- data.frame(x1 = 1:5, x2 = 2:6, x3 = 1, group = c("A", "A", "B",


"C", "C"))

aggregate(data[colnames(data) != "group"],by = list(data$group),FUN = mean)


Exercise - Create Data Frame
• A = 10 numbers
• B = 10 numbers random
• C = Marks
• D = Name 2 names for five times

• Find out average by grouping name.


tapply() Function
The tapply() function is also an inbuilt function of R and works similar to the
function aggregate().

tapply (x, …) or

tapply(x, INDEX, FUN, …)

where, x is an object that defines the summary variable; INDEX argument


defines the list of group elements—also called group variable; FUN argument
is a statistic function that returns numeric value after given statistic
operations; the dots “…” define the other optional argument
Create Data Frame
• price = round(rnorm(15, sd = 8, mean = 25)) #rnorm is the R function
that simulates random variates having a specified normal distribution.
• class = sample(1:4, size = 15, replace = TRUE) #random number
• shop1 = sample(paste("Store", 1:4), size = 15, replace = TRUE)

• tapply(price, class, mean) Class ko dekh ke price ka mean niklege

• tapply(price, shop1, mean) shop1 ko dekh ke price ka mean niklege


• class1=factor(class,labels =c("a","b","c","d"))

• tapply(price, class1, mean)


Exercise
Collect 30 samples for the ticket price of movie, average = 200, sd = 20
Define 3 class
5 multiplex

Find out mean ticket price for class type and for multiplex.
Selecting
• Install dplyr and tidyr packages
variables

library(dplyr)

# keep the variables name, height, and gender


newdata <- select(starwars, name, height, gender)

# keep the variables name and all variables


# between mass and species inclusive
newdata <- select(starwars, name, mass:species)

# keep all variables except birth_year and gender


newdata <- select(starwars, -birth_year, -gender)
• library(dplyr)
Selecting observations
• # select females
• newdata <- filter(starwars, gender == "female")

• # select females that are from Alderaan


• newdata <- filter(starwars, gender == "female" & homeworld == "Alderaan")

• # select individuals that are from


• # Alderaan, Coruscant, or Endor
• newdata <- filter(starwars, homeworld == "Alderaan" | homeworld ==
"Coruscant" | homeworld == "Endor")
Creating/Recoding variables
• The mutate function allows you to create new variables or transform
existing ones.

• library(dplyr)

• # convert height in centimeters to inches,


• # and mass in kilograms to pounds
• newdata <- mutate(starwars, height = height * 0.394, mass = mass * 2.205)
• The ifelse function (part of base R) can be used for recoding data. The
format is ifelse(test, return if TRUE, return if FALSE).

• library(dplyr)

• # if height is greater than 180


• # then heightcat = "tall",
• # otherwise heightcat = "short"

• newdata <- mutate(starwars, heightcat = ifelse(height > 180, "tall", "short")


• # set heights greater than 200 or
• # less than 75 to missing
• newdata <- mutate(starwars, height = ifelse(height < 75 | height >
200,NA, height)
Summarizing data
• The summarize function can be used to reduce multiple values down to a single
value (such as a mean). It is often used in conjunction with the by_group function,
to calculate statistics by group. In the code below, the na.rm=TRUE option is used
to drop missing values before calculating the means.

• library(dplyr)

•# calculate mean height and mass


•newdata <- summarize(starwars, mean_ht =
mean(height, na.rm=TRUE), mean_mass =
mean(mass, na.rm=TRUE))
• newdata
• calculate mean height and weight by gender
• newdata <- group_by(starwars, gender)
• newdata <- summarize(newdata,
• mean_ht = mean(height, na.rm=TRUE),
• mean_wt = mean(mass, na.rm=TRUE))
• newdata
Methods for Reading Data
Reading CSV Files
A CSV file uses .csv extension and stores data in a table structure
format in any plain text. The following function reads data from a CSV
file:
read.csv(“filename”)
where, filename is the name of the CSV file that needs to be imported.

Reading Spreadsheets
read.xlsx(“filename”,…)
where, filename argument defines the path of the file to be read; the
dots “…” define the other optional arguments.

You might also like