0% found this document useful (0 votes)
18 views2 pages

Summary R - Coding

The document provides an overview of R programming concepts related to data manipulation, including reading CSV files, data frames, matrices, and logical operators. It covers functions for filtering, sorting, and summarizing data, as well as techniques for data wrangling and handling missing values. Additionally, it discusses the use of the dplyr package for data manipulation tasks such as filtering, mutating, and arranging datasets.

Uploaded by

Yui Hisame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Summary R - Coding

The document provides an overview of R programming concepts related to data manipulation, including reading CSV files, data frames, matrices, and logical operators. It covers functions for filtering, sorting, and summarizing data, as well as techniques for data wrangling and handling missing values. Additionally, it discusses the use of the dplyr package for data manipulation tasks such as filtering, mutating, and arranging datasets.

Uploaded by

Yui Hisame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

LESSON 1 > salaries[c(“Jo”, “James”)] # Row1 & Column by the name mutate(Profit.

Margin = Profit/Sales) # Add a


> read.csv(“file_name.csv”) > salaries[-(1:2)] # all the salaries except the first two column PM
To read the csv file from the document. Only applicable to > salaries[-1][1:2] # remove the first element, the new first > set3 <- arrange(set2, -Profit.Margin) # Sort from Highest
URL of the CSV file and CSV file stored in local computer. 2 to Lowest
> read.csv(“C:/Folder/File.csv”) > salaries > 7500 # return logical output > set3 <- companies %>%
If the csv file is not in the same folder. > salaries(salaries > 7500) # the name and the salaries filter(Continent == “Asia” & Assets >= 50) %>%
> read.csv(“../../Folder/File.csv”) >7500 mutate(Profit.Margin = Profit/Sales) %>%
Here ../ is used to return to step one folder back. > salaries[order(salaries)] # ascending order of salaries DATA FRAME arrange(-Profit.Margin)
> read.csv(“./File.csv”) > salaries[order(-salaries)] # descending order of salaries Take an example that “employees” is a data frame. > set4 <- top_n(set3, 10) # get the first top 10
Here ./ is used to indicate the current working directory. > employees[1] # The first column > select(set4, Company, Profit.Margin)
MATRIX > class(employees[1]) # “data.frame” > set4 <- companies %>%
LESSON 2 > m <- matrix(1:12, nrow=4) A single square bracket returns a data frame. filter(Continent == “Asia” & Assets >= 50) %>%
- Coercion – R will try to coerce values to a different type so # [1,] 1 5 9 # [2,] 2 6 10 # [3,] 3 7 11 # [4,] 4 8 12 mutate(Profit.Margin = Profit/Sales) %>%
that the function will work. It arranges itself from the top to the bottom, going through > employees[ [1] ] # The values in the first column arrange(-Profit.Margin) %>%
- Values are converted to the simplest type required to each row before moving on the next column. > is.vector(employees[ [1] ]) # TRUE top_n(10) %>% select(Company, Profit.Margin)
represent all information. The ordering is: > m <- matrix(1:12, ncol=3, byrow=T) A double square bracket returns a vector.
logical < integer < numeric < complex < character < list # [1,] 1 2 3 # [2,] 4 5 6 # [3,] 7 8 9 # [4,] 10 11 12 DATA SUMMARY
It transposed the order, going by each column before going > employees$Position # The values in the Position column > companies %>% group_by(Continent) %>%
> str(data) through each row. > is.vector(employees$Position) # TRUE summarise(Max.Profit = max(Profits), No.of.Companies
To get the overview of the data including the data type of > m <- matrix(1:6, nrow=4, ncol=3) Using a dollar sign to get the values of called column. = n(),
the variable. However, the default data type assigned by R # [1,] 1 5 3 # [2,] 2 6 4 # [3,] 3 1 5 # [4,] 4 2 6 Avg.Asset = mean(Assets)) # A table grouped by
may not truly reflect the nature of the data. It recycles the vector until all the 12 elements are filled. > employees[2,3] # Access the second row, third column Continent
> cbind(1:3, 4:6, 7:9, 10:12) > employees[2,] # Access the entire second row > companies %>% group_by(Continent, Country) %>%
LOGICAL OPERATOR To bind the range based on column. > rownames(employees) <- employees$Name summarise(Max.Profit = max(Profits), No.of.Companies
Operator Description > rbind(1:3, 4:6, 7:9, 10:12) # The left-hand side that is originally numbered changed to = n(),
! Logical NOT To bind the range based on row. the employee’s name Avg.Asset = mean(Assets)) # Grouped by Continent and
& Element-wise logical AND > colnames(m) <- c(“Col1”, “Col2”) # replace the [,1] > employees$Name <- NULL # The name column removed Country
&& Logical AND > rownames(m) <- c(“Row1”, “Row2”) # replace the [1,] Assigning NULL to the column will remove the column
| Element-wise logical OR To give name to columns and rows > employees[“Joe”, “Salary”] # Joe’s salary DATA FORMAT
> t(1:3) # one row and three columns. We can access based on specific rows and columns - Wide Format to Long Format using gather
|| Logical OR
To transpose > df <- read.csv(“File_Name.csv”) # Allows data to be read > house_l <- house_w %>% gather(Year, Rate, 2:23) or
> sales[which.max(rowSums(sales)), ] # max sales / as a data frame > house_l <- house_w %>% gather(Year, Rate, -
CHARACTER
columns Property.Type)
> paste(“I”, “am”, “a”, “human”, sep=”!” ) # “I!am!a!
> colSums(sales) > 300 # logical function if it is above 300 OVERVIEW OF DATA FRAME > gather(df, key, value, ...) # … refers to column that is
human”
> sales[ , colSums(sales) > 300] # number in matrix form > view(df) # Table view of the data in a table format in R gathered
Without further specification “sep”, R will automatically
> dim(df) # To find the dimension of the data - Long Format to Wide Format using spread
assume that “ “ is the separator.
LIST > str(df) # To know the structure of the data > house_w <- house_l %>% spread(Year, Rate)
> paste0(“Hello”, “World”) # “HelloWorld”
> steven <- c(first.name=”Steven”, last.name=”Wong”, > summary(df) # Show data types, basic statistics of > spread(df, key, value) # Long to wide form
To concatenate two separate words.
age=34, marital.status=”Married”, no.Children=2, variables
monthly.income=9000) # return steven information Alternative data is normally non-financial data, from
> exam.result <- “92/A+”
Creating a list saved under variable steven MANIPULATING DATA FRAME – SORTING, FILTER non-traditional sources used by investment
> strsplit (exam.result, “/”) # “92” “A+”
> steven[c(1,3)] # the first and third column 1. Which company has the largest asset value? professionals in their investment processes.
To split the string into two separate characters. This is
Accessing the list > which.max(companies$Assets) # Max asset value
similar to how csv is separated by comma.
> steven$first.name # steven’s first.name > companies[which.max(companies$Assets), ] EXTENDING DATA FRAME
Using the dollar sign, able to choose the “key” # Full record of the company that has the largest asset - Data growth by adding new attributes, adding new
> a <- “I am a human”
> steven$”first.name” # steven’s first name value records, or merging with new data.
> substr(a, 8, 12) # “human”
This is also a possible solution > companies[which.max(companies$Assets), “Company”] - When adding new records to the existing dataset, rbind()
The “a” refers to the sentence, 8 refers to the starting
# The name of the company with the largest asset value only works if the column name matches that of the new
position of the character, and 12 is the end position. The
[ ] returns a subset of the original data structure. The result > companies[which.max(companies$Assets), ]$Company records.
count starts from 1 instead of 0 in Python.
is the same data structure. # The name of the company with the largest asset value - dplyr package has the function bind_rows to address to
> steven[“children”] # “Nick” “Joe” The which.max operator only returns one value this issue
> gsub(“behavior”, “behaviour”, a)
> length(steven[“children”]) # 1 2. Who are the top 10 companies in asset value? > employees <- bind_rows(employees, Jack)
To substitute the word “behavior” to “behaviour” in “a”.
> steven[“children”] [2] # $<NA> NULL > companies <- companies[order(companies$Assets, # A joint dataset where columns that do not exist to other
> trimws(“ Haha “) # “Haha”
decreasing = T), ] employees will be filled with “NA”
To remove white space from the sentence.
[ [ ] ] returns a single element of the data structure. The # Sort the companies in decreasing asset value INNER JOIN – returns only the rows in companies who has
result is of data type / class of the element itself. > head(companies, 10)$Company # Retrieve top 10 matching country names in countries
DATA STRUCTURE
> steven[[“children”]] # “Nick” “Joe” companies’ name > set1 <- merge(companies, countries, by=”Country”)
Data organization, management, and storage format that
> length(steven[[“children”]]) # 2 3. Which continent has the greatest number of OUTER JOIN – returns all rows from both companies and
enables efficient access and modification of data.
> steven[[“children”]][2] # “Joe companies in the list? countries, joining records from companies which have
> continents <- as.data.frame(table(companies$Continent)) matching keys in countries.
VECTOR – OPERATOR PROCEDENCE
OTHERS # Convert continent column to vector, enabling counting of > set2 <- merge(companies, countries, by=”Country”,
: Sequence Operator > rm(a) # to remove any assignments of “a” in coding all=T)
frequency
^ Exponentiation 5. List all companies in Asia who earn profit of at LEFT JOIN – returns all rows from companies with data
-, + Unary minus/plus Needs of categorical variables: least $30B? from countries wherever there are matching keys.
*, / Multiplication and Division 1. performing statistical analysis on categorical variables 2. > result <- companies[companies$Continent == “Asia” & > set3 <- merge(companies, countries, by=”Country”,
+, - Addition and Subtraction improving memory usage 3. ordering and ranking 4. companies$Profits >= 30, ] # The subset of the all.x=T)
<, <=, >=, ==, ! Logical Comparisons machine learning original data # Companies are all reserved in the final data
= > result$Company # Name of companies that met the RIGHT JOIN – returns all rows from companies with data
& Logical AND FACTORS condition from countries wherever there are matching keys.
| Logical OR > a = c(“Male”, “Female”, “Male”) 6. What is the average market value of Asian > set4 <- merge(companies, countries, by=”Country”,
<-, =, -> Assignment > a = as.factor(a) # “2”, “1”, “2” companies who earn a profit at least $30B? all.y=T)
> c(a, “4”) # “2” “1” “2” “4” > mean(result$Market.Value) # The average market value # Countries are all reserved in the final data. Companies
> letters # “a”, “b”, … > month.abb # “Jan”, “Feb”, … The factor assigned a level for those within the characters. without matching country names will be discarded.
> LETTERS # “A”, “B”, … > month.name # “January”, … hence female is assigned to 1 and male to 2. Since DPLYR PACKAGE – FILTER, MUTATE, ARRANGE, TOP_N
- Recycling - when two vectors with different length, character is higher in hierarchy than numeric, the result 7. Among all the Asian companies with asset value at
repeating the shorter vector until it reaches the same return everything in character instead of numeric. least $50B, find the top 10 companies in terms of DATA WRANGLING IN NATIVE R
length. their profit margin. > attach(assigned_variables) # Attached the
- && and || will only do the logical operation of the first pair LESSON 3 > set1 <- filter(companies, Continent == “Asia” & Assets assigned_variables to the R search path
of the elements into vector. DATA WRANGLING >= 50) > total # It will showcase the total number of the
- includes the data pre-processing and data transformation. > set1 <- companies %>% filter(Continent == “Asia” & assigned_variables
ACCESSING DATA POINT Assets >= 50)
> salaries[1:2] # Name & Column 1 to 2 > set2 <- mutate(set1, Profit.Margin = Profits/Sales) > detach(assigned_variables) # Detached the
> salaries[c(1,3)] # Name & Column 1 and 3 > set2 <- companies %>% assigned_variables from the R search path
filter(Continent == “Asia” & Assets >= 50) %>% > total # It will pop out an error
> tapply(x, y, operations) # grouping x and y then the 1. Delete rows, if not many rows have missing values # Removing subset with “-“
> filter(var, rate<=0.71) OR operations 2. Fill NA with an aggregated statistic (median, min., max., > materials[-2] # cloth will be gone
> var[var$rate<=0.71] # applies a function to each group of an array, grouped etc.) # Logical function to call out a particular material
based on levels of certain factors 3. Fill NA with value from next row > materials[c(T, F, F, T)] # returns wood and gold
> new_table <- select(var, state, region, rate) OR 4. Fill NA with value from previous row # If there is a missing argument, recycling takes place
> new_table <- var[, c(“state”, “region”, “rate”)] PIVOT TABLE 5. Fill NA by fitting model on non-missing rows, and use > materials[c(T, F, T,)] # returns wood, silver, and gold
- Easily analyze data by grouping data by different fields predictions from model to fill up Nas # While longer logical vectors introduce Nas
> new_table <- var[var$rate<=0.71, c(“state”, “region”, - Summarize the data with your own function for specific Stopping a loop vs. Stopping a function > materials[c(T, F, F, T, T, F)] # returns wood,
“rate”)] purpose  To stop a loop, break() and return() gold, NA, NA
# filter and selection together - tapply() works if the studied numbers are in one vector  To stop a function, stop() and return()

LESSON 4 > split(data, data$column_name) Separate function from tidyr() package


STATIC DATA – Demography (age, job, marital status, Split the data frame into list of data frames by a factor > separate() # creates new columns
education) array
DEPENDANT VARIABLE – the outcome (convert and did not CLASS DISCUSSION AND ANSWER
convert) > ratio_ <- function(x) { sum(x$total)/sum(x$population) } LOGICAL
> lapply( split(data, data$region), ratio_ ) > (3>2) | (5>7) # TRUE > ! (3>2) # NOT TRUE ==
LESSON 5 FALSE
CONDITIONAL STATEMENT – function | for Loop | while Loop GLOBAL VARIABLE – making the variables into a global
R being a programming language allows creativity, control variable INTEGER
(on the given data), flexibility, and reusability (in handling > x <<- x^2 # Using the <<- to make the variable global > x <- as.integer(2.56) # 2; It is truncated to 2 instead of
similar cases) reducing time to solve the cases R works line by line. If a global variable is assigned within 3
a function and then later re-define outside the function, the > x <- as.integer(5>6) # 0; TRUE = 1 and FALSE = 0
IF FUNCTION variable will hold the value that has been re-defined. > x <- as.numeric(“abc”) # trigger warning, changed to NA
> if (Boolean condition) { # Common format
procedure to handle the case when the logical condition RETURN IN FUNCTION CHARACTER
is true - By default, if in a function there is no return it will take the Count the character
} else last operations to be returned > a <- “Singapore”
{ alternative procedure to handle the case when condition - If a function(x, y, z) is called with incomplete argument > nchar(a) #9
is false function(1, 2), it will result in error as one of the arguments Find starting position of a small string in a large
} is missing string and returns -1 if no matching is found
> ifelse ( Boolean condition, case when TRUE, > regexpr(“ex”, “longtext”) # 6 match.length, 2
ifelse (Boolean condition, case when TRUE, case when EFFICIENCY IN CODING index.type
FALSE ) The ability to handle common and special cases, for Find positions of every match of a small string in a
> if (Boolean condition) {case when TRUE} # When only instance, how to address equality in majority scenario. In large string
TRUE is handled creating a data structure: > gregexpr(“a”, “banana”) # 2 4 6
FOR LOOP Idea > Design > Develop > Testing > Deployment match.length, 1 index.type
> for ( x in range of values ) { What operations used in the coding, for example, the > regexpr(“a”, “banana”) # 2 match.length, 1 index.type
perform operations with x as an input value, which is function uses a loop than apply operator. Find the positions of a regular expression in a vector
changing across the range of values } of text strings. Vector refers to a list of objects.
WHILE LOOP HANDLING SPECIAL CASES > txt <- c(“arm”, “foot”, “lefroo”, “bafoobar”)
- using if condition to “break” for while loop > tryCatch() # Mechanism that allowed the programming > grep(“foo”, txt) #24
> while ( the condition is TRUE ) { language to capture any exception and handle it without Extract part of a text string based on position in the
perform some operations. breaking function. text string
move to the next iteration } > stop() # The function to stop the function > substr(“Singapore”, 2, 4) # “ing”
FUNCTION > warning() # Proceed to get the result but it will issue a Replace the first match of a string with a new string
How to write a good function? It involves design thinking, warning > sub(“or”, “es”, “Singapore”) # “Singapese”
logical thinking, neat coding, solid testing, and good Replace every match of a give sub string in a string
documentation GENERALIZING with a new string
> function_name <- “function” ( inputs ) { - Making the code applicable for more scenarios e.g., > gsub(“a”, “o”, “banana”) # “bonono”
perform operations to achieve goal of the function with characters, Boolean values, etc.
given inputs - Setting the limit of the code e.g., self-defining what is VECTOR
“return” ( result ) majority and what is minority Name vectors in 3 different ways
} > a <- c(1,3,4) > a <- c(“a”=1,
QUIZ 1-5 “b”=3, “c”=4)
APPLY FUNCTIONS DATA FRAME EXTENSION > furniture <- c(“a”, “b”, “c”) > a <- c(a=1, b=3,
Alternative to loops: apply(), sapply(), mapply(), lapply(), c=4)
vapply(), rapply(), and tapply() # chosen based on the > Names(a) <- furniture
structure of the data and the desired output # a (1), b (3), c (4)
> apply (x, margin, func, ...) Basic Vectors
# margin shows how the function is applied 1 (across
rows) and 2 (across columns)
# The func used must accept a vector as its input
#13579
variable
> lapply(x, operations) # lapply(x, mean) where x is a list
or a vector
# Works on a list or vector #234234234
# Always returns a list of the same length as the given Vector Arithmetic
list or array #150 200
# sapply() works the same way as lapply except it 300
returns a vector #20 60 -50
> vapply() # similar to sapply() but less commonly used
> rapply(x, operations) #180
# stands for recursive apply
# applies a function to all elements of a list recursively #150000
> rep(3, 4) # 3 3 3 3 to replicate number 3, 4 times

> mapply() Vector Subsetting


# want to call the rep function multiple times but with > union(df1, df2) # appends rows from both tables while
different pair of input variables removing duplicate rows automatically
# multivariate version of lapply(), when multiple vectors # Multiple elements can be selected, wherein order matters
as inputs OTHER NOTES
#gold 3 silver 24
# apply func on each element of the vectors Possible techniques suggested to handle missing values
(NA)

You might also like