0% found this document useful (0 votes)
38 views

Course 7 - R Programming

R is a popular programming language used for statistical analysis, data visualization, and other data analysis tasks. It was developed in the 1990s at the University of Auckland as an enhanced version of the S language. There are several reasons why R is appealing for working with data, including that it is accessible for beginners, data-centric in its approach, and offers powerful tools for cleaning, manipulating, and modeling data.

Uploaded by

Ahanaf Rasheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Course 7 - R Programming

R is a popular programming language used for statistical analysis, data visualization, and other data analysis tasks. It was developed in the 1990s at the University of Auckland as an enhanced version of the S language. There are several reasons why R is appealing for working with data, including that it is accessible for beginners, data-centric in its approach, and offers powerful tools for cleaning, manipulating, and modeling data.

Uploaded by

Ahanaf Rasheed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

MODULE 1

Programming Languages
Programming means giving instruc�ons to a computer to perform an ac�on or set of ac�ons. Even if this is your first
�me programming, you already have plenty of experience telling a computer what to do. For example, you've
probably used a spreadsheet func�on to sort your data or perform calcula�ons, or you might have used SQL to tell a
computer how to pull data from a database or join two different data tables. Programming goes even further. It gives
you the highest level of control over your data. SQL can communicate with databases, but a general-purpose
programming language lets you create your own applica�ons and build your own func�ons from scratch. To
program, you first need to know a programming language. In this video, we'll learn about the basics of programming
languages and how they can help you work with your data.
Programming languages are the words and symbols we use to write instruc�ons for computers to follow. You can
think of a programming language as a bridge that connects humans and computers, and allows them to
communicate. Programming languages have their own set of rules for how these words and symbols should be
used, called syntax. Syntax shows you how to arrange the words and symbols you enter so they make sense to a
computer. Coding is wri�ng instruc�ons to the computer in the syntax of a specific programming language. Just like
the variety of human languages around the world, there's lots of different programming languages available to
communicate with computers. There's a language for almost anything you want to do, from designing websites, to
developing video games, to working with data. For example, Python is a general-purpose language that can be used
for all sorts of things, from working with ar�ficial intelligence to crea�ng virtual reality experiences. Javascript works
well for developing online apps and is an essen�al part of web browsers. Some other popular programming languages
for data analysis include SAS, Scala, and Julia. While programming languages can look different on the surface, they
all share similar structures and coding concepts. Once you learn your first language, you'll find it easier to learn
others.
Coming up, we'll explore R's many capabili�es. Before that, let's talk about some benefits of using any programming
language to work with your data. I'll highlight three. Programming helps you clarify the steps of your analysis, saves
time, and lets you easily reproduce and share your work.
Let's start with clarity. Programming languages have specific rules and guidelines for giving instruc�ons to the
computer. When you're telling a computer what to do, your instruc�ons have to be very clear. There can't be any
inconsistency in the way you write code. If there is, the code won't work. Transla�ng your thoughts into code forces
you to figure out exactly how to write each step of your analysis and how all the steps fit together. It gives your
analysis a level of precision that makes it really powerful.
Using a programming language for data analysis also saves you lots of �me. For example, take the process of cleaning
and transforming your data. With one line of code, you can create a separate dataset without any missing values.
With another line, you can apply mul�ple filters on your data. This lets you spend less �me preparing your data and
more �me on the analysis itself.
Finally, programming languages make it easy to reproduce your analysis. Data analysis is most useful when you can
reproduce your work and share it with other people. They can double-check it and help you solve problems. Code
automa�cally stores all of the steps of your analysis so you can reproduce, and share your work at any �me in the
future, weeks, months, or even years later. Here's an example. Let's say you're working on a project. You've collected
and cleaned your data and started your analysis, but the results don't add up. You suspect a mistake was made in the
process. You'd like to discuss the issue with a teammate and get their feedback. If you used a spreadsheet, you both
might have to redo the en�re analysis to discover the error. There's no easy way to record and reproduce your steps
in a spreadsheet, but if you use a programming language, all your work can be reproduced and shared in a moment,
from loading the data, to crea�ng visualiza�ons, to repor�ng the results. Plus, you can easily update your analysis
and fix any errors simply by changing the code.
From spreadsheets to SQL to R
Although the programming language R might be new to you, it actually has a lot of similarities to the other tools you have explored
in this program. In this reading, you will compare spreadsheet programs, SQL, and R to have a better sense of how to use each
moving forward.

Spreadsheets, SQL, and R: a comparison


As a data analyst, there is a good chance you will work with SQL, R, and spreadsheets at some point in your career. Each tool has its
own strengths and weaknesses, but they all make the data analysis process smoother and more efficient. There are two main
things that all three have in common:

• They all use filters: for example, you can easily filter a dataset using any of these tools. In R, you can use the filter function.
This performs the same task as a basic SELECT-FROM-WHERE SQL query. In a spreadsheet, you can create a filter using the
menu options.
• They all use functions: In spreadsheets, you use functions in formulas, and in SQL, you include them in queries. In R, you
will use functions in the code that is part of your analysis.
The table below presents key questions to explore a few more ways that these tools compare to each other. You can use this as a
general guide as you begin to navigate R.

Key question Spreadsheets SQL R


A program that uses rows and
A database programming A general purpose
columns to organize data and
language used to programming language
allows for analysis and
What is it? communicate with used for statistical
manipulation through
databases to conduct an analysis, visualization, and
formulas, functions, and built-
analysis of data other data analysis
in features
Provides an accessible
Allows users to language to organize,
What is a primary Includes a variety of manipulate and modify, and clean data
advantage? visualization tools and features reorganize data as needed frames, and create
to aid analysis insightful data
visualizations
Which datasets
does it work best Smaller datasets Larger datasets Larger datasets
with?
Loaded with R when
What is the source Entered manually or imported Accessed from an external installed, imported from
of the data? from an external source database your computer, or loaded
from external sources
Where is the data
In a spreadsheet file on your Inside tables in the In an R file on your
from my analysis
computer accessed database computer
usually stored?
Do I use formulas
Yes Yes Yes
and functions?
Yes, by using an additional
tool like a database
Can I create
Yes management system Yes
visualizations?
(DBMS) or a business
intelligence (BI) tool
Introduc�on to R
What is R? R is a programming language frequently used for sta�s�cal analysis, visualiza�on and other data
analysis. And Rstudio is a popular so�ware environment for the R language.
R is based on another programming language named S. In the 1970s, John Chambers created S for internal use at Bell
Labs, a famous scien�fic research facility. In the 1990s, Ross Oaxaca and Robert Gentleman developed R at the
University of Auckland, New Zealand. The �tle R refers to the first names of its two authors and plays on a single-
leter �tle of its predecessor S. Since then, R has become a preferred programming language of scien�sts, sta�s�cians
and data analysts around the world.
There's lots of reasons why people who work with data love R. I want to share four with you. R is:
 ACCESSIBLE: (First R is an accessible language for beginners. Lot of people without a tradi�onal programming
language learn R.)
 DATA-CENTRIC: (R really appeals to anyone who wants to solve problems that involve data. And that's one of
the things that's so great about R. It's all about data. R is what's known as a data-centric programming
language. It's specifically designed to make data analysis easier, more efficient and more powerful. )
 OPEN-SOURCE: (R is open source. Open source means that the code is freely available and may be modified
and shared by the people who use it. Let's pause for a moment and unpack how amazing this is. First anyone
can use R for free. Second, anyone can modify the code, fix bugs and improve it. In fact, over the years, lots
of excellent programmers have made improvements and fixes to the R code. For example, anyone who knows
the R language can create what's called an add-on package. Literally thousands of R packages exist, and they
were all built by people who wanted to solve specific problems. A lot of these packages are super useful for
data analysts. As an R user, you now enjoy the benefit of the shared knowledge.)
 R has an ac�ve community of users: (The R community is the best. This vibrant, diverse and accessible
community is so suppor�ve of new learners. You can go online any�me to find answers to all your R ques�ons.
Check out websites like R for Data Science Online Learning Community and RStudio Community. On top of
that, R users are all over Twiter and other social media. You'll discover tons of resources for professional
networking, mentoring and learning.)
Now that we know more about the general benefits of R, let's talk about some specific situa�ons when you might
use it for data analysis. Here's three scenarios:
 Reproducing your analysis (R can save and reproduce every step of your analysis. Earlier, we discussed how
data analysis is most useful when you can easily reproduce your work and share it with others. In R,
reproducing your analysis is as easy as pressing a buton on your keyboard. Your code stores it forever. And
you can share it with anyone at any �me.) [What does it mean to reproduce data? Answer: Obtaining the
same results.]
 Processing lots of data (Processing lots of data is also something R does really well, just like SQL. As you
learned earlier spreadsheets organize projects in sheets or tabs. If you've ever had to deal with spreadsheet
files that have tons of sheets or lots of data in each sheet, you know that things can start to move very slowly.
Working with too much data in a spreadsheet can even cause crashes. R can handle large amounts of data
much more quickly and efficiently.)
 Crea�ng data visualiza�ons (R can create powerful visuals and has state-of-the-art graphic capabili�es. As
you've seen in this program, tools like spreadsheets and Tableau offer lots of op�ons for visualizing your data.
R's on another level. With only a small bit of code, you can create histograms, scater plots, line plots and so
much more. And that's just the beginning. If you work with more advanced packages, you can make some
seriously impressive data visualiza�ons. )
Learning R is a huge benefit to anyone interested in becoming a data analyst. As I men�oned earlier, knowledge of R
will help you stand out as a job candidate. And as you keep moving forward, R will help you find solu�ons for more
complex data problems. You can keep learning about R throughout your career as a data analyst. The sky's the limit
when it comes to developing your data analysis skills.
Packages are units of reproducible R code.
Members of the R community create packages to keep track of the R functions that they write and reuse. Packages
offer a helpful combination of code, reusable R functions, descriptive documentation, tests for checking your
code, and sample data sets.

The lubridate package that you are about to install is part of the tidyverse. The tidyverse is a collection of packages
in R with a common design philosophy for data manipulation, exploration, and visualization. For a lot of data
analysts, the tidyverse is an essential tool. You will learn more about the tidyverse later on in this course.

When to use RStudio


As a data analyst, you will have plenty of tools to work with in each phase of your analysis. Sometimes, you will be able
to meet your objectives by working in a spreadsheet program or using SQL with a database. In this reading, you will go
through some examples of when working in R and RStudio might be your better option instead.

Why RStudio?
One of your core tasks as an analyst will be converting raw data into insights that are accurate, useful, and interesting.
That can be tricky to do when the raw data is complex. R and RStudio are designed to handle large data sets, which
spreadsheets might not be able to handle as well. RStudio also makes it easy to reproduce your work on different
datasets. When you input your code, it's simple to just load a new dataset and run your scripts again. You can also
create more detailed visualizations using RStudio.

When RStudio truly shines


When the data is spread across multiple categories or groups, it can be challenging to manage your analysis, visualize
trends, and build graphics. And the more groups of data that you need to work with, the harder those tasks become.
That’s where RStudio comes in.

For example, imagine you are analyzing sales data for every city across an entire country. That is a lot of data from a lot
of different groups–in this case, each city has its own group of data.

Here are a few ways RStudio could help in this situation:

• Using RStudio makes it easy to take a specific analysis step and perform it for each group using basic code. In
this example, you could calculate the yearly average sales data for every city.
• RStudio also allows for flexible data visualization. You can visualize differences across the cities effectively using
plotting features like facets–which you’ll learn more about later on.
• You can also use RStudio to automatically create an output of summary stats—or even your visualized plots—for
each group.
As you learn more about R and RStudio moving forward in this program, you’ll get a better understanding of when
RStudio should be your data analysis tool of choice.
MODULE 2
Programming Fundamentals
The basic concepts of R are:
1. Functions:
Func�ons are a body of reusable code used to perform specific tasks in R. Func�ons begin with func�on names like
print or paste, and are usually followed by one or more arguments in parentheses. An argument is informa�on that
a func�on in R needs in order to run. Here's a simple func�on in ac�on. (Here ‘Coding in R’ is an argument).
We'll start our func�on in the console with the func�on name ‘print’. This func�on name will return whatever we
include in the values in parentheses. We'll type an open parenthesis followed by a quota�on mark. Both the close
parenthesis and end quote automa�cally pop up because RStudio recognizes this syntax. Now we just have to add
the text string. We'll type “Coding in R”. Then we'll press enter.
Success! The code returns the words "Coding in R."
If you want to find out more about the print func�on or any
func�on, all you have to do is type a ques�on mark, the func�on name, and a set of parentheses { ?print()}. This
returns a page in the Help window, which helps you learn more about the func�ons you're working with.
Keep in mind that func�ons are case-sensi�ve, so typing Print with a Capital P brings back an error message.
Func�ons are great, but it can be prety �me-consuming to type out lots of values. To save �me, we can use variables
to represent the values.
2, 3 and 4. Variables, Comments and Data Types:
Variables let us call out the values any �me we need to. A variable is a representa�on of a value in R that can be
stored for use later during programming. Variables can also be called objects. As a data analyst, you'll find variables
are very useful when programming. For example, if you want to filter a dataset, just assign a variable to the func�on
you used to filter the data. That way, all you have to do is use that variable to filter the data later. When naming a
variable in R, you can use a short phrase. A variable name should start with a leter and can also contain numbers
and underscores. So, the variable 5penguin wouldn't work well
because it starts with a number. Also, just like func�ons,
variable names are case-sensi�ve. Using all lower case leters is
good prac�ce whenever possible. Now, before we get to coding
a variable, let's add a comment.
Comments are helpful when you want to describe or
explain what's going on in your code. Use them as
much as possible so that you and everyone can
understand the reasoning behind it. Comments should
be used to make an R script more readable. A comment
shouldn't be treated as code, so we'll put a # in front of
it. Then we'll add our comment, (# Here's an example of
a variable).
Every Variables are associated with a data type. Every data can be of different types. Like:
Numeric (3.14, 1500)
Character (Sam, Bob, My name is Maverick)
Logical (True / False)
Complex (7+5i, 3-9i; In complex data type you have two parts. First one is ‘real part’ and then you have ‘imaginary
part’. In 7+5i: ‘7’ represents the real part and ‘5i’ represents the imaginary part.)
Now let's go ahead with our example. It makes sense to use a variable name to connect to what the variable is
represen�ng. So, we'll type the variable name ‘first_variable’. Then a�er the variable name, we'll type a < sign,
followed by a -. This is the assignment operator. It assigns the value to the variable. It looks like an arrow, which makes
sense, since it's poin�ng from the value to the variable. There are other assignment operators that work too, but it's
always good to s�ck with just one type in your code. Next, we'll add the value that our variable will represent. We'll
use the text, "This is my variable." If we type the variable and hit Run, it will return the value that the variable
represents. This is a very basic way of using a variable.
For now, let's assign a variable to a different data type, numeric. We'll name this ‘second_variable’, and type our
assignment operator. We'll give it the numeric value 12.5. The Environment pane in the upper- right part of our work
space now shows both of our variables and their values.

5. Vectors: A vector is a group of data elements of the same type stored in a sequence in R. You can make a vector
using the combined func�on. In R this func�on is just the leter c followed by the values you want in your vector
inside parentheses like, c(x,y,z…).
All right, let's create a vector. Imagine this vector is for a measurement data that we need to analyze. We'll start our
code with the variable vec_1 to assign to the vector. Then we'll type c and the open parenthesis. Then we'll type our
list of numbers separated by commas. We'll then close our parentheses and press enter. This �me when we type our
variable and press enter, it returns our vector. We can use this vector anywhere in our analysis with only its variable
name vec_1. The values in the vector will automa�cally be applied to our analysis.

6. Pipes:
A pipe is a tool in R for expressing a sequence of mul�ple opera�ons. A pipe is represented by a % sign, followed by
a > sign, and another % sign (%>%). It's used to apply the output of one func�on into another func�on. Pipes can
make your code easier to read and understand. For example, this pipe filters and sorts the data.

Vectors and lists in R


In programming, a data structure is a format for organizing and storing data. Data structures are important to
understand because you will work with them frequently when you use R for data analysis. The most common data
structures in the R programming language include:

• Vectors
• Data frames
• Matrices
• Arrays
Think of a data structure like a house that contains your data.

This reading will focus on vectors. Later on, you’ll learn more about data
frames, matrices, and arrays.

There are two types of vectors: atomic vectors and lists. Coming up, you’ll
learn about the basic properties of atomic vectors and lists, and how to use
R code to create them.

Atomic vectors
First, we will go through the different types of atomic vectors. Then, you will learn how to use R code to create, identify,
and name the vectors.

Earlier, you learned that a vector is a group of data elements of the same type, stored in a sequence in R. You cannot
have a vector that contains both logicals and numerics.

There are six primary types of atomic vectors: logical, integer, double, character (which contains strings), complex,
and raw. The last two–complex and raw–aren’t as common in data analysis, so we will focus on the first four. Together,
integer and double vectors are known as numeric vectors because they both contain numbers. This table summarizes
the four primary types:

Type Description Example


Logical True/False TRUE

Integer Positive and negative whole values 3 (MUST USE ‘L’)

Double Decimal values 101.175


Character String/character values “Coding”
This diagram illustrates the hierarchy of relationships among these four main types of vectors:

Creating vectors
One way to create a vector is by using the c() function (called the “combine” function). The c() function in R combines
multiple values into a vector. In R, this function is just the letter “c” followed by the values you want in your vector inside
the parentheses, separated by a comma: c(x, y, z, …).

For example, you can use the c() function to store numeric data in a vector.

c(2.5, 48.5, 101.5)

To create a vector of integers using the c() function, you must place the letter "L" directly after each number.

c(1L, 5L, 15L)


If you explicitly want an integer, you need to specify the ‘L’ suffix. So, entering 1 in R gives you a numeric
object; entering 1L explicitly gives you an integer object.

You can also create a vector containing characters or logicals.

c(“Sara” , “Lisa” , “Anna”)

c(TRUE, FALSE, TRUE)

Determining the properties of vectors


Every vector you create will have two key properties: type and length.

You can determine what type of vector you are working with by using the typeof() function. Place the code for the
vector inside the parentheses of the function. When you run the function, R will tell you the type. For example:

typeof(c(a, b))

#> [1] "character"

Notice that the output of the typeof function in this example is “character”. Similarly, if you use the typeof function on
a vector with integer values, then the output will include “integer” instead:

typeof(c(1L , 3L))

#> [1] "integer"

You can determine the length of an existing vector–meaning the number of elements it contains–by using the length()
function. In this example, we use an assignment operator to assign the vector to the variable x. Then, we apply the
length() function to the variable. When we run the function, R tells us the length is 3.

x <- c(33.5, 57.75, 120.05)

length(x)

#> [1] 3

You can also check if a vector is a specific type by using an is function: is.logical(), is.double(), is.integer(),
is.character(). In this example, R returns a value of TRUE because the vector contains integers.

x <- c(2L, 5L, 11L)

is.integer(x)

#> [1] TRUE

In this example, R returns a value of FALSE because the vector does not contain characters, rather it contains logicals.

y <- c(TRUE, TRUE, FALSE)

is.character(y)

#> [1] FALSE

Naming vectors
All types of vectors can be named. Names are useful for writing readable code and describing objects in R. You can
name the elements of a vector with the names() function. As an example, let’s assign the variable x to a new vector with
three elements.

x <- c(1, 3, 5)

You can use the names() function to assign a different name to each element of the vector.
names(x) <- c("a", "b", "c")

Now, when you run the code, R shows that the first element of the vector is named a, the second b, and the third c.

#> a b c

#> 1 3 5

Remember that an atomic vector can only contain elements of the same type. If you want to store elements of different
types in the same data structure, you can use a list.

Creating lists
Lists are different from atomic vectors because their elements can be of any type—like dates, data frames, vectors,
matrices, and more. Lists can even contain other lists.

You can create a list with the list() function. Similar to the c() function, the list() function is just list followed by the
values you want in your list inside parentheses: list(x, y, z, …). In this example, we create a list that contains four different
kinds of elements: character ("a"), integer (1L), double (1.5), and logical (TRUE).

list("a", 1L, 1.5, TRUE)

Like we already mentioned, lists can contain other lists. If you want, you can even store a list inside a list inside a list—
and so on.

list(list(list(1 , 3, 5)))

Determining the structure of lists


If you want to find out what types of elements a list contains, you can use the str() function. To do so, place the code for
the list inside the parentheses of the function. When you run the function, R will display the data structure of the list by
describing its elements and their types.

Let’s apply the str() function to our first example of a list.

str(list("a", 1L, 1.5, TRUE))

We run the function, then R tells us that the list contains four elements, and that the elements consist of four different
types: character (chr), integer (int), number (num), and logical (logi).

#> List of 4

#> $ : chr "a"

#> $ : int 1

#> $ : num 1.5

#> $ : logi TRUE

Let’s use the str() function to discover the structure of our second example. First, let’s assign the list to the variable z to
make it easier to input in the str() function.

z <- list(list(list(1 , 3, 5)))

Let’s run the function.

str(z)

#> List of 1
#> $ :List of 1

#> ..$ :List of 3

#> .. ..$ : num 1

#> .. ..$ : num 3

#> .. ..$ : num 5

The indentation of the $ symbols reflect the nested structure of this list. Here, there are three levels (so there is
a list within a list within a list).

Naming lists
Lists, like vectors, can be named. You can name the elements of a list when you first create it with the list() function:

list('Chicago' = 1, 'New York' = 2, 'Los Angeles' = 3)

$`Chicago`

[1] 1

$`New York`

[1] 2

$`Los Angeles`

[1] 3

Additional resource
To learn more about vectors and lists, check out R for Data Science, Chapter 20: Vectors. R for Data Science is a classic
resource for learning how to use R for data science and data analysis. It covers everything from cleaning to visualizing to
communicating your data. If you want to get more details about the topic of vectors and lists, this chapter is a great place
to start.
Dates and times in R
In this reading, you will learn how to work with dates and times in R using the lubridate package. Coming up, you will
use tools in the lubridate package to convert different types of data in R into date and date-time formats.

Loading tidyverse and lubridate packages


Before you get started working with dates and times, you should load both tidyverse and lubridate. Lubridate is part of
tidyverse.

First, open RStudio.

If you haven't already installed tidyverse, you can use the install.packages() function to do so:

• install.packages("tidyverse")
Next, load the tidyverse and lubridate packages using the library() function. First, load the core tidyverse to make it
available in your current R session:

• library(tidyverse)
Then, load the lubridate package:

• library(lubridate)
Now you’re ready to be introduced to the tools in the lubridate package.

Working with dates and times


This section covers the data types for dates and times in R and how to convert strings to date-time formats.

Types
In R, there are three types of data that refer to an instant in time:

• A date ("2016-08-16")
• A time within a day (“20:11:59 UTC")
• And a date-time. This is a date plus a time ("2018-03-31 18:15:48 UTC")
The time is given in UTC, which stands for Universal Time Coordinated, more commonly called Universal Coordinated
Time. This is the primary standard by which the world regulates clocks and time.

For example, to get the current date you can run the today() function. The date appears as year, month, and day.
today()

#> [1] "2021-01-20"

To get the current date-time you can run the now() function. Note that the time appears to the nearest second.

now()

#> [1] "2021-01-20 16:25:05 UTC"

When working with R, there are three ways you are likely to create date-time formats:

• From a string
• From an individual date
• From an existing date/time object
R creates dates in the standard yyyy-mm-dd format by default.

Let's go over each.

Converting from strings


Date/time data often comes as strings. You can convert strings into dates and date-times using the tools provided by
lubridate. These tools automatically work out the date/time format. First, identify the order in which the year, month, and
day appear in your dates. Then, arrange the letters y, m, and d in the same order. That gives you the name of the
lubridate function that will parse your date. For example, for the date 2021-01-20, you use the order ymd:

ymd("2021-01-20")

When you run the function, R returns the date in yyyy-mm-dd format.

#> [1] "2021-01-20"

It works the same way for any order. For example, month, day, and year. R still returns the date in yyyy-mm-dd format.

mdy("January 20th, 2021")

#> [1] "2021-01-20"

Or, day, month, and year. R still returns the date in yyyy-mm-dd format.

dmy("20-Jan-2021")

#> [1] "2021-01-20"

These functions also take unquoted numbers and convert them into the yyyy-mm-dd format.

ymd(20210120)

#> [1] "2021-01-20"

Creating date-time components


The ymd() function and its variations create dates. To create a date-time from a date, add an underscore and one or
more of the letters h, m, and s (hours, minutes, seconds) to the name of the function:

ymd_hms("2021-01-20 20:11:59")

#> [1] "2021-01-20 20:11:59 UTC"

mdy_hm("01/20/2021 08:01")

#> [1] "2021-01-20 08:01:00 UTC"


Optional: Switching between existing date-time objects
Finally, you might want to switch between a date-time and a date.

You can use the function as_date() to convert a date-time to a date. For example, put the current date-time—now()—in
the parentheses of the function.

as_date(now())

#> [1] "2021-01-20"

Other common data structures


In this reading, you will continue on the topic of data structures with an introduction to data frames and matrices. You will
learn about the basic properties of each structure, and simple ways to make use of them using R code. You will also
briefly explore files, which are often used to access and store data and related information.

Data structures
Recall that a data structure is like a house that contains your data.

Data frames:
Data frames are the most common way of storing and analyzing data in R, so it’s important to understand what they are
and how to create them. A data frame is a collection of columns–similar to a spreadsheet or SQL table. Each column
has a name at the top that represents a variable, and includes one observation per row. Data frames help summarize
data and organize it into a format that is easy to read and use.

For example, the data frame below shows the “diamonds” dataset, which is one of the preloaded datasets in R. Each
column contains a single variable that is related to diamonds: carat, cut, color, clarity, depth, and so on. Each row
represents a single observation.

There are a few key things to keep in mind when you are working with data frames:
• First, columns should be named.
• Second, data frames can include many different types of data, like numeric, logical, or character.
• Finally, elements in the same column should be of the same type.
You will learn more about data frames later on in the program, but this is a great starting point.

If you need to manually create a data frame in R, you can use the data.frame() function. The data.frame() function takes
vectors as input. In the parentheses, enter the name of the column, followed by an equals sign, and then the vector you
want to input for that column. In this example, the x column is a vector with elements 1, 2, 3, and the y column is a vector
with elements 1.5, 5.5, 7.5.

data.frame(x = c(1, 2, 3) , y = c(1.5, 5.5, 7.5))

If you run the function, R displays the data frame in ordered rows and columns.

x y

1 1 1.5

2 2 5.5

3 3 7.5

In most cases, you won’t need to manually create a data frame yourself, as you will typically import data from another
source, such as a .csv file, a relational database, or a software program.

Files
Let’s go over how to create, copy, and delete files in R. For more information on working with files in R, check out R
documentation: files. R documentation is a tool that helps you easily find and browse the documentation of almost all R
packages on CRAN. It’s a useful reference guide for functions in R code. Let’s go through a few of the most useful
functions for working with files.

Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the
parentheses of the function.

dir.create ("destination_folder")

Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the
function. Your file types will usually be something like .txt, .docx, or .csv.

file.create (“new_text_file.txt”)

file.create (“new_word_file.docx”)

file.create (“new_csv_file.csv”)

If the file is successfully created when you run the function, R will return a value of TRUE (if not, R will return FALSE).

file.create (“new_csv_file.csv”)

[1] TRUE

Copying a file can be done using the file.copy() function. In the parentheses, add the name of the file to be copied.
Then, type a comma, and add the name of the destination folder that you want to copy the file to.

file.copy (“new_text_file.txt” , “destination_folder”)

If you check the Files pane in RStudio, a copy of the file appears in the relevant folder:
You can delete R files using the unlink() function. Enter the file’s name in the parentheses of the function.

unlink (“some_.file.csv”)

Additional resource
If you want to learn more about working with data frames, matrices, and arrays in R, check out the Data Wrangling
section of Stat Education's Introduction to R course. The section includes modules on data frames, matrices, and arrays
(and more), and each module contains helpful examples of key coding concepts.

--------------------------------------------------------------------------------------------------------------------------------------

Optional: Matrices
A matrix is a two-dimensional collection of data elements. This means it has both rows and columns. By contrast, a
vector is a one-dimensional sequence of data elements. But like vectors, matrices can only contain a single data type.
For example, you can’t have both logicals and numerics in a matrix.

To create a matrix in R, you can use the matrix() function. The matrix() function has two main arguments that you enter
in the parentheses. First, add a vector. The vector contains the values you want to place in the matrix. Next, add at least
one matrix dimension. You can choose to specify the number of rows or the number of columns by using the code nrow
= or ncol =.

For example, imagine you want to create a 2x3 (two rows by three columns) matrix containing the values 3-8. First, enter
a vector containing that series of numbers: c(3:8). Then, enter a comma. Finally, enter nrow = 2 to specify the
number of rows.

matrix(c(3:8), nrow = 2)

If you run the function, R displays a matrix with three columns and two rows (typically referred to as a “2x3”) that contain
the numeric values 3, 4, 5, 6, 7, 8. R places the first value (3) of the vector in the uppermost row, and the leftmost column
of the matrix, and continues the sequence from left to right.

[,1] [,2] [,3]

[1,] 3 5 7

[2,] 4 6 8

You can also choose to specify the number of columns (ncol = ) instead of the number of rows (nrow = ).

matrix(c(3:8), ncol = 2)

When you run the function, R infers the number of rows automatically.

[,1] [,2]

[1,] 3 6

[2,] 4 7

[3,] 5 8
Operators & Calcula�ons
Operator is a symbol that names the type of opera�on or calcula�on to be performed in a formula.
Imagine we have our hands on some e-commerce sales data that we need to analyze. Throughout our analysis we
will use variables that R will store so that we reference them whenever we need to. We will work with assignment
operators. Assignment operators are used to assign values to variables and vectors.
So, if we have a bunch of sales figures that we want to include in a vector, we can use assignment operator to assign
them to a variable.

Now, whenever we want to use the sales figure, we just type the variable we assigned.
Let’s checkout Arithme�c Operators. Arithme�c operators are used to complete math calcula�ons. Plus sign (+) do
addi�on in variables, Minus sign (-) do subtrac�on. An asterisk sign (*) used to do mul�plica�on and slash sign (/) do
division.

Logical operators and conditional statements


Earlier, you learned that an operator is a symbol that identifies the type of operation or calculation to be performed in a
formula. In this reading, you will learn about the main types of logical operators and how they can be used to create
conditional statements in R code.

Logical operators
Logical operators return a logical data type such as TRUE or FALSE.

There are three primary types of logical operators:

• AND (sometimes represented as & or && in R)


• OR (sometimes represented as | or || in R)
• NOT (!)
Review the summarized logical operators below.

AND operator “&”


• The AND operator takes two logical values. It returns TRUE only if both individual values are TRUE. This means
that TRUE & TRUE evaluates to TRUE. However, FALSE & TRUE, TRUE & FALSE, and FALSE & FALSE all
evaluate to FALSE.
• If you run the corresponding code in R, you get the following results: > TRUE & TRUE [1] TRUE > TRUE &
FALSE [1] FALSE > FALSE & TRUE [1] FALSE > FALSE & FALSE [1] FALSE You can illustrate this
using the results of our comparisons. Imagine you create a variable x that is equal to 10. x <- 10 To check if x
is greater than 3 but less than 12, you can use x > 3 and x < 12 as the values of an “AND” expression. x > 3 &
x < 12 When you run the function, R returns the result TRUE. [1] TRUE The first part, x > 3 will evaluate to
TRUE since 10 is greater than 3. The second part, x < 12 will also evaluate to TRUE since 10 is less than 12. So,
since both values are TRUE, the result of the AND expression is TRUE. The number 10 lies between the numbers
3 and 12. However, if you make x equal to 20, the expression x > 3 & x < 12 will return a different result. x
<- 20 x > 3 & x < 12 [1] FALSE Although x > 3 is TRUE (20 > 3), x < 12 is FALSE (20 < 12). If one part
of an AND expression is FALSE, the entire expression is FALSE (TRUE & FALSE = FALSE). So, R returns the
result FALSE.
OR operator “|”
• The OR operator (|) works in a similar way to the AND operator (&). The main difference is that at least
one of the values of the OR operation must be TRUE for the entire OR operation to evaluate to TRUE. This
means that TRUE | TRUE, TRUE | FALSE, and FALSE | TRUE all evaluate to TRUE. When both values are
FALSE, the result is FALSE.
• If you write out the code, you get the following results: > TRUE | TRUE [1] TRUE > TRUE | FALSE [1]
TRUE > FALSE | TRUE [1] TRUE > FALSE | FALSE [1] FALSE For example, suppose you create a
variable y equal to 7. To check if y is less than 8 or greater than 16, you can use the following expression: y <-
7 y < 8 | y > 16 The comparison result is TRUE (7 is less than 8) | FALSE (7 is not greater than 16). Since
only one value of an OR expression needs to be TRUE for the entire expression to be TRUE, R returns a result
of TRUE. [1] TRUE Now, suppose y is 12. The expression y < 8 | y > 16 now evaluates to FALSE (12 < 8) |
FALSE (12 > 16). Both comparisons are FALSE, so the result is FALSE. y <- 12 y < 8 | y > 16 [1]
FALSE
NOT operator “!”
• The NOT operator (!) simply negates the logical value it applies to. In other words, !TRUE evaluates to FALSE,
and !FALSE evaluates to TRUE.
• When you run the code, you get the following results: > !TRUE [1] FALSE > !FALSE [1] TRUE Just like
the OR and AND operators, you can use the NOT operator in combination with logical operators. Zero is
considered FALSE and non-zero numbers are taken as TRUE. The NOT operator evaluates to the opposite
logical value. Let’s imagine you have a variable x that equals 2: x <- 2 The NOT operation evaluates to
FALSE because it takes the opposite logical value of a non-zero number (TRUE). > !x [1] FALSE
-----------------

Let’s check out an example of how you might use logical operators to analyze data. Imagine you are working with the
airquality dataset that is preloaded in RStudio. It contains data on daily air quality measurements in New York from May
to September of 1973.

The data frame has six columns: Ozone (the ozone measurement), Solar.R (the solar measurement), Wind (the wind
measurement), Temp (the temperature in Fahrenheit), and the Month and Day of these measurements (each row
represents a specific month and day combination).

Let’s go through how the AND, OR, and NOT operators might be helpful in this situation.

AND example
Imagine you want to specify rows that are extremely sunny and windy, which you define as having a Solar measurement
of over 150 and a Wind measurement of over 10.

In R, you can express this logical statement as Solar.R > 150 & Wind > 10.

Only the rows where both of these conditions are true fulfill the criteria:

OR example
Next, imagine you want to specify rows where it’s extremely sunny or it’s extremely windy, which you define as having a
Solar measurement of over 150 or a Wind measurement of over 10.

In R, you can express this logical statement as Solar.R > 150 | Wind > 10.

All the rows where either of these conditions are true fulfill the criteria:
NOT example
Now, imagine you just want to focus on the weather measurements for days that aren't the first day of the month.

In R, you can express this logical statement as Day != 1.

The rows where this condition is true fulfill the criteria:

Finally, imagine you want to focus on scenarios that aren't extremely sunny and not extremely windy, based on your
previous definitions of extremely sunny and extremely windy. In other words, the following statement should not be true:
either a Solar measurement greater than 150 or a Wind measurement greater than 10.

Notice that this statement is the opposite of the OR statement used above. To express this statement in R, you can put
an exclamation point (!) in front of the previous OR statement: !(Solar.R > 150 | Wind > 10). R will apply the NOT
operator to everything within the parentheses.

In this case, only one row fulfills the criteria:

----------------------------------------------------------------------------------------------------------------------------------------

Optional: Conditional statements


A conditional statement is a declaration that if a certain condition holds, then a certain event must take place. For
example, “If the temperature is above freezing, then I will go outside for a walk.” If the first condition is true (the
temperature is above freezing), then the second condition will occur (I will go for a walk). Conditional statements in R
code have a similar logic.

Let’s discuss how to create conditional statements in R using three related statements:

• if()
• else()
• else if()
if statement
The if statement sets a condition, and if the condition evaluates to TRUE, the R code associated with the if statement is
executed.

In R, you place the code for the condition inside the parentheses of the if statement. The code that has to be executed if
the condition is TRUE follows in curly braces (expr). Note that in this case, the second curly brace is placed on its own
line of code and identifies the end of the code that you want to execute.

if (condition) {
expr

For example, let’s create a variable x equal to 4.

x <- 4

Next, let’s create a conditional statement: if x is greater than 0, then R will print out the string “x is a positive
number".

if (x > 0) {

print("x is a positive number")

Since x = 4, the condition is true (4 > 0). Therefore, when you run the code, R prints out the string “x is a positive
number".

[1] "x is a positive number"

But if you change x to a negative number, like -4, then the condition will be FALSE (-4 > 0). If you run the code, R will not
execute the print statement. Instead, a blank line will appear as the result.

else statement
The else statement is used in combination with an if statement. This is how the code is structured in R:

if (condition) {

expr1

} else {

expr2

The code associated with the else statement gets executed whenever the condition of the if statement is not
TRUE. In other words, if the condition is TRUE, then R will execute the code in the if statement (expr1); if the
condition is not TRUE, then R will execute the code in the else statement (expr2).

Let’s try an example. First, create a variable x equal to 7.

x <- 7

Next, let’s set up the following conditions:

• If x is greater than 0, R will print “x is a positive number”.


• If x is less than or equal to 0, R will print “x is either a negative number or zero”.
In our code, the first condition (x > 0) will be part of the if statement. The second condition of x less than or equal to 0 is
implied in the else statement. If x > 0, then R will print “x is a positive number”. Otherwise, R will print “x is
either a negative number or zero”.

x <- 7

if (x > 0) {

print ("x is a positive number")

} else {
print ("x is either a negative number or zero")

Since 7 is greater than 0, the condition of the if statement is true. So, when you run the code, R prints out “x is a
positive number”.

[1] "x is a positive number"

But if you make x equal to -7, the condition of the if statement is not true (-7 is not greater than 0). Therefore, R will
execute the code in the else statement. When you run the code, R prints out “x is either a negative number or
zero”.

x <- -7

if (x > 0) {

print("x is a positive number")

} else {

print ("x is either a negative number or zero")

[1] "x is either a negative number or zero"

else if statement
In some cases, you might want to customize your conditional statement even further by adding the else if statement. The
else if statement comes in between the if statement and the else statement. This is the code structure:

if (condition1) {

expr1

} else if (condition2) {

expr2

} else {

expr3

If the if condition (condition1) is met, then R executes the code in the first expression (expr1). If the if condition is not met,
and the else if condition (condition2) is met, then R executes the code in the second expression (expr2). If neither of the
two conditions are met, R executes the code in the third expression (expr3).

In our previous example, using only the if and else statements, R can only print “x is either a negative number
or zero” if x equals 0 or x is less than zero. Imagine you want R to print the string “x is zero” if x equals 0. You
need to add another condition using the else if statement.

Let’s try an example. First, create a variable x equal to negative 1 (“-1”), and run the code to save the variable to
memory.

x <- -1

Now, you want to set up the following conditions:

• If x is less than 0, print “x is a negative number”


• If x equals 0, print “x is zero”
• Otherwise, print “x is a positive number”
In the code, the first condition will be part of the if statement, the second condition will be part of the else if statement,
and the third condition will be part of the else statement. If x < 0, then R will print “x is a negative number”. If x =
0, then R will print “x is zero”. Otherwise, R will print “x is a positive number”.

x <- -1

# run the code

if (x < 0) {

print("x is a negative number")

} else if (x == 0) {

print("x is zero")

} else {

print("x is a positive number")

Run the code. Since -1 is less than 0, the condition for the if statement evaluates to TRUE, and R prints “x is a
negative number”.

[1] "x is a negative number"

If you make x equal to 0, R will first check the if condition (x < 0), and determine that it is FALSE. Then, R will evaluate
the else if condition. This condition, x==0, is TRUE. So, in this case, R prints “x is zero”.

If you make x equal to 1, both the if condition and the else if condition evaluate to FALSE. So, R will execute the else
statement and print “x is a positive number”.

As soon as R discovers a condition that evaluates to TRUE, R executes the corresponding code and ignores the rest.
Basic Concepts of R
Function A body of reusable code for performing specific tasks in R

Argument Information needed by a function in R in order to run

Comment Helpful text that describes or explains R code, preceded by #

Variable A representation of a value in R that can be stored for later use

Data Types An attribute that describes a piece of data based on its values, its programming language, or
the operations it can perform.

Vector A group of data elements of the same type stored in a one-dimensional sequence in R.

Pipe A tool in R for expressing a sequence of multiple operations, represented with %>%

Available R packages
To make the most of R for your data analysis, you will need to install packages.

Packages are units of reproducible R code that you can use to add more functionality to R.

The best part is that the R community creates and shares packages so that other users can access them! In this reading,
you will learn more about widely used packages and where to find them.

Packages can be found in repositories, which are collections of useful


packages that are ready to install. You can find repositories on
Bioconductor, R-Forge, rOpenSci, or GitHub, but the most
commonly used repository is the Comprehensive R Archive Network
or CRAN. CRAN stores code and documentation so that you can
install packages into your own RStudio space.

Package documentation
Packages will not only include the code itself, but also documentation that explains the package’s author, function, and
any other packages that you will need to download. When you are using CRAN, you can find the package documentation
in the DESCRIPTION file.

Check out Karl Broman's R Package Primer to learn more.

Choosing the right packages


With so many packages out there, it can be hard to know which ones will be the most useful for your library or directory
of installed packages. Luckily, there are some great resources out there:

• Tidyverse: the tidyverse is a collection of R packages specifically designed for working with data. It’s a standard
library for most data analysts, but you can also download the packages individually.
• Quick list of useful R packages: this is RStudio Support’s list of useful packages with installation instructions and
functionality descriptions.
• CRAN Task Views: this is an index of CRAN packages sorted by task. You can search for the type of task you
need to perform and it will pull up a page with packages related to that task for you to explore.
You will discover more packages throughout this course and as you use R more often, but this is a great starting point for
building your own library.
Welcome to the Tidyverse
Packages are a big part of what makes R so great. Packages offer a helpful combina�on of code, reusable R func�ons,
descrip�ve documenta�on, tests for checking operability, and sample data sets. And for lots of data analysts, at the
top of the list of useful packages is �dyverse.
Tidyverse is actually a collec�on of packages in R with a common design philosophy for data manipula�on,
explora�on, and visualiza�on. Using �dyverse can help you work your way through prety much the en�re data
analysis process. The packages in �dyverse work together naturally. Tidyverse is considered a key part of
programming for most R users. The principles associated with �dyverse, which you'll learn both here and at your job,
have been widely adopted by the R community.
Okay, let's install the �dyverse. Earlier, you learned how to find Base R packages using the func�on install packages.
To install packages like the �dyverse that aren't in Base R, we'll use the install packages func�on. As we discussed
earlier, this func�on calls the �dyverse and other packages from CRAN.
Let's talk about why CRAN was created. Since packages not in Base R are mostly made by R users, people need a
reliable way to check and validate submited code. CRAN makes sure any R content open to the public meets the
required quality standards. So, if it's sourced through CRAN, you can feel good that the package is authen�c and
valid. Another major source of packages and other R content is GitHub.
Now, we'll get back to installing the �dyverse. We'll first type install.packages. Then, between the parentheses, we'll
type �dyverse in quotes. The quotes aren't always necessary, but best prac�ce is to use quotes to make sure that we
are accurate; install.packages (“�dyverse”). We'll press Enter and wait for RStudio to install �dyverse. When we click
on our packages tab, we come across a lot of new packages on the list. That's �dyverse. You might have no�ced that
none of the packages are checked off. We need to load them first before we can use them. But that's a mighty long
list. So, let's just load the package named �dyverse for now, using the library func�on; library (�dyverse). The return
shows that not only was �dyverse loaded, but eight other packages were too. It also shows a list of conflicts. Conflicts
happen when packages have func�ons with the same names as other func�ons. Basically, the last package loaded is
the one whose func�ons will be used, so we'll s�ck with the �dyverse func�ons. But it's important to note that these
messages only appear once. So, as you get more used to R, you'll be able to figure out if you want to use certain
func�ons over others.
The loaded packages are ggplot2, �bble, �dyr, readr, purrr, dplyr, stringr, and forcats. These packages are the core
of the �dyverse because you'll use them in almost every analysis. All of them work together to make your data
analysis smooth and efficient. With these packages, �dyverse helps you do everything from impor�ng and
transforming data to exploring and visualizing it.
The packages available in �dyverse change a lot, but you can always check for updates by running �dyverse_update()
in your console. You can then update the packages in a couple of ways. If you use the update packages func�on, it'll
update all of your packages. That might take a while. So, if you just want to update one package, you can use the
install packages func�on again with the package name as your argument in parentheses. You should update packages
regularly to make sure you've got the latest version in your code.

Working with Pipes


A pipe is a tool in R for expressing a sequence of mul�ple opera�ons. A pipe is represented by a % sign, followed by
a > sign, and another % sign (%>%). It's used to apply the output of one func�on into another func�on. Pipes can
make your code easier to read and understand. For example, this pipe filters and sorts the data.
In other words, it takes the output of one statement and makes it the input of the next statement. So, instead of
typing func�ons contained inside other func�ons, you can use pipes operator to do the same work. In programming,
we describe this as NESTED. Nested describes code that performs a par�cular func�on and is contained within code
that performs a broader func�on. (NESTED Func�on is a func�on that is completely contained within another
func�on. With nested func�on, we read from the inside out. )
You can think of pipes as a way to code the phrase “and then”. Say you have got sales data, and you need to find the
mean or the average. You can create a pipe by calling up the data, and then grouping the data, and then summarizing
the grouped data using a mean func�on.
Rest of the work has done in the R file named “Toothgrowth explora�on”.

You might also like