Introduction To R Programming
Introduction To R Programming
1. Open-source
R in data science is free software that is accessible to everyone.
Furthermore, the programming language is adaptable, making it easy to
integrate with different applications and processes. Despite being open-
source software, the programming language exudes quality and is usable
and versatile.
2. Powerful graphics
This is one of the attractive features of R programming. Many data
scientists use R while analyzing data because it has static graphics that
produce good-quality data visualizations. Moreover, the programming
language has a comprehensive library that provides interactive graphics
and makes data visualization and representation easy to analyze. From
elaborative and interactive flow diagrams to bar graphs, R has everything
that makes data analysis enigmatic and easy.
3. Widely used
R fosters a community of its own. The programming language is widely
used by data scientists and business leaders across the globe because it is
an open-source computer system that evokes a sense of community
among its users.
4. Performs complex statistical calculations
R’s wide popularity is because of its ability to perform simple and
complex mathematical and statistical calculations. It is also used for
analyzing data in many industries.
5. Compatibility
R is compatible with computer programs like C, C++, Java, Python, etc.
Its functions can be easily integrated into different computer programs.
These features and more make R a favourite choice of many data
scientists. Interestingly, web applications like Twitter, Google analytics,
BBC, others use R in data science. They use it to collect data and create
clear data visualizations or graphs to draw inferences and implement their
business conclusions. If you are a data scientist wondering how to
procure R for data analysis, proceed to the next section wherein we have
explained how to install the R program below.
Basic R program
Since R is much similar to other widely used languages syntactically, it
is easier to code and learn in R. Programs can be written in R in any of
the widely used IDE like R Studio, Rattle, Tinn-R, etc. After writing
the program save the file with the extension .r. To run the program use
the following command on the command line:
Advantages of R
R is the most comprehensive statistical analysis package. As new
technology and concepts often appear first in R.
As R programming language is an open source. Thus, you can run R
anywhere and at any time.
R programming language is suitable for GNU/Linux and Windows
operating systems.
R programming is cross-platform and runs on any operating system.
In R, everyone is welcome to provide new packages, bug fixes, and
code enhancements.
Disadvantages of R
In the R programming language, the standard of some packages is
less than perfect.
Although, R commands give little pressure on memory management.
So R programming language may consume all available memory.
In R basically, nobody to complain if something doesn’t work.
R programming language is much slower than other programming
languages such as Python and MATLAB.
Applications of R
We use R for Data Science. It gives us a broad variety of libraries
related to statistics. It also provides the environment for statistical
computing and design.
R is used by many quantitative analysts as its programming tool.
Thus, it helps in data importing and cleaning.
R is the most prevalent language. So many data analysts and research
programmers use it. Hence, it is used as a fundamental tool for
finance.
Tech giants like Google, Facebook, Bing, Twitter, Accenture, Wipro,
and many more using R nowadays.
Importnce of R programming in Data Analysis :-
If you really want to dig into that question, we’ve demonstrated Python
vs. R to show how each language handles common data science tasks.
And while the the bottom line is that each language has its own strengths,
and both are great choices for data science, R does have unique strengths
that are worth considering!
And by the way, it’s not just tech firms: R is in use at analysis and
consulting firms, banks and other financial institutions, academic
institutions and research labs, and pretty much everywhere else data
needs analyzing and visualizing. Even the New York Times uses R!
And that’s really just the tip of the iceberg — there are plenty of R
packages even outside of the tidyverse that do cool things. For example,
check out this blog post on how to crunch Google Analytics data using R
and a package called googleAnalyticsR.
5. Inclusive, growing community of data scientists and statisticians.
As the field of data science has exploded, R has exploded with it,
becoming one of the fastest-growing languages in the world (as measured
by StackOverflow). That means it’s easy to find answers to questions and
community guidance as you work your way through projects in R. And
because there are so many enthusiastic R users, you can find R packages
integrating almost any app you can think of!
The R community is also particularly warm and inclusive, and there are
amazing groups like R Ladies and Minority R Users designed to
helpmake sure everyone learn and use R skills.
Even if you don't want to use R yourself, learning the basics will make it
easier for you to follow someone else's R code if you ever have to take
over a coworker's project. Being able to look at R and translate it into
Python means that the amazing resources of both languages are open to
you.
Long story short: there are lots of great reasons why you should learn R,
because it's a fantastic language for data science.
1. Descriptive analysis
2. Diagnostic analysis
3. Exploratory analysis
4. Inferential analysis
5. Predictive analysis
6. Causal analysis
7. Mechanistic analysis
8. Prescriptive analysis
1. DESCRIPTIVE ANALYSIS
Take the Covid-19 statistics page on Google, for example. The line graph
is a pure summary of the cases/deaths, a presentation and description of
the population of a particular country infected by the virus.
Descriptive analysis is the first step in analysis where you summarize and
describe the data you have using descriptive statistics, and the result is a
simple presentation of your data.
2. DIAGNOSTIC ANALYSIS
Diagnostic analysis seeks to answer the question “Why did this happen?”
by taking a more in-depth look at data to uncover subtle patterns. Here’s
what you need to know:
A footwear store wants to review its website traffic levels over the
previous 12 months. Upon compiling and assessing the data, the
company’s marketing team finds that June experienced above-average
levels of traffic while July and August witnessed slightly lower levels of
traffic. To find out why this difference occurred, the marketing team
takes a deeper look. Team members break down the data to focus on
specific categories of footwear. In the month of June, they discovered that
pages featuring sandals and other beach-related footwear received a high
number of views while these numbers dropped in July and August.
Marketers may also review other factors like seasonal changes and
company sales events to see if other variables could have contributed to
this trend.
4. INFERENTIAL ANALYSIS
The goal of statistical modeling itself is all about using a small amount of
information to extrapolate and generalize information to a larger group.
Here’s what you need to know:
5. PREDICTIVE ANALYSIS
The 2020 US election is a popular topic and many prediction models are
built to predict the winning candidate. FiveThirtyEight did this to forecast
the 2016 and 2020 elections. Prediction analysis for an election would
require input variables such as historical polling data, trends and current
polling data in order to return a good prediction. Something as large as an
election wouldn’t just be using a linear model, but a complex model with
certain tunings to best serve its purpose.
Predictive analysis takes data from the past and present to make
predictions about the future.
6. CAUSAL ANALYSIS
To find the cause, you have to question whether the observed correlations
driving your conclusion are valid. Just looking at the surface data won’t
help you discover the hidden mechanisms underlying the correlations.
Causal analysis is applied in randomized studies focused on identifying
causation.
Causal analysis is the gold standard in data analysis and scientific studies
where the cause of phenomenon is to be extracted and singled out, like
separating wheat from chaff.
Good data is hard to find and requires expensive research and studies.
These studies are analyzed in aggregate (multiple groups), and the
observed relationships are just average effects (mean) of the whole
population. This means the results might not apply to everyone.
Say you want to test out whether a new drug improves human strength
and focus. To do that, you perform randomized control trials for the drug
to test its effect. You compare the sample of candidates for your new drug
against the candidates receiving a mock control drug through a few tests
focused on strength and overall focus and attention. This will allow you
to observe how the drug affects the outcome.
Causal analysis is about finding out the causal relationship between
variables, and examining how a change in one variable affects another.
7. MECHANISTIC ANALYSIS
8. PRESCRIPTIVE ANALYSIS
R Data Types :-
R has a variety of data types and object classes. You will learn much
more about these as you continue to get to know R.
Imp :- We can use the class() function to check the data type of a
variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Assigning variables in R :-
In computer programming, a variable is a named memory location where
data is stored. For example,
x = 13.8
Here, x is the variable where the data 13.8 is stored. Now, whenever we
use x in our program, we will get 13.8.
x = 13.8
# print variableprint(x)
Output
[1] 13.8
Types of R Variables
Depending on the type of data that you want to store, variables can be
divided into the following types.
1. Boolean Variables
a = TRUE
print(a)print(class(a))
Output
[1] TRUE
[1] "logical"
Here, we have declared the boolean variable a with the value TRUE.
Boolean variables belong to the logical class so class(a) returns "logical".
2. Integer Variables
A = 14L
print(A)print(class(A))
Output
[1] 14
[1] "integer"
x = 13.4
print(x)print(class(x))
Output
[1] 13.4
[1] "numeric"
Here, we have created a floating point variable named x. You can see that
the floating point variable belongs to the numeric class.
4. Character Variables
alphabet = "a"
print(alphabet)print(class(alphabet))
Output
[1] "a"
[1] "character"
5. String Variables
It stores data that is composed of more than one character. We use double
quotes to represent string data. For example,
print(message)print(class(message))
Output
[1] "character"
Here, we have created a string variable named message. You can see that
the string variable also belongs to the character class.
print(message)
Output
In this program,
Basic Operations
Once you have a vector (or a list of numbers) in memory most basic
operations are available. Most of the basic operations will act on a whole
vector and can be used to quickly perform a large number of calculations
with a single command. There is one thing to note, if you perform an
operation on more than one vector it is often necessary that the vectors all
contain the same number of entries.
Here we first define a vector which we will call “a” and will look at how
to add and subtract constant numbers from all of the numbers in the
vector. First, the vector will contain the numbers 1, 2, 3, and 4. We then
see how to add 5 to each of the numbers, subtract 10 from each of the
numbers, multiply each number by 4, and divide each number by 5.
> b <- a - 10
>b
[1] -9 -8 -7 -6
If you want to take the square root, find e raised to each number, the
logarithm, etc., then the usual commands can be used:
> sqrt(a)
[1] 1.000000 1.414214 1.732051 2.000000
> exp(a)
[1] 2.718282 7.389056 20.085537 54.598150
> log(a)
[1] 0.0000000 0.6931472 1.0986123 1.3862944
> exp(log(a))
[1] 1 2 3 4
By combining operations and using parentheses you can make more
complicated expressions:
Note that you can do the same operations with vector arguments. For
example to add the elements in vector a to the elements in vector b use
the following command:
>a+b
[1] -8 -6 -4 -2
The operation is performed on an element by element basis. Note this is
true for almost all of the basic functions. So you can bring together all
kinds of complicated expressions:
> a*b
[1] -9 -16 -21 -24
> a/b
[1] -0.1111111 -0.2500000 -0.4285714 -0.6666667
> (a+3)/(sqrt(1-b)*2-1)
[1] 0.7512364 1.0000000 1.2884234 1.6311303
You need to be careful of one thing. When you do operations on vectors
they are performed on an element by element basis. One ramification of
this is that all of the vectors in an expression must be the same length. If
the lengths of the vectors differ then you may get an error message, or
worse, a warning message and unpredictable results:
> a+b
[1] 11 13 15 14
As you work in R and create new vectors it can be easy to lose track of
what variables you have defined. To get a list of all of the variables that
have been defined use the ls() command:
> ls()
Finally, you should keep in mind that the basic operations almost always
work on an element by element basis. There are rare exceptions to this
general rule. For example, if you look at the minimum of two vectors
using the min command you will get the minimum of all of the numbers.
There is a special command, called pmin, that may be the command you
want in some circumstances: