R Notes1 Merged
R Notes1 Merged
R used for?
•Statistical inference
•Data analysis
•Data Vizualisation
•Reporting
•Machine learning algorithm
Used in Areas of
•Academic
•Health care
• Finance
•Consulting
•Energy
Communicate with R
R has multiple ways to present and share work, either through a markdown document or a
shiny app.
Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
Valid identifiers in R
Invalid identifiers in R
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered
1 Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex.
2Character Constants
Character constants can be represented using either single quotes (') or double quotes (") as
delimiters.
3Built-in Constants
Some of the built-in constants defined in R along with their values is shown below.
An object is a data structure having some attributes and methods which act on its attributes
• names
• dimnames
• dim
• class
Data Type
data type is a data storage format that can contain a specific type or range of values.
• character
•numeric (real or decimal)
•integer
•logical
•complex
R Operators
Assignment Operators in R
Operator Description
Arithmetic Operators in R
Operator Description
+ Addition
– Subtraction
* Multiplication
/ Division
^ Exponent
Relational Operators in R
Operator Description
== Equal to
!= Not equal to
Operator Precedence in R
> 2 + 6 * 5 ans 32
Operator Associativity
3 / 4 / 5 ans 0.15
3 / (4 / 5) ans 3.75
.
Read Data
Write data
write.table(x, file, append = FALSE, sep = " ", dec = ".", row.names = TRUE,
col.names = TRUE)
save and restore one single R object: saveRDS(object, file), my_data <- readRDS(file)
save.image(file="my_work_space.RData")
Data Structure
Data Structures are the way of arranging data so that it can be used efficiently in a computer.
To store multiple data data structures are used, R has many data structures. These include
•vector
•list
•matrix
•data frame
•factors
Vector
coercion in R vector?
Vectors only hold elements of the same data type. If there is more than one data type,
the c() function converts the elements. This is known as coercion. The conversion takes
place from lower to higher types.
logical < integer < double < complex < character.
Code:
1.> vec6 <- c(1,FALSE,3L,12+5i,"hello")
2.> typeof(vec6)
Elements of a vector can be accessed using vector indexing. The vector used for indexing
can be logical, integer or character vector.
Vector index in R starts from 1, unlike most programming languages where index start from
0.
x
[1] 0 2 4 6 8 10
> x[3] # access 3rd element
3
In logical indexing, the positions whose corresponding position has logical vector TRUE are
returned. For example, in the below code, R returns the positions of 1 and 3, where the
corresponding logical vectors are TRUE.
1.> a <- c(1,2,3,4)
2.> a[c(TRUE, FALSE, TRUE, FALSE)]
What is R List?
R list is the object which contains elements of different types – like strings, numbers,
vectors and another list inside it.
Like an R vector, an R list can contain items of different data types. List elements are
accessed using two-part names, it is indicated with the dollar sign $ in R.
1.letters
2.LETTERS
3.month.abb
4.month.name
R Matrix
In a matrix, numbers are arranged in a fixed number of rows and columns and usually, the
numbers are the real numbers.
Matrix is a two dimensional data structure in R programming.
matrix(1:9, nrow = 3)
t is possible to name the rows and columns of matrix during creation by passing a 2 element
list to the argument dimnames.
Another way of creating a matrix is by using functions cbind() and rbind() as in column
bind and row bind.
> cbind(c(1,2,3),c(4,5,6))
rbind(c(1,2,3),c(4,5,6))
We can access elements of a matrix using the square bracket [ indexing method. Elements
can be accessed as var[row, column]. Here rows and columns are vectors.
R Data Frame
The tabular data is referred by the data frames. In particular, it is a data structure in R that
represents cases in which there are a number of observations(rows) or measurements
(columns).
A data frame is being used for storing data tables, the vectors that are contained in the form
of a list in a data frame are of equal length.
1.>output<-data.frame(employee_data$employee_name,
employee_data$employee_id)
2.> print(output)
Add Column
Add the column vector using a new column name.
•Add the “dept” column
Factor in R
Create a factor in R?
We can create a factor using the function factor(). Levels of a factor are inferred from the
data if not provided.
x <- factor(c("single", "married", "married", "single"));
>x
Factors are also created when we read non-numerical columns into a data frame
Accessing components of a factor is very much similar to that of vectors.
x[3]
Statistics is the discipline that concerns the collection, organization, analysis, interpretation
and presentation of data.
Statistics is the science of learning from data, and of measuring, controlling, and
communicating uncertainty; and it thereby provides the navigation essential for controlling
the course of scientific and societal advances.’
Variable : A variable in the mathematical sense, i.e. a quantity which may take any one of
specified set of values.
Population : In statistics, a population is the entire pool from which a statistical sample is
drawn
population can be said to be an aggregate observation of subjects grouped together by a
common feature.
There are two basic forms: descriptive statistics and inferential statistics.
• Descriptive Statistics is primarily about summarizing a given data set through numerical
summaries and graphs, and can be used for exploratory analysis to visualize the information
contained in the data and suggest hypotheses etc. It is useful and important. It has become
more exciting nowadays with people regularly using fancy interactive computer graphics to
display numerical information
Descreptive Statistics
Measures of Central Tendency : Central tendency refers to the idea that there is one
number that best summarizes the entire set of measurements, a number that is in
some way “central” to the set
• Mean
• Median
• Mode
• Variance
• Standard Deviation
• Qurtile
• Corelation
• Measures of shapes : It describes the distribution (or pattern) of the data within a
dataset
• skewness
• kurtosis
Mean / Average
Mean or Average is a central tendency of the data i.e. a number around which a whole data
is spread out. In a way, it is a single number which can estimate the value of whole data set.
Here we have position 5 and 6 in the middle, therefore, to get the median we are going to
interpolate them by adding the two then dividing them by 2.
Median=(50+63)/2= 56.5
Mode
Mode is the term appearing maximum time in data set i.e. term that has highest frequency.
But there could be a data set where there is no mode at all as all values appears same
number of times. If two values appeared same time and more than the rest of the values then
the data set is bimodal. If three values appeared same time and more than the rest of the
values then the data set is trimodal and for n modes, that data set is multimodal.
The range is just the maximum value minus the minimum value.
For our example above, the highest value is 92 and the lowest is 21. So the range is
=92-21
=71
Variance
IT is the most commonly used measure of dispersion. It is calculated by taking the average
of the squared differences between each value and the mean.
Standard Deviation
This is the most detailed and the most accurate description of dispersion. This is because it
shows how the different values in the set, relate to the mean.
21-58.5= -37.5
35-58.5= -23.5
46-58.5= -12.5
46-58.5= -12.5
50-58.5= -8.5
63-58.5= 4.5
67-58.5 = 7.5
77-58.5 = 18.5
=23.02294
In statistics and probability, quartiles are values that divide your data into quarters provided
data is sorted in an ascending order.
There are three quartile values. First quartile value is at 25 percentile. Second quartile is 50
percentile and third quartile is 75 percentile. Second quartile (Q2) is median of the whole
data. First quartile (Q1) is median of upper half of the data. And Third Quartile (Q3) is
median of lower half of the data.
Ex Example: 5, 7, 4, 4, 6, 2, 8
2, 4, 4, 5, 6, 7, 8
Q1 Q2 Q3
lower middle quartile upper
quartile (median) quartile
Quartile 1 (Q1) = 4
•Quartile 2 (Q2), which is also the Median, = 5
•Quartile 3 (Q3) = 7
A probability is a number that reflects the chance or likelihood that a particular event will
occur. Probabilities can be expressed as proportions that range from 0 to 1, and they can also
be expressed as percentages ranging from 0% to 100%. A probability of 0 indicates that
there is no chance that a particular event will occur, whereas a probability of 1 indicates that
an event is certain to occur. A probability of 0.45 (45%) indicates that there are 45 chances
out of 100 of the event occurring.
Example
A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two
outcomes (“heads” and “tails”) are both equally probable; the probability of “heads” equals
the probability of “tails”; and since no other outcomes are possible, the probability of either
“heads” or “tails” is 1/2 (which could also be written as 0.5 or 50%).
.Experiment – are the uncertain situations, which could have multiple outcomes.
Whether it rains on a daily basis is an experiment, physical situation whoseoutcome
cannot be predicted until it is observed.
Outcome is the result of a single trial. So, if it rains today, the outcome of today’s
trial from the experiment is “It rained”
•Event is one or more outcome from an experiment. An EVENT is a subset of the
sample space.“It rained” is one of the possible event for this experiment.
•SAMPLE SPACE of a statistical experiment is the set of all possible outcomes (also
SAMPLE SPACE known as SAMPLE POINTS ).
Example I flip a coin, with two possible outcomes: heads (H) or tails (T).
What is the sample space for this experiment? What about for three flips in a row?
Solution: For the first experiment (flip a coin once), the sample space is just {H, T}. For the
second experiment (flip a coin three times), the sample space is {HHH, HHT, HTH, HTT,
THH, THT, TTH, TTT}. Order matters: HHT is a different outcome than HTH.
Example . For the experiment where I flip a coin three times in a row, consider the
event that I get exactly one T. Which outcomes are in this event?
Solution: The subset of the sample space that contains all outcomes with exactly one T is
{HHT, HTH, THH}.
That is, the probability of an event is the proportion of outcomes in the sample space that
are also outcomes in that event.
Example flip a fair coin, with two possible outcomes: heads (H) or tails (T). What is the
probability that I get exactly one T if I flip the coin once? What if I flip it three times?
Solution: First, note that I said it’s a fair coin. This is important, because it means that on
any one flip, each outcome is equally likely, We already determined the relevant events and
sample spaces for each experiment in the previous section, so now we just need to divide
those numbers. Specifically, if we only flip the coin once, then the event we care about
(getting T) has one possible outcome, and the sample space has two possible outcomes, so
the probability of getting T is 1/2. If we flip the coin three times, there are three outcomes
with exactly one T , and eight outcomes altogether , so the probability of getting exactly one
T is 3/8
MUTUALLY EXCLUSIVE (or DISJOINT ) events. Two events are mutually exclusive
iff (“iff” means “if and only if”) they contain no outcomes in common (i.e., both events
cannot occur at the same time). For example, if I roll two dice, the events “get a total of 7”
and “get a total of 8” are mutually exclusive. On the other hand, “get a total of 7” and “get a
6 on one die” are not mutually exclusive, since both could occur on the same
roll
Now, suppose we have a set of n mutually exclusive events that together cover all possible
outcomes in our sample space S. Such a set is called a PARTITION of the sample space
A simple example would be flipping a coin, where S = {H, T} and we define n = 2 mutually
exclusive events, E 1 = {H} and E 2 = {T}.
PROBABILITY DISTRIBUTION
Property 1: 0 ≤ P(E i ) ≤ 1
Property 2: ∑ P(E i ) = 1
That is, every probability must fall between 0 and 1 (inclusive), and the sum of the
probabilities of the mutually exclusive events that cover the the sample space must equal 1.
Uniform Distribution
The amount of mass assigned to each E i is what we call P(E i ). In our coin flipping
example, the two events are equally likely, so the mass is divided evenly and we get P(E 1 )
= P(E 2 ) = 1/2. This is an example of a UNIFORM DISTRIBUTION , a distribution where
all events in a partition are equally likely.
Combining events
Often we are interested in combinations of two or more events. This can be represented
using set theoretic operations. Assume a sample space S and two events A and B:
• complement A (also A 0 ): all elements of S that are not in A;
P(¬E) = 1 − P(E).
Example Suppose I have a list of words, and I choose a word uniformly at random. If
the probability of getting a word starting with t is 1/7, then what is the probability of getting
a word that does not start with t?
Solution: Let E be the event that the word starts with t. Then P(¬E) is the probability we
were asked for, and it is 1 − P(E), or 6/7.
Example :Suppose I have a group containing the following first- and second-year university
students from various countries. The first 3 are male, and the last 4 female:
The union of two events, A and B. Since events are just sets of outcomes, taking their
union corresponds to considering any outcome that belongs to either A or B. For example,
looking at the scenario , let’s define A = “the student is female” and B = “the student is from
the UK”.
What is P(A ∪ B), that is, the probability that the student is female or from the UK?
Solution: You might imagine that the answer is just P(A) + P(B). Let’s see if that is correct.
First, we compute P(A), which is 4/7. Next, compute P(B), which is also 4/7. So P(A) +
P(B) = 8/7. But that clearly can’t be correct, since probabilities cannot be greater than one.
So, let’s instead consider which outcomes are actually in the set A ∪ B. They are: {Fiona,
Lea, Ajitha, Sarah, Andrew}. Since this set has five elements, we know that P(A ∪ B) must
be 5/7.
So what went wrong when we computed P(A) + P(B)? Notice that there are three students
who belong to both A and B: Fiona, Ajitha, and Sarah. So when we counted the outcomes in
Since A ∩ B is the set of students that are in both A and B, this is exactly the set that will
have been counted twice, so we subtract off that amount from the probability. In our
example, we now get P(A ∪ B) = 4/7 + 4/7 − 3/7 = 5/7, which is exactly what we got when
computing P(A ∪ B) directly
where A and B are mutually exclusive, it is true that P(A ∪B) = P(A)+P(B), because there
are no items in the intersection.
Joint probabilities
It is a special term for the probability of the intersection of two events: it is called the JOINT
PROBABILITY of A and B, written P(A ∩ B).
P(B) = ∑ P(B ∩ E i )
The law of total probability tells us that we can compute the probability of B by adding
up the joint probability of B with each of the E i .
Example Consider the scenario from Exercise . We partition the sample space according to
the country that each student comes from, with E 1 = “student is British”, E 2 = “student is
Chinese”, and E 3 = “student is German”. Also let B be the event that the student is female.
Apply the law of total probability to compute P(B), and check that the result is the
same as when computing P(B) directly.
Solution: Using the law of total probability, we have
Conditional probability
Conditional probability is one of the most important concepts of probability theory. A
conditional probability expresses the probability that some event A will occur, given that
PROBABILITY (conditioned on the fact that) event B has occurred. The conditional
probability of A given B, written P(A | B), where the | is pronounced “given”, is defined as
P(A | B) = P(A ∩ B)
P(B)
Example Again let’s use the scenario from same question with events A = “the student is
male” and B = “the student is from the UK”. What is P(A | B)?
Solution: In this case, A∩B is the set of male British students, so P(A∩B) = 1/7. P(B) = 4/7,
so P(A | B) = 1/7
4/7
= 1/4
the probability that the chosen student is male given that the student is British is simply the
number of male students as a fraction of the number of British students, or 1/4. But again,
it’s important to learn the formal rules of probability theory since not all problems you will
be faced with are so straightforward.
Example :The probability that it is Friday and that a student is absent is 0.03. Since there
are 5 school days in a week, the probability that it is Friday is 0.2. What is the probability
that a student is absent given that today is Friday?
Solution:
Example : A machine produces parts that are either good (90%), slightly defective (2%), or
obviously defective (8%). Produced parts get passed through an automatic inspection machine,
which is able to detect any part that is obviously defective and discard it. What is the quality of the
parts that make it through the inspection machine and get shipped?
soln Let G , SD, OD be the event that a randomly chosen shipped part is good , slightly defective,
obviously defective respectively. We are told that P(G) = .90, P(SD) = 0.02, and P(OD) = 0.08. We
want to compute the probability that a part is good given that it passed the inspection machine (i.e.,
it is not obviously defective),
which is P(G|ODc ) = P(G ∩ ODc ) /P(ODc)
= P(G) /1 − P(OD)
= .90 /1 − .08
= 90 / 92
= .978
Independent Event
Two events, A and B, are independent if the fact that A occurs does not affect the
probability of B occurring.
examples of independent events are:
•Landing on heads after tossing a coin AND rolling a 5 on a single 6-sided die.
•Rolling a 4 on a single 6-sided die, AND then rolling a 1 on a second roll of the die.
To find the probability of two independent events that occur in sequence, find the
probability of each event occurring separately, and then multiply the probabilities. This
multiplication rule is defined symbolically below. Note that multiplication is represented by
AND.
If the conditional probability P(A | B) is equal to the unconditional probability P(A). That
is, whether or not B occurs has no effect on the probability of A occurring. This is one way
of defining the notion of I Iindependent of two events. Two independent events A and B are
INDEPENDENT EVENTS 7 iff
P(A | B) = P(A).
By substituting in the definition of conditional probability from Eq (5) and rearranging the
terms, we can equivalently state that events A and B are independent if
P(A ∩ B) = P(A)P(B)
Multiplication Rule 1: When two events, A and B, are independent, the probability of
both occurring is:
P(A and B) = P(A) · P(B)
Example A card is chosen at random from a deck of 52 cards. It is then replaced and a
second card is chosen. What is the probability of choosing a jack and then an eight?
Probabilities:
4
P(jack) =
52
4
P(8) =
52
P(jack and 8) = P(jack) · P(8)
4 4
= ·
52 52
16
=
2704
1
=
169
DEPENDENT EVENTS
Two events are dependent if the outcome or occurrence of the first afects the outcome
or occurrence of the second so that the probability is changed.
Analysis: The probability that the first card is a queen is 4 out of 52. However, if the first
card is not replaced, then the second card is chosen from only 51 cards. Accordingly, the
probability that the second card is a jack given that the first card is a queen is 4 out of 51.
Conclusion: The outcome of choosing the first card has afected the outcome of choosing
the second card, making these events dependent.
Now that we have accounted for the fact that there is no replacement, we can find the
probability of the dependent events in Experiment 1 by multiplying the probabilities of
each event.
Example: A card is chosen at random from a standard deck of 52 playing cards. Without replacing
it, a second card is chosen. What is the probability that the first card chosen is a queen and the
second card chosen is a jack?
Probabilities:
P(queen on first pick) = 4 /52
P(jack on 2nd pick given queen on 1st pick) = 4 /51
P(queen and jack) = 4 /52 · 4 /51 = 16 /2652 = 4 / 663
RANDOM VARIABLE
A RANDOM VARIABLE (or RV) is a variable that represents all the possible events in some
partition of the sample space. In another way, an RV has several possible values, with each RV
value being one event in a partition (and where the values cover all events in the partition). We
will write random variables with uppercase letters, and their possible values with lowercase
letters or numbers.
Example 5.1.1. Define a random variable X to represent the outcome flipping a fair coin,
where this variable can take on two possible values (h or t) representing heads or tails. What
is the distribution over X?
Solution: The distribution over an RV simply tells us the probability of each value, so the
distribution over X is P(X = h) = P(X = t) = 1/2.
We use the notation P(X) as a shorthand meaning “the entire distribution over X”, in P(X)
contrast to P(X = x), which means “the probability that X takes the value x”.
Correlation is a bivariate analysis that measures the strength of association between two
variables.
The degree of association is measured by a correlation coefficient, denoted by r. It is
sometimes called Pearson's correlation coefficient after its originator and is a measure of
linear association
Where:
•rxy – the correlation coefficient of the linear relationship between the variables x and y
•xi – the values of the x-variable in a sample
•xx – the mean of the values of the x-variable
•yi – the values of the y-variable in a sample
•ȳ – the mean of the values of the y-variable
AGE GLUCOSE
SUBJECT XY X2 Y2
X LEVEL Y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
•Σx = 247
•Σy = 486
•Σxy = 20,485
•Σx2 = 11,409
•Σy2 = 40,022
Skewness
It is the degree of distortion from the symmetrical bell curve or the normal distribution. It
measures the lack of symmetry in data distribution.
It differentiates extreme values in one versus the other tail. A symmetrical distribution will
have a skewness of 0
• If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
Coefficient of Skewness
Kurtosis
Mesokurtic:
This distribution has kurtosis statistic similar to that of the normal distribution. It means that
the extreme values of the distribution are similar to that of a normal distribution
characteristic. This definition is used so that the standard normal distribution has a kurtosis
of three.
Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and
sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers.
Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data
appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a
leptokurtic distribution.
Platykurtic: (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal
distribution. The peak is lower and broader than Mesokurtic, which means that data are
light-tailed or lack of outliers.
The reason for this is because the extreme values are less than that of the normal
distribution.
Bar Chart
A bar chart is a graph with rectangular bars. The graph usually compares different
categories. Although the graphs can be plotted vertically (bars standing up) or horizontally
(bars laying flat from left to right), the most usual type of bar graph is vertical.
The horizontal (x) axis represents the categories; The vertical (y) axis represents a value for
those categories. In the graph below, the values are percentages.
HistogramA histogram is a specific type of bar chart, where the categories are
ranges of numbers. Histograms therefore show combined continuous data. .
#Note :-Plot discrete data on a bar chart, and plot continuous data on a histogram
Pie Charts
A pie chart looks like a circle (or a pie) cut up into segments. Pie charts are used to show
how the whole breaks down into parts.
Pie charts show percentages of a whole - your total is therefore 100% and the segments of
the pie chart are proportionally sized to represent the percentage of the total.
Box Plot
Boxplots are a standardized way of displaying the distribution of data based on a five
number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and
“maximum”)