Data Analytics Using R-Programming Notes
Data Analytics Using R-Programming Notes
Introduction to R-Programming
Everything in R is taken in terms of strings.
Ctrl + L Clear the screen in the R-Programming
1. Greet the world using “Hello World”
a. Command: > print(“Hello World”)
2. Add 2 to each element of vector x
a.
3. Add the values 6:10
a.
4. x <- 100:105
a. Then do x + 1:3
b.
5. Generate the sum of the first 10 natural numbers
a.
b. This is a case of compute, store and return.
6. Generate the sum of the squares of the first 10 natural numbers
a. sum((1:10)^2)
7. Construct a vector with elements 1,2,3,4,5
a. > c(1:5)
b. c is used to construct the vector.
c. c() is the way to construct a vector.
8. Display First 10 natural numbers, their squares and cubes
a. a <- 1:10
b. b <- a^2
c. c <- a^3
d. print (a,b,c)
1|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
2|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Paste command or paste function logically is not used to display. However, in console it does
both.
Print is an actual function to display.
Paste is used for combining the values and each of the values is taken as strings.
Anything taken in “” is taken as a string.
Each vector has an order.
Vectors have the following:
1. Vectors have an index
2. Vectors have an order
m = c(11, 13, 12, 9, 17)
order(m): This gives the index of the vector m in ascending order.
Sort (m): Sorts the values in ascending order.
3|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
The moment we say New Script, the console goes to the backend and only the script comes.
Any programming language is made up of keywords. Keywords are commands in programming
language.
Command + parameters makes one instruction.
With every instruction one requires command with parameters.
In R-programming, console takes one instruction and executes. Such set of instruction is referred
to as program. In R it is known as script. In R
Once the script is saved with a name, the script becomes permanent before it is deleted.
To execute use source() function.
objects(). This function list all the objects created during the session.
To check all the packages installed in the R-console use the function: installed.packages()
R gives built in dataset to do some type of analysis.
4|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
x <- 1:12
y <- c(x, x^2, x^3)
cbind(y): cbind() is a column bind vectors.
dim(y) <- c(12,3)
5|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
6|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
mode() function returns the data type of the variable. mode() function is not for frequency. It is
for datatype.
Powerful illustration of paste. For every item of A to paste B (which has only a single cell)
Function type:
Two types of functions in arguments:
Argument with default value
Argument without default value
7|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Every time a print is given for a numeric argument, the number is displayed about 6 digits after
the decimal. i.e. x value if 11/7 will be displayed as 1.571429
To specify the number of decimal places to be displayed for the number use digit.
If the number is too large or too small, then the number will be displayed by e (10 to the power
of).
To print 0.0000062
It will be displayed as 6.2e-5
By default, R returns seven digits for numeric value.
There is another option to print known as cat(). Whenever a print() command is given, by default
the command is moved to the new line. However, in the case of cat(), the command will remain
in the line where the cat() function is used.
8|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Operators
Complex numbers
y <- 3 + 2i
Re(y) gives the real part of a number.
Im(y) gives the imaginary part of a number.
9|Page
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Mathematical Functions
Sr. No Mathematical Function Description of the function
1. abs()
2. log() To get the log of a number
3. sqrt() To get the square root of a number
4. exp() To get e3, display the number as exp(3)
5. choose(n, r) This is used for the combination of numbers. i.e. nCr
6. floor(n) This gives the lower value of a number
7. ceiling(n) This gives the higher value of a number
8. round(n, 6) Round up the specified number to 6 decimal places
9. trunc(n) Truncate the decimal part of the number
10. str(v) Structure of v.
Trigonometric Functions
cos(π/4) will give 180ο/4 = 45ο which is 0.707
10 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Types of vectors:
Numeric vectors
Integer vectors
Logical vectors
Character vectors
DateTime vectors
Factors
11 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Replicate/ Repeat
Index of an element
Generate natural numbers btw. 1 to 30 in reverse order. After that display elements at index 11,
17 and 21.
Continuous range is allowed in index. For discrete range, put the numbers in an array and then
use in the index position. To extract the last 10 elements of the vector.
12 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Or
> i <- which(a>b)
> paste(i, a[i], b[i])
13 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
14 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Class Exercises
A1 – Dealing with missing values
A student can attend any 4 out of 5 tests. A student can skip any of the five tests. Scores of the
students are: (27, 32, 45, NA, 39)
Any construct requires:
1. Input:
a. Through console
i. Verification and Validation happens
b. From any of the data file one is extracting
c. From unstructured sources
2. Processing
a. Compute
b. Branching
c. Looping
3. Output
a. Format
Algorithm:
1. Input: Scores in 5 tests
2. Process:
a. Determine index of NA (Not available)
b. Determine valid test scores
c. Compute internal marks
3. Output: Provide the internal marks
15 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
16 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
A4 – Solve
Consider vector j having elements 11:16. Multiply each element of vector alternatively with 2
and 3 respectively
17 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Text Manipulations
However,
> nm <- c(“Adam”, “Smith”)
> length(nm)
This function gives the number of words in nm which is 2.
Length will give the number of words. nchar() will give the number of characters.
islands is a built-in dataset available in R.
18 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Exercise 5
Upper-case and lower-case alphabets
print(letters)
print(LETTERS)
19 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
str(islands)
names(islands)
The str(islands) will store islands in the form of attr(*, “names”)
Atomic value is a value which cannot be further split.
Extract the names of the first 8 islands:
20 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
strsplit() splits the string and it gives as many words as there are in a string.
Data Frame
Data frame is the same view as in excel sheet.
21 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Excel sheet can be brought into R in the form of data frame. Data frame is like excel. Data Frame
is fixed. It is a 2D data structure.
In a list different data types can be combined.
22 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Other functions:
substr() for extracting specified characters from string.
grep() returns index of a character string in an expression
o grep() is a search especially used in large databases.
gsub() replaces all matches of a string
To find out the substring of all names displayed by the head function.
23 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Benford’s Law
24 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
The median is technically the 50% trimmed mean i.e., trimming to 50% would remove half the
observations above the middle of x and half the observation below the middle of x.
25 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
When one trims 50% of the values from both the ends, then mean = median.
Year Invs A Invs B
1 12% 50%
2 -3% -40%
3 8% 30%
4 15% 70%
5 0% 10%
6 4% -50%
Geometric mean applies to only +ve numbers. In order to work on negative numbers, change the
data set.
26 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
quantile() is the function for finding out 1st quartile and subsequent quartiles.
The standard deviation is bigger when the differences is more spread out.
For computing Z-Score, the function is scale()
27 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
MAD: Exercise
The likes received by MAD is: (10, 15, 15, 17, 18, 21)
To find out how far each like is away from the mean, one would use Mean Absolute Deviation.
28 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
29 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
30 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
In the above table, between two eruptions 78 mins waiting time was the largest time difference.
The difference between the above 2 images is that in the second pictures, there are ranges where
the dataset is divided into different bins. Hence a range of numbers is provided. For obtaining
bins, the cut function must be used. This would be helpful especially when dealing with larger
datasets.
Table: Illustrations
31 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
For the above pie-chart, the color should match with the colors mentioned.
32 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
In order to change the colors of each pie of the pie-chart, use the color codes for the names of
colors.
Bar Plots
0 to 15 is the default range for the bar plot.
33 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
abline(h = mean(sample_data))
Abline is to get the horizontal line of the quantity.
34 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
35 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
36 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Business Trend
Development Credit Rating Exception Handling Process Mapping Analysis
PG Finance 2 9 3 11 9
MMS Finance 4 8 7 3 12
PG FS 5 2 8 10 11
37 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
38 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
39 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Boxplots Illustration – 5
40 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Histogram
41 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
42 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Line Graphs
plot() is ideally used to give a scatter plot.
43 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
To plot the Year in the graph insert Year before Value in the plot() as shown below.
44 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Density
45 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
46 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
47 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
plot(trees)
48 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
contour(volcano)
persp(volcano)
49 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
image(volcano)
Creating Factors
50 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
By typing, we
get:
Research
1. Qualitative
2. Quantitative
a. Experimental (Cause and effect)
i. Between subject design: Different participants are randomly assigned to
different conditions.
ii. Within subject design: Same participants are randomly allocated to more
than one condition – referred also to as repeated measures design.
iii. Mixed design experiment
b. Non- experimental
i. Commonly used in sociology, political science and management
disciplines.
ii. Research is often done with the survey
iii. No random assignment of assignment of participants to a particular group
iv. Two approaches to analyzing such data are:
1. Testing for significant differences across the groups
a. E.g. such as IQ levels of participants from different ethnic
background
2. Testing for significant association between two factors
a. E.g. such as firms’ sales and advertising expenditure
51 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Qualitative Research
Involves collecting qualitative data by means of interviews, observations, field notes,
open-ended questions etc.
Researcher is the primary data collection instrument.
Data could be in the form of words, images, patterns etc.
Two types of data used are:
o Primary data: Data collected directly from the subjects of study
o Secondary data: also known as archival data
Type of variables
1. Qualitative variables
a. Differ in kind rather than degree
2. Quantitative variables
a. Differ in degree rather than kind
Assessing validity
Aims at determining how accurate is the relationship between the measure and the underlying
trait it is trying to measure.
Epsilon: Inherent error in every individual, human being and task.
52 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Types of error
1. Type I error
a. While testing a hypothesis, if it gets rejected when it should have been accepted.
2. Type II error
a. While testing a hypothesis, if it gets accepted when it should have been rejected.
53 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Stats App 2
Standard Deviation
Coefficient of variation=
Mean
y -> Dependent variable
x -> Independent variable
Probability Distribution
54 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Normal distribution
Normal distribution is continuous with range -infinity to infinity.
55 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
56 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
57 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
58 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Strip chart
RColorBrewer package
59 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
60 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
61 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
62 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
63 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
64 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
65 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
66 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
67 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
68 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
69 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
70 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
71 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
72 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Analysis using R
Data analysis is all about searching for patterns. Research in behavioral sciences:
1. Qualitative research
2. Quantitative research
a. Data in the form of variables, establishing statistical relationship
b. Results are then generalizable to the entire population
c. There are two types of this research:
i. Experimental Research
1. In between-subject design: Different participants are randomly
assigned to different conditions
2. In within-subject design
ii. Non-experimental Research
1. Testing for significant differences
2. Testing for significant associations
Hypothesis testing: Options are mutually exclusive and exhaustive
Null hypothesis (µ0)
Alternative hypothesis (µ1)
73 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
hist(sales_data$Age)
> boxplot(sales_data$Age)
74 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
boxplot(sales_data)
Skewness
For skewed distributions, it is quite common to have one tail of the distribution. It is very
difficult to get skewness of 0. The skewness value can be positive or negative.
75 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Kurtosis is used for cluster analysis. The coefficient of kurtosis measures the peakedness of a
distribution. High kurtosis means that values close to the mean are relatively more frequent and
extreme values (very far from the mean) are also relatively more frequent.
Z-tests and t-tests
Common rationale but different assumptions
For z-tests, the population mean, and standard deviation should be known exactly.
If the sample data is smaller, than use t-tests
76 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
77 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
From the above function, we find out that the alternative hypothesis is true. True mean is not
equal to 750.
Analysis Case Exercise - 3
Compare the efficiency of the workers of two mines – privately owned (mine 1) and government
owned (mine 2)
What is to be rejected must be a part of the Null hypothesis (µ0).
µ0: There is no difference between the efficiency of the workers of mine 1 being tested
and the efficiency of the workers of mine 2.
µ1: There is a difference between the efficiency of the workers of mine 1 being tested and
the efficiency of the workers of mine 2.
78 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
79 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
80 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Regression Analysis
81 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
H0: There exist no relationship between number of hrs. of study and freshman score.
H1: There exist a relationship between number of hrs. of study and freshman score.
Me
an as a measure of central tendency doesn’t fall into the scheme of things. Hence linear
regression must be done here for analysis.
82 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
In the above equation there is a deviation between the estimated value (y) and the actual value
(fscr).
The residual value (diff) = Actual Value (fscr) – Predicted Value (y)
83 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
In the above picture, Pr(>|t|) is the probability of observing any value equal or larger than t. The
coefficient market with ‘***’ represents a highly significant value (p < 0.001). The coefficient
market with ‘**’ represents a highly significant value (p < 0.01).
84 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Any activity will enhance some type of an error known as Epslon error.
Coefficient Estimate
T −value=
Standard error
Further from zero, null hypothesis can be rejected.
Degree of freedom = number of rows in dataset – number of columns in dataset
Degree of freedom is 8.
Residual standard error (RSE): It is a measure of quality of a linear regression fit.
Every linear model is assumed to contain an error term – E, preventing prediction of exact
response values.
85 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
R Analytics
86 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
R2
∗(n−k )
(1−R2 )
F=
(k−1)
Larger the number of factors, the more it is important to understand the statictics.
Two most important functions to remember: predict() and resid().
87 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
88 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
For all the values to be predicted for weight, the corresponding heights must be put inside a.
89 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
y = 0.006757 * t + 2.306306
90 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
y = coefficient of t + coefficient of t2
91 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Multiple Regression
Total delivery time depends upon total delivery in miles as well as number of deliveries.
92 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
93 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Logistics Regression
Logistic regression is a predictive modelling algorithm that is used when the Y variable is binary
categorical. The goal is to determine a mathematical equation that can be used to predict the
probability of event 1. Once the equation is established, it can be used to predict the Y when only
the X’s are known. Logistics regression is used for:
1. Spam detection
2. Credit card fraud
3. Health
4. Marketing
5. Banking
94 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
From the above data, check if any data is Null. Use the function is.na()
To check if the admits are distributed well enough in each category of rank.
Convert rank variable from integer to factor. Using the function as.factor(rank)
95 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
96 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Decision tree
The function used is ctree()
library(“party”)
97 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
98 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Time series
This can be used for continuously monitoring of a person’s heart.
The four types of variations are:
Trend variations
99 | P a g e
Abhishek Bawa Data Analytics using R-Programming KJ SIMSR 2018-20
Create a vector to contain prices of a product in a year from Jan’17 to Dec’17. Create a time
series data of 12 months.
100 | P a g e