Data Science Using r Programming_data Science Using r Unit 1-5
Data Science Using r Programming_data Science Using r Unit 1-5
Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from it.
➡Data Science is about data +gathering, analysis and decision-making.
➡Is about finding patterns in data, through analysis, and making future predictions.
●For route planning: To discover the best routes to ship ●To predict who will win elections
●To foresee delays for flight/ship/train etc. (through predictive analysis) ●To create promotional offers
●To find the best suited time to deliver goods ●To forecast the next years revenue for a company
➡R is one of the programming languages that provide an intensive environment for you to research, process,
transform, and visualize information.
Nullity: Number of vectors present in the null space of a given matrix. In other words, the dimension of the null
space of the matrix A is called the nullity of A. The number of linear relations among the attributes is given by the
size of the null space. The null space vectors B can be used to identify these linear relationship.
Rank Nullity Theorem: The rank-nullity theorem helps us to relate the nullity of the data matrix to the rank and the
number of attributes in the data. The rank-nullity theorem is given by –
Nullity of A + Rank of A = Total number of attributes of A (i.e. total number of columns in A)
Linear Algebra: Is a branch of mathematics that is extremely useful in data science and machine learning. Linear
algebra is the most important math skill in machine learning. Most machine learning models can be expressed in
matrix form. A dataset itself is often represented as a matrix. Linear algebra is used in data preprocessing, data
transformation, and model evaluation. —————————————————————————————————
Y=matrix(c(1,2,3,4),nrow=2,ncol=2) X=matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE)
Y X
## [,1] [,2] ## [,1] [,2]
## [1,] 1 3 ## [1,] 1 2
## [2,] 2 4 ———————— ## [2,] 3 4 ———————————————
A <- matrix(c(10, 8, 5, 12), ncol = 2, byrow = TRUE) B <- matrix(c(5, 3,15, 6), ncol = 2, byrow = TRUE)
A B
#A #B
[, 1] [, 2] [, 1] [, 2]
[1, ] 10 8 [1, ] 5 3
[2, ] 7 12 ———————————————— [2, ] 15 6 ———————————————
Rank
qr(A)$rank # 2 # Equivalent to: library(Matrix)
qr(B)$rank # 2 rankMatrix(A)[1] # 2
● The standard multiplication symbol, ‘*,’ will unfortunately provide unexpected results if you are looking for matrix
multiplication. ‘*’ will multiply matrices element wise. In order to do matrix multiplication, the function is ‘%*%.’
X*X X%*%X
## [,1] [,2] ## [,1] [,2]
## [1,] 1 4 ## [1,] 7 10
## [2,] 9 16 ## [2,] 15 22
● To transpose a matrix or a vector X, X, ●To invert a matrix, use the solve() command.
use the function t(X X). t(X) Xinv=solve(X) — X%*%Xinv
## [,1] [,2] ## [,1] [,2]
## [1,] 1 3 ## [1,] 1 0
## [2,] 2 4 ## [2,] 0 1
● To determine the size of a matrix, use the dim() function. The result is a vector with two values: dim(x)[1] provides
the number of rows and dim(x)[2] provides the number of columns. We can label rows/columns of a matrix using the
rownames() or colnames() functions.
dim(A)
## [1] 100 2
nrows=dim(A)[1]
ncols=dim(A)[2]
★The vectors, which get only scaled and not rotated are called eigenvectors. Eigenvectors are the vectors, which
●only get scaled ●or do not change at all
★The factor by which they get scaled is the corresponding eigenvalue.
Hyperplane
★Hyperplane is a geometric entity geometrically whose dimension is one less than that of its ambient space.
★For example, for 3D space the hyperplane is 2D and for 2D space the hyperplane is 1D line and so on.
★ hyper plane can be defined by the below equation XT n + b =0
The above equation can be expanded for n dimensions
X1n1 + X2n2 + X3n3 + ……….. + Xnnn + b = 0
For 2 dimensions the equation is
X1n1 + X2n2 + b = 0
Consider the hyper plane of the below form
XT n =0
Example:
Let us consider a 2D geometry with
[ ]
1
n = 3 and b=4
Though it's a 2D geometry the value of X will be
X= [x1
x2 ]
So according to the equation of hyperplane it can be solved as
XTn + b = 0
[ ]
[x1x2] 13 + 4 = 0
x1 + 3x2 + 4 = 0
So as you can see from the solution the hyperplane is the equation of a line.
——————————————————————————————————————————————————
Half-space :
So, here we have a 2-dimensional space in X1 and X2 and as we have discussed before, an equation in two
dimensions would be a line which would be a hyperplane. So, the equation to
the line is written as XT n + b =0
So, for this two dimensions, we could write this line as X1n1 + X2n2 + b = 0
You can notice from the above graph that this whole two-dimensional space is
broken into two spaces; One on this side(+ve half of plane) of a line and the
other one on this side(-ve half of the plane) of a line. Now, these two spaces
are called half-spaces.
Example: Consider the same example taken in the hyperplane case. So by
solving, we got the equation as x1 + 3x2 + 4 = 0
There may arise 3 cases. Let’s discuss each case with an example.
Case 1: x1 + 3x2 + 4 = 0 : On the line
Let's consider two points (-1,-1). When we put this value on the equation of the
line we got 0. So we can say that this point is on the hyperplane of the line.
Case 2: x1 + 3x2 + 4 > 0 : Positive half-space
Consider two points (1,-1). When we put this value on the equation of line we got 2 which is greater than 0. So we
can say that this point is on the positive half space.
Case 3: x1 + 3x2 + 4 < 0 : Negative half-space
Consider two points (1,-2). When we put this value on the equation of line we got -1 which is less than 0. So we can
say that this point is in the negative half-space.
UNIT-II
Statistical modeling: ★Statistical modeling is fixed for any data analyst to make sense of the data and make
scientific predictions. ★Is a process using statistical models to analyze a set of data.
★Statistical models are mathematical representations of the observed data.
★Methods are a powerful tool in understanding consolidated data and making generalized predictions using this
data
★Is a way of applying statistical analysis to datasets in data science. ★Involves a mathematical relationship
between random and non-random variables.
★A statistical model can provide intuitive visualizations that aid data scientists in identifying relationships between
variables and making predictions by applying statistical models to raw data.
★Examples: census data, public health data, and social media data
Continuous Random variable: A continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements. Examples include height, weight, the amount of
sugar in an orange, the time required to run a mile.
➡A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values, and
is represented by the area under a curve (in advanced mathematics, this is known as an integral). The probability of
observing any single value is equal to 0, since the number of values which may be assumed by the random variable
is infinite.
➡Suppose a random variable X may take all values over an interval of real numbers. Then the probability that X is in
the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve, which represents a
function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.
Normal or Gaussian Distribution: Is a continuous probability distribution that has a bell-shaped probability density
function (Gaussian function), or informally a bell curve, where μ = mean and σ = standard deviation.
Moments of a PDF:
∞
k k
For continuous distributions - E[x ]= ∫ x f(x)dx
−∞
𝑁
k 𝑘
For discrete distributions - E[x ]= ∑ 𝑥 p(xi)
𝑖
𝑖=1
½
Mean: μ=E[X] Variance: σ2 = (E[(x-μ)2]) standard deviation = σ
Probability sampling is a technique in which the researcher chooses samples from a larger population using a
method based on probability theory. For a participant to be considered as a probability sample, he/she must be
selected using a random selection. ➡Probability sampling uses statistical theory to randomly select a small group of
people (sample) from an existing large population and then predict that all their responses will match the overall
population. ➡For example, an organization has 500,000 employees sitting at different geographic locations. The
organization wishes to make certain amendments in its human resource policy, but before they roll out the change,
they want to know if the employees will be happy with the change or not. However, reaching out to all 500,000
employees is a tedious task. This is where probability sampling comes in handy. A sample from a larger population
i.e., from 500,000 employees, is chosen. This sample will represent the population. Deploy a survey now to the
sample. ➡From the responses received, management will now be able to know whether employees in that
organization are happy or not about the amendment.
Statistical analysis is done on data sets, and the analysis process can create different output types from the input
data. For example, the process can give summarized data, derive key values from the input, present input data
characteristics, prove a null hypothesis, etc. The output type and format vary with the analysis method used.
The two main types are descriptive statistics and inferential statistics
● Descriptive statistics: It refers to collecting, organizing, analyzing, and summarizing data sets in an understandable
format, like charts, graphs, and tables. It makes a large data set presentable and eliminates complexity to help
analysts understand it. The format of the summary can be quantitative or visual.
● Inferential statistics: Inferential statistics derive inference about a large population. It is based on the analysis and
findings produced for sample data from the large population. Hence it makes the process cost-efficient and
time-efficient. It generally includes the development of interval estimate, and points estimate to conduct the analysis.
Hypothesis Testing: The method tests the validity and authenticity of a hypothesis, outcome, or argument.
➡Is an assumption set at the beginning of the research; after the test is over and a result is obtained based on it, the
belief can be either true or false. It can check whether the null hypothesis or alternative hypothesis is true.
Null hypothesis (H0) Alternative hypothesis (Ha)
● In the null hypothesis, there is no relationship ● In the alternative hypothesis, there is some
between the two variables. relationship between the two variables i.e. They are
● Generally, researchers and scientists try to reject or dependent upon each other.
disprove the null hypothesis. ● Generally, researchers and scientists try to accept or
● If the null hypothesis is accepted researchers have to approve the null hypothesis
make changes in their opinions and statements. ● If the alternative hypothesis gets accepted
● Here no effect can be observed i.e. it does not affect researchers do not have to make changes in their
output. opinions and statements.
● Here the testing process is implicit and indirect. ● Here effect can be observed i.e. it affects the output.
● This hypothesis is denoted by H0. ● Here the testing process is explicit and direct.
● It is generally used when we reject the null ● This hypothesis is denoted by Ha or H1.
hypothesis. ● It gets accepted if we fail to reject the null hypothesis.
● In this hypothesis, the p-value is smaller than the ● In this hypothesis, the p-value is greater than the
significance level. significance level.
Two-sample t test The mean dependent variable does not The mean dependent variable differs between
differ between group 1 (µ1) and group 2 group 1 (µ1) and group 2 (µ2) in the population;
(µ2) in the population; µ1 = µ2. µ1 ≠ µ2.
One-way ANOVA The mean dependent variable does not The mean dependent variable of group 1 (µ1),
with three groups differ between group 1 (µ1), group 2 group 2 (µ2), and group 3 (µ3) are not all equal
(µ2), and group 3 (µ3) in the population; in the population.
µ1 = µ2 = µ3.
Example 1: A teacher claims that the mean score of students in his class is greater than 82 with a standard deviation
of 20. If a sample of 81 students was selected with a mean score of 90 then check if there is enough evidence to
support this claim at a 0.05 significance level.
Solution: As the sample size is 81 and population standard deviation is known,
this is an example of a right-tailed one-sample z test.
Ho-μ=82 , H1>μ=82 From the z table the critical value at α = 1.645
𝑋 = 90, μ = 82, n = 81, σ = 20 , z = 3.6
As 3.6 > 1.645 thus, the null hypothesis is rejected and it is concluded that
there is enough evidence to support the teacher's claim. ★Reject the null hypothesis
T-test is the final statistical measure for determining differences between two means that may or may not be related.
The testing uses randomly selected samples from the two categories or groups.
t= Student's t-test
m = mean
μ = theoretical
value
S = standard
deviation
n = variable
set size
Support Vector Machine Algorithm
The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in
the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a
decision boundary or hyperplane:
#> Call:
#> lm(formula = dist ~ speed, data = cars)
#>
#> Coefficients:
#> (Intercept) speed
#> -17.579 3.932
Now that we have built the linear model, we also have established the relationship between the predictor and
response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output,
‘Coefficients’ part has two components: Intercept: -17.579, speed: 3.932 These are also called the beta coefficients.
In other words, dist = Intercept + (β ∗ speed) => dist = −17.579 + 3.932∗speed
UNIT-III
Predictive analysis in R Language is a branch of analysis which uses statistics operations to analyze historical facts
to make predict future events. It is a common term used in data mining and machine learning. Methods like time
series analysis, non-linear least square, etc. are used in predictive analysis.
Process of Predictive Analysis: 1.Define project: Defining the project, scope, objectives and result.
2.Data collection: Data is collected through data mining providing a complete view of customer interactions.
3.Data Analysis: It is the process of cleaning, inspecting, transforming and modelling the data.
4.Statistics: This process enables validating the assumptions and testing the statistical models.
5.Modelling: Predictive models are generated using statistics and the most optimized model is used for deployment.
6.Deployment: The predictive model is deployed to automate the production of everyday decision-making results.
7.Model monitoring: Keep monitoring the model to review performance which ensures expected results.
x <- c(580, 7813, 28266, 59287, 75700, 87820, 95314, 126214, 218843, 471497, 936851, 1508725, 2072113)
library(lubridate)
png(file ="predictiveAnalysis.png")
mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)
plot(mts, xlab ="Weekly Data of sales", ylab ="Total Revenue", main ="Sales vs Revenue", col.main ="darkgreen")
dev.off()
##code##
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
png(file = "linearregression.png")
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",
ylab = "Height in cm")
dev.off()
Multiple Regression: Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but in multiple regression
we have more than one predictor variable and one response variable.
The general equation is − y = a + b1x1 + b2x2 +...bnxn — parameters used −
●y is the response variable ●a, b1, b2...bn are the coefficients ●x1, x2, ...xn are the predictor variables.
lm() Function: This function creates the relationship model between the predictor and the response variable.
Syntax: lm(y ~ x1+x2+x3...,data) — parameters used − ●formula is a symbol presenting the relation between the
response variable and predictor variables. ●data is the vector on which the formula will be applied.
##code##
> input <- mtcars[,c("mpg","disp","hp","wt")]
> print(head(input)) —————————————
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
###code###
> input <- mtcars[,c("mpg","disp","hp","wt")]
> model <- lm(mpg~disp+hp+wt, data = input)
> print(model)
> cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
Output:
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
# # # # The Coefficient Values # # #
(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
###code###
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
Output: am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
Create Regression Model: glm() function is used to create the regression model and get its summary for analysis.
##code##
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))
Output:
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
UNIT-IV
R - Data Frames: A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
➡The column names should be non-empty. ➡The row names should be unique. ➡The data stored in a data frame
can be of numeric, factor or character type. ➡Each column should contain same number of data items.
➔ The structure of the data frame can be seen by using str() function.
➔ The statistical summary and nature of the data can be obtained by applying summary() function.
➔ To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as
the existing data frame and use the rbind() function.
What is R Programming
"R is an interpreted computer programming language which was created by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand." The R Development Core Team currently develops R. It is also a software
environment used to analyze statistical information, graphical representation, reporting, and data modeling. R is the
implementation of the S programming language, which is combined with lexical scoping semantics.
R Python
R packages have advanced techniques which are For finding outliers in a data set both R and Python
very useful for statistical work. The CRAN text view is are equally good. But for developing a web service to
provided by many useful R packages. These allow peoples to upload datasets and find outliers,
packages cover everything from Psychometrics to Python is better.
Genetics to Finance.
For data analysis, R has inbuilt functionalities Most of the data analysis functionalities are not inbuilt.
They are available through packages like Numpy and
Pandas
Data visualization is a key aspect of analysis. R Python is better for deep learning because Python
packages such as ggplot2, ggvis, lattice, etc. make packages such as Caffe, Keras, OpenNN, etc. allows
data visualization easier. the development of the deep neural network in a very
simple way.
There are hundreds of packages and ways to Python has few main packages such as viz, Sccikit
accomplish needful data science tasks. learn, and Pandas for data analysis of machine
learning, respectively.
Arithmetic Operators
Table shows the arithmetic operators supported by the R language. The operators act on each element of the vector.
(+) Adds two vectors (%%) Give remainder of first vector with the second
v <- c( 2,5.5,6) v <- c( 2,5.5,6)
t <- c(8, 3, 4) t <- c(8, 3, 4)
print(v+t) print(v%%t)
[1] 10.0 8.5 10.0 [1] 2.0 2.5 2.0
(−) Subtracts second vector from the first (%/%) The result of division of first vector with second
v <- c( 2,5.5,6) (quotient)
t <- c(8, 3, 4) v <- c( 2,5.5,6)
print(v-t) t <- c(8, 3, 4)
[1] -6.0 2.5 2.0 print(v%/%t)
[1] 0 1 1
(*) Multiplies both vectors
v <- c( 2,5.5,6) (^) The first vector raised to exponent of second vector
t <- c(8, 3, 4) v <- c( 2,5.5,6)
print(v*t) t <- c(8, 3, 4)
[1] 16.0 16.5 24.0 print(v^t)
[1] 256.000 166.375 1296.000
(/) Divide the first vector with the second
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
[1] 0.250000 1.833333 1.500000
Logical Operators:
Operator Description Example
& It is called Element-wise Logical AND operator. It combines > v<- c(3,1,TRUE,2+3i)
each element of the first vector with the corresponding > t<- c(4,1,FALSE,2+3i)
element of the second vector and gives a output TRUE if both print(v&t)
the elements are TRUE. [1] TRUE TRUE FALSE TRUE
! It is called Logical NOT operator. Takes each element of the > v<- c(3,0,TRUE,2+2i)
vector and gives the opposite logical value. print(!v)
[1] FALSE TRUE FALSE FALSE
&& and || considers only the first element of the vectors and give a vector of single element as output.
&& Called Logical AND operator. Takes first element of both the > v<- c(3,0,TRUE,2+2i)
vectors and gives the TRUE only if both are TRUE. > t<- c(1,3,TRUE,2+3i)
print(v&&t)
[1] TRUE
|| Called Logical OR operator. Takes first element of both the > v<- c(0,0,TRUE,2+2i)
vectors and gives the TRUE if one of them is TRUE. > t<- c(0,3,TRUE,2+3i)
print(v||t)
[1] FALSE
R - Matrices: A Matrix is created using the matrix() function.
Syntax - matrix(data, nrow, ncol, byrow, dimnames) parameters used −
●data is the input vector which becomes the data elements of the matrix. ●nrow is the number of rows to be created.
●ncol is the number of columns to be created. ●byrow is a logical clue. If TRUE then input vector elements are
arranged by row. ●dimname is the names assigned to the rows and columns.
Create a matrix taking a vector of numbers as input. Matrix Multiplication & Division
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M) matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
N <- matrix(c(3:14), nrow = 4, byrow = FALSE) print(matrix1)
print(N)
rownames = c("row1", "row2", "row3", "row4") matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
colnames = c("col1", "col2", "col3") print(matrix2)
P <- matrix(c(3:14), nrow = 4, byrow = TRUE,
dimnames = list(rownames, colnames)) result <- matrix1 * matrix2
print(P) cat("Result of multiplication","\n")
Output: print(result)
[,1] [,2] [,3] [,1] [,2] [,3] col1 col2 col3
result <- matrix1 / matrix2
[1,] 3 4 5 [1,] 3 7 11 row1 3 4 5
cat("Result of division","\n")
[2,] 6 7 8 [2,] 4 8 12 row2 6 7 8
print(result)
[3,] 9 10 11 [3,] 5 9 13 row3 9 10 11
[4,] 12 13 14 [4,] 6 10 14 row4 12 13 14
Output:
Simulation is a method used to examine the “what if” without having real data. We just make it up! We can use
pre-programmed functions in R to simulate data from different probability distributions or we can design our own
functions to simulate data from distributions not available in R.
➡In a simulation, you set the ground rules of a random process and then the computer uses random numbers to
generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the
following commands.
Control statements are expressions used to control the execution and flow of the program based on the conditions
provided in the statements. These structures are used to make a decision after assessing the variable. In this article,
we’ll discuss all the control statements with the examples. In R programming, there are 8 types of control statements:
➡if condition ➡if-else condition ➡for loop ➡nested loops ➡while loop ➡repeat and break statement
➡return statement ➡next statement
Debugging in R
➔ traceback(): If our code has already crashed and we want to know where the offensive line is, try traceback (). This
will (sometimes) show the location somewhere in the code of the problem. When an R function fails, an error is
printed on the screen. Immediately after the error, we can call traceback () to see on which function the error
occurred. The traceback () function prints the list of functions which were called before the error had occurred. The
functions are printed in reverse order.
➔ debug(): In R, debug () function allows the user to step through the execution of a function. At any point, we can print
the values of the variables or draw a graph of the results within the function. While debugging, we can just type "c" to
continue to the end of the current block of code. Traceback () does not tell us where the function error occurred. To
know which line is causing the error, we have to step through the function using debug ().
➔ trace(): The trace() function call allows the user to insert bits of code into the function. The syntax for the R debug
function trace () is a bit awkward for first-time users. It may be better to use debug ().
R - Functions: An R function is created by using the keyword function.
Syntax: function_name <- function(arg_1, arg_2, ...) { Function body }
Built-in Function: Simple examples of in-built functions Calling a Function with Argument Values (by
are seq(), mean(), max(), sum(x) and paste(...) etc. position and by name): The arguments to a function
They are directly called by user written programs. call can be supplied in the same sequence as defined in
##code## the function or they can be supplied in a different
print(seq(32,44)) sequence but assigned to the names of the arguments.
print(mean(25:82)) # Find sum of numbers frm 41 to ##code##
68. new.function <- function(a,b,c) {
print(sum(41:68)) result <- a * b + c
Output: print(result)
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44 } # Call the function by position of arguments.
[1] 53.5 new.function(5,3,11) # Call the function by names of
[1] 1526 the arguments.
———————————————————————— new.function(a = 11, b = 5, c = 3)
Calling a Function Output:
new.function <- function(a) { [1] 26
for(i in 1:a) { [1] 58
b <- i^2 ————————————————————————
print(b) Calling a Function with Default Argument: We can
} } # Call the function new.function supplying 6 as define the value of the arguments in the function
an argument. definition and call the function without supplying any
new.function(6) argument to get the default result. But we can also call
Output: such functions by supplying new values of the argument
[1] 1 and get non default result.
[1] 4 ##code##
[1] 9 new.function <- function(a = 3, b = 6) {
[1] 16 result <- a * b
[1] 25 print(result)
[1] 36 } # Call the function without giving any argument.
———————————————————————— new.function() # Call the function with giving new values
Lazy Evaluation of Function: Arguments to functions of the argument.
are evaluated lazily, which means so they are evaluated new.function(9,5)
only when needed by the function body. Output:
##code## [1] 18
new.function <- function(a, b) { [1] 45
print(a^2)
print(a)
print(b)
}
new.function(6)
Output:
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no
default
UNIT-V
Performance Evaluation Measures for Classification Models:
Confusion Matrix: Confusion Matrix usually causes a lot of confusion even in those who are using them regularly.
Terms used in defining a confusion matrix are TP, TN, FP, and FN.
Use case: Let’s take an example of a patient who has gone to a doctor with certain symptoms. Since it’s the season
of Covid, let’s assume that he went with fever, cough, throat ache, and cold. These are symptoms that can occur
during any seasonal changes too. Hence, it is tricky for the doctor to do the right diagnosis.
➔ True Positive (TP): Let’s say the patient was actually suffering from Covid and on doing the required
assessment, the doctor classified him as a Covid patient. This is called TP or True Positive.
➔ False Positive (FP): Let’s say the patient was not suffering from Covid and he was only showing symptoms of
seasonal flu but the doctor diagnosed him with Covid. This is called Type I Error.
➔ True Negative (TN): Let’s say the patient was not suffering from Covid and the doctor also gave him a clean
chit. This is called TN or True Negative.
➔ False Negative (FN): Let’s say the patient was suffering from Covid and the doctor did not diagnose him with
Covid. This is called FN or False Negative as the case was actually positive but was falsely classified as
negative. This is also called Type II Error.
Accuracy: Accuracy = (TP + TN) / (TP + FP +TN + FN) This term tells us how many right classifications were
made out of all the classifications. In other words, how many TPs and TNs were done out of TP + TN + FP + FNs.
It tells the ratio of “True”s to the sum of “True”s and “False”s.
Precision: Precision = TP / (TP + FP) Out of all that were marked as positive, how many are actually truly
positive. Use case: Another example that marks emails as spam or not. Here, if emails that are of importance get
marked as positive, then useful emails will end up going to “Spam” folder, which is dangerous. Hence, model which
has least FP value needs to be selected. A model that has highest precision needs to be selected among all models.
Recall or Sensitivity: Recall = TP/ (TN + FN) Out of all the actual real positive cases, how many were
identified as positive. Use case: Out of all the actual Covid patients who visited the doctor, how many were actually
diagnosed as Covid positive. Hence, least FN value model needs to be selected.
Specificity: Specificity = TN/ (TN + FP) Out of all the real negative cases, how many were identified as negative.
Use case: Out of all the non-Covid patients who visited the doctor, how many were diagnosed as non-Covid.
F1-Score: F1 score = 2* (Precision * Recall) / (Precision + Recall) As we saw above, sometimes we need to
give weightage to FP and sometimes to FN. F1 score is a weighted average of Precision and Recall, which means
there is equal importance given to FP and FN. ●This is a very useful metric compared to “Accuracy”. The problem
with using accuracy is that if we have a highly imbalanced dataset for training (a training dataset with 95% positive
class and 5% negative class), the model will end up learning how to predict the positive class properly and will not
learn how to identify the negative class.
Area Under Curve (AUC) and ROC Curve:
AUC [Area Under Curve] is used in conjecture with
ROC [Receiver Operating] Characteristics Curve.
AUC is the area under the ROC Curve.
➡A ROC Curve is drawn by plotting TPR [True
Positive Rate] or Recall or Sensitivity in the y-axis
against FPR [False Positive Rate] in the x-axis.
FPR = 1- Specificity
TPR = TP/ (TP + FN)
FPR = 1 – TN/ (TN+FP) = FP/ (TN + FP)
★A model with AUC close to 1. When we say a
model has a high AUC score, it means the model’s ability to separate the classes is very high (high separability).
This is a very important metric that should be checked while selecting a classification model.
K-Nearest Neighbor or K-NN is a Supervised Non-linear classification algorithm. K-NN is a Non-parametric
algorithm i.e it doesn’t make any assumption about underlying data or its distribution. It is one of the simplest and
widely used algorithm which depends on it’s k value(Neighbors) and finds it’s applications in many industries like
finance industry, healthcare industry etc.
Algorithm: ●Choose the number K of neighbor. ●Take the KNN of unknown data point according to distance.
●Among the K-neighbors, Count the number of data points in each category.
●Assign the new data point to a category, where you counted the most neighbors.
Features: ●KNN is a Supervised Learning algorithm that uses labeled input data set to predict output of data points.
●It is most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
●It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the
data point into the class it is most similar to. ●Unlike most algorithms, KNN is a non-parametric model which means
that it does not make any assumptions about the data set. This makes the algorithm more effective since it can
handle realistic data. ●KNN can be used for solving both classification and regression problems.
●KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative
function from the training data.
ACC.26 ACC.27
[1] 67.66667 [1] 67.33333
# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
R - Time Series Analysis: Time series is a series of data points in which each data point is associated with a
timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
Syntax timeseries.object.name <- ts(data, start, end, frequency) parameters used −
➡data is a vector or matrix containing the values used in the time series.
➡start specifies the start time for the first observation in time series.
➡end specifies the end time for the last observation in time series.
➡frequency specifies the number of observations per unit time.
Social network analysis in R: Social Network Analysis (SNA) is the process of exploring the social structure by
using graph theory. It is mainly used for measuring and analyzing the structural properties of the network.
library() function library() function load and attach add-on packages.