0% found this document useful (0 votes)
1 views

Data Science Using r Programming_data Science Using r Unit 1-5

The document outlines the fundamentals of Data Science, emphasizing its reliance on statistics, data analysis, and machine learning to extract insights from data. It discusses the R programming language and its libraries, such as dplyr and ggplot2, which facilitate data manipulation and visualization. Additionally, it covers statistical modeling techniques, including linear regression and classification methods, as well as concepts related to random variables and probability distributions.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Data Science Using r Programming_data Science Using r Unit 1-5

The document outlines the fundamentals of Data Science, emphasizing its reliance on statistics, data analysis, and machine learning to extract insights from data. It discusses the R programming language and its libraries, such as dplyr and ggplot2, which facilitate data manipulation and visualization. Additionally, it covers statistical modeling techniques, including linear regression and classification methods, as well as concepts related to random variables and probability distributions.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT-I

Data Science is a combination of multiple disciplines that uses statistics, data analysis, and machine learning to
analyze data and to extract knowledge and insights from it.
➡Data Science is about data +gathering, analysis and decision-making.
➡Is about finding patterns in data, through analysis, and making future predictions.
●For route planning: To discover the best routes to ship ●To predict who will win elections
●To foresee delays for flight/ship/train etc. (through predictive analysis) ●To create promotional offers
●To find the best suited time to deliver goods ●To forecast the next years revenue for a company
➡R is one of the programming languages that provide an intensive environment for you to research, process,
transform, and visualize information.

Features of R – Data Science


➡R provides extensive support for statistical modeling.
➡R is a suitable tool for various data science applications because it provides aesthetic visualization tools.
➡R is heavily utilized in data science applications for ETL (Extract, Transform, Load). It provides an interface for
many databases like SQL and even spreadsheets.
➡R also provides various important packages for data wrangling.
➡With R, data scientists can apply machine learning algorithms to gain insights about future events.
➡One of the important features of R is to interface with NoSQL databases and analyze unstructured data.

Data Science in R Libraries


● Dplyr: For performing data wrangling and data analysis, we use the dplyr package. We use this package for
facilitating various functions for the Data frame in R. Dplyr is actually built around these 5 functions. You can work
with local data frames as well as with remote database tables. You might need to: ➡Select certain columns of data.
➡Filter your data to select specific rows. ➡Arrange the rows of your data into order.
➡Mutate your data frame to contain new columns. ➡Summarize chunks of your data in some way.
● Ggplot2: R is most famous for its visualization library ggplot2. It provides an aesthetic set of graphics that are also
interactive.The ggplot2 library implements a “grammar of graphics”. This approach gives us a coherent way to
produce visualizations by expressing relationships between the attributes of data and their graphical representation.
● Tidyr: Tidyr is a package that we use for tidying or cleaning the data. We consider this data to be tidy when each
variable represents a column and each row represents an observation.
● Shiny: This is a very well known package in R. When you want to share your stuff with people around you and make
it easier for them to know and explore it visually, you can use shiny. It’s a Data Scientist’s best friend.
● Caret: Caret stands for classification and regression training. Using this function, you can model complex regression
and classification problems.
● E1071: This package has wide use for implementing clustering, Fourier Transform, Naive Bayes, SVM and other
types of miscellaneous functions.

Nullity: Number of vectors present in the null space of a given matrix. In other words, the dimension of the null
space of the matrix A is called the nullity of A. The number of linear relations among the attributes is given by the
size of the null space. The null space vectors B can be used to identify these linear relationship.
Rank Nullity Theorem: The rank-nullity theorem helps us to relate the nullity of the data matrix to the rank and the
number of attributes in the data. The rank-nullity theorem is given by –
Nullity of A + Rank of A = Total number of attributes of A (i.e. total number of columns in A)
Linear Algebra: Is a branch of mathematics that is extremely useful in data science and machine learning. Linear
algebra is the most important math skill in machine learning. Most machine learning models can be expressed in
matrix form. A dataset itself is often represented as a matrix. Linear algebra is used in data preprocessing, data
transformation, and model evaluation. —————————————————————————————————

Y=matrix(c(1,2,3,4),nrow=2,ncol=2) X=matrix(c(1,2,3,4),nrow=2,ncol=2,byrow=TRUE)
Y X
## [,1] [,2] ## [,1] [,2]
## [1,] 1 3 ## [1,] 1 2
## [2,] 2 4 ———————— ## [2,] 3 4 ———————————————
A <- matrix(c(10, 8, 5, 12), ncol = 2, byrow = TRUE) B <- matrix(c(5, 3,15, 6), ncol = 2, byrow = TRUE)
A B
#A #B
[, 1] [, 2] [, 1] [, 2]
[1, ] 10 8 [1, ] 5 3
[2, ] 7 12 ———————————————— [2, ] 15 6 ———————————————
Rank
qr(A)$rank # 2 # Equivalent to: library(Matrix)
qr(B)$rank # 2 rankMatrix(A)[1] # 2

● The standard multiplication symbol, ‘*,’ will unfortunately provide unexpected results if you are looking for matrix
multiplication. ‘*’ will multiply matrices element wise. In order to do matrix multiplication, the function is ‘%*%.’
X*X X%*%X
## [,1] [,2] ## [,1] [,2]
## [1,] 1 4 ## [1,] 7 10
## [2,] 9 16 ## [2,] 15 22

● To transpose a matrix or a vector X, X, ●To invert a matrix, use the solve() command.
use the function t(X X). t(X) Xinv=solve(X) — X%*%Xinv
## [,1] [,2] ## [,1] [,2]
## [1,] 1 3 ## [1,] 1 0
## [2,] 2 4 ## [2,] 0 1

● To determine the size of a matrix, use the dim() function. The result is a vector with two values: dim(x)[1] provides
the number of rows and dim(x)[2] provides the number of columns. We can label rows/columns of a matrix using the
rownames() or colnames() functions.
dim(A)
## [1] 100 2
nrows=dim(A)[1]
ncols=dim(A)[2]

★The vectors, which get only scaled and not rotated are called eigenvectors. Eigenvectors are the vectors, which
●only get scaled ●or do not change at all
★The factor by which they get scaled is the corresponding eigenvalue.
Hyperplane
★Hyperplane is a geometric entity geometrically whose dimension is one less than that of its ambient space.
★For example, for 3D space the hyperplane is 2D and for 2D space the hyperplane is 1D line and so on.
★ hyper plane can be defined by the below equation XT n + b =0
The above equation can be expanded for n dimensions
X1n1 + X2n2 + X3n3 + ……….. + Xnnn + b = 0
For 2 dimensions the equation is
X1n1 + X2n2 + b = 0
Consider the hyper plane of the below form
XT n =0
Example:
Let us consider a 2D geometry with

[ ]
1
n = 3 and b=4
Though it's a 2D geometry the value of X will be

X= [x1
x2 ]
So according to the equation of hyperplane it can be solved as
XTn + b = 0

[ ]
[x1x2] 13 + 4 = 0
x1 + 3x2 + 4 = 0
So as you can see from the solution the hyperplane is the equation of a line.
——————————————————————————————————————————————————
Half-space :
So, here we have a 2-dimensional space in X1 and X2 and as we have discussed before, an equation in two
dimensions would be a line which would be a hyperplane. So, the equation to
the line is written as XT n + b =0
So, for this two dimensions, we could write this line as X1n1 + X2n2 + b = 0
You can notice from the above graph that this whole two-dimensional space is
broken into two spaces; One on this side(+ve half of plane) of a line and the
other one on this side(-ve half of the plane) of a line. Now, these two spaces
are called half-spaces.
Example: Consider the same example taken in the hyperplane case. So by
solving, we got the equation as x1 + 3x2 + 4 = 0
There may arise 3 cases. Let’s discuss each case with an example.
Case 1: x1 + 3x2 + 4 = 0 : On the line
Let's consider two points (-1,-1). When we put this value on the equation of the
line we got 0. So we can say that this point is on the hyperplane of the line.
Case 2: x1 + 3x2 + 4 > 0 : Positive half-space
Consider two points (1,-1). When we put this value on the equation of line we got 2 which is greater than 0. So we
can say that this point is on the positive half space.
Case 3: x1 + 3x2 + 4 < 0 : Negative half-space
Consider two points (1,-2). When we put this value on the equation of line we got -1 which is less than 0. So we can
say that this point is in the negative half-space.
UNIT-II
Statistical modeling: ★Statistical modeling is fixed for any data analyst to make sense of the data and make
scientific predictions. ★Is a process using statistical models to analyze a set of data.
★Statistical models are mathematical representations of the observed data.
★Methods are a powerful tool in understanding consolidated data and making generalized predictions using this
data
★Is a way of applying statistical analysis to datasets in data science. ★Involves a mathematical relationship
between random and non-random variables.
★A statistical model can provide intuitive visualizations that aid data scientists in identifying relationships between
variables and making predictions by applying statistical models to raw data.
★Examples: census data, public health data, and social media data

Statistical Modeling Techniques in Data Analysis


Linear regression is based on using linear equations to represent a connection between two variables, one of which
is dependent and the other independent. It is classified into two categories, as follows:
Simple Linear Regression: This method uses a single independent variable to predict a dependent variable by using
the best linear correlation.
Multiple Linear Regression: This method requires more than one independent variable to predict the dependent
variable by offering the best linear relationship.
Classification
Classifications divide data into distinct categories, allowing for more precise prediction and analysis. This approach
can effectively analyse very big data sets. There are two primary categorization techniques:
When the dependent variable is dichotomous or binary, a regression analysis approach called logistic regression is
used. Statistical analysis is used to explain and predict data and relationships between nominal independent
variables and dependent binary variables.
Discriminant Analysis: A priori refers to two or more clusters (populations) in this analysis, and the fresh set of data
is sorted into one of the known clusters based on calculated characteristics. As a result, the Bayes theorem is
applied to pitch each of the response classes in terms of likelihoods for the response class given the values of "X."
Tree Based Methods
The predictor space is divided into simple sections in a tree-based technique. The decision-tree approach derives its
name from the fact that the set of splitting rules may be described in a tree. This method may be applied to both
regression and classification situations. This technique employs a variety of methodologies, including bagging,
boosting, and the random forest algorithm.
Unsupervised Learning
Deep learning: An algorithm that rewards positive results and punishes steps that lead to negative results in order to
learn the ideal procedure.
Clustering with K-means: Assembles a set amount of data points into clusters based on commonalities.
Clustering based on hierarchies: Creates a cluster tree, which aids in the development of a multi-level cluster
hierarchy.
Random Variables: A variable is something which can change its value. It may vary with different outcomes of an
experiment. If the value of a variable depends upon the outcome of a random experiment it is a random variable, can
take up any real value.
Example : Suppose a dice is thrown X = outcome of the dice.
Here, the sample space S = {1, 2, 3, 4, 5, 6}. The output of the function will be:
P(X=1) = 1/6 P(X=2) = 1/6 P(X=3) = 1/6 P(X=4) = 1/6 P(X=5) = 1/6 P(X=6) = 1/6

Types of Random Variable


Discrete Random Variable: A discrete random variable is one which may take on only a countable number of
distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a
random variable can take only a finite number of distinct values, then it must be discrete.
➡Examples of discrete random variables include the number of children in a family, the Friday night attendance at a
cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.
➡The probability distribution of a discrete random variable is a list of probabilities associated with each of its
possible values. It is also sometimes called the probability function or the probability mass function.
Suppose a random variable X may take k different values, with the probability that X = xi defined to be P(X = xi) = pi.
The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
Example: Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are in the table:
Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities:
P(X = 2 or X = 3) = P(X = 2) + P(X = 3) = 0.3 + 0.4 = 0.7.
Similarly, probability that X is greater than 1 is equal to 1-P(X=1)=1-0.1=0.9
by the complement rule. This distribution by the probability histogram is shown.

Continuous Random variable: A continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements. Examples include height, weight, the amount of
sugar in an orange, the time required to run a mile.
➡A continuous random variable is not defined at specific values. Instead, it is defined over an interval of values, and
is represented by the area under a curve (in advanced mathematics, this is known as an integral). The probability of
observing any single value is equal to 0, since the number of values which may be assumed by the random variable
is infinite.
➡Suppose a random variable X may take all values over an interval of real numbers. Then the probability that X is in
the set of outcomes A, P(A), is defined to be the area above A and under a curve. The curve, which represents a
function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.

Probability Mass and Density Functions


Probability mass and density functions are used to describe discrete and continuous probability distributions,
respectively. This allows us to determine the probability of an observation being exactly equal to a target value
(discrete) or within a set range around our target value (continuous).
Probability Mass Function (PMF) is also called a probability function or frequency function which characterizes the
distribution of a discrete random variable. Let X be a discrete random variable of a function, then the probability
mass function of a random variable X is given by Px(x) = P( X=x ), For all x belongs to the range of X
➡It is noted that the probability function should fall on the condition : Px(x) ≥ 0 and ∑xϵRange(x) Px(x) = 1
Here the Range(X) is a countable set and it can be written as { x1, x2, x3, ….}. This means that the random variable
X takes the value x1, x2, x3, …. The probability mass function P(X = x) = f(x) of a discrete random variable is a
function that satisfies the following properties: P(X = x) = f(x) > 0;
∑xϵRange(x)Px(x) = 1 if x ∈ Range of x that supports ●∑xϵRange(x)f(x) = 1 ●P(XϵA) = ∑xϵAf(x)

Normal or Gaussian Distribution: Is a continuous probability distribution that has a bell-shaped probability density
function (Gaussian function), or informally a bell curve, where μ = mean and σ = standard deviation.

Moments of a PDF:

k k
For continuous distributions - E[x ]= ∫ x f(x)dx
−∞
𝑁
k 𝑘
For discrete distributions - E[x ]= ∑ 𝑥 p(xi)
𝑖
𝑖=1
½
Mean: μ=E[X] Variance: σ2 = (E[(x-μ)2]) standard deviation = σ

Probability sampling is a technique in which the researcher chooses samples from a larger population using a
method based on probability theory. For a participant to be considered as a probability sample, he/she must be
selected using a random selection. ➡Probability sampling uses statistical theory to randomly select a small group of
people (sample) from an existing large population and then predict that all their responses will match the overall
population. ➡For example, an organization has 500,000 employees sitting at different geographic locations. The
organization wishes to make certain amendments in its human resource policy, but before they roll out the change,
they want to know if the employees will be happy with the change or not. However, reaching out to all 500,000
employees is a tedious task. This is where probability sampling comes in handy. A sample from a larger population
i.e., from 500,000 employees, is chosen. This sample will represent the population. Deploy a survey now to the
sample. ➡From the responses received, management will now be able to know whether employees in that
organization are happy or not about the amendment.

Statistical analysis is done on data sets, and the analysis process can create different output types from the input
data. For example, the process can give summarized data, derive key values from the input, present input data
characteristics, prove a null hypothesis, etc. The output type and format vary with the analysis method used.
The two main types are descriptive statistics and inferential statistics
● Descriptive statistics: It refers to collecting, organizing, analyzing, and summarizing data sets in an understandable
format, like charts, graphs, and tables. It makes a large data set presentable and eliminates complexity to help
analysts understand it. The format of the summary can be quantitative or visual.
● Inferential statistics: Inferential statistics derive inference about a large population. It is based on the analysis and
findings produced for sample data from the large population. Hence it makes the process cost-efficient and
time-efficient. It generally includes the development of interval estimate, and points estimate to conduct the analysis.
Hypothesis Testing: The method tests the validity and authenticity of a hypothesis, outcome, or argument.
➡Is an assumption set at the beginning of the research; after the test is over and a result is obtained based on it, the
belief can be either true or false. It can check whether the null hypothesis or alternative hypothesis is true.
Null hypothesis (H0) Alternative hypothesis (Ha)

● In the null hypothesis, there is no relationship ● In the alternative hypothesis, there is some
between the two variables. relationship between the two variables i.e. They are
● Generally, researchers and scientists try to reject or dependent upon each other.
disprove the null hypothesis. ● Generally, researchers and scientists try to accept or
● If the null hypothesis is accepted researchers have to approve the null hypothesis
make changes in their opinions and statements. ● If the alternative hypothesis gets accepted
● Here no effect can be observed i.e. it does not affect researchers do not have to make changes in their
output. opinions and statements.
● Here the testing process is implicit and indirect. ● Here effect can be observed i.e. it affects the output.
● This hypothesis is denoted by H0. ● Here the testing process is explicit and direct.
● It is generally used when we reject the null ● This hypothesis is denoted by Ha or H1.
hypothesis. ● It gets accepted if we fail to reject the null hypothesis.
● In this hypothesis, the p-value is smaller than the ● In this hypothesis, the p-value is greater than the
significance level. significance level.

Statistical test Null hypothesis (H0) Alternative hypothesis (Ha)

Two-sample t test The mean dependent variable does not The mean dependent variable differs between
differ between group 1 (µ1) and group 2 group 1 (µ1) and group 2 (µ2) in the population;
(µ2) in the population; µ1 = µ2. µ1 ≠ µ2.

One-way ANOVA The mean dependent variable does not The mean dependent variable of group 1 (µ1),
with three groups differ between group 1 (µ1), group 2 group 2 (µ2), and group 3 (µ3) are not all equal
(µ2), and group 3 (µ3) in the population; in the population.
µ1 = µ2 = µ3.

Pearson There is no correlation between There is a correlation between independent


correlation independent variable and dependent variable and dependent variable in the
variable in the population; ρ = 0. population; ρ ≠ 0.

Simple linear There is no relationship between There is a relationship between independent


regression independent variable and dependent variable and dependent variable in the
variable in the population; β1 = 0. population; β1 ≠ 0.

Two-proportions z The dependent variable expressed as a The dependent variable expressed as a


test proportion does not differ between group proportion differs between group 1 (p1) and
1 (p1) and group 2 (p2) in the group 2 (p2) in the population; p1 ≠ p2.
population; p1 = p2.

Example 1: A teacher claims that the mean score of students in his class is greater than 82 with a standard deviation
of 20. If a sample of 81 students was selected with a mean score of 90 then check if there is enough evidence to
support this claim at a 0.05 significance level.
Solution: As the sample size is 81 and population standard deviation is known,
this is an example of a right-tailed one-sample z test.
Ho-μ=82 , H1>μ=82 From the z table the critical value at α = 1.645
𝑋 = 90, μ = 82, n = 81, σ = 20 , z = 3.6
As 3.6 > 1.645 thus, the null hypothesis is rejected and it is concluded that
there is enough evidence to support the teacher's claim. ★Reject the null hypothesis

T-test is the final statistical measure for determining differences between two means that may or may not be related.
The testing uses randomly selected samples from the two categories or groups.

t= Student's t-test
m = mean
μ = theoretical
value
S = standard
deviation
n = variable
set size
Support Vector Machine Algorithm
The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space
into classes so that we can easily put the new data point in
the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a
decision boundary or hyperplane:

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but
we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features
(as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a
2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data
points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are
termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

Support Vector Machines in R


set.seed(10111)
x = matrix(rnorm(40), 20, 2)
y = rep(c(-1, 1), c(10, 10))
x[y == 1,] = x[y == 1,] + 1
plot(x, col = y + 3, pch = 19)
library(e1071)dat = data.frame(x, y = as.factor(y))
svmfit = svm(y ~ ., data = dat, kernel = "linear",
cost = 10, scale = FALSE)
print(svmfit)
plot(svmfit, dat)
Build Linear Model: The function used for building linear models is lm(). The lm() function takes in two main
arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is a object of class
formula. But the most common convention is to write out the formula directly in place of the argument as written.

# build linear regression model on full data


linearMod <- lm(dist ~ speed, data=cars)
print(linearMod)

#> Call:
#> lm(formula = dist ~ speed, data = cars)
#>
#> Coefficients:
#> (Intercept) speed
#> -17.579 3.932

Now that we have built the linear model, we also have established the relationship between the predictor and
response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output,
‘Coefficients’ part has two components: Intercept: -17.579, speed: 3.932 These are also called the beta coefficients.
In other words, dist = Intercept + (β ∗ speed) => dist = −17.579 + 3.932∗speed
UNIT-III
Predictive analysis in R Language is a branch of analysis which uses statistics operations to analyze historical facts
to make predict future events. It is a common term used in data mining and machine learning. Methods like time
series analysis, non-linear least square, etc. are used in predictive analysis.
Process of Predictive Analysis: 1.Define project: Defining the project, scope, objectives and result.
2.Data collection: Data is collected through data mining providing a complete view of customer interactions.
3.Data Analysis: It is the process of cleaning, inspecting, transforming and modelling the data.
4.Statistics: This process enables validating the assumptions and testing the statistical models.
5.Modelling: Predictive models are generated using statistics and the most optimized model is used for deployment.
6.Deployment: The predictive model is deployed to automate the production of everyday decision-making results.
7.Model monitoring: Keep monitoring the model to review performance which ensures expected results.

x <- c(580, 7813, 28266, 59287, 75700, 87820, 95314, 126214, 218843, 471497, 936851, 1508725, 2072113)
library(lubridate)
png(file ="predictiveAnalysis.png")
mts <- ts(x, start = decimal_date(ymd("2020-01-22")), frequency = 365.25 / 7)
plot(mts, xlab ="Weekly Data of sales", ylab ="Total Revenue", main ="Sales vs Revenue", col.main ="darkgreen")
dev.off()

Applications of Predictive Analysis


●Health care: Predictive analysis can be used
to determine the history of patient and thus,
determining the risks.
●Financial modelling: Financial modelling is
another aspect where predictive analysis plays
a major role in finding out the trending stocks
helping the business in decision making
process.
●Customer Relationship Management:
Predictive analysis helps firms in creating
marketing campaigns and customer services
based on the analysis produced by the
predictive algorithms.
●Risk Analysis: While forecasting the
campaigns, predictive analysis can show an
estimation of profit and helps in evaluating the
risks too.
Linear Regression: Is a statistical model that analyzes the relationship between a response variable (often called
y) and one or more variables and their interactions (often called x or explanatory variables).
➡In Linear Regression these two variables are related through an equation, where exponent (power) of both these
variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear
relationship where the exponent of any variable is not equal to 1 creates a curve.
The general equation is − y = ax + b — y=response variable, x=predictor variable, a & b= constants (coefficients)
lm() Function: This function creates the relationship model between the predictor and the response variable.
Syntax: lm(formula,data) — parameters used −
●formula is a symbol presenting relation between x and y. ●data is vector on which the formula will be applied.
##code##
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
print(relation)
Output:
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746

##code##
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
png(file = "linearregression.png")
plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",
ylab = "Height in cm")
dev.off()
Multiple Regression: Multiple regression is an extension of linear regression into relationship between more than
two variables. In simple linear relation we have one predictor and one response variable, but in multiple regression
we have more than one predictor variable and one response variable.
The general equation is − y = a + b1x1 + b2x2 +...bnxn — parameters used −
●y is the response variable ●a, b1, b2...bn are the coefficients ●x1, x2, ...xn are the predictor variables.
lm() Function: This function creates the relationship model between the predictor and the response variable.
Syntax: lm(y ~ x1+x2+x3...,data) — parameters used − ●formula is a symbol presenting the relation between the
response variable and predictor variables. ●data is the vector on which the formula will be applied.
##code##
> input <- mtcars[,c("mpg","disp","hp","wt")]
> print(head(input)) —————————————
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460
###code###
> input <- mtcars[,c("mpg","disp","hp","wt")]
> model <- lm(mpg~disp+hp+wt, data = input)
> print(model)
> cat("# # # # The Coefficient Values # # # ","\n")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
Output:
Call:
lm(formula = mpg ~ disp + hp + wt, data = input)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
# # # # The Coefficient Values # # #
(Intercept)
37.10551
disp
-0.0009370091
hp
-0.03115655
wt
-3.800891
###code###
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
Output: am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460

Create Regression Model: glm() function is used to create the regression model and get its summary for analysis.
##code##
input <- mtcars[,c("am","cyl","hp","wt")]
am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)
print(summary(am.data))
Output:
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.2297 on 31 degrees of freedom
Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841
Number of Fisher Scoring iterations: 8
UNIT-IV

R - Data Frames: A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
➡The column names should be non-empty. ➡The row names should be unique. ➡The data stored in a data frame
can be of numeric, factor or character type. ➡Each column should contain same number of data items.

##Create Data Frame


# Create the data frame.
emp.data <- data.frame( emp_id = c (1:5), emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")),
stringsAsFactors = FALSE )
print(emp.data)
Output:
emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27

➔ The structure of the data frame can be seen by using str() function.
➔ The statistical summary and nature of the data can be obtained by applying summary() function.
➔ To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as
the existing data frame and use the rbind() function.
What is R Programming
"R is an interpreted computer programming language which was created by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand." The R Development Core Team currently develops R. It is also a software
environment used to analyze statistical information, graphical representation, reporting, and data modeling. R is the
implementation of the S programming language, which is combined with lexical scoping semantics.

R Python

"R is an interpreted computer programming language Python is an Interpreted high-level programming


which was created by Ross Ihaka and Robert language used for general-purpose programming.
Gentleman at the University of Auckland, New Guido Van Rossum created it, and it was first
Zealand ." The R Development Core Team currently released in 1991. Python has a very simple and clean
develops R. R is also a software environment which is code syntax. It emphasizes the code readability and
used to analyze statistical information, graphical debugging is also simple and easier in Python.
representation, reporting, and data modeling.

R packages have advanced techniques which are For finding outliers in a data set both R and Python
very useful for statistical work. The CRAN text view is are equally good. But for developing a web service to
provided by many useful R packages. These allow peoples to upload datasets and find outliers,
packages cover everything from Psychometrics to Python is better.
Genetics to Finance.

For data analysis, R has inbuilt functionalities Most of the data analysis functionalities are not inbuilt.
They are available through packages like Numpy and
Pandas

Data visualization is a key aspect of analysis. R Python is better for deep learning because Python
packages such as ggplot2, ggvis, lattice, etc. make packages such as Caffe, Keras, OpenNN, etc. allows
data visualization easier. the development of the deep neural network in a very
simple way.

There are hundreds of packages and ways to Python has few main packages such as viz, Sccikit
accomplish needful data science tasks. learn, and Pandas for data analysis of machine
learning, respectively.
Arithmetic Operators
Table shows the arithmetic operators supported by the R language. The operators act on each element of the vector.
(+) Adds two vectors (%%) Give remainder of first vector with the second
v <- c( 2,5.5,6) v <- c( 2,5.5,6)
t <- c(8, 3, 4) t <- c(8, 3, 4)
print(v+t) print(v%%t)
[1] 10.0 8.5 10.0 [1] 2.0 2.5 2.0

(−) Subtracts second vector from the first (%/%) The result of division of first vector with second
v <- c( 2,5.5,6) (quotient)
t <- c(8, 3, 4) v <- c( 2,5.5,6)
print(v-t) t <- c(8, 3, 4)
[1] -6.0 2.5 2.0 print(v%/%t)
[1] 0 1 1
(*) Multiplies both vectors
v <- c( 2,5.5,6) (^) The first vector raised to exponent of second vector
t <- c(8, 3, 4) v <- c( 2,5.5,6)
print(v*t) t <- c(8, 3, 4)
[1] 16.0 16.5 24.0 print(v^t)
[1] 256.000 166.375 1296.000
(/) Divide the first vector with the second
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
[1] 0.250000 1.833333 1.500000

Logical Operators:
Operator Description Example

& It is called Element-wise Logical AND operator. It combines > v<- c(3,1,TRUE,2+3i)
each element of the first vector with the corresponding > t<- c(4,1,FALSE,2+3i)
element of the second vector and gives a output TRUE if both print(v&t)
the elements are TRUE. [1] TRUE TRUE FALSE TRUE

| It is called Element-wise Logical OR operator. It combines > v<- c(3,0,TRUE,2+2i)


each element of the first vector with the corresponding > t<- c(4,0,FALSE,2+3i)
element of the second vector and gives a output TRUE if one print(v|t)
the elements is TRUE. [1] TRUE FALSE TRUE TRUE

! It is called Logical NOT operator. Takes each element of the > v<- c(3,0,TRUE,2+2i)
vector and gives the opposite logical value. print(!v)
[1] FALSE TRUE FALSE FALSE

&& and || considers only the first element of the vectors and give a vector of single element as output.

&& Called Logical AND operator. Takes first element of both the > v<- c(3,0,TRUE,2+2i)
vectors and gives the TRUE only if both are TRUE. > t<- c(1,3,TRUE,2+3i)
print(v&&t)
[1] TRUE

|| Called Logical OR operator. Takes first element of both the > v<- c(0,0,TRUE,2+2i)
vectors and gives the TRUE if one of them is TRUE. > t<- c(0,3,TRUE,2+3i)
print(v||t)
[1] FALSE
R - Matrices: A Matrix is created using the matrix() function.
Syntax - matrix(data, nrow, ncol, byrow, dimnames) parameters used −
●data is the input vector which becomes the data elements of the matrix. ●nrow is the number of rows to be created.
●ncol is the number of columns to be created. ●byrow is a logical clue. If TRUE then input vector elements are
arranged by row. ●dimname is the names assigned to the rows and columns.
Create a matrix taking a vector of numbers as input. Matrix Multiplication & Division
M <- matrix(c(3:14), nrow = 4, byrow = TRUE)
print(M) matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2)
N <- matrix(c(3:14), nrow = 4, byrow = FALSE) print(matrix1)
print(N)
rownames = c("row1", "row2", "row3", "row4") matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2)
colnames = c("col1", "col2", "col3") print(matrix2)
P <- matrix(c(3:14), nrow = 4, byrow = TRUE,
dimnames = list(rownames, colnames)) result <- matrix1 * matrix2
print(P) cat("Result of multiplication","\n")
Output: print(result)
[,1] [,2] [,3] [,1] [,2] [,3] col1 col2 col3
result <- matrix1 / matrix2
[1,] 3 4 5 [1,] 3 7 11 row1 3 4 5
cat("Result of division","\n")
[2,] 6 7 8 [2,] 4 8 12 row2 6 7 8
print(result)
[3,] 9 10 11 [3,] 5 9 13 row3 9 10 11
[4,] 12 13 14 [4,] 6 10 14 row4 12 13 14
Output:

Matrix Addition & Subtraction [,1] [,2] [,3]


matrix1 <- matrix(c(3, 9, -1, 4, 2, 6), nrow = 2) [1,] 3 -1 2
print(matrix1) [2,] 9 4 6
matrix2 <- matrix(c(5, 2, 0, 9, 3, 4), nrow = 2) [,1] [,2] [,3]
print(matrix2) [1,] 5 0 3
result <- matrix1 + matrix2 [2,] 2 9 4
cat("Result of addition","\n")
print(result) Result of multiplication
result <- matrix1 - matrix2 [,1] [,2] [,3]
cat("Result of subtraction","\n") [1,] 15 0 6
print(result) [2,] 18 36 24
Output:
Result of addition Result of division
[,1] [,2] [,3] [,1] [,2] [,3] [,1] [,2] [,3]
[1,] 3 -1 2 [1,] 8 -1 5 [1,] 0.6 -Inf 0.6666667
[2,] 9 4 6 [2,] 11 13 10 [2,] 4.5 0.4444444 1.5000000

[,1] [,2] [,3] Result of subtraction


[1,] 5 0 3 [,1] [,2] [,3]
[2,] 2 9 4 [1,] -2 -1 -1
[2,] 7 -5 2

Simulation is a method used to examine the “what if” without having real data. We just make it up! We can use
pre-programmed functions in R to simulate data from different probability distributions or we can design our own
functions to simulate data from distributions not available in R.
➡In a simulation, you set the ground rules of a random process and then the computer uses random numbers to
generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the
following commands.
Control statements are expressions used to control the execution and flow of the program based on the conditions
provided in the statements. These structures are used to make a decision after assessing the variable. In this article,
we’ll discuss all the control statements with the examples. In R programming, there are 8 types of control statements:
➡if condition ➡if-else condition ➡for loop ➡nested loops ➡while loop ➡repeat and break statement
➡return statement ➡next statement

Debugging in R

➔ traceback(): If our code has already crashed and we want to know where the offensive line is, try traceback (). This
will (sometimes) show the location somewhere in the code of the problem. When an R function fails, an error is
printed on the screen. Immediately after the error, we can call traceback () to see on which function the error
occurred. The traceback () function prints the list of functions which were called before the error had occurred. The
functions are printed in reverse order.
➔ debug(): In R, debug () function allows the user to step through the execution of a function. At any point, we can print
the values of the variables or draw a graph of the results within the function. While debugging, we can just type "c" to
continue to the end of the current block of code. Traceback () does not tell us where the function error occurred. To
know which line is causing the error, we have to step through the function using debug ().
➔ trace(): The trace() function call allows the user to insert bits of code into the function. The syntax for the R debug
function trace () is a bit awkward for first-time users. It may be better to use debug ().
R - Functions: An R function is created by using the keyword function.
Syntax: function_name <- function(arg_1, arg_2, ...) { Function body }
Built-in Function: Simple examples of in-built functions Calling a Function with Argument Values (by
are seq(), mean(), max(), sum(x) and paste(...) etc. position and by name): The arguments to a function
They are directly called by user written programs. call can be supplied in the same sequence as defined in
##code## the function or they can be supplied in a different
print(seq(32,44)) sequence but assigned to the names of the arguments.
print(mean(25:82)) # Find sum of numbers frm 41 to ##code##
68. new.function <- function(a,b,c) {
print(sum(41:68)) result <- a * b + c
Output: print(result)
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44 } # Call the function by position of arguments.
[1] 53.5 new.function(5,3,11) # Call the function by names of
[1] 1526 the arguments.
———————————————————————— new.function(a = 11, b = 5, c = 3)
Calling a Function Output:
new.function <- function(a) { [1] 26
for(i in 1:a) { [1] 58
b <- i^2 ————————————————————————
print(b) Calling a Function with Default Argument: We can
} } # Call the function new.function supplying 6 as define the value of the arguments in the function
an argument. definition and call the function without supplying any
new.function(6) argument to get the default result. But we can also call
Output: such functions by supplying new values of the argument
[1] 1 and get non default result.
[1] 4 ##code##
[1] 9 new.function <- function(a = 3, b = 6) {
[1] 16 result <- a * b
[1] 25 print(result)
[1] 36 } # Call the function without giving any argument.
———————————————————————— new.function() # Call the function with giving new values
Lazy Evaluation of Function: Arguments to functions of the argument.
are evaluated lazily, which means so they are evaluated new.function(9,5)
only when needed by the function body. Output:
##code## [1] 18
new.function <- function(a, b) { [1] 45
print(a^2)
print(a)
print(b)
}
new.function(6)
Output:
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no
default
UNIT-V
Performance Evaluation Measures for Classification Models:
Confusion Matrix: Confusion Matrix usually causes a lot of confusion even in those who are using them regularly.
Terms used in defining a confusion matrix are TP, TN, FP, and FN.
Use case: Let’s take an example of a patient who has gone to a doctor with certain symptoms. Since it’s the season
of Covid, let’s assume that he went with fever, cough, throat ache, and cold. These are symptoms that can occur
during any seasonal changes too. Hence, it is tricky for the doctor to do the right diagnosis.
➔ True Positive (TP): Let’s say the patient was actually suffering from Covid and on doing the required
assessment, the doctor classified him as a Covid patient. This is called TP or True Positive.
➔ False Positive (FP): Let’s say the patient was not suffering from Covid and he was only showing symptoms of
seasonal flu but the doctor diagnosed him with Covid. This is called Type I Error.
➔ True Negative (TN): Let’s say the patient was not suffering from Covid and the doctor also gave him a clean
chit. This is called TN or True Negative.
➔ False Negative (FN): Let’s say the patient was suffering from Covid and the doctor did not diagnose him with
Covid. This is called FN or False Negative as the case was actually positive but was falsely classified as
negative. This is also called Type II Error.
Accuracy: Accuracy = (TP + TN) / (TP + FP +TN + FN) This term tells us how many right classifications were
made out of all the classifications. In other words, how many TPs and TNs were done out of TP + TN + FP + FNs.
It tells the ratio of “True”s to the sum of “True”s and “False”s.
Precision: Precision = TP / (TP + FP) Out of all that were marked as positive, how many are actually truly
positive. Use case: Another example that marks emails as spam or not. Here, if emails that are of importance get
marked as positive, then useful emails will end up going to “Spam” folder, which is dangerous. Hence, model which
has least FP value needs to be selected. A model that has highest precision needs to be selected among all models.
Recall or Sensitivity: Recall = TP/ (TN + FN) Out of all the actual real positive cases, how many were
identified as positive. Use case: Out of all the actual Covid patients who visited the doctor, how many were actually
diagnosed as Covid positive. Hence, least FN value model needs to be selected.
Specificity: Specificity = TN/ (TN + FP) Out of all the real negative cases, how many were identified as negative.
Use case: Out of all the non-Covid patients who visited the doctor, how many were diagnosed as non-Covid.
F1-Score: F1 score = 2* (Precision * Recall) / (Precision + Recall) As we saw above, sometimes we need to
give weightage to FP and sometimes to FN. F1 score is a weighted average of Precision and Recall, which means
there is equal importance given to FP and FN. ●This is a very useful metric compared to “Accuracy”. The problem
with using accuracy is that if we have a highly imbalanced dataset for training (a training dataset with 95% positive
class and 5% negative class), the model will end up learning how to predict the positive class properly and will not
learn how to identify the negative class.
Area Under Curve (AUC) and ROC Curve:
AUC [Area Under Curve] is used in conjecture with
ROC [Receiver Operating] Characteristics Curve.
AUC is the area under the ROC Curve.
➡A ROC Curve is drawn by plotting TPR [True
Positive Rate] or Recall or Sensitivity in the y-axis
against FPR [False Positive Rate] in the x-axis.
FPR = 1- Specificity
TPR = TP/ (TP + FN)
FPR = 1 – TN/ (TN+FP) = FP/ (TN + FP)
★A model with AUC close to 1. When we say a
model has a high AUC score, it means the model’s ability to separate the classes is very high (high separability).
This is a very important metric that should be checked while selecting a classification model.
K-Nearest Neighbor or K-NN is a Supervised Non-linear classification algorithm. K-NN is a Non-parametric
algorithm i.e it doesn’t make any assumption about underlying data or its distribution. It is one of the simplest and
widely used algorithm which depends on it’s k value(Neighbors) and finds it’s applications in many industries like
finance industry, healthcare industry etc.
Algorithm: ●Choose the number K of neighbor. ●Take the KNN of unknown data point according to distance.
●Among the K-neighbors, Count the number of data points in each category.
●Assign the new data point to a category, where you counted the most neighbors.
Features: ●KNN is a Supervised Learning algorithm that uses labeled input data set to predict output of data points.
●It is most simple Machine learning algorithms and it can be easily implemented for a varied set of problems.
●It is mainly based on feature similarity. KNN checks how similar a data point is to its neighbor and classifies the
data point into the class it is most similar to. ●Unlike most algorithms, KNN is a non-parametric model which means
that it does not make any assumptions about the data set. This makes the algorithm more effective since it can
handle realistic data. ●KNN can be used for solving both classification and regression problems.
●KNN is a lazy algorithm, this means that it memorizes the training data set instead of learning a discriminative
function from the training data.

K-Means Clustering in R Programming


K Means Clustering in R Programming is an Unsupervised Non-linear algorithm that cluster data based on similarity
or similar groups. It seeks to partition the observations into a pre-specified number of clusters. Segmentation of data
takes place to assign each training example to a segment called a cluster. In the unsupervised algorithm, high
reliance on raw data is given with large expenditure on manual review for review of relevance is given. It is used in a
variety of fields like Banking, healthcare, retail, Media, etc.
Advantages: ●The labeled data isn’t required. Since so much real-world data is unlabeled, as a result, it is
frequently utilized in a variety of real-world problem statements. ●It is easy to implement. ●It can handle massive
amounts of data. ●When data is large, it works faster than hierarchical clustering (for small k).
Disadvantages: ●K value is required to be selected manually using the “elbow method”.
●The presence of outliers would have an adverse impact on the clustering. As a result, outliers must be eliminated
before using k-means clustering. ●Clusters do not cross across; a point may only belong to one cluster at a time. As
a result of the lack of overlapping, certain points are placed in the incorrect clusters.
K-means algorithm:
1. Specify the number of clusters (K) to be created (by the analyst)
2. Select randomly k objects from the data set as the initial cluster centers or means
3. Assigns each observation to their closest centroid, based on the Euclidean distance between the object and the
centroid
4. For each of the k clusters update the cluster centroid by calculating the new mean values of all the data points in the
cluster. The centroid of a Kth cluster is a vector of length p containing the means of all variables for the observations
in the kth cluster; p is the number of variables.
5. Iteratively minimize the total within sum of square (Eq. 7). That is, iterate steps 3 and 4 until the cluster assignments
stop changing or the maximum number of iterations is reached. By default, the R software uses 10 as the default
value for the maximum number of iterations.
### KNN code###
#Import the dataset
loan <- read.csv("C:/Users/zulaikha/Desktop/DATASETS/knn dataset/credit_data.csv")
loan.subset <- loan[c('Creditability', 'Age..years.','Sex...Marital.Status', 'Occupation','Account.Balance',
'Credit.Amount', 'Length.of.current.employment', 'Purpose')]
head(loan.subset)
Creditability Age..years. Sex...Marital.Status Occupation Account.Balance Credit.Amount
1 1 21 2 3 1 1049
2 1 36 3 3 1 2799
3 1 23 2 2 2 841
4 1 39 3 2 1 2122
5 1 38 3 2 1 2171
6 1 48 3 2 1 2241
#Normalization
normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
loan.subset.n <- as.data.frame(lapply(loan.subset[,2:8], normalize))
set.seed(123)
dat.d <- sample(1:nrow(loan.subset.n),size=nrow(loan.subset.n)*0.7,replace = FALSE)
train.loan <- loan.subset[dat.d,] # 70% training data
test.loan <- loan.subset[-dat.d,] # remaining 30% test data

#Install class package


install.packages('class')
library(class)
knn.26 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=26)
knn.27 <- knn(train=train.loan, test=test.loan, cl=train.loan_labels, k=27)
#Calculate the proportion of correct classification for k = 26, 27
ACC.26 <- 100 * sum(test.loan_labels == knn.26)/NROW(test.loan_labels)
ACC.27 <- 100 * sum(test.loan_labels == knn.27)/NROW(test.loan_labels)

ACC.26 ACC.27
[1] 67.66667 [1] 67.33333

plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level") #Accuracy plot


## K-Means code##
k2 <- kmeans(df, centers = 2, nstart = 25)
k3 <- kmeans(df, centers = 3, nstart = 25)
k4 <- kmeans(df, centers = 4, nstart = 25)
k5 <- kmeans(df, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = df) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = df) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = df) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = df) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)
R - Time Series Analysis: Time series is a series of data points in which each data point is associated with a
timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
Syntax timeseries.object.name <- ts(data, start, end, frequency) parameters used −
➡data is a vector or matrix containing the values used in the time series.
➡start specifies the start time for the first observation in time series.
➡end specifies the end time for the last observation in time series.
➡frequency specifies the number of observations per unit time.

# Get the data points in form of a R vector.


rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
print(rainfall.timeseries)
png(file = "rainfall.png")
plot(rainfall.timeseries)
dev.off()

following result and chart −


Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2 2012 985.0 882.8 1071.0

The Time series chart −

Social network analysis in R: Social Network Analysis (SNA) is the process of exploring the social structure by
using graph theory. It is mainly used for measuring and analyzing the structural properties of the network.
library() function library() function load and attach add-on packages.

You might also like