Unit-15 Data Analysis and R
Unit-15 Data Analysis and R
15.1 INTRODUCTION
This unit deals with the concept of data analysis and how to leverage it by using
R programming. The unit discusses various tests and techniques to operate on
data in R and how to draw insights from it. The unit covers the Chi-Square Test,
its significance and the application in R with the help of an example. The unit
also familiarises with the concept of Regression Analysis and its types
including- Simple Linear and Multiple Linear Regression and afterwards,
Logistic Regression. It is further substantiated with examples in R that explain
the steps, functions and syntax to use correctly. It also explains how to interpret
the output and visualise the data. Subsequently, the unit explains the concept of
Time Series Analysis and how to run it on R. It also discusses about the
Stationary Time Series, extraction of trend, seasonality, and error and how to
create lags of a time series in R.
15.2 OBJECTIVES
After going through this Unit, you will be able to:-
• Run tests and techniques on data and interpret the results using R;
• explain the correlation between two variables in a dataset by running
Chi-Square Test in R;
• explain the concept of Regression Analysis and distinguish between their
types- simple Linear and Multiple Linear;
• build relationship models in R to plot and interpret the data and further
use it to predict the unknown variable values;
• explain the concept of Logistic Regression and its application on R;
• explain about the Time Series Analysis and the special case of Stationary
Time Series;
• explain about extraction of trend, seasonality, and error and how to create
lags of a time series in R.
41
Basics of R Programming
population and be categorical in nature, such as – top/bottom, True/False,
Black/White.Syntax of a chi-square test: chisq.test(data)
EXAMPLE:
Let’s consider R’s built in “MASS” library that contains Cars93 dataset that
represents the sales of different models of car.
42
Data Analysis and R
Chi-square test is one of the most useful test in finding relationships between
categorical variables.
How can you find the relationships between two scale or numeric variables using
R? One such technique, which helps in establishing a model-based relationship
is regression, which is discussed next.
residual
Input Data
Below is the sample data with the observations between weight and height,
which is experimentally collected and is input in the Figure 15.4
44
Data Analysis and R
Summary of the relationship:
Predict function:
Function which will be used to predict the weight of the new person.
Plot for Visualization: Finally, you may plot these values by setting the plot
title and axis titles (see Figure 15.8). The linear regression line is shown in
Figure 15.3.
45
Basics of R Programming
Linear regression has one response variable and one predictor variables,
however, in many practical cases there can be more than one predictor
variables. This is the case of multiple regression and is discussed next.
INPUT Data
46
Data Analysis and R
Let’s take the R inbuilt data set “mtcars”, which gives comparison between
various car models based on the mileage per gallon (mpg), cylinder
displacement (“disp”), horse power(“hp”), weight of the car(“wt”) & more.
The aim is to establish relationship of mpg (response variable) with predictor
variable (disp, hp, wt). The head function, as used in Figure 15.9, shows the first
5 rows of the dataset.
47
Basics of R Programming
Creating Equation for Regression Model: Based on the intercept & coefficient
values one can create the mathematical equation as follows:
𝑌 = 𝑎 + 𝑏 × 𝑥%&'( + 𝑐 × 𝑥)( + 𝑑 × 𝑥*+
or
𝑌 = 37.15 − 0.000937 × 𝑥%&'( − 0.0311 × 𝑥)( − 3.8008 × 𝑥*+
Input: Import the data set and then use ts() function.
The steps to use the function are given below. However, it is pertinent to note
here that the input values used in this case should ideally be a numeric vector
belonging to the “numeric” or “integer” class.
The following functions will generate quarterly data series from 1959:
ts(inputData, frequency =4, start = c(1959,2)) #frequency 4 => QuarterlyData
The following function will generate monthly data series from 1990
ts(1:10, frequency =12, start = 1990) #freq 12 => MonthlyData
The following function will generate yearly data series from 2009 to 2014.
ts(inputData, start=c(2009), end=c(2014), frequency=1) # YearlyData
In case, you want to use Additive Time Series, you use the following:
𝑌+ = 𝑆+ + 𝑇+ + 𝑒+
However, for Multiplicative Time Series, you may use:
𝑌+ = 𝑆+ × 𝑇+ × 𝑒+
The additive time series can be converted from multiplicative time series by
taking using the log function on the time series as represented below:
𝑎𝑑𝑑𝑖𝑡𝑖𝑣𝑒𝑇𝑆 = 𝑙𝑜𝑔(𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑐𝑎𝑡𝑖𝑣𝑒𝑇𝑆)
1. When the mean value of a time series remains constant over a period of time
and hence, the trend component is removed Over time, the variance does not
increase.
2. Seasonality has a minor impact.
timeSeriesData = EuStockMarkets[,1]
resultofDecompose = decompose(timeSeriesData, type=”mult”)
plot(resultofDecompose)
resultsofSt1 = stl(timeSeriesData, s.window = “periodic”)
15.8 SUMMARY
This unit introduces the concept of data analysis and examine its application
using R programming. It explains about the Chi-Square Test that is used to
determine if two categorical variables are significantly correlated and further
study its application on R. The unit explains the Regression Analysis, which is
a common statistical technique for establishing a relationship model between
two variables- a predictor variable and the response variable. It further explains
the various models in Regression Analysis including Linear and Logistics
Regression Analysis. In Linear Regression the two variables are related through
an equation of degree is one and employs a straight line to explain the
relationship between variables. It is categorised into two types- Simple Linear
Regression which uses only one independent variable and Multiple Linear
Regression which uses two or more independent variables. Once familiar with
the Regression, the unit proceeds to explain about the logistic regression, which
is a classification algorithm for determining the probability of event success and
failure. It is also known as Binomial logistic regression and is based on the sigmoid
function, with probability as the output and input ranging from -∞ to +∞ . At the end,
the unit introduces the concept of time series analysis and help understand its
application and usage on R. It also discusses the special case of Stationary Time Series
and how to make a time series stationary. This section further explains how to extract
the trend, seasonality and error in a time series in R and the creating lags of a time
series.
51
Basics of R Programming
15.9 ANSWERS
Check your Progress 1
1. A regression model that employs a straight line to explain the relationship
between variables is known as linear regression. In Linear Regression these
two variables are related through an equation, where exponent (power) of
both these variables is one. It searches for the value of the regression
coefficient(s) that minimises the total error of the model to find the line of
best fit through your data.
2. The Chi-square test of independence determines whether there is a
statistically significant relationship between categorical variables. It’s a
hypothesis test that answers the question—do the values of one categorical
variable depend on the value of other categorical variables?
3. Linear regression considers 2 variables whereas multiple regression consists
of 2 or more variables.
1. De Vries, A., & Meys, J. (2015). R for Dummies. John Wiley & Sons.
2. Peng, R. D. (2016). R programming for data science (pp. 86-181). Victoria, BC, Canada:
Leanpub.
3. Schmuller, J. (2017). Statistical Analysis with R For Dummies. John Wiley & Sons.
4. Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Sage publications.
5. Lander, J. P. (2014). R for everyone: Advanced analytics and graphics. Pearson
Education.
6. Lantz, B. (2019). Machine learning with R: expert techniques for predictive modeling.
Packt publishing ltd.
7. Heumann, C., & Schomaker, M. (2016). Introduction to statistics and data analysis.
Springer International Publishing Switzerland.
8. Davies, T. M. (2016). The book of R: a first course in programming and statistics. No
Starch Press.
9. https://www.tutorialspoint.com/r/index.html
10. https://data-flair.training/blogs/chi-square-test-in-r/
11. http://r-statistics.co/Time-Series-Analysis-With-R.html
12. http://r-statistics.co/Logistic-Regression-With-R.html
52