0% found this document useful (0 votes)
7 views

Unit-4 Big Data Analytics Methods using R

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit-4 Big Data Analytics Methods using R

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 57

Unit-4

Big Data Analytics


Methods using R
Contents

• Introduction to R-Attributes
• R Graphical user interfaces
• Data import and export
• Attribute and Data Types
• Descriptive Statistics
• Exploratory Data Analysis.
R Overview

• R is a comprehensive statistical and graphical programming language and is a dialect of


the S language:

1988 - S2: RA Becker, JM Chambers, A Wilks

1992 - S3: JM Chambers, TJ Hastie

1998 - S4: JM Chambers

• R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of


Auckland, New Zealand during 1990s.
R Overview

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
Among other things it has

• an effective data handling and storage facility,


• a suite of operators for calculations on arrays, in particular matrices,
• a large, coherent, integrated collection of intermediate tools for data analysis,
• graphical facilities for data analysis and display either directly at the computer or on hardcopy,
and
• a well developed, simple and effective programming language (called ‘S’) which includes
conditionals, loops, user defined recursive functions and input and output facilities.
How to download?

• CRAN (Comprehensive R Archive Network)


http://www.r-project.org
Data Types and Objects in R

• Variables are reserved memory locations to store values i.e when we create a variable
we allocate some memory space.

• In R the variable is an object. An object is a data structure having few attributes and
methods which are applied to its attributes.

• Variables can be broadly divided in to two types –


o Numerical
o Character

• Character variables are called Factors and divided in to two types.

• Factor variable is considered nominal if it represents a name(e.g names of persons)


and ordered factors are called ordinal variable (e.g satisfactory level of user from
extremly poor to extremely good).

• Numeric variables are either Interval or Ratio.


R Objects
R has five basic or “atomic” classes of objects:

• Numeric : Also known as Double. The default type when dealing with numbers. – e.g-
1, 1.0, 42.5
• Integer : e.g- 1L, 2L, 42L
• Complex : e.g- 4 + 2i
• Logical : Two possible values: TRUE and FALSE – You can also use T and F, but this is
not recommended. – NA is also considered logical.
• Character : e.g- “a”, “Statistics”, “1 plus 2.”

Other Objects:
• Inf is infinity. You can have either positive or negative infinity.
• NaN means Not a number. It’s an undefined value.
R Attributes

Discuss ??
Data Types and Objects in R
The most essential data structures used in R include:

• Vectors : A vector is an ordered collection of basic data types of a given length.

# Vectors(ordered collection of same data type)


X = c(1, 3, 5, 7, 8)
# Printing those elements in console
print(X)
Data Types and Objects in R

• Lists : A list is a generic object consisting of an ordered collection of objects.

# The first attributes is a numeric vector # containing the employee IDs


which is # created using the 'c' command here
empId = c(1, 2, 3, 4)

# The second attribute is the employee name # which is created using this
line of code here # which is the character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")

# The third attribute is the number of employees # which is a single numeric


variable.
numberOfEmp = 4

# We can combine all these three different # data types into a list #
containing the details of employees # which can be done using a list command
empList = list(empId, empName, numberOfEmp)
print(empList)
Data import and export

• Dataframes : Dataframes are generic data objects of R which are used to store the
tabular data.
# A vector which is a character vector
Name = c("Amiya", "Raj", "Asish")

# A vector which is a character vector


Language = c("R", "Python", "Java")

# A vector which is a numeric vector


Age = c(22, 25, 45)

# To create dataframe use data.frame command # and then


pass each of the vectors # we have created as arguments #
to the function data.frame()
df = data.frame(Name, Language, Age)
print(df)
Data Types and Objects in R

• Matrices : A matrix is a rectangular arrangement of numbers in rows and columns.

A = matrix (c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3,


byrow = TRUE )
# By default matrices are # in column-wise order # So this
parameter decides # how to arrange the matrix

print(A)
Data Types and Objects in R
• Arrays : Arrays are the R data objects which store the data in more than two
dimensions.

A = array( # Taking sequence of elements


c(1, 2, 3, 4, 5, 6, 7, 8), # Creating two
rectangular matrices # each with two rows
and two columns dim = c(2, 2, 2) ) print(A)
Data Types and Objects in R

• Factors : Factors are the data objects which are used to categorize the data and store
it as levels. They are useful for storing categorical data. They can store both strings and
integers.

# Creating factor using factor()

fac = factor(c("Male", "Female", "Male",


"Male", "Female", "Male", "Female"))
print(fac)
Statistics

• Statistics is a method of interpreting, analyzing and summarizing the data.

• Statistical analysis is meant to collect and study the information available in large
quantities

• For example, the collection and interpretation of data about a nation like its economy
and population, military, literacy, etc.

• Statistics have majorly categorized into two types:


o Descriptive statistics
o Inferential statistics
Descriptive Statistics
• In descriptive statistics, the data is summarized through the given observations.

• The summarization is done from a sample of population using parameters such as the mean
or standard deviation.

• Descriptive statistics is a way to organize, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television.

• Descriptive statistics are also categorized into four different categories:

o Measure of frequency - frequency measurement displays the number of times a


particular data occurs
o Measure of dispersion - Range, Variance, Standard Deviation are measures of
dispersion. It identifies the spread of data
o Measure of central tendency - Central tendencies are the mean, median and mode
of the data
o Measure of position - the measure of position describes the percentile and quartile
ranks.
Inferential statistics
• Inferential statistics is a branch of statistics that involves using data from a sample to
make inferences about a larger population. It is concerned with making predictions,
generalizations, and conclusions about a population based on the analysis of a sample
of data.

• Inferential statistics help to draw conclusions about the population while descriptive
statistics summarizes the features of the data set.

• Inferential statistics encompasses two primary categories –


o hypothesis testing and
o regression analysis.

• It is crucial for samples used in inferential statistics to be an accurate representation of


the entire population.
Exploratory Data Analysis (EDA)
• Exploratory Data Analysis (EDA) is a process of describing the data by means of
statistical and visualization techniques in order to bring important aspects of that data
into focus for further analysis.

• EDA aims to spot patterns and trends, to identify anomalies, and to test early
hypotheses.

• Exploratory data analytics often uses visual techniques, such as graphs, plots, and
other visualizations.
Exploratory Data Analysis (EDA)

Some key benefits of an EDA include:

• Spotting missing and incorrect data


• Understanding the underlying structure of your data
• Testing your hypothesis and checking assumptions
• Calculating the most important variables
• Creating the most efficient model
• Determining error margins
• Identifying the most appropriate statistical tools to help you
Exploratory Data Analysis (EDA)

Types of EDA:

• Univariate analysis: It is one of the simplest forms of data analysis. It looks at the
distribution of a single variable (or column of data) at a time. While univariate
analysis does not strictly need to be visual, it commonly uses visualizations such as
tables, pie charts, histograms, or bar charts.

• Multivariate analysis: It looks at the distribution of two or more variables and


explores the relationship between them. Most multivariate analyses compare two
variables at a time (bivariate analysis).
Data Visualization Using R

Four most common methods of visualizing data


are:

• Histograms
• Barplots
• Boxplots
• Scatterplots
Histograms:
When visualizing a single numerical variable, a histogram can be a go-to tool,
which can be created in R using the hist() function

data("mtcars")

hist(mtcars$mpg)
Histograms:

hist(mtcars$mpg, xlab =
"Miles/gallon", main =
"Histogram of MPG
(mtcars)", breaks = 12,
col = "lightseagreen",
border = "darkorange")

To know more about the arguments that a histogram can take


check this link.

https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.ht
ml
Barplots:
A barplot can provide a visual summary of a categorical variable, or a numeric variable
with a finite number of values, like a ranking from 1 to 10. For drawing barplot I will use
cyl varible which is nothing but Number of cylinders in mtcars dataset.

barplot(table(mtcars$cyl))
Barplots:

barplot(table(mtcars$cyl), xlab
= "Number of cylinders", ylab =
"Frequency", main = "mtcars
dataset", col =
"lightseagreen", border =
"darkorange")

To know more about the arguments that a barplot can take check
this link.

https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.ht
ml
Boxplots:
we can use a single boxplot as an alternative to a histogram for visualizing a single
numerical variable. Let’s do a boxplot for Weight column in mtcars.

boxplot(mtcars$wt)
Boxplots:

To visualize the relationship between a numerical and categorical variable, we can use a
boxplot. Here mpg is a numerical variable and Number of cylinders is categorical.

boxplot(mpg ~ cyl , data = mtcars)


Boxplots:
You can make the box plot more attractive by setting some of its parameters

boxplot(mpg ~ cyl, data = mtcars,


xlab = "Number of cylinders",
ylab = "Miles/(US) gallon",
main = "Number of cylinders VS Miles/(US) gallon",
pch = 20,
cex = 2,
col = "lightseagreen",
border = "red")
Scatterplots:
To visualize the relationship between two numeric variables we will use a scatterplot. This
can be done with the plot() function.

plot(mpg~disp, data=mtcars)
Scatterplots:

plot(mpg ~ disp, data = mtcars,


xlab = "Displacement",
ylab = "Miles Per Gallon",
main = "MPG vs Displacement",
pch = 20,
cex = 2,
col = "red")
Statistical methods for evaluation:

• Hypothesis Testing
• Difference of Means
• Wilcoxon Rank-Sum Test
• Type I and Type II
• Errors
• power and sample size
• ANOVA
Hypothesis Testing
• Statistical hypothesis is an assumption made about the data of the population
collected for any experiment. Hypothesis testing is also known as “T Testing”.

• It is not mandatory for this assumption to be true every time.

• In order to validate a hypothesis, it will consider the entire population into account.
However, this is not possible practically. Thus, to validate a hypothesis, it will use
random samples from a population.

• On the basis of the result from testing over the sample data, it either selects or rejects
the hypothesis.

• As an example, you may make the assumption that the longer it takes to develop a
product, the more successful it will be, resulting in higher sales than ever before.
Before implementing longer work hours to develop a product, hypothesis testing
ensures there’s an actual connection between the two.
Hypothesis Testing

• Statistical Hypothesis Testing can be categorized into two types as below:

o Null Hypothesis – Hypothesis testing is carried out in order to test the validity of
a claim or assumption that is made about the larger population. This claim that
involves attributes to the trial is known as the Null Hypothesis. The null hypothesis
testing is denoted by H0.

o Alternative Hypothesis – An alternative hypothesis would be considered valid if


the null hypothesis is fallacious. The evidence that is present in the trial is basically
the data and the statistical computations that accompany it. The alternative
hypothesis testing is denoted by H1or Ha.
Hypothesis Testing
• Hypothesis testing is conducted in the following manner:

1. State the Hypotheses – Stating the null and alternative hypotheses.


2. Formulate an Analysis Plan – The formulation of an analysis plan is a crucial step in this
stage.
3. Analyze Sample Data – Calculation and interpretation of the test statistic, as described
in the analysis plan.
4. Interpret Results – Application of the decision rule described in the analysis plan.

• Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in
other words what the data are about the population. The p-value ranges between 0 and 1. It
can be interpreted in the following way:

o A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis,
so you reject it.
o A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail
to reject it.

• A p-value very close to the cutoff (0.05) is considered to be marginal and could go either
way.
Hypothesis Testing

The two types of error that can occur from the hypothesis testing:

o Type I Error – Type I error occurs we rejects a null hypothesis when it is true. The
term significance level is used to express the probability of Type I error while
testing the hypothesis. The significance level is represented by the symbol α
(alpha).

o Type II Error – Accepting a false null hypothesis H0 is referred to as the Type II


error. The term power of the test is used to express the probability of Type II error
while testing hypothesis. The power of the test is represented by the symbol β
(beta).
One Sample T-Testing
• One sample T-Testing approach collects a huge amount of data and tests it on random
samples. To perform T-Test normally distributed data is required.

• This test is used to test the mean of the sample with the population. For example, the
height of persons living in an area is different or identical to other persons living in
other areas.
help("t.test")

# Defining sample vector


x <- rnorm(100)

# One Sample T-Test


t.test(x, mu = 5)
Two Sample T-Testing
• In two sample T-Testing, the sample vectors are compared

# Defining sample vector


x <- rnorm(100)
y <- rnorm(100)

# Two Sample T-Test


t.test(x, y)
Difference of Means

Reference:

https://stats.libretexts.org/Courses/Luther_College/Psyc_350%3ABehavioral_Statistics_(Toussaint)/08
%3A_Tests_of_Means/8.03%3A_Difference_between_Two_Means
Wilcoxon Test
• The Student’s t-test requires that the distributions follow a normal distribution or if the
sample size is large enough (usually n≥30, thanks to the central limit theorem)

• Wilcoxon test compare two groups when the normality assumption is violated

• The Wilcoxon test is a non-parametric test, meaning that it does not rely on data belonging to
any particular parametric family of probability distributions.

• There are actually two versions of the Wilcoxon test:

o Wilcoxon rank sum test (also referred as The Mann-Withney-Wilcoxon test or Mann-
Whitney U test) is performed when the samples are independent (this test is the non-
parametric equivalent to the Student’s t-test for independent samples).

o The Wilcoxon signed-rank test (also sometimes referred as Wilcoxon test for paired
samples) is performed when the samples are paired/dependent (this test is the non-
parametric equivalent to the Student’s t-test for paired samples).
Wilcoxon rank sum test
Problem: Apply Wilcoxon rank sum test on the given data of following 24 students (12 boys
and 12 girls)
Girls 19 18 9 17 8 7 16 19 20 9 11 18
Boys 16 5 15 2 14 15 4 7 15 6 7 14

The null and alternative hypothesis of the Wilcoxon test are as follows:

o H0 : the 2 groups are equal in terms of the variable of interest


o H1: the 2 groups are different in terms of the variable of interest

Applied to our research question, we have:

o H0 : grades of girls and boys are equal


o H1 : grades of girls and boys are different
Wilcoxon rank sum test

data <- data.frame(Gender = as.factor(c(rep("Girl", 12), rep("Boy", 12))),


Grade = c(19, 18, 9, 17, 8, 7, 16, 19, 20, 9, 11, 18,16, 5, 15, 2,
14, 15, 4, 7, 15, 6, 7, 14))

library(ggplot2)

ggplot(data) + aes(x = Gender, y = Grade) +


geom_boxplot(fill = "#0c4c8a") + theme_minimal()

hist(subset(data, Gender == "Girl")$Grade,


main = "Grades for girls",
xlab = "Grades" )

hist(subset(data, Gender == "Boy")$Grade,


main = "Grades for boys",
xlab = "Grades" )

test <- wilcox.test(data$Grade ~ data$Gender)

test
Wilcoxon rank sum test

We obtain the following test statistic, the p-value and a reminder of the hypothesis
tested.

Wilcoxon rank sum test with continuity correction

data: data$Grade by data$Gender


W = 31.5, p-value = 0.02056
alternative hypothesis: true location shift is not equal to 0

The p-value is 0.02056. Therefore, at the 5% significance level, we reject the null
hypothesis and we conclude that grades are significantly different between girls and
boys.
Type I and Type II errors

• Using hypothesis testing, one can make decisions about whether the data support or refute
the predictions with null and alternative hypotheses.

• Example: one decide to get tested for COVID-19


based on mild symptoms. There are two errors
that could potentially occur:
o Type I error (false positive): the test
result says that person have coronavirus,
but actually don’t.
o Type II error (false negative): the test
result says that person don’t have
coronavirus, but actually do.
Type I error

• A Type I error means rejecting the null hypothesis when it’s actually true. It means
concluding that results are statistically significant when, in reality, they came about
purely by chance or because of unrelated factors.

• The risk of committing this error is the significance level (alpha or α) you choose.
That’s a value that you set at the beginning of your study to assess the statistical
probability of obtaining your results (p value).

• The significance level is usually set at 0.05 or 5%. This means that your results only
have a 5% chance of occurring, or less, if the null hypothesis is actually true.

• If the p value of your test is lower than the significance level, it means your results are
statistically significant and consistent with the alternative hypothesis. If your p value is
higher than the significance level, then your results are considered statistically non-
significant.
Type I error
• The null hypothesis distribution curve below shows the probabilities of obtaining all
possible results if the study were repeated with new samples and the null hypothesis were
true in the population.

• At the tail end, the shaded area represents alpha. It’s also called a critical region in
statistics.

• If your results fall in the critical region of this curve, they are considered statistically
significant and the null hypothesis is rejected. However, this is a false positive conclusion,
because the null hypothesis is actually true in this case!
Type II error

• A Type II error means not rejecting the null hypothesis when it’s actually false. This is
not quite the same as “accepting” the null hypothesis, because hypothesis testing can
only tell you whether to reject the null hypothesis.

• Instead, a Type II error means failing to conclude there was an effect when there
actually was. In reality, your study may not have had enough statistical power to
detect an effect of a certain size.

• Power is the extent to which a test can correctly detect a real effect when there is
one. A power level of 80% or higher is usually considered acceptable.

• The risk of a Type II error is inversely related to the statistical power of a study. The
higher the statistical power, the lower the probability of making a Type II error.
Type II error

Statistical power is determined by:

• Size of the effect: Larger effects are more easily detected.


• Measurement error: Systematic and random errors in recorded data reduce
power.
• Sample size: Larger samples reduce sampling error and increase power.
• Significance level: Increasing the significance level increases power.

To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the
significance level.
Type II error
• The alternative hypothesis distribution curve below shows the probabilities of obtaining all
possible results if the study were repeated with new samples and the alternative
hypothesis were true in the population.

• The Type II error rate is beta (β), represented by the shaded area on the left side. The
remaining area under the curve represents statistical power, which is 1 – β.

• Increasing the statistical power of your test directly decreases the risk of making a Type II
error.
Analysis Of Variance (ANOVA)
• An ANOVA test is a statistical test used to determine if there is a statistically significant
difference between two or more categorical groups by testing for differences of means
using a variance.

• Another key part of ANOVA is that it splits the independent variable into two or more
groups.
Assumptions Of ANOVA
Here are the three important ANOVA assumptions:

o Normally distributed population derives different group samples.


o The sample or distribution has a homogenous variance
o Analysts draw all the data in a sample independently.

ANOVA test has other secondary assumptions as well, they are:

o The observations must be independent of each other and randomly sampled.


o There are additive effects for the factors.
o The sample size must always be greater than 10.
o The sample population must be uni-modal as well as symmetrical.
Types Of ANOVA Tests

ANOVA test involves setting up:

• Null Hypothesis: All population means are equal.


• Alternate Hypothesis: Atleast one population mean is different from other.

ANOVA tests are of two types:

• One way ANOVA: It takes one categorical group into consideration.


• Two way ANOVA: It takes two categorical group into consideration.
ANOVA Formula
Source Of Degrees Of
Sum Of Squares Mean Squares F Value
Variation Freedom

Between Groups SSB = ∑ nj (X̄ j – X̄)2 df1 =k – 1 MSB = SSB / (k-1) F = MSB/MSE

Error SSE =∑∑ (X- X̄ j)2 df2 = N – k MSE = SSE / (N-k)

Total SST = SSB + SSE df3 = N – 1

N = total number of observations/total sample size


MSB = Mean squares between groups
k = the number of groups
MSE = Mean squares of errors
df1 = Degrees of freedom between groups
df2 = Degrees of freedom of errors SSB = sum of squares between groups
df3 = Total degrees of freedom
SSE = sum of squares of errors
X = each data point in the jth group (individual observation)
X̄ j – X̄ = mean of the jth group, SST = Total sum of squares = SSB +
X- X̄ j = overall mean, and nj is the sample size of the jth SSE
group.
One Way ANOVA test
# Installing the package
install.packages("dplyr")

# Loading the package


library(dplyr)

# Variance in mean within group and between group


boxplot(mtcars$disp~factor(mtcars$gear),
xlab = "gear", ylab = "disp")

# Step 1: Setup Null Hypothesis and Alternate Hypothesis


# H0 = mu = mu01 = mu02(There is no difference
# between average displacement for different gear)
# H1 = Not all means are equal

# Step 2: Calculate test statistics using aov function


mtcars_aov <- aov(mtcars$disp~factor(mtcars$gear))
summary(mtcars_aov)

# Step 3: Calculate F-Critical Value


# For 0.05 Significant value, critical value = alpha = 0.05

# Step 4: Compare test statistics with F-Critical value


# and conclude test p < alpha, Reject Null Hypothesis
One Way ANOVA test

The summary shows that the gear attribute is very significant to


displacement(Three stars denoting it). Also, the P value is less than 0.05, so proves that
gear is significant to displacement i.e related to each other and we reject the Null
Hypothesis.
Two Way ANOVA test
# Installing the package
install.packages("dplyr")
# Loading the package
library(dplyr)

# Variance in mean within group and between group


boxplot(mtcars$disp~mtcars$gear, subset = (mtcars$am == 0),
xlab = "gear", ylab = "disp", main = "Automatic")

boxplot(mtcars$disp~mtcars$gear, subset = (mtcars$am == 1),


xlab = "gear", ylab = "disp", main = "Manual")

# Step 1: Setup Null Hypothesis and Alternate Hypothesis # H0 = mu0 = mu01 = mu02(There is no
difference between # average displacement for different gear) # H1 = Not all means are equal

# Step 2: Calculate test statistics using aov function


mtcars_aov2 <- aov(mtcars$disp~factor(mtcars$gear) * factor(mtcars$am))
summary(mtcars_aov2)

# Step 3: Calculate F-Critical Value # For 0.05 Significant value, critical value = alpha = 0.05

# Step 4: Compare test statistics with F-Critical value # and conclude test p < alpha, Reject
Null Hypothesis
Two Way ANOVA test

The summary shows that the gear attribute is very significant to


displacement(Three stars denoting it) and am attribute is not much significant to
displacement. P-value of gear is less than 0.05, so it proves that gear is significant to
displacement i.e related to each other. P-value of am is greater than 0.05, am is not
significant to displacement i.e not related to each other.
Happy Learning

You might also like