Unit-4 Big Data Analytics Methods using R
Unit-4 Big Data Analytics Methods using R
• Introduction to R-Attributes
• R Graphical user interfaces
• Data import and export
• Attribute and Data Types
• Descriptive Statistics
• Exploratory Data Analysis.
R Overview
R is an integrated suite of software facilities for data manipulation, calculation and graphical display.
Among other things it has
• Variables are reserved memory locations to store values i.e when we create a variable
we allocate some memory space.
• In R the variable is an object. An object is a data structure having few attributes and
methods which are applied to its attributes.
• Numeric : Also known as Double. The default type when dealing with numbers. – e.g-
1, 1.0, 42.5
• Integer : e.g- 1L, 2L, 42L
• Complex : e.g- 4 + 2i
• Logical : Two possible values: TRUE and FALSE – You can also use T and F, but this is
not recommended. – NA is also considered logical.
• Character : e.g- “a”, “Statistics”, “1 plus 2.”
Other Objects:
• Inf is infinity. You can have either positive or negative infinity.
• NaN means Not a number. It’s an undefined value.
R Attributes
Discuss ??
Data Types and Objects in R
The most essential data structures used in R include:
# The second attribute is the employee name # which is created using this
line of code here # which is the character vector
empName = c("Debi", "Sandeep", "Subham", "Shiba")
# We can combine all these three different # data types into a list #
containing the details of employees # which can be done using a list command
empList = list(empId, empName, numberOfEmp)
print(empList)
Data import and export
• Dataframes : Dataframes are generic data objects of R which are used to store the
tabular data.
# A vector which is a character vector
Name = c("Amiya", "Raj", "Asish")
print(A)
Data Types and Objects in R
• Arrays : Arrays are the R data objects which store the data in more than two
dimensions.
• Factors : Factors are the data objects which are used to categorize the data and store
it as levels. They are useful for storing categorical data. They can store both strings and
integers.
• Statistical analysis is meant to collect and study the information available in large
quantities
• For example, the collection and interpretation of data about a nation like its economy
and population, military, literacy, etc.
• The summarization is done from a sample of population using parameters such as the mean
or standard deviation.
• Descriptive statistics is a way to organize, represent and describe a collection of data using
tables, graphs, and summary measures. For example, the collection of people in a city using
the internet or using Television.
• Inferential statistics help to draw conclusions about the population while descriptive
statistics summarizes the features of the data set.
• EDA aims to spot patterns and trends, to identify anomalies, and to test early
hypotheses.
• Exploratory data analytics often uses visual techniques, such as graphs, plots, and
other visualizations.
Exploratory Data Analysis (EDA)
Types of EDA:
• Univariate analysis: It is one of the simplest forms of data analysis. It looks at the
distribution of a single variable (or column of data) at a time. While univariate
analysis does not strictly need to be visual, it commonly uses visualizations such as
tables, pie charts, histograms, or bar charts.
• Histograms
• Barplots
• Boxplots
• Scatterplots
Histograms:
When visualizing a single numerical variable, a histogram can be a go-to tool,
which can be created in R using the hist() function
data("mtcars")
hist(mtcars$mpg)
Histograms:
hist(mtcars$mpg, xlab =
"Miles/gallon", main =
"Histogram of MPG
(mtcars)", breaks = 12,
col = "lightseagreen",
border = "darkorange")
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.ht
ml
Barplots:
A barplot can provide a visual summary of a categorical variable, or a numeric variable
with a finite number of values, like a ranking from 1 to 10. For drawing barplot I will use
cyl varible which is nothing but Number of cylinders in mtcars dataset.
barplot(table(mtcars$cyl))
Barplots:
barplot(table(mtcars$cyl), xlab
= "Number of cylinders", ylab =
"Frequency", main = "mtcars
dataset", col =
"lightseagreen", border =
"darkorange")
To know more about the arguments that a barplot can take check
this link.
https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/hist.ht
ml
Boxplots:
we can use a single boxplot as an alternative to a histogram for visualizing a single
numerical variable. Let’s do a boxplot for Weight column in mtcars.
boxplot(mtcars$wt)
Boxplots:
To visualize the relationship between a numerical and categorical variable, we can use a
boxplot. Here mpg is a numerical variable and Number of cylinders is categorical.
plot(mpg~disp, data=mtcars)
Scatterplots:
• Hypothesis Testing
• Difference of Means
• Wilcoxon Rank-Sum Test
• Type I and Type II
• Errors
• power and sample size
• ANOVA
Hypothesis Testing
• Statistical hypothesis is an assumption made about the data of the population
collected for any experiment. Hypothesis testing is also known as “T Testing”.
• In order to validate a hypothesis, it will consider the entire population into account.
However, this is not possible practically. Thus, to validate a hypothesis, it will use
random samples from a population.
• On the basis of the result from testing over the sample data, it either selects or rejects
the hypothesis.
• As an example, you may make the assumption that the longer it takes to develop a
product, the more successful it will be, resulting in higher sales than ever before.
Before implementing longer work hours to develop a product, hypothesis testing
ensures there’s an actual connection between the two.
Hypothesis Testing
o Null Hypothesis – Hypothesis testing is carried out in order to test the validity of
a claim or assumption that is made about the larger population. This claim that
involves attributes to the trial is known as the Null Hypothesis. The null hypothesis
testing is denoted by H0.
• Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence or in
other words what the data are about the population. The p-value ranges between 0 and 1. It
can be interpreted in the following way:
o A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis,
so you reject it.
o A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail
to reject it.
• A p-value very close to the cutoff (0.05) is considered to be marginal and could go either
way.
Hypothesis Testing
The two types of error that can occur from the hypothesis testing:
o Type I Error – Type I error occurs we rejects a null hypothesis when it is true. The
term significance level is used to express the probability of Type I error while
testing the hypothesis. The significance level is represented by the symbol α
(alpha).
• This test is used to test the mean of the sample with the population. For example, the
height of persons living in an area is different or identical to other persons living in
other areas.
help("t.test")
Reference:
https://stats.libretexts.org/Courses/Luther_College/Psyc_350%3ABehavioral_Statistics_(Toussaint)/08
%3A_Tests_of_Means/8.03%3A_Difference_between_Two_Means
Wilcoxon Test
• The Student’s t-test requires that the distributions follow a normal distribution or if the
sample size is large enough (usually n≥30, thanks to the central limit theorem)
• Wilcoxon test compare two groups when the normality assumption is violated
• The Wilcoxon test is a non-parametric test, meaning that it does not rely on data belonging to
any particular parametric family of probability distributions.
o Wilcoxon rank sum test (also referred as The Mann-Withney-Wilcoxon test or Mann-
Whitney U test) is performed when the samples are independent (this test is the non-
parametric equivalent to the Student’s t-test for independent samples).
o The Wilcoxon signed-rank test (also sometimes referred as Wilcoxon test for paired
samples) is performed when the samples are paired/dependent (this test is the non-
parametric equivalent to the Student’s t-test for paired samples).
Wilcoxon rank sum test
Problem: Apply Wilcoxon rank sum test on the given data of following 24 students (12 boys
and 12 girls)
Girls 19 18 9 17 8 7 16 19 20 9 11 18
Boys 16 5 15 2 14 15 4 7 15 6 7 14
The null and alternative hypothesis of the Wilcoxon test are as follows:
library(ggplot2)
test
Wilcoxon rank sum test
We obtain the following test statistic, the p-value and a reminder of the hypothesis
tested.
The p-value is 0.02056. Therefore, at the 5% significance level, we reject the null
hypothesis and we conclude that grades are significantly different between girls and
boys.
Type I and Type II errors
• Using hypothesis testing, one can make decisions about whether the data support or refute
the predictions with null and alternative hypotheses.
• A Type I error means rejecting the null hypothesis when it’s actually true. It means
concluding that results are statistically significant when, in reality, they came about
purely by chance or because of unrelated factors.
• The risk of committing this error is the significance level (alpha or α) you choose.
That’s a value that you set at the beginning of your study to assess the statistical
probability of obtaining your results (p value).
• The significance level is usually set at 0.05 or 5%. This means that your results only
have a 5% chance of occurring, or less, if the null hypothesis is actually true.
• If the p value of your test is lower than the significance level, it means your results are
statistically significant and consistent with the alternative hypothesis. If your p value is
higher than the significance level, then your results are considered statistically non-
significant.
Type I error
• The null hypothesis distribution curve below shows the probabilities of obtaining all
possible results if the study were repeated with new samples and the null hypothesis were
true in the population.
• At the tail end, the shaded area represents alpha. It’s also called a critical region in
statistics.
• If your results fall in the critical region of this curve, they are considered statistically
significant and the null hypothesis is rejected. However, this is a false positive conclusion,
because the null hypothesis is actually true in this case!
Type II error
• A Type II error means not rejecting the null hypothesis when it’s actually false. This is
not quite the same as “accepting” the null hypothesis, because hypothesis testing can
only tell you whether to reject the null hypothesis.
• Instead, a Type II error means failing to conclude there was an effect when there
actually was. In reality, your study may not have had enough statistical power to
detect an effect of a certain size.
• Power is the extent to which a test can correctly detect a real effect when there is
one. A power level of 80% or higher is usually considered acceptable.
• The risk of a Type II error is inversely related to the statistical power of a study. The
higher the statistical power, the lower the probability of making a Type II error.
Type II error
To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the
significance level.
Type II error
• The alternative hypothesis distribution curve below shows the probabilities of obtaining all
possible results if the study were repeated with new samples and the alternative
hypothesis were true in the population.
• The Type II error rate is beta (β), represented by the shaded area on the left side. The
remaining area under the curve represents statistical power, which is 1 – β.
• Increasing the statistical power of your test directly decreases the risk of making a Type II
error.
Analysis Of Variance (ANOVA)
• An ANOVA test is a statistical test used to determine if there is a statistically significant
difference between two or more categorical groups by testing for differences of means
using a variance.
• Another key part of ANOVA is that it splits the independent variable into two or more
groups.
Assumptions Of ANOVA
Here are the three important ANOVA assumptions:
Between Groups SSB = ∑ nj (X̄ j – X̄)2 df1 =k – 1 MSB = SSB / (k-1) F = MSB/MSE
# Step 1: Setup Null Hypothesis and Alternate Hypothesis # H0 = mu0 = mu01 = mu02(There is no
difference between # average displacement for different gear) # H1 = Not all means are equal
# Step 3: Calculate F-Critical Value # For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value # and conclude test p < alpha, Reject
Null Hypothesis
Two Way ANOVA test