0% found this document useful (0 votes)

27 views92 pages

Chapter - 03 - Review of Basic Data

Chapter_03_Review of Basic Data

Uploaded by

datnthe171250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views92 pages

Chapter - 03 - Review of Basic Data

Chapter_03_Review of Basic Data

Uploaded by

datnthe171250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 92

REVIEW OF BASIC DATA

ANALYTIC METHODS
USING R
Author : FU
Date : Mar-2022
Objectives

After studying this chapter, the student should

be able to:
 Understand the basic features of R
 Understand how to do data exploration and analysis
with R?
 Understand methods for data evaluation
Content

1. Introduction to R
2. Exploratory Data Analysis
3. Statistical Methods for Evaluation
1. Introduction to R (1)

 R is a programming language and software framework

for statistical analysis and graphics.
 Available for use under the GNU General Public
License, R software and installation instructions can
be obtained via the Comprehensive R Archive and
Network
 Import Data file by read.csv() function
# import a CSV file of the total annual sales for each customer
sales <- read.csv("c:/data/yearly_sales.csv")
# examine the imported dataset
head(sales)
1. Introduction to R (2)

 Head() function, by default, displays the first six

records of sales
1. Introduction to R (3)

 summary() function provides some descriptive

statistics, such as the mean and median, for each
data column
1. Introduction to R (4)

 plot() function generates a scatterplot of the number of

orders (sales$num_of_orders) against the annual sales
(sales$sales_total).
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main="Number of Orders vs. Sales")
1. Introduction to R (5)

 lm() function use for linear regression modeling process

• intercept and slope values are –154.1 and 166.2,

respectively, for the fitted linear equation
1. Introduction to R (6)

 attributes() function use to show details on the

contents of results.
 summary() function is a generic function. A generic
function is a group of functions sharing the same
name but behaving differently depending on the
number and the type of arguments.
 plot() is another example of a generic function
 hist() to generate a histogram (Figure 3-2) of the
residuals stored in results
1. Introduction to R (7)

FIGURE 3-2 Evidence of large residuals

1.1R Graphical User Interfaces

 R software uses a command-line interface (CLI) that is

similar to the BASH shell in Linux or the interactive
versions of scripting languages such as Python
 RGui.exe provides a basic graphical user interface
(GUI) of R in Window version
 Popular GUIs: R commander, Rattle, and RStudio
1.1R Graphical User Interfaces - RStudio GUI (1)
1.1R Graphical User Interfaces- RStudio GUI (2)

 ?lm or help(lm) at the console prompt can be used to

obtain help information on R
 edit() and fix() allow the user to update the contents
of an R variable. Alternatively, such changes can be
implemented with RStudio by selecting the appropriate
variable from the workspace pane.
 R allows one to save the workspace environment,
including variables and loaded libraries, into an .Rdata
file using the save.image() function. An
existing .Rdata file can be loaded using the
load.image() function
1.1R Graphical User Interfaces- RStudio GUI (3)

FIGURE 3-4 Accessing help in Rstudio

1.2 Data Import and Export (1)
 read.csv() function use to import dataset from file
sales <- read.csv("c:/data/yearly_sales.csv")
• Forward slash (/) as the separator character in the directory and
file paths
• setwd() function used to Import of multiple files with long path
names
setwd("c:/data/")
sales <- read.csv("yearly_sales.csv")
• read.table() and read.delim() are intended to import other
common file types such as TXT
sales_table <- read.table("yearly_sales.csv", header=TRUE, sep=",")
sales_delim <- read.delim("yearly_sales.csv", sep=",")
1.2 Data Import and Export (2)

 Main difference between these import functions is the

default values. E.g. read.delim() function expects
the column separator to be a tab ("\t").
1.2 Data Import and Export (3)

 write.table(), write.csv(), and write.csv2()

enable exporting of R datasets to an external file.
 For example, following R code adds an additional
column to the sales dataset and exports the modified
dataset to an external file.
1.2 Data Import and Export (4)

 DBI and RODBC are R packages used to read data from a database
management system (DBMS). These packages provide database
interfaces for communication between R and DBMSs such as MySQL,
Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum
 install.packages() function used to install R packages
 library() function used to loads the package into the R workspace
 connector (conn) is initialized for connecting to a Pivotal Greenplum
database training2 via open database connectivity (ODBC) with user=user
 training2 database must be defined either in the /etc/ODBC.ini
configuration file or using the Administrative Tools under the Windows
Control Panel.
1.2 Data Import and Export (5)

 connector needs to be present to submit a SQL query

to an ODBC database by using the sqlQuery() function
from the RODBC package (Ref:
https://rdrr.io/cran/RODBC/man/)

 following R code retrieves specific columns from the

housing table in which household income (hinc) is
greater than $1,000,000.

Use sqlFetch(conn, “housing_data”, max = 100)

1.2 Data Import and Export (6)

 Using DBI package (Ref: https://dbi.r-dbi.org/reference/)

o Install library install.packages("odbc")
o Connect to SQL database
con <- DBI::dbConnect(odbc::odbc(),
Driver = “SQL Server ",
Server = "[your server's path]",
Database = "[your database's name]",
UID = "[your user name]",
PWD = [“Database password“],
Port = 1433)
• Select data
query <- dbSendQuery(con, "SELECT speed, dist FROM cars")
dbFetch(query, n = 10)
dbClearResult(query)
1.2 Data Import and Export (7)

 jpeg() function used to creates a new JPEG file, adds

a histogram plot to the file, and then closes the file.
 jpeg() is useful when automating standard reports.
 png(), bmp(), pdf(), and postscript(), are
available in R to save plots in the desired format.
1.3 Attribute and Data Types (1)

 In the earlier example, sales variable contained a record

for each customer. Several characteristics, such as total
annual sales, number of orders, and gender, were
provided for each customer
 These characteristics or attributes provide qualitative
and quantitative measures for each item or subject of
interest
 Attributes can be categorized into four types: nominal,
ordinal, interval, and ratio
1.3 Attribute and Data Types (2)
1.3 Attribute and Data Types (3)

 Data of one attribute type may be converted to

another. For example,
o quality of diamonds {Fair, Good, Very Good, Premium,
Ideal} is considered ordinal but can be converted to nominal
{Good, Excellent} with a defined mapping.
o A ratio attribute like Age can be converted into an ordinal
attribute such as {Infant, Adolescent, Adult, Senior}
 Understanding attribute types in a dataset is to select
the appropriate descriptive statistics and analytic
methods (applied and properly interpreted)
1.3.1 Numeric, Character, and Logical Data Types
(1)

 R supports numeric, character, and logical (Boolean) values.

 Examples of such variables are given in the following R code
i <- 1 # create a numeric variable
sport <- "football" # create a character variable
flag <- TRUE # create a logical variable

 R provides several functions, e.g. class() and typeof(), to

examine the characteristics of a given variable.
o class() function represents the abstract class of an object.
o typeof() function determines the way an object is stored in memory
 i appears to be an integer, i is internally stored using double
precision
1.3.1 Numeric, Character, and Logical Data Types
(2)

 R functions can test the variables and coerce a variable into a

specific type.
 is.integer() function to test if i is an integer
 as.integer() function to coerce i into a new integer variable, j . It
can be applied for double, character, and logical types.
1.3.1 Numeric, Character, and Logical Data Types
(3)

 length() return the created variables each have a length of

1.
 One might have expected the returned length of sport to
have been 8 for each of the characters in the string
"football". However, these three variables are actually one
element, vectors.
1.3.2 Vectors (1)
 Vectors are a basic building block for data in R. Simple R variables are
vectors.
 A vector can only consist of values in the same class.
 is.vector() function used to tests vectors

 c() or colon operator, : used to create a vector

 To build a vector from the sequence of integers from 1 to 5 as below
example
1.3.2 Vectors (2)

 A vector of a specific length can be initialized and then populate the

content of the vector later.
 vector() function creates a logical vector by default. A vector of a
different type can be specified by using the mode parameter.
 Vector c, an integer vector of length 0, may be useful when the number
of elements is not initially known and new elements will later be added to
the end of the vector as the values become available.
1.3.3 Arrays and Matrices (1)

 array() function can be

used to restructure a vector
as an array
 Below is a three-
dimensional array to hold
the quarterly sales for three
regions over a two-year
period and then assign the
sales amount of $158,000
to the second region for the
first quarter of the first year.
1.3.3 Arrays and Matrices (2)

 matrix is a two-dimensional array

 Below is a matrix to hold the quarterly sales for the
three regions. Parameters nrow and ncol define the
number of rows and columns, respectively, for the
sales_matrix.
1.3.3 Arrays and Matrices (3)

 Matrix operations - addition, subtraction, multiplication,

and transpose function t() and the inverse matrix function
matrix.inverse() included in the matrixcalc package.
 Following R code builds a 3 × 3 matrix, M, and multiplies it
by its inverse to obtain the identity matrix.
1.3.4 Data Frames (1)

 Data frames provide a structure for storing and

accessing several variables of possibly different data
types.
 is.data.frame() function indicates, a data frame was
created by the read.csv() function
1.3.4 Data Frames (2)

 $ notation used to assess variables stored in the data frame

 Following R code illustrates that in this example, each variable
is a vector with the exception of gender, which was, by a
read.csv() default, imported as a factor
 Factor denotes a categorical variable, typically with a few finite
levels such as “F” and “M” in the case of gender.
1.3.4 Data Frames (3)

 Data frames are preferred input format for many of the

modeling functions available in R
 Following use of the str() function provides the structure of
the sales data frame. This function identifies integer and
numeric (double) data types, the factor variables and levels,
as well as the first few values for each variable.
1.3.4 Data Frames (4)

 In the simplest, data frames are lists of variables of the same

length.
 subsetting operators can be used to retrieve a subset of the
data frame. They allow one to express complex operations in a
succinct fashion and easily retrieve a subset of the dataset.
1.3.5 List

 Contain any type of objects, including other lists.

 Using the vector v and the matrix M created in earlier
examples, the following R code creates assortment, a list of
different object types.

• Use double brackets, [ [] ] to display the contents of assortment

• Use single set of brackets only accesses an item in the list, not its
content
1.3.6 Factors (1)

 gender variable in the data frame sales is an factor.

 In this case, gender could assume one of two levels: F or M. Factors can be
ordered or not ordered. In the case of gender, the levels are not ordered.

• Included with ggplot2 package, the diamonds data frame contains three
ordered factors.
• Examining the cut factor, there are five levels in order of improving cut:
• Fair, Good, Very Good, Premium, and Ideal.
• sales$gender contains nominal data, and diamonds$cut contains ordinal
data.
1.3.6 Factors (2)

 Following code to
categorize
sales$sales_totals into
three groups—small,
medium, and big—
according to the
amount of the sales.
 These groupings are
the basis for the new
ordinal factor,
spender, with levels
{small, medium, big}.
1.3.7 Contingency Tables

 table refers to a class of

objects used to store the
observed counts across the
factors for a given dataset.
 a table is commonly referred
to as a contingency table and
is the basis for performing a
statistical test on the
independence of the factors
used to build the table.
 Following R code builds a
contingency table based on
the sales$gender and
sales$spender factors.
1.4 Descriptive Statistics (1)

 summary() function provides several descriptive statistics, e.g.

mean and median, about a variable such as the sales data
frame.
 results now include the counts for the three levels of the
spender variable based on the earlier examples involving factors.
1.4 Descriptive Statistics (2)

 Following code provides some common R functions that include

descriptive statistics. In parentheses, the comments describe the
functions.
2 Exploratory Data Analysis (1)

 summary() can help analysts get an idea of the

magnitude and range of the data
 Following code shows a summary view of a data
frame data with two columns x and y. The output
shows the range of x and y, but it’s not clear what the
relationship may be between these two variables.
2 Exploratory Data Analysis (2)

 Variables x and
y of the data
frame data can
be visualized in
a scatterplot
 Figure 3-5
depicts the
relationship
between two
variables

FIGURE 3-5 A scatterplot can easily show if x and y share a relat

2.1 Visualization Before Analysis (1)

 Figure 3-6 illustrates the

importance of visualizing
data, consider
Anscombe’s quartet.
Anscombe’s quartet
consists of four datasets.
 It was constructed by
statistician Francis
Anscombe in 1973 to
demonstrate the
importance of graphs in
FIGURE 3-6 Anscombe’s quartet
statistical analyses.
2.1 Visualization Before Analysis (2)

TABLE 3-3 Statistical Properties of Anscombe’s Quartet

FIGURE 3-7 Anscombe’s quartet visualized as scatterplo

2.1 Visualization Before Analysis (3)

 gl() function create variable levels, which generates factors of four levels (1, 2,
3, and 4), each repeating 11 times.
 Variable mydata is created using the with(data, expression) function, which
evaluates an expression in an environment constructed from data.
 In this example, the data is the anscombe dataset, which includes eight
attributes: x1, x2, x3, x4, y1, y2, y3, and y4.
 expression part in the code creates a data frame from the nscombe dataset,
and it only includes three attributes: x, y, and the group each data point
belongs to (mygroup).
2.1 Visualization Before Analysis (4)
2.2 Dirty Data (1)

 Consider a scenario in which a bank is

conducting data analyses of its account
holders to gauge customer retention.
Figure 3-8 shows the age distribution of
the account holders.
 If age data is in a vector called age, the
graph can be created with the following R
script:
 hist(age, breaks=100, main="Age
Distribution of Account Holders",
lab="Age", ylab="Frequency", col="gray")
 figure shows that the median age of the
account holders is around 40. A few FIGURE 3-8 Age distribution of bank account holder
accounts with account holder age less
than 10 are unusual but plausible
2.2 Dirty Data (2)

 Left side of the graph shows a huge spike of customers who are zero
years old or have negative ages.
 This is likely to be evidence of missing data. One possible explanation
is that the null age values could have been replaced by 0 or negative
values during the data input.
 Such an occurrence may be caused by entering age in a text box that
only allows numbers and does not accept empty values. Or it might be
caused by transferring data among several systems that have different
definitions for null values (such as NULL, NA, 0, –1, or –2).
 Therefore, data cleansing needs to be performed over the accounts
with abnormal age values. Analysts should take a closer look at the
records to decide if the missing data should be eliminated or if an
appropriate age value can be determined using other available
information for each of the accounts.
2.2 Dirty Data (3)

 is.na() function provides tests for missing

values.
 Following example creates a vector x where
the fourth value is not available (NA). The
is.na() function returns TRUE at each NA value
and FALSE otherwise.

 mean() applied to data containing

missing values can yield an NA result.
To prevent this, set the na.rm
parameter to TRUE to remove the
missing value during the function’s
execution.
2.3 Visualizing a Single Variable (1)

 R has many
functions
available to
examine a
single variable.
2.3 Visualizing a Single Variable (2)

 dotchart(x, label=...) used

to create a dotchart, where x is
a numeric vector and label is a
vector of categorical labels for
x.
 barplot(height) function used
to create A barplot , where
height represents a vector or
matrix.
 Figure 3-10 shows (a) a
dotchart and (b) a barplot
based on the mtcars dataset,
which includes the fuel
consumption and 10 aspects of
automobile design and
performance of 32 automobiles
2.3 Visualizing a Single Variable (3)

 dotchart can be created with the function dotchart(x,

label=...),
o x is a numeric vector and
o label is a vector of categorical labels for x.
2.3 Visualizing a Single Variable (4)

 Figure 3-11(a)
includes a
histogram of
household income.
 Histogram shows a
clear concentration
of low household
incomes on the left
and the long tail of
the higher incomes
on the right.
2.3 Visualizing a Single Variable (5)
2.3 Visualizing a Single Variable (6)

 Consider a density plot of diamond

prices (in USD).
 Figure 3-12(a) contains two density
plots for premium and ideal cuts of
diamonds.
 Figure 3-12(b) shows detail of the
diamond prices than Figure 3-12(a)
by taking the logarithm.

FIGURE 3-12 Density plots of (a)

diamond prices and (b) the
logarithm of diamond prices
2.3 Visualizing a Single Variable (7)

• R script to generate
the plots in Figure.
• The diamonds dataset
comes with the
ggplot2 package.
2.4 Examining Multiple Variables

 R code to produce Figure 3-13

FIGURE 3-13 Examining two

variables with regression
2.4 Examining Multiple Variables - Dotchart and
Barplot

 Dotchart and barplot from the

previous section can visualize
multiple variables. Both of
them use color as an
additional dimension for
visualizing the data.
 For the same mtcars dataset,
Figure 3-14 shows a dotchart
that groups vehicle cylinders
at the y-axis and uses colors to
distinguish different cylinders.
 The vehicles are sorted
according to their MPG values.
FIGURE 3-14 Dotplot to visualize multiple variabl
2.4 Examining Multiple Variables - Dotchart and
Barplot
2.4 Examining Multiple Variables- Dotchart and
Barplot

 barplot in Figure 3-15

visualizes the distribution of
car cylinder counts and
number of gears.
 The x-axis represents the
number of cylinders, and the
color represents the number
of gears

FIGURE 3-15 Barplot to visualize multiple variables

2.4 Examining Multiple Variables - Box-and-
Whisker Plot

 Graph shows how household income

varies by region.
 Highest median incomes are in region 0
and region 9. Region 0 is slightly higher,
but the boxes for the two regions overlap
enough that the difference between the
two regions probably is not significant.
 Lowest household incomes tend to be in
region 7, which includes states such as
Louisiana, Arkansas, and Oklahoma

FIGURE 3-16 A box-and-whisker plot of mean

household income and geographical region
2.4 Examining Multiple Variables - Hexbinplot for
Large Datasets

 hexbinplot of Figure 3-17(b) is

plotted by using zcta data frame.
 Running the code requires the
use of the hexbin package
installed by running
install.packages("hexbin")

FIGURE 3-17 (a)

Scatterplot and (b)
Hexbinplot of household
income against years of
education
2.4 Examining Multiple Variables - Scatterplot
Matrix

 vector colors defines the color

scheme for the plot
 colors <- c("gray50", "white",
"black") to make the
scatterplots grayscale.
FIGURE 3-18 Scatterplot matrix of Fisher’s [13] iris
dataset
2.4 Examining Multiple Variables - Analyzing a
Variable over Time

 Visualizing a variable over time is the

same as visualizing any pair of
variables, but in this case the goal is
to identify time-specific patterns.
Figure 3-19 plots the monthly total
numbers of international airline
passengers (in thousands) from
January 1940 to December 1960.
Enter plot(AirPassengers) in the R
console to obtain a similar graph. The
plot shows that, for each year, a large
peak occurs mid-year around July and
August, and a small peak happens
around the end of the year, possibly
due to the holidays. Such a
phenomenon is referred to as a
seasonality effect.
 plot(AirPassengers) FIGURE 3-19 Airline passenger counts from 1949 to 19
2.5 Data Exploration Versus Presentation

 Figure 3-20 shows the

density plot on the
distribution of account
values from a bank.

FIGURE 3-20 Density plots are

better to show to data scientists
2.5 Data Exploration Versus Presentation
3.1 Hypothesis Testing (1)

 Hypothesis testing is to form an

assertion and test it with data
 null hypothesis (H0) refer to
when performing hypothesis tests
based on the common assumption
is that there is no difference
between two samples. This
assumption is used as the default
position for building the test or
conducting a scientific experiment
 alternative hypothesis (HA) is
that there is a difference between
two samples
FIGURE 3-22 Distributions of two samples of data
3.1 Hypothesis Testing (2)

 For example, if the task is to identify the effect of drug A

compared to drug B on patients, the null hypothesis and
alternative hypothesis would be this.
o H0: Drug A and drug B have the same effect on patients
o HA: Drug A has a greater effect than drug B on patients.

 If the task is to identify whether advertising Campaign C is

effective on reducing customer churn, the null hypothesis
and alternative hypothesis would be as follows
o H0 : Campaign C does not reduce customer churn better than the
current campaign method.
o HA : Campaign C does reduce customer churn better than the
current campaign.
3.1 Hypothesis Testing (3)

 It is important to state the null hypothesis and alternative hypothesis, because

misstating them is likely to undermine the subsequent steps of the hypothesis testing
process.
 A hypothesis test leads to either rejecting the null hypothesis in favor of the
alternative or not rejecting the null hypothesis.
3.2 Difference of Means

 H0: μ1 = μ2
 HA: μ1 ≠μ2
 The μ1 and μ2 denote
the population means
of pop1 and pop2,
respectively.

FIGURE 3-23 Overlap of the two

distributions is large if X1 2 ≈ X2
3.2.1 Student’s t-test (1)

 Student’s t-test assumes that distributions of the two

populations have equal but unknown variances.
Suppose n1 and n2 samples are randomly and
independently selected from two populations, pop1
and pop2.
 If each population is normally distributed with same
mean (μ1=μ2) and with the same variance, then T
(the t-statistic), given in Equation 3-1, follows a t-
distribution with n1 + n2 −2 degrees of freedom
(df).

(3-1)
3.2.1 Student’s t-test (2)

 Shape of t-distribution is similar to normal distribution

 Degrees of freedom approaches 30 or more, t-distribution is
nearly identical to normal distribution
 Numerator of T is difference of sample means, if the observed
value of T is far enough from zero such that the probability of
observing such a value of T is unlikely, one would reject the
null hypothesis that population means are equal.
 a small probability, say α=0.05, T * is determined such that P(
|T| ≥T * )=0.05. After the samples are collected and the
observed value of T is calculated according to Equation 3-1,
the null hypothesis (μ1 = μ2) is rejected if |T| ≥T *.
3.2.1 Student’s t-test (3)
 In hypothesis testing, small probability, α, is known as the significance level of the
test
 significance level of test is the probability of rejecting the null hypothesis, when the
null hypothesis is actually TRUE. In other words, for α=0.05
 if means from the two populations are truly equal, then in repeated random sampling,
the observed magnitude of T would only exceed T * 5% of the time.
 Following R code example, 10 observations are randomly selected from two normally
distributed populations and assigned to variables x and y.
 Two populations have a mean of 100 and 105, respectively, and a standard deviation
equal to 5. Student’s t-test is then conducted to determine if the obtained random
samples support the rejection of the null hypothesis.
3.2.2 Welch’s t-test (1)

 Welch’s t-test can be used based on T expressed in

Equation 3-2

• where Xi , Si2, and ni correspond to the i-th sample mean, sample variance, and
sample size
• Welch’s t-test uses sample variance (Si2) for each population instead of the
pooled sample variance.
Following R code performs
the Welch’s t-test on the
same set of data analyzed in
the earlier Student’s t-test
example.
3.2.2 Welch’s t-test (2)

 The degrees of freedom for Welch’s t-test is defined in

Equation 3-3.
3.2.2 Welch’s t-test (3)

 A confidence interval is
an interval estimate of a
population parameter or
characteristic based on
sample data

FIGURE 3-25 A 95% confidence

interval straddling the unknown
population mean μ
3.2.2 Welch’s t-test (4)

 Confidence intervals discussed again in Section 3.3.6 on

ANOVA.
 A key assumption in both the Student’s and Welch’s t-test is that
the relevant population attribute is normally distributed.
 For non-normally distributed data, it is sometimes possible to
transform the collected data to approximate a normal
distribution.
 For example, taking the logarithm of a dataset can often
transform skewed data to a dataset that is at least symmetric
around its mean. However, if such transformations are
ineffective, there are tests like the Wilcoxon rank-sum test that
can be applied to see if two population distributions are different.
3.3 Wilcoxon Rank-Sum Test (1)

 Wilcoxon rank-sum test is a nonparametric hypothesis test

that checks whether two populations are identically distributed
 Let the two populations again be pop1 and pop2, with
independently random samples of size n1 and n2 respectively.
The total number of observations is then N =n1 + n2
 Wilcoxon rank-sum test determines the significance of the
observed rank-sums.
 Following R code performs the test on the same dataset used
for the previous t-test.
wilcox.test(x, y, conf.int = TRUE)
3.3 Wilcoxon Rank-Sum Test (2)

 wilcox.test() function ranks observations, determines the

respective rank-sums corresponding to each population’s
sample, and then determines the probability of such rank-
sums of such magnitude being observed assuming that the
population distributions are identical.
 In this example, the probability is given by the p-value of
0.04903.
 Thus, the null hypothesis would be rejected at a 0.05
significance level. The reader is cautioned against
interpreting that one hypothesis test is clearly better than
another test based solely on the examples given in this
section.
3.4 Type I and Type II Errors (1)
3.4 Type I and Type II Errors (2)

 significance level mentioned in the Student’s t-test discussion is

equivalent to the type I error.
 For a significance level such as α=0.05, if null hypothesis (μ1= μ2) is
TRUE, there is a 5% chance that the observed T value based on the
sample data will be large enough to reject the null hypothesis.
 By selecting n appropriate significance level, probability of committing
a type I error can be defined before any data is collected or analyzed.
 probability of committing a Type II error is somewhat more difficult to
determine. If two population means are truly not equal, probability of
committing a type II error will depend on how far apart the means
truly are. To reduce the probability of a type II error to a reasonable
level, it is often necessary to increase the sample size
3.5 Power and Sample Size
 power of a test is the probability
of correctly rejecting the null
hypothesis denoted by 1−β,
where β is probability of a type II
error.
 In general, the magnitude of the
difference is known as the effect
size. As the sample size becomes
larger, it is easier to detect a
given effect size, δ, as illustrated FIGURE 3-26 A larger
in Figure 3-26. sample size better
identifies a fixed effect
size
3.6 ANOVA (1)

 Hypothesis tests presented are good for analyzing means between

two populations. What if there are more than two populations?
Consider an example of testing the impact of nutrition and exercise on
60 candidates between age 18 and 50. Candidates are randomly split
into six groups, each assigned with a different weight loss strategy,
and the goal is to determine which strategy is the most effective.
o Group 1 only eats junk food.
o Group 2 only eats healthy food.
o Group 3 eats junk food and does cardio exercise every other day.
o Group 4 eats healthy food and does cardio exercise every other day.
o Group 5 eats junk food and does both cardio and strength training every other
day.
o Group 6 eats healthy food and does both cardio and strength training every
other day.
3.6 ANOVA (2)
 Multiple t-tests could be applied to each pair of weight loss strategies. In this
example, the weight loss of Group 1 is compared with the weight loss of Group
2, 3, 4, 5, or 6. Similarly, the weight loss of Group 2 is compared with that of
the next 4 groups. Therefore, a total of 15 t-tests would be performed.
 However, multiple t-tests may not perform well on several populations for two
reasons.
o First, number of t-tests increases as number of groups increases, analysis using the
multiple t-tests becomes cognitively more difficult.
o Second, by doing a greater number of analyses, probability of committing at least one
type I error somewhere in the analysis greatly increases.
 ANOVA (Analysis of Variance) designed to address these issues. ANOVA is a
generalization of the hypothesis testing of difference of two population means.
 ANOVA tests if any of the population means differ from the other population
means.
o The null hypothesis is that all the population means are equal. H0: μ1= μ2 = =…=μn
o Alternative hypothesis is that at least one pair of the population means is not equal.
HA: μi ≠ μj for at least one pair of i, j
3.6 ANOVA (3)
• In Section 3.3.2, “Difference of Means,” each population is assumed to be
normally distributed with the same variance.
• The first thing to calculate for the ANOVA is the test statistic. The goal is to
test whether the clusters formed by each population are more tightly grouped
than the spread across all the populations
 k: total number of populations

 N: total number of samples is randomly split into k groups

 ni number of samples in the i-th group

 Xi mean of the group, i∈[1,k]

 X0 mean of all the samples

 S2B (between-groups mean sum of squares) an estimate of the between-groups variance

measures how the population means vary with respect to the grand mean, or the mean spread across
all the populations
3.6 ANOVA (4)

 S2W:within-group mean sum of squares is an estimate of the

within-group variance

If SB2 is much larger than SW2 , then some of the population means are
different from each other.
F-test statistic is defined as the ratio of the between-groups mean sum of squares and the withingroup
mean sum of squares
3.6 ANOVA (5)
3.6 ANOVA (6)
3.6 ANOVA (7)
Summary

 R is a popular package and programming language for data

exploration, analytics, and visualization
 How to use R to perform exploratory data analysis,
including the discovery of dirty data, visualization of one or
more variables, and customization of visualization for
different audiences
 Introduces some basic statistical methods
o Hypothesis testing. The Student’s t-test and Welch’s t-test are
included as two example hypothesis tests designed for testing the
difference of means.
o Other statistical methods and tools: confidence intervals, Wilcoxon
rank-sum test, type I and II errors, effect size, and ANOVA

Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
Unit 2
No ratings yet
Unit 2
32 pages
(Step-Up) Samir S. Shah, Brian Alverson, Jeanine Ronan - Step-Up To Pediatrics-LWW (2013)
100% (5)
(Step-Up) Samir S. Shah, Brian Alverson, Jeanine Ronan - Step-Up To Pediatrics-LWW (2013)
607 pages
Eligible List - R-1 - ApMoSys Technologies - 2025 Batch - UIT, B
No ratings yet
Eligible List - R-1 - ApMoSys Technologies - 2025 Batch - UIT, B
2 pages
STATS LAB Basics of R PDF
No ratings yet
STATS LAB Basics of R PDF
77 pages
Government Thesis Paper
100% (3)
Government Thesis Paper
6 pages
Unit 5 - R and Data Analysis
No ratings yet
Unit 5 - R and Data Analysis
29 pages
CBSE Class 3 Science Birds MCQS, Multiple Choice Questions
No ratings yet
CBSE Class 3 Science Birds MCQS, Multiple Choice Questions
21 pages
How To Use The R Programming Language For Statistical Analyses
No ratings yet
How To Use The R Programming Language For Statistical Analyses
38 pages
Data Analytics Using R
100% (1)
Data Analytics Using R
27 pages
EFA and CFA
No ratings yet
EFA and CFA
36 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
Introduction - To - Womens - Gender - and - Sexuality Studies
No ratings yet
Introduction - To - Womens - Gender - and - Sexuality Studies
13 pages
Anaphora Resolution PDF
No ratings yet
Anaphora Resolution PDF
63 pages
The Contemporary World Syllabus 1st Sem Ay 2021 2022
No ratings yet
The Contemporary World Syllabus 1st Sem Ay 2021 2022
4 pages
All v2 Basic Statistics Using R
No ratings yet
All v2 Basic Statistics Using R
241 pages
JLPT N5 July 2024
No ratings yet
JLPT N5 July 2024
5 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Exploring Student Vices
No ratings yet
Exploring Student Vices
15 pages
Semi Detailed Lesson Plan in Measure of Central Tendency
No ratings yet
Semi Detailed Lesson Plan in Measure of Central Tendency
3 pages
GRADE 10 MODULE-compendium
No ratings yet
GRADE 10 MODULE-compendium
3 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
Statistical Computing II-slide
No ratings yet
Statistical Computing II-slide
279 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
Untitled
No ratings yet
Untitled
59 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
Stats With R
No ratings yet
Stats With R
103 pages
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
No ratings yet
Introduction To R: Nihan Acar-Denizli, Pau Fonseca
50 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
Introduction To R
No ratings yet
Introduction To R
52 pages
Computer Interactive Statistics
No ratings yet
Computer Interactive Statistics
103 pages
1st Teaching Practice On Seminar
No ratings yet
1st Teaching Practice On Seminar
28 pages
MIS 4.hafta (Introduction To R)
No ratings yet
MIS 4.hafta (Introduction To R)
52 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
R & Python Notes
No ratings yet
R & Python Notes
131 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
Unit - 3
No ratings yet
Unit - 3
39 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
DSCI Key Terms and Ideas For Review
No ratings yet
DSCI Key Terms and Ideas For Review
98 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
An Introduction To R: Biostatistics 615/815
No ratings yet
An Introduction To R: Biostatistics 615/815
59 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
Introduction To R
No ratings yet
Introduction To R
34 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
40 pages
R Statistical Package
No ratings yet
R Statistical Package
63 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
DS Lab
No ratings yet
DS Lab
31 pages
Unit 2 Notes - Data Analysis Using R
No ratings yet
Unit 2 Notes - Data Analysis Using R
19 pages
R Programming
No ratings yet
R Programming
22 pages
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
No ratings yet
Muthayammal College of Arts and Science Rasipuram: Assignment No - 1
10 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
R Software - Notes
No ratings yet
R Software - Notes
18 pages
Pushpendra Lab File
No ratings yet
Pushpendra Lab File
51 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
Introduction To R
No ratings yet
Introduction To R
23 pages
Unit 4
No ratings yet
Unit 4
27 pages
De Reading - Writing Skills
No ratings yet
De Reading - Writing Skills
6 pages
Unit I - Introduction To R
No ratings yet
Unit I - Introduction To R
21 pages
Kto12 Curriculum Program Implementation
No ratings yet
Kto12 Curriculum Program Implementation
2 pages
Managing Cognitive Load in ICT-based Learning
No ratings yet
Managing Cognitive Load in ICT-based Learning
7 pages
R Study Material I
No ratings yet
R Study Material I
8 pages
Past Simple Activity Cards
No ratings yet
Past Simple Activity Cards
5 pages
Data in R
No ratings yet
Data in R
7 pages
R Manual
No ratings yet
R Manual
10 pages
DA Lab Week-1
No ratings yet
DA Lab Week-1
7 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
5 pages
Keyboard Prep Piano Classes For 8-10yo
No ratings yet
Keyboard Prep Piano Classes For 8-10yo
3 pages
Group 6 - Ob Od - Bpa 2B
No ratings yet
Group 6 - Ob Od - Bpa 2B
8 pages
Analysis of Students Inquiry Skills in Senior High School Though Learning Based On The Hierarchy of Inquiry Model
No ratings yet
Analysis of Students Inquiry Skills in Senior High School Though Learning Based On The Hierarchy of Inquiry Model
6 pages
BA End Sem Important
No ratings yet
BA End Sem Important
18 pages
DLL Science 6 q3 w10
No ratings yet
DLL Science 6 q3 w10
6 pages
CW Marksheet and Cover Template
No ratings yet
CW Marksheet and Cover Template
3 pages
Attitudetowardsresearch
No ratings yet
Attitudetowardsresearch
5 pages
Assignment 1 Class 9 Bio Cell Unit of Life
No ratings yet
Assignment 1 Class 9 Bio Cell Unit of Life
3 pages
Week 1 - Discussion Forum 2
No ratings yet
Week 1 - Discussion Forum 2
3 pages
M6 - LINKAGES AND NETWORKING WITH ORGANIZATIONS (Responded)
No ratings yet
M6 - LINKAGES AND NETWORKING WITH ORGANIZATIONS (Responded)
3 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Ctevt Model Entrance Questions
No ratings yet
Ctevt Model Entrance Questions
4 pages
Python Full Course
No ratings yet
Python Full Course
2 pages
M.Sc. Semester Examination Timetable (April 2025) - 1
No ratings yet
M.Sc. Semester Examination Timetable (April 2025) - 1
1 page
C Programming Language
From Everand
C Programming Language
Younish Pathan
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet