0% found this document useful (0 votes)
27 views92 pages

Chapter - 03 - Review of Basic Data

Chapter_03_Review of Basic Data

Uploaded by

datnthe171250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views92 pages

Chapter - 03 - Review of Basic Data

Chapter_03_Review of Basic Data

Uploaded by

datnthe171250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 92

REVIEW OF BASIC DATA

ANALYTIC METHODS
USING R
Author : FU
Date : Mar-2022
Objectives

After studying this chapter, the student should


be able to:
 Understand the basic features of R
 Understand how to do data exploration and analysis
with R?
 Understand methods for data evaluation
Content

1. Introduction to R
2. Exploratory Data Analysis
3. Statistical Methods for Evaluation
1. Introduction to R (1)

 R is a programming language and software framework


for statistical analysis and graphics.
 Available for use under the GNU General Public
License, R software and installation instructions can
be obtained via the Comprehensive R Archive and
Network
 Import Data file by read.csv() function
# import a CSV file of the total annual sales for each customer
sales <- read.csv("c:/data/yearly_sales.csv")
# examine the imported dataset
head(sales)
1. Introduction to R (2)

 Head() function, by default, displays the first six


records of sales
1. Introduction to R (3)

 summary() function provides some descriptive


statistics, such as the mean and median, for each
data column
1. Introduction to R (4)

 plot() function generates a scatterplot of the number of


orders (sales$num_of_orders) against the annual sales
(sales$sales_total).
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main="Number of Orders vs. Sales")
1. Introduction to R (5)

 lm() function use for linear regression modeling process

• intercept and slope values are –154.1 and 166.2,


respectively, for the fitted linear equation
1. Introduction to R (6)

 attributes() function use to show details on the


contents of results.
 summary() function is a generic function. A generic
function is a group of functions sharing the same
name but behaving differently depending on the
number and the type of arguments.
 plot() is another example of a generic function
 hist() to generate a histogram (Figure 3-2) of the
residuals stored in results
1. Introduction to R (7)

FIGURE 3-2 Evidence of large residuals


1.1R Graphical User Interfaces

 R software uses a command-line interface (CLI) that is


similar to the BASH shell in Linux or the interactive
versions of scripting languages such as Python
 RGui.exe provides a basic graphical user interface
(GUI) of R in Window version
 Popular GUIs: R commander, Rattle, and RStudio
1.1R Graphical User Interfaces - RStudio GUI (1)
1.1R Graphical User Interfaces- RStudio GUI (2)

 ?lm or help(lm) at the console prompt can be used to


obtain help information on R
 edit() and fix() allow the user to update the contents
of an R variable. Alternatively, such changes can be
implemented with RStudio by selecting the appropriate
variable from the workspace pane.
 R allows one to save the workspace environment,
including variables and loaded libraries, into an .Rdata
file using the save.image() function. An
existing .Rdata file can be loaded using the
load.image() function
1.1R Graphical User Interfaces- RStudio GUI (3)

FIGURE 3-4 Accessing help in Rstudio


1.2 Data Import and Export (1)
 read.csv() function use to import dataset from file
sales <- read.csv("c:/data/yearly_sales.csv")
• Forward slash (/) as the separator character in the directory and
file paths
• setwd() function used to Import of multiple files with long path
names
setwd("c:/data/")
sales <- read.csv("yearly_sales.csv")
• read.table() and read.delim() are intended to import other
common file types such as TXT
sales_table <- read.table("yearly_sales.csv", header=TRUE, sep=",")
sales_delim <- read.delim("yearly_sales.csv", sep=",")
1.2 Data Import and Export (2)

 Main difference between these import functions is the


default values. E.g. read.delim() function expects
the column separator to be a tab ("\t").
1.2 Data Import and Export (3)

 write.table(), write.csv(), and write.csv2()


enable exporting of R datasets to an external file.
 For example, following R code adds an additional
column to the sales dataset and exports the modified
dataset to an external file.
1.2 Data Import and Export (4)

 DBI and RODBC are R packages used to read data from a database
management system (DBMS). These packages provide database
interfaces for communication between R and DBMSs such as MySQL,
Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum
 install.packages() function used to install R packages
 library() function used to loads the package into the R workspace
 connector (conn) is initialized for connecting to a Pivotal Greenplum
database training2 via open database connectivity (ODBC) with user=user
 training2 database must be defined either in the /etc/ODBC.ini
configuration file or using the Administrative Tools under the Windows
Control Panel.
1.2 Data Import and Export (5)

 connector needs to be present to submit a SQL query


to an ODBC database by using the sqlQuery() function
from the RODBC package (Ref:
https://rdrr.io/cran/RODBC/man/)

 following R code retrieves specific columns from the


housing table in which household income (hinc) is
greater than $1,000,000.

Use sqlFetch(conn, “housing_data”, max = 100)


1.2 Data Import and Export (6)

 Using DBI package (Ref: https://dbi.r-dbi.org/reference/)


o Install library install.packages("odbc")
o Connect to SQL database
con <- DBI::dbConnect(odbc::odbc(),
Driver = “SQL Server ",
Server = "[your server's path]",
Database = "[your database's name]",
UID = "[your user name]",
PWD = [“Database password“],
Port = 1433)
• Select data
query <- dbSendQuery(con, "SELECT speed, dist FROM cars")
dbFetch(query, n = 10)
dbClearResult(query)
1.2 Data Import and Export (7)

 jpeg() function used to creates a new JPEG file, adds


a histogram plot to the file, and then closes the file.
 jpeg() is useful when automating standard reports.
 png(), bmp(), pdf(), and postscript(), are
available in R to save plots in the desired format.
1.3 Attribute and Data Types (1)

 In the earlier example, sales variable contained a record


for each customer. Several characteristics, such as total
annual sales, number of orders, and gender, were
provided for each customer
 These characteristics or attributes provide qualitative
and quantitative measures for each item or subject of
interest
 Attributes can be categorized into four types: nominal,
ordinal, interval, and ratio
1.3 Attribute and Data Types (2)
1.3 Attribute and Data Types (3)

 Data of one attribute type may be converted to


another. For example,
o quality of diamonds {Fair, Good, Very Good, Premium,
Ideal} is considered ordinal but can be converted to nominal
{Good, Excellent} with a defined mapping.
o A ratio attribute like Age can be converted into an ordinal
attribute such as {Infant, Adolescent, Adult, Senior}
 Understanding attribute types in a dataset is to select
the appropriate descriptive statistics and analytic
methods (applied and properly interpreted)
1.3.1 Numeric, Character, and Logical Data Types
(1)

 R supports numeric, character, and logical (Boolean) values.


 Examples of such variables are given in the following R code
i <- 1 # create a numeric variable
sport <- "football" # create a character variable
flag <- TRUE # create a logical variable

 R provides several functions, e.g. class() and typeof(), to


examine the characteristics of a given variable.
o class() function represents the abstract class of an object.
o typeof() function determines the way an object is stored in memory
 i appears to be an integer, i is internally stored using double
precision
1.3.1 Numeric, Character, and Logical Data Types
(2)

 R functions can test the variables and coerce a variable into a


specific type.
 is.integer() function to test if i is an integer
 as.integer() function to coerce i into a new integer variable, j . It
can be applied for double, character, and logical types.
1.3.1 Numeric, Character, and Logical Data Types
(3)

 length() return the created variables each have a length of


1.
 One might have expected the returned length of sport to
have been 8 for each of the characters in the string
"football". However, these three variables are actually one
element, vectors.
1.3.2 Vectors (1)
 Vectors are a basic building block for data in R. Simple R variables are
vectors.
 A vector can only consist of values in the same class.
 is.vector() function used to tests vectors

 c() or colon operator, : used to create a vector


 To build a vector from the sequence of integers from 1 to 5 as below
example
1.3.2 Vectors (2)

 A vector of a specific length can be initialized and then populate the


content of the vector later.
 vector() function creates a logical vector by default. A vector of a
different type can be specified by using the mode parameter.
 Vector c, an integer vector of length 0, may be useful when the number
of elements is not initially known and new elements will later be added to
the end of the vector as the values become available.
1.3.3 Arrays and Matrices (1)

 array() function can be


used to restructure a vector
as an array
 Below is a three-
dimensional array to hold
the quarterly sales for three
regions over a two-year
period and then assign the
sales amount of $158,000
to the second region for the
first quarter of the first year.
1.3.3 Arrays and Matrices (2)

 matrix is a two-dimensional array


 Below is a matrix to hold the quarterly sales for the
three regions. Parameters nrow and ncol define the
number of rows and columns, respectively, for the
sales_matrix.
1.3.3 Arrays and Matrices (3)

 Matrix operations - addition, subtraction, multiplication,


and transpose function t() and the inverse matrix function
matrix.inverse() included in the matrixcalc package.
 Following R code builds a 3 × 3 matrix, M, and multiplies it
by its inverse to obtain the identity matrix.
1.3.4 Data Frames (1)

 Data frames provide a structure for storing and


accessing several variables of possibly different data
types.
 is.data.frame() function indicates, a data frame was
created by the read.csv() function
1.3.4 Data Frames (2)

 $ notation used to assess variables stored in the data frame


 Following R code illustrates that in this example, each variable
is a vector with the exception of gender, which was, by a
read.csv() default, imported as a factor
 Factor denotes a categorical variable, typically with a few finite
levels such as “F” and “M” in the case of gender.
1.3.4 Data Frames (3)

 Data frames are preferred input format for many of the


modeling functions available in R
 Following use of the str() function provides the structure of
the sales data frame. This function identifies integer and
numeric (double) data types, the factor variables and levels,
as well as the first few values for each variable.
1.3.4 Data Frames (4)

 In the simplest, data frames are lists of variables of the same


length.
 subsetting operators can be used to retrieve a subset of the
data frame. They allow one to express complex operations in a
succinct fashion and easily retrieve a subset of the dataset.
1.3.5 List

 Contain any type of objects, including other lists.


 Using the vector v and the matrix M created in earlier
examples, the following R code creates assortment, a list of
different object types.

• Use double brackets, [ [] ] to display the contents of assortment


• Use single set of brackets only accesses an item in the list, not its
content
1.3.6 Factors (1)

 gender variable in the data frame sales is an factor.


 In this case, gender could assume one of two levels: F or M. Factors can be
ordered or not ordered. In the case of gender, the levels are not ordered.

• Included with ggplot2 package, the diamonds data frame contains three
ordered factors.
• Examining the cut factor, there are five levels in order of improving cut:
• Fair, Good, Very Good, Premium, and Ideal.
• sales$gender contains nominal data, and diamonds$cut contains ordinal
data.
1.3.6 Factors (2)

 Following code to
categorize
sales$sales_totals into
three groups—small,
medium, and big—
according to the
amount of the sales.
 These groupings are
the basis for the new
ordinal factor,
spender, with levels
{small, medium, big}.
1.3.7 Contingency Tables

 table refers to a class of


objects used to store the
observed counts across the
factors for a given dataset.
 a table is commonly referred
to as a contingency table and
is the basis for performing a
statistical test on the
independence of the factors
used to build the table.
 Following R code builds a
contingency table based on
the sales$gender and
sales$spender factors.
1.4 Descriptive Statistics (1)

 summary() function provides several descriptive statistics, e.g.


mean and median, about a variable such as the sales data
frame.
 results now include the counts for the three levels of the
spender variable based on the earlier examples involving factors.
1.4 Descriptive Statistics (2)

 Following code provides some common R functions that include


descriptive statistics. In parentheses, the comments describe the
functions.
2 Exploratory Data Analysis (1)

 summary() can help analysts get an idea of the


magnitude and range of the data
 Following code shows a summary view of a data
frame data with two columns x and y. The output
shows the range of x and y, but it’s not clear what the
relationship may be between these two variables.
2 Exploratory Data Analysis (2)

 Variables x and
y of the data
frame data can
be visualized in
a scatterplot
 Figure 3-5
depicts the
relationship
between two
variables

FIGURE 3-5 A scatterplot can easily show if x and y share a relat


2.1 Visualization Before Analysis (1)

 Figure 3-6 illustrates the


importance of visualizing
data, consider
Anscombe’s quartet.
Anscombe’s quartet
consists of four datasets.
 It was constructed by
statistician Francis
Anscombe in 1973 to
demonstrate the
importance of graphs in
FIGURE 3-6 Anscombe’s quartet
statistical analyses.
2.1 Visualization Before Analysis (2)

TABLE 3-3 Statistical Properties of Anscombe’s Quartet

FIGURE 3-7 Anscombe’s quartet visualized as scatterplo


2.1 Visualization Before Analysis (3)

 gl() function create variable levels, which generates factors of four levels (1, 2,
3, and 4), each repeating 11 times.
 Variable mydata is created using the with(data, expression) function, which
evaluates an expression in an environment constructed from data.
 In this example, the data is the anscombe dataset, which includes eight
attributes: x1, x2, x3, x4, y1, y2, y3, and y4.
 expression part in the code creates a data frame from the nscombe dataset,
and it only includes three attributes: x, y, and the group each data point
belongs to (mygroup).
2.1 Visualization Before Analysis (4)
2.2 Dirty Data (1)

 Consider a scenario in which a bank is


conducting data analyses of its account
holders to gauge customer retention.
Figure 3-8 shows the age distribution of
the account holders.
 If age data is in a vector called age, the
graph can be created with the following R
script:
 hist(age, breaks=100, main="Age
Distribution of Account Holders",
lab="Age", ylab="Frequency", col="gray")
 figure shows that the median age of the
account holders is around 40. A few FIGURE 3-8 Age distribution of bank account holder
accounts with account holder age less
than 10 are unusual but plausible
2.2 Dirty Data (2)

 Left side of the graph shows a huge spike of customers who are zero
years old or have negative ages.
 This is likely to be evidence of missing data. One possible explanation
is that the null age values could have been replaced by 0 or negative
values during the data input.
 Such an occurrence may be caused by entering age in a text box that
only allows numbers and does not accept empty values. Or it might be
caused by transferring data among several systems that have different
definitions for null values (such as NULL, NA, 0, –1, or –2).
 Therefore, data cleansing needs to be performed over the accounts
with abnormal age values. Analysts should take a closer look at the
records to decide if the missing data should be eliminated or if an
appropriate age value can be determined using other available
information for each of the accounts.
2.2 Dirty Data (3)

 is.na() function provides tests for missing


values.
 Following example creates a vector x where
the fourth value is not available (NA). The
is.na() function returns TRUE at each NA value
and FALSE otherwise.

 mean() applied to data containing


missing values can yield an NA result.
To prevent this, set the na.rm
parameter to TRUE to remove the
missing value during the function’s
execution.
2.3 Visualizing a Single Variable (1)

 R has many
functions
available to
examine a
single variable.
2.3 Visualizing a Single Variable (2)

 dotchart(x, label=...) used


to create a dotchart, where x is
a numeric vector and label is a
vector of categorical labels for
x.
 barplot(height) function used
to create A barplot , where
height represents a vector or
matrix.
 Figure 3-10 shows (a) a
dotchart and (b) a barplot
based on the mtcars dataset,
which includes the fuel
consumption and 10 aspects of
automobile design and
performance of 32 automobiles
2.3 Visualizing a Single Variable (3)

 dotchart can be created with the function dotchart(x,


label=...),
o x is a numeric vector and
o label is a vector of categorical labels for x.
2.3 Visualizing a Single Variable (4)

 Figure 3-11(a)
includes a
histogram of
household income.
 Histogram shows a
clear concentration
of low household
incomes on the left
and the long tail of
the higher incomes
on the right.
2.3 Visualizing a Single Variable (5)
2.3 Visualizing a Single Variable (6)

 Consider a density plot of diamond


prices (in USD).
 Figure 3-12(a) contains two density
plots for premium and ideal cuts of
diamonds.
 Figure 3-12(b) shows detail of the
diamond prices than Figure 3-12(a)
by taking the logarithm.

FIGURE 3-12 Density plots of (a)


diamond prices and (b) the
logarithm of diamond prices
2.3 Visualizing a Single Variable (7)

• R script to generate
the plots in Figure.
• The diamonds dataset
comes with the
ggplot2 package.
2.4 Examining Multiple Variables

 R code to produce Figure 3-13

FIGURE 3-13 Examining two


variables with regression
2.4 Examining Multiple Variables - Dotchart and
Barplot

 Dotchart and barplot from the


previous section can visualize
multiple variables. Both of
them use color as an
additional dimension for
visualizing the data.
 For the same mtcars dataset,
Figure 3-14 shows a dotchart
that groups vehicle cylinders
at the y-axis and uses colors to
distinguish different cylinders.
 The vehicles are sorted
according to their MPG values.
FIGURE 3-14 Dotplot to visualize multiple variabl
2.4 Examining Multiple Variables - Dotchart and
Barplot
2.4 Examining Multiple Variables- Dotchart and
Barplot

 barplot in Figure 3-15


visualizes the distribution of
car cylinder counts and
number of gears.
 The x-axis represents the
number of cylinders, and the
color represents the number
of gears

FIGURE 3-15 Barplot to visualize multiple variables


2.4 Examining Multiple Variables - Box-and-
Whisker Plot

 Graph shows how household income


varies by region.
 Highest median incomes are in region 0
and region 9. Region 0 is slightly higher,
but the boxes for the two regions overlap
enough that the difference between the
two regions probably is not significant.
 Lowest household incomes tend to be in
region 7, which includes states such as
Louisiana, Arkansas, and Oklahoma

FIGURE 3-16 A box-and-whisker plot of mean


household income and geographical region
2.4 Examining Multiple Variables - Hexbinplot for
Large Datasets

 hexbinplot of Figure 3-17(b) is


plotted by using zcta data frame.
 Running the code requires the
use of the hexbin package
installed by running
install.packages("hexbin")

FIGURE 3-17 (a)


Scatterplot and (b)
Hexbinplot of household
income against years of
education
2.4 Examining Multiple Variables - Scatterplot
Matrix

 vector colors defines the color


scheme for the plot
 colors <- c("gray50", "white",
"black") to make the
scatterplots grayscale.
FIGURE 3-18 Scatterplot matrix of Fisher’s [13] iris
dataset
2.4 Examining Multiple Variables - Analyzing a
Variable over Time

 Visualizing a variable over time is the


same as visualizing any pair of
variables, but in this case the goal is
to identify time-specific patterns.
Figure 3-19 plots the monthly total
numbers of international airline
passengers (in thousands) from
January 1940 to December 1960.
Enter plot(AirPassengers) in the R
console to obtain a similar graph. The
plot shows that, for each year, a large
peak occurs mid-year around July and
August, and a small peak happens
around the end of the year, possibly
due to the holidays. Such a
phenomenon is referred to as a
seasonality effect.
 plot(AirPassengers) FIGURE 3-19 Airline passenger counts from 1949 to 19
2.5 Data Exploration Versus Presentation

 Figure 3-20 shows the


density plot on the
distribution of account
values from a bank.

FIGURE 3-20 Density plots are


better to show to data scientists
2.5 Data Exploration Versus Presentation
3.1 Hypothesis Testing (1)

 Hypothesis testing is to form an


assertion and test it with data
 null hypothesis (H0) refer to
when performing hypothesis tests
based on the common assumption
is that there is no difference
between two samples. This
assumption is used as the default
position for building the test or
conducting a scientific experiment
 alternative hypothesis (HA) is
that there is a difference between
two samples
FIGURE 3-22 Distributions of two samples of data
3.1 Hypothesis Testing (2)

 For example, if the task is to identify the effect of drug A


compared to drug B on patients, the null hypothesis and
alternative hypothesis would be this.
o H0: Drug A and drug B have the same effect on patients
o HA: Drug A has a greater effect than drug B on patients.

 If the task is to identify whether advertising Campaign C is


effective on reducing customer churn, the null hypothesis
and alternative hypothesis would be as follows
o H0 : Campaign C does not reduce customer churn better than the
current campaign method.
o HA : Campaign C does reduce customer churn better than the
current campaign.
3.1 Hypothesis Testing (3)

 It is important to state the null hypothesis and alternative hypothesis, because


misstating them is likely to undermine the subsequent steps of the hypothesis testing
process.
 A hypothesis test leads to either rejecting the null hypothesis in favor of the
alternative or not rejecting the null hypothesis.
3.2 Difference of Means

 H0: μ1 = μ2
 HA: μ1 ≠μ2
 The μ1 and μ2 denote
the population means
of pop1 and pop2,
respectively.

FIGURE 3-23 Overlap of the two


distributions is large if X1 2 ≈ X2
3.2.1 Student’s t-test (1)

 Student’s t-test assumes that distributions of the two


populations have equal but unknown variances.
Suppose n1 and n2 samples are randomly and
independently selected from two populations, pop1
and pop2.
 If each population is normally distributed with same
mean (μ1=μ2) and with the same variance, then T
(the t-statistic), given in Equation 3-1, follows a t-
distribution with n1 + n2 −2 degrees of freedom
(df).

(3-1)
3.2.1 Student’s t-test (2)

 Shape of t-distribution is similar to normal distribution


 Degrees of freedom approaches 30 or more, t-distribution is
nearly identical to normal distribution
 Numerator of T is difference of sample means, if the observed
value of T is far enough from zero such that the probability of
observing such a value of T is unlikely, one would reject the
null hypothesis that population means are equal.
 a small probability, say α=0.05, T * is determined such that P(
|T| ≥T * )=0.05. After the samples are collected and the
observed value of T is calculated according to Equation 3-1,
the null hypothesis (μ1 = μ2) is rejected if |T| ≥T *.
3.2.1 Student’s t-test (3)
 In hypothesis testing, small probability, α, is known as the significance level of the
test
 significance level of test is the probability of rejecting the null hypothesis, when the
null hypothesis is actually TRUE. In other words, for α=0.05
 if means from the two populations are truly equal, then in repeated random sampling,
the observed magnitude of T would only exceed T * 5% of the time.
 Following R code example, 10 observations are randomly selected from two normally
distributed populations and assigned to variables x and y.
 Two populations have a mean of 100 and 105, respectively, and a standard deviation
equal to 5. Student’s t-test is then conducted to determine if the obtained random
samples support the rejection of the null hypothesis.
3.2.2 Welch’s t-test (1)

 Welch’s t-test can be used based on T expressed in


Equation 3-2

• where Xi , Si2, and ni correspond to the i-th sample mean, sample variance, and
sample size
• Welch’s t-test uses sample variance (Si2) for each population instead of the
pooled sample variance.
Following R code performs
the Welch’s t-test on the
same set of data analyzed in
the earlier Student’s t-test
example.
3.2.2 Welch’s t-test (2)

 The degrees of freedom for Welch’s t-test is defined in


Equation 3-3.
3.2.2 Welch’s t-test (3)

 A confidence interval is
an interval estimate of a
population parameter or
characteristic based on
sample data

FIGURE 3-25 A 95% confidence


interval straddling the unknown
population mean μ
3.2.2 Welch’s t-test (4)

 Confidence intervals discussed again in Section 3.3.6 on


ANOVA.
 A key assumption in both the Student’s and Welch’s t-test is that
the relevant population attribute is normally distributed.
 For non-normally distributed data, it is sometimes possible to
transform the collected data to approximate a normal
distribution.
 For example, taking the logarithm of a dataset can often
transform skewed data to a dataset that is at least symmetric
around its mean. However, if such transformations are
ineffective, there are tests like the Wilcoxon rank-sum test that
can be applied to see if two population distributions are different.
3.3 Wilcoxon Rank-Sum Test (1)

 Wilcoxon rank-sum test is a nonparametric hypothesis test


that checks whether two populations are identically distributed
 Let the two populations again be pop1 and pop2, with
independently random samples of size n1 and n2 respectively.
The total number of observations is then N =n1 + n2
 Wilcoxon rank-sum test determines the significance of the
observed rank-sums.
 Following R code performs the test on the same dataset used
for the previous t-test.
wilcox.test(x, y, conf.int = TRUE)
3.3 Wilcoxon Rank-Sum Test (2)

 wilcox.test() function ranks observations, determines the


respective rank-sums corresponding to each population’s
sample, and then determines the probability of such rank-
sums of such magnitude being observed assuming that the
population distributions are identical.
 In this example, the probability is given by the p-value of
0.04903.
 Thus, the null hypothesis would be rejected at a 0.05
significance level. The reader is cautioned against
interpreting that one hypothesis test is clearly better than
another test based solely on the examples given in this
section.
3.4 Type I and Type II Errors (1)
3.4 Type I and Type II Errors (2)

 significance level mentioned in the Student’s t-test discussion is


equivalent to the type I error.
 For a significance level such as α=0.05, if null hypothesis (μ1= μ2) is
TRUE, there is a 5% chance that the observed T value based on the
sample data will be large enough to reject the null hypothesis.
 By selecting n appropriate significance level, probability of committing
a type I error can be defined before any data is collected or analyzed.
 probability of committing a Type II error is somewhat more difficult to
determine. If two population means are truly not equal, probability of
committing a type II error will depend on how far apart the means
truly are. To reduce the probability of a type II error to a reasonable
level, it is often necessary to increase the sample size
3.5 Power and Sample Size
 power of a test is the probability
of correctly rejecting the null
hypothesis denoted by 1−β,
where β is probability of a type II
error.
 In general, the magnitude of the
difference is known as the effect
size. As the sample size becomes
larger, it is easier to detect a
given effect size, δ, as illustrated FIGURE 3-26 A larger
in Figure 3-26. sample size better
identifies a fixed effect
size
3.6 ANOVA (1)

 Hypothesis tests presented are good for analyzing means between


two populations. What if there are more than two populations?
Consider an example of testing the impact of nutrition and exercise on
60 candidates between age 18 and 50. Candidates are randomly split
into six groups, each assigned with a different weight loss strategy,
and the goal is to determine which strategy is the most effective.
o Group 1 only eats junk food.
o Group 2 only eats healthy food.
o Group 3 eats junk food and does cardio exercise every other day.
o Group 4 eats healthy food and does cardio exercise every other day.
o Group 5 eats junk food and does both cardio and strength training every other
day.
o Group 6 eats healthy food and does both cardio and strength training every
other day.
3.6 ANOVA (2)
 Multiple t-tests could be applied to each pair of weight loss strategies. In this
example, the weight loss of Group 1 is compared with the weight loss of Group
2, 3, 4, 5, or 6. Similarly, the weight loss of Group 2 is compared with that of
the next 4 groups. Therefore, a total of 15 t-tests would be performed.
 However, multiple t-tests may not perform well on several populations for two
reasons.
o First, number of t-tests increases as number of groups increases, analysis using the
multiple t-tests becomes cognitively more difficult.
o Second, by doing a greater number of analyses, probability of committing at least one
type I error somewhere in the analysis greatly increases.
 ANOVA (Analysis of Variance) designed to address these issues. ANOVA is a
generalization of the hypothesis testing of difference of two population means.
 ANOVA tests if any of the population means differ from the other population
means.
o The null hypothesis is that all the population means are equal. H0: μ1= μ2 = =…=μn
o Alternative hypothesis is that at least one pair of the population means is not equal.
HA: μi ≠ μj for at least one pair of i, j
3.6 ANOVA (3)
• In Section 3.3.2, “Difference of Means,” each population is assumed to be
normally distributed with the same variance.
• The first thing to calculate for the ANOVA is the test statistic. The goal is to
test whether the clusters formed by each population are more tightly grouped
than the spread across all the populations
 k: total number of populations

 N: total number of samples is randomly split into k groups

 ni number of samples in the i-th group

 Xi mean of the group, i∈[1,k]

 X0 mean of all the samples

 S2B (between-groups mean sum of squares) an estimate of the between-groups variance


measures how the population means vary with respect to the grand mean, or the mean spread across
all the populations
3.6 ANOVA (4)

 S2W:within-group mean sum of squares is an estimate of the


within-group variance

If SB2 is much larger than SW2 , then some of the population means are
different from each other.
F-test statistic is defined as the ratio of the between-groups mean sum of squares and the withingroup
mean sum of squares
3.6 ANOVA (5)
3.6 ANOVA (6)
3.6 ANOVA (7)
Summary

 R is a popular package and programming language for data


exploration, analytics, and visualization
 How to use R to perform exploratory data analysis,
including the discovery of dirty data, visualization of one or
more variables, and customization of visualization for
different audiences
 Introduces some basic statistical methods
o Hypothesis testing. The Student’s t-test and Welch’s t-test are
included as two example hypothesis tests designed for testing the
difference of means.
o Other statistical methods and tools: confidence intervals, Wilcoxon
rank-sum test, type I and II errors, effect size, and ANOVA

You might also like