Chapter - 03 - Review of Basic Data
Chapter - 03 - Review of Basic Data
ANALYTIC METHODS
USING R
Author : FU
Date : Mar-2022
Objectives
1. Introduction to R
2. Exploratory Data Analysis
3. Statistical Methods for Evaluation
1. Introduction to R (1)
DBI and RODBC are R packages used to read data from a database
management system (DBMS). These packages provide database
interfaces for communication between R and DBMSs such as MySQL,
Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum
install.packages() function used to install R packages
library() function used to loads the package into the R workspace
connector (conn) is initialized for connecting to a Pivotal Greenplum
database training2 via open database connectivity (ODBC) with user=user
training2 database must be defined either in the /etc/ODBC.ini
configuration file or using the Administrative Tools under the Windows
Control Panel.
1.2 Data Import and Export (5)
• Included with ggplot2 package, the diamonds data frame contains three
ordered factors.
• Examining the cut factor, there are five levels in order of improving cut:
• Fair, Good, Very Good, Premium, and Ideal.
• sales$gender contains nominal data, and diamonds$cut contains ordinal
data.
1.3.6 Factors (2)
Following code to
categorize
sales$sales_totals into
three groups—small,
medium, and big—
according to the
amount of the sales.
These groupings are
the basis for the new
ordinal factor,
spender, with levels
{small, medium, big}.
1.3.7 Contingency Tables
Variables x and
y of the data
frame data can
be visualized in
a scatterplot
Figure 3-5
depicts the
relationship
between two
variables
gl() function create variable levels, which generates factors of four levels (1, 2,
3, and 4), each repeating 11 times.
Variable mydata is created using the with(data, expression) function, which
evaluates an expression in an environment constructed from data.
In this example, the data is the anscombe dataset, which includes eight
attributes: x1, x2, x3, x4, y1, y2, y3, and y4.
expression part in the code creates a data frame from the nscombe dataset,
and it only includes three attributes: x, y, and the group each data point
belongs to (mygroup).
2.1 Visualization Before Analysis (4)
2.2 Dirty Data (1)
Left side of the graph shows a huge spike of customers who are zero
years old or have negative ages.
This is likely to be evidence of missing data. One possible explanation
is that the null age values could have been replaced by 0 or negative
values during the data input.
Such an occurrence may be caused by entering age in a text box that
only allows numbers and does not accept empty values. Or it might be
caused by transferring data among several systems that have different
definitions for null values (such as NULL, NA, 0, –1, or –2).
Therefore, data cleansing needs to be performed over the accounts
with abnormal age values. Analysts should take a closer look at the
records to decide if the missing data should be eliminated or if an
appropriate age value can be determined using other available
information for each of the accounts.
2.2 Dirty Data (3)
R has many
functions
available to
examine a
single variable.
2.3 Visualizing a Single Variable (2)
Figure 3-11(a)
includes a
histogram of
household income.
Histogram shows a
clear concentration
of low household
incomes on the left
and the long tail of
the higher incomes
on the right.
2.3 Visualizing a Single Variable (5)
2.3 Visualizing a Single Variable (6)
• R script to generate
the plots in Figure.
• The diamonds dataset
comes with the
ggplot2 package.
2.4 Examining Multiple Variables
H0: μ1 = μ2
HA: μ1 ≠μ2
The μ1 and μ2 denote
the population means
of pop1 and pop2,
respectively.
(3-1)
3.2.1 Student’s t-test (2)
• where Xi , Si2, and ni correspond to the i-th sample mean, sample variance, and
sample size
• Welch’s t-test uses sample variance (Si2) for each population instead of the
pooled sample variance.
Following R code performs
the Welch’s t-test on the
same set of data analyzed in
the earlier Student’s t-test
example.
3.2.2 Welch’s t-test (2)
A confidence interval is
an interval estimate of a
population parameter or
characteristic based on
sample data
If SB2 is much larger than SW2 , then some of the population means are
different from each other.
F-test statistic is defined as the ratio of the between-groups mean sum of squares and the withingroup
mean sum of squares
3.6 ANOVA (5)
3.6 ANOVA (6)
3.6 ANOVA (7)
Summary