Unit 3 Data Analysis
Unit 3 Data Analysis
Raw data are the records or observations that make up a sample. Depending on the nature of
the intended analysis, these data could be stored in a specialized R object, often a data frame,
possibly read in from an external file using techniques. Before summarizing or modeling the
data, it is important to clearly identify your available variables.
A variable is a characteristic of an individual in a population, the value of which can differ
between entities within that population. Variables can take on a number of forms, determined
by the nature of the values they may take.
Numeric Variables
A numeric variable is one whose observations are naturally recorded as numbers. There are
two types of numeric variables: continuous and discrete.
A continuous variable can be recorded as any value in some interval, up to any number of
decimals. For example, if you were observing rainfall amount, a value of 15 mm would make
sense, but so would a value of 15.42135 mm. Any degree of measurement precision gives a
valid observation.
A discrete variable, on the other hand, may take on only distinct numeric values—and if the
range is restricted, then the number of possible values is finite. For example, if you were
observing the number of heads in 20 flips of a coin, only whole numbers would make sense.
It would not make sense to observe 15.42135 heads; the possible outcomes are restricted to
the integers from 0 to 20 (inclusive).
Categorical Variables
Though numeric observations are common for many variables, it’s also important to consider
categorical variables. Like some discrete variables, categorical variables may take only one
of a finite number of possibilities. Unlike discrete variables, however, categorical
observations are not always recorded as numeric values.
There are two types of categorical variables. Those that cannot be logically ranked are called
nominal. A good example of a categorical-nominal variable is gender. In most data sets, it
has two fixed possible values, male and female, and the order of these categories is irrelevant.
Categorical variables that can be naturally ranked are called ordinal. An example of a
categorical-ordinal variable would be the dose of a medicine, with the possible values low,
medium, and high. These values can be ordered in either increasing or decreasing amounts,
and the ordering might be relevant to the research.
Once you know what to look for, identifying the types of variables in a given data set is
straightforward. Take the data frame chickwts, which is available in the automatically loaded
datasets package. At the prompt, directly entering the following gives you the first five
records of this data set.
R> chickwts[1:5,]
weight feed
179 horsebean
2160 horsebean
3136 horsebean
4227 horsebean
5217 horsebean
R’s help file (?chickwts) describes these data as comprising the weights of 71 chicks (in
grams) after six weeks, based on the type of food provided to them. Now let’s take a look at
the two columns in their entirety as vectors:
R> chickwts$weight
179 160 136 227 217 168 108 124 143 140 309 229 181 141
260 203 148 169 213 257 244 271 243 230 248 327 329 250
193 271 316 267 199 171 158 248 423 340 392 339 341
226 320 295 334 322 297 318 325 257 303 315 380 153 263
242 206 344 258 368 390 379 260 404 318 352 359 216
222 283 332
R> chickwts$feed
weight is a numeric measurement that can fall anywhere on a continuum, so this is a numeric-
continuous variable. The fact that the chick weights appear to have been rounded or recorded
to the nearest gram does not affect this definition because in reality the weights can be any
value. feed is clearly a categorical variable because it has only six possible outcomes, which
aren’t numeric. The absence of any natural or easily identifiable ordering leads to the
conclusion that feed is a categorical-nominal variable.
Univariate and Multivariate Data
When discussing or analyzing data related to only one dimension, is called univariate data.
For example, the weight variable in the earlier example is univariate since each measurement
can be expressed with one component—a single number.
When it’s necessary to consider data with respect to variables that exist in more than one
dimension (in other words, with more than one component or measurement associated with
each observation), then the data are considered multivariate. Multivariate measurements are
arguably most relevant when the individual components aren’t as useful when considered on
their own (in other words, as univariate quantities) in any given statistical analysis.
An ideal example is that of spatial coordinates, which must be considered in terms of at least
two components—a horizontal x-coordinate and a vertical y-coordinate. The univariate data
alone—for example, the x-axis values only—aren’t especially useful. Consider the quakes
data set, which contains observations on 1,000 seismic events recorded off the coast of Fiji.
If you look at the first five records and read the descriptions in the help file ?quakes, you
quickly get a good understanding of what’s presented.
R>quakes[1:5,]
lat long depth mag stations
-20.42 181.62 562 4.8 41
-20.62 181.03 650 4.2 15
-26.00 184.10 42 5.4 43
-17.97 181.66 626 4.1 19
-20.42 181.96 649 4.0 11
The columns lat and long provide the latitude and longitude of the event, depth provides the
depth of the event (in kilometers), mag provides the magnitude, and stations provides the
number of observation stations that detected the event.
Parameter or Statistic
Statistics as a discipline is concerned with understanding features of an overall population,
defined as the entire collection of individuals or entities of interest. The characteristics of
that population are referred to as parameters. Because researchers are rarely able to access
relevant data on every single member of the population of interest, they typically collect a
sample of entities to represent the population and record relevant data from these entities.
They may then estimate the parameters of interest using the sample data—and those
estimates are the statistics.
For example, if you were interested in the average age of women in the United States who
own cats, the population of interest would be all women residing in the United States who
own at least one cat. The parameter of interest is the true mean age of women in the United
States who own at least one cat. Of course, obtaining the age of every single female
American with a cat would be a difficult feat. A more feasible approach would be to
randomly identify a smaller number of cat-owning American women and take data from
them—this is your sample, and the mean age of the women in the sample is your statistic.
Thus, the key difference between a statistic and a parameter is whether the characteristic
refers to the sample you drew your data from or the wider population.