0% found this document useful (0 votes)
14 views17 pages

Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models

The document discusses variation, missing values, covariation, patterns, and models in exploratory data analysis. It explains how to visualize distributions of categorical and continuous variables using bar charts and histograms. It also covers replacing missing values, examining covariation using boxplots, identifying patterns in data, and fitting simple linear models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views17 pages

Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models

The document discusses variation, missing values, covariation, patterns, and models in exploratory data analysis. It explains how to visualize distributions of categorical and continuous variables using bar charts and histograms. It also covers replacing missing values, examining covariation using boxplots, identifying patterns in data, and fitting simple linear models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT-IV

variation, missing values, co variation, patterns and models


Variation
● Variation is the tendency of the values of a variable to change from
measurement to measurement.
● Continuous variables and Categorical variables will give different results
with different measurements. (e.g., the eye colors of different people)
● Every variable has its own pattern of variation, which can reveal
interesting information.
● The best way to understand that pattern is to visualize the distribution of
variables’ values.
Variation
Visualizing Distributions
● In R, categorical variables are usually saved as factors or character
vectors. To examine the distribution of a categorical variable, use a bar
chart:
● > ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
● You can also compute these values manually with dplyr::count():
● diamonds %>% count(cut)
Variation
Visualizing Distributions
● A variable is continuous if it can take any of an infinite set of ordered
values.
● Numbers and date-times are two examples of continuous variables.
● To examine the distribution of a continuous variable, use a histogram.
● > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat),
binwidth = 0.5)
● You can compute this by hand by combining dplyr::count() and
ggplot2::cut_width():
● > diamonds %>% count(cut_width(carat, 0.5))
Variation
Visualizing Distributions
● the diamonds with a size of less than three carats and choose a smaller
binwidth:
● > smaller <- diamonds %>% filter(carat < 3)
● ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Variation
Visualizing Distributions
● If you wish to overlay multiple histograms in the same plot, use
geom_freqpoly() instead of geom_histogram().
● geom_freqpoly() performs the same calculation as geom_histogram(), but
instead of displaying the counts with bars, uses lines instead.
● It’s much easier to understand overlapping lines than bars:
● > ggplot(data = smaller, mapping = aes(x = carat, color = cut)) +
geom_freqpoly(binwidth = 0.1)
Missing Values
● Drop the entire row with the strange values:
● > diamonds2 <- diamonds %>% filter(between(y, 3, 20))
● Instead, replace the unusual values with missing values. The easiest way
to do this is to use mutate() to replace the variable with a modified copy.
You can use the ifelse() function to replace unusual values with NA:
● > diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y))
Exploratory Data Analysis (EDA)
Missing Values
● In ggplot2 we get warning messages for missing values
● > ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + geom_point()
● #> Warning: Removed 9 rows containing missing values
● #> (geom_point).
● To suppress that warning, set na.rm = TRUE:
● > ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
Covariation
● Covariation is the tendency for the values of two or more variables to vary
together in a related way.
● The best way to spot covariation is to visualize the relationship between
two or more variables.
● To display the distribution of a continuous variable broken down by a
categorical variable is the boxplot.
● A boxplot is a type of visual shorthand for a distribution of values that is
popular among statisticians.
Covariation
Each boxplot consists of:
● A box that stretches from the 25th percentile of the distribution to the 75th
percentile, a distance known as the interquartile range (IQR).
● In the middle of the box is a line that displays the median, i.e., 50th
percentile, of the distribution.
● These three lines give you a sense of the spread of the distribution and
whether or not the distribution is symmetric about the median or skewed
to one side.
Covariation
Covariation
● Let’s take a look at the distribution of price by cut using geom_boxplot():
● > ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
● gplot(data = mpg, mapping = aes(x = class, y = hwy)) +geom_boxplot()

● Many categorical variables don’t have such an intrinsic order, so you


might want to reorder them to make a more informative display. One way
to do that is with the reorder() function.
● > ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class,
hwy, FUN = median),y = hwy))
Covariation
● If you have long variable names, geom_boxplot() will work better if you
flip it 90°. You can do that with coord_flip():
● ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class,
hwy, FUN = median), y = hwy)) + coord_flip()
Patterns and models
● Patterns in your data provide clues about relationships.
● If a systematic relationship exists between two variables it will appear as a
pattern in the data.
● If you spot a pattern, ask yourself:
● Could this pattern be due to coincidence (i.e., random chance)?
● How can you describe the relationship implied by the pattern?
● How strong is the relationship implied by the pattern?
● What other variables might affect the relationship?
● Does the relationship change if you look at individual subgroups of the
data?
● ggplot(data = faithful) + geom_point(mapping = aes(x = eruptions, y =
waiting))

● The scatterplot displays the two clusters


● code fits a model that predicts price from carat and then computes the
residuals
● library(modelr)
● mod <- lm(log(price) ~ log(carat), data = diamonds)
● diamonds2 <- diamonds %>% add_residuals(mod) %>% mutate(resid =
exp(resid))
● ggplot(data = diamonds2) +
● geom_point(mapping = aes(x = carat, y = resid))
● Once you’ve removed the strong relationship between carat and price,
you can see what you expect in the relationship between cut and price—
relative to their size, better quality diamonds are more expensive:
● ggplot(data = diamonds2) + geom_boxplot(mapping = aes(x = cut, y =
resid))

● ggplot2 Calls :

● ggplot(data = faithful, mapping = aes(x = eruptions)) +


geom_freqpoly(binwidth = 0.25)
THANK YOU

You might also like