Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models
The document discusses variation, missing values, covariation, patterns, and models in exploratory data analysis. It explains how to visualize distributions of categorical and continuous variables using bar charts and histograms. It also covers replacing missing values, examining covariation using boxplots, identifying patterns in data, and fitting simple linear models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
14 views17 pages
Unit-Iv: Variation, Missing Values, Co Variation, Patterns and Models
The document discusses variation, missing values, covariation, patterns, and models in exploratory data analysis. It explains how to visualize distributions of categorical and continuous variables using bar charts and histograms. It also covers replacing missing values, examining covariation using boxplots, identifying patterns in data, and fitting simple linear models.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17
UNIT-IV
variation, missing values, co variation, patterns and models
Variation ● Variation is the tendency of the values of a variable to change from measurement to measurement. ● Continuous variables and Categorical variables will give different results with different measurements. (e.g., the eye colors of different people) ● Every variable has its own pattern of variation, which can reveal interesting information. ● The best way to understand that pattern is to visualize the distribution of variables’ values. Variation Visualizing Distributions ● In R, categorical variables are usually saved as factors or character vectors. To examine the distribution of a categorical variable, use a bar chart: ● > ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) ● You can also compute these values manually with dplyr::count(): ● diamonds %>% count(cut) Variation Visualizing Distributions ● A variable is continuous if it can take any of an infinite set of ordered values. ● Numbers and date-times are two examples of continuous variables. ● To examine the distribution of a continuous variable, use a histogram. ● > ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) ● You can compute this by hand by combining dplyr::count() and ggplot2::cut_width(): ● > diamonds %>% count(cut_width(carat, 0.5)) Variation Visualizing Distributions ● the diamonds with a size of less than three carats and choose a smaller binwidth: ● > smaller <- diamonds %>% filter(carat < 3) ● ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1) Variation Visualizing Distributions ● If you wish to overlay multiple histograms in the same plot, use geom_freqpoly() instead of geom_histogram(). ● geom_freqpoly() performs the same calculation as geom_histogram(), but instead of displaying the counts with bars, uses lines instead. ● It’s much easier to understand overlapping lines than bars: ● > ggplot(data = smaller, mapping = aes(x = carat, color = cut)) + geom_freqpoly(binwidth = 0.1) Missing Values ● Drop the entire row with the strange values: ● > diamonds2 <- diamonds %>% filter(between(y, 3, 20)) ● Instead, replace the unusual values with missing values. The easiest way to do this is to use mutate() to replace the variable with a modified copy. You can use the ifelse() function to replace unusual values with NA: ● > diamonds2 <- diamonds %>% mutate(y = ifelse(y < 3 | y > 20, NA, y)) Exploratory Data Analysis (EDA) Missing Values ● In ggplot2 we get warning messages for missing values ● > ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + geom_point() ● #> Warning: Removed 9 rows containing missing values ● #> (geom_point). ● To suppress that warning, set na.rm = TRUE: ● > ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + geom_point(na.rm = TRUE) Covariation ● Covariation is the tendency for the values of two or more variables to vary together in a related way. ● The best way to spot covariation is to visualize the relationship between two or more variables. ● To display the distribution of a continuous variable broken down by a categorical variable is the boxplot. ● A boxplot is a type of visual shorthand for a distribution of values that is popular among statisticians. Covariation Each boxplot consists of: ● A box that stretches from the 25th percentile of the distribution to the 75th percentile, a distance known as the interquartile range (IQR). ● In the middle of the box is a line that displays the median, i.e., 50th percentile, of the distribution. ● These three lines give you a sense of the spread of the distribution and whether or not the distribution is symmetric about the median or skewed to one side. Covariation Covariation ● Let’s take a look at the distribution of price by cut using geom_boxplot(): ● > ggplot(data = diamonds, mapping = aes(x = cut, y = price)) + geom_boxplot() ● gplot(data = mpg, mapping = aes(x = class, y = hwy)) +geom_boxplot()
● Many categorical variables don’t have such an intrinsic order, so you
might want to reorder them to make a more informative display. One way to do that is with the reorder() function. ● > ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median),y = hwy)) Covariation ● If you have long variable names, geom_boxplot() will work better if you flip it 90°. You can do that with coord_flip(): ● ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + coord_flip() Patterns and models ● Patterns in your data provide clues about relationships. ● If a systematic relationship exists between two variables it will appear as a pattern in the data. ● If you spot a pattern, ask yourself: ● Could this pattern be due to coincidence (i.e., random chance)? ● How can you describe the relationship implied by the pattern? ● How strong is the relationship implied by the pattern? ● What other variables might affect the relationship? ● Does the relationship change if you look at individual subgroups of the data? ● ggplot(data = faithful) + geom_point(mapping = aes(x = eruptions, y = waiting))
● The scatterplot displays the two clusters
● code fits a model that predicts price from carat and then computes the residuals ● library(modelr) ● mod <- lm(log(price) ~ log(carat), data = diamonds) ● diamonds2 <- diamonds %>% add_residuals(mod) %>% mutate(resid = exp(resid)) ● ggplot(data = diamonds2) + ● geom_point(mapping = aes(x = carat, y = resid)) ● Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price— relative to their size, better quality diamonds are more expensive: ● ggplot(data = diamonds2) + geom_boxplot(mapping = aes(x = cut, y = resid))