Data Analysis in R
Data Analysis in R
R is a powerful programming language widely used for statistical computing and data analysis. It
provides numerous built-in functions that simplify the process of analyzing data. These functions can
perform various tasks, including summarizing data, manipulating datasets, and visualizing results.
1. summary()
Description: This function provides a summary of the main statistical measures for
each column in a dataset.
Syntax: summary(object)
Example:
df <- data.frame(
)
summary(df)
Output:
age salary
2. mean()
Example:
Output:
[1] 63333.33
3. sd()
Description: This function computes the standard deviation of a numeric vector.
Example:
Output:
[1] 12990.38
4. t.test()
Description: This function performs a t-test to compare means between two groups.
Example:
t.test(group1, group2)
Output (example):
-126666.7 -3333.3
sample estimates:
mean of x mean of y
55000 75000
5. plot()
Description: This function creates a scatter plot or other types of plots based on the
input data.
Syntax: plot(x, y)
Example:
plot(df$age, df$salary)
6. lm()
Description: This function fits linear models to the data.
Example:
summary(model)
These functions are just a few examples of the many available in R for performing data analysis tasks
effectively.
Statistical Functions in R
R provides a wide array of statistical functions that facilitate data analysis. Below are some key
statistical functions along with their descriptions, syntax, and examples.
1. Mean Function
x: A numeric vector.
Example:
x: A numeric vector.
Example:
3. Variance Function
x: A numeric vector.
4. Median Function
x: A numeric vector.
Example:
5. Summary Function
Syntax: summary(object)
Example:
summary(data)
# Output:
# Min. :2.0
# 1st Qu.:3.0
# Median :5.0
# Mean :4.25
# 3rd Qu.:5.0
# Max. :7.0
# NA's :1
6. T-Test Function
Example (One-Sample):
# Output includes t-value and p-value for the test against mean=0
Example (Two-Sample):
t.test(group1 ~ group2)
# Output includes t-value and p-value comparing means of group1 and group2
7. Correlation Function
Syntax: cor(x,y,use="everything",method="pearson")
Example:
x <- c(1,2,3)
y <- c(4,5,6)
Inferential statistics allows researchers to make conclusions about a population based on sample
data. In R, various functions are available for performing inferential statistical tests such as t-tests,
chi-squared tests, ANOVA, and regression analysis. Below are some commonly used inferential
functions along with their syntax and examples.
2. t-test Functions
Independent t-test
Syntax:
Example:
# Sample data
print(result)
Paired t-test
Syntax:
Example:
print(result)
The chi-squared test is used to determine if there is a significant association between categorical
variables.
Chi-Squared Test
Syntax:
chisq.test(x)
Example:
print(result)
4. ANOVA Function
ANOVA (Analysis of Variance) is used when comparing means across three or more groups.
ANOVA Function
Syntax:
aov(formula, data)
Example:
df <- data.frame(
# Perform ANOVA
summary(result)
Linear regression is used to model the relationship between a dependent variable and one or more
independent variables.
Syntax:
lm(formula,data)
Example:
df <- data.frame(
x = rnorm(100),
y = rnorm(100)
summary(model)
6. Conclusion
These functions provide a foundation for conducting inferential statistics in R. By utilizing these tools
effectively with appropriate datasets and hypotheses testing approaches can lead to meaningful
insights from your analyses.
Data Manipulation Functions in R
In R, the dplyr package is a powerful tool for data manipulation. It provides a variety of functions that
allow users to perform common data manipulation tasks efficiently and effectively. Below are some
of the key functions available in the dplyr package:
1. filter() The filter() function is used to subset rows based on specific conditions. You can use logical
operators to specify these conditions.
2. distinct() The distinct() function removes duplicate rows from a data frame or based on specified
columns.
3. arrange() The arrange() function orders the rows of a data frame based on one or more columns.
4. select() The select() function extracts specific columns from a data frame.
5. rename() The rename() function changes the names of columns in a data frame.
6. mutate() & transmute() These functions are used to create new variables. The mutate() function
adds new variables while keeping existing ones; the transmute() function creates new variables but
drops existing ones.
Example:
7. summarize() The summarize() function aggregates data using summary statistics like mean or
sum.
Example:
These functions can be combined using the pipe operator (%>%) to create complex data
manipulation workflows that are both readable and efficient.
Data visualization in R is facilitated through various functions and packages that allow users to create
a wide range of graphical representations of data. Below are some of the key functions and their
respective uses:
plot(): This is the most basic function for creating scatter plots. It can be used for plotting
two variables against each other.
plot(x, y)
hist(): Used to create histograms, which display the distribution of a continuous variable.
hist(data$variable)
barplot(table(data$categorical_variable))
boxplot(): Generates box plots to visualize the distribution of a dataset based on summary
statistics (minimum, first quartile, median, third quartile, and maximum).
boxplot(data$variable ~ data$grouping_variable)
2. ggplot2 Package The ggplot2 package is one of the most popular libraries for data visualization in
R due to its flexibility and powerful capabilities.
geom_bar(): Used for creating bar charts; it automatically counts occurrences unless
specified otherwise.
geom_histogram(): Similar to hist() but integrates seamlessly into the ggplot framework.
facet_wrap() and facet_grid(): These functions allow you to create multiple panels based on
one or more categorical variables, making it easier to compare subsets of your data.
3. Lattice Package The lattice package provides another system for creating visualizations in R.
library(lattice)
barchart(), histogram(), and densityplot() are also available within this package for specific
types of visualizations.
library(plotly)
ggplotly(p)
In conclusion, R provides a rich ecosystem of functions and packages dedicated to data visualization.
Users can choose from base R plotting functions or leverage advanced libraries like ggplot2, lattice,
and others depending on their specific needs.