0% found this document useful (0 votes)
15 views10 pages

Data Analysis in R

The document provides an overview of data analysis functions in R, including built-in functions for summarizing, manipulating, and visualizing data. It covers common functions such as summary(), mean(), sd(), and t.test(), as well as data manipulation functions from the dplyr package and visualization functions from base R and ggplot2. Additionally, it discusses inferential statistics functions and their applications in R.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Data Analysis in R

The document provides an overview of data analysis functions in R, including built-in functions for summarizing, manipulating, and visualizing data. It covers common functions such as summary(), mean(), sd(), and t.test(), as well as data manipulation functions from the dplyr package and visualization functions from base R and ggplot2. Additionally, it discusses inferential statistics functions and their applications in R.

Uploaded by

lsivakum
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Introduction to Data Analysis Functions in R

R is a powerful programming language widely used for statistical computing and data analysis. It
provides numerous built-in functions that simplify the process of analyzing data. These functions can
perform various tasks, including summarizing data, manipulating datasets, and visualizing results.

Common Data Analysis Functions in R

1. summary()

 Description: This function provides a summary of the main statistical measures for
each column in a dataset.

 Syntax: summary(object)

 Example:

 df <- data.frame(

 age = c(25, 30, 35, 40),

 salary = c(50000, 60000, 70000, 80000)

 )

summary(df)

 Output:

 age salary

 Min. :25.0 Min. :50000

 1st Qu.:28.75 1st Qu.:57500

 Median :32.5 Median :65000

 Mean :32.5 Mean :66250

 3rd Qu.:36.25 3rd Qu.:72500

Max. :40.0 Max. :80000

2. mean()

 Description: This function calculates the average of a numeric vector.

 Syntax: mean(x, na.rm = FALSE)

 Example:

 salaries <- c(50000, 60000, NA, 80000)

mean(salaries, na.rm = TRUE) # Excludes NA values

 Output:

[1] 63333.33

3. sd()
 Description: This function computes the standard deviation of a numeric vector.

 Syntax: sd(x, na.rm = FALSE)

 Example:

sd(salaries, na.rm = TRUE) # Excludes NA values

 Output:

[1] 12990.38

4. t.test()

 Description: This function performs a t-test to compare means between two groups.

 Syntax: t.test(x, y = NULL, alternative = "two.sided", ...)

 Example:

 group1 <- c(50000, 60000)

 group2 <- c(70000, 80000)

t.test(group1, group2)

 Output (example):

 Welch Two Sample t-test

 data: group1 and group2

 t = -3.1623, df = 2.0008, p-value = 0.03656

 alternative hypothesis: true difference in means is not equal to 0

 95 percent confidence interval:

 -126666.7 -3333.3

 sample estimates:

 mean of x mean of y

55000 75000

5. plot()

 Description: This function creates a scatter plot or other types of plots based on the
input data.

 Syntax: plot(x, y)

 Example:

plot(df$age, df$salary)

6. lm()
 Description: This function fits linear models to the data.

 Syntax: lm(formula, data)

 Example:

 model <- lm(salary ~ age, data=df)

summary(model)

These functions are just a few examples of the many available in R for performing data analysis tasks
effectively.

Statistical Functions in R

R provides a wide array of statistical functions that facilitate data analysis. Below are some key
statistical functions along with their descriptions, syntax, and examples.

1. Mean Function

 Description: Calculates the average of a numeric vector.

 Syntax: mean(x, na.rm = FALSE)

 x: A numeric vector.

 na.rm: Logical value indicating whether to remove missing values (NA).

 Example:

 data <- c(2, 3, 5, NA, 7)

mean(data, na.rm = TRUE) # Output: 4.25

2. Standard Deviation Function

 Description: Computes the standard deviation of a numeric vector.

 Syntax: sd(x, na.rm = FALSE)

 x: A numeric vector.

 na.rm: Logical value indicating whether to remove missing values (NA).

 Example:

 data <- c(2, 3, 5, NA, 7)

sd(data, na.rm = TRUE) # Output: 2.08

3. Variance Function

 Description: Calculates the variance of a numeric vector.

 Syntax: var(x, na.rm = FALSE)

 x: A numeric vector.

 na.rm: Logical value indicating whether to remove missing values (NA).


 Example:

 data <- c(2, 3, 5, NA, 7)

var(data, na.rm = TRUE) # Output: 4.33

4. Median Function

 Description: Finds the median value of a numeric vector.

 Syntax: median(x, na.rm = FALSE)

 x: A numeric vector.

 na.rm: Logical value indicating whether to remove missing values (NA).

 Example:

 data <- c(2, 3, 5, NA, 7)

median(data, na.rm = TRUE) # Output: 5

5. Summary Function

 Description: Provides a summary of statistics for an object including minimums and


maximums.

 Syntax: summary(object)

 Example:

 data <- c(2, 3, 5, NA, 7)

 summary(data)

 # Output:

 # Min. :2.0

 # 1st Qu.:3.0

 # Median :5.0

 # Mean :4.25

 # 3rd Qu.:5.0

 # Max. :7.0

# NA's :1

6. T-Test Function

 Description: Performs one-sample or two-sample t-tests to compare means.

 Syntax: t.test(x) or for two samples: t.test(x ~ group)

 Example (One-Sample):

 data <- c(2, 3, 5)


 t.test(data)

# Output includes t-value and p-value for the test against mean=0

Example (Two-Sample):

group1 <- c(2, 3)

group2 <- c(5, 7)

t.test(group1 ~ group2)

# Output includes t-value and p-value comparing means of group1 and group2

7. Correlation Function

 Description: Computes the correlation coefficient between two variables.

 Syntax: cor(x,y,use="everything",method="pearson")

 Example:

 x <- c(1,2,3)

 y <- c(4,5,6)

cor(x,y) # Output: [1] 1 (perfect positive correlation)

These functions are fundamental for performing various statistical analyses in R.

1. Introduction to Inferential Statistics in R

Inferential statistics allows researchers to make conclusions about a population based on sample
data. In R, various functions are available for performing inferential statistical tests such as t-tests,
chi-squared tests, ANOVA, and regression analysis. Below are some commonly used inferential
functions along with their syntax and examples.

2. t-test Functions

The t-test is used to compare the means of two groups.

 Independent t-test

Syntax:

t.test(x, y = NULL, alternative = "two.sided", mu = 0, paired = FALSE, var.equal = FALSE, conf.level =


0.95)

Example:

# Sample data

group1 <- c(5, 6, 7, 8)

group2 <- c(3, 4, 5, 6)

# Perform independent t-test


result <- t.test(group1, group2)

print(result)

 Paired t-test

Syntax:

t.test(x, y = NULL, alternative = "two.sided", mu = 0, paired = TRUE, var.equal = FALSE, conf.level =


0.95)

Example:

# Sample data before and after treatment

before <- c(78,65,71)

after <- c(71,62,70)

# Perform paired t-test

result <- t.test(before, after, paired=TRUE)

print(result)

3. Chi-Squared Test Function

The chi-squared test is used to determine if there is a significant association between categorical
variables.

 Chi-Squared Test

Syntax:

chisq.test(x)

Example:

# Sample contingency table

data <- matrix(c(10,20,30,40), nrow=2)

# Perform chi-squared test

result <- chisq.test(data)

print(result)

4. ANOVA Function

ANOVA (Analysis of Variance) is used when comparing means across three or more groups.

 ANOVA Function

Syntax:

aov(formula, data)
Example:

# Sample data frame

df <- data.frame(

group = rep(c("A", "B", "C"), each=10),

values = c(rnorm(10), rnorm(10), rnorm(10))

# Perform ANOVA

result <- aov(values ~ group, data=df)

summary(result)

5. Linear Regression Function

Linear regression is used to model the relationship between a dependent variable and one or more
independent variables.

 Linear Regression Function

Syntax:

lm(formula,data)

Example:

# Sample data frame for linear regression

df <- data.frame(

x = rnorm(100),

y = rnorm(100)

# Perform linear regression

model <- lm(y ~ x , data=df)

summary(model)

6. Conclusion

These functions provide a foundation for conducting inferential statistics in R. By utilizing these tools
effectively with appropriate datasets and hypotheses testing approaches can lead to meaningful
insights from your analyses.
Data Manipulation Functions in R

In R, the dplyr package is a powerful tool for data manipulation. It provides a variety of functions that
allow users to perform common data manipulation tasks efficiently and effectively. Below are some
of the key functions available in the dplyr package:

1. filter() The filter() function is used to subset rows based on specific conditions. You can use logical
operators to specify these conditions.

 Syntax: filter(dataframeName, condition)

 Example: To filter players who scored more than 100 runs:

filtered_data <- filter(stats, runs > 100)

2. distinct() The distinct() function removes duplicate rows from a data frame or based on specified
columns.

 Syntax: distinct(dataframeName, col1, col2,.., .keep_all=TRUE)

 Example: To remove duplicates based on player names:

unique_players <- distinct(stats, player)

3. arrange() The arrange() function orders the rows of a data frame based on one or more columns.

 Syntax: arrange(dataframeName, columnName)

 Example: To sort players by their runs in ascending order:

sorted_data <- arrange(stats, runs)

4. select() The select() function extracts specific columns from a data frame.

 Syntax: select(dataframeName, col1,col2,… )

 Example: To select only the player and wickets columns:

selected_columns <- select(stats, player, wickets)

5. rename() The rename() function changes the names of columns in a data frame.

 Syntax: rename(dataframeName, newName=oldName)

 Example: To rename ‘runs’ to ‘runs_scored’:

renamed_data <- rename(stats, runs_scored = runs)

6. mutate() & transmute() These functions are used to create new variables. The mutate() function
adds new variables while keeping existing ones; the transmute() function creates new variables but
drops existing ones.

 Syntax for mutate: mutate(dataframeName, newVariable=formula)

 Syntax for transmute: transmute(dataframeName, newVariable=formula)

 Example:

 data_with_avg <- mutate(stats, avg = runs / wickets) # keeps old variables


data_with_only_avg <- transmute(stats, avg = runs / wickets) # drops old variables

7. summarize() The summarize() function aggregates data using summary statistics like mean or
sum.

 Syntax: summarize(dataframeName, aggregate_function(columnName))

 Example:

summary_stats <- summarize(stats, total_runs = sum(runs), average_runs = mean(runs))

These functions can be combined using the pipe operator (%>%) to create complex data
manipulation workflows that are both readable and efficient.

Data Visualization Functions in R

Data visualization in R is facilitated through various functions and packages that allow users to create
a wide range of graphical representations of data. Below are some of the key functions and their
respective uses:

1. Base R Plotting Functions

 plot(): This is the most basic function for creating scatter plots. It can be used for plotting
two variables against each other.

plot(x, y)

 hist(): Used to create histograms, which display the distribution of a continuous variable.

hist(data$variable)

 barplot(): This function creates bar plots for categorical data.

barplot(table(data$categorical_variable))

 boxplot(): Generates box plots to visualize the distribution of a dataset based on summary
statistics (minimum, first quartile, median, third quartile, and maximum).

boxplot(data$variable ~ data$grouping_variable)

2. ggplot2 Package The ggplot2 package is one of the most popular libraries for data visualization in
R due to its flexibility and powerful capabilities.

 ggplot(): Initializes a ggplot object. It requires a dataset as an argument.

ggplot(data = dataset) + geom_point(mapping = aes(x = x_variable, y = y_variable))

 geom_point(): Adds points to a scatter plot.

 geom_line(): Creates line graphs by connecting points with lines.

 geom_bar(): Used for creating bar charts; it automatically counts occurrences unless
specified otherwise.

 geom_histogram(): Similar to hist() but integrates seamlessly into the ggplot framework.
 facet_wrap() and facet_grid(): These functions allow you to create multiple panels based on
one or more categorical variables, making it easier to compare subsets of your data.

ggplot(data) + geom_point(aes(x = x_variable, y = y_variable)) + facet_wrap(~ categorical_variable)

3. Lattice Package The lattice package provides another system for creating visualizations in R.

 xyplot(): For creating scatter plots with conditioning variables.

library(lattice)

xyplot(y ~ x | factor(variable), data = dataset)

 barchart(), histogram(), and densityplot() are also available within this package for specific
types of visualizations.

4. Other Useful Packages Several other packages enhance visualization capabilities in R:

 Plotly: For interactive plots that can be embedded in web applications.

library(plotly)

p <- ggplot(data, aes(x=x_variable, y=y_variable)) + geom_point()

ggplotly(p)

 Highcharter: Another package for creating interactive visualizations using Highcharts


JavaScript library.

In conclusion, R provides a rich ecosystem of functions and packages dedicated to data visualization.
Users can choose from base R plotting functions or leverage advanced libraries like ggplot2, lattice,
and others depending on their specific needs.

You might also like