Unit 5 - R and Data Analysis
Unit 5 - R and Data Analysis
Analysis
Carlo Drago PhD
University Niccolo Cusano, Rome
Project N. 2023-1-IT02-KA220-HED-000161770
ANALYST - A New Advanced Level for Your Specialised Training
THE R PROGRAMMING LANGUAGE
R scripts are essential components in the R programming language, used for performing a
wide range of data analysis and statistical tasks. These scripts are essentially text files
containing a sequence of R commands and functions, which can be executed to manipulate
data, perform calculations, and generate visualizations.
R PROGRAMMING
The importance of programming with R scripts lies in their ability to automate repetitive
tasks, ensuring consistency and efficiency in data processing. By writing scripts, users can
document their workflows, making analyses reproducible and easier to share with others.
Moreover, R scripts enable the handling of large datasets, complex statistical operations, and
the development of sophisticated models, which are crucial for data-driven decision-making in
various fields such as finance, healthcare, and scientific research.
1. Handling Missing Values: It is possible to use `na.omit()` remove rows with missing values
or `impute()` from the `Hmisc` package to fill them.
```R
cleaned_data <- na.omit(raw_data)
```
PRE PROCESSING IN R
2. Normalizing or Scaling Data: It is possible to use the `scale()` function to normalize or
standardize your data.
```R
scaled_data <- scale(raw_data)
```
In R, the `apply` function is a powerful tool used to perform operations on rows or columns of
a matrix or an array. It allows you to apply a function to each row or column, making it an
efficient way to manipulate and analyze data without using loops. The basic syntax of `apply` is
`apply(X, MARGIN, FUN, ...)`, where `X` is the data, `MARGIN` specifies whether to apply the
function over rows (1) or columns (2), and `FUN` is the function to be applied. This command is
particularly useful for data frame operations, enabling concise and readable code.
TAPPLY
The `tapply` function in R is a useful command for applying a function over subsets of a
vector, allowing you to perform group-wise operations. It is particularly handy when you want
to compute summary statistics or perform calculations across different categories of data in a
single vector. By specifying a factor or list of factors, `tapply` applies the chosen function to
each subset, simplifying data analysis tasks involving grouped data.
LAPPLY
The lapply function in R is used to apply a function to each element of a list. The result is a
new list where each element is the result of the function applied to the corresponding
element of the original list. It is useful for processing data in a list without needing to manually
iterate.
DESCRIPTIVE ANALYSIS IN R
Descriptive statistics is a fundamental aspect of data analysis that focuses on summarizing
and organizing data to make it easily interpretable. In the R programming language, a
powerful tool for statistical computation, descriptive statistics are typically used to provide a
clear summary of the key characteristics of a dataset.
These statistics are often conveyed through visual displays like graphs and tables, as well as
summary measures including the mean, median, mode, variance, and standard deviation.
DESCRIPTIVE ANALYSIS IN R
Using R, it is possible efficiently compute these statistics to gain insights into data. For
instance, the `summary()` function can be used to quickly obtain a basic statistical overview of
a dataset, including the minimum, maximum, median, and mean values for each variable.
Additionally, packages such as `ggplot2` can be used to create detailed graphical
representations of data, enhancing the understanding of its distribution and trends.
In R, EDA typically starts with loading the dataset, often using functions like `read.csv()` or
`read.table()`. Once the data is loaded, functions such as `summary()`, `str()`, and `head()` the
results provide a quick overview of the data structure and key statistics.
CORRELATION ANALYSIS IN R
Correlation analysis is a statistical method used to evaluate the strength and direction of the
linear relationship between two continuous variables. In R, this analysis can be performed
using functions like cor() to calculate the correlation coefficient, which ranges from -1 to 1.
Simple regression analysis in R involves examining the relationship between two variables:
one independent variable and one dependent variable. To perform simple regression in R, it is
possible to use the lm() function. Here's a basic overview of the process:
1. Preparing the Data: It is necessary that the data is clean and organized, typically in a data
frame where columns represent different variables.
2. Loading Data: Importing the dataset into R using functions like read.csv() for CSV files.
3. Fit the Model: Using the lm() function to fit a linear model. For example, if it is necessary
to predict y based on x, it is necessary to use:
4. model <- lm(y ~ x, data = your_data)
Multiple regression in R involves using multiple predictor variables to predict a single
outcome variable. It is an extension of simple regression, which only uses one predictor. In R,
it is possible to perform multiple regression using the `lm()` function. Here’s a basic example:
4
SENSITIVITY ANALYSIS IN R