Data Cleaning in R
Data Cleaning in R
First steps
Always remember to “set your working directory”. In RStudio you can do it following the menu: ‘Session’ ->‘Set Working Directory’ -
> ‘Choose Directory…’ Or use the shortcut ‘Ctrl’+‘Shift’+‘H’ and select the folder desired as working directory.
If you need to restart your R session in RStudio you can use the shortcut ‘Ctrl’+‘Shift’+‘F10’.
To clear objects and loaded packages to clear your global work space you can use: rm(list = ls())
Import data
There are several ways to import data in RStudio. Some data sets are already included in R libraries. In the following example we will
use the ‘Melanoma’ dataset included in the ‘MASS’ package:
You can see the names of the variables included in the dataset using the function Names:
names(Melanoma)
To preview the first elements of the dataset we can use the head function:
1 10 3 1 76 1972 6.76 1
2 30 3 1 56 1968 0.65 0
3 35 2 1 41 1977 1.34 0
4 99 3 0 71 1968 2.90 0
6 rows
Instead, you can also see the last elements using the tail function:
6 rows
To open a spreadsheet-style viewer of your data we can use the View function:
We can perform operations over the data, like calculate the summary statistics for a numerical variable. In this example we write:
function_name(dataframe_name$variable_name)
To avoid writing the dataframe name everytime we want to perform an operation on a variable, we can attach the dataframe in R.
This is very useful for small to medium datasets, but be careful with big datasets.
Now we can repeat the calculation of the summary statistics using only the name of the variable(s):
For categorical variables, is best to specify R to treat them as factors with the as.factor function, before performing the analysis:
## 0 1
## 126 79
This will open the Import Dataset dialog box. Make sure the options are appropriate for your dataset.
read.csv for commas as separators and periods for decimals read.csv2 for semicolons as separators and commas for decimals
To import an Excel file, follow the menu ‘File’ -> ‘Import Dataset’ -> ‘From Excel’
In the ‘Import Dataset’ dialog box you can choose the name of your imported dataframe, and the ‘Sheet’ to import, among other
options.