0% found this document useful (0 votes)
12 views2 pages

Data Cleaning in R

Uploaded by

henriquezmd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views2 pages

Data Cleaning in R

Uploaded by

henriquezmd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Data cleaning in R

By: Rodrigo Henriquez, ITM 2024

First steps
Always remember to “set your working directory”. In RStudio you can do it following the menu: ‘Session’ ->‘Set Working Directory’ -
> ‘Choose Directory…’ Or use the shortcut ‘Ctrl’+‘Shift’+‘H’ and select the folder desired as working directory.

If you need to restart your R session in RStudio you can use the shortcut ‘Ctrl’+‘Shift’+‘F10’.

To clear objects and loaded packages to clear your global work space you can use: rm(list = ls())

Import data
There are several ways to import data in RStudio. Some data sets are already included in R libraries. In the following example we will
use the ‘Melanoma’ dataset included in the ‘MASS’ package:

library(MASS) # load the 'MASS' package


data("Melanoma")

You can see the names of the variables included in the dataset using the function Names:

names(Melanoma)

## [1] "time" "status" "sex" "age" "year" "thickness"


## [7] "ulcer"

To preview the first elements of the dataset we can use the head function:

head(Melanoma) # print the first elements of the data frame

time status sex age year thickness ulcer


<int> <int> <int> <int> <int> <dbl> <int>

1 10 3 1 76 1972 6.76 1

2 30 3 1 56 1968 0.65 0

3 35 2 1 41 1977 1.34 0

4 99 3 0 71 1968 2.90 0

5 185 1 1 52 1965 12.08 1

6 204 1 1 28 1971 4.84 1

6 rows

Instead, you can also see the last elements using the tail function:

tail(Melanoma) # print the last elements of the data frame

time status sex age year thickness ulcer


<int> <int> <int> <int> <int> <dbl> <int>

200 4479 2 0 19 1965 1.13 1

201 4492 2 1 29 1965 7.06 1

202 4668 2 0 40 1965 6.12 0

203 4688 2 0 42 1965 0.48 0

204 4926 2 0 50 1964 2.26 0


time status sex age year thickness ulcer
<int> <int> <int> <int> <int> <dbl> <int>

205 5565 2 0 41 1962 2.90 0

6 rows

To open a spreadsheet-style viewer of your data we can use the View function:

View(Melanoma) # see a spreadsheet-style data viewer

We can perform operations over the data, like calculate the summary statistics for a numerical variable. In this example we write:
function_name(dataframe_name$variable_name)

summary(Melanoma$age) # summary statistics for a numerical variable

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 4.00 42.00 54.00 52.46 65.00 95.00

To avoid writing the dataframe name everytime we want to perform an operation on a variable, we can attach the dataframe in R.
This is very useful for small to medium datasets, but be careful with big datasets.

attach(Melanoma) # access variables without writing the dataset name

Now we can repeat the calculation of the summary statistics using only the name of the variable(s):

summary(thickness) # summary statistics for the 'thickness' variable

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.10 0.97 1.94 2.92 3.56 17.42

For categorical variables, is best to specify R to treat them as factors with the as.factor function, before performing the analysis:

summary(as.factor(sex)) # count of observations per category

## 0 1
## 126 79

Import a dataset in .csv or .xlsl formats


The easiest way is to follow the Menu: ‘File’ -> ‘Import Dataset’ -> ‘From Text (base)…’ and select the path to the location of your
.csv file

This will open the Import Dataset dialog box. Make sure the options are appropriate for your dataset.

mydata <- read.csv(“C:/Rworkspace/datasets/birthwt.csv”, header = TRUE)

read.csv for commas as separators and periods for decimals read.csv2 for semicolons as separators and commas for decimals

To import an Excel file, follow the menu ‘File’ -> ‘Import Dataset’ -> ‘From Excel’

In the ‘Import Dataset’ dialog box you can choose the name of your imported dataframe, and the ‘Sheet’ to import, among other
options.

You might also like