0% found this document useful (0 votes)
20 views66 pages

Class One

The document provides an introduction to R programming and RStudio, emphasizing their roles in data analysis and statistical computing. It covers installation, updating, and basic functionalities of R and RStudio, as well as data manipulation techniques and the importance of levels of measurement. Additionally, it discusses importing data, handling missing values, and visualizing data using R packages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views66 pages

Class One

The document provides an introduction to R programming and RStudio, emphasizing their roles in data analysis and statistical computing. It covers installation, updating, and basic functionalities of R and RStudio, as well as data manipulation techniques and the importance of levels of measurement. Additionally, it discusses importing data, handling missing values, and visualizing data using R packages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Data Analysis Using R Programming

Language.

Strengthening Research skills in Eastern and Southern


Africa
INTRODUCTION TO R
PROGRAMMING LANGUAGE

Facilitators:
Prof. Susan Balaba Tumwebaze
Dr. Thomas Odong
Dr. Hellen Namawejje
R-programming and RStudio
• R soft ware • RStudio software

• Download R • https://www.rstudio.com/produ
cts/rstudio/download/
http://www.r-project.org/
or
http://cran.r-project.org/

• Use updated/latest version always


What is R?
• R is a free statistical programming language

• Can be used by either a scientist or non scientist

• You do not need to be a statistician to use R

• A widely used data management, statistical analysis and graphics software


program

• R is an open source tool used in data analysis.


Why R?

• The language is designed for data analysis.

• R has a set of tools available that makes writing code easier.

• Reproducibility

• Lots of libraries (ways of extending R language)


What is RSTUDIO?
• RStudio provides free and open source tools for R

• Often referred to as an Integrated Development Environment (IDE) for R

• Rstudio features include: a console, syntax-highlighting editor, environment,


history, tutorials, tools for plotting, packages, and help space

• Rstudio provides features to make using R and managing R much earlier


Difference between R and RStudio
R RStudio
• R is a programming language • RStudio uses the R language to
used for statistical computing develop statistical programs

• You can write a program and run • Must be used alongside R in


the code independently of any order to properly function
other computer program
• RStudio may not be used without
• R may be used without RStudio R
Installing R and Rstudio (Option B)
Install R Install RStudio
• Go to R project or CRAN • Go to RStudio
(Comprehensive R Archive Network)
• Select the link to download in R
• In the menu, go to Products >
RStudio
• Select the R download for your
operating system
(Windows/Mac/Linux) • Select download RStudio Desktop

• Note: You can install R and RStudio • Select the download for your
directly by typing it in Google URLs operating system
R and RSTUDIO workspace
How to update R and RStudio
• Updating R: R 4.3.1 is the latest R version

Option One

• The easiest way to update R is to simply download the


newest version.

• Install that, and it will overwrite your current version.


Updating R- R 4.3.1
Option 2: Use packages

• updateR for Mac,

• and installr for Windows


Updating R
• Alternatively
Updating RStudio
• Go to the Help menu in the top
menu bar of Rstudio

• In the Help menu, select Check


for Updates, i.e.,

• Click Help in RStudio > check for


updates
• Otherwise, will direct you to the
website to download the latest
• If RStudio is updated, you see this version.
window
R and its associated packages
• Once R/RStudio is installed

• You will have base R and its associated packages

• We will use many add on packages throughout this training

• You can install packages using different options


Adding/installing packages in R
• Options include Packages > install on RHS
bottom workspace
install.packages(“package_name
”) command Install multiple packages
install.packages (c(“tidyverse”,
Tools > install packages in “readxl”, “dplyr”, “ggbupr”))
Rstudio
readxl to read Microsoft Excel
search for the package to install files
Update Packages in R
• Click packages in the right bottom window, then update.
• If all packages are updated, you will see the window below, otherwise
select packages you want to update.
Getting help in R
Help about a specific command can be got using the following
commands:

>help(solve)

>?solve

>? t.test
or
>help(t.test)
Basic concepts in R
• R as a calculator • Factors
Data is sometimes categorized
e.g. Type of soils
( Loam, clay, sandy)

In R, categorical data is stored as


factor
• R use = or < - to assign values to a
variable name
e.g. x = 2 is same as x < - 2

R is case sensitive
DATA FRAME in R
• Data frame: represents a typical Characteristics of a Data frame
data table that researchers come • The column names should be
up with – like a spreadsheet. non-empty
e.g. • The row names should be
unique
• The data stored in a data frame
can be numeric, factor or
character type
• Each column should contain
same number of data items
Level of measurement
Some Definitions

Variable
Variable Gender
Gender

Attribute
Attribute Attribute
Attribute Female
Female Male
Male
What Is Level of
Measurement?
The relationship of the values that are assigned
to the attributes for a variable
Variable Party Affiliation

Attributes Republican Independent Democrat

Values 1 2 3

Relationship
Types of level of measurement
1. Nominal • Nominal: The values “name” the
attribute uniquely; The name does not
2. Ordinal
imply any ordering of the cases
3. Interval
4. Ratio • Ordinal: Attributes can be rank-ordered…
Note:
Interval and Ratio are times • Interval: When distance between
referred to as Scale attributes has meaning, e.g temperature:
measurements distance from 30-40 is the same as
distance from 70-80
Why is Level of measurement
important?
• Helps you to decide what • Ratio: absolute zero is
statistical analysis is appropriate meaningful. E.g number of
on the value that were assigned clients in past one months
• It is meaningful to say that “...we
• Helps you decide how to had twice as many clients in this
interpret the data from that period as we did in the previous
variable six months
The Hierarchy of Levels

Ratio Absolute zero

Interval Distance is meaningful

Ordinal Attributes can be ordered

Nominal Attributes are only named; weakest


Variable Type and Data Analysis
R scripts and comments
R scripts Comments
It is generally better to write your • You can add a comment on your
code in a script, RMarkdown or code
other document. • Comments act as remainders
Why:
So that you can easily save, edit • In R, use (#) icon to comment
and share your code

File> New File> R script


Importing data
• Before we can work with our • For example, csv files, excel
data in R files, tab delimited files

• We need to first import the data • Data files from other programs
into R e.g., SAS.sas files, SPSS.sav
• Importing data from different file files, or STATA.dta files etc can
types and sources using add on be used
packages
Importing data • Packages
• Importing the most commonly
• From Text (readr) to import csv
used file types of CSV and excel files
files using the data.
• From Excel(readxl) to import
excel files

• Using (Option 1) • dplyr, tidyverse, ggpubr, ggplot2


File > Import Dataset
• haven packages contain functions
to read in SAS, SPSS, and STATA
Importing data : Option 2
Step one: Set working directory (SWD)

Common ways of setting SWD

1. Go to Session> Set Working Directory and choose the folder that


contains your dataset

2. Set the working directory manually using the function


Setwd(“/path_to_folder/”)
Importing data: Path/direction
• Step two: use step one to create
a path to a folder with your data

• Give an object/ name to your


dataset

• Use read.csv() command


Importing Data
When setting the working directory for Windows or Mac/Linux

For Windows: the default directory structure involves a single backslash “\” but
R interprets these as escape characters, so you must replace these with forward
slashes “/” or two backslashes “\\”

For Mac/Linux: The default directory structure already uses forward slashes
IMPORTING DATA IN R
• Using ABC dataset • Set working directory using getwd()
• Install required packages command

• After installing the various


packages, remember to load them
as libraries.
Importing data in R we use the read_csv()
function from readr package
Features you can check in your data
Comments in R (#)
• Anything to the right of a comment symbol on the same line will be ignored by R
Reading data from excel to R-continued
Other important features
• How to save the R script (File > save > choose location to save)

• Clear environment in R just type: rm(list=ls())

• To clear the console type: Ctrl+L

• Calling rm() removes/deletes an objects in your working environment.

• ls() command # reads/lists current objects


Data Manipulation(DM)/
Inspecting Variables
Variable Type and Data Analysis
Data Manipulation (DM)
1. Creating new/add variables
2. Recording an existing variable
3. Rename columns of a data frame (df)
4. Subset rows/columns of a data frame
5. Remove columns of df
6. Level of measurements in R
7. Dealing with missing values-Delete/remove
8. Merging data
9. Descriptive analysis

Note: To perform the above operations we will use dplyr and tidyverse packages. dplyr
provide functions to make these operations more intuitive and codes more readable.
DM1: Creating new/adding a
variable(s)
• Use the assignment operator <-
to create new variables.

• A wide array of operators and


functions are available here
DM2:Recording an existing variable
• Converting a continuous variable into a categorical variable, e.g

• Creating two salary categories, that is, low and high


DM3: Rename columns of a data frame
• We can change column names using the rename() function from the R package
dplyr
• We could rename the column “sex” to SEX in the dataset
DM4: Subset rows/columns of a
data frame
• df are accessed with “[ ]” by specifying their index, or their name
DM5: Remove/delete a column in a
data frame
• Delete a column you are no longer interested in.
DM6: Level of measurements in R
Nominal and ordinal (categorical variables)

• as.factor() function for nominal data

• df$variable of interest<-as.factor(df$variable of interest)

e.g.
salaries$rank <-as.factor(salaries$rank)

check
class(df$rank)

levels(df $ rank)
DM6: Continued_Continuous
variables
• Scale (ratio and interval) –numerical /integer

• df$variable of interest<- as.numeric (df$variable of interest)


e.g.
salaries $ salary<-as.numeric (salaries$salary)

check
class(salaries $ salary)
DM7: Dealing with missing variables
• It might happen that your dataset is not complete.

• And when information is not available we call it missing values.

• In R the missing values are coded by the symbol NA

• To identify missing in your dataset the function is “is.na()”

• Using an example in the next slide, since salaries has no


missing data
DM7:Continued
• Example
DM7:Considering only complete
cases
DM7:Continued
• This is just the basic way of dealing with missing values in a df

• Other ways not mentioned here can be applied

• There are advanced ways that can be used to impute missing data.

• If missing values are all deleted, a lot of information is lost, so


imputing methods can be applied to avoid this.
DM8: Merging datasets
•H •K
DM8: Merging data with R dplyr
package
DM8: Using the left_join()
DM8: Using the right_join()
DM8: Using the inner_join()
DM8: Using the full_join()
DM7: Merging datasets in
R_Practicals
DM9: Descriptive Analysis
Descriptive analysis
• Frequency • Go to
table() command for categorical

Prop.table() #percentages

summary() command for scale

Measure of central tendency and


dispersion
Graphical methods in
R packages for data visualization
• tidyverse() and ggplot2() packages

• tidyverse is a set of packages for data tidying, manipulation, and


visualization.
Graphics for one scale variable
(Scatterplot or Histogram)
• Scatterplot
• Histogram
Graphics for one categorical
variable
• Option A: rank of professor • Option B
Exporting data from R-Continued
END OF DAY ONE

You might also like