Introduction To R PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Introduction to R

Statistics
Statistics is a branch of mathematics dealing with
 Data collection
 Organization
 Analysis
 Interpretation
 Make decisions
• Data consists of information coming from
observations, counts, measurements, or responses.

• A Variable is a characteristic or condition that can


change or take on different values.

• A population is the collection of all outcomes,


responses, measurement, or counts that are of
interest.

• A sample is a subset of a population.


Parameters & Statistics
• A parameter is a numerical description of a
population characteristic.

• A statistic is a numerical description of a sample


characteristic.
Example:
1. A recent survey of a sample of 450 college
students reported that the average weekly
income for students is $325.

2. The average weekly income for all students is


$405.
Branches of Statistics
Qualitative and Quantitative Data
• R is a language and environment for statistical
computing and graphics.

• R is available as a Free Software.

• Download R from the CRAN website


(Comprehensive R Archive Networks)

• The site: http://cran.r-project.org


• R provides a wide variety of statistical analysis

• Linear and Nonlinear modeling


• Classical statistical tests
• Time-series analysis
• Classification
• Clustering
• Big data Analysis
• Data mining
Interface of R studio
Create a new project
Working directory
To print the current working directory :
> getwd()
To change the current working directory :
> setwd(“H:”)

The workspace
Your R objects are stored in a workspace.
To list the objects in your workspace:
> ls()
History
• To Work with your previous commands:
>history() #display last 25 commands
>history(max.show=Inf) #display all previous
commands

• To save your command history default name is


“.Rhistory”.
>savehistory(file=“myfile”)

• To recall your command history:


>loadhistory(file=“myfile”)
Getting Help
• If you know a particular command, but don’t
know the correct syntax then: help(“command”)
Eg: >help(“ls”) or >?ls

• If you don’t know the command, but know the


keyword then:
Eg: > help.search(“ls”)
• If you want to list all the functions including a
particular word then:
apropos(“word”)
Eg: >apropos(“save”)

• If you want to get example about some


function
example(“word”)
Eg: >example(“save”)
As a Calculator
• One of the simplest possible tasks in R is to enter
an arithmetic expression and receive a result.
>2+2
[1] 4
>2*5
[1] 10
>sqrt(4)
[1] 2
>exp(-2)
[1] 0.13535353
> pi
[1] 3.141593

>(5+(6+7)*(pi^2))/8
[1] 16.66311

>log(exp(1))
[1] 1

>log(10000, 10)
[1] 4
> sin(pi/3)^2 + cos(pi/3)^2
[1] 1

>Sin(pi/3)^2 + cos(pi/3)^2
Error: couldn’t find function “Sin”

>ExP(-1)
Error: could not find function "ExP“

>exp(-1)
[1] 0.3678794
Naming Variables
• Three types of Variables
 Numeric {Ex: 3, 4.098, 1234}
 Character {Ex: Andrew, today, RRR}
 Logical{Ex: TRUE, FALSE}

• Names can be built from letters, digits, and the period (dot)(.)
symbol.
• Names must not start with a digit or a period followed by a
digit.
• Names are case-sensitive.
• Some names are already used by the system. You can’t use the
followings as variable names
Eg: c, q, t, D, F, I, T, diff, df, pt
Assigning values to variables

• “<-” used to indicate equal sign


> weight<-50 Or > weight=50
• To display the value,
>weight
[1] 5

• You cannot do much statistics on single numbers.


If we want to work out with more than one
number, the solution is VECTORS.
Vectors
• A vector is a sequence of data elements of the
same basic type.
• One advantage of R is that it can handle entire
data vectors as single objects.
> weight<-c(60,45,76,31,53)
> weight
[1] 60 45 76 31 53
• Eg: Enter the following numbers into Y, and
perform the following operations.
2, 4, 3, 6, 5, 1, 7, 8, 9, 10

i. a = Y + Y
ii. b = Y *(1/2)* Y
iii. c = a + b
iv. d = 1/c
v. Print Y, a, b, c, d
• Suppose you want to handle the 2nd element of
the Y.

> Y [2]
[1] 4

> Y [1:3]
[1] 2 4 3

>Y[5:8]
[1] 5 1 7 8
Character Vectors
• A character vector is a vector of text strings, whose
elements are specified and printed in Quotes.
> x = c (“Wednesday”, “Tuesday”, “Monday”)
>x
[1] “Wednesday” “Tuesday” “Monday”

> color <- c (“Red”, “Blue”, “Green”)


> color
[1] “Red” “Blue” “Green”

> color[2]
[1] “Blue”
Logical Vectors

• Logical vectors can take the value TRUE or FALSE


(or Not available, missing values)
• In input, you may use the convenient abbreviation
T or F.
> c (T, F, T, T, F, T, F) but this way of defining is
not common
Eg:> Y < 5

[1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE


FALSE FALSE FALSE
Missing Values
• Missing values are represented by the symbol NA (Not
Available)
• Impossible values are represented by the symbol NaN
(Not A Number)
• Infinite values are represented by Inf
> x <- c (12, 54, NA)
>x+3
[1] 15 57 NA

> log (0)


[1] –Inf

> 0/0
[1] NaN
R as a Number Generator
• Generate a variable with numbers ranging from 1
to 12:
> x <- 1:12
>x
[1] 1 2 3 4 5 6 7 8 9 10 11 12

• Sequence – seq(from, to, by)


> seq(4, 6, 0.25)
[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75
6.00
> seq(from = 1, to = 30)
> seq(from = -5, to = 5, by = 0.2)
> seq(length = 51, by = 0.2, from = 5)
• Repetition – rep(x, times, …)
> rep(10, 3)
[1] 10 10 10

> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4

> rep(c(1.2, 2.7, 4.8), 5)


[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2
2.7 4.8 1.2 2.7 4.8
> s1=1:9
> s2 <- rep(s1, times = 3)
> s3 <- rep(s1, each= 2)
> s4 <- rep(s1, times = 3, each= 2)

• Generating levels – gl(n, k, length = n*k)

> gl(2, 4, 8)
[1] 1 1 1 1 2 2 2 2

> gl(2, 10, length =20)


[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
Accessing Data
• There are several ways to extract data from a
vector.
Suppose x is the data vector, for example x = 1:10
• To find how many elements
> length(x)
• To print ith element (i = 2)
> x[2]
• To print all but not ith element (i = 2)
> x[-2]
• To print all but not specific elements (not 1st
to 3rd elements)
> x[-c(1:3)]

• To print first k elements (k = 5)


> x[1:5]

• To print specific elements (1st, 3rd and 5th)


> x[c(1,3,5)]
• To print all greater than some value (the value
is 3)
> x[x>3]

• To print which indices are largest


> which(x == max(x))
Sorting & ordering
• Suppose y is a data vector which contains the
values
12, 34, 6, 48, -3, -28

• To sort the values in y


> sort(y)
[1] -28 -3 6 12 34 48

• To order the values in y


> order(y)
[1] 6 5 3 1 2 4
R as a sampler
• We have 60 people (1,2,3,…,60). If we
randomly select 5 people from the group, who
would be selected?

> sample(1:60, 5)
[1] 32 26 6 18 9
Data frames

• We have to handle not only the variables and


vectors but also the data sets. When we work
with a data set in R, it is stored as a data frame.
Basically it can be used two methods to create
data frames.

1. Store all variables as separate vectors, and then


combine.

2. Read from files.


1. Store all variables as separate vectors, and
then combine
Syntax : >data.frame(variable_1,variable_2,.........)
Eg: High Medium Low
28 24 20
26 22 21
25 24 NA
29 28 25
33 30 26
38 36 32
41 40 35
36 33 32
• Create three variables High, Close and Low as
three separate vectors.
>High=c(28, 26,........,36)
>Medium=c(24, 22,.........,33)
>Low=c(20,21,NA,.......,32)
• Combine all those vectors by using the
command
>data.frame(High,Close,Low)
• If you want to access the created data set
again, you have to assign a name to that.
>data1=data.frame(High,Close,Low)

2. Read from files


• If we have large data sets it is more preferable
to read data from an external file rather than
entering data during the R session at the
keyboard. First we have to modify the input
file according to the requirements of R.
• To read an entire data frame directly, the external
file will normally have a special form.
• The first line of the file should have a name for
each variable in the data frame.
• All the missing values should replace by “NA”.
• By default numeric items (except row labels) are
read as numeric variables and non-numeric
variables as factors.
• The most convenient way of reading data into R is
via the function called
>read.table()
• It requires the data created with Windows’
Notepad or any other plaintext editor.
Sub1 Sub2
91 78
84 70
80 85
75 88
93 69
How to create the text file?
• To enter the data into file, you could start up
Windows’ Notepad and simply type the data as
shown.
• Columns are separated by an arbitrary number of
blanks. (Eg: single blank or tab space).
• NA represents a missing value.
• Save as a text file.
How to import into R?

Syntax
>data2=read.table(“H/marks.txt”, header=T)
Or
>data2=read.table(file.choose(), header=T)
• header=T columns have headings.

• Note that you use forward slashes(/), not back


slashes (\), in the file name.

• The back slash is itself is written \\ so we


could also have used

>data.frame.name=read.table(“Drive\\Directory\\FileName.extension”,

header=T)
Variations of read.table
1. read.csv
fields are separated by commas
2. Using History Window
Naming Columns
• It can be named columns after import the
data set into R.
Syntax:
>names(dataset_name) = c(“var_name1”,
“var_name2”, ............)
Eg:
>names(data)=c("Index","Weight","Height","S
ex","Sub1","Sub2","Sub3","Class")
To separate the data items into separate vectors
• Syntax: >variable_name = data_frame_name[column_no]
Eg: >Sub1=data2[1]

OR

• Syntax:
>variable_name = data_frame_name$ variable_name_in_text_file
Eg: >Sub1=data2$Sub1
Descriptive Statistics
• It can be used some predefined functions to
perform some necessary statistics one by one.
• Syntax:
> mean(variable_name)
> sd(variable_name)
> var(variable_name)
> min(variable_name)
> max(variable_name)
> median(variable_name)
• Eg:
>mean(Height)
>var(Height)
• All these statistics can be performed at once
by using the function 'summary'.
• Syntax: > summary(variable_name)
• Eg: >summary(Height)
• If there is any missing value in the variable, R produce
the result as a missing value (NA).
• To avoid that problem, you can give the argument
'na.rm‘ (not available, remove) to request that missing
values to be removed.
• Syntax: > mean(variable_name, na.rm=T)
• Eg:
>mean(Sub1)
>mean(Sub1,na.rm=T)
How to use a by variable?
1st method
• Consider about each levels of given category
• Syntax: > tapply(association_var, classification_var,statistic)
association_variable - any continuous variable
classification_variable - any categorical
variable (by variable)
statistic - any statistic that you want to perform
• Eg:
> tapply(Height,Sex,mean)
2nd method
Consider only the given level of given category
• Syntax:
> summary(association_var [classification_var = =level])
• Instead of 'summary', any predefined function for
descriptive statistics can also be used here.
Eg:
> summary(Height [Sex = ='M'])
> mean(Height [Sex = ='M'])
> var(Height [Sex = ='M'])
Tally and Contingency Tables
• Table uses the cross-classifying factors to build a
contingency table of the counts at each combination of
factor levels.

Tally table

• Syntax: > table(var_name)

• Var_name should be a Categorical Variable.

Eg: >table(Sex)
Contingency Tables
• Syntax:
> table(var_name1, var_name2)

Eg:
>table(Sex,Class)

You might also like