0% found this document useful (0 votes)
12 views

Lab4-Factors & DataFrames

Factors represent categorical data and can be ordered or unordered. They are created using the factor() function, which stores categorical values as integers mapped to character strings representing categories or levels. A data frame is a list of equal-length vectors that can be thought of as a rectangular structure with columns as variables and rows as observations. It allows storing multiple types of variables together. New rows and columns can be added to an existing data frame using functions like rbind() and $ operator respectively.

Uploaded by

roliho3769
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lab4-Factors & DataFrames

Factors represent categorical data and can be ordered or unordered. They are created using the factor() function, which stores categorical values as integers mapped to character strings representing categories or levels. A data frame is a list of equal-length vectors that can be thought of as a rectangular structure with columns as variables and rows as observations. It allows storing multiple types of variables together. New rows and columns can be added to an existing data frame using functions like rbind() and $ operator respectively.

Uploaded by

roliho3769
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Factors

• Factors are used to represent categorical data and can be unordered or


ordered.
• Factors can be considered as an integer vector where each integer has a label.
Factor objects can be created with the factor() function. The function factor()
stores the categorical values as a vector of integers in the range [1... k] (where
k is the number of unique values in the categorical variable), and an internal
vector of character strings (the original values) mapped to these integers.
• A factor is usually constructed by giving it a vector of strings. These are
translated into the different categories, and the factor becomes a vector of
these categories.These categories are called “levels”.
Example:
>f <- factor(c("small", "small", "medium", "large", "small","large"))

>f
## [1] small small medium large small large
## Levels: large medium small

>levels(f)
## [1] "large" "medium" "small"

By default, factor levels for character vectors are created in alphabetical order.
To see the underlying representation of a factor use the command
> unclass(x)

For vectors representing ordinal variables, you add the argument ordered=TRUE
to the factor() function . Given the vector
status <- c("Poor", "Improved", "Excellent", "Poor")
the statement
status <- factor(status, ordered=TRUE)
will encode the vector as (3, 2, 1, 3) and associate these values internally
as 1=Excellent, 2=Improved, and 3=Poor.
You can override the default by specifying a levels option. For
example,
status <- factor(status, order=TRUE,
levels=c("Poor", "Improved", "Excellent"))
would assign the levels as 1=Poor, 2=Improved, 3=Excellent.
Changing the order of the levels like this changes how many functions handle
the factor. The order of factor levels mostly affects how summary
information is printed and how factors are plotted.
f <- factor(c("small", "small", "medium",
"large", "small",
"large"))
f
## [1] small small medium large small large
## Levels: large medium small
summary(f)
## large medium small
## 2 1 3

ff <- factor(c("small", "small", "medium",


"large", "small",
"large"), levels = c("small", "medium",
"large"))
ff
## [1] small small medium large small large
## Levels: small medium large
summary(ff)
## small medium large
## 3 1 2

Data Frame
A data frame is a list of equal-length vectors. A data frame can be thought of as
a rectangular structure where each column is a variate and each row is an
observation.

Example 1:
data <- data.frame(x = 1:3, y = 4:6, z=c("one", "two", "three"))
str(data)
## 'data.frame': 3 obs. of 3 variables:
## $ x: int 1 2 3
## $ y: int 4 5 6
## $ z: Factor w/ 3 levels "one","three",..: 1 3 2

Example 2:

Consider the following data from website ESPN cricinfo live


Name Matches Innings Highestscore Average
Tendulkar 200 329 248 53.78
Ponting 168 287 257 51.85
Kallis 166 280 224 55.37
Dravid 164 286 270 52.31
Cook 161 291 294 45.35
The above data frame for batsmen with most runs can be created as follows:
> match_stat<-data.frame(name=c("Tendulkar","Ponting","kallis","Dravid",
"cook"),matches=c(200,168,166,164,161),innings=c(329,287,280,286,291),high
estscore=c(248,257,224,270,294),avg=c(53.78,51.85,55.37,52.31,45.35))
> match_stat
name matches innings highestscore avg
1 Tendulkar 200 329 248 53.78
2 Ponting 168 287 257 51.85
3 kallis 166 280 224 55.37
4 Dravid 164 286 270 52.31
5 cook 161 291 294 45.35

OR

>match_stat<-data.frame(name=character(0),matches=numeric(0)
,innings=numeric(0),highestscore=numeric(0),avg=numeric(0))
Creates an empty data frame with given variable names and mo
des and then the command
> match_stat <- edit(match_stat)
Invokes a text editor that allows you to enter your data manually.
Invoking mydata <- edit(mydata) again allows you to edit the data you’ve entered and
to add new data.
Note :
The result of the editing is assigned back to the object(match_stat) itself. The edit()
function operates on a copy of the object. If you don’t assign it a destination, all the edits
will be lost.

Getting structure of data frame:


The structure of data frame created can be obtained by function str() as follows:
>str(match_stat)
'data.frame': 5 obs. of 5 variables:
$ name : Factor w/ 5 levels "cook","Dravid",..: 5 4 3 2
1
$ matches : num 200 168 166 164 161
$ innings : num 329 287 280 286 291
$ highestscore: num 248 257 224 270 294
$ avg : num 53.8 51.9 55.4 52.3 45.4

Getting summary of data in data frame:


The summary of data in data frame cab be obtained by function summary().
>summary(match_stat)
name matches innings highestscore avg
cook :1 Min. :161.0 Min. :280.0 Min. :224.0 Min. :45.35
Dravid :1 1st Qu.:164.0 1st Qu.:286.0 1st Qu.:248.0 1st Qu.:51.85
kallis :1 Median :166.0 Median :287.0 Median :257.0 Median :52.31
Ponting :1 Mean :171.8 Mean :294.6 Mean :258.6 Mean :51.73
Tendulkar:1 3rd Qu.:168.0 3rd Qu.:291.0 3rd Qu.:270.0 3rd Qu.:53.78
Max. :200.0 Max. :329.0 Max. :294.0 Max. :55.37
If a data frame has too many rows and columns ,we can display few starting or ending entries b
y function head and tail as follows:
>head(match_stat,n=2)

name matches innings highestscore avg


1 Tendulakar 200 329 248 53.78
2 Ponting 168 287 257 51.85

>tail(match_stat,n=3)

name matches innings highestscore avg


3 kallis 166 280 224 55.37
4 Dravid 164 286 270 52.31
5 cook 161 291 294 45.35

Adding New columns:


Any new column can be added in the data frame given as below. Let we want to add number of
0s and 100s for every player of the data frame. We use ‘$’ operator to introduce a new column

>match_stat$half_cent<-c(68,62,58,63,57)
> match_stat$cent<-c(51,41,45,36,33)
> match_stat
name matches innings highestscore avg half_cent cent
1 Tendulkar 200 329 248 53.78 68 51
2 Ponting 168 287 257 51.85 62 41
3 kallis 166 280 224 55.37 58 45
4 Dravid 164 286 270 52.31 63 36
5 cook 161 291 294 45.35 57 33

Adding New rows:


New rows can be added in the existing data frame. Let we want to add two more players Sangk
ara and Lara. This can be done by rbind function.
>new_match_stat<-data.frame(name=c("sangakkara","lara"),matches=c(134,131),in
nings=c(233,232),highestscore=c(319,400),avg=c(57.4,52.8),half_cent=c(52,4
8),cent=c(38,34))
> match_stat<-rbind(match_stat,new_match_stat)

> match_stat
name matches innings highestscore avg half_cent cent
1 Tendulkar 200 329 248 53.78 68 51
2 Ponting 168 287 257 51.85 62 41
3 kallis 166 280 224 55.37 58 45
4 Dravid 164 286 270 52.31 63 36
5 cook 161 291 294 45.35 57 33
6 sangakkara 134 233 319 57.40 52 38
7 lara 131 232 400 52.80 48 34

Accessing data from data frame:


The data from the data frame can be accessed as follows:
1. Accessing by position(indices)
>match_stat[4,] #accessing 4th row
>match_stat[,4] #accessing 4th column
>match_stat[c(4,5),c(1,2)] #accessing 4th , 5th row and 1st ,2nd column
2. Accessing by column name:
>match_stat$name #access the “name” column

3. Accessing by condition
>match_stat[match_stat$name=="Tendulkar",] #access the row
corresponding to player Tendulkar

>match_stat[match_stat$name=="Tendulakar",4] # to find highest


score of Tendulkar
>which(match_stat$highestscore>=270,) # Find the row number of th
e data for which the highestscore is equal or greater than 270

>match_stat[which(match_stat$highestscore==max(match_stat
$highestscore)),c(1,5)] #Display the name and the average of the player
who is having maximum highestscore.

>match_stat[match_stat$name=="Tendulakar",2]<-201 # Modify
Tendulkar’s number of matches as 201.

4. Subset command:
>subset(match_stat,matches>165,select=c(name,matches)) #To
select names & matches of players with matches>165

>subset(match_stat,highestscore>250,select=name) # select names


of
players with highest score >250

You might also like