0% found this document useful (0 votes)
4 views61 pages

DS 3

This document provides an overview of fundamental concepts in data science, focusing on vectors, matrices, arrays, and factors in R programming. It explains how to create and manipulate these data structures, including functions for accessing, sorting, and subsetting data. Additionally, it covers the importance of categorical variables and how to create and summarize factors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views61 pages

DS 3

This document provides an overview of fundamental concepts in data science, focusing on vectors, matrices, arrays, and factors in R programming. It explains how to create and manipulate these data structures, including functions for accessing, sorting, and subsetting data. Additionally, it covers the importance of categorical variables and how to create and summarize factors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Fundamentals of Data Science

Unit 3

Prepared By
Dr.P.Sasikumar
Associate Professor, AIML Dept.
1
Vectors
A vector is simply a list of items that are of the same type.

• To combine the list of items to a vector, use the c() function and separate the items by a comma.

• Vectors are the same as the arrays in R language which are used to hold multiple data values
of the same type.

• One major key point is that in R Programming Language the indexing of the vector will start
from ‘1’ and not from ‘0’.

2
Six types of atomic vectors in R
Only the first four (double, integer, logical, and character) are discussed and used in this book.

Type Example Comment

Floating point numbers with double


1 double (or numeric) -0.5, 120.9, 5.0
precision

2 integer -1L, 121L, 5L “Long” integers

3 logical TRUE, FALSE Boolean

4 character "R", "5" or 'R', '5' Text

5 complex -5+11i, 3+2i, 0+4i Real+imaginary numbers

6 raw 01, ff Raw bytes (as hexadecimal)


3
Important vector functions
• In programming, functions are used to perform a specific task,

• e.g., manipulate an object, calculate a derived quantity, or investigate existing objects.

A few of the most important ones for creating and investigating simple vectors are:

• c(): Combines multiple elements into one atomic vector.

• length(): Returns the length (number of elements) of an object.

• class(): Returns the class of an object.

• typeof(): Returns the type of an object.

• attributes(): Returns further metadata of arbitrary type.

4
Create a vector variable called fruits,
that combine strings:

• # Vector of characters/strings

• fruits <- c("banana", "apple", "orange")

• # Print fruits

• fruits

• OPUTPUT:

• [1] "banana" "apple" "orange"

5
create a vector that combines numerical
values:

• # Vector of numerical values


numbers <- c(1, 2, 3)

# Print numbers
numbers

• OUTPUT

• [1] 1 2 3

6
Numerical values
• To create a vector with numerical values in a sequence, use the : operator

# Vector with numerical values in a sequence

numbers <- 1:10

# Print numbers

numbers

Output:
[1] 1 2 3 4 5 6 7 8 9 10
7
Values in a sequence
• create numerical values with decimals in a sequence, but note that if the last element does not belong
to the sequence, it is not used:

8
create a vector of logical values

9
Vector Length
To find out how many items a vector has, use the length() function

10
Sort a Vector
• To sort items in a vector alphabetically or numerically, use the sort() function:

11
Access Vectors
• You can access the vector items by referring to its index number inside brackets [].

• The first item has index 1, the second item has index 2, and so on:

• can also access multiple elements by referring to different index positions with the c() function:

12
Change an Item

• To change the value of a specific item, refer to the index number:

13
Repeat Vectors
• To repeat vectors, use the rep() function:

• Repeat the sequence of the vector:

• Repeat each value independently:

14
Matrices
• Matrix is a rectangular arrangement of numbers in rows and columns.

• In a matrix, as we know rows are the ones that run horizontally and columns are the ones that run
vertically.

• R-matrix is a two-dimensional arrangement of data in rows and columns.

• In R programming, matrices are two-dimensional, homogeneous data structures.

• These are some examples of matrices:

15
Creating a Matrix in R
• To create a matrix in R you need to use the function called matrix().

• The arguments to this matrix() are the set of elements in the vector.

• You have to pass how many numbers of rows and how many numbers of columns you want to have
in your matrix.

Syntax to Create R-Matrix

• matrix(data, nrow, ncol, byrow, dimnames)

Parameters:

• data – values you want to enter

• nrow – no. of rows

• ncol – no. of columns

• byrow – logical clue, if ‘true’ value will be assigned by rows

• dimnames – names of rows and columns

16
17
Creating Special Matrices in R
• R allows the creation of various different types of matrices with the use of arguments passed to the
matrix() function.

1. Matrix where all rows and columns are filled by a single constant ‘k’:

• To create such a R matrix the syntax is given below:

• Syntax: matrix(k, m, n)

Parameters:

• k: the constant

• m: no of rows

• n: no of columns

18
2. Diagonal matrix:
• A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero.

To create such a R matrix the syntax is given below:

• Syntax: diag(k, m, n)

• Parameters:

• k: the constants/array

• m: no of rows

• n: no of columns

19
3. Identity matrix:
• An identity matrix in which all the elements of the principal diagonal are ones and all other elements
are zeros.

• To create such a R matrix the syntax is given below:

• Syntax: diag(k, m, n)

• Parameters:

• k: 1

• m: no of rows

• n: no of columns

20
R-Matrix
• We can access elements in the R matrices using the same convention that is followed in data frames.
So, you will have a matrix and followed by a square bracket with a comma in between array.

• Value before the comma is used to access rows and value that is after the comma is used to access
columns.

• Let’s illustrate this by taking a simple R code.

21
22
Subsetting in R Programming
• To subset specific elements from the R matrix you can use bracket notation [], by using this
notation we can subset a single element from a matrix, multiple elements, subset by range, select
elements from a list etc.

• Let’s create a matrix from vectors by specifying the number of columns and number of rows.

23
Subset a specific element
• Given matrix you can pass the specified row index and column index(which is the location of a
specific element you want) into bracket notation [ ].

• It will return the subset of the given Matrix by a specific element.

24
Subset a R Matrix by a Specific Row
• To subset a matrix by a specific row, you can use bracket notation([]).

• To do this, simply specify the row index on the left side of the notation(before a comma), and the
matrix will be subsetted by the corresponding row of the specified index.

25
Subset a Matrix by a Specific
Column
• Alternatively, you can subset a matrix by a specific column using bracket notation([]).

• This time you can simply specify the column index on the right side of the notation(after a comma),

• and the matrix will be subsetted by the corresponding column of the specified index.

26
Subset a R Matrix by Logical Condition
• Subsetting a matrix by rows based on logical conditions is possible.

• For instance, you can select rows that meet a specific condition and those rows will be included in
the result.

27
Using subset() function Subset the R Matrix
• So far, we have implemented subsetting of matrices using bracket notation([]).

• Now we will implement subsetting the matrix by using the subset() function that provides a
concise way to filter data frames or matrices based on conditions.

• In this example, I will subset the matrix by rows based on the condition using this function.

28
Subset a Matrix by Name
• To subset a matrix by name in R, you can use the row and column names along with square
brackets.

• Let’s create a matrix with customized column names and row names and use these names to subset
a matrix by names.

29

R – Array
Arrays are essential data storage structures defined by a fixed number of dimensions. Arrays are used for the
allocation of space at contiguous memory locations.
• In R Programming Language Uni-dimensional arrays are called vectors with the length being their only
dimension. Two-dimensional arrays are called matrices, consisting of fixed numbers of rows and columns. R
Arrays consist of all elements of the same data type. Vectors are supplied as input to the function and then
create an array based on the number of dimensions.
• Creating an Array
• An R array can be created with the use of array() the function. A list of elements is passed to the array()
functions along with the dimensions as required.
• array(data, dim = (nrow, ncol, nmat), dimnames=names)
• where
• nrow: Number of rows
• ncol : Number of columns
• nmat: Number of matrices of dimensions nrow * ncol
• dimnames : Default value = NULL.
• Otherwise, a list has to be specified which has a name for each component of the dimension. Each component
is either a null or a vector of length equal to the dim value of that corresponding dimension.
30
Uni-Dimensional Array
• A vector is a uni-dimensional array, which is specified by a single dimension, length.

• A Vector can be created using ‘c()‘ function. A list of values is passed to the c() function to create a
vector.

31
Multi-Dimensional Array
• A two-dimensional matrix is an array specified by a fixed number of rows and columns, each
containing the same data type.

• A matrix is created by using array() function to which the values and the dimensions are passed.

32
Class in R
• Class is the blueprint that helps to create an object and contains its member variable along with

• the attributes. As discussed earlier in the previous section, there are two classes of R, S3, and S4.

• S3 Class

•  S3 class is somewhat primitive in nature. It lacks a formal definition and object of this class can
be created simply by adding a class attribute to it.

•  This simplicity accounts for the fact that it is widely used in R programming language. In fact
most of the R built-in classes are of this type

• S3 is the simplest yet the most popular OOP system and it lacks formal definition and structure.
An object of this type can be created by just adding an attribute to it. Following is an example to
make things more clear:.

33
Introduction to Factors
• Factor refers to a statistical data type used to store Categorical variables.

• The difference between a categorical variable and a continuous variable is that a categorical
variable can belong to a limited number of categories.

• A continuous variable, on the other hand, can correspond to an infinite number of values.

• It is important that R knows whether it is dealing with a continuous or a categorical variable, as


the statistical models you will develop in the future treat both types differently.

• A good example of a categorical variable is Gender.

• In many circumstances you can limit the Gender categories to “Male” or “Female”.

• In the above example, all the possible cases are known beforehand and are predefined.

• These distinct values are known as levels.

• After a factor is created it only consists of levels that are by default sorted alphabetically.

34
Creating a Factor in R
• The command used to create or modify a factor in R
language is – factor() with a vector as input.

The two steps to creating an R factor :

1. Creating a vector

2. Converting the vector created into a factor using


function factor()

35

Factor levels
When you first get a dataset, you will often notice that it contains factors with specific factor levels.
However, sometimes you will want to change the names of these levels for clarity or other reasons. R
allows you to do this with the function levels():

levels(factor_vector) <- c("name1", "name2",...)

• A good illustration is the raw data that is provided to you by a survey. A common question for every
questionnaire is the Gender of the respondent. Here, for simplicity, just two categories were
recorded, “M” and “F”.

survey_vector <- c("M", "F", "F", "M", "M")

• Recording the Gender with the abbreviations “M” and “F” can be convenient if you are collecting
data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you
will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” for
clarity.

• Watch out: the order with which you assign the levels is important. If you
type levels(factor_survey_vector), you’ll see that it outputs [1] “F” “M”. If you don’t specify the
levels of the factor when creating the vector, R will automatically assign them alphabetically.

• To correctly map “F” to “Female” and “M” to “Male”, the levels should be set to c(“Female”,
“Male”), in this order.
36
37
Summarizing a factor
• One of your favorite functions in R will be
summary(). This will give you a quick overview
of the contents of a variable:

summary(my_var)

• Going back to our survey, you would like to


know how many “Male” responses you have in
your study, and how many “Female” responses.
The summary() function gives you the answer
to this question.

• look at the output. The fact that you


identified “Male” and “Female” as factor
levels in factor_survey_vector enables R to
show the number of elements for each category.

38
Ordered factors
• Since “Male” and “Female” are unordered (or nominal) factor levels, R returns a warning message,
telling you that the greater than operator is not meaningful.

• As seen before, R attaches an equal value to the levels for such factors.

• But this is not always the case! Sometimes you will also deal with factors that do have a natural
ordering between its categories.

• If this is the case, we have to make sure that we pass this information to R…

• Let us say that you are leading a research team of five data analysts and that you want to evaluate
their performance.

• To do this, you track their speed, evaluate each analyst as “slow”, “medium” or “fast”, and save the
results in speed_vector.

39
Instructions
• As a first step, assign speed_vector a vector with 5 entries, one for each analyst.

• Each entry should be either “slow”, “medium”, or “fast”. Use the list below:

• Analyst 1 is medium,

• Analyst 2 is slow,

• Analyst 3 is slow,

• Analyst 4 is medium and

• Analyst 5 is fast.

• No need to specify these are factors yet.

• # Create speed_vector

speed_vector <- c("medium","slow","slow", "medium", "fast")

40
Comparing ordered factors
• Having a bad day at work, ‘data analyst number two’ enters your office and starts complaining that
‘data analyst number five’ is slowing down the entire project.

• Since you know that ‘data analyst number two’ has the reputation of being a smarty-pants, you first
decide to check if his statement is true.

• The fact that factor_speed_vector is now ordered enables us to compare different elements (the data
analysts in this case).

• You can simply do this by using the well-known operators.

Instructions

• Use [2] to select from factor_speed_vector the factor value for the second data analyst. Store it as da2.

• Use [5] to select the factor_speed_vector factor value for the fifth data analyst. Store it as da5.

• Check if da2 is greater than da5; simply print out the result. Remember that you can use the >
operator to check whether one element is larger than the other.

41
42
Data Frames

• Data Frames are data displayed in a format as a table.

• Data Frames can have different types of data inside it.

• While the first column can be character, the second and third can be numeric or logical.

• However, each column should have the same type of data.

• Data Frames in R Language are generic data objects of R that are used to store tabular data.

• Data frames can also be interpreted as matrices where each column of a matrix can be of different
data types.

• R DataFrame is made up of three principal components, the data, rows, and columns.

• Use the data.frame() function to create a data frame:

43
R Data Frames Structure
• As you can see in the image below, this is how a data frame is structured.

• The data is presented in tabular form, which makes it easier to operate and understand.

44
Create Data frame in R
• To create an R data frame use data.frame() function and
then pass each of the vectors you have created as
arguments to the function.

45
Create Subsets of a Data frame
• subset() function in R Programming Language is used to create subsets of a Data frame. This can
also be used to drop columns from a data frame.

Syntax: subset(df, expr)

• Parameters:

• df: Data frame used

• expr: Condition for subset

46
Example 1: Basic example of subset()
Function

Here, in the above code, the original data frame remains intact while another subset of
data frame is created which holds a selected row from the original data frame

47
Extending Data Frames in R
• expand.grid() function in R Programming Language is used to create a data frame with all the
values that can be formed with the combinations of all the vectors or factors passed to the function
as an argument.

• expand.grid() Function

• Syntax:

• expand.grid(…)

• Parameters:…: Vector1, Vector2, Vector3, …

48
R Programme to expand dataframe in R

49
R Programme to expand dataframe in R

50
How to Sort a DataFrame in R ?
• In R DataFrame is a two-dimensional tabular data structure that consists of rows and columns.

• Sorting a DataFrame allows us to reorder the rows based on the values in one or more columns.
This can be useful for various purposes, such as organizing data for analysis or presentation.

• Methods to sort a dataframe:

• order() function (increasing and decreasing order)

51
Method 1: Using order() function
• This function is used to sort the dataframe based on the particular column in the dataframe

• Syntax: order(dataframe$column_name,decreasing = TRUE))

where

• dataframe is the input dataframe

• Column name is the column in the dataframe such that dataframe is sorted based on this column

• Decreasing parameter specifies the type of sorting order

• If it is TRUE dataframe is sorted in descending order. Otherwise, in increasing order

• return type: Index positions of the elements

52
Example
• R program to create a dataframe with 2 columns and order based on particular columns in decreasing
order. Displayed the Sorted dataframe based on subjects in decreasing order, displayed the Sorted
dataframe based on roll no in decreasing order

53
A List in R programming

• A list in R programming is a generic object consisting of an ordered collection of objects.

• A list is with one-dimensional, heterogeneous data structures.

• The list can be a list of vectors, a list of matrices, a list of characters, a list of functions, and so on.

• Lists are the R objects which contain elements of different types like − numbers, strings, vectors and
another list inside it.

• A list can also contain a matrix or a function as its elements.

• List is created using list() function.

• In R, the indexing of a list starts with 1 instead of 0.

54
Creating a List
• Following is an example to create a list containing strings, numbers, vectors and a logical values.

55
Example to create a list containing strings,
numbers, vectors and a logical values.

56
Naming List Components
• Naming list components make it easier to access them.

• Example:

57
Accessing List Elements
• Elements of the list can be accessed by the index of the element in the list.

• In case of named lists it can also be accessed using the names.

• All the components of a list can be named and we can use those names to access the components of the
R list using the dollar command.

58
Manipulating List Elements
• We can add, delete and update list elements as shown below.

• We can add and delete elements only at the end of a list.

• But we can update any element.

59
Merging Lists
• You can merge many lists into one list by placing all the lists inside one list() function.

60
Converting List to Vector
• A list can be converted to a vector so that the elements of the vector can be used for further
manipulation.

• All the arithmetic operations on vectors can be applied after the list is converted into vectors.

• To do this conversion, we use the unlist() function. It takes the list as input and produces a vector.

61

You might also like