Data Analytics Unit-1 Notes
Data Analytics Unit-1 Notes
Unit-I
Introduction to R: Handling Packages in R, Getting Started with R, Working with Directory, Data
Types in R, Commands for Data Exploration.
Introduction to R:
Statistical computing and high-scale data analysis tasks needed a new category of computer
language besides the existing procedural and object-oriented programming languages, which would
support these tasks instead of developing new software. There is plenty of data available today which can
be analysed in different ways to provide a wide range of useful insights for multiple operations in various
industries. Problems such as the lack of support, tools and techniques for varied data analysis have been
solved with the introduction of one such language called R.
What is R?
Why R?
R has opened tremendous scope for statistical computing and data analysis. It provides
techniques for various statistical analyses like classical tests and classification, time- series analysis,
clustering, linear and non-linear modelling and graphical operations. The techniques supported by R are
highly extensible.
Another reason behind the popularity and widespread use of R is its superior support for
graphics. It can provide well-developed and high-quality plots from data analysis. The plots can contain
mathematical formulae and symbols, if necessary, and users have full control over the selection and use
of symbols in the graphics. Hence, other than robustness, user-experience and user-friendliness are two
key aspects of R.
Handling Packages In R
A package in R is the fundamental unit of shareable code. It is a collection of the following elements:
Functions
Data sets
Compiled code
Documentation for the package and for the functions inside
Tests – few tests to check if everything works as it should.
The directory where packages are stored is called a library. R comes with a standard set of packages.
Others are available for download and installation as per requirement. As on date, there are over 10,000
plus packages available in CRAN. This is also one of the reasons behind the huge popularity and success
of R.
Packages are used to share codes with others. One can develop their own R package. Any R user can
then download, install and learn to use the package. Packages, therefore allow for an easy, transparent and
cross-platform extension of the R base system.
In general, there is a single package library with each installation of R on a computer. Users can
change the path to that library to install a package on a different location other than the default package
library. The command .libPaths() can be used to get or set the path of the package library.
Example
> .libPaths()
Output
C:/R/R-3.1.3/library
This is the default package library location. The following command will change it into another path:
Example
> .libPaths(“~/R/win-library/3.1-mran-2016-07-02”)
Output
C:/Users/User1/Documents/R/win-library/3.1-mran-2016-07-02
R can be extended easily with the help of a rich set of packages. There are more than 10,000 packages
available for R. These packages are used for different purposes. Tables
1.2 and 1.3 list some commonly used R packages for different purposes.
Table 1.2 Commonly used R packages for different purposes Data Management Data Visualisation
Installing an R Package
R comes with some standard packages that are installed when a user first installs R and additional
packages can be installed separately. Users need to navigate through the package library and install a
package in the desired location. Following commands are used for navigating through R package
library and installing R package.
1. To start R, follow either Step 2 or 3. The assumption is that R is already installed on your machine.
2. If there is an ―R‖ icon on the desktop of the computer that you are using, double click on the ―R‖
icon to start R. If there is no ―R‖ icon on the desktop then click on the ―Start‖ button at the bottom
left of your computer screen, and then choose ―All programs‖, and start R by selecting ―R‖ (or R
X.X.X, where X.X.X gives the version of R, e.g. R 2.10.0) from the menu of programs.
4. Once you have started R, you can install an R package (e.g. the ―ggplot2‖ package) by choosing
―Install package(s)‖ from the ―Packages‖ menu at the top of the R console. This will ask you for the
website that you wish to download the package from. You can choose ―Iceland‖ (or another
country, if you prefer). It will also bring up a list of available packages that you can install, and you
can choose the package that you want to install from that list (e.g. ―ggplot2‖).
6. The ―ggplot2‖ package is now installed. Whenever you want to use the ―ggplot2‖ package after
this, after having successfully started R, you first have to load the package by typing into the R
console: library(―ggplot2‖).
7. You can get help on a package by typing the following at the R prompt: help(package= ―ggplot2‖)
installed.packages()
A user can check for all installed packages on the machine by using the installed. packages()
function.
―DESCRIPTION‖ file has the basic information about a package. It has details such as what the package
does, who is the author, what is the version for the documentation, the date, the type of license its use,
and the package dependencies, etc. To access the description file inside R, use the function,
packageDescription(―package‖). The same can also be accessed via the documentation of the package
by using help(package = ―package‖).
> packageDescription(“stats”)
Package: stats
Author: R Core Team and contributors worldwide Maintainer: R Core Team <[email protected]>
Description: R statistical functions.
find.package() and install.packages() commands will find and install specific R package(s). There are
two versions of this command. The first helps in installing one package at a time and the other is used to
install multiple packages at once using a single command—install.packages(). More details on
commands like find.package() and install.packages() can be retrieved using the help() command. For
example, help (installed.packages) can show details like the version number of a function.
Before writing a program or code using R, it is important to find out the directory being used. This can
be done using the getwd() function. If the current working directory is not as per preference, it can be
changed using the setwd() function. The dir() or the list.files() functions give information about the files
and directories in the current working directory or any other directory.
getwd() Command
getwd() command returns the absolute file path of the current working directory. This function has no
arguments.
Example
>getwd()
Output
[1] C:/Users/User1/Documents/R
Note the use of ‗/‘ as the file separator on Windows. The file path does not have a trailing ‗/‘ unless it is
the root directory. The getwd() function can return NULL if the working directory is not available.
setwd() Command
setwd() command resets the current working directory to another location as per the user‘s preference.
Example
>setwd(“C:/path/to/my_directory”)
Output
dir() Function:
This function returns a character vector of the names of files or directories in the named directory.
>dir() character(0)
>list.files() character(0)
The above command implies that there are no files or directories in the current directory.
Data Types in R
Everything in R is an object. R has 6 atomic vector types.
• character
• integer
• logical
• complex
✓ The value inside " " or ' ' are text (string). They are called characters.
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Output:
[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"
Operators in R
Arithmetic Operators
Assignment Operators
Logical Operators
Relational Operators
1) Arithmetic Operators:
These operators perform basic arithmetic operations like addition, subtraction, multiplication,
division, exponent, modulus, etc.
Let us see what these operators do.
For example, take two variables:
For example:-
x <- 10
y <- 5
Operator Operation Output
x+y Addition 15
x–y Subtraction 5
x*y Multiplication 50
x/y Division 2
x^y Exponent 10^5
x %% y Modulus 0
These operators can also be used to carry out mathematical operations on vectors.
In the case of vectors, all these operations are done in an element-by-element fashion.
For example:
x <- c(9,9,9)
y <- c(1,1,1)
print(x+y)
Output:
[1] 10 10 10
2) Assignment Operators
Operator Description
The operators <- and = can be used, almost interchangeably, to assign to variables in the
same environment.
The <<- operator is used for assigning to variables in the parent environments (more like
global assignments). The rightward assignments, although available, are rarely used.
x <- 5
x
x <- 9
x
10 -> x
x
Output
[1] 5
[1] 9
[1] 10
3) Logical Operators
Logical operators are used to carry out Boolean operations like AND, OR etc.
Operator Description
! Logical NOT
| Element-wise logical OR
|| Logical OR
Operators & and | perform element-wise operation producing result having length of the longer
operand.
But && and || examines only the first element of the operands resulting in a single length
logical vector.
Zero is considered FALSE and non-zero numbers are taken as TRUE. Let's see an example for
this:
Output
4) Relational Operators
Relational operators are used to compare between values. Here is a list of relational operators
available in R.
Operator Description
== Equal to
!= Not equal to
Let's see an example for this:
x <- 5
y <- 16
x<y
x>y
x <= 5
y >= 20
y == 16
x != 5
Output
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
Data Objects in R:
Data types are used to store information. In R, we do not need to declare a variable as some
data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data
type of the variable. There are mainly six data types present in R:
1. Vectors
2. Lists
3. Matrices
4. Arrays
5. Factors
6. Data Frames
1. Vector:
2. Lists:
A list in R can contain many different data types inside it. A list is a collection of data which
is ordered and changeable.
To create a list, use the list() function:
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
# Print the list
thislist
Output:
[[1]]
[1] "apple"
[[2]]
[1] "banana"
[[3]]
[1] "cherry"
Access Lists
You can access the list items by referring to its index number, inside brackets. The first item has index
1, the second item has index 2, and so on:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1]
Output:
[[1]]
[1] "apple"
Change Item Value
To change the value of a specific item, refer to the index number:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1] <- "blackcurrant"
# Print the updated list
thislist
Output:
[[1]]
[1] "blackcurrant"
[[2]]
[1] "banana"
[[3]]
[1] "cherry"
List Length
To find out how many items a list has, use the length() function:
Example
thislist <- list("apple", "banana", "cherry")
length(thislist)
Output:
[1] 3
Add List Items
To add an item to the end of the list, use the append() function:
Example
Add "orange" to the list:
thislist <- list("apple", "banana", "cherry")
append(thislist, "orange")
Output:
[[1]]
[1] "apple"
[[2]]
[1] "banana"
[[3]]
[1] "cherry"
[[4]]
[1] "orange"
Remove List Items
You can also remove list items. The following example creates a new, updated list without an "apple"
item:
Example
Remove "apple" from the list:
thislist <- list("apple", "banana", "cherry")
newlist <- thislist[-1]
# Print the new list
newlist
Output:
[[1]]
[1] "banana"
[[2]]
[1] "cherry"
Join Two Lists
There are several ways to join, or concatenate, two or more lists in R.
The most common way is to use the c() function, which combines two elements together:
Example
list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
list3
Output:
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
[[4]]
[1] 1
[[5]]
[1] 2
[[6]]
[1] 3
3. Matrices:
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular
layout.
A Matrix is created using the matrix() function.
Example:
matrix (data, nrow, ncol, byrow, dimnames)
where,
✓ data is the input vector which becomes the data elements of the matrix.
✓ nrow is the number of rows to be created.
✓ ncol is the number of columns to be created.
✓ byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
✓ dimname is the names assigned to the rows and columns.
4. Arrays:
Arrays are the R data objects which can store data in more than two dimensions.
For example –
If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2
rows and 3 columns.
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
An array is created using the array() function. It takes vectors as input and uses the values in the
dim parameter to create an array.
In the below example we create 2 arrays of which are 3x3 matrices each.
5. Factors:
Factors are the data objects which are used to categorize the data and store it as levels.
They can store both strings and integers.
They are useful in data analysis for statistical modeling.
Factors are created using the factor() function.
The nlevels functions gives the count of levels.
# Create a vector.
apple_colors<- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple<- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result –
Output:
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Ex2:
>data <- c("East","West","East","North","North","East","West","West―,"East―)
>factor_data<- factor(data)
>factor_data
Output :
[1] East West East North North East West West East
Levels: East North West
6. Data Frames:
A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
Data frames are tabular data objects.
Unlike a matrix in data frame each column can contain different modes of data.
The first column can be numeric while the second column can be character and third column can
be logical.
It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26) )
print(BMI)
When we execute the above code, it produces the following result –
Output:
Gender height weight Age
Male 152.0 81 42
Male 171.5 93 38
Female 165.0 78 26
Ex2:
>std_id = c (1:5)
>std_name = c("Rick","Dan","Michelle","Ryan","Gary")
>marks = c(623.3,515.2,611.0,729.0,843.25)
>std.data<- data.frame(std_id, std_name, marks)
>std.data
Output:
std_id std_name marks
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Few Commands For Data Exploration
This section will use functions such as summary(), str(), head(), tail(), view(), edit(), etc., to explore a
dataset.
It is very important to explore data before starting to build a predictive model. It gives an
idea about the structure of the dataset like number of continuous or categorical variables and number of
observations (rows).
Dataset
The snapshot of the dataset is pasted below. We have five variables - Q1, Q2, Q3, Q4 and Age.
The variables Q1-Q4 represents survey responses of a questionnaire.
The response lies between 1 and 6.
The variable Age represents age groups of the respondents. It lies between 1 to 3.
1 represents Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.
Sample Dataset: