0% found this document useful (0 votes)
41 views

Data Analytics Unit-1 Notes

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Data Analytics Unit-1 Notes

Uploaded by

Vijay Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

DATA ANALYTICS

Unit-I
Introduction to R: Handling Packages in R, Getting Started with R, Working with Directory, Data
Types in R, Commands for Data Exploration.
Introduction to R:
Statistical computing and high-scale data analysis tasks needed a new category of computer
language besides the existing procedural and object-oriented programming languages, which would
support these tasks instead of developing new software. There is plenty of data available today which can
be analysed in different ways to provide a wide range of useful insights for multiple operations in various
industries. Problems such as the lack of support, tools and techniques for varied data analysis have been
solved with the introduction of one such language called R.

What is R?

R is a scripting or programming language which provides an environment for statistical


computing, data science and graphics. It was inspired by, and is mostly compatible with, the statistical
language S developed at Bell laboratory (formerly AT & T, now Lucent technologies). Although there are
some very important differences between R and S, much of the code written for S runs unaltered on R. R
has become so popular that it is used as the single most important tool for computational statistics,
visualisation and data science.

Why R?

R has opened tremendous scope for statistical computing and data analysis. It provides
techniques for various statistical analyses like classical tests and classification, time- series analysis,
clustering, linear and non-linear modelling and graphical operations. The techniques supported by R are
highly extensible.

Another reason behind the popularity and widespread use of R is its superior support for
graphics. It can provide well-developed and high-quality plots from data analysis. The plots can contain
mathematical formulae and symbols, if necessary, and users have full control over the selection and use
of symbols in the graphics. Hence, other than robustness, user-experience and user-friendliness are two
key aspects of R.
Handling Packages In R

A package in R is the fundamental unit of shareable code. It is a collection of the following elements:

 Functions
 Data sets
 Compiled code
 Documentation for the package and for the functions inside
 Tests – few tests to check if everything works as it should.

The directory where packages are stored is called a library. R comes with a standard set of packages.
Others are available for download and installation as per requirement. As on date, there are over 10,000
plus packages available in CRAN. This is also one of the reasons behind the huge popularity and success
of R.

Packages are used to share codes with others. One can develop their own R package. Any R user can
then download, install and learn to use the package. Packages, therefore allow for an easy, transparent and
cross-platform extension of the R base system.

In general, there is a single package library with each installation of R on a computer. Users can
change the path to that library to install a package on a different location other than the default package
library. The command .libPaths() can be used to get or set the path of the package library.

Example

> .libPaths()

Output

C:/R/R-3.1.3/library

This is the default package library location. The following command will change it into another path:

Example
> .libPaths(“~/R/win-library/3.1-mran-2016-07-02”)

Output

C:/Users/User1/Documents/R/win-library/3.1-mran-2016-07-02

R can be extended easily with the help of a rich set of packages. There are more than 10,000 packages
available for R. These packages are used for different purposes. Tables

1.2 and 1.3 list some commonly used R packages for different purposes.

Table 1.2 Commonly used R packages for different purposes Data Management Data Visualisation

Installing an R Package

R comes with some standard packages that are installed when a user first installs R and additional
packages can be installed separately. Users need to navigate through the package library and install a
package in the desired location. Following commands are used for navigating through R package
library and installing R package.

1. To start R, follow either Step 2 or 3. The assumption is that R is already installed on your machine.

2. If there is an ―R‖ icon on the desktop of the computer that you are using, double click on the ―R‖
icon to start R. If there is no ―R‖ icon on the desktop then click on the ―Start‖ button at the bottom
left of your computer screen, and then choose ―All programs‖, and start R by selecting ―R‖ (or R
X.X.X, where X.X.X gives the version of R, e.g. R 2.10.0) from the menu of programs.

3. The R console should show up.

4. Once you have started R, you can install an R package (e.g. the ―ggplot2‖ package) by choosing
―Install package(s)‖ from the ―Packages‖ menu at the top of the R console. This will ask you for the
website that you wish to download the package from. You can choose ―Iceland‖ (or another
country, if you prefer). It will also bring up a list of available packages that you can install, and you
can choose the package that you want to install from that list (e.g. ―ggplot2‖).

5. This will install the ―ggplot2‖ package.

6. The ―ggplot2‖ package is now installed. Whenever you want to use the ―ggplot2‖ package after
this, after having successfully started R, you first have to load the package by typing into the R
console: library(―ggplot2‖).

7. You can get help on a package by typing the following at the R prompt: help(package= ―ggplot2‖)

Few Commands to Get Started

installed.packages()

A user can check for all installed packages on the machine by using the installed. packages()
function.

remove.packages() can be used to uninstall a package.


packageDescription()

―DESCRIPTION‖ file has the basic information about a package. It has details such as what the package
does, who is the author, what is the version for the documentation, the date, the type of license its use,
and the package dependencies, etc. To access the description file inside R, use the function,
packageDescription(―package‖). The same can also be accessed via the documentation of the package
by using help(package = ―package‖).

Let us look at the description for the ―stats‖ package.

> packageDescription(“stats”)

Package: stats

Version: 3.2.3 Priority: base

Title: The R Stats Package

Author: R Core Team and contributors worldwide Maintainer: R Core Team <[email protected]>
Description: R statistical functions.

License: Part of R 3.2.3

Suggests: MASS, Matrix, Suppdists, methods, stats4

Build: R 3.2.3; x86_64-w64-mingw32; 2015-12-10 13:03:29 UTC; windows

-- File: C:/Program Files/R/R-3.2.3/library/stats/Meta/package.rds

find.package() and install.packages() Command

find.package() and install.packages() commands will find and install specific R package(s). There are
two versions of this command. The first helps in installing one package at a time and the other is used to
install multiple packages at once using a single command—install.packages(). More details on
commands like find.package() and install.packages() can be retrieved using the help() command. For
example, help (installed.packages) can show details like the version number of a function.

Getting Started with R

Working with Directory

Before writing a program or code using R, it is important to find out the directory being used. This can
be done using the getwd() function. If the current working directory is not as per preference, it can be
changed using the setwd() function. The dir() or the list.files() functions give information about the files
and directories in the current working directory or any other directory.

getwd() Command
getwd() command returns the absolute file path of the current working directory. This function has no
arguments.

Example

>getwd()

Output

[1] C:/Users/User1/Documents/R

Note the use of ‗/‘ as the file separator on Windows. The file path does not have a trailing ‗/‘ unless it is
the root directory. The getwd() function can return NULL if the working directory is not available.

setwd() Command

setwd() command resets the current working directory to another location as per the user‘s preference.

Example

>setwd(“C:/path/to/my_directory”)

Output

It will change the path to the user specified directory.

dir() Function:

This is equivalent to list.files() function.

This function returns a character vector of the names of files or directories in the named directory.

>dir() character(0)

>list.files() character(0)

The above command implies that there are no files or directories in the current directory.

Data Types in R
Everything in R is an object. R has 6 atomic vector types.

• character

• numeric (real or decimal)

• integer

• logical

• complex

By atomic, we mean the vector only holds data of a single type.

• character: "a", "swc"


• numeric: 2, 15.5

• integer: 2L (the L tells R to store this as an integer)

• logical: TRUE, FALSE

• complex: 1+4i (complex numbers with real and imaginary parts)

Basics types of data

✓ 4.5 is a decimal value called numeric.

✓ 4 is a natural value called integers. Integers are also numeric.

✓ TRUE or FALSE is a Boolean value called logical.

✓ The value inside " " or ' ' are text (string). They are called characters.

✓ 3+4i is a complex type of data. Basics types of data

Example
# numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)

# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)
Output:

[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"
Operators in R

An Operator is a symbol that tells to perform different operations between operands. R


programming is very rich in built-in operators.

R has the following data operators:

 Arithmetic Operators

 Assignment Operators

 Logical Operators

 Relational Operators

1) Arithmetic Operators:
These operators perform basic arithmetic operations like addition, subtraction, multiplication,
division, exponent, modulus, etc.
Let us see what these operators do.
For example, take two variables:
For example:-
x <- 10

y <- 5
Operator Operation Output
x+y Addition 15
x–y Subtraction 5
x*y Multiplication 50
x/y Division 2
x^y Exponent 10^5
x %% y Modulus 0
These operators can also be used to carry out mathematical operations on vectors.

For creating vectors, we use the c() function.

In the case of vectors, all these operations are done in an element-by-element fashion.

For example:

x <- c(9,9,9)

y <- c(1,1,1)

print(x+y)

Output:

[1] 10 10 10
2) Assignment Operators

These operators are used to assign values to variables.

Operator Description

<- Leftwards assignment

->, ->> Rightwards assignment

The operators <- and = can be used, almost interchangeably, to assign to variables in the
same environment.
The <<- operator is used for assigning to variables in the parent environments (more like
global assignments). The rightward assignments, although available, are rarely used.

x <- 5
x
x <- 9
x
10 -> x
x

Output

[1] 5
[1] 9
[1] 10

3) Logical Operators

Logical operators are used to carry out Boolean operations like AND, OR etc.
Operator Description

! Logical NOT

& Element-wise logical AND

&& Logical AND

| Element-wise logical OR

|| Logical OR
Operators & and | perform element-wise operation producing result having length of the longer
operand.
But && and || examines only the first element of the operands resulting in a single length
logical vector.
Zero is considered FALSE and non-zero numbers are taken as TRUE. Let's see an example for
this:

x <- c(TRUE, FALSE, 0, 6)


y <- c(FALSE, TRUE, FALSE, TRUE)
!x
x&y
x && y
x|y
x || y

Output

[1] FALSE TRUE TRUE FALSE


[1] FALSE FALSE FALSE TRUE
[1] FALSE
[1] TRUE TRUE FALSE TRUE
[1] TRUE

4) Relational Operators

Relational operators are used to compare between values. Here is a list of relational operators
available in R.

Operator Description

< Less than

> Greater than

<= Less than or equal to

>= Greater than or equal to

== Equal to

!= Not equal to
Let's see an example for this:

x <- 5
y <- 16
x<y
x>y
x <= 5
y >= 20
y == 16
x != 5

Output

[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE

Data Objects in R:
Data types are used to store information. In R, we do not need to declare a variable as some
data type. The variables are assigned with R-Objects and the data type of the R-object becomes the data
type of the variable. There are mainly six data types present in R:

1. Vectors

2. Lists

3. Matrices

4. Arrays

5. Factors

6. Data Frames

1. Vector:

A Vector is a sequence of data elements of the same basic type.


Example 1:
> vtr = c(1, 3, 5 ,7 9)
or
vtr<- c (1, 3, 5 ,7 9)
print(vtr)
output: [1] 1 3 5 7 9
Example 2:
Creating sequence vector by using colon operator.
>v = 2:12
> print(v)
o/p: [1] 2 3 4 5 6 7 8 9 10 11 12
> v = 3.5:10.5
>v
o/p: [1] 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5
Example 3:
If the final element specified does not belong to the sequence then it is discarded.
> v <- 3.8:11
>v
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
Example 4:
Using seq()(sequence) function, create vector from 1 to9 increment by 2.
>a=seq(1,10,by=2)
>a
o/p: [1] 1 3 5 7 9
Example 4:
Accessing vector elements by its position (index).
> day = c("Mon","Tue","Wed","Thurs","Fri","Sat","sun")
> print(day[3])
o/p: [1] "Wed"
> weekend=day[c(6,7)]
> weekend
o/p: [1] "Sat" "sun".
> print(day[c(-2,-3)]) // Negative indexing
o/p: [1] "Mon" "Thurs" "Fri" "Sat" "sun"

2. Lists:
A list in R can contain many different data types inside it. A list is a collection of data which
is ordered and changeable.
To create a list, use the list() function:
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
# Print the list
thislist
Output:
[[1]]
[1] "apple"
[[2]]
[1] "banana"
[[3]]
[1] "cherry"

 Access Lists
You can access the list items by referring to its index number, inside brackets. The first item has index
1, the second item has index 2, and so on:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1]
Output:
[[1]]
[1] "apple"
 Change Item Value
To change the value of a specific item, refer to the index number:
Example
thislist <- list("apple", "banana", "cherry")
thislist[1] <- "blackcurrant"
# Print the updated list
thislist
Output:
[[1]]
[1] "blackcurrant"
[[2]]
[1] "banana"
[[3]]
[1] "cherry"
 List Length
To find out how many items a list has, use the length() function:
Example
thislist <- list("apple", "banana", "cherry")
length(thislist)
Output:
[1] 3
 Add List Items
To add an item to the end of the list, use the append() function:
Example
Add "orange" to the list:
thislist <- list("apple", "banana", "cherry")
append(thislist, "orange")
Output:
[[1]]
[1] "apple"
[[2]]
[1] "banana"

[[3]]
[1] "cherry"
[[4]]
[1] "orange"
 Remove List Items
You can also remove list items. The following example creates a new, updated list without an "apple"
item:
Example
Remove "apple" from the list:
thislist <- list("apple", "banana", "cherry")
newlist <- thislist[-1]
# Print the new list
newlist
Output:
[[1]]
[1] "banana"
[[2]]
[1] "cherry"
 Join Two Lists
There are several ways to join, or concatenate, two or more lists in R.
The most common way is to use the c() function, which combines two elements together:
Example
list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
list3
Output:
[[1]]
[1] "a"
[[2]]
[1] "b"
[[3]]
[1] "c"
[[4]]
[1] 1
[[5]]
[1] 2
[[6]]
[1] 3
3. Matrices:
Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular
layout.
A Matrix is created using the matrix() function.
Example:
matrix (data, nrow, ncol, byrow, dimnames)
where,
✓ data is the input vector which becomes the data elements of the matrix.
✓ nrow is the number of rows to be created.
✓ ncol is the number of columns to be created.
✓ byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
✓ dimname is the names assigned to the rows and columns.
4. Arrays:
Arrays are the R data objects which can store data in more than two dimensions.
For example –
 If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2
rows and 3 columns.
 While matrices are confined to two dimensions, arrays can be of any number of dimensions.
 An array is created using the array() function. It takes vectors as input and uses the values in the
dim parameter to create an array.
 In the below example we create 2 arrays of which are 3x3 matrices each.
5. Factors:
 Factors are the data objects which are used to categorize the data and store it as levels.
 They can store both strings and integers.
 They are useful in data analysis for statistical modeling.
 Factors are created using the factor() function.
 The nlevels functions gives the count of levels.
# Create a vector.
apple_colors<- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple<- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result –
Output:
[1] green green yellow red red red green
Levels: green red yellow
[1] 3
Ex2:
>data <- c("East","West","East","North","North","East","West","West―,"East―)
>factor_data<- factor(data)
>factor_data
Output :
[1] East West East North North East West West East
Levels: East North West
6. Data Frames:
 A data frame is a table or a two-dimensional array-like structure in which each column contains
values of one variable and each row contains one set of values from each column.
 Data frames are tabular data objects.
 Unlike a matrix in data frame each column can contain different modes of data.
 The first column can be numeric while the second column can be character and third column can
be logical.
 It is a list of vectors of equal length.
 Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26) )
print(BMI)
When we execute the above code, it produces the following result –
Output:
Gender height weight Age
Male 152.0 81 42
Male 171.5 93 38
Female 165.0 78 26
Ex2:
>std_id = c (1:5)
>std_name = c("Rick","Dan","Michelle","Ryan","Gary")
>marks = c(623.3,515.2,611.0,729.0,843.25)
>std.data<- data.frame(std_id, std_name, marks)
>std.data
Output:
std_id std_name marks
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
Few Commands For Data Exploration
This section will use functions such as summary(), str(), head(), tail(), view(), edit(), etc., to explore a
dataset.
It is very important to explore data before starting to build a predictive model. It gives an
idea about the structure of the dataset like number of continuous or categorical variables and number of
observations (rows).
Dataset
 The snapshot of the dataset is pasted below. We have five variables - Q1, Q2, Q3, Q4 and Age.
 The variables Q1-Q4 represents survey responses of a questionnaire.
 The response lies between 1 and 6.
 The variable Age represents age groups of the respondents. It lies between 1 to 3.
 1 represents Generation Z, 2 represents Generation X and Y, 3 represents Baby Boomers.
Sample Dataset:

Load Internal Dataset:


There are various inbuilt datasets in R, e.g. AirPassengers, mtcars, BOD, etc. A list of
datasets is available at https://vincentarelbundock.github.io/Rdatasets/datasets.html
Let us load the mtcars dataset from the datasets package following the steps:
1. Check if the datasets package is already installed.
>installed.packages()
2. If already installed and will be used frequently, load the package.
>library(datasets)
Import data into R
The read.csv() function is used to import CSV file into R.
The header = TRUE tells R that header is included in the data that we are going to import.
mydata <- read.csv("C:/Users/Documents/Book1.csv", header=TRUE)
1. Calculate basic descriptive statistics
summary(mydata)

Data Exploration with R


To calculate summary of a particular column, say third column, you can use the following syntax :
summary( mydata[3])
To calculate summary of a particular column by its name, you can use the following syntax :
summary( mydata$Q1)
2. Lists name of variables in a dataset
> names(mydata)
Output:
[1] "Q1" "Q2" "Q3" "Q4" "Age―
3. Calculate number of rows in a dataset
> nrow(mydata)
Output:
[1] 100
4. Calculate number of columns in a dataset
> ncol(mydata)
Output:
[1] 5
5. List structure of a dataset
str(mydata)
Output:
'data.frame': 100 obs. of 5 variables:
$ Q1 : int 1 5 3 1 6 2 2 1 4 1 ...
$ Q2 : int 3 3 3 1 1 4 2 2 6 1 ...
$ Q3 : int 4 2 1 4 3 6 1 4 4 4 ...
$ Q4 : int 3 5 1 1 3 4 2 2 5 1 ...
$ Age: int 3 1 1 1 1 1 3 1 2 3 ...
6. See first 6 rows of dataset
head(mydata)
Output:
Q1 Q2 Q3 Q4 Age

7. First n rows of dataset


In the code below, we are selecting first 5 rows of dataset.
head(mydata, n=5)
8. All rows but the last row
head(mydata, n= -1)
9. Last 6 rows of dataset
tail(mydata)
10. Last n rows of dataset
In the code below, we are selecting last 5 rows of dataset.
tail(mydata, n=5)
11. View() Command
View() command displays the given dataset in a spreadsheet-like data frame viewer.
Example >View(dataset name)
Output:
The output shows a tabular view of the content of the dataset.

12. edit() Command


edit() command helps with the dynamic editing or data manipulation of a dataset. When this
command is invoked, a dynamic data editor window opens with a tabular view of the dataset. Hereafter,
the required changes to the dataset can be made.
Syntax
>edit(dataset name).
13. fix() Command
fix() command saves the changes in the dataset itself, so there is no need to assign any variable to it.
Syntax
>fix(dataset name).
14. data() Function
The data() function lists the available datasets.
Syntax
> data()
15. save.image() Function
save.image() function writes an external representation of R objects to the specified file.
At a later point in time when it is required to read back the objects, one can use the load or attach
function.
Syntax:
save.image(file = ―.RData‖, version = NULL, ascii = FALSE, safe = TRUE)
The file is to be given an extension of RData. Note: The ―R‖ and ―D‖ in ―RData‖ should be in
capitals.
If ascii = TRUE, will save an ascii representation of the file. The default is ascii = FALSE. With
ascii being set to false, a binary representation of the file is saved.
version is used to specify the current workspace format version. The value of NULL specifies the
current default format. safe is set to a logical value.
A value of TRUE means that a temporary file is used to create the saved workspace. This temporary
file is renamed to file if the save succeeds.

You might also like