DATA ANALYTICS LAB MANUAL
DATA ANALYTICS LAB MANUAL
DATA ANALYTICS LAB MANUAL
LIST OF EXPERIMENTS
S.NO. NAME PAGE
NO.
1. Introduction 2 -9
2. To get the input from user and perform numerical operations (MAX, MIN, 10-12
AVG, SUM, SQRT, ROUND) using in R.
3. To perform data import/export (CSV, XLS, TXT) operations using data 13-15
frames in R.
4. To get the input matrix from user and perform Matrix Addition, 16-19
Subtraction, Multiplication, Inverse Transpose and Division
operations using vector concept in R.
9. To perform K-Means clustering operation and visualize for iris data set 37-41
10. Write R script to diagnose any disease using KNN classification and 42-48
plot the results.
11. To perform market basket analysis using Association Rules (Apriori). 49-52
One of the principle attractions of utilizing the R environment is the simplicity with which users can
compose their own projects and custom functions. The R programming syntax is very simple to learn, in
any event, for clients with no past programming experience. Once the basic R programming control
structures are perceived, users can utilize the R language as an amazing environment to perform complex
custom analyses of almost any type of data.
Several code editors are available that provide functionalities like R syntax highlighting, auto code
indenting and utilities to send code/functions to the R console.
History of R:
R is a programming language and free software environment for statistical computing and graphics that is
supported by the R Foundation for Statistical Computing. The R language is widely used among
statisticians and data miners for developing statistical software and data analysis.
R is an implementation of the S programming language combined with lexical scoping semantics inspired
by Scheme. S was created by John Chambers in 1976, while at Bell Labs. There are some important
differences, but much of the code written for S runs unaltered.
R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is
currently developed by the R Development Core Team, of which Chambers is a member. R is named partly
after the first names of the first two R authors and partly as a play on the name of S. The project was
conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.
R and its libraries implement a wide variety of statistical and graphical techniques, including linear and
nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and others. R
is easily extensible through functions and extensions, and the R community is noted for its active
R is an interpreted language; users typically access it through a command- line interpreter. If a user types
2+2 at the R command prompt and presses enter, the computer replies with 4, as shown below:
>2+2
[1] 4
Features of R:
As stated earlier, R is a programming language and software environment for statistical analysis, graphics
representation and reporting. The following are the important features of R –
R is a well-developed, simple and effective programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the computer or printing at
the papers.
Getting R:
R can be downloaded from one of the CRAN (Comprehensive R Archive Network) sites http://cran.us.r-
projects.org/. Look in the download and install area. Then, download it according to your OS either Mac
OS or Windows.
What is RStudio?
If you click on the R program you just downloaded, you will find a very basic user interface. For example,
below is what I get on Windows.
Getting RStudio:
Open up RStudio. You should see the interface shown in the figure below which has three windows:
R Data Types:
Let’s now explore what R can do. R is really just a big fancy calculator. For example, type in the
following mathematical expression next to the > in the R console (left window)
1. Characters: used to represent words or letter in R. Anything with quotes will be interpreted as a
character.
2. Logicals: takes two values i.e. FALSE or TRUE. They are usually constructed with comparison
operators.
3. Numeric: are separated into two types: integer and double.
4. Factors: are sort of like characters, but not really. It is actually a numeric code with character-
valued levels.
R Data Structures:
Vectors: are the most common and basic R data structure. It is simply a sequence of values which can
be of any data type but all of the same type. There are number of ways to create a vector depending on
the data type, but the most common is to insert the data you want to save in a vector into the command
c().
The above code does not actually save the values 4, 16, 9- it just presents it on the screen in a vector. If
you want to use these values again, you can save it in a data object. You assign data to an object using
the arrow sign < -. This will create an object in R’s memory that can be called back into the command
window at any time.
You should see the object b pop up in the Environment tab on the top right window of your RStudio
Interface.
The type property indicates the data type that the vector is holding. Use the command typeof() to
determine the type.
The command length() determines the number of data values that the vector is storing
You can also coerce a vector of one data type to another. For example, save the value “1” and “2” (both
in quotes) into a vector named x1.
Data Frames: are even higher level data structures. They store vectors of the same length.
Create a vector called v2 storing the values 5, 12, 25. We can create a data frame using the command
data.frame() storing the vectors v1 and v2 as columns
We can store different types of vectors in a data frame. For higher level data structures like a data
frame, use the function class() to figure out what kind of object you are working with. We can’t use
length() on a data frame as it has more than one vector. Instead, it has dimensions – the number of rows
and columns. We can find column names by using the command colnames(). Moreover, we can extract
columns from data frames by referring to their names using the $ sign. We can also extract data from
data frames using brackets [ , ].
Functions are also known as commands. An R function is a packaged recipe that converts one or more
inputs (called arguments) into a single output. You execute most of your tasks in R using functions.
Every function in R will have the following basic format
Let’s use the function seq() which makes regular sequences of numbers. You can find out what a
function does and its options by calling up its help documentation by typing ? and the function name.
the help documentation should have popped up in the bottom right window of you RStudio interface.
In running few lines of code above, you directly work in the R console and issue commands in an
interactive way. That is, you type a command after >, you hit enter/return, R responds, you type the
next command, hit enter, R responds and so on.
So instead of writing the command directly into the console, you should write it in a script. The process
is as follows: Type your command in the script. Run the code from the script. R responds. You get
results. You can write two commands in a script. Run both simultaneously. R responds. You get results.
This is the basic flow.
To get the input from user and perform numerical operations (MAX, MIN, AVG, SUM, SQRT, ROUND)
using in R.
PROCEDURE:
1. max() function in R language is used to find the maximum element present in an object. This object can
be a vector, a list, a data frame etc.
Syntax: max(object,
na.rm) Parameters:
object: vector, matrix, list, data frame, etc.
na.rm: Boolean value to remove NA element.
2. min() function in R language is used to find the minimum element present in an object. This object can be
a vector, a list, a data frame etc.
Syntax: min(object,
na.rm) Parameters:
object: vector, matrix, list, data frame, etc.
na.rm: Boolean value to remove NA element.
5. sqrt() function in R language is used to find the square root for an individual number or an
expression. Syntax: sqrt(numeric_expression)
Parameters:
numeric_expression: it can be a numeric value or a valid numerical expression for which you want to find
the square root in R.
To perform data import/export (CSV, XLS, TXT) operations using data frames in R.
PROCEDURE:
1. Export Data-
There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will
need to load the foreign packages. For excel, you will need the xls ReadWrite package.
a. To a CSV file- Firstly, you’re required to create data frame. Then, will export our dataframe into
csv file.
Syntax: write.csv(df, path)
Parameters:
df: dataset to save.
path: a string. Set the destination path. Path + filename + extension
library(xlsx)
write.xlsx(df, “X.xlsx”)
c. To TXT file- the basic function write.table() can be used to export data frame to txt
file. Syntax: write.table(x, file)
Parameters:
x: a matrix or a data frame to be written
file: a character specifying the name of the result.
c. From SPSS-
#save SPSS dataset in transport format
get file=’c:\mydata.sav’
export outfile=’c:\mydata.por’
#in R
library(Hmisc)
mydata < - spss.get(“c:/mydata.por”, use.value.labels=TRUE)
#last option converts value labels to R factors
d. From SAS-
#in R
library(Hmisc)
mydata < - sasxport.get(“c:/mydata.xpt”)
Name: Anshika Singh
Roll No.: 2000300130022
#character variables are converted to R factors
e. From Stata-
f. From systat-
To get the input matrix from user and perform Matrix Addition, Subtraction, Multiplication, Inverse Transpose
and Division operations using vector concept in R.
PROCEDURE:
Matrices in R are a bunch of values, either real or complex numbers, arranged in a group of fixed number of
rows and columns. Matrices are used to depict the data in a structured and well-organized format.
It is necessary to enclose the elements of a matrix in parentheses or brackets.
This matrix [M] has 3 rows and 3 columns. Each element of matrix [M] can be referred to by its row and
column number.
Parameters:
byrow: logical clue. If TRUE then the input vector elements are arranged by
1. Matrix addition:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and columns
Step4: Creating matrix to store results
Step5: Printing original matrices
3. Matrix multiplication:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and
columns Step4: Creating matrix to store
results Step5: Printing original matrices
Step6: Calculating product of
matrices Step7: Printing resultant
matrix
4. Matrix division:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and
columns Step4: Creating matrix to store
results Step5: Printing original matrices
Name: Anshika Singh
Roll No.: 2000300130022
Step6: Calculating product of matrices
5. Inverse transpose:
Step1: Create 3 different vectors using combine method
Step2: Bind the three vectors into a matrix using rbind() which is basically row-wise binding
Step3: Print the original matrix
Step4: Use the solve() function to calculate the inverse
Step5: Print the inverse of the matrix
To perform statistical operations (Mean, Median, Mode and Standard Deviation) using R.
PROCEDURE:
Statistical analysis in R is performed by using many in-built functions. Most of these functions are part of the
R base package. These functions take R vector as an input along with the arguments and give the result.
1. Mean- It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
Syntax: mean(x, trim = 0, na.rm = FALSE,
…) Parameters:
x: input vector
trim: used to drop some observations from both end of the sorted
order na.rm: used to remove the missing values from the input vector
2. Median- The middle most value in a data series is called the median. The median() function is used in R
to calculate this value.
Syntax: median(x, na.rm =
FALSE) Parameters:
x: input vector
na.rm: used to remove the missing values from the input vector
4. Standard Deviation- A measure that is used to quantify the amount of variation or dispersion of a set of
data values.
Syntax: sd(x, na.rm = FALSE)
Parameters:
x: input vector
na.rm: used to remove the missing values from the input vector
To perform data pre-processing operations i) Handling Missing data ii) Min-Max normalization
PROCEDURE:
i) In R missing values are represented by NA (not available). Impossible values are represented by the
symbol NaN (not a number). Unlike SAS, R uses the symbol for character and numeric data.
Testing for Missing Values:
is.na(x) #returns TRUE of x is
missing y < - c(1, 2, 3, NA)
is.na(y) #returns a vector (F F F T)
The function na.omit() returns the object with listwise deletion of missing
values. #create new dataset without missing data
newdata < - na.omit(mydata)
ii) Min-max normalization subtracts the minimum value of an attribute from each value of the attribute
and then divides the difference by the range of the attribute. These new values are multiplied by the
new range of the attribute and finally added to the new maximum value of the attribute. These
operations transform the data into a new range, generally [0,1].
Syntax: mmnorm(data, minval = 0, maxval = 1)
Parameters:
data: the dataset to be normalized, including classes
minval: the minimum value of the transformed range
maxval: the maximum value of the transformed range
To perform dimensionality reduction operation using PCA for Houses Data Set
PROCEDURE:
In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing
the number of random variables under consideration, via obtaining a set of principal variables. It can be
divided into feature selection and feature extraction.
Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of
outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into
few orthogonal components defined. It is an unsupervised method, meaning it will always look into the
greatest sources of variation regardless of the data structure.
Step 2: Standardize the data by using scale and apply “prcomp” function
From the the summary, we can undersand PC1 explains 62% of variance and PC2 explains 24% so on. Usually
Principal components which explains about 95% variance can be considered for models. Summary also yields
cumulative proportion of the principal components.
Best thing is, plot PCA using various types of scree plot. Above declared pcaCharts function invokes various
forms of scree plot.
PROCEDURE:
Linear Regression:
It is commonly used type of predictive analysis. It is a statistical approach for modeling relationship
between a dependent variable and a given set of independent variables.
There are two types of linear regression:
1. Simple Linear Regression- uses only one independent variable.
2. Multiple Linear Regression- uses two or more independent variables.
Simple Linear Regression:
The dataset contains observations about income (in a range of $15k to $75k) and happiness (rated on a scale
of 1 to 10) in an imaginary sample of 500 people. The income values are divided by 10,000 to make the
income data match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is #30,000,
etc.)
First of all, you are required to install the packages you need for the analysis,
install.packages(“ggplot2”)
install.packages(“dplyr”)
install.packages(“broom”)
install.packages(“ggpubr”)
As both our variables are quantitative, when we run this function we see a table in our console with a
numeric summary of the data. This tells us the minimum, median, mean, and maximum values of the
independent variable(income) and dependent variable(happiness):
3. Linearity- The relationship between the independent and dependent variable must be linear. We can test
this visually with a scatter plot to see if the distribution of data points could be described with a straight
line.
plot(happiness ~ income, data = income.data)
Note that the par(mfrow()) command will divide the Plots window into the number of rows and columns
specified in the brackets. So par(mfrow=c(2,2)) divides it up into two rows and two columns. To go back to
plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).
Residuals are the unexplained variance. They are not exactly the same as model error, but they are
calculated from it, so seeing a bias in the residuals would also indicate a bias in the error. The most
important thing to look for is that the red lines representing the mean of the residuals are all basically
horizontal and centered on zero. This means there are no outliers or biases in the data that would make a
To perform K-Means clustering operation and visualize for iris data set
PROCEDURE:
The Iris dataset contains the data for 50 flowers from each of the 3 species - Setosa, Versicolor and Virginica.
The data gives the measurements in centimeters of the variables sepal length and width and petal length and
width for each of the flowers.
Goal of the study is to perform exploratory analysis on the data and build a K-means clustering model to
cluster them into groups. Here we have assumed we do not have the species column to form clusters and then
used it to check our model performance.
First of all, install package ggplot2,
library(ggplot2)
The dataset has 150 observations equally distributed observations among the three species - Setosa, Versicolor
and Verginica. The below table shows the summary statistics of all the 4 variables.
summary(iris)
summary(iris)
plot(1:k.max,wss, type= "b", xlab = "Number of clusters(k)", ylab = "Within cluster sum of squares")
Write R script to diagnose any disease using KNN classification and plot the results.
PROCEDURE:
Machine learning finds extensive usage in pharmaceutical industry especially in detection of oncogenic (cancer
cells) growth. R finds application in machine learning to build models to predict the abnormal growth of cells
thereby helping in detection of cancer and benefiting the health system.
Let’s see the process of building this model using KNN algorithm in R Programming.
We will use a data set of 100 patients (created solely for the purpose of practice) to implement the KNN
algorithm and thereby interpreting results.
The data set consists of 100 observations and 10 variables (out of which 8 numeric variables and one
categorical variable and is ID) which are as follows:
1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Symmetry
8. Fractal dimension
In real life, there are dozens of important parameters needed to measure the probability of cancerous growth
but for simplicity purposes let’s deal with 8 of them.
Let’s make sure that we understand every line of code before proceeding to the next stage:
setwd(“C:/Users/Payal/Desktop/KNN”) #using this command we’ve imported the Prostate_Cancer.csv data
file. This command is used to point to the folder containing the required file.
str(prc) #we use this command to see whether the data is structured or not.
We find that the data is structured with 10 variables and 100 observations. If we observe the data set, the first
variable ‘id’ is unique in nature and can be removed as it does not provide useful information.
The data set contains patients who have been diagnosed with either Malignant (M) or Benign (B) cancer.
The variable diagnosis_result is our target variable i.e. this variable will determine the results of the diagnosis
based on the 8 numeric variables)
In case we wish to rename B as “Benign” and M as “Malignant” and see the results in the percentage form, we
may write as:
round(prop.table(table(prc$diagnosis)) * 100, digits=1) #it gives the result in the percentage form rounded of
to 1 decimal place (and so it’s digits=1)
Once we run this code, we are required to normalize the numeric features in the data set. Instead of
normalizing each of the 8 individual variables we use:
The first variable in our data set (after removal of id) is ‘diagnosis_result’ which is not numeric in nature. So,
we start from 2nd variable. The function lapply() applies normalize() to each feature in the data frame. The final
result is stored to prc_n data frame using as.data.frame() function
Let’s check using the variable ‘radius’ whether the data has been normalized.
summary(prc_n$radius)
The results show that the data has been normalized. Do try with the other variables such as perimeter, area etc.
The KNN algorithm is applied to the training data set and the results are verified on the test data set.
For this, we would divide the data set into 2 portions in the ratio of 65: 35 (assumed) for the training and test
data set respectively. You may use a different ratio altogether depending on the business requirement!
We shall divide the prc_n data frame into prc_train and prc_test data frames
prc_test <-
prc_n[66:100,]
A blank value in each of the above statements indicate that all rows and columns should be included.
Our target variable is ‘diagnosis_result’ which we have not included in our training and test data
The knn() function needs to be used to train a model for which we need to install a package ‘class’. The knn() function
identifies the k-nearest neighbors using Euclidean distance where k is a user-specified number.
install.packages(“class”)
library(class)
Now we are ready to use the knn() function to classify test data
The value for k is generally chosen as the square root of the number of observations.
knn() returns a factor value of predicted labels for each of the examples in the test data set which is then
assigned to the data frame prc_test_pred.
install.packages(“gmodels”)
The test data consisted of 35 observations. Out of which 5 cases have been accurately predicted (TN->True
Negatives) as Benign (B) in nature which constitutes 14.3%. Also, 16 out of 35 observations were accurately
predicted (TP-> True Positives) as Malignant (M) in nature which constitutes 45.7%. Thus, a total of 16 out of
35 predictions where TP i.e, True Positive in nature.
There were no cases of False Negatives (FN) meaning no cases were recorded which actually are malignant in
nature but got predicted as benign. The FN’s if any poses a potential threat for the same reason and the main
focus to increase the accuracy of the model is to reduce FN’s.
There were 14 cases of False Positives (FP) meaning 14 cases were actually benign in nature but got predicted
as malignant.
Name: Anshika Singh
Roll No.: 2000300130022
The total accuracy of the model is 60 %( (TN+TP)/35) which shows that there may be chances to improve the
model performance
PROCEDURE:
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between
items. It works by looking for combinations of items that occur together frequently in transactions. To put it
another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify
strong rules discovered in transaction data using measures of interestingness, based on the concept of strong
rules.
In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer
might have bought if the idea had occurred to them. As a first step, therefore, market basket analysis can be
used in deciding the location and promotion of goods inside a store. If, as has been observed, purchasers of
Barbie dolls have are more likely to buy candy, then high-margin candy can be placed near to the Barbie doll
display. Customers who would have bought candy with their Barbie dolls had they thought of it will now be
suitably tempted.
Association Rules:
There are many ways to see the similarities between items. These are techniques that fall under the general
umbrella of association. The outcome of this type of technique, in simple terms, is a set of rules that can be
understood as “if this, then that”.
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were
purchased. The receipt is a representation of stuff that went into a customer’s basket — and therefore ‘Market
Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1
receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.
“absolute”)
We are now ready to mine some rules! You will always have to pass the minimum
required support and confidence.
We set the minimum support to 0.001
The information on the data mined: total data mined, and minimum parameters.
The first issue we see here is that the rules are not sorted. Often we will want the most relevant rules first. Lets
say we wanted to have the most likely rules. We can easily sort by confidence by executing the following code.
Now our top 5 output will be sorted by confidence and therefore the most relevant rules appear.
rules <- apriori(Groceries, parameter = list(sup =0.001, conf = 0.8, maxlen = 3))
Redundancies
Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect
to drop the item from the dataset. Alternatively, you can remove redundant rules generated. We can eliminate
these repeated rules using the follow snippet of code:
Targeting items:
Now that we know how to generate rules, limit the output, lets say we wanted to target items to generate rules.
There are two types of targets we might be interested in that are illustrated with an example of “whole milk”:
1. What are customers likely to buy before buying whole milk
Visualization
The last step is visualization. Lets say you wanted to map out the rules in a graph. We can do that with another
library called “arulesViz”.
library(arulesViz)
plot(rules, method=”graph”, interactive= TRUE, shading=NA)
You will get a nice graph that you can move around to look like this: