DATA ANALYTICS LAB MANUAL

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 57

DATA ANALYTICS LAB MANUAL (KIT-651)

LIST OF EXPERIMENTS
S.NO. NAME PAGE
NO.
1. Introduction 2 -9

2. To get the input from user and perform numerical operations (MAX, MIN, 10-12
AVG, SUM, SQRT, ROUND) using in R.

3. To perform data import/export (CSV, XLS, TXT) operations using data 13-15
frames in R.

4. To get the input matrix from user and perform Matrix Addition, 16-19
Subtraction, Multiplication, Inverse Transpose and Division
operations using vector concept in R.

5. To perform statistical operations (Mean, Median, Mode and 20-21


Standard Deviation) using R.
6. To perform data pre-processing operations i) Handling Missing data 22-23
ii) Min-Max normalization
7. To perform dimensionality reduction operation using PCA for Houses Data 24-28
Set
8. To perform Simple Linear Regression with R. 29-36

9. To perform K-Means clustering operation and visualize for iris data set 37-41

10. Write R script to diagnose any disease using KNN classification and 42-48
plot the results.
11. To perform market basket analysis using Association Rules (Apriori). 49-52

Name: Anshika Singh


Roll No.: 2000300130022
INTRODUCTION
General Overview:

One of the principle attractions of utilizing the R environment is the simplicity with which users can
compose their own projects and custom functions. The R programming syntax is very simple to learn, in
any event, for clients with no past programming experience. Once the basic R programming control
structures are perceived, users can utilize the R language as an amazing environment to perform complex
custom analyses of almost any type of data.

Code Editors for R:

Several code editors are available that provide functionalities like R syntax highlighting, auto code
indenting and utilities to send code/functions to the R console.

 Basic code editors provided by Rguis


 Rstudio: GUI-based IDE for R
 Vim-R-Tmux: R working environment based on vim and tmux
 Emacs (ESS add-on package)
 Gedit and Rgedit
 RKWard
 Eclipse
 Tinn-R
 Notepad++

History of R:

R is a programming language and free software environment for statistical computing and graphics that is
supported by the R Foundation for Statistical Computing. The R language is widely used among
statisticians and data miners for developing statistical software and data analysis.

R is an implementation of the S programming language combined with lexical scoping semantics inspired
by Scheme. S was created by John Chambers in 1976, while at Bell Labs. There are some important
differences, but much of the code written for S runs unaltered.

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is
currently developed by the R Development Core Team, of which Chambers is a member. R is named partly
after the first names of the first two R authors and partly as a play on the name of S. The project was
conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.

R and its libraries implement a wide variety of statistical and graphical techniques, including linear and
nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and others. R
is easily extensible through functions and extensions, and the R community is noted for its active

Name: Anshika Singh


Roll No.: 2000300130022
contributions in terms of packages. Many of R's standard functions are written in R itself, which makes it
easy for users to follow the algorithmic choices made.

R is an interpreted language; users typically access it through a command- line interpreter. If a user types
2+2 at the R command prompt and presses enter, the computer replies with 4, as shown below:

>2+2

[1] 4

Features of R:

As stated earlier, R is a programming language and software environment for statistical analysis, graphics
representation and reporting. The following are the important features of R –

 R is a well-developed, simple and effective programming language which includes conditionals, loops,
user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the computer or printing at
the papers.

Getting R:

R can be downloaded from one of the CRAN (Comprehensive R Archive Network) sites http://cran.us.r-
projects.org/. Look in the download and install area. Then, download it according to your OS either Mac
OS or Windows.

What is RStudio?

If you click on the R program you just downloaded, you will find a very basic user interface. For example,
below is what I get on Windows.

Name: Anshika Singh


Roll No.: 2000300130022
We will not use R’s direct interface to run analyses. Instead, we will use the program RStudio. RStudio
gives you a true integrated development environment (IDE), where you can write code in a window, see
results in other windows, see locations of files, see objects you’ve created and so on.

Getting RStudio:

To download and install RStudio, follow the directions below:

 Navigate to RStudio’s download site:


https://www.rstudio.com/products/rstudio/download/#download
 Click on the appropriate link based on your OS.
 Click on the installer that you downloaded. Follow the installation wizard’s directions, making sure
to keep all defaults intact.

The RStudio interface:

Open up RStudio. You should see the interface shown in the figure below which has three windows:

Name: Anshika Singh


Roll No.: 2000300130022
 Console (left): The way R works is you write a line of code to execute some kind of task on a data
object. The R console allows you to run code interactively. The screen prompt > is an invitation
from R to enter its world. This is where you type code in, press enter to execute the code and see
the results.
 Environment, History and Connections tabs (upper-right):
o Environment- shows all the R objects that are currently open in your workspace.
o History- shows a list of executed commands in your current session.
o Connections- you can connect to a variety of data sources and explore the objects and data
inside the connection.
 Files, Plots, Packages, Help and Viewer tabs (lower-right):
o Files- show all the files and folders in your current working directory.
o Plots- show any charts, graphs, maps and plots you’ve successfully executed.
o Packages- tell you all the R packages that you have access to.
o Help- shows help documentation for R commands that you’ve called up
o Viewer- allows you to view local web content.

R Data Types:

Let’s now explore what R can do. R is really just a big fancy calculator. For example, type in the
following mathematical expression next to the > in the R console (left window)

Name: Anshika Singh


Roll No.: 2000300130022
There are four basic data types in R: character, logical, numeric and factors (they are two types:
complex and raw).

1. Characters: used to represent words or letter in R. Anything with quotes will be interpreted as a
character.
2. Logicals: takes two values i.e. FALSE or TRUE. They are usually constructed with comparison
operators.
3. Numeric: are separated into two types: integer and double.
4. Factors: are sort of like characters, but not really. It is actually a numeric code with character-
valued levels.

R Data Structures:

Now, let’s go through how we can store data in R.

Vectors: are the most common and basic R data structure. It is simply a sequence of values which can
be of any data type but all of the same type. There are number of ways to create a vector depending on
the data type, but the most common is to insert the data you want to save in a vector into the command
c().

For example, to save the values 4, 16, 9 in a vector type in

Name: Anshika Singh


Roll No.: 2000300130022
You can also have a vector of character values.

The above code does not actually save the values 4, 16, 9- it just presents it on the screen in a vector. If
you want to use these values again, you can save it in a data object. You assign data to an object using
the arrow sign < -. This will create an object in R’s memory that can be called back into the command
window at any time.

Now, the “b becomes “hello world””.

You should see the object b pop up in the Environment tab on the top right window of your RStudio
Interface.

Every vector has two key properties: type and length.

The type property indicates the data type that the vector is holding. Use the command typeof() to
determine the type.

The command length() determines the number of data values that the vector is storing

Name: Anshika Singh


Roll No.: 2000300130022
You can also directly determine if a vector is of a specific data type by using the command is.X()
where you replace X with the data type.

You can also coerce a vector of one data type to another. For example, save the value “1” and “2” (both
in quotes) into a vector named x1.

To remove any object from R forever, use the command rm().

Data Frames: are even higher level data structures. They store vectors of the same length.

Create a vector called v2 storing the values 5, 12, 25. We can create a data frame using the command
data.frame() storing the vectors v1 and v2 as columns

df1 should pop up in your Environment window.

We can store different types of vectors in a data frame. For higher level data structures like a data
frame, use the function class() to figure out what kind of object you are working with. We can’t use
length() on a data frame as it has more than one vector. Instead, it has dimensions – the number of rows
and columns. We can find column names by using the command colnames(). Moreover, we can extract
columns from data frames by referring to their names using the $ sign. We can also extract data from
data frames using brackets [ , ].

Name: Anshika Singh


Roll No.: 2000300130022
Functions:

Functions are also known as commands. An R function is a packaged recipe that converts one or more
inputs (called arguments) into a single output. You execute most of your tasks in R using functions.
Every function in R will have the following basic format

functionName(arg1 = val1, arg2 = val2, ….)

Let’s use the function seq() which makes regular sequences of numbers. You can find out what a
function does and its options by calling up its help documentation by typing ? and the function name.
the help documentation should have popped up in the bottom right window of you RStudio interface.

Name: Anshika Singh


Roll No.: 2000300130022
R Scripting:

In running few lines of code above, you directly work in the R console and issue commands in an
interactive way. That is, you type a command after >, you hit enter/return, R responds, you type the
next command, hit enter, R responds and so on.

So instead of writing the command directly into the console, you should write it in a script. The process
is as follows: Type your command in the script. Run the code from the script. R responds. You get
results. You can write two commands in a script. Run both simultaneously. R responds. You get results.
This is the basic flow.

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 1
AIM:

To get the input from user and perform numerical operations (MAX, MIN, AVG, SUM, SQRT, ROUND)
using in R.

PROCEDURE:

1. max() function in R language is used to find the maximum element present in an object. This object can
be a vector, a list, a data frame etc.
Syntax: max(object,
na.rm) Parameters:
object: vector, matrix, list, data frame, etc.
na.rm: Boolean value to remove NA element.

2. min() function in R language is used to find the minimum element present in an object. This object can be
a vector, a list, a data frame etc.
Syntax: min(object,
na.rm) Parameters:
object: vector, matrix, list, data frame, etc.
na.rm: Boolean value to remove NA element.

Name: Anshika Singh


Roll No.: 2000300130022
3. mean() function in R language is used to compute the average of values. This function takes a numerical
vector as an argument and results in the average/mean of this vector.
Syntax: mean(x, na.rm)
Parameters:
x: numeric vector
na.rm: Boolean value to ignore NA element.

4. sum() function in R language is used to compute the sum of


values. Syntax: sum(x, na.rm)
Parameters:
x: it is the vector having the numeric values
na.rm: Boolean value to remove or returns NA element

5. sqrt() function in R language is used to find the square root for an individual number or an
expression. Syntax: sqrt(numeric_expression)
Parameters:
numeric_expression: it can be a numeric value or a valid numerical expression for which you want to find
the square root in R.

Name: Anshika Singh


Roll No.: 2000300130022
6. round() function in R language is used to round off values to a specific number of decimal
value. Syntax: round(x, digits)
Parameters:
x: value to be round off
digits: number of digits to which value has to be round off

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 2
AIM:

To perform data import/export (CSV, XLS, TXT) operations using data frames in R.

PROCEDURE:

1. Export Data-
There are numerous methods for exporting R objects into other formats. For SPSS, SAS and Stata, you will
need to load the foreign packages. For excel, you will need the xls ReadWrite package.
a. To a CSV file- Firstly, you’re required to create data frame. Then, will export our dataframe into
csv file.
Syntax: write.csv(df, path)
Parameters:
df: dataset to save.
path: a string. Set the destination path. Path + filename + extension

write.csv(Your Dataframe, “Path where you’d like to export the Dataframe\\FileName.csv”,


row.names = FALSE)

b. To Excel file- Java needs to installed before.


If you’re a Windows user, you can install the library directly with conda to export dataframe to excel:
conda install –c r r-xlsx
Once the library is installed you can use the function write.xlsx(). A new Excel workbook is created in
the working directory for R export to Excel data.

library(xlsx)
write.xlsx(df, “X.xlsx”)

If you’re a Mac OS user, you need to follow these steps:


 Step1: install the latest version of java
 Step2: install library rJava
 Step3: install library xlsx

c. To TXT file- the basic function write.table() can be used to export data frame to txt
file. Syntax: write.table(x, file)
Parameters:
x: a matrix or a data frame to be written
file: a character specifying the name of the result.

write.table(mydata, "c:/mydata.txt", sep="\t")

Name: Anshika Singh


Roll No.: 2000300130022
2. Import Data-
Importing data into R is fairly simple. For Stata and Systat, use the foreign package. For SPSS and SAS use
the Hmisc package for ease and functionality.
a. From a Comma Delimited Text file-

# first row contains variable names, comma is separator


#assign the variable is to row names
#note the / instead of \ on mswindows systems

mydata < - read.table(“c:/mydata.csv”, header=TRUE, sep=”,”, row.names=”id”)


b. From Excel-

#read in the first worksheet from the workbook myexcel.xlsx


#first row contains variable names
library(xlsx)
mydata < - read.xlsx(“c:/myexcel.xlsx”, 1)

#read in the worksheet named mysheet


mydata < - read.xlsx(“c:/myexcel.xlsx”, sheetName=”mysheet”)

c. From SPSS-
#save SPSS dataset in transport format
get file=’c:\mydata.sav’
export outfile=’c:\mydata.por’

#in R
library(Hmisc)
mydata < - spss.get(“c:/mydata.por”, use.value.labels=TRUE)
#last option converts value labels to R factors

d. From SAS-

#save SAS dataset in transport format


libname out xport ‘c:/mydata.xpt’;
data out.mydata;
set sasuser.mydata;
run;

#in R
library(Hmisc)
mydata < - sasxport.get(“c:/mydata.xpt”)
Name: Anshika Singh
Roll No.: 2000300130022
#character variables are converted to R factors
e. From Stata-

#input Stata file


library(foreign)
mydata < - read.dta(“c:/mydata.dta”)

f. From systat-

#input Systat file


library(foreign)
mydata < - read.systat(“c:/mydata.dta”)

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 3
AIM:

To get the input matrix from user and perform Matrix Addition, Subtraction, Multiplication, Inverse Transpose
and Division operations using vector concept in R.

PROCEDURE:

Matrices in R are a bunch of values, either real or complex numbers, arranged in a group of fixed number of
rows and columns. Matrices are used to depict the data in a structured and well-organized format.
It is necessary to enclose the elements of a matrix in parentheses or brackets.

A matrix with 9 elements is shown below

This matrix [M] has 3 rows and 3 columns. Each element of matrix [M] can be referred to by its row and
column number.

Syntax for creating a matrix:

matrix(data, nrow, ncol, byrow, dimnames)

Parameters:

data: input vector which becomes the data elements of the

matrix nrow: number of rows to be created

ncol: number of columns to be created

byrow: logical clue. If TRUE then the input vector elements are arranged by

row. dimname: names assigned to the rows and columns

Order of a matrix – is defined in terms of its number of rows and

columns. Order of a matrix = No. of rows * No. of columns

Name: Anshika Singh


Roll No.: 2000300130022
Operations on Matrices – there are four basic operations i.e. DMAS (Division, Multiplication, Addition,
Subtraction)

1. Matrix addition:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and columns
Step4: Creating matrix to store results
Step5: Printing original matrices

Name: Anshika Singh


Roll No.: 2000300130022
2. Matrix subtraction:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and columns
Step4: Creating matrix to store results
Step5: Printing original matrices
Step6: Calculating difference of matrices
Step7: Printing resultant matrix

3. Matrix multiplication:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and
columns Step4: Creating matrix to store
results Step5: Printing original matrices
Step6: Calculating product of
matrices Step7: Printing resultant
matrix
4. Matrix division:
Step1: Creating first matrix
Step2: Creating second matrix
Step3: Getting number of rows and
columns Step4: Creating matrix to store
results Step5: Printing original matrices
Name: Anshika Singh
Roll No.: 2000300130022
Step6: Calculating product of matrices

Name: Anshika Singh


Roll No.: 2000300130022
Step7: Printing resultant matrix

5. Inverse transpose:
Step1: Create 3 different vectors using combine method
Step2: Bind the three vectors into a matrix using rbind() which is basically row-wise binding
Step3: Print the original matrix
Step4: Use the solve() function to calculate the inverse
Step5: Print the inverse of the matrix

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 4
AIM:

To perform statistical operations (Mean, Median, Mode and Standard Deviation) using R.

PROCEDURE:

Statistical analysis in R is performed by using many in-built functions. Most of these functions are part of the
R base package. These functions take R vector as an input along with the arguments and give the result.

1. Mean- It is calculated by taking the sum of the values and dividing with the number of values in a data
series.
Syntax: mean(x, trim = 0, na.rm = FALSE,
…) Parameters:
x: input vector
trim: used to drop some observations from both end of the sorted
order na.rm: used to remove the missing values from the input vector

2. Median- The middle most value in a data series is called the median. The median() function is used in R
to calculate this value.
Syntax: median(x, na.rm =
FALSE) Parameters:
x: input vector
na.rm: used to remove the missing values from the input vector

Name: Anshika Singh


Roll No.: 2000300130022
3. Mode- The mode is the value that has highest number of occurrences in a set of data. Unlike mean and
median, mode can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So we create a user function to calculate
mode of a data set in R. This function takes the vector as input and gives the mode value as output.

4. Standard Deviation- A measure that is used to quantify the amount of variation or dispersion of a set of
data values.
Syntax: sd(x, na.rm = FALSE)
Parameters:
x: input vector
na.rm: used to remove the missing values from the input vector

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 5
AIM:

To perform data pre-processing operations i) Handling Missing data ii) Min-Max normalization

PROCEDURE:

i) In R missing values are represented by NA (not available). Impossible values are represented by the
symbol NaN (not a number). Unlike SAS, R uses the symbol for character and numeric data.
 Testing for Missing Values:
is.na(x) #returns TRUE of x is
missing y < - c(1, 2, 3, NA)
is.na(y) #returns a vector (F F F T)

 Recoding Values to Missing:


#recode 99 to missing for variable v1
#select rows where v1 is 99 and recode column v1
mydata$v1[mydata$v1==99] < - NA

 Excluding Missing Values from Analyses:


x < - c(1, 2, NA, 3)
mean(x) #returns NA
mean(x, na.rm=TRUE) #returns 2

Name: Anshika Singh


Roll No.: 2000300130022
The function complete.cases() returns a logical vector indicating which cases are complete.
#list rows of data that have missing value
mydata[!complete.cases(mydata), ]

The function na.omit() returns the object with listwise deletion of missing
values. #create new dataset without missing data
newdata < - na.omit(mydata)

ii) Min-max normalization subtracts the minimum value of an attribute from each value of the attribute
and then divides the difference by the range of the attribute. These new values are multiplied by the
new range of the attribute and finally added to the new maximum value of the attribute. These
operations transform the data into a new range, generally [0,1].
Syntax: mmnorm(data, minval = 0, maxval = 1)
Parameters:
data: the dataset to be normalized, including classes
minval: the minimum value of the transformed range
maxval: the maximum value of the transformed range

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 6
AIM:

To perform dimensionality reduction operation using PCA for Houses Data Set

PROCEDURE:

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing
the number of random variables under consideration, via obtaining a set of principal variables. It can be
divided into feature selection and feature extraction.
Principal component analysis (PCA) is routinely employed on a wide range of problems. From the detection of
outliers to predictive modeling, PCA has the ability of projecting the observations described by p variables into
few orthogonal components defined. It is an unsupervised method, meaning it will always look into the
greatest sources of variation regardless of the data structure.

Step 0: Built pcaChart function for exploratory data analysis on Variance

Step 1: Load Data for analysis - Crime Data

Name: Anshika Singh


Roll No.: 2000300130022
Principal Component Analysis (PCA) is a method used to reduce the number of variables in a dataset. Now, we
will simplify the data into two-variables data. This does not mean that we are eliminating two variables and
keeping two; it means that we are replacing the four variables with two brand new ones called principal
components.€ • .
prcomp: Performs a principal components analysis on the given data matrix and returns the results as an object
of class.

Step 2: Standardize the data by using scale and apply “prcomp” function

Name: Anshika Singh


Roll No.: 2000300130022
Step 3: Choose the principal components with highest variances
Now that R has computed 4 new variables (principal components), you can choose the two (or one, or three)
principal components with the highest variances.
This step is to identify coverage of variance in dataset by individual principal components. summary() function
can be used or screen plot can be used to explain the variance.

From the the summary, we can undersand PC1 explains 62% of variance and PC2 explains 24% so on. Usually
Principal components which explains about 95% variance can be considered for models. Summary also yields
cumulative proportion of the principal components.
Best thing is, plot PCA using various types of scree plot. Above declared pcaCharts function invokes various
forms of scree plot.

Step 4: Visualization of Data in the new reduced dimension

Name: Anshika Singh


Roll No.: 2000300130022
Name: Anshika Singh
Roll No.: 2000300130022
Name: Anshika Singh
Roll No.: 2000300130022
Experiment No. 7
AIM:

To perform Simple Linear Regression with R.

PROCEDURE:

Linear Regression:
It is commonly used type of predictive analysis. It is a statistical approach for modeling relationship
between a dependent variable and a given set of independent variables.
There are two types of linear regression:
1. Simple Linear Regression- uses only one independent variable.
2. Multiple Linear Regression- uses two or more independent variables.
Simple Linear Regression:
The dataset contains observations about income (in a range of $15k to $75k) and happiness (rated on a scale
of 1 to 10) in an imaginary sample of 500 people. The income values are divided by 10,000 to make the
income data match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is #30,000,
etc.)
First of all, you are required to install the packages you need for the analysis,
install.packages(“ggplot2”)
install.packages(“dplyr”)
install.packages(“broom”)
install.packages(“ggpubr”)

Next, load the packages into your R environment:


library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)

Name: Anshika Singh


Roll No.: 2000300130022
Step1: Load the data into R
Follow these 4 steps:
1. In RStudio, go to File > Import Dataset > From Text(base).
2. Choose the data file and an Import Dataset window pops up.
3. In the Data Frame window, you should see an X (index) column and columns listing the data for each of
the variables (income and happiness).
4. Click on the Import button and the file should appear in your Environment tab on the upper right side of
the RStudio screen.
After you’ve loaded the data, check that it has been read in correctly using summary().
summary(income.data)

As both our variables are quantitative, when we run this function we see a table in our console with a
numeric summary of the data. This tells us the minimum, median, mean, and maximum values of the
independent variable(income) and dependent variable(happiness):

Name: Anshika Singh


Roll No.: 2000300130022
Step2: Make sure your data meet the assumptions
We can use R to check that our data meet the four main assumptions for linear regression.
1. Independence of observations- As we have only one independent variable and one dependent variable,
we don’t need to test for any hidden relationships among variables.
2. Normality- To check whether the dependent variable follows a normal distribution, use the hist()
function.
hist(income.data$happiness)

3. Linearity- The relationship between the independent and dependent variable must be linear. We can test
this visually with a scatter plot to see if the distribution of data points could be described with a straight
line.
plot(happiness ~ income, data = income.data)

Name: Anshika Singh


Roll No.: 2000300130022
The relationship looks roughly linear.
4. Homoscedasticity- This means that the prediction error doesn’t change significantly over the range of
prediction of the model. We can test this assumption later.
Step3: Perform the linear regression analysis
Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis
to evaluate the relationship between the independent and dependent variables.
Let’s see if there’s a linear relationship between income and happiness in our survey of 500 people with
incomes ranging from $15k to $75k, where happiness is measured on a scale of 1 to 10.
To perform a simple linear regression analysis and check the results, you need to run two lines of code. The
first line of code makes the linear model and the second line prints out the summary of the model:
income.happiness.lm <- lm(happiness ~income, data = income.data)
summary(income.happiness.lm)

Name: Anshika Singh


Roll No.: 2000300130022
This output table first presents the model equation, then summarizes the model residuals.

The Coefficients section shows:


1. The estimates (Estimate) for the model parameters- the value of the y- intercept(0.204) and the
estimated effect of income on happiness (0.713).
2. The standard error of the estimated values (Std. Error).
3. The test statistic (t value)
4. The p-value (Pr ( > | t | ) ), aka the probability of finding the given t-statistic of the null hypothesis of no
relationship were true.
The final three lines are model diagnostics- the most important thing to note is the p-value, which will
indicate whether the model fits the data well.
From these results, we can say that there is a significant positive relationship between income and
happiness (p-value < 0.001), with a 0.713 unit (+/- 0.01) increase in happiness for every unit increase in
income.
Step4: Check for homoscedasticity

Name: Anshika Singh


Roll No.: 2000300130022
Before proceeding with data visualization, we should make sure that our models fit the homoscedasticity
assumption of the linear model.
We can run plot(income.happiness.lm) to check whether the observed data meets our model assumptions:
par(mfrow=c(2,2))
plot(income.happiness.lm)
par(mfrow=c(1,1))

Note that the par(mfrow()) command will divide the Plots window into the number of rows and columns
specified in the brackets. So par(mfrow=c(2,2)) divides it up into two rows and two columns. To go back to
plotting one graph in the entire window, set the parameters again and replace the (2,2) with (1,1).

Residuals are the unexplained variance. They are not exactly the same as model error, but they are
calculated from it, so seeing a bias in the residuals would also indicate a bias in the error. The most
important thing to look for is that the red lines representing the mean of the residuals are all basically
horizontal and centered on zero. This means there are no outliers or biases in the data that would make a

Name: Anshika Singh


Roll No.: 2000300130022
linear regression invalid. Based on these residuals, we can say that our model meets the assumption of
homoscedasticity.

Step5: Visualize the results with a graph


Next, we can plot the data and the regression line from our linear regression model so that the results can be
shared.
Follow 4 steps to visualize the results of your simple linear regression:
1. Plot the data points on a graph
income.graph <- ggplot(income.data, aes(x=income, y=happiness)) + geom_point()
income.graph

2. Add the linear regression line to the plotted data


income.graph <- income.graph + geom_smooth(method=”lm”, col=”black”)
income.graph

3. Add the equation for the regression line


income.graph <- income.graph + stat_regline_equation(label.x=3, label.y=7)
income.graph

Name: Anshika Singh


Roll No.: 2000300130022
4. Make the graph ready for publication
income.graph + theme_bow() + labs(title = “Reported happiness as a function of income”, x= “Income
(x$10,000)”, y= “Happiness score (0 to 10)”)

Step6: Report your results


We found a significant relationship between income and happiness (p<0.001, R2=0.73 _+ 0.0193), with a
0.73-unit increase in reported happiness for every $10,000 increase in income.

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 8
AIM:

To perform K-Means clustering operation and visualize for iris data set

PROCEDURE:

The Iris dataset contains the data for 50 flowers from each of the 3 species - Setosa, Versicolor and Virginica.
The data gives the measurements in centimeters of the variables sepal length and width and petal length and
width for each of the flowers.
Goal of the study is to perform exploratory analysis on the data and build a K-means clustering model to
cluster them into groups. Here we have assumed we do not have the species column to form clusters and then
used it to check our model performance.
First of all, install package ggplot2,
library(ggplot2)

Exploratory Data Analysis:

The dataset has 150 observations equally distributed observations among the three species - Setosa, Versicolor
and Verginica. The below table shows the summary statistics of all the 4 variables.

summary(iris)

Name: Anshika Singh


Roll No.: 2000300130022
sapply(iris[,-5], var)

The petal length and petal width show 3 clusters.

summary(iris)

Name: Anshika Singh


Roll No.: 2000300130022
library(ggplot2)
ggplot(iris, aes(x = Sepal.length, y = Sepal.width, col = Species)) + geom_point()

ggplot(iris,aes(x = Petal.Length, y = Petal.Width, col= Species)) + geom_point()

Name: Anshika Singh


Roll No.: 2000300130022
Finding the optimum number of clusters:
The plot of Within cluster sum of squares vs the number of clusters show us an elbow point at 3. So, we can
conclude that 3 is the best value for k to be used to create the final model.

plot(1:k.max,wss, type= "b", xlab = "Number of clusters(k)", ylab = "Within cluster sum of squares")

Name: Anshika Singh


Roll No.: 2000300130022
The final cluster model:
The final model is built using kmeans and k = 3. The nstart value has also been defined as 20 which means
that R will try 20 different random starting assignments and then select the one with the lowest within cluster
variation.

Name: Anshika Singh


Roll No.: 2000300130022
From the table we can see most of the observations have been clustered correctly however, 2 of the versicolor
have been put in the cluster with all the virginica and 4 of the verginica have been put in cluster 3 which
mostly has versicolor.

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 9
AIM:

Write R script to diagnose any disease using KNN classification and plot the results.

PROCEDURE:

Machine learning finds extensive usage in pharmaceutical industry especially in detection of oncogenic (cancer
cells) growth. R finds application in machine learning to build models to predict the abnormal growth of cells
thereby helping in detection of cancer and benefiting the health system.

Let’s see the process of building this model using KNN algorithm in R Programming.

Step1: Data Collection

We will use a data set of 100 patients (created solely for the purpose of practice) to implement the KNN
algorithm and thereby interpreting results.

The data set consists of 100 observations and 10 variables (out of which 8 numeric variables and one
categorical variable and is ID) which are as follows:

1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Symmetry
8. Fractal dimension

In real life, there are dozens of important parameters needed to measure the probability of cancerous growth
but for simplicity purposes let’s deal with 8 of them.

Here’s how the data set looks like:

Name: Anshika Singh


Roll No.: 2000300130022
Step2: Preparing and exploring data:

Let’s make sure that we understand every line of code before proceeding to the next stage:
setwd(“C:/Users/Payal/Desktop/KNN”) #using this command we’ve imported the Prostate_Cancer.csv data
file. This command is used to point to the folder containing the required file.

prc <- read.csv(“Prostate_Cancer.csv”, stringsAsFactors=FALSE) #this command imports the


required data set and saves it to the prc data frame.

stringsAsFactors=FALSE #this command helps to convert every character vector to a factor


wherever it makes sense.

str(prc) #we use this command to see whether the data is structured or not.

We find that the data is structured with 10 variables and 100 observations. If we observe the data set, the first
variable ‘id’ is unique in nature and can be removed as it does not provide useful information.

Name: Anshika Singh


Roll No.: 2000300130022
prc <- prc[-1] #removes the first variable(id) from the data set.

The data set contains patients who have been diagnosed with either Malignant (M) or Benign (B) cancer.

table(prc$diagnosis_result) #it helps us to get the numbers of patients

The variable diagnosis_result is our target variable i.e. this variable will determine the results of the diagnosis
based on the 8 numeric variables)

In case we wish to rename B as “Benign” and M as “Malignant” and see the results in the percentage form, we
may write as:

prc$diagnosis <- factor(prc$diagnosis_result, levels=c(“B”, “M”), labels=c(“Benign”,”Malignant”))

round(prop.table(table(prc$diagnosis)) * 100, digits=1) #it gives the result in the percentage form rounded of
to 1 decimal place (and so it’s digits=1)

Normalizing numeric data


This feature is of paramount importance since the scale used for the values for each variable might be different.
The best practice is to normalize the data and transform all the values to a common scale.

normalize <- function(x) {


Name: Anshika Singh
Roll No.: 2000300130022
return ((x – min(x) / (max(x) – min(x)))) }

Once we run this code, we are required to normalize the numeric features in the data set. Instead of
normalizing each of the 8 individual variables we use:

prc <- as.data.frame(lapply(prc[2:9], normalize))

The first variable in our data set (after removal of id) is ‘diagnosis_result’ which is not numeric in nature. So,
we start from 2nd variable. The function lapply() applies normalize() to each feature in the data frame. The final
result is stored to prc_n data frame using as.data.frame() function

Let’s check using the variable ‘radius’ whether the data has been normalized.

summary(prc_n$radius)

The results show that the data has been normalized. Do try with the other variables such as perimeter, area etc.

Creating training and test data set

The KNN algorithm is applied to the training data set and the results are verified on the test data set.

For this, we would divide the data set into 2 portions in the ratio of 65: 35 (assumed) for the training and test
data set respectively. You may use a different ratio altogether depending on the business requirement!

We shall divide the prc_n data frame into prc_train and prc_test data frames

prc_train <- prc_n[1:65,]

prc_test <-

prc_n[66:100,]

A blank value in each of the above statements indicate that all rows and columns should be included.

Our target variable is ‘diagnosis_result’ which we have not included in our training and test data

sets. prc_train_labels <- prc_n[1:65, 1]


Name: Anshika Singh
Roll No.: 2000300130022
prc_test_labels <- prc_n[66:100, 1] #this code takes the diagnosis factor in column1 of the prc data frame and
on turn creates prc_train_labels and prc_test_labels data frame.

Step 3 – Training a model on data

The knn() function needs to be used to train a model for which we need to install a package ‘class’. The knn() function
identifies the k-nearest neighbors using Euclidean distance where k is a user-specified number.

You need to type in the following commands to use knn()

install.packages(“class”)

library(class)

Now we are ready to use the knn() function to classify test data

prc_test_pred <- knn(train = prc_train, test=prc_test, c1=prc_train_labels, k=10)

The value for k is generally chosen as the square root of the number of observations.

knn() returns a factor value of predicted labels for each of the examples in the test data set which is then
assigned to the data frame prc_test_pred.

Name: Anshika Singh


Roll No.: 2000300130022
Step 4 – Evaluate the model performance
We have built the model but we also need to check the accuracy of the predicted values in prc_test_pred as to
whether they match up with the known values in prc_test_labels. To ensure this, we need to use the
CrossTable() function available in the package ‘gmodels’.
We can install it using:

install.packages(“gmodels”)

The test data consisted of 35 observations. Out of which 5 cases have been accurately predicted (TN->True
Negatives) as Benign (B) in nature which constitutes 14.3%. Also, 16 out of 35 observations were accurately
predicted (TP-> True Positives) as Malignant (M) in nature which constitutes 45.7%. Thus, a total of 16 out of
35 predictions where TP i.e, True Positive in nature.
There were no cases of False Negatives (FN) meaning no cases were recorded which actually are malignant in
nature but got predicted as benign. The FN’s if any poses a potential threat for the same reason and the main
focus to increase the accuracy of the model is to reduce FN’s.
There were 14 cases of False Positives (FP) meaning 14 cases were actually benign in nature but got predicted
as malignant.
Name: Anshika Singh
Roll No.: 2000300130022
The total accuracy of the model is 60 %( (TN+TP)/35) which shows that there may be chances to improve the
model performance

Step 5 – Improve the performance of the model


This can be taken into account by repeating the steps 3 and 4 and by changing the k-value. Generally, it is the
square root of the observations and in this case we took k=10 which is a perfect square root of 100.The k-value
may be fluctuated in and around the value of 10 to check the increased accuracy of the model. Do try it out
with values of your choice to increase the accuracy! Also remember, to keep the value of FN’s as low as
possible.

Name: Anshika Singh


Roll No.: 2000300130022
Experiment No. 10
AIM:

To perform market basket analysis using Association Rules (Apriori).

PROCEDURE:

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between
items. It works by looking for combinations of items that occur together frequently in transactions. To put it
another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify
strong rules discovered in transaction data using measures of interestingness, based on the concept of strong
rules.
In retailing, most purchases are bought on impulse. Market basket analysis gives clues as to what a customer
might have bought if the idea had occurred to them. As a first step, therefore, market basket analysis can be
used in deciding the location and promotion of goods inside a store. If, as has been observed, purchasers of
Barbie dolls have are more likely to buy candy, then high-margin candy can be placed near to the Barbie doll
display. Customers who would have bought candy with their Barbie dolls had they thought of it will now be
suitably tempted.

Association Rules:

There are many ways to see the similarities between items. These are techniques that fall under the general
umbrella of association. The outcome of this type of technique, in simple terms, is a set of rules that can be
understood as “if this, then that”.
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were
purchased. The receipt is a representation of stuff that went into a customer’s basket — and therefore ‘Market
Basket Analysis’.
That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1
receipt and the items purchased. Each line is called a transaction and each column in a row represents an item.

Apriori Recommendation with R


#load the libraries
library(arules)
library(arulesViz)
library(datasets)

#load the dataset


data(Groceries)

Name: Anshika Singh


Roll No.: 2000300130022
Lets explore the data before we make any rules:

#create an item frequency plot for the top 20 items

itemFrequencyPlot(Groceries, topN=20, type=

“absolute”)

We are now ready to mine some rules! You will always have to pass the minimum
required support and confidence.
 We set the minimum support to 0.001

 We set the minimum confidence of 0.8

 We then show the top 5 rules

#get the rules


rules <- apriori(Groceries, parameter = list(sup=0.001,
conf=0.8)) #show the top 5 rules, but only 2 digits
options(digits=2)
inspect(rules[1:5])

The output we see should look something like this

Name: Anshika Singh


Roll No.: 2000300130022
This reads easily, for example: if someone buys yogurt and cereals, they are 81% likely to buy whole milk too.
We can get summary info. about the rules that give us some interesting information such as:
 The number of rules generated: 410

 The distribution of rules by length: Most rules are 4 items long


 The summary of quality measures: interesting to see ranges of support, lift, and confidence.

 The information on the data mined: total data mined, and minimum parameters.

Sorting stuff out

The first issue we see here is that the rules are not sorted. Often we will want the most relevant rules first. Lets
say we wanted to have the most likely rules. We can easily sort by confidence by executing the following code.

rules <- sort(rules, by=”confidence”, decreasing=TRUE)

Now our top 5 output will be sorted by confidence and therefore the most relevant rules appear.

Name: Anshika Singh


Roll No.: 2000300130022
Rule 4 is perhaps excessively long. Lets say you wanted more concise rules. That is also easy to do by adding a
“maxlen” parameter to your apriori function:

rules <- apriori(Groceries, parameter = list(sup =0.001, conf = 0.8, maxlen = 3))

Redundancies

Sometimes, rules will repeat. Redundancy indicates that one item might be a given. As an analyst you can elect
to drop the item from the dataset. Alternatively, you can remove redundant rules generated. We can eliminate
these repeated rules using the follow snippet of code:

subset.matrix <- is.subset(rules, rules)


subset.matrix[lower.tri(subset.matrix, diag=T) ] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
rules.pruned <- rules[!redundant]
rules <- rules.pruned

Targeting items:

Now that we know how to generate rules, limit the output, lets say we wanted to target items to generate rules.
There are two types of targets we might be interested in that are illustrated with an example of “whole milk”:
1. What are customers likely to buy before buying whole milk

2. What are customers likely to buy if they purchase whole milk?

Answering the first question we adjust our apriori() function as follows:

rules<-apriori(data=Groceries, parameters=list (supp=0.001, conf=0.08),


appearance=list(default=”lhs”,rhs=”whole milk”), control= list(verbose=F))
rules <- sort(rules, decreasing=TRUE,
by=”confidence”) inspect(rules[1:5])

Name: Anshika Singh


Roll No.: 2000300130022
Likewise, we can set the left hand side to be “whole milk” and find its antecedents.

rules<-apriori(data=Groceries, parameters=list (supp=0.001, conf=0.15, minlen=2),


appearance=list(default=”rhs”,lhs=”whole milk”), control= list(verbose=F))
rules <- sort(rules, decreasing=TRUE,
by=”confidence”) inspect(rules[1:5])

Visualization
The last step is visualization. Lets say you wanted to map out the rules in a graph. We can do that with another
library called “arulesViz”.

library(arulesViz)
plot(rules, method=”graph”, interactive= TRUE, shading=NA)

You will get a nice graph that you can move around to look like this:

Name: Anshika Singh


Roll No.: 2000300130022
Name: Anshika Singh
Roll No.: 2000300130022

You might also like